Re: rtl8821ae keep alive not set, connection lost

2018-02-04 Thread Larry Finger

On 02/02/2018 10:45 PM, Pkshih wrote:

On Fri, 2018-02-02 at 14:13 -0600, Larry Finger wrote:

On 02/02/2018 01:50 AM, Pkshih wrote:

Hi James,
  
In my experiment, unaligned-word-access may get wrong values that

are different from the value by byte-access. Actually, it can simply
verified by using 'lspci' to check PCI configuration space.
  
DBI read 0x70f:

_rtl8821ae_dbi_read:1127 r8 0x34f = 0x0017
_rtl8821ae_dbi_read:1131 r8 0x350 = 0x000c
_rtl8821ae_dbi_read:1136 r16 0x350 = 0x
  
DBI read 0x719:

_rtl8821ae_dbi_read:1127 r8 0x34d = 0x
_rtl8821ae_dbi_read:1131 r8 0x34e = 0x0002
_rtl8821ae_dbi_read:1136 r16 0x34e = 0x0200
  
  
According to the wrong and original value of 0x70f is 0xff, I think

larger L1 latency 0x70f[5:3] may be helpful. Please help to try
below patch. If it works, quirk table won't be necessary.
  
PK
  
  
diff --git a/rtl8821ae/hw.c b/rtl8821ae/hw.c

index 7d43ba002..e53af06ed 100644
--- a/rtl8821ae/hw.c
+++ b/rtl8821ae/hw.c
@@ -1123,7 +1123,8 @@ static u8 _rtl8821ae_dbi_read(struct rtl_priv *rtlpriv, 
u16 addr)
    }
    if (0 == tmp) {
    read_addr = REG_DBI_RDATA + addr % 4;
-   ret = rtl_read_word(rtlpriv, read_addr);
+
+   ret = rtl_read_byte(rtlpriv, read_addr);
    }
    return ret;
    }
@@ -1165,7 +1166,7 @@ static void _rtl8821ae_enable_aspm_back_door(struct 
ieee80211_hw *hw)
    }

    	tmp = _rtl8821ae_dbi_read(rtlpriv, 0x70f);

-   _rtl8821ae_dbi_write(rtlpriv, 0x70f, tmp | BIT(7));
+   _rtl8821ae_dbi_write(rtlpriv, 0x70f, tmp | BIT(7) | 0x38);

    	tmp = _rtl8821ae_dbi_read(rtlpriv, 0x719);

    _rtl8821ae_dbi_write(rtlpriv, 0x719, tmp | BIT(3) | BIT(4));
  



PK,

This patch works perfectly on my x86_64 system. With it, the interface handled a
10 million count flood ping with <1000 packets dropped. It has now been running
for at least 11 hours. Good work.

Can you explain that magic 0x38? I'm quite sure that Kalle will not accept the
patch as is. He already thinks that rtlwifi and friends already have too many
magic numbers.


The magic 0x38 represents L1 latency 0x70f[5:3]=b'111, and I will add definition
or macro to describe
0x70f and move all ASPM code into pci.c, if the problem was
solved. Hence, rtl8723de and rtl8723be
will use the same values, too.

Could you also help to test L1 latency as recommended value b'100?
(i.e. 0x20 instead)


I have run several tests with the following results:

b'010   Failed very quickly
b'100   Did not fail, but the ping test had errors
b'110   No failure and no errors

My tests seem to indicate that we should use at least 0x30 in the field, and 
perhaps we should stay at b'111 or 0x38.


To fix the current alignment errors for ARM architecture, I will push a patch 
for mainline and stable that fixes rtl8821ae. Then you can do the patch that 
moves all ASPM code into pci.c.


Larry


Re: rtl8821ae keep alive not set, connection lost

2018-02-02 Thread Pkshih
On Fri, 2018-02-02 at 14:13 -0600, Larry Finger wrote:
> On 02/02/2018 01:50 AM, Pkshih wrote:
> > Hi James,
> > 
> > In my experiment, unaligned-word-access may get wrong values that
> > are different from the value by byte-access. Actually, it can simply
> > verified by using 'lspci' to check PCI configuration space.
> > 
> > DBI read 0x70f:
> > _rtl8821ae_dbi_read:1127 r8 0x34f = 0x0017
> > _rtl8821ae_dbi_read:1131 r8 0x350 = 0x000c
> > _rtl8821ae_dbi_read:1136 r16 0x350 = 0x
> > 
> > DBI read 0x719:
> > _rtl8821ae_dbi_read:1127 r8 0x34d = 0x
> > _rtl8821ae_dbi_read:1131 r8 0x34e = 0x0002
> > _rtl8821ae_dbi_read:1136 r16 0x34e = 0x0200
> > 
> > 
> > According to the wrong and original value of 0x70f is 0xff, I think
> > larger L1 latency 0x70f[5:3] may be helpful. Please help to try
> > below patch. If it works, quirk table won't be necessary.
> > 
> > PK
> > 
> > 
> > diff --git a/rtl8821ae/hw.c b/rtl8821ae/hw.c
> > index 7d43ba002..e53af06ed 100644
> > --- a/rtl8821ae/hw.c
> > +++ b/rtl8821ae/hw.c
> > @@ -1123,7 +1123,8 @@ static u8 _rtl8821ae_dbi_read(struct rtl_priv 
> > *rtlpriv, u16 addr)
> >     }
> >     if (0 == tmp) {
> >     read_addr = REG_DBI_RDATA + addr % 4;
> > -   ret = rtl_read_word(rtlpriv, read_addr);
> > +
> > +   ret = rtl_read_byte(rtlpriv, read_addr);
> >     }
> >     return ret;
> >   }
> > @@ -1165,7 +1166,7 @@ static void _rtl8821ae_enable_aspm_back_door(struct 
> > ieee80211_hw *hw)
> >     }
> >   
> >     tmp = _rtl8821ae_dbi_read(rtlpriv, 0x70f);
> > -   _rtl8821ae_dbi_write(rtlpriv, 0x70f, tmp | BIT(7));
> > +   _rtl8821ae_dbi_write(rtlpriv, 0x70f, tmp | BIT(7) | 0x38);
> >   
> >     tmp = _rtl8821ae_dbi_read(rtlpriv, 0x719);
> >     _rtl8821ae_dbi_write(rtlpriv, 0x719, tmp | BIT(3) | BIT(4));
> > 
> 
> 
> PK,
> 
> This patch works perfectly on my x86_64 system. With it, the interface 
> handled a 
> 10 million count flood ping with <1000 packets dropped. It has now been 
> running 
> for at least 11 hours. Good work.
> 
> Can you explain that magic 0x38? I'm quite sure that Kalle will not accept 
> the 
> patch as is. He already thinks that rtlwifi and friends already have too many 
> magic numbers.
> 
The magic 0x38 represents L1 latency 0x70f[5:3]=b'111, and I will add definition
or macro to describe
0x70f and move all ASPM code into pci.c, if the problem was
solved. Hence, rtl8723de and rtl8723be
will use the same values, too.

Could you also help to test L1 latency as recommended value b'100? 
(i.e. 0x20 instead)

Thanks
PK



Re: rtl8821ae keep alive not set, connection lost

2018-02-02 Thread Larry Finger

On 02/02/2018 01:50 AM, Pkshih wrote:


Hi James,

In my experiment, unaligned-word-access may get wrong values that
are different from the value by byte-access. Actually, it can simply
verified by using 'lspci' to check PCI configuration space.

DBI read 0x70f:
_rtl8821ae_dbi_read:1127 r8 0x34f = 0x0017
_rtl8821ae_dbi_read:1131 r8 0x350 = 0x000c
_rtl8821ae_dbi_read:1136 r16 0x350 = 0x

DBI read 0x719:
_rtl8821ae_dbi_read:1127 r8 0x34d = 0x
_rtl8821ae_dbi_read:1131 r8 0x34e = 0x0002
_rtl8821ae_dbi_read:1136 r16 0x34e = 0x0200


According to the wrong and original value of 0x70f is 0xff, I think
larger L1 latency 0x70f[5:3] may be helpful. Please help to try
below patch. If it works, quirk table won't be necessary.

PK


diff --git a/rtl8821ae/hw.c b/rtl8821ae/hw.c
index 7d43ba002..e53af06ed 100644
--- a/rtl8821ae/hw.c
+++ b/rtl8821ae/hw.c
@@ -1123,7 +1123,8 @@ static u8 _rtl8821ae_dbi_read(struct rtl_priv *rtlpriv, 
u16 addr)
}
if (0 == tmp) {
read_addr = REG_DBI_RDATA + addr % 4;
-   ret = rtl_read_word(rtlpriv, read_addr);
+
+   ret = rtl_read_byte(rtlpriv, read_addr);
}
return ret;
  }
@@ -1165,7 +1166,7 @@ static void _rtl8821ae_enable_aspm_back_door(struct 
ieee80211_hw *hw)
}
  
  	tmp = _rtl8821ae_dbi_read(rtlpriv, 0x70f);

-   _rtl8821ae_dbi_write(rtlpriv, 0x70f, tmp | BIT(7));
+   _rtl8821ae_dbi_write(rtlpriv, 0x70f, tmp | BIT(7) | 0x38);
  
  	tmp = _rtl8821ae_dbi_read(rtlpriv, 0x719);

_rtl8821ae_dbi_write(rtlpriv, 0x719, tmp | BIT(3) | BIT(4));

PK,

One more question: Will we need to make the same change to rtl8723be and 
rtl8723de?

Larry



Re: rtl8821ae keep alive not set, connection lost

2018-02-02 Thread Larry Finger

On 02/02/2018 01:50 AM, Pkshih wrote:

Hi James,

In my experiment, unaligned-word-access may get wrong values that
are different from the value by byte-access. Actually, it can simply
verified by using 'lspci' to check PCI configuration space.

DBI read 0x70f:
_rtl8821ae_dbi_read:1127 r8 0x34f = 0x0017
_rtl8821ae_dbi_read:1131 r8 0x350 = 0x000c
_rtl8821ae_dbi_read:1136 r16 0x350 = 0x

DBI read 0x719:
_rtl8821ae_dbi_read:1127 r8 0x34d = 0x
_rtl8821ae_dbi_read:1131 r8 0x34e = 0x0002
_rtl8821ae_dbi_read:1136 r16 0x34e = 0x0200


According to the wrong and original value of 0x70f is 0xff, I think
larger L1 latency 0x70f[5:3] may be helpful. Please help to try
below patch. If it works, quirk table won't be necessary.

PK


diff --git a/rtl8821ae/hw.c b/rtl8821ae/hw.c
index 7d43ba002..e53af06ed 100644
--- a/rtl8821ae/hw.c
+++ b/rtl8821ae/hw.c
@@ -1123,7 +1123,8 @@ static u8 _rtl8821ae_dbi_read(struct rtl_priv *rtlpriv, 
u16 addr)
}
if (0 == tmp) {
read_addr = REG_DBI_RDATA + addr % 4;
-   ret = rtl_read_word(rtlpriv, read_addr);
+
+   ret = rtl_read_byte(rtlpriv, read_addr);
}
return ret;
  }
@@ -1165,7 +1166,7 @@ static void _rtl8821ae_enable_aspm_back_door(struct 
ieee80211_hw *hw)
}
  
  	tmp = _rtl8821ae_dbi_read(rtlpriv, 0x70f);

-   _rtl8821ae_dbi_write(rtlpriv, 0x70f, tmp | BIT(7));
+   _rtl8821ae_dbi_write(rtlpriv, 0x70f, tmp | BIT(7) | 0x38);
  
  	tmp = _rtl8821ae_dbi_read(rtlpriv, 0x719);

_rtl8821ae_dbi_write(rtlpriv, 0x719, tmp | BIT(3) | BIT(4));




PK,

This patch works perfectly on my x86_64 system. With it, the interface handled a 
10 million count flood ping with <1000 packets dropped. It has now been running 
for at least 11 hours. Good work.


Can you explain that magic 0x38? I'm quite sure that Kalle will not accept the 
patch as is. He already thinks that rtlwifi and friends already have too many 
magic numbers.


Larry



RE: rtl8821ae keep alive not set, connection lost

2018-02-01 Thread Pkshih

> -Original Message-
> From: linux-wireless-ow...@vger.kernel.org 
> [mailto:linux-wireless-ow...@vger.kernel.org] On Behalf
> Of James Cameron
> Sent: Thursday, February 01, 2018 2:22 PM
> To: Larry Finger
> Cc: linux-wireless@vger.kernel.org; Pkshih
> Subject: Re: rtl8821ae keep alive not set, connection lost
> 
> On Wed, Jan 31, 2018 at 11:06:12AM -0600, Larry Finger wrote:
> > On 09/12/2017 05:09 PM, James Cameron wrote:
> > >Summary: 40b368af4b75 ("rtlwifi: Fix alignment issues") breaks
> > >rtl8821ae keep alive, causing "Connection to AP lost" and deauth,
> > >but why?
> > >
> > >Wireless connection is lost after a few seconds or minutes, on
> > >every OLPC NL3 laptop with rtl8821ae, with any stable kernel after
> > >4.10.1, and any kernel with 40b368af4b75.
> > >
> > >dmesg contains
> > >
> > >   wlp2s0: Connection to AP 2c:b0:5d:a6:86:eb lost
> > >
> > >iw event shows
> > >
> > >   wlp2s0: del station 2c:b0:5d:a6:86:eb
> > >   wlp2s0 (phy #0): deauth 74:c6:3b:09:b5:0d -> 2c:b0:5d:a6:86:eb reason 
> > > 4: Disassociated due to
> inactivity
> > >   wlp2s0 (phy #0): disconnected (local request)
> > >
> > >Workaround is to bounce the link, then reconnect;
> > >
> > >   ip link set wlp2s0 down
> > >   ip link set wlp2s0 up
> > >   iw dev wlp2s0 connect qz
> > >
> > >A nearby monitor host captures a deauthentication packet sent by
> > >the device.
> > >
> > >Bisection showed cause is 40b368af4b75 ("rtlwifi: Fix alignment
> > >issues") which changes the width of DBI register read.
> > >
> > >On the face of it, 40b368af4b75 looks correct, especially compared
> > >against same function in rtl8723be.
> > >
> > >I've no idea why reverting fixes the problem.  I'm hoping someone
> > >here might speculate and suggest ways to test.
> > >
> > >As keep alive is set through this path, my guess is that keep alive
> > >is not being set in the device.  Or perhaps reading 16-bits
> > >perturbs another register.  Is there a way to test?
> > >
> > >http://dev.laptop.org/~quozl/z/1drtGD.txt dmesg of 4.13
> > >
> > >http://dev.laptop.org/~quozl/z/1drt7c.txt dmesg with 4.13 and
> > >revert of 40b368af4b75
> >
> > James,
> >
> > I'm afraid we are needing to revisit this problem again. Changing
> > that 8-bit read to a 16-bit version causes an unaligned memory
> > reference in AARCH64, thus we will need to re-revert. To prevent
> > problems on systems such as yours, PK plans to turn off ASPM
> > capability and backdoor in certain platforms that will be listed in
> > a quirks table. Please report the output of 'dmidecode -t system'
> > for you affected system(s).
> 
> Thanks for letting me know.
> 
> We made three production runs, and I'm waiting to get a hold of the
> dmidecode for two of them.  This may take some weeks; we have to find
> stock and ship it, or we have to ask our contract manufacturer (CM) if
> they have kept data or units.
> 
> I've dmidecode for one production run.
> 
> http://dev.laptop.org/~quozl/z/1eh7JF.txt (my unit nl3-e)
> 
> I've dmidecode for prototypes, but they have clearly been programmed
> badly.  We did not ask our CM for Windows compatibility, so they may
> have had no step to verify the data.  We also went through several
> iterations to get serial numbers assigned, so the data I have does not
> have good provenance.
> 
> http://dev.laptop.org/~quozl/z/1eh7EE.txt (my unit nl3-c)
> http://dev.laptop.org/~quozl/z/1eh7EV.txt (my unit nl3-d)
> http://dev.laptop.org/~quozl/z/1eh7He.txt (my unit nl3-a)
> http://dev.laptop.org/~quozl/z/1eh8DR.txt (my unit nl3-b)
> 
> > We hope you will be able to test any proposed patches.
> 
> Yes, can do.
> 
> I've just tested v4.15.
> 
> However, I'm concerned about your plan to use quirks;
> 
> 1.  turning off ASPM may decrease run time on battery, which if it is
> significant, across several thousand laptops will yield generator fuel
> or solar budget failure; can the power impact be quantified?
> 
> 2.  why not keep ASPM enabled, and use 8-bit when quirked, or on
> x86_64, or when not AARCH64?
> 
> 3.  why not find the underlying problem; PK is in the same company as
> the device firmware engineers, so it should be possible for them to
> find out why 16-bit access causes the device firmware to hang?  We
> drew a blank trying to reach firmware engineers through our CM and
> module maker; perha

Re: rtl8821ae keep alive not set, connection lost

2018-01-31 Thread James Cameron
On Wed, Jan 31, 2018 at 11:06:12AM -0600, Larry Finger wrote:
> On 09/12/2017 05:09 PM, James Cameron wrote:
> >Summary: 40b368af4b75 ("rtlwifi: Fix alignment issues") breaks
> >rtl8821ae keep alive, causing "Connection to AP lost" and deauth,
> >but why?
> >
> >Wireless connection is lost after a few seconds or minutes, on
> >every OLPC NL3 laptop with rtl8821ae, with any stable kernel after
> >4.10.1, and any kernel with 40b368af4b75.
> >
> >dmesg contains
> >
> >   wlp2s0: Connection to AP 2c:b0:5d:a6:86:eb lost
> >
> >iw event shows
> >
> >   wlp2s0: del station 2c:b0:5d:a6:86:eb
> >   wlp2s0 (phy #0): deauth 74:c6:3b:09:b5:0d -> 2c:b0:5d:a6:86:eb reason 4: 
> > Disassociated due to inactivity
> >   wlp2s0 (phy #0): disconnected (local request)
> >
> >Workaround is to bounce the link, then reconnect;
> >
> >   ip link set wlp2s0 down
> >   ip link set wlp2s0 up
> >   iw dev wlp2s0 connect qz
> >
> >A nearby monitor host captures a deauthentication packet sent by
> >the device.
> >
> >Bisection showed cause is 40b368af4b75 ("rtlwifi: Fix alignment
> >issues") which changes the width of DBI register read.
> >
> >On the face of it, 40b368af4b75 looks correct, especially compared
> >against same function in rtl8723be.
> >
> >I've no idea why reverting fixes the problem.  I'm hoping someone
> >here might speculate and suggest ways to test.
> >
> >As keep alive is set through this path, my guess is that keep alive
> >is not being set in the device.  Or perhaps reading 16-bits
> >perturbs another register.  Is there a way to test?
> >
> >http://dev.laptop.org/~quozl/z/1drtGD.txt dmesg of 4.13
> >
> >http://dev.laptop.org/~quozl/z/1drt7c.txt dmesg with 4.13 and
> >revert of 40b368af4b75
> 
> James,
> 
> I'm afraid we are needing to revisit this problem again. Changing
> that 8-bit read to a 16-bit version causes an unaligned memory
> reference in AARCH64, thus we will need to re-revert. To prevent
> problems on systems such as yours, PK plans to turn off ASPM
> capability and backdoor in certain platforms that will be listed in
> a quirks table. Please report the output of 'dmidecode -t system'
> for you affected system(s).

Thanks for letting me know.

We made three production runs, and I'm waiting to get a hold of the
dmidecode for two of them.  This may take some weeks; we have to find
stock and ship it, or we have to ask our contract manufacturer (CM) if
they have kept data or units.

I've dmidecode for one production run.

http://dev.laptop.org/~quozl/z/1eh7JF.txt (my unit nl3-e)

I've dmidecode for prototypes, but they have clearly been programmed
badly.  We did not ask our CM for Windows compatibility, so they may
have had no step to verify the data.  We also went through several
iterations to get serial numbers assigned, so the data I have does not
have good provenance.

http://dev.laptop.org/~quozl/z/1eh7EE.txt (my unit nl3-c)
http://dev.laptop.org/~quozl/z/1eh7EV.txt (my unit nl3-d)
http://dev.laptop.org/~quozl/z/1eh7He.txt (my unit nl3-a)
http://dev.laptop.org/~quozl/z/1eh8DR.txt (my unit nl3-b)

> We hope you will be able to test any proposed patches.

Yes, can do.

I've just tested v4.15.

However, I'm concerned about your plan to use quirks;

1.  turning off ASPM may decrease run time on battery, which if it is
significant, across several thousand laptops will yield generator fuel
or solar budget failure; can the power impact be quantified?

2.  why not keep ASPM enabled, and use 8-bit when quirked, or on
x86_64, or when not AARCH64?

3.  why not find the underlying problem; PK is in the same company as
the device firmware engineers, so it should be possible for them to
find out why 16-bit access causes the device firmware to hang?  We
drew a blank trying to reach firmware engineers through our CM and
module maker; perhaps we were not large or noisy enough.

4.  it's not just me; there are others who have reported similar
problems, so won't re-reverting affect them?  They haven't engaged in
the process as thoroughly, and may not be in the quirks table.  You
also reproduced the problem with different hardware.

> Thanks,
> 
> Larry

-- 
James Cameron
http://quozl.netrek.org/


Re: rtl8821ae keep alive not set, connection lost

2018-01-31 Thread Larry Finger

On 09/12/2017 05:09 PM, James Cameron wrote:

Summary: 40b368af4b75 ("rtlwifi: Fix alignment issues") breaks
rtl8821ae keep alive, causing "Connection to AP lost" and deauth, but
why?

Wireless connection is lost after a few seconds or minutes, on every
OLPC NL3 laptop with rtl8821ae, with any stable kernel after 4.10.1,
and any kernel with 40b368af4b75.

dmesg contains

   wlp2s0: Connection to AP 2c:b0:5d:a6:86:eb lost

iw event shows

   wlp2s0: del station 2c:b0:5d:a6:86:eb
   wlp2s0 (phy #0): deauth 74:c6:3b:09:b5:0d -> 2c:b0:5d:a6:86:eb reason 4: 
Disassociated due to inactivity
   wlp2s0 (phy #0): disconnected (local request)

Workaround is to bounce the link, then reconnect;

   ip link set wlp2s0 down
   ip link set wlp2s0 up
   iw dev wlp2s0 connect qz

A nearby monitor host captures a deauthentication packet sent by the
device.

Bisection showed cause is 40b368af4b75 ("rtlwifi: Fix alignment
issues") which changes the width of DBI register read.

On the face of it, 40b368af4b75 looks correct, especially compared
against same function in rtl8723be.

I've no idea why reverting fixes the problem.  I'm hoping someone here
might speculate and suggest ways to test.

As keep alive is set through this path, my guess is that keep alive is
not being set in the device.  Or perhaps reading 16-bits perturbs
another register.  Is there a way to test?

http://dev.laptop.org/~quozl/z/1drtGD.txt dmesg of 4.13

http://dev.laptop.org/~quozl/z/1drt7c.txt dmesg with 4.13 and revert
of 40b368af4b75


James,

I'm afraid we are needing to revisit this problem again. Changing that 8-bit 
read to a 16-bit version causes an unaligned memory reference in AARCH64, thus 
we will need to re-revert. To prevent problems on systems such as yours, PK 
plans to turn off ASPM capability and backdoor in certain platforms that will be 
listed in a quirks table. Please report the output of 'dmidecode -t system' for 
you affected system(s).


We hope you will be able to test any proposed patches.

Thanks,

Larry




Re: rtl8821ae keep alive not set, connection lost

2017-09-21 Thread James Cameron
On Thu, Sep 21, 2017 at 09:40:14AM -0500, Larry Finger wrote:
> On 09/21/2017 03:07 AM, James Cameron wrote:
> >My test kernel "-qb" was write_readback = false in sw.c, with 8-bit
> >read of REG_DBI_RDATA, and has been stable for four hours.  I'll
> >focus on some more testing of this one.  It is a surprise.
> >
> >http://dev.laptop.org/~quozl/z/1dutXk.txt  (dmesg)
> >
> >Observe how REG_DBI_FLAG+0 is briefly seen as 1, which doesn't
> >happen with write_readback = true.
> 
> Again, thanks for your efforts.
> 
> At this point, my system has been up over 17 hours without a single
> drop. As a result, I will leave the reversion of commit 40b368af4b75
> in place. It seems safer than turning off write_readback. After we
> get more testing, that could still be an option.

Thanks for the reversion commit, I'll point others to it.

My apologies for sloppy work, the test kernel features got swapped!
"-qb" above was with write_readback off, and 16-bit read of
REG_DBI_RDATA, not 8-bit.  Verified with objdump.  It has run for 24
hours without a drop.

So at conclusion;

- the 16-bit read is good with or without write_readback.

- the 8-bit read is bad with or without write_readback, and tends to
  lose connection much quicker without write_readback.

Been a pleasure working with you.  Back to lurk mode.

-- 
James Cameron
http://quozl.netrek.org/


Re: rtl8821ae keep alive not set, connection lost

2017-09-21 Thread Larry Finger

On 09/21/2017 03:07 AM, James Cameron wrote:

My test kernel "-qb" was write_readback = false in sw.c, with 8-bit
read of REG_DBI_RDATA, and has been stable for four hours.  I'll focus
on some more testing of this one.  It is a surprise.

http://dev.laptop.org/~quozl/z/1dutXk.txt  (dmesg)

Observe how REG_DBI_FLAG+0 is briefly seen as 1, which doesn't happen
with write_readback = true.


Again, thanks for your efforts.

At this point, my system has been up over 17 hours without a single drop. As a 
result, I will leave the reversion of commit 40b368af4b75 in place. It seems 
safer than turning off write_readback. After we get more testing, that could 
still be an option.


Larry



Re: rtl8821ae keep alive not set, connection lost

2017-09-21 Thread James Cameron
On Thu, Sep 21, 2017 at 09:22:28AM +1000, James Cameron wrote:
> On Wed, Sep 20, 2017 at 04:48:23PM -0500, Larry Finger wrote:
> > On 09/20/2017 04:36 AM, James Cameron wrote:
> > >When the problem occurs, register 0x350 bit 25 is set, for which a
> > >comment in _rtl8821ae_check_pcie_dma_hang says means there is an RX
> > >hang.
> > >
> > >So perhaps driver should call _rtl8821ae_check_pcie_dma_hang
> > >and _rtl8821ae_reset_pcie_interface_dma.
> > >
> > >Any ideas where to do this?
> > 
> > Thanks for the extended debugging.
> > 
> > I was able to repeat your findings. With the 8-bit read of
> > REG_DBI_RDATA, I got poor connection stability. Reverting that part
> > made it stable again. For that reason, I pushed the partial
> > reversion of commit 40b368af4b75 ("rtlwifi: Fix alignment issues").
> 
> That's great you were able to reproduce, thanks!
> [...]
> I'm still pondering a few more theories;
> 
> - change write_readback, it is true now, and the while()/udelay in
>   _rtl8821ae_dbi_read seems a waste, it never executes,

My test kernel "-qb" was write_readback = false in sw.c, with 8-bit
read of REG_DBI_RDATA, and has been stable for four hours.  I'll focus
on some more testing of this one.  It is a surprise.

http://dev.laptop.org/~quozl/z/1dutXk.txt (dmesg)

Observe how REG_DBI_FLAG+0 is briefly seen as 1, which doesn't happen
with write_readback = true.

> - clearing REG_DBI_CTRL write enable bits at the end of
>   _rtl8821ae_dbi_write,

My test kernel "-qc" had reset of REG_DBI_ADDR as last step in both
_rtl8821ae_dbi_read and _rtl8821ae_dbi_write, and was very unstable,
not able to connect.

http://dev.laptop.org/~quozl/y/1dutbX.txt (git diff v4.13)
http://dev.laptop.org/~quozl/z/1dutuM.txt (dmesg)

My test kernel "-qd" had reset of REG_DBI_ADDR as last step in only
_rtl8821ae_dbi_write, and had poor connection stability.

http://dev.laptop.org/~quozl/y/1dutr3.txt (git diff v4.13)
http://dev.laptop.org/~quozl/z/1duuDc.txt (dmesg connection lost)

Based on the above two kernels, clearing REG_DBI_ADDR after a read is
a bad idea, and suggests there is some underlying asynchronicity about
the DBI access.  Almost as if some other condition should signal
completion rather than zero in REG_DBI_FLAG+0.

> - switching to 32-bit access as used by rtl8192de.

My test kernel "-qe" changed RED_DBI_RDATA read to 32-bit, then used a
union hack to pull out the desired byte, and had poor connection
stability.

http://dev.laptop.org/~quozl/y/1duvIC.txt (git diff v4.13)
http://dev.laptop.org/~quozl/z/1duwI1.txt (dmesg connection lost)

-- 
James Cameron
http://quozl.netrek.org/


Re: rtl8821ae keep alive not set, connection lost

2017-09-20 Thread James Cameron
On Wed, Sep 20, 2017 at 04:48:23PM -0500, Larry Finger wrote:
> On 09/20/2017 04:36 AM, James Cameron wrote:
> >When the problem occurs, register 0x350 bit 25 is set, for which a
> >comment in _rtl8821ae_check_pcie_dma_hang says means there is an RX
> >hang.
> >
> >So perhaps driver should call _rtl8821ae_check_pcie_dma_hang
> >and _rtl8821ae_reset_pcie_interface_dma.
> >
> >Any ideas where to do this?
> 
> Thanks for the extended debugging.
> 
> I was able to repeat your findings. With the 8-bit read of
> REG_DBI_RDATA, I got poor connection stability. Reverting that part
> made it stable again. For that reason, I pushed the partial
> reversion of commit 40b368af4b75 ("rtlwifi: Fix alignment issues").

That's great you were able to reproduce, thanks!

> Where did you detect that bit 25 of register 0x350 was set?

In _rtl8821ae_check_pcie_dma_hang on link up.

REG_DBI_FLAG (0x350 bits 16-31) is observed as;

- 0x on entry to function after warm boot,

- 0x0400 on exit from function; debug bit 23 is set by the function,

- 0x0400 on entry to function on link up when the problem has not
  happened,

- 0x0600 on entry to function on link up when the problem has
  happened.

But I don't know if 0x0600 is useful to detect earlier, or if it is
only a symptom of link down while device active.  Either way, if it
truly does signal an RX hang or firmware RX queue full, it's useful.

My "-q9" and "-qa" test kernels dump REG_DBI_CTRL and REG_DBI_FLAG.

"-q9" is with 8-bit read of REG_DBI_RDATA.

"-qa" is with 16-bit read of REG_DBI_DATA.

My "-qa" test kernel;
http://dev.laptop.org/~quozl/y/1dunwN.txt (git diff v4.13)
http://dev.laptop.org/~quozl/z/1dubX7.txt (dmesg)

REG_DBI_CTRL+3 used by _rtl8821ae_check_pcie_dma_hang is effectively
REG_DBI_FLAG+1 (0x353).

REG_DBI_CTRL is REG_DBI_ADDR; a duplicate register definition.

I'm still pondering a few more theories;

- change write_readback, it is true now, and the while()/udelay in
  _rtl8821ae_dbi_read seems a waste, it never executes,

- clearing REG_DBI_CTRL write enable bits at the end of
  _rtl8821ae_dbi_write,

- switching to 32-bit access as used by rtl8192de.

And a giggle from reviewing the code,
_rtl8821ae_wowlan_initialize_adapter says "Patch Pcie Rx DMA hang
after S3/S4 several times.  The root cause has not be found."
... I've learned that root causes that aren't found tend to cause
further problems later.  ;-)

Given this, my gut feel is firmware or silicon problem; RX DMA ceases,
the driver does not detect it, the connection is lost.

-- 
James Cameron
http://quozl.netrek.org/


Re: rtl8821ae keep alive not set, connection lost

2017-09-20 Thread Larry Finger

On 09/20/2017 04:36 AM, James Cameron wrote:

When the problem occurs, register 0x350 bit 25 is set, for which a
comment in _rtl8821ae_check_pcie_dma_hang says means there is an RX
hang.

So perhaps driver should call _rtl8821ae_check_pcie_dma_hang
and _rtl8821ae_reset_pcie_interface_dma.

Any ideas where to do this?


Thanks for the extended debugging.

I was able to repeat your findings. With the 8-bit read of REG_DBI_RDATA, I got 
poor connection stability. Reverting that part made it stable again. For that 
reason, I pushed the partial reversion of commit 40b368af4b75 ("rtlwifi: Fix 
alignment issues").


Where did you detect that bit 25 of register 0x350 was set?

Larry


Re: rtl8821ae keep alive not set, connection lost

2017-09-20 Thread James Cameron
On Tue, Sep 19, 2017 at 07:42:04PM +1000, James Cameron wrote:
> On Thu, Sep 14, 2017 at 07:27:39PM +1000, James Cameron wrote:
> > On Wed, Sep 13, 2017 at 07:39:35PM -0500, Larry Finger wrote:
> > > On 09/13/2017 04:46 PM, James Cameron wrote:
> > > >
> > > >I'll give it some more testing and let you know, but it seems as
> > > >capable of keeping a connection as 4.13 plus my earlier revert.
> > > >
> > 
> > Testing went well; removing the call to enable ASPM was as good as
> > changing the DBI read back to 16-bit width.
> > 
> > > The change I sent earlier should be as good as reverting the change
> > > to write_byte in your reversion.
> > 
> > Yes, that would be the hope.
> > 
> > But with the 16-bit DBI read, the register REG_DBI_CTRL+0 is being
> > read as well, in the first read in _rtl8821ae_enable_aspm_back_door,
> > so perhaps reading that register has an unexpected side-effect.
> > 
> 
> I've ruled that out after testing for several days different kernels
> based on v4.13;
> 
> - add an rtl_read_byte of REG_DBI_CTRL+0 in rtl8821ae_hw_init just
>   after the call to enable_aspm; does not solve problem,
> 
> - add an rtl_read_byte of REG_DBI_CTRL+0 at the start of
>   _rtl8821ae_check_pcie_dma_hang; does not solve problem,

When the problem occurs, register 0x350 bit 25 is set, for which a
comment in _rtl8821ae_check_pcie_dma_hang says means there is an RX
hang.

So perhaps driver should call _rtl8821ae_check_pcie_dma_hang
and _rtl8821ae_reset_pcie_interface_dma.

Any ideas where to do this?

> [...]

-- 
James Cameron
http://quozl.netrek.org/


Re: rtl8821ae keep alive not set, connection lost

2017-09-19 Thread James Cameron
On Thu, Sep 14, 2017 at 07:27:39PM +1000, James Cameron wrote:
> On Wed, Sep 13, 2017 at 07:39:35PM -0500, Larry Finger wrote:
> > On 09/13/2017 04:46 PM, James Cameron wrote:
> > >
> > >I'll give it some more testing and let you know, but it seems as
> > >capable of keeping a connection as 4.13 plus my earlier revert.
> > >
> 
> Testing went well; removing the call to enable ASPM was as good as
> changing the DBI read back to 16-bit width.
> 
> > The change I sent earlier should be as good as reverting the change
> > to write_byte in your reversion.
> 
> Yes, that would be the hope.
> 
> But with the 16-bit DBI read, the register REG_DBI_CTRL+0 is being
> read as well, in the first read in _rtl8821ae_enable_aspm_back_door,
> so perhaps reading that register has an unexpected side-effect.
> 

I've ruled that out after testing for several days different kernels
based on v4.13;

- add an rtl_read_byte of REG_DBI_CTRL+0 in rtl8821ae_hw_init just
  after the call to enable_aspm; does not solve problem,

- add an rtl_read_byte of REG_DBI_CTRL+0 at the start of
  _rtl8821ae_check_pcie_dma_hang; does not solve problem,

Only way to solve the problem at the moment is either;

- reverting 40b368af4b75 ("rtlwifi: Fix alignment issues"), which
  means using rtl_read_word in _rtl8821ae_dbi_read,

or

- removing the two lines that enable ASPM, as you asked me to try.

> Is there any documentation for that register?  I see other code writes
> to REG_DBI_CTRL+3, in _rtl8821ae_check_pcie_dma_hang

I'll repeat and expand on this.  Is there any documentation for this
register, or the other REG_DBI_* registers?

I see that DBI windowed access in rtl8192de is different and yet very
similar.

In rtl8821ae, rtl8723be, and rtl8192de the method seems straightforward;
there are bits for address, bits for write enable by byte, and flag
bits for starting the transfer and completing.

> Evidence of read from REG_DBI_CTRL was captured with an instrumented
> kernel; git diff http://dev.laptop.org/~quozl/y/1dsQ6B.txt yielding
> these dmesg lines;
> 
> [6.010255] rtl_pci: _rtl_pci_update_default_setting const_amdpci_aspm=03
> [6.010338] rtl_pci: rtl_pci_enable_aspm
> [6.034295] ieee80211 phy0: Selected rate control algorithm 'rtl_rc'
> [6.034806] rtlwifi: rtlwifi: wireless switch is on
> [6.196958] rtl8821ae :02:00.0 wlp2s0: renamed from wlan0
> [7.979186] rtl_pci: rtl_pci_disable_aspm
> [7.979306] rtl8821ae: _rtl8821ae_check_pcie_dma_hang
> [8.295360] rtl8821ae: _rtl8821ae_enable_aspm_back_door
> [8.295437] rtl8821ae: _rtl8821ae_dbi_read  070f ->  (@034f)
> [8.295449] rtl8821ae: _rtl8821ae_dbi_write 070f <- ff (@870c)
> [8.295462] rtl8821ae: _rtl8821ae_dbi_read  0719 -> 0200 (@034d)
> [8.295474] rtl8821ae: _rtl8821ae_dbi_write 0719 <- 18 (@2718)
> [8.295477] rtl_pci: rtl_pci_enable_aspm
> [8.469734] rtl_pci: rtl_pci_disable_aspm
> [8.469857] rtl8821ae: _rtl8821ae_check_pcie_dma_hang
> [8.686955] rtl8821ae: _rtl8821ae_enable_aspm_back_door
> [8.687013] rtl8821ae: _rtl8821ae_dbi_read  070f ->  (@034f)
> [8.687025] rtl8821ae: _rtl8821ae_dbi_write 070f <- ff (@870c)
> [8.687038] rtl8821ae: _rtl8821ae_dbi_read  0719 -> 0218 (@034d)
> [8.687050] rtl8821ae: _rtl8821ae_dbi_write 0719 <- 18 (@2718)
> [8.687053] rtl_pci: rtl_pci_enable_aspm
> 
> Observe how the windowed read of DBI register 0x70f causes a read of
> 16-bits at 0x34f, which includes first 8-bits of 0x350 REG_DBI_CTRL.
> 
> By the way, the cold boot value of DBI register 0x719 is 0x00, and
> the warm boot value is 0x18, so I'm confident there isn't a
> comprehensive register reset.  It means that BIOS has relevance; and
> this BIOS is outside my control.  BIOS variation may explain
> difficulty reproducing.

Is there a register for device reset that I can try?  It would help
to exclude BIOS.

> 
> > There has been a report (in Russian unfortunately) at
> > https://www.linux.org.ru/forum/desktop/12620193 of delays in ARP
> > handling.
> 
> Thanks.  I've considered and excluded ARP handling delay.  Though ARP
> renewal is typical reason for device sleep to end.
> 
> With the call to enable ASPM disabled, instead of changing the DBI
> read to 16-bit width, what happens is that the device stops accepting
> data from the access point, packets are buffered there, and are
> transmitted as soon as the device makes the next transmission.
> 
> http://dev.laptop.org/~quozl/z/1dsQBf.txt has the ping and IP tcpdump
> to confirm this.
> 
> I've a monitor mode tcpdump I can send by private mail if required.
> In that the burst of packets shows ICMP echo requests were buffered by
> the access point.
> 
> > According to Google translate is as follows:
> > 
> > 
> > Periodically, Wi-Fi networker rtl8821ae ceases to respond to ARP,
> > which causes the Internet to end. Wireshark looks quite interesting:
> > ARP replays can be sent by one large 

Re: rtl8821ae keep alive not set, connection lost

2017-09-14 Thread James Cameron
On Wed, Sep 13, 2017 at 07:39:35PM -0500, Larry Finger wrote:
> On 09/13/2017 04:46 PM, James Cameron wrote:
> >
> >I'll give it some more testing and let you know, but it seems as
> >capable of keeping a connection as 4.13 plus my earlier revert.
> >

Testing went well; removing the call to enable ASPM was as good as
changing the DBI read back to 16-bit width.

> The change I sent earlier should be as good as reverting the change
> to write_byte in your reversion.

Yes, that would be the hope.

But with the 16-bit DBI read, the register REG_DBI_CTRL+0 is being
read as well, in the first read in _rtl8821ae_enable_aspm_back_door,
so perhaps reading that register has an unexpected side-effect.

Is there any documentation for that register?  I see other code writes
to REG_DBI_CTRL+3, in _rtl8821ae_check_pcie_dma_hang

Evidence of read from REG_DBI_CTRL was captured with an instrumented
kernel; git diff http://dev.laptop.org/~quozl/y/1dsQ6B.txt yielding
these dmesg lines;

[6.010255] rtl_pci: _rtl_pci_update_default_setting const_amdpci_aspm=03
[6.010338] rtl_pci: rtl_pci_enable_aspm
[6.034295] ieee80211 phy0: Selected rate control algorithm 'rtl_rc'
[6.034806] rtlwifi: rtlwifi: wireless switch is on
[6.196958] rtl8821ae :02:00.0 wlp2s0: renamed from wlan0
[7.979186] rtl_pci: rtl_pci_disable_aspm
[7.979306] rtl8821ae: _rtl8821ae_check_pcie_dma_hang
[8.295360] rtl8821ae: _rtl8821ae_enable_aspm_back_door
[8.295437] rtl8821ae: _rtl8821ae_dbi_read  070f ->  (@034f)
[8.295449] rtl8821ae: _rtl8821ae_dbi_write 070f <- ff (@870c)
[8.295462] rtl8821ae: _rtl8821ae_dbi_read  0719 -> 0200 (@034d)
[8.295474] rtl8821ae: _rtl8821ae_dbi_write 0719 <- 18 (@2718)
[8.295477] rtl_pci: rtl_pci_enable_aspm
[8.469734] rtl_pci: rtl_pci_disable_aspm
[8.469857] rtl8821ae: _rtl8821ae_check_pcie_dma_hang
[8.686955] rtl8821ae: _rtl8821ae_enable_aspm_back_door
[8.687013] rtl8821ae: _rtl8821ae_dbi_read  070f ->  (@034f)
[8.687025] rtl8821ae: _rtl8821ae_dbi_write 070f <- ff (@870c)
[8.687038] rtl8821ae: _rtl8821ae_dbi_read  0719 -> 0218 (@034d)
[8.687050] rtl8821ae: _rtl8821ae_dbi_write 0719 <- 18 (@2718)
[8.687053] rtl_pci: rtl_pci_enable_aspm

Observe how the windowed read of DBI register 0x70f causes a read of
16-bits at 0x34f, which includes first 8-bits of 0x350 REG_DBI_CTRL.

By the way, the cold boot value of DBI register 0x719 is 0x00, and
the warm boot value is 0x18, so I'm confident there isn't a
comprehensive register reset.  It means that BIOS has relevance; and
this BIOS is outside my control.  BIOS variation may explain
difficulty reproducing.

> There has been a report (in Russian unfortunately) at
> https://www.linux.org.ru/forum/desktop/12620193 of delays in ARP
> handling.

Thanks.  I've considered and excluded ARP handling delay.  Though ARP
renewal is typical reason for device sleep to end.

With the call to enable ASPM disabled, instead of changing the DBI
read to 16-bit width, what happens is that the device stops accepting
data from the access point, packets are buffered there, and are
transmitted as soon as the device makes the next transmission.

http://dev.laptop.org/~quozl/z/1dsQBf.txt has the ping and IP tcpdump
to confirm this.

I've a monitor mode tcpdump I can send by private mail if required.
In that the burst of packets shows ICMP echo requests were buffered by
the access point.

> According to Google translate is as follows:
> 
> 
> Periodically, Wi-Fi networker rtl8821ae ceases to respond to ARP,
> which causes the Internet to end. Wireshark looks quite interesting:
> ARP replays can be sent by one large packet a few seconds after
> receiving the requests, ie. they seem to be buffered somewhere.

Yes, buffering at access point.

> I need to explore that ENOBUFS return code.

I've seen ENOBUFS up at the application level with ping too, when the
original problem happens with v4.10 plus stable.

> Your case where the device is unresponsive to pings from another NIC
> until the device transmits may also be an ARP problem.
> 
> For completeness, are you using the 2.4 of 5 GHz band? What is the
> make/model your AP? If possible for you to determine, what firmware
> is it running?

2.4 GHz and 5 GHz reproduces the problem.

Open or WPA reproduces the problem.

Netgear WNDR3800 OpenWrt 12.09-beta, r33312.

Several other access points reproduce the problem, including a
customer's TP-Link TL-WR1042ND with unknown firmware version.

No access point as yet does not reproduce the problem.

Hope that helps, thanks for your ideas.

-- 
James Cameron
http://quozl.netrek.org/


Re: rtl8821ae keep alive not set, connection lost

2017-09-13 Thread James Cameron
On Wed, Sep 13, 2017 at 10:01:37AM -0500, Larry Finger wrote:
> Thank you very much for making the effort to bisect this problem. I
> know that several people have reported the problem, which we cannot
> duplicate; however, most of them just say it drops the connection
> and do nothing more.  In fact, we are lucky to have them even report
> which kernel version they are running!

Yes, in the reported bugs that style is common; almost animistic, very
mystical, and based on heuristics rather than analysis.  ;-)

> As we do not see the problem, we will be relying on you to help
> diagnose the issue. Merely changing the read from 8 to 16 bits
> should not cause any change.

Agreed.

> As _rtl8821ae_dbi_read() is only called from
> _rtl8821ae_enable_aspm_back_door(), we want to test turning off
> ASPM. The following patch will accomplish this. Unfortunately, the
> patch is white-space damaged, thus you will need to apply it
> manually. Please try it to see if it helps your connection
> loss. Note that ASPM settings are preserved through a module
> unload/reload sequence. Thus you will need to reboot after
> rebuilding the driver.

Went back to 4.13, added your test patch, and built kernel.

http://dev.laptop.org/~quozl/z/1dsFOW.txt is dmesg.

New symptom occurs; after 23 seconds since last transmission, the
device becomes unresponsive to ping from another host, but begins to
respond if the device transmits.  Flurry of responses then it settles
down to regular ping.

64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=39 ttl=64 time=1.71 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=40 ttl=64 time=1.93 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=41 ttl=64 time=1.71 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=42 ttl=64 time=1.66 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=43 ttl=64 time=1.70 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=44 ttl=64 time=1.69 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=45 ttl=64 time=37.7 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=46 ttl=64 time=383 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=47 ttl=64 time=11464 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=48 ttl=64 time=10465 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=49 ttl=64 time=9465 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=50 ttl=64 time=8466 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=51 ttl=64 time=7466 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=52 ttl=64 time=6466 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=53 ttl=64 time=5466 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=54 ttl=64 time=4467 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=55 ttl=64 time=3467 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=56 ttl=64 time=2468 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=57 ttl=64 time=1468 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=58 ttl=64 time=469 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=59 ttl=64 time=1.79 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=60 ttl=64 time=1.75 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=61 ttl=64 time=1.72 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=62 ttl=64 time=1.68 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=63 ttl=64 time=1.68 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=64 ttl=64 time=1.95 ms
64 bytes from nl3-e.lan (10.0.0.94): icmp_seq=65 ttl=64 time=1.68 ms

I'll give it some more testing and let you know, but it seems as
capable of keeping a connection as 4.13 plus my earlier revert.

-- 
James Cameron
http://quozl.netrek.org/


Re: rtl8821ae keep alive not set, connection lost

2017-09-13 Thread Larry Finger

On 09/12/2017 05:09 PM, James Cameron wrote:

Summary: 40b368af4b75 ("rtlwifi: Fix alignment issues") breaks
rtl8821ae keep alive, causing "Connection to AP lost" and deauth, but
why?

Wireless connection is lost after a few seconds or minutes, on every
OLPC NL3 laptop with rtl8821ae, with any stable kernel after 4.10.1,
and any kernel with 40b368af4b75.

dmesg contains

   wlp2s0: Connection to AP 2c:b0:5d:a6:86:eb lost

iw event shows

   wlp2s0: del station 2c:b0:5d:a6:86:eb
   wlp2s0 (phy #0): deauth 74:c6:3b:09:b5:0d -> 2c:b0:5d:a6:86:eb reason 4: 
Disassociated due to inactivity
   wlp2s0 (phy #0): disconnected (local request)

Workaround is to bounce the link, then reconnect;

   ip link set wlp2s0 down
   ip link set wlp2s0 up
   iw dev wlp2s0 connect qz

A nearby monitor host captures a deauthentication packet sent by the
device.

Bisection showed cause is 40b368af4b75 ("rtlwifi: Fix alignment
issues") which changes the width of DBI register read.

On the face of it, 40b368af4b75 looks correct, especially compared
against same function in rtl8723be.

I've no idea why reverting fixes the problem.  I'm hoping someone here
might speculate and suggest ways to test.

As keep alive is set through this path, my guess is that keep alive is
not being set in the device.  Or perhaps reading 16-bits perturbs
another register.  Is there a way to test?

http://dev.laptop.org/~quozl/z/1drtGD.txt dmesg of 4.13

http://dev.laptop.org/~quozl/z/1drt7c.txt dmesg with 4.13 and revert
of 40b368af4b75


James,

Thank you very much for making the effort to bisect this problem. I know that 
several people have reported the problem, which we cannot duplicate; however, 
most of them just say it drops the connection and do nothing more. In fact, we 
are lucky to have them even report which kernel version they are running!


As we do not see the problem, we will be relying on you to help diagnose the 
issue. Merely changing the read from 8 to 16 bits should not cause any change.


As _rtl8821ae_dbi_read() is only called from _rtl8821ae_enable_aspm_back_door(), 
we want to test turning off ASPM. The following patch will accomplish this. 
Unfortunately, the patch is white-space damaged, thus you will need to apply it 
manually. Please try it to see if it helps your connection loss. Note that ASPM 
settings are preserved through a module unload/reload sequence. Thus you will 
need to reboot after rebuilding the driver.


diff --git a/rtl8821ae/hw.c b/rtl8821ae/hw.c
index 305b3abbf..755d3704b 100644
--- a/rtl8821ae/hw.c
+++ b/rtl8821ae/hw.c
@@ -1982,8 +1982,8 @@ int rtl8821ae_hw_init(struct ieee80211_hw *hw)
ppsc->rfpwr_state = ERFON;

rtlpriv->cfg->ops->set_hw_reg(hw, HW_VAR_ETHER_ADDR, mac->mac_addr);
-   _rtl8821ae_enable_aspm_back_door(hw);
-   rtlpriv->intf_ops->enable_aspm(hw);
+   //_rtl8821ae_enable_aspm_back_door(hw);
+   //rtlpriv->intf_ops->enable_aspm(hw);

if (rtlhal->hw_type == HARDWARE_TYPE_RTL8812AE &&
(rtlhal->rfe_type == 1 || rtlhal->rfe_type == 5))


Thanks,

Larry


rtl8821ae keep alive not set, connection lost

2017-09-12 Thread James Cameron
Summary: 40b368af4b75 ("rtlwifi: Fix alignment issues") breaks
rtl8821ae keep alive, causing "Connection to AP lost" and deauth, but
why?

Wireless connection is lost after a few seconds or minutes, on every
OLPC NL3 laptop with rtl8821ae, with any stable kernel after 4.10.1,
and any kernel with 40b368af4b75.

dmesg contains

  wlp2s0: Connection to AP 2c:b0:5d:a6:86:eb lost

iw event shows

  wlp2s0: del station 2c:b0:5d:a6:86:eb
  wlp2s0 (phy #0): deauth 74:c6:3b:09:b5:0d -> 2c:b0:5d:a6:86:eb reason 4: 
Disassociated due to inactivity
  wlp2s0 (phy #0): disconnected (local request)

Workaround is to bounce the link, then reconnect;

  ip link set wlp2s0 down
  ip link set wlp2s0 up
  iw dev wlp2s0 connect qz

A nearby monitor host captures a deauthentication packet sent by the
device.

Bisection showed cause is 40b368af4b75 ("rtlwifi: Fix alignment
issues") which changes the width of DBI register read.

On the face of it, 40b368af4b75 looks correct, especially compared
against same function in rtl8723be.

I've no idea why reverting fixes the problem.  I'm hoping someone here
might speculate and suggest ways to test.

As keep alive is set through this path, my guess is that keep alive is
not being set in the device.  Or perhaps reading 16-bits perturbs
another register.  Is there a way to test?

http://dev.laptop.org/~quozl/z/1drtGD.txt dmesg of 4.13

http://dev.laptop.org/~quozl/z/1drt7c.txt dmesg with 4.13 and revert
of 40b368af4b75

-- 
James Cameron
http://quozl.netrek.org/