On 12/9/24 4:37 AM, James Prestwood wrote:
On 12/8/24 10:48 PM, Baochen Qiang wrote:
On 12/6/2024 8:27 PM, James Prestwood wrote:
Hi Baochen,
On 12/5/24 6:47 PM, Baochen Qiang wrote:
On 9/5/2024 9:46 AM, Baochen Qiang wrote:
On 9/5/2024 2:03 AM, Jeff Johnson wrote:
On 8/16/2024 5:04 AM, James Prestwood wrote:
Hi Baochen,
On 8/16/24 3:19 AM, Baochen Qiang wrote:
On 7/12/2024 9:11 PM, James Prestwood wrote:
Hi,
I've seen this error mentioned on random forum posts, but its
always associated
with a kernel crash/warning or some very obvious negative
behavior. I've noticed
this occasionally and at one location very frequently during
FT roaming,
specifically just after CMD_ASSOCIATE is issued. For our
company run networks I'm
not seeing any negative behavior apart from a 3 second delay
in sending the re-
association frame since the kernel waits for this timeout. But
we have some
networks our clients run on that we do not own (different
vendor), and we are
seeing association timeouts after this error occurs and in
some cases the AP is
sending a deauthentication with reason code 8 instead of
replying with a
reassociation reply and an error status, which is quite odd.
We are chasing down this with the vendor of these APs as well,
but the behavior
always happens after we see this key removal failure/timeout
on the client side. So
it would appear there is potentially a problem on both the
client and AP. My guess
is _something_ about the re-association frame changes when
this error is
encountered, but I cannot see how that would be the case. We
are working to get
PCAPs now, but its through a 3rd party, so that timing is out
of my control.
From the kernel code this error would appear innocuous, the
old key is failing to
be removed but it gets immediately replaced by the new key.
And we don't see that
addition failing. Am I understanding that logic correctly?
I.e. this logic:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/
mac80211/key.c#n503
Below are a few kernel logs of the issue happening, some with
the deauth being sent
by the AP, some with just timeouts:
--- No deauth frame sent, just association timeouts after the
error ---
Jul 11 00:05:30 kernel: wlan0: disconnect from AP <previous
BSS> for new assoc to
<new BSS>
Jul 11 00:05:33 kernel: ath10k_pci 0000:02:00.0: failed to
install key for vdev 0
peer <previous BSS>: -110
Jul 11 00:05:33 kernel: wlan0: failed to remove key
(0, <previous BSS>) from
hardware (-110)
Jul 11 00:05:33 kernel: wlan0: associate with <new BSS> (try 1/3)
Jul 11 00:05:33 kernel: wlan0: associate with <new BSS> (try 2/3)
Jul 11 00:05:33 kernel: wlan0: associate with <new BSS> (try 3/3)
Jul 11 00:05:33 kernel: wlan0: association with <new BSS>
timed out
Jul 11 00:05:36 kernel: wlan0: authenticate with <new BSS>
Jul 11 00:05:36 kernel: wlan0: send auth to <new BSS>a (try 1/3)
Jul 11 00:05:36 kernel: wlan0: authenticated
Jul 11 00:05:36 kernel: wlan0: associate with <new BSS> (try 1/3)
Jul 11 00:05:36 kernel: wlan0: RX AssocResp from <new BSS>
(capab=0x1111 status=0
aid=16)
Jul 11 00:05:36 kernel: wlan0: associated
--- Deauth frame sent amidst the association timeouts ---
Jul 11 00:43:18 kernel: wlan0: disconnect from AP <previous
BSS> for new assoc to
<new BSS>
Jul 11 00:43:21 kernel: ath10k_pci 0000:02:00.0: failed to
install key for vdev 0
peer <previous BSS>: -110
Jul 11 00:43:21 kernel: wlan0: failed to remove key (0,
<previous BSS>) from
hardware (-110)
Jul 11 00:43:21 kernel: wlan0: associate with <new BSS> (try 1/3)
Jul 11 00:43:21 kernel: wlan0: deauthenticated from <new BSS>
while associating
(Reason: 8=DISASSOC_STA_HAS_LEFT)
Jul 11 00:43:24 kernel: wlan0: authenticate with <new BSS>
Jul 11 00:43:24 kernel: wlan0: send auth to <new BSS> (try 1/3)
Jul 11 00:43:24 kernel: wlan0: authenticated
Jul 11 00:43:24 kernel: wlan0: associate with <new BSS> (try 1/3)
Jul 11 00:43:24 kernel: wlan0: RX AssocResp from <new BSS>
(capab=0x1111 status=0
aid=101)
Jul 11 00:43:24 kernel: wlan0: associated
Hi James, this is QCA6174, right? could you also share firmware
version?
Yep, using:
qca6174 hw3.2 target 0x05030000 chip_id 0x00340aff sub 1dac:0261
firmware ver WLAN.RM.4.4.1-00288- api 6 features
wowlan,ignore-otp,mfp
crc32 bf907c7c
I did try in one instance the latest firmware, 309, and still
saw the
same behavior but 288 is what all our devices are running.
Thanks,
James
Baochen, are you looking more into this? Would prefer to fix the
root cause
rather than take "[RFC 0/1] wifi: ath10k: improvement on key
removal failure"
I asked CST team to try to reproduce this issue such that we can
get firmware dump for
debug further. What I got is that CST team is currently busy at
other critical
schedules and they are planning to debug this ath10k issue after
those schedules get
finished.
Jeff, I am notified that CST team can not reproduce this issue.
Thanks for reaching out to them at least. Maybe the firmware team
can provide some info
about how long it _should_ take to remove a key and we can make the
timeout reflect that?
are you implying that the failure is due to a not-long-enough wait in
host driver? or you
want to know the maximum time firmware needs in removing key, and if
it is less than 3s we
can reduce current timeout to WAR the issue you hit?
No I'm not implying the wait isn't long enough. I would like to know
the maximum time the firmware should take normally and only wait that
amount of time, which would fix the issues we see with Cisco APs.
Thanks,
James
Attempting to revive this thread again with additional information.
After initially discovering this I have been carrying a patch which
lowers the timeout to 1 second instead of 3. Though undesirable (since
it delays roams by 1 second) it did work around the issue with Cisco
APs. Unfortunately we now see the same issue with another vendor,
"Extreme Networks", despite the delay being only 1 second.
I can't remember if it was mentioned but we do not see this failure with
other AP vendors like Meraki or Aruba, and even some clients that use
Cisco don't experience it. But it appears to happen more (sometimes 90%+
of the time) with certain AP vendors. I cannot begin to imagine how the
AP would have any effect on the driver/firmware's ability to remove a
key locally, but here we are.
Currently I'm thinking I have 2 options:
- Further reduce the wait, but given the failure happens so
consistently the roaming time will be at minimum whatever I set the
timeout to.
- Remove the wait entirely for DISABLE_KEY. I have no idea if this is
safe/recommenced but given the failure isn't handled (only an error log)
it feels like I could remove it.
Thanks,
James