6.9amd64 athn(4) AR7010+AR9280 & AR9271 USB WiFi don't work in AP mode (frames corruption)
Hi list, I use AR9280 PCIe card in AP mode for about two years in Lenovo x230 laptop (whitelist removed). Works perfectly. After moving to a modern laptop there is no ability to install PCIe card into it. I tried to use one of USB dongles I have based on: 1. AR9280+AR7010 2. AR9271 After attaching AR9271 USB dongle and restarting it by '/etc/netstart athn0' It broadcasts BSSID, and client can get IP address from it. But no data flow between USB AP and any client device connected. Most frames are dropped or corrupted on the 'air' level even USB AP and client on the same table. ICMP between AP and client goes with huge delays in both ways, most packages drop. Delay is about ~3.5ms but sometimes 337.3ms and more. The same issue with both USB dongles on x230 laptop (original PCIe was removed and nothing changed in configuration of PF and /etc/hostname.athn0). I tried three computers with OpenBSD 6.9amd64 GENERIC installed. The behaviour is the same with USB AR9280+AR7010 and AR9271 dongles. Once I return back PCIe version of AR9280 card all works like a charm. It looks like a bug in athn(4) driver related USB based devices. Martin
intel(4): edp_panel_vdd_on calls task_del(9) with NULL taskq
Hi, On a hunch I added additional parameter checks to task_add(9) and task_del(9) and caught intel(4) doing something strange. The patch is straightforward: check that the taskq pointer tq is not NULL. In the current code we return early if a flag is set or cleared in the task w, in which case we don't catch bogus taskq inputs, which is why the machine boots fine without the extra checks. The patch: Index: kern_task.c === RCS file: /cvs/src/sys/kern/kern_task.c,v retrieving revision 1.31 diff -u -p -r1.31 kern_task.c --- kern_task.c 1 Aug 2020 08:40:20 - 1.31 +++ kern_task.c 5 May 2021 21:29:08 - @@ -354,6 +354,9 @@ task_add(struct taskq *tq, struct task * { int rv = 0; + if (tq == NULL) + panic("%s: NULL taskq", __func__); + if (ISSET(w->t_flags, TASK_ONQUEUE)) return (0); @@ -378,6 +381,9 @@ int task_del(struct taskq *tq, struct task *w) { int rv = 0; + + if (tq == NULL) + panic("%s: NULL taskq", __func__); if (!ISSET(w->t_flags, TASK_ONQUEUE)) return (0); And here is the panic on my machine. I had to reconstruct it from OCR, the machine has no serial port, sorry if there are typos. panic: task_del: NULL taskq Stopped at db_enter+0xa: popq %rbp TIDPID UID PRFLAGSPFLAGS CPU COMMAND 513524 448830 Ox14000 0x2003 update 352928 824020 0x14000 0x2002 cleaner 382195 660350 Ox14000 0x2001 reaper ... db_enter() at db_enter+Oxa panic(81db24fb) at panic+0x12f task del(0,810633e0) at task_del+Oxa8 edp_panel vdd_on(81063128) at edp_panel_vdd_on+0x6a intel_dp_aux_xfer(81063128, 82512a20,4, 82512400,2,0) at intel_dp_aux_xfer+0x18b intel_dp_aux_transfer(810631e8, 82512a88) at intel_dp_aux_transfer+0x183 drm_dp_dpcd_access(810631e8,9,0,8106313a, 1) at drm_dp_dpcd_access+Oxa9 drm_dp_dpcd_read(810631e8,0,8106313a, f) at drm_dp_dpcd_read+0x61 intel_dp_read_dpcd(81063128) at intel_dp_read_dpcd+0x45 intel_dp_init_connector(81063000, 81064000) at intel_dp_init_connector+0x988 intel_ddi_init(80272000,0) at intel_ddi_init+0x454 intel_modeset_init(80272000) at intel_modeset_init+0x1c9f i915_driver_probe(80272000, 82052f98) at i915_driver_probe+0x7df inteldrm_attachhook(80272000) at inteldrm_attachhook+0x46 end trace frame: Ox82512700, count: 0 >From the backtrace, I gather the following: edp_panel_vdd_on() calls clear_delayed_work() which is just a macro that calls task_del(). And for whatever reason the taskq passed to task_del() is NULL. Maybe there is a missing INIT_DELAYED_WORK() call somewhere prior to this point? And yes, I know, this isn't a bug in the code as-is, but I'm putting this on bugs@ because I'm pretty certain the taskq shouldn't be NULL. It "works" without the additional checks, yes, but something is off. My dmesg is attached. Happy to provide more detail, reproduce it, etc. CC dlg@: Maybe there is a discussion to be had about always entering the taskq mutex during task_add() and task_del() to catch this problem in the future? -Scott OpenBSD 6.9-current (GENERIC.MP) #0: Tue May 4 14:43:41 CDT 2021 ssc@jetsam.local:/usr/src/sys/arch/amd64/compile/GENERIC.MP real mem = 16895528960 (16112MB) avail mem = 16368230400 (15609MB) random: good seed from bootblocks mpath0 at root scsibus0 at mpath0: 256 targets mainbus0 at root bios0 at mainbus0: SMBIOS rev. 3.0 @ 0x9f03b000 (63 entries) bios0: vendor LENOVO version "N23ET59W (1.34 )" date 11/08/2018 bios0: LENOVO 20KHCTO1WW acpi0 at bios0: ACPI 5.0 acpi0: sleep states S0 S3 S4 S5 acpi0: tables DSDT FACP SSDT SSDT TPM2 UEFI SSDT SSDT HPET APIC MCFG ECDT SSDT SSDT BOOT BATB SSDT SSDT SSDT LPIT WSMT SSDT SSDT SSDT DBGP DBG2 MSDM DMAR NHLT ASF! FPDT UEFI acpi0: wakeup devices GLAN(S4) XHC_(S3) XDCI(S4) HDAS(S4) RP01(S4) PXSX(S4) RP02(S4) PXSX(S4) PXSX(S4) RP04(S4) PXSX(S4) RP05(S4) PXSX(S4) RP06(S4) PXSX(S4) RP07(S4) [...] acpitimer0 at acpi0: 3579545 Hz, 24 bits acpihpet0 at acpi0: 2399 Hz acpimadt0 at acpi0 addr 0xfee0: PC-AT compat cpu0 at mainbus0: apid 0 (boot processor) cpu0: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz, 1791.41 MHz, 06-8e-0a cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,SGX,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PT,SRBDS_CTRL,MD_CLEAR,TSXFA,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES,MELTDOWN cpu0: 256KB 64b/line 8-way L2 cache cpu0:
Re: 6.9amd64 athn(4) AR7010+AR9280 & AR9271 USB WiFi don't work in AP mode (frames corruption)
On Wed, May 05, 2021 at 06:14:52PM +, Martin wrote: > Hi list, > > I use AR9280 PCIe card in AP mode for about two years in Lenovo x230 laptop > (whitelist removed). Works perfectly. > > After moving to a modern laptop there is no ability to install PCIe card into > it. > I tried to use one of USB dongles I have based on: > 1. AR9280+AR7010 > 2. AR9271 > > After attaching AR9271 USB dongle and restarting it by '/etc/netstart athn0' > It broadcasts BSSID, and client can get IP address from it. But no data flow > between USB AP and any client device connected. Most frames are dropped or > corrupted on the 'air' level even USB AP and client on the same table. > > ICMP between AP and client goes with huge delays in both ways, most packages > drop. Delay is about ~3.5ms but sometimes 337.3ms and more. > > The same issue with both USB dongles on x230 laptop (original PCIe was > removed and nothing changed in configuration of PF and /etc/hostname.athn0). > > I tried three computers with OpenBSD 6.9amd64 GENERIC installed. The > behaviour is the same with USB AR9280+AR7010 and AR9271 dongles. > > Once I return back PCIe version of AR9280 card all works like a charm. > > It looks like a bug in athn(4) driver related USB based devices. > > > Martin What you are seeing does not match what I see with this device plugged into an APU2 board: athn1 at uhub0 port 4 configuration 1 interface 0 "ATHEROS UB91C" rev 2.00/1.08 addr 2 athn1: AR9271 rev 1 (1T1R), ROM rev 15, address 00:c0:ca:xx:xx:xx tcpbench sending to an iwm client associated to athn1: $ tcpbench 192.168.1.2 elapsed_ms bytes mbps bwidth 10002195168 17.561 100.00% Conn: 1 Mbps: 17.561 Peak Mbps: 17.561 Avg Mbps: 17.561 20112264672 17.938 100.00% Conn: 1 Mbps: 17.938 Peak Mbps: 17.938 Avg Mbps: 17.938 30122292184 18.337 100.00% Conn: 1 Mbps: 18.337 Peak Mbps: 18.337 Avg Mbps: 18.337 40142274808 18.162 100.00% Conn: 1 Mbps: 18.162 Peak Mbps: 18.337 Avg Mbps: 18.162 50162231368 17.815 100.00% Conn: 1 Mbps: 17.815 Peak Mbps: 18.337 Avg Mbps: 17.815 60172228472 17.828 100.00% Conn: 1 Mbps: 17.828 Peak Mbps: 18.337 Avg Mbps: 17.828 70172200960 17.608 100.00% Conn: 1 Mbps: 17.608 Peak Mbps: 18.337 Avg Mbps: 17.608 80232293632 18.240 100.00% Conn: 1 Mbps: 18.240 Peak Mbps: 18.337 Avg Mbps: 18.240 90232274808 18.198 100.00% Conn: 1 Mbps: 18.198 Peak Mbps: 18.337 Avg Mbps: 18.198 ^C --- 192.168.1.2 tcpbench statistics --- 21104600 bytes sent over 9.393 seconds bandwidth min/avg/max/std-dev = 17.561/17.965/18.337/0.267 Mbps I'd suggest you try your adapters on some different machines that use different USB host controllers. There are known issues with these devices on some controllers which I believe relate to (lack of?) USB power management in the USB stack, though I don't know for certain. In some known cases the firmware wouldn't even boot. Cheers, Stefan
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
Hello Olivier, I've seen your report here https://marc.info/?l=openbsd-bugs=161968896108810 your crash is slightly different. Sebastien is lucky enough to trip crash in assert, when state key is dereferenced. in your case we've missed the assert and are dying on uvm fault. you both seem to be using rdr-to. your pf seems to use also divert-to rule. I suspect something is going wrong when we deal with traffic, which matches rdr-to rule. would you be so kind and try diff below on your AP box. The diff removes my change to pf_state_key_link_reverse(). Which is a primary suspect at the moment. I'm not able to trigger the panic on my notebook, nor on my home router. thank you for your help regards sashan 8<---8<---8<--8< diff --git a/sys/net/pf.c b/sys/net/pf.c index 23eebf4a274..12d05976f0b 100644 --- a/sys/net/pf.c +++ b/sys/net/pf.c @@ -7368,19 +7368,11 @@ pf_inp_unlink(struct inpcb *inp) void pf_state_key_link_reverse(struct pf_state_key *sk, struct pf_state_key *skrev) { - struct pf_state_key *old_reverse; - - old_reverse = atomic_cas_ptr(>reverse, NULL, skrev); - if (old_reverse != NULL) - KASSERT(old_reverse == skrev); - else - pf_state_key_ref(skrev); - - old_reverse = atomic_cas_ptr(>reverse, NULL, sk); - if (old_reverse != NULL) - KASSERT(old_reverse == sk); - else - pf_state_key_ref(sk); + /* Note that sk and skrev may be equal, then we refcount twice. */ + KASSERT(sk->reverse == NULL); + KASSERT(skrev->reverse == NULL); + sk->reverse = pf_state_key_ref(skrev); + skrev->reverse = pf_state_key_ref(sk); } #if NPFLOG > 0
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
On Tue, May 04, 2021 at 12:26:17PM +0200, sema...@online.fr wrote: > Date: Tue, 4 May 2021 12:26:17 +0200 > From: Sebastien Marie > To: Alexandr Nedvedicky > Cc: bugs@openbsd.org > Subject: Re: [External] : pf_state_key_unref: panic: kernel diagnostic > assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", > line 826 > > On Tue, May 04, 2021 at 11:47:55AM +0200, Alexandr Nedvedicky wrote: > > Hello Sebastien, > > > > On Tue, May 04, 2021 at 11:08:19AM +0200, Sebastien Marie wrote: > > > Hi, > > > > > > Currently, I am regulary (~1 per day) get panic on an amd64 host (OpenBSD > > > 6.9-current (GENERIC.MP) #492: Sat May 1 17:37:28 MDT 2021). > > Previous working kernel was OpenBSD 6.9-current (GENERIC.MP) #477: Sat Apr 24 > 16:08:13 MDT 2021 It looks to be similar than this one: https://marc.info/?l=openbsd-bugs=161968896108810 -- Olivier Cherrier Phone: +352691570680 mailto:o...@symacx.com
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
On Tue, May 04, 2021 at 04:50:06PM +0200, Sebastien Marie wrote: > On Tue, May 04, 2021 at 02:15:05PM +0200, Alexandr Nedvedicky wrote: > > Hello Sebastien, > > > > thank you for additional info about previously working kernel. > > > > it looks like your older kernel, which works, might be running without > > my commit > > > > revision 1.1116 > > date: 2021/04/27 09:38:29; author: sashan; state: Exp;\ > > lines: +14 -6; commitid: 3W1fRTkLb3ZlUanF; > > pf_state_key_link_reverse() is prone to race on parallel forwarding > > > > we need to adjust assertions. at time we call > > pf_state_key_link_reverse() > > is state_key either linked to correct reverse peer or not linked at all. > > The pf_state_key_link_reverse() is being called as a reader ons > > tate_lock. > > There might be more packets, which try to update the state key. > > > > OK bluhm@ > > > > > # cat /etc/pf.conf > > > # See pf.conf(5) and /etc/examples/pf.conf > > > # > > > # avoid skip on lo to get tryton redir > > > #set skip on lo > > > > > > block return# block stateless traffic > > > pass# establish keep-state > > > > > > # redir 80 -> 8000 > > > pass in proto tcp to (self) port 80 rdr-to > > > 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000 > > > > > > > the rule above is also interesting bit. this may trigger NAT-64. > > I'm not using IPv6 on my hosts/routers. I'll try to use similar rule > > to try to reproduce the crash. > > I will first try with another rule: > > pass in inet6 proto tcp to (self) port 80 rdr-to > 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000 > > to avoid the possible NAT-64. the assert still trigger. > And next (if I still trigger the assert) I will rebuild a kernel and > backout this specific commit. trying with a kernel without this commit -- Sebastien Marie