6.9amd64 athn(4) AR7010+AR9280 & AR9271 USB WiFi don't work in AP mode (frames corruption)

2021-05-05 Thread Martin
Hi list,

I use AR9280 PCIe card in AP mode for about two years in Lenovo x230 laptop 
(whitelist removed). Works perfectly.

After moving to a modern laptop there is no ability to install PCIe card into 
it.
I tried to use one of USB dongles I have based on:
1. AR9280+AR7010
2. AR9271

After attaching AR9271 USB dongle and restarting it by '/etc/netstart athn0' It 
broadcasts BSSID, and client can get IP address from it. But no data flow 
between USB AP and any client device connected. Most frames are dropped or 
corrupted on the 'air' level even USB AP and client on the same table.

ICMP between AP and client goes with huge delays in both ways, most packages 
drop. Delay is about ~3.5ms but sometimes 337.3ms and more.

The same issue with both USB dongles on x230 laptop (original PCIe was removed 
and nothing changed in configuration of PF and /etc/hostname.athn0).

I tried three computers with OpenBSD 6.9amd64 GENERIC installed. The behaviour 
is the same with USB AR9280+AR7010 and AR9271 dongles.

Once I return back PCIe version of AR9280 card all works like a charm.

It looks like a bug in athn(4) driver related USB based devices.


Martin



intel(4): edp_panel_vdd_on calls task_del(9) with NULL taskq

2021-05-05 Thread Scott Cheloha
Hi,

On a hunch I added additional parameter checks to task_add(9) and
task_del(9) and caught intel(4) doing something strange.

The patch is straightforward: check that the taskq pointer tq is not
NULL.  In the current code we return early if a flag is set or cleared
in the task w, in which case we don't catch bogus taskq inputs, which
is why the machine boots fine without the extra checks.

The patch:

Index: kern_task.c
===
RCS file: /cvs/src/sys/kern/kern_task.c,v
retrieving revision 1.31
diff -u -p -r1.31 kern_task.c
--- kern_task.c 1 Aug 2020 08:40:20 -   1.31
+++ kern_task.c 5 May 2021 21:29:08 -
@@ -354,6 +354,9 @@ task_add(struct taskq *tq, struct task *
 {
int rv = 0;
 
+   if (tq == NULL)
+   panic("%s: NULL taskq", __func__);
+
if (ISSET(w->t_flags, TASK_ONQUEUE))
return (0);
 
@@ -378,6 +381,9 @@ int
 task_del(struct taskq *tq, struct task *w)
 {
int rv = 0;
+
+   if (tq == NULL)
+   panic("%s: NULL taskq", __func__);
 
if (!ISSET(w->t_flags, TASK_ONQUEUE))
return (0);

And here is the panic on my machine.  I had to reconstruct it from
OCR, the machine has no serial port, sorry if there are typos.

panic: task_del: NULL taskq
Stopped at db_enter+0xa: popq %rbp
TIDPID  UID PRFLAGSPFLAGS  CPU  COMMAND
 513524  448830 Ox14000 0x2003  update
 352928  824020 0x14000 0x2002  cleaner
 382195  660350 Ox14000 0x2001  reaper
...
db_enter() at db_enter+Oxa
panic(81db24fb) at panic+0x12f
task del(0,810633e0) at task_del+Oxa8
edp_panel vdd_on(81063128) at edp_panel_vdd_on+0x6a
intel_dp_aux_xfer(81063128, 82512a20,4, 82512400,2,0) 
at intel_dp_aux_xfer+0x18b
intel_dp_aux_transfer(810631e8, 82512a88) at 
intel_dp_aux_transfer+0x183
drm_dp_dpcd_access(810631e8,9,0,8106313a, 1) at 
drm_dp_dpcd_access+Oxa9
drm_dp_dpcd_read(810631e8,0,8106313a, f) at 
drm_dp_dpcd_read+0x61
intel_dp_read_dpcd(81063128) at intel_dp_read_dpcd+0x45
intel_dp_init_connector(81063000, 81064000) at 
intel_dp_init_connector+0x988
intel_ddi_init(80272000,0) at intel_ddi_init+0x454
intel_modeset_init(80272000) at intel_modeset_init+0x1c9f
i915_driver_probe(80272000, 82052f98) at i915_driver_probe+0x7df
inteldrm_attachhook(80272000) at inteldrm_attachhook+0x46
end trace frame: Ox82512700, count: 0

>From the backtrace, I gather the following:

edp_panel_vdd_on() calls clear_delayed_work() which is just a macro
that calls task_del().  And for whatever reason the taskq passed to
task_del() is NULL.  Maybe there is a missing INIT_DELAYED_WORK() call
somewhere prior to this point?

And yes, I know, this isn't a bug in the code as-is, but I'm putting
this on bugs@ because I'm pretty certain the taskq shouldn't be NULL.
It "works" without the additional checks, yes, but something is off.

My dmesg is attached.

Happy to provide more detail, reproduce it, etc.

CC dlg@: Maybe there is a discussion to be had about always entering
the taskq mutex during task_add() and task_del() to catch this problem
in the future?

-Scott

OpenBSD 6.9-current (GENERIC.MP) #0: Tue May  4 14:43:41 CDT 2021
ssc@jetsam.local:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 16895528960 (16112MB)
avail mem = 16368230400 (15609MB)
random: good seed from bootblocks
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 3.0 @ 0x9f03b000 (63 entries)
bios0: vendor LENOVO version "N23ET59W (1.34 )" date 11/08/2018
bios0: LENOVO 20KHCTO1WW
acpi0 at bios0: ACPI 5.0
acpi0: sleep states S0 S3 S4 S5
acpi0: tables DSDT FACP SSDT SSDT TPM2 UEFI SSDT SSDT HPET APIC MCFG ECDT SSDT 
SSDT BOOT BATB SSDT SSDT SSDT LPIT WSMT SSDT SSDT SSDT DBGP DBG2 MSDM DMAR NHLT 
ASF! FPDT UEFI
acpi0: wakeup devices GLAN(S4) XHC_(S3) XDCI(S4) HDAS(S4) RP01(S4) PXSX(S4) 
RP02(S4) PXSX(S4) PXSX(S4) RP04(S4) PXSX(S4) RP05(S4) PXSX(S4) RP06(S4) 
PXSX(S4) RP07(S4) [...]
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpihpet0 at acpi0: 2399 Hz
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz, 1791.41 MHz, 06-8e-0a
cpu0: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,SGX,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PT,SRBDS_CTRL,MD_CLEAR,TSXFA,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES,MELTDOWN
cpu0: 256KB 64b/line 8-way L2 cache
cpu0: 

Re: 6.9amd64 athn(4) AR7010+AR9280 & AR9271 USB WiFi don't work in AP mode (frames corruption)

2021-05-05 Thread Stefan Sperling
On Wed, May 05, 2021 at 06:14:52PM +, Martin wrote:
> Hi list,
> 
> I use AR9280 PCIe card in AP mode for about two years in Lenovo x230 laptop 
> (whitelist removed). Works perfectly.
> 
> After moving to a modern laptop there is no ability to install PCIe card into 
> it.
> I tried to use one of USB dongles I have based on:
> 1. AR9280+AR7010
> 2. AR9271
> 
> After attaching AR9271 USB dongle and restarting it by '/etc/netstart athn0' 
> It broadcasts BSSID, and client can get IP address from it. But no data flow 
> between USB AP and any client device connected. Most frames are dropped or 
> corrupted on the 'air' level even USB AP and client on the same table.
> 
> ICMP between AP and client goes with huge delays in both ways, most packages 
> drop. Delay is about ~3.5ms but sometimes 337.3ms and more.
> 
> The same issue with both USB dongles on x230 laptop (original PCIe was 
> removed and nothing changed in configuration of PF and /etc/hostname.athn0).
> 
> I tried three computers with OpenBSD 6.9amd64 GENERIC installed. The 
> behaviour is the same with USB AR9280+AR7010 and AR9271 dongles.
> 
> Once I return back PCIe version of AR9280 card all works like a charm.
> 
> It looks like a bug in athn(4) driver related USB based devices.
> 
> 
> Martin

What you are seeing does not match what I see with this device
plugged into an APU2 board:

athn1 at uhub0 port 4 configuration 1 interface 0 "ATHEROS UB91C" rev 2.00/1.08 
addr 2
athn1: AR9271 rev 1 (1T1R), ROM rev 15, address 00:c0:ca:xx:xx:xx

tcpbench sending to an iwm client associated to athn1:

$ tcpbench 192.168.1.2
  elapsed_ms  bytes mbps   bwidth
10002195168   17.561  100.00%
Conn:   1 Mbps:   17.561 Peak Mbps:   17.561 Avg Mbps:   17.561
20112264672   17.938  100.00%
Conn:   1 Mbps:   17.938 Peak Mbps:   17.938 Avg Mbps:   17.938
30122292184   18.337  100.00%
Conn:   1 Mbps:   18.337 Peak Mbps:   18.337 Avg Mbps:   18.337
40142274808   18.162  100.00%
Conn:   1 Mbps:   18.162 Peak Mbps:   18.337 Avg Mbps:   18.162
50162231368   17.815  100.00%
Conn:   1 Mbps:   17.815 Peak Mbps:   18.337 Avg Mbps:   17.815
60172228472   17.828  100.00%
Conn:   1 Mbps:   17.828 Peak Mbps:   18.337 Avg Mbps:   17.828
70172200960   17.608  100.00%
Conn:   1 Mbps:   17.608 Peak Mbps:   18.337 Avg Mbps:   17.608
80232293632   18.240  100.00%
Conn:   1 Mbps:   18.240 Peak Mbps:   18.337 Avg Mbps:   18.240
90232274808   18.198  100.00%
Conn:   1 Mbps:   18.198 Peak Mbps:   18.337 Avg Mbps:   18.198
^C
--- 192.168.1.2 tcpbench statistics ---
21104600 bytes sent over 9.393 seconds
bandwidth min/avg/max/std-dev = 17.561/17.965/18.337/0.267 Mbps

I'd suggest you try your adapters on some different machines that
use different USB host controllers. There are known issues with
these devices on some controllers which I believe relate to (lack of?)
USB power management in the USB stack, though I don't know for certain.
In some known cases the firmware wouldn't even boot.

Cheers,
Stefan



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-05 Thread Alexandr Nedvedicky
Hello Olivier,

I've seen your report here

https://marc.info/?l=openbsd-bugs=161968896108810

your crash is slightly different. Sebastien is lucky enough
to trip crash in assert, when state key is dereferenced.

in your case we've missed the assert and are dying on uvm fault.

you both seem to be using rdr-to. your pf seems to use also divert-to rule.
I suspect something is going wrong when we deal with traffic, which matches
rdr-to rule.


would you be so kind and try diff below on your AP box. The diff removes
my change to pf_state_key_link_reverse(). Which is a primary suspect
at the moment.

I'm not able to trigger the panic on my notebook, nor on my
home router.

thank you for your help
regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/pf.c b/sys/net/pf.c
index 23eebf4a274..12d05976f0b 100644
--- a/sys/net/pf.c
+++ b/sys/net/pf.c
@@ -7368,19 +7368,11 @@ pf_inp_unlink(struct inpcb *inp)
 void
 pf_state_key_link_reverse(struct pf_state_key *sk, struct pf_state_key *skrev)
 {
-   struct pf_state_key *old_reverse;
-
-   old_reverse = atomic_cas_ptr(>reverse, NULL, skrev);
-   if (old_reverse != NULL)
-   KASSERT(old_reverse == skrev);
-   else
-   pf_state_key_ref(skrev);
-
-   old_reverse = atomic_cas_ptr(>reverse, NULL, sk);
-   if (old_reverse != NULL)
-   KASSERT(old_reverse == sk);
-   else
-   pf_state_key_ref(sk);
+   /* Note that sk and skrev may be equal, then we refcount twice. */
+   KASSERT(sk->reverse == NULL);
+   KASSERT(skrev->reverse == NULL);
+   sk->reverse = pf_state_key_ref(skrev);
+   skrev->reverse = pf_state_key_ref(sk);
 }
 
 #if NPFLOG > 0



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-05 Thread Olivier Cherrier
On Tue, May 04, 2021 at 12:26:17PM +0200, sema...@online.fr wrote:
> Date: Tue, 4 May 2021 12:26:17 +0200
> From: Sebastien Marie 
> To: Alexandr Nedvedicky 
> Cc: bugs@openbsd.org
> Subject: Re: [External] : pf_state_key_unref: panic: kernel diagnostic
>  assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c",
>  line 826
> 
> On Tue, May 04, 2021 at 11:47:55AM +0200, Alexandr Nedvedicky wrote:
> > Hello Sebastien,
> > 
> > On Tue, May 04, 2021 at 11:08:19AM +0200, Sebastien Marie wrote:
> > > Hi,
> > > 
> > > Currently, I am regulary (~1 per day) get panic on an amd64 host (OpenBSD 
> > > 6.9-current (GENERIC.MP) #492: Sat May  1 17:37:28 MDT 2021).
> 
> Previous working kernel was OpenBSD 6.9-current (GENERIC.MP) #477: Sat Apr 24 
> 16:08:13 MDT 2021
 

It looks to be similar than this one:
https://marc.info/?l=openbsd-bugs=161968896108810


-- 
Olivier Cherrier
Phone: +352691570680
mailto:o...@symacx.com



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-05 Thread Sebastien Marie
On Tue, May 04, 2021 at 04:50:06PM +0200, Sebastien Marie wrote:
> On Tue, May 04, 2021 at 02:15:05PM +0200, Alexandr Nedvedicky wrote:
> > Hello Sebastien,
> > 
> > thank you for additional info about previously working kernel.
> > 
> > it looks like your older kernel, which works, might be running without
> > my commit
> > 
> > revision 1.1116
> > date: 2021/04/27 09:38:29;  author: sashan;  state: Exp;\
> >  lines: +14 -6;  commitid: 3W1fRTkLb3ZlUanF;
> > pf_state_key_link_reverse() is prone to race on parallel forwarding
> > 
> > we need to adjust assertions. at time we call 
> > pf_state_key_link_reverse()
> > is state_key either linked to correct reverse peer or not linked at all.
> > The pf_state_key_link_reverse() is being called as a reader ons 
> > tate_lock.
> > There might be more packets, which try to update the state key.
> > 
> > OK bluhm@
> > 
> > > # cat /etc/pf.conf
> > > # See pf.conf(5) and /etc/examples/pf.conf
> > > # 
> > > # avoid skip on lo to get tryton redir
> > > #set skip on lo
> > > 
> > > block return# block stateless traffic
> > > pass# establish keep-state
> > > 
> > > # redir 80 -> 8000
> > > pass in proto tcp to (self) port 80 rdr-to 
> > > 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000
> > > 
> > 
> > the rule above is also interesting bit. this may trigger NAT-64.
> > I'm not using IPv6 on my hosts/routers. I'll try to use similar rule
> > to try to reproduce the crash.
> 
> I will first try with another rule:
> 
>   pass in inet6 proto tcp to (self) port 80 rdr-to 
> 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000
> 
> to avoid the possible NAT-64.

the assert still trigger.

> And next (if I still trigger the assert) I will rebuild a kernel and
> backout this specific commit.

trying with a kernel without this commit

-- 
Sebastien Marie