Re: crash on urtwn removal

2015-04-27 Thread Remi Locherer
On Mon, Apr 27, 2015 at 03:13:06PM +0200, Martin Pieuchot wrote:
 On 27/04/15(Mon) 13:55, Remi Locherer wrote:
  On Sun, Apr 26, 2015 at 10:07:19AM +0200, Stefan Sperling wrote:
   On Sun, Apr 26, 2015 at 09:36:06AM +0200, Remi Locherer wrote:
  With a series of removals and insertions of the urtwn device I could
  provoke another panic. I didn't wait for a timeout message to apear.
 
 Great !  You found another bug :)

Hmmm 
 
  [...]
  urtwn0 detached
  urtwn0 at uhub2 port 2 Realtek 802.11n WLAN Adapter rev 2.00/2.00 addr 3
  urtwn0: MAC/BB RTL8188CUS, RF 6052 1T1R, address 80:1f:02:da:13:e3
  usb_transfer_complete: actlen  len 4294951948  4
  usb_transfer_complete: actlen  len 4294951180  4
  usb_insert_transfer: xfer=0xff011a23e118 not free
 
 That's the real problem, some code path tries to reuse a xfer that's
 still on the ring.
 
  kernel: protection fault trap, code=0
  Stopped at  ehci_check_intr+0xb:movq0x10(%rax),%rax
  ddb{0} ehci_check_intr() at ehci_check_intr+0xb
  ehci_softintr() at ehci_softintr+0x3f
  softintr_dispatch() at softintr_dispatch+0x7f
  Xsoftnet() at Xsoftnet+0x2d
 
 This trace tells use that the pipe is no longer valid, which means that
 the device has been removed but a xfer is still referenced by ehci.
 
 The output of ps could help understand what's going wrong in such
 case.  If you can, please get it next time :)
 
 If you think you can reproduce this bug too, here's a diff that would
 get us a useful trace:

It seams to be easier to trigger this bug than the first one.
ddb trace and ps output:

https://relo.ch/urtwncrash_trace_part1.jpg
https://relo.ch/urtwncrash_trace_part2.jpg
https://relo.ch/urtwncrash_ps_part1.jpg
https://relo.ch/urtwncrash_ps_part2.jpg

Unfortunately boot reboot in ddb did not work so I had to upload 
the photos. But at least one line number appeared in the output so now I
know how to build a kernel with debug symbols ;)


 
 Index: usbdi.c
 ===
 RCS file: /cvs/src/sys/dev/usb/usbdi.c,v
 retrieving revision 1.81
 diff -u -p -r1.81 usbdi.c
 --- usbdi.c   14 Mar 2015 03:38:50 -  1.81
 +++ usbdi.c   27 Apr 2015 13:08:33 -
 @@ -824,10 +824,8 @@ usb_insert_transfer(struct usbd_xfer *xf
   DPRINTFN(5,(usb_insert_transfer: pipe=%p running=%d timeout=%d\n,
   pipe, pipe-running, xfer-timeout));
  #ifdef DIAGNOSTIC
 - if (xfer-busy_free != XFER_FREE) {
 - printf(%s: xfer=%p not free\n, __func__, xfer);
 - return (USBD_INVAL);
 - }
 + if (xfer-busy_free != XFER_FREE)
 + panic(%s: xfer=%p not free\n, __func__, xfer);
   xfer-busy_free = XFER_ONQU;
  #endif
   s = splusb();
 



Re: crash on urtwn removal

2015-04-27 Thread Stefan Sperling
On Mon, Apr 27, 2015 at 01:55:30PM +0200, Remi Locherer wrote:
 I rebuilt a kernel with /etc/mk.conf containing this:
 
 SUDO=/usr/bin/sudo
 DEBUG=-g
 #DEBUGLIB=-g
 
 But there are no line numbers in the ddb output (see end of dmesg). Is there
 another why to build a kernel with debug symbols?

It doesn't work that way.

To get line numbers, create a new kernel config file:

  cp GENERIC URTWN_DEBUG

add this line to the URTWN_DEBUG config file:

  makeoptions DEBUG=-g

run 'config URTWN_DEBUG; cd ../compile/URTWN_DEBUG; make'
and boot the resulting kernel.

  Something is going wrong in the IPv6 code. This problem is not specific
  to urtwn(4). The relevant part of the trace is:
 
 With a series of removals and insertions of the urtwn device I could
 provoke another panic. I didn't wait for a timeout message to apear.
 
 This time no nd6_* or in6_* are visible in the trace.
 
 I updated kernel sources short before building the kernel.

The suspected cause of your problem is a use-after-free.
So panic details might vary since they depend on when and where
the freed-and-subsequently-modified chunk of memory ends up being
allocated again.



Re: crash on urtwn removal

2015-04-27 Thread Remi Locherer
On Sun, Apr 26, 2015 at 10:07:19AM +0200, Stefan Sperling wrote:
 On Sun, Apr 26, 2015 at 09:36:06AM +0200, Remi Locherer wrote:
  Since a couple of month every now and then I'm getting a 
  urtwn0: device timeout. Usually I deal with it by unlugging the 
  urtwn device and plug it in again, then run sh /etc/netstart urtwn0.
 
 Is ifconfig down/up (or just 'sh /etc/netstart urtwn0') not enough?
 That is essentially what the driver should be doing in this situation.
 However, USB wifi drivers which need to load firmware currently don't
 reset properly when transmit times out -- they simply do nothing instead
 of resetting the interface.
 
 
  Sometimes the systems crashes when I unplug the urtwn device. 
  
  Photos from ddb after such a crash:
  
  https://relo.ch/crash_on_urtwn_removal-trace_part1.jpg
  https://relo.ch/crash_on_urtwn_removal-trace_part2.jpg
  https://relo.ch/crash_on_urtwn_removal-ps_part1.jpg
  https://relo.ch/crash_on_urtwn_removal-ps_part2.jpg
  https://relo.ch/crash_on_urtwn_removal-ps_part3.jpg
 
 Can you please recompile your kernel with
   makeoptions DEBUG=-g
 in the config when reproducing crashes? That's slightly more convenient
 for us to debug since ddb output will contain line numbers.

I rebuilt a kernel with /etc/mk.conf containing this:

SUDO=/usr/bin/sudo
DEBUG=-g
#DEBUGLIB=-g

But there are no line numbers in the ddb output (see end of dmesg). Is there
another why to build a kernel with debug symbols?
 
 Something is going wrong in the IPv6 code. This problem is not specific
 to urtwn(4). The relevant part of the trace is:

With a series of removals and insertions of the urtwn device I could
provoke another panic. I didn't wait for a timeout message to apear.

This time no nd6_* or in6_* are visible in the trace.

I updated kernel sources short before building the kernel.


OpenBSD 5.7-current (GENERIC.MP) #9: Sun Apr 26 21:27:24 CEST 2015
r...@mistral.relo.ch:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 3881746432 (3701MB)
avail mem = 3760246784 (3586MB)
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xe0970 (63 entries)
bios0: vendor Phoenix Technologies Ltd. version P00ACX date 04/26/2013
bios0: SAMSUNG ELECTRONICS CO., LTD. 900X3F
acpi0 at bios0: rev 2
acpi0: sleep states S0 S3 S4 S5
acpi0: tables DSDT FACP SLIC SSDT ASF! HPET APIC MCFG FPDT SSDT SSDT UEFI MSDM 
UEFI UEFI
acpi0: wakeup devices P0P1(S4) GLAN(S4) XHC_(S4) HDEF(S4) PXSX(S4) RP01(S4) 
PXSX(S4) RP02(S4) PXSX(S4) RP03(S4) PXSX(S4) RP04(S4) PXSX(S4) RP06(S4) 
PXSX(S4) RP07(S4) [...]
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpihpet0 at acpi0: 14318179 Hz
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz, 1696.44 MHz
cpu0: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,XSAVE,AVX,F16C,RDRAND,NXE,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS
cpu0: 256KB 64b/line 8-way L2 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
cpu0: apic clock running at 99MHz
cpu0: mwait min=64, max=64, C-substates=0.2.1.1.2, IBE
cpu1 at mainbus0: apid 1 (application processor)
cpu1: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz, 1696.15 MHz
cpu1: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,XSAVE,AVX,F16C,RDRAND,NXE,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS
cpu1: 256KB 64b/line 8-way L2 cache
cpu1: smt 1, core 0, package 0
cpu2 at mainbus0: apid 2 (application processor)
cpu2: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz, 1696.15 MHz
cpu2: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,XSAVE,AVX,F16C,RDRAND,NXE,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS
cpu2: 256KB 64b/line 8-way L2 cache
cpu2: smt 0, core 1, package 0
cpu3 at mainbus0: apid 3 (application processor)
cpu3: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz, 1696.15 MHz
cpu3: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,XSAVE,AVX,F16C,RDRAND,NXE,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS
cpu3: 256KB 64b/line 8-way L2 cache
cpu3: smt 1, core 1, package 0
ioapic0 at mainbus0: apid 2 pa 0xfec0, version 20, 24 pins
acpimcfg0 at acpi0 addr 0xf800, bus 0-63
acpiprt0 at acpi0: bus 0 (PCI0)
acpiprt1 at acpi0: bus -1 (P0P1)
acpiprt2 at acpi0: bus 1 

Re: crash on urtwn removal

2015-04-27 Thread Martin Pieuchot
On 27/04/15(Mon) 13:55, Remi Locherer wrote:
 On Sun, Apr 26, 2015 at 10:07:19AM +0200, Stefan Sperling wrote:
  On Sun, Apr 26, 2015 at 09:36:06AM +0200, Remi Locherer wrote:
 With a series of removals and insertions of the urtwn device I could
 provoke another panic. I didn't wait for a timeout message to apear.

Great !  You found another bug :)

 [...]
 urtwn0 detached
 urtwn0 at uhub2 port 2 Realtek 802.11n WLAN Adapter rev 2.00/2.00 addr 3
 urtwn0: MAC/BB RTL8188CUS, RF 6052 1T1R, address 80:1f:02:da:13:e3
 usb_transfer_complete: actlen  len 4294951948  4
 usb_transfer_complete: actlen  len 4294951180  4
 usb_insert_transfer: xfer=0xff011a23e118 not free

That's the real problem, some code path tries to reuse a xfer that's
still on the ring.

 kernel: protection fault trap, code=0
 Stopped atehci_check_intr+0xb:movq0x10(%rax),%rax
 ddb{0} ehci_check_intr() at ehci_check_intr+0xb
 ehci_softintr() at ehci_softintr+0x3f
 softintr_dispatch() at softintr_dispatch+0x7f
 Xsoftnet() at Xsoftnet+0x2d

This trace tells use that the pipe is no longer valid, which means that
the device has been removed but a xfer is still referenced by ehci.

The output of ps could help understand what's going wrong in such
case.  If you can, please get it next time :)

If you think you can reproduce this bug too, here's a diff that would
get us a useful trace:

Index: usbdi.c
===
RCS file: /cvs/src/sys/dev/usb/usbdi.c,v
retrieving revision 1.81
diff -u -p -r1.81 usbdi.c
--- usbdi.c 14 Mar 2015 03:38:50 -  1.81
+++ usbdi.c 27 Apr 2015 13:08:33 -
@@ -824,10 +824,8 @@ usb_insert_transfer(struct usbd_xfer *xf
DPRINTFN(5,(usb_insert_transfer: pipe=%p running=%d timeout=%d\n,
pipe, pipe-running, xfer-timeout));
 #ifdef DIAGNOSTIC
-   if (xfer-busy_free != XFER_FREE) {
-   printf(%s: xfer=%p not free\n, __func__, xfer);
-   return (USBD_INVAL);
-   }
+   if (xfer-busy_free != XFER_FREE)
+   panic(%s: xfer=%p not free\n, __func__, xfer);
xfer-busy_free = XFER_ONQU;
 #endif
s = splusb();