Re: crash on urtwn removal
On Mon, Apr 27, 2015 at 03:13:06PM +0200, Martin Pieuchot wrote: On 27/04/15(Mon) 13:55, Remi Locherer wrote: On Sun, Apr 26, 2015 at 10:07:19AM +0200, Stefan Sperling wrote: On Sun, Apr 26, 2015 at 09:36:06AM +0200, Remi Locherer wrote: With a series of removals and insertions of the urtwn device I could provoke another panic. I didn't wait for a timeout message to apear. Great ! You found another bug :) Hmmm [...] urtwn0 detached urtwn0 at uhub2 port 2 Realtek 802.11n WLAN Adapter rev 2.00/2.00 addr 3 urtwn0: MAC/BB RTL8188CUS, RF 6052 1T1R, address 80:1f:02:da:13:e3 usb_transfer_complete: actlen len 4294951948 4 usb_transfer_complete: actlen len 4294951180 4 usb_insert_transfer: xfer=0xff011a23e118 not free That's the real problem, some code path tries to reuse a xfer that's still on the ring. kernel: protection fault trap, code=0 Stopped at ehci_check_intr+0xb:movq0x10(%rax),%rax ddb{0} ehci_check_intr() at ehci_check_intr+0xb ehci_softintr() at ehci_softintr+0x3f softintr_dispatch() at softintr_dispatch+0x7f Xsoftnet() at Xsoftnet+0x2d This trace tells use that the pipe is no longer valid, which means that the device has been removed but a xfer is still referenced by ehci. The output of ps could help understand what's going wrong in such case. If you can, please get it next time :) If you think you can reproduce this bug too, here's a diff that would get us a useful trace: It seams to be easier to trigger this bug than the first one. ddb trace and ps output: https://relo.ch/urtwncrash_trace_part1.jpg https://relo.ch/urtwncrash_trace_part2.jpg https://relo.ch/urtwncrash_ps_part1.jpg https://relo.ch/urtwncrash_ps_part2.jpg Unfortunately boot reboot in ddb did not work so I had to upload the photos. But at least one line number appeared in the output so now I know how to build a kernel with debug symbols ;) Index: usbdi.c === RCS file: /cvs/src/sys/dev/usb/usbdi.c,v retrieving revision 1.81 diff -u -p -r1.81 usbdi.c --- usbdi.c 14 Mar 2015 03:38:50 - 1.81 +++ usbdi.c 27 Apr 2015 13:08:33 - @@ -824,10 +824,8 @@ usb_insert_transfer(struct usbd_xfer *xf DPRINTFN(5,(usb_insert_transfer: pipe=%p running=%d timeout=%d\n, pipe, pipe-running, xfer-timeout)); #ifdef DIAGNOSTIC - if (xfer-busy_free != XFER_FREE) { - printf(%s: xfer=%p not free\n, __func__, xfer); - return (USBD_INVAL); - } + if (xfer-busy_free != XFER_FREE) + panic(%s: xfer=%p not free\n, __func__, xfer); xfer-busy_free = XFER_ONQU; #endif s = splusb();
Re: crash on urtwn removal
On Mon, Apr 27, 2015 at 01:55:30PM +0200, Remi Locherer wrote: I rebuilt a kernel with /etc/mk.conf containing this: SUDO=/usr/bin/sudo DEBUG=-g #DEBUGLIB=-g But there are no line numbers in the ddb output (see end of dmesg). Is there another why to build a kernel with debug symbols? It doesn't work that way. To get line numbers, create a new kernel config file: cp GENERIC URTWN_DEBUG add this line to the URTWN_DEBUG config file: makeoptions DEBUG=-g run 'config URTWN_DEBUG; cd ../compile/URTWN_DEBUG; make' and boot the resulting kernel. Something is going wrong in the IPv6 code. This problem is not specific to urtwn(4). The relevant part of the trace is: With a series of removals and insertions of the urtwn device I could provoke another panic. I didn't wait for a timeout message to apear. This time no nd6_* or in6_* are visible in the trace. I updated kernel sources short before building the kernel. The suspected cause of your problem is a use-after-free. So panic details might vary since they depend on when and where the freed-and-subsequently-modified chunk of memory ends up being allocated again.
Re: crash on urtwn removal
On Sun, Apr 26, 2015 at 10:07:19AM +0200, Stefan Sperling wrote: On Sun, Apr 26, 2015 at 09:36:06AM +0200, Remi Locherer wrote: Since a couple of month every now and then I'm getting a urtwn0: device timeout. Usually I deal with it by unlugging the urtwn device and plug it in again, then run sh /etc/netstart urtwn0. Is ifconfig down/up (or just 'sh /etc/netstart urtwn0') not enough? That is essentially what the driver should be doing in this situation. However, USB wifi drivers which need to load firmware currently don't reset properly when transmit times out -- they simply do nothing instead of resetting the interface. Sometimes the systems crashes when I unplug the urtwn device. Photos from ddb after such a crash: https://relo.ch/crash_on_urtwn_removal-trace_part1.jpg https://relo.ch/crash_on_urtwn_removal-trace_part2.jpg https://relo.ch/crash_on_urtwn_removal-ps_part1.jpg https://relo.ch/crash_on_urtwn_removal-ps_part2.jpg https://relo.ch/crash_on_urtwn_removal-ps_part3.jpg Can you please recompile your kernel with makeoptions DEBUG=-g in the config when reproducing crashes? That's slightly more convenient for us to debug since ddb output will contain line numbers. I rebuilt a kernel with /etc/mk.conf containing this: SUDO=/usr/bin/sudo DEBUG=-g #DEBUGLIB=-g But there are no line numbers in the ddb output (see end of dmesg). Is there another why to build a kernel with debug symbols? Something is going wrong in the IPv6 code. This problem is not specific to urtwn(4). The relevant part of the trace is: With a series of removals and insertions of the urtwn device I could provoke another panic. I didn't wait for a timeout message to apear. This time no nd6_* or in6_* are visible in the trace. I updated kernel sources short before building the kernel. OpenBSD 5.7-current (GENERIC.MP) #9: Sun Apr 26 21:27:24 CEST 2015 r...@mistral.relo.ch:/usr/src/sys/arch/amd64/compile/GENERIC.MP real mem = 3881746432 (3701MB) avail mem = 3760246784 (3586MB) mpath0 at root scsibus0 at mpath0: 256 targets mainbus0 at root bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xe0970 (63 entries) bios0: vendor Phoenix Technologies Ltd. version P00ACX date 04/26/2013 bios0: SAMSUNG ELECTRONICS CO., LTD. 900X3F acpi0 at bios0: rev 2 acpi0: sleep states S0 S3 S4 S5 acpi0: tables DSDT FACP SLIC SSDT ASF! HPET APIC MCFG FPDT SSDT SSDT UEFI MSDM UEFI UEFI acpi0: wakeup devices P0P1(S4) GLAN(S4) XHC_(S4) HDEF(S4) PXSX(S4) RP01(S4) PXSX(S4) RP02(S4) PXSX(S4) RP03(S4) PXSX(S4) RP04(S4) PXSX(S4) RP06(S4) PXSX(S4) RP07(S4) [...] acpitimer0 at acpi0: 3579545 Hz, 24 bits acpihpet0 at acpi0: 14318179 Hz acpimadt0 at acpi0 addr 0xfee0: PC-AT compat cpu0 at mainbus0: apid 0 (boot processor) cpu0: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz, 1696.44 MHz cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,XSAVE,AVX,F16C,RDRAND,NXE,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS cpu0: 256KB 64b/line 8-way L2 cache cpu0: smt 0, core 0, package 0 mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges cpu0: apic clock running at 99MHz cpu0: mwait min=64, max=64, C-substates=0.2.1.1.2, IBE cpu1 at mainbus0: apid 1 (application processor) cpu1: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz, 1696.15 MHz cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,XSAVE,AVX,F16C,RDRAND,NXE,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS cpu1: 256KB 64b/line 8-way L2 cache cpu1: smt 1, core 0, package 0 cpu2 at mainbus0: apid 2 (application processor) cpu2: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz, 1696.15 MHz cpu2: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,XSAVE,AVX,F16C,RDRAND,NXE,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS cpu2: 256KB 64b/line 8-way L2 cache cpu2: smt 0, core 1, package 0 cpu3 at mainbus0: apid 3 (application processor) cpu3: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz, 1696.15 MHz cpu3: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,XSAVE,AVX,F16C,RDRAND,NXE,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS cpu3: 256KB 64b/line 8-way L2 cache cpu3: smt 1, core 1, package 0 ioapic0 at mainbus0: apid 2 pa 0xfec0, version 20, 24 pins acpimcfg0 at acpi0 addr 0xf800, bus 0-63 acpiprt0 at acpi0: bus 0 (PCI0) acpiprt1 at acpi0: bus -1 (P0P1) acpiprt2 at acpi0: bus 1
Re: crash on urtwn removal
On 27/04/15(Mon) 13:55, Remi Locherer wrote: On Sun, Apr 26, 2015 at 10:07:19AM +0200, Stefan Sperling wrote: On Sun, Apr 26, 2015 at 09:36:06AM +0200, Remi Locherer wrote: With a series of removals and insertions of the urtwn device I could provoke another panic. I didn't wait for a timeout message to apear. Great ! You found another bug :) [...] urtwn0 detached urtwn0 at uhub2 port 2 Realtek 802.11n WLAN Adapter rev 2.00/2.00 addr 3 urtwn0: MAC/BB RTL8188CUS, RF 6052 1T1R, address 80:1f:02:da:13:e3 usb_transfer_complete: actlen len 4294951948 4 usb_transfer_complete: actlen len 4294951180 4 usb_insert_transfer: xfer=0xff011a23e118 not free That's the real problem, some code path tries to reuse a xfer that's still on the ring. kernel: protection fault trap, code=0 Stopped atehci_check_intr+0xb:movq0x10(%rax),%rax ddb{0} ehci_check_intr() at ehci_check_intr+0xb ehci_softintr() at ehci_softintr+0x3f softintr_dispatch() at softintr_dispatch+0x7f Xsoftnet() at Xsoftnet+0x2d This trace tells use that the pipe is no longer valid, which means that the device has been removed but a xfer is still referenced by ehci. The output of ps could help understand what's going wrong in such case. If you can, please get it next time :) If you think you can reproduce this bug too, here's a diff that would get us a useful trace: Index: usbdi.c === RCS file: /cvs/src/sys/dev/usb/usbdi.c,v retrieving revision 1.81 diff -u -p -r1.81 usbdi.c --- usbdi.c 14 Mar 2015 03:38:50 - 1.81 +++ usbdi.c 27 Apr 2015 13:08:33 - @@ -824,10 +824,8 @@ usb_insert_transfer(struct usbd_xfer *xf DPRINTFN(5,(usb_insert_transfer: pipe=%p running=%d timeout=%d\n, pipe, pipe-running, xfer-timeout)); #ifdef DIAGNOSTIC - if (xfer-busy_free != XFER_FREE) { - printf(%s: xfer=%p not free\n, __func__, xfer); - return (USBD_INVAL); - } + if (xfer-busy_free != XFER_FREE) + panic(%s: xfer=%p not free\n, __func__, xfer); xfer-busy_free = XFER_ONQU; #endif s = splusb();