Re: crash on urtwn removal
On 28/04/15(Tue) 13:15, Remi Locherer wrote: On Tue, Apr 28, 2015 at 10:17:16AM +0200, Martin Pieuchot wrote: On 27/04/15(Mon) 22:45, Remi Locherer wrote: On Mon, Apr 27, 2015 at 03:13:06PM +0200, Martin Pieuchot wrote: This trace tells use that the pipe is no longer valid, which means that the device has been removed but a xfer is still referenced by ehci. The output of ps could help understand what's going wrong in such case. If you can, please get it next time :) If you think you can reproduce this bug too, here's a diff that would get us a useful trace: It seams to be easier to trigger this bug than the first one. What did you do to trigger it? While lynx was loading a website (via ipv6) I unplugged the urtwn device. The panic doesn't happen every time I try this. Maybe every 10th time. What I don't understand is that your trace shows that the process triggering the panic is doing an ioctl(2). In your ps output you have ifconfig running, but since you did ps on CPU3 I can only guess that that's the program who triggered the panic. So did you unplugged the device while doing anything with ifconfig? Where you able to use your device before the kernel crashes? Did you see anything in the dmesg before? I understand that you unplugged the device, but the important parts are what did you do *before* or *during* 8)
Re: crash on urtwn removal
On Tue, Apr 28, 2015 at 01:41:29PM +0200, Martin Pieuchot wrote: On 28/04/15(Tue) 13:15, Remi Locherer wrote: On Tue, Apr 28, 2015 at 10:17:16AM +0200, Martin Pieuchot wrote: On 27/04/15(Mon) 22:45, Remi Locherer wrote: On Mon, Apr 27, 2015 at 03:13:06PM +0200, Martin Pieuchot wrote: This trace tells use that the pipe is no longer valid, which means that the device has been removed but a xfer is still referenced by ehci. The output of ps could help understand what's going wrong in such case. If you can, please get it next time :) If you think you can reproduce this bug too, here's a diff that would get us a useful trace: It seams to be easier to trigger this bug than the first one. What did you do to trigger it? While lynx was loading a website (via ipv6) I unplugged the urtwn device. The panic doesn't happen every time I try this. Maybe every 10th time. What I don't understand is that your trace shows that the process triggering the panic is doing an ioctl(2). In your ps output you have ifconfig running, but since you did ps on CPU3 I can only guess that that's the program who triggered the panic. Should I run ps on all CPUs next time? Or better on the cpu that is active at the beginning of the ddb session? So did you unplugged the device while doing anything with ifconfig? I'm using Bob's wifinwid script which does this: while :; do (ifconfig $1 | grep 'status: active' /dev/null) if [ $? -ne 0 ]; then [...] sleep 2 fi done Full script: http://foad2.obtuse.com/beck/wifinwid Where you able to use your device before the kernel crashes? Did you see anything in the dmesg before? Before I unplugged the urtwn I could use it normaly. I didn't noticed any messages on the console but didn't check dmesg befor I unplugged urtwn. I understand that you unplugged the device, but the important parts are what did you do *before* or *during* 8) Besides lynx in the foreground on the console the wifinwid script and offlineimap (python) were active in the background.
Re: crash on urtwn removal
On Tue, Apr 28, 2015 at 10:17:16AM +0200, Martin Pieuchot wrote: On 27/04/15(Mon) 22:45, Remi Locherer wrote: On Mon, Apr 27, 2015 at 03:13:06PM +0200, Martin Pieuchot wrote: This trace tells use that the pipe is no longer valid, which means that the device has been removed but a xfer is still referenced by ehci. The output of ps could help understand what's going wrong in such case. If you can, please get it next time :) If you think you can reproduce this bug too, here's a diff that would get us a useful trace: It seams to be easier to trigger this bug than the first one. What did you do to trigger it? While lynx was loading a website (via ipv6) I unplugged the urtwn device. The panic doesn't happen every time I try this. Maybe every 10th time. ddb trace and ps output: https://relo.ch/urtwncrash_trace_part1.jpg https://relo.ch/urtwncrash_trace_part2.jpg https://relo.ch/urtwncrash_ps_part1.jpg https://relo.ch/urtwncrash_ps_part2.jpg Unfortunately boot reboot in ddb did not work so I had to upload the photos. But at least one line number appeared in the output so now I know how to build a kernel with debug symbols ;) Index: usbdi.c === RCS file: /cvs/src/sys/dev/usb/usbdi.c,v retrieving revision 1.81 diff -u -p -r1.81 usbdi.c --- usbdi.c 14 Mar 2015 03:38:50 - 1.81 +++ usbdi.c 27 Apr 2015 13:08:33 - @@ -824,10 +824,8 @@ usb_insert_transfer(struct usbd_xfer *xf DPRINTFN(5,(usb_insert_transfer: pipe=%p running=%d timeout=%d\n, pipe, pipe-running, xfer-timeout)); #ifdef DIAGNOSTIC - if (xfer-busy_free != XFER_FREE) { - printf(%s: xfer=%p not free\n, __func__, xfer); - return (USBD_INVAL); - } + if (xfer-busy_free != XFER_FREE) + panic(%s: xfer=%p not free\n, __func__, xfer); xfer-busy_free = XFER_ONQU; #endif s = splusb();
Re: crash on urtwn removal
On 27/04/15(Mon) 22:45, Remi Locherer wrote: On Mon, Apr 27, 2015 at 03:13:06PM +0200, Martin Pieuchot wrote: This trace tells use that the pipe is no longer valid, which means that the device has been removed but a xfer is still referenced by ehci. The output of ps could help understand what's going wrong in such case. If you can, please get it next time :) If you think you can reproduce this bug too, here's a diff that would get us a useful trace: It seams to be easier to trigger this bug than the first one. What did you do to trigger it? ddb trace and ps output: https://relo.ch/urtwncrash_trace_part1.jpg https://relo.ch/urtwncrash_trace_part2.jpg https://relo.ch/urtwncrash_ps_part1.jpg https://relo.ch/urtwncrash_ps_part2.jpg Unfortunately boot reboot in ddb did not work so I had to upload the photos. But at least one line number appeared in the output so now I know how to build a kernel with debug symbols ;) Index: usbdi.c === RCS file: /cvs/src/sys/dev/usb/usbdi.c,v retrieving revision 1.81 diff -u -p -r1.81 usbdi.c --- usbdi.c 14 Mar 2015 03:38:50 - 1.81 +++ usbdi.c 27 Apr 2015 13:08:33 - @@ -824,10 +824,8 @@ usb_insert_transfer(struct usbd_xfer *xf DPRINTFN(5,(usb_insert_transfer: pipe=%p running=%d timeout=%d\n, pipe, pipe-running, xfer-timeout)); #ifdef DIAGNOSTIC - if (xfer-busy_free != XFER_FREE) { - printf(%s: xfer=%p not free\n, __func__, xfer); - return (USBD_INVAL); - } + if (xfer-busy_free != XFER_FREE) + panic(%s: xfer=%p not free\n, __func__, xfer); xfer-busy_free = XFER_ONQU; #endif s = splusb();
Re: crash on urtwn removal
On Mon, Apr 27, 2015 at 03:13:06PM +0200, Martin Pieuchot wrote: On 27/04/15(Mon) 13:55, Remi Locherer wrote: On Sun, Apr 26, 2015 at 10:07:19AM +0200, Stefan Sperling wrote: On Sun, Apr 26, 2015 at 09:36:06AM +0200, Remi Locherer wrote: With a series of removals and insertions of the urtwn device I could provoke another panic. I didn't wait for a timeout message to apear. Great ! You found another bug :) Hmmm [...] urtwn0 detached urtwn0 at uhub2 port 2 Realtek 802.11n WLAN Adapter rev 2.00/2.00 addr 3 urtwn0: MAC/BB RTL8188CUS, RF 6052 1T1R, address 80:1f:02:da:13:e3 usb_transfer_complete: actlen len 4294951948 4 usb_transfer_complete: actlen len 4294951180 4 usb_insert_transfer: xfer=0xff011a23e118 not free That's the real problem, some code path tries to reuse a xfer that's still on the ring. kernel: protection fault trap, code=0 Stopped at ehci_check_intr+0xb:movq0x10(%rax),%rax ddb{0} ehci_check_intr() at ehci_check_intr+0xb ehci_softintr() at ehci_softintr+0x3f softintr_dispatch() at softintr_dispatch+0x7f Xsoftnet() at Xsoftnet+0x2d This trace tells use that the pipe is no longer valid, which means that the device has been removed but a xfer is still referenced by ehci. The output of ps could help understand what's going wrong in such case. If you can, please get it next time :) If you think you can reproduce this bug too, here's a diff that would get us a useful trace: It seams to be easier to trigger this bug than the first one. ddb trace and ps output: https://relo.ch/urtwncrash_trace_part1.jpg https://relo.ch/urtwncrash_trace_part2.jpg https://relo.ch/urtwncrash_ps_part1.jpg https://relo.ch/urtwncrash_ps_part2.jpg Unfortunately boot reboot in ddb did not work so I had to upload the photos. But at least one line number appeared in the output so now I know how to build a kernel with debug symbols ;) Index: usbdi.c === RCS file: /cvs/src/sys/dev/usb/usbdi.c,v retrieving revision 1.81 diff -u -p -r1.81 usbdi.c --- usbdi.c 14 Mar 2015 03:38:50 - 1.81 +++ usbdi.c 27 Apr 2015 13:08:33 - @@ -824,10 +824,8 @@ usb_insert_transfer(struct usbd_xfer *xf DPRINTFN(5,(usb_insert_transfer: pipe=%p running=%d timeout=%d\n, pipe, pipe-running, xfer-timeout)); #ifdef DIAGNOSTIC - if (xfer-busy_free != XFER_FREE) { - printf(%s: xfer=%p not free\n, __func__, xfer); - return (USBD_INVAL); - } + if (xfer-busy_free != XFER_FREE) + panic(%s: xfer=%p not free\n, __func__, xfer); xfer-busy_free = XFER_ONQU; #endif s = splusb();
Re: crash on urtwn removal
On Mon, Apr 27, 2015 at 01:55:30PM +0200, Remi Locherer wrote: I rebuilt a kernel with /etc/mk.conf containing this: SUDO=/usr/bin/sudo DEBUG=-g #DEBUGLIB=-g But there are no line numbers in the ddb output (see end of dmesg). Is there another why to build a kernel with debug symbols? It doesn't work that way. To get line numbers, create a new kernel config file: cp GENERIC URTWN_DEBUG add this line to the URTWN_DEBUG config file: makeoptions DEBUG=-g run 'config URTWN_DEBUG; cd ../compile/URTWN_DEBUG; make' and boot the resulting kernel. Something is going wrong in the IPv6 code. This problem is not specific to urtwn(4). The relevant part of the trace is: With a series of removals and insertions of the urtwn device I could provoke another panic. I didn't wait for a timeout message to apear. This time no nd6_* or in6_* are visible in the trace. I updated kernel sources short before building the kernel. The suspected cause of your problem is a use-after-free. So panic details might vary since they depend on when and where the freed-and-subsequently-modified chunk of memory ends up being allocated again.
Re: crash on urtwn removal
On Sun, Apr 26, 2015 at 10:07:19AM +0200, Stefan Sperling wrote: On Sun, Apr 26, 2015 at 09:36:06AM +0200, Remi Locherer wrote: Since a couple of month every now and then I'm getting a urtwn0: device timeout. Usually I deal with it by unlugging the urtwn device and plug it in again, then run sh /etc/netstart urtwn0. Is ifconfig down/up (or just 'sh /etc/netstart urtwn0') not enough? That is essentially what the driver should be doing in this situation. However, USB wifi drivers which need to load firmware currently don't reset properly when transmit times out -- they simply do nothing instead of resetting the interface. Sometimes the systems crashes when I unplug the urtwn device. Photos from ddb after such a crash: https://relo.ch/crash_on_urtwn_removal-trace_part1.jpg https://relo.ch/crash_on_urtwn_removal-trace_part2.jpg https://relo.ch/crash_on_urtwn_removal-ps_part1.jpg https://relo.ch/crash_on_urtwn_removal-ps_part2.jpg https://relo.ch/crash_on_urtwn_removal-ps_part3.jpg Can you please recompile your kernel with makeoptions DEBUG=-g in the config when reproducing crashes? That's slightly more convenient for us to debug since ddb output will contain line numbers. I rebuilt a kernel with /etc/mk.conf containing this: SUDO=/usr/bin/sudo DEBUG=-g #DEBUGLIB=-g But there are no line numbers in the ddb output (see end of dmesg). Is there another why to build a kernel with debug symbols? Something is going wrong in the IPv6 code. This problem is not specific to urtwn(4). The relevant part of the trace is: With a series of removals and insertions of the urtwn device I could provoke another panic. I didn't wait for a timeout message to apear. This time no nd6_* or in6_* are visible in the trace. I updated kernel sources short before building the kernel. OpenBSD 5.7-current (GENERIC.MP) #9: Sun Apr 26 21:27:24 CEST 2015 r...@mistral.relo.ch:/usr/src/sys/arch/amd64/compile/GENERIC.MP real mem = 3881746432 (3701MB) avail mem = 3760246784 (3586MB) mpath0 at root scsibus0 at mpath0: 256 targets mainbus0 at root bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xe0970 (63 entries) bios0: vendor Phoenix Technologies Ltd. version P00ACX date 04/26/2013 bios0: SAMSUNG ELECTRONICS CO., LTD. 900X3F acpi0 at bios0: rev 2 acpi0: sleep states S0 S3 S4 S5 acpi0: tables DSDT FACP SLIC SSDT ASF! HPET APIC MCFG FPDT SSDT SSDT UEFI MSDM UEFI UEFI acpi0: wakeup devices P0P1(S4) GLAN(S4) XHC_(S4) HDEF(S4) PXSX(S4) RP01(S4) PXSX(S4) RP02(S4) PXSX(S4) RP03(S4) PXSX(S4) RP04(S4) PXSX(S4) RP06(S4) PXSX(S4) RP07(S4) [...] acpitimer0 at acpi0: 3579545 Hz, 24 bits acpihpet0 at acpi0: 14318179 Hz acpimadt0 at acpi0 addr 0xfee0: PC-AT compat cpu0 at mainbus0: apid 0 (boot processor) cpu0: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz, 1696.44 MHz cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,XSAVE,AVX,F16C,RDRAND,NXE,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS cpu0: 256KB 64b/line 8-way L2 cache cpu0: smt 0, core 0, package 0 mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges cpu0: apic clock running at 99MHz cpu0: mwait min=64, max=64, C-substates=0.2.1.1.2, IBE cpu1 at mainbus0: apid 1 (application processor) cpu1: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz, 1696.15 MHz cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,XSAVE,AVX,F16C,RDRAND,NXE,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS cpu1: 256KB 64b/line 8-way L2 cache cpu1: smt 1, core 0, package 0 cpu2 at mainbus0: apid 2 (application processor) cpu2: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz, 1696.15 MHz cpu2: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,XSAVE,AVX,F16C,RDRAND,NXE,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS cpu2: 256KB 64b/line 8-way L2 cache cpu2: smt 0, core 1, package 0 cpu3 at mainbus0: apid 3 (application processor) cpu3: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz, 1696.15 MHz cpu3: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,XSAVE,AVX,F16C,RDRAND,NXE,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS cpu3: 256KB 64b/line 8-way L2 cache cpu3: smt 1, core 1, package 0 ioapic0 at mainbus0: apid 2 pa 0xfec0, version 20, 24 pins acpimcfg0 at acpi0 addr 0xf800, bus 0-63 acpiprt0 at acpi0: bus 0 (PCI0) acpiprt1 at acpi0: bus -1 (P0P1) acpiprt2 at acpi0: bus 1
Re: crash on urtwn removal
On 27/04/15(Mon) 13:55, Remi Locherer wrote: On Sun, Apr 26, 2015 at 10:07:19AM +0200, Stefan Sperling wrote: On Sun, Apr 26, 2015 at 09:36:06AM +0200, Remi Locherer wrote: With a series of removals and insertions of the urtwn device I could provoke another panic. I didn't wait for a timeout message to apear. Great ! You found another bug :) [...] urtwn0 detached urtwn0 at uhub2 port 2 Realtek 802.11n WLAN Adapter rev 2.00/2.00 addr 3 urtwn0: MAC/BB RTL8188CUS, RF 6052 1T1R, address 80:1f:02:da:13:e3 usb_transfer_complete: actlen len 4294951948 4 usb_transfer_complete: actlen len 4294951180 4 usb_insert_transfer: xfer=0xff011a23e118 not free That's the real problem, some code path tries to reuse a xfer that's still on the ring. kernel: protection fault trap, code=0 Stopped atehci_check_intr+0xb:movq0x10(%rax),%rax ddb{0} ehci_check_intr() at ehci_check_intr+0xb ehci_softintr() at ehci_softintr+0x3f softintr_dispatch() at softintr_dispatch+0x7f Xsoftnet() at Xsoftnet+0x2d This trace tells use that the pipe is no longer valid, which means that the device has been removed but a xfer is still referenced by ehci. The output of ps could help understand what's going wrong in such case. If you can, please get it next time :) If you think you can reproduce this bug too, here's a diff that would get us a useful trace: Index: usbdi.c === RCS file: /cvs/src/sys/dev/usb/usbdi.c,v retrieving revision 1.81 diff -u -p -r1.81 usbdi.c --- usbdi.c 14 Mar 2015 03:38:50 - 1.81 +++ usbdi.c 27 Apr 2015 13:08:33 - @@ -824,10 +824,8 @@ usb_insert_transfer(struct usbd_xfer *xf DPRINTFN(5,(usb_insert_transfer: pipe=%p running=%d timeout=%d\n, pipe, pipe-running, xfer-timeout)); #ifdef DIAGNOSTIC - if (xfer-busy_free != XFER_FREE) { - printf(%s: xfer=%p not free\n, __func__, xfer); - return (USBD_INVAL); - } + if (xfer-busy_free != XFER_FREE) + panic(%s: xfer=%p not free\n, __func__, xfer); xfer-busy_free = XFER_ONQU; #endif s = splusb();
Re: crash on urtwn removal
On Sun, Apr 26, 2015 at 10:07:19AM +0200, Stefan Sperling wrote: On Sun, Apr 26, 2015 at 09:36:06AM +0200, Remi Locherer wrote: Since a couple of month every now and then I'm getting a urtwn0: device timeout. Usually I deal with it by unlugging the urtwn device and plug it in again, then run sh /etc/netstart urtwn0. Is ifconfig down/up (or just 'sh /etc/netstart urtwn0') not enough? No, it's not enough. After the first timeout message on the console I executed sh /etc/netstart urtwn0. The result was: Console log for mistral.relo.ch urtwn0: device timeout usb_transfer_complete: actlen len 4294951684 4 usb_transfer_complete: actlen len 4294951044 4 usb_transfer_complete: actlen len 4294951300 4 usb_transfer_complete: actlen len 4294951308 4 urtwn0: device timeout usb_transfer_complete: actlen len 4294951300 4 urtwn0: device timeout usb_transfer_complete: actlen len 4294951308 4 urtwn0: device timeout usb_transfer_complete: actlen len 4294951948 4 usb_transfer_complete: actlen len 4294951684 4 urtwn0: device timeout usb_transfer_complete: actlen len 4294951304 2 urtwn0: device timeout usb_transfer_complete: actlen len 4294951300 4 urtwn0: device timeout usb_transfer_complete: actlen len 4294951682 2 usb_transfer_complete: actlen len 4294951300 4 urtwn0: device timeout usb_transfer_complete: actlen len 4294951300 4 usb_transfer_complete: actlen len 4294951172 4 urtwn0: device timeout My /etc/hostname.urtwn0: nwid tsunami wpakey XXX dhcp inet6 autoconf !pkill -9 -lf wifinwid !/etc/wifinwid \$if The script is the one from here: http://undeadly.org/cgi?action=articlesid=20130208141628 That is essentially what the driver should be doing in this situation. However, USB wifi drivers which need to load firmware currently don't reset properly when transmit times out -- they simply do nothing instead of resetting the interface. Sometimes the systems crashes when I unplug the urtwn device. Photos from ddb after such a crash: https://relo.ch/crash_on_urtwn_removal-trace_part1.jpg https://relo.ch/crash_on_urtwn_removal-trace_part2.jpg https://relo.ch/crash_on_urtwn_removal-ps_part1.jpg https://relo.ch/crash_on_urtwn_removal-ps_part2.jpg https://relo.ch/crash_on_urtwn_removal-ps_part3.jpg Can you please recompile your kernel with makeoptions DEBUG=-g in the config when reproducing crashes? That's slightly more convenient for us to debug since ddb output will contain line numbers. I'll do that and report back. Something is going wrong in the IPv6 code. This problem is not specific to urtwn(4). The relevant part of the trace is: Stopped at find_pfxlist_reachable_router+0x15: movq 0x10(%eax),%rdi find_pfxlist_reachable_router() pfxlist_onlink_check() prelist_remove() nd6_purge() in6_ifdetach() if_detach() urtwn_detach() config_detach() usbd_detach() uhub_explore() uhub_explore() usb_explore() usb_task_thread()