Re: Continuous stream of small bulk transfers hangs on OHCI-based system
On Fri, 10 Aug 2012, Tomas Sokorai wrote: > I used a very stupid/simplistic logic I already had for debugging, to > detect the condition: at the fourth (since its normally just one) pass > over the SF interrupt clear without being it cleared, I assume it is > stuck, and if ed_rm_list is the only one element long, I put it in > ed_to_check. > > This seems to work, but its very odd: in my first test, after the > first instance of the occurrence, every 5 second this condition kept > popping up, 6 times, until the communication died definitely with > -EPIPE: > > But neither the USB stack or app froze, just plug & unplug the device > and good to go again. > The second test, the "highlander ed" condition popped up, this time > twice, also with a 5 second between them, but no further problems for > quite a while after this, and no communication errors. > Then three more events 5 sec. apart, and -EPIPE again. > > It seems this condition comes in a "cluster", or the simplistic logic > of detection of this condition is not very well suited. I suspect what you're seeing is somehow indicative of a nasty underlying bug in the OHCI controller. Have you been able to try testing an OHCI-based system with a non-NVIDIA chipset? Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Continuous stream of small bulk transfers hangs on OHCI-based system
On Thu, Aug 9, 2012 at 10:29 PM, Alan Stern wrote: > In theory a similar quirk could be written to fix your problem. An > easy way to test this, if you can automatically detect the "hung" > condition, would be to set ohci->ed_to_check to the "unkillable" ed. > I used a very stupid/simplistic logic I already had for debugging, to detect the condition: at the fourth (since its normally just one) pass over the SF interrupt clear without being it cleared, I assume it is stuck, and if ed_rm_list is the only one element long, I put it in ed_to_check. This seems to work, but its very odd: in my first test, after the first instance of the occurrence, every 5 second this condition kept popping up, 6 times, until the communication died definitely with -EPIPE: But neither the USB stack or app froze, just plug & unplug the device and good to go again. The second test, the "highlander ed" condition popped up, this time twice, also with a 5 second between them, but no further problems for quite a while after this, and no communication errors. Then three more events 5 sec. apart, and -EPIPE again. It seems this condition comes in a "cluster", or the simplistic logic of detection of this condition is not very well suited. -- Tomas J. Sokorai Sch. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Continuous stream of small bulk transfers hangs on OHCI-based system
On Thu, 9 Aug 2012, Tomas Sokorai wrote: > OK, more info gathered: > > The skip when "Highlander" ed is unkillable, is entered from the > "skip_ed" label jump, not from the tick_before() check. > ed->ed_next is indeed NULL at that point. That's what I was afraid of. This means your OHCI controller has a hardware bug; either it doesn't add the completed TD to the done list or else it doesn't tell the computer that it has done so. This is in fact the sort of bug that the ZFMicro quirk was written for, except that the ZFMicro quirk affects Interrupt endpoints and in your case a Bulk endpoint is affected. In theory a similar quirk could be written to fix your problem. An easy way to test this, if you can automatically detect the "hung" condition, would be to set ohci->ed_to_check to the "unkillable" ed. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Continuous stream of small bulk transfers hangs on OHCI-based system
On Wed, Aug 8, 2012 at 11:45 AM, Alan Stern wrote: > The only way for it not be executed is if the "skip_ed" case occurs. > Therefore your next task is to determine what's going on. Does the > tick_before() test succeed? Does we follow the "goto skip_ed"? Or is > the list pointer messed up? > OK, more info gathered: The skip when "Highlander" ed is unkillable, is entered from the "skip_ed" label jump, not from the tick_before() check. ed->ed_next is indeed NULL at that point. -- Tomas J. Sokorai Sch. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Continuous stream of small bulk transfers hangs on OHCI-based system
On Tue, 7 Aug 2012, Tomas Sokorai wrote: > On Tue, Aug 7, 2012 at 3:42 PM, Alan Stern wrote: > > > > I don't have time today to look further into this, but I'll get back to > > you later. > > No hurries, in fact I was gathering a bit more info about this behavior. > I dumped the ed_rm_list when it is hung, and we have only one element > that's unkillable: > > [ 1108.841482] ohci_hcd :00:04.0: ed_rm_list, ed 8800c781f140 > state 0x1 type bulk; next ed > [ 1108.841489] ohci_hcd :00:04.0: info 08405103 MAX=64 DQ SKIP > EP=2-IN DEV=3 > [ 1108.841494] ohci_hcd :00:04.0: tds: head c78261e0 DATA0 tail > c78261e0 (not listing) Yep; that's undoubtedly the one your program is trying to kill. > Also, there's only one non-empty ed_rm_list "pass" after the > finish_unlinks between SR intr disables, when not hung. As it should be. So the next step is to see what's happening inside finish_unlinks(). The function is a big loop over the entries in the ed_rm_list, but we're considered just with the first part of the loop. "last" is initially set to &ohci->ed_rm_list, and about line 945 we have: *last = ed->ed_next; If everything were working right, ed->ed_next would be NULL (no other ED's on the list). So either ed->ed_next points back to ed itself or else this line doesn't get executed at all. The only way for it not be executed is if the "skip_ed" case occurs. Therefore your next task is to determine what's going on. Does the tick_before() test succeed? Does we follow the "goto skip_ed"? Or is the list pointer messed up? Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Continuous stream of small bulk transfers hangs on OHCI-based system
On Tue, Aug 7, 2012 at 3:42 PM, Alan Stern wrote: > > I don't have time today to look further into this, but I'll get back to > you later. No hurries, in fact I was gathering a bit more info about this behavior. I dumped the ed_rm_list when it is hung, and we have only one element that's unkillable: [ 1108.841482] ohci_hcd :00:04.0: ed_rm_list, ed 8800c781f140 state 0x1 type bulk; next ed [ 1108.841489] ohci_hcd :00:04.0: info 08405103 MAX=64 DQ SKIP EP=2-IN DEV=3 [ 1108.841494] ohci_hcd :00:04.0: tds: head c78261e0 DATA0 tail c78261e0 (not listing) Also, there's only one non-empty ed_rm_list "pass" after the finish_unlinks between SR intr disables, when not hung. -- Tomas J. Sokorai Sch. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Continuous stream of small bulk transfers hangs on OHCI-based system
On Mon, 6 Aug 2012, Tomas Sokorai wrote: > Ding, Ding, Ding!, we have a winner :-) > I did an ugly check: > > if (ohci->ed_rm_list) > finish_unlinks (ohci, ohci_frame_no(ohci)); > if ((ints & OHCI_INTR_SF) != 0 > && !ohci->ed_rm_list > && !ohci->ed_to_check > && ohci->rh_state == OHCI_RH_RUNNING) > ohci_writel (ohci, OHCI_INTR_SF, ®s->intrdisable); > else if ((ints & OHCI_INTR_SF) != 0 > && ohci->ed_rm_list > && !ohci->ed_to_check > && ohci->rh_state == OHCI_RH_RUNNING) > ohci_warn(ohci,"SF intr should have been disabled, but > ed_rm_list is > not empty\n"); > spin_unlock (&ohci->lock); > > Result: The ed_rm_list definitely stays not null forever after the hang. > > This non-empty ed_rm_list condition is raised even when not hung at > quite a rate, but after its hung, this lovely message does a while(1) > dmesg benchmark :-) It's normal to see this happen during ordinary use, although perhaps your rate is higher than it should be. Anyway, the next step is to look inside finish_unlinks() in ohci-q.c to figure out what's going wrong. I don't have time today to look further into this, but I'll get back to you later. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Continuous stream of small bulk transfers hangs on OHCI-based system
On Mon, Aug 6, 2012 at 12:24 PM, Alan Stern wrote: > any lines in the dmesg log from boot-up about "enabled Compaq ZFMicro > chipset quirks"? BTW, I printk'ed the ohci->flags just to be sure what quirks were enabled, and only the "do not trust power" was enabled. > My guess is that ed_rm_list is non-NULL, but for some reason > finish_unlinks() doesn't do anything -- in particular, it doesn't clear > ed_rm_list. As a result the condition in the "if" statement fails and > OHCI_INTR_SF doesn't get written to regs->intrdisable. Ding, Ding, Ding!, we have a winner :-) I did an ugly check: if (ohci->ed_rm_list) finish_unlinks (ohci, ohci_frame_no(ohci)); if ((ints & OHCI_INTR_SF) != 0 && !ohci->ed_rm_list && !ohci->ed_to_check && ohci->rh_state == OHCI_RH_RUNNING) ohci_writel (ohci, OHCI_INTR_SF, ®s->intrdisable); else if ((ints & OHCI_INTR_SF) != 0 && ohci->ed_rm_list && !ohci->ed_to_check && ohci->rh_state == OHCI_RH_RUNNING) ohci_warn(ohci,"SF intr should have been disabled, but ed_rm_list is not empty\n"); spin_unlock (&ohci->lock); Result: The ed_rm_list definitely stays not null forever after the hang. This non-empty ed_rm_list condition is raised even when not hung at quite a rate, but after its hung, this lovely message does a while(1) dmesg benchmark :-) -- Tomas J. Sokorai Sch. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Continuous stream of small bulk transfers hangs on OHCI-based system
On Mon, Aug 6, 2012 at 12:24 PM, Alan Stern wrote: > any lines in the dmesg log from boot-up about "enabled Compaq ZFMicro > chipset quirks"? Nope, no quirks message at all in the boot log. > > Are you comfortable writing your own debugging patches, or would you Quite comfortable, I'll gather more info as soon as I get home from work. > My guess is that ed_rm_list is non-NULL, but for some reason > finish_unlinks() doesn't do anything -- in particular, it doesn't clear > ed_rm_list. As a result the condition in the "if" statement fails and > OHCI_INTR_SF doesn't get written to regs->intrdisable. I'll instrument-ate that area, thanks a lot for your input, Alan. -- Tomas J. Sokorai Sch. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Continuous stream of small bulk transfers hangs on OHCI-based system
On Sun, 5 Aug 2012, Tomas Sokorai wrote: > Here's after plugging in, but before the hang: > - > bus pci, device :00:04.0 > OHCI Host Controller > ohci_hcd > OHCI 1.0, NO legacy support registers, rh state running > control 0x68f RWE RWC HCFS=operational IE PLE CBSR=3 > cmdstatus 0x0 SOC=0 > intrstatus 0x0024 FNO SF > intrenable 0x805a MIE RHSC UE RD WDH ... > And finally the one after the USB is hung: > - > bus pci, device :00:04.0 > OHCI Host Controller > ohci_hcd > OHCI 1.0, NO legacy support registers, rh state running > control 0x68f RWE RWC HCFS=operational IE PLE CBSR=3 > cmdstatus 0x0 SOC=0 > intrstatus 0x0020 FNO > intrenable 0x805e MIE RHSC UE RD SF WDH There is a notable difference: After the hang, SF is set in intrenable. The driver does this in only a few places, basically when an Endpoint Descriptor is unlinked or when the ZFMicro quirk is set. Do you see any lines in the dmesg log from boot-up about "enabled Compaq ZFMicro chipset quirks"? > The "async" pseudofile yields nothing, every time . Most likely something is going wrong with the ED unlink. Are you comfortable writing your own debugging patches, or would you prefer me to send you something? The place to check is these lines near the end of ohci_irq() in drivers/usb/host/ohci-hcd.c: if (ohci->ed_rm_list) finish_unlinks (ohci, ohci_frame_no(ohci)); if ((ints & OHCI_INTR_SF) != 0 && !ohci->ed_rm_list && !ohci->ed_to_check && ohci->rh_state == OHCI_RH_RUNNING) ohci_writel (ohci, OHCI_INTR_SF, ®s->intrdisable); My guess is that ed_rm_list is non-NULL, but for some reason finish_unlinks() doesn't do anything -- in particular, it doesn't clear ed_rm_list. As a result the condition in the "if" statement fails and OHCI_INTR_SF doesn't get written to regs->intrdisable. If we can verify that, the next step will be to look inside finish_unlinks() to see why it's not doing anything. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Continuous stream of small bulk transfers hangs on OHCI-based system
On Sun, Aug 5, 2012 at 5:36 PM, Alan Stern wrote: > Build a kernel with CONFIG_USB_DEBUG enabled (and also > CONFIG_DEBUG_FS). When the hang occurs, go into the subdirectory of > /sys/kernel/debug/usb/ohci/ corresponding to the bus the device is > plugged into (:00:04.0 if your setup hasn't changed), and post > copies of the "registers" and "async" files. Thanks for the input, Alan. Here's the "registers", before plugging my device in: - bus pci, device :00:04.0 OHCI Host Controller ohci_hcd OHCI 1.0, NO legacy support registers, rh state running control 0x68f RWE RWC HCFS=operational IE PLE CBSR=3 cmdstatus 0x0 SOC=0 intrstatus 0x0024 FNO SF intrenable 0x805a MIE RHSC UE RD WDH ed_controlhead c781f000 hcca frame 0xcca2 fmintvl 0xa7782edf FIT FSMPS=0xa778 FI=0x2edf fmremaining 0x80001500 FRT FR=0x1500 periodicstart 0x2a2f lsthresh 0x0700 hub poll timer off roothub.a 01001206 POTPGT=1 NOCP NPS NDP=6(6) roothub.b PPCM= DR= roothub.status 8000 DRWE roothub.portstatus [0] 0x0100 PPS roothub.portstatus [1] 0x0100 PPS roothub.portstatus [2] 0x0303 LSDA PPS PES CCS roothub.portstatus [3] 0x0100 PPS roothub.portstatus [4] 0x0100 PPS roothub.portstatus [5] 0x0100 PPS - Here's after plugging in, but before the hang: - bus pci, device :00:04.0 OHCI Host Controller ohci_hcd OHCI 1.0, NO legacy support registers, rh state running control 0x68f RWE RWC HCFS=operational IE PLE CBSR=3 cmdstatus 0x0 SOC=0 intrstatus 0x0024 FNO SF intrenable 0x805a MIE RHSC UE RD WDH ed_controlhead c781f0a0 hcca frame 0x0db6 fmintvl 0xa7782edf FIT FSMPS=0xa778 FI=0x2edf fmremaining 0x8296 FRT FR=0x0296 periodicstart 0x2a2f lsthresh 0x0700 hub poll timer off roothub.a 01001206 POTPGT=1 NOCP NPS NDP=6(6) roothub.b PPCM= DR= roothub.status 8000 DRWE roothub.portstatus [0] 0x0100 PPS roothub.portstatus [1] 0x0103 PPS PES CCS roothub.portstatus [2] 0x0303 LSDA PPS PES CCS roothub.portstatus [3] 0x0100 PPS roothub.portstatus [4] 0x0100 PPS roothub.portstatus [5] 0x0100 PPS - And finally the one after the USB is hung: - bus pci, device :00:04.0 OHCI Host Controller ohci_hcd OHCI 1.0, NO legacy support registers, rh state running control 0x68f RWE RWC HCFS=operational IE PLE CBSR=3 cmdstatus 0x0 SOC=0 intrstatus 0x0020 FNO intrenable 0x805e MIE RHSC UE RD SF WDH ed_controlhead c781f0a0 ed_bulkhead c781f140 hcca frame 0x2b5d fmintvl 0xa7782edf FIT FSMPS=0xa778 FI=0x2edf fmremaining 0x800024e2 FRT FR=0x24e2 periodicstart 0x2a2f lsthresh 0x0700 hub poll timer off roothub.a 01001206 POTPGT=1 NOCP NPS NDP=6(6) roothub.b PPCM= DR= roothub.status 8000 DRWE roothub.portstatus [0] 0x0100 PPS roothub.portstatus [1] 0x0103 PPS PES CCS roothub.portstatus [2] 0x0303 LSDA PPS PES CCS roothub.portstatus [3] 0x0100 PPS roothub.portstatus [4] 0x0100 PPS roothub.portstatus [5] 0x0100 PPS - The "async" pseudofile yields nothing, every time . -- Tomas J. Sokorai Sch. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Continuous stream of small bulk transfers hangs on OHCI-based system
On Sun, 5 Aug 2012, Tomas Sokorai wrote: > Hi guys! > > I think I might have hit a bug: > > * Description of the issue * > Continuous (variable time 1-10 mins) small bulk transfers (between 1-6 > bytes payload) either with libusb, or generic USB serial driver, leaves the > app in uninterruptible wait state (i.e., hung). > > The kernel hung task stack dump consistently show the USB stack getting stuck: ... > All the stacks are consistent of getting stuck in usb_kill_urb(). > > * Environment * > > This happens only in a OHCI-based system (AMD CPU, nVidia chipset), > since if I test the same program and USB device in an Intel-based, > EHCI laptop, I never hit this problem, and what's more interesting: > the operations are about 5x faster in the EHCI system, even when my > device is only 12Mbit-capable. > > I can reproduce this issue with a vanilla 3.5.0 kernel. ... > Any further ideas to shed more light into this issue? Build a kernel with CONFIG_USB_DEBUG enabled (and also CONFIG_DEBUG_FS). When the hang occurs, go into the subdirectory of /sys/kernel/debug/usb/ohci/ corresponding to the bus the device is plugged into (:00:04.0 if your setup hasn't changed), and post copies of the "registers" and "async" files. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Continuous stream of small bulk transfers hangs on OHCI-based system
Hi guys! I think I might have hit a bug: * Description of the issue * Continuous (variable time 1-10 mins) small bulk transfers (between 1-6 bytes payload) either with libusb, or generic USB serial driver, leaves the app in uninterruptible wait state (i.e., hung). The kernel hung task stack dump consistently show the USB stack getting stuck: [ 3118.808597] INFO: task reprogJTAG:6993 blocked for more than 120 seconds. [ 3118.808603] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3118.808606] reprogJTAG D 8800b026d260 0 6993 5917 0x [ 3118.808614] 88008ac69d88 0086 8801 88008ac69b88 [ 3118.808621] 88008ac68000 8800b03cea40 88008ac69fd8 00011080 [ 3118.808628] 00011080 4000 88008ac69fd8 00011080 [ 3118.808634] Call Trace: [ 3118.808648] [] ? _raw_spin_lock_irqsave+0x18/0x37 [ 3118.808655] [] ? _raw_spin_unlock_irqrestore+0x13/0x2e [ 3118.808661] [] ? ohci_urb_dequeue+0xdb/0xea [ 3118.808667] [] ? unlink1+0xcd/0xe0 [ 3118.808672] [] ? _raw_spin_lock_irqsave+0x18/0x37 [ 3118.808676] [] ? _raw_spin_unlock_irqrestore+0x13/0x2e [ 3118.808681] [] schedule+0x5f/0x61 [ 3118.808686] [] usb_kill_urb+0xb1/0xd1 [ 3118.808692] [] ? wake_up_bit+0x25/0x25 [ 3118.808696] [] ? mutex_lock+0x1f/0x40 [ 3118.808702] [] usbdev_do_ioctl+0xbb5/0xf71 [ 3118.808707] [] ? _raw_spin_unlock_irqrestore+0x13/0x2e [ 3118.808713] [] ? __hrtimer_start_range_ns+0x2a4/0x2b6 [ 3118.808717] [] ? _raw_spin_lock_irqsave+0x18/0x37 [ 3118.808722] [] usbdev_ioctl+0x9/0xd [ 3118.808727] [] do_vfs_ioctl+0x4ed/0x52e [ 3118.808733] [] ? sys_timerfd_settime+0x2f6/0x33f [ 3118.808737] [] sys_ioctl+0x42/0x66 [ 3118.808743] [] system_call_fastpath+0x1a/0x1f All the stacks are consistent of getting stuck in usb_kill_urb(). * Environment * This happens only in a OHCI-based system (AMD CPU, nVidia chipset), since if I test the same program and USB device in an Intel-based, EHCI laptop, I never hit this problem, and what's more interesting: the operations are about 5x faster in the EHCI system, even when my device is only 12Mbit-capable. I can reproduce this issue with a vanilla 3.5.0 kernel. The userspace program is a small libusb-based JTAG programmer I wrote and the USB device it talks to, is also designed, built, and firmware written by me, but the flawless work in the Intel system, rules out any issues on both the USB device and userspace program sides. Here is the lsusb output (running the 3.5.0 Gentoo kernel, not vanilla, but this version only has two minor patches, nothing USB-related, so can be regarded as a vanilla kernel for the practical purposes of this issue, my device is the 0x666:0x1 ): - Bus 003 Device 002: ID 04f9:0028 Brother Industries, Ltd Printer Device Descriptor: bLength18 bDescriptorType 1 bcdUSB 2.00 bDeviceClass0 (Defined at Interface level) bDeviceSubClass 0 bDeviceProtocol 0 bMaxPacketSize0 8 idVendor 0x04f9 Brother Industries, Ltd idProduct 0x0028 Printer bcdDevice1.00 iManufacturer 1 iProduct2 iSerial 3 bNumConfigurations 1 Configuration Descriptor: bLength 9 bDescriptorType 2 wTotalLength 32 bNumInterfaces 1 bConfigurationValue 1 iConfiguration 0 bmAttributes 0xc0 Self Powered MaxPower2mA Interface Descriptor: bLength 9 bDescriptorType 4 bInterfaceNumber0 bAlternateSetting 0 bNumEndpoints 2 bInterfaceClass 7 Printer bInterfaceSubClass 1 Printer bInterfaceProtocol 2 Bidirectional iInterface 0 Endpoint Descriptor: bLength 7 bDescriptorType 5 bEndpointAddress 0x01 EP 1 OUT bmAttributes2 Transfer TypeBulk Synch Type None Usage Type Data wMaxPacketSize 0x0040 1x 64 bytes bInterval 0 Endpoint Descriptor: bLength 7 bDescriptorType 5 bEndpointAddress 0x82 EP 2 IN bmAttributes2 Transfer TypeBulk Synch Type None Usage Type Data wMaxPacketSize 0x0010 1x 16 bytes bInterval 0 Bus 004 Device 003: ID 0666:0001 Device Descriptor: bLength18 bDescriptorType 1 bcdUSB 2.00 bDeviceClass0 (Defined at Interface level) bDeviceSubClass 0 bDeviceProtocol 0 bMaxPacketSize