RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]
From: walt > On 01/21/2014 01:51 AM, David Laight wrote: > > From: Sarah Sharp > >> On Mon, Jan 20, 2014 at 11:21:14AM +, David Laight wrote: > > ... > >>> A guess... > >>> > >>> In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead > >>> of xhci_td_remainder(). > >> > David, I tried the one-liner below, which changed nothing AFAICS, but > then I'm not sure it's the change you intended: ... > /* Set the TRB length, TD size, and interrupter fields. */ > - if (xhci->hci_version < 0x100) { > + if (xhci->hci_version > 0x100) { > remainder = xhci_td_remainder( > urb->transfer_buffer_length - > running_total); So my wild guess wasn't right. Can't win them all. David
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]
On 01/21/2014 01:51 AM, David Laight wrote: > From: Sarah Sharp >> On Mon, Jan 20, 2014 at 11:21:14AM +, David Laight wrote: > ... >>> A guess... >>> >>> In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead >>> of xhci_td_remainder(). >> >> Why? Walt has a 0.96 xHCI host controller, and the format for how to >> calculate the TD remainder changed between the 0.96 and the 1.0 spec. >> That's why we have xhci_v1_0_td_remainder() and xhci_td_remainder(). > > I just wonder how many of those differences are just differences in the > specification, rather than differences in the hardware implementation. > In some cases it might be that the old hardware just ignored the value. > > I know that the xhci hardware on my ivy bridge cpu does look at that > value (at least checking for zero), since things failed in subtle ways > when I got it wrong. > > In this case it was just something easy to change that might be worth > trying. I didn't necessarily expect it to make a positive difference. David, I tried the one-liner below, which changed nothing AFAICS, but then I'm not sure it's the change you intended: --- xhci-ring.c.orig2014-01-21 13:28:36.396278813 -0800 +++ xhci-ring.c 2014-01-21 13:35:11.410312814 -0800 @@ -3335,7 +3335,7 @@ } /* Set the TRB length, TD size, and interrupter fields. */ - if (xhci->hci_version < 0x100) { + if (xhci->hci_version > 0x100) { remainder = xhci_td_remainder( urb->transfer_buffer_length - running_total); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]
From: Sarah Sharp > On Mon, Jan 20, 2014 at 11:21:14AM +, David Laight wrote: ... > > A guess... > > > > In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead > > of xhci_td_remainder(). > > Why? Walt has a 0.96 xHCI host controller, and the format for how to > calculate the TD remainder changed between the 0.96 and the 1.0 spec. > That's why we have xhci_v1_0_td_remainder() and xhci_td_remainder(). I just wonder how many of those differences are just differences in the specification, rather than differences in the hardware implementation. In some cases it might be that the old hardware just ignored the value. I know that the xhci hardware on my ivy bridge cpu does look at that value (at least checking for zero), since things failed in subtle ways when I got it wrong. In this case it was just something easy to change that might be worth trying. I didn't necessarily expect it to make a positive difference. David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]
On Mon, Jan 20, 2014 at 11:21:14AM +, David Laight wrote: > From: walt > > On 01/17/2014 06:34 AM, David Laight wrote: > > > > > Can you try the patch I posted that stops the ownership on LINK TRBs > > > being changed before that on the linked-to TRB? > > > > Please disregard my earlier post about the patch not applying cleanly. > > That was the usual html corruption, so I found the original on the usb > > list and it was okay. > > > > Sadly, the patch didn't fix the ASMedia lockup behavior, however :( > > > > I did notice that the lockup occurred only when copying *to* the usb3 > > drive, and not when copying from it. I think that may be new behavior > > but I can't swear to it. > > Consistent with another report that says that ethernet worked provided > that TSO was disabled (ie no sg tx). > (Without the patch to delay he ownership change on link trbs it didn't > work at all.) Please be more clear. What do you mean by these statements? That someone privately reported that your earlier patch [1] did not help them, but applying your new patch [2] on top of the old patch did? [1] http://marc.info/?l=linux-usb&m=138418996717941&w=2 [2] http://marc.info/?l=linux-usb&m=138996538403468&w=2 In general, will you please Cc me and the USB list when replying to privately reported bugs/confirmations that patches work? Or if the confirmation was reported, please provide a link to the mailing list discussion or bugzilla entry. We need to keep bug and fix confirmations publicly archived. Please keep me on Cc since I filter mail based on that. > A guess... > > In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead > of xhci_td_remainder(). Why? Walt has a 0.96 xHCI host controller, and the format for how to calculate the TD remainder changed between the 0.96 and the 1.0 spec. That's why we have xhci_v1_0_td_remainder() and xhci_td_remainder(). Sarah Sharp -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]
From: walt > On 01/17/2014 06:34 AM, David Laight wrote: > > > Can you try the patch I posted that stops the ownership on LINK TRBs > > being changed before that on the linked-to TRB? > > Please disregard my earlier post about the patch not applying cleanly. > That was the usual html corruption, so I found the original on the usb > list and it was okay. > > Sadly, the patch didn't fix the ASMedia lockup behavior, however :( > > I did notice that the lockup occurred only when copying *to* the usb3 > drive, and not when copying from it. I think that may be new behavior > but I can't swear to it. Consistent with another report that says that ethernet worked provided that TSO was disabled (ie no sg tx). (Without the patch to delay he ownership change on link trbs it didn't work at all.) A guess... In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead of xhci_td_remainder(). You can try that on top of either of my other patches. David
RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]
From: > On 01/17/2014 06:34 AM, David Laight wrote: > > > Can you try the patch I posted that stops the ownership on LINK TRBs > > being changed before that on the linked-to TRB? > > Sadly, the patch didn't fix the ASMedia lockup behavior, however :( > > I did notice that the lockup occurred only when copying *to* the usb3 > drive, and not when copying from it. I think that may be new behavior > but I can't swear to it. If the behaviour has changed then the fix is on the right lines. If reads used to lock up, and don't with the patch then that is an improvement. It might be that the tx issue is actually a different 'bug'. > Just to confirm, here are the first few lines of the patch I used. > Please let me know if it's the wrong patch: > ... > + > + field4 = (field4 & ~TRB_CYCLE) | ring->cycle_state; > + if (trb == &ring->enqueue_first->generic) > + field4 ^= TRB_CYCLE; > + > trb->field[0] = cpu_to_le32(field1); That looks like my earlier 'test' patch. The fuller patch is http://article.gmane.org/gmane.linux.usb.general/101784 There shouldn't be any difference in the way the chip is driven, the later patch just rips out a load of code that is no longer needed. Mostly it simplifies the way the ownership of ring entries is passed to the hardware. David
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]
On 01/17/2014 06:34 AM, David Laight wrote: > Can you try the patch I posted that stops the ownership on LINK TRBs > being changed before that on the linked-to TRB? Please disregard my earlier post about the patch not applying cleanly. That was the usual html corruption, so I found the original on the usb list and it was okay. Sadly, the patch didn't fix the ASMedia lockup behavior, however :( I did notice that the lockup occurred only when copying *to* the usb3 drive, and not when copying from it. I think that may be new behavior but I can't swear to it. Just to confirm, here are the first few lines of the patch I used. Please let me know if it's the wrong patch: diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 53c2e29..589d336 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -2928,6 +2928,11 @@ static void queue_trb(struct xhci_hcd *xhci, struct xhci_ring *ring, struct xhci_generic_trb *trb; trb = &ring->enqueue->generic; + + field4 = (field4 & ~TRB_CYCLE) | ring->cycle_state; + if (trb == &ring->enqueue_first->generic) + field4 ^= TRB_CYCLE; + trb->field[0] = cpu_to_le32(field1); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]
On 01/17/2014 06:34 AM, David Laight wrote: > From: walt >> Oy, Sarah! ;) I put the ASMedia adapter in my older amd64 machine, and, >> well, >> the stupid thing Just Works(TM) with kernel 3.12.7! (Yes, with the same disk >> docking station, too.) >> >> I can't believe the adapter works perfectly in a different computer. Have >> you >> seen this kind of thing before? > > Could be a horrid timing race between the cpu and xchi controller. > If the cpu manages to write a NOP or LINK TRB for a following transfer > before the controller polls the next entry (after raising the IRQ) > then the controller might process the LINK and then get confused > when it can't process the linked-to TRB. > This might not sound likely, but PCIe has significant latency. > Can you try the patch I posted that stops the ownership on LINK TRBs > being changed before that on the linked-to TRB? David, the patch doesn't apply cleanly to any 3.12.x sources I can find, including the latest git from Linus. Half of the hunks fail. Could you create a fresh patch against vanilla 3.12.8 so I can test it? Or point me to the correct kernel sources, perhaps. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]
From: walt > Oy, Sarah! ;) I put the ASMedia adapter in my older amd64 machine, and, well, > the stupid thing Just Works(TM) with kernel 3.12.7! (Yes, with the same disk > docking station, too.) > > I can't believe the adapter works perfectly in a different computer. Have you > seen this kind of thing before? Could be a horrid timing race between the cpu and xchi controller. If the cpu manages to write a NOP or LINK TRB for a following transfer before the controller polls the next entry (after raising the IRQ) then the controller might process the LINK and then get confused when it can't process the linked-to TRB. This might not sound likely, but PCIe has significant latency. > At the moment I have two machines using your xhci driver and both work > perfectly, > so I thank you again :) > > I'm not sure where to go with this next. I could put the adapter back in the > other machine again if you have more patches to test. Can you try the patch I posted that stops the ownership on LINK TRBs being changed before that on the linked-to TRB? I got a private mail from someone indicating that my earlier 'minimal' patch helped an ASMedia controller talking to the asx189_178a ethernet hardware. David
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]
On Tue, Jan 14, 2014 at 01:27:25PM -0800, walt wrote: > On 01/14/2014 09:20 AM, Sarah Sharp wrote: > > On Mon, Jan 13, 2014 at 03:39:07PM -0800, walt wrote: > > >> Sarah, I just fixed my xhci bug for US$19.99 :) > >> > >> #lspci | tail -1 > >> 04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller > >> (rev 03) > >> > >> This new NEC usb3 controller does everything the ASMedia controller should > >> have > >> done from the start. > > > > I just got a similar report from someone with a Fresco Logic host > > controller, so I may need you to test a work-around patch for the > > ASMedia host at some point. Hmm, the Fresco Logic host issue seems unrelated. > Oy, Sarah! ;) I put the ASMedia adapter in my older amd64 machine, and, well, > the stupid thing Just Works(TM) with kernel 3.12.7! (Yes, with the same disk > docking station, too.) Ugh. Well, I suppose we can chalk it up to hardware failure? I think you're the only one to report a verified issue with the Link TRB patch. You are sure you're running a vanilla 3.12.7 kernel, right? > I can't believe the adapter works perfectly in a different computer. Have you > seen this kind of thing before? No, at least not this particular host-dying-only-on-one-machine failure mode. > At the moment I have two machines using your xhci driver and both work > perfectly, > so I thank you again :) > > I'm not sure where to go with this next. I could put the adapter back in the > other machine again if you have more patches to test. I think any patches I was going to send are moot with this new information. The issue with your PCI add-in card only happens in combination with a specific motherboard, so I don't think it makes sense to disable the no-op TRBs for that host. If lots of other people start reporting the same issue with the ASMedia 0.96 host, and reverting commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e "usb: xhci: Link TRB must not occur within a USB payload burst" helps them, then I'll reconsider that decision. Thank you so much for your patience while debugging this issue, and being willing to try all sorts of kernels and patches. I'm glad we finally figured out what the issue was, and you have working xHCI hosts now. Sarah Sharp -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]
On 01/14/2014 09:20 AM, Sarah Sharp wrote: > On Mon, Jan 13, 2014 at 03:39:07PM -0800, walt wrote: >> Sarah, I just fixed my xhci bug for US$19.99 :) >> >> #lspci | tail -1 >> 04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller >> (rev 03) >> >> This new NEC usb3 controller does everything the ASMedia controller should >> have >> done from the start. > I just got a similar report from someone with a Fresco Logic host > controller, so I may need you to test a work-around patch for the > ASMedia host at some point. Oy, Sarah! ;) I put the ASMedia adapter in my older amd64 machine, and, well, the stupid thing Just Works(TM) with kernel 3.12.7! (Yes, with the same disk docking station, too.) I can't believe the adapter works perfectly in a different computer. Have you seen this kind of thing before? At the moment I have two machines using your xhci driver and both work perfectly, so I thank you again :) I'm not sure where to go with this next. I could put the adapter back in the other machine again if you have more patches to test. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]
On Mon, Jan 13, 2014 at 03:39:07PM -0800, walt wrote: > On 01/09/2014 03:50 PM, Sarah Sharp wrote: > > >>> On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote: > >> > >> I've wondered if my xhci problems might be caused by hardware quirks, and > >> wondering why I seem to be the only one who has this problem. > >> > >> Maybe I could "take one for the team" by buying new hardware toys that I > >> don't really need but I could use to test the xhci driver? (I do enjoy > >> buying new toys, necessary, or, um, maybe not :) > > > > It would be appreciated if you could see if your device causes other > > host controllers to fail. Who am I to keep a geek from new toys? ;) > > Sarah, I just fixed my xhci bug for US$19.99 :) > > #lspci | tail -1 > 04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller > (rev 03) > > This new NEC usb3 controller does everything the ASMedia controller should > have > done from the start. I can even power-up the outboard usb3 disk docking > station > after the computer is running and still everything Just Works :) > > I appreciate all the debugging work you've done to fix the ASMedia problem but > I think it's time to quit now. If hundreds of irate linux users complain > about > the same bug then I'll be happy to test more patches. I just got a similar report from someone with a Fresco Logic host controller, so I may need you to test a work-around patch for the ASMedia host at some point. I'm considering disabling the effect of David's patch for those two host controllers. That will mean USB storage works fine, but USB ethernet may fail. I had considered just disabling scatter-gather for the hosts, but we can still run into the ethernet issue if we need to break a TRB at a 64KB boundary. So disabling scatter-gather would make USB ethernet work _most of the time_, but fail intermittently, and USB storage performance would be impacted. Since USB ethernet will fail in either case, I would rather keep USB storage performance and sacrifice USB ethernet on those particular hosts. So please keep the ASMedia host around for testing, if possible. Sarah Sharp -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]
From: walt > On 01/09/2014 03:50 PM, Sarah Sharp wrote: > > >>> On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote: > >> > >> I've wondered if my xhci problems might be caused by hardware quirks, and > >> wondering why I seem to be the only one who has this problem. > >> > >> Maybe I could "take one for the team" by buying new hardware toys that I > >> don't really need but I could use to test the xhci driver? (I do enjoy > >> buying new toys, necessary, or, um, maybe not :) > > > > It would be appreciated if you could see if your device causes other > > host controllers to fail. Who am I to keep a geek from new toys? ;) > > Sarah, I just fixed my xhci bug for US$19.99 :) > > #lspci | tail -1 > 04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller > (rev 03) Do you know which version of the xhci spec it conforms to? (Will probably be 0.96 or 1.00). Of course, ASMedia will probably change the silicon they are using without changing anything on the packaging. David
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]
On 01/09/2014 03:50 PM, Sarah Sharp wrote: >>> On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote: >> >> I've wondered if my xhci problems might be caused by hardware quirks, and >> wondering why I seem to be the only one who has this problem. >> >> Maybe I could "take one for the team" by buying new hardware toys that I >> don't really need but I could use to test the xhci driver? (I do enjoy >> buying new toys, necessary, or, um, maybe not :) > > It would be appreciated if you could see if your device causes other > host controllers to fail. Who am I to keep a geek from new toys? ;) Sarah, I just fixed my xhci bug for US$19.99 :) #lspci | tail -1 04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 03) This new NEC usb3 controller does everything the ASMedia controller should have done from the start. I can even power-up the outboard usb3 disk docking station after the computer is running and still everything Just Works :) I appreciate all the debugging work you've done to fix the ASMedia problem but I think it's time to quit now. If hundreds of irate linux users complain about the same bug then I'll be happy to test more patches. Until then... :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On Thu, 9 Jan 2014, Sarah Sharp wrote: > > I can't see anything obvious either. > > However there is no response to the 'stop endpoint' command. > > Section 4.6.9 (page 107 of rev1.0) states that the controller will complete > > any USB IN or OUT transaction before raising the command completion event. > > Possibly it is too 'stuck' to complete the transaction? > > The host has to stop processing the transaction, it can't "wait" for the > transaction to finish. "The Stop Endpoint Command is expected to stop > endpoint activity as soon as possible, which may mean that it stops in > the middle of a TRB." Just to clarify for Walt: There's a difference between a transaction and a transfer. Transfers can take an indefinitely long time to complete, because the device doesn't have to accept or send any data until it is ready. By contrast, transactions have sharply bounded lifetimes. A transaction consists of some maximum number of packets (usually 3, sometimes a little more) with upper limits on the time intervals between them. A TRB lies somewhere between a transfer and a transaction. A single TRB can encompass a single transaction or multiple transactions, and a single transfer can involve more than one TRB. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
From: walt > > In the meantime, try this patch, which is something of a long shot. > > No difference. But I notice the code enables the TRB quirk only if the > xhci_version is specifically 0x95. My debug messages claim that "xHCI > doesn't need link TRB QUIRK" so I'm wondering if adding my asmedia device > to the quirks list really doesn't change anything unless it's xhci 0.95. I think the intention was that the quirk would be applied for your hardware even though it claims to be version 0.96. That was gust a hopeful guess that the same change would help. > Does lspci provide that information? I'm not seeing anything obvious. I've only seen it in the diagnostic prints (that aren't normally output). David
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On 01/09/2014 03:50 PM, Sarah Sharp wrote: >>> On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote: >> >> The aftermarket usb3 adapter card and the usb3 outboard hard-drive docking >> station are the only two usb3 devices I have. >> >> I've wondered if my xhci problems might be caused by hardware quirks, and >> wondering why I seem to be the only one who has this problem. >> >> Maybe I could "take one for the team" by buying new hardware toys that I >> don't really need but I could use to test the xhci driver? (I do enjoy >> buying new toys, necessary, or, um, maybe not :) > > It would be appreciated if you could see if your device causes other > host controllers to fail. Who am I to keep a geek from new toys? ;) Thanks :) Just to clarify, when you say 'your device' do you mean the outboard disk docking station? So I need to buy other host controllers, not new docking stations? > > In the meantime, try this patch, which is something of a long shot. No difference. But I notice the code enables the TRB quirk only if the xhci_version is specifically 0x95. My debug messages claim that "xHCI doesn't need link TRB QUIRK" so I'm wondering if adding my asmedia device to the quirks list really doesn't change anything unless it's xhci 0.95. Does lspci provide that information? I'm not seeing anything obvious. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
[Walt, please use reply-all to keep the list in the loop, thanks.] On Wed, Jan 08, 2014 at 04:09:14PM +, David Laight wrote: > > From: Sarah Sharp > > On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote: > > > On 01/07/2014 01:21 PM, Sarah Sharp wrote: > > > > > > > Can you please try the attached patch, on top of the previous three > > > > patches, and send me dmesg? > > > > > > Hi Sarah, I just now finished running 0001-More-debugging.patch for the > > > first time. The previous dmesg didn't include that patch, but this one > > > does. > > > > > > I read through this dmesg but I nodded off somewhere around line 500. > > > I hope you can stay awake :) > > > > Well, it has all the info I need, but the results don't make me too > > happy. Everything I've checked seems consistent, and I don't know why > > the host stopped. The link TRBs are intact, the dequeue pointer for the > > endpoint was pointing to the transfer that timed out and it had the > > cycle bit set correctly, etc. Perhaps the no-op TRBs are really the > > issue. > > > > I'll have to take a look at the log again tomorrow. I posted the dmesg > > on pastebin if David wants to check it out as well: > > http://pastebin.com/a4AUpsL1 > > I can't see anything obvious either. > However there is no response to the 'stop endpoint' command. > Section 4.6.9 (page 107 of rev1.0) states that the controller will complete > any USB IN or OUT transaction before raising the command completion event. > Possibly it is too 'stuck' to complete the transaction? The host has to stop processing the transaction, it can't "wait" for the transaction to finish. "The Stop Endpoint Command is expected to stop endpoint activity as soon as possible, which may mean that it stops in the middle of a TRB." Usually when hosts get into this kind of mode, something has seriously gone wrong, like bus error when it issues a bad memory access. > The endpoint status is also still '1' (running). > This also means that the 'TR dequeue pointer' is undefined - so the > controller could easily be processing a later TRB. > This field might even still contain the ring base address written by > the driver much earlier. > > This might mean that something 'catastrophic' has happened earlier. > Maybe the controller isn't actually seeing any doorbell writes at all. > Maybe the base addresses it has for the rings have all got corrupted. > At least this looks like amd64 - so there aren't memory coherency issues. > > Some hacks that might help isolate the problem: > 1) Request an interrupt from the last nop data TRB. > 2) Put a command nop (decimal 23) TRB into the command ring before >the 'stop endpoint'. > 3) Comment out the code that adds the nop data TRBs. > The first two might need code adding to handle the responses. > > Do we know the actual xhci device? > I think it reports version 0x96. > (Sarah - it might be useful if that version were in one of the trace > messages that is output by default.) You mean print the PCI device and vendor ID? Perhaps Subsystem vendor as well? On Tue, Jan 07, 2014 at 05:26:37PM -0800, walt wrote: > On 01/07/2014 04:47 PM, Sarah Sharp wrote: > > > Can you send me the output of `sudo lspci -vvv -n`? Maybe we can just > > turn off scatter-gather for your host controller until we get a proper > > fix in that uses link TRBs instead of no-op TRBs. > > The aftermarket usb3 adapter card and the usb3 outboard hard-drive docking > station are the only two usb3 devices I have. > > I've wondered if my xhci problems might be caused by hardware quirks, and > wondering why I seem to be the only one who has this problem. > > Maybe I could "take one for the team" by buying new hardware toys that I > don't really need but I could use to test the xhci driver? (I do enjoy > buying new toys, necessary, or, um, maybe not :) It would be appreciated if you could see if your device causes other host controllers to fail. Who am I to keep a geek from new toys? ;) In the meantime, try this patch, which is something of a long shot. Sarah Sharp >From 19e2ab85ac2cc0d84f56247dcf29bdce14bd70d5 Mon Sep 17 00:00:00 2001 From: Sarah Sharp Date: Thu, 9 Jan 2014 15:46:04 -0800 Subject: [PATCH] xhci: Enable Link TRB quirk for 0.96 ASMedia host. A recent bug fix commit causes an ASMedia host to stop responding to commands. See if it needs the link TRB quirk. This was generally only necessary for 0.95 hosts, but maybe this 0.96 host needs it. Signed-off-by: Sarah Sharp --- drivers/usb/host/xhci-pci.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/usb/host/xhci-pci.c b/drivers/usb/host/xhci-pci.c index 3c898c12a06b..8196ac2289e4 100644 --- a/drivers/usb/host/xhci-pci.c +++ b/drivers/usb/host/xhci-pci.c @@ -92,6 +92,8 @@ static void xhci_pci_quirks(struct device *dev, struct xhci_hcd *xhci) xhci->quirks |= XHCI_TRUST_TX_LENGTH; } + if (pdev->vendor == PCI_VENDOR_ID_ASMEDIA && pdev->device == 1042) + xhci->quirks |= XHCI_LINK_TRB_QUIRK;
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On 01/09/2014 02:05 AM, David Laight wrote: >> From: walt > ... >> I'm still wondering if I'm suffering from hardware quirks. From the >> first day I installed my usb3 adapter card and the usb3 disk docking >> station I've noticed some quirky behavior. > > Ah - this isn't an 'on chip' usb3 adapter. > Some kind of PCIe card ? This one: http://www.sabrent.com/category/controller-cards/PCIX-USB3/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
> From: walt ... > I'm still wondering if I'm suffering from hardware quirks. From the > first day I installed my usb3 adapter card and the usb3 disk docking > station I've noticed some quirky behavior. Ah - this isn't an 'on chip' usb3 adapter. Some kind of PCIe card ? > e.g. I boot the machine with the docking station powered-off, and then > later I power it on, the usb3 disk is not detected at all -- until I > reboot the machine with the docking station still powered on. That will be something completely different to failures when running. David N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a��� 0��h���i
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On 01/03/2014 03:29 PM, Sarah Sharp wrote: > I'll let you know when I have some diagnostic patches ready. Hi Sarah. I see today gregkh committed the patches you've already sent me, so I assume someone (other than me) has tested those patches and discovered some benefit from them? I'm still wondering if I'm suffering from hardware quirks. From the first day I installed my usb3 adapter card and the usb3 disk docking station I've noticed some quirky behavior. e.g. I boot the machine with the docking station powered-off, and then later I power it on, the usb3 disk is not detected at all -- until I reboot the machine with the docking station still powered on. Minor stuff, yes, but maybe relevant? I dunno. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
> From: Alan Stern > On Wed, 8 Jan 2014, David Laight wrote: > > > > From: Alan Stern > > > > > > This may be a foolish question, but why is xhci-hcd using no-op TRBs in > > > the first place? > > > > Because it can't write in a link TRB because other parts of the > > code use link TRBs to detect the end of the ring. > > > > The problem is that it can't put a link TRB in the middle of > > a chain of data fragments unless it is at a 'suitable' offset > > from the start of the data TD. Given arbitrary input fragmentation > > this means that you can't put a link TRB in the middle of a TD. > > (The documented alignment might be as high as 16kB.) > > > > If the rest of the code used a 'ring end pointer' then a link TRB > > could be used instead. > > I see. Sounds like a poor design decision in hindsight. Can it be > changed? Anything can be changed :-) But it is a bit pervasive. Padding out with nops isolated the change. David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On Wed, 8 Jan 2014, David Laight wrote: > > From: Alan Stern > > > > This may be a foolish question, but why is xhci-hcd using no-op TRBs in > > the first place? > > Because it can't write in a link TRB because other parts of the > code use link TRBs to detect the end of the ring. > > The problem is that it can't put a link TRB in the middle of > a chain of data fragments unless it is at a 'suitable' offset > from the start of the data TD. Given arbitrary input fragmentation > this means that you can't put a link TRB in the middle of a TD. > (The documented alignment might be as high as 16kB.) > > If the rest of the code used a 'ring end pointer' then a link TRB > could be used instead. I see. Sounds like a poor design decision in hindsight. Can it be changed? Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
> From: Alan Stern > > This may be a foolish question, but why is xhci-hcd using no-op TRBs in > the first place? Because it can't write in a link TRB because other parts of the code use link TRBs to detect the end of the ring. The problem is that it can't put a link TRB in the middle of a chain of data fragments unless it is at a 'suitable' offset from the start of the data TD. Given arbitrary input fragmentation this means that you can't put a link TRB in the middle of a TD. (The documented alignment might be as high as 16kB.) If the rest of the code used a 'ring end pointer' then a link TRB could be used instead. David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On Tue, 7 Jan 2014, Sarah Sharp wrote: > On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote: > > On 01/07/2014 01:21 PM, Sarah Sharp wrote: > > > > > Can you please try the attached patch, on top of the previous three > > > patches, and send me dmesg? > > > > Hi Sarah, I just now finished running 0001-More-debugging.patch for the > > first time. The previous dmesg didn't include that patch, but this one > > does. > > > > I read through this dmesg but I nodded off somewhere around line 500. > > I hope you can stay awake :) > > Well, it has all the info I need, but the results don't make me too > happy. Everything I've checked seems consistent, and I don't know why > the host stopped. The link TRBs are intact, the dequeue pointer for the > endpoint was pointing to the transfer that timed out and it had the > cycle bit set correctly, etc. Perhaps the no-op TRBs are really the > issue. This may be a foolish question, but why is xhci-hcd using no-op TRBs in the first place? It makes sense that they would be needed if you have to unlink an URB that isn't the first one on the endpoint ring. But usb-storage never does that; whenever it unlinks an URB, it always unlinks the earliest entry in the endpoint's queue. After unlinking the first URB on the ring, you don't need to fill in its TRBs with no-ops. Instead, when you are ready to start the ring againk, just tell the host controller to move the dequeue pointer up to the start of the next surviving URB. (You'll also have to adjust the CYCLE bits of the TRBs that get skipped over, but that's trivial.) Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
> From: Sarah Sharp > On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote: > > On 01/07/2014 01:21 PM, Sarah Sharp wrote: > > > > > Can you please try the attached patch, on top of the previous three > > > patches, and send me dmesg? > > > > Hi Sarah, I just now finished running 0001-More-debugging.patch for the > > first time. The previous dmesg didn't include that patch, but this one > > does. > > > > I read through this dmesg but I nodded off somewhere around line 500. > > I hope you can stay awake :) > > Well, it has all the info I need, but the results don't make me too > happy. Everything I've checked seems consistent, and I don't know why > the host stopped. The link TRBs are intact, the dequeue pointer for the > endpoint was pointing to the transfer that timed out and it had the > cycle bit set correctly, etc. Perhaps the no-op TRBs are really the > issue. > > I'll have to take a look at the log again tomorrow. I posted the dmesg > on pastebin if David wants to check it out as well: > http://pastebin.com/a4AUpsL1 I can't see anything obvious either. However there is no response to the 'stop endpoint' command. Section 4.6.9 (page 107 of rev1.0) states that the controller will complete any USB IN or OUT transaction before raising the command completion event. Possibly it is too 'stuck' to complete the transaction? The endpoint status is also still '1' (running). This also means that the 'TR dequeue pointer' is undefined - so the controller could easily be processing a later TRB. This field might even still contain the ring base address written by the driver much earlier. This might mean that something 'catastrophic' has happened earlier. Maybe the controller isn't actually seeing any doorbell writes at all. Maybe the base addresses it has for the rings have all got corrupted. At least this looks like amd64 - so there aren't memory coherency issues. Some hacks that might help isolate the problem: 1) Request an interrupt from the last nop data TRB. 2) Put a command nop (decimal 23) TRB into the command ring before the 'stop endpoint'. 3) Comment out the code that adds the nop data TRBs. The first two might need code adding to handle the responses. Do we know the actual xhci device? I think it reports version 0x96. (Sarah - it might be useful if that version were in one of the trace messages that is output by default.) David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote: > On 01/07/2014 01:21 PM, Sarah Sharp wrote: > > > Can you please try the attached patch, on top of the previous three > > patches, and send me dmesg? > > Hi Sarah, I just now finished running 0001-More-debugging.patch for the > first time. The previous dmesg didn't include that patch, but this one > does. > > I read through this dmesg but I nodded off somewhere around line 500. > I hope you can stay awake :) Well, it has all the info I need, but the results don't make me too happy. Everything I've checked seems consistent, and I don't know why the host stopped. The link TRBs are intact, the dequeue pointer for the endpoint was pointing to the transfer that timed out and it had the cycle bit set correctly, etc. Perhaps the no-op TRBs are really the issue. I'll have to take a look at the log again tomorrow. I posted the dmesg on pastebin if David wants to check it out as well: http://pastebin.com/a4AUpsL1 Can you send me the output of `sudo lspci -vvv -n`? Maybe we can just turn off scatter-gather for your host controller until we get a proper fix in that uses link TRBs instead of no-op TRBs. Sarah Sharp -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On Tue, Jan 07, 2014 at 12:00:00PM -0800, walt wrote: > Okay, I used log_buf_len to make dmesg bigger and now I think I have > the whole thing. It's attached. Walt, can you make sure the patch I sent you was applied? The output doesn't look like it is. Sarah Sharp -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On Tue, Jan 07, 2014 at 01:58:32PM +, David Laight wrote: > The dmesg contains: > > [ 538.728064] EXT4-fs warning (device dm-0): ext4_end_bio:316: I/O error > writing to inode 23330865 (offset 0 size 8388608 starting block 812628) > > An 8MB transfer will need at least 128 ring entries (TRB) even if the request > is a single contiguous memory block. > > Are you using the patch that increases the ring size from 64 to 256? It's likely that the block layer is breaking up the EXT4 write into several transfers, since usb-storage limits overall transfer size to 120 KB. In any case, I added more debugging in the last patch to print the number of TRBs necessary. That way we can verify the patch to limit the number of scatter-gather list entries is working. Sarah Sharp -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On Tue, Jan 07, 2014 at 05:29:48AM -0800, walt wrote: > On 01/06/2014 04:31 PM, Sarah Sharp wrote: > > Hi Walt, > > > > I have a couple of patches for you to test. > > > Please only apply the first patch (which is diagnostic only), trigger > > your issue, and send me the resulting dmesg. Then try applying the > > other two patches, and see if the issue goes away. (I suspect it won't > > but I can't be sure.) > > Thanks Sarah. dmesg0 is from the diagnostic patch only. dmesg1 has all > three patches applied. Some of the messages in dmesg1 fell off the end of > the kernel buffer, so I may need to make the buffer larger next time but > I'll need a reminder of how to do it. Set CONFIG_LOG_BUF_SHIFT to 21. > As you suspected, the patches didn't fix the problem, sorry. Yep, I thought so. I did glean one bit of information from the logs: it seems that your host does handle no-op TRBs, at least for a while. However, after a bigger chunk of TRBs, it goes off into la-la-land. Assuming one of the rings is comprised of two segments: 0xbb711000 (start) 0xbb7113f0 (end) 0xbb711400 (start) 0xbb7117f0 (end) The log show no-ops were inserted at: 0xbb7207d0 0xbb7206a0 0xbb720be0 0xbb720be0 0xbb720bd0 0xbb7207e0 0xbb711370 = 8 no-ops 0xbb7117c0 = 3 0xbb7113b0 = 4 0xbb7113a0 = 5 0xbb7117d0 = 2 0xbb711340 = 11 0xbb711770 = 8 0xbb711230 = 28 0xbb7117e0 = 1 0xbb7117b0 = 4 0xbb7113d0 = 2 0xbb7117b0 = 4 0xbb711340 = 11 0xbb711690 = 22 So the host was able to process 28 no-op TRBs, but failed on 22 no-ops later. The event ring debugging shows the last event was for 0xbb711680, which is the last TRB before the first no-op inserted before the host died. There's no Stop Endpoint Command completion, and it looks like the command was correctly put on the command ring, so it seems the host is actually hanging for some reason. Unfortunately, I made a mistake in the debugging patch I sent you, so it didn't print out the endpoint rings when the host died. I need that info, to see whether the link TRB was still intact, or if we over-wrote it and caused the host to go fetch some invalid memory. Can you please try the attached patch, on top of the previous three patches, and send me dmesg? > I find that I can tell in advance whether the copy is going to succeed, > just by watching the light flicker on the usb3 drive. When the flicker > is absolutely regular, with no variation whatever, I can tell in 10 or > 15 seconds that the copy will fail. > > At the same time the light on the main drive goes dark after 10 seconds, > implying that the usb3 drive stops receiving any data from the main drive > after 10 seconds, yet the light on the usb3 drive continues to flicker as > if writing data -- even after the cp officially fails. The light on the > usb3 drive never stops flickering until I reboot the machine or unplug > the usb cable. Interesting. Without a USB analyzer, we can't really tell what's happening. However, one hypothesis could be that the blinking light is triggered by an active SCSI command (read/write, etc). There are three phases of the command: setup, data, and status. I think your device is getting the setup phase, and the host is dying before it sends the data phase. If the light blinks when it gets a setup phase, and turns off when the devices sends a status phase, that would explain its behavior. But that's just a hypothesis, I have no idea whether it's correct. Sarah Sharp >From d085fb5b9630e935d7954fe5947b48402e43bdc1 Mon Sep 17 00:00:00 2001 From: Sarah Sharp Date: Tue, 7 Jan 2014 12:39:47 -0800 Subject: [PATCH] More debugging. Signed-off-by: Sarah Sharp --- drivers/usb/host/xhci-ring.c | 19 ++- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 2afaf15009e8..228ab8cf868e 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -993,6 +993,9 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg) for (i = 0; i < MAX_HC_SLOTS; i++) { if (!xhci->devs[i]) continue; + + xhci_dbg(xhci, "Slot %d output context\n", i); + xhci_dbg_ctx(xhci, xhci->devs[i]->out_ctx, 30); for (j = 0; j < 31; j++) { temp_ep = &xhci->devs[i]->eps[j]; ring = temp_ep->ring; @@ -1001,6 +1004,10 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg) xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb, "Killing URBs for slot ID %u, " "ep index %u", i, j); + xhci_dbg(xhci, "Dev %i Ep 0x%x:\n", i, + xhci_get_endpoint_address(j)); + xhci_debug_ring(xhci, ring); + xhci_dbg_ring_ptrs(xhci, ring); while (!list_empty(&ring->td_list)) { cur_td = list_first_entry(&ring->td_list, struct xhci_td, @@ -1011,12 +1018,6 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg) xhci_giveback_urb_in_irq(xhci, cur_td, -ESHUTDOWN, "killed"); } - if (!list_empty(&temp_ep->cancelled_td_list)) { -xhci_dbg(xhci, "Dev %i Ep 0x%x:\n", i,
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
Okay, I used log_buf_len to make dmesg bigger and now I think I have the whole thing. It's attached. dmesg2.gz Description: application/gzip
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On 01/07/2014 05:58 AM, David Laight wrote: > The dmesg contains: > > [ 538.728064] EXT4-fs warning (device dm-0): ext4_end_bio:316: I/O error > writing to inode 23330865 (offset 0 size 8388608 starting block 812628) > > An 8MB transfer will need at least 128 ring entries (TRB) even if the request > is a single contiguous memory block. > > Are you using the patch that increases the ring size from 64 to 256? Yes, 256. I'll work on getting more debugging output. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
The dmesg contains: [ 538.728064] EXT4-fs warning (device dm-0): ext4_end_bio:316: I/O error writing to inode 23330865 (offset 0 size 8388608 starting block 812628) An 8MB transfer will need at least 128 ring entries (TRB) even if the request is a single contiguous memory block. Are you using the patch that increases the ring size from 64 to 256? David
RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
> From: walt ... > Thanks Sarah. dmesg0 is from the diagnostic patch only. dmesg1 has all > three patches applied. Some of the messages in dmesg1 fell off the end of > the kernel buffer, so I may need to make the buffer larger next time but > I'll need a reminder of how to do it. > > As you suspected, the patches didn't fix the problem, sorry. I think Sarah will want the traces of the transfer being setup (ie just before the error). Some 'normal' completing transfers are also useful. You might also find the full trace output in one of the files in /var/log. David
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On 01/06/2014 04:31 PM, Sarah Sharp wrote: > On Fri, Jan 03, 2014 at 03:29:29PM -0800, Sarah Sharp wrote: >> With the dmesg, I can finally see what happened: >> >> [ 188.703059] xhci_hcd :03:00.0: Cancel URB 8800b7d2e0c0, dev 1, ep >> 0x2, starting at offset 0xbb7b9000 >> [ 188.703072] xhci_hcd :03:00.0: // Ding dong! >> [ 193.711022] xhci_hcd :03:00.0: xHCI host not responding to stop >> endpoint command. >> [ 193.711029] xhci_hcd :03:00.0: Assuming host is dying, halting host. >> [ 193.711046] xhci_hcd :03:00.0: // Halt the HC >> [ 193.711060] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 0 >> [ 193.711066] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 2 >> [ 193.711078] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 3 >> [ 193.711096] xhci_hcd :03:00.0: Calling usb_hc_died() >> [ 193.711103] xhci_hcd :03:00.0: HC died; cleaning up >> [ 193.76] xhci_hcd :03:00.0: xHCI host controller is dead. >> >> It seems that the xHCI driver tried to stop the endpoint ring in order >> to cancel a SCSI transfer, and the driver never got a response for that. >> >> The offset is rather suspicious (0xbb7b9000), and it probably means the >> driver attempted to cancel a transfer that had been moved to the >> beginning of the ring segment, with no-op TRBs before the link TRB. >> >> I suspect David's patch triggers a bug in the command cancellation code. >> There's also the unlikely possibility that the no-op TRBs did indeed >> cause the host to hang. Either way, I'll have to look into it. >> >> I'll let you know when I have some diagnostic patches ready. > > Hi Walt, > > I have a couple of patches for you to test. > Please only apply the first patch (which is diagnostic only), trigger > your issue, and send me the resulting dmesg. Then try applying the > other two patches, and see if the issue goes away. (I suspect it won't > but I can't be sure.) Thanks Sarah. dmesg0 is from the diagnostic patch only. dmesg1 has all three patches applied. Some of the messages in dmesg1 fell off the end of the kernel buffer, so I may need to make the buffer larger next time but I'll need a reminder of how to do it. As you suspected, the patches didn't fix the problem, sorry. I find that I can tell in advance whether the copy is going to succeed, just by watching the light flicker on the usb3 drive. When the flicker is absolutely regular, with no variation whatever, I can tell in 10 or 15 seconds that the copy will fail. At the same time the light on the main drive goes dark after 10 seconds, implying that the usb3 drive stops receiving any data from the main drive after 10 seconds, yet the light on the usb3 drive continues to flicker as if writing data -- even after the cp officially fails. The light on the usb3 drive never stops flickering until I reboot the machine or unplug the usb cable. dmesg0.gz Description: application/gzip dmesg1.gz Description: application/gzip
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On Fri, Jan 03, 2014 at 03:29:29PM -0800, Sarah Sharp wrote: > On Fri, Jan 03, 2014 at 01:21:18PM -0800, walt wrote: > > I'm so sorry Sarah, that was another mistake. The mistake is so stupid I'm > > not > > going to publish it here :( > > > > Once I finally ran the kernel with debugging actually compiled in, dmesg > > contains > > xhci debugging messages. Wow :) > > > > It's a big file so I zipped and attached it, which I hope is acceptable in > > lkml. > > Yep, that's fine. Sticking it in pastebin (or up on your server) is > also fine, if it gets really big. > > > BTW, this dmesg is from a kernel with sg_tablesize = 31, which as I said > > before > > doesn't fix the problem. The cp stopped around 7GB just as before. > > > > Sorry for the noise... > > No worries! :) With the dmesg, I can finally see what happened: > > [ 188.703059] xhci_hcd :03:00.0: Cancel URB 8800b7d2e0c0, dev 1, ep > 0x2, starting at offset 0xbb7b9000 > [ 188.703072] xhci_hcd :03:00.0: // Ding dong! > [ 193.711022] xhci_hcd :03:00.0: xHCI host not responding to stop > endpoint command. > [ 193.711029] xhci_hcd :03:00.0: Assuming host is dying, halting host. > [ 193.711046] xhci_hcd :03:00.0: // Halt the HC > [ 193.711060] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 0 > [ 193.711066] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 2 > [ 193.711078] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 3 > [ 193.711096] xhci_hcd :03:00.0: Calling usb_hc_died() > [ 193.711103] xhci_hcd :03:00.0: HC died; cleaning up > [ 193.76] xhci_hcd :03:00.0: xHCI host controller is dead. > > It seems that the xHCI driver tried to stop the endpoint ring in order > to cancel a SCSI transfer, and the driver never got a response for that. > > The offset is rather suspicious (0xbb7b9000), and it probably means the > driver attempted to cancel a transfer that had been moved to the > beginning of the ring segment, with no-op TRBs before the link TRB. > > I suspect David's patch triggers a bug in the command cancellation code. > There's also the unlikely possibility that the no-op TRBs did indeed > cause the host to hang. Either way, I'll have to look into it. > > I'll let you know when I have some diagnostic patches ready. Hi Walt, I have a couple of patches for you to test. You can either apply the attached three patches, or you can pull down a kernel with: git clone git://git.kernel.org/pub/scm/linux/kernel/git/sarah/xhci.git -b 3.12-td-fragment-failure Please only apply the first patch (which is diagnostic only), trigger your issue, and send me the resulting dmesg. Then try applying the other two patches, and see if the issue goes away. (I suspect it won't but I can't be sure.) Sarah Sharp >From 0261dcd2711c010d786dcd940803a44e1bc19512 Mon Sep 17 00:00:00 2001 From: Sarah Sharp Date: Mon, 6 Jan 2014 16:06:27 -0800 Subject: [PATCH 1/3] TD fragment debugging Signed-off-by: Sarah Sharp --- drivers/usb/host/xhci-ring.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 55fc0c39b7e1..d05f61dc8359 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -982,6 +982,14 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg) * doesn't touch the memory. */ } + + xhci_dbg(xhci, "Command ring:\n"); + xhci_debug_ring(xhci, xhci->cmd_ring); + xhci_dbg_ring_ptrs(xhci, xhci->cmd_ring); + xhci_dbg(xhci, "Event ring:\n"); + xhci_debug_ring(xhci, xhci->event_ring); + xhci_dbg_ring_ptrs(xhci, xhci->event_ring); + for (i = 0; i < MAX_HC_SLOTS; i++) { if (!xhci->devs[i]) continue; @@ -1003,6 +1011,12 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg) xhci_giveback_urb_in_irq(xhci, cur_td, -ESHUTDOWN, "killed"); } + if (!list_empty(&temp_ep->cancelled_td_list)) { +xhci_dbg(xhci, "Dev %i Ep 0x%x:\n", i, + xhci_get_endpoint_address(j)); +xhci_debug_ring(xhci, ring); +xhci_dbg_ring_ptrs(xhci, ring); + } while (!list_empty(&temp_ep->cancelled_td_list)) { cur_td = list_first_entry( &temp_ep->cancelled_td_list, @@ -2966,6 +2980,10 @@ static int prepare_ring(struct xhci_hcd *xhci, struct xhci_ring *ep_ring, num_trbs, TRBS_PER_SEGMENT - 1); return -ENOMEM; } + xhci_dbg(xhci, "Insert no-op TRBs at 0x%llx\n", + (unsigned long long) + xhci_trb_virt_to_dma(ep_ring->enq_seg, + ep_ring->enqueue)); nop_cmd = cpu_to_le32(TRB_TYPE(TRB_TR_NOOP) | ep_ring->cycle_state); -- 1.8.3.3 >From 380071d6fa2430c7141faefc8acfc0909c75a0ed Mon Sep 17 00:00:00 2001 From: Ben Hutchings Date: Mon, 6 Jan 2014 03:16:32 + Subject: [PATCH 2/3] xhci: Avoid infinite loop when sg urb requires too many trbs Currently prepare_ring() returns -ENOMEM if the urb won't fit into a single ring segment. usb_sg_wait() treats this error as a temporary conditio
RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
> From: walt ... > /* Accept arbitrarily long scatter-gather lists */ > - hcd->self.sg_tablesize = ~0; > + hcd->self.sg_tablesize = 31; Even if that reduces the number of fragments passed to the xhci driver it may not be enough to limit the actual number of fragments that need to be placed in the ring itself. The xhci driver has to split every fragment on any 64k address boundary. One possibility is to scan long SG lists to see it they are actually problematic. If all the fragments are suitably aligned let them through (without any nops). My gut feeling is that problems only arise when the ring end isn't at a 1k boundary in the data. So provided all the fragments are multiples of 1k (after splitting on 64k boundaries) the transfer will be processed correctly. Alternatively, if the fragments are all longer than 1k (after the 64k split), the one that crosses the ring end can be split in two. A quick 'fix' would be to assume that anything with too many fragments is probably ok - maybe check the first fragment is suitably aligned. That would recover the old behaviour for usb disk transfers with a lot of fragments - yes it is a hack... David
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On 14-01-03 10:40 AM, walt wrote: > On 01/02/2014 11:15 AM, Sarah Sharp wrote: >> On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote: >>> On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote: 3.12-stable review patch. If anyone has any objections, please let me know. -- From: David Laight commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream. Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB can only occur at a boundary between underlying USB frames (512 bytes for high speed devices). If this isn't done the USB frames aren't formatted correctly and, for example, the USB3 ethernet ax88179_178a card will stop sending... >>> >>> >>> Unfortunately this patch causes a regression when copying large files to my >>> outboard USB3 drive. (Nothing at all to do with networking.) > >> Do you have CONFIG_USB_DEBUG turned on for 3.13? If so, you should see >> dmesg output from this statement shortly before your drive fails: >> >> if (num_trbs >= TRBS_PER_SEGMENT) { >> xhci_err(xhci, "Too many fragments %d, max %d\n", >> num_trbs, TRBS_PER_SEGMENT - 1); >> return -ENOMEM; >> } > > Well, the answers depend on whether the usb3 drive uses logical volumes or not > (lvm2), which I can't explain. What I've described so far is with lvm2. > > When using lvm2 on the usb3 drive, turning on USB_DEBUG has *no* effect -- the > console prints two or three lines stating that the ext4 journal has quit and > the drive is remounted ro. That particular drive stays wedged until the next > reboot, but no other ill effects to the system. > > OTOH, when I put a disk with just an ordinary ext4 partition in the usb3 dock, > (no logical volumes) the copy failure becomes catastrophic, with kernel panic > messages, leaving the system unresponsive and needing a hard reset to recover. > > I also tried your other suggestion: > > diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c > index 4265b48..1a6a43d 100644 > --- a/drivers/usb/host/xhci.c > +++ b/drivers/usb/host/xhci.c > @@ -4714,7 +4714,7 @@ int xhci_gen_setup(struct usb_hcd *hcd, > xhci_get_quirks_t get_quirks) > int retval; > > /* Accept arbitrarily long scatter-gather lists */ > - hcd->self.sg_tablesize = ~0; > + hcd->self.sg_tablesize = 31; > > /* support to build packet from discontinuous buffers */ > hcd->self.no_sg_constraint = 1; > > Sadly it didn't fix the problem. Did I get the patch right? That sounds almost as if the old version is still being loaded/run, possibly from the initramfs image? -- Mark Lord Real-Time Remedies Inc. ml...@pobox.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On Fri, Jan 03, 2014 at 01:21:18PM -0800, walt wrote: > I'm so sorry Sarah, that was another mistake. The mistake is so stupid I'm > not > going to publish it here :( > > Once I finally ran the kernel with debugging actually compiled in, dmesg > contains > xhci debugging messages. Wow :) > > It's a big file so I zipped and attached it, which I hope is acceptable in > lkml. Yep, that's fine. Sticking it in pastebin (or up on your server) is also fine, if it gets really big. > BTW, this dmesg is from a kernel with sg_tablesize = 31, which as I said > before > doesn't fix the problem. The cp stopped around 7GB just as before. > > Sorry for the noise... No worries! :) With the dmesg, I can finally see what happened: [ 188.703059] xhci_hcd :03:00.0: Cancel URB 8800b7d2e0c0, dev 1, ep 0x2, starting at offset 0xbb7b9000 [ 188.703072] xhci_hcd :03:00.0: // Ding dong! [ 193.711022] xhci_hcd :03:00.0: xHCI host not responding to stop endpoint command. [ 193.711029] xhci_hcd :03:00.0: Assuming host is dying, halting host. [ 193.711046] xhci_hcd :03:00.0: // Halt the HC [ 193.711060] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 0 [ 193.711066] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 2 [ 193.711078] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 3 [ 193.711096] xhci_hcd :03:00.0: Calling usb_hc_died() [ 193.711103] xhci_hcd :03:00.0: HC died; cleaning up [ 193.76] xhci_hcd :03:00.0: xHCI host controller is dead. It seems that the xHCI driver tried to stop the endpoint ring in order to cancel a SCSI transfer, and the driver never got a response for that. The offset is rather suspicious (0xbb7b9000), and it probably means the driver attempted to cancel a transfer that had been moved to the beginning of the ring segment, with no-op TRBs before the link TRB. I suspect David's patch triggers a bug in the command cancellation code. There's also the unlikely possibility that the no-op TRBs did indeed cause the host to hang. Either way, I'll have to look into it. I'll let you know when I have some diagnostic patches ready. Sarah Sharp -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On 01/03/2014 11:54 AM, Sarah Sharp wrote: > On Fri, Jan 03, 2014 at 07:40:33AM -0800, walt wrote: >> On 01/02/2014 11:15 AM, Sarah Sharp wrote: >>> On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote: On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote: > 3.12-stable review patch. If anyone has any objections, please let me > know. > > -- > > From: David Laight > > commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream. > > Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link > TRB > can only occur at a boundary between underlying USB frames (512 bytes for > high speed devices). > > If this isn't done the USB frames aren't formatted correctly and, for > example, > the USB3 ethernet ax88179_178a card will stop sending... Unfortunately this patch causes a regression when copying large files to my outboard USB3 drive. (Nothing at all to do with networking.) >> >>> Do you have CONFIG_USB_DEBUG turned on for 3.13? If so, you should see >>> dmesg output from this statement shortly before your drive fails: >>> >>> if (num_trbs >= TRBS_PER_SEGMENT) { >>> xhci_err(xhci, "Too many fragments %d, max %d\n", >>> num_trbs, TRBS_PER_SEGMENT - 1); >>> return -ENOMEM; >>> } >> >> Well, the answers depend on whether the usb3 drive uses logical volumes or >> not >> (lvm2), which I can't explain. What I've described so far is with lvm2. >> >> When using lvm2 on the usb3 drive, turning on USB_DEBUG has *no* effect I'm so sorry Sarah, that was another mistake. The mistake is so stupid I'm not going to publish it here :( Once I finally ran the kernel with debugging actually compiled in, dmesg contains xhci debugging messages. Wow :) It's a big file so I zipped and attached it, which I hope is acceptable in lkml. BTW, this dmesg is from a kernel with sg_tablesize = 31, which as I said before doesn't fix the problem. The cp stopped around 7GB just as before. Sorry for the noise... xhci.dmesg.gz Description: application/gzip
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On Fri, Jan 03, 2014 at 07:40:33AM -0800, walt wrote: > On 01/02/2014 11:15 AM, Sarah Sharp wrote: > > On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote: > >> On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote: > >>> 3.12-stable review patch. If anyone has any objections, please let me > >>> know. > >>> > >>> -- > >>> > >>> From: David Laight > >>> > >>> commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream. > >>> > >>> Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link > >>> TRB > >>> can only occur at a boundary between underlying USB frames (512 bytes for > >>> high speed devices). > >>> > >>> If this isn't done the USB frames aren't formatted correctly and, for > >>> example, > >>> the USB3 ethernet ax88179_178a card will stop sending... > >> > >> > >> Unfortunately this patch causes a regression when copying large files to my > >> outboard USB3 drive. (Nothing at all to do with networking.) > > > Do you have CONFIG_USB_DEBUG turned on for 3.13? If so, you should see > > dmesg output from this statement shortly before your drive fails: > > > > if (num_trbs >= TRBS_PER_SEGMENT) { > > xhci_err(xhci, "Too many fragments %d, max %d\n", > > num_trbs, TRBS_PER_SEGMENT - 1); > > return -ENOMEM; > > } > > Well, the answers depend on whether the usb3 drive uses logical volumes or not > (lvm2), which I can't explain. What I've described so far is with lvm2. > > When using lvm2 on the usb3 drive, turning on USB_DEBUG has *no* effect -- the > console prints two or three lines stating that the ext4 journal has quit and > the drive is remounted ro. That particular drive stays wedged until the next > reboot, but no other ill effects to the system. Odd. In 3.12 xHCI has dynamic debugging, and turning on CONFIG_USB_DEBUG should turn on debugging by default, so it's confusing that you didn't see any messages. Can I see your .config from /boot/? Also, did you try capturing dmesg with `tail -f /var/log/kern.log` or just dmesg? Perhaps you need to run `sudo dmesg -n 7`? > OTOH, when I put a disk with just an ordinary ext4 partition in the usb3 dock, > (no logical volumes) the copy failure becomes catastrophic, with kernel panic > messages, leaving the system unresponsive and needing a hard reset to recover. > > I also tried your other suggestion: > > diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c > index 4265b48..1a6a43d 100644 > --- a/drivers/usb/host/xhci.c > +++ b/drivers/usb/host/xhci.c > @@ -4714,7 +4714,7 @@ int xhci_gen_setup(struct usb_hcd *hcd, > xhci_get_quirks_t get_quirks) > int retval; > > /* Accept arbitrarily long scatter-gather lists */ > - hcd->self.sg_tablesize = ~0; > + hcd->self.sg_tablesize = 31; > > /* support to build packet from discontinuous buffers */ > hcd->self.no_sg_constraint = 1; > > Sadly it didn't fix the problem. Did I get the patch right? Yes, you did. So perhaps the patch triggers a different bug. I can't tell until I see xHCI debugging output. > Thanks for your help, and I'm happy to try more ideas, as always. Thanks for your patience. :) Sarah Sharp -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On 01/02/2014 11:15 AM, Sarah Sharp wrote: > On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote: >> On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote: >>> 3.12-stable review patch. If anyone has any objections, please let me know. >>> >>> -- >>> >>> From: David Laight >>> >>> commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream. >>> >>> Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB >>> can only occur at a boundary between underlying USB frames (512 bytes for >>> high speed devices). >>> >>> If this isn't done the USB frames aren't formatted correctly and, for >>> example, >>> the USB3 ethernet ax88179_178a card will stop sending... >> >> >> Unfortunately this patch causes a regression when copying large files to my >> outboard USB3 drive. (Nothing at all to do with networking.) > Do you have CONFIG_USB_DEBUG turned on for 3.13? If so, you should see > dmesg output from this statement shortly before your drive fails: > > if (num_trbs >= TRBS_PER_SEGMENT) { > xhci_err(xhci, "Too many fragments %d, max %d\n", > num_trbs, TRBS_PER_SEGMENT - 1); > return -ENOMEM; > } Well, the answers depend on whether the usb3 drive uses logical volumes or not (lvm2), which I can't explain. What I've described so far is with lvm2. When using lvm2 on the usb3 drive, turning on USB_DEBUG has *no* effect -- the console prints two or three lines stating that the ext4 journal has quit and the drive is remounted ro. That particular drive stays wedged until the next reboot, but no other ill effects to the system. OTOH, when I put a disk with just an ordinary ext4 partition in the usb3 dock, (no logical volumes) the copy failure becomes catastrophic, with kernel panic messages, leaving the system unresponsive and needing a hard reset to recover. I also tried your other suggestion: diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c index 4265b48..1a6a43d 100644 --- a/drivers/usb/host/xhci.c +++ b/drivers/usb/host/xhci.c @@ -4714,7 +4714,7 @@ int xhci_gen_setup(struct usb_hcd *hcd, xhci_get_quirks_t get_quirks) int retval; /* Accept arbitrarily long scatter-gather lists */ - hcd->self.sg_tablesize = ~0; + hcd->self.sg_tablesize = 31; /* support to build packet from discontinuous buffers */ hcd->self.no_sg_constraint = 1; Sadly it didn't fix the problem. Did I get the patch right? Thanks for your help, and I'm happy to try more ideas, as always. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On Thu, Jan 02, 2014 at 04:01:29PM -0500, Mark Lord wrote: > On 14-01-02 02:15 PM, Sarah Sharp wrote: > > On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote: > .. > >> Unfortunately this patch causes a regression when copying large files to my > >> outboard USB3 drive. (Nothing at all to do with networking.) > >> > >> When I try to copy a large (20GB) file to the USB3 drive, the copy dies > >> after > >> about 7GB, the ext4 journal aborts and the drive is remounted read-only. > >> > >> This bug is 100% reproducible (always pretty close to 7GB) and reverting > >> this > >> patch completely fixes the problem. > > > > Ok, I had feared that would be a consequence of this patch. I think the > > problem is that the usb-storage driver submitted an URB with more > > scatter-gather entries than would fit on the ring segment, the xHCI > > driver rejected the URB with -ENOMEM, and the SCSI core eventually gave > > up on the SCSI command. > > Is there not a block layer / scheduler tunable for max sg entries or > something? There is a USB host controller tunable for max number of sg entries that's passed up to the SCSI or block layer. We discussed changing it, but there wasn't a good consensus on what to change it to: http://marc.info/?l=linux-usb&m=138496358223904&w=2 http://marc.info/?l=linux-netdev&m=138496706007262&w=2 In the end, we thought we didn't need to limit the sglist size because Paul thought usb-storage limits the overall transfer size to 120K, which should fit in 31 TRBs: http://marc.info/?l=linux-usb&m=138498190419312&w=2 Walt, could you see if limiting the sglist size helps, by setting hcd->self.sg_tablesize in xhci.c to 31? Sarah Sharp -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On Thu, 2014-01-02 at 16:01 -0500, Mark Lord wrote: > On 14-01-02 02:15 PM, Sarah Sharp wrote: > > On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote: > .. > >> Unfortunately this patch causes a regression when copying large files to my > >> outboard USB3 drive. (Nothing at all to do with networking.) > >> > >> When I try to copy a large (20GB) file to the USB3 drive, the copy dies > >> after > >> about 7GB, the ext4 journal aborts and the drive is remounted read-only. > >> > >> This bug is 100% reproducible (always pretty close to 7GB) and reverting > >> this > >> patch completely fixes the problem. > > > > Ok, I had feared that would be a consequence of this patch. I think the > > problem is that the usb-storage driver submitted an URB with more > > scatter-gather entries than would fit on the ring segment, the xHCI > > driver rejected the URB with -ENOMEM, and the SCSI core eventually gave > > up on the SCSI command. > > > Is there not a block layer / scheduler tunable for max sg entries or > something? Yes, it's blk_queue_max_segments() You can also add an indirect limit from userspace using the queue/max_sectors_kb parameter (it doesn't limit the number of sg elements directly, but it does limit the overall size of the transfer, and since each segment is 4k or larger except at the beginning and end, that limits the total number of sg elements as well). James -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On 14-01-02 02:15 PM, Sarah Sharp wrote: > On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote: .. >> Unfortunately this patch causes a regression when copying large files to my >> outboard USB3 drive. (Nothing at all to do with networking.) >> >> When I try to copy a large (20GB) file to the USB3 drive, the copy dies after >> about 7GB, the ext4 journal aborts and the drive is remounted read-only. >> >> This bug is 100% reproducible (always pretty close to 7GB) and reverting this >> patch completely fixes the problem. > > Ok, I had feared that would be a consequence of this patch. I think the > problem is that the usb-storage driver submitted an URB with more > scatter-gather entries than would fit on the ring segment, the xHCI > driver rejected the URB with -ENOMEM, and the SCSI core eventually gave > up on the SCSI command. Is there not a block layer / scheduler tunable for max sg entries or something? -- Mark Lord Real-Time Remedies Inc. ml...@pobox.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote: > On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote: > > 3.12-stable review patch. If anyone has any objections, please let me know. > > > > -- > > > > From: David Laight > > > > commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream. > > > > Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB > > can only occur at a boundary between underlying USB frames (512 bytes for > > high speed devices). > > > > If this isn't done the USB frames aren't formatted correctly and, for > > example, > > the USB3 ethernet ax88179_178a card will stop sending... > > > Unfortunately this patch causes a regression when copying large files to my > outboard USB3 drive. (Nothing at all to do with networking.) > > When I try to copy a large (20GB) file to the USB3 drive, the copy dies after > about 7GB, the ext4 journal aborts and the drive is remounted read-only. > > This bug is 100% reproducible (always pretty close to 7GB) and reverting this > patch completely fixes the problem. Ok, I had feared that would be a consequence of this patch. I think the problem is that the usb-storage driver submitted an URB with more scatter-gather entries than would fit on the ring segment, the xHCI driver rejected the URB with -ENOMEM, and the SCSI core eventually gave up on the SCSI command. Do you have CONFIG_USB_DEBUG turned on for 3.13? If so, you should see dmesg output from this statement shortly before your drive fails: if (num_trbs >= TRBS_PER_SEGMENT) { xhci_err(xhci, "Too many fragments %d, max %d\n", num_trbs, TRBS_PER_SEGMENT - 1); return -ENOMEM; } > (Note to Sarah: I recently emailed you about this problem, and I *wrongly* > said that reverting the patch doesn't help. That was a mistake, sorry.) > > I'm happy to try any debugging suggestions/tricks. Unfortunately a real fix for this is going to take a bit. I have a couple different solutions to the bug the patch solved, but they're much more invasive than the original patch and will take a couple weeks to implement and thoroughly test. If David's patch is just reverted, USB ethernet on 3.12 and later breaks under xHCI. The networking folks added scatter-gather support in 3.12. Those patches could be reverted, but I suspect David Miller will not be happy with that solution, since the real problem is the xHCI driver itself, and EHCI scatter-gather works fine. I think the short term solution is to simply turn off scatter-gather all together under xHCI until this gets fixed. It could mean a big performance hit for USB storage devices, but that means we get correct behavior for both USB ethernet and USB storage. > BTW, please tell me if I've cc'd too many people. Nope, you're fine. I've Cc'ed the USB and SCSI mailing lists as well. Sarah Sharp -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On 13-12-31 03:40 PM, walt wrote: > On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote: >> 3.12-stable review patch. If anyone has any objections, please let me know. >> >> -- >> >> From: David Laight >> >> commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream. >> >> Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB >> can only occur at a boundary between underlying USB frames (512 bytes for >> high speed devices). >> >> If this isn't done the USB frames aren't formatted correctly and, for >> example, >> the USB3 ethernet ax88179_178a card will stop sending... > > > Unfortunately this patch causes a regression when copying large files to my > outboard USB3 drive. (Nothing at all to do with networking.) > > When I try to copy a large (20GB) file to the USB3 drive, the copy dies after > about 7GB, the ext4 journal aborts and the drive is remounted read-only. > > This bug is 100% reproducible (always pretty close to 7GB) and reverting this > patch completely fixes the problem. > > (Note to Sarah: I recently emailed you about this problem, and I *wrongly* > said that reverting the patch doesn't help. That was a mistake, sorry.) .. I have also had USB3 mass-storage issues with kernels since this patch. Dunno if the patch itself is the problem, but just too much to do with USB3 is very flakey in the most recent 3.12.* kernels, as well as the 3.13-rc* ones. So I have gone back to running older kernels and patching them. Cheers -- Mark Lord Real-Time Remedies Inc. ml...@pobox.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote: > 3.12-stable review patch. If anyone has any objections, please let me know. > > -- > > From: David Laight > > commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream. > > Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB > can only occur at a boundary between underlying USB frames (512 bytes for > high speed devices). > > If this isn't done the USB frames aren't formatted correctly and, for example, > the USB3 ethernet ax88179_178a card will stop sending... Unfortunately this patch causes a regression when copying large files to my outboard USB3 drive. (Nothing at all to do with networking.) When I try to copy a large (20GB) file to the USB3 drive, the copy dies after about 7GB, the ext4 journal aborts and the drive is remounted read-only. This bug is 100% reproducible (always pretty close to 7GB) and reverting this patch completely fixes the problem. (Note to Sarah: I recently emailed you about this problem, and I *wrongly* said that reverting the patch doesn't help. That was a mistake, sorry.) I'm happy to try any debugging suggestions/tricks. BTW, please tell me if I've cc'd too many people. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/