Re: EPROTO when USB 3 GbE adapters are under load

Alan Stern Fri, 26 Oct 2018 07:16:16 -0700

On Fri, 26 Oct 2018, Mathias Nyman wrote:

> On 25.10.2018 20:28, Alan Stern wrote:
> > On Thu, 25 Oct 2018, Mathias Nyman wrote:
> > 
> >> On 25.10.2018 12:52, Hao Wei Tee wrote:
> >>> On 25/10/18 4:45 PM, Mathias Nyman wrote:
> >>>> Reproducing the issue with a recent kernel with xhci traces enabled 
> >>>> should show the reason for EPROTO error.
> >>>>
> >>>> Add xhci traces before triggering the issue with:
> >>>>
> >>>> mount -t debugfs none /sys/kernel/debug
> >>>> echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
> >>>> echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
> >>>>
> >>>> after issue is triggered save and send the trace at 
> >>>> /sys/kernel/debug/tracing/trace
> >>>> Note that it might be huge
> >>>
> >>> Thanks for the suggestion.
> >>>
> >>> Here[1] is (part of) the trace starting about 250 lines before the EPROTO 
> >>> happens.
> >>>
> >>> [1]: 
> >>> https://gist.githubusercontent.com/angelsl/fdd04d2bded3a41029122b0536c00944/raw/b8e9f7d2695ac030b7f3dd53a1a9c3f37da7b7a0/trace
> >>>
> >>> The first error happens at line 243 (timestamp 8144.248398) coinciding 
> >>> with the start of errors spewed into dmesg:
> >>>
> >>> [ 8144.245359] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
> >>> [ 8144.248837] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
> >>> [ 8144.252392] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
> >>> [ 8144.255987] r8152 2-2:1.0 enp0s20f0u2: Stop submitting intr, status -71
> >>
> >> Thanks,
> >> xHC controller reports that there was a transaction error on one of the 
> >> bulk TRBs.
> >>
> >> The transaction error causes the endpoint to halt (host side halt only).
> >> Xhci driver resets the host side endpoint to recover from the halt,
> >> then returns the broken URB (TRB) with -EPROTO status, and then moves past 
> >> this TRB.
> > 
> > The host side of the endpoint should remain stopped until after the
> > URB's completion routine has had a chance to carry out error recovery.
> > Doesn't this imply the xHCI driver shouldn't reset the host-side
> > endpoint until after the giveback call returns?
> 
> True, on xhci side we could probably reset the endpoint, and even move the
> dequeue pointer to the next TRB, but make sure the endpoint is not restarted 
> yet.
> 
> The URB with -EPROTO status is given back in interrupt context, so this might 
> limit
> a bit what the higher-layer drivers can do in giveback.


One thing they can do is unlink any URBs still remaining in the 
endpoint's queue, thus preventing any confusion from stale data when 
the endpoint restarts.  It's okay to call usb_unlink_urb() in interrupt 
context.

> Now thinking about it, xhci driver calls the URB giveback in the same 
> "Transaction error"
> interrupt handler, after first queuing areset endpoint and a set TR Deq 
> pointer command.
> The endpoint is only restarted after those commands finish, in the command 
> completion interrupt
> handler.
> 
> So in that sense the endpoint shouldn't be restarted until the next interrupt 
> is handled,
> which shouldn't be possible before the URB giveback call returned in the 
> previous interrupt handler.
> 
> Well, at least not as long as we are in hard interrupt.
> 
> I think I need to dig a bit more into this.
> 
> > 
> >> Interesting thing here is that each TRB in the queue after the transaction 
> >> error
> >> also triggers a transaction error.
> >>   
> >> This might be a data toggle/sequence number sync issue.
> > 
> > It's more likely to be a problem on the device side.  Data toggle or
> > sequence number issues tend to be self-repairing (albeit with some data
> > loss) after a little while.
> 
> Ok, thanks, not spending too much time looking into that then.

Important point: The device's problem might be caused by the kernel
sending it a command it can't handle.  So maybe the way to fix the
problem may be to change the upper-layer driver; this happens 
sometimes.  Other times it really is just a bug in the device.

> >> The host side endpoint reset clears the host side sequence number,
> >> and host expects device side endpoint to be reset and sequence to be 
> >> cleared as well
> >> as a result of returning -EPROTO.
> >> If I remember correctly xhci driver does not wait for device side endpoint 
> >> to be reset,
> >> so if there are  TRBs in the queue they will be transferred, with a 
> >> cleared sequence number
> >> out of sync with the device side.
> > 
> > That's why it's important to wait until after the higher-layer driver
> > has had a chance to unlink the URBs that may be in the endpoint queue.
> > The driver may even want to reset the device.
> 
> Would it make sense to prevent endpoint from running until usb core calls
> hcd->driver->endpoint_reset?
> That is for halted endpoints, that returned URB with -EPROTO status.

The HCD shouldn't worry about that.  The higher-layer driver is
responsible for fixing the error that caused the endpoint to halt,
unlinking any remaining URBs, and clearing the halt.

Alan Stern

Re: EPROTO when USB 3 GbE adapters are under load

Reply via email to