Re: EPROTO when USB 3 GbE adapters are under load

2018-11-27 Thread Hao Wei Tee

On 19/11/18 10:30 pm, Hao Wei Tee wrote:

On 25/10/18 11:37 am, Hao Wei Tee wrote:
I got another RTL8153-powered adapter and, guess what, I can't seem to reproduce
this anymore. Not sure if it is something that changed in 4.19 or something to 
do
with the adapters themselves (I don't have the old adapter with me right now).


It finally happened again after a week of use. I guess it still happens, but 
it's
less common.

Oh well.

--
Hao Wei



Re: EPROTO when USB 3 GbE adapters are under load

2018-11-19 Thread Hao Wei Tee

On 25/10/18 11:37 am, Hao Wei Tee wrote:

Hi,

There are multiple reports[1][2][3] (more elsewhere on the internet) of USB 3
GbE adapters throwing EPROTO errors on USB transfer especially when the devices
are under load. Both of the two common chipsets (Realtek RTL8153 (r8152[4]) and
Asix AX88179 (ax88179_178a[5])) seem to exhibit this behaviour.


I got another RTL8153-powered adapter and, guess what, I can't seem to reproduce
this anymore. Not sure if it is something that changed in 4.19 or something to 
do
with the adapters themselves (I don't have the old adapter with me right now).

--
Hao Wei



Re: EPROTO when USB 3 GbE adapters are under load

2018-10-30 Thread Hao Wei Tee

On 25/10/18 11:04 PM, Mathias Nyman wrote:

There is a patch in usb-next that might help.
f8f80be xhci: Use soft retry to recover faster from transaction errors

It soft resets the halted host side endpoint, clears the halt without clearing 
the sequence number.


FWIW, although I guess you might've guessed, the patch didn't seem to change 
the behaviour at all. But thanks in any case.

I'll see what else I can figure out.

--
Hao Wei


Re: EPROTO when USB 3 GbE adapters are under load

2018-10-26 Thread Alan Stern
On Fri, 26 Oct 2018, Mathias Nyman wrote:

> On 25.10.2018 20:28, Alan Stern wrote:
> > On Thu, 25 Oct 2018, Mathias Nyman wrote:
> > 
> >> On 25.10.2018 12:52, Hao Wei Tee wrote:
> >>> On 25/10/18 4:45 PM, Mathias Nyman wrote:
>  Reproducing the issue with a recent kernel with xhci traces enabled 
>  should show the reason for EPROTO error.
> 
>  Add xhci traces before triggering the issue with:
> 
>  mount -t debugfs none /sys/kernel/debug
>  echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
>  echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
> 
>  after issue is triggered save and send the trace at 
>  /sys/kernel/debug/tracing/trace
>  Note that it might be huge
> >>>
> >>> Thanks for the suggestion.
> >>>
> >>> Here[1] is (part of) the trace starting about 250 lines before the EPROTO 
> >>> happens.
> >>>
> >>> [1]: 
> >>> https://gist.githubusercontent.com/angelsl/fdd04d2bded3a41029122b0536c00944/raw/b8e9f7d2695ac030b7f3dd53a1a9c3f37da7b7a0/trace
> >>>
> >>> The first error happens at line 243 (timestamp 8144.248398) coinciding 
> >>> with the start of errors spewed into dmesg:
> >>>
> >>> [ 8144.245359] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
> >>> [ 8144.248837] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
> >>> [ 8144.252392] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
> >>> [ 8144.255987] r8152 2-2:1.0 enp0s20f0u2: Stop submitting intr, status -71
> >>
> >> Thanks,
> >> xHC controller reports that there was a transaction error on one of the 
> >> bulk TRBs.
> >>
> >> The transaction error causes the endpoint to halt (host side halt only).
> >> Xhci driver resets the host side endpoint to recover from the halt,
> >> then returns the broken URB (TRB) with -EPROTO status, and then moves past 
> >> this TRB.
> > 
> > The host side of the endpoint should remain stopped until after the
> > URB's completion routine has had a chance to carry out error recovery.
> > Doesn't this imply the xHCI driver shouldn't reset the host-side
> > endpoint until after the giveback call returns?
> 
> True, on xhci side we could probably reset the endpoint, and even move the
> dequeue pointer to the next TRB, but make sure the endpoint is not restarted 
> yet.
> 
> The URB with -EPROTO status is given back in interrupt context, so this might 
> limit
> a bit what the higher-layer drivers can do in giveback.

One thing they can do is unlink any URBs still remaining in the 
endpoint's queue, thus preventing any confusion from stale data when 
the endpoint restarts.  It's okay to call usb_unlink_urb() in interrupt 
context.

> Now thinking about it, xhci driver calls the URB giveback in the same 
> "Transaction error"
> interrupt handler, after first queuing areset endpoint and a set TR Deq 
> pointer command.
> The endpoint is only restarted after those commands finish, in the command 
> completion interrupt
> handler.
> 
> So in that sense the endpoint shouldn't be restarted until the next interrupt 
> is handled,
> which shouldn't be possible before the URB giveback call returned in the 
> previous interrupt handler.
> 
> Well, at least not as long as we are in hard interrupt.
> 
> I think I need to dig a bit more into this.
> 
> > 
> >> Interesting thing here is that each TRB in the queue after the transaction 
> >> error
> >> also triggers a transaction error.
> >>   
> >> This might be a data toggle/sequence number sync issue.
> > 
> > It's more likely to be a problem on the device side.  Data toggle or
> > sequence number issues tend to be self-repairing (albeit with some data
> > loss) after a little while.
> 
> Ok, thanks, not spending too much time looking into that then.

Important point: The device's problem might be caused by the kernel
sending it a command it can't handle.  So maybe the way to fix the
problem may be to change the upper-layer driver; this happens 
sometimes.  Other times it really is just a bug in the device.

> >> The host side endpoint reset clears the host side sequence number,
> >> and host expects device side endpoint to be reset and sequence to be 
> >> cleared as well
> >> as a result of returning -EPROTO.
> >> If I remember correctly xhci driver does not wait for device side endpoint 
> >> to be reset,
> >> so if there are  TRBs in the queue they will be transferred, with a 
> >> cleared sequence number
> >> out of sync with the device side.
> > 
> > That's why it's important to wait until after the higher-layer driver
> > has had a chance to unlink the URBs that may be in the endpoint queue.
> > The driver may even want to reset the device.
> 
> Would it make sense to prevent endpoint from running until usb core calls
> hcd->driver->endpoint_reset?
> That is for halted endpoints, that returned URB with -EPROTO status.

The HCD shouldn't worry about that.  The higher-layer driver is
responsible for fixing the error that caused the endpoint to halt,
unlinking any remaining URBs, and clearing the halt.

Alan Stern



Re: EPROTO when USB 3 GbE adapters are under load

2018-10-26 Thread Mathias Nyman

On 25.10.2018 20:28, Alan Stern wrote:

On Thu, 25 Oct 2018, Mathias Nyman wrote:


On 25.10.2018 12:52, Hao Wei Tee wrote:

On 25/10/18 4:45 PM, Mathias Nyman wrote:

Reproducing the issue with a recent kernel with xhci traces enabled should show 
the reason for EPROTO error.

Add xhci traces before triggering the issue with:

mount -t debugfs none /sys/kernel/debug
echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable

after issue is triggered save and send the trace at 
/sys/kernel/debug/tracing/trace
Note that it might be huge


Thanks for the suggestion.

Here[1] is (part of) the trace starting about 250 lines before the EPROTO 
happens.

[1]: 
https://gist.githubusercontent.com/angelsl/fdd04d2bded3a41029122b0536c00944/raw/b8e9f7d2695ac030b7f3dd53a1a9c3f37da7b7a0/trace

The first error happens at line 243 (timestamp 8144.248398) coinciding with the 
start of errors spewed into dmesg:

[ 8144.245359] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
[ 8144.248837] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
[ 8144.252392] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
[ 8144.255987] r8152 2-2:1.0 enp0s20f0u2: Stop submitting intr, status -71


Thanks,
xHC controller reports that there was a transaction error on one of the bulk 
TRBs.

The transaction error causes the endpoint to halt (host side halt only).
Xhci driver resets the host side endpoint to recover from the halt,
then returns the broken URB (TRB) with -EPROTO status, and then moves past this 
TRB.


The host side of the endpoint should remain stopped until after the
URB's completion routine has had a chance to carry out error recovery.
Doesn't this imply the xHCI driver shouldn't reset the host-side
endpoint until after the giveback call returns?


True, on xhci side we could probably reset the endpoint, and even move the
dequeue pointer to the next TRB, but make sure the endpoint is not restarted 
yet.

The URB with -EPROTO status is given back in interrupt context, so this might 
limit
a bit what the higher-layer drivers can do in giveback.

Now thinking about it, xhci driver calls the URB giveback in the same "Transaction 
error"
interrupt handler, after first queuing areset endpoint and a set TR Deq pointer 
command.
The endpoint is only restarted after those commands finish, in the command 
completion interrupt
handler.

So in that sense the endpoint shouldn't be restarted until the next interrupt 
is handled,
which shouldn't be possible before the URB giveback call returned in the 
previous interrupt handler.

Well, at least not as long as we are in hard interrupt.

I think I need to dig a bit more into this.




Interesting thing here is that each TRB in the queue after the transaction error
also triggers a transaction error.
  
This might be a data toggle/sequence number sync issue.


It's more likely to be a problem on the device side.  Data toggle or
sequence number issues tend to be self-repairing (albeit with some data
loss) after a little while.


Ok, thanks, not spending too much time looking into that then.




The host side endpoint reset clears the host side sequence number,
and host expects device side endpoint to be reset and sequence to be cleared as 
well
as a result of returning -EPROTO.
If I remember correctly xhci driver does not wait for device side endpoint to 
be reset,
so if there are  TRBs in the queue they will be transferred, with a cleared 
sequence number
out of sync with the device side.


That's why it's important to wait until after the higher-layer driver
has had a chance to unlink the URBs that may be in the endpoint queue.
The driver may even want to reset the device.


Would it make sense to prevent endpoint from running until usb core calls
hcd->driver->endpoint_reset?
That is for halted endpoints, that returned URB with -EPROTO status.

-Mathias   


Re: EPROTO when USB 3 GbE adapters are under load

2018-10-25 Thread Alan Stern
On Thu, 25 Oct 2018, Mathias Nyman wrote:

> On 25.10.2018 12:52, Hao Wei Tee wrote:
> > On 25/10/18 4:45 PM, Mathias Nyman wrote:
> >> Reproducing the issue with a recent kernel with xhci traces enabled should 
> >> show the reason for EPROTO error.
> >>
> >> Add xhci traces before triggering the issue with:
> >>
> >> mount -t debugfs none /sys/kernel/debug
> >> echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
> >> echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
> >>
> >> after issue is triggered save and send the trace at 
> >> /sys/kernel/debug/tracing/trace
> >> Note that it might be huge
> > 
> > Thanks for the suggestion.
> > 
> > Here[1] is (part of) the trace starting about 250 lines before the EPROTO 
> > happens.
> > 
> > [1]: 
> > https://gist.githubusercontent.com/angelsl/fdd04d2bded3a41029122b0536c00944/raw/b8e9f7d2695ac030b7f3dd53a1a9c3f37da7b7a0/trace
> > 
> > The first error happens at line 243 (timestamp 8144.248398) coinciding with 
> > the start of errors spewed into dmesg:
> > 
> > [ 8144.245359] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
> > [ 8144.248837] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
> > [ 8144.252392] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
> > [ 8144.255987] r8152 2-2:1.0 enp0s20f0u2: Stop submitting intr, status -71
> 
> Thanks,
> xHC controller reports that there was a transaction error on one of the bulk 
> TRBs.
> 
> The transaction error causes the endpoint to halt (host side halt only).
> Xhci driver resets the host side endpoint to recover from the halt,
> then returns the broken URB (TRB) with -EPROTO status, and then moves past 
> this TRB.

The host side of the endpoint should remain stopped until after the
URB's completion routine has had a chance to carry out error recovery.  
Doesn't this imply the xHCI driver shouldn't reset the host-side
endpoint until after the giveback call returns?

> Interesting thing here is that each TRB in the queue after the transaction 
> error
> also triggers a transaction error.
>  
> This might be a data toggle/sequence number sync issue.

It's more likely to be a problem on the device side.  Data toggle or
sequence number issues tend to be self-repairing (albeit with some data
loss) after a little while.

> The host side endpoint reset clears the host side sequence number,
> and host expects device side endpoint to be reset and sequence to be cleared 
> as well
> as a result of returning -EPROTO.
> If I remember correctly xhci driver does not wait for device side endpoint to 
> be reset,
> so if there are  TRBs in the queue they will be transferred, with a cleared 
> sequence number
> out of sync with the device side.

That's why it's important to wait until after the higher-layer driver 
has had a chance to unlink the URBs that may be in the endpoint queue.  
The driver may even want to reset the device.

> There is a patch in usb-next that might help.
> f8f80be xhci: Use soft retry to recover faster from transaction errors
> 
> It soft resets the halted host side endpoint, clears the halt without 
> clearing the sequence number.
> 
> -Mathias

Alan Stern



Re: EPROTO when USB 3 GbE adapters are under load

2018-10-25 Thread Mathias Nyman

On 25.10.2018 12:52, Hao Wei Tee wrote:

On 25/10/18 4:45 PM, Mathias Nyman wrote:

Reproducing the issue with a recent kernel with xhci traces enabled should show 
the reason for EPROTO error.

Add xhci traces before triggering the issue with:

mount -t debugfs none /sys/kernel/debug
echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable

after issue is triggered save and send the trace at 
/sys/kernel/debug/tracing/trace
Note that it might be huge


Thanks for the suggestion.

Here[1] is (part of) the trace starting about 250 lines before the EPROTO 
happens.

[1]: 
https://gist.githubusercontent.com/angelsl/fdd04d2bded3a41029122b0536c00944/raw/b8e9f7d2695ac030b7f3dd53a1a9c3f37da7b7a0/trace

The first error happens at line 243 (timestamp 8144.248398) coinciding with the 
start of errors spewed into dmesg:

[ 8144.245359] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
[ 8144.248837] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
[ 8144.252392] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
[ 8144.255987] r8152 2-2:1.0 enp0s20f0u2: Stop submitting intr, status -71


Thanks,
xHC controller reports that there was a transaction error on one of the bulk 
TRBs.

The transaction error causes the endpoint to halt (host side halt only).
Xhci driver resets the host side endpoint to recover from the halt,
then returns the broken URB (TRB) with -EPROTO status, and then moves past this 
TRB.

Interesting thing here is that each TRB in the queue after the transaction error
also triggers a transaction error.

This might be a data toggle/sequence number sync issue.
The host side endpoint reset clears the host side sequence number,
and host expects device side endpoint to be reset and sequence to be cleared as 
well
as a result of returning -EPROTO.
If I remember correctly xhci driver does not wait for device side endpoint to 
be reset,
so if there are  TRBs in the queue they will be transferred, with a cleared 
sequence number
out of sync with the device side.

There is a patch in usb-next that might help.
f8f80be xhci: Use soft retry to recover faster from transaction errors

It soft resets the halted host side endpoint, clears the halt without clearing 
the sequence number.

-Mathias


Re: EPROTO when USB 3 GbE adapters are under load

2018-10-25 Thread Hao Wei Tee

On 25/10/18 4:45 PM, Mathias Nyman wrote:

Reproducing the issue with a recent kernel with xhci traces enabled should show 
the reason for EPROTO error.

Add xhci traces before triggering the issue with:

mount -t debugfs none /sys/kernel/debug
echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable

after issue is triggered save and send the trace at 
/sys/kernel/debug/tracing/trace
Note that it might be huge


Thanks for the suggestion.

Here[1] is (part of) the trace starting about 250 lines before the EPROTO 
happens.

[1]: 
https://gist.githubusercontent.com/angelsl/fdd04d2bded3a41029122b0536c00944/raw/b8e9f7d2695ac030b7f3dd53a1a9c3f37da7b7a0/trace

The first error happens at line 243 (timestamp 8144.248398) coinciding with the 
start of errors spewed into dmesg:

[ 8144.245359] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
[ 8144.248837] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
[ 8144.252392] r8152 2-2:1.0 enp0s20f0u2: Rx status -71
[ 8144.255987] r8152 2-2:1.0 enp0s20f0u2: Stop submitting intr, status -71
...

It doesn't seem to point to anything in particular, but I'm not really familiar 
with USB. I'll do some digging in any case...

Thanks!

--
Hao Wei


Re: EPROTO when USB 3 GbE adapters are under load

2018-10-25 Thread Mathias Nyman

On 25.10.2018 06:37, Hao Wei Tee wrote:

Hi,

There are multiple reports[1][2][3] (more elsewhere on the internet) of USB 3
GbE adapters throwing EPROTO errors on USB transfer especially when the devices
are under load. Both of the two common chipsets (Realtek RTL8153 (r8152[4]) and
Asix AX88179 (ax88179_178a[5])) seem to exhibit this behaviour.

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=75381
[2]: https://bugzilla.kernel.org/show_bug.cgi?id=196747
[3]: https://bugzilla.kernel.org/show_bug.cgi?id=198931
[4]: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/usb/r8152.c
[5]: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/usb/ax88179_178a.c

I'm trying to figure out why this happens (while it doesn't seem to happen on
other OSes, but I'm not sure). I think it's unlikely that both drivers are 
buggy,
so perhaps it is something to do with the USB stack instead of the device 
drivers.

It wouldn't be surprising if both devices actually don't adhere to the USB specs
properly and other OSes are just more tolerant of that (?) but that is just
conjecture on my part.

Does anyone have any ideas?


Reproducing the issue with a recent kernel with xhci traces enabled should show 
the reason for EPROTO error.

Add xhci traces before triggering the issue with:

mount -t debugfs none /sys/kernel/debug
echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable

after issue is triggered save and send the trace at 
/sys/kernel/debug/tracing/trace
Note that it might be huge

-Mathias



EPROTO when USB 3 GbE adapters are under load

2018-10-24 Thread Hao Wei Tee

Hi,

There are multiple reports[1][2][3] (more elsewhere on the internet) of USB 3
GbE adapters throwing EPROTO errors on USB transfer especially when the devices
are under load. Both of the two common chipsets (Realtek RTL8153 (r8152[4]) and
Asix AX88179 (ax88179_178a[5])) seem to exhibit this behaviour.

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=75381
[2]: https://bugzilla.kernel.org/show_bug.cgi?id=196747
[3]: https://bugzilla.kernel.org/show_bug.cgi?id=198931
[4]: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/usb/r8152.c
[5]: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/usb/ax88179_178a.c

I'm trying to figure out why this happens (while it doesn't seem to happen on
other OSes, but I'm not sure). I think it's unlikely that both drivers are 
buggy,
so perhaps it is something to do with the USB stack instead of the device 
drivers.

It wouldn't be surprising if both devices actually don't adhere to the USB specs
properly and other OSes are just more tolerant of that (?) but that is just
conjecture on my part.

Does anyone have any ideas?

Thanks.

--
Hao Wei