Re: USB instabilities with Atheros AR9344

2013-12-01 Thread Kristian Evensen
Thank you very much for another detailed reply.

On Sat, Nov 30, 2013 at 4:55 PM, Alan Stern st...@rowland.harvard.edu wrote:
 By device, I mean the piece of hardware that is supposed to reply to
 the host.  In your case that would be the modem (the hub does not make
 up replies to packets that were sent to the modem).

 On the other hand, it is true that in some circumstances, problems in
 the hub could mess up communications between the host and the modem.
 This could happen if the hub communicates at high speed (480 Mb/s) and
 the modem communicates at full speed (12 Mb/s).

Thanks for this pointer. It turns out there was one detail I had
completely overlooked. Even though I used different modems, they are
all based on the same chipset (Qualcomm's MDM9200). Since I did not
see this issue on with another SoC, I doubt it is a chipset-issue, but
it is worth investigating. Both modems and hub are high-speed (and
they are enumerated as high-speed), so that part should be fine.

Thanks again,
Kristian
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB instabilities with Atheros AR9344

2013-11-30 Thread Alan Stern
On Fri, 29 Nov 2013, Kristian Evensen wrote:

 Hi,
 
 Thank you very much for the quick reply.
 
 On Fri, Nov 29, 2013 at 4:13 PM, Alan Stern st...@rowland.harvard.edu wrote:
  The most common reason for -71 errors is that the device failed to send
  a reply or handshake packet back to the host.  Generally this is caused
  by a bug in the device's firmware (it can also be caused by unplugging
  the USB cable, but obviously that didn't happen here).  Ideally, if you
  knew what caused the device to go into this buggy state, you could
  avoid the situation.
 
 Thanks for the pointer. I have to admit I am a little bit unsure about
 what you refer to by device, do you mean the modem or the SoC USB hub?
 As it seems like most of the retransmitted packets are the big ones
 (1514 bytes), I guess it is the hub that does not ACK?

By device, I mean the piece of hardware that is supposed to reply to
the host.  In your case that would be the modem (the hub does not make
up replies to packets that were sent to the modem).

On the other hand, it is true that in some circumstances, problems in 
the hub could mess up communications between the host and the modem.  
This could happen if the hub communicates at high speed (480 Mb/s) and 
the modem communicates at full speed (12 Mb/s).

  It would not help.  Once the device stops replying to the host, it
  pretty much doesn't matter what you do on the host.  The only way to
  address the problem is to do some sort of error recovery on the device.
 
 One interesting observations is that the modems seems to work fine
 after this happens. They reattach to the network, switch between
 UMTS/LTE and so forth.

This may indicate the problem is in the hub.  But if it is, there isn't
much you can do about it.

  You could try doing a USB reset of the device.  Of course, this is
  likely to cause the device to lose all its settings, so it may end up
  being worse than the original problem.
 
 Thanks, this is what we are currently experimenting with. Since it
 seems to be a bug in the device, we made a quick hack where we monitor
 the output from the kernel and reset USB as soon as -71 is seen. We
 have also patched the qh_completions()-functions to drop packets where
 qtd-length - QTD_LENGTH(token) == 0, to shorten the wait for the usb
 reset to be detected. After a reset, the USB hub + modems work fine.
 
 One thing I have noticed is that when this error occurs with
 option_serial, a usb reset (by using gpio) is detected immediately.
 This is not the case with qmi-modems, which use cdc_wdm, they hang
 until packets on the queue have been retransmitted (and we have
 disconnected the devices). Is this expected behavior?

I have no idea.  I don't know how the various CDC drivers work.

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


USB instabilities with Atheros AR9344

2013-11-29 Thread Kristian Evensen
Hello,

I am currently working on an embedded project based on the Atheros
AR9344 SoC. As a prototype device, we are using the TP-Link TL-WDR4300
router (http://wiki.openwrt.org/toh/tp-link/tl-wdr4300) and latest
OpenWRT trunk. The kernel is 3.10.18.

We have over the last couple of weeks experienced a USB problem that
we have not been able to solve. The USB hub works fine most of the
time, but when event X happens, USB becomes unusable for extended
periods of time. We have to disable/enable the power on the USB port
(using GPIO) and then wait until a timeout expires/queue is flushed.

The devices we have been able to trigger event X with is different
3G/LTE modems. We have not been able to figure out exactly what
triggers the event, but it happens when we move into areas with poor
or no coverage and then move back into coverage. We see the error both
with QMI-modems (qmi_wwan driver), AT-modems (option_serial driver)
and WebUI-modems (cdc_ether driver). When looking in dmesg after this
event has happened, the following messages appear based on the modem
type:

QMI:
Thu Nov 21 09:44:53 2013 kern.err kernel: [  490.60] qmi_wwan
1-1.1.2:1.4: nonzero urb status received: -71
Thu Nov 21 09:44:53 2013 kern.err kernel: [  490.60] qmi_wwan
1-1.1.2:1.4: wdm_int_callback - 0 bytes

Serial:
[62979.28] option1 ttyUSB7: option_instat_callback: error -71

WebUI:
[ 1192.68] hub 1-1:1.0: cannot reset port 1 (err = -71)
[ 1192.69] hub 1-1:1.0: Cannot enable port 1.  Maybe the USB cable is bad?

The common denominator seems to be the -71 error code, which is a
generic Protocol Error if I have understood correctly. When I search
for this error code, it seems that most problems have been due to
power. However, this seems not be the issue here. The modems are
connected to an active hub and event X happens with only a single
modem connected, so it seems unlikely that it is power.

In order to rule out the TP-Link router, we have also tested with
another router based on the same SoC (Netgear WNDR4300). The same
issue is seen. We also made some tests on a device with a different
SoC (Raspberry Pi, BCM2835) and do not see this issue.

We have mostly focused on the QMI modems and when using dynamic
debugging, dmesg also contains these errors (repeated many times):
[ 1911.20] ehci-platform ehci-platform: detected XactErr len 0/1514 retry 26
[ 1911.20] ehci-platform ehci-platform: detected XactErr len 0/64 retry 14

Each packet is, as expected, retried 32 times. The data we sent when
these messages appeared was normal TCP traffic, which explains the
packet sizes. If we leave the router alone long enough, it is able to
restart the modems (they disconnect and then connect). However, this
can take many minutes (I guess the packet queue has to be flushed?),
and while this happens the USB hub is blocked (no traffic can pass
through it).

When running usbmon, we see the following around the time of the crash
(with QMI modem):

86abea80 1428742032 S Bi:1:115:7 -150 1514 
86abeb00 1428801536 C Bi:1:115:7 0 226 = 024b322c fd930250 f300
08004500 00d4bba7 4000fd06 08728027 245d2e0f
86abeb00 1428801554 S Bi:1:115:7 -150 1514 
84895c00 1428802518 S Bo:1:115:5 -150 66 = 0250f300 024b 322cfd93
08004500 00349c42 40003f06 e6772e0f e6768027
84895c00 1428802660 C Bo:1:115:5 0 66 
86abeb80 1428982112 C Bi:1:115:7 0 1354 = 024b322c fd930250 f300
08004500 053cbbaa 4000fd06 04078027 245d2e0f
86abeb80 1428982141 S Bi:1:115:7 -150 1514 
86abec00 1429021624 C Bi:1:115:7 0 226 = 024b322c fd930250 f300
08004500 00d4bbab 4000fd06 086e8027 245d2e0f
86abec00 1429021653 S Bi:1:115:7 -150 1514 
84895480 1429022660 S Bo:1:115:5 -150 66 = 0250f300 024b 322cfd93
08004500 00349c43 40003f06 e6762e0f e6768027
84895480 1429022746 C Bo:1:115:5 0 66 
86b1dc00 1430690752 C Ii:1:115:6 0:16 8 = a101 0400
86b03d80 1430690765 S Ci:1:115:0 s a1 01  0004 1000 4096 
86b1dc00 1430690787 S Ii:1:115:6 -150:16 64 
86b03d80 1430691369 C Ci:1:115:0 0 39 = 01260080 03010400 0024001a
001e0400 9f0c 1d0200db 0e110200 01050106
86abec80 1430896349 C Bi:1:115:7 -71 0
84895800 1431014639 S Bi:1:115:7 -150 1514 
86abed00 1431066817 C Bi:1:115:7 -71 0
84895480 1431184603 S Bi:1:115:7 -150 1514 
86abed80 1431307124 C Bi:1:115:7 -71 0
86b03c00 1431330567 S Co:1:115:0 s 21 00  0004 0012 18 = 0111
0301 0125 00100200 ff00
86b03c00 1431331498 C Co:1:115:0 0 18 
86b1dc00 1431332988 C Ii:1:115:6 0:16 8 = a101 0400
86b03d80 1431332996 S Ci:1:115:0 s a1 01  0004 1000 4096 
86b1dc00 1431333012 S Ii:1:115:6 -150:16 64 
86b03d80 1431333484 C Ci:1:115:0 0 58 = 01390080 03010200 0120002d
00020400  01020092 05110400 01006e05
86b03c00 1431346879 S Co:1:115:0 s 21 00  0004 000d 13 = 010c
0301 004d 00
86b03c00 1431347879 C Co:1:115:0 0 13 
86b1dc00 1431348994 C Ii:1:115:6 0:16 8 = a101 0400
86b03d80 1431349002 S Ci:1:115:0 s a1 01  0004 1000 4096 
86b1dc00 1431349021 S Ii:1:115:6 -150:16 

Re: USB instabilities with Atheros AR9344

2013-11-29 Thread Alan Stern
On Fri, 29 Nov 2013, Kristian Evensen wrote:

 Hello,
 
 I am currently working on an embedded project based on the Atheros
 AR9344 SoC. As a prototype device, we are using the TP-Link TL-WDR4300
 router (http://wiki.openwrt.org/toh/tp-link/tl-wdr4300) and latest
 OpenWRT trunk. The kernel is 3.10.18.
 
 We have over the last couple of weeks experienced a USB problem that
 we have not been able to solve. The USB hub works fine most of the
 time, but when event X happens, USB becomes unusable for extended
 periods of time. We have to disable/enable the power on the USB port
 (using GPIO) and then wait until a timeout expires/queue is flushed.
 
 The devices we have been able to trigger event X with is different
 3G/LTE modems. We have not been able to figure out exactly what
 triggers the event, but it happens when we move into areas with poor
 or no coverage and then move back into coverage. We see the error both
 with QMI-modems (qmi_wwan driver), AT-modems (option_serial driver)
 and WebUI-modems (cdc_ether driver). When looking in dmesg after this
 event has happened, the following messages appear based on the modem
 type:
 
 QMI:
 Thu Nov 21 09:44:53 2013 kern.err kernel: [  490.60] qmi_wwan
 1-1.1.2:1.4: nonzero urb status received: -71
 Thu Nov 21 09:44:53 2013 kern.err kernel: [  490.60] qmi_wwan
 1-1.1.2:1.4: wdm_int_callback - 0 bytes
 
 Serial:
 [62979.28] option1 ttyUSB7: option_instat_callback: error -71
 
 WebUI:
 [ 1192.68] hub 1-1:1.0: cannot reset port 1 (err = -71)
 [ 1192.69] hub 1-1:1.0: Cannot enable port 1.  Maybe the USB cable is bad?
 
 The common denominator seems to be the -71 error code, which is a
 generic Protocol Error if I have understood correctly. When I search
 for this error code, it seems that most problems have been due to
 power. However, this seems not be the issue here. The modems are
 connected to an active hub and event X happens with only a single
 modem connected, so it seems unlikely that it is power.

The most common reason for -71 errors is that the device failed to send
a reply or handshake packet back to the host.  Generally this is caused
by a bug in the device's firmware (it can also be caused by unplugging
the USB cable, but obviously that didn't happen here).  Ideally, if you 
knew what caused the device to go into this buggy state, you could 
avoid the situation.

 My question is, has anyone experienced anything similar and know how
 to solve this problem, or have any ideas on how to proceed? Since the
 error seems to be independent of drivers, I guess it points to this
 being hardware related. Would for example reducing QH_XACTERR_MAX be a
 possible (temporary) solution,

It would not help.  Once the device stops replying to the host, it 
pretty much doesn't matter what you do on the host.  The only way to 
address the problem is to do some sort of error recovery on the device.

 or are there any ways to flush this
 queue once we see the error? The most critical part for us is that USB
 is blocked for such extended periods of time.

You could try doing a USB reset of the device.  Of course, this is 
likely to cause the device to lose all its settings, so it may end up 
being worse than the original problem.

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB instabilities with Atheros AR9344

2013-11-29 Thread Kristian Evensen
Hi,

Thank you very much for the quick reply.

On Fri, Nov 29, 2013 at 4:13 PM, Alan Stern st...@rowland.harvard.edu wrote:
 The most common reason for -71 errors is that the device failed to send
 a reply or handshake packet back to the host.  Generally this is caused
 by a bug in the device's firmware (it can also be caused by unplugging
 the USB cable, but obviously that didn't happen here).  Ideally, if you
 knew what caused the device to go into this buggy state, you could
 avoid the situation.

Thanks for the pointer. I have to admit I am a little bit unsure about
what you refer to by device, do you mean the modem or the SoC USB hub?
As it seems like most of the retransmitted packets are the big ones
(1514 bytes), I guess it is the hub that does not ACK?

 It would not help.  Once the device stops replying to the host, it
 pretty much doesn't matter what you do on the host.  The only way to
 address the problem is to do some sort of error recovery on the device.

One interesting observations is that the modems seems to work fine
after this happens. They reattach to the network, switch between
UMTS/LTE and so forth.

 You could try doing a USB reset of the device.  Of course, this is
 likely to cause the device to lose all its settings, so it may end up
 being worse than the original problem.

Thanks, this is what we are currently experimenting with. Since it
seems to be a bug in the device, we made a quick hack where we monitor
the output from the kernel and reset USB as soon as -71 is seen. We
have also patched the qh_completions()-functions to drop packets where
qtd-length - QTD_LENGTH(token) == 0, to shorten the wait for the usb
reset to be detected. After a reset, the USB hub + modems work fine.

One thing I have noticed is that when this error occurs with
option_serial, a usb reset (by using gpio) is detected immediately.
This is not the case with qmi-modems, which use cdc_wdm, they hang
until packets on the queue have been retransmitted (and we have
disconnected the devices). Is this expected behavior?

Thanks again for the help!

-Kristian
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html