Re: USB instabilities with Atheros AR9344
Thank you very much for another detailed reply. On Sat, Nov 30, 2013 at 4:55 PM, Alan Stern st...@rowland.harvard.edu wrote: By device, I mean the piece of hardware that is supposed to reply to the host. In your case that would be the modem (the hub does not make up replies to packets that were sent to the modem). On the other hand, it is true that in some circumstances, problems in the hub could mess up communications between the host and the modem. This could happen if the hub communicates at high speed (480 Mb/s) and the modem communicates at full speed (12 Mb/s). Thanks for this pointer. It turns out there was one detail I had completely overlooked. Even though I used different modems, they are all based on the same chipset (Qualcomm's MDM9200). Since I did not see this issue on with another SoC, I doubt it is a chipset-issue, but it is worth investigating. Both modems and hub are high-speed (and they are enumerated as high-speed), so that part should be fine. Thanks again, Kristian -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB instabilities with Atheros AR9344
On Fri, 29 Nov 2013, Kristian Evensen wrote: Hi, Thank you very much for the quick reply. On Fri, Nov 29, 2013 at 4:13 PM, Alan Stern st...@rowland.harvard.edu wrote: The most common reason for -71 errors is that the device failed to send a reply or handshake packet back to the host. Generally this is caused by a bug in the device's firmware (it can also be caused by unplugging the USB cable, but obviously that didn't happen here). Ideally, if you knew what caused the device to go into this buggy state, you could avoid the situation. Thanks for the pointer. I have to admit I am a little bit unsure about what you refer to by device, do you mean the modem or the SoC USB hub? As it seems like most of the retransmitted packets are the big ones (1514 bytes), I guess it is the hub that does not ACK? By device, I mean the piece of hardware that is supposed to reply to the host. In your case that would be the modem (the hub does not make up replies to packets that were sent to the modem). On the other hand, it is true that in some circumstances, problems in the hub could mess up communications between the host and the modem. This could happen if the hub communicates at high speed (480 Mb/s) and the modem communicates at full speed (12 Mb/s). It would not help. Once the device stops replying to the host, it pretty much doesn't matter what you do on the host. The only way to address the problem is to do some sort of error recovery on the device. One interesting observations is that the modems seems to work fine after this happens. They reattach to the network, switch between UMTS/LTE and so forth. This may indicate the problem is in the hub. But if it is, there isn't much you can do about it. You could try doing a USB reset of the device. Of course, this is likely to cause the device to lose all its settings, so it may end up being worse than the original problem. Thanks, this is what we are currently experimenting with. Since it seems to be a bug in the device, we made a quick hack where we monitor the output from the kernel and reset USB as soon as -71 is seen. We have also patched the qh_completions()-functions to drop packets where qtd-length - QTD_LENGTH(token) == 0, to shorten the wait for the usb reset to be detected. After a reset, the USB hub + modems work fine. One thing I have noticed is that when this error occurs with option_serial, a usb reset (by using gpio) is detected immediately. This is not the case with qmi-modems, which use cdc_wdm, they hang until packets on the queue have been retransmitted (and we have disconnected the devices). Is this expected behavior? I have no idea. I don't know how the various CDC drivers work. Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
USB instabilities with Atheros AR9344
Hello, I am currently working on an embedded project based on the Atheros AR9344 SoC. As a prototype device, we are using the TP-Link TL-WDR4300 router (http://wiki.openwrt.org/toh/tp-link/tl-wdr4300) and latest OpenWRT trunk. The kernel is 3.10.18. We have over the last couple of weeks experienced a USB problem that we have not been able to solve. The USB hub works fine most of the time, but when event X happens, USB becomes unusable for extended periods of time. We have to disable/enable the power on the USB port (using GPIO) and then wait until a timeout expires/queue is flushed. The devices we have been able to trigger event X with is different 3G/LTE modems. We have not been able to figure out exactly what triggers the event, but it happens when we move into areas with poor or no coverage and then move back into coverage. We see the error both with QMI-modems (qmi_wwan driver), AT-modems (option_serial driver) and WebUI-modems (cdc_ether driver). When looking in dmesg after this event has happened, the following messages appear based on the modem type: QMI: Thu Nov 21 09:44:53 2013 kern.err kernel: [ 490.60] qmi_wwan 1-1.1.2:1.4: nonzero urb status received: -71 Thu Nov 21 09:44:53 2013 kern.err kernel: [ 490.60] qmi_wwan 1-1.1.2:1.4: wdm_int_callback - 0 bytes Serial: [62979.28] option1 ttyUSB7: option_instat_callback: error -71 WebUI: [ 1192.68] hub 1-1:1.0: cannot reset port 1 (err = -71) [ 1192.69] hub 1-1:1.0: Cannot enable port 1. Maybe the USB cable is bad? The common denominator seems to be the -71 error code, which is a generic Protocol Error if I have understood correctly. When I search for this error code, it seems that most problems have been due to power. However, this seems not be the issue here. The modems are connected to an active hub and event X happens with only a single modem connected, so it seems unlikely that it is power. In order to rule out the TP-Link router, we have also tested with another router based on the same SoC (Netgear WNDR4300). The same issue is seen. We also made some tests on a device with a different SoC (Raspberry Pi, BCM2835) and do not see this issue. We have mostly focused on the QMI modems and when using dynamic debugging, dmesg also contains these errors (repeated many times): [ 1911.20] ehci-platform ehci-platform: detected XactErr len 0/1514 retry 26 [ 1911.20] ehci-platform ehci-platform: detected XactErr len 0/64 retry 14 Each packet is, as expected, retried 32 times. The data we sent when these messages appeared was normal TCP traffic, which explains the packet sizes. If we leave the router alone long enough, it is able to restart the modems (they disconnect and then connect). However, this can take many minutes (I guess the packet queue has to be flushed?), and while this happens the USB hub is blocked (no traffic can pass through it). When running usbmon, we see the following around the time of the crash (with QMI modem): 86abea80 1428742032 S Bi:1:115:7 -150 1514 86abeb00 1428801536 C Bi:1:115:7 0 226 = 024b322c fd930250 f300 08004500 00d4bba7 4000fd06 08728027 245d2e0f 86abeb00 1428801554 S Bi:1:115:7 -150 1514 84895c00 1428802518 S Bo:1:115:5 -150 66 = 0250f300 024b 322cfd93 08004500 00349c42 40003f06 e6772e0f e6768027 84895c00 1428802660 C Bo:1:115:5 0 66 86abeb80 1428982112 C Bi:1:115:7 0 1354 = 024b322c fd930250 f300 08004500 053cbbaa 4000fd06 04078027 245d2e0f 86abeb80 1428982141 S Bi:1:115:7 -150 1514 86abec00 1429021624 C Bi:1:115:7 0 226 = 024b322c fd930250 f300 08004500 00d4bbab 4000fd06 086e8027 245d2e0f 86abec00 1429021653 S Bi:1:115:7 -150 1514 84895480 1429022660 S Bo:1:115:5 -150 66 = 0250f300 024b 322cfd93 08004500 00349c43 40003f06 e6762e0f e6768027 84895480 1429022746 C Bo:1:115:5 0 66 86b1dc00 1430690752 C Ii:1:115:6 0:16 8 = a101 0400 86b03d80 1430690765 S Ci:1:115:0 s a1 01 0004 1000 4096 86b1dc00 1430690787 S Ii:1:115:6 -150:16 64 86b03d80 1430691369 C Ci:1:115:0 0 39 = 01260080 03010400 0024001a 001e0400 9f0c 1d0200db 0e110200 01050106 86abec80 1430896349 C Bi:1:115:7 -71 0 84895800 1431014639 S Bi:1:115:7 -150 1514 86abed00 1431066817 C Bi:1:115:7 -71 0 84895480 1431184603 S Bi:1:115:7 -150 1514 86abed80 1431307124 C Bi:1:115:7 -71 0 86b03c00 1431330567 S Co:1:115:0 s 21 00 0004 0012 18 = 0111 0301 0125 00100200 ff00 86b03c00 1431331498 C Co:1:115:0 0 18 86b1dc00 1431332988 C Ii:1:115:6 0:16 8 = a101 0400 86b03d80 1431332996 S Ci:1:115:0 s a1 01 0004 1000 4096 86b1dc00 1431333012 S Ii:1:115:6 -150:16 64 86b03d80 1431333484 C Ci:1:115:0 0 58 = 01390080 03010200 0120002d 00020400 01020092 05110400 01006e05 86b03c00 1431346879 S Co:1:115:0 s 21 00 0004 000d 13 = 010c 0301 004d 00 86b03c00 1431347879 C Co:1:115:0 0 13 86b1dc00 1431348994 C Ii:1:115:6 0:16 8 = a101 0400 86b03d80 1431349002 S Ci:1:115:0 s a1 01 0004 1000 4096 86b1dc00 1431349021 S Ii:1:115:6 -150:16
Re: USB instabilities with Atheros AR9344
On Fri, 29 Nov 2013, Kristian Evensen wrote: Hello, I am currently working on an embedded project based on the Atheros AR9344 SoC. As a prototype device, we are using the TP-Link TL-WDR4300 router (http://wiki.openwrt.org/toh/tp-link/tl-wdr4300) and latest OpenWRT trunk. The kernel is 3.10.18. We have over the last couple of weeks experienced a USB problem that we have not been able to solve. The USB hub works fine most of the time, but when event X happens, USB becomes unusable for extended periods of time. We have to disable/enable the power on the USB port (using GPIO) and then wait until a timeout expires/queue is flushed. The devices we have been able to trigger event X with is different 3G/LTE modems. We have not been able to figure out exactly what triggers the event, but it happens when we move into areas with poor or no coverage and then move back into coverage. We see the error both with QMI-modems (qmi_wwan driver), AT-modems (option_serial driver) and WebUI-modems (cdc_ether driver). When looking in dmesg after this event has happened, the following messages appear based on the modem type: QMI: Thu Nov 21 09:44:53 2013 kern.err kernel: [ 490.60] qmi_wwan 1-1.1.2:1.4: nonzero urb status received: -71 Thu Nov 21 09:44:53 2013 kern.err kernel: [ 490.60] qmi_wwan 1-1.1.2:1.4: wdm_int_callback - 0 bytes Serial: [62979.28] option1 ttyUSB7: option_instat_callback: error -71 WebUI: [ 1192.68] hub 1-1:1.0: cannot reset port 1 (err = -71) [ 1192.69] hub 1-1:1.0: Cannot enable port 1. Maybe the USB cable is bad? The common denominator seems to be the -71 error code, which is a generic Protocol Error if I have understood correctly. When I search for this error code, it seems that most problems have been due to power. However, this seems not be the issue here. The modems are connected to an active hub and event X happens with only a single modem connected, so it seems unlikely that it is power. The most common reason for -71 errors is that the device failed to send a reply or handshake packet back to the host. Generally this is caused by a bug in the device's firmware (it can also be caused by unplugging the USB cable, but obviously that didn't happen here). Ideally, if you knew what caused the device to go into this buggy state, you could avoid the situation. My question is, has anyone experienced anything similar and know how to solve this problem, or have any ideas on how to proceed? Since the error seems to be independent of drivers, I guess it points to this being hardware related. Would for example reducing QH_XACTERR_MAX be a possible (temporary) solution, It would not help. Once the device stops replying to the host, it pretty much doesn't matter what you do on the host. The only way to address the problem is to do some sort of error recovery on the device. or are there any ways to flush this queue once we see the error? The most critical part for us is that USB is blocked for such extended periods of time. You could try doing a USB reset of the device. Of course, this is likely to cause the device to lose all its settings, so it may end up being worse than the original problem. Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB instabilities with Atheros AR9344
Hi, Thank you very much for the quick reply. On Fri, Nov 29, 2013 at 4:13 PM, Alan Stern st...@rowland.harvard.edu wrote: The most common reason for -71 errors is that the device failed to send a reply or handshake packet back to the host. Generally this is caused by a bug in the device's firmware (it can also be caused by unplugging the USB cable, but obviously that didn't happen here). Ideally, if you knew what caused the device to go into this buggy state, you could avoid the situation. Thanks for the pointer. I have to admit I am a little bit unsure about what you refer to by device, do you mean the modem or the SoC USB hub? As it seems like most of the retransmitted packets are the big ones (1514 bytes), I guess it is the hub that does not ACK? It would not help. Once the device stops replying to the host, it pretty much doesn't matter what you do on the host. The only way to address the problem is to do some sort of error recovery on the device. One interesting observations is that the modems seems to work fine after this happens. They reattach to the network, switch between UMTS/LTE and so forth. You could try doing a USB reset of the device. Of course, this is likely to cause the device to lose all its settings, so it may end up being worse than the original problem. Thanks, this is what we are currently experimenting with. Since it seems to be a bug in the device, we made a quick hack where we monitor the output from the kernel and reset USB as soon as -71 is seen. We have also patched the qh_completions()-functions to drop packets where qtd-length - QTD_LENGTH(token) == 0, to shorten the wait for the usb reset to be detected. After a reset, the USB hub + modems work fine. One thing I have noticed is that when this error occurs with option_serial, a usb reset (by using gpio) is detected immediately. This is not the case with qmi-modems, which use cdc_wdm, they hang until packets on the queue have been retransmitted (and we have disconnected the devices). Is this expected behavior? Thanks again for the help! -Kristian -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html