[PATCH v2] usb: Quiet down false peer failure messages
My recent Intel box is spewing these messages: xhci_hcd :00:14.0: xHCI Host Controller xhci_hcd :00:14.0: new USB bus registered, assigned bus number 2 usb usb2: New USB device found, idVendor=1d6b, idProduct=0003 usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1 usb usb2: Product: xHCI Host Controller usb usb2: Manufacturer: Linux 4.3.0+ xhci-hcd usb usb2: SerialNumber: :00:14.0 hub 2-0:1.0: USB hub found hub 2-0:1.0: 6 ports detected usb: failed to peer usb2-port2 and usb1-port1 by location (usb2-port2:none) (usb1-port1:usb2-port1) usb usb2-port2: failed to peer to usb1-port1 (-16) usb: port power management may be unreliable usb: failed to peer usb2-port3 and usb1-port1 by location (usb2-port3:none) (usb1-port1:usb2-port1) usb usb2-port3: failed to peer to usb1-port1 (-16) usb: failed to peer usb2-port5 and usb1-port1 by location (usb2-port5:none) (usb1-port1:usb2-port1) usb usb2-port5: failed to peer to usb1-port1 (-16) usb: failed to peer usb2-port6 and usb1-port1 by location (usb2-port6:none) (usb1-port1:usb2-port1) usb usb2-port6: failed to peer to usb1-port1 (-16) Diving into the acpi tables, I noticed the EHCI hub has 12 ports while the XHCI hub has 8 ports. Most of those ports are of connect type USB_PORT_NOT_USED (including port 1 of the EHCI hub). Further the unused ports have location data initialized to 0x8000. Now each unused port on the xhci hub walks the port list and finds a matching peer with port1 of the EHCI hub because the zero'd out group id bits falsely match. After port1 of the XHCI hub, each following matching peer will generate the above warning. These warnings seem to be harmless for this scenario as I don't think it matters that unused ports could not create a peer link. The attached patch utilizes that assumption and just turns the pr_warn into pr_debug to quiet things down. Tested on my Intel box. Signed-off-by: Don Zickus --- v2: modified patch to just quiet down the warns instead of skipping the ports --- drivers/usb/core/port.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/usb/core/port.c b/drivers/usb/core/port.c index 2106183..5487fe3 100644 --- a/drivers/usb/core/port.c +++ b/drivers/usb/core/port.c @@ -206,7 +206,7 @@ static int link_peers(struct usb_port *left, struct usb_port *right) else method = "default"; - pr_warn("usb: failed to peer %s and %s by %s (%s:%s) (%s:%s)\n", + pr_debug("usb: failed to peer %s and %s by %s (%s:%s) (%s:%s)\n", dev_name(&left->dev), dev_name(&right->dev), method, dev_name(&left->dev), lpeer ? dev_name(&lpeer->dev) : "none", @@ -265,7 +265,7 @@ static void link_peers_report(struct usb_port *left, struct usb_port *right) if (rc == 0) { dev_dbg(&left->dev, "peered to %s\n", dev_name(&right->dev)); } else { - dev_warn(&left->dev, "failed to peer to %s (%d)\n", + dev_dbg(&left->dev, "failed to peer to %s (%d)\n", dev_name(&right->dev), rc); pr_warn_once("usb: port power management may be unreliable\n"); usb_port_block_power_off = 1; -- 1.8.4 -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] usb: Quiet down false peer failure messages
On Mon, Nov 30, 2015 at 12:57:17PM -0800, Dan Williams wrote: > On Tue, Nov 17, 2015 at 1:00 PM, Don Zickus wrote: > > My recent Intel box is spewing these messages: > > > > xhci_hcd :00:14.0: xHCI Host Controller > > xhci_hcd :00:14.0: new USB bus registered, assigned bus number 2 > > usb usb2: New USB device found, idVendor=1d6b, idProduct=0003 > > usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1 > > usb usb2: Product: xHCI Host Controller > > usb usb2: Manufacturer: Linux 4.3.0+ xhci-hcd > > usb usb2: SerialNumber: :00:14.0 > > hub 2-0:1.0: USB hub found > > hub 2-0:1.0: 6 ports detected > > usb: failed to peer usb2-port2 and usb1-port1 by location (usb2-port2:none) > > (usb1-port1:usb2-port1) > > usb usb2-port2: failed to peer to usb1-port1 (-16) > > usb: port power management may be unreliable > > usb: failed to peer usb2-port3 and usb1-port1 by location (usb2-port3:none) > > (usb1-port1:usb2-port1) > > usb usb2-port3: failed to peer to usb1-port1 (-16) > > usb: failed to peer usb2-port5 and usb1-port1 by location (usb2-port5:none) > > (usb1-port1:usb2-port1) > > usb usb2-port5: failed to peer to usb1-port1 (-16) > > usb: failed to peer usb2-port6 and usb1-port1 by location (usb2-port6:none) > > (usb1-port1:usb2-port1) > > usb usb2-port6: failed to peer to usb1-port1 (-16) > > > > Diving into the acpi tables, I noticed the EHCI hub has 12 ports while the > > XHCI > > hub has 8 ports. Most of those ports are of connect type USB_PORT_NOT_USED > > (including port 1 of the EHCI hub). > > > > Further the unused ports have location data initialized to 0x8000. > > > > Now each unused port on the xhci hub walks the port list and finds a > > matching > > peer with port1 of the EHCI hub because the zero'd out group id bits > > falsely match. > > After port1 of the XHCI hub, each following matching peer will generate the > > above warning. > > > > These warnings seem to be harmless for this scenario as I don't think it > > matters that unused ports could not create a peer link. > > > > The attached patch utilizes that assumption and just skips the > > find_and_link_peer setup if a port is determined to be unused. This further > > assumes that port's status will never change. > > > > Tested on my Intel box. > > > > Signed-off-by: Don Zickus > > --- > > drivers/usb/core/port.c | 7 +++ > > 1 file changed, 7 insertions(+) > > > > diff --git a/drivers/usb/core/port.c b/drivers/usb/core/port.c > > index 2106183..3a8f84a 100644 > > --- a/drivers/usb/core/port.c > > +++ b/drivers/usb/core/port.c > > @@ -353,6 +353,13 @@ static void find_and_link_peer(struct usb_hub *hub, > > int port1) > > struct usb_hub *peer_hub; > > > > /* > > +* Un-used ports have zero'd out data that can create a false > > +* peer in-use failure. > > +*/ > > + if (port_dev->connect_type == USB_PORT_NOT_USED) > > + return; > > + > > + /* > > * If location data is available then we can only peer this port > > * by a location match, not the default peer (lest we create a > > * situation where we need to go back and undo a default peering > > Looks reasonable, but it may hide real errors in the ACPI tables. > > How about just changing the dev_warn() in link_peers_report() to > dev_dbg? We'll still get the pr_warn_once() to get the notification > that something went wrong, but won't continue to spam the logs if the > user does not care about peer port power management. Ok. Sounds fair. Curious, I couldn't figure out from the specs if this was valid or not, or if my Intel pre-production machine has an early firmware on it that might resolve this issue later.. Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] usb: Quiet down false peer failure messages
On Mon, Nov 30, 2015 at 03:49:51PM -0500, Alan Stern wrote: > On Mon, 30 Nov 2015, Don Zickus wrote: > > > On Tue, Nov 17, 2015 at 04:00:58PM -0500, Don Zickus wrote: > > > My recent Intel box is spewing these messages: > > > > poke? > > > > Cheers, > > Don > > It's up to Greg KH to accept this patch; maybe you need to mail it to > him. He seems to be busy these days -- at least, he hasn't responded > yet to some patches I sent shortly after the last merge window ended. Ok. > > I'm okay with the basic idea, but it would be good to get Dan Williams > to vet it. He has spent more time tussling with USB-3/2 peering issues > than I have. He responded with some feedback. I will spin a v2 based on it. Thanks! Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] usb: Quiet down false peer failure messages
On Tue, Nov 17, 2015 at 04:00:58PM -0500, Don Zickus wrote: > My recent Intel box is spewing these messages: poke? Cheers, Don > > xhci_hcd :00:14.0: xHCI Host Controller > xhci_hcd :00:14.0: new USB bus registered, assigned bus number 2 > usb usb2: New USB device found, idVendor=1d6b, idProduct=0003 > usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1 > usb usb2: Product: xHCI Host Controller > usb usb2: Manufacturer: Linux 4.3.0+ xhci-hcd > usb usb2: SerialNumber: :00:14.0 > hub 2-0:1.0: USB hub found > hub 2-0:1.0: 6 ports detected > usb: failed to peer usb2-port2 and usb1-port1 by location (usb2-port2:none) > (usb1-port1:usb2-port1) > usb usb2-port2: failed to peer to usb1-port1 (-16) > usb: port power management may be unreliable > usb: failed to peer usb2-port3 and usb1-port1 by location (usb2-port3:none) > (usb1-port1:usb2-port1) > usb usb2-port3: failed to peer to usb1-port1 (-16) > usb: failed to peer usb2-port5 and usb1-port1 by location (usb2-port5:none) > (usb1-port1:usb2-port1) > usb usb2-port5: failed to peer to usb1-port1 (-16) > usb: failed to peer usb2-port6 and usb1-port1 by location (usb2-port6:none) > (usb1-port1:usb2-port1) > usb usb2-port6: failed to peer to usb1-port1 (-16) > > Diving into the acpi tables, I noticed the EHCI hub has 12 ports while the > XHCI > hub has 8 ports. Most of those ports are of connect type USB_PORT_NOT_USED > (including port 1 of the EHCI hub). > > Further the unused ports have location data initialized to 0x8000. > > Now each unused port on the xhci hub walks the port list and finds a matching > peer with port1 of the EHCI hub because the zero'd out group id bits falsely > match. > After port1 of the XHCI hub, each following matching peer will generate the > above warning. > > These warnings seem to be harmless for this scenario as I don't think it > matters that unused ports could not create a peer link. > > The attached patch utilizes that assumption and just skips the > find_and_link_peer setup if a port is determined to be unused. This further > assumes that port's status will never change. > > Tested on my Intel box. > > Signed-off-by: Don Zickus > --- > drivers/usb/core/port.c | 7 +++ > 1 file changed, 7 insertions(+) > > diff --git a/drivers/usb/core/port.c b/drivers/usb/core/port.c > index 2106183..3a8f84a 100644 > --- a/drivers/usb/core/port.c > +++ b/drivers/usb/core/port.c > @@ -353,6 +353,13 @@ static void find_and_link_peer(struct usb_hub *hub, int > port1) > struct usb_hub *peer_hub; > > /* > + * Un-used ports have zero'd out data that can create a false > + * peer in-use failure. > + */ > + if (port_dev->connect_type == USB_PORT_NOT_USED) > + return; > + > + /* >* If location data is available then we can only peer this port >* by a location match, not the default peer (lest we create a >* situation where we need to go back and undo a default peering > -- > 2.6.0.rc0.2.g7662973 > -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] usb: Quiet down false peer failure messages
My recent Intel box is spewing these messages: xhci_hcd :00:14.0: xHCI Host Controller xhci_hcd :00:14.0: new USB bus registered, assigned bus number 2 usb usb2: New USB device found, idVendor=1d6b, idProduct=0003 usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1 usb usb2: Product: xHCI Host Controller usb usb2: Manufacturer: Linux 4.3.0+ xhci-hcd usb usb2: SerialNumber: :00:14.0 hub 2-0:1.0: USB hub found hub 2-0:1.0: 6 ports detected usb: failed to peer usb2-port2 and usb1-port1 by location (usb2-port2:none) (usb1-port1:usb2-port1) usb usb2-port2: failed to peer to usb1-port1 (-16) usb: port power management may be unreliable usb: failed to peer usb2-port3 and usb1-port1 by location (usb2-port3:none) (usb1-port1:usb2-port1) usb usb2-port3: failed to peer to usb1-port1 (-16) usb: failed to peer usb2-port5 and usb1-port1 by location (usb2-port5:none) (usb1-port1:usb2-port1) usb usb2-port5: failed to peer to usb1-port1 (-16) usb: failed to peer usb2-port6 and usb1-port1 by location (usb2-port6:none) (usb1-port1:usb2-port1) usb usb2-port6: failed to peer to usb1-port1 (-16) Diving into the acpi tables, I noticed the EHCI hub has 12 ports while the XHCI hub has 8 ports. Most of those ports are of connect type USB_PORT_NOT_USED (including port 1 of the EHCI hub). Further the unused ports have location data initialized to 0x8000. Now each unused port on the xhci hub walks the port list and finds a matching peer with port1 of the EHCI hub because the zero'd out group id bits falsely match. After port1 of the XHCI hub, each following matching peer will generate the above warning. These warnings seem to be harmless for this scenario as I don't think it matters that unused ports could not create a peer link. The attached patch utilizes that assumption and just skips the find_and_link_peer setup if a port is determined to be unused. This further assumes that port's status will never change. Tested on my Intel box. Signed-off-by: Don Zickus --- drivers/usb/core/port.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/usb/core/port.c b/drivers/usb/core/port.c index 2106183..3a8f84a 100644 --- a/drivers/usb/core/port.c +++ b/drivers/usb/core/port.c @@ -353,6 +353,13 @@ static void find_and_link_peer(struct usb_hub *hub, int port1) struct usb_hub *peer_hub; /* +* Un-used ports have zero'd out data that can create a false +* peer in-use failure. +*/ + if (port_dev->connect_type == USB_PORT_NOT_USED) + return; + + /* * If location data is available then we can only peer this port * by a location match, not the default peer (lest we create a * situation where we need to go back and undo a default peering -- 2.6.0.rc0.2.g7662973 -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] usbhid: Fix the check for HID_RESET_PENDING in hid_io_error
Bump. :-) On Mon, Aug 10, 2015 at 12:06:53PM -0400, Don Zickus wrote: > It was reported that after 10-20 reboots, a usb keyboard plugged > into a docking station would not work unless it was replugged in. > > Using usbmon, it turns out the interrupt URBs were streaming with > callback errors of -71 for some reason. The hid-core.c::hid_io_error was > supposed to retry and then reset, but the reset wasn't really happening. > > The check for HID_NO_BANDWIDTH was inverted. Fix was simple. > > Tested by reporter and locally by me by unplugging a keyboard halfway until I > could recreate a stream of errors but no disconnect. > > Signed-off-by: Don Zickus > --- > > I mistyped the usb mailing list address, so this patch never made it there. > Resending with the correct address. Sorry for the spam. > > > > drivers/hid/usbhid/hid-core.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/hid/usbhid/hid-core.c b/drivers/hid/usbhid/hid-core.c > index b37d944..9935a64 100644 > --- a/drivers/hid/usbhid/hid-core.c > +++ b/drivers/hid/usbhid/hid-core.c > @@ -164,7 +164,7 @@ static void hid_io_error(struct hid_device *hid) > if (time_after(jiffies, usbhid->stop_retry)) { > > /* Retries failed, so do a port reset unless we lack bandwidth*/ > - if (test_bit(HID_NO_BANDWIDTH, &usbhid->iofl) > + if (!test_bit(HID_NO_BANDWIDTH, &usbhid->iofl) >&& !test_and_set_bit(HID_RESET_PENDING, &usbhid->iofl)) { > > schedule_work(&usbhid->reset_work); > -- > 1.8.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-usb" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RESEND] usbhid: Fix the check for HID_RESET_PENDING in hid_io_error
It was reported that after 10-20 reboots, a usb keyboard plugged into a docking station would not work unless it was replugged in. Using usbmon, it turns out the interrupt URBs were streaming with callback errors of -71 for some reason. The hid-core.c::hid_io_error was supposed to retry and then reset, but the reset wasn't really happening. The check for HID_NO_BANDWIDTH was inverted. Fix was simple. Tested by reporter and locally by me by unplugging a keyboard halfway until I could recreate a stream of errors but no disconnect. Signed-off-by: Don Zickus --- I mistyped the usb mailing list address, so this patch never made it there. Resending with the correct address. Sorry for the spam. drivers/hid/usbhid/hid-core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/hid/usbhid/hid-core.c b/drivers/hid/usbhid/hid-core.c index b37d944..9935a64 100644 --- a/drivers/hid/usbhid/hid-core.c +++ b/drivers/hid/usbhid/hid-core.c @@ -164,7 +164,7 @@ static void hid_io_error(struct hid_device *hid) if (time_after(jiffies, usbhid->stop_retry)) { /* Retries failed, so do a port reset unless we lack bandwidth*/ - if (test_bit(HID_NO_BANDWIDTH, &usbhid->iofl) + if (!test_bit(HID_NO_BANDWIDTH, &usbhid->iofl) && !test_and_set_bit(HID_RESET_PENDING, &usbhid->iofl)) { schedule_work(&usbhid->reset_work); -- 1.8.4 -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] usb, misc: Fix a warning for using an uninitialized variable
On Wed, Oct 01, 2014 at 09:15:23PM -0400, Don Zickus wrote: > Recently our build environment changed and started turning some warnings into > errors. One of the fallouts is this warning: > > CC [M] drivers/usb/misc/usb3503.o > drivers/usb/misc/usb3503.c: In function ‘usb3503_probe’: > drivers/usb/misc/usb3503.c:195: warning: ‘err’ may be used uninitialized in > this function Doh. I rushed this and didn't realize this was already fixed in v3.17. Sorry for the noise. My tree was only at v3.16. :-( self-nak. Cheers, Don > > Fixed it by using the proper error. Compiled only as I don't have the > hardware > to test correctly. > > Signed-off-by: Don Zickus > --- > drivers/usb/misc/usb3503.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/drivers/usb/misc/usb3503.c b/drivers/usb/misc/usb3503.c > index f43c619..773a3a6 100644 > --- a/drivers/usb/misc/usb3503.c > +++ b/drivers/usb/misc/usb3503.c > @@ -192,7 +192,7 @@ static int usb3503_probe(struct usb3503 *hub) > > clk = devm_clk_get(dev, "refclk"); > if (IS_ERR(clk) && PTR_ERR(clk) != -ENOENT) { > - dev_err(dev, "unable to request refclk (%d)\n", err); > + dev_err(dev, "unable to request refclk (%ld)\n", > PTR_ERR(clk)); > return PTR_ERR(clk); > } > > -- > 1.7.1 > -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] usb, misc: Fix a warning for using an uninitialized variable
Recently our build environment changed and started turning some warnings into errors. One of the fallouts is this warning: CC [M] drivers/usb/misc/usb3503.o drivers/usb/misc/usb3503.c: In function ‘usb3503_probe’: drivers/usb/misc/usb3503.c:195: warning: ‘err’ may be used uninitialized in this function Fixed it by using the proper error. Compiled only as I don't have the hardware to test correctly. Signed-off-by: Don Zickus --- drivers/usb/misc/usb3503.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/usb/misc/usb3503.c b/drivers/usb/misc/usb3503.c index f43c619..773a3a6 100644 --- a/drivers/usb/misc/usb3503.c +++ b/drivers/usb/misc/usb3503.c @@ -192,7 +192,7 @@ static int usb3503_probe(struct usb3503 *hub) clk = devm_clk_get(dev, "refclk"); if (IS_ERR(clk) && PTR_ERR(clk) != -ENOENT) { - dev_err(dev, "unable to request refclk (%d)\n", err); + dev_err(dev, "unable to request refclk (%ld)\n", PTR_ERR(clk)); return PTR_ERR(clk); } -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFT] xhci: Fix resume issues on Renesas chips in Samsung laptops
On Wed, Jan 22, 2014 at 02:04:19PM -0800, Sarah Sharp wrote: > On Wed, Jan 22, 2014 at 09:18:27AM -0500, Prarit Bhargava wrote: > > > > > > > Don Zickus writes: > > > > > > Some co-workers of mine bought Samsung laptops that had mostly usb3 ports. > > > Those ports did not resume correctly (the driver would timeout > > > communicating > > > and fail). This led to frustration as suspend/resume is a common use for > > > laptops. > > > > > > Poking around, I applied the reset on resume quirk to this chipset and the > > > resume started working. Reloading the xhci_hcd module had been the > > > temporary > > > workaround. > > > > > > Signed-off-by: Sarah Sharp > > > Reported-by: Don Zickus > > > Cc: stable # 2.6.37 > > > --- > > > > > > Don, please ask your co-worker to apply this patch and test. Let me > > > know if your other co-worker has a different subsystem device ID in his > > > Samsung laptop, when he gets back from vacation. If so, we'll extend it > > > to all Samsung laptops with that particular Renesas host. > > > > > > Sarah Sharp > > > > Sarah, > > > > I tested a USB2 hard drive and a USB3 memory stick in the previously > > non-functional ports. Both now work AFAICT. > > > > Please add > > > > Tested-by: Prarit Bhargava > > Thanks for testing! This patch will be queued to the USB subsystem > maintainer after 3.14-rc1 is out. Thanks Sarah! Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] xhci: fix resume issues on Renesas chips
On Mon, Jan 13, 2014 at 03:17:48PM -0800, Sarah Sharp wrote: > On Mon, Jan 13, 2014 at 10:49:49AM -0500, Don Zickus wrote: > > Some co-workers of mine bought Samsung laptops that had mostly usb3 ports. > > Those ports did not resume correctly (the driver would timeout communicating > > and fail). This led to frustration as suspend/resume is a common use for > > laptops. > > > > Poking around, I applied the reset on resume quirk to this chipset and the > > resume started working. Reloading the xhci_hcd module had been the > > temporary > > workaround. > > > > Not sure if I should restrict this further or not. Probably should be sent > > to stable too. > > Can you provide the `sudo lspci -vvv -n` for the xHCI host? I would > rather restrict this quirk further, since this is the first report I've > seen of an NEC host needing this quirk. I would rather limit it to a > Samsung subsystem vendor ID if that's available. I figured. I attached the output of one laptop here. The other laptop's owner is on vacation for two weeks. :-( Cheers, Don 00:00.0 0600: 8086:0154 (rev 09) Subsystem: 144d:c0d3 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- SERR- 00:02.0 0300: 8086:0166 (rev 09) (prog-if 00 [VGA controller]) Subsystem: 144d:c0d3 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- SERR- [disabled] Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit- Address: fee0f00c Data: 41e1 Capabilities: [d0] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [a4] PCI Advanced Features AFCap: TP+ FLR+ AFCtrl: FLR- AFStatus: TP- Kernel driver in use: i915 Kernel modules: i915 00:16.0 0780: 8086:1e3a (rev 04) Subsystem: 144d:c0d3 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- SERR- TAbort- SERR- TAbort- Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend- LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <16us ClockPM- Surprise- LLActRep+ BwNot- LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt- SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise- Slot #0, PowerLimit 10.000W; Interlock- NoCompl+ SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg- Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock- SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock- Changed: MRL- PresDet- LinkState+ RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible- RootCap: CRSVisible- RootSta: PME ReqID , PMEStatus- PMEPending- DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd- LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit- Address: Data: Capabiliti
[PATCH] xhci: fix resume issues on Renesas chips
Some co-workers of mine bought Samsung laptops that had mostly usb3 ports. Those ports did not resume correctly (the driver would timeout communicating and fail). This led to frustration as suspend/resume is a common use for laptops. Poking around, I applied the reset on resume quirk to this chipset and the resume started working. Reloading the xhci_hcd module had been the temporary workaround. Not sure if I should restrict this further or not. Probably should be sent to stable too. Signed-off-by: Don Zickus --- drivers/usb/host/xhci-pci.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/drivers/usb/host/xhci-pci.c b/drivers/usb/host/xhci-pci.c index 73f5208..2a21892 100644 --- a/drivers/usb/host/xhci-pci.c +++ b/drivers/usb/host/xhci-pci.c @@ -144,6 +144,8 @@ static void xhci_pci_quirks(struct device *dev, struct xhci_hcd *xhci) } if (pdev->vendor == PCI_VENDOR_ID_VIA) xhci->quirks |= XHCI_RESET_ON_RESUME; + if (pdev->vendor == PCI_VENDOR_ID_RENESAS) + xhci->quirks |= XHCI_RESET_ON_RESUME; } /* called during probe() after chip reset completes */ -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Looking for a USB device that uses cdc-wdm driver
On Tue, Oct 08, 2013 at 11:10:52AM -0500, Dan Williams wrote: > On Tue, 2013-10-08 at 11:29 -0400, Don Zickus wrote: > > Hi Oliver, > > > > I am trying to verify a fix for the cdc-wdm driver and am having trouble > > figuring out what devices requires that driver. It seems related to cell > > phones but I wasn't sure if there was a particular one or all of them in > > general (I couldn't get my cell to use it). Thanks in advance! > > Find an Ericsson F3507 or F3607 PCIe minicard WWAN modem, like: > > http://www.ebay.com/itm/New-Unlocked-Ericsson-F3607GW-DELL-Wireless-5540-3G-GPS-Mini-WWAN-Card-/290659200462?pt=US_Internal_Network_Cards&hash=item43aca36dce > > They're cheap. They expose a couple cdc-wdm ports, one of which speaks > AT commands and another which might do the GPS stuff. You don't need a > SIM. If you're at a loss for how to connect it to your computer, or > your BIOS locks out unauthorized cards, grab one of these: > > http://www.hwtools.net/Adapter/USBMS-F.html > > which is a small PCIe minicard carrier with a USB connector. Thanks! Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Looking for a USB device that uses cdc-wdm driver
On Tue, Oct 08, 2013 at 05:36:46PM +0200, Bjørn Mork wrote: > Don Zickus writes: > > > I am trying to verify a fix for the cdc-wdm driver and am having trouble > > figuring out what devices requires that driver. It seems related to cell > > phones but I wasn't sure if there was a particular one or all of them in > > general (I couldn't get my cell to use it). Thanks in advance! > > The Ericsson MBM devices use it. E.g F3507g, F5521gw etc. Thanks! Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Looking for a USB device that uses cdc-wdm driver
Hi Oliver, I am trying to verify a fix for the cdc-wdm driver and am having trouble figuring out what devices requires that driver. It seems related to cell phones but I wasn't sure if there was a particular one or all of them in general (I couldn't get my cell to use it). Thanks in advance! Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] usb: xhci: Disable runtime PM suspend for quirky controllers.
On Wed, May 29, 2013 at 12:38:28PM -0700, Sarah Sharp wrote: > On Tue, May 28, 2013 at 02:41:18PM -0700, Julius Werner wrote: > > The policy we want to achieve is to disable runtime PM iff there is a > > device connected that doesn't have persist_enabled or a reset_resume() > > handler and whose parent/root hub resets on resume, right? > > Makes sense. However, not all distros may want that policy, so there > should be a way to change that policy via sysfs. Some distros may > choose to take the power savings over having a particular USB device > work, especially in the server market. > > Don, Oliver, what do you think of this patch: > http://marc.info/?l=linux-usb&m=136941922715772&w=2 That is limited only to certain controllers right? RHEL6 doesn't support runtime suspend, so we don't hear to many complaints. Most of our server customers don't have much plugged into USB, so I don't expect much problems there. Our laptop customers prefer the power savings, but I don't know how many of them have chipsets with XHCI_RESET_ON_RESUME. > > Julius is proposing to limit the scope of the patch a bit, but the > impact will still be that TI hosts will be active more often than not. Hmm, for some reason I don't see TI having the XHCI_RESET_ON_RESUME quirk set, just VIA and ETRON. Neither of which seem to be normally shipped with servers. Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] xhci - correct comp_mode_recovery_timer on return from hibernate
On Tue, Mar 05, 2013 at 03:14:10PM -0800, Sarah Sharp wrote: > > */ > > static void compliance_mode_recovery_timer_init(struct xhci_hcd *xhci) > > { > > + if (timer_pending(&xhci->comp_mode_recovery_timer)) { > > + xhci_dbg(xhci, "Compliance Mode Recovery Timer already > > active.\n"); > > + return; > > + } > > + > > xhci->port_status_u0 = 0; > > init_timer(&xhci->comp_mode_recovery_timer); > > Isn't this racy? There's no locking here, so what if the timer fired > just before the call to timer_pending happened? (The documentation on > the timer interface doesn't describe whether timer_pending returns true > if we're in the middle of a timer callback.) Ah, yes. I think I misled Tony here. I thought the timer_pending() check just checked to see if the timer was on the linked list, not realizing the timer is removed from that list when executed. Instead I thought the expires was set to -1 (or something large that it never expired) but it stayed on the list. We might have to dumb it down and just use a global init variable that is set/cleared by the compliance functions with the assumption they are called serially (as not to race). Though this all seems very hackish and ugly. There has to be a better way to do this.. :-( > > Also, looking at the compliance timer callback brings up further > questions: > > static void compliance_mode_recovery(unsigned long arg) > { > ... > for (i = 0; i < xhci->num_usb3_ports; i++) { > temp = xhci_readl(xhci, xhci->usb3_ports[i]); > if ((temp & PORT_PLS_MASK) == USB_SS_PORT_LS_COMP_MOD) { > /* > * Compliance Mode Detected. Letting USB Core > * handle the Warm Reset > */ > > What happens when the xHCI host controller goes into D3cold due to > runtime PM? The port status registers will read as all f's, so we will > miss any transitions to the compliance mode that happened before or > during the transition to D3cold. > > This code probably needs to wake up the host controller and keep it from > suspending until all the ports can be read. Fun... I take it there is no document that describes the various paths PM can take during each of these phases? Cheers, Don > > Alan, would the right way to do that be a get/put call into the runtime > PM core? > > Sarah Sharp -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] xhci - clarify compliance mode debug messages
On Wed, Feb 20, 2013 at 10:54:40AM -0500, Alan Stern wrote: > On Wed, 20 Feb 2013, Tony Camuso wrote: > > > On 02/20/2013 10:15 AM, Alan Stern wrote: > > > On Tue, 19 Feb 2013, Sarah Sharp wrote: > > > > > >> The one thing I wanted to check was my understanding of the hibernate > > >> flow path. As you mentioned, I thought that xhci_suspend would be > > >> called in the hibernate path, but it's not. > > > > > > Are you sure about that? AFAICT it should be called -- although it > > > gets called during the poweroff phase of hibernation, not the freeze > > > phase. > > > > > > core/hcd-pci.c: usb_hcd_pci_pm_ops.poweroff = hcd_pci_suspend > > > core/hcd-pci.c: hcd_pci_suspend calls suspend_common > > > core/hcd-pci.c: suspend_common calls hcd->driver->pci_suspend > > > host/xhci-pci.c: xhci_pci_hc_driver.pci_suspend = xhci_pci_suspend > > > host/xhci-pci.c: xhci_pci_suspend calls xhci_suspend > > > > > We instrumented the code, and when invoking pm-hibernate, we did > > not flow through xhci_suspend(), where the timer is deleted. > > If you instrumented the code then you ought to know how the actual > behavior differs from what I outlined above. Where is the difference? Hi Alan, I worked with Tony on this issue. The instrumentation we did was to add printks in the xhci_suspend, xhci_resume and compliance_mode_recovery_timer_init. What we noticed was that the xhci_suspend was not called during hibernate. As a result the timer was not deleted. During the resume, xhci_init called compliance_mode_recovery_timer_init which caused the initial panic (adding an existing timer). We added code to compliance_mode_recovery_timer_init to bail if the timer was already enabled. This lead us to the second problem at the bottom of xhci_resume where compliance_mode_recovery_timer_init was called again. We did not attempt to understand the mechanics of how suspend/hibernate and resume worked. We just assumed based on the hibernate flag to xhci_resume there were two code paths. If the expectation is that xhci_suspend should be called during hibernate, I can work with Tony to figure out why this is not happening. Solving that problem would partially solve our problem. We would still need the !hibernate flag at the bottom of xhci_resume to surround the timer check to deal with xhci_init calling the same init function. Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] xhci - clarify compliance mode debug messages
On Tue, Feb 19, 2013 at 01:48:51PM -0800, Sarah Sharp wrote: > Hi Tony, > > Greg closed the usb-next tree for 3.9 two weeks ago. The bug fix > patches will have to go in after the 3.9 merge window closes > (approximately two weeks from now). > > Sorry for the lack of response, I was on vacation. I'll review your > patch in the next few days. > > The one thing I wanted to check was my understanding of the hibernate > flow path. As you mentioned, I thought that xhci_suspend would be > called in the hibernate path, but it's not. Perhaps there is a better > function that's called in both the hibernate and suspend path where we > can stop the compliance mode timer. Hi Sarah, That was my understanding to, which is why I asked Tony to cc Rafael. I was hoping he could point us in the right direction.. Cheers, Don > > Sarah Sharp > > On Tue, Feb 19, 2013 at 11:01:16AM -0500, Tony Camuso wrote: > > Sarah, > > > > This patch is pursuant to the patch at ... > > http://www.mail-archive.com/linux-usb@vger.kernel.org/msg14965.html, > > Message ID <1361209953-5195-1-git-send-email-tcam...@redhat.com> > > ... and must be applied after that patch is applied. > > > > Can these patches be included in the next merge? > > > > Thanks, > > Tony > > -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: uhci irq race before term_td is allocated
On Tue, Jan 22, 2013 at 06:17:00AM +, Huang, Adrian (ISS Linux TW) wrote: > Hi Alan, > > Recently, I'm debugging this issue (the same symptom described by Don) on my > server. Fortunately, the issue was fixed by applying your patch posted in > this email thread. Just checked the latest kernel (3.7.4) and the mainline > (3.8-rc4), the patch was not included in those kernels. Do you have any > schedule to submit the patch to the upstream? Thanks in advance. I don't think Alan ever committed that patch because I never gave him feedback on it. Of course, the reason I never gave feedback was the customer I was working with (HP??) updated their firmware and couldn't reproduce the problem anymore. Therefore the patch couldn't be verified to work or not. But if you are able to show the patch working, I'll leave it up to Alan to decide what to do. Cheers, Don > > The list of the allocated IRQ (uhci_hcd used the shared interrupt) on my > server: >CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 > CPU6 CPU7 > 0:120 0 0 0 0 0 > 0 0 IR-IO-APIC-edge timer > 1: 2 0 0 0 0 0 > 0 0 IR-IO-APIC-edge i8042 > 3: 2 0 0 0 0 0 > 0 0 IR-IO-APIC-edge > 4: 1 0 0 0 0 0 > 0 0 IR-IO-APIC-edge > 8: 1 0 0 0 0 0 > 0 0 IR-IO-APIC-edge rtc0 > 9: 0 0 0 0 0 0 > 0 0 IR-IO-APIC-fasteoi acpi > 12: 4 0 0 0 0 0 > 0 0 IR-IO-APIC-edge i8042 > 16: 1187 0 0 0 0 0 > 0 0 IR-IO-APIC-fasteoi uhci_hcd:usb3, hpilo > 20: 1385 0 0 0 0 0 > 0 0 IR-IO-APIC-fasteoi ehci_hcd:usb2 > 21:167 0 0 0 0 0 > 0 0 IR-IO-APIC-fasteoi ehci_hcd:usb1 > > -- Adrian > -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: uhci irq race before term_td is allocated
On Thu, Oct 18, 2012 at 03:59:44PM -0400, Alan Stern wrote: > On Tue, 16 Oct 2012, Don Zickus wrote: > > > > > I did not get a chance to test the patch yet. > > > > > > Let me know what happens. > > > > It might take a while as our customer's customers have trouble reproducing > > this which means it is difficult to tell if the problem is fixed or just > > harder to reproduce. :-/ > > > > Hopefully after a couple of weeks I will have some indication of which. > > If you want to make the problem more likely to occur, just sleep for a > few seconds in usb_add_hcd() after calling usb_hcd_request_irqs() and > before calling hcd->driver->start(). > > Doing that both with and without the patch installed should show pretty > quickly whether the patch does what you want. That makes sense. For some reason kdump is failing me and I had to give back the machine. I'll try again later this week. Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: uhci irq race before term_td is allocated
On Tue, Oct 16, 2012 at 12:01:29PM -0400, Alan Stern wrote: > On Tue, 16 Oct 2012, Don Zickus wrote: > > > > I dislike adding an extra test to a hot path, but there doesn't seem to > > > be any way around it. Some of the other HCDs may need a similar > > > change. > > > > I understand your concern. I was curious in uhci_irq, I see the following > > fragment: > > > > status = uhci_readw(uhci, USBSTS); > > if (!(status & ~USBSTS_HCH))/* shared interrupt, not mine */ > > return IRQ_NONE; > > uhci_writew(uhci, status, USBSTS); /* Clear it */ > > > > Should that be helping in filtering out shared interrupts or not in this > > case? > > If it did help, your system wouldn't have crashed. :-) Hehe. True. > > What must have happened is that one of the bits in the USBSTS register > was set. It didn't cause an IRQ because the USBINTR register was > clear, but an IRQ arrived anyway from some other device sharing the IRQ > line. > > Since USBSTS had bits set, the test above did not filter out the > interrupt. In other words, even though it really was a shared > interrupt it didn't look that way, because the UHCI controller would > have generated an interrupt request if it had been allowed to. So this could have been a leftover interrupt from the boot kernel? > > > I did not get a chance to test the patch yet. > > Let me know what happens. It might take a while as our customer's customers have trouble reproducing this which means it is difficult to tell if the problem is fixed or just harder to reproduce. :-/ Hopefully after a couple of weeks I will have some indication of which. Thanks for the help. Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: uhci irq race before term_td is allocated
On Mon, Oct 15, 2012 at 05:31:01PM -0400, Alan Stern wrote: > > > > uhci_scan_schedule-> > > uhci_clear_next_interrupt-> > > uhci->term_td->status &= ~cpu_to_hc32(uhci, TD_CTRL_IOC); > > > > This panics becase term_td is not allocated yet. > > > > Now I could be wrong about the interrupts and the uhci_start routine and > > perhaps this is prevented somehow. This is why I am asking what is the > > expectation for the above scenario. > > > I was just wondering if you had a quick thought about this or not. Hi Alan, Thanks for your respone. > > I suspect you're getting a shared interrupt, that is, one generated by > another device using the same IRQ line and not by the UHCI controller > itself. Can you check the IRQ assignments to see if that's possible? A box in our lab (which should have similar hardware) shows (only first cpu shown, other 159 snipped) 17: 183 IR-IO-APIC-fasteoi uhci_hcd:usb6, ata_piix, hpilo 20: 3 IR-IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2 22: 0 IR-IO-APIC-fasteoi uhci_hcd:usb4 23: 53 IR-IO-APIC-fasteoi uhci_hcd:usb3, uhci_hcd:usb5, radeon So yes, you are probably correct with regards to shared interrupt. > > I'm reasonably sure that the controller didn't generate the bad > interrupt, because it's not allowed to generate any IRQs until the > USBINTR register is set to a nonzero value. That doesn't happen until > start_rh() is called near the end of uhci_start(). > > Since uhci_irq() doesn't check to see whether the uhci->is_initialized > flag has been set, a shared interrupt arriving too early could cause > exactly the sort of problem you see here. > > I dislike adding an extra test to a hot path, but there doesn't seem to > be any way around it. Some of the other HCDs may need a similar > change. I understand your concern. I was curious in uhci_irq, I see the following fragment: status = uhci_readw(uhci, USBSTS); if (!(status & ~USBSTS_HCH))/* shared interrupt, not mine */ return IRQ_NONE; uhci_writew(uhci, status, USBSTS); /* Clear it */ Should that be helping in filtering out shared interrupts or not in this case? I did not get a chance to test the patch yet. Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
uhci irq race before term_td is allocated
Hi Alan, I am seeing an odd panic with uhci when a 160 cpu box panics and starts running a kdump kernel (which is the same exact image as the boot kernel) for our RHEL-6 (2.6.32) kernel. Now I understand 2.6.32 is not something upstream supports. However, my question is what is expected to happen in the following situtation. The attached panic below shows usb_hcd_pci_probe-> usb_add_hcd-> /* enable irqs just before we start the controller, * if the BIOS provides legacy PCI irqs. */ if (usb_hcd_is_primary_hcd(hcd) && irqnum) { retval = usb_hcd_request_irqs(hcd, irqnum, irqflags); if (retval) goto err_request_irq; } hcd->state = HC_STATE_RUNNING; retval = hcd->driver->start(hcd); if (retval < 0) { dev_err(hcd->self.controller, "startup error %d\n", retval); goto err_hcd_driver_start; } That interrupts are setup and enabled before hcd->driver_start is called. Now in uhci_start-> uhci->term_td = uhci_alloc_td(uhci); Happens. But what if an interrupt comes in before that call runs, for example: uhci_irq-> if (status & USBSTS_RD) usb_hcd_poll_rh_status(hcd); else { spin_lock(&uhci->lock); uhci_scan_schedule(uhci); spin_unlock(&uhci->lock); } In uhci_scan_schedule(uhci) uhci_scan_schedule-> uhci_clear_next_interrupt-> uhci->term_td->status &= ~cpu_to_hc32(uhci, TD_CTRL_IOC); This panics becase term_td is not allocated yet. Now I could be wrong about the interrupts and the uhci_start routine and perhaps this is prevented somehow. This is why I am asking what is the expectation for the above scenario. Below is an actual panic that was reported against 2.6.32 which I think shows the scenario I described above (that could be wrong too). Though I did expect an interrupt exception frame in there to backup my theory... usb usb5: Product: UHCI Host Controller usb usb5: Manufacturer: Linux 2.6.32-220.7.1.el6.x86_64 uhci_hcd usb usb5: SerialNumber: :00:1d.3 usb usb5: configuration #1 chosen from 1 choice hub 5-0:1.0: USB hub found hub 5-0:1.0: 2 ports detected uhci_hcd :02:00.4: PCI INT B -> GSI 17 (level, low) -> IRQ 17 uhci_hcd :02:00.4: UHCI Host Controller uhci_hcd :02:00.4: new USB bus registered, assigned bus number 6 uhci_hcd :02:00.4: port count misdetected? forcing to 2 ports BUG: unable to handle kernel NULL pointer dereference at 0004 IP: [] uhci_scan_schedule+0x46/0xb20 PGD 0 Oops: 0002 [#1] SMP last sysfs file: CPU 0 Modules linked in: Pid: 56, comm: work_for_cpu Not tainted 2.6.32-220.7.1.el6.x86_64 #1 HP ProLiant DL980 G7 RIP: 0010:[] [ <0> 3731 880019104620 Call Trace: [] uhci_irq+0x7c/0x180 [] ? disable_irq_nosync+0x64/0xa0 [] usb_hcd_irq+0x3f/0x90 [] request_threaded_irq+0x1bc/0x2f0 [] ? usb_hcd_irq+0x0/0x90 [] usb_add_hcd+0x3be/0x800 [] usb_hcd_pci_probe+0x158/0x3d0 [] ? do_work_for_cpu+0x0/0x30 [] local_pci_probe+0x17/0x20 [] do_work_for_cpu+0x18/0x30 [] kthread+0x96/0xa0 [] child_rip+0xa/0x20 [] ? kthread+0x0/0xa0 [] ? child_rip+0x0/0x20 Code: 00 00 48 89 fb a8 01 0f 85 69 0a 00 00 48 8d 57 50 83 c8 01 88 87 c8 00 00 00 48 89 55 98 83 e0 bd 88 83 c8 00 00 00 48 8b 43 20 <81> 60 04 ff ff ff fe 83 bb bc 00 00 00 00 0f 84 09 0a 00 00 8b RIP [] uhci_scan_schedule+0x46/0xb20 RSP CR2: 0004 ---[ end trace b5ebaaef70b2501d ]--- If you aren't familar with kdump issues. Then it is not uncommon for interrupts to still be active on devices (though irqpoll is set to prevent irq flooding on boot up). This box doesn't have much for usb devices. The only thing people think might be helping cause this problem is from customers running '/usr/sbin/gpm -m /dev/input/mice -t exps2'. On the other hand, we are not to sure how to duplicate the problem other than 'echo c > /proc/sysrq-trigger' about 100 times and see if we get lucky. I was just wondering if you had a quick thought about this or not. Thanks, Don -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFT] usb: Don't enable LPM if the exit latency is zero.
On Wed, Oct 03, 2012 at 11:29:00AM -0700, Sarah Sharp wrote: > Hi Don, > > Please test this patch on top of the other patch. So yeah, that seemed to work. :-) Thanks! Cheers, Don > > Sarah Sharp > > >8---8< > Some USB 3.0 devices signal that they don't implement Link PM by having > all zeroes in the U1/U2 exit latencies in their SuperSpeed BOS > descriptor. Don found that a Western Digital device he has experiences > transfer errors when LPM is enabled. The lsusb shows the U1/U2 exit > latencies are set to zero: > > Binary Object Store Descriptor: > bLength 5 > bDescriptorType15 > wTotalLength 22 > bNumDeviceCaps 2 > SuperSpeed USB Device Capability: > bLength10 > bDescriptorType16 > bDevCapabilityType 3 > bmAttributes 0x00 > Latency Tolerance Messages (LTM) Supported > wSpeedsSupported 0x000e > Device can operate at Full Speed (12Mbps) > Device can operate at High Speed (480Mbps) > Device can operate at SuperSpeed (5Gbps) > bFunctionalitySupport 1 > Lowest fully-functional device speed is Full Speed (12Mbps) > bU1DevExitLat 0 micro seconds > bU2DevExitLat 0 micro seconds > > The fix is to not enable LPM for a particular link state if we find its > corresponding exit latency is zero. > > Signed-off-by: Sarah Sharp > Reported-by: Don Zickus > --- > drivers/usb/core/hub.c | 14 ++ > 1 files changed, 10 insertions(+), 4 deletions(-) > > diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c > index 55bef91..2568441 100644 > --- a/drivers/usb/core/hub.c > +++ b/drivers/usb/core/hub.c > @@ -3484,8 +3484,16 @@ static void usb_enable_link_state(struct usb_hcd *hcd, > struct usb_device *udev, > enum usb3_link_state state) > { > int timeout; > - __u8 u1_mel; > - __le16 u2_mel; > + __u8 u1_mel = udev->bos->ss_cap->bU1devExitLat; > + __le16 u2_mel = udev->bos->ss_cap->bU2DevExitLat; > + > + /* If the device says it doesn't have *any* exit latency to come out of > + * U1 or U2, it's probably lying. Assume it doesn't implement that link > + * state. > + */ > + if ((state == USB3_LPM_U1 && u1_mel == 0) || > + (state == USB3_LPM_U2 && u2_mel == 0)) > + return; > > /* We allow the host controller to set the U1/U2 timeout internally >* first, so that it can change its schedule to account for the > @@ -3512,8 +3520,6 @@ static void usb_enable_link_state(struct usb_hcd *hcd, > struct usb_device *udev, >* link commands. This can cause transfer errors, so only enable >* device-initiated LPM. >*/ > - u1_mel = udev->bos->ss_cap->bU1devExitLat; > - u2_mel = udev->bos->ss_cap->bU2DevExitLat; > if ((state == USB3_LPM_U1 && u1_mel == USB_U1_MAX_VALID_MEL) || > (state == USB3_LPM_U2 && >le16_to_cpu(u2_mel) == USB_U2_MAX_VALID_MEL)) { > -- > 1.7.9 > -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFT] usb: Fix TX errors on devices with unsupported LPM states.
On Tue, Oct 02, 2012 at 02:42:41PM -0700, Sarah Sharp wrote: > > > USB 3.0 devices are required to support Link PM. However, some of > > > them don't support U1, or don't support U2, or don't support either. > > > There is no way in the USB 3.0 specification to say that a device > > > doesn't support U1 or U2, so these devices set the U1 or U2 exit > > > latency to the maximum possible values (10ms for U1, 2047ms for U2). > > > > It still failed for me. :-( > > Can you send me the `sudo lsusb -v` output for the device? Also, please > see if you can get a new cable and retest. I attached the lsusb -v output (had to use another usb3 controller to obtain it). I can't really use another cable because it comes with its own usb3 mini cable. All the other usb3 cables I have are the normal sized ones (with the fatter end). Cheers, Don Bus 006 Device 002: ID 1058:0730 Western Digital Technologies, Inc. Device Descriptor: bLength18 bDescriptorType 1 bcdUSB 3.00 bDeviceClass0 (Defined at Interface level) bDeviceSubClass 0 bDeviceProtocol 0 bMaxPacketSize0 9 idVendor 0x1058 Western Digital Technologies, Inc. idProduct 0x0730 bcdDevice 10.16 iManufacturer 1 Western Digital iProduct2 My Passport 0730 iSerial 3 575837314132314434373235 bNumConfigurations 1 Configuration Descriptor: bLength 9 bDescriptorType 2 wTotalLength 44 bNumInterfaces 1 bConfigurationValue 1 iConfiguration 0 bmAttributes 0x80 (Bus Powered) MaxPower 224mA Interface Descriptor: bLength 9 bDescriptorType 4 bInterfaceNumber0 bAlternateSetting 0 bNumEndpoints 2 bInterfaceClass 8 Mass Storage bInterfaceSubClass 6 SCSI bInterfaceProtocol 80 Bulk-Only iInterface 0 Endpoint Descriptor: bLength 7 bDescriptorType 5 bEndpointAddress 0x84 EP 4 IN bmAttributes2 Transfer TypeBulk Synch Type None Usage Type Data wMaxPacketSize 0x0400 1x 1024 bytes bInterval 0 bMaxBurst 15 Endpoint Descriptor: bLength 7 bDescriptorType 5 bEndpointAddress 0x03 EP 3 OUT bmAttributes2 Transfer TypeBulk Synch Type None Usage Type Data wMaxPacketSize 0x0400 1x 1024 bytes bInterval 0 bMaxBurst 15 Binary Object Store Descriptor: bLength 5 bDescriptorType15 wTotalLength 22 bNumDeviceCaps 2 SuperSpeed USB Device Capability: bLength10 bDescriptorType16 bDevCapabilityType 3 bmAttributes 0x00 Latency Tolerance Messages (LTM) Supported wSpeedsSupported 0x000e Device can operate at Full Speed (12Mbps) Device can operate at High Speed (480Mbps) Device can operate at SuperSpeed (5Gbps) bFunctionalitySupport 1 Lowest fully-functional device speed is Full Speed (12Mbps) bU1DevExitLat 0 micro seconds bU2DevExitLat 0 micro seconds USB 2.0 Extension Device Capability: bLength 7 bDescriptorType16 bDevCapabilityType 2 bmAttributes 0x0002 Link Power Management (LPM) Supported Device Status: 0x0002 (Bus Powered) Remote Wakeup Enabled -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFT] usb: Fix TX errors on devices with unsupported LPM states.
On Tue, Oct 02, 2012 at 12:15:46PM -0700, Sarah Sharp wrote: > Hi Don, > > The WD drive arrived today, but I can't reproduce your I/O errors on > 3.6. I didn't try with 3.5 yet. However, I did notice that the device > really doesn't want to enter U1, but it will enter U2. I did update the firmware but I think I had the problem before that happened. Also I don't use the VIA hub anymore. I plugged the harddrive directly into my mahobay system's usb3 port and see the problem. > > Can you test the following patch with your device and see if it helps? > If that doesn't work, I suggest getting a high-quality new cable and > seeing if that helps the transfer issues. > > Gabor, this might fix your issues as well, and should allow your system > to consume less power than completely disabling LPM. > > Sarah Sharp > > >8-8< > > USB 3.0 devices are required to support Link PM. However, some of > them don't support U1, or don't support U2, or don't support either. > There is no way in the USB 3.0 specification to say that a device > doesn't support U1 or U2, so these devices set the U1 or U2 exit > latency to the maximum possible values (10ms for U1, 2047ms for U2). It still failed for me. :-( Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: xhci: LPM issues using Western Digital harddrive
On Fri, Sep 28, 2012 at 11:39:48AM -0400, Alan Stern wrote: > On Fri, 28 Sep 2012, Don Zickus wrote: > > > > Probably the device was already hung. The log showed that an earlier > > > transfer completed with a STALL. The reason wasn't apparent, although > > > a usbmon trace might give a clue. It might also show why the device > > > had to be reset. > > > > Here is the output of 'cat 4u' > > > 880240121860 3207118259 S Bo:4:002:3 -115 31 = 55534243 6c00 > > 0600 00 > > 880240121860 3207118359 C Bo:4:002:3 0 31 > > > 880240121860 3207118430 S Bi:4:002:4 -115 13 < > > 880240121860 3207118511 C Bi:4:002:4 0 13 = 55534253 6c00 > > 00 > > 880240121860 3207127088 S Bo:4:002:3 -115 31 = 55534243 6d00 > > 0002 8ca1 082e0001 ec00 00 > > 880240121860 3207127197 C Bo:4:002:3 0 31 > > > 880240122698 3207127238 S Bi:4:002:4 -115 512 < > > 880240122698 3207143925 C Bi:4:002:4 -32 0 > > 880240121860 3207144021 S Co:4:002:0 s 02 01 0084 0 > > 880240121860 3207220654 C Co:4:002:0 -32 0 > > 880240122698 3207220814 S Co:4:001:0 s 23 03 0004 0004 0 > > 880240122698 3207226644 C Co:4:001:0 0 0 > > This shows that the problem began when the device was sent a command it > didn't recognize: 0xA1, which is a 12-byte ATA pass-through, in this > case for an IDENTIFY DEVICE command (0xEC). Presumably the Western > Digital device doesn't support ATA pass-through. The device halted its > bulk-IN endpoint and then replied with a STALL to the > Clear-Endpoint-Halt request (which is an invalid response). This is > why the reset was tried. > > Don, what does a comparable usbmon trace look like when the drive is > attached to an EHCI controller? I would expect to see basically the > same thing. I can grab that. I just wanted to note that this drive works fine under a Fresco Logic or NEC xhci controller. But those controllers don't seem to invoke the LPM code paths either. I'll provide an EHCI usbmon trace from the panther point controller and if interesting a usbmon trace from an NEC xhci controller. Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: xhci: LPM issues using Western Digital harddrive
On Thu, Sep 27, 2012 at 04:42:52PM -0400, Alan Stern wrote: > On Thu, 27 Sep 2012, Sarah Sharp wrote: > > > On Wed, Sep 26, 2012 at 02:03:06PM -0400, Don Zickus wrote: > > > Hi Sarah, > > > > > > As discussed privately, here is the debug output of what happens when I > > > plug in the hard drive into my panther point system. > > > > > > kernel based off of greg's usb-next (0220a3f01a) with CONFIG_USB_DEBUG and > > > CONFIG_USB_XHCI_HCD_DEBUGGING turned on. > > > > Thanks for the dmesg Don. Unfortunately, it doesn't tell me anything > > terribly interesting. The address device control message completed with > > a transfer error twice, and the USB core gave up on it. > > > > I'll have to see if I can figure anything out with a USB bus analyzer > > when I get the drive next week. > > Probably the device was already hung. The log showed that an earlier > transfer completed with a STALL. The reason wasn't apparent, although > a usbmon trace might give a clue. It might also show why the device > had to be reset. Here is the output of 'cat 4u' Cheers, Don 880240121860 3194234363 S Ci:4:001:0 s a3 00 0001 0004 4 < 880240121860 3194238462 C Ci:4:001:0 0 4 = a002 880240121860 3194238531 S Ci:4:001:0 s a3 00 0002 0004 4 < 880240121860 3194242468 C Ci:4:001:0 0 4 = a002 880240121860 3194242529 S Ci:4:001:0 s a3 00 0003 0004 4 < 880240121860 3194247552 C Ci:4:001:0 0 4 = a002 880240121860 3194247612 S Ci:4:001:0 s a3 00 0004 0004 4 < 880240121860 3194251702 C Ci:4:001:0 0 4 = 03020100 880240121860 3194253923 S Co:4:001:0 s 23 01 0010 0004 0 880240121860 3194256746 C Co:4:001:0 0 0 88023dfd6080 3194357546 S Ii:4:001:1 -115:2048 4 < 880240121860 3194360035 S Ci:4:001:0 s a3 00 0004 0004 4 < 880240121860 3194363928 C Ci:4:001:0 0 4 = 0302 880240121860 3194375687 S Co:4:001:0 s 23 03 0004 0004 0 880240121860 3194378308 C Co:4:001:0 0 0 880240121860 3194428559 S Ci:4:001:0 s a3 00 0004 0004 4 < 880240121860 3194432085 C Ci:4:001:0 0 4 = 03021000 880240121860 3194482579 S Co:4:001:0 s 23 01 0014 0004 0 880240121860 3194484962 C Co:4:001:0 0 0 880240121860 3194753586 S Ci:4:002:0 s 80 06 0100 0008 8 < 880240121860 3194753860 C Ci:4:002:0 0 8 = 12010003 0009 880240121860 3194753956 S Ci:4:002:0 s 80 06 0100 0012 18 < 880240121860 3194754080 C Ci:4:002:0 0 18 = 12010003 0009 58103007 16100102 0301 880240121860 3194754222 S Ci:4:002:0 s 80 06 0f00 0005 5 < 880240121860 3194754355 C Ci:4:002:0 0 5 = 050f1600 02 880240121860 3194754516 S Ci:4:002:0 s 80 06 0f00 0016 22 < 880240121860 3194754598 C Ci:4:002:0 0 22 = 050f1600 020a1003 000e0001 0007 10020200 880240121860 3194754764 S Ci:4:002:0 s 80 06 0200 0009 9 < 880240121860 3194754925 C Ci:4:002:0 0 9 = 09022c00 01010080 70 880240121860 3194755041 S Ci:4:002:0 s 80 06 0200 002c 44 < 880240121860 3194755166 C Ci:4:002:0 0 44 = 09022c00 01010080 70090400 00020806 5705 84020004 0006300f 0007 880240121860 3194758636 S Ci:4:002:0 s 80 06 0300 00ff 255 < 880240121860 3194760861 C Ci:4:002:0 0 4 = 04030904 880240121860 3194762999 S Ci:4:002:0 s 80 06 0302 0409 00ff 255 < 880240121860 3194765295 C Ci:4:002:0 0 34 = 22034d00 79002000 50006100 73007300 70006f00 72007400 20003000 37003300 880240121860 3194765434 S Ci:4:002:0 s 80 06 0301 0409 00ff 255 < 880240121860 3194767615 C Ci:4:002:0 0 32 = 20035700 65007300 74006500 72006e00 20004400 69006700 69007400 61006c00 880240121860 3194767769 S Ci:4:002:0 s 80 06 0303 0409 00ff 255 < 880240121860 3194770006 C Ci:4:002:0 0 50 = 32033500 37003500 38003300 37003300 31003400 31003300 32003300 31003400 880240122698 3195163067 S Co:4:002:0 s 00 09 0001 0 880240122698 3195163251 C Co:4:002:0 0 0 880240122698 3195295743 S Co:4:001:0 s 23 03 0017 3204 0 880240122698 3195295762 C Co:4:001:0 0 0 880240122698 3195295792 S Co:4:002:0 s 00 30 0006 6 = 0a0a0002 0002 880240122698 3195295899 C Co:4:002:0 0 6 > 880240122698 3195296015 S Co:4:002:0 s 00 03 0030 0 880240122698 3195296100 C Co:4:002:0 0 0 880240122698 3195435238 S Co:4:001:0 s 23 03 0018 2804 0 880240122698 3195435268 C Co:4:001:0 0 0 880240122698 3195435337 S Co:4:002:0 s 00 30 0006 6 = 0a0a0002 0002 880240122698 3195435466 C Co:4:002:0 0 6 > 880240122698 3195435595 S Co:4:002:0 s 00 03 0031 0 880240122698 3195435711 C Co:4:002:0 0 0 880240121860 319786 S Co:4:001:0 s 23 03 0017 0004 0 880240121860 319804 C Co:4:001:0 0 0 880240121860 31
Re: [PATCH] usb/core: Fix race condition when removing EHCI PCI devices
On Wed, Sep 26, 2012 at 12:55:26PM -0400, Alan Stern wrote: > On Wed, 26 Sep 2012, Don Zickus wrote: > > > Hi Alan, > > > > This patch seemed to be successful. I copied and pasted the response from > > our customer (we backported the patch to RHEL-6/2.6.32): > > ... > > > Is there anything else you need from us or can we move forward with this > > patch? > > That's good enough; I'll submit it to Greg. Awesome. Thanks! Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] usb/core: Fix race condition when removing EHCI PCI devices
On Mon, Sep 24, 2012 at 12:59:51PM -0400, Alan Stern wrote: > > If you want to track down what's going wrong, you'll have to add some > > debugging code to usb_device_read() and usb_remove_hcd(). By the time > > usb_device_dump() starts running, it's already too late. > > After thinking about this some more, I realized that my patch still > leaves a race -- although the oops would occur in a different place > (where usb_device_read checks bus->root_hub->devnum). > > Here's a different patch which should work better. It relies on the > rh_registered flag in the usb_hcd structure, which persists as long as > the usb_bus structure does, rather than on anything stored in the > root-hub device structure. Hi Alan, This patch seemed to be successful. I copied and pasted the response from our customer (we backported the patch to RHEL-6/2.6.32): " This new patch testing went very well. In previous tests, the kernel paniced during the first or second surprise removal of the ehci_hcd PCI device on an idle system. I ran an entire night of surprise and polite device removals with no kernel panic or Oops. slub_debug=FZPU was used to poison deallocated storage blocks and check for use after free. No BUGs were logged by the allocator. Numerous messages like the following were seen at the console, indicating that the stimulus leading to the panic should be occurring: cat: /proc/bus/usb/001/006: No such device Surprise removals occur when the device is electrically disconnected from the PCI bus while in use. During a following clean-up operation, the driver's remove function called. Polite removals occur when the driver's remove function is called while the device is in use. After return from the driver, the device is electrically disconnected from the PCI bus. I ran 5 hours of surprise and polite removals on the same idle system that was used for the previous tests, about 60 trials of each type of removal. Then a workload was started to cause mid-level CPU, Memory and Disk stress; the test continued to run for 8.5 hours more executing about 38 more trials of each type of PCI removal. " Is there anything else you need from us or can we move forward with this patch? Thanks again! Cheers, Don > > > > Index: usb-3.6/drivers/usb/core/devices.c > === > --- usb-3.6.orig/drivers/usb/core/devices.c > +++ usb-3.6/drivers/usb/core/devices.c > @@ -624,7 +624,7 @@ static ssize_t usb_device_read(struct fi > /* print devices for all busses */ > list_for_each_entry(bus, &usb_bus_list, bus_list) { > /* recurse through all children of the root hub */ > - if (!bus->root_hub) > + if (!bus_to_hcd(bus)->rh_registered) > continue; > usb_lock_device(bus->root_hub); > ret = usb_device_dump(&buf, &nbytes, &skip_bytes, ppos, > Index: usb-3.6/drivers/usb/core/hcd.c > === > --- usb-3.6.orig/drivers/usb/core/hcd.c > +++ usb-3.6/drivers/usb/core/hcd.c > @@ -1011,10 +1011,7 @@ static int register_root_hub(struct usb_ > if (retval) { > dev_err (parent_dev, "can't register root hub for %s, %d\n", > dev_name(&usb_dev->dev), retval); > - } > - mutex_unlock(&usb_bus_list_lock); > - > - if (retval == 0) { > + } else { > spin_lock_irq (&hcd_root_hub_lock); > hcd->rh_registered = 1; > spin_unlock_irq (&hcd_root_hub_lock); > @@ -1023,6 +1020,7 @@ static int register_root_hub(struct usb_ > if (HCD_DEAD(hcd)) > usb_hc_died (hcd); /* This time clean up */ > } > + mutex_unlock(&usb_bus_list_lock); > > return retval; > } > -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] usb/core: Fix race condition when removing EHCI PCI devices
On Mon, Sep 24, 2012 at 12:59:51PM -0400, Alan Stern wrote: > On Mon, 17 Sep 2012, Alan Stern wrote: > > > On Mon, 17 Sep 2012, Don Zickus wrote: > > > > > I have added some in-depth analysis from our customer. > > > > > > > > > The problem is that the failing routine was called with a pointer to > > > usbdev in > > > memory that has already been freed and overwritten with the pool poison > > > pattern. > > > This causes an access violation on line 502: > > > > The stack trace shows that the failure occurred in usb_device_dump(), > > which was called from usb_device_read(). Since it wasn't a recursive > > call, the usbdev argument must have pointed to the root-hub device. > > > > I don't see how this could have happened if the kernel included the > > patch I sent earlier. usb_device_read() holds the usb_bus_list_lock > > mutex throughout, and it tests that the root hub's devnum field is not > > equal to 0 before calling usb_dump_device(). > > > > Meanwhile, usb_remove_hcd() sets hcd->self.bus.root_hub->devnum to 0 > > while holding usb_bus_list_lock, and it doesn't deallocate the root hub > > device until after the mutex is dropped. > > > > Are you certain that the test was conducted with the patch in place? > > > > If you want to track down what's going wrong, you'll have to add some > > debugging code to usb_device_read() and usb_remove_hcd(). By the time > > usb_device_dump() starts running, it's already too late. > > After thinking about this some more, I realized that my patch still > leaves a race -- although the oops would occur in a different place > (where usb_device_read checks bus->root_hub->devnum). > > Here's a different patch which should work better. It relies on the > rh_registered flag in the usb_hcd structure, which persists as long as > the usb_bus structure does, rather than on anything stored in the > root-hub device structure. Thanks! I will pass this on. Sorry for the lack of feedback. The customer I am coordinating with was on vacation last week. I am still trying to set up their machine on my end to help debug this faster. I appreciate you still thinking about this. Cheers, Don > > Alan Stern > > > > Index: usb-3.6/drivers/usb/core/devices.c > === > --- usb-3.6.orig/drivers/usb/core/devices.c > +++ usb-3.6/drivers/usb/core/devices.c > @@ -624,7 +624,7 @@ static ssize_t usb_device_read(struct fi > /* print devices for all busses */ > list_for_each_entry(bus, &usb_bus_list, bus_list) { > /* recurse through all children of the root hub */ > - if (!bus->root_hub) > + if (!bus_to_hcd(bus)->rh_registered) > continue; > usb_lock_device(bus->root_hub); > ret = usb_device_dump(&buf, &nbytes, &skip_bytes, ppos, > Index: usb-3.6/drivers/usb/core/hcd.c > === > --- usb-3.6.orig/drivers/usb/core/hcd.c > +++ usb-3.6/drivers/usb/core/hcd.c > @@ -1011,10 +1011,7 @@ static int register_root_hub(struct usb_ > if (retval) { > dev_err (parent_dev, "can't register root hub for %s, %d\n", > dev_name(&usb_dev->dev), retval); > - } > - mutex_unlock(&usb_bus_list_lock); > - > - if (retval == 0) { > + } else { > spin_lock_irq (&hcd_root_hub_lock); > hcd->rh_registered = 1; > spin_unlock_irq (&hcd_root_hub_lock); > @@ -1023,6 +1020,7 @@ static int register_root_hub(struct usb_ > if (HCD_DEAD(hcd)) > usb_hc_died (hcd); /* This time clean up */ > } > + mutex_unlock(&usb_bus_list_lock); > > return retval; > } > -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] usb/core: Fix race condition when removing EHCI PCI devices
On Thu, Sep 13, 2012 at 04:31:34PM -0400, Alan Stern wrote: > On Thu, 13 Sep 2012, Don Zickus wrote: > > > Hi Alan, > > > > I adapted your patch to our 2.6.32 tree and the customer tested it without > > success. The output panic is attached below. I will work on getting a > > machine with the latest kernel to reproduce this problem so I don't waste > > your time chasing something that might be fixed upstream. > > > > But if you could take a quick glance at the panic below to see if anything > > comes to mind I would appreciate it. > > > The test failed on the first surprise removal of PCI devices. Last gasp > > from the console is posted below. I would guess the faulting process was > > reading a file under /proc/bus/usb. I should have a dump if more info is > > needed. > > > ehci_hcd :2c:00.0: HC died; cleaning up > > ehci_hcd :2c:00.0: force halt; handhake c9654024 4000 > > 4000 -> -19 > > ehci_hcd :2c:00.0: HC died; cleaning up > > ehci_hcd :2c:00.0: remove, state 0 > > usb usb1: USB disconnect, device number 1 > > usb 1-1: USB disconnect, device number 2 > > usb 1-1.1: USB disconnect, device number 3 > > hub 4-1:1.0: unable to enumerate USB device on port 3 > > usb 1-1.3: USB disconnect, device number 4 > > usb 1-1.6: USB disconnect, device number 5 > > usb 1-1.6.1: USB disconnect, device number 6 > > ehci_hcd :2c:00.0: USB bus 1 deregistered > > hub 4-1:1.0: unable to enumerate USB device on port 3 > > general protection fault: [#1] SMP > > last sysfs file: > > /sys/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:3d:00.0/:3e:01.0/:66:00.0/usb3/3-1/3-1.6/3-1.6.1/3-1.6.1:1.2/input/input12/event12/uevent > > CPU 0 > > Modules linked in: autofs4 sunrpc configfs cachefiles fscache(T) bonding > > 8021q garp stp llc vhost_net macvtap macvlan tun uinput ipmi_devintf > > ftmod(P)(U) ipmi_msghandler sg matroxfb(U) fosil(U) ext4 mbcache jbd2 > > raid1 sr_mod cdrom sd_mod(U) crc_t10dif usb_storage mpt2sas(U) > > scsi_hbas(U) scsi_transport_sas raid_class igb(U) dca dm_mirror > > dm_region_hash dm_log dm_mod ipv6 cxgb4 cxgb3 mdio libiscsi_tcp libiscsi > > scsi_transport_iscsi [last unloaded: scsi_wait_scan] > > > > Pid: 32752, comm: cat Tainted: P --- T > > 2.6.32-279.el6.bz849188.test01.x86_64 #1 Stratus ftServer 2700/G7LAY > > RIP: 0010:[] [] > > usb_device_dump+0x87/0xa70 > > It would help to know what source statement that memory address > corresponds to. Offhand I can't see any remaining races between > usb_device_dump() and usb_disconnect(). I have added some in-depth analysis from our customer. The problem is that the failing routine was called with a pointer to usbdev in memory that has already been freed and overwritten with the pool poison pattern. This causes an access violation on line 502: 470 /* This is a recursive function. Parameters: 471 * buffer - the user-space buffer to write data into 472 * nbytes - the maximum number of bytes to write 473 * skip_bytes - the number of bytes to skip before writing anything 474 * file_offset - the offset into the devices file on completion 475 * The caller must own the device lock. 476 */ 477 static ssize_t usb_device_dump(char __user **buffer, size_t *nbytes, 478loff_t *skip_bytes, loff_t *file_offset, 479struct usb_device *usbdev, struct usb_bus *bus, 480int level, int index, int count) 481 { 482 int chix; 483 int ret, cnt = 0; 484 int parent_devnum = 0; 485 char *pages_start, *data_end, *speed; 486 unsigned int length; 487 ssize_t total_written = 0; 488 489 /* don't bother with anything else if we're not writing any data */ 490 if (*nbytes <= 0) 491 return 0; 492 493 if (level > MAX_TOPO_LEVEL) 494 return 0; 495 /* allocate 2^1 pages = 8K (on i386); 496 * should be more than enough for one device */ 497 pages_start = (char *)__get_free_pages(GFP_NOIO, 1); 498 if (!pages_start) 499 return -ENOMEM; 500 501 if (usbdev->parent && usbdev->parent->devnum != -1) 502 parent_devnum = usbdev->parent->devnum; 503 /* 504 * So the root hub's parent is 0 and any device that is 505 * plugged into the root hub has a parent of 0. 506 */ 507 switch (usbdev->speed) { crash> hex output radix: 16 (hex) crash> sym 813b8167 813b8167 (t) usb_device_dump+0x87 ../debug/kernel-2.6.32-279.el6/linux
Re: [PATCH] usb/core: Fix race condition when removing EHCI PCI devices
On Thu, Sep 13, 2012 at 04:31:34PM -0400, Alan Stern wrote: > On Thu, 13 Sep 2012, Don Zickus wrote: > > > Hi Alan, > > > > I adapted your patch to our 2.6.32 tree and the customer tested it without > > success. The output panic is attached below. I will work on getting a > > machine with the latest kernel to reproduce this problem so I don't waste > > your time chasing something that might be fixed upstream. > > > > But if you could take a quick glance at the panic below to see if anything > > comes to mind I would appreciate it. > > > The test failed on the first surprise removal of PCI devices. Last gasp > > from the console is posted below. I would guess the faulting process was > > reading a file under /proc/bus/usb. I should have a dump if more info is > > needed. > > > ehci_hcd :2c:00.0: HC died; cleaning up > > ehci_hcd :2c:00.0: force halt; handhake c9654024 4000 > > 4000 -> -19 > > ehci_hcd :2c:00.0: HC died; cleaning up > > ehci_hcd :2c:00.0: remove, state 0 > > usb usb1: USB disconnect, device number 1 > > usb 1-1: USB disconnect, device number 2 > > usb 1-1.1: USB disconnect, device number 3 > > hub 4-1:1.0: unable to enumerate USB device on port 3 > > usb 1-1.3: USB disconnect, device number 4 > > usb 1-1.6: USB disconnect, device number 5 > > usb 1-1.6.1: USB disconnect, device number 6 > > ehci_hcd :2c:00.0: USB bus 1 deregistered > > hub 4-1:1.0: unable to enumerate USB device on port 3 > > general protection fault: [#1] SMP > > last sysfs file: > > /sys/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:3d:00.0/:3e:01.0/:66:00.0/usb3/3-1/3-1.6/3-1.6.1/3-1.6.1:1.2/input/input12/event12/uevent > > CPU 0 > > Modules linked in: autofs4 sunrpc configfs cachefiles fscache(T) bonding > > 8021q garp stp llc vhost_net macvtap macvlan tun uinput ipmi_devintf > > ftmod(P)(U) ipmi_msghandler sg matroxfb(U) fosil(U) ext4 mbcache jbd2 > > raid1 sr_mod cdrom sd_mod(U) crc_t10dif usb_storage mpt2sas(U) > > scsi_hbas(U) scsi_transport_sas raid_class igb(U) dca dm_mirror > > dm_region_hash dm_log dm_mod ipv6 cxgb4 cxgb3 mdio libiscsi_tcp libiscsi > > scsi_transport_iscsi [last unloaded: scsi_wait_scan] > > > > Pid: 32752, comm: cat Tainted: P --- T > > 2.6.32-279.el6.bz849188.test01.x86_64 #1 Stratus ftServer 2700/G7LAY > > RIP: 0010:[] [] > > usb_device_dump+0x87/0xa70 > > It would help to know what source statement that memory address > corresponds to. Offhand I can't see any remaining races between > usb_device_dump() and usb_disconnect(). I'll dig that up.. Thanks, Don -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] usb/core: Fix race condition when removing EHCI PCI devices
On Tue, Sep 11, 2012 at 11:34:56AM -0400, Alan Stern wrote: > > Deregistering the bus first to prevent future readers from accessing the > > device > > before cleaning up the hcd, seemed like an appropriate way to go. I'll > > leave > > it to the experts to validate that or provide a better solution. :-) > > > > I adapted their usb_remove_hcd fix for usb_add_hcd error path too to mimic > > the same behaviour (a little code shuffling to handle the gotos cleanly). > > Moving things around in usb_add_hcd() is not a good way to attack this > problem. Here's a better approach; please make sure that it does what > you want. The idea is to indicate that a root hub is not registered by > setting its devnum field to 0. > > Alan Stern Hi Alan, I adapted your patch to our 2.6.32 tree and the customer tested it without success. The output panic is attached below. I will work on getting a machine with the latest kernel to reproduce this problem so I don't waste your time chasing something that might be fixed upstream. But if you could take a quick glance at the panic below to see if anything comes to mind I would appreciate it. Thanks for any help. Cheers, Don The test failed on the first surprise removal of PCI devices. Last gasp from the console is posted below. I would guess the faulting process was reading a file under /proc/bus/usb. I should have a dump if more info is needed. usb 3-1.6: New USB device found, idVendor=0557, idProduct=8021 usb 3-1.6: New USB device strings: Mfr=0, Product=0, SerialNumber=0 usb 3-1.6: configuration #1 chosen from 1 choice hub 3-1.6:1.0: USB hub found hub 3-1.6:1.0: 4 ports detected hub 4-1:1.0: unable to enumerate USB device on port 3 usb 3-1.6.1: new low speed USB device number 5 using ehci_hcd usb 3-1.6.1: New USB device found, idVendor=0557, idProduct=2260 usb 3-1.6.1: New USB device strings: Mfr=1, Product=2, SerialNumber=0 usb 3-1.6.1: Product: CS1308R1 V1.2.116 usb 3-1.6.1: Manufacturer: ATEN International Co. Ltd usb 3-1.6.1: configuration #1 chosen from 1 choice input: ATEN International Co. Ltd CS1308R1 V1.2.116 as /devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:3d:00.0/:3e:01.0/:66:00.0/usb3/3-1/3-1.6/3-1.6.1/3-1.6.1:1.0/input/input10 generic-usb 0003:0557:2260.0009: input,hidraw8: USB HID v1.00 Keyboard [ATEN International Co. Ltd CS1308R1 V1.2.116] on usb-:66:00.0-1.6.1/input0 input: ATEN International Co. Ltd CS1308R1 V1.2.116 as /devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:3d:00.0/:3e:01.0/:66:00.0/usb3/3-1/3-1.6/3-1.6.1/3-1.6.1:1.1/input/input11 generic-usb 0003:0557:2260.000A: input,hidraw9: USB HID v1.00 Device [ATEN International Co. Ltd CS1308R1 V1.2.116] on usb-:66:00.0-1.6.1/input1 input: ATEN International Co. Ltd CS1308R1 V1.2.116 as /devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:3d:00.0/:3e:01.0/:66:00.0/usb3/3-1/3-1.6/3-1.6.1/3-1.6.1:1.2/input/input12 generic-usb 0003:0557:2260.000B: input,hidraw10: USB HID v1.00 Mouse [ATEN International Co. Ltd CS1308R1 V1.2.116] on usb-:66:00.0-1.6.1/input2 input: ATEN International Co. Ltd CS1308R1 V1.2.116 as /devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:3d:00.0/:3e:01.0/:66:00.0/usb3/3-1/3-1.6/3-1.6.1/3-1.6.1:1.3/input/input13 generic-usb 0003:0557:2260.000C: input,hidraw11: USB HID v1.10 Mouse [ATEN International Co. Ltd CS1308R1 V1.2.116] on usb-:66:00.0-1.6.1/input3 hub 4-1:1.0: unable to enumerate USB device on port 3 usb 3-1.3: new high speed USB device number 6 using ehci_hcd sd 0:0:0:0: [sdq] Unhandled error code sd 0:0:0:0: [sdq] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK sd 0:0:0:0: [sdq] CDB: Write(10): 2a 00 02 08 08 08 00 00 01 00 md: super_written gets error=-5, uptodate=0 md/raid1:md2: Disk failure on sdq3, disabling device. md/raid1:md2: Operation continuing on 1 devices. usb 3-1.3: New USB device found, idVendor=152d, idProduct=2339 usb 3-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=5 usb 3-1.3: Product: USB to ATA/ATAPI Bridge usb 3-1.3: Manufacturer: JMicron usb 3-1.3: SerialNumber: MATSHITADVYN14 003100 usb 3-1.3: configuration #1 chosen from 1 choice scsi3 : SCSI emulation for USB Mass Storage devices hub 4-1:1.0: unable to enumerate USB device on port 3 hub 4-1:1.0: unable to enumerate USB device on port 3 hub 4-1:1.0: unable to enumerate USB device on port 3 hub 4-1:1.0: unable to enumerate USB device on port 3 scsi 3:0:0:0: CD-ROMMATSHITA DVD-RAM UJ8A0AS 1.20 PQ: 0 ANSI: 0 sr1: scsi3-mmc drive: 24x/24x writer dvd-ram cd/rw xa/form2 cdda tray sr 3:0:0:0: Attached scsi generic sg3 type 5 hub 4-1:1.0: unable to enumerate USB device on port 3 bonding: bond0: Warning: the permanent HWaddr of eth100600 - 00:25:5c:a6:f2:60 - is still in use by bond0. Set the HWaddr of eth100600 to a different address to avoid conflicts. bonding: bond0: releasing backup interface eth100600 hub 4-1:1.0: unable to enumerate USB device on por
Re: [PATCH] usb/core: Fix race condition when removing EHCI PCI devices
On Tue, Sep 11, 2012 at 11:34:56AM -0400, Alan Stern wrote: > On Mon, 10 Sep 2012, Don Zickus wrote: > > > A customer of ours noticed that after a bunch of hot removals of the EHCI > > PCI device, a panic occurs. This happened on a 2.6.32 RHEL-6 kernel, but > > I believe is still applicable upstream. > > > > The panic was further simplified to enabling SLUB debug poisoning and > > running > > the following command: > > > > while true; do find /proc/bus/usb -type f -exec cat {} >/dev/null \; ; done > > & > > > > This gets the machine to panic after 1 or 2 removals of the device. > > > This is likely because the following thread was in the process of removing > > the > > device: > > > The fix that solved their problem was to deregister the usb bus first before > > running usb_put_dev. After running multiple tests the panic disappeared. > > > > Deregistering the bus first to prevent future readers from accessing the > > device > > before cleaning up the hcd, seemed like an appropriate way to go. I'll > > leave > > it to the experts to validate that or provide a better solution. :-) > > > > I adapted their usb_remove_hcd fix for usb_add_hcd error path too to mimic > > the same behaviour (a little code shuffling to handle the gotos cleanly). > > Moving things around in usb_add_hcd() is not a good way to attack this > problem. Here's a better approach; please make sure that it does what > you want. The idea is to indicate that a root hub is not registered by > setting its devnum field to 0. Thanks for the feedback. I will pass this along for testing and get back to you. Thanks, Don > > Alan Stern > > > > Index: usb-3.6/drivers/usb/core/devices.c > === > --- usb-3.6.orig/drivers/usb/core/devices.c > +++ usb-3.6/drivers/usb/core/devices.c > @@ -624,7 +624,7 @@ static ssize_t usb_device_read(struct fi > /* print devices for all busses */ > list_for_each_entry(bus, &usb_bus_list, bus_list) { > /* recurse through all children of the root hub */ > - if (!bus->root_hub) > + if (!bus->root_hub || bus->root_hub->devnum == 0) > continue; > usb_lock_device(bus->root_hub); > ret = usb_device_dump(&buf, &nbytes, &skip_bytes, ppos, > Index: usb-3.6/drivers/usb/core/hcd.c > === > --- usb-3.6.orig/drivers/usb/core/hcd.c > +++ usb-3.6/drivers/usb/core/hcd.c > @@ -980,6 +980,8 @@ static int register_root_hub(struct usb_ > const int devnum = 1; > int retval; > > + mutex_lock(&usb_bus_list_lock); > + > usb_dev->devnum = devnum; > usb_dev->bus->devnum_next = devnum + 1; > memset (&usb_dev->bus->devmap.devicemap, 0, > @@ -987,8 +989,6 @@ static int register_root_hub(struct usb_ > set_bit (devnum, usb_dev->bus->devmap.devicemap); > usb_set_device_state(usb_dev, USB_STATE_ADDRESS); > > - mutex_lock(&usb_bus_list_lock); > - > usb_dev->ep0.desc.wMaxPacketSize = cpu_to_le16(64); > retval = usb_get_device_descriptor(usb_dev, USB_DT_DEVICE_SIZE); > if (retval != sizeof usb_dev->descriptor) { > @@ -1011,6 +1011,7 @@ static int register_root_hub(struct usb_ > if (retval) { > dev_err (parent_dev, "can't register root hub for %s, %d\n", > dev_name(&usb_dev->dev), retval); > + usb_dev->devnum = 0; > } > mutex_unlock(&usb_bus_list_lock); > > @@ -2527,6 +2528,7 @@ error_create_attr_group: > #endif > mutex_lock(&usb_bus_list_lock); > usb_disconnect(&rhdev); /* Sets rhdev to NULL */ > + hcd->self.root_hub->devnum = 0; > mutex_unlock(&usb_bus_list_lock); > err_register_root_hub: > hcd->rh_pollable = 0; > @@ -2583,6 +2585,7 @@ void usb_remove_hcd(struct usb_hcd *hcd) > > mutex_lock(&usb_bus_list_lock); > usb_disconnect(&rhdev); /* Sets rhdev to NULL */ > + hcd->self.root_hub->devnum = 0; > mutex_unlock(&usb_bus_list_lock); > > /* Prevent any more root-hub status calls from the timer. > -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] usb/core: Fix race condition when removing EHCI PCI devices
A customer of ours noticed that after a bunch of hot removals of the EHCI PCI device, a panic occurs. This happened on a 2.6.32 RHEL-6 kernel, but I believe is still applicable upstream. The panic was further simplified to enabling SLUB debug poisoning and running the following command: while true; do find /proc/bus/usb -type f -exec cat {} >/dev/null \; ; done & This gets the machine to panic after 1 or 2 removals of the device. The backtrace looks like: crash> bt PID: 23674 TASK: 8801ab411540 CPU: 2 COMMAND: "sosreport" #0 [8801b9073980] machine_kexec at 8103281b #1 [8801b90739e0] crash_kexec at 810ba662 #2 [8801b9073ab0] oops_end at 81501290 #3 [8801b9073ae0] no_context at 81043bab #4 [8801b9073b30] __bad_area_nosemaphore at 81043e35 #5 [8801b9073b80] bad_area at 81043f5e #6 [8801b9073bb0] __do_page_fault at 81044710 #7 [8801b9073cd0] do_page_fault at 8150326e #8 [8801b9073d00] page_fault at 81500625 [exception RIP: __list_add+0x26] RIP: 81282fb6 RSP: 8801b9073db8 RFLAGS: 00010046 RAX: 0292 RBX: 8801b9073de8 RCX: RDX: 8802782788f8 RSI: RDI: 8801b9073de8 RBP: 8801b9073dd8 R8: R9: 0001 R10: 880028401fc0 R11: R12: 8802782788f8 R13: R14: 7fff R15: 8801b9073e98 ORIG_RAX: CS: 0010 SS: 0018 #9 [8801b9073de0] __down at 814ff58e #10 [8801b9073e30] down at 81097ef1 #11 [8801b9073e60] usb_device_read at 813b78ce #12 [8801b9073ef0] vfs_read at 8117b8b5 #13 [8801b9073f30] sys_read at 8117b9f1 #14 [8801b9073f80] tracesys at 8100b308 (via system_call) RIP: 0030f90da360 RSP: 7fffef71dda8 RFLAGS: 0246 RAX: ffda RBX: 8100b308 RCX: RDX: 1000 RSI: 7f1926568000 RDI: 0003 RBP: R8: 084023c4 R9: 7f1926530700 R10: 2028 R11: 0246 R12: 08403b7e R13: 0846 R14: 083159c0 R15: 7fffef71de40 ORIG_RAX: CS: 0033 SS: 002b This is likely because the following thread was in the process of removing the device: PID: 3070 TASK: 8802735fd540 CPU: 1 COMMAND: "io_poll" #0 [8802788a97b0] schedule at 814fd7e2 #1 [8802788a9878] __mutex_lock_slowpath at 814fee8e #2 [8802788a98e8] mutex_lock at 814fed2b #3 [8802788a9908] usb_deregister_bus at 813a59ff #4 [8802788a9938] usb_remove_hcd at 813a5b9d #5 [8802788a9978] usb_hcd_pci_remove at 813b80b3 #6 [8802788a9998] pci_device_remove at 812930b7 #7 [8802788a99b8] __device_release_driver at 8135006f #8 [8802788a99d8] device_release_driver at 813501dd #9 [8802788a99f8] bus_remove_device at 8134f073 #10 [8802788a9a28] device_del at 8134ccbd #11 [8802788a9a58] device_unregister at 8134cd92 #12 [8802788a9a78] pci_stop_bus_device at 8128c4cc #13 [8802788a9aa8] pci_stop_bus_device at 8128c47b #14 [8802788a9ad8] pci_stop_bus_device at 8128c47b #15 [8802788a9b08] pci_remove_bus_device at 8128c57a #16 [8802788a9b38] remove_io at a0361db8 [ftmod] #17 [8802788a9b98] ioboard_event at a0362130 [ftmod] #18 [8802788a9c18] io_state_change at a0365c9e [ftmod] #19 [8802788a9c38] OsIoStateChange at a0406638 [ftmod] #20 [8802788a9c58] IoStateChange at a0407a4b [ftmod] #21 [8802788a9db8] CcIoPoll at a040af64 [ftmod] #22 [8802788a9e98] io_poll at a0365b00 [ftmod] #23 [8802788a9f48] kernel_thread at 8100c14a The fix that solved their problem was to deregister the usb bus first before running usb_put_dev. After running multiple tests the panic disappeared. Deregistering the bus first to prevent future readers from accessing the device before cleaning up the hcd, seemed like an appropriate way to go. I'll leave it to the experts to validate that or provide a better solution. :-) I adapted their usb_remove_hcd fix for usb_add_hcd error path too to mimic the same behaviour (a little code shuffling to handle the gotos cleanly). Signed-off-by: Don Zickus --- drivers/usb/core/hcd.c | 20 ++-- 1 files changed, 10 insertions(+), 10 deletions(-) diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c index bc84106..d155bbf 100644 --- a/drivers/usb/core/hcd.c +++ b/drivers/usb/core/hcd.c @@ -2428,13 +2428,6 @@
Re: [PATCH] usb, ehci: Avoid deadlock of ehci->lock by disabling interrupts
On Wed, Aug 01, 2012 at 10:42:37AM -0400, Alan Stern wrote: > On Tue, 31 Jul 2012, Don Zickus wrote: > > > We ran into an interesting deadlock on RHEL-5 (2.6.18) that I believe > > still appiles to the current kernel involving the ehci->lock. > > > > CPU A: > > submits a bulk transfer urb > > ehci_urb_enqueue calls submit_async > > submit_async blocks on ehci->lock with irq disabled (the result > > of spin_lock_irqsave) for CPU B > > > > CPU B: > > takes an ehci interrupt > > locks ehci->lock > > pre-empted by an IPI handler which spins waiting for CPU C > > > > CPU C: > > takes an MTRR request > > sends an IPI to all cpus to block > > spins waiting for all cpus to block > > > > CPU A nevers processes IPI because its interrupts are disabled, > > this creates the 3-way deadlock. > > > > This deadlock is hard to reproduce by our customer, but based on their > > vmcore > > it seems clear the above is what happened. I attatched a suggested patch > > from a colleague that would seem to resolve the problem. Because it is > > hard to reproduce, I have not been able to test it to verify it resolves > > the problem. > > > > The patch just turns spin_locks in the spin_lock_irqsaves in the ehci_irq > > function. This would essentially block the IPI handler and let the > > interrupt > > handler finish before processing the IPI. Then CPU A would get a chance to > > finish and process its IPI. > > > > Looking at the code paths in 2.6.18 and 3.5, the locking still seems the > > same > > which is why I believe the problem still exists. However, someone in the > > office > > thought the MTRR code has been re-written, so the problem we are seeing > > might > > be more difficult to see with the current kernel. > > > > This patch does feel awkward, disabling interrupts in the irq handler. It > > seems > > like it would make more sense to remove the locking from the irq handler. > > But > > that is probably more work and my knowledge of USB is limited. I'll start > > with > > this patch and see where the conversation goes. > > > > Any feedback would be appreciated. > > 2.6.18 is awfully old -- almost 6 years! Heh. Yes, that is world of RHEL and supporting kernels for 10 years. :-( I only brought up the issue because I thought it was still relevant, but.. > > Anyway, commit de85422b94ddb23c021126815ea49414047c13dc (USB: fix > interrupt disabling for HCDs with shared interrupt handlers) took care > of this problem way back in 2.6.26. You should be able to back-port > the patch. I see it was fixed at a higher level. Sorry I missed that. Thanks for pointing out the commit. Again sorry for the noise. Thanks! Cheers, Don -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] usb, ehci: Avoid deadlock of ehci->lock by disabling interrupts
We ran into an interesting deadlock on RHEL-5 (2.6.18) that I believe still appiles to the current kernel involving the ehci->lock. CPU A: submits a bulk transfer urb ehci_urb_enqueue calls submit_async submit_async blocks on ehci->lock with irq disabled (the result of spin_lock_irqsave) for CPU B CPU B: takes an ehci interrupt locks ehci->lock pre-empted by an IPI handler which spins waiting for CPU C CPU C: takes an MTRR request sends an IPI to all cpus to block spins waiting for all cpus to block CPU A nevers processes IPI because its interrupts are disabled, this creates the 3-way deadlock. This deadlock is hard to reproduce by our customer, but based on their vmcore it seems clear the above is what happened. I attatched a suggested patch from a colleague that would seem to resolve the problem. Because it is hard to reproduce, I have not been able to test it to verify it resolves the problem. The patch just turns spin_locks in the spin_lock_irqsaves in the ehci_irq function. This would essentially block the IPI handler and let the interrupt handler finish before processing the IPI. Then CPU A would get a chance to finish and process its IPI. Looking at the code paths in 2.6.18 and 3.5, the locking still seems the same which is why I believe the problem still exists. However, someone in the office thought the MTRR code has been re-written, so the problem we are seeing might be more difficult to see with the current kernel. This patch does feel awkward, disabling interrupts in the irq handler. It seems like it would make more sense to remove the locking from the irq handler. But that is probably more work and my knowledge of USB is limited. I'll start with this patch and see where the conversation goes. Any feedback would be appreciated. [dzickus - rewrote the description, ported to v3.5] Signed-off-by: Casey Dahlin Signed-off-by: Don Zickus --- drivers/usb/host/ehci-hcd.c |7 --- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c index 800be38..a31d9cc 100644 --- a/drivers/usb/host/ehci-hcd.c +++ b/drivers/usb/host/ehci-hcd.c @@ -849,8 +849,9 @@ static irqreturn_t ehci_irq (struct usb_hcd *hcd) struct ehci_hcd *ehci = hcd_to_ehci (hcd); u32 status, masked_status, pcd_status = 0, cmd; int bh; + unsigned long flags; - spin_lock (&ehci->lock); + spin_lock_irqsave(&ehci->lock, flags); status = ehci_readl(ehci, &ehci->regs->status); @@ -868,7 +869,7 @@ static irqreturn_t ehci_irq (struct usb_hcd *hcd) /* Shared IRQ? */ if (!masked_status || unlikely(ehci->rh_state == EHCI_RH_HALTED)) { - spin_unlock(&ehci->lock); + spin_unlock_irqrestore(&ehci->lock, flags); return IRQ_NONE; } @@ -969,7 +970,7 @@ dead: if (bh) ehci_work (ehci); - spin_unlock (&ehci->lock); + spin_unlock_irqrestore(&ehci->lock, flags); if (pcd_status) usb_hcd_poll_rh_status(hcd); return IRQ_HANDLED; -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html