from:"Don Zickus"

[PATCH v2] usb: Quiet down false peer failure messages

2015-12-03 Thread Don Zickus

My recent Intel box is spewing these messages:

xhci_hcd :00:14.0: xHCI Host Controller
xhci_hcd :00:14.0: new USB bus registered, assigned bus number 2
usb usb2: New USB device found, idVendor=1d6b, idProduct=0003
usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
usb usb2: Product: xHCI Host Controller
usb usb2: Manufacturer: Linux 4.3.0+ xhci-hcd
usb usb2: SerialNumber: :00:14.0
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 6 ports detected
usb: failed to peer usb2-port2 and usb1-port1 by location (usb2-port2:none) 
(usb1-port1:usb2-port1)
usb usb2-port2: failed to peer to usb1-port1 (-16)
usb: port power management may be unreliable
usb: failed to peer usb2-port3 and usb1-port1 by location (usb2-port3:none) 
(usb1-port1:usb2-port1)
usb usb2-port3: failed to peer to usb1-port1 (-16)
usb: failed to peer usb2-port5 and usb1-port1 by location (usb2-port5:none) 
(usb1-port1:usb2-port1)
usb usb2-port5: failed to peer to usb1-port1 (-16)
usb: failed to peer usb2-port6 and usb1-port1 by location (usb2-port6:none) 
(usb1-port1:usb2-port1)
usb usb2-port6: failed to peer to usb1-port1 (-16)

Diving into the acpi tables, I noticed the EHCI hub has 12 ports while the XHCI
hub has 8 ports.  Most of those ports are of connect type USB_PORT_NOT_USED
(including port 1 of the EHCI hub).

Further the unused ports have location data initialized to 0x8000.

Now each unused port on the xhci hub walks the port list and finds a matching
peer with port1 of the EHCI hub because the zero'd out group id bits falsely 
match.
After port1 of the XHCI hub, each following matching peer will generate the
above warning.

These warnings seem to be harmless for this scenario as I don't think it
matters that unused ports could not create a peer link.

The attached patch utilizes that assumption and just turns the pr_warn into
pr_debug to quiet things down.

Tested on my Intel box.

Signed-off-by: Don Zickus 
---
v2: modified patch to just quiet down the warns instead of skipping the ports

---
 drivers/usb/core/port.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/usb/core/port.c b/drivers/usb/core/port.c
index 2106183..5487fe3 100644
--- a/drivers/usb/core/port.c
+++ b/drivers/usb/core/port.c
@@ -206,7 +206,7 @@ static int link_peers(struct usb_port *left, struct 
usb_port *right)
else
method = "default";
 
-   pr_warn("usb: failed to peer %s and %s by %s (%s:%s) (%s:%s)\n",
+   pr_debug("usb: failed to peer %s and %s by %s (%s:%s) 
(%s:%s)\n",
dev_name(&left->dev), dev_name(&right->dev), method,
dev_name(&left->dev),
lpeer ? dev_name(&lpeer->dev) : "none",
@@ -265,7 +265,7 @@ static void link_peers_report(struct usb_port *left, struct 
usb_port *right)
if (rc == 0) {
dev_dbg(&left->dev, "peered to %s\n", dev_name(&right->dev));
} else {
-   dev_warn(&left->dev, "failed to peer to %s (%d)\n",
+   dev_dbg(&left->dev, "failed to peer to %s (%d)\n",
dev_name(&right->dev), rc);
pr_warn_once("usb: port power management may be unreliable\n");
usb_port_block_power_off = 1;
-- 
1.8.4

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] usb: Quiet down false peer failure messages

2015-11-30 Thread Don Zickus

On Mon, Nov 30, 2015 at 12:57:17PM -0800, Dan Williams wrote:
> On Tue, Nov 17, 2015 at 1:00 PM, Don Zickus  wrote:
> > My recent Intel box is spewing these messages:
> >
> > xhci_hcd :00:14.0: xHCI Host Controller
> > xhci_hcd :00:14.0: new USB bus registered, assigned bus number 2
> > usb usb2: New USB device found, idVendor=1d6b, idProduct=0003
> > usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> > usb usb2: Product: xHCI Host Controller
> > usb usb2: Manufacturer: Linux 4.3.0+ xhci-hcd
> > usb usb2: SerialNumber: :00:14.0
> > hub 2-0:1.0: USB hub found
> > hub 2-0:1.0: 6 ports detected
> > usb: failed to peer usb2-port2 and usb1-port1 by location (usb2-port2:none) 
> > (usb1-port1:usb2-port1)
> > usb usb2-port2: failed to peer to usb1-port1 (-16)
> > usb: port power management may be unreliable
> > usb: failed to peer usb2-port3 and usb1-port1 by location (usb2-port3:none) 
> > (usb1-port1:usb2-port1)
> > usb usb2-port3: failed to peer to usb1-port1 (-16)
> > usb: failed to peer usb2-port5 and usb1-port1 by location (usb2-port5:none) 
> > (usb1-port1:usb2-port1)
> > usb usb2-port5: failed to peer to usb1-port1 (-16)
> > usb: failed to peer usb2-port6 and usb1-port1 by location (usb2-port6:none) 
> > (usb1-port1:usb2-port1)
> > usb usb2-port6: failed to peer to usb1-port1 (-16)
> >
> > Diving into the acpi tables, I noticed the EHCI hub has 12 ports while the 
> > XHCI
> > hub has 8 ports.  Most of those ports are of connect type USB_PORT_NOT_USED
> > (including port 1 of the EHCI hub).
> >
> > Further the unused ports have location data initialized to 0x8000.
> >
> > Now each unused port on the xhci hub walks the port list and finds a 
> > matching
> > peer with port1 of the EHCI hub because the zero'd out group id bits 
> > falsely match.
> > After port1 of the XHCI hub, each following matching peer will generate the
> > above warning.
> >
> > These warnings seem to be harmless for this scenario as I don't think it
> > matters that unused ports could not create a peer link.
> >
> > The attached patch utilizes that assumption and just skips the
> > find_and_link_peer setup if a port is determined to be unused.  This further
> > assumes that port's status will never change.
> >
> > Tested on my Intel box.
> >
> > Signed-off-by: Don Zickus 
> > ---
> >  drivers/usb/core/port.c | 7 +++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/drivers/usb/core/port.c b/drivers/usb/core/port.c
> > index 2106183..3a8f84a 100644
> > --- a/drivers/usb/core/port.c
> > +++ b/drivers/usb/core/port.c
> > @@ -353,6 +353,13 @@ static void find_and_link_peer(struct usb_hub *hub, 
> > int port1)
> > struct usb_hub *peer_hub;
> >
> > /*
> > +* Un-used ports have zero'd out data that can create a false
> > +* peer in-use failure.
> > +*/
> > +   if (port_dev->connect_type == USB_PORT_NOT_USED)
> > +   return;
> > +
> > +   /*
> >  * If location data is available then we can only peer this port
> >  * by a location match, not the default peer (lest we create a
> >  * situation where we need to go back and undo a default peering
> 
> Looks reasonable, but it may hide real errors in the ACPI tables.
> 
> How about just changing the dev_warn() in link_peers_report() to
> dev_dbg?  We'll still get the pr_warn_once() to get the notification
> that something went wrong, but won't continue to spam the logs if the
> user does not care about peer port power management.

Ok.  Sounds fair.  Curious, I couldn't figure out from the specs if this was
valid or not, or if my Intel pre-production machine has an early firmware on it
that might resolve this issue later..

Cheers,
Don

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] usb: Quiet down false peer failure messages

2015-11-30 Thread Don Zickus

On Mon, Nov 30, 2015 at 03:49:51PM -0500, Alan Stern wrote:
> On Mon, 30 Nov 2015, Don Zickus wrote:
> 
> > On Tue, Nov 17, 2015 at 04:00:58PM -0500, Don Zickus wrote:
> > > My recent Intel box is spewing these messages:
> > 
> > poke?
> > 
> > Cheers,
> > Don
> 
> It's up to Greg KH to accept this patch; maybe you need to mail it to
> him.  He seems to be busy these days -- at least, he hasn't responded
> yet to some patches I sent shortly after the last merge window ended.

Ok.


> 
> I'm okay with the basic idea, but it would be good to get Dan Williams 
> to vet it.  He has spent more time tussling with USB-3/2 peering issues 
> than I have.

He responded with some feedback.  I will spin a v2 based on it.

Thanks!

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] usb: Quiet down false peer failure messages

2015-11-30 Thread Don Zickus

On Tue, Nov 17, 2015 at 04:00:58PM -0500, Don Zickus wrote:
> My recent Intel box is spewing these messages:

poke?

Cheers,
Don

> 
> xhci_hcd :00:14.0: xHCI Host Controller
> xhci_hcd :00:14.0: new USB bus registered, assigned bus number 2
> usb usb2: New USB device found, idVendor=1d6b, idProduct=0003
> usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
> usb usb2: Product: xHCI Host Controller
> usb usb2: Manufacturer: Linux 4.3.0+ xhci-hcd
> usb usb2: SerialNumber: :00:14.0
> hub 2-0:1.0: USB hub found
> hub 2-0:1.0: 6 ports detected
> usb: failed to peer usb2-port2 and usb1-port1 by location (usb2-port2:none) 
> (usb1-port1:usb2-port1)
> usb usb2-port2: failed to peer to usb1-port1 (-16)
> usb: port power management may be unreliable
> usb: failed to peer usb2-port3 and usb1-port1 by location (usb2-port3:none) 
> (usb1-port1:usb2-port1)
> usb usb2-port3: failed to peer to usb1-port1 (-16)
> usb: failed to peer usb2-port5 and usb1-port1 by location (usb2-port5:none) 
> (usb1-port1:usb2-port1)
> usb usb2-port5: failed to peer to usb1-port1 (-16)
> usb: failed to peer usb2-port6 and usb1-port1 by location (usb2-port6:none) 
> (usb1-port1:usb2-port1)
> usb usb2-port6: failed to peer to usb1-port1 (-16)
> 
> Diving into the acpi tables, I noticed the EHCI hub has 12 ports while the 
> XHCI
> hub has 8 ports.  Most of those ports are of connect type USB_PORT_NOT_USED
> (including port 1 of the EHCI hub).
> 
> Further the unused ports have location data initialized to 0x8000.
> 
> Now each unused port on the xhci hub walks the port list and finds a matching
> peer with port1 of the EHCI hub because the zero'd out group id bits falsely 
> match.
> After port1 of the XHCI hub, each following matching peer will generate the
> above warning.
> 
> These warnings seem to be harmless for this scenario as I don't think it
> matters that unused ports could not create a peer link.
> 
> The attached patch utilizes that assumption and just skips the
> find_and_link_peer setup if a port is determined to be unused.  This further
> assumes that port's status will never change.
> 
> Tested on my Intel box.
> 
> Signed-off-by: Don Zickus 
> ---
>  drivers/usb/core/port.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/usb/core/port.c b/drivers/usb/core/port.c
> index 2106183..3a8f84a 100644
> --- a/drivers/usb/core/port.c
> +++ b/drivers/usb/core/port.c
> @@ -353,6 +353,13 @@ static void find_and_link_peer(struct usb_hub *hub, int 
> port1)
>   struct usb_hub *peer_hub;
>  
>   /*
> +  * Un-used ports have zero'd out data that can create a false
> +  * peer in-use failure.
> +  */
> + if (port_dev->connect_type == USB_PORT_NOT_USED)
> + return;
> +
> + /*
>* If location data is available then we can only peer this port
>* by a location match, not the default peer (lest we create a
>* situation where we need to go back and undo a default peering
> -- 
> 2.6.0.rc0.2.g7662973
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] usb: Quiet down false peer failure messages

2015-11-17 Thread Don Zickus

My recent Intel box is spewing these messages:

xhci_hcd :00:14.0: xHCI Host Controller
xhci_hcd :00:14.0: new USB bus registered, assigned bus number 2
usb usb2: New USB device found, idVendor=1d6b, idProduct=0003
usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
usb usb2: Product: xHCI Host Controller
usb usb2: Manufacturer: Linux 4.3.0+ xhci-hcd
usb usb2: SerialNumber: :00:14.0
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 6 ports detected
usb: failed to peer usb2-port2 and usb1-port1 by location (usb2-port2:none) 
(usb1-port1:usb2-port1)
usb usb2-port2: failed to peer to usb1-port1 (-16)
usb: port power management may be unreliable
usb: failed to peer usb2-port3 and usb1-port1 by location (usb2-port3:none) 
(usb1-port1:usb2-port1)
usb usb2-port3: failed to peer to usb1-port1 (-16)
usb: failed to peer usb2-port5 and usb1-port1 by location (usb2-port5:none) 
(usb1-port1:usb2-port1)
usb usb2-port5: failed to peer to usb1-port1 (-16)
usb: failed to peer usb2-port6 and usb1-port1 by location (usb2-port6:none) 
(usb1-port1:usb2-port1)
usb usb2-port6: failed to peer to usb1-port1 (-16)

Diving into the acpi tables, I noticed the EHCI hub has 12 ports while the XHCI
hub has 8 ports.  Most of those ports are of connect type USB_PORT_NOT_USED
(including port 1 of the EHCI hub).

Further the unused ports have location data initialized to 0x8000.

Now each unused port on the xhci hub walks the port list and finds a matching
peer with port1 of the EHCI hub because the zero'd out group id bits falsely 
match.
After port1 of the XHCI hub, each following matching peer will generate the
above warning.

These warnings seem to be harmless for this scenario as I don't think it
matters that unused ports could not create a peer link.

The attached patch utilizes that assumption and just skips the
find_and_link_peer setup if a port is determined to be unused.  This further
assumes that port's status will never change.

Tested on my Intel box.

Signed-off-by: Don Zickus 
---
 drivers/usb/core/port.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/usb/core/port.c b/drivers/usb/core/port.c
index 2106183..3a8f84a 100644
--- a/drivers/usb/core/port.c
+++ b/drivers/usb/core/port.c
@@ -353,6 +353,13 @@ static void find_and_link_peer(struct usb_hub *hub, int 
port1)
struct usb_hub *peer_hub;
 
/*
+* Un-used ports have zero'd out data that can create a false
+* peer in-use failure.
+*/
+   if (port_dev->connect_type == USB_PORT_NOT_USED)
+   return;
+
+   /*
 * If location data is available then we can only peer this port
 * by a location match, not the default peer (lest we create a
 * situation where we need to go back and undo a default peering
-- 
2.6.0.rc0.2.g7662973

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RESEND] usbhid: Fix the check for HID_RESET_PENDING in hid_io_error

2015-08-18 Thread Don Zickus

Bump. :-)


On Mon, Aug 10, 2015 at 12:06:53PM -0400, Don Zickus wrote:
> It was reported that after 10-20 reboots, a usb keyboard plugged
> into a docking station would not work unless it was replugged in.
> 
> Using usbmon, it turns out the interrupt URBs were streaming with
> callback errors of -71 for some reason.  The hid-core.c::hid_io_error was
> supposed to retry and then reset, but the reset wasn't really happening.
> 
> The check for HID_NO_BANDWIDTH was inverted.  Fix was simple.
> 
> Tested by reporter and locally by me by unplugging a keyboard halfway until I
> could recreate a stream of errors but no disconnect.
> 
> Signed-off-by: Don Zickus 
> ---
> 
> I mistyped the usb mailing list address, so this patch never made it there.
> Resending with the correct address.  Sorry for the spam.
> 
> 
> 
>  drivers/hid/usbhid/hid-core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/hid/usbhid/hid-core.c b/drivers/hid/usbhid/hid-core.c
> index b37d944..9935a64 100644
> --- a/drivers/hid/usbhid/hid-core.c
> +++ b/drivers/hid/usbhid/hid-core.c
> @@ -164,7 +164,7 @@ static void hid_io_error(struct hid_device *hid)
>   if (time_after(jiffies, usbhid->stop_retry)) {
>  
>   /* Retries failed, so do a port reset unless we lack bandwidth*/
> - if (test_bit(HID_NO_BANDWIDTH, &usbhid->iofl)
> + if (!test_bit(HID_NO_BANDWIDTH, &usbhid->iofl)
>&& !test_and_set_bit(HID_RESET_PENDING, &usbhid->iofl)) {
>  
>   schedule_work(&usbhid->reset_work);
> -- 
> 1.8.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-usb" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RESEND] usbhid: Fix the check for HID_RESET_PENDING in hid_io_error

2015-08-10 Thread Don Zickus

It was reported that after 10-20 reboots, a usb keyboard plugged
into a docking station would not work unless it was replugged in.

Using usbmon, it turns out the interrupt URBs were streaming with
callback errors of -71 for some reason.  The hid-core.c::hid_io_error was
supposed to retry and then reset, but the reset wasn't really happening.

The check for HID_NO_BANDWIDTH was inverted.  Fix was simple.

Tested by reporter and locally by me by unplugging a keyboard halfway until I
could recreate a stream of errors but no disconnect.

Signed-off-by: Don Zickus 
---

I mistyped the usb mailing list address, so this patch never made it there.
Resending with the correct address.  Sorry for the spam.



 drivers/hid/usbhid/hid-core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/hid/usbhid/hid-core.c b/drivers/hid/usbhid/hid-core.c
index b37d944..9935a64 100644
--- a/drivers/hid/usbhid/hid-core.c
+++ b/drivers/hid/usbhid/hid-core.c
@@ -164,7 +164,7 @@ static void hid_io_error(struct hid_device *hid)
if (time_after(jiffies, usbhid->stop_retry)) {
 
/* Retries failed, so do a port reset unless we lack bandwidth*/
-   if (test_bit(HID_NO_BANDWIDTH, &usbhid->iofl)
+   if (!test_bit(HID_NO_BANDWIDTH, &usbhid->iofl)
 && !test_and_set_bit(HID_RESET_PENDING, &usbhid->iofl)) {
 
schedule_work(&usbhid->reset_work);
-- 
1.8.4

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] usb, misc: Fix a warning for using an uninitialized variable

2014-10-02 Thread Don Zickus

On Wed, Oct 01, 2014 at 09:15:23PM -0400, Don Zickus wrote:
> Recently our build environment changed and started turning some warnings into
> errors.  One of the fallouts is this warning:
> 
>   CC [M]  drivers/usb/misc/usb3503.o
> drivers/usb/misc/usb3503.c: In function ‘usb3503_probe’:
> drivers/usb/misc/usb3503.c:195: warning: ‘err’ may be used uninitialized in 
> this function

Doh.  I rushed this and didn't realize this was already fixed in v3.17.
Sorry for the noise.  My tree was only at v3.16. :-(

self-nak.

Cheers,
Don

> 
> Fixed it by using the proper error.  Compiled only as I don't have the 
> hardware
> to test correctly.
> 
> Signed-off-by: Don Zickus 
> ---
>  drivers/usb/misc/usb3503.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/usb/misc/usb3503.c b/drivers/usb/misc/usb3503.c
> index f43c619..773a3a6 100644
> --- a/drivers/usb/misc/usb3503.c
> +++ b/drivers/usb/misc/usb3503.c
> @@ -192,7 +192,7 @@ static int usb3503_probe(struct usb3503 *hub)
>  
>   clk = devm_clk_get(dev, "refclk");
>   if (IS_ERR(clk) && PTR_ERR(clk) != -ENOENT) {
> - dev_err(dev, "unable to request refclk (%d)\n", err);
> + dev_err(dev, "unable to request refclk (%ld)\n", 
> PTR_ERR(clk));
>   return PTR_ERR(clk);
>   }
>  
> -- 
> 1.7.1
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] usb, misc: Fix a warning for using an uninitialized variable

2014-10-01 Thread Don Zickus

Recently our build environment changed and started turning some warnings into
errors.  One of the fallouts is this warning:

  CC [M]  drivers/usb/misc/usb3503.o
drivers/usb/misc/usb3503.c: In function ‘usb3503_probe’:
drivers/usb/misc/usb3503.c:195: warning: ‘err’ may be used uninitialized in 
this function

Fixed it by using the proper error.  Compiled only as I don't have the hardware
to test correctly.

Signed-off-by: Don Zickus 
---
 drivers/usb/misc/usb3503.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/usb/misc/usb3503.c b/drivers/usb/misc/usb3503.c
index f43c619..773a3a6 100644
--- a/drivers/usb/misc/usb3503.c
+++ b/drivers/usb/misc/usb3503.c
@@ -192,7 +192,7 @@ static int usb3503_probe(struct usb3503 *hub)
 
clk = devm_clk_get(dev, "refclk");
if (IS_ERR(clk) && PTR_ERR(clk) != -ENOENT) {
-   dev_err(dev, "unable to request refclk (%d)\n", err);
+   dev_err(dev, "unable to request refclk (%ld)\n", 
PTR_ERR(clk));
return PTR_ERR(clk);
}
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFT] xhci: Fix resume issues on Renesas chips in Samsung laptops

2014-01-22 Thread Don Zickus

On Wed, Jan 22, 2014 at 02:04:19PM -0800, Sarah Sharp wrote:
> On Wed, Jan 22, 2014 at 09:18:27AM -0500, Prarit Bhargava wrote:
> > 
> > 
> > > Don Zickus  writes:
> > > 
> > > Some co-workers of mine bought Samsung laptops that had mostly usb3 ports.
> > > Those ports did not resume correctly (the driver would timeout 
> > > communicating
> > > and fail).  This led to frustration as suspend/resume is a common use for
> > > laptops.
> > > 
> > > Poking around, I applied the reset on resume quirk to this chipset and the
> > > resume started working.  Reloading the xhci_hcd module had been the 
> > > temporary
> > > workaround.
> > > 
> > > Signed-off-by: Sarah Sharp 
> > > Reported-by: Don Zickus 
> > > Cc: stable # 2.6.37
> > > ---
> > > 
> > > Don, please ask your co-worker to apply this patch and test.  Let me
> > > know if your other co-worker has a different subsystem device ID in his
> > > Samsung laptop, when he gets back from vacation.  If so, we'll extend it
> > > to all Samsung laptops with that particular Renesas host.
> > > 
> > > Sarah Sharp
> > 
> > Sarah,
> > 
> > I tested a USB2 hard drive and a USB3 memory stick in the previously
> > non-functional ports.  Both now work AFAICT.
> > 
> > Please add
> > 
> > Tested-by: Prarit Bhargava 
> 
> Thanks for testing!  This patch will be queued to the USB subsystem
> maintainer after 3.14-rc1 is out.

Thanks Sarah!

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] xhci: fix resume issues on Renesas chips

2014-01-14 Thread Don Zickus

On Mon, Jan 13, 2014 at 03:17:48PM -0800, Sarah Sharp wrote:
> On Mon, Jan 13, 2014 at 10:49:49AM -0500, Don Zickus wrote:
> > Some co-workers of mine bought Samsung laptops that had mostly usb3 ports.
> > Those ports did not resume correctly (the driver would timeout communicating
> > and fail).  This led to frustration as suspend/resume is a common use for
> > laptops.
> > 
> > Poking around, I applied the reset on resume quirk to this chipset and the
> > resume started working.  Reloading the xhci_hcd module had been the 
> > temporary
> > workaround.
> > 
> > Not sure if I should restrict this further or not.  Probably should be sent
> > to stable too.
> 
> Can you provide the `sudo lspci -vvv -n` for the xHCI host?  I would
> rather restrict this quirk further, since this is the first report I've
> seen of an NEC host needing this quirk.  I would rather limit it to a
> Samsung subsystem vendor ID if that's available.

I figured.  I attached the output of one laptop here.  The other laptop's
owner is on vacation for two weeks. :-(

Cheers,
Don
00:00.0 0600: 8086:0154 (rev 09)
Subsystem: 144d:c0d3
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- SERR- 

00:02.0 0300: 8086:0166 (rev 09) (prog-if 00 [VGA controller])
Subsystem: 144d:c0d3
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- SERR-  [disabled]
Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
Address: fee0f00c  Data: 41e1
Capabilities: [d0] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [a4] PCI Advanced Features
AFCap: TP+ FLR+
AFCtrl: FLR-
AFStatus: TP-
Kernel driver in use: i915
Kernel modules: i915

00:16.0 0780: 8086:1e3a (rev 04)
Subsystem: 144d:c0d3
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- SERR- TAbort- SERR- TAbort- Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, 
L1 <1us
ExtTag- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ 
TransPend-
LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Latency L0 
<512ns, L1 <16us
ClockPM- Surprise- LLActRep+ BwNot-
LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- Retrain- 
CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
DLActive+ BWMgmt+ ABWMgmt-
SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- 
Surprise-
Slot #0, PowerLimit 10.000W; Interlock- NoCompl+
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- 
LinkChg-
Control: AttnInd Unknown, PwrInd Unknown, Power- 
Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ 
Interlock-
Changed: MRL- PresDet- LinkState+
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- 
CRSVisible-
RootCap: CRSVisible-
RootSta: PME ReqID , PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF 
Not Supported ARIFwd-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, 
OBFF Disabled ARIFwd-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
 Transmit Margin: Normal Operating Range, 
EnterModifiedCompliance- ComplianceSOS-
 Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, 
EqualizationComplete-, EqualizationPhase1-
 EqualizationPhase2-, EqualizationPhase3-, 
LinkEqualizationRequest-
Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit-
Address:   Data: 
Capabiliti

[PATCH] xhci: fix resume issues on Renesas chips

2014-01-13 Thread Don Zickus

Some co-workers of mine bought Samsung laptops that had mostly usb3 ports.
Those ports did not resume correctly (the driver would timeout communicating
and fail).  This led to frustration as suspend/resume is a common use for
laptops.

Poking around, I applied the reset on resume quirk to this chipset and the
resume started working.  Reloading the xhci_hcd module had been the temporary
workaround.

Not sure if I should restrict this further or not.  Probably should be sent
to stable too.

Signed-off-by: Don Zickus 
---
 drivers/usb/host/xhci-pci.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/usb/host/xhci-pci.c b/drivers/usb/host/xhci-pci.c
index 73f5208..2a21892 100644
--- a/drivers/usb/host/xhci-pci.c
+++ b/drivers/usb/host/xhci-pci.c
@@ -144,6 +144,8 @@ static void xhci_pci_quirks(struct device *dev, struct 
xhci_hcd *xhci)
}
if (pdev->vendor == PCI_VENDOR_ID_VIA)
xhci->quirks |= XHCI_RESET_ON_RESUME;
+   if (pdev->vendor == PCI_VENDOR_ID_RENESAS)
+   xhci->quirks |= XHCI_RESET_ON_RESUME;
 }
 
 /* called during probe() after chip reset completes */
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Looking for a USB device that uses cdc-wdm driver

2013-10-08 Thread Don Zickus

On Tue, Oct 08, 2013 at 11:10:52AM -0500, Dan Williams wrote:
> On Tue, 2013-10-08 at 11:29 -0400, Don Zickus wrote:
> > Hi Oliver,
> > 
> > I am trying to verify a fix for the cdc-wdm driver and am having trouble
> > figuring out what devices requires that driver.  It seems related to cell
> > phones but I wasn't sure if there was a particular one or all of them in
> > general (I couldn't get my cell to use it).  Thanks in advance!
> 
> Find an Ericsson F3507 or F3607 PCIe minicard WWAN modem, like:
> 
> http://www.ebay.com/itm/New-Unlocked-Ericsson-F3607GW-DELL-Wireless-5540-3G-GPS-Mini-WWAN-Card-/290659200462?pt=US_Internal_Network_Cards&hash=item43aca36dce
> 
> They're cheap.  They expose a couple cdc-wdm ports, one of which speaks
> AT commands and another which might do the GPS stuff.  You don't need a
> SIM.  If you're at a loss for how to connect it to your computer, or
> your BIOS locks out unauthorized cards, grab one of these:
> 
> http://www.hwtools.net/Adapter/USBMS-F.html
> 
> which is a small PCIe minicard carrier with a USB connector.

Thanks!

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Looking for a USB device that uses cdc-wdm driver

2013-10-08 Thread Don Zickus

On Tue, Oct 08, 2013 at 05:36:46PM +0200, Bjørn Mork wrote:
> Don Zickus  writes:
> 
> > I am trying to verify a fix for the cdc-wdm driver and am having trouble
> > figuring out what devices requires that driver.  It seems related to cell
> > phones but I wasn't sure if there was a particular one or all of them in
> > general (I couldn't get my cell to use it).  Thanks in advance!
> 
> The Ericsson MBM devices use it. E.g F3507g, F5521gw etc.

Thanks!

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Looking for a USB device that uses cdc-wdm driver

2013-10-08 Thread Don Zickus

Hi Oliver,

I am trying to verify a fix for the cdc-wdm driver and am having trouble
figuring out what devices requires that driver.  It seems related to cell
phones but I wasn't sure if there was a particular one or all of them in
general (I couldn't get my cell to use it).  Thanks in advance!

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] usb: xhci: Disable runtime PM suspend for quirky controllers.

2013-05-29 Thread Don Zickus

On Wed, May 29, 2013 at 12:38:28PM -0700, Sarah Sharp wrote:
> On Tue, May 28, 2013 at 02:41:18PM -0700, Julius Werner wrote:
> > The policy we want to achieve is to disable runtime PM iff there is a
> > device connected that doesn't have persist_enabled or a reset_resume()
> > handler and whose parent/root hub resets on resume, right?
> 
> Makes sense.  However, not all distros may want that policy, so there
> should be a way to change that policy via sysfs.  Some distros may
> choose to take the power savings over having a particular USB device
> work, especially in the server market.
> 
> Don, Oliver, what do you think of this patch:
> http://marc.info/?l=linux-usb&m=136941922715772&w=2

That is limited only to certain controllers right?  RHEL6 doesn't support
runtime suspend, so we don't hear to many complaints.  Most of our server
customers don't have much plugged into USB, so I don't expect much
problems there.  Our laptop customers prefer the power savings, but I
don't know how many of them have chipsets with XHCI_RESET_ON_RESUME.

> 
> Julius is proposing to limit the scope of the patch a bit, but the
> impact will still be that TI hosts will be active more often than not.

Hmm, for some reason I don't see TI having the XHCI_RESET_ON_RESUME quirk
set, just VIA and ETRON.  Neither of which seem to be normally shipped
with servers.

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4] xhci - correct comp_mode_recovery_timer on return from hibernate

2013-03-06 Thread Don Zickus

On Tue, Mar 05, 2013 at 03:14:10PM -0800, Sarah Sharp wrote:
> >   */
> >  static void compliance_mode_recovery_timer_init(struct xhci_hcd *xhci)
> >  {
> > +   if (timer_pending(&xhci->comp_mode_recovery_timer)) {
> > +   xhci_dbg(xhci, "Compliance Mode Recovery Timer already 
> > active.\n");
> > +   return;
> > +   }
> > +
> > xhci->port_status_u0 = 0;
> > init_timer(&xhci->comp_mode_recovery_timer);
> 
> Isn't this racy?  There's no locking here, so what if the timer fired
> just before the call to timer_pending happened?  (The documentation on
> the timer interface doesn't describe whether timer_pending returns true
> if we're in the middle of a timer callback.)

Ah, yes.  I think I misled Tony here.  I thought the timer_pending() check
just checked to see if the timer was on the linked list, not realizing the
timer is removed from that list when executed.  Instead I thought the
expires was set to -1 (or something large that it never expired) but it
stayed on the list.

We might have to dumb it down and just use a global init variable that is
set/cleared by the compliance functions with the assumption they are
called serially (as not to race).

Though this all seems very hackish and ugly.  There has to be a better way
to do this.. :-(


> 
> Also, looking at the compliance timer callback brings up further
> questions:
> 
> static void compliance_mode_recovery(unsigned long arg)
> {
> ...
> for (i = 0; i < xhci->num_usb3_ports; i++) {
> temp = xhci_readl(xhci, xhci->usb3_ports[i]);
> if ((temp & PORT_PLS_MASK) == USB_SS_PORT_LS_COMP_MOD) {
> /*
>  * Compliance Mode Detected. Letting USB Core
>  * handle the Warm Reset
>  */
> 
> What happens when the xHCI host controller goes into D3cold due to
> runtime PM?  The port status registers will read as all f's, so we will
> miss any transitions to the compliance mode that happened before or
> during the transition to D3cold.
> 
> This code probably needs to wake up the host controller and keep it from
> suspending until all the ports can be read.

Fun...  I take it there is no document that describes the various paths PM
can take during each of these phases?

Cheers,
Don

> 
> Alan, would the right way to do that be a get/put call into the runtime
> PM core?
> 
> Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] xhci - clarify compliance mode debug messages

2013-02-20 Thread Don Zickus

On Wed, Feb 20, 2013 at 10:54:40AM -0500, Alan Stern wrote:
> On Wed, 20 Feb 2013, Tony Camuso wrote:
> 
> > On 02/20/2013 10:15 AM, Alan Stern wrote:
> > > On Tue, 19 Feb 2013, Sarah Sharp wrote:
> > >
> > >> The one thing I wanted to check was my understanding of the hibernate
> > >> flow path.  As you mentioned, I thought that xhci_suspend would be
> > >> called in the hibernate path, but it's not.
> > >
> > > Are you sure about that?  AFAICT it should be called -- although it
> > > gets called during the poweroff phase of hibernation, not the freeze
> > > phase.
> > >
> > > core/hcd-pci.c: usb_hcd_pci_pm_ops.poweroff = hcd_pci_suspend
> > > core/hcd-pci.c: hcd_pci_suspend calls suspend_common
> > > core/hcd-pci.c: suspend_common calls hcd->driver->pci_suspend
> > > host/xhci-pci.c: xhci_pci_hc_driver.pci_suspend = xhci_pci_suspend
> > > host/xhci-pci.c: xhci_pci_suspend calls xhci_suspend
> > >
> > We instrumented the code, and when invoking pm-hibernate, we did
> > not flow through xhci_suspend(), where the timer is deleted.
> 
> If you instrumented the code then you ought to know how the actual 
> behavior differs from what I outlined above.  Where is the difference?

Hi Alan,

I worked with Tony on this issue.  The instrumentation we did was to add
printks in the xhci_suspend, xhci_resume and
compliance_mode_recovery_timer_init.

What we noticed was that the xhci_suspend was not called during hibernate.
As a result the timer was not deleted.  During the resume, xhci_init
called compliance_mode_recovery_timer_init which caused the initial panic
(adding an existing timer).

We added code to compliance_mode_recovery_timer_init to bail if the timer
was already enabled.  This lead us to the second problem at the bottom of
xhci_resume where compliance_mode_recovery_timer_init was called again.

We did not attempt to understand the mechanics of how suspend/hibernate
and resume worked.  We just assumed based on the hibernate flag to
xhci_resume there were two code paths.

If the expectation is that xhci_suspend should be called during hibernate,
I can work with Tony to figure out why this is not happening.  Solving
that problem would partially solve our problem.  We would still need the
!hibernate flag at the bottom of xhci_resume to surround the timer check
to deal with xhci_init calling the same init function.

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] xhci - clarify compliance mode debug messages

2013-02-19 Thread Don Zickus

On Tue, Feb 19, 2013 at 01:48:51PM -0800, Sarah Sharp wrote:
> Hi Tony,
> 
> Greg closed the usb-next tree for 3.9 two weeks ago.  The bug fix
> patches will have to go in after the 3.9 merge window closes
> (approximately two weeks from now).
> 
> Sorry for the lack of response, I was on vacation.  I'll review your
> patch in the next few days.
> 
> The one thing I wanted to check was my understanding of the hibernate
> flow path.  As you mentioned, I thought that xhci_suspend would be
> called in the hibernate path, but it's not.  Perhaps there is a better
> function that's called in both the hibernate and suspend path where we
> can stop the compliance mode timer.

Hi Sarah,

That was my understanding to, which is why I asked Tony to cc Rafael.  I
was hoping he could point us in the right direction..

Cheers,
Don

> 
> Sarah Sharp
> 
> On Tue, Feb 19, 2013 at 11:01:16AM -0500, Tony Camuso wrote:
> > Sarah,
> > 
> > This patch is pursuant to the patch at ...
> > http://www.mail-archive.com/linux-usb@vger.kernel.org/msg14965.html,
> > Message ID <1361209953-5195-1-git-send-email-tcam...@redhat.com>
> > ... and must be applied after that patch is applied.
> > 
> > Can these patches be included in the next merge?
> > 
> > Thanks,
> > Tony
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: uhci irq race before term_td is allocated

2013-01-22 Thread Don Zickus

On Tue, Jan 22, 2013 at 06:17:00AM +, Huang, Adrian (ISS Linux TW) wrote:
> Hi Alan,
> 
> Recently, I'm debugging this issue (the same symptom described by Don) on my 
> server. Fortunately, the issue was fixed by applying your patch posted in 
> this email thread. Just checked the latest kernel (3.7.4) and the mainline 
> (3.8-rc4), the patch was not included in those kernels. Do you have any 
> schedule to submit the patch to the upstream? Thanks in advance. 

I don't think Alan ever committed that patch because I never gave him
feedback on it.  Of course, the reason I never gave feedback was the
customer I was working with (HP??) updated their firmware and couldn't
reproduce the problem anymore.  Therefore the patch couldn't be verified
to work or not.

But if you are able to show the patch working, I'll leave it up to Alan to
decide what to do.

Cheers,
Don

> 
> The list of the allocated IRQ (uhci_hcd used the shared interrupt) on my 
> server: 
>CPU0   CPU1   CPU2   CPU3   CPU4   CPU5   
> CPU6   CPU7
>   0:120  0  0  0  0  0
>   0  0  IR-IO-APIC-edge  timer
>   1:  2  0  0  0  0  0
>   0  0  IR-IO-APIC-edge  i8042
>   3:  2  0  0  0  0  0
>   0  0  IR-IO-APIC-edge
>   4:  1  0  0  0  0  0
>   0  0  IR-IO-APIC-edge
>   8:  1  0  0  0  0  0
>   0  0  IR-IO-APIC-edge  rtc0
>   9:  0  0  0  0  0  0
>   0  0  IR-IO-APIC-fasteoi   acpi
>  12:  4  0  0  0  0  0
>   0  0  IR-IO-APIC-edge  i8042
>  16:   1187  0  0  0  0  0
>   0  0  IR-IO-APIC-fasteoi   uhci_hcd:usb3, hpilo
>  20:   1385  0  0  0  0  0
>   0  0  IR-IO-APIC-fasteoi   ehci_hcd:usb2
>  21:167  0  0  0  0  0
>   0  0  IR-IO-APIC-fasteoi   ehci_hcd:usb1
> 
> -- Adrian
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: uhci irq race before term_td is allocated

2012-10-22 Thread Don Zickus

On Thu, Oct 18, 2012 at 03:59:44PM -0400, Alan Stern wrote:
> On Tue, 16 Oct 2012, Don Zickus wrote:
> 
> > > > I did not get a chance to test the patch yet.
> > > 
> > > Let me know what happens.
> > 
> > It might take a while as our customer's customers have trouble reproducing
> > this which means it is difficult to tell if the problem is fixed or just
> > harder to reproduce. :-/
> > 
> > Hopefully after a couple of weeks I will have some indication of which.
> 
> If you want to make the problem more likely to occur, just sleep for a 
> few seconds in usb_add_hcd() after calling usb_hcd_request_irqs() and 
> before calling hcd->driver->start().
> 
> Doing that both with and without the patch installed should show pretty 
> quickly whether the patch does what you want.

That makes sense.  For some reason kdump is failing me and I had to give
back the machine.  I'll try again later this week.

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: uhci irq race before term_td is allocated

2012-10-16 Thread Don Zickus

On Tue, Oct 16, 2012 at 12:01:29PM -0400, Alan Stern wrote:
> On Tue, 16 Oct 2012, Don Zickus wrote:
> 
> > > I dislike adding an extra test to a hot path, but there doesn't seem to 
> > > be any way around it.  Some of the other HCDs may need a similar 
> > > change.
> > 
> > I understand your concern.  I was curious in uhci_irq, I see the following
> > fragment:
> > 
> > status = uhci_readw(uhci, USBSTS);
> > if (!(status & ~USBSTS_HCH))/* shared interrupt, not mine */
> > return IRQ_NONE;
> > uhci_writew(uhci, status, USBSTS);  /* Clear it */
> > 
> > Should that be helping in filtering out shared interrupts or not in this
> > case?
> 
> If it did help, your system wouldn't have crashed.  :-)

Hehe.  True.

> 
> What must have happened is that one of the bits in the USBSTS register 
> was set.  It didn't cause an IRQ because the USBINTR register was 
> clear, but an IRQ arrived anyway from some other device sharing the IRQ 
> line.
> 
> Since USBSTS had bits set, the test above did not filter out the 
> interrupt.  In other words, even though it really was a shared 
> interrupt it didn't look that way, because the UHCI controller would 
> have generated an interrupt request if it had been allowed to.

So this could have been a leftover interrupt from the boot kernel?

> 
> > I did not get a chance to test the patch yet.
> 
> Let me know what happens.

It might take a while as our customer's customers have trouble reproducing
this which means it is difficult to tell if the problem is fixed or just
harder to reproduce. :-/

Hopefully after a couple of weeks I will have some indication of which.

Thanks for the help.

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: uhci irq race before term_td is allocated

2012-10-16 Thread Don Zickus

On Mon, Oct 15, 2012 at 05:31:01PM -0400, Alan Stern wrote:
> > 
> > uhci_scan_schedule->
> >   uhci_clear_next_interrupt->
> > uhci->term_td->status &= ~cpu_to_hc32(uhci, TD_CTRL_IOC);
> > 
> > This panics becase term_td is not allocated yet.
> > 
> > Now I could be wrong about the interrupts and the uhci_start routine and
> > perhaps this is prevented somehow.  This is why I am asking what is the
> > expectation for the above scenario.
> 
> > I was just wondering if you had a quick thought about this or not.

Hi Alan,

Thanks for your respone.

> 
> I suspect you're getting a shared interrupt, that is, one generated by 
> another device using the same IRQ line and not by the UHCI controller 
> itself.  Can you check the IRQ assignments to see if that's possible?

A box in our lab (which should have similar hardware) shows
(only first cpu shown, other 159 snipped)

17: 183 IR-IO-APIC-fasteoi uhci_hcd:usb6, ata_piix, hpilo
20: 3 IR-IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2
22: 0 IR-IO-APIC-fasteoi uhci_hcd:usb4
23: 53 IR-IO-APIC-fasteoi uhci_hcd:usb3, uhci_hcd:usb5, radeon

So yes, you are probably correct with regards to shared interrupt.

> 
> I'm reasonably sure that the controller didn't generate the bad 
> interrupt, because it's not allowed to generate any IRQs until the 
> USBINTR register is set to a nonzero value.  That doesn't happen until 
> start_rh() is called near the end of uhci_start().
> 
> Since uhci_irq() doesn't check to see whether the uhci->is_initialized 
> flag has been set, a shared interrupt arriving too early could cause 
> exactly the sort of problem you see here.
> 
> I dislike adding an extra test to a hot path, but there doesn't seem to 
> be any way around it.  Some of the other HCDs may need a similar 
> change.

I understand your concern.  I was curious in uhci_irq, I see the following
fragment:

status = uhci_readw(uhci, USBSTS);
if (!(status & ~USBSTS_HCH))/* shared interrupt, not mine */
return IRQ_NONE;
uhci_writew(uhci, status, USBSTS);  /* Clear it */

Should that be helping in filtering out shared interrupts or not in this
case?

I did not get a chance to test the patch yet.

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

uhci irq race before term_td is allocated

2012-10-15 Thread Don Zickus

Hi Alan,

I am seeing an odd panic with uhci when a 160 cpu box panics and starts
running a kdump kernel (which is the same exact image as the boot kernel)
for our RHEL-6 (2.6.32) kernel.  Now I understand 2.6.32 is not something
upstream supports.

However, my question is what is expected to happen in the following
situtation.  The attached panic below shows

usb_hcd_pci_probe->
  usb_add_hcd->

/* enable irqs just before we start the controller,
 * if the BIOS provides legacy PCI irqs.
 */
if (usb_hcd_is_primary_hcd(hcd) && irqnum) {
retval = usb_hcd_request_irqs(hcd, irqnum, irqflags);
if (retval)
goto err_request_irq;
}

hcd->state = HC_STATE_RUNNING;
retval = hcd->driver->start(hcd);
if (retval < 0) {
dev_err(hcd->self.controller, "startup error %d\n",
retval);
goto err_hcd_driver_start;
}


That interrupts are setup and enabled before hcd->driver_start is called.

Now in

uhci_start->

uhci->term_td = uhci_alloc_td(uhci);


Happens.  But what if an interrupt comes in before that call runs, for
example:

uhci_irq->

if (status & USBSTS_RD)
usb_hcd_poll_rh_status(hcd);
else {
spin_lock(&uhci->lock);
uhci_scan_schedule(uhci);
spin_unlock(&uhci->lock);
}


In uhci_scan_schedule(uhci)

uhci_scan_schedule->
  uhci_clear_next_interrupt->
uhci->term_td->status &= ~cpu_to_hc32(uhci, TD_CTRL_IOC);

This panics becase term_td is not allocated yet.

Now I could be wrong about the interrupts and the uhci_start routine and
perhaps this is prevented somehow.  This is why I am asking what is the
expectation for the above scenario.

Below is an actual panic that was reported against 2.6.32 which I think
shows the scenario I described above (that could be wrong too).  Though I
did expect an interrupt exception frame in there to backup my theory...

usb usb5: Product: UHCI Host Controller
usb usb5: Manufacturer: Linux 2.6.32-220.7.1.el6.x86_64 uhci_hcd
usb usb5: SerialNumber: :00:1d.3
usb usb5: configuration #1 chosen from 1 choice
hub 5-0:1.0: USB hub found
hub 5-0:1.0: 2 ports detected
uhci_hcd :02:00.4: PCI INT B -> GSI 17 (level, low) -> IRQ 17
uhci_hcd :02:00.4: UHCI Host Controller
uhci_hcd :02:00.4: new USB bus registered, assigned bus number 6
uhci_hcd :02:00.4: port count misdetected? forcing to 2 ports
BUG: unable to handle kernel NULL pointer dereference at 0004
IP: [] uhci_scan_schedule+0x46/0xb20
PGD 0
Oops: 0002 [#1] SMP
last sysfs file:
CPU 0
Modules linked in:

Pid: 56, comm: work_for_cpu Not tainted 2.6.32-220.7.1.el6.x86_64 #1 HP
ProLiant DL980 G7
RIP: 0010:[]  [    
<0>   3731 880019104620
Call Trace:
 [] uhci_irq+0x7c/0x180
 [] ? disable_irq_nosync+0x64/0xa0
 [] usb_hcd_irq+0x3f/0x90
 [] request_threaded_irq+0x1bc/0x2f0
 [] ? usb_hcd_irq+0x0/0x90
 [] usb_add_hcd+0x3be/0x800
 [] usb_hcd_pci_probe+0x158/0x3d0
 [] ? do_work_for_cpu+0x0/0x30
 [] local_pci_probe+0x17/0x20
 [] do_work_for_cpu+0x18/0x30
 [] kthread+0x96/0xa0
 [] child_rip+0xa/0x20
 [] ? kthread+0x0/0xa0
 [] ? child_rip+0x0/0x20
Code: 00 00 48 89 fb a8 01 0f 85 69 0a 00 00 48 8d 57 50 83 c8 01 88 87 c8
00 00 00 48 89 55 98 83 
e0 bd 88 83 c8 00 00 00 48 8b 43 20 <81> 60 04 ff ff ff fe 83 bb bc 00 00
00 00 0f 84 09 0a 00 00 
8b
RIP  [] uhci_scan_schedule+0x46/0xb20
 RSP 
CR2: 0004
---[ end trace b5ebaaef70b2501d ]---

If you aren't familar with kdump issues.  Then it is not uncommon for
interrupts to still be active on devices (though irqpoll is set to prevent
irq flooding on boot up).  This box doesn't have much for usb devices.
The only thing people think might be helping cause this problem is from
customers running '/usr/sbin/gpm -m /dev/input/mice -t exps2'.

On the other hand, we are not to sure how to duplicate the problem other
than 'echo c > /proc/sysrq-trigger' about 100 times and see if we get
lucky.

I was just wondering if you had a quick thought about this or not.

Thanks,
Don

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFT] usb: Don't enable LPM if the exit latency is zero.

2012-10-03 Thread Don Zickus

On Wed, Oct 03, 2012 at 11:29:00AM -0700, Sarah Sharp wrote:
> Hi Don,
> 
> Please test this patch on top of the other patch.

So yeah, that seemed to work. :-)

Thanks!

Cheers,
Don

> 
> Sarah Sharp
> 
> >8---8<
> Some USB 3.0 devices signal that they don't implement Link PM by having
> all zeroes in the U1/U2 exit latencies in their SuperSpeed BOS
> descriptor.  Don found that a Western Digital device he has experiences
> transfer errors when LPM is enabled.  The lsusb shows the U1/U2 exit
> latencies are set to zero:
> 
> Binary Object Store Descriptor:
>   bLength 5
>   bDescriptorType15
>   wTotalLength   22
>   bNumDeviceCaps  2
>   SuperSpeed USB Device Capability:
> bLength10
> bDescriptorType16
> bDevCapabilityType  3
> bmAttributes 0x00
>   Latency Tolerance Messages (LTM) Supported
> wSpeedsSupported   0x000e
>   Device can operate at Full Speed (12Mbps)
>   Device can operate at High Speed (480Mbps)
>   Device can operate at SuperSpeed (5Gbps)
> bFunctionalitySupport   1
>   Lowest fully-functional device speed is Full Speed (12Mbps)
> bU1DevExitLat   0 micro seconds
> bU2DevExitLat   0 micro seconds
> 
> The fix is to not enable LPM for a particular link state if we find its
> corresponding exit latency is zero.
> 
> Signed-off-by: Sarah Sharp 
> Reported-by: Don Zickus 
> ---
>  drivers/usb/core/hub.c |   14 ++
>  1 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
> index 55bef91..2568441 100644
> --- a/drivers/usb/core/hub.c
> +++ b/drivers/usb/core/hub.c
> @@ -3484,8 +3484,16 @@ static void usb_enable_link_state(struct usb_hcd *hcd, 
> struct usb_device *udev,
>   enum usb3_link_state state)
>  {
>   int timeout;
> - __u8 u1_mel;
> - __le16 u2_mel;
> + __u8 u1_mel = udev->bos->ss_cap->bU1devExitLat;
> + __le16 u2_mel = udev->bos->ss_cap->bU2DevExitLat;
> +
> + /* If the device says it doesn't have *any* exit latency to come out of
> +  * U1 or U2, it's probably lying.  Assume it doesn't implement that link
> +  * state.
> +  */
> + if ((state == USB3_LPM_U1 && u1_mel == 0) ||
> + (state == USB3_LPM_U2 && u2_mel == 0))
> + return;
>  
>   /* We allow the host controller to set the U1/U2 timeout internally
>* first, so that it can change its schedule to account for the
> @@ -3512,8 +3520,6 @@ static void usb_enable_link_state(struct usb_hcd *hcd, 
> struct usb_device *udev,
>* link commands.  This can cause transfer errors, so only enable
>* device-initiated LPM.
>*/
> - u1_mel = udev->bos->ss_cap->bU1devExitLat;
> - u2_mel = udev->bos->ss_cap->bU2DevExitLat;
>   if ((state == USB3_LPM_U1 && u1_mel == USB_U1_MAX_VALID_MEL) ||
>   (state == USB3_LPM_U2 &&
>le16_to_cpu(u2_mel) == USB_U2_MAX_VALID_MEL)) {
> -- 
> 1.7.9
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFT] usb: Fix TX errors on devices with unsupported LPM states.

2012-10-03 Thread Don Zickus

On Tue, Oct 02, 2012 at 02:42:41PM -0700, Sarah Sharp wrote:
> > > USB 3.0 devices are required to support Link PM.  However, some of
> > > them don't support U1, or don't support U2, or don't support either.
> > > There is no way in the USB 3.0 specification to say that a device
> > > doesn't support U1 or U2, so these devices set the U1 or U2 exit
> > > latency to the maximum possible values (10ms for U1, 2047ms for U2).
> > 
> > It still failed for me. :-(
> 
> Can you send me the `sudo lsusb -v` output for the device?  Also, please
> see if you can get a new cable and retest.

I attached the lsusb -v output (had to use another usb3 controller to
obtain it).  I can't really use another cable because it comes with its
own usb3 mini cable.  All the other usb3 cables I have are the normal
sized ones (with the fatter end).

Cheers,
Don


Bus 006 Device 002: ID 1058:0730 Western Digital Technologies, Inc. 
Device Descriptor:
  bLength18
  bDescriptorType 1
  bcdUSB   3.00
  bDeviceClass0 (Defined at Interface level)
  bDeviceSubClass 0 
  bDeviceProtocol 0 
  bMaxPacketSize0 9
  idVendor   0x1058 Western Digital Technologies, Inc.
  idProduct  0x0730 
  bcdDevice   10.16
  iManufacturer   1 Western Digital
  iProduct2 My Passport 0730
  iSerial 3 575837314132314434373235
  bNumConfigurations  1
  Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength   44
bNumInterfaces  1
bConfigurationValue 1
iConfiguration  0 
bmAttributes 0x80
  (Bus Powered)
MaxPower  224mA
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber0
  bAlternateSetting   0
  bNumEndpoints   2
  bInterfaceClass 8 Mass Storage
  bInterfaceSubClass  6 SCSI
  bInterfaceProtocol 80 Bulk-Only
  iInterface  0 
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x84  EP 4 IN
bmAttributes2
  Transfer TypeBulk
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0400  1x 1024 bytes
bInterval   0
bMaxBurst  15
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x03  EP 3 OUT
bmAttributes2
  Transfer TypeBulk
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0400  1x 1024 bytes
bInterval   0
bMaxBurst  15
Binary Object Store Descriptor:
  bLength 5
  bDescriptorType15
  wTotalLength   22
  bNumDeviceCaps  2
  SuperSpeed USB Device Capability:
bLength10
bDescriptorType16
bDevCapabilityType  3
bmAttributes 0x00
  Latency Tolerance Messages (LTM) Supported
wSpeedsSupported   0x000e
  Device can operate at Full Speed (12Mbps)
  Device can operate at High Speed (480Mbps)
  Device can operate at SuperSpeed (5Gbps)
bFunctionalitySupport   1
  Lowest fully-functional device speed is Full Speed (12Mbps)
bU1DevExitLat   0 micro seconds
bU2DevExitLat   0 micro seconds
  USB 2.0 Extension Device Capability:
bLength 7
bDescriptorType16
bDevCapabilityType  2
bmAttributes   0x0002
  Link Power Management (LPM) Supported
Device Status: 0x0002
  (Bus Powered)
  Remote Wakeup Enabled
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFT] usb: Fix TX errors on devices with unsupported LPM states.

2012-10-02 Thread Don Zickus

On Tue, Oct 02, 2012 at 12:15:46PM -0700, Sarah Sharp wrote:
> Hi Don,
> 
> The WD drive arrived today, but I can't reproduce your I/O errors on
> 3.6.  I didn't try with 3.5 yet.  However, I did notice that the device
> really doesn't want to enter U1, but it will enter U2.

I did update the firmware but I think I had the problem before that
happened.  Also I don't use the VIA hub anymore.  I plugged the harddrive
directly into my mahobay system's usb3 port and see the problem.

> 
> Can you test the following patch with your device and see if it helps?
> If that doesn't work, I suggest getting a high-quality new cable and
> seeing if that helps the transfer issues.
> 
> Gabor, this might fix your issues as well, and should allow your system
> to consume less power than completely disabling LPM.
> 
> Sarah Sharp
> 
> >8-8<
> 
> USB 3.0 devices are required to support Link PM.  However, some of
> them don't support U1, or don't support U2, or don't support either.
> There is no way in the USB 3.0 specification to say that a device
> doesn't support U1 or U2, so these devices set the U1 or U2 exit
> latency to the maximum possible values (10ms for U1, 2047ms for U2).

It still failed for me. :-(

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: xhci: LPM issues using Western Digital harddrive

2012-09-28 Thread Don Zickus

On Fri, Sep 28, 2012 at 11:39:48AM -0400, Alan Stern wrote:
> On Fri, 28 Sep 2012, Don Zickus wrote:
> 
> > > Probably the device was already hung.  The log showed that an earlier 
> > > transfer completed with a STALL.  The reason wasn't apparent, although 
> > > a usbmon trace might give a clue.  It might also show why the device 
> > > had to be reset.
> > 
> > Here is the output of 'cat 4u'
> 
> > 880240121860 3207118259 S Bo:4:002:3 -115 31 = 55534243 6c00 
> >  0600    00
> > 880240121860 3207118359 C Bo:4:002:3 0 31 >
> > 880240121860 3207118430 S Bi:4:002:4 -115 13 <
> > 880240121860 3207118511 C Bi:4:002:4 0 13 = 55534253 6c00  
> > 00
> > 880240121860 3207127088 S Bo:4:002:3 -115 31 = 55534243 6d00 
> > 0002 8ca1 082e0001  ec00 00
> > 880240121860 3207127197 C Bo:4:002:3 0 31 >
> > 880240122698 3207127238 S Bi:4:002:4 -115 512 <
> > 880240122698 3207143925 C Bi:4:002:4 -32 0
> > 880240121860 3207144021 S Co:4:002:0 s 02 01  0084  0
> > 880240121860 3207220654 C Co:4:002:0 -32 0
> > 880240122698 3207220814 S Co:4:001:0 s 23 03 0004 0004  0
> > 880240122698 3207226644 C Co:4:001:0 0 0
> 
> This shows that the problem began when the device was sent a command it
> didn't recognize: 0xA1, which is a 12-byte ATA pass-through, in this
> case for an IDENTIFY DEVICE command (0xEC).  Presumably the Western
> Digital device doesn't support ATA pass-through.  The device halted its
> bulk-IN endpoint and then replied with a STALL to the
> Clear-Endpoint-Halt request (which is an invalid response).  This is
> why the reset was tried.
> 
> Don, what does a comparable usbmon trace look like when the drive is 
> attached to an EHCI controller?  I would expect to see basically the 
> same thing.

I can grab that.  I just wanted to note that this drive works fine under a
Fresco Logic or NEC xhci controller.  But those controllers don't seem to
invoke the LPM code paths either.

I'll provide an EHCI usbmon trace from the panther point controller and if
interesting a usbmon trace from an NEC xhci controller.

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: xhci: LPM issues using Western Digital harddrive

2012-09-28 Thread Don Zickus

On Thu, Sep 27, 2012 at 04:42:52PM -0400, Alan Stern wrote:
> On Thu, 27 Sep 2012, Sarah Sharp wrote:
> 
> > On Wed, Sep 26, 2012 at 02:03:06PM -0400, Don Zickus wrote:
> > > Hi Sarah,
> > > 
> > > As discussed privately, here is the debug output of what happens when I
> > > plug in the hard drive into my panther point system.
> > > 
> > > kernel based off of greg's usb-next (0220a3f01a) with CONFIG_USB_DEBUG and
> > > CONFIG_USB_XHCI_HCD_DEBUGGING turned on.
> > 
> > Thanks for the dmesg Don.  Unfortunately, it doesn't tell me anything
> > terribly interesting.  The address device control message completed with
> > a transfer error twice, and the USB core gave up on it.
> > 
> > I'll have to see if I can figure anything out with a USB bus analyzer
> > when I get the drive next week.
> 
> Probably the device was already hung.  The log showed that an earlier 
> transfer completed with a STALL.  The reason wasn't apparent, although 
> a usbmon trace might give a clue.  It might also show why the device 
> had to be reset.

Here is the output of 'cat 4u'

Cheers,
Don

880240121860 3194234363 S Ci:4:001:0 s a3 00  0001 0004 4 <
880240121860 3194238462 C Ci:4:001:0 0 4 = a002
880240121860 3194238531 S Ci:4:001:0 s a3 00  0002 0004 4 <
880240121860 3194242468 C Ci:4:001:0 0 4 = a002
880240121860 3194242529 S Ci:4:001:0 s a3 00  0003 0004 4 <
880240121860 3194247552 C Ci:4:001:0 0 4 = a002
880240121860 3194247612 S Ci:4:001:0 s a3 00  0004 0004 4 <
880240121860 3194251702 C Ci:4:001:0 0 4 = 03020100
880240121860 3194253923 S Co:4:001:0 s 23 01 0010 0004  0
880240121860 3194256746 C Co:4:001:0 0 0
88023dfd6080 3194357546 S Ii:4:001:1 -115:2048 4 <
880240121860 3194360035 S Ci:4:001:0 s a3 00  0004 0004 4 <
880240121860 3194363928 C Ci:4:001:0 0 4 = 0302
880240121860 3194375687 S Co:4:001:0 s 23 03 0004 0004  0
880240121860 3194378308 C Co:4:001:0 0 0
880240121860 3194428559 S Ci:4:001:0 s a3 00  0004 0004 4 <
880240121860 3194432085 C Ci:4:001:0 0 4 = 03021000
880240121860 3194482579 S Co:4:001:0 s 23 01 0014 0004  0
880240121860 3194484962 C Co:4:001:0 0 0
880240121860 3194753586 S Ci:4:002:0 s 80 06 0100  0008 8 <
880240121860 3194753860 C Ci:4:002:0 0 8 = 12010003 0009
880240121860 3194753956 S Ci:4:002:0 s 80 06 0100  0012 18 <
880240121860 3194754080 C Ci:4:002:0 0 18 = 12010003 0009 58103007 
16100102 0301
880240121860 3194754222 S Ci:4:002:0 s 80 06 0f00  0005 5 <
880240121860 3194754355 C Ci:4:002:0 0 5 = 050f1600 02
880240121860 3194754516 S Ci:4:002:0 s 80 06 0f00  0016 22 <
880240121860 3194754598 C Ci:4:002:0 0 22 = 050f1600 020a1003 000e0001 
0007 10020200 
880240121860 3194754764 S Ci:4:002:0 s 80 06 0200  0009 9 <
880240121860 3194754925 C Ci:4:002:0 0 9 = 09022c00 01010080 70
880240121860 3194755041 S Ci:4:002:0 s 80 06 0200  002c 44 <
880240121860 3194755166 C Ci:4:002:0 0 44 = 09022c00 01010080 70090400 
00020806 5705 84020004 0006300f 0007
880240121860 3194758636 S Ci:4:002:0 s 80 06 0300  00ff 255 <
880240121860 3194760861 C Ci:4:002:0 0 4 = 04030904
880240121860 3194762999 S Ci:4:002:0 s 80 06 0302 0409 00ff 255 <
880240121860 3194765295 C Ci:4:002:0 0 34 = 22034d00 79002000 50006100 
73007300 70006f00 72007400 20003000 37003300
880240121860 3194765434 S Ci:4:002:0 s 80 06 0301 0409 00ff 255 <
880240121860 3194767615 C Ci:4:002:0 0 32 = 20035700 65007300 74006500 
72006e00 20004400 69006700 69007400 61006c00
880240121860 3194767769 S Ci:4:002:0 s 80 06 0303 0409 00ff 255 <
880240121860 3194770006 C Ci:4:002:0 0 50 = 32033500 37003500 38003300 
37003300 31003400 31003300 32003300 31003400
880240122698 3195163067 S Co:4:002:0 s 00 09 0001   0
880240122698 3195163251 C Co:4:002:0 0 0
880240122698 3195295743 S Co:4:001:0 s 23 03 0017 3204  0
880240122698 3195295762 C Co:4:001:0 0 0
880240122698 3195295792 S Co:4:002:0 s 00 30   0006 6 = 0a0a0002 
0002
880240122698 3195295899 C Co:4:002:0 0 6 >
880240122698 3195296015 S Co:4:002:0 s 00 03 0030   0
880240122698 3195296100 C Co:4:002:0 0 0
880240122698 3195435238 S Co:4:001:0 s 23 03 0018 2804  0
880240122698 3195435268 C Co:4:001:0 0 0
880240122698 3195435337 S Co:4:002:0 s 00 30   0006 6 = 0a0a0002 
0002
880240122698 3195435466 C Co:4:002:0 0 6 >
880240122698 3195435595 S Co:4:002:0 s 00 03 0031   0
880240122698 3195435711 C Co:4:002:0 0 0
880240121860 319786 S Co:4:001:0 s 23 03 0017 0004  0
880240121860 319804 C Co:4:001:0 0 0
880240121860 31

Re: [PATCH] usb/core: Fix race condition when removing EHCI PCI devices

2012-09-26 Thread Don Zickus

On Wed, Sep 26, 2012 at 12:55:26PM -0400, Alan Stern wrote:
> On Wed, 26 Sep 2012, Don Zickus wrote:
> 
> > Hi Alan,
> > 
> > This patch seemed to be successful.  I copied and pasted the response from
> > our customer (we backported the patch to RHEL-6/2.6.32):
> 
> ...
> 
> > Is there anything else you need from us or can we move forward with this
> > patch?
> 
> That's good enough; I'll submit it to Greg.

Awesome. Thanks!

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] usb/core: Fix race condition when removing EHCI PCI devices

2012-09-26 Thread Don Zickus

On Mon, Sep 24, 2012 at 12:59:51PM -0400, Alan Stern wrote:
> > If you want to track down what's going wrong, you'll have to add some 
> > debugging code to usb_device_read() and usb_remove_hcd().  By the time 
> > usb_device_dump() starts running, it's already too late.
> 
> After thinking about this some more, I realized that my patch still 
> leaves a race -- although the oops would occur in a different place 
> (where usb_device_read checks bus->root_hub->devnum).
> 
> Here's a different patch which should work better.  It relies on the
> rh_registered flag in the usb_hcd structure, which persists as long as
> the usb_bus structure does, rather than on anything stored in the
> root-hub device structure.

Hi Alan,

This patch seemed to be successful.  I copied and pasted the response from
our customer (we backported the patch to RHEL-6/2.6.32):

"
This new patch testing went very well.  In previous tests, the kernel paniced
during the first or second surprise removal of the ehci_hcd PCI device on an
idle system.  I ran an entire night of surprise and polite device removals with
no kernel panic or Oops.  slub_debug=FZPU was used to poison deallocated
storage blocks and check for use after free.  No BUGs were logged by the
allocator.

Numerous messages like the following were seen at the console, indicating that
the stimulus leading to the panic should be occurring:
 cat: /proc/bus/usb/001/006: No such device

Surprise removals occur when the device is electrically disconnected from the
PCI bus while in use.  During a following clean-up operation, the driver's
remove function called.

Polite removals occur when the driver's remove function is called while the
device is in use.  After return from the driver, the device is electrically
disconnected from the PCI bus.

I ran 5 hours of surprise and polite removals on the same idle system that was
used for the previous tests, about 60 trials of each type of removal.  Then a
workload was started to cause mid-level CPU, Memory and Disk stress; the test
continued to run for 8.5 hours more executing about 38 more trials of each type
of PCI removal.
"

Is there anything else you need from us or can we move forward with this
patch?

Thanks again!

Cheers,
Don

> 
> 
> 
> Index: usb-3.6/drivers/usb/core/devices.c
> ===
> --- usb-3.6.orig/drivers/usb/core/devices.c
> +++ usb-3.6/drivers/usb/core/devices.c
> @@ -624,7 +624,7 @@ static ssize_t usb_device_read(struct fi
>   /* print devices for all busses */
>   list_for_each_entry(bus, &usb_bus_list, bus_list) {
>   /* recurse through all children of the root hub */
> - if (!bus->root_hub)
> + if (!bus_to_hcd(bus)->rh_registered)
>   continue;
>   usb_lock_device(bus->root_hub);
>   ret = usb_device_dump(&buf, &nbytes, &skip_bytes, ppos,
> Index: usb-3.6/drivers/usb/core/hcd.c
> ===
> --- usb-3.6.orig/drivers/usb/core/hcd.c
> +++ usb-3.6/drivers/usb/core/hcd.c
> @@ -1011,10 +1011,7 @@ static int register_root_hub(struct usb_
>   if (retval) {
>   dev_err (parent_dev, "can't register root hub for %s, %d\n",
>   dev_name(&usb_dev->dev), retval);
> - }
> - mutex_unlock(&usb_bus_list_lock);
> -
> - if (retval == 0) {
> + } else {
>   spin_lock_irq (&hcd_root_hub_lock);
>   hcd->rh_registered = 1;
>   spin_unlock_irq (&hcd_root_hub_lock);
> @@ -1023,6 +1020,7 @@ static int register_root_hub(struct usb_
>   if (HCD_DEAD(hcd))
>   usb_hc_died (hcd);  /* This time clean up */
>   }
> + mutex_unlock(&usb_bus_list_lock);
>  
>   return retval;
>  }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] usb/core: Fix race condition when removing EHCI PCI devices

2012-09-24 Thread Don Zickus

On Mon, Sep 24, 2012 at 12:59:51PM -0400, Alan Stern wrote:
> On Mon, 17 Sep 2012, Alan Stern wrote:
> 
> > On Mon, 17 Sep 2012, Don Zickus wrote:
> > 
> > > I have added some in-depth analysis from our customer.
> > > 
> > > 
> > > The problem is that the failing routine was called with a pointer to 
> > > usbdev in
> > > memory that has already been freed and overwritten with the pool poison 
> > > pattern.
> > > This causes an access violation on line 502:
> > 
> > The stack trace shows that the failure occurred in usb_device_dump(), 
> > which was called from usb_device_read().  Since it wasn't a recursive 
> > call, the usbdev argument must have pointed to the root-hub device.
> > 
> > I don't see how this could have happened if the kernel included the 
> > patch I sent earlier.  usb_device_read() holds the usb_bus_list_lock 
> > mutex throughout, and it tests that the root hub's devnum field is not 
> > equal to 0 before calling usb_dump_device().
> > 
> > Meanwhile, usb_remove_hcd() sets hcd->self.bus.root_hub->devnum to 0
> > while holding usb_bus_list_lock, and it doesn't deallocate the root hub
> > device until after the mutex is dropped.
> > 
> > Are you certain that the test was conducted with the patch in place?
> > 
> > If you want to track down what's going wrong, you'll have to add some 
> > debugging code to usb_device_read() and usb_remove_hcd().  By the time 
> > usb_device_dump() starts running, it's already too late.
> 
> After thinking about this some more, I realized that my patch still 
> leaves a race -- although the oops would occur in a different place 
> (where usb_device_read checks bus->root_hub->devnum).
> 
> Here's a different patch which should work better.  It relies on the
> rh_registered flag in the usb_hcd structure, which persists as long as
> the usb_bus structure does, rather than on anything stored in the
> root-hub device structure.

Thanks!  I will pass this on. Sorry for the lack of feedback.  The
customer I am coordinating with was on vacation last week.  I am still
trying to set up their machine on my end to help debug this faster.

I appreciate you still thinking about this.

Cheers,
Don

> 
> Alan Stern
> 
> 
> 
> Index: usb-3.6/drivers/usb/core/devices.c
> ===
> --- usb-3.6.orig/drivers/usb/core/devices.c
> +++ usb-3.6/drivers/usb/core/devices.c
> @@ -624,7 +624,7 @@ static ssize_t usb_device_read(struct fi
>   /* print devices for all busses */
>   list_for_each_entry(bus, &usb_bus_list, bus_list) {
>   /* recurse through all children of the root hub */
> - if (!bus->root_hub)
> + if (!bus_to_hcd(bus)->rh_registered)
>   continue;
>   usb_lock_device(bus->root_hub);
>   ret = usb_device_dump(&buf, &nbytes, &skip_bytes, ppos,
> Index: usb-3.6/drivers/usb/core/hcd.c
> ===
> --- usb-3.6.orig/drivers/usb/core/hcd.c
> +++ usb-3.6/drivers/usb/core/hcd.c
> @@ -1011,10 +1011,7 @@ static int register_root_hub(struct usb_
>   if (retval) {
>   dev_err (parent_dev, "can't register root hub for %s, %d\n",
>   dev_name(&usb_dev->dev), retval);
> - }
> - mutex_unlock(&usb_bus_list_lock);
> -
> - if (retval == 0) {
> + } else {
>   spin_lock_irq (&hcd_root_hub_lock);
>   hcd->rh_registered = 1;
>   spin_unlock_irq (&hcd_root_hub_lock);
> @@ -1023,6 +1020,7 @@ static int register_root_hub(struct usb_
>   if (HCD_DEAD(hcd))
>   usb_hc_died (hcd);  /* This time clean up */
>   }
> + mutex_unlock(&usb_bus_list_lock);
>  
>   return retval;
>  }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] usb/core: Fix race condition when removing EHCI PCI devices

2012-09-17 Thread Don Zickus

On Thu, Sep 13, 2012 at 04:31:34PM -0400, Alan Stern wrote:
> On Thu, 13 Sep 2012, Don Zickus wrote:
> 
> > Hi Alan,
> > 
> > I adapted your patch to our 2.6.32 tree and the customer tested it without
> > success.  The output panic is attached below.  I will work on getting a
> > machine with the latest kernel to reproduce this problem so I don't waste
> > your time chasing something that might be fixed upstream.
> > 
> > But if you could take a quick glance at the panic below to see if anything
> > comes to mind I would appreciate it.
> 
> > The test failed on the first surprise removal of PCI devices.  Last gasp
> > from the console is posted below.  I would guess the faulting process was
> > reading a file under /proc/bus/usb.  I should have a dump if more info is
> > needed.
> 
> > ehci_hcd :2c:00.0: HC died; cleaning up
> > ehci_hcd :2c:00.0: force halt; handhake c9654024 4000
> > 4000 -> -19
> > ehci_hcd :2c:00.0: HC died; cleaning up
> > ehci_hcd :2c:00.0: remove, state 0
> > usb usb1: USB disconnect, device number 1
> > usb 1-1: USB disconnect, device number 2
> > usb 1-1.1: USB disconnect, device number 3
> > hub 4-1:1.0: unable to enumerate USB device on port 3
> > usb 1-1.3: USB disconnect, device number 4
> > usb 1-1.6: USB disconnect, device number 5
> > usb 1-1.6.1: USB disconnect, device number 6
> > ehci_hcd :2c:00.0: USB bus 1 deregistered
> > hub 4-1:1.0: unable to enumerate USB device on port 3
> > general protection fault:  [#1] SMP 
> > last sysfs file:
> > /sys/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:3d:00.0/:3e:01.0/:66:00.0/usb3/3-1/3-1.6/3-1.6.1/3-1.6.1:1.2/input/input12/event12/uevent
> > CPU 0 
> > Modules linked in: autofs4 sunrpc configfs cachefiles fscache(T) bonding
> > 8021q garp stp llc vhost_net macvtap macvlan tun uinput ipmi_devintf
> > ftmod(P)(U) ipmi_msghandler sg matroxfb(U) fosil(U) ext4 mbcache jbd2
> > raid1 sr_mod cdrom sd_mod(U) crc_t10dif usb_storage mpt2sas(U)
> > scsi_hbas(U) scsi_transport_sas raid_class igb(U) dca dm_mirror
> > dm_region_hash dm_log dm_mod ipv6 cxgb4 cxgb3 mdio libiscsi_tcp libiscsi
> > scsi_transport_iscsi [last unloaded: scsi_wait_scan]
> > 
> > Pid: 32752, comm: cat Tainted: P   ---  T
> > 2.6.32-279.el6.bz849188.test01.x86_64 #1 Stratus ftServer 2700/G7LAY
> > RIP: 0010:[]  [] 
> > usb_device_dump+0x87/0xa70
> 
> It would help to know what source statement that memory address
> corresponds to.  Offhand I can't see any remaining races between
> usb_device_dump() and usb_disconnect().

I have added some in-depth analysis from our customer.


The problem is that the failing routine was called with a pointer to usbdev in
memory that has already been freed and overwritten with the pool poison pattern.
This causes an access violation on line 502:

470 /* This is a recursive function. Parameters:
471  * buffer - the user-space buffer to write data into
472  * nbytes - the maximum number of bytes to write
473  * skip_bytes - the number of bytes to skip before writing anything
474  * file_offset - the offset into the devices file on completion
475  * The caller must own the device lock.
476  */
477 static ssize_t usb_device_dump(char __user **buffer, size_t *nbytes,
478loff_t *skip_bytes, loff_t *file_offset,
479struct usb_device *usbdev, struct usb_bus 
*bus,
480int level, int index, int count)
481 {
482 int chix;
483 int ret, cnt = 0;
484 int parent_devnum = 0;
485 char *pages_start, *data_end, *speed;
486 unsigned int length;
487 ssize_t total_written = 0;
488 
489 /* don't bother with anything else if we're not writing any data */
490 if (*nbytes <= 0)
491 return 0;
492 
493 if (level > MAX_TOPO_LEVEL)
494 return 0;
495 /* allocate 2^1 pages = 8K (on i386);
496  * should be more than enough for one device */
497 pages_start = (char *)__get_free_pages(GFP_NOIO, 1);
498 if (!pages_start)
499 return -ENOMEM;
500 
501 if (usbdev->parent && usbdev->parent->devnum != -1)
502 parent_devnum = usbdev->parent->devnum;
503 /*
504  * So the root hub's parent is 0 and any device that is
505  * plugged into the root hub has a parent of 0.
506  */
507 switch (usbdev->speed) {

crash> hex
output radix: 16 (hex)
crash> sym 813b8167
813b8167 (t) usb_device_dump+0x87 
../debug/kernel-2.6.32-279.el6/linux

Re: [PATCH] usb/core: Fix race condition when removing EHCI PCI devices

2012-09-13 Thread Don Zickus

On Thu, Sep 13, 2012 at 04:31:34PM -0400, Alan Stern wrote:
> On Thu, 13 Sep 2012, Don Zickus wrote:
> 
> > Hi Alan,
> > 
> > I adapted your patch to our 2.6.32 tree and the customer tested it without
> > success.  The output panic is attached below.  I will work on getting a
> > machine with the latest kernel to reproduce this problem so I don't waste
> > your time chasing something that might be fixed upstream.
> > 
> > But if you could take a quick glance at the panic below to see if anything
> > comes to mind I would appreciate it.
> 
> > The test failed on the first surprise removal of PCI devices.  Last gasp
> > from the console is posted below.  I would guess the faulting process was
> > reading a file under /proc/bus/usb.  I should have a dump if more info is
> > needed.
> 
> > ehci_hcd :2c:00.0: HC died; cleaning up
> > ehci_hcd :2c:00.0: force halt; handhake c9654024 4000
> > 4000 -> -19
> > ehci_hcd :2c:00.0: HC died; cleaning up
> > ehci_hcd :2c:00.0: remove, state 0
> > usb usb1: USB disconnect, device number 1
> > usb 1-1: USB disconnect, device number 2
> > usb 1-1.1: USB disconnect, device number 3
> > hub 4-1:1.0: unable to enumerate USB device on port 3
> > usb 1-1.3: USB disconnect, device number 4
> > usb 1-1.6: USB disconnect, device number 5
> > usb 1-1.6.1: USB disconnect, device number 6
> > ehci_hcd :2c:00.0: USB bus 1 deregistered
> > hub 4-1:1.0: unable to enumerate USB device on port 3
> > general protection fault:  [#1] SMP 
> > last sysfs file:
> > /sys/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:3d:00.0/:3e:01.0/:66:00.0/usb3/3-1/3-1.6/3-1.6.1/3-1.6.1:1.2/input/input12/event12/uevent
> > CPU 0 
> > Modules linked in: autofs4 sunrpc configfs cachefiles fscache(T) bonding
> > 8021q garp stp llc vhost_net macvtap macvlan tun uinput ipmi_devintf
> > ftmod(P)(U) ipmi_msghandler sg matroxfb(U) fosil(U) ext4 mbcache jbd2
> > raid1 sr_mod cdrom sd_mod(U) crc_t10dif usb_storage mpt2sas(U)
> > scsi_hbas(U) scsi_transport_sas raid_class igb(U) dca dm_mirror
> > dm_region_hash dm_log dm_mod ipv6 cxgb4 cxgb3 mdio libiscsi_tcp libiscsi
> > scsi_transport_iscsi [last unloaded: scsi_wait_scan]
> > 
> > Pid: 32752, comm: cat Tainted: P   ---  T
> > 2.6.32-279.el6.bz849188.test01.x86_64 #1 Stratus ftServer 2700/G7LAY
> > RIP: 0010:[]  [] 
> > usb_device_dump+0x87/0xa70
> 
> It would help to know what source statement that memory address
> corresponds to.  Offhand I can't see any remaining races between
> usb_device_dump() and usb_disconnect().

I'll dig that up..

Thanks,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] usb/core: Fix race condition when removing EHCI PCI devices

2012-09-13 Thread Don Zickus

On Tue, Sep 11, 2012 at 11:34:56AM -0400, Alan Stern wrote:
> > Deregistering the bus first to prevent future readers from accessing the 
> > device
> > before cleaning up the hcd, seemed like an appropriate way to go.  I'll 
> > leave
> > it to the experts to validate that or provide a better solution. :-)
> > 
> > I adapted their usb_remove_hcd fix for usb_add_hcd error path too to mimic
> > the same behaviour (a little code shuffling to handle the gotos cleanly).
> 
> Moving things around in usb_add_hcd() is not a good way to attack this
> problem.  Here's a better approach; please make sure that it does what
> you want.  The idea is to indicate that a root hub is not registered by
> setting its devnum field to 0.
> 
> Alan Stern

Hi Alan,

I adapted your patch to our 2.6.32 tree and the customer tested it without
success.  The output panic is attached below.  I will work on getting a
machine with the latest kernel to reproduce this problem so I don't waste
your time chasing something that might be fixed upstream.

But if you could take a quick glance at the panic below to see if anything
comes to mind I would appreciate it.

Thanks for any help.

Cheers,
Don

The test failed on the first surprise removal of PCI devices.  Last gasp
from the console is posted below.  I would guess the faulting process was
reading a file under /proc/bus/usb.  I should have a dump if more info is
needed.

usb 3-1.6: New USB device found, idVendor=0557, idProduct=8021
usb 3-1.6: New USB device strings: Mfr=0, Product=0, SerialNumber=0
usb 3-1.6: configuration #1 chosen from 1 choice
hub 3-1.6:1.0: USB hub found
hub 3-1.6:1.0: 4 ports detected
hub 4-1:1.0: unable to enumerate USB device on port 3
usb 3-1.6.1: new low speed USB device number 5 using ehci_hcd
usb 3-1.6.1: New USB device found, idVendor=0557, idProduct=2260
usb 3-1.6.1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
usb 3-1.6.1: Product: CS1308R1 V1.2.116
usb 3-1.6.1: Manufacturer: ATEN International Co. Ltd
usb 3-1.6.1: configuration #1 chosen from 1 choice
input: ATEN International Co. Ltd CS1308R1 V1.2.116 as
/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:3d:00.0/:3e:01.0/:66:00.0/usb3/3-1/3-1.6/3-1.6.1/3-1.6.1:1.0/input/input10
generic-usb 0003:0557:2260.0009: input,hidraw8: USB HID v1.00 Keyboard
[ATEN International Co. Ltd CS1308R1 V1.2.116] on
usb-:66:00.0-1.6.1/input0
input: ATEN International Co. Ltd CS1308R1 V1.2.116 as
/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:3d:00.0/:3e:01.0/:66:00.0/usb3/3-1/3-1.6/3-1.6.1/3-1.6.1:1.1/input/input11
generic-usb 0003:0557:2260.000A: input,hidraw9: USB HID v1.00 Device [ATEN
International Co. Ltd CS1308R1 V1.2.116] on usb-:66:00.0-1.6.1/input1
input: ATEN International Co. Ltd CS1308R1 V1.2.116 as
/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:3d:00.0/:3e:01.0/:66:00.0/usb3/3-1/3-1.6/3-1.6.1/3-1.6.1:1.2/input/input12
generic-usb 0003:0557:2260.000B: input,hidraw10: USB HID v1.00 Mouse [ATEN
International Co. Ltd CS1308R1 V1.2.116] on usb-:66:00.0-1.6.1/input2
input: ATEN International Co. Ltd CS1308R1 V1.2.116 as
/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:3d:00.0/:3e:01.0/:66:00.0/usb3/3-1/3-1.6/3-1.6.1/3-1.6.1:1.3/input/input13
generic-usb 0003:0557:2260.000C: input,hidraw11: USB HID v1.10 Mouse [ATEN
International Co. Ltd CS1308R1 V1.2.116] on usb-:66:00.0-1.6.1/input3
hub 4-1:1.0: unable to enumerate USB device on port 3
usb 3-1.3: new high speed USB device number 6 using ehci_hcd
sd 0:0:0:0: [sdq] Unhandled error code
sd 0:0:0:0: [sdq] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
sd 0:0:0:0: [sdq] CDB: Write(10): 2a 00 02 08 08 08 00 00 01 00
md: super_written gets error=-5, uptodate=0
md/raid1:md2: Disk failure on sdq3, disabling device.
md/raid1:md2: Operation continuing on 1 devices.
usb 3-1.3: New USB device found, idVendor=152d, idProduct=2339
usb 3-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=5
usb 3-1.3: Product: USB to ATA/ATAPI Bridge
usb 3-1.3: Manufacturer: JMicron
usb 3-1.3: SerialNumber: MATSHITADVYN14  003100
usb 3-1.3: configuration #1 chosen from 1 choice
scsi3 : SCSI emulation for USB Mass Storage devices
hub 4-1:1.0: unable to enumerate USB device on port 3
hub 4-1:1.0: unable to enumerate USB device on port 3
hub 4-1:1.0: unable to enumerate USB device on port 3
hub 4-1:1.0: unable to enumerate USB device on port 3
scsi 3:0:0:0: CD-ROMMATSHITA DVD-RAM UJ8A0AS  1.20 PQ: 0 ANSI: 0
sr1: scsi3-mmc drive: 24x/24x writer dvd-ram cd/rw xa/form2 cdda tray
sr 3:0:0:0: Attached scsi generic sg3 type 5
hub 4-1:1.0: unable to enumerate USB device on port 3
bonding: bond0: Warning: the permanent HWaddr of eth100600 -
00:25:5c:a6:f2:60 - is still in use by bond0. Set the HWaddr of eth100600
to a different address to avoid conflicts.
bonding: bond0: releasing backup interface eth100600
hub 4-1:1.0: unable to enumerate USB device on por

Re: [PATCH] usb/core: Fix race condition when removing EHCI PCI devices

2012-09-11 Thread Don Zickus

On Tue, Sep 11, 2012 at 11:34:56AM -0400, Alan Stern wrote:
> On Mon, 10 Sep 2012, Don Zickus wrote:
> 
> > A customer of ours noticed that after a bunch of hot removals of the EHCI
> > PCI device, a panic occurs. This happened on a 2.6.32 RHEL-6 kernel, but
> > I believe is still applicable upstream.
> > 
> > The panic was further simplified to enabling SLUB debug poisoning and 
> > running
> > the following command:
> > 
> > while true; do find /proc/bus/usb -type f -exec cat {} >/dev/null \; ; done 
> > &
> > 
> > This gets the machine to panic after 1 or 2 removals of the device.
> 
> > This is likely because the following thread was in the process of removing 
> > the
> > device:
> 
> > The fix that solved their problem was to deregister the usb bus first before
> > running usb_put_dev.  After running multiple tests the panic disappeared.
> > 
> > Deregistering the bus first to prevent future readers from accessing the 
> > device
> > before cleaning up the hcd, seemed like an appropriate way to go.  I'll 
> > leave
> > it to the experts to validate that or provide a better solution. :-)
> > 
> > I adapted their usb_remove_hcd fix for usb_add_hcd error path too to mimic
> > the same behaviour (a little code shuffling to handle the gotos cleanly).
> 
> Moving things around in usb_add_hcd() is not a good way to attack this
> problem.  Here's a better approach; please make sure that it does what
> you want.  The idea is to indicate that a root hub is not registered by
> setting its devnum field to 0.

Thanks for the feedback.  I will pass this along for testing and get back
to you.

Thanks,
Don

> 
> Alan Stern
> 
> 
> 
> Index: usb-3.6/drivers/usb/core/devices.c
> ===
> --- usb-3.6.orig/drivers/usb/core/devices.c
> +++ usb-3.6/drivers/usb/core/devices.c
> @@ -624,7 +624,7 @@ static ssize_t usb_device_read(struct fi
>   /* print devices for all busses */
>   list_for_each_entry(bus, &usb_bus_list, bus_list) {
>   /* recurse through all children of the root hub */
> - if (!bus->root_hub)
> + if (!bus->root_hub || bus->root_hub->devnum == 0)
>   continue;
>   usb_lock_device(bus->root_hub);
>   ret = usb_device_dump(&buf, &nbytes, &skip_bytes, ppos,
> Index: usb-3.6/drivers/usb/core/hcd.c
> ===
> --- usb-3.6.orig/drivers/usb/core/hcd.c
> +++ usb-3.6/drivers/usb/core/hcd.c
> @@ -980,6 +980,8 @@ static int register_root_hub(struct usb_
>   const int devnum = 1;
>   int retval;
>  
> + mutex_lock(&usb_bus_list_lock);
> +
>   usb_dev->devnum = devnum;
>   usb_dev->bus->devnum_next = devnum + 1;
>   memset (&usb_dev->bus->devmap.devicemap, 0,
> @@ -987,8 +989,6 @@ static int register_root_hub(struct usb_
>   set_bit (devnum, usb_dev->bus->devmap.devicemap);
>   usb_set_device_state(usb_dev, USB_STATE_ADDRESS);
>  
> - mutex_lock(&usb_bus_list_lock);
> -
>   usb_dev->ep0.desc.wMaxPacketSize = cpu_to_le16(64);
>   retval = usb_get_device_descriptor(usb_dev, USB_DT_DEVICE_SIZE);
>   if (retval != sizeof usb_dev->descriptor) {
> @@ -1011,6 +1011,7 @@ static int register_root_hub(struct usb_
>   if (retval) {
>   dev_err (parent_dev, "can't register root hub for %s, %d\n",
>   dev_name(&usb_dev->dev), retval);
> + usb_dev->devnum = 0;
>   }
>   mutex_unlock(&usb_bus_list_lock);
>  
> @@ -2527,6 +2528,7 @@ error_create_attr_group:
>  #endif
>   mutex_lock(&usb_bus_list_lock);
>   usb_disconnect(&rhdev); /* Sets rhdev to NULL */
> + hcd->self.root_hub->devnum = 0;
>   mutex_unlock(&usb_bus_list_lock);
>  err_register_root_hub:
>   hcd->rh_pollable = 0;
> @@ -2583,6 +2585,7 @@ void usb_remove_hcd(struct usb_hcd *hcd)
>  
>   mutex_lock(&usb_bus_list_lock);
>   usb_disconnect(&rhdev); /* Sets rhdev to NULL */
> + hcd->self.root_hub->devnum = 0;
>   mutex_unlock(&usb_bus_list_lock);
>  
>   /* Prevent any more root-hub status calls from the timer.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] usb/core: Fix race condition when removing EHCI PCI devices

2012-09-10 Thread Don Zickus

A customer of ours noticed that after a bunch of hot removals of the EHCI
PCI device, a panic occurs. This happened on a 2.6.32 RHEL-6 kernel, but
I believe is still applicable upstream.

The panic was further simplified to enabling SLUB debug poisoning and running
the following command:

while true; do find /proc/bus/usb -type f -exec cat {} >/dev/null \; ; done &

This gets the machine to panic after 1 or 2 removals of the device.

The backtrace looks like:

  crash> bt
  PID: 23674  TASK: 8801ab411540  CPU: 2   COMMAND: "sosreport"
   #0 [8801b9073980] machine_kexec at 8103281b
   #1 [8801b90739e0] crash_kexec at 810ba662
   #2 [8801b9073ab0] oops_end at 81501290
   #3 [8801b9073ae0] no_context at 81043bab
   #4 [8801b9073b30] __bad_area_nosemaphore at 81043e35
   #5 [8801b9073b80] bad_area at 81043f5e
   #6 [8801b9073bb0] __do_page_fault at 81044710
   #7 [8801b9073cd0] do_page_fault at 8150326e
   #8 [8801b9073d00] page_fault at 81500625
  [exception RIP: __list_add+0x26]
  RIP: 81282fb6  RSP: 8801b9073db8  RFLAGS: 00010046
  RAX: 0292  RBX: 8801b9073de8  RCX: 
  RDX: 8802782788f8  RSI:   RDI: 8801b9073de8
  RBP: 8801b9073dd8   R8:    R9: 0001
  R10: 880028401fc0  R11:   R12: 8802782788f8
  R13:   R14: 7fff  R15: 8801b9073e98
  ORIG_RAX:   CS: 0010  SS: 0018
   #9 [8801b9073de0] __down at 814ff58e
  #10 [8801b9073e30] down at 81097ef1
  #11 [8801b9073e60] usb_device_read at 813b78ce
  #12 [8801b9073ef0] vfs_read at 8117b8b5
  #13 [8801b9073f30] sys_read at 8117b9f1
  #14 [8801b9073f80] tracesys at 8100b308 (via system_call)
  RIP: 0030f90da360  RSP: 7fffef71dda8  RFLAGS: 0246
  RAX: ffda  RBX: 8100b308  RCX: 
  RDX: 1000  RSI: 7f1926568000  RDI: 0003
  RBP:    R8: 084023c4   R9: 7f1926530700
  R10: 2028  R11: 0246  R12: 08403b7e
  R13: 0846  R14: 083159c0  R15: 7fffef71de40
  ORIG_RAX:   CS: 0033  SS: 002b

This is likely because the following thread was in the process of removing the
device:

  PID: 3070   TASK: 8802735fd540  CPU: 1   COMMAND: "io_poll"
   #0 [8802788a97b0] schedule at 814fd7e2
   #1 [8802788a9878] __mutex_lock_slowpath at 814fee8e
   #2 [8802788a98e8] mutex_lock at 814fed2b
   #3 [8802788a9908] usb_deregister_bus at 813a59ff
   #4 [8802788a9938] usb_remove_hcd at 813a5b9d
   #5 [8802788a9978] usb_hcd_pci_remove at 813b80b3
   #6 [8802788a9998] pci_device_remove at 812930b7
   #7 [8802788a99b8] __device_release_driver at 8135006f
   #8 [8802788a99d8] device_release_driver at 813501dd
   #9 [8802788a99f8] bus_remove_device at 8134f073
  #10 [8802788a9a28] device_del at 8134ccbd
  #11 [8802788a9a58] device_unregister at 8134cd92
  #12 [8802788a9a78] pci_stop_bus_device at 8128c4cc
  #13 [8802788a9aa8] pci_stop_bus_device at 8128c47b
  #14 [8802788a9ad8] pci_stop_bus_device at 8128c47b
  #15 [8802788a9b08] pci_remove_bus_device at 8128c57a
  #16 [8802788a9b38] remove_io at a0361db8 [ftmod]
  #17 [8802788a9b98] ioboard_event at a0362130 [ftmod]
  #18 [8802788a9c18] io_state_change at a0365c9e [ftmod]
  #19 [8802788a9c38] OsIoStateChange at a0406638 [ftmod]
  #20 [8802788a9c58] IoStateChange at a0407a4b [ftmod]
  #21 [8802788a9db8] CcIoPoll at a040af64 [ftmod]
  #22 [8802788a9e98] io_poll at a0365b00 [ftmod]
  #23 [8802788a9f48] kernel_thread at 8100c14a

The fix that solved their problem was to deregister the usb bus first before
running usb_put_dev.  After running multiple tests the panic disappeared.

Deregistering the bus first to prevent future readers from accessing the device
before cleaning up the hcd, seemed like an appropriate way to go.  I'll leave
it to the experts to validate that or provide a better solution. :-)

I adapted their usb_remove_hcd fix for usb_add_hcd error path too to mimic
the same behaviour (a little code shuffling to handle the gotos cleanly).

Signed-off-by: Don Zickus 
---
 drivers/usb/core/hcd.c |   20 ++--
 1 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index bc84106..d155bbf 100644
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -2428,13 +2428,6 @@

Re: [PATCH] usb, ehci: Avoid deadlock of ehci->lock by disabling interrupts

2012-08-01 Thread Don Zickus

On Wed, Aug 01, 2012 at 10:42:37AM -0400, Alan Stern wrote:
> On Tue, 31 Jul 2012, Don Zickus wrote:
> 
> > We ran into an interesting deadlock on RHEL-5 (2.6.18) that I believe
> > still appiles to the current kernel involving the ehci->lock.
> > 
> > CPU A:
> > submits a bulk transfer urb
> > ehci_urb_enqueue calls submit_async
> > submit_async blocks on ehci->lock with irq disabled (the result
> > of spin_lock_irqsave) for CPU B
> > 
> > CPU B:
> > takes an ehci interrupt
> > locks ehci->lock
> > pre-empted by an IPI handler which spins waiting for CPU C
> > 
> > CPU C:
> > takes an MTRR request
> > sends an IPI to all cpus to block
> > spins waiting for all cpus to block
> > 
> > CPU A nevers processes IPI because its interrupts are disabled,
> > this creates the 3-way deadlock.
> > 
> > This deadlock is hard to reproduce by our customer, but based on their 
> > vmcore
> > it seems clear the above is what happened.  I attatched a suggested patch
> > from a colleague that would seem to resolve the problem.  Because it is
> > hard to reproduce, I have not been able to test it to verify it resolves
> > the problem.
> > 
> > The patch just turns spin_locks in the spin_lock_irqsaves in the ehci_irq
> > function.  This would essentially block the IPI handler and let the 
> > interrupt
> > handler finish before processing the IPI.  Then CPU A would get a chance to
> > finish and process its IPI.
> > 
> > Looking at the code paths in 2.6.18 and 3.5, the locking still seems the 
> > same
> > which is why I believe the problem still exists.  However, someone in the 
> > office
> > thought the MTRR code has been re-written, so the problem we are seeing 
> > might
> > be more difficult to see with the current kernel.
> > 
> > This patch does feel awkward, disabling interrupts in the irq handler.  It 
> > seems
> > like it would make more sense to remove the locking from the irq handler.  
> > But
> > that is probably more work and my knowledge of USB is limited.  I'll start 
> > with
> > this patch and see where the conversation goes.
> > 
> > Any feedback would be appreciated.
> 
> 2.6.18 is awfully old -- almost 6 years!

Heh.  Yes, that is world of RHEL and supporting kernels for 10 years. :-(
I only brought up the issue because I thought it was still relevant, but..

> 
> Anyway, commit de85422b94ddb23c021126815ea49414047c13dc (USB: fix
> interrupt disabling for HCDs with shared interrupt handlers) took care
> of this problem way back in 2.6.26.  You should be able to back-port
> the patch.

I see it was fixed at a higher level.  Sorry I missed that.  Thanks for
pointing out the commit.  Again sorry for the noise.  Thanks!

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] usb, ehci: Avoid deadlock of ehci->lock by disabling interrupts

2012-07-31 Thread Don Zickus

We ran into an interesting deadlock on RHEL-5 (2.6.18) that I believe
still appiles to the current kernel involving the ehci->lock.

CPU A:
submits a bulk transfer urb
ehci_urb_enqueue calls submit_async
submit_async blocks on ehci->lock with irq disabled (the result
of spin_lock_irqsave) for CPU B

CPU B:
takes an ehci interrupt
locks ehci->lock
pre-empted by an IPI handler which spins waiting for CPU C

CPU C:
takes an MTRR request
sends an IPI to all cpus to block
spins waiting for all cpus to block

CPU A nevers processes IPI because its interrupts are disabled,
this creates the 3-way deadlock.

This deadlock is hard to reproduce by our customer, but based on their vmcore
it seems clear the above is what happened.  I attatched a suggested patch
from a colleague that would seem to resolve the problem.  Because it is
hard to reproduce, I have not been able to test it to verify it resolves
the problem.

The patch just turns spin_locks in the spin_lock_irqsaves in the ehci_irq
function.  This would essentially block the IPI handler and let the interrupt
handler finish before processing the IPI.  Then CPU A would get a chance to
finish and process its IPI.

Looking at the code paths in 2.6.18 and 3.5, the locking still seems the same
which is why I believe the problem still exists.  However, someone in the office
thought the MTRR code has been re-written, so the problem we are seeing might
be more difficult to see with the current kernel.

This patch does feel awkward, disabling interrupts in the irq handler.  It seems
like it would make more sense to remove the locking from the irq handler.  But
that is probably more work and my knowledge of USB is limited.  I'll start with
this patch and see where the conversation goes.

Any feedback would be appreciated.

[dzickus - rewrote the description, ported to v3.5]

Signed-off-by: Casey Dahlin 
Signed-off-by: Don Zickus 
---
 drivers/usb/host/ehci-hcd.c |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c
index 800be38..a31d9cc 100644
--- a/drivers/usb/host/ehci-hcd.c
+++ b/drivers/usb/host/ehci-hcd.c
@@ -849,8 +849,9 @@ static irqreturn_t ehci_irq (struct usb_hcd *hcd)
struct ehci_hcd *ehci = hcd_to_ehci (hcd);
u32 status, masked_status, pcd_status = 0, cmd;
int bh;
+   unsigned long   flags;
 
-   spin_lock (&ehci->lock);
+   spin_lock_irqsave(&ehci->lock, flags);
 
status = ehci_readl(ehci, &ehci->regs->status);
 
@@ -868,7 +869,7 @@ static irqreturn_t ehci_irq (struct usb_hcd *hcd)
 
/* Shared IRQ? */
if (!masked_status || unlikely(ehci->rh_state == EHCI_RH_HALTED)) {
-   spin_unlock(&ehci->lock);
+   spin_unlock_irqrestore(&ehci->lock, flags);
return IRQ_NONE;
}
 
@@ -969,7 +970,7 @@ dead:
 
if (bh)
ehci_work (ehci);
-   spin_unlock (&ehci->lock);
+   spin_unlock_irqrestore(&ehci->lock, flags);
if (pcd_status)
usb_hcd_poll_rh_status(hcd);
return IRQ_HANDLED;
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

39 matches

Mail list logo