Re: fix stuck cancelled ehci xfers (ehci_idone: ex=%p is done!)

2015-03-09 Thread Martin Pieuchot
On 26/02/15(Thu) 01:51, Martin Pieuchot wrote:
 On 22/02/15(Sun) 00:54, Martin Pieuchot wrote:
  On 20/02/15(Fri) 23:15, Stefan Sperling wrote:
   In the following configuration I can pretty easily trigger endless
   screenfulls of scrolling ehci_idone: ex=%p is done! messages,
   where %p is to a constant pointer value (same in each message).
   
   [ehci host]--[usb extension cable (hub 1)]--[usb keyboard (hub 
   2)]--[usb mouse]
   
   So there's a USB extension cable (hub 1) that I plug a keyboard into.
   The keyboard in turn has a built-in hub that has a mouse plugged into it.
   The problem happens almost every time I pull the keyboard out of hub 1.
   
   I've also seen it happen without any intervention on my part (In fact
   I was in a different city and couldn't use the box until I got home
   several hours later and hit the reset switch... that's no fun)
   
   So if a transfer is cancelled (e.g. as a result of pulling the plug), 
   then:
   
   - usbd_abort_pipe wants to abort a related transfer
   - ehci_abort_xfer schedules and waits for ehci_softintr, expecting
 the softintr routine to deal with the cancelled transfer:
   
 /*
  * Step 3: Make sure the soft interrupt routine has run. This
  * should remove any completed items off the queue.
  * The hardware has no reference to completed items (TDs).
  * It's safe to remove them at any time.
  */
 s = splusb();
 sc-sc_softwake = 1;
 usb_schedsoftintr(sc-sc_bus);
 tsleep(sc-sc_softwake, PZERO, ehciab, 0);
   
   - ehci_softintr gets scheduled
   - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
 invokes ehci_check_intr on each
   - ehci_check_intr eventually ends up calling ehci_idone
   - ehci_idone does nothing for cancelled transfers... ?!?
   
 if (xfer-status == USBD_CANCELLED ||
 xfer-status == USBD_TIMEOUT) {
 DPRINTF((ehci_idone: aborted xfer=%p\n, xfer));
 return;
 }
   
   - something else happens
   
   - ehci_abort_xfer awakes from tsleep and sets ex-isdone, since it
 expects the softinterrupt routine to have dealt with the xfer
   
   - something else happens
   
   - the host controller sends an INTERR interrupt
   - ehci_intr1 schedules ehci_softintr
   - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
 invokes ehci_check_intr on each
   - the cancelled xfer is still in the intrhead TAILQ and ends up in 
   ehci_idone
   - ehci_idone looks for the isdone flag which is now set, then it
 complains and does nothing
   
   - the host controller sends an INTERR interrupt
   ... same story again, we get an endless loop
   
   This diff breaks the chain of events and fixes the endless loop for me.
   I can't reproduce the problem anymore by pulling the keyboard out.
   I don't quite understand how this prevents the flood of INTERR interrupts
   but it seems to work.
   
   I assume there are nasty tentacles in USB land which I'm unfamiliar with.
   Is there any reason this could be a bad idea?
  
  Stefan that's a really good analysis.  I think the diff might not be
  completely correct though.
  
  So basically you're removing the transfer from the active list.  That's
  generally done after the USB callback has been executed, in your case in
  ehci_device_intr_done().
  
  For interrupt transfers (pipe-repeat is 1) the transfer is kept on the
  list while the descriptors are freed/reallocated.  That should be safe
  since we should be reusing the sames.  
  
  So I don't know if we are missing a spl protection of if there's an xfer 
  leak but I'm afraid that with your diff usb_transfer_complete() might
  not be called for the failing xfer.
  
  That's easy to check, look if the ehcixfer pool counter increase when
  you detach your device.
  
  I'm afraid I cannot help more as I am currently traveling :)
 
 Here's an alternative diff that removes xfers from the list of
 currently pending transfers as soon as they are finished or
 cancelled.  This should also prevent the race you analysed.
 
 With this diff ehci_idone() should no longer be called on a
 CANCELLED or TIMEOUT xfer which mean the 'isdone' check can
 be removed.

I reverted the previous diff because it did not properly abort transfer
that were on the pipe's list but not on the interrupt list.

Diff below should take care of that and add an extra check for
suspend/resume.  I'm looking for test reports and oks.

I tried to keep the diff small, cleanups can be done afterward.

Index: ehci.c
===
RCS file: /cvs/src/sys/dev/usb/ehci.c,v
retrieving revision 1.176
diff -u -p -r1.176 ehci.c
--- ehci.c  6 Mar 2015 22:53:03 -   1.176
+++ ehci.c  9 Mar 2015 16:03:56 -
@@ -206,11 +206,7 @@ void   ehci_dump_exfer(struct ehci_xfer *
 #define ehci_add_intr_list(sc, ex) \
TAILQ_INSERT_TAIL((sc)-sc_intrhead, (ex), inext);
 #define ehci_del_intr_list(sc, ex) \
-   

Re: fix stuck cancelled ehci xfers (ehci_idone: ex=%p is done!)

2015-02-27 Thread Martin Pieuchot
On 27/02/15(Fri) 12:07, Stefan Sperling wrote:
 On Thu, Feb 26, 2015 at 10:45:01PM +0100, Stefan Sperling wrote:
  There are other issues which are not fixed by this diff and which
  my diff did not fix either.
 
  But these are bugs we can investigate another day. Happy travels!
 
 I suppose there might well be power supply issues when chaining too many
 passive hubs and devices off a single USB port on the host controller.
 I've now put an active USB hub (separate power source) at the end of
 my long-cable hub and so far things are looking good. I'll report back
 if my problems appear in this configuration, too.
 
 My setup changed from
 
 [ehci host]--[usb extension cable hub]--[usb keyboard (+hub)]--[usb mouse]
 
 to this:
  [power adapter]
|
V
 [ehci host]--[usb extension cable hub]--[active hub]--[usb keyboard (+hub)]
   ^
   |
   +--[usb mouse]

Well there's an obvious bug to fix in our stack regarding HUB power
sources, I don't know if that's the one you're hitting but we can try :)

When a device is attached the stack gets its power usage from the
configuration descriptor.  But the logic to check if the hub this device
is attached to has enough power budget is dumb and does not work well
with multiple devices below the same hub.

Instead of making a guess about how much power can be drained by a port
we should really keep track of the power available per hub. 

If you're curious grep for 'powersrc' in sys/dev/usb.

I have this on my TODO list because SuperSpeed devices do things
slightly differently.  I'll try to cook a diff, but it will be for after
5.7.



Re: fix stuck cancelled ehci xfers (ehci_idone: ex=%p is done!)

2015-02-27 Thread Martin Pieuchot
On 26/02/15(Thu) 22:45, Stefan Sperling wrote:
 On Thu, Feb 26, 2015 at 01:51:29AM +0100, Martin Pieuchot wrote:
  On 22/02/15(Sun) 00:54, Martin Pieuchot wrote:
   On 20/02/15(Fri) 23:15, Stefan Sperling wrote:
In the following configuration I can pretty easily trigger endless
screenfulls of scrolling ehci_idone: ex=%p is done! messages,
where %p is to a constant pointer value (same in each message).

[ehci host]--[usb extension cable (hub 1)]--[usb keyboard (hub 
2)]--[usb mouse]

So there's a USB extension cable (hub 1) that I plug a keyboard into.
The keyboard in turn has a built-in hub that has a mouse plugged into 
it.
The problem happens almost every time I pull the keyboard out of hub 1.

I've also seen it happen without any intervention on my part (In fact
I was in a different city and couldn't use the box until I got home
several hours later and hit the reset switch... that's no fun)

So if a transfer is cancelled (e.g. as a result of pulling the plug), 
then:

- usbd_abort_pipe wants to abort a related transfer
- ehci_abort_xfer schedules and waits for ehci_softintr, expecting
  the softintr routine to deal with the cancelled transfer:

/*
 * Step 3: Make sure the soft interrupt routine has run. This
 * should remove any completed items off the queue.
 * The hardware has no reference to completed items (TDs).
 * It's safe to remove them at any time.
 */
s = splusb();
sc-sc_softwake = 1;
usb_schedsoftintr(sc-sc_bus);
tsleep(sc-sc_softwake, PZERO, ehciab, 0);

- ehci_softintr gets scheduled
- ehci_softintr loops over xfers on the sc_intrhead TAILQ and
  invokes ehci_check_intr on each
- ehci_check_intr eventually ends up calling ehci_idone
- ehci_idone does nothing for cancelled transfers... ?!?

if (xfer-status == USBD_CANCELLED ||
xfer-status == USBD_TIMEOUT) {
DPRINTF((ehci_idone: aborted xfer=%p\n, xfer));
return;
}

- something else happens

- ehci_abort_xfer awakes from tsleep and sets ex-isdone, since it
  expects the softinterrupt routine to have dealt with the xfer

- something else happens

- the host controller sends an INTERR interrupt
- ehci_intr1 schedules ehci_softintr
- ehci_softintr loops over xfers on the sc_intrhead TAILQ and
  invokes ehci_check_intr on each
- the cancelled xfer is still in the intrhead TAILQ and ends up in 
ehci_idone
- ehci_idone looks for the isdone flag which is now set, then it
  complains and does nothing

- the host controller sends an INTERR interrupt
... same story again, we get an endless loop

This diff breaks the chain of events and fixes the endless loop for me.
I can't reproduce the problem anymore by pulling the keyboard out.
I don't quite understand how this prevents the flood of INTERR 
interrupts
but it seems to work.

I assume there are nasty tentacles in USB land which I'm unfamiliar 
with.
Is there any reason this could be a bad idea?
   
   Stefan that's a really good analysis.  I think the diff might not be
   completely correct though.
   
   So basically you're removing the transfer from the active list.  That's
   generally done after the USB callback has been executed, in your case in
   ehci_device_intr_done().
   
   For interrupt transfers (pipe-repeat is 1) the transfer is kept on the
   list while the descriptors are freed/reallocated.  That should be safe
   since we should be reusing the sames.  
   
   So I don't know if we are missing a spl protection of if there's an xfer 
   leak but I'm afraid that with your diff usb_transfer_complete() might
   not be called for the failing xfer.
   
   That's easy to check, look if the ehcixfer pool counter increase when
   you detach your device.
   
   I'm afraid I cannot help more as I am currently traveling :)
  
  Here's an alternative diff that removes xfers from the list of
  currently pending transfers as soon as they are finished or
  cancelled.  This should also prevent the race you analysed.
  
  With this diff ehci_idone() should no longer be called on a
  CANCELLED or TIMEOUT xfer which mean the 'isdone' check can
  be removed.
  
  Does it work for you?
 
 It does, yes. Thanks.
 Your diff also stops the vomit of ehci_idone messages for me when
 I unplug devices or hubs.

Does that mean you're ok with the diff?

 There are other issues which are not fixed by this diff and which
 my diff did not fix either.
 
 Sometimes my USB keyboard seems to go insane and keeps pressing a key
 repeatedly even though I'm not pressing it. For example, when I typed
   make install
 today the keyboard kept repeating the letter l so the line looked
 like this:
   

Re: fix stuck cancelled ehci xfers (ehci_idone: ex=%p is done!)

2015-02-27 Thread Stefan Sperling
On Fri, Feb 27, 2015 at 06:15:32PM +0100, Martin Pieuchot wrote:
 On 26/02/15(Thu) 22:45, Stefan Sperling wrote:
  Your diff also stops the vomit of ehci_idone messages for me when
  I unplug devices or hubs.
 
 Does that mean you're ok with the diff?

Yes.



Re: fix stuck cancelled ehci xfers (ehci_idone: ex=%p is done!)

2015-02-27 Thread Stefan Sperling
On Thu, Feb 26, 2015 at 10:45:01PM +0100, Stefan Sperling wrote:
 There are other issues which are not fixed by this diff and which
 my diff did not fix either.

 But these are bugs we can investigate another day. Happy travels!

I suppose there might well be power supply issues when chaining too many
passive hubs and devices off a single USB port on the host controller.
I've now put an active USB hub (separate power source) at the end of
my long-cable hub and so far things are looking good. I'll report back
if my problems appear in this configuration, too.

My setup changed from

[ehci host]--[usb extension cable hub]--[usb keyboard (+hub)]--[usb mouse]

to this:
 [power adapter]
   |
   V
[ehci host]--[usb extension cable hub]--[active hub]--[usb keyboard (+hub)]
  ^
  |
  +--[usb mouse]



Re: fix stuck cancelled ehci xfers (ehci_idone: ex=%p is done!)

2015-02-26 Thread Stefan Sperling
On Thu, Feb 26, 2015 at 01:51:29AM +0100, Martin Pieuchot wrote:
 On 22/02/15(Sun) 00:54, Martin Pieuchot wrote:
  On 20/02/15(Fri) 23:15, Stefan Sperling wrote:
   In the following configuration I can pretty easily trigger endless
   screenfulls of scrolling ehci_idone: ex=%p is done! messages,
   where %p is to a constant pointer value (same in each message).
   
   [ehci host]--[usb extension cable (hub 1)]--[usb keyboard (hub 
   2)]--[usb mouse]
   
   So there's a USB extension cable (hub 1) that I plug a keyboard into.
   The keyboard in turn has a built-in hub that has a mouse plugged into it.
   The problem happens almost every time I pull the keyboard out of hub 1.
   
   I've also seen it happen without any intervention on my part (In fact
   I was in a different city and couldn't use the box until I got home
   several hours later and hit the reset switch... that's no fun)
   
   So if a transfer is cancelled (e.g. as a result of pulling the plug), 
   then:
   
   - usbd_abort_pipe wants to abort a related transfer
   - ehci_abort_xfer schedules and waits for ehci_softintr, expecting
 the softintr routine to deal with the cancelled transfer:
   
 /*
  * Step 3: Make sure the soft interrupt routine has run. This
  * should remove any completed items off the queue.
  * The hardware has no reference to completed items (TDs).
  * It's safe to remove them at any time.
  */
 s = splusb();
 sc-sc_softwake = 1;
 usb_schedsoftintr(sc-sc_bus);
 tsleep(sc-sc_softwake, PZERO, ehciab, 0);
   
   - ehci_softintr gets scheduled
   - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
 invokes ehci_check_intr on each
   - ehci_check_intr eventually ends up calling ehci_idone
   - ehci_idone does nothing for cancelled transfers... ?!?
   
 if (xfer-status == USBD_CANCELLED ||
 xfer-status == USBD_TIMEOUT) {
 DPRINTF((ehci_idone: aborted xfer=%p\n, xfer));
 return;
 }
   
   - something else happens
   
   - ehci_abort_xfer awakes from tsleep and sets ex-isdone, since it
 expects the softinterrupt routine to have dealt with the xfer
   
   - something else happens
   
   - the host controller sends an INTERR interrupt
   - ehci_intr1 schedules ehci_softintr
   - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
 invokes ehci_check_intr on each
   - the cancelled xfer is still in the intrhead TAILQ and ends up in 
   ehci_idone
   - ehci_idone looks for the isdone flag which is now set, then it
 complains and does nothing
   
   - the host controller sends an INTERR interrupt
   ... same story again, we get an endless loop
   
   This diff breaks the chain of events and fixes the endless loop for me.
   I can't reproduce the problem anymore by pulling the keyboard out.
   I don't quite understand how this prevents the flood of INTERR interrupts
   but it seems to work.
   
   I assume there are nasty tentacles in USB land which I'm unfamiliar with.
   Is there any reason this could be a bad idea?
  
  Stefan that's a really good analysis.  I think the diff might not be
  completely correct though.
  
  So basically you're removing the transfer from the active list.  That's
  generally done after the USB callback has been executed, in your case in
  ehci_device_intr_done().
  
  For interrupt transfers (pipe-repeat is 1) the transfer is kept on the
  list while the descriptors are freed/reallocated.  That should be safe
  since we should be reusing the sames.  
  
  So I don't know if we are missing a spl protection of if there's an xfer 
  leak but I'm afraid that with your diff usb_transfer_complete() might
  not be called for the failing xfer.
  
  That's easy to check, look if the ehcixfer pool counter increase when
  you detach your device.
  
  I'm afraid I cannot help more as I am currently traveling :)
 
 Here's an alternative diff that removes xfers from the list of
 currently pending transfers as soon as they are finished or
 cancelled.  This should also prevent the race you analysed.
 
 With this diff ehci_idone() should no longer be called on a
 CANCELLED or TIMEOUT xfer which mean the 'isdone' check can
 be removed.
 
 Does it work for you?

It does, yes. Thanks.
Your diff also stops the vomit of ehci_idone messages for me when
I unplug devices or hubs.

There are other issues which are not fixed by this diff and which
my diff did not fix either.

Sometimes my USB keyboard seems to go insane and keeps pressing a key
repeatedly even though I'm not pressing it. For example, when I typed
  make install
today the keyboard kept repeating the letter l so the line looked
like this:
  make 
install

I've seen this problem with 2 different USB keyboards.
Re-plugging the keyboard or the hub it is connected to now restores normal
behaviour -- before this discussion unplugging usually 

Re: fix stuck cancelled ehci xfers (ehci_idone: ex=%p is done!)

2015-02-25 Thread Martin Pieuchot
On 22/02/15(Sun) 00:54, Martin Pieuchot wrote:
 On 20/02/15(Fri) 23:15, Stefan Sperling wrote:
  In the following configuration I can pretty easily trigger endless
  screenfulls of scrolling ehci_idone: ex=%p is done! messages,
  where %p is to a constant pointer value (same in each message).
  
  [ehci host]--[usb extension cable (hub 1)]--[usb keyboard (hub 2)]--[usb 
  mouse]
  
  So there's a USB extension cable (hub 1) that I plug a keyboard into.
  The keyboard in turn has a built-in hub that has a mouse plugged into it.
  The problem happens almost every time I pull the keyboard out of hub 1.
  
  I've also seen it happen without any intervention on my part (In fact
  I was in a different city and couldn't use the box until I got home
  several hours later and hit the reset switch... that's no fun)
  
  So if a transfer is cancelled (e.g. as a result of pulling the plug), then:
  
  - usbd_abort_pipe wants to abort a related transfer
  - ehci_abort_xfer schedules and waits for ehci_softintr, expecting
the softintr routine to deal with the cancelled transfer:
  
  /*
   * Step 3: Make sure the soft interrupt routine has run. This
   * should remove any completed items off the queue.
   * The hardware has no reference to completed items (TDs).
   * It's safe to remove them at any time.
   */
  s = splusb();
  sc-sc_softwake = 1;
  usb_schedsoftintr(sc-sc_bus);
  tsleep(sc-sc_softwake, PZERO, ehciab, 0);
  
  - ehci_softintr gets scheduled
  - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
invokes ehci_check_intr on each
  - ehci_check_intr eventually ends up calling ehci_idone
  - ehci_idone does nothing for cancelled transfers... ?!?
  
  if (xfer-status == USBD_CANCELLED ||
  xfer-status == USBD_TIMEOUT) {
  DPRINTF((ehci_idone: aborted xfer=%p\n, xfer));
  return;
  }
  
  - something else happens
  
  - ehci_abort_xfer awakes from tsleep and sets ex-isdone, since it
expects the softinterrupt routine to have dealt with the xfer
  
  - something else happens
  
  - the host controller sends an INTERR interrupt
  - ehci_intr1 schedules ehci_softintr
  - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
invokes ehci_check_intr on each
  - the cancelled xfer is still in the intrhead TAILQ and ends up in 
  ehci_idone
  - ehci_idone looks for the isdone flag which is now set, then it
complains and does nothing
  
  - the host controller sends an INTERR interrupt
  ... same story again, we get an endless loop
  
  This diff breaks the chain of events and fixes the endless loop for me.
  I can't reproduce the problem anymore by pulling the keyboard out.
  I don't quite understand how this prevents the flood of INTERR interrupts
  but it seems to work.
  
  I assume there are nasty tentacles in USB land which I'm unfamiliar with.
  Is there any reason this could be a bad idea?
 
 Stefan that's a really good analysis.  I think the diff might not be
 completely correct though.
 
 So basically you're removing the transfer from the active list.  That's
 generally done after the USB callback has been executed, in your case in
 ehci_device_intr_done().
 
 For interrupt transfers (pipe-repeat is 1) the transfer is kept on the
 list while the descriptors are freed/reallocated.  That should be safe
 since we should be reusing the sames.  
 
 So I don't know if we are missing a spl protection of if there's an xfer 
 leak but I'm afraid that with your diff usb_transfer_complete() might
 not be called for the failing xfer.
 
 That's easy to check, look if the ehcixfer pool counter increase when
 you detach your device.
 
 I'm afraid I cannot help more as I am currently traveling :)

Here's an alternative diff that removes xfers from the list of
currently pending transfers as soon as they are finished or
cancelled.  This should also prevent the race you analysed.

With this diff ehci_idone() should no longer be called on a
CANCELLED or TIMEOUT xfer which mean the 'isdone' check can
be removed.

Does it work for you?

Index: ehci.c
===
RCS file: /cvs/src/sys/dev/usb/ehci.c,v
retrieving revision 1.174
diff -u -p -r1.174 ehci.c
--- ehci.c  9 Feb 2015 22:14:43 -   1.174
+++ ehci.c  25 Feb 2015 06:04:49 -
@@ -206,11 +206,7 @@ void   ehci_dump_exfer(struct ehci_xfer *
 #define ehci_add_intr_list(sc, ex) \
TAILQ_INSERT_TAIL((sc)-sc_intrhead, (ex), inext);
 #define ehci_del_intr_list(sc, ex) \
-   do { \
-   TAILQ_REMOVE(sc-sc_intrhead, (ex), inext); \
-   (ex)-inext.tqe_prev = NULL; \
-   } while (0)
-#define ehci_active_intr_list(ex) ((ex)-inext.tqe_prev != NULL)
+   TAILQ_REMOVE((sc)-sc_intrhead, (ex), inext);
 
 struct usbd_bus_methods ehci_bus_methods = {
.open_pipe = ehci_open,
@@ -753,6 +749,7 @@ ehci_check_qh_intr(struct ehci_softc *sc
}
  

Re: fix stuck cancelled ehci xfers (ehci_idone: ex=%p is done!)

2015-02-21 Thread Martin Pieuchot
On 20/02/15(Fri) 23:15, Stefan Sperling wrote:
 In the following configuration I can pretty easily trigger endless
 screenfulls of scrolling ehci_idone: ex=%p is done! messages,
 where %p is to a constant pointer value (same in each message).
 
 [ehci host]--[usb extension cable (hub 1)]--[usb keyboard (hub 2)]--[usb 
 mouse]
 
 So there's a USB extension cable (hub 1) that I plug a keyboard into.
 The keyboard in turn has a built-in hub that has a mouse plugged into it.
 The problem happens almost every time I pull the keyboard out of hub 1.
 
 I've also seen it happen without any intervention on my part (In fact
 I was in a different city and couldn't use the box until I got home
 several hours later and hit the reset switch... that's no fun)
 
 So if a transfer is cancelled (e.g. as a result of pulling the plug), then:
 
 - usbd_abort_pipe wants to abort a related transfer
 - ehci_abort_xfer schedules and waits for ehci_softintr, expecting
   the softintr routine to deal with the cancelled transfer:
 
   /*
* Step 3: Make sure the soft interrupt routine has run. This
* should remove any completed items off the queue.
* The hardware has no reference to completed items (TDs).
* It's safe to remove them at any time.
*/
   s = splusb();
   sc-sc_softwake = 1;
   usb_schedsoftintr(sc-sc_bus);
   tsleep(sc-sc_softwake, PZERO, ehciab, 0);
 
 - ehci_softintr gets scheduled
 - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
   invokes ehci_check_intr on each
 - ehci_check_intr eventually ends up calling ehci_idone
 - ehci_idone does nothing for cancelled transfers... ?!?
 
   if (xfer-status == USBD_CANCELLED ||
   xfer-status == USBD_TIMEOUT) {
   DPRINTF((ehci_idone: aborted xfer=%p\n, xfer));
   return;
   }
 
 - something else happens
 
 - ehci_abort_xfer awakes from tsleep and sets ex-isdone, since it
   expects the softinterrupt routine to have dealt with the xfer
 
 - something else happens
 
 - the host controller sends an INTERR interrupt
 - ehci_intr1 schedules ehci_softintr
 - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
   invokes ehci_check_intr on each
 - the cancelled xfer is still in the intrhead TAILQ and ends up in ehci_idone
 - ehci_idone looks for the isdone flag which is now set, then it
   complains and does nothing
 
 - the host controller sends an INTERR interrupt
 ... same story again, we get an endless loop
 
 This diff breaks the chain of events and fixes the endless loop for me.
 I can't reproduce the problem anymore by pulling the keyboard out.
 I don't quite understand how this prevents the flood of INTERR interrupts
 but it seems to work.
 
 I assume there are nasty tentacles in USB land which I'm unfamiliar with.
 Is there any reason this could be a bad idea?

Stefan that's a really good analysis.  I think the diff might not be
completely correct though.

So basically you're removing the transfer from the active list.  That's
generally done after the USB callback has been executed, in your case in
ehci_device_intr_done().

For interrupt transfers (pipe-repeat is 1) the transfer is kept on the
list while the descriptors are freed/reallocated.  That should be safe
since we should be reusing the sames.  

So I don't know if we are missing a spl protection of if there's an xfer 
leak but I'm afraid that with your diff usb_transfer_complete() might
not be called for the failing xfer.

That's easy to check, look if the ehcixfer pool counter increase when
you detach your device.

I'm afraid I cannot help more as I am currently traveling :)

 
 Index: ehci.c
 ===
 RCS file: /cvs/src/sys/dev/usb/ehci.c,v
 retrieving revision 1.174
 diff -u -p -r1.174 ehci.c
 --- ehci.c9 Feb 2015 22:14:43 -   1.174
 +++ ehci.c20 Feb 2015 21:32:40 -
 @@ -811,6 +811,7 @@ done:
  void
  ehci_idone(struct usbd_xfer *xfer)
  {
 + struct ehci_softc *sc = (struct ehci_softc *)xfer-device-bus;
   struct ehci_xfer *ex = (struct ehci_xfer *)xfer;
  #ifdef EHCI_DEBUG
   struct ehci_pipe *epipe = (struct ehci_pipe *)xfer-pipe;
 @@ -839,6 +840,8 @@ ehci_idone(struct usbd_xfer *xfer)
  #endif
   if (xfer-status == USBD_CANCELLED ||
   xfer-status == USBD_TIMEOUT) {
 + if (ehci_active_intr_list(ex))
 + ehci_del_intr_list(sc, ex);
   DPRINTF((ehci_idone: aborted xfer=%p\n, xfer));
   return;
   }
 



fix stuck cancelled ehci xfers (ehci_idone: ex=%p is done!)

2015-02-20 Thread Stefan Sperling
In the following configuration I can pretty easily trigger endless
screenfulls of scrolling ehci_idone: ex=%p is done! messages,
where %p is to a constant pointer value (same in each message).

[ehci host]--[usb extension cable (hub 1)]--[usb keyboard (hub 2)]--[usb 
mouse]

So there's a USB extension cable (hub 1) that I plug a keyboard into.
The keyboard in turn has a built-in hub that has a mouse plugged into it.
The problem happens almost every time I pull the keyboard out of hub 1.

I've also seen it happen without any intervention on my part (In fact
I was in a different city and couldn't use the box until I got home
several hours later and hit the reset switch... that's no fun)

So if a transfer is cancelled (e.g. as a result of pulling the plug), then:

- usbd_abort_pipe wants to abort a related transfer
- ehci_abort_xfer schedules and waits for ehci_softintr, expecting
  the softintr routine to deal with the cancelled transfer:

/*
 * Step 3: Make sure the soft interrupt routine has run. This
 * should remove any completed items off the queue.
 * The hardware has no reference to completed items (TDs).
 * It's safe to remove them at any time.
 */
s = splusb();
sc-sc_softwake = 1;
usb_schedsoftintr(sc-sc_bus);
tsleep(sc-sc_softwake, PZERO, ehciab, 0);

- ehci_softintr gets scheduled
- ehci_softintr loops over xfers on the sc_intrhead TAILQ and
  invokes ehci_check_intr on each
- ehci_check_intr eventually ends up calling ehci_idone
- ehci_idone does nothing for cancelled transfers... ?!?

if (xfer-status == USBD_CANCELLED ||
xfer-status == USBD_TIMEOUT) {
DPRINTF((ehci_idone: aborted xfer=%p\n, xfer));
return;
}

- something else happens

- ehci_abort_xfer awakes from tsleep and sets ex-isdone, since it
  expects the softinterrupt routine to have dealt with the xfer

- something else happens

- the host controller sends an INTERR interrupt
- ehci_intr1 schedules ehci_softintr
- ehci_softintr loops over xfers on the sc_intrhead TAILQ and
  invokes ehci_check_intr on each
- the cancelled xfer is still in the intrhead TAILQ and ends up in ehci_idone
- ehci_idone looks for the isdone flag which is now set, then it
  complains and does nothing

- the host controller sends an INTERR interrupt
... same story again, we get an endless loop

This diff breaks the chain of events and fixes the endless loop for me.
I can't reproduce the problem anymore by pulling the keyboard out.
I don't quite understand how this prevents the flood of INTERR interrupts
but it seems to work.

I assume there are nasty tentacles in USB land which I'm unfamiliar with.
Is there any reason this could be a bad idea?

Index: ehci.c
===
RCS file: /cvs/src/sys/dev/usb/ehci.c,v
retrieving revision 1.174
diff -u -p -r1.174 ehci.c
--- ehci.c  9 Feb 2015 22:14:43 -   1.174
+++ ehci.c  20 Feb 2015 21:32:40 -
@@ -811,6 +811,7 @@ done:
 void
 ehci_idone(struct usbd_xfer *xfer)
 {
+   struct ehci_softc *sc = (struct ehci_softc *)xfer-device-bus;
struct ehci_xfer *ex = (struct ehci_xfer *)xfer;
 #ifdef EHCI_DEBUG
struct ehci_pipe *epipe = (struct ehci_pipe *)xfer-pipe;
@@ -839,6 +840,8 @@ ehci_idone(struct usbd_xfer *xfer)
 #endif
if (xfer-status == USBD_CANCELLED ||
xfer-status == USBD_TIMEOUT) {
+   if (ehci_active_intr_list(ex))
+   ehci_del_intr_list(sc, ex);
DPRINTF((ehci_idone: aborted xfer=%p\n, xfer));
return;
}