Re: Hitting "unused qh not empty" BUG in qh_destroy

2014-09-17 Thread Alan Stern
On Wed, 17 Sep 2014, Joe Lawrence wrote:

> On Tue, 16 Sep 2014 15:29:20 -0400
> Alan Stern  wrote:
> 
> > ... And now I see the problem.  It's these two lines just before the 
> > "switch":
> > 
> > if (ehci->rh_state < EHCI_RH_RUNNING)
> > qh->qh_state = QH_STATE_IDLE;
> > 
> > That undoubtedly caused us to destroy the QH directly without unlinking 
> > it first.
> > 
> > I'm pretty sure those two lines aren't needed any more.  Try removing 
> > them and see if the problem persists.
> 
> Much better!  Two machines running the same set of device removal tests
> are still running (kicked off ~5pm Tues).  Typically the qh_destroy
> BUG would be hit within a few hours of testing.
> 
> (I left the "BUG_ON(ehci->async->qh_next.qh == qh)" in
> ehci_endpoint_disable as an assertion.  Obviously no hits on that
> either.)

Very good.  I reviewed the rest of the code to make sure that removing 
those lines won't cause any other problems.  It looks safe, so I have 
submitted a patch.

> Thanks for the debug!

You're welcome.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hitting "unused qh not empty" BUG in qh_destroy

2014-09-17 Thread Joe Lawrence
On Tue, 16 Sep 2014 15:29:20 -0400
Alan Stern  wrote:

> ... And now I see the problem.  It's these two lines just before the 
> "switch":
> 
>   if (ehci->rh_state < EHCI_RH_RUNNING)
>   qh->qh_state = QH_STATE_IDLE;
> 
> That undoubtedly caused us to destroy the QH directly without unlinking 
> it first.
> 
> I'm pretty sure those two lines aren't needed any more.  Try removing 
> them and see if the problem persists.

Much better!  Two machines running the same set of device removal tests
are still running (kicked off ~5pm Tues).  Typically the qh_destroy
BUG would be hit within a few hours of testing.

(I left the "BUG_ON(ehci->async->qh_next.qh == qh)" in
ehci_endpoint_disable as an assertion.  Obviously no hits on that
either.)

Thanks for the debug!

-- Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hitting "unused qh not empty" BUG in qh_destroy

2014-09-16 Thread Alan Stern
On Tue, 16 Sep 2014, Joe Lawrence wrote:

> Quick turn around on this one :)
> 
> crash> log | grep :2c:00.0
> 
> pci :2c:00.0: [8086:1d26] type 00 class 0x0c0320
> pci :2c:00.0: reg 0x10: [mem 0x9000-0x93ff]
> pci :2c:00.0: PME# supported from D0 D3hot D3cold
> ehci-pci :2c:00.0: EHCI Host Controller
> ehci-pci :2c:00.0: new USB bus registered, assigned bus number 1
> ehci-pci :2c:00.0: debug port 2
> ehci-pci :2c:00.0: cache line size of 64 is not supported
> ehci-pci :2c:00.0: irq 10, io mem 0x9000
> ehci-pci :2c:00.0: USB 2.0 started, EHCI 1.00
> usb usb1: SerialNumber: :2c:00.0
> ehci-pci :2c:00.0: qh_link_async:1003 ehci = 88084ff1b088, 
> head->qh_next.qh =   (null), qh = 88083eacae50
> ehci-pci :2c:00.0: HC died; cleaning up
> ehci-pci :2c:00.0: ehci_endpoint_disable:944 ep->hcpriv = 
> 88083eacae50, ehci(88084ff1b088)->num_async = 1, 
> ehci->async->qh_next.qh = 88083eacae50
> 
> [ cut here ]
> kernel BUG at drivers/usb/host/ehci-hcd.c:1002!

Aha!  So we triggered the BUG_ON you added to ehci_endpoint_disable.

> invalid opcode:  [#1] SMP 
> ...
> CPU: 0 PID: 207 Comm: khubd Tainted: PF  O--   
> 3.10.0-123.6.3.el7.bigphysarea_expedient1_usb4.x86_64 #1
> Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 
> 12/25/2013
> task: 880853ff8cb0 ti: 880853434000 task.ti: 880853434000
> RIP: 0010:[]  [] 
> ehci_endpoint_disable+0x1f6/0x200

This means we have to check what's going on inside
ehci_endpoint_disable.  We know qh wasn't NULL and we know that qh->hw
wasn't NULL (otherwise we would have jumped over the BUG_ON).

In particular, which case of the "switch" statement did we hit?

... And now I see the problem.  It's these two lines just before the 
"switch":

if (ehci->rh_state < EHCI_RH_RUNNING)
qh->qh_state = QH_STATE_IDLE;

That undoubtedly caused us to destroy the QH directly without unlinking 
it first.

I'm pretty sure those two lines aren't needed any more.  Try removing 
them and see if the problem persists.

> Just to review, I pasted the USB related patches to this kernel below.  The
> drivers/usb/core/hub.c changes include:
> 
>   d8521af "usb: assign default peer ports for root hubs" (maxchild parts)
>   c605f3c "usb: hub: take hub->hdev reference when processing from eventlist"
> 
> the rest are debugging for the qh_destroy BUG.

Yes, the patches look fine.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hitting "unused qh not empty" BUG in qh_destroy

2014-09-16 Thread Joe Lawrence
On Tue, 16 Sep 2014 11:56:14 -0400
Alan Stern  wrote:

> On Tue, 16 Sep 2014, Joe Lawrence wrote:
> 
> > > Anyway, the log above means that a QH was linked before the HC died,
> > > but then it was never unlinked.  Please add a line at the start of
> > > ehci_endpoint_disable, printing out the value of ep->hcpriv for
> > > comparison with the values listed above.
> > 
> > Will do.
> > 
> > > Also, let's see the contents of *qh and *(qh->hw) for the left-over QH
> > > at *(ehci->async->qh_next.qh).  You can get those right away from your 
> > > crash dump data, right?
> > 
> > Yep, interestingly enough, *(ehci->async->qh_next.qh) appears to be
> > mostly full of zeros:
> 
> > crash> struct ehci_qh 0x88083d6b4558 << 
> > *(ehci->async->qh_next.qh)
> > struct ehci_qh {
> >   hw = 0x0,
> >   qh_dma = 0x0, 
> >   qh_next = {
> > qh = 0x0, 
> > itd = 0x0, 
> > sitd = 0x0, 
> > fstn = 0x0, 
> > hw_next = 0x0, 
> > ptr = 0x0
> >   }, 
> >   qtd_list = {
> > next = 0x0, 
> > prev = 0x0
> >   }, 
> >   intr_node = {
> > next = 0x0, 
> > prev = 0x0
> >   }, 
> >   dummy = 0x0, 
> >   unlink_node = {
> > next = 0x0, 
> > prev = 0x0
> >   }, 
> >   unlink_cycle = 0x6, 
> >   qh_state = 0x0, 
> >   xacterrs = 0x0, 
> >   usecs = 0x0, 
> >   gap_uf = 0x0, 
> >   c_usecs = 0x0, 
> >   tt_usecs = 0x0, 
> >   period = 0x0, 
> >   start = 0x0, 
> >   dev = 0x6, 
> >   is_out = 0x0, 
> >   clearing_tt = 0x0, 
> >   dequeue_during_giveback = 0x0, 
> >   exception = 0x0
> > }
> 
> So this QH was destroyed properly, but somehow the link pointer in
> ehci->async continued to point at it.  And it got destroyed without
> passing through single_unlink_async.
> 
> Maybe you should also add a check at the end of ehci_endpoint_disable, 
> just before the done: label.  For example,
> 
>   BUG_ON(ehci->async->qh_next.qh == qh);
> 
> Then if you really want to be thorough, in single_unlink_async, just
> after the "while" loop, print the values of prev, prev->qh_next.qh, qh,
> and qh->qh_next.qh.  And in qh_link_async, just after the wmb(), print
> the values of head->qh_next.qh and qh.  (This could end up generating
> quite a lot of output...)
> 
> Alan Stern
> 

Quick turn around on this one :)

crash> log | grep :2c:00.0

pci :2c:00.0: [8086:1d26] type 00 class 0x0c0320
pci :2c:00.0: reg 0x10: [mem 0x9000-0x93ff]
pci :2c:00.0: PME# supported from D0 D3hot D3cold
ehci-pci :2c:00.0: EHCI Host Controller
ehci-pci :2c:00.0: new USB bus registered, assigned bus number 1
ehci-pci :2c:00.0: debug port 2
ehci-pci :2c:00.0: cache line size of 64 is not supported
ehci-pci :2c:00.0: irq 10, io mem 0x9000
ehci-pci :2c:00.0: USB 2.0 started, EHCI 1.00
usb usb1: SerialNumber: :2c:00.0
ehci-pci :2c:00.0: qh_link_async:1003 ehci = 88084ff1b088, 
head->qh_next.qh =   (null), qh = 88083eacae50
ehci-pci :2c:00.0: HC died; cleaning up
ehci-pci :2c:00.0: ehci_endpoint_disable:944 ep->hcpriv = 88083eacae50, 
ehci(88084ff1b088)->num_async = 1, ehci->async->qh_next.qh = 
88083eacae50

[ cut here ]
kernel BUG at drivers/usb/host/ehci-hcd.c:1002!
invalid opcode:  [#1] SMP 
...
CPU: 0 PID: 207 Comm: khubd Tainted: PF  O--   
3.10.0-123.6.3.el7.bigphysarea_expedient1_usb4.x86_64 #1
Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
task: 880853ff8cb0 ti: 880853434000 task.ti: 880853434000
RIP: 0010:[]  [] 
ehci_endpoint_disable+0x1f6/0x200
RSP: 0018:880853435ca0  EFLAGS: 00010046
RAX: 8800372cc720 RBX: 88083eacae50 RCX: dead00200200
RDX: 88085f800710 RSI: 0047 RDI: 0047
RBP: 880853435ce0 R08: 88085f800710 R09: 000100230022
R10: 88085f807600 R11: 81412158 R12: 88084ff1ae68
R13: 88084ff1b174 R14: 880851914a88 R15: 0246
FS:  () GS:88085fc0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7ffc501a2000 CR3: 000850402000 CR4: 000407f0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Stack:
 88080001 88083eacae50 88084ff1b088 880851914a40
 880851914a88 0001 880851914a40 ff00
 880853435d00 813fabf3 880851914a40 880851914a88
Call Trace:
 [] usb_hcd_disable_endpoint+0x33/0x40
 [] usb_disable_endpoint+0x69/0x90
 [] usb_ep0_reinit+0x1c/0x40
 [] hub_port_connect_change+0x33a/0xb50
 [] hub_events+0x379/0xa30
 [] hub_thread+0x45/0x1a0
 [] ? wake_up_bit+0x30/0x30
 [] ? hub_events+0xa30/0xa30
 [] kthread+0xcf/0xe0
 [] ? kthread_create_on_node+0x140/0x140
 [] ret_from_fork+0x7c/0xb0
 [] ? kthread_create_on_node+0x140/0x140


Just to review, I pasted the USB related patches to this kernel below.  The
drivers/usb/core/h

Re: Hitting "unused qh not empty" BUG in qh_destroy

2014-09-16 Thread Alan Stern
On Tue, 16 Sep 2014, Joe Lawrence wrote:

> > Anyway, the log above means that a QH was linked before the HC died,
> > but then it was never unlinked.  Please add a line at the start of
> > ehci_endpoint_disable, printing out the value of ep->hcpriv for
> > comparison with the values listed above.
> 
> Will do.
> 
> > Also, let's see the contents of *qh and *(qh->hw) for the left-over QH
> > at *(ehci->async->qh_next.qh).  You can get those right away from your 
> > crash dump data, right?
> 
> Yep, interestingly enough, *(ehci->async->qh_next.qh) appears to be
> mostly full of zeros:

> crash> struct ehci_qh 0x88083d6b4558 << *(ehci->async->qh_next.qh)
> struct ehci_qh {
>   hw = 0x0,
>   qh_dma = 0x0, 
>   qh_next = {
> qh = 0x0, 
> itd = 0x0, 
> sitd = 0x0, 
> fstn = 0x0, 
> hw_next = 0x0, 
> ptr = 0x0
>   }, 
>   qtd_list = {
> next = 0x0, 
> prev = 0x0
>   }, 
>   intr_node = {
> next = 0x0, 
> prev = 0x0
>   }, 
>   dummy = 0x0, 
>   unlink_node = {
> next = 0x0, 
> prev = 0x0
>   }, 
>   unlink_cycle = 0x6, 
>   qh_state = 0x0, 
>   xacterrs = 0x0, 
>   usecs = 0x0, 
>   gap_uf = 0x0, 
>   c_usecs = 0x0, 
>   tt_usecs = 0x0, 
>   period = 0x0, 
>   start = 0x0, 
>   dev = 0x6, 
>   is_out = 0x0, 
>   clearing_tt = 0x0, 
>   dequeue_during_giveback = 0x0, 
>   exception = 0x0
> }

So this QH was destroyed properly, but somehow the link pointer in
ehci->async continued to point at it.  And it got destroyed without
passing through single_unlink_async.

Maybe you should also add a check at the end of ehci_endpoint_disable, 
just before the done: label.  For example,

BUG_ON(ehci->async->qh_next.qh == qh);

Then if you really want to be thorough, in single_unlink_async, just
after the "while" loop, print the values of prev, prev->qh_next.qh, qh,
and qh->qh_next.qh.  And in qh_link_async, just after the wmb(), print
the values of head->qh_next.qh and qh.  (This could end up generating
quite a lot of output...)

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hitting "unused qh not empty" BUG in qh_destroy

2014-09-16 Thread Joe Lawrence
On Tue, 16 Sep 2014 10:44:45 -0400
Alan Stern  wrote:

> On Tue, 16 Sep 2014, Joe Lawrence wrote:
> 
> > > You can check for this.  Sprinkle ehci_info messages throughout 
> > > ehci_stop, printing the value of ehci->async->qh_next.qh.  It should be 
> > > NULL the entire time.
> > 
> > I re-ran the tests with the following changes:
> > 
> >   - moved num_async debug up above struct ehci_hcd.priv[0] :)
> >   - added verbose debug msg in check_async_ring to display the new
> > num_async value
> >   - added ehci_info to every other line in ehci_stop
> > 
> > % grep ':66:00.1' vmcore-dmesg.txt
> > 
> > [ 4936.956739] pci :66:00.1: [8086:1d2d] type 00 class 0x0c0320
> > [ 4936.956789] pci :66:00.1: reg 0x10: [mem 0xa400-0xa7ff]
> > [ 4936.957019] pci :66:00.1: PME# supported from D0 D3hot D3cold
> > [ 4939.564933] ehci-pci :66:00.1: EHCI Host Controller
> > [ 4939.565462] ehci-pci :66:00.1: new USB bus registered, assigned bus 
> > number 4
> > [ 4939.565498] ehci-pci :66:00.1: debug port 2
> > [ 4939.569466] ehci-pci :66:00.1: cache line size of 64 is not supported
> > [ 4939.569533] ehci-pci :66:00.1: irq 13, io mem 0xa400
> > [ 4939.574794] ehci-pci :66:00.1: USB 2.0 started, EHCI 1.00
> > [ 4939.575410] usb usb4: SerialNumber: :66:00.1
> > [ 4940.089044] ehci-pci :66:00.1: check_async_ring:970 
> > ehci(8808419839d0)->num_async = 1, qh=88083d6b4558
> > [ 4941.795700] ehci-pci :66:00.1: check_async_ring:970 
> > ehci(8808419839d0)->num_async = 0, qh=  (null)
> > [ 4941.895713] ehci-pci :66:00.1: check_async_ring:970 
> > ehci(8808419839d0)->num_async = 1, qh=88083d6b4558
> > [ 4941.943761] ehci-pci :66:00.1: check_async_ring:970 
> > ehci(8808419839d0)->num_async = 0, qh=  (null)
> > [ 4942.002715] ehci-pci :66:00.1: check_async_ring:970 
> > ehci(8808419839d0)->num_async = 1, qh=88083d6b4558
> > [ 4942.064346] ehci-pci :66:00.1: check_async_ring:970 
> > ehci(8808419839d0)->num_async = 0, qh=  (null)
> > [ 4942.156646] ehci-pci :66:00.1: check_async_ring:970 
> > ehci(8808419839d0)->num_async = 1, qh=88083d6b4558
> > [ 4942.226028] ehci-pci :66:00.1: check_async_ring:970 
> > ehci(8808419839d0)->num_async = 0, qh=  (null)
> > [ 4942.274185] ehci-pci :66:00.1: check_async_ring:970 
> > ehci(8808419839d0)->num_async = 1, qh=88083d6b4558
> > [ 4942.329196] ehci-pci :66:00.1: check_async_ring:970 
> > ehci(8808419839d0)->num_async = 0, qh=  (null)
> > [ 4942.391055] ehci-pci :66:00.1: check_async_ring:970 
> > ehci(8808419839d0)->num_async = 1, qh=88083d6b4558
> > [ 4942.629653] ehci-pci :66:00.1: HC died; cleaning up
> > [ 4945.663153] ehci-pci :66:00.1: remove, state 1
> > [ 4945.733934] ehci-pci :66:00.1: ehci_stop:420 
> > ehci(8808419839d0)->num_async = 1
> > [ 4945.774242] ehci-pci :66:00.1: ehci_stop:427 
> > ehci(8808419839d0)->num_async = 1
> > [ 4945.814791] ehci-pci :66:00.1: ehci_stop:429 
> > ehci(8808419839d0)->num_async = 1
> > [ 4945.855938] ehci-pci :66:00.1: ehci_stop:431 
> > ehci(8808419839d0)->num_async = 1
> > [ 4945.897145] ehci-pci :66:00.1: ehci_stop:433 
> > ehci(8808419839d0)->num_async = 1
> > [ 4945.938791] ehci-pci :66:00.1: ehci_stop:436 
> > ehci(8808419839d0)->num_async = 1
> > [ 4945.980491] ehci-pci :66:00.1: ehci_stop:438 
> > ehci(8808419839d0)->num_async = 1
> > [ 4946.01] ehci-pci :66:00.1: ehci_stop:440 
> > ehci(8808419839d0)->num_async = 1
> > [ 4946.063936] ehci-pci :66:00.1: ehci_stop:444 
> > ehci(8808419839d0)->num_async = 1
> > [ 4946.105631] ehci-pci :66:00.1: ehci_stop:446 
> > ehci(8808419839d0)->num_async = 1
> > [ 4946.241802] ehci-pci :66:00.1: ehci_stop:448 
> > ehci(8808419839d0)->num_async = 1
> > 
> > where ehci_stop line 420 is function entry and line 448 function
> > return.  Notice that check_async_ring incremented num_async just before
> > the HC died/removed and then num_async was 1 throughout ehci_stop.
> > 
> > qh_destroy then BUGs with a struct ehci_hcd *ehci of 8808419839d0.
> 
> This says that ehci->num_async ended up equal to 1.  But in your 
> previous email, you said the final printout indicated it was equal to 
> 0.  Did something change in the meantime?

I just double checked the logs, and that is correct.  However, the
earlier reports of a dozen or so crashes included num_async at the end
of struct ehci_hcd.  Perhaps the priv[] array clobbered the value?
Sorry for the confusion.

> Anyway, the log above means that a QH was linked before the HC died,
> but then it was never unlinked.  Please add a line at the start of
> ehci_endpoint_disable, printing out the value of ep->hcpriv for
> comparison with the values listed above.

Will do.

> Also, let's see the contents of *qh and *(qh->hw) for the left-over QH
> at *(ehci->async->qh

Re: Hitting "unused qh not empty" BUG in qh_destroy

2014-09-16 Thread Alan Stern
On Tue, 16 Sep 2014, Joe Lawrence wrote:

> > You can check for this.  Sprinkle ehci_info messages throughout 
> > ehci_stop, printing the value of ehci->async->qh_next.qh.  It should be 
> > NULL the entire time.
> 
> I re-ran the tests with the following changes:
> 
>   - moved num_async debug up above struct ehci_hcd.priv[0] :)
>   - added verbose debug msg in check_async_ring to display the new
> num_async value
>   - added ehci_info to every other line in ehci_stop
> 
> % grep ':66:00.1' vmcore-dmesg.txt
> 
> [ 4936.956739] pci :66:00.1: [8086:1d2d] type 00 class 0x0c0320
> [ 4936.956789] pci :66:00.1: reg 0x10: [mem 0xa400-0xa7ff]
> [ 4936.957019] pci :66:00.1: PME# supported from D0 D3hot D3cold
> [ 4939.564933] ehci-pci :66:00.1: EHCI Host Controller
> [ 4939.565462] ehci-pci :66:00.1: new USB bus registered, assigned bus 
> number 4
> [ 4939.565498] ehci-pci :66:00.1: debug port 2
> [ 4939.569466] ehci-pci :66:00.1: cache line size of 64 is not supported
> [ 4939.569533] ehci-pci :66:00.1: irq 13, io mem 0xa400
> [ 4939.574794] ehci-pci :66:00.1: USB 2.0 started, EHCI 1.00
> [ 4939.575410] usb usb4: SerialNumber: :66:00.1
> [ 4940.089044] ehci-pci :66:00.1: check_async_ring:970 
> ehci(8808419839d0)->num_async = 1, qh=88083d6b4558
> [ 4941.795700] ehci-pci :66:00.1: check_async_ring:970 
> ehci(8808419839d0)->num_async = 0, qh=  (null)
> [ 4941.895713] ehci-pci :66:00.1: check_async_ring:970 
> ehci(8808419839d0)->num_async = 1, qh=88083d6b4558
> [ 4941.943761] ehci-pci :66:00.1: check_async_ring:970 
> ehci(8808419839d0)->num_async = 0, qh=  (null)
> [ 4942.002715] ehci-pci :66:00.1: check_async_ring:970 
> ehci(8808419839d0)->num_async = 1, qh=88083d6b4558
> [ 4942.064346] ehci-pci :66:00.1: check_async_ring:970 
> ehci(8808419839d0)->num_async = 0, qh=  (null)
> [ 4942.156646] ehci-pci :66:00.1: check_async_ring:970 
> ehci(8808419839d0)->num_async = 1, qh=88083d6b4558
> [ 4942.226028] ehci-pci :66:00.1: check_async_ring:970 
> ehci(8808419839d0)->num_async = 0, qh=  (null)
> [ 4942.274185] ehci-pci :66:00.1: check_async_ring:970 
> ehci(8808419839d0)->num_async = 1, qh=88083d6b4558
> [ 4942.329196] ehci-pci :66:00.1: check_async_ring:970 
> ehci(8808419839d0)->num_async = 0, qh=  (null)
> [ 4942.391055] ehci-pci :66:00.1: check_async_ring:970 
> ehci(8808419839d0)->num_async = 1, qh=88083d6b4558
> [ 4942.629653] ehci-pci :66:00.1: HC died; cleaning up
> [ 4945.663153] ehci-pci :66:00.1: remove, state 1
> [ 4945.733934] ehci-pci :66:00.1: ehci_stop:420 
> ehci(8808419839d0)->num_async = 1
> [ 4945.774242] ehci-pci :66:00.1: ehci_stop:427 
> ehci(8808419839d0)->num_async = 1
> [ 4945.814791] ehci-pci :66:00.1: ehci_stop:429 
> ehci(8808419839d0)->num_async = 1
> [ 4945.855938] ehci-pci :66:00.1: ehci_stop:431 
> ehci(8808419839d0)->num_async = 1
> [ 4945.897145] ehci-pci :66:00.1: ehci_stop:433 
> ehci(8808419839d0)->num_async = 1
> [ 4945.938791] ehci-pci :66:00.1: ehci_stop:436 
> ehci(8808419839d0)->num_async = 1
> [ 4945.980491] ehci-pci :66:00.1: ehci_stop:438 
> ehci(8808419839d0)->num_async = 1
> [ 4946.01] ehci-pci :66:00.1: ehci_stop:440 
> ehci(8808419839d0)->num_async = 1
> [ 4946.063936] ehci-pci :66:00.1: ehci_stop:444 
> ehci(8808419839d0)->num_async = 1
> [ 4946.105631] ehci-pci :66:00.1: ehci_stop:446 
> ehci(8808419839d0)->num_async = 1
> [ 4946.241802] ehci-pci :66:00.1: ehci_stop:448 
> ehci(8808419839d0)->num_async = 1
> 
> where ehci_stop line 420 is function entry and line 448 function
> return.  Notice that check_async_ring incremented num_async just before
> the HC died/removed and then num_async was 1 throughout ehci_stop.
> 
> qh_destroy then BUGs with a struct ehci_hcd *ehci of 8808419839d0.

This says that ehci->num_async ended up equal to 1.  But in your 
previous email, you said the final printout indicated it was equal to 
0.  Did something change in the meantime?

Anyway, the log above means that a QH was linked before the HC died,
but then it was never unlinked.  Please add a line at the start of
ehci_endpoint_disable, printing out the value of ep->hcpriv for
comparison with the values listed above.

Also, let's see the contents of *qh and *(qh->hw) for the left-over QH
at *(ehci->async->qh_next.qh).  You can get those right away from your 
crash dump data, right?

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hitting "unused qh not empty" BUG in qh_destroy

2014-09-16 Thread Joe Lawrence
On Mon, 15 Sep 2014 11:59:15 -0400
Alan Stern  wrote:

> On Sat, 13 Sep 2014, Joe Lawrence wrote:
> 
> > Hi Alan,
> > 
> > I've collected 16 crashes since kicking off automated tests a little
> > over 24 hrs ago.
> > 
> > Each crash hit the BUG in qh_destroy() and the only unique debugging
> > printk from ehci_stop() was: "ehci_stop: ehci->num_async = 0".
> 
> What about error messages from the check_async_ring routine?

In the crash instances, the error messages never occurred.  (I had
added a BUG() in that code block, just to be really clear if/when that
event occurred).

> You can check for this.  Sprinkle ehci_info messages throughout 
> ehci_stop, printing the value of ehci->async->qh_next.qh.  It should be 
> NULL the entire time.

I re-ran the tests with the following changes:

  - moved num_async debug up above struct ehci_hcd.priv[0] :)
  - added verbose debug msg in check_async_ring to display the new
num_async value
  - added ehci_info to every other line in ehci_stop

% grep ':66:00.1' vmcore-dmesg.txt

[ 4936.956739] pci :66:00.1: [8086:1d2d] type 00 class 0x0c0320
[ 4936.956789] pci :66:00.1: reg 0x10: [mem 0xa400-0xa7ff]
[ 4936.957019] pci :66:00.1: PME# supported from D0 D3hot D3cold
[ 4939.564933] ehci-pci :66:00.1: EHCI Host Controller
[ 4939.565462] ehci-pci :66:00.1: new USB bus registered, assigned bus 
number 4
[ 4939.565498] ehci-pci :66:00.1: debug port 2
[ 4939.569466] ehci-pci :66:00.1: cache line size of 64 is not supported
[ 4939.569533] ehci-pci :66:00.1: irq 13, io mem 0xa400
[ 4939.574794] ehci-pci :66:00.1: USB 2.0 started, EHCI 1.00
[ 4939.575410] usb usb4: SerialNumber: :66:00.1
[ 4940.089044] ehci-pci :66:00.1: check_async_ring:970 
ehci(8808419839d0)->num_async = 1, qh=88083d6b4558
[ 4941.795700] ehci-pci :66:00.1: check_async_ring:970 
ehci(8808419839d0)->num_async = 0, qh=  (null)
[ 4941.895713] ehci-pci :66:00.1: check_async_ring:970 
ehci(8808419839d0)->num_async = 1, qh=88083d6b4558
[ 4941.943761] ehci-pci :66:00.1: check_async_ring:970 
ehci(8808419839d0)->num_async = 0, qh=  (null)
[ 4942.002715] ehci-pci :66:00.1: check_async_ring:970 
ehci(8808419839d0)->num_async = 1, qh=88083d6b4558
[ 4942.064346] ehci-pci :66:00.1: check_async_ring:970 
ehci(8808419839d0)->num_async = 0, qh=  (null)
[ 4942.156646] ehci-pci :66:00.1: check_async_ring:970 
ehci(8808419839d0)->num_async = 1, qh=88083d6b4558
[ 4942.226028] ehci-pci :66:00.1: check_async_ring:970 
ehci(8808419839d0)->num_async = 0, qh=  (null)
[ 4942.274185] ehci-pci :66:00.1: check_async_ring:970 
ehci(8808419839d0)->num_async = 1, qh=88083d6b4558
[ 4942.329196] ehci-pci :66:00.1: check_async_ring:970 
ehci(8808419839d0)->num_async = 0, qh=  (null)
[ 4942.391055] ehci-pci :66:00.1: check_async_ring:970 
ehci(8808419839d0)->num_async = 1, qh=88083d6b4558
[ 4942.629653] ehci-pci :66:00.1: HC died; cleaning up
[ 4945.663153] ehci-pci :66:00.1: remove, state 1
[ 4945.733934] ehci-pci :66:00.1: ehci_stop:420 
ehci(8808419839d0)->num_async = 1
[ 4945.774242] ehci-pci :66:00.1: ehci_stop:427 
ehci(8808419839d0)->num_async = 1
[ 4945.814791] ehci-pci :66:00.1: ehci_stop:429 
ehci(8808419839d0)->num_async = 1
[ 4945.855938] ehci-pci :66:00.1: ehci_stop:431 
ehci(8808419839d0)->num_async = 1
[ 4945.897145] ehci-pci :66:00.1: ehci_stop:433 
ehci(8808419839d0)->num_async = 1
[ 4945.938791] ehci-pci :66:00.1: ehci_stop:436 
ehci(8808419839d0)->num_async = 1
[ 4945.980491] ehci-pci :66:00.1: ehci_stop:438 
ehci(8808419839d0)->num_async = 1
[ 4946.01] ehci-pci :66:00.1: ehci_stop:440 
ehci(8808419839d0)->num_async = 1
[ 4946.063936] ehci-pci :66:00.1: ehci_stop:444 
ehci(8808419839d0)->num_async = 1
[ 4946.105631] ehci-pci :66:00.1: ehci_stop:446 
ehci(8808419839d0)->num_async = 1
[ 4946.241802] ehci-pci :66:00.1: ehci_stop:448 
ehci(8808419839d0)->num_async = 1

where ehci_stop line 420 is function entry and line 448 function
return.  Notice that check_async_ring incremented num_async just before
the HC died/removed and then num_async was 1 throughout ehci_stop.

qh_destroy then BUGs with a struct ehci_hcd *ehci of 8808419839d0.

-- Joe

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hitting "unused qh not empty" BUG in qh_destroy

2014-09-15 Thread Alan Stern
On Sat, 13 Sep 2014, Joe Lawrence wrote:

> Hi Alan,
> 
> I've collected 16 crashes since kicking off automated tests a little
> over 24 hrs ago.
> 
> Each crash hit the BUG in qh_destroy() and the only unique debugging
> printk from ehci_stop() was: "ehci_stop: ehci->num_async = 0".

What about error messages from the check_async_ring routine?

> I can include (or upload) a full (or filtered) vmcore-dmesg.txt if that
> would be more helpful.

Only if it includes one of those messages.

> The debug code I added as you suggested is provided below...
> 
> Thanks,
> 
> -- Joe
> 
> 
> diff -Nupr before/drivers/usb/host/ehci.h after/drivers/usb/host/ehci.h
> --- before/drivers/usb/host/ehci.h2014-07-16 14:25:31.0 -0400
> +++ after/drivers/usb/host/ehci.h 2014-09-12 15:47:18.943478397 -0400
> @@ -231,6 +231,8 @@ struct ehci_hcd { /* one per controlle
>  
>   /* platform-specific data -- must come last */
>   unsigned long   priv[0] __aligned(sizeof(s64));
> +
> + int num_async;
>  };

Not a good idea to put your new field _after_ a field that is
documented as coming last, but in this case it probably won't hurt.

>  /* convert between an HCD pointer and the corresponding EHCI_HCD */
> diff -Nupr before/drivers/usb/host/ehci-hcd.c 
> after/drivers/usb/host/ehci-hcd.c
> --- before/drivers/usb/host/ehci-hcd.c2014-07-16 14:25:31.0 
> -0400
> +++ after/drivers/usb/host/ehci-hcd.c 2014-09-12 15:53:16.863163627 -0400
> @@ -440,6 +440,8 @@ static void ehci_stop (struct usb_hcd *h
>   if (ehci->amd_pll_fix == 1)
>   usb_amd_dev_put();
>  
> + pr_err("%s: ehci->num_async = %d\n", __func__, ehci->num_async);
> +
>   dbg_status (ehci, "ehci_stop completed",
>   ehci_readl(ehci, &ehci->regs->status));
>  }

Always 0, as it should be.

> diff -Nupr before/drivers/usb/host/ehci-q.c after/drivers/usb/host/ehci-q.c
> --- before/drivers/usb/host/ehci-q.c  2014-07-16 14:25:31.0 -0400
> +++ after/drivers/usb/host/ehci-q.c   2014-09-12 15:52:08.023292291 -0400
> @@ -959,6 +959,24 @@ static void disable_async(struct ehci_hc
>   ehci_poll_ASS(ehci);
>  }
>  
> +static void check_async_ring(struct ehci_hcd *ehci, int add)
> +{
> + struct ehci_qh *qh;
> + int n;
> +
> + qh = ehci->async->qh_next.qh;
> + n = ehci->num_async += add;
> + while (qh && n > 0) {
> + qh = qh->qh_next.qh;
> + --n;
> + }
> + if (qh || n != 0) {
> + ehci_err(ehci, "EHCI async list corrupted: num %d n %d qh %p\n",
> + ehci->num_async, n, qh);
> + BUG();
> + }
> +}

In the last call to this function, ehci->num_async must get set to 0 
(because that is the final value as printed out in ehci_stop).  This 
means n = 0, which means ehci->async->qh_next.qh must be NULL.  And 
this routine is the only place where ehci->num_async gets changed.

But then ehci->async->qh_next.qh isn't NULL when the final qh_destroy 
call is made.  This suggests that something changes it in the meantime.

You can check for this.  Sprinkle ehci_info messages throughout 
ehci_stop, printing the value of ehci->async->qh_next.qh.  It should be 
NULL the entire time.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hitting "unused qh not empty" BUG in qh_destroy

2014-09-13 Thread Joe Lawrence
On Fri, 12 Sep 2014, Alan Stern wrote:

> On Fri, 12 Sep 2014, Joe Lawrence wrote:
> 
> > On Fri, 12 Sep 2014 11:31:46 -0400
> > Alan Stern  wrote:
> > 
> > > On Thu, 11 Sep 2014, Joe Lawrence wrote:
> > > 
> > > > Hi Alan,
> > > > 
> > > > I've got another USB bug to report that manifests during automated
> > > > device removal testing on RHEL7.  This one hits the BUG() inside
> > > > qh_destroy:
> > > 
> > > How reliably can you trigger this bug?
> > 
> > I have collected a few crashes within a few days, so somewhat
> > frequently.
> > 
> > > >  67 static void qh_destroy(struct ehci_hcd *ehci, struct ehci_qh *qh)
> > > >  68 {
> > > >  69 /* clean qtds first, and know this is not linked */
> > > >  70 if (!list_empty (&qh->qtd_list) || qh->qh_next.ptr) {
> > > >  71 ehci_dbg (ehci, "unused qh not empty!\n");
> > > >  72 BUG ();
> > > >  73 }
> > > 
> > > > and finally a dump of the ehci_qh in question:
> > > > 
> > > > crash> struct ehci_qh 88084b84dc80
> > > > struct ehci_qh {
> > > >   hw = 0x880078d1a000, 
> > > 
> > > It would be good to see the contents of the ehci_qh_hw structure.  That 
> > > would tell us what device and endpoint this QH was for.
> > 
> > crash> struct ehci_qh_hw 0x880078d1a000
> > struct ehci_qh_hw {
> >   hw_next = 0x78d1a062, 
> >   hw_info1 = 0x8000, 
> 
> No maxpacket value, device address, or endpoint number, but the QH_HEAD
> bit is set.  That happens only with the head of the async ring.  And
> indeed, the QH address agrees with ehci->async = 0x88084b84dc80 in
> your earlier email.
> 
> >   hw_info2 = 0x0, 
> >   hw_current = 0x0, 
> >   hw_qtd_next = 0x1, 
> >   hw_alt_next = 0x78d22000, 
> >   hw_token = 0x40, 
> >   hw_buf = {0x0, 0x0, 0x0, 0x0, 0x0}, 
> >   hw_buf_hi = {0x0, 0x0, 0x0, 0x0, 0x0}
> > }
> > 
> > > >   qh_dma = 0x78d1a000, 
> > > >   qh_next = {
> > > > qh = 0x88084efe6730, 
> > > > itd = 0x88084efe6730, 
> > > > sitd = 0x88084efe6730, 
> > > > fstn = 0x88084efe6730, 
> > > > hw_next = 0x88084efe6730, 
> > > > ptr = 0x88084efe6730 << !NULL
> > > >   }, 
> 
> So there's a leftover qh_next pointer, presumably to a QH that used to
> be on the async list but no longer exists.
> 
> This means the list pointers got corrupted somehow.  No way at this
> point to know just how.  You can add some debugging code to check the
> links at the end of qh_link_async (which adds a new QH to the async
> list) and single_unlink_async (which removes a QH from the list).

Hi Alan,

I've collected 16 crashes since kicking off automated tests a little
over 24 hrs ago.

Each crash hit the BUG in qh_destroy() and the only unique debugging
printk from ehci_stop() was: "ehci_stop: ehci->num_async = 0".

I can include (or upload) a full (or filtered) vmcore-dmesg.txt if that
would be more helpful.

The debug code I added as you suggested is provided below...

Thanks,

-- Joe


diff -Nupr before/drivers/usb/host/ehci.h after/drivers/usb/host/ehci.h
--- before/drivers/usb/host/ehci.h  2014-07-16 14:25:31.0 -0400
+++ after/drivers/usb/host/ehci.h   2014-09-12 15:47:18.943478397 -0400
@@ -231,6 +231,8 @@ struct ehci_hcd {   /* one per controlle
 
/* platform-specific data -- must come last */
unsigned long   priv[0] __aligned(sizeof(s64));
+
+   int num_async;
 };
 
 /* convert between an HCD pointer and the corresponding EHCI_HCD */
diff -Nupr before/drivers/usb/host/ehci-hcd.c after/drivers/usb/host/ehci-hcd.c
--- before/drivers/usb/host/ehci-hcd.c  2014-07-16 14:25:31.0 -0400
+++ after/drivers/usb/host/ehci-hcd.c   2014-09-12 15:53:16.863163627 -0400
@@ -440,6 +440,8 @@ static void ehci_stop (struct usb_hcd *h
if (ehci->amd_pll_fix == 1)
usb_amd_dev_put();
 
+   pr_err("%s: ehci->num_async = %d\n", __func__, ehci->num_async);
+
dbg_status (ehci, "ehci_stop completed",
ehci_readl(ehci, &ehci->regs->status));
 }
diff -Nupr before/drivers/usb/host/ehci-q.c after/drivers/usb/host/ehci-q.c
--- before/drivers/usb/host/ehci-q.c2014-07-16 14:25:31.0 -0400
+++ after/drivers/usb/host/ehci-q.c 2014-09-12 15:52:08.023292291 -0400
@@ -959,6 +959,24 @@ static void disable_async(struct ehci_hc
ehci_poll_ASS(ehci);
 }
 
+static void check_async_ring(struct ehci_hcd *ehci, int add)
+{
+   struct ehci_qh *qh;
+   int n;
+
+   qh = ehci->async->qh_next.qh;
+   n = ehci->num_async += add;
+   while (qh && n > 0) {
+   qh = qh->qh_next.qh;
+   --n;
+   }
+   if (qh || n != 0) {
+   ehci_err(ehci, "EHCI async list corrupted: num %d n %d qh %p\n",
+   ehci->num_async, n, qh);
+   BUG();
+   }
+}
+
 /* move qh (and its qtds) onto async queue; maybe enable queue.  */
 
 static void qh_link_async (struct ehci_hcd *ehci, s

Re: Hitting "unused qh not empty" BUG in qh_destroy

2014-09-12 Thread Alan Stern
On Fri, 12 Sep 2014, Joe Lawrence wrote:

> On Fri, 12 Sep 2014 11:31:46 -0400
> Alan Stern  wrote:
> 
> > On Thu, 11 Sep 2014, Joe Lawrence wrote:
> > 
> > > Hi Alan,
> > > 
> > > I've got another USB bug to report that manifests during automated
> > > device removal testing on RHEL7.  This one hits the BUG() inside
> > > qh_destroy:
> > 
> > How reliably can you trigger this bug?
> 
> I have collected a few crashes within a few days, so somewhat
> frequently.
> 
> > >  67 static void qh_destroy(struct ehci_hcd *ehci, struct ehci_qh *qh)
> > >  68 {
> > >  69 /* clean qtds first, and know this is not linked */
> > >  70 if (!list_empty (&qh->qtd_list) || qh->qh_next.ptr) {
> > >  71 ehci_dbg (ehci, "unused qh not empty!\n");
> > >  72 BUG ();
> > >  73 }
> > 
> > > and finally a dump of the ehci_qh in question:
> > > 
> > > crash> struct ehci_qh 88084b84dc80
> > > struct ehci_qh {
> > >   hw = 0x880078d1a000, 
> > 
> > It would be good to see the contents of the ehci_qh_hw structure.  That 
> > would tell us what device and endpoint this QH was for.
> 
> crash> struct ehci_qh_hw 0x880078d1a000
> struct ehci_qh_hw {
>   hw_next = 0x78d1a062, 
>   hw_info1 = 0x8000, 

No maxpacket value, device address, or endpoint number, but the QH_HEAD
bit is set.  That happens only with the head of the async ring.  And
indeed, the QH address agrees with ehci->async = 0x88084b84dc80 in
your earlier email.

>   hw_info2 = 0x0, 
>   hw_current = 0x0, 
>   hw_qtd_next = 0x1, 
>   hw_alt_next = 0x78d22000, 
>   hw_token = 0x40, 
>   hw_buf = {0x0, 0x0, 0x0, 0x0, 0x0}, 
>   hw_buf_hi = {0x0, 0x0, 0x0, 0x0, 0x0}
> }
> 
> > >   qh_dma = 0x78d1a000, 
> > >   qh_next = {
> > > qh = 0x88084efe6730, 
> > > itd = 0x88084efe6730, 
> > > sitd = 0x88084efe6730, 
> > > fstn = 0x88084efe6730, 
> > > hw_next = 0x88084efe6730, 
> > > ptr = 0x88084efe6730 << !NULL
> > >   }, 

So there's a leftover qh_next pointer, presumably to a QH that used to
be on the async list but no longer exists.

This means the list pointers got corrupted somehow.  No way at this
point to know just how.  You can add some debugging code to check the
links at the end of qh_link_async (which adds a new QH to the async
list) and single_unlink_async (which removes a QH from the list).

Something like this:

static void check_async_ring(struct ehci_hcd *ehci, int add)
{
struct ehci_qh *qh;
int n;

qh = ehci->async->qh_next.qh;
n = ehci->num_async += add;
while (qh && n > 0) {
qh = qh->qh_next.qh;
--n;
}
if (qh || n != 0)
ehci_err(ehci, "EHCI async list corrupted: num %d n %d qh %p\n",
ehci->num_async, n, qh);
}

Add an int num_async field to the ehci_hcd structure, and then add

check_async_ring(ehci, 1);

at the end of qh_link_async and

check_async_ring(ehci, -1);

at the end of single_unlink_async.  Then maybe for thoroughness, print 
out the value of ehci->num_async in ehci_stop if it is nonzero.

> > The place to look is ehci_endpoint_disable.  Did that routine get 
> > called for this QH?  Did it hit the default case of the big switch 
> > statement (with its ehci_err statement)?
> 
> Not sure if there is enough residual side-effect data in a crash dump
> to determine if ehci_endpoint_disable executed.  However, the QH that
> qh_destroy was handling did *not* have the exception bit set.  (See the
> first mail for the structure dump.)

Yeah, the head of the ring isn't a "real" QH, so it never gets
disabled.  Whatever it was pointing to must have been unlinked and
disabled at some time, though.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hitting "unused qh not empty" BUG in qh_destroy

2014-09-12 Thread Joe Lawrence
On Fri, 12 Sep 2014 11:31:46 -0400
Alan Stern  wrote:

> On Thu, 11 Sep 2014, Joe Lawrence wrote:
> 
> > Hi Alan,
> > 
> > I've got another USB bug to report that manifests during automated
> > device removal testing on RHEL7.  This one hits the BUG() inside
> > qh_destroy:
> 
> How reliably can you trigger this bug?

I have collected a few crashes within a few days, so somewhat
frequently.

> >  67 static void qh_destroy(struct ehci_hcd *ehci, struct ehci_qh *qh)
> >  68 {
> >  69 /* clean qtds first, and know this is not linked */
> >  70 if (!list_empty (&qh->qtd_list) || qh->qh_next.ptr) {
> >  71 ehci_dbg (ehci, "unused qh not empty!\n");
> >  72 BUG ();
> >  73 }
> 
> > and finally a dump of the ehci_qh in question:
> > 
> > crash> struct ehci_qh 88084b84dc80
> > struct ehci_qh {
> >   hw = 0x880078d1a000, 
> 
> It would be good to see the contents of the ehci_qh_hw structure.  That 
> would tell us what device and endpoint this QH was for.

crash> struct ehci_qh_hw 0x880078d1a000
struct ehci_qh_hw {
  hw_next = 0x78d1a062, 
  hw_info1 = 0x8000, 
  hw_info2 = 0x0, 
  hw_current = 0x0, 
  hw_qtd_next = 0x1, 
  hw_alt_next = 0x78d22000, 
  hw_token = 0x40, 
  hw_buf = {0x0, 0x0, 0x0, 0x0, 0x0}, 
  hw_buf_hi = {0x0, 0x0, 0x0, 0x0, 0x0}
}

> >   qh_dma = 0x78d1a000, 
> >   qh_next = {
> > qh = 0x88084efe6730, 
> > itd = 0x88084efe6730, 
> > sitd = 0x88084efe6730, 
> > fstn = 0x88084efe6730, 
> > hw_next = 0x88084efe6730, 
> > ptr = 0x88084efe6730 << !NULL
> >   }, 
> >   qtd_list = {   << list_empty
> > next = 0x88084b84dc98, 
> > prev = 0x88084b84dc98
> >   }, 
> >   intr_node = {
> > next = 0x0, 
> > prev = 0x0
> >   }, 
> >   dummy = 0x880078d22000, 
> >   unlink_node = {
> > next = 0x88084b84dcc0, 
> > prev = 0x88084b84dcc0
> >   }, 
> >   unlink_cycle = 0x0, 
> >   qh_state = 0x1,<< QH_STATE_LINKED
> ...
> > }
> > 
> > The qtd_list is empty, contains only one entry, itself.
> > 
> > crash> struct -o ehci_qh | grep td_list
> >   [0x18] struct list_head qtd_list;
> > crash> p/x 0x88084b84dc80 + 0x18
> > $1 = 0x88084b84dc98
> > 
> > but qh->qh_next.ptr is !NULL, so we hit the BUG.  However, it seems that
> > the memory at qh->qh_next.ptr has been freed:
> 
> > I'm not too familiar with the USB code stack, so any suggestions on
> > instrumentation that I can add to aid in debugging would be helpful.
> > Maybe some tracing in qh_link_async / single_unlink_async /
> > end_unlink_async /qh_link_periodic can reveal the sequence that is
> > leaving this dangling qh_next.ptr?
> 
> The place to look is ehci_endpoint_disable.  Did that routine get 
> called for this QH?  Did it hit the default case of the big switch 
> statement (with its ehci_err statement)?

Not sure if there is enough residual side-effect data in a crash dump
to determine if ehci_endpoint_disable executed.  However, the QH that
qh_destroy was handling did *not* have the exception bit set.  (See the
first mail for the structure dump.)

Would it be reasonable to add printk debugging messages to
ehci_endpoint_disable to trace the QH in question and its qh_state?

> > Note: This does bear some resemblance to a bug that Stratus hit a few
> > years ago [1] [2], however enough of the code has changed that I'm not
> > sure the fix for that one would apply to a modern kernel.
> 
> What version of the driver are you currently running?

The driver is built into a slightly modified RHEL7 3.10.0-123.6.3.el7.x86_64 
kernel.

Regards,

-- Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hitting "unused qh not empty" BUG in qh_destroy

2014-09-12 Thread Alan Stern
On Thu, 11 Sep 2014, Joe Lawrence wrote:

> Hi Alan,
> 
> I've got another USB bug to report that manifests during automated
> device removal testing on RHEL7.  This one hits the BUG() inside
> qh_destroy:

How reliably can you trigger this bug?

>  67 static void qh_destroy(struct ehci_hcd *ehci, struct ehci_qh *qh)
>  68 {
>  69 /* clean qtds first, and know this is not linked */
>  70 if (!list_empty (&qh->qtd_list) || qh->qh_next.ptr) {
>  71 ehci_dbg (ehci, "unused qh not empty!\n");
>  72 BUG ();
>  73 }

> and finally a dump of the ehci_qh in question:
> 
> crash> struct ehci_qh 88084b84dc80
> struct ehci_qh {
>   hw = 0x880078d1a000, 

It would be good to see the contents of the ehci_qh_hw structure.  That 
would tell us what device and endpoint this QH was for.

>   qh_dma = 0x78d1a000, 
>   qh_next = {
> qh = 0x88084efe6730, 
> itd = 0x88084efe6730, 
> sitd = 0x88084efe6730, 
> fstn = 0x88084efe6730, 
> hw_next = 0x88084efe6730, 
> ptr = 0x88084efe6730 << !NULL
>   }, 
>   qtd_list = {   << list_empty
> next = 0x88084b84dc98, 
> prev = 0x88084b84dc98
>   }, 
>   intr_node = {
> next = 0x0, 
> prev = 0x0
>   }, 
>   dummy = 0x880078d22000, 
>   unlink_node = {
> next = 0x88084b84dcc0, 
> prev = 0x88084b84dcc0
>   }, 
>   unlink_cycle = 0x0, 
>   qh_state = 0x1,<< QH_STATE_LINKED
...
> }
> 
> The qtd_list is empty, contains only one entry, itself.
> 
> crash> struct -o ehci_qh | grep td_list
>   [0x18] struct list_head qtd_list;
> crash> p/x 0x88084b84dc80 + 0x18
> $1 = 0x88084b84dc98
> 
> but qh->qh_next.ptr is !NULL, so we hit the BUG.  However, it seems that
> the memory at qh->qh_next.ptr has been freed:

> I'm not too familiar with the USB code stack, so any suggestions on
> instrumentation that I can add to aid in debugging would be helpful.
> Maybe some tracing in qh_link_async / single_unlink_async /
> end_unlink_async /qh_link_periodic can reveal the sequence that is
> leaving this dangling qh_next.ptr?

The place to look is ehci_endpoint_disable.  Did that routine get 
called for this QH?  Did it hit the default case of the big switch 
statement (with its ehci_err statement)?

> Note: This does bear some resemblance to a bug that Stratus hit a few
> years ago [1] [2], however enough of the code has changed that I'm not
> sure the fix for that one would apply to a modern kernel.

What version of the driver are you currently running?

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html