ohci: sporadic crash/lockup in ohci-hcd io_watchdog_func()

2015-01-19 Thread Heiko Przybyl
Hi all,

been redirected here from bug #91511 [1].

I'm getting sporadic crashes in io_watchdog_func() in drivers/usb/host/ohci-
hcd.c:761:
"
list_for_each_entry(ed, &ohci->eds_in_use, in_use_list) {
if (ed->pending_td) {
"
with the in_use list getting corrupted:

from ohci_urb_enqueue():
[43656.918055] list_add double add: new=8800cbaa8040, 
prev=8800cb8aa5b8, 
next=8800cbaa8040.
from ohci_work.part():
[43656.920980] list_del corruption. next->prev should be 8800cbaa8040, but 
was 8800cb8aa5b8

One or both set the pointer to 0xdead00100100 and 0xdead001000c0, 
where io_watchdog_func() chokes on [2].

It seems to be related to keyboard input (at least it happens when using the 
keyboard), without relation to system load. Can happen within a day after boot 
or after several days of hibernated uptime. Unfortunately, I haven't found a 
way to reliably reproduce the issue, yet.

The box is a "Gigabyte GA-78LMT-USB3" with "AMD FX(tm)-6300 Six-Core 
Processor" and "[AMD/ATI] SB7x0 USB OHCI1 Controller".

For more info including crash trace, just have a look at the bug report [1]


My (pretty wild) guess is, that the corruption happens through a race in the 
interrupt handler ohci_irq(), which calls ohci_work(), which calls 
finish_urb(), which states:
" * PRECONDITION:  ohci lock held, irqs blocked"

But ohci_irq() seems to only spin_[un]lock(), not spin_[un]lock_irq[save|
restore](). All other functions that call ohci_work() do at least 
spin_[un]lock_irq. So irqs could still be enabled and possibly the event 
triggered twice, thus the double list add?


If you need more information just let me know (per the bug [1] or CC, since 
I'm not subscribed to this list).


Kind regards,

Heiko


[1] https://bugzilla.kernel.org/show_bug.cgi?id=91511
[2] https://bugzilla.kernel.org/attachment.cgi?id=163721

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ohci: sporadic crash/lockup in ohci-hcd io_watchdog_func()

2015-06-28 Thread Heiko Przybyl
On Monday 11 May 2015 13:14:52 Alan Stern wrote:
> On Sun, 10 May 2015, Heiko Przybyl wrote:
> > Hello again.
> > 
> > The issue still bites me with 4.0.2.
> > 
> > Though I was able to narrow down the issue, it's still hard to
> > deterministically trigger it.
> > 
> > Anyway. The setup:
> > - OHCI controller with hooked up keyboard
> > - xhddled (program that toggles scrolllock on disk activity)
> > - disk activity
> > 
> > I'm very certain that it happens, when usbhid issues an urb for scrolllock
> > led toggling, the ohci driver adding a duplicate to its eds_in_use list.
> > That's why I hit it way more often with disk activity. I also tried to
> > alter the program to just periodically toggle the scrolllock, but that
> > didn't seem to help for reproducibility.
> 
> As long as you have a way to trigger the problem, you're in a good
> position to track it down.
> 
> > On Friday 23 January 2015 15:21:11 Alan Stern wrote:
> > 
> > As suggested, checking the eds_in_use list was a good idea. Now, I'm even
> > more convinced duplicates are being added to the eds_in_use list.
> 
> Your tests gave me a clue as to where the problem might be.  It looks
> like finish_unlinks() changes ed->state to ED_IDLE too soon.  The state
> should remain set to ED_UNLINK while finish_urb() is called.
> 
> See if the patch below eliminates your problem.

Yes it indeed does eliminate the problem. Took some time to check if the 
problem is really gone, though. Without the patch I get lockups up to and 
including release 4.1; and not a single lockup with the patch applied.

So for the patch I'd say

Tested-By: Heiko Przybyl 


Thanks for looking into and fixing this issue,

   Heiko

> 
> Alan Stern
> 
> 
> 
> Index: usb-4.0/drivers/usb/host/ohci-q.c
> ===
> --- usb-4.0.orig/drivers/usb/host/ohci-q.c
> +++ usb-4.0/drivers/usb/host/ohci-q.c
> @@ -980,10 +980,6 @@ rescan_all:
>   int completed, modified;
>   __hc32  *prev;
> 
> - /* Is this ED already invisible to the hardware? */
> - if (ed->state == ED_IDLE)
> - goto ed_idle;
> -
>   /* only take off EDs that the HC isn't using, accounting for
>* frame counter wraps and EDs with partially retired TDs
>*/
> @@ -1011,12 +1007,10 @@ skip_ed:
>   }
> 
>   /* ED's now officially unlinked, hc doesn't see */
> - ed->state = ED_IDLE;
>   ed->hwHeadP &= ~cpu_to_hc32(ohci, ED_H);
>   ed->hwNextED = 0;
>   wmb();
>   ed->hwINFO &= ~cpu_to_hc32(ohci, ED_SKIP | ED_DEQUEUE);
> -ed_idle:
> 
>   /* reentrancy:  if we drop the schedule lock, someone might
>* have modified this list.  normally it's just prepending
> @@ -1087,6 +1081,7 @@ rescan_this:
>   if (list_empty(&ed->td_list)) {
>   *last = ed->ed_next;
>   ed->ed_next = NULL;
> + ed->state = ED_IDLE;
>   list_del(&ed->in_use_list);
>   } else if (ohci->rh_state == OHCI_RH_RUNNING) {
>   *last = ed->ed_next;

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ohci: sporadic crash/lockup in ohci-hcd io_watchdog_func()

2015-01-19 Thread Alan Stern
On Mon, 19 Jan 2015, Heiko Przybyl wrote:

> Hi all,
> 
> been redirected here from bug #91511 [1].
> 
> I'm getting sporadic crashes in io_watchdog_func() in drivers/usb/host/ohci-
> hcd.c:761:
> "
> list_for_each_entry(ed, &ohci->eds_in_use, in_use_list) {
> if (ed->pending_td) {
> "
> with the in_use list getting corrupted:
> 
> from ohci_urb_enqueue():
> [43656.918055] list_add double add: new=8800cbaa8040, 
> prev=8800cb8aa5b8, 
> next=8800cbaa8040.
> from ohci_work.part():
> [43656.920980] list_del corruption. next->prev should be 8800cbaa8040, 
> but 
> was 8800cb8aa5b8
> 
> One or both set the pointer to 0xdead00100100 and 0xdead001000c0, 
> where io_watchdog_func() chokes on [2].

That is indeed a bad bug.

> It seems to be related to keyboard input (at least it happens when using the 
> keyboard), without relation to system load. Can happen within a day after 
> boot 
> or after several days of hibernated uptime. Unfortunately, I haven't found a 
> way to reliably reproduce the issue, yet.
> 
> The box is a "Gigabyte GA-78LMT-USB3" with "AMD FX(tm)-6300 Six-Core 
> Processor" and "[AMD/ATI] SB7x0 USB OHCI1 Controller".
> 
> For more info including crash trace, just have a look at the bug report [1]
> 
> 
> My (pretty wild) guess is, that the corruption happens through a race in the 
> interrupt handler ohci_irq(), which calls ohci_work(), which calls 
> finish_urb(), which states:
> " * PRECONDITION:  ohci lock held, irqs blocked"
> 
> But ohci_irq() seems to only spin_[un]lock(), not spin_[un]lock_irq[save|
> restore](). All other functions that call ohci_work() do at least 
> spin_[un]lock_irq. So irqs could still be enabled and possibly the event 
> triggered twice, thus the double list add?

That's easy enough to test.  All you have to do is change the 
spin_lock/unlock statements to their irq_save/restore variants.

ohci_irq() is an interrupt handler.  In the absence of threaded IRQs,
he kernel should always call interrupt handlers with interrupts 
disabled.  Do you specify "threadirqs" on your boot command line?

If that's not the explanation then we'll have to dig deeper.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ohci: sporadic crash/lockup in ohci-hcd io_watchdog_func()

2015-01-19 Thread Heiko Przybyl
On Monday 19 January 2015 11:17:59 Alan Stern wrote:
> On Mon, 19 Jan 2015, Heiko Przybyl wrote:
> > It seems to be related to keyboard input (at least it happens when using
> > the keyboard), without relation to system load. Can happen within a day
> > after boot or after several days of hibernated uptime. Unfortunately, I
> > haven't found a way to reliably reproduce the issue, yet.
> > 
> > [..]
> > 
> > My (pretty wild) guess is, that the corruption happens through a race in
> > the interrupt handler ohci_irq(), which calls ohci_work(), which calls
> > finish_urb(), which states:
> > " * PRECONDITION:  ohci lock held, irqs blocked"
> > 
> > But ohci_irq() seems to only spin_[un]lock(), not spin_[un]lock_irq[save|
> > restore](). All other functions that call ohci_work() do at least
> > spin_[un]lock_irq. So irqs could still be enabled and possibly the event
> > triggered twice, thus the double list add?
> 
> That's easy enough to test.  All you have to do is change the
> spin_lock/unlock statements to their irq_save/restore variants.

Well, thought about that as well, but I'm not sure when to take it as fixed and 
when to take it as issue-just-didn't-happen-yet, because of the not-so-
deterministic occurrence of the error. But I can try it out anyway, just 
wanted to have some feedback before trying.

> 
> ohci_irq() is an interrupt handler.  In the absence of threaded IRQs,
> he kernel should always call interrupt handlers with interrupts
> disabled.  Do you specify "threadirqs" on your boot command line?
> 

Never used "threadirqs".

# cat /proc/cmdline 
BOOT_IMAGE=/boot/gentoo root=/dev/sda2 ro rootfstype=ext4 resume=/dev/sda3 
init=/usr/lib/systemd/systemd quiet libahci.ignore_sss=1 i8042.nopnp 
crashkernel=64M

> If that's not the explanation then we'll have to dig deeper.

I can still work on a saved vmcore dump of a crash. Btw. using crash(1) and 
its command `bt -E`shows two kernel mode exceptions. Though, I can't figure out 
where the first one originates from

CPU 3 IRQ STACK:
  KERNEL-MODE EXCEPTION FRAME AT: 88022ecc3638
[exception RIP: _raw_spin_unlock_irqrestore+9]
RIP: 814774b9  RSP: 88022ecc36e8  RFLAGS: 0202
RAX: 88022ecc36a8  RBX: 88022ecc36b0  RCX: 81290279
RDX: 2dff  RSI:   RDI: 88022ecc3788
RBP: 88022ecc36e8   R8: 0080   R9: 0023
R10: 813e6407  R11: ea000863ad80  R12: 88022ecc3658
R13: 81478b2a  R14: 88022ecc36e8  R15: 0001
ORIG_RAX: 81471cfd  CS: 0010  SS: 0018

0x814774b9 <+9>: decl   %gs:0xa860

CPU 5 IRQ STACK:
  KERNEL-MODE EXCEPTION FRAME AT: 88022ed43d98
[exception RIP: io_watchdog_func+112]
RIP: 81394b80  RSP: 88022ed43e48  RFLAGS: 00010006
RAX: 8800cb8aa598  RBX: 0296  RCX: 8800cbaa8030
RDX: dead00100100  RSI: cbaa91e0  RDI: 8800cbaa8030
RBP: 88022ed43e88   R8: 8800cbaa7fe8   R9: 0205
R10: 8800cbaa8030  R11: 8800cb8aa5a0  R12: dead001000c0
R13: 8800cb8aa248  R14: 8800cb8aa5b8  R15: 
ORIG_RAX:   CS: 0010  SS: 


> Alan Stern

Kind regards,

   Heiko
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ohci: sporadic crash/lockup in ohci-hcd io_watchdog_func()

2015-01-20 Thread Alan Stern
On Mon, 19 Jan 2015, Heiko Przybyl wrote:

> On Monday 19 January 2015 11:17:59 Alan Stern wrote:
> > On Mon, 19 Jan 2015, Heiko Przybyl wrote:
> > > It seems to be related to keyboard input (at least it happens when using
> > > the keyboard), without relation to system load. Can happen within a day
> > > after boot or after several days of hibernated uptime. Unfortunately, I
> > > haven't found a way to reliably reproduce the issue, yet.
> > > 
> > > [..]
> > > 
> > > My (pretty wild) guess is, that the corruption happens through a race in
> > > the interrupt handler ohci_irq(), which calls ohci_work(), which calls
> > > finish_urb(), which states:
> > > " * PRECONDITION:  ohci lock held, irqs blocked"
> > > 
> > > But ohci_irq() seems to only spin_[un]lock(), not spin_[un]lock_irq[save|
> > > restore](). All other functions that call ohci_work() do at least
> > > spin_[un]lock_irq. So irqs could still be enabled and possibly the event
> > > triggered twice, thus the double list add?
> > 
> > That's easy enough to test.  All you have to do is change the
> > spin_lock/unlock statements to their irq_save/restore variants.
> 
> Well, thought about that as well, but I'm not sure when to take it as fixed 
> and 
> when to take it as issue-just-didn't-happen-yet, because of the not-so-
> deterministic occurrence of the error. But I can try it out anyway, just 
> wanted to have some feedback before trying.

By the way, failing to disable interrupts when acquiring a spinlock 
generally does not lead to data corruption -- it leads to deadlocks.  
So I doubt this is the cause of your problem.  If you really want to, 
you could add a

WARN_ON(!irqs_disabled());

line to ohci_irq().

> > If that's not the explanation then we'll have to dig deeper.
> 
> I can still work on a saved vmcore dump of a crash. Btw. using crash(1) and 
> its command `bt -E`shows two kernel mode exceptions. Though, I can't figure 
> out 
> where the first one originates from
> 
> CPU 3 IRQ STACK:
>   KERNEL-MODE EXCEPTION FRAME AT: 88022ecc3638
> [exception RIP: _raw_spin_unlock_irqrestore+9]
> RIP: 814774b9  RSP: 88022ecc36e8  RFLAGS: 0202
> RAX: 88022ecc36a8  RBX: 88022ecc36b0  RCX: 81290279
> RDX: 2dff  RSI:   RDI: 88022ecc3788
> RBP: 88022ecc36e8   R8: 0080   R9: 0023
> R10: 813e6407  R11: ea000863ad80  R12: 88022ecc3658
> R13: 81478b2a  R14: 88022ecc36e8  R15: 0001
> ORIG_RAX: 81471cfd  CS: 0010  SS: 0018
> 
> 0x814774b9 <+9>: decl   %gs:0xa860

No idea.

It might be a good idea for you to try something a little more 
invasive.  How about writing a routine to check the entire 
ohci->eds_in_use list for validity (each forward pointer is matched by 
the corresponding backward pointer), and calling this routine at each 
place where the list gets modified, before the modification happens?

You could also make sure that an entry being added to the list isn't on 
the list already, and whenever an entry is deleted from the list 
either it really is on the list or else its list pointers point to 
themselves.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ohci: sporadic crash/lockup in ohci-hcd io_watchdog_func()

2015-01-20 Thread Heiko Przybyl
On Tuesday 20 January 2015 10:49:29 Alan Stern wrote:
> On Mon, 19 Jan 2015, Heiko Przybyl wrote:
> > On Monday 19 January 2015 11:17:59 Alan Stern wrote:
> > > 
> > > That's easy enough to test.  All you have to do is change the
> > > spin_lock/unlock statements to their irq_save/restore variants.
> > 
> > Well, thought about that as well, but I'm not sure when to take it as
> > fixed and when to take it as issue-just-didn't-happen-yet, because of the
> > not-so- deterministic occurrence of the error. But I can try it out
> > anyway, just wanted to have some feedback before trying.
> 
> By the way, failing to disable interrupts when acquiring a spinlock
> generally does not lead to data corruption -- it leads to deadlocks.
> So I doubt this is the cause of your problem.  If you really want to,
> you could add a
> 
>   WARN_ON(!irqs_disabled());
> 
> line to ohci_irq().

You're right. After thinking/reading a bit more about the topic, not disabled 
IRQs would cause deadlocks, because there's already one thread in the critical 
section.

> 
> No idea.
> 
> It might be a good idea for you to try something a little more
> invasive.  How about writing a routine to check the entire
> ohci->eds_in_use list for validity (each forward pointer is matched by
> the corresponding backward pointer), and calling this routine at each
> place where the list gets modified, before the modification happens?
> 
> You could also make sure that an entry being added to the list isn't on
> the list already, and whenever an entry is deleted from the list
> either it really is on the list or else its list pointers point to
> themselves.
> 

I'm not 100% sure, but then it's probably a race between urb 
enqueuing (duplicates?) and the watchdog orphan cleanup.

The crash log already shows the double list add in ohci_urb_enqueue
"
ohci-hcd.c:238: list_add(&ed->in_use_list, &ohci->eds_in_use);
"
This is probably due to the ed returned by ed_get() being reused before the
watchdog ran, thus the same in_use_list re-added to ohci.eds_in_use.

Entries seem to get removed in finish_unlinks()
"
ohci-q.c:1090: list_del(&ed->in_use_list);
"
with list_del() poisoning the next/prev pointers of the removed entry.

Now with the watchdog starting cleanup it iterates over the ohci.eds_in_use 
list
that still has the second very same entry of in_use_list we double-added (but 
now with 0xdead... pointers) and we fault on 
"
ohci-hcd.c:761: if (ed->pending_td) {
"

I hope that makes any sense? I'll hook up the list checking tomorrow. Though I 
haven't hit the (double-add) problem again, since the bug report. Seems pretty 
specific the whole thing.


> Alan Stern

Heiko
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ohci: sporadic crash/lockup in ohci-hcd io_watchdog_func()

2015-01-23 Thread Alan Stern
On Tue, 20 Jan 2015, Heiko Przybyl wrote:

> I'm not 100% sure, but then it's probably a race between urb 
> enqueuing (duplicates?) and the watchdog orphan cleanup.
> 
> The crash log already shows the double list add in ohci_urb_enqueue
> "
> ohci-hcd.c:238: list_add(&ed->in_use_list, &ohci->eds_in_use);
> "
> This is probably due to the ed returned by ed_get() being reused before the
> watchdog ran, thus the same in_use_list re-added to ohci.eds_in_use.

But this won't happen unless ed->state was ED_IDLE; and then
ed_schedule() changes ed->state to ED_OPER before ed is added to the
in_use_list.

> Entries seem to get removed in finish_unlinks()
> "
> ohci-q.c:1090: list_del(&ed->in_use_list);
> "
> with list_del() poisoning the next/prev pointers of the removed entry.

And finish_unlinks() is where ed->state gets set back to ED_IDLE.  
Furthermore, all these operations take place under the protection of 
ohci->lock, so they can't race with each other.

They have to happen in strict sequence: list_add, list_del, list_add.  
The second list_add can't occur until after the list_del, because 
ed->state won't be equal to ED_IDLE until then.

At least, that's how it's _supposed_ to work...

> Now with the watchdog starting cleanup it iterates over the ohci.eds_in_use 
> list
> that still has the second very same entry of in_use_list we double-added (but 
> now with 0xdead... pointers) and we fault on 
> "
> ohci-hcd.c:761: if (ed->pending_td) {
> "
> 
> I hope that makes any sense? I'll hook up the list checking tomorrow. Though 
> I 
> haven't hit the (double-add) problem again, since the bug report. Seems 
> pretty 
> specific the whole thing.

It's a puzzling problem.  I hope you can find the cause.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html