Re: USB OOPS 2.6.25-rc2-git1

2008-02-25 Thread David Brownell
On Thursday 21 February 2008, Alan Stern wrote:
> > = CUT HERE
> > Modify EHCI irq handling on the theory that at least some of the
> > "lost" IRQs are caused by goofage between multiple lowlevel IRQ
> > acking mechanisms:  try rescanning before we exit the handler, in
> > case the EHCI-internal ack (by clearing the irq status) doesn't
> > always suffice for IRQs triggered nearly back-to-back.
> > 
> > ---
> >  drivers/usb/host/ehci-hcd.c |    8 
> >  1 file changed, 8 insertions(+)
> > 
> > --- g26.orig/drivers/usb/host/ehci-hcd.c  2008-02-20 13:26:00.0 
> > -0800
> > +++ g26/drivers/usb/host/ehci-hcd.c   2008-02-20 13:54:37.0 -0800
> > @@ -638,6 +638,8 @@ static irqreturn_t ehci_irq (struct usb_
> >   return IRQ_NONE;
> >   }
> >  
> > +retrigger:
> > +
> >   /* clear (just) interrupts */
> >   ehci_writel(ehci, status, &ehci->regs->status);
> >   cmd = ehci_readl(ehci, &ehci->regs->command);
> > @@ -725,6 +727,12 @@ dead:
> >  
> >   if (bh)
> >   ehci_work (ehci);
> > +
> > + status = ehci_readl(ehci, &ehci->regs->status);
> > + status &= INTR_MASK;
> > + if (status)
> > + goto retrigger;
> > +
> >   spin_unlock (&ehci->lock);
> >   if (pcd_status & STS_PCD)
> >   usb_hcd_poll_rh_status(hcd);
> 
> There's one little problem here.  As a result of this change, the line 
> where pcd_status gets set (not shown in this patch) needs to be changed 
> to:
> 
> pcd_status |= (status & STS_PCD);

Actually, no change is needed.  It's initialized to zero, then
set to "status" given "if (status & STS_PCD)", and never cleared.
So if it's ever set, it stays set...

> 
> Then the test shown above can be simplified to:
> 
> if (pcd_status)

True with the current code too ...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: USB OOPS 2.6.25-rc2-git1

2008-02-21 Thread Alan Stern
On Wed, 20 Feb 2008, David Brownell wrote:

> > CPU 0   CPU 1
> > -   -
> > Watchdog timer expires
> > Timer routine acquires spinlock
> > IAA IRQ arrives
> > ehci_irq tries to acquire 
> > spinlock...

The following comment refers to the "Timer routine either sets" below, 
right?

> There's another condition here, and
> another action.  The condition is
> that ehci->reclaim must first be set;
> the action is to clear STS_IAA (and,
> given the previous patch, maybe IAAD).
> 
> And this "either" is more concisely
> written as "call end_unlink_async()"
> (point made just for clarity).

Correct on both counts.  I had forgotten that the watchdog routine 
clears STS_IAA.

> > Timer routine either sets
> > ehci->reclaim to NULL 
> > or else starts a new
> > IAA cycle
> > Timer routine releases spinlock
> > and returns
> > ehci_irq acquires spinlock
> > and sees IAA is set
> 
>   Can only happen if a new IAA
>   cycle was started by CPU0, and
>   the IAA condition triggered
>   that quickly.
> 
> > Call end_unlink_async()!

Okay, so this isn't as bad as it seemed.  I don't have a copy of your 
most recent patch, but it seems clear that the watchdog routine must:

First remove the circumstances that would cause the controller 
to set IAA.  I guess that means clearing IAAD; it's not
entirely clear from the spec whether this will do what we 
want.

Then clear IAA (if it happens to be set).

This is the only way to avoid the race, and I know that my original
version of the routine does these steps in the wrong order (if at all).  
That should be fixed.  Given sufficiently bizarre hardware we can't be
certain that things won't still go wrong on occasion, but this is the
best we can do for now -- weird hardware can be handled as it arises.

The other change to make (which you have already anticipated) is to 
guard against ehci->reclaim == NULL in end_unlink_async().  There's no 
real need for a warning or stack dump; it should just return silently 
when this happens.  If there is a warning, maybe it should be placed at 
the site of the caller (for example, in ehci_irq() when STS_IAA is 
detected).

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: USB OOPS 2.6.25-rc2-git1

2008-02-21 Thread Alan Stern
On Wed, 20 Feb 2008, David Brownell wrote:

> = CUT HERE
> Modify EHCI irq handling on the theory that at least some of the
> "lost" IRQs are caused by goofage between multiple lowlevel IRQ
> acking mechanisms:  try rescanning before we exit the handler, in
> case the EHCI-internal ack (by clearing the irq status) doesn't
> always suffice for IRQs triggered nearly back-to-back.
> 
> ---
>  drivers/usb/host/ehci-hcd.c |8 
>  1 file changed, 8 insertions(+)
> 
> --- g26.orig/drivers/usb/host/ehci-hcd.c  2008-02-20 13:26:00.0 
> -0800
> +++ g26/drivers/usb/host/ehci-hcd.c   2008-02-20 13:54:37.0 -0800
> @@ -638,6 +638,8 @@ static irqreturn_t ehci_irq (struct usb_
>   return IRQ_NONE;
>   }
>  
> +retrigger:
> +
>   /* clear (just) interrupts */
>   ehci_writel(ehci, status, &ehci->regs->status);
>   cmd = ehci_readl(ehci, &ehci->regs->command);
> @@ -725,6 +727,12 @@ dead:
>  
>   if (bh)
>   ehci_work (ehci);
> +
> + status = ehci_readl(ehci, &ehci->regs->status);
> + status &= INTR_MASK;
> + if (status)
> + goto retrigger;
> +
>   spin_unlock (&ehci->lock);
>   if (pcd_status & STS_PCD)
>   usb_hcd_poll_rh_status(hcd);

There's one little problem here.  As a result of this change, the line 
where pcd_status gets set (not shown in this patch) needs to be changed 
to:

pcd_status |= (status & STS_PCD);

Then the test shown above can be simplified to:

if (pcd_status)

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: USB OOPS 2.6.25-rc2-git1

2008-02-20 Thread David Brownell
On Wednesday 20 February 2008, Andre Tomt wrote:
> David Brownell wrote:
> > On Wednesday 20 February 2008, Andre Tomt wrote:
> >> It has not crashed yet with the patch though.
> > 
> > It seems that one of the tweks in this patch made the watchdog
> > act better than before.  So unless I hear from you (before the
> > start of next week) that some other message appears, or that your
> > oops re-appears, I'll submit some version of this patch for RC3.
> 
> OOPS'ed again after some hours. The OOPS looks identical to me besides 
> all kind of other crap mixed in the trace due to a lot of unrelated 
> activity going on.
> 
> Quite a lot of the same IAA messages (status 8029 and 8028, cmd 10021) 
> in /var/log/debug prior to the crash, over the entire uptime time span.
> 
> This was with the first patch posted only. Not any of the other ones.

Hmm ... I'd have expected some other IAA/IAAD message too.


> > And if you're up for it, I may have another patch for you
> > to try on top of this one ... I had an idea about IRQ trigger
> > modes that might be causing this problem.
> 
> It'll have to be tomorrow. Should I throw in the anti-oops patch too?

Sure.  I expect you'll see the stacktrace then, instead of oopsing.

You might turn that one IAA message into an ehci_vdbg() call
instead of an ehci_dbg() call, since  the data it gives isn't
useful.  That would reduce the amount of log noise you seee.

- Dave
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: USB OOPS 2.6.25-rc2-git1

2008-02-20 Thread Andre Tomt

David Brownell wrote:

On Wednesday 20 February 2008, Andre Tomt wrote:

It has not crashed yet with the patch though.


It seems that one of the tweks in this patch made the watchdog
act better than before.  So unless I hear from you (before the
start of next week) that some other message appears, or that your
oops re-appears, I'll submit some version of this patch for RC3.


OOPS'ed again after some hours. The OOPS looks identical to me besides 
all kind of other crap mixed in the trace due to a lot of unrelated 
activity going on.


Quite a lot of the same IAA messages (status 8029 and 8028, cmd 10021) 
in /var/log/debug prior to the crash, over the entire uptime time span.


This was with the first patch posted only. Not any of the other ones.


And if you're up for it, I may have another patch for you
to try on top of this one ... I had an idea about IRQ trigger
modes that might be causing this problem.


It'll have to be tomorrow. Should I throw in the anti-oops patch too?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: USB OOPS 2.6.25-rc2-git1

2008-02-20 Thread David Brownell
On Wednesday 20 February 2008, Alan Stern wrote:
> On Wed, 20 Feb 2008, David Brownell wrote:
> 
> > On Wednesday 20 February 2008, Alan Stern wrote:
> > > > ehci_hcd :00:1d.7: IAA watchdog, lost IAA: status 8029 cmd 10021
> > > 
> > > lines in the log brings up some ideas that have been percolating in my 
> > > mind for a while.  They have to do with the possibility of a race 
> > > between the watchdog routine and assertion of IAA.
> > 
> > The curious bit IMO being STS_INT (0001), which should also have
> > triggered an IRQ.  Suggesting to me that the race might be lower
> > level than that ... at the level of a conflict between the various
> > mechanisms to ack irqs.
> 
> Maybe it did trigger an IRQ.  Inside the watchdog routine interrupts 
> are disabled.
> 
> > > In fact, if the timing comes out just wrong then it's possible (on SMP
> > > systems) for an IAA interrupt to arrive when the watchdog
> > > routine has already started running.  Then end_unlink_async() might get 
> > > called right at the start of a new IAA cycle, or when the reclaim list 
> > > is empty.
> > 
> > The driver's spinlock should prevent that particular problem from
> > appearing.
> 
> I don't think so:
> 
>   CPU 0   CPU 1
>   -   -
>   Watchdog timer expires
>   Timer routine acquires spinlock
>   IAA IRQ arrives
>   ehci_irq tries to acquire 
>   spinlock...

There's another condition here, and
another action.  The condition is
that ehci->reclaim must first be set;
the action is to clear STS_IAA (and,
given the previous patch, maybe IAAD).

And this "either" is more concisely
written as "call end_unlink_async()"
(point made just for clarity).

>   Timer routine either sets
>   ehci->reclaim to NULL 
>   or else starts a new
>   IAA cycle
>   Timer routine releases spinlock
>   and returns
>   ehci_irq acquires spinlock
>   and sees IAA is set

Can only happen if a new IAA
cycle was started by CPU0, and
the IAA condition triggered
that quickly.

>   Call end_unlink_async()!
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: USB OOPS 2.6.25-rc2-git1

2008-02-20 Thread Alan Stern
On Wed, 20 Feb 2008, David Brownell wrote:

> On Wednesday 20 February 2008, Alan Stern wrote:
> > > ehci_hcd :00:1d.7: IAA watchdog, lost IAA: status 8029 cmd 10021
> > 
> > lines in the log brings up some ideas that have been percolating in my 
> > mind for a while.  They have to do with the possibility of a race 
> > between the watchdog routine and assertion of IAA.
> 
> The curious bit IMO being STS_INT (0001), which should also have
> triggered an IRQ.  Suggesting to me that the race might be lower
> level than that ... at the level of a conflict between the various
> mechanisms to ack irqs.

Maybe it did trigger an IRQ.  Inside the watchdog routine interrupts 
are disabled.

> > In fact, if the timing comes out just wrong then it's possible (on SMP
> > systems) for an IAA interrupt to arrive when the watchdog
> > routine has already started running.  Then end_unlink_async() might get 
> > called right at the start of a new IAA cycle, or when the reclaim list 
> > is empty.
> 
> The driver's spinlock should prevent that particular problem from
> appearing.

I don't think so:

CPU 0   CPU 1
-   -
Watchdog timer expires
Timer routine acquires spinlock
IAA IRQ arrives
ehci_irq tries to acquire 
spinlock...
Timer routine either sets
ehci->reclaim to NULL 
or else starts a new
IAA cycle
Timer routine releases spinlock
and returns
ehci_irq acquires spinlock
and sees IAA is set
Call end_unlink_async()!

> = CUT HERE
> Modify EHCI irq handling on the theory that at least some of the
> "lost" IRQs are caused by goofage between multiple lowlevel IRQ
> acking mechanisms:  try rescanning before we exit the handler, in
> case the EHCI-internal ack (by clearing the irq status) doesn't
> always suffice for IRQs triggered nearly back-to-back.

This might help, but it won't fix the race outlined above.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: USB OOPS 2.6.25-rc2-git1

2008-02-20 Thread David Brownell
On Wednesday 20 February 2008, Alan Stern wrote:
> > ehci_hcd :00:1d.7: IAA watchdog, lost IAA: status 8029 cmd 10021
> 
> lines in the log brings up some ideas that have been percolating in my 
> mind for a while.  They have to do with the possibility of a race 
> between the watchdog routine and assertion of IAA.

The curious bit IMO being STS_INT (0001), which should also have
triggered an IRQ.  Suggesting to me that the race might be lower
level than that ... at the level of a conflict between the various
mechanisms to ack irqs.

See the appended patch (Andre, this is the additional one I meant)
for a tweak at that level.


> In fact, if the timing comes out just wrong then it's possible (on SMP
> systems) for an IAA interrupt to arrive when the watchdog
> routine has already started running.  Then end_unlink_async() might get 
> called right at the start of a new IAA cycle, or when the reclaim list 
> is empty.

The driver's spinlock should prevent that particular problem from
appearing.

- Dave


= CUT HERE
Modify EHCI irq handling on the theory that at least some of the
"lost" IRQs are caused by goofage between multiple lowlevel IRQ
acking mechanisms:  try rescanning before we exit the handler, in
case the EHCI-internal ack (by clearing the irq status) doesn't
always suffice for IRQs triggered nearly back-to-back.

---
 drivers/usb/host/ehci-hcd.c |8 
 1 file changed, 8 insertions(+)

--- g26.orig/drivers/usb/host/ehci-hcd.c2008-02-20 13:26:00.0 
-0800
+++ g26/drivers/usb/host/ehci-hcd.c 2008-02-20 13:54:37.0 -0800
@@ -638,6 +638,8 @@ static irqreturn_t ehci_irq (struct usb_
return IRQ_NONE;
}
 
+retrigger:
+
/* clear (just) interrupts */
ehci_writel(ehci, status, &ehci->regs->status);
cmd = ehci_readl(ehci, &ehci->regs->command);
@@ -725,6 +727,12 @@ dead:
 
if (bh)
ehci_work (ehci);
+
+   status = ehci_readl(ehci, &ehci->regs->status);
+   status &= INTR_MASK;
+   if (status)
+   goto retrigger;
+
spin_unlock (&ehci->lock);
if (pcd_status & STS_PCD)
usb_hcd_poll_rh_status(hcd);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: USB OOPS 2.6.25-rc2-git1

2008-02-20 Thread David Brownell
On Wednesday 20 February 2008, Andre Tomt wrote:
> ehci_hcd :00:1d.7: IAA watchdog, lost IAA: status 8029 cmd 10021
> ehci_hcd :00:1d.7: IAA watchdog, lost IAA: status 8029 cmd 10021
> ehci_hcd :00:1d.7: IAA watchdog, lost IAA: status 8029 cmd 10021
> ehci_hcd :00:1d.7: IAA watchdog, lost IAA: status 8029 cmd 10021
> ehci_hcd :00:1d.7: IAA watchdog, lost IAA: status 8029 cmd 10021
> ehci_hcd :00:1d.7: IAA watchdog, lost IAA: status 8028 cmd 10021
> ehci_hcd :00:1d.7: IAA watchdog, lost IAA: status 8028 cmd 10021
> ehci_hcd :00:1d.7: IAA watchdog, lost IAA: status 8028 cmd 10021
> ehci_hcd :00:1d.7: IAA watchdog, lost IAA: status 8029 cmd 10021

... etc.

OK, the relevant bits are:

   status  0001 == some transaction completed normally (ignored here)
   status  0020 == IAA set, which should have triggered an IRQ
   command 0040 == IAAD clear, meaning IAA should have triggered

Meaning the hardware is misbehaving in a "traditional" way, one
that the watchdog is supposed to catch:  IAA set, but no IRQ.

If you see any "IAA" messages *other* than those, please report
them ASAP.  They'll indicate "nontraditional" misbehavior.


> It has not crashed yet with the patch though.

It seems that one of the tweks in this patch made the watchdog
act better than before.  So unless I hear from you (before the
start of next week) that some other message appears, or that your
oops re-appears, I'll submit some version of this patch for RC3.

And if you're up for it, I may have another patch for you
to try on top of this one ... I had an idea about IRQ trigger
modes that might be causing this problem.

- Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: USB OOPS 2.6.25-rc2-git1

2008-02-20 Thread Alan Stern
On Wed, 20 Feb 2008, Andre Tomt wrote:

> David Brownell wrote:
> > On Tuesday 19 February 2008, Andre Tomt wrote:
> >>> Can you try this diagnostic patch, to see if it reports any messages
> >>> about IAA and/or IAAD oddities?  There's surely a quick workaround
> >>> for this, but I'd rather understand the root cause before patching.
> >> Doesn't seem to have triggered anything. dmesg attached in case I missed 
> >> anything.
> > 
> > You don't seem to have enabled CONFIG_USB_DEBUG, as the patch instructions
> > say is needed to get such diagnostics ... I can tell because the startup
> > messages from USB are pretty minimal.  (See appended, vs what you sent...)
> > 
> > Please try again with USB debugging enabled.
> 
> Argh, silly me. Here you go (attached). It has not crashed yet with the 
> patch though.

You know, Dave, seeing all those

> ehci_hcd :00:1d.7: IAA watchdog, lost IAA: status 8029 cmd 10021

lines in the log brings up some ideas that have been percolating in my 
mind for a while.  They have to do with the possibility of a race 
between the watchdog routine and assertion of IAA.

In fact, if the timing comes out just wrong then it's possible (on SMP
systems) for an IAA interrupt to arrive when the watchdog
routine has already started running.  Then end_unlink_async() might get 
called right at the start of a new IAA cycle, or when the reclaim list 
is empty.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: USB OOPS 2.6.25-rc2-git1

2008-02-20 Thread Andre Tomt

David Brownell wrote:

On Tuesday 19 February 2008, Andre Tomt wrote:

Can you try this diagnostic patch, to see if it reports any messages
about IAA and/or IAAD oddities?  There's surely a quick workaround
for this, but I'd rather understand the root cause before patching.
Doesn't seem to have triggered anything. dmesg attached in case I missed 
anything.


You don't seem to have enabled CONFIG_USB_DEBUG, as the patch instructions
say is needed to get such diagnostics ... I can tell because the startup
messages from USB are pretty minimal.  (See appended, vs what you sent...)

Please try again with USB debugging enabled.


Argh, silly me. Here you go (attached). It has not crashed yet with the 
patch though.
Initializing cgroup subsys cpuset
Linux version 2.6.25-rc2-git1 ([EMAIL PROTECTED]) (gcc version 4.2.3 (Debian 
4.2.3-1)) #5 SMP Wed Feb 20 09:40:27 CET 2008
Command line: root=/dev/sda1 ro rootflags=noatime rootfstype=ext2 
console=ttyS0,38400 verbose 
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009bc00 (usable)
 BIOS-e820: 0009bc00 - 000a (reserved)
 BIOS-e820: 000f - 0010 (reserved)
 BIOS-e820: 0010 - 7fee (usable)
 BIOS-e820: 7fee - 7fee3000 (ACPI NVS)
 BIOS-e820: 7fee3000 - 7fef (ACPI data)
 BIOS-e820: 7fef - 7ff0 (reserved)
 BIOS-e820: e000 - f000 (reserved)
 BIOS-e820: fec0 - 0001 (reserved)
Entering add_active_range(0, 0, 155) 0 entries of 3200 used
Entering add_active_range(0, 256, 524000) 1 entries of 3200 used
end_pfn_map = 1048576
DMI 2.4 present.
ACPI: RSDP 000F7850, 0014 (r0 IntelR)
ACPI: RSDT 7FEE3040, 0038 (r1 IntelR AWRDACPI 42302E31 AWRD0)
ACPI: FACP 7FEE30C0, 0074 (r1 IntelR AWRDACPI 42302E31 AWRD0)
ACPI: DSDT 7FEE3180, 463D (r1 INTELR AWRDACPI 1000 MSFT  300)
ACPI: FACS 7FEE, 0040
ACPI: HPET 7FEE7900, 0038 (r1 IntelR AWRDACPI 42302E31 AWRD   98)
ACPI: MCFG 7FEE7980, 003C (r1 IntelR AWRDACPI 42302E31 AWRD0)
ACPI: APIC 7FEE7800, 0084 (r1 IntelR AWRDACPI 42302E31 AWRD0)
ACPI: SSDT 7FEE8020, 02F1 (r1  PmRefCpuPm 3000 INTL 20040311)
No NUMA configuration found
Faking a node at -7fee
Entering add_active_range(0, 0, 155) 0 entries of 3200 used
Entering add_active_range(0, 256, 524000) 1 entries of 3200 used
Bootmem setup node 0 -7fee
  NODE_DATA [c000 - 00012fff]
  bootmap [00013000 -  00022fdf] pages 10
early res: 0 [0-fff] BIOS data page
early res: 1 [6000-7fff] SMP_TRAMPOLINE
early res: 2 [20-565977] TEXT DATA BSS
early res: 3 [37a43000-37fef4eb] RAMDISK
early res: 4 [9bc00-abbff] EBDA
early res: 5 [8000-bfff] PGTABLE
 [e200-e21f] PMD ->81000120 on node 0
 [e220-e23f] PMD ->81000160 on node 0
 [e240-e25f] PMD ->810001a0 on node 0
 [e260-e27f] PMD ->810001e0 on node 0
 [e280-e29f] PMD ->81000220 on node 0
 [e2a0-e2bf] PMD ->81000260 on node 0
 [e2c0-e2df] PMD ->810002a0 on node 0
 [e2e0-e2ff] PMD ->810002e0 on node 0
 [e2000100-e200011f] PMD ->81000320 on node 0
 [e2000120-e200013f] PMD ->81000360 on node 0
 [e2000140-e200015f] PMD ->810003a0 on node 0
 [e2000160-e200017f] PMD ->810003e0 on node 0
 [e2000180-e200019f] PMD ->81000420 on node 0
 [e20001a0-e20001bf] PMD ->81000460 on node 0
Zone PFN ranges:
  DMA 0 -> 4096
  DMA324096 ->  1048576
  Normal1048576 ->  1048576
Movable zone start PFN for each node
early_node_map[2] active PFN ranges
0:0 ->  155
0:  256 ->   524000
On node 0 totalpages: 523899
  DMA zone: 56 pages used for memmap
  DMA zone: 894 pages reserved
  DMA zone: 3045 pages, LIFO batch:0
  DMA32 zone: 7108 pages used for memmap
  DMA32 zone: 512796 pages, LIFO batch:31
  Normal zone: 0 pages used for memmap
  Movable zone: 0 pages used for memmap
ACPI: PM-Timer IO Port: 0x408
ACPI: Local APIC address 0xfee0
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 (Bootup-CPU)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] disabled)
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] disabled)
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1])
ACPI: IOAPIC (id[0x04] address[0xfec0] gsi_base[0])
IOAPIC[0]: apic_id 4, address 0xfec0, GSI 0-23
ACPI: INT_S

Re: USB OOPS 2.6.25-rc2-git1

2008-02-20 Thread Alan Stern
On Tue, 19 Feb 2008, David Brownell wrote:

> Please try that diagnostic patch I sent ... with CONFIG_USB_DEBUG.
> 
> Near as I can tell this is caused by some hardware oddity that needs
> to be worked around.  We seem to be at stage where we've fixed some
> problems, nudging code paths around so another one shows up, and have
> incidentally had a new silicion-specific hardware erratum reported
> in this area.  So more info is needed...
> 
> A quick anti-oops patch is appended, it should work OK on top of that
> diagnostic patch, but won't necessarily resolve the underlying problem.
> 
> - Dave
> 
> 
> --- g26.orig/drivers/usb/host/ehci-q.c2008-02-19 16:15:04.0 
> -0800
> +++ g26/drivers/usb/host/ehci-q.c 2008-02-19 16:15:59.0 -0800
> @@ -993,6 +993,11 @@ static void end_unlink_async (struct ehc
>  
>   iaa_watchdog_done(ehci);
>  
> + if (!qh) {
> + WARN_ON(1);
> + return;
> + }
> +

It will be interesting to see the stack dump.  As far as I can tell,
there are two pathways which could lead qh being NULL.  One is the IAA
hardware peculiarity (setting the status bit very late, after the
watchdog timer has already expired), and the other is in unlink_async()  
if the controller isn't running.  That second one may be just a simple
bug -- I doubt it would show up unless the controller got a fatal 
error and stopped.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: USB OOPS 2.6.25-rc2-git1

2008-02-19 Thread David Miller
From: David Brownell <[EMAIL PROTECTED]>
Date: Tue, 19 Feb 2008 16:19:54 -0800

> On Tuesday 19 February 2008, David Miller wrote:
> > From: Andre Tomt <[EMAIL PROTECTED]>
> > Date: Tue, 19 Feb 2008 16:19:08 +0100
> > 
> > > Got this on a serial console today, using 2.6.25-rc2-git1. Machine was 
> > > not doing anything interesting at the time, but has its / and kernel on 
> > > a usb-storage device (usb pen drive).
> > > 
> > > Intel ICH8R chipset (and USB controller), running x86_64 kernel. I'll 
> > > post .config and some additional info when I get home later if it isn't 
> > > obvious what broke.
> > 
> > FWIW, I've seen a near identical crash on my Niagara system.
> 
> Please try that diagnostic patch I sent ... with CONFIG_USB_DEBUG.

I have that patch applied with USB_DEBUG enabled, I'll let you know
if it triggers.

It doesn't happen often, say once in every 20 or so boots.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: USB OOPS 2.6.25-rc2-git1

2008-02-19 Thread David Brownell
On Tuesday 19 February 2008, Andre Tomt wrote:
> 
> > Can you try this diagnostic patch, to see if it reports any messages
> > about IAA and/or IAAD oddities?  There's surely a quick workaround
> > for this, but I'd rather understand the root cause before patching.
> 
> Doesn't seem to have triggered anything. dmesg attached in case I missed 
> anything.

You don't seem to have enabled CONFIG_USB_DEBUG, as the patch instructions
say is needed to get such diagnostics ... I can tell because the startup
messages from USB are pretty minimal.  (See appended, vs what you sent...)

Please try again with USB debugging enabled.

- Dave

ehci_hcd :00:02.2: new USB bus registered, assigned bus number 3
ehci_hcd :00:02.2: reset hcs_params 0x102486 dbg=1 cc=2 pcc=4 !ppc ports=6
ehci_hcd :00:02.2: reset portroute 0 0 1 1 1 0 
ehci_hcd :00:02.2: reset hcc_params a086 caching frame 256/512/1024 park
ehci_hcd :00:02.2: park 0
ehci_hcd :00:02.2: reset command 080b02 park=3 ithresh=8 period=1024 Reset 
HALT
PCI: cache line size of 64 is not supported by device :00:02.2
ehci_hcd :00:02.2: supports USB remote wakeup
ehci_hcd :00:02.2: irq 22, io mem 0xe8004000
ehci_hcd :00:02.2: reset command 080b02 park=3 ithresh=8 period=1024 Reset 
HALT
ehci_hcd :00:02.2: init command 010009 (park)=0 ithresh=1 period=256 RUN
ehci_hcd :00:02.2: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: USB OOPS 2.6.25-rc2-git1

2008-02-19 Thread David Brownell
On Tuesday 19 February 2008, David Miller wrote:
> From: Andre Tomt <[EMAIL PROTECTED]>
> Date: Tue, 19 Feb 2008 16:19:08 +0100
> 
> > Got this on a serial console today, using 2.6.25-rc2-git1. Machine was 
> > not doing anything interesting at the time, but has its / and kernel on 
> > a usb-storage device (usb pen drive).
> > 
> > Intel ICH8R chipset (and USB controller), running x86_64 kernel. I'll 
> > post .config and some additional info when I get home later if it isn't 
> > obvious what broke.
> 
> FWIW, I've seen a near identical crash on my Niagara system.

Please try that diagnostic patch I sent ... with CONFIG_USB_DEBUG.

Near as I can tell this is caused by some hardware oddity that needs
to be worked around.  We seem to be at stage where we've fixed some
problems, nudging code paths around so another one shows up, and have
incidentally had a new silicion-specific hardware erratum reported
in this area.  So more info is needed...

A quick anti-oops patch is appended, it should work OK on top of that
diagnostic patch, but won't necessarily resolve the underlying problem.

- Dave


--- g26.orig/drivers/usb/host/ehci-q.c  2008-02-19 16:15:04.0 -0800
+++ g26/drivers/usb/host/ehci-q.c   2008-02-19 16:15:59.0 -0800
@@ -993,6 +993,11 @@ static void end_unlink_async (struct ehc
 
iaa_watchdog_done(ehci);
 
+   if (!qh) {
+   WARN_ON(1);
+   return;
+   }
+
// qh->hw_next = cpu_to_hc32(qh->qh_dma);
qh->qh_state = QH_STATE_IDLE;
qh->qh_next.qh = NULL;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: USB OOPS 2.6.25-rc2-git1

2008-02-19 Thread Andre Tomt

David Brownell wrote:

On Tuesday 19 February 2008, Andre Tomt wrote:
Got this on a serial console today, using 2.6.25-rc2-git1. Machine was 
not doing anything interesting at the time, but has its / and kernel on 
a usb-storage device (usb pen drive).


Can you try this diagnostic patch, to see if it reports any messages
about IAA and/or IAAD oddities?  There's surely a quick workaround
for this, but I'd rather understand the root cause before patching.


Doesn't seem to have triggered anything. dmesg attached in case I missed 
anything.


Note that this does not happen at all with 2.6.24, but with 
2.6.25-rc2-git1 it usually crashes at boot. If not, it crashes after 
about an hour (pretty random, but it never lasts very long.)
Initializing cgroup subsys cpuset
Linux version 2.6.25-rc2-git1 ([EMAIL PROTECTED]) (gcc version 4.2.3 (Debian 
4.2.3-1)) #4 SMP Tue Feb 19 23:24:03 CET 2008
Command line: root=/dev/sda1 ro rootflags=noatime rootfstype=ext2 
console=ttyS0,38400 verbose
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009bc00 (usable)
 BIOS-e820: 0009bc00 - 000a (reserved)
 BIOS-e820: 000f - 0010 (reserved)
 BIOS-e820: 0010 - 7fee (usable)
 BIOS-e820: 7fee - 7fee3000 (ACPI NVS)
 BIOS-e820: 7fee3000 - 7fef (ACPI data)
 BIOS-e820: 7fef - 7ff0 (reserved)
 BIOS-e820: e000 - f000 (reserved)
 BIOS-e820: fec0 - 0001 (reserved)
end_pfn_map = 1048576
DMI 2.4 present.
ACPI: RSDP 000F7850, 0014 (r0 IntelR)
ACPI: RSDT 7FEE3040, 0038 (r1 IntelR AWRDACPI 42302E31 AWRD0)
ACPI: FACP 7FEE30C0, 0074 (r1 IntelR AWRDACPI 42302E31 AWRD0)
ACPI: DSDT 7FEE3180, 463D (r1 INTELR AWRDACPI 1000 MSFT  300)
ACPI: FACS 7FEE, 0040
ACPI: HPET 7FEE7900, 0038 (r1 IntelR AWRDACPI 42302E31 AWRD   98)
ACPI: MCFG 7FEE7980, 003C (r1 IntelR AWRDACPI 42302E31 AWRD0)
ACPI: APIC 7FEE7800, 0084 (r1 IntelR AWRDACPI 42302E31 AWRD0)
ACPI: SSDT 7FEE8020, 02F1 (r1  PmRefCpuPm 3000 INTL 20040311)
No NUMA configuration found
Faking a node at -7fee
Bootmem setup node 0 -7fee
  NODE_DATA [c000 - 00012fff]
  bootmap [00013000 -  00022fdf] pages 10
early res: 0 [0-fff] BIOS data page
early res: 1 [6000-7fff] SMP_TRAMPOLINE
early res: 2 [20-565977] TEXT DATA BSS
early res: 3 [37a4a000-37fefe19] RAMDISK
early res: 4 [9bc00-abbff] EBDA
early res: 5 [8000-bfff] PGTABLE
Zone PFN ranges:
  DMA 0 -> 4096
  DMA324096 ->  1048576
  Normal1048576 ->  1048576
Movable zone start PFN for each node
early_node_map[2] active PFN ranges
0:0 ->  155
0:  256 ->   524000
ACPI: PM-Timer IO Port: 0x408
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 (Bootup-CPU)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] disabled)
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] disabled)
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1])
ACPI: IOAPIC (id[0x04] address[0xfec0] gsi_base[0])
IOAPIC[0]: apic_id 4, address 0xfec0, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
Setting APIC routing to flat
ACPI: HPET id: 0x8086a201 base: 0xfed0
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 8000 (gap: 7ff0:6010)
SMP: Allowing 4 CPUs, 2 hotplug CPUs
PERCPU: Allocating 34472 bytes of per cpu data
Built 1 zonelists in Node order, mobility grouping on.  Total pages: 515841
Policy zone: DMA32
Kernel command line: root=/dev/sda1 ro rootflags=noatime rootfstype=ext2 
console=ttyS0,38400 verbose
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 32768 bytes)
TSC calibrated against PM_TIMER
time.c: Detected 1862.005 MHz processor.
Console: colour VGA+ 80x25
console [ttyS0] enabled
Checking aperture...
Memory: 2057284k/2096000k available (1802k kernel code, 38312k reserved, 791k 
data, 328k init)
Calibrating delay using timer specific routine.. 3726.31 BogoMIPS (lpj=18631578)
Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
Mount-cache hash table entries: 256
Initializing cgroup subsys ns
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 2048K
CPU 0/0 -> Node 0
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0
CPU0: Thermal monitoring enabled (TM2)
using mwait in idle threads.
checking if image is initramfs... it is
Freeing initrd memory: 5783k freed
ACPI: Core revision 20070126
ACPI: Checking initramfs for custom DSDT
Using local APIC timer 

Re: USB OOPS 2.6.25-rc2-git1

2008-02-19 Thread David Miller
From: Andre Tomt <[EMAIL PROTECTED]>
Date: Tue, 19 Feb 2008 16:19:08 +0100

> Got this on a serial console today, using 2.6.25-rc2-git1. Machine was 
> not doing anything interesting at the time, but has its / and kernel on 
> a usb-storage device (usb pen drive).
> 
> Intel ICH8R chipset (and USB controller), running x86_64 kernel. I'll 
> post .config and some additional info when I get home later if it isn't 
> obvious what broke.

FWIW, I've seen a near identical crash on my Niagara system.

The only USB device attached is a CD/DVDW drive sitting behind
usb-storage.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: USB OOPS 2.6.25-rc2-git1

2008-02-19 Thread Andre Tomt

Alan Stern wrote:

On Tue, 19 Feb 2008, Andre Tomt wrote:

Got this on a serial console today, using 2.6.25-rc2-git1. Machine was 
not doing anything interesting at the time, but has its / and kernel on 
a usb-storage device (usb pen drive).


Intel ICH8R chipset (and USB controller), running x86_64 kernel. I'll 
post .config and some additional info when I get home later if it isn't 
obvious what broke.



BUG: unable to handle kernel NULL pointer dereference at 0080
IP: [] :ehci_hcd:end_unlink_async+0x17/0xfa


Can you provide some sort of disassembly listing of end_unlink_async, 
to determine which C statement contained the NULL pointer dereference?


Here you go:

[EMAIL PROTECTED]:~/work/pkg-linux/linux-2.6.25$ gdb 
/lib/modules/2.6.25-rc2-git1/kernel/drivers/usb/host/ehci-hcd.ko
GNU gdb 6.7.1-debian
Copyright (C) 2007 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu"...
(no debugging symbols found)
Using host libthread_db library "/lib/libthread_db.so.1".
(gdb) disassemble end_unlink_async
Dump of assembler code for function end_unlink_async:
0x0d1e :push   %r12
0x0d20 :push   %rbp
0x0d21 :mov%rdi,%rbp
0x0d24 :push   %rbx
0x0d25 :mov0x28(%rdi),%rbx
0x0d29 :   lea0x110(%rdi),%rdi
0x0d30 :   callq  0xd35 

0x0d35 :   mov0x80(%rbx),%eax
0x0d3b :   movb   $0x3,0x88(%rbx)
0x0d42 :   movq   $0x0,0x50(%rbx)
0x0d4a :   dec%eax
0x0d4c :   test   %eax,%eax
0x0d4e :   mov%eax,0x80(%rbx)
0x0d54 :   jne0xd5e 

0x0d56 :   mov%rbx,%rdi
0x0d59 :   callq  0x84d 
0x0d5e :   mov0x70(%rbx),%r12
0x0d62 :   mov%rbx,%rsi
0x0d65 :   mov%rbp,%rdi
0x0d68 :   mov%r12,0x28(%rbp)
0x0d6c :   movq   $0x0,0x70(%rbx)
0x0d74 :   callq  0xf6a 
0x0d79 :   lea0x58(%rbx),%rax
0x0d7d :   cmp%rax,0x58(%rbx)
0x0d81 :   je 0xd96 

0x0d83 :  testb  $0x1,-0x8(%rbp)
0x0d87 :  je 0xd96 

0x0d89 :  mov%rbx,%rsi
0x0d8c :  mov%rbp,%rdi
0x0d8f :  callq  0x6b0 
0x0d94 :  jmp0xdfa 

0x0d96 :  mov0x80(%rbx),%eax
0x0d9c :  dec%eax
0x0d9e :  test   %eax,%eax
0x0da0 :  mov%eax,0x80(%rbx)
0x0da6 :  jne0xdb0 

0x0da8 :  mov%rbx,%rdi
0x0dab :  callq  0x84d 
0x0db0 :  testb  $0x1,-0x8(%rbp)
0x0db4 :  je 0xdfa 

0x0db6 :  mov0x20(%rbp),%rax
0x0dba :  cmpq   $0x0,0x50(%rax)
0x0dbf :  jne0xdfa 

0x0dc1 :  lock btsl $0x2,0x1b0(%rbp)
0x0dca :  sbb%eax,%eax
0x0dcc :  test   %eax,%eax
0x0dce :  jne0xdfa 

0x0dd0 :  mov0x0(%rip),%rax# 0xdd7 

0x0dd7 :  lea0x5(%rax),%rsi
0x0ddb :  cmp%rsi,0x170(%rbp)
0x0de2 :  js 0xdee 

0x0de4 :  cmpq   $0x0,0x160(%rbp)
0x0dec :  jne0xdfa 

0x0dee :  lea0x160(%rbp),%rdi
0x0df5 :  callq  0xdfa 

0x0dfa :  test   %r12,%r12
0x0dfd :  je 0xe13 

0x0dff :  movq   $0x0,0x28(%rbp)
0x0e07 :  mov%rbp,%rdi
0x0e0a :  mov%r12,%rsi
0x0e0d :  pop%rbx
0x0e0e :  pop%rbp
0x0e0f :  pop%r12
0x0e11 :  jmp0xe18 

0x0e13 :  pop%rbx
0x0e14 :  pop%rbp
0x0e15 :  pop%r12
0x0e17 :  retq
End of assembler dump.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: USB OOPS 2.6.25-rc2-git1

2008-02-19 Thread Alan Stern
On Tue, 19 Feb 2008, Andre Tomt wrote:

> Got this on a serial console today, using 2.6.25-rc2-git1. Machine was 
> not doing anything interesting at the time, but has its / and kernel on 
> a usb-storage device (usb pen drive).
> 
> Intel ICH8R chipset (and USB controller), running x86_64 kernel. I'll 
> post .config and some additional info when I get home later if it isn't 
> obvious what broke.
> 
> > BUG: unable to handle kernel NULL pointer dereference at 0080
> > IP: [] :ehci_hcd:end_unlink_async+0x17/0xfa

Can you provide some sort of disassembly listing of end_unlink_async, 
to determine which C statement contained the NULL pointer dereference?

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: USB OOPS 2.6.25-rc2-git1

2008-02-19 Thread David Brownell
On Tuesday 19 February 2008, Andre Tomt wrote:
> Got this on a serial console today, using 2.6.25-rc2-git1. Machine was 
> not doing anything interesting at the time, but has its / and kernel on 
> a usb-storage device (usb pen drive).

Can you try this diagnostic patch, to see if it reports any messages
about IAA and/or IAAD oddities?  There's surely a quick workaround
for this, but I'd rather understand the root cause before patching.

- Dave


Work around for an evident bug in one EHCI controller:  IAA didn't get
set when IAAD was cleared.  Evidently writing the status register can
prevent setting IAA; someone's VHDL (or whatever) code was wrong.
This workaround catches that specific bug (in the IRQ handler and in
the IAA watchdog) and treats it as if IAA was properly set.

The patch also adds *LOTS* of related paranoia, insisting IAAD is clear
(or set, as appropriate) at various points, and adding code to improve
the handling of some such cases.  It also raises the volume and precision
of debug messaging related to IAA problems.

This patch is EXPERIMENTAL and DIAGNOSTIC ... not intended for merge.
It's also in *addition* to the IAA watchdog timer rework that's already
been merged into 2.6.25-rc1, which will help some systems.

If you use this, run with CONFIG_USB_DEBUG enabled.  Most messages
with IAA or IAAD should then be "interesting" in the sense that they
will indicate something odd happening ... maybe something that's
fully worked around, maybe not.

---
 drivers/usb/host/ehci-hcd.c |   38 ++
 drivers/usb/host/ehci-q.c   |   27 ++-
 2 files changed, 56 insertions(+), 9 deletions(-)

--- g26.orig/drivers/usb/host/ehci-hcd.c2008-02-11 19:18:39.0 
-0800
+++ g26/drivers/usb/host/ehci-hcd.c 2008-02-13 15:30:56.0 -0800
@@ -255,21 +255,33 @@ static void ehci_iaa_watchdog(unsigned l
u32 status, cmd;
 
spin_lock_irqsave (&ehci->lock, flags);
-   WARN_ON(!ehci->reclaim);
 
status = ehci_readl(ehci, &ehci->regs->status);
cmd = ehci_readl(ehci, &ehci->regs->command);
-   ehci_dbg(ehci, "IAA watchdog: status %x cmd %x\n", status, cmd);
 
/* lost IAA irqs wedge things badly; seen first with a vt8235 */
if (ehci->reclaim) {
-   if (status & STS_IAA) {
-   ehci_vdbg (ehci, "lost IAA\n");
+   /* STS_IAA means we got the status, but no IRQ.  (Or at
+* best, the IRQ took until just now to arrive.)  Missing
+* CMD_IAAD means we got the effect, but no status or IRQ.
+*/
+   if ((status & STS_IAA) || !(cmd & CMD_IAAD)) {
COUNT (ehci->stats.lost_iaa);
-   ehci_writel(ehci, STS_IAA, &ehci->regs->status);
+   if (status & STS_IAA)
+   ehci_writel(ehci, STS_IAA,
+   &ehci->regs->status);
}
-   ehci_writel(ehci, cmd & ~CMD_IAAD, &ehci->regs->command);
+   ehci_dbg(ehci, "IAA watchdog%s: status %x cmd %x\n",
+   ((status & STS_IAA) || !(cmd & CMD_IAAD))
+   ? ", lost IAA" : "",
+   status, cmd);
end_unlink_async(ehci);
+
+   } else if (status & STS_IAA) {
+   ehci_writel(ehci, STS_IAA, &ehci->regs->status);
+   ehci_dbg(ehci, "IAA watchdog%s: status %x cmd %x\n",
+   ", IAA with empty reclaim",
+   status, cmd);
}
 
spin_unlock_irqrestore(&ehci->lock, flags);
@@ -602,7 +614,7 @@ static int ehci_run (struct usb_hcd *hcd
 static irqreturn_t ehci_irq (struct usb_hcd *hcd)
 {
struct ehci_hcd *ehci = hcd_to_ehci (hcd);
-   u32 status, pcd_status = 0;
+   u32 status, pcd_status = 0, cmd;
int bh;
 
spin_lock (&ehci->lock);
@@ -623,7 +635,7 @@ static irqreturn_t ehci_irq (struct usb_
 
/* clear (just) interrupts */
ehci_writel(ehci, status, &ehci->regs->status);
-   ehci_readl(ehci, &ehci->regs->command); /* unblock posted write */
+   cmd = ehci_readl(ehci, &ehci->regs->command);
bh = 0;
 
 #ifdef VERBOSE_DEBUG
@@ -642,6 +654,16 @@ static irqreturn_t ehci_irq (struct usb_
bh = 1;
}
 
+   /* Cope with silicon bug where IAA sometimes isn't set, but
+* IAAD is cleared ... clearing should be a side effect of
+* setting IAA.  So assume that when we expect IAA but neither
+* IAA nor IAAD are set, we should act as if IAA was reported.
+*/
+   if (ehci->reclaim && !(status & STS_IAA) && !(cmd & CMD_IAAD)) {
+   ehci_dbg(ehci, "IAAD cleared without IAA\n");
+   status |= STS_IAA;
+