Re: Testing for hardware bug in EHCI controllers

Alan Stern Tue, 26 Mar 2013 08:23:29 -0700

On Tue, 26 Mar 2013, Noone Nowhere wrote:

> Hello again Alan, we read that the patch broke.


What patch?  The one that works around the Intel/AMD hardware problem?  
Yes, it had a mistake, which has now been fixed.  (Although the fix has 
not yet been released in a 3.8.stable kernel.)

>  Damn, our fear was
> justified. If we are right, it will break again because it affects
> various
> devices using different busses, each one with its limitations and
> defects.

That reasoning doesn't make sense.  Practically every line of code in 
ehci-hcd.c and its associated files affects various devices using 
different buses.  If your fear was justified, the driver would never 
work at all.

>  That's why we wrote to Sarah(and got no reply yet) that

Your message to Sarah was little more than a rant about issues that are 
out of her control and took place long before she started working at 
Intel.  Surely you didn't expect her to have a constructive reply?

> vendors
> should fix their h/w bugs in-house or give open source developers full
> documentation (like nda datasheets,bios spec,bios spec up)
> to cope with their >buggy< hardware!

Well, it's questionable whether this really should be called a bug.  
Clearly it is undesirable behavior, but strictly speaking it is not in
violation of the EHCI spec.

> > Probably it is cured.  But something is still wrong, even though it may
> > be unrelated.
> 
> Since something is broken and the last program tested exits at A50M,
> we have some questions for today prior testing A50M, moschip and
> ICH4M:
> 
> 1)Do we need to apply a third
> patch(http://marc.info/?l=linux-usb&m=136379040918329&w=2) [it's upper
> part to be specific]?

Are you considering the hardware workaround and the bug-detection patch
as the first two?  Then yes, a third patch is needed, but not the one
you mentioned.  The third patch is here:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit?id=d714aaf649460cbfd5e82e75520baa856b4fa0a0

Oddly, I can't find the patch submission anywhere in the email
archives.  Maybe I forgot to CC: linux-usb when sending it in.

> If yes, please make
> a cumulative patch for 3.8 series before we mess things up and post it
> under this subject.

It is below.

> At the same post, add the final test program with
> program version number printing as well as a "do NOT unplug the usb
> and a leave the program to go up to 1000" message. Also point out that
> the
> device must not be inside the unusual devs file as you told us. This
> will make it easier for others to follow and help you.

You can easily make these changes and post the result yourself.

> 2)If A50M still fails, do you have time to analyze our usbmon output?

Yes.

> 3)Has any type of transfer corruption been observed from this bug?

Not as far as I know.

Alan Stern



Index: 3.8/drivers/usb/host/ehci-q.c
===================================================================
--- 3.8.orig/drivers/usb/host/ehci-q.c
+++ 3.8/drivers/usb/host/ehci-q.c
@@ -547,7 +547,7 @@ qh_completions (struct ehci_hcd *ehci, s
        if (stopped != 0 || hw->hw_qtd_next == EHCI_LIST_END(ehci)) {
                switch (state) {
                case QH_STATE_IDLE:
-                       qh_refresh(ehci, qh);
+//                     qh_refresh(ehci, qh);
                        break;
                case QH_STATE_LINKED:
                        /* We won't refresh a QH that's linked (after the HC
@@ -1170,7 +1170,7 @@ static void single_unlink_async(struct e
        struct ehci_qh          *prev;
 
        /* Add to the end of the list of QHs waiting for the next IAAD */
-       qh->qh_state = QH_STATE_UNLINK;
+       qh->qh_state = QH_STATE_UNLINK_WAIT;
        if (ehci->async_unlink)
                ehci->async_unlink_last->unlink_next = qh;
        else
@@ -1213,9 +1213,19 @@ static void start_iaa_cycle(struct ehci_
 
                /* Do only the first waiting QH (nVidia bug?) */
                qh = ehci->async_unlink;
-               ehci->async_iaa = qh;
-               ehci->async_unlink = qh->unlink_next;
-               qh->unlink_next = NULL;
+
+               /*
+                * Intel (?) bug: The HC can write back the overlay region
+                * even after the IAA interrupt occurs.  In self-defense,
+                * always go through two IAA cycles for each QH.
+                */
+               if (qh->qh_state == QH_STATE_UNLINK_WAIT) {
+                       qh->qh_state = QH_STATE_UNLINK;
+               } else {
+                       ehci->async_iaa = qh;
+                       ehci->async_unlink = qh->unlink_next;
+                       qh->unlink_next = NULL;
+               }
 
                /* Make sure the unlinks are all visible to the hardware */
                wmb();
@@ -1232,6 +1242,7 @@ static void start_iaa_cycle(struct ehci_
 static void end_unlink_async(struct ehci_hcd *ehci)
 {
        struct ehci_qh          *qh;
+       __hc32                  tok1, tok2;
 
        if (ehci->has_synopsys_hc_bug)
                ehci_writel(ehci, (u32) ehci->async->qh_dma,
@@ -1242,6 +1253,7 @@ static void end_unlink_async(struct ehci
        ehci->async_unlinking = true;
        while (ehci->async_iaa) {
                qh = ehci->async_iaa;
+               tok1 = ACCESS_ONCE(qh->hw->hw_token);
                ehci->async_iaa = qh->unlink_next;
                qh->unlink_next = NULL;
 
@@ -1250,8 +1262,14 @@ static void end_unlink_async(struct ehci
 
                qh_completions(ehci, qh);
                if (!list_empty(&qh->qtd_list) &&
-                               ehci->rh_state == EHCI_RH_RUNNING)
+                               ehci->rh_state == EHCI_RH_RUNNING) {
+                       udelay(10);
+                       tok2 = ACCESS_ONCE(qh->hw->hw_token);
+                       if (tok1 != tok2)
+                               ehci_err(ehci, "EHCI hardware bug detected: 
%08x %08x\n",
+                                               tok1, tok2);
                        qh_link_async(ehci, qh);
+               }
                disable_async(ehci);
        }
        ehci->async_unlinking = false;
Index: 3.8/drivers/usb/host/ehci-hcd.c
===================================================================
--- 3.8.orig/drivers/usb/host/ehci-hcd.c
+++ 3.8/drivers/usb/host/ehci-hcd.c
@@ -748,11 +748,9 @@ static irqreturn_t ehci_irq (struct usb_
                /* guard against (alleged) silicon errata */
                if (cmd & CMD_IAAD)
                        ehci_dbg(ehci, "IAA with IAAD still set?\n");
-               if (ehci->async_iaa) {
+               if (ehci->async_iaa)
                        COUNT(ehci->stats.iaa);
-                       end_unlink_async(ehci);
-               } else
-                       ehci_dbg(ehci, "IAA with nothing unlinked?\n");
+               end_unlink_async(ehci);
        }
 
        /* remote wakeup [4.3.1] */
Index: 3.8/drivers/usb/host/ehci-timer.c
===================================================================
--- 3.8.orig/drivers/usb/host/ehci-timer.c
+++ 3.8/drivers/usb/host/ehci-timer.c
@@ -305,7 +305,7 @@ static void ehci_iaa_watchdog(struct ehc
         * (a) SMP races against real IAA firing and retriggering, and
         * (b) clean HC shutdown, when IAA watchdog was pending.
         */
-       if (ehci->async_iaa) {
+       if (1) {
                u32 cmd, status;
 
                /* If we get here, IAA is *REALLY* late.  It's barely

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Testing for hardware bug in EHCI controllers

Reply via email to