Re: [linux-usb-devel] [BUG] 2.6.3-bk9 Badness in ohci_endpoint_disable

Martin Diehl Mon, 01 Mar 2004 11:38:19 -0800

On Mon, 1 Mar 2004, David Brownell wrote:

> > As I said most of the bad entry addresses aren't 5a5a5a-like, I just 
> > selected this one because I considered the slab poison-byte remarkable 
> > too. In most cases the look more like being "normal".
> 
> Right, I suspect there are actually multiple bugs showing up
> here. One slab-related, another might be in OHCI, but given
> those slab problems it's hard to know for sure.


Seems you are right. More information:

After I fixed the "you shall not DMA to the stack while using 
usb_control_msg" issue and stopped the tx-path from using transfer_buffers 
directly from skb (and thus probably not cache-line aligned) the 
disconnect-badness warning didn't appear anymore. Neither does any 5a5a5a 
bad entry. So it might be the slab corruption was caused by these driver 
issues.

Unfortunately the bad entry errors persist - always with what appears a 
BULK-IN and a SETUP TD, the latter one with CC=5, len=0. However, the new 
observation is, while it happens with ohci_hcd on NEC-companion with or 
without external hub - it seems it never happens when the device is 
connected to a SiS USB-1.1 OHCI-HC! So it might even be a silicon issue 
with the nec-ohci. Or the pci-interface of this guy is somewhat faster 
exposing some race with the hcd.

Have you ever heard anything about such silicon-issues with nec-ohci?

> Are you using uniprocessor x86, or something else?  "-mregparam"
> build option?

SMP-compiled kernel running on UP, CONFIG_REGPARAM _not_ set, it's just 
the default i386 Makefile from the vanilla 2.6.3-bk.

> > But AFAICS the box continous to work rock solid - it's just usb became 
> > unuseable. To me it looks like the corruption remains confined in 
> > pci_pool.
> 
> Depends how much memory you have, and what got corrupted.
> There's no rule saying that corrupting a few 16-byte headers
> must have immediate negative effects; statistically, it won't.

Sure, but with 192MB total RAM I assume it's not very likely the box 
continues without any symptoms...

> > Is there anything we might learn from the td-hash dump?
> 
> Yes, pretty much a confirm that you're seeing a problem I
> recently found out about this one.  I'm tracking it down,
> it might be a real OHCI issue (not triggered by slab or
> dma_pool corruption in recent kernels).

Hm, just speculation of course, but in td_submit_urb I see quite a number 
of writel - could it be there might a readl missing somewhere do deal with 
posted-writes on pci?

> By the way:  thanks for the great debug information.  It
> makes it a lot easier to sort through a complex problem!

You're welcome - and thanks for your help, it's not all due to ohci ;-)

Martin

> > * using the device (ifup, start discovery)
> > 
> > irlap_change_speed(), setting speed to 9600
> 
> Does this do anything more than wait for a bulk packet to complete,
> and then send another one?

Well, much more. Basically something like:

for (;;) {
        for (i = 0; i < 7; i++) {
                usb_bulk_msg(tx_buf);   /* bulk-out */
                while (usb_control_msg(read_regs) != error) {
                        if (sufficient_space_in_fifo(regs))
                                break;
                        mdelay();
                }
        }

        usb_submit_urb(rx_urb);         /* bulk-in */

        while(<500 ms after submit) {

            [ /* stuff in brackets executed outside kernel_thread */
                rx_urb->complete:
                        stuff data into skb
                        start_timer(4 ms)
                timer_expires:
                        usb_submit_urb(rx_urb)
            ]

        }
        del_timer_sync();
        usb_unlink_urb(rx_urb);         /* synch */
}

> > * only a few seconds later
> > 
> > ohci_hcd 0000:00:0c.1: bad entry  afbc0c0
> > ohci_hcd 0000:00:0c.1: td_hash[0]
> > ohci_hcd 0000:00:0c.1:  entry td cafbc000; urb cb23a2e0 index 0; hw next td 
> > 0afbc040
> > ohci_hcd 0000:00:0c.1:      info 00140000 CC=0 (CARRY) DI=0 IN R
> > ohci_hcd 0000:00:0c.1:      cbp 0afbe020 be 0afbe020 (len 1)
> 
> That one would be the hub status IRQ transfer.  Your later post,
> without the hub, didn't show this.

Yep.

> > ohci_hcd 0000:00:0c.1: td_hash[4]
> > ohci_hcd 0000:00:0c.1:  entry td cafbc100; urb c8298ae4 index 0; hw next td 
> > 00000000
> > ohci_hcd 0000:00:0c.1:      info 03140000 CC=0 DATA1 DI=0 IN R
> > ohci_hcd 0000:00:0c.1:      cbp 08264000 be 08264fff (len 4096)
> 
> Did you recently issue an IN packet on this same endpoint?

As you can see from the scheme above, there are almost always bulk-reads 
submitted. Usually they will complete very fast because the device is 
(mis-) designed to return actual_length==0 instead of NAK when there is no 
rx-data available.

> > ohci_hcd 0000:00:0c.1: bad? td cafbc0c0; urb 00000000 index 0; hw next td 0afbc100
> > ohci_hcd 0000:00:0c.1:      info 5e000000 CC=5 DATA0 DI=0 SETUP 
> > ohci_hcd 0000:00:0c.1:      cbp 00000000 be 00000000 (len 0)
> 
> Those are the two that look suspicious.  And they look a lot less
> cryptic once you know that I've seen almost the same "bad?" TD in
> some unlink testing I've been doing, with pure bulk traffic,
> paired with the strange "good" one (hwNextTD also null/bogus, that
> should only happen with dummy TDs).

So it could be related to the bulk-unlink? This would at least not 
contradict the the observation, a former version of the driver did not 
trigger the issue. This version (as of 2.6.3-bk6 or later) uses an INT-IN 
urb instead of the BULK_IN.

So what was changed is:
* BULK-IN instead of INT-IN transfer used
* resubmit defered to timer instead directly from completion callback
* synch unlink instead of async on line turnaround rx->tx

> Notice how bad->hwNextTD points to the first/only td_hash[4] entry.
> Everything except its hwNextTd and one "info" byte is zero ... the
> "SETUP" comes from a bitfield that's zero.   And when I dumped just
> a few more values from the TD, everything exept bad->td_dma was
> was also zero.  As if a just-allocated TD got sent to the HC, and
> it came back with CC=5/timeout because of course a bulk endpoint
> isn't going to accept a SETUP packet.

Ok. When I said it looked like control-transfer this was due to the SETUP 
packet assumption - (almost) all-zero TD would create this impression too, 
yes. But why should it have DATA1 in many cases? And look for the strange 
unaligned hw_next_td field in some of the additional examples below:

ohci_hcd 0000:00:0c.1: bad entry  69f6040
ohci_hcd 0000:00:0c.1: td_hash[2]
ohci_hcd 0000:00:0c.1:  entry td c69f6080; urb c4abf274 index 0; hw next td 00000000
ohci_hcd 0000:00:0c.1:      info 03140000 CC=0 DATA1 DI=0 IN R
ohci_hcd 0000:00:0c.1:      cbp 05080000 be 05080fff (len 4096)
ohci_hcd 0000:00:0c.1: bad? td c69f6040; urb 00000000 index 0; hw next td 069f6080
ohci_hcd 0000:00:0c.1:      info 5e000000 CC=5 DATA0 DI=0 SETUP 
ohci_hcd 0000:00:0c.1:      cbp 00000000 be 00000000 (len 0)
ohci_hcd 0000:00:0c.1: urb c4abf274 path 2 ep2in 5f160000 cc 5 --> status -110

ohci_hcd 0000:00:0c.1: bad entry  69f6080
ohci_hcd 0000:00:0c.1: td_hash[1]
ohci_hcd 0000:00:0c.1:  entry td c69f6040; urb c4abf34c index 0; hw next td 00000000
ohci_hcd 0000:00:0c.1:      info 03140000 CC=0 DATA1 DI=0 IN R
ohci_hcd 0000:00:0c.1:      cbp 02018000 be 02018fff (len 4096)
ohci_hcd 0000:00:0c.1: bad? td c69f6080; urb 00000000 index 0; hw next td 069f6083
ohci_hcd 0000:00:0c.1:      info 5f000000 CC=5 DATA1 DI=0 SETUP 
ohci_hcd 0000:00:0c.1:      cbp 00000000 be 00000000 (len 0)

ohci_hcd 0000:00:0c.1: bad entry  69f6000
ohci_hcd 0000:00:0c.1: td_hash[2]
ohci_hcd 0000:00:0c.1:  entry td c69f6080; urb c4abf34c index 0; hw next td 00000000
ohci_hcd 0000:00:0c.1:      info 02140000 CC=0 DATA0 DI=0 IN R
ohci_hcd 0000:00:0c.1:      cbp 02018000 be 02018fff (len 4096)
ohci_hcd 0000:00:0c.1: bad? td c69f6000; urb 00000000 index 0; hw next td 069f6003
ohci_hcd 0000:00:0c.1:      info 5f000000 CC=5 DATA1 DI=0 SETUP 
ohci_hcd 0000:00:0c.1:      cbp 00000000 be 00000000 (len 0)

ohci_hcd 0000:00:0c.1: bad entry  69f6040
ohci_hcd 0000:00:0c.1: td_hash[2]
ohci_hcd 0000:00:0c.1:  entry td c69f6080; urb c4abf34c index 0; hw next td 00000000
ohci_hcd 0000:00:0c.1:      info 02140000 CC=0 DATA0 DI=0 IN R
ohci_hcd 0000:00:0c.1:      cbp 02018000 be 02018fff (len 4096)
ohci_hcd 0000:00:0c.1: bad? td c69f6040; urb 00000000 index 0; hw next td 069f6043
ohci_hcd 0000:00:0c.1:      info 5f000000 CC=5 DATA1 DI=0 SETUP 
ohci_hcd 0000:00:0c.1:      cbp 00000000 be 00000000 (len 0)



-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
[EMAIL PROTECTED]
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-devel

Re: [linux-usb-devel] [BUG] 2.6.3-bk9 Badness in ohci_endpoint_disable

Reply via email to