On Mon, 1 Mar 2004, David Brownell wrote: > > As I said most of the bad entry addresses aren't 5a5a5a-like, I just > > selected this one because I considered the slab poison-byte remarkable > > too. In most cases the look more like being "normal". > > Right, I suspect there are actually multiple bugs showing up > here. One slab-related, another might be in OHCI, but given > those slab problems it's hard to know for sure.
Seems you are right. More information: After I fixed the "you shall not DMA to the stack while using usb_control_msg" issue and stopped the tx-path from using transfer_buffers directly from skb (and thus probably not cache-line aligned) the disconnect-badness warning didn't appear anymore. Neither does any 5a5a5a bad entry. So it might be the slab corruption was caused by these driver issues. Unfortunately the bad entry errors persist - always with what appears a BULK-IN and a SETUP TD, the latter one with CC=5, len=0. However, the new observation is, while it happens with ohci_hcd on NEC-companion with or without external hub - it seems it never happens when the device is connected to a SiS USB-1.1 OHCI-HC! So it might even be a silicon issue with the nec-ohci. Or the pci-interface of this guy is somewhat faster exposing some race with the hcd. Have you ever heard anything about such silicon-issues with nec-ohci? > Are you using uniprocessor x86, or something else? "-mregparam" > build option? SMP-compiled kernel running on UP, CONFIG_REGPARAM _not_ set, it's just the default i386 Makefile from the vanilla 2.6.3-bk. > > But AFAICS the box continous to work rock solid - it's just usb became > > unuseable. To me it looks like the corruption remains confined in > > pci_pool. > > Depends how much memory you have, and what got corrupted. > There's no rule saying that corrupting a few 16-byte headers > must have immediate negative effects; statistically, it won't. Sure, but with 192MB total RAM I assume it's not very likely the box continues without any symptoms... > > Is there anything we might learn from the td-hash dump? > > Yes, pretty much a confirm that you're seeing a problem I > recently found out about this one. I'm tracking it down, > it might be a real OHCI issue (not triggered by slab or > dma_pool corruption in recent kernels). Hm, just speculation of course, but in td_submit_urb I see quite a number of writel - could it be there might a readl missing somewhere do deal with posted-writes on pci? > By the way: thanks for the great debug information. It > makes it a lot easier to sort through a complex problem! You're welcome - and thanks for your help, it's not all due to ohci ;-) Martin > > * using the device (ifup, start discovery) > > > > irlap_change_speed(), setting speed to 9600 > > Does this do anything more than wait for a bulk packet to complete, > and then send another one? Well, much more. Basically something like: for (;;) { for (i = 0; i < 7; i++) { usb_bulk_msg(tx_buf); /* bulk-out */ while (usb_control_msg(read_regs) != error) { if (sufficient_space_in_fifo(regs)) break; mdelay(); } } usb_submit_urb(rx_urb); /* bulk-in */ while(<500 ms after submit) { [ /* stuff in brackets executed outside kernel_thread */ rx_urb->complete: stuff data into skb start_timer(4 ms) timer_expires: usb_submit_urb(rx_urb) ] } del_timer_sync(); usb_unlink_urb(rx_urb); /* synch */ } > > * only a few seconds later > > > > ohci_hcd 0000:00:0c.1: bad entry afbc0c0 > > ohci_hcd 0000:00:0c.1: td_hash[0] > > ohci_hcd 0000:00:0c.1: entry td cafbc000; urb cb23a2e0 index 0; hw next td > > 0afbc040 > > ohci_hcd 0000:00:0c.1: info 00140000 CC=0 (CARRY) DI=0 IN R > > ohci_hcd 0000:00:0c.1: cbp 0afbe020 be 0afbe020 (len 1) > > That one would be the hub status IRQ transfer. Your later post, > without the hub, didn't show this. Yep. > > ohci_hcd 0000:00:0c.1: td_hash[4] > > ohci_hcd 0000:00:0c.1: entry td cafbc100; urb c8298ae4 index 0; hw next td > > 00000000 > > ohci_hcd 0000:00:0c.1: info 03140000 CC=0 DATA1 DI=0 IN R > > ohci_hcd 0000:00:0c.1: cbp 08264000 be 08264fff (len 4096) > > Did you recently issue an IN packet on this same endpoint? As you can see from the scheme above, there are almost always bulk-reads submitted. Usually they will complete very fast because the device is (mis-) designed to return actual_length==0 instead of NAK when there is no rx-data available. > > ohci_hcd 0000:00:0c.1: bad? td cafbc0c0; urb 00000000 index 0; hw next td 0afbc100 > > ohci_hcd 0000:00:0c.1: info 5e000000 CC=5 DATA0 DI=0 SETUP > > ohci_hcd 0000:00:0c.1: cbp 00000000 be 00000000 (len 0) > > Those are the two that look suspicious. And they look a lot less > cryptic once you know that I've seen almost the same "bad?" TD in > some unlink testing I've been doing, with pure bulk traffic, > paired with the strange "good" one (hwNextTD also null/bogus, that > should only happen with dummy TDs). So it could be related to the bulk-unlink? This would at least not contradict the the observation, a former version of the driver did not trigger the issue. This version (as of 2.6.3-bk6 or later) uses an INT-IN urb instead of the BULK_IN. So what was changed is: * BULK-IN instead of INT-IN transfer used * resubmit defered to timer instead directly from completion callback * synch unlink instead of async on line turnaround rx->tx > Notice how bad->hwNextTD points to the first/only td_hash[4] entry. > Everything except its hwNextTd and one "info" byte is zero ... the > "SETUP" comes from a bitfield that's zero. And when I dumped just > a few more values from the TD, everything exept bad->td_dma was > was also zero. As if a just-allocated TD got sent to the HC, and > it came back with CC=5/timeout because of course a bulk endpoint > isn't going to accept a SETUP packet. Ok. When I said it looked like control-transfer this was due to the SETUP packet assumption - (almost) all-zero TD would create this impression too, yes. But why should it have DATA1 in many cases? And look for the strange unaligned hw_next_td field in some of the additional examples below: ohci_hcd 0000:00:0c.1: bad entry 69f6040 ohci_hcd 0000:00:0c.1: td_hash[2] ohci_hcd 0000:00:0c.1: entry td c69f6080; urb c4abf274 index 0; hw next td 00000000 ohci_hcd 0000:00:0c.1: info 03140000 CC=0 DATA1 DI=0 IN R ohci_hcd 0000:00:0c.1: cbp 05080000 be 05080fff (len 4096) ohci_hcd 0000:00:0c.1: bad? td c69f6040; urb 00000000 index 0; hw next td 069f6080 ohci_hcd 0000:00:0c.1: info 5e000000 CC=5 DATA0 DI=0 SETUP ohci_hcd 0000:00:0c.1: cbp 00000000 be 00000000 (len 0) ohci_hcd 0000:00:0c.1: urb c4abf274 path 2 ep2in 5f160000 cc 5 --> status -110 ohci_hcd 0000:00:0c.1: bad entry 69f6080 ohci_hcd 0000:00:0c.1: td_hash[1] ohci_hcd 0000:00:0c.1: entry td c69f6040; urb c4abf34c index 0; hw next td 00000000 ohci_hcd 0000:00:0c.1: info 03140000 CC=0 DATA1 DI=0 IN R ohci_hcd 0000:00:0c.1: cbp 02018000 be 02018fff (len 4096) ohci_hcd 0000:00:0c.1: bad? td c69f6080; urb 00000000 index 0; hw next td 069f6083 ohci_hcd 0000:00:0c.1: info 5f000000 CC=5 DATA1 DI=0 SETUP ohci_hcd 0000:00:0c.1: cbp 00000000 be 00000000 (len 0) ohci_hcd 0000:00:0c.1: bad entry 69f6000 ohci_hcd 0000:00:0c.1: td_hash[2] ohci_hcd 0000:00:0c.1: entry td c69f6080; urb c4abf34c index 0; hw next td 00000000 ohci_hcd 0000:00:0c.1: info 02140000 CC=0 DATA0 DI=0 IN R ohci_hcd 0000:00:0c.1: cbp 02018000 be 02018fff (len 4096) ohci_hcd 0000:00:0c.1: bad? td c69f6000; urb 00000000 index 0; hw next td 069f6003 ohci_hcd 0000:00:0c.1: info 5f000000 CC=5 DATA1 DI=0 SETUP ohci_hcd 0000:00:0c.1: cbp 00000000 be 00000000 (len 0) ohci_hcd 0000:00:0c.1: bad entry 69f6040 ohci_hcd 0000:00:0c.1: td_hash[2] ohci_hcd 0000:00:0c.1: entry td c69f6080; urb c4abf34c index 0; hw next td 00000000 ohci_hcd 0000:00:0c.1: info 02140000 CC=0 DATA0 DI=0 IN R ohci_hcd 0000:00:0c.1: cbp 02018000 be 02018fff (len 4096) ohci_hcd 0000:00:0c.1: bad? td c69f6040; urb 00000000 index 0; hw next td 069f6043 ohci_hcd 0000:00:0c.1: info 5f000000 CC=5 DATA1 DI=0 SETUP ohci_hcd 0000:00:0c.1: cbp 00000000 be 00000000 (len 0) ------------------------------------------------------- SF.Net is sponsored by: Speed Start Your Linux Apps Now. Build and deploy apps & Web services for Linux with a free DVD software kit from IBM. Click Now! http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click _______________________________________________ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel