On Friday 15 April 2005 1:58 pm, Alan Stern wrote: > On Fri, 15 Apr 2005, David Brownell wrote: > > > > Could you summarize what tools you used to generate those numbers? > > Like what kind of driver(s) were active, with what kind of loads. > > Audio? Storage? Networking? How about other statistics, like > > minimum, mean, and standard deviation? > > No special measures were taken. This was done on two ordinary > workstations. Networking was up on the P4 but not on the P2. No user > programs running other than the shell and the normal background daemons, > none of which did any USB activity (in particular haldaemon was off). I > used usb-storage (with debugging turned off, although that shouldn't > matter much). The P4 has EHCI controllers but ehci-hcd wasn't loaded -- > otherwise the test device wouldn't have used uhci-hcd!
So basically this was a usb-storage measurement. That's probably a worst-case from the HCD perspective, since virtually everything else only uses very short queues. (Other than "usbnet", which at full speed uses shorter queue lengths; or usb audio, which usually keeps only two ISO transfers of a handful of msec each.) Only usb-storage is routinely queueing more than a dozen KBytes or so (at full speed). > (A lot of kernel debugging features, like cache poisoning, were turned on > since that's how I normally do my development. They may have had a > significant impact.) Cache/memory poisoning certainly would; I've seen it. Of course, I normally leave that on too! Though thankfully it's been ages since I've had to chase a bug where an HC was using memory after it was freed by an HCD. Remember back a few years when that was sadly common? :) > I didn't keep any statistics other than what you see above, and I only > ran the test a few times. It's possible that the numbers are incorrect > because, as I realized later, I stored the initial timer value immediately > before calling spin_lock_irqsave instead of immediately after. I can do > it over again if you want. > > > It'd also be interesting to compare them for OHCI and EHCI. I'd > > expect UHCI would be worse, because of the TD-per-packet thing, > > but also having some common baselines would be good. > > Would you like to see my test code? I'll send it to you off-list if you > want -- not because it's big but because it's so ugly. It should be easy > enough to adapt it to OHCI and EHCI. Sure, please do. It'd be worth gathering statistics at the usbcore level, IMO, for numbers that are directly comparable. > > Heck, even just the usbcore/hcd hooks to let the HCDs cache a list of TDs > > onto the URB would help, without needing any new API... so the invasive > > changes could be invisible (at first) to device drivers. TDs could be freed > > to the per-urb list, and on some architectures (like x86) the re-enqueue > > path might well be able to use cache-hot memory. > > I'm not sure what would be the best/easiest approach. Preallocating TDs > may not be good if the URB is going to live for a long time. It'd be "good" in the sense of "when that URB is used, it'll have TDs available". The "not good" would be limited to memory from that dma_pool not being easily shared ... a non-issue unless urbs sit idle. > And it's not > clear how much of the time for enqueue is spent _allocating_ the TDs as > opposed to _preparing_ them. With memory/cache poisoning, allocating and freeing each write whole TDs. So the cost to allocate a new one from a dma_pool will be more than the cost to initialize one with data (since it's got to find a TD to allocate). Ergo my observation that a freelist would be quicker. It can also be used to prioritize allocation to TDs that are already cache-hot. > > Alternatively, a per-endpoint cache of TDs might be even better ... less > > invasive to usbcore. That wouldn't help with urb-private data, but for > > HCDs that need those it'd still just be a single kmalloc/free per submit. > > That might facilitate addressing the UHCI-specific "lots of TDs" issue. > > (By a scheme I once sketched: only URBs to the front of the queue would > > need TDs allocated, and as TDs get freed they could be mapped onto URBs > > towards the end. That'd put a ceiling on the enqueue costs, which is a > > fine thing from real-time perspectives...) > > This is one of those changes I mentioned earlier. It shouldn't be > necessary to have more than, say, 500 TDs allocated for an endpoint at any > time. That's about 26 ms worth, or 31 KB of data. So long as a > completion interrupt is issued every 200 TDs, it should work fine. I think this approach is the one I'd take if I had time to do that kind of work. I suspect it'd be a much bigger win with UHCI though! - Dave ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ linux-usb-devel@lists.sourceforge.net To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel