On Mon, May 21, 2001 at 08:48:03PM -0700, David Brownell wrote:
> Do you have enough statistics about those scatter/gather segments
> to come up with a reasonable first-cut performance model?  For
> example, how big would each s/g segment be?  How many segments
> would get queued at a time?  How many other (control/interrupt)
> transfers go between each group of s/g bulk transfers?

Statistics?  Nah... not even the commitee has stats yet.  But here's what I
can tell you.

All storage device transfers can be looked at in terms of 3 phases:
command, (optional) data, and status.

All service options require an URB for the command, either to a control or
bulk endpoint.

Data is always passed via bulk endpoint.  The SCSI layer will allocate the
scatter-gather segments for me, and those can vary significantly.  Initial
performance tests suggest that memory allocation is _much_ faster (read 4-6
times) when smaller segments are used.  That stat is based on teh current
codebase -- when the number of segments is increased (and thus the size of
each decreases), total throughput jumps dramatically. A "big" segment is
4K, a small segment is 512 bytes.

Status is passed either via bulk URB or interrupt URB, depending on service
options.  Lather, rince, repeat.

Currently, since I need to maintain the synchronization between endpoints
manually, I handle each URB individualy.  This works well when I've only
got one command to deal with at a time, but I'd like to be able to handle
multiple commands in the queue to improve performance.  This only makes
sense when I can use my CPU time to construct the URB chains ahead of time,
and submit them all at once, letting the DMA hardware take care of that
series.  Note that I don't actually need any completion handler code except
for the last (and possibly next-to-last in _some_ data-in cases) URB.  The
URBs really do take care of themselves.  The problem is, they're not all
bulk transfers.

Here's what I'd like in my dream world:  I allocate a largeish pool of URBs
at init time.  I then take a command off the command queue and allocate
URBs to handle it, and submit them.  While my ?HCI controller is happily
DMAing data all over the place, I take the next command of the queue and
being constructing an URB chain for it.  In the nominal "working" case, I
get a signal from the final completion handler that tells me to submit the
next chain.

When something goes wrong, the device STALLs the endpoint.  At which point,
the data stage URBs all get -EPIPE and the status stage comes up.  We get
the status, and are ready to decide what do to about it -- including
possibly sending an already pre-constructed URB sequence to retrieve the
REQUEST_SENSE data (again, min 3 URBs).  Note that CPU time is optimized
to always be ahead of the transfer in progress, and minimal interaction is
needed once an URB chain is submitted.

It really sounds like the UHCI controllers may already do what I want.  Is
that the case?  Or am I mis-reading something here.

> I think the current bulk queuing has lots of mileage left.  I'd rather
> not change that unless/until we find we're hitting a wall with it.

Wall?  No.  Performance gain?  Yes.  Right now, usb-storage consumes and
_enormous_ amount of CPU time juggling URBs and checking status.  And
implementing a command queue is almost pointless.

Matt

-- 
Matthew Dharm                              Home: [EMAIL PROTECTED] 
Maintainer, Linux USB Mass Storage Driver

We can customize our colonels.
                                        -- Tux
User Friendly, 12/1/1998

PGP signature

Reply via email to