Re: [zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

Roch Fri, 02 Jun 2006 06:56:49 -0700

Anton Rang writes:
 > On May 31, 2006, at 8:56 AM, Roch Bourbonnais - Performance  
 > Engineering wrote:
 > 
 > > I'm not taking  a stance on this, but  if I keep a controler
 > > full  of 128K   I/Os  and  assuming  there  are   targetting
 > > contiguous physical blocks, how different is that to issuing
 > > a very large I/O ?
 > 
 > There are differences at the host, the HBA, the disk or RAID
 > controller, and on the wire.



 > 
 > At the host:
 > 
 >    The SCSI/FC/ATA stack is run once for each I/O.  This takes
 >    a bit of CPU.  We generally take one interrupt for each I/O
 >    (if the CPU is fast enough), so instead of taking one
 >    interrupt for 8 MB (for instance), we take 64.
 > 
 >    We run through the IOMMU or page translation code once per
 >    page, but the overhead of initially setting up the IOMMU or
 >    starting the translation loop happens once per I/O.
 > 
 > At the HBA:
 > 
 >    There is some overhead each time that the controller switches
 >    processing from one I/O to another.  This isn't too large on a
 >    modern system, but it does add up.
 > 
 >    There is overhead on the PCI (or other) bus for the small
 >    transfers that make up the command block and scatter/gather
 >    list for each I/O.  Again, it adds up (faster than you might
 >    expect, since PCI Express can move 128 KB very quickly).
 > 
 >    There is a limit on the maximum number of outstanding I/O
 >    requests, but we're unlikely to hit this in normal use; it
 >    is typically at least 256 and more often 1024 or more on
 >    newer hardware.  (This is shared for the whole channel
 >    in the FC and SCSI case, and may be shared between multiple
 >    channels for SAS or multi-port FC cards.)
 > 
 >    There is often a small cache of commands which can be handled
 >    quickly; commands outside of this cache (which may hold 4 to
 >    16 or so) are much slower to "context-switch" in when their
 >    data is needed; in particular, the scatter/gather list may
 >    need to be read again.
 > 
 > At the disk or RAID:
 > 
 >    There is a fixed overhead for processing each command.  This
 >    can be fairly readily measured, and roughly reflects the
 >    difference between delivered 512-byte IOPs and bandwidth for
 >    a large I/O.  Some of it is related to parsing the CDB and
 >    starting command execution; some of it is related to cache
 >    management.
 > 
 >    There is some overhead for switching between data transfers
 >    for each command.  A typical track on a disk may hold 400K
 >    or so of data, and a full-track transfer is optimal (runs at
 >    platter speed).  A partial-track transfer immediately followed
 >    by another may take enough time to switch that we sometimes
 >    lose one revolution (particularly on disks which do not have
 >    sector headers).  Write caching should nearly eliminate this
 >    as a concern, however.
 > 
 >    There is a fixed-size window of commands that can be
 >    reordered on the device.  Data transfer within a command can
 >    be reordered arbitrarily (for parallel SCSI and FC, though
 >    not for ATA or SAS).  It's good to have lots of outstanding
 >    commands, but if they are all sequential, there's not much
 >    point (no reason to reorder them, except perhaps if you're
 >    going backwards, and FC/SCSI can handle this anyway).
 > 
 > On the wire:
 > 
 >    Sending a command and its completion takes time that could
 >    be spent moving data instead; but for most protocols this
 >    probably isn't significant.
 > 
 > You can actually see most of this with a PCI and protocol
 > analyzer.
 > 

So the main question, does any of this cause a full flush of
the pipelined operations ? If it just extra busy-ness of the
individual components,  all operating concurrently and if we
don't  saturate anybody because of the  extra  work, then it
seems to me that we are fine. So clearly  there may be a few
extra bubbles that find their  way into the  pipe and we can
loose the     last    few   bleeding    edge     percent  of
throughput. Those  guys are on  QFS and delighted to be (and
they should be, QFS is outstanding in that market).

-r

 > -- Anton
 > 

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

Reply via email to