Re: [zfs-discuss] ZFS I/O algorithms

Bill Moloney Wed, 19 Mar 2008 15:10:27 -0700

Hi Bob ... as richard has mentioned, allocation to vdevs
is done in a fixed sized chunk (richard specs 1MB, but I
remember a 512KB number from the original spec, but this
is not very important), and the allocation algorithm is
basically doing load balancing.


for your non-raid pool, this chunk size will stay fixed regardless
of the block size you choose when creating the file system or the
IO unit size your applications(s) use. (The stripe size can 
dynamically change in a raidz pool, but not in your non-raid pool.)

Measuring bandwidth for you application load is tricky with ZFS,
since there are many hidden IO operations (besides the ones that
your application is requesting) that ZFS must perform.  If you collect
iostats on bytes transferred to hard drives and compare those numbers 
to the amount of data your application(s) transferred you can find
potentially large differences.  The differences in these scenarios are 
largely driven by the IO size your application(s) use. For example, when
I run the following tests here are my observations:
-using dual xeon server with qlogic FC 2G interface
-using a pool with 5 10Krpm FC 146 GB drives
-sequentially writing 4 15GB previously wriiten files in one
 file system in the pool (this file system is using 128KB 
 block size), and a separate thread writing
 each file concurrently for a total of 60GB written
block size written    actual written    disk IO observed  BW MB/S    %CPU
          4KB                      60GB                227.3GB           34.2   
     20.4
         32KB                     60GB                216.5GB           36.1    
    13.9
        128KB                    60Gb                  63.6GB           69.6    
    31.0

You can see that a small application IO size causes much 
meta-data based IO (more than 3 times the actual application
IO requirements), while the 128KB application writes induce 
only marginally more disk IO than the application actually uses.

the BW numbers here are for just the application data, but when
you consider all the IO from the disks over the test times, the 
physical BW is obviously greater in all cases.

All my drives were uniformly busy in these tests, but the 
small application IO sizes forced much more total IO against
the drives.  In your case the application IO rate would be even
further degraded due to the mirror configuration.  The extra
load of reading and writing meta-data (including ditto-blocks) 
and mirror devices conspire to reduce the application IO rate, 
even though the disk device IO rates may be quite good.  

File system block size reduction only exacerbates the problem by
requiring more meta-data to support the same quantity of
application data, and for sequential IO this is a loser.  In any
case, for a non-raid pool, the allocation chunk size per drive 
(the stripe size) is not influenced by file system block size.

When application IO sizes get small, the overhead in ZFS goes
up dramatically.

regards, Bill

> The application is spending almost all the time
> blocked on I/O.  I see 
> that the number of device writes per second seems
> pretty high.  The 
> application is doing I/O in 128K blocks.  How many
> IOPS does a modern 
> 300GB 15K RPM SAS drive typically deliver?  Of course
> the IOPS 
> capacity depends on if the access is random or
> sequential.  At the 
> application level, the access is completely
> sequential but ZFS is 
> likely doing some extra seeks.
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS I/O algorithms

Reply via email to