Re: [zfs-discuss] ZFS I/O algorithms

2008-03-20 Thread Mario Goebbels
 Similarly, read block size does not make a 
 significant difference to the sequential read speed.

Last time I did a simple bench using dd, supplying the record size as
blocksize to it instead of no blocksize parameter bumped the mirror pool
speed from 90MB/s to 130MB/s.

-mg



signature.asc
Description: OpenPGP digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS I/O algorithms

2008-03-20 Thread Bob Friesenhahn
On Thu, 20 Mar 2008, Mario Goebbels wrote:

 Similarly, read block size does not make a
 significant difference to the sequential read speed.

 Last time I did a simple bench using dd, supplying the record size as
 blocksize to it instead of no blocksize parameter bumped the mirror pool
 speed from 90MB/s to 130MB/s.

Indeed.  However, as an interesting twist to things, in my own 
benchmark runs I see two behaviors.  When the file size is smaller 
than the amount of RAM the ARC can reasonably grow to, the write block 
size does make a clear difference.  When the file size is larger than 
RAM, the write block size no longer makes much difference and 
sometimes larger block sizes actually go slower.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS I/O algorithms

2008-03-20 Thread Jonathan Edwards

On Mar 20, 2008, at 11:07 AM, Bob Friesenhahn wrote:
 On Thu, 20 Mar 2008, Mario Goebbels wrote:

 Similarly, read block size does not make a
 significant difference to the sequential read speed.

 Last time I did a simple bench using dd, supplying the record size as
 blocksize to it instead of no blocksize parameter bumped the mirror  
 pool
 speed from 90MB/s to 130MB/s.

 Indeed.  However, as an interesting twist to things, in my own
 benchmark runs I see two behaviors.  When the file size is smaller
 than the amount of RAM the ARC can reasonably grow to, the write block
 size does make a clear difference.  When the file size is larger than
 RAM, the write block size no longer makes much difference and
 sometimes larger block sizes actually go slower.

in that case .. try fixing the ARC size .. the dynamic resizing on the  
ARC can be less than optimal IMHO

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS I/O algorithms

2008-03-20 Thread Bob Friesenhahn
On Thu, 20 Mar 2008, Jonathan Edwards wrote:

 in that case .. try fixing the ARC size .. the dynamic resizing on the ARC 
 can be less than optimal IMHO

Is a 16GB ARC size not considered to be enough? ;-)

I was only describing the behavior that I observed.  It seems to me 
that when large files are written very quickly, that when the file 
becomes bigger than the ARC, that what is contained in the ARC is 
mostly stale and does not help much any more.  If the file is smaller 
than the ARC, then there is likely to be more useful caching.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS I/O algorithms

2008-03-20 Thread Jonathan Edwards

On Mar 20, 2008, at 2:00 PM, Bob Friesenhahn wrote:
 On Thu, 20 Mar 2008, Jonathan Edwards wrote:

 in that case .. try fixing the ARC size .. the dynamic resizing on  
 the ARC
 can be less than optimal IMHO

 Is a 16GB ARC size not considered to be enough? ;-)

 I was only describing the behavior that I observed.  It seems to me
 that when large files are written very quickly, that when the file
 becomes bigger than the ARC, that what is contained in the ARC is
 mostly stale and does not help much any more.  If the file is smaller
 than the ARC, then there is likely to be more useful caching.

sure i got that - it's not the size of the arc in this case since  
caching is going to be a lost cause.. but explicitly setting a  
zfs_arc_max should result in fewer calls to arc_shrink() when you hit  
memory pressure between the application's page buffer competing with  
the arc

in other words, as soon as the arc is 50% full of dirty pages (8GB)  
it'll start evicting pages .. you can't avoid that .. but what you can  
avoid is the additional weight of constantly growing and shrinking the  
cache as it tries to keep up with your constantly changing blocks in a  
large file

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS I/O algorithms

2008-03-19 Thread Bill Moloney
Hi Bob ... as richard has mentioned, allocation to vdevs
is done in a fixed sized chunk (richard specs 1MB, but I
remember a 512KB number from the original spec, but this
is not very important), and the allocation algorithm is
basically doing load balancing.

for your non-raid pool, this chunk size will stay fixed regardless
of the block size you choose when creating the file system or the
IO unit size your applications(s) use. (The stripe size can 
dynamically change in a raidz pool, but not in your non-raid pool.)

Measuring bandwidth for you application load is tricky with ZFS,
since there are many hidden IO operations (besides the ones that
your application is requesting) that ZFS must perform.  If you collect
iostats on bytes transferred to hard drives and compare those numbers 
to the amount of data your application(s) transferred you can find
potentially large differences.  The differences in these scenarios are 
largely driven by the IO size your application(s) use. For example, when
I run the following tests here are my observations:
-using dual xeon server with qlogic FC 2G interface
-using a pool with 5 10Krpm FC 146 GB drives
-sequentially writing 4 15GB previously wriiten files in one
 file system in the pool (this file system is using 128KB 
 block size), and a separate thread writing
 each file concurrently for a total of 60GB written
block size writtenactual writtendisk IO observed  BW MB/S%CPU
  4KB  60GB227.3GB   34.2   
 20.4
 32KB 60GB216.5GB   36.1
13.9
128KB60Gb  63.6GB   69.6
31.0

You can see that a small application IO size causes much 
meta-data based IO (more than 3 times the actual application
IO requirements), while the 128KB application writes induce 
only marginally more disk IO than the application actually uses.

the BW numbers here are for just the application data, but when
you consider all the IO from the disks over the test times, the 
physical BW is obviously greater in all cases.

All my drives were uniformly busy in these tests, but the 
small application IO sizes forced much more total IO against
the drives.  In your case the application IO rate would be even
further degraded due to the mirror configuration.  The extra
load of reading and writing meta-data (including ditto-blocks) 
and mirror devices conspire to reduce the application IO rate, 
even though the disk device IO rates may be quite good.  

File system block size reduction only exacerbates the problem by
requiring more meta-data to support the same quantity of
application data, and for sequential IO this is a loser.  In any
case, for a non-raid pool, the allocation chunk size per drive 
(the stripe size) is not influenced by file system block size.

When application IO sizes get small, the overhead in ZFS goes
up dramatically.

regards, Bill

 The application is spending almost all the time
 blocked on I/O.  I see 
 that the number of device writes per second seems
 pretty high.  The 
 application is doing I/O in 128K blocks.  How many
 IOPS does a modern 
 300GB 15K RPM SAS drive typically deliver?  Of course
 the IOPS 
 capacity depends on if the access is random or
 sequential.  At the 
 application level, the access is completely
 sequential but ZFS is 
 likely doing some extra seeks.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS I/O algorithms

2008-03-19 Thread Bob Friesenhahn
On Wed, 19 Mar 2008, Bill Moloney wrote:

 When application IO sizes get small, the overhead in ZFS goes
 up dramatically.

Thanks for the feedback.  However, from what I have observed, it is 
not a full story at all.  On my own system, when a new file is 
written, the write block size does not make a significant difference 
to the write speed.  Similarly, read block size does not make a 
significant difference to the sequential read speed.  I do see a 
large difference in rates when an existing file is updated 
sequentially.  There is a many orders of magnitude difference for 
random I/O type updates.

I think that there some rather obvious reasons for the difference 
between writing a new file, or updating an existing file.  When 
writing a new file, the system can buffer up to a disk block's worth 
of size prior to issuing a a disk I/O, or it can immedialy write what 
it has and since the write is sequential, it does not need to re-read 
prior to write (but there may be more metadata I/Os).  For the case of 
updating part of a disk block, there needs to be a read prior to write 
if the block is not cached in RAM.

If the system is short on RAM, it may be that ZFS issues many more 
write I/Os than if it has a lot of RAM.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS I/O algorithms

2008-03-19 Thread Bill Moloney
 On my own system, when a new file is 
 written, the write block size does not make 
 a significant difference to the write speed

Yes, I've observed the same result ... when a new file is being written 
sequentially, the file data and newly constructed meta-data can be 
built in cache and written in large sequential chunks periodically,
without the need to read in existing meta-data and/or data.  It
seems that data and meta-data that are newly constucted in cache for 
sequential operations will persist in cache effectively, 
and the application IO size is a much less sensitive parameter.
Monitoring disks with iostat in these cases shows the disk IO to 
be only marginally greater than the application IO.
This is why I specified that the write tests
described in my previous post were to existing files.

The overhead of doing small sequential writes to an 
existing object is so much greater than writing to a new 
object, that it begs for some reasonable explanation.
The only one that i've been able assemble in various
experimentation, is that data/meta-data for existing objects 
is not retained effetively in cache if ZFS detects that such an
object is being sequntially written.  This forces the
constant re-reading of the data/meta-data associated with
such an object, causing a huge increase in device IO
traffic that does not seem to accompany the writing of a
brand new object.  The size of RAM seems to make little
difference in this case.

As small sequential writes accumulate in the 5 second cache, the 
chain of meta-data leading to the newly constructed data block may 
see only one pointer (of the 128 in the final set) changing to point to
this newly constructed data block, but all the meta-data from the
uber block to the target must be rewritten on the 5 second flush.
Of course this is not much diffrent from what's happening in the
newly created object scenario, so it must be the behavior that follows 
this flush that's different.  It seems to me that after this flush, some,
or all of the data/meta-data that will be affected next is re-read even
though much of what's needed for subsequent operations should already
be in cache.

My experience with large RAM systems and with the use of SSDs 
as ZFS cache devices has convinced me that data/meta-data associated
with sequential write operations to existing objects (and ZFS seems 
very good at detecting this association) does not get retained 
in cache very effectively.

You can see this very clearly if you look at the IO to a cache
device (ZFS allows you to easily attach a device to a pool as a 
cache device which acts as a sort of L2 type cache for RAM).  
When I do random IO operations to existing objects I
see a large amount of IO to my cache device as RAM fills and ZFS 
pushes cached information (that would otherwise be evicted)
to the SSD cache device.  If I repeat 
the random IO test over the same total file space I see improved
performance as I get occassional hits from the RAM cache and the
SSD cache.  As this extended cache heirarchy warms up with each
test run, my results continue to improve.  If I run sequential write 
operations to exiting objects however,  I see very little activity to 
my SSD cache, and virtually no change in performance  when I 
immediately run the same test again.  

It seems that ZFS is still in need of some fine-tuning for small
sequential write operations to exiting objects.

regards, Bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS I/O algorithms

2008-03-16 Thread Mario Goebbels
 I do see that all the devices are quite evenly busy.  There is no 
 doubt that the load balancing is quite good.  The main question is if 
 there is any actual striping going on (breaking the data into 
 smaller chunks), or if the algorithm is simply load balancing. 
 Striping trades IOPS for bandwidth.

There's no striping across vdevs going on. It's simple load balancing,
i.e. blocks are spread semi-randomly across the vdevs. I say semi
because apparently the average bandwidth load of the vdevs influence the
outcome.

-mg



signature.asc
Description: OpenPGP digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS I/O algorithms

2008-03-16 Thread Richard Elling
Bob Friesenhahn wrote:
 On Sat, 15 Mar 2008, Richard Elling wrote:

 My observation, is that each metaslab is, by default, 1 MByte in 
 size.  Each
 top-level vdev is allocated by metaslabs.  ZFS tries to allocate a 
 top-level
 vdev's metaslab before moving onto another one.  So you should see eight
 128kByte allocs per top-level vdev before the next top-level vdev is
 allocated.

 That said, the actual iops are sent in parallel.  So it is not 
 unusual to see
 many, most, or all of the top-level vdevs concurrently busy.

 Does this match your experience?

 I do see that all the devices are quite evenly busy.  There is no 
 doubt that the load balancing is quite good.  The main question is if 
 there is any actual striping going on (breaking the data into 
 smaller chunks), or if the algorithm is simply load balancing. 
 Striping trades IOPS for bandwidth.

By my definition of striping, yes it is going on.  But there are
different ways to spread the data.  The way that writes are handled,
ZFS rewards devices which can provide good sequential write
bandwidth, like disks.  Reads are another story, they read from
where the data is, which in turn depends on the conditions at
write time.

The other behaviour you may see is that reads and writes are
coalesced, when possible.  At the device level you may see your
smaller blocks being coalesced into larger iops.


 Using my application, I did some tests today.  The application was 
 used to do balanced read/write of about 500GB of data in some tens of 
 thousand of reasonably large files.  The application sequentially 
 reads a file, then sequentially writes a file.  Several copies (2-6) 
 of the application were run at once for concurrency.  What I noticed 
 is that with hardly any CPU being used, the read+write bandwidth 
 seemed to be bottlenecked at about 280MB/second with 'zfs iostat' 
 showing very balanced I/O between the reads and the writes.

But where is the bottleneck?  iostat will show bottlenecks in the
physical disks and channels.  vmstat or mpstat will show the
bottlenecks in cpus.  To see if the app is the bottleneck will
require some analysis of the app itself.  Is it spending its time
blocked on I/O?


 The system I set up is performing quite a bit differently than I 
 anticipated.  The I/O is bottlenecked and I find that my application 
 can do significant processing of the data without significantly 
 increasing the application run time.  So CPU time is almost free.

 If I was to assign a smaller block size for the filesystem, would that 
 provide more of the benefits of striping or would it be detrimental to 
 performance due to the number of I/Os?

I would not expect to see much difference, but the proof is in the pudding.
Let us know what you find.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS I/O algorithms

2008-03-16 Thread Bob Friesenhahn
On Sun, 16 Mar 2008, Richard Elling wrote:

 But where is the bottleneck?  iostat will show bottlenecks in the
 physical disks and channels.  vmstat or mpstat will show the
 bottlenecks in cpus.  To see if the app is the bottleneck will
 require some analysis of the app itself.  Is it spending its time
 blocked on I/O?

The application is spending almost all the time blocked on I/O.  I see 
that the number of device writes per second seems pretty high.  The 
application is doing I/O in 128K blocks.  How many IOPS does a modern 
300GB 15K RPM SAS drive typically deliver?  Of course the IOPS 
capacity depends on if the access is random or sequential.  At the 
application level, the access is completely sequential but ZFS is 
likely doing some extra seeks.

iostat output (atime=off):

  extended device statistics 
devicer/sw/s   Mr/s   Mw/s wait actv  svc_t  %w  %b 
sd0   0.00.00.00.0  0.0  0.00.0   0   0 
sd1   0.00.00.00.0  0.0  0.02.8   0   0 
sd2   0.00.00.00.0  0.0  0.00.0   0   0 
sd10 80.4  170.7   10.0   19.9  0.0  9.2   36.5   0  54 
sd11 82.1  170.2   10.2   20.0  0.0 13.3   52.9   0  71 
sd12 79.3  168.39.9   20.0  0.0 13.1   53.1   0  69 
sd13 80.6  173.0   10.0   19.9  0.0  9.3   36.7   0  56 
sd14 80.9  167.8   10.1   20.0  0.0 13.4   53.8   0  70 
sd15 77.7  168.79.7   19.9  0.0  9.1   37.1   0  52 
sd16 77.3  170.69.6   20.0  0.0 13.3   53.7   0  70 
sd17 76.4  168.29.5   20.0  0.0  9.1   37.2   0  52 
sd18 76.7  172.29.5   19.9  0.0 13.5   54.2   0  70 
sd19 83.8  173.2   10.4   20.0  0.0 13.7   53.4   0  74 
sd20 73.3  174.39.1   20.0  0.0  9.1   36.9   0  56 
sd21 75.3  170.29.4   20.0  0.0 13.2   53.9   0  69 
nfs1  0.00.00.00.0  0.0  0.00.0   0   0

% mpstat
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
   0  288   1  189  1018  413  815   26  102   880  30463   3   0  94
   1  185   1  180   6341  830   43  111   740  31173   2   0  94
   2  284   1  183   5216  617   27   98   670  49544   3   0  93
   3  176   1  239   748  353  555   25   76   620  39334   3   0  93

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS I/O algorithms

2008-03-15 Thread Richard Elling
Bob Friesenhahn wrote:
 Can someone please describe to me the actual underlying I/O operations 
 which occur when a 128K block of data is written to a storage pool 
 configured as shown below (with default ZFS block sizes)?  I am 
 particularly interested in the degree of striping across mirrors 
 which occurs.  This would be for Solaris 10 U4.
   

My observation, is that each metaslab is, by default, 1 MByte in size.  Each
top-level vdev is allocated by metaslabs.  ZFS tries to allocate a top-level
vdev's metaslab before moving onto another one.  So you should see eight
128kByte allocs per top-level vdev before the next top-level vdev is
allocated.

That said, the actual iops are sent in parallel.  So it is not unusual 
to see
many, most, or all of the top-level vdevs concurrently busy.

Does this match your experience?
 -- richard


   NAME   STATE READ WRITE CKSUM
   Sun_2540   ONLINE   0 0 0
 mirror   ONLINE   0 0 0
   c4t600A0B80003A8A0B096A47B4559Ed0  ONLINE   0 0 0
   c4t600A0B800039C9B50AA047B4529Bd0  ONLINE   0 0 0
 mirror   ONLINE   0 0 0
   c4t600A0B80003A8A0B096E47B456DAd0  ONLINE   0 0 0
   c4t600A0B800039C9B50AA447B4544Fd0  ONLINE   0 0 0
 mirror   ONLINE   0 0 0
   c4t600A0B80003A8A0B096147B451BEd0  ONLINE   0 0 0
   c4t600A0B800039C9B50AA847B45605d0  ONLINE   0 0 0
 mirror   ONLINE   0 0 0
   c4t600A0B80003A8A0B096647B453CEd0  ONLINE   0 0 0
   c4t600A0B800039C9B50AAC47B45739d0  ONLINE   0 0 0
 mirror   ONLINE   0 0 0
   c4t600A0B80003A8A0B097347B457D4d0  ONLINE   0 0 0
   c4t600A0B800039C9B50AB047B457ADd0  ONLINE   0 0 0
 mirror   ONLINE   0 0 0
   c4t600A0B800039C9B50A9C47B4522Dd0  ONLINE   0 0 0
   c4t600A0B800039C9B50AB447B4595Fd0  ONLINE   0 0 0


 Thanks,

 Bob
 ==
 Bob Friesenhahn
 [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS I/O algorithms

2008-03-15 Thread Bob Friesenhahn
On Sat, 15 Mar 2008, Richard Elling wrote:

 My observation, is that each metaslab is, by default, 1 MByte in size.  Each
 top-level vdev is allocated by metaslabs.  ZFS tries to allocate a top-level
 vdev's metaslab before moving onto another one.  So you should see eight
 128kByte allocs per top-level vdev before the next top-level vdev is
 allocated.

 That said, the actual iops are sent in parallel.  So it is not unusual to see
 many, most, or all of the top-level vdevs concurrently busy.

 Does this match your experience?

I do see that all the devices are quite evenly busy.  There is no 
doubt that the load balancing is quite good.  The main question is if 
there is any actual striping going on (breaking the data into 
smaller chunks), or if the algorithm is simply load balancing. 
Striping trades IOPS for bandwidth.

Using my application, I did some tests today.  The application was 
used to do balanced read/write of about 500GB of data in some tens of 
thousand of reasonably large files.  The application sequentially 
reads a file, then sequentially writes a file.  Several copies (2-6) 
of the application were run at once for concurrency.  What I noticed 
is that with hardly any CPU being used, the read+write bandwidth 
seemed to be bottlenecked at about 280MB/second with 'zfs iostat' 
showing very balanced I/O between the reads and the writes.

The system I set up is performing quite a bit differently than I 
anticipated.  The I/O is bottlenecked and I find that my application 
can do significant processing of the data without significantly 
increasing the application run time.  So CPU time is almost free.

If I was to assign a smaller block size for the filesystem, would that 
provide more of the benefits of striping or would it be detrimental to 
performance due to the number of I/Os?

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss