Re: [zfs-discuss] IOzone benchmarking

2012-05-08 Thread Richard Elling
On May 7, 2012, at 1:53 PM, Edward Ned Harvey wrote:

>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Bob Friesenhahn
>> 
>> Has someone done real-world measurements which indicate that raidz*
>> actually provides better sequential read or write than simple
>> mirroring with the same number of disks?  While it seems that there
>> should be an advantage, I don't recall seeing posted evidence of such.
>> If there was a measurable advantage, it would be under conditions
>> which are unlikely in the real world.
> 
> Apparently I pulled it down at some point, so I don't have a URL for you
> anymore, but I did, and I posted.  Long story short, both raidzN and mirror
> configurations behave approximately the way you would hope they do.  That
> is...
> 
> Approximately, as compared to a single disk:  And I *mean* approximately,
> because I'm just pulling it back from memory the way I chose to remember it,
> which is to say, a simplified model that I felt comfortable with:

This model is completely wrong for writes. Suggest you deal with writes 
separately.

Also, the random reads must be small random reads, where I/O size << 128k.
For most common use cases, expect random reads to be 4k or 8k.
  -- richard

>   seq rd  seq wr  rand rd rand wr
> 2-disk mirror 2x  1x  2x  1x
> 3-disk mirror 3x  1x  3x  1x
> 2x 2disk mirr 4x  2x  4x  2x
> 3x 2disk mirr 6x  3x  6x  3x
> 3-disk raidz  2x  2x  1x  1x
> 4-disk raidz  3x  3x  1x  1x
> 5-disk raidz  4x  4x  1x  1x
> 6-disk raidz  5x  5x  1x  1x
> 
> I went on to test larger and more complex arrangements...  Started getting
> things like 1.9x and 1.8x where I would have expected 2x and so forth...
> Sorry for being vague now, but the data isn't in front of me anymore.  Might
> not ever be again.
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-07 Thread Bob Friesenhahn

On Mon, 7 May 2012, Edward Ned Harvey wrote:


Apparently I pulled it down at some point, so I don't have a URL for you
anymore, but I did, and I posted.  Long story short, both raidzN and mirror
configurations behave approximately the way you would hope they do.  That
is...

Approximately, as compared to a single disk:  And I *mean* approximately,


Yes, I remember your results.

In a few weeks I should be setting up a new system with OpenIndiana 
and 8 SAS disks.  This will give me an opportunity to test again. 
Last time I got to play was back in Feburary 2008 and I did not bother 
to test raidz 
(http://www.simplesystems.org/users/bfriesen/zfs-discuss/2540-zfs-performance.pdf).


Most common benchmarking is sequential read/write and rarely 
read-file/write-file where 'file' is a megabyte or two and the file is 
different for each iteration.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-07 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Paul Kraus
> 
> Even with uncompressable data I measure better performance with
> compression turned on rather than off. 

*cough*

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-07 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Bob Friesenhahn
> 
> Has someone done real-world measurements which indicate that raidz*
> actually provides better sequential read or write than simple
> mirroring with the same number of disks?  While it seems that there
> should be an advantage, I don't recall seeing posted evidence of such.
> If there was a measurable advantage, it would be under conditions
> which are unlikely in the real world.

Apparently I pulled it down at some point, so I don't have a URL for you
anymore, but I did, and I posted.  Long story short, both raidzN and mirror
configurations behave approximately the way you would hope they do.  That
is...

Approximately, as compared to a single disk:  And I *mean* approximately,
because I'm just pulling it back from memory the way I chose to remember it,
which is to say, a simplified model that I felt comfortable with:
seq rd  seq wr  rand rd rand wr
2-disk mirror   2x  1x  2x  1x
3-disk mirror   3x  1x  3x  1x
2x 2disk mirr   4x  2x  4x  2x
3x 2disk mirr   6x  3x  6x  3x
3-disk raidz2x  2x  1x  1x
4-disk raidz3x  3x  1x  1x
5-disk raidz4x  4x  1x  1x
6-disk raidz5x  5x  1x  1x

I went on to test larger and more complex arrangements...  Started getting
things like 1.9x and 1.8x where I would have expected 2x and so forth...
Sorry for being vague now, but the data isn't in front of me anymore.  Might
not ever be again.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-05 Thread Richard Elling
On May 5, 2012, at 8:04 AM, Bob Friesenhahn wrote:

> On Fri, 4 May 2012, Erik Trimble wrote:
>> predictable, and the backing store is still only giving 1 disk's IOPS.   The 
>> RAIDZ* may, however, give you significantly more throughput (in MB/s) than a 
>> single disk if you do a lot of sequential read or write.
> 
> Has someone done real-world measurements which indicate that raidz* actually 
> provides better sequential read or write than simple mirroring with the same 
> number of disks?  While it seems that there should be an advantage, I don't 
> recall seeing posted evidence of such. If there was a measurable advantage, 
> it would be under conditions which are unlikely in the real world.

Why would one expect raidz to be faster? Mirrors will always win on reads 
because you
read from all sides of the mirror. 

Writes are a bit more difficult to predict and measure, mostly because ZFS 
writes to the 
pool are async.

> The only thing totally clear to me is that raidz* provides better storage 
> efficiency than mirroring and that raidz1 is dangerous with large disks.

space, performance, dependability: pick two
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-05 Thread Erik Trimble

On 5/5/2012 8:04 AM, Bob Friesenhahn wrote:

On Fri, 4 May 2012, Erik Trimble wrote:
predictable, and the backing store is still only giving 1 disk's 
IOPS.   The RAIDZ* may, however, give you significantly more 
throughput (in MB/s) than a single disk if you do a lot of sequential 
read or write.


Has someone done real-world measurements which indicate that raidz* 
actually provides better sequential read or write than simple 
mirroring with the same number of disks?  While it seems that there 
should be an advantage, I don't recall seeing posted evidence of such. 
If there was a measurable advantage, it would be under conditions 
which are unlikely in the real world.


The only thing totally clear to me is that raidz* provides better 
storage efficiency than mirroring and that raidz1 is dangerous with 
large disks.


Provided that the media reliability is sufficiently high, there are 
still many performance and operational advantages obtained from simple 
mirroring (duplex mirroring) with zfs.


Bob



I'll see what I can do about actual measurements.  Given that we're 
really recommending a minimum of RAIDZ2 nowdays (with disks > 1TB), that 
means, for N disks, you get N-2 data disks in a RAIDZ2, and N/2 disks in 
a standard striped mirror.   My brain says that even with the overhead 
of parity calculation, for doing sequential read/write of at least the 
slab size (i.e. involving all the data drives in a RAIDZ2),  performance 
for the RAIDZ2 should be better for N >= 6.  But, that's my theoretical 
brain, and we should do some decent benchmarking, to put some hard fact 
to that.


-Erik
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-05 Thread Bob Friesenhahn

On Fri, 4 May 2012, Erik Trimble wrote:
predictable, and the backing store is still only giving 1 disk's IOPS.   The 
RAIDZ* may, however, give you significantly more throughput (in MB/s) than a 
single disk if you do a lot of sequential read or write.


Has someone done real-world measurements which indicate that raidz* 
actually provides better sequential read or write than simple 
mirroring with the same number of disks?  While it seems that there 
should be an advantage, I don't recall seeing posted evidence of such. 
If there was a measurable advantage, it would be under conditions 
which are unlikely in the real world.


The only thing totally clear to me is that raidz* provides better 
storage efficiency than mirroring and that raidz1 is dangerous with 
large disks.


Provided that the media reliability is sufficiently high, there are 
still many performance and operational advantages obtained from simple 
mirroring (duplex mirroring) with zfs.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-04 Thread Erik Trimble

On 5/4/2012 1:24 PM, Peter Tribble wrote:

On Thu, May 3, 2012 at 3:35 PM, Edward Ned Harvey
  wrote:

I think you'll get better, both performance&  reliability, if you break each
of those 15-disk raidz3's into three 5-disk raidz1's.  Here's why:

Incorrect on reliability; see below.


Now, to put some numbers on this...
A single 1T disk can sustain (let's assume) 1.0 Gbit/sec read/write
sequential.  This means resilvering the entire disk sequentially, including
unused space, (which is not what ZFS does) would require 2.2 hours.  In
practice, on my 1T disks, which are in a mirrored configuration, I find
resilvering takes 12 hours.  I would expect this to be ~4 days if I were
using 5-disk raidz1, and I would expect it to be ~12 days if I were using
15-disk raidz3.

Based on your use of "I would expect", I'm guessing you haven't
done the actual measurement.

I see ~12-16 hour resilver times on pools using 1TB drives in
raidz configurations. The resilver times don't seem to vary
with whether I'm using raidz1 or raidz2.


Suddenly the prospect of multiple failures overlapping don't seem so
unlikely.

Which is *exactly* why you need multiple-parity solutions. Put
simply, if you're using single-parity redundancy with 1TB drives
or larger (raidz1 or 2-way mirroring) then you're putting your
data at risk. I'm seeing - at a very low level, but clearly non-zero -
occasional read errors during rebuild of raidz1 vdevs, leading to
data loss. Usually just one file, so it's not too bad (and zfs will tell
you which file has been lost). And the observed error rates we're
seeing in terms of uncorrectable (and undetectable) errors from
drives are actually slightly better than you would expect from the
manufacturers spec sheets.

So you definitely need raidz2 rather than raidz1; I'm looking at
going to raidz3 for solutions using current high capacity (ie 3TB)
drives.

(On performance, I know what the theory says about getting one
disk's worth of IOPS out of each vdev in a raidz configuration. In
practice we're finding that our raidz systems actually perform
pretty well when compared with dynamic stripes, mirrors, and
hardware raid LUNs.)




Really, guys:  Richard, myself, and several others have covered how ZFS 
does resilvering (and on disk reliability, a related issue), and 
included very detailed calculations on IOPS required and discussions 
about slabs, recordsize, and how disks operate with regards to 
seek/access times and OS caching.


Please search the archives, as it's not fruitful to repost the exact 
same thing repeatedly.



Short version:  assuming identical drives and the exact same usage 
pattern and /amount/ of data, the time it takes the various ZFS 
configurations to resilver is N for ANY mirrored config and  a bit less 
than N*M for a M-disk RAIDZ*, where M = the number of data disks in the 
RAIDZ* - thus a 6-drive (total) RAIDZ2 will have the same resilver time 
as a 5-drive (total) RAIDZ1.  Calculating what N is depends entirely on 
the pattern which the data was written on the drive.  You're always 
going to be IOPS-bound on the disk being resilvered.


Which RAIDZ* config to use (assuming you have a fixed tolerance for data 
loss) depends entirely on what your data usage pattern does to resilver 
times; configurations needing very long resilver times better have more 
redundancy. And, remember, larger configs will allow for more data to be 
stored, that also increases resilver time.


Oh, and a RAIDZ* will /only/ ever get you slightly more than 1 disk's 
worth of IOPS (averaged over a reasonable time period).  Caching may 
make it appear to give more IOPS in certain cases, but that's neither 
sustainable nor predictable, and the backing store is still only giving 
1 disk's IOPS.   The RAIDZ* may, however, give you significantly more 
throughput (in MB/s) than a single disk if you do a lot of sequential 
read or write.


-Erik

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-04 Thread Peter Tribble
On Thu, May 3, 2012 at 3:35 PM, Edward Ned Harvey
 wrote:
>
> I think you'll get better, both performance & reliability, if you break each
> of those 15-disk raidz3's into three 5-disk raidz1's.  Here's why:

Incorrect on reliability; see below.

> Now, to put some numbers on this...
> A single 1T disk can sustain (let's assume) 1.0 Gbit/sec read/write
> sequential.  This means resilvering the entire disk sequentially, including
> unused space, (which is not what ZFS does) would require 2.2 hours.  In
> practice, on my 1T disks, which are in a mirrored configuration, I find
> resilvering takes 12 hours.  I would expect this to be ~4 days if I were
> using 5-disk raidz1, and I would expect it to be ~12 days if I were using
> 15-disk raidz3.

Based on your use of "I would expect", I'm guessing you haven't
done the actual measurement.

I see ~12-16 hour resilver times on pools using 1TB drives in
raidz configurations. The resilver times don't seem to vary
with whether I'm using raidz1 or raidz2.

> Suddenly the prospect of multiple failures overlapping don't seem so
> unlikely.

Which is *exactly* why you need multiple-parity solutions. Put
simply, if you're using single-parity redundancy with 1TB drives
or larger (raidz1 or 2-way mirroring) then you're putting your
data at risk. I'm seeing - at a very low level, but clearly non-zero -
occasional read errors during rebuild of raidz1 vdevs, leading to
data loss. Usually just one file, so it's not too bad (and zfs will tell
you which file has been lost). And the observed error rates we're
seeing in terms of uncorrectable (and undetectable) errors from
drives are actually slightly better than you would expect from the
manufacturers spec sheets.

So you definitely need raidz2 rather than raidz1; I'm looking at
going to raidz3 for solutions using current high capacity (ie 3TB)
drives.

(On performance, I know what the theory says about getting one
disk's worth of IOPS out of each vdev in a raidz configuration. In
practice we're finding that our raidz systems actually perform
pretty well when compared with dynamic stripes, mirrors, and
hardware raid LUNs.)

-- 
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-04 Thread Ray Van Dolson
On Thu, May 03, 2012 at 07:35:45AM -0700, Edward Ned Harvey wrote:
> > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> > boun...@opensolaris.org] On Behalf Of Ray Van Dolson
> > 
> > System is a 240x2TB (7200RPM) system in 20 Dell MD1200 JBODs.  16 vdevs of
> > 15
> > disks each -- RAIDZ3.  NexentaStor 3.1.2.
> 
> I think you'll get better, both performance & reliability, if you break each
> of those 15-disk raidz3's into three 5-disk raidz1's.  Here's why:
> 
> Obviously, with raidz3, if any 3 of 15 disks fail, you're still in
> operation, and on the 4th failure, you're toast.
> Obviously, with raidz1, if any 1 of 5 disks fail, you're still in operation,
> and on the 2nd failure, you're toast.
> 
> So it's all about computing the probability of 4 overlapping failures in the
> 15-disk raidz3, or 2 overlapping failures in a smaller 5-disk raidz1.  In
> order to calculate that, you need to estimate the time to resilver any one
> failed disk...
> 
> In ZFS, suppose you have a record of 128k, and suppose you have a 2-way
> mirror vdev.  Then each disk writes 128k.  If you have a 3-disk raidz1, then
> each disk writes 64k.   If you have a 5-disk raidz1, then each disk writes
> 32k.  If you have a 15-disk raidz3, then each disk writes 10.6k.  
> 
> Assuming you have a machine in production, and you are doing autosnapshots.
> And your data is volatile.  Over time, it serves to fragment your data, and
> after a year or two of being in production, your resilver will be composed
> almost entirely of random IO.  Each of the non-failed disks must read their
> segment of the stripe, in order to reconstruct the data that will be written
> to the new good disk.  If you're in the 15-disk raidz3 configuration...
> Your segment size is approx 3x smaller, which means approx 3x more IO
> operations.
> 
> Another way of saying that...  Assuming the amount of data you will write to
> your pool is the same regardless of which architecture you chose...  For
> discussion purposes, let's say you write 3T to your pool.  And let's
> momentarily assume you whole pool will be composed of 15 disks, in either a
> single raidz3, or in 3x 5-disk raidz1.  If you use one big raidz3, then the
> 3T will require at least 24million 128k records to hold it all, and each
> 128k record will be divided up onto all the disks.  If you use the smaller
> raidz1, then only 1T will get written to each vdev, and you will only need
> 8million records on each disk.  Thus, to resilver the large vdev, you will
> require 3x more IO operations.
> 
> Worse still, on each IO request, you have to wait for the slowest of all
> disks to return.  If you were in a 2-way mirror situation, your seek time
> would be the average seek time of a single disk.  But if you were in an
> infinite-disk situation, your seek time would be the worst case seek time on
> every single IO operation, which is about 2x longer than the average seek
> time.  So not only do you have 3x more seeks to perform, you have up to 2x
> longer to wait upon each seek...
> 
> Now, to put some numbers on this...
> A single 1T disk can sustain (let's assume) 1.0 Gbit/sec read/write
> sequential.  This means resilvering the entire disk sequentially, including
> unused space, (which is not what ZFS does) would require 2.2 hours.  In
> practice, on my 1T disks, which are in a mirrored configuration, I find
> resilvering takes 12 hours.  I would expect this to be ~4 days if I were
> using 5-disk raidz1, and I would expect it to be ~12 days if I were using
> 15-disk raidz3.
> 
> Your disks are all 2T, so you should double all the times I just wrote.
> Your raidz3 should be able to resilver a single disk in approx 24 days.
> Your raidz5 should be able to do one in ~ 8 days.  If you were using
> mirrors, ~ 1 day.
> 
> Suddenly the prospect of multiple failures overlapping don't seem so
> unlikely.

Ed, thanks for taking the time to write this all out.  Definitely food
for thought.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-03 Thread Gary
On Thu, May 3, 2012 at 7:47 AM, Edward Ned Harvey wrote:

> Given the amount of ram you have, I really don't think you'll be able to get
> any useful metric out of iozone in this lifetime.

I still think it would be apropos if dedup and compression were being
used. In that case, does filebench have an option for testing either
of those?

-Gary
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-03 Thread Paul Kraus
On Thu, May 3, 2012 at 10:39 AM, Edward Ned Harvey
 wrote:
>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Paul Kraus
>>
>>     If you have compression turned on (and I highly recommend turning
>> it on if you have the CPU power to handle it),
>
> What if he's storing video files, compressed files, or encrypted data?  Then
> compression is 100% waste.  So you should qualify a statement like that...
> Compression can be great, depending on the type of data to be stored.  In my
> usage scenarios, I usually benefit a lot, both in terms of capacity and
> speed, by enabling compression.

Even with uncompressable data I measure better performance with
compression turned on rather than off. I have been testing with random
data that shows a compressratio of 1:1. I will test with some real
data that is already highly compressed and see if that agrees with my
prior testing.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Assistant Technical Director, LoneStarCon 3 (http://lonestarcon3.org/)
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, Troy Civic Theatre Company
-> Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-03 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Bob Friesenhahn
> 
> Zfs is all about caching so the cache really does need to be included
> (and not intentionally broken) in any realistic measurement of how the
> system will behave.

I agree with what others have said - and this comment in particular.

The only useful thing you can do is to NOT break your system intentionally,
and instead find ways to emulate the real life jobs you want to do.  This is
exceptionally difficult, because in real life, your system will be on for a
long time, doing periodic snapshot rotation, and periodic scrubs, and people
will be doing all sorts of work scattered about on disk...  Sometimes
writing, sometimes reading, sometimes modifying, sometimes deleting.

The modifies and deletes are particularly important.  Because when you mix a
bunch of reads/writes/overwrites/deletes in with a bunch of snapshots
automatically being created & destroyed over time, these behaviors totally
change the way data gets distributed throughout your pool.  And the periodic
scrub will also affect your memory usage and therefore distribution
patterns.

Given the amount of ram you have, I really don't think you'll be able to get
any useful metric out of iozone in this lifetime.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-03 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Paul Kraus
> 
> If you have compression turned on (and I highly recommend turning
> it on if you have the CPU power to handle it), 

What if he's storing video files, compressed files, or encrypted data?  Then
compression is 100% waste.  So you should qualify a statement like that...
Compression can be great, depending on the type of data to be stored.  In my
usage scenarios, I usually benefit a lot, both in terms of capacity and
speed, by enabling compression.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-03 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Ray Van Dolson
> 
> System is a 240x2TB (7200RPM) system in 20 Dell MD1200 JBODs.  16 vdevs of
> 15
> disks each -- RAIDZ3.  NexentaStor 3.1.2.

I think you'll get better, both performance & reliability, if you break each
of those 15-disk raidz3's into three 5-disk raidz1's.  Here's why:

Obviously, with raidz3, if any 3 of 15 disks fail, you're still in
operation, and on the 4th failure, you're toast.
Obviously, with raidz1, if any 1 of 5 disks fail, you're still in operation,
and on the 2nd failure, you're toast.

So it's all about computing the probability of 4 overlapping failures in the
15-disk raidz3, or 2 overlapping failures in a smaller 5-disk raidz1.  In
order to calculate that, you need to estimate the time to resilver any one
failed disk...

In ZFS, suppose you have a record of 128k, and suppose you have a 2-way
mirror vdev.  Then each disk writes 128k.  If you have a 3-disk raidz1, then
each disk writes 64k.   If you have a 5-disk raidz1, then each disk writes
32k.  If you have a 15-disk raidz3, then each disk writes 10.6k.  

Assuming you have a machine in production, and you are doing autosnapshots.
And your data is volatile.  Over time, it serves to fragment your data, and
after a year or two of being in production, your resilver will be composed
almost entirely of random IO.  Each of the non-failed disks must read their
segment of the stripe, in order to reconstruct the data that will be written
to the new good disk.  If you're in the 15-disk raidz3 configuration...
Your segment size is approx 3x smaller, which means approx 3x more IO
operations.

Another way of saying that...  Assuming the amount of data you will write to
your pool is the same regardless of which architecture you chose...  For
discussion purposes, let's say you write 3T to your pool.  And let's
momentarily assume you whole pool will be composed of 15 disks, in either a
single raidz3, or in 3x 5-disk raidz1.  If you use one big raidz3, then the
3T will require at least 24million 128k records to hold it all, and each
128k record will be divided up onto all the disks.  If you use the smaller
raidz1, then only 1T will get written to each vdev, and you will only need
8million records on each disk.  Thus, to resilver the large vdev, you will
require 3x more IO operations.

Worse still, on each IO request, you have to wait for the slowest of all
disks to return.  If you were in a 2-way mirror situation, your seek time
would be the average seek time of a single disk.  But if you were in an
infinite-disk situation, your seek time would be the worst case seek time on
every single IO operation, which is about 2x longer than the average seek
time.  So not only do you have 3x more seeks to perform, you have up to 2x
longer to wait upon each seek...

Now, to put some numbers on this...
A single 1T disk can sustain (let's assume) 1.0 Gbit/sec read/write
sequential.  This means resilvering the entire disk sequentially, including
unused space, (which is not what ZFS does) would require 2.2 hours.  In
practice, on my 1T disks, which are in a mirrored configuration, I find
resilvering takes 12 hours.  I would expect this to be ~4 days if I were
using 5-disk raidz1, and I would expect it to be ~12 days if I were using
15-disk raidz3.

Your disks are all 2T, so you should double all the times I just wrote.
Your raidz3 should be able to resilver a single disk in approx 24 days.
Your raidz5 should be able to do one in ~ 8 days.  If you were using
mirrors, ~ 1 day.

Suddenly the prospect of multiple failures overlapping don't seem so
unlikely.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-01 Thread Richard Elling
more comments...

On May 1, 2012, at 10:41 AM, Ray Van Dolson wrote:

> On Tue, May 01, 2012 at 07:18:18AM -0700, Bob Friesenhahn wrote:
>> On Mon, 30 Apr 2012, Ray Van Dolson wrote:
>> 
>>> I'm trying to run some IOzone benchmarking on a new system to get a
>>> feel for baseline performance.
>> 
>> Unfortunately, benchmarking with IOzone is a very poor indicator of 
>> what performance will be like during normal use.  Forcing the system 
>> to behave like it is short on memory only tests how the system will 
>> behave when it is short on memory.
>> 
>> Testing multi-threaded synchronous writes with IOzone might actually 
>> mean something if it is representative of your work-load.
>> 
>> Bob
> 
> Sounds like IOzone may not be my best option here (though it does
> produce pretty graphs).

For performance analysis of ZFS systems, you need to consider the advantages
of the hybrid storage pool. I wrote a white paper last summer describing a model
that you can use with your performance measurements or data from vendor 
datasheets. 

http://info.nexenta.com/rs/nexenta/images/tech_brief_nexenta_performance.pdf

And in presentation form, 
http://www.slideshare.net/relling/nexentastor-performance-tuning-openstorage-summit-2011

Recently, this model has been expanded and enhanced. Contact me offline, if you
are interested.

I have used IOzone, filebench, and vdbench for a lot of performance 
characterization
lately. Each has their own strength, but all can build a full characterization 
profile of a
system.

For IOzone, I like to run a full characterization run, which precludes 
multithreaded runs,
for a spectrum of I/O sizes and WSS. Such info can be useful to explore the 
boundaries
of your system's performance and compare to other systems. 

Also, for systems with > 50GB of RAM, there are some tunables needed for good 
scaling 
under heavy write load workloads. Alas, there is no perfect answer and no 
single tunable
setting works optimally for all cases. WIP. YMMV.

A single, summary metric is not very useful...

> bonnie++ actually gave me more realistic sounding numbers, and I've
> been reading good thigns about fio.

IMNSHO, bonnie++ is a totally useless benchmark. Roch disected it rather nicely 
at
https://bigip-blogs-cms-adc.oracle.com/roch/entry/decoding_bonnie

[gag me!  Does Oracle have butugly URLs or what? ;-)]
 -- richard

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-01 Thread Bob Friesenhahn

On Tue, 1 May 2012, Ray Van Dolson wrote:


Testing multi-threaded synchronous writes with IOzone might actually
mean something if it is representative of your work-load.


Sounds like IOzone may not be my best option here (though it does
produce pretty graphs).

bonnie++ actually gave me more realistic sounding numbers, and I've
been reading good thigns about fio.


None of these benchmarks is really useful other than to stress-test 
your hardware.  Assuming that the hardware is working properly, when 
you intentionally break the cache, IOzone should produce numbers 
similar to what you could have estimated from hardware specification 
sheets and an understanding of the algorithms.


Sun engineers used 'filebench' to do most of their performance testing 
because it allowed configuring the behavior to emulate various usage 
models.  You can get it from 
"https://sourceforge.net/projects/filebench/";.


Zfs is all about caching so the cache really does need to be included 
(and not intentionally broken) in any realistic measurement of how the 
system will behave.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-01 Thread Paul Kraus
On Tue, May 1, 2012 at 1:45 PM, Gary  wrote:

> The idea of benchmarking -- IMHO -- is to vaguely attempt to reproduce
> real world loads. Obviously, this is an imperfect science but if
> you're going to be writing a lot of small files (e.g. NNTP or email
> servers used to be a good real world example) then you're going to
> want to benchmark for that. If you're going to want to write a bunch
> of huge files (are you writing a lot of 16GB files?) then you'll want
> to test for that. Caching anywhere in the pipeline is important for
> benchmarks because you aren't going to turn off a cache or remove RAM
> in production are you?

It also depends on what you are going to be tuning. When I needed
to decided on a zpool configuration (# of vdev's, type of vdev, etc.)
I did not want the effect of the cache "hiding" the underlying
performance limitations of the physical drive configuration. In that
case I either needed to use a very large test data set or reduce the
size (effect) of the RAM. By limiting the ARC to 2 GB for my test, I
was able to relatively easily quantify the performance differences
between the various configurations. Once we picked a configuration, we
let the ARC take as much RAM as it wanted and re-ran the benchmark to
see what kind of real world performance we would get. Unfortunately,
we could not easily simulate 400 real world people sitting at desktops
accessing the data. So our ARC limited benchmark was effectively a
"worst case" number and the full ARC the "best case". The real world,
as usual, fell somewhere in between.

   Finding a benchmark tool that matches _my_ work load is why I have
started kludging together my own.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-01 Thread Gary
On 5/1/12, Ray Van Dolson wrote:

> The problem is this box has 144GB of memory.  If I go with a 16GB file
> size (which I did), then memory and caching influences the results
> pretty severely (I get around 3GB/sec for writes!).

The idea of benchmarking -- IMHO -- is to vaguely attempt to reproduce
real world loads. Obviously, this is an imperfect science but if
you're going to be writing a lot of small files (e.g. NNTP or email
servers used to be a good real world example) then you're going to
want to benchmark for that. If you're going to want to write a bunch
of huge files (are you writing a lot of 16GB files?) then you'll want
to test for that. Caching anywhere in the pipeline is important for
benchmarks because you aren't going to turn off a cache or remove RAM
in production are you?

-Gary
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-01 Thread Ray Van Dolson
On Tue, May 01, 2012 at 07:18:18AM -0700, Bob Friesenhahn wrote:
> On Mon, 30 Apr 2012, Ray Van Dolson wrote:
> 
> > I'm trying to run some IOzone benchmarking on a new system to get a
> > feel for baseline performance.
> 
> Unfortunately, benchmarking with IOzone is a very poor indicator of 
> what performance will be like during normal use.  Forcing the system 
> to behave like it is short on memory only tests how the system will 
> behave when it is short on memory.
> 
> Testing multi-threaded synchronous writes with IOzone might actually 
> mean something if it is representative of your work-load.
> 
> Bob

Sounds like IOzone may not be my best option here (though it does
produce pretty graphs).

bonnie++ actually gave me more realistic sounding numbers, and I've
been reading good thigns about fio.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-01 Thread Ray Van Dolson
On Tue, May 01, 2012 at 03:21:05AM -0700, Gary Driggs wrote:
> On May 1, 2012, at 1:41 AM, Ray Van Dolson wrote:
> 
> > Throughput:
> >iozone -m -t 8 -T -r 128k -o -s 36G -R -b bigfile.xls
> >
> > IOPS:
> >iozone -O -i 0 -i 1 -i 2 -e -+n -r 128K -s 288G > iops.txt
> 
> Do you expect to be reading or writing 36 or 288Gb files very often on
> this array? The largest file size I've used in my still lengthy
> benchmarks was 16Gb. If you use the sizes you've proposed, it could
> take several days or weeks to complete. Try a web search for "iozone
> examples" if you want more details on the command switches.
> 
> -Gary

The problem is this box has 144GB of memory.  If I go with a 16GB file
size (which I did), then memory and caching influences the results
pretty severely (I get around 3GB/sec for writes!).

Obviously, I could yank RAM for purposes of benchmarking. :)

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-01 Thread Bob Friesenhahn

On Mon, 30 Apr 2012, Ray Van Dolson wrote:


I'm trying to run some IOzone benchmarking on a new system to get a
feel for baseline performance.


Unfortunately, benchmarking with IOzone is a very poor indicator of 
what performance will be like during normal use.  Forcing the system 
to behave like it is short on memory only tests how the system will 
behave when it is short on memory.


Testing multi-threaded synchronous writes with IOzone might actually 
mean something if it is representative of your work-load.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-01 Thread Paul Kraus
On Mon, Apr 30, 2012 at 4:15 PM, Ray Van Dolson  wrote:

> I'm trying to run some IOzone benchmarking on a new system to get a
> feel for baseline performance.

If you have compression turned on (and I highly recommend turning
it on if you have the CPU power to handle it), the IOzone data will be
flawed. I did not look deeper into it, but the data that IOzone uses
compresses very, very well. Much more so than any real data out there.
I used a combination of Filebench and Oracle's Orion to test ZFS
performance. Recently I started writing my own utilities for testing,
as _none_ of the existing offerings tested what I needed (lots and
lots of small, less than 64KB, files). My tool is only OK for relative
measures.

> Unfortunately, the system has a lot of memory (144GB), but I have some
> time so am approaching my runs as follows:

 When I was testing systems with more RAM than I wanted (when does
that ever happen :-), I called the ARC to something rational (2GB, 4GB
etc) and ran the tests with file sizes four times the ARC limit.
Unfortunately, the siwiki site appears to be down (gone ???).

On Solaris 10, the following in /etc/system (and a reboot) will cap
the zfa arc to the amount of RAM specified (in bytes). Not sure on
Nextena (and I have not had to cap the arc on my Nexenta Core system
at home).

set zfs:zfs_arc_max = 4294967296

> Throughput:
>    iozone -m -t 8 -T -r 128k -o -s 36G -R -b bigfile.xls
>
> IOPS:
>    iozone -O -i 0 -i 1 -i 2 -e -+n -r 128K -s 288G > iops.txt
>
> Not sure what I gain/lose by using threads or not.

IOzone without threads is single threaded and will demonstrate the
performance a single user or application will achieve. When you use
threads in IOzone you see performance for N simultaneous users (or
applications). In my experience, the knee in the performance vs. # of
threads curve happens somewhere between one and two times the number
of CPUs in the system. In other words, with a 16 CPU system,
performance scales linearly as the number of threads increases until
you get to somewhere between 16 and 32. At that point the performance
will start flattening out and eventually _decreases_ as you add more
threads.

 Using multiple threads (or processes or clients or etc.) is a
good way to measure how many simultaneous users your system can handle
(at a certain performance level).

> Am I off on this?
>
> System is a 240x2TB (7200RPM) system in 20 Dell MD1200 JBODs.  16 vdevs of 15
> disks each -- RAIDZ3.  NexentaStor 3.1.2.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Assistant Technical Director, LoneStarCon 3 (http://lonestarcon3.org/)
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, Troy Civic Theatre Company
-> Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-01 Thread Gary Driggs
On May 1, 2012, at 1:41 AM, Ray Van Dolson wrote:

> Throughput:
>iozone -m -t 8 -T -r 128k -o -s 36G -R -b bigfile.xls
>
> IOPS:
>iozone -O -i 0 -i 1 -i 2 -e -+n -r 128K -s 288G > iops.txt

Do you expect to be reading or writing 36 or 288Gb files very often on
this array? The largest file size I've used in my still lengthy
benchmarks was 16Gb. If you use the sizes you've proposed, it could
take several days or weeks to complete. Try a web search for "iozone
examples" if you want more details on the command switches.

-Gary
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] IOzone benchmarking

2012-05-01 Thread Ray Van Dolson
I'm trying to run some IOzone benchmarking on a new system to get a
feel for baseline performance.

Unfortunately, the system has a lot of memory (144GB), but I have some
time so am approaching my runs as follows:

Throughput:
iozone -m -t 8 -T -r 128k -o -s 36G -R -b bigfile.xls

IOPS:
iozone -O -i 0 -i 1 -i 2 -e -+n -r 128K -s 288G > iops.txt

Not sure what I gain/lose by using threads or not.

Am I off on this?

System is a 240x2TB (7200RPM) system in 20 Dell MD1200 JBODs.  16 vdevs of 15
disks each -- RAIDZ3.  NexentaStor 3.1.2.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss