On 5/4/2012 1:24 PM, Peter Tribble wrote:
On Thu, May 3, 2012 at 3:35 PM, Edward Ned Harvey
<opensolarisisdeadlongliveopensola...@nedharvey.com>  wrote:
I think you'll get better, both performance&  reliability, if you break each
of those 15-disk raidz3's into three 5-disk raidz1's.  Here's why:
Incorrect on reliability; see below.

Now, to put some numbers on this...
A single 1T disk can sustain (let's assume) 1.0 Gbit/sec read/write
sequential.  This means resilvering the entire disk sequentially, including
unused space, (which is not what ZFS does) would require 2.2 hours.  In
practice, on my 1T disks, which are in a mirrored configuration, I find
resilvering takes 12 hours.  I would expect this to be ~4 days if I were
using 5-disk raidz1, and I would expect it to be ~12 days if I were using
15-disk raidz3.
Based on your use of "I would expect", I'm guessing you haven't
done the actual measurement.

I see ~12-16 hour resilver times on pools using 1TB drives in
raidz configurations. The resilver times don't seem to vary
with whether I'm using raidz1 or raidz2.

Suddenly the prospect of multiple failures overlapping don't seem so
unlikely.
Which is *exactly* why you need multiple-parity solutions. Put
simply, if you're using single-parity redundancy with 1TB drives
or larger (raidz1 or 2-way mirroring) then you're putting your
data at risk. I'm seeing - at a very low level, but clearly non-zero -
occasional read errors during rebuild of raidz1 vdevs, leading to
data loss. Usually just one file, so it's not too bad (and zfs will tell
you which file has been lost). And the observed error rates we're
seeing in terms of uncorrectable (and undetectable) errors from
drives are actually slightly better than you would expect from the
manufacturers spec sheets.

So you definitely need raidz2 rather than raidz1; I'm looking at
going to raidz3 for solutions using current high capacity (ie 3TB)
drives.

(On performance, I know what the theory says about getting one
disk's worth of IOPS out of each vdev in a raidz configuration. In
practice we're finding that our raidz systems actually perform
pretty well when compared with dynamic stripes, mirrors, and
hardware raid LUNs.)



Really, guys: Richard, myself, and several others have covered how ZFS does resilvering (and on disk reliability, a related issue), and included very detailed calculations on IOPS required and discussions about slabs, recordsize, and how disks operate with regards to seek/access times and OS caching.

Please search the archives, as it's not fruitful to repost the exact same thing repeatedly.


Short version: assuming identical drives and the exact same usage pattern and /amount/ of data, the time it takes the various ZFS configurations to resilver is N for ANY mirrored config and a bit less than N*M for a M-disk RAIDZ*, where M = the number of data disks in the RAIDZ* - thus a 6-drive (total) RAIDZ2 will have the same resilver time as a 5-drive (total) RAIDZ1. Calculating what N is depends entirely on the pattern which the data was written on the drive. You're always going to be IOPS-bound on the disk being resilvered.

Which RAIDZ* config to use (assuming you have a fixed tolerance for data loss) depends entirely on what your data usage pattern does to resilver times; configurations needing very long resilver times better have more redundancy. And, remember, larger configs will allow for more data to be stored, that also increases resilver time.

Oh, and a RAIDZ* will /only/ ever get you slightly more than 1 disk's worth of IOPS (averaged over a reasonable time period). Caching may make it appear to give more IOPS in certain cases, but that's neither sustainable nor predictable, and the backing store is still only giving 1 disk's IOPS. The RAIDZ* may, however, give you significantly more throughput (in MB/s) than a single disk if you do a lot of sequential read or write.

-Erik

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to