> Hello Kyle,
> 
> Wednesday, January 10, 2007, 5:33:12 PM, you wrote:
> 
> KM> Remember though that it's been mathematically
> figured that the 
> KM> disadvantages to RaidZ start to show up after 9
> or 10 drives. (That's 
> 
> Well, nothing like this was proved and definitely not
> mathematically.
> 
> It's just a common sense advise - for many users
> keeping raidz groups
> below 9 disks should give good enough performance.
> However if someone
> creates raidz group of 48 disks he/she probable
> expects also
> performance and in general raid-z wouldn't offer one.

Wow, lots of good discussion here.  I started the idea of allowing a RAIDZ 
group to grow to arbitrary drives because I was unaware of the downsides to 
massive pools.  From my RAID5 experience, a perfect world would be large 
numbers of data spindles and a sufficient number of parity spindles, e.g. 99+17 
(99 data drives and 17 parity drives).  In RAID5 this would give massive iops 
and redundancy.

After studying the code and reading the blogs, a few things have jumped out, 
with some interesting (and sometimes goofy) implications.  Since I am still 
learning, I could be wrong on any of the following.

RAIDZ pools operate with a storage granularity of one stripe.  If you request a 
read of a block within the stripe, you get the whole stripe.  If you modify a 
block within the stripe, the whole stripe is written to a different location 
(ala COW).

This implies that ANY read will require the whole stripe, therefore all 
spindles to seek and read a sector.  All drives will return the sectors 
(mostly) simultaneously.  For performance purposes, a RAIDZ pool seeks like a 
single drive would and has the throughput of multiple drives.  Unlike 
traditional RAID5, adding more spindles does NOT increase read IOPS.

Another implication is ZFS checksums the stripe, not the component sectors.  If 
a drive silently returns a bad sector, ZFS only knows is that the whole stripe 
is bad (which could probably also be inferred from a bogus parity sector).  ZFS 
has no clue which drive produced bad data, only that the whole stripe failed 
the checksum.  ZFS finds the offending sector by process of elimination: going 
through the sectors one at a time, throwing away the data actually read, 
reconstructing the data from the parity then determining if the stripe passes 
the checksum.

Two parity drives make this a bigger problem still, almost squaring the number 
of computations needed.  If a stripe has enough parity drives, then the cost of 
determining N bad data sectors in a stripe is roughly O(k^N), where k is some 
constant.

Another implication is that there is no RAID5 "write penalty."  More 
accurately, the write penalty is incurred during the read operation where an 
entire stripe is read.

Finally, there is no need to rotate parity.  Rotating parity was introduced in 
RAID5 because every write of a single sector in a stripe also necessitated the 
read and subsequent write of the parity sector.  Since there are no partial 
stripe writes in ZFS, there is no need to read then write the parity sector.

For those in the know, where I am off base here?

Thanks!
Marty
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to