[zfs-discuss] Re: Trying to understand zfs RAID-Z

Martin Sat, 19 May 2007 15:32:30 -0700

> Quoth Steven Sim on Thu, May 17, 2007 at 09:55:37AM
> +0800:
> >    Gurus;
> >    I am exceedingly impressed by the ZFS although
> it is my humble opinion
> >    that Sun is not doing enough evangelizing for
> it.
> 
> What else do you think we should be doing?
> 
> 
> David


I'll jump in here.  I am a huge fan of ZFS.  At the same time, I know about 
some of its warts.

ZFS hints at adding agility to data management and is a wonderful system.  At 
the same time, it operates on some assumptions which are antithetical to data 
agility, including:
* inability to online restripe: add/remove data/parity disks
* inability to make effective use of varying sized disks

In one breath ZFS says, "Look how well you can dynamically alter filesystem 
storage."

In another breath ZFS says, "Make sure that your pools have identical spindles 
and you have accurately predicted future bandwidth, access time, vdev size, and 
parity disks.  Because you can't change any of that later."

I know, down the road you can tack new vdevs onto the pool, but that really 
misses the point.  Even so, if I accidentally add a vdev to a pool and then 
realize my mistake, I am sunk.  Once a vdev is added to a pool, it is attached 
to the pool forever.

Ideally I could provision a vdev, later decide that I need a disk/LUN from that 
vdev and simply remove the disk/LUN, decreasing the vdev capacity.  I should 
have the ability to decide that current redundancy needs are insufficient and 
allocate [b]any[/b] number of new parity disks.  I should be able to have a 
pool from a rack of 15x250GB disks and then later add a rack of 11x750GB disks 
[b]to the vdev[/b], not by making another vdev.

I should have the luxury of deciding to put hot Oracle indexes on their own 
vdev, deallocate spindles form an existing vdev and put those indexes on the 
new vdev.  I should be able to change my mind later and put it all back.

Most importantly is the access time issue.  Since there are no partial-stripe 
reads in ZFS, then access time for a RAIDZ vdev is the same as single-disk 
access time, no matter how wide the stripe is.

How to evangelize better?

Get rid of the glaring "you can't change it later" problems.

Another thought is that flash storage has all of the indicators of being a 
disruptive technology described in [i]The Innovator's Dilemma[/i].  What this 
means is that flash storage [b]will[/b] take over hard disks.  It is 
inevitable.  ZFS has a weakness with access times but handles single-block 
corruption very nicely.  ZFS also has the ability to do very wide RAIDZ 
stripes, up to 256(?) devices, providing mind-numbing throughput.

Flash has near-zero access times and relatively low throughput.  Flash is also 
prone to single-block failures once the erase-limit has been reached for a 
block.

ZFS + Flash = near-zero access time, very high throughput and high data 
integrity.

To answer the question: get rid of the limitations and build a Thumper-like 
device using flash.  Market it for Oracle redo logs, temp space, swap space 
(flash is now cheaper than RAM), anything that needs massive throughput and 
ridiculous iops numbers, but not necessarily huge storage.

Each month, the cost of flash will fall 4% anyway, so get ahead of the curve 
now.

My 2 cents, at least.

Marty
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Trying to understand zfs RAID-Z

Reply via email to