Hi All,

As part of my CS dissertation I was playing with ZFS and attempting to
improve upon the standard RAID-5esque feature set of RAID-Z.  This is
one of the features I has hoped to get working (though in the end
didn't finish), though I think I have a reasonable idea of how it
could be done if anyone else is interested in lending a hand...

The project aimed to do two things:
 1) Use all available space on a RAID-Z with mismatched disks (i.e.
don't cap to size of the smallest).
  2) Allow the user to grow(shrink) the array.

With 1 there is the difficulty that the full stripe width changes with
respect to logical volume offset.  What I ended up doing was
specializing the space map initialization code to be vdev_ specific,
with the RAID-Z vdev configuring the spacemap such that the metaslabs
boundaries align with changes in the number of disks in a full sized
stripe.   This makes is much easier when performing translation of
lsize to asize which now needs to be done once per metaslab during
allocation rather than once per vdev...

I tested these changes by modifying small amounts of infrastructure
(ztest/zdb) to allow the creation of arrays with varying sized disks
(and ztest reporting no problems).  I also tested performance at a
macro level (IOZone) and a micro level dtrace with averaging the
thread time taken to execute modified functions.  In this
implementation there was minimal (I think!) additional computational
overhead O(ndisks) for metaslab_distance,fre_dva and O(lg n) for
raidz_asize.  DTrace showed almost no change in execution times (for
reasonable numbers of disks -- it starts getting interesting/visible
at 64 devices in the RAID :) ).  And IOZone showed Write speed to be
almost identical and read speed to be ~7% worse -- I suspect this is
down to my dodgy disk array 1x40, 3x120Gb disks.  With these changes
total available space was 257G as opposed to standard RAID-Zs 113G :D

There was one regression however: replacing disks.  This will need the
same infrastructure as adding a disk to the array (2).

My idea for fixing this was to use the GRID part of the block pointer
to allow versions of the RAID disk array.  Whenever a disk is
added/replaced, the space map is 'munged', the layout for the new
array is created and stored, and the GRID incremented.  The spacemap
then always contains a map of free space in the current disk
arrangement; an in memory array of disk arrays is used for resolving
blocks, with each block individually address by <vdev,offset,grid>.
New blocks are always written using the most recent value of GRID. My
vdev_raidz code currently takes account of GRID when resolving blocks.

What I haven't implemented is updating the space map and persisting
the versions of the array.  These both strike me as relatively hard.
For example: I'm unsure how to go about locking and modifying the
whole spacemap while the pool is mounted.  Also when freeing blocks
there needs to be a mechanism to iterate through the raidz versions
such that the previous grid/offset can be translated to blocks in the
current grid's space map.

Also difficult is the way the RAID-Z currently accounts for single
space holes of unallocable space by rounding up writes.  If a disk is
added, we end up with single block holes everywhere.  One solution is
to possible let the metaslab handle it, passivating the metaslab when
there are no block runs of the minimum size remaining -- to me the
metaslab code is quite scary with magical bit manipulation; and the
hairy interaction between alloc_dva and group_alloc isn't well
commented, and non-intuitive...

The beauty of this is that with RAID-Z you don't have to rewrite all
the data in one go when adding to the array, which, as has already
been mentioned, is cumbersome and slow.  Of course you won't get all
the disk space available at once, but with copy-on-write, data will
spread over all the disks gradually.  The 256 change limit is
restrictive, but if Sun add the 'rewrite data' facility, this can be
overcome (and support for removing disks can be a simple extension).

While I didn't get a chance to finish the above (it turns out the
dissertation text is more important than the code...) having
graduated, started work and moved house just this weekend, it would be
nice to continue work on a project in the evenings/weekends, at the
very least I still have a bunch of random disks that need a reliable
filesystem ;).

If anyone is interested or is familiar with ZFS internals and has any
input on the issues I've mentioned above I'd be hugely grateful -- in
particular: rewriting the space map on the fly (and updating metaslabs
to point at  the new regions of the array), persisting the array
layout to disk, and RAID-Z roundup, I'm all ears!

Cheers,

James

On 7/31/07, Adam Leventhal <ahl at eng.sun.com> wrote:
> Thanks for explaining the constraints you'd like to see on any potential
> solution. It would be possible to create some sort of method for extending
> an existing RAID-Z stripe; it will be quite complicated.
>
> I think it's fair to say that while the ZFS team at Sun is working on some
> facilities that will be required for this sort of migration, their priorities
> lie elsewhere. The OpenSolaris community at large, however, may see this as
> a high enough priority that some group wants to give it a shot. I suggest
> that you file an RFE at least.
>
> Adam
>
> On Mon, Jul 30, 2007 at 05:55:11PM -0700, Dave Johnson wrote:
> > They perform it while online.  The operation takes an extensive amount of
> > time... presumably due to the overhead involved in performing such an
> > exhaustive amount of data manipulation.
> >
> > There are optimizations one could take but for simplicity, I expect this
> > would be one way a hardware controller could expand a RAID5 array:
> >
> > -Keep track of "access method" address using "utility area" on the existing
> > array (used to keep track of the address in the array beyond witch uses the
> > "new" stripe size.  needs to be kept updated on disk in case of power outage
> > during array expansion).
> >
> > -Logically "relocate" first stripe of data on existing array to an area
> > inside the "utility area" created previously for this purpose
> >
> > -Modify controller logic to add a temporary "stripe access method" check to
> > the access algorithm (used from this point forward until expansion is
> > complete)
> >
> > -Read data from full stripe on the disk starting at address 00 (Stripe "A")
> >
> > -Read additional data additional stripes on disk until the aggregation of
> > stripe reads is greater than or equal to new stripe size
> >
> > -Write aggregated data in new stripe layout to previously empty stripe, plus
> > blocks from newly added stripe members
> >
> > -Update "stripe access method" address
> >
> > -Read next stripe
> >
> > -Aggregate data left over from previously read stripe with next stripe
> >
> > -Write new stripe in similar fashion as above
> >
> > -Update "stripe access method" address
> >
> > -Wash, rinse, repeat
> >
> > -Write relocated stripe 00 back to beginning of array
> >
> > -Remove additional logic to check for "access method" for array
> >
> >
> > How one would perform such an operation in ZFS is left as an exercise for
> > the reader :)
> >
> > -=dave
> >
> > ----- Original Message -----
> > From: "Adam Leventhal" <ahl at eng.sun.com>
> > To: "MC" <rac at eastlink.ca>
> > Cc: <zfs-code at opensolaris.org>
> > Sent: Monday, July 30, 2007 4:06 PM
> > Subject: Re: [zfs-code] Raid-Z expansion
> >
> >
> > >> RAIDz does not let you do this:  Start from one disk, add another disk
> > >> to mirror the data, add another disk to make it a RAIDz array, and add
> > >> another disk to increase the size of the RAIDz array.
> > >
> > > That's true: today you can't expand a RAID-Z stripe or 'promote' a mirror
> > > to be a RAID-Z stripe. Given the current architecture, I'm not sure how
> > > that would be done exactly, but it's an interesting though experiment.
> > >
> > > How do other systems work? Do they take the pool offline while they
> > > migrate
> > > data to the new device in the RAID stripe or do they do this online? How
> > > would you propose this work with ZFS?
> > >
> > > Adam
> > >
> > > --
> > > Adam Leventhal, Solaris Kernel Development       http://blogs.sun.com/ahl
> > > _______________________________________________
> > > zfs-code mailing list
> > > zfs-code at opensolaris.org
> > > http://mail.opensolaris.org/mailman/listinfo/zfs-code
> > >
> >
> > _______________________________________________
> > zfs-code mailing list
> > zfs-code at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-code
>
> --
> Adam Leventhal, Solaris Kernel Development       http://blogs.sun.com/ahl
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code
>

Reply via email to