Re: [zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

Bill Moore Fri, 15 Sep 2006 09:26:03 -0700

On Fri, Sep 15, 2006 at 01:23:31AM -0700, can you guess? wrote:
> Implementing it at the directory and file levels would be even more
> flexible:  redundancy strategy would no longer be tightly tied to path
> location, but directories and files could themselves still inherit
> defaults from the filesystem and pool when appropriate (but could be
> individually handled when desirable).


The problem boils down to not having a way to express your intent that
works over NFS (where you're basically limited by POSIX) that you can
use from any platform (esp. ones where ZFS isn't installed).  If you
have some ideas, this is something we'd love to hear about.

> I've never understood why redundancy was a pool characteristic in ZFS
> - and the addition of 'ditto blocks' and now this new proposal (both
> of which introduce completely new forms of redundancy to compensate
> for the fact that pool-level redundancy doesn't satisfy some needs)
> just makes me more skeptical about it.

We have thought long and hard about this problem and even know how to
implement it (the name we've been using is Metaslab Grids, which isn't
terribly descriptive, or as Matt put it "a bag o' disks").  There are
two main problems with it, though.  One is failures.  The problem is
that you want the set of disks implementing redundancy (mirror, RAID-Z,
etc.) to be spread across fault domains (controller, cable, fans, power
supplies, geographic sites) as much as possible.  There is no generic
mechanism to obtain this information and act upon it.  We could ask the
administrator to supply it somehow, but such a description takes effort,
is not easy, and prone to error.  That's why we have the model right now
where the administrator specifies how they want the disks spread out
across fault groups (vdevs).

The second problem comes back to accounting.  If you can specify, on a
per-file or per-directory basis, what kind of replication you want, how
do you answer the statvfs() question?  I think the recent "discussions"
on this list illustrate the complexity and passion on both sides of the
argument.

> (Not that I intend in any way to minimize the effort it might take to
> change that decision now.)

The effort is not actually that great.  All the hard problems we needed
to solve in order to implement this were basically solved when we did
the RAID-Z code.  As a matter of fact, you can see it in the on-disk
specification as well.  In the DVA, you'll notice an 8-bit field labeled
"GRID".  These are the bits that would describe, on a per-block basis,
what kind of redundancy we used.


--Bill
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Re: Re: Re: Proposal: multiple copies of user data

Reply via email to