(I looked at my email before checking here, so I'll just cut-and-paste the 
email response in here rather than send it.  By the way, is there a way to view 
just the responses that have accumulated in this forum since I last visited - 
or just those I've never looked at before?)

Bill Moore wrote:
> On Fri, Sep 15, 2006 at 01:23:31AM -0700, can you guess? wrote:
>> Implementing it at the directory and file levels would be even more
>> flexible:  redundancy strategy would no longer be tightly tied to path
>> location, but directories and files could themselves still inherit
>> defaults from the filesystem and pool when appropriate (but could be
>> individually handled when desirable).
> 
> The problem boils down to not having a way to express your intent that
> works over NFS (where you're basically limited by POSIX) that you can
> use from any platform (esp. ones where ZFS isn't installed).  If you
> have some ideas, this is something we'd love to hear about.

Well, one idea is that it seems downright silly to gate ZFS facilities 
on the basis of two-decade-old network file access technology:  sure, 
it's important to be able to *access* ZFS files using NFS, but does 
anyone really care if NFS can't express the full range of ZFS features - 
at least to the degree that they think such features should be 
suppressed as a result (rather than made available to local users plus any 
remote users employing a possibly future mechanism that *can* support them)?

That being said, you could always adopt the ReiserFS approach of 
allowing access to file/directory metadata via extended path 
specifications in environments like NFS where richer forms of 
interaction aren't available:  yes, it may feel a bit kludgey, but it gets the 
job done.

And, of course, even if you did nothing to help NFS its users would 
still benefit from inheriting whatever arbitrarily fine-grained 
redundancy levels had been established via more comprehensive means: 
they just wouldn't be able to tweak redundancy levels themselves (any 
more, or any less, than they can do so today).

> 
>> I've never understood why redundancy was a pool characteristic in ZFS
>> - and the addition of 'ditto blocks' and now this new proposal (both
>> of which introduce completely new forms of redundancy to compensate
>> for the fact that pool-level redundancy doesn't satisfy some needs)
>> just makes me more skeptical about it.
> 
> We have thought long and hard about this problem and even know how to
> implement it (the name we've been using is Metaslab Grids, which isn't
> terribly descriptive, or as Matt put it "a bag o' disks").

Yes, 'a bag o' disks' - used intelligently at a higher level - is pretty much 
what I had in mind.

  There are
> two main problems with it, though.  One is failures.  The problem is
> that you want the set of disks implementing redundancy (mirror, RAID-Z,
> etc.) to be spread across fault domains (controller, cable, fans, power
> supplies, geographic sites) as much as possible.  There is no generic
> mechanism to obtain this information and act upon it.  We could ask the
> administrator to supply it somehow, but such a description takes effort,
> is not easy, and prone to error.  That's why we have the model right now
> where the administrator specifies how they want the disks spread out
> across fault groups (vdevs).

Without having looked at the code I may be missing something here. 
Even with your current implementation, if there's indeed no automated 
way to obtain such information the administrator has to exercise manual 
control over disk groupings if they're going to attain higher 
availability by avoiding other single points of failure instead of just guard 
against unrecoverable data loss from disk failure.  Once that 
information has been made available to the system, letting it make use 
of it at a higher level rather than just aggregating entire physical 
disks should not entail additional administrator effort.

I admit that I haven't considered the problem in great detail, since my 
bias is toward solutions that employ redundant arrays of inexpensive 
nodes to scale up rather than a small number of very large nodes (in part 
because a single large node itself can often be a single point of 
failure even if many of its subsystems carefully avoid being so in the 
manner that you suggest).  Each such small node has a relatively low 
disk count and little or no internal redundancy, and thus comprises its 
own little fault-containment environment, avoiding most such issues; as 
a plus, such node sizes mesh well with the bandwidth available from very 
inexpensive Gigabit Ethernet interconnects and switches (even when 
streaming data sequentially, such as video on demand) and allow 
fine-grained incremental system scaling (by the time faster 
interconnects become inexpensive, disk bandwidth should have increased 
enough that such a balance will still be fairly good).

Still, if you can group whole disks intelligently in a large system with 
respect to supplementing simple redundancy with higher overall subsystem 
availability, then you ought to be able to use exactly the same 
information to allow higher-level decisions about where to place 
redundant data at other than whole-disk granularity.

> 
> The second problem comes back to accounting.  If you can specify, on a
> per-file or per-directory basis, what kind of replication you want, how
> do you answer the statvfs() question?  I think the recent "discussions"
> on this list illustrate the complexity and passion on both sides of the
> argument.

I rather liked the idea of using the filesystem *default* redundancy 
level as the basis for providing free space information, though in 
environments where different users were set up with different defaults 
using the per-user default might make sense (then, only if that was 
manually changed, presumably by that user, would less obvious things 
happen).

Overall, I think perhaps free space should be reported on the basis of 
things that the user does *not* have control over, such as the default 
flavor of redundancy established by an administrator (i.e., as the 
number of bytes the user could write using that default flavor - which 
is what I was starting to converge on just above).  Then the user will 
mostly see only discrepancies caused by changes in that default that 
s/he has made, and should be able to understand them (well, if the user 
has personal 'temp' space the admin might have special-cased that for 
them by making it non-redundant, I suppose).

Then again, whenever one traverses a mount point today (not always all that 
obvious a transition) the whole world of free space (and I'd expect quota) 
changes anyway, and users don't seem to find that an insurmountable obstacle.  
So I find it difficult to see free-space reporting as being any real 
show-stopper in this area regardless of how it's done (though like most people 
who contributed to that topic I think I have a preference).

> 
>> (Not that I intend in any way to minimize the effort it might take to
>> change that decision now.)
> 
> The effort is not actually that great.  All the hard problems we needed
> to solve in order to implement this were basically solved when we did
> the RAID-Z code.  As a matter of fact, you can see it in the on-disk
> specification as well.  In the DVA, you'll notice an 8-bit field labeled
> "GRID".  These are the bits that would describe, on a per-block basis,
> what kind of redundancy we used.

The only reason I can think of for establishing that per block (rather than per 
object) would be if you kept per-block access-rate information around so that 
you could distribute really hot blocks more widely.  And given that such blocks 
would normally be in cache anyway, that only seems to make sense in a 
distributed environment (where you're trying to spread the load over multiple 
nodes more because of interconnect bandwidth limitations than disk bandwidth 
limitations - though even here you could do this at the cache level rather than 
the on-disk level based on dynamic needs).

- bill
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to