Re: [zfs-discuss] RFE for two-level ZFS

Miles Nordin Thu, 19 Feb 2009 12:34:48 -0800

>>>>> "j" == jkait  <jkait...@gmail.com> writes:


     j> [b] The Lustre team, I believe, is looking at porting the DMU
     j> **not** the entire zfs stack.

wow.  That's even more awesome.  In that case, since they are more or
less making their own filesystem, maybe it will be natural to validate
checksums on the clients.

     j> 
http://www.enterprisestorageforum.com/continuity/news/article.php/3672651

meh, wake me when it's over.

Another thing which interests me in light of recent discussion, is
checksums which can be broken if write barriers are violated.  It's
forever impossible to tell if your data is ``up-to-date'' with just a
checksum because it will be valid tomorrow if it's valid today, but
you can tell if a bag of checksums match with each other, perhaps be
warned if the filesystem has recovered to some new and seemingly-valid
state through which, were it respecting fsync() barriers, it could
never have passed before the data loss.  With this feature, instead of
just insuring the insides of files as invalid, ZFS could put seals on
whole datasets, and we would see these checksum seals broken if we
disabled the ZIL.  It could become meaningful to put a seal on a
heirarchy of datasets, which would be broken if you mounted a tree of
snapshots of those datasets which were not taken atomically.  This
also becomes more meaningful with filesystems like HAMMER that have
infinite snapshots, where you may want metadata checksums to seal the
filesystems' history, a history which could be broken if drives write
checksum-sized blocks, but write them in the wrong order.

I don't see how raw storage can do anything but put checksums on
block-sized chunks, which is useful for data in flight but not that
useful to store.  The stored checksum can prove ``this exact block was
once handed to me, and I was once told to write it to this LBA on this
LUN.''  So what?  Yes, I agree that happened, but it might have been
two years ago.  that doesn't mean the block is what belongs there
_right now_.  I could have overwritten that block 100 times since
then.  You need a metadata heirarchy to know that.

What the SCSI extensions could do is extend the checksums that all the
big storage vendors are already doing over the FC/iSCSI SAN, and thus
stop ZFS advocates from pointing at weak TCP checksums, ancient
routers, SAN bitflip gremlins when pools with single-lun vdevs become
corrupt.  The storage vendor pitch about helping to _find_ the
corruption problems---I buy that one.  ZFS is notoriously poor at that
job.  But I don't think the SCSI extension is helpful for extending
the halo of the on-disk protection domain through the filesystem and
above it, past a network filesystem.  They can't do that by adding
SCSI commands.  It's simply irrelevant to the task, unless SCSI is
going to become its own non-POSIX filesystem with snapshots and a
virtual clock, which it had better not.

Lustre could do it, though, especially if they are building their own
filesystem from zpool pieces right above the transactional layer, not
just using ZFS as a POSIX backing store.

pgpRddtuLuMWd.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RFE for two-level ZFS

Reply via email to