> The circumstances where I have lost data have been when ZFS has not
> handled a layer of redundancy.  However, I am not terribly optimistic
> of the prospects of ZFS on any device that hasn't committed writes
> that ZFS thinks are committed.

FYI, I'm working on a workaround for broken devices.  As you note,
some disks flat-out lie: you issue the synchronize-cache command,
they say "got it, boss", yet the data is still not on stable storage.
Why do they do this?  Because "it performs better".  Well, duh --
you can make stuff *really* fast if it doesn't have to be correct.

Before I explain how ZFS can fix this, I need to get something off my
chest: people who knowingly make such disks should be in federal prison.
It is *fraud* to win benchmarks this way.  Doing so causes real harm
to real people.  Same goes for NFS implementations that ignore sync.
We have specifications for a reason.  People assume that you honor them,
and build higher-level systems on top of them.  Change the mass of
the proton by a few percent, and the stars explode.  It is impossible
to build a functioning civil society in a culture that tolerates lies.
We need a little more Code of Hammurabi in the storage industry.

Now:

The uberblock ring buffer in ZFS gives us a way to cope with this,
as long as we don't reuse freed blocks for a few transaction groups.
The basic idea: if we can't read the pool startign from the most
recent uberblock, then we should be able to use the one before it,
or the one before that, etc, as long as we haven't yet reused any
blocks that were freed in those earlier txgs.  This allows us to
use the normal load on the pool, plus the passage of time, as a
displacement flush for disk caches that ignore the sync command.

If we go back far enough in (txg) time, we will eventually find an
uberblock all of whose dependent data blocks have make it to disk.
I'll run tests with known-broken disks to determine how far back we
need to go in practice -- I'll bet one txg is almost always enough.

Jeff
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to