Hi Jeff,

On Sex, 2008-10-10 at 01:26 -0700, Jeff Bonwick wrote:
> > The circumstances where I have lost data have been when ZFS has not
> > handled a layer of redundancy.  However, I am not terribly optimistic
> > of the prospects of ZFS on any device that hasn't committed writes
> > that ZFS thinks are committed.
> 
> FYI, I'm working on a workaround for broken devices.  As you note,
> some disks flat-out lie: you issue the synchronize-cache command,
> they say "got it, boss", yet the data is still not on stable storage.

It's not just about ignoring the synchronize-cache command, there's also
another weak spot.

ZFS is quite resilient against so-called phantom writes, provided that
they occur sporadically - let's say, if the disk decides to _randomly_
ignore writes 10% of the time, ZFS could probably survive that pretty
well even on single-vdev pools, due to ditto blocks.

However, it is not so resilient when the storage system suffers hiccups
which cause phantom writes to occur continuously, even if for a small
period of time (say less than 10 seconds), and then return to normal.
This could happen for several reasons, including network problems, bugs
in software or even firmware, etc.

I think in this case, going back to a previous uberblock could also be
enough to recover from such a scenario most of the times, unless perhaps
the error occurred too long ago, and the unwritten metadata got flushed
out of the ARC and didn't have a chance to get rewritten.

In any case, a more generic solution to repair all kinds of metadata
corruption, such as (e.g.) space map corruption, would be very
desirable, as I think everyone can agree.

Best regards,
Ricardo


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to