On 12-Jan-10, at 10:40 PM, Brad wrote:

"(Caching isn't the problem; ordering is.)"

Weird I was reading about a problem where using SSDs (intel x25-e) if the power goes out and the data in cache is not flushed, you would have loss of data.

Could you elaborate on "ordering"?

ZFS integrity is maintained if the device correctly respects flush/ barrier semantics, which, as required, enforce an ordering of operations. The synchronous completion of flush guarantees that prior writes have durably completed. This is irrespective of write caching.

When a device does not properly flush, all bets are off, because inflight data (including unwritten data in the write cache) is not written in any determinate manner (you cannot know what was written, or in what order). The precondition for an atomic überblock update is that the tree of blocks it references has been fully written.

This has been mentioned periodically on the list. I thought somebody (Richard Elling?) did a nice capsule summary recently but I can't find it, so here are some other past list snippets by more knowledgeable people than I.

Neil Perrin, 6 Dec, 2009:

ZFS uses a 3 stage transaction model: Open, Quiescing and Syncing.
Transactions enter in Open. Quiescing is where a new Open stage has
started and waits for transactions that have yet to commit to finish.
Syncing is where all the completed transactions are pushed to the pool
in an atomic manner with the last write being the root of the new tree
of blocks (uberblock).

All the guarantees assume good hardware. As part of the new uberblock update we flush the write caches of the pool devices. If this is broken all bets
are off.

14 Oct, 2009, James R. Van Artsdalen:

ZFS is different because it uses a different "superblock" every few
seconds (every transaction commit), and more importantly, the top levels of the filesystem and some pool metadata are moved too. After every tx
commit the uberblock is in a different place and some of its pointers
are to different places.

Moreover, blocks that were freed by this process are rapidly reclaimed. The uberblock itself is not reclaimed for another 127 commits - several minutes - but the things it points to are. In other words as soon as tx group N is committed, blocks from N-1 that are no longer referenced are
reclaimed as free space.

What goes wrong when the write fence / cache flush doesn't happen:

As soon as the uberblock for tx group N is written everything from N-1
that is no longer referenced is marked free for reallocation, and these newly-freed blocks often contain part of the top levels of the N-1 pool
/ filesystems and metadata.

If the uberblock for N is _not_ written to media when it was supposed to
be then ZFS may happily reuse the blocks from N-1 while the uberblock
for N-1 is still the most recent on media, instead of N as ZFS expects.
In other words there might be a window where the most recent uberblock
on disk media (N-1) points to a toplevel directory block that is
overwritten with unrelated data - disaster.

That window closes once uberblock N hits media.  Unfortunately with no
write fence it might be a long time before that happens.  ...

10 Oct, 2009, James Relph quotes Dominic Giampaolo:

"Last, I do not believe that the crash protection scheme used
by ZFS can ever work reliably on drives that drop the flush
track cache request.  The only approach that is guaranteed to
work is to keep enough data in a log that when you remount the
drive, you can replay more data than the drive could have kept

Nicolas Williams, 13 Feb, 2009:

Also, note that ignoring barriers is effectively as bad as dropping
writes if there's any chance that some writes will never hit the disk
because of, say, power failures. Imagine 100 txgs, but some writes from the first txg never hitting the disk because the drive keeps them in the cache without flushing them for too long, then you pull out the disk, or
power fails -- in that case not even fallback to older txgs will help
you, there'd be nothing that ZFS could do to help you.

Peter Schuller, 10 Feb, 2009:

What's stopping a RAID device from,
for example, ACK:ing an I/O before it is even in the cache? I have not
designed RAID controller firmware so I am not sure how likely that is,
but I don't see it as an impossibility. Disabling flushing because you
have battery backed nvram implies that your battery-backed nvram
guarantees ordering of all writes, and that nothing is ever placed in
said battery backed cache out of order.

Jeff Bonwick, 12 Feb, 2007:

Even if you disable the intent log, the transactional nature
of ZFS ensures preservation of event ordering.  Note that disk caches
don't come into it: ZFS builds up a wad of transactions in memory,
then pushes them out as a transaction group.  That entire group will
either commit or not.  ZFS writes all the new data to new locations,
then flushes all disk write caches, then writes the new uberblock,
then flushes the caches again.  Thus you can lose power at any point
in the middle of committing transaction group N, and you're guaranteed
that upon reboot, everything will either be at state N or state N-1.

I agree about the usefulness of fbarrier() vs. fsync(), BTW.  The cool
thing is that on ZFS, fbarrier() is a no-op.  It's implicit after
every system call.

(This issue also arises with respect to the questionable VirtualBox default setting of "Ignore Flush".)


This message posted from opensolaris.org
zfs-discuss mailing list

zfs-discuss mailing list

Reply via email to