Re: [zfs-discuss] zfs streams & data corruption

Miles Nordin Wed, 25 Feb 2009 12:41:37 -0800

>>>>> "tt" == Toby Thain <t...@telegraphics.com.au> writes:


     c> so it's ok for snapshots but not cord-yanks if VBox never
     c> bothers to call fsync().

    tt> Taking good host snapshots may require VB to do that, though.

AIUI the contents of a snapshot on the host will be invariant no
matter where VBox places host fsync() calls along the timeline, or if
it makes them at all.

The host snapshot will not be invariant of when applications running
inside the guest call fsync(), because this inner fsync() implicates
the buffer cache in the guest OS, possibly flush commands at the
guest/VBox driver/virtualdisk boundary, and stdio buffers inside the
VBox app.

so...in the sense that, in a hypothetical nonexistent working overall
system, a guest app calling fsync() eventually propogates out until
finally VBox calls fsync() on the host's kernel, then yeah, observing
a lack of fsync()'s coming out of VBox probably means host snapshots
won't be crash-consistent.  BUT the effect of the fsync() on the host
itself is not what's needed for host snapshots (only needed for host
cord-yanks).  It's all the other stuff that's needed for host
snapshots---flushing the buffer cache inside the guest OS, flushing
VBox's stdio buffers, u.s.w., that makes a bunch of write()'s spew out
just before the fsync() and dams up other write()s inside VBox and the
guest OS until after the fsync() comes out.

     c>   But ext3's supposed ability to mostly work ok without
     c> barriers

    tt> Without *working* barriers, you mean? I haven't RTFS but I
    tt> suspect ext3 needs functioning barriers to maintain "crash
    tt> consistency".

no, the lwn article says that ext3 is just like Solaris UFS and never
issues a cache flush to the drive (except on SLES where Novell made
local patches to their kernel).

ext3 probably does still use an internal Linux barrier API to stop
dangerous kinds of reordering within the Linux buffer cache, but
nothing that makes it down to the drive (nor into VBox).  so I think
even if you turn on the flush-respecting feature in VBox, Linux ext3
and Solaris UFS would both still be necessarily unsafe (according to
our theory so far), at least unsafe from: (1) host cord-yanking, (2)
host snapshots taken without ``pausing'' the VM.

If you're going to turn on the VBox flush option, maybe it would be
worth trying XFS or ext4 or ZFS inside the guest and comparing their
corruptability.

For VBox to simulate a real disk with its write cache turned off, and
thus work better with UFS and ext3, VBox would need to make sure
writes are not re-ordered.  For the unpaused-host-snapshot case this
should be relatively easy---just make VBox stop using stdio, and call
write() exactly once for every disk command the guest issues and call
it in the same order the guest passed it.  It's not necessary to call
fsync() at all, so it should not make things too much slower.

For the host cord-yanking case, I don't think POSIX gives enough to
achieve this and still be fast because you'd be expected to call
fsync() between each write.  What we really want is some flag, ``make
sure my writes appear to have been done in order after a crash.''  I
don't think there's such a thing as a write barrier in POSIX, only the
fsync() flush command?  

Maybe it should be a new rule of zvol's that they always act this
way. It need not slow things down much for the host to arrange that
writes not appear to have been reordered: all you have to do is batch
them into chunks along the timeline, and make sure all the writes in a
chunk commit, or none of them do.  It doesn't matter how big the
chunks are nor where they start and end.  It's sort of a degenerate
form of the snapshot case: with the fwrite()-to-write() change above
we can already take a clean snapshot without fsync(), so just pretend
as thoughyou were taking a snapshot a couple times a minute, and after
losing power roll back to the newest one that survived.  I'm not sure
real snapshots are the right way to implement it, but the idea is with
a COW backingn store it should be well within-reach to provide the
illusion writes are never reordered (and thus that your virtual hard
disk has its write cache turned off) without adding lots of io/s the
way fsync() does.  This still compromises the D in ACID for databases
running inside the guest, in the host cord-yank case, but it should
stop the corruption.

pgpDmKTrtWRL1.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs streams & data corruption

Reply via email to