>>>>> "tt" == Toby Thain <t...@telegraphics.com.au> writes:
c> so it's ok for snapshots but not cord-yanks if VBox never c> bothers to call fsync(). tt> Taking good host snapshots may require VB to do that, though. AIUI the contents of a snapshot on the host will be invariant no matter where VBox places host fsync() calls along the timeline, or if it makes them at all. The host snapshot will not be invariant of when applications running inside the guest call fsync(), because this inner fsync() implicates the buffer cache in the guest OS, possibly flush commands at the guest/VBox driver/virtualdisk boundary, and stdio buffers inside the VBox app. so...in the sense that, in a hypothetical nonexistent working overall system, a guest app calling fsync() eventually propogates out until finally VBox calls fsync() on the host's kernel, then yeah, observing a lack of fsync()'s coming out of VBox probably means host snapshots won't be crash-consistent. BUT the effect of the fsync() on the host itself is not what's needed for host snapshots (only needed for host cord-yanks). It's all the other stuff that's needed for host snapshots---flushing the buffer cache inside the guest OS, flushing VBox's stdio buffers, u.s.w., that makes a bunch of write()'s spew out just before the fsync() and dams up other write()s inside VBox and the guest OS until after the fsync() comes out. c> But ext3's supposed ability to mostly work ok without c> barriers tt> Without *working* barriers, you mean? I haven't RTFS but I tt> suspect ext3 needs functioning barriers to maintain "crash tt> consistency". no, the lwn article says that ext3 is just like Solaris UFS and never issues a cache flush to the drive (except on SLES where Novell made local patches to their kernel). ext3 probably does still use an internal Linux barrier API to stop dangerous kinds of reordering within the Linux buffer cache, but nothing that makes it down to the drive (nor into VBox). so I think even if you turn on the flush-respecting feature in VBox, Linux ext3 and Solaris UFS would both still be necessarily unsafe (according to our theory so far), at least unsafe from: (1) host cord-yanking, (2) host snapshots taken without ``pausing'' the VM. If you're going to turn on the VBox flush option, maybe it would be worth trying XFS or ext4 or ZFS inside the guest and comparing their corruptability. For VBox to simulate a real disk with its write cache turned off, and thus work better with UFS and ext3, VBox would need to make sure writes are not re-ordered. For the unpaused-host-snapshot case this should be relatively easy---just make VBox stop using stdio, and call write() exactly once for every disk command the guest issues and call it in the same order the guest passed it. It's not necessary to call fsync() at all, so it should not make things too much slower. For the host cord-yanking case, I don't think POSIX gives enough to achieve this and still be fast because you'd be expected to call fsync() between each write. What we really want is some flag, ``make sure my writes appear to have been done in order after a crash.'' I don't think there's such a thing as a write barrier in POSIX, only the fsync() flush command? Maybe it should be a new rule of zvol's that they always act this way. It need not slow things down much for the host to arrange that writes not appear to have been reordered: all you have to do is batch them into chunks along the timeline, and make sure all the writes in a chunk commit, or none of them do. It doesn't matter how big the chunks are nor where they start and end. It's sort of a degenerate form of the snapshot case: with the fwrite()-to-write() change above we can already take a clean snapshot without fsync(), so just pretend as thoughyou were taking a snapshot a couple times a minute, and after losing power roll back to the newest one that survived. I'm not sure real snapshots are the right way to implement it, but the idea is with a COW backingn store it should be well within-reach to provide the illusion writes are never reordered (and thus that your virtual hard disk has its write cache turned off) without adding lots of io/s the way fsync() does. This still compromises the D in ACID for databases running inside the guest, in the host cord-yank case, but it should stop the corruption.
pgpDmKTrtWRL1.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss