>>>>> "jm" == Moore, Joe <joe.mo...@siemens.com> writes:
jm> This is correct. The general term for these sorts of jm> point-in-time backups is "crash consistant". phew, thanks, glad I wasn't talking out my ass again. jm> In-flight transactions (ones that have not been committed) at jm> the database level are rolled back. Applications using the jm> database will be confused by this in a recovery scenario, jm> since the transaction was reported as committed are gone when jm> the database comes back. But that's the case any time a jm> database moves "backward" in time. hm. I thought a database would not return success to the app until it was actually certain the data was on disk with fsync() or whatever, and this is why databases like NVRAM's and slogs. Are you saying it's a common ``optimisation'' for DBMS to worry about write barriers only, not about flushing? jm> Snapshots of a virtual disk are also crash-consistant. If the jm> VM has not committed its transactionally-committed data and is jm> still holding it volatile memory, that VM is not maintaining jm> its ACID requirements, and that's a bug in either the database jm> or in the OS running on the VM. I'm betting mostly ``the OS running inside the VM'' and ``the virtualizer itself''. For the latter, from Toby's thread: -----8<----- If desired, the virtual disk images (VDI) can be flushed when the guest issues the IDE FLUSH CACHE command. Normally these requests are ignored for improved performance. To enable flushing, issue the following command: VBoxManage setextradata VMNAME "VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/IgnoreFlush" 0 -----8<----- Virtualizers are able to take snapshots themselves without help from the host OS, so I would expect at least those to work, and host snapshots to be fixable. VirtualBox has a ``pause'' feature---it could pretend it's received a flush command from the guest, and flush whatever internal virtualizer buffers it has to the host OS when paused. Also a host snapshot is a little more forgiving than a host cord-yank because the snapshot will capture things applications like VBox have written to files but not fsync()d yet. so it's ok for snapshots but not cord-yanks if VBox never bothers to call fsync(). It's just not okay that VBox might buffer data internally sometimes. Even if that's all sorted, though, ``the OS running inside the VM''---neither UFS nor ext3 sends these cache flush commands to virtual drives. At least for ext3, the story is pretty long: http://lwn.net/Articles/283161/ So, for those that wish to enable them, barriers apparently are turned on by giving "barrier=1" as an option to the mount(8) command, either on the command line or in /etc/fstab: mount -t ext3 -o barrier=1 <device> <mount point> (but, does not help at all if using LVM2 because LVM2 drops the barriers) ext3 get away with it because drive write buffers are small enough they can mostly get away with only flushing the journal, and the journal's written in LBA order, so except when it wraps around there's little incentive for drives to re-order it. But ext3's supposed ability to mostly work ok without barriers depends on assumptions about physical disks---the size of the write cache being <32MB, their reordering sorting algorithm being elevator-like---that probably don't apply to a virtual disk so a Linux guest OS very likely is ``broken'' w.r.t. taking these crash-consistent virtual disk snapshots. And also a Solaris guest: we've been told UFS+logging expects the write cache to be *off* for correctness. I don't know if UFS is less good at evading the problem than ext3, or if Solaris users are just more conservative. but, with a virtual disk the write cache will always be effectively on no matter what simon-sez flags you pass to that awful 'format' tool. That was never on the bargaining table because there's no other way it can have remotely reasonable performance. Possibly the ``pause'' command would be a workaround for this becuase it could let you force a barrier into the write stream yourself (one the guest OS never sent) and then take a snapshot right after the barrier with no writes allowed between barrier and snapshot. If the fake barrier is inserted into the stack right at the guest/VBox boundary, then it should make the overall system behave as well as the guest running on a drive with the write cache disabled. I'm not sure such a barrier is actually implied by VBox ``pause'' but if I were designing the pause feature it would be.
pgpnmTxPa2z8Y.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss