Thanks for following up with this, Russel.

On Jul 31, 2009, at 7:11 AM, Russel wrote:

After all the discussion here about VB, and all the finger pointing
I raised a bug on VB about flushing.

Remember I am using RAW disks via the SATA emulation in VB
the disks are WD 2TB drives. Also remember the HOST machine
NEVER crashed or stopped. BUT the guest OS OpenSolaris was
hung and so I powered off the VIRTUAL host.

OK, this is what the VB engineer had to say after reading this and
another thread I had pointed him to. (he missed the fast I was
using RAW not supprising as its a rather long thread now!)

===============================
Just looked at those two threads, and from what I saw all vital information is missing - no hint whatsoever on how the user set up his disks, nothing about what errors should be dealt with and so on. So hard to say anything sensible, especially as people seem most interested in assigning blame to some product. ZFS doesn't deserve this, and VirtualBox doesn't deserve this either.

In the first place, there is absolutely no difference in how the IDE and SATA devices handle the flush command. The documentation just wasn't updated to talk about the SATA controller. Thanks for pointing this out, it will be fixed in the next major release. If you want to get the information straight away: just replace "piix3ide" with "ahci", and all other flushing behavior settings apply as well. See a bit further below of what it buys you (or not).

What I haven't mentioned is the rationale behind the current behavior. The reason for ignoring flushes is simple: the biggest competitor does it by default as well, and one gets beaten up by every reviewer if VirtualBox is just a few percent slower than you know what. Forget about arguing with reviewers.

That said, a bit about what flushing can achieve - or not. Just keep in mind that VirtualBox doesn't really buffer anything. In the IDE case every read and write requests gets handed more or less straight (depending on the image format complexity) to the host OS. So there is absolutely nothing which can be lost if one assumes the host OS doesn't crash.

In the SATA case things are slightly more complicated. If you're using anything but raw disks or flat file VMDKs, the behavior is 100% identical to IDE. If you use raw disks or flat file VMDKs, we activate NCQ support in the SATA device code, which means that the guest can push through a number of commands at once, and they get handled on the host via async I/O. Again - if the host OS works reliably there is nothing to lose.

The problem with this thought process is that since the data is not
on medium, a fault that occurs between the flush request and
the bogus ack goes undetected. The OS trusts when the disk
said "the data is on the medium" that the data is on the medium
with no errors.

This problem also affects "hardware" RAID arrays which provide
nonvolatile caches.  If the array acks a write and flush, but the
data is not yet committed to medium, then if the disk fails, the
data must remain in nonvolatile cache until it can be committed
to the medium. A use case may help, suppose the power goes
out. Most arrays have enough battery to last for some time. But
if power isn't restored prior to the batteries discharging, then
there is a risk of data loss.

For ZFS, cache flush requests are not gratuitous. One critical
case is the uberblock or label update. ZFS does:
        1. update labels 0 and 2
        2. flush
        3. check for errors
        4. update labels 1 and 3
        5. flush
        6. check for errors

Making flush be a nop destroys the ability to check for errors
thus breaking the trust between ZFS and the data on medium.
 -- richard


The only thing what flushing can potentially improve is the behavior when the host OS crashes. But that depends on many assumptions on what the respective OS does, the filesystems do etc etc.

Hope those facts can be the basis of a real discussion. Feel free to raise any issue you have in this context, as long as it's not purely hypothetical.

===================================
--
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to