On 09/20/2010 06:17 PM, Christoph Hellwig wrote:
On Mon, Sep 20, 2010 at 03:11:31PM -0500, Anthony Liguori wrote:
All read and write requests SHOULD avoid any type of caching in the
host. Any write request MUST complete after the next level of storage
reports that the write request has completed. A flush from the guest
MUST complete after all pending I/O requests for the guest have been
completed.
As an implementation detail, with the raw format, these guarantees are
only in place for preallocated images. Sparse images do not provide as
strong of a guarantee.
That's not how cache=none ever worked nor works currently.
How does it work today compared to what I wrote above?
For the guest point of view it works exactly as you describe
cache=writeback. There is no ordering or cache flushing guarantees. By
using O_DIRECT we do bypass the host file cache, but we don't even try
on the others (disk cache, commiting metadata transaction that are
required to actually see the commited data for sparse, preallocated or
growing images).
O_DIRECT alone to a pre-allocated file on a normal file system should
result in the data being visible without any additional metadata
transactions.
The only time when that isn't true is when dealing with CoW or other
special filesystem features.
What you describe above is the equivalent of O_DSYNC|O_DIRECT which
doesn't exist in current qemu, except that O_DSYNC|O_DIRECT also
guarantees the semantics for sparse images. Sparse images really aren't
special in any way - preallocaiton using posix_fallocate or COW
filesystems like btrfs,nilfs2 or zfs have exactly the same issues.
| WC enable | WC disable
-----------------------------------------------
direct | |
buffer | |
buffer + ignore flush | |
currently we only have:
cache=none direct + WC enable
cache=writeback buffer + WC enable
cache=writethrough buffer + WC disable
cache=unsafe buffer + ignore flush + WC enable
Where does O_DSYNC fit into this chart?
O_DSYNC is used for all WC disable modes.
Do all modern filesystems implement O_DSYNC without generating
additional barriers per request?
Having a barrier per-write request is ultimately not the right semantic
for any of the modes. However, without the use of O_DSYNC (or
sync_file_range(), which I know you dislike), I don't see how we can
have reasonable semantics without always implementing write back caching
in the host.
Barriers are a Linux-specific implementation details that is in the
process of going away, probably in Linux 2.6.37. But if you want
O_DSYNC semantics with a volatile disk write cache there is no way
around using a cache flush or the FUA bit on all I/O caused by it.
If you have a volatile disk write cache, then we don't need O_DSYNC
semantics.
We
currently use the cache flush, and although I plan to experiment a bit
more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very
surprised if they actually are any faster.
The thing I struggle with understanding is that if the guest is sending
us a write request, why are we sending the underlying disk a write +
flush request? That doesn't seem logical at all to me.
Even if we advertise WC disable, it should be up to the guest to decide
when to issue flushes.
I'm certainly happy to break up the caching option. However, I still
don't know how we get a reasonable equivalent to cache=writethrough
without assuming that ext4 is mounted without barriers enabled.
There's two problems here - one is a Linux-wide problem and that's the
barrier primitive which is currenly the only way to flush a volatile
disk cache. We've sorted this out for the 2.6.37. The other is that
ext3 and ext4 have really bad fsync implementations. Just use a better
filesystem or bug one of it's developers if you want that fixed. But
except for disabling the disk cache there is no way to get data integrity
without cache flushes (the FUA bit is nothing but an implicit flush).
But why are we issuing more flushes than the guest is issuing if we
don't have to worry about filesystem metadata (i.e. preallocated storage
or physical devices)?
Regards,
Anthony Liguori