On 09/21/2010 09:26 AM, Christoph Hellwig wrote:
On Mon, Sep 20, 2010 at 07:18:14PM -0500, Anthony Liguori wrote:
O_DIRECT alone to a pre-allocated file on a normal file system should
result in the data being visible without any additional metadata
transactions.
Anthony, for the third time: no. O_DIRECT is a non-portable extension
in Linux (taken from IRIX) and is defined as:
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file.
In general this will degrade performance, but it is useful in
special situations, such as when applications do their own
caching. File I/O is done directly to/from user space buffers.
The O_DIRECT flag on its own makes at an effort to transfer data
synchronously, but does not give the guarantees of the O_SYNC
that data and necessary metadata are transferred. To guarantee
synchronous I/O the O_SYNC must be used in addition to O_DIRECT.
See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block
devices is described in raw(8).
O_DIRECT does not have any meaning for data integrity, it just tells the
filesystem it *should* not use the pagecache. Even if it should not
various filesystem have fallbacks to buffered I/O for corner cases.
It does *not* mean the actual disk cache gets flushed, and it *does*
not guarantee anything about metadata which is very important.
Yes, I understand all of this but I was trying to avoid accepting it.
But after the call today, I'm convinced that this is fundamentally a
filesystem problem.
I think what we need to do is:
1) make virtual WC guest controllable. If a guest enables WC, &=
~O_DSYNC. If it disables WC, |= O_DSYNC. Obviously, we can let a user
specify the virtual WC mode but it has to be changable during live
migration.
2) only let the user choose between using and not using the host page
cache. IOW, direct=on|off. cache=XXX is deprecated.
3) make O_DIRECT | O_DSYNC not suck so badly on ext4.
Barriers are a Linux-specific implementation details that is in the
process of going away, probably in Linux 2.6.37. But if you want
O_DSYNC semantics with a volatile disk write cache there is no way
around using a cache flush or the FUA bit on all I/O caused by it.
If you have a volatile disk write cache, then we don't need O_DSYNC
semantics.
If you present a volatile write cache to the guest you do indeed not
need O_DSYNC and can rely on the guest sending fdatasync calls when it
wants to flush the cache. But for the statement above you can replace
O_DSYC with fdatasync and it will still be correct. O_DSYNC in current
Linux kernels is nothing but an implicit range fdatasync after each
write.
Yes. I was stuck on O_DSYNC being independent of the virtual WC but
it's clear to me now that it cannot be.
ext3 and ext4 have really bad fsync implementations. Just use a better
filesystem or bug one of it's developers if you want that fixed. But
except for disabling the disk cache there is no way to get data integrity
without cache flushes (the FUA bit is nothing but an implicit flush).
But why are we issuing more flushes than the guest is issuing if we
don't have to worry about filesystem metadata (i.e. preallocated storage
or physical devices)?
Who is "we" and what is workload/filesystem/kernel combination?
Specific details and numbers please.
My concern is ext4. With a preallocated file and cache=none as
implemented today, performance is good even when barrier=1. If we
enable O_DSYNC, performance will plummet. Ultimately, this is an ext4
problem, not a QEMU problem.
Perhaps we can issue a warning if the WC is disabled and we do an fsstat
and see that it's ext4 with barriers enabled.
I think it's more common for a user to want to disable a virtual WC
because they have less faith in the hypervisor than they have in the
underlying storage.
The scenarios I am concerned about:
1) User has enterprise storage, but has an image on ext4 with
barrier=1. User explicitly disables WC in guest because they have
enterprise storage but not an UPS for the hypervisor.
2) User does not have enterprise storage, but has an image on ext4 with
barrier=1. User explicitly disables WC in guest because they don't know
what they're doing.
In the case of (1), the answer may be "ext4 sucks, remount with
barrier=0" but I think we need to at least warn the user of this.
For (2), again it's probably the user doing the wrong thing because if
they don't have enterprise storage, then they shouldn't care about a
virtual WC. Practically though, I've seen a lot of this with users.
Regards,
Anthony Liguori