[Qemu-devel] Re: Caching modes

Anthony Liguori Tue, 21 Sep 2010 08:19:34 -0700

On 09/21/2010 09:26 AM, Christoph Hellwig wrote:

On Mon, Sep 20, 2010 at 07:18:14PM -0500, Anthony Liguori wrote:

O_DIRECT alone to a pre-allocated file on a normal file system should
result in the data being visible without any additional metadata
transactions.

Anthony, for the third time: no.  O_DIRECT is a non-portable extension
in Linux (taken from IRIX) and is defined as:



        O_DIRECT (Since Linux 2.4.10)
               Try  to minimize cache effects of the I/O to and from this file.
               In general this will degrade performance, but it  is  useful  in
               special  situations,  such  as  when  applications  do their own
               caching.  File I/O is done directly to/from user space  buffers.
               The O_DIRECT flag on its own makes at an effort to transfer data
               synchronously, but does not give the guarantees  of  the  O_SYNC
               that  data and necessary metadata are transferred.  To guarantee
               synchronous I/O the O_SYNC must be used in addition to O_DIRECT.
               See NOTES below for further discussion.

               A  semantically  similar  (but  deprecated)  interface for block
               devices is described in raw(8).

O_DIRECT does not have any meaning for data integrity, it just tells the
filesystem it *should* not use the pagecache.  Even if it should not
various filesystem have fallbacks to buffered I/O for corner cases.
It does *not* mean the actual disk cache gets flushed, and it *does*
not guarantee anything about metadata which is very important.

Yes, I understand all of this but I was trying to avoid accepting it.But after the call today, I'm convinced that this is fundamentally afilesystem problem.


I think what we need to do is:

1) make virtual WC guest controllable. If a guest enables WC, &=~O_DSYNC. If it disables WC, |= O_DSYNC. Obviously, we can let a userspecify the virtual WC mode but it has to be changable during livemigration.

2) only let the user choose between using and not using the host pagecache. IOW, direct=on|off. cache=XXX is deprecated.


3) make O_DIRECT | O_DSYNC not suck so badly on ext4.

Barriers are a Linux-specific implementation details that is in the
process of going away, probably in Linux 2.6.37.  But if you want
O_DSYNC semantics with a volatile disk write cache there is no way
around using a cache flush or the FUA bit on all I/O caused by it.

If you have a volatile disk write cache, then we don't need O_DSYNC
semantics.

If you present a volatile write cache to the guest you do indeed not
need O_DSYNC and can rely on the guest sending fdatasync calls when it
wants to flush the cache.  But for the statement above you can replace
O_DSYC with fdatasync and it will still be correct.  O_DSYNC in current
Linux kernels is nothing but an implicit range fdatasync after each
write.

Yes. I was stuck on O_DSYNC being independent of the virtual WC butit's clear to me now that it cannot be.

ext3 and ext4 have really bad fsync implementations.  Just use a better
filesystem or bug one of it's developers if you want that fixed.  But
except for disabling the disk cache there is no way to get data integrity
without cache flushes (the FUA bit is nothing but an implicit flush).

But why are we issuing more flushes than the guest is issuing if we
don't have to worry about filesystem metadata (i.e. preallocated storage
or physical devices)?

Who is "we" and what is workload/filesystem/kernel combination?
Specific details and numbers please.

My concern is ext4. With a preallocated file and cache=none asimplemented today, performance is good even when barrier=1. If weenable O_DSYNC, performance will plummet. Ultimately, this is an ext4problem, not a QEMU problem.

Perhaps we can issue a warning if the WC is disabled and we do an fsstatand see that it's ext4 with barriers enabled.

I think it's more common for a user to want to disable a virtual WCbecause they have less faith in the hypervisor than they have in theunderlying storage.


The scenarios I am concerned about:

1) User has enterprise storage, but has an image on ext4 withbarrier=1. User explicitly disables WC in guest because they haveenterprise storage but not an UPS for the hypervisor.

2) User does not have enterprise storage, but has an image on ext4 withbarrier=1. User explicitly disables WC in guest because they don't knowwhat they're doing.

In the case of (1), the answer may be "ext4 sucks, remount withbarrier=0" but I think we need to at least warn the user of this.

For (2), again it's probably the user doing the wrong thing because ifthey don't have enterprise storage, then they shouldn't care about avirtual WC. Practically though, I've seen a lot of this with users.


Regards,

Anthony Liguori

[Qemu-devel] Re: Caching modes

Reply via email to