Am 21.09.2010 02:18, schrieb Anthony Liguori: > On 09/20/2010 06:17 PM, Christoph Hellwig wrote: >> On Mon, Sep 20, 2010 at 03:11:31PM -0500, Anthony Liguori wrote: >> >>>>> All read and write requests SHOULD avoid any type of caching in the >>>>> host. Any write request MUST complete after the next level of storage >>>>> reports that the write request has completed. A flush from the guest >>>>> MUST complete after all pending I/O requests for the guest have been >>>>> completed. >>>>> >>>>> As an implementation detail, with the raw format, these guarantees are >>>>> only in place for preallocated images. Sparse images do not provide as >>>>> strong of a guarantee. >>>>> >>>>> >>>> That's not how cache=none ever worked nor works currently. >>>> >>>> >>> How does it work today compared to what I wrote above? >>> >> For the guest point of view it works exactly as you describe >> cache=writeback. There is no ordering or cache flushing guarantees. By >> using O_DIRECT we do bypass the host file cache, but we don't even try >> on the others (disk cache, commiting metadata transaction that are >> required to actually see the commited data for sparse, preallocated or >> growing images). >> > > O_DIRECT alone to a pre-allocated file on a normal file system should > result in the data being visible without any additional metadata > transactions. > > The only time when that isn't true is when dealing with CoW or other > special filesystem features.
I think preallocated files are the exception, usually people use sparse files. And even with preallocation, the disk cache is still left. >> What you describe above is the equivalent of O_DSYNC|O_DIRECT which >> doesn't exist in current qemu, except that O_DSYNC|O_DIRECT also >> guarantees the semantics for sparse images. Sparse images really aren't >> special in any way - preallocaiton using posix_fallocate or COW >> filesystems like btrfs,nilfs2 or zfs have exactly the same issues. >> >> >>>> | WC enable | WC disable >>>> ----------------------------------------------- >>>> direct | | >>>> buffer | | >>>> buffer + ignore flush | | >>>> >>>> currently we only have: >>>> >>>> cache=none direct + WC enable >>>> cache=writeback buffer + WC enable >>>> cache=writethrough buffer + WC disable >>>> cache=unsafe buffer + ignore flush + WC enable >>>> >>>> >>> Where does O_DSYNC fit into this chart? >>> >> O_DSYNC is used for all WC disable modes. >> >> >>> Do all modern filesystems implement O_DSYNC without generating >>> additional barriers per request? >>> >>> Having a barrier per-write request is ultimately not the right semantic >>> for any of the modes. However, without the use of O_DSYNC (or >>> sync_file_range(), which I know you dislike), I don't see how we can >>> have reasonable semantics without always implementing write back caching >>> in the host. >>> >> Barriers are a Linux-specific implementation details that is in the >> process of going away, probably in Linux 2.6.37. But if you want >> O_DSYNC semantics with a volatile disk write cache there is no way >> around using a cache flush or the FUA bit on all I/O caused by it. > > If you have a volatile disk write cache, then we don't need O_DSYNC > semantics. What has semantics of a qemu option to do with the host disk write cache? We always need to provide the same semantics. If anything, we can take advantage of a host providing write-through/no caches so that we don't have to issue the flushes ourselves. >> We >> currently use the cache flush, and although I plan to experiment a bit >> more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very >> surprised if they actually are any faster. >> > > The thing I struggle with understanding is that if the guest is sending > us a write request, why are we sending the underlying disk a write + > flush request? That doesn't seem logical at all to me. > > Even if we advertise WC disable, it should be up to the guest to decide > when to issue flushes. Why should a guest ever flush a cache when it's told that this cache doesn't exist? Kevin