Re: [Qemu-devel] [RFC] Disk integrity in QEMU
Izik Eidus wrote: But when using O_DIRECT you actuality make the pages not swappable at all... or am i wrong? Only for the duration of the I/O operation, which is typically very short. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Disk integrity in QEMU
Avi Kivity wrote: Chris Wright wrote: I think it's safe to say the perf folks are concerned w/ data integrity first, stable/reproducible results second, and raw performance third. So seeing data cached in host was simply not what they expected. I think write through is sufficient. However I think that uncached vs. wt will show up on the radar under reproducible results (need to tune based on cache size). And in most overcommit scenarios memory is typically more precious than cpu, it's unclear to me if the extra buffering is anything other than memory overhead. As long as it's configurable then it's comparable and benchmarking and best practices can dictate best choice. Getting good performance because we have a huge amount of free memory in the host is not a good benchmark. Under most circumstances, the free memory will be used either for more guests, or will be given to the existing guests, which can utilize it more efficiently than the host. I can see two cases where this is not true: - using older, 32-bit guests which cannot utilize all of the cache. I think Windows XP is limited to 512MB of cache, and usually doesn't utilize even that. So if you have an application running on 32-bit Windows (or on 32-bit Linux with pae disabled), and a huge host, you will see a significant boost from cache=writethrough. This is a case where performance can exceed native, simply because native cannot exploit all the resources of the host. - if cache requirements vary in time across the different guests, and if some smart ballooning is not in place, having free memory on the host means we utilize it for whichever guest has the greatest need, so overall performance improves. Another justification for ODIRECT is that many production system will use the base images for their VMs. It's mainly true for desktop virtualization but probably for some server virtualization deployments. In these type of scenarios, we can have all of the base image chain opened as default with caching for read-only while the leaf images are open with cache=off. Since there is ongoing effort (both by IT and developers) to keep the base images as big as possible, it guarantees that this data is best suited for caching in the host while the private leaf images will be uncached. This way we provide good performance and caching for the shared parent images while also promising correctness. Actually this is what happens on mainline qemu with cache=off. Cheers, Dor -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Disk integrity in QEMU
Dor Laor wrote: Actually this is what happens on mainline qemu with cache=off. Have I understood right that cache=off on a qcow2 image only uses O_DIRECT for the leaf image, and the chain of base images don't use O_DIRECT? Sometimes on a memory constrained host, where the (collective) guest memory is nearly as big as the host memory, I'm not sure this is what I want. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Disk integrity in QEMU
Chris Wright wrote: Either wt or uncached (so host O_DSYNC or O_DIRECT) would suffice to get it through to host's storage subsytem, and I think that's been the core of the discussion (plus defaults, etc). Just want to point out that the storage commitment from O_DIRECT can be _weaker_ than O_DSYNC. On Linux,m O_DIRECT never uses storage-device barriers or transactions, but O_DSYNC sometimes does, and fsync is even more likely to than O_DSYNC. I'm not certain, but I think the same applies to other host OSes too - including Windows, which has its own equivalents to O_DSYNC and O_DIRECT, and extra documented semantics when they are used together. Although this is a host implementation detail, unfortunately it means that O_DIRECT=no-cache and O_DSYNC=write-through-cache is not an accurate characterisation. Some might be mislead into assuming that cache=off is as strongly committing their data to hard storage as cache=wb would. I think you can assume this only when the underlying storage devices' write caches are disabled. You cannot assume this if the host filesystem uses barriers instead of disabling the storage devices' write cache. Unfortunately there's not a lot qemu can do about these various quirks, but at least it should be documented, so that someone requiring storage commitment (e.g. for a critical guest database) is advised to investigate whether O_DIRECT and/or O_DSYNC give them what they require with their combination of host kernel, filesystem, filesystem options and storage device(s). -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Disk integrity in QEMU
Jamie Lokier wrote: Dor Laor wrote: Actually this is what happens on mainline qemu with cache=off. Have I understood right that cache=off on a qcow2 image only uses O_DIRECT for the leaf image, and the chain of base images don't use O_DIRECT? Yeah, that's a bug IMHO and in my patch to add O_DSYNC, I fix that. I think an argument for O_DIRECT in a leaf and wb in the leaf is seriously flawed... Regards, Anthony Liguori Sometimes on a memory constrained host, where the (collective) guest memory is nearly as big as the host memory, I'm not sure this is what I want. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Disk integrity in QEMU
Dor Laor wrote: Avi Kivity wrote: Since there is ongoing effort (both by IT and developers) to keep the base images as big as possible, it guarantees that this data is best suited for caching in the host while the private leaf images will be uncached. A proper CAS solution is really such a better approach. qcow2 deduplification is an interesting concept, but such a hack :-) This way we provide good performance and caching for the shared parent images while also promising correctness. You get correctness by using O_DSYNC. cache=off should disable the use of the page cache everywhere. Regards, Anthony Liguori Actually this is what happens on mainline qemu with cache=off. Cheers, Dor -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Disk integrity in QEMU
Avi Kivity wrote: LRU typically makes fairly bad decisions since it throws most of the information it has away. I recommend looking up LRU-K and similar algorithms, just to get a feel for this; it is basically the simplest possible algorithm short of random selection. Note that Linux doesn't even have an LRU; it has to approximate since it can't sample all of the pages all of the time. With a hypervisor that uses Intel's EPT, it's even worse since we don't have an accessed bit. On silly benchmarks that just exercise the disk and touch no memory, and if you tune the host very aggresively, LRU will win on long running guests since it will eventually page out all unused guest memory (with Linux guests, it will never even page guest memory in). On real life applications I don't think there is much chance. But when using O_DIRECT you actuality make the pages not swappable at all... or am i wrong? maybe somekind of combination with the mm shrink could be good, do_try_to_free_pages is good point for reference. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Disk integrity in QEMU
* Mark Wagner ([EMAIL PROTECTED]) wrote: I think that are two distinct arguments going on here. My main concern is that I don't think that this a simple what do we make the default cache policy be issue. I think that regardless of the cache policy, if something in the guest requests O_DIRECT, the host must honor that and not cache the data. OK, O_DIRECT in the guest is just one example of the guest requesting data to be synchronously written to disk. It bypasses guest page cache, but even page cached writes need to be written at some point. Any time the disk driver issues an io where it expects the data to be on disk (possible low-level storage subystem caching) is the area of concern. * Mark Wagner ([EMAIL PROTECTED]) wrote: Anthony Liguori wrote: It's extremely important to understand what the guarantee is. The guarantee is that upon completion on write(), the data will have been reported as written by the underlying storage subsystem. This does *not* mean that the data is on disk. I apologize if I worded it poorly, I assume that the guarantee is that the data has been sent to the storage controller and said controller sent an indication that the write has completed. This could mean multiple things likes its in the controllers cache, on the disk, etc. I do not believe that this means that the data is still sitting in the host cache. I realize it may not yet be on a disk, but, at a minimum, I would expect that is has been sent to the storage controller. Do you consider the hosts cache to be part of the storage subsystem ? Either wt or uncached (so host O_DSYNC or O_DIRECT) would suffice to get it through to host's storage subsytem, and I think that's been the core of the discussion (plus defaults, etc). In the case of KVM, even using write-back caching with the host page cache, we are still honoring the guarantee of O_DIRECT. We just have another level of caching that happens to be write-back. I still don't get it. If I have something running on the host that I open with O_DIRECT, do you still consider it not to be a violation of the system call if that data ends up in the host cache instead of being sent to the storage controller? I suppose an argument could be made for host caching and write-back to be considered part of the storage subsystem from the guest pov, but then we also need to bring in the requirement for proper cache flushing. Given a popular linux guest fs can be a little fast and loose, wb and flushing isn't really optimal choice for the integrity case. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html