On Wed, Sep 11, 2013 at 05:30:10PM +0300, Filippos Giannakos wrote: > I stumbled upon this link [1] which among other things contains the following: > > "iSCSI, FC, or other forms of direct attached storage are only safe to use > with > live migration if you use cache=none." > > How valid is this assertion with current QEMU versions? > > I checked out the source code and was left with the impression that > during migration and *before* handling control to the destination, a flush is > performed on all disks of the VM. Since the VM is started on the destination > only after the flush is done, its very first read will bring consistent data > from disk. > > I can understand that on the corner case in which the storage device has > already been mapped and perhaps has data in the page cache of the destination > node, there is no way to invalidate them, so the VM will read stale data, > despite the flushes which happened at the source node. > > In our case, we provision VMs using our custom storage layer, called > Archipelago [2], which presents volumes as block devices in the host. We would > like to run VMs in cache=writeback mode. If we guarantee externally that there > will be no incoherent cached data on the destination host of the migration > (e.g., by making sure the volume is not mapped on the destination node before > the migration), would it be safe to do so? > > Can you comment on the aforementioned approach? Please let me know if there's > something I have misunderstood. > > [1] http://wiki.qemu.org/Migration/Storage > [2] http://www.synnefo.org/docs/archipelago/latest
Hi Filippos, Late response but this may help start the discussion... Cache consistency during migration was discussed a lot on the mailing list. You might be able to find threads from about 2 years ago that discuss this in detail. Here is what I remember: During migration the QEMU process on the destination host must be started. When QEMU starts up it opens the image file and reads the first sector (for disk geometry and image format probing). At this point the destination would populate its page cache while the source is still running the guest. We're in trouble because the destination host has stale pages in its page cache. Hence the recommendation to use cache=none. There are a few things to look at if you are really eager to use cache=writeback: 1. Can you avoid geometry probing? I think by setting the geometry options on the -drive you can skip probing. See hw/block/hd-geometry.c. 2. Can you avoid format probing? Use -drive format=raw to skip format probing. 3. Make sure to use raw image files. Do not use a format since that would require reading a header and metadata before migration handover. 4. Check if ioctl(BLKFLSBUF) can be used. Unfortunately it requires CAP_SYS_ADMIN so the QEMU process cannot issue it when running without privileges. Perhaps an external tool like libvirt could issue it, but that's tricky since live migration handover is a delicate operation - it's important to avoided dependencies between multiple processes to keep guest downtime low and avoid possibility of failures. So you might be able to get away with cache=writeback *if* you carefully study the code and double-check with strace that the destination QEMU processes does not access the image file before handover has completed. Stefan