I did a brain dump of my understanding of the various storage requirements for
live migration. I think it's accurate but I may have misunderstand some details
so I would appreciate review.
I think given sections (1) and (2), the only viable thing is to require
cache=none unless we get new interfaces to flush caches.
Section (3) talks about image formats. As I mentioned elsewhere in the thread,
I think the best we can do right now is have a block layer interface to quiesce
the image format. I think reopen may be a viable short term strategy for qcow2
but I think for raw, we should just make the quiesce operation a nop.
http://wiki.qemu.org/Migration/Storage
Inlined below for ease of review.
Regards,
Anthony Liguori
Migration in QEMU is designed assuming cache coherent shared storage and raw
format block devices. There are some cases where less migration will also work
with more weakly coherent shared storage. This wiki page attempts to outline
those scenarios. It also attempts to iterate through the reasons why various
image formats do not support migration even with shared storage.
== NFS ==
=== Background ===
NFS only offers close-to-open cache coherence. This means that the only
guarantee provided by the protocol is that if you close a file in a client A and
then open the file in another client B, client B will see client A's changes.
The way migration works in QEMU, the source stops the guest after it sends all
of the required data but does not immediately free any resources. This makes
migration more reliable since it avoids the Two Generals Problem allowing a
reliable third node to make the final decision about whether migration was
successful.
As soon as the destination receives all of the data, it immediately starts the
guest. This means that the reliable third node is not in the critical path of
migration downtime but can still recover a failed migration.
Since the source never knows that the destination is okay, the only way to
support NFS robustly would be to close all files on the source before sending
the last chunk of migration data. This would mean that if any failure occurred
after this point, the VM would be lost.
=== In Practice ===
A Linux NFS server that exports with 'sync' offers a stronger coherency than NFS
guarantees. This is an implementation detail, not a guarantee as far as I know.
If the client sends a read request, then any data that has been acknowledged
done with a stable write by any other client will be returned without the need
to close and reopen the file.
A file opened with O_DIRECT with the Linux NFS client code wil always issue a
protocol read operation given a userspace read() call. This means that if you
issue stable writes (fsync) on the source and then use O_DIRECT to read on the
destination, you can safely access the same file without reopening.
=== Conclusion ===
Migration with QEMU is safe, in practice, when using Linux as an NFS server and
client when both the source and destination are using cache=none for the disks
and a raw file.
== iSCSI/Direct Attached Storage ==
iSCSI has a similar cache coherency guarantee to direct attached storage (via
fibre channel). Any read request will return data that has been acknowledged as
written by another client.
Since QEMU issues read() requests in userspace, Linux normally uses the page
cache. The Linux page cache is not coherent across multiple nodes so the only
way to safely access storage coherently is to bypass the Linux page cache via
cache=none.
=== Conclusion ===
iSCSI, FC, or other forms of direct attached storage are only safe to use with
live migration if you use cache=none and a raw image.
== Clustered File Systems ==
Clustered File Systems such as GPFS, Ceph, Glusterfs, or GFS2 are safe to use
with live migration regardless of the caching option use as long as raw images
are used.
== Image Formats ==
Image formats are not safe to use with live migration. The reason is that QEMU
caches data for image formats and does not have a mechanism to flush those
caches. The following attempts to describe the issues with the various formats
=== QCOW2 ===
QCOW2 caches two forms of data, cluster metadata (L1/L2 data, refcount table,
etc) and mutable header information (file size, snapshot entries, etc).
This data needs to be discarded before after migration starts.
=== QED ===
QED caches similar data to QCOW2. In addition, the QED header has a dirty flag
that must be handled specially in the case of live migration.
=== Raw Files ===
Technically, the file size of a raw file is mutable metadata that QEMU caches.
This is only applicable when using online image resizing. If you avoid online
image resizing during live migration, raw files are completely safe provided the
storage used meets the above requirements.