Dear list, my research revolves around cloud computing, virtual machines and migration. In this context I came across the following: a recent study by IBM indicates that a typical VM only migrates between a small set of physical servers; often just two.
The potential for optimization is clear. By storing a snapshot of the VM's memory on the migration source, we can reuse (some) of this information on a subsequent incoming migration. In the course of our research we implemented a prototype of this feature within kvm/qemu. We would like to contribute it to mainline, but it needs cleanup and proper testing. As is the nature with research prototypes, the code is ugly and not well integrated with the existing kvm/qemu codebase. To avoid confusion and irritation, I want to mention that I have little experience in contributing to large open-source projects. So if I violate some unwritten protocol or best practises, please be patient. Initially, I'm hoping to get some feedback on the current state of the implementation. It would be immensely helpful if someone more intimately familiar with the migration code/framework could comment on the prototyp's current state. The code very likely needs restructuring to make it fit better into the overall codebase. Getting information on what needs to change and how to change it would be my goal. The prototype also touches the migration protocol. Changes in this part probably need discussion. The basic idea is that if a block of memory (e.g., a 4 KiB page) already exists at the migration destination, than the source only sends a checksum of the block (currently MD5). The destination uses the checksum to find the corresponding block, e.g., by reading it from local storage (instead of transferring it over the network). This definitely reduces the migration traffic and usually also the overall migration time. We currently use MD5 checksums to identify (un)modified blocks. For strict ping-pong migration, where a VM only migrates between two servers, there is also the possibility to use dirty page tracking to identify modified pages. This has not been implemented so far. We are also unclear about the potential performance tradeoffs this might entail and how it would interact with the dirty page tracking code during a live migration. Our research also includes a look at real world data to motivate that this optimization actually does make sense in practise. If you are interested, you can find a draft of the relevant paper at: https://www.dropbox.com/s/v7qzim8exmji6j5/paper.pdf?dl=0 Keep in mind that the paper is not published yet and, hence, work in progress. As you can see, there are many open/unanswered questions, but I'm hopeful that this feature will eventually become part of kvm/qemu such that everyone can benefit from it. Please find the current code at https://bitbucket.org/tknauth/vecycle-qemu/branch/checkpoint-assisted-migration Looking forward to your feedback, Thomas.