Dear list,

my research revolves around cloud computing, virtual machines and
migration. In this context I came across the following: a recent study
by IBM indicates that a typical VM only migrates between a small set
of physical servers; often just two.

The potential for optimization is clear. By storing a snapshot of the
VM's memory on the migration source, we can reuse (some) of this
information on a subsequent incoming migration.

In the course of our research we implemented a prototype of this
feature within kvm/qemu. We would like to contribute it to mainline,
but it needs cleanup and proper testing. As is the nature with
research prototypes, the code is ugly and not well integrated with the
existing kvm/qemu codebase. To avoid confusion and irritation, I want
to mention that I have little experience in contributing to large
open-source projects. So if I violate some unwritten protocol or best
practises, please be patient.

Initially, I'm hoping to get some feedback on the current state of the
implementation. It would be immensely helpful if someone more
intimately familiar with the migration code/framework could comment on
the prototyp's current state. The code very likely needs restructuring
to make it fit better into the overall codebase. Getting information
on what needs to change and how to change it would be my goal.

The prototype also touches the migration protocol. Changes in this
part probably need discussion. The basic idea is that if a block of
memory (e.g., a 4 KiB page) already exists at the migration
destination, than the source only sends a checksum of the block
(currently MD5). The destination uses the checksum to find the
corresponding block, e.g., by reading it from local storage (instead
of transferring it over the network). This definitely reduces the
migration traffic and usually also the overall migration time.

We currently use MD5 checksums to identify (un)modified blocks. For
strict ping-pong migration, where a VM only migrates between two
servers, there is also the possibility to use dirty page tracking to
identify modified pages. This has not been implemented so far. We are
also unclear about the potential performance tradeoffs this might
entail and how it would interact with the dirty page tracking code
during a live migration.

Our research also includes a look at real world data to motivate that
this optimization actually does make sense in practise. If you are
interested, you can find a draft of the relevant paper at:

https://www.dropbox.com/s/v7qzim8exmji6j5/paper.pdf?dl=0

Keep in mind that the paper is not published yet and, hence, work in progress.

As you can see, there are many open/unanswered questions, but I'm
hopeful that this feature will eventually become part of kvm/qemu such
that everyone can benefit from it.

Please find the current code at
https://bitbucket.org/tknauth/vecycle-qemu/branch/checkpoint-assisted-migration

Looking forward to your feedback,
Thomas.

Reply via email to