Hi Everyone,

I have been trying for some time to get the code for the live-snapshot 
blueprint[1] in. Going through the review process for the rpc and interface 
code[2] was easy. I suspect the api-extension code[3] will also be relatively 
trivial to get in. The main concern is with the libvirt driver 
implementation[4]. I'd like to discuss the concerns and see if we can make some 
progress.

Short Summary (tl;dr)
=====================

I propose we merge live-cloning as an experimental feature for havanna and have 
the api extension disabled by default.

Overview
========

First of all, let me express the value of live snapshoting. The slowest part of 
the vm provisioning process is generally booting of the OS. The advantage of 
live-snapshotting is that it allows the possibility of bringing up application 
servers while skipping the overhead of vm (and application startup).

I recognize that this capability comes with some security concerns, so I don't 
expect this feature to go in and be ready to for use in production right away. 
Similarly, containers have a lot of the same benefit, but have had their own 
security issues which are gradually being resolved. My hope is that getting 
this feature in would allow people to start experimenting with live-booting so 
that we could uncover some of these security issues.

There are two specific concerns that have been raised regarding my patch. The 
first concern is related to my use of libvirt. The second concern is related to 
the security issues above. Let me address them separately.

1. Libvirt Issues
=================

The only feature I require from the hypervisor is to load memory/processor 
state for a vm from a file. Qemu supports this directly. The only way that 
libvirt exposes this functionality is via its restore command which is 
specifically for restoring the previous state of an existing vm. "Cloning", or 
restoring the memory state of a cloned vm is considered unsafe (which I will 
address in the second point, below).

The result of the limited api is that I must include some hacks to make the 
restore command actually allow me to restore the state of the new vm. I 
recognize that this is using an undocumented libvirt api and isn't the ideal 
solution, but it seemed "better" then avoiding libvirt and talking directly to 
qemu.

This is obviously not ideal. It is my hope that this 0.1 version of the feature 
will allow us to iteratively improve the live-snapshot/clone proccess and get 
the security to a point where the libvirt maintainers would be willing to 
accept a patch to directly expose an api to load memory from a file.

2. Security Concerns
====================

There are a number of security issues with loading state from another vm. Here 
is a short list of things that need to be done just to make a cloned vm usable:

a) mac address needs to be recreated
b) entropy pool needs to be reset
c) host name must be reset
d) host keys bust be regenerated

There are others, and trying to clone a running application as well may expose 
other sensitive data, especially if users are snaphsoting vms and making them 
public.

The only issue that I address on the driver side is the mac addresses. This is 
the minimum that needs to be done just to be able to access the vm over the 
network. This is implemented by unplugging all network devices before the 
snapshot and plugging new network devices in on clone. This isn't the most 
friendly thing to guest applications, but it seems like the safest option for 
the first version of this feature.

So cloning vms must be done with care. Sensitive data must be removed from the 
vm pre-clone and new data needs to be generated post-clone. Ultimately this 
should all be done via guest-agent of some sort. I have found some volunteers 
to make the guest agent a reality, but it will take a bit of time to get 
something workable, and it will be much more difficult if there isn't a way to 
test the feature.

Conclusion
==========

There are obviously problems to be solved with the whole idea of live cloning, 
but I think it enables some important new ways of deploying applications. 
Imagine for example a PaaS built on fast-cloning vms instead of containers. 
This is clearly a long term project but if we block it now it may never get the 
support it needs to become a real option.

Proposal
========

I propose we allow the patch in and we leave the live-snapshot extension 
disabled by default. Deployers can turn on the extension to experiment with the 
feature. This will allow other hypervisors do do an implementation, and the 
community in general to improve the security and usefulness of live-cloned 
virtual machines.

I'm very interested in your thoughts and feedback. Thank you to everyone who 
made it this far.

Vish 

[1] https://blueprints.launchpad.net/nova/+spec/live-snapshot-vms
[2] https://review.openstack.org/#/c/33697/
[3] https://review.openstack.org/#/c/34036/
[4] https://review.openstack.org/#/c/33698/
_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to