Re: [Qemu-devel] CoW image commit+shrink(= make_empty) support

Jeff Cody Fri, 08 Jun 2012 06:20:31 -0700

On 06/08/2012 08:42 AM, Stefan Hajnoczi wrote:
> On Thu, Jun 7, 2012 at 3:14 PM, Jeff Cody <jc...@redhat.com> wrote:
>> On 06/07/2012 02:19 AM, Taisuke Yamada wrote:
>>> I attended Paolo Bonzini's qemu session ("Live Disk Operations: Juggling
>>> Data and Trying to go Unnoticed") in LinuxCon Japan, and he adviced me
>>> to post the bits I have regarding my question on qemu's  support on 
>>> shrinking
>>> CoW image.
>>>
>>> Here's my problem description.
>>>
>>> I recently designed a experimental system which holds VM master images
>>> on a HDD and CoW snapshots on a SSD. VMs run on CoW snapshots only.
>>> This split-image configration is done to keep VM I/Os on a SSD
>>>
>>> As SSD capacity is rather limited, I need to do a writeback commit from SSD 
>>> to
>>> HDD time to time, and that is done during weekend/midnight. The problem is
>>> although a commit is made, that alone won't shrink CoW image - all unused 
>>> blocks
>>> are still kept in a snapshot, and uses up space.
>>>
>>> Patch attached is a workaround I added to cope with the problem,
>>> but the basic problem I faced was that both QCOW2/QED format still does not
>>> support "bdrv_make_empty" API.
>>>
>>> Implementing the API (say, by hole punching) seemed like a lot of effort, so
>>> I ended up creating a new CoW image, and then replace current CoW
>>> snapshot with a new (empty) one. But I find the code ugly.
>>>
>>> In his talk, Paolo suggested possibility of using new "live op" API for this
>>> task, but I'm not aware of the actual API. Is there any documentation or
>>> source code I can look at to re-implement above feature?
>>>
>>> Best Regards,
>>
>> Hello Taisuke-san,
>>
>> I am working on a document now for a live commit proposal, with the API
>> being similar to the block-stream command, but for a live commit.  Here
>> is what I am thinking about proposing for the command:
>>
>> { 'command': 'block-commit', 'data': { 'device': 'str', '*base': 'str',
>>                                       '*top': 'str', '*speed': 'int' } }
>>
>> I think something similar to the above would be good for a 'live
>> commit', and it would be somewhat analogous to block streaming, but in
>> the other direction.
>>
>> One issue I see with the patch attached, is the reliance on bdrv_close()


>> longer have the ability to safely recover from error, because it is
>> possible for the recovery bdrv_open() to fail for some reason.
>>
>> The live block commit command I am working on operates like the block
>> streaming code, and like transactional commands in that the use of
>> bdrv_close() / bdrv_open() to change an image is avoided, so that error
>> recovery can be safely done by just abandoning the operation.  A key
>> point that needs to be done 'transactionally', is to open the base or
>> intermediate target image with file access mode r/w, as the backing
>> files are open as r/o by default.
>>
>> I am going to be putting all my documentation into the qemu wiki today /
>> tomorrow, and I will follow up with a link to that if you like.
> 
> Thanks for sharing.  This is also something Zhi Hui and I have been
> thinking about, my notes are below.  The key difference to Taisuke's
> requirement is that I imagined we would simply not support merging the
> top image down while the VM is running.  You could only merge an image
> down which is not top-most.
> 
> <quote>
> For incremental backup we typically have a backing file chain like this:
> 
> vm001.img <-- snap1.qcow2 <-- snap2.qcow2
> 
> The guest is writing to snap2.qcow2.  vm001.img and snap1.qcow2 are
> read-only and the guest cannot write to them.
> 
> We want to commit snap1.qcow2 down into vm001.img while the guest is running:
> 
> vm001.img <-- snap2.qcow2
> 
> This means copying allocated blocks from snap1.qcow2 and writing them
> into vm001.img.  Once this process is complete it is safe to delete
> snap1.qcow2 since all data is now in vm001.img.

Yes, this is the same as what we are wanting to accomplish.  The trick
here is open vm001.img r/w, in a safe manner (by safe, I mean able to
abort in case of error while keeping the guest running live).

My thoughts on this has revolved around something similar to what was
done in bdrv_append(), where a duplicate BDS is created, a new file-open
performed with the appropriate access mode flags, and if successful
swapped out for the originally opened BDS for vm001.img.  If there is an
error, the new BDS is abandoned without modifying the BDS list.

> 
> As a result we have made the backing file chain shorter.  This is
> improtant because otherwise incremental backup would grow the backing
> file chain forever - each time it takes a new snapshot the chain
> becomes longer and I/O accesses can become slower!
> 
> The task is to add a new block job type called "commit".  It is like
> the qemu-img commit command except it works while the guest is
> running.
> 
> The new QMP command should look like this:
> 
> { 'command': 'block-commit', 'data': { 'device': 'str', 'image':
> 'str', 'base': 'str', '*speed': 'int' }

This is very similar to what I was thinking as well - I think the only
difference in the command is that I what you called 'image' I called
'top', and the argument order was after base.

Here is what I had for the command:

{ 'command': 'block-commit', 'data': { 'device': 'str', '*base': 'str',
                                       '*top': 'str', '*speed': 'int' } }

I don't think I have a strong preference for either of our proposed
commands - they are essentially the same.

> 
> This command can take a backing file chain:
> 
> base <- a <- b <- image <- c
> 
> It copies allocated blocks from a <- b <- image into base:
> 
> base <- c
> 
> After the operation completes a, b, and image can be deleted.

Yes - and of course, and other child / leaf images of base are now
invalid, but that is just a consequence of a commit operation.

> 
> Note that block-commit cannot work on the top-most image since the
> guest is still writing to that image and we might never be able to
> copy all the data into the base image (the guest could write new data
> as quickly as we copy it to the base).  The command should check for
> this and reject the top-most image.

By this you mean that you would like to disallow committing the
top-level image to the base?  Perhaps there is a way to attempt to
converge, and adaptively give more time to the co-routine if we are able
to detect divergence.  This may require violating the 'speed' parameter,
however, and make the commit less 'live'.

> 
> This command is similar to block-stream but it copies data "down" to
> the backing file instead of "up" from the backing file.  It's
> necessary to add this command because in most cases block-commit is
> much more efficient than block-stream (the CoW file usually has much
> less data than the backing file so less data needs to be copied).
> </unquote>

Definitely, that has been my mental model has well - block-stream in
reverse.

> 
> Let's figure out how to specify block-commit so we're all happy, that
> way we can avoid duplicating work.  Any comments on my notes above?
> 

I think we are almost completely on the same page - devil is in the
details, of course (for instance, on how to convert the destination base
from r/o to r/w).


> Stefan

Re: [Qemu-devel] CoW image commit+shrink(= make_empty) support

Reply via email to