On Fri, Jun 8, 2012 at 2:19 PM, Jeff Cody <jc...@redhat.com> wrote:
> On 06/08/2012 08:42 AM, Stefan Hajnoczi wrote:
>> On Thu, Jun 7, 2012 at 3:14 PM, Jeff Cody <jc...@redhat.com> wrote:
>>> On 06/07/2012 02:19 AM, Taisuke Yamada wrote:
>> We want to commit snap1.qcow2 down into vm001.img while the guest is running:
>>
>> vm001.img <-- snap2.qcow2
>>
>> This means copying allocated blocks from snap1.qcow2 and writing them
>> into vm001.img.  Once this process is complete it is safe to delete
>> snap1.qcow2 since all data is now in vm001.img.
>
> Yes, this is the same as what we are wanting to accomplish.  The trick
> here is open vm001.img r/w, in a safe manner (by safe, I mean able to
> abort in case of error while keeping the guest running live).
>
> My thoughts on this has revolved around something similar to what was
> done in bdrv_append(), where a duplicate BDS is created, a new file-open
> performed with the appropriate access mode flags, and if successful
> swapped out for the originally opened BDS for vm001.img.  If there is an
> error, the new BDS is abandoned without modifying the BDS list.

Yes, there needs to be an atomic way to try opening the image read/write.

>>
>> As a result we have made the backing file chain shorter.  This is
>> improtant because otherwise incremental backup would grow the backing
>> file chain forever - each time it takes a new snapshot the chain
>> becomes longer and I/O accesses can become slower!
>>
>> The task is to add a new block job type called "commit".  It is like
>> the qemu-img commit command except it works while the guest is
>> running.
>>
>> The new QMP command should look like this:
>>
>> { 'command': 'block-commit', 'data': { 'device': 'str', 'image':
>> 'str', 'base': 'str', '*speed': 'int' }
>
> This is very similar to what I was thinking as well - I think the only
> difference in the command is that I what you called 'image' I called
> 'top', and the argument order was after base.
>
> Here is what I had for the command:
>
> { 'command': 'block-commit', 'data': { 'device': 'str', '*base': 'str',
>                                       '*top': 'str', '*speed': 'int' } }
>
> I don't think I have a strong preference for either of our proposed
> commands - they are essentially the same.

Yes, they are basically the same.  I don't mind which one we use.

>>
>> Note that block-commit cannot work on the top-most image since the
>> guest is still writing to that image and we might never be able to
>> copy all the data into the base image (the guest could write new data
>> as quickly as we copy it to the base).  The command should check for
>> this and reject the top-most image.
>
> By this you mean that you would like to disallow committing the
> top-level image to the base?  Perhaps there is a way to attempt to
> converge, and adaptively give more time to the co-routine if we are able
> to detect divergence.  This may require violating the 'speed' parameter,
> however, and make the commit less 'live'.

Yes, I think we should disallow merging down the topmost image.  The
convergence problem is the same issue that live migration has.  It's a
hard problem and not something that is essential for block-commit.  It
would add complexity into the block-commit implementation - the only
benefit would be that you can merge down the last COW snapshot.  We
already have an implementation that does this: the "commit" command
which stops the guest :).

If we take Taisuke's scenario, just create a snapshot on the SSD
immediately before merging down:

  /big-slow-cheap-disk/master.img <- /small-fast-expensive-ssd/cow.qcow2

(qemu) snapshot_blkdev ...

  /big-slow-cheap-disk/master.img <-
/small-fast-expensive-ssd/cow.qcow2 <-
/small-fast-expensive-ssd/newcow.qcow2

(qemu) block-commit cow.qcow2

  /big-slow-cheap-disk/master.img <- /small-fast-expensive-ssd/newcow.qcow2

>>
>> Let's figure out how to specify block-commit so we're all happy, that
>> way we can avoid duplicating work.  Any comments on my notes above?
>>
>
> I think we are almost completely on the same page - devil is in the
> details, of course (for instance, on how to convert the destination base
> from r/o to r/w).

Great.  The atomic r/o -> r/w transition and the commit coroutine can
be implemented on in parallel.  Are you happy to implement the atomic
r/o -> r/w transition since you wrote bdrv_append()?  Then Zhi Hui can
assume that part already works and focus on implementing the commit
coroutine in the meantime.  I'm just suggesting a way to split up the
work, please let me know if you think this is good.

Stefan

Reply via email to