Re: [Qemu-devel] CoW image commit+shrink(= make_empty) support

Jeff Cody Sat, 09 Jun 2012 09:52:41 -0700

On 06/08/2012 05:08 PM, Kevin Wolf wrote:
> Am 08.06.2012 20:33, schrieb Jeff Cody:
>> On 06/08/2012 01:57 PM, Kevin Wolf wrote:
>>> Am 08.06.2012 19:46, schrieb Jeff Cody:
>>>> On 06/08/2012 12:11 PM, Kevin Wolf wrote:
>>>>> Am 08.06.2012 16:32, schrieb Jeff Cody:
>>>>>> On 06/08/2012 09:53 AM, Stefan Hajnoczi wrote:
>>>>>>> On Fri, Jun 8, 2012 at 2:19 PM, Jeff Cody <jc...@redhat.com> wrote:
>>>>>>>> On 06/08/2012 08:42 AM, Stefan Hajnoczi wrote:
>>>>>>>>> Note that block-commit cannot work on the top-most image since the
>>>>>>>>> guest is still writing to that image and we might never be able to
>>>>>>>>> copy all the data into the base image (the guest could write new data
>>>>>>>>> as quickly as we copy it to the base).  The command should check for
>>>>>>>>> this and reject the top-most image.
>>>>>>>>
>>>>>>>> By this you mean that you would like to disallow committing the
>>>>>>>> top-level image to the base?  Perhaps there is a way to attempt to
>>>>>>>> converge, and adaptively give more time to the co-routine if we are 
>>>>>>>> able
>>>>>>>> to detect divergence.  This may require violating the 'speed' 
>>>>>>>> parameter,
>>>>>>>> however, and make the commit less 'live'.
>>>>>>>
>>>>>>> Yes, I think we should disallow merging down the topmost image.  The
>>>>>>> convergence problem is the same issue that live migration has.  It's a
>>>>>>> hard problem and not something that is essential for block-commit.  It
>>>>>>> would add complexity into the block-commit implementation - the only
>>>>>>> benefit would be that you can merge down the last COW snapshot.  We
>>>>>>> already have an implementation that does this: the "commit" command
>>>>>>> which stops the guest :).
>>>>>>
>>>>>> OK.  And, it is better to get this implemented now, and if desired, I
>>>>>> suppose we can always revisit the convergence at a later date (or not at
>>>>>> all, if there is no pressing case for it).
>>>>>
>>>>> Not allowing live commit for the topmost image may be a good way to
>>>>> break the problem down into more manageable pieces, but eventually the
>>>>> feature isn't complete until you can do it.
>>>>>
>>>>> Basically what you need there is the equivalent of an active mirror. I
>>>>> wouldn't be surprised if actually code could be reused for both.
>>>>
>>>> I agree, doing it like mirroring for new writes on the top layer makes
>>>> sense, as long as you are willing to violate the (optional) speed
>>>> parameter.  I wouldn't think violating the speed parameter in that case
>>>> would be an issue, as long as it is a documented affect of performing a
>>>> live commit on the active (top) layer.
>>>
>>> The question is what the speed parameter really means for a live commit
>>> block job (or for an active mirror). I think it would make sense to
>>> limit only the speed of I/O that doesn't come from the guest but is from
>>> the background copying.
>>>
>>> If you did apply it to guest I/O as well, then I think the logical
>>> consequence would be that guest I/O is throttled as well because
>>> requests only complete when they have reached both source and target.
>>>
>>> (I know I'm happily mixing terminology of all kinds of live block jobs
>>> here, hope you understand anyway what I mean ;-))
>>
>> I understand what you mean :) My thought is that the block commit
>> 'speed' parameter would essentially only apply to 'old data', that is
>> being copied in the commit coroutine, and it would be ignored (violated)
>> for 'new data' in the mirroring-like case.
>>
>> I assumed we wouldn't want to throttle the guest I/O, but that may be a
>> bad assumption on my part.
> 
> I don't disagree, I just wanted to point out that throttling the guest
> would be another option. Applying the limit only to old data probably
> makes most sense indeed.
>


Given that, perhaps it does make sense to get it implemented in two
phases - first the straightforward block commit for images below the
active layer, and then a follow-up patch set to implement committing in
the top layer alongside an active mirroring approach for convergence.

>>>>>> I am happy to do it that way.  I'll shift my focus to the atomic image
>>>>>> reopen in r/w mode.  I'll go ahead and post my diagrams and other info
>>>>>> for block-commit on the wiki, because I don't think it conflicts with we
>>>>>> discussed above (although I will modify my diagrams to not show commit
>>>>>> from the top-level image).  Of course, feel free to change it as
>>>>>> necessary.
>>>>>
>>>>> I may have mentioned it before, but just in case: I think Supriya's
>>>>> bdrv_reopen() patches are a good starting point. I don't know why they
>>>>> were never completed, but I think we all agreed on the general design,
>>>>> so it should be possible to pick that up.
>>>>>
>>>>> Though if you have already started with your own work on it, Jeff, I
>>>>> expect that it won't be much different because it's basically the same
>>>>> transactional approach that you know and that we already used for group
>>>>> snapshots.
>>>>
>>>> I will definitely use parts of Supriya's as it makes sense - what I
>>>> started work on is similar to bdrv_append() and the current transaction
>>>> approach, so there will be plenty in common to reuse, even with some
>>>> differences.
>>>
>>> I'm not sure how far a bdrv_append-like approach will work in this case.
>>> The problem is that you must not open any image r/w twice. The policy is
>>> that you can open an image either once r/w or multiple times read-only.
>>
>> I think there are 3 possibilities that would be OK: open once r/w, one
>> or more times r/o, or once r/w + one or more times r/o.  Do you agree
>> with this?  Is the latter case against block-layer policy?
> 
> Yes, it is. The problem is that image formats have data cached in
> memory, and as soon as you have a writer, the other users of the same
> image file - no matter whether r/w or r/o - get out of sync. They might
> mix old (cached) matadata with other new metadata, and I think you can
> imagine what mess would result.
> 
> Strictly speaking even raw caches things like the image size, so if the
> r/w instance happens to resize the image, the additional r/o readers
> would have a problem even with raw.
> 

OK, thanks. I should have qualified that with saying, for this
application, that it is not intended to have the images open r/w and r/o
for ongoing operations; it is just for the purpose of overlapping the
closing of the r/o instance.

>> In this special case for block commit, we need to do the reopen() to go
>> from a r/o image to a r/w image.  If an image is opened r/o, the drv
>> layer can do another open with a new descriptor, this time r/w, without
>> closing the r/o instance.  If the new r/w open was successful, then the
>> r/o instance can be closed, and some bdrv_append()-like operations done.
>> Otherwise, like the current transactional code, we can just 'walk away'
>> by only closing the new image (if it got that far), and the existing BDS
>> with the r/o image is safely unchanged (and never closed, so no
>> unrecoverable error paths).
> 
> Yes. It's just that you need to be very careful. You would ask the qcow2
> driver to reopen the image, and it would take care to invalidate any
> cached data and then pass the request on to the next layer. When you
> have drilled down through the stack to raw-posix, it would actually open
> the second file descriptor and wait for a commit/abort that goes through
> the whole stack again.
>

That makes sense, and I agree that it is a bit tricky.  The other
question is what protocol / drivers will support this sort of operation,
as well. And I certainly think that using Supriya's patches will prove
useful.

Thanks,
Jeff

Re: [Qemu-devel] CoW image commit+shrink(= make_empty) support

Reply via email to