On 06/08/2012 05:08 PM, Kevin Wolf wrote: > Am 08.06.2012 20:33, schrieb Jeff Cody: >> On 06/08/2012 01:57 PM, Kevin Wolf wrote: >>> Am 08.06.2012 19:46, schrieb Jeff Cody: >>>> On 06/08/2012 12:11 PM, Kevin Wolf wrote: >>>>> Am 08.06.2012 16:32, schrieb Jeff Cody: >>>>>> On 06/08/2012 09:53 AM, Stefan Hajnoczi wrote: >>>>>>> On Fri, Jun 8, 2012 at 2:19 PM, Jeff Cody <jc...@redhat.com> wrote: >>>>>>>> On 06/08/2012 08:42 AM, Stefan Hajnoczi wrote: >>>>>>>>> Note that block-commit cannot work on the top-most image since the >>>>>>>>> guest is still writing to that image and we might never be able to >>>>>>>>> copy all the data into the base image (the guest could write new data >>>>>>>>> as quickly as we copy it to the base). The command should check for >>>>>>>>> this and reject the top-most image. >>>>>>>> >>>>>>>> By this you mean that you would like to disallow committing the >>>>>>>> top-level image to the base? Perhaps there is a way to attempt to >>>>>>>> converge, and adaptively give more time to the co-routine if we are >>>>>>>> able >>>>>>>> to detect divergence. This may require violating the 'speed' >>>>>>>> parameter, >>>>>>>> however, and make the commit less 'live'. >>>>>>> >>>>>>> Yes, I think we should disallow merging down the topmost image. The >>>>>>> convergence problem is the same issue that live migration has. It's a >>>>>>> hard problem and not something that is essential for block-commit. It >>>>>>> would add complexity into the block-commit implementation - the only >>>>>>> benefit would be that you can merge down the last COW snapshot. We >>>>>>> already have an implementation that does this: the "commit" command >>>>>>> which stops the guest :). >>>>>> >>>>>> OK. And, it is better to get this implemented now, and if desired, I >>>>>> suppose we can always revisit the convergence at a later date (or not at >>>>>> all, if there is no pressing case for it). >>>>> >>>>> Not allowing live commit for the topmost image may be a good way to >>>>> break the problem down into more manageable pieces, but eventually the >>>>> feature isn't complete until you can do it. >>>>> >>>>> Basically what you need there is the equivalent of an active mirror. I >>>>> wouldn't be surprised if actually code could be reused for both. >>>> >>>> I agree, doing it like mirroring for new writes on the top layer makes >>>> sense, as long as you are willing to violate the (optional) speed >>>> parameter. I wouldn't think violating the speed parameter in that case >>>> would be an issue, as long as it is a documented affect of performing a >>>> live commit on the active (top) layer. >>> >>> The question is what the speed parameter really means for a live commit >>> block job (or for an active mirror). I think it would make sense to >>> limit only the speed of I/O that doesn't come from the guest but is from >>> the background copying. >>> >>> If you did apply it to guest I/O as well, then I think the logical >>> consequence would be that guest I/O is throttled as well because >>> requests only complete when they have reached both source and target. >>> >>> (I know I'm happily mixing terminology of all kinds of live block jobs >>> here, hope you understand anyway what I mean ;-)) >> >> I understand what you mean :) My thought is that the block commit >> 'speed' parameter would essentially only apply to 'old data', that is >> being copied in the commit coroutine, and it would be ignored (violated) >> for 'new data' in the mirroring-like case. >> >> I assumed we wouldn't want to throttle the guest I/O, but that may be a >> bad assumption on my part. > > I don't disagree, I just wanted to point out that throttling the guest > would be another option. Applying the limit only to old data probably > makes most sense indeed. >
Given that, perhaps it does make sense to get it implemented in two phases - first the straightforward block commit for images below the active layer, and then a follow-up patch set to implement committing in the top layer alongside an active mirroring approach for convergence. >>>>>> I am happy to do it that way. I'll shift my focus to the atomic image >>>>>> reopen in r/w mode. I'll go ahead and post my diagrams and other info >>>>>> for block-commit on the wiki, because I don't think it conflicts with we >>>>>> discussed above (although I will modify my diagrams to not show commit >>>>>> from the top-level image). Of course, feel free to change it as >>>>>> necessary. >>>>> >>>>> I may have mentioned it before, but just in case: I think Supriya's >>>>> bdrv_reopen() patches are a good starting point. I don't know why they >>>>> were never completed, but I think we all agreed on the general design, >>>>> so it should be possible to pick that up. >>>>> >>>>> Though if you have already started with your own work on it, Jeff, I >>>>> expect that it won't be much different because it's basically the same >>>>> transactional approach that you know and that we already used for group >>>>> snapshots. >>>> >>>> I will definitely use parts of Supriya's as it makes sense - what I >>>> started work on is similar to bdrv_append() and the current transaction >>>> approach, so there will be plenty in common to reuse, even with some >>>> differences. >>> >>> I'm not sure how far a bdrv_append-like approach will work in this case. >>> The problem is that you must not open any image r/w twice. The policy is >>> that you can open an image either once r/w or multiple times read-only. >> >> I think there are 3 possibilities that would be OK: open once r/w, one >> or more times r/o, or once r/w + one or more times r/o. Do you agree >> with this? Is the latter case against block-layer policy? > > Yes, it is. The problem is that image formats have data cached in > memory, and as soon as you have a writer, the other users of the same > image file - no matter whether r/w or r/o - get out of sync. They might > mix old (cached) matadata with other new metadata, and I think you can > imagine what mess would result. > > Strictly speaking even raw caches things like the image size, so if the > r/w instance happens to resize the image, the additional r/o readers > would have a problem even with raw. > OK, thanks. I should have qualified that with saying, for this application, that it is not intended to have the images open r/w and r/o for ongoing operations; it is just for the purpose of overlapping the closing of the r/o instance. >> In this special case for block commit, we need to do the reopen() to go >> from a r/o image to a r/w image. If an image is opened r/o, the drv >> layer can do another open with a new descriptor, this time r/w, without >> closing the r/o instance. If the new r/w open was successful, then the >> r/o instance can be closed, and some bdrv_append()-like operations done. >> Otherwise, like the current transactional code, we can just 'walk away' >> by only closing the new image (if it got that far), and the existing BDS >> with the r/o image is safely unchanged (and never closed, so no >> unrecoverable error paths). > > Yes. It's just that you need to be very careful. You would ask the qcow2 > driver to reopen the image, and it would take care to invalidate any > cached data and then pass the request on to the next layer. When you > have drilled down through the stack to raw-posix, it would actually open > the second file descriptor and wait for a commit/abort that goes through > the whole stack again. > That makes sense, and I agree that it is a bit tricky. The other question is what protocol / drivers will support this sort of operation, as well. And I certainly think that using Supriya's patches will prove useful. Thanks, Jeff