On Tue, Oct 5, 2021 at 10:19 AM Peter Lieven <p...@kamp.de> wrote: > > Am 05.10.21 um 09:54 schrieb Ilya Dryomov: > > On Thu, Sep 16, 2021 at 2:21 PM Peter Lieven <p...@kamp.de> wrote: > >> the qemu rbd driver currently lacks support for bdrv_co_block_status. > >> This results mainly in incorrect progress during block operations (e.g. > >> qemu-img convert with an rbd image as source). > >> > >> This patch utilizes the rbd_diff_iterate2 call from librbd to detect > >> allocated and unallocated (all zero areas). > >> > >> To avoid querying the ceph OSDs for the answer this is only done if > >> the image has the fast-diff feature which depends on the object-map and > >> exclusive-lock features. In this case it is guaranteed that the information > >> is present in memory in the librbd client and thus very fast. > >> > >> If fast-diff is not available all areas are reported to be allocated > >> which is the current behaviour if bdrv_co_block_status is not implemented. > >> > >> Signed-off-by: Peter Lieven <p...@kamp.de> > >> --- > >> V2->V3: > >> - check rbd_flags every time (they can change during runtime) [Ilya] > >> - also check for fast-diff invalid flag [Ilya] > >> - *map and *file cant be NULL [Ilya] > >> - set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an > >> unallocated area [Ilya] > >> - typo: catched -> caught [Ilya] > >> - changed wording about fast-diff, object-map and exclusive lock in > >> commit msg [Ilya] > >> > >> V1->V2: > >> - add commit comment [Stefano] > >> - use failed_post_open [Stefano] > >> - remove redundant assert [Stefano] > >> - add macro+comment for the magic -9000 value [Stefano] > >> - always set *file if its non NULL [Stefano] > >> > >> block/rbd.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> 1 file changed, 126 insertions(+) > >> > >> diff --git a/block/rbd.c b/block/rbd.c > >> index dcf82b15b8..3cb24f9981 100644 > >> --- a/block/rbd.c > >> +++ b/block/rbd.c > >> @@ -1259,6 +1259,131 @@ static ImageInfoSpecific > >> *qemu_rbd_get_specific_info(BlockDriverState *bs, > >> return spec_info; > >> } > >> > >> +typedef struct rbd_diff_req { > >> + uint64_t offs; > >> + uint64_t bytes; > >> + int exists; > > Hi Peter, > > > > Nit: make exists a bool. The one in the callback has to be an int > > because of the callback signature but let's not spread that. > > > >> +} rbd_diff_req; > >> + > >> +/* > >> + * rbd_diff_iterate2 allows to interrupt the exection by returning a > >> negative > >> + * value in the callback routine. Choose a value that does not conflict > >> with > >> + * an existing exitcode and return it if we want to prematurely stop the > >> + * execution because we detected a change in the allocation status. > >> + */ > >> +#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000 > >> + > >> +static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len, > >> + int exists, void *opaque) > >> +{ > >> + struct rbd_diff_req *req = opaque; > >> + > >> + assert(req->offs + req->bytes <= offs); > >> + > >> + if (req->exists && offs > req->offs + req->bytes) { > >> + /* > >> + * we started in an allocated area and jumped over an unallocated > >> area, > >> + * req->bytes contains the length of the allocated area before the > >> + * unallocated area. stop further processing. > >> + */ > >> + return QEMU_RBD_EXIT_DIFF_ITERATE2; > >> + } > >> + if (req->exists && !exists) { > >> + /* > >> + * we started in an allocated area and reached a hole. req->bytes > >> + * contains the length of the allocated area before the hole. > >> + * stop further processing. > >> + */ > >> + return QEMU_RBD_EXIT_DIFF_ITERATE2; > > Do you have a test case for when this branch is taken? > > > That would happen if you diff from a snapshot, the question is if it can also > happen if the image is a clone from a snapshot? > > > > > >> + } > >> + if (!req->exists && exists && offs > req->offs) { > >> + /* > >> + * we started in an unallocated area and hit the first allocated > >> + * block. req->bytes must be set to the length of the unallocated > >> area > >> + * before the allocated area. stop further processing. > >> + */ > >> + req->bytes = offs - req->offs; > >> + return QEMU_RBD_EXIT_DIFF_ITERATE2; > >> + } > >> + > >> + /* > >> + * assert that we caught all cases above and allocation state has not > >> + * changed during callbacks. > >> + */ > >> + assert(exists == req->exists || !req->bytes); > >> + req->exists = exists; > >> + > >> + /* > >> + * assert that we either return an unallocated block or have got > >> callbacks > >> + * for all allocated blocks present. > >> + */ > >> + assert(!req->exists || offs == req->offs + req->bytes); > >> + req->bytes = offs + len - req->offs; > >> + > >> + return 0; > >> +} > >> + > >> +static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs, > >> + bool want_zero, int64_t > >> offset, > >> + int64_t bytes, int64_t > >> *pnum, > >> + int64_t *map, > >> + BlockDriverState **file) > >> +{ > >> + BDRVRBDState *s = bs->opaque; > >> + int ret, r; > > Nit: I would rename ret to status or something like that to make > > it clear(er) that it is an actual value and never an error. Or, > > even better, drop it entirely and return one of the two bitmasks > > directly. > > > >> + struct rbd_diff_req req = { .offs = offset }; > >> + uint64_t features, flags; > >> + > >> + assert(offset + bytes <= s->image_size); > >> + > >> + /* default to all sectors allocated */ > >> + ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; > >> + *map = offset; > >> + *file = bs; > >> + *pnum = bytes; > >> + > >> + /* check if RBD image supports fast-diff */ > >> + r = rbd_get_features(s->image, &features); > >> + if (r < 0) { > >> + goto out; > >> + } > >> + if (!(features & RBD_FEATURE_FAST_DIFF)) { > >> + goto out; > >> + } > >> + > >> + /* check if RBD fast-diff result is valid */ > >> + r = rbd_get_flags(s->image, &flags); > >> + if (r < 0) { > >> + goto out; > >> + } > >> + if (flags & RBD_FLAG_FAST_DIFF_INVALID) { > >> + goto out; > > Nit: I'm not a fan of labels that cover just the return statement. > > Feel free to ignore this one but I would get rid of it and replace > > these gotos with returns. > > > That would be return with the bitmask directly coded in if I also > > drop the ret variable. I can change that, no problem. > > > > > >> + } > >> + > >> + r = rbd_diff_iterate2(s->image, NULL, offset, bytes, true, true, > >> + qemu_rbd_co_block_status_cb, &req); > >> + if (r < 0 && r != QEMU_RBD_EXIT_DIFF_ITERATE2) { > >> + goto out; > >> + } > >> + assert(req.bytes <= bytes); > >> + if (!req.exists) { > >> + if (r == 0 && !req.bytes) { > >> + /* > >> + * rbd_diff_iterate2 does not invoke callbacks for > >> unallocated areas > >> + * except for the case where an overlay has a hole where the > >> parent > >> + * has not. This here catches the case where no callback was > >> + * invoked at all. > >> + */ > > The above is true in the case of diffing against a snapshot, i.e. when > > the "from" snapshot has some data where the "to" revision (whether HEAD > > or another snapshot) has a hole. I don't think it is true for child vs > > parent (but it has been a while since I looked at the diff code). As > > long as NULL is passed for fromsnapname, I would expect the callback to > > be invoked only for allocated areas. If I'm right, we could simplify > > qemu_rbd_co_block_status_cb() considerably. > > See my comment in the callback. Can you look at the diff code or give > me at least a pointer where I can find it. I am not that familiar with > the librbd code yet.
I assumed that you added !exists handling because it came up in your testing. If you don't have a test case then let's proceed under the assumption that it doesn't happen for clones. Thanks, Ilya