On Tue, Oct 12, 2021 at 5:22 PM Peter Lieven <p...@kamp.de> wrote: > > the qemu rbd driver currently lacks support for bdrv_co_block_status. > This results mainly in incorrect progress during block operations (e.g. > qemu-img convert with an rbd image as source). > > This patch utilizes the rbd_diff_iterate2 call from librbd to detect > allocated and unallocated (all zero areas). > > To avoid querying the ceph OSDs for the answer this is only done if > the image has the fast-diff feature which depends on the object-map and > exclusive-lock features. In this case it is guaranteed that the information > is present in memory in the librbd client and thus very fast. > > If fast-diff is not available all areas are reported to be allocated > which is the current behaviour if bdrv_co_block_status is not implemented. > > Signed-off-by: Peter Lieven <p...@kamp.de> > --- > block/rbd.c | 112 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 112 insertions(+) > > V4->V5: > - rename rbd_diff_req to RBDDiffIterateReq, use typedef and move > defintion to top [Ilya] > - rename callback to qemu_rbd_diff_iterate_cb [Ilya] > - assert that req.bytes == 0 if !req.exists and r == 0 [Ilya] > > V3->V4: > - make req.exists a bool [Ilya] > - simplify callback under the assuption that we never receive a cb > for a hole since we do not diff against a snapshot [Ilya] > - remove out label [Ilya] > - rename ret to status [Ilya] > > V2->V3: > - check rbd_flags every time (they can change during runtime) [Ilya] > - also check for fast-diff invalid flag [Ilya] > - *map and *file cant be NULL [Ilya] > - set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an > unallocated area [Ilya] > - typo: catched -> caught [Ilya] > - changed wording about fast-diff, object-map and exclusive lock in > commit msg [Ilya] > > V1->V2: > - add commit comment [Stefano] > - use failed_post_open [Stefano] > - remove redundant assert [Stefano] > - add macro+comment for the magic -9000 value [Stefano] > - always set *file if its non NULL [Stefano] > > > diff --git a/block/rbd.c b/block/rbd.c > index 701fbf2b0c..def96292e0 100644 > --- a/block/rbd.c > +++ b/block/rbd.c > @@ -97,6 +97,12 @@ typedef struct RBDTask { > int64_t ret; > } RBDTask; > > +typedef struct RBDDiffIterateReq { > + uint64_t offs; > + uint64_t bytes; > + bool exists; > +} RBDDiffIterateReq; > + > static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx, > BlockdevOptionsRbd *opts, bool cache, > const char *keypairs, const char *secretid, > @@ -1259,6 +1265,111 @@ static ImageInfoSpecific > *qemu_rbd_get_specific_info(BlockDriverState *bs, > return spec_info; > } > > +/* > + * rbd_diff_iterate2 allows to interrupt the exection by returning a negative > + * value in the callback routine. Choose a value that does not conflict with > + * an existing exitcode and return it if we want to prematurely stop the > + * execution because we detected a change in the allocation status. > + */ > +#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000 > + > +static int qemu_rbd_diff_iterate_cb(uint64_t offs, size_t len, > + int exists, void *opaque) > +{ > + RBDDiffIterateReq *req = opaque; > + > + assert(req->offs + req->bytes <= offs); > + /* > + * we do not diff against a snapshot so we should never receive a > callback > + * for a hole. > + */ > + assert(exists); > + > + if (!req->exists && offs > req->offs) { > + /* > + * we started in an unallocated area and hit the first allocated > + * block. req->bytes must be set to the length of the unallocated > area > + * before the allocated area. stop further processing. > + */ > + req->bytes = offs - req->offs; > + return QEMU_RBD_EXIT_DIFF_ITERATE2; > + } > + > + if (req->exists && offs > req->offs + req->bytes) { > + /* > + * we started in an allocated area and jumped over an unallocated > area, > + * req->bytes contains the length of the allocated area before the > + * unallocated area. stop further processing. > + */ > + return QEMU_RBD_EXIT_DIFF_ITERATE2; > + } > + > + req->bytes += len; > + req->exists = true; > + > + return 0; > +} > + > +static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs, > + bool want_zero, int64_t > offset, > + int64_t bytes, int64_t > *pnum, > + int64_t *map, > + BlockDriverState **file) > +{ > + BDRVRBDState *s = bs->opaque; > + int status, r; > + RBDDiffIterateReq req = { .offs = offset }; > + uint64_t features, flags; > + > + assert(offset + bytes <= s->image_size); > + > + /* default to all sectors allocated */ > + status = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; > + *map = offset; > + *file = bs; > + *pnum = bytes; > + > + /* check if RBD image supports fast-diff */ > + r = rbd_get_features(s->image, &features); > + if (r < 0) { > + return status; > + } > + if (!(features & RBD_FEATURE_FAST_DIFF)) { > + return status; > + } > + > + /* check if RBD fast-diff result is valid */ > + r = rbd_get_flags(s->image, &flags); > + if (r < 0) { > + return status; > + } > + if (flags & RBD_FLAG_FAST_DIFF_INVALID) { > + return status; > + } > + > + r = rbd_diff_iterate2(s->image, NULL, offset, bytes, true, true, > + qemu_rbd_diff_iterate_cb, &req); > + if (r < 0 && r != QEMU_RBD_EXIT_DIFF_ITERATE2) { > + return status; > + } > + assert(req.bytes <= bytes); > + if (!req.exists) { > + if (r == 0) { > + /* > + * rbd_diff_iterate2 does not invoke callbacks for unallocated > + * areas. This here catches the case where no callback was > + * invoked at all (req.bytes == 0). > + */ > + assert(req.bytes == 0); > + req.bytes = bytes; > + } > + status = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID; > + } > + > + *pnum = req.bytes; > + return status; > +} > + > static int64_t qemu_rbd_getlength(BlockDriverState *bs) > { > BDRVRBDState *s = bs->opaque; > @@ -1494,6 +1605,7 @@ static BlockDriver bdrv_rbd = { > #ifdef LIBRBD_SUPPORTS_WRITE_ZEROES > .bdrv_co_pwrite_zeroes = qemu_rbd_co_pwrite_zeroes, > #endif > + .bdrv_co_block_status = qemu_rbd_co_block_status, > > .bdrv_snapshot_create = qemu_rbd_snap_create, > .bdrv_snapshot_delete = qemu_rbd_snap_remove, > -- > 2.17.1 > >
Reviewed-by: Ilya Dryomov <idryo...@gmail.com> Thanks, Ilya