On Wed, Jan 12, 2022 at 9:37 PM Peter Lieven <p...@kamp.de> wrote: > > Am 12.01.22 um 20:51 schrieb Ilya Dryomov: > > On Wed, Jan 12, 2022 at 1:32 PM Peter Lieven <p...@kamp.de> wrote: > >> Am 12.01.22 um 13:22 schrieb Ilya Dryomov: > >>> On Wed, Jan 12, 2022 at 12:55 PM Peter Lieven <p...@kamp.de> wrote: > >>>> Am 12.01.22 um 10:59 schrieb Ilya Dryomov: > >>>>> On Mon, Jan 10, 2022 at 12:43 PM Peter Lieven <p...@kamp.de> wrote: > >>>>>> librbd had a bug until early 2022 that affected all versions of ceph > >>>>>> that > >>>>>> supported fast-diff. This bug results in reporting of incorrect offsets > >>>>>> if the offset parameter to rbd_diff_iterate2 is not object aligned. > >>>>>> Work around this bug by rounding down the offset to object boundaries. > >>>>>> > >>>>>> Fixes: https://tracker.ceph.com/issues/53784 > >>>>> I don't think the Fixes tag is appropriate here. Linking librbd > >>>>> ticket is fine but this patch doesn't really fix anything. > >>>> Okay, I will change that to See: > >>> It's already linked in the source code, up to you if you also want to > >>> link it in the description. > >>> > >>>>>> Cc: qemu-sta...@nongnu.org > >>>>>> Signed-off-by: Peter Lieven <p...@kamp.de> > >>>>>> --- > >>>>>> block/rbd.c | 17 ++++++++++++++++- > >>>>>> 1 file changed, 16 insertions(+), 1 deletion(-) > >>>>>> > >>>>>> diff --git a/block/rbd.c b/block/rbd.c > >>>>>> index 5e9dc91d81..260cb9f4b4 100644 > >>>>>> --- a/block/rbd.c > >>>>>> +++ b/block/rbd.c > >>>>>> @@ -1333,6 +1333,7 @@ static int coroutine_fn > >>>>>> qemu_rbd_co_block_status(BlockDriverState *bs, > >>>>>> int status, r; > >>>>>> RBDDiffIterateReq req = { .offs = offset }; > >>>>>> uint64_t features, flags; > >>>>>> + int64_t head; > >>>>>> > >>>>>> assert(offset + bytes <= s->image_size); > >>>>>> > >>>>>> @@ -1360,6 +1361,19 @@ static int coroutine_fn > >>>>>> qemu_rbd_co_block_status(BlockDriverState *bs, > >>>>>> return status; > >>>>>> } > >>>>>> > >>>>>> + /* > >>>>>> + * librbd had a bug until early 2022 that affected all versions > >>>>>> of ceph that > >>>>>> + * supported fast-diff. This bug results in reporting of > >>>>>> incorrect offsets > >>>>>> + * if the offset parameter to rbd_diff_iterate2 is not object > >>>>>> aligned. > >>>>>> + * Work around this bug by rounding down the offset to object > >>>>>> boundaries. > >>>>>> + * > >>>>>> + * See: https://tracker.ceph.com/issues/53784 > >>>>>> + */ > >>>>>> + head = offset & (s->object_size - 1); > >>>>>> + offset -= head; > >>>>>> + req.offs -= head; > >>>>>> + bytes += head; > >>>>> So it looks like the intention is to have more or less a permanent > >>>>> workaround since all librbd versions are affected, right? For that, > >>>>> I think we would need to also reject custom striping patterns and > >>>>> clones. For the above to be reliable, the image has to be standalone > >>>>> and have a default striping pattern (stripe_unit == object_size && > >>>>> stripe_count == 1). Otherwise, behave as if fast-diff is disabled or > >>>>> invalid. > >>>> Do you have a fealing how many users use a different striping pattern > >>>> than default? > >>> Very few. > >>> > >>>> What about EC backed pools? > >>> In this context EC pools behave exactly the same as replicated pools. > >>> > >>>> Do you have another idea how we can detect if the librbd version is > >>>> broken? > >>> No. Initially I wanted to just fix these bugs in librbd, relying on > >>> the assumption that setups with a new QEMU should also have a fairly > >>> new librbd. But after looking at various distros and realizing the > >>> extent of rbd_diff_iterate2() issues, I think a long-term workaround > >>> in QEMU makes sense. A configure-time check for known good versions > >>> of librbd can be added later if someone feels like it. > >> > >> Okay, I will add a check on image open if the striping fits or > >> > >> we have a clone. Can you explain how I would best check for it? > > ... if _we_ are a clone. Use rbd_get_parent_info(), it returns 0 if an > > image has a parent and -ENOENT if an image is standalone. Keep in mind > > that the image can be flattened (i.e. made standalone) at any moment so > > this should be checked on each invocation. > > > > To get striping parameters, use rbd_get_stripe_unit() and > > rbd_get_stripe_count() after checking for RBD_FEATURE_STRIPINGV2 > > feature. This can be done just on image open, but since features > > are queried on each invocation anyway and very few users are going > > to have a custom striping pattern (and therefore have this feature > > enabled), I'd do it here as well. > > > > Thanks, > > > > Ilya > > The flattening is not a problem because if we detect a clone we disable > get_block_status > > for the session. If it gets flattened later it doesn't hurt we have just no > allocation info. > > So at least these checks could be done at image open.
I see. Doing both on image open and handling clones this way is fine too. > > > To be honest, all these checks and workaround stuff doesn't feel right. We > try to fix something that is broken > > from scratch and make the qemu block driver quite ugly doing so. > > > I would personally go and drop get_block_status support for all existing > librbd versions and add a flag to librbd that reports that rbd_diff_iterate2 > is not broken > > and check for that and only support get_block_status if it is present. It's a couple of dozen lines of code which I certainly wouldn't consider ugly. A flag would be taking it too far IMO, a simple lower-bound version check on the upcoming Quincy release should be enough. > > > Or if you can confirm that rbd_diff_iterate2 does at least work for a query > on the full image support get_block_status only for read-only > > opened image files (this is true for our both current use cases backing file > and qemu-img convert with rbd source) and read and cache > > the full allocation status at the first invocation of get_block_status (and > of course drop it on invalidate_cache or bdrv_reopen). > > This would also work around the nasty algorithm that tries to handle all > callback cases. What nasty algorithm? If you are talking about the first patch, then I think it can be reduced to the addition of if (exists) return 0; at the top of the callback. Thanks, Ilya