Am 17.01.2022 um 09:46 hat Hanna Reitz geschrieben:
> On 16.01.22 19:09, Nir Soffer wrote:
> > On Sun, Jan 16, 2022 at 1:17 PM Nir Soffer <nsof...@redhat.com> wrote:
> > > Some of our tests started to fail since qemu-img 6.2.0 released.
> > > 
> > > Here is an example failure:
> > > 
> > > $ truncate -s1g /data/scratch/empty-1g.raw
> > > 
> > > $ qemu-nbd -r -t -e8 -k /tmp/nbd.sock -A -f raw 
> > > /data/scratch/empty-1g.raw &
> > git bisect points to this commit:
> > 
> > $ git bisect good
> > 0bc329fbb009f8601cec23bf2bc48ead0c5a5fa2 is the first bad commit
> > commit 0bc329fbb009f8601cec23bf2bc48ead0c5a5fa2
> > Author: Hanna Reitz <hre...@redhat.com>
> > Date:   Thu Aug 12 10:41:44 2021 +0200
> > 
> >      block: block-status cache for data regions
> > 
> > The commit message says:
> > 
> >      (Note that only caching data regions but not zero regions means that
> >      returning false information from the cache is not catastrophic: 
> > Treating
> >      zeroes as data is fine.  While we try to invalidate the cache on zero
> >      writes and discards, such incongruences may still occur when there are
> >      other processes writing to the image.)
> > 
> > I don't think it is fine to report zero as data. This can cause severe
> > performance
> > issues when users copy unneeded zero extents over the network, instead of
> > doing no work with zero bandwidth.
> 
> You’re right, it was meant as a contrast to whether this might lead to
> catastrophic data failure.
> 
> But it was also meant as a risk estimation.  There wasn’t supposed to be a
> situation where the cache would report zeroes as data (other than with
> external writers), and so that’s why I thought it’d be fine.  But, well,
> things are supposed to be one way, and then you (me) write buggy code...
> 
> Thanks for the reproducer steps, I can reproduce the bug with your script
> (not with nbdinfo fwiw) and the problem seems to be this:
> 
> diff --git a/block/io.c b/block/io.c
> index bb0a254def..83e94faeaf 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2498,7 +2498,8 @@ static int coroutine_fn
> bdrv_co_block_status(BlockDriverState *bs,
>               * the cache requires an RCU update, so double check here to
> avoid
>               * such an update if possible.
>               */
> -            if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID) &&
> +            if (want_zero &&
> +                ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID) &&
>                  QLIST_EMPTY(&bs->children))
>              {
>                  /*
> 
> We should update the cache only when we have accurate zero information, so
> only if want_zero was true.

Either that, or store the want_zero value as well and compare it before
using the cache.

Depends on whether we need the optimisation for want_zero=false cases,
too. I think this mostly means bdrv_is_allocated(_above)(), which seems
to be relevant for the commit and stream jobs and 'qemu-img rebase' at
least and some less commonly used stuff. There are some other cases of
is_allocated (like in mirror) where it doesn't make a difference because
we always process the maximum number of bytes with the first call.

> > Do we have a way to disable this cache, at least for nbd?
> 
> No, unfortunately not.  We could implement it, but I think the management
> layer would need to specify it for all protocol nodes where it wants the
> cache to be disabled.
> 
> Auto-detecting would be difficult for qemu, because it would basically mean
> detecting whether there are any exports in the block graph somewhere above
> the node in question (not even immediately above it, because for example in
> your example there’s a raw node between the export and the protocol node
> whose block-status information we’re caching).
> 
> I assume you’re now very skeptical of the cache, so even if we implement a
> fix for this problem, you’ll probably still want that option to disable it,
> right?

I don't think we can realistically provide an option for every
individual optimisation where there could be a bug. We just need to fix
the problem and make sure to return correct data when there is no
external writer.

Kevin


Reply via email to