Re: [PATCH v3 1/3] block: zero data data corruption using prealloc-filter

Kevin Wolf Mon, 05 Aug 2024 05:00:33 -0700

Am 18.07.2024 um 21:46 hat Denis V. Lunev geschrieben:
> On 7/18/24 17:51, Kevin Wolf wrote:
> > Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
> > > From: "Denis V. Lunev" <d...@openvz.org>
> > > 
> > > We have observed that some clusters in the QCOW2 files are zeroed
> > > while preallocation filter is used.
> > > 
> > > We are able to trace down the following sequence when prealloc-filter
> > > is used:
> > >      co=0x55e7cbed7680 qcow2_co_pwritev_task()
> > >      co=0x55e7cbed7680 preallocate_co_pwritev_part()
> > >      co=0x55e7cbed7680 handle_write()
> > >      co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
> > >      co=0x55e7cbed7680 raw_do_pwrite_zeroes()
> > >      co=0x7f9edb7fe500 do_fallocate()
> > > 
> > > Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
> > > 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
> > > time to handle next coroutine, which
> > >      co=0x55e7cbee91b0 qcow2_co_pwritev_task()
> > >      co=0x55e7cbee91b0 preallocate_co_pwritev_part()
> > >      co=0x55e7cbee91b0 handle_write()
> > >      co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
> > >      co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
> > >      co=0x7f9edb7deb00 do_fallocate()
> > > 
> > > The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
> > > file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
> > > the same area. This means that if (once fallocate is started inside
> > > 0x7f9edb7deb00) original fallocate could end and the real write will
> > > be executed. In that case write() request is handled at the same time
> > > as fallocate().
> > > 
> > > The patch moves s->file_lock assignment before fallocate and that is
> > s/file_lock/file_end/?
> > 
> > > crucial. The idea is that all subsequent requests into the area
> > > being preallocation will be issued as just writes without fallocate
> > > to this area and they will not proceed thanks to overlapping
> > > requests mechanics. If preallocation will fail, we will just switch
> > > to the normal expand-by-write behavior and that is not a problem
> > > except performance.
> > > 
> > > Signed-off-by: Denis V. Lunev <d...@openvz.org>
> > > Tested-by: Andrey Drobyshev <andrey.drobys...@virtuozzo.com>
> > > ---
> > >   block/preallocate.c | 8 +++++++-
> > >   1 file changed, 7 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/block/preallocate.c b/block/preallocate.c
> > > index d215bc5d6d..ecf0aa4baa 100644
> > > --- a/block/preallocate.c
> > > +++ b/block/preallocate.c
> > > @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, 
> > > int64_t bytes,
> > >       want_merge_zero = want_merge_zero && (prealloc_start <= offset);
> > > +    /*
> > > +     * Assign file_end before making actual preallocation. This will 
> > > ensure
> > > +     * that next request performed while preallocation is in progress 
> > > will
> > > +     * be passed without preallocation.
> > > +     */
> > > +    s->file_end = prealloc_end;
> > > +
> > >       ret = bdrv_co_pwrite_zeroes(
> > >               bs->file, prealloc_start, prealloc_end - prealloc_start,
> > >               BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | 
> > > BDRV_REQ_NO_WAIT);
> > > @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, 
> > > int64_t bytes,
> > >           return false;
> > >       }
> > > -    s->file_end = prealloc_end;
> > >       return want_merge_zero;
> > >   }
> > Until bdrv_co_pwrite_zeroes() completes successfully, the file size is
> > unchanged. In other words, if anything calls preallocate_co_getlength()
> > in the meantime, don't we run into...
> > 
> >      ret = bdrv_co_getlength(bs->file->bs);
> > 
> >      if (has_prealloc_perms(bs)) {
> >          s->file_end = s->zero_start = s->data_end = ret;
> >      }
> > 
> > ...and reset s->file_end back to the old value, re-enabling the bug
> > you're trying to fix here?
> > 
> > Or do we know that no such code path can be called concurrently for some
> > reason?
> > 
> > Kevin
> > 
> After more detailed thinking I tend to disagree.
> Normally we would not hit the problem. Though
> this was not obvious at the beginning :)
> 
> The point in handle_write() where we move
> file_end assignment is reachable only if the
> following code has been executed
> 
>     if (s->data_end < 0) {
>         s->data_end = bdrv_co_getlength(bs->file->bs);
>         if (s->data_end < 0) {
>             return false;
>         }
> 
>         if (s->file_end < 0) {
>             s->file_end = s->data_end;
>         }
>     }
> 
>     if (end <= s->data_end) {
>         return false;
>     }
> 
> which means that s->data_end > 0.
> 
> Thus
> 
> static int64_t coroutine_fn GRAPH_RDLOCK
> preallocate_co_getlength(BlockDriverState *bs)
> {
>     int64_t ret;
>     BDRVPreallocateState *s = bs->opaque;
> 
>     if (s->data_end >= 0) {
>         return s->data_end; <--- we will return here
>     }
> 
>     ret = bdrv_co_getlength(bs->file->bs);
> 
>     if (has_prealloc_perms(bs)) {
>         s->file_end = s->zero_start = s->data_end = ret;
>     }
> 
>     return ret;
> }


I think you're right there. And the other places that update s->file_end
should probably be okay because of the request serialisation.

I'm okay with applying this patch as it seems to fix a corruption, but
the way this whole block driver operates doesn't feel very robust to me.
There seem to be a lot of implicit assumptions that make the code hard
to understand even though it's quite short.

Kevin

Re: [PATCH v3 1/3] block: zero data data corruption using prealloc-filter

Reply via email to