On 7/16/24 4:32 PM, Denis V. Lunev wrote: > On 7/12/24 13:55, Vladimir Sementsov-Ogievskiy wrote: >> On 12.07.24 12:46, Andrey Drobyshev wrote: >>> From: "Denis V. Lunev" <d...@openvz.org> >>> >>> We have observed that some clusters in the QCOW2 files are zeroed >>> while preallocation filter is used. >>> >>> We are able to trace down the following sequence when prealloc-filter >>> is used: >>> co=0x55e7cbed7680 qcow2_co_pwritev_task() >>> co=0x55e7cbed7680 preallocate_co_pwritev_part() >>> co=0x55e7cbed7680 handle_write() >>> co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes() >>> co=0x55e7cbed7680 raw_do_pwrite_zeroes() >>> co=0x7f9edb7fe500 do_fallocate() >>> >>> Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine >>> 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is >>> time to handle next coroutine, which >>> co=0x55e7cbee91b0 qcow2_co_pwritev_task() >>> co=0x55e7cbee91b0 preallocate_co_pwritev_part() >>> co=0x55e7cbee91b0 handle_write() >>> co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes() >>> co=0x55e7cbee91b0 raw_do_pwrite_zeroes() >>> co=0x7f9edb7deb00 do_fallocate() >>> >>> The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced >>> file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for >>> the same area. This means that if (once fallocate is started inside >>> 0x7f9edb7deb00) original fallocate could end and the real write will >>> be executed. In that case write() request is handled at the same time >>> as fallocate(). >>> >>> The patch moves s->file_lock assignment before fallocate and that is >> >> text need to be updated >> >>> crucial. The idea is that all subsequent requests into the area >>> being preallocation will be issued as just writes without fallocate >>> to this area and they will not proceed thanks to overlapping >>> requests mechanics. If preallocation will fail, we will just switch >>> to the normal expand-by-write behavior and that is not a problem >>> except performance. >>> >>> Signed-off-by: Denis V. Lunev <d...@openvz.org> >>> Tested-by: Andrey Drobyshev <andrey.drobys...@virtuozzo.com> >>> --- >>> block/preallocate.c | 8 +++++++- >>> 1 file changed, 7 insertions(+), 1 deletion(-) >>> >>> diff --git a/block/preallocate.c b/block/preallocate.c >>> index d215bc5d6d..ecf0aa4baa 100644 >>> --- a/block/preallocate.c >>> +++ b/block/preallocate.c >>> @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t >>> offset, int64_t bytes, >>> want_merge_zero = want_merge_zero && (prealloc_start <= offset); >>> + /* >>> + * Assign file_end before making actual preallocation. This will >>> ensure >>> + * that next request performed while preallocation is in >>> progress will >>> + * be passed without preallocation. >>> + */ >>> + s->file_end = prealloc_end; >>> + >>> ret = bdrv_co_pwrite_zeroes( >>> bs->file, prealloc_start, prealloc_end - prealloc_start, >>> BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | >>> BDRV_REQ_NO_WAIT); >>> @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t >>> offset, int64_t bytes, >>> return false; >>> } >>> - s->file_end = prealloc_end; >>> return want_merge_zero; >>> } >> >> >> Hmm. But this way we set both s->file_end and s->zero_start prior to >> actual write_zero operation. This means that next write-zero operation >> may go fast-path (see preallocate_co_pwrite_zeroes()) and return >> success, even before actual finish of preallocation write_zeroes >> operation (which may also fail). Seems we need to update logic around >> s->zero_start too. >> > Yes. This is not a problem at all. We go fast path and this new > fast-pathed write request will stuck on overlapped request check. > This if fine on success path. > > But error path is a trickier question. > > iris ~/src/qemu $ cat 1.c > #include <stdio.h> > #include <unistd.h> > #include <string.h> > #include <fcntl.h> > > int main() > { > int fd = open("file", O_RDWR | O_CREAT); > char buf[4096]; > > memset(buf, 'a', sizeof(buf)); > pwrite(fd, buf, sizeof(buf), 4096); > > return 0; > } > iris ~/src/qemu $ > > This works just fine, thus error path would be also fine. > > Den
This would also be relatively easy to check by modifying our original test case in 298 so that half of aio_write requests write actual data, and other half cause write_zeroes operations. It doesn't seem to fail with this v2 patch. I'll modify the testcase accordingly in v3.