On 7/16/24 4:32 PM, Denis V. Lunev wrote:
> On 7/12/24 13:55, Vladimir Sementsov-Ogievskiy wrote:
>> On 12.07.24 12:46, Andrey Drobyshev wrote:
>>> From: "Denis V. Lunev" <d...@openvz.org>
>>>
>>> We have observed that some clusters in the QCOW2 files are zeroed
>>> while preallocation filter is used.
>>>
>>> We are able to trace down the following sequence when prealloc-filter
>>> is used:
>>>      co=0x55e7cbed7680 qcow2_co_pwritev_task()
>>>      co=0x55e7cbed7680 preallocate_co_pwritev_part()
>>>      co=0x55e7cbed7680 handle_write()
>>>      co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
>>>      co=0x55e7cbed7680 raw_do_pwrite_zeroes()
>>>      co=0x7f9edb7fe500 do_fallocate()
>>>
>>> Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
>>> 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
>>> time to handle next coroutine, which
>>>      co=0x55e7cbee91b0 qcow2_co_pwritev_task()
>>>      co=0x55e7cbee91b0 preallocate_co_pwritev_part()
>>>      co=0x55e7cbee91b0 handle_write()
>>>      co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
>>>      co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
>>>      co=0x7f9edb7deb00 do_fallocate()
>>>
>>> The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
>>> file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
>>> the same area. This means that if (once fallocate is started inside
>>> 0x7f9edb7deb00) original fallocate could end and the real write will
>>> be executed. In that case write() request is handled at the same time
>>> as fallocate().
>>>
>>> The patch moves s->file_lock assignment before fallocate and that is
>>
>> text need to be updated
>>
>>> crucial. The idea is that all subsequent requests into the area
>>> being preallocation will be issued as just writes without fallocate
>>> to this area and they will not proceed thanks to overlapping
>>> requests mechanics. If preallocation will fail, we will just switch
>>> to the normal expand-by-write behavior and that is not a problem
>>> except performance.
>>>
>>> Signed-off-by: Denis V. Lunev <d...@openvz.org>
>>> Tested-by: Andrey Drobyshev <andrey.drobys...@virtuozzo.com>
>>> ---
>>>   block/preallocate.c | 8 +++++++-
>>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/preallocate.c b/block/preallocate.c
>>> index d215bc5d6d..ecf0aa4baa 100644
>>> --- a/block/preallocate.c
>>> +++ b/block/preallocate.c
>>> @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t
>>> offset, int64_t bytes,
>>>         want_merge_zero = want_merge_zero && (prealloc_start <= offset);
>>>   +    /*
>>> +     * Assign file_end before making actual preallocation. This will
>>> ensure
>>> +     * that next request performed while preallocation is in
>>> progress will
>>> +     * be passed without preallocation.
>>> +     */
>>> +    s->file_end = prealloc_end;
>>> +
>>>       ret = bdrv_co_pwrite_zeroes(
>>>               bs->file, prealloc_start, prealloc_end - prealloc_start,
>>>               BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING |
>>> BDRV_REQ_NO_WAIT);
>>> @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t
>>> offset, int64_t bytes,
>>>           return false;
>>>       }
>>>   -    s->file_end = prealloc_end;
>>>       return want_merge_zero;
>>>   }
>>
>>
>> Hmm. But this way we set both s->file_end and s->zero_start prior to
>> actual write_zero operation. This means that next write-zero operation
>> may go fast-path (see preallocate_co_pwrite_zeroes()) and return
>> success, even before actual finish of preallocation write_zeroes
>> operation (which may also fail). Seems we need to update logic around
>> s->zero_start too.
>>
> Yes. This is not a problem at all. We go fast path and this new
> fast-pathed write request will stuck on overlapped request check.
> This if fine on success path.
> 
> But error path is a trickier question.
> 
> iris ~/src/qemu $ cat 1.c
> #include <stdio.h>
> #include <unistd.h>
> #include <string.h>
> #include <fcntl.h>
> 
> int main()
> {
>     int fd = open("file", O_RDWR | O_CREAT);
>     char buf[4096];
> 
>     memset(buf, 'a', sizeof(buf));
>     pwrite(fd, buf, sizeof(buf), 4096);
> 
>     return 0;
> }
> iris ~/src/qemu $
> 
> This works just fine, thus error path would be also fine.
> 
> Den

This would also be relatively easy to check by modifying our original
test case in 298 so that half of aio_write requests write actual data,
and other half cause write_zeroes operations.  It doesn't seem to fail
with this v2 patch.

I'll modify the testcase accordingly in v3.

Reply via email to