Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Vladimir Sementsov-Ogievskiy Wed, 20 Nov 2019 05:37:53 -0800

20.11.2019 15:04, Vladimir Sementsov-Ogievskiy wrote:
> 20.11.2019 14:44, Kevin Wolf wrote:
>> Am 20.11.2019 um 11:20 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>> 16.11.2019 19:34, Vladimir Sementsov-Ogievskiy wrote:
>>>> Hi all!
>>>>
>>>> I wanted to understand, what is the real difference between 
>>>> bdrv_block_status_above
>>>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work 
>>>> through
>>>> bdrv_block_status_above..
>>>>
>>>> And I found the problem: bdrv_is_allocated_above considers space after EOF 
>>>> as
>>>> UNALLOCATED for intermediate nodes..
>>>>
>>>> UNALLOCATED is not about allocation at fs level, but about should we go to 
>>>> backing or
>>>> not.. And it seems incorrect for me, as in case of short backing file, 
>>>> we'll read
>>>> zeroes after EOF, instead of going further by backing chain.
>>>>
>>>> This leads to the following effect:
>>>>
>>>> ./qemu-img create -f qcow2 base.qcow2 2M
>>>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2
>>>>
>>>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M
>>>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M
>>>>
>>>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes:
>>>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2
>>>> read 1048576/1048576 bytes at offset 1048576
>>>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec)
>>>>
>>>> But after commit guest visible state is changed, which seems wrong for me:
>>>> ./qemu-img commit top.qcow2 -b mid.qcow2
>>>>
>>>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2
>>>> Pattern verification failed at offset 1048576, 1048576 bytes
>>>> read 1048576/1048576 bytes at offset 1048576
>>>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec)
>>>>
>>>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2
>>>> read 1048576/1048576 bytes at offset 1048576
>>>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec)
>>>>
>>>>
>>>> I don't know, is it a real bug, as I don't know, do we support backing 
>>>> file larger than
>>>> its parent. Still, I'm not sure that this behavior of 
>>>> bdrv_is_allocated_above don't lead
>>>> to other problems.
>>>>
>>>> =====
>>>>
>>>> Hmm, bdrv_block_allocated_above behaves strange too:
>>>>
>>>> with want_zero=true, it may report unallocated zeroes because of short 
>>>> backing files, which
>>>> are actually "allocated" in POV of backing chains. But I see this may 
>>>> influence only
>>>> qemu-img compare, and I don't see can it trigger some bug..
>>>>
>>>> with want_zero=false, it may do no progress because of short backing file. 
>>>> Moreover it may
>>>> report EOF in the middle!! But want_zero=false used only in 
>>>> bdrv_is_allocated, which considers
>>>> onlyt top layer, so it seems OK.
>>>>
>>>> =====
>>>>
>>>> So, I propose these series, still I'm not sure is there a real bug.
>>>>
>>>> Vladimir Sementsov-Ogievskiy (4):
>>>>     block/io: fix bdrv_co_block_status_above
>>>>     block/io: bdrv_common_block_status_above: support include_base
>>>>     block/io: bdrv_common_block_status_above: support bs == base
>>>>     block/io: fix bdrv_is_allocated_above
>>>>
>>>>    block/io.c                 | 104 ++++++++++++++++++-------------------
>>>>    tests/qemu-iotests/154.out |   4 +-
>>>>    2 files changed, 53 insertions(+), 55 deletions(-)
>>>>
>>>
>>>
>>> Interesting that the problem illustrated here is not fixed by the series, 
>>> it's actually
>>> relates to the fact that mirror does truncation with PREALLOC_MODE_OFF, 
>>> which leads
>>> to unallocated qcow2 clusters, which I think should be fixed too.
>>
>> Yes, this is what I posted yesterday. (With a suggested quick fix, but
>> it turns out it was not quite correct, see below.)
>>
>>> To illustrate the problem fixed by the series, we should commit to base:
>>>
>>> # ./qemu-img commit top.qcow2 -b base.qcow2
>>> Image committed.
>>> # ./qemu-io -c "read -P 0 1M 1M" base.qcow2
>>> Pattern verification failed at offset 1048576, 1048576 bytes
>>> read 1048576/1048576 bytes at offset 1048576
>>> 1 MiB, 1 ops; 00.00 sec (5.366 GiB/sec and 5494.4149 ops/sec)
>>
>> Ok, I'll try that later.
>>
>>> Hmm, but how to fix the problem about truncate? I think truncate must
>>> not make underlying backing available for read.. Discard operation
>>> doesn't do it.
>>>
>>> So, actually on PREALLOC_MODE_OFF we must allocated L2 tables and mark
>>> new clusters as ZERO?
>>
>> Yes, we need to write zeroes to the new area if the backing file covers
>> it. We need to do this not only in mirror/commit/bdrv_commit(), but in
>> fact for all truncate operations: Berto mentioned on IRC yesterday that
>> you can get into the same situation with 'block_resize' monitor
>> commands.
>>
>> So I tried to fix this yesterday, and I thought that I had a fix, when I
>> noticed that bdrv_co_do_zero_pwritev() takes a 32 bit bytes parameter.
>> So I'll still need to fix this. Other than that, I suppose the following
>> fix should work (but is probably a bit too invasive for -rc3).
>>
>> Kevin
>>
>> diff --git a/block/io.c b/block/io.c
>> index f75777f5ea..4118bf0118 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -3382,6 +3382,32 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, 
>> int64_t offset, bool exact,
>>           goto out;
>>       }
>>
>> +    /*
>> +     * If the image has a backing file that is large enough that it would
>> +     * provide data for the new area, we cannot leave it unallocated because
>> +     * then the backing file content would become visible. Instead, 
>> zero-fill
>> +     * the area where backing file and new area overlap.
>> +     */
>> +    if (new_bytes && bs->backing && prealloc == PREALLOC_MODE_OFF) {
>> +        int64_t backing_len;
>> +
>> +        backing_len = bdrv_getlength(backing_bs(bs));
>> +        if (backing_len < 0) {
>> +            ret = backing_len;
>> +            goto out;
>> +        }
>> +
>> +        if (backing_len > old_size) {
>> +            /* FIXME bytes parameter is 32 bits */
>> +            ret = bdrv_co_do_zero_pwritev(child, old_size,
>> +                                          MIN(new_bytes, backing_len - 
>> old_size),
>> +                                          BDRV_REQ_ZERO_WRITE | 
>> BDRV_REQ_MAY_UNMAP, &req);
>> +            if (ret < 0) {
>> +                goto out;
>> +            }
>> +        }
>> +    }
>> +
>>       ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
>>       if (ret < 0) {
>>           error_setg_errno(errp, -ret, "Could not refresh total sector 
>> count");
>>
> 
> I'm not sure that it is safe enough: we may not have opened backing at the 
> moment, but
> still it may exist and managed by user.
> 
> PREALLOC_MODE_OFF is documented as
> # @off: no preallocation
> 
> - not very descriptive, but I think it's nothing about making backing file 
> available
> through new clusters.
> 
> I think PREALLOC_MODE_OFF should always make new clusters 
> "BDRV_BLOCK_ALLOCATED". If
> for some scenarios (are they exist at all?) we need to preallocate cluster in 
> manner
> that backing file would be visible through them, we'd better add another 
> preallocation
> mode which will directly document this behaviour, like PREALLOC_MODE_BACKING.
> 
> So, I'd consider PREALLOC_MODE_OFF as something that must not create 
> UNALLOCATED (in POV
> of backing chains) clusters, and should be fixed in all formats.. Or as a 
> quick fix may
> we may write zeros from bdrv_co_truncate, but independently of backing file 
> existence
> and length.
> 
> ===


Or visa-versa, if we can't change current qemu-img-create default, which means 
preallocation="off"
  === UNALLOCATED transaprent image, we may improve preallocation:"off" 
specification, mentioning
backing files, and add preallocation:"zero" mode, to be used in truncate.

> 
> Also I think it's a wrong thing at all that qcow2 new file is transparent by 
> default..
> It should be transparent only when we create snapshots and incremental 
> backups. But when
> we create new disk for new vm it should be zeroed (and extending L1 table 
> entry spec by
> "zero bit" may help)
> 

-- 
Best regards,
Vladimir

Re: [PATCH 0/4] fix & merge block_status_above and is_allocated_above

Reply via email to