19.11.2019 15:34, Vladimir Sementsov-Ogievskiy wrote: > 19.11.2019 15:32, Vladimir Sementsov-Ogievskiy wrote: >> 19.11.2019 15:17, Vladimir Sementsov-Ogievskiy wrote: >>> 19.11.2019 15:05, Kevin Wolf wrote: >>>> Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben: >>>>> Hi all! >>>>> >>>>> I wanted to understand, what is the real difference between >>>>> bdrv_block_status_above and bdrv_is_allocated_above, IMHO >>>>> bdrv_is_allocated_above should work through bdrv_block_status_above.. >>>>> >>>>> And I found the problem: bdrv_is_allocated_above considers space after >>>>> EOF as UNALLOCATED for intermediate nodes.. >>>>> >>>>> UNALLOCATED is not about allocation at fs level, but about should we >>>>> go to backing or not.. And it seems incorrect for me, as in case of >>>>> short backing file, we'll read zeroes after EOF, instead of going >>>>> further by backing chain. >>>> >>>> We actually have documentation what it means: >>>> >>>> * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this >>>> * layer rather than any backing, set by block layer >>>> >>>> Say we have a short overlay, like this: >>>> >>>> base.qcow2: AAAAAAAA >>>> overlay.qcow2: BBBB >>>> >>>> Then of course the content of block 5 (the one after EOF of >>>> overlay.qcow2) is still determined by overlay.qcow2, which can be easily >>>> verified by reading it from overlay.qcow2 (produces zeros) and from >>>> base.qcow2 (produces As). >>>> >>>> So the correct result when querying the block status of block 5 on >>>> overlay.qcow2 is BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO. >>>> >>>> Interestingly, we already fixed the opposite case (large overlay over >>>> short backing file) in commit e88ae2264d9 from May 2014 according to the >>>> same logic. >>>> >>>>> This leads to the following effect: >>>>> >>>>> ./qemu-img create -f qcow2 base.qcow2 2M >>>>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2 >>>>> >>>>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M >>>>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M >>>>> >>>>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes: >>>>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2 >>>>> read 1048576/1048576 bytes at offset 1048576 >>>>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec) >>>>> >>>>> But after commit guest visible state is changed, which seems wrong for me: >>>>> ./qemu-img commit top.qcow2 -b mid.qcow2 >>>>> >>>>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2 >>>>> Pattern verification failed at offset 1048576, 1048576 bytes >>>>> read 1048576/1048576 bytes at offset 1048576 >>>>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec) >>>>> >>>>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2 >>>>> read 1048576/1048576 bytes at offset 1048576 >>>>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec) >>>>> >>>>> >>>>> I don't know, is it a real bug, as I don't know, do we support backing >>>>> file larger than its parent. Still, I'm not sure that this behavior of >>>>> bdrv_is_allocated_above don't lead to other problems. >>>> >>>> I agree it's a bug. >>>> >>>> Your fix doesn't look right to me, though. You leave the buggy behaviour >>>> of bdrv_co_block_status() as it is and then add four patches to work >>>> around it in some (but not all) callers of it.
The only caller is bdrv_co_block_status_above. >>>> >>>> All that it should take to fix this is making the bs->backing check >>>> independent from want_zero and let it set ALLOCATED. What I expected >>>> would be something like the below patch. >>>> >>>> But it doesn't seem to fully fix the problem (though 'alloc 1M 1M' in >>>> qemu-io shows that the range is now considered allocated), so probably >>>> there is still a separate bug in bdrv_is_allocated_above(). >>>> >>>> And I think we'll want an iotests case for both cases (short overlay, >>>> short backing file). >>>> >>>> Kevin >>>> >>>> >>>> diff --git a/block/io.c b/block/io.c >>>> index f75777f5ea..5eafcff01a 100644 >>>> --- a/block/io.c >>>> +++ b/block/io.c >>>> @@ -2359,16 +2359,17 @@ static int coroutine_fn >>>> bdrv_co_block_status(BlockDriverState *bs, >>>> if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) { >>>> ret |= BDRV_BLOCK_ALLOCATED; >>>> - } else if (want_zero) { >>>> - if (bdrv_unallocated_blocks_are_zero(bs)) { >>>> - ret |= BDRV_BLOCK_ZERO; >>>> - } else if (bs->backing) { >>>> - BlockDriverState *bs2 = bs->backing->bs; >>>> - int64_t size2 = bdrv_getlength(bs2); >>>> - >>>> - if (size2 >= 0 && offset >= size2) { >>>> + } else if (want_zero && bdrv_unallocated_blocks_are_zero(bs)) { >>>> + ret |= BDRV_BLOCK_ZERO; >>>> + } else if (bs->backing) { >>>> + BlockDriverState *bs2 = bs->backing->bs; >>>> + int64_t size2 = bdrv_getlength(bs2); >>>> + >>>> + if (size2 >= 0 && offset >= size2) { >>>> + if (want_zero) { >>>> ret |= BDRV_BLOCK_ZERO; >>>> } >>>> + ret |= BDRV_BLOCK_ALLOCATED; >>> >>> No, I think it's wrong, as it's not allocated at bs, but at bs->backing->bs. >>> >>> >>> So, bdrv_co_block_status works correct, what we can change about it, is not >>> to return pnum=0 if requested range after eof, but return pnum=bytes, >>> together >>> with BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO | BDRV_BLOCK_EOF. >> >> But this will break users which rely on pnum=0. > > Possibly, we may add flag to bdrv_co_block_status, to behave in such manner, > and > set it to true when call from bdrv_block_status_above/bdrv_is_allocated_above. > >> >>> >>> And actual problem is in bdrv_block_status_above and >>> bdrv_is_allocated_above, which >>> I'm fixing. >>> >>> >>> >> >> > > -- Best regards, Vladimir