19.11.2019 15:32, Vladimir Sementsov-Ogievskiy wrote: > 19.11.2019 15:17, Vladimir Sementsov-Ogievskiy wrote: >> 19.11.2019 15:05, Kevin Wolf wrote: >>> Am 16.11.2019 um 17:34 hat Vladimir Sementsov-Ogievskiy geschrieben: >>>> Hi all! >>>> >>>> I wanted to understand, what is the real difference between >>>> bdrv_block_status_above and bdrv_is_allocated_above, IMHO >>>> bdrv_is_allocated_above should work through bdrv_block_status_above.. >>>> >>>> And I found the problem: bdrv_is_allocated_above considers space after >>>> EOF as UNALLOCATED for intermediate nodes.. >>>> >>>> UNALLOCATED is not about allocation at fs level, but about should we >>>> go to backing or not.. And it seems incorrect for me, as in case of >>>> short backing file, we'll read zeroes after EOF, instead of going >>>> further by backing chain. >>> >>> We actually have documentation what it means: >>> >>> * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this >>> * layer rather than any backing, set by block layer >>> >>> Say we have a short overlay, like this: >>> >>> base.qcow2: AAAAAAAA >>> overlay.qcow2: BBBB >>> >>> Then of course the content of block 5 (the one after EOF of >>> overlay.qcow2) is still determined by overlay.qcow2, which can be easily >>> verified by reading it from overlay.qcow2 (produces zeros) and from >>> base.qcow2 (produces As). >>> >>> So the correct result when querying the block status of block 5 on >>> overlay.qcow2 is BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO. >>> >>> Interestingly, we already fixed the opposite case (large overlay over >>> short backing file) in commit e88ae2264d9 from May 2014 according to the >>> same logic. >>> >>>> This leads to the following effect: >>>> >>>> ./qemu-img create -f qcow2 base.qcow2 2M >>>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2 >>>> >>>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M >>>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M >>>> >>>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes: >>>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2 >>>> read 1048576/1048576 bytes at offset 1048576 >>>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec) >>>> >>>> But after commit guest visible state is changed, which seems wrong for me: >>>> ./qemu-img commit top.qcow2 -b mid.qcow2 >>>> >>>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2 >>>> Pattern verification failed at offset 1048576, 1048576 bytes >>>> read 1048576/1048576 bytes at offset 1048576 >>>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec) >>>> >>>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2 >>>> read 1048576/1048576 bytes at offset 1048576 >>>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec) >>>> >>>> >>>> I don't know, is it a real bug, as I don't know, do we support backing >>>> file larger than its parent. Still, I'm not sure that this behavior of >>>> bdrv_is_allocated_above don't lead to other problems. >>> >>> I agree it's a bug. >>> >>> Your fix doesn't look right to me, though. You leave the buggy behaviour >>> of bdrv_co_block_status() as it is and then add four patches to work >>> around it in some (but not all) callers of it. >>> >>> All that it should take to fix this is making the bs->backing check >>> independent from want_zero and let it set ALLOCATED. What I expected >>> would be something like the below patch. >>> >>> But it doesn't seem to fully fix the problem (though 'alloc 1M 1M' in >>> qemu-io shows that the range is now considered allocated), so probably >>> there is still a separate bug in bdrv_is_allocated_above(). >>> >>> And I think we'll want an iotests case for both cases (short overlay, >>> short backing file). >>> >>> Kevin >>> >>> >>> diff --git a/block/io.c b/block/io.c >>> index f75777f5ea..5eafcff01a 100644 >>> --- a/block/io.c >>> +++ b/block/io.c >>> @@ -2359,16 +2359,17 @@ static int coroutine_fn >>> bdrv_co_block_status(BlockDriverState *bs, >>> if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) { >>> ret |= BDRV_BLOCK_ALLOCATED; >>> - } else if (want_zero) { >>> - if (bdrv_unallocated_blocks_are_zero(bs)) { >>> - ret |= BDRV_BLOCK_ZERO; >>> - } else if (bs->backing) { >>> - BlockDriverState *bs2 = bs->backing->bs; >>> - int64_t size2 = bdrv_getlength(bs2); >>> - >>> - if (size2 >= 0 && offset >= size2) { >>> + } else if (want_zero && bdrv_unallocated_blocks_are_zero(bs)) { >>> + ret |= BDRV_BLOCK_ZERO; >>> + } else if (bs->backing) { >>> + BlockDriverState *bs2 = bs->backing->bs; >>> + int64_t size2 = bdrv_getlength(bs2); >>> + >>> + if (size2 >= 0 && offset >= size2) { >>> + if (want_zero) { >>> ret |= BDRV_BLOCK_ZERO; >>> } >>> + ret |= BDRV_BLOCK_ALLOCATED; >> >> No, I think it's wrong, as it's not allocated at bs, but at bs->backing->bs. >> >> >> So, bdrv_co_block_status works correct, what we can change about it, is not >> to return pnum=0 if requested range after eof, but return pnum=bytes, >> together >> with BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO | BDRV_BLOCK_EOF. > > But this will break users which rely on pnum=0.
Possibly, we may add flag to bdrv_co_block_status, to behave in such manner, and set it to true when call from bdrv_block_status_above/bdrv_is_allocated_above. > >> >> And actual problem is in bdrv_block_status_above and >> bdrv_is_allocated_above, which >> I'm fixing. >> >> >> > > -- Best regards, Vladimir