20.11.2019 15:04, Vladimir Sementsov-Ogievskiy wrote: > 20.11.2019 14:44, Kevin Wolf wrote: >> Am 20.11.2019 um 11:20 hat Vladimir Sementsov-Ogievskiy geschrieben: >>> 16.11.2019 19:34, Vladimir Sementsov-Ogievskiy wrote: >>>> Hi all! >>>> >>>> I wanted to understand, what is the real difference between >>>> bdrv_block_status_above >>>> and bdrv_is_allocated_above, IMHO bdrv_is_allocated_above should work >>>> through >>>> bdrv_block_status_above.. >>>> >>>> And I found the problem: bdrv_is_allocated_above considers space after EOF >>>> as >>>> UNALLOCATED for intermediate nodes.. >>>> >>>> UNALLOCATED is not about allocation at fs level, but about should we go to >>>> backing or >>>> not.. And it seems incorrect for me, as in case of short backing file, >>>> we'll read >>>> zeroes after EOF, instead of going further by backing chain. >>>> >>>> This leads to the following effect: >>>> >>>> ./qemu-img create -f qcow2 base.qcow2 2M >>>> ./qemu-io -c "write -P 0x1 0 2M" base.qcow2 >>>> >>>> ./qemu-img create -f qcow2 -b base.qcow2 mid.qcow2 1M >>>> ./qemu-img create -f qcow2 -b mid.qcow2 top.qcow2 2M >>>> >>>> Region 1M..2M is shadowed by short middle image, so guest sees zeroes: >>>> ./qemu-io -c "read -P 0 1M 1M" top.qcow2 >>>> read 1048576/1048576 bytes at offset 1048576 >>>> 1 MiB, 1 ops; 00.00 sec (22.795 GiB/sec and 23341.5807 ops/sec) >>>> >>>> But after commit guest visible state is changed, which seems wrong for me: >>>> ./qemu-img commit top.qcow2 -b mid.qcow2 >>>> >>>> ./qemu-io -c "read -P 0 1M 1M" mid.qcow2 >>>> Pattern verification failed at offset 1048576, 1048576 bytes >>>> read 1048576/1048576 bytes at offset 1048576 >>>> 1 MiB, 1 ops; 00.00 sec (4.981 GiB/sec and 5100.4794 ops/sec) >>>> >>>> ./qemu-io -c "read -P 1 1M 1M" mid.qcow2 >>>> read 1048576/1048576 bytes at offset 1048576 >>>> 1 MiB, 1 ops; 00.00 sec (3.365 GiB/sec and 3446.1606 ops/sec) >>>> >>>> >>>> I don't know, is it a real bug, as I don't know, do we support backing >>>> file larger than >>>> its parent. Still, I'm not sure that this behavior of >>>> bdrv_is_allocated_above don't lead >>>> to other problems. >>>> >>>> ===== >>>> >>>> Hmm, bdrv_block_allocated_above behaves strange too: >>>> >>>> with want_zero=true, it may report unallocated zeroes because of short >>>> backing files, which >>>> are actually "allocated" in POV of backing chains. But I see this may >>>> influence only >>>> qemu-img compare, and I don't see can it trigger some bug.. >>>> >>>> with want_zero=false, it may do no progress because of short backing file. >>>> Moreover it may >>>> report EOF in the middle!! But want_zero=false used only in >>>> bdrv_is_allocated, which considers >>>> onlyt top layer, so it seems OK. >>>> >>>> ===== >>>> >>>> So, I propose these series, still I'm not sure is there a real bug. >>>> >>>> Vladimir Sementsov-Ogievskiy (4): >>>> block/io: fix bdrv_co_block_status_above >>>> block/io: bdrv_common_block_status_above: support include_base >>>> block/io: bdrv_common_block_status_above: support bs == base >>>> block/io: fix bdrv_is_allocated_above >>>> >>>> block/io.c | 104 ++++++++++++++++++------------------- >>>> tests/qemu-iotests/154.out | 4 +- >>>> 2 files changed, 53 insertions(+), 55 deletions(-) >>>> >>> >>> >>> Interesting that the problem illustrated here is not fixed by the series, >>> it's actually >>> relates to the fact that mirror does truncation with PREALLOC_MODE_OFF, >>> which leads >>> to unallocated qcow2 clusters, which I think should be fixed too. >> >> Yes, this is what I posted yesterday. (With a suggested quick fix, but >> it turns out it was not quite correct, see below.) >> >>> To illustrate the problem fixed by the series, we should commit to base: >>> >>> # ./qemu-img commit top.qcow2 -b base.qcow2 >>> Image committed. >>> # ./qemu-io -c "read -P 0 1M 1M" base.qcow2 >>> Pattern verification failed at offset 1048576, 1048576 bytes >>> read 1048576/1048576 bytes at offset 1048576 >>> 1 MiB, 1 ops; 00.00 sec (5.366 GiB/sec and 5494.4149 ops/sec) >> >> Ok, I'll try that later. >> >>> Hmm, but how to fix the problem about truncate? I think truncate must >>> not make underlying backing available for read.. Discard operation >>> doesn't do it. >>> >>> So, actually on PREALLOC_MODE_OFF we must allocated L2 tables and mark >>> new clusters as ZERO? >> >> Yes, we need to write zeroes to the new area if the backing file covers >> it. We need to do this not only in mirror/commit/bdrv_commit(), but in >> fact for all truncate operations: Berto mentioned on IRC yesterday that >> you can get into the same situation with 'block_resize' monitor >> commands. >> >> So I tried to fix this yesterday, and I thought that I had a fix, when I >> noticed that bdrv_co_do_zero_pwritev() takes a 32 bit bytes parameter. >> So I'll still need to fix this. Other than that, I suppose the following >> fix should work (but is probably a bit too invasive for -rc3). >> >> Kevin >> >> diff --git a/block/io.c b/block/io.c >> index f75777f5ea..4118bf0118 100644 >> --- a/block/io.c >> +++ b/block/io.c >> @@ -3382,6 +3382,32 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, >> int64_t offset, bool exact, >> goto out; >> } >> >> + /* >> + * If the image has a backing file that is large enough that it would >> + * provide data for the new area, we cannot leave it unallocated because >> + * then the backing file content would become visible. Instead, >> zero-fill >> + * the area where backing file and new area overlap. >> + */ >> + if (new_bytes && bs->backing && prealloc == PREALLOC_MODE_OFF) { >> + int64_t backing_len; >> + >> + backing_len = bdrv_getlength(backing_bs(bs)); >> + if (backing_len < 0) { >> + ret = backing_len; >> + goto out; >> + } >> + >> + if (backing_len > old_size) { >> + /* FIXME bytes parameter is 32 bits */ >> + ret = bdrv_co_do_zero_pwritev(child, old_size, >> + MIN(new_bytes, backing_len - >> old_size), >> + BDRV_REQ_ZERO_WRITE | >> BDRV_REQ_MAY_UNMAP, &req); >> + if (ret < 0) { >> + goto out; >> + } >> + } >> + } >> + >> ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS); >> if (ret < 0) { >> error_setg_errno(errp, -ret, "Could not refresh total sector >> count"); >> > > I'm not sure that it is safe enough: we may not have opened backing at the > moment, but > still it may exist and managed by user. > > PREALLOC_MODE_OFF is documented as > # @off: no preallocation > > - not very descriptive, but I think it's nothing about making backing file > available > through new clusters. > > I think PREALLOC_MODE_OFF should always make new clusters > "BDRV_BLOCK_ALLOCATED". If > for some scenarios (are they exist at all?) we need to preallocate cluster in > manner > that backing file would be visible through them, we'd better add another > preallocation > mode which will directly document this behaviour, like PREALLOC_MODE_BACKING. > > So, I'd consider PREALLOC_MODE_OFF as something that must not create > UNALLOCATED (in POV > of backing chains) clusters, and should be fixed in all formats.. Or as a > quick fix may > we may write zeros from bdrv_co_truncate, but independently of backing file > existence > and length. > > ===
Or visa-versa, if we can't change current qemu-img-create default, which means preallocation="off" === UNALLOCATED transaprent image, we may improve preallocation:"off" specification, mentioning backing files, and add preallocation:"zero" mode, to be used in truncate. > > Also I think it's a wrong thing at all that qcow2 new file is transparent by > default.. > It should be transparent only when we create snapshots and incremental > backups. But when > we create new disk for new vm it should be zeroed (and extending L1 table > entry spec by > "zero bit" may help) > -- Best regards, Vladimir