Am 12.05.2016 um 11:00 hat Denis V. Lunev geschrieben: > On 05/11/2016 02:28 PM, Kevin Wolf wrote: > >Am 11.05.2016 um 09:00 hat Denis V. Lunev geschrieben: > >>There is a possibility that qcow2_co_write_zeroes() will be called > >>with the partial block. This could be synthetically triggered with > >> qemu-io -c "write -z 32k 4k" > >>and can happen in the real life in qemu-nbd. The latter happens under > >>the following conditions: > >> (1) qemu-nbd is started with --detect-zeroes=on and is connected to the > >> kernel NBD client > >> (2) third party program opens kernel NBD device with O_DIRECT > >> (3) third party program performs write operation with memory buffer > >> not aligned to the page > >>In this case qcow2_co_write_zeroes() is unable to perform the operation > >>and mark entire cluster as zeroed and returns ENOTSUP. Thus the caller > >>switches to non-optimized version and writes real zeroes to the disk. > >> > >>The patch creates a shortcut. If the block is read as zeroes, f.e. if > >>it is unallocated, the request is extended to cover full block. > >>User-visible situation with this block is not changed. Before the patch > >>the block is filled in the image with real zeroes. After that patch the > >>block is marked as zeroed in metadata. Thus any subsequent changes in > >>backing store chain are not affected. > >> > >>Kevin, thank you for a cool suggestion. > >> > >>Signed-off-by: Denis V. Lunev <d...@openvz.org> > >>Reviewed-by: Roman Kagan <rka...@virtuozzo.com> > >>CC: Kevin Wolf <kw...@redhat.com> > >>CC: Max Reitz <mre...@redhat.com> > >>--- > >>Changes from v2: > >>- checked head/tail clusters separately (one can be zeroed, one unallocated) > >>- fixed range calculations > >>- fixed race when the block can become used just after the check > >>- fixed zero cluster detection > >>- minor tweaks in the description > >> > >>Changes from v1: > >>- description rewritten completely > >>- new approach suggested by Kevin is implemented > >> > >> block/qcow2.c | 65 > >> +++++++++++++++++++++++++++++++++++++++++++++++++++++------ > >> 1 file changed, 59 insertions(+), 6 deletions(-) > >> > >>diff --git a/block/qcow2.c b/block/qcow2.c > >>index 470734b..c2474c1 100644 > >>--- a/block/qcow2.c > >>+++ b/block/qcow2.c > >>@@ -2411,21 +2411,74 @@ finish: > >> return ret; > >> } > >>+ > >>+static bool is_zero_cluster(BlockDriverState *bs, int64_t start) > >>+{ > >>+ BDRVQcow2State *s = bs->opaque; > >>+ int nr; > >>+ BlockDriverState *file; > >>+ int64_t res = bdrv_get_block_status_above(bs, NULL, start, > >>+ s->cluster_sectors, &nr, > >>&file); > >>+ return res >= 0 && ((res & BDRV_BLOCK_ZERO) || !(res & > >>BDRV_BLOCK_DATA)); > >Why did you add the !(res & BDRV_BLOCK_DATA) condition? This means that > >all unallocated clusters return true, even if the backing file contains > >non-zero data for them. > > this is correct. From my POW this means that this area is unallocated > in the entire backing chain and thus it will be read as zeroes. Thus > we could cover it with zeroes.
You're right that I made a mistake, I was thinking of the non-recursive bdrv_get_block_status(). However, I still think that we may not assume that !BDRV_BLOCK_DATA means zero data, even though that affects only more obscure cases. We have bdrv_unallocated_blocks_are_zero() to check whether the assumption is true. However, bdrv_co_get_block_status() already checks this internally and sets BDRV_BLOCK_ZERO in this case, so just checking BDRV_BLOCK_ZERO in qcow2 should be good. Did you find a case where you got !DATA, but not ZERO, and assuming zeroes was valid? If so, we may need to fix bdrv_co_get_block_status(). > Indeed, > > # create top image > hades ~/src/qemu $ qemu-img create -f qcow2 test2.qcow2 -b test.qcow2 64G > Formatting 'test2.qcow2', fmt=qcow2 size=68719476736 > backing_file='test.qcow2' encryption=off cluster_size=65536 > lazy_refcounts=off refcount_bits=16 > > # create backing store > hades ~/src/qemu $ qemu-img create -f qcow2 test.qcow2 64G > Formatting 'test.qcow2', fmt=qcow2 size=68719476736 encryption=off > cluster_size=65536 lazy_refcounts=off refcount_bits=16 > > # write to 2 first data clusters into backing store > hades ~/src/qemu $ ./qemu-io -c "write -P 11 62k 4k" test.qcow2 > qcow2_co_writev off=f800 bytes nr=1000 bytes > wrote 4096/4096 bytes at offset 63488 > 4 KiB, 1 ops; 0.1349 sec (29.642 KiB/sec and 7.4106 ops/sec) > > # write zeroes to 2 first data clusters into top image > hades ~/src/qemu $ ./qemu-io -c "write -z 62k 4k" test2.qcow2 > qcow2_co_write_zeroes off=0xf800 bytes nr=0x1000 bytes > is_zero_cluster off=0x0 bytes res=0x50015 ZERO=0 DATA=1 > qcow2_co_writev off=f800 bytes nr=1000 bytes > wrote 4096/4096 bytes at offset 63488 > 4 KiB, 1 ops; 0.1371 sec (29.167 KiB/sec and 7.2919 ops/sec) > # so far, so good. No zeroes, data is reported > > #writing to top image at offset 252 Kb (unallocated in both images) > hades ~/src/qemu $ ./qemu-io -c "write -z 254k 4k" test2.qcow2 > qcow2_co_write_zeroes off=0x3f800 bytes nr=0x1000 bytes > is_zero_cluster off=0x30000 bytes res=0x2 ZERO=1 DATA=0 > is_zero_cluster off=0x40000 bytes res=0x2 ZERO=1 DATA=0 > is_zero_cluster_top_locked off=0x30000 bytes ret=0x0 > is_zero_cluster_top_locked off=0x40000 bytes ret=0x0 > wrote 4096/4096 bytes at offset 260096 > 4 KiB, 1 ops; 0.0297 sec (134.391 KiB/sec and 33.5976 ops/sec) > hades ~/src/qemu $ > # you correct, zeroes reported, but data is not reported too. > # may be my condition means the same in both parts. > > OK. Can be cleaned up. I think it means that same in all cases where bdrv_unallocated_blocks_are_zero(bs) == true. > >>+} > >>+ > >>+static bool is_zero_cluster_top_locked(BlockDriverState *bs, int64_t start) > >>+{ > >>+ BDRVQcow2State *s = bs->opaque; > >>+ int nr = s->cluster_sectors; > >>+ uint64_t off; > >>+ int ret; > >>+ > >>+ ret = qcow2_get_cluster_offset(bs, start << BDRV_SECTOR_BITS, &nr, > >>&off); > >>+ return ret == QCOW2_CLUSTER_UNALLOCATED || ret == QCOW2_CLUSTER_ZERO; > >>+} > >This part is new, but it also returns true for clusters that are > >allocated in the backing file. > > this is checked only after the first check, i.e. when > is_zero_cluster has already > returned success. I assume here that backing stores are read-only > and can not > be changed. In the other case I'll have to call > bdrv_get_block_status_above(backing_bs(bs), NULL, > with s->lock held which I have not wanted to do. > > >> static coroutine_fn int qcow2_co_write_zeroes(BlockDriverState *bs, > >> int64_t sector_num, int nb_sectors, BdrvRequestFlags flags) > >> { > >> int ret; > >> BDRVQcow2State *s = bs->opaque; > >>- /* Emulate misaligned zero writes */ > >>- if (sector_num % s->cluster_sectors || nb_sectors % > >>s->cluster_sectors) { > >>- return -ENOTSUP; > >>+ int head = sector_num % s->cluster_sectors; > >>+ int tail = (sector_num + nb_sectors) % s->cluster_sectors; > >>+ > >>+ if (head != 0 || tail != 0) { > >>+ int64_t cl_end = -1; > >>+ > >>+ sector_num -= head; > >>+ nb_sectors += head; > >>+ > >>+ if (tail != 0) { > >>+ nb_sectors += s->cluster_sectors - tail; > >>+ } > >>+ > >>+ if (!is_zero_cluster(bs, sector_num)) { > >>+ return -ENOTSUP; > >>+ } > >>+ > >>+ if (nb_sectors > s->cluster_sectors) { > >>+ /* Technically the request can cover 2 clusters, f.e. 4k write > >>+ at s->cluster_sectors - 2k offset. One of these cluster can > >>+ be zeroed, one unallocated */ > >This cannot happen. bdrv_co_write_zeroes() splits requests not only > >based on the length, but so that they are actually aligned at cluster > >boundaries. > This do happens. > > hades ~/src/qemu $ ./qemu-io -c "write -z 62k 4k" test.qcow2 > qcow2_co_write_zeroes off=0xf800 bytes nr=0x1000 bytes > wrote 4096/4096 bytes at offset 63488 > 4 KiB, 1 ops; 0.0670 sec (59.662 KiB/sec and 14.9156 ops/sec) > hades ~/src/qemu $ ./qemu-img info test.qcow2 > image: test.qcow2 > file format: qcow2 > virtual size: 64G (68719476736 bytes) > disk size: 260K > cluster_size: 65536 > Format specific information: > compat: 1.1 > lazy refcounts: false > refcount bits: 16 > corrupt: false > hades ~/src/qemu $ > > Does this mean that I should check upper layer and fix? Hm, I see: if (bs->bl.write_zeroes_alignment && num > bs->bl.write_zeroes_alignment) { Removing the second part should fix this, i.e. it would split a request into two unaligned halves even if there is no aligned "bulk" in the middle. I think it would match my expectations better, but maybe that's just me. What do you think? > >>+ cl_end = sector_num + nb_sectors - s->cluster_sectors; > >>+ if (!is_zero_cluster(bs, cl_end)) { > >>+ return -ENOTSUP; > >>+ } > >>+ } > >>+ > >>+ qemu_co_mutex_lock(&s->lock); > >>+ /* We can have new write after previous check */ > >>+ if (!is_zero_cluster_top_locked(bs, sector_num) || > >>+ (cl_end > 0 && !is_zero_cluster_top_locked(bs, cl_end))) { > >>+ qemu_co_mutex_unlock(&s->lock); > >>+ return -ENOTSUP; > >>+ } > >Just lock the mutex before the check, the possible optimisation for the > >emulation case (which is slow anyway) isn't worth the additional code > >complexity. > > bdrv_get_block_status_above(bs) takes s->lock inside. This lock is not > recursive thus the code will hang. This is the problem trying to be > addressed with this split of checks. > > May be we could make the lock recursive... Maybe your version is no far from the best we can do then. It deserves a comment, though, because it's not completely obvious. The other option that we have and that looks reasonable enough to me is checking is_zero_cluster_top_locked() first and only if that returns false, we check the block status of the backing chain, starting at bs->backing->bs. This way we would bypass the recursive call and could take the lock from the beginning. If we go that way, it deserves a comment as well. Kevin