Re: [Qemu-devel] [PATCH for 2.7 v3 1/1] qcow2: improve qcow2_co_write_zeroes()

Kevin Wolf Thu, 12 May 2016 03:39:07 -0700

Am 12.05.2016 um 11:00 hat Denis V. Lunev geschrieben:
> On 05/11/2016 02:28 PM, Kevin Wolf wrote:
> >Am 11.05.2016 um 09:00 hat Denis V. Lunev geschrieben:
> >>There is a possibility that qcow2_co_write_zeroes() will be called
> >>with the partial block. This could be synthetically triggered with
> >>     qemu-io -c "write -z 32k 4k"
> >>and can happen in the real life in qemu-nbd. The latter happens under
> >>the following conditions:
> >>     (1) qemu-nbd is started with --detect-zeroes=on and is connected to the
> >>         kernel NBD client
> >>     (2) third party program opens kernel NBD device with O_DIRECT
> >>     (3) third party program performs write operation with memory buffer
> >>         not aligned to the page
> >>In this case qcow2_co_write_zeroes() is unable to perform the operation
> >>and mark entire cluster as zeroed and returns ENOTSUP. Thus the caller
> >>switches to non-optimized version and writes real zeroes to the disk.
> >>
> >>The patch creates a shortcut. If the block is read as zeroes, f.e. if
> >>it is unallocated, the request is extended to cover full block.
> >>User-visible situation with this block is not changed. Before the patch
> >>the block is filled in the image with real zeroes. After that patch the
> >>block is marked as zeroed in metadata. Thus any subsequent changes in
> >>backing store chain are not affected.
> >>
> >>Kevin, thank you for a cool suggestion.
> >>
> >>Signed-off-by: Denis V. Lunev <d...@openvz.org>
> >>Reviewed-by: Roman Kagan <rka...@virtuozzo.com>
> >>CC: Kevin Wolf <kw...@redhat.com>
> >>CC: Max Reitz <mre...@redhat.com>
> >>---
> >>Changes from v2:
> >>- checked head/tail clusters separately (one can be zeroed, one unallocated)
> >>- fixed range calculations
> >>- fixed race when the block can become used just after the check
> >>- fixed zero cluster detection
> >>- minor tweaks in the description
> >>
> >>Changes from v1:
> >>- description rewritten completely
> >>- new approach suggested by Kevin is implemented
> >>
> >>  block/qcow2.c | 65 
> >> +++++++++++++++++++++++++++++++++++++++++++++++++++++------
> >>  1 file changed, 59 insertions(+), 6 deletions(-)
> >>
> >>diff --git a/block/qcow2.c b/block/qcow2.c
> >>index 470734b..c2474c1 100644
> >>--- a/block/qcow2.c
> >>+++ b/block/qcow2.c
> >>@@ -2411,21 +2411,74 @@ finish:
> >>      return ret;
> >>  }
> >>+
> >>+static bool is_zero_cluster(BlockDriverState *bs, int64_t start)
> >>+{
> >>+    BDRVQcow2State *s = bs->opaque;
> >>+    int nr;
> >>+    BlockDriverState *file;
> >>+    int64_t res = bdrv_get_block_status_above(bs, NULL, start,
> >>+                                              s->cluster_sectors, &nr, 
> >>&file);
> >>+    return res >= 0 && ((res & BDRV_BLOCK_ZERO) || !(res & 
> >>BDRV_BLOCK_DATA));
> >Why did you add the !(res & BDRV_BLOCK_DATA) condition? This means that
> >all unallocated clusters return true, even if the backing file contains
> >non-zero data for them.
> 
> this is correct. From my POW this means that this area is unallocated
> in the entire backing chain and thus it will be read as zeroes. Thus
> we could cover it with zeroes.


You're right that I made a mistake, I was thinking of the non-recursive
bdrv_get_block_status().

However, I still think that we may not assume that !BDRV_BLOCK_DATA
means zero data, even though that affects only more obscure cases. We
have bdrv_unallocated_blocks_are_zero() to check whether the assumption
is true. However, bdrv_co_get_block_status() already checks this
internally and sets BDRV_BLOCK_ZERO in this case, so just checking
BDRV_BLOCK_ZERO in qcow2 should be good.

Did you find a case where you got !DATA, but not ZERO, and assuming
zeroes was valid? If so, we may need to fix bdrv_co_get_block_status().

> Indeed,
> 
> # create top image
> hades ~/src/qemu $ qemu-img create -f qcow2 test2.qcow2 -b test.qcow2 64G
> Formatting 'test2.qcow2', fmt=qcow2 size=68719476736
> backing_file='test.qcow2' encryption=off cluster_size=65536
> lazy_refcounts=off refcount_bits=16
> 
> # create backing store
> hades ~/src/qemu $ qemu-img create -f qcow2 test.qcow2 64G
> Formatting 'test.qcow2', fmt=qcow2 size=68719476736 encryption=off
> cluster_size=65536 lazy_refcounts=off refcount_bits=16
> 
> # write to 2 first data clusters into backing store
> hades ~/src/qemu $ ./qemu-io -c "write -P 11 62k 4k" test.qcow2
> qcow2_co_writev off=f800 bytes nr=1000 bytes
> wrote 4096/4096 bytes at offset 63488
> 4 KiB, 1 ops; 0.1349 sec (29.642 KiB/sec and 7.4106 ops/sec)
> 
> # write zeroes to 2 first data clusters into top image
> hades ~/src/qemu $ ./qemu-io -c "write -z 62k 4k" test2.qcow2
> qcow2_co_write_zeroes off=0xf800 bytes nr=0x1000 bytes
> is_zero_cluster off=0x0 bytes res=0x50015 ZERO=0 DATA=1
> qcow2_co_writev off=f800 bytes nr=1000 bytes
> wrote 4096/4096 bytes at offset 63488
> 4 KiB, 1 ops; 0.1371 sec (29.167 KiB/sec and 7.2919 ops/sec)
> # so far, so good. No zeroes, data is reported
> 
> #writing to top image at offset 252 Kb (unallocated in both images)
> hades ~/src/qemu $ ./qemu-io -c "write -z 254k 4k" test2.qcow2
> qcow2_co_write_zeroes off=0x3f800 bytes nr=0x1000 bytes
> is_zero_cluster off=0x30000 bytes res=0x2 ZERO=1 DATA=0
> is_zero_cluster off=0x40000 bytes res=0x2 ZERO=1 DATA=0
> is_zero_cluster_top_locked off=0x30000 bytes ret=0x0
> is_zero_cluster_top_locked off=0x40000 bytes ret=0x0
> wrote 4096/4096 bytes at offset 260096
> 4 KiB, 1 ops; 0.0297 sec (134.391 KiB/sec and 33.5976 ops/sec)
> hades ~/src/qemu $
> # you correct, zeroes reported, but data is not reported too.
> # may be my condition means the same in both parts.
> 
> OK. Can be cleaned up.

I think it means that same in all cases where
bdrv_unallocated_blocks_are_zero(bs) == true.

> >>+}
> >>+
> >>+static bool is_zero_cluster_top_locked(BlockDriverState *bs, int64_t start)
> >>+{
> >>+    BDRVQcow2State *s = bs->opaque;
> >>+    int nr = s->cluster_sectors;
> >>+    uint64_t off;
> >>+    int ret;
> >>+
> >>+    ret = qcow2_get_cluster_offset(bs, start << BDRV_SECTOR_BITS, &nr, 
> >>&off);
> >>+    return ret == QCOW2_CLUSTER_UNALLOCATED || ret == QCOW2_CLUSTER_ZERO;
> >>+}
> >This part is new, but it also returns true for clusters that are
> >allocated in the backing file.
> 
> this is checked only after the first check, i.e. when
> is_zero_cluster has already
> returned success. I assume here that backing stores are read-only
> and can not
> be changed. In the other case I'll have to call
>     bdrv_get_block_status_above(backing_bs(bs), NULL,
> with s->lock held which I have not wanted to do.
> 
> >>  static coroutine_fn int qcow2_co_write_zeroes(BlockDriverState *bs,
> >>      int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
> >>  {
> >>      int ret;
> >>      BDRVQcow2State *s = bs->opaque;
> >>-    /* Emulate misaligned zero writes */
> >>-    if (sector_num % s->cluster_sectors || nb_sectors % 
> >>s->cluster_sectors) {
> >>-        return -ENOTSUP;
> >>+    int head = sector_num % s->cluster_sectors;
> >>+    int tail = (sector_num + nb_sectors) % s->cluster_sectors;
> >>+
> >>+    if (head != 0 || tail != 0) {
> >>+        int64_t cl_end = -1;
> >>+
> >>+        sector_num -= head;
> >>+        nb_sectors += head;
> >>+
> >>+        if (tail != 0) {
> >>+            nb_sectors += s->cluster_sectors - tail;
> >>+        }
> >>+
> >>+        if (!is_zero_cluster(bs, sector_num)) {
> >>+            return -ENOTSUP;
> >>+        }
> >>+
> >>+        if (nb_sectors > s->cluster_sectors) {
> >>+            /* Technically the request can cover 2 clusters, f.e. 4k write
> >>+               at s->cluster_sectors - 2k offset. One of these cluster can
> >>+               be zeroed, one unallocated */
> >This cannot happen. bdrv_co_write_zeroes() splits requests not only
> >based on the length, but so that they are actually aligned at cluster
> >boundaries.
> This do happens.
> 
> hades ~/src/qemu $ ./qemu-io -c "write -z 62k 4k" test.qcow2
> qcow2_co_write_zeroes off=0xf800 bytes nr=0x1000 bytes
> wrote 4096/4096 bytes at offset 63488
> 4 KiB, 1 ops; 0.0670 sec (59.662 KiB/sec and 14.9156 ops/sec)
> hades ~/src/qemu $ ./qemu-img info test.qcow2
> image: test.qcow2
> file format: qcow2
> virtual size: 64G (68719476736 bytes)
> disk size: 260K
> cluster_size: 65536
> Format specific information:
>     compat: 1.1
>     lazy refcounts: false
>     refcount bits: 16
>     corrupt: false
> hades ~/src/qemu $
> 
> Does this mean that I should check upper layer and fix?

Hm, I see:

    if (bs->bl.write_zeroes_alignment
        && num > bs->bl.write_zeroes_alignment) {

Removing the second part should fix this, i.e. it would split a request
into two unaligned halves even if there is no aligned "bulk" in the
middle.

I think it would match my expectations better, but maybe that's just me.
What do you think?

> >>+            cl_end = sector_num + nb_sectors - s->cluster_sectors;
> >>+            if (!is_zero_cluster(bs, cl_end)) {
> >>+                return -ENOTSUP;
> >>+            }
> >>+        }
> >>+
> >>+        qemu_co_mutex_lock(&s->lock);
> >>+        /* We can have new write after previous check */
> >>+        if (!is_zero_cluster_top_locked(bs, sector_num) ||
> >>+                (cl_end > 0 && !is_zero_cluster_top_locked(bs, cl_end))) {
> >>+            qemu_co_mutex_unlock(&s->lock);
> >>+            return -ENOTSUP;
> >>+        }
> >Just lock the mutex before the check, the possible optimisation for the
> >emulation case (which is slow anyway) isn't worth the additional code
> >complexity.
> 
> bdrv_get_block_status_above(bs) takes s->lock inside. This lock is not
> recursive thus the code will hang. This is the problem trying to be
> addressed with this split of checks.
> 
> May be we could make the lock recursive...

Maybe your version is no far from the best we can do then. It deserves a
comment, though, because it's not completely obvious.

The other option that we have and that looks reasonable enough to me is
checking is_zero_cluster_top_locked() first and only if that returns
false, we check the block status of the backing chain, starting at
bs->backing->bs. This way we would bypass the recursive call and could
take the lock from the beginning. If we go that way, it deserves a
comment as well.

Kevin

Re: [Qemu-devel] [PATCH for 2.7 v3 1/1] qcow2: improve qcow2_co_write_zeroes()

Reply via email to