from:"Fiona Ebner"

Re: query dirty areas according to bitmap via QMP or qemu-nbd

2024-07-29 Thread Fiona Ebner

Am 26.07.24 um 17:38 schrieb Eric Blake:
> On Fri, Jul 26, 2024 at 04:16:41PM GMT, Fiona Ebner wrote:
>> Hi,
>>
>> sorry if I'm missing the obvious, but is there a way to get the dirty
>> areas according to a dirty bitmap via QMP? I mean as something like
>> offset + size + dirty-flag triples. In my case, the bitmap is also
>> exported via NBD, so same question for qemu-nbd being the client.
> 
> Over QMP, no - that can produce a potentially large response and
> possible long time in computing the data, so we have never felt the
> need to introduce a new QMP command for that purpose.  So over NBD is
> the preferred solution.
> 
>>
>> I can get the info with "nbdinfo --map", but would like to avoid
>> requiring a tool outside QEMU.
> 
> By default, QEMU as an NBD client only reads the "base:allocation" NBD
> metacontext, and is not wired to read more than one NBD metacontext at
> once (weaker than nbdinfo's capabilities).  But I have intentionally
> left in a hack (accessible through QMP as well as from the command
> line) for connecting a qemu NBD client to an alternative NBD
> metacontext that feeds the block status, at which point 2 bits of
> information from the alternative context are observable through the
> result of block status calls.  Note that using such an NBD connection
> for anything OTHER than block status calls is inadvisable (qemu might
> incorrectly optimize reads based on its misinterpretation of those
> block status bits); but as long as you limit the client to block
> status calls, it's a great way to read out a "qemu:dirty-bitmap:..."
> metacontext using only a qemu NBD client connection.
> 
> git grep -l x-dirty-bitmap tests/qemu-iotests
> 
> shows several of the iotests using the backdoor in just that manner.
> In particular, tests/qemu-img-bitmaps gives the magic decoder ring:
> 
> | # x-dirty-bitmap is a hack for reading bitmaps; it abuses block status to
> | # report "data":false for portions of the bitmap which are set
> | IMG="driver=nbd,server.type=unix,server.path=$nbd_unix_socket"
> | nbd_server_start_unix_socket -r -f qcow2 \
> | -B b0 -B b1 -B b2 -B b3 "$TEST_IMG"
> | $QEMU_IMG map --output=json --image-opts \
> | "$IMG,x-dirty-bitmap=qemu:dirty-bitmap:b0" | _filter_qemu_img_map
> 
> meaning the map output includes "data":false for the dirty portions
> and "data":true for the unchanged portions recorded in bitmap b0 as
> read from the JSON map output.
> 

Oh, I didn't think about checking the NBD block driver for such an
option :) And thank you for all the explanations!

>>
>> If it is not currently possible, would upstream be interested too in the
>> feature, either for QMP or qemu-nbd?
> 
> Improving qemu-img to get at the information without quite the hacky
> post-processing deciphering would indeed be a useful patch, but it has
> never risen to the level of enough of an itch for me to write it
> myself (especially since 'nbdinfo --map's output works just as well).
> 

I might just go with the above for now, but who knows if I'll get around
to this some day. Three approaches that come to mind are:

1. qemu-img bitmap --dump

Other bitmap actions won't be supported in combination with NBD.

2. qemu-img map --bitmap NAME

Should it use a dedicated output format compared to the usual "map"
output (both human and json) with just "start/offset + length + dirty
bit" triples?

3. qemu-nbd --map CONTEXT

With only supporting one context at a time? Would be limited to NBD of
course which the other two won't be.


All would require connecting to the NBD export with the correct meta
context, which currently means using x_dirty_bitmap internally. So would
that even be okay as part of a non-experimental command, or would it
require to teach the NBD client code to deal with multiple meta contexts
first?

Best Regards,
Fiona

query dirty areas according to bitmap via QMP or qemu-nbd

2024-07-26 Thread Fiona Ebner

Hi,

sorry if I'm missing the obvious, but is there a way to get the dirty
areas according to a dirty bitmap via QMP? I mean as something like
offset + size + dirty-flag triples. In my case, the bitmap is also
exported via NBD, so same question for qemu-nbd being the client.

I can get the info with "nbdinfo --map", but would like to avoid
requiring a tool outside QEMU.

If it is not currently possible, would upstream be interested too in the
feature, either for QMP or qemu-nbd?

Best Regards,
Fiona

[PATCH v2] block/reqlist: allow adding overlapping requests

2024-07-12 Thread Fiona Ebner

Allow overlapping request by removing the assert that made it
impossible. There are only two callers:

1. block_copy_task_create()

It already asserts the very same condition before calling
reqlist_init_req().

2. cbw_snapshot_read_lock()

There is no need to have read requests be non-overlapping in
copy-before-write when used for snapshot-access. In fact, there was no
protection against two callers of cbw_snapshot_read_lock() calling
reqlist_init_req() with overlapping ranges and this could lead to an
assertion failure [1].

In particular, with the reproducer script below [0], two
cbw_co_snapshot_block_status() callers could race, with the second
calling reqlist_init_req() before the first one finishes and removes
its conflicting request.

[0]:

> #!/bin/bash -e
> dd if=/dev/urandom of=/tmp/disk.raw bs=1M count=1024
> ./qemu-img create /tmp/fleecing.raw -f raw 1G
> (
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev raw,node-name=node0,file.driver=file,file.filename=/tmp/disk.raw \
> --blockdev 
> raw,node-name=node1,file.driver=file,file.filename=/tmp/fleecing.raw \
> < {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "copy-before-write", 
> "file": "node0", "target": "node1", "node-name": "node3" } }
> {"execute": "blockdev-add", "arguments": { "driver": "snapshot-access", 
> "file": "node3", "node-name": "snap0" } }
> {"execute": "nbd-server-start", "arguments": {"addr": { "type": "unix", 
> "data": { "path": "/tmp/nbd.socket" } } } }
> {"execute": "block-export-add", "arguments": {"id": "exp0", "node-name": 
> "snap0", "type": "nbd", "name": "exp0"}}
> EOF
> ) &
> sleep 5
> while true; do
> ./qemu-nbd -d /dev/nbd0
> ./qemu-nbd -c /dev/nbd0 nbd:unix:/tmp/nbd.socket:exportname=exp0 -f raw -r
> nbdinfo --map 'nbd+unix:///exp0?socket=/tmp/nbd.socket'
> done

[1]:

> #5  0x71e5f0088eb2 in __GI___assert_fail (...) at ./assert/assert.c:101
> #6  0x615285438017 in reqlist_init_req (...) at ../block/reqlist.c:23
> #7  0x6152853e2d98 in cbw_snapshot_read_lock (...) at 
> ../block/copy-before-write.c:237
> #8  0x6152853e3068 in cbw_co_snapshot_block_status (...) at 
> ../block/copy-before-write.c:304
> #9  0x6152853f4d22 in bdrv_co_snapshot_block_status (...) at 
> ../block/io.c:3726
> #10 0x61528543a63e in snapshot_access_co_block_status (...) at 
> ../block/snapshot-access.c:48
> #11 0x6152853f1a0a in bdrv_co_do_block_status (...) at ../block/io.c:2474
> #12 0x6152853f2016 in bdrv_co_common_block_status_above (...) at 
> ../block/io.c:2652
> #13 0x6152853f22cf in bdrv_co_block_status_above (...) at 
> ../block/io.c:2732
> #14 0x6152853d9a86 in blk_co_block_status_above (...) at 
> ../block/block-backend.c:1473
> #15 0x61528538da6c in blockstatus_to_extents (...) at ../nbd/server.c:2374
> #16 0x61528538deb1 in nbd_co_send_block_status (...) at 
> ../nbd/server.c:2481
> #17 0x61528538f424 in nbd_handle_request (...) at ../nbd/server.c:2978
> #18 0x61528538f906 in nbd_trip (...) at ../nbd/server.c:3121
> #19 0x6152855a7caf in coroutine_trampoline (...) at 
> ../util/coroutine-ucontext.c:175

Cc: qemu-sta...@nongnu.org
Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Fiona Ebner 
---

Changes in v2:
* different approach, allowing overlapping requests for
  copy-before-write rather than waiting for them. block-copy already
  asserts there are no conflicts before adding a request.

 block/copy-before-write.c | 3 ++-
 block/reqlist.c   | 2 --
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index 853e01a1eb..28f6a096cd 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -66,7 +66,8 @@ typedef struct BDRVCopyBeforeWriteState {
 
 /*
  * @frozen_read_reqs: current read requests for fleecing user in bs->file
- * node. These areas must not be rewritten by guest.
+ * node. These areas must not be rewritten by guest. There can be multiple
+ * overlapping read requests.
  */
 BlockReqList frozen_read_reqs;
 
diff --git a/block/reqlist.c b/block/reqlist.c
index 08cb57cfa4..098e807378 100644
--- a/block/reqlist.c
+++ b/block/reqlist.c
@@ -20,8 +20,6 @@
 void reqlist_init_req(BlockReqList *reqs, BlockReq *req, int64_t offset,
   int64_t bytes)
 {
-assert(!reqlist_find_conflict(reqs, offset, bytes));
-
 *req = (BlockReq) {
 .offset = offset,
 .bytes = bytes,
-- 
2.39.2

[PATCH] block/copy-before-write: wait for conflicts when read locking to avoid assertion failure

2024-07-11 Thread Fiona Ebner

There is no protection against two callers of cbw_snapshot_read_lock()
calling reqlist_init_req() with overlapping ranges, and
reqlist_init_req() asserts that there are no conflicting requests.

In particular, two cbw_co_snapshot_block_status() callers can race,
with the second calling reqlist_init_req() before the first one
finishes and removes its conflicting request, leading to an assertion
failure.

Reproducer script [0] and backtrace [1] are attached below.

[0]:

> #!/bin/bash -e
> dd if=/dev/urandom of=/tmp/disk.raw bs=1M count=1024
> ./qemu-img create /tmp/fleecing.raw -f raw 1G
> (
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev raw,node-name=node0,file.driver=file,file.filename=/tmp/disk.raw \
> --blockdev 
> raw,node-name=node1,file.driver=file,file.filename=/tmp/fleecing.raw \
> < {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "copy-before-write", 
> "file": "node0", "target": "node1", "node-name": "node3" } }
> {"execute": "blockdev-add", "arguments": { "driver": "snapshot-access", 
> "file": "node3", "node-name": "snap0" } }
> {"execute": "nbd-server-start", "arguments": {"addr": { "type": "unix", 
> "data": { "path": "/tmp/nbd.socket" } } } }
> {"execute": "block-export-add", "arguments": {"id": "exp0", "node-name": 
> "snap0", "type": "nbd", "name": "exp0"}}
> EOF
> ) &
> sleep 5
> while true; do
> ./qemu-nbd -d /dev/nbd0
> ./qemu-nbd -c /dev/nbd0 nbd:unix:/tmp/nbd.socket:exportname=exp0 -f raw -r
> nbdinfo --map 'nbd+unix:///exp0?socket=/tmp/nbd.socket'
> done

[1]:

> #5  0x71e5f0088eb2 in __GI___assert_fail (...) at ./assert/assert.c:101
> #6  0x615285438017 in reqlist_init_req (...) at ../block/reqlist.c:23
> #7  0x6152853e2d98 in cbw_snapshot_read_lock (...) at 
> ../block/copy-before-write.c:237
> #8  0x6152853e3068 in cbw_co_snapshot_block_status (...) at 
> ../block/copy-before-write.c:304
> #9  0x6152853f4d22 in bdrv_co_snapshot_block_status (...) at 
> ../block/io.c:3726
> #10 0x61528543a63e in snapshot_access_co_block_status (...) at 
> ../block/snapshot-access.c:48
> #11 0x6152853f1a0a in bdrv_co_do_block_status (...) at ../block/io.c:2474
> #12 0x6152853f2016 in bdrv_co_common_block_status_above (...) at 
> ../block/io.c:2652
> #13 0x6152853f22cf in bdrv_co_block_status_above (...) at 
> ../block/io.c:2732
> #14 0x6152853d9a86 in blk_co_block_status_above (...) at 
> ../block/block-backend.c:1473
> #15 0x61528538da6c in blockstatus_to_extents (...) at ../nbd/server.c:2374
> #16 0x61528538deb1 in nbd_co_send_block_status (...) at 
> ../nbd/server.c:2481
> #17 0x61528538f424 in nbd_handle_request (...) at ../nbd/server.c:2978
> #18 0x61528538f906 in nbd_trip (...) at ../nbd/server.c:3121
> #19 0x6152855a7caf in coroutine_trampoline (...) at 
> ../util/coroutine-ucontext.c:175

Signed-off-by: Fiona Ebner 
---
 block/copy-before-write.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index 853e01a1eb..376ff3f3e1 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -234,6 +234,7 @@ cbw_snapshot_read_lock(BlockDriverState *bs, int64_t 
offset, int64_t bytes,
 *req = (BlockReq) {.offset = -1, .bytes = -1};
 *file = s->target;
 } else {
+reqlist_wait_all(&s->frozen_read_reqs, offset, bytes, &s->lock);
 reqlist_init_req(&s->frozen_read_reqs, req, offset, bytes);
 *file = bs->file;
 }
-- 
2.39.2

[PATCH v3 0/2] backup: allow specifying minimum cluster size

2024-07-11 Thread Fiona Ebner

Discussion for v2:
https://lore.kernel.org/qemu-devel/20240528120114.344416-1-f.eb...@proxmox.com/

Changes in v3:
* Pass min_cluster_size option directly without checking
  has_min_cluster_size, because the default is 0 anyways.
* Calculate maximum of passed-in argument and default once at the
  beginning of block_copy_calculate_cluster_size()
* Update warning message to reflect actual value used
* Do not leak qdict in error case
* Use PRI{i,u}64 macros

Discussion for v1:
https://lore.kernel.org/qemu-devel/20240308155158.830258-1-f.eb...@proxmox.com/
-
Changes in v2:
* Use 'size' type in QAPI.
* Remove option in cbw_parse_options(), i.e. before parsing generic
  blockdev options.
* Reword commit messages hoping to describe the issue in a more
  straight-forward way.

In the context of backup fleecing, discarding the source will not work
when the fleecing image has a larger granularity than the one used for
block-copy operations (can happen if the backup target has smaller
cluster size), because cbw_co_pdiscard_snapshot() will align down the
discard requests and thus effectively ignore then.

To make @discard-source work in such a scenario, allow specifying the
minimum cluster size used for block-copy operations and thus in
particular also the granularity for discard requests to the source.

Fiona Ebner (2):
  copy-before-write: allow specifying minimum cluster size
  backup: add minimum cluster size to performance options

 block/backup.c |  2 +-
 block/block-copy.c | 36 ++--
 block/copy-before-write.c  | 14 +-
 block/copy-before-write.h  |  1 +
 blockdev.c |  3 +++
 include/block/block-copy.h |  1 +
 qapi/block-core.json   | 17 ++---
 7 files changed, 59 insertions(+), 15 deletions(-)

-- 
2.39.2

[PATCH v3 1/2] copy-before-write: allow specifying minimum cluster size

2024-07-11 Thread Fiona Ebner

In the context of backup fleecing, discarding the source will not work
when the fleecing image has a larger granularity than the one used for
block-copy operations (can happen if the backup target has smaller
cluster size), because cbw_co_pdiscard_snapshot() will align down the
discard requests and thus effectively ignore then.

To make @discard-source work in such a scenario, allow specifying the
minimum cluster size used for block-copy operations and thus in
particular also the granularity for discard requests to the source.

The type 'size' (corresponding to uint64_t in C) is used in QAPI to
rule out negative inputs and for consistency with already existing
@cluster-size parameters. Since block_copy_calculate_cluster_size()
uses int64_t for its result, a check that the input is not too large
is added in block_copy_state_new() before calling it. The calculation
in block_copy_calculate_cluster_size() is done in the target int64_t
type.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Acked-by: Markus Armbruster  (QAPI schema)
Signed-off-by: Fiona Ebner 
---

Changes in v3:
* Pass min_cluster_size option directly without checking
  has_min_cluster_size, because the default is 0 anyways.
* Calculate maximum of passed-in argument and default once at the
  beginning of block_copy_calculate_cluster_size()
* Update warning message to reflect actual value used
* Use PRI{i,u}64 macros

 block/block-copy.c | 36 ++--
 block/copy-before-write.c  |  5 -
 include/block/block-copy.h |  1 +
 qapi/block-core.json   |  8 +++-
 4 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/block/block-copy.c b/block/block-copy.c
index 7e3b378528..59bee538eb 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -310,6 +310,7 @@ void block_copy_set_copy_opts(BlockCopyState *s, bool 
use_copy_range,
 }
 
 static int64_t block_copy_calculate_cluster_size(BlockDriverState *target,
+ int64_t min_cluster_size,
  Error **errp)
 {
 int ret;
@@ -319,6 +320,9 @@ static int64_t 
block_copy_calculate_cluster_size(BlockDriverState *target,
 GLOBAL_STATE_CODE();
 GRAPH_RDLOCK_GUARD_MAINLOOP();
 
+min_cluster_size = MAX(min_cluster_size,
+   (int64_t)BLOCK_COPY_CLUSTER_SIZE_DEFAULT);
+
 target_does_cow = bdrv_backing_chain_next(target);
 
 /*
@@ -329,13 +333,13 @@ static int64_t 
block_copy_calculate_cluster_size(BlockDriverState *target,
 ret = bdrv_get_info(target, &bdi);
 if (ret == -ENOTSUP && !target_does_cow) {
 /* Cluster size is not defined */
-warn_report("The target block device doesn't provide "
-"information about the block size and it doesn't have a "
-"backing file. The default block size of %u bytes is "
-"used. If the actual block size of the target exceeds "
-"this default, the backup may be unusable",
-BLOCK_COPY_CLUSTER_SIZE_DEFAULT);
-return BLOCK_COPY_CLUSTER_SIZE_DEFAULT;
+warn_report("The target block device doesn't provide information about 
"
+"the block size and it doesn't have a backing file. The "
+"(default) block size of %" PRIi64 " bytes is used. If the 
"
+"actual block size of the target exceeds this value, the "
+"backup may be unusable",
+min_cluster_size);
+return min_cluster_size;
 } else if (ret < 0 && !target_does_cow) {
 error_setg_errno(errp, -ret,
 "Couldn't determine the cluster size of the target image, "
@@ -345,16 +349,17 @@ static int64_t 
block_copy_calculate_cluster_size(BlockDriverState *target,
 return ret;
 } else if (ret < 0 && target_does_cow) {
 /* Not fatal; just trudge on ahead. */
-return BLOCK_COPY_CLUSTER_SIZE_DEFAULT;
+return min_cluster_size;
 }
 
-return MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, bdi.cluster_size);
+return MAX(min_cluster_size, bdi.cluster_size);
 }
 
 BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target,
  BlockDriverState *copy_bitmap_bs,
  const BdrvDirtyBitmap *bitmap,
  bool discard_source,
+ uint64_t min_cluster_size,
  Error **errp)
 {
 ERRP_GUARD();
@@ -365,7 +370,18 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, 
BdrvChild *target,
 
 GLOBAL_STATE_CODE();
 
-cluster_size = block_copy_calculate_cluster_size(target->bs, errp);
+if (min

[PATCH v3 2/2] backup: add minimum cluster size to performance options

2024-07-11 Thread Fiona Ebner

In the context of backup fleecing, discarding the source will not work
when the fleecing image has a larger granularity than the one used for
block-copy operations (can happen if the backup target has smaller
cluster size), because cbw_co_pdiscard_snapshot() will align down the
discard requests and thus effectively ignore then.

To make @discard-source work in such a scenario, allow specifying the
minimum cluster size used for block-copy operations and thus in
particular also the granularity for discard requests to the source.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Acked-by: Markus Armbruster  (QAPI schema)
Signed-off-by: Fiona Ebner 
---

Changes in v3:
* Use PRI{i,u}64 macros
* Do not leak qdict in error case

 block/backup.c| 2 +-
 block/copy-before-write.c | 9 +
 block/copy-before-write.h | 1 +
 blockdev.c| 3 +++
 qapi/block-core.json  | 9 +++--
 5 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 3dd2e229d2..a1292c01ec 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -458,7 +458,7 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 }
 
 cbw = bdrv_cbw_append(bs, target, filter_node_name, discard_source,
-  &bcs, errp);
+  perf->min_cluster_size, &bcs, errp);
 if (!cbw) {
 goto error;
 }
diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index a919b1f41b..e835987e52 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -548,6 +548,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
   BlockDriverState *target,
   const char *filter_node_name,
   bool discard_source,
+  uint64_t min_cluster_size,
   BlockCopyState **bcs,
   Error **errp)
 {
@@ -567,6 +568,14 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
 qdict_put_str(opts, "file", bdrv_get_node_name(source));
 qdict_put_str(opts, "target", bdrv_get_node_name(target));
 
+if (min_cluster_size > INT64_MAX) {
+error_setg(errp, "min-cluster-size too large: %" PRIu64 " > %" PRIi64,
+   min_cluster_size, INT64_MAX);
+qobject_unref(opts);
+return NULL;
+}
+qdict_put_int(opts, "min-cluster-size", (int64_t)min_cluster_size);
+
 top = bdrv_insert_node(source, opts, flags, errp);
 if (!top) {
 return NULL;
diff --git a/block/copy-before-write.h b/block/copy-before-write.h
index 01af0cd3c4..2a5d4ba693 100644
--- a/block/copy-before-write.h
+++ b/block/copy-before-write.h
@@ -40,6 +40,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
   BlockDriverState *target,
   const char *filter_node_name,
   bool discard_source,
+  uint64_t min_cluster_size,
   BlockCopyState **bcs,
   Error **errp);
 void bdrv_cbw_drop(BlockDriverState *bs);
diff --git a/blockdev.c b/blockdev.c
index 835064ed03..6740663fda 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2655,6 +2655,9 @@ static BlockJob *do_backup_common(BackupCommon *backup,
 if (backup->x_perf->has_max_chunk) {
 perf.max_chunk = backup->x_perf->max_chunk;
 }
+if (backup->x_perf->has_min_cluster_size) {
+perf.min_cluster_size = backup->x_perf->min_cluster_size;
+}
 }
 
 if ((backup->sync == MIRROR_SYNC_MODE_BITMAP) ||
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 80e32db8aa..9a54bfb15f 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1551,11 +1551,16 @@
 # it should not be less than job cluster size which is calculated
 # as maximum of target image cluster size and 64k.  Default 0.
 #
+# @min-cluster-size: Minimum size of blocks used by copy-before-write
+# and background copy operations.  Has to be a power of 2.  No
+# effect if smaller than the maximum of the target's cluster size
+# and 64 KiB.  Default 0.  (Since 9.1)
+#
 # Since: 6.0
 ##
 { 'struct': 'BackupPerf',
-  'data': { '*use-copy-range': 'bool',
-'*max-workers': 'int', '*max-chunk': 'int64' } }
+  'data': { '*use-copy-range': 'bool', '*max-workers': 'int',
+'*max-chunk': 'int64', '*min-cluster-size': 'size' } }
 
 ##
 # @BackupCommon:
-- 
2.39.2

Re: [PATCH] scsi: Don't ignore most usb-storage properties

2024-07-01 Thread Fiona Ebner

Hi,

we got a user report about bootindex for an 'usb-storage' device not
working anymore [0] and I reproduced it and bisected it to this patch.

Am 31.01.24 um 14:06 schrieb Kevin Wolf:
> @@ -399,11 +397,10 @@ SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, 
> BlockBackend *blk,
>  object_property_add_child(OBJECT(bus), name, OBJECT(dev));
>  g_free(name);
>  
> +s = SCSI_DEVICE(dev);
> +s->conf = *conf;
> +
>  qdev_prop_set_uint32(dev, "scsi-id", unit);
> -if (bootindex >= 0) {
> -object_property_set_int(OBJECT(dev), "bootindex", bootindex,
> -&error_abort);
> -}

The fact that this is not called anymore means that the 'set' method for
the property is also not called. Here, that method is
device_set_bootindex() (as configured by scsi_dev_instance_init() ->
device_add_bootindex_property()). Therefore, the device is never
registered via add_boot_device_path() meaning that the bootindex
property does not have the desired effect anymore.

Is it necessary to keep the object_property_set_{bool,int} and
qdev_prop_set_enum calls around for these potential side effects? Would
it even be necessary to introduce new similar calls for the newly
supported properties? Or is there an easy alternative to
s->conf = *conf;
that does trigger the side effects?

>  if (object_property_find(OBJECT(dev), "removable")) {
>  qdev_prop_set_bit(dev, "removable", removable);
>  }
> @@ -414,19 +411,12 @@ SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, 
> BlockBackend *blk,
>  object_unparent(OBJECT(dev));
>  return NULL;
>  }
> -if (!object_property_set_bool(OBJECT(dev), "share-rw", share_rw, errp)) {
> -object_unparent(OBJECT(dev));
> -return NULL;
> -}
> -
> -qdev_prop_set_enum(dev, "rerror", rerror);
> -qdev_prop_set_enum(dev, "werror", werror);
>  
>  if (!qdev_realize_and_unref(dev, &bus->qbus, errp)) {
>  object_unparent(OBJECT(dev));
>  return NULL;
>  }
> -return SCSI_DEVICE(dev);
> +return s;
>  }
>  
>  void scsi_bus_legacy_handle_cmdline(SCSIBus *bus)
[0]: https://forum.proxmox.com/threads/149772/post-679433

Best Regards,
Fiona

Re: [PATCH v4 0/5] mirror: allow specifying working bitmap

2024-06-10 Thread Fiona Ebner

Ping

Am 21.05.24 um 14:20 schrieb Fiona Ebner:
> Changes from v3 (discussion here [3]):
> * Improve/fix QAPI documentation.
> 
> Changes from v2 (discussion here [2]):
> * Cluster size caveats only apply to non-COW diff image, adapt the
>   cluster size check and documentation accordingly.
> * In the IO test, use backing files (rather than stand-alone diff
>   images) in combination with copy-mode=write-blocking and larger
>   cluster size for target images, to have a more realistic use-case
>   and show that COW prevents ending up with cluster with partial data
>   upon unaligned writes.
> * Create a separate patch for replacing is_none_mode with sync_mode in
>   MirrorBlockJobx struct.
> * Disallow using read-only bitmap (cannot be used as working bitmap).
> * Require that bitmap is enabled at the start of the job.
> * Avoid IO test script potentially waiting on non-existent job when
>   blockdev-mirror QMP command fails.
> * Fix pylint issues in IO test.
> * Rename IO test from sync-bitmap-mirror to mirror-bitmap.
> 
> Changes from RFC/v1 (discussion here [0]):
> * Add patch to split BackupSyncMode and MirrorSyncMode.
> * Drop bitmap-mode parameter and use passed-in bitmap as the working
>   bitmap instead. Users can get the desired behaviors by
>   using the block-dirty-bitmap-clear and block-dirty-bitmap-merge
>   calls (see commit message in patch 2/4 for how exactly).
> * Add patch to check whether target image's cluster size is at most
>   mirror job's granularity. Optional, it's an extra safety check
>   that's useful when the target is a "diff" image that does not have
>   previously synced data.
> 
> Use cases:
> * Possibility to resume a failed mirror later.
> * Possibility to only mirror deltas to a previously mirrored volume.
> * Possibility to (efficiently) mirror an drive that was previously
>   mirrored via some external mechanism (e.g. ZFS replication).
> 
> We are using the last one in production without any issues since about
> 4 years now. In particular, like mentioned in [1]:
> 
>> - create bitmap(s)
>> - (incrementally) replicate storage volume(s) out of band (using ZFS)
>> - incrementally drive mirror as part of a live migration of VM
>> - drop bitmap(s)
> 
> 
> Now, the IO test added in patch 4/5 actually contains yet another use
> case, namely doing incremental mirrors to qcow2 "diff" images, that
> only contain the delta and can be rebased later. I had to adapt the IO
> test, because its output expected the mirror bitmap to still be dirty,
> but nowadays the mirror is apparently already done when the bitmaps
> are queried. So I thought, I'll just use 'write-blocking' mode to
> avoid any potential timing issues.
> 
> Initially, the qcow2 diff image targets were stand-alone and that
> suffers from an issue when 'write-blocking' mode is used. If a write
> is not aligned to the granularity of the mirror target, then rebasing
> the diff image onto a backing image will not yield the desired result,
> because the full cluster is considered to be allocated and will "hide"
> some part of the base/backing image. The failure can be seen by either
> using 'write-blocking' mode in the IO test or setting the (bitmap)
> granularity to 32 KiB rather than the current 64 KiB.
> 
> The test thus uses a more realistic approach where the qcow2 targets
> have backing images and a check is added in patch 5/5 for the cluster
> size for non-COW targets. However, with e.g. NBD, the cluster size
> cannot be queried and prohibiting all bitmap mirrors to NBD targets
> just to prevent the highly specific edge case seems not worth it, so
> the limitation is rather documented and the check ignores cases where
> querying the target image's cluster size returns -ENOTSUP.
> 
> 
> [0]: 
> https://lore.kernel.org/qemu-devel/b91dba34-7969-4d51-ba40-96a91038c...@yandex-team.ru/T/#m4ae27dc8ca1fb053e0a32cc4ffa2cfab6646805c
> [1]: 
> https://lore.kernel.org/qemu-devel/1599127031.9uxdp5h9o2.astr...@nora.none/
> [2]: 
> https://lore.kernel.org/qemu-devel/20240307134711.709816-1-f.eb...@proxmox.com/
> [3]: 
> https://lore.kernel.org/qemu-devel/20240510131647.1256467-1-f.eb...@proxmox.com/
> 
> 
> Fabian Grünbichler (1):
>   iotests: add test for bitmap mirror
> 
> Fiona Ebner (3):
>   qapi/block-core: avoid the re-use of MirrorSyncMode for backup
>   block/mirror: replace is_none_mode with sync_mode in MirrorBlockJob
> struct
>   blockdev: mirror: check for target's cluster size when using bitmap
> 
> John Snow (1):
>   mirror: allow specifying working bitmap
> 
>  block/backup.c |   18 +-
>  block/mir

Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-06-05 Thread Fiona Ebner

Am 04.06.24 um 17:28 schrieb Kevin Wolf:
> Am 04.06.2024 um 09:58 hat Fiona Ebner geschrieben:
>> Am 03.06.24 um 18:21 schrieb Kevin Wolf:
>>> Am 03.06.2024 um 16:17 hat Fiona Ebner geschrieben:
>>>> Am 26.03.24 um 13:44 schrieb Kevin Wolf:
>>>>>
>>>>> The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
>>>>> with a coroutine wrapper so that the graph lock is held for the whole
>>>>> function. Then calling bdrv_co_flush() while iterating the list is safe
>>>>> and doesn't allow concurrent graph modifications.
>>>>
>>>> The second is that iotest 255 ran into an assertion failure upon QMP 
>>>> 'quit':
>>>>
>>>>> ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion 
>>>>> `!qemu_in_coroutine()' failed.
>>>>
>>>> Looking at the backtrace:
>>>>
>>>>> #5  0x762a90cc3eb2 in __GI___assert_fail
>>>>> (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 
>>>>> "../block/graph-lock.c", line=113, function=0x5afb07991f20 
>>>>> <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock")
>>>>> at ./assert/assert.c:101
>>>>> #6  0x5afb07585311 in bdrv_graph_wrlock () at 
>>>>> ../block/graph-lock.c:113
>>>>> #7  0x5afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at 
>>>>> ../block/block-backend.c:901
>>>>> #8  0x5afb075729a7 in blk_delete (blk=0x5afb0af99420) at 
>>>>> ../block/block-backend.c:487
>>>>> #9  0x5afb07572d88 in blk_unref (blk=0x5afb0af99420) at 
>>>>> ../block/block-backend.c:547
>>>>> #10 0x5afb07572fe8 in bdrv_next (it=0x762a852fef00) at 
>>>>> ../block/block-backend.c:618
>>>>> #11 0x5afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347
>>>>> #12 0x5afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x712c6050) 
>>>>> at block/block-gen.c:1391
>>>>> #13 0x5afb0773bf41 in coroutine_trampoline (i0=168365184, i1=23291)
>>>>
>>>> So I guess calling bdrv_next() is not safe from a coroutine, because
>>>> the function doing the iteration could end up being the last thing to
>>>> have a reference for the BB.
>>>
>>> Does your bdrv_co_flush_all() take the graph (reader) lock? If so, this
>>> is surprising, because while we hold the graph lock, no reference should
>>> be able to go away - you need the writer lock for that and you won't get
>>> it as long as bdrv_co_flush_all() locks the graph. So whatever had a
>>> reference before the bdrv_next() loop must still have it now. Do you
>>> know where it gets dropped?
>>>
>>
>> AFAICT, yes, it does hold the graph reader lock. The generated code is:
>>
>>> static void coroutine_fn bdrv_co_flush_all_entry(void *opaque)
>>> {
>>> BdrvFlushAll *s = opaque;
>>>
>>> bdrv_graph_co_rdlock();
>>> s->ret = bdrv_co_flush_all();
>>> bdrv_graph_co_rdunlock();
>>> s->poll_state.in_progress = false;
>>>
>>> aio_wait_kick();
>>> }
>>
>> Apparently when the mirror job is aborted/exits, which can happen during
>> the polling for bdrv_co_flush_all_entry(), a reference can go away
>> without the write lock (at least my breakpoints didn't trigger) being held:
>>
>>> #0  blk_unref (blk=0x5cdefe943d20) at ../block/block-backend.c:537
>>> #1  0x5cdefb26697e in mirror_exit_common (job=0x5cdefeb53000) at 
>>> ../block/mirror.c:710
>>> #2  0x5cdefb263575 in mirror_abort (job=0x5cdefeb53000) at 
>>> ../block/mirror.c:823
>>> #3  0x5cdefb2248a6 in job_abort (job=0x5cdefeb53000) at ../job.c:825
>>> #4  0x5cdefb2245f2 in job_finalize_single_locked (job=0x5cdefeb53000) 
>>> at ../job.c:855
>>> #5  0x5cdefb223852 in job_completed_txn_abort_locked 
>>> (job=0x5cdefeb53000) at ../job.c:958
>>> #6  0x5cdefb223714 in job_completed_locked (job=0x5cdefeb53000) at 
>>> ../job.c:1065
>>> #7  0x5cdefb224a8b in job_exit (opaque=0x5cdefeb53000) at ../job.c:1088
>>> #8  0x5cdefb4134fc in aio_bh_call (bh=0x5cdefe7487c0) at 
>>> ../util/async.c:171
>>> #9  0x5cdefb4136ce in aio_bh_poll (ctx=0x5cdefd9cd750) at 
>>> ../util/async.c:218
>>> #10 0x5cdefb3efdfd in aio_poll (ctx=0x5cdef

[RFC PATCH] block-coroutine-wrapper: support generating wrappers for functions without arguments

2024-06-04 Thread Fiona Ebner

Signed-off-by: Fiona Ebner 
---

An alternative would be to detect whether the argument list is 'void'
in FuncDecl's __init__, assign the empty list to self.args there and
special case based on that in the rest of the code.

Not super happy about the introduction of the 'void_value' parameter,
but the different callers seem to make something like it necessary.
Could be avoided if there were a nice way to map a format which
contains no other keys besides '{name}' to the empty list if the
argument's 'name' is 'None'. At least until there is a format that
contains both '{name}' and another key which would require special
handling again.

The generated code unfortunately does contain a few extra blank lines.
Avoiding that would require turning some of the (currently static)
formatting surrounding gen_block() dynamic based upon whether the
argument list is 'void'.

Happy about any feedback/suggestions!

 scripts/block-coroutine-wrapper.py | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/scripts/block-coroutine-wrapper.py 
b/scripts/block-coroutine-wrapper.py
index dbbde99e39..533f6dbe12 100644
--- a/scripts/block-coroutine-wrapper.py
+++ b/scripts/block-coroutine-wrapper.py
@@ -54,6 +54,11 @@ class ParamDecl:
   r')')
 
 def __init__(self, param_decl: str) -> None:
+if param_decl.strip() == 'void':
+self.decl = 'void'
+self.type = 'void'
+self.name = None
+return
 m = self.param_re.match(param_decl.strip())
 if m is None:
 raise ValueError(f'Wrong parameter declaration: "{param_decl}"')
@@ -114,10 +119,14 @@ def gen_ctx(self, prefix: str = '') -> str:
 else:
 return 'qemu_get_aio_context()'
 
-def gen_list(self, format: str) -> str:
+def gen_list(self, format: str, void_value='') -> str:
+if len(self.args) == 1 and self.args[0].type == 'void':
+return void_value
 return ', '.join(format.format_map(arg.__dict__) for arg in self.args)
 
 def gen_block(self, format: str) -> str:
+if len(self.args) == 1 and self.args[0].type == 'void':
+return ''
 return '\n'.join(format.format_map(arg.__dict__) for arg in self.args)
 
 
@@ -158,7 +167,7 @@ def create_mixed_wrapper(func: FuncDecl) -> str:
 graph_assume_lock = 'assume_graph_lock();' if func.graph_rdlock else ''
 
 return f"""\
-{func.return_type} {func.name}({ func.gen_list('{decl}') })
+{func.return_type} {func.name}({ func.gen_list('{decl}', 'void') })
 {{
 if (qemu_in_coroutine()) {{
 {graph_assume_lock}
@@ -186,7 +195,7 @@ def create_co_wrapper(func: FuncDecl) -> str:
 name = func.target_name
 struct_name = func.struct_name
 return f"""\
-{func.return_type} {func.name}({ func.gen_list('{decl}') })
+{func.return_type} {func.name}({ func.gen_list('{decl}', 'void') })
 {{
 {struct_name} s = {{
 .poll_state.ctx = qemu_get_current_aio_context(),
@@ -284,7 +293,7 @@ def gen_no_co_wrapper(func: FuncDecl) -> str:
 aio_co_wake(s->co);
 }}
 
-{func.return_type} coroutine_fn {func.name}({ func.gen_list('{decl}') })
+{func.return_type} coroutine_fn {func.name}({ func.gen_list('{decl}', 'void') 
})
 {{
 {struct_name} s = {{
 .co = qemu_coroutine_self(),
-- 
2.39.2

Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-06-04 Thread Fiona Ebner

Am 03.06.24 um 18:21 schrieb Kevin Wolf:
> Am 03.06.2024 um 16:17 hat Fiona Ebner geschrieben:
>> Am 26.03.24 um 13:44 schrieb Kevin Wolf:
>>>
>>> The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
>>> with a coroutine wrapper so that the graph lock is held for the whole
>>> function. Then calling bdrv_co_flush() while iterating the list is safe
>>> and doesn't allow concurrent graph modifications.
>>
>> The second is that iotest 255 ran into an assertion failure upon QMP 'quit':
>>
>>> ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion 
>>> `!qemu_in_coroutine()' failed.
>>
>> Looking at the backtrace:
>>
>>> #5  0x762a90cc3eb2 in __GI___assert_fail
>>> (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 
>>> "../block/graph-lock.c", line=113, function=0x5afb07991f20 
>>> <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock")
>>> at ./assert/assert.c:101
>>> #6  0x5afb07585311 in bdrv_graph_wrlock () at ../block/graph-lock.c:113
>>> #7  0x5afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at 
>>> ../block/block-backend.c:901
>>> #8  0x5afb075729a7 in blk_delete (blk=0x5afb0af99420) at 
>>> ../block/block-backend.c:487
>>> #9  0x5afb07572d88 in blk_unref (blk=0x5afb0af99420) at 
>>> ../block/block-backend.c:547
>>> #10 0x5afb07572fe8 in bdrv_next (it=0x762a852fef00) at 
>>> ../block/block-backend.c:618
>>> #11 0x5afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347
>>> #12 0x5afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x712c6050) 
>>> at block/block-gen.c:1391
>>> #13 0x5afb0773bf41 in coroutine_trampoline (i0=168365184, i1=23291)
>>
>> So I guess calling bdrv_next() is not safe from a coroutine, because
>> the function doing the iteration could end up being the last thing to
>> have a reference for the BB.
> 
> Does your bdrv_co_flush_all() take the graph (reader) lock? If so, this
> is surprising, because while we hold the graph lock, no reference should
> be able to go away - you need the writer lock for that and you won't get
> it as long as bdrv_co_flush_all() locks the graph. So whatever had a
> reference before the bdrv_next() loop must still have it now. Do you
> know where it gets dropped?
> 

AFAICT, yes, it does hold the graph reader lock. The generated code is:

> static void coroutine_fn bdrv_co_flush_all_entry(void *opaque)
> {
> BdrvFlushAll *s = opaque;
> 
> bdrv_graph_co_rdlock();
> s->ret = bdrv_co_flush_all();
> bdrv_graph_co_rdunlock();
> s->poll_state.in_progress = false;
> 
> aio_wait_kick();
> }

Apparently when the mirror job is aborted/exits, which can happen during
the polling for bdrv_co_flush_all_entry(), a reference can go away
without the write lock (at least my breakpoints didn't trigger) being held:

> #0  blk_unref (blk=0x5cdefe943d20) at ../block/block-backend.c:537
> #1  0x5cdefb26697e in mirror_exit_common (job=0x5cdefeb53000) at 
> ../block/mirror.c:710
> #2  0x5cdefb263575 in mirror_abort (job=0x5cdefeb53000) at 
> ../block/mirror.c:823
> #3  0x5cdefb2248a6 in job_abort (job=0x5cdefeb53000) at ../job.c:825
> #4  0x5cdefb2245f2 in job_finalize_single_locked (job=0x5cdefeb53000) at 
> ../job.c:855
> #5  0x5cdefb223852 in job_completed_txn_abort_locked (job=0x5cdefeb53000) 
> at ../job.c:958
> #6  0x5cdefb223714 in job_completed_locked (job=0x5cdefeb53000) at 
> ../job.c:1065
> #7  0x5cdefb224a8b in job_exit (opaque=0x5cdefeb53000) at ../job.c:1088
> #8  0x5cdefb4134fc in aio_bh_call (bh=0x5cdefe7487c0) at 
> ../util/async.c:171
> #9  0x5cdefb4136ce in aio_bh_poll (ctx=0x5cdefd9cd750) at 
> ../util/async.c:218
> #10 0x5cdefb3efdfd in aio_poll (ctx=0x5cdefd9cd750, blocking=true) at 
> ../util/aio-posix.c:722
> #11 0x5cdefb20435e in bdrv_poll_co (s=0x7ffe491621d8) at 
> ../block/block-gen.h:43
> #12 0x5cdefb206a33 in bdrv_flush_all () at block/block-gen.c:1410
> #13 0x5cdefae5c8ed in do_vm_stop (state=RUN_STATE_SHUTDOWN, 
> send_stop=false)
> at ../system/cpus.c:297
> #14 0x5cdefae5c850 in vm_shutdown () at ../system/cpus.c:308
> #15 0x5cdefae6d892 in qemu_cleanup (status=0) at ../system/runstate.c:871
> #16 0x5cdefb1a7e78 in qemu_default_main () at ../system/main.c:38
> #17 0x5cdefb1a7eb8 in main (argc=34, argv=0x7ffe491623a8) at 
> ../system/main.c:48

Looking at the code in mirror_exit_common(), it doesn't seem to acquire
a write lock:

> bdrv_graph_rdunlock_main_loop();
> 
> /*
>  * Remove target parent that still uses BLK_PERM_WRITE/RESIZE before
>  * inserting target_bs at s->to_replace, where we might not be able to get
>  * these permissions.
>  */
> blk_unref(s->target);
> s->target = NULL;

The write lock is taken in blk_remove_bs() when the refcount drops to 0
and the BB is actually removed:

> bdrv_graph_wrlock();
> bdrv_root_unref_child(root);
> bdrv_graph_wrunlock();

Best Regards,
Fiona

Re: [PATCH] block/copy-before-write: use uint64_t for timeout in nanoseconds

2024-06-03 Thread Fiona Ebner

Am 28.05.24 um 18:06 schrieb Kevin Wolf:
> Am 29.04.2024 um 16:19 hat Fiona Ebner geschrieben:
>> rather than the uint32_t for which the maximum is slightly more than 4
>> seconds and larger values would overflow. The QAPI interface allows
>> specifying the number of seconds, so only values 0 to 4 are safe right
>> now, other values lead to a much lower timeout than a user expects.
>>
>> The block_copy() call where this is used already takes a uint64_t for
>> the timeout, so no change required there.
>>
>> Fixes: 6db7fd1ca9 ("block/copy-before-write: implement cbw-timeout option")
>> Reported-by: Friedrich Weber 
>> Signed-off-by: Fiona Ebner 
> 
> Thanks, applied to the block branch.
> 
> But I don't think our job is done yet with this. Increasing the limit is
> good and useful, but even if it's now unlikely to hit with sane values,
> we should still catch integer overflows in cbw_open() and return an
> error on too big values instead of silently wrapping around.

NANOSECONDS_PER_SECOND is 10^9 and the QAPI type for cbw-timeout is
uint32_t, so even with the maximum allowed value, there is no overflow.
Should I still add such a check?

Best Regards,
Fiona

Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-06-03 Thread Fiona Ebner

Hi Kevin,

Am 26.03.24 um 13:44 schrieb Kevin Wolf:
> Am 22.03.2024 um 10:50 hat Fiona Ebner geschrieben:
>> The old_bs variable in bdrv_next() is currently determined by looking
>> at the old block backend. However, if the block graph changes before
>> the next bdrv_next() call, it might be that the associated BDS is not
>> the same that was referenced previously. In that case, the wrong BDS
>> is unreferenced, leading to an assertion failure later:
>>
>>> bdrv_unref: Assertion `bs->refcnt > 0' failed.
> 
> Your change makes sense, but in theory it shouldn't make a difference.
> The real bug is in the caller, you can't allow graph modifications while
> iterating the list of nodes. Even if it doesn't crash (like after your
> patch), you don't have any guarantee that you will have seen every node
> that exists that the end - and maybe not even that you don't get the
> same node twice.
> 
>> In particular, this can happen in the context of bdrv_flush_all(),
>> when polling for bdrv_co_flush() in the generated co-wrapper leads to
>> a graph change (for example with a stream block job [0]).
> 
> The whole locking around this case is a bit tricky and would deserve
> some cleanup.
> 
> The basic rule for bdrv_next() callers is that they need to hold the
> graph reader lock as long as they are iterating the graph, otherwise
> it's not safe. This implies that the ref/unref pairs in it should never
> make a difference either - which is important, because at least
> releasing the last reference is forbidden while holding the graph lock.
> I intended to remove the ref/unref for bdrv_next(), but I didn't because
> I realised that the callers need to be audited first that they really
> obey the rules. You found one that would be problematic.
> 
> The thing that bdrv_flush_all() gets wrong is that it promises to follow
> the graph lock rules with GRAPH_RDLOCK_GUARD_MAINLOOP(), but then calls
> something that polls. The compiler can't catch this because bdrv_flush()
> is a co_wrapper_mixed_bdrv_rdlock. The behaviour for these functions is:
> 
> - If called outside of coroutine context, they are GRAPH_UNLOCKED
> - If called in coroutine context, they are GRAPH_RDLOCK
> 
> We should probably try harder to get rid of the mixed functions, because
> a synchronous co_wrapper_bdrv_rdlock could actually be marked
> GRAPH_UNLOCKED in the code and then the compiler could catch this case.
> 
> The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
> with a coroutine wrapper so that the graph lock is held for the whole
> function. Then calling bdrv_co_flush() while iterating the list is safe
> and doesn't allow concurrent graph modifications.

I attempted this now, but ran into two issues:

The first is that I had to add support for a function without arguments
to the block-coroutine-wrapper.py script. I can send this as an RFC in
any case if desired.

The second is that iotest 255 ran into an assertion failure upon QMP 'quit':

> ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion 
> `!qemu_in_coroutine()' failed.

Looking at the backtrace:

> #5  0x762a90cc3eb2 in __GI___assert_fail
> (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 
> "../block/graph-lock.c", line=113, function=0x5afb07991f20 
> <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock")
> at ./assert/assert.c:101
> #6  0x5afb07585311 in bdrv_graph_wrlock () at ../block/graph-lock.c:113
> #7  0x5afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at 
> ../block/block-backend.c:901
> #8  0x5afb075729a7 in blk_delete (blk=0x5afb0af99420) at 
> ../block/block-backend.c:487
> #9  0x5afb07572d88 in blk_unref (blk=0x5afb0af99420) at 
> ../block/block-backend.c:547
> #10 0x5afb07572fe8 in bdrv_next (it=0x762a852fef00) at 
> ../block/block-backend.c:618
> #11 0x5afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347
> #12 0x5afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x712c6050) at 
> block/block-gen.c:1391
> #13 0x5afb0773bf41 in coroutine_trampoline (i0=168365184, i1=23291)

So I guess calling bdrv_next() is not safe from a coroutine, because the
function doing the iteration could end up being the last thing to have a
reference for the BB.

Best Regards,
Fiona

Re: [PATCH 1/2] Revert "monitor: use aio_co_reschedule_self()"

2024-05-29 Thread Fiona Ebner

CC-ing stable since 1f25c172f83704e350c0829438d832384084a74d is in 9.0.0

Am 06.05.24 um 21:06 schrieb Stefan Hajnoczi:
> Commit 1f25c172f837 ("monitor: use aio_co_reschedule_self()") was a code
> cleanup that uses aio_co_reschedule_self() instead of open coding
> coroutine rescheduling.
> 
> Bug RHEL-34618 was reported and Kevin Wolf  identified
> the root cause. I missed that aio_co_reschedule_self() ->
> qemu_get_current_aio_context() only knows about
> qemu_aio_context/IOThread AioContexts and not about iohandler_ctx. It
> does not function correctly when going back from the iohandler_ctx to
> qemu_aio_context.
> 
> Go back to open coding the AioContext transitions to avoid this bug.
> 
> This reverts commit 1f25c172f83704e350c0829438d832384084a74d.
> 
> Buglink: https://issues.redhat.com/browse/RHEL-34618
> Signed-off-by: Stefan Hajnoczi 
> ---
>  qapi/qmp-dispatch.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
> index f3488afeef..176b549473 100644
> --- a/qapi/qmp-dispatch.c
> +++ b/qapi/qmp-dispatch.c
> @@ -212,7 +212,8 @@ QDict *coroutine_mixed_fn qmp_dispatch(const 
> QmpCommandList *cmds, QObject *requ
>   * executing the command handler so that it can make progress if 
> it
>   * involves an AIO_WAIT_WHILE().
>   */
> -aio_co_reschedule_self(qemu_get_aio_context());
> +aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self());
> +qemu_coroutine_yield();
>  }
>  
>  monitor_set_cur(qemu_coroutine_self(), cur_mon);
> @@ -226,7 +227,9 @@ QDict *coroutine_mixed_fn qmp_dispatch(const 
> QmpCommandList *cmds, QObject *requ
>   * Move back to iohandler_ctx so that nested event loops for
>   * qemu_aio_context don't start new monitor commands.
>   */
> -aio_co_reschedule_self(iohandler_get_aio_context());
> +aio_co_schedule(iohandler_get_aio_context(),
> +qemu_coroutine_self());
> +qemu_coroutine_yield();
>  }
>  } else {
> /*

Re: block snapshot issue with RBD

2024-05-29 Thread Fiona Ebner

Hi,

Am 28.05.24 um 20:19 schrieb Jin Cao:
> Hi Ilya
> 
> On 5/28/24 11:13 AM, Ilya Dryomov wrote:
>> On Mon, May 27, 2024 at 9:06 PM Jin Cao  wrote:
>>>
>>> Supplementary info: VM is paused after "migrate" command. After being
>>> resumed with "cont", snapshot_delete_blkdev_internal works again, which
>>> is confusing, as disk snapshot generally recommend I/O is paused, and a
>>> frozen VM satisfy this requirement.
>>
>> Hi Jin,
>>
>> This doesn't seem to be related to RBD.  Given that the same error is
>> observed when using the RBD driver with the raw format, I would dig in
>> the direction of migration somehow "installing" the raw format (which
>> is on-disk compatible with the rbd format).
>>
> 
> Thanks for the hint.
> 
>> Also, did you mean to say "snapshot_blkdev_internal" instead of
>> "snapshot_delete_blkdev_internal" in both instances?
> 
> Sorry for my copy-and-paste mistake. Yes, it's snapshot_blkdev_internal.
> 
> -- 
> Sincerely,
> Jin Cao
> 
>>
>> Thanks,
>>
>>  Ilya
>>
>>>
>>> -- 
>>> Sincerely
>>> Jin Cao
>>>
>>> On 5/27/24 10:56 AM, Jin Cao wrote:
 CC block and migration related address.

 On 5/27/24 12:03 AM, Jin Cao wrote:
> Hi,
>
> I encountered RBD block snapshot issue after doing migration.
>
> Steps
> -
>
> 1. Start QEMU with:
> ./qemu-system-x86_64 -name VM -machine q35 -accel kvm -cpu
> host,migratable=on -m 2G -boot menu=on,strict=on
> rbd:image/ubuntu-22.04-server-cloudimg-amd64.raw -net nic -net user
> -cdrom /home/my/path/of/cloud-init.iso -monitor stdio
>
> 2. Do block snapshot in monitor cmd: snapshot_delete_blkdev_internal.
> It works as expected: the snapshot is visable with command`rbd snap ls
> pool_name/image_name`.
>
> 3. Do pseudo migration with monitor cmd: migrate -d
> exec:cat>/tmp/vm.out
>
> 4. Do block snapshot again with snapshot_delete_blkdev_internal, then
> I get:
>  Error: Block format 'raw' used by device 'ide0-hd0' does not
> support internal snapshots
>
> I was hoping to do the second block snapshot successfully, and it
> feels abnormal the RBD block snapshot function is disrupted after
> migration.
>
> BTW, I get the same block snapshot error when I start QEMU with:
>   "-drive format=raw,file=rbd:pool_name/image_name"
>
> My questions is: how could I proceed with RBD block snapshot after the
> pseudo migration?
> 
> 

I bisected this issue to d3007d348a ("block: Fix crash when loading
snapshot on inactive node").

> diff --git a/block/snapshot.c b/block/snapshot.c
> index ec8cf4810b..c4d40e80dd 100644
> --- a/block/snapshot.c
> +++ b/block/snapshot.c
> @@ -196,8 +196,10 @@ bdrv_snapshot_fallback(BlockDriverState *bs)
>  int bdrv_can_snapshot(BlockDriverState *bs)
>  {
>  BlockDriver *drv = bs->drv;
> +
>  GLOBAL_STATE_CODE();
> -if (!drv || !bdrv_is_inserted(bs) || bdrv_is_read_only(bs)) {
> +
> +if (!drv || !bdrv_is_inserted(bs) || !bdrv_is_writable(bs)) {
>  return 0;
>  }
>  

So I guess the issue is that the blockdev is not writable when
"postmigrate" state?

Best Regards,
Fiona

[PATCH v2 1/2] copy-before-write: allow specifying minimum cluster size

2024-05-28 Thread Fiona Ebner

In the context of backup fleecing, discarding the source will not work
when the fleecing image has a larger granularity than the one used for
block-copy operations (can happen if the backup target has smaller
cluster size), because cbw_co_pdiscard_snapshot() will align down the
discard requests and thus effectively ignore then.

To make @discard-source work in such a scenario, allow specifying the
minimum cluster size used for block-copy operations and thus in
particular also the granularity for discard requests to the source.

The type 'size' (corresponding to uint64_t in C) is used in QAPI to
rule out negative inputs and for consistency with already existing
@cluster-size parameters. Since block_copy_calculate_cluster_size()
uses int64_t for its result, a check that the input is not too large
is added in block_copy_state_new() before calling it. The calculation
in block_copy_calculate_cluster_size() is done in the target int64_t
type.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Fiona Ebner 
---

Changes in v2:
* Use 'size' type in QAPI.
* Remove option in cbw_parse_options(), i.e. before parsing generic
  blockdev options.

 block/block-copy.c | 22 ++
 block/copy-before-write.c  | 10 +-
 include/block/block-copy.h |  1 +
 qapi/block-core.json   |  8 +++-
 4 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/block/block-copy.c b/block/block-copy.c
index 7e3b378528..36eaecaaf4 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -310,6 +310,7 @@ void block_copy_set_copy_opts(BlockCopyState *s, bool 
use_copy_range,
 }
 
 static int64_t block_copy_calculate_cluster_size(BlockDriverState *target,
+ int64_t min_cluster_size,
  Error **errp)
 {
 int ret;
@@ -335,7 +336,7 @@ static int64_t 
block_copy_calculate_cluster_size(BlockDriverState *target,
 "used. If the actual block size of the target exceeds "
 "this default, the backup may be unusable",
 BLOCK_COPY_CLUSTER_SIZE_DEFAULT);
-return BLOCK_COPY_CLUSTER_SIZE_DEFAULT;
+return MAX(min_cluster_size, (int64_t)BLOCK_COPY_CLUSTER_SIZE_DEFAULT);
 } else if (ret < 0 && !target_does_cow) {
 error_setg_errno(errp, -ret,
 "Couldn't determine the cluster size of the target image, "
@@ -345,16 +346,18 @@ static int64_t 
block_copy_calculate_cluster_size(BlockDriverState *target,
 return ret;
 } else if (ret < 0 && target_does_cow) {
 /* Not fatal; just trudge on ahead. */
-return BLOCK_COPY_CLUSTER_SIZE_DEFAULT;
+return MAX(min_cluster_size, (int64_t)BLOCK_COPY_CLUSTER_SIZE_DEFAULT);
 }
 
-return MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, bdi.cluster_size);
+return MAX(min_cluster_size,
+   (int64_t)MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, 
bdi.cluster_size));
 }
 
 BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target,
  BlockDriverState *copy_bitmap_bs,
  const BdrvDirtyBitmap *bitmap,
  bool discard_source,
+ uint64_t min_cluster_size,
  Error **errp)
 {
 ERRP_GUARD();
@@ -365,7 +368,18 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, 
BdrvChild *target,
 
 GLOBAL_STATE_CODE();
 
-cluster_size = block_copy_calculate_cluster_size(target->bs, errp);
+if (min_cluster_size > INT64_MAX) {
+error_setg(errp, "min-cluster-size too large: %lu > %ld",
+   min_cluster_size, INT64_MAX);
+return NULL;
+} else if (min_cluster_size && !is_power_of_2(min_cluster_size)) {
+error_setg(errp, "min-cluster-size needs to be a power of 2");
+return NULL;
+}
+
+cluster_size = block_copy_calculate_cluster_size(target->bs,
+ (int64_t)min_cluster_size,
+ errp);
 if (cluster_size < 0) {
 return NULL;
 }
diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index cd65524e26..ef0bc4dcfe 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -417,6 +417,7 @@ static BlockdevOptions *cbw_parse_options(QDict *options, 
Error **errp)
 qdict_extract_subqdict(options, NULL, "bitmap");
 qdict_del(options, "on-cbw-error");
 qdict_del(options, "cbw-timeout");
+qdict_del(options, "min-cluster-size");
 
 out:
 visit_free(v);
@@ -432,6 +433,7 @@ static int cbw_open(BlockDriverState *bs, QDict *options, 
int flags,
 BDRVCopyBeforeWriteState *s = bs->opaque;
 BdrvD

[PATCH v2 2/2] backup: add minimum cluster size to performance options

2024-05-28 Thread Fiona Ebner

In the context of backup fleecing, discarding the source will not work
when the fleecing image has a larger granularity than the one used for
block-copy operations (can happen if the backup target has smaller
cluster size), because cbw_co_pdiscard_snapshot() will align down the
discard requests and thus effectively ignore then.

To make @discard-source work in such a scenario, allow specifying the
minimum cluster size used for block-copy operations and thus in
particular also the granularity for discard requests to the source.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Fiona Ebner 
---

Changes in v2:
* Use 'size' type in QAPI.

 block/backup.c| 2 +-
 block/copy-before-write.c | 8 
 block/copy-before-write.h | 1 +
 blockdev.c| 3 +++
 qapi/block-core.json  | 9 +++--
 5 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 3dd2e229d2..a1292c01ec 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -458,7 +458,7 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 }
 
 cbw = bdrv_cbw_append(bs, target, filter_node_name, discard_source,
-  &bcs, errp);
+  perf->min_cluster_size, &bcs, errp);
 if (!cbw) {
 goto error;
 }
diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index ef0bc4dcfe..183eed42e5 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -553,6 +553,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
   BlockDriverState *target,
   const char *filter_node_name,
   bool discard_source,
+  uint64_t min_cluster_size,
   BlockCopyState **bcs,
   Error **errp)
 {
@@ -572,6 +573,13 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
 qdict_put_str(opts, "file", bdrv_get_node_name(source));
 qdict_put_str(opts, "target", bdrv_get_node_name(target));
 
+if (min_cluster_size > INT64_MAX) {
+error_setg(errp, "min-cluster-size too large: %lu > %ld",
+   min_cluster_size, INT64_MAX);
+return NULL;
+}
+qdict_put_int(opts, "min-cluster-size", (int64_t)min_cluster_size);
+
 top = bdrv_insert_node(source, opts, flags, errp);
 if (!top) {
 return NULL;
diff --git a/block/copy-before-write.h b/block/copy-before-write.h
index 01af0cd3c4..2a5d4ba693 100644
--- a/block/copy-before-write.h
+++ b/block/copy-before-write.h
@@ -40,6 +40,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
   BlockDriverState *target,
   const char *filter_node_name,
   bool discard_source,
+  uint64_t min_cluster_size,
   BlockCopyState **bcs,
   Error **errp);
 void bdrv_cbw_drop(BlockDriverState *bs);
diff --git a/blockdev.c b/blockdev.c
index 835064ed03..6740663fda 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2655,6 +2655,9 @@ static BlockJob *do_backup_common(BackupCommon *backup,
 if (backup->x_perf->has_max_chunk) {
 perf.max_chunk = backup->x_perf->max_chunk;
 }
+if (backup->x_perf->has_min_cluster_size) {
+perf.min_cluster_size = backup->x_perf->min_cluster_size;
+}
 }
 
 if ((backup->sync == MIRROR_SYNC_MODE_BITMAP) ||
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 8fc0a4b234..f1219a9dfb 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1551,11 +1551,16 @@
 # it should not be less than job cluster size which is calculated
 # as maximum of target image cluster size and 64k.  Default 0.
 #
+# @min-cluster-size: Minimum size of blocks used by copy-before-write
+# and background copy operations.  Has to be a power of 2.  No
+# effect if smaller than the maximum of the target's cluster size
+# and 64 KiB.  Default 0.  (Since 9.1)
+#
 # Since: 6.0
 ##
 { 'struct': 'BackupPerf',
-  'data': { '*use-copy-range': 'bool',
-'*max-workers': 'int', '*max-chunk': 'int64' } }
+  'data': { '*use-copy-range': 'bool', '*max-workers': 'int',
+'*max-chunk': 'int64', '*min-cluster-size': 'size' } }
 
 ##
 # @BackupCommon:
-- 
2.39.2

[PATCH v2 0/2] backup: allow specifying minimum cluster size

2024-05-28 Thread Fiona Ebner

Based-on: 
https://lore.kernel.org/qemu-devel/20240429115157.2260885-1-vsement...@yandex-team.ru/

Discussion for v1:
https://lore.kernel.org/qemu-devel/20240308155158.830258-1-f.eb...@proxmox.com/

Changes in v2:
* Use 'size' type in QAPI.
* Remove option in cbw_parse_options(), i.e. before parsing generic
  blockdev options.
* Reword commit messages hoping to describe the issue in a more
  straight-forward way.

In the context of backup fleecing, discarding the source will not work
when the fleecing image has a larger granularity than the one used for
block-copy operations (can happen if the backup target has smaller
cluster size), because cbw_co_pdiscard_snapshot() will align down the
discard requests and thus effectively ignore then.

To make @discard-source work in such a scenario, allow specifying the
minimum cluster size used for block-copy operations and thus in
particular also the granularity for discard requests to the source.

Fiona Ebner (2):
  copy-before-write: allow specifying minimum cluster size
  backup: add minimum cluster size to performance options

 block/backup.c |  2 +-
 block/block-copy.c | 22 ++
 block/copy-before-write.c  | 18 +-
 block/copy-before-write.h  |  1 +
 blockdev.c |  3 +++
 include/block/block-copy.h |  1 +
 qapi/block-core.json   | 17 ++---
 7 files changed, 55 insertions(+), 9 deletions(-)

-- 
2.39.2

[PATCH v4 0/5] mirror: allow specifying working bitmap

2024-05-21 Thread Fiona Ebner

Changes from v3 (discussion here [3]):
* Improve/fix QAPI documentation.

Changes from v2 (discussion here [2]):
* Cluster size caveats only apply to non-COW diff image, adapt the
cluster size check and documentation accordingly.
* In the IO test, use backing files (rather than stand-alone diff
images) in combination with copy-mode=write-blocking and larger
cluster size for target images, to have a more realistic use-case
and show that COW prevents ending up with cluster with partial data
upon unaligned writes.
* Create a separate patch for replacing is_none_mode with sync_mode in
MirrorBlockJobx struct.
* Disallow using read-only bitmap (cannot be used as working bitmap).
* Require that bitmap is enabled at the start of the job.
* Avoid IO test script potentially waiting on non-existent job when
blockdev-mirror QMP command fails.
* Fix pylint issues in IO test.
* Rename IO test from sync-bitmap-mirror to mirror-bitmap.

Changes from RFC/v1 (discussion here [0]):
* Add patch to split BackupSyncMode and MirrorSyncMode.
* Drop bitmap-mode parameter and use passed-in bitmap as the working
bitmap instead. Users can get the desired behaviors by
using the block-dirty-bitmap-clear and block-dirty-bitmap-merge
calls (see commit message in patch 2/4 for how exactly).
* Add patch to check whether target image's cluster size is at most
mirror job's granularity. Optional, it's an extra safety check
that's useful when the target is a "diff" image that does not have
previously synced data.

Use cases:
* Possibility to resume a failed mirror later.
* Possibility to only mirror deltas to a previously mirrored volume.
* Possibility to (efficiently) mirror an drive that was previously
mirrored via some external mechanism (e.g. ZFS replication).

We are using the last one in production without any issues since about
4 years now. In particular, like mentioned in [1]:

> - create bitmap(s)
> - (incrementally) replicate storage volume(s) out of band (using ZFS)
> - incrementally drive mirror as part of a live migration of VM
> - drop bitmap(s)

Now, the IO test added in patch 4/5 actually contains yet another use
case, namely doing incremental mirrors to qcow2 "diff" images, that
only contain the delta and can be rebased later. I had to adapt the IO
test, because its output expected the mirror bitmap to still be dirty,
but nowadays the mirror is apparently already done when the bitmaps
are queried. So I thought, I'll just use 'write-blocking' mode to
avoid any potential timing issues.

Initially, the qcow2 diff image targets were stand-alone and that
suffers from an issue when 'write-blocking' mode is used. If a write
is not aligned to the granularity of the mirror target, then rebasing
the diff image onto a backing image will not yield the desired result,
because the full cluster is considered to be allocated and will "hide"
some part of the base/backing image. The failure can be seen by either
using 'write-blocking' mode in the IO test or setting the (bitmap)
granularity to 32 KiB rather than the current 64 KiB.

The test thus uses a more realistic approach where the qcow2 targets
have backing images and a check is added in patch 5/5 for the cluster
size for non-COW targets. However, with e.g. NBD, the cluster size
cannot be queried and prohibiting all bitmap mirrors to NBD targets
just to prevent the highly specific edge case seems not worth it, so
the limitation is rather documented and the check ignores cases where
querying the target image's cluster size returns -ENOTSUP.

[0]:
https://lore.kernel.org/qemu-devel/b91dba34-7969-4d51-ba40-96a91038c...@yandex-team.ru/T/#m4ae27dc8ca1fb053e0a32cc4ffa2cfab6646805c
[1]: https://lore.kernel.org/qemu-devel/1599127031.9uxdp5h9o2.astr...@nora.none/
[2]:
https://lore.kernel.org/qemu-devel/20240307134711.709816-1-f.eb...@proxmox.com/
[3]:
https://lore.kernel.org/qemu-devel/20240510131647.1256467-1-f.eb...@proxmox.com/

Fabian Grünbichler (1):
iotests: add test for bitmap mirror

Fiona Ebner (3):
qapi/block-core: avoid the re-use of MirrorSyncMode for backup
block/mirror: replace is_none_mode with sync_mode in MirrorBlockJob
struct
blockdev: mirror: check for target's cluster size when using bitmap

John Snow (1):
mirror: allow specifying working bitmap

[PATCH v4 3/5] mirror: allow specifying working bitmap

2024-05-21 Thread Fiona Ebner

From: John Snow 

for the mirror job. The bitmap's granularity is used as the job's
granularity.

The new @bitmap parameter is marked unstable in the QAPI and can
currently only be used for @sync=full mode.

Clusters initially dirty in the bitmap as well as new writes are
copied to the target.

Using block-dirty-bitmap-clear and block-dirty-bitmap-merge API,
callers can simulate the three kinds of @BitmapSyncMode (which is used
by backup):
1. always: default, just pass bitmap as working bitmap.
2. never: copy bitmap and pass copy to the mirror job.
3. on-success: copy bitmap and pass copy to the mirror job and if
   successful, merge bitmap into original afterwards.

When the target image is a non-COW "diff image", i.e. one that was not
used as the target of a previous mirror and the target image's cluster
size is larger than the bitmap's granularity, or when
@copy-mode=write-blocking is used, there is a pitfall, because the
cluster in the target image will be allocated, but not contain all the
data corresponding to the same region in the source image.

An idea to avoid the limitation would be to mark clusters which are
affected by unaligned writes and are not allocated in the target image
dirty, so they would be copied fully later. However, for migration,
the invariant that an actively synced mirror stays actively synced
(unless an error happens) is useful, because without that invariant,
migration might inactivate block devices when mirror still got work
to do and run into an assertion failure [0].

Another approach would be to read the missing data from the source
upon unaligned writes to be able to write the full target cluster
instead.

But certain targets like NBD do not allow querying the cluster size.
To avoid limiting/breaking the use case of syncing to an existing
target, which is arguably more common than the diff image use case,
document the limitation in QAPI.

This patch was originally based on one by Ma Haocong, but it has since
been modified pretty heavily, first by John and then again by Fiona.

[0]: 
https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/

Suggested-by: Ma Haocong 
Signed-off-by: Ma Haocong 
Signed-off-by: John Snow 
[FG: switch to bdrv_dirty_bitmap_merge_internal]
Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.1
 get rid of bitmap mode parameter
 use caller-provided bitmap as working bitmap
 turn bitmap parameter experimental]
Signed-off-by: Fiona Ebner 
Acked-by: Markus Armbruster 
---
 block/mirror.c | 80 +-
 blockdev.c | 44 +++---
 include/block/block_int-global-state.h |  5 +-
 qapi/block-core.json   | 35 ++-
 tests/unit/test-block-iothread.c   |  2 +-
 5 files changed, 141 insertions(+), 25 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index ca23d6ef65..d3d0698116 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -73,6 +73,11 @@ typedef struct MirrorBlockJob {
 size_t buf_size;
 int64_t bdev_length;
 unsigned long *cow_bitmap;
+/*
+ * Whether the bitmap is created locally or provided by the caller (for
+ * incremental sync).
+ */
+bool dirty_bitmap_is_local;
 BdrvDirtyBitmap *dirty_bitmap;
 BdrvDirtyBitmapIter *dbi;
 uint8_t *buf;
@@ -691,7 +696,11 @@ static int mirror_exit_common(Job *job)
 bdrv_unfreeze_backing_chain(mirror_top_bs, target_bs);
 }
 
-bdrv_release_dirty_bitmap(s->dirty_bitmap);
+if (s->dirty_bitmap_is_local) {
+bdrv_release_dirty_bitmap(s->dirty_bitmap);
+} else {
+bdrv_enable_dirty_bitmap(s->dirty_bitmap);
+}
 
 /* Make sure that the source BDS doesn't go away during bdrv_replace_node,
  * before we can call bdrv_drained_end */
@@ -820,6 +829,16 @@ static void mirror_abort(Job *job)
 assert(ret == 0);
 }
 
+/* Always called after commit/abort. */
+static void mirror_clean(Job *job)
+{
+MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
+
+if (!s->dirty_bitmap_is_local && s->dirty_bitmap) {
+bdrv_dirty_bitmap_set_busy(s->dirty_bitmap, false);
+}
+}
+
 static void coroutine_fn mirror_throttle(MirrorBlockJob *s)
 {
 int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1016,7 +1035,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 mirror_free_init(s);
 
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
-if (s->sync_mode != MIRROR_SYNC_MODE_NONE) {
+if (s->sync_mode != MIRROR_SYNC_MODE_NONE && s->dirty_bitmap_is_local) {
 ret = mirror_dirty_init(s);
 if (ret < 0 || job_is_cancelled(&s->common.job)) {
 goto immediate_exit;
@@ -1029,6 +1048,14 @@ static int coroutine_fn mirror_run(Job *job, Error 
**errp)
  */
 mirror_top_op

[PATCH v4 5/5] blockdev: mirror: check for target's cluster size when using bitmap

2024-05-21 Thread Fiona Ebner

When using mirror with a bitmap and the target does not do COW and is
is a diff image, i.e. one that should only contain the delta and was
not synced to previously, a too large cluster size for the target can
be problematic. In particular, when the mirror sends data to the
target aligned to the jobs granularity, but not aligned to the larger
target image's cluster size, the target's cluster would be allocated
but only be filled partially. When rebasing such a diff image later,
the corresponding cluster of the base image would get "masked" and the
part of the cluster not in the diff image is not accessible anymore.

Unfortunately, it is not always possible to check for the target
image's cluster size, e.g. when it's NBD. Because the limitation is
already documented in the QAPI description for the @bitmap parameter
and it's only required for special diff image use-case, simply skip
the check then.

Signed-off-by: Fiona Ebner 
---
 blockdev.c | 57 ++
 tests/qemu-iotests/tests/mirror-bitmap |  6 +++
 tests/qemu-iotests/tests/mirror-bitmap.out |  7 +++
 3 files changed, 70 insertions(+)

diff --git a/blockdev.c b/blockdev.c
index 4f72a72dc7..468974108e 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2769,6 +2769,59 @@ void qmp_blockdev_backup(BlockdevBackup *backup, Error 
**errp)
 blockdev_do_action(&action, errp);
 }
 
+static int blockdev_mirror_check_bitmap_granularity(BlockDriverState *target,
+BdrvDirtyBitmap *bitmap,
+Error **errp)
+{
+int ret;
+BlockDriverInfo bdi;
+uint32_t bitmap_granularity;
+
+GLOBAL_STATE_CODE();
+GRAPH_RDLOCK_GUARD_MAINLOOP();
+
+if (bdrv_backing_chain_next(target)) {
+/*
+ * No need to worry about creating clusters with partial data when the
+ * target does COW.
+ */
+return 0;
+}
+
+/*
+ * If there is no backing file on the target, we cannot rely on COW if our
+ * backup cluster size is smaller than the target cluster size. Even for
+ * targets with a backing file, try to avoid COW if possible.
+ */
+ret = bdrv_get_info(target, &bdi);
+if (ret == -ENOTSUP) {
+/*
+ * Ignore if unable to get the info, e.g. when target is NBD. It's only
+ * relevant for syncing to a diff image and the documentation already
+ * states that the target's cluster size needs to small enough then.
+ */
+return 0;
+} else if (ret < 0) {
+error_setg_errno(errp, -ret,
+"Couldn't determine the cluster size of the target image, "
+"which has no backing file");
+return ret;
+}
+
+bitmap_granularity = bdrv_dirty_bitmap_granularity(bitmap);
+if (bitmap_granularity < bdi.cluster_size ||
+bitmap_granularity % bdi.cluster_size != 0) {
+error_setg(errp, "Bitmap granularity %u is not a multiple of the "
+   "target image's cluster size %u and the target image has "
+   "no backing file",
+   bitmap_granularity, bdi.cluster_size);
+return -EINVAL;
+}
+
+return 0;
+}
+
+
 /* Parameter check and block job starting for drive mirroring.
  * Caller should hold @device and @target's aio context (must be the same).
  **/
@@ -2863,6 +2916,10 @@ static void blockdev_mirror_common(const char *job_id, 
BlockDriverState *bs,
 return;
 }
 
+if (blockdev_mirror_check_bitmap_granularity(target, bitmap, errp)) {
+return;
+}
+
 if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_DEFAULT, errp)) {
 return;
 }
diff --git a/tests/qemu-iotests/tests/mirror-bitmap 
b/tests/qemu-iotests/tests/mirror-bitmap
index 37bbe0f241..e8cd482a19 100755
--- a/tests/qemu-iotests/tests/mirror-bitmap
+++ b/tests/qemu-iotests/tests/mirror-bitmap
@@ -584,6 +584,12 @@ def test_mirror_api():
 bitmap=bitmap)
 log('')
 
+log("-- Test bitmap with too small granularity to non-COW target --\n")
+vm.qmp_log("block-dirty-bitmap-add", node=drive0.node,
+   name="bitmap-small", granularity=GRANULARITY)
+blockdev_mirror(drive0.vm, drive0.node, "mirror_target", "full",
+job_id='api_job', bitmap="bitmap-small")
+log('')
 
 def main():
 for bsync_mode in ("never", "on-success", "always"):
diff --git a/tests/qemu-iotests/tests/mirror-bitmap.out 
b/tests/qemu-iotests/tests/mirror-bitmap.out
index 5c8acc1d69..af605f3803 100644
--- a/tests/qemu-iotests/tests/mirror-bitmap.out
+++ b/tests/qemu-iotests/tests/mir

[PATCH v4 4/5] iotests: add test for bitmap mirror

2024-05-21 Thread Fiona Ebner

From: Fabian Grünbichler 

heavily based on/practically forked off iotest 257 for bitmap backups,
but:

- no writes to filter node 'mirror-top' between completion and
finalization, as those seem to deadlock?
- extra set of reference/test mirrors to verify that writes in parallel
with active mirror work

Intentionally keeping copyright and ownership of original test case to
honor provenance.

The test was originally adapted by Fabian from 257, but has seen
rather big changes, because the interface for mirror with bitmap was
changed, i.e. no @bitmap-mode parameter anymore and bitmap is used as
the working bitmap, and the test was changed to use backing images and
@sync-mode=write-blocking.

Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.1
 adapt to changes to mirror bitmap interface
 rename test from '384' to 'mirror-bitmap'
 use backing files, copy-mode=write-blocking, larger cluster size]
Signed-off-by: Fiona Ebner 
---
 tests/qemu-iotests/tests/mirror-bitmap |  597 
 tests/qemu-iotests/tests/mirror-bitmap.out | 3191 
 2 files changed, 3788 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/mirror-bitmap
 create mode 100644 tests/qemu-iotests/tests/mirror-bitmap.out

diff --git a/tests/qemu-iotests/tests/mirror-bitmap 
b/tests/qemu-iotests/tests/mirror-bitmap
new file mode 100755
index 00..37bbe0f241
--- /dev/null
+++ b/tests/qemu-iotests/tests/mirror-bitmap
@@ -0,0 +1,597 @@
+#!/usr/bin/env python3
+# group: rw
+#
+# Test bitmap-sync mirrors (incremental, differential, and partials)
+#
+# Copyright (c) 2019 John Snow for Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+# owner=js...@redhat.com
+
+import os
+
+import iotests
+from iotests import log, qemu_img
+
+SIZE = 64 * 1024 * 1024
+GRANULARITY = 64 * 1024
+IMAGE_CLUSTER_SIZE = 128 * 1024
+
+
+class Pattern:
+def __init__(self, byte, offset, size=GRANULARITY):
+self.byte = byte
+self.offset = offset
+self.size = size
+
+def bits(self, granularity):
+lower = self.offset // granularity
+upper = (self.offset + self.size - 1) // granularity
+return set(range(lower, upper + 1))
+
+
+class PatternGroup:
+"""Grouping of Pattern objects. Initialize with an iterable of Patterns."""
+def __init__(self, patterns):
+self.patterns = patterns
+
+def bits(self, granularity):
+"""Calculate the unique bits dirtied by this pattern grouping"""
+res = set()
+for pattern in self.patterns:
+res |= pattern.bits(granularity)
+return res
+
+
+GROUPS = [
+PatternGroup([
+# Batch 0: 4 clusters
+Pattern('0x49', 0x000),
+Pattern('0x6c', 0x010),   # 1M
+Pattern('0x6f', 0x200),   # 32M
+Pattern('0x76', 0x3ff)]), # 64M - 64K
+PatternGroup([
+# Batch 1: 6 clusters (3 new)
+Pattern('0x65', 0x000),   # Full overwrite
+Pattern('0x77', 0x00f8000),   # Partial-left (1M-32K)
+Pattern('0x72', 0x2008000),   # Partial-right (32M+32K)
+Pattern('0x69', 0x3fe)]), # Adjacent-left (64M - 128K)
+PatternGroup([
+# Batch 2: 7 clusters (3 new)
+Pattern('0x74', 0x001),   # Adjacent-right
+Pattern('0x69', 0x00e8000),   # Partial-left  (1M-96K)
+Pattern('0x6e', 0x2018000),   # Partial-right (32M+96K)
+Pattern('0x67', 0x3fe,
+2*GRANULARITY)]), # Overwrite [(64M-128K)-64M)
+PatternGroup([
+# Batch 3: 8 clusters (5 new)
+# Carefully chosen such that nothing re-dirties the one cluster
+# that copies out successfully before failure in Group #1.
+Pattern('0xaa', 0x001,
+3*GRANULARITY),   # Overwrite and 2x Adjacent-right
+Pattern('0xbb', 0x00d8000),   # Partial-left (1M-160K)
+Pattern('0xcc', 0x2028000),   # Partial-right (32M+160K)
+Pattern('0xdd', 0x3fc)]), # New; leaving a gap to the right
+]
+
+
+class EmulatedBitmap:
+def __init__(self, granularity=GRANULA

[PATCH v4 2/5] block/mirror: replace is_none_mode with sync_mode in MirrorBlockJob struct

2024-05-21 Thread Fiona Ebner

It is more flexible and is done in preparation to support specifying a
working bitmap for mirror jobs. In particular, this makes it possible
to assert that @sync_mode=full when a bitmap is used. That assertion
is just to be sure, of course the mirror QMP commands will be made to
fail earlier with a clean error.

Signed-off-by: Fiona Ebner 
---
 block/mirror.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index c0597039a5..ca23d6ef65 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -51,7 +51,7 @@ typedef struct MirrorBlockJob {
 BlockDriverState *to_replace;
 /* Used to block operations on the drive-mirror-replace target */
 Error *replace_blocker;
-bool is_none_mode;
+MirrorSyncMode sync_mode;
 BlockMirrorBackingMode backing_mode;
 /* Whether the target image requires explicit zero-initialization */
 bool zero_target;
@@ -722,7 +722,8 @@ static int mirror_exit_common(Job *job)
  &error_abort);
 
 if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
-BlockDriverState *backing = s->is_none_mode ? src : s->base;
+BlockDriverState *backing;
+backing = s->sync_mode == MIRROR_SYNC_MODE_NONE ? src : s->base;
 BlockDriverState *unfiltered_target = bdrv_skip_filters(target_bs);
 
 if (bdrv_cow_bs(unfiltered_target) != backing) {
@@ -1015,7 +1016,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 mirror_free_init(s);
 
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
-if (!s->is_none_mode) {
+if (s->sync_mode != MIRROR_SYNC_MODE_NONE) {
 ret = mirror_dirty_init(s);
 if (ret < 0 || job_is_cancelled(&s->common.job)) {
 goto immediate_exit;
@@ -1714,7 +1715,8 @@ static BlockJob *mirror_start_job(
  BlockCompletionFunc *cb,
  void *opaque,
  const BlockJobDriver *driver,
- bool is_none_mode, BlockDriverState *base,
+ MirrorSyncMode sync_mode,
+ BlockDriverState *base,
  bool auto_complete, const char *filter_node_name,
  bool is_mirror, MirrorCopyMode copy_mode,
  Error **errp)
@@ -1871,7 +1873,7 @@ static BlockJob *mirror_start_job(
 s->replaces = g_strdup(replaces);
 s->on_source_error = on_source_error;
 s->on_target_error = on_target_error;
-s->is_none_mode = is_none_mode;
+s->sync_mode = sync_mode;
 s->backing_mode = backing_mode;
 s->zero_target = zero_target;
 qatomic_set(&s->copy_mode, copy_mode);
@@ -2008,20 +2010,18 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
   bool unmap, const char *filter_node_name,
   MirrorCopyMode copy_mode, Error **errp)
 {
-bool is_none_mode;
 BlockDriverState *base;
 
 GLOBAL_STATE_CODE();
 
 bdrv_graph_rdlock_main_loop();
-is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
 base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL;
 bdrv_graph_rdunlock_main_loop();
 
 mirror_start_job(job_id, bs, creation_flags, target, replaces,
  speed, granularity, buf_size, backing_mode, zero_target,
  on_source_error, on_target_error, unmap, NULL, NULL,
- &mirror_job_driver, is_none_mode, base, false,
+ &mirror_job_driver, mode, base, false,
  filter_node_name, true, copy_mode, errp);
 }
 
@@ -2049,9 +2049,9 @@ BlockJob *commit_active_start(const char *job_id, 
BlockDriverState *bs,
  job_id, bs, creation_flags, base, NULL, speed, 0, 0,
  MIRROR_LEAVE_BACKING_CHAIN, false,
  on_error, on_error, true, cb, opaque,
- &commit_active_job_driver, false, base, auto_complete,
- filter_node_name, false, MIRROR_COPY_MODE_BACKGROUND,
- errp);
+ &commit_active_job_driver, MIRROR_SYNC_MODE_FULL, base,
+ auto_complete, filter_node_name, false,
+ MIRROR_COPY_MODE_BACKGROUND, errp);
 if (!job) {
 goto error_restore_flags;
 }
-- 
2.39.2

[PATCH v4 1/5] qapi/block-core: avoid the re-use of MirrorSyncMode for backup

2024-05-21 Thread Fiona Ebner

Backup supports all modes listed in MirrorSyncMode, while mirror does
not. Introduce BackupSyncMode by copying the current MirrorSyncMode
and drop the variants mirror does not support from MirrorSyncMode as
well as the corresponding manual check in mirror_start().

A consequence is also tighter introspection: query-qmp-schema no
longer reports drive-mirror and blockdev-mirror accepting @sync values
they actually reject.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Fiona Ebner 
Acked-by: Markus Armbruster 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/backup.c | 18 -
 block/mirror.c |  7 ---
 block/monitor/block-hmp-cmds.c |  2 +-
 block/replication.c|  2 +-
 blockdev.c | 26 -
 include/block/block_int-global-state.h |  2 +-
 qapi/block-core.json   | 27 +-
 7 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index ec29d6b810..1cc4e055c6 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -37,7 +37,7 @@ typedef struct BackupBlockJob {
 
 BdrvDirtyBitmap *sync_bitmap;
 
-MirrorSyncMode sync_mode;
+BackupSyncMode sync_mode;
 BitmapSyncMode bitmap_mode;
 BlockdevOnError on_source_error;
 BlockdevOnError on_target_error;
@@ -111,7 +111,7 @@ void backup_do_checkpoint(BlockJob *job, Error **errp)
 
 assert(block_job_driver(job) == &backup_job_driver);
 
-if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+if (backup_job->sync_mode != BACKUP_SYNC_MODE_NONE) {
 error_setg(errp, "The backup job only supports block checkpoint in"
" sync=none mode");
 return;
@@ -231,11 +231,11 @@ static void backup_init_bcs_bitmap(BackupBlockJob *job)
 uint64_t estimate;
 BdrvDirtyBitmap *bcs_bitmap = block_copy_dirty_bitmap(job->bcs);
 
-if (job->sync_mode == MIRROR_SYNC_MODE_BITMAP) {
+if (job->sync_mode == BACKUP_SYNC_MODE_BITMAP) {
 bdrv_clear_dirty_bitmap(bcs_bitmap, NULL);
 bdrv_dirty_bitmap_merge_internal(bcs_bitmap, job->sync_bitmap, NULL,
  true);
-} else if (job->sync_mode == MIRROR_SYNC_MODE_TOP) {
+} else if (job->sync_mode == BACKUP_SYNC_MODE_TOP) {
 /*
  * We can't hog the coroutine to initialize this thoroughly.
  * Set a flag and resume work when we are able to yield safely.
@@ -254,7 +254,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
 
 backup_init_bcs_bitmap(s);
 
-if (s->sync_mode == MIRROR_SYNC_MODE_TOP) {
+if (s->sync_mode == BACKUP_SYNC_MODE_TOP) {
 int64_t offset = 0;
 int64_t count;
 
@@ -282,7 +282,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
 block_copy_set_skip_unallocated(s->bcs, false);
 }
 
-if (s->sync_mode == MIRROR_SYNC_MODE_NONE) {
+if (s->sync_mode == BACKUP_SYNC_MODE_NONE) {
 /*
  * All bits are set in bcs bitmap to allow any cluster to be copied.
  * This does not actually require them to be copied.
@@ -354,7 +354,7 @@ static const BlockJobDriver backup_job_driver = {
 
 BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
   BlockDriverState *target, int64_t speed,
-  MirrorSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap,
+  BackupSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap,
   BitmapSyncMode bitmap_mode,
   bool compress,
   const char *filter_node_name,
@@ -376,8 +376,8 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 GLOBAL_STATE_CODE();
 
 /* QMP interface protects us from these cases */
-assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL);
-assert(sync_bitmap || sync_mode != MIRROR_SYNC_MODE_BITMAP);
+assert(sync_mode != BACKUP_SYNC_MODE_INCREMENTAL);
+assert(sync_bitmap || sync_mode != BACKUP_SYNC_MODE_BITMAP);
 
 if (bs == target) {
 error_setg(errp, "Source and target cannot be the same");
diff --git a/block/mirror.c b/block/mirror.c
index 1bdce3b657..c0597039a5 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -2013,13 +2013,6 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
 
 GLOBAL_STATE_CODE();
 
-if ((mode == MIRROR_SYNC_MODE_INCREMENTAL) ||
-(mode == MIRROR_SYNC_MODE_BITMAP)) {
-error_setg(errp, "Sync mode '%s' not supported",
-   MirrorSyncMode_str(mode));
-return;
-}
-
 bdrv_graph_rdlock_main_loop();
 is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
 base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL;
diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c

Re: [PATCH v4] hw/pflash: fix block write start

2024-05-16 Thread Fiona Ebner

Am 16.05.24 um 10:46 schrieb Gerd Hoffmann:
> Move the pflash_blk_write_start() call.  We need the offset of the
> first data write, not the offset for the setup (number-of-bytes)
> write.  Without this fix u-boot can do block writes to the first
> flash block only.
> 
> While being at it drop a leftover FIXME.
> 
> Cc: qemu-sta...@nongnu.org
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2343
> Fixes: fcc79f2e0955 ("hw/pflash: implement update buffer for block writes")

Just a minor thing I noticed: this is the commit in v8.1.5.
The commit in v9.0.0 is 284a7ee2e2 ("hw/pflash: implement update buffer
for block writes").

Best Regards,
Fiona

Re: [PATCH 1/2] copy-before-write: allow specifying minimum cluster size

2024-05-13 Thread Fiona Ebner

Am 26.03.24 um 10:06 schrieb Markus Armbruster:
>> @@ -365,7 +368,13 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, 
>> BdrvChild *target,
>>  
>>  GLOBAL_STATE_CODE();
>>  
>> -cluster_size = block_copy_calculate_cluster_size(target->bs, errp);
>> +if (min_cluster_size && !is_power_of_2(min_cluster_size)) {
> 
> min_cluster_size is int64_t, is_power_of_2() takes uint64_t.  Bad if
> min_cluster_size is negative.  Could this happen?
> 

No, because it comes in as a uint32_t via the QAPI (the internal caller
added by patch 2/2 from the backup code also gets the value via QAPI and
there uint32_t is used too).

---snip---

>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 0a72c590a8..85c8f88f6e 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -4625,12 +4625,18 @@
>>  # @on-cbw-error parameter will decide how this failure is handled.
>>  # Default 0. (Since 7.1)
>>  #
>> +# @min-cluster-size: Minimum size of blocks used by copy-before-write
>> +# operations.  Has to be a power of 2.  No effect if smaller than
>> +# the maximum of the target's cluster size and 64 KiB.  Default 0.
>> +# (Since 9.0)
>> +#
>>  # Since: 6.2
>>  ##
>>  { 'struct': 'BlockdevOptionsCbw',
>>'base': 'BlockdevOptionsGenericFormat',
>>'data': { 'target': 'BlockdevRef', '*bitmap': 'BlockDirtyBitmap',
>> -'*on-cbw-error': 'OnCbwError', '*cbw-timeout': 'uint32' } }
>> +'*on-cbw-error': 'OnCbwError', '*cbw-timeout': 'uint32',
>> +'*min-cluster-size': 'uint32' } }
> 
> Elsewhere in the schema, we use either 'int' or 'size' for cluster-size.
> Why the difference?
> 

The motivation was to disallow negative values up front and have it work
with block_copy_calculate_cluster_size(), whose result is an int64_t. If
I go with 'int', I'll have to add a check to disallow negative values.
If I go with 'size', I'll have to add a check for to disallow too large
values.

Which approach should I go with?

Best Regards,
Fiona

[PATCH v3 2/5] block/mirror: replace is_none_mode with sync_mode in MirrorBlockJob struct

2024-05-10 Thread Fiona Ebner

It is more flexible and is done in preparation to support specifying a
working bitmap for mirror jobs. In particular, this makes it possible
to assert that @sync_mode=full when a bitmap is used. That assertion
is just to be sure, of course the mirror QMP commands will be made to
fail earlier with a clean error.

Signed-off-by: Fiona Ebner 
---

New in v3.

 block/mirror.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index c0597039a5..ca23d6ef65 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -51,7 +51,7 @@ typedef struct MirrorBlockJob {
 BlockDriverState *to_replace;
 /* Used to block operations on the drive-mirror-replace target */
 Error *replace_blocker;
-bool is_none_mode;
+MirrorSyncMode sync_mode;
 BlockMirrorBackingMode backing_mode;
 /* Whether the target image requires explicit zero-initialization */
 bool zero_target;
@@ -722,7 +722,8 @@ static int mirror_exit_common(Job *job)
  &error_abort);
 
 if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
-BlockDriverState *backing = s->is_none_mode ? src : s->base;
+BlockDriverState *backing;
+backing = s->sync_mode == MIRROR_SYNC_MODE_NONE ? src : s->base;
 BlockDriverState *unfiltered_target = bdrv_skip_filters(target_bs);
 
 if (bdrv_cow_bs(unfiltered_target) != backing) {
@@ -1015,7 +1016,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 mirror_free_init(s);
 
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
-if (!s->is_none_mode) {
+if (s->sync_mode != MIRROR_SYNC_MODE_NONE) {
 ret = mirror_dirty_init(s);
 if (ret < 0 || job_is_cancelled(&s->common.job)) {
 goto immediate_exit;
@@ -1714,7 +1715,8 @@ static BlockJob *mirror_start_job(
  BlockCompletionFunc *cb,
  void *opaque,
  const BlockJobDriver *driver,
- bool is_none_mode, BlockDriverState *base,
+ MirrorSyncMode sync_mode,
+ BlockDriverState *base,
  bool auto_complete, const char *filter_node_name,
  bool is_mirror, MirrorCopyMode copy_mode,
  Error **errp)
@@ -1871,7 +1873,7 @@ static BlockJob *mirror_start_job(
 s->replaces = g_strdup(replaces);
 s->on_source_error = on_source_error;
 s->on_target_error = on_target_error;
-s->is_none_mode = is_none_mode;
+s->sync_mode = sync_mode;
 s->backing_mode = backing_mode;
 s->zero_target = zero_target;
 qatomic_set(&s->copy_mode, copy_mode);
@@ -2008,20 +2010,18 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
   bool unmap, const char *filter_node_name,
   MirrorCopyMode copy_mode, Error **errp)
 {
-bool is_none_mode;
 BlockDriverState *base;
 
 GLOBAL_STATE_CODE();
 
 bdrv_graph_rdlock_main_loop();
-is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
 base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL;
 bdrv_graph_rdunlock_main_loop();
 
 mirror_start_job(job_id, bs, creation_flags, target, replaces,
  speed, granularity, buf_size, backing_mode, zero_target,
  on_source_error, on_target_error, unmap, NULL, NULL,
- &mirror_job_driver, is_none_mode, base, false,
+ &mirror_job_driver, mode, base, false,
  filter_node_name, true, copy_mode, errp);
 }
 
@@ -2049,9 +2049,9 @@ BlockJob *commit_active_start(const char *job_id, 
BlockDriverState *bs,
  job_id, bs, creation_flags, base, NULL, speed, 0, 0,
  MIRROR_LEAVE_BACKING_CHAIN, false,
  on_error, on_error, true, cb, opaque,
- &commit_active_job_driver, false, base, auto_complete,
- filter_node_name, false, MIRROR_COPY_MODE_BACKGROUND,
- errp);
+ &commit_active_job_driver, MIRROR_SYNC_MODE_FULL, base,
+ auto_complete, filter_node_name, false,
+ MIRROR_COPY_MODE_BACKGROUND, errp);
 if (!job) {
 goto error_restore_flags;
 }
-- 
2.39.2

[PATCH v3 0/5] mirror: allow specifying working bitmap

2024-05-10 Thread Fiona Ebner

We are using the last one in production without any issues since about
4 years now. In particular, like mentioned in [1]:

> - create bitmap(s)
> - (incrementally) replicate storage volume(s) out of band (using ZFS)
> - incrementally drive mirror as part of a live migration of VM
> - drop bitmap(s)

Fabian Grünbichler (1):
iotests: add test for bitmap mirror

John Snow (1):
mirror: allow specifying working bitmap

block/backup.c | 18 +-
block/mirror.c | 101 +-
block/monitor/block-hmp-cmds.c |2 +-
block/replication.c|2 +-
blockdev.c | 127 +-
include/block/block_int-global-state.h |7 +-
qapi/block-core.json | 64 +-
tests/qemu-iotests/tests/mirror-bitmap | 603
tests/qemu-iotests/tests/mirror-bitmap.out | 3198
tests/unit/test-block-iothread.c |2 +-
10 files changed, 4055 insertions(+), 69 deletions(-)
create mode 100755 tests/qemu-iotests/tests/mirror-bitmap
create mode 100644 tests/qemu-iotests/tests/mirror-bitmap.out

--
2.39.2

[PATCH v3 4/5] iotests: add test for bitmap mirror

2024-05-10 Thread Fiona Ebner

From: Fabian Grünbichler 

heavily based on/practically forked off iotest 257 for bitmap backups,
but:

- no writes to filter node 'mirror-top' between completion and
finalization, as those seem to deadlock?
- extra set of reference/test mirrors to verify that writes in parallel
with active mirror work

Intentionally keeping copyright and ownership of original test case to
honor provenance.

The test was originally adapted by Fabian from 257, but has seen
rather big changes, because the interface for mirror with bitmap was
changed, i.e. no @bitmap-mode parameter anymore and bitmap is used as
the working bitmap, and the test was changed to use backing images and
@sync-mode=write-blocking.

Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.1
 adapt to changes to mirror bitmap interface
 rename test from '384' to 'mirror-bitmap'
 use backing files, copy-mode=write-blocking, larger cluster size]
Signed-off-by: Fiona Ebner 
---

Changes in v3:
* avoid script potentially waiting on non-existent job when
  blockdev-mirror QMP command fails by asserting that there is no
  error when none is expected.
* fix pylint issues
* rename test from sync-bitmap-mirror to mirror-bitmap
* use backing files (rather than stand-alone diff images) in
  combination with copy-mode=write-blocking and larger cluster size
  for target images, to have a more realistic use-case and show that
  COW prevents ending up with cluster with partial data upon unaligned
  writes

 tests/qemu-iotests/tests/mirror-bitmap |  597 
 tests/qemu-iotests/tests/mirror-bitmap.out | 3191 
 2 files changed, 3788 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/mirror-bitmap
 create mode 100644 tests/qemu-iotests/tests/mirror-bitmap.out

diff --git a/tests/qemu-iotests/tests/mirror-bitmap 
b/tests/qemu-iotests/tests/mirror-bitmap
new file mode 100755
index 00..37bbe0f241
--- /dev/null
+++ b/tests/qemu-iotests/tests/mirror-bitmap
@@ -0,0 +1,597 @@
+#!/usr/bin/env python3
+# group: rw
+#
+# Test bitmap-sync mirrors (incremental, differential, and partials)
+#
+# Copyright (c) 2019 John Snow for Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+# owner=js...@redhat.com
+
+import os
+
+import iotests
+from iotests import log, qemu_img
+
+SIZE = 64 * 1024 * 1024
+GRANULARITY = 64 * 1024
+IMAGE_CLUSTER_SIZE = 128 * 1024
+
+
+class Pattern:
+def __init__(self, byte, offset, size=GRANULARITY):
+self.byte = byte
+self.offset = offset
+self.size = size
+
+def bits(self, granularity):
+lower = self.offset // granularity
+upper = (self.offset + self.size - 1) // granularity
+return set(range(lower, upper + 1))
+
+
+class PatternGroup:
+"""Grouping of Pattern objects. Initialize with an iterable of Patterns."""
+def __init__(self, patterns):
+self.patterns = patterns
+
+def bits(self, granularity):
+"""Calculate the unique bits dirtied by this pattern grouping"""
+res = set()
+for pattern in self.patterns:
+res |= pattern.bits(granularity)
+return res
+
+
+GROUPS = [
+PatternGroup([
+# Batch 0: 4 clusters
+Pattern('0x49', 0x000),
+Pattern('0x6c', 0x010),   # 1M
+Pattern('0x6f', 0x200),   # 32M
+Pattern('0x76', 0x3ff)]), # 64M - 64K
+PatternGroup([
+# Batch 1: 6 clusters (3 new)
+Pattern('0x65', 0x000),   # Full overwrite
+Pattern('0x77', 0x00f8000),   # Partial-left (1M-32K)
+Pattern('0x72', 0x2008000),   # Partial-right (32M+32K)
+Pattern('0x69', 0x3fe)]), # Adjacent-left (64M - 128K)
+PatternGroup([
+# Batch 2: 7 clusters (3 new)
+Pattern('0x74', 0x001),   # Adjacent-right
+Pattern('0x69', 0x00e8000),   # Partial-left  (1M-96K)
+Pattern('0x6e', 0x2018000),   # Partial-right (32M+96K)
+Pattern('0x67', 0x3fe,
+2*GRANULARITY)]), # Overwrite [(64M-128K)-64M)
+PatternGroup([
+# Batch 3: 8 clusters (5 new)
+# Carefully chosen such that nothing re-d

[PATCH v3 1/5] qapi/block-core: avoid the re-use of MirrorSyncMode for backup

2024-05-10 Thread Fiona Ebner

Backup supports all modes listed in MirrorSyncMode, while mirror does
not. Introduce BackupSyncMode by copying the current MirrorSyncMode
and drop the variants mirror does not support from MirrorSyncMode as
well as the corresponding manual check in mirror_start().

A consequence is also tighter introspection: query-qmp-schema no
longer reports drive-mirror and blockdev-mirror accepting @sync values
they actually reject.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Fiona Ebner 
Acked-by: Markus Armbruster 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---

Changes in v3:
* add comment about introspection to commit message as suggested by
  Markus

 block/backup.c | 18 -
 block/mirror.c |  7 ---
 block/monitor/block-hmp-cmds.c |  2 +-
 block/replication.c|  2 +-
 blockdev.c | 26 -
 include/block/block_int-global-state.h |  2 +-
 qapi/block-core.json   | 27 +-
 7 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index ec29d6b810..1cc4e055c6 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -37,7 +37,7 @@ typedef struct BackupBlockJob {
 
 BdrvDirtyBitmap *sync_bitmap;
 
-MirrorSyncMode sync_mode;
+BackupSyncMode sync_mode;
 BitmapSyncMode bitmap_mode;
 BlockdevOnError on_source_error;
 BlockdevOnError on_target_error;
@@ -111,7 +111,7 @@ void backup_do_checkpoint(BlockJob *job, Error **errp)
 
 assert(block_job_driver(job) == &backup_job_driver);
 
-if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+if (backup_job->sync_mode != BACKUP_SYNC_MODE_NONE) {
 error_setg(errp, "The backup job only supports block checkpoint in"
" sync=none mode");
 return;
@@ -231,11 +231,11 @@ static void backup_init_bcs_bitmap(BackupBlockJob *job)
 uint64_t estimate;
 BdrvDirtyBitmap *bcs_bitmap = block_copy_dirty_bitmap(job->bcs);
 
-if (job->sync_mode == MIRROR_SYNC_MODE_BITMAP) {
+if (job->sync_mode == BACKUP_SYNC_MODE_BITMAP) {
 bdrv_clear_dirty_bitmap(bcs_bitmap, NULL);
 bdrv_dirty_bitmap_merge_internal(bcs_bitmap, job->sync_bitmap, NULL,
  true);
-} else if (job->sync_mode == MIRROR_SYNC_MODE_TOP) {
+} else if (job->sync_mode == BACKUP_SYNC_MODE_TOP) {
 /*
  * We can't hog the coroutine to initialize this thoroughly.
  * Set a flag and resume work when we are able to yield safely.
@@ -254,7 +254,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
 
 backup_init_bcs_bitmap(s);
 
-if (s->sync_mode == MIRROR_SYNC_MODE_TOP) {
+if (s->sync_mode == BACKUP_SYNC_MODE_TOP) {
 int64_t offset = 0;
 int64_t count;
 
@@ -282,7 +282,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
 block_copy_set_skip_unallocated(s->bcs, false);
 }
 
-if (s->sync_mode == MIRROR_SYNC_MODE_NONE) {
+if (s->sync_mode == BACKUP_SYNC_MODE_NONE) {
 /*
  * All bits are set in bcs bitmap to allow any cluster to be copied.
  * This does not actually require them to be copied.
@@ -354,7 +354,7 @@ static const BlockJobDriver backup_job_driver = {
 
 BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
   BlockDriverState *target, int64_t speed,
-  MirrorSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap,
+  BackupSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap,
   BitmapSyncMode bitmap_mode,
   bool compress,
   const char *filter_node_name,
@@ -376,8 +376,8 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 GLOBAL_STATE_CODE();
 
 /* QMP interface protects us from these cases */
-assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL);
-assert(sync_bitmap || sync_mode != MIRROR_SYNC_MODE_BITMAP);
+assert(sync_mode != BACKUP_SYNC_MODE_INCREMENTAL);
+assert(sync_bitmap || sync_mode != BACKUP_SYNC_MODE_BITMAP);
 
 if (bs == target) {
 error_setg(errp, "Source and target cannot be the same");
diff --git a/block/mirror.c b/block/mirror.c
index 1bdce3b657..c0597039a5 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -2013,13 +2013,6 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
 
 GLOBAL_STATE_CODE();
 
-if ((mode == MIRROR_SYNC_MODE_INCREMENTAL) ||
-(mode == MIRROR_SYNC_MODE_BITMAP)) {
-error_setg(errp, "Sync mode '%s' not supported",
-   MirrorSyncMode_str(mode));
-return;
-}
-
 bdrv_graph_rdlock_main_loop();
 is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
 base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_c

[PATCH v3 5/5] blockdev: mirror: check for target's cluster size when using bitmap

2024-05-10 Thread Fiona Ebner

When using mirror with a bitmap and the target does not do COW and is
is a diff image, i.e. one that should only contain the delta and was
not synced to previously, a too large cluster size for the target can
be problematic. In particular, when the mirror sends data to the
target aligned to the jobs granularity, but not aligned to the larger
target image's cluster size, the target's cluster would be allocated
but only be filled partially. When rebasing such a diff image later,
the corresponding cluster of the base image would get "masked" and the
part of the cluster not in the diff image is not accessible anymore.

Unfortunately, it is not always possible to check for the target
image's cluster size, e.g. when it's NBD. Because the limitation is
already documented in the QAPI description for the @bitmap parameter
and it's only required for special diff image use-case, simply skip
the check then.

Signed-off-by: Fiona Ebner 
---

Changes in v3:
* detect when the target does COW and do not error out in that case
* treat ENOTSUP differently from other failure when querying the
  cluster size

 blockdev.c | 57 ++
 tests/qemu-iotests/tests/mirror-bitmap |  6 +++
 tests/qemu-iotests/tests/mirror-bitmap.out |  7 +++
 3 files changed, 70 insertions(+)

diff --git a/blockdev.c b/blockdev.c
index 4f72a72dc7..468974108e 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2769,6 +2769,59 @@ void qmp_blockdev_backup(BlockdevBackup *backup, Error 
**errp)
 blockdev_do_action(&action, errp);
 }
 
+static int blockdev_mirror_check_bitmap_granularity(BlockDriverState *target,
+BdrvDirtyBitmap *bitmap,
+Error **errp)
+{
+int ret;
+BlockDriverInfo bdi;
+uint32_t bitmap_granularity;
+
+GLOBAL_STATE_CODE();
+GRAPH_RDLOCK_GUARD_MAINLOOP();
+
+if (bdrv_backing_chain_next(target)) {
+/*
+ * No need to worry about creating clusters with partial data when the
+ * target does COW.
+ */
+return 0;
+}
+
+/*
+ * If there is no backing file on the target, we cannot rely on COW if our
+ * backup cluster size is smaller than the target cluster size. Even for
+ * targets with a backing file, try to avoid COW if possible.
+ */
+ret = bdrv_get_info(target, &bdi);
+if (ret == -ENOTSUP) {
+/*
+ * Ignore if unable to get the info, e.g. when target is NBD. It's only
+ * relevant for syncing to a diff image and the documentation already
+ * states that the target's cluster size needs to small enough then.
+ */
+return 0;
+} else if (ret < 0) {
+error_setg_errno(errp, -ret,
+"Couldn't determine the cluster size of the target image, "
+"which has no backing file");
+return ret;
+}
+
+bitmap_granularity = bdrv_dirty_bitmap_granularity(bitmap);
+if (bitmap_granularity < bdi.cluster_size ||
+bitmap_granularity % bdi.cluster_size != 0) {
+error_setg(errp, "Bitmap granularity %u is not a multiple of the "
+   "target image's cluster size %u and the target image has "
+   "no backing file",
+   bitmap_granularity, bdi.cluster_size);
+return -EINVAL;
+}
+
+return 0;
+}
+
+
 /* Parameter check and block job starting for drive mirroring.
  * Caller should hold @device and @target's aio context (must be the same).
  **/
@@ -2863,6 +2916,10 @@ static void blockdev_mirror_common(const char *job_id, 
BlockDriverState *bs,
 return;
 }
 
+if (blockdev_mirror_check_bitmap_granularity(target, bitmap, errp)) {
+return;
+}
+
 if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_DEFAULT, errp)) {
 return;
 }
diff --git a/tests/qemu-iotests/tests/mirror-bitmap 
b/tests/qemu-iotests/tests/mirror-bitmap
index 37bbe0f241..e8cd482a19 100755
--- a/tests/qemu-iotests/tests/mirror-bitmap
+++ b/tests/qemu-iotests/tests/mirror-bitmap
@@ -584,6 +584,12 @@ def test_mirror_api():
 bitmap=bitmap)
 log('')
 
+log("-- Test bitmap with too small granularity to non-COW target --\n")
+vm.qmp_log("block-dirty-bitmap-add", node=drive0.node,
+   name="bitmap-small", granularity=GRANULARITY)
+blockdev_mirror(drive0.vm, drive0.node, "mirror_target", "full",
+job_id='api_job', bitmap="bitmap-small")
+log('')
 
 def main():
 for bsync_mode in ("never", "on-success", "always"):
diff --git a/tests/qemu-iotests/tests/mirror-bitm

[PATCH v3 3/5] mirror: allow specifying working bitmap

2024-05-10 Thread Fiona Ebner

From: John Snow 

for the mirror job. The bitmap's granularity is used as the job's
granularity.

The new @bitmap parameter is marked unstable in the QAPI and can
currently only be used for @sync=full mode.

Clusters initially dirty in the bitmap as well as new writes are
copied to the target.

Using block-dirty-bitmap-clear and block-dirty-bitmap-merge API,
callers can simulate the three kinds of @BitmapSyncMode (which is used
by backup):
1. always: default, just pass bitmap as working bitmap.
2. never: copy bitmap and pass copy to the mirror job.
3. on-success: copy bitmap and pass copy to the mirror job and if
   successful, merge bitmap into original afterwards.

When the target image is a non-COW "diff image", i.e. one that was not
used as the target of a previous mirror and the target image's cluster
size is larger than the bitmap's granularity, or when
@copy-mode=write-blocking is used, there is a pitfall, because the
cluster in the target image will be allocated, but not contain all the
data corresponding to the same region in the source image.

An idea to avoid the limitation would be to mark clusters which are
affected by unaligned writes and are not allocated in the target image
dirty, so they would be copied fully later. However, for migration,
the invariant that an actively synced mirror stays actively synced
(unless an error happens) is useful, because without that invariant,
migration might inactivate block devices when mirror still got work
to do and run into an assertion failure [0].

Another approach would be to read the missing data from the source
upon unaligned writes to be able to write the full target cluster
instead.

But certain targets like NBD do not allow querying the cluster size.
To avoid limiting/breaking the use case of syncing to an existing
target, which is arguably more common than the diff image use case,
document the limitation in QAPI.

This patch was originally based on one by Ma Haocong, but it has since
been modified pretty heavily, first by John and then again by Fiona.

[0]: 
https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/

Suggested-by: Ma Haocong 
Signed-off-by: Ma Haocong 
Signed-off-by: John Snow 
[FG: switch to bdrv_dirty_bitmap_merge_internal]
Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.1
 get rid of bitmap mode parameter
 use caller-provided bitmap as working bitmap
 turn bitmap parameter experimental]
Signed-off-by: Fiona Ebner 
---

Changes in v3:
* remove duplicate "use" in QAPI description
* clarify that cluster size caveats only applies to non-COW diff image
* split changing is_none_mode to sync_mode in job struct into a
  separate patch
* use shorter sync_mode != none rather than sync_mode == top || full
  in an if condition
* also disallow read-only bitmap (cannot be used as working bitmap)
* require that bitmap is enabled at the start of the job

 block/mirror.c | 80 +-
 blockdev.c | 44 +++---
 include/block/block_int-global-state.h |  5 +-
 qapi/block-core.json   | 37 +++-
 tests/unit/test-block-iothread.c   |  2 +-
 5 files changed, 143 insertions(+), 25 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index ca23d6ef65..d3d0698116 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -73,6 +73,11 @@ typedef struct MirrorBlockJob {
 size_t buf_size;
 int64_t bdev_length;
 unsigned long *cow_bitmap;
+/*
+ * Whether the bitmap is created locally or provided by the caller (for
+ * incremental sync).
+ */
+bool dirty_bitmap_is_local;
 BdrvDirtyBitmap *dirty_bitmap;
 BdrvDirtyBitmapIter *dbi;
 uint8_t *buf;
@@ -691,7 +696,11 @@ static int mirror_exit_common(Job *job)
 bdrv_unfreeze_backing_chain(mirror_top_bs, target_bs);
 }
 
-bdrv_release_dirty_bitmap(s->dirty_bitmap);
+if (s->dirty_bitmap_is_local) {
+bdrv_release_dirty_bitmap(s->dirty_bitmap);
+} else {
+bdrv_enable_dirty_bitmap(s->dirty_bitmap);
+}
 
 /* Make sure that the source BDS doesn't go away during bdrv_replace_node,
  * before we can call bdrv_drained_end */
@@ -820,6 +829,16 @@ static void mirror_abort(Job *job)
 assert(ret == 0);
 }
 
+/* Always called after commit/abort. */
+static void mirror_clean(Job *job)
+{
+MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
+
+if (!s->dirty_bitmap_is_local && s->dirty_bitmap) {
+bdrv_dirty_bitmap_set_busy(s->dirty_bitmap, false);
+}
+}
+
 static void coroutine_fn mirror_throttle(MirrorBlockJob *s)
 {
 int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1016,7 +1035,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 mirror_free_init(s);
 
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALT

Re: [PATCH v2 2/4] mirror: allow specifying working bitmap

2024-05-08 Thread Fiona Ebner

Am 07.05.24 um 14:15 schrieb Fiona Ebner:
> Am 02.04.24 um 22:14 schrieb Vladimir Sementsov-Ogievskiy:
>> On 07.03.24 16:47, Fiona Ebner wrote:
>>> +# @bitmap: The name of a bitmap to use as a working bitmap for
>>> +# sync=full mode.  This argument must be not be present for other
>>> +# sync modes and not at the same time as @granularity.  The
>>> +# bitmap's granularity is used as the job's granularity.  When
>>> +# the target is a diff image, i.e. one that should only contain
>>> +# the delta and was not synced to previously, the target's
>>> +# cluster size must not be larger than the bitmap's granularity.
>>
>> Could we check this? Like in block_copy_calculate_cluster_size(), we can
>> check if target does COW, and if not, we can check that we are safe with
>> granularity.
>>
> 
> The issue here is (in particular) present when the target does COW, i.e.
> in qcow2 diff images, allocated clusters which end up with partial data,
> when we don't have the right cluster size. Patch 4/4 adds the check for
> the target's cluster size.
> 

Sorry, no. What I said is wrong. It's just that the test does something
very pathological and does not even use COW/backing files. All the
mirror targets are separate diff images there. So yes, we can do the
same as block_copy_calculate_cluster_size() and the issue only appears
in the same edge cases as for backup where we can error out early. This
also applies to copy-mode=write-blocking AFAICT.

>>> +# For a diff image target, using copy-mode=write-blocking should
>>> +# not be used, because unaligned writes will lead to allocated
>>> +# clusters with partial data in the target image!
>>
>> Could this be checked?
>>
> 
> I don't think so. How should we know if the target already contains data
> from a previous full sync or not?
> 
> Those caveats when using diff images are unfortunate, and users should
> be warned about them of course, but the main/expected use case for the
> feature is to sync to the same target multiple times, so I'd hope the
> cluster size check in patch 4/4 and mentioning the edge cases in the
> documentation is enough here.
>

Re: [PATCH v2 2/4] mirror: allow specifying working bitmap

2024-05-07 Thread Fiona Ebner

Am 02.04.24 um 22:14 schrieb Vladimir Sementsov-Ogievskiy:
> On 07.03.24 16:47, Fiona Ebner wrote:
>> diff --git a/block/mirror.c b/block/mirror.c
>> index 1609354db3..5c9a00b574 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -51,7 +51,7 @@ typedef struct MirrorBlockJob {
>>   BlockDriverState *to_replace;
>>   /* Used to block operations on the drive-mirror-replace target */
>>   Error *replace_blocker;
>> -    bool is_none_mode;
>> +    MirrorSyncMode sync_mode;
> 
> Could you please split this change to separate preparation patch?
> 

Will do.

>> +    if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_ALLOW_RO,
>> errp)) {
> 
> Why allow read-only bitmaps?
> 

Good catch! This is a left-over from an earlier version. Now that the
bitmap shall be used as the working bitmap, it cannot be read-only. I'll
change it to BDRV_BITMAP_DEFAULT in v3 of the series.

>> +# @bitmap: The name of a bitmap to use as a working bitmap for
>> +# sync=full mode.  This argument must be not be present for other
>> +# sync modes and not at the same time as @granularity.  The
>> +# bitmap's granularity is used as the job's granularity.  When
>> +# the target is a diff image, i.e. one that should only contain
>> +# the delta and was not synced to previously, the target's
>> +# cluster size must not be larger than the bitmap's granularity.
> 
> Could we check this? Like in block_copy_calculate_cluster_size(), we can
> check if target does COW, and if not, we can check that we are safe with
> granularity.
> 

The issue here is (in particular) present when the target does COW, i.e.
in qcow2 diff images, allocated clusters which end up with partial data,
when we don't have the right cluster size. Patch 4/4 adds the check for
the target's cluster size.

>> +# For a diff image target, using copy-mode=write-blocking should
>> +# not be used, because unaligned writes will lead to allocated
>> +# clusters with partial data in the target image!
> 
> Could this be checked?
> 

I don't think so. How should we know if the target already contains data
from a previous full sync or not?

Those caveats when using diff images are unfortunate, and users should
be warned about them of course, but the main/expected use case for the
feature is to sync to the same target multiple times, so I'd hope the
cluster size check in patch 4/4 and mentioning the edge cases in the
documentation is enough here.

>>  The bitmap
>> +# will be enabled after the job finishes.  (Since 9.0)
> 
> Hmm. That looks correct. At least for the case, when bitmap is enabled
> at that start of job. Suggest to require this.
> 

It's true for any provided bitmap: it will be disabled when the mirror
job starts, because we manually set it in bdrv_mirror_top_do_write() and
then in mirror_exit_common(), the bitmap will be enabled.

Okay, I'll require that it is enabled at the beginning.

>> +#
>>   # @granularity: granularity of the dirty bitmap, default is 64K if the
>>   # image format doesn't have clusters, 4K if the clusters are
>>   # smaller than that, else the cluster size.  Must be a power of 2
>> @@ -2548,6 +2578,10 @@
>>   # disappear from the query list without user intervention.
>>   # Defaults to true.  (Since 3.1)
>>   #
>> +# Features:
>> +#
>> +# @unstable: Member @bitmap is experimental.
>> +#
>>   # Since: 2.6
> 
> Y_MODE_BACKGROUND,
>>    &error_abort);
> 
> [..]
> 
> Generally looks good to me.
> 

Thank you for the review!

Re: [PATCH] block/copy-before-write: use uint64_t for timeout in nanoseconds

2024-04-29 Thread Fiona Ebner

Am 29.04.24 um 16:36 schrieb Philippe Mathieu-Daudé:
> Hi Fiona,
> 
> On 29/4/24 16:19, Fiona Ebner wrote:
> 
> Not everybody uses an email client that shows the patch content just
> after the subject (your first lines wasn't making sense at first).
> 
> Simply duplicating the subject helps to understand:
> 
>   Use uint64_t for timeout in nanoseconds ...
> 

Oh, sorry. I'll try to remember that for the future. Should I re-send as
a v2?

Best Regards,
Fiona

[PATCH] block/copy-before-write: use uint64_t for timeout in nanoseconds

2024-04-29 Thread Fiona Ebner

rather than the uint32_t for which the maximum is slightly more than 4
seconds and larger values would overflow. The QAPI interface allows
specifying the number of seconds, so only values 0 to 4 are safe right
now, other values lead to a much lower timeout than a user expects.

The block_copy() call where this is used already takes a uint64_t for
the timeout, so no change required there.

Fixes: 6db7fd1ca9 ("block/copy-before-write: implement cbw-timeout option")
Reported-by: Friedrich Weber 
Signed-off-by: Fiona Ebner 
---
 block/copy-before-write.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index 8aba27a71d..026fa9840f 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -43,7 +43,7 @@ typedef struct BDRVCopyBeforeWriteState {
 BlockCopyState *bcs;
 BdrvChild *target;
 OnCbwError on_cbw_error;
-uint32_t cbw_timeout_ns;
+uint64_t cbw_timeout_ns;
 
 /*
  * @lock: protects access to @access_bitmap, @done_bitmap and
-- 
2.39.2

Re: [PATCH] block: Remove unnecessary NULL check in bdrv_pad_request()

2024-03-28 Thread Fiona Ebner

Am 27.03.24 um 20:27 schrieb Kevin Wolf:
> Coverity complains that the check introduced in commit 3f934817 suggests
> that qiov could be NULL and we dereference it before reaching the check.
> In fact, all of the callers pass a non-NULL pointer, so just remove the
> misleading check.
> 
> Resolves: Coverity CID 1542668
> Signed-off-by: Kevin Wolf 

Reviewed-by: Fiona Ebner 

Thank you for the fix,
Fiona

Question about block graph lock limitation with generated co-wrappers

2024-03-26 Thread Fiona Ebner

Hi,
we have a custom block driver downstream, which currently calls
bdrv_get_info() (for its file child) in the bdrv_refresh_limits()
callback. However, with graph locking, this doesn't work anymore.
AFAICT, the reason is the following:

The block driver has a backing file option.
During initialization, in bdrv_set_backing_hd(), the graph lock is
acquired exclusively.
Then the bdrv_refresh_limits() callback is invoked.
Now bdrv_get_info() is called, which is a generated co-wrapper.
The bdrv_co_get_info_entry() function tries to acquire the graph lock
for reading, sees that has_writer is true and so the coroutine will be
put to wait, leading to a deadlock.

For my specific case, I can move the bdrv_get_info() call to bdrv_open()
as a workaround. But I wanted to ask if there is a way to make generated
co-wrappers inside an exclusively locked section work? And if not, could
we introduce/extend the annotations, so the compiler can catch this kind
of issue, i.e. calling a generated co-wrapper while in an exclusively
locked section?

Best Regards,
Fiona

[PATCH v3 1/4] block/io: accept NULL qiov in bdrv_pad_request

2024-03-22 Thread Fiona Ebner

From: Stefan Reiter 

Some operations, e.g. block-stream, perform reads while discarding the
results (only copy-on-read matters). In this case, they will pass NULL
as the target QEMUIOVector, which will however trip bdrv_pad_request,
since it wants to extend its passed vector. In particular, this is the
case for the blk_co_preadv() call in stream_populate().

If there is no qiov, no operation can be done with it, but the bytes
and offset still need to be updated, so the subsequent aligned read
will actually be aligned and not run into an assertion failure.

In particular, this can happen when the request alignment of the top
node is larger than the allocated part of the bottom node, in which
case padding becomes necessary. For example:

> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": 
> "node0", "node-name": "node1" } }
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node1" } }
> EOF

Originally-by: Stefan Reiter 
Signed-off-by: Thomas Lamprecht 
[FE: do update bytes and offset in any case
 add reproducer to commit message]
Signed-off-by: Fiona Ebner 
---

No changes in v3.
No changes in v2.

 block/io.c | 31 +++
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/block/io.c b/block/io.c
index 33150c0359..395bea3bac 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs,
 return 0;
 }
 
-sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
-  &sliced_head, &sliced_tail,
-  &sliced_niov);
+/*
+ * For prefetching in stream_populate(), no qiov is passed along, because
+ * only copy-on-read matters.
+ */
+if (qiov && *qiov) {
+sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
+  &sliced_head, &sliced_tail,
+  &sliced_niov);
 
-/* Guaranteed by bdrv_check_request32() */
-assert(*bytes <= SIZE_MAX);
-ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
-  sliced_head, *bytes);
-if (ret < 0) {
-bdrv_padding_finalize(pad);
-return ret;
+/* Guaranteed by bdrv_check_request32() */
+assert(*bytes <= SIZE_MAX);
+ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
+  sliced_head, *bytes);
+if (ret < 0) {
+bdrv_padding_finalize(pad);
+return ret;
+}
+*qiov = &pad->local_qiov;
+*qiov_offset = 0;
 }
+
 *bytes += pad->head + pad->tail;
 *offset -= pad->head;
-*qiov = &pad->local_qiov;
-*qiov_offset = 0;
 if (padded) {
 *padded = true;
 }
-- 
2.39.2

[PATCH v3 4/4] iotests: add test for stream job with an unaligned prefetch read

2024-03-22 Thread Fiona Ebner

Previously, bdrv_pad_request() could not deal with a NULL qiov when
a read needed to be aligned. During prefetch, a stream job will pass a
NULL qiov. Add a test case to cover this scenario.

By accident, also covers a previous race during shutdown, where block
graph changes during iteration in bdrv_flush_all() could lead to
unreferencing the wrong block driver state and an assertion failure
later.

Signed-off-by: Fiona Ebner 
---

No changes in v3.
New in v2.

 .../tests/stream-unaligned-prefetch   | 86 +++
 .../tests/stream-unaligned-prefetch.out   |  5 ++
 2 files changed, 91 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
 create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out

diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch 
b/tests/qemu-iotests/tests/stream-unaligned-prefetch
new file mode 100755
index 00..546db1d369
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch
@@ -0,0 +1,86 @@
+#!/usr/bin/env python3
+# group: rw quick
+#
+# Test what happens when a stream job does an unaligned prefetch read
+# which requires padding while having a NULL qiov.
+#
+# Copyright (C) Proxmox Server Solutions GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import os
+import iotests
+from iotests import imgfmt, qemu_img_create, qemu_io, QMPTestCase
+
+image_size = 1 * 1024 * 1024
+cluster_size = 64 * 1024
+base = os.path.join(iotests.test_dir, 'base.img')
+top = os.path.join(iotests.test_dir, 'top.img')
+
+class TestStreamUnalignedPrefetch(QMPTestCase):
+def setUp(self) -> None:
+"""
+Create two images:
+- base image {base} with {cluster_size // 2} bytes allocated
+- top image {top} without any data allocated and coarser
+  cluster size
+
+Attach a compress filter for the top image, because that
+requires that the request alignment is the top image's cluster
+size.
+"""
+qemu_img_create('-f', imgfmt,
+'-o', 'cluster_size={}'.format(cluster_size // 2),
+base, str(image_size))
+qemu_io('-c', f'write 0 {cluster_size // 2}', base)
+qemu_img_create('-f', imgfmt,
+'-o', 'cluster_size={}'.format(cluster_size),
+top, str(image_size))
+
+self.vm = iotests.VM()
+self.vm.add_blockdev(self.vm.qmp_to_opts({
+'driver': imgfmt,
+'node-name': 'base',
+'file': {
+'driver': 'file',
+'filename': base
+}
+}))
+self.vm.add_blockdev(self.vm.qmp_to_opts({
+'driver': 'compress',
+'node-name': 'compress-top',
+'file': {
+'driver': imgfmt,
+'node-name': 'top',
+'file': {
+'driver': 'file',
+'filename': top
+},
+'backing': 'base'
+}
+}))
+self.vm.launch()
+
+def tearDown(self) -> None:
+self.vm.shutdown()
+os.remove(top)
+os.remove(base)
+
+def test_stream_unaligned_prefetch(self) -> None:
+self.vm.cmd('block-stream', job_id='stream', device='compress-top')
+
+
+if __name__ == '__main__':
+iotests.main(supported_fmts=['qcow2'], supported_protocols=['file'])
diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch.out 
b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out
new file mode 100644
index 00..ae1213e6f8
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out
@@ -0,0 +1,5 @@
+.
+--
+Ran 1 tests
+
+OK
-- 
2.39.2

[PATCH v3 0/4] fix two edge cases related to stream block jobs

2024-03-22 Thread Fiona Ebner

Changes in v3:
* Also deal with edge case in bdrv_next_cleanup(). Haven't run
  into an actual issue there, but at least the caller in
  migration/block.c uses bdrv_nb_sectors() which, while not a
  coroutine wrapper itself (it's written manually), may call
  bdrv_refresh_total_sectors(), which is a generated coroutine
  wrapper, so AFAIU, the block graph can change during that call.
  And even without that, it's just better to be more consistent
  with bdrv_next().

Changes in v2:
* Ran into another issue while writing the IO test Stefan wanted
  to have (good call :)), so include a fix for that and add the
  test. I didn't notice during manual testing, because I hadn't
  used a scripted QMP 'quit', so there was no race.

Fiona Ebner (3):
  block-backend: fix edge case in bdrv_next() where BDS associated to BB
changes
  block-backend: fix edge case in bdrv_next_cleanup() where BDS
associated to BB changes
  iotests: add test for stream job with an unaligned prefetch read

Stefan Reiter (1):
  block/io: accept NULL qiov in bdrv_pad_request

 block/block-backend.c | 18 ++--
 block/io.c| 31 ---
 .../tests/stream-unaligned-prefetch   | 86 +++
 .../tests/stream-unaligned-prefetch.out   |  5 ++
 4 files changed, 117 insertions(+), 23 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
 create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out

-- 
2.39.2

[PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-03-22 Thread Fiona Ebner

The old_bs variable in bdrv_next() is currently determined by looking
at the old block backend. However, if the block graph changes before
the next bdrv_next() call, it might be that the associated BDS is not
the same that was referenced previously. In that case, the wrong BDS
is unreferenced, leading to an assertion failure later:

> bdrv_unref: Assertion `bs->refcnt > 0' failed.

In particular, this can happen in the context of bdrv_flush_all(),
when polling for bdrv_co_flush() in the generated co-wrapper leads to
a graph change (for example with a stream block job [0]).

A racy reproducer:

> #!/bin/bash
> rm -f /tmp/backing.qcow2
> rm -f /tmp/top.qcow2
> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node0" } }
> {"execute": "quit"}
> EOF

[0]:

> #0  bdrv_replace_child_tran (child=..., new_bs=..., tran=...)
> #1  bdrv_replace_node_noperm (from=..., to=..., auto_skip=..., tran=..., 
> errp=...)
> #2  bdrv_replace_node_common (from=..., to=..., auto_skip=..., 
> detach_subchain=..., errp=...)
> #3  bdrv_drop_filter (bs=..., errp=...)
> #4  bdrv_cor_filter_drop (cor_filter_bs=...)
> #5  stream_prepare (job=...)
> #6  job_prepare_locked (job=...)
> #7  job_txn_apply_locked (fn=..., job=...)
> #8  job_do_finalize_locked (job=...)
> #9  job_exit (opaque=...)
> #10 aio_bh_poll (ctx=...)
> #11 aio_poll (ctx=..., blocking=...)
> #12 bdrv_poll_co (s=...)
> #13 bdrv_flush (bs=...)
> #14 bdrv_flush_all ()
> #15 do_vm_stop (state=..., send_stop=...)
> #16 vm_shutdown ()

Signed-off-by: Fiona Ebner 
---

No changes in v3.
New in v2.

 block/block-backend.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 9c4de79e6b..28af1eb17a 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -599,14 +599,14 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
 /* Must be called from the main loop */
 assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 
+old_bs = it->bs;
+
 /* First, return all root nodes of BlockBackends. In order to avoid
  * returning a BDS twice when multiple BBs refer to it, we only return it
  * if the BB is the first one in the parent list of the BDS. */
 if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
 BlockBackend *old_blk = it->blk;
 
-old_bs = old_blk ? blk_bs(old_blk) : NULL;
-
 do {
 it->blk = blk_all_next(it->blk);
 bs = it->blk ? blk_bs(it->blk) : NULL;
@@ -620,11 +620,10 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
 if (bs) {
 bdrv_ref(bs);
 bdrv_unref(old_bs);
+it->bs = bs;
 return bs;
 }
 it->phase = BDRV_NEXT_MONITOR_OWNED;
-} else {
-old_bs = it->bs;
 }
 
 /* Then return the monitor-owned BDSes without a BB attached. Ignore all
-- 
2.39.2

[PATCH v3 3/4] block-backend: fix edge case in bdrv_next_cleanup() where BDS associated to BB changes

2024-03-22 Thread Fiona Ebner

Same rationale as for commit "block-backend: fix edge case in
bdrv_next() where BDS associated to BB changes". The block graph might
change between the bdrv_next() call and the bdrv_next_cleanup() call,
so it could be that the associated BDS is not the same that was
referenced previously anymore. Instead, rely on bdrv_next() to set
it->bs to the BDS it referenced and unreference that one in any case.

Signed-off-by: Fiona Ebner 
---

New in v3.

 block/block-backend.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 28af1eb17a..db6f9b92a3 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -663,13 +663,10 @@ void bdrv_next_cleanup(BdrvNextIterator *it)
 /* Must be called from the main loop */
 assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 
-if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
-if (it->blk) {
-bdrv_unref(blk_bs(it->blk));
-blk_unref(it->blk);
-}
-} else {
-bdrv_unref(it->bs);
+bdrv_unref(it->bs);
+
+if (it->phase == BDRV_NEXT_BACKEND_ROOTS && it->blk) {
+blk_unref(it->blk);
 }
 
 bdrv_next_reset(it);
-- 
2.39.2

[PATCH v2 0/3] fix two edge cases related to stream block jobs

2024-03-21 Thread Fiona Ebner

Changes in v2:
* Ran into another issue while writing the IO test Stefan wanted
  to have (good call :)), so include a fix for that and add the
  test. I didn't notice during manual testing, because I hadn't
  used a scripted QMP 'quit', so there was no race.

Fiona Ebner (2):
  block-backend: fix edge case in bdrv_next() where BDS associated to BB
changes
  iotests: add test for stream job with an unaligned prefetch read

Stefan Reiter (1):
  block/io: accept NULL qiov in bdrv_pad_request

 block/block-backend.c |  7 +-
 block/io.c| 31 ---
 .../tests/stream-unaligned-prefetch   | 86 +++
 .../tests/stream-unaligned-prefetch.out   |  5 ++
 4 files changed, 113 insertions(+), 16 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
 create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out

-- 
2.39.2

[PATCH v2 3/3] iotests: add test for stream job with an unaligned prefetch read

2024-03-21 Thread Fiona Ebner

Previously, bdrv_pad_request() could not deal with a NULL qiov when
a read needed to be aligned. During prefetch, a stream job will pass a
NULL qiov. Add a test case to cover this scenario.

By accident, also covers a previous race during shutdown, where block
graph changes during iteration in bdrv_flush_all() could lead to
unreferencing the wrong block driver state and an assertion failure
later.

Signed-off-by: Fiona Ebner 
---

New in v2.

 .../tests/stream-unaligned-prefetch   | 86 +++
 .../tests/stream-unaligned-prefetch.out   |  5 ++
 2 files changed, 91 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
 create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out

diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch 
b/tests/qemu-iotests/tests/stream-unaligned-prefetch
new file mode 100755
index 00..546db1d369
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch
@@ -0,0 +1,86 @@
+#!/usr/bin/env python3
+# group: rw quick
+#
+# Test what happens when a stream job does an unaligned prefetch read
+# which requires padding while having a NULL qiov.
+#
+# Copyright (C) Proxmox Server Solutions GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import os
+import iotests
+from iotests import imgfmt, qemu_img_create, qemu_io, QMPTestCase
+
+image_size = 1 * 1024 * 1024
+cluster_size = 64 * 1024
+base = os.path.join(iotests.test_dir, 'base.img')
+top = os.path.join(iotests.test_dir, 'top.img')
+
+class TestStreamUnalignedPrefetch(QMPTestCase):
+def setUp(self) -> None:
+"""
+Create two images:
+- base image {base} with {cluster_size // 2} bytes allocated
+- top image {top} without any data allocated and coarser
+  cluster size
+
+Attach a compress filter for the top image, because that
+requires that the request alignment is the top image's cluster
+size.
+"""
+qemu_img_create('-f', imgfmt,
+'-o', 'cluster_size={}'.format(cluster_size // 2),
+base, str(image_size))
+qemu_io('-c', f'write 0 {cluster_size // 2}', base)
+qemu_img_create('-f', imgfmt,
+'-o', 'cluster_size={}'.format(cluster_size),
+top, str(image_size))
+
+self.vm = iotests.VM()
+self.vm.add_blockdev(self.vm.qmp_to_opts({
+'driver': imgfmt,
+'node-name': 'base',
+'file': {
+'driver': 'file',
+'filename': base
+}
+}))
+self.vm.add_blockdev(self.vm.qmp_to_opts({
+'driver': 'compress',
+'node-name': 'compress-top',
+'file': {
+'driver': imgfmt,
+'node-name': 'top',
+'file': {
+'driver': 'file',
+'filename': top
+},
+'backing': 'base'
+}
+}))
+self.vm.launch()
+
+def tearDown(self) -> None:
+self.vm.shutdown()
+os.remove(top)
+os.remove(base)
+
+def test_stream_unaligned_prefetch(self) -> None:
+self.vm.cmd('block-stream', job_id='stream', device='compress-top')
+
+
+if __name__ == '__main__':
+iotests.main(supported_fmts=['qcow2'], supported_protocols=['file'])
diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch.out 
b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out
new file mode 100644
index 00..ae1213e6f8
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out
@@ -0,0 +1,5 @@
+.
+--
+Ran 1 tests
+
+OK
-- 
2.39.2

[PATCH v2 1/3] block/io: accept NULL qiov in bdrv_pad_request

2024-03-21 Thread Fiona Ebner

From: Stefan Reiter 

Some operations, e.g. block-stream, perform reads while discarding the
results (only copy-on-read matters). In this case, they will pass NULL
as the target QEMUIOVector, which will however trip bdrv_pad_request,
since it wants to extend its passed vector. In particular, this is the
case for the blk_co_preadv() call in stream_populate().

If there is no qiov, no operation can be done with it, but the bytes
and offset still need to be updated, so the subsequent aligned read
will actually be aligned and not run into an assertion failure.

In particular, this can happen when the request alignment of the top
node is larger than the allocated part of the bottom node, in which
case padding becomes necessary. For example:

> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": 
> "node0", "node-name": "node1" } }
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node1" } }
> EOF

Originally-by: Stefan Reiter 
Signed-off-by: Thomas Lamprecht 
[FE: do update bytes and offset in any case
 add reproducer to commit message]
Signed-off-by: Fiona Ebner 
---

No changes in v2.

 block/io.c | 31 +++
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/block/io.c b/block/io.c
index 33150c0359..395bea3bac 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs,
 return 0;
 }
 
-sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
-  &sliced_head, &sliced_tail,
-  &sliced_niov);
+/*
+ * For prefetching in stream_populate(), no qiov is passed along, because
+ * only copy-on-read matters.
+ */
+if (qiov && *qiov) {
+sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
+  &sliced_head, &sliced_tail,
+  &sliced_niov);
 
-/* Guaranteed by bdrv_check_request32() */
-assert(*bytes <= SIZE_MAX);
-ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
-  sliced_head, *bytes);
-if (ret < 0) {
-bdrv_padding_finalize(pad);
-return ret;
+/* Guaranteed by bdrv_check_request32() */
+assert(*bytes <= SIZE_MAX);
+ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
+  sliced_head, *bytes);
+if (ret < 0) {
+bdrv_padding_finalize(pad);
+return ret;
+}
+*qiov = &pad->local_qiov;
+*qiov_offset = 0;
 }
+
 *bytes += pad->head + pad->tail;
 *offset -= pad->head;
-*qiov = &pad->local_qiov;
-*qiov_offset = 0;
 if (padded) {
 *padded = true;
 }
-- 
2.39.2

[PATCH v2 2/3] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-03-21 Thread Fiona Ebner

The old_bs variable in bdrv_next() is currently determined by looking
at the old block backend. However, if the block graph changes before
the next bdrv_next() call, it might be that the associated BDS is not
the same that was referenced previously. In that case, the wrong BDS
is unreferenced, leading to an assertion failure later:

> bdrv_unref: Assertion `bs->refcnt > 0' failed.

In particular, this can happen in the context of bdrv_flush_all(),
when polling for bdrv_co_flush() in the generated co-wrapper leads to
a graph change (for example with a stream block job [0]).

A racy reproducer:

> #!/bin/bash
> rm -f /tmp/backing.qcow2
> rm -f /tmp/top.qcow2
> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node0" } }
> {"execute": "quit"}
> EOF

[0]:

> #0  bdrv_replace_child_tran (child=..., new_bs=..., tran=...)
> #1  bdrv_replace_node_noperm (from=..., to=..., auto_skip=..., tran=..., 
> errp=...)
> #2  bdrv_replace_node_common (from=..., to=..., auto_skip=..., 
> detach_subchain=..., errp=...)
> #3  bdrv_drop_filter (bs=..., errp=...)
> #4  bdrv_cor_filter_drop (cor_filter_bs=...)
> #5  stream_prepare (job=...)
> #6  job_prepare_locked (job=...)
> #7  job_txn_apply_locked (fn=..., job=...)
> #8  job_do_finalize_locked (job=...)
> #9  job_exit (opaque=...)
> #10 aio_bh_poll (ctx=...)
> #11 aio_poll (ctx=..., blocking=...)
> #12 bdrv_poll_co (s=...)
> #13 bdrv_flush (bs=...)
> #14 bdrv_flush_all ()
> #15 do_vm_stop (state=..., send_stop=...)
> #16 vm_shutdown ()

Signed-off-by: Fiona Ebner 
---

Not sure if this is the correct fix, or if the call site should rather
be adapted somehow?

New in v2.

 block/block-backend.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 9c4de79e6b..28af1eb17a 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -599,14 +599,14 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
 /* Must be called from the main loop */
 assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 
+old_bs = it->bs;
+
 /* First, return all root nodes of BlockBackends. In order to avoid
  * returning a BDS twice when multiple BBs refer to it, we only return it
  * if the BB is the first one in the parent list of the BDS. */
 if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
 BlockBackend *old_blk = it->blk;
 
-old_bs = old_blk ? blk_bs(old_blk) : NULL;
-
 do {
 it->blk = blk_all_next(it->blk);
 bs = it->blk ? blk_bs(it->blk) : NULL;
@@ -620,11 +620,10 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
 if (bs) {
 bdrv_ref(bs);
 bdrv_unref(old_bs);
+it->bs = bs;
 return bs;
 }
 it->phase = BDRV_NEXT_MONITOR_OWNED;
-} else {
-old_bs = it->bs;
 }
 
 /* Then return the monitor-owned BDSes without a BB attached. Ignore all
-- 
2.39.2

[PATCH] block/io: accept NULL qiov in bdrv_pad_request

2024-03-19 Thread Fiona Ebner

From: Stefan Reiter 

Some operations, e.g. block-stream, perform reads while discarding the
results (only copy-on-read matters). In this case, they will pass NULL
as the target QEMUIOVector, which will however trip bdrv_pad_request,
since it wants to extend its passed vector. In particular, this is the
case for the blk_co_preadv() call in stream_populate().

If there is no qiov, no operation can be done with it, but the bytes
and offset still need to be updated, so the subsequent aligned read
will actually be aligned and not run into an assertion failure.

In particular, this can happen when the request alignment of the top
node is larger than the allocated part of the bottom node, in which
case padding becomes necessary. For example:

> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": 
> "node0", "node-name": "node1" } }
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node1" } }
> EOF

Originally-by: Stefan Reiter 
Signed-off-by: Thomas Lamprecht 
[FE: do update bytes and offset in any case
 add reproducer to commit message]
Signed-off-by: Fiona Ebner 
---
 block/io.c | 31 +++
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/block/io.c b/block/io.c
index 33150c0359..395bea3bac 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs,
 return 0;
 }
 
-sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
-  &sliced_head, &sliced_tail,
-  &sliced_niov);
+/*
+ * For prefetching in stream_populate(), no qiov is passed along, because
+ * only copy-on-read matters.
+ */
+if (qiov && *qiov) {
+sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
+  &sliced_head, &sliced_tail,
+  &sliced_niov);
 
-/* Guaranteed by bdrv_check_request32() */
-assert(*bytes <= SIZE_MAX);
-ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
-  sliced_head, *bytes);
-if (ret < 0) {
-bdrv_padding_finalize(pad);
-return ret;
+/* Guaranteed by bdrv_check_request32() */
+assert(*bytes <= SIZE_MAX);
+ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
+  sliced_head, *bytes);
+if (ret < 0) {
+bdrv_padding_finalize(pad);
+return ret;
+}
+*qiov = &pad->local_qiov;
+*qiov_offset = 0;
 }
+
 *bytes += pad->head + pad->tail;
 *offset -= pad->head;
-*qiov = &pad->local_qiov;
-*qiov_offset = 0;
 if (padded) {
 *padded = true;
 }
-- 
2.39.2

[PATCH 1/2] copy-before-write: allow specifying minimum cluster size

2024-03-08 Thread Fiona Ebner

Useful to make discard-source work in the context of backup fleecing
when the fleecing image has a larger granularity than the backup
target.

Copy-before-write operations will use at least this granularity and in
particular, discard requests to the source node will too. If the
granularity is too small, they will just be aligned down in
cbw_co_pdiscard_snapshot() and thus effectively ignored.

The QAPI uses uint32 so the value will be non-negative, but still fit
into a uint64_t.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Fiona Ebner 
---
 block/block-copy.c | 17 +
 block/copy-before-write.c  |  3 ++-
 include/block/block-copy.h |  1 +
 qapi/block-core.json   |  8 +++-
 4 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/block/block-copy.c b/block/block-copy.c
index 7e3b378528..adb1cbb440 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -310,6 +310,7 @@ void block_copy_set_copy_opts(BlockCopyState *s, bool 
use_copy_range,
 }
 
 static int64_t block_copy_calculate_cluster_size(BlockDriverState *target,
+ int64_t min_cluster_size,
  Error **errp)
 {
 int ret;
@@ -335,7 +336,7 @@ static int64_t 
block_copy_calculate_cluster_size(BlockDriverState *target,
 "used. If the actual block size of the target exceeds "
 "this default, the backup may be unusable",
 BLOCK_COPY_CLUSTER_SIZE_DEFAULT);
-return BLOCK_COPY_CLUSTER_SIZE_DEFAULT;
+return MAX(min_cluster_size, BLOCK_COPY_CLUSTER_SIZE_DEFAULT);
 } else if (ret < 0 && !target_does_cow) {
 error_setg_errno(errp, -ret,
 "Couldn't determine the cluster size of the target image, "
@@ -345,16 +346,18 @@ static int64_t 
block_copy_calculate_cluster_size(BlockDriverState *target,
 return ret;
 } else if (ret < 0 && target_does_cow) {
 /* Not fatal; just trudge on ahead. */
-return BLOCK_COPY_CLUSTER_SIZE_DEFAULT;
+return MAX(min_cluster_size, BLOCK_COPY_CLUSTER_SIZE_DEFAULT);
 }
 
-return MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, bdi.cluster_size);
+return MAX(min_cluster_size,
+   MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, bdi.cluster_size));
 }
 
 BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target,
  BlockDriverState *copy_bitmap_bs,
  const BdrvDirtyBitmap *bitmap,
  bool discard_source,
+ int64_t min_cluster_size,
  Error **errp)
 {
 ERRP_GUARD();
@@ -365,7 +368,13 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, 
BdrvChild *target,
 
 GLOBAL_STATE_CODE();
 
-cluster_size = block_copy_calculate_cluster_size(target->bs, errp);
+if (min_cluster_size && !is_power_of_2(min_cluster_size)) {
+error_setg(errp, "min-cluster-size needs to be a power of 2");
+return NULL;
+}
+
+cluster_size = block_copy_calculate_cluster_size(target->bs,
+ min_cluster_size, errp);
 if (cluster_size < 0) {
 return NULL;
 }
diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index dac57481c5..f9896c6c1e 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -476,7 +476,8 @@ static int cbw_open(BlockDriverState *bs, QDict *options, 
int flags,
 
 s->discard_source = flags & BDRV_O_CBW_DISCARD_SOURCE;
 s->bcs = block_copy_state_new(bs->file, s->target, bs, bitmap,
-  flags & BDRV_O_CBW_DISCARD_SOURCE, errp);
+  flags & BDRV_O_CBW_DISCARD_SOURCE,
+  opts->min_cluster_size, errp);
 if (!s->bcs) {
 error_prepend(errp, "Cannot create block-copy-state: ");
 return -EINVAL;
diff --git a/include/block/block-copy.h b/include/block/block-copy.h
index bdc703bacd..77857c6c68 100644
--- a/include/block/block-copy.h
+++ b/include/block/block-copy.h
@@ -28,6 +28,7 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, 
BdrvChild *target,
  BlockDriverState *copy_bitmap_bs,
  const BdrvDirtyBitmap *bitmap,
  bool discard_source,
+ int64_t min_cluster_size,
  Error **errp);
 
 /* Function should be called prior any actual copy request */
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 0a72c590a8..85c8f88f6e 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4625,12 +4625,18 @@
 # @on-cbw-err

[PATCH 0/2] backup: allow specifying minimum cluster size

2024-03-08 Thread Fiona Ebner

Based-on: 
https://lore.kernel.org/qemu-devel/20240228141501.455989-1-vsement...@yandex-team.ru/

Useful to make discard-source work in the context of backup fleecing
when the fleecing image has a larger granularity than the backup
target.

Backup/block-copy will use at least this granularity for copy operations
and in particular, discard requests to the backup source will too. If
the granularity is too small, they will just be aligned down in
cbw_co_pdiscard_snapshot() and thus effectively ignored.

Fiona Ebner (2):
  copy-before-write: allow specifying minimum cluster size
  backup: add minimum cluster size to performance options

 block/backup.c |  2 +-
 block/block-copy.c | 17 +
 block/copy-before-write.c  |  5 -
 block/copy-before-write.h  |  1 +
 blockdev.c |  3 +++
 include/block/block-copy.h |  1 +
 qapi/block-core.json   | 17 ++---
 7 files changed, 37 insertions(+), 9 deletions(-)

-- 
2.39.2

[PATCH 2/2] backup: add minimum cluster size to performance options

2024-03-08 Thread Fiona Ebner

Useful to make discard-source work in the context of backup fleecing
when the fleecing image has a larger granularity than the backup
target.

Backup/block-copy will use at least this granularity for copy operations
and in particular, discard requests to the backup source will too. If
the granularity is too small, they will just be aligned down in
cbw_co_pdiscard_snapshot() and thus effectively ignored.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Fiona Ebner 
---
 block/backup.c| 2 +-
 block/copy-before-write.c | 2 ++
 block/copy-before-write.h | 1 +
 blockdev.c| 3 +++
 qapi/block-core.json  | 9 +++--
 5 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 3dd2e229d2..a1292c01ec 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -458,7 +458,7 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 }
 
 cbw = bdrv_cbw_append(bs, target, filter_node_name, discard_source,
-  &bcs, errp);
+  perf->min_cluster_size, &bcs, errp);
 if (!cbw) {
 goto error;
 }
diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index f9896c6c1e..55a9272485 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -545,6 +545,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
   BlockDriverState *target,
   const char *filter_node_name,
   bool discard_source,
+  int64_t min_cluster_size,
   BlockCopyState **bcs,
   Error **errp)
 {
@@ -563,6 +564,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
 }
 qdict_put_str(opts, "file", bdrv_get_node_name(source));
 qdict_put_str(opts, "target", bdrv_get_node_name(target));
+qdict_put_int(opts, "min-cluster-size", min_cluster_size);
 
 top = bdrv_insert_node(source, opts, flags, errp);
 if (!top) {
diff --git a/block/copy-before-write.h b/block/copy-before-write.h
index 01af0cd3c4..dc6cafe7fa 100644
--- a/block/copy-before-write.h
+++ b/block/copy-before-write.h
@@ -40,6 +40,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
   BlockDriverState *target,
   const char *filter_node_name,
   bool discard_source,
+  int64_t min_cluster_size,
   BlockCopyState **bcs,
   Error **errp);
 void bdrv_cbw_drop(BlockDriverState *bs);
diff --git a/blockdev.c b/blockdev.c
index daceb50460..8e6bdbc94a 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2653,6 +2653,9 @@ static BlockJob *do_backup_common(BackupCommon *backup,
 if (backup->x_perf->has_max_chunk) {
 perf.max_chunk = backup->x_perf->max_chunk;
 }
+if (backup->x_perf->has_min_cluster_size) {
+perf.min_cluster_size = backup->x_perf->min_cluster_size;
+}
 }
 
 if ((backup->sync == MIRROR_SYNC_MODE_BITMAP) ||
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 85c8f88f6e..ba0836892f 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1551,11 +1551,16 @@
 # it should not be less than job cluster size which is calculated
 # as maximum of target image cluster size and 64k.  Default 0.
 #
+# @min-cluster-size: Minimum size of blocks used by copy-before-write
+# and background copy operations.  Has to be a power of 2.  No
+# effect if smaller than the maximum of the target's cluster size
+# and 64 KiB.  Default 0.  (Since 9.0)
+#
 # Since: 6.0
 ##
 { 'struct': 'BackupPerf',
-  'data': { '*use-copy-range': 'bool',
-'*max-workers': 'int', '*max-chunk': 'int64' } }
+  'data': { '*use-copy-range': 'bool', '*max-workers': 'int',
+'*max-chunk': 'int64', '*min-cluster-size': 'uint32' } }
 
 ##
 # @BackupCommon:
-- 
2.39.2

Re: [PATCH v3 3/5] block/copy-before-write: create block_copy bitmap in filter node

2024-03-08 Thread Fiona Ebner

Am 28.02.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy:
> Currently block_copy creates copy_bitmap in source node. But that is in
> bad relation with .independent_close=true of copy-before-write filter:
> source node may be detached and removed before .bdrv_close() handler
> called, which should call block_copy_state_free(), which in turn should
> remove copy_bitmap.
> 
> That's all not ideal: it would be better if internal bitmap of
> block-copy object is not attached to any node. But that is not possible
> now.
> 
> The simplest solution is just create copy_bitmap in filter node, where
> anyway two other bitmaps are created.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 

Reviewed-by: Fiona Ebner

Re: [PATCH v3 0/5] backup: discard-source parameter

2024-03-08 Thread Fiona Ebner

Am 28.02.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy:
> Hi all! The main patch is 04, please look at it for description and
> diagram.
> 
> v3:
> 02: new patch
> 04: take WRITE permission only when discard_source is required
> 
> Vladimir Sementsov-Ogievskiy (5):
>   block/copy-before-write: fix permission
>   block/copy-before-write: support unligned snapshot-discard
>   block/copy-before-write: create block_copy bitmap in filter node
>   qapi: blockdev-backup: add discard-source parameter
>   iotests: add backup-discard-source
> 
>  block/backup.c|   5 +-
>  block/block-copy.c|  12 +-
>  block/copy-before-write.c |  39 -
>  block/copy-before-write.h |   1 +
>  block/replication.c   |   4 +-
>  blockdev.c|   2 +-
>  include/block/block-common.h  |   2 +
>  include/block/block-copy.h|   2 +
>  include/block/block_int-global-state.h|   2 +-
>  qapi/block-core.json  |   4 +
>  tests/qemu-iotests/257.out| 112 ++---
>  .../qemu-iotests/tests/backup-discard-source  | 151 ++
>  .../tests/backup-discard-source.out   |   5 +
>  13 files changed, 271 insertions(+), 70 deletions(-)
>  create mode 100755 tests/qemu-iotests/tests/backup-discard-source
>  create mode 100644 tests/qemu-iotests/tests/backup-discard-source.out
> 

Tested-by: Fiona Ebner

Re: [PATCH v3 5/5] iotests: add backup-discard-source

2024-03-08 Thread Fiona Ebner

Am 28.02.24 um 15:15 schrieb Vladimir Sementsov-Ogievskiy:
> Add test for a new backup option: discard-source.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  .../qemu-iotests/tests/backup-discard-source  | 151 ++
>  .../tests/backup-discard-source.out   |   5 +
>  2 files changed, 156 insertions(+)
>  create mode 100755 tests/qemu-iotests/tests/backup-discard-source
>  create mode 100644 tests/qemu-iotests/tests/backup-discard-source.out
> 
> diff --git a/tests/qemu-iotests/tests/backup-discard-source 
> b/tests/qemu-iotests/tests/backup-discard-source
> new file mode 100755
> index 00..8a88b0f6c4
> --- /dev/null
> +++ b/tests/qemu-iotests/tests/backup-discard-source
> @@ -0,0 +1,151 @@
> +#!/usr/bin/env python3
> +#
> +# Test removing persistent bitmap from backing
> +#
> +# Copyright (c) 2022 Virtuozzo International GmbH.
> +#

Title and copyright year are wrong.

Apart from that:

Reviewed-by: Fiona Ebner

Re: [PATCH v3 2/5] block/copy-before-write: support unligned snapshot-discard

2024-03-08 Thread Fiona Ebner

Am 28.02.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy:
> First thing that crashes on unligned access here is
> bdrv_reset_dirty_bitmap(). Correct way is to align-down the
> snapshot-discard request.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 

Reviewed-by: Fiona Ebner

Re: [PATCH v3 4/5] qapi: blockdev-backup: add discard-source parameter

2024-03-08 Thread Fiona Ebner

Am 28.02.24 um 15:15 schrieb Vladimir Sementsov-Ogievskiy:
> Add a parameter that enables discard-after-copy. That is mostly useful
> in "push backup with fleecing" scheme, when source is snapshot-access
> format driver node, based on copy-before-write filter snapshot-access
> API:
> 
> [guest]  [snapshot-access] ~~ blockdev-backup ~~> [backup target]
>||
>| root   | file
>vv
> [copy-before-write]
>| |
>| file| target
>v v
> [active disk]   [temp.img]
> 
> In this case discard-after-copy does two things:
> 
>  - discard data in temp.img to save disk space
>  - avoid further copy-before-write operation in discarded area
> 
> Note that we have to declare WRITE permission on source in
> copy-before-write filter, for discard to work. Still we can't take it
> unconditionally, as it will break normal backup from RO source. So, we
> have to add a parameter and pass it thorough bdrv_open flags.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 

Reviewed-by: Fiona Ebner

Re: [PATCH v2 00/10] mirror: allow switching from background to active mode

2024-03-08 Thread Fiona Ebner

Am 07.03.24 um 20:42 schrieb Vladimir Sementsov-Ogievskiy:
> On 04.03.24 14:09, Peter Krempa wrote:
>> On Mon, Mar 04, 2024 at 11:48:54 +0100, Kevin Wolf wrote:
>>> Am 28.02.2024 um 19:07 hat Vladimir Sementsov-Ogievskiy geschrieben:
 On 03.11.23 18:56, Markus Armbruster wrote:
> Kevin Wolf  writes:
>>
>> [...]
>>
> Is the job abstraction a failure?
>
> We have
>
>   block-job- command  since   job- command    since
>   -
>   block-job-set-speed 1.1
>   block-job-cancel    1.1 job-cancel  3.0
>   block-job-pause 1.3 job-pause   3.0
>   block-job-resume    1.3 job-resume  3.0
>   block-job-complete  1.3 job-complete    3.0
>   block-job-dismiss   2.12    job-dismiss 3.0
>   block-job-finalize  2.12    job-finalize    3.0
>   block-job-change    8.2
>   query-block-jobs    1.1 query-jobs
>>
>> [...]
>>
>>> I consider these strictly optional. We don't really have strong reasons
>>> to deprecate these commands (they are just thin wrappers), and I think
>>> libvirt still uses block-job-* in some places.
>>
>> Libvirt uses 'block-job-cancel' because it has different semantics from
>> 'job-cancel' which libvirt documented as the behaviour of the API that
>> uses it. (Semantics regarding the expectation of what is written to the
>> destination node at the point when the job is cancelled).
>>
> 
> That's the following semantics:
> 
>   # Note that if you issue 'block-job-cancel' after 'drive-mirror' has
>   # indicated (via the event BLOCK_JOB_READY) that the source and
>   # destination are synchronized, then the event triggered by this
>   # command changes to BLOCK_JOB_COMPLETED, to indicate that the
>   # mirroring has ended and the destination now has a point-in-time copy
>   # tied to the time of the cancellation.
> 
> Hmm. Looking at this, it looks for me, that should probably a
> 'block-job-complete" command (as leading to BLOCK_JOB_COMPLETED).
> 
> Actually, what is the difference between block-job-complete and
> block-job-cancel(force=false) for mirror in ready state?
> 
> I only see the following differencies:
> 
> 1. block-job-complete documents that it completes the job
> synchronously.. But looking at mirror code I see it just set
> s->should_complete = true, which will be then handled asynchronously..
> So I doubt that documentation is correct.
> 
> 2. block-job-complete will trigger final graph changes. block-job-cancel
> will not.
> 
> Is [2] really useful? Seems yes: in case of some failure before starting
> migration target, we'd like to continue executing source. So, no reason
> to break block-graph in source, better keep it unchanged.
> 

FWIW, we also rely on these special semantics. We allow cloning the disk
state of a running guest using drive-mirror (and before finishing,
fsfreeze in the guest for consistency). We cannot use block-job-complete
there, because we do not want to switch the source's drive.

> But I think, such behavior better be setup by mirror-job start
> parameter, rather then by special option for cancel (or even compelete)
> command, useful only for mirror.
> 
> So, what about the following substitution for block-job-cancel:
> 
> block-job-cancel(force=true)  -->  use job-cancel
> 
> block-job-cancel(force=false) for backup, stream, commit  -->  use
> job-cancel
> 
> block-job-cancel(force=false) for mirror in ready mode  -->
> 
>   instead, use block-job-complete. If you don't need final graph
> modification which mirror job normally does, use graph-change=false
> parameter for blockdev-mirror command.
> 

But yes, having a graph-change parameter would work for us too :)

Best Regards,
Fiona

[PATCH v2 1/4] qapi/block-core: avoid the re-use of MirrorSyncMode for backup

2024-03-07 Thread Fiona Ebner

Backup supports all modes listed in MirrorSyncMode, while mirror does
not. Introduce BackupSyncMode by copying the current MirrorSyncMode
and drop the variants mirror does not support from MirrorSyncMode as
well as the corresponding manual check in mirror_start().

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Fiona Ebner 
---

I felt like keeping the "Since: X.Y" as before makes the most sense as
to not lose history. Or is it necessary to change this for
BackupSyncMode (and its members) since it got a new name?

 block/backup.c | 18 -
 block/mirror.c |  7 ---
 block/monitor/block-hmp-cmds.c |  2 +-
 block/replication.c|  2 +-
 blockdev.c | 26 -
 include/block/block_int-global-state.h |  2 +-
 qapi/block-core.json   | 27 +-
 7 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index ec29d6b810..1cc4e055c6 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -37,7 +37,7 @@ typedef struct BackupBlockJob {
 
 BdrvDirtyBitmap *sync_bitmap;
 
-MirrorSyncMode sync_mode;
+BackupSyncMode sync_mode;
 BitmapSyncMode bitmap_mode;
 BlockdevOnError on_source_error;
 BlockdevOnError on_target_error;
@@ -111,7 +111,7 @@ void backup_do_checkpoint(BlockJob *job, Error **errp)
 
 assert(block_job_driver(job) == &backup_job_driver);
 
-if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+if (backup_job->sync_mode != BACKUP_SYNC_MODE_NONE) {
 error_setg(errp, "The backup job only supports block checkpoint in"
" sync=none mode");
 return;
@@ -231,11 +231,11 @@ static void backup_init_bcs_bitmap(BackupBlockJob *job)
 uint64_t estimate;
 BdrvDirtyBitmap *bcs_bitmap = block_copy_dirty_bitmap(job->bcs);
 
-if (job->sync_mode == MIRROR_SYNC_MODE_BITMAP) {
+if (job->sync_mode == BACKUP_SYNC_MODE_BITMAP) {
 bdrv_clear_dirty_bitmap(bcs_bitmap, NULL);
 bdrv_dirty_bitmap_merge_internal(bcs_bitmap, job->sync_bitmap, NULL,
  true);
-} else if (job->sync_mode == MIRROR_SYNC_MODE_TOP) {
+} else if (job->sync_mode == BACKUP_SYNC_MODE_TOP) {
 /*
  * We can't hog the coroutine to initialize this thoroughly.
  * Set a flag and resume work when we are able to yield safely.
@@ -254,7 +254,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
 
 backup_init_bcs_bitmap(s);
 
-if (s->sync_mode == MIRROR_SYNC_MODE_TOP) {
+if (s->sync_mode == BACKUP_SYNC_MODE_TOP) {
 int64_t offset = 0;
 int64_t count;
 
@@ -282,7 +282,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
 block_copy_set_skip_unallocated(s->bcs, false);
 }
 
-if (s->sync_mode == MIRROR_SYNC_MODE_NONE) {
+if (s->sync_mode == BACKUP_SYNC_MODE_NONE) {
 /*
  * All bits are set in bcs bitmap to allow any cluster to be copied.
  * This does not actually require them to be copied.
@@ -354,7 +354,7 @@ static const BlockJobDriver backup_job_driver = {
 
 BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
   BlockDriverState *target, int64_t speed,
-  MirrorSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap,
+  BackupSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap,
   BitmapSyncMode bitmap_mode,
   bool compress,
   const char *filter_node_name,
@@ -376,8 +376,8 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 GLOBAL_STATE_CODE();
 
 /* QMP interface protects us from these cases */
-assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL);
-assert(sync_bitmap || sync_mode != MIRROR_SYNC_MODE_BITMAP);
+assert(sync_mode != BACKUP_SYNC_MODE_INCREMENTAL);
+assert(sync_bitmap || sync_mode != BACKUP_SYNC_MODE_BITMAP);
 
 if (bs == target) {
 error_setg(errp, "Source and target cannot be the same");
diff --git a/block/mirror.c b/block/mirror.c
index 5145eb53e1..1609354db3 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -2011,13 +2011,6 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
 
 GLOBAL_STATE_CODE();
 
-if ((mode == MIRROR_SYNC_MODE_INCREMENTAL) ||
-(mode == MIRROR_SYNC_MODE_BITMAP)) {
-error_setg(errp, "Sync mode '%s' not supported",
-   MirrorSyncMode_str(mode));
-return;
-}
-
 bdrv_graph_rdlock_main_loop();
 is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
 base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL;
diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index d954bec6f1..9633d000

[PATCH v2 0/4] mirror: allow specifying working bitmap

2024-03-07 Thread Fiona Ebner

Changes from RFC/v1 (discussion here [0]):
* Add patch to split BackupSyncMode and MirrorSyncMode.
* Drop bitmap-mode parameter and use passed-in bitmap as the working
  bitmap instead. Users can get the desired behaviors by
  using the block-dirty-bitmap-clear and block-dirty-bitmap-merge
  calls (see commit message in patch 2/4 for how exactly).
* Add patch to check whether target image's cluster size is at most
  mirror job's granularity. Optional, it's an extra safety check
  that's useful when the target is a "diff" image that does not have
  previously synced data.

Use cases:
* Possibility to resume a failed mirror later.
* Possibility to only mirror deltas to a previously mirrored volume.
* Possibility to (efficiently) mirror an drive that was previously
  mirrored via some external mechanism (e.g. ZFS replication).

We are using the last one in production without any issues since about
4 years now. In particular, like mentioned in [1]:

> - create bitmap(s)
> - (incrementally) replicate storage volume(s) out of band (using ZFS)
> - incrementally drive mirror as part of a live migration of VM
> - drop bitmap(s)


Now, the IO test added in patch 4/4 actually contains yet another use
case, namely doing incremental mirrors to stand-alone qcow2 "diff"
images, that only contain the delta and can be rebased later. I had to
adapt the IO test, because its output expected the mirror bitmap to
still be dirty, but nowadays the mirror is apparently already done
when the bitmaps are queried. So I thought, I'll just use
'write-blocking' mode to avoid any potential timing issues.

But this exposed an issue with the diff image approach. If a write is
not aligned to the granularity of the mirror target, then rebasing the
diff image onto a backing image will not yield the desired result,
because the full cluster is considered to be allocated and will "hide"
some part of the base/backing image. The failure can be seen by either
using 'write-blocking' mode in the IO test or setting the (bitmap)
granularity to 32 KiB rather than the current 64 KiB.

For the latter case, patch 4/4 adds a check. For the former, the
limitation is documented (I'd expect this to be a niche use case in
practice).

[0]: 
https://lore.kernel.org/qemu-devel/b91dba34-7969-4d51-ba40-96a91038c...@yandex-team.ru/T/#m4ae27dc8ca1fb053e0a32cc4ffa2cfab6646805c
[1]: https://lore.kernel.org/qemu-devel/1599127031.9uxdp5h9o2.astr...@nora.none/


Fabian Grünbichler (1):
  iotests: add test for bitmap mirror

Fiona Ebner (2):
  qapi/block-core: avoid the re-use of MirrorSyncMode for backup
  blockdev: mirror: check for target's cluster size when using bitmap

John Snow (1):
  mirror: allow specifying working bitmap

 block/backup.c|   18 +-
 block/mirror.c|  102 +-
 block/monitor/block-hmp-cmds.c|2 +-
 block/replication.c   |2 +-
 blockdev.c|   84 +-
 include/block/block_int-global-state.h|7 +-
 qapi/block-core.json  |   64 +-
 tests/qemu-iotests/tests/bitmap-sync-mirror   |  571 
 .../qemu-iotests/tests/bitmap-sync-mirror.out | 2946 +
 tests/unit/test-block-iothread.c  |2 +-
 10 files changed, 3729 insertions(+), 69 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror
 create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out

-- 
2.39.2

[PATCH v2 3/4] iotests: add test for bitmap mirror

2024-03-07 Thread Fiona Ebner

From: Fabian Grünbichler 

heavily based on/practically forked off iotest 257 for bitmap backups,
but:

- no writes to filter node 'mirror-top' between completion and
finalization, as those seem to deadlock?
- extra set of reference/test mirrors to verify that writes in parallel
with active mirror work

Intentionally keeping copyright and ownership of original test case to
honor provenance.

The test was originally adapted by Fabian from 257, but has seen
rather big changes, because the interface for mirror with bitmap was
changed, i.e. no @bitmap-mode parameter anymore and bitmap is used as
the working bitmap.

Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.0
 adapt to changes to mirror bitmap interface
 rename test from '384' to 'bitmap-sync-mirror']
Signed-off-by: Fiona Ebner 
---
 tests/qemu-iotests/tests/bitmap-sync-mirror   |  565 
 .../qemu-iotests/tests/bitmap-sync-mirror.out | 2939 +
 2 files changed, 3504 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror
 create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out

diff --git a/tests/qemu-iotests/tests/bitmap-sync-mirror 
b/tests/qemu-iotests/tests/bitmap-sync-mirror
new file mode 100755
index 00..898f1f4ba4
--- /dev/null
+++ b/tests/qemu-iotests/tests/bitmap-sync-mirror
@@ -0,0 +1,565 @@
+#!/usr/bin/env python3
+# group: rw
+#
+# Test bitmap-sync mirrors (incremental, differential, and partials)
+#
+# Copyright (c) 2019 John Snow for Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+# owner=js...@redhat.com
+
+import math
+import os
+
+import iotests
+from iotests import log, qemu_img
+
+SIZE = 64 * 1024 * 1024
+GRANULARITY = 64 * 1024
+
+
+class Pattern:
+def __init__(self, byte, offset, size=GRANULARITY):
+self.byte = byte
+self.offset = offset
+self.size = size
+
+def bits(self, granularity):
+lower = self.offset // granularity
+upper = (self.offset + self.size - 1) // granularity
+return set(range(lower, upper + 1))
+
+
+class PatternGroup:
+"""Grouping of Pattern objects. Initialize with an iterable of Patterns."""
+def __init__(self, patterns):
+self.patterns = patterns
+
+def bits(self, granularity):
+"""Calculate the unique bits dirtied by this pattern grouping"""
+res = set()
+for pattern in self.patterns:
+res |= pattern.bits(granularity)
+return res
+
+
+GROUPS = [
+PatternGroup([
+# Batch 0: 4 clusters
+Pattern('0x49', 0x000),
+Pattern('0x6c', 0x010),   # 1M
+Pattern('0x6f', 0x200),   # 32M
+Pattern('0x76', 0x3ff)]), # 64M - 64K
+PatternGroup([
+# Batch 1: 6 clusters (3 new)
+Pattern('0x65', 0x000),   # Full overwrite
+Pattern('0x77', 0x00f8000),   # Partial-left (1M-32K)
+Pattern('0x72', 0x2008000),   # Partial-right (32M+32K)
+Pattern('0x69', 0x3fe)]), # Adjacent-left (64M - 128K)
+PatternGroup([
+# Batch 2: 7 clusters (3 new)
+Pattern('0x74', 0x001),   # Adjacent-right
+Pattern('0x69', 0x00e8000),   # Partial-left  (1M-96K)
+Pattern('0x6e', 0x2018000),   # Partial-right (32M+96K)
+Pattern('0x67', 0x3fe,
+2*GRANULARITY)]), # Overwrite [(64M-128K)-64M)
+PatternGroup([
+# Batch 3: 8 clusters (5 new)
+# Carefully chosen such that nothing re-dirties the one cluster
+# that copies out successfully before failure in Group #1.
+Pattern('0xaa', 0x001,
+3*GRANULARITY),   # Overwrite and 2x Adjacent-right
+Pattern('0xbb', 0x00d8000),   # Partial-left (1M-160K)
+Pattern('0xcc', 0x2028000),   # Partial-right (32M+160K)
+Pattern('0xdd', 0x3fc)]), # New; leaving a gap to the right
+]
+
+
+class EmulatedBitmap:
+def __init__(self, granularity=GRANULARITY):
+self._bits = set()
+self.granularity = granularity
+
+def dirty_bits(self, bits):
+self._bits

[PATCH v2 2/4] mirror: allow specifying working bitmap

2024-03-07 Thread Fiona Ebner

From: John Snow 

for the mirror job. The bitmap's granularity is used as the job's
granularity.

The new @bitmap parameter is marked unstable in the QAPI and can
currently only be used for @sync=full mode.

Clusters initially dirty in the bitmap as well as new writes are
copied to the target.

Using block-dirty-bitmap-clear and block-dirty-bitmap-merge API,
callers can simulate the three kinds of @BitmapSyncMode (which is used
by backup):
1. always: default, just pass bitmap as working bitmap.
2. never: copy bitmap and pass copy to the mirror job.
3. on-success: copy bitmap and pass copy to the mirror job and if
   successful, merge bitmap into original afterwards.

When the target image is a fresh "diff image", i.e. one that was not
used as the target of a previous mirror and the target image's cluster
size is larger than the bitmap's granularity, or when
@copy-mode=write-blocking is used, there is a pitfall, because the
cluster in the target image will be allocated, but not contain all the
data corresponding to the same region in the source image.

An idea to avoid the limitation would be to mark clusters which are
affected by unaligned writes and are not allocated in the target image
dirty, so they would be copied fully later. However, for migration,
the invariant that an actively synced mirror stays actively synced
(unless an error happens) is useful, because without that invariant,
migration might inactivate block devices when mirror still got work
to do and run into an assertion failure [0].

Another approach would be to read the missing data from the source
upon unaligned writes to be able to write the full target cluster
instead.

But certain targets like NBD do not allow querying the cluster size.
To avoid limiting/breaking the use case of syncing to an existing
target, which is arguably more common than the diff image use case,
document the limiation in QAPI.

This patch was originally based on one by Ma Haocong, but it has since
been modified pretty heavily, first by John and then again by Fiona.

[0]: 
https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/

Suggested-by: Ma Haocong 
Signed-off-by: Ma Haocong 
Signed-off-by: John Snow 
[FG: switch to bdrv_dirty_bitmap_merge_internal]
Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.0
 get rid of bitmap mode parameter
 use caller-provided bitmap as working bitmap
 turn bitmap parameter experimental]
Signed-off-by: Fiona Ebner 
---
 block/mirror.c | 95 --
 blockdev.c | 39 +--
 include/block/block_int-global-state.h |  5 +-
 qapi/block-core.json   | 37 +-
 tests/unit/test-block-iothread.c   |  2 +-
 5 files changed, 146 insertions(+), 32 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 1609354db3..5c9a00b574 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -51,7 +51,7 @@ typedef struct MirrorBlockJob {
 BlockDriverState *to_replace;
 /* Used to block operations on the drive-mirror-replace target */
 Error *replace_blocker;
-bool is_none_mode;
+MirrorSyncMode sync_mode;
 BlockMirrorBackingMode backing_mode;
 /* Whether the target image requires explicit zero-initialization */
 bool zero_target;
@@ -73,6 +73,11 @@ typedef struct MirrorBlockJob {
 size_t buf_size;
 int64_t bdev_length;
 unsigned long *cow_bitmap;
+/*
+ * Whether the bitmap is created locally or provided by the caller (for
+ * incremental sync).
+ */
+bool dirty_bitmap_is_local;
 BdrvDirtyBitmap *dirty_bitmap;
 BdrvDirtyBitmapIter *dbi;
 uint8_t *buf;
@@ -687,7 +692,11 @@ static int mirror_exit_common(Job *job)
 bdrv_unfreeze_backing_chain(mirror_top_bs, target_bs);
 }
 
-bdrv_release_dirty_bitmap(s->dirty_bitmap);
+if (s->dirty_bitmap_is_local) {
+bdrv_release_dirty_bitmap(s->dirty_bitmap);
+} else {
+bdrv_enable_dirty_bitmap(s->dirty_bitmap);
+}
 
 /* Make sure that the source BDS doesn't go away during bdrv_replace_node,
  * before we can call bdrv_drained_end */
@@ -718,7 +727,8 @@ static int mirror_exit_common(Job *job)
  &error_abort);
 
 if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
-BlockDriverState *backing = s->is_none_mode ? src : s->base;
+BlockDriverState *backing;
+backing = s->sync_mode == MIRROR_SYNC_MODE_NONE ? src : s->base;
 BlockDriverState *unfiltered_target = bdrv_skip_filters(target_bs);
 
 if (bdrv_cow_bs(unfiltered_target) != backing) {
@@ -815,6 +825,16 @@ static void mirror_abort(Job *job)
 assert(ret == 0);
 }
 
+/* Always called after commit/abort. */
+static void mirror_clean(Job *job)
+{
+MirrorBlockJob *s = container_of(job, MirrorBloc

[PATCH v2 4/4] blockdev: mirror: check for target's cluster size when using bitmap

2024-03-07 Thread Fiona Ebner

When using mirror with a bitmap and the target is a diff image, i.e.
one that should only contain the delta and was not synced to
previously, a too large cluster size for the target can be
problematic. In particular, when the mirror sends data to the target
aligned to the jobs granularity, but not aligned to the larger target
image's cluster size, the target's cluster would be allocated but only
be filled partially. When rebasing such a diff image later, the
corresponding cluster of the base image would get "masked" and the
part of the cluster not in the diff image is not accessible anymore.

Unfortunately, it is not always possible to check for the target
image's cluster size, e.g. when it's NBD. Because the limitation is
already documented in the QAPI description for the @bitmap parameter
and it's only required for special diff image use-case, simply skip
the check then.

Signed-off-by: Fiona Ebner 
---
 blockdev.c| 19 +++
 tests/qemu-iotests/tests/bitmap-sync-mirror   |  6 ++
 .../qemu-iotests/tests/bitmap-sync-mirror.out |  7 +++
 3 files changed, 32 insertions(+)

diff --git a/blockdev.c b/blockdev.c
index c76eb97a4c..968d44cd3b 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2847,6 +2847,9 @@ static void blockdev_mirror_common(const char *job_id, 
BlockDriverState *bs,
 }
 
 if (bitmap_name) {
+BlockDriverInfo bdi;
+uint32_t bitmap_granularity;
+
 if (sync != MIRROR_SYNC_MODE_FULL) {
 error_setg(errp, "Sync mode '%s' not supported with bitmap.",
MirrorSyncMode_str(sync));
@@ -2863,6 +2866,22 @@ static void blockdev_mirror_common(const char *job_id, 
BlockDriverState *bs,
 return;
 }
 
+bitmap_granularity = bdrv_dirty_bitmap_granularity(bitmap);
+/*
+ * Ignore if unable to get the info, e.g. when target is NBD. It's only
+ * relevant for syncing to a diff image and the documentation already
+ * states that the target's cluster size needs to small enough then.
+ */
+if (bdrv_get_info(target, &bdi) >= 0) {
+if (bitmap_granularity < bdi.cluster_size ||
+bitmap_granularity % bdi.cluster_size != 0) {
+error_setg(errp, "Bitmap granularity %u is not a multiple of "
+   "the target image's cluster size %u",
+   bitmap_granularity, bdi.cluster_size);
+return;
+}
+}
+
 if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_ALLOW_RO, errp)) {
 return;
 }
diff --git a/tests/qemu-iotests/tests/bitmap-sync-mirror 
b/tests/qemu-iotests/tests/bitmap-sync-mirror
index 898f1f4ba4..cbd5cc99cc 100755
--- a/tests/qemu-iotests/tests/bitmap-sync-mirror
+++ b/tests/qemu-iotests/tests/bitmap-sync-mirror
@@ -552,6 +552,12 @@ def test_mirror_api():
 bitmap=bitmap)
 log('')
 
+log("-- Test bitmap with too small granularity --\n".format(sync_mode))
+vm.qmp_log("block-dirty-bitmap-add", node=drive0.node,
+   name="bitmap-small", granularity=(GRANULARITY // 2))
+blockdev_mirror(drive0.vm, drive0.node, "mirror_target", "full",
+job_id='api_job', bitmap="bitmap-small")
+log('')
 
 def main():
 for bsync_mode in ("never", "on-success", "always"):
diff --git a/tests/qemu-iotests/tests/bitmap-sync-mirror.out 
b/tests/qemu-iotests/tests/bitmap-sync-mirror.out
index c05b4788c6..d40ea7d689 100644
--- a/tests/qemu-iotests/tests/bitmap-sync-mirror.out
+++ b/tests/qemu-iotests/tests/bitmap-sync-mirror.out
@@ -2937,3 +2937,10 @@ qemu_img compare "TEST_DIR/PID-img" 
"TEST_DIR/PID-fmirror3" ==> Identical, OK!
 {"execute": "blockdev-mirror", "arguments": {"bitmap": "bitmap0", "device": 
"drive0", "filter-node-name": "mirror-top", "job-id": "api_job", "sync": 
"none", "target": "mirror_target"}}
 {"error": {"class": "GenericError", "desc": "Sync mode 'none' not supported 
with bitmap."}}
 
+-- Test bitmap with too small granularity --
+
+{"execute": "block-dirty-bitmap-add", "arguments": {"granularity": 32768, 
"name": "bitmap-small", "node": "drive0"}}
+{"return": {}}
+{"execute": "blockdev-mirror", "arguments": {"bitmap": "bitmap-small", 
"device": "drive0", "filter-node-name": "mirror-top", "job-id": "api_job", 
"sync": "full", "target": "mirror_target"}}
+{"error": {"class": "GenericError", "desc": "Bitmap granularity 32768 is not a 
multiple of the target image's cluster size 65536"}}
+
-- 
2.39.2

Re: [RFC 0/4] mirror: implement incremental and bitmap modes

2024-03-06 Thread Fiona Ebner

Am 29.02.24 um 13:47 schrieb Fiona Ebner:
> Am 29.02.24 um 12:48 schrieb Vladimir Sementsov-Ogievskiy:
>> On 29.02.24 13:11, Fiona Ebner wrote:
>>>
>>> The iotest creates a new target image for each incremental sync which
>>> only records the diff relative to the previous mirror and those diff
>>> images are later rebased onto each other to get the full picture.
>>>
>>> Thus, it can be that a previous mirror job (not just background process
>>> or previous write) already copied a cluster, and in particular, copied
>>> it to a different target!
>>
>> Aha understand.
>>
>> For simplicity, let's consider case, when source "cluster size" = "job
>> cluster size" = "bitmap granularity" = "target cluster size".
>>
>> Which types of clusters we should consider, when we want to handle guest
>> write?
>>
>> 1. Clusters, that should be copied by background process
>>
>> These are dirty clusters from user-given bitmap, or if we do a full-disk
>> mirror, all clusters, not yet copied by background process.
>>
>> For such clusters we simply ignore the unaligned write. We can even
>> ignore the aligned write too: less disturbing the guest by delays.
>>
> 
> Since do_sync_target_write() currently doesn't ignore aligned writes, I
> wouldn't change it. Of course they can count towards the "done_bitmap"
> you propose below.
> 
>> 2. Clusters, already copied by background process during this mirror job
>> and not dirtied by guest since this time.
>>
>> For such clusters we are safe to do unaligned write, as target cluster
>> must be allocated.
>>
> 
> Right.
> 
>> 3. Clusters, not marked initially by dirty bitmap.
>>
>> What to do with them? We can't do unaligned write. I see two variants:
>>
>> - do additional read from source, to fill the whole cluster, which seems
>> a bit too heavy
>>
> 
> Yes, I'd rather only do that as a last resort.
> 
>> - just mark the cluster as dirty for background job. So we behave like
>> in "background" mode. But why not? The maximum count of such "hacks" is
>> limited to number of "clear" clusters at start of mirror job, which
>> means that we don't seriously affect the convergence. Mirror is
>> guaranteed to converge anyway. And the whole sense of "write-blocking"
>> mode is to have a guaranteed convergence. What do you think?
>>
> 
> It could lead to a lot of flips between job->actively_synced == true and
> == false. AFAIU, currently, we only switch back from true to false when
> an error happens. While I don't see a concrete issue with it, at least
> it might be unexpected to users, so it better be documented.
> 
> I'll try going with this approach, thanks!
> 

These flips are actually a problem. When using live-migration with disk
mirroring, it's good that an actively synced image stays actively
synced. Otherwise, migration could finish at an inconvenient time and
try to inactivate the block device while mirror still got something to
do which would lead to an assertion failure [0].

The IO test added by this series is what uses the possibility to sync to
"diff images" which contain only the delta. In production, we are only
syncing to a previously mirrored target image. Non-aligned writes are
not an issue later like with a diff image. (Even if the initial
mirroring happened via ZFS replication outside of QEMU).

So copy-mode=write-blocking would work fine for our use case, but if I
go with the "mark clusters for unaligned writes dirty"-approach, it
would not work fine anymore.

Should I rather just document the limitation for the combination "target
is a diff image" and copy-mode=write-blocking?

I'd still add the check for the granularity and target cluster size.
While also only needed for diff images, it would allow using background
mode safely for those.

Best Regards,
Fiona

[0]:
https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/

Re: [RFC 0/4] mirror: implement incremental and bitmap modes

2024-03-03 Thread Fiona Ebner

Am 01.03.24 um 16:46 schrieb Vladimir Sementsov-Ogievskiy:
> On 01.03.24 18:14, Fiona Ebner wrote:
>> Am 01.03.24 um 16:02 schrieb Vladimir Sementsov-Ogievskiy:
>>>>> About documentation: actually, I never liked that we use for backup
>>>>> job
>>>>> "MirrorSyncMode". Now it looks more like "BackupSyncMode", having two
>>>>> values supported only by backup.
>>>>>
>>>>> I'm also unsure how mode=full&bitmap=some_bitmap differs from
>>>>> mode=bitmap&bitmap=some_bitmap..
>>>>>
>>>>
>>>> With the current patches, it was an error to specify @bitmap for other
>>>> modes than 'incremental' and 'bitmap'.
>>>
>>> Current documentation says:
>>>    # @bitmap: The name of a dirty bitmap to use.  Must be present if
>>> sync
>>>    # is "bitmap" or "incremental". Can be present if sync is "full"
>>>    # or "top".  Must not be present otherwise.
>>>    # (Since 2.4 (drive-backup), 3.1 (blockdev-backup))
>>>
>>>
>>
>> This is for backup. The documentation (and behavior) for @bitmap added
>> by these patches for mirror is different ;)
> 
> I meant backup in "I'm also unsure", just as one more point not consider
> backup-bitmap-API as a prototype for mirror-bitmap-API.
> 

Oh, I see. Sorry for the confusion!

Best Regards,
Fiona

Re: [RFC 0/4] mirror: implement incremental and bitmap modes

2024-03-01 Thread Fiona Ebner

Am 01.03.24 um 16:02 schrieb Vladimir Sementsov-Ogievskiy:
> On 01.03.24 17:52, Fiona Ebner wrote:
>> Am 01.03.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy:
>>>
>>> As we already understood, (block-)job-api needs some spring-cleaning.
>>> Unfortunately I don't have much time on it, but still I decided to start
>>> from finally depreacting block-job-* API and moving to job-*.. Probably
>>> bitmap/bitmap-mode/sync APIs also need some optimization, keeping in
>>> mind new block-dirty-bitmap-merge api.
>>>
>>> So, what I could advice in this situation for newc interfaces:
>>>
>>> 1. be minimalistic
>>> 2. add `x-` prefix when unsure
>>>
>>> So, following these two rules, what about x-bitmap field, which may be
>>> combined only with 'full' mode, and do what you need?
>>>
>>
>> AFAIU, it should rather be marked as @unstable in QAPI [0]? Then it
>> doesn't need to be renamed if it becomes stable later.
> 
> Right, unstable feature is needed, using "x-" is optional.
> 
> Recent discussion about it was in my "vhost-user-blk: live resize
> additional APIs" series:
> 
> https://patchew.org/QEMU/20231006202045.1161543-1-vsement...@yandex-team.ru/20231006202045.1161543-5-vsement...@yandex-team.ru/
> 
> Following it, I think it's OK to not care anymore with "x-" prefixes,
> and rely on unstable feature.
> 

Thanks for the confirmation! I'll go without the prefix in the name then.

>>
>>> About documentation: actually, I never liked that we use for backup job
>>> "MirrorSyncMode". Now it looks more like "BackupSyncMode", having two
>>> values supported only by backup.
>>>
>>> I'm also unsure how mode=full&bitmap=some_bitmap differs from
>>> mode=bitmap&bitmap=some_bitmap..
>>>
>>
>> With the current patches, it was an error to specify @bitmap for other
>> modes than 'incremental' and 'bitmap'.
> 
> Current documentation says:
>   # @bitmap: The name of a dirty bitmap to use.  Must be present if sync
>   # is "bitmap" or "incremental". Can be present if sync is "full"
>   # or "top".  Must not be present otherwise.
>   # (Since 2.4 (drive-backup), 3.1 (blockdev-backup))
> 
> 

This is for backup. The documentation (and behavior) for @bitmap added
by these patches for mirror is different ;)

Best Regards,
Fiona

Re: [RFC 0/4] mirror: implement incremental and bitmap modes

2024-03-01 Thread Fiona Ebner

Am 01.03.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy:
> 
> As we already understood, (block-)job-api needs some spring-cleaning.
> Unfortunately I don't have much time on it, but still I decided to start
> from finally depreacting block-job-* API and moving to job-*.. Probably
> bitmap/bitmap-mode/sync APIs also need some optimization, keeping in
> mind new block-dirty-bitmap-merge api.
> 
> So, what I could advice in this situation for newc interfaces:
> 
> 1. be minimalistic
> 2. add `x-` prefix when unsure
> 
> So, following these two rules, what about x-bitmap field, which may be
> combined only with 'full' mode, and do what you need?
> 

AFAIU, it should rather be marked as @unstable in QAPI [0]? Then it
doesn't need to be renamed if it becomes stable later.

> About documentation: actually, I never liked that we use for backup job
> "MirrorSyncMode". Now it looks more like "BackupSyncMode", having two
> values supported only by backup.
> 
> I'm also unsure how mode=full&bitmap=some_bitmap differs from
> mode=bitmap&bitmap=some_bitmap..
> 

With the current patches, it was an error to specify @bitmap for other
modes than 'incremental' and 'bitmap'.

> So, I'd suggest simply rename MirrorSyncMode to BackupSyncMode, and add
> separate MirrorSyncMode with only "full", "top" and "none" values.
> 

Sounds good to me!

[0]:
https://gitlab.com/qemu-project/qemu/-/commit/a3c45b3e62962f99338716b1347cfb0d427cea44

Best Regards,
Fiona

Re: [RFC 0/4] mirror: implement incremental and bitmap modes

2024-02-29 Thread Fiona Ebner

Am 29.02.24 um 13:00 schrieb Vladimir Sementsov-Ogievskiy:
> 
> But anyway, this all could be simply achieved with
> bitmap-copying/merging API, if we allow to pass user-given bitmap to the
> mirror as working bitmap.
> 
>>
>> I see, I'll drop the 'bitmap-mode' in the next version if nobody
>> complains :)
>>
> 
> Good. It's a golden rule: never make public interfaces which you don't
> actually need for production. I myself sometimes violate it and spend
> extra time on developing features, which we later have to just drop as
> "not needed downstream, no sense in upstreaming".
> 

Just wondering which new mode I should allow for the @MirrorSyncMode
then? The documentation states:

> # @incremental: only copy data described by the dirty bitmap.
> # (since: 2.4)
> #
> # @bitmap: only copy data described by the dirty bitmap.  (since: 4.2)
> # Behavior on completion is determined by the BitmapSyncMode.

For backup, do_backup_common() just maps @incremental to @bitmap +
@bitmap-mode == @on-success.

Using @bitmap for mirror would lead to being at odds with the
documentation, because it mentions the BitmapSyncMode, which mirror
won't have.

Using @incremental for mirror would be consistent with the
documentation, but behave a bit differently from backup.

Opinions?

Best Regards,
Fiona

Re: [RFC 0/4] mirror: implement incremental and bitmap modes

2024-02-29 Thread Fiona Ebner

Am 29.02.24 um 12:48 schrieb Vladimir Sementsov-Ogievskiy:
> On 29.02.24 13:11, Fiona Ebner wrote:
>>
>> The iotest creates a new target image for each incremental sync which
>> only records the diff relative to the previous mirror and those diff
>> images are later rebased onto each other to get the full picture.
>>
>> Thus, it can be that a previous mirror job (not just background process
>> or previous write) already copied a cluster, and in particular, copied
>> it to a different target!
> 
> Aha understand.
> 
> For simplicity, let's consider case, when source "cluster size" = "job
> cluster size" = "bitmap granularity" = "target cluster size".
> 
> Which types of clusters we should consider, when we want to handle guest
> write?
> 
> 1. Clusters, that should be copied by background process
> 
> These are dirty clusters from user-given bitmap, or if we do a full-disk
> mirror, all clusters, not yet copied by background process.
> 
> For such clusters we simply ignore the unaligned write. We can even
> ignore the aligned write too: less disturbing the guest by delays.
> 

Since do_sync_target_write() currently doesn't ignore aligned writes, I
wouldn't change it. Of course they can count towards the "done_bitmap"
you propose below.

> 2. Clusters, already copied by background process during this mirror job
> and not dirtied by guest since this time.
> 
> For such clusters we are safe to do unaligned write, as target cluster
> must be allocated.
> 

Right.

> 3. Clusters, not marked initially by dirty bitmap.
> 
> What to do with them? We can't do unaligned write. I see two variants:
> 
> - do additional read from source, to fill the whole cluster, which seems
> a bit too heavy
> 

Yes, I'd rather only do that as a last resort.

> - just mark the cluster as dirty for background job. So we behave like
> in "background" mode. But why not? The maximum count of such "hacks" is
> limited to number of "clear" clusters at start of mirror job, which
> means that we don't seriously affect the convergence. Mirror is
> guaranteed to converge anyway. And the whole sense of "write-blocking"
> mode is to have a guaranteed convergence. What do you think?
> 

It could lead to a lot of flips between job->actively_synced == true and
== false. AFAIU, currently, we only switch back from true to false when
an error happens. While I don't see a concrete issue with it, at least
it might be unexpected to users, so it better be documented.

I'll try going with this approach, thanks!

> 
> 
> 
> Of course, we can't distinguish 3 types by on dirty bitmap, so we need
> the second one. For example "done_bitmap", where we can mark clusters
> that were successfully copied. That would be a kind of block-status of
> target image. But using bitmap is a lot better than querying
> block-status from target.

Best Regards,
Fiona

Re: [RFC 0/4] mirror: implement incremental and bitmap modes

2024-02-29 Thread Fiona Ebner

Am 28.02.24 um 17:24 schrieb Vladimir Sementsov-Ogievskiy:
> On 16.02.24 13:55, Fiona Ebner wrote:
>> Previous discussion from when this was sent upstream [0] (it's been a
>> while). I rebased the patches and re-ordered and squashed like
>> suggested back then [1].
>>
>> This implements two new mirror modes:
>>
>> - bitmap mirror mode with always/on-success/never bitmap sync mode
>> - incremental mirror mode as sugar for bitmap + on-success
>>
>> Use cases:
>> * Possibility to resume a failed mirror later.
>> * Possibility to only mirror deltas to a previously mirrored volume.
>> * Possibility to (efficiently) mirror an drive that was previously
>>    mirrored via some external mechanism (e.g. ZFS replication).
>>
>> We are using the last one in production without any issues since about
>> 4 years now. In particular, like mentioned in [2]:
>>
>>> - create bitmap(s)
>>> - (incrementally) replicate storage volume(s) out of band (using ZFS)
>>> - incrementally drive mirror as part of a live migration of VM
>>> - drop bitmap(s)
> 
> Actually which mode you use, "never", "always" or "conditional"? Or in
> downstream you have different approach?
> 

We are using "conditional", but I think we don't really require any
specific mode, because we drop the bitmaps after mirroring (even in
failure case). Fabian, please correct me if I'm wrong.

> Why am I asking:
> 
> These modes (for backup) were developed prior to
> block-dirty-bitmap-merge command, which allowed to copy bitmaps as you
> want. With that API, we actually don't need all these modes, instead
> it's enough to pass a bitmap, which would be _actually_ used by mirror.
> 
> So, if you need "never" mode, you just copy your bitmap by
> block-dirty-bitmap-add + block-dirty-bitmap-merge, and pass a copy to
> mirror job.
> 
> Or, you pass your bitmap to mirror-job, and have a "always" mode.
> 
> And I don't see, why we need a "conditional" mode, which actually just
> drops away the progress we actually made. (OK, we failed, but why to
> drop the progress of successfully copied clusters?)
> 

I'm not sure actually. Maybe John remembers?

I see, I'll drop the 'bitmap-mode' in the next version if nobody
complains :)

> 
> Using user-given bitmap in the mirror job has also an additional
> advantage of live progress: up to visualization of disk copying by
> visualization of the dirty bitmap contents.
> 

Best Regards,
Fiona

Re: [RFC 0/4] mirror: implement incremental and bitmap modes

2024-02-29 Thread Fiona Ebner

Am 28.02.24 um 17:06 schrieb Vladimir Sementsov-Ogievskiy:
> On 28.02.24 19:00, Vladimir Sementsov-Ogievskiy wrote:
>> On 16.02.24 13:55, Fiona Ebner wrote:
>>> Now, the IO test added in patch 4/4 actually contains yet another use
>>> case, namely doing incremental mirrors to stand-alone qcow2 "diff"
>>> images, that only contain the delta and can be rebased later. I had to
>>> adapt the IO test, because its output expected the mirror bitmap to
>>> still be dirty, but nowadays the mirror is apparently already done
>>> when the bitmaps are queried. So I thought, I'll just use
>>> 'write-blocking' mode to avoid any potential timing issues.
>>>
>>> But this exposed an issue with the diff image approach. If a write is
>>> not aligned to the granularity of the mirror target, then rebasing the
>>> diff image onto a backing image will not yield the desired result,
>>> because the full cluster is considered to be allocated and will "hide"
>>> some part of the base/backing image. The failure can be seen by either
>>> using 'write-blocking' mode in the IO test or setting the (bitmap)
>>> granularity to 32 KiB rather than the current 64 KiB.
>>>
>>> The question is how to deal with these edge cases? Some possibilities
>>> that would make sense to me:
>>>
>>> For 'background' mode:
>>> * prohibit if target's cluster size is larger than the bitmap
>>>    granularity
>>> * document the limitation
>>>
>>> For 'write-blocking' mode:
>>> * disallow in combination with bitmap mode (would not be happy about
>>>    it, because I'd like to use this without diff images)
>>
>> why not just require the same: bitmap granularity must be >= target
>> granularity
>>

For the iotest's use-case, that only works for background mode. I'll
explain below.

>>> * for writes that are not aligned to the target's cluster size, read
>>>    the relevant/missing parts from the source image to be able to write
>>>    whole target clusters (seems rather complex)
>>
>> There is another approach: consider and unaligned part of the request,
>> fit in one cluster (we can always split any request to "aligned"
>> middle part, and at most two small "unligned" parts, each fit into one
>> cluster).
>>
>> We have two possibilities:
>>
>> 1. the cluster is dirty (marked dirty in the bitmap used by background
>> process)
>>
>> We can simply ignore this part and rely on background process. This
>> will not affect the convergence of the mirror job.
>>

Agreed.

>> 2. the cluster is clear (i.e. background process, or some previous
>> write already copied it)
>>

The iotest creates a new target image for each incremental sync which
only records the diff relative to the previous mirror and those diff
images are later rebased onto each other to get the full picture.

Thus, it can be that a previous mirror job (not just background process
or previous write) already copied a cluster, and in particular, copied
it to a different target!

>> In this case, we are safe to do unaligned write, as target cluster
>> must be allocated.

Because the diff image is new, the target's cluster is not necessarily
allocated. When using write-blocking and a write of, e.g., 9 bytes to a
clear source cluster comes in, only those 9 bytes are written to the
target. Now the target's cluster is allocated but with only those 9
bytes of data. When rebasing, the previously copied cluster is "masked"
and when reading the rebased image, we only see the cluster with those 9
bytes (and IIRC, zeroes for the rest of the cluster rather than the
previously copied data).

>>
>> (for bitmap-mode, I don't consider here clusters that are clear from
>> the start, which we shouldn't copy in any case)
>>

We do need to copy new writes to any cluster, and with a clear cluster
and write-blocking, the issue can manifest.

> 
> Hmm, right, and that's exactly the logic we already have in
> do_sync_target_write(). So that's enough just to require that
> bitmap_granularity >= target_granularity
> 

Best Regards,
Fiona

Re: [RFC 1/4] drive-mirror: add support for sync=bitmap mode=never

2024-02-21 Thread Fiona Ebner

Am 21.02.24 um 07:55 schrieb Markus Armbruster:
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index ab5a93a966..ac05483958 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -2181,6 +2181,15 @@
>>  # destination (all the disk, only the sectors allocated in the
>>  # topmost image, or only new I/O).
>>  #
>> +# @bitmap: The name of a bitmap to use for sync=bitmap mode.  This
>> +# argument must be present for bitmap mode and absent otherwise.
>> +# The bitmap's granularity is used instead of @granularity.
>> +# (Since 9.0).
> 
> What happens when the user specifies @granularity anyway?  Error or
> silently ignored?
> 

It's an error:

>> +if (bitmap) {
>> +if (granularity) {
>> +error_setg(errp, "granularity (%d)"
>> +   "cannot be specified when a bitmap is provided",
>> +   granularity);
>> +return NULL;
>> +}

>> +#
>> +# @bitmap-mode: Specifies the type of data the bitmap should contain
>> +# after the operation concludes.  Must be present if sync is
>> +# "bitmap".  Must NOT be present otherwise.  (Since 9.0)
> 
> Members that must be present when and only when some enum member has a
> certain value should perhaps be in a union branch.  Perhaps the block
> maintainers have an opinion here.
> 

Sounds sensible to me. Considering also the next patches, in the end it
could be a union discriminated by the @sync which contains @bitmap and
@bitmap-mode when it's the 'bitmap' sync mode, @bitmap when it's the
'incremental' sync mode (@bitmap-sync mode needs to be 'on-success'
then, so there is no choice for the user) and which contains
@granularity for the other sync modes.

Best Regards,
Fiona

[RFC 1/4] drive-mirror: add support for sync=bitmap mode=never

2024-02-16 Thread Fiona Ebner

From: John Snow 

This patch adds support for the "BITMAP" sync mode to drive-mirror and
blockdev-mirror. It adds support only for the BitmapSyncMode "never,"
because it's the simplest mode.

This mode simply uses a user-provided bitmap as an initial copy
manifest, and then does not clear any bits in the bitmap at the
conclusion of the operation.

Any new writes dirtied during the operation are copied out, in contrast
to backup. Note that whether these writes are reflected in the bitmap
at the conclusion of the operation depends on whether that bitmap is
actually recording!

This patch was originally based on one by Ma Haocong, but it has since
been modified pretty heavily.

Suggested-by: Ma Haocong 
Signed-off-by: Ma Haocong 
Signed-off-by: John Snow 
[FG: switch to bdrv_dirty_bitmap_merge_internal]
Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.0
 update version and formatting in QAPI]
Signed-off-by: Fiona Ebner 
---
 block/mirror.c | 96 --
 blockdev.c | 38 +-
 include/block/block_int-global-state.h |  4 +-
 qapi/block-core.json   | 25 ++-
 tests/unit/test-block-iothread.c   |  4 +-
 5 files changed, 139 insertions(+), 28 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 5145eb53e1..315dff11e2 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -51,7 +51,7 @@ typedef struct MirrorBlockJob {
 BlockDriverState *to_replace;
 /* Used to block operations on the drive-mirror-replace target */
 Error *replace_blocker;
-bool is_none_mode;
+MirrorSyncMode sync_mode;
 BlockMirrorBackingMode backing_mode;
 /* Whether the target image requires explicit zero-initialization */
 bool zero_target;
@@ -73,6 +73,8 @@ typedef struct MirrorBlockJob {
 size_t buf_size;
 int64_t bdev_length;
 unsigned long *cow_bitmap;
+BdrvDirtyBitmap *sync_bitmap;
+BitmapSyncMode bitmap_mode;
 BdrvDirtyBitmap *dirty_bitmap;
 BdrvDirtyBitmapIter *dbi;
 uint8_t *buf;
@@ -718,7 +720,8 @@ static int mirror_exit_common(Job *job)
  &error_abort);
 
 if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
-BlockDriverState *backing = s->is_none_mode ? src : s->base;
+BlockDriverState *backing;
+backing = s->sync_mode == MIRROR_SYNC_MODE_NONE ? src : s->base;
 BlockDriverState *unfiltered_target = bdrv_skip_filters(target_bs);
 
 if (bdrv_cow_bs(unfiltered_target) != backing) {
@@ -815,6 +818,16 @@ static void mirror_abort(Job *job)
 assert(ret == 0);
 }
 
+/* Always called after commit/abort. */
+static void mirror_clean(Job *job)
+{
+MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
+
+if (s->sync_bitmap) {
+bdrv_dirty_bitmap_set_busy(s->sync_bitmap, false);
+}
+}
+
 static void coroutine_fn mirror_throttle(MirrorBlockJob *s)
 {
 int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1011,7 +1024,8 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 mirror_free_init(s);
 
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
-if (!s->is_none_mode) {
+if ((s->sync_mode == MIRROR_SYNC_MODE_TOP) ||
+(s->sync_mode == MIRROR_SYNC_MODE_FULL)) {
 ret = mirror_dirty_init(s);
 if (ret < 0 || job_is_cancelled(&s->common.job)) {
 goto immediate_exit;
@@ -1302,6 +1316,7 @@ static const BlockJobDriver mirror_job_driver = {
 .run= mirror_run,
 .prepare= mirror_prepare,
 .abort  = mirror_abort,
+.clean  = mirror_clean,
 .pause  = mirror_pause,
 .complete   = mirror_complete,
 .cancel = mirror_cancel,
@@ -1320,6 +1335,7 @@ static const BlockJobDriver commit_active_job_driver = {
 .run= mirror_run,
 .prepare= mirror_prepare,
 .abort  = mirror_abort,
+.clean  = mirror_clean,
 .pause  = mirror_pause,
 .complete   = mirror_complete,
 .cancel = commit_active_cancel,
@@ -1712,7 +1728,10 @@ static BlockJob *mirror_start_job(
  BlockCompletionFunc *cb,
  void *opaque,
  const BlockJobDriver *driver,
- bool is_none_mode, BlockDriverState *base,
+ MirrorSyncMode sync_mode,
+ BdrvDirtyBitmap *bitmap,
+ BitmapSyncMode bitmap_mode,
+ BlockDriverState *base,
  bool auto_complete, const char *filter_n

[RFC 3/4] mirror: move some checks to qmp

2024-02-16 Thread Fiona Ebner

From: Fabian Grünbichler 

and assert the passing conditions in block/mirror.c. while incremental
mode was never available for drive-mirror, it makes the interface more
uniform w.r.t. backup block jobs.

Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.0]
Signed-off-by: Fiona Ebner 
---
 block/mirror.c | 28 +---
 blockdev.c | 29 +
 2 files changed, 34 insertions(+), 23 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 84155b1f78..15d1c060eb 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1755,31 +1755,13 @@ static BlockJob *mirror_start_job(
 
 GLOBAL_STATE_CODE();
 
-if (sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
-error_setg(errp, "Sync mode '%s' not supported",
-   MirrorSyncMode_str(sync_mode));
-return NULL;
-} else if (sync_mode == MIRROR_SYNC_MODE_BITMAP) {
-if (!bitmap) {
-error_setg(errp, "Must provide a valid bitmap name for '%s'"
-   " sync mode",
-   MirrorSyncMode_str(sync_mode));
-return NULL;
-}
-} else if (bitmap) {
-error_setg(errp,
-   "sync mode '%s' is not compatible with bitmaps",
-   MirrorSyncMode_str(sync_mode));
-return NULL;
-}
+/* QMP interface protects us from these cases */
+assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL);
+assert((bitmap && sync_mode == MIRROR_SYNC_MODE_BITMAP) ||
+   (!bitmap && sync_mode != MIRROR_SYNC_MODE_BITMAP));
+assert(!(bitmap && granularity));
 
 if (bitmap) {
-if (granularity) {
-error_setg(errp, "granularity (%d)"
-   "cannot be specified when a bitmap is provided",
-   granularity);
-return NULL;
-}
 granularity = bdrv_dirty_bitmap_granularity(bitmap);
 
 if (bitmap_mode != BITMAP_SYNC_MODE_NEVER) {
diff --git a/blockdev.c b/blockdev.c
index aeb9fde9f3..519f408359 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2852,7 +2852,36 @@ static void blockdev_mirror_common(const char *job_id, 
BlockDriverState *bs,
 sync = MIRROR_SYNC_MODE_FULL;
 }
 
+if ((sync == MIRROR_SYNC_MODE_BITMAP) ||
+(sync == MIRROR_SYNC_MODE_INCREMENTAL)) {
+/* done before desugaring 'incremental' to print the right message */
+if (!bitmap_name) {
+error_setg(errp, "Must provide a valid bitmap name for "
+   "'%s' sync mode", MirrorSyncMode_str(sync));
+return;
+}
+}
+
+if (sync == MIRROR_SYNC_MODE_INCREMENTAL) {
+if (has_bitmap_mode &&
+bitmap_mode != BITMAP_SYNC_MODE_ON_SUCCESS) {
+error_setg(errp, "Bitmap sync mode must be '%s' "
+   "when using sync mode '%s'",
+   BitmapSyncMode_str(BITMAP_SYNC_MODE_ON_SUCCESS),
+   MirrorSyncMode_str(sync));
+return;
+}
+has_bitmap_mode = true;
+sync = MIRROR_SYNC_MODE_BITMAP;
+bitmap_mode = BITMAP_SYNC_MODE_ON_SUCCESS;
+}
+
 if (bitmap_name) {
+if (sync != MIRROR_SYNC_MODE_BITMAP) {
+error_setg(errp, "Sync mode '%s' not supported with bitmap.",
+   MirrorSyncMode_str(sync));
+return;
+}
 if (granularity) {
 error_setg(errp, "Granularity and bitmap cannot both be set");
 return;
-- 
2.39.2

[RFC 4/4] iotests: add test for bitmap mirror

2024-02-16 Thread Fiona Ebner

From: Fabian Grünbichler 

heavily based on/practically forked off iotest 257 for bitmap backups,
but:

- no writes to filter node 'mirror-top' between completion and
finalization, as those seem to deadlock?
- no inclusion of not-yet-available full/top sync modes in combination
with bitmaps
- extra set of reference/test mirrors to verify that writes in parallel
with active mirror work

intentionally keeping copyright and ownership of original test case to
honor provenance.

Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.0, i.e. adapt to renames like vm.command -> vm.cmd,
 specifying explicit image format for rebase,
 adapt to new behavior of qemu_img(),
 dropping of 'status' field in output, etc.
 rename test from '384' to 'bitmap-sync-mirror']
Signed-off-by: Fiona Ebner 
---
 tests/qemu-iotests/tests/bitmap-sync-mirror   |  550 
 .../qemu-iotests/tests/bitmap-sync-mirror.out | 2810 +
 2 files changed, 3360 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror
 create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out

diff --git a/tests/qemu-iotests/tests/bitmap-sync-mirror 
b/tests/qemu-iotests/tests/bitmap-sync-mirror
new file mode 100755
index 00..6cd9b74dac
--- /dev/null
+++ b/tests/qemu-iotests/tests/bitmap-sync-mirror
@@ -0,0 +1,550 @@
+#!/usr/bin/env python3
+# group: rw
+#
+# Test bitmap-sync mirrors (incremental, differential, and partials)
+#
+# Copyright (c) 2019 John Snow for Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+# owner=js...@redhat.com
+
+import math
+import os
+
+import iotests
+from iotests import log, qemu_img
+
+SIZE = 64 * 1024 * 1024
+GRANULARITY = 64 * 1024
+
+
+class Pattern:
+def __init__(self, byte, offset, size=GRANULARITY):
+self.byte = byte
+self.offset = offset
+self.size = size
+
+def bits(self, granularity):
+lower = self.offset // granularity
+upper = (self.offset + self.size - 1) // granularity
+return set(range(lower, upper + 1))
+
+
+class PatternGroup:
+"""Grouping of Pattern objects. Initialize with an iterable of Patterns."""
+def __init__(self, patterns):
+self.patterns = patterns
+
+def bits(self, granularity):
+"""Calculate the unique bits dirtied by this pattern grouping"""
+res = set()
+for pattern in self.patterns:
+res |= pattern.bits(granularity)
+return res
+
+
+GROUPS = [
+PatternGroup([
+# Batch 0: 4 clusters
+Pattern('0x49', 0x000),
+Pattern('0x6c', 0x010),   # 1M
+Pattern('0x6f', 0x200),   # 32M
+Pattern('0x76', 0x3ff)]), # 64M - 64K
+PatternGroup([
+# Batch 1: 6 clusters (3 new)
+Pattern('0x65', 0x000),   # Full overwrite
+Pattern('0x77', 0x00f8000),   # Partial-left (1M-32K)
+Pattern('0x72', 0x2008000),   # Partial-right (32M+32K)
+Pattern('0x69', 0x3fe)]), # Adjacent-left (64M - 128K)
+PatternGroup([
+# Batch 2: 7 clusters (3 new)
+Pattern('0x74', 0x001),   # Adjacent-right
+Pattern('0x69', 0x00e8000),   # Partial-left  (1M-96K)
+Pattern('0x6e', 0x2018000),   # Partial-right (32M+96K)
+Pattern('0x67', 0x3fe,
+2*GRANULARITY)]), # Overwrite [(64M-128K)-64M)
+PatternGroup([
+# Batch 3: 8 clusters (5 new)
+# Carefully chosen such that nothing re-dirties the one cluster
+# that copies out successfully before failure in Group #1.
+Pattern('0xaa', 0x001,
+3*GRANULARITY),   # Overwrite and 2x Adjacent-right
+Pattern('0xbb', 0x00d8000),   # Partial-left (1M-160K)
+Pattern('0xcc', 0x2028000),   # Partial-right (32M+160K)
+Pattern('0xdd', 0x3fc)]), # New; leaving a gap to the right
+]
+
+
+class EmulatedBitmap:
+def __init__(self, granularity=GRANULARITY):
+self._bits = set()
+self.granularity = granularity
+
+def dirty_bits(self, bits):
+

[RFC 0/4] mirror: implement incremental and bitmap modes

2024-02-16 Thread Fiona Ebner

Previous discussion from when this was sent upstream [0] (it's been a
while). I rebased the patches and re-ordered and squashed like
suggested back then [1].

This implements two new mirror modes:

- bitmap mirror mode with always/on-success/never bitmap sync mode
- incremental mirror mode as sugar for bitmap + on-success

Use cases:
* Possibility to resume a failed mirror later.
* Possibility to only mirror deltas to a previously mirrored volume.
* Possibility to (efficiently) mirror an drive that was previously
  mirrored via some external mechanism (e.g. ZFS replication).

We are using the last one in production without any issues since about
4 years now. In particular, like mentioned in [2]:

> - create bitmap(s)
> - (incrementally) replicate storage volume(s) out of band (using ZFS)
> - incrementally drive mirror as part of a live migration of VM
> - drop bitmap(s)


Now, the IO test added in patch 4/4 actually contains yet another use
case, namely doing incremental mirrors to stand-alone qcow2 "diff"
images, that only contain the delta and can be rebased later. I had to
adapt the IO test, because its output expected the mirror bitmap to
still be dirty, but nowadays the mirror is apparently already done
when the bitmaps are queried. So I thought, I'll just use
'write-blocking' mode to avoid any potential timing issues.

But this exposed an issue with the diff image approach. If a write is
not aligned to the granularity of the mirror target, then rebasing the
diff image onto a backing image will not yield the desired result,
because the full cluster is considered to be allocated and will "hide"
some part of the base/backing image. The failure can be seen by either
using 'write-blocking' mode in the IO test or setting the (bitmap)
granularity to 32 KiB rather than the current 64 KiB.

The question is how to deal with these edge cases? Some possibilities
that would make sense to me:

For 'background' mode:
* prohibit if target's cluster size is larger than the bitmap
  granularity
* document the limitation

For 'write-blocking' mode:
* disallow in combination with bitmap mode (would not be happy about
  it, because I'd like to use this without diff images)
* for writes that are not aligned to the target's cluster size, read
  the relevant/missing parts from the source image to be able to write
  whole target clusters (seems rather complex)
* document the limitation


[0]: 
https://lore.kernel.org/qemu-devel/20200218100740.2228521-1-f.gruenbich...@proxmox.com/
[1]: 
https://lore.kernel.org/qemu-devel/d35a76de-78d5-af56-0b34-f7bd2bbd3...@redhat.com/
[2]: https://lore.kernel.org/qemu-devel/1599127031.9uxdp5h9o2.astr...@nora.none/


Fabian Grünbichler (2):
  mirror: move some checks to qmp
  iotests: add test for bitmap mirror

John Snow (2):
  drive-mirror: add support for sync=bitmap mode=never
  drive-mirror: add support for conditional and always bitmap sync modes

 block/mirror.c|   94 +-
 blockdev.c|   70 +-
 include/block/block_int-global-state.h|4 +-
 qapi/block-core.json  |   25 +-
 tests/qemu-iotests/tests/bitmap-sync-mirror   |  550 
 .../qemu-iotests/tests/bitmap-sync-mirror.out | 2810 +
 tests/unit/test-block-iothread.c  |4 +-
 7 files changed, 3527 insertions(+), 30 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror
 create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out

-- 
2.39.2

[RFC 2/4] drive-mirror: add support for conditional and always bitmap sync modes

2024-02-16 Thread Fiona Ebner

From: John Snow 

Teach mirror two new tricks for using bitmaps:

Always: no matter what, we synchronize the copy_bitmap back to the
sync_bitmap. In effect, this allows us resume a failed mirror at a later
date.

Conditional: On success only, we sync the bitmap. This is akin to
incremental backup modes; we can use this bitmap to later refresh a
successfully created mirror.

Originally-by: John Snow 
[FG: add check for bitmap-mode without bitmap
 switch to bdrv_dirty_bitmap_merge_internal]
Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.0]
Signed-off-by: Fiona Ebner 
---

The original patch this was based on came from a WIP git branch and
thus has no Signed-off-by trailer from John, see [0]. I added an
Originally-by trailer for now. Let me know if I should drop that and
wait for John's Signed-off-by instead.

[0] https://lore.kernel.org/qemu-devel/1599140071.n44h532eeu.astr...@nora.none/

 block/mirror.c | 24 ++--
 blockdev.c |  3 +++
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 315dff11e2..84155b1f78 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -689,8 +689,6 @@ static int mirror_exit_common(Job *job)
 bdrv_unfreeze_backing_chain(mirror_top_bs, target_bs);
 }
 
-bdrv_release_dirty_bitmap(s->dirty_bitmap);
-
 /* Make sure that the source BDS doesn't go away during bdrv_replace_node,
  * before we can call bdrv_drained_end */
 bdrv_ref(src);
@@ -796,6 +794,18 @@ static int mirror_exit_common(Job *job)
 bdrv_drained_end(target_bs);
 bdrv_unref(target_bs);
 
+if (s->sync_bitmap) {
+if (s->bitmap_mode == BITMAP_SYNC_MODE_ALWAYS ||
+(s->bitmap_mode == BITMAP_SYNC_MODE_ON_SUCCESS &&
+ job->ret == 0 && ret == 0)) {
+/* Success; synchronize copy back to sync. */
+bdrv_clear_dirty_bitmap(s->sync_bitmap, NULL);
+bdrv_dirty_bitmap_merge_internal(s->sync_bitmap, s->dirty_bitmap,
+ NULL, true);
+}
+}
+bdrv_release_dirty_bitmap(s->dirty_bitmap);
+
 bs_opaque->job = NULL;
 
 bdrv_drained_end(src);
@@ -1755,10 +1765,6 @@ static BlockJob *mirror_start_job(
" sync mode",
MirrorSyncMode_str(sync_mode));
 return NULL;
-} else if (bitmap_mode != BITMAP_SYNC_MODE_NEVER) {
-error_setg(errp,
-   "Bitmap Sync Mode '%s' is not supported by Mirror",
-   BitmapSyncMode_str(bitmap_mode));
 }
 } else if (bitmap) {
 error_setg(errp,
@@ -1775,6 +1781,12 @@ static BlockJob *mirror_start_job(
 return NULL;
 }
 granularity = bdrv_dirty_bitmap_granularity(bitmap);
+
+if (bitmap_mode != BITMAP_SYNC_MODE_NEVER) {
+if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_DEFAULT, errp)) {
+return NULL;
+}
+}
 } else if (granularity == 0) {
 granularity = bdrv_get_default_bitmap_granularity(target);
 }
diff --git a/blockdev.c b/blockdev.c
index c65d9ded70..aeb9fde9f3 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2873,6 +2873,9 @@ static void blockdev_mirror_common(const char *job_id, 
BlockDriverState *bs,
 if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_ALLOW_RO, errp)) {
 return;
 }
+} else if (has_bitmap_mode) {
+error_setg(errp, "Cannot specify bitmap sync mode without a bitmap");
+return;
 }
 
 if (!replaces) {
-- 
2.39.2

[PATCH] iotests: adapt to output change for recently introduced 'detached header' field

2024-02-16 Thread Fiona Ebner

Failure was noticed when running the tests for the qcow2 image format.

Fixes: 0bd779e27e ("crypto: Introduce 'detached-header' field in 
QCryptoBlockInfoLUKS")
Signed-off-by: Fiona Ebner 
---
 tests/qemu-iotests/198.out | 2 ++
 tests/qemu-iotests/206.out | 1 +
 2 files changed, 3 insertions(+)

diff --git a/tests/qemu-iotests/198.out b/tests/qemu-iotests/198.out
index 805494916f..62fb73fa3e 100644
--- a/tests/qemu-iotests/198.out
+++ b/tests/qemu-iotests/198.out
@@ -39,6 +39,7 @@ Format specific information:
 compression type: COMPRESSION_TYPE
 encrypt:
 ivgen alg: plain64
+detached header: false
 hash alg: sha256
 cipher alg: aes-256
 uuid: ----
@@ -84,6 +85,7 @@ Format specific information:
 compression type: COMPRESSION_TYPE
 encrypt:
 ivgen alg: plain64
+detached header: false
 hash alg: sha256
 cipher alg: aes-256
 uuid: ----
diff --git a/tests/qemu-iotests/206.out b/tests/qemu-iotests/206.out
index 7e95694777..979f00f9bf 100644
--- a/tests/qemu-iotests/206.out
+++ b/tests/qemu-iotests/206.out
@@ -114,6 +114,7 @@ Format specific information:
 refcount bits: 16
 encrypt:
 ivgen alg: plain64
+detached header: false
 hash alg: sha1
 cipher alg: aes-128
 uuid: ----
-- 
2.39.2

Re: double free or corruption (out) in iscsi virtual machine

2024-02-15 Thread Fiona Ebner

Am 17.01.24 um 08:23 schrieb M_O_Bz:
> Basic Info:
> 1. Issue: I got a " double free or corruption (out)", head for
> attachment debug.log for details, the debug.log print the backtrace of
> one virtual machine
> 2. Reproduce: currently I cann't destribe how to reproduce this bug,
> because it's in my productive enviroment which include some special stuffs
> 3. qemu version:  I'm using is qemu-6.0.1
> 4. qemu ccmdline in short:(checkout detail in the virtual machine log
> message)

Hi,
sounds like it might be the issue fixed by:
https://github.com/qemu/qemu/commit/5080152e2ef6cde7aa692e29880c62bd54acb750

Best Regards,
Fiona

Re: [PATCH v2 3/4] qapi: blockdev-backup: add discard-source parameter

2024-01-26 Thread Fiona Ebner

Am 25.01.24 um 18:22 schrieb Vladimir Sementsov-Ogievskiy:
> 
> Hmm. Taking maximum is not optimal for usual case without
> discard-source: user may want to work in smaller granularity than
> source, to save disk space.
> 
> In case with discarding we have two possibilities:
> 
> - either take larger granularity for the whole process like you propose
> (but this will need and option for CBW?)
> - or, fix discarding bitmap in CBW to work like normal discard: it
> should be aligned down. This will lead actually to discard-source option
> doing nothing..
> 
> ==
> But why do you want fleecing image with larger granularity? Is that a
> real case or just experimenting? Still we should fix assertion anyway.
> 

Yes, it's a real use case. We do support different storage types and
want to allow users to place the fleecing image on a different storage
than the original image for flexibility.

I ran into the issue when backing up to a target with 1 MiB cluster_size
while using a fleecing image on RBD (which has 4 MiB cluster_size by
default).

In theory, I guess I could look into querying the cluster_size of the
backup target and trying to allocate the fleecing image with a small
enough cluster_size. But not sure if that would work on all storage
combinations, and would require teaching our storage plugin API (which
also supports third-party plugins) to perform allocation with a specific
cluster size. So not an ideal solution for us.

> I think:
> 
> 1. fix discarding bitmap to make aligning-down (will do that for v3)
> 

Thanks!

> 2. if we need another logic for block_copy_calculate_cluster_size() it
> should be an option. May be explicit "copy-cluster-size" or
> "granularity" option for CBW driver and for backup job. And we'll just
> check that given cluster-size is power of two >= target_size.
> 

I'll try to implement point 2. That should resolve the issue for our use
case.

Best Regards,
Fiona

Re: [PATCH v2 3/4] qapi: blockdev-backup: add discard-source parameter

2024-01-25 Thread Fiona Ebner

Am 24.01.24 um 16:03 schrieb Fiona Ebner:
> Am 17.01.24 um 17:07 schrieb Vladimir Sementsov-Ogievskiy:
>> Add a parameter that enables discard-after-copy. That is mostly useful
>> in "push backup with fleecing" scheme, when source is snapshot-access
>> format driver node, based on copy-before-write filter snapshot-access
>> API:
>>
>> [guest]  [snapshot-access] ~~ blockdev-backup ~~> [backup target]
>>||
>>| root   | file
>>vv
>> [copy-before-write]
>>| |
>>| file| target
>>v v
>> [active disk]   [temp.img]
>>
>> In this case discard-after-copy does two things:
>>
>>  - discard data in temp.img to save disk space
>>  - avoid further copy-before-write operation in discarded area
>>
>> Note that we have to declare WRITE permission on source in
>> copy-before-write filter, for discard to work.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> 
> Ran into another issue when the cluster_size of the fleecing image is
> larger than for the backup target, e.g.
> 
>> #!/bin/bash
>> rm /tmp/fleecing.qcow2
>> ./qemu-img create /tmp/disk.qcow2 -f qcow2 1G
>> ./qemu-img create /tmp/fleecing.qcow2 -o cluster_size=2M -f qcow2 1G
>> ./qemu-img create /tmp/backup.qcow2 -f qcow2 1G
>> ./qemu-system-x86_64 --qmp stdio \
>> --blockdev 
>> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/disk.qcow2 \
>> --blockdev 
>> qcow2,node-name=node1,file.driver=file,file.filename=/tmp/fleecing.qcow2,discard=unmap
>>  \
>> --blockdev 
>> qcow2,node-name=node2,file.driver=file,file.filename=/tmp/backup.qcow2 \
>> <> {"execute": "qmp_capabilities"}
>> {"execute": "blockdev-add", "arguments": { "driver": "copy-before-write", 
>> "file": "node0", "target": "node1", "node-name": "node3" } }
>> {"execute": "blockdev-add", "arguments": { "driver": "snapshot-access", 
>> "file": "node3", "discard": "unmap", "node-name": "snap0" } }
>> {"execute": "blockdev-backup", "arguments": { "device": "snap0", "target": 
>> "node2", "sync": "full", "job-id": "backup0", "discard-source": true } }
>> EOF
> 
> will fail with
> 
>> qemu-system-x86_64: ../util/hbitmap.c:570: hbitmap_reset: Assertion 
>> `QEMU_IS_ALIGNED(count, gran) || (start + count == hb->orig_size)' failed.
> 
> Backtrace shows the assert happens while discarding, when resetting the
> BDRVCopyBeforeWriteState access_bitmap
>  > #6  0x56142a2a in hbitmap_reset (hb=0x57e01b80, start=0,
> count=1048576) at ../util/hbitmap.c:570
>> #7  0x55f80764 in bdrv_reset_dirty_bitmap_locked 
>> (bitmap=0x5850a660, offset=0, bytes=1048576) at 
>> ../block/dirty-bitmap.c:563
>> #8  0x55f807ab in bdrv_reset_dirty_bitmap (bitmap=0x5850a660, 
>> offset=0, bytes=1048576) at ../block/dirty-bitmap.c:570
>> #9  0x55f7bb16 in cbw_co_pdiscard_snapshot (bs=0x581a7f60, 
>> offset=0, bytes=1048576) at ../block/copy-before-write.c:330
>> #10 0x55f8d00a in bdrv_co_pdiscard_snapshot (bs=0x581a7f60, 
>> offset=0, bytes=1048576) at ../block/io.c:3734
>> #11 0x55fd2380 in snapshot_access_co_pdiscard (bs=0x582b4f60, 
>> offset=0, bytes=1048576) at ../block/snapshot-access.c:55
>> #12 0x55f8b65d in bdrv_co_pdiscard (child=0x584fe790, offset=0, 
>> bytes=1048576) at ../block/io.c:3144
>> #13 0x55f78650 in block_copy_task_entry (task=0x57f588f0) at 
>> ../block/block-copy.c:597
> 
> My guess for the cause is that in block_copy_calculate_cluster_size() we
> only look at the target. But now that we need to discard the source,
> we'll also need to consider that for the calculation?
> 

Just querying the source and picking the maximum won't work either,
because snapshot-access does not currently implement .bdrv_co_get_info
and because copy-before-write (doesn't implement .bdrv_co_get_info and
is a filter) will just return the info of its file child. But the
discard will go to the target child.

If I do

1. .bdrv_co_get_info in snapshot-access: return info from file child
2. .bdrv_co_get_info in copy-before-write: return maximum cluster_size
from file child and target child
3. block_copy_calculate_cluster_size: return maximum from source and target

then the issue does go away, but I don't know if that's not violating
any assumptions and probably there's a better way to avoid the issue?

Best Regards,
Fiona

Re: [PATCH 2/2] virtio: Keep notifications disabled during drain

2024-01-25 Thread Fiona Ebner

Am 24.01.24 um 18:38 schrieb Hanna Czenczek:
> During drain, we do not care about virtqueue notifications, which is why
> we remove the handlers on it.  When removing those handlers, whether vq
> notifications are enabled or not depends on whether we were in polling
> mode or not; if not, they are enabled (by default); if so, they have
> been disabled by the io_poll_start callback.
> 
> Because we do not care about those notifications after removing the
> handlers, this is fine.  However, we have to explicitly ensure they are
> enabled when re-attaching the handlers, so we will resume receiving
> notifications.  We do this in virtio_queue_aio_attach_host_notifier*().
> If such a function is called while we are in a polling section,
> attaching the notifiers will then invoke the io_poll_start callback,
> re-disabling notifications.
> 
> Because we will always miss virtqueue updates in the drained section, we
> also need to poll the virtqueue once after attaching the notifiers.
> 
> Buglink: https://issues.redhat.com/browse/RHEL-3934
> Signed-off-by: Hanna Czenczek 

Tested-by: Fiona Ebner 
Reviewed-by: Fiona Ebner

Re: [PATCH 1/2] virtio-scsi: Attach event vq notifier with no_poll

2024-01-25 Thread Fiona Ebner

Am 24.01.24 um 18:38 schrieb Hanna Czenczek:
> As of commit 38738f7dbbda90fbc161757b7f4be35b52205552 ("virtio-scsi:
> don't waste CPU polling the event virtqueue"), we only attach an io_read
> notifier for the virtio-scsi event virtqueue instead, and no polling
> notifiers.  During operation, the event virtqueue is typically
> non-empty, but none of the buffers are intended to be used immediately.
> Instead, they only get used when certain events occur.  Therefore, it
> makes no sense to continuously poll it when non-empty, because it is
> supposed to be and stay non-empty.
> 
> We do this by using virtio_queue_aio_attach_host_notifier_no_poll()
> instead of virtio_queue_aio_attach_host_notifier() for the event
> virtqueue.
> 
> Commit 766aa2de0f29b657148e04599320d771c36fd126 ("virtio-scsi: implement
> BlockDevOps->drained_begin()") however has virtio_scsi_drained_end() use
> virtio_queue_aio_attach_host_notifier() for all virtqueues, including
> the event virtqueue.  This can lead to it being polled again, undoing
> the benefit of commit 38738f7dbbda90fbc161757b7f4be35b52205552.
> 
> Fix it by using virtio_queue_aio_attach_host_notifier_no_poll() for the
> event virtqueue.
> 
> Reported-by: Fiona Ebner 
> Fixes: 766aa2de0f29b657148e04599320d771c36fd126
>    ("virtio-scsi: implement BlockDevOps->drained_begin()")
> Signed-off-by: Hanna Czenczek 

Tested-by: Fiona Ebner 
Reviewed-by: Fiona Ebner

Re: [PATCH v2 3/4] qapi: blockdev-backup: add discard-source parameter

2024-01-24 Thread Fiona Ebner

Am 17.01.24 um 17:07 schrieb Vladimir Sementsov-Ogievskiy:
> Add a parameter that enables discard-after-copy. That is mostly useful
> in "push backup with fleecing" scheme, when source is snapshot-access
> format driver node, based on copy-before-write filter snapshot-access
> API:
> 
> [guest]  [snapshot-access] ~~ blockdev-backup ~~> [backup target]
>||
>| root   | file
>vv
> [copy-before-write]
>| |
>| file| target
>v v
> [active disk]   [temp.img]
> 
> In this case discard-after-copy does two things:
> 
>  - discard data in temp.img to save disk space
>  - avoid further copy-before-write operation in discarded area
> 
> Note that we have to declare WRITE permission on source in
> copy-before-write filter, for discard to work.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 

Ran into another issue when the cluster_size of the fleecing image is
larger than for the backup target, e.g.

> #!/bin/bash
> rm /tmp/fleecing.qcow2
> ./qemu-img create /tmp/disk.qcow2 -f qcow2 1G
> ./qemu-img create /tmp/fleecing.qcow2 -o cluster_size=2M -f qcow2 1G
> ./qemu-img create /tmp/backup.qcow2 -f qcow2 1G
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/disk.qcow2 \
> --blockdev 
> qcow2,node-name=node1,file.driver=file,file.filename=/tmp/fleecing.qcow2,discard=unmap
>  \
> --blockdev 
> qcow2,node-name=node2,file.driver=file,file.filename=/tmp/backup.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "copy-before-write", 
> "file": "node0", "target": "node1", "node-name": "node3" } }
> {"execute": "blockdev-add", "arguments": { "driver": "snapshot-access", 
> "file": "node3", "discard": "unmap", "node-name": "snap0" } }
> {"execute": "blockdev-backup", "arguments": { "device": "snap0", "target": 
> "node2", "sync": "full", "job-id": "backup0", "discard-source": true } }
> EOF

will fail with

> qemu-system-x86_64: ../util/hbitmap.c:570: hbitmap_reset: Assertion 
> `QEMU_IS_ALIGNED(count, gran) || (start + count == hb->orig_size)' failed.

Backtrace shows the assert happens while discarding, when resetting the
BDRVCopyBeforeWriteState access_bitmap
 > #6  0x56142a2a in hbitmap_reset (hb=0x57e01b80, start=0,
count=1048576) at ../util/hbitmap.c:570
> #7  0x55f80764 in bdrv_reset_dirty_bitmap_locked 
> (bitmap=0x5850a660, offset=0, bytes=1048576) at 
> ../block/dirty-bitmap.c:563
> #8  0x55f807ab in bdrv_reset_dirty_bitmap (bitmap=0x5850a660, 
> offset=0, bytes=1048576) at ../block/dirty-bitmap.c:570
> #9  0x55f7bb16 in cbw_co_pdiscard_snapshot (bs=0x581a7f60, 
> offset=0, bytes=1048576) at ../block/copy-before-write.c:330
> #10 0x55f8d00a in bdrv_co_pdiscard_snapshot (bs=0x581a7f60, 
> offset=0, bytes=1048576) at ../block/io.c:3734
> #11 0x55fd2380 in snapshot_access_co_pdiscard (bs=0x582b4f60, 
> offset=0, bytes=1048576) at ../block/snapshot-access.c:55
> #12 0x55f8b65d in bdrv_co_pdiscard (child=0x584fe790, offset=0, 
> bytes=1048576) at ../block/io.c:3144
> #13 0x55f78650 in block_copy_task_entry (task=0x57f588f0) at 
> ../block/block-copy.c:597

My guess for the cause is that in block_copy_calculate_cluster_size() we
only look at the target. But now that we need to discard the source,
we'll also need to consider that for the calculation?

Best Regards,
Fiona

[PATCH] block/io_uring: improve error message when init fails

2024-01-23 Thread Fiona Ebner

The man page for io_uring_queue_init states:

> io_uring_queue_init(3) returns 0 on success and -errno on failure.

and the man page for io_uring_setup (which is one of the functions
where the return value of io_uring_queue_init() can come from) states:

> On error, a negative error code is returned. The caller should not
> rely on errno variable.

Tested using 'sysctl kernel.io_uring_disabled=2'. Output before this
change:

> failed to init linux io_uring ring

Output after this change:

> failed to init linux io_uring ring: Operation not permitted

Signed-off-by: Fiona Ebner 
---
 block/io_uring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/io_uring.c b/block/io_uring.c
index d77ae55745..d11b2051ab 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -432,7 +432,7 @@ LuringState *luring_init(Error **errp)
 
 rc = io_uring_queue_init(MAX_ENTRIES, ring, 0);
 if (rc < 0) {
-error_setg_errno(errp, errno, "failed to init linux io_uring ring");
+error_setg_errno(errp, -rc, "failed to init linux io_uring ring");
 g_free(s);
 return NULL;
 }
-- 
2.39.2

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2024-01-23 Thread Fiona Ebner

Am 22.01.24 um 18:52 schrieb Hanna Czenczek:
> On 22.01.24 18:41, Hanna Czenczek wrote:
>> On 05.01.24 15:30, Fiona Ebner wrote:
>>> Am 05.01.24 um 14:43 schrieb Fiona Ebner:
>>>> Am 03.01.24 um 14:35 schrieb Paolo Bonzini:
>>>>> On 1/3/24 12:40, Fiona Ebner wrote:
>>>>>> I'm happy to report that I cannot reproduce the CPU-usage-spike issue
>>>>>> with the patch, but I did run into an assertion failure when
>>>>>> trying to
>>>>>> verify that it fixes my original stuck-guest-IO issue. See below
>>>>>> for the
>>>>>> backtrace [0]. Hanna wrote in
>>>>>> https://issues.redhat.com/browse/RHEL-3934
>>>>>>
>>>>>>> I think it’s sufficient to simply call virtio_queue_notify_vq(vq)
>>>>>>> after the virtio_queue_aio_attach_host_notifier(vq, ctx) call,
>>>>>>> because
>>>>>>> both virtio-scsi’s and virtio-blk’s .handle_output() implementations
>>>>>>> acquire the device’s context, so this should be directly callable
>>>>>>> from
>>>>>>> any context.
>>>>>> I guess this is not true anymore now that the AioContext locking was
>>>>>> removed?
>>>>> Good point and, in fact, even before it was much safer to use
>>>>> virtio_queue_notify() instead.  Not only does it use the event
>>>>> notifier
>>>>> handler, but it also calls it in the right thread/AioContext just by
>>>>> doing event_notifier_set().
>>>>>
>>>> But with virtio_queue_notify() using the event notifier, the
>>>> CPU-usage-spike issue is present:
>>>>
>>>>>> Back to the CPU-usage-spike issue: I experimented around and it
>>>>>> doesn't
>>>>>> seem to matter whether I notify the virt queue before or after
>>>>>> attaching
>>>>>> the notifiers. But there's another functional difference. My patch
>>>>>> called virtio_queue_notify() which contains this block:
>>>>>>
>>>>>>>  if (vq->host_notifier_enabled) {
>>>>>>>  event_notifier_set(&vq->host_notifier);
>>>>>>>  } else if (vq->handle_output) {
>>>>>>>  vq->handle_output(vdev, vq);
>>>>>> In my testing, the first branch was taken, calling
>>>>>> event_notifier_set().
>>>>>> Hanna's patch uses virtio_queue_notify_vq() and there,
>>>>>> vq->handle_output() will be called. That seems to be the relevant
>>>>>> difference regarding the CPU-usage-spike issue.
>>>> I should mention that this is with a VirtIO SCSI disk. I also attempted
>>>> to reproduce the CPU-usage-spike issue with a VirtIO block disk, but
>>>> didn't manage yet.
>>>>
>>>> What I noticed is that in virtio_queue_host_notifier_aio_poll(), one of
>>>> the queues (but only one) will always show as nonempty. And then,
>>>> run_poll_handlers_once() will always detect progress which explains the
>>>> CPU usage.
>>>>
>>>> The following shows
>>>> 1. vq address
>>>> 2. number of times vq was passed to
>>>> virtio_queue_host_notifier_aio_poll()
>>>> 3. number of times the result of virtio_queue_host_notifier_aio_poll()
>>>> was true for the vq
>>>>
>>>>> 0x555fd93f9c80 17162000 0
>>>>> 0x555fd93f9e48 17162000 6
>>>>> 0x555fd93f9ee0 17162000 0
>>>>> 0x555fd93f9d18 17162000 17162000
>>>>> 0x555fd93f9db0 17162000 0
>>>>> 0x555fd93f9f78 17162000 0
>>>> And for the problematic one, the reason it is seen as nonempty is:
>>>>
>>>>> 0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0
>>> vring_avail_idx(vq) also gives 8 here. This is the vs->event_vq and
>>> s->events_dropped is false in my testing, so
>>> virtio_scsi_handle_event_vq() doesn't do anything.
>>>
>>>> Those values stay like this while the call counts above increase.
>>>>
>>>> So something going wrong with the indices when the event notifier is
>>>> set
>>>> from QEMU side (in the main thread)?
>>>>
>>>> The guest is Debian 12 with a 6.1 kernel.
>>
>> So, trying to figure out a new RFC version:
>

Re: [PATCH v2 3/4] qapi: blockdev-backup: add discard-source parameter

2024-01-19 Thread Fiona Ebner

Am 17.01.24 um 17:07 schrieb Vladimir Sementsov-Ogievskiy:
> Add a parameter that enables discard-after-copy. That is mostly useful
> in "push backup with fleecing" scheme, when source is snapshot-access
> format driver node, based on copy-before-write filter snapshot-access
> API:
> 
> [guest]  [snapshot-access] ~~ blockdev-backup ~~> [backup target]
>||
>| root   | file
>vv
> [copy-before-write]
>| |
>| file| target
>v v
> [active disk]   [temp.img]
> 
> In this case discard-after-copy does two things:
> 
>  - discard data in temp.img to save disk space
>  - avoid further copy-before-write operation in discarded area
> 
> Note that we have to declare WRITE permission on source in
> copy-before-write filter, for discard to work.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 

Unfortunately, setting BLK_PERM_WRITE unconditionally breaks
blockdev-backup for a read-only node (even when not using discard-source):

> #!/bin/bash
> ./qemu-img create /tmp/disk.raw -f raw 1G
> ./qemu-img create /tmp/backup.raw -f raw 1G
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> raw,node-name=node0,file.driver=file,file.filename=/tmp/disk.raw,read-only=true
>  \
> --blockdev raw,node-name=node1,file.driver=file,file.filename=/tmp/backup.raw 
> \
> < {"execute": "qmp_capabilities"}
> {"execute": "blockdev-backup", "arguments": { "device": "node0", "target": 
> "node1", "sync": "full", "job-id": "backup0" } }
> EOF

The above works before applying this patch, but leads to an error
afterwards:

> {"error": {"class": "GenericError", "desc": "Block node is read-only"}}

Best Regards,
Fiona

Re: [PATCH 3/3] monitor: only run coroutine commands in qemu_aio_context

2024-01-18 Thread Fiona Ebner

Am 16.01.24 um 20:00 schrieb Stefan Hajnoczi:
> monitor_qmp_dispatcher_co() runs in the iohandler AioContext that is not
> polled during nested event loops. The coroutine currently reschedules
> itself in the main loop's qemu_aio_context AioContext, which is polled
> during nested event loops. One known problem is that QMP device-add
> calls drain_call_rcu(), which temporarily drops the BQL, leading to all
> sorts of havoc like other vCPU threads re-entering device emulation code
> while another vCPU thread is waiting in device emulation code with
> aio_poll().
> 
> Paolo Bonzini suggested running non-coroutine QMP handlers in the
> iohandler AioContext. This avoids trouble with nested event loops. His
> original idea was to move coroutine rescheduling to
> monitor_qmp_dispatch(), but I resorted to moving it to qmp_dispatch()
> because we don't know if the QMP handler needs to run in coroutine
> context in monitor_qmp_dispatch(). monitor_qmp_dispatch() would have
> been nicer since it's associated with the monitor implementation and not
> as general as qmp_dispatch(), which is also used by qemu-ga.
> 
> A number of qemu-iotests need updated .out files because the order of
> QMP events vs QMP responses has changed.
> 
> Solves Issue #1933.
> 
> Fixes: 7bed89958bfbf40df9ca681cefbdca63abdde39d ("device_core: use 
> drain_call_rcu in in qmp_device_add")
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2215192
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2214985
> Buglink: https://issues.redhat.com/browse/RHEL-17369
> Signed-off-by: Stefan Hajnoczi 

With the patch I can no longer see any do_qmp_dispatch_bh() calls run in
vCPU threads.

I also did a bit of smoke testing with some other QMP and QGA commands
and didn't find any obvious breakage, so:

Tested-by: Fiona Ebner 

P.S.

Unfortunately, the patch does not solve the other issue I came across
back then [0] with snapshot_save_job_bh() being executed during a vCPU
thread's aio_poll(). See also [1].

I suppose this would need some other mechanism to solve or could it also
be scheduled to the iohandler AioContext? It's not directly related to
your patch of course, just mentioning it, because it's a similar theme.

[0]:
https://lore.kernel.org/qemu-devel/31757c45-695d-4408-468c-c2de560af...@proxmox.com/
[1]: https://gitlab.com/qemu-project/qemu/-/issues/2111

Best Regards,
Fiona

[PATCH] block/io: clear BDRV_BLOCK_RECURSE flag after recursing in bdrv_co_block_status

2024-01-16 Thread Fiona Ebner

Using fleecing backup like in [0] on a qcow2 image (with metadata
preallocation) can lead to the following assertion failure:

> bdrv_co_do_block_status: Assertion `!(ret & BDRV_BLOCK_ZERO)' failed.

In the reproducer [0], it happens because the BDRV_BLOCK_RECURSE flag
will be set by the qcow2 driver, so the caller will recursively check
the file child. Then the BDRV_BLOCK_ZERO set too. Later up the call
chain, in bdrv_co_do_block_status() for the snapshot-access driver,
the assertion failure will happen, because both flags are set.

To fix it, clear the recurse flag after the recursive check was done.

In detail:

> #0  qcow2_co_block_status

Returns 0x45 = BDRV_BLOCK_RECURSE | BDRV_BLOCK_DATA |
BDRV_BLOCK_OFFSET_VALID.

> #1  bdrv_co_do_block_status

Because of the data flag, bdrv_co_do_block_status() will now also set
BDRV_BLOCK_ALLOCATED. Because of the recurse flag,
bdrv_co_do_block_status() for the bdrv_file child will be called,
which returns 0x16 = BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_OFFSET_VALID |
BDRV_BLOCK_ZERO. Now the return value inherits the zero flag.

Returns 0x57 = BDRV_BLOCK_RECURSE | BDRV_BLOCK_DATA |
BDRV_BLOCK_OFFSET_VALID | BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO.

> #2  bdrv_co_common_block_status_above
> #3  bdrv_co_block_status_above
> #4  bdrv_co_block_status
> #5  cbw_co_snapshot_block_status
> #6  bdrv_co_snapshot_block_status
> #7  snapshot_access_co_block_status
> #8  bdrv_co_do_block_status

Return value is propagated all the way up to here, where the assertion
failure happens, because BDRV_BLOCK_RECURSE and BDRV_BLOCK_ZERO are
both set.

> #9  bdrv_co_common_block_status_above
> #10 bdrv_co_block_status_above
> #11 block_copy_block_status
> #12 block_copy_dirty_clusters
> #13 block_copy_common
> #14 block_copy_async_co_entry
> #15 coroutine_trampoline

[0]:

> #!/bin/bash
> rm /tmp/disk.qcow2
> ./qemu-img create /tmp/disk.qcow2 -o preallocation=metadata -f qcow2 1G
> ./qemu-img create /tmp/fleecing.qcow2 -f qcow2 1G
> ./qemu-img create /tmp/backup.qcow2 -f qcow2 1G
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/disk.qcow2 \
> --blockdev 
> qcow2,node-name=node1,file.driver=file,file.filename=/tmp/fleecing.qcow2 \
> --blockdev 
> qcow2,node-name=node2,file.driver=file,file.filename=/tmp/backup.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "copy-before-write", 
> "file": "node0", "target": "node1", "node-name": "node3" } }
> {"execute": "blockdev-add", "arguments": { "driver": "snapshot-access", 
> "file": "node3", "node-name": "snap0" } }
> {"execute": "blockdev-backup", "arguments": { "device": "snap0", "target": 
> "node1", "sync": "full", "job-id": "backup0" } }
> EOF

Signed-off-by: Fiona Ebner 
---

I'm new to this part of the code, so I'm not sure if it is actually
safe to clear the flag? Intuitively, I'd expect it to be only relevant
until it was acted upon, but no clue.

 block/io.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/block/io.c b/block/io.c
index 8fa7670571..33150c0359 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2584,6 +2584,16 @@ bdrv_co_do_block_status(BlockDriverState *bs, bool 
want_zero,
 ret |= (ret2 & BDRV_BLOCK_ZERO);
 }
 }
+
+/*
+ * Now that the recursive search was done, clear the flag. Otherwise,
+ * with more complicated block graphs like snapshot-access ->
+ * copy-before-write -> qcow2, where the return value will be 
propagated
+ * further up to a parent bdrv_co_do_block_status() call, both the
+ * BDRV_BLOCK_RECURSE and BDRV_BLOCK_ZERO flags would be set, which is
+ * not allowed.
+ */
+ret &= ~BDRV_BLOCK_RECURSE;
 }
 
 out:
-- 
2.39.2

Re: [PATCH 2/3] qapi: blockdev-backup: add discard-source parameter

2024-01-11 Thread Fiona Ebner

Hi Vladimir,

hope I didn't miss a newer version of this series. I'm currently
evaluating fleecing backup for Proxmox downstream, so I pulled in this
series and wanted to let you know about two issues I encountered while
testing. We are still based on 8.1, but if I'm not mistaken, they are
still relevant:

Am 31.03.22 um 21:57 schrieb Vladimir Sementsov-Ogievskiy:
> @@ -575,6 +577,10 @@ static coroutine_fn int block_copy_task_entry(AioTask 
> *task)
>  co_put_to_shres(s->mem, t->req.bytes);
>  block_copy_task_end(t, ret);
>  
> +if (s->discard_source && ret == 0) {
> +bdrv_co_pdiscard(s->source, t->req.offset, t->req.bytes);
> +}
> +
>  return ret;
>  }
>  

If the image size is not aligned to the cluster size, passing
t->req.bytes when calling bdrv_co_pdiscard() can lead to an assertion
failure at the end of the image:

> kvm: ../block/io.c:1982: bdrv_co_write_req_prepare: Assertion `offset + bytes 
> <= bs->total_sectors * BDRV_SECTOR_SIZE || child->perm & BLK_PERM_RESIZE' 
> failed.

block_copy_do_copy() does have a line to clamp down:

> int64_t nbytes = MIN(offset + bytes, s->len) - offset;

If I do the same before calling bdrv_co_pdiscard(), the failure goes away.


For the second one, the following code saw some changes since the series
was sent:

> diff --git a/block/copy-before-write.c b/block/copy-before-write.c
> index 79cf12380e..3e77313a9a 100644
> --- a/block/copy-before-write.c
> +++ b/block/copy-before-write.c
> @@ -319,7 +319,7 @@ static void cbw_child_perm(BlockDriverState *bs, 
> BdrvChild *c,
>  bdrv_default_perms(bs, c, role, reopen_queue,
> perm, shared, nperm, nshared);
>  
> -*nperm = *nperm | BLK_PERM_CONSISTENT_READ;
> +*nperm = *nperm | BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE;
>  *nshared &= ~(BLK_PERM_WRITE | BLK_PERM_RESIZE);
>  }
>  }

It's now:

> bdrv_default_perms(bs, c, role, reopen_queue,
>perm, shared, nperm, nshared);
> 
> if (!QLIST_EMPTY(&bs->parents)) {
> if (perm & BLK_PERM_WRITE) {
> *nperm = *nperm | BLK_PERM_CONSISTENT_READ;
> }
> *nshared &= ~(BLK_PERM_WRITE | BLK_PERM_RESIZE);
> }

So I wasn't sure how to adapt the patch:

- If setting BLK_PERM_WRITE unconditionally, it seems to break usual
drive-backup (with no fleecing set up):

> permissions 'write' are both required by node '#block691' (uses node 
> '#block151' as 'file' child) and unshared by block device 'drive-scsi0' (uses 
> node '#block151' as 'root' child).

- If I only do it within the if block, it doesn't work when I try to set
up fleecing, because bs->parents is empty for me, i.e. when passing the
snapshot-access node to backup_job_create() while the usual cbw for
backup is appended. I should note I'm doing it manually in a custom QMP
command, not in a transaction (which requires the not-yet-merged
blockdev-replace AFAIU).

Not sure if I'm doing something wrong, but maybe what you wrote in the
commit message is necessary after all?

> Alternative is to pass
> an option to bdrv_cbw_append(), add some internal open-option for
> copy-before-write filter to require WRITE permission only for backup
> with discard-source=true. But I'm not sure it worth the complexity.

Best Regards,
Fiona

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2024-01-05 Thread Fiona Ebner

Am 05.01.24 um 14:43 schrieb Fiona Ebner:
> Am 03.01.24 um 14:35 schrieb Paolo Bonzini:
>> On 1/3/24 12:40, Fiona Ebner wrote:
>>> I'm happy to report that I cannot reproduce the CPU-usage-spike issue
>>> with the patch, but I did run into an assertion failure when trying to
>>> verify that it fixes my original stuck-guest-IO issue. See below for the
>>> backtrace [0]. Hanna wrote in https://issues.redhat.com/browse/RHEL-3934
>>>
>>>> I think it’s sufficient to simply call virtio_queue_notify_vq(vq)
>>>> after the virtio_queue_aio_attach_host_notifier(vq, ctx) call, because
>>>> both virtio-scsi’s and virtio-blk’s .handle_output() implementations
>>>> acquire the device’s context, so this should be directly callable from
>>>> any context.
>>>
>>> I guess this is not true anymore now that the AioContext locking was
>>> removed?
>>
>> Good point and, in fact, even before it was much safer to use
>> virtio_queue_notify() instead.  Not only does it use the event notifier
>> handler, but it also calls it in the right thread/AioContext just by
>> doing event_notifier_set().
>>
> 
> But with virtio_queue_notify() using the event notifier, the
> CPU-usage-spike issue is present:
> 
>>> Back to the CPU-usage-spike issue: I experimented around and it doesn't
>>> seem to matter whether I notify the virt queue before or after attaching
>>> the notifiers. But there's another functional difference. My patch
>>> called virtio_queue_notify() which contains this block:
>>>
>>>> if (vq->host_notifier_enabled) {
>>>> event_notifier_set(&vq->host_notifier);
>>>> } else if (vq->handle_output) {
>>>> vq->handle_output(vdev, vq);
>>>
>>> In my testing, the first branch was taken, calling event_notifier_set().
>>> Hanna's patch uses virtio_queue_notify_vq() and there,
>>> vq->handle_output() will be called. That seems to be the relevant
>>> difference regarding the CPU-usage-spike issue.
> 
> I should mention that this is with a VirtIO SCSI disk. I also attempted
> to reproduce the CPU-usage-spike issue with a VirtIO block disk, but
> didn't manage yet.
> 
> What I noticed is that in virtio_queue_host_notifier_aio_poll(), one of
> the queues (but only one) will always show as nonempty. And then,
> run_poll_handlers_once() will always detect progress which explains the
> CPU usage.
> 
> The following shows
> 1. vq address
> 2. number of times vq was passed to virtio_queue_host_notifier_aio_poll()
> 3. number of times the result of virtio_queue_host_notifier_aio_poll()
> was true for the vq
> 
>> 0x555fd93f9c80 17162000 0
>> 0x555fd93f9e48 17162000 6
>> 0x555fd93f9ee0 17162000 0
>> 0x555fd93f9d18 17162000 17162000
>> 0x555fd93f9db0 17162000 0
>> 0x555fd93f9f78 17162000 0
> 
> And for the problematic one, the reason it is seen as nonempty is:
> 
>> 0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0
> 

vring_avail_idx(vq) also gives 8 here. This is the vs->event_vq and
s->events_dropped is false in my testing, so
virtio_scsi_handle_event_vq() doesn't do anything.

> Those values stay like this while the call counts above increase.
> 
> So something going wrong with the indices when the event notifier is set
> from QEMU side (in the main thread)?
> 
> The guest is Debian 12 with a 6.1 kernel.

Best Regards,
Fiona

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2024-01-05 Thread Fiona Ebner

Am 03.01.24 um 14:35 schrieb Paolo Bonzini:
> On 1/3/24 12:40, Fiona Ebner wrote:
>> I'm happy to report that I cannot reproduce the CPU-usage-spike issue
>> with the patch, but I did run into an assertion failure when trying to
>> verify that it fixes my original stuck-guest-IO issue. See below for the
>> backtrace [0]. Hanna wrote in https://issues.redhat.com/browse/RHEL-3934
>>
>>> I think it’s sufficient to simply call virtio_queue_notify_vq(vq)
>>> after the virtio_queue_aio_attach_host_notifier(vq, ctx) call, because
>>> both virtio-scsi’s and virtio-blk’s .handle_output() implementations
>>> acquire the device’s context, so this should be directly callable from
>>> any context.
>>
>> I guess this is not true anymore now that the AioContext locking was
>> removed?
> 
> Good point and, in fact, even before it was much safer to use
> virtio_queue_notify() instead.  Not only does it use the event notifier
> handler, but it also calls it in the right thread/AioContext just by
> doing event_notifier_set().
> 

But with virtio_queue_notify() using the event notifier, the
CPU-usage-spike issue is present:

>> Back to the CPU-usage-spike issue: I experimented around and it doesn't
>> seem to matter whether I notify the virt queue before or after attaching
>> the notifiers. But there's another functional difference. My patch
>> called virtio_queue_notify() which contains this block:
>> 
>>> if (vq->host_notifier_enabled) {
>>> event_notifier_set(&vq->host_notifier);
>>> } else if (vq->handle_output) {
>>> vq->handle_output(vdev, vq);
>> 
>> In my testing, the first branch was taken, calling event_notifier_set().
>> Hanna's patch uses virtio_queue_notify_vq() and there,
>> vq->handle_output() will be called. That seems to be the relevant
>> difference regarding the CPU-usage-spike issue.

I should mention that this is with a VirtIO SCSI disk. I also attempted
to reproduce the CPU-usage-spike issue with a VirtIO block disk, but
didn't manage yet.

What I noticed is that in virtio_queue_host_notifier_aio_poll(), one of
the queues (but only one) will always show as nonempty. And then,
run_poll_handlers_once() will always detect progress which explains the
CPU usage.

The following shows
1. vq address
2. number of times vq was passed to virtio_queue_host_notifier_aio_poll()
3. number of times the result of virtio_queue_host_notifier_aio_poll()
was true for the vq

> 0x555fd93f9c80 17162000 0
> 0x555fd93f9e48 17162000 6
> 0x555fd93f9ee0 17162000 0
> 0x555fd93f9d18 17162000 17162000
> 0x555fd93f9db0 17162000 0
> 0x555fd93f9f78 17162000 0

And for the problematic one, the reason it is seen as nonempty is:

> 0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0

Those values stay like this while the call counts above increase.

So something going wrong with the indices when the event notifier is set
from QEMU side (in the main thread)?

The guest is Debian 12 with a 6.1 kernel.

Best Regards,
Fiona

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2024-01-03 Thread Fiona Ebner

Am 02.01.24 um 16:24 schrieb Hanna Czenczek:
> 
> I’ve attached the preliminary patch that I didn’t get to send (or test
> much) last year.  Not sure if it has the same CPU-usage-spike issue
> Fiona was seeing, the only functional difference is that I notify the vq
> after attaching the notifiers instead of before.
> 

Applied the patch on top of c12887e1b0 ("block-coroutine-wrapper: use
qemu_get_current_aio_context()") because it conflicts with b6948ab01d
("virtio-blk: add iothread-vq-mapping parameter").

I'm happy to report that I cannot reproduce the CPU-usage-spike issue
with the patch, but I did run into an assertion failure when trying to
verify that it fixes my original stuck-guest-IO issue. See below for the
backtrace [0]. Hanna wrote in https://issues.redhat.com/browse/RHEL-3934

> I think it’s sufficient to simply call virtio_queue_notify_vq(vq) after the 
> virtio_queue_aio_attach_host_notifier(vq, ctx) call, because both 
> virtio-scsi’s and virtio-blk’s .handle_output() implementations acquire the 
> device’s context, so this should be directly callable from any context.

I guess this is not true anymore now that the AioContext locking was
removed?

Back to the CPU-usage-spike issue: I experimented around and it doesn't
seem to matter whether I notify the virt queue before or after attaching
the notifiers. But there's another functional difference. My patch
called virtio_queue_notify() which contains this block:

> if (vq->host_notifier_enabled) {
> event_notifier_set(&vq->host_notifier);
> } else if (vq->handle_output) {
> vq->handle_output(vdev, vq);

In my testing, the first branch was taken, calling event_notifier_set().
Hanna's patch uses virtio_queue_notify_vq() and there,
vq->handle_output() will be called. That seems to be the relevant
difference regarding the CPU-usage-spike issue.

Best Regards,
Fiona

[0]:

> #0  __pthread_kill_implementation (threadid=, 
> signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
> #1  0x760e3d9f in __pthread_kill_internal (signo=6, 
> threadid=) at ./nptl/pthread_kill.c:78
> #2  0x76094f32 in __GI_raise (sig=sig@entry=6) at 
> ../sysdeps/posix/raise.c:26
> #3  0x7607f472 in __GI_abort () at ./stdlib/abort.c:79
> #4  0x7607f395 in __assert_fail_base (fmt=0x761f3a90 "%s%s%s:%u: 
> %s%sAssertion `%s' failed.\n%n", 
> assertion=assertion@entry=0x56246bf8 "ctx == 
> qemu_get_current_aio_context()", 
> file=file@entry=0x56246baf "../system/dma-helpers.c", 
> line=line@entry=123, 
> function=function@entry=0x56246c70 <__PRETTY_FUNCTION__.1> 
> "dma_blk_cb") at ./assert/assert.c:92
> #5  0x7608de32 in __GI___assert_fail (assertion=0x56246bf8 "ctx 
> == qemu_get_current_aio_context()", 
> file=0x56246baf "../system/dma-helpers.c", line=123, 
> function=0x56246c70 <__PRETTY_FUNCTION__.1> "dma_blk_cb")
> at ./assert/assert.c:101
> #6  0x55b83425 in dma_blk_cb (opaque=0x5804f150, ret=0) at 
> ../system/dma-helpers.c:123
> #7  0x55b839ec in dma_blk_io (ctx=0x57404310, sg=0x588ca6f8, 
> offset=70905856, align=512, 
> io_func=0x55a94a87 , io_func_opaque=0x5817ea00, 
> cb=0x55a8d99f , opaque=0x5817ea00, 
> dir=DMA_DIRECTION_FROM_DEVICE) at ../system/dma-helpers.c:236
> #8  0x55a8de9a in scsi_do_read (r=0x5817ea00, ret=0) at 
> ../hw/scsi/scsi-disk.c:431
> #9  0x55a8e249 in scsi_read_data (req=0x5817ea00) at 
> ../hw/scsi/scsi-disk.c:501
> #10 0x55a897e3 in scsi_req_continue (req=0x5817ea00) at 
> ../hw/scsi/scsi-bus.c:1478
> #11 0x55d8270e in virtio_scsi_handle_cmd_req_submit 
> (s=0x58669af0, req=0x588ca6b0) at ../hw/scsi/virtio-scsi.c:828
> #12 0x55d82937 in virtio_scsi_handle_cmd_vq (s=0x58669af0, 
> vq=0x58672550) at ../hw/scsi/virtio-scsi.c:870
> #13 0x55d829a9 in virtio_scsi_handle_cmd (vdev=0x58669af0, 
> vq=0x58672550) at ../hw/scsi/virtio-scsi.c:883
> #14 0x55db3784 in virtio_queue_notify_vq (vq=0x58672550) at 
> ../hw/virtio/virtio.c:2268
> #15 0x55d8346a in virtio_scsi_drained_end (bus=0x58669d88) at 
> ../hw/scsi/virtio-scsi.c:1179
> #16 0x55a8a549 in scsi_device_drained_end (sdev=0x58105000) at 
> ../hw/scsi/scsi-bus.c:1774
> #17 0x55a931db in scsi_disk_drained_end (opaque=0x58105000) at 
> ../hw/scsi/scsi-disk.c:2369
> #18 0x55ee439c in blk_root_drained_end (child=0x574065d0) at 
> ../block/block-backend.c:2829
> #19 0x55ef0ac3 in bdrv_parent_drained_end_single (c=0x574065d0) 
> at ../block/io.c:74
> #20 0x55ef0b02 in bdrv_parent_drained_end (bs=0x57409f80, 
> ignore=0x0) at ../block/io.c:89
> #21 0x55ef1b1b in bdrv_do_drained_end (bs=0x57409f80, parent=0x0) 
> at ../block/io.c:421
> #22 0x55ef1b5a in bdrv_drained_end (bs=0x57409f80) at 
> ../block/io.c:428
> #23 0x55efcf64 in

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2023-12-19 Thread Fiona Ebner

Am 18.12.23 um 15:49 schrieb Paolo Bonzini:
> On Mon, Dec 18, 2023 at 1:41 PM Fiona Ebner  wrote:
>> I think it's because of nested drains, because when additionally
>> checking that the drain count is zero and only executing the loop then,
>> that issue doesn't seem to manifest
> 
> But isn't virtio_scsi_drained_end only run if bus->drain_count == 0?
> 
> if (bus->drain_count-- == 1) {
> trace_scsi_bus_drained_end(bus, sdev);
> if (bus->info->drained_end) {
> bus->info->drained_end(bus);
> }
> }
> 

You're right. Sorry, I must've messed up my testing yesterday :(
Sometimes the CPU spikes are very short-lived. Now I see the same issue
with both variants.

Unfortunately, I haven't been able to figure out why it happens yet.

Best Regards,
Fiona

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2023-12-18 Thread Fiona Ebner

Am 14.12.23 um 20:53 schrieb Stefan Hajnoczi:
> 
> I will still try the other approach that Hanna and Paolo have suggested.
> It seems more palatable. I will send a v2.
> 

FYI, what I already tried downstream (for VirtIO SCSI):

> diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> index 9c751bf296..a6449b04d0 100644
> --- a/hw/scsi/virtio-scsi.c
> +++ b/hw/scsi/virtio-scsi.c
> @@ -1166,6 +1166,8 @@ static void virtio_scsi_drained_end(SCSIBus *bus)
>  
>  for (uint32_t i = 0; i < total_queues; i++) {
>  VirtQueue *vq = virtio_get_queue(vdev, i);
> +virtio_queue_set_notification(vq, 1);
> +virtio_queue_notify(vdev, i);
>  virtio_queue_aio_attach_host_notifier(vq, s->ctx);
>  }
>  }

But this introduces an issue where e.g. a 'backup' QMP command would put
the iothread into a bad state. After the command, whenever the guest
issues IO, the thread will temporarily spike to using 100% CPU. Using
QMP stop+cont is a way to make it go back to normal.

I think it's because of nested drains, because when additionally
checking that the drain count is zero and only executing the loop then,
that issue doesn't seem to manifest, i.e.:

> diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
> index 9c751bf296..d22c586b38 100644
> --- a/hw/scsi/virtio-scsi.c
> +++ b/hw/scsi/virtio-scsi.c
> @@ -1164,9 +1164,13 @@ static void virtio_scsi_drained_end(SCSIBus *bus)
>  return;
>  }
>  
> -for (uint32_t i = 0; i < total_queues; i++) {
> -VirtQueue *vq = virtio_get_queue(vdev, i);
> -virtio_queue_aio_attach_host_notifier(vq, s->ctx);
> +if (s->bus.drain_count == 0) {
> +for (uint32_t i = 0; i < total_queues; i++) {
> +VirtQueue *vq = virtio_get_queue(vdev, i);
> +virtio_queue_set_notification(vq, 1);
> +virtio_queue_notify(vdev, i);
> +virtio_queue_aio_attach_host_notifier(vq, s->ctx);
> +}
>  }
>  }
>  

Best Regards,
Fiona

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2023-12-14 Thread Fiona Ebner

Am 13.12.23 um 22:15 schrieb Stefan Hajnoczi:
> But there you have it. Please let me know what you think and try your
> reproducers to see if this fixes the missing io_poll_end() issue. Thanks!
> 

Thanks to you! I applied the RFC (and the series it depends on) on top
of 8.2.0-rc4 and this fixes my reproducer which drains VirtIO SCSI or
VirtIO block devices in a loop. Also didn't encounter any other issues
while playing around a bit with backup and mirror jobs.

The changes look fine to me, but this issue is also the first time I
came in close contact with this code, so that unfortunately does not say
much.

Best Regards,
Fiona

Re: [PULL 29/32] virtio-blk: implement BlockDevOps->drained_begin()

2023-12-11 Thread Fiona Ebner

Am 08.12.23 um 09:32 schrieb Kevin Wolf:
> 
> I'm not involved in it myself, but the kind of theme reminds me of this
> downstream bug that Hanna analysed recently:
> 
> https://issues.redhat.com/browse/RHEL-3934
> 
> Does it look like the same root cause to you?
> 

Thank you for the reference! Yes, that does sound like the same root
cause. And the workaround I ended up with is also very similar, but it
was missing kicking the virt queue.

Best Regards,
Fiona

Re: [PULL 29/32] virtio-blk: implement BlockDevOps->drained_begin()

2023-12-07 Thread Fiona Ebner

Am 03.11.23 um 14:12 schrieb Fiona Ebner:
> Hi,
> 
> I ran into a strange issue where guest IO would get completely stuck
> during certain block jobs a while ago and finally managed to find a
> small reproducer [0]. I'm using a VM with virtio-blk-pci (or
> virtio-scsi-pci) with an iothread and running
> 
> fio --name=file --size=100M --direct=1 --rw=randwrite --bs=4k
> --ioengine=psync --numjobs=5 --runtime=1200 --time_based
> 
> in the guest. Then I'm issuing the QMP command with the reproducer in a
> loop. Usually, the guest IO will get stuck after about 1-3 minutes,
> sometimes fio can manage to continue with a lower speed for a while (but
> trying to Ctrl+C it or doing other IO in the guest will already be
> broken), which I guess could be a hint that it's an issue with notifiers?
> 
> Bisecting (to declare a commit good, I waited 10 minutes) led me to this
> patch, i.e. commit 1665d9326f ("virtio-blk: implement
> BlockDevOps->drained_begin()") and for SCSI, I verified that the issue
> similarly starts happening after 766aa2de0f ("virtio-scsi: implement
> BlockDevOps->drained_begin()").
> 
> Both issues are still present on current master (i.e. 1c98a821a2
> ("tests/qtest: Introduce tests for AMD/Xilinx Versal TRNG device"))
> 
> Happy to provide more information and hints about how to debug the issue
> further.
> 

I think I was finally able to get to the bottom of this and have a
plausible-sounding pet theory now. It involves the VirtIO notifier
optimization during poll mode.

Let's step through some debug prints I added. First number is always the
thread ID (I'm sorry that I used warn_report rather than proper tracing):

> 247050 nodefd 29 poll_set_started 1

The iothread starts poll mode for the node with fd 29 which is the
virtio host notifier.

> 247050 0x55e515185270 poll begin for vq
> 247050 0x55e515185270 setting notification for vq 0

virtio_queue_set_notification is called to disable notification.

> 247050 nodefd 29 poll_set_started 1 done
> 247050 0x55e515185270 handle vq suppress_notifications 0 num_reqs 1
> 247050 0x55e515185270 handle vq suppress_notifications 0 num_reqs 4

virtio-blk handling some requests, note that suppress_notifications is 0
because we are in poll mode.

> 247048 nodefd 29 addr 0x55e51496ed70 marking as deleted

Main thread marks the node for deletion when beginning drain, i.e.
detaches the host notifier.

> 247048 nodefd 29 addr 0x55e513cdcd20 is_new 1 adding node

Main thread adds a new node when ending drain, i.e. attaches the host
notifier.

> 247050 nodefd 29 addr 0x55e51496ed70 remove deleted handler

The iothread removes the handler marked for removal. In particular from
the node_poll list: QLIST_SAFE_REMOVE(node, node_poll);

> 247050 disabling poll mode before fdmon_ops->wait

This is just before the call to
poll_set_started(ctx, &ready_list, false)

Whoops!! Nobody ends poll mode for the node with fd 29, because the old
node was deleted from the node_poll list already and new node is not
part of it, i.e. nobody has started poll mode for the new node.

> 247050 0x55e515185270 handle vq suppress_notifications 0 num_reqs 0

fdmon_ops->wait() returns one last time (not sure why) but no actual
requests.

> 247050 disabling poll mode before fdmon_ops->wait

After this, the fdmon_ops->wait() (it's fdmon_poll_wait in my case) will
just wait forever (or until triggering QMP 'stop' and 'cont' which
restarts the dataplane).


A minimal workaround seems to be either calling
event_notifier_set(virtio_queue_get_host_notifier(vq));
or
virtio_queue_set_notification(vq, true);
in drainded_end (for both VirtIO SCSI/block).

But is this an actual issue with the AIO interface/implementation? Or
should it rather be considered a bug in the VirtIO SCSI/block drain
implementation, because of the notification optimization?

Best Regards,
Fiona

> 
> [0]:
> 
>> diff --git a/blockdev.c b/blockdev.c
>> index db2725fe74..bf2e0fc22c 100644
>> --- a/blockdev.c
>> +++ b/blockdev.c
>> @@ -2986,6 +2986,11 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>  bool zero_target;
>>  int ret;
>>  
>> +bdrv_drain_all_begin();
>> +bdrv_drain_all_end();
>> +return;
>> +
>> +
>>  bs = qmp_get_root_bs(arg->device, errp);
>>  if (!bs) {
>>  return;
> 
> 
>

Re: [PULL 29/32] virtio-blk: implement BlockDevOps->drained_begin()

2023-11-13 Thread Fiona Ebner

Am 03.11.23 um 14:12 schrieb Fiona Ebner:
> Hi,
> 
> Am 30.05.23 um 18:32 schrieb Kevin Wolf:
>> From: Stefan Hajnoczi 
>>
>> Detach ioeventfds during drained sections to stop I/O submission from
>> the guest. virtio-blk is no longer reliant on aio_disable_external()
>> after this patch. This will allow us to remove the
>> aio_disable_external() API once all other code that relies on it is
>> converted.
>>
>> Take extra care to avoid attaching/detaching ioeventfds if the data
>> plane is started/stopped during a drained section. This should be rare,
>> but maybe the mirror block job can trigger it.
>>
>> Signed-off-by: Stefan Hajnoczi 
>> Message-Id: <20230516190238.8401-18-stefa...@redhat.com>
>> Signed-off-by: Kevin Wolf 
> 
> I ran into a strange issue where guest IO would get completely stuck
> during certain block jobs a while ago and finally managed to find a
> small reproducer [0]. I'm using a VM with virtio-blk-pci (or
> virtio-scsi-pci) with an iothread and running
> 
> fio --name=file --size=100M --direct=1 --rw=randwrite --bs=4k
> --ioengine=psync --numjobs=5 --runtime=1200 --time_based
> 
> in the guest. Then I'm issuing the QMP command with the reproducer in a
> loop. Usually, the guest IO will get stuck after about 1-3 minutes,
> sometimes fio can manage to continue with a lower speed for a while (but
> trying to Ctrl+C it or doing other IO in the guest will already be
> broken), which I guess could be a hint that it's an issue with notifiers?
> 
> Bisecting (to declare a commit good, I waited 10 minutes) led me to this
> patch, i.e. commit 1665d9326f ("virtio-blk: implement
> BlockDevOps->drained_begin()") and for SCSI, I verified that the issue
> similarly starts happening after 766aa2de0f ("virtio-scsi: implement
> BlockDevOps->drained_begin()").
> 
> Both issues are still present on current master (i.e. 1c98a821a2
> ("tests/qtest: Introduce tests for AMD/Xilinx Versal TRNG device"))
> 
> Happy to provide more information and hints about how to debug the issue
> further.
> 

Of course, I meant "and for hints" ;)

I should also mention that when IO is stuck, for the two
BlockDriverStates (i.e. bdrv_raw and bdrv_file) and BlockBackend,
in_flight and quiesce_counter are 0, tracked_requests, respectively
queued_requests, are empty and quiesced_parent is false for the parents.

Two observations:

1. I found that using QMP 'stop' and 'cont' will allow guest IO to get
unstuck. I'm pretty sure, it's the virtio_blk_data_plane_stop/start
calls it triggers.

2. While experimenting, I found that after the below change [1] in
aio_poll(), I wasn't able to trigger the issue anymore (letting my
reproducer run for 40 minutes).

Best Regards,
Fiona

[1]:

> diff --git a/util/aio-posix.c b/util/aio-posix.c
> index 7f2c99729d..dff9ad4148 100644
> --- a/util/aio-posix.c
> +++ b/util/aio-posix.c
> @@ -655,7 +655,7 @@ bool aio_poll(AioContext *ctx, bool blocking)
>  /* If polling is allowed, non-blocking aio_poll does not need the
>   * system call---a single round of run_poll_handlers_once suffices.
>   */
> -if (timeout || ctx->fdmon_ops->need_wait(ctx)) {
> +if (1) { //timeout || ctx->fdmon_ops->need_wait(ctx)) {
>  /*
>   * Disable poll mode. poll mode should be disabled before the call
>   * of ctx->fdmon_ops->wait() so that guest's notification can wake


> [0]:
> 
>> diff --git a/blockdev.c b/blockdev.c
>> index db2725fe74..bf2e0fc22c 100644
>> --- a/blockdev.c
>> +++ b/blockdev.c
>> @@ -2986,6 +2986,11 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
>>  bool zero_target;
>>  int ret;
>>  
>> +bdrv_drain_all_begin();
>> +bdrv_drain_all_end();
>> +return;
>> +
>> +
>>  bs = qmp_get_root_bs(arg->device, errp);
>>  if (!bs) {
>>  return;
> 
> 
>

Re: deadlock when using iothread during backup_clean()

2023-11-03 Thread Fiona Ebner

Am 20.10.23 um 15:52 schrieb Fiona Ebner:
> And I found that with drive-mirror, the issue during starting seems to
> manifest with the bdrv_open() call. Adding a return before it, the guest
> IO didn't get stuck in my testing, but adding a return after it, it can
> get stuck. I'll try to see if I can further narrow it down next week,
> but maybe that's already a useful hint?
> 

In the end, I was able to find a reproducer that just does draining and
bisected the issue (doesn't seem related to the graph lock after all). I
replied there, to avoid all the overhead from this thread:

https://lists.nongnu.org/archive/html/qemu-devel/2023-11/msg00681.html

Best Regards,
Fiona

1 2 3 >

1 - 100 of 240 matches

Mail list logo