Re: [PATCH] meson: add 'qemuutil' dependency for block.c
Am 15.08.24 um 17:58 schrieb Daniel P. Berrangé: > On Wed, Aug 14, 2024 at 12:00:52PM +0200, Fiona Ebner wrote: >> The macro block_module_load() used by block.c is a wrapper around >> module_load(), which is implemented in util/module.c. >> >> Fixes linking for a future binary or downstream binary that does not >> depend on 'qemuutil' directly, but does depend on 'block'. > > Such a scenario is impossible surely, even in future. Every file in > QEMU pulls in osdep.h, and as a result effectively gets a dep on > on qemuutil, not to mention the block layer using countless APIs > present in qemuutil > Yes, you are right. Sorry, I missed this dependency. The sources for both of our affected downstream binaries do include "qemu/osdep.h" and thus have a direct dependency on qemuutil. So my patch can be disregarded. Build for the mentioned binaries broke after, IIRC, 414b180d42 ("meson: Pass objects and dependencies to declare_dependency()"), because they didn't explicitly specify the qemuutil dependency in meson. The error message I got was about "module_load" used by the block layer. Best Regards, Fiona
[PATCH] meson: add 'qemuutil' dependency for block.c
The macro block_module_load() used by block.c is a wrapper around module_load(), which is implemented in util/module.c. Fixes linking for a future binary or downstream binary that does not depend on 'qemuutil' directly, but does depend on 'block'. Signed-off-by: Fiona Ebner --- meson.build | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/meson.build b/meson.build index 81ecd4bae7..efa0ac8d0b 100644 --- a/meson.build +++ b/meson.build @@ -3555,7 +3555,7 @@ if have_block 'blockjob.c', 'job.c', 'qemu-io-cmds.c', - )) + ), qemuutil) if config_host_data.get('CONFIG_REPLICATION') block_ss.add(files('replication.c')) endif -- 2.39.2
Re: query dirty areas according to bitmap via QMP or qemu-nbd
Am 26.07.24 um 17:38 schrieb Eric Blake: > On Fri, Jul 26, 2024 at 04:16:41PM GMT, Fiona Ebner wrote: >> Hi, >> >> sorry if I'm missing the obvious, but is there a way to get the dirty >> areas according to a dirty bitmap via QMP? I mean as something like >> offset + size + dirty-flag triples. In my case, the bitmap is also >> exported via NBD, so same question for qemu-nbd being the client. > > Over QMP, no - that can produce a potentially large response and > possible long time in computing the data, so we have never felt the > need to introduce a new QMP command for that purpose. So over NBD is > the preferred solution. > >> >> I can get the info with "nbdinfo --map", but would like to avoid >> requiring a tool outside QEMU. > > By default, QEMU as an NBD client only reads the "base:allocation" NBD > metacontext, and is not wired to read more than one NBD metacontext at > once (weaker than nbdinfo's capabilities). But I have intentionally > left in a hack (accessible through QMP as well as from the command > line) for connecting a qemu NBD client to an alternative NBD > metacontext that feeds the block status, at which point 2 bits of > information from the alternative context are observable through the > result of block status calls. Note that using such an NBD connection > for anything OTHER than block status calls is inadvisable (qemu might > incorrectly optimize reads based on its misinterpretation of those > block status bits); but as long as you limit the client to block > status calls, it's a great way to read out a "qemu:dirty-bitmap:..." > metacontext using only a qemu NBD client connection. > > git grep -l x-dirty-bitmap tests/qemu-iotests > > shows several of the iotests using the backdoor in just that manner. > In particular, tests/qemu-img-bitmaps gives the magic decoder ring: > > | # x-dirty-bitmap is a hack for reading bitmaps; it abuses block status to > | # report "data":false for portions of the bitmap which are set > | IMG="driver=nbd,server.type=unix,server.path=$nbd_unix_socket" > | nbd_server_start_unix_socket -r -f qcow2 \ > | -B b0 -B b1 -B b2 -B b3 "$TEST_IMG" > | $QEMU_IMG map --output=json --image-opts \ > | "$IMG,x-dirty-bitmap=qemu:dirty-bitmap:b0" | _filter_qemu_img_map > > meaning the map output includes "data":false for the dirty portions > and "data":true for the unchanged portions recorded in bitmap b0 as > read from the JSON map output. > Oh, I didn't think about checking the NBD block driver for such an option :) And thank you for all the explanations! >> >> If it is not currently possible, would upstream be interested too in the >> feature, either for QMP or qemu-nbd? > > Improving qemu-img to get at the information without quite the hacky > post-processing deciphering would indeed be a useful patch, but it has > never risen to the level of enough of an itch for me to write it > myself (especially since 'nbdinfo --map's output works just as well). > I might just go with the above for now, but who knows if I'll get around to this some day. Three approaches that come to mind are: 1. qemu-img bitmap --dump Other bitmap actions won't be supported in combination with NBD. 2. qemu-img map --bitmap NAME Should it use a dedicated output format compared to the usual "map" output (both human and json) with just "start/offset + length + dirty bit" triples? 3. qemu-nbd --map CONTEXT With only supporting one context at a time? Would be limited to NBD of course which the other two won't be. All would require connecting to the NBD export with the correct meta context, which currently means using x_dirty_bitmap internally. So would that even be okay as part of a non-experimental command, or would it require to teach the NBD client code to deal with multiple meta contexts first? Best Regards, Fiona
query dirty areas according to bitmap via QMP or qemu-nbd
Hi, sorry if I'm missing the obvious, but is there a way to get the dirty areas according to a dirty bitmap via QMP? I mean as something like offset + size + dirty-flag triples. In my case, the bitmap is also exported via NBD, so same question for qemu-nbd being the client. I can get the info with "nbdinfo --map", but would like to avoid requiring a tool outside QEMU. If it is not currently possible, would upstream be interested too in the feature, either for QMP or qemu-nbd? Best Regards, Fiona
[PATCH] hw/scsi/lsi53c895a: bump instruction limit in scripts processing to fix regression
Commit 9876359990 ("hw/scsi/lsi53c895a: add timer to scripts processing") reduced the maximum allowed instruction count by a factor of 100 all the way down to 100. This causes the "Check Point R81.20 Gaia" appliance [0] to fail to boot after fully finishing the installation via the appliance's web interface (there is already one reboot before that). With a limit of 150, the appliance still fails to boot, while with a limit of 200, it works. Bump to 500 to fix the regression and be on the safe side. Originally reported in the Proxmox community forum[1]. [0]: https://support.checkpoint.com/results/download/124397 [1]: https://forum.proxmox.com/threads/149772/post-683459 Cc: qemu-sta...@nongnu.org Fixes: 9876359990 ("hw/scsi/lsi53c895a: add timer to scripts processing") Signed-off-by: Fiona Ebner --- hw/scsi/lsi53c895a.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/scsi/lsi53c895a.c b/hw/scsi/lsi53c895a.c index eb9828dd5e..f1935e5328 100644 --- a/hw/scsi/lsi53c895a.c +++ b/hw/scsi/lsi53c895a.c @@ -188,7 +188,7 @@ static const char *names[] = { #define LSI_TAG_VALID (1 << 16) /* Maximum instructions to process. */ -#define LSI_MAX_INSN100 +#define LSI_MAX_INSN500 typedef struct lsi_request { SCSIRequest *req; -- 2.39.2
[PATCH v2] block/reqlist: allow adding overlapping requests
Allow overlapping request by removing the assert that made it impossible. There are only two callers: 1. block_copy_task_create() It already asserts the very same condition before calling reqlist_init_req(). 2. cbw_snapshot_read_lock() There is no need to have read requests be non-overlapping in copy-before-write when used for snapshot-access. In fact, there was no protection against two callers of cbw_snapshot_read_lock() calling reqlist_init_req() with overlapping ranges and this could lead to an assertion failure [1]. In particular, with the reproducer script below [0], two cbw_co_snapshot_block_status() callers could race, with the second calling reqlist_init_req() before the first one finishes and removes its conflicting request. [0]: > #!/bin/bash -e > dd if=/dev/urandom of=/tmp/disk.raw bs=1M count=1024 > ./qemu-img create /tmp/fleecing.raw -f raw 1G > ( > ./qemu-system-x86_64 --qmp stdio \ > --blockdev raw,node-name=node0,file.driver=file,file.filename=/tmp/disk.raw \ > --blockdev > raw,node-name=node1,file.driver=file,file.filename=/tmp/fleecing.raw \ > < {"execute": "qmp_capabilities"} > {"execute": "blockdev-add", "arguments": { "driver": "copy-before-write", > "file": "node0", "target": "node1", "node-name": "node3" } } > {"execute": "blockdev-add", "arguments": { "driver": "snapshot-access", > "file": "node3", "node-name": "snap0" } } > {"execute": "nbd-server-start", "arguments": {"addr": { "type": "unix", > "data": { "path": "/tmp/nbd.socket" } } } } > {"execute": "block-export-add", "arguments": {"id": "exp0", "node-name": > "snap0", "type": "nbd", "name": "exp0"}} > EOF > ) & > sleep 5 > while true; do > ./qemu-nbd -d /dev/nbd0 > ./qemu-nbd -c /dev/nbd0 nbd:unix:/tmp/nbd.socket:exportname=exp0 -f raw -r > nbdinfo --map 'nbd+unix:///exp0?socket=/tmp/nbd.socket' > done [1]: > #5 0x71e5f0088eb2 in __GI___assert_fail (...) at ./assert/assert.c:101 > #6 0x615285438017 in reqlist_init_req (...) at ../block/reqlist.c:23 > #7 0x6152853e2d98 in cbw_snapshot_read_lock (...) at > ../block/copy-before-write.c:237 > #8 0x6152853e3068 in cbw_co_snapshot_block_status (...) at > ../block/copy-before-write.c:304 > #9 0x6152853f4d22 in bdrv_co_snapshot_block_status (...) at > ../block/io.c:3726 > #10 0x61528543a63e in snapshot_access_co_block_status (...) at > ../block/snapshot-access.c:48 > #11 0x6152853f1a0a in bdrv_co_do_block_status (...) at ../block/io.c:2474 > #12 0x6152853f2016 in bdrv_co_common_block_status_above (...) at > ../block/io.c:2652 > #13 0x6152853f22cf in bdrv_co_block_status_above (...) at > ../block/io.c:2732 > #14 0x6152853d9a86 in blk_co_block_status_above (...) at > ../block/block-backend.c:1473 > #15 0x61528538da6c in blockstatus_to_extents (...) at ../nbd/server.c:2374 > #16 0x61528538deb1 in nbd_co_send_block_status (...) at > ../nbd/server.c:2481 > #17 0x61528538f424 in nbd_handle_request (...) at ../nbd/server.c:2978 > #18 0x61528538f906 in nbd_trip (...) at ../nbd/server.c:3121 > #19 0x6152855a7caf in coroutine_trampoline (...) at > ../util/coroutine-ucontext.c:175 Cc: qemu-sta...@nongnu.org Suggested-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Fiona Ebner --- Changes in v2: * different approach, allowing overlapping requests for copy-before-write rather than waiting for them. block-copy already asserts there are no conflicts before adding a request. block/copy-before-write.c | 3 ++- block/reqlist.c | 2 -- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/block/copy-before-write.c b/block/copy-before-write.c index 853e01a1eb..28f6a096cd 100644 --- a/block/copy-before-write.c +++ b/block/copy-before-write.c @@ -66,7 +66,8 @@ typedef struct BDRVCopyBeforeWriteState { /* * @frozen_read_reqs: current read requests for fleecing user in bs->file - * node. These areas must not be rewritten by guest. + * node. These areas must not be rewritten by guest. There can be multiple + * overlapping read requests. */ BlockReqList frozen_read_reqs; diff --git a/block/reqlist.c b/block/reqlist.c index 08cb57cfa4..098e807378 100644 --- a/block/reqlist.c +++ b/block/reqlist.c @@ -20,8 +20,6 @@ void reqlist_init_req(BlockReqList *reqs, BlockReq *req, int64_t offset, int64_t bytes) { -assert(!reqlist_find_conflict(reqs, offset, bytes)); - *req = (BlockReq) { .offset = offset, .bytes = bytes, -- 2.39.2
[PATCH] block/copy-before-write: wait for conflicts when read locking to avoid assertion failure
There is no protection against two callers of cbw_snapshot_read_lock() calling reqlist_init_req() with overlapping ranges, and reqlist_init_req() asserts that there are no conflicting requests. In particular, two cbw_co_snapshot_block_status() callers can race, with the second calling reqlist_init_req() before the first one finishes and removes its conflicting request, leading to an assertion failure. Reproducer script [0] and backtrace [1] are attached below. [0]: > #!/bin/bash -e > dd if=/dev/urandom of=/tmp/disk.raw bs=1M count=1024 > ./qemu-img create /tmp/fleecing.raw -f raw 1G > ( > ./qemu-system-x86_64 --qmp stdio \ > --blockdev raw,node-name=node0,file.driver=file,file.filename=/tmp/disk.raw \ > --blockdev > raw,node-name=node1,file.driver=file,file.filename=/tmp/fleecing.raw \ > < {"execute": "qmp_capabilities"} > {"execute": "blockdev-add", "arguments": { "driver": "copy-before-write", > "file": "node0", "target": "node1", "node-name": "node3" } } > {"execute": "blockdev-add", "arguments": { "driver": "snapshot-access", > "file": "node3", "node-name": "snap0" } } > {"execute": "nbd-server-start", "arguments": {"addr": { "type": "unix", > "data": { "path": "/tmp/nbd.socket" } } } } > {"execute": "block-export-add", "arguments": {"id": "exp0", "node-name": > "snap0", "type": "nbd", "name": "exp0"}} > EOF > ) & > sleep 5 > while true; do > ./qemu-nbd -d /dev/nbd0 > ./qemu-nbd -c /dev/nbd0 nbd:unix:/tmp/nbd.socket:exportname=exp0 -f raw -r > nbdinfo --map 'nbd+unix:///exp0?socket=/tmp/nbd.socket' > done [1]: > #5 0x71e5f0088eb2 in __GI___assert_fail (...) at ./assert/assert.c:101 > #6 0x615285438017 in reqlist_init_req (...) at ../block/reqlist.c:23 > #7 0x6152853e2d98 in cbw_snapshot_read_lock (...) at > ../block/copy-before-write.c:237 > #8 0x6152853e3068 in cbw_co_snapshot_block_status (...) at > ../block/copy-before-write.c:304 > #9 0x6152853f4d22 in bdrv_co_snapshot_block_status (...) at > ../block/io.c:3726 > #10 0x61528543a63e in snapshot_access_co_block_status (...) at > ../block/snapshot-access.c:48 > #11 0x6152853f1a0a in bdrv_co_do_block_status (...) at ../block/io.c:2474 > #12 0x6152853f2016 in bdrv_co_common_block_status_above (...) at > ../block/io.c:2652 > #13 0x6152853f22cf in bdrv_co_block_status_above (...) at > ../block/io.c:2732 > #14 0x6152853d9a86 in blk_co_block_status_above (...) at > ../block/block-backend.c:1473 > #15 0x61528538da6c in blockstatus_to_extents (...) at ../nbd/server.c:2374 > #16 0x61528538deb1 in nbd_co_send_block_status (...) at > ../nbd/server.c:2481 > #17 0x61528538f424 in nbd_handle_request (...) at ../nbd/server.c:2978 > #18 0x61528538f906 in nbd_trip (...) at ../nbd/server.c:3121 > #19 0x6152855a7caf in coroutine_trampoline (...) at > ../util/coroutine-ucontext.c:175 Signed-off-by: Fiona Ebner --- block/copy-before-write.c | 1 + 1 file changed, 1 insertion(+) diff --git a/block/copy-before-write.c b/block/copy-before-write.c index 853e01a1eb..376ff3f3e1 100644 --- a/block/copy-before-write.c +++ b/block/copy-before-write.c @@ -234,6 +234,7 @@ cbw_snapshot_read_lock(BlockDriverState *bs, int64_t offset, int64_t bytes, *req = (BlockReq) {.offset = -1, .bytes = -1}; *file = s->target; } else { +reqlist_wait_all(&s->frozen_read_reqs, offset, bytes, &s->lock); reqlist_init_req(&s->frozen_read_reqs, req, offset, bytes); *file = bs->file; } -- 2.39.2
[PATCH v3 0/2] backup: allow specifying minimum cluster size
Discussion for v2: https://lore.kernel.org/qemu-devel/20240528120114.344416-1-f.eb...@proxmox.com/ Changes in v3: * Pass min_cluster_size option directly without checking has_min_cluster_size, because the default is 0 anyways. * Calculate maximum of passed-in argument and default once at the beginning of block_copy_calculate_cluster_size() * Update warning message to reflect actual value used * Do not leak qdict in error case * Use PRI{i,u}64 macros Discussion for v1: https://lore.kernel.org/qemu-devel/20240308155158.830258-1-f.eb...@proxmox.com/ - Changes in v2: * Use 'size' type in QAPI. * Remove option in cbw_parse_options(), i.e. before parsing generic blockdev options. * Reword commit messages hoping to describe the issue in a more straight-forward way. In the context of backup fleecing, discarding the source will not work when the fleecing image has a larger granularity than the one used for block-copy operations (can happen if the backup target has smaller cluster size), because cbw_co_pdiscard_snapshot() will align down the discard requests and thus effectively ignore then. To make @discard-source work in such a scenario, allow specifying the minimum cluster size used for block-copy operations and thus in particular also the granularity for discard requests to the source. Fiona Ebner (2): copy-before-write: allow specifying minimum cluster size backup: add minimum cluster size to performance options block/backup.c | 2 +- block/block-copy.c | 36 ++-- block/copy-before-write.c | 14 +- block/copy-before-write.h | 1 + blockdev.c | 3 +++ include/block/block-copy.h | 1 + qapi/block-core.json | 17 ++--- 7 files changed, 59 insertions(+), 15 deletions(-) -- 2.39.2
[PATCH v3 2/2] backup: add minimum cluster size to performance options
In the context of backup fleecing, discarding the source will not work when the fleecing image has a larger granularity than the one used for block-copy operations (can happen if the backup target has smaller cluster size), because cbw_co_pdiscard_snapshot() will align down the discard requests and thus effectively ignore then. To make @discard-source work in such a scenario, allow specifying the minimum cluster size used for block-copy operations and thus in particular also the granularity for discard requests to the source. Suggested-by: Vladimir Sementsov-Ogievskiy Acked-by: Markus Armbruster (QAPI schema) Signed-off-by: Fiona Ebner --- Changes in v3: * Use PRI{i,u}64 macros * Do not leak qdict in error case block/backup.c| 2 +- block/copy-before-write.c | 9 + block/copy-before-write.h | 1 + blockdev.c| 3 +++ qapi/block-core.json | 9 +++-- 5 files changed, 21 insertions(+), 3 deletions(-) diff --git a/block/backup.c b/block/backup.c index 3dd2e229d2..a1292c01ec 100644 --- a/block/backup.c +++ b/block/backup.c @@ -458,7 +458,7 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs, } cbw = bdrv_cbw_append(bs, target, filter_node_name, discard_source, - &bcs, errp); + perf->min_cluster_size, &bcs, errp); if (!cbw) { goto error; } diff --git a/block/copy-before-write.c b/block/copy-before-write.c index a919b1f41b..e835987e52 100644 --- a/block/copy-before-write.c +++ b/block/copy-before-write.c @@ -548,6 +548,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source, BlockDriverState *target, const char *filter_node_name, bool discard_source, + uint64_t min_cluster_size, BlockCopyState **bcs, Error **errp) { @@ -567,6 +568,14 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source, qdict_put_str(opts, "file", bdrv_get_node_name(source)); qdict_put_str(opts, "target", bdrv_get_node_name(target)); +if (min_cluster_size > INT64_MAX) { +error_setg(errp, "min-cluster-size too large: %" PRIu64 " > %" PRIi64, + min_cluster_size, INT64_MAX); +qobject_unref(opts); +return NULL; +} +qdict_put_int(opts, "min-cluster-size", (int64_t)min_cluster_size); + top = bdrv_insert_node(source, opts, flags, errp); if (!top) { return NULL; diff --git a/block/copy-before-write.h b/block/copy-before-write.h index 01af0cd3c4..2a5d4ba693 100644 --- a/block/copy-before-write.h +++ b/block/copy-before-write.h @@ -40,6 +40,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source, BlockDriverState *target, const char *filter_node_name, bool discard_source, + uint64_t min_cluster_size, BlockCopyState **bcs, Error **errp); void bdrv_cbw_drop(BlockDriverState *bs); diff --git a/blockdev.c b/blockdev.c index 835064ed03..6740663fda 100644 --- a/blockdev.c +++ b/blockdev.c @@ -2655,6 +2655,9 @@ static BlockJob *do_backup_common(BackupCommon *backup, if (backup->x_perf->has_max_chunk) { perf.max_chunk = backup->x_perf->max_chunk; } +if (backup->x_perf->has_min_cluster_size) { +perf.min_cluster_size = backup->x_perf->min_cluster_size; +} } if ((backup->sync == MIRROR_SYNC_MODE_BITMAP) || diff --git a/qapi/block-core.json b/qapi/block-core.json index 80e32db8aa..9a54bfb15f 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -1551,11 +1551,16 @@ # it should not be less than job cluster size which is calculated # as maximum of target image cluster size and 64k. Default 0. # +# @min-cluster-size: Minimum size of blocks used by copy-before-write +# and background copy operations. Has to be a power of 2. No +# effect if smaller than the maximum of the target's cluster size +# and 64 KiB. Default 0. (Since 9.1) +# # Since: 6.0 ## { 'struct': 'BackupPerf', - 'data': { '*use-copy-range': 'bool', -'*max-workers': 'int', '*max-chunk': 'int64' } } + 'data': { '*use-copy-range': 'bool', '*max-workers': 'int', +'*max-chunk': 'int64', '*min-cluster-size': 'size' } } ## # @BackupCommon: -- 2.39.2
[PATCH v3 1/2] copy-before-write: allow specifying minimum cluster size
In the context of backup fleecing, discarding the source will not work when the fleecing image has a larger granularity than the one used for block-copy operations (can happen if the backup target has smaller cluster size), because cbw_co_pdiscard_snapshot() will align down the discard requests and thus effectively ignore then. To make @discard-source work in such a scenario, allow specifying the minimum cluster size used for block-copy operations and thus in particular also the granularity for discard requests to the source. The type 'size' (corresponding to uint64_t in C) is used in QAPI to rule out negative inputs and for consistency with already existing @cluster-size parameters. Since block_copy_calculate_cluster_size() uses int64_t for its result, a check that the input is not too large is added in block_copy_state_new() before calling it. The calculation in block_copy_calculate_cluster_size() is done in the target int64_t type. Suggested-by: Vladimir Sementsov-Ogievskiy Acked-by: Markus Armbruster (QAPI schema) Signed-off-by: Fiona Ebner --- Changes in v3: * Pass min_cluster_size option directly without checking has_min_cluster_size, because the default is 0 anyways. * Calculate maximum of passed-in argument and default once at the beginning of block_copy_calculate_cluster_size() * Update warning message to reflect actual value used * Use PRI{i,u}64 macros block/block-copy.c | 36 ++-- block/copy-before-write.c | 5 - include/block/block-copy.h | 1 + qapi/block-core.json | 8 +++- 4 files changed, 38 insertions(+), 12 deletions(-) diff --git a/block/block-copy.c b/block/block-copy.c index 7e3b378528..59bee538eb 100644 --- a/block/block-copy.c +++ b/block/block-copy.c @@ -310,6 +310,7 @@ void block_copy_set_copy_opts(BlockCopyState *s, bool use_copy_range, } static int64_t block_copy_calculate_cluster_size(BlockDriverState *target, + int64_t min_cluster_size, Error **errp) { int ret; @@ -319,6 +320,9 @@ static int64_t block_copy_calculate_cluster_size(BlockDriverState *target, GLOBAL_STATE_CODE(); GRAPH_RDLOCK_GUARD_MAINLOOP(); +min_cluster_size = MAX(min_cluster_size, + (int64_t)BLOCK_COPY_CLUSTER_SIZE_DEFAULT); + target_does_cow = bdrv_backing_chain_next(target); /* @@ -329,13 +333,13 @@ static int64_t block_copy_calculate_cluster_size(BlockDriverState *target, ret = bdrv_get_info(target, &bdi); if (ret == -ENOTSUP && !target_does_cow) { /* Cluster size is not defined */ -warn_report("The target block device doesn't provide " -"information about the block size and it doesn't have a " -"backing file. The default block size of %u bytes is " -"used. If the actual block size of the target exceeds " -"this default, the backup may be unusable", -BLOCK_COPY_CLUSTER_SIZE_DEFAULT); -return BLOCK_COPY_CLUSTER_SIZE_DEFAULT; +warn_report("The target block device doesn't provide information about " +"the block size and it doesn't have a backing file. The " +"(default) block size of %" PRIi64 " bytes is used. If the " +"actual block size of the target exceeds this value, the " +"backup may be unusable", +min_cluster_size); +return min_cluster_size; } else if (ret < 0 && !target_does_cow) { error_setg_errno(errp, -ret, "Couldn't determine the cluster size of the target image, " @@ -345,16 +349,17 @@ static int64_t block_copy_calculate_cluster_size(BlockDriverState *target, return ret; } else if (ret < 0 && target_does_cow) { /* Not fatal; just trudge on ahead. */ -return BLOCK_COPY_CLUSTER_SIZE_DEFAULT; +return min_cluster_size; } -return MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, bdi.cluster_size); +return MAX(min_cluster_size, bdi.cluster_size); } BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target, BlockDriverState *copy_bitmap_bs, const BdrvDirtyBitmap *bitmap, bool discard_source, + uint64_t min_cluster_size, Error **errp) { ERRP_GUARD(); @@ -365,7 +370,18 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target, GLOBAL_STATE_CODE(); -cluster_size = block_copy_calculate_cluster_size(target->bs, errp); +if (min
[PATCH] scsi: fix regression and honor bootindex again for legacy drives
Commit 3089637461 ("scsi: Don't ignore most usb-storage properties") removed the call to object_property_set_int() and thus the 'set' method for the bootindex property was also not called anymore. Here that method is device_set_bootindex() (as configured by scsi_dev_instance_init() -> device_add_bootindex_property()) which as a side effect registers the device via add_boot_device_path(). As reported by a downstream user [0], the bootindex property did not have the desired effect anymore for legacy drives. Fix the regression by explicitly calling the add_boot_device_path() function after checking that the bootindex is not yet used (to avoid add_boot_device_path() calling exit()). [0]: https://forum.proxmox.com/threads/149772/post-679433 Cc: qemu-sta...@nongnu.org Fixes: 3089637461 ("scsi: Don't ignore most usb-storage properties") Suggested-by: Kevin Wolf Signed-off-by: Fiona Ebner --- hw/scsi/scsi-bus.c | 9 + 1 file changed, 9 insertions(+) diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c index 9e40b0c920..53eff5dd3d 100644 --- a/hw/scsi/scsi-bus.c +++ b/hw/scsi/scsi-bus.c @@ -384,6 +384,7 @@ SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, BlockBackend *blk, DeviceState *dev; SCSIDevice *s; DriveInfo *dinfo; +Error *local_err = NULL; if (blk_is_sg(blk)) { driver = "scsi-generic"; @@ -403,6 +404,14 @@ SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, BlockBackend *blk, s = SCSI_DEVICE(dev); s->conf = *conf; +check_boot_index(conf->bootindex, &local_err); +if (local_err) { +object_unparent(OBJECT(dev)); +error_propagate(errp, local_err); +return NULL; +} +add_boot_device_path(conf->bootindex, dev, NULL); + qdev_prop_set_uint32(dev, "scsi-id", unit); if (object_property_find(OBJECT(dev), "removable")) { qdev_prop_set_bit(dev, "removable", removable); -- 2.39.2
Re: [PATCH] scsi: Don't ignore most usb-storage properties
Hi, we got a user report about bootindex for an 'usb-storage' device not working anymore [0] and I reproduced it and bisected it to this patch. Am 31.01.24 um 14:06 schrieb Kevin Wolf: > @@ -399,11 +397,10 @@ SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, > BlockBackend *blk, > object_property_add_child(OBJECT(bus), name, OBJECT(dev)); > g_free(name); > > +s = SCSI_DEVICE(dev); > +s->conf = *conf; > + > qdev_prop_set_uint32(dev, "scsi-id", unit); > -if (bootindex >= 0) { > -object_property_set_int(OBJECT(dev), "bootindex", bootindex, > -&error_abort); > -} The fact that this is not called anymore means that the 'set' method for the property is also not called. Here, that method is device_set_bootindex() (as configured by scsi_dev_instance_init() -> device_add_bootindex_property()). Therefore, the device is never registered via add_boot_device_path() meaning that the bootindex property does not have the desired effect anymore. Is it necessary to keep the object_property_set_{bool,int} and qdev_prop_set_enum calls around for these potential side effects? Would it even be necessary to introduce new similar calls for the newly supported properties? Or is there an easy alternative to s->conf = *conf; that does trigger the side effects? > if (object_property_find(OBJECT(dev), "removable")) { > qdev_prop_set_bit(dev, "removable", removable); > } > @@ -414,19 +411,12 @@ SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, > BlockBackend *blk, > object_unparent(OBJECT(dev)); > return NULL; > } > -if (!object_property_set_bool(OBJECT(dev), "share-rw", share_rw, errp)) { > -object_unparent(OBJECT(dev)); > -return NULL; > -} > - > -qdev_prop_set_enum(dev, "rerror", rerror); > -qdev_prop_set_enum(dev, "werror", werror); > > if (!qdev_realize_and_unref(dev, &bus->qbus, errp)) { > object_unparent(OBJECT(dev)); > return NULL; > } > -return SCSI_DEVICE(dev); > +return s; > } > > void scsi_bus_legacy_handle_cmdline(SCSIBus *bus) [0]: https://forum.proxmox.com/threads/149772/post-679433 Best Regards, Fiona
Re: [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context
Am 12.06.24 um 17:34 schrieb Stefan Hajnoczi: > > Thank you for investigating! It looks like we would be trading one > issue (the assertion failures you mentioned) for another (a rare, but > possible, hang). > > I'm not sure what the best solution is. It seems like vm_stop() is the > first place where things go awry. It's where we should exit device > emulation code. Doing that probably requires an asynchronous API that > takes a callback. Do you want to try that? > I can try, but I'm afraid it will be a while (at least a few weeks) until I can get around to it. Best Regards, Fiona
Re: [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context
Am 11.06.24 um 16:04 schrieb Stefan Hajnoczi: > On Tue, Jun 11, 2024 at 02:08:49PM +0200, Fiona Ebner wrote: >> Am 06.06.24 um 20:36 schrieb Stefan Hajnoczi: >>> On Wed, Jun 05, 2024 at 02:08:48PM +0200, Fiona Ebner wrote: >>>> The fact that the snapshot_save_job_bh() is scheduled in the main >>>> loop's qemu_aio_context AioContext means that it might get executed >>>> during a vCPU thread's aio_poll(). But saving of the VM state cannot >>>> happen while the guest or devices are active and can lead to assertion >>>> failures. See issue #2111 for two examples. Avoid the problem by >>>> scheduling the snapshot_save_job_bh() in the iohandler AioContext, >>>> which is not polled by vCPU threads. >>>> >>>> Solves Issue #2111. >>>> >>>> This change also solves the following issue: >>>> >>>> Since commit effd60c878 ("monitor: only run coroutine commands in >>>> qemu_aio_context"), the 'snapshot-save' QMP call would not respond >>>> right after starting the job anymore, but only after the job finished, >>>> which can take a long time. The reason is, because after commit >>>> effd60c878, do_qmp_dispatch_bh() runs in the iohandler AioContext. >>>> When do_qmp_dispatch_bh() wakes the qmp_dispatch() coroutine, the >>>> coroutine cannot be entered immediately anymore, but needs to be >>>> scheduled to the main loop's qemu_aio_context AioContext. But >>>> snapshot_save_job_bh() was scheduled first to the same AioContext and >>>> thus gets executed first. >>>> >>>> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/2111 >>>> Signed-off-by: Fiona Ebner >>>> --- >>>> >>>> While initial smoke testing seems fine, I'm not familiar enough with >>>> this to rule out any pitfalls with the approach. Any reason why >>>> scheduling to the iohandler AioContext could be wrong here? >>> >>> If something waits for a BlockJob to finish using aio_poll() from >>> qemu_aio_context then a deadlock is possible since the iohandler_ctx >>> won't get a chance to execute. The only suspicious code path I found was >>> job_completed_txn_abort_locked() -> job_finish_sync_locked() but I'm not >>> sure whether it triggers this scenario. Please check that code path. >>> >> >> Sorry, I don't understand. Isn't executing the scheduled BH the only >> additional progress that the iohandler_ctx needs to make compared to >> before the patch? How exactly would that cause issues when waiting for a >> BlockJob? >> >> Or do you mean something waiting for the SnapshotJob from >> qemu_aio_context before snapshot_save_job_bh had the chance to run? > > Yes, exactly. job_finish_sync_locked() will hang since iohandler_ctx has > no chance to execute. But I haven't audited the code to understand > whether this can happen. So job_finish_sync_locked() is executed in job_completed_txn_abort_locked() when the following branch is taken > if (!job_is_completed_locked(other_job)) and there is no other job in the transaction, so we can assume other_job being the snapshot-save job itself. The callers of job_completed_txn_abort_locked(): 1. in job_do_finalize_locked() if job->ret is non-zero. The callers of which are: 1a. in job_finalize_locked() if JOB_VERB_FINALIZE is allowed, meaning job->status is JOB_STATUS_PENDING, so job_is_completed_locked() will be true. 1b. job_completed_txn_success_locked() sets job->status to JOB_STATUS_WAITING before, so job_is_completed_locked() will be true. 2. in job_completed_locked() it is only done if job->ret is non-zero, in which case job->status was set to JOB_STATUS_ABORTING by the preceding job_update_rc_locked(), and thus job_is_completed_locked() will be true. 3. in job_cancel_locked() if job->deferred_to_main_loop is true, which is set in job_co_entry() before job_exit() is scheduled as a BH and is also set in job_do_dismiss_locked(). In the former case, the snapshot_save_job_bh has already been executed. In the latter case, job_is_completed_locked() will be true (since job_early_fail() is not used for the snapshot job). However, job_finish_sync_locked() is also executed via job_cancel_sync_all() during qemu_cleanup(). With bad timing there, I suppose the issue could arise. In fact, I could trigger it with the following hack on top: > diff --git a/migration/savevm.c b/migration/savevm.c > index 0086b76ab0..42c93176ba 100644 > --- a/migration/savevm.c > +++ b/migration/savevm.c > @@ -3429,6 +3429,17 @@ static void sn
Re: [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context
Am 06.06.24 um 20:36 schrieb Stefan Hajnoczi: > On Wed, Jun 05, 2024 at 02:08:48PM +0200, Fiona Ebner wrote: >> The fact that the snapshot_save_job_bh() is scheduled in the main >> loop's qemu_aio_context AioContext means that it might get executed >> during a vCPU thread's aio_poll(). But saving of the VM state cannot >> happen while the guest or devices are active and can lead to assertion >> failures. See issue #2111 for two examples. Avoid the problem by >> scheduling the snapshot_save_job_bh() in the iohandler AioContext, >> which is not polled by vCPU threads. >> >> Solves Issue #2111. >> >> This change also solves the following issue: >> >> Since commit effd60c878 ("monitor: only run coroutine commands in >> qemu_aio_context"), the 'snapshot-save' QMP call would not respond >> right after starting the job anymore, but only after the job finished, >> which can take a long time. The reason is, because after commit >> effd60c878, do_qmp_dispatch_bh() runs in the iohandler AioContext. >> When do_qmp_dispatch_bh() wakes the qmp_dispatch() coroutine, the >> coroutine cannot be entered immediately anymore, but needs to be >> scheduled to the main loop's qemu_aio_context AioContext. But >> snapshot_save_job_bh() was scheduled first to the same AioContext and >> thus gets executed first. >> >> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/2111 >> Signed-off-by: Fiona Ebner >> --- >> >> While initial smoke testing seems fine, I'm not familiar enough with >> this to rule out any pitfalls with the approach. Any reason why >> scheduling to the iohandler AioContext could be wrong here? > > If something waits for a BlockJob to finish using aio_poll() from > qemu_aio_context then a deadlock is possible since the iohandler_ctx > won't get a chance to execute. The only suspicious code path I found was > job_completed_txn_abort_locked() -> job_finish_sync_locked() but I'm not > sure whether it triggers this scenario. Please check that code path. > Sorry, I don't understand. Isn't executing the scheduled BH the only additional progress that the iohandler_ctx needs to make compared to before the patch? How exactly would that cause issues when waiting for a BlockJob? Or do you mean something waiting for the SnapshotJob from qemu_aio_context before snapshot_save_job_bh had the chance to run? Best Regards, Fiona
Re: [PATCH v4 0/5] mirror: allow specifying working bitmap
Ping Am 21.05.24 um 14:20 schrieb Fiona Ebner: > Changes from v3 (discussion here [3]): > * Improve/fix QAPI documentation. > > Changes from v2 (discussion here [2]): > * Cluster size caveats only apply to non-COW diff image, adapt the > cluster size check and documentation accordingly. > * In the IO test, use backing files (rather than stand-alone diff > images) in combination with copy-mode=write-blocking and larger > cluster size for target images, to have a more realistic use-case > and show that COW prevents ending up with cluster with partial data > upon unaligned writes. > * Create a separate patch for replacing is_none_mode with sync_mode in > MirrorBlockJobx struct. > * Disallow using read-only bitmap (cannot be used as working bitmap). > * Require that bitmap is enabled at the start of the job. > * Avoid IO test script potentially waiting on non-existent job when > blockdev-mirror QMP command fails. > * Fix pylint issues in IO test. > * Rename IO test from sync-bitmap-mirror to mirror-bitmap. > > Changes from RFC/v1 (discussion here [0]): > * Add patch to split BackupSyncMode and MirrorSyncMode. > * Drop bitmap-mode parameter and use passed-in bitmap as the working > bitmap instead. Users can get the desired behaviors by > using the block-dirty-bitmap-clear and block-dirty-bitmap-merge > calls (see commit message in patch 2/4 for how exactly). > * Add patch to check whether target image's cluster size is at most > mirror job's granularity. Optional, it's an extra safety check > that's useful when the target is a "diff" image that does not have > previously synced data. > > Use cases: > * Possibility to resume a failed mirror later. > * Possibility to only mirror deltas to a previously mirrored volume. > * Possibility to (efficiently) mirror an drive that was previously > mirrored via some external mechanism (e.g. ZFS replication). > > We are using the last one in production without any issues since about > 4 years now. In particular, like mentioned in [1]: > >> - create bitmap(s) >> - (incrementally) replicate storage volume(s) out of band (using ZFS) >> - incrementally drive mirror as part of a live migration of VM >> - drop bitmap(s) > > > Now, the IO test added in patch 4/5 actually contains yet another use > case, namely doing incremental mirrors to qcow2 "diff" images, that > only contain the delta and can be rebased later. I had to adapt the IO > test, because its output expected the mirror bitmap to still be dirty, > but nowadays the mirror is apparently already done when the bitmaps > are queried. So I thought, I'll just use 'write-blocking' mode to > avoid any potential timing issues. > > Initially, the qcow2 diff image targets were stand-alone and that > suffers from an issue when 'write-blocking' mode is used. If a write > is not aligned to the granularity of the mirror target, then rebasing > the diff image onto a backing image will not yield the desired result, > because the full cluster is considered to be allocated and will "hide" > some part of the base/backing image. The failure can be seen by either > using 'write-blocking' mode in the IO test or setting the (bitmap) > granularity to 32 KiB rather than the current 64 KiB. > > The test thus uses a more realistic approach where the qcow2 targets > have backing images and a check is added in patch 5/5 for the cluster > size for non-COW targets. However, with e.g. NBD, the cluster size > cannot be queried and prohibiting all bitmap mirrors to NBD targets > just to prevent the highly specific edge case seems not worth it, so > the limitation is rather documented and the check ignores cases where > querying the target image's cluster size returns -ENOTSUP. > > > [0]: > https://lore.kernel.org/qemu-devel/b91dba34-7969-4d51-ba40-96a91038c...@yandex-team.ru/T/#m4ae27dc8ca1fb053e0a32cc4ffa2cfab6646805c > [1]: > https://lore.kernel.org/qemu-devel/1599127031.9uxdp5h9o2.astr...@nora.none/ > [2]: > https://lore.kernel.org/qemu-devel/20240307134711.709816-1-f.eb...@proxmox.com/ > [3]: > https://lore.kernel.org/qemu-devel/20240510131647.1256467-1-f.eb...@proxmox.com/ > > > Fabian Grünbichler (1): > iotests: add test for bitmap mirror > > Fiona Ebner (3): > qapi/block-core: avoid the re-use of MirrorSyncMode for backup > block/mirror: replace is_none_mode with sync_mode in MirrorBlockJob > struct > blockdev: mirror: check for target's cluster size when using bitmap > > John Snow (1): > mirror: allow specifying working bitmap > > block/backup.c | 18 +- > block/mir
Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes
Am 04.06.24 um 17:28 schrieb Kevin Wolf: > Am 04.06.2024 um 09:58 hat Fiona Ebner geschrieben: >> Am 03.06.24 um 18:21 schrieb Kevin Wolf: >>> Am 03.06.2024 um 16:17 hat Fiona Ebner geschrieben: >>>> Am 26.03.24 um 13:44 schrieb Kevin Wolf: >>>>> >>>>> The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all() >>>>> with a coroutine wrapper so that the graph lock is held for the whole >>>>> function. Then calling bdrv_co_flush() while iterating the list is safe >>>>> and doesn't allow concurrent graph modifications. >>>> >>>> The second is that iotest 255 ran into an assertion failure upon QMP >>>> 'quit': >>>> >>>>> ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion >>>>> `!qemu_in_coroutine()' failed. >>>> >>>> Looking at the backtrace: >>>> >>>>> #5 0x762a90cc3eb2 in __GI___assert_fail >>>>> (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 >>>>> "../block/graph-lock.c", line=113, function=0x5afb07991f20 >>>>> <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock") >>>>> at ./assert/assert.c:101 >>>>> #6 0x5afb07585311 in bdrv_graph_wrlock () at >>>>> ../block/graph-lock.c:113 >>>>> #7 0x5afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at >>>>> ../block/block-backend.c:901 >>>>> #8 0x5afb075729a7 in blk_delete (blk=0x5afb0af99420) at >>>>> ../block/block-backend.c:487 >>>>> #9 0x5afb07572d88 in blk_unref (blk=0x5afb0af99420) at >>>>> ../block/block-backend.c:547 >>>>> #10 0x5afb07572fe8 in bdrv_next (it=0x762a852fef00) at >>>>> ../block/block-backend.c:618 >>>>> #11 0x5afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347 >>>>> #12 0x5afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x712c6050) >>>>> at block/block-gen.c:1391 >>>>> #13 0x5afb0773bf41 in coroutine_trampoline (i0=168365184, i1=23291) >>>> >>>> So I guess calling bdrv_next() is not safe from a coroutine, because >>>> the function doing the iteration could end up being the last thing to >>>> have a reference for the BB. >>> >>> Does your bdrv_co_flush_all() take the graph (reader) lock? If so, this >>> is surprising, because while we hold the graph lock, no reference should >>> be able to go away - you need the writer lock for that and you won't get >>> it as long as bdrv_co_flush_all() locks the graph. So whatever had a >>> reference before the bdrv_next() loop must still have it now. Do you >>> know where it gets dropped? >>> >> >> AFAICT, yes, it does hold the graph reader lock. The generated code is: >> >>> static void coroutine_fn bdrv_co_flush_all_entry(void *opaque) >>> { >>> BdrvFlushAll *s = opaque; >>> >>> bdrv_graph_co_rdlock(); >>> s->ret = bdrv_co_flush_all(); >>> bdrv_graph_co_rdunlock(); >>> s->poll_state.in_progress = false; >>> >>> aio_wait_kick(); >>> } >> >> Apparently when the mirror job is aborted/exits, which can happen during >> the polling for bdrv_co_flush_all_entry(), a reference can go away >> without the write lock (at least my breakpoints didn't trigger) being held: >> >>> #0 blk_unref (blk=0x5cdefe943d20) at ../block/block-backend.c:537 >>> #1 0x5cdefb26697e in mirror_exit_common (job=0x5cdefeb53000) at >>> ../block/mirror.c:710 >>> #2 0x5cdefb263575 in mirror_abort (job=0x5cdefeb53000) at >>> ../block/mirror.c:823 >>> #3 0x5cdefb2248a6 in job_abort (job=0x5cdefeb53000) at ../job.c:825 >>> #4 0x5cdefb2245f2 in job_finalize_single_locked (job=0x5cdefeb53000) >>> at ../job.c:855 >>> #5 0x5cdefb223852 in job_completed_txn_abort_locked >>> (job=0x5cdefeb53000) at ../job.c:958 >>> #6 0x5cdefb223714 in job_completed_locked (job=0x5cdefeb53000) at >>> ../job.c:1065 >>> #7 0x5cdefb224a8b in job_exit (opaque=0x5cdefeb53000) at ../job.c:1088 >>> #8 0x5cdefb4134fc in aio_bh_call (bh=0x5cdefe7487c0) at >>> ../util/async.c:171 >>> #9 0x5cdefb4136ce in aio_bh_poll (ctx=0x5cdefd9cd750) at >>> ../util/async.c:218 >>> #10 0x5cdefb3efdfd in aio_poll (ctx=0x5cdef
[RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context
The fact that the snapshot_save_job_bh() is scheduled in the main loop's qemu_aio_context AioContext means that it might get executed during a vCPU thread's aio_poll(). But saving of the VM state cannot happen while the guest or devices are active and can lead to assertion failures. See issue #2111 for two examples. Avoid the problem by scheduling the snapshot_save_job_bh() in the iohandler AioContext, which is not polled by vCPU threads. Solves Issue #2111. This change also solves the following issue: Since commit effd60c878 ("monitor: only run coroutine commands in qemu_aio_context"), the 'snapshot-save' QMP call would not respond right after starting the job anymore, but only after the job finished, which can take a long time. The reason is, because after commit effd60c878, do_qmp_dispatch_bh() runs in the iohandler AioContext. When do_qmp_dispatch_bh() wakes the qmp_dispatch() coroutine, the coroutine cannot be entered immediately anymore, but needs to be scheduled to the main loop's qemu_aio_context AioContext. But snapshot_save_job_bh() was scheduled first to the same AioContext and thus gets executed first. Buglink: https://gitlab.com/qemu-project/qemu/-/issues/2111 Signed-off-by: Fiona Ebner --- While initial smoke testing seems fine, I'm not familiar enough with this to rule out any pitfalls with the approach. Any reason why scheduling to the iohandler AioContext could be wrong here? Should the same be done for the snapshot_load_job_bh and snapshot_delete_job_bh to keep it consistent? migration/savevm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/migration/savevm.c b/migration/savevm.c index c621f2359b..0086b76ab0 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -3459,7 +3459,7 @@ static int coroutine_fn snapshot_save_job_run(Job *job, Error **errp) SnapshotJob *s = container_of(job, SnapshotJob, common); s->errp = errp; s->co = qemu_coroutine_self(); -aio_bh_schedule_oneshot(qemu_get_aio_context(), +aio_bh_schedule_oneshot(iohandler_get_aio_context(), snapshot_save_job_bh, job); qemu_coroutine_yield(); return s->ret ? 0 : -1; -- 2.39.2
Re: [PATCH] target/i386: fix SSE and SSE2 featue check
Am 02.06.24 um 12:48 schrieb Zhao Liu: > On Sun, Jun 02, 2024 at 06:09:04PM +0800, lixinyu...@ict.ac.cn wrote: >> Date: Sun, 2 Jun 2024 18:09:04 +0800 >> From: lixinyu...@ict.ac.cn >> Subject: [PATCH] target/i386: fix SSE and SSE2 featue check >> X-Mailer: git-send-email 2.34.1 >> >> From: Xinyu Li >> >> Featues check of CPUID_SSE and CPUID_SSE2 shoule use cpuid_features, >> rather than cpuid_ext_features > > It's better to add a Fixes tag, > > Fixes: caa01fadbef1 ("target/i386: add CPUID feature checks to new decoder") > CC-ing stable, because caa01fadbef1 has been there since v7.2.0
[RFC PATCH] block-coroutine-wrapper: support generating wrappers for functions without arguments
Signed-off-by: Fiona Ebner --- An alternative would be to detect whether the argument list is 'void' in FuncDecl's __init__, assign the empty list to self.args there and special case based on that in the rest of the code. Not super happy about the introduction of the 'void_value' parameter, but the different callers seem to make something like it necessary. Could be avoided if there were a nice way to map a format which contains no other keys besides '{name}' to the empty list if the argument's 'name' is 'None'. At least until there is a format that contains both '{name}' and another key which would require special handling again. The generated code unfortunately does contain a few extra blank lines. Avoiding that would require turning some of the (currently static) formatting surrounding gen_block() dynamic based upon whether the argument list is 'void'. Happy about any feedback/suggestions! scripts/block-coroutine-wrapper.py | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/scripts/block-coroutine-wrapper.py b/scripts/block-coroutine-wrapper.py index dbbde99e39..533f6dbe12 100644 --- a/scripts/block-coroutine-wrapper.py +++ b/scripts/block-coroutine-wrapper.py @@ -54,6 +54,11 @@ class ParamDecl: r')') def __init__(self, param_decl: str) -> None: +if param_decl.strip() == 'void': +self.decl = 'void' +self.type = 'void' +self.name = None +return m = self.param_re.match(param_decl.strip()) if m is None: raise ValueError(f'Wrong parameter declaration: "{param_decl}"') @@ -114,10 +119,14 @@ def gen_ctx(self, prefix: str = '') -> str: else: return 'qemu_get_aio_context()' -def gen_list(self, format: str) -> str: +def gen_list(self, format: str, void_value='') -> str: +if len(self.args) == 1 and self.args[0].type == 'void': +return void_value return ', '.join(format.format_map(arg.__dict__) for arg in self.args) def gen_block(self, format: str) -> str: +if len(self.args) == 1 and self.args[0].type == 'void': +return '' return '\n'.join(format.format_map(arg.__dict__) for arg in self.args) @@ -158,7 +167,7 @@ def create_mixed_wrapper(func: FuncDecl) -> str: graph_assume_lock = 'assume_graph_lock();' if func.graph_rdlock else '' return f"""\ -{func.return_type} {func.name}({ func.gen_list('{decl}') }) +{func.return_type} {func.name}({ func.gen_list('{decl}', 'void') }) {{ if (qemu_in_coroutine()) {{ {graph_assume_lock} @@ -186,7 +195,7 @@ def create_co_wrapper(func: FuncDecl) -> str: name = func.target_name struct_name = func.struct_name return f"""\ -{func.return_type} {func.name}({ func.gen_list('{decl}') }) +{func.return_type} {func.name}({ func.gen_list('{decl}', 'void') }) {{ {struct_name} s = {{ .poll_state.ctx = qemu_get_current_aio_context(), @@ -284,7 +293,7 @@ def gen_no_co_wrapper(func: FuncDecl) -> str: aio_co_wake(s->co); }} -{func.return_type} coroutine_fn {func.name}({ func.gen_list('{decl}') }) +{func.return_type} coroutine_fn {func.name}({ func.gen_list('{decl}', 'void') }) {{ {struct_name} s = {{ .co = qemu_coroutine_self(), -- 2.39.2
Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes
Am 03.06.24 um 18:21 schrieb Kevin Wolf: > Am 03.06.2024 um 16:17 hat Fiona Ebner geschrieben: >> Am 26.03.24 um 13:44 schrieb Kevin Wolf: >>> >>> The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all() >>> with a coroutine wrapper so that the graph lock is held for the whole >>> function. Then calling bdrv_co_flush() while iterating the list is safe >>> and doesn't allow concurrent graph modifications. >> >> The second is that iotest 255 ran into an assertion failure upon QMP 'quit': >> >>> ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion >>> `!qemu_in_coroutine()' failed. >> >> Looking at the backtrace: >> >>> #5 0x762a90cc3eb2 in __GI___assert_fail >>> (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 >>> "../block/graph-lock.c", line=113, function=0x5afb07991f20 >>> <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock") >>> at ./assert/assert.c:101 >>> #6 0x5afb07585311 in bdrv_graph_wrlock () at ../block/graph-lock.c:113 >>> #7 0x5afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at >>> ../block/block-backend.c:901 >>> #8 0x5afb075729a7 in blk_delete (blk=0x5afb0af99420) at >>> ../block/block-backend.c:487 >>> #9 0x5afb07572d88 in blk_unref (blk=0x5afb0af99420) at >>> ../block/block-backend.c:547 >>> #10 0x5afb07572fe8 in bdrv_next (it=0x762a852fef00) at >>> ../block/block-backend.c:618 >>> #11 0x5afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347 >>> #12 0x5afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x712c6050) >>> at block/block-gen.c:1391 >>> #13 0x5afb0773bf41 in coroutine_trampoline (i0=168365184, i1=23291) >> >> So I guess calling bdrv_next() is not safe from a coroutine, because >> the function doing the iteration could end up being the last thing to >> have a reference for the BB. > > Does your bdrv_co_flush_all() take the graph (reader) lock? If so, this > is surprising, because while we hold the graph lock, no reference should > be able to go away - you need the writer lock for that and you won't get > it as long as bdrv_co_flush_all() locks the graph. So whatever had a > reference before the bdrv_next() loop must still have it now. Do you > know where it gets dropped? > AFAICT, yes, it does hold the graph reader lock. The generated code is: > static void coroutine_fn bdrv_co_flush_all_entry(void *opaque) > { > BdrvFlushAll *s = opaque; > > bdrv_graph_co_rdlock(); > s->ret = bdrv_co_flush_all(); > bdrv_graph_co_rdunlock(); > s->poll_state.in_progress = false; > > aio_wait_kick(); > } Apparently when the mirror job is aborted/exits, which can happen during the polling for bdrv_co_flush_all_entry(), a reference can go away without the write lock (at least my breakpoints didn't trigger) being held: > #0 blk_unref (blk=0x5cdefe943d20) at ../block/block-backend.c:537 > #1 0x5cdefb26697e in mirror_exit_common (job=0x5cdefeb53000) at > ../block/mirror.c:710 > #2 0x5cdefb263575 in mirror_abort (job=0x5cdefeb53000) at > ../block/mirror.c:823 > #3 0x5cdefb2248a6 in job_abort (job=0x5cdefeb53000) at ../job.c:825 > #4 0x5cdefb2245f2 in job_finalize_single_locked (job=0x5cdefeb53000) at > ../job.c:855 > #5 0x5cdefb223852 in job_completed_txn_abort_locked (job=0x5cdefeb53000) > at ../job.c:958 > #6 0x5cdefb223714 in job_completed_locked (job=0x5cdefeb53000) at > ../job.c:1065 > #7 0x5cdefb224a8b in job_exit (opaque=0x5cdefeb53000) at ../job.c:1088 > #8 0x5cdefb4134fc in aio_bh_call (bh=0x5cdefe7487c0) at > ../util/async.c:171 > #9 0x5cdefb4136ce in aio_bh_poll (ctx=0x5cdefd9cd750) at > ../util/async.c:218 > #10 0x5cdefb3efdfd in aio_poll (ctx=0x5cdefd9cd750, blocking=true) at > ../util/aio-posix.c:722 > #11 0x5cdefb20435e in bdrv_poll_co (s=0x7ffe491621d8) at > ../block/block-gen.h:43 > #12 0x5cdefb206a33 in bdrv_flush_all () at block/block-gen.c:1410 > #13 0x5cdefae5c8ed in do_vm_stop (state=RUN_STATE_SHUTDOWN, > send_stop=false) > at ../system/cpus.c:297 > #14 0x5cdefae5c850 in vm_shutdown () at ../system/cpus.c:308 > #15 0x5cdefae6d892 in qemu_cleanup (status=0) at ../system/runstate.c:871 > #16 0x5cdefb1a7e78 in qemu_default_main () at ../system/main.c:38 > #17 0x5cdefb1a7eb8 in main (argc=34, argv=0x7ffe491623a8) at > ../system/main.c:48 Looking at the code in mirror_exit_common(), it doesn't seem to acquire a write lock: > bdrv_graph_rdunlock_main_loop(); > > /* > * Remove target parent that still uses BLK_PERM_WRITE/RESIZE before > * inserting target_bs at s->to_replace, where we might not be able to get > * these permissions. > */ > blk_unref(s->target); > s->target = NULL; The write lock is taken in blk_remove_bs() when the refcount drops to 0 and the BB is actually removed: > bdrv_graph_wrlock(); > bdrv_root_unref_child(root); > bdrv_graph_wrunlock(); Best Regards, Fiona
Re: [PATCH] block/copy-before-write: use uint64_t for timeout in nanoseconds
Am 28.05.24 um 18:06 schrieb Kevin Wolf: > Am 29.04.2024 um 16:19 hat Fiona Ebner geschrieben: >> rather than the uint32_t for which the maximum is slightly more than 4 >> seconds and larger values would overflow. The QAPI interface allows >> specifying the number of seconds, so only values 0 to 4 are safe right >> now, other values lead to a much lower timeout than a user expects. >> >> The block_copy() call where this is used already takes a uint64_t for >> the timeout, so no change required there. >> >> Fixes: 6db7fd1ca9 ("block/copy-before-write: implement cbw-timeout option") >> Reported-by: Friedrich Weber >> Signed-off-by: Fiona Ebner > > Thanks, applied to the block branch. > > But I don't think our job is done yet with this. Increasing the limit is > good and useful, but even if it's now unlikely to hit with sane values, > we should still catch integer overflows in cbw_open() and return an > error on too big values instead of silently wrapping around. NANOSECONDS_PER_SECOND is 10^9 and the QAPI type for cbw-timeout is uint32_t, so even with the maximum allowed value, there is no overflow. Should I still add such a check? Best Regards, Fiona
Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes
Hi Kevin, Am 26.03.24 um 13:44 schrieb Kevin Wolf: > Am 22.03.2024 um 10:50 hat Fiona Ebner geschrieben: >> The old_bs variable in bdrv_next() is currently determined by looking >> at the old block backend. However, if the block graph changes before >> the next bdrv_next() call, it might be that the associated BDS is not >> the same that was referenced previously. In that case, the wrong BDS >> is unreferenced, leading to an assertion failure later: >> >>> bdrv_unref: Assertion `bs->refcnt > 0' failed. > > Your change makes sense, but in theory it shouldn't make a difference. > The real bug is in the caller, you can't allow graph modifications while > iterating the list of nodes. Even if it doesn't crash (like after your > patch), you don't have any guarantee that you will have seen every node > that exists that the end - and maybe not even that you don't get the > same node twice. > >> In particular, this can happen in the context of bdrv_flush_all(), >> when polling for bdrv_co_flush() in the generated co-wrapper leads to >> a graph change (for example with a stream block job [0]). > > The whole locking around this case is a bit tricky and would deserve > some cleanup. > > The basic rule for bdrv_next() callers is that they need to hold the > graph reader lock as long as they are iterating the graph, otherwise > it's not safe. This implies that the ref/unref pairs in it should never > make a difference either - which is important, because at least > releasing the last reference is forbidden while holding the graph lock. > I intended to remove the ref/unref for bdrv_next(), but I didn't because > I realised that the callers need to be audited first that they really > obey the rules. You found one that would be problematic. > > The thing that bdrv_flush_all() gets wrong is that it promises to follow > the graph lock rules with GRAPH_RDLOCK_GUARD_MAINLOOP(), but then calls > something that polls. The compiler can't catch this because bdrv_flush() > is a co_wrapper_mixed_bdrv_rdlock. The behaviour for these functions is: > > - If called outside of coroutine context, they are GRAPH_UNLOCKED > - If called in coroutine context, they are GRAPH_RDLOCK > > We should probably try harder to get rid of the mixed functions, because > a synchronous co_wrapper_bdrv_rdlock could actually be marked > GRAPH_UNLOCKED in the code and then the compiler could catch this case. > > The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all() > with a coroutine wrapper so that the graph lock is held for the whole > function. Then calling bdrv_co_flush() while iterating the list is safe > and doesn't allow concurrent graph modifications. I attempted this now, but ran into two issues: The first is that I had to add support for a function without arguments to the block-coroutine-wrapper.py script. I can send this as an RFC in any case if desired. The second is that iotest 255 ran into an assertion failure upon QMP 'quit': > ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion > `!qemu_in_coroutine()' failed. Looking at the backtrace: > #5 0x762a90cc3eb2 in __GI___assert_fail > (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 > "../block/graph-lock.c", line=113, function=0x5afb07991f20 > <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock") > at ./assert/assert.c:101 > #6 0x5afb07585311 in bdrv_graph_wrlock () at ../block/graph-lock.c:113 > #7 0x5afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at > ../block/block-backend.c:901 > #8 0x5afb075729a7 in blk_delete (blk=0x5afb0af99420) at > ../block/block-backend.c:487 > #9 0x5afb07572d88 in blk_unref (blk=0x5afb0af99420) at > ../block/block-backend.c:547 > #10 0x5afb07572fe8 in bdrv_next (it=0x762a852fef00) at > ../block/block-backend.c:618 > #11 0x5afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347 > #12 0x5afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x712c6050) at > block/block-gen.c:1391 > #13 0x5afb0773bf41 in coroutine_trampoline (i0=168365184, i1=23291) So I guess calling bdrv_next() is not safe from a coroutine, because the function doing the iteration could end up being the last thing to have a reference for the BB. Best Regards, Fiona
Re: [PATCH] virtio-pci: Fix the use of an uninitialized irqfd.
Hi, Am 22.05.24 um 07:10 schrieb Cindy Lu: > The crash was reported in MAC OS and NixOS, here is the link for this bug > https://gitlab.com/qemu-project/qemu/-/issues/2334 > https://gitlab.com/qemu-project/qemu/-/issues/2321 > > The root cause is that the function virtio_pci_set_guest_notifiers() only > initializes the irqfd when the use_guest_notifier_mask and guest_notifier_mask > are set. Sorry, I'm just trying to understand the fix and I'm probably missing something, but in virtio_pci_set_guest_notifiers() there is: > bool with_irqfd = msix_enabled(&proxy->pci_dev) && > kvm_msi_via_irqfd_enabled(); and then: > if ((with_irqfd || > (vdev->use_guest_notifier_mask && k->guest_notifier_mask)) && > assign) { > if (with_irqfd) { > proxy->vector_irqfd = > g_malloc0(sizeof(*proxy->vector_irqfd) * > msix_nr_vectors_allocated(&proxy->pci_dev)); > r = kvm_virtio_pci_vector_vq_use(proxy, nvqs); Meaning proxy->vector_irqfd is allocated when with_irqfd is true (even if vdev->use_guest_notifier_mask && k->guest_notifier_mask is false). > However, this check is missing in virtio_pci_set_vector(). > So the fix is to add this check. > > This fix is verified in vyatta,MacOS,NixOS,fedora system. > > The bt tree for this bug is: > Thread 6 "CPU 0/KVM" received signal SIGSEGV, Segmentation fault. > [Switching to Thread 0x7c817be006c0 (LWP 1269146)] > kvm_virtio_pci_vq_vector_use () at ../qemu-9.0.0/hw/virtio/virtio-pci.c:817 > 817 if (irqfd->users == 0) { The crash happens because the irqfd is NULL/invalid here, right? proxy->vector_irqfd = NULL happens when virtio_pci_set_guest_notifiers() is called with assign=false or for an unsuccessful call to virtio_pci_set_guest_notifiers() with assign=true. AFAIU, the issue is that virtio_pci_set_vector() is called between a call to virtio_pci_set_guest_notifiers() with assign=false and a successful virtio_pci_set_guest_notifiers() with assign=true (or before the first such call). So I'm trying to understand why adding the check for vdev->use_guest_notifier_mask && k->guest_notifier_mask is sufficient to fix the issue. Thanks! Best Regards, Fiona
Re: [PULL 6/9] virtio-gpu: fix v2 migration
CC-ing stable, because this already is an issue in 9.0.0 Am 23.05.24 um 00:20 schrieb Fabiano Rosas: > From: Marc-André Lureau > > Commit dfcf74fa ("virtio-gpu: fix scanout migration post-load") broke > forward/backward version migration. Versioning of nested VMSD structures > is not straightforward, as the wire format doesn't have nested > structures versions. Introduce x-scanout-vmstate-version and a field > test to save/load appropriately according to the machine version. > > Fixes: dfcf74fa ("virtio-gpu: fix scanout migration post-load") > Signed-off-by: Marc-André Lureau > Signed-off-by: Peter Xu > Reviewed-by: Fiona Ebner > Tested-by: Fiona Ebner > [fixed long lines] > Signed-off-by: Fabiano Rosas > --- > hw/core/machine.c | 1 + > hw/display/virtio-gpu.c| 30 ++ > include/hw/virtio/virtio-gpu.h | 1 + > 3 files changed, 24 insertions(+), 8 deletions(-) > > diff --git a/hw/core/machine.c b/hw/core/machine.c > index c7ceb11501..8d6dc69f0e 100644 > --- a/hw/core/machine.c > +++ b/hw/core/machine.c > @@ -42,6 +42,7 @@ GlobalProperty hw_compat_8_2[] = { > { "migration", "zero-page-detection", "legacy"}, > { TYPE_VIRTIO_IOMMU_PCI, "granule", "4k" }, > { TYPE_VIRTIO_IOMMU_PCI, "aw-bits", "64" }, > +{ "virtio-gpu-device", "x-scanout-vmstate-version", "1" }, > }; > const size_t hw_compat_8_2_len = G_N_ELEMENTS(hw_compat_8_2); > > diff --git a/hw/display/virtio-gpu.c b/hw/display/virtio-gpu.c > index ae831b6b3e..d60b1b2973 100644 > --- a/hw/display/virtio-gpu.c > +++ b/hw/display/virtio-gpu.c > @@ -1166,10 +1166,17 @@ static void virtio_gpu_cursor_bh(void *opaque) > virtio_gpu_handle_cursor(&g->parent_obj.parent_obj, g->cursor_vq); > } > > +static bool scanout_vmstate_after_v2(void *opaque, int version) > +{ > +struct VirtIOGPUBase *base = container_of(opaque, VirtIOGPUBase, > scanout); > +struct VirtIOGPU *gpu = container_of(base, VirtIOGPU, parent_obj); > + > +return gpu->scanout_vmstate_version >= 2; > +} > + > static const VMStateDescription vmstate_virtio_gpu_scanout = { > .name = "virtio-gpu-one-scanout", > -.version_id = 2, > -.minimum_version_id = 1, > +.version_id = 1, > .fields = (const VMStateField[]) { > VMSTATE_UINT32(resource_id, struct virtio_gpu_scanout), > VMSTATE_UINT32(width, struct virtio_gpu_scanout), > @@ -1181,12 +1188,18 @@ static const VMStateDescription > vmstate_virtio_gpu_scanout = { > VMSTATE_UINT32(cursor.hot_y, struct virtio_gpu_scanout), > VMSTATE_UINT32(cursor.pos.x, struct virtio_gpu_scanout), > VMSTATE_UINT32(cursor.pos.y, struct virtio_gpu_scanout), > -VMSTATE_UINT32_V(fb.format, struct virtio_gpu_scanout, 2), > -VMSTATE_UINT32_V(fb.bytes_pp, struct virtio_gpu_scanout, 2), > -VMSTATE_UINT32_V(fb.width, struct virtio_gpu_scanout, 2), > -VMSTATE_UINT32_V(fb.height, struct virtio_gpu_scanout, 2), > -VMSTATE_UINT32_V(fb.stride, struct virtio_gpu_scanout, 2), > -VMSTATE_UINT32_V(fb.offset, struct virtio_gpu_scanout, 2), > +VMSTATE_UINT32_TEST(fb.format, struct virtio_gpu_scanout, > +scanout_vmstate_after_v2), > +VMSTATE_UINT32_TEST(fb.bytes_pp, struct virtio_gpu_scanout, > +scanout_vmstate_after_v2), > +VMSTATE_UINT32_TEST(fb.width, struct virtio_gpu_scanout, > +scanout_vmstate_after_v2), > +VMSTATE_UINT32_TEST(fb.height, struct virtio_gpu_scanout, > +scanout_vmstate_after_v2), > +VMSTATE_UINT32_TEST(fb.stride, struct virtio_gpu_scanout, > +scanout_vmstate_after_v2), > +VMSTATE_UINT32_TEST(fb.offset, struct virtio_gpu_scanout, > +scanout_vmstate_after_v2), > VMSTATE_END_OF_LIST() > }, > }; > @@ -1659,6 +1672,7 @@ static Property virtio_gpu_properties[] = { > DEFINE_PROP_BIT("blob", VirtIOGPU, parent_obj.conf.flags, > VIRTIO_GPU_FLAG_BLOB_ENABLED, false), > DEFINE_PROP_SIZE("hostmem", VirtIOGPU, parent_obj.conf.hostmem, 0), > +DEFINE_PROP_UINT8("x-scanout-vmstate-version", VirtIOGPU, > scanout_vmstate_version, 2), > DEFINE_PROP_END_OF_LIST(), > }; > > diff --git a/include/hw/virtio/virtio-gpu.h b/include/hw/virtio/virtio-gpu.h > index 56d6e821bf..7a59379f5a 100644 > --- a/include/hw/virtio/virtio-gpu.h > +++ b/include/hw/virtio/virtio-gpu.h > @@ -177,6 +177,7 @@ typedef struct VGPUDMABuf { > struct VirtIOGPU { > VirtIOGPUBase parent_obj; > > +uint8_t scanout_vmstate_version; > uint64_t conf_max_hostmem; > > VirtQueue *ctrl_vq;
Re: [PATCH 1/2] Revert "monitor: use aio_co_reschedule_self()"
CC-ing stable since 1f25c172f83704e350c0829438d832384084a74d is in 9.0.0 Am 06.05.24 um 21:06 schrieb Stefan Hajnoczi: > Commit 1f25c172f837 ("monitor: use aio_co_reschedule_self()") was a code > cleanup that uses aio_co_reschedule_self() instead of open coding > coroutine rescheduling. > > Bug RHEL-34618 was reported and Kevin Wolf identified > the root cause. I missed that aio_co_reschedule_self() -> > qemu_get_current_aio_context() only knows about > qemu_aio_context/IOThread AioContexts and not about iohandler_ctx. It > does not function correctly when going back from the iohandler_ctx to > qemu_aio_context. > > Go back to open coding the AioContext transitions to avoid this bug. > > This reverts commit 1f25c172f83704e350c0829438d832384084a74d. > > Buglink: https://issues.redhat.com/browse/RHEL-34618 > Signed-off-by: Stefan Hajnoczi > --- > qapi/qmp-dispatch.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c > index f3488afeef..176b549473 100644 > --- a/qapi/qmp-dispatch.c > +++ b/qapi/qmp-dispatch.c > @@ -212,7 +212,8 @@ QDict *coroutine_mixed_fn qmp_dispatch(const > QmpCommandList *cmds, QObject *requ > * executing the command handler so that it can make progress if > it > * involves an AIO_WAIT_WHILE(). > */ > -aio_co_reschedule_self(qemu_get_aio_context()); > +aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self()); > +qemu_coroutine_yield(); > } > > monitor_set_cur(qemu_coroutine_self(), cur_mon); > @@ -226,7 +227,9 @@ QDict *coroutine_mixed_fn qmp_dispatch(const > QmpCommandList *cmds, QObject *requ > * Move back to iohandler_ctx so that nested event loops for > * qemu_aio_context don't start new monitor commands. > */ > -aio_co_reschedule_self(iohandler_get_aio_context()); > +aio_co_schedule(iohandler_get_aio_context(), > +qemu_coroutine_self()); > +qemu_coroutine_yield(); > } > } else { > /*
Re: block snapshot issue with RBD
Hi, Am 28.05.24 um 20:19 schrieb Jin Cao: > Hi Ilya > > On 5/28/24 11:13 AM, Ilya Dryomov wrote: >> On Mon, May 27, 2024 at 9:06 PM Jin Cao wrote: >>> >>> Supplementary info: VM is paused after "migrate" command. After being >>> resumed with "cont", snapshot_delete_blkdev_internal works again, which >>> is confusing, as disk snapshot generally recommend I/O is paused, and a >>> frozen VM satisfy this requirement. >> >> Hi Jin, >> >> This doesn't seem to be related to RBD. Given that the same error is >> observed when using the RBD driver with the raw format, I would dig in >> the direction of migration somehow "installing" the raw format (which >> is on-disk compatible with the rbd format). >> > > Thanks for the hint. > >> Also, did you mean to say "snapshot_blkdev_internal" instead of >> "snapshot_delete_blkdev_internal" in both instances? > > Sorry for my copy-and-paste mistake. Yes, it's snapshot_blkdev_internal. > > -- > Sincerely, > Jin Cao > >> >> Thanks, >> >> Ilya >> >>> >>> -- >>> Sincerely >>> Jin Cao >>> >>> On 5/27/24 10:56 AM, Jin Cao wrote: CC block and migration related address. On 5/27/24 12:03 AM, Jin Cao wrote: > Hi, > > I encountered RBD block snapshot issue after doing migration. > > Steps > - > > 1. Start QEMU with: > ./qemu-system-x86_64 -name VM -machine q35 -accel kvm -cpu > host,migratable=on -m 2G -boot menu=on,strict=on > rbd:image/ubuntu-22.04-server-cloudimg-amd64.raw -net nic -net user > -cdrom /home/my/path/of/cloud-init.iso -monitor stdio > > 2. Do block snapshot in monitor cmd: snapshot_delete_blkdev_internal. > It works as expected: the snapshot is visable with command`rbd snap ls > pool_name/image_name`. > > 3. Do pseudo migration with monitor cmd: migrate -d > exec:cat>/tmp/vm.out > > 4. Do block snapshot again with snapshot_delete_blkdev_internal, then > I get: > Error: Block format 'raw' used by device 'ide0-hd0' does not > support internal snapshots > > I was hoping to do the second block snapshot successfully, and it > feels abnormal the RBD block snapshot function is disrupted after > migration. > > BTW, I get the same block snapshot error when I start QEMU with: > "-drive format=raw,file=rbd:pool_name/image_name" > > My questions is: how could I proceed with RBD block snapshot after the > pseudo migration? > > I bisected this issue to d3007d348a ("block: Fix crash when loading snapshot on inactive node"). > diff --git a/block/snapshot.c b/block/snapshot.c > index ec8cf4810b..c4d40e80dd 100644 > --- a/block/snapshot.c > +++ b/block/snapshot.c > @@ -196,8 +196,10 @@ bdrv_snapshot_fallback(BlockDriverState *bs) > int bdrv_can_snapshot(BlockDriverState *bs) > { > BlockDriver *drv = bs->drv; > + > GLOBAL_STATE_CODE(); > -if (!drv || !bdrv_is_inserted(bs) || bdrv_is_read_only(bs)) { > + > +if (!drv || !bdrv_is_inserted(bs) || !bdrv_is_writable(bs)) { > return 0; > } > So I guess the issue is that the blockdev is not writable when "postmigrate" state? Best Regards, Fiona
[PATCH v2 0/2] backup: allow specifying minimum cluster size
Based-on: https://lore.kernel.org/qemu-devel/20240429115157.2260885-1-vsement...@yandex-team.ru/ Discussion for v1: https://lore.kernel.org/qemu-devel/20240308155158.830258-1-f.eb...@proxmox.com/ Changes in v2: * Use 'size' type in QAPI. * Remove option in cbw_parse_options(), i.e. before parsing generic blockdev options. * Reword commit messages hoping to describe the issue in a more straight-forward way. In the context of backup fleecing, discarding the source will not work when the fleecing image has a larger granularity than the one used for block-copy operations (can happen if the backup target has smaller cluster size), because cbw_co_pdiscard_snapshot() will align down the discard requests and thus effectively ignore then. To make @discard-source work in such a scenario, allow specifying the minimum cluster size used for block-copy operations and thus in particular also the granularity for discard requests to the source. Fiona Ebner (2): copy-before-write: allow specifying minimum cluster size backup: add minimum cluster size to performance options block/backup.c | 2 +- block/block-copy.c | 22 ++ block/copy-before-write.c | 18 +- block/copy-before-write.h | 1 + blockdev.c | 3 +++ include/block/block-copy.h | 1 + qapi/block-core.json | 17 ++--- 7 files changed, 55 insertions(+), 9 deletions(-) -- 2.39.2
[PATCH v2 2/2] backup: add minimum cluster size to performance options
In the context of backup fleecing, discarding the source will not work when the fleecing image has a larger granularity than the one used for block-copy operations (can happen if the backup target has smaller cluster size), because cbw_co_pdiscard_snapshot() will align down the discard requests and thus effectively ignore then. To make @discard-source work in such a scenario, allow specifying the minimum cluster size used for block-copy operations and thus in particular also the granularity for discard requests to the source. Suggested-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Fiona Ebner --- Changes in v2: * Use 'size' type in QAPI. block/backup.c| 2 +- block/copy-before-write.c | 8 block/copy-before-write.h | 1 + blockdev.c| 3 +++ qapi/block-core.json | 9 +++-- 5 files changed, 20 insertions(+), 3 deletions(-) diff --git a/block/backup.c b/block/backup.c index 3dd2e229d2..a1292c01ec 100644 --- a/block/backup.c +++ b/block/backup.c @@ -458,7 +458,7 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs, } cbw = bdrv_cbw_append(bs, target, filter_node_name, discard_source, - &bcs, errp); + perf->min_cluster_size, &bcs, errp); if (!cbw) { goto error; } diff --git a/block/copy-before-write.c b/block/copy-before-write.c index ef0bc4dcfe..183eed42e5 100644 --- a/block/copy-before-write.c +++ b/block/copy-before-write.c @@ -553,6 +553,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source, BlockDriverState *target, const char *filter_node_name, bool discard_source, + uint64_t min_cluster_size, BlockCopyState **bcs, Error **errp) { @@ -572,6 +573,13 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source, qdict_put_str(opts, "file", bdrv_get_node_name(source)); qdict_put_str(opts, "target", bdrv_get_node_name(target)); +if (min_cluster_size > INT64_MAX) { +error_setg(errp, "min-cluster-size too large: %lu > %ld", + min_cluster_size, INT64_MAX); +return NULL; +} +qdict_put_int(opts, "min-cluster-size", (int64_t)min_cluster_size); + top = bdrv_insert_node(source, opts, flags, errp); if (!top) { return NULL; diff --git a/block/copy-before-write.h b/block/copy-before-write.h index 01af0cd3c4..2a5d4ba693 100644 --- a/block/copy-before-write.h +++ b/block/copy-before-write.h @@ -40,6 +40,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source, BlockDriverState *target, const char *filter_node_name, bool discard_source, + uint64_t min_cluster_size, BlockCopyState **bcs, Error **errp); void bdrv_cbw_drop(BlockDriverState *bs); diff --git a/blockdev.c b/blockdev.c index 835064ed03..6740663fda 100644 --- a/blockdev.c +++ b/blockdev.c @@ -2655,6 +2655,9 @@ static BlockJob *do_backup_common(BackupCommon *backup, if (backup->x_perf->has_max_chunk) { perf.max_chunk = backup->x_perf->max_chunk; } +if (backup->x_perf->has_min_cluster_size) { +perf.min_cluster_size = backup->x_perf->min_cluster_size; +} } if ((backup->sync == MIRROR_SYNC_MODE_BITMAP) || diff --git a/qapi/block-core.json b/qapi/block-core.json index 8fc0a4b234..f1219a9dfb 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -1551,11 +1551,16 @@ # it should not be less than job cluster size which is calculated # as maximum of target image cluster size and 64k. Default 0. # +# @min-cluster-size: Minimum size of blocks used by copy-before-write +# and background copy operations. Has to be a power of 2. No +# effect if smaller than the maximum of the target's cluster size +# and 64 KiB. Default 0. (Since 9.1) +# # Since: 6.0 ## { 'struct': 'BackupPerf', - 'data': { '*use-copy-range': 'bool', -'*max-workers': 'int', '*max-chunk': 'int64' } } + 'data': { '*use-copy-range': 'bool', '*max-workers': 'int', +'*max-chunk': 'int64', '*min-cluster-size': 'size' } } ## # @BackupCommon: -- 2.39.2
[PATCH v2 1/2] copy-before-write: allow specifying minimum cluster size
In the context of backup fleecing, discarding the source will not work when the fleecing image has a larger granularity than the one used for block-copy operations (can happen if the backup target has smaller cluster size), because cbw_co_pdiscard_snapshot() will align down the discard requests and thus effectively ignore then. To make @discard-source work in such a scenario, allow specifying the minimum cluster size used for block-copy operations and thus in particular also the granularity for discard requests to the source. The type 'size' (corresponding to uint64_t in C) is used in QAPI to rule out negative inputs and for consistency with already existing @cluster-size parameters. Since block_copy_calculate_cluster_size() uses int64_t for its result, a check that the input is not too large is added in block_copy_state_new() before calling it. The calculation in block_copy_calculate_cluster_size() is done in the target int64_t type. Suggested-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Fiona Ebner --- Changes in v2: * Use 'size' type in QAPI. * Remove option in cbw_parse_options(), i.e. before parsing generic blockdev options. block/block-copy.c | 22 ++ block/copy-before-write.c | 10 +- include/block/block-copy.h | 1 + qapi/block-core.json | 8 +++- 4 files changed, 35 insertions(+), 6 deletions(-) diff --git a/block/block-copy.c b/block/block-copy.c index 7e3b378528..36eaecaaf4 100644 --- a/block/block-copy.c +++ b/block/block-copy.c @@ -310,6 +310,7 @@ void block_copy_set_copy_opts(BlockCopyState *s, bool use_copy_range, } static int64_t block_copy_calculate_cluster_size(BlockDriverState *target, + int64_t min_cluster_size, Error **errp) { int ret; @@ -335,7 +336,7 @@ static int64_t block_copy_calculate_cluster_size(BlockDriverState *target, "used. If the actual block size of the target exceeds " "this default, the backup may be unusable", BLOCK_COPY_CLUSTER_SIZE_DEFAULT); -return BLOCK_COPY_CLUSTER_SIZE_DEFAULT; +return MAX(min_cluster_size, (int64_t)BLOCK_COPY_CLUSTER_SIZE_DEFAULT); } else if (ret < 0 && !target_does_cow) { error_setg_errno(errp, -ret, "Couldn't determine the cluster size of the target image, " @@ -345,16 +346,18 @@ static int64_t block_copy_calculate_cluster_size(BlockDriverState *target, return ret; } else if (ret < 0 && target_does_cow) { /* Not fatal; just trudge on ahead. */ -return BLOCK_COPY_CLUSTER_SIZE_DEFAULT; +return MAX(min_cluster_size, (int64_t)BLOCK_COPY_CLUSTER_SIZE_DEFAULT); } -return MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, bdi.cluster_size); +return MAX(min_cluster_size, + (int64_t)MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, bdi.cluster_size)); } BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target, BlockDriverState *copy_bitmap_bs, const BdrvDirtyBitmap *bitmap, bool discard_source, + uint64_t min_cluster_size, Error **errp) { ERRP_GUARD(); @@ -365,7 +368,18 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target, GLOBAL_STATE_CODE(); -cluster_size = block_copy_calculate_cluster_size(target->bs, errp); +if (min_cluster_size > INT64_MAX) { +error_setg(errp, "min-cluster-size too large: %lu > %ld", + min_cluster_size, INT64_MAX); +return NULL; +} else if (min_cluster_size && !is_power_of_2(min_cluster_size)) { +error_setg(errp, "min-cluster-size needs to be a power of 2"); +return NULL; +} + +cluster_size = block_copy_calculate_cluster_size(target->bs, + (int64_t)min_cluster_size, + errp); if (cluster_size < 0) { return NULL; } diff --git a/block/copy-before-write.c b/block/copy-before-write.c index cd65524e26..ef0bc4dcfe 100644 --- a/block/copy-before-write.c +++ b/block/copy-before-write.c @@ -417,6 +417,7 @@ static BlockdevOptions *cbw_parse_options(QDict *options, Error **errp) qdict_extract_subqdict(options, NULL, "bitmap"); qdict_del(options, "on-cbw-error"); qdict_del(options, "cbw-timeout"); +qdict_del(options, "min-cluster-size"); out: visit_free(v); @@ -432,6 +433,7 @@ static int cbw_open(BlockDriverState *bs, QDict *options, int flags, BDRVCopyBeforeWriteState *s = bs->opaque; BdrvD
[PATCH v4 4/5] iotests: add test for bitmap mirror
From: Fabian Grünbichler heavily based on/practically forked off iotest 257 for bitmap backups, but: - no writes to filter node 'mirror-top' between completion and finalization, as those seem to deadlock? - extra set of reference/test mirrors to verify that writes in parallel with active mirror work Intentionally keeping copyright and ownership of original test case to honor provenance. The test was originally adapted by Fabian from 257, but has seen rather big changes, because the interface for mirror with bitmap was changed, i.e. no @bitmap-mode parameter anymore and bitmap is used as the working bitmap, and the test was changed to use backing images and @sync-mode=write-blocking. Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.1 adapt to changes to mirror bitmap interface rename test from '384' to 'mirror-bitmap' use backing files, copy-mode=write-blocking, larger cluster size] Signed-off-by: Fiona Ebner --- tests/qemu-iotests/tests/mirror-bitmap | 597 tests/qemu-iotests/tests/mirror-bitmap.out | 3191 2 files changed, 3788 insertions(+) create mode 100755 tests/qemu-iotests/tests/mirror-bitmap create mode 100644 tests/qemu-iotests/tests/mirror-bitmap.out diff --git a/tests/qemu-iotests/tests/mirror-bitmap b/tests/qemu-iotests/tests/mirror-bitmap new file mode 100755 index 00..37bbe0f241 --- /dev/null +++ b/tests/qemu-iotests/tests/mirror-bitmap @@ -0,0 +1,597 @@ +#!/usr/bin/env python3 +# group: rw +# +# Test bitmap-sync mirrors (incremental, differential, and partials) +# +# Copyright (c) 2019 John Snow for Red Hat, Inc. +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# +# owner=js...@redhat.com + +import os + +import iotests +from iotests import log, qemu_img + +SIZE = 64 * 1024 * 1024 +GRANULARITY = 64 * 1024 +IMAGE_CLUSTER_SIZE = 128 * 1024 + + +class Pattern: +def __init__(self, byte, offset, size=GRANULARITY): +self.byte = byte +self.offset = offset +self.size = size + +def bits(self, granularity): +lower = self.offset // granularity +upper = (self.offset + self.size - 1) // granularity +return set(range(lower, upper + 1)) + + +class PatternGroup: +"""Grouping of Pattern objects. Initialize with an iterable of Patterns.""" +def __init__(self, patterns): +self.patterns = patterns + +def bits(self, granularity): +"""Calculate the unique bits dirtied by this pattern grouping""" +res = set() +for pattern in self.patterns: +res |= pattern.bits(granularity) +return res + + +GROUPS = [ +PatternGroup([ +# Batch 0: 4 clusters +Pattern('0x49', 0x000), +Pattern('0x6c', 0x010), # 1M +Pattern('0x6f', 0x200), # 32M +Pattern('0x76', 0x3ff)]), # 64M - 64K +PatternGroup([ +# Batch 1: 6 clusters (3 new) +Pattern('0x65', 0x000), # Full overwrite +Pattern('0x77', 0x00f8000), # Partial-left (1M-32K) +Pattern('0x72', 0x2008000), # Partial-right (32M+32K) +Pattern('0x69', 0x3fe)]), # Adjacent-left (64M - 128K) +PatternGroup([ +# Batch 2: 7 clusters (3 new) +Pattern('0x74', 0x001), # Adjacent-right +Pattern('0x69', 0x00e8000), # Partial-left (1M-96K) +Pattern('0x6e', 0x2018000), # Partial-right (32M+96K) +Pattern('0x67', 0x3fe, +2*GRANULARITY)]), # Overwrite [(64M-128K)-64M) +PatternGroup([ +# Batch 3: 8 clusters (5 new) +# Carefully chosen such that nothing re-dirties the one cluster +# that copies out successfully before failure in Group #1. +Pattern('0xaa', 0x001, +3*GRANULARITY), # Overwrite and 2x Adjacent-right +Pattern('0xbb', 0x00d8000), # Partial-left (1M-160K) +Pattern('0xcc', 0x2028000), # Partial-right (32M+160K) +Pattern('0xdd', 0x3fc)]), # New; leaving a gap to the right +] + + +class EmulatedBitmap: +def __init__(self, granularity=GRANULA
[PATCH v4 0/5] mirror: allow specifying working bitmap
Changes from v3 (discussion here [3]): * Improve/fix QAPI documentation. Changes from v2 (discussion here [2]): * Cluster size caveats only apply to non-COW diff image, adapt the cluster size check and documentation accordingly. * In the IO test, use backing files (rather than stand-alone diff images) in combination with copy-mode=write-blocking and larger cluster size for target images, to have a more realistic use-case and show that COW prevents ending up with cluster with partial data upon unaligned writes. * Create a separate patch for replacing is_none_mode with sync_mode in MirrorBlockJobx struct. * Disallow using read-only bitmap (cannot be used as working bitmap). * Require that bitmap is enabled at the start of the job. * Avoid IO test script potentially waiting on non-existent job when blockdev-mirror QMP command fails. * Fix pylint issues in IO test. * Rename IO test from sync-bitmap-mirror to mirror-bitmap. Changes from RFC/v1 (discussion here [0]): * Add patch to split BackupSyncMode and MirrorSyncMode. * Drop bitmap-mode parameter and use passed-in bitmap as the working bitmap instead. Users can get the desired behaviors by using the block-dirty-bitmap-clear and block-dirty-bitmap-merge calls (see commit message in patch 2/4 for how exactly). * Add patch to check whether target image's cluster size is at most mirror job's granularity. Optional, it's an extra safety check that's useful when the target is a "diff" image that does not have previously synced data. Use cases: * Possibility to resume a failed mirror later. * Possibility to only mirror deltas to a previously mirrored volume. * Possibility to (efficiently) mirror an drive that was previously mirrored via some external mechanism (e.g. ZFS replication). We are using the last one in production without any issues since about 4 years now. In particular, like mentioned in [1]: > - create bitmap(s) > - (incrementally) replicate storage volume(s) out of band (using ZFS) > - incrementally drive mirror as part of a live migration of VM > - drop bitmap(s) Now, the IO test added in patch 4/5 actually contains yet another use case, namely doing incremental mirrors to qcow2 "diff" images, that only contain the delta and can be rebased later. I had to adapt the IO test, because its output expected the mirror bitmap to still be dirty, but nowadays the mirror is apparently already done when the bitmaps are queried. So I thought, I'll just use 'write-blocking' mode to avoid any potential timing issues. Initially, the qcow2 diff image targets were stand-alone and that suffers from an issue when 'write-blocking' mode is used. If a write is not aligned to the granularity of the mirror target, then rebasing the diff image onto a backing image will not yield the desired result, because the full cluster is considered to be allocated and will "hide" some part of the base/backing image. The failure can be seen by either using 'write-blocking' mode in the IO test or setting the (bitmap) granularity to 32 KiB rather than the current 64 KiB. The test thus uses a more realistic approach where the qcow2 targets have backing images and a check is added in patch 5/5 for the cluster size for non-COW targets. However, with e.g. NBD, the cluster size cannot be queried and prohibiting all bitmap mirrors to NBD targets just to prevent the highly specific edge case seems not worth it, so the limitation is rather documented and the check ignores cases where querying the target image's cluster size returns -ENOTSUP. [0]: https://lore.kernel.org/qemu-devel/b91dba34-7969-4d51-ba40-96a91038c...@yandex-team.ru/T/#m4ae27dc8ca1fb053e0a32cc4ffa2cfab6646805c [1]: https://lore.kernel.org/qemu-devel/1599127031.9uxdp5h9o2.astr...@nora.none/ [2]: https://lore.kernel.org/qemu-devel/20240307134711.709816-1-f.eb...@proxmox.com/ [3]: https://lore.kernel.org/qemu-devel/20240510131647.1256467-1-f.eb...@proxmox.com/ Fabian Grünbichler (1): iotests: add test for bitmap mirror Fiona Ebner (3): qapi/block-core: avoid the re-use of MirrorSyncMode for backup block/mirror: replace is_none_mode with sync_mode in MirrorBlockJob struct blockdev: mirror: check for target's cluster size when using bitmap John Snow (1): mirror: allow specifying working bitmap block/backup.c | 18 +- block/mirror.c | 101 +- block/monitor/block-hmp-cmds.c |2 +- block/replication.c|2 +- blockdev.c | 127 +- include/block/block_int-global-state.h |7 +- qapi/block-core.json | 62 +- tests/qemu-iotests/tests/mirror-bitmap | 603 tests/qemu-iotests/tests/mirror-bitmap.out | 3198 tests/unit/test-block-iothread.c |2 +- 10 files changed, 4053 in
[PATCH v4 3/5] mirror: allow specifying working bitmap
From: John Snow for the mirror job. The bitmap's granularity is used as the job's granularity. The new @bitmap parameter is marked unstable in the QAPI and can currently only be used for @sync=full mode. Clusters initially dirty in the bitmap as well as new writes are copied to the target. Using block-dirty-bitmap-clear and block-dirty-bitmap-merge API, callers can simulate the three kinds of @BitmapSyncMode (which is used by backup): 1. always: default, just pass bitmap as working bitmap. 2. never: copy bitmap and pass copy to the mirror job. 3. on-success: copy bitmap and pass copy to the mirror job and if successful, merge bitmap into original afterwards. When the target image is a non-COW "diff image", i.e. one that was not used as the target of a previous mirror and the target image's cluster size is larger than the bitmap's granularity, or when @copy-mode=write-blocking is used, there is a pitfall, because the cluster in the target image will be allocated, but not contain all the data corresponding to the same region in the source image. An idea to avoid the limitation would be to mark clusters which are affected by unaligned writes and are not allocated in the target image dirty, so they would be copied fully later. However, for migration, the invariant that an actively synced mirror stays actively synced (unless an error happens) is useful, because without that invariant, migration might inactivate block devices when mirror still got work to do and run into an assertion failure [0]. Another approach would be to read the missing data from the source upon unaligned writes to be able to write the full target cluster instead. But certain targets like NBD do not allow querying the cluster size. To avoid limiting/breaking the use case of syncing to an existing target, which is arguably more common than the diff image use case, document the limitation in QAPI. This patch was originally based on one by Ma Haocong, but it has since been modified pretty heavily, first by John and then again by Fiona. [0]: https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/ Suggested-by: Ma Haocong Signed-off-by: Ma Haocong Signed-off-by: John Snow [FG: switch to bdrv_dirty_bitmap_merge_internal] Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.1 get rid of bitmap mode parameter use caller-provided bitmap as working bitmap turn bitmap parameter experimental] Signed-off-by: Fiona Ebner Acked-by: Markus Armbruster --- block/mirror.c | 80 +- blockdev.c | 44 +++--- include/block/block_int-global-state.h | 5 +- qapi/block-core.json | 35 ++- tests/unit/test-block-iothread.c | 2 +- 5 files changed, 141 insertions(+), 25 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index ca23d6ef65..d3d0698116 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -73,6 +73,11 @@ typedef struct MirrorBlockJob { size_t buf_size; int64_t bdev_length; unsigned long *cow_bitmap; +/* + * Whether the bitmap is created locally or provided by the caller (for + * incremental sync). + */ +bool dirty_bitmap_is_local; BdrvDirtyBitmap *dirty_bitmap; BdrvDirtyBitmapIter *dbi; uint8_t *buf; @@ -691,7 +696,11 @@ static int mirror_exit_common(Job *job) bdrv_unfreeze_backing_chain(mirror_top_bs, target_bs); } -bdrv_release_dirty_bitmap(s->dirty_bitmap); +if (s->dirty_bitmap_is_local) { +bdrv_release_dirty_bitmap(s->dirty_bitmap); +} else { +bdrv_enable_dirty_bitmap(s->dirty_bitmap); +} /* Make sure that the source BDS doesn't go away during bdrv_replace_node, * before we can call bdrv_drained_end */ @@ -820,6 +829,16 @@ static void mirror_abort(Job *job) assert(ret == 0); } +/* Always called after commit/abort. */ +static void mirror_clean(Job *job) +{ +MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job); + +if (!s->dirty_bitmap_is_local && s->dirty_bitmap) { +bdrv_dirty_bitmap_set_busy(s->dirty_bitmap, false); +} +} + static void coroutine_fn mirror_throttle(MirrorBlockJob *s) { int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); @@ -1016,7 +1035,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp) mirror_free_init(s); s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); -if (s->sync_mode != MIRROR_SYNC_MODE_NONE) { +if (s->sync_mode != MIRROR_SYNC_MODE_NONE && s->dirty_bitmap_is_local) { ret = mirror_dirty_init(s); if (ret < 0 || job_is_cancelled(&s->common.job)) { goto immediate_exit; @@ -1029,6 +1048,14 @@ static int coroutine_fn mirror_run(Job *job, Error **errp) */ mirror_top_op
[PATCH v4 1/5] qapi/block-core: avoid the re-use of MirrorSyncMode for backup
Backup supports all modes listed in MirrorSyncMode, while mirror does not. Introduce BackupSyncMode by copying the current MirrorSyncMode and drop the variants mirror does not support from MirrorSyncMode as well as the corresponding manual check in mirror_start(). A consequence is also tighter introspection: query-qmp-schema no longer reports drive-mirror and blockdev-mirror accepting @sync values they actually reject. Suggested-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Fiona Ebner Acked-by: Markus Armbruster Reviewed-by: Vladimir Sementsov-Ogievskiy --- block/backup.c | 18 - block/mirror.c | 7 --- block/monitor/block-hmp-cmds.c | 2 +- block/replication.c| 2 +- blockdev.c | 26 - include/block/block_int-global-state.h | 2 +- qapi/block-core.json | 27 +- 7 files changed, 47 insertions(+), 37 deletions(-) diff --git a/block/backup.c b/block/backup.c index ec29d6b810..1cc4e055c6 100644 --- a/block/backup.c +++ b/block/backup.c @@ -37,7 +37,7 @@ typedef struct BackupBlockJob { BdrvDirtyBitmap *sync_bitmap; -MirrorSyncMode sync_mode; +BackupSyncMode sync_mode; BitmapSyncMode bitmap_mode; BlockdevOnError on_source_error; BlockdevOnError on_target_error; @@ -111,7 +111,7 @@ void backup_do_checkpoint(BlockJob *job, Error **errp) assert(block_job_driver(job) == &backup_job_driver); -if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) { +if (backup_job->sync_mode != BACKUP_SYNC_MODE_NONE) { error_setg(errp, "The backup job only supports block checkpoint in" " sync=none mode"); return; @@ -231,11 +231,11 @@ static void backup_init_bcs_bitmap(BackupBlockJob *job) uint64_t estimate; BdrvDirtyBitmap *bcs_bitmap = block_copy_dirty_bitmap(job->bcs); -if (job->sync_mode == MIRROR_SYNC_MODE_BITMAP) { +if (job->sync_mode == BACKUP_SYNC_MODE_BITMAP) { bdrv_clear_dirty_bitmap(bcs_bitmap, NULL); bdrv_dirty_bitmap_merge_internal(bcs_bitmap, job->sync_bitmap, NULL, true); -} else if (job->sync_mode == MIRROR_SYNC_MODE_TOP) { +} else if (job->sync_mode == BACKUP_SYNC_MODE_TOP) { /* * We can't hog the coroutine to initialize this thoroughly. * Set a flag and resume work when we are able to yield safely. @@ -254,7 +254,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp) backup_init_bcs_bitmap(s); -if (s->sync_mode == MIRROR_SYNC_MODE_TOP) { +if (s->sync_mode == BACKUP_SYNC_MODE_TOP) { int64_t offset = 0; int64_t count; @@ -282,7 +282,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp) block_copy_set_skip_unallocated(s->bcs, false); } -if (s->sync_mode == MIRROR_SYNC_MODE_NONE) { +if (s->sync_mode == BACKUP_SYNC_MODE_NONE) { /* * All bits are set in bcs bitmap to allow any cluster to be copied. * This does not actually require them to be copied. @@ -354,7 +354,7 @@ static const BlockJobDriver backup_job_driver = { BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs, BlockDriverState *target, int64_t speed, - MirrorSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap, + BackupSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap, BitmapSyncMode bitmap_mode, bool compress, const char *filter_node_name, @@ -376,8 +376,8 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs, GLOBAL_STATE_CODE(); /* QMP interface protects us from these cases */ -assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL); -assert(sync_bitmap || sync_mode != MIRROR_SYNC_MODE_BITMAP); +assert(sync_mode != BACKUP_SYNC_MODE_INCREMENTAL); +assert(sync_bitmap || sync_mode != BACKUP_SYNC_MODE_BITMAP); if (bs == target) { error_setg(errp, "Source and target cannot be the same"); diff --git a/block/mirror.c b/block/mirror.c index 1bdce3b657..c0597039a5 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -2013,13 +2013,6 @@ void mirror_start(const char *job_id, BlockDriverState *bs, GLOBAL_STATE_CODE(); -if ((mode == MIRROR_SYNC_MODE_INCREMENTAL) || -(mode == MIRROR_SYNC_MODE_BITMAP)) { -error_setg(errp, "Sync mode '%s' not supported", - MirrorSyncMode_str(mode)); -return; -} - bdrv_graph_rdlock_main_loop(); is_none_mode = mode == MIRROR_SYNC_MODE_NONE; base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL; diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
[PATCH v4 2/5] block/mirror: replace is_none_mode with sync_mode in MirrorBlockJob struct
It is more flexible and is done in preparation to support specifying a working bitmap for mirror jobs. In particular, this makes it possible to assert that @sync_mode=full when a bitmap is used. That assertion is just to be sure, of course the mirror QMP commands will be made to fail earlier with a clean error. Signed-off-by: Fiona Ebner --- block/mirror.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index c0597039a5..ca23d6ef65 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -51,7 +51,7 @@ typedef struct MirrorBlockJob { BlockDriverState *to_replace; /* Used to block operations on the drive-mirror-replace target */ Error *replace_blocker; -bool is_none_mode; +MirrorSyncMode sync_mode; BlockMirrorBackingMode backing_mode; /* Whether the target image requires explicit zero-initialization */ bool zero_target; @@ -722,7 +722,8 @@ static int mirror_exit_common(Job *job) &error_abort); if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) { -BlockDriverState *backing = s->is_none_mode ? src : s->base; +BlockDriverState *backing; +backing = s->sync_mode == MIRROR_SYNC_MODE_NONE ? src : s->base; BlockDriverState *unfiltered_target = bdrv_skip_filters(target_bs); if (bdrv_cow_bs(unfiltered_target) != backing) { @@ -1015,7 +1016,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp) mirror_free_init(s); s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); -if (!s->is_none_mode) { +if (s->sync_mode != MIRROR_SYNC_MODE_NONE) { ret = mirror_dirty_init(s); if (ret < 0 || job_is_cancelled(&s->common.job)) { goto immediate_exit; @@ -1714,7 +1715,8 @@ static BlockJob *mirror_start_job( BlockCompletionFunc *cb, void *opaque, const BlockJobDriver *driver, - bool is_none_mode, BlockDriverState *base, + MirrorSyncMode sync_mode, + BlockDriverState *base, bool auto_complete, const char *filter_node_name, bool is_mirror, MirrorCopyMode copy_mode, Error **errp) @@ -1871,7 +1873,7 @@ static BlockJob *mirror_start_job( s->replaces = g_strdup(replaces); s->on_source_error = on_source_error; s->on_target_error = on_target_error; -s->is_none_mode = is_none_mode; +s->sync_mode = sync_mode; s->backing_mode = backing_mode; s->zero_target = zero_target; qatomic_set(&s->copy_mode, copy_mode); @@ -2008,20 +2010,18 @@ void mirror_start(const char *job_id, BlockDriverState *bs, bool unmap, const char *filter_node_name, MirrorCopyMode copy_mode, Error **errp) { -bool is_none_mode; BlockDriverState *base; GLOBAL_STATE_CODE(); bdrv_graph_rdlock_main_loop(); -is_none_mode = mode == MIRROR_SYNC_MODE_NONE; base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL; bdrv_graph_rdunlock_main_loop(); mirror_start_job(job_id, bs, creation_flags, target, replaces, speed, granularity, buf_size, backing_mode, zero_target, on_source_error, on_target_error, unmap, NULL, NULL, - &mirror_job_driver, is_none_mode, base, false, + &mirror_job_driver, mode, base, false, filter_node_name, true, copy_mode, errp); } @@ -2049,9 +2049,9 @@ BlockJob *commit_active_start(const char *job_id, BlockDriverState *bs, job_id, bs, creation_flags, base, NULL, speed, 0, 0, MIRROR_LEAVE_BACKING_CHAIN, false, on_error, on_error, true, cb, opaque, - &commit_active_job_driver, false, base, auto_complete, - filter_node_name, false, MIRROR_COPY_MODE_BACKGROUND, - errp); + &commit_active_job_driver, MIRROR_SYNC_MODE_FULL, base, + auto_complete, filter_node_name, false, + MIRROR_COPY_MODE_BACKGROUND, errp); if (!job) { goto error_restore_flags; } -- 2.39.2
[PATCH v4 5/5] blockdev: mirror: check for target's cluster size when using bitmap
When using mirror with a bitmap and the target does not do COW and is is a diff image, i.e. one that should only contain the delta and was not synced to previously, a too large cluster size for the target can be problematic. In particular, when the mirror sends data to the target aligned to the jobs granularity, but not aligned to the larger target image's cluster size, the target's cluster would be allocated but only be filled partially. When rebasing such a diff image later, the corresponding cluster of the base image would get "masked" and the part of the cluster not in the diff image is not accessible anymore. Unfortunately, it is not always possible to check for the target image's cluster size, e.g. when it's NBD. Because the limitation is already documented in the QAPI description for the @bitmap parameter and it's only required for special diff image use-case, simply skip the check then. Signed-off-by: Fiona Ebner --- blockdev.c | 57 ++ tests/qemu-iotests/tests/mirror-bitmap | 6 +++ tests/qemu-iotests/tests/mirror-bitmap.out | 7 +++ 3 files changed, 70 insertions(+) diff --git a/blockdev.c b/blockdev.c index 4f72a72dc7..468974108e 100644 --- a/blockdev.c +++ b/blockdev.c @@ -2769,6 +2769,59 @@ void qmp_blockdev_backup(BlockdevBackup *backup, Error **errp) blockdev_do_action(&action, errp); } +static int blockdev_mirror_check_bitmap_granularity(BlockDriverState *target, +BdrvDirtyBitmap *bitmap, +Error **errp) +{ +int ret; +BlockDriverInfo bdi; +uint32_t bitmap_granularity; + +GLOBAL_STATE_CODE(); +GRAPH_RDLOCK_GUARD_MAINLOOP(); + +if (bdrv_backing_chain_next(target)) { +/* + * No need to worry about creating clusters with partial data when the + * target does COW. + */ +return 0; +} + +/* + * If there is no backing file on the target, we cannot rely on COW if our + * backup cluster size is smaller than the target cluster size. Even for + * targets with a backing file, try to avoid COW if possible. + */ +ret = bdrv_get_info(target, &bdi); +if (ret == -ENOTSUP) { +/* + * Ignore if unable to get the info, e.g. when target is NBD. It's only + * relevant for syncing to a diff image and the documentation already + * states that the target's cluster size needs to small enough then. + */ +return 0; +} else if (ret < 0) { +error_setg_errno(errp, -ret, +"Couldn't determine the cluster size of the target image, " +"which has no backing file"); +return ret; +} + +bitmap_granularity = bdrv_dirty_bitmap_granularity(bitmap); +if (bitmap_granularity < bdi.cluster_size || +bitmap_granularity % bdi.cluster_size != 0) { +error_setg(errp, "Bitmap granularity %u is not a multiple of the " + "target image's cluster size %u and the target image has " + "no backing file", + bitmap_granularity, bdi.cluster_size); +return -EINVAL; +} + +return 0; +} + + /* Parameter check and block job starting for drive mirroring. * Caller should hold @device and @target's aio context (must be the same). **/ @@ -2863,6 +2916,10 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs, return; } +if (blockdev_mirror_check_bitmap_granularity(target, bitmap, errp)) { +return; +} + if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_DEFAULT, errp)) { return; } diff --git a/tests/qemu-iotests/tests/mirror-bitmap b/tests/qemu-iotests/tests/mirror-bitmap index 37bbe0f241..e8cd482a19 100755 --- a/tests/qemu-iotests/tests/mirror-bitmap +++ b/tests/qemu-iotests/tests/mirror-bitmap @@ -584,6 +584,12 @@ def test_mirror_api(): bitmap=bitmap) log('') +log("-- Test bitmap with too small granularity to non-COW target --\n") +vm.qmp_log("block-dirty-bitmap-add", node=drive0.node, + name="bitmap-small", granularity=GRANULARITY) +blockdev_mirror(drive0.vm, drive0.node, "mirror_target", "full", +job_id='api_job', bitmap="bitmap-small") +log('') def main(): for bsync_mode in ("never", "on-success", "always"): diff --git a/tests/qemu-iotests/tests/mirror-bitmap.out b/tests/qemu-iotests/tests/mirror-bitmap.out index 5c8acc1d69..af605f3803 100644 --- a/tests/qemu-iotests/tests/mirror-bitmap.out +++ b/tests/qemu-iotests/tests/mir
Re: [PATCH] hw/core/machine: move compatibility flags for VirtIO-net USO to machine 8.1
Am 21.05.24 um 00:22 schrieb Fabiano Rosas: > Fiona Ebner writes: > >> Migration from an 8.2 or 9.0 binary to an 8.1 binary with machine >> version 8.1 can fail with: >> >>> kvm: Features 0x1c0010130afffa7 unsupported. Allowed features: 0x10179bfffe7 >>> kvm: Failed to load virtio-net:virtio >>> kvm: error while loading state for instance 0x0 of device >>> ':00:12.0/virtio-net' >>> kvm: load of migration failed: Operation not permitted >> >> The series >> >> 53da8b5a99 virtio-net: Add support for USO features >> 9da1684954 virtio-net: Add USO flags to vhost support. >> f03e0cf63b tap: Add check for USO features >> 2ab0ec3121 tap: Add USO support to tap device. >> >> only landed in QEMU 8.2, so the compatibility flags should be part of >> machine version 8.1. >> >> Moving the flags unfortunately breaks forward migration with machine >> version 8.1 from a binary without this patch to a binary with this >> patch. >> >> Fixes: 53da8b5a99 ("virtio-net: Add support for USO features") >> Signed-off-by: Fiona Ebner > > Reviewed-by: Fabiano Rosas > > I'll get to it eventually, but is this another one where just having > -device virtio-net in the command line when testing cross-version > migration would already have caught the issue? > AFAIU, the guest kernel needs to be recent enough to support the feature too. I don't seem to run into the issue with a Debian 11 (kernel 5.10) guest, but I do run into the issue with an Ubuntu 23.10 (kernel 6.5) guest. Seems like it got added in kernel 6.2 with 418044e1de30 ("drivers/net/virtio_net.c: Added USO support.") Best Regards, Fiona
[PATCH] hw/core/machine: move compatibility flags for VirtIO-net USO to machine 8.1
Migration from an 8.2 or 9.0 binary to an 8.1 binary with machine version 8.1 can fail with: > kvm: Features 0x1c0010130afffa7 unsupported. Allowed features: 0x10179bfffe7 > kvm: Failed to load virtio-net:virtio > kvm: error while loading state for instance 0x0 of device > ':00:12.0/virtio-net' > kvm: load of migration failed: Operation not permitted The series 53da8b5a99 virtio-net: Add support for USO features 9da1684954 virtio-net: Add USO flags to vhost support. f03e0cf63b tap: Add check for USO features 2ab0ec3121 tap: Add USO support to tap device. only landed in QEMU 8.2, so the compatibility flags should be part of machine version 8.1. Moving the flags unfortunately breaks forward migration with machine version 8.1 from a binary without this patch to a binary with this patch. Fixes: 53da8b5a99 ("virtio-net: Add support for USO features") Signed-off-by: Fiona Ebner --- hw/core/machine.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/hw/core/machine.c b/hw/core/machine.c index c7ceb11501..95051b80db 100644 --- a/hw/core/machine.c +++ b/hw/core/machine.c @@ -50,15 +50,15 @@ GlobalProperty hw_compat_8_1[] = { { "ramfb", "x-migrate", "off" }, { "vfio-pci-nohotplug", "x-ramfb-migrate", "off" }, { "igb", "x-pcie-flr-init", "off" }, +{ TYPE_VIRTIO_NET, "host_uso", "off"}, +{ TYPE_VIRTIO_NET, "guest_uso4", "off"}, +{ TYPE_VIRTIO_NET, "guest_uso6", "off"}, }; const size_t hw_compat_8_1_len = G_N_ELEMENTS(hw_compat_8_1); GlobalProperty hw_compat_8_0[] = { { "migration", "multifd-flush-after-each-section", "on"}, { TYPE_PCI_DEVICE, "x-pcie-ari-nextfn-1", "on" }, -{ TYPE_VIRTIO_NET, "host_uso", "off"}, -{ TYPE_VIRTIO_NET, "guest_uso4", "off"}, -{ TYPE_VIRTIO_NET, "guest_uso6", "off"}, }; const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0); -- 2.39.2
Re: [PULL 04/17] virtio-net: Add support for USO features
Hi, Am 08.09.23 um 08:44 schrieb Jason Wang: > diff --git a/hw/core/machine.c b/hw/core/machine.c > index da699cf..230aab8 100644 > --- a/hw/core/machine.c > +++ b/hw/core/machine.c > @@ -38,6 +38,7 @@ > #include "exec/confidential-guest-support.h" > #include "hw/virtio/virtio.h" > #include "hw/virtio/virtio-pci.h" > +#include "hw/virtio/virtio-net.h" > > GlobalProperty hw_compat_8_1[] = {}; > const size_t hw_compat_8_1_len = G_N_ELEMENTS(hw_compat_8_1); > @@ -45,6 +46,9 @@ const size_t hw_compat_8_1_len = > G_N_ELEMENTS(hw_compat_8_1); > GlobalProperty hw_compat_8_0[] = { > { "migration", "multifd-flush-after-each-section", "on"}, > { TYPE_PCI_DEVICE, "x-pcie-ari-nextfn-1", "on" }, > +{ TYPE_VIRTIO_NET, "host_uso", "off"}, > +{ TYPE_VIRTIO_NET, "guest_uso4", "off"}, > +{ TYPE_VIRTIO_NET, "guest_uso6", "off"}, > }; > const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0); > unfortunately, this broke backwards migration with machine version 8.1 from 8.2 and 9.0 binaries to a 8.1 binary: > kvm: Features 0x1c0010130afffa7 unsupported. Allowed features: 0x10179bfffe7 > kvm: Failed to load virtio-net:virtio > kvm: error while loading state for instance 0x0 of device > ':00:12.0/virtio-net' > kvm: load of migration failed: Operation not permitted Since the series here only landed in 8.2, shouldn't these flags have been added to hw_compat_8_1[] instead? Attempting to fix it by moving the flags will break migration with machine version 8.1 between patched 9.0 and unpatched 9.0 however :( Is there anything that can be done or will it need to stay broken now? CC-ing the migration maintainers. Best Regards, Fiona
Re: [PATCH v4] hw/pflash: fix block write start
Am 16.05.24 um 10:46 schrieb Gerd Hoffmann: > Move the pflash_blk_write_start() call. We need the offset of the > first data write, not the offset for the setup (number-of-bytes) > write. Without this fix u-boot can do block writes to the first > flash block only. > > While being at it drop a leftover FIXME. > > Cc: qemu-sta...@nongnu.org > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2343 > Fixes: fcc79f2e0955 ("hw/pflash: implement update buffer for block writes") Just a minor thing I noticed: this is the commit in v8.1.5. The commit in v9.0.0 is 284a7ee2e2 ("hw/pflash: implement update buffer for block writes"). Best Regards, Fiona
Re: [PATCH v3 5/5] accel/tcg: Always call tcg_flush_jmp_cache() on reset
Hi, Am 03.05.24 um 14:34 schrieb Philippe Mathieu-Daudé: > In commit bb6cf6f016 ("accel/tcg: Factor tcg_cpu_reset_hold() out") > we unfortunately restricted the tcg_flush_jmp_cache() to system > emulation. Move it to the common tcg_exec_cpu_reset_hold() handler > so user emulation gets the jmp_cache initialized when threads > are created. > > Remove the NULL check in tcg_flush_jmp_cache() from commit 4e4fa6c12d > ("accel/tcg: Complete cpu initialization before registration") which > was a band-aid fix for incorrect commit bb6cf6f016. > AFAICT, commit 4e4fa6c12d was already present in v7.2.0, while commit bb6cf6f016 only later in v8.2.0. So is it really fine to remove the NULL check? Best Regards, Fiona
Re: [PATCH v4 0/3] Fix "virtio-gpu: fix scanout migration post-load"
Am 16.05.24 um 10:40 schrieb marcandre.lur...@redhat.com: > From: Marc-André Lureau > > Hi, > > The aforementioned patch breaks virtio-gpu device migrations for versions > pre-9.0/9.0, both forwards and backwards. Versioning of `VMS_STRUCT` is more > complex than it may initially appear, as evidenced in the problematic commit > dfcf74fa68c ("virtio-gpu: fix scanout migration post-load"). > > v2: > - use a manual version field test (instead of the more complex struct > variant) > > v3: > - introduce machine_check_version() > - drop the VMSD version, and use machine version field test > > v4: > - drop machine_check_version() approach > - property renamed to x-scanout-vmstate-version > > Marc-André Lureau (3): > migration: add "exists" info to load-state-field trace > migration: fix a typo > virtio-gpu: fix v2 migration > > include/hw/virtio/virtio-gpu.h | 1 + > hw/core/machine.c | 1 + > hw/display/virtio-gpu.c| 24 > migration/vmstate.c| 7 --- > migration/trace-events | 2 +- > 5 files changed, 23 insertions(+), 12 deletions(-) > Reviewed-by: Fiona Ebner Tested-by: Fiona Ebner Tested the same things as for v1/v2, with an Ubuntu 23.10 VM: Machine type pc-i440fx-8.2: 1. create snapshot with 8.2, load with patched 9.0 2. create snapshot with patched 9.0, load with patched 9.0 and with 8.2 Machine type pc-i440fx-9.0: 1. create snapshot with patched 9.0, load with patched 9.0 No crashes/failures and didn't notice any other issues 🙂 Best Regards, Fiona
Re: [PATCH v2 0/4] Fix "virtio-gpu: fix scanout migration post-load"
Am 13.05.24 um 15:21 schrieb Marc-André Lureau: > > Indeed, it needs: > > diff --git a/hw/display/virtio-gpu.c b/hw/display/virtio-gpu.c > index 5de90bb62f..3a88eb5e3a 100644 > --- a/hw/display/virtio-gpu.c > +++ b/hw/display/virtio-gpu.c > @@ -1201,7 +1201,7 @@ static const VMStateDescription > vmstate_virtio_gpu_scanout = { > > static const VMStateDescription vmstate_virtio_gpu_scanouts = { > .name = "virtio-gpu-scanouts", > -.version_id = 1, > +.version_id = 2, > > Thanks! With that on top: Tested-by: Fiona Ebner Tested with an Ubuntu 23.10 VM: Machine type pc-i440fx-8.2: 1. create snapshot with 8.2, load with patched 9.0 2. create snapshot with patched 9.0, load with patched 9.0 and with 8.2 Machine type pc-i440fx-9.0: 1. create snapshot with patched 9.0, load with patched 9.0 No crashes/failures and didn't notice any other issues 🙂 Best Regards, Fiona
Re: [PATCH 1/2] copy-before-write: allow specifying minimum cluster size
Am 26.03.24 um 10:06 schrieb Markus Armbruster: >> @@ -365,7 +368,13 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, >> BdrvChild *target, >> >> GLOBAL_STATE_CODE(); >> >> -cluster_size = block_copy_calculate_cluster_size(target->bs, errp); >> +if (min_cluster_size && !is_power_of_2(min_cluster_size)) { > > min_cluster_size is int64_t, is_power_of_2() takes uint64_t. Bad if > min_cluster_size is negative. Could this happen? > No, because it comes in as a uint32_t via the QAPI (the internal caller added by patch 2/2 from the backup code also gets the value via QAPI and there uint32_t is used too). ---snip--- >> diff --git a/qapi/block-core.json b/qapi/block-core.json >> index 0a72c590a8..85c8f88f6e 100644 >> --- a/qapi/block-core.json >> +++ b/qapi/block-core.json >> @@ -4625,12 +4625,18 @@ >> # @on-cbw-error parameter will decide how this failure is handled. >> # Default 0. (Since 7.1) >> # >> +# @min-cluster-size: Minimum size of blocks used by copy-before-write >> +# operations. Has to be a power of 2. No effect if smaller than >> +# the maximum of the target's cluster size and 64 KiB. Default 0. >> +# (Since 9.0) >> +# >> # Since: 6.2 >> ## >> { 'struct': 'BlockdevOptionsCbw', >>'base': 'BlockdevOptionsGenericFormat', >>'data': { 'target': 'BlockdevRef', '*bitmap': 'BlockDirtyBitmap', >> -'*on-cbw-error': 'OnCbwError', '*cbw-timeout': 'uint32' } } >> +'*on-cbw-error': 'OnCbwError', '*cbw-timeout': 'uint32', >> +'*min-cluster-size': 'uint32' } } > > Elsewhere in the schema, we use either 'int' or 'size' for cluster-size. > Why the difference? > The motivation was to disallow negative values up front and have it work with block_copy_calculate_cluster_size(), whose result is an int64_t. If I go with 'int', I'll have to add a check to disallow negative values. If I go with 'size', I'll have to add a check for to disallow too large values. Which approach should I go with? Best Regards, Fiona
Re: [PATCH v2 0/4] Fix "virtio-gpu: fix scanout migration post-load"
Hi, Am 13.05.24 um 09:19 schrieb marcandre.lur...@redhat.com: > From: Marc-André Lureau > > Hi, > > The aforementioned patch breaks virtio-gpu device migrations for versions > pre-9.0/9.0, both forwards and backwards. Versioning of `VMS_STRUCT` is more > complex than it may initially appear, as evidenced in the problematic commit > dfcf74fa68c ("virtio-gpu: fix scanout migration post-load"). > > v2: > - use a manual version field test (instead of the more complex struct > variant) > Unfortunately, when creating a snapshot with machine type pc-i440fx-9.0 and trying to load it afterwards (both times with patches on top of current master), it'll fail with: > qemu-system-x86_64: virtio-gpu-scanouts: incoming version_id 2 is too new for > local version_id 1 > qemu-system-x86_64: Missing section footer for :00:02.0/virtio-gpu > qemu-system-x86_64: Error -22 while loading VM state Is there a bump to virtio-gpu-scanouts' version_id missing? Best Regards, Fiona
[PATCH v3 5/5] blockdev: mirror: check for target's cluster size when using bitmap
When using mirror with a bitmap and the target does not do COW and is is a diff image, i.e. one that should only contain the delta and was not synced to previously, a too large cluster size for the target can be problematic. In particular, when the mirror sends data to the target aligned to the jobs granularity, but not aligned to the larger target image's cluster size, the target's cluster would be allocated but only be filled partially. When rebasing such a diff image later, the corresponding cluster of the base image would get "masked" and the part of the cluster not in the diff image is not accessible anymore. Unfortunately, it is not always possible to check for the target image's cluster size, e.g. when it's NBD. Because the limitation is already documented in the QAPI description for the @bitmap parameter and it's only required for special diff image use-case, simply skip the check then. Signed-off-by: Fiona Ebner --- Changes in v3: * detect when the target does COW and do not error out in that case * treat ENOTSUP differently from other failure when querying the cluster size blockdev.c | 57 ++ tests/qemu-iotests/tests/mirror-bitmap | 6 +++ tests/qemu-iotests/tests/mirror-bitmap.out | 7 +++ 3 files changed, 70 insertions(+) diff --git a/blockdev.c b/blockdev.c index 4f72a72dc7..468974108e 100644 --- a/blockdev.c +++ b/blockdev.c @@ -2769,6 +2769,59 @@ void qmp_blockdev_backup(BlockdevBackup *backup, Error **errp) blockdev_do_action(&action, errp); } +static int blockdev_mirror_check_bitmap_granularity(BlockDriverState *target, +BdrvDirtyBitmap *bitmap, +Error **errp) +{ +int ret; +BlockDriverInfo bdi; +uint32_t bitmap_granularity; + +GLOBAL_STATE_CODE(); +GRAPH_RDLOCK_GUARD_MAINLOOP(); + +if (bdrv_backing_chain_next(target)) { +/* + * No need to worry about creating clusters with partial data when the + * target does COW. + */ +return 0; +} + +/* + * If there is no backing file on the target, we cannot rely on COW if our + * backup cluster size is smaller than the target cluster size. Even for + * targets with a backing file, try to avoid COW if possible. + */ +ret = bdrv_get_info(target, &bdi); +if (ret == -ENOTSUP) { +/* + * Ignore if unable to get the info, e.g. when target is NBD. It's only + * relevant for syncing to a diff image and the documentation already + * states that the target's cluster size needs to small enough then. + */ +return 0; +} else if (ret < 0) { +error_setg_errno(errp, -ret, +"Couldn't determine the cluster size of the target image, " +"which has no backing file"); +return ret; +} + +bitmap_granularity = bdrv_dirty_bitmap_granularity(bitmap); +if (bitmap_granularity < bdi.cluster_size || +bitmap_granularity % bdi.cluster_size != 0) { +error_setg(errp, "Bitmap granularity %u is not a multiple of the " + "target image's cluster size %u and the target image has " + "no backing file", + bitmap_granularity, bdi.cluster_size); +return -EINVAL; +} + +return 0; +} + + /* Parameter check and block job starting for drive mirroring. * Caller should hold @device and @target's aio context (must be the same). **/ @@ -2863,6 +2916,10 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs, return; } +if (blockdev_mirror_check_bitmap_granularity(target, bitmap, errp)) { +return; +} + if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_DEFAULT, errp)) { return; } diff --git a/tests/qemu-iotests/tests/mirror-bitmap b/tests/qemu-iotests/tests/mirror-bitmap index 37bbe0f241..e8cd482a19 100755 --- a/tests/qemu-iotests/tests/mirror-bitmap +++ b/tests/qemu-iotests/tests/mirror-bitmap @@ -584,6 +584,12 @@ def test_mirror_api(): bitmap=bitmap) log('') +log("-- Test bitmap with too small granularity to non-COW target --\n") +vm.qmp_log("block-dirty-bitmap-add", node=drive0.node, + name="bitmap-small", granularity=GRANULARITY) +blockdev_mirror(drive0.vm, drive0.node, "mirror_target", "full", +job_id='api_job', bitmap="bitmap-small") +log('') def main(): for bsync_mode in ("never", "on-success", "always"): diff --git a/tests/qemu-iotests/tests/mirror-bitm
[PATCH v3 0/5] mirror: allow specifying working bitmap
Changes from v2 (discussion here [2]): * Cluster size caveats only apply to non-COW diff image, adapt the cluster size check and documentation accordingly. * In the IO test, use backing files (rather than stand-alone diff images) in combination with copy-mode=write-blocking and larger cluster size for target images, to have a more realistic use-case and show that COW prevents ending up with cluster with partial data upon unaligned writes. * Create a separate patch for replacing is_none_mode with sync_mode in MirrorBlockJobx struct. * Disallow using read-only bitmap (cannot be used as working bitmap). * Require that bitmap is enabled at the start of the job. * Avoid IO test script potentially waiting on non-existent job when blockdev-mirror QMP command fails. * Fix pylint issues in IO test. * Rename IO test from sync-bitmap-mirror to mirror-bitmap. Changes from RFC/v1 (discussion here [0]): * Add patch to split BackupSyncMode and MirrorSyncMode. * Drop bitmap-mode parameter and use passed-in bitmap as the working bitmap instead. Users can get the desired behaviors by using the block-dirty-bitmap-clear and block-dirty-bitmap-merge calls (see commit message in patch 2/4 for how exactly). * Add patch to check whether target image's cluster size is at most mirror job's granularity. Optional, it's an extra safety check that's useful when the target is a "diff" image that does not have previously synced data. Use cases: * Possibility to resume a failed mirror later. * Possibility to only mirror deltas to a previously mirrored volume. * Possibility to (efficiently) mirror an drive that was previously mirrored via some external mechanism (e.g. ZFS replication). We are using the last one in production without any issues since about 4 years now. In particular, like mentioned in [1]: > - create bitmap(s) > - (incrementally) replicate storage volume(s) out of band (using ZFS) > - incrementally drive mirror as part of a live migration of VM > - drop bitmap(s) Now, the IO test added in patch 4/5 actually contains yet another use case, namely doing incremental mirrors to qcow2 "diff" images, that only contain the delta and can be rebased later. I had to adapt the IO test, because its output expected the mirror bitmap to still be dirty, but nowadays the mirror is apparently already done when the bitmaps are queried. So I thought, I'll just use 'write-blocking' mode to avoid any potential timing issues. Initially, the qcow2 diff image targets were stand-alone and that suffers from an issue when 'write-blocking' mode is used. If a write is not aligned to the granularity of the mirror target, then rebasing the diff image onto a backing image will not yield the desired result, because the full cluster is considered to be allocated and will "hide" some part of the base/backing image. The failure can be seen by either using 'write-blocking' mode in the IO test or setting the (bitmap) granularity to 32 KiB rather than the current 64 KiB. The test thus uses a more realistic approach where the qcow2 targets have backing images and a check is added in patch 5/5 for the cluster size for non-COW targets. However, with e.g. NBD, the cluster size cannot be queried and prohibiting all bitmap mirrors to NBD targets just to prevent the highly specific edge case seems not worth it, so the limitation is rather documented and the check ignores cases where querying the target image's cluster size returns -ENOTSUP. [0]: https://lore.kernel.org/qemu-devel/b91dba34-7969-4d51-ba40-96a91038c...@yandex-team.ru/T/#m4ae27dc8ca1fb053e0a32cc4ffa2cfab6646805c [1]: https://lore.kernel.org/qemu-devel/1599127031.9uxdp5h9o2.astr...@nora.none/ [2]: https://lore.kernel.org/qemu-devel/20240307134711.709816-1-f.eb...@proxmox.com/ Fabian Grünbichler (1): iotests: add test for bitmap mirror Fiona Ebner (3): qapi/block-core: avoid the re-use of MirrorSyncMode for backup block/mirror: replace is_none_mode with sync_mode in MirrorBlockJob struct blockdev: mirror: check for target's cluster size when using bitmap John Snow (1): mirror: allow specifying working bitmap block/backup.c | 18 +- block/mirror.c | 101 +- block/monitor/block-hmp-cmds.c |2 +- block/replication.c|2 +- blockdev.c | 127 +- include/block/block_int-global-state.h |7 +- qapi/block-core.json | 64 +- tests/qemu-iotests/tests/mirror-bitmap | 603 tests/qemu-iotests/tests/mirror-bitmap.out | 3198 tests/unit/test-block-iothread.c |2 +- 10 files changed, 4055 insertions(+), 69 deletions(-) create mode 100755 tests/qemu-iotests/tests/mirror-bitmap create mode 100644 tests/qemu-iotests/tests/mirror-bitmap.out -- 2.39.2
[PATCH v3 1/5] qapi/block-core: avoid the re-use of MirrorSyncMode for backup
Backup supports all modes listed in MirrorSyncMode, while mirror does not. Introduce BackupSyncMode by copying the current MirrorSyncMode and drop the variants mirror does not support from MirrorSyncMode as well as the corresponding manual check in mirror_start(). A consequence is also tighter introspection: query-qmp-schema no longer reports drive-mirror and blockdev-mirror accepting @sync values they actually reject. Suggested-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Fiona Ebner Acked-by: Markus Armbruster Reviewed-by: Vladimir Sementsov-Ogievskiy --- Changes in v3: * add comment about introspection to commit message as suggested by Markus block/backup.c | 18 - block/mirror.c | 7 --- block/monitor/block-hmp-cmds.c | 2 +- block/replication.c| 2 +- blockdev.c | 26 - include/block/block_int-global-state.h | 2 +- qapi/block-core.json | 27 +- 7 files changed, 47 insertions(+), 37 deletions(-) diff --git a/block/backup.c b/block/backup.c index ec29d6b810..1cc4e055c6 100644 --- a/block/backup.c +++ b/block/backup.c @@ -37,7 +37,7 @@ typedef struct BackupBlockJob { BdrvDirtyBitmap *sync_bitmap; -MirrorSyncMode sync_mode; +BackupSyncMode sync_mode; BitmapSyncMode bitmap_mode; BlockdevOnError on_source_error; BlockdevOnError on_target_error; @@ -111,7 +111,7 @@ void backup_do_checkpoint(BlockJob *job, Error **errp) assert(block_job_driver(job) == &backup_job_driver); -if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) { +if (backup_job->sync_mode != BACKUP_SYNC_MODE_NONE) { error_setg(errp, "The backup job only supports block checkpoint in" " sync=none mode"); return; @@ -231,11 +231,11 @@ static void backup_init_bcs_bitmap(BackupBlockJob *job) uint64_t estimate; BdrvDirtyBitmap *bcs_bitmap = block_copy_dirty_bitmap(job->bcs); -if (job->sync_mode == MIRROR_SYNC_MODE_BITMAP) { +if (job->sync_mode == BACKUP_SYNC_MODE_BITMAP) { bdrv_clear_dirty_bitmap(bcs_bitmap, NULL); bdrv_dirty_bitmap_merge_internal(bcs_bitmap, job->sync_bitmap, NULL, true); -} else if (job->sync_mode == MIRROR_SYNC_MODE_TOP) { +} else if (job->sync_mode == BACKUP_SYNC_MODE_TOP) { /* * We can't hog the coroutine to initialize this thoroughly. * Set a flag and resume work when we are able to yield safely. @@ -254,7 +254,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp) backup_init_bcs_bitmap(s); -if (s->sync_mode == MIRROR_SYNC_MODE_TOP) { +if (s->sync_mode == BACKUP_SYNC_MODE_TOP) { int64_t offset = 0; int64_t count; @@ -282,7 +282,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp) block_copy_set_skip_unallocated(s->bcs, false); } -if (s->sync_mode == MIRROR_SYNC_MODE_NONE) { +if (s->sync_mode == BACKUP_SYNC_MODE_NONE) { /* * All bits are set in bcs bitmap to allow any cluster to be copied. * This does not actually require them to be copied. @@ -354,7 +354,7 @@ static const BlockJobDriver backup_job_driver = { BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs, BlockDriverState *target, int64_t speed, - MirrorSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap, + BackupSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap, BitmapSyncMode bitmap_mode, bool compress, const char *filter_node_name, @@ -376,8 +376,8 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs, GLOBAL_STATE_CODE(); /* QMP interface protects us from these cases */ -assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL); -assert(sync_bitmap || sync_mode != MIRROR_SYNC_MODE_BITMAP); +assert(sync_mode != BACKUP_SYNC_MODE_INCREMENTAL); +assert(sync_bitmap || sync_mode != BACKUP_SYNC_MODE_BITMAP); if (bs == target) { error_setg(errp, "Source and target cannot be the same"); diff --git a/block/mirror.c b/block/mirror.c index 1bdce3b657..c0597039a5 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -2013,13 +2013,6 @@ void mirror_start(const char *job_id, BlockDriverState *bs, GLOBAL_STATE_CODE(); -if ((mode == MIRROR_SYNC_MODE_INCREMENTAL) || -(mode == MIRROR_SYNC_MODE_BITMAP)) { -error_setg(errp, "Sync mode '%s' not supported", - MirrorSyncMode_str(mode)); -return; -} - bdrv_graph_rdlock_main_loop(); is_none_mode = mode == MIRROR_SYNC_MODE_NONE; base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_c
[PATCH v3 4/5] iotests: add test for bitmap mirror
From: Fabian Grünbichler heavily based on/practically forked off iotest 257 for bitmap backups, but: - no writes to filter node 'mirror-top' between completion and finalization, as those seem to deadlock? - extra set of reference/test mirrors to verify that writes in parallel with active mirror work Intentionally keeping copyright and ownership of original test case to honor provenance. The test was originally adapted by Fabian from 257, but has seen rather big changes, because the interface for mirror with bitmap was changed, i.e. no @bitmap-mode parameter anymore and bitmap is used as the working bitmap, and the test was changed to use backing images and @sync-mode=write-blocking. Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.1 adapt to changes to mirror bitmap interface rename test from '384' to 'mirror-bitmap' use backing files, copy-mode=write-blocking, larger cluster size] Signed-off-by: Fiona Ebner --- Changes in v3: * avoid script potentially waiting on non-existent job when blockdev-mirror QMP command fails by asserting that there is no error when none is expected. * fix pylint issues * rename test from sync-bitmap-mirror to mirror-bitmap * use backing files (rather than stand-alone diff images) in combination with copy-mode=write-blocking and larger cluster size for target images, to have a more realistic use-case and show that COW prevents ending up with cluster with partial data upon unaligned writes tests/qemu-iotests/tests/mirror-bitmap | 597 tests/qemu-iotests/tests/mirror-bitmap.out | 3191 2 files changed, 3788 insertions(+) create mode 100755 tests/qemu-iotests/tests/mirror-bitmap create mode 100644 tests/qemu-iotests/tests/mirror-bitmap.out diff --git a/tests/qemu-iotests/tests/mirror-bitmap b/tests/qemu-iotests/tests/mirror-bitmap new file mode 100755 index 00..37bbe0f241 --- /dev/null +++ b/tests/qemu-iotests/tests/mirror-bitmap @@ -0,0 +1,597 @@ +#!/usr/bin/env python3 +# group: rw +# +# Test bitmap-sync mirrors (incremental, differential, and partials) +# +# Copyright (c) 2019 John Snow for Red Hat, Inc. +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# +# owner=js...@redhat.com + +import os + +import iotests +from iotests import log, qemu_img + +SIZE = 64 * 1024 * 1024 +GRANULARITY = 64 * 1024 +IMAGE_CLUSTER_SIZE = 128 * 1024 + + +class Pattern: +def __init__(self, byte, offset, size=GRANULARITY): +self.byte = byte +self.offset = offset +self.size = size + +def bits(self, granularity): +lower = self.offset // granularity +upper = (self.offset + self.size - 1) // granularity +return set(range(lower, upper + 1)) + + +class PatternGroup: +"""Grouping of Pattern objects. Initialize with an iterable of Patterns.""" +def __init__(self, patterns): +self.patterns = patterns + +def bits(self, granularity): +"""Calculate the unique bits dirtied by this pattern grouping""" +res = set() +for pattern in self.patterns: +res |= pattern.bits(granularity) +return res + + +GROUPS = [ +PatternGroup([ +# Batch 0: 4 clusters +Pattern('0x49', 0x000), +Pattern('0x6c', 0x010), # 1M +Pattern('0x6f', 0x200), # 32M +Pattern('0x76', 0x3ff)]), # 64M - 64K +PatternGroup([ +# Batch 1: 6 clusters (3 new) +Pattern('0x65', 0x000), # Full overwrite +Pattern('0x77', 0x00f8000), # Partial-left (1M-32K) +Pattern('0x72', 0x2008000), # Partial-right (32M+32K) +Pattern('0x69', 0x3fe)]), # Adjacent-left (64M - 128K) +PatternGroup([ +# Batch 2: 7 clusters (3 new) +Pattern('0x74', 0x001), # Adjacent-right +Pattern('0x69', 0x00e8000), # Partial-left (1M-96K) +Pattern('0x6e', 0x2018000), # Partial-right (32M+96K) +Pattern('0x67', 0x3fe, +2*GRANULARITY)]), # Overwrite [(64M-128K)-64M) +PatternGroup([ +# Batch 3: 8 clusters (5 new) +# Carefully chosen such that nothing re-d
[PATCH v3 3/5] mirror: allow specifying working bitmap
From: John Snow for the mirror job. The bitmap's granularity is used as the job's granularity. The new @bitmap parameter is marked unstable in the QAPI and can currently only be used for @sync=full mode. Clusters initially dirty in the bitmap as well as new writes are copied to the target. Using block-dirty-bitmap-clear and block-dirty-bitmap-merge API, callers can simulate the three kinds of @BitmapSyncMode (which is used by backup): 1. always: default, just pass bitmap as working bitmap. 2. never: copy bitmap and pass copy to the mirror job. 3. on-success: copy bitmap and pass copy to the mirror job and if successful, merge bitmap into original afterwards. When the target image is a non-COW "diff image", i.e. one that was not used as the target of a previous mirror and the target image's cluster size is larger than the bitmap's granularity, or when @copy-mode=write-blocking is used, there is a pitfall, because the cluster in the target image will be allocated, but not contain all the data corresponding to the same region in the source image. An idea to avoid the limitation would be to mark clusters which are affected by unaligned writes and are not allocated in the target image dirty, so they would be copied fully later. However, for migration, the invariant that an actively synced mirror stays actively synced (unless an error happens) is useful, because without that invariant, migration might inactivate block devices when mirror still got work to do and run into an assertion failure [0]. Another approach would be to read the missing data from the source upon unaligned writes to be able to write the full target cluster instead. But certain targets like NBD do not allow querying the cluster size. To avoid limiting/breaking the use case of syncing to an existing target, which is arguably more common than the diff image use case, document the limitation in QAPI. This patch was originally based on one by Ma Haocong, but it has since been modified pretty heavily, first by John and then again by Fiona. [0]: https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/ Suggested-by: Ma Haocong Signed-off-by: Ma Haocong Signed-off-by: John Snow [FG: switch to bdrv_dirty_bitmap_merge_internal] Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.1 get rid of bitmap mode parameter use caller-provided bitmap as working bitmap turn bitmap parameter experimental] Signed-off-by: Fiona Ebner --- Changes in v3: * remove duplicate "use" in QAPI description * clarify that cluster size caveats only applies to non-COW diff image * split changing is_none_mode to sync_mode in job struct into a separate patch * use shorter sync_mode != none rather than sync_mode == top || full in an if condition * also disallow read-only bitmap (cannot be used as working bitmap) * require that bitmap is enabled at the start of the job block/mirror.c | 80 +- blockdev.c | 44 +++--- include/block/block_int-global-state.h | 5 +- qapi/block-core.json | 37 +++- tests/unit/test-block-iothread.c | 2 +- 5 files changed, 143 insertions(+), 25 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index ca23d6ef65..d3d0698116 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -73,6 +73,11 @@ typedef struct MirrorBlockJob { size_t buf_size; int64_t bdev_length; unsigned long *cow_bitmap; +/* + * Whether the bitmap is created locally or provided by the caller (for + * incremental sync). + */ +bool dirty_bitmap_is_local; BdrvDirtyBitmap *dirty_bitmap; BdrvDirtyBitmapIter *dbi; uint8_t *buf; @@ -691,7 +696,11 @@ static int mirror_exit_common(Job *job) bdrv_unfreeze_backing_chain(mirror_top_bs, target_bs); } -bdrv_release_dirty_bitmap(s->dirty_bitmap); +if (s->dirty_bitmap_is_local) { +bdrv_release_dirty_bitmap(s->dirty_bitmap); +} else { +bdrv_enable_dirty_bitmap(s->dirty_bitmap); +} /* Make sure that the source BDS doesn't go away during bdrv_replace_node, * before we can call bdrv_drained_end */ @@ -820,6 +829,16 @@ static void mirror_abort(Job *job) assert(ret == 0); } +/* Always called after commit/abort. */ +static void mirror_clean(Job *job) +{ +MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job); + +if (!s->dirty_bitmap_is_local && s->dirty_bitmap) { +bdrv_dirty_bitmap_set_busy(s->dirty_bitmap, false); +} +} + static void coroutine_fn mirror_throttle(MirrorBlockJob *s) { int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); @@ -1016,7 +1035,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp) mirror_free_init(s); s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALT
[PATCH v3 2/5] block/mirror: replace is_none_mode with sync_mode in MirrorBlockJob struct
It is more flexible and is done in preparation to support specifying a working bitmap for mirror jobs. In particular, this makes it possible to assert that @sync_mode=full when a bitmap is used. That assertion is just to be sure, of course the mirror QMP commands will be made to fail earlier with a clean error. Signed-off-by: Fiona Ebner --- New in v3. block/mirror.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index c0597039a5..ca23d6ef65 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -51,7 +51,7 @@ typedef struct MirrorBlockJob { BlockDriverState *to_replace; /* Used to block operations on the drive-mirror-replace target */ Error *replace_blocker; -bool is_none_mode; +MirrorSyncMode sync_mode; BlockMirrorBackingMode backing_mode; /* Whether the target image requires explicit zero-initialization */ bool zero_target; @@ -722,7 +722,8 @@ static int mirror_exit_common(Job *job) &error_abort); if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) { -BlockDriverState *backing = s->is_none_mode ? src : s->base; +BlockDriverState *backing; +backing = s->sync_mode == MIRROR_SYNC_MODE_NONE ? src : s->base; BlockDriverState *unfiltered_target = bdrv_skip_filters(target_bs); if (bdrv_cow_bs(unfiltered_target) != backing) { @@ -1015,7 +1016,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp) mirror_free_init(s); s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); -if (!s->is_none_mode) { +if (s->sync_mode != MIRROR_SYNC_MODE_NONE) { ret = mirror_dirty_init(s); if (ret < 0 || job_is_cancelled(&s->common.job)) { goto immediate_exit; @@ -1714,7 +1715,8 @@ static BlockJob *mirror_start_job( BlockCompletionFunc *cb, void *opaque, const BlockJobDriver *driver, - bool is_none_mode, BlockDriverState *base, + MirrorSyncMode sync_mode, + BlockDriverState *base, bool auto_complete, const char *filter_node_name, bool is_mirror, MirrorCopyMode copy_mode, Error **errp) @@ -1871,7 +1873,7 @@ static BlockJob *mirror_start_job( s->replaces = g_strdup(replaces); s->on_source_error = on_source_error; s->on_target_error = on_target_error; -s->is_none_mode = is_none_mode; +s->sync_mode = sync_mode; s->backing_mode = backing_mode; s->zero_target = zero_target; qatomic_set(&s->copy_mode, copy_mode); @@ -2008,20 +2010,18 @@ void mirror_start(const char *job_id, BlockDriverState *bs, bool unmap, const char *filter_node_name, MirrorCopyMode copy_mode, Error **errp) { -bool is_none_mode; BlockDriverState *base; GLOBAL_STATE_CODE(); bdrv_graph_rdlock_main_loop(); -is_none_mode = mode == MIRROR_SYNC_MODE_NONE; base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL; bdrv_graph_rdunlock_main_loop(); mirror_start_job(job_id, bs, creation_flags, target, replaces, speed, granularity, buf_size, backing_mode, zero_target, on_source_error, on_target_error, unmap, NULL, NULL, - &mirror_job_driver, is_none_mode, base, false, + &mirror_job_driver, mode, base, false, filter_node_name, true, copy_mode, errp); } @@ -2049,9 +2049,9 @@ BlockJob *commit_active_start(const char *job_id, BlockDriverState *bs, job_id, bs, creation_flags, base, NULL, speed, 0, 0, MIRROR_LEAVE_BACKING_CHAIN, false, on_error, on_error, true, cb, opaque, - &commit_active_job_driver, false, base, auto_complete, - filter_node_name, false, MIRROR_COPY_MODE_BACKGROUND, - errp); + &commit_active_job_driver, MIRROR_SYNC_MODE_FULL, base, + auto_complete, filter_node_name, false, + MIRROR_COPY_MODE_BACKGROUND, errp); if (!job) { goto error_restore_flags; } -- 2.39.2
Re: [PATCH v2 2/4] mirror: allow specifying working bitmap
Am 07.05.24 um 14:15 schrieb Fiona Ebner: > Am 02.04.24 um 22:14 schrieb Vladimir Sementsov-Ogievskiy: >> On 07.03.24 16:47, Fiona Ebner wrote: >>> +# @bitmap: The name of a bitmap to use as a working bitmap for >>> +# sync=full mode. This argument must be not be present for other >>> +# sync modes and not at the same time as @granularity. The >>> +# bitmap's granularity is used as the job's granularity. When >>> +# the target is a diff image, i.e. one that should only contain >>> +# the delta and was not synced to previously, the target's >>> +# cluster size must not be larger than the bitmap's granularity. >> >> Could we check this? Like in block_copy_calculate_cluster_size(), we can >> check if target does COW, and if not, we can check that we are safe with >> granularity. >> > > The issue here is (in particular) present when the target does COW, i.e. > in qcow2 diff images, allocated clusters which end up with partial data, > when we don't have the right cluster size. Patch 4/4 adds the check for > the target's cluster size. > Sorry, no. What I said is wrong. It's just that the test does something very pathological and does not even use COW/backing files. All the mirror targets are separate diff images there. So yes, we can do the same as block_copy_calculate_cluster_size() and the issue only appears in the same edge cases as for backup where we can error out early. This also applies to copy-mode=write-blocking AFAICT. >>> +# For a diff image target, using copy-mode=write-blocking should >>> +# not be used, because unaligned writes will lead to allocated >>> +# clusters with partial data in the target image! >> >> Could this be checked? >> > > I don't think so. How should we know if the target already contains data > from a previous full sync or not? > > Those caveats when using diff images are unfortunate, and users should > be warned about them of course, but the main/expected use case for the > feature is to sync to the same target multiple times, so I'd hope the > cluster size check in patch 4/4 and mentioning the edge cases in the > documentation is enough here. >
Re: [PATCH 0/4] Fix "virtio-gpu: fix scanout migration post-load"
Am 07.05.24 um 13:19 schrieb marcandre.lur...@redhat.com: > From: Marc-André Lureau > > Hi, > > The aforementioned patch breaks virtio-gpu device migrations for versions > pre-9.0/9.0, both forwards and backwards. Versioning of `VMS_STRUCT` is more > complex than it may initially appear, as evidenced in the problematic commit > dfcf74fa68c ("virtio-gpu: fix scanout migration post-load"). > > To resolve this, we need to propagate the `vmstate` `version_id` through the > nested structures. Additionally, we should tie specific machine version to a > corresponding `version_id` to maintain migration compatibility. > > `VMS_VSTRUCT` allows specifying the appropriate version of the nested > structure > to use. > > Marc-André Lureau (4): > migration: add "exists" info to load-state-field trace > include/migration: add VMSTATE_VSTRUCT_TEST_VARRAY_UINT32 > virtio-gpu: use a VMState variant for the scanout field > virtio-gpu: add x-vmstate-version > > include/hw/virtio/virtio-gpu.h | 1 + > include/migration/vmstate.h| 12 > hw/core/machine.c | 1 + > hw/display/virtio-gpu.c| 28 +--- > migration/vmstate.c| 5 +++-- > migration/trace-events | 2 +- > 6 files changed, 39 insertions(+), 10 deletions(-) > Tested-by: Fiona Ebner Thank you for the fix! Tested with an Ubuntu 23.10 VM: Machine type pc-i440fx-8.2: 1. create snapshot with 8.2, load with patched 9.0 2. create snapshot with patched 9.0, load with patched 9.0 and with 8.2 Machine type pc-i440fx-9.0: 1. create snapshot with patched 9.0, load with patched 9.0 No crashes/failures and didn't notice any other issues :) Best Regards, Fiona
Re: [PATCH v2 2/4] mirror: allow specifying working bitmap
Am 02.04.24 um 22:14 schrieb Vladimir Sementsov-Ogievskiy: > On 07.03.24 16:47, Fiona Ebner wrote: >> diff --git a/block/mirror.c b/block/mirror.c >> index 1609354db3..5c9a00b574 100644 >> --- a/block/mirror.c >> +++ b/block/mirror.c >> @@ -51,7 +51,7 @@ typedef struct MirrorBlockJob { >> BlockDriverState *to_replace; >> /* Used to block operations on the drive-mirror-replace target */ >> Error *replace_blocker; >> - bool is_none_mode; >> + MirrorSyncMode sync_mode; > > Could you please split this change to separate preparation patch? > Will do. >> + if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_ALLOW_RO, >> errp)) { > > Why allow read-only bitmaps? > Good catch! This is a left-over from an earlier version. Now that the bitmap shall be used as the working bitmap, it cannot be read-only. I'll change it to BDRV_BITMAP_DEFAULT in v3 of the series. >> +# @bitmap: The name of a bitmap to use as a working bitmap for >> +# sync=full mode. This argument must be not be present for other >> +# sync modes and not at the same time as @granularity. The >> +# bitmap's granularity is used as the job's granularity. When >> +# the target is a diff image, i.e. one that should only contain >> +# the delta and was not synced to previously, the target's >> +# cluster size must not be larger than the bitmap's granularity. > > Could we check this? Like in block_copy_calculate_cluster_size(), we can > check if target does COW, and if not, we can check that we are safe with > granularity. > The issue here is (in particular) present when the target does COW, i.e. in qcow2 diff images, allocated clusters which end up with partial data, when we don't have the right cluster size. Patch 4/4 adds the check for the target's cluster size. >> +# For a diff image target, using copy-mode=write-blocking should >> +# not be used, because unaligned writes will lead to allocated >> +# clusters with partial data in the target image! > > Could this be checked? > I don't think so. How should we know if the target already contains data from a previous full sync or not? Those caveats when using diff images are unfortunate, and users should be warned about them of course, but the main/expected use case for the feature is to sync to the same target multiple times, so I'd hope the cluster size check in patch 4/4 and mentioning the edge cases in the documentation is enough here. >> The bitmap >> +# will be enabled after the job finishes. (Since 9.0) > > Hmm. That looks correct. At least for the case, when bitmap is enabled > at that start of job. Suggest to require this. > It's true for any provided bitmap: it will be disabled when the mirror job starts, because we manually set it in bdrv_mirror_top_do_write() and then in mirror_exit_common(), the bitmap will be enabled. Okay, I'll require that it is enabled at the beginning. >> +# >> # @granularity: granularity of the dirty bitmap, default is 64K if the >> # image format doesn't have clusters, 4K if the clusters are >> # smaller than that, else the cluster size. Must be a power of 2 >> @@ -2548,6 +2578,10 @@ >> # disappear from the query list without user intervention. >> # Defaults to true. (Since 3.1) >> # >> +# Features: >> +# >> +# @unstable: Member @bitmap is experimental. >> +# >> # Since: 2.6 > > Y_MODE_BACKGROUND, >> &error_abort); > > [..] > > Generally looks good to me. > Thank you for the review!
Re: [PULL 5/5] virtio-gpu: fix scanout migration post-load
Am 12.03.24 um 15:02 schrieb marcandre.lur...@redhat.com: > From: Marc-André Lureau > > The current post-loading code for scanout has a FIXME: it doesn't take > the resource region/rect into account. But there is more, when adding > blob migration support in commit f66767f75c9, I didn't realize that blob > resources could be used for scanouts. This situationn leads to a crash > during post-load, as they don't have an associated res->image. > > virtio_gpu_do_set_scanout() handle all cases, but requires the > associated virtio_gpu_framebuffer, which is currently not saved during > migration. > > Add a v2 of "virtio-gpu-one-scanout" with the framebuffer fields, so we > can restore blob scanouts, as well as fixing the existing FIXME. > > Signed-off-by: Marc-André Lureau > Reviewed-by: Sebastian Ott Hi, unfortunately, this broke migration from pre-9.0 to 9.0: > vmstate_load_state_field virtio-gpu:virtio-gpu > vmstate_load_state_field virtio-gpu-scanouts:parent_obj.enable > vmstate_load_state_field virtio-gpu-scanouts:parent_obj.conf.max_outputs > vmstate_load_state_field virtio-gpu-scanouts:parent_obj.scanout > vmstate_load_state_field virtio-gpu-one-scanout:resource_id > vmstate_load_state_field virtio-gpu-one-scanout:width > vmstate_load_state_field virtio-gpu-one-scanout:height > vmstate_load_state_field virtio-gpu-one-scanout:x > vmstate_load_state_field virtio-gpu-one-scanout:y > vmstate_load_state_field virtio-gpu-one-scanout:cursor.resource_id > vmstate_load_state_field virtio-gpu-one-scanout:cursor.hot_x > vmstate_load_state_field virtio-gpu-one-scanout:cursor.hot_y > vmstate_load_state_field virtio-gpu-one-scanout:cursor.pos.x > vmstate_load_state_field virtio-gpu-one-scanout:cursor.pos.y > vmstate_load_state_field virtio-gpu-one-scanout:fb.format > vmstate_load_state_field virtio-gpu-one-scanout:fb.bytes_pp > vmstate_load_state_field virtio-gpu-one-scanout:fb.width > vmstate_load_state_field virtio-gpu-one-scanout:fb.height > vmstate_load_state_field virtio-gpu-one-scanout:fb.stride > vmstate_load_state_field virtio-gpu-one-scanout:fb.offset > qemu-system-x86_64: Missing section footer for :00:02.0/virtio-gpu > qemu-system-x86_64: Error -22 while loading VM state It wrongly tries to load the fb fields even though they should be guarded by version 2. Looking at it with GDB, in vmstate_load_state(), when we come to field->name == parent_obj.scanout, the > } else if (field->flags & VMS_STRUCT) { > ret = vmstate_load_state(f, field->vmsd, curr_elem, > field->vmsd->version_id); branch will be taken and suddenly we'll have a call to vmstate_load_state() for vmsd==vmstate_virtio_gpu_scanout with version_id==2 rather than version_id==1, because that is field->vmsd->version_id (i.e. the .version_id in VMStateDescription vmstate_virtio_gpu_scanout). Would it have been necessary to version the VMStateDescription vmstate_virtio_gpu_scanouts too using VMS_VSTRUCT (or am I misinterpreting the use case for that)? Best Regards, Fiona
Re: [PATCH] block/copy-before-write: use uint64_t for timeout in nanoseconds
Am 29.04.24 um 16:36 schrieb Philippe Mathieu-Daudé: > Hi Fiona, > > On 29/4/24 16:19, Fiona Ebner wrote: > > Not everybody uses an email client that shows the patch content just > after the subject (your first lines wasn't making sense at first). > > Simply duplicating the subject helps to understand: > > Use uint64_t for timeout in nanoseconds ... > Oh, sorry. I'll try to remember that for the future. Should I re-send as a v2? Best Regards, Fiona
[PATCH] block/copy-before-write: use uint64_t for timeout in nanoseconds
rather than the uint32_t for which the maximum is slightly more than 4 seconds and larger values would overflow. The QAPI interface allows specifying the number of seconds, so only values 0 to 4 are safe right now, other values lead to a much lower timeout than a user expects. The block_copy() call where this is used already takes a uint64_t for the timeout, so no change required there. Fixes: 6db7fd1ca9 ("block/copy-before-write: implement cbw-timeout option") Reported-by: Friedrich Weber Signed-off-by: Fiona Ebner --- block/copy-before-write.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/copy-before-write.c b/block/copy-before-write.c index 8aba27a71d..026fa9840f 100644 --- a/block/copy-before-write.c +++ b/block/copy-before-write.c @@ -43,7 +43,7 @@ typedef struct BDRVCopyBeforeWriteState { BlockCopyState *bcs; BdrvChild *target; OnCbwError on_cbw_error; -uint32_t cbw_timeout_ns; +uint64_t cbw_timeout_ns; /* * @lock: protects access to @access_bitmap, @done_bitmap and -- 2.39.2
Re: [PATCH] Makefile: preserve --jobserver-auth argument when calling ninja
Am 02.04.24 um 10:17 schrieb Martin Hundebøll: > Qemu wraps its call to ninja in a Makefile. Since ninja, as opposed to > make, utilizes all CPU cores by default, the qemu Makefile translates > the absense of a `-jN` argument into `-j1`. This breaks jobserver > functionality, so update the -jN mangling to take the --jobserver-auth > argument into considerationa too. > > Signed-off-by: Martin Hundebøll > --- > Makefile | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/Makefile b/Makefile > index 8f36990335..183756018f 100644 > --- a/Makefile > +++ b/Makefile > @@ -142,7 +142,7 @@ MAKE.k = $(findstring k,$(firstword $(filter-out > --%,$(MAKEFLAGS > MAKE.q = $(findstring q,$(firstword $(filter-out --%,$(MAKEFLAGS > MAKE.nq = $(if $(word 2, $(MAKE.n) $(MAKE.q)),nq) > NINJAFLAGS = $(if $V,-v) $(if $(MAKE.n), -n) $(if $(MAKE.k), -k0) \ > -$(filter-out -j, $(lastword -j1 $(filter -l% -j%, $(MAKEFLAGS \ > +$(or $(filter -l% -j%, $(MAKEFLAGS)), $(if $(filter > --jobserver-auth=%, $(MAKEFLAGS)),, -j1)) \ > -d keepdepfile > ninja-cmd-goals = $(or $(MAKECMDGOALS), all) > ninja-cmd-goals += $(foreach g, $(MAKECMDGOALS), $(.ninja-goals.$g)) Hi, unfortunately, this patch breaks build when specifying just '-j' as a make flag (i.e. without a number), because it will now end up being passed to ninja: > $ make -j > changing dir to build for make ""... > make[1]: Entering directory '/home/febner/repos/qemu/build' > ninja: fatal: invalid -j parameter Best Regards, Fiona
Re: [PATCH] block: Remove unnecessary NULL check in bdrv_pad_request()
Am 27.03.24 um 20:27 schrieb Kevin Wolf: > Coverity complains that the check introduced in commit 3f934817 suggests > that qiov could be NULL and we dereference it before reaching the check. > In fact, all of the callers pass a non-NULL pointer, so just remove the > misleading check. > > Resolves: Coverity CID 1542668 > Signed-off-by: Kevin Wolf Reviewed-by: Fiona Ebner Thank you for the fix, Fiona
Question about block graph lock limitation with generated co-wrappers
Hi, we have a custom block driver downstream, which currently calls bdrv_get_info() (for its file child) in the bdrv_refresh_limits() callback. However, with graph locking, this doesn't work anymore. AFAICT, the reason is the following: The block driver has a backing file option. During initialization, in bdrv_set_backing_hd(), the graph lock is acquired exclusively. Then the bdrv_refresh_limits() callback is invoked. Now bdrv_get_info() is called, which is a generated co-wrapper. The bdrv_co_get_info_entry() function tries to acquire the graph lock for reading, sees that has_writer is true and so the coroutine will be put to wait, leading to a deadlock. For my specific case, I can move the bdrv_get_info() call to bdrv_open() as a workaround. But I wanted to ask if there is a way to make generated co-wrappers inside an exclusively locked section work? And if not, could we introduce/extend the annotations, so the compiler can catch this kind of issue, i.e. calling a generated co-wrapper while in an exclusively locked section? Best Regards, Fiona
[PATCH v3 0/4] fix two edge cases related to stream block jobs
Changes in v3: * Also deal with edge case in bdrv_next_cleanup(). Haven't run into an actual issue there, but at least the caller in migration/block.c uses bdrv_nb_sectors() which, while not a coroutine wrapper itself (it's written manually), may call bdrv_refresh_total_sectors(), which is a generated coroutine wrapper, so AFAIU, the block graph can change during that call. And even without that, it's just better to be more consistent with bdrv_next(). Changes in v2: * Ran into another issue while writing the IO test Stefan wanted to have (good call :)), so include a fix for that and add the test. I didn't notice during manual testing, because I hadn't used a scripted QMP 'quit', so there was no race. Fiona Ebner (3): block-backend: fix edge case in bdrv_next() where BDS associated to BB changes block-backend: fix edge case in bdrv_next_cleanup() where BDS associated to BB changes iotests: add test for stream job with an unaligned prefetch read Stefan Reiter (1): block/io: accept NULL qiov in bdrv_pad_request block/block-backend.c | 18 ++-- block/io.c| 31 --- .../tests/stream-unaligned-prefetch | 86 +++ .../tests/stream-unaligned-prefetch.out | 5 ++ 4 files changed, 117 insertions(+), 23 deletions(-) create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out -- 2.39.2
[PATCH v3 3/4] block-backend: fix edge case in bdrv_next_cleanup() where BDS associated to BB changes
Same rationale as for commit "block-backend: fix edge case in bdrv_next() where BDS associated to BB changes". The block graph might change between the bdrv_next() call and the bdrv_next_cleanup() call, so it could be that the associated BDS is not the same that was referenced previously anymore. Instead, rely on bdrv_next() to set it->bs to the BDS it referenced and unreference that one in any case. Signed-off-by: Fiona Ebner --- New in v3. block/block-backend.c | 11 --- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/block/block-backend.c b/block/block-backend.c index 28af1eb17a..db6f9b92a3 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -663,13 +663,10 @@ void bdrv_next_cleanup(BdrvNextIterator *it) /* Must be called from the main loop */ assert(qemu_get_current_aio_context() == qemu_get_aio_context()); -if (it->phase == BDRV_NEXT_BACKEND_ROOTS) { -if (it->blk) { -bdrv_unref(blk_bs(it->blk)); -blk_unref(it->blk); -} -} else { -bdrv_unref(it->bs); +bdrv_unref(it->bs); + +if (it->phase == BDRV_NEXT_BACKEND_ROOTS && it->blk) { +blk_unref(it->blk); } bdrv_next_reset(it); -- 2.39.2
[PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes
The old_bs variable in bdrv_next() is currently determined by looking at the old block backend. However, if the block graph changes before the next bdrv_next() call, it might be that the associated BDS is not the same that was referenced previously. In that case, the wrong BDS is unreferenced, leading to an assertion failure later: > bdrv_unref: Assertion `bs->refcnt > 0' failed. In particular, this can happen in the context of bdrv_flush_all(), when polling for bdrv_co_flush() in the generated co-wrapper leads to a graph change (for example with a stream block job [0]). A racy reproducer: > #!/bin/bash > rm -f /tmp/backing.qcow2 > rm -f /tmp/top.qcow2 > ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M > ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2 > ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2 > ./qemu-system-x86_64 --qmp stdio \ > --blockdev > qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \ > < {"execute": "qmp_capabilities"} > {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": > "node0" } } > {"execute": "quit"} > EOF [0]: > #0 bdrv_replace_child_tran (child=..., new_bs=..., tran=...) > #1 bdrv_replace_node_noperm (from=..., to=..., auto_skip=..., tran=..., > errp=...) > #2 bdrv_replace_node_common (from=..., to=..., auto_skip=..., > detach_subchain=..., errp=...) > #3 bdrv_drop_filter (bs=..., errp=...) > #4 bdrv_cor_filter_drop (cor_filter_bs=...) > #5 stream_prepare (job=...) > #6 job_prepare_locked (job=...) > #7 job_txn_apply_locked (fn=..., job=...) > #8 job_do_finalize_locked (job=...) > #9 job_exit (opaque=...) > #10 aio_bh_poll (ctx=...) > #11 aio_poll (ctx=..., blocking=...) > #12 bdrv_poll_co (s=...) > #13 bdrv_flush (bs=...) > #14 bdrv_flush_all () > #15 do_vm_stop (state=..., send_stop=...) > #16 vm_shutdown () Signed-off-by: Fiona Ebner --- No changes in v3. New in v2. block/block-backend.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/block/block-backend.c b/block/block-backend.c index 9c4de79e6b..28af1eb17a 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -599,14 +599,14 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it) /* Must be called from the main loop */ assert(qemu_get_current_aio_context() == qemu_get_aio_context()); +old_bs = it->bs; + /* First, return all root nodes of BlockBackends. In order to avoid * returning a BDS twice when multiple BBs refer to it, we only return it * if the BB is the first one in the parent list of the BDS. */ if (it->phase == BDRV_NEXT_BACKEND_ROOTS) { BlockBackend *old_blk = it->blk; -old_bs = old_blk ? blk_bs(old_blk) : NULL; - do { it->blk = blk_all_next(it->blk); bs = it->blk ? blk_bs(it->blk) : NULL; @@ -620,11 +620,10 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it) if (bs) { bdrv_ref(bs); bdrv_unref(old_bs); +it->bs = bs; return bs; } it->phase = BDRV_NEXT_MONITOR_OWNED; -} else { -old_bs = it->bs; } /* Then return the monitor-owned BDSes without a BB attached. Ignore all -- 2.39.2
[PATCH v3 1/4] block/io: accept NULL qiov in bdrv_pad_request
From: Stefan Reiter Some operations, e.g. block-stream, perform reads while discarding the results (only copy-on-read matters). In this case, they will pass NULL as the target QEMUIOVector, which will however trip bdrv_pad_request, since it wants to extend its passed vector. In particular, this is the case for the blk_co_preadv() call in stream_populate(). If there is no qiov, no operation can be done with it, but the bytes and offset still need to be updated, so the subsequent aligned read will actually be aligned and not run into an assertion failure. In particular, this can happen when the request alignment of the top node is larger than the allocated part of the bottom node, in which case padding becomes necessary. For example: > ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768 > ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2 > ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2 > ./qemu-system-x86_64 --qmp stdio \ > --blockdev > qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \ > < {"execute": "qmp_capabilities"} > {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": > "node0", "node-name": "node1" } } > {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": > "node1" } } > EOF Originally-by: Stefan Reiter Signed-off-by: Thomas Lamprecht [FE: do update bytes and offset in any case add reproducer to commit message] Signed-off-by: Fiona Ebner --- No changes in v3. No changes in v2. block/io.c | 31 +++ 1 file changed, 19 insertions(+), 12 deletions(-) diff --git a/block/io.c b/block/io.c index 33150c0359..395bea3bac 100644 --- a/block/io.c +++ b/block/io.c @@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs, return 0; } -sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes, - &sliced_head, &sliced_tail, - &sliced_niov); +/* + * For prefetching in stream_populate(), no qiov is passed along, because + * only copy-on-read matters. + */ +if (qiov && *qiov) { +sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes, + &sliced_head, &sliced_tail, + &sliced_niov); -/* Guaranteed by bdrv_check_request32() */ -assert(*bytes <= SIZE_MAX); -ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov, - sliced_head, *bytes); -if (ret < 0) { -bdrv_padding_finalize(pad); -return ret; +/* Guaranteed by bdrv_check_request32() */ +assert(*bytes <= SIZE_MAX); +ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov, + sliced_head, *bytes); +if (ret < 0) { +bdrv_padding_finalize(pad); +return ret; +} +*qiov = &pad->local_qiov; +*qiov_offset = 0; } + *bytes += pad->head + pad->tail; *offset -= pad->head; -*qiov = &pad->local_qiov; -*qiov_offset = 0; if (padded) { *padded = true; } -- 2.39.2
[PATCH v3 4/4] iotests: add test for stream job with an unaligned prefetch read
Previously, bdrv_pad_request() could not deal with a NULL qiov when a read needed to be aligned. During prefetch, a stream job will pass a NULL qiov. Add a test case to cover this scenario. By accident, also covers a previous race during shutdown, where block graph changes during iteration in bdrv_flush_all() could lead to unreferencing the wrong block driver state and an assertion failure later. Signed-off-by: Fiona Ebner --- No changes in v3. New in v2. .../tests/stream-unaligned-prefetch | 86 +++ .../tests/stream-unaligned-prefetch.out | 5 ++ 2 files changed, 91 insertions(+) create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch b/tests/qemu-iotests/tests/stream-unaligned-prefetch new file mode 100755 index 00..546db1d369 --- /dev/null +++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch @@ -0,0 +1,86 @@ +#!/usr/bin/env python3 +# group: rw quick +# +# Test what happens when a stream job does an unaligned prefetch read +# which requires padding while having a NULL qiov. +# +# Copyright (C) Proxmox Server Solutions GmbH +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# + +import os +import iotests +from iotests import imgfmt, qemu_img_create, qemu_io, QMPTestCase + +image_size = 1 * 1024 * 1024 +cluster_size = 64 * 1024 +base = os.path.join(iotests.test_dir, 'base.img') +top = os.path.join(iotests.test_dir, 'top.img') + +class TestStreamUnalignedPrefetch(QMPTestCase): +def setUp(self) -> None: +""" +Create two images: +- base image {base} with {cluster_size // 2} bytes allocated +- top image {top} without any data allocated and coarser + cluster size + +Attach a compress filter for the top image, because that +requires that the request alignment is the top image's cluster +size. +""" +qemu_img_create('-f', imgfmt, +'-o', 'cluster_size={}'.format(cluster_size // 2), +base, str(image_size)) +qemu_io('-c', f'write 0 {cluster_size // 2}', base) +qemu_img_create('-f', imgfmt, +'-o', 'cluster_size={}'.format(cluster_size), +top, str(image_size)) + +self.vm = iotests.VM() +self.vm.add_blockdev(self.vm.qmp_to_opts({ +'driver': imgfmt, +'node-name': 'base', +'file': { +'driver': 'file', +'filename': base +} +})) +self.vm.add_blockdev(self.vm.qmp_to_opts({ +'driver': 'compress', +'node-name': 'compress-top', +'file': { +'driver': imgfmt, +'node-name': 'top', +'file': { +'driver': 'file', +'filename': top +}, +'backing': 'base' +} +})) +self.vm.launch() + +def tearDown(self) -> None: +self.vm.shutdown() +os.remove(top) +os.remove(base) + +def test_stream_unaligned_prefetch(self) -> None: +self.vm.cmd('block-stream', job_id='stream', device='compress-top') + + +if __name__ == '__main__': +iotests.main(supported_fmts=['qcow2'], supported_protocols=['file']) diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch.out b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out new file mode 100644 index 00..ae1213e6f8 --- /dev/null +++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out @@ -0,0 +1,5 @@ +. +-- +Ran 1 tests + +OK -- 2.39.2
[PATCH v2 3/3] iotests: add test for stream job with an unaligned prefetch read
Previously, bdrv_pad_request() could not deal with a NULL qiov when a read needed to be aligned. During prefetch, a stream job will pass a NULL qiov. Add a test case to cover this scenario. By accident, also covers a previous race during shutdown, where block graph changes during iteration in bdrv_flush_all() could lead to unreferencing the wrong block driver state and an assertion failure later. Signed-off-by: Fiona Ebner --- New in v2. .../tests/stream-unaligned-prefetch | 86 +++ .../tests/stream-unaligned-prefetch.out | 5 ++ 2 files changed, 91 insertions(+) create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch b/tests/qemu-iotests/tests/stream-unaligned-prefetch new file mode 100755 index 00..546db1d369 --- /dev/null +++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch @@ -0,0 +1,86 @@ +#!/usr/bin/env python3 +# group: rw quick +# +# Test what happens when a stream job does an unaligned prefetch read +# which requires padding while having a NULL qiov. +# +# Copyright (C) Proxmox Server Solutions GmbH +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# + +import os +import iotests +from iotests import imgfmt, qemu_img_create, qemu_io, QMPTestCase + +image_size = 1 * 1024 * 1024 +cluster_size = 64 * 1024 +base = os.path.join(iotests.test_dir, 'base.img') +top = os.path.join(iotests.test_dir, 'top.img') + +class TestStreamUnalignedPrefetch(QMPTestCase): +def setUp(self) -> None: +""" +Create two images: +- base image {base} with {cluster_size // 2} bytes allocated +- top image {top} without any data allocated and coarser + cluster size + +Attach a compress filter for the top image, because that +requires that the request alignment is the top image's cluster +size. +""" +qemu_img_create('-f', imgfmt, +'-o', 'cluster_size={}'.format(cluster_size // 2), +base, str(image_size)) +qemu_io('-c', f'write 0 {cluster_size // 2}', base) +qemu_img_create('-f', imgfmt, +'-o', 'cluster_size={}'.format(cluster_size), +top, str(image_size)) + +self.vm = iotests.VM() +self.vm.add_blockdev(self.vm.qmp_to_opts({ +'driver': imgfmt, +'node-name': 'base', +'file': { +'driver': 'file', +'filename': base +} +})) +self.vm.add_blockdev(self.vm.qmp_to_opts({ +'driver': 'compress', +'node-name': 'compress-top', +'file': { +'driver': imgfmt, +'node-name': 'top', +'file': { +'driver': 'file', +'filename': top +}, +'backing': 'base' +} +})) +self.vm.launch() + +def tearDown(self) -> None: +self.vm.shutdown() +os.remove(top) +os.remove(base) + +def test_stream_unaligned_prefetch(self) -> None: +self.vm.cmd('block-stream', job_id='stream', device='compress-top') + + +if __name__ == '__main__': +iotests.main(supported_fmts=['qcow2'], supported_protocols=['file']) diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch.out b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out new file mode 100644 index 00..ae1213e6f8 --- /dev/null +++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out @@ -0,0 +1,5 @@ +. +-- +Ran 1 tests + +OK -- 2.39.2
[PATCH v2 0/3] fix two edge cases related to stream block jobs
Changes in v2: * Ran into another issue while writing the IO test Stefan wanted to have (good call :)), so include a fix for that and add the test. I didn't notice during manual testing, because I hadn't used a scripted QMP 'quit', so there was no race. Fiona Ebner (2): block-backend: fix edge case in bdrv_next() where BDS associated to BB changes iotests: add test for stream job with an unaligned prefetch read Stefan Reiter (1): block/io: accept NULL qiov in bdrv_pad_request block/block-backend.c | 7 +- block/io.c| 31 --- .../tests/stream-unaligned-prefetch | 86 +++ .../tests/stream-unaligned-prefetch.out | 5 ++ 4 files changed, 113 insertions(+), 16 deletions(-) create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out -- 2.39.2
[PATCH v2 1/3] block/io: accept NULL qiov in bdrv_pad_request
From: Stefan Reiter Some operations, e.g. block-stream, perform reads while discarding the results (only copy-on-read matters). In this case, they will pass NULL as the target QEMUIOVector, which will however trip bdrv_pad_request, since it wants to extend its passed vector. In particular, this is the case for the blk_co_preadv() call in stream_populate(). If there is no qiov, no operation can be done with it, but the bytes and offset still need to be updated, so the subsequent aligned read will actually be aligned and not run into an assertion failure. In particular, this can happen when the request alignment of the top node is larger than the allocated part of the bottom node, in which case padding becomes necessary. For example: > ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768 > ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2 > ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2 > ./qemu-system-x86_64 --qmp stdio \ > --blockdev > qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \ > < {"execute": "qmp_capabilities"} > {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": > "node0", "node-name": "node1" } } > {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": > "node1" } } > EOF Originally-by: Stefan Reiter Signed-off-by: Thomas Lamprecht [FE: do update bytes and offset in any case add reproducer to commit message] Signed-off-by: Fiona Ebner --- No changes in v2. block/io.c | 31 +++ 1 file changed, 19 insertions(+), 12 deletions(-) diff --git a/block/io.c b/block/io.c index 33150c0359..395bea3bac 100644 --- a/block/io.c +++ b/block/io.c @@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs, return 0; } -sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes, - &sliced_head, &sliced_tail, - &sliced_niov); +/* + * For prefetching in stream_populate(), no qiov is passed along, because + * only copy-on-read matters. + */ +if (qiov && *qiov) { +sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes, + &sliced_head, &sliced_tail, + &sliced_niov); -/* Guaranteed by bdrv_check_request32() */ -assert(*bytes <= SIZE_MAX); -ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov, - sliced_head, *bytes); -if (ret < 0) { -bdrv_padding_finalize(pad); -return ret; +/* Guaranteed by bdrv_check_request32() */ +assert(*bytes <= SIZE_MAX); +ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov, + sliced_head, *bytes); +if (ret < 0) { +bdrv_padding_finalize(pad); +return ret; +} +*qiov = &pad->local_qiov; +*qiov_offset = 0; } + *bytes += pad->head + pad->tail; *offset -= pad->head; -*qiov = &pad->local_qiov; -*qiov_offset = 0; if (padded) { *padded = true; } -- 2.39.2
[PATCH v2 2/3] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes
The old_bs variable in bdrv_next() is currently determined by looking at the old block backend. However, if the block graph changes before the next bdrv_next() call, it might be that the associated BDS is not the same that was referenced previously. In that case, the wrong BDS is unreferenced, leading to an assertion failure later: > bdrv_unref: Assertion `bs->refcnt > 0' failed. In particular, this can happen in the context of bdrv_flush_all(), when polling for bdrv_co_flush() in the generated co-wrapper leads to a graph change (for example with a stream block job [0]). A racy reproducer: > #!/bin/bash > rm -f /tmp/backing.qcow2 > rm -f /tmp/top.qcow2 > ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M > ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2 > ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2 > ./qemu-system-x86_64 --qmp stdio \ > --blockdev > qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \ > < {"execute": "qmp_capabilities"} > {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": > "node0" } } > {"execute": "quit"} > EOF [0]: > #0 bdrv_replace_child_tran (child=..., new_bs=..., tran=...) > #1 bdrv_replace_node_noperm (from=..., to=..., auto_skip=..., tran=..., > errp=...) > #2 bdrv_replace_node_common (from=..., to=..., auto_skip=..., > detach_subchain=..., errp=...) > #3 bdrv_drop_filter (bs=..., errp=...) > #4 bdrv_cor_filter_drop (cor_filter_bs=...) > #5 stream_prepare (job=...) > #6 job_prepare_locked (job=...) > #7 job_txn_apply_locked (fn=..., job=...) > #8 job_do_finalize_locked (job=...) > #9 job_exit (opaque=...) > #10 aio_bh_poll (ctx=...) > #11 aio_poll (ctx=..., blocking=...) > #12 bdrv_poll_co (s=...) > #13 bdrv_flush (bs=...) > #14 bdrv_flush_all () > #15 do_vm_stop (state=..., send_stop=...) > #16 vm_shutdown () Signed-off-by: Fiona Ebner --- Not sure if this is the correct fix, or if the call site should rather be adapted somehow? New in v2. block/block-backend.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/block/block-backend.c b/block/block-backend.c index 9c4de79e6b..28af1eb17a 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -599,14 +599,14 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it) /* Must be called from the main loop */ assert(qemu_get_current_aio_context() == qemu_get_aio_context()); +old_bs = it->bs; + /* First, return all root nodes of BlockBackends. In order to avoid * returning a BDS twice when multiple BBs refer to it, we only return it * if the BB is the first one in the parent list of the BDS. */ if (it->phase == BDRV_NEXT_BACKEND_ROOTS) { BlockBackend *old_blk = it->blk; -old_bs = old_blk ? blk_bs(old_blk) : NULL; - do { it->blk = blk_all_next(it->blk); bs = it->blk ? blk_bs(it->blk) : NULL; @@ -620,11 +620,10 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it) if (bs) { bdrv_ref(bs); bdrv_unref(old_bs); +it->bs = bs; return bs; } it->phase = BDRV_NEXT_MONITOR_OWNED; -} else { -old_bs = it->bs; } /* Then return the monitor-owned BDSes without a BB attached. Ignore all -- 2.39.2
[PATCH] block/io: accept NULL qiov in bdrv_pad_request
From: Stefan Reiter Some operations, e.g. block-stream, perform reads while discarding the results (only copy-on-read matters). In this case, they will pass NULL as the target QEMUIOVector, which will however trip bdrv_pad_request, since it wants to extend its passed vector. In particular, this is the case for the blk_co_preadv() call in stream_populate(). If there is no qiov, no operation can be done with it, but the bytes and offset still need to be updated, so the subsequent aligned read will actually be aligned and not run into an assertion failure. In particular, this can happen when the request alignment of the top node is larger than the allocated part of the bottom node, in which case padding becomes necessary. For example: > ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768 > ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2 > ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2 > ./qemu-system-x86_64 --qmp stdio \ > --blockdev > qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \ > < {"execute": "qmp_capabilities"} > {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": > "node0", "node-name": "node1" } } > {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": > "node1" } } > EOF Originally-by: Stefan Reiter Signed-off-by: Thomas Lamprecht [FE: do update bytes and offset in any case add reproducer to commit message] Signed-off-by: Fiona Ebner --- block/io.c | 31 +++ 1 file changed, 19 insertions(+), 12 deletions(-) diff --git a/block/io.c b/block/io.c index 33150c0359..395bea3bac 100644 --- a/block/io.c +++ b/block/io.c @@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs, return 0; } -sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes, - &sliced_head, &sliced_tail, - &sliced_niov); +/* + * For prefetching in stream_populate(), no qiov is passed along, because + * only copy-on-read matters. + */ +if (qiov && *qiov) { +sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes, + &sliced_head, &sliced_tail, + &sliced_niov); -/* Guaranteed by bdrv_check_request32() */ -assert(*bytes <= SIZE_MAX); -ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov, - sliced_head, *bytes); -if (ret < 0) { -bdrv_padding_finalize(pad); -return ret; +/* Guaranteed by bdrv_check_request32() */ +assert(*bytes <= SIZE_MAX); +ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov, + sliced_head, *bytes); +if (ret < 0) { +bdrv_padding_finalize(pad); +return ret; +} +*qiov = &pad->local_qiov; +*qiov_offset = 0; } + *bytes += pad->head + pad->tail; *offset -= pad->head; -*qiov = &pad->local_qiov; -*qiov_offset = 0; if (padded) { *padded = true; } -- 2.39.2
[PATCH 2/2] backup: add minimum cluster size to performance options
Useful to make discard-source work in the context of backup fleecing when the fleecing image has a larger granularity than the backup target. Backup/block-copy will use at least this granularity for copy operations and in particular, discard requests to the backup source will too. If the granularity is too small, they will just be aligned down in cbw_co_pdiscard_snapshot() and thus effectively ignored. Suggested-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Fiona Ebner --- block/backup.c| 2 +- block/copy-before-write.c | 2 ++ block/copy-before-write.h | 1 + blockdev.c| 3 +++ qapi/block-core.json | 9 +++-- 5 files changed, 14 insertions(+), 3 deletions(-) diff --git a/block/backup.c b/block/backup.c index 3dd2e229d2..a1292c01ec 100644 --- a/block/backup.c +++ b/block/backup.c @@ -458,7 +458,7 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs, } cbw = bdrv_cbw_append(bs, target, filter_node_name, discard_source, - &bcs, errp); + perf->min_cluster_size, &bcs, errp); if (!cbw) { goto error; } diff --git a/block/copy-before-write.c b/block/copy-before-write.c index f9896c6c1e..55a9272485 100644 --- a/block/copy-before-write.c +++ b/block/copy-before-write.c @@ -545,6 +545,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source, BlockDriverState *target, const char *filter_node_name, bool discard_source, + int64_t min_cluster_size, BlockCopyState **bcs, Error **errp) { @@ -563,6 +564,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source, } qdict_put_str(opts, "file", bdrv_get_node_name(source)); qdict_put_str(opts, "target", bdrv_get_node_name(target)); +qdict_put_int(opts, "min-cluster-size", min_cluster_size); top = bdrv_insert_node(source, opts, flags, errp); if (!top) { diff --git a/block/copy-before-write.h b/block/copy-before-write.h index 01af0cd3c4..dc6cafe7fa 100644 --- a/block/copy-before-write.h +++ b/block/copy-before-write.h @@ -40,6 +40,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source, BlockDriverState *target, const char *filter_node_name, bool discard_source, + int64_t min_cluster_size, BlockCopyState **bcs, Error **errp); void bdrv_cbw_drop(BlockDriverState *bs); diff --git a/blockdev.c b/blockdev.c index daceb50460..8e6bdbc94a 100644 --- a/blockdev.c +++ b/blockdev.c @@ -2653,6 +2653,9 @@ static BlockJob *do_backup_common(BackupCommon *backup, if (backup->x_perf->has_max_chunk) { perf.max_chunk = backup->x_perf->max_chunk; } +if (backup->x_perf->has_min_cluster_size) { +perf.min_cluster_size = backup->x_perf->min_cluster_size; +} } if ((backup->sync == MIRROR_SYNC_MODE_BITMAP) || diff --git a/qapi/block-core.json b/qapi/block-core.json index 85c8f88f6e..ba0836892f 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -1551,11 +1551,16 @@ # it should not be less than job cluster size which is calculated # as maximum of target image cluster size and 64k. Default 0. # +# @min-cluster-size: Minimum size of blocks used by copy-before-write +# and background copy operations. Has to be a power of 2. No +# effect if smaller than the maximum of the target's cluster size +# and 64 KiB. Default 0. (Since 9.0) +# # Since: 6.0 ## { 'struct': 'BackupPerf', - 'data': { '*use-copy-range': 'bool', -'*max-workers': 'int', '*max-chunk': 'int64' } } + 'data': { '*use-copy-range': 'bool', '*max-workers': 'int', +'*max-chunk': 'int64', '*min-cluster-size': 'uint32' } } ## # @BackupCommon: -- 2.39.2
[PATCH 0/2] backup: allow specifying minimum cluster size
Based-on: https://lore.kernel.org/qemu-devel/20240228141501.455989-1-vsement...@yandex-team.ru/ Useful to make discard-source work in the context of backup fleecing when the fleecing image has a larger granularity than the backup target. Backup/block-copy will use at least this granularity for copy operations and in particular, discard requests to the backup source will too. If the granularity is too small, they will just be aligned down in cbw_co_pdiscard_snapshot() and thus effectively ignored. Fiona Ebner (2): copy-before-write: allow specifying minimum cluster size backup: add minimum cluster size to performance options block/backup.c | 2 +- block/block-copy.c | 17 + block/copy-before-write.c | 5 - block/copy-before-write.h | 1 + blockdev.c | 3 +++ include/block/block-copy.h | 1 + qapi/block-core.json | 17 ++--- 7 files changed, 37 insertions(+), 9 deletions(-) -- 2.39.2
[PATCH 1/2] copy-before-write: allow specifying minimum cluster size
Useful to make discard-source work in the context of backup fleecing when the fleecing image has a larger granularity than the backup target. Copy-before-write operations will use at least this granularity and in particular, discard requests to the source node will too. If the granularity is too small, they will just be aligned down in cbw_co_pdiscard_snapshot() and thus effectively ignored. The QAPI uses uint32 so the value will be non-negative, but still fit into a uint64_t. Suggested-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Fiona Ebner --- block/block-copy.c | 17 + block/copy-before-write.c | 3 ++- include/block/block-copy.h | 1 + qapi/block-core.json | 8 +++- 4 files changed, 23 insertions(+), 6 deletions(-) diff --git a/block/block-copy.c b/block/block-copy.c index 7e3b378528..adb1cbb440 100644 --- a/block/block-copy.c +++ b/block/block-copy.c @@ -310,6 +310,7 @@ void block_copy_set_copy_opts(BlockCopyState *s, bool use_copy_range, } static int64_t block_copy_calculate_cluster_size(BlockDriverState *target, + int64_t min_cluster_size, Error **errp) { int ret; @@ -335,7 +336,7 @@ static int64_t block_copy_calculate_cluster_size(BlockDriverState *target, "used. If the actual block size of the target exceeds " "this default, the backup may be unusable", BLOCK_COPY_CLUSTER_SIZE_DEFAULT); -return BLOCK_COPY_CLUSTER_SIZE_DEFAULT; +return MAX(min_cluster_size, BLOCK_COPY_CLUSTER_SIZE_DEFAULT); } else if (ret < 0 && !target_does_cow) { error_setg_errno(errp, -ret, "Couldn't determine the cluster size of the target image, " @@ -345,16 +346,18 @@ static int64_t block_copy_calculate_cluster_size(BlockDriverState *target, return ret; } else if (ret < 0 && target_does_cow) { /* Not fatal; just trudge on ahead. */ -return BLOCK_COPY_CLUSTER_SIZE_DEFAULT; +return MAX(min_cluster_size, BLOCK_COPY_CLUSTER_SIZE_DEFAULT); } -return MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, bdi.cluster_size); +return MAX(min_cluster_size, + MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, bdi.cluster_size)); } BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target, BlockDriverState *copy_bitmap_bs, const BdrvDirtyBitmap *bitmap, bool discard_source, + int64_t min_cluster_size, Error **errp) { ERRP_GUARD(); @@ -365,7 +368,13 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target, GLOBAL_STATE_CODE(); -cluster_size = block_copy_calculate_cluster_size(target->bs, errp); +if (min_cluster_size && !is_power_of_2(min_cluster_size)) { +error_setg(errp, "min-cluster-size needs to be a power of 2"); +return NULL; +} + +cluster_size = block_copy_calculate_cluster_size(target->bs, + min_cluster_size, errp); if (cluster_size < 0) { return NULL; } diff --git a/block/copy-before-write.c b/block/copy-before-write.c index dac57481c5..f9896c6c1e 100644 --- a/block/copy-before-write.c +++ b/block/copy-before-write.c @@ -476,7 +476,8 @@ static int cbw_open(BlockDriverState *bs, QDict *options, int flags, s->discard_source = flags & BDRV_O_CBW_DISCARD_SOURCE; s->bcs = block_copy_state_new(bs->file, s->target, bs, bitmap, - flags & BDRV_O_CBW_DISCARD_SOURCE, errp); + flags & BDRV_O_CBW_DISCARD_SOURCE, + opts->min_cluster_size, errp); if (!s->bcs) { error_prepend(errp, "Cannot create block-copy-state: "); return -EINVAL; diff --git a/include/block/block-copy.h b/include/block/block-copy.h index bdc703bacd..77857c6c68 100644 --- a/include/block/block-copy.h +++ b/include/block/block-copy.h @@ -28,6 +28,7 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target, BlockDriverState *copy_bitmap_bs, const BdrvDirtyBitmap *bitmap, bool discard_source, + int64_t min_cluster_size, Error **errp); /* Function should be called prior any actual copy request */ diff --git a/qapi/block-core.json b/qapi/block-core.json index 0a72c590a8..85c8f88f6e 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -4625,12 +4625,18 @@ # @on-cbw-err
Re: [PATCH v3 0/5] backup: discard-source parameter
Am 28.02.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy: > Hi all! The main patch is 04, please look at it for description and > diagram. > > v3: > 02: new patch > 04: take WRITE permission only when discard_source is required > > Vladimir Sementsov-Ogievskiy (5): > block/copy-before-write: fix permission > block/copy-before-write: support unligned snapshot-discard > block/copy-before-write: create block_copy bitmap in filter node > qapi: blockdev-backup: add discard-source parameter > iotests: add backup-discard-source > > block/backup.c| 5 +- > block/block-copy.c| 12 +- > block/copy-before-write.c | 39 - > block/copy-before-write.h | 1 + > block/replication.c | 4 +- > blockdev.c| 2 +- > include/block/block-common.h | 2 + > include/block/block-copy.h| 2 + > include/block/block_int-global-state.h| 2 +- > qapi/block-core.json | 4 + > tests/qemu-iotests/257.out| 112 ++--- > .../qemu-iotests/tests/backup-discard-source | 151 ++ > .../tests/backup-discard-source.out | 5 + > 13 files changed, 271 insertions(+), 70 deletions(-) > create mode 100755 tests/qemu-iotests/tests/backup-discard-source > create mode 100644 tests/qemu-iotests/tests/backup-discard-source.out > Tested-by: Fiona Ebner
Re: [PATCH v3 5/5] iotests: add backup-discard-source
Am 28.02.24 um 15:15 schrieb Vladimir Sementsov-Ogievskiy: > Add test for a new backup option: discard-source. > > Signed-off-by: Vladimir Sementsov-Ogievskiy > --- > .../qemu-iotests/tests/backup-discard-source | 151 ++ > .../tests/backup-discard-source.out | 5 + > 2 files changed, 156 insertions(+) > create mode 100755 tests/qemu-iotests/tests/backup-discard-source > create mode 100644 tests/qemu-iotests/tests/backup-discard-source.out > > diff --git a/tests/qemu-iotests/tests/backup-discard-source > b/tests/qemu-iotests/tests/backup-discard-source > new file mode 100755 > index 00..8a88b0f6c4 > --- /dev/null > +++ b/tests/qemu-iotests/tests/backup-discard-source > @@ -0,0 +1,151 @@ > +#!/usr/bin/env python3 > +# > +# Test removing persistent bitmap from backing > +# > +# Copyright (c) 2022 Virtuozzo International GmbH. > +# Title and copyright year are wrong. Apart from that: Reviewed-by: Fiona Ebner
Re: [PATCH v3 2/5] block/copy-before-write: support unligned snapshot-discard
Am 28.02.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy: > First thing that crashes on unligned access here is > bdrv_reset_dirty_bitmap(). Correct way is to align-down the > snapshot-discard request. > > Signed-off-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fiona Ebner
Re: [PATCH v3 4/5] qapi: blockdev-backup: add discard-source parameter
Am 28.02.24 um 15:15 schrieb Vladimir Sementsov-Ogievskiy: > Add a parameter that enables discard-after-copy. That is mostly useful > in "push backup with fleecing" scheme, when source is snapshot-access > format driver node, based on copy-before-write filter snapshot-access > API: > > [guest] [snapshot-access] ~~ blockdev-backup ~~> [backup target] >|| >| root | file >vv > [copy-before-write] >| | >| file| target >v v > [active disk] [temp.img] > > In this case discard-after-copy does two things: > > - discard data in temp.img to save disk space > - avoid further copy-before-write operation in discarded area > > Note that we have to declare WRITE permission on source in > copy-before-write filter, for discard to work. Still we can't take it > unconditionally, as it will break normal backup from RO source. So, we > have to add a parameter and pass it thorough bdrv_open flags. > > Signed-off-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fiona Ebner
Re: [PATCH v3 3/5] block/copy-before-write: create block_copy bitmap in filter node
Am 28.02.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy: > Currently block_copy creates copy_bitmap in source node. But that is in > bad relation with .independent_close=true of copy-before-write filter: > source node may be detached and removed before .bdrv_close() handler > called, which should call block_copy_state_free(), which in turn should > remove copy_bitmap. > > That's all not ideal: it would be better if internal bitmap of > block-copy object is not attached to any node. But that is not possible > now. > > The simplest solution is just create copy_bitmap in filter node, where > anyway two other bitmaps are created. > > Signed-off-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fiona Ebner
Re: [PATCH v2 00/10] mirror: allow switching from background to active mode
Am 07.03.24 um 20:42 schrieb Vladimir Sementsov-Ogievskiy: > On 04.03.24 14:09, Peter Krempa wrote: >> On Mon, Mar 04, 2024 at 11:48:54 +0100, Kevin Wolf wrote: >>> Am 28.02.2024 um 19:07 hat Vladimir Sementsov-Ogievskiy geschrieben: On 03.11.23 18:56, Markus Armbruster wrote: > Kevin Wolf writes: >> >> [...] >> > Is the job abstraction a failure? > > We have > > block-job- command since job- command since > - > block-job-set-speed 1.1 > block-job-cancel 1.1 job-cancel 3.0 > block-job-pause 1.3 job-pause 3.0 > block-job-resume 1.3 job-resume 3.0 > block-job-complete 1.3 job-complete 3.0 > block-job-dismiss 2.12 job-dismiss 3.0 > block-job-finalize 2.12 job-finalize 3.0 > block-job-change 8.2 > query-block-jobs 1.1 query-jobs >> >> [...] >> >>> I consider these strictly optional. We don't really have strong reasons >>> to deprecate these commands (they are just thin wrappers), and I think >>> libvirt still uses block-job-* in some places. >> >> Libvirt uses 'block-job-cancel' because it has different semantics from >> 'job-cancel' which libvirt documented as the behaviour of the API that >> uses it. (Semantics regarding the expectation of what is written to the >> destination node at the point when the job is cancelled). >> > > That's the following semantics: > > # Note that if you issue 'block-job-cancel' after 'drive-mirror' has > # indicated (via the event BLOCK_JOB_READY) that the source and > # destination are synchronized, then the event triggered by this > # command changes to BLOCK_JOB_COMPLETED, to indicate that the > # mirroring has ended and the destination now has a point-in-time copy > # tied to the time of the cancellation. > > Hmm. Looking at this, it looks for me, that should probably a > 'block-job-complete" command (as leading to BLOCK_JOB_COMPLETED). > > Actually, what is the difference between block-job-complete and > block-job-cancel(force=false) for mirror in ready state? > > I only see the following differencies: > > 1. block-job-complete documents that it completes the job > synchronously.. But looking at mirror code I see it just set > s->should_complete = true, which will be then handled asynchronously.. > So I doubt that documentation is correct. > > 2. block-job-complete will trigger final graph changes. block-job-cancel > will not. > > Is [2] really useful? Seems yes: in case of some failure before starting > migration target, we'd like to continue executing source. So, no reason > to break block-graph in source, better keep it unchanged. > FWIW, we also rely on these special semantics. We allow cloning the disk state of a running guest using drive-mirror (and before finishing, fsfreeze in the guest for consistency). We cannot use block-job-complete there, because we do not want to switch the source's drive. > But I think, such behavior better be setup by mirror-job start > parameter, rather then by special option for cancel (or even compelete) > command, useful only for mirror. > > So, what about the following substitution for block-job-cancel: > > block-job-cancel(force=true) --> use job-cancel > > block-job-cancel(force=false) for backup, stream, commit --> use > job-cancel > > block-job-cancel(force=false) for mirror in ready mode --> > > instead, use block-job-complete. If you don't need final graph > modification which mirror job normally does, use graph-change=false > parameter for blockdev-mirror command. > But yes, having a graph-change parameter would work for us too :) Best Regards, Fiona
[PATCH v2 4/4] blockdev: mirror: check for target's cluster size when using bitmap
When using mirror with a bitmap and the target is a diff image, i.e. one that should only contain the delta and was not synced to previously, a too large cluster size for the target can be problematic. In particular, when the mirror sends data to the target aligned to the jobs granularity, but not aligned to the larger target image's cluster size, the target's cluster would be allocated but only be filled partially. When rebasing such a diff image later, the corresponding cluster of the base image would get "masked" and the part of the cluster not in the diff image is not accessible anymore. Unfortunately, it is not always possible to check for the target image's cluster size, e.g. when it's NBD. Because the limitation is already documented in the QAPI description for the @bitmap parameter and it's only required for special diff image use-case, simply skip the check then. Signed-off-by: Fiona Ebner --- blockdev.c| 19 +++ tests/qemu-iotests/tests/bitmap-sync-mirror | 6 ++ .../qemu-iotests/tests/bitmap-sync-mirror.out | 7 +++ 3 files changed, 32 insertions(+) diff --git a/blockdev.c b/blockdev.c index c76eb97a4c..968d44cd3b 100644 --- a/blockdev.c +++ b/blockdev.c @@ -2847,6 +2847,9 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs, } if (bitmap_name) { +BlockDriverInfo bdi; +uint32_t bitmap_granularity; + if (sync != MIRROR_SYNC_MODE_FULL) { error_setg(errp, "Sync mode '%s' not supported with bitmap.", MirrorSyncMode_str(sync)); @@ -2863,6 +2866,22 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs, return; } +bitmap_granularity = bdrv_dirty_bitmap_granularity(bitmap); +/* + * Ignore if unable to get the info, e.g. when target is NBD. It's only + * relevant for syncing to a diff image and the documentation already + * states that the target's cluster size needs to small enough then. + */ +if (bdrv_get_info(target, &bdi) >= 0) { +if (bitmap_granularity < bdi.cluster_size || +bitmap_granularity % bdi.cluster_size != 0) { +error_setg(errp, "Bitmap granularity %u is not a multiple of " + "the target image's cluster size %u", + bitmap_granularity, bdi.cluster_size); +return; +} +} + if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_ALLOW_RO, errp)) { return; } diff --git a/tests/qemu-iotests/tests/bitmap-sync-mirror b/tests/qemu-iotests/tests/bitmap-sync-mirror index 898f1f4ba4..cbd5cc99cc 100755 --- a/tests/qemu-iotests/tests/bitmap-sync-mirror +++ b/tests/qemu-iotests/tests/bitmap-sync-mirror @@ -552,6 +552,12 @@ def test_mirror_api(): bitmap=bitmap) log('') +log("-- Test bitmap with too small granularity --\n".format(sync_mode)) +vm.qmp_log("block-dirty-bitmap-add", node=drive0.node, + name="bitmap-small", granularity=(GRANULARITY // 2)) +blockdev_mirror(drive0.vm, drive0.node, "mirror_target", "full", +job_id='api_job', bitmap="bitmap-small") +log('') def main(): for bsync_mode in ("never", "on-success", "always"): diff --git a/tests/qemu-iotests/tests/bitmap-sync-mirror.out b/tests/qemu-iotests/tests/bitmap-sync-mirror.out index c05b4788c6..d40ea7d689 100644 --- a/tests/qemu-iotests/tests/bitmap-sync-mirror.out +++ b/tests/qemu-iotests/tests/bitmap-sync-mirror.out @@ -2937,3 +2937,10 @@ qemu_img compare "TEST_DIR/PID-img" "TEST_DIR/PID-fmirror3" ==> Identical, OK! {"execute": "blockdev-mirror", "arguments": {"bitmap": "bitmap0", "device": "drive0", "filter-node-name": "mirror-top", "job-id": "api_job", "sync": "none", "target": "mirror_target"}} {"error": {"class": "GenericError", "desc": "Sync mode 'none' not supported with bitmap."}} +-- Test bitmap with too small granularity -- + +{"execute": "block-dirty-bitmap-add", "arguments": {"granularity": 32768, "name": "bitmap-small", "node": "drive0"}} +{"return": {}} +{"execute": "blockdev-mirror", "arguments": {"bitmap": "bitmap-small", "device": "drive0", "filter-node-name": "mirror-top", "job-id": "api_job", "sync": "full", "target": "mirror_target"}} +{"error": {"class": "GenericError", "desc": "Bitmap granularity 32768 is not a multiple of the target image's cluster size 65536"}} + -- 2.39.2
[PATCH v2 3/4] iotests: add test for bitmap mirror
From: Fabian Grünbichler heavily based on/practically forked off iotest 257 for bitmap backups, but: - no writes to filter node 'mirror-top' between completion and finalization, as those seem to deadlock? - extra set of reference/test mirrors to verify that writes in parallel with active mirror work Intentionally keeping copyright and ownership of original test case to honor provenance. The test was originally adapted by Fabian from 257, but has seen rather big changes, because the interface for mirror with bitmap was changed, i.e. no @bitmap-mode parameter anymore and bitmap is used as the working bitmap. Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.0 adapt to changes to mirror bitmap interface rename test from '384' to 'bitmap-sync-mirror'] Signed-off-by: Fiona Ebner --- tests/qemu-iotests/tests/bitmap-sync-mirror | 565 .../qemu-iotests/tests/bitmap-sync-mirror.out | 2939 + 2 files changed, 3504 insertions(+) create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out diff --git a/tests/qemu-iotests/tests/bitmap-sync-mirror b/tests/qemu-iotests/tests/bitmap-sync-mirror new file mode 100755 index 00..898f1f4ba4 --- /dev/null +++ b/tests/qemu-iotests/tests/bitmap-sync-mirror @@ -0,0 +1,565 @@ +#!/usr/bin/env python3 +# group: rw +# +# Test bitmap-sync mirrors (incremental, differential, and partials) +# +# Copyright (c) 2019 John Snow for Red Hat, Inc. +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# +# owner=js...@redhat.com + +import math +import os + +import iotests +from iotests import log, qemu_img + +SIZE = 64 * 1024 * 1024 +GRANULARITY = 64 * 1024 + + +class Pattern: +def __init__(self, byte, offset, size=GRANULARITY): +self.byte = byte +self.offset = offset +self.size = size + +def bits(self, granularity): +lower = self.offset // granularity +upper = (self.offset + self.size - 1) // granularity +return set(range(lower, upper + 1)) + + +class PatternGroup: +"""Grouping of Pattern objects. Initialize with an iterable of Patterns.""" +def __init__(self, patterns): +self.patterns = patterns + +def bits(self, granularity): +"""Calculate the unique bits dirtied by this pattern grouping""" +res = set() +for pattern in self.patterns: +res |= pattern.bits(granularity) +return res + + +GROUPS = [ +PatternGroup([ +# Batch 0: 4 clusters +Pattern('0x49', 0x000), +Pattern('0x6c', 0x010), # 1M +Pattern('0x6f', 0x200), # 32M +Pattern('0x76', 0x3ff)]), # 64M - 64K +PatternGroup([ +# Batch 1: 6 clusters (3 new) +Pattern('0x65', 0x000), # Full overwrite +Pattern('0x77', 0x00f8000), # Partial-left (1M-32K) +Pattern('0x72', 0x2008000), # Partial-right (32M+32K) +Pattern('0x69', 0x3fe)]), # Adjacent-left (64M - 128K) +PatternGroup([ +# Batch 2: 7 clusters (3 new) +Pattern('0x74', 0x001), # Adjacent-right +Pattern('0x69', 0x00e8000), # Partial-left (1M-96K) +Pattern('0x6e', 0x2018000), # Partial-right (32M+96K) +Pattern('0x67', 0x3fe, +2*GRANULARITY)]), # Overwrite [(64M-128K)-64M) +PatternGroup([ +# Batch 3: 8 clusters (5 new) +# Carefully chosen such that nothing re-dirties the one cluster +# that copies out successfully before failure in Group #1. +Pattern('0xaa', 0x001, +3*GRANULARITY), # Overwrite and 2x Adjacent-right +Pattern('0xbb', 0x00d8000), # Partial-left (1M-160K) +Pattern('0xcc', 0x2028000), # Partial-right (32M+160K) +Pattern('0xdd', 0x3fc)]), # New; leaving a gap to the right +] + + +class EmulatedBitmap: +def __init__(self, granularity=GRANULARITY): +self._bits = set() +self.granularity = granularity + +def dirty_bits(self, bits): +self._bits
[PATCH v2 1/4] qapi/block-core: avoid the re-use of MirrorSyncMode for backup
Backup supports all modes listed in MirrorSyncMode, while mirror does not. Introduce BackupSyncMode by copying the current MirrorSyncMode and drop the variants mirror does not support from MirrorSyncMode as well as the corresponding manual check in mirror_start(). Suggested-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Fiona Ebner --- I felt like keeping the "Since: X.Y" as before makes the most sense as to not lose history. Or is it necessary to change this for BackupSyncMode (and its members) since it got a new name? block/backup.c | 18 - block/mirror.c | 7 --- block/monitor/block-hmp-cmds.c | 2 +- block/replication.c| 2 +- blockdev.c | 26 - include/block/block_int-global-state.h | 2 +- qapi/block-core.json | 27 +- 7 files changed, 47 insertions(+), 37 deletions(-) diff --git a/block/backup.c b/block/backup.c index ec29d6b810..1cc4e055c6 100644 --- a/block/backup.c +++ b/block/backup.c @@ -37,7 +37,7 @@ typedef struct BackupBlockJob { BdrvDirtyBitmap *sync_bitmap; -MirrorSyncMode sync_mode; +BackupSyncMode sync_mode; BitmapSyncMode bitmap_mode; BlockdevOnError on_source_error; BlockdevOnError on_target_error; @@ -111,7 +111,7 @@ void backup_do_checkpoint(BlockJob *job, Error **errp) assert(block_job_driver(job) == &backup_job_driver); -if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) { +if (backup_job->sync_mode != BACKUP_SYNC_MODE_NONE) { error_setg(errp, "The backup job only supports block checkpoint in" " sync=none mode"); return; @@ -231,11 +231,11 @@ static void backup_init_bcs_bitmap(BackupBlockJob *job) uint64_t estimate; BdrvDirtyBitmap *bcs_bitmap = block_copy_dirty_bitmap(job->bcs); -if (job->sync_mode == MIRROR_SYNC_MODE_BITMAP) { +if (job->sync_mode == BACKUP_SYNC_MODE_BITMAP) { bdrv_clear_dirty_bitmap(bcs_bitmap, NULL); bdrv_dirty_bitmap_merge_internal(bcs_bitmap, job->sync_bitmap, NULL, true); -} else if (job->sync_mode == MIRROR_SYNC_MODE_TOP) { +} else if (job->sync_mode == BACKUP_SYNC_MODE_TOP) { /* * We can't hog the coroutine to initialize this thoroughly. * Set a flag and resume work when we are able to yield safely. @@ -254,7 +254,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp) backup_init_bcs_bitmap(s); -if (s->sync_mode == MIRROR_SYNC_MODE_TOP) { +if (s->sync_mode == BACKUP_SYNC_MODE_TOP) { int64_t offset = 0; int64_t count; @@ -282,7 +282,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp) block_copy_set_skip_unallocated(s->bcs, false); } -if (s->sync_mode == MIRROR_SYNC_MODE_NONE) { +if (s->sync_mode == BACKUP_SYNC_MODE_NONE) { /* * All bits are set in bcs bitmap to allow any cluster to be copied. * This does not actually require them to be copied. @@ -354,7 +354,7 @@ static const BlockJobDriver backup_job_driver = { BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs, BlockDriverState *target, int64_t speed, - MirrorSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap, + BackupSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap, BitmapSyncMode bitmap_mode, bool compress, const char *filter_node_name, @@ -376,8 +376,8 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs, GLOBAL_STATE_CODE(); /* QMP interface protects us from these cases */ -assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL); -assert(sync_bitmap || sync_mode != MIRROR_SYNC_MODE_BITMAP); +assert(sync_mode != BACKUP_SYNC_MODE_INCREMENTAL); +assert(sync_bitmap || sync_mode != BACKUP_SYNC_MODE_BITMAP); if (bs == target) { error_setg(errp, "Source and target cannot be the same"); diff --git a/block/mirror.c b/block/mirror.c index 5145eb53e1..1609354db3 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -2011,13 +2011,6 @@ void mirror_start(const char *job_id, BlockDriverState *bs, GLOBAL_STATE_CODE(); -if ((mode == MIRROR_SYNC_MODE_INCREMENTAL) || -(mode == MIRROR_SYNC_MODE_BITMAP)) { -error_setg(errp, "Sync mode '%s' not supported", - MirrorSyncMode_str(mode)); -return; -} - bdrv_graph_rdlock_main_loop(); is_none_mode = mode == MIRROR_SYNC_MODE_NONE; base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL; diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c index d954bec6f1..9633d000
[PATCH v2 2/4] mirror: allow specifying working bitmap
From: John Snow for the mirror job. The bitmap's granularity is used as the job's granularity. The new @bitmap parameter is marked unstable in the QAPI and can currently only be used for @sync=full mode. Clusters initially dirty in the bitmap as well as new writes are copied to the target. Using block-dirty-bitmap-clear and block-dirty-bitmap-merge API, callers can simulate the three kinds of @BitmapSyncMode (which is used by backup): 1. always: default, just pass bitmap as working bitmap. 2. never: copy bitmap and pass copy to the mirror job. 3. on-success: copy bitmap and pass copy to the mirror job and if successful, merge bitmap into original afterwards. When the target image is a fresh "diff image", i.e. one that was not used as the target of a previous mirror and the target image's cluster size is larger than the bitmap's granularity, or when @copy-mode=write-blocking is used, there is a pitfall, because the cluster in the target image will be allocated, but not contain all the data corresponding to the same region in the source image. An idea to avoid the limitation would be to mark clusters which are affected by unaligned writes and are not allocated in the target image dirty, so they would be copied fully later. However, for migration, the invariant that an actively synced mirror stays actively synced (unless an error happens) is useful, because without that invariant, migration might inactivate block devices when mirror still got work to do and run into an assertion failure [0]. Another approach would be to read the missing data from the source upon unaligned writes to be able to write the full target cluster instead. But certain targets like NBD do not allow querying the cluster size. To avoid limiting/breaking the use case of syncing to an existing target, which is arguably more common than the diff image use case, document the limiation in QAPI. This patch was originally based on one by Ma Haocong, but it has since been modified pretty heavily, first by John and then again by Fiona. [0]: https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/ Suggested-by: Ma Haocong Signed-off-by: Ma Haocong Signed-off-by: John Snow [FG: switch to bdrv_dirty_bitmap_merge_internal] Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.0 get rid of bitmap mode parameter use caller-provided bitmap as working bitmap turn bitmap parameter experimental] Signed-off-by: Fiona Ebner --- block/mirror.c | 95 -- blockdev.c | 39 +-- include/block/block_int-global-state.h | 5 +- qapi/block-core.json | 37 +- tests/unit/test-block-iothread.c | 2 +- 5 files changed, 146 insertions(+), 32 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index 1609354db3..5c9a00b574 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -51,7 +51,7 @@ typedef struct MirrorBlockJob { BlockDriverState *to_replace; /* Used to block operations on the drive-mirror-replace target */ Error *replace_blocker; -bool is_none_mode; +MirrorSyncMode sync_mode; BlockMirrorBackingMode backing_mode; /* Whether the target image requires explicit zero-initialization */ bool zero_target; @@ -73,6 +73,11 @@ typedef struct MirrorBlockJob { size_t buf_size; int64_t bdev_length; unsigned long *cow_bitmap; +/* + * Whether the bitmap is created locally or provided by the caller (for + * incremental sync). + */ +bool dirty_bitmap_is_local; BdrvDirtyBitmap *dirty_bitmap; BdrvDirtyBitmapIter *dbi; uint8_t *buf; @@ -687,7 +692,11 @@ static int mirror_exit_common(Job *job) bdrv_unfreeze_backing_chain(mirror_top_bs, target_bs); } -bdrv_release_dirty_bitmap(s->dirty_bitmap); +if (s->dirty_bitmap_is_local) { +bdrv_release_dirty_bitmap(s->dirty_bitmap); +} else { +bdrv_enable_dirty_bitmap(s->dirty_bitmap); +} /* Make sure that the source BDS doesn't go away during bdrv_replace_node, * before we can call bdrv_drained_end */ @@ -718,7 +727,8 @@ static int mirror_exit_common(Job *job) &error_abort); if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) { -BlockDriverState *backing = s->is_none_mode ? src : s->base; +BlockDriverState *backing; +backing = s->sync_mode == MIRROR_SYNC_MODE_NONE ? src : s->base; BlockDriverState *unfiltered_target = bdrv_skip_filters(target_bs); if (bdrv_cow_bs(unfiltered_target) != backing) { @@ -815,6 +825,16 @@ static void mirror_abort(Job *job) assert(ret == 0); } +/* Always called after commit/abort. */ +static void mirror_clean(Job *job) +{ +MirrorBlockJob *s = container_of(job, MirrorBloc
[PATCH v2 0/4] mirror: allow specifying working bitmap
Changes from RFC/v1 (discussion here [0]): * Add patch to split BackupSyncMode and MirrorSyncMode. * Drop bitmap-mode parameter and use passed-in bitmap as the working bitmap instead. Users can get the desired behaviors by using the block-dirty-bitmap-clear and block-dirty-bitmap-merge calls (see commit message in patch 2/4 for how exactly). * Add patch to check whether target image's cluster size is at most mirror job's granularity. Optional, it's an extra safety check that's useful when the target is a "diff" image that does not have previously synced data. Use cases: * Possibility to resume a failed mirror later. * Possibility to only mirror deltas to a previously mirrored volume. * Possibility to (efficiently) mirror an drive that was previously mirrored via some external mechanism (e.g. ZFS replication). We are using the last one in production without any issues since about 4 years now. In particular, like mentioned in [1]: > - create bitmap(s) > - (incrementally) replicate storage volume(s) out of band (using ZFS) > - incrementally drive mirror as part of a live migration of VM > - drop bitmap(s) Now, the IO test added in patch 4/4 actually contains yet another use case, namely doing incremental mirrors to stand-alone qcow2 "diff" images, that only contain the delta and can be rebased later. I had to adapt the IO test, because its output expected the mirror bitmap to still be dirty, but nowadays the mirror is apparently already done when the bitmaps are queried. So I thought, I'll just use 'write-blocking' mode to avoid any potential timing issues. But this exposed an issue with the diff image approach. If a write is not aligned to the granularity of the mirror target, then rebasing the diff image onto a backing image will not yield the desired result, because the full cluster is considered to be allocated and will "hide" some part of the base/backing image. The failure can be seen by either using 'write-blocking' mode in the IO test or setting the (bitmap) granularity to 32 KiB rather than the current 64 KiB. For the latter case, patch 4/4 adds a check. For the former, the limitation is documented (I'd expect this to be a niche use case in practice). [0]: https://lore.kernel.org/qemu-devel/b91dba34-7969-4d51-ba40-96a91038c...@yandex-team.ru/T/#m4ae27dc8ca1fb053e0a32cc4ffa2cfab6646805c [1]: https://lore.kernel.org/qemu-devel/1599127031.9uxdp5h9o2.astr...@nora.none/ Fabian Grünbichler (1): iotests: add test for bitmap mirror Fiona Ebner (2): qapi/block-core: avoid the re-use of MirrorSyncMode for backup blockdev: mirror: check for target's cluster size when using bitmap John Snow (1): mirror: allow specifying working bitmap block/backup.c| 18 +- block/mirror.c| 102 +- block/monitor/block-hmp-cmds.c|2 +- block/replication.c |2 +- blockdev.c| 84 +- include/block/block_int-global-state.h|7 +- qapi/block-core.json | 64 +- tests/qemu-iotests/tests/bitmap-sync-mirror | 571 .../qemu-iotests/tests/bitmap-sync-mirror.out | 2946 + tests/unit/test-block-iothread.c |2 +- 10 files changed, 3729 insertions(+), 69 deletions(-) create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out -- 2.39.2
Re: [RFC 0/4] mirror: implement incremental and bitmap modes
Am 29.02.24 um 13:47 schrieb Fiona Ebner: > Am 29.02.24 um 12:48 schrieb Vladimir Sementsov-Ogievskiy: >> On 29.02.24 13:11, Fiona Ebner wrote: >>> >>> The iotest creates a new target image for each incremental sync which >>> only records the diff relative to the previous mirror and those diff >>> images are later rebased onto each other to get the full picture. >>> >>> Thus, it can be that a previous mirror job (not just background process >>> or previous write) already copied a cluster, and in particular, copied >>> it to a different target! >> >> Aha understand. >> >> For simplicity, let's consider case, when source "cluster size" = "job >> cluster size" = "bitmap granularity" = "target cluster size". >> >> Which types of clusters we should consider, when we want to handle guest >> write? >> >> 1. Clusters, that should be copied by background process >> >> These are dirty clusters from user-given bitmap, or if we do a full-disk >> mirror, all clusters, not yet copied by background process. >> >> For such clusters we simply ignore the unaligned write. We can even >> ignore the aligned write too: less disturbing the guest by delays. >> > > Since do_sync_target_write() currently doesn't ignore aligned writes, I > wouldn't change it. Of course they can count towards the "done_bitmap" > you propose below. > >> 2. Clusters, already copied by background process during this mirror job >> and not dirtied by guest since this time. >> >> For such clusters we are safe to do unaligned write, as target cluster >> must be allocated. >> > > Right. > >> 3. Clusters, not marked initially by dirty bitmap. >> >> What to do with them? We can't do unaligned write. I see two variants: >> >> - do additional read from source, to fill the whole cluster, which seems >> a bit too heavy >> > > Yes, I'd rather only do that as a last resort. > >> - just mark the cluster as dirty for background job. So we behave like >> in "background" mode. But why not? The maximum count of such "hacks" is >> limited to number of "clear" clusters at start of mirror job, which >> means that we don't seriously affect the convergence. Mirror is >> guaranteed to converge anyway. And the whole sense of "write-blocking" >> mode is to have a guaranteed convergence. What do you think? >> > > It could lead to a lot of flips between job->actively_synced == true and > == false. AFAIU, currently, we only switch back from true to false when > an error happens. While I don't see a concrete issue with it, at least > it might be unexpected to users, so it better be documented. > > I'll try going with this approach, thanks! > These flips are actually a problem. When using live-migration with disk mirroring, it's good that an actively synced image stays actively synced. Otherwise, migration could finish at an inconvenient time and try to inactivate the block device while mirror still got something to do which would lead to an assertion failure [0]. The IO test added by this series is what uses the possibility to sync to "diff images" which contain only the delta. In production, we are only syncing to a previously mirrored target image. Non-aligned writes are not an issue later like with a diff image. (Even if the initial mirroring happened via ZFS replication outside of QEMU). So copy-mode=write-blocking would work fine for our use case, but if I go with the "mark clusters for unaligned writes dirty"-approach, it would not work fine anymore. Should I rather just document the limitation for the combination "target is a diff image" and copy-mode=write-blocking? I'd still add the check for the granularity and target cluster size. While also only needed for diff images, it would allow using background mode safely for those. Best Regards, Fiona [0]: https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/
Re: [RFC 0/4] mirror: implement incremental and bitmap modes
Am 01.03.24 um 16:46 schrieb Vladimir Sementsov-Ogievskiy: > On 01.03.24 18:14, Fiona Ebner wrote: >> Am 01.03.24 um 16:02 schrieb Vladimir Sementsov-Ogievskiy: >>>>> About documentation: actually, I never liked that we use for backup >>>>> job >>>>> "MirrorSyncMode". Now it looks more like "BackupSyncMode", having two >>>>> values supported only by backup. >>>>> >>>>> I'm also unsure how mode=full&bitmap=some_bitmap differs from >>>>> mode=bitmap&bitmap=some_bitmap.. >>>>> >>>> >>>> With the current patches, it was an error to specify @bitmap for other >>>> modes than 'incremental' and 'bitmap'. >>> >>> Current documentation says: >>> # @bitmap: The name of a dirty bitmap to use. Must be present if >>> sync >>> # is "bitmap" or "incremental". Can be present if sync is "full" >>> # or "top". Must not be present otherwise. >>> # (Since 2.4 (drive-backup), 3.1 (blockdev-backup)) >>> >>> >> >> This is for backup. The documentation (and behavior) for @bitmap added >> by these patches for mirror is different ;) > > I meant backup in "I'm also unsure", just as one more point not consider > backup-bitmap-API as a prototype for mirror-bitmap-API. > Oh, I see. Sorry for the confusion! Best Regards, Fiona
Re: [RFC 0/4] mirror: implement incremental and bitmap modes
Am 01.03.24 um 16:02 schrieb Vladimir Sementsov-Ogievskiy: > On 01.03.24 17:52, Fiona Ebner wrote: >> Am 01.03.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy: >>> >>> As we already understood, (block-)job-api needs some spring-cleaning. >>> Unfortunately I don't have much time on it, but still I decided to start >>> from finally depreacting block-job-* API and moving to job-*.. Probably >>> bitmap/bitmap-mode/sync APIs also need some optimization, keeping in >>> mind new block-dirty-bitmap-merge api. >>> >>> So, what I could advice in this situation for newc interfaces: >>> >>> 1. be minimalistic >>> 2. add `x-` prefix when unsure >>> >>> So, following these two rules, what about x-bitmap field, which may be >>> combined only with 'full' mode, and do what you need? >>> >> >> AFAIU, it should rather be marked as @unstable in QAPI [0]? Then it >> doesn't need to be renamed if it becomes stable later. > > Right, unstable feature is needed, using "x-" is optional. > > Recent discussion about it was in my "vhost-user-blk: live resize > additional APIs" series: > > https://patchew.org/QEMU/20231006202045.1161543-1-vsement...@yandex-team.ru/20231006202045.1161543-5-vsement...@yandex-team.ru/ > > Following it, I think it's OK to not care anymore with "x-" prefixes, > and rely on unstable feature. > Thanks for the confirmation! I'll go without the prefix in the name then. >> >>> About documentation: actually, I never liked that we use for backup job >>> "MirrorSyncMode". Now it looks more like "BackupSyncMode", having two >>> values supported only by backup. >>> >>> I'm also unsure how mode=full&bitmap=some_bitmap differs from >>> mode=bitmap&bitmap=some_bitmap.. >>> >> >> With the current patches, it was an error to specify @bitmap for other >> modes than 'incremental' and 'bitmap'. > > Current documentation says: > # @bitmap: The name of a dirty bitmap to use. Must be present if sync > # is "bitmap" or "incremental". Can be present if sync is "full" > # or "top". Must not be present otherwise. > # (Since 2.4 (drive-backup), 3.1 (blockdev-backup)) > > This is for backup. The documentation (and behavior) for @bitmap added by these patches for mirror is different ;) Best Regards, Fiona
Re: [RFC 0/4] mirror: implement incremental and bitmap modes
Am 01.03.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy: > > As we already understood, (block-)job-api needs some spring-cleaning. > Unfortunately I don't have much time on it, but still I decided to start > from finally depreacting block-job-* API and moving to job-*.. Probably > bitmap/bitmap-mode/sync APIs also need some optimization, keeping in > mind new block-dirty-bitmap-merge api. > > So, what I could advice in this situation for newc interfaces: > > 1. be minimalistic > 2. add `x-` prefix when unsure > > So, following these two rules, what about x-bitmap field, which may be > combined only with 'full' mode, and do what you need? > AFAIU, it should rather be marked as @unstable in QAPI [0]? Then it doesn't need to be renamed if it becomes stable later. > About documentation: actually, I never liked that we use for backup job > "MirrorSyncMode". Now it looks more like "BackupSyncMode", having two > values supported only by backup. > > I'm also unsure how mode=full&bitmap=some_bitmap differs from > mode=bitmap&bitmap=some_bitmap.. > With the current patches, it was an error to specify @bitmap for other modes than 'incremental' and 'bitmap'. > So, I'd suggest simply rename MirrorSyncMode to BackupSyncMode, and add > separate MirrorSyncMode with only "full", "top" and "none" values. > Sounds good to me! [0]: https://gitlab.com/qemu-project/qemu/-/commit/a3c45b3e62962f99338716b1347cfb0d427cea44 Best Regards, Fiona
Re: [PATCH 00/19] Workaround Windows failing to find 64bit SMBIOS entry point with SeaBIOS
Am 29.02.24 um 14:18 schrieb Fiona Ebner: > Am 27.02.24 um 16:47 schrieb Igor Mammedov: >> Windows (10) bootloader when running on top of SeaBIOS, fails to find >> >> SMBIOSv3 entry point. Tracing it shows that it looks for v2 anchor markers >> >> only and not v3. Tricking it into believing that entry point is found >> >> lets Windows successfully locate and parse SMBIOSv3 tables. Whether it >> >> will be fixed on Windows side is not clear so here goes a workaround. >> >> >> >> Idea is to try build v2 tables if QEMU configuration permits, >> >> and fallback to v3 tables otherwise. That will mask Windows issue >> >> form majority of users. >> >> However if VM configuration can't be described (typically large VMs) >> >> by v2 tables, QEMU will use SMBIOSv3 and Windows will hit the issue >> >> again. In this case complain to Microsoft and/or use UEFI instead of >> >> SeaBIOS (requires reinstall). >> >> >> >> Default compat setting of smbios-entry-point-type after series >> >> for pc/q35 machines: >> >> * 9.0-newer: 'auto' >> >> * 8.1-8.2: '64' >> >> * 8.0-older: '32' >> >> >> >> Fixes: https://gitlab.com/qemu-project/qemu/-/issues/2008 >> > > Thank you! I'm happy to confirm that this series works around the issue :) > While I still didn't do any in-depth testing (don't have enough knowledge for that anyways), I played around a bit more now, check that nothing obvious breaks also with a Linux VM and also ran a successful 'make check'. If that is enough, feel free to add: Tested-by: Fiona Ebner Best Regards, Fiona
Re: [RFC 0/4] mirror: implement incremental and bitmap modes
Am 29.02.24 um 13:00 schrieb Vladimir Sementsov-Ogievskiy: > > But anyway, this all could be simply achieved with > bitmap-copying/merging API, if we allow to pass user-given bitmap to the > mirror as working bitmap. > >> >> I see, I'll drop the 'bitmap-mode' in the next version if nobody >> complains :) >> > > Good. It's a golden rule: never make public interfaces which you don't > actually need for production. I myself sometimes violate it and spend > extra time on developing features, which we later have to just drop as > "not needed downstream, no sense in upstreaming". > Just wondering which new mode I should allow for the @MirrorSyncMode then? The documentation states: > # @incremental: only copy data described by the dirty bitmap. > # (since: 2.4) > # > # @bitmap: only copy data described by the dirty bitmap. (since: 4.2) > # Behavior on completion is determined by the BitmapSyncMode. For backup, do_backup_common() just maps @incremental to @bitmap + @bitmap-mode == @on-success. Using @bitmap for mirror would lead to being at odds with the documentation, because it mentions the BitmapSyncMode, which mirror won't have. Using @incremental for mirror would be consistent with the documentation, but behave a bit differently from backup. Opinions? Best Regards, Fiona
Re: [PATCH 00/19] Workaround Windows failing to find 64bit SMBIOS entry point with SeaBIOS
Am 27.02.24 um 16:47 schrieb Igor Mammedov: > Windows (10) bootloader when running on top of SeaBIOS, fails to find > > SMBIOSv3 entry point. Tracing it shows that it looks for v2 anchor markers > > only and not v3. Tricking it into believing that entry point is found > > lets Windows successfully locate and parse SMBIOSv3 tables. Whether it > > will be fixed on Windows side is not clear so here goes a workaround. > > > > Idea is to try build v2 tables if QEMU configuration permits, > > and fallback to v3 tables otherwise. That will mask Windows issue > > form majority of users. > > However if VM configuration can't be described (typically large VMs) > > by v2 tables, QEMU will use SMBIOSv3 and Windows will hit the issue > > again. In this case complain to Microsoft and/or use UEFI instead of > > SeaBIOS (requires reinstall). > > > > Default compat setting of smbios-entry-point-type after series > > for pc/q35 machines: > > * 9.0-newer: 'auto' > > * 8.1-8.2: '64' > > * 8.0-older: '32' > > > > Fixes: https://gitlab.com/qemu-project/qemu/-/issues/2008 > Thank you! I'm happy to confirm that this series works around the issue :) Best Regards, Fiona
Re: [RFC 0/4] mirror: implement incremental and bitmap modes
Am 29.02.24 um 12:48 schrieb Vladimir Sementsov-Ogievskiy: > On 29.02.24 13:11, Fiona Ebner wrote: >> >> The iotest creates a new target image for each incremental sync which >> only records the diff relative to the previous mirror and those diff >> images are later rebased onto each other to get the full picture. >> >> Thus, it can be that a previous mirror job (not just background process >> or previous write) already copied a cluster, and in particular, copied >> it to a different target! > > Aha understand. > > For simplicity, let's consider case, when source "cluster size" = "job > cluster size" = "bitmap granularity" = "target cluster size". > > Which types of clusters we should consider, when we want to handle guest > write? > > 1. Clusters, that should be copied by background process > > These are dirty clusters from user-given bitmap, or if we do a full-disk > mirror, all clusters, not yet copied by background process. > > For such clusters we simply ignore the unaligned write. We can even > ignore the aligned write too: less disturbing the guest by delays. > Since do_sync_target_write() currently doesn't ignore aligned writes, I wouldn't change it. Of course they can count towards the "done_bitmap" you propose below. > 2. Clusters, already copied by background process during this mirror job > and not dirtied by guest since this time. > > For such clusters we are safe to do unaligned write, as target cluster > must be allocated. > Right. > 3. Clusters, not marked initially by dirty bitmap. > > What to do with them? We can't do unaligned write. I see two variants: > > - do additional read from source, to fill the whole cluster, which seems > a bit too heavy > Yes, I'd rather only do that as a last resort. > - just mark the cluster as dirty for background job. So we behave like > in "background" mode. But why not? The maximum count of such "hacks" is > limited to number of "clear" clusters at start of mirror job, which > means that we don't seriously affect the convergence. Mirror is > guaranteed to converge anyway. And the whole sense of "write-blocking" > mode is to have a guaranteed convergence. What do you think? > It could lead to a lot of flips between job->actively_synced == true and == false. AFAIU, currently, we only switch back from true to false when an error happens. While I don't see a concrete issue with it, at least it might be unexpected to users, so it better be documented. I'll try going with this approach, thanks! > > > > Of course, we can't distinguish 3 types by on dirty bitmap, so we need > the second one. For example "done_bitmap", where we can mark clusters > that were successfully copied. That would be a kind of block-status of > target image. But using bitmap is a lot better than querying > block-status from target. Best Regards, Fiona
Re: [RFC 0/4] mirror: implement incremental and bitmap modes
Am 28.02.24 um 17:24 schrieb Vladimir Sementsov-Ogievskiy: > On 16.02.24 13:55, Fiona Ebner wrote: >> Previous discussion from when this was sent upstream [0] (it's been a >> while). I rebased the patches and re-ordered and squashed like >> suggested back then [1]. >> >> This implements two new mirror modes: >> >> - bitmap mirror mode with always/on-success/never bitmap sync mode >> - incremental mirror mode as sugar for bitmap + on-success >> >> Use cases: >> * Possibility to resume a failed mirror later. >> * Possibility to only mirror deltas to a previously mirrored volume. >> * Possibility to (efficiently) mirror an drive that was previously >> mirrored via some external mechanism (e.g. ZFS replication). >> >> We are using the last one in production without any issues since about >> 4 years now. In particular, like mentioned in [2]: >> >>> - create bitmap(s) >>> - (incrementally) replicate storage volume(s) out of band (using ZFS) >>> - incrementally drive mirror as part of a live migration of VM >>> - drop bitmap(s) > > Actually which mode you use, "never", "always" or "conditional"? Or in > downstream you have different approach? > We are using "conditional", but I think we don't really require any specific mode, because we drop the bitmaps after mirroring (even in failure case). Fabian, please correct me if I'm wrong. > Why am I asking: > > These modes (for backup) were developed prior to > block-dirty-bitmap-merge command, which allowed to copy bitmaps as you > want. With that API, we actually don't need all these modes, instead > it's enough to pass a bitmap, which would be _actually_ used by mirror. > > So, if you need "never" mode, you just copy your bitmap by > block-dirty-bitmap-add + block-dirty-bitmap-merge, and pass a copy to > mirror job. > > Or, you pass your bitmap to mirror-job, and have a "always" mode. > > And I don't see, why we need a "conditional" mode, which actually just > drops away the progress we actually made. (OK, we failed, but why to > drop the progress of successfully copied clusters?) > I'm not sure actually. Maybe John remembers? I see, I'll drop the 'bitmap-mode' in the next version if nobody complains :) > > Using user-given bitmap in the mirror job has also an additional > advantage of live progress: up to visualization of disk copying by > visualization of the dirty bitmap contents. > Best Regards, Fiona
Re: [RFC 0/4] mirror: implement incremental and bitmap modes
Am 28.02.24 um 17:06 schrieb Vladimir Sementsov-Ogievskiy: > On 28.02.24 19:00, Vladimir Sementsov-Ogievskiy wrote: >> On 16.02.24 13:55, Fiona Ebner wrote: >>> Now, the IO test added in patch 4/4 actually contains yet another use >>> case, namely doing incremental mirrors to stand-alone qcow2 "diff" >>> images, that only contain the delta and can be rebased later. I had to >>> adapt the IO test, because its output expected the mirror bitmap to >>> still be dirty, but nowadays the mirror is apparently already done >>> when the bitmaps are queried. So I thought, I'll just use >>> 'write-blocking' mode to avoid any potential timing issues. >>> >>> But this exposed an issue with the diff image approach. If a write is >>> not aligned to the granularity of the mirror target, then rebasing the >>> diff image onto a backing image will not yield the desired result, >>> because the full cluster is considered to be allocated and will "hide" >>> some part of the base/backing image. The failure can be seen by either >>> using 'write-blocking' mode in the IO test or setting the (bitmap) >>> granularity to 32 KiB rather than the current 64 KiB. >>> >>> The question is how to deal with these edge cases? Some possibilities >>> that would make sense to me: >>> >>> For 'background' mode: >>> * prohibit if target's cluster size is larger than the bitmap >>> granularity >>> * document the limitation >>> >>> For 'write-blocking' mode: >>> * disallow in combination with bitmap mode (would not be happy about >>> it, because I'd like to use this without diff images) >> >> why not just require the same: bitmap granularity must be >= target >> granularity >> For the iotest's use-case, that only works for background mode. I'll explain below. >>> * for writes that are not aligned to the target's cluster size, read >>> the relevant/missing parts from the source image to be able to write >>> whole target clusters (seems rather complex) >> >> There is another approach: consider and unaligned part of the request, >> fit in one cluster (we can always split any request to "aligned" >> middle part, and at most two small "unligned" parts, each fit into one >> cluster). >> >> We have two possibilities: >> >> 1. the cluster is dirty (marked dirty in the bitmap used by background >> process) >> >> We can simply ignore this part and rely on background process. This >> will not affect the convergence of the mirror job. >> Agreed. >> 2. the cluster is clear (i.e. background process, or some previous >> write already copied it) >> The iotest creates a new target image for each incremental sync which only records the diff relative to the previous mirror and those diff images are later rebased onto each other to get the full picture. Thus, it can be that a previous mirror job (not just background process or previous write) already copied a cluster, and in particular, copied it to a different target! >> In this case, we are safe to do unaligned write, as target cluster >> must be allocated. Because the diff image is new, the target's cluster is not necessarily allocated. When using write-blocking and a write of, e.g., 9 bytes to a clear source cluster comes in, only those 9 bytes are written to the target. Now the target's cluster is allocated but with only those 9 bytes of data. When rebasing, the previously copied cluster is "masked" and when reading the rebased image, we only see the cluster with those 9 bytes (and IIRC, zeroes for the rest of the cluster rather than the previously copied data). >> >> (for bitmap-mode, I don't consider here clusters that are clear from >> the start, which we shouldn't copy in any case) >> We do need to copy new writes to any cluster, and with a clear cluster and write-blocking, the issue can manifest. > > Hmm, right, and that's exactly the logic we already have in > do_sync_target_write(). So that's enough just to require that > bitmap_granularity >= target_granularity > Best Regards, Fiona
Re: [RFC 1/4] drive-mirror: add support for sync=bitmap mode=never
Am 21.02.24 um 07:55 schrieb Markus Armbruster: >> diff --git a/qapi/block-core.json b/qapi/block-core.json >> index ab5a93a966..ac05483958 100644 >> --- a/qapi/block-core.json >> +++ b/qapi/block-core.json >> @@ -2181,6 +2181,15 @@ >> # destination (all the disk, only the sectors allocated in the >> # topmost image, or only new I/O). >> # >> +# @bitmap: The name of a bitmap to use for sync=bitmap mode. This >> +# argument must be present for bitmap mode and absent otherwise. >> +# The bitmap's granularity is used instead of @granularity. >> +# (Since 9.0). > > What happens when the user specifies @granularity anyway? Error or > silently ignored? > It's an error: >> +if (bitmap) { >> +if (granularity) { >> +error_setg(errp, "granularity (%d)" >> + "cannot be specified when a bitmap is provided", >> + granularity); >> +return NULL; >> +} >> +# >> +# @bitmap-mode: Specifies the type of data the bitmap should contain >> +# after the operation concludes. Must be present if sync is >> +# "bitmap". Must NOT be present otherwise. (Since 9.0) > > Members that must be present when and only when some enum member has a > certain value should perhaps be in a union branch. Perhaps the block > maintainers have an opinion here. > Sounds sensible to me. Considering also the next patches, in the end it could be a union discriminated by the @sync which contains @bitmap and @bitmap-mode when it's the 'bitmap' sync mode, @bitmap when it's the 'incremental' sync mode (@bitmap-sync mode needs to be 'on-success' then, so there is no choice for the user) and which contains @granularity for the other sync modes. Best Regards, Fiona
[RFC 1/4] drive-mirror: add support for sync=bitmap mode=never
From: John Snow This patch adds support for the "BITMAP" sync mode to drive-mirror and blockdev-mirror. It adds support only for the BitmapSyncMode "never," because it's the simplest mode. This mode simply uses a user-provided bitmap as an initial copy manifest, and then does not clear any bits in the bitmap at the conclusion of the operation. Any new writes dirtied during the operation are copied out, in contrast to backup. Note that whether these writes are reflected in the bitmap at the conclusion of the operation depends on whether that bitmap is actually recording! This patch was originally based on one by Ma Haocong, but it has since been modified pretty heavily. Suggested-by: Ma Haocong Signed-off-by: Ma Haocong Signed-off-by: John Snow [FG: switch to bdrv_dirty_bitmap_merge_internal] Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.0 update version and formatting in QAPI] Signed-off-by: Fiona Ebner --- block/mirror.c | 96 -- blockdev.c | 38 +- include/block/block_int-global-state.h | 4 +- qapi/block-core.json | 25 ++- tests/unit/test-block-iothread.c | 4 +- 5 files changed, 139 insertions(+), 28 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index 5145eb53e1..315dff11e2 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -51,7 +51,7 @@ typedef struct MirrorBlockJob { BlockDriverState *to_replace; /* Used to block operations on the drive-mirror-replace target */ Error *replace_blocker; -bool is_none_mode; +MirrorSyncMode sync_mode; BlockMirrorBackingMode backing_mode; /* Whether the target image requires explicit zero-initialization */ bool zero_target; @@ -73,6 +73,8 @@ typedef struct MirrorBlockJob { size_t buf_size; int64_t bdev_length; unsigned long *cow_bitmap; +BdrvDirtyBitmap *sync_bitmap; +BitmapSyncMode bitmap_mode; BdrvDirtyBitmap *dirty_bitmap; BdrvDirtyBitmapIter *dbi; uint8_t *buf; @@ -718,7 +720,8 @@ static int mirror_exit_common(Job *job) &error_abort); if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) { -BlockDriverState *backing = s->is_none_mode ? src : s->base; +BlockDriverState *backing; +backing = s->sync_mode == MIRROR_SYNC_MODE_NONE ? src : s->base; BlockDriverState *unfiltered_target = bdrv_skip_filters(target_bs); if (bdrv_cow_bs(unfiltered_target) != backing) { @@ -815,6 +818,16 @@ static void mirror_abort(Job *job) assert(ret == 0); } +/* Always called after commit/abort. */ +static void mirror_clean(Job *job) +{ +MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job); + +if (s->sync_bitmap) { +bdrv_dirty_bitmap_set_busy(s->sync_bitmap, false); +} +} + static void coroutine_fn mirror_throttle(MirrorBlockJob *s) { int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); @@ -1011,7 +1024,8 @@ static int coroutine_fn mirror_run(Job *job, Error **errp) mirror_free_init(s); s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); -if (!s->is_none_mode) { +if ((s->sync_mode == MIRROR_SYNC_MODE_TOP) || +(s->sync_mode == MIRROR_SYNC_MODE_FULL)) { ret = mirror_dirty_init(s); if (ret < 0 || job_is_cancelled(&s->common.job)) { goto immediate_exit; @@ -1302,6 +1316,7 @@ static const BlockJobDriver mirror_job_driver = { .run= mirror_run, .prepare= mirror_prepare, .abort = mirror_abort, +.clean = mirror_clean, .pause = mirror_pause, .complete = mirror_complete, .cancel = mirror_cancel, @@ -1320,6 +1335,7 @@ static const BlockJobDriver commit_active_job_driver = { .run= mirror_run, .prepare= mirror_prepare, .abort = mirror_abort, +.clean = mirror_clean, .pause = mirror_pause, .complete = mirror_complete, .cancel = commit_active_cancel, @@ -1712,7 +1728,10 @@ static BlockJob *mirror_start_job( BlockCompletionFunc *cb, void *opaque, const BlockJobDriver *driver, - bool is_none_mode, BlockDriverState *base, + MirrorSyncMode sync_mode, + BdrvDirtyBitmap *bitmap, + BitmapSyncMode bitmap_mode, + BlockDriverState *base, bool auto_complete, const char *filter_n
[RFC 0/4] mirror: implement incremental and bitmap modes
Previous discussion from when this was sent upstream [0] (it's been a while). I rebased the patches and re-ordered and squashed like suggested back then [1]. This implements two new mirror modes: - bitmap mirror mode with always/on-success/never bitmap sync mode - incremental mirror mode as sugar for bitmap + on-success Use cases: * Possibility to resume a failed mirror later. * Possibility to only mirror deltas to a previously mirrored volume. * Possibility to (efficiently) mirror an drive that was previously mirrored via some external mechanism (e.g. ZFS replication). We are using the last one in production without any issues since about 4 years now. In particular, like mentioned in [2]: > - create bitmap(s) > - (incrementally) replicate storage volume(s) out of band (using ZFS) > - incrementally drive mirror as part of a live migration of VM > - drop bitmap(s) Now, the IO test added in patch 4/4 actually contains yet another use case, namely doing incremental mirrors to stand-alone qcow2 "diff" images, that only contain the delta and can be rebased later. I had to adapt the IO test, because its output expected the mirror bitmap to still be dirty, but nowadays the mirror is apparently already done when the bitmaps are queried. So I thought, I'll just use 'write-blocking' mode to avoid any potential timing issues. But this exposed an issue with the diff image approach. If a write is not aligned to the granularity of the mirror target, then rebasing the diff image onto a backing image will not yield the desired result, because the full cluster is considered to be allocated and will "hide" some part of the base/backing image. The failure can be seen by either using 'write-blocking' mode in the IO test or setting the (bitmap) granularity to 32 KiB rather than the current 64 KiB. The question is how to deal with these edge cases? Some possibilities that would make sense to me: For 'background' mode: * prohibit if target's cluster size is larger than the bitmap granularity * document the limitation For 'write-blocking' mode: * disallow in combination with bitmap mode (would not be happy about it, because I'd like to use this without diff images) * for writes that are not aligned to the target's cluster size, read the relevant/missing parts from the source image to be able to write whole target clusters (seems rather complex) * document the limitation [0]: https://lore.kernel.org/qemu-devel/20200218100740.2228521-1-f.gruenbich...@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/d35a76de-78d5-af56-0b34-f7bd2bbd3...@redhat.com/ [2]: https://lore.kernel.org/qemu-devel/1599127031.9uxdp5h9o2.astr...@nora.none/ Fabian Grünbichler (2): mirror: move some checks to qmp iotests: add test for bitmap mirror John Snow (2): drive-mirror: add support for sync=bitmap mode=never drive-mirror: add support for conditional and always bitmap sync modes block/mirror.c| 94 +- blockdev.c| 70 +- include/block/block_int-global-state.h|4 +- qapi/block-core.json | 25 +- tests/qemu-iotests/tests/bitmap-sync-mirror | 550 .../qemu-iotests/tests/bitmap-sync-mirror.out | 2810 + tests/unit/test-block-iothread.c |4 +- 7 files changed, 3527 insertions(+), 30 deletions(-) create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out -- 2.39.2
[RFC 2/4] drive-mirror: add support for conditional and always bitmap sync modes
From: John Snow Teach mirror two new tricks for using bitmaps: Always: no matter what, we synchronize the copy_bitmap back to the sync_bitmap. In effect, this allows us resume a failed mirror at a later date. Conditional: On success only, we sync the bitmap. This is akin to incremental backup modes; we can use this bitmap to later refresh a successfully created mirror. Originally-by: John Snow [FG: add check for bitmap-mode without bitmap switch to bdrv_dirty_bitmap_merge_internal] Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.0] Signed-off-by: Fiona Ebner --- The original patch this was based on came from a WIP git branch and thus has no Signed-off-by trailer from John, see [0]. I added an Originally-by trailer for now. Let me know if I should drop that and wait for John's Signed-off-by instead. [0] https://lore.kernel.org/qemu-devel/1599140071.n44h532eeu.astr...@nora.none/ block/mirror.c | 24 ++-- blockdev.c | 3 +++ 2 files changed, 21 insertions(+), 6 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index 315dff11e2..84155b1f78 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -689,8 +689,6 @@ static int mirror_exit_common(Job *job) bdrv_unfreeze_backing_chain(mirror_top_bs, target_bs); } -bdrv_release_dirty_bitmap(s->dirty_bitmap); - /* Make sure that the source BDS doesn't go away during bdrv_replace_node, * before we can call bdrv_drained_end */ bdrv_ref(src); @@ -796,6 +794,18 @@ static int mirror_exit_common(Job *job) bdrv_drained_end(target_bs); bdrv_unref(target_bs); +if (s->sync_bitmap) { +if (s->bitmap_mode == BITMAP_SYNC_MODE_ALWAYS || +(s->bitmap_mode == BITMAP_SYNC_MODE_ON_SUCCESS && + job->ret == 0 && ret == 0)) { +/* Success; synchronize copy back to sync. */ +bdrv_clear_dirty_bitmap(s->sync_bitmap, NULL); +bdrv_dirty_bitmap_merge_internal(s->sync_bitmap, s->dirty_bitmap, + NULL, true); +} +} +bdrv_release_dirty_bitmap(s->dirty_bitmap); + bs_opaque->job = NULL; bdrv_drained_end(src); @@ -1755,10 +1765,6 @@ static BlockJob *mirror_start_job( " sync mode", MirrorSyncMode_str(sync_mode)); return NULL; -} else if (bitmap_mode != BITMAP_SYNC_MODE_NEVER) { -error_setg(errp, - "Bitmap Sync Mode '%s' is not supported by Mirror", - BitmapSyncMode_str(bitmap_mode)); } } else if (bitmap) { error_setg(errp, @@ -1775,6 +1781,12 @@ static BlockJob *mirror_start_job( return NULL; } granularity = bdrv_dirty_bitmap_granularity(bitmap); + +if (bitmap_mode != BITMAP_SYNC_MODE_NEVER) { +if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_DEFAULT, errp)) { +return NULL; +} +} } else if (granularity == 0) { granularity = bdrv_get_default_bitmap_granularity(target); } diff --git a/blockdev.c b/blockdev.c index c65d9ded70..aeb9fde9f3 100644 --- a/blockdev.c +++ b/blockdev.c @@ -2873,6 +2873,9 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs, if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_ALLOW_RO, errp)) { return; } +} else if (has_bitmap_mode) { +error_setg(errp, "Cannot specify bitmap sync mode without a bitmap"); +return; } if (!replaces) { -- 2.39.2
[RFC 3/4] mirror: move some checks to qmp
From: Fabian Grünbichler and assert the passing conditions in block/mirror.c. while incremental mode was never available for drive-mirror, it makes the interface more uniform w.r.t. backup block jobs. Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.0] Signed-off-by: Fiona Ebner --- block/mirror.c | 28 +--- blockdev.c | 29 + 2 files changed, 34 insertions(+), 23 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index 84155b1f78..15d1c060eb 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -1755,31 +1755,13 @@ static BlockJob *mirror_start_job( GLOBAL_STATE_CODE(); -if (sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) { -error_setg(errp, "Sync mode '%s' not supported", - MirrorSyncMode_str(sync_mode)); -return NULL; -} else if (sync_mode == MIRROR_SYNC_MODE_BITMAP) { -if (!bitmap) { -error_setg(errp, "Must provide a valid bitmap name for '%s'" - " sync mode", - MirrorSyncMode_str(sync_mode)); -return NULL; -} -} else if (bitmap) { -error_setg(errp, - "sync mode '%s' is not compatible with bitmaps", - MirrorSyncMode_str(sync_mode)); -return NULL; -} +/* QMP interface protects us from these cases */ +assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL); +assert((bitmap && sync_mode == MIRROR_SYNC_MODE_BITMAP) || + (!bitmap && sync_mode != MIRROR_SYNC_MODE_BITMAP)); +assert(!(bitmap && granularity)); if (bitmap) { -if (granularity) { -error_setg(errp, "granularity (%d)" - "cannot be specified when a bitmap is provided", - granularity); -return NULL; -} granularity = bdrv_dirty_bitmap_granularity(bitmap); if (bitmap_mode != BITMAP_SYNC_MODE_NEVER) { diff --git a/blockdev.c b/blockdev.c index aeb9fde9f3..519f408359 100644 --- a/blockdev.c +++ b/blockdev.c @@ -2852,7 +2852,36 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs, sync = MIRROR_SYNC_MODE_FULL; } +if ((sync == MIRROR_SYNC_MODE_BITMAP) || +(sync == MIRROR_SYNC_MODE_INCREMENTAL)) { +/* done before desugaring 'incremental' to print the right message */ +if (!bitmap_name) { +error_setg(errp, "Must provide a valid bitmap name for " + "'%s' sync mode", MirrorSyncMode_str(sync)); +return; +} +} + +if (sync == MIRROR_SYNC_MODE_INCREMENTAL) { +if (has_bitmap_mode && +bitmap_mode != BITMAP_SYNC_MODE_ON_SUCCESS) { +error_setg(errp, "Bitmap sync mode must be '%s' " + "when using sync mode '%s'", + BitmapSyncMode_str(BITMAP_SYNC_MODE_ON_SUCCESS), + MirrorSyncMode_str(sync)); +return; +} +has_bitmap_mode = true; +sync = MIRROR_SYNC_MODE_BITMAP; +bitmap_mode = BITMAP_SYNC_MODE_ON_SUCCESS; +} + if (bitmap_name) { +if (sync != MIRROR_SYNC_MODE_BITMAP) { +error_setg(errp, "Sync mode '%s' not supported with bitmap.", + MirrorSyncMode_str(sync)); +return; +} if (granularity) { error_setg(errp, "Granularity and bitmap cannot both be set"); return; -- 2.39.2
[RFC 4/4] iotests: add test for bitmap mirror
From: Fabian Grünbichler heavily based on/practically forked off iotest 257 for bitmap backups, but: - no writes to filter node 'mirror-top' between completion and finalization, as those seem to deadlock? - no inclusion of not-yet-available full/top sync modes in combination with bitmaps - extra set of reference/test mirrors to verify that writes in parallel with active mirror work intentionally keeping copyright and ownership of original test case to honor provenance. Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.0, i.e. adapt to renames like vm.command -> vm.cmd, specifying explicit image format for rebase, adapt to new behavior of qemu_img(), dropping of 'status' field in output, etc. rename test from '384' to 'bitmap-sync-mirror'] Signed-off-by: Fiona Ebner --- tests/qemu-iotests/tests/bitmap-sync-mirror | 550 .../qemu-iotests/tests/bitmap-sync-mirror.out | 2810 + 2 files changed, 3360 insertions(+) create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out diff --git a/tests/qemu-iotests/tests/bitmap-sync-mirror b/tests/qemu-iotests/tests/bitmap-sync-mirror new file mode 100755 index 00..6cd9b74dac --- /dev/null +++ b/tests/qemu-iotests/tests/bitmap-sync-mirror @@ -0,0 +1,550 @@ +#!/usr/bin/env python3 +# group: rw +# +# Test bitmap-sync mirrors (incremental, differential, and partials) +# +# Copyright (c) 2019 John Snow for Red Hat, Inc. +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# +# owner=js...@redhat.com + +import math +import os + +import iotests +from iotests import log, qemu_img + +SIZE = 64 * 1024 * 1024 +GRANULARITY = 64 * 1024 + + +class Pattern: +def __init__(self, byte, offset, size=GRANULARITY): +self.byte = byte +self.offset = offset +self.size = size + +def bits(self, granularity): +lower = self.offset // granularity +upper = (self.offset + self.size - 1) // granularity +return set(range(lower, upper + 1)) + + +class PatternGroup: +"""Grouping of Pattern objects. Initialize with an iterable of Patterns.""" +def __init__(self, patterns): +self.patterns = patterns + +def bits(self, granularity): +"""Calculate the unique bits dirtied by this pattern grouping""" +res = set() +for pattern in self.patterns: +res |= pattern.bits(granularity) +return res + + +GROUPS = [ +PatternGroup([ +# Batch 0: 4 clusters +Pattern('0x49', 0x000), +Pattern('0x6c', 0x010), # 1M +Pattern('0x6f', 0x200), # 32M +Pattern('0x76', 0x3ff)]), # 64M - 64K +PatternGroup([ +# Batch 1: 6 clusters (3 new) +Pattern('0x65', 0x000), # Full overwrite +Pattern('0x77', 0x00f8000), # Partial-left (1M-32K) +Pattern('0x72', 0x2008000), # Partial-right (32M+32K) +Pattern('0x69', 0x3fe)]), # Adjacent-left (64M - 128K) +PatternGroup([ +# Batch 2: 7 clusters (3 new) +Pattern('0x74', 0x001), # Adjacent-right +Pattern('0x69', 0x00e8000), # Partial-left (1M-96K) +Pattern('0x6e', 0x2018000), # Partial-right (32M+96K) +Pattern('0x67', 0x3fe, +2*GRANULARITY)]), # Overwrite [(64M-128K)-64M) +PatternGroup([ +# Batch 3: 8 clusters (5 new) +# Carefully chosen such that nothing re-dirties the one cluster +# that copies out successfully before failure in Group #1. +Pattern('0xaa', 0x001, +3*GRANULARITY), # Overwrite and 2x Adjacent-right +Pattern('0xbb', 0x00d8000), # Partial-left (1M-160K) +Pattern('0xcc', 0x2028000), # Partial-right (32M+160K) +Pattern('0xdd', 0x3fc)]), # New; leaving a gap to the right +] + + +class EmulatedBitmap: +def __init__(self, granularity=GRANULARITY): +self._bits = set() +self.granularity = granularity + +def dirty_bits(self, bits): +