Re: [PULL 00/28] Block layer patches
On Tue, 19 Sept 2023 at 06:26, Kevin Wolf wrote: > Am 18.09.2023 um 20:56 hat Stefan Hajnoczi geschrieben: > If we could fully get rid of the AioContext lock (as we originally > stated as a goal), that would automatically solve this kind of > deadlocks. Grepping for "ctx locked", "context acquired", etc does not bring up a lot of comments describing variables that are protected by the AioContext lock. However, there are at least hundreds of functions that assume they are called with the AioContext lock held. There are a few strategies: Top-down Shorten AioContext lock critical sections to cover only APIs that need them. Then push the lock down into the API and repeat the next lower level until aio_context_acquire() + AIO_WAIT_WHILE() + aio_context_release() can be replaced with AIO_WAIT_UNLOCKED(). Bottom-up - Switch AIO_WAIT_WHILE() to aio_context_release() + AIO_WAIT_WHILE_UNLOCKED() + aio_context_acquire(). Then move the lock up into callers and repeat at the next higher level until aio_context_acquire() + aio_context_release() cancel each other out. Big bang Remove aio_context_acquire/release() and fix tests until they pass. I think top-down is safer than bottom-up, because bottom-up is more likely to cause issues with callers that do not tolerate temporarily dropping the lock. The big bang approach is only reasonable if the AioContext lock is no longer used to protect variables (which we don't know for sure because that requires auditing every line of code). My concern with the top-down approach is that so much code needs to be audited and the conversions are temporary steps (it's almost a waste of time for maintainers to review them). I'm tempted to go for the big bang approach but also don't want to introduce a slew of new race conditions. :/ Stefan
Re: [PULL 00/28] Block layer patches
On Tue, 19 Sept 2023 at 06:26, Kevin Wolf wrote: > > Am 18.09.2023 um 20:56 hat Stefan Hajnoczi geschrieben: > > Hi Kevin, > > I believe that my own commit "block-coroutine-wrapper: use > > qemu_get_current_aio_context()" breaks this test. The failure is > > non-deterministic (happens about 1 out of 4 runs). > > > > It seems the job hangs and the test times out in vm.run_job('job1', > > wait=5.0). > > > > I haven't debugged it yet but wanted to share this information to save > > some time. Tomorrow I'll investigate further. > > Yes, it's relatively easily reproducible if I run the test in a loop, > and I can't seem to reproduce it without the last patch. Should I > unstage the full series again, or do you think that the last patch is > really optional this time? > > However, I'm unsure how the stack traces I'm seeing are related to your > patch. Maybe it just made an existing bug more likely to be triggered? > > What I'm seeing is that the reader lock is held by an iothread that is > waiting for its AioContext lock to make progress: > > Thread 3 (Thread 0x7f811e9346c0 (LWP 26390) "qemu-system-x86"): > #0 0x7f81250aaf80 in __lll_lock_wait () at /lib64/libc.so.6 > #1 0x7f81250b149a in pthread_mutex_lock@@GLIBC_2.2.5 () at > /lib64/libc.so.6 > #2 0x55b7b170967e in qemu_mutex_lock_impl (mutex=0x55b7b34e3080, > file=0x55b7b199e1f7 "../util/async.c", line=728) at > ../util/qemu-thread-posix.c:94 > #3 0x55b7b1709953 in qemu_rec_mutex_lock_impl (mutex=0x55b7b34e3080, > file=0x55b7b199e1f7 "../util/async.c", line=728) at > ../util/qemu-thread-posix.c:149 > #4 0x55b7b1728318 in aio_context_acquire (ctx=0x55b7b34e3020) at > ../util/async.c:728 > #5 0x55b7b1727c49 in co_schedule_bh_cb (opaque=0x55b7b34e3020) at > ../util/async.c:565 > #6 0x55b7b1726f1c in aio_bh_call (bh=0x55b7b34e2e70) at > ../util/async.c:169 > #7 0x55b7b17270ee in aio_bh_poll (ctx=0x55b7b34e3020) at > ../util/async.c:216 > #8 0x55b7b170351d in aio_poll (ctx=0x55b7b34e3020, blocking=true) at > ../util/aio-posix.c:722 > #9 0x55b7b1518604 in iothread_run (opaque=0x55b7b2904460) at > ../iothread.c:63 > #10 0x55b7b170a955 in qemu_thread_start (args=0x55b7b34e36b0) at > ../util/qemu-thread-posix.c:541 > #11 0x7f81250ae15d in start_thread () at /lib64/libc.so.6 > #12 0x7f812512fc00 in clone3 () at /lib64/libc.so.6 > > On the other hand, the main thread wants to acquire the writer lock, > but it holds the AioContext lock of the iothread (it takes it in > job_prepare_locked()): > > Thread 1 (Thread 0x7f811f4b7b00 (LWP 26388) "qemu-system-x86"): > #0 0x7f8125122356 in ppoll () at /lib64/libc.so.6 > #1 0x55b7b172eae0 in qemu_poll_ns (fds=0x55b7b34ec910, nfds=1, > timeout=-1) at ../util/qemu-timer.c:339 > #2 0x55b7b1704ebd in fdmon_poll_wait (ctx=0x55b7b3269210, > ready_list=0x7ffc90b05680, timeout=-1) at ../util/fdmon-poll.c:79 > #3 0x55b7b1703284 in aio_poll (ctx=0x55b7b3269210, blocking=true) at > ../util/aio-posix.c:670 > #4 0x55b7b1567c3b in bdrv_graph_wrlock (bs=0x0) at > ../block/graph-lock.c:145 > #5 0x55b7b1554c1c in blk_remove_bs (blk=0x55b7b4425800) at > ../block/block-backend.c:916 > #6 0x55b7b1554779 in blk_delete (blk=0x55b7b4425800) at > ../block/block-backend.c:497 > #7 0x55b7b1554133 in blk_unref (blk=0x55b7b4425800) at > ../block/block-backend.c:557 > #8 0x55b7b157a149 in mirror_exit_common (job=0x55b7b4419000) at > ../block/mirror.c:696 > #9 0x55b7b1577015 in mirror_prepare (job=0x55b7b4419000) at > ../block/mirror.c:807 > #10 0x55b7b153a1a7 in job_prepare_locked (job=0x55b7b4419000) at > ../job.c:988 > #11 0x55b7b153a0d9 in job_txn_apply_locked (job=0x55b7b4419000, > fn=0x55b7b153a110 ) at ../job.c:191 > #12 0x55b7b1538b6d in job_do_finalize_locked (job=0x55b7b4419000) at > ../job.c:1011 > #13 0x55b7b153a886 in job_completed_txn_success_locked > (job=0x55b7b4419000) at ../job.c:1068 > #14 0x55b7b1539372 in job_completed_locked (job=0x55b7b4419000) at > ../job.c:1082 > #15 0x55b7b153a71b in job_exit (opaque=0x55b7b4419000) at ../job.c:1103 > #16 0x55b7b1726f1c in aio_bh_call (bh=0x7f8110005470) at > ../util/async.c:169 > #17 0x55b7b17270ee in aio_bh_poll (ctx=0x55b7b3269210) at > ../util/async.c:216 > #18 0x55b7b1702c05 in aio_dispatch (ctx=0x55b7b3269210) at > ../util/aio-posix.c:423 > #19 0x55b7b1728a14 in aio_ctx_dispatch (source=0x55b7b3269210, > callback=0x0, user_data=0x0) at ../util/async.c:358 > #20 0x7f8126c31c7f in g_main_dispatch (context=0x55b7b3269720) at > ../glib/gmain.c:3454 > #21 g_main_context_dispatch (context=0x55b7b3269720) at ../glib/gmain.c:4172 > #22 0x55b7b1729c98 in glib_pollfds_poll () at ../util/main-loop.c:290 > #23 0x55b7b1729572 in os_host_main_loop_wait (timeout=27462700) at > ../util/main-loop.c:313 > #24 0x55b7b1729452 in main_loop_wait (nonblocking=0) at > ../util/main-loop.c:592 > #25 0x55b7b119a1eb in
Re: [PULL 00/28] Block layer patches
On Tue, 19 Sept 2023 at 06:26, Kevin Wolf wrote: > > Am 18.09.2023 um 20:56 hat Stefan Hajnoczi geschrieben: > > Hi Kevin, > > I believe that my own commit "block-coroutine-wrapper: use > > qemu_get_current_aio_context()" breaks this test. The failure is > > non-deterministic (happens about 1 out of 4 runs). > > > > It seems the job hangs and the test times out in vm.run_job('job1', > > wait=5.0). > > > > I haven't debugged it yet but wanted to share this information to save > > some time. Tomorrow I'll investigate further. > > Yes, it's relatively easily reproducible if I run the test in a loop, > and I can't seem to reproduce it without the last patch. Should I > unstage the full series again, or do you think that the last patch is > really optional this time? Please drop the last patch. I'm not aware of dependencies on the last patch. > However, I'm unsure how the stack traces I'm seeing are related to your > patch. Maybe it just made an existing bug more likely to be triggered? I'll share my thoughts once I've looked at the crash today. Regarding AioContext lock removal: I'll work on that and see what still depends on the lock. Stefan > What I'm seeing is that the reader lock is held by an iothread that is > waiting for its AioContext lock to make progress: > > Thread 3 (Thread 0x7f811e9346c0 (LWP 26390) "qemu-system-x86"): > #0 0x7f81250aaf80 in __lll_lock_wait () at /lib64/libc.so.6 > #1 0x7f81250b149a in pthread_mutex_lock@@GLIBC_2.2.5 () at > /lib64/libc.so.6 > #2 0x55b7b170967e in qemu_mutex_lock_impl (mutex=0x55b7b34e3080, > file=0x55b7b199e1f7 "../util/async.c", line=728) at > ../util/qemu-thread-posix.c:94 > #3 0x55b7b1709953 in qemu_rec_mutex_lock_impl (mutex=0x55b7b34e3080, > file=0x55b7b199e1f7 "../util/async.c", line=728) at > ../util/qemu-thread-posix.c:149 > #4 0x55b7b1728318 in aio_context_acquire (ctx=0x55b7b34e3020) at > ../util/async.c:728 > #5 0x55b7b1727c49 in co_schedule_bh_cb (opaque=0x55b7b34e3020) at > ../util/async.c:565 > #6 0x55b7b1726f1c in aio_bh_call (bh=0x55b7b34e2e70) at > ../util/async.c:169 > #7 0x55b7b17270ee in aio_bh_poll (ctx=0x55b7b34e3020) at > ../util/async.c:216 > #8 0x55b7b170351d in aio_poll (ctx=0x55b7b34e3020, blocking=true) at > ../util/aio-posix.c:722 > #9 0x55b7b1518604 in iothread_run (opaque=0x55b7b2904460) at > ../iothread.c:63 > #10 0x55b7b170a955 in qemu_thread_start (args=0x55b7b34e36b0) at > ../util/qemu-thread-posix.c:541 > #11 0x7f81250ae15d in start_thread () at /lib64/libc.so.6 > #12 0x7f812512fc00 in clone3 () at /lib64/libc.so.6 > > On the other hand, the main thread wants to acquire the writer lock, > but it holds the AioContext lock of the iothread (it takes it in > job_prepare_locked()): > > Thread 1 (Thread 0x7f811f4b7b00 (LWP 26388) "qemu-system-x86"): > #0 0x7f8125122356 in ppoll () at /lib64/libc.so.6 > #1 0x55b7b172eae0 in qemu_poll_ns (fds=0x55b7b34ec910, nfds=1, > timeout=-1) at ../util/qemu-timer.c:339 > #2 0x55b7b1704ebd in fdmon_poll_wait (ctx=0x55b7b3269210, > ready_list=0x7ffc90b05680, timeout=-1) at ../util/fdmon-poll.c:79 > #3 0x55b7b1703284 in aio_poll (ctx=0x55b7b3269210, blocking=true) at > ../util/aio-posix.c:670 > #4 0x55b7b1567c3b in bdrv_graph_wrlock (bs=0x0) at > ../block/graph-lock.c:145 > #5 0x55b7b1554c1c in blk_remove_bs (blk=0x55b7b4425800) at > ../block/block-backend.c:916 > #6 0x55b7b1554779 in blk_delete (blk=0x55b7b4425800) at > ../block/block-backend.c:497 > #7 0x55b7b1554133 in blk_unref (blk=0x55b7b4425800) at > ../block/block-backend.c:557 > #8 0x55b7b157a149 in mirror_exit_common (job=0x55b7b4419000) at > ../block/mirror.c:696 > #9 0x55b7b1577015 in mirror_prepare (job=0x55b7b4419000) at > ../block/mirror.c:807 > #10 0x55b7b153a1a7 in job_prepare_locked (job=0x55b7b4419000) at > ../job.c:988 > #11 0x55b7b153a0d9 in job_txn_apply_locked (job=0x55b7b4419000, > fn=0x55b7b153a110 ) at ../job.c:191 > #12 0x55b7b1538b6d in job_do_finalize_locked (job=0x55b7b4419000) at > ../job.c:1011 > #13 0x55b7b153a886 in job_completed_txn_success_locked > (job=0x55b7b4419000) at ../job.c:1068 > #14 0x55b7b1539372 in job_completed_locked (job=0x55b7b4419000) at > ../job.c:1082 > #15 0x55b7b153a71b in job_exit (opaque=0x55b7b4419000) at ../job.c:1103 > #16 0x55b7b1726f1c in aio_bh_call (bh=0x7f8110005470) at > ../util/async.c:169 > #17 0x55b7b17270ee in aio_bh_poll (ctx=0x55b7b3269210) at > ../util/async.c:216 > #18 0x55b7b1702c05 in aio_dispatch (ctx=0x55b7b3269210) at > ../util/aio-posix.c:423 > #19 0x55b7b1728a14 in aio_ctx_dispatch (source=0x55b7b3269210, > callback=0x0, user_data=0x0) at ../util/async.c:358 > #20 0x7f8126c31c7f in g_main_dispatch (context=0x55b7b3269720) at > ../glib/gmain.c:3454 > #21 g_main_context_dispatch (context=0x55b7b3269720) at ../glib/gmain.c:4172 > #22 0x55b7b1729c98 in glib_pollfds_poll () at ..
[PATCH v3 4/8] qemu-img: add chunk size parameter to compare_buffers()
Add @chsize param to the function which, if non-zero, would represent the chunk size to be used for comparison. If it's zero, then BDRV_SECTOR_SIZE is used as default chunk size, which is the previous behaviour. In particular, we're going to use this param in img_rebase() to make the write requests aligned to a predefined alignment value. Signed-off-by: Andrey Drobyshev Reviewed-by: Eric Blake Reviewed-by: Hanna Czenczek --- qemu-img.c | 24 +++- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/qemu-img.c b/qemu-img.c index 4dc91505bf..0f67b021f7 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -1274,23 +1274,29 @@ static int is_allocated_sectors_min(const uint8_t *buf, int n, int *pnum, } /* - * Compares two buffers sector by sector. Returns 0 if the first - * sector of each buffer matches, non-zero otherwise. + * Compares two buffers chunk by chunk, where @chsize is the chunk size. + * If @chsize is 0, default chunk size of BDRV_SECTOR_SIZE is used. + * Returns 0 if the first chunk of each buffer matches, non-zero otherwise. * - * pnum is set to the sector-aligned size of the buffer prefix that - * has the same matching status as the first sector. + * @pnum is set to the size of the buffer prefix aligned to @chsize that + * has the same matching status as the first chunk. */ static int compare_buffers(const uint8_t *buf1, const uint8_t *buf2, - int64_t bytes, int64_t *pnum) + int64_t bytes, uint64_t chsize, int64_t *pnum) { bool res; -int64_t i = MIN(bytes, BDRV_SECTOR_SIZE); +int64_t i; assert(bytes > 0); +if (!chsize) { +chsize = BDRV_SECTOR_SIZE; +} +i = MIN(bytes, chsize); + res = !!memcmp(buf1, buf2, i); while (i < bytes) { -int64_t len = MIN(bytes - i, BDRV_SECTOR_SIZE); +int64_t len = MIN(bytes - i, chsize); if (!!memcmp(buf1 + i, buf2 + i, len) != res) { break; @@ -1559,7 +1565,7 @@ static int img_compare(int argc, char **argv) ret = 4; goto out; } -ret = compare_buffers(buf1, buf2, chunk, &pnum); +ret = compare_buffers(buf1, buf2, chunk, 0, &pnum); if (ret || pnum != chunk) { qprintf(quiet, "Content mismatch at offset %" PRId64 "!\n", offset + (ret ? 0 : pnum)); @@ -3878,7 +3884,7 @@ static int img_rebase(int argc, char **argv) int64_t pnum; if (compare_buffers(buf_old + written, buf_new + written, -n - written, &pnum)) +n - written, 0, &pnum)) { if (buf_old_is_zero) { ret = blk_pwrite_zeroes(blk, offset + written, pnum, 0); -- 2.39.3
[PATCH v3 1/8] qemu-img: rebase: stop when reaching EOF of old backing file
In case when we're rebasing within one backing chain, and when target image is larger than old backing file, bdrv_is_allocated_above() ends up setting *pnum = 0. As a result, target offset isn't getting incremented, and we get stuck in an infinite for loop. Let's detect this case and proceed further down the loop body, as the offsets beyond the old backing size need to be explicitly zeroed. Signed-off-by: Andrey Drobyshev Reviewed-by: Denis V. Lunev Reviewed-by: Hanna Czenczek --- qemu-img.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/qemu-img.c b/qemu-img.c index a48edb7101..50660ba920 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -3805,6 +3805,8 @@ static int img_rebase(int argc, char **argv) } if (prefix_chain_bs) { +uint64_t bytes = n; + /* * If cluster wasn't changed since prefix_chain, we don't need * to take action @@ -3817,9 +3819,18 @@ static int img_rebase(int argc, char **argv) strerror(-ret)); goto out; } -if (!ret) { +if (!ret && n) { continue; } +if (!n) { +/* + * If we've reached EOF of the old backing, it means that + * offsets beyond the old backing size were read as zeroes. + * Now we will need to explicitly zero the cluster in + * order to preserve that state after the rebase. + */ +n = bytes; +} } /* -- 2.39.3
[PATCH v3 8/8] iotests: add tests for "qemu-img rebase" with compression
The test cases considered so far: 314 (new test suite): 1. Check that compression mode isn't compatible with "-f raw" (raw format doesn't support compression). 2. Check that rebasing an image onto no backing file preserves the data and writes the copied clusters actually compressed. 3. Same as 2, but with a raw backing file (i.e. the clusters copied from the backing are originally uncompressed -- we check they end up compressed after being merged). 4. Remove a single delta from a backing chain, perform the same checks as in 2. 5. Check that even when backing and overlay are initially uncompressed, copied clusters end up compressed when rebase with compression is performed. 271: 1. Check that when target image has subclusters, rebase with compression will make an entire cluster containing the written subcluster compressed. Signed-off-by: Andrey Drobyshev Reviewed-by: Hanna Czenczek --- tests/qemu-iotests/271 | 65 +++ tests/qemu-iotests/271.out | 40 + tests/qemu-iotests/314 | 165 + tests/qemu-iotests/314.out | 75 + 4 files changed, 345 insertions(+) create mode 100755 tests/qemu-iotests/314 create mode 100644 tests/qemu-iotests/314.out diff --git a/tests/qemu-iotests/271 b/tests/qemu-iotests/271 index e243f57ba7..59a6fafa2f 100755 --- a/tests/qemu-iotests/271 +++ b/tests/qemu-iotests/271 @@ -965,6 +965,71 @@ echo TEST_IMG="$TEST_IMG.top" alloc="1 30" zero="" _verify_l2_bitmap 0 +# Check that rebase with compression works correctly with images containing +# subclusters. When compression is enabled and we allocate a new +# subcluster within the target (overlay) image, we expect the entire cluster +# containing that subcluster to become compressed. +# +# Here we expect 1st and 3rd clusters of the top (overlay) image to become +# compressed after the rebase, while cluster 2 to remain unallocated and +# be read from the base (new backing) image. +# +# Base (new backing): |-- -- .. -- --|11 11 .. 11 11|-- -- .. -- --| +# Mid (old backing): |-- -- .. -- 22|-- -- .. -- --|33 -- .. -- --| +# Top:|-- -- .. -- --|-- -- -- -- --|-- -- .. -- --| + +echo +echo "### Rebase with compression for images with subclusters ###" +echo + +echo "# create backing chain" +echo + +TEST_IMG="$TEST_IMG.base" _make_test_img -o cluster_size=1M,extended_l2=on 3M +TEST_IMG="$TEST_IMG.mid" _make_test_img -o cluster_size=1M,extended_l2=on \ +-b "$TEST_IMG.base" -F qcow2 3M +TEST_IMG="$TEST_IMG.top" _make_test_img -o cluster_size=1M,extended_l2=on \ +-b "$TEST_IMG.mid" -F qcow2 3M + +echo +echo "# fill old and new backing with data" +echo + +$QEMU_IO -c "write -P 0x11 1M 1M" "$TEST_IMG.base" | _filter_qemu_io +$QEMU_IO -c "write -P 0x22 $(( 31 * 32 ))k 32k" \ + -c "write -P 0x33 $(( 64 * 32 ))k 32k" \ + "$TEST_IMG.mid" | _filter_qemu_io + +echo +echo "# rebase topmost image onto the new backing, with compression" +echo + +$QEMU_IMG rebase -c -b "$TEST_IMG.base" -F qcow2 "$TEST_IMG.top" + +echo "# verify that the 1st and 3rd clusters've become compressed" +echo + +$QEMU_IMG map --output=json "$TEST_IMG.top" | _filter_testdir + +echo +echo "# verify that data is read the same before and after rebase" +echo + +$QEMU_IO -c "read -P 0x22 $(( 31 * 32 ))k 32k" \ + -c "read -P 0x11 1M 1M" \ + -c "read -P 0x33 $(( 64 * 32 ))k 32k" \ + "$TEST_IMG.top" | _filter_qemu_io + +echo +echo "# verify image bitmap" +echo + +# For compressed clusters bitmap is always 0. For unallocated cluster +# there should be no entry at all, thus bitmap is also 0. +TEST_IMG="$TEST_IMG.top" alloc="" zero="" _verify_l2_bitmap 0 +TEST_IMG="$TEST_IMG.top" alloc="" zero="" _verify_l2_bitmap 1 +TEST_IMG="$TEST_IMG.top" alloc="" zero="" _verify_l2_bitmap 2 + # success, all done echo "*** done" rm -f $seq.full diff --git a/tests/qemu-iotests/271.out b/tests/qemu-iotests/271.out index c335a6c608..0b24d50159 100644 --- a/tests/qemu-iotests/271.out +++ b/tests/qemu-iotests/271.out @@ -765,4 +765,44 @@ Offset Length Mapped to File # verify image bitmap L2 entry #0: 0x8050 4002 + +### Rebase with compression for images with subclusters ### + +# create backing chain + +Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=3145728 +Formatting 'TEST_DIR/t.IMGFMT.mid', fmt=IMGFMT size=3145728 backing_file=TEST_DIR/t.IMGFMT.base backing_fmt=IMGFMT +Formatting 'TEST_DIR/t.IMGFMT.top', fmt=IMGFMT size=3145728 backing_file=TEST_DIR/t.IMGFMT.mid backing_fmt=IMGFMT + +# fill old and new backing with data + +wrote 1048576/1048576 bytes at offset 1048576 +1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +wrote 32768/32768 bytes at offset 1015808 +32 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +wrote 32768/32768 bytes at offset 2097152 +32 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) + +# rebase topmost image onto the new backing,
[PATCH v3 5/8] qemu-img: rebase: avoid unnecessary COW operations
When rebasing an image from one backing file to another, we need to compare data from old and new backings. If the diff between that data happens to be unaligned to the target cluster size, we might end up doing partial writes, which would lead to copy-on-write and additional IO. Consider the following simple case (virtual_size == cluster_size == 64K): base <-- inc1 <-- inc2 qemu-io -c "write -P 0xaa 0 32K" base.qcow2 qemu-io -c "write -P 0xcc 32K 32K" base.qcow2 qemu-io -c "write -P 0xbb 0 32K" inc1.qcow2 qemu-io -c "write -P 0xcc 32K 32K" inc1.qcow2 qemu-img rebase -f qcow2 -b base.qcow2 -F qcow2 inc2.qcow2 While doing rebase, we'll write a half of the cluster to inc2, and block layer will have to read the 2nd half of the same cluster from the base image inc1 while doing this write operation, although the whole cluster is already read earlier to perform data comparison. In order to avoid these unnecessary IO cycles, let's make sure every write request is aligned to the overlay subcluster boundaries. Using subcluster size is universal as for the images which don't have them this size equals to the cluster size. so in any case we end up aligning to the smallest unit of allocation. Signed-off-by: Andrey Drobyshev --- qemu-img.c | 74 +++--- 1 file changed, 54 insertions(+), 20 deletions(-) diff --git a/qemu-img.c b/qemu-img.c index 0f67b021f7..a2d6241648 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -3523,6 +3523,7 @@ static int img_rebase(int argc, char **argv) uint8_t *buf_new = NULL; BlockDriverState *bs = NULL, *prefix_chain_bs = NULL; BlockDriverState *unfiltered_bs; +BlockDriverInfo bdi = {0}; char *filename; const char *fmt, *cache, *src_cache, *out_basefmt, *out_baseimg; int c, flags, src_flags, ret; @@ -3533,6 +3534,7 @@ static int img_rebase(int argc, char **argv) bool quiet = false; Error *local_err = NULL; bool image_opts = false; +int64_t write_align; /* Parse commandline parameters */ fmt = NULL; @@ -3656,6 +3658,20 @@ static int img_rebase(int argc, char **argv) } } +/* + * We need overlay subcluster size to make sure write requests are + * aligned. + */ +ret = bdrv_get_info(unfiltered_bs, &bdi); +if (ret < 0) { +error_report("could not get block driver info"); +goto out; +} else if (bdi.subcluster_size == 0) { +bdi.subcluster_size = 1; +} + +write_align = bdi.subcluster_size; + /* For safe rebasing we need to compare old and new backing file */ if (!unsafe) { QDict *options = NULL; @@ -3753,7 +3769,7 @@ static int img_rebase(int argc, char **argv) int64_t old_backing_size = 0; int64_t new_backing_size = 0; uint64_t offset; -int64_t n; +int64_t n, n_old = 0, n_new = 0; float local_progress = 0; if (blk_old_backing && bdrv_opt_mem_align(blk_bs(blk_old_backing)) > @@ -3799,7 +3815,8 @@ static int img_rebase(int argc, char **argv) } for (offset = 0; offset < size; offset += n) { -bool buf_old_is_zero = false; +bool old_backing_eof = false; +int64_t n_alloc; /* How many bytes can we handle with the next read? */ n = MIN(IO_BUF_SIZE, size - offset); @@ -3844,33 +3861,46 @@ static int img_rebase(int argc, char **argv) } } +/* + * At this point we know that the region [offset; offset + n) + * is unallocated within the target image. This region might be + * unaligned to the target image's (sub)cluster boundaries, as + * old backing may have smaller clusters (or have subclusters). + * We extend it to the aligned boundaries to avoid CoW on + * partial writes in blk_pwrite(), + */ +n += offset - QEMU_ALIGN_DOWN(offset, write_align); +offset = QEMU_ALIGN_DOWN(offset, write_align); +n += QEMU_ALIGN_UP(offset + n, write_align) - (offset + n); +n = MIN(n, size - offset); +assert(!bdrv_is_allocated(unfiltered_bs, offset, n, &n_alloc) && + n_alloc == n); + +/* + * Much like with the target image, we'll try to read as much + * of the old and new backings as we can. + */ +n_old = MIN(n, MAX(0, old_backing_size - (int64_t) offset)); +n_new = MIN(n, MAX(0, new_backing_size - (int64_t) offset)); + /* * Read old and new backing file and take into consideration that * backing files may be smaller than the COW image. */ -if (offset >= old_backing_size) { -memset(buf_old, 0, n); -buf_old_is_zero = true; +memset(buf_old + n_old, 0, n - n_old); +if (!n_old) { +
[PATCH v3 7/8] qemu-img: add compression option to rebase subcommand
If we rebase an image whose backing file has compressed clusters, we might end up wasting disk space since the copied clusters are now uncompressed. In order to have better control over this, let's add "--compress" option to the "qemu-img rebase" command. Note that this option affects only the clusters which are actually being copied from the original backing file. The clusters which were uncompressed in the target image will remain so. Signed-off-by: Andrey Drobyshev Reviewed-by: Denis V. Lunev Reviewed-by: Hanna Czenczek --- docs/tools/qemu-img.rst | 6 -- qemu-img-cmds.hx| 4 ++-- qemu-img.c | 26 -- 3 files changed, 26 insertions(+), 10 deletions(-) diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst index ca5a2773cf..4459c065f1 100644 --- a/docs/tools/qemu-img.rst +++ b/docs/tools/qemu-img.rst @@ -667,7 +667,7 @@ Command description: List, apply, create or delete snapshots in image *FILENAME*. -.. option:: rebase [--object OBJECTDEF] [--image-opts] [-U] [-q] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-p] [-u] -b BACKING_FILE [-F BACKING_FMT] FILENAME +.. option:: rebase [--object OBJECTDEF] [--image-opts] [-U] [-q] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-p] [-u] [-c] -b BACKING_FILE [-F BACKING_FMT] FILENAME Changes the backing file of an image. Only the formats ``qcow2`` and ``qed`` support changing the backing file. @@ -694,7 +694,9 @@ Command description: In order to achieve this, any clusters that differ between *BACKING_FILE* and the old backing file of *FILENAME* are merged -into *FILENAME* before actually changing the backing file. +into *FILENAME* before actually changing the backing file. With the +``-c`` option specified, the clusters which are being merged (but not +the entire *FILENAME* image) are compressed when written. Note that the safe mode is an expensive operation, comparable to converting an image. It only works if the old backing file still diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx index 1b1dab5b17..068692d13e 100644 --- a/qemu-img-cmds.hx +++ b/qemu-img-cmds.hx @@ -88,9 +88,9 @@ SRST ERST DEF("rebase", img_rebase, -"rebase [--object objectdef] [--image-opts] [-U] [-q] [-f fmt] [-t cache] [-T src_cache] [-p] [-u] -b backing_file [-F backing_fmt] filename") +"rebase [--object objectdef] [--image-opts] [-U] [-q] [-f fmt] [-t cache] [-T src_cache] [-p] [-u] [-c] -b backing_file [-F backing_fmt] filename") SRST -.. option:: rebase [--object OBJECTDEF] [--image-opts] [-U] [-q] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-p] [-u] -b BACKING_FILE [-F BACKING_FMT] FILENAME +.. option:: rebase [--object OBJECTDEF] [--image-opts] [-U] [-q] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-p] [-u] [-c] -b BACKING_FILE [-F BACKING_FMT] FILENAME ERST DEF("resize", img_resize, diff --git a/qemu-img.c b/qemu-img.c index a2d6241648..9d7a4a3566 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -3527,11 +3527,13 @@ static int img_rebase(int argc, char **argv) char *filename; const char *fmt, *cache, *src_cache, *out_basefmt, *out_baseimg; int c, flags, src_flags, ret; +BdrvRequestFlags write_flags = 0; bool writethrough, src_writethrough; int unsafe = 0; bool force_share = false; int progress = 0; bool quiet = false; +bool compress = false; Error *local_err = NULL; bool image_opts = false; int64_t write_align; @@ -3548,9 +3550,10 @@ static int img_rebase(int argc, char **argv) {"object", required_argument, 0, OPTION_OBJECT}, {"image-opts", no_argument, 0, OPTION_IMAGE_OPTS}, {"force-share", no_argument, 0, 'U'}, +{"compress", no_argument, 0, 'c'}, {0, 0, 0, 0} }; -c = getopt_long(argc, argv, ":hf:F:b:upt:T:qU", +c = getopt_long(argc, argv, ":hf:F:b:upt:T:qUc", long_options, NULL); if (c == -1) { break; @@ -3598,6 +3601,9 @@ static int img_rebase(int argc, char **argv) case 'U': force_share = true; break; +case 'c': +compress = true; +break; } } @@ -3650,6 +3656,14 @@ static int img_rebase(int argc, char **argv) unfiltered_bs = bdrv_skip_filters(bs); +if (compress && !block_driver_can_compress(unfiltered_bs->drv)) { +error_report("Compression not supported for this file format"); +ret = -1; +goto out; +} else if (compress) { +write_flags |= BDRV_REQ_WRITE_COMPRESSED; +} + if (out_basefmt != NULL) { if (bdrv_find_format(out_basefmt) == NULL) { error_report("Invalid format name: '%s'", out_basefmt); @@ -3659,18 +3673,18 @@ static int img_rebase(int argc, char **argv) } /* - * We need overlay subcluster size to make sure write requests are - * aligned. + * We need overlay subcluster size (or cluster size in
[PATCH v3 3/8] qemu-img: rebase: use backing files' BlockBackend for buffer alignment
Since commit bb1c05973cf ("qemu-img: Use qemu_blockalign"), buffers for the data read from the old and new backing files are aligned using BlockDriverState (or BlockBackend later on) referring to the target image. However, this isn't quite right, because buf_new is only being used for reading from the new backing, while buf_old is being used for both reading from the old backing and writing to the target. Let's take that into account and use more appropriate values as alignments. Signed-off-by: Andrey Drobyshev --- qemu-img.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/qemu-img.c b/qemu-img.c index 50660ba920..4dc91505bf 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -3750,8 +3750,13 @@ static int img_rebase(int argc, char **argv) int64_t n; float local_progress = 0; -buf_old = blk_blockalign(blk, IO_BUF_SIZE); -buf_new = blk_blockalign(blk, IO_BUF_SIZE); +if (blk_old_backing && bdrv_opt_mem_align(blk_bs(blk_old_backing)) > +bdrv_opt_mem_align(blk_bs(blk))) { +buf_old = blk_blockalign(blk_old_backing, IO_BUF_SIZE); +} else { +buf_old = blk_blockalign(blk, IO_BUF_SIZE); +} +buf_new = blk_blockalign(blk_new_backing, IO_BUF_SIZE); size = blk_getlength(blk); if (size < 0) { -- 2.39.3
[PATCH v3 6/8] iotests/{024, 271}: add testcases for qemu-img rebase
As the previous commit changes the logic of "qemu-img rebase" (it's using write alignment now), let's add a couple more test cases which would ensure it works correctly. In particular, the following scenarios: 024: add test case for rebase within one backing chain when the overlay cluster size > backings cluster size; 271: add test case for rebase images that contain subclusters. Check that no extra allocations are being made. Signed-off-by: Andrey Drobyshev Reviewed-by: Hanna Czenczek --- tests/qemu-iotests/024 | 60 ++ tests/qemu-iotests/024.out | 43 + tests/qemu-iotests/271 | 66 ++ tests/qemu-iotests/271.out | 42 4 files changed, 211 insertions(+) diff --git a/tests/qemu-iotests/024 b/tests/qemu-iotests/024 index 98a7c8fd65..285f17e79f 100755 --- a/tests/qemu-iotests/024 +++ b/tests/qemu-iotests/024 @@ -257,6 +257,66 @@ $QEMU_IO "$OVERLAY" -c "read -P 0x00 $(( CLUSTER_SIZE * 4 )) $CLUSTER_SIZE" \ echo +# Check that rebase within the chain is working when +# overlay cluster size > backings cluster size +# (here overlay cluster size == 2 * backings cluster size) +# +# base_new <-- base_old <-- overlay +# +# Backing (new): -- -- -- -- -- -- +# Backing (old): -- 11 -- -- 22 -- +# Overlay: |-- --|-- --|-- --| +# +# We should end up having 1st and 3rd cluster allocated, and their halves +# being read as zeroes. + +echo +echo "=== Test rebase with different cluster sizes ===" +echo + +echo "Creating backing chain" +echo + +TEST_IMG=$BASE_NEW _make_test_img $(( CLUSTER_SIZE * 6 )) +TEST_IMG=$BASE_OLD _make_test_img -b "$BASE_NEW" -F $IMGFMT \ +$(( CLUSTER_SIZE * 6 )) +CLUSTER_SIZE=$(( CLUSTER_SIZE * 2 )) TEST_IMG=$OVERLAY \ +_make_test_img -b "$BASE_OLD" -F $IMGFMT $(( CLUSTER_SIZE * 6 )) + +TEST_IMG=$OVERLAY _img_info + +echo +echo "Fill backing files with data" +echo + +$QEMU_IO "$BASE_OLD" -c "write -P 0x11 $CLUSTER_SIZE $CLUSTER_SIZE" \ +-c "write -P 0x22 $(( CLUSTER_SIZE * 4 )) $CLUSTER_SIZE" \ +| _filter_qemu_io + +echo +echo "Rebase onto another image in the same chain" +echo + +$QEMU_IMG rebase -b "$BASE_NEW" -F $IMGFMT "$OVERLAY" + +echo "Verify that data is read the same before and after rebase" +echo + +$QEMU_IO "$OVERLAY" -c "read -P 0x00 0 $CLUSTER_SIZE" \ +-c "read -P 0x11 $CLUSTER_SIZE $CLUSTER_SIZE" \ +-c "read -P 0x00 $(( CLUSTER_SIZE * 2 )) $(( CLUSTER_SIZE * 2 ))" \ +-c "read -P 0x22 $(( CLUSTER_SIZE * 4 )) $CLUSTER_SIZE" \ +-c "read -P 0x00 $(( CLUSTER_SIZE * 5 )) $CLUSTER_SIZE" \ +| _filter_qemu_io + +echo +echo "Verify that untouched cluster remains unallocated" +echo + +$QEMU_IMG map "$OVERLAY" | _filter_qemu_img_map + +echo + # success, all done echo "*** done" rm -f $seq.full diff --git a/tests/qemu-iotests/024.out b/tests/qemu-iotests/024.out index 245fe8b1d1..e1e8eea863 100644 --- a/tests/qemu-iotests/024.out +++ b/tests/qemu-iotests/024.out @@ -201,4 +201,47 @@ read 262144/262144 bytes at offset 0 read 65536/65536 bytes at offset 262144 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) + +=== Test rebase with different cluster sizes === + +Creating backing chain + +Formatting 'TEST_DIR/subdir/t.IMGFMT.base_new', fmt=IMGFMT size=393216 +Formatting 'TEST_DIR/subdir/t.IMGFMT.base_old', fmt=IMGFMT size=393216 backing_file=TEST_DIR/subdir/t.IMGFMT.base_new backing_fmt=IMGFMT +Formatting 'TEST_DIR/subdir/t.IMGFMT', fmt=IMGFMT size=393216 backing_file=TEST_DIR/subdir/t.IMGFMT.base_old backing_fmt=IMGFMT +image: TEST_DIR/subdir/t.IMGFMT +file format: IMGFMT +virtual size: 384 KiB (393216 bytes) +cluster_size: 131072 +backing file: TEST_DIR/subdir/t.IMGFMT.base_old +backing file format: IMGFMT + +Fill backing files with data + +wrote 65536/65536 bytes at offset 65536 +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +wrote 65536/65536 bytes at offset 262144 +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) + +Rebase onto another image in the same chain + +Verify that data is read the same before and after rebase + +read 65536/65536 bytes at offset 0 +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +read 65536/65536 bytes at offset 65536 +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +read 131072/131072 bytes at offset 131072 +128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +read 65536/65536 bytes at offset 262144 +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +read 65536/65536 bytes at offset 327680 +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) + +Verify that untouched cluster remains unallocated + +Offset Length File +0 0x2 TEST_DIR/subdir/t.IMGFMT +0x4 0x2 TEST_DIR/subdir/t.IMGFMT + *** done diff --git a/tests/qemu-iotests/271 b/tests/qemu-iotests/271 index c7c2cadda0..e243f57ba7 100755 --- a/tests/qemu-iotests/271 +++ b/tests/qemu-iotests/271 @@ -899,6 +899
[PATCH v3 2/8] qemu-iotests: 024: add rebasing test case for overlay_size > backing_size
Before previous commit, rebase was getting infitely stuck in case of rebasing within the same backing chain and when overlay_size > backing_size. Let's add this case to the rebasing test 024 to make sure it doesn't break again. Signed-off-by: Andrey Drobyshev Reviewed-by: Denis V. Lunev Reviewed-by: Hanna Czenczek --- tests/qemu-iotests/024 | 57 ++ tests/qemu-iotests/024.out | 30 2 files changed, 87 insertions(+) diff --git a/tests/qemu-iotests/024 b/tests/qemu-iotests/024 index 25a564a150..98a7c8fd65 100755 --- a/tests/qemu-iotests/024 +++ b/tests/qemu-iotests/024 @@ -199,6 +199,63 @@ echo # $BASE_OLD and $BASE_NEW) $QEMU_IMG map "$OVERLAY" | _filter_qemu_img_map +# Check that rebase within the chain is working when +# overlay_size > old_backing_size +# +# base_new <-- base_old <-- overlay +# +# Backing (new): 11 11 11 11 11 +# Backing (old): 22 22 22 22 +# Overlay: -- -- -- -- -- +# +# As a result, overlay should contain data identical to base_old, with the +# last cluster remaining unallocated. + +echo +echo "=== Test rebase within one backing chain ===" +echo + +echo "Creating backing chain" +echo + +TEST_IMG=$BASE_NEW _make_test_img $(( CLUSTER_SIZE * 5 )) +TEST_IMG=$BASE_OLD _make_test_img -b "$BASE_NEW" -F $IMGFMT \ +$(( CLUSTER_SIZE * 4 )) +TEST_IMG=$OVERLAY _make_test_img -b "$BASE_OLD" -F $IMGFMT \ +$(( CLUSTER_SIZE * 5 )) + +echo +echo "Fill backing files with data" +echo + +$QEMU_IO "$BASE_NEW" -c "write -P 0x11 0 $(( CLUSTER_SIZE * 5 ))" \ +| _filter_qemu_io +$QEMU_IO "$BASE_OLD" -c "write -P 0x22 0 $(( CLUSTER_SIZE * 4 ))" \ +| _filter_qemu_io + +echo +echo "Check the last cluster is zeroed in overlay before the rebase" +echo +$QEMU_IO "$OVERLAY" -c "read -P 0x00 $(( CLUSTER_SIZE * 4 )) $CLUSTER_SIZE" \ +| _filter_qemu_io + +echo +echo "Rebase onto another image in the same chain" +echo + +$QEMU_IMG rebase -b "$BASE_NEW" -F $IMGFMT "$OVERLAY" + +echo "Verify that data is read the same before and after rebase" +echo + +# Verify the first 4 clusters are still read the same as in the old base +$QEMU_IO "$OVERLAY" -c "read -P 0x22 0 $(( CLUSTER_SIZE * 4 ))" \ +| _filter_qemu_io +# Verify the last cluster still reads as zeroes +$QEMU_IO "$OVERLAY" -c "read -P 0x00 $(( CLUSTER_SIZE * 4 )) $CLUSTER_SIZE" \ +| _filter_qemu_io + +echo # success, all done echo "*** done" diff --git a/tests/qemu-iotests/024.out b/tests/qemu-iotests/024.out index 973a5a3711..245fe8b1d1 100644 --- a/tests/qemu-iotests/024.out +++ b/tests/qemu-iotests/024.out @@ -171,4 +171,34 @@ read 65536/65536 bytes at offset 196608 Offset Length File 0 0x3 TEST_DIR/subdir/t.IMGFMT 0x3 0x1 TEST_DIR/subdir/t.IMGFMT.base_new + +=== Test rebase within one backing chain === + +Creating backing chain + +Formatting 'TEST_DIR/subdir/t.IMGFMT.base_new', fmt=IMGFMT size=327680 +Formatting 'TEST_DIR/subdir/t.IMGFMT.base_old', fmt=IMGFMT size=262144 backing_file=TEST_DIR/subdir/t.IMGFMT.base_new backing_fmt=IMGFMT +Formatting 'TEST_DIR/subdir/t.IMGFMT', fmt=IMGFMT size=327680 backing_file=TEST_DIR/subdir/t.IMGFMT.base_old backing_fmt=IMGFMT + +Fill backing files with data + +wrote 327680/327680 bytes at offset 0 +320 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +wrote 262144/262144 bytes at offset 0 +256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) + +Check the last cluster is zeroed in overlay before the rebase + +read 65536/65536 bytes at offset 262144 +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) + +Rebase onto another image in the same chain + +Verify that data is read the same before and after rebase + +read 262144/262144 bytes at offset 0 +256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +read 65536/65536 bytes at offset 262144 +64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) + *** done -- 2.39.3
[PATCH v3 0/8] qemu-img: rebase: add compression support
v2 --> v3: * Patch 3/8: fixed logic in the if statement, so that we align on blk when blk_old_backing == NULL; * Patch 4/8: comment fix; * Patch 5/8: comment fix; dropped redundant "if (blk_new_backing)" statements. v2: https://lists.nongnu.org/archive/html/qemu-block/2023-09/msg00448.html Andrey Drobyshev (8): qemu-img: rebase: stop when reaching EOF of old backing file qemu-iotests: 024: add rebasing test case for overlay_size > backing_size qemu-img: rebase: use backing files' BlockBackend for buffer alignment qemu-img: add chunk size parameter to compare_buffers() qemu-img: rebase: avoid unnecessary COW operations iotests/{024, 271}: add testcases for qemu-img rebase qemu-img: add compression option to rebase subcommand iotests: add tests for "qemu-img rebase" with compression docs/tools/qemu-img.rst| 6 +- qemu-img-cmds.hx | 4 +- qemu-img.c | 136 ++ tests/qemu-iotests/024 | 117 ++ tests/qemu-iotests/024.out | 73 tests/qemu-iotests/271 | 131 + tests/qemu-iotests/271.out | 82 ++ tests/qemu-iotests/314 | 165 + tests/qemu-iotests/314.out | 75 + 9 files changed, 752 insertions(+), 37 deletions(-) create mode 100755 tests/qemu-iotests/314 create mode 100644 tests/qemu-iotests/314.out -- 2.39.3
Re: [PATCH v2 5/8] qemu-img: rebase: avoid unnecessary COW operations
On 9/19/23 13:46, Hanna Czenczek wrote: > On 15.09.23 18:20, Andrey Drobyshev wrote: >> When rebasing an image from one backing file to another, we need to >> compare data from old and new backings. If the diff between that data >> happens to be unaligned to the target cluster size, we might end up >> doing partial writes, which would lead to copy-on-write and additional >> IO. >> >> Consider the following simple case (virtual_size == cluster_size == 64K): >> >> base <-- inc1 <-- inc2 >> >> qemu-io -c "write -P 0xaa 0 32K" base.qcow2 >> qemu-io -c "write -P 0xcc 32K 32K" base.qcow2 >> qemu-io -c "write -P 0xbb 0 32K" inc1.qcow2 >> qemu-io -c "write -P 0xcc 32K 32K" inc1.qcow2 >> qemu-img rebase -f qcow2 -b base.qcow2 -F qcow2 inc2.qcow2 >> >> While doing rebase, we'll write a half of the cluster to inc2, and block >> layer will have to read the 2nd half of the same cluster from the base >> image >> inc1 while doing this write operation, although the whole cluster is >> already >> read earlier to perform data comparison. >> >> In order to avoid these unnecessary IO cycles, let's make sure every >> write request is aligned to the overlay subcluster boundaries. Using >> subcluster size is universal as for the images which don't have them >> this size equals to the cluster size, so in any case we end up aligning >> to the smallest unit of allocation. >> >> Signed-off-by: Andrey Drobyshev >> --- >> qemu-img.c | 76 -- >> 1 file changed, 56 insertions(+), 20 deletions(-) > > Looks good, I like the changes from v1! Two minor things: > >> diff --git a/qemu-img.c b/qemu-img.c >> index fcd31d7b5b..83950af42b 100644 >> --- a/qemu-img.c >> +++ b/qemu-img.c > > [...] > >> @@ -3844,33 +3861,48 @@ static int img_rebase(int argc, char **argv) >> } >> } >> + /* >> + * At this point we know that the region [offset; offset >> + n) >> + * is unallocated within the target image. This region >> might be >> + * unaligned to the target image's (sub)cluster >> boundaries, as >> + * old backing may have smaller clusters (or have >> subclusters). >> + * We extend it to the aligned boundaries to avoid CoW on >> + * partial writes in blk_pwrite(), >> + */ >> + n += offset - QEMU_ALIGN_DOWN(offset, write_align); >> + offset = QEMU_ALIGN_DOWN(offset, write_align); >> + n += QEMU_ALIGN_UP(offset + n, write_align) - (offset + n); >> + n = MIN(n, size - offset); >> + assert(!bdrv_is_allocated(unfiltered_bs, offset, n, >> &n_alloc) && >> + n_alloc == n); >> + >> + /* >> + * Much like the with the target image, we'll try to read >> as much > > s/the with the/with the/ > Noted. >> + * of the old and new backings as we can. >> + */ >> + n_old = MIN(n, MAX(0, old_backing_size - (int64_t) offset)); >> + if (blk_new_backing) { >> + n_new = MIN(n, MAX(0, new_backing_size - (int64_t) >> offset)); >> + } > > If we don’t have a check for blk_old_backing (old_backing_size is 0 if > blk_old_backing is NULL), why do we have a check for blk_new_backing > (new_backing_size is 0 if blk_new_backing is NULL)? > > (Perhaps because the previous check was `offset >= new_backing_size || > !blk_new_backing`, i.e. included exactly such a check – but I don’t > think it’s necessary, new_backing_size will be 0 if blk_new_backing is > NULL.) > >> + >> /* >> * Read old and new backing file and take into >> consideration that >> * backing files may be smaller than the COW image. >> */ >> - if (offset >= old_backing_size) { >> - memset(buf_old, 0, n); >> - buf_old_is_zero = true; >> + memset(buf_old + n_old, 0, n - n_old); >> + if (!n_old) { >> + old_backing_eof = true; >> } else { >> - if (offset + n > old_backing_size) { >> - n = old_backing_size - offset; >> - } >> - >> - ret = blk_pread(blk_old_backing, offset, n, buf_old, 0); >> + ret = blk_pread(blk_old_backing, offset, n_old, >> buf_old, 0); >> if (ret < 0) { >> error_report("error while reading from old >> backing file"); >> goto out; >> } >> } >> - if (offset >= new_backing_size || !blk_new_backing) { >> - memset(buf_new, 0, n); >> - } else { >> - if (offset + n > new_backing_size) { >> - n = new_backing_size - offset; >> - } >> - >> - ret = blk_pread(blk_new_backing, offset, n, buf_new, 0); >> + memset(buf_new + n_new, 0,
Re: [PATCH] block: mark aio_poll as non-coroutine
Am 08.09.2023 um 09:54 hat Paolo Bonzini geschrieben: > It is forbidden to block on the event loop during a coroutine, as that > can cause deadlocks due to recursive locking. > > Signed-off-by: Paolo Bonzini Thanks, applied to the block branch. Kevin
Re: Deadlock with SATA CD I/O and eject
Am 19.09.2023 um 14:57 hat John Levon geschrieben: > On Tue, Sep 19, 2023 at 12:54:59PM +0200, Kevin Wolf wrote: > > > > In the meantime, we start processing the blk_drain() code, so by the time > > > this > > > blk_pread() actually gets handled, quiesce is set, and we get stuck in the > > > blk_wait_while_drained(). > > > > > > I don't know the qemu block stack well enough to propose an actual fix. > > > > > > Experimentally, waiting for ->in_flight to drop to zero *before* we > > > quiesce in > > > blk_remove_bs() via an AIO_WAIT_WHILE() avoids the symptom, but I'm > > > pretty sure > > > that's just a band-aid instead of fixing the deadlock. > > > > Related discussion: > > https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00284.html > > > > Actually, it seems we never fixed that problem either? > > > > Back then I suggested that blk_drain*() should disable request queuing > > because its callers don't want to quiesce the BlockBackend, but rather > > get their own requests completed. I think this approach would solve this > > one as well. > > In this case though, it's not their own requests right? We have incoming I/O > from the guest and the eject is a separate operation. It's not the same code path, but we're operating in the context of the IDE device (or more specifically, its BlockBackend), so in that sense it is "its own requests". The main difference is anyway between just stopping activity (even if it's in the middle of a higher level operation of the device) and getting requests fully completed. We want the latter here. > So why it would be OK to disable queuing and have ongoing I/Os (i.e. > blk->in_flight > 0) racing against the block remove? With eject, the case is simple for IDE: We hold the BQL, so the guest won't be able to submit new requests anyway. In more complicated cases like virtio-blk, bdrv_drained_begin() and friends take care to stop new requests from coming in from the guest by not running notifiers while the device is drained. We just need to take care of the requests that have already started. > > Your experiment with moving the queuing later is basically what Paolo > > I think it's much more flaky than that, isn't it? It's just clearing out the > *current* pending queue, but nothing would stop new I/Os from being started > before we dropped into the poll for blk->in_flight to be zero again. In my > case, > this just happens to work because the prior tray open notification has stopped > new I/O being filed, right? I think it's the same as above, holding the BQL and calling drain would already take care of that. Kevin
Re: [PATCH v2] block-backend: Add new bds_io_in_flight counter
Am 31.03.2023 um 18:23 hat Hanna Czenczek geschrieben: > IDE TRIM is a BB user that wants to elevate its BB's in-flight counter > for a "macro" operation that consists of several actual I/O operations. > Each of those operations is individually started and awaited. It does > this so that blk_drain() will drain the whole TRIM, and not just a > single one of the many discard operations it may encompass. > > When request queuing is enabled, this leads to a deadlock: The currently > ongoing discard is drained, and the next one is queued, waiting for the > drain to stop. Meanwhile, TRIM still keeps the in-flight counter > elevated, waiting for all discards to stop -- which will never happen, > because with the in-flight counter elevated, the BB is never considered > drained, so the drained section does not begin and cannot end. Alright, let's have another look at this now that another similar deadlock was reported: https://lists.gnu.org/archive/html/qemu-block/2023-09/msg00536.html > There are two separate cases to look at here, namely bdrv_drain*() and > blk_drain*(). As said above, we do want blk_drain*() to settle the > whole operation: The only way to do so is to disable request queuing, > then. So, we do that: Have blk_drain() and blk_drain_all() temporarily > disable request queuing, which prevents the deadlock. Two separate cases with two separate fixes suggests that it could be two separate patches. I feel the blk_*() case is uncontroversial and it would fix John's case, so splitting wouldn't only make this easier to understand, but could mean that we can fix a useful subset earlier. > (The devil's in the details, though: blk_drain_all() runs > bdrv_drain_all_begin() first, so when we get to the individual BB, there > may already be queued requests. Therefore, we have to not only disable > request queuing then, but resume all already-queued requests, too.) Why can't we just disable request queuing before calling bdrv_drain_*()? Is it possible that the same problem occurs because someone else already called bdrv_drain_*()? That is, blk_drain_*() would be called from a callback in the nested event loop in bdrv_drain_*()? If so, we can't avoid that there are already queued requests. Restarting them seems correct anyway. > For bdrv_drain*(), we want request queuing -- and macro requests such as > IDE's TRIM request do not matter. bdrv_drain*() wants to keep I/O > requests from BDS nodes, and the TRIM does not issue such requests; it > instead does so through blk_*() functions, which themselves elevate the > BB's in-flight counter. So the idea is to drain (and potentially queue) > those blk_*() requests, but completely ignore the TRIM. > > We can do that by splitting a new counter off of the existing BB > counter: The new bds_io_in_flight counter counts all those blk_*() > requests that can issue I/O to a BDS (so must be drained by > bdrv_drain*()), but will never block waiting on another request on the > BB. You end up changing all of the existing blk_inc_in_flight() callers except those in IDE and virtio-blk. That makes me wonder if it shouldn't be approached the other way around: BlockBackend users that want to be included in drain should use a special function blk_inc_in_flight_external() or something that wouldn't increase blk->in_flight, but only a new separate counter. And then only blk_drain*() wait for it (by extending the AIO_WAIT_WHILE() condition), but not the child_root callbacks. This would give us more directly the semantics that we actually need: The root BDS doesn't care if the operation on the device level has completed as long as nothing new arrives, only external callers which use blk_drain*() do. I believe internal/external is easier to reason about than "requests that can issue I/O to a BDS [directly]", it keeps the external callers the special ones that need extra care while the normal I/O path is unaffected, and it would make the patch much smaller. > In blk_drain*(), we disable request queuing and settle all requests (the > full in_flight count). In bdrv_drain*() (i.e. blk_root_drained_poll()), > we only settle bds_io_in_flight_count, ignoring all requests that will > not directly issue I/O requests to BDS nodes. > > Reported-by: Fiona Ebner > Fixes: 7e5cdb345f77d76cb4877fe6230c4e17a7d0d0ca >("ide: Increment BB in-flight counter for TRIM BH") > Signed-off-by: Hanna Czenczek Kevin
Re: Deadlock with SATA CD I/O and eject
On Tue, Sep 19, 2023 at 12:54:59PM +0200, Kevin Wolf wrote: > > In the meantime, we start processing the blk_drain() code, so by the time > > this > > blk_pread() actually gets handled, quiesce is set, and we get stuck in the > > blk_wait_while_drained(). > > > > I don't know the qemu block stack well enough to propose an actual fix. > > > > Experimentally, waiting for ->in_flight to drop to zero *before* we quiesce > > in > > blk_remove_bs() via an AIO_WAIT_WHILE() avoids the symptom, but I'm pretty > > sure > > that's just a band-aid instead of fixing the deadlock. > > Related discussion: > https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00284.html > > Actually, it seems we never fixed that problem either? > > Back then I suggested that blk_drain*() should disable request queuing > because its callers don't want to quiesce the BlockBackend, but rather > get their own requests completed. I think this approach would solve this > one as well. In this case though, it's not their own requests right? We have incoming I/O from the guest and the eject is a separate operation. So why it would be OK to disable queuing and have ongoing I/Os (i.e. blk->in_flight > 0) racing against the block remove? > Your experiment with moving the queuing later is basically what Paolo I think it's much more flaky than that, isn't it? It's just clearing out the *current* pending queue, but nothing would stop new I/Os from being started before we dropped into the poll for blk->in_flight to be zero again. In my case, this just happens to work because the prior tray open notification has stopped new I/O being filed, right? thanks john
Re: [PATCH v2 8/8] iotests: add tests for "qemu-img rebase" with compression
On 15.09.23 18:20, Andrey Drobyshev wrote: The test cases considered so far: 314 (new test suite): 1. Check that compression mode isn't compatible with "-f raw" (raw format doesn't support compression). 2. Check that rebasing an image onto no backing file preserves the data and writes the copied clusters actually compressed. 3. Same as 2, but with a raw backing file (i.e. the clusters copied from the backing are originally uncompressed -- we check they end up compressed after being merged). 4. Remove a single delta from a backing chain, perform the same checks as in 2. 5. Check that even when backing and overlay are initially uncompressed, copied clusters end up compressed when rebase with compression is performed. 271: 1. Check that when target image has subclusters, rebase with compression will make an entire cluster containing the written subcluster compressed. Signed-off-by: Andrey Drobyshev --- tests/qemu-iotests/271 | 65 +++ tests/qemu-iotests/271.out | 40 + tests/qemu-iotests/314 | 165 + tests/qemu-iotests/314.out | 75 + 4 files changed, 345 insertions(+) create mode 100755 tests/qemu-iotests/314 create mode 100644 tests/qemu-iotests/314.out Reviewed-by: Hanna Czenczek
Re: [PATCH 1/1] block: improve alignment detection and fix 271 test
On 9/7/23 23:53, Denis V. Lunev wrote: Unfortunately 271 IO test is broken if started in non-cached mode. Commits commit a6b257a08e3d72219f03e461a52152672fec0612 Author: Nir Soffer Date: Tue Aug 13 21:21:03 2019 +0300 file-posix: Handle undetectable alignment and commit 9c60a5d1978e6dcf85c0e01b50e6f7f54ca09104 Author: Kevin Wolf Date: Thu Jul 16 16:26:00 2020 +0200 block: Require aligned image size to avoid assertion failure have interesting side effect if used togather. If the image size is not multiple of 4k and that image falls under original constraints of Nil's patch, the image can not be opened due to the check in the bdrv_check_perm(). The patch tries to satisfy the requirements of bdrv_check_perm() inside raw_probe_alignment(). This is at my opinion better that just disallowing to run that test in non-cached mode. The operation is legal by itself. Signed-off-by: Denis V. Lunev CC: Nir Soffer CC: Kevin Wolf CC: Hanna Reitz CC: Alberto Garcia --- block/file-posix.c | 17 +++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index b16e9c21a1..988cfdc76c 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -447,8 +447,21 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) for (i = 0; i < ARRAY_SIZE(alignments); i++) { align = alignments[i]; if (raw_is_io_aligned(fd, buf, align)) { -/* Fallback to safe value. */ -bs->bl.request_alignment = (align != 1) ? align : max_align; +if (align != 1) { +bs->bl.request_alignment = align; +break; +} +/* + * Fallback to safe value. max_align is perfect, but the size of the device must be multiple of + * the virtual length of the device. In the other case we will get a error in + * bdrv_node_refresh_perm(). + */ +for (align = max_align; align > 1; align /= 2) { +if ((bs->total_sectors * BDRV_SECTOR_SIZE) % align == 0) { +break; +} +} +bs->bl.request_alignment = align; break; } } ping
[PATCH v5 5/5] vhost-user: fix lost reconnect
When the vhost-user is reconnecting to the backend, and if the vhost-user fails at the get_features in vhost_dev_init(), then the reconnect will fail and it will not be retriggered forever. The reason is: When the vhost-user fails at get_features, the vhost_dev_cleanup will be called immediately. vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'. The reconnect path is: vhost_user_blk_event vhost_user_async_close(.. vhost_user_blk_disconnect ..) qemu_chr_fe_set_handlers <- clear the notifier callback schedule vhost_user_async_close_bh The vhost->vdev is null, so the vhost_user_blk_disconnect will not be called, then the event fd callback will not be reinstalled. All vhost-user devices have this issue, including vhost-user-blk/scsi. With this patch, if the vdev->vdev is null, the fd callback will still be reinstalled. Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling") Signed-off-by: Li Feng --- hw/block/vhost-user-blk.c | 2 +- hw/scsi/vhost-user-scsi.c | 3 ++- hw/virtio/vhost-user-gpio.c| 2 +- hw/virtio/vhost-user.c | 9 +++-- include/hw/virtio/vhost-user.h | 3 ++- 5 files changed, 13 insertions(+), 6 deletions(-) diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c index 3c69fa47d5..95c758200d 100644 --- a/hw/block/vhost-user-blk.c +++ b/hw/block/vhost-user-blk.c @@ -391,7 +391,7 @@ static void vhost_user_blk_event(void *opaque, QEMUChrEvent event) case CHR_EVENT_CLOSED: /* defer close until later to avoid circular close */ vhost_user_async_close(dev, &s->chardev, &s->dev, - vhost_user_blk_disconnect); + vhost_user_blk_disconnect, vhost_user_blk_event); break; case CHR_EVENT_BREAK: case CHR_EVENT_MUX_IN: diff --git a/hw/scsi/vhost-user-scsi.c b/hw/scsi/vhost-user-scsi.c index 53a62c3170..0effbb4787 100644 --- a/hw/scsi/vhost-user-scsi.c +++ b/hw/scsi/vhost-user-scsi.c @@ -238,7 +238,8 @@ static void vhost_user_scsi_event(void *opaque, QEMUChrEvent event) case CHR_EVENT_CLOSED: /* defer close until later to avoid circular close */ vhost_user_async_close(dev, &vs->conf.chardev, &vsc->dev, - vhost_user_scsi_disconnect); + vhost_user_scsi_disconnect, + vhost_user_scsi_event); break; case CHR_EVENT_BREAK: case CHR_EVENT_MUX_IN: diff --git a/hw/virtio/vhost-user-gpio.c b/hw/virtio/vhost-user-gpio.c index d9979aa5db..04c2cc79f4 100644 --- a/hw/virtio/vhost-user-gpio.c +++ b/hw/virtio/vhost-user-gpio.c @@ -283,7 +283,7 @@ static void vu_gpio_event(void *opaque, QEMUChrEvent event) case CHR_EVENT_CLOSED: /* defer close until later to avoid circular close */ vhost_user_async_close(dev, &gpio->chardev, &gpio->vhost_dev, - vu_gpio_disconnect); + vu_gpio_disconnect, vu_gpio_event); break; case CHR_EVENT_BREAK: case CHR_EVENT_MUX_IN: diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index 8dcf049d42..12c4a41f30 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -2643,6 +2643,7 @@ typedef struct { DeviceState *dev; CharBackend *cd; struct vhost_dev *vhost; +IOEventHandler *event_cb; } VhostAsyncCallback; static void vhost_user_async_close_bh(void *opaque) @@ -2657,7 +2658,10 @@ static void vhost_user_async_close_bh(void *opaque) */ if (vhost->vdev) { data->cb(data->dev); -} +} else if (data->event_cb) { +qemu_chr_fe_set_handlers(data->cd, NULL, NULL, data->event_cb, + NULL, data->dev, NULL, true); + } g_free(data); } @@ -2669,7 +2673,8 @@ static void vhost_user_async_close_bh(void *opaque) */ void vhost_user_async_close(DeviceState *d, CharBackend *chardev, struct vhost_dev *vhost, -vu_async_close_fn cb) +vu_async_close_fn cb, +IOEventHandler *event_cb) { if (!runstate_check(RUN_STATE_SHUTDOWN)) { /* diff --git a/include/hw/virtio/vhost-user.h b/include/hw/virtio/vhost-user.h index 191216a74f..649e9dd54f 100644 --- a/include/hw/virtio/vhost-user.h +++ b/include/hw/virtio/vhost-user.h @@ -84,6 +84,7 @@ typedef void (*vu_async_close_fn)(DeviceState *cb); void vhost_user_async_close(DeviceState *d, CharBackend *chardev, struct vhost_dev *vhost, -vu_async_close_fn cb); +vu_async_close_fn cb, +IOEventHandler *event_cb); #endif -- 2.41.0
[PATCH v5 0/5] Implement reconnect for vhost-user-scsi
Ping ... Could anyone review this series and merge them? Changes for v5: - No logic has been changed, just move part of the code from patch 4 to patch 5. Changes for v4: - Merge https://lore.kernel.org/all/20230830045722.611224-1-fen...@smartx.com/ to this series. - Add ERRP_GUARD in vhost_user_scsi_realize; - Reword the commit messages. Changes for v3: - Split the vhost_user_scsi_handle_output to a separate patch; - Move the started_vu from vhost scsi common header to vhost-user-scsi header; - Fix a log print error; Changes for v2: - Split the v1 patch to small separate patchset; - New patch for fixing fd leak, which has sent to reviewers in another mail; - Implement the `vhost_user_scsi_handle_output`; - Add the started_vu safe check; - Fix error handler; - Check the inflight before set/get inflight fd. Li Feng (5): vhost-user-common: send get_inflight_fd once vhost: move and rename the conn retry times vhost-user-scsi: support reconnect to backend vhost-user-scsi: start vhost when guest kicks vhost-user: fix lost reconnect hw/block/vhost-user-blk.c | 6 +- hw/scsi/vhost-scsi-common.c | 47 ++--- hw/scsi/vhost-scsi.c | 5 +- hw/scsi/vhost-user-scsi.c | 253 +++--- hw/virtio/vhost-user-gpio.c | 5 +- hw/virtio/vhost-user.c| 9 +- include/hw/virtio/vhost-scsi-common.h | 2 +- include/hw/virtio/vhost-user-scsi.h | 4 + include/hw/virtio/vhost-user.h| 3 +- include/hw/virtio/vhost.h | 2 + 10 files changed, 276 insertions(+), 60 deletions(-) -- 2.41.0
[PATCH v5 2/5] vhost: move and rename the conn retry times
Multiple devices need this macro, move it to a common header. Signed-off-by: Li Feng Reviewed-by: Raphael Norwitz --- hw/block/vhost-user-blk.c | 4 +--- hw/virtio/vhost-user-gpio.c | 3 +-- include/hw/virtio/vhost.h | 2 ++ 3 files changed, 4 insertions(+), 5 deletions(-) diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c index eecf3f7a81..3c69fa47d5 100644 --- a/hw/block/vhost-user-blk.c +++ b/hw/block/vhost-user-blk.c @@ -32,8 +32,6 @@ #include "sysemu/sysemu.h" #include "sysemu/runstate.h" -#define REALIZE_CONNECTION_RETRIES 3 - static const int user_feature_bits[] = { VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_SEG_MAX, @@ -482,7 +480,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp) s->inflight = g_new0(struct vhost_inflight, 1); s->vhost_vqs = g_new0(struct vhost_virtqueue, s->num_queues); -retries = REALIZE_CONNECTION_RETRIES; +retries = VU_REALIZE_CONN_RETRIES; assert(!*errp); do { if (*errp) { diff --git a/hw/virtio/vhost-user-gpio.c b/hw/virtio/vhost-user-gpio.c index 3b013f2d0f..d9979aa5db 100644 --- a/hw/virtio/vhost-user-gpio.c +++ b/hw/virtio/vhost-user-gpio.c @@ -15,7 +15,6 @@ #include "standard-headers/linux/virtio_ids.h" #include "trace.h" -#define REALIZE_CONNECTION_RETRIES 3 #define VHOST_NVQS 2 /* Features required from VirtIO */ @@ -359,7 +358,7 @@ static void vu_gpio_device_realize(DeviceState *dev, Error **errp) qemu_chr_fe_set_handlers(&gpio->chardev, NULL, NULL, vu_gpio_event, NULL, dev, NULL, true); -retries = REALIZE_CONNECTION_RETRIES; +retries = VU_REALIZE_CONN_RETRIES; g_assert(!*errp); do { if (*errp) { diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h index 6a173cb9fa..ca3131b1af 100644 --- a/include/hw/virtio/vhost.h +++ b/include/hw/virtio/vhost.h @@ -8,6 +8,8 @@ #define VHOST_F_DEVICE_IOTLB 63 #define VHOST_USER_F_PROTOCOL_FEATURES 30 +#define VU_REALIZE_CONN_RETRIES 3 + /* Generic structures common for any vhost based device. */ struct vhost_inflight { -- 2.41.0
[PATCH v5 1/5] vhost-user-common: send get_inflight_fd once
Currently the get_inflight_fd will be sent every time the device is started, and the backend will allocate shared memory to save the inflight state. If the backend finds that it receives the second get_inflight_fd, it will release the previous shared memory, which breaks inflight working logic. This patch is a preparation for the following patches. Signed-off-by: Li Feng --- hw/scsi/vhost-scsi-common.c | 37 ++--- 1 file changed, 18 insertions(+), 19 deletions(-) diff --git a/hw/scsi/vhost-scsi-common.c b/hw/scsi/vhost-scsi-common.c index a06f01af26..a61cd0e907 100644 --- a/hw/scsi/vhost-scsi-common.c +++ b/hw/scsi/vhost-scsi-common.c @@ -52,20 +52,28 @@ int vhost_scsi_common_start(VHostSCSICommon *vsc) vsc->dev.acked_features = vdev->guest_features; -assert(vsc->inflight == NULL); -vsc->inflight = g_new0(struct vhost_inflight, 1); -ret = vhost_dev_get_inflight(&vsc->dev, - vs->conf.virtqueue_size, - vsc->inflight); +ret = vhost_dev_prepare_inflight(&vsc->dev, vdev); if (ret < 0) { -error_report("Error get inflight: %d", -ret); +error_report("Error setting inflight format: %d", -ret); goto err_guest_notifiers; } -ret = vhost_dev_set_inflight(&vsc->dev, vsc->inflight); -if (ret < 0) { -error_report("Error set inflight: %d", -ret); -goto err_guest_notifiers; +if (vsc->inflight) { +if (!vsc->inflight->addr) { +ret = vhost_dev_get_inflight(&vsc->dev, +vs->conf.virtqueue_size, +vsc->inflight); +if (ret < 0) { +error_report("Error getting inflight: %d", -ret); +goto err_guest_notifiers; +} +} + +ret = vhost_dev_set_inflight(&vsc->dev, vsc->inflight); +if (ret < 0) { +error_report("Error setting inflight: %d", -ret); +goto err_guest_notifiers; +} } ret = vhost_dev_start(&vsc->dev, vdev, true); @@ -85,9 +93,6 @@ int vhost_scsi_common_start(VHostSCSICommon *vsc) return ret; err_guest_notifiers: -g_free(vsc->inflight); -vsc->inflight = NULL; - k->set_guest_notifiers(qbus->parent, vsc->dev.nvqs, false); err_host_notifiers: vhost_dev_disable_notifiers(&vsc->dev, vdev); @@ -111,12 +116,6 @@ void vhost_scsi_common_stop(VHostSCSICommon *vsc) } assert(ret >= 0); -if (vsc->inflight) { -vhost_dev_free_inflight(vsc->inflight); -g_free(vsc->inflight); -vsc->inflight = NULL; -} - vhost_dev_disable_notifiers(&vsc->dev, vdev); } -- 2.41.0
[PATCH v5 3/5] vhost-user-scsi: support reconnect to backend
If the backend crashes and restarts, the device is broken. This patch adds reconnect for vhost-user-scsi. This patch also improves the error messages, and reports some silent errors. Tested with spdk backend. Signed-off-by: Li Feng --- hw/scsi/vhost-scsi-common.c | 16 +- hw/scsi/vhost-scsi.c | 5 +- hw/scsi/vhost-user-scsi.c | 204 +++--- include/hw/virtio/vhost-scsi-common.h | 2 +- include/hw/virtio/vhost-user-scsi.h | 4 + 5 files changed, 201 insertions(+), 30 deletions(-) diff --git a/hw/scsi/vhost-scsi-common.c b/hw/scsi/vhost-scsi-common.c index a61cd0e907..4c8637045d 100644 --- a/hw/scsi/vhost-scsi-common.c +++ b/hw/scsi/vhost-scsi-common.c @@ -16,6 +16,7 @@ */ #include "qemu/osdep.h" +#include "qapi/error.h" #include "qemu/error-report.h" #include "qemu/module.h" #include "hw/virtio/vhost.h" @@ -25,7 +26,7 @@ #include "hw/virtio/virtio-access.h" #include "hw/fw-path-provider.h" -int vhost_scsi_common_start(VHostSCSICommon *vsc) +int vhost_scsi_common_start(VHostSCSICommon *vsc, Error **errp) { int ret, i; VirtIODevice *vdev = VIRTIO_DEVICE(vsc); @@ -35,18 +36,19 @@ int vhost_scsi_common_start(VHostSCSICommon *vsc) VirtIOSCSICommon *vs = (VirtIOSCSICommon *)vsc; if (!k->set_guest_notifiers) { -error_report("binding does not support guest notifiers"); +error_setg(errp, "binding does not support guest notifiers"); return -ENOSYS; } ret = vhost_dev_enable_notifiers(&vsc->dev, vdev); if (ret < 0) { +error_setg_errno(errp, -ret, "Error enabling host notifiers"); return ret; } ret = k->set_guest_notifiers(qbus->parent, vsc->dev.nvqs, true); if (ret < 0) { -error_report("Error binding guest notifier"); +error_setg_errno(errp, -ret, "Error binding guest notifier"); goto err_host_notifiers; } @@ -54,7 +56,7 @@ int vhost_scsi_common_start(VHostSCSICommon *vsc) ret = vhost_dev_prepare_inflight(&vsc->dev, vdev); if (ret < 0) { -error_report("Error setting inflight format: %d", -ret); +error_setg_errno(errp, -ret, "Error setting inflight format"); goto err_guest_notifiers; } @@ -64,21 +66,21 @@ int vhost_scsi_common_start(VHostSCSICommon *vsc) vs->conf.virtqueue_size, vsc->inflight); if (ret < 0) { -error_report("Error getting inflight: %d", -ret); +error_setg_errno(errp, -ret, "Error getting inflight"); goto err_guest_notifiers; } } ret = vhost_dev_set_inflight(&vsc->dev, vsc->inflight); if (ret < 0) { -error_report("Error setting inflight: %d", -ret); +error_setg_errno(errp, -ret, "Error setting inflight"); goto err_guest_notifiers; } } ret = vhost_dev_start(&vsc->dev, vdev, true); if (ret < 0) { -error_report("Error start vhost dev"); +error_setg_errno(errp, -ret, "Error starting vhost dev"); goto err_guest_notifiers; } diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c index 443f67daa4..01a3ab4277 100644 --- a/hw/scsi/vhost-scsi.c +++ b/hw/scsi/vhost-scsi.c @@ -75,6 +75,7 @@ static int vhost_scsi_start(VHostSCSI *s) int ret, abi_version; VHostSCSICommon *vsc = VHOST_SCSI_COMMON(s); const VhostOps *vhost_ops = vsc->dev.vhost_ops; +Error *local_err = NULL; ret = vhost_ops->vhost_scsi_get_abi_version(&vsc->dev, &abi_version); if (ret < 0) { @@ -88,14 +89,14 @@ static int vhost_scsi_start(VHostSCSI *s) return -ENOSYS; } -ret = vhost_scsi_common_start(vsc); +ret = vhost_scsi_common_start(vsc, &local_err); if (ret < 0) { return ret; } ret = vhost_scsi_set_endpoint(s); if (ret < 0) { -error_report("Error setting vhost-scsi endpoint"); +error_reportf_err(local_err, "Error setting vhost-scsi endpoint"); vhost_scsi_common_stop(vsc); } diff --git a/hw/scsi/vhost-user-scsi.c b/hw/scsi/vhost-user-scsi.c index ee99b19e7a..dc109154ad 100644 --- a/hw/scsi/vhost-user-scsi.c +++ b/hw/scsi/vhost-user-scsi.c @@ -43,26 +43,56 @@ enum VhostUserProtocolFeature { VHOST_USER_PROTOCOL_F_RESET_DEVICE = 13, }; +static int vhost_user_scsi_start(VHostUserSCSI *s, Error **errp) +{ +VHostSCSICommon *vsc = VHOST_SCSI_COMMON(s); +int ret; + +ret = vhost_scsi_common_start(vsc, errp); +s->started_vu = (ret < 0 ? false : true); + +return ret; +} + +static void vhost_user_scsi_stop(VHostUserSCSI *s) +{ +VHostSCSICommon *vsc = VHOST_SCSI_COMMON(s); + +if (!s->started_vu) { +return; +} +s->started_vu = false; + +vhost_scsi_common_stop(vsc); +} + static void vhost_user_scsi_set_status(VirtIODevice *vdev, uint8_t status) { VHostUser
[PATCH v5 4/5] vhost-user-scsi: start vhost when guest kicks
Let's keep the same behavior as vhost-user-blk. Some old guests kick virtqueue before setting VIRTIO_CONFIG_S_DRIVER_OK. Signed-off-by: Li Feng --- hw/scsi/vhost-user-scsi.c | 48 +++ 1 file changed, 44 insertions(+), 4 deletions(-) diff --git a/hw/scsi/vhost-user-scsi.c b/hw/scsi/vhost-user-scsi.c index dc109154ad..53a62c3170 100644 --- a/hw/scsi/vhost-user-scsi.c +++ b/hw/scsi/vhost-user-scsi.c @@ -115,8 +115,48 @@ static void vhost_user_scsi_reset(VirtIODevice *vdev) } } -static void vhost_dummy_handle_output(VirtIODevice *vdev, VirtQueue *vq) +static void vhost_user_scsi_handle_output(VirtIODevice *vdev, VirtQueue *vq) { +VHostUserSCSI *s = (VHostUserSCSI *)vdev; +DeviceState *dev = &s->parent_obj.parent_obj.parent_obj.parent_obj; +VHostSCSICommon *vsc = VHOST_SCSI_COMMON(s); +VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(dev); + +Error *local_err = NULL; +int i, ret; + +if (!vdev->start_on_kick) { +return; +} + +if (!s->connected) { +return; +} + +if (vhost_dev_is_started(&vsc->dev)) { +return; +} + +/* + * Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start + * vhost here instead of waiting for .set_status(). + */ +ret = vhost_user_scsi_start(s, &local_err); +if (ret < 0) { +error_reportf_err(local_err, "vhost-user-scsi: vhost start failed: "); +qemu_chr_fe_disconnect(&vs->conf.chardev); +return; +} + +/* Kick right away to begin processing requests already in vring */ +for (i = 0; i < vsc->dev.nvqs; i++) { +VirtQueue *kick_vq = virtio_get_queue(vdev, i); + +if (!virtio_queue_get_desc_addr(vdev, i)) { +continue; +} +event_notifier_set(virtio_queue_get_host_notifier(kick_vq)); +} } static int vhost_user_scsi_connect(DeviceState *dev, Error **errp) @@ -246,9 +286,9 @@ static void vhost_user_scsi_realize(DeviceState *dev, Error **errp) return; } -virtio_scsi_common_realize(dev, vhost_dummy_handle_output, - vhost_dummy_handle_output, - vhost_dummy_handle_output, &err); +virtio_scsi_common_realize(dev, vhost_user_scsi_handle_output, + vhost_user_scsi_handle_output, + vhost_user_scsi_handle_output, &err); if (err != NULL) { error_propagate(errp, err); return; -- 2.41.0
Re: [PATCH v2 6/8] iotests/{024, 271}: add testcases for qemu-img rebase
On 15.09.23 18:20, Andrey Drobyshev wrote: As the previous commit changes the logic of "qemu-img rebase" (it's using write alignment now), let's add a couple more test cases which would ensure it works correctly. In particular, the following scenarios: 024: add test case for rebase within one backing chain when the overlay cluster size > backings cluster size; 271: add test case for rebase images that contain subclusters. Check that no extra allocations are being made. Signed-off-by: Andrey Drobyshev --- tests/qemu-iotests/024 | 60 ++ tests/qemu-iotests/024.out | 43 + tests/qemu-iotests/271 | 66 ++ tests/qemu-iotests/271.out | 42 4 files changed, 211 insertions(+) Interestingly, the new case in 024 hangs before this series. So it isn’t just an optimization, but a fix, actually! Reviewed-by: Hanna Czenczek
Re: Deadlock with SATA CD I/O and eject
On 19.09.23 12:54, Kevin Wolf wrote: Am 18.09.2023 um 19:28 hat John Levon geschrieben: Observed with base of qemu 6.2.0, but from code inspection it looks to me like it's still current in upstream master. Apologies if I have missed a fix in this area. Symptom: run a UEFI-booted SATA CD Windows installer. When it hits "Loading files.." screen, run an eject e.g. virsh qemu-monitor-command 64c6e190-ea7f-49e2-b2d5-6ba1814b00ae '{"execute":"eject", "arguments": { "id": "sata0-0-0" } }' qemu will get stuck like so: gdb) bt #0 0x7f8ba4b16036 in ppoll () from /lib64/libc.so.6 #1 0x561813c48ed5 in ppoll (__ss=0x0, __timeout=0x7ffcbd981a70, __nfds=, __fds=) at /usr/include/bits/poll2.h:62 #2 qemu_poll_ns (fds=, nfds=, timeout=timeout@entry=999896128) at ../util/qemu-timer.c:348 #3 0x561813c29be9 in fdmon_poll_wait (ctx=0x56181516e070, ready_list=0x7ffcbd981af0, timeout=999896128) at ../util/fdmon-poll.c:80 #4 0x561813c297e1 in aio_poll (ctx=ctx@entry=0x56181516e070, blocking=blocking@entry=true) at ../util/aio-posix.c:607 #5 0x561813ae2fad in bdrv_do_drained_begin (poll=true, ignore_bds_parents=false, parent=0x0, recursive=false, bs=0x56181533fcc0) at ../block/io.c:483 #6 bdrv_do_drained_begin (bs=0x56181533fcc0, recursive=, parent=0x0, ignore_bds_parents=, poll=) at ../block/io.c:446 #7 0x561813ad9982 in blk_drain (blk=0x5618161c1f10) at ../block/block-backend.c:1741 #8 0x561813ad9b8c in blk_remove_bs (blk=blk@entry=0x5618161c1f10) at ../block/block-backend.c:852 #9 0x56181382b8ab in blockdev_remove_medium (has_device=, device=, has_id=, id=, errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:232 #10 0x56181382bfb1 in qmp_eject (has_device=, device=0x0, has_id=, id=0x561815e6efe0 "sata0-0-0", has_force=, force=, errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:45 We are stuck forever here: 351 static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent, 352 bool poll) ... 380 if (poll) { 381 BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, parent)); 382 } Because the blk root's ->in_flight is 1, as tested by the condition blk_root_drained_poll(). Our blk->in_flight user is stuck here: 1298 static void coroutine_fn blk_wait_while_drained(BlockBackend *blk) ... 1310 blk_dec_in_flight(blk); 1311 qemu_co_queue_wait(&blk->queued_requests, &blk->queued_requests_lock); 1312 blk_inc_in_flight(blk); Note that before entering this stanza, blk->in_flight was 2. This turns out to be due to the ide atapi code. In particular, the guest VM is generating lots of read I/O. The "first IO" arrives into blk via: cd_read_sector()->ide_buffered_readv()->blk_aio_preadv() This initial IO completes: blk_aio_read_entry()->blk_aio_complete() 1560 static void blk_aio_complete(BlkAioEmAIOCB *acb) 1561 { 1562 if (acb->has_returned) { 1563 acb->common.cb(acb->common.opaque, acb->rwco.ret); 1564 blk_dec_in_flight(acb->rwco.blk); 1565 qemu_aio_unref(acb); 1566 } 1567 } Line 1564 is what we need to move blk->in_flight down to zero, but that is never reached! This is because of what happens at :1563 acm->common.cb()->cd_read_sector_cb()->ide_atapi_cmd_reply_end()->cd_read_sector_sync()->blk_pread() That is, the IO callback in the atapi code itself triggers more - synchronous - IO. In the meantime, we start processing the blk_drain() code, so by the time this blk_pread() actually gets handled, quiesce is set, and we get stuck in the blk_wait_while_drained(). I don't know the qemu block stack well enough to propose an actual fix. Experimentally, waiting for ->in_flight to drop to zero *before* we quiesce in blk_remove_bs() via an AIO_WAIT_WHILE() avoids the symptom, but I'm pretty sure that's just a band-aid instead of fixing the deadlock. Any suggestions/clues/thoughts? Related discussion: https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00284.html Actually, it seems we never fixed that problem either? We didn’t, from my POV mainly because everyone had different suggestions and none of them looked nice enough that it was clear which one to pursue. I.e., there are also https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00709.html and https://lists.gnu.org/archive/html/qemu-block/2023-04/msg00118.html . Back then I suggested that blk_drain*() should disable request queuing because its callers don't want to quiesce the BlockBackend, but rather get their own requests completed. I think this approach would solve this one as well. Your experiment with moving the queuing later is basically what Paolo suggested, though I'd still argue it's not the right thing to do because while it seems to mostly work with both use cases of drain (give me exclusive access vs. wait for my requests to complete), it's not optimal for either one. Kevin
Re: Deadlock with SATA CD I/O and eject
Am 18.09.2023 um 19:28 hat John Levon geschrieben: > > Observed with base of qemu 6.2.0, but from code inspection it looks to me like > it's still current in upstream master. Apologies if I have missed a fix in > this > area. > > Symptom: run a UEFI-booted SATA CD Windows installer. When it hits "Loading > files.." screen, run an eject e.g. > > virsh qemu-monitor-command 64c6e190-ea7f-49e2-b2d5-6ba1814b00ae > '{"execute":"eject", "arguments": { "id": "sata0-0-0" } }' > > qemu will get stuck like so: > > gdb) bt > #0 0x7f8ba4b16036 in ppoll () from /lib64/libc.so.6 > #1 0x561813c48ed5 in ppoll (__ss=0x0, __timeout=0x7ffcbd981a70, > __nfds=, __fds=) at /usr/include/bits/poll2.h:62 > #2 qemu_poll_ns (fds=, nfds=, > timeout=timeout@entry=999896128) at ../util/qemu-timer.c:348 > #3 0x561813c29be9 in fdmon_poll_wait (ctx=0x56181516e070, > ready_list=0x7ffcbd981af0, timeout=999896128) at ../util/fdmon-poll.c:80 > #4 0x561813c297e1 in aio_poll (ctx=ctx@entry=0x56181516e070, > blocking=blocking@entry=true) at ../util/aio-posix.c:607 > #5 0x561813ae2fad in bdrv_do_drained_begin (poll=true, > ignore_bds_parents=false, parent=0x0, recursive=false, bs=0x56181533fcc0) at > ../block/io.c:483 > #6 bdrv_do_drained_begin (bs=0x56181533fcc0, recursive=, > parent=0x0, ignore_bds_parents=, poll=) at > ../block/io.c:446 > #7 0x561813ad9982 in blk_drain (blk=0x5618161c1f10) at > ../block/block-backend.c:1741 > #8 0x561813ad9b8c in blk_remove_bs (blk=blk@entry=0x5618161c1f10) at > ../block/block-backend.c:852 > #9 0x56181382b8ab in blockdev_remove_medium (has_device=, > device=, has_id=, id=, > errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:232 > #10 0x56181382bfb1 in qmp_eject (has_device=, device=0x0, > has_id=, id=0x561815e6efe0 "sata0-0-0", has_force= out>, force=, errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:45 > > We are stuck forever here: > > 351 static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild > *parent, > 352 bool poll) > ... > 380 if (poll) { > 381 BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, parent)); > 382 } > > Because the blk root's ->in_flight is 1, as tested by the condition > blk_root_drained_poll(). > > > Our blk->in_flight user is stuck here: > > 1298 static void coroutine_fn blk_wait_while_drained(BlockBackend *blk) > ... > 1310 blk_dec_in_flight(blk); > 1311 qemu_co_queue_wait(&blk->queued_requests, > &blk->queued_requests_lock); > 1312 blk_inc_in_flight(blk); > > Note that before entering this stanza, blk->in_flight was 2. This turns out to > be due to the ide atapi code. In particular, the guest VM is generating lots > of > read I/O. The "first IO" arrives into blk via: > > cd_read_sector()->ide_buffered_readv()->blk_aio_preadv() > > This initial IO completes: > > blk_aio_read_entry()->blk_aio_complete() > > 1560 static void blk_aio_complete(BlkAioEmAIOCB *acb) > 1561 { > 1562 if (acb->has_returned) { > 1563 acb->common.cb(acb->common.opaque, acb->rwco.ret); > 1564 blk_dec_in_flight(acb->rwco.blk); > 1565 qemu_aio_unref(acb); > 1566 } > 1567 } > > Line 1564 is what we need to move blk->in_flight down to zero, but that is > never > reached! This is because of what happens at :1563 > > acm->common.cb()->cd_read_sector_cb()->ide_atapi_cmd_reply_end()->cd_read_sector_sync()->blk_pread() > > That is, the IO callback in the atapi code itself triggers more - synchronous > - IO. > > In the meantime, we start processing the blk_drain() code, so by the time this > blk_pread() actually gets handled, quiesce is set, and we get stuck in the > blk_wait_while_drained(). > > I don't know the qemu block stack well enough to propose an actual fix. > > Experimentally, waiting for ->in_flight to drop to zero *before* we quiesce in > blk_remove_bs() via an AIO_WAIT_WHILE() avoids the symptom, but I'm pretty > sure > that's just a band-aid instead of fixing the deadlock. > > Any suggestions/clues/thoughts? Related discussion: https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00284.html Actually, it seems we never fixed that problem either? Back then I suggested that blk_drain*() should disable request queuing because its callers don't want to quiesce the BlockBackend, but rather get their own requests completed. I think this approach would solve this one as well. Your experiment with moving the queuing later is basically what Paolo suggested, though I'd still argue it's not the right thing to do because while it seems to mostly work with both use cases of drain (give me exclusive access vs. wait for my requests to complete), it's not optimal for either one. Kevin
Re: [PATCH v2 5/8] qemu-img: rebase: avoid unnecessary COW operations
On 15.09.23 18:20, Andrey Drobyshev wrote: When rebasing an image from one backing file to another, we need to compare data from old and new backings. If the diff between that data happens to be unaligned to the target cluster size, we might end up doing partial writes, which would lead to copy-on-write and additional IO. Consider the following simple case (virtual_size == cluster_size == 64K): base <-- inc1 <-- inc2 qemu-io -c "write -P 0xaa 0 32K" base.qcow2 qemu-io -c "write -P 0xcc 32K 32K" base.qcow2 qemu-io -c "write -P 0xbb 0 32K" inc1.qcow2 qemu-io -c "write -P 0xcc 32K 32K" inc1.qcow2 qemu-img rebase -f qcow2 -b base.qcow2 -F qcow2 inc2.qcow2 While doing rebase, we'll write a half of the cluster to inc2, and block layer will have to read the 2nd half of the same cluster from the base image inc1 while doing this write operation, although the whole cluster is already read earlier to perform data comparison. In order to avoid these unnecessary IO cycles, let's make sure every write request is aligned to the overlay subcluster boundaries. Using subcluster size is universal as for the images which don't have them this size equals to the cluster size, so in any case we end up aligning to the smallest unit of allocation. Signed-off-by: Andrey Drobyshev --- qemu-img.c | 76 -- 1 file changed, 56 insertions(+), 20 deletions(-) Looks good, I like the changes from v1! Two minor things: diff --git a/qemu-img.c b/qemu-img.c index fcd31d7b5b..83950af42b 100644 --- a/qemu-img.c +++ b/qemu-img.c [...] @@ -3844,33 +3861,48 @@ static int img_rebase(int argc, char **argv) } } +/* + * At this point we know that the region [offset; offset + n) + * is unallocated within the target image. This region might be + * unaligned to the target image's (sub)cluster boundaries, as + * old backing may have smaller clusters (or have subclusters). + * We extend it to the aligned boundaries to avoid CoW on + * partial writes in blk_pwrite(), + */ +n += offset - QEMU_ALIGN_DOWN(offset, write_align); +offset = QEMU_ALIGN_DOWN(offset, write_align); +n += QEMU_ALIGN_UP(offset + n, write_align) - (offset + n); +n = MIN(n, size - offset); +assert(!bdrv_is_allocated(unfiltered_bs, offset, n, &n_alloc) && + n_alloc == n); + +/* + * Much like the with the target image, we'll try to read as much s/the with the/with the/ + * of the old and new backings as we can. + */ +n_old = MIN(n, MAX(0, old_backing_size - (int64_t) offset)); +if (blk_new_backing) { +n_new = MIN(n, MAX(0, new_backing_size - (int64_t) offset)); +} If we don’t have a check for blk_old_backing (old_backing_size is 0 if blk_old_backing is NULL), why do we have a check for blk_new_backing (new_backing_size is 0 if blk_new_backing is NULL)? (Perhaps because the previous check was `offset >= new_backing_size || !blk_new_backing`, i.e. included exactly such a check – but I don’t think it’s necessary, new_backing_size will be 0 if blk_new_backing is NULL.) + /* * Read old and new backing file and take into consideration that * backing files may be smaller than the COW image. */ -if (offset >= old_backing_size) { -memset(buf_old, 0, n); -buf_old_is_zero = true; +memset(buf_old + n_old, 0, n - n_old); +if (!n_old) { +old_backing_eof = true; } else { -if (offset + n > old_backing_size) { -n = old_backing_size - offset; -} - -ret = blk_pread(blk_old_backing, offset, n, buf_old, 0); +ret = blk_pread(blk_old_backing, offset, n_old, buf_old, 0); if (ret < 0) { error_report("error while reading from old backing file"); goto out; } } -if (offset >= new_backing_size || !blk_new_backing) { -memset(buf_new, 0, n); -} else { -if (offset + n > new_backing_size) { -n = new_backing_size - offset; -} - -ret = blk_pread(blk_new_backing, offset, n, buf_new, 0); +memset(buf_new + n_new, 0, n - n_new); +if (blk_new_backing && n_new) { Same as above, I think we can drop the blk_new_backing check, just so that both blocks (for old and new) look the same. (Also, the memset() already has to trust that n_new is 0 if blk_new_backing is NULL, so the check doesn’t make much sense from that perspective either, and makes the memset() look wrong.)
Re: [PULL 00/28] Block layer patches
Am 18.09.2023 um 20:56 hat Stefan Hajnoczi geschrieben: > Hi Kevin, > I believe that my own commit "block-coroutine-wrapper: use > qemu_get_current_aio_context()" breaks this test. The failure is > non-deterministic (happens about 1 out of 4 runs). > > It seems the job hangs and the test times out in vm.run_job('job1', wait=5.0). > > I haven't debugged it yet but wanted to share this information to save > some time. Tomorrow I'll investigate further. Yes, it's relatively easily reproducible if I run the test in a loop, and I can't seem to reproduce it without the last patch. Should I unstage the full series again, or do you think that the last patch is really optional this time? However, I'm unsure how the stack traces I'm seeing are related to your patch. Maybe it just made an existing bug more likely to be triggered? What I'm seeing is that the reader lock is held by an iothread that is waiting for its AioContext lock to make progress: Thread 3 (Thread 0x7f811e9346c0 (LWP 26390) "qemu-system-x86"): #0 0x7f81250aaf80 in __lll_lock_wait () at /lib64/libc.so.6 #1 0x7f81250b149a in pthread_mutex_lock@@GLIBC_2.2.5 () at /lib64/libc.so.6 #2 0x55b7b170967e in qemu_mutex_lock_impl (mutex=0x55b7b34e3080, file=0x55b7b199e1f7 "../util/async.c", line=728) at ../util/qemu-thread-posix.c:94 #3 0x55b7b1709953 in qemu_rec_mutex_lock_impl (mutex=0x55b7b34e3080, file=0x55b7b199e1f7 "../util/async.c", line=728) at ../util/qemu-thread-posix.c:149 #4 0x55b7b1728318 in aio_context_acquire (ctx=0x55b7b34e3020) at ../util/async.c:728 #5 0x55b7b1727c49 in co_schedule_bh_cb (opaque=0x55b7b34e3020) at ../util/async.c:565 #6 0x55b7b1726f1c in aio_bh_call (bh=0x55b7b34e2e70) at ../util/async.c:169 #7 0x55b7b17270ee in aio_bh_poll (ctx=0x55b7b34e3020) at ../util/async.c:216 #8 0x55b7b170351d in aio_poll (ctx=0x55b7b34e3020, blocking=true) at ../util/aio-posix.c:722 #9 0x55b7b1518604 in iothread_run (opaque=0x55b7b2904460) at ../iothread.c:63 #10 0x55b7b170a955 in qemu_thread_start (args=0x55b7b34e36b0) at ../util/qemu-thread-posix.c:541 #11 0x7f81250ae15d in start_thread () at /lib64/libc.so.6 #12 0x7f812512fc00 in clone3 () at /lib64/libc.so.6 On the other hand, the main thread wants to acquire the writer lock, but it holds the AioContext lock of the iothread (it takes it in job_prepare_locked()): Thread 1 (Thread 0x7f811f4b7b00 (LWP 26388) "qemu-system-x86"): #0 0x7f8125122356 in ppoll () at /lib64/libc.so.6 #1 0x55b7b172eae0 in qemu_poll_ns (fds=0x55b7b34ec910, nfds=1, timeout=-1) at ../util/qemu-timer.c:339 #2 0x55b7b1704ebd in fdmon_poll_wait (ctx=0x55b7b3269210, ready_list=0x7ffc90b05680, timeout=-1) at ../util/fdmon-poll.c:79 #3 0x55b7b1703284 in aio_poll (ctx=0x55b7b3269210, blocking=true) at ../util/aio-posix.c:670 #4 0x55b7b1567c3b in bdrv_graph_wrlock (bs=0x0) at ../block/graph-lock.c:145 #5 0x55b7b1554c1c in blk_remove_bs (blk=0x55b7b4425800) at ../block/block-backend.c:916 #6 0x55b7b1554779 in blk_delete (blk=0x55b7b4425800) at ../block/block-backend.c:497 #7 0x55b7b1554133 in blk_unref (blk=0x55b7b4425800) at ../block/block-backend.c:557 #8 0x55b7b157a149 in mirror_exit_common (job=0x55b7b4419000) at ../block/mirror.c:696 #9 0x55b7b1577015 in mirror_prepare (job=0x55b7b4419000) at ../block/mirror.c:807 #10 0x55b7b153a1a7 in job_prepare_locked (job=0x55b7b4419000) at ../job.c:988 #11 0x55b7b153a0d9 in job_txn_apply_locked (job=0x55b7b4419000, fn=0x55b7b153a110 ) at ../job.c:191 #12 0x55b7b1538b6d in job_do_finalize_locked (job=0x55b7b4419000) at ../job.c:1011 #13 0x55b7b153a886 in job_completed_txn_success_locked (job=0x55b7b4419000) at ../job.c:1068 #14 0x55b7b1539372 in job_completed_locked (job=0x55b7b4419000) at ../job.c:1082 #15 0x55b7b153a71b in job_exit (opaque=0x55b7b4419000) at ../job.c:1103 #16 0x55b7b1726f1c in aio_bh_call (bh=0x7f8110005470) at ../util/async.c:169 #17 0x55b7b17270ee in aio_bh_poll (ctx=0x55b7b3269210) at ../util/async.c:216 #18 0x55b7b1702c05 in aio_dispatch (ctx=0x55b7b3269210) at ../util/aio-posix.c:423 #19 0x55b7b1728a14 in aio_ctx_dispatch (source=0x55b7b3269210, callback=0x0, user_data=0x0) at ../util/async.c:358 #20 0x7f8126c31c7f in g_main_dispatch (context=0x55b7b3269720) at ../glib/gmain.c:3454 #21 g_main_context_dispatch (context=0x55b7b3269720) at ../glib/gmain.c:4172 #22 0x55b7b1729c98 in glib_pollfds_poll () at ../util/main-loop.c:290 #23 0x55b7b1729572 in os_host_main_loop_wait (timeout=27462700) at ../util/main-loop.c:313 #24 0x55b7b1729452 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:592 #25 0x55b7b119a1eb in qemu_main_loop () at ../softmmu/runstate.c:772 #26 0x55b7b14c102d in qemu_default_main () at ../softmmu/main.c:37 #27 0x55b7b14c1068 in main (argc=44, argv=0x7ffc90b05d58) at ../softmmu/main.c:48 At first I thought we just need to look into the
Re: [PATCH 1/2] blockdev: qmp_transaction: harden transaction properties for bitmaps
Hi! Thanks for the review On 9/12/23 21:29, Vladimir Sementsov-Ogievskiy wrote: On 04.09.23 11:31, Andrey Zhadchenko wrote: Unlike other transaction commands, bitmap operations do not drain target bds. If we have an IOThread, this may result in some inconsistencies, as bitmap content may change during transaction command. Add bdrv_drained_begin()/end() to bitmap operations. Signed-off-by: Andrey Zhadchenko Hi! First, please always include cover letter when more than 1 patch. Next. Hmm. Good idea, but I'm afraid that's still not enough. Assume you have two BSs A and B in two different iothreads. So, the sequence may be like this: 1. drain_begin A 2. do operation with bitmap in A 3. guest writes to B, B is modified and bitmap in B is modified as well 4. drain_begin B 5. do operation with bitmap in B 6. drain_end B 7. drain_end A User may expect, that all the operations are done atomically in relation to any guest IO operations. And if operations are dependent, the intermediate write [3] make break the result. I see your point, but can the difference really be observed in this case? It is probably only relevant if user can copy/merge bitmaps between block nodes (as far as I know this is not the case for now) So, we should drain all participant drives during the whole transactions. The simplest solution is bdrv_drain_all_begin() / bdrv_drain_all_end() pair in qmp_transaction(), could we start with it? That would definitely get rig of all problems, but I do not really like the idea of draining unrelated nodes. What do you think if I add a new function prototype to the TransactionActionDrv, which will return transaction bds? Then we can get a list of all bds and lock them before prepairing and clean afterwards.
Re: [PATCH 20/22] tests: extend test 131 to cover availability of the discard operation
On 9/18/23 20:00, Denis V. Lunev wrote: This patch contains test which minimally tests discard and new cluster allocation logic. The following checks are added: * write 2 clusters, discard the first allocated * write another cluster, check that the hole is filled * write 2 clusters, discard the first allocated, write 1 cluster at non-aligned to cluster offset (2 new clusters should be allocated) Signed-off-by: Denis V. Lunev --- tests/qemu-iotests/131 | 31 +++ tests/qemu-iotests/131.out | 38 ++ 2 files changed, 69 insertions(+) diff --git a/tests/qemu-iotests/131 b/tests/qemu-iotests/131 index 304bbb3f61..324008b3f6 100755 --- a/tests/qemu-iotests/131 +++ b/tests/qemu-iotests/131 @@ -74,6 +74,37 @@ poke_file "$TEST_IMG" "$inuse_offset" "\x59\x6e\x6f\x74" echo "== read corrupted image with repairing ==" { $QEMU_IO -c "read -P 0x11 $CLUSTER_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | _filter_qemu_io | _filter_testdir +echo "== check discard ==" + +# Clear image +_make_test_img $size + +{ $QEMU_IO -c "write -P 0x11 0 $CLUSTER_DBL_SIZE" "$TEST_IMG"; } 2>&1 | _filter_qemu_io | _filter_testdir +{ $QEMU_IMG map "$TEST_IMG"; } 2>&1 | _filter_qemu_img_map +{ $QEMU_IO -c "discard 0 $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | _filter_qemu_io | _filter_testdir +{ $QEMU_IMG map "$TEST_IMG"; } 2>&1 | _filter_qemu_img_map +{ $QEMU_IO -c "read -P 0 0 $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | _filter_qemu_io | _filter_testdir + +echo "== check simple allocation over the discarded hole ==" + +{ $QEMU_IO -c "write -P 0x11 $CLUSTER_DBL_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | _filter_qemu_io | _filter_testdir +{ $QEMU_IMG map "$TEST_IMG"; } 2>&1 | _filter_qemu_img_map +{ $QEMU_IO -c "read -P 0x11 $CLUSTER_DBL_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | _filter_qemu_io | _filter_testdir + +echo "== check more complex allocation over the discard hole ==" + +# Clear image +_make_test_img $size + +{ $QEMU_IO -c "write -P 0x11 $CLUSTER_DBL_SIZE $CLUSTER_DBL_SIZE" "$TEST_IMG"; } 2>&1 | _filter_qemu_io | _filter_testdir +{ $QEMU_IO -c "discard $CLUSTER_DBL_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | _filter_qemu_io | _filter_testdir +# There is 1 cluster hole. Fill it fully and allocate 1 cluster at the end +{ $QEMU_IO -c "write -P 0x12 $CLUSTER_HALF_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | _filter_qemu_io | _filter_testdir +{ $QEMU_IMG map "$TEST_IMG"; } 2>&1 | _filter_qemu_img_map +{ $QEMU_IO -c "read -P 0x12 $CLUSTER_HALF_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | _filter_qemu_io | _filter_testdir +{ $QEMU_IO -c "read -P 0 0 $CLUSTER_HALF_SIZE" "$TEST_IMG"; } 2>&1 | _filter_qemu_io | _filter_testdir +{ $QEMU_IO -c "read -P 0 $((CLUSTER_SIZE + CLUSTER_HALF_SIZE)) $CLUSTER_HALF_SIZE" "$TEST_IMG"; } 2>&1 | _filter_qemu_io | _filter_testdir + echo "== allocate with backing ==" # Verify that allocating clusters works fine even when there is a backing image. # Regression test for a bug where we would pass a buffer read from the backing diff --git a/tests/qemu-iotests/131.out b/tests/qemu-iotests/131.out index d2904578df..27df91ca97 100644 --- a/tests/qemu-iotests/131.out +++ b/tests/qemu-iotests/131.out @@ -26,6 +26,44 @@ read 524288/524288 bytes at offset 0 Repairing image was not closed correctly read 1048576/1048576 bytes at offset 1048576 1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +== check discard == +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 +wrote 2097152/2097152 bytes at offset 0 +2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +Offset Length File +0 0x20TEST_DIR/t.IMGFMT +discard 1048576/1048576 bytes at offset 0 +1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +Offset Length File +0x100x10TEST_DIR/t.IMGFMT +read 1048576/1048576 bytes at offset 0 +1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +== check simple allocation over the discarded hole == +wrote 1048576/1048576 bytes at offset 2097152 +1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +Offset Length File +0x100x10TEST_DIR/t.IMGFMT +0x200x10TEST_DIR/t.IMGFMT +read 1048576/1048576 bytes at offset 2097152 +1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +== check more complex allocation over the discard hole == +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 +wrote 2097152/2097152 bytes at offset 2097152 +2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +discard 1048576/1048576 bytes at offset 2097152 +1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +wrote 1048576/1048576 bytes at offset 524288 +1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +Offset Length File +0 0x10TEST_DIR/t.IMGFMT +0x100x10TEST_DIR/t.IMGFMT +0x300x10TEST_DIR/t.IMGFMT +read
Re: [PATCH 4/7] block/dirty-bitmap: Clean up local variable shadowing
Am 19.09.2023 um 07:48 hat Markus Armbruster geschrieben: > Kevin Wolf writes: > > > Am 31.08.2023 um 15:25 hat Markus Armbruster geschrieben: > >> Local variables shadowing other local variables or parameters make the > >> code needlessly hard to understand. Tracked down with -Wshadow=local. > >> Clean up: delete inner declarations when they are actually redundant, > >> else rename variables. > >> > >> Signed-off-by: Markus Armbruster > >> --- > >> block/monitor/bitmap-qmp-cmds.c | 2 +- > >> block/qcow2-bitmap.c| 3 +-- > >> 2 files changed, 2 insertions(+), 3 deletions(-) > >> > >> diff --git a/block/monitor/bitmap-qmp-cmds.c > >> b/block/monitor/bitmap-qmp-cmds.c > >> index 55f778f5af..4d018423d8 100644 > >> --- a/block/monitor/bitmap-qmp-cmds.c > >> +++ b/block/monitor/bitmap-qmp-cmds.c > >> @@ -276,7 +276,7 @@ BdrvDirtyBitmap *block_dirty_bitmap_merge(const char > >> *node, const char *target, > >> > >> for (lst = bms; lst; lst = lst->next) { > >> switch (lst->value->type) { > >> -const char *name, *node; > >> +const char *name; > >> case QTYPE_QSTRING: > >> name = lst->value->u.local; > >> src = bdrv_find_dirty_bitmap(bs, name); > > > > The names in this function are all over the place... A more ambitious > > patch could rename the parameters to dst_node/dst_bitmap and these > > variables to src_node/src_bitmap to get some more consistency (both with > > each other and with the existing src/dst variables). > > What exactly would you like me to consider? Perhaps: > > * Rename parameter @node to @dst_node > > * Rename which parameter to @dst_bitmap? Parameter @target to @dst_bitmap (it's the name of a bitmap in @node/@dst_node) > * Rename nested local @node to @src_node > > * Rename which local variable to @src_bitmap? @name to @src_bitmap (it's the name of a bitmap in the local @node/@src_node) > * Move nested locals to function scope I don't really mind either way, but yes, maybe that would be more conventional. That you couldn't tell for two of the variables what they actually are probably supports the argument that they should be renamed. :-) Kevin
Re: [PATCH 19/22] parallels: naive implementation of parallels_co_pdiscard
On 9/18/23 20:00, Denis V. Lunev wrote: * Discarding with backing stores is not supported by the format. * There is no buffering/queueing of the discard operation. * Only operations aligned to the cluster are supported. Signed-off-by: Denis V. Lunev --- block/parallels.c | 46 ++ 1 file changed, 46 insertions(+) diff --git a/block/parallels.c b/block/parallels.c index d9d36c514b..1ef23f6669 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -537,6 +537,51 @@ parallels_co_readv(BlockDriverState *bs, int64_t sector_num, int nb_sectors, return ret; } + +static int coroutine_fn +parallels_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes) +{ +int ret = 0; +uint32_t cluster, count; +BDRVParallelsState *s = bs->opaque; + +/* + * The image does not support ZERO mark inside the BAT, which means that + * stale data could be exposed from the backing file. + */ +if (bs->backing) { +return -ENOTSUP; +} + +if (!QEMU_IS_ALIGNED(offset, s->cluster_size)) { +return -ENOTSUP; +} else if (!QEMU_IS_ALIGNED(bytes, s->cluster_size)) { +return -ENOTSUP; +} + +cluster = offset / s->cluster_size; +count = bytes / s->cluster_size; + +qemu_co_mutex_lock(&s->lock); +for (; count > 0; cluster++, count--) { +int64_t host_off = bat2sect(s, cluster) << BDRV_SECTOR_BITS; +if (host_off == 0) { +continue; +} + +ret = bdrv_co_pdiscard(bs->file, host_off, s->cluster_size); +if (ret < 0) { +goto done; +} + +parallels_set_bat_entry(s, cluster, 0); +bitmap_clear(s->used_bmap, host_cluster_index(s, host_off), 1); +} +done: +qemu_co_mutex_unlock(&s->lock); +return ret; +} + static void parallels_check_unclean(BlockDriverState *bs, BdrvCheckResult *res, BdrvCheckMode fix) @@ -1417,6 +1462,7 @@ static BlockDriver bdrv_parallels = { .bdrv_co_create = parallels_co_create, .bdrv_co_create_opts= parallels_co_create_opts, .bdrv_co_check = parallels_co_check, +.bdrv_co_pdiscard = parallels_co_pdiscard, }; static void bdrv_parallels_init(void) Reviewed-by: Alexander Ivanov -- Best regards, Alexander Ivanov
Re: [PATCH 04/22] parallels: invent parallels_opts_prealloc() helper to parse prealloc opts
On 9/18/23 20:00, Denis V. Lunev wrote: This patch creates above mentioned helper and moves its usage to the beginning of parallels_open(). This simplifies parallels_open() a bit. The patch also ensures that we store prealloc_size on block driver state always in sectors. This makes code cleaner and avoids wrong opinion at the assignment that the value is in bytes. Signed-off-by: Denis V. Lunev --- block/parallels.c | 72 +-- 1 file changed, 44 insertions(+), 28 deletions(-) diff --git a/block/parallels.c b/block/parallels.c index af7be427c9..ae006e7fc7 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -1025,6 +1025,44 @@ static int parallels_update_header(BlockDriverState *bs) return bdrv_pwrite_sync(bs->file, 0, size, s->header, 0); } + +static int parallels_opts_prealloc(BlockDriverState *bs, QDict *options, + Error **errp) +{ +int err; +char *buf; +int64_t bytes; +BDRVParallelsState *s = bs->opaque; +Error *local_err = NULL; +QemuOpts *opts = qemu_opts_create(¶llels_runtime_opts, NULL, 0, errp); +if (!opts) { +return -ENOMEM; +} + +err = -EINVAL; +if (!qemu_opts_absorb_qdict(opts, options, errp)) { +goto done; +} + +bytes = qemu_opt_get_size_del(opts, PARALLELS_OPT_PREALLOC_SIZE, 0); +s->prealloc_size = bytes >> BDRV_SECTOR_BITS; +buf = qemu_opt_get_del(opts, PARALLELS_OPT_PREALLOC_MODE); +/* prealloc_mode can be downgraded later during allocate_clusters */ +s->prealloc_mode = qapi_enum_parse(&prealloc_mode_lookup, buf, + PRL_PREALLOC_MODE_FALLOCATE, + &local_err); +g_free(buf); +if (local_err != NULL) { +error_propagate(errp, local_err); +goto done; +} +err = 0; + +done: +qemu_opts_del(opts); +return err; +} + static int parallels_open(BlockDriverState *bs, QDict *options, int flags, Error **errp) { @@ -1033,11 +1071,13 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, int ret, size, i; int64_t file_nb_sectors, sector; uint32_t data_start; -QemuOpts *opts = NULL; -Error *local_err = NULL; -char *buf; bool data_off_is_correct; +ret = parallels_opts_prealloc(bs, options, errp); +if (ret < 0) { +return ret; +} + ret = bdrv_open_file_child(NULL, options, "file", bs, errp); if (ret < 0) { return ret; @@ -1078,6 +1118,7 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, ret = -EFBIG; goto fail; } +s->prealloc_size = MAX(s->tracks, s->prealloc_size); s->cluster_size = s->tracks << BDRV_SECTOR_BITS; s->bat_size = le32_to_cpu(ph.bat_entries); @@ -1117,29 +1158,6 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, s->header_size = size; } -opts = qemu_opts_create(¶llels_runtime_opts, NULL, 0, errp); -if (!opts) { -goto fail_options; -} - -if (!qemu_opts_absorb_qdict(opts, options, errp)) { -goto fail_options; -} - -s->prealloc_size = -qemu_opt_get_size_del(opts, PARALLELS_OPT_PREALLOC_SIZE, 0); -s->prealloc_size = MAX(s->tracks, s->prealloc_size >> BDRV_SECTOR_BITS); -buf = qemu_opt_get_del(opts, PARALLELS_OPT_PREALLOC_MODE); -/* prealloc_mode can be downgraded later during allocate_clusters */ -s->prealloc_mode = qapi_enum_parse(&prealloc_mode_lookup, buf, - PRL_PREALLOC_MODE_FALLOCATE, - &local_err); -g_free(buf); -if (local_err != NULL) { -error_propagate(errp, local_err); -goto fail_options; -} - if (ph.ext_off) { if (flags & BDRV_O_RDWR) { /* @@ -1214,10 +1232,8 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags, fail_format: error_setg(errp, "Image not in Parallels format"); -fail_options: ret = -EINVAL; fail: -qemu_opts_del(opts); /* * "s" object was allocated by g_malloc0 so we can safely * try to free its fields even they were not allocated. Reviewed-by: Alexander Ivanov -- Best regards, Alexander Ivanov
Re: [PATCH 03/22] parallels: fix memory leak in parallels_open()
On 9/18/23 20:00, Denis V. Lunev wrote: We should free opts allocated through qemu_opts_create() at the end. Signed-off-by: Denis V. Lunev --- block/parallels.c | 1 + 1 file changed, 1 insertion(+) diff --git a/block/parallels.c b/block/parallels.c index 428f72de1c..af7be427c9 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -1217,6 +1217,7 @@ fail_format: fail_options: ret = -EINVAL; fail: +qemu_opts_del(opts); /* * "s" object was allocated by g_malloc0 so we can safely * try to free its fields even they were not allocated. Reviewed-by: Alexander Ivanov -- Best regards, Alexander Ivanov
Re: [PATCH v2 3/8] qemu-img: rebase: use backing files' BlockBackend for buffer alignment
On 9/19/23 11:18, Hanna Czenczek wrote: > On 15.09.23 18:20, Andrey Drobyshev wrote: >> Since commit bb1c05973cf ("qemu-img: Use qemu_blockalign"), buffers for >> the data read from the old and new backing files are aligned using >> BlockDriverState (or BlockBackend later on) referring to the target >> image. >> However, this isn't quite right, because buf_new is only being used for >> reading from the new backing, while buf_old is being used for both >> reading >> from the old backing and writing to the target. Let's take that into >> account >> and use more appropriate values as alignments. >> >> Signed-off-by: Andrey Drobyshev >> --- >> qemu-img.c | 9 +++-- >> 1 file changed, 7 insertions(+), 2 deletions(-) >> >> diff --git a/qemu-img.c b/qemu-img.c >> index 50660ba920..d12e4a4753 100644 >> --- a/qemu-img.c >> +++ b/qemu-img.c >> @@ -3750,8 +3750,13 @@ static int img_rebase(int argc, char **argv) >> int64_t n; >> float local_progress = 0; >> - buf_old = blk_blockalign(blk, IO_BUF_SIZE); >> - buf_new = blk_blockalign(blk, IO_BUF_SIZE); >> + if (blk_old_backing && bdrv_opt_mem_align(blk_bs(blk)) > >> + bdrv_opt_mem_align(blk_bs(blk_old_backing))) { >> + buf_old = blk_blockalign(blk, IO_BUF_SIZE); >> + } else { >> + buf_old = blk_blockalign(blk_old_backing, IO_BUF_SIZE); >> + } > > As I read this, if blk_old_backing is NULL, we will go to the > blk_blockalign(blk_old_backing, IO_BUF_SIZE) path. I think if it is > NULL, we should align on blk instead. > > Hanna You're right, thanks for noticing. Will fix that. > >> + buf_new = blk_blockalign(blk_new_backing, IO_BUF_SIZE); >> size = blk_getlength(blk); >> if (size < 0) { >
Re: [PATCH v2 4/8] qemu-img: add chunk size parameter to compare_buffers()
On 15.09.23 18:20, Andrey Drobyshev wrote: Add @chsize param to the function which, if non-zero, would represent the chunk size to be used for comparison. If it's zero, then BDRV_SECTOR_SIZE is used as default chunk size, which is the previous behaviour. In particular, we're going to use this param in img_rebase() to make the write requests aligned to a predefined alignment value. Signed-off-by: Andrey Drobyshev --- qemu-img.c | 22 ++ 1 file changed, 14 insertions(+), 8 deletions(-) With the comment fix Eric has suggested: Reviewed-by: Hanna Czenczek
Re: [PATCH v2 3/8] qemu-img: rebase: use backing files' BlockBackend for buffer alignment
On 15.09.23 18:20, Andrey Drobyshev wrote: Since commit bb1c05973cf ("qemu-img: Use qemu_blockalign"), buffers for the data read from the old and new backing files are aligned using BlockDriverState (or BlockBackend later on) referring to the target image. However, this isn't quite right, because buf_new is only being used for reading from the new backing, while buf_old is being used for both reading from the old backing and writing to the target. Let's take that into account and use more appropriate values as alignments. Signed-off-by: Andrey Drobyshev --- qemu-img.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/qemu-img.c b/qemu-img.c index 50660ba920..d12e4a4753 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -3750,8 +3750,13 @@ static int img_rebase(int argc, char **argv) int64_t n; float local_progress = 0; -buf_old = blk_blockalign(blk, IO_BUF_SIZE); -buf_new = blk_blockalign(blk, IO_BUF_SIZE); +if (blk_old_backing && bdrv_opt_mem_align(blk_bs(blk)) > +bdrv_opt_mem_align(blk_bs(blk_old_backing))) { +buf_old = blk_blockalign(blk, IO_BUF_SIZE); +} else { +buf_old = blk_blockalign(blk_old_backing, IO_BUF_SIZE); +} As I read this, if blk_old_backing is NULL, we will go to the blk_blockalign(blk_old_backing, IO_BUF_SIZE) path. I think if it is NULL, we should align on blk instead. Hanna +buf_new = blk_blockalign(blk_new_backing, IO_BUF_SIZE); size = blk_getlength(blk); if (size < 0) {