Re: [PULL 00/28] Block layer patches

2023-09-19 Thread Stefan Hajnoczi
On Tue, 19 Sept 2023 at 06:26, Kevin Wolf  wrote:
> Am 18.09.2023 um 20:56 hat Stefan Hajnoczi geschrieben:
> If we could fully get rid of the AioContext lock (as we originally
> stated as a goal), that would automatically solve this kind of
> deadlocks.

Grepping for "ctx locked", "context acquired", etc does not bring up a
lot of comments describing variables that are protected by the
AioContext lock.

However, there are at least hundreds of functions that assume they are
called with the AioContext lock held.

There are a few strategies:

Top-down

Shorten AioContext lock critical sections to cover only APIs that need them.
Then push the lock down into the API and repeat the next lower level until
aio_context_acquire() + AIO_WAIT_WHILE() + aio_context_release() can be
replaced with AIO_WAIT_UNLOCKED().

Bottom-up
-
Switch AIO_WAIT_WHILE() to aio_context_release() + AIO_WAIT_WHILE_UNLOCKED() +
aio_context_acquire(). Then move the lock up into callers and repeat at the
next higher level until aio_context_acquire() + aio_context_release() cancel
each other out.

Big bang

Remove aio_context_acquire/release() and fix tests until they pass.

I think top-down is safer than bottom-up, because bottom-up is more
likely to cause issues with callers that do not tolerate temporarily
dropping the lock.

The big bang approach is only reasonable if the AioContext lock is no
longer used to protect variables (which we don't know for sure because
that requires auditing every line of code).

My concern with the top-down approach is that so much code needs to be
audited and the conversions are temporary steps (it's almost a waste
of time for maintainers to review them).

I'm tempted to go for the big bang approach but also don't want to
introduce a slew of new race conditions. :/

Stefan



Re: [PULL 00/28] Block layer patches

2023-09-19 Thread Stefan Hajnoczi
On Tue, 19 Sept 2023 at 06:26, Kevin Wolf  wrote:
>
> Am 18.09.2023 um 20:56 hat Stefan Hajnoczi geschrieben:
> > Hi Kevin,
> > I believe that my own commit "block-coroutine-wrapper: use
> > qemu_get_current_aio_context()" breaks this test. The failure is
> > non-deterministic (happens about 1 out of 4 runs).
> >
> > It seems the job hangs and the test times out in vm.run_job('job1', 
> > wait=5.0).
> >
> > I haven't debugged it yet but wanted to share this information to save
> > some time. Tomorrow I'll investigate further.
>
> Yes, it's relatively easily reproducible if I run the test in a loop,
> and I can't seem to reproduce it without the last patch. Should I
> unstage the full series again, or do you think that the last patch is
> really optional this time?
>
> However, I'm unsure how the stack traces I'm seeing are related to your
> patch. Maybe it just made an existing bug more likely to be triggered?
>
> What I'm seeing is that the reader lock is held by an iothread that is
> waiting for its AioContext lock to make progress:
>
> Thread 3 (Thread 0x7f811e9346c0 (LWP 26390) "qemu-system-x86"):
> #0  0x7f81250aaf80 in __lll_lock_wait () at /lib64/libc.so.6
> #1  0x7f81250b149a in pthread_mutex_lock@@GLIBC_2.2.5 () at 
> /lib64/libc.so.6
> #2  0x55b7b170967e in qemu_mutex_lock_impl (mutex=0x55b7b34e3080, 
> file=0x55b7b199e1f7 "../util/async.c", line=728) at 
> ../util/qemu-thread-posix.c:94
> #3  0x55b7b1709953 in qemu_rec_mutex_lock_impl (mutex=0x55b7b34e3080, 
> file=0x55b7b199e1f7 "../util/async.c", line=728) at 
> ../util/qemu-thread-posix.c:149
> #4  0x55b7b1728318 in aio_context_acquire (ctx=0x55b7b34e3020) at 
> ../util/async.c:728
> #5  0x55b7b1727c49 in co_schedule_bh_cb (opaque=0x55b7b34e3020) at 
> ../util/async.c:565
> #6  0x55b7b1726f1c in aio_bh_call (bh=0x55b7b34e2e70) at 
> ../util/async.c:169
> #7  0x55b7b17270ee in aio_bh_poll (ctx=0x55b7b34e3020) at 
> ../util/async.c:216
> #8  0x55b7b170351d in aio_poll (ctx=0x55b7b34e3020, blocking=true) at 
> ../util/aio-posix.c:722
> #9  0x55b7b1518604 in iothread_run (opaque=0x55b7b2904460) at 
> ../iothread.c:63
> #10 0x55b7b170a955 in qemu_thread_start (args=0x55b7b34e36b0) at 
> ../util/qemu-thread-posix.c:541
> #11 0x7f81250ae15d in start_thread () at /lib64/libc.so.6
> #12 0x7f812512fc00 in clone3 () at /lib64/libc.so.6
>
> On the other hand, the main thread wants to acquire the writer lock,
> but it holds the AioContext lock of the iothread (it takes it in
> job_prepare_locked()):
>
> Thread 1 (Thread 0x7f811f4b7b00 (LWP 26388) "qemu-system-x86"):
> #0  0x7f8125122356 in ppoll () at /lib64/libc.so.6
> #1  0x55b7b172eae0 in qemu_poll_ns (fds=0x55b7b34ec910, nfds=1, 
> timeout=-1) at ../util/qemu-timer.c:339
> #2  0x55b7b1704ebd in fdmon_poll_wait (ctx=0x55b7b3269210, 
> ready_list=0x7ffc90b05680, timeout=-1) at ../util/fdmon-poll.c:79
> #3  0x55b7b1703284 in aio_poll (ctx=0x55b7b3269210, blocking=true) at 
> ../util/aio-posix.c:670
> #4  0x55b7b1567c3b in bdrv_graph_wrlock (bs=0x0) at 
> ../block/graph-lock.c:145
> #5  0x55b7b1554c1c in blk_remove_bs (blk=0x55b7b4425800) at 
> ../block/block-backend.c:916
> #6  0x55b7b1554779 in blk_delete (blk=0x55b7b4425800) at 
> ../block/block-backend.c:497
> #7  0x55b7b1554133 in blk_unref (blk=0x55b7b4425800) at 
> ../block/block-backend.c:557
> #8  0x55b7b157a149 in mirror_exit_common (job=0x55b7b4419000) at 
> ../block/mirror.c:696
> #9  0x55b7b1577015 in mirror_prepare (job=0x55b7b4419000) at 
> ../block/mirror.c:807
> #10 0x55b7b153a1a7 in job_prepare_locked (job=0x55b7b4419000) at 
> ../job.c:988
> #11 0x55b7b153a0d9 in job_txn_apply_locked (job=0x55b7b4419000, 
> fn=0x55b7b153a110 ) at ../job.c:191
> #12 0x55b7b1538b6d in job_do_finalize_locked (job=0x55b7b4419000) at 
> ../job.c:1011
> #13 0x55b7b153a886 in job_completed_txn_success_locked 
> (job=0x55b7b4419000) at ../job.c:1068
> #14 0x55b7b1539372 in job_completed_locked (job=0x55b7b4419000) at 
> ../job.c:1082
> #15 0x55b7b153a71b in job_exit (opaque=0x55b7b4419000) at ../job.c:1103
> #16 0x55b7b1726f1c in aio_bh_call (bh=0x7f8110005470) at 
> ../util/async.c:169
> #17 0x55b7b17270ee in aio_bh_poll (ctx=0x55b7b3269210) at 
> ../util/async.c:216
> #18 0x55b7b1702c05 in aio_dispatch (ctx=0x55b7b3269210) at 
> ../util/aio-posix.c:423
> #19 0x55b7b1728a14 in aio_ctx_dispatch (source=0x55b7b3269210, 
> callback=0x0, user_data=0x0) at ../util/async.c:358
> #20 0x7f8126c31c7f in g_main_dispatch (context=0x55b7b3269720) at 
> ../glib/gmain.c:3454
> #21 g_main_context_dispatch (context=0x55b7b3269720) at ../glib/gmain.c:4172
> #22 0x55b7b1729c98 in glib_pollfds_poll () at ../util/main-loop.c:290
> #23 0x55b7b1729572 in os_host_main_loop_wait (timeout=27462700) at 
> ../util/main-loop.c:313
> #24 0x55b7b1729452 in main_loop_wait (nonblocking=0) at 
> ../util/main-loop.c:592
> #25 0x55b7b119a1eb in 

Re: [PULL 00/28] Block layer patches

2023-09-19 Thread Stefan Hajnoczi
On Tue, 19 Sept 2023 at 06:26, Kevin Wolf  wrote:
>
> Am 18.09.2023 um 20:56 hat Stefan Hajnoczi geschrieben:
> > Hi Kevin,
> > I believe that my own commit "block-coroutine-wrapper: use
> > qemu_get_current_aio_context()" breaks this test. The failure is
> > non-deterministic (happens about 1 out of 4 runs).
> >
> > It seems the job hangs and the test times out in vm.run_job('job1', 
> > wait=5.0).
> >
> > I haven't debugged it yet but wanted to share this information to save
> > some time. Tomorrow I'll investigate further.
>
> Yes, it's relatively easily reproducible if I run the test in a loop,
> and I can't seem to reproduce it without the last patch. Should I
> unstage the full series again, or do you think that the last patch is
> really optional this time?

Please drop the last patch. I'm not aware of dependencies on the last patch.

> However, I'm unsure how the stack traces I'm seeing are related to your
> patch. Maybe it just made an existing bug more likely to be triggered?

I'll share my thoughts once I've looked at the crash today.

Regarding AioContext lock removal: I'll work on that and see what
still depends on the lock.

Stefan

> What I'm seeing is that the reader lock is held by an iothread that is
> waiting for its AioContext lock to make progress:
>
> Thread 3 (Thread 0x7f811e9346c0 (LWP 26390) "qemu-system-x86"):
> #0  0x7f81250aaf80 in __lll_lock_wait () at /lib64/libc.so.6
> #1  0x7f81250b149a in pthread_mutex_lock@@GLIBC_2.2.5 () at 
> /lib64/libc.so.6
> #2  0x55b7b170967e in qemu_mutex_lock_impl (mutex=0x55b7b34e3080, 
> file=0x55b7b199e1f7 "../util/async.c", line=728) at 
> ../util/qemu-thread-posix.c:94
> #3  0x55b7b1709953 in qemu_rec_mutex_lock_impl (mutex=0x55b7b34e3080, 
> file=0x55b7b199e1f7 "../util/async.c", line=728) at 
> ../util/qemu-thread-posix.c:149
> #4  0x55b7b1728318 in aio_context_acquire (ctx=0x55b7b34e3020) at 
> ../util/async.c:728
> #5  0x55b7b1727c49 in co_schedule_bh_cb (opaque=0x55b7b34e3020) at 
> ../util/async.c:565
> #6  0x55b7b1726f1c in aio_bh_call (bh=0x55b7b34e2e70) at 
> ../util/async.c:169
> #7  0x55b7b17270ee in aio_bh_poll (ctx=0x55b7b34e3020) at 
> ../util/async.c:216
> #8  0x55b7b170351d in aio_poll (ctx=0x55b7b34e3020, blocking=true) at 
> ../util/aio-posix.c:722
> #9  0x55b7b1518604 in iothread_run (opaque=0x55b7b2904460) at 
> ../iothread.c:63
> #10 0x55b7b170a955 in qemu_thread_start (args=0x55b7b34e36b0) at 
> ../util/qemu-thread-posix.c:541
> #11 0x7f81250ae15d in start_thread () at /lib64/libc.so.6
> #12 0x7f812512fc00 in clone3 () at /lib64/libc.so.6
>
> On the other hand, the main thread wants to acquire the writer lock,
> but it holds the AioContext lock of the iothread (it takes it in
> job_prepare_locked()):
>
> Thread 1 (Thread 0x7f811f4b7b00 (LWP 26388) "qemu-system-x86"):
> #0  0x7f8125122356 in ppoll () at /lib64/libc.so.6
> #1  0x55b7b172eae0 in qemu_poll_ns (fds=0x55b7b34ec910, nfds=1, 
> timeout=-1) at ../util/qemu-timer.c:339
> #2  0x55b7b1704ebd in fdmon_poll_wait (ctx=0x55b7b3269210, 
> ready_list=0x7ffc90b05680, timeout=-1) at ../util/fdmon-poll.c:79
> #3  0x55b7b1703284 in aio_poll (ctx=0x55b7b3269210, blocking=true) at 
> ../util/aio-posix.c:670
> #4  0x55b7b1567c3b in bdrv_graph_wrlock (bs=0x0) at 
> ../block/graph-lock.c:145
> #5  0x55b7b1554c1c in blk_remove_bs (blk=0x55b7b4425800) at 
> ../block/block-backend.c:916
> #6  0x55b7b1554779 in blk_delete (blk=0x55b7b4425800) at 
> ../block/block-backend.c:497
> #7  0x55b7b1554133 in blk_unref (blk=0x55b7b4425800) at 
> ../block/block-backend.c:557
> #8  0x55b7b157a149 in mirror_exit_common (job=0x55b7b4419000) at 
> ../block/mirror.c:696
> #9  0x55b7b1577015 in mirror_prepare (job=0x55b7b4419000) at 
> ../block/mirror.c:807
> #10 0x55b7b153a1a7 in job_prepare_locked (job=0x55b7b4419000) at 
> ../job.c:988
> #11 0x55b7b153a0d9 in job_txn_apply_locked (job=0x55b7b4419000, 
> fn=0x55b7b153a110 ) at ../job.c:191
> #12 0x55b7b1538b6d in job_do_finalize_locked (job=0x55b7b4419000) at 
> ../job.c:1011
> #13 0x55b7b153a886 in job_completed_txn_success_locked 
> (job=0x55b7b4419000) at ../job.c:1068
> #14 0x55b7b1539372 in job_completed_locked (job=0x55b7b4419000) at 
> ../job.c:1082
> #15 0x55b7b153a71b in job_exit (opaque=0x55b7b4419000) at ../job.c:1103
> #16 0x55b7b1726f1c in aio_bh_call (bh=0x7f8110005470) at 
> ../util/async.c:169
> #17 0x55b7b17270ee in aio_bh_poll (ctx=0x55b7b3269210) at 
> ../util/async.c:216
> #18 0x55b7b1702c05 in aio_dispatch (ctx=0x55b7b3269210) at 
> ../util/aio-posix.c:423
> #19 0x55b7b1728a14 in aio_ctx_dispatch (source=0x55b7b3269210, 
> callback=0x0, user_data=0x0) at ../util/async.c:358
> #20 0x7f8126c31c7f in g_main_dispatch (context=0x55b7b3269720) at 
> ../glib/gmain.c:3454
> #21 g_main_context_dispatch (context=0x55b7b3269720) at ../glib/gmain.c:4172
> #22 0x55b7b1729c98 in glib_pollfds_poll () at ..

[PATCH v3 4/8] qemu-img: add chunk size parameter to compare_buffers()

2023-09-19 Thread Andrey Drobyshev via
Add @chsize param to the function which, if non-zero, would represent
the chunk size to be used for comparison.  If it's zero, then
BDRV_SECTOR_SIZE is used as default chunk size, which is the previous
behaviour.

In particular, we're going to use this param in img_rebase() to make the
write requests aligned to a predefined alignment value.

Signed-off-by: Andrey Drobyshev 
Reviewed-by: Eric Blake 
Reviewed-by: Hanna Czenczek 
---
 qemu-img.c | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 4dc91505bf..0f67b021f7 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1274,23 +1274,29 @@ static int is_allocated_sectors_min(const uint8_t *buf, 
int n, int *pnum,
 }
 
 /*
- * Compares two buffers sector by sector. Returns 0 if the first
- * sector of each buffer matches, non-zero otherwise.
+ * Compares two buffers chunk by chunk, where @chsize is the chunk size.
+ * If @chsize is 0, default chunk size of BDRV_SECTOR_SIZE is used.
+ * Returns 0 if the first chunk of each buffer matches, non-zero otherwise.
  *
- * pnum is set to the sector-aligned size of the buffer prefix that
- * has the same matching status as the first sector.
+ * @pnum is set to the size of the buffer prefix aligned to @chsize that
+ * has the same matching status as the first chunk.
  */
 static int compare_buffers(const uint8_t *buf1, const uint8_t *buf2,
-   int64_t bytes, int64_t *pnum)
+   int64_t bytes, uint64_t chsize, int64_t *pnum)
 {
 bool res;
-int64_t i = MIN(bytes, BDRV_SECTOR_SIZE);
+int64_t i;
 
 assert(bytes > 0);
 
+if (!chsize) {
+chsize = BDRV_SECTOR_SIZE;
+}
+i = MIN(bytes, chsize);
+
 res = !!memcmp(buf1, buf2, i);
 while (i < bytes) {
-int64_t len = MIN(bytes - i, BDRV_SECTOR_SIZE);
+int64_t len = MIN(bytes - i, chsize);
 
 if (!!memcmp(buf1 + i, buf2 + i, len) != res) {
 break;
@@ -1559,7 +1565,7 @@ static int img_compare(int argc, char **argv)
 ret = 4;
 goto out;
 }
-ret = compare_buffers(buf1, buf2, chunk, &pnum);
+ret = compare_buffers(buf1, buf2, chunk, 0, &pnum);
 if (ret || pnum != chunk) {
 qprintf(quiet, "Content mismatch at offset %" PRId64 "!\n",
 offset + (ret ? 0 : pnum));
@@ -3878,7 +3884,7 @@ static int img_rebase(int argc, char **argv)
 int64_t pnum;
 
 if (compare_buffers(buf_old + written, buf_new + written,
-n - written, &pnum))
+n - written, 0, &pnum))
 {
 if (buf_old_is_zero) {
 ret = blk_pwrite_zeroes(blk, offset + written, pnum, 
0);
-- 
2.39.3




[PATCH v3 1/8] qemu-img: rebase: stop when reaching EOF of old backing file

2023-09-19 Thread Andrey Drobyshev via
In case when we're rebasing within one backing chain, and when target image
is larger than old backing file, bdrv_is_allocated_above() ends up setting
*pnum = 0.  As a result, target offset isn't getting incremented, and we
get stuck in an infinite for loop.  Let's detect this case and proceed
further down the loop body, as the offsets beyond the old backing size need
to be explicitly zeroed.

Signed-off-by: Andrey Drobyshev 
Reviewed-by: Denis V. Lunev 
Reviewed-by: Hanna Czenczek 
---
 qemu-img.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/qemu-img.c b/qemu-img.c
index a48edb7101..50660ba920 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3805,6 +3805,8 @@ static int img_rebase(int argc, char **argv)
 }
 
 if (prefix_chain_bs) {
+uint64_t bytes = n;
+
 /*
  * If cluster wasn't changed since prefix_chain, we don't need
  * to take action
@@ -3817,9 +3819,18 @@ static int img_rebase(int argc, char **argv)
  strerror(-ret));
 goto out;
 }
-if (!ret) {
+if (!ret && n) {
 continue;
 }
+if (!n) {
+/*
+ * If we've reached EOF of the old backing, it means that
+ * offsets beyond the old backing size were read as zeroes.
+ * Now we will need to explicitly zero the cluster in
+ * order to preserve that state after the rebase.
+ */
+n = bytes;
+}
 }
 
 /*
-- 
2.39.3




[PATCH v3 8/8] iotests: add tests for "qemu-img rebase" with compression

2023-09-19 Thread Andrey Drobyshev via
The test cases considered so far:

314 (new test suite):

1. Check that compression mode isn't compatible with "-f raw" (raw
   format doesn't support compression).
2. Check that rebasing an image onto no backing file preserves the data
   and writes the copied clusters actually compressed.
3. Same as 2, but with a raw backing file (i.e. the clusters copied from the
   backing are originally uncompressed -- we check they end up compressed
   after being merged).
4. Remove a single delta from a backing chain, perform the same checks
   as in 2.
5. Check that even when backing and overlay are initially uncompressed,
   copied clusters end up compressed when rebase with compression is
   performed.

271:

1. Check that when target image has subclusters, rebase with compression
   will make an entire cluster containing the written subcluster
   compressed.

Signed-off-by: Andrey Drobyshev 
Reviewed-by: Hanna Czenczek 
---
 tests/qemu-iotests/271 |  65 +++
 tests/qemu-iotests/271.out |  40 +
 tests/qemu-iotests/314 | 165 +
 tests/qemu-iotests/314.out |  75 +
 4 files changed, 345 insertions(+)
 create mode 100755 tests/qemu-iotests/314
 create mode 100644 tests/qemu-iotests/314.out

diff --git a/tests/qemu-iotests/271 b/tests/qemu-iotests/271
index e243f57ba7..59a6fafa2f 100755
--- a/tests/qemu-iotests/271
+++ b/tests/qemu-iotests/271
@@ -965,6 +965,71 @@ echo
 
 TEST_IMG="$TEST_IMG.top" alloc="1 30" zero="" _verify_l2_bitmap 0
 
+# Check that rebase with compression works correctly with images containing
+# subclusters.  When compression is enabled and we allocate a new
+# subcluster within the target (overlay) image, we expect the entire cluster
+# containing that subcluster to become compressed.
+#
+# Here we expect 1st and 3rd clusters of the top (overlay) image to become
+# compressed after the rebase, while cluster 2 to remain unallocated and
+# be read from the base (new backing) image.
+#
+# Base (new backing): |-- -- .. -- --|11 11 .. 11 11|-- -- .. -- --|
+# Mid (old backing):  |-- -- .. -- 22|-- -- .. -- --|33 -- .. -- --|
+# Top:|-- -- .. -- --|-- -- -- -- --|-- -- .. -- --|
+
+echo
+echo "### Rebase with compression for images with subclusters ###"
+echo
+
+echo "# create backing chain"
+echo
+
+TEST_IMG="$TEST_IMG.base" _make_test_img -o cluster_size=1M,extended_l2=on 3M
+TEST_IMG="$TEST_IMG.mid" _make_test_img -o cluster_size=1M,extended_l2=on \
+-b "$TEST_IMG.base" -F qcow2 3M
+TEST_IMG="$TEST_IMG.top" _make_test_img -o cluster_size=1M,extended_l2=on \
+-b "$TEST_IMG.mid" -F qcow2 3M
+
+echo
+echo "# fill old and new backing with data"
+echo
+
+$QEMU_IO -c "write -P 0x11 1M 1M" "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c "write -P 0x22 $(( 31 * 32 ))k 32k" \
+ -c "write -P 0x33 $(( 64 * 32 ))k 32k" \
+ "$TEST_IMG.mid" | _filter_qemu_io
+
+echo
+echo "# rebase topmost image onto the new backing, with compression"
+echo
+
+$QEMU_IMG rebase -c -b "$TEST_IMG.base" -F qcow2 "$TEST_IMG.top"
+
+echo "# verify that the 1st and 3rd clusters've become compressed"
+echo
+
+$QEMU_IMG map --output=json "$TEST_IMG.top" | _filter_testdir
+
+echo
+echo "# verify that data is read the same before and after rebase"
+echo
+
+$QEMU_IO -c "read -P 0x22 $(( 31 * 32 ))k 32k" \
+ -c "read -P 0x11 1M 1M" \
+ -c "read -P 0x33 $(( 64 * 32 ))k 32k" \
+ "$TEST_IMG.top" | _filter_qemu_io
+
+echo
+echo "# verify image bitmap"
+echo
+
+# For compressed clusters bitmap is always 0.  For unallocated cluster
+# there should be no entry at all, thus bitmap is also 0.
+TEST_IMG="$TEST_IMG.top" alloc="" zero="" _verify_l2_bitmap 0
+TEST_IMG="$TEST_IMG.top" alloc="" zero="" _verify_l2_bitmap 1
+TEST_IMG="$TEST_IMG.top" alloc="" zero="" _verify_l2_bitmap 2
+
 # success, all done
 echo "*** done"
 rm -f $seq.full
diff --git a/tests/qemu-iotests/271.out b/tests/qemu-iotests/271.out
index c335a6c608..0b24d50159 100644
--- a/tests/qemu-iotests/271.out
+++ b/tests/qemu-iotests/271.out
@@ -765,4 +765,44 @@ Offset  Length  Mapped to   File
 # verify image bitmap
 
 L2 entry #0: 0x8050 4002
+
+### Rebase with compression for images with subclusters ###
+
+# create backing chain
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=3145728
+Formatting 'TEST_DIR/t.IMGFMT.mid', fmt=IMGFMT size=3145728 
backing_file=TEST_DIR/t.IMGFMT.base backing_fmt=IMGFMT
+Formatting 'TEST_DIR/t.IMGFMT.top', fmt=IMGFMT size=3145728 
backing_file=TEST_DIR/t.IMGFMT.mid backing_fmt=IMGFMT
+
+# fill old and new backing with data
+
+wrote 1048576/1048576 bytes at offset 1048576
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 32768/32768 bytes at offset 1015808
+32 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 32768/32768 bytes at offset 2097152
+32 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+# rebase topmost image onto the new backing,

[PATCH v3 5/8] qemu-img: rebase: avoid unnecessary COW operations

2023-09-19 Thread Andrey Drobyshev via
When rebasing an image from one backing file to another, we need to
compare data from old and new backings.  If the diff between that data
happens to be unaligned to the target cluster size, we might end up
doing partial writes, which would lead to copy-on-write and additional IO.

Consider the following simple case (virtual_size == cluster_size == 64K):

base <-- inc1 <-- inc2

qemu-io -c "write -P 0xaa 0 32K" base.qcow2
qemu-io -c "write -P 0xcc 32K 32K" base.qcow2
qemu-io -c "write -P 0xbb 0 32K" inc1.qcow2
qemu-io -c "write -P 0xcc 32K 32K" inc1.qcow2
qemu-img rebase -f qcow2 -b base.qcow2 -F qcow2 inc2.qcow2

While doing rebase, we'll write a half of the cluster to inc2, and block
layer will have to read the 2nd half of the same cluster from the base image
inc1 while doing this write operation, although the whole cluster is already
read earlier to perform data comparison.

In order to avoid these unnecessary IO cycles, let's make sure every
write request is aligned to the overlay subcluster boundaries.  Using
subcluster size is universal as for the images which don't have them
this size equals to the cluster size. so in any case we end up aligning
to the smallest unit of allocation.

Signed-off-by: Andrey Drobyshev 
---
 qemu-img.c | 74 +++---
 1 file changed, 54 insertions(+), 20 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 0f67b021f7..a2d6241648 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3523,6 +3523,7 @@ static int img_rebase(int argc, char **argv)
 uint8_t *buf_new = NULL;
 BlockDriverState *bs = NULL, *prefix_chain_bs = NULL;
 BlockDriverState *unfiltered_bs;
+BlockDriverInfo bdi = {0};
 char *filename;
 const char *fmt, *cache, *src_cache, *out_basefmt, *out_baseimg;
 int c, flags, src_flags, ret;
@@ -3533,6 +3534,7 @@ static int img_rebase(int argc, char **argv)
 bool quiet = false;
 Error *local_err = NULL;
 bool image_opts = false;
+int64_t write_align;
 
 /* Parse commandline parameters */
 fmt = NULL;
@@ -3656,6 +3658,20 @@ static int img_rebase(int argc, char **argv)
 }
 }
 
+/*
+ * We need overlay subcluster size to make sure write requests are
+ * aligned.
+ */
+ret = bdrv_get_info(unfiltered_bs, &bdi);
+if (ret < 0) {
+error_report("could not get block driver info");
+goto out;
+} else if (bdi.subcluster_size == 0) {
+bdi.subcluster_size = 1;
+}
+
+write_align = bdi.subcluster_size;
+
 /* For safe rebasing we need to compare old and new backing file */
 if (!unsafe) {
 QDict *options = NULL;
@@ -3753,7 +3769,7 @@ static int img_rebase(int argc, char **argv)
 int64_t old_backing_size = 0;
 int64_t new_backing_size = 0;
 uint64_t offset;
-int64_t n;
+int64_t n, n_old = 0, n_new = 0;
 float local_progress = 0;
 
 if (blk_old_backing && bdrv_opt_mem_align(blk_bs(blk_old_backing)) >
@@ -3799,7 +3815,8 @@ static int img_rebase(int argc, char **argv)
 }
 
 for (offset = 0; offset < size; offset += n) {
-bool buf_old_is_zero = false;
+bool old_backing_eof = false;
+int64_t n_alloc;
 
 /* How many bytes can we handle with the next read? */
 n = MIN(IO_BUF_SIZE, size - offset);
@@ -3844,33 +3861,46 @@ static int img_rebase(int argc, char **argv)
 }
 }
 
+/*
+ * At this point we know that the region [offset; offset + n)
+ * is unallocated within the target image.  This region might be
+ * unaligned to the target image's (sub)cluster boundaries, as
+ * old backing may have smaller clusters (or have subclusters).
+ * We extend it to the aligned boundaries to avoid CoW on
+ * partial writes in blk_pwrite(),
+ */
+n += offset - QEMU_ALIGN_DOWN(offset, write_align);
+offset = QEMU_ALIGN_DOWN(offset, write_align);
+n += QEMU_ALIGN_UP(offset + n, write_align) - (offset + n);
+n = MIN(n, size - offset);
+assert(!bdrv_is_allocated(unfiltered_bs, offset, n, &n_alloc) &&
+   n_alloc == n);
+
+/*
+ * Much like with the target image, we'll try to read as much
+ * of the old and new backings as we can.
+ */
+n_old = MIN(n, MAX(0, old_backing_size - (int64_t) offset));
+n_new = MIN(n, MAX(0, new_backing_size - (int64_t) offset));
+
 /*
  * Read old and new backing file and take into consideration that
  * backing files may be smaller than the COW image.
  */
-if (offset >= old_backing_size) {
-memset(buf_old, 0, n);
-buf_old_is_zero = true;
+memset(buf_old + n_old, 0, n - n_old);
+if (!n_old) {
+

[PATCH v3 7/8] qemu-img: add compression option to rebase subcommand

2023-09-19 Thread Andrey Drobyshev via
If we rebase an image whose backing file has compressed clusters, we
might end up wasting disk space since the copied clusters are now
uncompressed.  In order to have better control over this, let's add
"--compress" option to the "qemu-img rebase" command.

Note that this option affects only the clusters which are actually being
copied from the original backing file.  The clusters which were
uncompressed in the target image will remain so.

Signed-off-by: Andrey Drobyshev 
Reviewed-by: Denis V. Lunev 
Reviewed-by: Hanna Czenczek 
---
 docs/tools/qemu-img.rst |  6 --
 qemu-img-cmds.hx|  4 ++--
 qemu-img.c  | 26 --
 3 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
index ca5a2773cf..4459c065f1 100644
--- a/docs/tools/qemu-img.rst
+++ b/docs/tools/qemu-img.rst
@@ -667,7 +667,7 @@ Command description:
 
   List, apply, create or delete snapshots in image *FILENAME*.
 
-.. option:: rebase [--object OBJECTDEF] [--image-opts] [-U] [-q] [-f FMT] [-t 
CACHE] [-T SRC_CACHE] [-p] [-u] -b BACKING_FILE [-F BACKING_FMT] FILENAME
+.. option:: rebase [--object OBJECTDEF] [--image-opts] [-U] [-q] [-f FMT] [-t 
CACHE] [-T SRC_CACHE] [-p] [-u] [-c] -b BACKING_FILE [-F BACKING_FMT] FILENAME
 
   Changes the backing file of an image. Only the formats ``qcow2`` and
   ``qed`` support changing the backing file.
@@ -694,7 +694,9 @@ Command description:
 
 In order to achieve this, any clusters that differ between
 *BACKING_FILE* and the old backing file of *FILENAME* are merged
-into *FILENAME* before actually changing the backing file.
+into *FILENAME* before actually changing the backing file. With the
+``-c`` option specified, the clusters which are being merged (but not
+the entire *FILENAME* image) are compressed when written.
 
 Note that the safe mode is an expensive operation, comparable to
 converting an image. It only works if the old backing file still
diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index 1b1dab5b17..068692d13e 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -88,9 +88,9 @@ SRST
 ERST
 
 DEF("rebase", img_rebase,
-"rebase [--object objectdef] [--image-opts] [-U] [-q] [-f fmt] [-t cache] 
[-T src_cache] [-p] [-u] -b backing_file [-F backing_fmt] filename")
+"rebase [--object objectdef] [--image-opts] [-U] [-q] [-f fmt] [-t cache] 
[-T src_cache] [-p] [-u] [-c] -b backing_file [-F backing_fmt] filename")
 SRST
-.. option:: rebase [--object OBJECTDEF] [--image-opts] [-U] [-q] [-f FMT] [-t 
CACHE] [-T SRC_CACHE] [-p] [-u] -b BACKING_FILE [-F BACKING_FMT] FILENAME
+.. option:: rebase [--object OBJECTDEF] [--image-opts] [-U] [-q] [-f FMT] [-t 
CACHE] [-T SRC_CACHE] [-p] [-u] [-c] -b BACKING_FILE [-F BACKING_FMT] FILENAME
 ERST
 
 DEF("resize", img_resize,
diff --git a/qemu-img.c b/qemu-img.c
index a2d6241648..9d7a4a3566 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3527,11 +3527,13 @@ static int img_rebase(int argc, char **argv)
 char *filename;
 const char *fmt, *cache, *src_cache, *out_basefmt, *out_baseimg;
 int c, flags, src_flags, ret;
+BdrvRequestFlags write_flags = 0;
 bool writethrough, src_writethrough;
 int unsafe = 0;
 bool force_share = false;
 int progress = 0;
 bool quiet = false;
+bool compress = false;
 Error *local_err = NULL;
 bool image_opts = false;
 int64_t write_align;
@@ -3548,9 +3550,10 @@ static int img_rebase(int argc, char **argv)
 {"object", required_argument, 0, OPTION_OBJECT},
 {"image-opts", no_argument, 0, OPTION_IMAGE_OPTS},
 {"force-share", no_argument, 0, 'U'},
+{"compress", no_argument, 0, 'c'},
 {0, 0, 0, 0}
 };
-c = getopt_long(argc, argv, ":hf:F:b:upt:T:qU",
+c = getopt_long(argc, argv, ":hf:F:b:upt:T:qUc",
 long_options, NULL);
 if (c == -1) {
 break;
@@ -3598,6 +3601,9 @@ static int img_rebase(int argc, char **argv)
 case 'U':
 force_share = true;
 break;
+case 'c':
+compress = true;
+break;
 }
 }
 
@@ -3650,6 +3656,14 @@ static int img_rebase(int argc, char **argv)
 
 unfiltered_bs = bdrv_skip_filters(bs);
 
+if (compress && !block_driver_can_compress(unfiltered_bs->drv)) {
+error_report("Compression not supported for this file format");
+ret = -1;
+goto out;
+} else if (compress) {
+write_flags |= BDRV_REQ_WRITE_COMPRESSED;
+}
+
 if (out_basefmt != NULL) {
 if (bdrv_find_format(out_basefmt) == NULL) {
 error_report("Invalid format name: '%s'", out_basefmt);
@@ -3659,18 +3673,18 @@ static int img_rebase(int argc, char **argv)
 }
 
 /*
- * We need overlay subcluster size to make sure write requests are
- * aligned.
+ * We need overlay subcluster size (or cluster size in

[PATCH v3 3/8] qemu-img: rebase: use backing files' BlockBackend for buffer alignment

2023-09-19 Thread Andrey Drobyshev via
Since commit bb1c05973cf ("qemu-img: Use qemu_blockalign"), buffers for
the data read from the old and new backing files are aligned using
BlockDriverState (or BlockBackend later on) referring to the target image.
However, this isn't quite right, because buf_new is only being used for
reading from the new backing, while buf_old is being used for both reading
from the old backing and writing to the target.  Let's take that into account
and use more appropriate values as alignments.

Signed-off-by: Andrey Drobyshev 
---
 qemu-img.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 50660ba920..4dc91505bf 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3750,8 +3750,13 @@ static int img_rebase(int argc, char **argv)
 int64_t n;
 float local_progress = 0;
 
-buf_old = blk_blockalign(blk, IO_BUF_SIZE);
-buf_new = blk_blockalign(blk, IO_BUF_SIZE);
+if (blk_old_backing && bdrv_opt_mem_align(blk_bs(blk_old_backing)) >
+bdrv_opt_mem_align(blk_bs(blk))) {
+buf_old = blk_blockalign(blk_old_backing, IO_BUF_SIZE);
+} else {
+buf_old = blk_blockalign(blk, IO_BUF_SIZE);
+}
+buf_new = blk_blockalign(blk_new_backing, IO_BUF_SIZE);
 
 size = blk_getlength(blk);
 if (size < 0) {
-- 
2.39.3




[PATCH v3 6/8] iotests/{024, 271}: add testcases for qemu-img rebase

2023-09-19 Thread Andrey Drobyshev via
As the previous commit changes the logic of "qemu-img rebase" (it's using
write alignment now), let's add a couple more test cases which would
ensure it works correctly.  In particular, the following scenarios:

024: add test case for rebase within one backing chain when the overlay
 cluster size > backings cluster size;
271: add test case for rebase images that contain subclusters.  Check
 that no extra allocations are being made.

Signed-off-by: Andrey Drobyshev 
Reviewed-by: Hanna Czenczek 
---
 tests/qemu-iotests/024 | 60 ++
 tests/qemu-iotests/024.out | 43 +
 tests/qemu-iotests/271 | 66 ++
 tests/qemu-iotests/271.out | 42 
 4 files changed, 211 insertions(+)

diff --git a/tests/qemu-iotests/024 b/tests/qemu-iotests/024
index 98a7c8fd65..285f17e79f 100755
--- a/tests/qemu-iotests/024
+++ b/tests/qemu-iotests/024
@@ -257,6 +257,66 @@ $QEMU_IO "$OVERLAY" -c "read -P 0x00 $(( CLUSTER_SIZE * 4 
)) $CLUSTER_SIZE" \
 
 echo
 
+# Check that rebase within the chain is working when
+# overlay cluster size > backings cluster size
+# (here overlay cluster size == 2 * backings cluster size)
+#
+# base_new <-- base_old <-- overlay
+#
+# Backing (new): -- -- -- -- -- --
+# Backing (old): -- 11 -- -- 22 --
+# Overlay:  |-- --|-- --|-- --|
+#
+# We should end up having 1st and 3rd cluster allocated, and their halves
+# being read as zeroes.
+
+echo
+echo "=== Test rebase with different cluster sizes ==="
+echo
+
+echo "Creating backing chain"
+echo
+
+TEST_IMG=$BASE_NEW _make_test_img $(( CLUSTER_SIZE * 6 ))
+TEST_IMG=$BASE_OLD _make_test_img -b "$BASE_NEW" -F $IMGFMT \
+$(( CLUSTER_SIZE * 6 ))
+CLUSTER_SIZE=$(( CLUSTER_SIZE * 2 )) TEST_IMG=$OVERLAY \
+_make_test_img -b "$BASE_OLD" -F $IMGFMT $(( CLUSTER_SIZE * 6 ))
+
+TEST_IMG=$OVERLAY _img_info
+
+echo
+echo "Fill backing files with data"
+echo
+
+$QEMU_IO "$BASE_OLD" -c "write -P 0x11 $CLUSTER_SIZE $CLUSTER_SIZE" \
+-c "write -P 0x22 $(( CLUSTER_SIZE * 4 )) $CLUSTER_SIZE" \
+| _filter_qemu_io
+
+echo
+echo "Rebase onto another image in the same chain"
+echo
+
+$QEMU_IMG rebase -b "$BASE_NEW" -F $IMGFMT "$OVERLAY"
+
+echo "Verify that data is read the same before and after rebase"
+echo
+
+$QEMU_IO "$OVERLAY" -c "read -P 0x00 0 $CLUSTER_SIZE" \
+-c "read -P 0x11 $CLUSTER_SIZE $CLUSTER_SIZE" \
+-c "read -P 0x00 $(( CLUSTER_SIZE * 2 )) $(( CLUSTER_SIZE * 2 ))" \
+-c "read -P 0x22 $(( CLUSTER_SIZE * 4 )) $CLUSTER_SIZE" \
+-c "read -P 0x00 $(( CLUSTER_SIZE * 5 )) $CLUSTER_SIZE" \
+| _filter_qemu_io
+
+echo
+echo "Verify that untouched cluster remains unallocated"
+echo
+
+$QEMU_IMG map "$OVERLAY" | _filter_qemu_img_map
+
+echo
+
 # success, all done
 echo "*** done"
 rm -f $seq.full
diff --git a/tests/qemu-iotests/024.out b/tests/qemu-iotests/024.out
index 245fe8b1d1..e1e8eea863 100644
--- a/tests/qemu-iotests/024.out
+++ b/tests/qemu-iotests/024.out
@@ -201,4 +201,47 @@ read 262144/262144 bytes at offset 0
 read 65536/65536 bytes at offset 262144
 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
+
+=== Test rebase with different cluster sizes ===
+
+Creating backing chain
+
+Formatting 'TEST_DIR/subdir/t.IMGFMT.base_new', fmt=IMGFMT size=393216
+Formatting 'TEST_DIR/subdir/t.IMGFMT.base_old', fmt=IMGFMT size=393216 
backing_file=TEST_DIR/subdir/t.IMGFMT.base_new backing_fmt=IMGFMT
+Formatting 'TEST_DIR/subdir/t.IMGFMT', fmt=IMGFMT size=393216 
backing_file=TEST_DIR/subdir/t.IMGFMT.base_old backing_fmt=IMGFMT
+image: TEST_DIR/subdir/t.IMGFMT
+file format: IMGFMT
+virtual size: 384 KiB (393216 bytes)
+cluster_size: 131072
+backing file: TEST_DIR/subdir/t.IMGFMT.base_old
+backing file format: IMGFMT
+
+Fill backing files with data
+
+wrote 65536/65536 bytes at offset 65536
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 262144
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+Rebase onto another image in the same chain
+
+Verify that data is read the same before and after rebase
+
+read 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 65536
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 131072/131072 bytes at offset 131072
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 262144
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 327680
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+Verify that untouched cluster remains unallocated
+
+Offset  Length  File
+0   0x2 TEST_DIR/subdir/t.IMGFMT
+0x4 0x2 TEST_DIR/subdir/t.IMGFMT
+
 *** done
diff --git a/tests/qemu-iotests/271 b/tests/qemu-iotests/271
index c7c2cadda0..e243f57ba7 100755
--- a/tests/qemu-iotests/271
+++ b/tests/qemu-iotests/271
@@ -899,6 +899

[PATCH v3 2/8] qemu-iotests: 024: add rebasing test case for overlay_size > backing_size

2023-09-19 Thread Andrey Drobyshev via
Before previous commit, rebase was getting infitely stuck in case of
rebasing within the same backing chain and when overlay_size > backing_size.
Let's add this case to the rebasing test 024 to make sure it doesn't
break again.

Signed-off-by: Andrey Drobyshev 
Reviewed-by: Denis V. Lunev 
Reviewed-by: Hanna Czenczek 
---
 tests/qemu-iotests/024 | 57 ++
 tests/qemu-iotests/024.out | 30 
 2 files changed, 87 insertions(+)

diff --git a/tests/qemu-iotests/024 b/tests/qemu-iotests/024
index 25a564a150..98a7c8fd65 100755
--- a/tests/qemu-iotests/024
+++ b/tests/qemu-iotests/024
@@ -199,6 +199,63 @@ echo
 # $BASE_OLD and $BASE_NEW)
 $QEMU_IMG map "$OVERLAY" | _filter_qemu_img_map
 
+# Check that rebase within the chain is working when
+# overlay_size > old_backing_size
+#
+# base_new <-- base_old <-- overlay
+#
+# Backing (new): 11 11 11 11 11
+# Backing (old): 22 22 22 22
+# Overlay:   -- -- -- -- --
+#
+# As a result, overlay should contain data identical to base_old, with the
+# last cluster remaining unallocated.
+
+echo
+echo "=== Test rebase within one backing chain ==="
+echo
+
+echo "Creating backing chain"
+echo
+
+TEST_IMG=$BASE_NEW _make_test_img $(( CLUSTER_SIZE * 5 ))
+TEST_IMG=$BASE_OLD _make_test_img -b "$BASE_NEW" -F $IMGFMT \
+$(( CLUSTER_SIZE * 4 ))
+TEST_IMG=$OVERLAY _make_test_img -b "$BASE_OLD" -F $IMGFMT \
+$(( CLUSTER_SIZE * 5 ))
+
+echo
+echo "Fill backing files with data"
+echo
+
+$QEMU_IO "$BASE_NEW" -c "write -P 0x11 0 $(( CLUSTER_SIZE * 5 ))" \
+| _filter_qemu_io
+$QEMU_IO "$BASE_OLD" -c "write -P 0x22 0 $(( CLUSTER_SIZE * 4 ))" \
+| _filter_qemu_io
+
+echo
+echo "Check the last cluster is zeroed in overlay before the rebase"
+echo
+$QEMU_IO "$OVERLAY" -c "read -P 0x00 $(( CLUSTER_SIZE * 4 )) $CLUSTER_SIZE" \
+| _filter_qemu_io
+
+echo
+echo "Rebase onto another image in the same chain"
+echo
+
+$QEMU_IMG rebase -b "$BASE_NEW" -F $IMGFMT "$OVERLAY"
+
+echo "Verify that data is read the same before and after rebase"
+echo
+
+# Verify the first 4 clusters are still read the same as in the old base
+$QEMU_IO "$OVERLAY" -c "read -P 0x22 0 $(( CLUSTER_SIZE * 4 ))" \
+| _filter_qemu_io
+# Verify the last cluster still reads as zeroes
+$QEMU_IO "$OVERLAY" -c "read -P 0x00 $(( CLUSTER_SIZE * 4 )) $CLUSTER_SIZE" \
+| _filter_qemu_io
+
+echo
 
 # success, all done
 echo "*** done"
diff --git a/tests/qemu-iotests/024.out b/tests/qemu-iotests/024.out
index 973a5a3711..245fe8b1d1 100644
--- a/tests/qemu-iotests/024.out
+++ b/tests/qemu-iotests/024.out
@@ -171,4 +171,34 @@ read 65536/65536 bytes at offset 196608
 Offset  Length  File
 0   0x3 TEST_DIR/subdir/t.IMGFMT
 0x3 0x1 TEST_DIR/subdir/t.IMGFMT.base_new
+
+=== Test rebase within one backing chain ===
+
+Creating backing chain
+
+Formatting 'TEST_DIR/subdir/t.IMGFMT.base_new', fmt=IMGFMT size=327680
+Formatting 'TEST_DIR/subdir/t.IMGFMT.base_old', fmt=IMGFMT size=262144 
backing_file=TEST_DIR/subdir/t.IMGFMT.base_new backing_fmt=IMGFMT
+Formatting 'TEST_DIR/subdir/t.IMGFMT', fmt=IMGFMT size=327680 
backing_file=TEST_DIR/subdir/t.IMGFMT.base_old backing_fmt=IMGFMT
+
+Fill backing files with data
+
+wrote 327680/327680 bytes at offset 0
+320 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 262144/262144 bytes at offset 0
+256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+Check the last cluster is zeroed in overlay before the rebase
+
+read 65536/65536 bytes at offset 262144
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+Rebase onto another image in the same chain
+
+Verify that data is read the same before and after rebase
+
+read 262144/262144 bytes at offset 0
+256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 262144
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
 *** done
-- 
2.39.3




[PATCH v3 0/8] qemu-img: rebase: add compression support

2023-09-19 Thread Andrey Drobyshev via
v2 --> v3:
 * Patch 3/8: fixed logic in the if statement, so that we align on blk
   when blk_old_backing == NULL;
 * Patch 4/8: comment fix;
 * Patch 5/8: comment fix; dropped redundant "if (blk_new_backing)"
   statements.

v2: https://lists.nongnu.org/archive/html/qemu-block/2023-09/msg00448.html

Andrey Drobyshev (8):
  qemu-img: rebase: stop when reaching EOF of old backing file
  qemu-iotests: 024: add rebasing test case for overlay_size >
backing_size
  qemu-img: rebase: use backing files' BlockBackend for buffer alignment
  qemu-img: add chunk size parameter to compare_buffers()
  qemu-img: rebase: avoid unnecessary COW operations
  iotests/{024, 271}: add testcases for qemu-img rebase
  qemu-img: add compression option to rebase subcommand
  iotests: add tests for "qemu-img rebase" with compression

 docs/tools/qemu-img.rst|   6 +-
 qemu-img-cmds.hx   |   4 +-
 qemu-img.c | 136 ++
 tests/qemu-iotests/024 | 117 ++
 tests/qemu-iotests/024.out |  73 
 tests/qemu-iotests/271 | 131 +
 tests/qemu-iotests/271.out |  82 ++
 tests/qemu-iotests/314 | 165 +
 tests/qemu-iotests/314.out |  75 +
 9 files changed, 752 insertions(+), 37 deletions(-)
 create mode 100755 tests/qemu-iotests/314
 create mode 100644 tests/qemu-iotests/314.out

-- 
2.39.3




Re: [PATCH v2 5/8] qemu-img: rebase: avoid unnecessary COW operations

2023-09-19 Thread Andrey Drobyshev
On 9/19/23 13:46, Hanna Czenczek wrote:
> On 15.09.23 18:20, Andrey Drobyshev wrote:
>> When rebasing an image from one backing file to another, we need to
>> compare data from old and new backings.  If the diff between that data
>> happens to be unaligned to the target cluster size, we might end up
>> doing partial writes, which would lead to copy-on-write and additional
>> IO.
>>
>> Consider the following simple case (virtual_size == cluster_size == 64K):
>>
>> base <-- inc1 <-- inc2
>>
>> qemu-io -c "write -P 0xaa 0 32K" base.qcow2
>> qemu-io -c "write -P 0xcc 32K 32K" base.qcow2
>> qemu-io -c "write -P 0xbb 0 32K" inc1.qcow2
>> qemu-io -c "write -P 0xcc 32K 32K" inc1.qcow2
>> qemu-img rebase -f qcow2 -b base.qcow2 -F qcow2 inc2.qcow2
>>
>> While doing rebase, we'll write a half of the cluster to inc2, and block
>> layer will have to read the 2nd half of the same cluster from the base
>> image
>> inc1 while doing this write operation, although the whole cluster is
>> already
>> read earlier to perform data comparison.
>>
>> In order to avoid these unnecessary IO cycles, let's make sure every
>> write request is aligned to the overlay subcluster boundaries.  Using
>> subcluster size is universal as for the images which don't have them
>> this size equals to the cluster size, so in any case we end up aligning
>> to the smallest unit of allocation.
>>
>> Signed-off-by: Andrey Drobyshev 
>> ---
>>   qemu-img.c | 76 --
>>   1 file changed, 56 insertions(+), 20 deletions(-)
> 
> Looks good, I like the changes from v1!  Two minor things:
> 
>> diff --git a/qemu-img.c b/qemu-img.c
>> index fcd31d7b5b..83950af42b 100644
>> --- a/qemu-img.c
>> +++ b/qemu-img.c
> 
> [...]
> 
>> @@ -3844,33 +3861,48 @@ static int img_rebase(int argc, char **argv)
>>   }
>>   }
>>   +    /*
>> + * At this point we know that the region [offset; offset
>> + n)
>> + * is unallocated within the target image.  This region
>> might be
>> + * unaligned to the target image's (sub)cluster
>> boundaries, as
>> + * old backing may have smaller clusters (or have
>> subclusters).
>> + * We extend it to the aligned boundaries to avoid CoW on
>> + * partial writes in blk_pwrite(),
>> + */
>> +    n += offset - QEMU_ALIGN_DOWN(offset, write_align);
>> +    offset = QEMU_ALIGN_DOWN(offset, write_align);
>> +    n += QEMU_ALIGN_UP(offset + n, write_align) - (offset + n);
>> +    n = MIN(n, size - offset);
>> +    assert(!bdrv_is_allocated(unfiltered_bs, offset, n,
>> &n_alloc) &&
>> +   n_alloc == n);
>> +
>> +    /*
>> + * Much like the with the target image, we'll try to read
>> as much
> 
> s/the with the/with the/
>

Noted.

>> + * of the old and new backings as we can.
>> + */
>> +    n_old = MIN(n, MAX(0, old_backing_size - (int64_t) offset));
>> +    if (blk_new_backing) {
>> +    n_new = MIN(n, MAX(0, new_backing_size - (int64_t)
>> offset));
>> +    }
> 
> If we don’t have a check for blk_old_backing (old_backing_size is 0 if
> blk_old_backing is NULL), why do we have a check for blk_new_backing
> (new_backing_size is 0 if blk_new_backing is NULL)?
> 
> (Perhaps because the previous check was `offset >= new_backing_size ||
> !blk_new_backing`, i.e. included exactly such a check – but I don’t
> think it’s necessary, new_backing_size will be 0 if blk_new_backing is
> NULL.)
> 
>> +
>>   /*
>>    * Read old and new backing file and take into
>> consideration that
>>    * backing files may be smaller than the COW image.
>>    */
>> -    if (offset >= old_backing_size) {
>> -    memset(buf_old, 0, n);
>> -    buf_old_is_zero = true;
>> +    memset(buf_old + n_old, 0, n - n_old);
>> +    if (!n_old) {
>> +    old_backing_eof = true;
>>   } else {
>> -    if (offset + n > old_backing_size) {
>> -    n = old_backing_size - offset;
>> -    }
>> -
>> -    ret = blk_pread(blk_old_backing, offset, n, buf_old, 0);
>> +    ret = blk_pread(blk_old_backing, offset, n_old,
>> buf_old, 0);
>>   if (ret < 0) {
>>   error_report("error while reading from old
>> backing file");
>>   goto out;
>>   }
>>   }
>>   -    if (offset >= new_backing_size || !blk_new_backing) {
>> -    memset(buf_new, 0, n);
>> -    } else {
>> -    if (offset + n > new_backing_size) {
>> -    n = new_backing_size - offset;
>> -    }
>> -
>> -    ret = blk_pread(blk_new_backing, offset, n, buf_new, 0);
>> +    memset(buf_new + n_new, 0,

Re: [PATCH] block: mark aio_poll as non-coroutine

2023-09-19 Thread Kevin Wolf
Am 08.09.2023 um 09:54 hat Paolo Bonzini geschrieben:
> It is forbidden to block on the event loop during a coroutine, as that
> can cause deadlocks due to recursive locking.
> 
> Signed-off-by: Paolo Bonzini 

Thanks, applied to the block branch.

Kevin




Re: Deadlock with SATA CD I/O and eject

2023-09-19 Thread Kevin Wolf
Am 19.09.2023 um 14:57 hat John Levon geschrieben:
> On Tue, Sep 19, 2023 at 12:54:59PM +0200, Kevin Wolf wrote:
> 
> > > In the meantime, we start processing the blk_drain() code, so by the time 
> > > this
> > > blk_pread() actually gets handled, quiesce is set, and we get stuck in the
> > > blk_wait_while_drained().
> > > 
> > > I don't know the qemu block stack well enough to propose an actual fix.
> > > 
> > > Experimentally, waiting for ->in_flight to drop to zero *before* we 
> > > quiesce in
> > > blk_remove_bs() via an AIO_WAIT_WHILE() avoids the symptom, but I'm 
> > > pretty sure
> > > that's just a band-aid instead of fixing the deadlock.
> > 
> > Related discussion: 
> > https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00284.html
> > 
> > Actually, it seems we never fixed that problem either?
> > 
> > Back then I suggested that blk_drain*() should disable request queuing
> > because its callers don't want to quiesce the BlockBackend, but rather
> > get their own requests completed. I think this approach would solve this
> > one as well.
> 
> In this case though, it's not their own requests right? We have incoming I/O
> from the guest and the eject is a separate operation.

It's not the same code path, but we're operating in the context of the
IDE device (or more specifically, its BlockBackend), so in that sense it
is "its own requests".

The main difference is anyway between just stopping activity (even if
it's in the middle of a higher level operation of the device) and
getting requests fully completed. We want the latter here.

> So why it would be OK to disable queuing and have ongoing I/Os (i.e.
> blk->in_flight > 0) racing against the block remove?

With eject, the case is simple for IDE: We hold the BQL, so the guest
won't be able to submit new requests anyway.

In more complicated cases like virtio-blk, bdrv_drained_begin() and
friends take care to stop new requests from coming in from the guest by
not running notifiers while the device is drained.

We just need to take care of the requests that have already started.

> > Your experiment with moving the queuing later is basically what Paolo
> 
> I think it's much more flaky than that, isn't it? It's just clearing out the
> *current* pending queue, but nothing would stop new I/Os from being started
> before we dropped into the poll for blk->in_flight to be zero again. In my 
> case,
> this just happens to work because the prior tray open notification has stopped
> new I/O being filed, right?

I think it's the same as above, holding the BQL and calling drain would
already take care of that.

Kevin




Re: [PATCH v2] block-backend: Add new bds_io_in_flight counter

2023-09-19 Thread Kevin Wolf
Am 31.03.2023 um 18:23 hat Hanna Czenczek geschrieben:
> IDE TRIM is a BB user that wants to elevate its BB's in-flight counter
> for a "macro" operation that consists of several actual I/O operations.
> Each of those operations is individually started and awaited.  It does
> this so that blk_drain() will drain the whole TRIM, and not just a
> single one of the many discard operations it may encompass.
> 
> When request queuing is enabled, this leads to a deadlock: The currently
> ongoing discard is drained, and the next one is queued, waiting for the
> drain to stop.  Meanwhile, TRIM still keeps the in-flight counter
> elevated, waiting for all discards to stop -- which will never happen,
> because with the in-flight counter elevated, the BB is never considered
> drained, so the drained section does not begin and cannot end.

Alright, let's have another look at this now that another similar
deadlock was reported:
https://lists.gnu.org/archive/html/qemu-block/2023-09/msg00536.html

> There are two separate cases to look at here, namely bdrv_drain*() and
> blk_drain*().  As said above, we do want blk_drain*() to settle the
> whole operation: The only way to do so is to disable request queuing,
> then.  So, we do that: Have blk_drain() and blk_drain_all() temporarily
> disable request queuing, which prevents the deadlock.

Two separate cases with two separate fixes suggests that it could be
two separate patches. I feel the blk_*() case is uncontroversial and it
would fix John's case, so splitting wouldn't only make this easier to
understand, but could mean that we can fix a useful subset earlier.

> (The devil's in the details, though: blk_drain_all() runs
> bdrv_drain_all_begin() first, so when we get to the individual BB, there
> may already be queued requests.  Therefore, we have to not only disable
> request queuing then, but resume all already-queued requests, too.)

Why can't we just disable request queuing before calling bdrv_drain_*()?

Is it possible that the same problem occurs because someone else already
called bdrv_drain_*()? That is, blk_drain_*() would be called from a
callback in the nested event loop in bdrv_drain_*()? If so, we can't
avoid that there are already queued requests.

Restarting them seems correct anyway.

> For bdrv_drain*(), we want request queuing -- and macro requests such as
> IDE's TRIM request do not matter.  bdrv_drain*() wants to keep I/O
> requests from BDS nodes, and the TRIM does not issue such requests; it
> instead does so through blk_*() functions, which themselves elevate the
> BB's in-flight counter.  So the idea is to drain (and potentially queue)
> those blk_*() requests, but completely ignore the TRIM.
> 
> We can do that by splitting a new counter off of the existing BB
> counter: The new bds_io_in_flight counter counts all those blk_*()
> requests that can issue I/O to a BDS (so must be drained by
> bdrv_drain*()), but will never block waiting on another request on the
> BB.

You end up changing all of the existing blk_inc_in_flight() callers
except those in IDE and virtio-blk.

That makes me wonder if it shouldn't be approached the other way around:
BlockBackend users that want to be included in drain should use a
special function blk_inc_in_flight_external() or something that wouldn't
increase blk->in_flight, but only a new separate counter. And then only
blk_drain*() wait for it (by extending the AIO_WAIT_WHILE() condition),
but not the child_root callbacks.

This would give us more directly the semantics that we actually need:
The root BDS doesn't care if the operation on the device level has
completed as long as nothing new arrives, only external callers which
use blk_drain*() do. I believe internal/external is easier to reason
about than "requests that can issue I/O to a BDS [directly]", it keeps
the external callers the special ones that need extra care while the
normal I/O path is unaffected, and it would make the patch much smaller.

> In blk_drain*(), we disable request queuing and settle all requests (the
> full in_flight count).  In bdrv_drain*() (i.e. blk_root_drained_poll()),
> we only settle bds_io_in_flight_count, ignoring all requests that will
> not directly issue I/O requests to BDS nodes.
> 
> Reported-by: Fiona Ebner 
> Fixes: 7e5cdb345f77d76cb4877fe6230c4e17a7d0d0ca
>("ide: Increment BB in-flight counter for TRIM BH")
> Signed-off-by: Hanna Czenczek 

Kevin




Re: Deadlock with SATA CD I/O and eject

2023-09-19 Thread John Levon
On Tue, Sep 19, 2023 at 12:54:59PM +0200, Kevin Wolf wrote:

> > In the meantime, we start processing the blk_drain() code, so by the time 
> > this
> > blk_pread() actually gets handled, quiesce is set, and we get stuck in the
> > blk_wait_while_drained().
> > 
> > I don't know the qemu block stack well enough to propose an actual fix.
> > 
> > Experimentally, waiting for ->in_flight to drop to zero *before* we quiesce 
> > in
> > blk_remove_bs() via an AIO_WAIT_WHILE() avoids the symptom, but I'm pretty 
> > sure
> > that's just a band-aid instead of fixing the deadlock.
> 
> Related discussion: 
> https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00284.html
> 
> Actually, it seems we never fixed that problem either?
> 
> Back then I suggested that blk_drain*() should disable request queuing
> because its callers don't want to quiesce the BlockBackend, but rather
> get their own requests completed. I think this approach would solve this
> one as well.

In this case though, it's not their own requests right? We have incoming I/O
from the guest and the eject is a separate operation.

So why it would be OK to disable queuing and have ongoing I/Os (i.e.
blk->in_flight > 0) racing against the block remove?

> Your experiment with moving the queuing later is basically what Paolo

I think it's much more flaky than that, isn't it? It's just clearing out the
*current* pending queue, but nothing would stop new I/Os from being started
before we dropped into the poll for blk->in_flight to be zero again. In my case,
this just happens to work because the prior tray open notification has stopped
new I/O being filed, right?

thanks
john



Re: [PATCH v2 8/8] iotests: add tests for "qemu-img rebase" with compression

2023-09-19 Thread Hanna Czenczek

On 15.09.23 18:20, Andrey Drobyshev wrote:

The test cases considered so far:

314 (new test suite):

1. Check that compression mode isn't compatible with "-f raw" (raw
format doesn't support compression).
2. Check that rebasing an image onto no backing file preserves the data
and writes the copied clusters actually compressed.
3. Same as 2, but with a raw backing file (i.e. the clusters copied from the
backing are originally uncompressed -- we check they end up compressed
after being merged).
4. Remove a single delta from a backing chain, perform the same checks
as in 2.
5. Check that even when backing and overlay are initially uncompressed,
copied clusters end up compressed when rebase with compression is
performed.

271:

1. Check that when target image has subclusters, rebase with compression
will make an entire cluster containing the written subcluster
compressed.

Signed-off-by: Andrey Drobyshev 
---
  tests/qemu-iotests/271 |  65 +++
  tests/qemu-iotests/271.out |  40 +
  tests/qemu-iotests/314 | 165 +
  tests/qemu-iotests/314.out |  75 +
  4 files changed, 345 insertions(+)
  create mode 100755 tests/qemu-iotests/314
  create mode 100644 tests/qemu-iotests/314.out


Reviewed-by: Hanna Czenczek 




Re: [PATCH 1/1] block: improve alignment detection and fix 271 test

2023-09-19 Thread Denis V. Lunev

On 9/7/23 23:53, Denis V. Lunev wrote:

Unfortunately 271 IO test is broken if started in non-cached mode.

Commits
 commit a6b257a08e3d72219f03e461a52152672fec0612
 Author: Nir Soffer 
 Date:   Tue Aug 13 21:21:03 2019 +0300
 file-posix: Handle undetectable alignment
and
 commit 9c60a5d1978e6dcf85c0e01b50e6f7f54ca09104
 Author: Kevin Wolf 
 Date:   Thu Jul 16 16:26:00 2020 +0200
 block: Require aligned image size to avoid assertion failure
have interesting side effect if used togather.

If the image size is not multiple of 4k and that image falls under
original constraints of Nil's patch, the image can not be opened
due to the check in the bdrv_check_perm().

The patch tries to satisfy the requirements of bdrv_check_perm()
inside raw_probe_alignment(). This is at my opinion better that just
disallowing to run that test in non-cached mode. The operation is legal
by itself.

Signed-off-by: Denis V. Lunev 
CC: Nir Soffer 
CC: Kevin Wolf 
CC: Hanna Reitz 
CC: Alberto Garcia 
---
  block/file-posix.c | 17 +++--
  1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index b16e9c21a1..988cfdc76c 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -447,8 +447,21 @@ static void raw_probe_alignment(BlockDriverState *bs, int 
fd, Error **errp)
  for (i = 0; i < ARRAY_SIZE(alignments); i++) {
  align = alignments[i];
  if (raw_is_io_aligned(fd, buf, align)) {
-/* Fallback to safe value. */
-bs->bl.request_alignment = (align != 1) ? align : max_align;
+if (align != 1) {
+bs->bl.request_alignment = align;
+break;
+}
+/*
+ * Fallback to safe value. max_align is perfect, but the size 
of the device must be multiple of
+ * the virtual length of the device. In the other case we will 
get a error in
+ * bdrv_node_refresh_perm().
+ */
+for (align = max_align; align > 1; align /= 2) {
+if ((bs->total_sectors * BDRV_SECTOR_SIZE) % align == 0) {
+break;
+}
+}
+bs->bl.request_alignment = align;
  break;
  }
  }

ping



[PATCH v5 5/5] vhost-user: fix lost reconnect

2023-09-19 Thread Li Feng
When the vhost-user is reconnecting to the backend, and if the vhost-user fails
at the get_features in vhost_dev_init(), then the reconnect will fail
and it will not be retriggered forever.

The reason is:
When the vhost-user fails at get_features, the vhost_dev_cleanup will be called
immediately.

vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'.

The reconnect path is:
vhost_user_blk_event
   vhost_user_async_close(.. vhost_user_blk_disconnect ..)
 qemu_chr_fe_set_handlers <- clear the notifier callback
   schedule vhost_user_async_close_bh

The vhost->vdev is null, so the vhost_user_blk_disconnect will not be
called, then the event fd callback will not be reinstalled.

All vhost-user devices have this issue, including vhost-user-blk/scsi.

With this patch, if the vdev->vdev is null, the fd callback will still
be reinstalled.

Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling")

Signed-off-by: Li Feng 
---
 hw/block/vhost-user-blk.c  | 2 +-
 hw/scsi/vhost-user-scsi.c  | 3 ++-
 hw/virtio/vhost-user-gpio.c| 2 +-
 hw/virtio/vhost-user.c | 9 +++--
 include/hw/virtio/vhost-user.h | 3 ++-
 5 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index 3c69fa47d5..95c758200d 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -391,7 +391,7 @@ static void vhost_user_blk_event(void *opaque, QEMUChrEvent 
event)
 case CHR_EVENT_CLOSED:
 /* defer close until later to avoid circular close */
 vhost_user_async_close(dev, &s->chardev, &s->dev,
-   vhost_user_blk_disconnect);
+   vhost_user_blk_disconnect, 
vhost_user_blk_event);
 break;
 case CHR_EVENT_BREAK:
 case CHR_EVENT_MUX_IN:
diff --git a/hw/scsi/vhost-user-scsi.c b/hw/scsi/vhost-user-scsi.c
index 53a62c3170..0effbb4787 100644
--- a/hw/scsi/vhost-user-scsi.c
+++ b/hw/scsi/vhost-user-scsi.c
@@ -238,7 +238,8 @@ static void vhost_user_scsi_event(void *opaque, 
QEMUChrEvent event)
 case CHR_EVENT_CLOSED:
 /* defer close until later to avoid circular close */
 vhost_user_async_close(dev, &vs->conf.chardev, &vsc->dev,
-   vhost_user_scsi_disconnect);
+   vhost_user_scsi_disconnect,
+   vhost_user_scsi_event);
 break;
 case CHR_EVENT_BREAK:
 case CHR_EVENT_MUX_IN:
diff --git a/hw/virtio/vhost-user-gpio.c b/hw/virtio/vhost-user-gpio.c
index d9979aa5db..04c2cc79f4 100644
--- a/hw/virtio/vhost-user-gpio.c
+++ b/hw/virtio/vhost-user-gpio.c
@@ -283,7 +283,7 @@ static void vu_gpio_event(void *opaque, QEMUChrEvent event)
 case CHR_EVENT_CLOSED:
 /* defer close until later to avoid circular close */
 vhost_user_async_close(dev, &gpio->chardev, &gpio->vhost_dev,
-   vu_gpio_disconnect);
+   vu_gpio_disconnect, vu_gpio_event);
 break;
 case CHR_EVENT_BREAK:
 case CHR_EVENT_MUX_IN:
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 8dcf049d42..12c4a41f30 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -2643,6 +2643,7 @@ typedef struct {
 DeviceState *dev;
 CharBackend *cd;
 struct vhost_dev *vhost;
+IOEventHandler *event_cb;
 } VhostAsyncCallback;
 
 static void vhost_user_async_close_bh(void *opaque)
@@ -2657,7 +2658,10 @@ static void vhost_user_async_close_bh(void *opaque)
  */
 if (vhost->vdev) {
 data->cb(data->dev);
-}
+} else if (data->event_cb) {
+qemu_chr_fe_set_handlers(data->cd, NULL, NULL, data->event_cb,
+ NULL, data->dev, NULL, true);
+   }
 
 g_free(data);
 }
@@ -2669,7 +2673,8 @@ static void vhost_user_async_close_bh(void *opaque)
  */
 void vhost_user_async_close(DeviceState *d,
 CharBackend *chardev, struct vhost_dev *vhost,
-vu_async_close_fn cb)
+vu_async_close_fn cb,
+IOEventHandler *event_cb)
 {
 if (!runstate_check(RUN_STATE_SHUTDOWN)) {
 /*
diff --git a/include/hw/virtio/vhost-user.h b/include/hw/virtio/vhost-user.h
index 191216a74f..649e9dd54f 100644
--- a/include/hw/virtio/vhost-user.h
+++ b/include/hw/virtio/vhost-user.h
@@ -84,6 +84,7 @@ typedef void (*vu_async_close_fn)(DeviceState *cb);
 
 void vhost_user_async_close(DeviceState *d,
 CharBackend *chardev, struct vhost_dev *vhost,
-vu_async_close_fn cb);
+vu_async_close_fn cb,
+IOEventHandler *event_cb);
 
 #endif
-- 
2.41.0




[PATCH v5 0/5] Implement reconnect for vhost-user-scsi

2023-09-19 Thread Li Feng
Ping ...
Could anyone review this series and merge them?

Changes for v5:
- No logic has been changed, just move part of the code from patch 4 to patch 5.

Changes for v4:
- Merge
  https://lore.kernel.org/all/20230830045722.611224-1-fen...@smartx.com/ to
  this series.
- Add ERRP_GUARD in vhost_user_scsi_realize;
- Reword the commit messages.

Changes for v3:
- Split the vhost_user_scsi_handle_output to a separate patch;
- Move the started_vu from vhost scsi common header to vhost-user-scsi header;
- Fix a log print error;

Changes for v2:
- Split the v1 patch to small separate patchset;
- New patch for fixing fd leak, which has sent to reviewers in another
  mail;
- Implement the `vhost_user_scsi_handle_output`;
- Add the started_vu safe check;
- Fix error handler;
- Check the inflight before set/get inflight fd.

Li Feng (5):
  vhost-user-common: send get_inflight_fd once
  vhost: move and rename the conn retry times
  vhost-user-scsi: support reconnect to backend
  vhost-user-scsi: start vhost when guest kicks
  vhost-user: fix lost reconnect

 hw/block/vhost-user-blk.c |   6 +-
 hw/scsi/vhost-scsi-common.c   |  47 ++---
 hw/scsi/vhost-scsi.c  |   5 +-
 hw/scsi/vhost-user-scsi.c | 253 +++---
 hw/virtio/vhost-user-gpio.c   |   5 +-
 hw/virtio/vhost-user.c|   9 +-
 include/hw/virtio/vhost-scsi-common.h |   2 +-
 include/hw/virtio/vhost-user-scsi.h   |   4 +
 include/hw/virtio/vhost-user.h|   3 +-
 include/hw/virtio/vhost.h |   2 +
 10 files changed, 276 insertions(+), 60 deletions(-)

-- 
2.41.0




[PATCH v5 2/5] vhost: move and rename the conn retry times

2023-09-19 Thread Li Feng
Multiple devices need this macro, move it to a common header.

Signed-off-by: Li Feng 
Reviewed-by: Raphael Norwitz 
---
 hw/block/vhost-user-blk.c   | 4 +---
 hw/virtio/vhost-user-gpio.c | 3 +--
 include/hw/virtio/vhost.h   | 2 ++
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index eecf3f7a81..3c69fa47d5 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -32,8 +32,6 @@
 #include "sysemu/sysemu.h"
 #include "sysemu/runstate.h"
 
-#define REALIZE_CONNECTION_RETRIES 3
-
 static const int user_feature_bits[] = {
 VIRTIO_BLK_F_SIZE_MAX,
 VIRTIO_BLK_F_SEG_MAX,
@@ -482,7 +480,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, 
Error **errp)
 s->inflight = g_new0(struct vhost_inflight, 1);
 s->vhost_vqs = g_new0(struct vhost_virtqueue, s->num_queues);
 
-retries = REALIZE_CONNECTION_RETRIES;
+retries = VU_REALIZE_CONN_RETRIES;
 assert(!*errp);
 do {
 if (*errp) {
diff --git a/hw/virtio/vhost-user-gpio.c b/hw/virtio/vhost-user-gpio.c
index 3b013f2d0f..d9979aa5db 100644
--- a/hw/virtio/vhost-user-gpio.c
+++ b/hw/virtio/vhost-user-gpio.c
@@ -15,7 +15,6 @@
 #include "standard-headers/linux/virtio_ids.h"
 #include "trace.h"
 
-#define REALIZE_CONNECTION_RETRIES 3
 #define VHOST_NVQS 2
 
 /* Features required from VirtIO */
@@ -359,7 +358,7 @@ static void vu_gpio_device_realize(DeviceState *dev, Error 
**errp)
 qemu_chr_fe_set_handlers(&gpio->chardev, NULL, NULL, vu_gpio_event, NULL,
  dev, NULL, true);
 
-retries = REALIZE_CONNECTION_RETRIES;
+retries = VU_REALIZE_CONN_RETRIES;
 g_assert(!*errp);
 do {
 if (*errp) {
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 6a173cb9fa..ca3131b1af 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -8,6 +8,8 @@
 #define VHOST_F_DEVICE_IOTLB 63
 #define VHOST_USER_F_PROTOCOL_FEATURES 30
 
+#define VU_REALIZE_CONN_RETRIES 3
+
 /* Generic structures common for any vhost based device. */
 
 struct vhost_inflight {
-- 
2.41.0




[PATCH v5 1/5] vhost-user-common: send get_inflight_fd once

2023-09-19 Thread Li Feng
Currently the get_inflight_fd will be sent every time the device is started, and
the backend will allocate shared memory to save the inflight state. If the
backend finds that it receives the second get_inflight_fd, it will release the
previous shared memory, which breaks inflight working logic.

This patch is a preparation for the following patches.

Signed-off-by: Li Feng 
---
 hw/scsi/vhost-scsi-common.c | 37 ++---
 1 file changed, 18 insertions(+), 19 deletions(-)

diff --git a/hw/scsi/vhost-scsi-common.c b/hw/scsi/vhost-scsi-common.c
index a06f01af26..a61cd0e907 100644
--- a/hw/scsi/vhost-scsi-common.c
+++ b/hw/scsi/vhost-scsi-common.c
@@ -52,20 +52,28 @@ int vhost_scsi_common_start(VHostSCSICommon *vsc)
 
 vsc->dev.acked_features = vdev->guest_features;
 
-assert(vsc->inflight == NULL);
-vsc->inflight = g_new0(struct vhost_inflight, 1);
-ret = vhost_dev_get_inflight(&vsc->dev,
- vs->conf.virtqueue_size,
- vsc->inflight);
+ret = vhost_dev_prepare_inflight(&vsc->dev, vdev);
 if (ret < 0) {
-error_report("Error get inflight: %d", -ret);
+error_report("Error setting inflight format: %d", -ret);
 goto err_guest_notifiers;
 }
 
-ret = vhost_dev_set_inflight(&vsc->dev, vsc->inflight);
-if (ret < 0) {
-error_report("Error set inflight: %d", -ret);
-goto err_guest_notifiers;
+if (vsc->inflight) {
+if (!vsc->inflight->addr) {
+ret = vhost_dev_get_inflight(&vsc->dev,
+vs->conf.virtqueue_size,
+vsc->inflight);
+if (ret < 0) {
+error_report("Error getting inflight: %d", -ret);
+goto err_guest_notifiers;
+}
+}
+
+ret = vhost_dev_set_inflight(&vsc->dev, vsc->inflight);
+if (ret < 0) {
+error_report("Error setting inflight: %d", -ret);
+goto err_guest_notifiers;
+}
 }
 
 ret = vhost_dev_start(&vsc->dev, vdev, true);
@@ -85,9 +93,6 @@ int vhost_scsi_common_start(VHostSCSICommon *vsc)
 return ret;
 
 err_guest_notifiers:
-g_free(vsc->inflight);
-vsc->inflight = NULL;
-
 k->set_guest_notifiers(qbus->parent, vsc->dev.nvqs, false);
 err_host_notifiers:
 vhost_dev_disable_notifiers(&vsc->dev, vdev);
@@ -111,12 +116,6 @@ void vhost_scsi_common_stop(VHostSCSICommon *vsc)
 }
 assert(ret >= 0);
 
-if (vsc->inflight) {
-vhost_dev_free_inflight(vsc->inflight);
-g_free(vsc->inflight);
-vsc->inflight = NULL;
-}
-
 vhost_dev_disable_notifiers(&vsc->dev, vdev);
 }
 
-- 
2.41.0




[PATCH v5 3/5] vhost-user-scsi: support reconnect to backend

2023-09-19 Thread Li Feng
If the backend crashes and restarts, the device is broken.
This patch adds reconnect for vhost-user-scsi.

This patch also improves the error messages, and reports some silent errors.

Tested with spdk backend.

Signed-off-by: Li Feng 
---
 hw/scsi/vhost-scsi-common.c   |  16 +-
 hw/scsi/vhost-scsi.c  |   5 +-
 hw/scsi/vhost-user-scsi.c | 204 +++---
 include/hw/virtio/vhost-scsi-common.h |   2 +-
 include/hw/virtio/vhost-user-scsi.h   |   4 +
 5 files changed, 201 insertions(+), 30 deletions(-)

diff --git a/hw/scsi/vhost-scsi-common.c b/hw/scsi/vhost-scsi-common.c
index a61cd0e907..4c8637045d 100644
--- a/hw/scsi/vhost-scsi-common.c
+++ b/hw/scsi/vhost-scsi-common.c
@@ -16,6 +16,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qapi/error.h"
 #include "qemu/error-report.h"
 #include "qemu/module.h"
 #include "hw/virtio/vhost.h"
@@ -25,7 +26,7 @@
 #include "hw/virtio/virtio-access.h"
 #include "hw/fw-path-provider.h"
 
-int vhost_scsi_common_start(VHostSCSICommon *vsc)
+int vhost_scsi_common_start(VHostSCSICommon *vsc, Error **errp)
 {
 int ret, i;
 VirtIODevice *vdev = VIRTIO_DEVICE(vsc);
@@ -35,18 +36,19 @@ int vhost_scsi_common_start(VHostSCSICommon *vsc)
 VirtIOSCSICommon *vs = (VirtIOSCSICommon *)vsc;
 
 if (!k->set_guest_notifiers) {
-error_report("binding does not support guest notifiers");
+error_setg(errp, "binding does not support guest notifiers");
 return -ENOSYS;
 }
 
 ret = vhost_dev_enable_notifiers(&vsc->dev, vdev);
 if (ret < 0) {
+error_setg_errno(errp, -ret, "Error enabling host notifiers");
 return ret;
 }
 
 ret = k->set_guest_notifiers(qbus->parent, vsc->dev.nvqs, true);
 if (ret < 0) {
-error_report("Error binding guest notifier");
+error_setg_errno(errp, -ret, "Error binding guest notifier");
 goto err_host_notifiers;
 }
 
@@ -54,7 +56,7 @@ int vhost_scsi_common_start(VHostSCSICommon *vsc)
 
 ret = vhost_dev_prepare_inflight(&vsc->dev, vdev);
 if (ret < 0) {
-error_report("Error setting inflight format: %d", -ret);
+error_setg_errno(errp, -ret, "Error setting inflight format");
 goto err_guest_notifiers;
 }
 
@@ -64,21 +66,21 @@ int vhost_scsi_common_start(VHostSCSICommon *vsc)
 vs->conf.virtqueue_size,
 vsc->inflight);
 if (ret < 0) {
-error_report("Error getting inflight: %d", -ret);
+error_setg_errno(errp, -ret, "Error getting inflight");
 goto err_guest_notifiers;
 }
 }
 
 ret = vhost_dev_set_inflight(&vsc->dev, vsc->inflight);
 if (ret < 0) {
-error_report("Error setting inflight: %d", -ret);
+error_setg_errno(errp, -ret, "Error setting inflight");
 goto err_guest_notifiers;
 }
 }
 
 ret = vhost_dev_start(&vsc->dev, vdev, true);
 if (ret < 0) {
-error_report("Error start vhost dev");
+error_setg_errno(errp, -ret, "Error starting vhost dev");
 goto err_guest_notifiers;
 }
 
diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c
index 443f67daa4..01a3ab4277 100644
--- a/hw/scsi/vhost-scsi.c
+++ b/hw/scsi/vhost-scsi.c
@@ -75,6 +75,7 @@ static int vhost_scsi_start(VHostSCSI *s)
 int ret, abi_version;
 VHostSCSICommon *vsc = VHOST_SCSI_COMMON(s);
 const VhostOps *vhost_ops = vsc->dev.vhost_ops;
+Error *local_err = NULL;
 
 ret = vhost_ops->vhost_scsi_get_abi_version(&vsc->dev, &abi_version);
 if (ret < 0) {
@@ -88,14 +89,14 @@ static int vhost_scsi_start(VHostSCSI *s)
 return -ENOSYS;
 }
 
-ret = vhost_scsi_common_start(vsc);
+ret = vhost_scsi_common_start(vsc, &local_err);
 if (ret < 0) {
 return ret;
 }
 
 ret = vhost_scsi_set_endpoint(s);
 if (ret < 0) {
-error_report("Error setting vhost-scsi endpoint");
+error_reportf_err(local_err, "Error setting vhost-scsi endpoint");
 vhost_scsi_common_stop(vsc);
 }
 
diff --git a/hw/scsi/vhost-user-scsi.c b/hw/scsi/vhost-user-scsi.c
index ee99b19e7a..dc109154ad 100644
--- a/hw/scsi/vhost-user-scsi.c
+++ b/hw/scsi/vhost-user-scsi.c
@@ -43,26 +43,56 @@ enum VhostUserProtocolFeature {
 VHOST_USER_PROTOCOL_F_RESET_DEVICE = 13,
 };
 
+static int vhost_user_scsi_start(VHostUserSCSI *s, Error **errp)
+{
+VHostSCSICommon *vsc = VHOST_SCSI_COMMON(s);
+int ret;
+
+ret = vhost_scsi_common_start(vsc, errp);
+s->started_vu = (ret < 0 ? false : true);
+
+return ret;
+}
+
+static void vhost_user_scsi_stop(VHostUserSCSI *s)
+{
+VHostSCSICommon *vsc = VHOST_SCSI_COMMON(s);
+
+if (!s->started_vu) {
+return;
+}
+s->started_vu = false;
+
+vhost_scsi_common_stop(vsc);
+}
+
 static void vhost_user_scsi_set_status(VirtIODevice *vdev, uint8_t status)
 {
 VHostUser

[PATCH v5 4/5] vhost-user-scsi: start vhost when guest kicks

2023-09-19 Thread Li Feng
Let's keep the same behavior as vhost-user-blk.

Some old guests kick virtqueue before setting VIRTIO_CONFIG_S_DRIVER_OK.

Signed-off-by: Li Feng 
---
 hw/scsi/vhost-user-scsi.c | 48 +++
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/hw/scsi/vhost-user-scsi.c b/hw/scsi/vhost-user-scsi.c
index dc109154ad..53a62c3170 100644
--- a/hw/scsi/vhost-user-scsi.c
+++ b/hw/scsi/vhost-user-scsi.c
@@ -115,8 +115,48 @@ static void vhost_user_scsi_reset(VirtIODevice *vdev)
 }
 }
 
-static void vhost_dummy_handle_output(VirtIODevice *vdev, VirtQueue *vq)
+static void vhost_user_scsi_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
+VHostUserSCSI *s = (VHostUserSCSI *)vdev;
+DeviceState *dev = &s->parent_obj.parent_obj.parent_obj.parent_obj;
+VHostSCSICommon *vsc = VHOST_SCSI_COMMON(s);
+VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(dev);
+
+Error *local_err = NULL;
+int i, ret;
+
+if (!vdev->start_on_kick) {
+return;
+}
+
+if (!s->connected) {
+return;
+}
+
+if (vhost_dev_is_started(&vsc->dev)) {
+return;
+}
+
+/*
+ * Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start
+ * vhost here instead of waiting for .set_status().
+ */
+ret = vhost_user_scsi_start(s, &local_err);
+if (ret < 0) {
+error_reportf_err(local_err, "vhost-user-scsi: vhost start failed: ");
+qemu_chr_fe_disconnect(&vs->conf.chardev);
+return;
+}
+
+/* Kick right away to begin processing requests already in vring */
+for (i = 0; i < vsc->dev.nvqs; i++) {
+VirtQueue *kick_vq = virtio_get_queue(vdev, i);
+
+if (!virtio_queue_get_desc_addr(vdev, i)) {
+continue;
+}
+event_notifier_set(virtio_queue_get_host_notifier(kick_vq));
+}
 }
 
 static int vhost_user_scsi_connect(DeviceState *dev, Error **errp)
@@ -246,9 +286,9 @@ static void vhost_user_scsi_realize(DeviceState *dev, Error 
**errp)
 return;
 }
 
-virtio_scsi_common_realize(dev, vhost_dummy_handle_output,
-   vhost_dummy_handle_output,
-   vhost_dummy_handle_output, &err);
+virtio_scsi_common_realize(dev, vhost_user_scsi_handle_output,
+   vhost_user_scsi_handle_output,
+   vhost_user_scsi_handle_output, &err);
 if (err != NULL) {
 error_propagate(errp, err);
 return;
-- 
2.41.0




Re: [PATCH v2 6/8] iotests/{024, 271}: add testcases for qemu-img rebase

2023-09-19 Thread Hanna Czenczek

On 15.09.23 18:20, Andrey Drobyshev wrote:

As the previous commit changes the logic of "qemu-img rebase" (it's using
write alignment now), let's add a couple more test cases which would
ensure it works correctly.  In particular, the following scenarios:

024: add test case for rebase within one backing chain when the overlay
  cluster size > backings cluster size;
271: add test case for rebase images that contain subclusters.  Check
  that no extra allocations are being made.

Signed-off-by: Andrey Drobyshev 
---
  tests/qemu-iotests/024 | 60 ++
  tests/qemu-iotests/024.out | 43 +
  tests/qemu-iotests/271 | 66 ++
  tests/qemu-iotests/271.out | 42 
  4 files changed, 211 insertions(+)


Interestingly, the new case in 024 hangs before this series.  So it 
isn’t just an optimization, but a fix, actually!


Reviewed-by: Hanna Czenczek 




Re: Deadlock with SATA CD I/O and eject

2023-09-19 Thread Hanna Czenczek

On 19.09.23 12:54, Kevin Wolf wrote:

Am 18.09.2023 um 19:28 hat John Levon geschrieben:

Observed with base of qemu 6.2.0, but from code inspection it looks to me like
it's still current in upstream master. Apologies if I have missed a fix in this
area.

Symptom: run a UEFI-booted SATA CD Windows installer. When it hits "Loading
files.." screen, run an eject e.g.

virsh qemu-monitor-command 64c6e190-ea7f-49e2-b2d5-6ba1814b00ae '{"execute":"eject", "arguments": { 
"id": "sata0-0-0" } }'

qemu will get stuck like so:

gdb) bt
#0  0x7f8ba4b16036 in ppoll () from /lib64/libc.so.6
#1  0x561813c48ed5 in ppoll (__ss=0x0, __timeout=0x7ffcbd981a70, __nfds=, __fds=) at /usr/include/bits/poll2.h:62
#2  qemu_poll_ns (fds=, nfds=, 
timeout=timeout@entry=999896128) at ../util/qemu-timer.c:348
#3  0x561813c29be9 in fdmon_poll_wait (ctx=0x56181516e070, 
ready_list=0x7ffcbd981af0, timeout=999896128) at ../util/fdmon-poll.c:80
#4  0x561813c297e1 in aio_poll (ctx=ctx@entry=0x56181516e070, 
blocking=blocking@entry=true) at ../util/aio-posix.c:607
#5  0x561813ae2fad in bdrv_do_drained_begin (poll=true, 
ignore_bds_parents=false, parent=0x0, recursive=false, bs=0x56181533fcc0) at 
../block/io.c:483
#6  bdrv_do_drained_begin (bs=0x56181533fcc0, recursive=, parent=0x0, 
ignore_bds_parents=, poll=) at ../block/io.c:446
#7  0x561813ad9982 in blk_drain (blk=0x5618161c1f10) at 
../block/block-backend.c:1741
#8  0x561813ad9b8c in blk_remove_bs (blk=blk@entry=0x5618161c1f10) at 
../block/block-backend.c:852
#9  0x56181382b8ab in blockdev_remove_medium (has_device=, 
device=, has_id=, id=, 
errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:232
#10 0x56181382bfb1 in qmp_eject (has_device=, device=0x0, has_id=, 
id=0x561815e6efe0 "sata0-0-0", has_force=, force=, 
errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:45

We are stuck forever here:

  351 static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent,
  352   bool poll)
...
  380 if (poll) {
  381 BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, parent));
  382 }

Because the blk root's ->in_flight is 1, as tested by the condition
blk_root_drained_poll().


Our blk->in_flight user is stuck here:

1298 static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
...
1310 blk_dec_in_flight(blk);
1311 qemu_co_queue_wait(&blk->queued_requests, 
&blk->queued_requests_lock);
1312 blk_inc_in_flight(blk);

Note that before entering this stanza, blk->in_flight was 2. This turns out to
be due to the ide atapi code. In particular, the guest VM is generating lots of
read I/O. The "first IO" arrives into blk via:

cd_read_sector()->ide_buffered_readv()->blk_aio_preadv()

This initial IO completes:

blk_aio_read_entry()->blk_aio_complete()

1560 static void blk_aio_complete(BlkAioEmAIOCB *acb)
1561 {
1562 if (acb->has_returned) {
1563 acb->common.cb(acb->common.opaque, acb->rwco.ret);
1564 blk_dec_in_flight(acb->rwco.blk);
1565 qemu_aio_unref(acb);
1566 }
1567 }

Line 1564 is what we need to move blk->in_flight down to zero, but that is never
reached! This is because of what happens at :1563

acm->common.cb()->cd_read_sector_cb()->ide_atapi_cmd_reply_end()->cd_read_sector_sync()->blk_pread()

That is, the IO callback in the atapi code itself triggers more - synchronous - 
IO.

In the meantime, we start processing the blk_drain() code, so by the time this
blk_pread() actually gets handled, quiesce is set, and we get stuck in the
blk_wait_while_drained().

I don't know the qemu block stack well enough to propose an actual fix.

Experimentally, waiting for ->in_flight to drop to zero *before* we quiesce in
blk_remove_bs() via an AIO_WAIT_WHILE() avoids the symptom, but I'm pretty sure
that's just a band-aid instead of fixing the deadlock.

Any suggestions/clues/thoughts?

Related discussion: 
https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00284.html

Actually, it seems we never fixed that problem either?


We didn’t, from my POV mainly because everyone had different suggestions 
and none of them looked nice enough that it was clear which one to pursue.


I.e., there are also 
https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00709.html and 
https://lists.gnu.org/archive/html/qemu-block/2023-04/msg00118.html .



Back then I suggested that blk_drain*() should disable request queuing
because its callers don't want to quiesce the BlockBackend, but rather
get their own requests completed. I think this approach would solve this
one as well.

Your experiment with moving the queuing later is basically what Paolo
suggested, though I'd still argue it's not the right thing to do because
while it seems to mostly work with both use cases of drain (give me
exclusive access vs. wait for my requests to complete), it's not optimal
for either one.

Kevin






Re: Deadlock with SATA CD I/O and eject

2023-09-19 Thread Kevin Wolf
Am 18.09.2023 um 19:28 hat John Levon geschrieben:
> 
> Observed with base of qemu 6.2.0, but from code inspection it looks to me like
> it's still current in upstream master. Apologies if I have missed a fix in 
> this
> area.
> 
> Symptom: run a UEFI-booted SATA CD Windows installer. When it hits "Loading
> files.." screen, run an eject e.g.
> 
> virsh qemu-monitor-command 64c6e190-ea7f-49e2-b2d5-6ba1814b00ae 
> '{"execute":"eject", "arguments": { "id": "sata0-0-0" } }'
> 
> qemu will get stuck like so:
> 
> gdb) bt
> #0  0x7f8ba4b16036 in ppoll () from /lib64/libc.so.6
> #1  0x561813c48ed5 in ppoll (__ss=0x0, __timeout=0x7ffcbd981a70, 
> __nfds=, __fds=) at /usr/include/bits/poll2.h:62
> #2  qemu_poll_ns (fds=, nfds=, 
> timeout=timeout@entry=999896128) at ../util/qemu-timer.c:348
> #3  0x561813c29be9 in fdmon_poll_wait (ctx=0x56181516e070, 
> ready_list=0x7ffcbd981af0, timeout=999896128) at ../util/fdmon-poll.c:80
> #4  0x561813c297e1 in aio_poll (ctx=ctx@entry=0x56181516e070, 
> blocking=blocking@entry=true) at ../util/aio-posix.c:607
> #5  0x561813ae2fad in bdrv_do_drained_begin (poll=true, 
> ignore_bds_parents=false, parent=0x0, recursive=false, bs=0x56181533fcc0) at 
> ../block/io.c:483
> #6  bdrv_do_drained_begin (bs=0x56181533fcc0, recursive=, 
> parent=0x0, ignore_bds_parents=, poll=) at 
> ../block/io.c:446
> #7  0x561813ad9982 in blk_drain (blk=0x5618161c1f10) at 
> ../block/block-backend.c:1741
> #8  0x561813ad9b8c in blk_remove_bs (blk=blk@entry=0x5618161c1f10) at 
> ../block/block-backend.c:852
> #9  0x56181382b8ab in blockdev_remove_medium (has_device=, 
> device=, has_id=, id=, 
> errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:232
> #10 0x56181382bfb1 in qmp_eject (has_device=, device=0x0, 
> has_id=, id=0x561815e6efe0 "sata0-0-0", has_force= out>, force=, errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:45
> 
> We are stuck forever here:
> 
>  351 static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild 
> *parent,
>  352   bool poll)
> ...
>  380 if (poll) {
>  381 BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, parent));
>  382 }
> 
> Because the blk root's ->in_flight is 1, as tested by the condition
> blk_root_drained_poll().
> 
> 
> Our blk->in_flight user is stuck here:
> 
> 1298 static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
> ...
> 1310 blk_dec_in_flight(blk);
> 1311 qemu_co_queue_wait(&blk->queued_requests, 
> &blk->queued_requests_lock);
> 1312 blk_inc_in_flight(blk);
> 
> Note that before entering this stanza, blk->in_flight was 2. This turns out to
> be due to the ide atapi code. In particular, the guest VM is generating lots 
> of
> read I/O. The "first IO" arrives into blk via:
> 
> cd_read_sector()->ide_buffered_readv()->blk_aio_preadv()
> 
> This initial IO completes:
> 
> blk_aio_read_entry()->blk_aio_complete()
> 
> 1560 static void blk_aio_complete(BlkAioEmAIOCB *acb)
> 1561 {
> 1562 if (acb->has_returned) {
> 1563 acb->common.cb(acb->common.opaque, acb->rwco.ret);
> 1564 blk_dec_in_flight(acb->rwco.blk);
> 1565 qemu_aio_unref(acb);
> 1566 }
> 1567 }
> 
> Line 1564 is what we need to move blk->in_flight down to zero, but that is 
> never
> reached! This is because of what happens at :1563
> 
> acm->common.cb()->cd_read_sector_cb()->ide_atapi_cmd_reply_end()->cd_read_sector_sync()->blk_pread()
> 
> That is, the IO callback in the atapi code itself triggers more - synchronous 
> - IO.
> 
> In the meantime, we start processing the blk_drain() code, so by the time this
> blk_pread() actually gets handled, quiesce is set, and we get stuck in the
> blk_wait_while_drained().
> 
> I don't know the qemu block stack well enough to propose an actual fix.
> 
> Experimentally, waiting for ->in_flight to drop to zero *before* we quiesce in
> blk_remove_bs() via an AIO_WAIT_WHILE() avoids the symptom, but I'm pretty 
> sure
> that's just a band-aid instead of fixing the deadlock.
> 
> Any suggestions/clues/thoughts?

Related discussion: 
https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00284.html

Actually, it seems we never fixed that problem either?

Back then I suggested that blk_drain*() should disable request queuing
because its callers don't want to quiesce the BlockBackend, but rather
get their own requests completed. I think this approach would solve this
one as well.

Your experiment with moving the queuing later is basically what Paolo
suggested, though I'd still argue it's not the right thing to do because
while it seems to mostly work with both use cases of drain (give me
exclusive access vs. wait for my requests to complete), it's not optimal
for either one.

Kevin




Re: [PATCH v2 5/8] qemu-img: rebase: avoid unnecessary COW operations

2023-09-19 Thread Hanna Czenczek

On 15.09.23 18:20, Andrey Drobyshev wrote:

When rebasing an image from one backing file to another, we need to
compare data from old and new backings.  If the diff between that data
happens to be unaligned to the target cluster size, we might end up
doing partial writes, which would lead to copy-on-write and additional IO.

Consider the following simple case (virtual_size == cluster_size == 64K):

base <-- inc1 <-- inc2

qemu-io -c "write -P 0xaa 0 32K" base.qcow2
qemu-io -c "write -P 0xcc 32K 32K" base.qcow2
qemu-io -c "write -P 0xbb 0 32K" inc1.qcow2
qemu-io -c "write -P 0xcc 32K 32K" inc1.qcow2
qemu-img rebase -f qcow2 -b base.qcow2 -F qcow2 inc2.qcow2

While doing rebase, we'll write a half of the cluster to inc2, and block
layer will have to read the 2nd half of the same cluster from the base image
inc1 while doing this write operation, although the whole cluster is already
read earlier to perform data comparison.

In order to avoid these unnecessary IO cycles, let's make sure every
write request is aligned to the overlay subcluster boundaries.  Using
subcluster size is universal as for the images which don't have them
this size equals to the cluster size, so in any case we end up aligning
to the smallest unit of allocation.

Signed-off-by: Andrey Drobyshev 
---
  qemu-img.c | 76 --
  1 file changed, 56 insertions(+), 20 deletions(-)


Looks good, I like the changes from v1!  Two minor things:


diff --git a/qemu-img.c b/qemu-img.c
index fcd31d7b5b..83950af42b 100644
--- a/qemu-img.c
+++ b/qemu-img.c


[...]


@@ -3844,33 +3861,48 @@ static int img_rebase(int argc, char **argv)
  }
  }
  
+/*

+ * At this point we know that the region [offset; offset + n)
+ * is unallocated within the target image.  This region might be
+ * unaligned to the target image's (sub)cluster boundaries, as
+ * old backing may have smaller clusters (or have subclusters).
+ * We extend it to the aligned boundaries to avoid CoW on
+ * partial writes in blk_pwrite(),
+ */
+n += offset - QEMU_ALIGN_DOWN(offset, write_align);
+offset = QEMU_ALIGN_DOWN(offset, write_align);
+n += QEMU_ALIGN_UP(offset + n, write_align) - (offset + n);
+n = MIN(n, size - offset);
+assert(!bdrv_is_allocated(unfiltered_bs, offset, n, &n_alloc) &&
+   n_alloc == n);
+
+/*
+ * Much like the with the target image, we'll try to read as much


s/the with the/with the/


+ * of the old and new backings as we can.
+ */
+n_old = MIN(n, MAX(0, old_backing_size - (int64_t) offset));
+if (blk_new_backing) {
+n_new = MIN(n, MAX(0, new_backing_size - (int64_t) offset));
+}


If we don’t have a check for blk_old_backing (old_backing_size is 0 if 
blk_old_backing is NULL), why do we have a check for blk_new_backing 
(new_backing_size is 0 if blk_new_backing is NULL)?


(Perhaps because the previous check was `offset >= new_backing_size || 
!blk_new_backing`, i.e. included exactly such a check – but I don’t 
think it’s necessary, new_backing_size will be 0 if blk_new_backing is 
NULL.)



+
  /*
   * Read old and new backing file and take into consideration that
   * backing files may be smaller than the COW image.
   */
-if (offset >= old_backing_size) {
-memset(buf_old, 0, n);
-buf_old_is_zero = true;
+memset(buf_old + n_old, 0, n - n_old);
+if (!n_old) {
+old_backing_eof = true;
  } else {
-if (offset + n > old_backing_size) {
-n = old_backing_size - offset;
-}
-
-ret = blk_pread(blk_old_backing, offset, n, buf_old, 0);
+ret = blk_pread(blk_old_backing, offset, n_old, buf_old, 0);
  if (ret < 0) {
  error_report("error while reading from old backing file");
  goto out;
  }
  }
  
-if (offset >= new_backing_size || !blk_new_backing) {

-memset(buf_new, 0, n);
-} else {
-if (offset + n > new_backing_size) {
-n = new_backing_size - offset;
-}
-
-ret = blk_pread(blk_new_backing, offset, n, buf_new, 0);
+memset(buf_new + n_new, 0, n - n_new);
+if (blk_new_backing && n_new) {


Same as above, I think we can drop the blk_new_backing check, just so 
that both blocks (for old and new) look the same.


(Also, the memset() already has to trust that n_new is 0 if 
blk_new_backing is NULL, so the check doesn’t make much sense from that 
perspective either, and makes the memset() look wrong.)




Re: [PULL 00/28] Block layer patches

2023-09-19 Thread Kevin Wolf
Am 18.09.2023 um 20:56 hat Stefan Hajnoczi geschrieben:
> Hi Kevin,
> I believe that my own commit "block-coroutine-wrapper: use
> qemu_get_current_aio_context()" breaks this test. The failure is
> non-deterministic (happens about 1 out of 4 runs).
> 
> It seems the job hangs and the test times out in vm.run_job('job1', wait=5.0).
> 
> I haven't debugged it yet but wanted to share this information to save
> some time. Tomorrow I'll investigate further.

Yes, it's relatively easily reproducible if I run the test in a loop,
and I can't seem to reproduce it without the last patch. Should I
unstage the full series again, or do you think that the last patch is
really optional this time?

However, I'm unsure how the stack traces I'm seeing are related to your
patch. Maybe it just made an existing bug more likely to be triggered?

What I'm seeing is that the reader lock is held by an iothread that is
waiting for its AioContext lock to make progress:

Thread 3 (Thread 0x7f811e9346c0 (LWP 26390) "qemu-system-x86"):
#0  0x7f81250aaf80 in __lll_lock_wait () at /lib64/libc.so.6
#1  0x7f81250b149a in pthread_mutex_lock@@GLIBC_2.2.5 () at /lib64/libc.so.6
#2  0x55b7b170967e in qemu_mutex_lock_impl (mutex=0x55b7b34e3080, 
file=0x55b7b199e1f7 "../util/async.c", line=728) at 
../util/qemu-thread-posix.c:94
#3  0x55b7b1709953 in qemu_rec_mutex_lock_impl (mutex=0x55b7b34e3080, 
file=0x55b7b199e1f7 "../util/async.c", line=728) at 
../util/qemu-thread-posix.c:149
#4  0x55b7b1728318 in aio_context_acquire (ctx=0x55b7b34e3020) at 
../util/async.c:728
#5  0x55b7b1727c49 in co_schedule_bh_cb (opaque=0x55b7b34e3020) at 
../util/async.c:565
#6  0x55b7b1726f1c in aio_bh_call (bh=0x55b7b34e2e70) at ../util/async.c:169
#7  0x55b7b17270ee in aio_bh_poll (ctx=0x55b7b34e3020) at 
../util/async.c:216
#8  0x55b7b170351d in aio_poll (ctx=0x55b7b34e3020, blocking=true) at 
../util/aio-posix.c:722
#9  0x55b7b1518604 in iothread_run (opaque=0x55b7b2904460) at 
../iothread.c:63
#10 0x55b7b170a955 in qemu_thread_start (args=0x55b7b34e36b0) at 
../util/qemu-thread-posix.c:541
#11 0x7f81250ae15d in start_thread () at /lib64/libc.so.6
#12 0x7f812512fc00 in clone3 () at /lib64/libc.so.6

On the other hand, the main thread wants to acquire the writer lock,
but it holds the AioContext lock of the iothread (it takes it in
job_prepare_locked()):

Thread 1 (Thread 0x7f811f4b7b00 (LWP 26388) "qemu-system-x86"):
#0  0x7f8125122356 in ppoll () at /lib64/libc.so.6
#1  0x55b7b172eae0 in qemu_poll_ns (fds=0x55b7b34ec910, nfds=1, timeout=-1) 
at ../util/qemu-timer.c:339
#2  0x55b7b1704ebd in fdmon_poll_wait (ctx=0x55b7b3269210, 
ready_list=0x7ffc90b05680, timeout=-1) at ../util/fdmon-poll.c:79
#3  0x55b7b1703284 in aio_poll (ctx=0x55b7b3269210, blocking=true) at 
../util/aio-posix.c:670
#4  0x55b7b1567c3b in bdrv_graph_wrlock (bs=0x0) at 
../block/graph-lock.c:145
#5  0x55b7b1554c1c in blk_remove_bs (blk=0x55b7b4425800) at 
../block/block-backend.c:916
#6  0x55b7b1554779 in blk_delete (blk=0x55b7b4425800) at 
../block/block-backend.c:497
#7  0x55b7b1554133 in blk_unref (blk=0x55b7b4425800) at 
../block/block-backend.c:557
#8  0x55b7b157a149 in mirror_exit_common (job=0x55b7b4419000) at 
../block/mirror.c:696
#9  0x55b7b1577015 in mirror_prepare (job=0x55b7b4419000) at 
../block/mirror.c:807
#10 0x55b7b153a1a7 in job_prepare_locked (job=0x55b7b4419000) at 
../job.c:988
#11 0x55b7b153a0d9 in job_txn_apply_locked (job=0x55b7b4419000, 
fn=0x55b7b153a110 ) at ../job.c:191
#12 0x55b7b1538b6d in job_do_finalize_locked (job=0x55b7b4419000) at 
../job.c:1011
#13 0x55b7b153a886 in job_completed_txn_success_locked (job=0x55b7b4419000) 
at ../job.c:1068
#14 0x55b7b1539372 in job_completed_locked (job=0x55b7b4419000) at 
../job.c:1082
#15 0x55b7b153a71b in job_exit (opaque=0x55b7b4419000) at ../job.c:1103
#16 0x55b7b1726f1c in aio_bh_call (bh=0x7f8110005470) at ../util/async.c:169
#17 0x55b7b17270ee in aio_bh_poll (ctx=0x55b7b3269210) at 
../util/async.c:216
#18 0x55b7b1702c05 in aio_dispatch (ctx=0x55b7b3269210) at 
../util/aio-posix.c:423
#19 0x55b7b1728a14 in aio_ctx_dispatch (source=0x55b7b3269210, 
callback=0x0, user_data=0x0) at ../util/async.c:358
#20 0x7f8126c31c7f in g_main_dispatch (context=0x55b7b3269720) at 
../glib/gmain.c:3454
#21 g_main_context_dispatch (context=0x55b7b3269720) at ../glib/gmain.c:4172
#22 0x55b7b1729c98 in glib_pollfds_poll () at ../util/main-loop.c:290
#23 0x55b7b1729572 in os_host_main_loop_wait (timeout=27462700) at 
../util/main-loop.c:313
#24 0x55b7b1729452 in main_loop_wait (nonblocking=0) at 
../util/main-loop.c:592
#25 0x55b7b119a1eb in qemu_main_loop () at ../softmmu/runstate.c:772
#26 0x55b7b14c102d in qemu_default_main () at ../softmmu/main.c:37
#27 0x55b7b14c1068 in main (argc=44, argv=0x7ffc90b05d58) at 
../softmmu/main.c:48

At first I thought we just need to look into the

Re: [PATCH 1/2] blockdev: qmp_transaction: harden transaction properties for bitmaps

2023-09-19 Thread Andrey Zhadchenko

Hi!

Thanks for the review

On 9/12/23 21:29, Vladimir Sementsov-Ogievskiy wrote:

On 04.09.23 11:31, Andrey Zhadchenko wrote:

Unlike other transaction commands, bitmap operations do not drain target
bds. If we have an IOThread, this may result in some inconsistencies, as
bitmap content may change during transaction command.
Add bdrv_drained_begin()/end() to bitmap operations.

Signed-off-by: Andrey Zhadchenko


Hi!

First, please always include cover letter when more than 1 patch.

Next. Hmm. Good idea, but I'm afraid that's still not enough.

Assume you have two BSs A and B in two different iothreads. So, the 
sequence may be like this:


1. drain_begin A

2. do operation with bitmap in A

3. guest writes to B, B is modified and bitmap in B is modified as well

4. drain_begin B

5. do operation with bitmap in B

6. drain_end B

7. drain_end A

User may expect, that all the operations are done atomically in relation 
to any guest IO operations. And if operations are dependent, the 
intermediate write [3] make break the result.


I see your point, but can the difference really be observed in this 
case? It is probably only relevant if user can copy/merge bitmaps 
between block nodes (as far as I know this is not the case for now)




So, we should drain all participant drives during the whole 
transactions. The simplest solution is bdrv_drain_all_begin() / 
bdrv_drain_all_end() pair in qmp_transaction(), could we start with it?




That would definitely get rig of all problems, but I do not really like 
the idea of draining unrelated nodes.


What do you think if I add a new function prototype to the 
TransactionActionDrv, which will return transaction bds? Then we can get 
a list of all bds and lock them before prepairing and clean afterwards.




Re: [PATCH 20/22] tests: extend test 131 to cover availability of the discard operation

2023-09-19 Thread Alexander Ivanov

On 9/18/23 20:00, Denis V. Lunev wrote:

This patch contains test which minimally tests discard and new cluster
allocation logic.

The following checks are added:
* write 2 clusters, discard the first allocated
* write another cluster, check that the hole is filled
* write 2 clusters, discard the first allocated, write 1 cluster at
   non-aligned to cluster offset (2 new clusters should be allocated)

Signed-off-by: Denis V. Lunev 
---
  tests/qemu-iotests/131 | 31 +++
  tests/qemu-iotests/131.out | 38 ++
  2 files changed, 69 insertions(+)

diff --git a/tests/qemu-iotests/131 b/tests/qemu-iotests/131
index 304bbb3f61..324008b3f6 100755
--- a/tests/qemu-iotests/131
+++ b/tests/qemu-iotests/131
@@ -74,6 +74,37 @@ poke_file "$TEST_IMG" "$inuse_offset" "\x59\x6e\x6f\x74"
  echo "== read corrupted image with repairing =="
  { $QEMU_IO -c "read -P 0x11 $CLUSTER_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 
| _filter_qemu_io | _filter_testdir
  
+echo "== check discard =="

+
+# Clear image
+_make_test_img $size
+
+{ $QEMU_IO -c "write -P 0x11 0 $CLUSTER_DBL_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+{ $QEMU_IMG map "$TEST_IMG"; } 2>&1 | _filter_qemu_img_map
+{ $QEMU_IO -c "discard 0 $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | _filter_qemu_io 
| _filter_testdir
+{ $QEMU_IMG map "$TEST_IMG"; } 2>&1 | _filter_qemu_img_map
+{ $QEMU_IO -c "read -P 0 0 $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+
+echo "== check simple allocation over the discarded hole =="
+
+{ $QEMU_IO -c "write -P 0x11 $CLUSTER_DBL_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
+{ $QEMU_IMG map "$TEST_IMG"; } 2>&1 | _filter_qemu_img_map
+{ $QEMU_IO -c "read -P 0x11 $CLUSTER_DBL_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
+
+echo "== check more complex allocation over the discard hole =="
+
+# Clear image
+_make_test_img $size
+
+{ $QEMU_IO -c "write -P 0x11 $CLUSTER_DBL_SIZE $CLUSTER_DBL_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
+{ $QEMU_IO -c "discard $CLUSTER_DBL_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+# There is 1 cluster hole. Fill it fully and allocate 1 cluster at the end
+{ $QEMU_IO -c "write -P 0x12 $CLUSTER_HALF_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
+{ $QEMU_IMG map "$TEST_IMG"; } 2>&1 | _filter_qemu_img_map
+{ $QEMU_IO -c "read -P 0x12 $CLUSTER_HALF_SIZE $CLUSTER_SIZE" "$TEST_IMG"; } 
2>&1 | _filter_qemu_io | _filter_testdir
+{ $QEMU_IO -c "read -P 0 0 $CLUSTER_HALF_SIZE" "$TEST_IMG"; } 2>&1 | 
_filter_qemu_io | _filter_testdir
+{ $QEMU_IO -c "read -P 0 $((CLUSTER_SIZE + CLUSTER_HALF_SIZE)) $CLUSTER_HALF_SIZE" 
"$TEST_IMG"; } 2>&1 | _filter_qemu_io | _filter_testdir
+
  echo "== allocate with backing =="
  # Verify that allocating clusters works fine even when there is a backing 
image.
  # Regression test for a bug where we would pass a buffer read from the backing
diff --git a/tests/qemu-iotests/131.out b/tests/qemu-iotests/131.out
index d2904578df..27df91ca97 100644
--- a/tests/qemu-iotests/131.out
+++ b/tests/qemu-iotests/131.out
@@ -26,6 +26,44 @@ read 524288/524288 bytes at offset 0
  Repairing image was not closed correctly
  read 1048576/1048576 bytes at offset 1048576
  1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== check discard ==
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
+wrote 2097152/2097152 bytes at offset 0
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset  Length  File
+0   0x20TEST_DIR/t.IMGFMT
+discard 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset  Length  File
+0x100x10TEST_DIR/t.IMGFMT
+read 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== check simple allocation over the discarded hole ==
+wrote 1048576/1048576 bytes at offset 2097152
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset  Length  File
+0x100x10TEST_DIR/t.IMGFMT
+0x200x10TEST_DIR/t.IMGFMT
+read 1048576/1048576 bytes at offset 2097152
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== check more complex allocation over the discard hole ==
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
+wrote 2097152/2097152 bytes at offset 2097152
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+discard 1048576/1048576 bytes at offset 2097152
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 1048576/1048576 bytes at offset 524288
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset  Length  File
+0   0x10TEST_DIR/t.IMGFMT
+0x100x10TEST_DIR/t.IMGFMT
+0x300x10TEST_DIR/t.IMGFMT
+read

Re: [PATCH 4/7] block/dirty-bitmap: Clean up local variable shadowing

2023-09-19 Thread Kevin Wolf
Am 19.09.2023 um 07:48 hat Markus Armbruster geschrieben:
> Kevin Wolf  writes:
> 
> > Am 31.08.2023 um 15:25 hat Markus Armbruster geschrieben:
> >> Local variables shadowing other local variables or parameters make the
> >> code needlessly hard to understand.  Tracked down with -Wshadow=local.
> >> Clean up: delete inner declarations when they are actually redundant,
> >> else rename variables.
> >> 
> >> Signed-off-by: Markus Armbruster 
> >> ---
> >>  block/monitor/bitmap-qmp-cmds.c | 2 +-
> >>  block/qcow2-bitmap.c| 3 +--
> >>  2 files changed, 2 insertions(+), 3 deletions(-)
> >> 
> >> diff --git a/block/monitor/bitmap-qmp-cmds.c 
> >> b/block/monitor/bitmap-qmp-cmds.c
> >> index 55f778f5af..4d018423d8 100644
> >> --- a/block/monitor/bitmap-qmp-cmds.c
> >> +++ b/block/monitor/bitmap-qmp-cmds.c
> >> @@ -276,7 +276,7 @@ BdrvDirtyBitmap *block_dirty_bitmap_merge(const char 
> >> *node, const char *target,
> >>  
> >>  for (lst = bms; lst; lst = lst->next) {
> >>  switch (lst->value->type) {
> >> -const char *name, *node;
> >> +const char *name;
> >>  case QTYPE_QSTRING:
> >>  name = lst->value->u.local;
> >>  src = bdrv_find_dirty_bitmap(bs, name);
> >
> > The names in this function are all over the place... A more ambitious
> > patch could rename the parameters to dst_node/dst_bitmap and these
> > variables to src_node/src_bitmap to get some more consistency (both with
> > each other and with the existing src/dst variables).
> 
> What exactly would you like me to consider?  Perhaps:
> 
> * Rename parameter @node to @dst_node
> 
> * Rename which parameter to @dst_bitmap?

Parameter @target to @dst_bitmap (it's the name of a bitmap in
@node/@dst_node)

> * Rename nested local @node to @src_node
> 
> * Rename which local variable to @src_bitmap?

@name to @src_bitmap (it's the name of a bitmap in the local
@node/@src_node)

> * Move nested locals to function scope

I don't really mind either way, but yes, maybe that would be more
conventional.

That you couldn't tell for two of the variables what they actually are
probably supports the argument that they should be renamed. :-)

Kevin




Re: [PATCH 19/22] parallels: naive implementation of parallels_co_pdiscard

2023-09-19 Thread Alexander Ivanov

On 9/18/23 20:00, Denis V. Lunev wrote:

* Discarding with backing stores is not supported by the format.
* There is no buffering/queueing of the discard operation.
* Only operations aligned to the cluster are supported.

Signed-off-by: Denis V. Lunev 
---
  block/parallels.c | 46 ++
  1 file changed, 46 insertions(+)

diff --git a/block/parallels.c b/block/parallels.c
index d9d36c514b..1ef23f6669 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -537,6 +537,51 @@ parallels_co_readv(BlockDriverState *bs, int64_t 
sector_num, int nb_sectors,
  return ret;
  }
  
+

+static int coroutine_fn
+parallels_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
+{
+int ret = 0;
+uint32_t cluster, count;
+BDRVParallelsState *s = bs->opaque;
+
+/*
+ * The image does not support ZERO mark inside the BAT, which means that
+ * stale data could be exposed from the backing file.
+ */
+if (bs->backing) {
+return -ENOTSUP;
+}
+
+if (!QEMU_IS_ALIGNED(offset, s->cluster_size)) {
+return -ENOTSUP;
+} else if (!QEMU_IS_ALIGNED(bytes, s->cluster_size)) {
+return -ENOTSUP;
+}
+
+cluster = offset / s->cluster_size;
+count = bytes / s->cluster_size;
+
+qemu_co_mutex_lock(&s->lock);
+for (; count > 0; cluster++, count--) {
+int64_t host_off = bat2sect(s, cluster) << BDRV_SECTOR_BITS;
+if (host_off == 0) {
+continue;
+}
+
+ret = bdrv_co_pdiscard(bs->file, host_off, s->cluster_size);
+if (ret < 0) {
+goto done;
+}
+
+parallels_set_bat_entry(s, cluster, 0);
+bitmap_clear(s->used_bmap, host_cluster_index(s, host_off), 1);
+}
+done:
+qemu_co_mutex_unlock(&s->lock);
+return ret;
+}
+
  static void parallels_check_unclean(BlockDriverState *bs,
  BdrvCheckResult *res,
  BdrvCheckMode fix)
@@ -1417,6 +1462,7 @@ static BlockDriver bdrv_parallels = {
  .bdrv_co_create = parallels_co_create,
  .bdrv_co_create_opts= parallels_co_create_opts,
  .bdrv_co_check  = parallels_co_check,
+.bdrv_co_pdiscard   = parallels_co_pdiscard,
  };
  
  static void bdrv_parallels_init(void)


Reviewed-by: Alexander Ivanov 

--
Best regards,
Alexander Ivanov




Re: [PATCH 04/22] parallels: invent parallels_opts_prealloc() helper to parse prealloc opts

2023-09-19 Thread Alexander Ivanov

On 9/18/23 20:00, Denis V. Lunev wrote:

This patch creates above mentioned helper and moves its usage to the
beginning of parallels_open(). This simplifies parallels_open() a bit.

The patch also ensures that we store prealloc_size on block driver state
always in sectors. This makes code cleaner and avoids wrong opinion at
the assignment that the value is in bytes.

Signed-off-by: Denis V. Lunev 
---
  block/parallels.c | 72 +--
  1 file changed, 44 insertions(+), 28 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index af7be427c9..ae006e7fc7 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -1025,6 +1025,44 @@ static int parallels_update_header(BlockDriverState *bs)
  return bdrv_pwrite_sync(bs->file, 0, size, s->header, 0);
  }
  
+

+static int parallels_opts_prealloc(BlockDriverState *bs, QDict *options,
+   Error **errp)
+{
+int err;
+char *buf;
+int64_t bytes;
+BDRVParallelsState *s = bs->opaque;
+Error *local_err = NULL;
+QemuOpts *opts = qemu_opts_create(¶llels_runtime_opts, NULL, 0, errp);
+if (!opts) {
+return -ENOMEM;
+}
+
+err = -EINVAL;
+if (!qemu_opts_absorb_qdict(opts, options, errp)) {
+goto done;
+}
+
+bytes = qemu_opt_get_size_del(opts, PARALLELS_OPT_PREALLOC_SIZE, 0);
+s->prealloc_size = bytes >> BDRV_SECTOR_BITS;
+buf = qemu_opt_get_del(opts, PARALLELS_OPT_PREALLOC_MODE);
+/* prealloc_mode can be downgraded later during allocate_clusters */
+s->prealloc_mode = qapi_enum_parse(&prealloc_mode_lookup, buf,
+   PRL_PREALLOC_MODE_FALLOCATE,
+   &local_err);
+g_free(buf);
+if (local_err != NULL) {
+error_propagate(errp, local_err);
+goto done;
+}
+err = 0;
+
+done:
+qemu_opts_del(opts);
+return err;
+}
+
  static int parallels_open(BlockDriverState *bs, QDict *options, int flags,
Error **errp)
  {
@@ -1033,11 +1071,13 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
  int ret, size, i;
  int64_t file_nb_sectors, sector;
  uint32_t data_start;
-QemuOpts *opts = NULL;
-Error *local_err = NULL;
-char *buf;
  bool data_off_is_correct;
  
+ret = parallels_opts_prealloc(bs, options, errp);

+if (ret < 0) {
+return ret;
+}
+
  ret = bdrv_open_file_child(NULL, options, "file", bs, errp);
  if (ret < 0) {
  return ret;
@@ -1078,6 +1118,7 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
  ret = -EFBIG;
  goto fail;
  }
+s->prealloc_size = MAX(s->tracks, s->prealloc_size);
  s->cluster_size = s->tracks << BDRV_SECTOR_BITS;
  
  s->bat_size = le32_to_cpu(ph.bat_entries);

@@ -1117,29 +1158,6 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
  s->header_size = size;
  }
  
-opts = qemu_opts_create(¶llels_runtime_opts, NULL, 0, errp);

-if (!opts) {
-goto fail_options;
-}
-
-if (!qemu_opts_absorb_qdict(opts, options, errp)) {
-goto fail_options;
-}
-
-s->prealloc_size =
-qemu_opt_get_size_del(opts, PARALLELS_OPT_PREALLOC_SIZE, 0);
-s->prealloc_size = MAX(s->tracks, s->prealloc_size >> BDRV_SECTOR_BITS);
-buf = qemu_opt_get_del(opts, PARALLELS_OPT_PREALLOC_MODE);
-/* prealloc_mode can be downgraded later during allocate_clusters */
-s->prealloc_mode = qapi_enum_parse(&prealloc_mode_lookup, buf,
-   PRL_PREALLOC_MODE_FALLOCATE,
-   &local_err);
-g_free(buf);
-if (local_err != NULL) {
-error_propagate(errp, local_err);
-goto fail_options;
-}
-
  if (ph.ext_off) {
  if (flags & BDRV_O_RDWR) {
  /*
@@ -1214,10 +1232,8 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
  
  fail_format:

  error_setg(errp, "Image not in Parallels format");
-fail_options:
  ret = -EINVAL;
  fail:
-qemu_opts_del(opts);
  /*
   * "s" object was allocated by g_malloc0 so we can safely
   * try to free its fields even they were not allocated.


Reviewed-by: Alexander Ivanov 

--
Best regards,
Alexander Ivanov




Re: [PATCH 03/22] parallels: fix memory leak in parallels_open()

2023-09-19 Thread Alexander Ivanov

On 9/18/23 20:00, Denis V. Lunev wrote:

We should free opts allocated through qemu_opts_create() at the end.

Signed-off-by: Denis V. Lunev 
---
  block/parallels.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/block/parallels.c b/block/parallels.c
index 428f72de1c..af7be427c9 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -1217,6 +1217,7 @@ fail_format:
  fail_options:
  ret = -EINVAL;
  fail:
+qemu_opts_del(opts);
  /*
   * "s" object was allocated by g_malloc0 so we can safely
   * try to free its fields even they were not allocated.


Reviewed-by: Alexander Ivanov 

--
Best regards,
Alexander Ivanov




Re: [PATCH v2 3/8] qemu-img: rebase: use backing files' BlockBackend for buffer alignment

2023-09-19 Thread Andrey Drobyshev
On 9/19/23 11:18, Hanna Czenczek wrote:
> On 15.09.23 18:20, Andrey Drobyshev wrote:
>> Since commit bb1c05973cf ("qemu-img: Use qemu_blockalign"), buffers for
>> the data read from the old and new backing files are aligned using
>> BlockDriverState (or BlockBackend later on) referring to the target
>> image.
>> However, this isn't quite right, because buf_new is only being used for
>> reading from the new backing, while buf_old is being used for both
>> reading
>> from the old backing and writing to the target.  Let's take that into
>> account
>> and use more appropriate values as alignments.
>>
>> Signed-off-by: Andrey Drobyshev 
>> ---
>>   qemu-img.c | 9 +++--
>>   1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/qemu-img.c b/qemu-img.c
>> index 50660ba920..d12e4a4753 100644
>> --- a/qemu-img.c
>> +++ b/qemu-img.c
>> @@ -3750,8 +3750,13 @@ static int img_rebase(int argc, char **argv)
>>   int64_t n;
>>   float local_progress = 0;
>>   -    buf_old = blk_blockalign(blk, IO_BUF_SIZE);
>> -    buf_new = blk_blockalign(blk, IO_BUF_SIZE);
>> +    if (blk_old_backing && bdrv_opt_mem_align(blk_bs(blk)) >
>> +    bdrv_opt_mem_align(blk_bs(blk_old_backing))) {
>> +    buf_old = blk_blockalign(blk, IO_BUF_SIZE);
>> +    } else {
>> +    buf_old = blk_blockalign(blk_old_backing, IO_BUF_SIZE);
>> +    }
> 
> As I read this, if blk_old_backing is NULL, we will go to the
> blk_blockalign(blk_old_backing, IO_BUF_SIZE) path.  I think if it is
> NULL, we should align on blk instead.
> 
> Hanna

You're right, thanks for noticing.  Will fix that.

> 
>> +    buf_new = blk_blockalign(blk_new_backing, IO_BUF_SIZE);
>>     size = blk_getlength(blk);
>>   if (size < 0) {
> 




Re: [PATCH v2 4/8] qemu-img: add chunk size parameter to compare_buffers()

2023-09-19 Thread Hanna Czenczek

On 15.09.23 18:20, Andrey Drobyshev wrote:

Add @chsize param to the function which, if non-zero, would represent
the chunk size to be used for comparison.  If it's zero, then
BDRV_SECTOR_SIZE is used as default chunk size, which is the previous
behaviour.

In particular, we're going to use this param in img_rebase() to make the
write requests aligned to a predefined alignment value.

Signed-off-by: Andrey Drobyshev 
---
  qemu-img.c | 22 ++
  1 file changed, 14 insertions(+), 8 deletions(-)


With the comment fix Eric has suggested:

Reviewed-by: Hanna Czenczek 




Re: [PATCH v2 3/8] qemu-img: rebase: use backing files' BlockBackend for buffer alignment

2023-09-19 Thread Hanna Czenczek

On 15.09.23 18:20, Andrey Drobyshev wrote:

Since commit bb1c05973cf ("qemu-img: Use qemu_blockalign"), buffers for
the data read from the old and new backing files are aligned using
BlockDriverState (or BlockBackend later on) referring to the target image.
However, this isn't quite right, because buf_new is only being used for
reading from the new backing, while buf_old is being used for both reading
from the old backing and writing to the target.  Let's take that into account
and use more appropriate values as alignments.

Signed-off-by: Andrey Drobyshev 
---
  qemu-img.c | 9 +++--
  1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 50660ba920..d12e4a4753 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3750,8 +3750,13 @@ static int img_rebase(int argc, char **argv)
  int64_t n;
  float local_progress = 0;
  
-buf_old = blk_blockalign(blk, IO_BUF_SIZE);

-buf_new = blk_blockalign(blk, IO_BUF_SIZE);
+if (blk_old_backing && bdrv_opt_mem_align(blk_bs(blk)) >
+bdrv_opt_mem_align(blk_bs(blk_old_backing))) {
+buf_old = blk_blockalign(blk, IO_BUF_SIZE);
+} else {
+buf_old = blk_blockalign(blk_old_backing, IO_BUF_SIZE);
+}


As I read this, if blk_old_backing is NULL, we will go to the 
blk_blockalign(blk_old_backing, IO_BUF_SIZE) path.  I think if it is 
NULL, we should align on blk instead.


Hanna


+buf_new = blk_blockalign(blk_new_backing, IO_BUF_SIZE);
  
  size = blk_getlength(blk);

  if (size < 0) {