Am 12.11.2025 um 17:41 hat [email protected] geschrieben:
> From: Andrey Drobyshev <[email protected]>
>
> Hi all,
>
> There's a bug in block layer which leads to an assertion failure and crash.
> The reproduce is flaky, which suggests there's a race involved. Here's how
> it's reproduced:
>
> 1. Run QEMU with 2 disks: one system (with actual guest) and one empty; attach
> them to an iothread;
> 2. In the guest, format the empty disk and start sequential IO on it;
> 3. At the host, start growing a chain of snapshots upon the empty image.
>
> Here're the scripts:
> [...]
> So draining doesn't work as expected: in-flight requests are supposed to be
> polled right before calling bdrv_drain_assert_idle(), but there's new IO
> arriving:
>
> void coroutine_mixed_fn bdrv_drain_all_begin(void)
> {
> ...
> bdrv_drain_all_begin_nopoll();
>
> /* Now poll the in-flight requests */
> AIO_WAIT_WHILE_UNLOCKED(NULL, bdrv_drain_all_poll()); <---------
>
> while ((bs = bdrv_next_all_states(bs))) {
> bdrv_drain_assert_idle(bs);
> }
> }
Took me a bit, but I can reproduce the problem. Yes, "new" I/O arrives
between AIO_WAIT_WHILE_UNLOCKED() and the assertion, which means that
drain itself doesn't work correctly.
After some debugging, I ended up adding some new assertions that would
trigger when the new I/O is added during this period, and what I caught
is this (same in multiple runs):
(gdb) qemu bt
#0 __GI_abort () at abort.c:95
#1 0x00007f99cb823639 in __assert_fail_base (fmt=<optimized out>,
assertion=<optimized out>, file=<optimized out>, line=<optimized out>,
function=<optimized out>) at assert.c:118
#2 0x00007f99cb8340af in __assert_fail (assertion=<optimized out>,
file=<optimized out>, line=<optimized out>, function=<optimized out>) at
assert.c:127
#3 0x0000557c2a0d64c6 in blk_wait_while_drained (blk=0x557c69d18020) at
../block/block-backend.c:1325
#4 0x0000557c2a0d5069 in blk_co_do_pwritev_part (blk=0x1,
offset=140292968848728, bytes=0, qiov=0x7f99b4004568,
qiov_offset=qiov_offset@entry=0, flags=3416308576)
at ../block/block-backend.c:1417
#5 0x0000557c2a0d55ae in blk_aio_write_entry (opaque=0x7f99b4002020) at
../block/block-backend.c:1635
#6 0x0000557c2a30a376 in coroutine_trampoline (i0=<optimized out>,
i1=<optimized out>) at ../util/coroutine-ucontext.c:175
#7 0x00007f99cb856f70 in ??? () at
../sysdeps/unix/sysv/linux/x86_64/__start_context.S:66
#8 0x00007f99c35f0260 in ??? ()
#9 0x0000000000000000 in ??? ()
[Thread 0x7f98a94fc6c0 (LWP 321021) exited]
[Thread 0x7f99c2dfd6c0 (LWP 320958) exited]
Coroutine at 0x7f99c35fe588:
#0 qemu_coroutine_switch (from_=from_@entry=0x7f99c35fe588,
to_=to_@entry=0x7f99b40051f0, action=action@entry=COROUTINE_ENTER) at
../util/coroutine-ucontext.c:322
#1 0x0000557c2a3089e8 in qemu_aio_coroutine_enter
(ctx=ctx@entry=0x557c68d5cae0, co=<optimized out>) at
../util/qemu-coroutine.c:293
#2 0x0000557c2a3071f7 in co_schedule_bh_cb (opaque=0x557c68d5cae0) at
../util/async.c:547
#3 0x0000557c2a306b79 in aio_bh_call (bh=0x557c68d822d0) at ../util/async.c:172
#4 aio_bh_poll (ctx=ctx@entry=0x557c68d5cae0) at ../util/async.c:219
#5 0x0000557c2a2ee9e5 in aio_poll (ctx=0x557c68d5cae0, blocking=<optimized
out>) at ../util/aio-posix.c:719
#6 0x0000557c2a0ac032 in iothread_run (opaque=opaque@entry=0x557c68d810a0) at
../iothread.c:63
#7 0x0000557c2a2f2aee in qemu_thread_start (args=<optimized out>) at
../util/qemu-thread-posix.c:393
#8 0x00007f99cb893f54 in start_thread (arg=<optimized out>) at
pthread_create.c:448
#9 0x00007f99cb91732c in __GI___clone3 () at
../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
This suggests that the following happens:
1. Drain section starts
2. A new request still arrives at the BlockBackend from the iothread and
is queued in blk_wait_while_drained(). This alone is already
interesting because we thought that the BlockBackend queuing was only
necessary for IDE and friends, not for virtio-blk. Probably only true
in the main thread.
3. Drain section ends, blk_root_drained_end() calls qemu_co_enter_next()
and that is effectively aio_co_wake(). As the request coroutine is in
the iothread and blk_root_drained_end() in the main thread, the
coroutine is only scheduled in its AioContext, but doesn't run yet.
4. Another drain section starts
4a. bdrv_drain_all_begin_nopoll()
4b. AIO_WAIT_WHILE_UNLOCKED(NULL, bdrv_drain_all_poll());
5. The iothread processes its BH list and reenters the queued request,
which increases blk->in_flight and bs->in_flight and runs the actual
request even though we're drained again. Oops.
4c. bdrv_drain_assert_idle() fails
After a simple s/if/while/ in blk_wait_while_drained(), I can't
reproduce the bug any more. I don't think it's a fully correct fix
because blk->in_flight is still increased for a short time and that
could in theory still trigger the assertion. It's just that the window
is now much smaller because the request doesn't actually execute but is
requeued right away.
I need to think a bit more about the right synchronisation tomorrow.
> I've bisected the issue, and it seems to be introduced by Fiona's series [0]
> for fixing a deadlock. Namely, before the commit b13f546545 ("block: move
> drain outside of bdrv_root_unref_child()") my reproducer above produces a
> deadlock similar to what we had in [1]. And after commit b13f546545 is
> applied, we get the crash.
>
> Attached is a naive fix which simply eliminates global draining (i.e. draining
> of all the block graph) in problematic codepaths. While global draining might
> indeed be redundant there, the real problem, i.e. the race, still remains.
Yes, we don't want to just paper over the bug.
Kevin