On 5/20/26 21:38, Denis V. Lunev wrote:
> Changes since v1
> ----------------
>
> v1 was a two-patch series tracking two independent missed-wakeup
> races on the qemu shutdown path.
>
>   * Patch 1, "block/graph-lock: fix missed wakeup in
>     bdrv_graph_co_rdunlock()", was applied as e3082ab3b3 by Kevin
>     and is now in tree. This v2 carries only the remaining race.
>
>   * Per Kevin's review of v1 patch 2 [1], the cache_clean_timer
>     hang is no longer worked around inside block/qcow2.c. Instead,
>     the underlying primitive -- qemu_co_sleep_wake() -- is fixed,
>     closing the lost-wakeup window for every caller
>     (cache_clean_timer, block_copy_kick, ...) rather than just
>     qcow2. Cancellation latency through qemu_co_sleep_wake() drops
>     from "next 1 s tick" (v1 workaround) to aio_co_wake().
>
> [1] https://lore.kernel.org/qemu-devel/[email protected]/
>
> Problem
> -------
>
> The qemu shutdown / blockdev-close path can deadlock permanently on
> upstream master. The main thread enters ppoll(timeout=-1) holding
> BQL, no other thread has a wake source that points back at it, and
> qemu has to be SIGKILLed. The hang has no timeout -- it is a hard
> deadlock, not a slow operation; behind BQL, RCU, VCPUs and every
> iothread path that needs BQL stall with it.
>
> Two independent missed-wakeup races in the block layer contributed
> to the symptom on v1. Both shared the same shape: a waiter arms on
> one side, the waker reads stale state on its fast path and silently
> skips the kick, and nothing else on the AioContext fires to
> recover. The first (block/graph-lock) was fixed by e3082ab3b3 and
> is now in tree. This patch closes the second one, exposed in
> qcow2's cache_clean_timer cancellation path:
>
>   ppoll -> aio_poll -> cache_clean_timer_del_and_wait -> qcow2_close
>
> The race diagram and the exact stale-state read are in the patch's
> commit message.
>
> Reproducer
> ----------
>
> Environment: 4-vCPU VM guest, kernel 6.12.x, upstream master at
> e3082ab3b3 (with the graph-lock fix already applied). On modern
> bare-metal the window is narrow enough that the hang rarely
> reproduces without a VM -- a VM guest under full CPU saturation is
> what makes the timing reliable.
>
>     # reproducer
>     stress-ng --cpu "$(nproc)" --timeout 0 &
>     for r in $(seq 20); do
>         timeout 120 ./build/tests/qemu-iotests/check -qcow2 iothreads-create
>     done
>     kill %1
>
> With `stress-ng --cpu $(nproc)` the race surfaces. With
> `stress-ng --cpu $(($(nproc) - 1))` or without a stressor it does
> not reproduce reliably across 20 iterations.
>
> When the race fires, the Python QMP client times out on
> vm.run_job() after 5 s, the qemu process keeps running but never
> makes forward progress, and the outer `timeout 120` eventually
> kills it. Attach gdb before the timeout kills qemu to capture the
> stack.
>
> Results
> -------
>
> Same guest, 20 iterations of the loop above, master at e3082ab3b3:
>
>   without this patch:  reproduces reliably (qcow2_close in ppoll)
>   with this patch:     20/20 PASS
>
> Signed-off-by: Denis V. Lunev <[email protected]>
> Cc: Kevin Wolf <[email protected]>
> Cc: Hanna Reitz <[email protected]>
>
> Denis V. Lunev (1):
>   coroutine: fix lost wakeup in qemu_co_sleep_wake()
>
>  include/qemu/coroutine.h    | 17 ++++++++---
>  util/qemu-coroutine-sleep.c | 60 +++++++++++++++++++++++++++----------
>  2 files changed, 58 insertions(+), 19 deletions(-)
>
ping

Reply via email to