Changes since v1
----------------

v1 was a two-patch series tracking two independent missed-wakeup
races on the qemu shutdown path.

  * Patch 1, "block/graph-lock: fix missed wakeup in
    bdrv_graph_co_rdunlock()", was applied as e3082ab3b3 by Kevin
    and is now in tree. This v2 carries only the remaining race.

  * Per Kevin's review of v1 patch 2 [1], the cache_clean_timer
    hang is no longer worked around inside block/qcow2.c. Instead,
    the underlying primitive -- qemu_co_sleep_wake() -- is fixed,
    closing the lost-wakeup window for every caller
    (cache_clean_timer, block_copy_kick, ...) rather than just
    qcow2. Cancellation latency through qemu_co_sleep_wake() drops
    from "next 1 s tick" (v1 workaround) to aio_co_wake().

[1] https://lore.kernel.org/qemu-devel/[email protected]/

Problem
-------

The qemu shutdown / blockdev-close path can deadlock permanently on
upstream master. The main thread enters ppoll(timeout=-1) holding
BQL, no other thread has a wake source that points back at it, and
qemu has to be SIGKILLed. The hang has no timeout -- it is a hard
deadlock, not a slow operation; behind BQL, RCU, VCPUs and every
iothread path that needs BQL stall with it.

Two independent missed-wakeup races in the block layer contributed
to the symptom on v1. Both shared the same shape: a waiter arms on
one side, the waker reads stale state on its fast path and silently
skips the kick, and nothing else on the AioContext fires to
recover. The first (block/graph-lock) was fixed by e3082ab3b3 and
is now in tree. This patch closes the second one, exposed in
qcow2's cache_clean_timer cancellation path:

  ppoll -> aio_poll -> cache_clean_timer_del_and_wait -> qcow2_close

The race diagram and the exact stale-state read are in the patch's
commit message.

Reproducer
----------

Environment: 4-vCPU VM guest, kernel 6.12.x, upstream master at
e3082ab3b3 (with the graph-lock fix already applied). On modern
bare-metal the window is narrow enough that the hang rarely
reproduces without a VM -- a VM guest under full CPU saturation is
what makes the timing reliable.

    # reproducer
    stress-ng --cpu "$(nproc)" --timeout 0 &
    for r in $(seq 20); do
        timeout 120 ./build/tests/qemu-iotests/check -qcow2 iothreads-create
    done
    kill %1

With `stress-ng --cpu $(nproc)` the race surfaces. With
`stress-ng --cpu $(($(nproc) - 1))` or without a stressor it does
not reproduce reliably across 20 iterations.

When the race fires, the Python QMP client times out on
vm.run_job() after 5 s, the qemu process keeps running but never
makes forward progress, and the outer `timeout 120` eventually
kills it. Attach gdb before the timeout kills qemu to capture the
stack.

Results
-------

Same guest, 20 iterations of the loop above, master at e3082ab3b3:

  without this patch:  reproduces reliably (qcow2_close in ppoll)
  with this patch:     20/20 PASS

Signed-off-by: Denis V. Lunev <[email protected]>
Cc: Kevin Wolf <[email protected]>
Cc: Hanna Reitz <[email protected]>

Denis V. Lunev (1):
  coroutine: fix lost wakeup in qemu_co_sleep_wake()

 include/qemu/coroutine.h    | 17 ++++++++---
 util/qemu-coroutine-sleep.c | 60 +++++++++++++++++++++++++++----------
 2 files changed, 58 insertions(+), 19 deletions(-)

-- 
2.51.0


Reply via email to