On 5/20/26 21:38, Denis V. Lunev wrote: > Changes since v1 > ---------------- > > v1 was a two-patch series tracking two independent missed-wakeup > races on the qemu shutdown path. > > * Patch 1, "block/graph-lock: fix missed wakeup in > bdrv_graph_co_rdunlock()", was applied as e3082ab3b3 by Kevin > and is now in tree. This v2 carries only the remaining race. > > * Per Kevin's review of v1 patch 2 [1], the cache_clean_timer > hang is no longer worked around inside block/qcow2.c. Instead, > the underlying primitive -- qemu_co_sleep_wake() -- is fixed, > closing the lost-wakeup window for every caller > (cache_clean_timer, block_copy_kick, ...) rather than just > qcow2. Cancellation latency through qemu_co_sleep_wake() drops > from "next 1 s tick" (v1 workaround) to aio_co_wake(). > > [1] https://lore.kernel.org/qemu-devel/[email protected]/ > > Problem > ------- > > The qemu shutdown / blockdev-close path can deadlock permanently on > upstream master. The main thread enters ppoll(timeout=-1) holding > BQL, no other thread has a wake source that points back at it, and > qemu has to be SIGKILLed. The hang has no timeout -- it is a hard > deadlock, not a slow operation; behind BQL, RCU, VCPUs and every > iothread path that needs BQL stall with it. > > Two independent missed-wakeup races in the block layer contributed > to the symptom on v1. Both shared the same shape: a waiter arms on > one side, the waker reads stale state on its fast path and silently > skips the kick, and nothing else on the AioContext fires to > recover. The first (block/graph-lock) was fixed by e3082ab3b3 and > is now in tree. This patch closes the second one, exposed in > qcow2's cache_clean_timer cancellation path: > > ppoll -> aio_poll -> cache_clean_timer_del_and_wait -> qcow2_close > > The race diagram and the exact stale-state read are in the patch's > commit message. > > Reproducer > ---------- > > Environment: 4-vCPU VM guest, kernel 6.12.x, upstream master at > e3082ab3b3 (with the graph-lock fix already applied). On modern > bare-metal the window is narrow enough that the hang rarely > reproduces without a VM -- a VM guest under full CPU saturation is > what makes the timing reliable. > > # reproducer > stress-ng --cpu "$(nproc)" --timeout 0 & > for r in $(seq 20); do > timeout 120 ./build/tests/qemu-iotests/check -qcow2 iothreads-create > done > kill %1 > > With `stress-ng --cpu $(nproc)` the race surfaces. With > `stress-ng --cpu $(($(nproc) - 1))` or without a stressor it does > not reproduce reliably across 20 iterations. > > When the race fires, the Python QMP client times out on > vm.run_job() after 5 s, the qemu process keeps running but never > makes forward progress, and the outer `timeout 120` eventually > kills it. Attach gdb before the timeout kills qemu to capture the > stack. > > Results > ------- > > Same guest, 20 iterations of the loop above, master at e3082ab3b3: > > without this patch: reproduces reliably (qcow2_close in ppoll) > with this patch: 20/20 PASS > > Signed-off-by: Denis V. Lunev <[email protected]> > Cc: Kevin Wolf <[email protected]> > Cc: Hanna Reitz <[email protected]> > > Denis V. Lunev (1): > coroutine: fix lost wakeup in qemu_co_sleep_wake() > > include/qemu/coroutine.h | 17 ++++++++--- > util/qemu-coroutine-sleep.c | 60 +++++++++++++++++++++++++++---------- > 2 files changed, 58 insertions(+), 19 deletions(-) > ping
