On 4/24/26 12:39, Denis V. Lunev wrote:
> Problem
> -------
>
> The qemu shutdown / blockdev-close path can deadlock permanently on
> upstream master. The main thread enters ppoll(timeout=-1) holding
> BQL, no other thread has a wake source that points back at it, and
> qemu has to be SIGKILLed. The hang has no timeout -- it is a hard
> deadlock, not a slow operation; behind BQL, RCU, VCPUs and every
> iothread path that needs BQL stall with it.
>
> Two independent missed-wakeup races in the block layer contribute.
> Both share the same shape: a waiter arms on one side, the waker
> reads stale state on its fast path and silently skips the kick, and
> nothing else on the AioContext will fire to recover. They are
> different bugs in different subsystems and each patch stands on its
> own; they are posted together because they surface through the same
> test and the same symptom and are easiest to diagnose side by side.
>
> Depending on which race fires, the main thread backtrace at the
> moment of hang is one of:
>
> ppoll -> aio_poll -> bdrv_graph_wrlock -> blk_remove_bs
> (patch 1 -- block/graph-lock)
>
> ppoll -> aio_poll -> cache_clean_timer_del_and_wait -> qcow2_close
> (patch 2 -- block/qcow2 cache_clean_timer)
>
> Race diagrams and the exact stale-state read are in each patch's
> commit message.
>
> Reproducer
> ----------
>
> Environment used for the numbers below: 4-vCPU VM guest,
> kernel 6.12.x, upstream master at bb230769b4. On modern bare-metal
> the window is narrow enough that the hangs rarely reproduce without
> a VM -- a VM guest under full CPU saturation is what makes the
> timing reliable. Downstream trees that still use plain
> bdrv_graph_wrlock() in blk_remove_bs() hit the graph-lock race on
> the first iteration without any stress at all.
>
> # reproducer
> stress-ng --cpu "$(nproc)" --timeout 0 &
> for r in $(seq 20); do
> timeout 120 ./build/tests/qemu-iotests/check -qcow2 iothreads-create
> done
> kill %1
>
> With `stress-ng --cpu $(nproc)` both races surface. With
> `stress-ng --cpu $(($(nproc) - 1))` or without a stressor neither
> reproduces reliably across 20 iterations.
>
> When a race fires, the Python QMP client times out on vm.run_job()
> after 5 s, the qemu process keeps running but never makes forward
> progress, and the outer `timeout 120` eventually kills it. attach
> gdb before the timeout kills qemu to capture the stack and
> distinguish which of the two races fired.
>
> Results
> -------
>
> Same guest, 20 iterations of the loop above:
>
> upstream master: 10/20 FAIL (first fail at iter #2)
> master + both patches: 20/20 PASS
>
> Signed-off-by: Denis V. Lunev <[email protected]>
> Cc: Kevin Wolf <[email protected]>
> Cc: Hanna Reitz <[email protected]>
> Cc: Stefan Hajnoczi <[email protected]>
> Cc: Fiona Ebner <[email protected]>
> Cc: Hanna Czenczek <[email protected]>
>
> Denis V. Lunev (2):
> block/graph-lock: fix missed wakeup in bdrv_graph_co_rdunlock()
> block/qcow2: fix hangup in cache_clean_timer cancellation
>
> block/graph-lock.c | 12 +++++-------
> block/qcow2.c | 28 +++++++++++++++++-----------
> 2 files changed, 22 insertions(+), 18 deletions(-)
>
> --
> 2.51.0
ping