On Mon, May 11, 2026 at 11:53:37PM +0200, Denis V. Lunev wrote: > On 4/24/26 12:39, Denis V. Lunev wrote: > > Problem > > ------- > > > > The qemu shutdown / blockdev-close path can deadlock permanently on > > upstream master. The main thread enters ppoll(timeout=-1) holding > > BQL, no other thread has a wake source that points back at it, and > > qemu has to be SIGKILLed. The hang has no timeout -- it is a hard > > deadlock, not a slow operation; behind BQL, RCU, VCPUs and every > > iothread path that needs BQL stall with it. > > > > Two independent missed-wakeup races in the block layer contribute. > > Both share the same shape: a waiter arms on one side, the waker > > reads stale state on its fast path and silently skips the kick, and > > nothing else on the AioContext will fire to recover. They are > > different bugs in different subsystems and each patch stands on its > > own; they are posted together because they surface through the same > > test and the same symptom and are easiest to diagnose side by side. > > > > Depending on which race fires, the main thread backtrace at the > > moment of hang is one of: > > > > ppoll -> aio_poll -> bdrv_graph_wrlock -> blk_remove_bs > > (patch 1 -- block/graph-lock) > > > > ppoll -> aio_poll -> cache_clean_timer_del_and_wait -> qcow2_close > > (patch 2 -- block/qcow2 cache_clean_timer) > > > > Race diagrams and the exact stale-state read are in each patch's > > commit message. > > > > Reproducer > > ---------- > > > > Environment used for the numbers below: 4-vCPU VM guest, > > kernel 6.12.x, upstream master at bb230769b4. On modern bare-metal > > the window is narrow enough that the hangs rarely reproduce without > > a VM -- a VM guest under full CPU saturation is what makes the > > timing reliable. Downstream trees that still use plain > > bdrv_graph_wrlock() in blk_remove_bs() hit the graph-lock race on > > the first iteration without any stress at all. > > > > # reproducer > > stress-ng --cpu "$(nproc)" --timeout 0 & > > for r in $(seq 20); do > > timeout 120 ./build/tests/qemu-iotests/check -qcow2 iothreads-create > > done > > kill %1 > > > > With `stress-ng --cpu $(nproc)` both races surface. With > > `stress-ng --cpu $(($(nproc) - 1))` or without a stressor neither > > reproduces reliably across 20 iterations. > > > > When a race fires, the Python QMP client times out on vm.run_job() > > after 5 s, the qemu process keeps running but never makes forward > > progress, and the outer `timeout 120` eventually kills it. attach > > gdb before the timeout kills qemu to capture the stack and > > distinguish which of the two races fired. > > > > Results > > ------- > > > > Same guest, 20 iterations of the loop above: > > > > upstream master: 10/20 FAIL (first fail at iter #2) > > master + both patches: 20/20 PASS > > > > Signed-off-by: Denis V. Lunev <[email protected]> > > Cc: Kevin Wolf <[email protected]> > > Cc: Hanna Reitz <[email protected]> > > Cc: Stefan Hajnoczi <[email protected]> > > Cc: Fiona Ebner <[email protected]> > > Cc: Hanna Czenczek <[email protected]> > > > > Denis V. Lunev (2): > > block/graph-lock: fix missed wakeup in bdrv_graph_co_rdunlock() > > block/qcow2: fix hangup in cache_clean_timer cancellation > > > > block/graph-lock.c | 12 +++++------- > > block/qcow2.c | 28 +++++++++++++++++----------- > > 2 files changed, 22 insertions(+), 18 deletions(-) > > > > -- > > 2.51.0 > ping
Hi Kevin, This looks like a series for your block tree. If I can help in some way, please let me know. Stefan
signature.asc
Description: PGP signature
