On Mon, May 11, 2026 at 11:53:37PM +0200, Denis V. Lunev wrote:
> On 4/24/26 12:39, Denis V. Lunev wrote:
> > Problem
> > -------
> >
> > The qemu shutdown / blockdev-close path can deadlock permanently on
> > upstream master.  The main thread enters ppoll(timeout=-1) holding
> > BQL, no other thread has a wake source that points back at it, and
> > qemu has to be SIGKILLed.  The hang has no timeout -- it is a hard
> > deadlock, not a slow operation; behind BQL, RCU, VCPUs and every
> > iothread path that needs BQL stall with it.
> >
> > Two independent missed-wakeup races in the block layer contribute.
> > Both share the same shape: a waiter arms on one side, the waker
> > reads stale state on its fast path and silently skips the kick, and
> > nothing else on the AioContext will fire to recover.  They are
> > different bugs in different subsystems and each patch stands on its
> > own; they are posted together because they surface through the same
> > test and the same symptom and are easiest to diagnose side by side.
> >
> > Depending on which race fires, the main thread backtrace at the
> > moment of hang is one of:
> >
> >   ppoll -> aio_poll -> bdrv_graph_wrlock -> blk_remove_bs
> >       (patch 1 -- block/graph-lock)
> >
> >   ppoll -> aio_poll -> cache_clean_timer_del_and_wait -> qcow2_close
> >       (patch 2 -- block/qcow2 cache_clean_timer)
> >
> > Race diagrams and the exact stale-state read are in each patch's
> > commit message.
> >
> > Reproducer
> > ----------
> >
> > Environment used for the numbers below: 4-vCPU VM guest,
> > kernel 6.12.x, upstream master at bb230769b4.  On modern bare-metal
> > the window is narrow enough that the hangs rarely reproduce without
> > a VM -- a VM guest under full CPU saturation is what makes the
> > timing reliable.  Downstream trees that still use plain
> > bdrv_graph_wrlock() in blk_remove_bs() hit the graph-lock race on
> > the first iteration without any stress at all.
> >
> >     # reproducer
> >     stress-ng --cpu "$(nproc)" --timeout 0 &
> >     for r in $(seq 20); do
> >         timeout 120 ./build/tests/qemu-iotests/check -qcow2 iothreads-create
> >     done
> >     kill %1
> >
> > With `stress-ng --cpu $(nproc)` both races surface.  With
> > `stress-ng --cpu $(($(nproc) - 1))` or without a stressor neither
> > reproduces reliably across 20 iterations.
> >
> > When a race fires, the Python QMP client times out on vm.run_job()
> > after 5 s, the qemu process keeps running but never makes forward
> > progress, and the outer `timeout 120` eventually kills it.  attach
> > gdb before the timeout kills qemu to capture the stack and
> > distinguish which of the two races fired.
> >
> > Results
> > -------
> >
> > Same guest, 20 iterations of the loop above:
> >
> >   upstream master:            10/20 FAIL (first fail at iter #2)
> >   master + both patches:      20/20 PASS
> >
> > Signed-off-by: Denis V. Lunev <[email protected]>
> > Cc: Kevin Wolf <[email protected]>
> > Cc: Hanna Reitz <[email protected]>
> > Cc: Stefan Hajnoczi <[email protected]>
> > Cc: Fiona Ebner <[email protected]>
> > Cc: Hanna Czenczek <[email protected]>
> >
> > Denis V. Lunev (2):
> >   block/graph-lock: fix missed wakeup in bdrv_graph_co_rdunlock()
> >   block/qcow2: fix hangup in cache_clean_timer cancellation
> >
> >  block/graph-lock.c | 12 +++++-------
> >  block/qcow2.c      | 28 +++++++++++++++++-----------
> >  2 files changed, 22 insertions(+), 18 deletions(-)
> >
> > --
> > 2.51.0
> ping

Hi Kevin,
This looks like a series for your block tree. If I can help in some way,
please let me know.

Stefan

Attachment: signature.asc
Description: PGP signature

Reply via email to