Am 24.04.2026 um 12:39 hat Denis V. Lunev geschrieben:
> tests/qemu-iotests/tests/iothreads-create reproduces the hang on
> master under `stress-ng --cpu $(nproc) --timeout 0`.  The iotest's
> vm.run_job() times out and qemu stays permanently stuck in
> ppoll(timeout=-1) inside bdrv_graph_wrlock_drained -> blk_remove_bs
> during qemu_cleanup().  The timing window is narrow on modern
> bare-metal hardware and much wider in a VM guest; downstream trees
> that still use plain bdrv_graph_wrlock() in blk_remove_bs() hit it
> on the first iteration under the same stress.
> 
> bdrv_graph_wrlock() zeroes has_writer around its AIO_WAIT_WHILE loop
> so that callbacks dispatched by aio_poll() can still take the read
> lock on the fast path.  The rdunlock side, however, only kicks a
> waiting writer when has_writer is observed set; a reader that drops
> its lock inside the polling window silently returns and nothing ever
> wakes the writer:
> 
>   main thread                         iothread0 coroutine
>   -----------                         -------------------
>   bdrv_graph_wrlock:                  rdlock held, reader_count=1
>     bdrv_drain_all_begin_nopoll
>     has_writer = 0
>     AIO_WAIT_WHILE_UNLOCKED(
>         NULL, reader_count >= 1):
>       num_waiters++
>       smp_mb
>       aio_poll(main_ctx, true)   -->  bdrv_graph_co_rdunlock:
>         (ppoll, blocked)                reader_count-- -> 0
>                                         smp_mb
>                                         read has_writer = 0
>                                         skip aio_wait_kick()
>                                       return
> 
> reader_count is now 0 and num_waiters is still 1, but no BH, fd or
> timer on the main AioContext will fire -- the only entity that could
> kick just decided it did not have to.  Main stays in ppoll() holding
> BQL, so RCU, VCPUs and any iothread path that needs BQL stall behind
> it.  The hang is final; no timeout, no forward progress, no recovery
> as there is no other source of wake up inside qemu_cleanup().
> 
> bdrv_drain_all_begin() does not close the race on its own: it
> quiesces in-flight I/O, but graph readers also include non-I/O
> coroutines (block-job cleanup, virtio-scsi polling) that drain does
> not evict.  The bdrv_graph_wrlock_drained() wrapper narrows the
> window but does not eliminate it; every plain bdrv_graph_wrlock()
> site is exposed on the same basis.
> 
> Drop the has_writer check in bdrv_graph_co_rdunlock() and call
> aio_wait_kick() unconditionally.  The helper itself loads num_waiters
> atomically and only schedules a dummy BH when a waiter exists, so the
> change is a no-op on the no-writer path and closes the missed-wakeup
> on the writer path.
> 
> Signed-off-by: Denis V. Lunev <[email protected]>
> Cc: Kevin Wolf <[email protected]>
> Cc: Hanna Reitz <[email protected]>
> Cc: Stefan Hajnoczi <[email protected]>
> Cc: Fiona Ebner <[email protected]>

Reviewed-by: Kevin Wolf <[email protected]>

Thanks, applied to the block branch.

Kevin


Reply via email to