Am 24.04.2026 um 12:39 hat Denis V. Lunev geschrieben: > tests/qemu-iotests/tests/iothreads-create reproduces the hang on > master under `stress-ng --cpu $(nproc) --timeout 0`. The iotest's > vm.run_job() times out and qemu stays permanently stuck in > ppoll(timeout=-1) inside bdrv_graph_wrlock_drained -> blk_remove_bs > during qemu_cleanup(). The timing window is narrow on modern > bare-metal hardware and much wider in a VM guest; downstream trees > that still use plain bdrv_graph_wrlock() in blk_remove_bs() hit it > on the first iteration under the same stress. > > bdrv_graph_wrlock() zeroes has_writer around its AIO_WAIT_WHILE loop > so that callbacks dispatched by aio_poll() can still take the read > lock on the fast path. The rdunlock side, however, only kicks a > waiting writer when has_writer is observed set; a reader that drops > its lock inside the polling window silently returns and nothing ever > wakes the writer: > > main thread iothread0 coroutine > ----------- ------------------- > bdrv_graph_wrlock: rdlock held, reader_count=1 > bdrv_drain_all_begin_nopoll > has_writer = 0 > AIO_WAIT_WHILE_UNLOCKED( > NULL, reader_count >= 1): > num_waiters++ > smp_mb > aio_poll(main_ctx, true) --> bdrv_graph_co_rdunlock: > (ppoll, blocked) reader_count-- -> 0 > smp_mb > read has_writer = 0 > skip aio_wait_kick() > return > > reader_count is now 0 and num_waiters is still 1, but no BH, fd or > timer on the main AioContext will fire -- the only entity that could > kick just decided it did not have to. Main stays in ppoll() holding > BQL, so RCU, VCPUs and any iothread path that needs BQL stall behind > it. The hang is final; no timeout, no forward progress, no recovery > as there is no other source of wake up inside qemu_cleanup(). > > bdrv_drain_all_begin() does not close the race on its own: it > quiesces in-flight I/O, but graph readers also include non-I/O > coroutines (block-job cleanup, virtio-scsi polling) that drain does > not evict. The bdrv_graph_wrlock_drained() wrapper narrows the > window but does not eliminate it; every plain bdrv_graph_wrlock() > site is exposed on the same basis. > > Drop the has_writer check in bdrv_graph_co_rdunlock() and call > aio_wait_kick() unconditionally. The helper itself loads num_waiters > atomically and only schedules a dummy BH when a waiter exists, so the > change is a no-op on the no-writer path and closes the missed-wakeup > on the writer path. > > Signed-off-by: Denis V. Lunev <[email protected]> > Cc: Kevin Wolf <[email protected]> > Cc: Hanna Reitz <[email protected]> > Cc: Stefan Hajnoczi <[email protected]> > Cc: Fiona Ebner <[email protected]>
Reviewed-by: Kevin Wolf <[email protected]> Thanks, applied to the block branch. Kevin
