From: "Denis V. Lunev" <[email protected]>

tests/qemu-iotests/tests/iothreads-create reproduces the hang on
master under `stress-ng --cpu $(nproc) --timeout 0`.  The iotest's
vm.run_job() times out and qemu stays permanently stuck in
ppoll(timeout=-1) inside bdrv_graph_wrlock_drained -> blk_remove_bs
during qemu_cleanup().  The timing window is narrow on modern
bare-metal hardware and much wider in a VM guest; downstream trees
that still use plain bdrv_graph_wrlock() in blk_remove_bs() hit it
on the first iteration under the same stress.

bdrv_graph_wrlock() zeroes has_writer around its AIO_WAIT_WHILE loop
so that callbacks dispatched by aio_poll() can still take the read
lock on the fast path.  The rdunlock side, however, only kicks a
waiting writer when has_writer is observed set; a reader that drops
its lock inside the polling window silently returns and nothing ever
wakes the writer:

  main thread                         iothread0 coroutine
  -----------                         -------------------
  bdrv_graph_wrlock:                  rdlock held, reader_count=1
    bdrv_drain_all_begin_nopoll
    has_writer = 0
    AIO_WAIT_WHILE_UNLOCKED(
        NULL, reader_count >= 1):
      num_waiters++
      smp_mb
      aio_poll(main_ctx, true)   -->  bdrv_graph_co_rdunlock:
        (ppoll, blocked)                reader_count-- -> 0
                                        smp_mb
                                        read has_writer = 0
                                        skip aio_wait_kick()
                                      return

reader_count is now 0 and num_waiters is still 1, but no BH, fd or
timer on the main AioContext will fire -- the only entity that could
kick just decided it did not have to.  Main stays in ppoll() holding
BQL, so RCU, VCPUs and any iothread path that needs BQL stall behind
it.  The hang is final; no timeout, no forward progress, no recovery
as there is no other source of wake up inside qemu_cleanup().

bdrv_drain_all_begin() does not close the race on its own: it
quiesces in-flight I/O, but graph readers also include non-I/O
coroutines (block-job cleanup, virtio-scsi polling) that drain does
not evict.  The bdrv_graph_wrlock_drained() wrapper narrows the
window but does not eliminate it; every plain bdrv_graph_wrlock()
site is exposed on the same basis.

Drop the has_writer check in bdrv_graph_co_rdunlock() and call
aio_wait_kick() unconditionally.  The helper itself loads num_waiters
atomically and only schedules a dummy BH when a waiter exists, so the
change is a no-op on the no-writer path and closes the missed-wakeup
on the writer path.

Signed-off-by: Denis V. Lunev <[email protected]>
Cc: Kevin Wolf <[email protected]>
Cc: Hanna Reitz <[email protected]>
Cc: Stefan Hajnoczi <[email protected]>
Cc: Fiona Ebner <[email protected]>
Message-ID: <[email protected]>
Reviewed-by: Kevin Wolf <[email protected]>
Signed-off-by: Kevin Wolf <[email protected]>
---
 block/graph-lock.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/block/graph-lock.c b/block/graph-lock.c
index b7319473a17..f2501d75fbe 100644
--- a/block/graph-lock.c
+++ b/block/graph-lock.c
@@ -278,14 +278,12 @@ void coroutine_fn bdrv_graph_co_rdunlock(void)
     smp_mb();
 
     /*
-     * has_writer == 0: this means reader will read reader_count decreased
-     * has_writer == 1: we don't know if writer read reader_count old or
-     *                  new. Therefore, kick again so on next iteration
-     *                  writer will for sure read the updated value.
+     * Always kick: bdrv_graph_wrlock() zeroes has_writer while polling (to
+     * let callbacks take the reader lock via the fast path), so we cannot
+     * rely on has_writer to detect a waiting writer. aio_wait_kick() is a
+     * no-op when no one is waiting, so it is cheap in the common case.
      */
-    if (qatomic_read(&has_writer)) {
-        aio_wait_kick();
-    }
+    aio_wait_kick();
 }
 
 void bdrv_graph_rdlock_main_loop(void)
-- 
2.54.0


Reply via email to