On 12.12.25 11:25, Hanna Czenczek wrote:
This reverts commit 0f142cbd919fcb6cea7aa176f7e4939925806dd9.

Lukáš Doktor reported a simple single-threaded nvme test case hanging
and bisected it to this commit.  While we are still investigating, it is
best to revert the commit for now.

(This breaks multiqueue for nvme, but better to have single-queue
working than neither.)

Cc: [email protected]
Reported-by: Lukáš Doktor <[email protected]>
Signed-off-by: Hanna Czenczek <[email protected]>
---
  block/nvme.c | 56 +++++++++++++++++++++++++---------------------------
  1 file changed, 27 insertions(+), 29 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 919e14cef9..c3d3b99d1f 100644
--- a/block/nvme.c
+++ b/block/nvme.c

[...]

  /* Put into NVMeRequest.cb, so runs in the BDS's main AioContext */
  static void nvme_rw_cb(void *opaque, int ret)
  {

[...]

-        aio_co_wake(data->co);
[...]
+    replay_bh_schedule_oneshot_event(data->ctx, nvme_rw_cb_bh, data);
  }

From testing, this bit seems to be the important one: The hang seems to be caused by entering directly the coroutine directly instead of always going through a BH.  Why that is, I haven’t yet found out, only that s/aio_co_wake()/aio_co_schedule()/ seems to make it work.

I’ll spend more time trying to find out why.

(The only thing I know so far is that iscsi similarly should not use aio_co_wake(), and for that we do have a documented reason: https://gitlab.com/qemu-project/qemu/-/commit/8b9dfe9098 – in light of that, it probably makes sense not to use aio_co_wake() for NFS either, which was the third case in the original series where I replaced a oneshot schedule by aio_co_wake().)

Hanna


Reply via email to