Re: [PATCH v2 for-10.2] Revert "nvme: Fix coroutine waking"

Hanna Czenczek Mon, 15 Dec 2025 06:20:03 -0800

On 15.12.25 15:15, Hanna Czenczek wrote:

This reverts commit 0f142cbd919fcb6cea7aa176f7e4939925806dd9.

Sorry, in my joy to have found the root cause (I’ve tested a fixspecifically for that on Lukáš’s testing machine), I forgot to add thev2 changelog:

The code didn’t change in v2, only the commit message did, specificallyto note what the root cause is and why reverting fixes the issue.


Hanna

Said commit changed the replay_bh_schedule_oneshot_event() in
nvme_rw_cb() to aio_co_wake(), allowing the request coroutine to be
entered directly (instead of only being scheduled for later execution).
This can cause the device to become stalled like so:

It is possible that after completion the request coroutine goes on to
submit another request without yielding, e.g. a flush after a write to
emulate FUA.  This will likely cause a nested nvme_process_completion()
call because nvme_rw_cb() itself is called from there.

(After submitting a request, we invoke nvme_process_completion() through
defer_call(); but the fact that nvme_process_completion() ran in the
first place indicates that we are not in a call-deferring section, so
defer_call() will call nvme_process_completion() immediately.)

If this inner nvme_process_completion() loop then processes any
completions, it will write the final completion queue (CQ) head index to
the CQ head doorbell, and subsequently execution will return to the
outer nvme_process_completion() loop.  Even if this loop now finds no
further completions, it still processed at least one completion before,
or it would not have called the nvme_rw_cb() which led to nesting.
Therefore, it will now write the exact same CQ head index value to the
doorbell, which effectively is an unrecoverable error[1].

Therefore, nesting of nvme_process_completion() does not work at this
point.  Reverting said commit removes the nesting (by scheduling the
request coroutine instead of entering it immediately), and so fixes the
stall.

On the downside, reverting said commit breaks multiqueue for nvme, but
better to have single-queue working than neither.  For 11.0, we will
have a solution that makes both work.

A side note: There is a comment in nvme_process_completion() above
qemu_bh_schedule() that claims nesting works, as long as it is done
through the completion_bh.  I am quite sure that is not true, for two
reasons:
- The problem described above, which is even worse when going through
   nvme_process_completion_bh() because that function unconditionally
   writes to the CQ head doorbell,
- nvme_process_completion_bh() never takes q->lock, so
   nvme_process_completion() unlocking it will likely abort.

Given the lack of reports of such aborts, I believe that completion_bh
simply is unused in practice.

[1] See the NVMe Base Specification revision 2.3, page 180, figure 152:
     “Invalid Doorbell Write Value: A host attempted to write an invalid
     doorbell value. Some possible causes of this error are: [...] the
     value written is the same as the previously written doorbell value.”

     To even be notified of this error, we would need to send an
     Asynchronous Event Request to the admin queue (p. 178ff), which we
     don’t do, and then to handle it, we would need to delete and
     recreate the queue (p. 88, section 3.3.1.2 Queue Usage).

Cc: [email protected]
Reported-by: Lukáš Doktor <[email protected]>
Tested-by: Lukáš Doktor <[email protected]>
Signed-off-by: Hanna Czenczek <[email protected]>
---
  block/nvme.c | 56 +++++++++++++++++++++++++---------------------------
  1 file changed, 27 insertions(+), 29 deletions(-)

Re: [PATCH v2 for-10.2] Revert "nvme: Fix coroutine waking"

Reply via email to