linux-aio: bound ioq_submit() recursion depth

Denis V. Lunev Wed, 20 May 2026 07:26:55 -0700

On 5/19/26 22:19, Stefan Hajnoczi wrote:
> On Mon, May 18, 2026 at 07:13:56PM +0200, Denis V. Lunev wrote:
>> On 4/28/26 12:58, Denis V. Lunev wrote:
>>> qemu_laio_process_completions() wraps its body in defer_call_begin /
>>> defer_call_end. Inside the section, completion callbacks wake coroutines
>>> that queue new aiocbs; laio_do_submit() defers laio_deferred_fn. At the
>>> bottom of qemu_laio_process_completions() the defer_call_end() fires
>>> laio_deferred_fn, which calls ioq_submit(), closing the cycle:
>>>
>>>   ioq_submit
>>>     -> io_submit(2)                           // some sync completions
>>>     -> qemu_laio_process_completions          // defer_call_begin
>>>          -> aio_co_wake                       // resumes coroutine
>>>               -> laio_do_submit
>>>                    -> defer_call(laio_deferred_fn, s)   // enqueued
>>>          -> defer_call_end                    // nesting drops to 0
>>>               -> laio_deferred_fn
>>>                    -> ioq_submit              // +1 stack frame, loop
>>>
>>> When io_submit(2) returns asynchronously (O_DIRECT) the cycle
>>> terminates in one extra frame: the fresh aiocb is still in flight, no
>>> completion is drained, no coroutine wakes, no new submission queues.
>>> When submissions complete synchronously (non-O_DIRECT, or per-descriptor
>>> drivers such as vmdk) each level enqueues more work for the next
>>> defer_call_end() to drain, so recursion grows without bound and QEMU
>>> crashes with SIGSEGV on the thread guard page.
>>>
>>> The cycle was closed by two performance commits, each correct in
>>> isolation:
>>>
>>>   076682885d ("block/linux-aio: convert to blk_io_plug_call() API")
>>>     -- introduced laio_deferred_fn and wired
>>>        laio_do_submit -> defer_call(laio_deferred_fn, s).
>>>
>>>   84d61e5f36 ("virtio: use defer_call() in virtio_irqfd_notify()")
>>>     -- added defer_call_begin/end around qemu_laio_process_completions
>>>        so virtio-irqfd notifications batch across a completion pass.
>>>
>>> The supported aio=native + cache=none pairing keeps submissions
>>> asynchronous, so the cycle stays bounded; nothing in the code enforces
>>> that contract. Observed in production as a SIGSEGV during a backup job
>>> configured with --cached + aio=native; reproducible on upstream with
>>> qemu-io against vmdk.
>>>
>>> Cap ioq_submit() recursion with a per-thread counter. On overflow,
>>> return without submitting. The pending work is drained by
>>> s->completion_bh, which qemu_laio_process_completions() has already
>>> scheduled on entry -- no work is lost; one event-loop round-trip of
>>> latency is paid only when the bound is hit, which cannot happen on a
>>> supported configuration.
>>>
>>> Signed-off-by: Denis V. Lunev <[email protected]>
>>> CC: Kevin Wolf <[email protected]>
>>> CC: Hanna Reitz <[email protected]>
>>> CC: Stefan Hajnoczi <[email protected]>
>>> CC: Paolo Bonzini <[email protected]>
>>> ---
>>>  block/linux-aio.c | 23 +++++++++++++++++++++++
>>>  1 file changed, 23 insertions(+)
>>>
>>> diff --git a/block/linux-aio.c b/block/linux-aio.c
>>> index 0a7424fbb3..f98bb6e766 100644
>>> --- a/block/linux-aio.c
>>> +++ b/block/linux-aio.c
>>> @@ -36,6 +36,19 @@
>>>  /* Maximum number of requests in a batch. (default value) */
>>>  #define DEFAULT_MAX_BATCH 32
>>>  
>>> +/*
>>> + * Bound on how deep ioq_submit() may recurse on a single thread via the
>>> + * ioq_submit -> qemu_laio_process_completions -> defer_call_end ->
>>> + * laio_deferred_fn -> ioq_submit cycle. The cycle terminates naturally
>>> + * when io_submit(2) returns asynchronously (O_DIRECT), but can grow
>>> + * without bound when submissions complete synchronously. On overflow
>>> + * the caller returns without submitting; the outermost
>>> + * qemu_laio_process_completions() has already scheduled s->completion_bh
>>> + * (via qemu_bh_schedule() at the top of that function), which resumes
>>> + * submission from the next event-loop dispatch.
>>> + */
>>> +#define IOQ_SUBMIT_MAX_DEPTH 8
>>> +
>>>  struct qemu_laiocb {
>>>      Coroutine *co;
>>>      LinuxAioState *ctx;
>>> @@ -80,6 +93,9 @@ struct LinuxAioState {
>>>  static void ioq_submit(LinuxAioState *s);
>>>  static int laio_do_submit(struct qemu_laiocb *laiocb);
>>>  
>>> +/* Per-thread recursion counter for ioq_submit(). See 
>>> IOQ_SUBMIT_MAX_DEPTH. */
>>> +static __thread unsigned ioq_submit_depth;
>>> +
>>>  static inline ssize_t io_event_ret(struct io_event *ev)
>>>  {
>>>      return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
>>> @@ -340,6 +356,11 @@ static void ioq_submit(LinuxAioState *s)
>>>      QEMU_UNINITIALIZED struct iocb *iocbs[MAX_EVENTS];
>>>      QSIMPLEQ_HEAD(, qemu_laiocb) completed;
>>>  
>>> +    if (ioq_submit_depth >= IOQ_SUBMIT_MAX_DEPTH) {
>>> +        return;
>>> +    }
>>> +    ioq_submit_depth++;
>>> +
>>>      do {
>>>          if (s->io_q.in_flight >= MAX_EVENTS) {
>>>              break;
>>> @@ -385,6 +406,8 @@ static void ioq_submit(LinuxAioState *s)
>>>           * pended requests will be submitted from there.
>>>           */
>>>      }
>>> +
>>> +    ioq_submit_depth--;
>>>  }
>>>  
>>>  static uint64_t laio_max_batch(LinuxAioState *s, uint64_t dev_max_batch)
>> ping
> Sorry, I missed this.
>
> Please add a field to LaioQueue instead of adding a __thread variable.
> LaioQueue is only accessed from a single thread and that's where the
> state lives.
>
> Stefan
>
done, sent v3.


Thanks for review,
    Den

Re: [PATCH 1/1] block/linux-aio: bound ioq_submit() recursion depth

Reply via email to