On 5/29/26 16:40, Kevin Wolf wrote:
> Am 20.05.2026 um 21:38 hat Denis V. Lunev geschrieben:
>> cache_clean_timer_del_and_wait() cancels the cache-cleaner coroutine
>> by setting s->cache_clean_interval = 0 and calling qemu_co_sleep_wake()
>> to cut short its qemu_co_sleep_ns_wakeable(). qemu_co_sleep_wake() is
>> fire-and-forget: it reads w->to_wake and silently returns when it is
>> NULL. A sleeper that is between two iterations -- has just released
>> s->lock but has not yet set w->to_wake inside qemu_co_sleep() -- loses
>> the wake:
>>
>>   iothread0 timer coroutine           main thread (qcow2 close)
>>   -------------------------           -------------------------
>>   while-body (holding s->lock):
>>     read interval = 600
>>     wait_ns = 600 * NS
>>     release s->lock
>>                                       take s->lock
>>                                       interval = 0
>>                                       qemu_co_sleep_wake(w):
>>                                         w->to_wake == NULL -> skip
>>                                         return
>>                                       qemu_co_queue_wait(exit, s->lock):
>>                                         release s->lock
>>                                         yield
>>   qemu_co_sleep_ns_wakeable:
>>     aio_timer_init(+600 s)
>>     qemu_co_sleep:
>>       cas scheduled NULL -> "qsns"
>>       w->to_wake = co
>>       yield  [sleeps 600 s]
>>
>> cache_clean_timer_del_and_wait() then blocks on cache_clean_timer_exit
>> until the original 600 s expiry fires, and qcow2_close() holds BQL the
>> whole time so the VM stalls behind it.
>>
>> block_copy_kick() has the same shape. Fix the primitive once instead
>> of working around it in each caller.
>>
>> Use a tri-state for QemuCoSleep::to_wake:
>>
>>   NULL     - idle
>>   co       - sleeper parked
>>   PENDING  - wake delivered, no sleeper yet (sticky)
>>
>> qemu_co_sleep_wake() xchgs PENDING into to_wake: a real sleeper is
>> woken, NULL/PENDING is left untouched so the wake stays sticky.
>> qemu_co_sleep() consumes a sticky PENDING on entry; otherwise it
>> cmpxchg-publishes itself, and a wake racing in between is caught
>> because the cmpxchg observes PENDING and returns without yielding.
>> On normal resume qemu_co_sleep() clears the PENDING the waker left
>> behind so the next sleep starts clean.
>>
>> A double-fire (real wake plus timer callback) is harmless: the first
>> xchg returns the coroutine and wakes it; the second returns PENDING
>> and is a no-op. Cancellation latency through qemu_co_sleep_wake() is
>> now bounded by aio_co_wake() rather than by the sleep duration.
>>
>> Fixes: f86dde9a15 ("qcow2: Fix cache_clean_timer")
>> Signed-off-by: Denis V. Lunev <[email protected]>
>> Cc: Hanna Czenczek <[email protected]>
>> Cc: Kevin Wolf <[email protected]>
>> ---
>>  include/qemu/coroutine.h    | 17 ++++++++---
>>  util/qemu-coroutine-sleep.c | 60 +++++++++++++++++++++++++++----------
>>  2 files changed, 58 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
>> index e545bbf620..1c31de60f9 100644
>> --- a/include/qemu/coroutine.h
>> +++ b/include/qemu/coroutine.h
>> @@ -260,10 +260,19 @@ int coroutine_fn qemu_co_timeout(CoroutineEntry 
>> *entry, void *opaque,
>>                                   uint64_t timeout_ns, CleanupFunc clean);
>>  
>>  /**
>> - * Wake a coroutine if it is sleeping in qemu_co_sleep_ns. The timer will be
>> - * deleted. @sleep_state must be the variable whose address was given to
>> - * qemu_co_sleep_ns() and should be checked to be non-NULL before calling
>> - * qemu_co_sleep_wake().
>> + * Wake a coroutine sleeping in qemu_co_sleep() or 
>> qemu_co_sleep_ns_wakeable().
>> + * The timer set up by the latter is deleted on wakeup.
>> + *
>> + * The wake is sticky: if no sleeper is parked on @w at the time of the 
>> call,
>> + * the wake is recorded on @w and consumed by the next qemu_co_sleep() on 
>> the
>> + * same @w, which then returns without yielding. This closes the lost-wakeup
>> + * window between two sleeps and is the documented behavior callers should
>> + * rely on -- e.g. a cancellation signal raised between iterations of a
>> + * sleep/work loop will shorten the next sleep instead of being dropped.
>> + *
>> + * The state persists until consumed: if no further qemu_co_sleep() is ever
>> + * called on @w, the pending wake is harmlessly discarded when @w goes away.
>> + * Multiple wakes coalesce -- the next sleep consumes at most one.
>>   */
>>  void qemu_co_sleep_wake(QemuCoSleep *w);
>>  
>> diff --git a/util/qemu-coroutine-sleep.c b/util/qemu-coroutine-sleep.c
>> index edef117284..a5734a8eb7 100644
>> --- a/util/qemu-coroutine-sleep.c
>> +++ b/util/qemu-coroutine-sleep.c
>> @@ -18,20 +18,29 @@
>>  
>>  static const char *qemu_co_sleep_ns__scheduled = "qemu_co_sleep_ns";
>>  
>> +/*
>> + * Sentinel stored in QemuCoSleep::to_wake by qemu_co_sleep_wake() when no
>> + * sleeper has parked yet. The next qemu_co_sleep() consumes it and returns
>> + * without yielding, so a wake that races the arming of a sleep is never
>> + * lost.
>> + */
>> +#define QEMU_CO_SLEEP_PENDING ((Coroutine *)(uintptr_t)1)
>> +
>>  void qemu_co_sleep_wake(QemuCoSleep *w)
>>  {
>>      Coroutine *co;
>>  
>> -    co = w->to_wake;
>> -    w->to_wake = NULL;
>> -    if (co) {
>> -        /* Write of schedule protected by barrier write in aio_co_schedule 
>> */
>> -        const char *scheduled = qatomic_cmpxchg(&co->scheduled,
>> -                                                
>> qemu_co_sleep_ns__scheduled, NULL);
>> -
>> -        assert(scheduled == qemu_co_sleep_ns__scheduled);
>> -        aio_co_wake(co);
>> +    co = qatomic_xchg(&w->to_wake, QEMU_CO_SLEEP_PENDING);
>> +    if (co == NULL || co == QEMU_CO_SLEEP_PENDING) {
>> +        /* No sleeper, or a wake is already pending. */
>> +        return;
>>      }
>> +
>> +    /* Write of scheduled protected by barrier write in aio_co_schedule */
>> +    const char *scheduled = qatomic_cmpxchg(&co->scheduled,
>> +                                            qemu_co_sleep_ns__scheduled, 
>> NULL);
>> +    assert(scheduled == qemu_co_sleep_ns__scheduled);
>> +    aio_co_wake(co);
>>  }
>>  
>>  static void co_sleep_cb(void *opaque)
>> @@ -43,6 +52,14 @@ static void co_sleep_cb(void *opaque)
>>  void coroutine_fn qemu_co_sleep(QemuCoSleep *w)
>>  {
>>      Coroutine *co = qemu_coroutine_self();
>> +    Coroutine *prev;
>> +
>> +    /* Consume an already-delivered wake without yielding. */
>> +    prev = qatomic_cmpxchg(&w->to_wake, QEMU_CO_SLEEP_PENDING, NULL);
>> +    if (prev == QEMU_CO_SLEEP_PENDING) {
>> +        return;
>> +    }
>> +    assert(prev == NULL);
> I'm wondering if this check serves a purpose when...
>
>>      const char *scheduled = qatomic_cmpxchg(&co->scheduled, NULL,
>>                                              qemu_co_sleep_ns__scheduled);
>> @@ -53,11 +70,23 @@ void coroutine_fn qemu_co_sleep(QemuCoSleep *w)
>>          abort();
>>      }
>>  
>> -    w->to_wake = co;
>> +    /*
>> +     * Publish ourselves as the sleeper. If a wake raced in between the
>> +     * fast path above and now, the cmpxchg observes QEMU_CO_SLEEP_PENDING
>> +     * and we consume it without yielding.
>> +     */
>> +    prev = qatomic_cmpxchg(&w->to_wake, NULL, co);
>> +    if (prev == QEMU_CO_SLEEP_PENDING) {
> ...we essentially have to repeat it here anyway. Couldn't we always just
> use this second check?
good point. Will do.


>> +        qatomic_set(&w->to_wake, NULL);
>> +        qatomic_set(&co->scheduled, NULL);
>> +        return;
>> +    }
>> +    assert(prev == NULL);
>> +
>>      qemu_coroutine_yield();
>>  
>> -    /* w->to_wake is cleared before resuming this coroutine.  */
>> -    assert(w->to_wake == NULL);
>> +    /* The waker left QEMU_CO_SLEEP_PENDING; clear it for the next sleep. */
>> +    qatomic_set(&w->to_wake, NULL);
>>  }
>>  
>>  void coroutine_fn qemu_co_sleep_ns_wakeable(QemuCoSleep *w,
>> @@ -70,9 +99,10 @@ void coroutine_fn qemu_co_sleep_ns_wakeable(QemuCoSleep 
>> *w,
>>      timer_mod(&ts, qemu_clock_get_ns(type) + ns);
>>  
>>      /*
>> -     * The timer will fire in the current AiOContext, so the callback
>> -     * must happen after qemu_co_sleep yields and there is no race
>> -     * between timer_mod and qemu_co_sleep.
>> +     * A wake racing with the arming of the sleep -- including the timer
>> +     * we just armed firing in another AioContext before qemu_co_sleep()
>> +     * publishes itself -- is captured by the sticky PENDING state in
>> +     * qemu_co_sleep_wake() and consumed here without yielding.
>>       */
>>      qemu_co_sleep(w);
>>      timer_del(&ts);
> Overall this looks good to me. Some unit tests would be nice, though.
>
Invented, will be sent in v3.

Den

Reply via email to