Re: [PATCH 4/5] rcu: Wake the RCU thread when draining

Peter Xu Thu, 06 Nov 2025 13:54:02 -0800

On Thu, Nov 06, 2025 at 10:40:52AM +0900, Akihiko Odaki wrote:
> > > > > +        /*
> > > > > +         * Ensure that the forced variable has not been set after 
> > > > > fetching
> > > > > +         * rcu_call_count; otherwise we may get confused by a force 
> > > > > quiescent
> > > > > +         * state request for an element later than n.
> > > > > +         */
> > > > > +        while (qatomic_xchg(&forced, false)) {
> > > > > +            sleep = false;
> > > > > +            n = qatomic_read(&rcu_call_count);
> > > > >            }
> > > > 
> > > > This is pretty tricky, and I wonder if it will make the code easier to 
> > > > read
> > > > if we convert the sync_event to be a semaphore instead.  When as a sem, 
> > > > it
> > > > will take account of whatever kick to it, either a call_rcu1() or an
> > > > enforced rcu flush, so that we don't need to reset it.  Meanwhile, we 
> > > > don't
> > > > worry on an slightly outdated "n" read because the 2nd round of 
> > > > sem_wait()
> > > > will catch that new "n".
> > > > 
> > > > Instead, worst case is rcu thread runs one more round without seeing
> > > > callbacks on the queue.
> > > > 
> > > > I'm not sure if that could help simplying code, maybe also make it less
> > > > error-prone.
> > > 
> > > Semaphore is not applicable here because it will not de-duplicate 
> > > concurrent
> > > kicks of RCU threads.
> > 
> > Why concurrent kicks of rcu threads is a problem?  QemuSemaphore is
> > thread-safe itself, meanwhile IIUC it only still causes call_rcu_thread()
> > loops some more rounds reading "n", which looks all safe. No?
> 
> It is safe but incurs overheads and confusing. QemuEvent represents the
> boolean semantics better.
> 
> I also have difficulty to understand how converting sync_event to a
> semaphore simplifies the code. Perhaps some (pseudo)code to show how the
> code will look like may be useful.


I prepared a patch on top of your current patchset to show what I meant.  I
also added comments and some test results showing why I think it might be
fine that the sem overhead should be small.

In short, I tested a VM with 8 vCPUs and 4G mem, booting Linux and properly
poweroff, I only saw <1000 rcu_call1 users in total.  That should be the
max-bound of sem overhead on looping in rcu thread.

It's in patch format but still treat it as a comment instead to discuss
with you.  Attaching it is just easier for me.

===8<===
>From 71f15ed19050a973088352a8d71b6cc6b7b5f7cf Mon Sep 17 00:00:00 2001
From: Peter Xu <[email protected]>
Date: Thu, 6 Nov 2025 16:03:00 -0500
Subject: [PATCH] rcu: Make sync_event a semaphore

It could simply all reset logic, especially after enforced rcu is
introduced we'll also need a tweak to re-read "n", which can be avoided too
when with a sem.

However, the sem can introduce an overhead in high frequecy rcu frees.
This patch is drafted with the assumption that rcu free is at least very
rare in QEMU, hence it's not a problem.

When I tested with this command:

qemu-system-x86_64 -M q35,kernel-irqchip=split,suppress-vmdesc=on -smp 8 \
  -m 4G -msg timestamp=on -name peter-vm,debug-threads=on -cpu Nehalem \
  -accel kvm -qmp unix:/tmp/peter.sock,server,nowait -nographic \
  -monitor telnet::6666,server,nowait -netdev user,id=net0,hostfwd=tcp::5555-:22
  -device e1000,netdev=net0 -device virtio-balloon $DISK

I booted a pre-installed Linux, login and poweroff, wait until VM
completely shutdowns.  I captured less than 1000 rcu_free1() calls in
summary.  It means for the whole lifetime of such VM the max overhead of
the call_rcu_thread() loop reading rcu_call_count will be 1000 loops.

Signed-off-by: Peter Xu <[email protected]>
---
 util/rcu.c | 36 ++++++++----------------------------
 1 file changed, 8 insertions(+), 28 deletions(-)

diff --git a/util/rcu.c b/util/rcu.c
index 85f9333f5d..dfe031a5c9 100644
--- a/util/rcu.c
+++ b/util/rcu.c
@@ -54,7 +54,7 @@ static int rcu_call_count;
 static QemuMutex rcu_registry_lock;
 
 /* Set when the forced variable is set or rcu_call_count becomes non-zero. */
-static QemuEvent sync_event;
+static QemuSemaphore sync_event;
 
 /*
  * Check whether a quiescent state was crossed between the beginning of
@@ -80,7 +80,7 @@ static ThreadList registry = QLIST_HEAD_INITIALIZER(registry);
 void force_rcu(void)
 {
     qatomic_set(&forced, true);
-    qemu_event_set(&sync_event);
+    qemu_sem_post(&sync_event);
 }
 
 /* Wait for previous parity/grace period to be empty of readers.  */
@@ -148,7 +148,7 @@ static void wait_for_readers(bool sleep)
              */
             qemu_event_reset(&rcu_gp_event);
         } else if (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
-                   !sleeps || qemu_event_timedwait(&sync_event, 10)) {
+                   !sleeps || qemu_sem_timedwait(&sync_event, 10)) {
             /*
              * Now one of the following heuristical conditions is satisfied:
              * - A decent number of callbacks piled up.
@@ -286,7 +286,6 @@ static void *call_rcu_thread(void *opaque)
     rcu_register_thread();
 
     for (;;) {
-        bool sleep = true;
         int n;
 
         /*
@@ -294,7 +293,6 @@ static void *call_rcu_thread(void *opaque)
          * added before enter_qs() starts.
          */
         for (;;) {
-            qemu_event_reset(&sync_event);
             n = qatomic_read(&rcu_call_count);
             if (n) {
                 break;
@@ -303,36 +301,19 @@ static void *call_rcu_thread(void *opaque)
 #if defined(CONFIG_MALLOC_TRIM)
             malloc_trim(4 * 1024 * 1024);
 #endif
-            qemu_event_wait(&sync_event);
+            qemu_sem_wait(&sync_event);
         }
 
-        /*
-         * Ensure that an event for a rcu_call_count change will not interrupt
-         * wait_for_readers().
-         */
-        qemu_event_reset(&sync_event);
-
-        /*
-         * Ensure that the forced variable has not been set after fetching
-         * rcu_call_count; otherwise we may get confused by a force quiescent
-         * state request for an element later than n.
-         */
-        while (qatomic_xchg(&forced, false)) {
-            sleep = false;
-            n = qatomic_read(&rcu_call_count);
-        }
-
-        enter_qs(sleep);
+        enter_qs(!qatomic_xchg(&forced, false));
         qatomic_sub(&rcu_call_count, n);
         bql_lock();
         while (n > 0) {
             node = try_dequeue();
             while (!node) {
                 bql_unlock();
-                qemu_event_reset(&sync_event);
                 node = try_dequeue();
                 if (!node) {
-                    qemu_event_wait(&sync_event);
+                    qemu_sem_wait(&sync_event);
                     node = try_dequeue();
                 }
                 bql_lock();
@@ -352,7 +333,7 @@ void call_rcu1(struct rcu_head *node, void (*func)(struct 
rcu_head *node))
     enqueue(node);
 
     if (!qatomic_fetch_inc(&rcu_call_count)) {
-        qemu_event_set(&sync_event);
+        qemu_sem_post(&sync_event);
     }
 }
 
@@ -456,8 +437,7 @@ static void rcu_init_complete(void)
 
     qemu_mutex_init(&rcu_registry_lock);
     qemu_event_init(&rcu_gp_event, true);
-
-    qemu_event_init(&sync_event, false);
+    qemu_sem_init(&sync_event, 0);
 
     /* The caller is assumed to have BQL, so the call_rcu thread
      * must have been quiescent even after forking, just recreate it.
-- 
2.50.1
===8<===

When I was having a closer look, I found some other issues, I'll list it
all here.

1. I found that rcu_gp_event was initialized as "true".  I'm not sure
   whether it should be false.  This has nothing to do with your series as
   well, only some observation of mine.
   
    qemu_event_init(&rcu_gp_event, true);

2. It looks to me your patch here checked the wrong retval of
   qemu_event_timedwait()..

        } else if (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
                   !sleeps || qemu_event_timedwait(&sync_event, 10)) {
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

   should it be "!qemu_event_timedwait(&sync_event, 10)"?  Side note: maybe
   we should cleanup all _timedwait() for QEMU's eventfd, sem, cond,
   ... because they don't return the same retval.. but if you think sem is
   good, then we don't need eventfd's timedwait() in this series (your
   initial two patches).

3. I doubt if malloc_trim() feature is broken with your current patchset,
   because now the loop looks like:

        for (;;) {
            qemu_event_reset(&sync_event);
            n = qatomic_read(&rcu_call_count);
            if (n) {
                break;
            }
#if defined(CONFIG_MALLOC_TRIM)
            malloc_trim(4 * 1024 * 1024);
#endif
            qemu_event_wait(&sync_event);
        }

  I don't know if n can be zero here at all.. if you use the sem change
  this might trigger but it's not designed for it.  When using sem we may
  want to periodically trim.  But I'd like to know how you think in general
  on the sem idea first (e.g. do we need to be prepared for high freq rcu
  frees).

Thanks,

-- 
Peter Xu

Re: [PATCH 4/5] rcu: Wake the RCU thread when draining

Reply via email to