On 17.11.25 15:50, Kevin Wolf wrote:
Am 10.11.2025 um 16:48 hat Hanna Czenczek geschrieben:
The cache-cleaner runs as a timer CB in the BDS AioContext.  With
multiqueue, it can run concurrently to I/O requests, and because it does
not take any lock, this can break concurrent cache accesses, corrupting
the image.  While the chances of this happening are low, it can be
reproduced e.g. by modifying the code to schedule the timer CB every
5 ms (instead of at most once per second) and modifying the last (inner)
while loop of qcow2_cache_clean_unused() like so:

     while (i < c->size && can_clean_entry(c, i)) {
         for (int j = 0; j < 1000 && can_clean_entry(c, i); j++) {
             usleep(100);
         }
         c->entries[i].offset = 0;
         c->entries[i].lru_counter = 0;
         i++;
         to_clean++;
     }

i.e. making it wait on purpose for the point in time where the cache is
in use by something else.

The solution chosen for this in this patch is not the best solution, I
hope, but I admittedly can’t come up with anything strictly better.

We can protect from concurrent cache accesses either by taking the
existing s->lock, or we introduce a new (non-coroutine) mutex
specifically for cache accesses.  I would prefer to avoid the latter so
as not to introduce additional (very slight) overhead.

Using s->lock, which is a coroutine mutex, however means that we need to
take it in a coroutine, so the timer must run in a coroutine.  We can
transform it from the current timer CB style into a coroutine that
sleeps for the set interval.  As a result, however, we can no longer
just deschedule the timer to instantly guarantee it won’t run anymore,
but have to await the coroutine’s exit.

(Note even before this patch there were places that may not have been so
guaranteed after all: Anything calling cache_clean_timer_del() from the
QEMU main AioContext could have been running concurrently to an existing
timer CB invocation.)

Polling to await the timer to actually settle seems very complicated for
something that’s rather a minor problem, but I can’t come up with any
better solution that doesn’t again just overlook potential problems.

(Not Cc-ing qemu-stable, as the issue is quite unlikely to be hit, and
I’m not too fond of this solution.)

Signed-off-by: Hanna Czenczek <[email protected]>
-static void cache_clean_timer_cb(void *opaque)
+static void coroutine_fn cache_clean_timer(void *opaque)
  {
-    BlockDriverState *bs = opaque;
-    BDRVQcow2State *s = bs->opaque;
-    qcow2_cache_clean_unused(s->l2_table_cache);
-    qcow2_cache_clean_unused(s->refcount_block_cache);
-    timer_mod(s->cache_clean_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
-              (int64_t) s->cache_clean_interval * 1000);
+    BDRVQcow2State *s = opaque;
+    uint64_t wait_ns;
+
+    WITH_QEMU_LOCK_GUARD(&s->lock) {
+        wait_ns = s->cache_clean_interval * NANOSECONDS_PER_SECOND;
+    }
+
+    while (wait_ns > 0) {
+        qemu_co_sleep_ns_wakeable(&s->cache_clean_timer_wake,
+                                  QEMU_CLOCK_VIRTUAL, wait_ns);
+
+        WITH_QEMU_LOCK_GUARD(&s->lock) {
+            if (s->cache_clean_interval > 0) {
+                qcow2_cache_clean_unused(s->l2_table_cache);
+                qcow2_cache_clean_unused(s->refcount_block_cache);
+            }
+
+            wait_ns = s->cache_clean_interval * NANOSECONDS_PER_SECOND;
+        }
+    }
+
+    WITH_QEMU_LOCK_GUARD(&s->lock) {
+        s->cache_clean_timer_co = NULL;
+        qemu_co_queue_restart_all(&s->cache_clean_timer_exit);
+    }
  }
+/**
+ * Delete the cache clean timer and await any yet running instance.
+ * Called holding s->lock.
+ */
+static void coroutine_fn
+cache_clean_timer_co_locked_del_and_wait(BlockDriverState *bs)
+{
+    BDRVQcow2State *s = bs->opaque;
+
+    if (s->cache_clean_timer_co) {
+        s->cache_clean_interval = 0;
+        qemu_co_sleep_wake(&s->cache_clean_timer_wake);
+        qemu_co_queue_wait(&s->cache_clean_timer_exit, &s->lock);
      }
  }
I don't want to count the number of context switches this dance between
cache_clean_timer_co_locked_del_and_wait() and cache_clean_timer()
takes! Good that it's not a hot path. :-)

+/* s_locked specifies whether s->lock is held or not */
  static void qcow2_update_options_commit(BlockDriverState *bs,
-                                        Qcow2ReopenState *r)
+                                        Qcow2ReopenState *r,
+                                        bool s_locked)
  {
      BDRVQcow2State *s = bs->opaque;
      int i;
+ /*
+     * We need to stop the cache-clean-timer before destroying the metadata
+     * table caches
+     */
+    if (s_locked) {
+        cache_clean_timer_co_locked_del_and_wait(bs);
+    } else {
+        cache_clean_timer_del_and_wait(bs);
+    }
+
      if (s->l2_table_cache) {
          qcow2_cache_destroy(s->l2_table_cache);
      }
@@ -1228,6 +1312,10 @@ static void qcow2_update_options_commit(BlockDriverState 
*bs,
      }
      s->l2_table_cache = r->l2_table_cache;
      s->refcount_block_cache = r->refcount_block_cache;
+
+    s->cache_clean_interval = r->cache_clean_interval;
+    cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
+
      s->l2_slice_size = r->l2_slice_size;
s->overlap_check = r->overlap_check;
@@ -1239,12 +1327,6 @@ static void qcow2_update_options_commit(BlockDriverState 
*bs,
s->discard_no_unref = r->discard_no_unref; - if (s->cache_clean_interval != r->cache_clean_interval) {
-        cache_clean_timer_del(bs);
-        s->cache_clean_interval = r->cache_clean_interval;
-        cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
-    }
-
      qapi_free_QCryptoBlockOpenOptions(s->crypto_opts);
      s->crypto_opts = r->crypto_opts;
  }
Is there any specific reason why you move cache_clean_timer_init()
earlier? I don't see an actual problem with the code as it is after this
change, but s->l2_slice_size is related to the cache in a way, so it
would feel safer if the cache cleaner were only started once all the
settings are updated and not only those that it actually happens to
access at the moment.

Oh.  I don’t think there’s a good reason.  I think it makes sense to keep the set-up in the old place.  Can you do that in your tree?

(I think it’s just because when I split it to stop the timer before deleting the caches, I decided it would conversely make sense to start it after they have been set up.)

Hanna


Reply via email to