On 29.10.25 21:23, Kevin Wolf wrote:
Am 28.10.2025 um 17:33 hat Hanna Czenczek geschrieben:
The cache-cleaner runs as a timer CB in the BDS AioContext. With
multiqueue, it can run concurrently to I/O requests, and because it does
not take any lock, this can break concurrent cache accesses, corrupting
the image. While the chances of this happening are low, it can be
reproduced e.g. by modifying the code to schedule the timer CB every
5 ms (instead of at most once per second) and modifying the last (inner)
while loop of qcow2_cache_clean_unused() like so:
while (i < c->size && can_clean_entry(c, i)) {
for (int j = 0; j < 1000 && can_clean_entry(c, i); j++) {
usleep(100);
}
c->entries[i].offset = 0;
c->entries[i].lru_counter = 0;
i++;
to_clean++;
}
i.e. making it wait on purpose for the point in time where the cache is
in use by something else.
The solution chosen for this in this patch is not the best solution, I
hope, but I admittedly can’t come up with anything strictly better.
We can protect from concurrent cache accesses either by taking the
existing s->lock, or we introduce a new (non-coroutine) mutex
specifically for cache accesses. I would prefer to avoid the latter so
as not to introduce additional (very slight) overhead.
In theory, the old plan was that eventually qcow2 would use fine grained
locks instead of the single s->lock, and having a separate cache lock
would be a step towards it. But if we never actually make use of it to
hold s->lock for a shorter time, that's not really a good argument. I'm
not sure if that's ever going to happen unless for a rewrite in Rust or
something.
I never tried to measure specifically if lock contention is a problem
with high queue depth and random I/O on a huge disk. Intuitively,
holding s->lock while doing I/O for loading entries into the cache can't
be really good.
Anyway, I went a bit on a tangent there...
Using s->lock, which is a coroutine mutex, however means that we need to
take it in a coroutine, so the timer CB must enter such a coroutine. As
a result, descheduling the timer is no longer a guarantee that the
cache-cleaner will not run, because it may now be yielding in
qemu_co_mutex_lock().
I think creating a coroutine in cache_clean_timer_cb() is the wrong
approach. Instead, cache_clean_timer_init() could create a coroutine
and its implementation could be something like this:
while (!s->cache_clean_timer_stopping) {
qemu_co_sleep_ns_wakeable(&s->cache_clean_timer_wake,
QEMU_CLOCK_VIRTUAL,
s->cache_clean_interval *
NANOSECONDS_PER_SECOND);
qemu_co_mutex_lock(&s->lock);
qcow2_cache_clean_unused(s->l2_table_cache);
qcow2_cache_clean_unused(s->refcount_block_cache);
qemu_co_mutex_unlock(&s->lock);
}
s->cache_clean_timer_stopping = false;
Ah, that’s nicer. I think we can replace the flag by checking
s->cache_clean_interval > 0 and adding a CoQueue to wake up waiting
coroutines.
(Note even now this was only guaranteed for cache_clean_timer_del()
callers that run in the BDS (the timer’s) AioContext. For callers
running in the main context, the problem may have already existed,
though maybe the BQL prevents timers from running in other contexts, I’m
not sure.)
Polling to await the timer to actually settle seems very complicated for
something that’s rather a minor problem, but I can’t come up with any
better solution that doesn’t again just overlook potential problems.
(One cleaner idea may be to have a generic way to have timers run
coroutines, and to await those when descheduling the timer. But while
cleaner, it would also be more complicated, and I don’t think worth it
at this point.)
(Not Cc-ing qemu-stable, as the issue is quite unlikely to be hit, and
I’m not too fond of this solution.)
Signed-off-by: Hanna Czenczek <[email protected]>
---
block/qcow2.h | 1 +
block/qcow2.c | 90 ++++++++++++++++++++++++++++++++++++++++++++-------
2 files changed, 79 insertions(+), 12 deletions(-)
@@ -867,6 +893,39 @@ static void cache_clean_timer_del(BlockDriverState *bs)
}
}
+/*
+ * Delete the cache clean timer and await any yet running instance.
+ * Must be called from the main or BDS AioContext, holding s->lock.
+ */
+static void coroutine_fn
+cache_clean_timer_locked_co_del_and_wait(BlockDriverState *bs)
+{
+ BDRVQcow2State *s = bs->opaque;
+ IO_OR_GS_CODE();
+ cache_clean_timer_del(bs);
+ if (qatomic_read(&s->cache_clean_running)) {
+ qemu_co_mutex_unlock(&s->lock);
+ qatomic_set(&s->cache_clean_polling, true);
+ BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
Polling in a coroutine_fn is verboten.
If we do need this function, I think it would be a yield here and a wake
on the other side. I think we might be able to get around it if we move
the call from qcow2_do_open() into qcow2_open() (i.e. outside the
coroutine). A bit ugly, so your choice.
We can let a CoQueue do the waking, no?
+ qemu_co_mutex_lock(&s->lock);
+ }
+}
+
+/*
+ * Delete the cache clean timer and await any yet running instance.
+ * Must be called from the main or BDS AioContext without s->lock held.
+ */
+static void cache_clean_timer_del_and_wait(BlockDriverState *bs)
+{
+ BDRVQcow2State *s = bs->opaque;
+ IO_OR_GS_CODE();
+ cache_clean_timer_del(bs);
+ if (qatomic_read(&s->cache_clean_running)) {
+ qatomic_set(&s->cache_clean_polling, true);
+ BDRV_POLL_WHILE(bs, qatomic_read(&s->cache_clean_running));
+ }
+}
+
static void qcow2_detach_aio_context(BlockDriverState *bs)
{
cache_clean_timer_del(bs);
@@ -1214,12 +1273,20 @@ fail:
return ret;
}
+/* s_locked specifies whether s->lock is held or not */
static void qcow2_update_options_commit(BlockDriverState *bs,
- Qcow2ReopenState *r)
+ Qcow2ReopenState *r,
+ bool s_locked)
{
BDRVQcow2State *s = bs->opaque;
int i;
+ if (s_locked) {
+ cache_clean_timer_locked_co_del_and_wait(bs);
+ } else {
+ cache_clean_timer_del_and_wait(bs);
+ }
+
if (s->l2_table_cache) {
qcow2_cache_destroy(s->l2_table_cache);
}
@@ -1228,6 +1295,10 @@ static void qcow2_update_options_commit(BlockDriverState
*bs,
}
s->l2_table_cache = r->l2_table_cache;
s->refcount_block_cache = r->refcount_block_cache;
+
+ s->cache_clean_interval = r->cache_clean_interval;
+ cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
+
s->l2_slice_size = r->l2_slice_size;
s->overlap_check = r->overlap_check;
@@ -1239,12 +1310,6 @@ static void qcow2_update_options_commit(BlockDriverState
*bs,
s->discard_no_unref = r->discard_no_unref;
- if (s->cache_clean_interval != r->cache_clean_interval) {
- cache_clean_timer_del(bs);
- s->cache_clean_interval = r->cache_clean_interval;
- cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
- }
-
I think the del/init pair here won't be necessary any more after
switching to the background coroutine. It will just start using the new
value of s->cache_clean_interval the next time it sleeps.
One problem is that if we don’t lock s->lock, the coroutine can read
s->l2_table_cache and s->refcount_block_cache while they’re invalid,
which is why I moved the deletion above. We also still need to delete
if the interval is set to 0 (or special-case that in the coroutine to
wait forever).
We could run all of this in a coroutine so we can lock s->lock, or we
have to force-stop the timer/coroutine at the start. Maybe running it
in a coroutine is better…
Hanna