qcow2: introduce inflight writes counters: fix discard

Max Reitz Fri, 12 Mar 2021 03:17:34 -0800

On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:

11.03.2021 22:58, Max Reitz wrote:
On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
There is a bug in qcow2: host cluster can be discarded (refcount
becomes 0) and reused during data write. In this case data write may
pollute another cluster (recently allocated) or even metadata.
I was about to ask whether we couldn’t somehow let discard requestscheck overlapping I/O with the tracked_request infrastructure fromblock/io.c (or perhaps even just let them be serializing).
But I suppose there may be other reasons for clusters being discardedother than explicit discard requests, so we have to have something inqcow2...
For example, host cluster which contain exactly one compressed clustermay be discarded when corresponding guest cluster is rewritten bynon-compressed write.
Also, (not very good argument for fixing existing bug but still)serializing will not help with compressed-cache, because with compressedcache we'll have data writes (cache flushes) not related to currentin-flight request. And that leads us to something like serializing ofoperations with .file child of qcow2 node. But we can't do it, as"discarding" a host cluster is operation with metadata, which may do nooperation with underlying .file child.

Yes, makes sense. I think it is a good argument, because the point isthat qcow2 host cluster discards are not necessarily related to guestdiscards. The compressed cache is just one new example for that.

To fix the issue let's track inflight writes to host cluster in the
hash table and consider new counter when discarding and reusing host
clusters.
This didn’t really explain anything that would help me understandwhat’s going on in this patch. The patch itself doesn’t really helpme with comments either. It’s possible that I’m just daft, buthonestly I have a hard time really understanding what’s supposed to begoing on.
Sorry for this. I believe into your experience of understanding mypatches, so that shows that something is wrong with the patch itself)
Coming back from jumping all over this patch, I also wonder why thisisn’t split in multiple patches, where the first introduces theinfrastructure and explains exactly what’s going on, and the nextpatches make use of it so I could understand what each check etc. issupposed to represent and do.
OK
To serve as an example, after reading through the patch, I still haveno idea how it prevents that discard problem you’re trying to fix.Maybe I’m lazy, but I would have liked exactly that to be explained bythis commit message. Actually, I don’t even understand the problemjust from this commit message alone, but I could resort to HEAD^ andsome thinking. Not sure if that’s ideal, though.
The problem:
1. Start write to qcow2. Assume guest cluster G and corresponding hostcluster is H.
2. The write requests come to the point of data writing to .file. Thewrite to .file is started and qcow2 mutex is unlocked.
3. At this time refcount of H becomes 0. For example, it may be due todiscard operation on qcow2 node, or rewriting compressed data by normalwrite, or some operation with snapshots..
4. Next, some operations occurs and leads to allocation of H for someother needs: it may be another write-to-qcow2-node operation, orallocation of L2 table or some other data or metadata cluster allocation.
5. So, at this point H is used for something other. Assume, L2 table iswritten into H.
6. And now, our write from [2] finishes. And pollutes L2 table in H.That's a bug.


Understood.

The problem becomes more possible with compressed write cache, becauseany flush operation may trigger deferred data writes, so problem can bereproduced like:
1. Write to guest cluster G2, which triggers flushing of data to hostcluster H, which is unrelated to G2.
2. Discard cluster G in parallel

[...]
So, problem is simply triggered by parallel write and discard todifferent guest clusters. That's why I should do something with this old(and most probably never triggered) problem in qcow2 driver prior toimplementing compressed cache.


OK!  That makes sense.

Hmm, probably we can avoid all these difficulties by implementingcompressed cache as a filter driver between qcow2 and its .file nodes..

As I see it, you’ve now opened the can of worms, and I’m not sure we canjust close it and say we haven’t seen any problem. :)


So, regardless of the compressed cache, I think this should be fixed.

I tried to do it, encountered some problems, but I don't remember whatexactly. Probably I should return to this try. Some obvious problems:
1. Metadata write will go through cache, and we should handle it somehowto not break the sequence of unaligned chunks. Simplest way is to addBDRV_REQ_COLORED flag and do compressed data writes to .file with thisflag. And cache filter would track this flag for sequential caching.
2. We can't enable caching by default this way, so user will have toexplicitly add a filter.. And actually, it's a real problem of qcow2driver that in O_DIRECT mode it writes compressed data by smallunaligned chunks. So, good to fix it inside qcow2..


Agreed.

With filter we avoid the problem of parallel reuse of host cluster, asall writes go through the cache and flushes will be serialized. So, theproblem in qcow2 driver is not enlarged by the cache and we can leave itunfixed..


Not sure I agree. ;)

So the commit message says that “host cluster can be discarded andreused during data write”, which to me just seems like the exactbehavior we want. The second sentence describes a problem, butdoesn’t say how reclaiming discarded clusters leads there.
I suppose what leads to the problem is that there first needs to be abackground write even before the discard that is settled only afterthe re-claiming write (which isn’t mentioned).
As far as I understand, this patch addresses that problem bypreventing the discarded clusters from being allocated while saidbackground write is not settled yet. This is done by keeping track ofall host clusters that are currently the target of some writeoperation, and preventing them from being allocated. OK, makes sense.(This latter part does roughly correspond to the second paragraph ofthis commit message, but I couldn’t make sense of it withoutunderstanding why we’d do it. What’s notably missing from myexplanation is the part that you hinted at with “when discarding”, butmy problem is that that I just don’t understand what that means atall. Perhaps it has to do with how I don’t understand the change toupdate_refcount(), and consequently the whole “slow path” inupdate_inflight_write_cnt().)
Let me explain:
What is the problem? User uses dynamic object, and the object disappearduring the usage. Use-after-free follows. This means that we probablywant reference counting for the object, so that user will keep thereference during usage, and object is freed when refcount becomes 0. Butfor qcow2 host clusters we already have reference counting!


Makes sense.

Still, wedon't want to simply reuse qcow2 refcounts as normal reference countersfor two reasons:
1. qcow2 refcounts is part of qcow2 metadata and strictly defined. Qcow2refcounts shows only usage of the host cluster by other objects in qcow2metadata. But what we want is usage by the process of writing data.


Indeed.

2. we don't want to do additional read/write of refcounts from/tophysical drive


Right.

So, what comes to my mind is and additional "dynamic" in-RAM referencecounter for host cluster. So that actual full reference count of hostcluster is qcow2-refcount + inflight-write-cnt.


OK, yes, that makes a lot of sense.

So, what we need for this:
1. calculate these new inflight-write-cnt, increase it before data writeand decrease after.
2. update the core thing of reference counting: freeing the object.

Makes sense, but I don’t see how your change to update_refcount() doesthat, though. But OTOH I do see how you prevent use-after-frees (bypreventing allocations of clusters with refcount == 0 but with activewrites on them), which works just as well.

And if freeing the object (clusters) were postponed until all writes aresettled, I don’t think we’d need the checks in the cluster allocationfunctions.

Currently the object is "freed" when qcow2-refcoutn becomes 0, that'sthe code in update_recounts(). But we want to postpone freeing untilfull reference count becomes zero, i.e. both qcow2-refcount is zero andinflight-write-cnt is zero. So, if qcow2-refcount becomes zero butinflight-write-cnt is nonzero, we postpone discarding the cluster,storing needed information in Qcow2InFlightRefcount. And wheninflight-write-cnt becomes zero (and we see this information) we callupdate_refcount(addend=0) to trigger actual discarding of the cluster.
3. we should consider full reference count when search for free hostclusters.
Side note: Intuitively, this hash table feels like an unnecessarilybig hammer to me, but I can’t think of a better solution either, so.(I mainly don’t like how every write request will result in oneallocation and hash table entry per cluster it touches, even when nodata cluster is ever discarded.)
We can have a list of inflight-data-writes instead.


Yes.  I’m not sure which would actually be better in the end.

I feel such a list would be better for the normal case (just writes, noallocations or discards), but I don’t know how much worse they’d be whensomething needs to be looked up (i.e. on allocations or discards).

I afraid, we can'treuse tracked requests of underlying .file node, because we should addinflight-write to the list before releasing the qcow2 mutex.


Yes.

Enable qcow2-discard-during-rewrite as it is fixed.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com>
---
  block/qcow2.h                                 |   9 ++
  block/qcow2-refcount.c                        | 149 +++++++++++++++++-
  block/qcow2.c                                 |  26 ++-
  .../tests/qcow2-discard-during-rewrite        |   2 +-
  4 files changed, 181 insertions(+), 5 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 0678073b74..e18d8dfbae 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -420,6 +420,8 @@ typedef struct BDRVQcow2State {
       * is to convert the image with the desired compression type set.
       */
      Qcow2CompressionType compression_type;
+
+    GHashTable *inflight_writes_counters;
  } BDRVQcow2State;
  typedef struct Qcow2COWRegion {
@@ -896,6 +898,13 @@ int qcow2_shrink_reftable(BlockDriverState *bs);
  int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
  int qcow2_detect_metadata_preallocation(BlockDriverState *bs);
+int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
+                              int64_t length);
+int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
+                              int64_t length);

+int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_toffset,

+                                     int64_t length);
+
  /* qcow2-cluster.c functions */
  int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
                          bool exact_size);
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 8e649b008e..464d133368 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -799,6 +799,140 @@ found:
      }
  }
+/*

+ * Qcow2InFlightRefcount is a type for values ofs->inflight_writes_counters

+ * hasm map. And it's keys are cluster indices.


*hash, *its

I’d rather document this for s->inflight_writes_counters, though (like“/* cluster index (int64_t) -> Qcow2InFlightRefcount */”)

+ */
+typedef struct Qcow2InFlightRefcount {
+    /*
+ * Number of in-flight writes to the cluster, always > 0, aswhen becomes


s/when becomes/when it becomes/

+     * 0 the entry is removed from s->inflight_writes_counters.
+     */
+    uint64_t inflight_writes_cnt;
+
+    /* Cluster refcount is known to be zero */
+    bool refcount_zero;
+
+    /* Cluster refcount was made zero with this discard-type */
+    enum qcow2_discard_type type;
+} Qcow2InFlightRefcount;
+
+static Qcow2InFlightRefcount *find_infl_wr(BDRVQcow2State *s,
+                                           int64_t cluster_index)
+{
+    Qcow2InFlightRefcount *infl;
+
+    if (!s->inflight_writes_counters) {
+        return NULL;
+    }
+

+ infl = g_hash_table_lookup(s->inflight_writes_counters,&cluster_index);

+
+    if (infl) {
+        assert(infl->inflight_writes_cnt > 0);
+    }
+
+    return infl;
+}
+
+/*

+ * Returns true if there are any in-flight writes to the clusterblocking

+ * its reallocation.
+ */
+static bool has_infl_wr(BDRVQcow2State *s, int64_t cluster_index)
+{
+    return !!find_infl_wr(s, cluster_index);


I wonder if g_hash_table_contains() would be quicker. *shrug*

+}
+
+static int update_inflight_write_cnt(BlockDriverState *bs,
+                                     int64_t offset, int64_t length,
+                                     bool decrease, bool locked)
+{
+    BDRVQcow2State *s = bs->opaque;
+    int64_t start, last, cluster_offset;
+
+    if (locked) {
+        qemu_co_mutex_assert_locked(&s->lock);
+    }
+
+    start = start_of_cluster(s, offset);
+    last = start_of_cluster(s, offset + length - 1);
+    for (cluster_offset = start; cluster_offset <= last;
+         cluster_offset += s->cluster_size)
+    {
+        int64_t cluster_index = cluster_offset >> s->cluster_bits;
+        Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
+
+        if (!infl) {
+            infl = g_new0(Qcow2InFlightRefcount, 1);
+            g_hash_table_insert(s->inflight_writes_counters,

+ g_memdup(&cluster_index,sizeof(cluster_index)),

+                                infl);

I suppose we could save one allocation by putting the cluster indexinto Qcow2InFlightRefcount and then giving the key as&infl->cluster_index. Not sure if that’s too nasty, though.

Allocations are a lot faster than IO.. So, I don't think we'll see thedifference in normal cases.


Sure.

+        }
+
+        if (decrease) {
+            assert(infl->inflight_writes_cnt >= 1);
+            infl->inflight_writes_cnt--;
+        } else {
+            infl->inflight_writes_cnt++;
+        }
+
+        if (infl->inflight_writes_cnt == 0) {
+            bool refcount_zero = infl->refcount_zero;
+            enum qcow2_discard_type type = infl->type;
+
+ g_hash_table_remove(s->inflight_writes_counters,&cluster_index);
+
+            if (refcount_zero) {
+                /*
+ * Slow path. We must reset normal refcount toactually release
I don’t understand what “slow path” means in this context. (Nor do Ireally understand the rest, but more on that below.)
In most cases inc/dev of inflight-writes-cnt is fast: it only updatein-memory structure. But when full reference count of the clusterbecomes zero we call real update_refcount to trigger discarding of thecluster.. This may be slower. Probably the needed code should be movedfrom update_refcount to separate function like host_cluster_free(), tonot cheat with addend=0.

OK.

+                 * the cluster.
+                 */
+                int ret;
+
+                if (!locked) {
+                    qemu_co_mutex_lock(&s->lock);
+                }
+ ret = qcow2_update_cluster_refcount(bs,cluster_index, 0,
+                                                    true, type);
I don’t understand this, but then again I’m writing this very early inmy review, and at this point I don’t understand anything, really.(The comment doesn’t explain to me what’s happening here.)
Revisiting later, refcount_zero is set by update_refcount() when therefcount drops to zero, so invoking this function here will finalizethe change. I don’t quite understand what’s going on inupdate_refcount(), though.
(And even after finding out why, I don’t understand why the commentdoesn’t explain why we must invoke qcow2_update_cluster_refcount()with addend=0.)
Sorry for the mess. Here we want to trigger the code inupdate_refcount() that is freeing host cluster.. I.e. the code that runswhen refcount becomes zero.
+                if (!locked) {
+                    qemu_co_mutex_unlock(&s->lock);
+                }
+
+                if (ret < 0) {
+                    return ret;
+                }
+            }
+        }
+
+    }
+
+    return 0;
+}
+
+int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
+                              int64_t length)
+{
+    return update_inflight_write_cnt(bs, offset, length, false, false);
+}
+
+/*
+ * Called with s->lock not locked by caller. Will take s->lock onlyif need to+ * release the cluster (refcount is 0 and inflight-write-cnt becomeszero).
+ */
Why doesn’t qcow2_inflight_writes_inc() get the same comment?
...Oh, because @locked doesn’t have an influence there. Why isn’t itworth a comment that *_inc() works both with a locked and an unlockedmutex?
OK, it worth
+int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
+                              int64_t length)
+{
+    return update_inflight_write_cnt(bs, offset, length, true, false);
+}
+
+/* Called with s->lock locked. */
+int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_toffset,
+                                     int64_t length)
+{
+    return update_inflight_write_cnt(bs, offset, length, true, true);
+}
+
  /* XXX: cache several refcount block clusters ? */
/* @addend is the absolute value of the addend; if @decrease isset, @addend * will be subtracted from the current refcount, otherwise it willbe added */@@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULTupdate_refcount(BlockDriverState *bs,
          if (refcount == 0) {
              void *table;
+ Qcow2InFlightRefcount *infl = find_infl_wr(s,cluster_index);
+
+            if (infl) {
+                infl->refcount_zero = true;
+                infl->type = type;
+                continue;
+            }
I don’t understand what this is supposed to do exactly. It seems likeit wants to keep metadata structures in the cache that are still inuse (because dropping them from the caches is what happens next), butusers of metadata structures won’t set in-flight counters for thosemetadata structures, will they?
Don't follow.
We want the code in "if (refcount == 0)" to be triggered only when fullreference count of the host cluster becomes 0, includinginflight-write-cnt. So, if at this point inflight-write-cnt is not 0, wepostpone freeing the host cluster, it will be done later from "slowpath" in update_inflight_write_cnt().

But the code under “if (refcount == 0)” doesn’t free anything, does it?All I can see is code to remove metadata structures from the metadatacaches (if the discarded cluster was an L2 table or a refblock), andfinally the discard on the underlying file. I don’t see how thatprotocol-level discard has anything to do with our problem, though.

As far as I understand, the freeing happens immediately above the “if(refcount == 0)” block by s->set_refcount() setting the refcount to 0.(including updating s->free_cluster_index if the refcount is 0).

(I also wondered why the continue wasn’t put before thes->set_refcount() call, but I suppose the same effect is had by thehas_infl_wr() check in alloc_clusters_noref() andqcow2_alloc_clusters_at().)
The only change of update_refcount() logic is postponing of freeing thecluster. So, handling of qcow2-refacount is kept as is. qcow2-refcountbecomes zero, and s->set_refcount() is called. inflight-write-cnt is aseparate thing, not the cache for qcow2-refcount.
table =qcow2_cache_is_table_offset(s->refcount_block_cache,
                                                  offset);
@@ -983,7 +1124,7 @@ retry:
          if (ret < 0) {
              return ret;
-        } else if (refcount != 0) {
+ } else if (refcount != 0 || has_infl_wr(s,next_cluster_index)) {
              goto retry;
          }
      }
@@ -1046,7 +1187,7 @@ int64_tqcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
              ret = qcow2_get_refcount(bs, cluster_index++, &refcount);
              if (ret < 0) {
                  return ret;
-            } else if (refcount != 0) {
+ } else if (refcount != 0 || has_infl_wr(s,cluster_index)) {
                  break;
              }
          }
@@ -2294,7 +2435,9 @@ static int64_talloc_clusters_imrt(BlockDriverState *bs,
           contiguous_free_clusters < cluster_count;
           cluster++)
      {
-        if (!s->get_refcount(*refcount_table, cluster)) {
+        if (!s->get_refcount(*refcount_table, cluster) &&
+            !has_infl_wr(s, cluster))
Is this really necessary?
Yes. Everywhere when we want free cluster, we should check that fullreference count is zero, which is qcow2-refcount + inflight-write-cnt..Which is equal to check that both qcow2-refcount and inflight-write-cntare zero. And for for zero inflight-write-cnt it must just be absent inthe hash.

I understand the sentiment, but I don’t think that reasoning applieshere. The in-memory refcount table (IMRT) used for qemu-img check willnot be updated on discards anyway. If there were any I/O to the imageconcurrently to check, it would completely ignore the IMRT, so the checkfeels hypocritical.

+        {
              contiguous_free_clusters++;
              if (first_gap) {
                  /* If this is the first free cluster found, update
diff --git a/block/qcow2.c b/block/qcow2.c
index d9f49a52e7..6ee94421e0 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
[...]
@@ -4536,13 +4553,20 @@qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
      }
ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset,out_len, true);
-    qemu_co_mutex_unlock(&s->lock);
      if (ret < 0) {
+        qemu_co_mutex_unlock(&s->lock);
          goto fail;
      }
+    qcow2_inflight_writes_inc(bs, cluster_offset, out_len);
It was my impression that this function could be called either with orwithout a lock, that it wouldn’t really matter. Well, at least that’swhat I figured out for lack of a comment regarding the contract how itis to be invoked.
I'm sorry:(

Oh, geez, now I feel bad, too. Thanks a lot for your explanations, itall makes much more sense to me now! :)

Max

Re: [PATCH v3 3/6] block/qcow2: introduce inflight writes counters: fix discard

Reply via email to