On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
11.03.2021 22:58, Max Reitz wrote:
On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
There is a bug in qcow2: host cluster can be discarded (refcount
becomes 0) and reused during data write. In this case data write may
pollute another cluster (recently allocated) or even metadata.

I was about to ask whether we couldn’t somehow let discard requests check overlapping I/O with the tracked_request infrastructure from block/io.c (or perhaps even just let them be serializing).

But I suppose there may be other reasons for clusters being discarded other than explicit discard requests, so we have to have something in qcow2...

For example, host cluster which contain exactly one compressed cluster may be discarded when corresponding guest cluster is rewritten by non-compressed write.

Also, (not very good argument for fixing existing bug but still) serializing will not help with compressed-cache, because with compressed cache we'll have data writes (cache flushes) not related to current in-flight request. And that leads us to something like serializing of operations with .file child of qcow2 node. But we can't do it, as "discarding" a host cluster is operation with metadata, which may do no operation with underlying .file child.

Yes, makes sense. I think it is a good argument, because the point is that qcow2 host cluster discards are not necessarily related to guest discards. The compressed cache is just one new example for that.

To fix the issue let's track inflight writes to host cluster in the
hash table and consider new counter when discarding and reusing host
clusters.

This didn’t really explain anything that would help me understand what’s going on in this patch.  The patch itself doesn’t really help me with comments either.  It’s possible that I’m just daft, but honestly I have a hard time really understanding what’s supposed to be going on.

Sorry for this. I believe into your experience of understanding my patches, so that shows that something is wrong with the patch itself)


Coming back from jumping all over this patch, I also wonder why this isn’t split in multiple patches, where the first introduces the infrastructure and explains exactly what’s going on, and the next patches make use of it so I could understand what each check etc. is supposed to represent and do.

OK



To serve as an example, after reading through the patch, I still have no idea how it prevents that discard problem you’re trying to fix. Maybe I’m lazy, but I would have liked exactly that to be explained by this commit message.  Actually, I don’t even understand the problem just from this commit message alone, but I could resort to HEAD^ and some thinking.  Not sure if that’s ideal, though.

The problem:

1. Start write to qcow2. Assume guest cluster G and corresponding host cluster is H.

2. The write requests come to the point of data writing to .file. The write to .file is started and qcow2 mutex is unlocked.

3. At this time refcount of H becomes 0. For example, it may be due to discard operation on qcow2 node, or rewriting compressed data by normal write, or some operation with snapshots..

4. Next, some operations occurs and leads to allocation of H for some other needs: it may be another write-to-qcow2-node operation, or allocation of L2 table or some other data or metadata cluster allocation.

5. So, at this point H is used for something other. Assume, L2 table is written into H.

6. And now, our write from [2] finishes. And pollutes L2 table in H. That's a bug.

Understood.

The problem becomes more possible with compressed write cache, because any flush operation may trigger deferred data writes, so problem can be reproduced like:

1. Write to guest cluster G2, which triggers flushing of data to host cluster H, which is unrelated to G2.

2. Discard cluster G in parallel

[...]

So, problem is simply triggered by parallel write and discard to different guest clusters. That's why I should do something with this old (and most probably never triggered) problem in qcow2 driver prior to implementing compressed cache.

OK!  That makes sense.

Hmm, probably we can avoid all these difficulties by implementing compressed cache as a filter driver between qcow2 and its .file nodes..

As I see it, you’ve now opened the can of worms, and I’m not sure we can just close it and say we haven’t seen any problem. :)

So, regardless of the compressed cache, I think this should be fixed.

I tried to do it, encountered some problems, but I don't remember what exactly. Probably I should return to this try. Some obvious problems:

1. Metadata write will go through cache, and we should handle it somehow to not break the sequence of unaligned chunks. Simplest way is to add BDRV_REQ_COLORED flag and do compressed data writes to .file with this flag. And cache filter would track this flag for sequential caching.

2. We can't enable caching by default this way, so user will have to explicitly add a filter.. And actually, it's a real problem of qcow2 driver that in O_DIRECT mode it writes compressed data by small unaligned chunks. So, good to fix it inside qcow2..

Agreed.

With filter we avoid the problem of parallel reuse of host cluster, as all writes go through the cache and flushes will be serialized. So, the problem in qcow2 driver is not enlarged by the cache and we can leave it unfixed..

Not sure I agree. ;)

So the commit message says that “host cluster can be discarded and reused during data write”, which to me just seems like the exact behavior we want.  The second sentence describes a problem, but doesn’t say how reclaiming discarded clusters leads there.

I suppose what leads to the problem is that there first needs to be a background write even before the discard that is settled only after the re-claiming write (which isn’t mentioned).

As far as I understand, this patch addresses that problem by preventing the discarded clusters from being allocated while said background write is not settled yet.  This is done by keeping track of all host clusters that are currently the target of some write operation, and preventing them from being allocated.  OK, makes sense. (This latter part does roughly correspond to the second paragraph of this commit message, but I couldn’t make sense of it without understanding why we’d do it.  What’s notably missing from my explanation is the part that you hinted at with “when discarding”, but my problem is that that I just don’t understand what that means at all.  Perhaps it has to do with how I don’t understand the change to update_refcount(), and consequently the whole “slow path” in update_inflight_write_cnt().)


Let me explain:

What is the problem? User uses dynamic object, and the object disappear during the usage. Use-after-free follows. This means that we probably want reference counting for the object, so that user will keep the reference during usage, and object is freed when refcount becomes 0. But for qcow2 host clusters we already have reference counting!

Makes sense.

Still, we don't want to simply reuse qcow2 refcounts as normal reference counters for two reasons:

1. qcow2 refcounts is part of qcow2 metadata and strictly defined. Qcow2 refcounts shows only usage of the host cluster by other objects in qcow2 metadata. But what we want is usage by the process of writing data.

Indeed.

2. we don't want to do additional read/write of refcounts from/to physical drive

Right.

So, what comes to my mind is and additional "dynamic" in-RAM reference counter for host cluster. So that actual full reference count of host cluster is qcow2-refcount + inflight-write-cnt.

OK, yes, that makes a lot of sense.

So, what we need for this:

1. calculate these new inflight-write-cnt, increase it before data write and decrease after.

2. update the core thing of reference counting: freeing the object.

Makes sense, but I don’t see how your change to update_refcount() does that, though. But OTOH I do see how you prevent use-after-frees (by preventing allocations of clusters with refcount == 0 but with active writes on them), which works just as well.

And if freeing the object (clusters) were postponed until all writes are settled, I don’t think we’d need the checks in the cluster allocation functions.

Currently the object is "freed" when qcow2-refcoutn becomes 0, that's the code in update_recounts(). But we want to postpone freeing until full reference count becomes zero, i.e. both qcow2-refcount is zero and inflight-write-cnt is zero. So, if qcow2-refcount becomes zero but inflight-write-cnt is nonzero, we postpone discarding the cluster, storing needed information in Qcow2InFlightRefcount. And when inflight-write-cnt becomes zero (and we see this information) we call update_refcount(addend=0) to trigger actual discarding of the cluster.

3. we should consider full reference count when search for free host clusters.



Side note: Intuitively, this hash table feels like an unnecessarily big hammer to me, but I can’t think of a better solution either, so. (I mainly don’t like how every write request will result in one allocation and hash table entry per cluster it touches, even when no data cluster is ever discarded.)

We can have a list of inflight-data-writes instead.

Yes.  I’m not sure which would actually be better in the end.

I feel such a list would be better for the normal case (just writes, no allocations or discards), but I don’t know how much worse they’d be when something needs to be looked up (i.e. on allocations or discards).

I afraid, we can't reuse tracked requests of underlying .file node, because we should add inflight-write to the list before releasing the qcow2 mutex.

Yes.

Enable qcow2-discard-during-rewrite as it is fixed.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com>
---
  block/qcow2.h                                 |   9 ++
  block/qcow2-refcount.c                        | 149 +++++++++++++++++-
  block/qcow2.c                                 |  26 ++-
  .../tests/qcow2-discard-during-rewrite        |   2 +-
  4 files changed, 181 insertions(+), 5 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 0678073b74..e18d8dfbae 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -420,6 +420,8 @@ typedef struct BDRVQcow2State {
       * is to convert the image with the desired compression type set.
       */
      Qcow2CompressionType compression_type;
+
+    GHashTable *inflight_writes_counters;
  } BDRVQcow2State;
  typedef struct Qcow2COWRegion {
@@ -896,6 +898,13 @@ int qcow2_shrink_reftable(BlockDriverState *bs);
  int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
  int qcow2_detect_metadata_preallocation(BlockDriverState *bs);
+int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
+                              int64_t length);
+int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
+                              int64_t length);
+int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_t offset,
+                                     int64_t length);
+
  /* qcow2-cluster.c functions */
  int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
                          bool exact_size);
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 8e649b008e..464d133368 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -799,6 +799,140 @@ found:
      }
  }
+/*
+ * Qcow2InFlightRefcount is a type for values of s->inflight_writes_counters
+ * hasm map. And it's keys are cluster indices.

*hash, *its

I’d rather document this for s->inflight_writes_counters, though (like “/* cluster index (int64_t) -> Qcow2InFlightRefcount */”)

+ */
+typedef struct Qcow2InFlightRefcount {
+    /*
+     * Number of in-flight writes to the cluster, always > 0, as when becomes

s/when becomes/when it becomes/

+     * 0 the entry is removed from s->inflight_writes_counters.
+     */
+    uint64_t inflight_writes_cnt;
+
+    /* Cluster refcount is known to be zero */
+    bool refcount_zero;
+
+    /* Cluster refcount was made zero with this discard-type */
+    enum qcow2_discard_type type;
+} Qcow2InFlightRefcount;
+
+static Qcow2InFlightRefcount *find_infl_wr(BDRVQcow2State *s,
+                                           int64_t cluster_index)
+{
+    Qcow2InFlightRefcount *infl;
+
+    if (!s->inflight_writes_counters) {
+        return NULL;
+    }
+
+    infl = g_hash_table_lookup(s->inflight_writes_counters, &cluster_index);
+
+    if (infl) {
+        assert(infl->inflight_writes_cnt > 0);
+    }
+
+    return infl;
+}
+
+/*
+ * Returns true if there are any in-flight writes to the cluster blocking
+ * its reallocation.
+ */
+static bool has_infl_wr(BDRVQcow2State *s, int64_t cluster_index)
+{
+    return !!find_infl_wr(s, cluster_index);

I wonder if g_hash_table_contains() would be quicker. *shrug*

+}
+
+static int update_inflight_write_cnt(BlockDriverState *bs,
+                                     int64_t offset, int64_t length,
+                                     bool decrease, bool locked)
+{
+    BDRVQcow2State *s = bs->opaque;
+    int64_t start, last, cluster_offset;
+
+    if (locked) {
+        qemu_co_mutex_assert_locked(&s->lock);
+    }
+
+    start = start_of_cluster(s, offset);
+    last = start_of_cluster(s, offset + length - 1);
+    for (cluster_offset = start; cluster_offset <= last;
+         cluster_offset += s->cluster_size)
+    {
+        int64_t cluster_index = cluster_offset >> s->cluster_bits;
+        Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
+
+        if (!infl) {
+            infl = g_new0(Qcow2InFlightRefcount, 1);
+            g_hash_table_insert(s->inflight_writes_counters,
+                                g_memdup(&cluster_index, sizeof(cluster_index)),
+                                infl);

I suppose we could save one allocation by putting the cluster index into Qcow2InFlightRefcount and then giving the key as &infl->cluster_index. Not sure if that’s too nasty, though.

Allocations are a lot faster than IO.. So, I don't think we'll see the difference in normal cases.

Sure.

+        }
+
+        if (decrease) {
+            assert(infl->inflight_writes_cnt >= 1);
+            infl->inflight_writes_cnt--;
+        } else {
+            infl->inflight_writes_cnt++;
+        }
+
+        if (infl->inflight_writes_cnt == 0) {
+            bool refcount_zero = infl->refcount_zero;
+            enum qcow2_discard_type type = infl->type;
+
+            g_hash_table_remove(s->inflight_writes_counters, &cluster_index);
+
+            if (refcount_zero) {
+                /*
+                 * Slow path. We must reset normal refcount to actually release

I don’t understand what “slow path” means in this context.  (Nor do I really understand the rest, but more on that below.)


In most cases inc/dev of inflight-writes-cnt is fast: it only update in-memory structure. But when full reference count of the cluster becomes zero we call real update_refcount to trigger discarding of the cluster.. This may be slower. Probably the needed code should be moved from update_refcount to separate function like host_cluster_free(), to not cheat with addend=0.

OK.

+                 * the cluster.
+                 */
+                int ret;
+
+                if (!locked) {
+                    qemu_co_mutex_lock(&s->lock);
+                }
+                ret = qcow2_update_cluster_refcount(bs, cluster_index, 0,
+                                                    true, type);

I don’t understand this, but then again I’m writing this very early in my review, and at this point I don’t understand anything, really. (The comment doesn’t explain to me what’s happening here.)

Revisiting later, refcount_zero is set by update_refcount() when the refcount drops to zero, so invoking this function here will finalize the change.  I don’t quite understand what’s going on in update_refcount(), though.

(And even after finding out why, I don’t understand why the comment doesn’t explain why we must invoke qcow2_update_cluster_refcount() with addend=0.)

Sorry for the mess. Here we want to trigger the code in update_refcount() that is freeing host cluster.. I.e. the code that runs when refcount becomes zero.


+                if (!locked) {
+                    qemu_co_mutex_unlock(&s->lock);
+                }
+
+                if (ret < 0) {
+                    return ret;
+                }
+            }
+        }
+
+    }
+
+    return 0;
+}
+
+int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
+                              int64_t length)
+{
+    return update_inflight_write_cnt(bs, offset, length, false, false);
+}
+
+/*
+ * Called with s->lock not locked by caller. Will take s->lock only if need to + * release the cluster (refcount is 0 and inflight-write-cnt becomes zero).
+ */

Why doesn’t qcow2_inflight_writes_inc() get the same comment?

...Oh, because @locked doesn’t have an influence there.  Why isn’t it worth a comment that *_inc() works both with a locked and an unlocked mutex?

OK, it worth


+int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
+                              int64_t length)
+{
+    return update_inflight_write_cnt(bs, offset, length, true, false);
+}
+
+/* Called with s->lock locked. */
+int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_t offset,
+                                     int64_t length)
+{
+    return update_inflight_write_cnt(bs, offset, length, true, true);
+}
+
  /* XXX: cache several refcount block clusters ? */
  /* @addend is the absolute value of the addend; if @decrease is set, @addend    * will be subtracted from the current refcount, otherwise it will be added */ @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
          if (refcount == 0) {
              void *table;
+            Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
+
+            if (infl) {
+                infl->refcount_zero = true;
+                infl->type = type;
+                continue;
+            }

I don’t understand what this is supposed to do exactly.  It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they?

Don't follow.

We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt().

But the code under “if (refcount == 0)” doesn’t free anything, does it? All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file. I don’t see how that protocol-level discard has anything to do with our problem, though.

As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0).

(I also wondered why the continue wasn’t put before the s->set_refcount() call, but I suppose the same effect is had by the has_infl_wr() check in alloc_clusters_noref() and qcow2_alloc_clusters_at().)

The only change of update_refcount() logic is postponing of freeing the cluster. So, handling of qcow2-refacount is kept as is. qcow2-refcount becomes zero, and s->set_refcount() is called. inflight-write-cnt is a separate thing, not the cache for qcow2-refcount.


              table = qcow2_cache_is_table_offset(s->refcount_block_cache,
                                                  offset);
@@ -983,7 +1124,7 @@ retry:
          if (ret < 0) {
              return ret;
-        } else if (refcount != 0) {
+        } else if (refcount != 0 || has_infl_wr(s, next_cluster_index)) {
              goto retry;
          }
      }
@@ -1046,7 +1187,7 @@ int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
              ret = qcow2_get_refcount(bs, cluster_index++, &refcount);
              if (ret < 0) {
                  return ret;
-            } else if (refcount != 0) {
+            } else if (refcount != 0 || has_infl_wr(s, cluster_index)) {
                  break;
              }
          }
@@ -2294,7 +2435,9 @@ static int64_t alloc_clusters_imrt(BlockDriverState *bs,
           contiguous_free_clusters < cluster_count;
           cluster++)
      {
-        if (!s->get_refcount(*refcount_table, cluster)) {
+        if (!s->get_refcount(*refcount_table, cluster) &&
+            !has_infl_wr(s, cluster))

Is this really necessary?

Yes. Everywhere when we want free cluster, we should check that full reference count is zero, which is qcow2-refcount + inflight-write-cnt.. Which is equal to check that both qcow2-refcount and inflight-write-cnt are zero. And for for zero inflight-write-cnt it must just be absent in the hash.

I understand the sentiment, but I don’t think that reasoning applies here. The in-memory refcount table (IMRT) used for qemu-img check will not be updated on discards anyway. If there were any I/O to the image concurrently to check, it would completely ignore the IMRT, so the check feels hypocritical.

+        {
              contiguous_free_clusters++;
              if (first_gap) {
                  /* If this is the first free cluster found, update
diff --git a/block/qcow2.c b/block/qcow2.c
index d9f49a52e7..6ee94421e0 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c

[...]

@@ -4536,13 +4553,20 @@ qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
      }
      ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset, out_len, true);
-    qemu_co_mutex_unlock(&s->lock);
      if (ret < 0) {
+        qemu_co_mutex_unlock(&s->lock);
          goto fail;
      }
+    qcow2_inflight_writes_inc(bs, cluster_offset, out_len);

It was my impression that this function could be called either with or without a lock, that it wouldn’t really matter.  Well, at least that’s what I figured out for lack of a comment regarding the contract how it is to be invoked.


I'm sorry:(

Oh, geez, now I feel bad, too. Thanks a lot for your explanations, it all makes much more sense to me now! :)

Max


Reply via email to