We have a bug in qcow2: assume we've started data write into host cluster A. s->lock is unlocked. During the write the refcount of cluster A may become zero, cluster may be reallocated for other needs, and our in-flight write become a use-after-free.
To fix the bug let's do the following. Or better, let's start from what we have now: Now we consider cluster "free", when its refcount is 0. When cluster becomes "free" we also update s->free_cluster_index and optionally discard it on bs->file level. These two operations are done in same s->lock critical section where refcount becomes 0 (and this all is in update_refcount()). Calling update_refcount() is wrong. It's ofcourse done sometimes, as not everything is moved to coroutine for now.. Still, it's out of our topic. Later, we can reallocate "free" cluster in alloc_clusters_noref() and qcow2_alloc_clusters_at(), where is_cluster_free() is used. OK, to correctly handle in-flight writes, let's modify a concept of "free" cluster, so that cluster is "free" when its refcount is 0 and there no inflight writes. So, we discard the cluster at bs->file level, update s->free_cluster_index and allow reallocation only when both refcount and inflight-write-cnt becomes both zero. It may happen either in update_refcount() or in update_inflight_write_cnt(). Consider update_refcount() first. Here we update refcounts metadata we must be under s->lock. So, if we catch a situation when refcount becomes 0 but there are in-flight writes we change a behavior so that we don't update s->free_cluster_index, and do not discard the cluster. This will be done by update_inflight_write_cnt() later. So, we save needed information into Qcow2InFlightWriteCounter structure, so that further update_inflight_write_cnt() do not need to load the refcount. Now what about update_inflight_write_cnt()? Here things become more interesting, because we can avoid s->lock. If inflight-write-cnt + refcount is still positive, we don't have any yield point, just update inflight write counter and we are done. If inflight-write-cnt becomes 0 and refcount is already 0, we just need to keep inflight-write-counter=1 during the discard operation (if it is needed) and then drop the counter and update s->free_cluster_index. Note, that at this point we only implement the whole infrastructure for in-flight writes handling. Nobody now call qcow2_inflight_writes_inc() or qcow2_inflight_writes_dec(). It's a deal of the following patch. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com> --- block/qcow2-refcount.c | 70 ++++++++++++++++++++++++++++++++++++++---- 1 file changed, 64 insertions(+), 6 deletions(-) diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c index eedc83ea4a..9a0d6570a5 100644 --- a/block/qcow2-refcount.c +++ b/block/qcow2-refcount.c @@ -805,6 +805,15 @@ typedef struct Qcow2InFlightWriteCounter { * 0 the entry is removed from s->inflight_writes_counters. */ uint64_t inflight_writes_cnt; + + /* For convenience, keep cluster_index here */ + int64_t cluster_index; + + /* Cluster refcount is known to be zero */ + bool refcount_zero; + + /* Cluster refcount was made zero with this discard-type */ + enum qcow2_discard_type last_discard_type; } Qcow2InFlightWriteCounter; /* Find Qcow2InFlightWriteCounter corresponding to @cluster_index */ @@ -845,6 +854,7 @@ update_inflight_write_cnt(BlockDriverState *bs, int64_t offset, int64_t length, if (!decrease) { if (!infl) { infl = g_new0(Qcow2InFlightWriteCounter, 1); + infl->cluster_index = cluster_index; g_hash_table_insert(s->inflight_writes_counters, g_memdup(&cluster_index, sizeof(cluster_index)), infl); @@ -857,10 +867,40 @@ update_inflight_write_cnt(BlockDriverState *bs, int64_t offset, int64_t length, assert(infl); assert(infl->inflight_writes_cnt >= 1); - infl->inflight_writes_cnt--; + if (infl->inflight_writes_cnt > 1) { + infl->inflight_writes_cnt--; + continue; + } - if (infl->inflight_writes_cnt == 0) { + if (!infl->refcount_zero) { + /* + * All in-flight writes are finished, but cluster is still in use, + * nothing to do now. + */ g_hash_table_remove(s->inflight_writes_counters, &cluster_index); + continue; + } + + /* + * OK. At this point there no more refcounts and no more in-flight + * writes. Cluster is almost free. But we probably want to do one more + * in-flight operation: discard. So we keep inflight_writes_cnt = 1 + * during the discard. + */ + if (s->discard_passthrough[infl->last_discard_type]) { + int64_t cluster_offset = cluster_index << s->cluster_bits; + if (s->cache_discards) { + update_refcount_discard(bs, cluster_offset, s->cluster_size); + } else { + /* Discard is optional, ignore the return value */ + bdrv_pdiscard(bs->file, cluster_offset, s->cluster_size); + } + } + + g_hash_table_remove(s->inflight_writes_counters, &cluster_index); + + if (cluster_index < s->free_cluster_index) { + s->free_cluster_index = cluster_index; } } } @@ -970,6 +1010,7 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, s->set_refcount(refcount_block, block_index, refcount); if (refcount == 0) { + Qcow2InFlightWriteCounter *infl; void *table; table = qcow2_cache_is_table_offset(s->refcount_block_cache, @@ -986,8 +1027,24 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, qcow2_cache_discard(s->l2_table_cache, table); } - if (s->discard_passthrough[type]) { - update_refcount_discard(bs, cluster_offset, s->cluster_size); + infl = find_infl_wr(s, cluster_index); + if (infl) { + /* + * Refcount becomes zero, but there are still in-flight writes + * to the cluster. update_inflight_write_cnt() will take care + * of discarding the cluster and updating s->free_cluster_index. + */ + infl->refcount_zero = true; + infl->last_discard_type = type; + } else { + /* Refcount is zero and no in-fligth writes. Cluster is free. */ + if (cluster_index < s->free_cluster_index) { + s->free_cluster_index = cluster_index; + } + if (s->discard_passthrough[type]) { + update_refcount_discard(bs, cluster_offset, + s->cluster_size); + } } } } @@ -1049,7 +1106,7 @@ int qcow2_update_cluster_refcount(BlockDriverState *bs, /* - * Cluster is free when its refcount is 0 + * Cluster is free when its refcount is 0 and there is no in-flight writes * * Return < 0 if failed to get refcount * 0 if cluster is not free @@ -1057,6 +1114,7 @@ int qcow2_update_cluster_refcount(BlockDriverState *bs, */ static int is_cluster_free(BlockDriverState *bs, int64_t cluster_index) { + BDRVQcow2State *s = bs->opaque; int ret; uint64_t refcount; @@ -1065,7 +1123,7 @@ static int is_cluster_free(BlockDriverState *bs, int64_t cluster_index) return ret; } - return refcount == 0; + return refcount == 0 && !find_infl_wr(s, cluster_index); } /* return < 0 if error */ -- 2.29.2