On 04/29/2014 09:07 AM, Max Reitz wrote:
> bdrv_make_empty() is currently only called if the current image
> represents an external snapshot that has been committed to its base
> image; it is therefore unlikely to have internal snapshots. In this
> case, bdrv_make_empty() can be greatly sped up by creating an empty L1
> table and dropping all data clusters at once by recreating the refcount
> structure accordingly instead of normally discarding all clusters.
> 
> If there are snapshots, fall back to the simple implementation (discard
> all clusters).
> 
> Signed-off-by: Max Reitz <mre...@redhat.com>
> ---
>  block/qcow2.c | 385 
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 370 insertions(+), 15 deletions(-)

Wow!  This was a fun review, and took me quite some time.  I also had to
re-read docs/specs/qcow2.txt several times to make sure I understood all
your numbers, but everything in here made sense.

The pdf helped; I didn't spot any math errors there (I did have to
figure out the mapping between the one-letter variables in the pdf to
the multi-char variables in your code; such as 'x' matching 'overhead').

Reviewed-by: Eric Blake <ebl...@redhat.com>

More comments below, describing some of my thoughts while reviewing.

This patch undoes anything accomplished earlier by preallocation.  I
suppose that if a user wants to empty an image but still guarantee that
it is preallocated large enough, they can do it in two steps; or they
can do the 'commit -d' to skip emptying the image, and then just
recreate a new qcow2 image with full preallocation and bypass the work
of this patch.  I agree that it doesn't make any sense to complicate
this patch even more to worry about restoring a level of preallocation.

> +/*
> + * Creates a reftable pointing to refblocks following right afterwards and an
> + * empty L1 table at the given @offset. @refblocks is the number of refblocks
> + * to create.
> + *
> + * This combination of structures (reftable+refblocks+L1) is here called a
> + * "blob".

Indeed, the bare minimum qcow2 image is one consisting of a reftable and
refblocks for all host clusters, and an L1 table large enough for all
guest clusters; making the three tables contiguous and acting on a blob
makes sense.

> + */
> +static int create_refcount_l1(BlockDriverState *bs, uint64_t offset,
> +                              uint64_t refblocks)
> +{
> +    BDRVQcowState *s = bs->opaque;
> +    uint64_t *reftable = NULL;
> +    uint16_t *refblock = NULL;
> +    uint64_t reftable_clusters;
> +    uint64_t rbi;
> +    uint64_t blob_start, blob_end;
> +    uint64_t l2_tables, l1_clusters, l1_size2;
> +    uint8_t l1_size_and_offset[12];
> +    uint64_t rt_offset;

Good; most math is in uint64_t, so I don't have to think too hard about
overflows.

> +    int ret, i;
> +
> +    reftable_clusters = DIV_ROUND_UP(refblocks, s->cluster_size / 8);

Each reftable entry is 8 bytes, so this is correct for how many entries
are packed in a cluster.

> +    l2_tables = DIV_ROUND_UP(bs->total_sectors / s->cluster_sectors,
> +                             s->cluster_size / 8);

Each L2 entry is 8 bytes, so this is correct for how many L2 entries are
packed in an L2 cluster.  It took me a while to figure out that you have
to know how many L2 clusters would exist in a fully populated image in
order to size the L1 table, even though you are NOT allocating L2 clusters.

> +    l1_clusters = DIV_ROUND_UP(l2_tables, s->cluster_size / 8);

Each L1 entry is 8 bytes, so this is also correct.

> +    l1_size2 = l1_clusters << s->cluster_bits;
> +
> +    blob_start = offset;
> +    blob_end = offset + ((reftable_clusters + refblocks + l1_clusters) <<
> +                s->cluster_bits);
> +
> +    ret = qcow2_pre_write_overlap_check(bs, 0, blob_start,
> +                                        blob_end - blob_start);

The caller allocated all three parts of the blob but has not yet put
them to use; so this makes sense that we should have full access as the
first user of each cluster in the range.

> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    /* Create the reftable pointing with entries pointing consecutively to 
> the
> +     * following clusters */
> +    reftable = g_malloc0_n(reftable_clusters, s->cluster_size);

The zero initialization is overkill for all but the last cluster, if the
guest image size is not an even multiple of cluster size.  But it
doesn't hurt to be safe.  You COULD have optimized further by leaving 0
entries for refcount blocks that would all contain 0 entries, but that
would make the code more complex; furthermore, you call this blob
initialization function twice in the normal case, but only the first
call will generate any refcount blocks containing all zeros.  In the
second call, when the refcount block is now hoisted to the front of the
file, all refcount blocks will be non-zero.

> +
> +    for (rbi = 0; rbi < refblocks; rbi++) {
> +        reftable[rbi] = cpu_to_be64(offset + ((reftable_clusters + rbi) <<
> +                        s->cluster_bits));
> +    }

Straightforward mapping.  All but the tail will now be mapped into
consecutive refblocks.

> +
> +    ret = bdrv_write(bs->file, offset >> BDRV_SECTOR_BITS, (uint8_t 
> *)reftable,
> +                     reftable_clusters * s->cluster_sectors);
> +    if (ret < 0) {
> +        goto out;
> +    }
> +
> +    offset += reftable_clusters << s->cluster_bits;
> +
> +    /* Keep the reftable, as we will need it for the BDRVQcowState anyway */
> +
> +    /* Now, create all the refblocks */
> +    refblock = g_malloc(s->cluster_size);
> +
> +    for (rbi = 0; rbi < refblocks; rbi++) {
> +        for (i = 0; i < s->cluster_size / 2; i++) {
> +            uint64_t cluster_offset = ((rbi << (s->cluster_bits - 1)) + i)
> +                                      << s->cluster_bits;
> +
> +            /* Only 0 and 1 are possible as refcounts here */
> +            refblock[i] = cpu_to_be16(!cluster_offset ||
> +                                      (cluster_offset >= blob_start &&
> +                                       cluster_offset <  blob_end));

Caller passed in refblocks, but you are correct that each refblock entry
is 2 bytes long.

The image header and all clusters within the blob will have a refcount
of 1, all other clusters will have a refcount of 2.  Of course, at this
point, this is just pre-populating the table, it is not actually in use yet.

> +        }
> +
> +        ret = bdrv_write(bs->file, offset >> BDRV_SECTOR_BITS,
> +                         (uint8_t *)refblock, s->cluster_sectors);
> +        if (ret < 0) {
> +            goto out;
> +        }
> +
> +        offset += s->cluster_size;
> +    }
> +
> +    g_free(refblock);
> +    refblock = NULL;
> +
> +    /* The L1 table is very simple */
> +    ret = bdrv_write_zeroes(bs->file, offset >> BDRV_SECTOR_BITS,
> +                            l1_clusters * s->cluster_sectors, 0);
> +    if (ret < 0) {
> +        goto out;
> +    }

Sets up the new L1 table to point to all L2 tables being unallocated.
Any reason you aren't passing BDRV_REQ_MAY_UNMAP, to make the allocated
size of the image even smaller when supported?

> +
> +    /* Now make sure all changes are stable and clear all caches */
> +    bdrv_flush(bs);
> +
> +    /* This is probably not really necessary, but it cannot hurt, either */
> +    ret = qcow2_mark_clean(bs);
> +    if (ret < 0) {
> +        goto out;
> +    }
> +
> +    ret = qcow2_cache_empty(bs, s->l2_table_cache);
> +    if (ret < 0) {
> +        goto out;
> +    }
> +
> +    ret = qcow2_cache_empty(bs, s->refcount_block_cache);
> +    if (ret < 0) {
> +        goto out;
> +    }
> +
> +    /* Modify the image header to point to the new blank L1 table. This will
> +     * leak all currently existing data clusters, which is fine. */
> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_UPDATE);
> +
> +    assert(l1_size2 / 8 <= UINT32_MAX);
> +    cpu_to_be32w((uint32_t *)&l1_size_and_offset[0], l1_size2 / 8);
> +    cpu_to_be64w((uint64_t *)&l1_size_and_offset[4], offset);
> +    ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, l1_size),
> +                           l1_size_and_offset, sizeof(l1_size_and_offset));
> +    if (ret < 0) {
> +        goto out;
> +    }

Pivots the active L1 table; so now the guest sees the entire backing
file due to all L2 entries being unallocated; the refcount table is
still unpivoted.  This failure point is where the blob is leaked, but
the old image is still consistent.

> +
> +    /* Adapt the in-memory L1 table accordingly */
> +    s->l1_table_offset = offset;
> +    s->l1_size         = l1_size2 / 8;
> +
> +    s->l1_table = g_realloc(s->l1_table, l1_size2);
> +    memset(s->l1_table, 0, l1_size2);
> +
> +    /* Modify the image header to point to the refcount table. This will fix 
> all
> +     * leaks (unless an error occurs). */
> +    BLKDBG_EVENT(bs->file, BLKDBG_REFTABLE_UPDATE);
> +
> +    rt_offset = cpu_to_be64(blob_start);
> +    ret = bdrv_pwrite_sync(bs->file,
> +                           offsetof(QCowHeader, refcount_table_offset),
> +                           &rt_offset, sizeof(rt_offset));
> +    if (ret < 0) {
> +        goto out;
> +    }

Pivots the refcount table; the old table is now gone and the new table
is now (potentially) self-consistent (we still require a truncate of any
clusters already in the image but beyond the size of this new refcount
table before the refcount table once again correctly describes the
entire file).

> +
> +    /* The image is now clean. The only thing left is to adapt the in-memory
> +     * refcount table to match it. */
> +    s->refcount_table_offset = blob_start;
> +    s->refcount_table_size   = reftable_clusters << (s->cluster_bits - 3);
> +
> +    for (rbi = 0; rbi < refblocks; rbi++) {
> +        be64_to_cpus(&reftable[rbi]);
> +    }
> +
> +    g_free(s->refcount_table);
> +    s->refcount_table = reftable;
> +    reftable = NULL;
> +
> +out:
> +    g_free(refblock);
> +    g_free(reftable);
> +    return ret;
> +}

Looks correct.

> +
> +/*
> + * Calculates the number of clusters required for an L1 table for an image 
> with
> + * the given parameters plus the reftable and the refblocks required to cover
> + * themselves, the L1 table and a given number of clusters @overhead.
> + *
> + * @ts:       Total number of guest sectors the image provides
> + * @cb:       1 << @cb is the cluster size in bytes
> + * @spcb:     1 << @spcb is the number of clusters per sector
> + * @overhead: Number of clusters which shall additionally be covered by the
> + *            refcount structures
> + */
> +static uint64_t minimal_blob_size(uint64_t ts, int cb, int spcb,
> +                                  uint64_t overhead)
> +{
> +    uint64_t cs  = UINT64_C(1) << cb;
> +    uint64_t spc = UINT64_C(1) << spcb;
> +
> +    /* The following statement is a solution for this system of equations:
> +     *
> +     * Let req_cls be the required number of clusters required for the 
> reftable,
> +     * all refblocks and the L1 table.
> +     *
> +     * rtc be the clusters required for the reftable, rbc the clusters for 
> all
> +     * refblocks (i.e., the number of refblocks), l1c the clusters for the L1
> +     * table and l2c the clusters for all L2 tables (i.e., the number of L2
> +     * tables).
> +     *
> +     * cs be the cluster size (in bytes), ts the total number of sectors, spc
> +     * the number of sectors per cluster and tc the total number of clusters.
> +     *
> +     * overhead is a number of clusters which should additionally be covered 
> by
> +     * the refcount structures (i.e. all clusters before this blob of req_cls
> +     * clusters).
> +     *
> +     * req_cls >= rtc + rbc + l1c
> +     *   -- should be obvious

The blob must be large enough to hold all three tables.

> +     * rbc      = ceil((overhead + req_cls) / (cs / 2))
> +     *   -- as each refblock holds cs/2 entries

The refblock table must be large enough to describe all host clusters
(the overhead that occurs before the blob, as well as the blob itself),
assuming the table is fully utilized, and rounded up to cluster size.
We could have a smaller refblock table by omitting clusters with all-0
entries, but it's not worth the complication.

> +     * rtc      = ceil(rbc                  / (cs / 8))
> +     *   -- as each reftable cluster holds cs/8 entries

The reftable must be large enough to point to each refblock, and rounded
up to cluster size.

> +     * tc       = ts / spc
> +     *   -- ts should be a multiple of spc

Hmm.  What guarantees that ts is a multiple of spc?  'qemu-img create -f
qcow2 blah 512' shows an image with a virtual size of 512 but a cluster
size of 64k; isn't that a case where ts==1 but spc==128 (spcb==7), so
tc==0 even though the l2 table would need to describe 1 cluster?  But
your math was done on ceil(ts), not on tc, so even though this comment
is fishy, I don't think it invalidates your math.  I think all that you
need is to write tc = ceil(ts / spc)

> +     * l2c      = ceil(tc  / (cs / 8))
> +     *   -- as each L2 table holds cs/8 entries

Makes sense, for correct tc

> +     * l1c      = ceil(l2c / (cs / 8))
> +     *   -- as each L1 table cluster holds cs/8 entries

Makes sense, for correct l2c

> +     *
> +     * The following equation yields a result which satisfies the constraint.
> +     * The result may be too high, but is never too low.
> +     *
> +     * The original calculation (without bitshifting) was:
> +     *
> +     * DIV_ROUND_UP(overhead * (1 + cs / 8) +
> +     *              3 * cs * cs / 16 - 5 * cs / 8 - 1 +
> +     *              4 * (ts + spc * cs / 8 - 1) / spc,
> +     *              cs * cs / 16 - cs / 8 - 1)

Yes, this matches your pdf followup.

> +     *
> +     */
> +
> +    return DIV_ROUND_UP(overhead + (overhead << (cb - 3)) +
> +                        (3 << (2 * cb - 4)) - (5 << (cb - 3)) - 1 +
> +                        (4 * (ts + (spc << (cb - 3)) - 1) >> spcb),
> +                        (1 << (2 * cb - 4)) - cs / 8 - 1);

And yes, this matches the comment written with division.  It could
definitely be made tighter for the case of overhead large enough where
we could omit an entire cluster of all-0 refblocks by writing 0 in the
reftable; but it's not worth the complication.

> +}
> +
>  static int qcow2_make_empty(BlockDriverState *bs)
>  {
> +    BDRVQcowState *s = bs->opaque;
>      int ret = 0;
> -    uint64_t start_sector;
> -    int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
>  
> -    for (start_sector = 0; start_sector < bs->total_sectors;
> -         start_sector += sector_step)
> -    {
> -        /* As this function is generally used after committing an external
> -         * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
> -         * default action for this kind of discard is to pass the discard,
> -         * which will ideally result in an actually smaller image file, as
> -         * is probably desired. */
> -        ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
> -                                     MIN(sector_step,
> -                                         bs->total_sectors - start_sector),
> -                                     QCOW2_DISCARD_SNAPSHOT, true);
> +    if (s->snapshots) {
> +        uint64_t start_sector;
> +        int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
> +
> +        /* If there are snapshots, every active cluster has to be discarded 
> */
> +
> +        for (start_sector = 0; start_sector < bs->total_sectors;
> +             start_sector += sector_step)
> +        {
> +            /* As this function is generally used after committing an 
> external
> +             * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
> +             * default action for this kind of discard is to pass the 
> discard,
> +             * which will ideally result in an actually smaller image file, 
> as
> +             * is probably desired. */
> +            ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
> +                                         MIN(sector_step,
> +                                             bs->total_sectors - 
> start_sector),
> +                                         QCOW2_DISCARD_SNAPSHOT, true);
> +            if (ret < 0) {
> +                break;
> +            }
> +        }

The fallback is always safe, even if slower.

> +    } else {
> +        uint64_t min_size_1, min_size_2;
> +        int64_t blob_start;
> +        uint64_t blob_end, real_blob_size, clusters;
> +        uint64_t refblocks, reftable_clusters, l2_tables, l1_clusters;
> +
> +        /* If there are no snapshots, we basically want to create a new empty
> +         * image. This function is however not for creating a new image and
> +         * renaming it so it shadows the existing but rather for emptying an
> +         * image, so do exactly that.
> +         *
> +         * Therefore, the image should be valid at all points in time and may
> +         * at worst leak clusters.

I've convinced myself that you met that constraint - any early exit
prior to pivoting the reftable probably has leaks, but leaks are easy to
clean, and the image is consistent at all times.

> +         *
> +         * Any valid qcow2 image requires an L1 table which is large enough 
> to
> +         * cover all of the guest cluster range, therefore it is impossible 
> to
> +         * simply drop the L1 table (make its size 0) and create a minimal
> +         * refcount structure.
> +         *
> +         * Instead, allocate a blob of data which is large enough to hold a 
> new
> +         * refcount structure (refcount table and all blocks) covering the 
> whole
> +         * image until the end of that blob, and an empty L1 table covering 
> the
> +         * whole guest cluster range. Then these structures are initialized 
> and
> +         * the image header is modified to point to them.
> +         *
> +         * Generally, this blob will be allocated at the end of the image 
> (that
> +         * is, its offset will be greater than its size). If that is indeed 
> the
> +         * case, the same operation is repeated, but this time the new blob
> +         * starts right after the image header which will then indeed lead 
> to a
> +         * minimal image. If this is not the case, the image will be nearly
> +         * minimal as well, as long as the underlying protocol supports 
> discard.

Again, depending on whether the all-0 L1 table actually occupies disk
space determines whether the image is truly minimal in that regards, but
even if it occupies disk space this description makes sense as
describing the bare minimum for a qcow2 file that is still valid and
points to the backing file for all clusters, regardless of how much of
the underlying file is allocated.

> +         *
> +         * Note that this implementation never frees allocated clusters. 
> This is
> +         * because in case of success, the current refcounts are invalid 
> anyway;

Or more precisely, on success, the current refcounts are thrown away as
we pivot to the new table of refcounts.

> +         * and in case of failure, it would be too cumbersome to keep track 
> of
> +         * all allocated cluster ranges and free them then.
> +         *
> +         * min_size_1 will contain the number of clusters minimally required 
> for
> +         * a blob that starts right after the image header; min_size_2 will
> +         * contain the number of clusters minimally required for the blob 
> which
> +         * can be allocated based on the existing refcount structure.
> +         */
> +
> +        /* Repeat until a blob could be allocated which is large enough to
> +         * contain all structures necessary for describing itself. Allocated
> +         * clusters are not freed, even if they are not suitable, as this 
> would
> +         * result in exactly the same cluster range being returned during the
> +         * retry (which is obviously not desirable). In case of success, the
> +         * current refcounts do not matter anyway; and in case of failure, 
> the
> +         * clusters are only leaked (which can be easily repaired). */
> +        do {
> +            uint64_t fci = s->free_cluster_index;
> +
> +            /* TODO: Optimize, we do not need refblocks for other parts of 
> the
> +             * image than the header and this blob */

Ah, you caught on to the optimization of fewer refblocks.  Rather than
wording this as a TODO, I'd just state that any overhead occupied by not
being minimal is reclaimed on a successful min_size_1 blob anyways, so
it wasn't worth the complications to the code.

> +            min_size_2 = minimal_blob_size(bs->total_sectors, 
> s->cluster_bits,
> +                                           s->cluster_bits - 
> BDRV_SECTOR_BITS,
> +                                           fci);
> +
> +            blob_start = qcow2_alloc_clusters(bs,
> +                                              min_size_2 << s->cluster_bits);

This call has the side effect of increasing 'fci' for the next iteration
of the do loop, if this iteration didn't guess correctly.

> +            if (blob_start < 0) {
> +                return blob_start;
> +            }
> +
> +            clusters          = (blob_start >> s->cluster_bits) + min_size_2;
> +            refblocks         = DIV_ROUND_UP(clusters, s->cluster_size / 2);
> +            reftable_clusters = DIV_ROUND_UP(refblocks, s->cluster_size / 8);
> +            l2_tables         = DIV_ROUND_UP(bs->total_sectors /
> +                                             s->cluster_sectors,
> +                                             s->cluster_size / 8);
> +            l1_clusters       = DIV_ROUND_UP(l2_tables, s->cluster_size / 8);
> +
> +            real_blob_size = reftable_clusters + refblocks + l1_clusters;
> +        } while (min_size_2 < real_blob_size);

But in reality, will the while loop ever be executed more than once?
Since min_size_2 was computed via a nice set of equations to be large
enough, it looks like you are just being paranoid here.

> +
> +        ret = create_refcount_l1(bs, blob_start, refblocks);
>          if (ret < 0) {
> -            break;
> +            return ret;
>          }
> +
> +        /* The only overhead is the image header */
> +        min_size_1 = minimal_blob_size(bs->total_sectors, s->cluster_bits,
> +                                       s->cluster_bits - BDRV_SECTOR_BITS, 
> 1);
> +
> +        /* If we cannot create a new blob before the current one, just 
> discard
> +         * the sectors in between and return. Even if the discard does 
> nothing,
> +         * only up to min_size_1 clusters plus the refcount blocks for those
> +         * are unused. The worst case is therefore an image of double the 
> size
> +         * it needs to be, which is not too bad. */
> +        if ((blob_start >> s->cluster_bits) < 1 + min_size_1) {
> +            uint64_t sector, end;
> +            int step = INT_MAX / BDRV_SECTOR_SIZE;
> +
> +            end = blob_start >> (s->cluster_bits - BDRV_SECTOR_SIZE);
> +
> +            /* skip the image header */
> +            for (sector = s->cluster_sectors; sector < end; sector += step) {
> +                bdrv_discard(bs->file, sector, MIN(step, end - sector));

Why bdrv_discard and not qcow2_discard_clusters(...,
QCOW2_DISCARD_SNAPSHOT, true);?  Why can you ignore errors here?

> +            }
> +
> +            blob_end = (blob_start + real_blob_size) << s->cluster_bits;
> +        } else {
> +            clusters          = 1 + min_size_1;
> +            refblocks         = DIV_ROUND_UP(clusters, s->cluster_size / 2);
> +            reftable_clusters = DIV_ROUND_UP(refblocks, s->cluster_size / 8);
> +            l2_tables         = DIV_ROUND_UP(bs->total_sectors /
> +                                             s->cluster_sectors,
> +                                             s->cluster_size / 8);
> +            l1_clusters       = DIV_ROUND_UP(l2_tables, s->cluster_size / 8);
> +
> +            real_blob_size = reftable_clusters + refblocks + l1_clusters;
> +
> +            assert(min_size_1 >= real_blob_size);
> +
> +            ret = create_refcount_l1(bs, s->cluster_size, refblocks);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +
> +            blob_end = (1 + real_blob_size) << s->cluster_bits;
> +        }
> +
> +        ret = bdrv_truncate(bs->file, blob_end);

The truncate is the key that finally makes the newly used refcount table
(whether from one call to create_refcount_l1 or two) accurate.  Note
that we have a case where if the first attempt at allocating min_size_2
manages to find room for a blob in the middle of the file rather than at
the end, and then the second create_refcount_l1() fails (or you don't
attempt a second one but instead add in error handling for failed
bdrv_discard), then resulting qcow2 file is inconsistent in that the
reftable is smaller than the overall size of the qcow2 image.  But as
long as our image sanity checking code can handle that, and repair the
image (if I understand correctly, anything beyond the refcount table is
basically treated the same as any other leaked cluster), I don't think
it is fatal to this patch.

>      }
>  
>      return ret;
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to