On 04/06/2017 10:01 AM, Alberto Garcia wrote:
> Hi all,
> 
> over the past couple of months I discussed with some of you the
> possibility to extend the qcow2 format in order to improve its
> performance and reduce its memory requirements (particularly with very
> large images).
> 
> After some discussion in the mailing list and the #qemu IRC channel I
> decided to write a prototype of a new extension for qcow2 so I could
> understand better the scope of the changes and have some preliminary
> data about its effects.
> 
> This e-mail is the formal presentation of my proposal to extend the
> on-disk qcow2 format. As you can see this is still an RFC. Due to the
> nature of the changes I would like to get as much feedback as possible
> before going forward.

The idea in general makes sense; I can even remember chatting with Kevin
about similar ideas as far back as 2015, where the biggest drawback is
that it is an incompatible image change, and therefore images created
with the flag cannot be read by older tools.

> === Test results ===
> 
> I have a basic working prototype of this. It's still incomplete -and
> buggy :)- but it gives an idea of what we can expect from it. In my
> implementation each data cluster has 8 subclusters, but that's not set
> in stone (see below).
> 
> I made all tests on an SSD drive, writing to an empty qcow2 image with
> a fully populated 40GB backing image, performing random writes using
> fio with a block size of 4KB.
> 
> I tried with the default and maximum cluster sizes (64KB and 2MB) and
> also with some other sizes. I also made sure to try with 32KB clusters
> so the subcluster size matches the 4KB block size used for the I/O.
> 
> It's important to point out that once a cluster has been completely
> allocated then having subclusters offers no performance benefit. For
> this reason the size of the image for these tests (40GB) was chosen to
> be large enough to guarantee that there are always new clusters being
> allocated. This is therefore a worst-case scenario (or best-case for
> this feature, if you want).
> 
> Here are the results (subcluster size in brackets):
> 
> |-----------------+----------------+-----------------+-------------------|
> |  cluster size   | subclusters=on | subclusters=off | Max L2 cache size |
> |-----------------+----------------+-----------------+-------------------|
> |   2 MB (256 KB) |   440 IOPS     |  100 IOPS       | 160 KB (*)        |
> | 512 KB  (64 KB) |  1000 IOPS     |  300 IOPS       | 640 KB            |
> |  64 KB   (8 KB) |  3000 IOPS     | 1000 IOPS       |   5 MB            |
> |  32 KB   (4 KB) | 12000 IOPS     | 1300 IOPS       |  10 MB            |
> |   4 KB  (512 B) |   100 IOPS     |  100 IOPS       |  80 MB            |
> |-----------------+----------------+-----------------+-------------------|

Those are some cool results!


> === Changes to the on-disk format ===
> 
> The qcow2 on-disk format needs to change so each L2 entry has a bitmap
> indicating the allocation status of each subcluster. There are three
> possible states (unallocated, allocated, all zeroes), so we need two
> bits per subcluster.

You also have to add a new incompatible feature bit, so that older tools
know they can't read the new image correctly, and therefore don't
accidentally corrupt it.

> 
> An L2 entry is 64 bits wide, and this is the current format (for
> uncompressed clusters):
> 
> 63    56 55    48 47    40 39    32 31    24 23    16 15     8 7      0
> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> **<----> <--------------------------------------------------><------->*
>   Rsrved              host cluster offset of data             Reserved
>   (6 bits)                (47 bits)                           (8 bits)
> 
>     bit 63: refcount == 1   (QCOW_OFLAG_COPIED)
>     bit 62: compressed = 1  (QCOW_OFLAG_COMPRESSED)
>     bit 0: all zeros        (QCOW_OFLAG_ZERO)
> 
> I thought of three alternatives for storing the subcluster bitmaps. I
> haven't made my mind completely about which one is the best one, so
> I'd like to present all three for discussion. Here they are:
> 
> (1) Storing the bitmap inside the 64-bit entry
> 
>     This is a simple alternative and is the one that I chose for my
>     prototype. There are 14 unused bits plus the "all zeroes" one. If
>     we steal one from the host offset we have the 16 bits that we need
>     for the bitmap and we have 46 bits left for the host offset, which
>     is more than enough.

Note that because you are using exactly 8 subclusters, you can require
that the minimum cluster size when subclusters are enabled be 4k (since
we already have a lower-limit of 512-byte sector operation, and don't
want subclusters to be smaller than that); at which case you are
guaranteed that the host cluster offset will be 4k aligned.  So in
reality, once you turn on subclusters, you have:

63    56 55    48 47    40 39    32 31    24 23    16 15     8 7      0
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
**<----> <-----------------------------------------------><---------->*
  Rsrved              host cluster offset of data             Reserved
  (6 bits)                (44 bits)                           (11 bits)

where you have 17 bits plus the "all zeroes" bit to play with, thanks to
the three bits of host cluster offset that are now guaranteed to be zero
due to cluster size alignment (but you're also right that the "all
zeroes" bit is now redundant information with the 8 subcluster-is-zero
bits, so repurposing it does not hurt)

> 
>     * Pros:
>       + Simple. Few changes compared to the current qcow2 format.
> 
>     * Cons:
>       - Only 8 subclusters per cluster. We would not be making the
>         most of this feature.
> 
>       - No reserved bits left for the future.

I just argued you have at least one, and probably 2, bits left over for
future in-word expansion.

> 
> (2) Making L2 entries 128-bit wide.
> 
>     In this alternative we would double the size of L2 entries. The
>     first half would remain unchanged and the second one would store
>     the bitmap. That would leave us with 32 subclusters per cluster.

Although for smaller cluster sizes (such as 4k clusters), you'd still
want to restrict that subclusters are at least 512-byte sectors, so
you'd be using fewer than 32 of those subcluster positions until the
cluster size is large enough.

> 
>     * Pros:
>       + More subclusters per cluster. We could have images with
>         e.g. 128k clusters with 4k subclusters.

Could allow variable-sized subclusters (your choice of 32 subclusters of
4k each, or 16 subclusters of 8k each)

> 
>     * Cons:
>       - More space needed for L2 entries. The same cluster size would
>         require a cache twice as large, although having subcluster
>         allocation would compensate for this.
> 
>       - More changes to the code to handle 128-bit entries.

Dealing with variable-sized subclusters, or with unused subclsuster
entries when the cluster size is too small (such as a 4k cluster should
not be allowed any subclusters smaller than 512 bytes, but that's at
most 8 out of the 32 slots available), can get tricky.

> 
>       - We would still be wasting the 14 reserved bits that L2 entries
>         have.
> 
> (3) Storing the bitmap somewhere else
> 
>     This would involve storing the bitmap separate from the L2 tables
>     (perhaps using the bitmaps extension? I haven't looked much into
>     this).
> 
>     * Pros:
>       + Possibility to make the number of subclusters configurable
>         by the user (32, 64, 128, ...)
>       + All existing metadata structures would remain untouched
>         (although the "all zeroes" bit in L2 entries would probably
>         become unused).

It might still remain useful for optimization purposes, although then we
get into image consistency questions (if the all zeroes bit is set but
subcluster map claims allocation, or if the all zeroes bit is clear but
all subclusters claim zero, which one wins).

> 
>     * Cons:
>       - As with alternative (2), more space needed for metadata.
> 
>       - The bitmap would also need to be cached for performance
>         reasons.
> 
>       - Possibly one more *_cache_size option.
> 
>       - One more metadata structure to be updated for each
>         allocation. This would probably impact I/O negatively.

Having the subcluster table directly in the L2 means that updating the
L2 table is done with a single write. You are definitely right that
having the subcluster table as a bitmap in a separate cluster means two
writes instead of one, but as always, it's hard to predict how much of
an impact that is without benchmarks.

> 
> === Compressed clusters ===
> 
> My idea is that compressed clusters would remain the same. They are
> read-only anyway so they would not be affected by any of these
> changes.
> 
> ===========================
> 
> I think I managed to summarize all the ideas that I had in my mind,
> but I'm sure you probably have questions and comments, so I'd be happy
> to get as much feedback as possible.
> 
> So, looking forward to reading your opinions.
> 

The fact that you already have numbers proving the speedups that are
possible when first allocating the image make this sound like a useful
project, even though it is an incompatible image change that old tools
won't be able to recognize. You'll want to make sure 'qemu-img amend'
can rewrite an image with subclusters into an older image.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to