> > FVD's novel uses of the reference count table reduces the metadata update > > overhead down to literally zero during normal execution of a VM. This gets > > the bests of QCOW2's reference count table but without its oeverhead. In > > FVD, the reference count table is only updated when creating a new > > snapshot or deleting an existing snapshot. The reference count table is > > never updated during normal execution of a VM. > > Do you want to send out a break-down of the steps (and cost) involved in doing: > > 1. Snapshot creation. > 2. Snapshot deletion. > 3. Opening an image with n snapshots.
Here is a detailed description. Relevant to the discussion of snapshot, FVD uses a one-level lookup table and a refcount table. FVD’s one-level lookup table is very similar to QCOW2’s two-level lookup table, except that it is much smaller in FVD, and is preallocated and hence contiguous in image. FVD’s refcount table is almost identical to that of QCOW2, but with a key difference. An image consists of an arbitrary number of read-only snapshots, and a single writeable image front, which is the current image state perceived by the VM. Below, I will simply refer to the read-only snapshots as snapshots, and refer to the “writeable image front” as “writeable-front.” QCOW2’s refcount table counts clusters that are used by either read-only snapshots or writeable-front. Because writeable-front changes as the VM runs, QCOW2 needs to update the refcount table on the fast path of normal VM execution. By contrast, FVD’s refcount table only counts chunks that are used by read-only snapshots, and does not count chunks used by write-front. This is the key that allows FVD to entirely avoid updating the refcount table on the fast path of normal VM execution. Below are the detailed steps for different operations. O1: Open an image with n snapshots. Let me introduce some basic concepts first. The storage allocation unit in FVD is called a chunk (like cluster in QCOW2). The default chunk size is 1MB, like that in VDI (VMDK and Microsoft VHD use 2MB chunks). An FVD image file is conceptually divided into chunks, where chunk 0 is the first 1MB of the image file, chunk 1 is the second 1MB, … chunk j, …and so forth. The size of an image file grow as needed, just like that of QCOW2. The refcount table is a linear array “uint16_t refcount[]”. If a chunk j is referenced by s different snapshots, then refcount[j] = s. If a new snapshot is created and this new snapshot also uses chunk j, then refcount[j] is incremented to refcount[j] = s+1. If all snapshots together use 1TB storage spaces, there are 1TB/1MB=1,000,000 chunks, and the size of the refcount table is 2MB. Loading the entire 2MB refcount table from disk into memory takes about 15 milliseconds. If the virtual disk size perceived by the VM is also 1TB, FVD’s one-level lookup table is 4MB. FVD’s one-level lookup table serves the same purpose as QCOW2’s two-level lookup table, but FVD’s one-level table is much smaller and is preallocated and hence continuous in the image. Loading the entire 4MB lookup table from disk into memory takes about 20 milliseconds. These numbers mean that it is quite affordable to scan the entire tables at VM boot time, although the scan can also be avoided in FVD. The optimizations will be described later. When opening an image with n snapshots, an unoptimized version of FVD performs the following steps: O1: Load the entire 2MB reference count table from disk into memory. This step takes about 15ms. O2: Load the entire 4MB lookup table from disk into memory. This step takes about 20ms. O3: Use the two tables to build an in-memory data structure called “free-chunk-bitmap.” This step takes about 2ms. The free-chunk-bitmap identifies free chunks that are not used by either snapshots or writeable front, and hence can be allocated for future writes. The size of the free-chunk-bitmap is only 125KB for a 1TB disk, and hence the memory overhead is negligible. The free-chunk-bitmap also supports trim operations. The free-chunk-bitmap does not have to be persisted on disk as it can always be rebuilt easily, although as an optimization it can be persisted on disk on VM shutdown. O4: Compare the refcount table and the lookup table to identify chunks that are in both tables (i.e., shared) and hence the running VM’s write to those chunks in writeable-front triggers copy-on-write. This step takes about 2ms. One bit in the lookup table’s entry is stolen to mark whether a chunk in writeable-front is shared with snapshots and hence needs copy-on-write upon a write. The whole process above, i.e., opening an image with n (e.g., n=1000) snapshots, takes about 39ms and it is a one-time cost at VM boot. Later, I will describe optimizations that can further reduce this 39ms by saving the 125KB free-chunk-bitmap to disk on VM shutdown, but that optimization is more than likely to an over-engineering effort, given that 39ms on VM boot perhaps is not a big deal. Once the image is opened, the refcount table is discarded and never updated during normal execution of the VM. This is how FVD gets the bests of QCOW2’s refcount table but without its runtime overhead. When the VM issues a write to a chunk in writeable-front and this chunk is shared with snapshots (this is known by looking at the special marker bit in lookup table entries), FVD uses the free-chunk-bitmap to find a free chunk, performs copy-on-write, and updates the lookup table to point to that new chunk. This metadata change to the lookup table is persisted in FVD’s journal and is not updated directly to the writeable-front’s on-disk lookup table. When the VM shuts down gracefully, the entire lookup table is written back to the writeable-front’s on-disk lookup table (this step takes about 20ms), and FVD’s image header is updated to indicate that there is no need to recover from journal on the next VM boot. Now let me describe an optimization that can further reduce the 39ms time needed for opening an image with n snapshots. When the VM shuts down gracefully, FVD can save the in-memory 125KB free-chunk-bitmap to disk. When the VM reboots, FVD can load the 125KB free-chunk-bitmap from disk and skips steps O1, O2, and O3. Since the lookup table is saved to disk on a clean shutdown and one bit in the table entries marks whether a chunk is shared with snapshots, the scanning step in O4 can also be avoided. In other words, all the steps above O1-O4 can all be avoided on a clean shutdown. They are needed only after a host crash. During the recovery process after a host crash, the journal re-builds the lookup table; the refcount table and the lookup table rebuild the free-chunk-bitmap. This is a description of FVD's image open operation. (…to be continued… I am running out of time now, and will write about FVD’s other snapshot operations in separate emails.) Regards, ChunQiang (CQ) Tang Homepage: http://www.research.ibm.com/people/c/ctang