> It seems that there is great interest in QCOW2's > internal snapshot feature. If we really want to do that, the right solution is > to follow VMDK's approach of storing each snapshot as a separate COW file (see > http://www.vmware.com/app/vmdk/?src=vmdk ), rather than using the reference > count table. VMDK’s approach can be easily implemented for any COW format, or > even as a function of the generic block layer, without complicating any COW > format or hurting its performance.
After the heated debate, I thought more about the right approach of implementing snapshot, and it becomes clear to me that there are major limitations with both VMDK's external snapshot approach (which stores each snapshot as a separate CoW file) and QCOW2's internal snapshot approach (which stores all snapshots in one file and uses a reference count table to keep track of them). I just posted to the mailing list a patch that implements internal snapshot in FVD but does it in a way without the limitations of VMDK and QCOW2. Let's first list the properties of an ideal virtual disk snapshot solution, and then discuss how to achieve them. G1: Do no harm (or avoid being a misfeature), i.e., the added snapshot code should not slow down the runtime performance of an image that has no snapshots. This implies that an image without snapshot should not cache the reference count table in memory and should not update the on-disk reference count table. G2: Even better, an image with 1 snapshot runs as fast as an image without snapshot. G3: Even even better, an image with 1,000 snapshots runs as fast as an image without snapshot. This basically means getting the snapshot feature for free. G4: An image with 1,000 snapshots consumes no more memory than an image without snapshot. This again means getting the snapshot feature for free. G5: Regardless of the number of existing snapshots, creating a new snapshot is fast, e.g., taking no more than 1 second. G6: Regardless of the number of existing snapshots, deleting a snapshot is fast, e.g., taking no more than 1 second. Now let's evaluate VMDK and QCOW2 against these ideal properties. G1: VMDK good; QCOW2 poor G2: VMDK ok; QCOW2 poor G3: VMDK very poor; QCOW2 poor G4: VMDK very poor; QCOW2 poor G5: VMDK good; QCOW2 good G6: VMDK poor; QCOW2 good The evaluation above assumes a straightforward VMDK implementation that, when handling a long chain of snapshots, s0<-s1<-s2<- … <-s1000, it uses a chain of 1,000 VMDK driver instances to represent the chain of backing files. This is slow and consumes a lot of memory, but it is the behavior of QEMU's block device architecture today. Even if the QEMU architecture can be revised and the VMDK implementation is optimized to extreme, a fundamental limitation of VMDK (by design instead of by implementation) is G6, i.e., deleting a snapshot X in the middle of a snapshot chain is slow (this is also what I observed with the VMware software). Because each snapshot is stored as a separate file, when a snapshot X is deleted, part of X's data blocks that are still needed by its children Y must be physically copied from file X to file Y, which is slow and the VM is halted during the copy operation. QCOW2's internal snapshot approach avoids this problem. Since all snapshots are stored in one file, when a snapshot is deleted, QCOW2 only needs to update its reference count table without physically moving data blocks. On the other hand, QCOW'2 internal snapshot has two major limitations that hurt runtime performance: caching the reference count table in memory and updating the on-disk reference count table. If we can eliminate both, then it is an ideal solution. This is exactly what FVD's internal snapshot solution does. Below is the key observation on why FVD can do it so efficiently. In an internal snapshot implementation, the reference count table is used to track used blocks and free blocks. It serves no other purposes. In FVD, its "static" reference count table only tracks blocks used by (static) snapshots, and it does not track blocks (dynamically) allocated (on a write) or freed (on a trim) for the running VM. This is a simple but fundamental difference w.r.t. to QCOW2, whose reference count table tracks both the static content and the dynamic content. Because data blocks used by snapshots are static and do not change unless a snapshot is created or deleted, there is no need to update FVD's "static" reference count table when a VM runs, and actually there is even no need to cache it in memory. Data blocks that are dynamically allocated or freed for a running VM are already tracked by FVD's one-level lookup table (which is similar to QCOW2's two-level table, but in FVD it is much smaller and faster) even before introducing the snapshot feature, and hence it comes for free. Updating FVD's one-level lookup table is efficient because of FVD's journal. When the VM boots, FVD scans the reference count table once to build a so-called free-block-bitmap in memory, which identifies blocks not used by static snapshots. The reference count table is then thrown away and never updated when the VM runs. For an image with 1TB snapshot data, the free-block-bitmap is only 125KB, i.e., the memory overhead is negligible. For an image with 1TB snapshot data, FVD's reference count table is 2MB, and scanning it once at VM boot time takes no more than 20 milliseconds. In short, FVD's internal snapshot achieves the ideal properties of G1-G6, by 1) using the reference count table to only track "static" snapshots, 2) not keeping the reference count table in memory, 3) not updating the on-disk "static" reference count table when the VM runs, and 4) efficiently tracking dynamically allocated blocks by piggybacking on FVD's other features, i.e., its journal and small one-level lookup table. Regards, ChunQiang (CQ) Tang Homepage: http://www.research.ibm.com/people/c/ctang