Re: [Qemu-devel] Re: Strategic decision: COW format

Chunqiang Tang Sat, 12 Mar 2011 21:53:48 -0800

> It seems that there is great interest in QCOW2's
> internal snapshot feature. If we really want to do that, the right 
solution is
> to follow VMDK's approach of storing each snapshot as a separate COW 
file (see 
> http://www.vmware.com/app/vmdk/?src=vmdk ), rather than using the 
reference 
> count table. VMDK’s approach can be easily implemented for any COW 
format, or 
> even as a function of the generic block layer, without complicating any 
COW 
> format or hurting its performance.


After the heated debate, I thought more about the right approach of 
implementing snapshot, and it becomes clear to me that there are major 
limitations with both VMDK's external snapshot approach (which stores each 
snapshot as a separate CoW file) and QCOW2's internal snapshot approach 
(which stores all snapshots in one file and uses a reference count table 
to keep track of them). I just posted to the mailing list a patch that 
implements internal snapshot in FVD but does it in a way without the 
limitations of VMDK and QCOW2. 

Let's first list the properties of an ideal virtual disk snapshot 
solution, and then discuss how to achieve them.

G1: Do no harm (or avoid being a misfeature), i.e., the added snapshot 
code should not slow down the runtime performance of an image that has no 
snapshots.  This implies that an image without snapshot should not cache 
the reference count table in memory and should not update the on-disk 
reference count table.

G2: Even better, an image with 1 snapshot runs as fast as an image without 
snapshot.

G3: Even even better, an image with 1,000 snapshots runs as fast as an 
image without snapshot. This basically means getting the snapshot feature 
for free.

G4: An image with 1,000 snapshots consumes no more memory than an image 
without snapshot. This again means getting the snapshot feature for free.

G5: Regardless of the number of existing snapshots, creating a new 
snapshot is fast, e.g., taking no more than 1 second.

G6: Regardless of the number of existing snapshots, deleting a snapshot is 
fast, e.g., taking no more than 1 second.

Now let's evaluate VMDK and QCOW2 against these ideal properties. 

G1: VMDK good; QCOW2 poor
G2: VMDK ok; QCOW2 poor
G3: VMDK very poor; QCOW2 poor
G4: VMDK very poor; QCOW2 poor
G5: VMDK good; QCOW2 good
G6: VMDK poor; QCOW2 good

The evaluation above assumes a straightforward VMDK implementation that, 
when handling a long chain of snapshots, s0<-s1<-s2<- … <-s1000, it uses a 
chain of 1,000 VMDK driver instances to represent the chain of backing 
files. This is slow and consumes a lot of memory, but it is the behavior 
of QEMU's block device architecture today.

Even if the QEMU architecture can be revised and the VMDK implementation 
is optimized to extreme, a fundamental limitation of VMDK (by design 
instead of by implementation) is G6, i.e., deleting a snapshot X in the 
middle of a snapshot chain is slow (this is also what I observed with the 
VMware software). Because each snapshot is stored as a separate file, when 
a snapshot X is deleted, part of X's data blocks that are still needed by 
its children Y must be physically copied from file X to file Y, which is 
slow and the VM is halted during the copy operation. QCOW2's internal 
snapshot approach avoids this problem. Since all snapshots are stored in 
one file, when a snapshot is deleted, QCOW2 only needs to update its 
reference count table without physically moving data blocks.

On the other hand, QCOW'2 internal snapshot has two major limitations that 
hurt runtime performance: caching the reference count table in memory and 
updating the on-disk reference count table. If we can eliminate both, then 
it is an ideal solution. This is exactly what FVD's internal snapshot 
solution does. Below is the key observation on why FVD can do it so 
efficiently.

In an internal snapshot implementation, the reference count table is used 
to track used blocks and free blocks. It serves no other purposes. In FVD, 
its "static" reference count table only tracks blocks used by (static) 
snapshots, and it does not track blocks (dynamically) allocated (on a 
write) or freed (on a trim) for the running VM. This is a simple but 
fundamental difference w.r.t. to QCOW2, whose reference count table tracks 
both the static content and the dynamic content. Because data blocks used 
by snapshots are static and do not change unless a snapshot is created or 
deleted, there is no need to update FVD's "static" reference count table 
when a VM runs, and actually there is even no need to cache it in memory. 
Data blocks that are dynamically allocated or freed for a running VM are 
already tracked by FVD's one-level lookup table (which is similar to 
QCOW2's two-level table, but in FVD it is much smaller and faster) even 
before introducing the snapshot feature, and hence it comes for free. 
Updating FVD's one-level lookup table is efficient because of FVD's 
journal.

When the VM boots, FVD scans the reference count table once to build a 
so-called free-block-bitmap in memory, which identifies blocks not used by 
static snapshots. The reference count table is then thrown away and never 
updated when the VM runs. For an image with 1TB snapshot data, the 
free-block-bitmap is only 125KB, i.e., the memory overhead is negligible. 
For an image with 1TB snapshot data, FVD's reference count table is 2MB, 
and scanning it once at VM boot time takes no more than 20 milliseconds. 

In short, FVD's internal snapshot achieves the ideal properties of G1-G6, 
by 1) using the reference count table to only track "static" snapshots, 2) 
not keeping the reference count table in memory, 3) not updating the 
on-disk "static" reference count table when the VM runs, and 4) 
efficiently tracking dynamically allocated blocks by piggybacking on FVD's 
other features, i.e., its journal and small one-level lookup table.

Regards,

ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang

Re: [Qemu-devel] Re: Strategic decision: COW format

Reply via email to