Re: [Qemu-devel] Re: Strategic decision: COW format
Am 13.03.2011 06:51, schrieb Chunqiang Tang: After the heated debate, I thought more about the right approach of implementing snapshot, and it becomes clear to me that there are major limitations with both VMDK's external snapshot approach (which stores each snapshot as a separate CoW file) and QCOW2's internal snapshot approach (which stores all snapshots in one file and uses a reference count table to keep track of them). I just posted to the mailing list a patch that implements internal snapshot in FVD but does it in a way without the limitations of VMDK and QCOW2. Let's first list the properties of an ideal virtual disk snapshot solution, and then discuss how to achieve them. G1: Do no harm (or avoid being a misfeature), i.e., the added snapshot code should not slow down the runtime performance of an image that has no snapshots. This implies that an image without snapshot should not cache the reference count table in memory and should not update the on-disk reference count table. G2: Even better, an image with 1 snapshot runs as fast as an image without snapshot. G3: Even even better, an image with 1,000 snapshots runs as fast as an image without snapshot. This basically means getting the snapshot feature for free. G4: An image with 1,000 snapshots consumes no more memory than an image without snapshot. This again means getting the snapshot feature for free. G5: Regardless of the number of existing snapshots, creating a new snapshot is fast, e.g., taking no more than 1 second. G6: Regardless of the number of existing snapshots, deleting a snapshot is fast, e.g., taking no more than 1 second. Now let's evaluate VMDK and QCOW2 against these ideal properties. G1: VMDK good; QCOW2 poor G2: VMDK ok; QCOW2 poor G3: VMDK very poor; QCOW2 poor G4: VMDK very poor; QCOW2 poor G5: VMDK good; QCOW2 good G6: VMDK poor; QCOW2 good Okay. I think I don't agree with all of these. I'm not entirely sure how VMDK works, so I take this as random image format that uses backing files (so it also applies to qcow2 with backing files, which I hope isn't too confusing). G1: VMDK good; QCOW2 poor for cache=writethrough, ok otherwise; QCOW3 good G2: VMDK ok; QCOW2 good G3: VMDK poor; QCOW2 good G4: VMDK very poor; QCOW2 ok G5: VMDK good; QCOW2 good G6: VMDK very poor; QCOW2 good Also, let me add another feature which I believe is an important factor in the decision between internal and external snapshots: G7: Loading/Reverting to a snapshot is fast G7: VMDK good; QCOW2 ok On the other hand, QCOW'2 internal snapshot has two major limitations that hurt runtime performance: caching the reference count table in memory and updating the on-disk reference count table. If we can eliminate both, then it is an ideal solution. It's not even necessary to get completely rid of it. What hurts is writing the additional metadata. So if you can delay writing the metadata and only write out a refcount block once you need to load the next one into memory, the overhead is lost in the noise (remember, even with 64k clusters, a refcount block covers 2 GB of virtual disk space). We already do that for qcow2 in all writeback cache modes. We can't do it yet for cache=writethrough, but we were planning to allow using QED's dirty flag approach which would get rid of the writes also in writethrough modes. I think this explains my estimation for G1. For G2 and G3, I'm not sure why you think that having internal snapshots slows down operation. It's basically just data that sits in the image file and is unused. After startup or after deleting a snapshot you probably have to look at all of the refcount table again for cluster allocations, is this what you mean? For G4, the size of snapshots in memory, the only overhead of internal snapshots that I could think of is the snapshot table. I would hardly rate this as poor. For G5 and G6 I basically agree with your estimation, except that I think that the overhead of deleting a snapshot is _really_ bad. This is one of the major problems we have with external snapshots today. In an internal snapshot implementation, the reference count table is used to track used blocks and free blocks. It serves no other purposes. In FVD, its static reference count table only tracks blocks used by (static) snapshots, and it does not track blocks (dynamically) allocated (on a write) or freed (on a trim) for the running VM. This is a simple but fundamental difference w.r.t. to QCOW2, whose reference count table tracks both the static content and the dynamic content. Because data blocks used by snapshots are static and do not change unless a snapshot is created or deleted, there is no need to update FVD's static reference count table when a VM runs, and actually there is even no need to cache it in memory. Data blocks that are dynamically allocated or freed for a running VM are already tracked by FVD's one-level lookup table
Re: [Qemu-devel] Re: Strategic decision: COW format
On 03/13/2011 09:28 PM, Chunqiang Tang wrote: In short, FVD's internal snapshot achieves the ideal properties of G1-G6, by 1) using the reference count table to only track static snapshots, 2) not keeping the reference count table in memory, 3) not updating the on-disk static reference count table when the VM runs, and 4) efficiently tracking dynamically allocated blocks by piggybacking on FVD's other features, i.e., its journal and small one-level lookup table. Are you assuming snapshots are read-only? It's not clear to me how this would work with writeable snapshots. It's not clear to me that writeable snapshots are really that important, but this is an advantage of having a refcount table. External snapshots are essentially read-only snapshots so I can understand the argument for it. By definition, a snapshot itself must be immutable (read-only), but a writeable image state can be derived from an immutable snapshot by using copy-on-write, which I guess is what you meant by writeable snapshot. No, because the copy-on-write is another layer on top of the snapshot and AFAICT, they don't persist when moving between snapshots. The equivalent for external snapshots would be: base0 - base1 - base2 - image And then if I wanted to move to base1 without destroying base2 and image, I could do: qemu-img create -f qcow2 -b base1 base1-overlay.img The file system can keep a lot of these things around pretty easily but with your proposal, it seems like there can only be one. If you support many of them, I think you'll degenerate to something as complex as a reference count table. On the other hand, I think it's reasonable to just avoid the CoW overlay entirely and say that moving to a previous snapshot destroys any of it's children. I think this ends up being a simplifying assumption that is worth investigating further. From the use-cases that I'm aware of (backup and RAS), I think these semantics are okay. I'm curious what other people think (Kevin/Stefan?). Regards, Anthony Liguori
Re: [Qemu-devel] Re: Strategic decision: COW format
No, because the copy-on-write is another layer on top of the snapshot and AFAICT, they don't persist when moving between snapshots. The equivalent for external snapshots would be: base0 - base1 - base2 - image And then if I wanted to move to base1 without destroying base2 and image, I could do: qemu-img create -f qcow2 -b base1 base1-overlay.img The file system can keep a lot of these things around pretty easily but with your proposal, it seems like there can only be one. If you support many of them, I think you'll degenerate to something as complex as a reference count table. On the other hand, I think it's reasonable to just avoid the CoW overlay entirely and say that moving to a previous snapshot destroys any of it's children. I think this ends up being a simplifying assumption that is worth investigating further. No, both VMware and FVD have the same semantics as QCOW2. Moving to a previous snapshot does not destroy any of its children. In the example I gave (copied below), it goes from Image: s1-s2-s3-s4-(current-state) back to snapshot s2, and now the state is Image: s1-s2-s3-s4 |-(curren-state) where all snapshots s1-s4 are kept. From there, it can take another snapshot s5, and then further go back to snapshot s4, ending up with Image: s1-s2-s3-s4 |-s5 | |- (current-state) FVD does have a reference count table like that in QCOW2, but it avoids the need for updating the reference count table during normal execution of the VM. The reference count table is only updated at the time of creating a snapshot or deleting a snapshot. Therefore, during normal execution of a VM, images with snapshots are as fast as images without snapshot. FVD can do this because of the following: FVD's reference count table only tracks the snapshots (s1, s2, ...), but does not track the current-state. Instead, FVD's default mechanism (one-level lookup table, journal, etc.), which exists even before introducing snapshot, already tracks the current-state. Working together, FVD's reference count table and its default mechanism tracks all the states. In QCOW2, when a new cluster is allocated during handling a running VM's write request, it updates both the lookup table and the reference count table, which is unnecessary because their information is redundant. By contrast, in FVD, when a new chunk is allocated during handling a running VM's write request, it only updates the lookup table without updating the reference count table, because by design the reference count table does not track the current-state and this chunk allocation operation belongs to the current-state. This is the key why FVD can get all the functions of QCOW2's internal snapshot but without its memory overhead to cache the reference count table and its disk I/O overhead to read or write the reference count table during normal execution of VM. Regards, ChunQiang (CQ) Tang, Ph.D. Homepage: http://www.research.ibm.com/people/c/ctang
Re: [Qemu-devel] Re: Strategic decision: COW format
On 03/14/2011 08:53 AM, Chunqiang Tang wrote: No, because the copy-on-write is another layer on top of the snapshot and AFAICT, they don't persist when moving between snapshots. The equivalent for external snapshots would be: base0- base1- base2- image And then if I wanted to move to base1 without destroying base2 and image, I could do: qemu-img create -f qcow2 -b base1 base1-overlay.img The file system can keep a lot of these things around pretty easily but with your proposal, it seems like there can only be one. If you support many of them, I think you'll degenerate to something as complex as a reference count table. On the other hand, I think it's reasonable to just avoid the CoW overlay entirely and say that moving to a previous snapshot destroys any of it's children. I think this ends up being a simplifying assumption that is worth investigating further. No, both VMware and FVD have the same semantics as QCOW2. Moving to a previous snapshot does not destroy any of its children. In the example I gave (copied below), it goes from Image: s1-s2-s3-s4-(current-state) back to snapshot s2, and now the state is Image: s1-s2-s3-s4 |-(curren-state) where all snapshots s1-s4 are kept. From there, it can take another snapshot s5, and then further go back to snapshot s4, ending up with Image: s1-s2-s3-s4 |-s5 | |- (current-state) Your use of current-state is confusing me because AFAICT, current-state is just semantically another snapshot. It's writable because it has no children. You only keep around one writable snapshot and to make another snapshot writable, you have to discard the former. This is not the semantics of qcow2. Every time you create a snapshot, it's essentially a new image. You can write directly to it. While we don't do this today and I don't think we ever should, it's entirely possible to have two disks served simultaneously out of the same qcow2 file using snapshots. Regards, Anthony Liguori
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 14.03.2011 14:22, schrieb Anthony Liguori: On 03/13/2011 09:28 PM, Chunqiang Tang wrote: In short, FVD's internal snapshot achieves the ideal properties of G1-G6, by 1) using the reference count table to only track static snapshots, 2) not keeping the reference count table in memory, 3) not updating the on-disk static reference count table when the VM runs, and 4) efficiently tracking dynamically allocated blocks by piggybacking on FVD's other features, i.e., its journal and small one-level lookup table. Are you assuming snapshots are read-only? It's not clear to me how this would work with writeable snapshots. It's not clear to me that writeable snapshots are really that important, but this is an advantage of having a refcount table. External snapshots are essentially read-only snapshots so I can understand the argument for it. By definition, a snapshot itself must be immutable (read-only), but a writeable image state can be derived from an immutable snapshot by using copy-on-write, which I guess is what you meant by writeable snapshot. No, because the copy-on-write is another layer on top of the snapshot and AFAICT, they don't persist when moving between snapshots. The equivalent for external snapshots would be: base0 - base1 - base2 - image And then if I wanted to move to base1 without destroying base2 and image, I could do: qemu-img create -f qcow2 -b base1 base1-overlay.img The file system can keep a lot of these things around pretty easily but with your proposal, it seems like there can only be one. If you support many of them, I think you'll degenerate to something as complex as a reference count table. IIUC, he already uses a refcount table. Actually, I think that a refcount table is a requirement to provide the interesting properties that internal snapshots have (see my other mail). Refcount tables aren't a very complex thing either. In fact, it makes a format much simpler to have one concept like refcount tables instead of adding another different mechanism for each new feature that would be natural with refcount tables. The only problem with them is that they are metadata that must be updated. However, I think we have discussed enough how to avoid the greatest part of that cost. On the other hand, I think it's reasonable to just avoid the CoW overlay entirely and say that moving to a previous snapshot destroys any of it's children. I think this ends up being a simplifying assumption that is worth investigating further. From the use-cases that I'm aware of (backup and RAS), I think these semantics are okay. I don't think this semantics would be expected. Any anyway, would this really allow simplification of the format? I'm afraid that you would go for complicated solutions with odd semantics just because of an arbitrary dislike of refcounts. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
IIUC, he already uses a refcount table. Actually, I think that a refcount table is a requirement to provide the interesting properties that internal snapshots have (see my other mail). Refcount tables aren't a very complex thing either. In fact, it makes a format much simpler to have one concept like refcount tables instead of adding another different mechanism for each new feature that would be natural with refcount tables. The only problem with them is that they are metadata that must be updated. However, I think we have discussed enough how to avoid the greatest part of that cost. FVD's novel uses of the reference count table reduces the metadata update overhead down to literally zero during normal execution of a VM. This gets the bests of QCOW2's reference count table but without its oeverhead. In FVD, the reference count table is only updated when creating a new snapshot or deleting an existing snapshot. The reference count table is never updated during normal execution of a VM.
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 14.03.2011 15:02, schrieb Anthony Liguori: On 03/14/2011 08:53 AM, Chunqiang Tang wrote: No, because the copy-on-write is another layer on top of the snapshot and AFAICT, they don't persist when moving between snapshots. The equivalent for external snapshots would be: base0- base1- base2- image And then if I wanted to move to base1 without destroying base2 and image, I could do: qemu-img create -f qcow2 -b base1 base1-overlay.img The file system can keep a lot of these things around pretty easily but with your proposal, it seems like there can only be one. If you support many of them, I think you'll degenerate to something as complex as a reference count table. On the other hand, I think it's reasonable to just avoid the CoW overlay entirely and say that moving to a previous snapshot destroys any of it's children. I think this ends up being a simplifying assumption that is worth investigating further. No, both VMware and FVD have the same semantics as QCOW2. Moving to a previous snapshot does not destroy any of its children. In the example I gave (copied below), it goes from Image: s1-s2-s3-s4-(current-state) back to snapshot s2, and now the state is Image: s1-s2-s3-s4 |-(curren-state) where all snapshots s1-s4 are kept. From there, it can take another snapshot s5, and then further go back to snapshot s4, ending up with Image: s1-s2-s3-s4 |-s5 | |- (current-state) Your use of current-state is confusing me because AFAICT, current-state is just semantically another snapshot. It's writable because it has no children. You only keep around one writable snapshot and to make another snapshot writable, you have to discard the former. This is not the semantics of qcow2. Every time you create a snapshot, it's essentially a new image. You can write directly to it. While we don't do this today and I don't think we ever should, it's entirely possible to have two disks served simultaneously out of the same qcow2 file using snapshots. No, CQ is describing the semantics of internal snapshots in qcow2 correctly. You have all the snapshots that are stored in the snapshot table (all read-only) plus one current state described by the image header (read-write). Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
On Mon, Mar 14, 2011 at 1:53 PM, Chunqiang Tang ct...@us.ibm.com wrote: Therefore, during normal execution of a VM, images with snapshots are as fast as images without snapshot. Hang on, an image with a snapshot still needs to do copy-on-write, just like backing files. The cost of copy-on-write is reading data from the backing file, whereas a non-CoW write doesn't need to do that. So no, snapshots are not free during normal execution. Stefan
Re: [Qemu-devel] Re: Strategic decision: COW format
Your use of current-state is confusing me because AFAICT, current-state is just semantically another snapshot. It's writable because it has no children. You only keep around one writable snapshot and to make another snapshot writable, you have to discard the former. This is not the semantics of qcow2. Every time you create a snapshot, it's essentially a new image. You can write directly to it. While we don't do this today and I don't think we ever should, it's entirely possible to have two disks served simultaneously out of the same qcow2 file using snapshots. No, CQ is describing the semantics of internal snapshots in qcow2 correctly. You have all the snapshots that are stored in the snapshot table (all read-only) plus one current state described by the image header (read-write). That's also the semantics of VMware's external snapshot. So there is no difference in semantics. It is just a difference in implementation and performance. Regards, ChunQiang (CQ) Tang Homepage: http://www.research.ibm.com/people/c/ctang
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 14.03.2011 15:25, schrieb Chunqiang Tang: IIUC, he already uses a refcount table. Actually, I think that a refcount table is a requirement to provide the interesting properties that internal snapshots have (see my other mail). Refcount tables aren't a very complex thing either. In fact, it makes a format much simpler to have one concept like refcount tables instead of adding another different mechanism for each new feature that would be natural with refcount tables. The only problem with them is that they are metadata that must be updated. However, I think we have discussed enough how to avoid the greatest part of that cost. FVD's novel uses of the reference count table reduces the metadata update overhead down to literally zero during normal execution of a VM. This gets the bests of QCOW2's reference count table but without its oeverhead. In FVD, the reference count table is only updated when creating a new snapshot or deleting an existing snapshot. The reference count table is never updated during normal execution of a VM. Yeah, I think that's basically an interesting property. However, I don't think that it makes a big difference compared to qcow2's refcount table when you use a writeback metadata cache. What about the question that I had in my other mail? (How do you determine if a cluster is free without scanning the whole lookup table?) I think this might be the missing piece for me to understand how your approach works. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
On 03/14/2011 09:15 AM, Kevin Wolf wrote: The file system can keep a lot of these things around pretty easily but with your proposal, it seems like there can only be one. If you support many of them, I think you'll degenerate to something as complex as a reference count table. IIUC, he already uses a refcount table. Well, he needs a separate mechanism to make trim/discard work, but for the snapshot discussion, a reference count table is avoided. The bitmap only covers whether the guest has accessed a block or not. Then there is a separate table that maps guest offsets to offsets within the file. I haven't thought hard about it, but my guess is that there is an ordering constraint between these two pieces of metadata which is why the journal is necessary. I get worried about the complexity of a journal even more than a reference count table. Actually, I think that a refcount table is a requirement to provide the interesting properties that internal snapshots have (see my other mail). Well the trick here AFAICT is that you're basically storing external snapshots internally. So it's sort of like a bunch of FVD formats embedded into a single image. Refcount tables aren't a very complex thing either. In fact, it makes a format much simpler to have one concept like refcount tables instead of adding another different mechanism for each new feature that would be natural with refcount tables. I think it's a reasonable design goal to minimize any metadata updates in the fast path. If we can write 1 piece of metadata verses writing 2, then it's worth exploring IMHO. The only problem with them is that they are metadata that must be updated. However, I think we have discussed enough how to avoid the greatest part of that cost. Maybe I missed it, but in the WCE=0 mode, is it really possible to avoid the writes for the refcount table? On the other hand, I think it's reasonable to just avoid the CoW overlay entirely and say that moving to a previous snapshot destroys any of it's children. I think this ends up being a simplifying assumption that is worth investigating further. From the use-cases that I'm aware of (backup and RAS), I think these semantics are okay. I don't think this semantics would be expected. Any anyway, would this really allow simplification of the format? I don't know, I'm really just trying to separate out the implementation of the format to the use-cases we're trying to address. Even if we're talking about qcow3, then if we only really care about read-only snapshots, perhaps we can add a feature bit for this and take advantage of this to make the WCE=0 case much faster. But the fundamental question is, does this satisfy the use-cases we care about? Regards, Anthony Liguori I'm afraid that you would go for complicated solutions with odd semantics just because of an arbitrary dislike of refcounts. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
On Mon, Mar 14, 2011 at 1:53 PM, Chunqiang Tang ct...@us.ibm.com wrote: Therefore, during normal execution of a VM, images with snapshots are as fast as images without snapshot. Hang on, an image with a snapshot still needs to do copy-on-write, just like backing files. The cost of copy-on-write is reading data from the backing file, whereas a non-CoW write doesn't need to do that. So no, snapshots are not free during normal execution. You are right. For any implementation of snapshot (internal or external), this CoW overhead is unavoidable. What I meant to say was that, other than this mandatory CoW overhead, FVD's internal snapshot does not incur any additional metadata update overhead (unlike that in QCOW2).
Re: [Qemu-devel] Re: Strategic decision: COW format
On 03/14/2011 09:21 AM, Kevin Wolf wrote: Am 14.03.2011 15:02, schrieb Anthony Liguori: On 03/14/2011 08:53 AM, Chunqiang Tang wrote: No, because the copy-on-write is another layer on top of the snapshot and AFAICT, they don't persist when moving between snapshots. The equivalent for external snapshots would be: base0- base1- base2- image And then if I wanted to move to base1 without destroying base2 and image, I could do: qemu-img create -f qcow2 -b base1 base1-overlay.img The file system can keep a lot of these things around pretty easily but with your proposal, it seems like there can only be one. If you support many of them, I think you'll degenerate to something as complex as a reference count table. On the other hand, I think it's reasonable to just avoid the CoW overlay entirely and say that moving to a previous snapshot destroys any of it's children. I think this ends up being a simplifying assumption that is worth investigating further. No, both VMware and FVD have the same semantics as QCOW2. Moving to a previous snapshot does not destroy any of its children. In the example I gave (copied below), it goes from Image: s1-s2-s3-s4-(current-state) back to snapshot s2, and now the state is Image: s1-s2-s3-s4 |-(curren-state) where all snapshots s1-s4 are kept. From there, it can take another snapshot s5, and then further go back to snapshot s4, ending up with Image: s1-s2-s3-s4 |-s5 | |- (current-state) Your use of current-state is confusing me because AFAICT, current-state is just semantically another snapshot. It's writable because it has no children. You only keep around one writable snapshot and to make another snapshot writable, you have to discard the former. This is not the semantics of qcow2. Every time you create a snapshot, it's essentially a new image. You can write directly to it. While we don't do this today and I don't think we ever should, it's entirely possible to have two disks served simultaneously out of the same qcow2 file using snapshots. No, CQ is describing the semantics of internal snapshots in qcow2 correctly. You have all the snapshots that are stored in the snapshot table (all read-only) plus one current state described by the image header (read-write). But is there any problem (in the format) with writing to the non-current state? I can't think of one. Regards, Anthony Liguori Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
On Mon, Mar 14, 2011 at 2:25 PM, Chunqiang Tang ct...@us.ibm.com wrote: IIUC, he already uses a refcount table. Actually, I think that a refcount table is a requirement to provide the interesting properties that internal snapshots have (see my other mail). Refcount tables aren't a very complex thing either. In fact, it makes a format much simpler to have one concept like refcount tables instead of adding another different mechanism for each new feature that would be natural with refcount tables. The only problem with them is that they are metadata that must be updated. However, I think we have discussed enough how to avoid the greatest part of that cost. FVD's novel uses of the reference count table reduces the metadata update overhead down to literally zero during normal execution of a VM. This gets the bests of QCOW2's reference count table but without its oeverhead. In FVD, the reference count table is only updated when creating a new snapshot or deleting an existing snapshot. The reference count table is never updated during normal execution of a VM. Do you want to send out a break-down of the steps (and cost) involved in doing: 1. Snapshot creation. 2. Snapshot deletion. 3. Opening an image with n snapshots. Stefan
Re: [Qemu-devel] Re: Strategic decision: COW format
The file system can keep a lot of these things around pretty easily but with your proposal, it seems like there can only be one. If you support many of them, I think you'll degenerate to something as complex as a reference count table. IIUC, he already uses a refcount table. Well, he needs a separate mechanism to make trim/discard work, but for the snapshot discussion, a reference count table is avoided. Kevin is right. FVD does have a refcount table. Sorry for causing confusion. I am going to send out a very detailed email which describes the operation steps in FVD, as Stefan requested. The bitmap only covers whether the guest has accessed a block or not. Then there is a separate table that maps guest offsets to offsets within the file. I haven't thought hard about it, but my guess is that there is an ordering constraint between these two pieces of metadata which is why the journal is necessary. I get worried about the complexity of a journal even more than a reference count table. No, the journal is not necessary. Actually, a very old version of FVD worked without journal. Journal was later introduced as a performance enhancement. Maybe I missed it, but in the WCE=0 mode, is it really possible to avoid the writes for the refcount table? Yes, this is indeed achieved in FVD, with zero writes to the refcount table on the fast path. See details in the other email I am going to send out soon. Regards, ChunQiang (CQ) Tang Homepage: http://www.research.ibm.com/people/c/ctang
Re: [Qemu-devel] Re: Strategic decision: COW format
On Mon, Mar 14, 2011 at 2:49 PM, Anthony Liguori anth...@codemonkey.ws wrote: On 03/14/2011 09:21 AM, Kevin Wolf wrote: Am 14.03.2011 15:02, schrieb Anthony Liguori: On 03/14/2011 08:53 AM, Chunqiang Tang wrote: No, because the copy-on-write is another layer on top of the snapshot and AFAICT, they don't persist when moving between snapshots. The equivalent for external snapshots would be: base0- base1- base2- image And then if I wanted to move to base1 without destroying base2 and image, I could do: qemu-img create -f qcow2 -b base1 base1-overlay.img The file system can keep a lot of these things around pretty easily but with your proposal, it seems like there can only be one. If you support many of them, I think you'll degenerate to something as complex as a reference count table. On the other hand, I think it's reasonable to just avoid the CoW overlay entirely and say that moving to a previous snapshot destroys any of it's children. I think this ends up being a simplifying assumption that is worth investigating further. No, both VMware and FVD have the same semantics as QCOW2. Moving to a previous snapshot does not destroy any of its children. In the example I gave (copied below), it goes from Image: s1-s2-s3-s4-(current-state) back to snapshot s2, and now the state is Image: s1-s2-s3-s4 |-(curren-state) where all snapshots s1-s4 are kept. From there, it can take another snapshot s5, and then further go back to snapshot s4, ending up with Image: s1-s2-s3-s4 |-s5 | |- (current-state) Your use of current-state is confusing me because AFAICT, current-state is just semantically another snapshot. It's writable because it has no children. You only keep around one writable snapshot and to make another snapshot writable, you have to discard the former. This is not the semantics of qcow2. Every time you create a snapshot, it's essentially a new image. You can write directly to it. While we don't do this today and I don't think we ever should, it's entirely possible to have two disks served simultaneously out of the same qcow2 file using snapshots. No, CQ is describing the semantics of internal snapshots in qcow2 correctly. You have all the snapshots that are stored in the snapshot table (all read-only) plus one current state described by the image header (read-write). But is there any problem (in the format) with writing to the non-current state? I can't think of one. Here is a problem: there is a single global refcount table in QCOW2. You need to synchronize updates of the refcounts between multiple writers to avoid introducing incorrect refcounts. Stefan
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 14.03.2011 15:47, schrieb Anthony Liguori: On 03/14/2011 09:15 AM, Kevin Wolf wrote: The file system can keep a lot of these things around pretty easily but with your proposal, it seems like there can only be one. If you support many of them, I think you'll degenerate to something as complex as a reference count table. IIUC, he already uses a refcount table. Well, he needs a separate mechanism to make trim/discard work, but for the snapshot discussion, a reference count table is avoided. The bitmap only covers whether the guest has accessed a block or not. Then there is a separate table that maps guest offsets to offsets within the file. I haven't thought hard about it, but my guess is that there is an ordering constraint between these two pieces of metadata which is why the journal is necessary. I get worried about the complexity of a journal even more than a reference count table. Honestly I think that a journal is a good idea that we'll want to implement in the long run. There are people who aren't really happy about the dirty flag + fsck approach, and there are people who are concerned about cluster leaks without fsck. Both problems should be solved with a journal. Compared to other questions in the discussio, I think it's only a nice-to-have addition, though. Actually, I think that a refcount table is a requirement to provide the interesting properties that internal snapshots have (see my other mail). Well the trick here AFAICT is that you're basically storing external snapshots internally. So it's sort of like a bunch of FVD formats embedded into a single image. CQ, can you please clarify? From your description, Anthony seems to understand something completely different than I do. Are its characteristics more like qcow2's internal snapshots (which is what I understand) or more like external snapshots (which is what Anthony seems to understand). Refcount tables aren't a very complex thing either. In fact, it makes a format much simpler to have one concept like refcount tables instead of adding another different mechanism for each new feature that would be natural with refcount tables. I think it's a reasonable design goal to minimize any metadata updates in the fast path. If we can write 1 piece of metadata verses writing 2, then it's worth exploring IMHO. The only problem with them is that they are metadata that must be updated. However, I think we have discussed enough how to avoid the greatest part of that cost. Maybe I missed it, but in the WCE=0 mode, is it really possible to avoid the writes for the refcount table? Protected by a dirty flag (and/or a journal), sure. I mean, wasn't that the whole point of starting the qcow3 discussion? Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
On Mon, Mar 14, 2011 at 3:04 PM, Chunqiang Tang ct...@us.ibm.com wrote: The file system can keep a lot of these things around pretty easily but with your proposal, it seems like there can only be one. If you support many of them, I think you'll degenerate to something as complex as a reference count table. IIUC, he already uses a refcount table. Well, he needs a separate mechanism to make trim/discard work, but for the snapshot discussion, a reference count table is avoided. Kevin is right. FVD does have a refcount table. Sorry for causing confusion. I am going to send out a very detailed email which describes the operation steps in FVD, as Stefan requested. The bitmap only covers whether the guest has accessed a block or not. Then there is a separate table that maps guest offsets to offsets within the file. I haven't thought hard about it, but my guess is that there is an ordering constraint between these two pieces of metadata which is why the journal is necessary. I get worried about the complexity of a journal even more than a reference count table. No, the journal is not necessary. Actually, a very old version of FVD worked without journal. Journal was later introduced as a performance enhancement. I like the journal because it allows us to isolate metadata updates into one specific area that can be scanned on image recovery. If we take the QED approach with the dirty bit then we have to scan all L1/L2 tables. The journal makes recovery more efficient than a full consistency check. Stefan
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 14.03.2011 15:49, schrieb Anthony Liguori: On 03/14/2011 09:21 AM, Kevin Wolf wrote: Am 14.03.2011 15:02, schrieb Anthony Liguori: On 03/14/2011 08:53 AM, Chunqiang Tang wrote: No, because the copy-on-write is another layer on top of the snapshot and AFAICT, they don't persist when moving between snapshots. The equivalent for external snapshots would be: base0- base1- base2- image And then if I wanted to move to base1 without destroying base2 and image, I could do: qemu-img create -f qcow2 -b base1 base1-overlay.img The file system can keep a lot of these things around pretty easily but with your proposal, it seems like there can only be one. If you support many of them, I think you'll degenerate to something as complex as a reference count table. On the other hand, I think it's reasonable to just avoid the CoW overlay entirely and say that moving to a previous snapshot destroys any of it's children. I think this ends up being a simplifying assumption that is worth investigating further. No, both VMware and FVD have the same semantics as QCOW2. Moving to a previous snapshot does not destroy any of its children. In the example I gave (copied below), it goes from Image: s1-s2-s3-s4-(current-state) back to snapshot s2, and now the state is Image: s1-s2-s3-s4 |-(curren-state) where all snapshots s1-s4 are kept. From there, it can take another snapshot s5, and then further go back to snapshot s4, ending up with Image: s1-s2-s3-s4 |-s5 | |- (current-state) Your use of current-state is confusing me because AFAICT, current-state is just semantically another snapshot. It's writable because it has no children. You only keep around one writable snapshot and to make another snapshot writable, you have to discard the former. This is not the semantics of qcow2. Every time you create a snapshot, it's essentially a new image. You can write directly to it. While we don't do this today and I don't think we ever should, it's entirely possible to have two disks served simultaneously out of the same qcow2 file using snapshots. No, CQ is describing the semantics of internal snapshots in qcow2 correctly. You have all the snapshots that are stored in the snapshot table (all read-only) plus one current state described by the image header (read-write). But is there any problem (in the format) with writing to the non-current state? I can't think of one. You would run into problems with the COW flag in the L2 tables. They are only an optimization, though, so you could probably avoid using them and directly look up the refcount table for each write, at the cost of performance. Anyway, I don't think there's a real use case for something like this. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
On 03/14/2011 10:03 AM, Kevin Wolf wrote: The only problem with them is that they are metadata that must be updated. However, I think we have discussed enough how to avoid the greatest part of that cost. Maybe I missed it, but in the WCE=0 mode, is it really possible to avoid the writes for the refcount table? Protected by a dirty flag (and/or a journal), sure. I mean, wasn't that the whole point of starting the qcow3 discussion? Okay, I thought you had something else in mind. Regards, Anthony Liguori Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
FVD's novel uses of the reference count table reduces the metadata update overhead down to literally zero during normal execution of a VM. This gets the bests of QCOW2's reference count table but without its oeverhead. In FVD, the reference count table is only updated when creating a new snapshot or deleting an existing snapshot. The reference count table is never updated during normal execution of a VM. Do you want to send out a break-down of the steps (and cost) involved in doing: 1. Snapshot creation. 2. Snapshot deletion. 3. Opening an image with n snapshots. Here is a detailed description. Relevant to the discussion of snapshot, FVD uses a one-level lookup table and a refcount table. FVD’s one-level lookup table is very similar to QCOW2’s two-level lookup table, except that it is much smaller in FVD, and is preallocated and hence contiguous in image. FVD’s refcount table is almost identical to that of QCOW2, but with a key difference. An image consists of an arbitrary number of read-only snapshots, and a single writeable image front, which is the current image state perceived by the VM. Below, I will simply refer to the read-only snapshots as snapshots, and refer to the “writeable image front” as “writeable-front.” QCOW2’s refcount table counts clusters that are used by either read-only snapshots or writeable-front. Because writeable-front changes as the VM runs, QCOW2 needs to update the refcount table on the fast path of normal VM execution. By contrast, FVD’s refcount table only counts chunks that are used by read-only snapshots, and does not count chunks used by write-front. This is the key that allows FVD to entirely avoid updating the refcount table on the fast path of normal VM execution. Below are the detailed steps for different operations. O1: Open an image with n snapshots. Let me introduce some basic concepts first. The storage allocation unit in FVD is called a chunk (like cluster in QCOW2). The default chunk size is 1MB, like that in VDI (VMDK and Microsoft VHD use 2MB chunks). An FVD image file is conceptually divided into chunks, where chunk 0 is the first 1MB of the image file, chunk 1 is the second 1MB, … chunk j, …and so forth. The size of an image file grow as needed, just like that of QCOW2. The refcount table is a linear array “uint16_t refcount[]”. If a chunk j is referenced by s different snapshots, then refcount[j] = s. If a new snapshot is created and this new snapshot also uses chunk j, then refcount[j] is incremented to refcount[j] = s+1. If all snapshots together use 1TB storage spaces, there are 1TB/1MB=1,000,000 chunks, and the size of the refcount table is 2MB. Loading the entire 2MB refcount table from disk into memory takes about 15 milliseconds. If the virtual disk size perceived by the VM is also 1TB, FVD’s one-level lookup table is 4MB. FVD’s one-level lookup table serves the same purpose as QCOW2’s two-level lookup table, but FVD’s one-level table is much smaller and is preallocated and hence continuous in the image. Loading the entire 4MB lookup table from disk into memory takes about 20 milliseconds. These numbers mean that it is quite affordable to scan the entire tables at VM boot time, although the scan can also be avoided in FVD. The optimizations will be described later. When opening an image with n snapshots, an unoptimized version of FVD performs the following steps: O1: Load the entire 2MB reference count table from disk into memory. This step takes about 15ms. O2: Load the entire 4MB lookup table from disk into memory. This step takes about 20ms. O3: Use the two tables to build an in-memory data structure called “free-chunk-bitmap.” This step takes about 2ms. The free-chunk-bitmap identifies free chunks that are not used by either snapshots or writeable front, and hence can be allocated for future writes. The size of the free-chunk-bitmap is only 125KB for a 1TB disk, and hence the memory overhead is negligible. The free-chunk-bitmap also supports trim operations. The free-chunk-bitmap does not have to be persisted on disk as it can always be rebuilt easily, although as an optimization it can be persisted on disk on VM shutdown. O4: Compare the refcount table and the lookup table to identify chunks that are in both tables (i.e., shared) and hence the running VM’s write to those chunks in writeable-front triggers copy-on-write. This step takes about 2ms. One bit in the lookup table’s entry is stolen to mark whether a chunk in writeable-front is shared with snapshots and hence needs copy-on-write upon a write. The whole process above, i.e., opening an image with n (e.g., n=1000) snapshots, takes about 39ms and it is a one-time cost at VM boot. Later, I will describe optimizations that can further reduce this 39ms by saving the 125KB free-chunk-bitmap to disk on VM shutdown, but that optimization is more than likely to an over-engineering effort, given that 39ms
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 14.03.2011 17:32, schrieb Chunqiang Tang: FVD's novel uses of the reference count table reduces the metadata update overhead down to literally zero during normal execution of a VM. This gets the bests of QCOW2's reference count table but without its oeverhead. In FVD, the reference count table is only updated when creating a new snapshot or deleting an existing snapshot. The reference count table is never updated during normal execution of a VM. Do you want to send out a break-down of the steps (and cost) involved in doing: 1. Snapshot creation. 2. Snapshot deletion. 3. Opening an image with n snapshots. Here is a detailed description. Relevant to the discussion of snapshot, FVD uses a one-level lookup table and a refcount table. FVD’s one-level lookup table is very similar to QCOW2’s two-level lookup table, except that it is much smaller in FVD, and is preallocated and hence contiguous in image. Does this mean that FVD can't hold VM state of arbitrary size? FVD’s refcount table is almost identical to that of QCOW2, but with a key difference. An image consists of an arbitrary number of read-only snapshots, and a single writeable image front, which is the current image state perceived by the VM. Below, I will simply refer to the read-only snapshots as snapshots, and refer to the “writeable image front” as “writeable-front.” QCOW2’s refcount table counts clusters that are used by either read-only snapshots or writeable-front. Because writeable-front changes as the VM runs, QCOW2 needs to update the refcount table on the fast path of normal VM execution. Needs to update, but not necessarily on the fast path. Updates can be delayed and batched. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
Here is a detailed description. Relevant to the discussion of snapshot, FVD uses a one-level lookup table and a refcount table. FVD’s one-level lookup table is very similar to QCOW2’s two-level lookup table, except that it is much smaller in FVD, and is preallocated and hence contiguous in image. Does this mean that FVD can't hold VM state of arbitrary size? No, FVD can hold VM state of an arbitrary size. Unlike QCOW2, FVD does not store the index of the vm state as part of the one-level lookup table. FVD could have done so, and then relocates the one-level lookup table in order grow it in size (growing FVD's lookup table through relocation is supported, e.g., in order to resize an image to a larger size), but that's not an ideal solution. Instead, in FVD, each snapshot has two fields, vm_state_offset and vm_state_space_size, which directly point to where the VM state is stored, and vm_state_space_size can be arbitrary. BTW, I observe uint32_t QEMUSnapshotInfo.vm_state_size. Does this mean that a VM state cannot be larger than 4GB? This seems to be a limitation of QEMU. FVD instead uses uint64_t vm_state_space_size in the image format, in case that the size of QEMUSnapshotInfo.vm_state_size is increased in the future. FVD’s refcount table is almost identical to that of QCOW2, but with a key difference. An image consists of an arbitrary number of read-only snapshots, and a single writeable image front, which is the current image state perceived by the VM. Below, I will simply refer to the read-only snapshots as snapshots, and refer to the “writeable image front” as “writeable-front.” QCOW2’s refcount table counts clusters that are used by either read-only snapshots or writeable-front. Because writeable-front changes as the VM runs, QCOW2 needs to update the refcount table on the fast path of normal VM execution. Needs to update, but not necessarily on the fast path. Updates can be delayed and batched. Probably this has been discussed extensively before (as you mentioned in some previous emails), but I missed the discussion and still have a naive question. Is delaying and batching possible for wce=0, i.e., cache=writethrough? Regards, ChunQiang (CQ) Tang Homepage: http://www.research.ibm.com/people/c/ctang
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 14.03.2011 20:23, schrieb Chunqiang Tang: Here is a detailed description. Relevant to the discussion of snapshot, FVD uses a one-level lookup table and a refcount table. FVD’s one-level lookup table is very similar to QCOW2’s two-level lookup table, except that it is much smaller in FVD, and is preallocated and hence contiguous in image. Does this mean that FVD can't hold VM state of arbitrary size? No, FVD can hold VM state of an arbitrary size. Unlike QCOW2, FVD does not store the index of the vm state as part of the one-level lookup table. FVD could have done so, and then relocates the one-level lookup table in order grow it in size (growing FVD's lookup table through relocation is supported, e.g., in order to resize an image to a larger size), but that's not an ideal solution. Instead, in FVD, each snapshot has two fields, vm_state_offset and vm_state_space_size, which directly point to where the VM state is stored, and vm_state_space_size can be arbitrary. Okay, makes sense. BTW, I observe uint32_t QEMUSnapshotInfo.vm_state_size. Does this mean that a VM state cannot be larger than 4GB? This seems to be a limitation of QEMU. FVD instead uses uint64_t vm_state_space_size in the image format, in case that the size of QEMUSnapshotInfo.vm_state_size is increased in the future. Yeah, that was a stupid decision, it definitely should be 64 bit. Needs to update, but not necessarily on the fast path. Updates can be delayed and batched. Probably this has been discussed extensively before (as you mentioned in some previous emails), but I missed the discussion and still have a naive question. Is delaying and batching possible for wce=0, i.e., cache=writethrough? It's possible with QED's approach: You set a dirty flag in the image header, and while this flag is set you don't have to care about consistent refcount tables. Only when you clear the flag, you must flush the refcount cache to the image file. If qemu crashes, you see the dirty flag and you know that you have an image with stale refcounts. In this case you must do a metadata scan to rebuild the refcount table from the L2 tables (or just replay the journal if you have one). Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
FVD's novel uses of the reference count table reduces the metadata update overhead down to literally zero during normal execution of a VM. This gets the bests of QCOW2's reference count table but without its oeverhead. In FVD, the reference count table is only updated when creating a new snapshot or deleting an existing snapshot. The reference count table is never updated during normal execution of a VM. Do you want to send out a break-down of the steps (and cost) involved in doing: 1. Snapshot creation. 2. Snapshot deletion. 3. Opening an image with n snapshots. Here is a detailed description. Relevant to the discussion of snapshot, FVD uses a one-level lookup table and a refcount table. FVD’s one-level lookup table is very similar to QCOW2’s two-level lookup table, except that it is much smaller in FVD, and is preallocated and hence contiguous in image. FVD’s refcount table is almost identical to that of QCOW2, but with a key difference. An image consists of an arbitrary number of read-only snapshots, and a single writeable image front, which is the current image state perceived by the VM. Below, I will simply refer to the read-only snapshots as snapshots, and refer to the “writeable image front” as “writeable-front.” QCOW2’s refcount table counts clusters that are used by either read-only snapshots or writeable-front. Because writeable-front changes as the VM runs, QCOW2 needs to update the refcount table on the fast path of normal VM execution. By contrast, FVD’s refcount table only counts chunks that are used by read- only snapshots, and does not count chunks used by write-front. This is the key that allows FVD to entirely avoid updating the refcount table on the fast path of normal VM execution. Below are the detailed steps for different operations. O1: Open an image with n snapshots. Let me introduce some basic concepts first. The storage allocation unit in FVD is called a chunk (like cluster in QCOW2). The default chunk size is 1MB, like that in VDI (VMDK and Microsoft VHD use 2MB chunks). An FVD image file is conceptually divided into chunks, where chunk 0 is the first 1MB of the image file, chunk 1 is the second 1MB, … chunk j, …and so forth. The size of an image file grow as needed, just like that of QCOW2. The refcount table is a linear array “uint16_t refcount[]”. If a chunk j is referenced by s different snapshots, then refcount[j] = s. If a new snapshot is created and this new snapshot also uses chunk j, then refcount[j] is incremented to refcount[j] = s+1. If all snapshots together use 1TB storage spaces, there are 1TB/1MB=1,000,000 chunks, and the size of the refcount table is 2MB. Loading the entire 2MB refcount table from disk into memory takes about 15 milliseconds. If the virtual disk size perceived by the VM is also 1TB, FVD’s one-level lookup table is 4MB. FVD’s one-level lookup table serves the same purpose as QCOW2’s two-level lookup table, but FVD’s one-level table is much smaller and is preallocated and hence continuous in the image. Loading the entire 4MB lookup table from disk into memory takes about 20 milliseconds. These numbers mean that it is quite affordable to scan the entire tables at VM boot time, although the scan can also be avoided in FVD. The optimizations will be described later. When opening an image with n snapshots, an unoptimized version of FVD performs the following steps: O1: Load the entire 2MB reference count table from disk into memory. This step takes about 15ms. O2: Load the entire 4MB lookup table from disk into memory. This step takes about 20ms. O3: Use the two tables to build an in-memory data structure called “free- chunk-bitmap.” This step takes about 2ms. The free-chunk-bitmap identifies free chunks that are not used by either snapshots or writeable front, and hence can be allocated for future writes. The size of the free-chunk-bitmap is only 125KB for a 1TB disk, and hence the memory overhead is negligible. The free-chunk-bitmap also supports trim operations. The free-chunk-bitmap does not have to be persisted on disk as it can always be rebuilt easily, although as an optimization it can be persisted on disk on VM shutdown. O4: Compare the refcount table and the lookup table to identify chunks that are in both tables (i.e., shared) and hence the running VM’s write to those chunks in writeable-front triggers copy-on-write. This step takes about 2ms. One bit in the lookup table’s entry is stolen to mark whether a chunk in writeable-front is shared with snapshots and hence needs copy-on-write upon a write. The whole process above, i.e., opening an image with n (e.g., n=1000) snapshots, takes about 39ms and it is a one-time cost at VM boot. Later, I will describe optimizations that can further reduce this 39ms by saving the 125KB free-chunk-bitmap
Re: [Qemu-devel] Re: Strategic decision: COW format
On 03/12/2011 11:51 PM, Chunqiang Tang wrote: In short, FVD's internal snapshot achieves the ideal properties of G1-G6, by 1) using the reference count table to only track static snapshots, 2) not keeping the reference count table in memory, 3) not updating the on-disk static reference count table when the VM runs, and 4) efficiently tracking dynamically allocated blocks by piggybacking on FVD's other features, i.e., its journal and small one-level lookup table. Are you assuming snapshots are read-only? It's not clear to me how this would work with writeable snapshots. It's not clear to me that writeable snapshots are really that important, but this is an advantage of having a refcount table. External snapshots are essentially read-only snapshots so I can understand the argument for it. Regards, Anthony Liguori Regards, ChunQiang (CQ) Tang Homepage: http://www.research.ibm.com/people/c/ctang
Re: [Qemu-devel] Re: Strategic decision: COW format
In short, FVD's internal snapshot achieves the ideal properties of G1-G6, by 1) using the reference count table to only track static snapshots, 2) not keeping the reference count table in memory, 3) not updating the on-disk static reference count table when the VM runs, and 4) efficiently tracking dynamically allocated blocks by piggybacking on FVD's other features, i.e., its journal and small one-level lookup table. Are you assuming snapshots are read-only? It's not clear to me how this would work with writeable snapshots. It's not clear to me that writeable snapshots are really that important, but this is an advantage of having a refcount table. External snapshots are essentially read-only snapshots so I can understand the argument for it. By definition, a snapshot itself must be immutable (read-only), but a writeable image state can be derived from an immutable snapshot by using copy-on-write, which I guess is what you meant by writeable snapshot. Perhaps the following concrete use cases will make things clear. These use cases are supported by QCOW2, VMware, and FVD, regardless of the difference in their internal implementation. Suppose an image's initial state is: Image: (current-disk-state-observed-by-the-running-VM) Below, I simply refer to current-disk-state-observed-by-the-running-VM as current-state. The VM issues writes and continuously modifies the current-state. At one point in time, a snapshot s1 is taken, and the image becomes: Image: s1-(current-state) The VM issues more writes and subsequently takes three snapshots, s2, s3, and s4. Now the image becomes: Image: s1-s2-s3-s4-(current-state) Suppose the action goto snapshot s2 is taken, which does not affect the immutable snapshots s1-s4, but the current-state is abandoned and lost. Now the image becomes: Image: s1-s2-s3-s4 |-(curren-state) (Note: depending on your email client, the two lines in the diagram may not be properly aligned). The new current-state is writeable and is derived from the immutable snapshot s2. When the VM issues a write, it does copy-on-write and stores dirty data in the current-state without modifying the original snapshot s2. Perhaps this is what you meant by writeable snapshot? The diagram above is at the conceptual level. In implementation, both QCOW2 and FVD store all snapshots s1-s4 and the current-state in one image file, and the snapshots and curren-state may share data chunks. Suppose the VM issues some writes and subsequently takes two snapshots, s5 and s6. Now the image becomes: Image: s1-s2-s3-s4 |-s5-s6-(curren-state) Suppose the action goto snapshot s2 is taken again. Now the image becomes: Image: s1-s2-s3-s4 |-s5-s6 |-(current-state) The new current-state is writeable and is derived from the immutable snapshot s2. Right after the goto action, the running VM sees the state of s2, instead of the state of s5 created after the first goto snapshot s2 action. Again, this is because a snapshot itself is immutable. Again, all the use cases are supported by QCOW2, VMware, and FVD, regardless of the difference in their internal implementation. Now let's come back to the discussion of FVD. Perhaps my description in the previous email is not clear. In the diagrams above, FVD's reference count table only tracks the snapshots (s1, s2, ...), but does not track the current-state. Instead, FVD's default mechanism (one-level lookup table, journal, etc.), which exists even before introducing snapshot, already tracks the current-state. Working together, FVD's reference count table and its default mechanism tracks all the states. In QCOW2, when a new cluster is allocated during handling a running VM's write request, it updates both the lookup table and the reference count table, which is unnecessary because their information is redundant. By contrast, in FVD, when a new chunk is allocated during handling a running VM's write request, it only updates the lookup table without updating the reference count table, because by design the reference count table does not track the current-state and this chunk allocation operation belongs to the current-state. This is the key why FVD can get all the functions of QCOW2's internal snapshot but without its memory overhead to cache the reference count table and its disk I/O overhead to read or write the reference count table during normal execution of VM. Regards, ChunQiang (CQ) Tang Homepage: http://www.research.ibm.com/people/c/ctang
Re: [Qemu-devel] Re: Strategic decision: COW format
It seems that there is great interest in QCOW2's internal snapshot feature. If we really want to do that, the right solution is to follow VMDK's approach of storing each snapshot as a separate COW file (see http://www.vmware.com/app/vmdk/?src=vmdk ), rather than using the reference count table. VMDK’s approach can be easily implemented for any COW format, or even as a function of the generic block layer, without complicating any COW format or hurting its performance. After the heated debate, I thought more about the right approach of implementing snapshot, and it becomes clear to me that there are major limitations with both VMDK's external snapshot approach (which stores each snapshot as a separate CoW file) and QCOW2's internal snapshot approach (which stores all snapshots in one file and uses a reference count table to keep track of them). I just posted to the mailing list a patch that implements internal snapshot in FVD but does it in a way without the limitations of VMDK and QCOW2. Let's first list the properties of an ideal virtual disk snapshot solution, and then discuss how to achieve them. G1: Do no harm (or avoid being a misfeature), i.e., the added snapshot code should not slow down the runtime performance of an image that has no snapshots. This implies that an image without snapshot should not cache the reference count table in memory and should not update the on-disk reference count table. G2: Even better, an image with 1 snapshot runs as fast as an image without snapshot. G3: Even even better, an image with 1,000 snapshots runs as fast as an image without snapshot. This basically means getting the snapshot feature for free. G4: An image with 1,000 snapshots consumes no more memory than an image without snapshot. This again means getting the snapshot feature for free. G5: Regardless of the number of existing snapshots, creating a new snapshot is fast, e.g., taking no more than 1 second. G6: Regardless of the number of existing snapshots, deleting a snapshot is fast, e.g., taking no more than 1 second. Now let's evaluate VMDK and QCOW2 against these ideal properties. G1: VMDK good; QCOW2 poor G2: VMDK ok; QCOW2 poor G3: VMDK very poor; QCOW2 poor G4: VMDK very poor; QCOW2 poor G5: VMDK good; QCOW2 good G6: VMDK poor; QCOW2 good The evaluation above assumes a straightforward VMDK implementation that, when handling a long chain of snapshots, s0-s1-s2- … -s1000, it uses a chain of 1,000 VMDK driver instances to represent the chain of backing files. This is slow and consumes a lot of memory, but it is the behavior of QEMU's block device architecture today. Even if the QEMU architecture can be revised and the VMDK implementation is optimized to extreme, a fundamental limitation of VMDK (by design instead of by implementation) is G6, i.e., deleting a snapshot X in the middle of a snapshot chain is slow (this is also what I observed with the VMware software). Because each snapshot is stored as a separate file, when a snapshot X is deleted, part of X's data blocks that are still needed by its children Y must be physically copied from file X to file Y, which is slow and the VM is halted during the copy operation. QCOW2's internal snapshot approach avoids this problem. Since all snapshots are stored in one file, when a snapshot is deleted, QCOW2 only needs to update its reference count table without physically moving data blocks. On the other hand, QCOW'2 internal snapshot has two major limitations that hurt runtime performance: caching the reference count table in memory and updating the on-disk reference count table. If we can eliminate both, then it is an ideal solution. This is exactly what FVD's internal snapshot solution does. Below is the key observation on why FVD can do it so efficiently. In an internal snapshot implementation, the reference count table is used to track used blocks and free blocks. It serves no other purposes. In FVD, its static reference count table only tracks blocks used by (static) snapshots, and it does not track blocks (dynamically) allocated (on a write) or freed (on a trim) for the running VM. This is a simple but fundamental difference w.r.t. to QCOW2, whose reference count table tracks both the static content and the dynamic content. Because data blocks used by snapshots are static and do not change unless a snapshot is created or deleted, there is no need to update FVD's static reference count table when a VM runs, and actually there is even no need to cache it in memory. Data blocks that are dynamically allocated or freed for a running VM are already tracked by FVD's one-level lookup table (which is similar to QCOW2's two-level table, but in FVD it is much smaller and faster) even before introducing the snapshot feature, and hence it comes for free. Updating FVD's one-level lookup table is efficient because of FVD's journal. When the VM boots, FVD scans the reference count table
RE: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 05:50 PM, Anthony Liguori wrote: I still don't see. What would you do with thousands of checkpoints? For reverse debugging, if you store checkpoints at a rate of save, every 10ms, and then degrade to storing every 100ms after 1 second, etc. you'll have quite a large number of snapshots pretty quickly. The idea of snapshotting with reverse debugging is that instead of undoing every instruction, you can revert to the snapshot before, and then replay the instruction stream until you get to the desired point in time. You cannot replay the instruction stream since inputs (interrupts, rdtsc or other timers, I/O) will be different. You need Kemari for this. I've created the technology for replaying instruction stream and all of the inputs. This technology is similar to deterministic replay in VMWare. Now I need something to save machine state in many checkpoints to implement reverse debugging. I think COW2 may be useful for it (or I should create something like this). Pavel Dovgaluk
Re: [Qemu-devel] Re: Strategic decision: COW format
On Fri, Feb 25, 2011 at 11:20 AM, Pavel Dovgaluk pavel.dovga...@ispras.ru wrote: On 02/23/2011 05:50 PM, Anthony Liguori wrote: I still don't see. What would you do with thousands of checkpoints? For reverse debugging, if you store checkpoints at a rate of save, every 10ms, and then degrade to storing every 100ms after 1 second, etc. you'll have quite a large number of snapshots pretty quickly. The idea of snapshotting with reverse debugging is that instead of undoing every instruction, you can revert to the snapshot before, and then replay the instruction stream until you get to the desired point in time. You cannot replay the instruction stream since inputs (interrupts, rdtsc or other timers, I/O) will be different. You need Kemari for this. I've created the technology for replaying instruction stream and all of the inputs. This technology is similar to deterministic replay in VMWare. Now I need something to save machine state in many checkpoints to implement reverse debugging. I think COW2 may be useful for it (or I should create something like this). Or the BTRFS_IOC_CLONE ioctl on the btrfs filesystem. You can copy-on-write clone a file using it. Stefan
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 22.02.2011 19:18, schrieb Anthony Liguori: On 02/22/2011 10:15 AM, Kevin Wolf wrote: Am 22.02.2011 16:57, schrieb Anthony Liguori: On 02/22/2011 02:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. Except that they require a refcount table that adds additional metadata that needs to be updated in the fast path. I consider that impacting the normal case. Like it or not, this requirement exists anyway, without any of your misfeatures. You chose to use the dirty flag in QED in order to avoid having to flush metadata too often, which is an approach that any other format, even one using refcounts, can take as well. It's a minor detail, but flushing and the amount of metadata are separate points. I agree that they are separate... The dirty flag prevents metadata from being flushed to disk very often but the use of a refcount table adds additional metadata. A refcount table is definitely not required even if you claim the requirement exists for other features. I assume you mean to implement trim/discard support but instead of a refcount table, a free list would work just as well and would leave the metadata update out of the fast path (allocating writes) and instead only be in the slow path (trim/discard). ...but here you're arguing about writing metadata out in the fast path, so you're actually not interested in the amount of metadata but in the overhead of flushing it. Which is a problem that's solved. A refcount table is essential for internal snapshots and compression, it's useful for discard and for running on block devices, it's necessary for avoiding the dirty flag and fsck on startup. These are five use cases that I can enumerate without thinking a lot about it, there might be more. You propose using three different mechanisms for allowing normal allocations (use the file size), block devices (add a size field into the header) and discard (free list), and the other three features, for which you can't think of a hack, you declare misfeatures. I don't think what you're proposing is a satisfactory solution. In my book, a single data structure that can provide all of the features is better than a bunch of independent hacks that allows only half of it. As a format feature, a refcount table really only makes sense if the refcount is required to be greater than a single bit. There are more optimal data structures that can be used if the refcount of a block is fixed to 1-bit (like a free list) which is what the fundamental design difference between qcow2 and qed is. Okay, so even assuming that there's something like misfeatures that we can kick out (with which I strongly disagree), what's the crucial advantage of free lists that would make you switch the image format? That you only access it in the slow path (discard) isn't true, because you certainly want to reallocate freed clusters. Otherwise you could just leak them without maintaining a list of leaked clusters... The only use of a refcount of more than 1-bit is internal snapshots AFAICT. Of the currently implemented features, internal snapshots and compression. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
Chunqiang Tang ct...@us.ibm.com writes: [...] Now let’s talk about features. It seems that there is great interest in QCOW2’ internal snapshot feature. If we really want to do that, the right Great interest? Its use cases are demo, debugging, testing and such. Kind of useful for developers, but I wouldn't want to use it in anger. Nice to have if we can get it cheaply, but I'm not prepared to pay much for it in performance or complexity, and I doubt I'm the only one. Users always say yes when you ask them whether they need some feature. Hence, the question is useless. A better question to ask is how much are you willing to pay for it? solution is to follow VMDK’s approach of storing each snapshot as a separate COW file (see http://www.vmware.com/app/vmdk/?src=vmdk ), rather than using the reference count table. VMDK’s approach can be easily implemented for any COW format, or even as a function of the generic block layer, without complicating any COW format or hurting its performance. I know the snapshots are not really “internal” as stored in a single file but instead more like external snapshots, but users don’t care about that so long as they support the same use cases. Probably many people who use VMware don't even know that the snapshots are stored as separate files. Do they care? I certainly wouldn't.
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/22/2011 10:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. And they are it today. Plus, encryption and snapshots can be implemented in a way that doesn't impact performance more than is reasonable. Compression perhaps not, but if you choose compression, then performance is not your top consideration. That's the case with filesystems that support compression as well. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 07:43 AM, Avi Kivity wrote: On 02/22/2011 10:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. And they are it today. Plus, encryption and snapshots can be implemented in a way that doesn't impact performance more than is reasonable. We're still missing the existence proof of this, but even assuming it existed, what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Regards, Anthony Liguori Compression perhaps not, but if you choose compression, then performance is not your top consideration. That's the case with filesystems that support compression as well.
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 03:13 AM, Kevin Wolf wrote: Am 22.02.2011 19:18, schrieb Anthony Liguori: On 02/22/2011 10:15 AM, Kevin Wolf wrote: Am 22.02.2011 16:57, schrieb Anthony Liguori: On 02/22/2011 02:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. Except that they require a refcount table that adds additional metadata that needs to be updated in the fast path. I consider that impacting the normal case. Like it or not, this requirement exists anyway, without any of your misfeatures. You chose to use the dirty flag in QED in order to avoid having to flush metadata too often, which is an approach that any other format, even one using refcounts, can take as well. It's a minor detail, but flushing and the amount of metadata are separate points. I agree that they are separate... The dirty flag prevents metadata from being flushed to disk very often but the use of a refcount table adds additional metadata. A refcount table is definitely not required even if you claim the requirement exists for other features. I assume you mean to implement trim/discard support but instead of a refcount table, a free list would work just as well and would leave the metadata update out of the fast path (allocating writes) and instead only be in the slow path (trim/discard). ...but here you're arguing about writing metadata out in the fast path, so you're actually not interested in the amount of metadata but in the overhead of flushing it. Which is a problem that's solved. I'm interested in both. An extra write is always going to be an extra write. The flush just makes it very painful. A refcount table is essential for internal snapshots and compression, it's useful for discard and for running on block devices, it's necessary for avoiding the dirty flag and fsck on startup. No, as designed today, qcow2 still needs a dirty flag to avoid leaking blocks. These are five use cases that I can enumerate without thinking a lot about it, there might be more. You propose using three different mechanisms for allowing normal allocations (use the file size), block devices (add a size field into the header) and discard (free list), and the other three features, for which you can't think of a hack, you declare misfeatures. No, I only label compression and internal snapshots as misfeatures. Encryption is a completely reasonable feature. So even with qcow3, what's the expectation of snapshots? Are we going to scale to images with over 1000 snapshots? I believe snapshot support in qcow2 is not a feature that has been designed with any serious thought. If we truly want to support internal snapshots, let's design it correctly. As a format feature, a refcount table really only makes sense if the refcount is required to be greater than a single bit. There are more optimal data structures that can be used if the refcount of a block is fixed to 1-bit (like a free list) which is what the fundamental design difference between qcow2 and qed is. Okay, so even assuming that there's something like misfeatures that we can kick out (with which I strongly disagree), what's the crucial advantage of free lists that would make you switch the image format? Performance. One thing we haven't tested with qcow2 is O_SYNC performance in the guest but my suspicion is that an O_SYNC workload is going to perform poorly even with cache=none. Starting with a simple format that we don't have to jump through tremendous hoops to get reasonable performance out of has a lot of virtues. Regards, Anthony Liguori
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 23.02.2011 15:23, schrieb Anthony Liguori: On 02/23/2011 07:43 AM, Avi Kivity wrote: On 02/22/2011 10:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. And they are it today. Plus, encryption and snapshots can be implemented in a way that doesn't impact performance more than is reasonable. We're still missing the existence proof of this, but even assuming it Define reasonable. I sent you some numbers not too long for encryption, and I consider them reasonable (iirc, between 25% and 40% slower than without encryption). existed, what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Is there any hope for backing file chains of 1000 files or more? I haven't tried it out, but in theory I'd expect that internal snapshots could cope better with it than external ones because internal snapshots don't have to go through the whole chain all the time. What are the points where you think that performance of internal snapshots suffers? The argument that I would understand is that internal snapshots are probably not as handy in all scenarios. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 23.02.2011 15:21, schrieb Anthony Liguori: On 02/23/2011 03:13 AM, Kevin Wolf wrote: Am 22.02.2011 19:18, schrieb Anthony Liguori: On 02/22/2011 10:15 AM, Kevin Wolf wrote: Am 22.02.2011 16:57, schrieb Anthony Liguori: On 02/22/2011 02:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. Except that they require a refcount table that adds additional metadata that needs to be updated in the fast path. I consider that impacting the normal case. Like it or not, this requirement exists anyway, without any of your misfeatures. You chose to use the dirty flag in QED in order to avoid having to flush metadata too often, which is an approach that any other format, even one using refcounts, can take as well. It's a minor detail, but flushing and the amount of metadata are separate points. I agree that they are separate... The dirty flag prevents metadata from being flushed to disk very often but the use of a refcount table adds additional metadata. A refcount table is definitely not required even if you claim the requirement exists for other features. I assume you mean to implement trim/discard support but instead of a refcount table, a free list would work just as well and would leave the metadata update out of the fast path (allocating writes) and instead only be in the slow path (trim/discard). ...but here you're arguing about writing metadata out in the fast path, so you're actually not interested in the amount of metadata but in the overhead of flushing it. Which is a problem that's solved. I'm interested in both. An extra write is always going to be an extra write. The flush just makes it very painful. One extra write of 64k every 2 GB. Hardly relevant. A refcount table is essential for internal snapshots and compression, it's useful for discard and for running on block devices, it's necessary for avoiding the dirty flag and fsck on startup. No, as designed today, qcow2 still needs a dirty flag to avoid leaking blocks. I know that this is your opinion and I do respect that, this is one of the reasons why there is the suggestion to add the dirty flag for you. On the other hand, it would be about time for you to accept that there are people who think differently about it and who don't want the same as you. This is why using the dirty flag should be optional. These are five use cases that I can enumerate without thinking a lot about it, there might be more. You propose using three different mechanisms for allowing normal allocations (use the file size), block devices (add a size field into the header) and discard (free list), and the other three features, for which you can't think of a hack, you declare misfeatures. No, I only label compression and internal snapshots as misfeatures. Encryption is a completely reasonable feature. I didn't even mention encryption. It's obvious that it's a reasonable feature and not a misfeature, because it fits relatively easily in your QED design. :-) The three features you don't like because they don't fit are compression, internal snapshots and not having to fsck (thanks for proving the latter above) So even with qcow3, what's the expectation of snapshots? Are we going to scale to images with over 1000 snapshots? I believe snapshot support in qcow2 is not a feature that has been designed with any serious thought. If we truly want to support internal snapshots, let's design it correctly. So what would be the key differences between your design and qcow2's? We can always check if there's room to improve. As a format feature, a refcount table really only makes sense if the refcount is required to be greater than a single bit. There are more optimal data structures that can be used if the refcount of a block is fixed to 1-bit (like a free list) which is what the fundamental design difference between qcow2 and qed is. Okay, so even assuming that there's something like misfeatures that we can kick out (with which I strongly disagree), what's the crucial advantage of free lists that would make you switch the image format? Performance. One thing we haven't tested with qcow2 is O_SYNC performance in the guest but my suspicion is that an O_SYNC workload is going to perform poorly even with cache=none. But wasn't it you who wants to use the dirty flag in any case? The refcounts aren't even written then. Starting with a simple format that we don't have to jump through tremendous hoops to get reasonable performance out of has a lot of virtues. I know that you don't mean it like I read this, but it's entirely true: You're _starting_ with a simple
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 04:23 PM, Anthony Liguori wrote: On 02/23/2011 07:43 AM, Avi Kivity wrote: On 02/22/2011 10:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. And they are it today. Plus, encryption and snapshots can be implemented in a way that doesn't impact performance more than is reasonable. We're still missing the existence proof of this, but even assuming it existed, dm-crypt isn't any more complicated, and it's used by default in most distributions these days. what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Are thousands of snapshots for a single image a reasonable user expectation? What's the use case? -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 08:38 AM, Kevin Wolf wrote: Am 23.02.2011 15:23, schrieb Anthony Liguori: On 02/23/2011 07:43 AM, Avi Kivity wrote: On 02/22/2011 10:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. And they are it today. Plus, encryption and snapshots can be implemented in a way that doesn't impact performance more than is reasonable. We're still missing the existence proof of this, but even assuming it Define reasonable. I sent you some numbers not too long for encryption, and I consider them reasonable (iirc, between 25% and 40% slower than without encryption). I was really referring to snapshots. I have absolutely no doubt that encryption can be implemented with a reasonable performance overhead. existed, what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Is there any hope for backing file chains of 1000 files or more? I haven't tried it out, but in theory I'd expect that internal snapshots could cope better with it than external ones because internal snapshots don't have to go through the whole chain all the time. I don't think there's a user expectation of backing file chains of 1000 files performing well. However, I've talked to a number of customers that have been interested in using internal snapshots for checkpointing which would involve a large number of snapshots. In fact, Fabrice originally added qcow2 because he was interested in doing reverse debugging. The idea of internal snapshots was to store a high number of checkpoints to allow reverse debugging to be optimized. I think the way snapshot metadata is stored makes this not realistic since they're stored in more or less a linear array. I think to really support a high number of snapshots, you'd want to store a hash with each block that contained a refcount 1. I think you quickly end up reinventing btrfs though in the process. Regards, Anthony Liguori What are the points where you think that performance of internal snapshots suffers? The argument that I would understand is that internal snapshots are probably not as handy in all scenarios. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
On Wed, Feb 23, 2011 at 05:23:33PM +0200, Avi Kivity wrote: On 02/23/2011 04:23 PM, Anthony Liguori wrote: On 02/23/2011 07:43 AM, Avi Kivity wrote: On 02/22/2011 10:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. And they are it today. Plus, encryption and snapshots can be implemented in a way that doesn't impact performance more than is reasonable. We're still missing the existence proof of this, but even assuming it existed, dm-crypt isn't any more complicated, and it's used by default in most distributions these days. IMHO dm-crypt isn't a generally usable alternative to native built in encryption in qcow2. It isn't usable at all by non-root. If you want to use with plain files, then you need to turn the file into a loopback device and then layer in dm-crypt. It is generally just a PITA to manage. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 09:23 AM, Avi Kivity wrote: On 02/23/2011 04:23 PM, Anthony Liguori wrote: On 02/23/2011 07:43 AM, Avi Kivity wrote: On 02/22/2011 10:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. And they are it today. Plus, encryption and snapshots can be implemented in a way that doesn't impact performance more than is reasonable. We're still missing the existence proof of this, but even assuming it existed, dm-crypt isn't any more complicated, and it's used by default in most distributions these days. what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Are thousands of snapshots for a single image a reasonable user expectation? What's the use case? Checkpointing. It was the original use-case that led to qcow2 being invented. Regards, Anthony Liguori
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 05:29 PM, Anthony Liguori wrote: existed, what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Is there any hope for backing file chains of 1000 files or more? I haven't tried it out, but in theory I'd expect that internal snapshots could cope better with it than external ones because internal snapshots don't have to go through the whole chain all the time. I don't think there's a user expectation of backing file chains of 1000 files performing well. However, I've talked to a number of customers that have been interested in using internal snapshots for checkpointing which would involve a large number of snapshots. In fact, Fabrice originally added qcow2 because he was interested in doing reverse debugging. The idea of internal snapshots was to store a high number of checkpoints to allow reverse debugging to be optimized. I don't see how that works, since the memory image is duplicated for each snapshot. So thousands of snapshots = terabytes of storage, and hours of creating the snapshots. Migrate-to-file with block live migration, or even better, something based on Kemari would be a lot faster. I think the way snapshot metadata is stored makes this not realistic since they're stored in more or less a linear array. I think to really support a high number of snapshots, you'd want to store a hash with each block that contained a refcount 1. I think you quickly end up reinventing btrfs though in the process. Can you elaborate? What's the problem with a linear array of snapshots (say up to 10,000 snapshots)? -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 05:33 PM, Daniel P. Berrange wrote: On Wed, Feb 23, 2011 at 05:23:33PM +0200, Avi Kivity wrote: On 02/23/2011 04:23 PM, Anthony Liguori wrote: On 02/23/2011 07:43 AM, Avi Kivity wrote: On 02/22/2011 10:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. And they are it today. Plus, encryption and snapshots can be implemented in a way that doesn't impact performance more than is reasonable. We're still missing the existence proof of this, but even assuming it existed, dm-crypt isn't any more complicated, and it's used by default in most distributions these days. IMHO dm-crypt isn't a generally usable alternative to native built in encryption in qcow2. It isn't usable at all by non-root. If you want to use with plain files, then you need to turn the file into a loopback device and then layer in dm-crypt. It is generally just a PITA to manage. I wasn't suggesting dm-crypt is a replacement for qcow2 encyption, just that it shows that block-level encryption can be done with reasonable overhead. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 05:31 PM, Anthony Liguori wrote: what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Are thousands of snapshots for a single image a reasonable user expectation? What's the use case? Checkpointing. It was the original use-case that led to qcow2 being invented. I still don't see. What would you do with thousands of checkpoints? -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 09:36 AM, Avi Kivity wrote: On 02/23/2011 05:29 PM, Anthony Liguori wrote: existed, what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Is there any hope for backing file chains of 1000 files or more? I haven't tried it out, but in theory I'd expect that internal snapshots could cope better with it than external ones because internal snapshots don't have to go through the whole chain all the time. I don't think there's a user expectation of backing file chains of 1000 files performing well. However, I've talked to a number of customers that have been interested in using internal snapshots for checkpointing which would involve a large number of snapshots. In fact, Fabrice originally added qcow2 because he was interested in doing reverse debugging. The idea of internal snapshots was to store a high number of checkpoints to allow reverse debugging to be optimized. I don't see how that works, since the memory image is duplicated for each snapshot. So thousands of snapshots = terabytes of storage, and hours of creating the snapshots. Fabrice wanted to use CoW to as a mechanism to deduplicate the memory contents with the on-disk state specifically to address this problem. For the longest time, there was a comment in the savevm code along these lines. It might still be there. I think the lack of on-disk hashes was a critical missing bit to make this feature really work well. Migrate-to-file with block live migration, or even better, something based on Kemari would be a lot faster. I think the way snapshot metadata is stored makes this not realistic since they're stored in more or less a linear array. I think to really support a high number of snapshots, you'd want to store a hash with each block that contained a refcount 1. I think you quickly end up reinventing btrfs though in the process. Can you elaborate? What's the problem with a linear array of snapshots (say up to 10,000 snapshots)? Lots of things. The array will start to consume quite a bit of contiguous space as it gets larger which means it needs to be relocated. Deleting a snapshot is a far more expensive operation than it needs to be. Regards, Anthony Liguori
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 09:37 AM, Avi Kivity wrote: On 02/23/2011 05:31 PM, Anthony Liguori wrote: what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Are thousands of snapshots for a single image a reasonable user expectation? What's the use case? Checkpointing. It was the original use-case that led to qcow2 being invented. I still don't see. What would you do with thousands of checkpoints? Er, hit send to quickly. HPC is a big space where checkpointing is actually useful. An HPC workload may take weeks to run to completion. If something fails during the run, it's a huge waste of time. However, if you do regularl checkpointing, a failure may only lose a few minutes of work instead of the entire weeks worth of work. Regards, Anthony Liguori
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 09:37 AM, Avi Kivity wrote: On 02/23/2011 05:31 PM, Anthony Liguori wrote: what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Are thousands of snapshots for a single image a reasonable user expectation? What's the use case? Checkpointing. It was the original use-case that led to qcow2 being invented. I still don't see. What would you do with thousands of checkpoints? For reverse debugging, if you store checkpoints at a rate of save, every 10ms, and then degrade to storing every 100ms after 1 second, etc. you'll have quite a large number of snapshots pretty quickly. The idea of snapshotting with reverse debugging is that instead of undoing every instruction, you can revert to the snapshot before, and then replay the instruction stream until you get to the desired point in time. For disaster recovery, there are some workloads that you can meaningful revert to a snapshot provided that the snapshot is stored at a rate of something frequency (like once a second). Think of something like a webserver where the only accumulated data is logs. Losing some of the logs is better than losing all of the logs. Regards, Anthony Liguori
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 23.02.2011 16:29, schrieb Anthony Liguori: On 02/23/2011 08:38 AM, Kevin Wolf wrote: Am 23.02.2011 15:23, schrieb Anthony Liguori: On 02/23/2011 07:43 AM, Avi Kivity wrote: On 02/22/2011 10:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. And they are it today. Plus, encryption and snapshots can be implemented in a way that doesn't impact performance more than is reasonable. We're still missing the existence proof of this, but even assuming it Define reasonable. I sent you some numbers not too long for encryption, and I consider them reasonable (iirc, between 25% and 40% slower than without encryption). I was really referring to snapshots. I have absolutely no doubt that encryption can be implemented with a reasonable performance overhead. Alright. Last time you complained about things being too slow you were explicitly referring to encryption, so sometimes it's hard for me to follow you jumping from once topic to another. existed, what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Is there any hope for backing file chains of 1000 files or more? I haven't tried it out, but in theory I'd expect that internal snapshots could cope better with it than external ones because internal snapshots don't have to go through the whole chain all the time. I don't think there's a user expectation of backing file chains of 1000 files performing well. However, I've talked to a number of customers that have been interested in using internal snapshots for checkpointing which would involve a large number of snapshots. So if there's no expectation that a chain of 1000 external snapshots works fine, why is it a requirement for internal snapshots? You might have a point if the external snapshots were actually not a chain, but a snapshot tree with lots of branches, but checkpointing means exactly creating a single chain. That said, while I haven't tried it out, I don't see any theoretical problems with using 1000 internal snapshots. In fact, Fabrice originally added qcow2 because he was interested in doing reverse debugging. The idea of internal snapshots was to store a high number of checkpoints to allow reverse debugging to be optimized. I think the way snapshot metadata is stored makes this not realistic since they're stored in more or less a linear array. I think to really support a high number of snapshots, you'd want to store a hash with each block that contained a refcount 1. I think you quickly end up reinventing btrfs though in the process. I share Avi's problem here, I don't really understand what the problem with a linear list of snapshots is. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 05:47 PM, Anthony Liguori wrote: I don't see how that works, since the memory image is duplicated for each snapshot. So thousands of snapshots = terabytes of storage, and hours of creating the snapshots. Fabrice wanted to use CoW to as a mechanism to deduplicate the memory contents with the on-disk state specifically to address this problem. For the longest time, there was a comment in the savevm code along these lines. It might still be there. I think the lack of on-disk hashes was a critical missing bit to make this feature really work well. So you have to use dirty logging to see which pages changed, otherwise you have to dedup all of them. Still I think migration/kemari is a better fit for this. Can you elaborate? What's the problem with a linear array of snapshots (say up to 10,000 snapshots)? Lots of things. The array will start to consume quite a bit of contiguous space as it gets larger which means it needs to be relocated. If you double the space each time, it amortizes out. A snapshot seems to be around 40 bytes. So 10K snapshots = 400KB, hardly a huge amount (sans pointed-to data which doesn't need to move). Deleting a snapshot is a far more expensive operation than it needs to be. Move the last snapshot into the deleted entry? -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: Strategic decision: COW format
On Wed, Feb 23, 2011 at 09:52:02AM -0600, Anthony Liguori wrote: On 02/23/2011 09:37 AM, Avi Kivity wrote: On 02/23/2011 05:31 PM, Anthony Liguori wrote: what about snapshots? Are we okay having a feature in a prominent format that isn't going to meet user's expectations? Is there any hope that an image with 1000, 1000, or 1 snapshots is going to have even reasonable performance in qcow2? Are thousands of snapshots for a single image a reasonable user expectation? What's the use case? Checkpointing. It was the original use-case that led to qcow2 being invented. I still don't see. What would you do with thousands of checkpoints? Er, hit send to quickly. HPC is a big space where checkpointing is actually useful. An HPC workload may take weeks to run to completion. If something fails during the run, it's a huge waste of time. However, if you do regularl checkpointing, a failure may only lose a few minutes of work instead of the entire weeks worth of work. HPC workload mostly run on cluster nowadays. Getting consistent distributed snapshot without messages in flight is not as simple as snapshotting bunch of VMs at random time. Anyway in HPC scenario you need only one (last) snapshot, not thousands of them. -- Gleb.
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 05:52 PM, Anthony Liguori wrote: I still don't see. What would you do with thousands of checkpoints? Er, hit send to quickly. HPC is a big space where checkpointing is actually useful. An HPC workload may take weeks to run to completion. If something fails during the run, it's a huge waste of time. However, if you do regularl checkpointing, a failure may only lose a few minutes of work instead of the entire weeks worth of work. The trick is to delete snapshot N-M after taking snapshot N (for a small constant M). -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 10:03 AM, Avi Kivity wrote: On 02/23/2011 05:50 PM, Anthony Liguori wrote: I still don't see. What would you do with thousands of checkpoints? For reverse debugging, if you store checkpoints at a rate of save, every 10ms, and then degrade to storing every 100ms after 1 second, etc. you'll have quite a large number of snapshots pretty quickly. The idea of snapshotting with reverse debugging is that instead of undoing every instruction, you can revert to the snapshot before, and then replay the instruction stream until you get to the desired point in time. You cannot replay the instruction stream since inputs (interrupts, rdtsc or other timers, I/O) will be different. You need Kemari for this. Yes, I'm well aware of this. I don't think all the pieces where ever really there to do this. Regards, Anthony Liguori For disaster recovery, there are some workloads that you can meaningful revert to a snapshot provided that the snapshot is stored at a rate of something frequency (like once a second). Think of something like a webserver where the only accumulated data is logs. Losing some of the logs is better than losing all of the logs. Are static webservers that interesting? For disaster recovery? Anything else will need Kemari.
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/23/2011 05:50 PM, Anthony Liguori wrote: I still don't see. What would you do with thousands of checkpoints? For reverse debugging, if you store checkpoints at a rate of save, every 10ms, and then degrade to storing every 100ms after 1 second, etc. you'll have quite a large number of snapshots pretty quickly. The idea of snapshotting with reverse debugging is that instead of undoing every instruction, you can revert to the snapshot before, and then replay the instruction stream until you get to the desired point in time. You cannot replay the instruction stream since inputs (interrupts, rdtsc or other timers, I/O) will be different. You need Kemari for this. For disaster recovery, there are some workloads that you can meaningful revert to a snapshot provided that the snapshot is stored at a rate of something frequency (like once a second). Think of something like a webserver where the only accumulated data is logs. Losing some of the logs is better than losing all of the logs. Are static webservers that interesting? For disaster recovery? Anything else will need Kemari. -- error compiling committee.c: too many arguments to function
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 23.02.2011 17:04, schrieb Anthony Liguori: On 02/23/2011 10:03 AM, Avi Kivity wrote: On 02/23/2011 05:50 PM, Anthony Liguori wrote: I still don't see. What would you do with thousands of checkpoints? For reverse debugging, if you store checkpoints at a rate of save, every 10ms, and then degrade to storing every 100ms after 1 second, etc. you'll have quite a large number of snapshots pretty quickly. The idea of snapshotting with reverse debugging is that instead of undoing every instruction, you can revert to the snapshot before, and then replay the instruction stream until you get to the desired point in time. You cannot replay the instruction stream since inputs (interrupts, rdtsc or other timers, I/O) will be different. You need Kemari for this. Yes, I'm well aware of this. I don't think all the pieces where ever really there to do this. So why exactly was this a requirement for internal snapshots to be consider usable in a reasonable way? ;-) Anyway, I actually think with internal snapshots you're better suited to implement something like this than with external snapshots. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
Anthony Liguori anth...@codemonkey.ws writes: On 02/18/2011 03:57 AM, Kevin Wolf wrote: Am 18.02.2011 10:12, schrieb Markus Armbruster: Kevin Wolfkw...@redhat.com writes: Am 15.02.2011 20:45, schrieb Chunqiang Tang: Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM: As you requested, I set up a wiki page for FVD at http://wiki.qemu.org/Features/FVD . It includes a summary of FVD, a detailed specification of FVD, and a comparison of the design and performance of FVD and QED. See the figure at http://wiki.qemu.org/Features/FVD/Compare . This figure shows that the file creation throughput of NetApp's PostMark benchmark under FVD is 74.9% to 215% higher than that under QED. Hi Anthony, Please let me know if more information is needed. I would appreciate your feedback and advice on the best way to proceed with FVD. Yet another file format with yet another implementation is definitely not what we need. We should probably take some of the ideas in FVD and consider them for qcow3. Got an assumption there: that the one COW format we need must be qcow3, i.e. an evolution of qcow2. Needs to be justified. If that discussion has happened on the list already, I missed it. If not, it's overdue, and then we better start it right away. Right. I probably wasn't very clear about what I mean with qcow3 either, so let me try to summarize my reasoning. The first point is an assumption that you made, too: That we want to have only one format. I hope it's easy to agree on this, duplication is bad and every additional format creates new maintenance burden, especially if we're taking it serious. Until now, there were exactly two formats for which we managed to do this, raw and qcow2. raw is more or less for free, so with the introduction of another format, we basically double the supported block driver code overnight (while not doubling the number of developers). Not sure what project you're following, but we've had an awful lot of formats before qcow2 :-) And qcow2 was never all that special, it just was dropped in the code base one day. You've put a lot of work into qcow2, but there are other folks that are contributing additional formats and that means more developers. The consequence of having only one file format is that it must be able to obsolete the existing ones, most notably qcow2. We can only neglect qcow1 today because we can tell users to use qcow2. It supports everything that qcow1 supports and more. We couldn't have done this if qcow2 lacked features compared to qcow1. So the one really essential requirement that I see is that we provide a way forward for _all_ users by maintaining all of qcow2's features. This is the only way of getting people to not stay with qcow2. Of course, you could invent another format that implements the same features, but I think just carefully extending qcow2 has some real advantages. The first is that conversion of existing images would be really easy. Basically increment the version number in the header file and you're done. Structures would be compatible. qemu-img convert is a reasonable path for conversion. If you compare it to file systems, I rarely ever change the file system on a non-empty partition. Even if I wanted, it's usually just too painful. Except when I was able to use tune2fs -j to make ext3 out of ext2, that was really easy. We can provide the same for qcow2 to qcow3 conversion, but not with a completely new format. Also, while obsoleting a file format means that we need not put much effort in its maintenance, we still need to keep the code around for reading old images. With an extension of qcow2, it would be the same code that is used for both versions. Third, qcow2 already exists, is used in practice and we have put quite some effort into QA. At least initially confidence would be higher than in a completely new, yet untested format. Remember that with qcow3 I'm not talking about rewriting everything, it's a careful evolution, mostly with optional additions here and there. My requirements for a new format are as followed: 1) documented, thought-out specification that is covered under and open license with a clear process for extension. 2) ability to add both compatible and incompatible features in a graceful way 3) ability to achieve performance that's close to raw. I want our new format to be able to be used universally both for servers and desktops. I'd like to add 4) minimize complexity and maximize maintainability of the code. I'd gladly sacrifice nice-to-have features for that. I think qcow2 has some misfeatures like compression and internal snapshots. I think preserving those misfeatures is a mistake because I don't think we can satisfy the above while trying to preserve those
Re: [Qemu-devel] Re: Strategic decision: COW format
Aurelien Jarno aurel...@aurel32.net writes: [...] I agree that the best would be to have a single format, and it's probably a goal to have. That said, what is most important to my view is having one or two formats which together have _all_ the features (and here I consider speed as a feature) of the existing qcow2 format. QED or FVD have been designed with the virtualization in a datacenter in mind, and are very good for this use. OTOH they don't support compression or snapshotting, that are quite useful for demo, debugging, testing, or even for occasionally running a Windows VM, in other words in situations where the speed is not the priority. Speed not a priority means the requirements are pretty radically different. Satisfying two radically different sets of requirements with the same format could be difficult. Great to have, but possibly difficult. If we can't find a tradeoff for that, we should go for two instead of one image format. Less bad than a jack-of-all-trades.
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 22.02.2011 09:37, schrieb Markus Armbruster: Anthony Liguori anth...@codemonkey.ws writes: On 02/18/2011 03:57 AM, Kevin Wolf wrote: Am 18.02.2011 10:12, schrieb Markus Armbruster: Kevin Wolfkw...@redhat.com writes: Am 15.02.2011 20:45, schrieb Chunqiang Tang: Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM: As you requested, I set up a wiki page for FVD at http://wiki.qemu.org/Features/FVD . It includes a summary of FVD, a detailed specification of FVD, and a comparison of the design and performance of FVD and QED. See the figure at http://wiki.qemu.org/Features/FVD/Compare . This figure shows that the file creation throughput of NetApp's PostMark benchmark under FVD is 74.9% to 215% higher than that under QED. Hi Anthony, Please let me know if more information is needed. I would appreciate your feedback and advice on the best way to proceed with FVD. Yet another file format with yet another implementation is definitely not what we need. We should probably take some of the ideas in FVD and consider them for qcow3. Got an assumption there: that the one COW format we need must be qcow3, i.e. an evolution of qcow2. Needs to be justified. If that discussion has happened on the list already, I missed it. If not, it's overdue, and then we better start it right away. Right. I probably wasn't very clear about what I mean with qcow3 either, so let me try to summarize my reasoning. The first point is an assumption that you made, too: That we want to have only one format. I hope it's easy to agree on this, duplication is bad and every additional format creates new maintenance burden, especially if we're taking it serious. Until now, there were exactly two formats for which we managed to do this, raw and qcow2. raw is more or less for free, so with the introduction of another format, we basically double the supported block driver code overnight (while not doubling the number of developers). Not sure what project you're following, but we've had an awful lot of formats before qcow2 :-) And qcow2 was never all that special, it just was dropped in the code base one day. You've put a lot of work into qcow2, but there are other folks that are contributing additional formats and that means more developers. The consequence of having only one file format is that it must be able to obsolete the existing ones, most notably qcow2. We can only neglect qcow1 today because we can tell users to use qcow2. It supports everything that qcow1 supports and more. We couldn't have done this if qcow2 lacked features compared to qcow1. So the one really essential requirement that I see is that we provide a way forward for _all_ users by maintaining all of qcow2's features. This is the only way of getting people to not stay with qcow2. Of course, you could invent another format that implements the same features, but I think just carefully extending qcow2 has some real advantages. The first is that conversion of existing images would be really easy. Basically increment the version number in the header file and you're done. Structures would be compatible. qemu-img convert is a reasonable path for conversion. If you compare it to file systems, I rarely ever change the file system on a non-empty partition. Even if I wanted, it's usually just too painful. Except when I was able to use tune2fs -j to make ext3 out of ext2, that was really easy. We can provide the same for qcow2 to qcow3 conversion, but not with a completely new format. Also, while obsoleting a file format means that we need not put much effort in its maintenance, we still need to keep the code around for reading old images. With an extension of qcow2, it would be the same code that is used for both versions. Third, qcow2 already exists, is used in practice and we have put quite some effort into QA. At least initially confidence would be higher than in a completely new, yet untested format. Remember that with qcow3 I'm not talking about rewriting everything, it's a careful evolution, mostly with optional additions here and there. My requirements for a new format are as followed: 1) documented, thought-out specification that is covered under and open license with a clear process for extension. 2) ability to add both compatible and incompatible features in a graceful way 3) ability to achieve performance that's close to raw. I want our new format to be able to be used universally both for servers and desktops. I'd like to add 4) minimize complexity and maximize maintainability of the code. I'd gladly sacrifice nice-to-have features for that. Especially if they are features that only other users use, right? What's the Sankt-Florians-Prinzip called in English? I think qcow2 has some
Re: [Qemu-devel] Re: Strategic decision: COW format
Kevin Wolf kw...@redhat.com writes: Am 22.02.2011 09:37, schrieb Markus Armbruster: Anthony Liguori anth...@codemonkey.ws writes: On 02/18/2011 03:57 AM, Kevin Wolf wrote: Am 18.02.2011 10:12, schrieb Markus Armbruster: Kevin Wolfkw...@redhat.com writes: Am 15.02.2011 20:45, schrieb Chunqiang Tang: Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM: As you requested, I set up a wiki page for FVD at http://wiki.qemu.org/Features/FVD . It includes a summary of FVD, a detailed specification of FVD, and a comparison of the design and performance of FVD and QED. See the figure at http://wiki.qemu.org/Features/FVD/Compare . This figure shows that the file creation throughput of NetApp's PostMark benchmark under FVD is 74.9% to 215% higher than that under QED. Hi Anthony, Please let me know if more information is needed. I would appreciate your feedback and advice on the best way to proceed with FVD. Yet another file format with yet another implementation is definitely not what we need. We should probably take some of the ideas in FVD and consider them for qcow3. Got an assumption there: that the one COW format we need must be qcow3, i.e. an evolution of qcow2. Needs to be justified. If that discussion has happened on the list already, I missed it. If not, it's overdue, and then we better start it right away. Right. I probably wasn't very clear about what I mean with qcow3 either, so let me try to summarize my reasoning. The first point is an assumption that you made, too: That we want to have only one format. I hope it's easy to agree on this, duplication is bad and every additional format creates new maintenance burden, especially if we're taking it serious. Until now, there were exactly two formats for which we managed to do this, raw and qcow2. raw is more or less for free, so with the introduction of another format, we basically double the supported block driver code overnight (while not doubling the number of developers). Not sure what project you're following, but we've had an awful lot of formats before qcow2 :-) And qcow2 was never all that special, it just was dropped in the code base one day. You've put a lot of work into qcow2, but there are other folks that are contributing additional formats and that means more developers. The consequence of having only one file format is that it must be able to obsolete the existing ones, most notably qcow2. We can only neglect qcow1 today because we can tell users to use qcow2. It supports everything that qcow1 supports and more. We couldn't have done this if qcow2 lacked features compared to qcow1. So the one really essential requirement that I see is that we provide a way forward for _all_ users by maintaining all of qcow2's features. This is the only way of getting people to not stay with qcow2. Of course, you could invent another format that implements the same features, but I think just carefully extending qcow2 has some real advantages. The first is that conversion of existing images would be really easy. Basically increment the version number in the header file and you're done. Structures would be compatible. qemu-img convert is a reasonable path for conversion. If you compare it to file systems, I rarely ever change the file system on a non-empty partition. Even if I wanted, it's usually just too painful. Except when I was able to use tune2fs -j to make ext3 out of ext2, that was really easy. We can provide the same for qcow2 to qcow3 conversion, but not with a completely new format. Also, while obsoleting a file format means that we need not put much effort in its maintenance, we still need to keep the code around for reading old images. With an extension of qcow2, it would be the same code that is used for both versions. Third, qcow2 already exists, is used in practice and we have put quite some effort into QA. At least initially confidence would be higher than in a completely new, yet untested format. Remember that with qcow3 I'm not talking about rewriting everything, it's a careful evolution, mostly with optional additions here and there. My requirements for a new format are as followed: 1) documented, thought-out specification that is covered under and open license with a clear process for extension. 2) ability to add both compatible and incompatible features in a graceful way 3) ability to achieve performance that's close to raw. I want our new format to be able to be used universally both for servers and desktops. I'd like to add 4) minimize complexity and maximize maintainability of the code. I'd gladly sacrifice nice-to-have features for that. Especially if they are features that only other users use, right? What's the Sankt-Florians-Prinzip called in
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/22/2011 02:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. Except that they require a refcount table that adds additional metadata that needs to be updated in the fast path. I consider that impacting the normal case. Regards, Anthony Liguori And they are it today. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 22.02.2011 16:57, schrieb Anthony Liguori: On 02/22/2011 02:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. Except that they require a refcount table that adds additional metadata that needs to be updated in the fast path. I consider that impacting the normal case. Like it or not, this requirement exists anyway, without any of your misfeatures. You chose to use the dirty flag in QED in order to avoid having to flush metadata too often, which is an approach that any other format, even one using refcounts, can take as well. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/22/2011 10:15 AM, Kevin Wolf wrote: Am 22.02.2011 16:57, schrieb Anthony Liguori: On 02/22/2011 02:56 AM, Kevin Wolf wrote: *sigh* It starts to get annoying, but if you really insist, I can repeat it once more: These features that you don't need (this is the correct description for what you call misfeatures) _are_ implemented in a way that they don't impact the normal case. Except that they require a refcount table that adds additional metadata that needs to be updated in the fast path. I consider that impacting the normal case. Like it or not, this requirement exists anyway, without any of your misfeatures. You chose to use the dirty flag in QED in order to avoid having to flush metadata too often, which is an approach that any other format, even one using refcounts, can take as well. It's a minor detail, but flushing and the amount of metadata are separate points. The dirty flag prevents metadata from being flushed to disk very often but the use of a refcount table adds additional metadata. A refcount table is definitely not required even if you claim the requirement exists for other features. I assume you mean to implement trim/discard support but instead of a refcount table, a free list would work just as well and would leave the metadata update out of the fast path (allocating writes) and instead only be in the slow path (trim/discard). As a format feature, a refcount table really only makes sense if the refcount is required to be greater than a single bit. There are more optimal data structures that can be used if the refcount of a block is fixed to 1-bit (like a free list) which is what the fundamental design difference between qcow2 and qed is. The only use of a refcount of more than 1-bit is internal snapshots AFAICT. Regards, Anthony Liguori Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
In any case, the next step is to get down to specifics. Here is the page with the current QCOW3 roadmap: http://wiki.qemu.org/Qcow3_Roadmap Please raise concrete requirements or features so they can be discussed and captured. Now it turns into a more productive discussion, but it seems to lose the big picture too quickly and has gone too narrowly into issues like the “dirty bit”. Let’s try to answer a bigger question: how to take a holistic approach to address all the factors that make a virtual disk slower than a physical disk? Even if issues like the “dirty bit” are addressed perfectly, they may still only be a small part of the total solution. The discussion of internal snapshot is at the end of this email. Compared with a physical disk, a virtual disk (even RAW) incurs some or all of the following overheads. Obviously, the way to achieve high performance is to eliminate or reduce these overheads. Overhead at the image level: I1: Data fragmentation caused by an image format. I2: Overhead in reading an image format’s metadata from disk. I3: Overhead in writing an image format’s metadata to disk. I4: Inefficiency and complexity in the block driver implementation, e.g., waiting synchronously for reading or writing metadata, submitting I/O requests sequentially when they should be done concurrently, performing a flush unnecessarily, etc. Overhead at the host file system level: H1: Data fragmentation caused by a host file system. H2: Overhead in reading a host file system’s metadata. H3: Overhead in writing a host file system’s metadata. Existing image formats by design do not address many of these issues, which is the reason why FVD was invented ( http://wiki.qemu.org/Features/FVD). Let’s look at these issues one by one. Regarding I1: Data fragmentation caused by an image format: This problem exists in most image formats, as they insist on doing storage allocation for the second time at the image level (including QCOW2, QED, VMDK, VDI, VHD, etc.), even if the host file system already does storage allocation. These image formats unnecessarily mix the function of storage allocation with the function of copy-on-write, i.e., they determine whether a cluster is dirty by checking whether it has storage space allocated at the image level. This is wrong. Storage allocation and tracking dirty clusters are two separate functions. Data fragmentation at the image level can be totally avoided by using a RAW image plus a bitmap header to indicate whether clusters are dirty due to copy-on-write. FVD can be configured to take this approach, although it can also be configured to do storage allocation. Doing storage allocation at the image level can be optional, but should never be mandatory. Regarding I2: Overhead in reading an image format’s metadata from disk: Obviously, the solution is to make the metadata small so that it can be cached entirely in memory. Is this aspect, QCOW1/QCOW2/QED and VMDK-workstation-version are wrong, and VirtualBox VDI, Microsoft VHD, and VMDK-esx-server-version are right. With QCOW1/QCOW2/QED, for a 1TB virtual disk, the metadata size is at least 128MB. By contrast, with VDI, for a 1TB virtual disk, the metadata size is only 4MB. The “wrong formats” all use a two-level lookup table to do storage allocation at a small granularity (e.g., 64KB), whereas the “right formats” all use a one-level lookup table to do storage allocation at a large granularity (1MB or 2MB). The one-level table is easier to implementation. Note that VMware VMDK started wrong in VMware’s workstation version, and then was corrected to be right in the ESX server version, which is a good move. As virtual disks grow bigger, it is likely that the storage allocation unit will be increased in the future, e.g., to 10MB or even larger. In existing image formats, one limitation of using a large storage allocation unit is that it forces copy-on-write being performed on a large cluster (e.g., 10MB in the future), which is sort of wrong. FVD gets the bests of both worlds. It uses a one-level table to perform storage allocation at a large granularity, but uses a bitmap to track copy-on-write at a smaller granularity. For a 1TB virtual disk, this approach needs only 6MB metadata, slightly larger than VDI’s 4MB. Regarding I3: Overhead in writing an image format’s metadata to disk: This is where the “dirty bit” discussion fits, but FVD goes way beyond that to reduce metadata updates. When an FVD image is fully optimized (e.g., the one-level lookup table is disabled and the base image is reduced to its minimum size), FVD has almost zero overhead in metadata update and the data layout is just like a RAW image. More specifically, metadata updates are skipped, delayed, batched, or merged as much as possible without compromising data integrity. First, even with cache=writethrough (i.e., O_DSYNC), all metadata updates are sequential writes to FVD’s journal, which can be
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 20.02.2011 23:13, schrieb Aurelien Jarno: On Fri, Feb 18, 2011 at 10:57:05AM +0100, Kevin Wolf wrote: Am 18.02.2011 10:12, schrieb Markus Armbruster: Kevin Wolf kw...@redhat.com writes: Am 15.02.2011 20:45, schrieb Chunqiang Tang: Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM: As you requested, I set up a wiki page for FVD at http://wiki.qemu.org/Features/FVD . It includes a summary of FVD, a detailed specification of FVD, and a comparison of the design and performance of FVD and QED. See the figure at http://wiki.qemu.org/Features/FVD/Compare . This figure shows that the file creation throughput of NetApp's PostMark benchmark under FVD is 74.9% to 215% higher than that under QED. Hi Anthony, Please let me know if more information is needed. I would appreciate your feedback and advice on the best way to proceed with FVD. Yet another file format with yet another implementation is definitely not what we need. We should probably take some of the ideas in FVD and consider them for qcow3. Got an assumption there: that the one COW format we need must be qcow3, i.e. an evolution of qcow2. Needs to be justified. If that discussion has happened on the list already, I missed it. If not, it's overdue, and then we better start it right away. Right. I probably wasn't very clear about what I mean with qcow3 either, so let me try to summarize my reasoning. The first point is an assumption that you made, too: That we want to have only one format. I hope it's easy to agree on this, duplication is bad and every additional format creates new maintenance burden, especially if we're taking it serious. Until now, there were exactly two formats for which we managed to do this, raw and qcow2. raw is more or less for free, so with the introduction of another format, we basically double the supported block driver code overnight (while not doubling the number of developers). The consequence of having only one file format is that it must be able to obsolete the existing ones, most notably qcow2. We can only neglect qcow1 today because we can tell users to use qcow2. It supports everything that qcow1 supports and more. We couldn't have done this if qcow2 lacked features compared to qcow1. So the one really essential requirement that I see is that we provide a way forward for _all_ users by maintaining all of qcow2's features. This is the only way of getting people to not stay with qcow2. I agree that the best would be to have a single format, and it's probably a goal to have. That said, what is most important to my view is having one or two formats which together have _all_ the features (and here I consider speed as a feature) of the existing qcow2 format. QED or FVD have been designed with the virtualization in a datacenter in mind, and are very good for this use. OTOH they don't support compression or snapshotting, that are quite useful for demo, debugging, testing, or even for occasionally running a Windows VM, in other words in situations where the speed is not the priority. If we can't find a tradeoff for that, we should go for two instead of one image format. I agree. Though that's purely theoretical because there no reason why we shouldn't find a way to get both. ;-) In fact, the only area where qcow2 in performs really bad in 0.14 is cache=writethrough (which unfortunately is the default...). With cache=none it's easy to find scenarios where it provides higher throughput than QED. Anyway, there's really only one crucial difference between QED and qcow2, which is that qcow2 ensures that metadata is consistent on disk at any time whereas QED relies on a dirty flag and rebuilds metadata after a crash (basically requiring an fsck). The obvious solution if you want to have this in qcow2, is adding a dirty flag there as well. In my opinion, an additional flag certainly doesn't justify maintaining an additional format instead of extending the existing one. Likewise, I think FVD might provide some ideas that we can integrate as well, I just don't see a justification to include it as a separate format. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
On Mon, Feb 21, 2011 at 8:59 AM, Kevin Wolf kw...@redhat.com wrote: In fact, the only area where qcow2 in performs really bad in 0.14 is cache=writethrough (which unfortunately is the default...). With cache=none it's easy to find scenarios where it provides higher throughput than QED. Yeah, I'm tempted to implement parallel allocating writes now so I can pick on qcow2 in all benchmarks again ;). Anyway, there's really only one crucial difference between QED and qcow2, which is that qcow2 ensures that metadata is consistent on disk at any time whereas QED relies on a dirty flag and rebuilds metadata after a crash (basically requiring an fsck). The obvious solution if you want to have this in qcow2, is adding a dirty flag there as well. Likewise, I think FVD might provide some ideas that we can integrate as well, I just don't see a justification to include it as a separate format. You think that QED and FVD can be integrated into a QCOW2-based format. I agree it's possible and has some value. It isn't pretty and I would prefer to work on a clean new format because that, too, has value. In any case, the next step is to get down to specifics. Here is the page with the current QCOW3 roadmap: http://wiki.qemu.org/Qcow3_Roadmap Please raise concrete requirements or features so they can be discussed and captured. For example, journalling is an alternative to the dirty bit approach. If you feel that journalling is the best technique to address consistent updates, then make your case outside the context of today's qcow2, QED, and FVD implementations (although benchmark data will rely on current implementations). Explain how the technique would fit into QCOW3 and what format changes need to be made. I think this is the level we need to discuss at rather than qcow2 vs QED vs FVD. Stefan
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 21.02.2011 14:44, schrieb Stefan Hajnoczi: On Mon, Feb 21, 2011 at 8:59 AM, Kevin Wolf kw...@redhat.com wrote: In fact, the only area where qcow2 in performs really bad in 0.14 is cache=writethrough (which unfortunately is the default...). With cache=none it's easy to find scenarios where it provides higher throughput than QED. Yeah, I'm tempted to implement parallel allocating writes now so I can pick on qcow2 in all benchmarks again ;). Heh. ;-) In the end it just shows that the differences are mainly in the implementation, not in the format. Anyway, there's really only one crucial difference between QED and qcow2, which is that qcow2 ensures that metadata is consistent on disk at any time whereas QED relies on a dirty flag and rebuilds metadata after a crash (basically requiring an fsck). The obvious solution if you want to have this in qcow2, is adding a dirty flag there as well. Likewise, I think FVD might provide some ideas that we can integrate as well, I just don't see a justification to include it as a separate format. You think that QED and FVD can be integrated into a QCOW2-based format. I agree it's possible and has some value. It isn't pretty and I would prefer to work on a clean new format because that, too, has value. In any case, the next step is to get down to specifics. Here is the page with the current QCOW3 roadmap: http://wiki.qemu.org/Qcow3_Roadmap Please raise concrete requirements or features so they can be discussed and captured. For example, journalling is an alternative to the dirty bit approach. If you feel that journalling is the best technique to address consistent updates, then make your case outside the context of today's qcow2, QED, and FVD implementations (although benchmark data will rely on current implementations). Explain how the technique would fit into QCOW3 and what format changes need to be made. I think journalling is an interesting option, but I'm not sure if we should target it for 0.15. As you know, there's already more than enough stuff to do until then, with coroutines etc. The dirty flag thing would be way easier to implement. We can always add a journal as a compatible feature in 0.16. To be honest, I'm not even sure any more that the dirty flag is that important. Originally we have been talking about cache=none and it definitely makes a big difference there because we save flushes. However, we're talking about cache=writethrough now and you flush on any write. It might be more important to make things parallel for writethrough. Maybe not writing out refcounts is something we should measure before we start implementing anything. (It's easy to disable all writes for a benchmark, even if the image will be broken afterwards) I think this is the level we need to discuss at rather than qcow2 vs QED vs FVD. Definitely more productive, yes. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/21/2011 08:10 AM, Kevin Wolf wrote: Am 21.02.2011 14:44, schrieb Stefan Hajnoczi: On Mon, Feb 21, 2011 at 8:59 AM, Kevin Wolfkw...@redhat.com wrote: In fact, the only area where qcow2 in performs really bad in 0.14 is cache=writethrough (which unfortunately is the default...). With cache=none it's easy to find scenarios where it provides higher throughput than QED. Yeah, I'm tempted to implement parallel allocating writes now so I can pick on qcow2 in all benchmarks again ;). Heh. ;-) In the end it just shows that the differences are mainly in the implementation, not in the format. Anyway, there's really only one crucial difference between QED and qcow2, which is that qcow2 ensures that metadata is consistent on disk at any time whereas QED relies on a dirty flag and rebuilds metadata after a crash (basically requiring an fsck). The obvious solution if you want to have this in qcow2, is adding a dirty flag there as well. Likewise, I think FVD might provide some ideas that we can integrate as well, I just don't see a justification to include it as a separate format. You think that QED and FVD can be integrated into a QCOW2-based format. I agree it's possible and has some value. It isn't pretty and I would prefer to work on a clean new format because that, too, has value. In any case, the next step is to get down to specifics. Here is the page with the current QCOW3 roadmap: http://wiki.qemu.org/Qcow3_Roadmap Please raise concrete requirements or features so they can be discussed and captured. For example, journalling is an alternative to the dirty bit approach. If you feel that journalling is the best technique to address consistent updates, then make your case outside the context of today's qcow2, QED, and FVD implementations (although benchmark data will rely on current implementations). Explain how the technique would fit into QCOW3 and what format changes need to be made. I think journalling is an interesting option, but I'm not sure if we should target it for 0.15. As you know, there's already more than enough stuff to do until then, with coroutines etc. The dirty flag thing would be way easier to implement. We can always add a journal as a compatible feature in 0.16. To be honest, I'm not even sure any more that the dirty flag is that important. Originally we have been talking about cache=none and it definitely makes a big difference there because we save flushes. However, we're talking about cache=writethrough now and you flush on any write. It might be more important to make things parallel for writethrough. One thing I wonder about is whether we really need to have cache=X and wce=X. I never really minded the fact that cache=none advertised wce=on because we behaved effectively as if wce=on. But now that qcow2 triggers on wce=on, I'm a bit concerned that we're introducing a subtle degradation that most people won't realize. Ignoring some of the problems with O_DIRECT, semantically, I think there's a strong use-case for cache=none, wce=off. Regards, Anthony Liguori Maybe not writing out refcounts is something we should measure before we start implementing anything. (It's easy to disable all writes for a benchmark, even if the image will be broken afterwards) I think this is the level we need to discuss at rather than qcow2 vs QED vs FVD. Definitely more productive, yes. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 21.02.2011 16:16, schrieb Anthony Liguori: On 02/21/2011 08:10 AM, Kevin Wolf wrote: Am 21.02.2011 14:44, schrieb Stefan Hajnoczi: On Mon, Feb 21, 2011 at 8:59 AM, Kevin Wolfkw...@redhat.com wrote: In fact, the only area where qcow2 in performs really bad in 0.14 is cache=writethrough (which unfortunately is the default...). With cache=none it's easy to find scenarios where it provides higher throughput than QED. Yeah, I'm tempted to implement parallel allocating writes now so I can pick on qcow2 in all benchmarks again ;). Heh. ;-) In the end it just shows that the differences are mainly in the implementation, not in the format. Anyway, there's really only one crucial difference between QED and qcow2, which is that qcow2 ensures that metadata is consistent on disk at any time whereas QED relies on a dirty flag and rebuilds metadata after a crash (basically requiring an fsck). The obvious solution if you want to have this in qcow2, is adding a dirty flag there as well. Likewise, I think FVD might provide some ideas that we can integrate as well, I just don't see a justification to include it as a separate format. You think that QED and FVD can be integrated into a QCOW2-based format. I agree it's possible and has some value. It isn't pretty and I would prefer to work on a clean new format because that, too, has value. In any case, the next step is to get down to specifics. Here is the page with the current QCOW3 roadmap: http://wiki.qemu.org/Qcow3_Roadmap Please raise concrete requirements or features so they can be discussed and captured. For example, journalling is an alternative to the dirty bit approach. If you feel that journalling is the best technique to address consistent updates, then make your case outside the context of today's qcow2, QED, and FVD implementations (although benchmark data will rely on current implementations). Explain how the technique would fit into QCOW3 and what format changes need to be made. I think journalling is an interesting option, but I'm not sure if we should target it for 0.15. As you know, there's already more than enough stuff to do until then, with coroutines etc. The dirty flag thing would be way easier to implement. We can always add a journal as a compatible feature in 0.16. To be honest, I'm not even sure any more that the dirty flag is that important. Originally we have been talking about cache=none and it definitely makes a big difference there because we save flushes. However, we're talking about cache=writethrough now and you flush on any write. It might be more important to make things parallel for writethrough. One thing I wonder about is whether we really need to have cache=X and wce=X. I never really minded the fact that cache=none advertised wce=on because we behaved effectively as if wce=on. But now that qcow2 triggers on wce=on, I'm a bit concerned that we're introducing a subtle degradation that most people won't realize. Ignoring some of the problems with O_DIRECT, semantically, I think there's a strong use-case for cache=none, wce=off. Fully agree, there's no real reason for having three writeback modes, but only one writethrough mode. It should be completely symmetrical. I think Christoph has mentioned several times that he has some patches for this. What's the status of them, Christoph? Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
On Fri, Feb 18, 2011 at 10:57:05AM +0100, Kevin Wolf wrote: Am 18.02.2011 10:12, schrieb Markus Armbruster: Kevin Wolf kw...@redhat.com writes: Am 15.02.2011 20:45, schrieb Chunqiang Tang: Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM: As you requested, I set up a wiki page for FVD at http://wiki.qemu.org/Features/FVD . It includes a summary of FVD, a detailed specification of FVD, and a comparison of the design and performance of FVD and QED. See the figure at http://wiki.qemu.org/Features/FVD/Compare . This figure shows that the file creation throughput of NetApp's PostMark benchmark under FVD is 74.9% to 215% higher than that under QED. Hi Anthony, Please let me know if more information is needed. I would appreciate your feedback and advice on the best way to proceed with FVD. Yet another file format with yet another implementation is definitely not what we need. We should probably take some of the ideas in FVD and consider them for qcow3. Got an assumption there: that the one COW format we need must be qcow3, i.e. an evolution of qcow2. Needs to be justified. If that discussion has happened on the list already, I missed it. If not, it's overdue, and then we better start it right away. Right. I probably wasn't very clear about what I mean with qcow3 either, so let me try to summarize my reasoning. The first point is an assumption that you made, too: That we want to have only one format. I hope it's easy to agree on this, duplication is bad and every additional format creates new maintenance burden, especially if we're taking it serious. Until now, there were exactly two formats for which we managed to do this, raw and qcow2. raw is more or less for free, so with the introduction of another format, we basically double the supported block driver code overnight (while not doubling the number of developers). The consequence of having only one file format is that it must be able to obsolete the existing ones, most notably qcow2. We can only neglect qcow1 today because we can tell users to use qcow2. It supports everything that qcow1 supports and more. We couldn't have done this if qcow2 lacked features compared to qcow1. So the one really essential requirement that I see is that we provide a way forward for _all_ users by maintaining all of qcow2's features. This is the only way of getting people to not stay with qcow2. I agree that the best would be to have a single format, and it's probably a goal to have. That said, what is most important to my view is having one or two formats which together have _all_ the features (and here I consider speed as a feature) of the existing qcow2 format. QED or FVD have been designed with the virtualization in a datacenter in mind, and are very good for this use. OTOH they don't support compression or snapshotting, that are quite useful for demo, debugging, testing, or even for occasionally running a Windows VM, in other words in situations where the speed is not the priority. If we can't find a tradeoff for that, we should go for two instead of one image format. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net
Re: [Qemu-devel] Re: Strategic decision: COW format
On Fri, Feb 18, 2011 at 7:11 PM, Kevin Wolf kw...@redhat.com wrote: Am 18.02.2011 18:43, schrieb Stefan Weil: Is maintaining an additional file format really so much work? I have only some personal experience with vdi.c, and there maintainance was largely caused by interface changes and done by Kevin. Hopefully interfaces will stabilize, so changes will become less frequent. Well, there are different types of maintenance. It's not much work to just drop the code into qemu and let it bitrot. This is what happens to the funky formats like bochs or dmg. They are usually patches enough so that they still build, but nobody tries if they actually work. Then there are formats in which there is at least some interest, like vmdk or vdi. Occasionally they get some fixes, they are probably fine for image conversion, but I wouldn't really trust them for production use. And then there's raw and qcow2, which are used by a lot of people for running VMs, that are actively maintained, get a decent level of review and fixes etc. Getting a format into this group really takes a lot of work. Taking something like FVD would only make sense if we are willing to do that work - I mean, really nobody wants to convert from/to a file format that isn't implemented anywhere else. This is a good thing to agree on so I want to reiterate: There are two types of image formats in QEMU today. 1. Native formats that are maintained and suitable for running VMs. This includes raw, qcow2, and qed. 2. Convert-only formats that may not be maintained and are not suitable for running VMs. All other formats in qemu.git. The convert-only formats have synchronous implementations which makes it a bad idea to run VMs with them. They don't fit into QEMU's event-driven architecture and will cause poor performance and possible hangs. I hope folks agree on this. The next step is to consider that native support requires at least an order of magnitude more work and code. It would be wise to focus on a flagship format in order to share that effort. So I think this thread is a useful discussion to have even if no one can be forced to collaborate on just one format. Kevin's position seems to be that an evolution of qcow2 is best for code maintenance and reuse. The position that QED and FVD have taken is to start from a clean slate in order to make incompatible changes and leave out problematic features. I think we can get there eventually with either approach but we'll be introducing incompatible changes either way. In terms of code reuse, it's initially nice to share code with qcow2 but in the long run the two formats might diverge far enough that it becomes a liability due to extra complexity. For reference, here is the QCOW3 roadmap wiki page: http://wiki.qemu.org/Qcow3_Roadmap Here is the QED outstanding work page: http://wiki.qemu.org/Features/QED/OutstandingWork Does FVD have a roadmap or future features? Stefan
[Qemu-devel] Re: Strategic decision: COW format
Am 18.02.2011 10:12, schrieb Markus Armbruster: Kevin Wolf kw...@redhat.com writes: Am 15.02.2011 20:45, schrieb Chunqiang Tang: Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM: As you requested, I set up a wiki page for FVD at http://wiki.qemu.org/Features/FVD . It includes a summary of FVD, a detailed specification of FVD, and a comparison of the design and performance of FVD and QED. See the figure at http://wiki.qemu.org/Features/FVD/Compare . This figure shows that the file creation throughput of NetApp's PostMark benchmark under FVD is 74.9% to 215% higher than that under QED. Hi Anthony, Please let me know if more information is needed. I would appreciate your feedback and advice on the best way to proceed with FVD. Yet another file format with yet another implementation is definitely not what we need. We should probably take some of the ideas in FVD and consider them for qcow3. Got an assumption there: that the one COW format we need must be qcow3, i.e. an evolution of qcow2. Needs to be justified. If that discussion has happened on the list already, I missed it. If not, it's overdue, and then we better start it right away. Right. I probably wasn't very clear about what I mean with qcow3 either, so let me try to summarize my reasoning. The first point is an assumption that you made, too: That we want to have only one format. I hope it's easy to agree on this, duplication is bad and every additional format creates new maintenance burden, especially if we're taking it serious. Until now, there were exactly two formats for which we managed to do this, raw and qcow2. raw is more or less for free, so with the introduction of another format, we basically double the supported block driver code overnight (while not doubling the number of developers). The consequence of having only one file format is that it must be able to obsolete the existing ones, most notably qcow2. We can only neglect qcow1 today because we can tell users to use qcow2. It supports everything that qcow1 supports and more. We couldn't have done this if qcow2 lacked features compared to qcow1. So the one really essential requirement that I see is that we provide a way forward for _all_ users by maintaining all of qcow2's features. This is the only way of getting people to not stay with qcow2. Of course, you could invent another format that implements the same features, but I think just carefully extending qcow2 has some real advantages. The first is that conversion of existing images would be really easy. Basically increment the version number in the header file and you're done. Structures would be compatible. If you compare it to file systems, I rarely ever change the file system on a non-empty partition. Even if I wanted, it's usually just too painful. Except when I was able to use tune2fs -j to make ext3 out of ext2, that was really easy. We can provide the same for qcow2 to qcow3 conversion, but not with a completely new format. Also, while obsoleting a file format means that we need not put much effort in its maintenance, we still need to keep the code around for reading old images. With an extension of qcow2, it would be the same code that is used for both versions. Third, qcow2 already exists, is used in practice and we have put quite some effort into QA. At least initially confidence would be higher than in a completely new, yet untested format. Remember that with qcow3 I'm not talking about rewriting everything, it's a careful evolution, mostly with optional additions here and there. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/18/2011 03:57 AM, Kevin Wolf wrote: Am 18.02.2011 10:12, schrieb Markus Armbruster: Kevin Wolfkw...@redhat.com writes: Am 15.02.2011 20:45, schrieb Chunqiang Tang: Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM: As you requested, I set up a wiki page for FVD at http://wiki.qemu.org/Features/FVD . It includes a summary of FVD, a detailed specification of FVD, and a comparison of the design and performance of FVD and QED. See the figure at http://wiki.qemu.org/Features/FVD/Compare . This figure shows that the file creation throughput of NetApp's PostMark benchmark under FVD is 74.9% to 215% higher than that under QED. Hi Anthony, Please let me know if more information is needed. I would appreciate your feedback and advice on the best way to proceed with FVD. Yet another file format with yet another implementation is definitely not what we need. We should probably take some of the ideas in FVD and consider them for qcow3. Got an assumption there: that the one COW format we need must be qcow3, i.e. an evolution of qcow2. Needs to be justified. If that discussion has happened on the list already, I missed it. If not, it's overdue, and then we better start it right away. Right. I probably wasn't very clear about what I mean with qcow3 either, so let me try to summarize my reasoning. The first point is an assumption that you made, too: That we want to have only one format. I hope it's easy to agree on this, duplication is bad and every additional format creates new maintenance burden, especially if we're taking it serious. Until now, there were exactly two formats for which we managed to do this, raw and qcow2. raw is more or less for free, so with the introduction of another format, we basically double the supported block driver code overnight (while not doubling the number of developers). Not sure what project you're following, but we've had an awful lot of formats before qcow2 :-) And qcow2 was never all that special, it just was dropped in the code base one day. You've put a lot of work into qcow2, but there are other folks that are contributing additional formats and that means more developers. The consequence of having only one file format is that it must be able to obsolete the existing ones, most notably qcow2. We can only neglect qcow1 today because we can tell users to use qcow2. It supports everything that qcow1 supports and more. We couldn't have done this if qcow2 lacked features compared to qcow1. So the one really essential requirement that I see is that we provide a way forward for _all_ users by maintaining all of qcow2's features. This is the only way of getting people to not stay with qcow2. Of course, you could invent another format that implements the same features, but I think just carefully extending qcow2 has some real advantages. The first is that conversion of existing images would be really easy. Basically increment the version number in the header file and you're done. Structures would be compatible. qemu-img convert is a reasonable path for conversion. If you compare it to file systems, I rarely ever change the file system on a non-empty partition. Even if I wanted, it's usually just too painful. Except when I was able to use tune2fs -j to make ext3 out of ext2, that was really easy. We can provide the same for qcow2 to qcow3 conversion, but not with a completely new format. Also, while obsoleting a file format means that we need not put much effort in its maintenance, we still need to keep the code around for reading old images. With an extension of qcow2, it would be the same code that is used for both versions. Third, qcow2 already exists, is used in practice and we have put quite some effort into QA. At least initially confidence would be higher than in a completely new, yet untested format. Remember that with qcow3 I'm not talking about rewriting everything, it's a careful evolution, mostly with optional additions here and there. My requirements for a new format are as followed: 1) documented, thought-out specification that is covered under and open license with a clear process for extension. 2) ability to add both compatible and incompatible features in a graceful way 3) ability to achieve performance that's close to raw. I want our new format to be able to be used universally both for servers and desktops. I think qcow2 has some misfeatures like compression and internal snapshots. I think preserving those misfeatures is a mistake because I don't think we can satisfy the above while trying to preserve those features. If the image format degrades when those features are enabled, then it decreases confidence in the format. I think QED satisfies all of these today. Regards, Anthony Liguori Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 18.02.2011 10:57, schrieb Kevin Wolf: Am 18.02.2011 10:12, schrieb Markus Armbruster: Kevin Wolf kw...@redhat.com writes: Yet another file format with yet another implementation is definitely not what we need. We should probably take some of the ideas in FVD and consider them for qcow3. Got an assumption there: that the one COW format we need must be qcow3, i.e. an evolution of qcow2. Needs to be justified. If that discussion has happened on the list already, I missed it. If not, it's overdue, and then we better start it right away. Right. I probably wasn't very clear about what I mean with qcow3 either, so let me try to summarize my reasoning. The first point is an assumption that you made, too: That we want to have only one format. I hope it's easy to agree on this, duplication is bad and every additional format creates new maintenance burden, especially if we're taking it serious. Until now, there were exactly two formats for which we managed to do this, raw and qcow2. raw is more or less for free, so with the introduction of another format, we basically double the supported block driver code overnight (while not doubling the number of developers). The consequence of having only one file format is that it must be able to obsolete the existing ones, most notably qcow2. We can only neglect qcow1 today because we can tell users to use qcow2. It supports everything that qcow1 supports and more. We couldn't have done this if qcow2 lacked features compared to qcow1. So the one really essential requirement that I see is that we provide a way forward for _all_ users by maintaining all of qcow2's features. This is the only way of getting people to not stay with qcow2. The support of several different file formats is one of the strong points of QEMU, at least in my opinion. Reducing this to offline conversion would be a bad idea because it costs too much time and disk space for quick tests (for production environments, this might be totally different). Is maintaining an additional file format really so much work? I have only some personal experience with vdi.c, and there maintainance was largely caused by interface changes and done by Kevin. Hopefully interfaces will stabilize, so changes will become less frequent. A new file format like fvd would be a challenge for the existing ones. Declare its support as unsupported or experimental, but let users decide which one is best suited to their needs! Maybe adding a staging tree (like for the linux kernel) for experimental drivers, devices, file formats, tcg targets and so on would make it easier to add new code and reduce the need for QEMU forks. I'd appreciate such or any other solution which allows this very much! Regards, Stefan
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 18.02.2011 18:43, schrieb Stefan Weil: Am 18.02.2011 10:57, schrieb Kevin Wolf: Am 18.02.2011 10:12, schrieb Markus Armbruster: Kevin Wolf kw...@redhat.com writes: Yet another file format with yet another implementation is definitely not what we need. We should probably take some of the ideas in FVD and consider them for qcow3. Got an assumption there: that the one COW format we need must be qcow3, i.e. an evolution of qcow2. Needs to be justified. If that discussion has happened on the list already, I missed it. If not, it's overdue, and then we better start it right away. Right. I probably wasn't very clear about what I mean with qcow3 either, so let me try to summarize my reasoning. The first point is an assumption that you made, too: That we want to have only one format. I hope it's easy to agree on this, duplication is bad and every additional format creates new maintenance burden, especially if we're taking it serious. Until now, there were exactly two formats for which we managed to do this, raw and qcow2. raw is more or less for free, so with the introduction of another format, we basically double the supported block driver code overnight (while not doubling the number of developers). The consequence of having only one file format is that it must be able to obsolete the existing ones, most notably qcow2. We can only neglect qcow1 today because we can tell users to use qcow2. It supports everything that qcow1 supports and more. We couldn't have done this if qcow2 lacked features compared to qcow1. So the one really essential requirement that I see is that we provide a way forward for _all_ users by maintaining all of qcow2's features. This is the only way of getting people to not stay with qcow2. The support of several different file formats is one of the strong points of QEMU, at least in my opinion. I totally agree. qemu-img is known as a Swiss army knife for disk images and this is definitely a strength. However, it's not useful just because it supports a high number of formats, but because these formats are in active use. Most of them are the native formats of some other software. I think things look a bit different when we're talking about qemu-specific formats. qcow1 isn't in use any more because nobody needs it for compatibility with other software and for use with qemu, there is qcow2. Still, the qcow1 driver is still around and bitrots. Reducing this to offline conversion would be a bad idea because it costs too much time and disk space for quick tests (for production environments, this might be totally different). Either I'm misunderstanding what you try to say here, or you miunderstood what I said. I agree that we don't want to have to do qemu-img convert (i.e. a full copy) in order to upgrade. This is one of the reasons why I think we should have a qcow3 which can be upgraded basically by increasing the version number in the header (look at it as an incompatible feature flag, if you want) instead of starting something completely new. Is maintaining an additional file format really so much work? I have only some personal experience with vdi.c, and there maintainance was largely caused by interface changes and done by Kevin. Hopefully interfaces will stabilize, so changes will become less frequent. Well, there are different types of maintenance. It's not much work to just drop the code into qemu and let it bitrot. This is what happens to the funky formats like bochs or dmg. They are usually patches enough so that they still build, but nobody tries if they actually work. Then there are formats in which there is at least some interest, like vmdk or vdi. Occasionally they get some fixes, they are probably fine for image conversion, but I wouldn't really trust them for production use. And then there's raw and qcow2, which are used by a lot of people for running VMs, that are actively maintained, get a decent level of review and fixes etc. Getting a format into this group really takes a lot of work. Taking something like FVD would only make sense if we are willing to do that work - I mean, really nobody wants to convert from/to a file format that isn't implemented anywhere else. A new file format like fvd would be a challenge for the existing ones. Declare its support as unsupported or experimental, but let users decide which one is best suited to their needs! Basically this is what we did for QED. In hindsight I consider it a mistake because it set a bad precedence of inventing something new instead of fixing what's there. I really don't want to convert all my images each time to take advantage of new qemu version. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/18/2011 01:11 PM, Kevin Wolf wrote: A new file format like fvd would be a challenge for the existing ones. Declare its support as unsupported or experimental, but let users decide which one is best suited to their needs! Basically this is what we did for QED. In hindsight I consider it a mistake because it set a bad precedence of inventing something new instead of fixing what's there. I don't see how qcow3 is fixing something that's there since it's still an incompatible format. It'd be a stronger argument if you were suggesting something that was still fully compatible with qcow2 but once compatibility is broken, it's broken. Regards, Anthony Liguori I really don't want to convert all my images each time to take advantage of new qemu version. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/18/2011 11:43 AM, Stefan Weil wrote: Is maintaining an additional file format really so much work? I have only some personal experience with vdi.c, and there maintainance was largely caused by interface changes and done by Kevin. Hopefully interfaces will stabilize, so changes will become less frequent. A new file format like fvd would be a challenge for the existing ones. FVD isn't merged because it's gotten almost no review. If it turns out that it is identical to an existing format and an existing format just has a crappy implementation, it wouldn't be merged in favor of fixing the existing format. But if it has a compelling advantage for a reasonable use-case, it will be merged. I don't know where this whole discussion of strategic formats for QEMU came from but that's never been the way the project has operated. Regards, Anthony Liguori
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 18.02.2011 20:47, schrieb Anthony Liguori: On 02/18/2011 01:11 PM, Kevin Wolf wrote: A new file format like fvd would be a challenge for the existing ones. Declare its support as unsupported or experimental, but let users decide which one is best suited to their needs! Basically this is what we did for QED. In hindsight I consider it a mistake because it set a bad precedence of inventing something new instead of fixing what's there. I don't see how qcow3 is fixing something that's there since it's still an incompatible format. It'd be a stronger argument if you were suggesting something that was still fully compatible with qcow2 but once compatibility is broken, it's broken. It's really more like adding an incompatible feature flag in QED. You still have one implementation for old and new images instead of splitting up development efforts, you still have all of the features and so on. It's a completely different story than QED. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
On 02/18/2011 02:49 PM, Kevin Wolf wrote: Am 18.02.2011 20:47, schrieb Anthony Liguori: On 02/18/2011 01:11 PM, Kevin Wolf wrote: A new file format like fvd would be a challenge for the existing ones. Declare its support as unsupported or experimental, but let users decide which one is best suited to their needs! Basically this is what we did for QED. In hindsight I consider it a mistake because it set a bad precedence of inventing something new instead of fixing what's there. I don't see how qcow3 is fixing something that's there since it's still an incompatible format. It'd be a stronger argument if you were suggesting something that was still fully compatible with qcow2 but once compatibility is broken, it's broken. It's really more like adding an incompatible feature flag in QED. You still have one implementation for old and new images instead of splitting up development efforts, you still have all of the features and so on. In theory. Since an implementation doesn't exist, we have no idea how much code is actually going to be shared at the end of the day. I suspect that, especially if you drop the ref table updates, there won't be an awful lot of common code in the two paths. Regards, Anthony Liguori It's a completely different story than QED. Kevin
Re: [Qemu-devel] Re: Strategic decision: COW format
Am 18.02.2011 21:50, schrieb Anthony Liguori: On 02/18/2011 02:49 PM, Kevin Wolf wrote: Am 18.02.2011 20:47, schrieb Anthony Liguori: On 02/18/2011 01:11 PM, Kevin Wolf wrote: A new file format like fvd would be a challenge for the existing ones. Declare its support as unsupported or experimental, but let users decide which one is best suited to their needs! Basically this is what we did for QED. In hindsight I consider it a mistake because it set a bad precedence of inventing something new instead of fixing what's there. I don't see how qcow3 is fixing something that's there since it's still an incompatible format. It'd be a stronger argument if you were suggesting something that was still fully compatible with qcow2 but once compatibility is broken, it's broken. It's really more like adding an incompatible feature flag in QED. You still have one implementation for old and new images instead of splitting up development efforts, you still have all of the features and so on. In theory. Since an implementation doesn't exist, we have no idea how much code is actually going to be shared at the end of the day. I suspect that, especially if you drop the ref table updates, there won't be an awful lot of common code in the two paths. Allowing refcounts to be inconsistent, protected by a dirty flag, is only an option, and you should only take it if you absolutely need it (i.e. your guest is broken and requires cache=writethrough, but you desperately need performance) My preferred way of implementing it is telling the refcount cache that it should ignore flushes and write its data only back when another refcount block must be loaded into the cache (which happens rarely enough that it doesn't really hurt performance). This makes the difference from the existing code more or less one if statement that returns early. Kevin