subject:"\\\[Qemu\\\-devel\\\] Re\\\: Strategic decision\\\: COW format"

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Kevin Wolf

Am 13.03.2011 06:51, schrieb Chunqiang Tang:
 After the heated debate, I thought more about the right approach of 
 implementing snapshot, and it becomes clear to me that there are major 
 limitations with both VMDK's external snapshot approach (which stores each 
 snapshot as a separate CoW file) and QCOW2's internal snapshot approach 
 (which stores all snapshots in one file and uses a reference count table 
 to keep track of them). I just posted to the mailing list a patch that 
 implements internal snapshot in FVD but does it in a way without the 
 limitations of VMDK and QCOW2. 
 
 Let's first list the properties of an ideal virtual disk snapshot 
 solution, and then discuss how to achieve them.
 
 G1: Do no harm (or avoid being a misfeature), i.e., the added snapshot 
 code should not slow down the runtime performance of an image that has no 
 snapshots.  This implies that an image without snapshot should not cache 
 the reference count table in memory and should not update the on-disk 
 reference count table.
 
 G2: Even better, an image with 1 snapshot runs as fast as an image without 
 snapshot.
 
 G3: Even even better, an image with 1,000 snapshots runs as fast as an 
 image without snapshot. This basically means getting the snapshot feature 
 for free.
 
 G4: An image with 1,000 snapshots consumes no more memory than an image 
 without snapshot. This again means getting the snapshot feature for free.
 
 G5: Regardless of the number of existing snapshots, creating a new 
 snapshot is fast, e.g., taking no more than 1 second.
 
 G6: Regardless of the number of existing snapshots, deleting a snapshot is 
 fast, e.g., taking no more than 1 second.
 
 Now let's evaluate VMDK and QCOW2 against these ideal properties. 
 
 G1: VMDK good; QCOW2 poor
 G2: VMDK ok; QCOW2 poor
 G3: VMDK very poor; QCOW2 poor
 G4: VMDK very poor; QCOW2 poor
 G5: VMDK good; QCOW2 good
 G6: VMDK poor; QCOW2 good

Okay. I think I don't agree with all of these. I'm not entirely sure how
VMDK works, so I take this as random image format that uses backing
files (so it also applies to qcow2 with backing files, which I hope
isn't too confusing).

G1: VMDK good; QCOW2 poor for cache=writethrough, ok otherwise; QCOW3 good
G2: VMDK ok; QCOW2 good
G3: VMDK poor; QCOW2 good
G4: VMDK very poor; QCOW2 ok
G5: VMDK good; QCOW2 good
G6: VMDK very poor; QCOW2 good

Also, let me add another feature which I believe is an important factor
in the decision between internal and external snapshots:

G7: Loading/Reverting to a snapshot is fast
G7: VMDK good; QCOW2 ok

 On the other hand, QCOW'2 internal snapshot has two major limitations that 
 hurt runtime performance: caching the reference count table in memory and 
 updating the on-disk reference count table. If we can eliminate both, then 
 it is an ideal solution.

It's not even necessary to get completely rid of it. What hurts is
writing the additional metadata. So if you can delay writing the
metadata and only write out a refcount block once you need to load the
next one into memory, the overhead is lost in the noise (remember, even
with 64k clusters, a refcount block covers 2 GB of virtual disk space).

We already do that for qcow2 in all writeback cache modes. We can't do
it yet for cache=writethrough, but we were planning to allow using QED's
dirty flag approach which would get rid of the writes also in
writethrough modes.

I think this explains my estimation for G1.

For G2 and G3, I'm not sure why you think that having internal snapshots
slows down operation. It's basically just data that sits in the image
file and is unused. After startup or after deleting a snapshot you
probably have to look at all of the refcount table again for cluster
allocations, is this what you mean?

For G4, the size of snapshots in memory, the only overhead of internal
snapshots that I could think of is the snapshot table. I would hardly
rate this as poor.

For G5 and G6 I basically agree with your estimation, except that I
think that the overhead of deleting a snapshot is _really_ bad. This is
one of the major problems we have with external snapshots today.

 In an internal snapshot implementation, the reference count table is used 
 to track used blocks and free blocks. It serves no other purposes. In FVD, 
 its static reference count table only tracks blocks used by (static) 
 snapshots, and it does not track blocks (dynamically) allocated (on a 
 write) or freed (on a trim) for the running VM. This is a simple but 
 fundamental difference w.r.t. to QCOW2, whose reference count table tracks 
 both the static content and the dynamic content. Because data blocks used 
 by snapshots are static and do not change unless a snapshot is created or 
 deleted, there is no need to update FVD's static reference count table 
 when a VM runs, and actually there is even no need to cache it in memory. 
 Data blocks that are dynamically allocated or freed for a running VM are 
 already tracked by FVD's one-level lookup table

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Anthony Liguori


On 03/13/2011 09:28 PM, Chunqiang Tang wrote:

In short, FVD's internal snapshot achieves the ideal properties of

G1-G6,

by 1) using the reference count table to only track static

snapshots, 2)

not keeping the reference count table in memory, 3) not updating the
on-disk static reference count table when the VM runs, and 4)
efficiently tracking dynamically allocated blocks by piggybacking on

FVD's

other features, i.e., its journal and small one-level lookup table.

Are you assuming snapshots are read-only?

It's not clear to me how this would work with writeable snapshots.  It's
not clear to me that writeable snapshots are really that important, but
this is an advantage of having a refcount table.

External snapshots are essentially read-only snapshots so I can
understand the argument for it.

By definition, a snapshot itself must be immutable (read-only), but a
writeable
image state can be derived from an immutable snapshot by using
copy-on-write,
which I guess is what you meant by writeable snapshot.


No, because the copy-on-write is another layer on top of the snapshot 
and AFAICT, they don't persist when moving between snapshots.


The equivalent for external snapshots would be:

base0 - base1 - base2 - image

And then if I wanted to move to base1 without destroying base2 and 
image, I could do:


qemu-img create -f qcow2 -b base1 base1-overlay.img

The file system can keep a lot of these things around pretty easily but 
with your proposal, it seems like there can only be one.  If you support 
many of them, I think you'll degenerate to something as complex as a 
reference count table.


On the other hand, I think it's reasonable to just avoid the CoW overlay 
entirely and say that moving to a previous snapshot destroys any of it's 
children.  I think this ends up being a simplifying assumption that is 
worth investigating further.


From the use-cases that I'm aware of (backup and RAS), I think these 
semantics are okay.


I'm curious what other people think (Kevin/Stefan?).

Regards,

Anthony Liguori

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Chunqiang Tang

 No, because the copy-on-write is another layer on top of the snapshot 
 and AFAICT, they don't persist when moving between snapshots.
 
 The equivalent for external snapshots would be:
 
 base0 - base1 - base2 - image
 
 And then if I wanted to move to base1 without destroying base2 and 
 image, I could do:
 
 qemu-img create -f qcow2 -b base1 base1-overlay.img
 
 The file system can keep a lot of these things around pretty easily but 
 with your proposal, it seems like there can only be one.  If you support 

 many of them, I think you'll degenerate to something as complex as a 
 reference count table.
 
 On the other hand, I think it's reasonable to just avoid the CoW overlay 

 entirely and say that moving to a previous snapshot destroys any of it's 

 children.  I think this ends up being a simplifying assumption that is 
 worth investigating further.

No, both VMware and FVD have the same semantics as QCOW2. Moving to a 
previous snapshot does not destroy any of its children. In the example I 
gave (copied below), 
it goes from 

Image: s1-s2-s3-s4-(current-state)

back to snapshot s2, and now the state is

Image: s1-s2-s3-s4
   |-(curren-state)

where all snapshots s1-s4 are kept. From there, it can take another 
snapshot s5, and then further go back to snapshot s4, ending up with 

Image: s1-s2-s3-s4
   |-s5   |
   |- (current-state)

FVD does have a reference count table like that in QCOW2, but it avoids 
the need for updating the reference count table during normal execution of 
the VM. The reference count table is only updated at the time of creating 
a snapshot or deleting a snapshot. Therefore, during normal execution of a 
VM, images with snapshots are as fast as images without snapshot. 

FVD can do this because of the following:

FVD's reference count table only tracks the snapshots (s1, s2, ...), 
but does not track the current-state. Instead,
FVD's default mechanism (one-level lookup table, journal, etc.), which 
exists
even before introducing snapshot, already tracks the current-state. 
Working
together, FVD's reference count table and its default mechanism tracks all 
the
states. In QCOW2, when a new cluster is allocated during handling a 
running VM's
write request, it updates both the lookup table and the reference count 
table,
which is unnecessary because their information is redundant. By contrast, 
in
FVD, when a new chunk is allocated during handling a running VM's write
request, it only updates the lookup table without updating the reference 
count
table, because by design the reference count table does not track the 
current-state and this chunk allocation operation belongs to the 
current-state.
This is the key why FVD can get all the functions of QCOW2's internal 
snapshot
but without its memory overhead to cache the reference count table and
its disk I/O overhead to read or write the reference count table during 
normal
execution of VM.

Regards,
ChunQiang (CQ) Tang, Ph.D.
Homepage: http://www.research.ibm.com/people/c/ctang

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Anthony Liguori


On 03/14/2011 08:53 AM, Chunqiang Tang wrote:

No, because the copy-on-write is another layer on top of the snapshot
and AFAICT, they don't persist when moving between snapshots.

The equivalent for external snapshots would be:

base0- base1- base2- image

And then if I wanted to move to base1 without destroying base2 and
image, I could do:

qemu-img create -f qcow2 -b base1 base1-overlay.img

The file system can keep a lot of these things around pretty easily but
with your proposal, it seems like there can only be one.  If you support
many of them, I think you'll degenerate to something as complex as a
reference count table.

On the other hand, I think it's reasonable to just avoid the CoW overlay
entirely and say that moving to a previous snapshot destroys any of it's
children.  I think this ends up being a simplifying assumption that is
worth investigating further.

No, both VMware and FVD have the same semantics as QCOW2. Moving to a
previous snapshot does not destroy any of its children. In the example I
gave (copied below),
it goes from

Image: s1-s2-s3-s4-(current-state)

back to snapshot s2, and now the state is

Image: s1-s2-s3-s4
|-(curren-state)

where all snapshots s1-s4 are kept. From there, it can take another
snapshot s5, and then further go back to snapshot s4, ending up with

Image: s1-s2-s3-s4
|-s5   |
|-  (current-state)


Your use of current-state is confusing me because AFAICT, 
current-state is just semantically another snapshot.


It's writable because it has no children.  You only keep around one 
writable snapshot and to make another snapshot writable, you have to 
discard the former.


This is not the semantics of qcow2.  Every time you create a snapshot, 
it's essentially a new image.  You can write directly to it.


While we don't do this today and I don't think we ever should, it's 
entirely possible to have two disks served simultaneously out of the 
same qcow2 file using snapshots.


Regards,

Anthony Liguori

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Kevin Wolf

Am 14.03.2011 14:22, schrieb Anthony Liguori:
 On 03/13/2011 09:28 PM, Chunqiang Tang wrote:
 In short, FVD's internal snapshot achieves the ideal properties of
 G1-G6,
 by 1) using the reference count table to only track static
 snapshots, 2)
 not keeping the reference count table in memory, 3) not updating the
 on-disk static reference count table when the VM runs, and 4)
 efficiently tracking dynamically allocated blocks by piggybacking on
 FVD's
 other features, i.e., its journal and small one-level lookup table.
 Are you assuming snapshots are read-only?

 It's not clear to me how this would work with writeable snapshots.  It's
 not clear to me that writeable snapshots are really that important, but
 this is an advantage of having a refcount table.

 External snapshots are essentially read-only snapshots so I can
 understand the argument for it.
 By definition, a snapshot itself must be immutable (read-only), but a
 writeable
 image state can be derived from an immutable snapshot by using
 copy-on-write,
 which I guess is what you meant by writeable snapshot.
 
 No, because the copy-on-write is another layer on top of the snapshot 
 and AFAICT, they don't persist when moving between snapshots.
 
 The equivalent for external snapshots would be:
 
 base0 - base1 - base2 - image
 
 And then if I wanted to move to base1 without destroying base2 and 
 image, I could do:
 
 qemu-img create -f qcow2 -b base1 base1-overlay.img
 
 The file system can keep a lot of these things around pretty easily but 
 with your proposal, it seems like there can only be one.  If you support 
 many of them, I think you'll degenerate to something as complex as a 
 reference count table.

IIUC, he already uses a refcount table. Actually, I think that a
refcount table is a requirement to provide the interesting properties
that internal snapshots have (see my other mail).

Refcount tables aren't a very complex thing either. In fact, it makes a
format much simpler to have one concept like refcount tables instead of
adding another different mechanism for each new feature that would be
natural with refcount tables.

The only problem with them is that they are metadata that must be
updated. However, I think we have discussed enough how to avoid the
greatest part of that cost.

 On the other hand, I think it's reasonable to just avoid the CoW overlay 
 entirely and say that moving to a previous snapshot destroys any of it's 
 children.  I think this ends up being a simplifying assumption that is 
 worth investigating further.
 
  From the use-cases that I'm aware of (backup and RAS), I think these 
 semantics are okay.

I don't think this semantics would be expected. Any anyway, would this
really allow simplification of the format? I'm afraid that you would go
for complicated solutions with odd semantics just because of an
arbitrary dislike of refcounts.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Chunqiang Tang

 IIUC, he already uses a refcount table. Actually, I think that a
 refcount table is a requirement to provide the interesting properties
 that internal snapshots have (see my other mail).
 
 Refcount tables aren't a very complex thing either. In fact, it makes a
 format much simpler to have one concept like refcount tables instead of
 adding another different mechanism for each new feature that would be
 natural with refcount tables.
 
 The only problem with them is that they are metadata that must be
 updated. However, I think we have discussed enough how to avoid the
 greatest part of that cost.

FVD's novel uses of the reference count table reduces the metadata update 
overhead down to literally zero during normal execution of a VM. This gets 
the bests of QCOW2's reference count table but without its oeverhead. In 
FVD, the reference count table is only updated when creating a new 
snapshot or deleting an existing snapshot. The reference count table is 
never updated during normal execution of a VM.

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Kevin Wolf

Am 14.03.2011 15:02, schrieb Anthony Liguori:
 On 03/14/2011 08:53 AM, Chunqiang Tang wrote:
 No, because the copy-on-write is another layer on top of the snapshot
 and AFAICT, they don't persist when moving between snapshots.

 The equivalent for external snapshots would be:

 base0- base1- base2- image

 And then if I wanted to move to base1 without destroying base2 and
 image, I could do:

 qemu-img create -f qcow2 -b base1 base1-overlay.img

 The file system can keep a lot of these things around pretty easily but
 with your proposal, it seems like there can only be one.  If you support
 many of them, I think you'll degenerate to something as complex as a
 reference count table.

 On the other hand, I think it's reasonable to just avoid the CoW overlay
 entirely and say that moving to a previous snapshot destroys any of it's
 children.  I think this ends up being a simplifying assumption that is
 worth investigating further.
 No, both VMware and FVD have the same semantics as QCOW2. Moving to a
 previous snapshot does not destroy any of its children. In the example I
 gave (copied below),
 it goes from

 Image: s1-s2-s3-s4-(current-state)

 back to snapshot s2, and now the state is

 Image: s1-s2-s3-s4
 |-(curren-state)

 where all snapshots s1-s4 are kept. From there, it can take another
 snapshot s5, and then further go back to snapshot s4, ending up with

 Image: s1-s2-s3-s4
 |-s5   |
 |-  (current-state)
 
 Your use of current-state is confusing me because AFAICT, 
 current-state is just semantically another snapshot.
 
 It's writable because it has no children.  You only keep around one 
 writable snapshot and to make another snapshot writable, you have to 
 discard the former.
 
 This is not the semantics of qcow2.  Every time you create a snapshot, 
 it's essentially a new image.  You can write directly to it.
 
 While we don't do this today and I don't think we ever should, it's 
 entirely possible to have two disks served simultaneously out of the 
 same qcow2 file using snapshots.

No, CQ is describing the semantics of internal snapshots in qcow2
correctly. You have all the snapshots that are stored in the snapshot
table (all read-only) plus one current state described by the image
header (read-write).

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Stefan Hajnoczi

On Mon, Mar 14, 2011 at 1:53 PM, Chunqiang Tang ct...@us.ibm.com wrote:
 Therefore, during normal execution of a
 VM, images with snapshots are as fast as images without snapshot.

Hang on, an image with a snapshot still needs to do copy-on-write,
just like backing files.  The cost of copy-on-write is reading data
from the backing file, whereas a non-CoW write doesn't need to do
that.

So no, snapshots are not free during normal execution.

Stefan

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Chunqiang Tang

  Your use of current-state is confusing me because AFAICT, 
  current-state is just semantically another snapshot.
  
  It's writable because it has no children.  You only keep around one 
  writable snapshot and to make another snapshot writable, you have to 
  discard the former.
  
  This is not the semantics of qcow2.  Every time you create a snapshot, 

  it's essentially a new image.  You can write directly to it.
  
  While we don't do this today and I don't think we ever should, it's 
  entirely possible to have two disks served simultaneously out of the 
  same qcow2 file using snapshots.
 
 No, CQ is describing the semantics of internal snapshots in qcow2
 correctly. You have all the snapshots that are stored in the snapshot
 table (all read-only) plus one current state described by the image
 header (read-write).

That's also the semantics of VMware's external snapshot. So there is no 
difference in semantics. It is just a difference in implementation and 
performance.

Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Kevin Wolf

Am 14.03.2011 15:25, schrieb Chunqiang Tang:
 IIUC, he already uses a refcount table. Actually, I think that a
 refcount table is a requirement to provide the interesting properties
 that internal snapshots have (see my other mail).

 Refcount tables aren't a very complex thing either. In fact, it makes a
 format much simpler to have one concept like refcount tables instead of
 adding another different mechanism for each new feature that would be
 natural with refcount tables.

 The only problem with them is that they are metadata that must be
 updated. However, I think we have discussed enough how to avoid the
 greatest part of that cost.
 
 FVD's novel uses of the reference count table reduces the metadata update 
 overhead down to literally zero during normal execution of a VM. This gets 
 the bests of QCOW2's reference count table but without its oeverhead. In 
 FVD, the reference count table is only updated when creating a new 
 snapshot or deleting an existing snapshot. The reference count table is 
 never updated during normal execution of a VM.

Yeah, I think that's basically an interesting property. However, I don't
think that it makes a big difference compared to qcow2's refcount table
when you use a writeback metadata cache.

What about the question that I had in my other mail? (How do you
determine if a cluster is free without scanning the whole lookup table?)
I think this might be the missing piece for me to understand how your
approach works.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Anthony Liguori


On 03/14/2011 09:15 AM, Kevin Wolf wrote:

The file system can keep a lot of these things around pretty easily but
with your proposal, it seems like there can only be one.  If you support
many of them, I think you'll degenerate to something as complex as a
reference count table.

IIUC, he already uses a refcount table.


Well, he needs a separate mechanism to make trim/discard work, but for 
the snapshot discussion, a reference count table is avoided.


The bitmap only covers whether the guest has accessed a block or not.  
Then there is a separate table that maps guest offsets to offsets within 
the file.


I haven't thought hard about it, but my guess is that there is an 
ordering constraint between these two pieces of metadata which is why 
the journal is necessary.  I get worried about the complexity of a 
journal even more than a reference count table.



  Actually, I think that a
refcount table is a requirement to provide the interesting properties
that internal snapshots have (see my other mail).


Well the trick here AFAICT is that you're basically storing external 
snapshots internally.  So it's sort of like a bunch of FVD formats 
embedded into a single image.



Refcount tables aren't a very complex thing either. In fact, it makes a
format much simpler to have one concept like refcount tables instead of
adding another different mechanism for each new feature that would be
natural with refcount tables.


I think it's a reasonable design goal to minimize any metadata updates 
in the fast path.  If we can write 1 piece of metadata verses writing 2, 
then it's worth exploring IMHO.



The only problem with them is that they are metadata that must be
updated. However, I think we have discussed enough how to avoid the
greatest part of that cost.


Maybe I missed it, but in the WCE=0 mode, is it really possible to avoid 
the writes for the refcount table?



On the other hand, I think it's reasonable to just avoid the CoW overlay
entirely and say that moving to a previous snapshot destroys any of it's
children.  I think this ends up being a simplifying assumption that is
worth investigating further.

  From the use-cases that I'm aware of (backup and RAS), I think these
semantics are okay.

I don't think this semantics would be expected. Any anyway, would this
really allow simplification of the format?


I don't know, I'm really just trying to separate out the implementation 
of the format to the use-cases we're trying to address.


Even if we're talking about qcow3, then if we only really care about 
read-only snapshots, perhaps we can add a feature bit for this and take 
advantage of this to make the WCE=0 case much faster.


But the fundamental question is, does this satisfy the use-cases we care 
about?


Regards,

Anthony Liguori


  I'm afraid that you would go
for complicated solutions with odd semantics just because of an
arbitrary dislike of refcounts.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Chunqiang Tang

 On Mon, Mar 14, 2011 at 1:53 PM, Chunqiang Tang ct...@us.ibm.com 
wrote:
  Therefore, during normal execution of a
  VM, images with snapshots are as fast as images without snapshot.
 
 Hang on, an image with a snapshot still needs to do copy-on-write,
 just like backing files.  The cost of copy-on-write is reading data
 from the backing file, whereas a non-CoW write doesn't need to do
 that.
 
 So no, snapshots are not free during normal execution.

You are right. For any implementation of snapshot (internal or external), 
this CoW overhead is unavoidable. What I meant to say was that, other than 
this mandatory CoW overhead, FVD's internal snapshot does not incur any 
additional metadata update overhead (unlike that in QCOW2).

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Anthony Liguori


On 03/14/2011 09:21 AM, Kevin Wolf wrote:

Am 14.03.2011 15:02, schrieb Anthony Liguori:

On 03/14/2011 08:53 AM, Chunqiang Tang wrote:

No, because the copy-on-write is another layer on top of the snapshot
and AFAICT, they don't persist when moving between snapshots.

The equivalent for external snapshots would be:

base0- base1- base2- image

And then if I wanted to move to base1 without destroying base2 and
image, I could do:

qemu-img create -f qcow2 -b base1 base1-overlay.img

The file system can keep a lot of these things around pretty easily but
with your proposal, it seems like there can only be one.  If you support
many of them, I think you'll degenerate to something as complex as a
reference count table.

On the other hand, I think it's reasonable to just avoid the CoW overlay
entirely and say that moving to a previous snapshot destroys any of it's
children.  I think this ends up being a simplifying assumption that is
worth investigating further.

No, both VMware and FVD have the same semantics as QCOW2. Moving to a
previous snapshot does not destroy any of its children. In the example I
gave (copied below),
it goes from

Image: s1-s2-s3-s4-(current-state)

back to snapshot s2, and now the state is

Image: s1-s2-s3-s4
 |-(curren-state)

where all snapshots s1-s4 are kept. From there, it can take another
snapshot s5, and then further go back to snapshot s4, ending up with

Image: s1-s2-s3-s4
 |-s5   |
 |-   (current-state)

Your use of current-state is confusing me because AFAICT,
current-state is just semantically another snapshot.

It's writable because it has no children.  You only keep around one
writable snapshot and to make another snapshot writable, you have to
discard the former.

This is not the semantics of qcow2.  Every time you create a snapshot,
it's essentially a new image.  You can write directly to it.

While we don't do this today and I don't think we ever should, it's
entirely possible to have two disks served simultaneously out of the
same qcow2 file using snapshots.

No, CQ is describing the semantics of internal snapshots in qcow2
correctly. You have all the snapshots that are stored in the snapshot
table (all read-only) plus one current state described by the image
header (read-write).


But is there any problem (in the format) with writing to the non-current 
state?  I can't think of one.


Regards,

Anthony Liguori


Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Stefan Hajnoczi

On Mon, Mar 14, 2011 at 2:25 PM, Chunqiang Tang ct...@us.ibm.com wrote:
 IIUC, he already uses a refcount table. Actually, I think that a
 refcount table is a requirement to provide the interesting properties
 that internal snapshots have (see my other mail).

 Refcount tables aren't a very complex thing either. In fact, it makes a
 format much simpler to have one concept like refcount tables instead of
 adding another different mechanism for each new feature that would be
 natural with refcount tables.

 The only problem with them is that they are metadata that must be
 updated. However, I think we have discussed enough how to avoid the
 greatest part of that cost.

 FVD's novel uses of the reference count table reduces the metadata update
 overhead down to literally zero during normal execution of a VM. This gets
 the bests of QCOW2's reference count table but without its oeverhead. In
 FVD, the reference count table is only updated when creating a new
 snapshot or deleting an existing snapshot. The reference count table is
 never updated during normal execution of a VM.

Do you want to send out a break-down of the steps (and cost) involved in doing:

1. Snapshot creation.
2. Snapshot deletion.
3. Opening an image with n snapshots.

Stefan

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Chunqiang Tang

  The file system can keep a lot of these things around pretty easily 
but
  with your proposal, it seems like there can only be one.  If you 
support
  many of them, I think you'll degenerate to something as complex as a
  reference count table.
  IIUC, he already uses a refcount table.
 
 Well, he needs a separate mechanism to make trim/discard work, but for 
 the snapshot discussion, a reference count table is avoided.

Kevin is right. FVD does have a refcount table. Sorry for causing 
confusion. I am going to send out a very detailed email which describes 
the operation steps in FVD, as Stefan requested.

 The bitmap only covers whether the guest has accessed a block or not. 
 Then there is a separate table that maps guest offsets to offsets within 

 the file.
 
 I haven't thought hard about it, but my guess is that there is an 
 ordering constraint between these two pieces of metadata which is why 
 the journal is necessary.  I get worried about the complexity of a 
 journal even more than a reference count table.

No, the journal is not necessary. Actually, a very old version of FVD 
worked without journal. Journal was later introduced as a performance 
enhancement. 

 Maybe I missed it, but in the WCE=0 mode, is it really possible to avoid 

 the writes for the refcount table?

Yes, this is indeed achieved in FVD, with zero writes to the refcount 
table on the fast path. See details in the other email I am going to send 
out soon.

Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Stefan Hajnoczi

On Mon, Mar 14, 2011 at 2:49 PM, Anthony Liguori anth...@codemonkey.ws wrote:
 On 03/14/2011 09:21 AM, Kevin Wolf wrote:

 Am 14.03.2011 15:02, schrieb Anthony Liguori:

 On 03/14/2011 08:53 AM, Chunqiang Tang wrote:

 No, because the copy-on-write is another layer on top of the snapshot
 and AFAICT, they don't persist when moving between snapshots.

 The equivalent for external snapshots would be:

 base0- base1- base2- image

 And then if I wanted to move to base1 without destroying base2 and
 image, I could do:

 qemu-img create -f qcow2 -b base1 base1-overlay.img

 The file system can keep a lot of these things around pretty easily but
 with your proposal, it seems like there can only be one.  If you
 support
 many of them, I think you'll degenerate to something as complex as a
 reference count table.

 On the other hand, I think it's reasonable to just avoid the CoW
 overlay
 entirely and say that moving to a previous snapshot destroys any of
 it's
 children.  I think this ends up being a simplifying assumption that is
 worth investigating further.

 No, both VMware and FVD have the same semantics as QCOW2. Moving to a
 previous snapshot does not destroy any of its children. In the example I
 gave (copied below),
 it goes from

 Image: s1-s2-s3-s4-(current-state)

 back to snapshot s2, and now the state is

 Image: s1-s2-s3-s4
             |-(curren-state)

 where all snapshots s1-s4 are kept. From there, it can take another
 snapshot s5, and then further go back to snapshot s4, ending up with

 Image: s1-s2-s3-s4
             |-s5   |
                     |-   (current-state)

 Your use of current-state is confusing me because AFAICT,
 current-state is just semantically another snapshot.

 It's writable because it has no children.  You only keep around one
 writable snapshot and to make another snapshot writable, you have to
 discard the former.

 This is not the semantics of qcow2.  Every time you create a snapshot,
 it's essentially a new image.  You can write directly to it.

 While we don't do this today and I don't think we ever should, it's
 entirely possible to have two disks served simultaneously out of the
 same qcow2 file using snapshots.

 No, CQ is describing the semantics of internal snapshots in qcow2
 correctly. You have all the snapshots that are stored in the snapshot
 table (all read-only) plus one current state described by the image
 header (read-write).

 But is there any problem (in the format) with writing to the non-current
 state?  I can't think of one.

Here is a problem: there is a single global refcount table in QCOW2.
You need to synchronize updates of the refcounts between multiple
writers to avoid introducing incorrect refcounts.

Stefan

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Kevin Wolf

Am 14.03.2011 15:47, schrieb Anthony Liguori:
 On 03/14/2011 09:15 AM, Kevin Wolf wrote:
 The file system can keep a lot of these things around pretty easily but
 with your proposal, it seems like there can only be one.  If you support
 many of them, I think you'll degenerate to something as complex as a
 reference count table.
 IIUC, he already uses a refcount table.
 
 Well, he needs a separate mechanism to make trim/discard work, but for 
 the snapshot discussion, a reference count table is avoided.
 
 The bitmap only covers whether the guest has accessed a block or not.  
 Then there is a separate table that maps guest offsets to offsets within 
 the file.
 
 I haven't thought hard about it, but my guess is that there is an 
 ordering constraint between these two pieces of metadata which is why 
 the journal is necessary.  I get worried about the complexity of a 
 journal even more than a reference count table.

Honestly I think that a journal is a good idea that we'll want to
implement in the long run.

There are people who aren't really happy about the dirty flag + fsck
approach, and there are people who are concerned about cluster leaks
without fsck. Both problems should be solved with a journal.

Compared to other questions in the discussio, I think it's only a
nice-to-have addition, though.

   Actually, I think that a
 refcount table is a requirement to provide the interesting properties
 that internal snapshots have (see my other mail).
 
 Well the trick here AFAICT is that you're basically storing external 
 snapshots internally.  So it's sort of like a bunch of FVD formats 
 embedded into a single image.

CQ, can you please clarify? From your description, Anthony seems to
understand something completely different than I do.

Are its characteristics more like qcow2's internal snapshots (which is
what I understand) or more like external snapshots (which is what
Anthony seems to understand).

 Refcount tables aren't a very complex thing either. In fact, it makes a
 format much simpler to have one concept like refcount tables instead of
 adding another different mechanism for each new feature that would be
 natural with refcount tables.
 
 I think it's a reasonable design goal to minimize any metadata updates 
 in the fast path.  If we can write 1 piece of metadata verses writing 2, 
 then it's worth exploring IMHO.
 
 The only problem with them is that they are metadata that must be
 updated. However, I think we have discussed enough how to avoid the
 greatest part of that cost.
 
 Maybe I missed it, but in the WCE=0 mode, is it really possible to avoid 
 the writes for the refcount table?

Protected by a dirty flag (and/or a journal), sure. I mean, wasn't that
the whole point of starting the qcow3 discussion?

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Stefan Hajnoczi

On Mon, Mar 14, 2011 at 3:04 PM, Chunqiang Tang ct...@us.ibm.com wrote:
  The file system can keep a lot of these things around pretty easily
 but
  with your proposal, it seems like there can only be one.  If you
 support
  many of them, I think you'll degenerate to something as complex as a
  reference count table.
  IIUC, he already uses a refcount table.

 Well, he needs a separate mechanism to make trim/discard work, but for
 the snapshot discussion, a reference count table is avoided.

 Kevin is right. FVD does have a refcount table. Sorry for causing
 confusion. I am going to send out a very detailed email which describes
 the operation steps in FVD, as Stefan requested.

 The bitmap only covers whether the guest has accessed a block or not.
 Then there is a separate table that maps guest offsets to offsets within

 the file.

 I haven't thought hard about it, but my guess is that there is an
 ordering constraint between these two pieces of metadata which is why
 the journal is necessary.  I get worried about the complexity of a
 journal even more than a reference count table.

 No, the journal is not necessary. Actually, a very old version of FVD
 worked without journal. Journal was later introduced as a performance
 enhancement.

I like the journal because it allows us to isolate metadata updates
into one specific area that can be scanned on image recovery.  If we
take the QED approach with the dirty bit then we have to scan all
L1/L2 tables.  The journal makes recovery more efficient than a full
consistency check.

Stefan

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Kevin Wolf

Am 14.03.2011 15:49, schrieb Anthony Liguori:
 On 03/14/2011 09:21 AM, Kevin Wolf wrote:
 Am 14.03.2011 15:02, schrieb Anthony Liguori:
 On 03/14/2011 08:53 AM, Chunqiang Tang wrote:
 No, because the copy-on-write is another layer on top of the snapshot
 and AFAICT, they don't persist when moving between snapshots.

 The equivalent for external snapshots would be:

 base0- base1- base2- image

 And then if I wanted to move to base1 without destroying base2 and
 image, I could do:

 qemu-img create -f qcow2 -b base1 base1-overlay.img

 The file system can keep a lot of these things around pretty easily but
 with your proposal, it seems like there can only be one.  If you support
 many of them, I think you'll degenerate to something as complex as a
 reference count table.

 On the other hand, I think it's reasonable to just avoid the CoW overlay
 entirely and say that moving to a previous snapshot destroys any of it's
 children.  I think this ends up being a simplifying assumption that is
 worth investigating further.
 No, both VMware and FVD have the same semantics as QCOW2. Moving to a
 previous snapshot does not destroy any of its children. In the example I
 gave (copied below),
 it goes from

 Image: s1-s2-s3-s4-(current-state)

 back to snapshot s2, and now the state is

 Image: s1-s2-s3-s4
  |-(curren-state)

 where all snapshots s1-s4 are kept. From there, it can take another
 snapshot s5, and then further go back to snapshot s4, ending up with

 Image: s1-s2-s3-s4
  |-s5   |
  |-   (current-state)
 Your use of current-state is confusing me because AFAICT,
 current-state is just semantically another snapshot.

 It's writable because it has no children.  You only keep around one
 writable snapshot and to make another snapshot writable, you have to
 discard the former.

 This is not the semantics of qcow2.  Every time you create a snapshot,
 it's essentially a new image.  You can write directly to it.

 While we don't do this today and I don't think we ever should, it's
 entirely possible to have two disks served simultaneously out of the
 same qcow2 file using snapshots.
 No, CQ is describing the semantics of internal snapshots in qcow2
 correctly. You have all the snapshots that are stored in the snapshot
 table (all read-only) plus one current state described by the image
 header (read-write).
 
 But is there any problem (in the format) with writing to the non-current 
 state?  I can't think of one.

You would run into problems with the COW flag in the L2 tables. They are
only an optimization, though, so you could probably avoid using them and
directly look up the refcount table for each write, at the cost of
performance.

Anyway, I don't think there's a real use case for something like this.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Anthony Liguori


On 03/14/2011 10:03 AM, Kevin Wolf wrote:

The only problem with them is that they are metadata that must be
updated. However, I think we have discussed enough how to avoid the
greatest part of that cost.

Maybe I missed it, but in the WCE=0 mode, is it really possible to avoid
the writes for the refcount table?

Protected by a dirty flag (and/or a journal), sure. I mean, wasn't that
the whole point of starting the qcow3 discussion?


Okay, I thought you had something else in mind.

Regards,

Anthony Liguori


Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Chunqiang Tang

  FVD's novel uses of the reference count table reduces the metadata 
update
  overhead down to literally zero during normal execution of a VM. This 
gets
  the bests of QCOW2's reference count table but without its oeverhead. 
In
  FVD, the reference count table is only updated when creating a new
  snapshot or deleting an existing snapshot. The reference count table 
is
  never updated during normal execution of a VM.
 
 Do you want to send out a break-down of the steps (and cost) involved in 
doing:
 
 1. Snapshot creation.
 2. Snapshot deletion.
 3. Opening an image with n snapshots.

Here is a detailed description. Relevant to the discussion of snapshot, 
FVD uses a one-level lookup table and a refcount table. FVD’s one-level 
lookup table is very similar to QCOW2’s two-level lookup table, except 
that it is much smaller in FVD, and is preallocated and hence contiguous 
in image. 

FVD’s refcount table is almost identical to that of QCOW2, but with a key 
difference. An image consists of an arbitrary number of read-only 
snapshots, and a single writeable image front, which is the current image 
state perceived by the VM. Below, I will simply refer to the read-only 
snapshots as snapshots, and refer to the “writeable image front” as 
“writeable-front.” QCOW2’s refcount table counts clusters that are used by 
either read-only snapshots or writeable-front. Because writeable-front 
changes as the VM runs, QCOW2 needs to update the refcount table on the 
fast path of normal VM execution. 

By contrast, FVD’s refcount table only counts chunks that are used by 
read-only snapshots, and does not count chunks used by write-front. This 
is the key that allows FVD to entirely avoid updating the refcount table 
on the fast path of normal VM execution. 

Below are the detailed steps for different operations.

O1: Open an image with n snapshots.

Let me introduce some basic concepts first. The storage allocation unit in 
FVD is called a chunk (like cluster in QCOW2). The default chunk size is 
1MB, like that in VDI (VMDK and Microsoft VHD use 2MB chunks). An FVD 
image file is conceptually divided into chunks, where chunk 0 is the first 
1MB of the image file, chunk 1 is the second 1MB, … chunk j, …and so 
forth. The size of an image file grow as needed, just like that of QCOW2. 
The refcount table is a linear array “uint16_t refcount[]”. If a chunk j 
is referenced by s different snapshots, then refcount[j] = s. If a new 
snapshot is created and this new snapshot also uses chunk j, then 
refcount[j] is incremented to refcount[j] = s+1.

If all snapshots together use 1TB storage spaces, there are 
1TB/1MB=1,000,000 chunks, and the size of the refcount table is 2MB. 
Loading the entire 2MB refcount table from disk into memory takes about 15 
milliseconds. If the virtual disk size perceived by the VM is also 1TB, 
FVD’s one-level lookup table is 4MB. FVD’s one-level lookup table serves 
the same purpose as QCOW2’s two-level lookup table, but FVD’s one-level 
table is much smaller and is preallocated and hence continuous in the 
image. Loading the entire 4MB lookup table from disk into memory takes 
about 20 milliseconds. These numbers mean that it is quite affordable to 
scan the entire tables at VM boot time, although the scan can also be 
avoided in FVD. The optimizations will be described later.

When opening an image with n snapshots, an unoptimized version of FVD 
performs the following steps:

O1: Load the entire 2MB reference count table from disk into memory. This 
step takes about 15ms.

O2: Load the entire 4MB lookup table from disk into memory. This step 
takes about 20ms.

O3: Use the two tables to build an in-memory data structure called 
“free-chunk-bitmap.” This step takes about 2ms. The free-chunk-bitmap 
identifies free chunks that are not used by either snapshots or writeable 
front, and hence can be allocated for future writes. The size of the 
free-chunk-bitmap is only 125KB for a 1TB disk, and hence the memory 
overhead is negligible. The free-chunk-bitmap also supports trim 
operations. The free-chunk-bitmap does not have to be persisted on disk as 
it can always be rebuilt easily, although as an optimization it can be 
persisted on disk on VM shutdown.

O4: Compare the refcount table and the lookup table to identify chunks 
that are in both tables (i.e., shared) and hence the running VM’s write to 
those chunks in writeable-front triggers copy-on-write. This step takes 
about 2ms. One bit in the lookup table’s entry is stolen to mark whether a 
chunk in writeable-front is shared with snapshots and hence needs 
copy-on-write upon a write.

The whole process above, i.e., opening an image with n (e.g., n=1000) 
snapshots, takes about 39ms and it is a one-time cost at VM boot. Later, I 
will describe optimizations that can further reduce this 39ms by saving 
the 125KB free-chunk-bitmap to disk on VM shutdown, but that optimization 
is more than likely to an over-engineering effort, given that 39ms

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Kevin Wolf

Am 14.03.2011 17:32, schrieb Chunqiang Tang:
 FVD's novel uses of the reference count table reduces the metadata 
 update
 overhead down to literally zero during normal execution of a VM. This 
 gets
 the bests of QCOW2's reference count table but without its oeverhead. 
 In
 FVD, the reference count table is only updated when creating a new
 snapshot or deleting an existing snapshot. The reference count table 
 is
 never updated during normal execution of a VM.

 Do you want to send out a break-down of the steps (and cost) involved in 
 doing:

 1. Snapshot creation.
 2. Snapshot deletion.
 3. Opening an image with n snapshots.
 
 Here is a detailed description. Relevant to the discussion of snapshot, 
 FVD uses a one-level lookup table and a refcount table. FVD’s one-level 
 lookup table is very similar to QCOW2’s two-level lookup table, except 
 that it is much smaller in FVD, and is preallocated and hence contiguous 
 in image.

Does this mean that FVD can't hold VM state of arbitrary size?

 FVD’s refcount table is almost identical to that of QCOW2, but with a key 
 difference. An image consists of an arbitrary number of read-only 
 snapshots, and a single writeable image front, which is the current image 
 state perceived by the VM. Below, I will simply refer to the read-only 
 snapshots as snapshots, and refer to the “writeable image front” as 
 “writeable-front.” QCOW2’s refcount table counts clusters that are used by 
 either read-only snapshots or writeable-front. Because writeable-front 
 changes as the VM runs, QCOW2 needs to update the refcount table on the 
 fast path of normal VM execution. 

Needs to update, but not necessarily on the fast path. Updates can be
delayed and batched.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Chunqiang Tang

  Here is a detailed description. Relevant to the discussion of 
snapshot, 
  FVD uses a one-level lookup table and a refcount table. FVD’s 
one-level 
  lookup table is very similar to QCOW2’s two-level lookup table, except 

  that it is much smaller in FVD, and is preallocated and hence 
contiguous 
  in image.
 
 Does this mean that FVD can't hold VM state of arbitrary size?

No, FVD can hold VM state of an arbitrary size. Unlike QCOW2, FVD does not 
store the index of the vm state as part of the one-level lookup table. FVD 
could have done so, and then relocates the one-level lookup table in order 
grow it in size (growing FVD's lookup table through relocation is 
supported, e.g., in order to resize an image to a larger size), but that's 
not an ideal solution. Instead, in FVD, each snapshot has two fields, 
vm_state_offset and vm_state_space_size, which directly point to where the 
VM state is stored, and vm_state_space_size can be arbitrary. BTW, I 
observe uint32_t QEMUSnapshotInfo.vm_state_size. Does this mean that a 
VM state cannot be larger than 4GB? This seems to be a limitation of QEMU. 
FVD instead uses uint64_t vm_state_space_size in the image format, in 
case that the size of QEMUSnapshotInfo.vm_state_size is increased in the 
future.
 
  FVD’s refcount table is almost identical to that of QCOW2, but with a 
key 
  difference. An image consists of an arbitrary number of read-only 
  snapshots, and a single writeable image front, which is the current 
image 
  state perceived by the VM. Below, I will simply refer to the read-only 

  snapshots as snapshots, and refer to the “writeable image front” as 
  “writeable-front.” QCOW2’s refcount table counts clusters that are 
used by 
  either read-only snapshots or writeable-front. Because writeable-front 

  changes as the VM runs, QCOW2 needs to update the refcount table on 
the 
  fast path of normal VM execution. 
 
 Needs to update, but not necessarily on the fast path. Updates can be
 delayed and batched.

Probably this has been discussed extensively before (as you mentioned in 
some previous emails), but I missed the discussion and still have a naive 
question. Is delaying and batching possible for wce=0, i.e., 
cache=writethrough? 

Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Kevin Wolf

Am 14.03.2011 20:23, schrieb Chunqiang Tang:
 Here is a detailed description. Relevant to the discussion of 
 snapshot, 
 FVD uses a one-level lookup table and a refcount table. FVD’s 
 one-level 
 lookup table is very similar to QCOW2’s two-level lookup table, except 
 
 that it is much smaller in FVD, and is preallocated and hence 
 contiguous 
 in image.

 Does this mean that FVD can't hold VM state of arbitrary size?
 
 No, FVD can hold VM state of an arbitrary size. Unlike QCOW2, FVD does not 
 store the index of the vm state as part of the one-level lookup table. FVD 
 could have done so, and then relocates the one-level lookup table in order 
 grow it in size (growing FVD's lookup table through relocation is 
 supported, e.g., in order to resize an image to a larger size), but that's 
 not an ideal solution. Instead, in FVD, each snapshot has two fields, 
 vm_state_offset and vm_state_space_size, which directly point to where the 
 VM state is stored, and vm_state_space_size can be arbitrary. 

Okay, makes sense.

 BTW, I 
 observe uint32_t QEMUSnapshotInfo.vm_state_size. Does this mean that a 
 VM state cannot be larger than 4GB? This seems to be a limitation of QEMU. 
 FVD instead uses uint64_t vm_state_space_size in the image format, in 
 case that the size of QEMUSnapshotInfo.vm_state_size is increased in the 
 future.

Yeah, that was a stupid decision, it definitely should be 64 bit.

 Needs to update, but not necessarily on the fast path. Updates can be
 delayed and batched.
 
 Probably this has been discussed extensively before (as you mentioned in 
 some previous emails), but I missed the discussion and still have a naive 
 question. Is delaying and batching possible for wce=0, i.e., 
 cache=writethrough? 

It's possible with QED's approach: You set a dirty flag in the image
header, and while this flag is set you don't have to care about
consistent refcount tables. Only when you clear the flag, you must flush
the refcount cache to the image file.

If qemu crashes, you see the dirty flag and you know that you have an
image with stale refcounts. In this case you must do a metadata scan to
rebuild the refcount table from the L2 tables (or just replay the
journal if you have one).

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-14 Thread Chunqiang Tang

   FVD's novel uses of the reference count table reduces the metadata 
update
   overhead down to literally zero during normal execution of a VM. 
This gets
   the bests of QCOW2's reference count table but without its 
oeverhead. In
   FVD, the reference count table is only updated when creating a new
   snapshot or deleting an existing snapshot. The reference count table 
is
   never updated during normal execution of a VM.
  
  Do you want to send out a break-down of the steps (and cost) involved 
in doing:
  
  1. Snapshot creation.
  2. Snapshot deletion.
  3. Opening an image with n snapshots.

 Here is a detailed description. Relevant to the discussion of snapshot, 
FVD 
 uses a one-level lookup table and a refcount table. FVD’s one-level 
lookup 
 table is very similar to QCOW2’s two-level lookup table, except that it 
is 
 much smaller in FVD, and is preallocated and hence contiguous in image. 
 
 FVD’s refcount table is almost identical to that of QCOW2, but with a 
key 
 difference. An image consists of an arbitrary number of read-only 
snapshots, 
 and a single writeable image front, which is the current image state 
perceived
 by the VM. Below, I will simply refer to the read-only snapshots as 
snapshots,
 and refer to the “writeable image front” as “writeable-front.” QCOW2’s 
 refcount table counts clusters that are used by either read-only 
snapshots or 
 writeable-front. Because writeable-front changes as the VM runs, QCOW2 
needs 
 to update the refcount table on the fast path of normal VM execution. 
 
 By contrast, FVD’s refcount table only counts chunks that are used by 
read-
 only snapshots, and does not count chunks used by write-front. This is 
the key
 that allows FVD to entirely avoid updating the refcount table on the 
fast path
 of normal VM execution. 
 
 Below are the detailed steps for different operations.
 
 O1: Open an image with n snapshots.
 
 Let me introduce some basic concepts first. The storage allocation unit 
in FVD
 is called a chunk (like cluster in QCOW2). The default chunk size is 
1MB, like
 that in VDI (VMDK and Microsoft VHD use 2MB chunks). An FVD image file 
is 
 conceptually divided into chunks, where chunk 0 is the first 1MB of the 
image 
 file, chunk 1 is the second 1MB, … chunk j, …and so forth. The size of 
an 
 image file grow as needed, just like that of QCOW2. The refcount table 
is a 
 linear array “uint16_t refcount[]”. If a chunk j is referenced by s 
different 
 snapshots, then refcount[j] = s. If a new snapshot is created and this 
new 
 snapshot also uses chunk j, then refcount[j] is incremented to 
refcount[j] = s+1.
 
 If all snapshots together use 1TB storage spaces, there are 
1TB/1MB=1,000,000 
 chunks, and the size of the refcount table is 2MB. Loading the entire 
2MB 
 refcount table from disk into memory takes about 15 milliseconds. If the 

 virtual disk size perceived by the VM is also 1TB, FVD’s one-level 
lookup 
 table is 4MB. FVD’s one-level lookup table serves the same purpose as 
QCOW2’s 
 two-level lookup table, but FVD’s one-level table is much smaller and is 

 preallocated and hence continuous in the image. Loading the entire 4MB 
lookup 
 table from disk into memory takes about 20 milliseconds. These numbers 
mean 
 that it is quite affordable to scan the entire tables at VM boot time, 
 although the scan can also be avoided in FVD. The optimizations will be 
described later.
 
 When opening an image with n snapshots, an unoptimized version of FVD 
performs
 the following steps:
 
 O1: Load the entire 2MB reference count table from disk into memory. 
This step
 takes about 15ms.
 
 O2: Load the entire 4MB lookup table from disk into memory. This step 
takes about 20ms.
 
 O3: Use the two tables to build an in-memory data structure called 
“free-
 chunk-bitmap.” This step takes about 2ms. The free-chunk-bitmap 
identifies 
 free chunks that are not used by either snapshots or writeable front, 
and 
 hence can be allocated for future writes. The size of the 
free-chunk-bitmap is
 only 125KB for a 1TB disk, and hence the memory overhead is negligible. 
The 
 free-chunk-bitmap also supports trim operations. The free-chunk-bitmap 
does 
 not have to be persisted on disk as it can always be rebuilt easily, 
although 
 as an optimization it can be persisted on disk on VM shutdown.
 
 O4: Compare the refcount table and the lookup table to identify chunks 
that 
 are in both tables (i.e., shared) and hence the running VM’s write to 
those 
 chunks in writeable-front triggers copy-on-write. This step takes about 
2ms. 
 One bit in the lookup table’s entry is stolen to mark whether a chunk in 

 writeable-front is shared with snapshots and hence needs copy-on-write 
upon a write.
 
 The whole process above, i.e., opening an image with n (e.g., n=1000) 
 snapshots, takes about 39ms and it is a one-time cost at VM boot. Later, 
I 
 will describe optimizations that can further reduce this 39ms by saving 
the 
 125KB free-chunk-bitmap

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-13 Thread Anthony Liguori


On 03/12/2011 11:51 PM, Chunqiang Tang wrote:


In short, FVD's internal snapshot achieves the ideal properties of G1-G6,
by 1) using the reference count table to only track static snapshots, 2)
not keeping the reference count table in memory, 3) not updating the
on-disk static reference count table when the VM runs, and 4)
efficiently tracking dynamically allocated blocks by piggybacking on FVD's
other features, i.e., its journal and small one-level lookup table.


Are you assuming snapshots are read-only?

It's not clear to me how this would work with writeable snapshots.  It's 
not clear to me that writeable snapshots are really that important, but 
this is an advantage of having a refcount table.


External snapshots are essentially read-only snapshots so I can 
understand the argument for it.


Regards,

Anthony Liguori


Regards,

ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-13 Thread Chunqiang Tang

  In short, FVD's internal snapshot achieves the ideal properties of 
G1-G6,
  by 1) using the reference count table to only track static 
snapshots, 2)
  not keeping the reference count table in memory, 3) not updating the
  on-disk static reference count table when the VM runs, and 4)
  efficiently tracking dynamically allocated blocks by piggybacking on 
FVD's
  other features, i.e., its journal and small one-level lookup table.
 
 Are you assuming snapshots are read-only?
 
 It's not clear to me how this would work with writeable snapshots.  It's 

 not clear to me that writeable snapshots are really that important, but 
 this is an advantage of having a refcount table.
 
 External snapshots are essentially read-only snapshots so I can 
 understand the argument for it.

By definition, a snapshot itself must be immutable (read-only), but a 
writeable
image state can be derived from an immutable snapshot by using 
copy-on-write, 
which I guess is what you meant by writeable snapshot. 
Perhaps the following concrete use cases will make things clear. 
These use cases are supported by QCOW2, VMware, and
FVD, regardless of the difference in their internal implementation. 

Suppose an image's initial state is: 

Image: (current-disk-state-observed-by-the-running-VM)

Below, I simply refer to current-disk-state-observed-by-the-running-VM 
as
current-state.  The VM issues writes and continuously modifies the
current-state. At one point in time, a snapshot s1 is taken, and the 
image
becomes:

Image: s1-(current-state)

The VM issues more writes and subsequently takes three snapshots, s2, s3, 
and
s4. Now the image becomes:

Image: s1-s2-s3-s4-(current-state)

Suppose the action goto snapshot s2 is taken, which does not affect the
immutable snapshots s1-s4, but the current-state is abandoned and lost. 
Now
the image becomes:

Image: s1-s2-s3-s4
   |-(curren-state)

(Note: depending on your email client, the two lines in the diagram may 
not be 
properly aligned). 
The new current-state is writeable and is derived from the immutable 
snapshot s2.
When the VM issues a write, it does copy-on-write and stores
dirty data in the current-state without modifying the original snapshot 
s2.
Perhaps this is what you meant by writeable snapshot? 
The diagram above is at the conceptual level. In implementation,
both QCOW2 and FVD store all snapshots s1-s4 and the current-state in one 
image
file, and the snapshots and curren-state may share data chunks. 

Suppose the VM issues some writes and subsequently takes two snapshots, s5 
and
s6. Now the image becomes: 

Image: s1-s2-s3-s4
   |-s5-s6-(curren-state)

Suppose the action goto snapshot s2 is taken again. Now the image 
becomes:

Image: s1-s2-s3-s4
   |-s5-s6
   |-(current-state)

The new current-state is writeable and is derived from the immutable 
snapshot s2. Right after the goto action, the running VM sees the 
state of s2, instead of the state of s5 created after the first goto 
snapshot s2 action. 
Again, this is because a snapshot itself is immutable. 

Again, all the use cases are supported by QCOW2, VMware, and
FVD, regardless of the difference in their internal implementation. 

Now let's come back to the discussion of FVD. Perhaps my description 
in the previous email is not clear. In the diagrams above, FVD's 
reference count table only tracks the snapshots (s1, s2, ...), 
but does not track the current-state. Instead,
FVD's default mechanism (one-level lookup table, journal, etc.), which 
exists
even before introducing snapshot, already tracks the current-state. 
Working
together, FVD's reference count table and its default mechanism tracks all 
the
states. In QCOW2, when a new cluster is allocated during handling a 
running VM's
write request, it updates both the lookup table and the reference count 
table,
which is unnecessary because their information is redundant. By contrast, 
in
FVD, when a new chunk is allocated during handling a running VM's write
request, it only updates the lookup table without updating the reference 
count
table, because by design the reference count table does not track the 
current-state and this chunk allocation operation belongs to the 
current-state.
This is the key why FVD can get all the functions of QCOW2's internal 
snapshot
but without its memory overhead to cache the reference count table and
its disk I/O overhead to read or write the reference count table during 
normal
execution of VM.

Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-03-12 Thread Chunqiang Tang

 It seems that there is great interest in QCOW2's
 internal snapshot feature. If we really want to do that, the right 
solution is
 to follow VMDK's approach of storing each snapshot as a separate COW 
file (see 
 http://www.vmware.com/app/vmdk/?src=vmdk ), rather than using the 
reference 
 count table. VMDK’s approach can be easily implemented for any COW 
format, or 
 even as a function of the generic block layer, without complicating any 
COW 
 format or hurting its performance. 

After the heated debate, I thought more about the right approach of 
implementing snapshot, and it becomes clear to me that there are major 
limitations with both VMDK's external snapshot approach (which stores each 
snapshot as a separate CoW file) and QCOW2's internal snapshot approach 
(which stores all snapshots in one file and uses a reference count table 
to keep track of them). I just posted to the mailing list a patch that 
implements internal snapshot in FVD but does it in a way without the 
limitations of VMDK and QCOW2. 

Let's first list the properties of an ideal virtual disk snapshot 
solution, and then discuss how to achieve them.

G1: Do no harm (or avoid being a misfeature), i.e., the added snapshot 
code should not slow down the runtime performance of an image that has no 
snapshots.  This implies that an image without snapshot should not cache 
the reference count table in memory and should not update the on-disk 
reference count table.

G2: Even better, an image with 1 snapshot runs as fast as an image without 
snapshot.

G3: Even even better, an image with 1,000 snapshots runs as fast as an 
image without snapshot. This basically means getting the snapshot feature 
for free.

G4: An image with 1,000 snapshots consumes no more memory than an image 
without snapshot. This again means getting the snapshot feature for free.

G5: Regardless of the number of existing snapshots, creating a new 
snapshot is fast, e.g., taking no more than 1 second.

G6: Regardless of the number of existing snapshots, deleting a snapshot is 
fast, e.g., taking no more than 1 second.

Now let's evaluate VMDK and QCOW2 against these ideal properties. 

G1: VMDK good; QCOW2 poor
G2: VMDK ok; QCOW2 poor
G3: VMDK very poor; QCOW2 poor
G4: VMDK very poor; QCOW2 poor
G5: VMDK good; QCOW2 good
G6: VMDK poor; QCOW2 good

The evaluation above assumes a straightforward VMDK implementation that, 
when handling a long chain of snapshots, s0-s1-s2- … -s1000, it uses a 
chain of 1,000 VMDK driver instances to represent the chain of backing 
files. This is slow and consumes a lot of memory, but it is the behavior 
of QEMU's block device architecture today.

Even if the QEMU architecture can be revised and the VMDK implementation 
is optimized to extreme, a fundamental limitation of VMDK (by design 
instead of by implementation) is G6, i.e., deleting a snapshot X in the 
middle of a snapshot chain is slow (this is also what I observed with the 
VMware software). Because each snapshot is stored as a separate file, when 
a snapshot X is deleted, part of X's data blocks that are still needed by 
its children Y must be physically copied from file X to file Y, which is 
slow and the VM is halted during the copy operation. QCOW2's internal 
snapshot approach avoids this problem. Since all snapshots are stored in 
one file, when a snapshot is deleted, QCOW2 only needs to update its 
reference count table without physically moving data blocks.

On the other hand, QCOW'2 internal snapshot has two major limitations that 
hurt runtime performance: caching the reference count table in memory and 
updating the on-disk reference count table. If we can eliminate both, then 
it is an ideal solution. This is exactly what FVD's internal snapshot 
solution does. Below is the key observation on why FVD can do it so 
efficiently.

In an internal snapshot implementation, the reference count table is used 
to track used blocks and free blocks. It serves no other purposes. In FVD, 
its static reference count table only tracks blocks used by (static) 
snapshots, and it does not track blocks (dynamically) allocated (on a 
write) or freed (on a trim) for the running VM. This is a simple but 
fundamental difference w.r.t. to QCOW2, whose reference count table tracks 
both the static content and the dynamic content. Because data blocks used 
by snapshots are static and do not change unless a snapshot is created or 
deleted, there is no need to update FVD's static reference count table 
when a VM runs, and actually there is even no need to cache it in memory. 
Data blocks that are dynamically allocated or freed for a running VM are 
already tracked by FVD's one-level lookup table (which is similar to 
QCOW2's two-level table, but in FVD it is much smaller and faster) even 
before introducing the snapshot feature, and hence it comes for free. 
Updating FVD's one-level lookup table is efficient because of FVD's 
journal.

When the VM boots, FVD scans the reference count table

RE: [Qemu-devel] Re: Strategic decision: COW format

2011-02-25 Thread Pavel Dovgaluk


 On 02/23/2011 05:50 PM, Anthony Liguori wrote:
  I still don't see.  What would you do with thousands of checkpoints?
 
 
  For reverse debugging, if you store checkpoints at a rate of save,
  every 10ms, and then degrade to storing every 100ms after 1 second,
  etc. you'll have quite a large number of snapshots pretty quickly.
  The idea of snapshotting with reverse debugging is that instead of
  undoing every instruction, you can revert to the snapshot before, and
  then replay the instruction stream until you get to the desired point
  in time.
 
 You cannot replay the instruction stream since inputs (interrupts, rdtsc
 or other timers, I/O) will be different.  You need Kemari for this.

  I've created the technology for replaying instruction stream and all of the 
inputs. This technology is similar to deterministic replay in VMWare.
  Now I need something to save machine state in many checkpoints to
implement reverse debugging.
  I think COW2 may be useful for it (or I should create something like this).


Pavel Dovgaluk

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-25 Thread Stefan Hajnoczi

On Fri, Feb 25, 2011 at 11:20 AM, Pavel Dovgaluk
pavel.dovga...@ispras.ru wrote:

 On 02/23/2011 05:50 PM, Anthony Liguori wrote:
  I still don't see.  What would you do with thousands of checkpoints?
 
 
  For reverse debugging, if you store checkpoints at a rate of save,
  every 10ms, and then degrade to storing every 100ms after 1 second,
  etc. you'll have quite a large number of snapshots pretty quickly.
  The idea of snapshotting with reverse debugging is that instead of
  undoing every instruction, you can revert to the snapshot before, and
  then replay the instruction stream until you get to the desired point
  in time.

 You cannot replay the instruction stream since inputs (interrupts, rdtsc
 or other timers, I/O) will be different.  You need Kemari for this.

  I've created the technology for replaying instruction stream and all of the
 inputs. This technology is similar to deterministic replay in VMWare.
  Now I need something to save machine state in many checkpoints to
 implement reverse debugging.
  I think COW2 may be useful for it (or I should create something like this).

Or the BTRFS_IOC_CLONE ioctl on the btrfs filesystem.  You can
copy-on-write clone a file using it.

Stefan

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Kevin Wolf

Am 22.02.2011 19:18, schrieb Anthony Liguori:
 On 02/22/2011 10:15 AM, Kevin Wolf wrote:
 Am 22.02.2011 16:57, schrieb Anthony Liguori:

 On 02/22/2011 02:56 AM, Kevin Wolf wrote:
  
 *sigh*

 It starts to get annoying, but if you really insist, I can repeat it
 once more: These features that you don't need (this is the correct
 description for what you call misfeatures) _are_ implemented in a way
 that they don't impact the normal case.

 Except that they require a refcount table that adds additional metadata
 that needs to be updated in the fast path.  I consider that impacting
 the normal case.
  
 Like it or not, this requirement exists anyway, without any of your
 misfeatures.

 You chose to use the dirty flag in QED in order to avoid having to flush
 metadata too often, which is an approach that any other format, even one
 using refcounts, can take as well.

 
 It's a minor detail, but flushing and the amount of metadata are 
 separate points.

I agree that they are separate...

 
 The dirty flag prevents metadata from being flushed to disk very often 
 but the use of a refcount table adds additional metadata.
 
 A refcount table is definitely not required even if you claim the 
 requirement exists for other features.  I assume you mean to implement 
 trim/discard support but instead of a refcount table, a free list would 
 work just as well and would leave the metadata update out of the fast 
 path (allocating writes) and instead only be in the slow path 
 (trim/discard).

...but here you're arguing about writing metadata out in the fast path,
so you're actually not interested in the amount of metadata but in the
overhead of flushing it. Which is a problem that's solved.

A refcount table is essential for internal snapshots and compression,
it's useful for discard and for running on block devices, it's necessary
for avoiding the dirty flag and fsck on startup.

These are five use cases that I can enumerate without thinking a lot
about it, there might be more. You propose using three different
mechanisms for allowing normal allocations (use the file size), block
devices (add a size field into the header) and discard (free list), and
the other three features, for which you can't think of a hack, you
declare misfeatures.

I don't think what you're proposing is a satisfactory solution. In my
book, a single data structure that can provide all of the features is
better than a bunch of independent hacks that allows only half of it.

 As a format feature, a refcount table really only makes sense if the 
 refcount is required to be greater than a single bit.  There are more 
 optimal data structures that can be used if the refcount of a block is 
 fixed to 1-bit (like a free list) which is what the fundamental design 
 difference between qcow2 and qed is.

Okay, so even assuming that there's something like misfeatures that we
can kick out (with which I strongly disagree), what's the crucial
advantage of free lists that would make you switch the image format?

That you only access it in the slow path (discard) isn't true, because
you certainly want to reallocate freed clusters. Otherwise you could
just leak them without maintaining a list of leaked clusters...

 The only use of a refcount of more than 1-bit is internal snapshots AFAICT.

Of the currently implemented features, internal snapshots and compression.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Markus Armbruster

Chunqiang Tang ct...@us.ibm.com writes:

[...]
 Now let’s talk about features. It seems that there is great interest in 
 QCOW2’ internal snapshot feature. If we really want to do that, the right 

Great interest?  Its use cases are demo, debugging, testing and such.
Kind of useful for developers, but I wouldn't want to use it in anger.
Nice to have if we can get it cheaply, but I'm not prepared to pay much
for it in performance or complexity, and I doubt I'm the only one.

Users always say yes when you ask them whether they need some feature.
Hence, the question is useless.  A better question to ask is how much
are you willing to pay for it?

 solution is to follow VMDK’s approach of storing each snapshot as a 
 separate COW file (see http://www.vmware.com/app/vmdk/?src=vmdk ), rather 
 than using the reference count table. VMDK’s approach can be easily 
 implemented for any COW format, or even as a function of the generic block 
 layer, without complicating any COW format or hurting its performance. I 
 know the snapshots are not really “internal” as stored in a single file 
 but instead more like external snapshots, but users don’t care about that 
 so long as they support the same use cases. Probably many people who use 
 VMware don't even know that the snapshots are stored as separate files. Do 
 they care?

I certainly wouldn't.

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Avi Kivity


On 02/22/2011 10:56 AM, Kevin Wolf wrote:

*sigh*

It starts to get annoying, but if you really insist, I can repeat it
once more: These features that you don't need (this is the correct
description for what you call misfeatures) _are_ implemented in a way
that they don't impact the normal case. And they are it today.



Plus, encryption and snapshots can be implemented in a way that doesn't 
impact performance more than is reasonable.  Compression perhaps not, 
but if you choose compression, then performance is not your top 
consideration.  That's the case with filesystems that support 
compression as well.


--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Anthony Liguori


On 02/23/2011 07:43 AM, Avi Kivity wrote:

On 02/22/2011 10:56 AM, Kevin Wolf wrote:

*sigh*

It starts to get annoying, but if you really insist, I can repeat it
once more: These features that you don't need (this is the correct
description for what you call misfeatures) _are_ implemented in a way
that they don't impact the normal case. And they are it today.



Plus, encryption and snapshots can be implemented in a way that 
doesn't impact performance more than is reasonable.


We're still missing the existence proof of this, but even assuming it 
existed, what about snapshots?  Are we okay having a feature in a 
prominent format that isn't going to meet user's expectations?


Is there any hope that an image with 1000, 1000, or 1 snapshots is 
going to have even reasonable performance in qcow2?


Regards,

Anthony Liguori

  Compression perhaps not, but if you choose compression, then 
performance is not your top consideration.  That's the case with 
filesystems that support compression as well.

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Anthony Liguori


On 02/23/2011 03:13 AM, Kevin Wolf wrote:

Am 22.02.2011 19:18, schrieb Anthony Liguori:
   

On 02/22/2011 10:15 AM, Kevin Wolf wrote:
 

Am 22.02.2011 16:57, schrieb Anthony Liguori:

   

On 02/22/2011 02:56 AM, Kevin Wolf wrote:

 

*sigh*

It starts to get annoying, but if you really insist, I can repeat it
once more: These features that you don't need (this is the correct
description for what you call misfeatures) _are_ implemented in a way
that they don't impact the normal case.

   

Except that they require a refcount table that adds additional metadata
that needs to be updated in the fast path.  I consider that impacting
the normal case.

 

Like it or not, this requirement exists anyway, without any of your
misfeatures.

You chose to use the dirty flag in QED in order to avoid having to flush
metadata too often, which is an approach that any other format, even one
using refcounts, can take as well.

   

It's a minor detail, but flushing and the amount of metadata are
separate points.
 

I agree that they are separate...

   

The dirty flag prevents metadata from being flushed to disk very often
but the use of a refcount table adds additional metadata.

A refcount table is definitely not required even if you claim the
requirement exists for other features.  I assume you mean to implement
trim/discard support but instead of a refcount table, a free list would
work just as well and would leave the metadata update out of the fast
path (allocating writes) and instead only be in the slow path
(trim/discard).
 

...but here you're arguing about writing metadata out in the fast path,
so you're actually not interested in the amount of metadata but in the
overhead of flushing it. Which is a problem that's solved.
   


I'm interested in both.  An extra write is always going to be an extra 
write.  The flush just makes it very painful.



A refcount table is essential for internal snapshots and compression,
it's useful for discard and for running on block devices, it's necessary
for avoiding the dirty flag and fsck on startup.
   


No, as designed today, qcow2 still needs a dirty flag to avoid leaking 
blocks.



These are five use cases that I can enumerate without thinking a lot
about it, there might be more. You propose using three different
mechanisms for allowing normal allocations (use the file size), block
devices (add a size field into the header) and discard (free list), and
the other three features, for which you can't think of a hack, you
declare misfeatures.
   


No, I only label compression and internal snapshots as misfeatures.  
Encryption is a completely reasonable feature.


So even with qcow3, what's the expectation of snapshots?  Are we going 
to scale to images with over 1000 snapshots?  I believe snapshot support 
in qcow2 is not a feature that has been designed with any serious 
thought.  If we truly want to support internal snapshots, let's design 
it correctly.



As a format feature, a refcount table really only makes sense if the
refcount is required to be greater than a single bit.  There are more
optimal data structures that can be used if the refcount of a block is
fixed to 1-bit (like a free list) which is what the fundamental design
difference between qcow2 and qed is.
 

Okay, so even assuming that there's something like misfeatures that we
can kick out (with which I strongly disagree), what's the crucial
advantage of free lists that would make you switch the image format?
   


Performance.  One thing we haven't tested with qcow2 is O_SYNC 
performance in the guest but my suspicion is that an O_SYNC workload is 
going to perform poorly even with cache=none.


Starting with a simple format that we don't have to jump through 
tremendous hoops to get reasonable performance out of has a lot of virtues.


Regards,

Anthony Liguori

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Kevin Wolf

Am 23.02.2011 15:23, schrieb Anthony Liguori:
 On 02/23/2011 07:43 AM, Avi Kivity wrote:
 On 02/22/2011 10:56 AM, Kevin Wolf wrote:
 *sigh*

 It starts to get annoying, but if you really insist, I can repeat it
 once more: These features that you don't need (this is the correct
 description for what you call misfeatures) _are_ implemented in a way
 that they don't impact the normal case. And they are it today.


 Plus, encryption and snapshots can be implemented in a way that 
 doesn't impact performance more than is reasonable.
 
 We're still missing the existence proof of this, but even assuming it 

Define reasonable. I sent you some numbers not too long for
encryption, and I consider them reasonable (iirc, between 25% and 40%
slower than without encryption).

 existed, what about snapshots?  Are we okay having a feature in a 
 prominent format that isn't going to meet user's expectations?
 
 Is there any hope that an image with 1000, 1000, or 1 snapshots is 
 going to have even reasonable performance in qcow2?

Is there any hope for backing file chains of 1000 files or more? I
haven't tried it out, but in theory I'd expect that internal snapshots
could cope better with it than external ones because internal snapshots
don't have to go through the whole chain all the time.

What are the points where you think that performance of internal
snapshots suffers?

The argument that I would understand is that internal snapshots are
probably not as handy in all scenarios.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Kevin Wolf

Am 23.02.2011 15:21, schrieb Anthony Liguori:
 On 02/23/2011 03:13 AM, Kevin Wolf wrote:
 Am 22.02.2011 19:18, schrieb Anthony Liguori:

 On 02/22/2011 10:15 AM, Kevin Wolf wrote:
  
 Am 22.02.2011 16:57, schrieb Anthony Liguori:


 On 02/22/2011 02:56 AM, Kevin Wolf wrote:

  
 *sigh*

 It starts to get annoying, but if you really insist, I can repeat it
 once more: These features that you don't need (this is the correct
 description for what you call misfeatures) _are_ implemented in a way
 that they don't impact the normal case.


 Except that they require a refcount table that adds additional metadata
 that needs to be updated in the fast path.  I consider that impacting
 the normal case.

  
 Like it or not, this requirement exists anyway, without any of your
 misfeatures.

 You chose to use the dirty flag in QED in order to avoid having to flush
 metadata too often, which is an approach that any other format, even one
 using refcounts, can take as well.


 It's a minor detail, but flushing and the amount of metadata are
 separate points.
  
 I agree that they are separate...


 The dirty flag prevents metadata from being flushed to disk very often
 but the use of a refcount table adds additional metadata.

 A refcount table is definitely not required even if you claim the
 requirement exists for other features.  I assume you mean to implement
 trim/discard support but instead of a refcount table, a free list would
 work just as well and would leave the metadata update out of the fast
 path (allocating writes) and instead only be in the slow path
 (trim/discard).
  
 ...but here you're arguing about writing metadata out in the fast path,
 so you're actually not interested in the amount of metadata but in the
 overhead of flushing it. Which is a problem that's solved.

 
 I'm interested in both.  An extra write is always going to be an extra 
 write.  The flush just makes it very painful.

One extra write of 64k every 2 GB. Hardly relevant.

 A refcount table is essential for internal snapshots and compression,
 it's useful for discard and for running on block devices, it's necessary
 for avoiding the dirty flag and fsck on startup.

 
 No, as designed today, qcow2 still needs a dirty flag to avoid leaking 
 blocks.

I know that this is your opinion and I do respect that, this is one of
the reasons why there is the suggestion to add the dirty flag for you.

On the other hand, it would be about time for you to accept that there
are people who think differently about it and who don't want the same as
you. This is why using the dirty flag should be optional.

 These are five use cases that I can enumerate without thinking a lot
 about it, there might be more. You propose using three different
 mechanisms for allowing normal allocations (use the file size), block
 devices (add a size field into the header) and discard (free list), and
 the other three features, for which you can't think of a hack, you
 declare misfeatures.

 
 No, I only label compression and internal snapshots as misfeatures.  
 Encryption is a completely reasonable feature.

I didn't even mention encryption. It's obvious that it's a reasonable
feature and not a misfeature, because it fits relatively easily in
your QED design. :-)

The three features you don't like because they don't fit are
compression, internal snapshots and not having to fsck (thanks for
proving the latter above)

 So even with qcow3, what's the expectation of snapshots?  Are we going 
 to scale to images with over 1000 snapshots?  I believe snapshot support 
 in qcow2 is not a feature that has been designed with any serious 
 thought.  If we truly want to support internal snapshots, let's design 
 it correctly.

So what would be the key differences between your design and qcow2's? We
can always check if there's room to improve.

 As a format feature, a refcount table really only makes sense if the
 refcount is required to be greater than a single bit.  There are more
 optimal data structures that can be used if the refcount of a block is
 fixed to 1-bit (like a free list) which is what the fundamental design
 difference between qcow2 and qed is.
  
 Okay, so even assuming that there's something like misfeatures that we
 can kick out (with which I strongly disagree), what's the crucial
 advantage of free lists that would make you switch the image format?
 
 Performance.  One thing we haven't tested with qcow2 is O_SYNC 
 performance in the guest but my suspicion is that an O_SYNC workload is 
 going to perform poorly even with cache=none.

But wasn't it you who wants to use the dirty flag in any case? The
refcounts aren't even written then.

 Starting with a simple format that we don't have to jump through 
 tremendous hoops to get reasonable performance out of has a lot of virtues.

I know that you don't mean it like I read this, but it's entirely true:
You're _starting_ with a simple

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Avi Kivity


On 02/23/2011 04:23 PM, Anthony Liguori wrote:

On 02/23/2011 07:43 AM, Avi Kivity wrote:

On 02/22/2011 10:56 AM, Kevin Wolf wrote:

*sigh*

It starts to get annoying, but if you really insist, I can repeat it
once more: These features that you don't need (this is the correct
description for what you call misfeatures) _are_ implemented in a way
that they don't impact the normal case. And they are it today.



Plus, encryption and snapshots can be implemented in a way that 
doesn't impact performance more than is reasonable.


We're still missing the existence proof of this, but even assuming it 
existed,


dm-crypt isn't any more complicated, and it's used by default in most 
distributions these days.


what about snapshots?  Are we okay having a feature in a prominent 
format that isn't going to meet user's expectations?


Is there any hope that an image with 1000, 1000, or 1 snapshots is 
going to have even reasonable performance in qcow2?




Are thousands of snapshots for a single image a reasonable user 
expectation?  What's the use case?


--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Anthony Liguori


On 02/23/2011 08:38 AM, Kevin Wolf wrote:

Am 23.02.2011 15:23, schrieb Anthony Liguori:
   

On 02/23/2011 07:43 AM, Avi Kivity wrote:
 

On 02/22/2011 10:56 AM, Kevin Wolf wrote:
   

*sigh*

It starts to get annoying, but if you really insist, I can repeat it
once more: These features that you don't need (this is the correct
description for what you call misfeatures) _are_ implemented in a way
that they don't impact the normal case. And they are it today.

 

Plus, encryption and snapshots can be implemented in a way that
doesn't impact performance more than is reasonable.
   

We're still missing the existence proof of this, but even assuming it
 

Define reasonable. I sent you some numbers not too long for
encryption, and I consider them reasonable (iirc, between 25% and 40%
slower than without encryption).
   


I was really referring to snapshots.  I have absolutely no doubt that 
encryption can be implemented with a reasonable performance overhead.



existed, what about snapshots?  Are we okay having a feature in a
prominent format that isn't going to meet user's expectations?

Is there any hope that an image with 1000, 1000, or 1 snapshots is
going to have even reasonable performance in qcow2?
 

Is there any hope for backing file chains of 1000 files or more? I
haven't tried it out, but in theory I'd expect that internal snapshots
could cope better with it than external ones because internal snapshots
don't have to go through the whole chain all the time.
   


I don't think there's a user expectation of backing file chains of 1000 
files performing well.  However, I've talked to a number of customers 
that have been interested in using internal snapshots for checkpointing 
which would involve a large number of snapshots.


In fact, Fabrice originally added qcow2 because he was interested in 
doing reverse debugging.  The idea of internal snapshots was to store a 
high number of checkpoints to allow reverse debugging to be optimized.


I think the way snapshot metadata is stored makes this not realistic 
since they're stored in more or less a linear array.  I think to really 
support a high number of snapshots, you'd want to store a hash with each 
block that contained a refcount  1.  I think you quickly end up 
reinventing btrfs though in the process.


Regards,

Anthony Liguori


What are the points where you think that performance of internal
snapshots suffers?

The argument that I would understand is that internal snapshots are
probably not as handy in all scenarios.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Daniel P. Berrange

On Wed, Feb 23, 2011 at 05:23:33PM +0200, Avi Kivity wrote:
 On 02/23/2011 04:23 PM, Anthony Liguori wrote:
 On 02/23/2011 07:43 AM, Avi Kivity wrote:
 On 02/22/2011 10:56 AM, Kevin Wolf wrote:
 *sigh*
 
 It starts to get annoying, but if you really insist, I can repeat it
 once more: These features that you don't need (this is the correct
 description for what you call misfeatures) _are_ implemented in a way
 that they don't impact the normal case. And they are it today.
 
 
 Plus, encryption and snapshots can be implemented in a way that
 doesn't impact performance more than is reasonable.
 
 We're still missing the existence proof of this, but even assuming
 it existed,
 
 dm-crypt isn't any more complicated, and it's used by default in
 most distributions these days.

IMHO dm-crypt isn't a generally usable alternative to native built
in encryption in qcow2. It isn't usable at all by non-root. If you
want to use with plain files, then you need to turn the file into
a loopback device and then layer in dm-crypt. It is generally
just a PITA to manage.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Anthony Liguori


On 02/23/2011 09:23 AM, Avi Kivity wrote:

On 02/23/2011 04:23 PM, Anthony Liguori wrote:

On 02/23/2011 07:43 AM, Avi Kivity wrote:

On 02/22/2011 10:56 AM, Kevin Wolf wrote:

*sigh*

It starts to get annoying, but if you really insist, I can repeat it
once more: These features that you don't need (this is the correct
description for what you call misfeatures) _are_ implemented in a 
way

that they don't impact the normal case. And they are it today.



Plus, encryption and snapshots can be implemented in a way that 
doesn't impact performance more than is reasonable.


We're still missing the existence proof of this, but even assuming it 
existed,


dm-crypt isn't any more complicated, and it's used by default in most 
distributions these days.


what about snapshots?  Are we okay having a feature in a prominent 
format that isn't going to meet user's expectations?


Is there any hope that an image with 1000, 1000, or 1 snapshots 
is going to have even reasonable performance in qcow2?




Are thousands of snapshots for a single image a reasonable user 
expectation?  What's the use case?


Checkpointing.  It was the original use-case that led to qcow2 being 
invented.


Regards,

Anthony Liguori

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Avi Kivity


On 02/23/2011 05:29 PM, Anthony Liguori wrote:



existed, what about snapshots?  Are we okay having a feature in a
prominent format that isn't going to meet user's expectations?

Is there any hope that an image with 1000, 1000, or 1 snapshots is
going to have even reasonable performance in qcow2?

Is there any hope for backing file chains of 1000 files or more? I
haven't tried it out, but in theory I'd expect that internal snapshots
could cope better with it than external ones because internal snapshots
don't have to go through the whole chain all the time.


I don't think there's a user expectation of backing file chains of 
1000 files performing well.  However, I've talked to a number of 
customers that have been interested in using internal snapshots for 
checkpointing which would involve a large number of snapshots.


In fact, Fabrice originally added qcow2 because he was interested in 
doing reverse debugging.  The idea of internal snapshots was to store 
a high number of checkpoints to allow reverse debugging to be optimized.


I don't see how that works, since the memory image is duplicated for 
each snapshot.  So thousands of snapshots = terabytes of storage, and 
hours of creating the snapshots.


Migrate-to-file with block live migration, or even better, something 
based on Kemari would be a lot faster.




I think the way snapshot metadata is stored makes this not realistic 
since they're stored in more or less a linear array.  I think to 
really support a high number of snapshots, you'd want to store a hash 
with each block that contained a refcount  1.  I think you quickly 
end up reinventing btrfs though in the process.


Can you elaborate?  What's the problem with a linear array of snapshots 
(say up to 10,000 snapshots)?


--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Avi Kivity


On 02/23/2011 05:33 PM, Daniel P. Berrange wrote:

On Wed, Feb 23, 2011 at 05:23:33PM +0200, Avi Kivity wrote:
  On 02/23/2011 04:23 PM, Anthony Liguori wrote:
  On 02/23/2011 07:43 AM, Avi Kivity wrote:
  On 02/22/2011 10:56 AM, Kevin Wolf wrote:
  *sigh*
  
  It starts to get annoying, but if you really insist, I can repeat it
  once more: These features that you don't need (this is the correct
  description for what you call misfeatures) _are_ implemented in a way
  that they don't impact the normal case. And they are it today.
  
  
  Plus, encryption and snapshots can be implemented in a way that
  doesn't impact performance more than is reasonable.
  
  We're still missing the existence proof of this, but even assuming
  it existed,

  dm-crypt isn't any more complicated, and it's used by default in
  most distributions these days.

IMHO dm-crypt isn't a generally usable alternative to native built
in encryption in qcow2. It isn't usable at all by non-root. If you
want to use with plain files, then you need to turn the file into
a loopback device and then layer in dm-crypt. It is generally
just a PITA to manage.


I wasn't suggesting dm-crypt is a replacement for qcow2 encyption, just 
that it shows that block-level encryption can be done with reasonable 
overhead.


--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Avi Kivity


On 02/23/2011 05:31 PM, Anthony Liguori wrote:


what about snapshots?  Are we okay having a feature in a prominent 
format that isn't going to meet user's expectations?


Is there any hope that an image with 1000, 1000, or 1 snapshots 
is going to have even reasonable performance in qcow2?




Are thousands of snapshots for a single image a reasonable user 
expectation?  What's the use case?



Checkpointing.  It was the original use-case that led to qcow2 being 
invented.


I still don't see.  What would you do with thousands of checkpoints?

--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Anthony Liguori


On 02/23/2011 09:36 AM, Avi Kivity wrote:

On 02/23/2011 05:29 PM, Anthony Liguori wrote:



existed, what about snapshots?  Are we okay having a feature in a
prominent format that isn't going to meet user's expectations?

Is there any hope that an image with 1000, 1000, or 1 snapshots is
going to have even reasonable performance in qcow2?

Is there any hope for backing file chains of 1000 files or more? I
haven't tried it out, but in theory I'd expect that internal snapshots
could cope better with it than external ones because internal snapshots
don't have to go through the whole chain all the time.


I don't think there's a user expectation of backing file chains of 
1000 files performing well.  However, I've talked to a number of 
customers that have been interested in using internal snapshots for 
checkpointing which would involve a large number of snapshots.


In fact, Fabrice originally added qcow2 because he was interested in 
doing reverse debugging.  The idea of internal snapshots was to store 
a high number of checkpoints to allow reverse debugging to be optimized.


I don't see how that works, since the memory image is duplicated for 
each snapshot.  So thousands of snapshots = terabytes of storage, and 
hours of creating the snapshots.


Fabrice wanted to use CoW to as a mechanism to deduplicate the memory 
contents with the on-disk state specifically to address this problem.  
For the longest time, there was a comment in the savevm code along these 
lines.  It might still be there.


I think the lack of on-disk hashes was a critical missing bit to make 
this feature really work well.


Migrate-to-file with block live migration, or even better, something 
based on Kemari would be a lot faster.




I think the way snapshot metadata is stored makes this not realistic 
since they're stored in more or less a linear array.  I think to 
really support a high number of snapshots, you'd want to store a hash 
with each block that contained a refcount  1.  I think you quickly 
end up reinventing btrfs though in the process.


Can you elaborate?  What's the problem with a linear array of 
snapshots (say up to 10,000 snapshots)?


Lots of things.  The array will start to consume quite a bit of 
contiguous space as it gets larger which means it needs to be 
relocated.  Deleting a snapshot is a far more expensive operation than 
it needs to be.


Regards,

Anthony Liguori

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Anthony Liguori


On 02/23/2011 09:37 AM, Avi Kivity wrote:

On 02/23/2011 05:31 PM, Anthony Liguori wrote:


what about snapshots?  Are we okay having a feature in a prominent 
format that isn't going to meet user's expectations?


Is there any hope that an image with 1000, 1000, or 1 snapshots 
is going to have even reasonable performance in qcow2?




Are thousands of snapshots for a single image a reasonable user 
expectation?  What's the use case?



Checkpointing.  It was the original use-case that led to qcow2 being 
invented.


I still don't see.  What would you do with thousands of checkpoints?


Er, hit send to quickly.

HPC is a big space where checkpointing is actually useful.  An HPC 
workload may take weeks to run to completion.  If something fails during 
the run, it's a huge waste of time.  However, if you do regularl 
checkpointing, a failure may only lose a few minutes of work instead of 
the entire weeks worth of work.


Regards,

Anthony Liguori

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Anthony Liguori


On 02/23/2011 09:37 AM, Avi Kivity wrote:

On 02/23/2011 05:31 PM, Anthony Liguori wrote:


what about snapshots?  Are we okay having a feature in a prominent 
format that isn't going to meet user's expectations?


Is there any hope that an image with 1000, 1000, or 1 snapshots 
is going to have even reasonable performance in qcow2?




Are thousands of snapshots for a single image a reasonable user 
expectation?  What's the use case?



Checkpointing.  It was the original use-case that led to qcow2 being 
invented.


I still don't see.  What would you do with thousands of checkpoints?


For reverse debugging, if you store checkpoints at a rate of save, every 
10ms, and then degrade to storing every 100ms after 1 second, etc. 
you'll have quite a large number of snapshots pretty quickly.  The idea 
of snapshotting with reverse debugging is that instead of undoing every 
instruction, you can revert to the snapshot before, and then replay the 
instruction stream until you get to the desired point in time.


For disaster recovery, there are some workloads that you can meaningful 
revert to a snapshot provided that the snapshot is stored at a rate of 
something frequency (like once a second).  Think of something like a 
webserver where the only accumulated data is logs.  Losing some of the 
logs is better than losing all of the logs.


Regards,

Anthony Liguori

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Kevin Wolf

Am 23.02.2011 16:29, schrieb Anthony Liguori:
 On 02/23/2011 08:38 AM, Kevin Wolf wrote:
 Am 23.02.2011 15:23, schrieb Anthony Liguori:

 On 02/23/2011 07:43 AM, Avi Kivity wrote:
  
 On 02/22/2011 10:56 AM, Kevin Wolf wrote:

 *sigh*

 It starts to get annoying, but if you really insist, I can repeat it
 once more: These features that you don't need (this is the correct
 description for what you call misfeatures) _are_ implemented in a way
 that they don't impact the normal case. And they are it today.

  
 Plus, encryption and snapshots can be implemented in a way that
 doesn't impact performance more than is reasonable.

 We're still missing the existence proof of this, but even assuming it
  
 Define reasonable. I sent you some numbers not too long for
 encryption, and I consider them reasonable (iirc, between 25% and 40%
 slower than without encryption).

 
 I was really referring to snapshots.  I have absolutely no doubt that 
 encryption can be implemented with a reasonable performance overhead.

Alright. Last time you complained about things being too slow you were
explicitly referring to encryption, so sometimes it's hard for me to
follow you jumping from once topic to another.

 existed, what about snapshots?  Are we okay having a feature in a
 prominent format that isn't going to meet user's expectations?

 Is there any hope that an image with 1000, 1000, or 1 snapshots is
 going to have even reasonable performance in qcow2?
  
 Is there any hope for backing file chains of 1000 files or more? I
 haven't tried it out, but in theory I'd expect that internal snapshots
 could cope better with it than external ones because internal snapshots
 don't have to go through the whole chain all the time.
 
 I don't think there's a user expectation of backing file chains of 1000 
 files performing well.  However, I've talked to a number of customers 
 that have been interested in using internal snapshots for checkpointing 
 which would involve a large number of snapshots.

So if there's no expectation that a chain of 1000 external snapshots
works fine, why is it a requirement for internal snapshots?

You might have a point if the external snapshots were actually not a
chain, but a snapshot tree with lots of branches, but checkpointing
means exactly creating a single chain.

That said, while I haven't tried it out, I don't see any theoretical
problems with using 1000 internal snapshots.

 In fact, Fabrice originally added qcow2 because he was interested in 
 doing reverse debugging.  The idea of internal snapshots was to store a 
 high number of checkpoints to allow reverse debugging to be optimized.
 
 I think the way snapshot metadata is stored makes this not realistic 
 since they're stored in more or less a linear array.  I think to really 
 support a high number of snapshots, you'd want to store a hash with each 
 block that contained a refcount  1.  I think you quickly end up 
 reinventing btrfs though in the process.

I share Avi's problem here, I don't really understand what the problem
with a linear list of snapshots is.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Avi Kivity


On 02/23/2011 05:47 PM, Anthony Liguori wrote:
I don't see how that works, since the memory image is duplicated for 
each snapshot.  So thousands of snapshots = terabytes of storage, and 
hours of creating the snapshots.



Fabrice wanted to use CoW to as a mechanism to deduplicate the memory 
contents with the on-disk state specifically to address this problem.  
For the longest time, there was a comment in the savevm code along 
these lines.  It might still be there.


I think the lack of on-disk hashes was a critical missing bit to make 
this feature really work well.


So you have to use dirty logging to see which pages changed, otherwise 
you have to dedup all of them.  Still I think migration/kemari is a 
better fit for this.


Can you elaborate?  What's the problem with a linear array of 
snapshots (say up to 10,000 snapshots)?


Lots of things.  The array will start to consume quite a bit of 
contiguous space as it gets larger which means it needs to be relocated.


If you double the space each time, it amortizes out.

A snapshot seems to be around 40 bytes.  So 10K snapshots = 400KB, 
hardly a huge amount (sans pointed-to data which doesn't need to move).


Deleting a snapshot is a far more expensive operation than it needs to 
be.




Move the last snapshot into the deleted entry?

--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Gleb Natapov

On Wed, Feb 23, 2011 at 09:52:02AM -0600, Anthony Liguori wrote:
 On 02/23/2011 09:37 AM, Avi Kivity wrote:
 On 02/23/2011 05:31 PM, Anthony Liguori wrote:
 
 what about snapshots?  Are we okay having a feature in a
 prominent format that isn't going to meet user's
 expectations?
 
 Is there any hope that an image with 1000, 1000, or 1
 snapshots is going to have even reasonable performance in
 qcow2?
 
 
 Are thousands of snapshots for a single image a reasonable
 user expectation?  What's the use case?
 
 
 Checkpointing.  It was the original use-case that led to qcow2
 being invented.
 
 I still don't see.  What would you do with thousands of checkpoints?
 
 Er, hit send to quickly.
 
 HPC is a big space where checkpointing is actually useful.  An HPC
 workload may take weeks to run to completion.  If something fails
 during the run, it's a huge waste of time.  However, if you do
 regularl checkpointing, a failure may only lose a few minutes of
 work instead of the entire weeks worth of work.
 
HPC workload mostly run on cluster nowadays. Getting consistent
distributed snapshot without messages in flight is not as simple as
snapshotting bunch of VMs at random time. Anyway in HPC scenario you
need only one (last) snapshot, not thousands of them.

--
Gleb.

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Avi Kivity


On 02/23/2011 05:52 PM, Anthony Liguori wrote:

I still don't see.  What would you do with thousands of checkpoints?



Er, hit send to quickly.

HPC is a big space where checkpointing is actually useful.  An HPC 
workload may take weeks to run to completion.  If something fails 
during the run, it's a huge waste of time.  However, if you do 
regularl checkpointing, a failure may only lose a few minutes of work 
instead of the entire weeks worth of work.


The trick is to delete snapshot N-M after taking snapshot N (for a small 
constant M).


--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Anthony Liguori


On 02/23/2011 10:03 AM, Avi Kivity wrote:

On 02/23/2011 05:50 PM, Anthony Liguori wrote:

I still don't see.  What would you do with thousands of checkpoints?



For reverse debugging, if you store checkpoints at a rate of save, 
every 10ms, and then degrade to storing every 100ms after 1 second, 
etc. you'll have quite a large number of snapshots pretty quickly.  
The idea of snapshotting with reverse debugging is that instead of 
undoing every instruction, you can revert to the snapshot before, and 
then replay the instruction stream until you get to the desired point 
in time.


You cannot replay the instruction stream since inputs (interrupts, 
rdtsc or other timers, I/O) will be different.  You need Kemari for this.


Yes, I'm well aware of this.  I don't think all the pieces where ever 
really there to do this.


Regards,

Anthony Liguori



For disaster recovery, there are some workloads that you can 
meaningful revert to a snapshot provided that the snapshot is stored 
at a rate of something frequency (like once a second).  Think of 
something like a webserver where the only accumulated data is logs.  
Losing some of the logs is better than losing all of the logs.


Are static webservers that interesting?  For disaster recovery? 
Anything else will need Kemari.

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Avi Kivity


On 02/23/2011 05:50 PM, Anthony Liguori wrote:

I still don't see.  What would you do with thousands of checkpoints?



For reverse debugging, if you store checkpoints at a rate of save, 
every 10ms, and then degrade to storing every 100ms after 1 second, 
etc. you'll have quite a large number of snapshots pretty quickly.  
The idea of snapshotting with reverse debugging is that instead of 
undoing every instruction, you can revert to the snapshot before, and 
then replay the instruction stream until you get to the desired point 
in time.


You cannot replay the instruction stream since inputs (interrupts, rdtsc 
or other timers, I/O) will be different.  You need Kemari for this.




For disaster recovery, there are some workloads that you can 
meaningful revert to a snapshot provided that the snapshot is stored 
at a rate of something frequency (like once a second).  Think of 
something like a webserver where the only accumulated data is logs.  
Losing some of the logs is better than losing all of the logs.


Are static webservers that interesting?  For disaster recovery? Anything 
else will need Kemari.


--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-23 Thread Kevin Wolf

Am 23.02.2011 17:04, schrieb Anthony Liguori:
 On 02/23/2011 10:03 AM, Avi Kivity wrote:
 On 02/23/2011 05:50 PM, Anthony Liguori wrote:
 I still don't see.  What would you do with thousands of checkpoints?


 For reverse debugging, if you store checkpoints at a rate of save, 
 every 10ms, and then degrade to storing every 100ms after 1 second, 
 etc. you'll have quite a large number of snapshots pretty quickly.  
 The idea of snapshotting with reverse debugging is that instead of 
 undoing every instruction, you can revert to the snapshot before, and 
 then replay the instruction stream until you get to the desired point 
 in time.

 You cannot replay the instruction stream since inputs (interrupts, 
 rdtsc or other timers, I/O) will be different.  You need Kemari for this.
 
 Yes, I'm well aware of this.  I don't think all the pieces where ever 
 really there to do this.

So why exactly was this a requirement for internal snapshots to be
consider usable in a reasonable way? ;-)

Anyway, I actually think with internal snapshots you're better suited to
implement something like this than with external snapshots.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-22 Thread Markus Armbruster

Anthony Liguori anth...@codemonkey.ws writes:

 On 02/18/2011 03:57 AM, Kevin Wolf wrote:
 Am 18.02.2011 10:12, schrieb Markus Armbruster:

 Kevin Wolfkw...@redhat.com  writes:

  
 Am 15.02.2011 20:45, schrieb Chunqiang Tang:

 Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
 As you requested, I set up a wiki page for FVD at

 http://wiki.qemu.org/Features/FVD
  
 . It includes a summary of FVD, a detailed specification of FVD, and a
 comparison of the design and performance of FVD and QED.

  
 See the figure at http://wiki.qemu.org/Features/FVD/Compare . This

 figure
  
 shows that the file creation throughput of NetApp's PostMark benchmark

 under
  
 FVD is 74.9% to 215% higher than that under QED.

 Hi Anthony,

 Please let me know if more information is needed. I would appreciate your
 feedback and advice on the best way to proceed with FVD.
  
 Yet another file format with yet another implementation is definitely
 not what we need. We should probably take some of the ideas in FVD and
 consider them for qcow3.

 Got an assumption there: that the one COW format we need must be qcow3,
 i.e. an evolution of qcow2.  Needs to be justified.  If that discussion
 has happened on the list already, I missed it.  If not, it's overdue,
 and then we better start it right away.
  
 Right. I probably wasn't very clear about what I mean with qcow3 either,
 so let me try to summarize my reasoning.


 The first point is an assumption that you made, too: That we want to
 have only one format. I hope it's easy to agree on this, duplication is
 bad and every additional format creates new maintenance burden,
 especially if we're taking it serious. Until now, there were exactly two
 formats for which we managed to do this, raw and qcow2. raw is more or
 less for free, so with the introduction of another format, we basically
 double the supported block driver code overnight (while not doubling the
 number of developers).


 Not sure what project you're following, but we've had an awful lot of
 formats before qcow2 :-)

 And qcow2 was never all that special, it just was dropped in the code
 base one day.  You've put a lot of work into qcow2, but there are
 other folks that are contributing additional formats and that means
 more developers.

 The consequence of having only one file format is that it must be able
 to obsolete the existing ones, most notably qcow2. We can only neglect
 qcow1 today because we can tell users to use qcow2. It supports
 everything that qcow1 supports and more. We couldn't have done this if
 qcow2 lacked features compared to qcow1.

 So the one really essential requirement that I see is that we provide a
 way forward for _all_ users by maintaining all of qcow2's features. This
 is the only way of getting people to not stay with qcow2.


 Of course, you could invent another format that implements the same
 features, but I think just carefully extending qcow2 has some real
 advantages.

 The first is that conversion of existing images would be really easy.
 Basically increment the version number in the header file and you're
 done. Structures would be compatible.

 qemu-img convert is a reasonable path for conversion.

   If you compare it to file systems,
 I rarely ever change the file system on a non-empty partition. Even if I
 wanted, it's usually just too painful. Except when I was able to use
 tune2fs -j to make ext3 out of ext2, that was really easy. We can
 provide the same for qcow2 to qcow3 conversion, but not with a
 completely new format.

 Also, while obsoleting a file format means that we need not put much
 effort in its maintenance, we still need to keep the code around for
 reading old images. With an extension of qcow2, it would be the same
 code that is used for both versions.

 Third, qcow2 already exists, is used in practice and we have put quite
 some effort into QA. At least initially confidence would be higher than
 in a completely new, yet untested format. Remember that with qcow3 I'm
 not talking about rewriting everything, it's a careful evolution, mostly
 with optional additions here and there.


 My requirements for a new format are as followed:

 1) documented, thought-out specification that is covered under and
 open license with a clear process for extension.

 2) ability to add both compatible and incompatible features in a
 graceful way

 3) ability to achieve performance that's close to raw.  I want our new
 format to be able to be used universally both for servers and
 desktops.

I'd like to add

4) minimize complexity and maximize maintainability of the code.  I'd
gladly sacrifice nice-to-have features for that.

 I think qcow2 has some misfeatures like compression and internal
 snapshots.  I think preserving those misfeatures is a mistake because
 I don't think we can satisfy the above while trying to preserve those

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-22 Thread Markus Armbruster

Aurelien Jarno aurel...@aurel32.net writes:

[...]
 I agree that the best would be to have a single format, and it's
 probably a goal to have. That said, what is most important to my view is
 having one or two formats which together have _all_ the features (and 
 here I consider speed as a feature) of the existing qcow2 format. QED or
 FVD have been designed with the virtualization in a datacenter in mind,
 and are very good for this use. OTOH they don't support compression or 
 snapshotting, that are quite useful for demo, debugging, testing, or
 even for occasionally running a Windows VM, in other words in situations
 where the speed is not the priority.

Speed not a priority means the requirements are pretty radically
different.  Satisfying two radically different sets of requirements with
the same format could be difficult.  Great to have, but possibly
difficult.

 If we can't find a tradeoff for that, we should go for two instead of 
 one image format.

Less bad than a jack-of-all-trades.

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-22 Thread Kevin Wolf

Am 22.02.2011 09:37, schrieb Markus Armbruster:
 Anthony Liguori anth...@codemonkey.ws writes:
 
 On 02/18/2011 03:57 AM, Kevin Wolf wrote:
 Am 18.02.2011 10:12, schrieb Markus Armbruster:

 Kevin Wolfkw...@redhat.com  writes:

  
 Am 15.02.2011 20:45, schrieb Chunqiang Tang:

 Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
 As you requested, I set up a wiki page for FVD at

 http://wiki.qemu.org/Features/FVD
  
 . It includes a summary of FVD, a detailed specification of FVD, and a
 comparison of the design and performance of FVD and QED.

  
 See the figure at http://wiki.qemu.org/Features/FVD/Compare . This

 figure
  
 shows that the file creation throughput of NetApp's PostMark benchmark

 under
  
 FVD is 74.9% to 215% higher than that under QED.

 Hi Anthony,

 Please let me know if more information is needed. I would appreciate your
 feedback and advice on the best way to proceed with FVD.
  
 Yet another file format with yet another implementation is definitely
 not what we need. We should probably take some of the ideas in FVD and
 consider them for qcow3.

 Got an assumption there: that the one COW format we need must be qcow3,
 i.e. an evolution of qcow2.  Needs to be justified.  If that discussion
 has happened on the list already, I missed it.  If not, it's overdue,
 and then we better start it right away.
  
 Right. I probably wasn't very clear about what I mean with qcow3 either,
 so let me try to summarize my reasoning.


 The first point is an assumption that you made, too: That we want to
 have only one format. I hope it's easy to agree on this, duplication is
 bad and every additional format creates new maintenance burden,
 especially if we're taking it serious. Until now, there were exactly two
 formats for which we managed to do this, raw and qcow2. raw is more or
 less for free, so with the introduction of another format, we basically
 double the supported block driver code overnight (while not doubling the
 number of developers).


 Not sure what project you're following, but we've had an awful lot of
 formats before qcow2 :-)

 And qcow2 was never all that special, it just was dropped in the code
 base one day.  You've put a lot of work into qcow2, but there are
 other folks that are contributing additional formats and that means
 more developers.

 The consequence of having only one file format is that it must be able
 to obsolete the existing ones, most notably qcow2. We can only neglect
 qcow1 today because we can tell users to use qcow2. It supports
 everything that qcow1 supports and more. We couldn't have done this if
 qcow2 lacked features compared to qcow1.

 So the one really essential requirement that I see is that we provide a
 way forward for _all_ users by maintaining all of qcow2's features. This
 is the only way of getting people to not stay with qcow2.


 Of course, you could invent another format that implements the same
 features, but I think just carefully extending qcow2 has some real
 advantages.

 The first is that conversion of existing images would be really easy.
 Basically increment the version number in the header file and you're
 done. Structures would be compatible.

 qemu-img convert is a reasonable path for conversion.

   If you compare it to file systems,
 I rarely ever change the file system on a non-empty partition. Even if I
 wanted, it's usually just too painful. Except when I was able to use
 tune2fs -j to make ext3 out of ext2, that was really easy. We can
 provide the same for qcow2 to qcow3 conversion, but not with a
 completely new format.

 Also, while obsoleting a file format means that we need not put much
 effort in its maintenance, we still need to keep the code around for
 reading old images. With an extension of qcow2, it would be the same
 code that is used for both versions.

 Third, qcow2 already exists, is used in practice and we have put quite
 some effort into QA. At least initially confidence would be higher than
 in a completely new, yet untested format. Remember that with qcow3 I'm
 not talking about rewriting everything, it's a careful evolution, mostly
 with optional additions here and there.


 My requirements for a new format are as followed:

 1) documented, thought-out specification that is covered under and
 open license with a clear process for extension.

 2) ability to add both compatible and incompatible features in a
 graceful way

 3) ability to achieve performance that's close to raw.  I want our new
 format to be able to be used universally both for servers and
 desktops.
 
 I'd like to add
 
 4) minimize complexity and maximize maintainability of the code.  I'd
 gladly sacrifice nice-to-have features for that.

Especially if they are features that only other users use, right?

What's the Sankt-Florians-Prinzip called in English?

 I think qcow2 has some

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-22 Thread Markus Armbruster

Kevin Wolf kw...@redhat.com writes:

 Am 22.02.2011 09:37, schrieb Markus Armbruster:
 Anthony Liguori anth...@codemonkey.ws writes:
 
 On 02/18/2011 03:57 AM, Kevin Wolf wrote:
 Am 18.02.2011 10:12, schrieb Markus Armbruster:

 Kevin Wolfkw...@redhat.com  writes:

  
 Am 15.02.2011 20:45, schrieb Chunqiang Tang:

 Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
 As you requested, I set up a wiki page for FVD at

 http://wiki.qemu.org/Features/FVD
  
 . It includes a summary of FVD, a detailed specification of FVD, and a
 comparison of the design and performance of FVD and QED.

  
 See the figure at http://wiki.qemu.org/Features/FVD/Compare . This

 figure
  
 shows that the file creation throughput of NetApp's PostMark benchmark

 under
  
 FVD is 74.9% to 215% higher than that under QED.

 Hi Anthony,

 Please let me know if more information is needed. I would appreciate 
 your
 feedback and advice on the best way to proceed with FVD.
  
 Yet another file format with yet another implementation is definitely
 not what we need. We should probably take some of the ideas in FVD and
 consider them for qcow3.

 Got an assumption there: that the one COW format we need must be qcow3,
 i.e. an evolution of qcow2.  Needs to be justified.  If that discussion
 has happened on the list already, I missed it.  If not, it's overdue,
 and then we better start it right away.
  
 Right. I probably wasn't very clear about what I mean with qcow3 either,
 so let me try to summarize my reasoning.


 The first point is an assumption that you made, too: That we want to
 have only one format. I hope it's easy to agree on this, duplication is
 bad and every additional format creates new maintenance burden,
 especially if we're taking it serious. Until now, there were exactly two
 formats for which we managed to do this, raw and qcow2. raw is more or
 less for free, so with the introduction of another format, we basically
 double the supported block driver code overnight (while not doubling the
 number of developers).


 Not sure what project you're following, but we've had an awful lot of
 formats before qcow2 :-)

 And qcow2 was never all that special, it just was dropped in the code
 base one day.  You've put a lot of work into qcow2, but there are
 other folks that are contributing additional formats and that means
 more developers.

 The consequence of having only one file format is that it must be able
 to obsolete the existing ones, most notably qcow2. We can only neglect
 qcow1 today because we can tell users to use qcow2. It supports
 everything that qcow1 supports and more. We couldn't have done this if
 qcow2 lacked features compared to qcow1.

 So the one really essential requirement that I see is that we provide a
 way forward for _all_ users by maintaining all of qcow2's features. This
 is the only way of getting people to not stay with qcow2.


 Of course, you could invent another format that implements the same
 features, but I think just carefully extending qcow2 has some real
 advantages.

 The first is that conversion of existing images would be really easy.
 Basically increment the version number in the header file and you're
 done. Structures would be compatible.

 qemu-img convert is a reasonable path for conversion.

   If you compare it to file systems,
 I rarely ever change the file system on a non-empty partition. Even if I
 wanted, it's usually just too painful. Except when I was able to use
 tune2fs -j to make ext3 out of ext2, that was really easy. We can
 provide the same for qcow2 to qcow3 conversion, but not with a
 completely new format.

 Also, while obsoleting a file format means that we need not put much
 effort in its maintenance, we still need to keep the code around for
 reading old images. With an extension of qcow2, it would be the same
 code that is used for both versions.

 Third, qcow2 already exists, is used in practice and we have put quite
 some effort into QA. At least initially confidence would be higher than
 in a completely new, yet untested format. Remember that with qcow3 I'm
 not talking about rewriting everything, it's a careful evolution, mostly
 with optional additions here and there.


 My requirements for a new format are as followed:

 1) documented, thought-out specification that is covered under and
 open license with a clear process for extension.

 2) ability to add both compatible and incompatible features in a
 graceful way

 3) ability to achieve performance that's close to raw.  I want our new
 format to be able to be used universally both for servers and
 desktops.
 
 I'd like to add
 
 4) minimize complexity and maximize maintainability of the code.  I'd
 gladly sacrifice nice-to-have features for that.

 Especially if they are features that only other users use, right?

 What's the Sankt-Florians-Prinzip called in

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-22 Thread Anthony Liguori


On 02/22/2011 02:56 AM, Kevin Wolf wrote:


*sigh*

It starts to get annoying, but if you really insist, I can repeat it
once more: These features that you don't need (this is the correct
description for what you call misfeatures) _are_ implemented in a way
that they don't impact the normal case.


Except that they require a refcount table that adds additional metadata 
that needs to be updated in the fast path.  I consider that impacting 
the normal case.


Regards,

Anthony Liguori


  And they are it today.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-22 Thread Kevin Wolf

Am 22.02.2011 16:57, schrieb Anthony Liguori:
 On 02/22/2011 02:56 AM, Kevin Wolf wrote:

 *sigh*

 It starts to get annoying, but if you really insist, I can repeat it
 once more: These features that you don't need (this is the correct
 description for what you call misfeatures) _are_ implemented in a way
 that they don't impact the normal case.
 
 Except that they require a refcount table that adds additional metadata 
 that needs to be updated in the fast path.  I consider that impacting 
 the normal case.

Like it or not, this requirement exists anyway, without any of your
misfeatures.

You chose to use the dirty flag in QED in order to avoid having to flush
metadata too often, which is an approach that any other format, even one
using refcounts, can take as well.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-22 Thread Anthony Liguori


On 02/22/2011 10:15 AM, Kevin Wolf wrote:

Am 22.02.2011 16:57, schrieb Anthony Liguori:
   

On 02/22/2011 02:56 AM, Kevin Wolf wrote:
 

*sigh*

It starts to get annoying, but if you really insist, I can repeat it
once more: These features that you don't need (this is the correct
description for what you call misfeatures) _are_ implemented in a way
that they don't impact the normal case.
   

Except that they require a refcount table that adds additional metadata
that needs to be updated in the fast path.  I consider that impacting
the normal case.
 

Like it or not, this requirement exists anyway, without any of your
misfeatures.

You chose to use the dirty flag in QED in order to avoid having to flush
metadata too often, which is an approach that any other format, even one
using refcounts, can take as well.
   


It's a minor detail, but flushing and the amount of metadata are 
separate points.


The dirty flag prevents metadata from being flushed to disk very often 
but the use of a refcount table adds additional metadata.


A refcount table is definitely not required even if you claim the 
requirement exists for other features.  I assume you mean to implement 
trim/discard support but instead of a refcount table, a free list would 
work just as well and would leave the metadata update out of the fast 
path (allocating writes) and instead only be in the slow path 
(trim/discard).


As a format feature, a refcount table really only makes sense if the 
refcount is required to be greater than a single bit.  There are more 
optimal data structures that can be used if the refcount of a block is 
fixed to 1-bit (like a free list) which is what the fundamental design 
difference between qcow2 and qed is.


The only use of a refcount of more than 1-bit is internal snapshots AFAICT.

Regards,

Anthony Liguori


Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-22 Thread Chunqiang Tang

 In any case, the next step is to get down to specifics.  Here is the
 page with the current QCOW3 roadmap:
 
 http://wiki.qemu.org/Qcow3_Roadmap

 Please raise concrete requirements or features so they can be
 discussed and captured.

Now it turns into a more productive discussion, but it seems to lose the 
big picture too quickly and has gone too narrowly into issues like the 
“dirty bit”. Let’s try to answer a bigger question: how to take a holistic 
approach to address all the factors that make a virtual disk slower than a 
physical disk? Even if issues like the “dirty bit” are addressed 
perfectly, they may still only be a small part of the total solution. The 
discussion of internal snapshot is at the end of this email.

Compared with a physical disk, a virtual disk (even RAW) incurs some or 
all of the following overheads. Obviously, the way to achieve high 
performance is to eliminate or reduce these overheads.

Overhead at the image level:
I1: Data fragmentation caused by an image format.
I2: Overhead in reading an image format’s metadata from disk.
I3: Overhead in writing an image format’s metadata to disk.
I4: Inefficiency and complexity in the block driver implementation, e.g., 
waiting synchronously for reading or writing metadata, submitting I/O 
requests sequentially when they should be done concurrently, performing a 
flush unnecessarily, etc.

Overhead at the host file system level:
H1: Data fragmentation caused by a host file system.
H2: Overhead in reading a host file system’s metadata.
H3: Overhead in writing a host file system’s metadata.

Existing image formats by design do not address many of these issues, 
which is the reason why FVD was invented (
http://wiki.qemu.org/Features/FVD).  Let’s look at these issues one by 
one.

Regarding I1: Data fragmentation caused by an image format:
This problem exists in most image formats, as they insist on doing storage 
allocation for the second time at the image level (including QCOW2, QED, 
VMDK, VDI, VHD, etc.), even if the host file system already does storage 
allocation. These image formats unnecessarily mix the function of storage 
allocation with the function of copy-on-write, i.e., they determine 
whether a cluster is dirty by checking whether it has storage space 
allocated at the image level. This is wrong. Storage allocation and 
tracking dirty clusters are two separate functions. Data fragmentation at 
the image level can be totally avoided by using a RAW image plus a bitmap 
header to indicate whether clusters are dirty due to copy-on-write. FVD 
can be configured to take this approach, although it can also be 
configured to do storage allocation.  Doing storage allocation at the 
image level can be optional, but should never be mandatory.

Regarding I2: Overhead in reading an image format’s metadata from disk:
Obviously, the solution is to make the metadata small so that it can be 
cached entirely in memory. Is this aspect, QCOW1/QCOW2/QED and 
VMDK-workstation-version are wrong, and VirtualBox VDI, Microsoft VHD, and 
VMDK-esx-server-version are right. With QCOW1/QCOW2/QED, for a 1TB virtual 
disk, the metadata size is at least 128MB. By contrast, with VDI, for a 
1TB virtual disk, the metadata size is only 4MB. The “wrong formats” all 
use a two-level lookup table to do storage allocation at a small 
granularity (e.g., 64KB), whereas the “right formats” all use a one-level 
lookup table to do storage allocation at a large granularity (1MB or 2MB). 
The one-level table is easier to implementation. Note that VMware VMDK 
started wrong in VMware’s workstation version, and then was corrected to 
be right in the ESX server version, which is a good move. As virtual disks 
grow bigger, it is likely that the storage allocation unit will be 
increased in the future, e.g., to 10MB or even larger. In existing image 
formats, one limitation of using a large storage allocation unit is that 
it forces copy-on-write being performed on a large cluster (e.g., 10MB in 
the future), which is sort of wrong. FVD gets the bests of both worlds. It 
uses a one-level table to perform storage allocation at a large 
granularity, but uses a bitmap to track copy-on-write at a smaller 
granularity. For a 1TB virtual disk, this approach needs only 6MB 
metadata, slightly larger than VDI’s 4MB.

Regarding I3: Overhead in writing an image format’s metadata to disk:
This is where the “dirty bit” discussion fits, but FVD goes way beyond 
that to reduce metadata updates.  When an FVD image is fully optimized 
(e.g., the one-level lookup table is disabled and the base image is 
reduced to its minimum size), FVD has almost zero overhead in metadata 
update and the data layout is just like a RAW image. More specifically, 
metadata updates are skipped, delayed, batched, or merged as much as 
possible without compromising data integrity. First, even with 
cache=writethrough (i.e., O_DSYNC), all metadata updates are sequential 
writes to FVD’s journal, which can be

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-21 Thread Kevin Wolf

Am 20.02.2011 23:13, schrieb Aurelien Jarno:
 On Fri, Feb 18, 2011 at 10:57:05AM +0100, Kevin Wolf wrote:
 Am 18.02.2011 10:12, schrieb Markus Armbruster:
 Kevin Wolf kw...@redhat.com writes:

 Am 15.02.2011 20:45, schrieb Chunqiang Tang:
 Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
 As you requested, I set up a wiki page for FVD at 
 http://wiki.qemu.org/Features/FVD
 . It includes a summary of FVD, a detailed specification of FVD, and a 
 comparison of the design and performance of FVD and QED. 

 See the figure at http://wiki.qemu.org/Features/FVD/Compare . This 
 figure 
 shows that the file creation throughput of NetApp's PostMark benchmark 
 under 
 FVD is 74.9% to 215% higher than that under QED.

 Hi Anthony,

 Please let me know if more information is needed. I would appreciate your 
 feedback and advice on the best way to proceed with FVD. 

 Yet another file format with yet another implementation is definitely
 not what we need. We should probably take some of the ideas in FVD and
 consider them for qcow3.

 Got an assumption there: that the one COW format we need must be qcow3,
 i.e. an evolution of qcow2.  Needs to be justified.  If that discussion
 has happened on the list already, I missed it.  If not, it's overdue,
 and then we better start it right away.

 Right. I probably wasn't very clear about what I mean with qcow3 either,
 so let me try to summarize my reasoning.


 The first point is an assumption that you made, too: That we want to
 have only one format. I hope it's easy to agree on this, duplication is
 bad and every additional format creates new maintenance burden,
 especially if we're taking it serious. Until now, there were exactly two
 formats for which we managed to do this, raw and qcow2. raw is more or
 less for free, so with the introduction of another format, we basically
 double the supported block driver code overnight (while not doubling the
 number of developers).

 The consequence of having only one file format is that it must be able
 to obsolete the existing ones, most notably qcow2. We can only neglect
 qcow1 today because we can tell users to use qcow2. It supports
 everything that qcow1 supports and more. We couldn't have done this if
 qcow2 lacked features compared to qcow1.

 So the one really essential requirement that I see is that we provide a
 way forward for _all_ users by maintaining all of qcow2's features. This
 is the only way of getting people to not stay with qcow2.
 
 I agree that the best would be to have a single format, and it's
 probably a goal to have. That said, what is most important to my view is
 having one or two formats which together have _all_ the features (and 
 here I consider speed as a feature) of the existing qcow2 format. QED or
 FVD have been designed with the virtualization in a datacenter in mind,
 and are very good for this use. OTOH they don't support compression or 
 snapshotting, that are quite useful for demo, debugging, testing, or
 even for occasionally running a Windows VM, in other words in situations
 where the speed is not the priority.
 
 If we can't find a tradeoff for that, we should go for two instead of 
 one image format.

I agree. Though that's purely theoretical because there no reason why we
shouldn't find a way to get both. ;-)

In fact, the only area where qcow2 in performs really bad in 0.14 is
cache=writethrough (which unfortunately is the default...). With
cache=none it's easy to find scenarios where it provides higher
throughput than QED.

Anyway, there's really only one crucial difference between QED and
qcow2, which is that qcow2 ensures that metadata is consistent on disk
at any time whereas QED relies on a dirty flag and rebuilds metadata
after a crash (basically requiring an fsck). The obvious solution if you
want to have this in qcow2, is adding a dirty flag there as well.

In my opinion, an additional flag certainly doesn't justify maintaining
an additional format instead of extending the existing one.

Likewise, I think FVD might provide some ideas that we can integrate as
well, I just don't see a justification to include it as a separate format.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-21 Thread Stefan Hajnoczi

On Mon, Feb 21, 2011 at 8:59 AM, Kevin Wolf kw...@redhat.com wrote:
 In fact, the only area where qcow2 in performs really bad in 0.14 is
 cache=writethrough (which unfortunately is the default...). With
 cache=none it's easy to find scenarios where it provides higher
 throughput than QED.

Yeah, I'm tempted to implement parallel allocating writes now so I can
pick on qcow2 in all benchmarks again ;).

 Anyway, there's really only one crucial difference between QED and
 qcow2, which is that qcow2 ensures that metadata is consistent on disk
 at any time whereas QED relies on a dirty flag and rebuilds metadata
 after a crash (basically requiring an fsck). The obvious solution if you
 want to have this in qcow2, is adding a dirty flag there as well.

 Likewise, I think FVD might provide some ideas that we can integrate as
 well, I just don't see a justification to include it as a separate format.

You think that QED and FVD can be integrated into a QCOW2-based
format.  I agree it's possible and has some value.  It isn't pretty
and I would prefer to work on a clean new format because that, too,
has value.

In any case, the next step is to get down to specifics.  Here is the
page with the current QCOW3 roadmap:

http://wiki.qemu.org/Qcow3_Roadmap

Please raise concrete requirements or features so they can be
discussed and captured.

For example, journalling is an alternative to the dirty bit approach.
If you feel that journalling is the best technique to address
consistent updates, then make your case outside the context of today's
qcow2, QED, and FVD implementations (although benchmark data will rely
on current implementations).  Explain how the technique would fit into
QCOW3 and what format changes need to be made.

I think this is the level we need to discuss at rather than qcow2 vs QED vs FVD.

Stefan

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-21 Thread Kevin Wolf

Am 21.02.2011 14:44, schrieb Stefan Hajnoczi:
 On Mon, Feb 21, 2011 at 8:59 AM, Kevin Wolf kw...@redhat.com wrote:
 In fact, the only area where qcow2 in performs really bad in 0.14 is
 cache=writethrough (which unfortunately is the default...). With
 cache=none it's easy to find scenarios where it provides higher
 throughput than QED.
 
 Yeah, I'm tempted to implement parallel allocating writes now so I can
 pick on qcow2 in all benchmarks again ;).

Heh. ;-)

In the end it just shows that the differences are mainly in the
implementation, not in the format.

 Anyway, there's really only one crucial difference between QED and
 qcow2, which is that qcow2 ensures that metadata is consistent on disk
 at any time whereas QED relies on a dirty flag and rebuilds metadata
 after a crash (basically requiring an fsck). The obvious solution if you
 want to have this in qcow2, is adding a dirty flag there as well.

 Likewise, I think FVD might provide some ideas that we can integrate as
 well, I just don't see a justification to include it as a separate format.
 
 You think that QED and FVD can be integrated into a QCOW2-based
 format.  I agree it's possible and has some value.  It isn't pretty
 and I would prefer to work on a clean new format because that, too,
 has value.
 
 In any case, the next step is to get down to specifics.  Here is the
 page with the current QCOW3 roadmap:
 
 http://wiki.qemu.org/Qcow3_Roadmap
 
 Please raise concrete requirements or features so they can be
 discussed and captured.
 
 For example, journalling is an alternative to the dirty bit approach.
 If you feel that journalling is the best technique to address
 consistent updates, then make your case outside the context of today's
 qcow2, QED, and FVD implementations (although benchmark data will rely
 on current implementations).  Explain how the technique would fit into
 QCOW3 and what format changes need to be made.

I think journalling is an interesting option, but I'm not sure if we
should target it for 0.15. As you know, there's already more than enough
stuff to do until then, with coroutines etc. The dirty flag thing would
be way easier to implement. We can always add a journal as a compatible
feature in 0.16.

To be honest, I'm not even sure any more that the dirty flag is that
important. Originally we have been talking about cache=none and it
definitely makes a big difference there because we save flushes.
However, we're talking about cache=writethrough now and you flush on any
write. It might be more important to make things parallel for writethrough.

Maybe not writing out refcounts is something we should measure before we
start implementing anything. (It's easy to disable all writes for a
benchmark, even if the image will be broken afterwards)

 I think this is the level we need to discuss at rather than qcow2 vs QED vs 
 FVD.

Definitely more productive, yes.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-21 Thread Anthony Liguori


On 02/21/2011 08:10 AM, Kevin Wolf wrote:

Am 21.02.2011 14:44, schrieb Stefan Hajnoczi:
   

On Mon, Feb 21, 2011 at 8:59 AM, Kevin Wolfkw...@redhat.com  wrote:
 

In fact, the only area where qcow2 in performs really bad in 0.14 is
cache=writethrough (which unfortunately is the default...). With
cache=none it's easy to find scenarios where it provides higher
throughput than QED.
   

Yeah, I'm tempted to implement parallel allocating writes now so I can
pick on qcow2 in all benchmarks again ;).
 

Heh. ;-)

In the end it just shows that the differences are mainly in the
implementation, not in the format.

   

Anyway, there's really only one crucial difference between QED and
qcow2, which is that qcow2 ensures that metadata is consistent on disk
at any time whereas QED relies on a dirty flag and rebuilds metadata
after a crash (basically requiring an fsck). The obvious solution if you
want to have this in qcow2, is adding a dirty flag there as well.

Likewise, I think FVD might provide some ideas that we can integrate as
well, I just don't see a justification to include it as a separate format.
   

You think that QED and FVD can be integrated into a QCOW2-based
format.  I agree it's possible and has some value.  It isn't pretty
and I would prefer to work on a clean new format because that, too,
has value.

In any case, the next step is to get down to specifics.  Here is the
page with the current QCOW3 roadmap:

http://wiki.qemu.org/Qcow3_Roadmap

Please raise concrete requirements or features so they can be
discussed and captured.

For example, journalling is an alternative to the dirty bit approach.
If you feel that journalling is the best technique to address
consistent updates, then make your case outside the context of today's
qcow2, QED, and FVD implementations (although benchmark data will rely
on current implementations).  Explain how the technique would fit into
QCOW3 and what format changes need to be made.
 

I think journalling is an interesting option, but I'm not sure if we
should target it for 0.15. As you know, there's already more than enough
stuff to do until then, with coroutines etc. The dirty flag thing would
be way easier to implement. We can always add a journal as a compatible
feature in 0.16.

To be honest, I'm not even sure any more that the dirty flag is that
important. Originally we have been talking about cache=none and it
definitely makes a big difference there because we save flushes.
However, we're talking about cache=writethrough now and you flush on any
write. It might be more important to make things parallel for writethrough.
   


One thing I wonder about is whether we really need to have cache=X and 
wce=X.  I never really minded the fact that cache=none advertised wce=on 
because we behaved effectively as if wce=on.  But now that qcow2 
triggers on wce=on, I'm a bit concerned that we're introducing a subtle 
degradation that most people won't realize.


Ignoring some of the problems with O_DIRECT, semantically, I think 
there's a strong use-case for cache=none, wce=off.


Regards,

Anthony Liguori


Maybe not writing out refcounts is something we should measure before we
start implementing anything. (It's easy to disable all writes for a
benchmark, even if the image will be broken afterwards)

   

I think this is the level we need to discuss at rather than qcow2 vs QED vs FVD.
 

Definitely more productive, yes.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-21 Thread Kevin Wolf

Am 21.02.2011 16:16, schrieb Anthony Liguori:
 On 02/21/2011 08:10 AM, Kevin Wolf wrote:
 Am 21.02.2011 14:44, schrieb Stefan Hajnoczi:

 On Mon, Feb 21, 2011 at 8:59 AM, Kevin Wolfkw...@redhat.com  wrote:
  
 In fact, the only area where qcow2 in performs really bad in 0.14 is
 cache=writethrough (which unfortunately is the default...). With
 cache=none it's easy to find scenarios where it provides higher
 throughput than QED.

 Yeah, I'm tempted to implement parallel allocating writes now so I can
 pick on qcow2 in all benchmarks again ;).
  
 Heh. ;-)

 In the end it just shows that the differences are mainly in the
 implementation, not in the format.


 Anyway, there's really only one crucial difference between QED and
 qcow2, which is that qcow2 ensures that metadata is consistent on disk
 at any time whereas QED relies on a dirty flag and rebuilds metadata
 after a crash (basically requiring an fsck). The obvious solution if you
 want to have this in qcow2, is adding a dirty flag there as well.

 Likewise, I think FVD might provide some ideas that we can integrate as
 well, I just don't see a justification to include it as a separate format.

 You think that QED and FVD can be integrated into a QCOW2-based
 format.  I agree it's possible and has some value.  It isn't pretty
 and I would prefer to work on a clean new format because that, too,
 has value.

 In any case, the next step is to get down to specifics.  Here is the
 page with the current QCOW3 roadmap:

 http://wiki.qemu.org/Qcow3_Roadmap

 Please raise concrete requirements or features so they can be
 discussed and captured.

 For example, journalling is an alternative to the dirty bit approach.
 If you feel that journalling is the best technique to address
 consistent updates, then make your case outside the context of today's
 qcow2, QED, and FVD implementations (although benchmark data will rely
 on current implementations).  Explain how the technique would fit into
 QCOW3 and what format changes need to be made.
  
 I think journalling is an interesting option, but I'm not sure if we
 should target it for 0.15. As you know, there's already more than enough
 stuff to do until then, with coroutines etc. The dirty flag thing would
 be way easier to implement. We can always add a journal as a compatible
 feature in 0.16.

 To be honest, I'm not even sure any more that the dirty flag is that
 important. Originally we have been talking about cache=none and it
 definitely makes a big difference there because we save flushes.
 However, we're talking about cache=writethrough now and you flush on any
 write. It might be more important to make things parallel for writethrough.

 
 One thing I wonder about is whether we really need to have cache=X and 
 wce=X.  I never really minded the fact that cache=none advertised wce=on 
 because we behaved effectively as if wce=on.  But now that qcow2 
 triggers on wce=on, I'm a bit concerned that we're introducing a subtle 
 degradation that most people won't realize.
 
 Ignoring some of the problems with O_DIRECT, semantically, I think 
 there's a strong use-case for cache=none, wce=off.

Fully agree, there's no real reason for having three writeback modes,
but only one writethrough mode. It should be completely symmetrical.

I think Christoph has mentioned several times that he has some patches
for this. What's the status of them, Christoph?

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-20 Thread Aurelien Jarno

On Fri, Feb 18, 2011 at 10:57:05AM +0100, Kevin Wolf wrote:
 Am 18.02.2011 10:12, schrieb Markus Armbruster:
  Kevin Wolf kw...@redhat.com writes:
  
  Am 15.02.2011 20:45, schrieb Chunqiang Tang:
  Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
  As you requested, I set up a wiki page for FVD at 
  http://wiki.qemu.org/Features/FVD
  . It includes a summary of FVD, a detailed specification of FVD, and a 
  comparison of the design and performance of FVD and QED. 
 
  See the figure at http://wiki.qemu.org/Features/FVD/Compare . This 
  figure 
  shows that the file creation throughput of NetApp's PostMark benchmark 
  under 
  FVD is 74.9% to 215% higher than that under QED.
 
  Hi Anthony,
 
  Please let me know if more information is needed. I would appreciate your 
  feedback and advice on the best way to proceed with FVD. 
 
  Yet another file format with yet another implementation is definitely
  not what we need. We should probably take some of the ideas in FVD and
  consider them for qcow3.
  
  Got an assumption there: that the one COW format we need must be qcow3,
  i.e. an evolution of qcow2.  Needs to be justified.  If that discussion
  has happened on the list already, I missed it.  If not, it's overdue,
  and then we better start it right away.
 
 Right. I probably wasn't very clear about what I mean with qcow3 either,
 so let me try to summarize my reasoning.
 
 
 The first point is an assumption that you made, too: That we want to
 have only one format. I hope it's easy to agree on this, duplication is
 bad and every additional format creates new maintenance burden,
 especially if we're taking it serious. Until now, there were exactly two
 formats for which we managed to do this, raw and qcow2. raw is more or
 less for free, so with the introduction of another format, we basically
 double the supported block driver code overnight (while not doubling the
 number of developers).
 
 The consequence of having only one file format is that it must be able
 to obsolete the existing ones, most notably qcow2. We can only neglect
 qcow1 today because we can tell users to use qcow2. It supports
 everything that qcow1 supports and more. We couldn't have done this if
 qcow2 lacked features compared to qcow1.
 
 So the one really essential requirement that I see is that we provide a
 way forward for _all_ users by maintaining all of qcow2's features. This
 is the only way of getting people to not stay with qcow2.
 

I agree that the best would be to have a single format, and it's
probably a goal to have. That said, what is most important to my view is
having one or two formats which together have _all_ the features (and 
here I consider speed as a feature) of the existing qcow2 format. QED or
FVD have been designed with the virtualization in a datacenter in mind,
and are very good for this use. OTOH they don't support compression or 
snapshotting, that are quite useful for demo, debugging, testing, or
even for occasionally running a Windows VM, in other words in situations
where the speed is not the priority.

If we can't find a tradeoff for that, we should go for two instead of 
one image format.

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-19 Thread Stefan Hajnoczi

On Fri, Feb 18, 2011 at 7:11 PM, Kevin Wolf kw...@redhat.com wrote:
 Am 18.02.2011 18:43, schrieb Stefan Weil:
 Is maintaining an additional file format really so much work?
 I have only some personal experience with vdi.c, and there maintainance
 was largely caused by interface changes and done by Kevin.
 Hopefully interfaces will stabilize, so changes will become less frequent.

 Well, there are different types of maintenance.

 It's not much work to just drop the code into qemu and let it bitrot.
 This is what happens to the funky formats like bochs or dmg. They are
 usually patches enough so that they still build, but nobody tries if
 they actually work.

 Then there are formats in which there is at least some interest, like
 vmdk or vdi. Occasionally they get some fixes, they are probably fine
 for image conversion, but I wouldn't really trust them for production use.

 And then there's raw and qcow2, which are used by a lot of people for
 running VMs, that are actively maintained, get a decent level of review
 and fixes etc. Getting a format into this group really takes a lot of
 work. Taking something like FVD would only make sense if we are willing
 to do that work - I mean, really nobody wants to convert from/to a file
 format that isn't implemented anywhere else.

This is a good thing to agree on so I want to reiterate:

There are two types of image formats in QEMU today.

1. Native formats that are maintained and suitable for running VMs.
This includes raw, qcow2, and qed.

2. Convert-only formats that may not be maintained and are not
suitable for running VMs.  All other formats in qemu.git.

The convert-only formats have synchronous implementations which makes
it a bad idea to run VMs with them.  They don't fit into QEMU's
event-driven architecture and will cause poor performance and possible
hangs.

I hope folks agree on this.

The next step is to consider that native support requires at least an
order of magnitude more work and code.  It would be wise to focus on a
flagship format in order to share that effort.  So I think this thread
is a useful discussion to have even if no one can be forced to
collaborate on just one format.

Kevin's position seems to be that an evolution of qcow2 is best for
code maintenance and reuse.

The position that QED and FVD have taken is to start from a clean
slate in order to make incompatible changes and leave out problematic
features.

I think we can get there eventually with either approach but we'll be
introducing incompatible changes either way.  In terms of code reuse,
it's initially nice to share code with qcow2 but in the long run the
two formats might diverge far enough that it becomes a liability due
to extra complexity.

For reference, here is the QCOW3 roadmap wiki page:
http://wiki.qemu.org/Qcow3_Roadmap
Here is the QED outstanding work page:
http://wiki.qemu.org/Features/QED/OutstandingWork

Does FVD have a roadmap or future features?

Stefan

[Qemu-devel] Re: Strategic decision: COW format

2011-02-18 Thread Kevin Wolf

Am 18.02.2011 10:12, schrieb Markus Armbruster:
 Kevin Wolf kw...@redhat.com writes:
 
 Am 15.02.2011 20:45, schrieb Chunqiang Tang:
 Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
 As you requested, I set up a wiki page for FVD at 
 http://wiki.qemu.org/Features/FVD
 . It includes a summary of FVD, a detailed specification of FVD, and a 
 comparison of the design and performance of FVD and QED. 

 See the figure at http://wiki.qemu.org/Features/FVD/Compare . This 
 figure 
 shows that the file creation throughput of NetApp's PostMark benchmark 
 under 
 FVD is 74.9% to 215% higher than that under QED.

 Hi Anthony,

 Please let me know if more information is needed. I would appreciate your 
 feedback and advice on the best way to proceed with FVD. 

 Yet another file format with yet another implementation is definitely
 not what we need. We should probably take some of the ideas in FVD and
 consider them for qcow3.
 
 Got an assumption there: that the one COW format we need must be qcow3,
 i.e. an evolution of qcow2.  Needs to be justified.  If that discussion
 has happened on the list already, I missed it.  If not, it's overdue,
 and then we better start it right away.

Right. I probably wasn't very clear about what I mean with qcow3 either,
so let me try to summarize my reasoning.


The first point is an assumption that you made, too: That we want to
have only one format. I hope it's easy to agree on this, duplication is
bad and every additional format creates new maintenance burden,
especially if we're taking it serious. Until now, there were exactly two
formats for which we managed to do this, raw and qcow2. raw is more or
less for free, so with the introduction of another format, we basically
double the supported block driver code overnight (while not doubling the
number of developers).

The consequence of having only one file format is that it must be able
to obsolete the existing ones, most notably qcow2. We can only neglect
qcow1 today because we can tell users to use qcow2. It supports
everything that qcow1 supports and more. We couldn't have done this if
qcow2 lacked features compared to qcow1.

So the one really essential requirement that I see is that we provide a
way forward for _all_ users by maintaining all of qcow2's features. This
is the only way of getting people to not stay with qcow2.


Of course, you could invent another format that implements the same
features, but I think just carefully extending qcow2 has some real
advantages.

The first is that conversion of existing images would be really easy.
Basically increment the version number in the header file and you're
done. Structures would be compatible. If you compare it to file systems,
I rarely ever change the file system on a non-empty partition. Even if I
wanted, it's usually just too painful. Except when I was able to use
tune2fs -j to make ext3 out of ext2, that was really easy. We can
provide the same for qcow2 to qcow3 conversion, but not with a
completely new format.

Also, while obsoleting a file format means that we need not put much
effort in its maintenance, we still need to keep the code around for
reading old images. With an extension of qcow2, it would be the same
code that is used for both versions.

Third, qcow2 already exists, is used in practice and we have put quite
some effort into QA. At least initially confidence would be higher than
in a completely new, yet untested format. Remember that with qcow3 I'm
not talking about rewriting everything, it's a careful evolution, mostly
with optional additions here and there.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-18 Thread Anthony Liguori


On 02/18/2011 03:57 AM, Kevin Wolf wrote:

Am 18.02.2011 10:12, schrieb Markus Armbruster:
   

Kevin Wolfkw...@redhat.com  writes:

 

Am 15.02.2011 20:45, schrieb Chunqiang Tang:
   

Chunqiang Tang/Watson/IBM wrote on 01/28/2011 05:13:27 PM:
As you requested, I set up a wiki page for FVD at
   

http://wiki.qemu.org/Features/FVD
 

. It includes a summary of FVD, a detailed specification of FVD, and a
comparison of the design and performance of FVD and QED.
   
 

See the figure at http://wiki.qemu.org/Features/FVD/Compare . This
   

figure
 

shows that the file creation throughput of NetApp's PostMark benchmark
   

under
 

FVD is 74.9% to 215% higher than that under QED.
   

Hi Anthony,

Please let me know if more information is needed. I would appreciate your
feedback and advice on the best way to proceed with FVD.
 

Yet another file format with yet another implementation is definitely
not what we need. We should probably take some of the ideas in FVD and
consider them for qcow3.
   

Got an assumption there: that the one COW format we need must be qcow3,
i.e. an evolution of qcow2.  Needs to be justified.  If that discussion
has happened on the list already, I missed it.  If not, it's overdue,
and then we better start it right away.
 

Right. I probably wasn't very clear about what I mean with qcow3 either,
so let me try to summarize my reasoning.


The first point is an assumption that you made, too: That we want to
have only one format. I hope it's easy to agree on this, duplication is
bad and every additional format creates new maintenance burden,
especially if we're taking it serious. Until now, there were exactly two
formats for which we managed to do this, raw and qcow2. raw is more or
less for free, so with the introduction of another format, we basically
double the supported block driver code overnight (while not doubling the
number of developers).
   


Not sure what project you're following, but we've had an awful lot of 
formats before qcow2 :-)


And qcow2 was never all that special, it just was dropped in the code 
base one day.  You've put a lot of work into qcow2, but there are other 
folks that are contributing additional formats and that means more 
developers.



The consequence of having only one file format is that it must be able
to obsolete the existing ones, most notably qcow2. We can only neglect
qcow1 today because we can tell users to use qcow2. It supports
everything that qcow1 supports and more. We couldn't have done this if
qcow2 lacked features compared to qcow1.

So the one really essential requirement that I see is that we provide a
way forward for _all_ users by maintaining all of qcow2's features. This
is the only way of getting people to not stay with qcow2.


Of course, you could invent another format that implements the same
features, but I think just carefully extending qcow2 has some real
advantages.

The first is that conversion of existing images would be really easy.
Basically increment the version number in the header file and you're
done. Structures would be compatible.


qemu-img convert is a reasonable path for conversion.


  If you compare it to file systems,
I rarely ever change the file system on a non-empty partition. Even if I
wanted, it's usually just too painful. Except when I was able to use
tune2fs -j to make ext3 out of ext2, that was really easy. We can
provide the same for qcow2 to qcow3 conversion, but not with a
completely new format.

Also, while obsoleting a file format means that we need not put much
effort in its maintenance, we still need to keep the code around for
reading old images. With an extension of qcow2, it would be the same
code that is used for both versions.

Third, qcow2 already exists, is used in practice and we have put quite
some effort into QA. At least initially confidence would be higher than
in a completely new, yet untested format. Remember that with qcow3 I'm
not talking about rewriting everything, it's a careful evolution, mostly
with optional additions here and there.
   


My requirements for a new format are as followed:

1) documented, thought-out specification that is covered under and open 
license with a clear process for extension.


2) ability to add both compatible and incompatible features in a 
graceful way


3) ability to achieve performance that's close to raw.  I want our new 
format to be able to be used universally both for servers and desktops.


I think qcow2 has some misfeatures like compression and internal 
snapshots.  I think preserving those misfeatures is a mistake because I 
don't think we can satisfy the above while trying to preserve those 
features.  If the image format degrades when those features are enabled, 
then it decreases confidence in the format.


I think QED satisfies all of these today.

Regards,

Anthony Liguori


Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-18 Thread Stefan Weil


Am 18.02.2011 10:57, schrieb Kevin Wolf:

Am 18.02.2011 10:12, schrieb Markus Armbruster:

Kevin Wolf kw...@redhat.com writes:


Yet another file format with yet another implementation is definitely
not what we need. We should probably take some of the ideas in FVD and
consider them for qcow3.


Got an assumption there: that the one COW format we need must be qcow3,
i.e. an evolution of qcow2. Needs to be justified. If that discussion
has happened on the list already, I missed it. If not, it's overdue,
and then we better start it right away.


Right. I probably wasn't very clear about what I mean with qcow3 either,
so let me try to summarize my reasoning.


The first point is an assumption that you made, too: That we want to
have only one format. I hope it's easy to agree on this, duplication is
bad and every additional format creates new maintenance burden,
especially if we're taking it serious. Until now, there were exactly two
formats for which we managed to do this, raw and qcow2. raw is more or
less for free, so with the introduction of another format, we basically
double the supported block driver code overnight (while not doubling the
number of developers).

The consequence of having only one file format is that it must be able
to obsolete the existing ones, most notably qcow2. We can only neglect
qcow1 today because we can tell users to use qcow2. It supports
everything that qcow1 supports and more. We couldn't have done this if
qcow2 lacked features compared to qcow1.

So the one really essential requirement that I see is that we provide a
way forward for _all_ users by maintaining all of qcow2's features. This
is the only way of getting people to not stay with qcow2.



The support of several different file formats is one of the
strong points of QEMU, at least in my opinion.

Reducing this to offline conversion would be a bad idea because it costs
too much time and disk space for quick tests (for production environments,
this might be totally different).

Is maintaining an additional file format really so much work?
I have only some personal experience with vdi.c, and there maintainance
was largely caused by interface changes and done by Kevin.
Hopefully interfaces will stabilize, so changes will become less frequent.

A new file format like fvd would be a challenge for the existing ones.
Declare its support as unsupported or experimental, but let users
decide which one is best suited to their needs!

Maybe adding a staging tree (like for the linux kernel) for experimental
drivers, devices, file formats, tcg targets and so on would make it easier
to add new code and reduce the need for QEMU forks. I'd appreciate such
or any other solution which allows this very much!

Regards,
Stefan

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-18 Thread Kevin Wolf

Am 18.02.2011 18:43, schrieb Stefan Weil:
 Am 18.02.2011 10:57, schrieb Kevin Wolf:
 Am 18.02.2011 10:12, schrieb Markus Armbruster:
 Kevin Wolf kw...@redhat.com writes:

 Yet another file format with yet another implementation is definitely
 not what we need. We should probably take some of the ideas in FVD and
 consider them for qcow3.

 Got an assumption there: that the one COW format we need must be qcow3,
 i.e. an evolution of qcow2. Needs to be justified. If that discussion
 has happened on the list already, I missed it. If not, it's overdue,
 and then we better start it right away.

 Right. I probably wasn't very clear about what I mean with qcow3 either,
 so let me try to summarize my reasoning.


 The first point is an assumption that you made, too: That we want to
 have only one format. I hope it's easy to agree on this, duplication is
 bad and every additional format creates new maintenance burden,
 especially if we're taking it serious. Until now, there were exactly two
 formats for which we managed to do this, raw and qcow2. raw is more or
 less for free, so with the introduction of another format, we basically
 double the supported block driver code overnight (while not doubling the
 number of developers).

 The consequence of having only one file format is that it must be able
 to obsolete the existing ones, most notably qcow2. We can only neglect
 qcow1 today because we can tell users to use qcow2. It supports
 everything that qcow1 supports and more. We couldn't have done this if
 qcow2 lacked features compared to qcow1.

 So the one really essential requirement that I see is that we provide a
 way forward for _all_ users by maintaining all of qcow2's features. This
 is the only way of getting people to not stay with qcow2.
 
 
 The support of several different file formats is one of the
 strong points of QEMU, at least in my opinion.

I totally agree. qemu-img is known as a Swiss army knife for disk images
and this is definitely a strength.

However, it's not useful just because it supports a high number of
formats, but because these formats are in active use. Most of them are
the native formats of some other software.

I think things look a bit different when we're talking about
qemu-specific formats. qcow1 isn't in use any more because nobody needs
it for compatibility with other software and for use with qemu, there is
qcow2. Still, the qcow1 driver is still around and bitrots.

 Reducing this to offline conversion would be a bad idea because it costs
 too much time and disk space for quick tests (for production environments,
 this might be totally different).

Either I'm misunderstanding what you try to say here, or you
miunderstood what I said. I agree that we don't want to have to do
qemu-img convert (i.e. a full copy) in order to upgrade. This is one of
the reasons why I think we should have a qcow3 which can be upgraded
basically by increasing the version number in the header (look at it as
an incompatible feature flag, if you want) instead of starting something
completely new.

 Is maintaining an additional file format really so much work?
 I have only some personal experience with vdi.c, and there maintainance
 was largely caused by interface changes and done by Kevin.
 Hopefully interfaces will stabilize, so changes will become less frequent.

Well, there are different types of maintenance.

It's not much work to just drop the code into qemu and let it bitrot.
This is what happens to the funky formats like bochs or dmg. They are
usually patches enough so that they still build, but nobody tries if
they actually work.

Then there are formats in which there is at least some interest, like
vmdk or vdi. Occasionally they get some fixes, they are probably fine
for image conversion, but I wouldn't really trust them for production use.

And then there's raw and qcow2, which are used by a lot of people for
running VMs, that are actively maintained, get a decent level of review
and fixes etc. Getting a format into this group really takes a lot of
work. Taking something like FVD would only make sense if we are willing
to do that work - I mean, really nobody wants to convert from/to a file
format that isn't implemented anywhere else.

 A new file format like fvd would be a challenge for the existing ones.
 Declare its support as unsupported or experimental, but let users
 decide which one is best suited to their needs!

Basically this is what we did for QED. In hindsight I consider it a
mistake because it set a bad precedence of inventing something new
instead of fixing what's there. I really don't want to convert all my
images each time to take advantage of new qemu version.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-18 Thread Anthony Liguori


On 02/18/2011 01:11 PM, Kevin Wolf wrote:

A new file format like fvd would be a challenge for the existing ones.
Declare its support as unsupported or experimental, but let users
decide which one is best suited to their needs!
 

Basically this is what we did for QED. In hindsight I consider it a
mistake because it set a bad precedence of inventing something new
instead of fixing what's there.


I don't see how qcow3 is fixing something that's there since it's still 
an incompatible format.


It'd be a stronger argument if you were suggesting something that was 
still fully compatible with qcow2 but once compatibility is broken, it's 
broken.


Regards,

Anthony Liguori


  I really don't want to convert all my
images each time to take advantage of new qemu version.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-18 Thread Anthony Liguori


On 02/18/2011 11:43 AM, Stefan Weil wrote:


Is maintaining an additional file format really so much work?
I have only some personal experience with vdi.c, and there maintainance
was largely caused by interface changes and done by Kevin.
Hopefully interfaces will stabilize, so changes will become less 
frequent.


A new file format like fvd would be a challenge for the existing ones.


FVD isn't merged because it's gotten almost no review.  If it turns out 
that it is identical to an existing format and an existing format just 
has a crappy implementation, it wouldn't be merged in favor of fixing 
the existing format.


But if it has a compelling advantage for a reasonable use-case, it will 
be merged.


I don't know where this whole discussion of strategic formats for QEMU 
came from but that's never been the way the project has operated.


Regards,

Anthony Liguori

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-18 Thread Kevin Wolf

Am 18.02.2011 20:47, schrieb Anthony Liguori:
 On 02/18/2011 01:11 PM, Kevin Wolf wrote:
 A new file format like fvd would be a challenge for the existing ones.
 Declare its support as unsupported or experimental, but let users
 decide which one is best suited to their needs!
  
 Basically this is what we did for QED. In hindsight I consider it a
 mistake because it set a bad precedence of inventing something new
 instead of fixing what's there.
 
 I don't see how qcow3 is fixing something that's there since it's still 
 an incompatible format.
 
 It'd be a stronger argument if you were suggesting something that was 
 still fully compatible with qcow2 but once compatibility is broken, it's 
 broken.

It's really more like adding an incompatible feature flag in QED. You
still have one implementation for old and new images instead of
splitting up development efforts, you still have all of the features and
so on. It's a completely different story than QED.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-18 Thread Anthony Liguori


On 02/18/2011 02:49 PM, Kevin Wolf wrote:

Am 18.02.2011 20:47, schrieb Anthony Liguori:
   

On 02/18/2011 01:11 PM, Kevin Wolf wrote:
 

A new file format like fvd would be a challenge for the existing ones.
Declare its support as unsupported or experimental, but let users
decide which one is best suited to their needs!

 

Basically this is what we did for QED. In hindsight I consider it a
mistake because it set a bad precedence of inventing something new
instead of fixing what's there.
   

I don't see how qcow3 is fixing something that's there since it's still
an incompatible format.

It'd be a stronger argument if you were suggesting something that was
still fully compatible with qcow2 but once compatibility is broken, it's
broken.
 

It's really more like adding an incompatible feature flag in QED. You
still have one implementation for old and new images instead of
splitting up development efforts, you still have all of the features and
so on.


In theory.  Since an implementation doesn't exist, we have no idea how 
much code is actually going to be shared at the end of the day.


I suspect that, especially if you drop the ref table updates, there 
won't be an awful lot of common code in the two paths.


Regards,

Anthony Liguori


  It's a completely different story than QED.

Kevin

Re: [Qemu-devel] Re: Strategic decision: COW format

2011-02-18 Thread Kevin Wolf

Am 18.02.2011 21:50, schrieb Anthony Liguori:
 On 02/18/2011 02:49 PM, Kevin Wolf wrote:
 Am 18.02.2011 20:47, schrieb Anthony Liguori:

 On 02/18/2011 01:11 PM, Kevin Wolf wrote:
  
 A new file format like fvd would be a challenge for the existing ones.
 Declare its support as unsupported or experimental, but let users
 decide which one is best suited to their needs!

  
 Basically this is what we did for QED. In hindsight I consider it a
 mistake because it set a bad precedence of inventing something new
 instead of fixing what's there.

 I don't see how qcow3 is fixing something that's there since it's still
 an incompatible format.

 It'd be a stronger argument if you were suggesting something that was
 still fully compatible with qcow2 but once compatibility is broken, it's
 broken.
  
 It's really more like adding an incompatible feature flag in QED. You
 still have one implementation for old and new images instead of
 splitting up development efforts, you still have all of the features and
 so on.
 
 In theory.  Since an implementation doesn't exist, we have no idea how 
 much code is actually going to be shared at the end of the day.
 
 I suspect that, especially if you drop the ref table updates, there 
 won't be an awful lot of common code in the two paths.

Allowing refcounts to be inconsistent, protected by a dirty flag, is
only an option, and you should only take it if you absolutely need it
(i.e. your guest is broken and requires cache=writethrough, but you
desperately need performance)

My preferred way of implementing it is telling the refcount cache that
it should ignore flushes and write its data only back when another
refcount block must be loaded into the cache (which happens rarely
enough that it doesn't really hurt performance). This makes the
difference from the existing code more or less one if statement that
returns early.

Kevin

78 matches

Mail list logo