Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-15 Thread Benoît Canet
  Kevin what do you think of this ?
  I could strip down the dedupe journal code to specialize it.
 
 If you think it turns out easier than using the journalling
 infrastructure that we're going to implement anyway, then why not.

This question need to be though.
The good thing about the current dedup log is that the code is already written
and match very well the dedup usage.

Aside my personal interest of resuming the work on deduplication quickly I think
we should consider the overall cost of having a specialized log for dedupe.

Is the reviewing and maintainance of http://patchwork.ozlabs.org/patch/252955/
a reasonable extra cost ?

Best regards

Benoît



Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-06 Thread Kevin Wolf
Am 05.09.2013 um 17:26 hat Benoît Canet geschrieben:
 Le Thursday 05 Sep 2013 à 11:24:40 (+0200), Stefan Hajnoczi a écrit :
  On Wed, Sep 04, 2013 at 11:55:23AM +0200, Benoît Canet wrote:
 I'm not sure if multiple journals will work in practice.  Doesn't this
 re-introduce the need to order update steps and flush between them?

This is a question for Benoît, who made this requirement. I asked him
the same a while ago and apparently his explanation made some sense to
me, or I would have remembered that I don't want it. ;-)
   
   The reason behind the multiple journal requirement is that if a block get
   created and deleted in a cyclic way it can generate cyclic 
   insertions/deletions
   journal entries.
   The journal could easilly be filled if this pathological corner case 
   happen.
   When it happen the dedup code repack the journal by writting only the non
   redundant information into a new journal and then use the new one.
   It would not be easy to do so if non dedup journal entries are present in 
   the
   journal hence the multiple journal requirement.
   
   The deduplication also need two journals because when the first one is 
   frozen it
   take some time to write the hash table to disk and anyway new entries 
   must be
   stored somewhere at the same time. The code cannot block.
   
It might have something to do with the fact that deduplication uses the
journal more as a kind of cache for hash values that can be dropped and
rebuilt after a crash.
   
   For dedupe the journal is more a resume after exit tool.
  
  I'm not sure anymore if dedupe needs the same kind of journal as a
  metadata journal for qcow2.
  
  Since you have a dirty flag to discard the journal on crash, the
  journal is not used for data integrity.
  
  That makes me wonder if the metadata journal is the right structure for
  dedupe?  Maybe your original proposal was fine for dedupe and we just
  misinterpreted it because we thought this needs to be a safe journal.
 
 Kevin what do you think of this ?
 I could strip down the dedupe journal code to specialize it.

If you think it turns out easier than using the journalling
infrastructure that we're going to implement anyway, then why not.

Kevin



Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-06 Thread Fam Zheng
On Wed, 09/04 11:39, Kevin Wolf wrote:
 First of all, excuse any inconsistencies in the following mail. I wrote
 it from top to bottom, and there was some thought process involved in
 almost every paragraph...
 
 Am 04.09.2013 um 10:03 hat Stefan Hajnoczi geschrieben:
  On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote:
   @@ -103,7 +107,11 @@ in the description of a field.
write to an image with unknown auto-clear features 
   if it
clears the respective bits from this field first.

   -Bits 0-63:  Reserved (set to 0)
   +Bit 0:  Journal valid bit. This bit indicates 
   that the
   +image contains a valid main journal 
   starting at
   +journal_offset.
  
  Whether the journal is used can be determined from the journal_offset
  value (header length must be large enough and journal offset must be
  valid).
  
  Why do we need this autoclear bit?
 
 Hm, I introduced this one first and the journal dirty incompatible bit
 later, perhaps it's unnecessary now. Let's check...
 
 The obvious thing we need to protect against is applying stale journal
 data to an image that has been changed by an older version. As long as
 the journal is clean, this can't happen, and the journal dirty bit will
 ensure that the old version can only open the image if it is clean.
 
 However, what if we run 'qemu-img check -r leaks' with an old qemu-img
 version? It will reclaim the clusters used by the journal, and if we
 continue using the journal we'll corrupt whatever new data is there
 now.
 
Why can old version qemu-img open the image with dirty journal in the first
place? It's incompatible bit.

 Can we protect against this without using an autoclear bit?
 
   +Journals are used to allow safe updates of metadata without impacting
   +performance by requiring flushes to order updates to different parts of 
   the
   +metadata.
  
  This sentence is hard to parse.  Maybe something shorter like this:
  
  Journals allow safe metadata updates without the need for carefully
  ordering and flushing between update steps.
 
 Okay, I'll update the text with your proposal.
 
   +They consist of transactions, which in turn contain operations that
   +are effectively executed atomically. A qcow2 image can have a main image
   +journal that deals with cluster management operations, and additional 
   specific
   +journals can be used by other features like data deduplication.
  
  I'm not sure if multiple journals will work in practice.  Doesn't this
  re-introduce the need to order update steps and flush between them?
 
 This is a question for Benoît, who made this requirement. I asked him
 the same a while ago and apparently his explanation made some sense to
 me, or I would have remembered that I don't want it. ;-)
 
 It might have something to do with the fact that deduplication uses the
 journal more as a kind of cache for hash values that can be dropped and
 rebuilt after a crash.
 
   +A journal is organised in journal blocks, all of which have a reference 
   count
   +of exactly 1. It starts with a block containing the following journal 
   header:
   +
   +Byte  0 -  7:   Magic (qjournal ASCII string)
   +
   +  8 - 11:   Journal size in bytes, including the header
   +
   + 12 - 15:   Journal block size order (block size in bytes = 1  
   order)
   +The block size must be at least 512 bytes and must 
   not
   +exceed the cluster size.
   +
   + 16 - 19:   Journal block index of the descriptor for the last
   +transaction that has been synced, starting with 1 
   for the
   +journal block after the header. 0 is used for empty
   +journals.
   +
   + 20 - 23:   Sequence number of the last transaction that has been
   +synced. 0 is recommended as the initial value.
   +
   + 24 - 27:   Sequence number of the last transaction that has been
   +committed. When replaying a journal, all transactions
   +after the last synced one up to the last commit one 
   must be
   +synced. Note that this may include a wraparound of 
   sequence
   +numbers.
   +
   + 28 -  31:  Checksum (one's complement of the sum of all bytes 
   in the
   +header journal block except those of the checksum 
   field)
   +
   + 32 - 511:  Reserved (set to 0)
  
  I'm not sure if these fields are necessary.  They require updates (and
  maybe flush) after every commit and sync.
  
  The fewer metadata updates, the better, not just for performance but
  also to reduce the risk of data loss.  If any metadata required to
  access the journal is corrupted, the image will be unavailable.
  
  It should be possible to determine 

Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-06 Thread Kevin Wolf
Am 06.09.2013 um 11:20 hat Fam Zheng geschrieben:
 On Wed, 09/04 11:39, Kevin Wolf wrote:
  First of all, excuse any inconsistencies in the following mail. I wrote
  it from top to bottom, and there was some thought process involved in
  almost every paragraph...
  
  Am 04.09.2013 um 10:03 hat Stefan Hajnoczi geschrieben:
   On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote:
@@ -103,7 +107,11 @@ in the description of a field.
 write to an image with unknown auto-clear features 
if it
 clears the respective bits from this field first.
 
-Bits 0-63:  Reserved (set to 0)
+Bit 0:  Journal valid bit. This bit indicates 
that the
+image contains a valid main journal 
starting at
+journal_offset.
   
   Whether the journal is used can be determined from the journal_offset
   value (header length must be large enough and journal offset must be
   valid).
   
   Why do we need this autoclear bit?
  
  Hm, I introduced this one first and the journal dirty incompatible bit
  later, perhaps it's unnecessary now. Let's check...
  
  The obvious thing we need to protect against is applying stale journal
  data to an image that has been changed by an older version. As long as
  the journal is clean, this can't happen, and the journal dirty bit will
  ensure that the old version can only open the image if it is clean.
  
  However, what if we run 'qemu-img check -r leaks' with an old qemu-img
  version? It will reclaim the clusters used by the journal, and if we
  continue using the journal we'll corrupt whatever new data is there
  now.
  
 Why can old version qemu-img open the image with dirty journal in the first
 place? It's incompatible bit.

This is about a clean journal.

Kevin



Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-06 Thread Fam Zheng
On Tue, 09/03 15:45, Kevin Wolf wrote:
 This contains an extension of the qcow2 spec that introduces journalling
 to the image format, plus some preliminary type definitions and
 function prototypes in the qcow2 code.
 
 Journalling functionality is a crucial feature for the design of data
 deduplication, and it will improve the core part of qcow2 by avoiding
 cluster leaks on crashes as well as provide an easier way to get a
 reliable implementation of performance features like Delayed COW.
 
 At this point of the RFC, it would be most important to review the
 on-disk structure. Once we're confident that it can do everything we
 want, we can start going into more detail on the qemu side of things.
 
 Signed-off-by: Kevin Wolf kw...@redhat.com
 ---
  block/Makefile.objs   |   2 +-
  block/qcow2-journal.c |  55 ++
  block/qcow2.h |  78 +++
  docs/specs/qcow2.txt  | 204 
 +-
  4 files changed, 337 insertions(+), 2 deletions(-)
  create mode 100644 block/qcow2-journal.c
 
 diff --git a/block/Makefile.objs b/block/Makefile.objs
 index 3bb85b5..59be314 100644
 --- a/block/Makefile.objs
 +++ b/block/Makefile.objs
 @@ -1,5 +1,5 @@
  block-obj-y += raw_bsd.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o 
 vpc.o vvfat.o
 -block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
 qcow2-cache.o
 +block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
 qcow2-cache.o qcow2-journal.o
  block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
  block-obj-y += qed-check.o
  block-obj-y += vhdx.o
 diff --git a/block/qcow2-journal.c b/block/qcow2-journal.c
 new file mode 100644
 index 000..5b20239
 --- /dev/null
 +++ b/block/qcow2-journal.c
 @@ -0,0 +1,55 @@
 +/*
 + * qcow2 journalling functions
 + *
 + * Copyright (c) 2013 Kevin Wolf kw...@redhat.com
 + *
 + * Permission is hereby granted, free of charge, to any person obtaining a 
 copy
 + * of this software and associated documentation files (the Software), to 
 deal
 + * in the Software without restriction, including without limitation the 
 rights
 + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 + * copies of the Software, and to permit persons to whom the Software is
 + * furnished to do so, subject to the following conditions:
 + *
 + * The above copyright notice and this permission notice shall be included in
 + * all copies or substantial portions of the Software.
 + *
 + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
 + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
 FROM,
 + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 + * THE SOFTWARE.
 + */
 +
 +#include qemu-common.h
 +#include block/block_int.h
 +#include qcow2.h
 +
 +#define QCOW2_JOURNAL_MAGIC 0x716a6f75726e616cULL  /* qjournal */
 +#define QCOW2_JOURNAL_BLOCK_MAGIC 0x716a626b  /* qjbk */
 +
 +typedef struct Qcow2JournalHeader {
 +uint64_tmagic;
 +uint32_tjournal_size;
 +uint32_tblock_size;
 +uint32_tsynced_index;
 +uint32_tsynced_seq;
 +uint32_tcommitted_seq;
 +uint32_tchecksum;
 +} QEMU_PACKED Qcow2JournalHeader;
 +
 +/*
 + * One big transaction per journal block. The transaction is committed either
 + * time based or when a microtransaction (single set of operations that must 
 be
 + * performed atomically) doesn't fit in the same block any more.
 + */
 +typedef struct Qcow2JournalBlock {
 +uint32_tmagic;
 +uint32_tchecksum;
 +uint32_tseq;
 +uint32_tdesc_offset; /* Allow block header extensions */
 +uint32_tdesc_bytes;
 +uint32_tnb_data_blocks;
 +} QEMU_PACKED Qcow2JournalBlock;
 +
 diff --git a/block/qcow2.h b/block/qcow2.h
 index 1000239..2aee1fd 100644
 --- a/block/qcow2.h
 +++ b/block/qcow2.h
 @@ -157,6 +157,10 @@ typedef struct Qcow2DiscardRegion {
  QTAILQ_ENTRY(Qcow2DiscardRegion) next;
  } Qcow2DiscardRegion;
  
 +typedef struct Qcow2Journal {
 +
 +} Qcow2Journal;
 +
  typedef struct BDRVQcowState {
  int cluster_bits;
  int cluster_size;
 @@ -479,4 +483,78 @@ int qcow2_cache_get_empty(BlockDriverState *bs, 
 Qcow2Cache *c, uint64_t offset,
  void **table);
  int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table);
  
 +/* qcow2-journal.c functions */
 +
 +typedef struct Qcow2JournalTransaction Qcow2JournalTransaction;
 +
 +enum Qcow2JournalEntryTypeID {
 +QJ_DESC_NOOP= 0,
 +QJ_DESC_WRITE   = 1,
 +QJ_DESC_COPY= 2,
 +
 +/* required after a cluster is freed and used for other purposes, so that
 + * new (unjournalled) data won't be overwritten with 

Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-06 Thread Fam Zheng
On Fri, 09/06 11:57, Kevin Wolf wrote:
 Am 06.09.2013 um 11:20 hat Fam Zheng geschrieben:
  On Wed, 09/04 11:39, Kevin Wolf wrote:
   First of all, excuse any inconsistencies in the following mail. I wrote
   it from top to bottom, and there was some thought process involved in
   almost every paragraph...
   
   Am 04.09.2013 um 10:03 hat Stefan Hajnoczi geschrieben:
On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote:
 @@ -103,7 +107,11 @@ in the description of a field.
  write to an image with unknown auto-clear 
 features if it
  clears the respective bits from this field first.
  
 -Bits 0-63:  Reserved (set to 0)
 +Bit 0:  Journal valid bit. This bit 
 indicates that the
 +image contains a valid main journal 
 starting at
 +journal_offset.

Whether the journal is used can be determined from the journal_offset
value (header length must be large enough and journal offset must be
valid).

Why do we need this autoclear bit?
   
   Hm, I introduced this one first and the journal dirty incompatible bit
   later, perhaps it's unnecessary now. Let's check...
   
   The obvious thing we need to protect against is applying stale journal
   data to an image that has been changed by an older version. As long as
   the journal is clean, this can't happen, and the journal dirty bit will
   ensure that the old version can only open the image if it is clean.
   
   However, what if we run 'qemu-img check -r leaks' with an old qemu-img
   version? It will reclaim the clusters used by the journal, and if we
   continue using the journal we'll corrupt whatever new data is there
   now.
   
  Why can old version qemu-img open the image with dirty journal in the first
  place? It's incompatible bit.
 
 This is about a clean journal.
 
Ah yes, I get it, thanks.

Fam



Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-05 Thread Stefan Hajnoczi
On Wed, Sep 04, 2013 at 11:39:51AM +0200, Kevin Wolf wrote:
 First of all, excuse any inconsistencies in the following mail. I wrote
 it from top to bottom, and there was some thought process involved in
 almost every paragraph...

I should add this disclaimer to all my emails ;-).

 Am 04.09.2013 um 10:03 hat Stefan Hajnoczi geschrieben:
  On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote:
   @@ -103,7 +107,11 @@ in the description of a field.
write to an image with unknown auto-clear features 
   if it
clears the respective bits from this field first.

   -Bits 0-63:  Reserved (set to 0)
   +Bit 0:  Journal valid bit. This bit indicates 
   that the
   +image contains a valid main journal 
   starting at
   +journal_offset.
  
  Whether the journal is used can be determined from the journal_offset
  value (header length must be large enough and journal offset must be
  valid).
  
  Why do we need this autoclear bit?
 
 Hm, I introduced this one first and the journal dirty incompatible bit
 later, perhaps it's unnecessary now. Let's check...
 
 The obvious thing we need to protect against is applying stale journal
 data to an image that has been changed by an older version. As long as
 the journal is clean, this can't happen, and the journal dirty bit will
 ensure that the old version can only open the image if it is clean.
 
 However, what if we run 'qemu-img check -r leaks' with an old qemu-img
 version? It will reclaim the clusters used by the journal, and if we
 continue using the journal we'll corrupt whatever new data is there
 now.
 
 Can we protect against this without using an autoclear bit?

You are right.  It's a weird case I didn't think of but it could happen.
An autoclear bit sounds like the simplest solution.

Please document this scenario.

   +A journal is organised in journal blocks, all of which have a reference 
   count
   +of exactly 1. It starts with a block containing the following journal 
   header:
   +
   +Byte  0 -  7:   Magic (qjournal ASCII string)
   +
   +  8 - 11:   Journal size in bytes, including the header
   +
   + 12 - 15:   Journal block size order (block size in bytes = 1  
   order)
   +The block size must be at least 512 bytes and must 
   not
   +exceed the cluster size.
   +
   + 16 - 19:   Journal block index of the descriptor for the last
   +transaction that has been synced, starting with 1 
   for the
   +journal block after the header. 0 is used for empty
   +journals.
   +
   + 20 - 23:   Sequence number of the last transaction that has been
   +synced. 0 is recommended as the initial value.
   +
   + 24 - 27:   Sequence number of the last transaction that has been
   +committed. When replaying a journal, all transactions
   +after the last synced one up to the last commit one 
   must be
   +synced. Note that this may include a wraparound of 
   sequence
   +numbers.
   +
   + 28 -  31:  Checksum (one's complement of the sum of all bytes 
   in the
   +header journal block except those of the checksum 
   field)
   +
   + 32 - 511:  Reserved (set to 0)
  
  I'm not sure if these fields are necessary.  They require updates (and
  maybe flush) after every commit and sync.
  
  The fewer metadata updates, the better, not just for performance but
  also to reduce the risk of data loss.  If any metadata required to
  access the journal is corrupted, the image will be unavailable.
  
  It should be possible to determine this information by scanning the
  journal transactions.
 
 This is rather handwavy. Can you elaborate how this would work in detail?
 
 
 For example, let's assume we get to read this journal (a journal can be
 rather large, too, so I'm not sure if we want to read it in completely):
 
  - Descriptor, seq 42, 2 data blocks
  - Data block
  - Data block
  - Data block starting with qjbk
  - Data block
  - Descriptor, seq 7, 0 data blocks
  - Descriptor, seq 8, 1 data block
  - Data block
 
 Which of these have already been synced? Which have been committed?
 
 
 I guess we could introduce an is_commited flag in the descriptor, but
 wouldn't correct operation look like this then:
 
 1. Write out descriptor commit flag clear and any data blocks
 2. Flush
 3. Rewrite descriptor with commit flag set
 
 This ensures that the commit flag is only set if all the required data
 is indeed stable on disk. What has changed compared to this proposal is
 just the offset at which you write in step 3 (header vs. descriptor).

A commit flag cannot be relied upon.  A transaction can be corrupted
after being committed, or 

Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-05 Thread Stefan Hajnoczi
On Wed, Sep 04, 2013 at 11:55:23AM +0200, Benoît Canet wrote:
   I'm not sure if multiple journals will work in practice.  Doesn't this
   re-introduce the need to order update steps and flush between them?
  
  This is a question for Benoît, who made this requirement. I asked him
  the same a while ago and apparently his explanation made some sense to
  me, or I would have remembered that I don't want it. ;-)
 
 The reason behind the multiple journal requirement is that if a block get
 created and deleted in a cyclic way it can generate cyclic 
 insertions/deletions
 journal entries.
 The journal could easilly be filled if this pathological corner case happen.
 When it happen the dedup code repack the journal by writting only the non
 redundant information into a new journal and then use the new one.
 It would not be easy to do so if non dedup journal entries are present in the
 journal hence the multiple journal requirement.
 
 The deduplication also need two journals because when the first one is frozen 
 it
 take some time to write the hash table to disk and anyway new entries must be
 stored somewhere at the same time. The code cannot block.
 
  It might have something to do with the fact that deduplication uses the
  journal more as a kind of cache for hash values that can be dropped and
  rebuilt after a crash.
 
 For dedupe the journal is more a resume after exit tool.

I'm not sure anymore if dedupe needs the same kind of journal as a
metadata journal for qcow2.

Since you have a dirty flag to discard the journal on crash, the
journal is not used for data integrity.

That makes me wonder if the metadata journal is the right structure for
dedupe?  Maybe your original proposal was fine for dedupe and we just
misinterpreted it because we thought this needs to be a safe journal.

Stefan



Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-05 Thread Stefan Hajnoczi
On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote:
 This contains an extension of the qcow2 spec that introduces journalling
 to the image format, plus some preliminary type definitions and
 function prototypes in the qcow2 code.
 
 Journalling functionality is a crucial feature for the design of data
 deduplication, and it will improve the core part of qcow2 by avoiding
 cluster leaks on crashes as well as provide an easier way to get a
 reliable implementation of performance features like Delayed COW.
 
 At this point of the RFC, it would be most important to review the
 on-disk structure. Once we're confident that it can do everything we
 want, we can start going into more detail on the qemu side of things.
 
 Signed-off-by: Kevin Wolf kw...@redhat.com
 ---
  block/Makefile.objs   |   2 +-
  block/qcow2-journal.c |  55 ++
  block/qcow2.h |  78 +++
  docs/specs/qcow2.txt  | 204 
 +-
  4 files changed, 337 insertions(+), 2 deletions(-)
  create mode 100644 block/qcow2-journal.c

Although we are still discussing details of the on-disk layout, the
general design is clear enough to discuss how the journal will be used.

Today qcow2 uses Qcow2Cache to do lazy, ordered metadata updates.  The
performance is pretty good with two exceptions that I can think of:

1. The delayed CoW problem that Kevin has been working on.  Guests
   perform sequential writes that are smaller than a qcow2 cluster.  The
   first write triggers a copy-on-write of the full cluster.  Later
   writes then overwrite the copied data.  It would be more efficient to
   anticipate sequential writes and hold off on CoW where possible.

2. Lazy metadata updates lead to bursty behavior and expensive flushes.
   We do not take advantage of disk bandwidth since metadata updates
   stay in the Qcow2Cache until the last possible second.  When the
   guest issues a flush we must write out dirty Qcow2Cache entries and
   possibly fsync between them if dependencies have been set (e.g.
   refcount before L2).

How will the journal change this situation?  Writes that go through the
journal are doubled - they must first be journalled, fsync, and then
they can be applied to the actual image.

How do we benefit by using the journal?

Stefan



Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-05 Thread Kevin Wolf
Am 05.09.2013 um 11:21 hat Stefan Hajnoczi geschrieben:
 On Wed, Sep 04, 2013 at 11:39:51AM +0200, Kevin Wolf wrote:
  However, what if we run 'qemu-img check -r leaks' with an old qemu-img
  version? It will reclaim the clusters used by the journal, and if we
  continue using the journal we'll corrupt whatever new data is there
  now.
  
  Can we protect against this without using an autoclear bit?
 
 You are right.  It's a weird case I didn't think of but it could happen.
 An autoclear bit sounds like the simplest solution.
 
 Please document this scenario.

Okay, I've updated the description as follows:

Bit 0:  Journal valid bit. This bit indicates that the
image contains a valid main journal starting at
journal_offset; it is used to mark journals
invalid if the image was opened by older
implementations that may have reclaimed the
journal clusters that would appear as leaked
clusters to them.

+A journal is organised in journal blocks, all of which have a 
reference count
+of exactly 1. It starts with a block containing the following journal 
header:
+
+Byte  0 -  7:   Magic (qjournal ASCII string)
+
+  8 - 11:   Journal size in bytes, including the header
+
+ 12 - 15:   Journal block size order (block size in bytes = 1 
 order)
+The block size must be at least 512 bytes and must 
not
+exceed the cluster size.
+
+ 16 - 19:   Journal block index of the descriptor for the last
+transaction that has been synced, starting with 1 
for the
+journal block after the header. 0 is used for empty
+journals.
+
+ 20 - 23:   Sequence number of the last transaction that has 
been
+synced. 0 is recommended as the initial value.
+
+ 24 - 27:   Sequence number of the last transaction that has 
been
+committed. When replaying a journal, all 
transactions
+after the last synced one up to the last commit 
one must be
+synced. Note that this may include a wraparound of 
sequence
+numbers.
+
+ 28 -  31:  Checksum (one's complement of the sum of all bytes 
in the
+header journal block except those of the checksum 
field)
+
+ 32 - 511:  Reserved (set to 0)
   
   I'm not sure if these fields are necessary.  They require updates (and
   maybe flush) after every commit and sync.
   
   The fewer metadata updates, the better, not just for performance but
   also to reduce the risk of data loss.  If any metadata required to
   access the journal is corrupted, the image will be unavailable.
   
   It should be possible to determine this information by scanning the
   journal transactions.
  
  This is rather handwavy. Can you elaborate how this would work in detail?
  
  
  For example, let's assume we get to read this journal (a journal can be
  rather large, too, so I'm not sure if we want to read it in completely):
  
   - Descriptor, seq 42, 2 data blocks
   - Data block
   - Data block
   - Data block starting with qjbk
   - Data block
   - Descriptor, seq 7, 0 data blocks
   - Descriptor, seq 8, 1 data block
   - Data block
  
  Which of these have already been synced? Which have been committed?

So what's your algorithm for this?

  I guess we could introduce an is_commited flag in the descriptor, but
  wouldn't correct operation look like this then:
  
  1. Write out descriptor commit flag clear and any data blocks
  2. Flush
  3. Rewrite descriptor with commit flag set
  
  This ensures that the commit flag is only set if all the required data
  is indeed stable on disk. What has changed compared to this proposal is
  just the offset at which you write in step 3 (header vs. descriptor).
 
 A commit flag cannot be relied upon.  A transaction can be corrupted
 after being committed, or it can be corrupted due to power failure while
 writing the transaction.  In both cases we have an invalid transaction
 and we must discard it.

No, I believe it is vitally important to distinguish these two cases.

If a transaction was corrupted due to power failure while writing the
transaction, then we can simply discard it indeed.

If, however, a transaction was committed and gets corrupted after the
fact, then we have a problem because the data on the disk is laid out as
described by on-disk metadat (e.g. L2 tables) _with the journal fully
applied_. The replay and consequently bdrv_open() must fail in this case.

The first case is handled by any information that tells us whether the
transaction is already committed; the second should never happen, but
would be caught by a checksum.

 The checksum 

Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-05 Thread Kevin Wolf
Am 05.09.2013 um 11:35 hat Stefan Hajnoczi geschrieben:
 Although we are still discussing details of the on-disk layout, the
 general design is clear enough to discuss how the journal will be used.
 
 Today qcow2 uses Qcow2Cache to do lazy, ordered metadata updates.  The
 performance is pretty good with two exceptions that I can think of:
 
 1. The delayed CoW problem that Kevin has been working on.  Guests
perform sequential writes that are smaller than a qcow2 cluster.  The
first write triggers a copy-on-write of the full cluster.  Later
writes then overwrite the copied data.  It would be more efficient to
anticipate sequential writes and hold off on CoW where possible.

To be clear, more efficient can mean a plus of 50% and more. COW
overhead is the only major overhead compared to raw when looking at
normal cluster allocations. So this is something that is really
important for cluster allocation performance.

The patches that I posted a while ago showed that it's possible to do
this without a journal, however the flush operation became very complex
(which we all found rather scary) and required that the COW be completed
before signalling flush completion.

With a journal, the only thing that you need to do on a flush is to
commit all transactions, i.e. write them out and bdrv_flush(bs-file).
The actualy data copy of the COW (i.e. the sync) can be further delayed
and doesn't have to happen at commit type as it would have without a
journal.

 2. Lazy metadata updates lead to bursty behavior and expensive flushes.
We do not take advantage of disk bandwidth since metadata updates
stay in the Qcow2Cache until the last possible second.  When the
guest issues a flush we must write out dirty Qcow2Cache entries and
possibly fsync between them if dependencies have been set (e.g.
refcount before L2).

Hm, have we ever measured the impact of this?

I don't think a journal can make a fundamental difference here - either
you write only at the last possible second (today flush, with a journal
commit), or you write out more data than strictly necessary.

 How will the journal change this situation?  Writes that go through the
 journal are doubled - they must first be journalled, fsync, and then
 they can be applied to the actual image.
 
 How do we benefit by using the journal?

I believe Delayed COW is a pretty strong one. But there are more cases
in which performance isn't that great.

I think you refer to the simple case with a normal empty image where new
clusters are allocated, which is pretty good indeed if we ignore COW.
Trouble starts when you also free clusters, which happens for example
with internal COW (internal snapshots, compressed images) or discard.
Deduplication as well in the future, I suppose.

Then you get very quickly alternating sequences of L2 depends on
refcount update (for allocation) and refcount update depends on L2
update (for freeing), which means that Qcow2Cache starts flushing all
the time without accumulating many requests. These are cases that would
benefit as well from the atomicity of journal transactions.

And then, of course, we still leak clusters on failed operations. With a
journal, this wouldn't happen any more and the image would always stay
consistent (instead of only corruption-free).

Kevin



Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-05 Thread Benoît Canet
 Then you get very quickly alternating sequences of L2 depends on
 refcount update (for allocation) and refcount update depends on L2
 update (for freeing), which means that Qcow2Cache starts flushing all
 the time without accumulating many requests. These are cases that would
 benefit as well from the atomicity of journal transactions.

True, Deduplication can hit this case on delete if I remember correctly
and it slow down everything.

Best regards

Benoît



Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-05 Thread Benoît Canet
Le Thursday 05 Sep 2013 à 11:24:40 (+0200), Stefan Hajnoczi a écrit :
 On Wed, Sep 04, 2013 at 11:55:23AM +0200, Benoît Canet wrote:
I'm not sure if multiple journals will work in practice.  Doesn't this
re-introduce the need to order update steps and flush between them?
   
   This is a question for Benoît, who made this requirement. I asked him
   the same a while ago and apparently his explanation made some sense to
   me, or I would have remembered that I don't want it. ;-)
  
  The reason behind the multiple journal requirement is that if a block get
  created and deleted in a cyclic way it can generate cyclic 
  insertions/deletions
  journal entries.
  The journal could easilly be filled if this pathological corner case happen.
  When it happen the dedup code repack the journal by writting only the non
  redundant information into a new journal and then use the new one.
  It would not be easy to do so if non dedup journal entries are present in 
  the
  journal hence the multiple journal requirement.
  
  The deduplication also need two journals because when the first one is 
  frozen it
  take some time to write the hash table to disk and anyway new entries must 
  be
  stored somewhere at the same time. The code cannot block.
  
   It might have something to do with the fact that deduplication uses the
   journal more as a kind of cache for hash values that can be dropped and
   rebuilt after a crash.
  
  For dedupe the journal is more a resume after exit tool.
 
 I'm not sure anymore if dedupe needs the same kind of journal as a
 metadata journal for qcow2.
 
 Since you have a dirty flag to discard the journal on crash, the
 journal is not used for data integrity.
 
 That makes me wonder if the metadata journal is the right structure for
 dedupe?  Maybe your original proposal was fine for dedupe and we just
 misinterpreted it because we thought this needs to be a safe journal.

Kevin what do you think of this ?
I could strip down the dedupe journal code to specialize it.

Best regards

Benoît



Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-05 Thread Stefan Hajnoczi
On Thu, Sep 5, 2013 at 1:18 PM, Kevin Wolf kw...@redhat.com wrote:
 Am 05.09.2013 um 11:21 hat Stefan Hajnoczi geschrieben:
 On Wed, Sep 04, 2013 at 11:39:51AM +0200, Kevin Wolf wrote:
  However, what if we run 'qemu-img check -r leaks' with an old qemu-img
  version? It will reclaim the clusters used by the journal, and if we
  continue using the journal we'll corrupt whatever new data is there
  now.
 
  Can we protect against this without using an autoclear bit?

 You are right.  It's a weird case I didn't think of but it could happen.
 An autoclear bit sounds like the simplest solution.

 Please document this scenario.

 Okay, I've updated the description as follows:

 Bit 0:  Journal valid bit. This bit indicates that the
 image contains a valid main journal starting at
 journal_offset; it is used to mark journals
 invalid if the image was opened by older
 implementations that may have reclaimed the
 journal clusters that would appear as leaked
 clusters to them.

Great, thanks.

+A journal is organised in journal blocks, all of which have a 
reference count
+of exactly 1. It starts with a block containing the following journal 
header:
+
+Byte  0 -  7:   Magic (qjournal ASCII string)
+
+  8 - 11:   Journal size in bytes, including the header
+
+ 12 - 15:   Journal block size order (block size in bytes = 1 
 order)
+The block size must be at least 512 bytes and 
must not
+exceed the cluster size.
+
+ 16 - 19:   Journal block index of the descriptor for the last
+transaction that has been synced, starting with 1 
for the
+journal block after the header. 0 is used for 
empty
+journals.
+
+ 20 - 23:   Sequence number of the last transaction that has 
been
+synced. 0 is recommended as the initial value.
+
+ 24 - 27:   Sequence number of the last transaction that has 
been
+committed. When replaying a journal, all 
transactions
+after the last synced one up to the last commit 
one must be
+synced. Note that this may include a wraparound 
of sequence
+numbers.
+
+ 28 -  31:  Checksum (one's complement of the sum of all 
bytes in the
+header journal block except those of the checksum 
field)
+
+ 32 - 511:  Reserved (set to 0)
  
   I'm not sure if these fields are necessary.  They require updates (and
   maybe flush) after every commit and sync.
  
   The fewer metadata updates, the better, not just for performance but
   also to reduce the risk of data loss.  If any metadata required to
   access the journal is corrupted, the image will be unavailable.
  
   It should be possible to determine this information by scanning the
   journal transactions.
 
  This is rather handwavy. Can you elaborate how this would work in detail?
 
 
  For example, let's assume we get to read this journal (a journal can be
  rather large, too, so I'm not sure if we want to read it in completely):
 
   - Descriptor, seq 42, 2 data blocks
   - Data block
   - Data block
   - Data block starting with qjbk
   - Data block
   - Descriptor, seq 7, 0 data blocks
   - Descriptor, seq 8, 1 data block
   - Data block
 
  Which of these have already been synced? Which have been committed?

 So what's your algorithm for this?

Scan the journal to find unsynced transactions, if they exist:

last_sync_seq = 0
last_seqno = 0
while True:
block = journal[(i++) % journal_nblocks]
if i = journal_nblocks * 2:
break # avoid infinite loop
if block.magic != 'qjbk':
continue
if block.seqno  last_seqno:
# Wrapped around to oldest transaction
break
elif block.seqno == seqno:
# Corrupt journal, sequence number should be
# monotonically increasing
raise InvalidJournalException
if block.last_sync_seq != last_sync_seq:
last_sync_seq = block.last_sync_seq
last_seqno = block.seqno

print 'First unsynced block seq no:', last_sync_seq
print 'Last block seq no:', last_seqno

This is broken pseudocode, but hopefully the idea makes sense.

  I guess we could introduce an is_commited flag in the descriptor, but
  wouldn't correct operation look like this then:
 
  1. Write out descriptor commit flag clear and any data blocks
  2. Flush
  3. Rewrite descriptor with commit flag set
 
  This ensures that the commit flag is only set if all the required data
  is indeed stable on disk. What has changed compared to this proposal is
  just the offset at which you write in step 3 (header vs. descriptor).

 A commit flag cannot be relied 

Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-05 Thread Kevin Wolf
Am 05.09.2013 um 16:55 hat Stefan Hajnoczi geschrieben:
 On Thu, Sep 5, 2013 at 1:18 PM, Kevin Wolf kw...@redhat.com wrote:
  Am 05.09.2013 um 11:21 hat Stefan Hajnoczi geschrieben:
  On Wed, Sep 04, 2013 at 11:39:51AM +0200, Kevin Wolf wrote:
 +A journal is organised in journal blocks, all of which have a 
 reference count
 +of exactly 1. It starts with a block containing the following 
 journal header:
 +
 +Byte  0 -  7:   Magic (qjournal ASCII string)
 +
 +  8 - 11:   Journal size in bytes, including the header
 +
 + 12 - 15:   Journal block size order (block size in bytes = 
 1  order)
 +The block size must be at least 512 bytes and 
 must not
 +exceed the cluster size.
 +
 + 16 - 19:   Journal block index of the descriptor for the 
 last
 +transaction that has been synced, starting with 
 1 for the
 +journal block after the header. 0 is used for 
 empty
 +journals.
 +
 + 20 - 23:   Sequence number of the last transaction that 
 has been
 +synced. 0 is recommended as the initial value.
 +
 + 24 - 27:   Sequence number of the last transaction that 
 has been
 +committed. When replaying a journal, all 
 transactions
 +after the last synced one up to the last commit 
 one must be
 +synced. Note that this may include a wraparound 
 of sequence
 +numbers.
 +
 + 28 -  31:  Checksum (one's complement of the sum of all 
 bytes in the
 +header journal block except those of the 
 checksum field)
 +
 + 32 - 511:  Reserved (set to 0)
   
I'm not sure if these fields are necessary.  They require updates (and
maybe flush) after every commit and sync.
   
The fewer metadata updates, the better, not just for performance but
also to reduce the risk of data loss.  If any metadata required to
access the journal is corrupted, the image will be unavailable.
   
It should be possible to determine this information by scanning the
journal transactions.
  
   This is rather handwavy. Can you elaborate how this would work in detail?
  
  
   For example, let's assume we get to read this journal (a journal can be
   rather large, too, so I'm not sure if we want to read it in completely):
  
- Descriptor, seq 42, 2 data blocks
- Data block
- Data block
- Data block starting with qjbk
- Data block
- Descriptor, seq 7, 0 data blocks
- Descriptor, seq 8, 1 data block
- Data block
  
   Which of these have already been synced? Which have been committed?
 
  So what's your algorithm for this?
 
 Scan the journal to find unsynced transactions, if they exist:
 
 last_sync_seq = 0
 last_seqno = 0
 while True:
 block = journal[(i++) % journal_nblocks]
 if i = journal_nblocks * 2:
 break # avoid infinite loop
 if block.magic != 'qjbk':
 continue

Important implication: This doesn't allow data blocks starting with
'qjbk'. Otherwise you're not even guaranteed to find a descriptor block
to start your seach with.

The second time you make this assumption is when there are stale data
blocks in the unused area between the head and tail of the journal.

 if block.seqno  last_seqno:
 # Wrapped around to oldest transaction
 break

Why can you stop here? There might be transactions in the second half of
the journal that aren't synced yet.

 elif block.seqno == seqno:
 # Corrupt journal, sequence number should be
 # monotonically increasing
 raise InvalidJournalException
 if block.last_sync_seq != last_sync_seq:
 last_sync_seq = block.last_sync_seq

The 'if' doesn't add anything here, so you end up using the
last_sync_seq field of the last valid descriptor.

 last_seqno = block.seqno
 
 print 'First unsynced block seq no:', last_sync_seq
 print 'Last block seq no:', last_seqno
 
 This is broken pseudocode, but hopefully the idea makes sense.

One additional thought that might make the thing a bit more interesting:
Sequence numbers can wrap around as well.

Kevin



Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-05 Thread Eric Blake
On 09/05/2013 09:20 AM, Kevin Wolf wrote:
 
 One additional thought that might make the thing a bit more interesting:
 Sequence numbers can wrap around as well.

On the other hand, if sequence numbers are 64-bit, the number of
operations required to cause a wrap far exceeds the expected lifetime of
any of us on this list, and we can safely assume it to be a non-issue.
(There's other places in qemu where we intentionally have an abort() if
a 64-bit number would wrap...)

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-04 Thread Benoît Canet
  +They consist of transactions, which in turn contain operations that
  +are effectively executed atomically. A qcow2 image can have a main image
  +journal that deals with cluster management operations, and additional 
  specific
  +journals can be used by other features like data deduplication.
 
 I'm not sure if multiple journals will work in practice.  Doesn't this
 re-introduce the need to order update steps and flush between them?

The flush and data has reached stable storage requirement of the deduplication
journal are very weak.
The deduplication code maintains an incompatible dedup dirty flag and flush
the journal on exit then clear the flag.
If the flag is set at startup all deduplication metadata and journal content are
dropped and it does not harm the image file in any way.
The code just starts over.



Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-04 Thread Kevin Wolf
First of all, excuse any inconsistencies in the following mail. I wrote
it from top to bottom, and there was some thought process involved in
almost every paragraph...

Am 04.09.2013 um 10:03 hat Stefan Hajnoczi geschrieben:
 On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote:
  @@ -103,7 +107,11 @@ in the description of a field.
   write to an image with unknown auto-clear features if 
  it
   clears the respective bits from this field first.
   
  -Bits 0-63:  Reserved (set to 0)
  +Bit 0:  Journal valid bit. This bit indicates that 
  the
  +image contains a valid main journal 
  starting at
  +journal_offset.
 
 Whether the journal is used can be determined from the journal_offset
 value (header length must be large enough and journal offset must be
 valid).
 
 Why do we need this autoclear bit?

Hm, I introduced this one first and the journal dirty incompatible bit
later, perhaps it's unnecessary now. Let's check...

The obvious thing we need to protect against is applying stale journal
data to an image that has been changed by an older version. As long as
the journal is clean, this can't happen, and the journal dirty bit will
ensure that the old version can only open the image if it is clean.

However, what if we run 'qemu-img check -r leaks' with an old qemu-img
version? It will reclaim the clusters used by the journal, and if we
continue using the journal we'll corrupt whatever new data is there
now.

Can we protect against this without using an autoclear bit?

  +Journals are used to allow safe updates of metadata without impacting
  +performance by requiring flushes to order updates to different parts of the
  +metadata.
 
 This sentence is hard to parse.  Maybe something shorter like this:
 
 Journals allow safe metadata updates without the need for carefully
 ordering and flushing between update steps.

Okay, I'll update the text with your proposal.

  +They consist of transactions, which in turn contain operations that
  +are effectively executed atomically. A qcow2 image can have a main image
  +journal that deals with cluster management operations, and additional 
  specific
  +journals can be used by other features like data deduplication.
 
 I'm not sure if multiple journals will work in practice.  Doesn't this
 re-introduce the need to order update steps and flush between them?

This is a question for Benoît, who made this requirement. I asked him
the same a while ago and apparently his explanation made some sense to
me, or I would have remembered that I don't want it. ;-)

It might have something to do with the fact that deduplication uses the
journal more as a kind of cache for hash values that can be dropped and
rebuilt after a crash.

  +A journal is organised in journal blocks, all of which have a reference 
  count
  +of exactly 1. It starts with a block containing the following journal 
  header:
  +
  +Byte  0 -  7:   Magic (qjournal ASCII string)
  +
  +  8 - 11:   Journal size in bytes, including the header
  +
  + 12 - 15:   Journal block size order (block size in bytes = 1  
  order)
  +The block size must be at least 512 bytes and must not
  +exceed the cluster size.
  +
  + 16 - 19:   Journal block index of the descriptor for the last
  +transaction that has been synced, starting with 1 for 
  the
  +journal block after the header. 0 is used for empty
  +journals.
  +
  + 20 - 23:   Sequence number of the last transaction that has been
  +synced. 0 is recommended as the initial value.
  +
  + 24 - 27:   Sequence number of the last transaction that has been
  +committed. When replaying a journal, all transactions
  +after the last synced one up to the last commit one 
  must be
  +synced. Note that this may include a wraparound of 
  sequence
  +numbers.
  +
  + 28 -  31:  Checksum (one's complement of the sum of all bytes in 
  the
  +header journal block except those of the checksum 
  field)
  +
  + 32 - 511:  Reserved (set to 0)
 
 I'm not sure if these fields are necessary.  They require updates (and
 maybe flush) after every commit and sync.
 
 The fewer metadata updates, the better, not just for performance but
 also to reduce the risk of data loss.  If any metadata required to
 access the journal is corrupted, the image will be unavailable.
 
 It should be possible to determine this information by scanning the
 journal transactions.

This is rather handwavy. Can you elaborate how this would work in detail?


For example, let's assume we get to read this journal (a journal can be
rather large, too, so I'm not sure if we want to read it in 

Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-04 Thread Kevin Wolf
Am 04.09.2013 um 10:32 hat Max Reitz geschrieben:
 On 2013-09-03 15:45, Kevin Wolf wrote:
 This contains an extension of the qcow2 spec that introduces journalling
 to the image format, plus some preliminary type definitions and
 function prototypes in the qcow2 code.
 
 Journalling functionality is a crucial feature for the design of data
 deduplication, and it will improve the core part of qcow2 by avoiding
 cluster leaks on crashes as well as provide an easier way to get a
 reliable implementation of performance features like Delayed COW.
 
 At this point of the RFC, it would be most important to review the
 on-disk structure. Once we're confident that it can do everything we
 want, we can start going into more detail on the qemu side of things.
 
 Signed-off-by: Kevin Wolf kw...@redhat.com
 ---
   block/Makefile.objs   |   2 +-
   block/qcow2-journal.c |  55 ++
   block/qcow2.h |  78 +++
   docs/specs/qcow2.txt  | 204 
  +-
   4 files changed, 337 insertions(+), 2 deletions(-)
   create mode 100644 block/qcow2-journal.c
 
 diff --git a/block/Makefile.objs b/block/Makefile.objs
 index 3bb85b5..59be314 100644
 --- a/block/Makefile.objs
 +++ b/block/Makefile.objs
 @@ -1,5 +1,5 @@
   block-obj-y += raw_bsd.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o 
  vpc.o vvfat.o
 -block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
 qcow2-cache.o
 +block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
 qcow2-cache.o qcow2-journal.o
   block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
   block-obj-y += qed-check.o
   block-obj-y += vhdx.o
 diff --git a/block/qcow2-journal.c b/block/qcow2-journal.c
 new file mode 100644
 index 000..5b20239
 --- /dev/null
 +++ b/block/qcow2-journal.c
 @@ -0,0 +1,55 @@
 +/*
 + * qcow2 journalling functions
 + *
 + * Copyright (c) 2013 Kevin Wolf kw...@redhat.com
 + *
 + * Permission is hereby granted, free of charge, to any person obtaining a 
 copy
 + * of this software and associated documentation files (the Software), to 
 deal
 + * in the Software without restriction, including without limitation the 
 rights
 + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 + * copies of the Software, and to permit persons to whom the Software is
 + * furnished to do so, subject to the following conditions:
 + *
 + * The above copyright notice and this permission notice shall be included 
 in
 + * all copies or substantial portions of the Software.
 + *
 + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS 
 OR
 + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
 + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR 
 OTHER
 + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
 FROM,
 + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 + * THE SOFTWARE.
 + */
 +
 +#include qemu-common.h
 +#include block/block_int.h
 +#include qcow2.h
 +
 +#define QCOW2_JOURNAL_MAGIC 0x716a6f75726e616cULL  /* qjournal */
 +#define QCOW2_JOURNAL_BLOCK_MAGIC 0x716a626b  /* qjbk */
 +
 +typedef struct Qcow2JournalHeader {
 +uint64_tmagic;
 +uint32_tjournal_size;
 +uint32_tblock_size;
 +uint32_tsynced_index;
 +uint32_tsynced_seq;
 +uint32_tcommitted_seq;
 +uint32_tchecksum;
 +} QEMU_PACKED Qcow2JournalHeader;
 +
 +/*
 + * One big transaction per journal block. The transaction is committed 
 either
 + * time based or when a microtransaction (single set of operations that 
 must be
 + * performed atomically) doesn't fit in the same block any more.
 + */
 +typedef struct Qcow2JournalBlock {
 +uint32_tmagic;
 +uint32_tchecksum;
 +uint32_tseq;
 +uint32_tdesc_offset; /* Allow block header extensions */
 +uint32_tdesc_bytes;
 +uint32_tnb_data_blocks;
 +} QEMU_PACKED Qcow2JournalBlock;
 +
 Why is this in the C file...
 
 diff --git a/block/qcow2.h b/block/qcow2.h
 index 1000239..2aee1fd 100644
 --- a/block/qcow2.h
 +++ b/block/qcow2.h
 @@ -157,6 +157,10 @@ typedef struct Qcow2DiscardRegion {
   QTAILQ_ENTRY(Qcow2DiscardRegion) next;
   } Qcow2DiscardRegion;
 +typedef struct Qcow2Journal {
 +
 +} Qcow2Journal;
 +
   typedef struct BDRVQcowState {
   int cluster_bits;
   int cluster_size;
 @@ -479,4 +483,78 @@ int qcow2_cache_get_empty(BlockDriverState *bs, 
 Qcow2Cache *c, uint64_t offset,
   void **table);
   int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table);
 +/* qcow2-journal.c functions */
 +
 +typedef struct Qcow2JournalTransaction Qcow2JournalTransaction;
 +
 +enum Qcow2JournalEntryTypeID {
 +QJ_DESC_NOOP= 0,
 +QJ_DESC_WRITE   = 1,
 +QJ_DESC_COPY= 2,
 +
 +/* required after a cluster is 

Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-04 Thread Benoît Canet
  I'm not sure if multiple journals will work in practice.  Doesn't this
  re-introduce the need to order update steps and flush between them?
 
 This is a question for Benoît, who made this requirement. I asked him
 the same a while ago and apparently his explanation made some sense to
 me, or I would have remembered that I don't want it. ;-)

The reason behind the multiple journal requirement is that if a block get
created and deleted in a cyclic way it can generate cyclic insertions/deletions
journal entries.
The journal could easilly be filled if this pathological corner case happen.
When it happen the dedup code repack the journal by writting only the non
redundant information into a new journal and then use the new one.
It would not be easy to do so if non dedup journal entries are present in the
journal hence the multiple journal requirement.

The deduplication also need two journals because when the first one is frozen it
take some time to write the hash table to disk and anyway new entries must be
stored somewhere at the same time. The code cannot block.

 It might have something to do with the fact that deduplication uses the
 journal more as a kind of cache for hash values that can be dropped and
 rebuilt after a crash.

For dedupe the journal is more a resume after exit tool.

Best regards

Benoît



Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-04 Thread Stefan Hajnoczi
On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote:
 @@ -103,7 +107,11 @@ in the description of a field.
  write to an image with unknown auto-clear features if it
  clears the respective bits from this field first.
  
 -Bits 0-63:  Reserved (set to 0)
 +Bit 0:  Journal valid bit. This bit indicates that 
 the
 +image contains a valid main journal starting 
 at
 +journal_offset.

Whether the journal is used can be determined from the journal_offset
value (header length must be large enough and journal offset must be
valid).

Why do we need this autoclear bit?

 +Journals are used to allow safe updates of metadata without impacting
 +performance by requiring flushes to order updates to different parts of the
 +metadata.

This sentence is hard to parse.  Maybe something shorter like this:

Journals allow safe metadata updates without the need for carefully
ordering and flushing between update steps.

 +They consist of transactions, which in turn contain operations that
 +are effectively executed atomically. A qcow2 image can have a main image
 +journal that deals with cluster management operations, and additional 
 specific
 +journals can be used by other features like data deduplication.

I'm not sure if multiple journals will work in practice.  Doesn't this
re-introduce the need to order update steps and flush between them?

 +A journal is organised in journal blocks, all of which have a reference count
 +of exactly 1. It starts with a block containing the following journal header:
 +
 +Byte  0 -  7:   Magic (qjournal ASCII string)
 +
 +  8 - 11:   Journal size in bytes, including the header
 +
 + 12 - 15:   Journal block size order (block size in bytes = 1  
 order)
 +The block size must be at least 512 bytes and must not
 +exceed the cluster size.
 +
 + 16 - 19:   Journal block index of the descriptor for the last
 +transaction that has been synced, starting with 1 for the
 +journal block after the header. 0 is used for empty
 +journals.
 +
 + 20 - 23:   Sequence number of the last transaction that has been
 +synced. 0 is recommended as the initial value.
 +
 + 24 - 27:   Sequence number of the last transaction that has been
 +committed. When replaying a journal, all transactions
 +after the last synced one up to the last commit one must 
 be
 +synced. Note that this may include a wraparound of 
 sequence
 +numbers.
 +
 + 28 -  31:  Checksum (one's complement of the sum of all bytes in the
 +header journal block except those of the checksum field)
 +
 + 32 - 511:  Reserved (set to 0)

I'm not sure if these fields are necessary.  They require updates (and
maybe flush) after every commit and sync.

The fewer metadata updates, the better, not just for performance but
also to reduce the risk of data loss.  If any metadata required to
access the journal is corrupted, the image will be unavailable.

It should be possible to determine this information by scanning the
journal transactions.

 +A wraparound may not occur in the middle of a single transaction, but only
 +between two transactions. For the necessary padding an empty descriptor with
 +any number of data blocks can be used as the last entry of the ring.

Why have this limitation?

 +All descriptors start with a common part:
 +
 +Byte  0 -  1:   Descriptor type
 +0 - No-op descriptor
 +1 - Write data block
 +2 - Copy data
 +3 - Revoke
 +4 - Deduplication hash insertion
 +5 - Deduplication hash deletion
 +
 +  2 -  3:   Size of the descriptor in bytes

Data blocks are not included in the descriptor size?  I just want to
make sure that we don't be limited to 64 KB for the actual data.

 +
 +  4 -  n:   Type-specific data
 +
 +The following section specifies the purpose (i.e. the action that is to be
 +performed when syncing) and type-specific data layout of each descriptor 
 type:
 +
 +  * No-op descriptor: No action is to be performed when syncing this 
 descriptor
 +
 +  4 -  n:   Ignored
 +
 +  * Write data block: Write literal data associated with this transaction 
 from
 +the journal to a given offset.
 +
 +  4 -  7:   Length of the data to write in bytes
 +
 +  8 - 15:   Offset in the image file to write the data to
 +
 + 16 - 19:   Index of the journal block at which the data to write
 +starts. The data must be stored sequentially and be fully
 +contained in the data blocks associated 

Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-04 Thread Max Reitz

On 2013-09-03 15:45, Kevin Wolf wrote:

This contains an extension of the qcow2 spec that introduces journalling
to the image format, plus some preliminary type definitions and
function prototypes in the qcow2 code.

Journalling functionality is a crucial feature for the design of data
deduplication, and it will improve the core part of qcow2 by avoiding
cluster leaks on crashes as well as provide an easier way to get a
reliable implementation of performance features like Delayed COW.

At this point of the RFC, it would be most important to review the
on-disk structure. Once we're confident that it can do everything we
want, we can start going into more detail on the qemu side of things.

Signed-off-by: Kevin Wolf kw...@redhat.com
---
  block/Makefile.objs   |   2 +-
  block/qcow2-journal.c |  55 ++
  block/qcow2.h |  78 +++
  docs/specs/qcow2.txt  | 204 +-
  4 files changed, 337 insertions(+), 2 deletions(-)
  create mode 100644 block/qcow2-journal.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 3bb85b5..59be314 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -1,5 +1,5 @@
  block-obj-y += raw_bsd.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o 
vpc.o vvfat.o
-block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
qcow2-cache.o
+block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
qcow2-cache.o qcow2-journal.o
  block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
  block-obj-y += qed-check.o
  block-obj-y += vhdx.o
diff --git a/block/qcow2-journal.c b/block/qcow2-journal.c
new file mode 100644
index 000..5b20239
--- /dev/null
+++ b/block/qcow2-journal.c
@@ -0,0 +1,55 @@
+/*
+ * qcow2 journalling functions
+ *
+ * Copyright (c) 2013 Kevin Wolf kw...@redhat.com
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the Software), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include qemu-common.h
+#include block/block_int.h
+#include qcow2.h
+
+#define QCOW2_JOURNAL_MAGIC 0x716a6f75726e616cULL  /* qjournal */
+#define QCOW2_JOURNAL_BLOCK_MAGIC 0x716a626b  /* qjbk */
+
+typedef struct Qcow2JournalHeader {
+uint64_tmagic;
+uint32_tjournal_size;
+uint32_tblock_size;
+uint32_tsynced_index;
+uint32_tsynced_seq;
+uint32_tcommitted_seq;
+uint32_tchecksum;
+} QEMU_PACKED Qcow2JournalHeader;
+
+/*
+ * One big transaction per journal block. The transaction is committed either
+ * time based or when a microtransaction (single set of operations that must be
+ * performed atomically) doesn't fit in the same block any more.
+ */
+typedef struct Qcow2JournalBlock {
+uint32_tmagic;
+uint32_tchecksum;
+uint32_tseq;
+uint32_tdesc_offset; /* Allow block header extensions */
+uint32_tdesc_bytes;
+uint32_tnb_data_blocks;
+} QEMU_PACKED Qcow2JournalBlock;
+

Why is this in the C file...


diff --git a/block/qcow2.h b/block/qcow2.h
index 1000239..2aee1fd 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -157,6 +157,10 @@ typedef struct Qcow2DiscardRegion {
  QTAILQ_ENTRY(Qcow2DiscardRegion) next;
  } Qcow2DiscardRegion;
  
+typedef struct Qcow2Journal {

+
+} Qcow2Journal;
+
  typedef struct BDRVQcowState {
  int cluster_bits;
  int cluster_size;
@@ -479,4 +483,78 @@ int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache 
*c, uint64_t offset,
  void **table);
  int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table);
  
+/* qcow2-journal.c functions */

+
+typedef struct Qcow2JournalTransaction Qcow2JournalTransaction;
+
+enum Qcow2JournalEntryTypeID {
+QJ_DESC_NOOP= 0,
+QJ_DESC_WRITE   = 1,
+QJ_DESC_COPY= 2,
+
+/* required after a cluster is freed and used for other purposes, so that
+ * new (unjournalled) data won't be overwritten with stale metadata */
+QJ_DESC_REVOKE  = 3,
+};
+
+typedef struct 

[Qemu-devel] [RFC] qcow2 journalling draft

2013-09-03 Thread Kevin Wolf
This contains an extension of the qcow2 spec that introduces journalling
to the image format, plus some preliminary type definitions and
function prototypes in the qcow2 code.

Journalling functionality is a crucial feature for the design of data
deduplication, and it will improve the core part of qcow2 by avoiding
cluster leaks on crashes as well as provide an easier way to get a
reliable implementation of performance features like Delayed COW.

At this point of the RFC, it would be most important to review the
on-disk structure. Once we're confident that it can do everything we
want, we can start going into more detail on the qemu side of things.

Signed-off-by: Kevin Wolf kw...@redhat.com
---
 block/Makefile.objs   |   2 +-
 block/qcow2-journal.c |  55 ++
 block/qcow2.h |  78 +++
 docs/specs/qcow2.txt  | 204 +-
 4 files changed, 337 insertions(+), 2 deletions(-)
 create mode 100644 block/qcow2-journal.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 3bb85b5..59be314 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -1,5 +1,5 @@
 block-obj-y += raw_bsd.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o 
vvfat.o
-block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
qcow2-cache.o
+block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
qcow2-cache.o qcow2-journal.o
 block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
 block-obj-y += qed-check.o
 block-obj-y += vhdx.o
diff --git a/block/qcow2-journal.c b/block/qcow2-journal.c
new file mode 100644
index 000..5b20239
--- /dev/null
+++ b/block/qcow2-journal.c
@@ -0,0 +1,55 @@
+/*
+ * qcow2 journalling functions
+ *
+ * Copyright (c) 2013 Kevin Wolf kw...@redhat.com
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the Software), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include qemu-common.h
+#include block/block_int.h
+#include qcow2.h
+
+#define QCOW2_JOURNAL_MAGIC 0x716a6f75726e616cULL  /* qjournal */
+#define QCOW2_JOURNAL_BLOCK_MAGIC 0x716a626b  /* qjbk */
+
+typedef struct Qcow2JournalHeader {
+uint64_tmagic;
+uint32_tjournal_size;
+uint32_tblock_size;
+uint32_tsynced_index;
+uint32_tsynced_seq;
+uint32_tcommitted_seq;
+uint32_tchecksum;
+} QEMU_PACKED Qcow2JournalHeader;
+
+/*
+ * One big transaction per journal block. The transaction is committed either
+ * time based or when a microtransaction (single set of operations that must be
+ * performed atomically) doesn't fit in the same block any more.
+ */
+typedef struct Qcow2JournalBlock {
+uint32_tmagic;
+uint32_tchecksum;
+uint32_tseq;
+uint32_tdesc_offset; /* Allow block header extensions */
+uint32_tdesc_bytes;
+uint32_tnb_data_blocks;
+} QEMU_PACKED Qcow2JournalBlock;
+
diff --git a/block/qcow2.h b/block/qcow2.h
index 1000239..2aee1fd 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -157,6 +157,10 @@ typedef struct Qcow2DiscardRegion {
 QTAILQ_ENTRY(Qcow2DiscardRegion) next;
 } Qcow2DiscardRegion;
 
+typedef struct Qcow2Journal {
+
+} Qcow2Journal;
+
 typedef struct BDRVQcowState {
 int cluster_bits;
 int cluster_size;
@@ -479,4 +483,78 @@ int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache 
*c, uint64_t offset,
 void **table);
 int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table);
 
+/* qcow2-journal.c functions */
+
+typedef struct Qcow2JournalTransaction Qcow2JournalTransaction;
+
+enum Qcow2JournalEntryTypeID {
+QJ_DESC_NOOP= 0,
+QJ_DESC_WRITE   = 1,
+QJ_DESC_COPY= 2,
+
+/* required after a cluster is freed and used for other purposes, so that
+ * new (unjournalled) data won't be overwritten with stale metadata */
+QJ_DESC_REVOKE  = 3,
+};
+
+typedef struct Qcow2JournalEntryType {
+enum Qcow2JournalEntryTypeID id;
+int (*sync)(void *buf, size_t size);
+} 

Re: [Qemu-devel] [RFC] qcow2 journalling draft

2013-09-03 Thread Benoît Canet
Le Tuesday 03 Sep 2013 à 15:45:52 (+0200), Kevin Wolf a écrit :
 This contains an extension of the qcow2 spec that introduces journalling
 to the image format, plus some preliminary type definitions and
 function prototypes in the qcow2 code.
 
 Journalling functionality is a crucial feature for the design of data
 deduplication, and it will improve the core part of qcow2 by avoiding
 cluster leaks on crashes as well as provide an easier way to get a
 reliable implementation of performance features like Delayed COW.
 
 At this point of the RFC, it would be most important to review the
 on-disk structure. Once we're confident that it can do everything we
 want, we can start going into more detail on the qemu side of things.
 
 Signed-off-by: Kevin Wolf kw...@redhat.com
 ---
  block/Makefile.objs   |   2 +-
  block/qcow2-journal.c |  55 ++
  block/qcow2.h |  78 +++
  docs/specs/qcow2.txt  | 204 
 +-
  4 files changed, 337 insertions(+), 2 deletions(-)
  create mode 100644 block/qcow2-journal.c
 
 diff --git a/block/Makefile.objs b/block/Makefile.objs
 index 3bb85b5..59be314 100644
 --- a/block/Makefile.objs
 +++ b/block/Makefile.objs
 @@ -1,5 +1,5 @@
  block-obj-y += raw_bsd.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o 
 vpc.o vvfat.o
 -block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
 qcow2-cache.o
 +block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
 qcow2-cache.o qcow2-journal.o
  block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
  block-obj-y += qed-check.o
  block-obj-y += vhdx.o
 diff --git a/block/qcow2-journal.c b/block/qcow2-journal.c
 new file mode 100644
 index 000..5b20239
 --- /dev/null
 +++ b/block/qcow2-journal.c
 @@ -0,0 +1,55 @@
 +/*
 + * qcow2 journalling functions
 + *
 + * Copyright (c) 2013 Kevin Wolf kw...@redhat.com
 + *
 + * Permission is hereby granted, free of charge, to any person obtaining a 
 copy
 + * of this software and associated documentation files (the Software), to 
 deal
 + * in the Software without restriction, including without limitation the 
 rights
 + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 + * copies of the Software, and to permit persons to whom the Software is
 + * furnished to do so, subject to the following conditions:
 + *
 + * The above copyright notice and this permission notice shall be included in
 + * all copies or substantial portions of the Software.
 + *
 + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
 + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
 FROM,
 + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 + * THE SOFTWARE.
 + */
 +
 +#include qemu-common.h
 +#include block/block_int.h
 +#include qcow2.h
 +
 +#define QCOW2_JOURNAL_MAGIC 0x716a6f75726e616cULL  /* qjournal */
 +#define QCOW2_JOURNAL_BLOCK_MAGIC 0x716a626b  /* qjbk */
 +
 +typedef struct Qcow2JournalHeader {
 +uint64_tmagic;
 +uint32_tjournal_size;
 +uint32_tblock_size;
 +uint32_tsynced_index;
 +uint32_tsynced_seq;
 +uint32_tcommitted_seq;
 +uint32_tchecksum;
 +} QEMU_PACKED Qcow2JournalHeader;
 +
 +/*
 + * One big transaction per journal block. The transaction is committed either
 + * time based or when a microtransaction (single set of operations that must 
 be
 + * performed atomically) doesn't fit in the same block any more.
 + */
 +typedef struct Qcow2JournalBlock {
 +uint32_tmagic;
 +uint32_tchecksum;
 +uint32_tseq;
 +uint32_tdesc_offset; /* Allow block header extensions */
 +uint32_tdesc_bytes;
 +uint32_tnb_data_blocks;
 +} QEMU_PACKED Qcow2JournalBlock;
 +
 diff --git a/block/qcow2.h b/block/qcow2.h
 index 1000239..2aee1fd 100644
 --- a/block/qcow2.h
 +++ b/block/qcow2.h
 @@ -157,6 +157,10 @@ typedef struct Qcow2DiscardRegion {
  QTAILQ_ENTRY(Qcow2DiscardRegion) next;
  } Qcow2DiscardRegion;
  
 +typedef struct Qcow2Journal {
 +
 +} Qcow2Journal;
 +
  typedef struct BDRVQcowState {
  int cluster_bits;
  int cluster_size;
 @@ -479,4 +483,78 @@ int qcow2_cache_get_empty(BlockDriverState *bs, 
 Qcow2Cache *c, uint64_t offset,
  void **table);
  int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table);
  
 +/* qcow2-journal.c functions */
 +
 +typedef struct Qcow2JournalTransaction Qcow2JournalTransaction;
 +
 +enum Qcow2JournalEntryTypeID {
 +QJ_DESC_NOOP= 0,
 +QJ_DESC_WRITE   = 1,
 +QJ_DESC_COPY= 2,
 +
 +/* required after a cluster is freed and used for other purposes, so that
 + * new (unjournalled) data