Re: [Qemu-devel] [RFC] qcow2 journalling draft
Kevin what do you think of this ? I could strip down the dedupe journal code to specialize it. If you think it turns out easier than using the journalling infrastructure that we're going to implement anyway, then why not. This question need to be though. The good thing about the current dedup log is that the code is already written and match very well the dedup usage. Aside my personal interest of resuming the work on deduplication quickly I think we should consider the overall cost of having a specialized log for dedupe. Is the reviewing and maintainance of http://patchwork.ozlabs.org/patch/252955/ a reasonable extra cost ? Best regards Benoît
Re: [Qemu-devel] [RFC] qcow2 journalling draft
Am 05.09.2013 um 17:26 hat Benoît Canet geschrieben: Le Thursday 05 Sep 2013 à 11:24:40 (+0200), Stefan Hajnoczi a écrit : On Wed, Sep 04, 2013 at 11:55:23AM +0200, Benoît Canet wrote: I'm not sure if multiple journals will work in practice. Doesn't this re-introduce the need to order update steps and flush between them? This is a question for Benoît, who made this requirement. I asked him the same a while ago and apparently his explanation made some sense to me, or I would have remembered that I don't want it. ;-) The reason behind the multiple journal requirement is that if a block get created and deleted in a cyclic way it can generate cyclic insertions/deletions journal entries. The journal could easilly be filled if this pathological corner case happen. When it happen the dedup code repack the journal by writting only the non redundant information into a new journal and then use the new one. It would not be easy to do so if non dedup journal entries are present in the journal hence the multiple journal requirement. The deduplication also need two journals because when the first one is frozen it take some time to write the hash table to disk and anyway new entries must be stored somewhere at the same time. The code cannot block. It might have something to do with the fact that deduplication uses the journal more as a kind of cache for hash values that can be dropped and rebuilt after a crash. For dedupe the journal is more a resume after exit tool. I'm not sure anymore if dedupe needs the same kind of journal as a metadata journal for qcow2. Since you have a dirty flag to discard the journal on crash, the journal is not used for data integrity. That makes me wonder if the metadata journal is the right structure for dedupe? Maybe your original proposal was fine for dedupe and we just misinterpreted it because we thought this needs to be a safe journal. Kevin what do you think of this ? I could strip down the dedupe journal code to specialize it. If you think it turns out easier than using the journalling infrastructure that we're going to implement anyway, then why not. Kevin
Re: [Qemu-devel] [RFC] qcow2 journalling draft
On Wed, 09/04 11:39, Kevin Wolf wrote: First of all, excuse any inconsistencies in the following mail. I wrote it from top to bottom, and there was some thought process involved in almost every paragraph... Am 04.09.2013 um 10:03 hat Stefan Hajnoczi geschrieben: On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote: @@ -103,7 +107,11 @@ in the description of a field. write to an image with unknown auto-clear features if it clears the respective bits from this field first. -Bits 0-63: Reserved (set to 0) +Bit 0: Journal valid bit. This bit indicates that the +image contains a valid main journal starting at +journal_offset. Whether the journal is used can be determined from the journal_offset value (header length must be large enough and journal offset must be valid). Why do we need this autoclear bit? Hm, I introduced this one first and the journal dirty incompatible bit later, perhaps it's unnecessary now. Let's check... The obvious thing we need to protect against is applying stale journal data to an image that has been changed by an older version. As long as the journal is clean, this can't happen, and the journal dirty bit will ensure that the old version can only open the image if it is clean. However, what if we run 'qemu-img check -r leaks' with an old qemu-img version? It will reclaim the clusters used by the journal, and if we continue using the journal we'll corrupt whatever new data is there now. Why can old version qemu-img open the image with dirty journal in the first place? It's incompatible bit. Can we protect against this without using an autoclear bit? +Journals are used to allow safe updates of metadata without impacting +performance by requiring flushes to order updates to different parts of the +metadata. This sentence is hard to parse. Maybe something shorter like this: Journals allow safe metadata updates without the need for carefully ordering and flushing between update steps. Okay, I'll update the text with your proposal. +They consist of transactions, which in turn contain operations that +are effectively executed atomically. A qcow2 image can have a main image +journal that deals with cluster management operations, and additional specific +journals can be used by other features like data deduplication. I'm not sure if multiple journals will work in practice. Doesn't this re-introduce the need to order update steps and flush between them? This is a question for Benoît, who made this requirement. I asked him the same a while ago and apparently his explanation made some sense to me, or I would have remembered that I don't want it. ;-) It might have something to do with the fact that deduplication uses the journal more as a kind of cache for hash values that can be dropped and rebuilt after a crash. +A journal is organised in journal blocks, all of which have a reference count +of exactly 1. It starts with a block containing the following journal header: + +Byte 0 - 7: Magic (qjournal ASCII string) + + 8 - 11: Journal size in bytes, including the header + + 12 - 15: Journal block size order (block size in bytes = 1 order) +The block size must be at least 512 bytes and must not +exceed the cluster size. + + 16 - 19: Journal block index of the descriptor for the last +transaction that has been synced, starting with 1 for the +journal block after the header. 0 is used for empty +journals. + + 20 - 23: Sequence number of the last transaction that has been +synced. 0 is recommended as the initial value. + + 24 - 27: Sequence number of the last transaction that has been +committed. When replaying a journal, all transactions +after the last synced one up to the last commit one must be +synced. Note that this may include a wraparound of sequence +numbers. + + 28 - 31: Checksum (one's complement of the sum of all bytes in the +header journal block except those of the checksum field) + + 32 - 511: Reserved (set to 0) I'm not sure if these fields are necessary. They require updates (and maybe flush) after every commit and sync. The fewer metadata updates, the better, not just for performance but also to reduce the risk of data loss. If any metadata required to access the journal is corrupted, the image will be unavailable. It should be possible to determine
Re: [Qemu-devel] [RFC] qcow2 journalling draft
Am 06.09.2013 um 11:20 hat Fam Zheng geschrieben: On Wed, 09/04 11:39, Kevin Wolf wrote: First of all, excuse any inconsistencies in the following mail. I wrote it from top to bottom, and there was some thought process involved in almost every paragraph... Am 04.09.2013 um 10:03 hat Stefan Hajnoczi geschrieben: On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote: @@ -103,7 +107,11 @@ in the description of a field. write to an image with unknown auto-clear features if it clears the respective bits from this field first. -Bits 0-63: Reserved (set to 0) +Bit 0: Journal valid bit. This bit indicates that the +image contains a valid main journal starting at +journal_offset. Whether the journal is used can be determined from the journal_offset value (header length must be large enough and journal offset must be valid). Why do we need this autoclear bit? Hm, I introduced this one first and the journal dirty incompatible bit later, perhaps it's unnecessary now. Let's check... The obvious thing we need to protect against is applying stale journal data to an image that has been changed by an older version. As long as the journal is clean, this can't happen, and the journal dirty bit will ensure that the old version can only open the image if it is clean. However, what if we run 'qemu-img check -r leaks' with an old qemu-img version? It will reclaim the clusters used by the journal, and if we continue using the journal we'll corrupt whatever new data is there now. Why can old version qemu-img open the image with dirty journal in the first place? It's incompatible bit. This is about a clean journal. Kevin
Re: [Qemu-devel] [RFC] qcow2 journalling draft
On Tue, 09/03 15:45, Kevin Wolf wrote: This contains an extension of the qcow2 spec that introduces journalling to the image format, plus some preliminary type definitions and function prototypes in the qcow2 code. Journalling functionality is a crucial feature for the design of data deduplication, and it will improve the core part of qcow2 by avoiding cluster leaks on crashes as well as provide an easier way to get a reliable implementation of performance features like Delayed COW. At this point of the RFC, it would be most important to review the on-disk structure. Once we're confident that it can do everything we want, we can start going into more detail on the qemu side of things. Signed-off-by: Kevin Wolf kw...@redhat.com --- block/Makefile.objs | 2 +- block/qcow2-journal.c | 55 ++ block/qcow2.h | 78 +++ docs/specs/qcow2.txt | 204 +- 4 files changed, 337 insertions(+), 2 deletions(-) create mode 100644 block/qcow2-journal.c diff --git a/block/Makefile.objs b/block/Makefile.objs index 3bb85b5..59be314 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -1,5 +1,5 @@ block-obj-y += raw_bsd.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o -block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o +block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o qcow2-journal.o block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o block-obj-y += qed-check.o block-obj-y += vhdx.o diff --git a/block/qcow2-journal.c b/block/qcow2-journal.c new file mode 100644 index 000..5b20239 --- /dev/null +++ b/block/qcow2-journal.c @@ -0,0 +1,55 @@ +/* + * qcow2 journalling functions + * + * Copyright (c) 2013 Kevin Wolf kw...@redhat.com + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ + +#include qemu-common.h +#include block/block_int.h +#include qcow2.h + +#define QCOW2_JOURNAL_MAGIC 0x716a6f75726e616cULL /* qjournal */ +#define QCOW2_JOURNAL_BLOCK_MAGIC 0x716a626b /* qjbk */ + +typedef struct Qcow2JournalHeader { +uint64_tmagic; +uint32_tjournal_size; +uint32_tblock_size; +uint32_tsynced_index; +uint32_tsynced_seq; +uint32_tcommitted_seq; +uint32_tchecksum; +} QEMU_PACKED Qcow2JournalHeader; + +/* + * One big transaction per journal block. The transaction is committed either + * time based or when a microtransaction (single set of operations that must be + * performed atomically) doesn't fit in the same block any more. + */ +typedef struct Qcow2JournalBlock { +uint32_tmagic; +uint32_tchecksum; +uint32_tseq; +uint32_tdesc_offset; /* Allow block header extensions */ +uint32_tdesc_bytes; +uint32_tnb_data_blocks; +} QEMU_PACKED Qcow2JournalBlock; + diff --git a/block/qcow2.h b/block/qcow2.h index 1000239..2aee1fd 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -157,6 +157,10 @@ typedef struct Qcow2DiscardRegion { QTAILQ_ENTRY(Qcow2DiscardRegion) next; } Qcow2DiscardRegion; +typedef struct Qcow2Journal { + +} Qcow2Journal; + typedef struct BDRVQcowState { int cluster_bits; int cluster_size; @@ -479,4 +483,78 @@ int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset, void **table); int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table); +/* qcow2-journal.c functions */ + +typedef struct Qcow2JournalTransaction Qcow2JournalTransaction; + +enum Qcow2JournalEntryTypeID { +QJ_DESC_NOOP= 0, +QJ_DESC_WRITE = 1, +QJ_DESC_COPY= 2, + +/* required after a cluster is freed and used for other purposes, so that + * new (unjournalled) data won't be overwritten with
Re: [Qemu-devel] [RFC] qcow2 journalling draft
On Fri, 09/06 11:57, Kevin Wolf wrote: Am 06.09.2013 um 11:20 hat Fam Zheng geschrieben: On Wed, 09/04 11:39, Kevin Wolf wrote: First of all, excuse any inconsistencies in the following mail. I wrote it from top to bottom, and there was some thought process involved in almost every paragraph... Am 04.09.2013 um 10:03 hat Stefan Hajnoczi geschrieben: On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote: @@ -103,7 +107,11 @@ in the description of a field. write to an image with unknown auto-clear features if it clears the respective bits from this field first. -Bits 0-63: Reserved (set to 0) +Bit 0: Journal valid bit. This bit indicates that the +image contains a valid main journal starting at +journal_offset. Whether the journal is used can be determined from the journal_offset value (header length must be large enough and journal offset must be valid). Why do we need this autoclear bit? Hm, I introduced this one first and the journal dirty incompatible bit later, perhaps it's unnecessary now. Let's check... The obvious thing we need to protect against is applying stale journal data to an image that has been changed by an older version. As long as the journal is clean, this can't happen, and the journal dirty bit will ensure that the old version can only open the image if it is clean. However, what if we run 'qemu-img check -r leaks' with an old qemu-img version? It will reclaim the clusters used by the journal, and if we continue using the journal we'll corrupt whatever new data is there now. Why can old version qemu-img open the image with dirty journal in the first place? It's incompatible bit. This is about a clean journal. Ah yes, I get it, thanks. Fam
Re: [Qemu-devel] [RFC] qcow2 journalling draft
On Wed, Sep 04, 2013 at 11:39:51AM +0200, Kevin Wolf wrote: First of all, excuse any inconsistencies in the following mail. I wrote it from top to bottom, and there was some thought process involved in almost every paragraph... I should add this disclaimer to all my emails ;-). Am 04.09.2013 um 10:03 hat Stefan Hajnoczi geschrieben: On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote: @@ -103,7 +107,11 @@ in the description of a field. write to an image with unknown auto-clear features if it clears the respective bits from this field first. -Bits 0-63: Reserved (set to 0) +Bit 0: Journal valid bit. This bit indicates that the +image contains a valid main journal starting at +journal_offset. Whether the journal is used can be determined from the journal_offset value (header length must be large enough and journal offset must be valid). Why do we need this autoclear bit? Hm, I introduced this one first and the journal dirty incompatible bit later, perhaps it's unnecessary now. Let's check... The obvious thing we need to protect against is applying stale journal data to an image that has been changed by an older version. As long as the journal is clean, this can't happen, and the journal dirty bit will ensure that the old version can only open the image if it is clean. However, what if we run 'qemu-img check -r leaks' with an old qemu-img version? It will reclaim the clusters used by the journal, and if we continue using the journal we'll corrupt whatever new data is there now. Can we protect against this without using an autoclear bit? You are right. It's a weird case I didn't think of but it could happen. An autoclear bit sounds like the simplest solution. Please document this scenario. +A journal is organised in journal blocks, all of which have a reference count +of exactly 1. It starts with a block containing the following journal header: + +Byte 0 - 7: Magic (qjournal ASCII string) + + 8 - 11: Journal size in bytes, including the header + + 12 - 15: Journal block size order (block size in bytes = 1 order) +The block size must be at least 512 bytes and must not +exceed the cluster size. + + 16 - 19: Journal block index of the descriptor for the last +transaction that has been synced, starting with 1 for the +journal block after the header. 0 is used for empty +journals. + + 20 - 23: Sequence number of the last transaction that has been +synced. 0 is recommended as the initial value. + + 24 - 27: Sequence number of the last transaction that has been +committed. When replaying a journal, all transactions +after the last synced one up to the last commit one must be +synced. Note that this may include a wraparound of sequence +numbers. + + 28 - 31: Checksum (one's complement of the sum of all bytes in the +header journal block except those of the checksum field) + + 32 - 511: Reserved (set to 0) I'm not sure if these fields are necessary. They require updates (and maybe flush) after every commit and sync. The fewer metadata updates, the better, not just for performance but also to reduce the risk of data loss. If any metadata required to access the journal is corrupted, the image will be unavailable. It should be possible to determine this information by scanning the journal transactions. This is rather handwavy. Can you elaborate how this would work in detail? For example, let's assume we get to read this journal (a journal can be rather large, too, so I'm not sure if we want to read it in completely): - Descriptor, seq 42, 2 data blocks - Data block - Data block - Data block starting with qjbk - Data block - Descriptor, seq 7, 0 data blocks - Descriptor, seq 8, 1 data block - Data block Which of these have already been synced? Which have been committed? I guess we could introduce an is_commited flag in the descriptor, but wouldn't correct operation look like this then: 1. Write out descriptor commit flag clear and any data blocks 2. Flush 3. Rewrite descriptor with commit flag set This ensures that the commit flag is only set if all the required data is indeed stable on disk. What has changed compared to this proposal is just the offset at which you write in step 3 (header vs. descriptor). A commit flag cannot be relied upon. A transaction can be corrupted after being committed, or
Re: [Qemu-devel] [RFC] qcow2 journalling draft
On Wed, Sep 04, 2013 at 11:55:23AM +0200, Benoît Canet wrote: I'm not sure if multiple journals will work in practice. Doesn't this re-introduce the need to order update steps and flush between them? This is a question for Benoît, who made this requirement. I asked him the same a while ago and apparently his explanation made some sense to me, or I would have remembered that I don't want it. ;-) The reason behind the multiple journal requirement is that if a block get created and deleted in a cyclic way it can generate cyclic insertions/deletions journal entries. The journal could easilly be filled if this pathological corner case happen. When it happen the dedup code repack the journal by writting only the non redundant information into a new journal and then use the new one. It would not be easy to do so if non dedup journal entries are present in the journal hence the multiple journal requirement. The deduplication also need two journals because when the first one is frozen it take some time to write the hash table to disk and anyway new entries must be stored somewhere at the same time. The code cannot block. It might have something to do with the fact that deduplication uses the journal more as a kind of cache for hash values that can be dropped and rebuilt after a crash. For dedupe the journal is more a resume after exit tool. I'm not sure anymore if dedupe needs the same kind of journal as a metadata journal for qcow2. Since you have a dirty flag to discard the journal on crash, the journal is not used for data integrity. That makes me wonder if the metadata journal is the right structure for dedupe? Maybe your original proposal was fine for dedupe and we just misinterpreted it because we thought this needs to be a safe journal. Stefan
Re: [Qemu-devel] [RFC] qcow2 journalling draft
On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote: This contains an extension of the qcow2 spec that introduces journalling to the image format, plus some preliminary type definitions and function prototypes in the qcow2 code. Journalling functionality is a crucial feature for the design of data deduplication, and it will improve the core part of qcow2 by avoiding cluster leaks on crashes as well as provide an easier way to get a reliable implementation of performance features like Delayed COW. At this point of the RFC, it would be most important to review the on-disk structure. Once we're confident that it can do everything we want, we can start going into more detail on the qemu side of things. Signed-off-by: Kevin Wolf kw...@redhat.com --- block/Makefile.objs | 2 +- block/qcow2-journal.c | 55 ++ block/qcow2.h | 78 +++ docs/specs/qcow2.txt | 204 +- 4 files changed, 337 insertions(+), 2 deletions(-) create mode 100644 block/qcow2-journal.c Although we are still discussing details of the on-disk layout, the general design is clear enough to discuss how the journal will be used. Today qcow2 uses Qcow2Cache to do lazy, ordered metadata updates. The performance is pretty good with two exceptions that I can think of: 1. The delayed CoW problem that Kevin has been working on. Guests perform sequential writes that are smaller than a qcow2 cluster. The first write triggers a copy-on-write of the full cluster. Later writes then overwrite the copied data. It would be more efficient to anticipate sequential writes and hold off on CoW where possible. 2. Lazy metadata updates lead to bursty behavior and expensive flushes. We do not take advantage of disk bandwidth since metadata updates stay in the Qcow2Cache until the last possible second. When the guest issues a flush we must write out dirty Qcow2Cache entries and possibly fsync between them if dependencies have been set (e.g. refcount before L2). How will the journal change this situation? Writes that go through the journal are doubled - they must first be journalled, fsync, and then they can be applied to the actual image. How do we benefit by using the journal? Stefan
Re: [Qemu-devel] [RFC] qcow2 journalling draft
Am 05.09.2013 um 11:21 hat Stefan Hajnoczi geschrieben: On Wed, Sep 04, 2013 at 11:39:51AM +0200, Kevin Wolf wrote: However, what if we run 'qemu-img check -r leaks' with an old qemu-img version? It will reclaim the clusters used by the journal, and if we continue using the journal we'll corrupt whatever new data is there now. Can we protect against this without using an autoclear bit? You are right. It's a weird case I didn't think of but it could happen. An autoclear bit sounds like the simplest solution. Please document this scenario. Okay, I've updated the description as follows: Bit 0: Journal valid bit. This bit indicates that the image contains a valid main journal starting at journal_offset; it is used to mark journals invalid if the image was opened by older implementations that may have reclaimed the journal clusters that would appear as leaked clusters to them. +A journal is organised in journal blocks, all of which have a reference count +of exactly 1. It starts with a block containing the following journal header: + +Byte 0 - 7: Magic (qjournal ASCII string) + + 8 - 11: Journal size in bytes, including the header + + 12 - 15: Journal block size order (block size in bytes = 1 order) +The block size must be at least 512 bytes and must not +exceed the cluster size. + + 16 - 19: Journal block index of the descriptor for the last +transaction that has been synced, starting with 1 for the +journal block after the header. 0 is used for empty +journals. + + 20 - 23: Sequence number of the last transaction that has been +synced. 0 is recommended as the initial value. + + 24 - 27: Sequence number of the last transaction that has been +committed. When replaying a journal, all transactions +after the last synced one up to the last commit one must be +synced. Note that this may include a wraparound of sequence +numbers. + + 28 - 31: Checksum (one's complement of the sum of all bytes in the +header journal block except those of the checksum field) + + 32 - 511: Reserved (set to 0) I'm not sure if these fields are necessary. They require updates (and maybe flush) after every commit and sync. The fewer metadata updates, the better, not just for performance but also to reduce the risk of data loss. If any metadata required to access the journal is corrupted, the image will be unavailable. It should be possible to determine this information by scanning the journal transactions. This is rather handwavy. Can you elaborate how this would work in detail? For example, let's assume we get to read this journal (a journal can be rather large, too, so I'm not sure if we want to read it in completely): - Descriptor, seq 42, 2 data blocks - Data block - Data block - Data block starting with qjbk - Data block - Descriptor, seq 7, 0 data blocks - Descriptor, seq 8, 1 data block - Data block Which of these have already been synced? Which have been committed? So what's your algorithm for this? I guess we could introduce an is_commited flag in the descriptor, but wouldn't correct operation look like this then: 1. Write out descriptor commit flag clear and any data blocks 2. Flush 3. Rewrite descriptor with commit flag set This ensures that the commit flag is only set if all the required data is indeed stable on disk. What has changed compared to this proposal is just the offset at which you write in step 3 (header vs. descriptor). A commit flag cannot be relied upon. A transaction can be corrupted after being committed, or it can be corrupted due to power failure while writing the transaction. In both cases we have an invalid transaction and we must discard it. No, I believe it is vitally important to distinguish these two cases. If a transaction was corrupted due to power failure while writing the transaction, then we can simply discard it indeed. If, however, a transaction was committed and gets corrupted after the fact, then we have a problem because the data on the disk is laid out as described by on-disk metadat (e.g. L2 tables) _with the journal fully applied_. The replay and consequently bdrv_open() must fail in this case. The first case is handled by any information that tells us whether the transaction is already committed; the second should never happen, but would be caught by a checksum. The checksum
Re: [Qemu-devel] [RFC] qcow2 journalling draft
Am 05.09.2013 um 11:35 hat Stefan Hajnoczi geschrieben: Although we are still discussing details of the on-disk layout, the general design is clear enough to discuss how the journal will be used. Today qcow2 uses Qcow2Cache to do lazy, ordered metadata updates. The performance is pretty good with two exceptions that I can think of: 1. The delayed CoW problem that Kevin has been working on. Guests perform sequential writes that are smaller than a qcow2 cluster. The first write triggers a copy-on-write of the full cluster. Later writes then overwrite the copied data. It would be more efficient to anticipate sequential writes and hold off on CoW where possible. To be clear, more efficient can mean a plus of 50% and more. COW overhead is the only major overhead compared to raw when looking at normal cluster allocations. So this is something that is really important for cluster allocation performance. The patches that I posted a while ago showed that it's possible to do this without a journal, however the flush operation became very complex (which we all found rather scary) and required that the COW be completed before signalling flush completion. With a journal, the only thing that you need to do on a flush is to commit all transactions, i.e. write them out and bdrv_flush(bs-file). The actualy data copy of the COW (i.e. the sync) can be further delayed and doesn't have to happen at commit type as it would have without a journal. 2. Lazy metadata updates lead to bursty behavior and expensive flushes. We do not take advantage of disk bandwidth since metadata updates stay in the Qcow2Cache until the last possible second. When the guest issues a flush we must write out dirty Qcow2Cache entries and possibly fsync between them if dependencies have been set (e.g. refcount before L2). Hm, have we ever measured the impact of this? I don't think a journal can make a fundamental difference here - either you write only at the last possible second (today flush, with a journal commit), or you write out more data than strictly necessary. How will the journal change this situation? Writes that go through the journal are doubled - they must first be journalled, fsync, and then they can be applied to the actual image. How do we benefit by using the journal? I believe Delayed COW is a pretty strong one. But there are more cases in which performance isn't that great. I think you refer to the simple case with a normal empty image where new clusters are allocated, which is pretty good indeed if we ignore COW. Trouble starts when you also free clusters, which happens for example with internal COW (internal snapshots, compressed images) or discard. Deduplication as well in the future, I suppose. Then you get very quickly alternating sequences of L2 depends on refcount update (for allocation) and refcount update depends on L2 update (for freeing), which means that Qcow2Cache starts flushing all the time without accumulating many requests. These are cases that would benefit as well from the atomicity of journal transactions. And then, of course, we still leak clusters on failed operations. With a journal, this wouldn't happen any more and the image would always stay consistent (instead of only corruption-free). Kevin
Re: [Qemu-devel] [RFC] qcow2 journalling draft
Then you get very quickly alternating sequences of L2 depends on refcount update (for allocation) and refcount update depends on L2 update (for freeing), which means that Qcow2Cache starts flushing all the time without accumulating many requests. These are cases that would benefit as well from the atomicity of journal transactions. True, Deduplication can hit this case on delete if I remember correctly and it slow down everything. Best regards Benoît
Re: [Qemu-devel] [RFC] qcow2 journalling draft
Le Thursday 05 Sep 2013 à 11:24:40 (+0200), Stefan Hajnoczi a écrit : On Wed, Sep 04, 2013 at 11:55:23AM +0200, Benoît Canet wrote: I'm not sure if multiple journals will work in practice. Doesn't this re-introduce the need to order update steps and flush between them? This is a question for Benoît, who made this requirement. I asked him the same a while ago and apparently his explanation made some sense to me, or I would have remembered that I don't want it. ;-) The reason behind the multiple journal requirement is that if a block get created and deleted in a cyclic way it can generate cyclic insertions/deletions journal entries. The journal could easilly be filled if this pathological corner case happen. When it happen the dedup code repack the journal by writting only the non redundant information into a new journal and then use the new one. It would not be easy to do so if non dedup journal entries are present in the journal hence the multiple journal requirement. The deduplication also need two journals because when the first one is frozen it take some time to write the hash table to disk and anyway new entries must be stored somewhere at the same time. The code cannot block. It might have something to do with the fact that deduplication uses the journal more as a kind of cache for hash values that can be dropped and rebuilt after a crash. For dedupe the journal is more a resume after exit tool. I'm not sure anymore if dedupe needs the same kind of journal as a metadata journal for qcow2. Since you have a dirty flag to discard the journal on crash, the journal is not used for data integrity. That makes me wonder if the metadata journal is the right structure for dedupe? Maybe your original proposal was fine for dedupe and we just misinterpreted it because we thought this needs to be a safe journal. Kevin what do you think of this ? I could strip down the dedupe journal code to specialize it. Best regards Benoît
Re: [Qemu-devel] [RFC] qcow2 journalling draft
On Thu, Sep 5, 2013 at 1:18 PM, Kevin Wolf kw...@redhat.com wrote: Am 05.09.2013 um 11:21 hat Stefan Hajnoczi geschrieben: On Wed, Sep 04, 2013 at 11:39:51AM +0200, Kevin Wolf wrote: However, what if we run 'qemu-img check -r leaks' with an old qemu-img version? It will reclaim the clusters used by the journal, and if we continue using the journal we'll corrupt whatever new data is there now. Can we protect against this without using an autoclear bit? You are right. It's a weird case I didn't think of but it could happen. An autoclear bit sounds like the simplest solution. Please document this scenario. Okay, I've updated the description as follows: Bit 0: Journal valid bit. This bit indicates that the image contains a valid main journal starting at journal_offset; it is used to mark journals invalid if the image was opened by older implementations that may have reclaimed the journal clusters that would appear as leaked clusters to them. Great, thanks. +A journal is organised in journal blocks, all of which have a reference count +of exactly 1. It starts with a block containing the following journal header: + +Byte 0 - 7: Magic (qjournal ASCII string) + + 8 - 11: Journal size in bytes, including the header + + 12 - 15: Journal block size order (block size in bytes = 1 order) +The block size must be at least 512 bytes and must not +exceed the cluster size. + + 16 - 19: Journal block index of the descriptor for the last +transaction that has been synced, starting with 1 for the +journal block after the header. 0 is used for empty +journals. + + 20 - 23: Sequence number of the last transaction that has been +synced. 0 is recommended as the initial value. + + 24 - 27: Sequence number of the last transaction that has been +committed. When replaying a journal, all transactions +after the last synced one up to the last commit one must be +synced. Note that this may include a wraparound of sequence +numbers. + + 28 - 31: Checksum (one's complement of the sum of all bytes in the +header journal block except those of the checksum field) + + 32 - 511: Reserved (set to 0) I'm not sure if these fields are necessary. They require updates (and maybe flush) after every commit and sync. The fewer metadata updates, the better, not just for performance but also to reduce the risk of data loss. If any metadata required to access the journal is corrupted, the image will be unavailable. It should be possible to determine this information by scanning the journal transactions. This is rather handwavy. Can you elaborate how this would work in detail? For example, let's assume we get to read this journal (a journal can be rather large, too, so I'm not sure if we want to read it in completely): - Descriptor, seq 42, 2 data blocks - Data block - Data block - Data block starting with qjbk - Data block - Descriptor, seq 7, 0 data blocks - Descriptor, seq 8, 1 data block - Data block Which of these have already been synced? Which have been committed? So what's your algorithm for this? Scan the journal to find unsynced transactions, if they exist: last_sync_seq = 0 last_seqno = 0 while True: block = journal[(i++) % journal_nblocks] if i = journal_nblocks * 2: break # avoid infinite loop if block.magic != 'qjbk': continue if block.seqno last_seqno: # Wrapped around to oldest transaction break elif block.seqno == seqno: # Corrupt journal, sequence number should be # monotonically increasing raise InvalidJournalException if block.last_sync_seq != last_sync_seq: last_sync_seq = block.last_sync_seq last_seqno = block.seqno print 'First unsynced block seq no:', last_sync_seq print 'Last block seq no:', last_seqno This is broken pseudocode, but hopefully the idea makes sense. I guess we could introduce an is_commited flag in the descriptor, but wouldn't correct operation look like this then: 1. Write out descriptor commit flag clear and any data blocks 2. Flush 3. Rewrite descriptor with commit flag set This ensures that the commit flag is only set if all the required data is indeed stable on disk. What has changed compared to this proposal is just the offset at which you write in step 3 (header vs. descriptor). A commit flag cannot be relied
Re: [Qemu-devel] [RFC] qcow2 journalling draft
Am 05.09.2013 um 16:55 hat Stefan Hajnoczi geschrieben: On Thu, Sep 5, 2013 at 1:18 PM, Kevin Wolf kw...@redhat.com wrote: Am 05.09.2013 um 11:21 hat Stefan Hajnoczi geschrieben: On Wed, Sep 04, 2013 at 11:39:51AM +0200, Kevin Wolf wrote: +A journal is organised in journal blocks, all of which have a reference count +of exactly 1. It starts with a block containing the following journal header: + +Byte 0 - 7: Magic (qjournal ASCII string) + + 8 - 11: Journal size in bytes, including the header + + 12 - 15: Journal block size order (block size in bytes = 1 order) +The block size must be at least 512 bytes and must not +exceed the cluster size. + + 16 - 19: Journal block index of the descriptor for the last +transaction that has been synced, starting with 1 for the +journal block after the header. 0 is used for empty +journals. + + 20 - 23: Sequence number of the last transaction that has been +synced. 0 is recommended as the initial value. + + 24 - 27: Sequence number of the last transaction that has been +committed. When replaying a journal, all transactions +after the last synced one up to the last commit one must be +synced. Note that this may include a wraparound of sequence +numbers. + + 28 - 31: Checksum (one's complement of the sum of all bytes in the +header journal block except those of the checksum field) + + 32 - 511: Reserved (set to 0) I'm not sure if these fields are necessary. They require updates (and maybe flush) after every commit and sync. The fewer metadata updates, the better, not just for performance but also to reduce the risk of data loss. If any metadata required to access the journal is corrupted, the image will be unavailable. It should be possible to determine this information by scanning the journal transactions. This is rather handwavy. Can you elaborate how this would work in detail? For example, let's assume we get to read this journal (a journal can be rather large, too, so I'm not sure if we want to read it in completely): - Descriptor, seq 42, 2 data blocks - Data block - Data block - Data block starting with qjbk - Data block - Descriptor, seq 7, 0 data blocks - Descriptor, seq 8, 1 data block - Data block Which of these have already been synced? Which have been committed? So what's your algorithm for this? Scan the journal to find unsynced transactions, if they exist: last_sync_seq = 0 last_seqno = 0 while True: block = journal[(i++) % journal_nblocks] if i = journal_nblocks * 2: break # avoid infinite loop if block.magic != 'qjbk': continue Important implication: This doesn't allow data blocks starting with 'qjbk'. Otherwise you're not even guaranteed to find a descriptor block to start your seach with. The second time you make this assumption is when there are stale data blocks in the unused area between the head and tail of the journal. if block.seqno last_seqno: # Wrapped around to oldest transaction break Why can you stop here? There might be transactions in the second half of the journal that aren't synced yet. elif block.seqno == seqno: # Corrupt journal, sequence number should be # monotonically increasing raise InvalidJournalException if block.last_sync_seq != last_sync_seq: last_sync_seq = block.last_sync_seq The 'if' doesn't add anything here, so you end up using the last_sync_seq field of the last valid descriptor. last_seqno = block.seqno print 'First unsynced block seq no:', last_sync_seq print 'Last block seq no:', last_seqno This is broken pseudocode, but hopefully the idea makes sense. One additional thought that might make the thing a bit more interesting: Sequence numbers can wrap around as well. Kevin
Re: [Qemu-devel] [RFC] qcow2 journalling draft
On 09/05/2013 09:20 AM, Kevin Wolf wrote: One additional thought that might make the thing a bit more interesting: Sequence numbers can wrap around as well. On the other hand, if sequence numbers are 64-bit, the number of operations required to cause a wrap far exceeds the expected lifetime of any of us on this list, and we can safely assume it to be a non-issue. (There's other places in qemu where we intentionally have an abort() if a 64-bit number would wrap...) -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [RFC] qcow2 journalling draft
+They consist of transactions, which in turn contain operations that +are effectively executed atomically. A qcow2 image can have a main image +journal that deals with cluster management operations, and additional specific +journals can be used by other features like data deduplication. I'm not sure if multiple journals will work in practice. Doesn't this re-introduce the need to order update steps and flush between them? The flush and data has reached stable storage requirement of the deduplication journal are very weak. The deduplication code maintains an incompatible dedup dirty flag and flush the journal on exit then clear the flag. If the flag is set at startup all deduplication metadata and journal content are dropped and it does not harm the image file in any way. The code just starts over.
Re: [Qemu-devel] [RFC] qcow2 journalling draft
First of all, excuse any inconsistencies in the following mail. I wrote it from top to bottom, and there was some thought process involved in almost every paragraph... Am 04.09.2013 um 10:03 hat Stefan Hajnoczi geschrieben: On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote: @@ -103,7 +107,11 @@ in the description of a field. write to an image with unknown auto-clear features if it clears the respective bits from this field first. -Bits 0-63: Reserved (set to 0) +Bit 0: Journal valid bit. This bit indicates that the +image contains a valid main journal starting at +journal_offset. Whether the journal is used can be determined from the journal_offset value (header length must be large enough and journal offset must be valid). Why do we need this autoclear bit? Hm, I introduced this one first and the journal dirty incompatible bit later, perhaps it's unnecessary now. Let's check... The obvious thing we need to protect against is applying stale journal data to an image that has been changed by an older version. As long as the journal is clean, this can't happen, and the journal dirty bit will ensure that the old version can only open the image if it is clean. However, what if we run 'qemu-img check -r leaks' with an old qemu-img version? It will reclaim the clusters used by the journal, and if we continue using the journal we'll corrupt whatever new data is there now. Can we protect against this without using an autoclear bit? +Journals are used to allow safe updates of metadata without impacting +performance by requiring flushes to order updates to different parts of the +metadata. This sentence is hard to parse. Maybe something shorter like this: Journals allow safe metadata updates without the need for carefully ordering and flushing between update steps. Okay, I'll update the text with your proposal. +They consist of transactions, which in turn contain operations that +are effectively executed atomically. A qcow2 image can have a main image +journal that deals with cluster management operations, and additional specific +journals can be used by other features like data deduplication. I'm not sure if multiple journals will work in practice. Doesn't this re-introduce the need to order update steps and flush between them? This is a question for Benoît, who made this requirement. I asked him the same a while ago and apparently his explanation made some sense to me, or I would have remembered that I don't want it. ;-) It might have something to do with the fact that deduplication uses the journal more as a kind of cache for hash values that can be dropped and rebuilt after a crash. +A journal is organised in journal blocks, all of which have a reference count +of exactly 1. It starts with a block containing the following journal header: + +Byte 0 - 7: Magic (qjournal ASCII string) + + 8 - 11: Journal size in bytes, including the header + + 12 - 15: Journal block size order (block size in bytes = 1 order) +The block size must be at least 512 bytes and must not +exceed the cluster size. + + 16 - 19: Journal block index of the descriptor for the last +transaction that has been synced, starting with 1 for the +journal block after the header. 0 is used for empty +journals. + + 20 - 23: Sequence number of the last transaction that has been +synced. 0 is recommended as the initial value. + + 24 - 27: Sequence number of the last transaction that has been +committed. When replaying a journal, all transactions +after the last synced one up to the last commit one must be +synced. Note that this may include a wraparound of sequence +numbers. + + 28 - 31: Checksum (one's complement of the sum of all bytes in the +header journal block except those of the checksum field) + + 32 - 511: Reserved (set to 0) I'm not sure if these fields are necessary. They require updates (and maybe flush) after every commit and sync. The fewer metadata updates, the better, not just for performance but also to reduce the risk of data loss. If any metadata required to access the journal is corrupted, the image will be unavailable. It should be possible to determine this information by scanning the journal transactions. This is rather handwavy. Can you elaborate how this would work in detail? For example, let's assume we get to read this journal (a journal can be rather large, too, so I'm not sure if we want to read it in
Re: [Qemu-devel] [RFC] qcow2 journalling draft
Am 04.09.2013 um 10:32 hat Max Reitz geschrieben: On 2013-09-03 15:45, Kevin Wolf wrote: This contains an extension of the qcow2 spec that introduces journalling to the image format, plus some preliminary type definitions and function prototypes in the qcow2 code. Journalling functionality is a crucial feature for the design of data deduplication, and it will improve the core part of qcow2 by avoiding cluster leaks on crashes as well as provide an easier way to get a reliable implementation of performance features like Delayed COW. At this point of the RFC, it would be most important to review the on-disk structure. Once we're confident that it can do everything we want, we can start going into more detail on the qemu side of things. Signed-off-by: Kevin Wolf kw...@redhat.com --- block/Makefile.objs | 2 +- block/qcow2-journal.c | 55 ++ block/qcow2.h | 78 +++ docs/specs/qcow2.txt | 204 +- 4 files changed, 337 insertions(+), 2 deletions(-) create mode 100644 block/qcow2-journal.c diff --git a/block/Makefile.objs b/block/Makefile.objs index 3bb85b5..59be314 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -1,5 +1,5 @@ block-obj-y += raw_bsd.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o -block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o +block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o qcow2-journal.o block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o block-obj-y += qed-check.o block-obj-y += vhdx.o diff --git a/block/qcow2-journal.c b/block/qcow2-journal.c new file mode 100644 index 000..5b20239 --- /dev/null +++ b/block/qcow2-journal.c @@ -0,0 +1,55 @@ +/* + * qcow2 journalling functions + * + * Copyright (c) 2013 Kevin Wolf kw...@redhat.com + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ + +#include qemu-common.h +#include block/block_int.h +#include qcow2.h + +#define QCOW2_JOURNAL_MAGIC 0x716a6f75726e616cULL /* qjournal */ +#define QCOW2_JOURNAL_BLOCK_MAGIC 0x716a626b /* qjbk */ + +typedef struct Qcow2JournalHeader { +uint64_tmagic; +uint32_tjournal_size; +uint32_tblock_size; +uint32_tsynced_index; +uint32_tsynced_seq; +uint32_tcommitted_seq; +uint32_tchecksum; +} QEMU_PACKED Qcow2JournalHeader; + +/* + * One big transaction per journal block. The transaction is committed either + * time based or when a microtransaction (single set of operations that must be + * performed atomically) doesn't fit in the same block any more. + */ +typedef struct Qcow2JournalBlock { +uint32_tmagic; +uint32_tchecksum; +uint32_tseq; +uint32_tdesc_offset; /* Allow block header extensions */ +uint32_tdesc_bytes; +uint32_tnb_data_blocks; +} QEMU_PACKED Qcow2JournalBlock; + Why is this in the C file... diff --git a/block/qcow2.h b/block/qcow2.h index 1000239..2aee1fd 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -157,6 +157,10 @@ typedef struct Qcow2DiscardRegion { QTAILQ_ENTRY(Qcow2DiscardRegion) next; } Qcow2DiscardRegion; +typedef struct Qcow2Journal { + +} Qcow2Journal; + typedef struct BDRVQcowState { int cluster_bits; int cluster_size; @@ -479,4 +483,78 @@ int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset, void **table); int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table); +/* qcow2-journal.c functions */ + +typedef struct Qcow2JournalTransaction Qcow2JournalTransaction; + +enum Qcow2JournalEntryTypeID { +QJ_DESC_NOOP= 0, +QJ_DESC_WRITE = 1, +QJ_DESC_COPY= 2, + +/* required after a cluster is
Re: [Qemu-devel] [RFC] qcow2 journalling draft
I'm not sure if multiple journals will work in practice. Doesn't this re-introduce the need to order update steps and flush between them? This is a question for Benoît, who made this requirement. I asked him the same a while ago and apparently his explanation made some sense to me, or I would have remembered that I don't want it. ;-) The reason behind the multiple journal requirement is that if a block get created and deleted in a cyclic way it can generate cyclic insertions/deletions journal entries. The journal could easilly be filled if this pathological corner case happen. When it happen the dedup code repack the journal by writting only the non redundant information into a new journal and then use the new one. It would not be easy to do so if non dedup journal entries are present in the journal hence the multiple journal requirement. The deduplication also need two journals because when the first one is frozen it take some time to write the hash table to disk and anyway new entries must be stored somewhere at the same time. The code cannot block. It might have something to do with the fact that deduplication uses the journal more as a kind of cache for hash values that can be dropped and rebuilt after a crash. For dedupe the journal is more a resume after exit tool. Best regards Benoît
Re: [Qemu-devel] [RFC] qcow2 journalling draft
On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote: @@ -103,7 +107,11 @@ in the description of a field. write to an image with unknown auto-clear features if it clears the respective bits from this field first. -Bits 0-63: Reserved (set to 0) +Bit 0: Journal valid bit. This bit indicates that the +image contains a valid main journal starting at +journal_offset. Whether the journal is used can be determined from the journal_offset value (header length must be large enough and journal offset must be valid). Why do we need this autoclear bit? +Journals are used to allow safe updates of metadata without impacting +performance by requiring flushes to order updates to different parts of the +metadata. This sentence is hard to parse. Maybe something shorter like this: Journals allow safe metadata updates without the need for carefully ordering and flushing between update steps. +They consist of transactions, which in turn contain operations that +are effectively executed atomically. A qcow2 image can have a main image +journal that deals with cluster management operations, and additional specific +journals can be used by other features like data deduplication. I'm not sure if multiple journals will work in practice. Doesn't this re-introduce the need to order update steps and flush between them? +A journal is organised in journal blocks, all of which have a reference count +of exactly 1. It starts with a block containing the following journal header: + +Byte 0 - 7: Magic (qjournal ASCII string) + + 8 - 11: Journal size in bytes, including the header + + 12 - 15: Journal block size order (block size in bytes = 1 order) +The block size must be at least 512 bytes and must not +exceed the cluster size. + + 16 - 19: Journal block index of the descriptor for the last +transaction that has been synced, starting with 1 for the +journal block after the header. 0 is used for empty +journals. + + 20 - 23: Sequence number of the last transaction that has been +synced. 0 is recommended as the initial value. + + 24 - 27: Sequence number of the last transaction that has been +committed. When replaying a journal, all transactions +after the last synced one up to the last commit one must be +synced. Note that this may include a wraparound of sequence +numbers. + + 28 - 31: Checksum (one's complement of the sum of all bytes in the +header journal block except those of the checksum field) + + 32 - 511: Reserved (set to 0) I'm not sure if these fields are necessary. They require updates (and maybe flush) after every commit and sync. The fewer metadata updates, the better, not just for performance but also to reduce the risk of data loss. If any metadata required to access the journal is corrupted, the image will be unavailable. It should be possible to determine this information by scanning the journal transactions. +A wraparound may not occur in the middle of a single transaction, but only +between two transactions. For the necessary padding an empty descriptor with +any number of data blocks can be used as the last entry of the ring. Why have this limitation? +All descriptors start with a common part: + +Byte 0 - 1: Descriptor type +0 - No-op descriptor +1 - Write data block +2 - Copy data +3 - Revoke +4 - Deduplication hash insertion +5 - Deduplication hash deletion + + 2 - 3: Size of the descriptor in bytes Data blocks are not included in the descriptor size? I just want to make sure that we don't be limited to 64 KB for the actual data. + + 4 - n: Type-specific data + +The following section specifies the purpose (i.e. the action that is to be +performed when syncing) and type-specific data layout of each descriptor type: + + * No-op descriptor: No action is to be performed when syncing this descriptor + + 4 - n: Ignored + + * Write data block: Write literal data associated with this transaction from +the journal to a given offset. + + 4 - 7: Length of the data to write in bytes + + 8 - 15: Offset in the image file to write the data to + + 16 - 19: Index of the journal block at which the data to write +starts. The data must be stored sequentially and be fully +contained in the data blocks associated
Re: [Qemu-devel] [RFC] qcow2 journalling draft
On 2013-09-03 15:45, Kevin Wolf wrote: This contains an extension of the qcow2 spec that introduces journalling to the image format, plus some preliminary type definitions and function prototypes in the qcow2 code. Journalling functionality is a crucial feature for the design of data deduplication, and it will improve the core part of qcow2 by avoiding cluster leaks on crashes as well as provide an easier way to get a reliable implementation of performance features like Delayed COW. At this point of the RFC, it would be most important to review the on-disk structure. Once we're confident that it can do everything we want, we can start going into more detail on the qemu side of things. Signed-off-by: Kevin Wolf kw...@redhat.com --- block/Makefile.objs | 2 +- block/qcow2-journal.c | 55 ++ block/qcow2.h | 78 +++ docs/specs/qcow2.txt | 204 +- 4 files changed, 337 insertions(+), 2 deletions(-) create mode 100644 block/qcow2-journal.c diff --git a/block/Makefile.objs b/block/Makefile.objs index 3bb85b5..59be314 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -1,5 +1,5 @@ block-obj-y += raw_bsd.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o -block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o +block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o qcow2-journal.o block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o block-obj-y += qed-check.o block-obj-y += vhdx.o diff --git a/block/qcow2-journal.c b/block/qcow2-journal.c new file mode 100644 index 000..5b20239 --- /dev/null +++ b/block/qcow2-journal.c @@ -0,0 +1,55 @@ +/* + * qcow2 journalling functions + * + * Copyright (c) 2013 Kevin Wolf kw...@redhat.com + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ + +#include qemu-common.h +#include block/block_int.h +#include qcow2.h + +#define QCOW2_JOURNAL_MAGIC 0x716a6f75726e616cULL /* qjournal */ +#define QCOW2_JOURNAL_BLOCK_MAGIC 0x716a626b /* qjbk */ + +typedef struct Qcow2JournalHeader { +uint64_tmagic; +uint32_tjournal_size; +uint32_tblock_size; +uint32_tsynced_index; +uint32_tsynced_seq; +uint32_tcommitted_seq; +uint32_tchecksum; +} QEMU_PACKED Qcow2JournalHeader; + +/* + * One big transaction per journal block. The transaction is committed either + * time based or when a microtransaction (single set of operations that must be + * performed atomically) doesn't fit in the same block any more. + */ +typedef struct Qcow2JournalBlock { +uint32_tmagic; +uint32_tchecksum; +uint32_tseq; +uint32_tdesc_offset; /* Allow block header extensions */ +uint32_tdesc_bytes; +uint32_tnb_data_blocks; +} QEMU_PACKED Qcow2JournalBlock; + Why is this in the C file... diff --git a/block/qcow2.h b/block/qcow2.h index 1000239..2aee1fd 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -157,6 +157,10 @@ typedef struct Qcow2DiscardRegion { QTAILQ_ENTRY(Qcow2DiscardRegion) next; } Qcow2DiscardRegion; +typedef struct Qcow2Journal { + +} Qcow2Journal; + typedef struct BDRVQcowState { int cluster_bits; int cluster_size; @@ -479,4 +483,78 @@ int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset, void **table); int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table); +/* qcow2-journal.c functions */ + +typedef struct Qcow2JournalTransaction Qcow2JournalTransaction; + +enum Qcow2JournalEntryTypeID { +QJ_DESC_NOOP= 0, +QJ_DESC_WRITE = 1, +QJ_DESC_COPY= 2, + +/* required after a cluster is freed and used for other purposes, so that + * new (unjournalled) data won't be overwritten with stale metadata */ +QJ_DESC_REVOKE = 3, +}; + +typedef struct
[Qemu-devel] [RFC] qcow2 journalling draft
This contains an extension of the qcow2 spec that introduces journalling to the image format, plus some preliminary type definitions and function prototypes in the qcow2 code. Journalling functionality is a crucial feature for the design of data deduplication, and it will improve the core part of qcow2 by avoiding cluster leaks on crashes as well as provide an easier way to get a reliable implementation of performance features like Delayed COW. At this point of the RFC, it would be most important to review the on-disk structure. Once we're confident that it can do everything we want, we can start going into more detail on the qemu side of things. Signed-off-by: Kevin Wolf kw...@redhat.com --- block/Makefile.objs | 2 +- block/qcow2-journal.c | 55 ++ block/qcow2.h | 78 +++ docs/specs/qcow2.txt | 204 +- 4 files changed, 337 insertions(+), 2 deletions(-) create mode 100644 block/qcow2-journal.c diff --git a/block/Makefile.objs b/block/Makefile.objs index 3bb85b5..59be314 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -1,5 +1,5 @@ block-obj-y += raw_bsd.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o -block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o +block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o qcow2-journal.o block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o block-obj-y += qed-check.o block-obj-y += vhdx.o diff --git a/block/qcow2-journal.c b/block/qcow2-journal.c new file mode 100644 index 000..5b20239 --- /dev/null +++ b/block/qcow2-journal.c @@ -0,0 +1,55 @@ +/* + * qcow2 journalling functions + * + * Copyright (c) 2013 Kevin Wolf kw...@redhat.com + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ + +#include qemu-common.h +#include block/block_int.h +#include qcow2.h + +#define QCOW2_JOURNAL_MAGIC 0x716a6f75726e616cULL /* qjournal */ +#define QCOW2_JOURNAL_BLOCK_MAGIC 0x716a626b /* qjbk */ + +typedef struct Qcow2JournalHeader { +uint64_tmagic; +uint32_tjournal_size; +uint32_tblock_size; +uint32_tsynced_index; +uint32_tsynced_seq; +uint32_tcommitted_seq; +uint32_tchecksum; +} QEMU_PACKED Qcow2JournalHeader; + +/* + * One big transaction per journal block. The transaction is committed either + * time based or when a microtransaction (single set of operations that must be + * performed atomically) doesn't fit in the same block any more. + */ +typedef struct Qcow2JournalBlock { +uint32_tmagic; +uint32_tchecksum; +uint32_tseq; +uint32_tdesc_offset; /* Allow block header extensions */ +uint32_tdesc_bytes; +uint32_tnb_data_blocks; +} QEMU_PACKED Qcow2JournalBlock; + diff --git a/block/qcow2.h b/block/qcow2.h index 1000239..2aee1fd 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -157,6 +157,10 @@ typedef struct Qcow2DiscardRegion { QTAILQ_ENTRY(Qcow2DiscardRegion) next; } Qcow2DiscardRegion; +typedef struct Qcow2Journal { + +} Qcow2Journal; + typedef struct BDRVQcowState { int cluster_bits; int cluster_size; @@ -479,4 +483,78 @@ int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset, void **table); int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table); +/* qcow2-journal.c functions */ + +typedef struct Qcow2JournalTransaction Qcow2JournalTransaction; + +enum Qcow2JournalEntryTypeID { +QJ_DESC_NOOP= 0, +QJ_DESC_WRITE = 1, +QJ_DESC_COPY= 2, + +/* required after a cluster is freed and used for other purposes, so that + * new (unjournalled) data won't be overwritten with stale metadata */ +QJ_DESC_REVOKE = 3, +}; + +typedef struct Qcow2JournalEntryType { +enum Qcow2JournalEntryTypeID id; +int (*sync)(void *buf, size_t size); +}
Re: [Qemu-devel] [RFC] qcow2 journalling draft
Le Tuesday 03 Sep 2013 à 15:45:52 (+0200), Kevin Wolf a écrit : This contains an extension of the qcow2 spec that introduces journalling to the image format, plus some preliminary type definitions and function prototypes in the qcow2 code. Journalling functionality is a crucial feature for the design of data deduplication, and it will improve the core part of qcow2 by avoiding cluster leaks on crashes as well as provide an easier way to get a reliable implementation of performance features like Delayed COW. At this point of the RFC, it would be most important to review the on-disk structure. Once we're confident that it can do everything we want, we can start going into more detail on the qemu side of things. Signed-off-by: Kevin Wolf kw...@redhat.com --- block/Makefile.objs | 2 +- block/qcow2-journal.c | 55 ++ block/qcow2.h | 78 +++ docs/specs/qcow2.txt | 204 +- 4 files changed, 337 insertions(+), 2 deletions(-) create mode 100644 block/qcow2-journal.c diff --git a/block/Makefile.objs b/block/Makefile.objs index 3bb85b5..59be314 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -1,5 +1,5 @@ block-obj-y += raw_bsd.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o -block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o +block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o qcow2-journal.o block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o block-obj-y += qed-check.o block-obj-y += vhdx.o diff --git a/block/qcow2-journal.c b/block/qcow2-journal.c new file mode 100644 index 000..5b20239 --- /dev/null +++ b/block/qcow2-journal.c @@ -0,0 +1,55 @@ +/* + * qcow2 journalling functions + * + * Copyright (c) 2013 Kevin Wolf kw...@redhat.com + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ + +#include qemu-common.h +#include block/block_int.h +#include qcow2.h + +#define QCOW2_JOURNAL_MAGIC 0x716a6f75726e616cULL /* qjournal */ +#define QCOW2_JOURNAL_BLOCK_MAGIC 0x716a626b /* qjbk */ + +typedef struct Qcow2JournalHeader { +uint64_tmagic; +uint32_tjournal_size; +uint32_tblock_size; +uint32_tsynced_index; +uint32_tsynced_seq; +uint32_tcommitted_seq; +uint32_tchecksum; +} QEMU_PACKED Qcow2JournalHeader; + +/* + * One big transaction per journal block. The transaction is committed either + * time based or when a microtransaction (single set of operations that must be + * performed atomically) doesn't fit in the same block any more. + */ +typedef struct Qcow2JournalBlock { +uint32_tmagic; +uint32_tchecksum; +uint32_tseq; +uint32_tdesc_offset; /* Allow block header extensions */ +uint32_tdesc_bytes; +uint32_tnb_data_blocks; +} QEMU_PACKED Qcow2JournalBlock; + diff --git a/block/qcow2.h b/block/qcow2.h index 1000239..2aee1fd 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -157,6 +157,10 @@ typedef struct Qcow2DiscardRegion { QTAILQ_ENTRY(Qcow2DiscardRegion) next; } Qcow2DiscardRegion; +typedef struct Qcow2Journal { + +} Qcow2Journal; + typedef struct BDRVQcowState { int cluster_bits; int cluster_size; @@ -479,4 +483,78 @@ int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset, void **table); int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table); +/* qcow2-journal.c functions */ + +typedef struct Qcow2JournalTransaction Qcow2JournalTransaction; + +enum Qcow2JournalEntryTypeID { +QJ_DESC_NOOP= 0, +QJ_DESC_WRITE = 1, +QJ_DESC_COPY= 2, + +/* required after a cluster is freed and used for other purposes, so that + * new (unjournalled) data