Re: [Qemu-devel] [RFC] qcow2 journalling draft

Fam Zheng Fri, 06 Sep 2013 02:22:41 -0700

On Wed, 09/04 11:39, Kevin Wolf wrote:
> First of all, excuse any inconsistencies in the following mail. I wrote
> it from top to bottom, and there was some thought process involved in
> almost every paragraph...
> 
> Am 04.09.2013 um 10:03 hat Stefan Hajnoczi geschrieben:
> > On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote:
> > > @@ -103,7 +107,11 @@ in the description of a field.
> > >                      write to an image with unknown auto-clear features 
> > > if it
> > >                      clears the respective bits from this field first.
> > >  
> > > -                    Bits 0-63:  Reserved (set to 0)
> > > +                    Bit 0:      Journal valid bit. This bit indicates 
> > > that the
> > > +                                image contains a valid main journal 
> > > starting at
> > > +                                journal_offset.
> > 
> > Whether the journal is used can be determined from the journal_offset
> > value (header length must be large enough and journal offset must be
> > valid).
> > 
> > Why do we need this autoclear bit?
> 
> Hm, I introduced this one first and the journal dirty incompatible bit
> later, perhaps it's unnecessary now. Let's check...
> 
> The obvious thing we need to protect against is applying stale journal
> data to an image that has been changed by an older version. As long as
> the journal is clean, this can't happen, and the journal dirty bit will
> ensure that the old version can only open the image if it is clean.
> 
> However, what if we run 'qemu-img check -r leaks' with an old qemu-img
> version? It will reclaim the clusters used by the journal, and if we
> continue using the journal we'll corrupt whatever new data is there
> now.
> 
Why can old version qemu-img open the image with dirty journal in the first
place? It's incompatible bit.


> Can we protect against this without using an autoclear bit?
> 
> > > +Journals are used to allow safe updates of metadata without impacting
> > > +performance by requiring flushes to order updates to different parts of 
> > > the
> > > +metadata.
> > 
> > This sentence is hard to parse.  Maybe something shorter like this:
> > 
> > Journals allow safe metadata updates without the need for carefully
> > ordering and flushing between update steps.
> 
> Okay, I'll update the text with your proposal.
> 
> > > +They consist of transactions, which in turn contain operations that
> > > +are effectively executed atomically. A qcow2 image can have a main image
> > > +journal that deals with cluster management operations, and additional 
> > > specific
> > > +journals can be used by other features like data deduplication.
> > 
> > I'm not sure if multiple journals will work in practice.  Doesn't this
> > re-introduce the need to order update steps and flush between them?
> 
> This is a question for Benoît, who made this requirement. I asked him
> the same a while ago and apparently his explanation made some sense to
> me, or I would have remembered that I don't want it. ;-)
> 
> It might have something to do with the fact that deduplication uses the
> journal more as a kind of cache for hash values that can be dropped and
> rebuilt after a crash.
> 
> > > +A journal is organised in journal blocks, all of which have a reference 
> > > count
> > > +of exactly 1. It starts with a block containing the following journal 
> > > header:
> > > +
> > > +    Byte  0 -  7:   Magic ("qjournal" ASCII string)
> > > +
> > > +          8 - 11:   Journal size in bytes, including the header
> > > +
> > > +         12 - 15:   Journal block size order (block size in bytes = 1 << 
> > > order)
> > > +                    The block size must be at least 512 bytes and must 
> > > not
> > > +                    exceed the cluster size.
> > > +
> > > +         16 - 19:   Journal block index of the descriptor for the last
> > > +                    transaction that has been synced, starting with 1 
> > > for the
> > > +                    journal block after the header. 0 is used for empty
> > > +                    journals.
> > > +
> > > +         20 - 23:   Sequence number of the last transaction that has been
> > > +                    synced. 0 is recommended as the initial value.
> > > +
> > > +         24 - 27:   Sequence number of the last transaction that has been
> > > +                    committed. When replaying a journal, all transactions
> > > +                    after the last synced one up to the last commit one 
> > > must be
> > > +                    synced. Note that this may include a wraparound of 
> > > sequence
> > > +                    numbers.
> > > +
> > > +         28 -  31:  Checksum (one's complement of the sum of all bytes 
> > > in the
> > > +                    header journal block except those of the checksum 
> > > field)
> > > +
> > > +         32 - 511:  Reserved (set to 0)
> > 
> > I'm not sure if these fields are necessary.  They require updates (and
> > maybe flush) after every commit and sync.
> > 
> > The fewer metadata updates, the better, not just for performance but
> > also to reduce the risk of data loss.  If any metadata required to
> > access the journal is corrupted, the image will be unavailable.
> > 
> > It should be possible to determine this information by scanning the
> > journal transactions.
> 
> This is rather handwavy. Can you elaborate how this would work in detail?
> 
> 
> For example, let's assume we get to read this journal (a journal can be
> rather large, too, so I'm not sure if we want to read it in completely):
> 
>  - Descriptor, seq 42, 2 data blocks
>  - Data block
>  - Data block
>  - Data block starting with "qjbk"
>  - Data block
>  - Descriptor, seq 7, 0 data blocks
>  - Descriptor, seq 8, 1 data block
>  - Data block
> 
> Which of these have already been synced? Which have been committed?
> 
> 
> I guess we could introduce an is_commited flag in the descriptor, but
> wouldn't correct operation look like this then:
> 
> 1. Write out descriptor commit flag clear and any data blocks
> 2. Flush
> 3. Rewrite descriptor with commit flag set
> 
> This ensures that the commit flag is only set if all the required data
> is indeed stable on disk. What has changed compared to this proposal is
> just the offset at which you write in step 3 (header vs. descriptor).
> 
> For sync I suppose the same option exists.
> 
> 
> One unrelated thing we need to take care of is this 'data block starting
> with "qjbk"' thing I mentioned in the example. I can see two solutions:
> The first is never creating such a journal, by always blanking the next
> block after a transaction that we write, so that the descriptor chain is
> terminated. The second one is never creating such a journal, by
> zeroing an initial "qjbk" in data blocks and setting a flag in the
> descriptor instead.
> 
> I was thinking of the first, but I guess it would be worth at least
> mentioning the problem in the spec.
> 
> > > +A wraparound may not occur in the middle of a single transaction, but 
> > > only
> > > +between two transactions. For the necessary padding an empty descriptor 
> > > with
> > > +any number of data blocks can be used as the last entry of the ring.
> > 
> > Why have this limitation?
> 
> I thought it would make implementations easier if they didn't have to
> read/write in data blocks in two parts in some cases.
> 
> > > +All descriptors start with a common part:
> > > +
> > > +    Byte  0 -  1:   Descriptor type
> > > +                        0 - No-op descriptor
> > > +                        1 - Write data block
> > > +                        2 - Copy data
> > > +                        3 - Revoke
> > > +                        4 - Deduplication hash insertion
> > > +                        5 - Deduplication hash deletion
> > > +
> > > +          2 -  3:   Size of the descriptor in bytes
> > 
> > Data blocks are not included in the descriptor size?  I just want to
> > make sure that we don't be limited to 64 KB for the actual data.
> 
> Right, this only for the descriptors, without data. It implies a maximum
> journal block size of 64k, which I think is okay.
> 
> > > +
> > > +          4 -  n:   Type-specific data
> > > +
> > > +The following section specifies the purpose (i.e. the action that is to 
> > > be
> > > +performed when syncing) and type-specific data layout of each descriptor 
> > > type:
> > > +
> > > +  * No-op descriptor: No action is to be performed when syncing this 
> > > descriptor
> > > +
> > > +          4 -  n:   Ignored
> > > +
> > > +  * Write data block: Write literal data associated with this 
> > > transaction from
> > > +    the journal to a given offset.
> > > +
> > > +          4 -  7:   Length of the data to write in bytes
> > > +
> > > +          8 - 15:   Offset in the image file to write the data to
> > > +
> > > +         16 - 19:   Index of the journal block at which the data to write
> > > +                    starts. The data must be stored sequentially and be 
> > > fully
> > > +                    contained in the data blocks associated with the
> > > +                    transaction.
> > > +
> > > +    The type-specific data can be repeated, specifying multiple chunks 
> > > of data
> > > +    to be written in one operation. This means the size of the 
> > > descriptor must
> > > +    be 4 + 16 * n.
> > 
> > Why is the necessary?  Multiple data descriptors could be used, is it
> > worth the additional logic and testing?
> 
> What does a typical journal transaction look like?
> 
> My assumption is that initially it has lots of L2 table and refcount
> block updates and little else. All of these are data writes. Once we
> implement Delayed COW using the journal, we get some data copy
> operations as well. Assuming a stupid guest, we get two copies and two
> writes for each cluster allocation.
> 
> The space required for these is 2*(4 + 16n) + 2*(4 + 20n) = 16 + 72n. In
> one 4k descriptor block you can fit 56 cluster allocations.
> 
> If you used separate descriptors, you get 2*20n + 2*24n = 88n. In one 4k
> descriptor block you could fit 46 cluster allocations.
> 
> Considering that in both cases the space used by descriptors is dwarved
> by the updated tables to be written, I guess it's not worth it for the
> main journal. Things may look different for the dedpulication journal,
> where no data blocks are used and the space taken is determined only by
> the descriptors.
> 
> And in fact... All of the above sounded great, but is in fact wrong.
> Typically you'd get _one_ L2 update for all (sequential) allocations,
> the COW entries without data should dominate in the main journal as well.
> 
> > > +
> > > +  * Copy data: Copy data from one offset in the image to another one. 
> > > This can
> > > +    be used for journalling copy-on-write operations.
> > 
> > This reminds me to ask what the plan is for journal scope: metadata only
> > or also data?  For some operations like dedupe it seems that full data
> > journalling may be necessary.  But for an image without dedupe it would
> > not be necessary to journal the rewrites to an already allocated
> > cluster, for example.
> 
> Generally only metadata.
> 
> The COW case is a bit tricky. Here we really store only metadata in the
> journal as well ("this COW is still pending"), but it directly affects
> data and the qemu read/write paths for data must take such pending COWs
> into consideration. This will require some ordering between the journal
> and data (e.g. only free the COWed cluster after the journal is
> committed), but I think that's okay.
> 
> > > +          4 -  7:   Length of the data to write in bytes
> > > +
> > > +          8 - 15:   Target offset in the image file
> > > +
> > > +         16 - 23:   Source offset in the image file
> > 
> > Source and target cannot overlap?
> 
> I don't think there's a use case for it, so let's forbid it.
> 
> > > +
> > > +    The type-specific data can be repeated, specifying multiple chunks 
> > > of data
> > > +    to be copied in one operation. This means the size of the descriptor 
> > > must
> > > +    be 4 + 20 * n.
> > > +
> > > +  * Revoke: Marks operations on a given range in the imag file invalid 
> > > for all
> > 
> > s/imag/image/
> > 
> > > +    earlier transactions (this does not include the transaction 
> > > containing the
> > > +    revoke). They must not be executed on a sync operation (e.g. because 
> > > the
> > > +    range in question has been freed and may have been reused for other, 
> > > not
> > > +    journalled data structures that must not be overwritten with stale 
> > > data).
> > > +    Note that this may mean that operations are to be executed partially.
> > 
> > Example scenario?
> 
> Once I had one... Well, let's try this:
> 
> 1. Cluster allocation that allocates a new L2 table, i.e. L1 update gets
>    journalled
> 
> 2. Another cluster allocation, this time the L1 has to grow. The write
>    to the new L1 position is journalled, the old table is freed.
> 
> 3. Next allocation reuses the clusters where the old L1 table was stored
> 
> 4. Journal sync. The L1 update from step 1 is written out to the
>    clusters that are now used for data. Oops.
> 
> Kevin

Re: [Qemu-devel] [RFC] qcow2 journalling draft

Reply via email to