On Wed, 09/04 11:39, Kevin Wolf wrote: > First of all, excuse any inconsistencies in the following mail. I wrote > it from top to bottom, and there was some thought process involved in > almost every paragraph... > > Am 04.09.2013 um 10:03 hat Stefan Hajnoczi geschrieben: > > On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote: > > > @@ -103,7 +107,11 @@ in the description of a field. > > > write to an image with unknown auto-clear features > > > if it > > > clears the respective bits from this field first. > > > > > > - Bits 0-63: Reserved (set to 0) > > > + Bit 0: Journal valid bit. This bit indicates > > > that the > > > + image contains a valid main journal > > > starting at > > > + journal_offset. > > > > Whether the journal is used can be determined from the journal_offset > > value (header length must be large enough and journal offset must be > > valid). > > > > Why do we need this autoclear bit? > > Hm, I introduced this one first and the journal dirty incompatible bit > later, perhaps it's unnecessary now. Let's check... > > The obvious thing we need to protect against is applying stale journal > data to an image that has been changed by an older version. As long as > the journal is clean, this can't happen, and the journal dirty bit will > ensure that the old version can only open the image if it is clean. > > However, what if we run 'qemu-img check -r leaks' with an old qemu-img > version? It will reclaim the clusters used by the journal, and if we > continue using the journal we'll corrupt whatever new data is there > now. > Why can old version qemu-img open the image with dirty journal in the first place? It's incompatible bit.
> Can we protect against this without using an autoclear bit? > > > > +Journals are used to allow safe updates of metadata without impacting > > > +performance by requiring flushes to order updates to different parts of > > > the > > > +metadata. > > > > This sentence is hard to parse. Maybe something shorter like this: > > > > Journals allow safe metadata updates without the need for carefully > > ordering and flushing between update steps. > > Okay, I'll update the text with your proposal. > > > > +They consist of transactions, which in turn contain operations that > > > +are effectively executed atomically. A qcow2 image can have a main image > > > +journal that deals with cluster management operations, and additional > > > specific > > > +journals can be used by other features like data deduplication. > > > > I'm not sure if multiple journals will work in practice. Doesn't this > > re-introduce the need to order update steps and flush between them? > > This is a question for Benoît, who made this requirement. I asked him > the same a while ago and apparently his explanation made some sense to > me, or I would have remembered that I don't want it. ;-) > > It might have something to do with the fact that deduplication uses the > journal more as a kind of cache for hash values that can be dropped and > rebuilt after a crash. > > > > +A journal is organised in journal blocks, all of which have a reference > > > count > > > +of exactly 1. It starts with a block containing the following journal > > > header: > > > + > > > + Byte 0 - 7: Magic ("qjournal" ASCII string) > > > + > > > + 8 - 11: Journal size in bytes, including the header > > > + > > > + 12 - 15: Journal block size order (block size in bytes = 1 << > > > order) > > > + The block size must be at least 512 bytes and must > > > not > > > + exceed the cluster size. > > > + > > > + 16 - 19: Journal block index of the descriptor for the last > > > + transaction that has been synced, starting with 1 > > > for the > > > + journal block after the header. 0 is used for empty > > > + journals. > > > + > > > + 20 - 23: Sequence number of the last transaction that has been > > > + synced. 0 is recommended as the initial value. > > > + > > > + 24 - 27: Sequence number of the last transaction that has been > > > + committed. When replaying a journal, all transactions > > > + after the last synced one up to the last commit one > > > must be > > > + synced. Note that this may include a wraparound of > > > sequence > > > + numbers. > > > + > > > + 28 - 31: Checksum (one's complement of the sum of all bytes > > > in the > > > + header journal block except those of the checksum > > > field) > > > + > > > + 32 - 511: Reserved (set to 0) > > > > I'm not sure if these fields are necessary. They require updates (and > > maybe flush) after every commit and sync. > > > > The fewer metadata updates, the better, not just for performance but > > also to reduce the risk of data loss. If any metadata required to > > access the journal is corrupted, the image will be unavailable. > > > > It should be possible to determine this information by scanning the > > journal transactions. > > This is rather handwavy. Can you elaborate how this would work in detail? > > > For example, let's assume we get to read this journal (a journal can be > rather large, too, so I'm not sure if we want to read it in completely): > > - Descriptor, seq 42, 2 data blocks > - Data block > - Data block > - Data block starting with "qjbk" > - Data block > - Descriptor, seq 7, 0 data blocks > - Descriptor, seq 8, 1 data block > - Data block > > Which of these have already been synced? Which have been committed? > > > I guess we could introduce an is_commited flag in the descriptor, but > wouldn't correct operation look like this then: > > 1. Write out descriptor commit flag clear and any data blocks > 2. Flush > 3. Rewrite descriptor with commit flag set > > This ensures that the commit flag is only set if all the required data > is indeed stable on disk. What has changed compared to this proposal is > just the offset at which you write in step 3 (header vs. descriptor). > > For sync I suppose the same option exists. > > > One unrelated thing we need to take care of is this 'data block starting > with "qjbk"' thing I mentioned in the example. I can see two solutions: > The first is never creating such a journal, by always blanking the next > block after a transaction that we write, so that the descriptor chain is > terminated. The second one is never creating such a journal, by > zeroing an initial "qjbk" in data blocks and setting a flag in the > descriptor instead. > > I was thinking of the first, but I guess it would be worth at least > mentioning the problem in the spec. > > > > +A wraparound may not occur in the middle of a single transaction, but > > > only > > > +between two transactions. For the necessary padding an empty descriptor > > > with > > > +any number of data blocks can be used as the last entry of the ring. > > > > Why have this limitation? > > I thought it would make implementations easier if they didn't have to > read/write in data blocks in two parts in some cases. > > > > +All descriptors start with a common part: > > > + > > > + Byte 0 - 1: Descriptor type > > > + 0 - No-op descriptor > > > + 1 - Write data block > > > + 2 - Copy data > > > + 3 - Revoke > > > + 4 - Deduplication hash insertion > > > + 5 - Deduplication hash deletion > > > + > > > + 2 - 3: Size of the descriptor in bytes > > > > Data blocks are not included in the descriptor size? I just want to > > make sure that we don't be limited to 64 KB for the actual data. > > Right, this only for the descriptors, without data. It implies a maximum > journal block size of 64k, which I think is okay. > > > > + > > > + 4 - n: Type-specific data > > > + > > > +The following section specifies the purpose (i.e. the action that is to > > > be > > > +performed when syncing) and type-specific data layout of each descriptor > > > type: > > > + > > > + * No-op descriptor: No action is to be performed when syncing this > > > descriptor > > > + > > > + 4 - n: Ignored > > > + > > > + * Write data block: Write literal data associated with this > > > transaction from > > > + the journal to a given offset. > > > + > > > + 4 - 7: Length of the data to write in bytes > > > + > > > + 8 - 15: Offset in the image file to write the data to > > > + > > > + 16 - 19: Index of the journal block at which the data to write > > > + starts. The data must be stored sequentially and be > > > fully > > > + contained in the data blocks associated with the > > > + transaction. > > > + > > > + The type-specific data can be repeated, specifying multiple chunks > > > of data > > > + to be written in one operation. This means the size of the > > > descriptor must > > > + be 4 + 16 * n. > > > > Why is the necessary? Multiple data descriptors could be used, is it > > worth the additional logic and testing? > > What does a typical journal transaction look like? > > My assumption is that initially it has lots of L2 table and refcount > block updates and little else. All of these are data writes. Once we > implement Delayed COW using the journal, we get some data copy > operations as well. Assuming a stupid guest, we get two copies and two > writes for each cluster allocation. > > The space required for these is 2*(4 + 16n) + 2*(4 + 20n) = 16 + 72n. In > one 4k descriptor block you can fit 56 cluster allocations. > > If you used separate descriptors, you get 2*20n + 2*24n = 88n. In one 4k > descriptor block you could fit 46 cluster allocations. > > Considering that in both cases the space used by descriptors is dwarved > by the updated tables to be written, I guess it's not worth it for the > main journal. Things may look different for the dedpulication journal, > where no data blocks are used and the space taken is determined only by > the descriptors. > > And in fact... All of the above sounded great, but is in fact wrong. > Typically you'd get _one_ L2 update for all (sequential) allocations, > the COW entries without data should dominate in the main journal as well. > > > > + > > > + * Copy data: Copy data from one offset in the image to another one. > > > This can > > > + be used for journalling copy-on-write operations. > > > > This reminds me to ask what the plan is for journal scope: metadata only > > or also data? For some operations like dedupe it seems that full data > > journalling may be necessary. But for an image without dedupe it would > > not be necessary to journal the rewrites to an already allocated > > cluster, for example. > > Generally only metadata. > > The COW case is a bit tricky. Here we really store only metadata in the > journal as well ("this COW is still pending"), but it directly affects > data and the qemu read/write paths for data must take such pending COWs > into consideration. This will require some ordering between the journal > and data (e.g. only free the COWed cluster after the journal is > committed), but I think that's okay. > > > > + 4 - 7: Length of the data to write in bytes > > > + > > > + 8 - 15: Target offset in the image file > > > + > > > + 16 - 23: Source offset in the image file > > > > Source and target cannot overlap? > > I don't think there's a use case for it, so let's forbid it. > > > > + > > > + The type-specific data can be repeated, specifying multiple chunks > > > of data > > > + to be copied in one operation. This means the size of the descriptor > > > must > > > + be 4 + 20 * n. > > > + > > > + * Revoke: Marks operations on a given range in the imag file invalid > > > for all > > > > s/imag/image/ > > > > > + earlier transactions (this does not include the transaction > > > containing the > > > + revoke). They must not be executed on a sync operation (e.g. because > > > the > > > + range in question has been freed and may have been reused for other, > > > not > > > + journalled data structures that must not be overwritten with stale > > > data). > > > + Note that this may mean that operations are to be executed partially. > > > > Example scenario? > > Once I had one... Well, let's try this: > > 1. Cluster allocation that allocates a new L2 table, i.e. L1 update gets > journalled > > 2. Another cluster allocation, this time the L1 has to grow. The write > to the new L1 position is journalled, the old table is freed. > > 3. Next allocation reuses the clusters where the old L1 table was stored > > 4. Journal sync. The L1 update from step 1 is written out to the > clusters that are now used for data. Oops. > > Kevin