On Thu, Jun 20, 2013 at 04:26:09PM +0200, Benoît Canet wrote: > --- > docs/specs/qcow2.txt | 42 ++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 42 insertions(+) > > diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt > index 36a559d..a4ffc85 100644 > --- a/docs/specs/qcow2.txt > +++ b/docs/specs/qcow2.txt > @@ -350,3 +350,45 @@ Snapshot table entry: > variable: Unique ID string for the snapshot (not null terminated) > > variable: Name of the snapshot (not null terminated) > + > +== Journal == > + > +QCOW2 can use one or more instance of a metadata journal.
s/instance/instances/ Is there a reason to use multiple journals rather than a single journal for all entry types? The single journal area avoids seeks. > + > +A journal is a sequential log of journal entries appended on a previously > +allocated and reseted area. I think you say "previously reset area" instead of "reseted". Another option is "initialized area". > +A journal is designed like a linked list with each entry pointing to the next > +so it's easy to iterate over entries. > + > +A journal uses the following constants to denote the type of each entry > + > +TYPE_NONE = 0xFF default value of any bytes in a reseted journal > +TYPE_END = 1 the entry ends a journal cluster and point to the next > + cluster > +TYPE_HASH = 2 the entry contains a deduplication hash > + > +QCOW2 journal entry: > + > + Byte 0 : Size of the entry: size = 2 + n with size <= 254 This is not clear. I'm wondering if the +2 is included in the byte value or not. I'm also wondering what a byte value of zero means and what a byte value of 255 means. Please include an example to illustrate how this field works. > + > + 1 : Type of the entry > + > + 2 - size : The optional n bytes structure carried by entry > + > +A journal is divided into clusters and no journal entry can be spilled on two > +clusters. This avoid having to read more than one cluster to get a single > entry. > + > +For this purpose an entry with the end type is added at the end of a journal > +cluster before starting to write in the next cluster. > +The size of such an entry is set so the entry points to the next cluster. > + > +As any journal cluster must be ended with an end entry the size of regular > +journal entries is limited to 254 bytes in order to always left room for an > end > +entry which mimimal size is two bytes. > + > +The only cases where size > 254 are none entries where size = 255. > + > +The replay of a journal stop when the first end none entry is reached. s/stop/stops/ > +The journal cluster size is 4096 bytes. Questions about this layout: 1. Journal entries have no integrity mechanism, which is especially important if they span physical sectors where cheap disks may perform a partial write. This would leave a corrupt journal. If the last bytes are a checksum then you can get some confidence that the entry was fully written and is valid. Did I miss something? 2. Byte-granularity means that read-modify-write is necessary to append entries to the journal. Therefore a failure could destroy previously committed entries. Any ideas how existing journals handle this?