Nguyễn Thái Ngọc Duy <[email protected]> writes:
> Signed-off-by: Nguyễn Thái Ngọc Duy <[email protected]>
> ---
> For my education but may help people who are interested in the
> format. Most is gathered from commit messages, except the delta tree
> entries.
Thanks.
> diff --git a/Documentation/technical/pack-format-v4.txt
> b/Documentation/technical/pack-format-v4.txt
In the final version it may be a good idea to either have this
together with the documentation for the existing pack-formats, or
add a reference from the documentation for the existing formats to
point at this new file saying "for v4 see ...".
> new file mode 100644
> index 0000000..9123a53
> --- /dev/null
> +++ b/Documentation/technical/pack-format-v4.txt
> @@ -0,0 +1,110 @@
> +Git pack v4 format
> +==================
> +
> +== pack-*.pack files have the following format:
> +
> + - A header appears at the beginning and consists of the following:
> +
> + 4-byte signature:
> + The signature is: {'P', 'A', 'C', 'K'}
> +
> + 4-byte version number (network byte order): must be version
> + number 4
> +
> + 4-byte number of objects contained in the pack (network byte
> + order)
> +
> + - (20 * nr_objects)-byte SHA-1 table: sorted in memcmp() order.
> +
> + - Commit name dictionary: the uncompressed length in variable
> + encoding, followed by zlib-compressed dictionary. Each entry
> + consists of two prefix bytes storing timezone followed by a
> + NUL-terminated string.
The log and code use different names to call this thing. "commit
name" is misleading (e.g. it is not "commit object name", but "names
recorded in commit objects"; it is not only for "committer" names,
but also applies to authors; it is not just names but also emails
and TZ used). Perhaps a better name would be "ident" table, as we
use the word "ident" only to refer to data to refer to people who
are recorded on either author/committer/tagger lines of the objects?
> + (undeltified representation)
> + n-byte type and length (4-bit type, (n-1)*7+4-bit length)
> + [uncompressed data]
> + [compressed data]
These two lines are not useful; it is better spelled as [data
specific to object type] as you have to enumerate what are stored
and how for each type separately anyway.
> +=== Tree representation
> +
> + - n-byte type and length (4-bit type, (n-1)*7+4-bit length)
> +
> + - Number of trees in variable length encoding
> +
> + - A number of trees, each consists of
The above "number of trees" sounds both wrong; aren't they the
number of "tree entries" (that can be blobs or subtrees) this tree
object records?
> + Path component reference: an index, in variable length encoding,
> + into tree path dictionary, which also covers entry mode.
> +
> + SHA-1 in SHA-1 reference encoding.
> +
> +Path component reference zero is an indicator of deltified portion and
> +has the following format:
> +
> + - path component reference: zero
> +
> + - index of the entry to copy from, in variable length encoding
> +
> + - number of entries in variable length encoding
> +
> + - base tree in SHA-1 reference encoding
> +
> +=== SHA-1 reference encoding
> +
> +This encoding is used to encode SHA-1 efficiently if it's already in
> +the SHA-1 table. It starts with an index number in variable length
> +encoding. If it's not zero, its value minus one is the index in the
> +SHA-1 table. If it's zero, 20 bytes of SHA-1 is followed.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html