I think the pack v4 format and compatibility code is pretty stable now,
and probably as optimal as it can reasonably be.
@Junio: might be a good idea to refresh your pu branch again.
Now the biggest problem to solve is the actual tree
"deltification". I don't have the time to spend on this otherwise very
interesting problem (IMHO at least) so I'm sending this request for help
in the hope that more people would be keen to contribute their computing
skills to solve this challenge.
I'll make a quick description of the pv4 tree encoding first and explain
the problem to solve afterwards.
A pv4 tree is comprised of two types of records: immediate tree entry
and tree sequence copy.
1) Immediate Tree Entry
This is made of the following tuple:
<path component index>,{SHA1 reference index>
The path component index refers to the path dictionary table where path
strings and file modes are uniquely stored. The SHA1 index refers to
the sorted SHA1 table as previously found in the pack index but which is
now part of the pack file itself. So on average a single tree entry may
take between 1 to 2 bytes for the path component index and 1 to 3 bytes
for the SHA1 index.
2) Tree Sequence Copy
This is used to literally copy a contiguous list of tree entries from
another tree object. This goes as follows:
<tree entry where to start>,<number of entries to copy>
[,<SHA1 index of the object to copy from>]
So instead of having arbitrary copying of data like in delta objects, we
refer directly to tree entries in another object. The big advantage
here is that the tree walker may directly jump to the copied object
without having to do any delta patching and caching. The SHA1 index is
optional if it refers to the same copied object as the previous tree
sequence copy record in this tree object. And another possible
optimization for the tree walker when enumerating objects is to skip the
copy entry entirely if the object being copied from has already been
enumerated.
The size of this entry is more variable and is typically between 1 to 2
bytes for the start index, 1 to 2 bytes for the copy size, and 0 to 3
bytes for the SHA1 index.
So... what to do with this?
Currently the "deltification" is achieved using a pretty dumb heuristic
which is to simply take each tree object in a pack v2 with their
corresponding base delta object and perform a straight conversion into
pack v4 tree format i.e. use copy records whenever possible as long as
they represent a space saving over the corresponding immediate tree
entries (which is normally the case).
However this is rather sub-optimal for two reasons:
1) This doesn't benefit from the ability to use multiple base objects to
copy from and is potentially missing on additional opportunities for
copy sequences which are not possible in the selected base object
from the pack v2 delta base selection. Pack v4 is already quite
smaller than pack v2 yet it might possibly be smaller still.
2) This makes deep delta chains very inefficient at runtime when the
pack is subsequently accessed.
Let's consider this example to fully illustrate #2.
Tree A:
entry 0: "foo.txt"
entry 1-3: copy from tree B: start=2 count=3
Tree B:
entry 0-5: copy from tree C: start=2 count=5
entry 6: "bar.txt"
Tree C:
entry 0: "file_a"
entry 1: "file_b"
entry 2: "file_c"
entry 3: "file_D"
entry 4: "file_E"
entry 5: "file_F"
entry 6: "file_G"
This is a good example of what typically happens when the above heuristic
is applied. And it is not uncommon to see a long indirection chain of
"copy those 2 entries from that other object" sometimes reaching 50
levels deep where the same 2 (or more) entries require up to 50 object
hops before they can be obtained.
Obviously here the encoding should be optimized to reduce the chaining
effect simply by referring to the final object directly. Hence tree A
could be encoded as follows:
Tree A:
entry 0: "foo.txt"
entry 1-3: copy from tree C: start=4 count=3
The on-disk encoding is likely to be the same size but the runtime
access is greatly optimized.
Now... instead of trying to do reference simplification by factoring out
the chain effect, I think a whole new approach to tree delta compression
for pack v4 should be developed which would also address issue #1 above.
For example, we may try to make each canonical tree objects into
sequences of hashed tree entries in memory where each tree entry would
be reduced down to a single CRC32 value (or even Adler32 for speed).
Each tree object would then be represented by some kind of 32-bit
character "string". Then it is just a matter of applying substring
matching algorithms to find the best copy sequences without creating
reference cycles, weighted by the number of indirections involved when
a referred object already has a copy sequence covering part or all of
the matched string. Etc.
Or that can be something else entirely. Certainly there are similar
problems already solved in the literature somewhere.
And it has to be _fast_.
So if someone is on the lookout for a nice algorithmic and coding
challenge then here's your opportunity!
Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html