On Thu, 6 Dec 2007, Jon Loeliger wrote:
>
> On Thu, 2007-12-06 at 00:09, Linus Torvalds wrote:
> > Git also does delta-chains, but it does them a lot more "loosely". There
> > is no fixed entity. Delta's are generated against any random other version
> > that git deems to be a good delta candidate (with various fairly
> > successful heursitics), and there are absolutely no hard grouping rules.
>
> I'd like to learn more about that. Can someone point me to
> either more documentation on it? In the absence of that,
> perhaps a pointer to the source code that implements it?
Well, in a very real sense, what the delta code does is:
- just list every single object in the whole repository
- walk over each object, trying to find another object that it can be
written as a delta against
- write out the result as a pack-file
That's simplified: we may not walk _all_ objects, for example: only a
global repack does that (and most pack creations are actually for pushign
and pulling between two repositories, so we only walk the objects that are
in the source but not the destination repository).
The interesting phase is the "walk each object, try to find a delta" part.
In particular, you don't want to try to find a delta by comparing each
object to every other object out there (that would be O(n^2) in objects,
and with a fairly high constant cost too!). So what it does is to sort the
objects by a few heuristics (type of object, base name that object was
found as when traversing a tree and size, and how recently it was found in
the history).
And then over that sorted list, it tries to find deltas between entries
that are "close" to each other (and that's where the "--window=xyz" thing
comes in - it says how big the window is for objects being close. A
smaller window generates somewhat less good deltas, but takes a lot less
effort to generate).
The source is in git/builtin-pack-objects.c, with the core of it being
- try_delta() - try to generate a *single* delta when given an object
pair.
- find_deltas() - do the actual list traversal
- prepare_pack() and type_size_sort() - create the delta sort list from
the list of objects.
but that whole file is probably some of the more opaque parts of git.
> I guess one question I posit is, would it be more accurate
> to think of this as a "delta net" in a weighted graph rather
> than a "delta chain"?
It's certainly not a simple chain, it's more of a set of acyclic directed
graphs in the object list. And yes, it's weigted by the size of the delta
between objects, and the optimization problem is kind of akin to finding
the smallest spanning tree (well, forest - since you do *not* want to
create one large graph, you also want to make the individual trees shallow
enough that you don't have excessive delta depth).
There are good algorithms for finding minimum spanning trees, but this one
is complicated by the fact that the biggest cost (by far!) is the
calculation of the weights itself. So rather than really worry about
finding the minimal tree/forest, the code needs to worry about not having
to even calculate all the weights!
(That, btw, is a common theme. A lot of git is about traversing graphs,
like the revision graph. And most of the trivial graph problems all assume
that you have the whole graph, but since the "whole graph" is the whole
history of the repository, those algorithms are totally worthless, since
they are fundamentally much too expensive - if we have to generate the
whole history, we're already screwed for a big project. So things like
revision graph calculation, the main performance issue is to avoid having
to even *look* at parts of the graph that we don't need to see!)
Linus