Re: [PATCH v2 0/4] Lazy-load trees when reading commit-graph

Derrick Stolee Sat, 07 Apr 2018 18:18:06 -0700

On 4/7/2018 2:40 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


[...]

On the Linux repository, performance tests were run for the following
command:

     git log --graph --oneline -1000

     Before: 0.92s
     After:  0.66s
     Rel %: -28.3%

Adding '-- kernel/' to the command requires loading the root tree
for every commit that is walked. There was no measureable performance
change as a result of this patch.

In the "Git Merge contributor summit notes" [1] one can read that:

- VSTS adds bloom filters to know which paths have changed on the commit
- tree-same check in the bloom filter is fast; speeds up file history checks
- if the file history is _very_ sparse, then bloom filter is useful

Could this method speed up also the second case mentioned here?  Can
anyone explain how this "path-changed bloom filter" works in VSTS?

The idea is simple: for every commit, store a Bloom filter containingthe list of paths that are not TREESAME against the first parent. (Aslight detail: have a max cap on the number of paths, and store simply"TOO_BIG" for commits with too many diffs.)

When performing 'git log -- path' queries, the most important detail forconsidering how to advance the walk is whether the commit is TREESAME toits first parent. For a deep path in a large repo, this is almost alwaystrue. When a Bloom filter says "TREESAME" (i.e. "this path is not in myset") it is always correct, so we can set the treesame bit and continuewithout walking any trees. When a Bloom filter says "MAYBE NOT TREESAME"(i.e. "this path is probably in my set") you only need to do the samework as before: walk the trees to compare against your first parent.

If a Bloom filter has a false-positive rate of X%, then you can possiblydrop your number of tree comparisons by (100-X)%. This is very importantfor large repos where some paths were changed only ten times or so, thefull graph needs to be walked and it is helpful to avoid parsing toomany trees.

Could we add something like this to the commit-graph file?

I'm not sure if it is necessary for client-side operations, but it isone of the reasons the commit-graph file has the idea of an "optionalchunk". It could be added to the file format (without changing versionnumbers) and be ignored by clients that don't understand it. I couldalso be gated by a config setting for computing them. My guess is thatonly server-side operations will need the added response time, and canbear the cost of computing them when writing the commit-graph file.Clients are less likely to be patient waiting for a lot of diffcalculations.

If we add commit-graph file downloads to the protocol, then the servercould do this computation and send the data to all clients. But thatwould be "secondary" information that maybe clients want to verify,which is as difficult as computing it themselves.


Thanks,

-Stolee

Re: [PATCH v2 0/4] Lazy-load trees when reading commit-graph

Reply via email to