Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.

Derrick Stolee Mon, 07 May 2018 07:26:31 -0700

On 5/4/2018 3:40 PM, Jakub Narebski wrote:

Hello,


With early parts of commit-graph feature (ds/commit-graph and
ds/lazy-load-trees) close to being merged into "master", see
https://public-inbox.org/git/[email protected]/
I think it would be good idea to think what other data could be added
there to make Git even faster.

Before thinking about adding more data to the commit-graph, I thinkinstead we need to finish taking advantage of the data that is alreadythere. This means landing the generation number patch [1] (I think thisis close, so I'll send a v6 this week if there is no new feedback.) andthe auto-compute patch [2] (this could use more feedback, but I'll senda v1 based on the RFC feedback if no one chimes in).

[1]https://public-inbox.org/git/[email protected]/

    [PATCH v5 00/11] Compute and consume generation numbers

[2]https://public-inbox.org/git/[email protected]/

    [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'

The big wins remaining from this data are `git tag --merged` and `gitlog --graph`. The `tag` scenario is probably easier: this can be done byreplacing the revision-walk underlying the call to usepaint_down_to_common() instead. Requires adding an external method tocommit.c, but not too much code.

The tougher challenge is `git log --graph`. The revision walk machinerycurrently uses two precompute phases before iterating results to thepager: limit_list() and sort_in_topological_order(); these correspond totwo phases of Kahn's algorithm for topo-sort (compute in-degrees, thenwalk by peeling commits with in-degree zero). This requires O(N) time,where N is the number of reachable commits. Instead, we could make thisbe O(W) time to output one page of results, where W is (roughly) thenumber of reachable commits with generation number above the lastreported result.

In order to take advantage of this approach, the two phases of Kahn'salgorithm need to be done in-line with reporting results to the pager.This means keeping two queues: one is a priority queue by generationnumber that computes in-degrees, the other is a priority queue (bycommit-date or a visit-order value to do the --topo-order priority) thatpeels the in-degree-zero commits (and decrements the in-degree of theirparents). I have not begun this refactoring effort because appearscomplicated to me, and it will be hard to tease out the logic withoutaffecting other consumers of the revision-walk machinery.

I would love it if someone picked up the `git log --graph` task, sinceit will be a few weeks before I have the time to focus on it.

Without completing the benefits we get from generation numbers, theseinvestigations into other reachability indexes will be incomplete asthey are comparing benefits without all consumers taking advantage of areachability index.


[...]

Bloom filter for changed paths
------------------------------

The goal of this chunk is to speed up checking if the file or directory
was changed in given commit, for queries such as "git log -- <file>" or
"git blame <file>".  This is something that according to "Git Merge
contributor summit notes" [2] is already present in VSTS (Visual Studio
Team Services - the server counterpart of GVFS: Git Virtual File System)
at Microsoft:

AV> - VSTS adds bloom filters to know which paths have changed on the commit
AV> - tree-same check in the bloom filter is fast; speeds up file history checks
AV> - might be useful in the client as well, since limited-traversal is common
AV> - if the file history is _very_ sparse, then bloom filter is useful
AV> - but needs pre-compute, so useful to do once
AV> - first make the client do it, then think about how to serve it centrally

[2]: 
https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/

I think it was what Derrick Stolee was talking about at the end of his
part of "Making Git for Windows" presentation at Git Merge 2018:
https://youtu.be/oOMzi983Qmw?t=1835

This was also mentioned in subthread of "Re: [PATCH v2 0/4] Lazy-load
trees when reading commit-graph", starting from [3]
[3]: https://public-inbox.org/git/[email protected]/

Again, the benefits of Bloom filters should only be measured afteralready taking advantage of a reachability index during `git log`.However, you could get performance benefits from Bloom filters in anormal `git log` (no topo-order).

The tricky part about this feature is that the decisions we made in ourC# implementation for the VSTS Git server may be very different than theneeds for the C implementation of the Git client. Questions like "how dowe handle merge commits?" may have different answers, which can only bediscovered by implementing the feature.

(The answer for VSTS is that we only store Bloom filters containing thelist of changed paths against the first parent. The second parentfrequently has too many different paths, and if we are computingfile-history simplification we have already determined the first parentis _not_ TREESAME, which requires verifying the difference by parsingtrees against the first parent.)

I'm happy to provide more information on how we built this feature ifsomeone is writing a patch. Otherwise, I plan to implement it afterfinishing the parts I think are higher priority.


Thanks,
-Stolee

Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.

Reply via email to