Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.

2018-05-15 Thread Johannes Schindelin
Hi Kuba,

On Mon, 14 May 2018, Jakub Narebski wrote:

> [... lots and lots of discussions...]
>
> All right.
> 
> Here is my [allegedly] improved version, which assumes that we always
> want to start from commit with maximum generation number (there may be
> more than one such commit).
> 
> Let's add one more data structure:
> 
>   RRQ : A queue, for storing remaining commits from R (starting point).
>   It is a priority queue ordered by generation number.
> 
> 
> InitializeTopoOrder(R):
>     Initialize IDQ, IDV, TOQ, RRQ
> 
>     max_generation = 0
>     visit_order = 0
> 
>     for each commit c in R:
>         max_generation = max(max_generation, generation(c))
>         unless IsRedundantFilter(R / c, c): # optional
>             RRQ.Enqueue(c, generation(c))
> 
>     while RRQ.Max == max_generation:
>         c = RRQ.Dequeue()
>         IDV[c] = 0  # commits with max gen have in-degree of 0
>         IDQ.Enqueue(c, generation(c))
> 
>     # this would be almost no-op
>     AdvanceInDegreeWalk(IDQ, IDV, max_generation)
> 
>     for each commit c in reversed R:
>         if generation(c) == max_generation: # then IDV[c] == 0
>             TOQ.Enqueue(c, visit_order++)
> 
>     return (IDQ, IDV, TOQ, RRQ, visit_order)

Isn't it premature to throw out code at this stage? My impression from the
side lines is that Stolee is trying to get concrete code implemented, code
that can be tested against concrete repositories, so that the question of
singularly overall importance can be asked: is it worth it, is it really
faster? And by how much?

It may not be the best idea to distract Stolee from said implementation
(i.e. step 1) by getting ahead ourselves with steps 17, 18 or 19. The next
steps would seem to be step 2 and 3, and I am sure that Stolee would
appreciate your help in implementing them.

So at this point it would maybe make a lot more sense to help Stolee e.g.
by refactoring the rev-list code that is a *mess* around the topo order,
so that all kinds of cute algorithms can be implemented *directly* in Git,
and put to the ultimate test.

Ciao,
Dscho

Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.

2018-05-14 Thread Jakub Narebski
Derrick Stolee  writes:
> On 5/12/2018 10:00 AM, Jakub Narebski wrote:
>> Derrick Stolee  writes:
>>> On 5/4/2018 3:40 PM, Jakub Narebski wrote:

 With early parts of commit-graph feature (ds/commit-graph and
 ds/lazy-load-trees) close to being merged into "master", see
 https://public-inbox.org/git/xmqq4ljtz87g@gitster-ct.c.googlers.com/
 I think it would be good idea to think what other data could be added
 there to make Git even faster.
>>> Before thinking about adding more data to the commit-graph, I think
>>> instead we need to finish taking advantage of the data that is already
>>> there. This means landing the generation number patch [1] (I think
>>> this is close, so I'll send a v6 this week if there is no new
>>> feedback.) and the auto-compute patch [2] (this could use more
>>> feedback, but I'll send a v1 based on the RFC feedback if no one
>>> chimes in).
>>>
>>> [1]
>>> https://public-inbox.org/git/20180501124652.155781-1-dsto...@microsoft.com/
>>>      [PATCH v5 00/11] Compute and consume generation numbers
>>>
>>> [2]
>>> https://public-inbox.org/git/20180417181028.198397-1-dsto...@microsoft.com/
>>>      [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'
>>
>> Right, so the RFC might be a bit premature; I wanted the discussion to
>> be out there to think about when adding new uses of existing features.
>>
>>
>> DIGRESSION: it is commendable that you are sending patches in small,
>> easy digestible chunks / patch series.  It is much easier to review 10+
>> series than 80+ behemoth (though I understand it is not always possible
>> to subdivide patch series into smaller self-contained sub-series).
>>
>> On the other hand, it would be nice to have some roadmap about series
>> and features to be sent in the future, if possible.  Something like what
>> was done when 'git rebase --interactive' was getting builtinified: moved
>> (in parts) to C.  It would be great to have such roadmap with milestones
>> achieved and milestones to be done in the cover letter for series.
>
> I suppose that is what I intended in the "Future Work" section of
> Documentation/technical/commit-graph.txt. It gives a set of things
> that need to be done in order to make this a default feature, not just
> an expert-level feature. When I wrote the section, I was focused
> entirely on "commit-graph version 1.0" and I didn't know how long that
> would take. The series have been getting interest and review (in great
> part to your interest, Jakub) so they have been moving to 'next'
> faster than I anticipated.
>
> I'll plan on writing a more detailed list of future directions, but
> for now I'll list the things I know about and how I see them in
> priority order:
>
> Commit-graph v1.0:
> * ds/generation-numbers
> * 'verify' and fsck/gc integration
> * correct behavior with shallow clones, replace-objects, and grafts

So the goal of current v1.0 phase is to introduce generation numbers.
use them for better performance ("low hanging fruit"), ensure that it is
automatic and safe -- thus useable for an ordinary user.

>
> Commit-graph v1.1:
> * Place commit-graph storage in the_repository
> * 'git tag --merged' use generation numbers
> * 'git log --graph' use generation numbers
>
> Commit-graph v1.X:
> * Protocol v2 verb for sending commit-graph
> * Bloom filters for changed paths

Thanks, that is what I was missing.  The "Future Work" section, while
very nice to have (because it does not require to follow git
development; it is here in technical documentation), lacked
prioritization and rough milestones map.

[...]
>>> The tougher challenge is `git log --graph`. The revision walk
>>> machinery currently uses two precompute phases before iterating
>>> results to the pager: limit_list() and sort_in_topological_order();
>>> these correspond to two phases of Kahn's algorithm for topo-sort
>>> (compute in-degrees, then walk by peeling commits with in-degree
>>> zero). This requires O(N) time, where N is the number of reachable
>>> commits. Instead, we could make this be O(W) time to output one page
>>> of results, where W is (roughly) the number of reachable commits with
>>> generation number above the last reported result.
>>
>> A reminder: Kahn's algorithm (described for example in [1] and [3])
>> looks like this:
>>
>>L ← Empty list that will contain the sorted elements
>>S ← Collection of all nodes with no incoming edge
>>while S is non-empty do
>>remove a node 'n' from S
>>add 'n' to tail of L
>>for each parent 'm' of 'n' do
>>decrease the in-degree of 'm'
>>if 'm' has in-degree of 0 then
>>insert 'm' into S
>>
>> [1]: https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm
>> [2]: 
>> https://www.geeksforgeeks.org/topological-sorting-indegree-based-solution/
[...]

> I'm not following your pseudocode very well,

Not surprising, I've lost myself in the middle of writing 

Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.

2018-05-14 Thread Derrick Stolee

On 5/12/2018 10:00 AM, Jakub Narebski wrote:

Derrick Stolee  writes:

On 5/4/2018 3:40 PM, Jakub Narebski wrote:

With early parts of commit-graph feature (ds/commit-graph and
ds/lazy-load-trees) close to being merged into "master", see
https://public-inbox.org/git/xmqq4ljtz87g@gitster-ct.c.googlers.com/
I think it would be good idea to think what other data could be added
there to make Git even faster.

Before thinking about adding more data to the commit-graph, I think
instead we need to finish taking advantage of the data that is already
there. This means landing the generation number patch [1] (I think
this is close, so I'll send a v6 this week if there is no new
feedback.) and the auto-compute patch [2] (this could use more
feedback, but I'll send a v1 based on the RFC feedback if no one
chimes in).

[1]
https://public-inbox.org/git/20180501124652.155781-1-dsto...@microsoft.com/
     [PATCH v5 00/11] Compute and consume generation numbers

[2]
https://public-inbox.org/git/20180417181028.198397-1-dsto...@microsoft.com/
     [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'

Right, so the RFC might be a bit premature; I wanted the discussion to
be out there to think about when adding new uses of existing features.


DIGRESSION: it is commendable that you are sending patches in small,
easy digestible chunks / patch series.  It is much easier to review 10+
series than 80+ behemoth (though I understand it is not always possible
to subdivide patch series into smaller self-contained sub-series).

On the other hand, it would be nice to have some roadmap about series
and features to be sent in the future, if possible.  Something like what
was done when 'git rebase --interactive' was getting builtinified: moved
(in parts) to C.  It would be great to have such roadmap with milestones
achieved and milestones to be done in the cover letter for series.


I suppose that is what I intended in the "Future Work" section of 
Documentation/technical/commit-graph.txt. It gives a set of things that 
need to be done in order to make this a default feature, not just an 
expert-level feature. When I wrote the section, I was focused entirely 
on "commit-graph version 1.0" and I didn't know how long that would 
take. The series have been getting interest and review (in great part to 
your interest, Jakub) so they have been moving to 'next' faster than I 
anticipated.


I'll plan on writing a more detailed list of future directions, but for 
now I'll list the things I know about and how I see them in priority order:


Commit-graph v1.0:
* ds/generation-numbers
* 'verify' and fsck/gc integration
* correct behavior with shallow clones, replace-objects, and grafts

Commit-graph v1.1:
* Place commit-graph storage in the_repository
* 'git tag --merged' use generation numbers
* 'git log --graph' use generation numbers

Commit-graph v1.X:
* Protocol v2 verb for sending commit-graph
* Bloom filters for changed paths




The big wins remaining from this data are `git tag --merged` and `git
log --graph`. The `tag` scenario is probably easier: this can be done
by replacing the revision-walk underlying the call to use
paint_down_to_common() instead. Requires adding an external method to
commit.c, but not too much code.

I wonder if there is some significant reason behind `git tag --merged`
using its own codepath, beside historical reasons.  Maybe performance is
better with current code?

Utilizing paint_down_to_common() there, beside reducing amount of code
you would have to modify, would also unify code (and possibly reduce
amount of code).  That's very nice.


The tougher challenge is `git log --graph`. The revision walk
machinery currently uses two precompute phases before iterating
results to the pager: limit_list() and sort_in_topological_order();
these correspond to two phases of Kahn's algorithm for topo-sort
(compute in-degrees, then walk by peeling commits with in-degree
zero). This requires O(N) time, where N is the number of reachable
commits. Instead, we could make this be O(W) time to output one page
of results, where W is (roughly) the number of reachable commits with
generation number above the last reported result.

A reminder: Kahn's algorithm (described for example in [1] and [3])
looks like this:

   L ← Empty list that will contain the sorted elements
   S ← Collection of all nodes with no incoming edge
   while S is non-empty do
   remove a node 'n' from S
   add 'n' to tail of L
   for each parent 'm' of 'n' do
   decrease the in-degree of 'm'
   if 'm' has in-degree of 0 then
   insert 'm' into S

[1]: https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm
[2]: https://www.geeksforgeeks.org/topological-sorting-indegree-based-solution/


In order to take advantage of this approach, the two phases of Kahn's
algorithm need to be done in-line with reporting results to the
pager. This means keeping two queues: one is a priority queue 

Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.

2018-05-12 Thread Jakub Narebski
Derrick Stolee  writes:
> On 5/4/2018 3:40 PM, Jakub Narebski wrote:
>>
>> With early parts of commit-graph feature (ds/commit-graph and
>> ds/lazy-load-trees) close to being merged into "master", see
>> https://public-inbox.org/git/xmqq4ljtz87g@gitster-ct.c.googlers.com/
>> I think it would be good idea to think what other data could be added
>> there to make Git even faster.
>
> Before thinking about adding more data to the commit-graph, I think
> instead we need to finish taking advantage of the data that is already
> there. This means landing the generation number patch [1] (I think
> this is close, so I'll send a v6 this week if there is no new
> feedback.) and the auto-compute patch [2] (this could use more
> feedback, but I'll send a v1 based on the RFC feedback if no one
> chimes in).
>
> [1]
> https://public-inbox.org/git/20180501124652.155781-1-dsto...@microsoft.com/
>     [PATCH v5 00/11] Compute and consume generation numbers
>
> [2]
> https://public-inbox.org/git/20180417181028.198397-1-dsto...@microsoft.com/
>     [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'

Right, so the RFC might be a bit premature; I wanted the discussion to
be out there to think about when adding new uses of existing features.


DIGRESSION: it is commendable that you are sending patches in small,
easy digestible chunks / patch series.  It is much easier to review 10+
series than 80+ behemoth (though I understand it is not always possible
to subdivide patch series into smaller self-contained sub-series).

On the other hand, it would be nice to have some roadmap about series
and features to be sent in the future, if possible.  Something like what
was done when 'git rebase --interactive' was getting builtinified: moved
(in parts) to C.  It would be great to have such roadmap with milestones
achieved and milestones to be done in the cover letter for series.

> The big wins remaining from this data are `git tag --merged` and `git
> log --graph`. The `tag` scenario is probably easier: this can be done
> by replacing the revision-walk underlying the call to use
> paint_down_to_common() instead. Requires adding an external method to
> commit.c, but not too much code.

I wonder if there is some significant reason behind `git tag --merged`
using its own codepath, beside historical reasons.  Maybe performance is
better with current code?

Utilizing paint_down_to_common() there, beside reducing amount of code
you would have to modify, would also unify code (and possibly reduce
amount of code).  That's very nice.

>
> The tougher challenge is `git log --graph`. The revision walk
> machinery currently uses two precompute phases before iterating
> results to the pager: limit_list() and sort_in_topological_order();
> these correspond to two phases of Kahn's algorithm for topo-sort
> (compute in-degrees, then walk by peeling commits with in-degree
> zero). This requires O(N) time, where N is the number of reachable
> commits. Instead, we could make this be O(W) time to output one page
> of results, where W is (roughly) the number of reachable commits with
> generation number above the last reported result.

A reminder: Kahn's algorithm (described for example in [1] and [3])
looks like this:

  L ← Empty list that will contain the sorted elements
  S ← Collection of all nodes with no incoming edge
  while S is non-empty do
  remove a node 'n' from S
  add 'n' to tail of L
  for each parent 'm' of 'n' do
  decrease the in-degree of 'm'
  if 'm' has in-degree of 0 then
  insert 'm' into S

[1]: https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm
[2]: https://www.geeksforgeeks.org/topological-sorting-indegree-based-solution/

> In order to take advantage of this approach, the two phases of Kahn's
> algorithm need to be done in-line with reporting results to the
> pager. This means keeping two queues: one is a priority queue by
> generation number that computes in-degrees,

This simple solition of using priority queue by generation number won't
work, I think.  In-degree is computed from heads down, generation number
is computed from roots up.

For example in the graph below

   *<---*<---*<---*<---*<---*<---*<---*<---*<---A
 \
  \--*<---B

both A and B have in-degree of 0, but gen(B) << gen(A).

But I might be wrong.

>the other is a priority
> queue (by commit-date or a visit-order value to do the --topo-order
> priority) that peels the in-degree-zero commits (and decrements the
> in-degree of their parents). I have not begun this refactoring effort
> because appears complicated to me, and it will be hard to tease out
> the logic without affecting other consumers of the revision-walk
> machinery.
>
> I would love it if someone picked up the `git log --graph` task, since
> it will be a few weeks before I have the time to focus on it.

Let's assume that we have extra data (indexes such as 

Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.

2018-05-07 Thread Derrick Stolee

On 5/4/2018 3:40 PM, Jakub Narebski wrote:

Hello,

With early parts of commit-graph feature (ds/commit-graph and
ds/lazy-load-trees) close to being merged into "master", see
https://public-inbox.org/git/xmqq4ljtz87g@gitster-ct.c.googlers.com/
I think it would be good idea to think what other data could be added
there to make Git even faster.


Before thinking about adding more data to the commit-graph, I think 
instead we need to finish taking advantage of the data that is already 
there. This means landing the generation number patch [1] (I think this 
is close, so I'll send a v6 this week if there is no new feedback.) and 
the auto-compute patch [2] (this could use more feedback, but I'll send 
a v1 based on the RFC feedback if no one chimes in).


[1] 
https://public-inbox.org/git/20180501124652.155781-1-dsto...@microsoft.com/

    [PATCH v5 00/11] Compute and consume generation numbers

[2] 
https://public-inbox.org/git/20180417181028.198397-1-dsto...@microsoft.com/

    [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'

The big wins remaining from this data are `git tag --merged` and `git 
log --graph`. The `tag` scenario is probably easier: this can be done by 
replacing the revision-walk underlying the call to use 
paint_down_to_common() instead. Requires adding an external method to 
commit.c, but not too much code.


The tougher challenge is `git log --graph`. The revision walk machinery 
currently uses two precompute phases before iterating results to the 
pager: limit_list() and sort_in_topological_order(); these correspond to 
two phases of Kahn's algorithm for topo-sort (compute in-degrees, then 
walk by peeling commits with in-degree zero). This requires O(N) time, 
where N is the number of reachable commits. Instead, we could make this 
be O(W) time to output one page of results, where W is (roughly) the 
number of reachable commits with generation number above the last 
reported result.


In order to take advantage of this approach, the two phases of Kahn's 
algorithm need to be done in-line with reporting results to the pager. 
This means keeping two queues: one is a priority queue by generation 
number that computes in-degrees, the other is a priority queue (by 
commit-date or a visit-order value to do the --topo-order priority) that 
peels the in-degree-zero commits (and decrements the in-degree of their 
parents). I have not begun this refactoring effort because appears 
complicated to me, and it will be hard to tease out the logic without 
affecting other consumers of the revision-walk machinery.


I would love it if someone picked up the `git log --graph` task, since 
it will be a few weeks before I have the time to focus on it.


Without completing the benefits we get from generation numbers, these 
investigations into other reachability indexes will be incomplete as 
they are comparing benefits without all consumers taking advantage of a 
reachability index.


[...]

Bloom filter for changed paths
--

The goal of this chunk is to speed up checking if the file or directory
was changed in given commit, for queries such as "git log -- " or
"git blame ".  This is something that according to "Git Merge
contributor summit notes" [2] is already present in VSTS (Visual Studio
Team Services - the server counterpart of GVFS: Git Virtual File System)
at Microsoft:

AV> - VSTS adds bloom filters to know which paths have changed on the commit
AV> - tree-same check in the bloom filter is fast; speeds up file history checks
AV> - might be useful in the client as well, since limited-traversal is common
AV> - if the file history is _very_ sparse, then bloom filter is useful
AV> - but needs pre-compute, so useful to do once
AV> - first make the client do it, then think about how to serve it centrally

[2]: 
https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/

I think it was what Derrick Stolee was talking about at the end of his
part of "Making Git for Windows" presentation at Git Merge 2018:
https://youtu.be/oOMzi983Qmw?t=1835

This was also mentioned in subthread of "Re: [PATCH v2 0/4] Lazy-load
trees when reading commit-graph", starting from [3]
[3]: https://public-inbox.org/git/86y3hyeu6c@gmail.com/


Again, the benefits of Bloom filters should only be measured after 
already taking advantage of a reachability index during `git log`. 
However, you could get performance benefits from Bloom filters in a 
normal `git log` (no topo-order).


The tricky part about this feature is that the decisions we made in our 
C# implementation for the VSTS Git server may be very different than the 
needs for the C implementation of the Git client. Questions like "how do 
we handle merge commits?" may have different answers, which can only be 
discovered by implementing the feature.


(The answer for VSTS is that we only store Bloom filters containing the 
list of changed paths against the first parent. The second parent 
frequently has 

Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.

2018-05-05 Thread Jakub Narebski
Ævar Arnfjörð Bjarmason  writes:
> On Fri, May 04 2018, Jakub Narebski wrote:
>
> (Just off-the cuff here and I'm surely about to be corrected by
> Derrick...)
>
>> * What to do about merge commits, and octopus merges in particular?
>>   Should Bloom filter be stored for each of the parents?  How to ensure
>>   fast access then (fixed-width records) - use large edge list?
>
> You could still store it fixed with, you'd just say that if you
> encounter a merge with N parents the filter wouldn't store files changed
> in that commit, but rather whether any of the N (including the merge)
> had changes to files as of the their common merge-base.

Well, that is one solution: to store union of changes (sum of changes)
from all parents of a merge commit in a Bloom filter for a merge.

But this wouldn't help us to skip merge entirely, because which parent
we would walk then?  But perhaps I am mistaken, and that does not
matter.

> Then if they did you'd need to walk all sides of the merge where each
> commit would also have the filter to figure out where the change(s)
> was/were, but if they didn't you could skip straight to the merge base
> and keep walking.

Another solution that I thought of is to use the same mechanism that
commit-graph file uses for storing merges: store Bloom filters for first
two parents, and if there are more parents (octopus merge), store Bloom
filters for the remaining parents in large edge extension for Bloom
filters.

This meant accepting some padding waste for CDAT chunk, to have faster
access.  We could do the same for Bloom filters, but it may mean quite a
bit of waste, depending on how many bits Bloom filter would use... but
there is another solution: for merge commits store Bloom filters for
first two parents that are half the size - this means of course more
false positives, but it may be acceptable solution.

> Which, on the topic of what else a commit graph could store: A mapping
> from merge commits of N parents to the merge-base of those commits.

The problem is that those N parents may have more than one merge-base,
and if so then those merge-bases may have also multiple merge-bases,
recursively (what 'recursive' merge strategy handles).  Though this
could be solved with 'large merge-base list' extension, just like
existing EDGE chunk - I think we can assume that most merge parents have
only single merge-base (but I have not checked this).

> You could also store nothing for merges (or only files the merge itself
> changed v.s. its parents). Derrick talked about how the bloom filter
> implementation has a value that's "Didn't compute (for whatever reason),
> look at it manually".

Right, another solution could be to store nothing for merges, or store
Bloom filter for changes against only first parent.  The goal of Bloom
filter is to avoid calculating diff if we don't need to.

Derrick, could you tell us what solution VSTS uses for Bloom filters on
merge commits?  Thanks in advance.

>> * Then there is problem of rename and copying detection - I think we can
>>   simply ignore it: unless someone has an idea about how to handle it?
>>
>>   Though this means that "git log --follow " wouldn't get any
>>   speedup, and neither the half of "git gui blame" that runs "git blame
>>   --incremental -C -C -w" -- the one that allows code copying and
>>   movement detection.
>
> Couldn't the bloom filter also speed up --follow if you did two passes
> through the history? The first to figure out all files that ever changed
> names, and then say you did `--follow sha1-name.c` on git.git. The
> filter would have had all the bits for both sha1_name.c and sha1-name.c
> set on all commits that touched either for all of the history.
>
> Of course this would only work with a given default value of -M, but
> on the assumption that most users left it at the default, and
> furthermore that renames weren't so common as to make the filter useless
> with too many false-positives as a result, it might be worth it. If you

I think it would be much simpler to just ensure that we store in Bloom
filter as changed files also pure renames, and leave doing rename
detection to the walk.  This way we do not fix old rename detecion
algorithm in stone.

The walk would simply change the name of file it would ask Bloom filters
about.

Thank you for your comments,
-- 
Jakub Narębski


Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.

2018-05-04 Thread Ævar Arnfjörð Bjarmason

On Fri, May 04 2018, Jakub Narebski wrote:

(Just off-the cuff here and I'm surely about to be corrected by
Derrick...)

> * What to do about merge commits, and octopus merges in particular?
>   Should Bloom filter be stored for each of the parents?  How to ensure
>   fast access then (fixed-width records) - use large edge list?

You could still store it fixed with, you'd just say that if you
encounter a merge with N parents the filter wouldn't store files changed
in that commit, but rather whether any of the N (including the merge)
had changes to files as of the their common merge-base.

Then if they did you'd need to walk all sides of the merge where each
commit would also have the filter to figure out where the change(s)
was/were, but if they didn't you could skip straight to the merge base
and keep walking.

Which, on the topic of what else a commit graph could store: A mapping
from merge commits of N parents to the merge-base of those commits.

You could also store nothing for merges (or only files the merge itself
changed v.s. its parents). Derrick talked about how the bloom filter
implementation has a value that's "Didn't compute (for whatever reason),
look at it manually".

> * Then there is problem of rename and copying detection - I think we can
>   simply ignore it: unless someone has an idea about how to handle it?
>
>   Though this means that "git log --follow " wouldn't get any
>   speedup, and neither the half of "git gui blame" that runs "git blame
>   --incremental -C -C -w" -- the one that allows code copying and
>   movement detection.

Couldn't the bloom filter also speed up --follow if you did two passes
through the history? The first to figure out all files that ever changed
names, and then say you did `--follow sha1-name.c` on git.git. The
filter would have had all the bits for both sha1_name.c and sha1-name.c
set on all commits that touched either for all of the history.

Of course this would only work with a given default value of -M, but
on the assumption that most users left it at the default, and
furthermore that renames weren't so common as to make the filter useless
with too many false-positives as a result, it might be worth it. If you


Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.

2018-05-04 Thread Ævar Arnfjörð Bjarmason

On Fri, May 04 2018, Jakub Narebski wrote:

> With early parts of commit-graph feature (ds/commit-graph and
> ds/lazy-load-trees) close to being merged into "master", see
> https://public-inbox.org/git/xmqq4ljtz87g@gitster-ct.c.googlers.com/
> I think it would be good idea to think what other data could be added
> there to make Git even faster.

Thanks for the write-up.

> 3. Third, it needs to be reasonably fast to create, and fast to update.
> This means time to create the chunk to be proportional to number of
> commits, or sum of number of commits and edges (which for commit graph
> and other sparse graphs is proprtional to the number of nodes / commits
> anyway).  In my opinion time proportional to n*lug(n), where 'n' is the
> number of commits, is also acceptable.  Times proportional to n^2 or n^3
> are not acceptable.

I don't think this requirement makes sense...

>   DS> If we add commit-graph file downloads to the protocol, then the
>   DS> server could do this computation and send the data to all
>   DS> clients. But that would be "secondary" information that maybe
>   DS> clients want to verify, which is as difficult as computing it
>   DS> themselves.

... when combined with this. If hypothetically there was some data that
significantly sped up Git and the algorithm to generate it was
ridiculously expensive, that would be fine when combined with the
ability to fetch it pre-generated from the server. There could always be
an option where you trust the server and optionally don't verify the
data, or where it's much cheaper to verify than compute.