Re: [PATCH v4 00/13] Serialized Git Commit Graph

2018-04-07 Thread Jakub Narebski
Derrick Stolee  writes:

> On 4/2/2018 10:46 AM, Jakub Narebski wrote:
>> Derrick Stolee  writes:
> [...]
>> I see the FELINE-index as a stronger form of generation numbers (called
>> also level of the vertex / node), in that it allows to negative-cut even
>> more, pruning paths that are known to be unreachable (or marking nodes
>> known to be unreachable in the "calculate difference" scenario).
>>
>> Also, FELINE-index uses two integer numbers (coordinates in 2d space);
>> one of those indices can be the topological numbering (topological
>> sorting order) of nodes in the commit graph.  That would help to answer
>> even more Git questions.
>
> This two-dimensional generation number is helpful for non-reachability
> queries, but is something that needs the "full" commit graph in order
> to define the value for a single commit (hence the O(N lg N)
> performance mentioned below). Generation numbers are effective while
> being easy to compute and immutable.

O(N log N) is the cost of sort algorithm, namely peforming topological
sorting of commits.  Which Git can currently do.

We can use the main idea behind FELINE index, namely weak dominance
drawing, while perhaps changing the detail.  The main idea is that if
you draw edges of DAG always to be in selected quadrant -- the default
is drawing them up and to the right, then paths would also always be in
the same quadrant; if target node is not in given quadrant with respect
to source node, then target node cannot be reached from source node.

For fast answering of reachability queries it is important to have
minimal number of false positives (falsely implied paths), that is
situation where FELINE index does imply that there could be a path, but
in fact target is not reachable from the source.  The authors propose a
heuristics: use positions in [some] topological order for X coordinates
of FELINE index, and generate new optimal locally topological sorting to
use for Y coordinates.


Generation numbers (which can be used together with FELINE index, and
what paper does use -- calling them Level Filter) have the advantage of
being able to be calculated incrementally. I think this is what you
wanted to say.

I think it would be possible to calculate FELINE index incrementally,
too.  If Git's topological order is stable, at least for X coordinates
of FELINE index it would be easy.


I have started experimenting with this approach in Jupyter Notebook,
available on Google Colaboratory (you can examine it, run it and edit it
from web browser -- the default is to use remote runtime on Google
cloud).  Currently I am at the stage of reproducing the theoretical
parts of the FELINE paper.

  https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
  
https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing

> I wonder if FELINE was compared directly to a one-dimensional index (I
> apologize that I have not read the paper in detail, so I don't
> understand the indexes they compare against).

They have compared FELINE using real graphs (like ArXiv, Pubmed,
CiteseerX, Uniprot150m) and synthetic DAG against:
 - GRAIL (Graph Reachability Indexing via RAndomized Interval Labeling)
 - FERRARI (Flexible and Efficient Reachability Range Assignment for
   gRapth Indexing), with interval set compression to approximate ones
 - Nuutila's INTERVAL, which extracts complete transitive closure of
   the graph and stores it using PWAH bit-vector compression
 - TF-Label (Topological Folding LABELling)

>It also appears the
> graphs they use for their tests are not commit graphs, which have a
> different shape than many of the digraphs studies by that work.

I plan to check how FELINE index works for commit graphs in
above-mentioned Jupyter Notebook.

> This is all to say: I would love to see an interesting study in this
> direction, specifically comparing this series' definition of
> generation numbers to the 2-dimensional system in FELINE, and on a
> large sample of commit graphs available in open-source data sets
> (Linux kernel, Git, etc.) and possibly on interesting closed-source
> data sets.

I wonder if there are more recent works, with even faster solutions.

>>> The case for "Can X reach Y?" is mostly for commands like 'git branch
>>> --contains', when 'git fetch' checks for forced-updates of branches,
>>> or when the server decides enough negotiation has occurred during a
>>> 'git fetch'. While these may be worth investigating, they also benefit
>>> greatly from the accelerated graph walk introduced in the current
>>> format.
>>>
>>> I would be happy to review any effort to extend the commit-graph
>>> format to include such indexes, as long as the performance benefits
>>> outweigh the complexity to create them.
>>>
>>> [2] 
>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396=rep1=pdf

I wonder if next low-hanging branch would be to store 

Re: [PATCH v4 00/13] Serialized Git Commit Graph

2018-04-02 Thread Stefan Beller
> Currently, the format includes 8 bytes to share between the generation
> number and commit date. Due to alignment concerns, we will want to keep this
> as 8 bytes or truncate it to 4-bytes. Either we would be wasting at least 3
> bytes or truncating dates too much (presenting the 2038 problem [1] since
> dates are signed).

Good point. I forgot about them while writing the previous email.
That is reason enough to keep the generation numbers, sorry
for the noise.

>
>> I only glanced at the paper, but it looks like a "more advanced 2d
>> generation number" that seems to be able to answer questions
>> that gen numbers can answer, but that paper also refers
>> to SCARAB as well as GRAIL as the state of the art, so maybe
>> there are even more papers to explore?
>
>
> The biggest reason I can say to advance this series (and the small follow-up
> series that computes and consumes generation numbers) is that generation
> numbers are _extremely simple_. You only need to know your parents and their
> generation numbers to compute your own. These other reachability indexes
> require examining the entire graph to create "good" index values.

Yes, that is a good point, too. Generation numbers can be computed
"commit locally" and do not need expensive setups, which the others
presumably need.

> The hard part about using generation numbers (or any other reachability
> index) in Git is refactoring the revision-walk machinery to take advantage
> of them; current code requires O(reachable commits) to topo-order instead of
> O(commits that will be output). I think we should table any discussion of
> these advanced indexes until that work is done and a valuable comparison can
> be done. "Premature optimization is the root of all evil" and all that.

agreed,

Stefan


Re: [PATCH v4 00/13] Serialized Git Commit Graph

2018-04-02 Thread Derrick Stolee

On 4/2/2018 1:35 PM, Stefan Beller wrote:

On Mon, Apr 2, 2018 at 8:02 AM, Derrick Stolee  wrote:

I would be happy to review any effort to extend the commit-graph
format to include such indexes, as long as the performance benefits
outweigh the complexity to create them.

[2]
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396=rep1=pdf

The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the
storage complexity is 2*|V|.


This would be very easy to add as an optional chunk, since it can use one
row per commit.

Given this discussion, I wonder if we want to include generation numbers
as a first class citizen in the current format. They could also go as
an optional
chunk and we may want to discuss further if we want generation numbers or
FELINE numbers or GRAIL or SCARAB, which are all graph related speedup
mechanism AFAICT.
In case we decide against generation numbers in the long run,
the row of mandatory generation numbers would be dead weight
that we'd need to carry.


Currently, the format includes 8 bytes to share between the generation 
number and commit date. Due to alignment concerns, we will want to keep 
this as 8 bytes or truncate it to 4-bytes. Either we would be wasting at 
least 3 bytes or truncating dates too much (presenting the 2038 problem 
[1] since dates are signed).



I only glanced at the paper, but it looks like a "more advanced 2d
generation number" that seems to be able to answer questions
that gen numbers can answer, but that paper also refers
to SCARAB as well as GRAIL as the state of the art, so maybe
there are even more papers to explore?


The biggest reason I can say to advance this series (and the small 
follow-up series that computes and consumes generation numbers) is that 
generation numbers are _extremely simple_. You only need to know your 
parents and their generation numbers to compute your own. These other 
reachability indexes require examining the entire graph to create "good" 
index values.


The hard part about using generation numbers (or any other reachability 
index) in Git is refactoring the revision-walk machinery to take 
advantage of them; current code requires O(reachable commits) to 
topo-order instead of O(commits that will be output). I think we should 
table any discussion of these advanced indexes until that work is done 
and a valuable comparison can be done. "Premature optimization is the 
root of all evil" and all that.


Thanks,
-Stolee

[1] https://en.wikipedia.org/wiki/Year_2038_problem


Re: [PATCH v4 00/13] Serialized Git Commit Graph

2018-04-02 Thread Stefan Beller
On Mon, Apr 2, 2018 at 8:02 AM, Derrick Stolee  wrote:
>>>
>>> I would be happy to review any effort to extend the commit-graph
>>> format to include such indexes, as long as the performance benefits
>>> outweigh the complexity to create them.
>>>
>>> [2]
>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396=rep1=pdf
>>
>> The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the
>> storage complexity is 2*|V|.
>>
>
> This would be very easy to add as an optional chunk, since it can use one
> row per commit.

Given this discussion, I wonder if we want to include generation numbers
as a first class citizen in the current format. They could also go as
an optional
chunk and we may want to discuss further if we want generation numbers or
FELINE numbers or GRAIL or SCARAB, which are all graph related speedup
mechanism AFAICT.
In case we decide against generation numbers in the long run,
the row of mandatory generation numbers would be dead weight
that we'd need to carry.

I only glanced at the paper, but it looks like a "more advanced 2d
generation number" that seems to be able to answer questions
that gen numbers can answer, but that paper also refers
to SCARAB as well as GRAIL as the state of the art, so maybe
there are even more papers to explore?

Stefan


Re: [PATCH v4 00/13] Serialized Git Commit Graph

2018-04-02 Thread Derrick Stolee

On 4/2/2018 10:46 AM, Jakub Narebski wrote:

Derrick Stolee  writes:

[...]

At one point, I was investigating these reachability indexes (I read
"SCARAB: Scaling Reachability Computation on Large Graphs" by Jihn,
Ruan, Dey, and Xu [2]) but find the question that these indexes target
to be lacking for most of the Git uses. That is, they ask the boolean
question "Can X reach Y?". More often, Git needs to answer "What is
the set of commits reachable from X but not from Y" or "Topologically
sort commits reachable from X" or "How many commits are in each part
of the symmetric difference between reachable from X or reachable from
Y?"

In the "Reachability Queries in Very Large Graphs..." by Veloso, Cerf,
Meira and Zaki FELINE-index work, authors mention SCARAB as something
that can be used in addition to FELINE-index, as a complementary data
(FELINE-SCARAB in the work, section 4.4).

I see the FELINE-index as a stronger form of generation numbers (called
also level of the vertex / node), in that it allows to negative-cut even
more, pruning paths that are known to be unreachable (or marking nodes
known to be unreachable in the "calculate difference" scenario).

Also, FELINE-index uses two integer numbers (coordinates in 2d space);
one of those indices can be the topological numbering (topological
sorting order) of nodes in the commit graph.  That would help to answer
even more Git questions.


This two-dimensional generation number is helpful for non-reachability 
queries, but is something that needs the "full" commit graph in order to 
define the value for a single commit (hence the O(N lg N) performance 
mentioned below). Generation numbers are effective while being easy to 
compute and immutable.


I wonder if FELINE was compared directly to a one-dimensional index (I 
apologize that I have not read the paper in detail, so I don't 
understand the indexes they compare against). It also appears the graphs 
they use for their tests are not commit graphs, which have a different 
shape than many of the digraphs studies by that work.


This is all to say: I would love to see an interesting study in this 
direction, specifically comparing this series' definition of generation 
numbers to the 2-dimensional system in FELINE, and on a large sample of 
commit graphs available in open-source data sets (Linux kernel, Git, 
etc.) and possibly on interesting closed-source data sets.





The case for "Can X reach Y?" is mostly for commands like 'git branch
--contains', when 'git fetch' checks for forced-updates of branches,
or when the server decides enough negotiation has occurred during a
'git fetch'. While these may be worth investigating, they also benefit
greatly from the accelerated graph walk introduced in the current
format.

I would be happy to review any effort to extend the commit-graph
format to include such indexes, as long as the performance benefits
outweigh the complexity to create them.

[2] 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396=rep1=pdf

The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the
storage complexity is 2*|V|.



This would be very easy to add as an optional chunk, since it can use 
one row per commit.


Thanks,
-Stolee


Re: [PATCH v4 00/13] Serialized Git Commit Graph

2018-04-02 Thread Jakub Narebski
Derrick Stolee  writes:

> On 3/30/2018 7:10 AM, Jakub Narebski wrote:
>> I hope that I am addressing the most recent version of this series.
>
> Hi Jakub. Thanks for the interest in this patch series.
>
> The most-recent version is v6 [1], but I will re-roll to v7 soon
> (after v2.17.0 is marked).
>
> [1] 
> https://public-inbox.org/git/20180314192736.70602-1-dsto...@microsoft.com/T/#u

Ooops.  Sorry about that.

>> Derrick Stolee  writes:
[...]

>> What are the assumptions about the serialized commit graph format? Does
>> it need to be:
>>   - extensible without rewriting (e.g. append-only)?
>>   - like the above, but may need rewriting for optimal performance?
>>   - extending it needs to rewrite whole file?
>>
>> Excuse me if it waas already asked and answered.
>
> It is not extensible without rewriting. Reducing write time was not a
> main goal, since the graph will be written only occasionally during
> data management phases (like 'gc' or 'repack'; this integration is not
> implemented yet).

Ah.  I thought that it could be something easily extensible in-place,
and thus easy to keep up to date on each commit.

Recalculating it on 'gc' or 'repack' is still good, especially that it
works even when there are come commits outside commit-graph, without
this information.

>>
>>> The file format has room to store generation numbers, which will be
>>> provided as a patch after this framework is merged. Generation numbers
>>> are referenced by the design document but not implemented in order to
>>> make the current patch focus on the graph construction process. Once
>>> that is stable, it will be easier to add generation numbers and make
>>> graph walks aware of generation numbers one-by-one.
>>>
>> As the serialized commit graph format is versioned, I wonder if it would
>> be possible to speed up graph walks even more by adding to it FELINE
>> index (pair of numbers) from "Reachability Queries in Very Large Graphs:
>> A Fast Refined Olnine Search Approach" (2014) - available at
>> http://openproceedings.org/EDBT/2014/paper_166.pdf
>>
>> The implementation would probably need adjustments to make it
>> unambiguous and unambiguously extensible; unless there is place for
>> indices that are local-only and need to be recalculated from scratch
>> when graph changes (to cover all graph).
>
> The chunk-based format is intended to allow extra indexes like the one
> you recommend, without needing to increase the version number. Using
> an optional chunk allows older versions of Git to read the file
> without error, since the data is "extra", and newer versions can take
> advantage of the acceleration.

That's good.

> At one point, I was investigating these reachability indexes (I read
> "SCARAB: Scaling Reachability Computation on Large Graphs" by Jihn,
> Ruan, Dey, and Xu [2]) but find the question that these indexes target
> to be lacking for most of the Git uses. That is, they ask the boolean
> question "Can X reach Y?". More often, Git needs to answer "What is
> the set of commits reachable from X but not from Y" or "Topologically
> sort commits reachable from X" or "How many commits are in each part
> of the symmetric difference between reachable from X or reachable from
> Y?"

In the "Reachability Queries in Very Large Graphs..." by Veloso, Cerf,
Meira and Zaki FELINE-index work, authors mention SCARAB as something
that can be used in addition to FELINE-index, as a complementary data
(FELINE-SCARAB in the work, section 4.4).

I see the FELINE-index as a stronger form of generation numbers (called
also level of the vertex / node), in that it allows to negative-cut even
more, pruning paths that are known to be unreachable (or marking nodes
known to be unreachable in the "calculate difference" scenario). 

Also, FELINE-index uses two integer numbers (coordinates in 2d space);
one of those indices can be the topological numbering (topological
sorting order) of nodes in the commit graph.  That would help to answer
even more Git questions.

> The case for "Can X reach Y?" is mostly for commands like 'git branch
> --contains', when 'git fetch' checks for forced-updates of branches,
> or when the server decides enough negotiation has occurred during a
> 'git fetch'. While these may be worth investigating, they also benefit
> greatly from the accelerated graph walk introduced in the current
> format.
>
> I would be happy to review any effort to extend the commit-graph
> format to include such indexes, as long as the performance benefits
> outweigh the complexity to create them.
>
> [2] 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396=rep1=pdf

The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the
storage complexity is 2*|V|.

>>
>>> Here are some performance results for a copy of the Linux repository
>>> where 'master' has 704,766 reachable commits and is behind 'origin/master'
>>> by 19,610 commits.
>>>
>>> | Command  | 

Re: [PATCH v4 00/13] Serialized Git Commit Graph

2018-04-02 Thread Derrick Stolee

On 3/30/2018 7:10 AM, Jakub Narebski wrote:

I hope that I am addressing the most recent version of this series.


Hi Jakub. Thanks for the interest in this patch series.

The most-recent version is v6 [1], but I will re-roll to v7 soon (after 
v2.17.0 is marked).


[1] 
https://public-inbox.org/git/20180314192736.70602-1-dsto...@microsoft.com/T/#u



Derrick Stolee  writes:


As promised [1], this patch contains a way to serialize the commit graph.
The current implementation defines a new file format to store the graph
structure (parent relationships) and basic commit metadata (commit date,
root tree OID) in order to prevent parsing raw commits while performing
basic graph walks. For example, we do not need to parse the full commit
when performing these walks:

* 'git log --topo-order -1000' walks all reachable commits to avoid
   incorrect topological orders, but only needs the commit message for
   the top 1000 commits.

* 'git merge-base  ' may walk many commits to find the correct
   boundary between the commits reachable from A and those reachable
   from B. No commit messages are needed.

* 'git branch -vv' checks ahead/behind status for all local branches
   compared to their upstream remote branches. This is essentially as
   hard as computing merge bases for each.

The current patch speeds up these calculations by injecting a check in
parse_commit_gently() to check if there is a graph file and using that
to provide the required metadata to the struct commit.

That's nice.

What are the assumptions about the serialized commit graph format? Does
it need to be:
  - extensible without rewriting (e.g. append-only)?
  - like the above, but may need rewriting for optimal performance?
  - extending it needs to rewrite whole file?

Excuse me if it waas already asked and answered.


It is not extensible without rewriting. Reducing write time was not a 
main goal, since the graph will be written only occasionally during data 
management phases (like 'gc' or 'repack'; this integration is not 
implemented yet).





The file format has room to store generation numbers, which will be
provided as a patch after this framework is merged. Generation numbers
are referenced by the design document but not implemented in order to
make the current patch focus on the graph construction process. Once
that is stable, it will be easier to add generation numbers and make
graph walks aware of generation numbers one-by-one.

As the serialized commit graph format is versioned, I wonder if it would
be possible to speed up graph walks even more by adding to it FELINE
index (pair of numbers) from "Reachability Queries in Very Large Graphs:
A Fast Refined Olnine Search Approach" (2014) - available at
http://openproceedings.org/EDBT/2014/paper_166.pdf

The implementation would probably need adjustments to make it
unambiguous and unambiguously extensible; unless there is place for
indices that are local-only and need to be recalculated from scratch
when graph changes (to cover all graph).


The chunk-based format is intended to allow extra indexes like the one 
you recommend, without needing to increase the version number. Using an 
optional chunk allows older versions of Git to read the file without 
error, since the data is "extra", and newer versions can take advantage 
of the acceleration.


At one point, I was investigating these reachability indexes (I read 
"SCARAB: Scaling Reachability Computation on Large Graphs" by Jihn, 
Ruan, Dey, and Xu [2]) but find the question that these indexes target 
to be lacking for most of the Git uses. That is, they ask the boolean 
question "Can X reach Y?". More often, Git needs to answer "What is the 
set of commits reachable from X but not from Y" or "Topologically sort 
commits reachable from X" or "How many commits are in each part of the 
symmetric difference between reachable from X or reachable from Y?"


The case for "Can X reach Y?" is mostly for commands like 'git branch 
--contains', when 'git fetch' checks for forced-updates of branches, or 
when the server decides enough negotiation has occurred during a 'git 
fetch'. While these may be worth investigating, they also benefit 
greatly from the accelerated graph walk introduced in the current format.


I would be happy to review any effort to extend the commit-graph format 
to include such indexes, as long as the performance benefits outweigh 
the complexity to create them.


[2] 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396=rep1=pdf





Here are some performance results for a copy of the Linux repository
where 'master' has 704,766 reachable commits and is behind 'origin/master'
by 19,610 commits.

| Command  | Before | After  | Rel % |
|--|||---|
| log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
| branch -vv   |  0.42s |  0.27s | -35%  |
| rev-list --all   |  6.4s  

Re: [PATCH v4 00/13] Serialized Git Commit Graph

2018-03-30 Thread Jakub Narebski
I hope that I am addressing the most recent version of this series.

Derrick Stolee  writes:

> As promised [1], this patch contains a way to serialize the commit graph.
> The current implementation defines a new file format to store the graph
> structure (parent relationships) and basic commit metadata (commit date,
> root tree OID) in order to prevent parsing raw commits while performing
> basic graph walks. For example, we do not need to parse the full commit
> when performing these walks:
>
> * 'git log --topo-order -1000' walks all reachable commits to avoid
>   incorrect topological orders, but only needs the commit message for
>   the top 1000 commits.
>
> * 'git merge-base  ' may walk many commits to find the correct
>   boundary between the commits reachable from A and those reachable
>   from B. No commit messages are needed.
>
> * 'git branch -vv' checks ahead/behind status for all local branches
>   compared to their upstream remote branches. This is essentially as
>   hard as computing merge bases for each.
>
> The current patch speeds up these calculations by injecting a check in
> parse_commit_gently() to check if there is a graph file and using that
> to provide the required metadata to the struct commit.

That's nice.

What are the assumptions about the serialized commit graph format? Does
it need to be:
 - extensible without rewriting (e.g. append-only)?
 - like the above, but may need rewriting for optimal performance?
 - extending it needs to rewrite whole file?

Excuse me if it waas already asked and answered.

>
> The file format has room to store generation numbers, which will be
> provided as a patch after this framework is merged. Generation numbers
> are referenced by the design document but not implemented in order to
> make the current patch focus on the graph construction process. Once
> that is stable, it will be easier to add generation numbers and make
> graph walks aware of generation numbers one-by-one.

As the serialized commit graph format is versioned, I wonder if it would
be possible to speed up graph walks even more by adding to it FELINE
index (pair of numbers) from "Reachability Queries in Very Large Graphs:
A Fast Refined Olnine Search Approach" (2014) - available at
http://openproceedings.org/EDBT/2014/paper_166.pdf

The implementation would probably need adjustments to make it
unambiguous and unambiguously extensible; unless there is place for
indices that are local-only and need to be recalculated from scratch
when graph changes (to cover all graph).

>
> Here are some performance results for a copy of the Linux repository
> where 'master' has 704,766 reachable commits and is behind 'origin/master'
> by 19,610 commits.
>
> | Command  | Before | After  | Rel % |
> |--|||---|
> | log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
> | branch -vv   |  0.42s |  0.27s | -35%  |
> | rev-list --all   |  6.4s  |  1.0s  | -84%  |
> | rev-list --all --objects | 32.6s  | 27.6s  | -15%  |

That's the "Rel %" of "Before", that is delta/before, isn't it?

> To test this yourself, run the following on your repo:
>
>   git config core.commitGraph true
>   git show-ref -s | git commit-graph write --set-latest --stdin-commits
>
> The second command writes a commit graph file containing every commit
> reachable from your refs. Now, all git commands that walk commits will
> check your graph first before consulting the ODB. You can run your own
> performance comparisions by toggling the 'core.commitgraph' setting.

Good.  It is nicely similar to how bitmap indices (of reachability) are
handled.

I just wonder what happens in the (rare) presence of grafts (old
mechanism), or "git replace"-d commits...

Regards,
-- 
Jakub Narębski


[PATCH v4 00/13] Serialized Git Commit Graph

2018-02-19 Thread Derrick Stolee
Thanks for all of the feedback. I've learned a lot working on this patch.

As discussed [0], this version changes several fundamental structures
and operations, including:

* Graph files are stored in .git/objects/info

* The "graph-head" file is now called "graph-latest" to avoid confusion
  with HEAD. We use "--set-latest" argument instead of "--update-head".

* The graph format no longer stores the hash width, but expects the hash
  version to imply a length. This frees up one byte for future use while
  preserving 4-byte alignment.

* The graph format is more explicit about optional chunks and no longer
  dies when seeing an unknown chunk. The large edges chunk is now optional.

* There is no longer a "clear" subcommand to git-commit-graph.

* Fixed the bug related to "--stdin-commits" and check that the command
  only includes commits reachable from the input OIDs.

* The struct packed_oid_list type is now an array of struct object_id
  instead of an array of pointers. In my testing, I saw no performance
  difference between these two options, so I switched to the simpler
  pattern.

* My patch is based on jt/binsearch-with-fanout (b4e00f730), with the
  newer method prototype since v3.

Thanks,
-Stolee

[0] 
https://public-inbox.org/git/1517348383-112294-1-git-send-email-dsto...@microsoft.com/T/#m22bfdb7cf7b3d6e5f380b8bf0eec957e2cfd2dd7

-- >8 --

As promised [1], this patch contains a way to serialize the commit graph.
The current implementation defines a new file format to store the graph
structure (parent relationships) and basic commit metadata (commit date,
root tree OID) in order to prevent parsing raw commits while performing
basic graph walks. For example, we do not need to parse the full commit
when performing these walks:

* 'git log --topo-order -1000' walks all reachable commits to avoid
  incorrect topological orders, but only needs the commit message for
  the top 1000 commits.

* 'git merge-base  ' may walk many commits to find the correct
  boundary between the commits reachable from A and those reachable
  from B. No commit messages are needed.

* 'git branch -vv' checks ahead/behind status for all local branches
  compared to their upstream remote branches. This is essentially as
  hard as computing merge bases for each.

The current patch speeds up these calculations by injecting a check in
parse_commit_gently() to check if there is a graph file and using that
to provide the required metadata to the struct commit.

The file format has room to store generation numbers, which will be
provided as a patch after this framework is merged. Generation numbers
are referenced by the design document but not implemented in order to
make the current patch focus on the graph construction process. Once
that is stable, it will be easier to add generation numbers and make
graph walks aware of generation numbers one-by-one.

Here are some performance results for a copy of the Linux repository
where 'master' has 704,766 reachable commits and is behind 'origin/master'
by 19,610 commits.

| Command  | Before | After  | Rel % |
|--|||---|
| log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
| branch -vv   |  0.42s |  0.27s | -35%  |
| rev-list --all   |  6.4s  |  1.0s  | -84%  |
| rev-list --all --objects | 32.6s  | 27.6s  | -15%  |

To test this yourself, run the following on your repo:

  git config core.commitGraph true
  git show-ref -s | git commit-graph write --set-latest --stdin-commits

The second command writes a commit graph file containing every commit
reachable from your refs. Now, all git commands that walk commits will
check your graph first before consulting the ODB. You can run your own
performance comparisions by toggling the 'core.commitgraph' setting.

[1] 
https://public-inbox.org/git/d154319e-bb9e-b300-7c37-27b1dcd2a...@jeffhostetler.com/
Re: What's cooking in git.git (Jan 2018, #03; Tue, 23)

[2] https://github.com/derrickstolee/git/pull/2
A GitHub pull request containing the latest version of this patch.

Derrick Stolee (13):
  commit-graph: add format document
  graph: add commit graph design document
  commit-graph: create git-commit-graph builtin
  commit-graph: implement write_commit_graph()
  commit-graph: implement 'git-commit-graph write'
  commit-graph: implement git commit-graph read
  commit-graph: implement --set-latest
  commit-graph: implement --delete-expired
  commit-graph: add core.commitGraph setting
  commit-graph: close under reachability
  commit: integrate commit graph with commit parsing
  commit-graph: read only from specific pack-indexes
  commit-graph: build graph from starting commits

 .gitignore  |   1 +
 Documentation/config.txt|   3 +
 Documentation/git-commit-graph.txt  | 105 
 Documentation/technical/commit-graph-format.txt |  90 +++