Re: [PATCH v3 8/9] commit-graph: always load commit-graph information

2018-04-23 Thread Derrick Stolee

On 4/18/2018 8:02 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

With generation numbers in the commit-graph, we need to ensure that any
commit that exists in the commit-graph file has its generation number
loaded.

All right, that is nice explanation of the why behind this change.


Create new load_commit_graph_info() method to fill in the information
for a commit that exists only in the commit-graph file. Call it from
parse_commit_buffer() after loading the other commit information from
the given buffer. Only fill this information when specified by the
'check_graph' parameter. This avoids duplicate work when we already
checked the graph in parse_commit_gently() or when simply checking the
buffer contents in check_commit().

Couldn't this 'check_graph' parameter be a global variable similar to
the 'commit_graph' variable?  Maybe I am not understanding it.


See the two callers at the bottom of the patch. They have different 
purposes: one needs to fill in a valid commit struct, the other needs to 
check the commit buffer is valid (then throws away the struct). They 
have different values for 'check_graph'. Also, in parse_commit_gently() 
we check parse_commit_in_graph() before we call parse_commit_buffer, so 
we do not want to repeat work; in the case of a valid commit-graph file, 
but the commit is not in the commit-graph, we would repeat our binary 
search for the same commit.





Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  commit-graph.c | 51 --
  commit-graph.h |  8 
  commit.c   |  7 +--
  commit.h   |  2 +-
  object.c   |  2 +-
  sha1_file.c|  2 +-
  6 files changed, 49 insertions(+), 23 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 688d5b1801..21e853c21a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct 
commit_graph *g,
return _list_insert(c, pptr)->next;
  }
  
+static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)

+{
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+}
+
  static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t pos)
  {
uint32_t edge_value;
uint32_t *parent_data_ptr;
uint64_t date_low, date_high;
struct commit_list **pptr;
-   const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len 
+ 16) * pos;
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;

I'm probably wrong, but isn't it unrelated change?


You're right. I saw this while I was in here, and there was a similar 
comment on this change in a different patch. Probably best to keep these 
cleanup things in a separate commit.



item->object.parsed = 1;
item->graph_pos = pos;
@@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin
return 1;
  }
  
+static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)

+{
+   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+   *pos = item->graph_pos;
+   return 1;
+   } else {
+   return bsearch_graph(commit_graph, &(item->object.oid), pos);
+   }
+}

All right (after the fix).


+
  int parse_commit_in_graph(struct commit *item)
  {
+   uint32_t pos;
+
+   if (item->object.parsed)
+   return 0;
if (!core_commit_graph)
return 0;
-   if (item->object.parsed)
-   return 1;

Hmmm... previously the function returned 1 if item->object.parsed, now
it returns 0 for this situation.  I don't understand this change.


The good news is that this change is unimportant (the only caller is 
parse_commit_gently() which checks item->object.parsed before calling 
parse_commit_in_graph()). I wonder why I reordered those things, anyway. 
I'll revert to simplify the patch.





-
prepare_commit_graph();
-   if (commit_graph) {
-   uint32_t pos;
-   int found;
-   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
-   pos = item->graph_pos;
-   found = 1;
-   } else {
-   found = bsearch_graph(commit_graph, &(item->object.oid), 
);
-   }
-
-   if (found)
-   return fill_commit_in_graph(item, commit_graph, po

Re: [PATCH v3 5/9] ref-filter: use generation number for --contains

2018-04-23 Thread Derrick Stolee

On 4/18/2018 5:02 PM, Jakub Narebski wrote:

Here I can offer only the cursory examination, as I don't know this area
of code in question.

Derrick Stolee <dsto...@microsoft.com> writes:


A commit A can reach a commit B only if the generation number of A
is larger than the generation number of B. This condition allows
significantly short-circuiting commit-graph walks.

Use generation number for 'git tag --contains' queries.

On a copy of the Linux repository where HEAD is containd in v4.13
but no earlier tag, the command 'git tag --contains HEAD' had the
following peformance improvement:

Before: 0.81s
After:  0.04s
Rel %:  -95%

A question: what is the performance after if the "commit-graph" feature
is disabled, or there is no commit-graph file?  Is there performance
regression in this case, or is the difference negligible?


Negligible, since we are adding a small number of integer comparisons 
and the main cost is in commit parsing. More on commit parsing in 
response to your comments below.





Helped-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  ref-filter.c | 23 +++
  1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index cffd8bf3ce..e2fea6d635 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1587,7 +1587,8 @@ static int in_commit_list(const struct commit_list *want, 
struct commit *c)
  /*
   * Test whether the candidate or one of its parents is contained in the list.

  ^

Sidenote: when examining the code after the change, I have noticed that
the above part of commit header for the comtains_test() function is no
longer entirely correct, as the function only checks the candidate
commit, and in no place it access its parents.

But that is not your problem.


I'll add a commit in the next version that fixes this comment before I 
make any changes to the method.





   * Do not recurse to find out, though, but return -1 if inconclusive.
   */
  static enum contains_result contains_test(struct commit *candidate,
  const struct commit_list *want,
- struct contains_cache *cache)
+ struct contains_cache *cache,
+ uint32_t cutoff)
  {
enum contains_result *cached = contains_cache_at(cache, candidate);
  
@@ -1603,6 +1604,10 @@ static enum contains_result contains_test(struct commit *candidate,
  
  	/* Otherwise, we don't know; prepare to recurse */

parse_commit_or_die(candidate);
+
+   if (candidate->generation < cutoff)
+   return CONTAINS_NO;
+

Looks good to me.

The only [minor] question may be whether to define separate type for
generation numbers, and whether to future proof the tests - though the
latter would be almost certainly overengineering, and the former
probablt too.


If we have multiple notions of generation, then we can refactor all 
references to the "generation" member.





return CONTAINS_UNKNOWN;
  }
  
@@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,

  struct contains_cache *cache)
  {
struct contains_stack contains_stack = { 0, 0, NULL };
-   enum contains_result result = contains_test(candidate, want, cache);
+   enum contains_result result;
+   uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+   const struct commit_list *p;
+
+   for (p = want; p; p = p->next) {
+   struct commit *c = p->item;
+   parse_commit_or_die(c);
+   if (c->generation < cutoff)
+   cutoff = c->generation;
+   }

Sholdn't the above be made conditional on the ability to get generation
numbers from the commit-graph file (feature is turned on and file
exists)?  Otherwise here after the change contains_tag_algo() now parses
each commit in 'want', which I think was not done previously.

With commit-graph file parsing is [probably] cheap.  Without it, not
necessary.

But I might be worrying about nothing.


Not nothing. This parses the "wants" when we previously did not parse 
the wants. Further: this parsing happens before we do the simple check 
of comparing the OID of the candidate against the wants.


The question is: are these parsed commits significant compared to the 
walk that will parse many more commits? It is certainly possible.


One way to fix this is to call 'prepare_commit_graph()' directly and 
then test that 'commit_graph' is non-null before performing any parses. 
I'm not thrilled with how that couples the commit-graph implementation 
to this feature, but that may be necessary to avoid regressions in the 
non-commit-graph case.




  
+	result = contains_test(candidate, want, cache, cutoff);

O

Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-23 Thread Derrick Stolee

On 4/21/2018 4:44 PM, Jakub Narebski wrote:

Jakub Narebski <jna...@gmail.com> writes:

Derrick Stolee <sto...@gmail.com> writes:

On 4/11/2018 3:32 PM, Jakub Narebski wrote:

What would you suggest as a good test that could imply performance? The
Google Colab notebook linked to above includes a function to count
number of commits (nodes / vertices in the commit graph) walked,
currently in the worst case scenario.

The two main questions to consider are:

1. Can X reach Y?

That is easy to do.  The function generic_is_reachable() does
that... though using direct translation of the pseudocode for
"Algorithm 3: Reachable" from FELINE paper, which is recursive and
doesn't check if vertex was already visited was not good idea for large
graphs such as Linux kernel commit graph, oops.  That is why
generic_is_reachable_large() was created.

[...]


And the thing to measure is a commit count. If possible, it would be
good to count commits walked (commits whose parent list is enumerated)
and commits inspected (commits that were listed as a parent of some
walked commit). Walked commits require a commit parse -- albeit from
the commit-graph instead of the ODB now -- while inspected commits
only check the in-memory cache.

[...]

For git.git and Linux, I like to use the release tags as tests. They
provide a realistic view of the linear history, and maintenance
releases have their own history from the major releases.

Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the
FELINE index does not bring any improvements over using just level
(generation number) filter.  But that may be caused by narrowing od
commit DAG around releases.

I try do do the same between commits in wide part, with many commits
with the same level (same generation number) both for source and for
target commit.  Though this may be unfair to level filter, though...


Note however that FELINE index is not unabiguous, like generation
numbers are (modulo decision whether to start at 0 or at 1); it depends
on the topological ordering chosen for the X elements.

One can now test reachability on git.git repository; there is a form
where one can plug source and destination revisions at
https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg#scrollTo=svNUnSA9O_NK=2=1

I have tried the case that is quite unfair to the generation numbers
filter, namely the check between one of recent tags, and one commit that
shares generation number among largest number of other commits.

Here level = generation number-1 (as it starts at 0 for root commit, not
1).

The results are:
  * src = 468165c1d = v2.17.0
  * dst = 66d2e04ec = v2.0.5-5-g66d2e04ec

  * 468165c1d has level 18418 which it shares with 6 commits
  * 66d2e04ec has level 14776 which it shares with 93 commits
  * gen(468165c1d) - gen(66d2e04ec) = 3642

  algorithm  | access  | walk   | maxdepth | visited | level-f  | FELINE-f  |
  ---+-++--+-+--+---+
  naive  | 48865   | 39599  | 244  | 9200|  |   |
  level  |  3086   |  2492  | 113  |  528| 285  |   |
  FELINE |   283   |   216  |  68  |0|  |  25   |
  lev+FELINE |   282   |   215  |  68  |0|   5  |  24   |
  ---+-++--+-+--+---+
  lev+FEL+mpi|79   |59  |  21  |0|   0  |   0   |

Here we have:
* 'naive' implementation means simple DFS walk, without any filters (cut-offs)
* 'level' means using levels / generation numbers based negative-cut filter
* 'FELINE' means using FELINE index based negative-cut filter
* 'lev+FELINE' means combining generation numbers filter with FELINE filter
* 'mpi' means min-post [smanning-tree] intervals for positive-cut filter;
   note that the code does not walk the path after cut, but it is easy to do

The stats have the following meaning:
* 'access' means accessing the node
* 'walk' is actual walking the node
* 'maxdepth' is maximum depth of the stack used for DFS
* 'level-f' and 'FELINE-f' is number of times levels filter or FELINE filter
   were used for negative-cut; note that those are not disjoint; node can
   be rejected by both level filter and FELINE filter

For v2.17.0 and v2.17.0-rc2 the numbers are much less in FELINE favor:
the results are the same, with 5 commits accessed and 6 walked compared
to 61574 accessed in naive algorithm.

The git.git commit graph has 53128 nodes and 66124 edges, 4 tips / heads
(different child-less commits) and 9 roots, and has average clustering
coefficient 0.000409217.


Thanks for these results. Now, write a patch. I'm sticking to generation 
numbers for my patch because of the simplified computation, but you can 
contribute a FELINE implementation.



P.S. Would it be better to move the discussion about possible extensions
to the commit-graph in the form of new chunks (topological order, 

Re: [PATCH v3 0/9] Compute and consume generation numbers

2018-04-23 Thread Derrick Stolee

On 4/18/2018 8:04 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


-- >8 --

This is the one of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph) and lazy-loading trees
(ds/lazy-load-trees).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one). This section is expanded to describe the interaction with special
generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph
file) and *_ZERO (commits in a commit-graph file written before generation
numbers were implemented).

This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a short-circuit mechanism
to improve performance of `git branch --contains`.

Further, use generation numbers for 'git tag --contains', providing a
significant speedup (at least 95% for some cases).

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

This patch series is build on ds/lazy-load-trees.

Derrick Stolee (9):
   commit: add generation number to struct commmit

Nice and short patch. Looks good to me.


   commit-graph: compute generation numbers

Another quite easy to understand patch. LGTM.


   commit: use generations in paint_down_to_common()

Nice and short patch; minor typo in comment in code.
Otherwise it looks good to me.


   commit-graph.txt: update design document

I see that diagram got removed in this version; maybe it could be
replaced with relationship table?

Anyway, it looks good to me.


The diagrams and tables seemed to cause more confusion than clarity. I 
think the reader should create their own mental model from the 
definitions and description and we should avoid trying to make a summary.





   ref-filter: use generation number for --contains

A question: how performance looks like after the change if commit-graph
is not available?


The performance issue is minor, but will be fixed in v4.




   commit: use generation numbers for in_merge_bases()

Possible typo in the commit message, and stylistic inconsistence in
in_merge_bases() - though actually more clear than existing code.

Short, simple, and gives good performance improvenebts.


   commit: add short-circuit to paint_down_to_common()

Looks good to me; ignore [mostly] what I have written in response to the
patch in question.


   commit-graph: always load commit-graph information

Looks all right; question: parameter or one more global variable.


I responded to say that the global variable approach is incorrect. 
Parameter is important to functionality and performance.





   merge: check config before loading commits

This looks good to me.


  Documentation/technical/commit-graph.txt | 30 +--
  alloc.c  |  1 +
  builtin/merge.c  |  5 +-
  commit-graph.c   | 99 +++-
  commit-graph.h   |  8 ++
  commit.c | 54 +++--
  commit.h |  7 +-
  object.c |  2 +-
  ref-filter.c | 23 +-
  sha1_file.c  |  2 +-
  t/t5318-commit-graph.sh  |  9 +++
  11 files changed, 199 insertions(+), 41 deletions(-)


base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707




Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()

2018-04-24 Thread Derrick Stolee

On 4/23/2018 5:38 PM, Jakub Narebski wrote:

Derrick Stolee <sto...@gmail.com> writes:


On 4/18/2018 7:19 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


[...]

[...], and this saves time during 'git branch --contains' queries
that would otherwise walk "around" the commit we are inspecting.


If I understand the code properly, what happens is that we can now
short-circuit if all commits that are left are lower than the target
commit.

This is because max-order priority queue is used: if the commit with
maximum generation number is below generation number of target commit,
then target commit is not reachable from any commit in the priority
queue (all of which has generation number less or equal than the commit
at head of queue, i.e. all are same level or deeper); compare what I
have written in [1]

[1]: https://public-inbox.org/git/866052dkju@gmail.com/

Do I have that right?  If so, it looks all right to me.

Yes, the priority queue needs to compare via generation number first
or there will be errors. This is why we could not use commit time
before.

I was more concerned about getting right the order in the priority queue
(does it return minimal or maximal generation number).

I understand that the cutoff could not be used without generation
numbers because of the possibility of clock skew - using cutoff on dates
could lead to wrong results.


Maximal generation number is important so we do not visit commits 
multiple times (say, once with PARENT1 set, and a second time when 
PARENT2 is set). A minimal generation number order would create a DFS 
order and walk until the cutoff every time.


In cases without clock skew, maximal generation number order will walk 
the same set of commits as maximal commit time; the order may differ, 
but only between incomparable commits.



For a copy of the Linux repository, where HEAD is checked out at
v4.13~100, we get the following performance improvement for
'git branch --contains' over the previous commit:

Before: 0.21s
After:  0.13s
Rel %: -38%

[...]

flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
if (flags == (PARENT1 | PARENT2)) {
if (!(commit->object.flags & RESULT)) {
@@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit 
*one, int n, struct co
return NULL;
}
   -list = paint_down_to_common(one, n, twos);
+   list = paint_down_to_common(one, n, twos, 0);
while (list) {
struct commit *commit = pop_commit();
@@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
filled_index[filled] = j;
work[filled++] = array[j];
}
-   common = paint_down_to_common(array[i], filled, work);
+   common = paint_down_to_common(array[i], filled, work, 0);
if (array[i]->object.flags & PARENT2)
redundant[i] = 1;
for (j = 0; j < filled; j++)
@@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
if (commit->generation > min_generation)
return 0;
   -bases = paint_down_to_common(commit, nr_reference, reference);
+   bases = paint_down_to_common(commit, nr_reference, reference, 
commit->generation);

Is it the only case where we would call paint_down_to_common() with
non-zero last parameter?  Would we always use commit->generation where
commit is the first parameter of paint_down_to_common()?

If both are true and will remain true, then in my humble opinion it is
not necessary to change the signature of this function.

We need to change the signature some way, but maybe the way I chose is
not the best.

No, after taking longer I think the new signature is a good choice.


To elaborate: paint_down_to_common() is used for multiple
purposes. The caller here that supplies 'commit->generation' is used
only to compute reachability (by testing if the flag PARENT2 exists on
the commit, then clears all flags). The other callers expect the full
walk down to the common commits, and keeps those PARENT1, PARENT2, and
STALE flags for future use (such as reporting merge bases). Usually
the call to paint_down_to_common() is followed by a revision walk that
only halts when reaching root commits or commits with both PARENT1 and
PARENT2 flags on, so always short-circuiting on generations would
break the functionality; this is confirmed by the
t5318-commit-graph.sh.

Right.

I have realized that just after sending the email.  I'm sorry about this.


An alternative to the signature change is to add a boolean parameter
"use_cutoff" or something, that specifies "don't walk beyond the
commit". This may give a more of a clear description of what it will
do with the generation value, but since we are 

[RFC PATCH 02/12] commit-graph: add 'check' subcommand

2018-04-17 Thread Derrick Stolee
If the commit-graph file becomes corrupt, we need a way to verify
its contents match the object database. In the manner of 'git fsck'
we will implement a 'git commit-graph check' subcommand to report
all issues with the file.

Add the 'check' subcommand to the 'commit-graph' builtin and its
documentation. Add a simple test that ensures the command returns
a zero error code.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt |  7 +-
 builtin/commit-graph.c | 38 ++
 commit-graph.c |  5 
 commit-graph.h |  2 ++
 t/t5318-commit-graph.sh|  5 
 5 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 4c97b555cc..93c7841ba2 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -9,10 +9,10 @@ git-commit-graph - Write and verify Git commit graph files
 SYNOPSIS
 
 [verse]
+'git commit-graph check' [--object-dir ]
 'git commit-graph read' [--object-dir ]
 'git commit-graph write'  [--object-dir ]
 
-
 DESCRIPTION
 ---
 
@@ -52,6 +52,11 @@ existing commit-graph file.
 Read a graph file given by the commit-graph file and output basic
 details about the graph file. Used for debugging purposes.
 
+'check'::
+
+Read the commit-graph file and verify its contents against the object
+database. Used to check for corrupted data.
+
 
 EXAMPLES
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 37420ae0fd..77c1a04932 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -7,11 +7,17 @@
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
+   N_("git commit-graph check [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
+static const char * const builtin_commit_graph_check_usage[] = {
+   N_("git commit-graph check [--object-dir ]"),
+   NULL
+};
+
 static const char * const builtin_commit_graph_read_usage[] = {
N_("git commit-graph read [--object-dir ]"),
NULL
@@ -29,6 +35,36 @@ static struct opts_commit_graph {
int append;
 } opts;
 
+
+static int graph_check(int argc, const char **argv)
+{
+   struct commit_graph *graph = 0;
+   char *graph_name;
+
+   static struct option builtin_commit_graph_check_options[] = {
+   OPT_STRING(0, "object-dir", _dir,
+  N_("dir"),
+  N_("The object directory to store the graph")),
+   OPT_END(),
+   };
+
+   argc = parse_options(argc, argv, NULL,
+builtin_commit_graph_check_options,
+builtin_commit_graph_check_usage, 0);
+
+   if (!opts.obj_dir)
+   opts.obj_dir = get_object_directory();
+
+   graph_name = get_commit_graph_filename(opts.obj_dir);
+   graph = load_commit_graph_one(graph_name);
+
+   if (!graph)
+   die("graph file %s does not exist", graph_name);
+   FREE_AND_NULL(graph_name);
+
+   return check_commit_graph(graph);
+}
+
 static int graph_read(int argc, const char **argv)
 {
struct commit_graph *graph = NULL;
@@ -160,6 +196,8 @@ int cmd_commit_graph(int argc, const char **argv, const 
char *prefix)
 PARSE_OPT_STOP_AT_NON_OPTION);
 
if (argc > 0) {
+   if (!strcmp(argv[0], "check"))
+   return graph_check(argc, argv);
if (!strcmp(argv[0], "read"))
return graph_read(argc, argv);
if (!strcmp(argv[0], "write"))
diff --git a/commit-graph.c b/commit-graph.c
index 3f0c142603..cd0634bba0 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -819,3 +819,8 @@ void write_commit_graph(const char *obj_dir,
oids.alloc = 0;
oids.nr = 0;
 }
+
+int check_commit_graph(struct commit_graph *g)
+{
+   return !g;
+}
diff --git a/commit-graph.h b/commit-graph.h
index 96cccb10f3..e8c8d99dff 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -53,4 +53,6 @@ void write_commit_graph(const char *obj_dir,
int nr_commits,
int append);
 
+int check_commit_graph(struct commit_graph *g);
+
 #endif
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 77d85aefe7..e91053271a 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -230,4 +230,9 @@ test_expect_success 'perform fast-forward merge in full 
repo' '
test_cmp expect output
 '
 
+test_expect_success 'git commit-graph check' '
+   cd "$TRASH_DIRECTORY/full" &&
+   git commit-graph check >output
+'
+
 test_done
-- 
2.17.0.39.g685157f7fb



[RFC PATCH 08/12] commit-graph: verify commit contents against odb

2018-04-17 Thread Derrick Stolee
When running 'git commit-graph check', compare the contents of the
commits that are loaded from the commit-graph file with commits that are
loaded directly from the object database. This includes checking the
root tree object ID, commit date, and parents.

In addition, verify the generation number calculation is correct for all
commits in the commit-graph file.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 82 ++
 1 file changed, 82 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 80a2ac2a6a..b5bce2bac4 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -899,5 +899,87 @@ int check_commit_graph(struct commit_graph *g)
 graph_commit->graph_pos, i);
}
 
+   for (i = 0; i < g->num_commits; i++) {
+   struct commit *graph_commit, *odb_commit;
+   struct commit_list *graph_parents, *odb_parents;
+   int num_parents = 0;
+
+   hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+   graph_commit = lookup_commit(_oid);
+   odb_commit = (struct commit *)create_object(cur_oid.hash, 
alloc_commit_node());
+   if (parse_commit_internal(odb_commit, 0, 0))
+   graph_report(_("failed to parse %s from object 
database"), oid_to_hex(_oid));
+
+   if (oidcmp(_commit_tree_in_graph_one(g, 
graph_commit)->object.oid,
+  get_commit_tree_oid(odb_commit)))
+   graph_report(_("root tree object ID for commit %s in 
commit-graph is %s != %s"),
+oid_to_hex(_oid),
+
oid_to_hex(get_commit_tree_oid(graph_commit)),
+
oid_to_hex(get_commit_tree_oid(odb_commit)));
+
+   if (graph_commit->date != odb_commit->date)
+   graph_report(_("commit date for commit %s in 
commit-graph is %"PRItime" != %"PRItime""),
+oid_to_hex(_oid),
+graph_commit->date,
+odb_commit->date);
+
+
+   graph_parents = graph_commit->parents;
+   odb_parents = odb_commit->parents;
+
+   while (graph_parents) {
+   num_parents++;
+
+   if (odb_parents == NULL)
+   graph_report(_("commit-graph parent list for 
commit %s is too long (%d)"),
+oid_to_hex(_oid),
+num_parents);
+
+   if (oidcmp(_parents->item->object.oid, 
_parents->item->object.oid))
+   graph_report(_("commit-graph parent for %s is 
%s != %s"),
+oid_to_hex(_oid),
+
oid_to_hex(_parents->item->object.oid),
+
oid_to_hex(_parents->item->object.oid));
+
+   graph_parents = graph_parents->next;
+   odb_parents = odb_parents->next;
+   }
+
+   if (odb_parents != NULL)
+   graph_report(_("commit-graph parent list for commit %s 
terminates early"),
+oid_to_hex(_oid));
+
+   if (graph_commit->generation) {
+   uint32_t max_generation = 0;
+   graph_parents = graph_commit->parents;
+
+   while (graph_parents) {
+   if (graph_parents->item->generation == 
GENERATION_NUMBER_ZERO ||
+   graph_parents->item->generation == 
GENERATION_NUMBER_INFINITY)
+   graph_report(_("commit-graph has valid 
generation for %s but not its parent, %s"),
+oid_to_hex(_oid),
+
oid_to_hex(_parents->item->object.oid));
+   if (graph_parents->item->generation > 
max_generation)
+   max_generation = 
graph_parents->item->generation;
+   graph_parents = graph_parents->next;
+   }
+
+   if (graph_commit->generation != max_generation + 1)
+   graph_report(_("commit-graph has incorrect 
generation for %s"),
+oid_to_hex(_oid));
+   } else {
+   graph_parents = graph_commit->parents;
+
+   

[RFC PATCH 09/12] fsck: check commit-graph

2018-04-17 Thread Derrick Stolee
If a commit-graph file exists, check its contents during 'git fsck'.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/fsck.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index ef78c6c00c..9712f230ba 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -16,6 +16,7 @@
 #include "streaming.h"
 #include "decorate.h"
 #include "packfile.h"
+#include "run-command.h"
 
 #define REACHABLE 0x0001
 #define SEEN  0x0002
@@ -45,6 +46,7 @@ static int name_objects;
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
 #define ERROR_REFS 010
+#define ERROR_COMMIT_GRAPH 020
 
 static const char *describe_object(struct object *obj)
 {
@@ -815,5 +817,16 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
}
 
check_connectivity();
+
+   if (core_commit_graph) {
+   struct child_process commit_graph_check = CHILD_PROCESS_INIT;
+   const char *check_argv[] = { "commit-graph", "check", NULL, 
NULL };
+   commit_graph_check.argv = check_argv;
+   commit_graph_check.git_cmd = 1;
+
+   if (run_command(_graph_check))
+   errors_found |= ERROR_COMMIT_GRAPH;
+   }
+
return errors_found;
 }
-- 
2.17.0.39.g685157f7fb



[RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'

2018-04-17 Thread Derrick Stolee
The commit-graph feature is not useful to end users until the
commit-graph file is maintained automatically by Git during normal
upkeep operations. One natural place to trigger this write is during
'git gc'.

Before automatically generating a commit-graph, we need to be able to
verify the contents of a commit-graph file. Integrate commit-graph
checks into 'fsck' that check the commit-graph contents against commits
in the object database.

Things to think about:

* Are these the right integration points?

* gc.commitGraph defaults to true right now for the purpose of testing,
  but may not be required to start. The goal is to have this default to
  true eventually, but we may want to delay that until the feature is
  stable.

* I implement a "--reachable" option to 'git commit-graph write' that
  iterates over all refs. This does the same as 

git show-ref -s | git commit-graph write --stdin-commits

  but I don't know how to pipe two child processes together inside of Git.
  Perhaps this is a better solution, anyway.

What other things should I be considering in this case? I'm unfamiliar
with the inner-workings of 'fsck' and 'gc', so this is a new space for me.

This RFC is based on v3 of ds/generation-numbers, and the first commit
is a fixup! based on a bug in that version that I caught while prepping
this series.

Thanks,
-Stolee

Derrick Stolee (12):
  fixup! commit-graph: always load commit-graph information
  commit-graph: add 'check' subcommand
  commit-graph: check file header information
  commit-graph: parse commit from chosen graph
  commit-graph: check fanout and lookup table
  commit: force commit to parse from object database
  commit-graph: load a root tree from specific graph
  commit-graph: verify commit contents against odb
  fsck: check commit-graph
  commit-graph: add '--reachable' option
  gc: automatically write commit-graph files
  commit-graph: update design document

 Documentation/git-commit-graph.txt   |  15 +-
 Documentation/git-gc.txt |   4 +
 Documentation/technical/commit-graph.txt |   9 --
 builtin/commit-graph.c   |  79 +-
 builtin/fsck.c   |  13 ++
 builtin/gc.c |   8 +
 commit-graph.c   | 178 ++-
 commit-graph.h   |   2 +
 commit.c |  14 +-
 commit.h |   1 +
 t/t5318-commit-graph.sh  |  15 ++
 11 files changed, 311 insertions(+), 27 deletions(-)

-- 
2.17.0.39.g685157f7fb



[RFC PATCH 03/12] commit-graph: check file header information

2018-04-17 Thread Derrick Stolee
During a run of 'git commit-graph check', list the issues with the
header information in the commit-graph file. Some of this information
is inferred from the loaded 'struct commit_graph'.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 29 -
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index cd0634bba0..c5e5a0f860 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -820,7 +820,34 @@ void write_commit_graph(const char *obj_dir,
oids.nr = 0;
 }
 
+static int check_commit_graph_error;
+#define graph_report(...) { check_commit_graph_error = 1; printf(__VA_ARGS__); 
}
+
 int check_commit_graph(struct commit_graph *g)
 {
-   return !g;
+   if (!g) {
+   graph_report(_("no commit-graph file loaded"));
+   return 1;
+   }
+
+   check_commit_graph_error = 0;
+
+   if (get_be32(g->data) != GRAPH_SIGNATURE)
+   graph_report(_("commit-graph file has incorrect header"));
+
+   if (*(g->data + 4) != 1)
+   graph_report(_("commit-graph file version is not 1"));
+   if (*(g->data + 5) != GRAPH_OID_VERSION)
+   graph_report(_("commit-graph OID version is not 1 (SHA1)"));
+
+   if (!g->chunk_oid_fanout)
+   graph_report(_("commit-graph is missing the OID Fanout chunk"));
+   if (!g->chunk_oid_lookup)
+   graph_report(_("commit-graph is missing the OID Lookup chunk"));
+   if (!g->chunk_commit_data)
+   graph_report(_("commit-graph is missing the Commit Data 
chunk"));
+   if (g->hash_len != GRAPH_OID_LEN)
+   graph_report(_("commit-graph has incorrect hash length: %d"), 
g->hash_len);
+
+   return check_commit_graph_error;
 }
-- 
2.17.0.39.g685157f7fb



[RFC PATCH 11/12] gc: automatically write commit-graph files

2018-04-17 Thread Derrick Stolee
The commit-graph file is a very helpful feature for speeding up git
operations. In order to make it more useful, write the commit-graph file
by default during standard garbage collection operations.

Add a 'gc.commitGraph' config setting that triggers writing a
commit-graph file after any non-trivial 'git gc' command.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-gc.txt | 4 
 builtin/gc.c | 8 
 2 files changed, 12 insertions(+)

diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 571b5a7e3c..17dd654a59 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -119,6 +119,10 @@ The optional configuration variable `gc.packRefs` 
determines if
 it within all non-bare repos or it can be set to a boolean value.
 This defaults to true.
 
+The optional configuration variable 'gc.commitGraph' determines if
+'git gc' runs 'git commit-graph write'. This can be set to a boolean
+value. This defaults to false.
+
 The optional configuration variable `gc.aggressiveWindow` controls how
 much time is spent optimizing the delta compression of the objects in
 the repository when the --aggressive option is specified.  The larger
diff --git a/builtin/gc.c b/builtin/gc.c
index 77fa720bd0..070f2a7a3d 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -34,6 +34,7 @@ static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
+static int gc_commit_graph = 1;
 static int detach_auto = 1;
 static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
@@ -46,6 +47,7 @@ static struct argv_array repack = ARGV_ARRAY_INIT;
 static struct argv_array prune = ARGV_ARRAY_INIT;
 static struct argv_array prune_worktrees = ARGV_ARRAY_INIT;
 static struct argv_array rerere = ARGV_ARRAY_INIT;
+static struct argv_array commit_graph = ARGV_ARRAY_INIT;
 
 static struct tempfile *pidfile;
 static struct lock_file log_lock;
@@ -121,6 +123,7 @@ static void gc_config(void)
git_config_get_int("gc.aggressivedepth", _depth);
git_config_get_int("gc.auto", _auto_threshold);
git_config_get_int("gc.autopacklimit", _auto_pack_limit);
+   git_config_get_bool("gc.commitgraph", _commit_graph);
git_config_get_bool("gc.autodetach", _auto);
git_config_get_expiry("gc.pruneexpire", _expire);
git_config_get_expiry("gc.worktreepruneexpire", 
_worktrees_expire);
@@ -374,6 +377,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
argv_array_pushl(, "prune", "--expire", NULL);
argv_array_pushl(_worktrees, "worktree", "prune", "--expire", 
NULL);
argv_array_pushl(, "rerere", "gc", NULL);
+   argv_array_pushl(_graph, "commit-graph", "write", "--reachable", 
NULL);
 
/* default expiry time, overwritten in gc_config */
gc_config();
@@ -480,6 +484,10 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
if (pack_garbage.nr > 0)
clean_pack_garbage();
 
+   if (gc_commit_graph)
+   if (run_command_v_opt(commit_graph.argv, RUN_GIT_CMD))
+   return error(FAILED_RUN, commit_graph.argv[0]);
+
if (auto_gc && too_many_loose_objects())
warning(_("There are too many unreachable loose objects; "
"run 'git prune' to remove them."));
-- 
2.17.0.39.g685157f7fb



[RFC PATCH 01/12] fixup! commit-graph: always load commit-graph information

2018-04-17 Thread Derrick Stolee
---
 commit-graph.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 21e853c21a..3f0c142603 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -304,7 +304,7 @@ static int find_commit_in_graph(struct commit *item, struct 
commit_graph *g, uin
*pos = item->graph_pos;
return 1;
} else {
-   return bsearch_graph(commit_graph, &(item->object.oid), pos);
+   return bsearch_graph(g, &(item->object.oid), pos);
}
 }
 
-- 
2.17.0.39.g685157f7fb



[RFC PATCH 06/12] commit: force commit to parse from object database

2018-04-17 Thread Derrick Stolee
In anticipation of checking commit-graph file contents against the
object database, create parse_commit_internal() to allow side-stepping
the commit-graph file and parse directly from the object database.

Due to the use of generation numbers, this method should not be called
unless the intention is explicit in avoiding commits from the
commit-graph file.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 14 ++
 commit.h |  1 +
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index 9ef6f699bd..07752d8503 100644
--- a/commit.c
+++ b/commit.c
@@ -392,7 +392,8 @@ int parse_commit_buffer(struct commit *item, const void 
*buffer, unsigned long s
return 0;
 }
 
-int parse_commit_gently(struct commit *item, int quiet_on_missing)
+
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int 
use_commit_graph)
 {
enum object_type type;
void *buffer;
@@ -403,17 +404,17 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
return -1;
if (item->object.parsed)
return 0;
-   if (parse_commit_in_graph(item))
+   if (use_commit_graph && parse_commit_in_graph(item))
return 0;
buffer = read_sha1_file(item->object.oid.hash, , );
if (!buffer)
return quiet_on_missing ? -1 :
error("Could not read %s",
-oid_to_hex(>object.oid));
+   oid_to_hex(>object.oid));
if (type != OBJ_COMMIT) {
free(buffer);
return error("Object %s not a commit",
-oid_to_hex(>object.oid));
+   oid_to_hex(>object.oid));
}
ret = parse_commit_buffer(item, buffer, size, 0);
if (save_commit_buffer && !ret) {
@@ -424,6 +425,11 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
return ret;
 }
 
+int parse_commit_gently(struct commit *item, int quiet_on_missing)
+{
+   return parse_commit_internal(item, quiet_on_missing, 1);
+}
+
 void parse_commit_or_die(struct commit *item)
 {
if (parse_commit(item))
diff --git a/commit.h b/commit.h
index b5afde1ae9..5fde74fcd7 100644
--- a/commit.h
+++ b/commit.h
@@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char 
*name);
 struct commit *lookup_commit_or_die(const struct object_id *oid, const char 
*ref_name);
 
 int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long 
size, int check_graph);
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int 
use_commit_graph);
 int parse_commit_gently(struct commit *item, int quiet_on_missing);
 static inline int parse_commit(struct commit *item)
 {
-- 
2.17.0.39.g685157f7fb



[RFC PATCH 05/12] commit-graph: check fanout and lookup table

2018-04-17 Thread Derrick Stolee
While running 'git commit-graph check', verify that the object IDs
are listed in lexicographic order and that the fanout table correctly
navigates into that list of object IDs.

In anticipation of checking the commits in the commit-graph file
against the object database, parse the commits from that file in
advance. We perform this parse now to ensure the object cache contains
only commits from this commit-graph file.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 34 ++
 1 file changed, 34 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 6d0d303a7a..6e3c08cd5c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -835,6 +835,9 @@ static int check_commit_graph_error;
 
 int check_commit_graph(struct commit_graph *g)
 {
+   uint32_t i, cur_fanout_pos = 0;
+   struct object_id prev_oid, cur_oid;
+
if (!g) {
graph_report(_("no commit-graph file loaded"));
return 1;
@@ -859,5 +862,36 @@ int check_commit_graph(struct commit_graph *g)
if (g->hash_len != GRAPH_OID_LEN)
graph_report(_("commit-graph has incorrect hash length: %d"), 
g->hash_len);
 
+   for (i = 0; i < g->num_commits; i++) {
+   struct commit *graph_commit;
+
+   hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+   if (i > 0 && oidcmp(_oid, _oid) >= 0)
+   graph_report(_("commit-graph has incorrect oid order: 
%s then %s"),
+
+   oid_to_hex(_oid),
+   oid_to_hex(_oid));
+   oidcpy(_oid, _oid);
+
+   while (cur_oid.hash[0] > cur_fanout_pos) {
+   uint32_t fanout_value = get_be32(g->chunk_oid_fanout + 
cur_fanout_pos);
+   if (i != fanout_value)
+   graph_report(_("commit-graph has incorrect 
fanout value: fanout[%d] = %u != %u"),
+cur_fanout_pos, fanout_value, i);
+
+   cur_fanout_pos++;
+   }
+
+   graph_commit = lookup_commit(_oid);
+
+   if (!parse_commit_in_graph_one(g, graph_commit))
+   graph_report(_("failed to parse %s from commit-graph"), 
oid_to_hex(_oid));
+
+   if (graph_commit->graph_pos != i)
+   graph_report(_("graph_pos for commit %s is %u != %u"), 
oid_to_hex(_oid),
+graph_commit->graph_pos, i);
+   }
+
return check_commit_graph_error;
 }
-- 
2.17.0.39.g685157f7fb



[RFC PATCH 12/12] commit-graph: update design document

2018-04-17 Thread Derrick Stolee
The commit-graph feature is now integrated with 'fsck' and 'gc',
so remove those items from the "Future Work" section of the
commit-graph design document.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 9 -
 1 file changed, 9 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
index d9f2713efa..d04657b781 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -118,9 +118,6 @@ Future Work
 - The commit graph feature currently does not honor commit grafts. This can
   be remedied by duplicating or refactoring the current graft logic.
 
-- The 'commit-graph' subcommand does not have a "verify" mode that is
-  necessary for integration with fsck.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
@@ -142,12 +139,6 @@ Future Work
   such as "ensure_tree_loaded(commit)" that fully loads a tree before
   using commit->tree.
 
-- The current design uses the 'commit-graph' subcommand to generate the graph.
-  When this feature stabilizes enough to recommend to most users, we should
-  add automatic graph writes to common operations that create many commits.
-  For example, one could compute a graph on 'clone', 'fetch', or 'repack'
-  commands.
-
 - A server could provide a commit graph file as part of the network protocol
   to avoid extra calculations by clients. This feature is only of benefit if
   the user is willing to trust the file, because verifying the file is correct
-- 
2.17.0.39.g685157f7fb



[RFC PATCH 10/12] commit-graph: add '--reachable' option

2018-04-17 Thread Derrick Stolee
When writing commit-graph files, it can be convenient to ask for all
reachable commits (starting at the ref set) in the resulting file. This
is particularly helpful when writing to stdin is complicated, such as a
future integration with 'git gc' which will call
'git commit-graph write --reachable' after performing cleanup of the
object database.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt |  8 --
 builtin/commit-graph.c | 41 +++---
 t/t5318-commit-graph.sh| 10 
 3 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 93c7841ba2..1b14d40590 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -37,12 +37,16 @@ Write a commit graph file based on the commits found in 
packfiles.
 +
 With the `--stdin-packs` option, generate the new commit graph by
 walking objects only in the specified pack-indexes. (Cannot be combined
-with --stdin-commits.)
+with --stdin-commits or --reachable.)
 +
 With the `--stdin-commits` option, generate the new commit graph by
 walking commits starting at the commits specified in stdin as a list
 of OIDs in hex, one OID per line. (Cannot be combined with
---stdin-packs.)
+--stdin-packs or --reachable.)
++
+With the `--reachable` option, generate the new commit graph by walking
+commits starting at all refs. (Cannot be combined with --stdin-commits
+or --stind-packs.)
 +
 With the `--append` option, include all commits that are present in the
 existing commit-graph file.
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 77c1a04932..a89285ada8 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -3,13 +3,14 @@
 #include "dir.h"
 #include "lockfile.h"
 #include "parse-options.h"
+#include "refs.h"
 #include "commit-graph.h"
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph check [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
-   N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--append] 
[--reachable|--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -24,12 +25,13 @@ static const char * const builtin_commit_graph_read_usage[] 
= {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--append] 
[--reachable|--stdin-packs|--stdin-commits]"),
NULL
 };
 
 static struct opts_commit_graph {
const char *obj_dir;
+   int reachable;
int stdin_packs;
int stdin_commits;
int append;
@@ -113,6 +115,25 @@ static int graph_read(int argc, const char **argv)
return 0;
 }
 
+struct hex_list {
+   char **hex_strs;
+   int hex_nr;
+   int hex_alloc;
+};
+
+static int add_ref_to_list(const char *refname,
+  const struct object_id *oid,
+  int flags, void *cb_data)
+{
+   struct hex_list *list = (struct hex_list*)cb_data;
+
+   ALLOC_GROW(list->hex_strs, list->hex_nr + 1, list->hex_alloc);
+   list->hex_strs[list->hex_nr] = xcalloc(GIT_MAX_HEXSZ + 1, 1);
+   strcpy(list->hex_strs[list->hex_nr], oid_to_hex(oid));
+   list->hex_nr++;
+   return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
const char **pack_indexes = NULL;
@@ -127,6 +148,8 @@ static int graph_write(int argc, const char **argv)
OPT_STRING(0, "object-dir", _dir,
N_("dir"),
N_("The object directory to store the graph")),
+   OPT_BOOL(0, "reachable", ,
+   N_("start walk at all refs")),
OPT_BOOL(0, "stdin-packs", _packs,
N_("scan pack-indexes listed by stdin for commits")),
OPT_BOOL(0, "stdin-commits", _commits,
@@ -140,8 +163,8 @@ static int graph_write(int argc, const char **argv)
 builtin_commit_graph_write_options,
 builtin_commit_graph_write_usage, 0);
 
-   if (opts.stdin_packs && opts.stdin_commits)
-   die(_("cannot use both --stdin-commits and --stdin-packs"));
+   if (opts.reachable + opts.stdin_packs + opts.stdin_commits > 1)
+   die(_("use at most one of --reachable, --stdin-commits, or 
--stdin-packs"));
if (!opts.obj_dir)
   

[RFC PATCH 07/12] commit-graph: load a root tree from specific graph

2018-04-17 Thread Derrick Stolee
When lazy-loading a tree for a commit, it will be important to select
the tree from a specific struct commit_graph. Create a new method that
specifies the commit-graph file and use that in
get_commit_tree_in_graph().

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 6e3c08cd5c..80a2ac2a6a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -354,14 +354,20 @@ static struct tree *load_tree_for_commit(struct 
commit_graph *g, struct commit *
return c->maybe_tree;
 }
 
-struct tree *get_commit_tree_in_graph(const struct commit *c)
+static struct tree *get_commit_tree_in_graph_one(struct commit_graph *g,
+const struct commit *c)
 {
if (c->maybe_tree)
return c->maybe_tree;
if (c->graph_pos == COMMIT_NOT_FROM_GRAPH)
BUG("get_commit_tree_in_graph called from non-commit-graph 
commit");
 
-   return load_tree_for_commit(commit_graph, (struct commit *)c);
+   return load_tree_for_commit(g, (struct commit *)c);
+}
+
+struct tree *get_commit_tree_in_graph(const struct commit *c)
+{
+   return get_commit_tree_in_graph_one(commit_graph, c);
 }
 
 static void write_graph_chunk_fanout(struct hashfile *f,
-- 
2.17.0.39.g685157f7fb



[RFC PATCH 04/12] commit-graph: parse commit from chosen graph

2018-04-17 Thread Derrick Stolee
Before checking a commit-graph file against the object database, we
need to parse all commits from the given commit-graph file. Create
parse_commit_in_graph_one() to target a given struct commit_graph.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index c5e5a0f860..6d0d303a7a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -308,17 +308,27 @@ static int find_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin
}
 }
 
-int parse_commit_in_graph(struct commit *item)
+int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
 {
uint32_t pos;
 
if (item->object.parsed)
-   return 0;
+   return 1;
+
+   if (find_commit_in_graph(item, g, ))
+   return fill_commit_in_graph(item, g, pos);
+
+   return 0;
+}
+
+int parse_commit_in_graph(struct commit *item)
+{
if (!core_commit_graph)
return 0;
+
prepare_commit_graph();
-   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
-   return fill_commit_in_graph(item, commit_graph, pos);
+   if (commit_graph)
+   return parse_commit_in_graph_one(commit_graph, item);
return 0;
 }
 
-- 
2.17.0.39.g685157f7fb



Re: [PATCH v3 8/9] commit-graph: always load commit-graph information

2018-04-17 Thread Derrick Stolee

On 4/17/2018 1:00 PM, Derrick Stolee wrote:

Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

With generation numbers in the commit-graph, we need to ensure that any
commit that exists in the commit-graph file has its generation number
loaded.

Create new load_commit_graph_info() method to fill in the information
for a commit that exists only in the commit-graph file. Call it from
parse_commit_buffer() after loading the other commit information from
the given buffer. Only fill this information when specified by the
'check_graph' parameter. This avoids duplicate work when we already
checked the graph in parse_commit_gently() or when simply checking the
buffer contents in check_commit().

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  commit-graph.c | 51 --
  commit-graph.h |  8 
  commit.c   |  7 +--
  commit.h   |  2 +-
  object.c   |  2 +-
  sha1_file.c|  2 +-
  6 files changed, 49 insertions(+), 23 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 688d5b1801..21e853c21a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct 
commit_graph *g,
return _list_insert(c, pptr)->next;
  }
  
+static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)

+{
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+}
+
  static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t pos)
  {
uint32_t edge_value;
uint32_t *parent_data_ptr;
uint64_t date_low, date_high;
struct commit_list **pptr;
-   const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len 
+ 16) * pos;
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
  
  	item->object.parsed = 1;

item->graph_pos = pos;
@@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin
return 1;
  }
  
+static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)

+{
+   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+   *pos = item->graph_pos;
+   return 1;
+   } else {
+   return bsearch_graph(commit_graph, &(item->object.oid), pos);


The reference to 'commit_graph' in the above line should be 'g'. Sorry!


+   }
+}
+
  int parse_commit_in_graph(struct commit *item)
  {
+   uint32_t pos;
+
+   if (item->object.parsed)
+   return 0;
if (!core_commit_graph)
return 0;
-   if (item->object.parsed)
-   return 1;
-
prepare_commit_graph();
-   if (commit_graph) {
-   uint32_t pos;
-   int found;
-   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
-   pos = item->graph_pos;
-   found = 1;
-   } else {
-   found = bsearch_graph(commit_graph, &(item->object.oid), 
);
-   }
-
-   if (found)
-   return fill_commit_in_graph(item, commit_graph, pos);
-   }
-
+   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
+   return fill_commit_in_graph(item, commit_graph, pos);
return 0;
  }
  
+void load_commit_graph_info(struct commit *item)

+{
+   uint32_t pos;
+   if (!core_commit_graph)
+   return;
+   prepare_commit_graph();
+   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
+   fill_commit_graph_info(item, commit_graph, pos);
+}
+
  static struct tree *load_tree_for_commit(struct commit_graph *g, struct 
commit *c)
  {
struct object_id oid;
diff --git a/commit-graph.h b/commit-graph.h
index 260a468e73..96cccb10f3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
   */
  int parse_commit_in_graph(struct commit *item);
  
+/*

+ * It is possible that we loaded commit contents from the commit buffer,
+ * but we also want to ensure the commit-graph content is correctly
+ * checked and filled. Fill the graph_pos and generation members of
+ * the given commit.
+ */
+void load_commit_graph_info(struct commit *item);
+
  struct tree *get_commit_tree_in_graph(const struct commit *c);
  
  struct commit_graph {

diff --git a/commit.c b/commit.c
index a70f120878..9ef6f699bd 100644
--- a/commit.c
+++ b/commit.c
@@ -331,7 +331,7 @@ const void *detach_comm

Re: [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'

2018-04-17 Thread Derrick Stolee

On 4/17/2018 2:10 PM, Derrick Stolee wrote:

The commit-graph feature is not useful to end users until the
commit-graph file is maintained automatically by Git during normal
upkeep operations. One natural place to trigger this write is during
'git gc'.

Before automatically generating a commit-graph, we need to be able to
verify the contents of a commit-graph file. Integrate commit-graph
checks into 'fsck' that check the commit-graph contents against commits
in the object database.

Things to think about:

* Are these the right integration points?

* gc.commitGraph defaults to true right now for the purpose of testing,
   but may not be required to start. The goal is to have this default to
   true eventually, but we may want to delay that until the feature is
   stable.

* I implement a "--reachable" option to 'git commit-graph write' that
   iterates over all refs. This does the same as

git show-ref -s | git commit-graph write --stdin-commits

   but I don't know how to pipe two child processes together inside of Git.
   Perhaps this is a better solution, anyway.

What other things should I be considering in this case? I'm unfamiliar
with the inner-workings of 'fsck' and 'gc', so this is a new space for me.

This RFC is based on v3 of ds/generation-numbers, and the first commit
is a fixup! based on a bug in that version that I caught while prepping
this series.

Thanks,
-Stolee

Derrick Stolee (12):
   fixup! commit-graph: always load commit-graph information
   commit-graph: add 'check' subcommand
   commit-graph: check file header information
   commit-graph: parse commit from chosen graph
   commit-graph: check fanout and lookup table
   commit: force commit to parse from object database
   commit-graph: load a root tree from specific graph
   commit-graph: verify commit contents against odb
   fsck: check commit-graph
   commit-graph: add '--reachable' option
   gc: automatically write commit-graph files
   commit-graph: update design document

  Documentation/git-commit-graph.txt   |  15 +-
  Documentation/git-gc.txt |   4 +
  Documentation/technical/commit-graph.txt |   9 --
  builtin/commit-graph.c   |  79 +-
  builtin/fsck.c   |  13 ++
  builtin/gc.c |   8 +
  commit-graph.c   | 178 ++-
  commit-graph.h   |   2 +
  commit.c |  14 +-
  commit.h |   1 +
  t/t5318-commit-graph.sh  |  15 ++
  11 files changed, 311 insertions(+), 27 deletions(-)


This RFC is also available as a GitHub pull request [1]

[1] https://github.com/derrickstolee/git/pull/6



Re: [PATCH v10 00/36] Add directory rename detection to git

2018-04-19 Thread Derrick Stolee

On 4/19/2018 2:41 PM, Stefan Beller wrote:

On Thu, Apr 19, 2018 at 11:35 AM, Elijah Newren  wrote:

On Thu, Apr 19, 2018 at 10:57 AM, Elijah Newren  wrote:

This series is a reboot of the directory rename detection series that was
merged to master and then reverted due to the final patch having a buggy
can-skip-update check, as noted at
   https://public-inbox.org/git/xmqqmuya43cs@gitster-ct.c.googlers.com/
This series based on top of master.

...and merges cleanly to next but apparently has some minor conflicts
with both ds/lazy-load-trees and ps/test-chmtime-get from pu.

What's the preferred way to resolve this?  Rebase and resubmit my
series on pu, or something else?

If you were to base it off of pu, this series would depend on all other
series that pu contains. This is bad for the progress of this series.
(If it were to be merged to next, all other series would automatically
merge to next as well)

If the conflicts are minor, then Junio resolves them; if you want to be
nice, pick your merge point as

 git checkout origin/master
 git merge ds/lazy-load-trees
 git merge ps/test-chmtime-get
 git tag my-anchor

and put the series on top of that anchor.

If you do this, you'd want to be reasonably sure that
those two series are not in too much flux.


I believe ds/lazy-load-trees is queued for 'next'. I'm not surprised 
that there are some conflicts here. Any reference to the 'tree' member 
of a commit should be replaced with 'get_commit_tree(c)', or 
'get_commit_tree_oid(c)' if you only care about the tree's object id.


I think Stefan's suggestion is the best approach to get the right 
conflicts out of the way.


Thanks,
-Stolee


Re: [RFC 0/1] Tolerate broken headers in `packed-refs` files

2018-03-26 Thread Derrick Stolee

On 3/26/2018 8:42 AM, Michael Haggerty wrote:

[...]

But there might be some tools out in the wild that have been writing
broken headers. In that case, users who upgrade Git might suddenly
find that they can't read repositories that they could read before. In
fact, a tool that we wrote and use internally at GitHub was doing
exactly that, which is how we discovered this "problem".

This patch shows what it would look like to relax the parsing again,
albeit *only* for the first line of the file, and *only* for lines
that start with '#'.

The problem with this patch is that it would make it harder for people
who implement broken tools in the future to discover their mistakes.
The only result of the error would be that it is slower to work with
the `packed-refs` files that they wrote. Such an error could go
undiscovered for a long time.


My opinion is that we shouldn't maintain back-compat with formats that 
may have been written by another tool because Git wasn't strict about 
it. As long as Git never wrote files with these formats, then they 
shouldn't be supported.


You are absolutely right that staying strict will help discover the 
tools that are writing an incorrect format.


Since most heavily-used tools that didn't spawn Git processes use 
LibGit2 to interact with Git repos, I added Ed Thomson to CC to see if 
libgit2 could ever write these bad header comments.


Thanks for writing this RFC so we can have the discussion and more 
quickly identify this issue if/when users are broken.


Thanks,
-Stolee


Re: [PATCH 4/3] sha1_name: use bsearch_pack() in unique_in_pack()

2018-03-25 Thread Derrick Stolee

On 3/24/2018 12:41 PM, René Scharfe wrote:

Replace the custom binary search in unique_in_pack() with a call to
bsearch_pack().  This reduces code duplication and makes use of the
fan-out table of packs.

Signed-off-by: Rene Scharfe <l@web.de>
---
This is basically the same replacement as done by patch 3.  Speed is
less of a concern here -- at least I don't know a commonly used
command that needs to resolve lots of short hashes.


Thanks, René! Good teamwork on this patch series.

Reviewed-by: Derrick Stolee <dsto...@microsoft.com>


Re: [PATCH] unpack-trees: release oid_array after use in check_updates()

2018-03-25 Thread Derrick Stolee

On 3/25/2018 12:31 PM, René Scharfe wrote:

Signed-off-by: Rene Scharfe 
---
That leak was introduced by c0c578b33c (unpack-trees: batch fetching of
missing blobs).

  unpack-trees.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/unpack-trees.c b/unpack-trees.c
index d5685891a5..e73745051e 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -379,30 +379,31 @@ static int check_updates(struct unpack_trees_options *o)
struct oid_array to_fetch = OID_ARRAY_INIT;
int fetch_if_missing_store = fetch_if_missing;
fetch_if_missing = 0;
for (i = 0; i < index->cache_nr; i++) {
struct cache_entry *ce = index->cache[i];
if ((ce->ce_flags & CE_UPDATE) &&
!S_ISGITLINK(ce->ce_mode)) {
if (!has_object_file(>oid))
oid_array_append(_fetch, >oid);
}
}
if (to_fetch.nr)
fetch_objects(repository_format_partial_clone,
  _fetch);
fetch_if_missing = fetch_if_missing_store;
+   oid_array_clear(_fetch);
}
for (i = 0; i < index->cache_nr; i++) {
struct cache_entry *ce = index->cache[i];
  
  		if (ce->ce_flags & CE_UPDATE) {

if (ce->ce_flags & CE_WT_REMOVE)
die("BUG: both update and delete flags are set on 
%s",
ce->name);
display_progress(progress, ++cnt);
ce->ce_flags &= ~CE_UPDATE;
if (o->update && !o->dry_run) {
errs |= checkout_entry(ce, , NULL);
}
}
}


Ack. Looks correct.

-Stolee


Re: [ANNOUNCE] Git v2.17.0-rc1

2018-03-25 Thread Derrick Stolee

On 3/25/2018 2:42 PM, Ævar Arnfjörð Bjarmason wrote:

On Sun, Mar 25 2018, Derrick Stolee wrote:


On 3/23/2018 1:59 PM, Ævar Arnfjörð Bjarmason wrote:

On Wed, Mar 21 2018, Junio C. Hamano wrote:


A release candidate Git v2.17.0-rc1 is now available for testing
at the usual places.  It is comprised of 493 non-merge commits
since v2.16.0, contributed by 62 people, 19 of which are new faces.

I have this deployed on some tens of K machines who all use git in one
way or another (from automated pulls, to users interactively), and rc0
before that, with a few patches on top from me + Takato + Duy + Derrick
since rc0 was released (and since today based on top of rc1). No issues
so far.

The specific in-house version I have is at:
https://github.com/git/git/compare/v2.17.0-rc1...bookingcom:booking-git-v2018-03-23-1

Thanks for testing the commit-graph feature, Ævar! I'm guessing you
have some mechanisms to ensure the 'git commit-graph write' command is
run on these machines and 'core.commitGraph' is set to true in the
config? I would love to hear how this benefits your org.

I haven't deployed any actual use of it at a wider scale, but I've done
some ad-hoc benchmarking with our internal version which has your
patches, and the results are very promising so far on the isolated test
cases where it helps (that you know about, e.g. rev-list --all).

So sorry, I don't have any meaningful testing of this, I just wanted an
easy way to ad-hoc test it & make sure it doesn't break other stuff for
now.

I also threw out most of the manual git maintenance stuff we had and
just rely on gc --auto now, so as soon as you have something to
integrate with that, along with those perf changes Peff suggested I'm
much more likely to play with it in some real way.


Thanks. Integration with 'gc --auto' is a high priority for me after the 
patch lands.


The version on GitHub [1] is slightly ahead of v6 as I wait to reroll on 
v2.17.0. It includes Peff's improvements to inspecting pack-indexes [2].


[1] https://github.com/derrickstolee/git/pull/2
[2] 
https://github.com/derrickstolee/git/pull/2/commits/cb86817ee5c5127b32c93a22ef130f0db6207970


Re: [ANNOUNCE] Git v2.17.0-rc1

2018-03-25 Thread Derrick Stolee

On 3/23/2018 1:59 PM, Ævar Arnfjörð Bjarmason wrote:

On Wed, Mar 21 2018, Junio C. Hamano wrote:


A release candidate Git v2.17.0-rc1 is now available for testing
at the usual places.  It is comprised of 493 non-merge commits
since v2.16.0, contributed by 62 people, 19 of which are new faces.

I have this deployed on some tens of K machines who all use git in one
way or another (from automated pulls, to users interactively), and rc0
before that, with a few patches on top from me + Takato + Duy + Derrick
since rc0 was released (and since today based on top of rc1). No issues
so far.

The specific in-house version I have is at:
https://github.com/git/git/compare/v2.17.0-rc1...bookingcom:booking-git-v2018-03-23-1


Thanks for testing the commit-graph feature, Ævar! I'm guessing you have 
some mechanisms to ensure the 'git commit-graph write' command is run on 
these machines and 'core.commitGraph' is set to true in the config? I 
would love to hear how this benefits your org.


Thanks,
-Stolee


Re: [PATCH v4 00/13] Serialized Git Commit Graph

2018-04-02 Thread Derrick Stolee

On 3/30/2018 7:10 AM, Jakub Narebski wrote:

I hope that I am addressing the most recent version of this series.


Hi Jakub. Thanks for the interest in this patch series.

The most-recent version is v6 [1], but I will re-roll to v7 soon (after 
v2.17.0 is marked).


[1] 
https://public-inbox.org/git/20180314192736.70602-1-dsto...@microsoft.com/T/#u



Derrick Stolee <sto...@gmail.com> writes:


As promised [1], this patch contains a way to serialize the commit graph.
The current implementation defines a new file format to store the graph
structure (parent relationships) and basic commit metadata (commit date,
root tree OID) in order to prevent parsing raw commits while performing
basic graph walks. For example, we do not need to parse the full commit
when performing these walks:

* 'git log --topo-order -1000' walks all reachable commits to avoid
   incorrect topological orders, but only needs the commit message for
   the top 1000 commits.

* 'git merge-base  ' may walk many commits to find the correct
   boundary between the commits reachable from A and those reachable
   from B. No commit messages are needed.

* 'git branch -vv' checks ahead/behind status for all local branches
   compared to their upstream remote branches. This is essentially as
   hard as computing merge bases for each.

The current patch speeds up these calculations by injecting a check in
parse_commit_gently() to check if there is a graph file and using that
to provide the required metadata to the struct commit.

That's nice.

What are the assumptions about the serialized commit graph format? Does
it need to be:
  - extensible without rewriting (e.g. append-only)?
  - like the above, but may need rewriting for optimal performance?
  - extending it needs to rewrite whole file?

Excuse me if it waas already asked and answered.


It is not extensible without rewriting. Reducing write time was not a 
main goal, since the graph will be written only occasionally during data 
management phases (like 'gc' or 'repack'; this integration is not 
implemented yet).





The file format has room to store generation numbers, which will be
provided as a patch after this framework is merged. Generation numbers
are referenced by the design document but not implemented in order to
make the current patch focus on the graph construction process. Once
that is stable, it will be easier to add generation numbers and make
graph walks aware of generation numbers one-by-one.

As the serialized commit graph format is versioned, I wonder if it would
be possible to speed up graph walks even more by adding to it FELINE
index (pair of numbers) from "Reachability Queries in Very Large Graphs:
A Fast Refined Olnine Search Approach" (2014) - available at
http://openproceedings.org/EDBT/2014/paper_166.pdf

The implementation would probably need adjustments to make it
unambiguous and unambiguously extensible; unless there is place for
indices that are local-only and need to be recalculated from scratch
when graph changes (to cover all graph).


The chunk-based format is intended to allow extra indexes like the one 
you recommend, without needing to increase the version number. Using an 
optional chunk allows older versions of Git to read the file without 
error, since the data is "extra", and newer versions can take advantage 
of the acceleration.


At one point, I was investigating these reachability indexes (I read 
"SCARAB: Scaling Reachability Computation on Large Graphs" by Jihn, 
Ruan, Dey, and Xu [2]) but find the question that these indexes target 
to be lacking for most of the Git uses. That is, they ask the boolean 
question "Can X reach Y?". More often, Git needs to answer "What is the 
set of commits reachable from X but not from Y" or "Topologically sort 
commits reachable from X" or "How many commits are in each part of the 
symmetric difference between reachable from X or reachable from Y?"


The case for "Can X reach Y?" is mostly for commands like 'git branch 
--contains', when 'git fetch' checks for forced-updates of branches, or 
when the server decides enough negotiation has occurred during a 'git 
fetch'. While these may be worth investigating, they also benefit 
greatly from the accelerated graph walk introduced in the current format.


I would be happy to review any effort to extend the commit-graph format 
to include such indexes, as long as the performance benefits outweigh 
the complexity to create them.


[2] 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396=rep1=pdf





Here are some performance results for a copy of the Linux repository
where 'master' has 704,766 reachable commits and is behind 'origin/master'
by 19,610 commits.

| Command  | Before | After  | Rel % |
|--|||---|
| log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
| branch -vv   

Re: [PATCH v4 01/13] commit-graph: add format document

2018-04-02 Thread Derrick Stolee

On 3/30/2018 9:25 AM, Jakub Narebski wrote:

Derrick Stolee <sto...@gmail.com> writes:


+== graph-*.graph files have the following format:

What is this '*' here?


No longer necessary. It used to be a placeholder for a hash value, but 
now the graph is stored in objects/info/commit-graph.




[...]

+  The remaining data in the body is described one chunk at a time, and
+  these chunks may be given in any order. Chunks are required unless
+  otherwise specified.

Does Git need to understand all chunks, or could there be optional
chunks that can be safely ignored (like in PNG format)?  Though this may
be overkill, and could be left for later revision of the format if
deemed necessary.


In v6, the format and design documents are edited to make clear the use 
of optional chunks, specifically for future extension without increasing 
the version number.





+
+CHUNK DATA:
+
+  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+  The ith entry, F[i], stores the number of OIDs with first
+  byte at most i. Thus F[255] stores the total
+  number of commits (N).

All right, it is small enough that can be required even for a very small
number of commits.


+
+  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+  The OIDs for all commits in the graph, sorted in ascending order.
+
+  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)

Do commits need to be put here in the ascending order of OIDs?


Yes.


If so, this would mean that it is not possible to add information about
new commits by only appending data and maybe overwriting some fields, I
think.  You would need to do full rewrite to insert new commit in
appropriate place.


That is the idea. This file is not updated with every new commit, but 
instead will be updated on some scheduled cleanup events. The 
commit-graph file is designed in a way to be non-critical, and not tied 
to the packfile layout. This allows flexibility for when to do the write.


For example, in GVFS, we will write a new commit-graph when there are 
new daily prefetch packs.


This could also integrate with 'gc' and 'repack' so whenever they are 
triggered the commit-graph is written as well.


Commits that do not exist in the commit-graph file will load from the 
object database as normal (after a failed lookup in the commit-graph file).



+* The first H bytes are for the OID of the root tree.
+* The next 8 bytes are for the int-ids of the first two parents
+  of the ith commit. Stores value 0x if no parent in that
+  position. If there are more than two parents, the second value
+  has its most-significant bit on and the other bits store an array
+  position into the Large Edge List chunk.
+* The next 8 bytes store the generation number of the commit and
+  the commit time in seconds since EPOCH. The generation number
+  uses the higher 30 bits of the first 4 bytes, while the commit
+  time uses the 32 bits of the second 4 bytes, along with the lowest
+  2 bits of the lowest byte, storing the 33rd and 34th bit of the
+  commit time.
+
+  Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
+  This list of 4-byte values store the second through nth parents for
+  all octopus merges. The second parent value in the commit data stores
+  an array position within this list along with the most-significant bit
+  on. Starting at that array position, iterate through this list of int-ids
+  for the parents until reaching a value with the most-significant bit on.
+  The other bits correspond to the int-id of the last parent.

All right, that is one chunk that cannot use fixed-length records; this
shouldn't matter much, as we iterate only up to the number of parents
less two.


Less one: the second "parent" column of the commit data chunk is used to 
point into this list, so (P-1) parents are in this chunk for a commit 
with P parents.



A question: what happens to the last list of parents?  Is there a
guardian value of 0x at last place?


The termination condition is in the position of the last parent, since 
the most-significant bit is on. The other 31 bits contain the int-id of 
the parent.


Thanks,
-Stolee



Re: [PATCH v4 00/13] Serialized Git Commit Graph

2018-04-02 Thread Derrick Stolee

On 4/2/2018 10:46 AM, Jakub Narebski wrote:

Derrick Stolee <sto...@gmail.com> writes:

[...]

At one point, I was investigating these reachability indexes (I read
"SCARAB: Scaling Reachability Computation on Large Graphs" by Jihn,
Ruan, Dey, and Xu [2]) but find the question that these indexes target
to be lacking for most of the Git uses. That is, they ask the boolean
question "Can X reach Y?". More often, Git needs to answer "What is
the set of commits reachable from X but not from Y" or "Topologically
sort commits reachable from X" or "How many commits are in each part
of the symmetric difference between reachable from X or reachable from
Y?"

In the "Reachability Queries in Very Large Graphs..." by Veloso, Cerf,
Meira and Zaki FELINE-index work, authors mention SCARAB as something
that can be used in addition to FELINE-index, as a complementary data
(FELINE-SCARAB in the work, section 4.4).

I see the FELINE-index as a stronger form of generation numbers (called
also level of the vertex / node), in that it allows to negative-cut even
more, pruning paths that are known to be unreachable (or marking nodes
known to be unreachable in the "calculate difference" scenario).

Also, FELINE-index uses two integer numbers (coordinates in 2d space);
one of those indices can be the topological numbering (topological
sorting order) of nodes in the commit graph.  That would help to answer
even more Git questions.


This two-dimensional generation number is helpful for non-reachability 
queries, but is something that needs the "full" commit graph in order to 
define the value for a single commit (hence the O(N lg N) performance 
mentioned below). Generation numbers are effective while being easy to 
compute and immutable.


I wonder if FELINE was compared directly to a one-dimensional index (I 
apologize that I have not read the paper in detail, so I don't 
understand the indexes they compare against). It also appears the graphs 
they use for their tests are not commit graphs, which have a different 
shape than many of the digraphs studies by that work.


This is all to say: I would love to see an interesting study in this 
direction, specifically comparing this series' definition of generation 
numbers to the 2-dimensional system in FELINE, and on a large sample of 
commit graphs available in open-source data sets (Linux kernel, Git, 
etc.) and possibly on interesting closed-source data sets.





The case for "Can X reach Y?" is mostly for commands like 'git branch
--contains', when 'git fetch' checks for forced-updates of branches,
or when the server decides enough negotiation has occurred during a
'git fetch'. While these may be worth investigating, they also benefit
greatly from the accelerated graph walk introduced in the current
format.

I would be happy to review any effort to extend the commit-graph
format to include such indexes, as long as the performance benefits
outweigh the complexity to create them.

[2] 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396=rep1=pdf

The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the
storage complexity is 2*|V|.



This would be very easy to add as an optional chunk, since it can use 
one row per commit.


Thanks,
-Stolee


Re: [PATCH v4 00/13] Serialized Git Commit Graph

2018-04-02 Thread Derrick Stolee

On 4/2/2018 1:35 PM, Stefan Beller wrote:

On Mon, Apr 2, 2018 at 8:02 AM, Derrick Stolee <sto...@gmail.com> wrote:

I would be happy to review any effort to extend the commit-graph
format to include such indexes, as long as the performance benefits
outweigh the complexity to create them.

[2]
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396=rep1=pdf

The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the
storage complexity is 2*|V|.


This would be very easy to add as an optional chunk, since it can use one
row per commit.

Given this discussion, I wonder if we want to include generation numbers
as a first class citizen in the current format. They could also go as
an optional
chunk and we may want to discuss further if we want generation numbers or
FELINE numbers or GRAIL or SCARAB, which are all graph related speedup
mechanism AFAICT.
In case we decide against generation numbers in the long run,
the row of mandatory generation numbers would be dead weight
that we'd need to carry.


Currently, the format includes 8 bytes to share between the generation 
number and commit date. Due to alignment concerns, we will want to keep 
this as 8 bytes or truncate it to 4-bytes. Either we would be wasting at 
least 3 bytes or truncating dates too much (presenting the 2038 problem 
[1] since dates are signed).



I only glanced at the paper, but it looks like a "more advanced 2d
generation number" that seems to be able to answer questions
that gen numbers can answer, but that paper also refers
to SCARAB as well as GRAIL as the state of the art, so maybe
there are even more papers to explore?


The biggest reason I can say to advance this series (and the small 
follow-up series that computes and consumes generation numbers) is that 
generation numbers are _extremely simple_. You only need to know your 
parents and their generation numbers to compute your own. These other 
reachability indexes require examining the entire graph to create "good" 
index values.


The hard part about using generation numbers (or any other reachability 
index) in Git is refactoring the revision-walk machinery to take 
advantage of them; current code requires O(reachable commits) to 
topo-order instead of O(commits that will be output). I think we should 
table any discussion of these advanced indexes until that work is done 
and a valuable comparison can be done. "Premature optimization is the 
root of all evil" and all that.


Thanks,
-Stolee

[1] https://en.wikipedia.org/wiki/Year_2038_problem


[PATCH v7 13/14] commit-graph: build graph from starting commits

2018-04-02 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach git-commit-graph to read commits from stdin when the
--stdin-commits flag is specified. Commits reachable from these
commits are added to the graph. This is a much faster way to construct
the graph than inspecting all packed objects, but is restricted to
known tips.

For the Linux repository, 700,000+ commits were added to the graph
file starting from 'master' in 7-9 seconds, depending on the number
of packfiles in the repo (1, 24, or 120).

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt | 14 +-
 builtin/commit-graph.c | 27 +--
 commit-graph.c | 27 +--
 commit-graph.h |  4 +++-
 t/t5318-commit-graph.sh| 13 +
 5 files changed, 75 insertions(+), 10 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 8143cc3f07..442ac243e6 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -36,7 +36,13 @@ COMMANDS
 Write a commit graph file based on the commits found in packfiles.
 +
 With the `--stdin-packs` option, generate the new commit graph by
-walking objects only in the specified pack-indexes.
+walking objects only in the specified pack-indexes. (Cannot be combined
+with --stdin-commits.)
++
+With the `--stdin-commits` option, generate the new commit graph by
+walking commits starting at the commits specified in stdin as a list
+of OIDs in hex, one OID per line. (Cannot be combined with
+--stdin-packs.)
 
 'read'::
 
@@ -60,6 +66,12 @@ $ git commit-graph write
 $ echo  | git commit-graph write --stdin-packs
 
 
+* Write a graph file containing all reachable commits.
++
+
+$ git show-ref -s | git commit-graph write --stdin-commits
+
+
 * Read basic information from the commit-graph file.
 +
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 9f83c872e9..f5fc717b8f 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,7 +8,7 @@
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
-   N_("git commit-graph write [--object-dir ] [--stdin-packs]"),
+   N_("git commit-graph write [--object-dir ] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -18,13 +18,14 @@ static const char * const builtin_commit_graph_read_usage[] 
= {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--object-dir ] [--stdin-packs]"),
+   N_("git commit-graph write [--object-dir ] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
 static struct opts_commit_graph {
const char *obj_dir;
int stdin_packs;
+   int stdin_commits;
 } opts;
 
 static int graph_read(int argc, const char **argv)
@@ -79,6 +80,8 @@ static int graph_write(int argc, const char **argv)
 {
const char **pack_indexes = NULL;
int packs_nr = 0;
+   const char **commit_hex = NULL;
+   int commits_nr = 0;
const char **lines = NULL;
int lines_nr = 0;
int lines_alloc = 0;
@@ -89,6 +92,8 @@ static int graph_write(int argc, const char **argv)
N_("The object directory to store the graph")),
OPT_BOOL(0, "stdin-packs", _packs,
N_("scan pack-indexes listed by stdin for commits")),
+   OPT_BOOL(0, "stdin-commits", _commits,
+   N_("start walk at commits listed by stdin")),
OPT_END(),
};
 
@@ -96,10 +101,12 @@ static int graph_write(int argc, const char **argv)
 builtin_commit_graph_write_options,
 builtin_commit_graph_write_usage, 0);
 
+   if (opts.stdin_packs && opts.stdin_commits)
+   die(_("cannot use both --stdin-commits and --stdin-packs"));
if (!opts.obj_dir)
opts.obj_dir = get_object_directory();
 
-   if (opts.stdin_packs) {
+   if (opts.stdin_packs || opts.stdin_commits) {
struct strbuf buf = STRBUF_INIT;
lines_nr = 0;
lines_alloc = 128;
@@ -110,13 +117,21 @@ static int graph_write(int argc, const char **argv)
lines[lines_nr++] = strbuf_detach(, NULL);
}
 
-   pack_indexes = lines;
-   packs_nr = lines_nr;
+   if (opts.stdin_packs) {
+   pack_indexes = lines;
+  

[PATCH v7 11/14] commit: integrate commit graph with commit parsing

2018-04-02 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach Git to inspect a commit graph file to supply the contents of a
struct commit when calling parse_commit_gently(). This implementation
satisfies all post-conditions on the struct commit, including loading
parents, the root tree, and the commit date.

If core.commitGraph is false, then do not check graph files.

In test script t5318-commit-graph.sh, add output-matching conditions on
read-only graph operations.

By loading commits from the graph instead of parsing commit buffers, we
save a lot of time on long commit walks. Here are some performance
results for a copy of the Linux repository where 'master' has 678,653
reachable commits and is behind 'origin/master' by 59,929 commits.

| Command  | Before | After  | Rel % |
|--|||---|
| log --oneline --topo-order -1000 |  8.31s |  0.94s | -88%  |
| branch -vv   |  1.02s |  0.14s | -86%  |
| rev-list --all   |  5.89s |  1.07s | -81%  |
| rev-list --all --objects | 66.15s | 58.45s | -11%  |

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 alloc.c |   1 +
 commit-graph.c  | 141 +++-
 commit-graph.h  |  12 
 commit.c|   3 +
 commit.h|   3 +
 t/t5318-commit-graph.sh |  47 +-
 6 files changed, 205 insertions(+), 2 deletions(-)

diff --git a/alloc.c b/alloc.c
index 12afadfacd..cf4f8b61e1 100644
--- a/alloc.c
+++ b/alloc.c
@@ -93,6 +93,7 @@ void *alloc_commit_node(void)
struct commit *c = alloc_node(_state, sizeof(struct commit));
c->object.type = OBJ_COMMIT;
c->index = alloc_commit_index();
+   c->graph_pos = COMMIT_NOT_FROM_GRAPH;
return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index ea29c5c2d8..983454785e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,7 +38,6 @@
 #define GRAPH_MIN_SIZE (5 * GRAPH_CHUNKLOOKUP_WIDTH + GRAPH_FANOUT_SIZE + \
GRAPH_OID_LEN + 8)
 
-
 char *get_commit_graph_filename(const char *obj_dir)
 {
return xstrfmt("%s/info/commit-graph", obj_dir);
@@ -179,6 +178,145 @@ struct commit_graph *load_commit_graph_one(const char 
*graph_file)
exit(1);
 }
 
+/* global storage */
+struct commit_graph *commit_graph = NULL;
+
+static void prepare_commit_graph_one(const char *obj_dir)
+{
+   char *graph_name;
+
+   if (commit_graph)
+   return;
+
+   graph_name = get_commit_graph_filename(obj_dir);
+   commit_graph = load_commit_graph_one(graph_name);
+
+   FREE_AND_NULL(graph_name);
+}
+
+static int prepare_commit_graph_run_once = 0;
+static void prepare_commit_graph(void)
+{
+   struct alternate_object_database *alt;
+   char *obj_dir;
+
+   if (prepare_commit_graph_run_once)
+   return;
+   prepare_commit_graph_run_once = 1;
+
+   obj_dir = get_object_directory();
+   prepare_commit_graph_one(obj_dir);
+   prepare_alt_odb();
+   for (alt = alt_odb_list; !commit_graph && alt; alt = alt->next)
+   prepare_commit_graph_one(alt->path);
+}
+
+static void close_commit_graph(void)
+{
+   if (!commit_graph)
+   return;
+
+   if (commit_graph->graph_fd >= 0) {
+   munmap((void *)commit_graph->data, commit_graph->data_len);
+   commit_graph->data = NULL;
+   close(commit_graph->graph_fd);
+   }
+
+   FREE_AND_NULL(commit_graph);
+}
+
+static int bsearch_graph(struct commit_graph *g, struct object_id *oid, 
uint32_t *pos)
+{
+   return bsearch_hash(oid->hash, g->chunk_oid_fanout,
+   g->chunk_oid_lookup, g->hash_len, pos);
+}
+
+static struct commit_list **insert_parent_or_die(struct commit_graph *g,
+uint64_t pos,
+struct commit_list **pptr)
+{
+   struct commit *c;
+   struct object_id oid;
+   hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos);
+   c = lookup_commit();
+   if (!c)
+   die("could not find commit %s", oid_to_hex());
+   c->graph_pos = pos;
+   return _list_insert(c, pptr)->next;
+}
+
+static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t pos)
+{
+   struct object_id oid;
+   uint32_t edge_value;
+   uint32_t *parent_data_ptr;
+   uint64_t date_low, date_high;
+   struct commit_list **pptr;
+   const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len 
+ 16) * pos;
+
+   item->object.parsed = 1;
+   item->graph_pos = pos;
+
+   hashcpy(oid.hash, commit_data);
+   item->tree = lookup_tree();
+
+   date_high = get_be32(commit_data + g->

[PATCH v7 01/14] csum-file: rename hashclose() to finalize_hashfile()

2018-04-02 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

The hashclose() method behaves very differently depending on the flags
parameter. In particular, the file descriptor is not always closed.

Perform a simple rename of "hashclose()" to "finalize_hashfile()" in
preparation for functional changes.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/index-pack.c   | 2 +-
 builtin/pack-objects.c | 6 +++---
 bulk-checkin.c | 4 ++--
 csum-file.c| 2 +-
 csum-file.h| 4 ++--
 fast-import.c  | 2 +-
 pack-bitmap-write.c| 2 +-
 pack-write.c   | 4 ++--
 8 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index bda84a92ef..8bcf280e0b 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1270,7 +1270,7 @@ static void conclude_pack(int fix_thin_pack, const char 
*curr_pack, unsigned cha
nr_objects - nr_objects_initial);
stop_progress_msg(, msg.buf);
strbuf_release();
-   hashclose(f, tail_hash, 0);
+   finalize_hashfile(f, tail_hash, 0);
hashcpy(read_hash, pack_hash);
fixup_pack_header_footer(output_fd, pack_hash,
 curr_pack, nr_objects,
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index e9d3cfb9e3..ab3e80ee49 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -837,11 +837,11 @@ static void write_pack_file(void)
 * If so, rewrite it like in fast-import
 */
if (pack_to_stdout) {
-   hashclose(f, oid.hash, CSUM_CLOSE);
+   finalize_hashfile(f, oid.hash, CSUM_CLOSE);
} else if (nr_written == nr_remaining) {
-   hashclose(f, oid.hash, CSUM_FSYNC);
+   finalize_hashfile(f, oid.hash, CSUM_FSYNC);
} else {
-   int fd = hashclose(f, oid.hash, 0);
+   int fd = finalize_hashfile(f, oid.hash, 0);
fixup_pack_header_footer(fd, oid.hash, pack_tmp_name,
 nr_written, oid.hash, offset);
close(fd);
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 9d87eac07b..227cc9f3b1 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -35,9 +35,9 @@ static void finish_bulk_checkin(struct bulk_checkin_state 
*state)
unlink(state->pack_tmp_name);
goto clear_exit;
} else if (state->nr_written == 1) {
-   hashclose(state->f, oid.hash, CSUM_FSYNC);
+   finalize_hashfile(state->f, oid.hash, CSUM_FSYNC);
} else {
-   int fd = hashclose(state->f, oid.hash, 0);
+   int fd = finalize_hashfile(state->f, oid.hash, 0);
fixup_pack_header_footer(fd, oid.hash, state->pack_tmp_name,
 state->nr_written, oid.hash,
 state->offset);
diff --git a/csum-file.c b/csum-file.c
index 5eda7fb6af..e6c95a6915 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -53,7 +53,7 @@ void hashflush(struct hashfile *f)
}
 }
 
-int hashclose(struct hashfile *f, unsigned char *result, unsigned int flags)
+int finalize_hashfile(struct hashfile *f, unsigned char *result, unsigned int 
flags)
 {
int fd;
 
diff --git a/csum-file.h b/csum-file.h
index 992e5c0141..9ba87f0a6c 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -26,14 +26,14 @@ struct hashfile_checkpoint {
 extern void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint 
*);
 extern int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);
 
-/* hashclose flags */
+/* finalize_hashfile flags */
 #define CSUM_CLOSE 1
 #define CSUM_FSYNC 2
 
 extern struct hashfile *hashfd(int fd, const char *name);
 extern struct hashfile *hashfd_check(const char *name);
 extern struct hashfile *hashfd_throughput(int fd, const char *name, struct 
progress *tp);
-extern int hashclose(struct hashfile *, unsigned char *, unsigned int);
+extern int finalize_hashfile(struct hashfile *, unsigned char *, unsigned int);
 extern void hashwrite(struct hashfile *, const void *, unsigned int);
 extern void hashflush(struct hashfile *f);
 extern void crc32_begin(struct hashfile *);
diff --git a/fast-import.c b/fast-import.c
index b5db5d20b1..6d96f55d9d 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1016,7 +1016,7 @@ static void end_packfile(void)
struct tag *t;
 
close_pack_windows(pack_data);
-   hashclose(pack_file, cur_pack_oid.hash, 0);
+   finalize_hashfile(pack_file, cur_pack_oid.hash, 0);
fixup_pack_header_footer(pack_data->pack_fd, pack_data->sha1,
 

[PATCH v7 03/14] commit-graph: add format document

2018-04-02 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Add document specifying the binary format for commit graphs. This
format allows for:

* New versions.
* New hash functions and hash lengths.
* Optional extensions.

Basic header information is followed by a binary table of contents
into "chunks" that include:

* An ordered list of commit object IDs.
* A 256-entry fanout into that list of OIDs.
* A list of metadata for the commits.
* A list of "large edges" to enable octopus merges.

The format automatically includes two parent positions for every
commit. This favors speed over space, since using only one position
per commit would cause an extra level of indirection for every merge
commit. (Octopus merges suffer from this indirection, but they are
very rare.)

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 .../technical/commit-graph-format.txt | 97 +++
 1 file changed, 97 insertions(+)
 create mode 100644 Documentation/technical/commit-graph-format.txt

diff --git a/Documentation/technical/commit-graph-format.txt 
b/Documentation/technical/commit-graph-format.txt
new file mode 100644
index 00..ad6af8105c
--- /dev/null
+++ b/Documentation/technical/commit-graph-format.txt
@@ -0,0 +1,97 @@
+Git commit graph format
+===
+
+The Git commit graph stores a list of commit OIDs and some associated
+metadata, including:
+
+- The generation number of the commit. Commits with no parents have
+  generation number 1; commits with parents have generation number
+  one more than the maximum generation number of its parents. We
+  reserve zero as special, and can be used to mark a generation
+  number invalid or as "not computed".
+
+- The root tree OID.
+
+- The commit date.
+
+- The parents of the commit, stored using positional references within
+  the graph file.
+
+These positional references are stored as unsigned 32-bit integers
+corresponding to the array position withing the list of commit OIDs. We
+use the most-significant bit for special purposes, so we can store at most
+(1 << 31) - 1 (around 2 billion) commits.
+
+== Commit graph files have the following format:
+
+In order to allow extensions that add extra data to the graph, we organize
+the body into "chunks" and provide a binary lookup table at the beginning
+of the body. The header includes certain values, such as number of chunks
+and hash type.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+  4-byte signature:
+  The signature is: {'C', 'G', 'P', 'H'}
+
+  1-byte version number:
+  Currently, the only valid version is 1.
+
+  1-byte Hash Version (1 = SHA-1)
+  We infer the hash length (H) from this value.
+
+  1-byte number (C) of "chunks"
+
+  1-byte (reserved for later use)
+ Current clients should ignore this value.
+
+CHUNK LOOKUP:
+
+  (C + 1) * 12 bytes listing the table of contents for the chunks:
+  First 4 bytes describe the chunk id. Value 0 is a terminating label.
+  Other 8 bytes provide the byte-offset in current file for chunk to
+  start. (Chunks are ordered contiguously in the file, so you can infer
+  the length using the next chunk position if necessary.) Each chunk
+  ID appears at most once.
+
+  The remaining data in the body is described one chunk at a time, and
+  these chunks may be given in any order. Chunks are required unless
+  otherwise specified.
+
+CHUNK DATA:
+
+  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+  The ith entry, F[i], stores the number of OIDs with first
+  byte at most i. Thus F[255] stores the total
+  number of commits (N).
+
+  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+  The OIDs for all commits in the graph, sorted in ascending order.
+
+  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+* The first H bytes are for the OID of the root tree.
+* The next 8 bytes are for the positions of the first two parents
+  of the ith commit. Stores value 0x if no parent in that
+  position. If there are more than two parents, the second value
+  has its most-significant bit on and the other bits store an array
+  position into the Large Edge List chunk.
+* The next 8 bytes store the generation number of the commit and
+  the commit time in seconds since EPOCH. The generation number
+  uses the higher 30 bits of the first 4 bytes, while the commit
+  time uses the 32 bits of the second 4 bytes, along with the lowest
+  2 bits of the lowest byte, storing the 33rd and 34th bit of the
+  commit time.
+
+  Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
+  This list of 4-byte values store the second through nth parents for
+  all octopus merges. The second parent value in the commit data stores
+  an array position within this list along with the most-significant bit
+  on. Starting at that array position, iterate through

[PATCH v7 12/14] commit-graph: read only from specific pack-indexes

2018-04-02 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach git-commit-graph to inspect the objects only in a certain list
of pack-indexes within the given pack directory. This allows updating
the commit graph iteratively.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt | 11 +-
 builtin/commit-graph.c | 33 +++---
 commit-graph.c | 26 +--
 commit-graph.h |  4 +++-
 packfile.c |  4 ++--
 packfile.h |  2 ++
 t/t5318-commit-graph.sh| 10 +
 7 files changed, 81 insertions(+), 9 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 8aad8303f5..8143cc3f07 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -34,7 +34,9 @@ COMMANDS
 'write'::
 
 Write a commit graph file based on the commits found in packfiles.
-Includes all commits from the existing commit graph file.
++
+With the `--stdin-packs` option, generate the new commit graph by
+walking objects only in the specified pack-indexes.
 
 'read'::
 
@@ -51,6 +53,13 @@ EXAMPLES
 $ git commit-graph write
 
 
+* Write a graph file, extending the current graph file using commits
+* in .
++
+
+$ echo  | git commit-graph write --stdin-packs
+
+
 * Read basic information from the commit-graph file.
 +
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index e3f67401fb..9f83c872e9 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,7 +8,7 @@
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
-   N_("git commit-graph write [--object-dir ]"),
+   N_("git commit-graph write [--object-dir ] [--stdin-packs]"),
NULL
 };
 
@@ -18,12 +18,13 @@ static const char * const builtin_commit_graph_read_usage[] 
= {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--object-dir ]"),
+   N_("git commit-graph write [--object-dir ] [--stdin-packs]"),
NULL
 };
 
 static struct opts_commit_graph {
const char *obj_dir;
+   int stdin_packs;
 } opts;
 
 static int graph_read(int argc, const char **argv)
@@ -76,10 +77,18 @@ static int graph_read(int argc, const char **argv)
 
 static int graph_write(int argc, const char **argv)
 {
+   const char **pack_indexes = NULL;
+   int packs_nr = 0;
+   const char **lines = NULL;
+   int lines_nr = 0;
+   int lines_alloc = 0;
+
static struct option builtin_commit_graph_write_options[] = {
OPT_STRING(0, "object-dir", _dir,
N_("dir"),
N_("The object directory to store the graph")),
+   OPT_BOOL(0, "stdin-packs", _packs,
+   N_("scan pack-indexes listed by stdin for commits")),
OPT_END(),
};
 
@@ -90,7 +99,25 @@ static int graph_write(int argc, const char **argv)
if (!opts.obj_dir)
opts.obj_dir = get_object_directory();
 
-   write_commit_graph(opts.obj_dir);
+   if (opts.stdin_packs) {
+   struct strbuf buf = STRBUF_INIT;
+   lines_nr = 0;
+   lines_alloc = 128;
+   ALLOC_ARRAY(lines, lines_alloc);
+
+   while (strbuf_getline(, stdin) != EOF) {
+   ALLOC_GROW(lines, lines_nr + 1, lines_alloc);
+   lines[lines_nr++] = strbuf_detach(, NULL);
+   }
+
+   pack_indexes = lines;
+   packs_nr = lines_nr;
+   }
+
+   write_commit_graph(opts.obj_dir,
+  pack_indexes,
+  packs_nr);
+
return 0;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 983454785e..fa19b83a8e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -549,7 +549,9 @@ static void close_reachable(struct packed_oid_list *oids)
}
 }
 
-void write_commit_graph(const char *obj_dir)
+void write_commit_graph(const char *obj_dir,
+   const char **pack_indexes,
+   int nr_packs)
 {
struct packed_oid_list oids;
struct packed_commit_list commits;
@@ -571,7 +573,27 @@ void write_commit_graph(const char *obj_dir)
oids.alloc = 1024;
ALLOC_ARRAY(oids.list, oids.alloc);
 
-   for_each_packed_object(add_packed_commits, , 0);
+   if (pack_indexes) {
+   struct strbuf packname = STRBUF_I

[PATCH v7 02/14] csum-file: refactor finalize_hashfile() method

2018-04-02 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

If we want to use a hashfile on the temporary file for a lockfile, then
we need finalize_hashfile() to fully write the trailing hash but also keep
the file descriptor open.

Do this by adding a new CSUM_HASH_IN_STREAM flag along with a functional
change that checks this flag before writing the checksum to the stream.
This differs from previous behavior since it would be written if either
CSUM_CLOSE or CSUM_FSYNC is provided.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/pack-objects.c | 4 ++--
 bulk-checkin.c | 2 +-
 csum-file.c| 8 
 csum-file.h| 5 +++--
 pack-bitmap-write.c| 2 +-
 pack-write.c   | 5 +++--
 6 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index ab3e80ee49..b09bbf4f4c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -837,9 +837,9 @@ static void write_pack_file(void)
 * If so, rewrite it like in fast-import
 */
if (pack_to_stdout) {
-   finalize_hashfile(f, oid.hash, CSUM_CLOSE);
+   finalize_hashfile(f, oid.hash, CSUM_HASH_IN_STREAM | 
CSUM_CLOSE);
} else if (nr_written == nr_remaining) {
-   finalize_hashfile(f, oid.hash, CSUM_FSYNC);
+   finalize_hashfile(f, oid.hash, CSUM_HASH_IN_STREAM | 
CSUM_FSYNC | CSUM_CLOSE);
} else {
int fd = finalize_hashfile(f, oid.hash, 0);
fixup_pack_header_footer(fd, oid.hash, pack_tmp_name,
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 227cc9f3b1..70b14fdf41 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -35,7 +35,7 @@ static void finish_bulk_checkin(struct bulk_checkin_state 
*state)
unlink(state->pack_tmp_name);
goto clear_exit;
} else if (state->nr_written == 1) {
-   finalize_hashfile(state->f, oid.hash, CSUM_FSYNC);
+   finalize_hashfile(state->f, oid.hash, CSUM_HASH_IN_STREAM | 
CSUM_FSYNC | CSUM_CLOSE);
} else {
int fd = finalize_hashfile(state->f, oid.hash, 0);
fixup_pack_header_footer(fd, oid.hash, state->pack_tmp_name,
diff --git a/csum-file.c b/csum-file.c
index e6c95a6915..53ce37f7ca 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -61,11 +61,11 @@ int finalize_hashfile(struct hashfile *f, unsigned char 
*result, unsigned int fl
the_hash_algo->final_fn(f->buffer, >ctx);
if (result)
hashcpy(result, f->buffer);
-   if (flags & (CSUM_CLOSE | CSUM_FSYNC)) {
-   /* write checksum and close fd */
+   if (flags & CSUM_HASH_IN_STREAM)
flush(f, f->buffer, the_hash_algo->rawsz);
-   if (flags & CSUM_FSYNC)
-   fsync_or_die(f->fd, f->name);
+   if (flags & CSUM_FSYNC)
+   fsync_or_die(f->fd, f->name);
+   if (flags & CSUM_CLOSE) {
if (close(f->fd))
die_errno("%s: sha1 file error on close", f->name);
fd = 0;
diff --git a/csum-file.h b/csum-file.h
index 9ba87f0a6c..c5a2e335e7 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -27,8 +27,9 @@ extern void hashfile_checkpoint(struct hashfile *, struct 
hashfile_checkpoint *)
 extern int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);
 
 /* finalize_hashfile flags */
-#define CSUM_CLOSE 1
-#define CSUM_FSYNC 2
+#define CSUM_CLOSE 1
+#define CSUM_FSYNC 2
+#define CSUM_HASH_IN_STREAM4
 
 extern struct hashfile *hashfd(int fd, const char *name);
 extern struct hashfile *hashfd_check(const char *name);
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index 662b44f97d..db4c832428 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -535,7 +535,7 @@ void bitmap_writer_finish(struct pack_idx_entry **index,
if (options & BITMAP_OPT_HASH_CACHE)
write_hash_cache(f, index, index_nr);
 
-   finalize_hashfile(f, NULL, CSUM_FSYNC);
+   finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC | 
CSUM_CLOSE);
 
if (adjust_shared_perm(tmp_file.buf))
die_errno("unable to make temporary bitmap file readable");
diff --git a/pack-write.c b/pack-write.c
index 044f427392..a9d46bc03f 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -170,8 +170,9 @@ const char *write_idx_file(const char *index_name, struct 
pack_idx_entry **objec
}
 
hashwrite(f, sha1, the_hash_algo->rawsz);
-   finalize_hashfile(f, NULL, ((opts->flags & WRITE_IDX_VERIFY)
-   ? CSUM_CLOSE : CSUM_FSYNC));
+   finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_CLOSE |
+   

[PATCH v7 14/14] commit-graph: implement "--additive" option

2018-04-02 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach git-commit-graph to add all commits from the existing
commit-graph file to the file about to be written. This should be
used when adding new commits without performing garbage collection.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt | 10 ++
 builtin/commit-graph.c | 10 +++---
 commit-graph.c | 17 -
 commit-graph.h |  3 ++-
 t/t5318-commit-graph.sh| 10 ++
 5 files changed, 45 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 442ac243e6..4c97b555cc 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -43,6 +43,9 @@ With the `--stdin-commits` option, generate the new commit 
graph by
 walking commits starting at the commits specified in stdin as a list
 of OIDs in hex, one OID per line. (Cannot be combined with
 --stdin-packs.)
++
+With the `--append` option, include all commits that are present in the
+existing commit-graph file.
 
 'read'::
 
@@ -72,6 +75,13 @@ $ echo  | git commit-graph write --stdin-packs
 $ git show-ref -s | git commit-graph write --stdin-commits
 
 
+* Write a graph file containing all commits in the current
+* commit-graph file along with those reachable from HEAD.
++
+
+$ git rev-parse HEAD | git commit-graph write --stdin-commits --append
+
+
 * Read basic information from the commit-graph file.
 +
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index f5fc717b8f..41c4f76caf 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,7 +8,7 @@
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
-   N_("git commit-graph write [--object-dir ] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -18,7 +18,7 @@ static const char * const builtin_commit_graph_read_usage[] = 
{
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--object-dir ] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -26,6 +26,7 @@ static struct opts_commit_graph {
const char *obj_dir;
int stdin_packs;
int stdin_commits;
+   int append;
 } opts;
 
 static int graph_read(int argc, const char **argv)
@@ -94,6 +95,8 @@ static int graph_write(int argc, const char **argv)
N_("scan pack-indexes listed by stdin for commits")),
OPT_BOOL(0, "stdin-commits", _commits,
N_("start walk at commits listed by stdin")),
+   OPT_BOOL(0, "append", ,
+   N_("include all commits already in the commit-graph 
file")),
OPT_END(),
};
 
@@ -131,7 +134,8 @@ static int graph_write(int argc, const char **argv)
   pack_indexes,
   packs_nr,
   commit_hex,
-  commits_nr);
+  commits_nr,
+  opts.append);
 
return 0;
 }
diff --git a/commit-graph.c b/commit-graph.c
index 253bc2213a..1fc63d541b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -553,7 +553,8 @@ void write_commit_graph(const char *obj_dir,
const char **pack_indexes,
int nr_packs,
const char **commit_hex,
-   int nr_commits)
+   int nr_commits,
+   int append)
 {
struct packed_oid_list oids;
struct packed_commit_list commits;
@@ -571,10 +572,24 @@ void write_commit_graph(const char *obj_dir,
oids.nr = 0;
oids.alloc = approximate_object_count() / 4;
 
+   if (append) {
+   prepare_commit_graph_one(obj_dir);
+   if (commit_graph)
+   oids.alloc += commit_graph->num_commits;
+   }
+
if (oids.alloc < 1024)
oids.alloc = 1024;
ALLOC_ARRAY(oids.list, oids.alloc);
 
+   if (append && commit_graph) {
+   for (i = 0; i < commit_graph->num_commits; i++) {
+   const unsigned char *hash = 
commit_graph->chunk_oid_lookup +
+   commit_graph->hash_len * i;
+

[PATCH v7 09/14] commit-graph: add core.commitGraph setting

2018-04-02 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

The commit graph feature is controlled by the new core.commitGraph config
setting. This defaults to 0, so the feature is opt-in.

The intention of core.commitGraph is that a user can always stop checking
for or parsing commit graph files if core.commitGraph=0.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/config.txt | 4 
 cache.h  | 1 +
 config.c | 5 +
 environment.c| 1 +
 4 files changed, 11 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 4e0cff87f6..e5c7013fb0 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -898,6 +898,10 @@ core.notesRef::
 This setting defaults to "refs/notes/commits", and it can be overridden by
 the `GIT_NOTES_REF` environment variable.  See linkgit:git-notes[1].
 
+core.commitGraph::
+   Enable git commit graph feature. Allows reading from the
+   commit-graph file.
+
 core.sparseCheckout::
Enable "sparse checkout" feature. See section "Sparse checkout" in
linkgit:git-read-tree[1] for more information.
diff --git a/cache.h b/cache.h
index a61b2d3f0d..8bdbcbbbf7 100644
--- a/cache.h
+++ b/cache.h
@@ -805,6 +805,7 @@ extern char *git_replace_ref_base;
 
 extern int fsync_object_files;
 extern int core_preload_index;
+extern int core_commit_graph;
 extern int core_apply_sparse_checkout;
 extern int precomposed_unicode;
 extern int protect_hfs;
diff --git a/config.c b/config.c
index b0c20e6cb8..25ee4a676c 100644
--- a/config.c
+++ b/config.c
@@ -1226,6 +1226,11 @@ static int git_default_core_config(const char *var, 
const char *value)
return 0;
}
 
+   if (!strcmp(var, "core.commitgraph")) {
+   core_commit_graph = git_config_bool(var, value);
+   return 0;
+   }
+
if (!strcmp(var, "core.sparsecheckout")) {
core_apply_sparse_checkout = git_config_bool(var, value);
return 0;
diff --git a/environment.c b/environment.c
index d6dd64662c..8853e2f0dd 100644
--- a/environment.c
+++ b/environment.c
@@ -62,6 +62,7 @@ enum push_default_type push_default = 
PUSH_DEFAULT_UNSPECIFIED;
 enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE;
 char *notes_ref_name;
 int grafts_replace_parents = 1;
+int core_commit_graph;
 int core_apply_sparse_checkout;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
-- 
2.17.0.14.gba1221a8ce



[PATCH v7 00/14] Serialized Git Commit Graph

2018-04-02 Thread Derrick Stolee
This patch has only a few changes since v6:

* Fixed whitespace issues using 'git rebase --whitespace=fix'

* The --stdin-packs docs now refer to "pack-indexes" insead of "packs"

* Modified description of --object-dir option to warn use is rare

* Replaced '--additive' with '--append'

* In "commit-graph: close under reachability" I greatly simplified
  the check that every reachable commit is included. While running
  tests I noticed that the revision walk machinery could not keep up
  with a very large queue created when combined with the '--append'
  option that added all commits from the existing file as starting
  points for the walk. The new algorithm simply appends missing commits
  to the end of the list, which are then iterated to ensure their
  parents are in the list.

I have a few patch series prepared that provide further performance
improvments following this patch.

-- >8 --

This patch contains a way to serialize the commit graph.

The current implementation defines a new file format to store the graph
structure (parent relationships) and basic commit metadata (commit date,
root tree OID) in order to prevent parsing raw commits while performing
basic graph walks. For example, we do not need to parse the full commit
when performing these walks:

* 'git log --topo-order -1000' walks all reachable commits to avoid
  incorrect topological orders, but only needs the commit message for
  the top 1000 commits.

* 'git merge-base  ' may walk many commits to find the correct
  boundary between the commits reachable from A and those reachable
  from B. No commit messages are needed.

* 'git branch -vv' checks ahead/behind status for all local branches
  compared to their upstream remote branches. This is essentially as
  hard as computing merge bases for each.

The current patch speeds up these calculations by injecting a check in
parse_commit_gently() to check if there is a graph file and using that
to provide the required metadata to the struct commit.

The file format has room to store generation numbers, which will be
provided as a patch after this framework is merged. Generation numbers
are referenced by the design document but not implemented in order to
make the current patch focus on the graph construction process. Once
that is stable, it will be easier to add generation numbers and make
graph walks aware of generation numbers one-by-one.

By loading commits from the graph instead of parsing commit buffers, we
save a lot of time on long commit walks. Here are some performance
results for a copy of the Linux repository where 'master' has 678,653
reachable commits and is behind 'origin/master' by 59,929 commits.

| Command  | Before | After  | Rel % |
|--|||---|
| log --oneline --topo-order -1000 |  8.31s |  0.94s | -88%  |
| branch -vv   |  1.02s |  0.14s | -86%  |
| rev-list --all   |  5.89s |  1.07s | -81%  |
| rev-list --all --objects | 66.15s | 58.45s | -11%  |

To test this yourself, run the following on your repo:

  git config core.commitGraph true
  git show-ref -s | git commit-graph write --stdin-commits

The second command writes a commit graph file containing every commit
reachable from your refs. Now, all git commands that walk commits will
check your graph first before consulting the ODB. You can run your own
performance comparisons by toggling the 'core.commitGraph' setting.

[1] https://github.com/derrickstolee/git/pull/2
A GitHub pull request containing the latest version of this patch.

Derrick Stolee (14):
  csum-file: rename hashclose() to finalize_hashfile()
  csum-file: refactor finalize_hashfile() method
  commit-graph: add format document
  graph: add commit graph design document
  commit-graph: create git-commit-graph builtin
  commit-graph: implement write_commit_graph()
  commit-graph: implement git-commit-graph write
  commit-graph: implement git commit-graph read
  commit-graph: add core.commitGraph setting
  commit-graph: close under reachability
  commit: integrate commit graph with commit parsing
  commit-graph: read only from specific pack-indexes
  commit-graph: build graph from starting commits
  commit-graph: implement "--additive" option

 .gitignore|   1 +
 Documentation/config.txt  |   4 +
 Documentation/git-commit-graph.txt|  94 +++
 .../technical/commit-graph-format.txt |  97 +++
 Documentation/technical/commit-graph.txt  | 163 
 Makefile  |   2 +
 alloc.c   |   1 +
 builtin.h |   1 +
 builtin/commit-graph.c| 171 
 builtin/index-pack.c  |   2 +-
 builtin/pack-objects.c|   6 +-
 bulk-checkin.c  

[PATCH v7 06/14] commit-graph: implement write_commit_graph()

2018-04-02 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach Git to write a commit graph file by checking all packed objects
to see if they are commits, then store the file in the given object
directory.

Helped-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Makefile   |   1 +
 commit-graph.c | 359 +
 commit-graph.h |   6 +
 3 files changed, 366 insertions(+)
 create mode 100644 commit-graph.c
 create mode 100644 commit-graph.h

diff --git a/Makefile b/Makefile
index a59b62bed1..26a23257e9 100644
--- a/Makefile
+++ b/Makefile
@@ -777,6 +777,7 @@ LIB_OBJS += color.o
 LIB_OBJS += column.o
 LIB_OBJS += combine-diff.o
 LIB_OBJS += commit.o
+LIB_OBJS += commit-graph.o
 LIB_OBJS += compat/obstack.o
 LIB_OBJS += compat/terminal.o
 LIB_OBJS += config.o
diff --git a/commit-graph.c b/commit-graph.c
new file mode 100644
index 00..f3f7c4f189
--- /dev/null
+++ b/commit-graph.c
@@ -0,0 +1,359 @@
+#include "cache.h"
+#include "config.h"
+#include "git-compat-util.h"
+#include "lockfile.h"
+#include "pack.h"
+#include "packfile.h"
+#include "commit.h"
+#include "object.h"
+#include "revision.h"
+#include "sha1-lookup.h"
+#include "commit-graph.h"
+
+#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
+#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
+#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
+#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */
+
+#define GRAPH_DATA_WIDTH 36
+
+#define GRAPH_VERSION_1 0x1
+#define GRAPH_VERSION GRAPH_VERSION_1
+
+#define GRAPH_OID_VERSION_SHA1 1
+#define GRAPH_OID_LEN_SHA1 GIT_SHA1_RAWSZ
+#define GRAPH_OID_VERSION GRAPH_OID_VERSION_SHA1
+#define GRAPH_OID_LEN GRAPH_OID_LEN_SHA1
+
+#define GRAPH_OCTOPUS_EDGES_NEEDED 0x8000
+#define GRAPH_PARENT_MISSING 0x7fff
+#define GRAPH_EDGE_LAST_MASK 0x7fff
+#define GRAPH_PARENT_NONE 0x7000
+
+#define GRAPH_LAST_EDGE 0x8000
+
+#define GRAPH_FANOUT_SIZE (4 * 256)
+#define GRAPH_CHUNKLOOKUP_WIDTH 12
+#define GRAPH_MIN_SIZE (5 * GRAPH_CHUNKLOOKUP_WIDTH + GRAPH_FANOUT_SIZE + \
+   GRAPH_OID_LEN + 8)
+
+
+static char *get_commit_graph_filename(const char *obj_dir)
+{
+   return xstrfmt("%s/info/commit-graph", obj_dir);
+}
+
+static void write_graph_chunk_fanout(struct hashfile *f,
+struct commit **commits,
+int nr_commits)
+{
+   int i, count = 0;
+   struct commit **list = commits;
+
+   /*
+* Write the first-level table (the list is sorted,
+* but we use a 256-entry lookup to be able to avoid
+* having to do eight extra binary search iterations).
+*/
+   for (i = 0; i < 256; i++) {
+   while (count < nr_commits) {
+   if ((*list)->object.oid.hash[0] != i)
+   break;
+   count++;
+   list++;
+   }
+
+   hashwrite_be32(f, count);
+   }
+}
+
+static void write_graph_chunk_oids(struct hashfile *f, int hash_len,
+  struct commit **commits, int nr_commits)
+{
+   struct commit **list = commits;
+   int count;
+   for (count = 0; count < nr_commits; count++, list++)
+   hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
+}
+
+static const unsigned char *commit_to_sha1(size_t index, void *table)
+{
+   struct commit **commits = table;
+   return commits[index]->object.oid.hash;
+}
+
+static void write_graph_chunk_data(struct hashfile *f, int hash_len,
+  struct commit **commits, int nr_commits)
+{
+   struct commit **list = commits;
+   struct commit **last = commits + nr_commits;
+   uint32_t num_extra_edges = 0;
+
+   while (list < last) {
+   struct commit_list *parent;
+   int edge_value;
+   uint32_t packedDate[2];
+
+   parse_commit(*list);
+   hashwrite(f, (*list)->tree->object.oid.hash, hash_len);
+
+   parent = (*list)->parents;
+
+   if (!parent)
+   edge_value = GRAPH_PARENT_NONE;
+   else {
+   edge_value = sha1_pos(parent->item->object.oid.hash,
+ commits,
+ nr_commits,
+ commit_to_sha1);
+
+   if (edge_value < 0)
+   edge_value = GRAPH_PARENT_MISSING;
+   }
+
+   hashwrite_be32(f, edge_value);
+
+   

[PATCH v7 04/14] graph: add commit graph design document

2018-04-02 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Add Documentation/technical/commit-graph.txt with details of the planned
commit graph feature, including future plans.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 163 +++
 1 file changed, 163 insertions(+)
 create mode 100644 Documentation/technical/commit-graph.txt

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
new file mode 100644
index 00..0550c6d0dc
--- /dev/null
+++ b/Documentation/technical/commit-graph.txt
@@ -0,0 +1,163 @@
+Git Commit Graph Design Notes
+=
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows. The merge
+base calculation shows up in many user-facing commands, such as 'merge-base'
+or 'status' and can take minutes to compute depending on history shape.
+
+There are two main costs here:
+
+1. Decompressing and parsing commits.
+2. Walking the entire graph to satisfy topological order constraints.
+
+The commit graph file is a supplemental data structure that accelerates
+commit graph walks. If a user downgrades or disables the 'core.commitGraph'
+config setting, then the existing ODB is sufficient. The file is stored
+as "commit-graph" either in the .git/objects/info directory or in the info
+directory of an alternate.
+
+The commit graph file stores the commit graph structure along with some
+extra metadata to speed up graph walks. By listing commit OIDs in lexi-
+cographic order, we can identify an integer position for each commit and
+refer to the parents of a commit using those integer positions. We use
+binary search to find initial commits and then use the integer positions
+for fast lookups during the walk.
+
+A consumer may load the following info for a commit from the graph:
+
+1. The commit OID.
+2. The list of parents, along with their integer position.
+3. The commit date.
+4. The root tree OID.
+5. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().
+
+Define the "generation number" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has generation number one.
+
+ * A commit with at least one parent has generation number one more than
+   the largest generation number among its parents.
+
+Equivalently, the generation number of a commit A is one more than the
+length of a longest path from A to a root commit. The recursive definition
+is easier to use for computation and observing the following property:
+
+If A and B are commits with generation numbers N and M, respectively,
+and N <= M, then A cannot reach B. That is, we know without searching
+that B is not an ancestor of A because it is further from a root commit
+than A.
+
+Conversely, when checking if A is an ancestor of B, then we only need
+to walk commits until all commits on the walk boundary have generation
+number at most N. If we walk commits using a priority queue seeded by
+generation numbers, then we always expand the boundary commit with highest
+generation number and can easily detect the stopping condition.
+
+This property can be used to significantly reduce the time it takes to
+walk commits and determine topological relationships. Without generation
+numbers, the general heuristic is the following:
+
+If A and B are commits with commit time X and Y, respectively, and
+X < Y, then A _probably_ cannot reach B.
+
+This heuristic is currently used whenever the computation is allowed to
+violate topological relationships due to clock skew (such as "git log"
+with default order), but is not used when the topological order is
+required (such as merge base calculations, "git log --graph").
+
+In practice, we expect some commits to be created recently and not stored
+in the commit graph. We can treat these commits as having "infinite"
+generation number and walk until reaching commits with known generation
+number.
+
+Design Details
+--
+
+- The commit graph file is stored in a file named 'commit-graph' in the
+  .git/objects/info directory. This could be stored in the info directory
+  of an alternate.
+
+- The core.commitGraph config setting must be on to consume graph files.
+
+- The file format includes parameters for the object ID hash function,
+  so a future change of hash algorithm does not require a change in format.
+
+Future Work
+---
+
+- The commit graph feature currently does not honor commit grafts. This can
+  be remedied by duplicating or refactoring the current graft logic.
+
+- The 'commit-graph' subcommand does not have a "verify" mode that is
+  necessary for integration with fsck.
+
+- The file format includes ro

[PATCH v7 10/14] commit-graph: close under reachability

2018-04-02 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach write_commit_graph() to walk all parents from the commits
discovered in packfiles. This prevents gaps given by loose objects or
previously-missed packfiles.

Also automatically add commits from the existing graph file, if it
exists.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 45 +
 1 file changed, 45 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index b1bd3a892d..ea29c5c2d8 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -367,6 +367,50 @@ static int add_packed_commits(const struct object_id *oid,
return 0;
 }
 
+static void add_missing_parents(struct packed_oid_list *oids, struct commit 
*commit)
+{
+   struct commit_list *parent;
+   for (parent = commit->parents; parent; parent = parent->next) {
+   if (!(parent->item->object.flags & UNINTERESTING)) {
+   ALLOC_GROW(oids->list, oids->nr + 1, oids->alloc);
+   oidcpy(>list[oids->nr], 
&(parent->item->object.oid));
+   oids->nr++;
+   parent->item->object.flags |= UNINTERESTING;
+   }
+   }
+}
+
+static void close_reachable(struct packed_oid_list *oids)
+{
+   int i;
+   struct commit *commit;
+
+   for (i = 0; i < oids->nr; i++) {
+   commit = lookup_commit(>list[i]);
+   if (commit)
+   commit->object.flags |= UNINTERESTING;
+   }
+
+   /*
+* As this loop runs, oids->nr may grow, but not more
+* than the number of missing commits in the reachable
+* closure.
+*/
+   for (i = 0; i < oids->nr; i++) {
+   commit = lookup_commit(>list[i]);
+
+   if (commit && !parse_commit(commit))
+   add_missing_parents(oids, commit);
+   }
+
+   for (i = 0; i < oids->nr; i++) {
+   commit = lookup_commit(>list[i]);
+
+   if (commit)
+   commit->object.flags &= ~UNINTERESTING;
+   }
+}
+
 void write_commit_graph(const char *obj_dir)
 {
struct packed_oid_list oids;
@@ -390,6 +434,7 @@ void write_commit_graph(const char *obj_dir)
ALLOC_ARRAY(oids.list, oids.alloc);
 
for_each_packed_object(add_packed_commits, , 0);
+   close_reachable();
 
QSORT(oids.list, oids.nr, commit_compare);
 
-- 
2.17.0.14.gba1221a8ce



[PATCH v7 08/14] commit-graph: implement git commit-graph read

2018-04-02 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach git-commit-graph to read commit graph files and summarize their contents.

Use the read subcommand to verify the contents of a commit graph file in the
tests.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt |  12 +++
 builtin/commit-graph.c |  56 
 commit-graph.c | 137 -
 commit-graph.h |  23 +
 t/t5318-commit-graph.sh|  32 +--
 5 files changed, 254 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 47996e8f89..8aad8303f5 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -9,6 +9,7 @@ git-commit-graph - Write and verify Git commit graph files
 SYNOPSIS
 
 [verse]
+'git commit-graph read' [--object-dir ]
 'git commit-graph write'  [--object-dir ]
 
 
@@ -35,6 +36,11 @@ COMMANDS
 Write a commit graph file based on the commits found in packfiles.
 Includes all commits from the existing commit graph file.
 
+'read'::
+
+Read a graph file given by the commit-graph file and output basic
+details about the graph file. Used for debugging purposes.
+
 
 EXAMPLES
 
@@ -45,6 +51,12 @@ EXAMPLES
 $ git commit-graph write
 
 
+* Read basic information from the commit-graph file.
++
+
+$ git commit-graph read
+
+
 
 GIT
 ---
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 26b6360289..e3f67401fb 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -7,10 +7,16 @@
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
+   N_("git commit-graph read [--object-dir ]"),
N_("git commit-graph write [--object-dir ]"),
NULL
 };
 
+static const char * const builtin_commit_graph_read_usage[] = {
+   N_("git commit-graph read [--object-dir ]"),
+   NULL
+};
+
 static const char * const builtin_commit_graph_write_usage[] = {
N_("git commit-graph write [--object-dir ]"),
NULL
@@ -20,6 +26,54 @@ static struct opts_commit_graph {
const char *obj_dir;
 } opts;
 
+static int graph_read(int argc, const char **argv)
+{
+   struct commit_graph *graph = 0;
+   char *graph_name;
+
+   static struct option builtin_commit_graph_read_options[] = {
+   OPT_STRING(0, "object-dir", _dir,
+   N_("dir"),
+   N_("The object directory to store the graph")),
+   OPT_END(),
+   };
+
+   argc = parse_options(argc, argv, NULL,
+builtin_commit_graph_read_options,
+builtin_commit_graph_read_usage, 0);
+
+   if (!opts.obj_dir)
+   opts.obj_dir = get_object_directory();
+
+   graph_name = get_commit_graph_filename(opts.obj_dir);
+   graph = load_commit_graph_one(graph_name);
+
+   if (!graph)
+   die("graph file %s does not exist", graph_name);
+   FREE_AND_NULL(graph_name);
+
+   printf("header: %08x %d %d %d %d\n",
+   ntohl(*(uint32_t*)graph->data),
+   *(unsigned char*)(graph->data + 4),
+   *(unsigned char*)(graph->data + 5),
+   *(unsigned char*)(graph->data + 6),
+   *(unsigned char*)(graph->data + 7));
+   printf("num_commits: %u\n", graph->num_commits);
+   printf("chunks:");
+
+   if (graph->chunk_oid_fanout)
+   printf(" oid_fanout");
+   if (graph->chunk_oid_lookup)
+   printf(" oid_lookup");
+   if (graph->chunk_commit_data)
+   printf(" commit_metadata");
+   if (graph->chunk_large_edges)
+   printf(" large_edges");
+   printf("\n");
+
+   return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
static struct option builtin_commit_graph_write_options[] = {
@@ -60,6 +114,8 @@ int cmd_commit_graph(int argc, const char **argv, const char 
*prefix)
 PARSE_OPT_STOP_AT_NON_OPTION);
 
if (argc > 0) {
+   if (!strcmp(argv[0], "read"))
+   return graph_read(argc, argv);
if (!strcmp(argv[0], "write"))
return graph_write(argc, argv);
}
diff --git a/commit-graph.c b/commit-graph.c
index f3f7c4f189..b1bd3a892d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -39,11 +39,146 @@
GRAPH_OID_LEN + 8)
 
 
-static char *get_commit_graph_f

[PATCH v7 05/14] commit-graph: create git-commit-graph builtin

2018-04-02 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach git the 'commit-graph' builtin that will be used for writing and
reading packed graph files. The current implementation is mostly
empty, except for an '--object-dir' option.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 .gitignore |  1 +
 Documentation/git-commit-graph.txt | 10 +++
 Makefile   |  1 +
 builtin.h  |  1 +
 builtin/commit-graph.c | 36 ++
 command-list.txt   |  1 +
 contrib/completion/git-completion.bash |  2 ++
 git.c  |  1 +
 8 files changed, 53 insertions(+)
 create mode 100644 Documentation/git-commit-graph.txt
 create mode 100644 builtin/commit-graph.c

diff --git a/.gitignore b/.gitignore
index 833ef3b0b7..e82f90184d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -34,6 +34,7 @@
 /git-clone
 /git-column
 /git-commit
+/git-commit-graph
 /git-commit-tree
 /git-config
 /git-count-objects
diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
new file mode 100644
index 00..f3b34622a8
--- /dev/null
+++ b/Documentation/git-commit-graph.txt
@@ -0,0 +1,10 @@
+git-commit-graph(1)
+===
+
+NAME
+
+git-commit-graph - Write and verify Git commit graph files
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index a1d8775adb..a59b62bed1 100644
--- a/Makefile
+++ b/Makefile
@@ -952,6 +952,7 @@ BUILTIN_OBJS += builtin/clone.o
 BUILTIN_OBJS += builtin/column.o
 BUILTIN_OBJS += builtin/commit-tree.o
 BUILTIN_OBJS += builtin/commit.o
+BUILTIN_OBJS += builtin/commit-graph.o
 BUILTIN_OBJS += builtin/config.o
 BUILTIN_OBJS += builtin/count-objects.o
 BUILTIN_OBJS += builtin/credential.o
diff --git a/builtin.h b/builtin.h
index 42378f3aa4..079855b6d4 100644
--- a/builtin.h
+++ b/builtin.h
@@ -149,6 +149,7 @@ extern int cmd_clone(int argc, const char **argv, const 
char *prefix);
 extern int cmd_clean(int argc, const char **argv, const char *prefix);
 extern int cmd_column(int argc, const char **argv, const char *prefix);
 extern int cmd_commit(int argc, const char **argv, const char *prefix);
+extern int cmd_commit_graph(int argc, const char **argv, const char *prefix);
 extern int cmd_commit_tree(int argc, const char **argv, const char *prefix);
 extern int cmd_config(int argc, const char **argv, const char *prefix);
 extern int cmd_count_objects(int argc, const char **argv, const char *prefix);
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
new file mode 100644
index 00..b466ecd781
--- /dev/null
+++ b/builtin/commit-graph.c
@@ -0,0 +1,36 @@
+#include "builtin.h"
+#include "config.h"
+#include "parse-options.h"
+
+static char const * const builtin_commit_graph_usage[] = {
+   N_("git commit-graph [--object-dir ]"),
+   NULL
+};
+
+static struct opts_commit_graph {
+   const char *obj_dir;
+} opts;
+
+
+int cmd_commit_graph(int argc, const char **argv, const char *prefix)
+{
+   static struct option builtin_commit_graph_options[] = {
+   OPT_STRING(0, "object-dir", _dir,
+   N_("dir"),
+   N_("The object directory to store the graph")),
+   OPT_END(),
+   };
+
+   if (argc == 2 && !strcmp(argv[1], "-h"))
+   usage_with_options(builtin_commit_graph_usage,
+  builtin_commit_graph_options);
+
+   git_config(git_default_config, NULL);
+   argc = parse_options(argc, argv, prefix,
+builtin_commit_graph_options,
+builtin_commit_graph_usage,
+PARSE_OPT_STOP_AT_NON_OPTION);
+
+   usage_with_options(builtin_commit_graph_usage,
+  builtin_commit_graph_options);
+}
diff --git a/command-list.txt b/command-list.txt
index a1fad28fd8..835c5890be 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -34,6 +34,7 @@ git-clean   mainporcelain
 git-clone   mainporcelain   init
 git-column  purehelpers
 git-commit  mainporcelain   history
+git-commit-graphplumbingmanipulators
 git-commit-tree plumbingmanipulators
 git-config  ancillarymanipulators
 git-count-objects   ancillaryinterrogators
diff --git a/contrib/completion/git-completion.bash 
b/contrib/completion/git-completion.bash
index b09c8a2362..6726daaf69 100644
--- a/contrib/completion/git-completion.bash
+++ b/contrib/completion/git-completion.bash
@@ -878,6 +878,7 @@ __git_list_porcelain_commands ()
check-ref-format) : plumbing;

[PATCH v7 07/14] commit-graph: implement git-commit-graph write

2018-04-02 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach git-commit-graph to write graph files. Create new test script to verify
this command succeeds without failure.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt |  41 ++
 builtin/commit-graph.c |  33 
 t/t5318-commit-graph.sh| 124 +
 3 files changed, 198 insertions(+)
 create mode 100755 t/t5318-commit-graph.sh

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index f3b34622a8..47996e8f89 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -5,6 +5,47 @@ NAME
 
 git-commit-graph - Write and verify Git commit graph files
 
+
+SYNOPSIS
+
+[verse]
+'git commit-graph write'  [--object-dir ]
+
+
+DESCRIPTION
+---
+
+Manage the serialized commit graph file.
+
+
+OPTIONS
+---
+--object-dir::
+   Use given directory for the location of packfiles and commit graph
+   file. This parameter exists to specify the location of an alternate
+   that only has the objects directory, not a full .git directory. The
+   commit graph file is expected to be at /info/commit-graph and
+   the packfiles are expected to be in /pack.
+
+
+COMMANDS
+
+'write'::
+
+Write a commit graph file based on the commits found in packfiles.
+Includes all commits from the existing commit graph file.
+
+
+EXAMPLES
+
+
+* Write a commit graph file for the packed commits in your local .git folder.
++
+
+$ git commit-graph write
+
+
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index b466ecd781..26b6360289 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -1,9 +1,18 @@
 #include "builtin.h"
 #include "config.h"
+#include "dir.h"
+#include "lockfile.h"
 #include "parse-options.h"
+#include "commit-graph.h"
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
+   N_("git commit-graph write [--object-dir ]"),
+   NULL
+};
+
+static const char * const builtin_commit_graph_write_usage[] = {
+   N_("git commit-graph write [--object-dir ]"),
NULL
 };
 
@@ -11,6 +20,25 @@ static struct opts_commit_graph {
const char *obj_dir;
 } opts;
 
+static int graph_write(int argc, const char **argv)
+{
+   static struct option builtin_commit_graph_write_options[] = {
+   OPT_STRING(0, "object-dir", _dir,
+   N_("dir"),
+   N_("The object directory to store the graph")),
+   OPT_END(),
+   };
+
+   argc = parse_options(argc, argv, NULL,
+builtin_commit_graph_write_options,
+builtin_commit_graph_write_usage, 0);
+
+   if (!opts.obj_dir)
+   opts.obj_dir = get_object_directory();
+
+   write_commit_graph(opts.obj_dir);
+   return 0;
+}
 
 int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 {
@@ -31,6 +59,11 @@ int cmd_commit_graph(int argc, const char **argv, const char 
*prefix)
 builtin_commit_graph_usage,
 PARSE_OPT_STOP_AT_NON_OPTION);
 
+   if (argc > 0) {
+   if (!strcmp(argv[0], "write"))
+   return graph_write(argc, argv);
+   }
+
usage_with_options(builtin_commit_graph_usage,
   builtin_commit_graph_options);
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
new file mode 100755
index 00..d7b635bd68
--- /dev/null
+++ b/t/t5318-commit-graph.sh
@@ -0,0 +1,124 @@
+#!/bin/sh
+
+test_description='commit graph'
+. ./test-lib.sh
+
+test_expect_success 'setup full repo' '
+   mkdir full &&
+   cd "$TRASH_DIRECTORY/full" &&
+   git init &&
+   objdir=".git/objects"
+'
+
+test_expect_success 'write graph with no packs' '
+   cd "$TRASH_DIRECTORY/full" &&
+   git commit-graph write --object-dir . &&
+   test_path_is_file info/commit-graph
+'
+
+test_expect_success 'create commits and repack' '
+   cd "$TRASH_DIRECTORY/full" &&
+   for i in $(test_seq 3)
+   do
+   test_commit $i &&
+   git branch commits/$i
+   done &&
+   git repack
+'
+
+test_expect_success 'write graph' '
+   cd "$TRASH_DIRECTORY/full" &&
+   graph1=$(git commit-graph write) &&
+   test_path_is_file $objdir/info/commit-graph
+'
+
+test_expect_success 'Add more commits' '
+ 

Re: [PATCH 0/3] Lazy-load trees when reading commit-graph

2018-04-03 Thread Derrick Stolee

On 4/3/2018 9:06 AM, Jeff King wrote:

On Tue, Apr 03, 2018 at 08:00:54AM -0400, Derrick Stolee wrote:


There are several commit-graph walks that require loading many commits
but never walk the trees reachable from those commits. However, the
current logic in parse_commit() requires the root tree to be loaded.
This only uses lookup_tree(), but when reading commits from the commit-
graph file, the hashcpy() to load the root tree hash and the time spent
checking the object cache take more time than parsing the rest of the
commit.

In this patch series, all direct references to accessing the 'tree'
member of struct commit are replaced instead by one of the following
methods:

struct tree *get_commit_tree(struct commit *)
struct object_id *get_commit_tree_oid(struct commit *)

This seems like a pretty sane thing to do. There may still be a few
parts of the code, notably fsck, that are reliant on a "struct object"
having been instantiated to determine whether an object is "used".  I
tried to clean that up around the time of c2d17b3b6e (fsck: check
HAS_OBJ more consistently, 2017-01-16), but I won't be surprised if
there's still some hidden assumptions.

I peeked at the fsck.c parts of patch 2, and it looks like we may be OK,
since you use get_commit_tree() during the walk.


This replacement was assisted by a Coccinelle script, but the 'tree'
member is overloaded in other types, so the script gave false-positives
that were removed from the diff.

That catches any existing in-tree callers, but not any topics in flight.
We could add the Coccinelle script and re-run it to catch any future
ones. But perhaps simpler still: can we rename the "tree" member to
"maybe_tree" or something, since nobody should be accessing it directly?
That will give us a compile error if an older topic is merged.


That's a good idea. I can reorg in v2 to rename 'tree' to 'maybe_tree' 
(and add an explicit comment that 'maybe_tree' could be NULL, so don't 
reference it directly). The check that the rename patch is correct is 
simply "does it compile?" Then Coccinelle could do the transfer of 
"c->maybe_tree" to "get_commit_tree(c)" safely, and can be added to the 
scripts.


I guess one caveat is to not include the mutators (c->maybe_tree = ...), 
but that's probably something Coccinelle can do.





On the Linux repository, performance tests were run for the following
command:

 git log --graph --oneline -1000

Before: 0.83s
After:  0.65s
Rel %: -21.6%

Neat.

Not strictly related, but I happened across another thing today that
might be worth investigating here. We allocate "struct commit" in these
nice big allocation blocks. But then for every commit we allocate at
least one "struct commit_list" to store the parent pointer.

I was looking at this from a memory consumption angle (I found a
pathological repository full of single-parent commits, and this wastes
an extra 16 bytes plus malloc overhead for every 64-byte "struct
commit").

But I wonder if it could also have non-negligible overhead in calling
malloc() for your case, since you've optimized out so much of the rest
of the work.


That is a pattern I'm seeing: we strip out one bit and something else 
shows up as a hot spot. This game of whack-a-mole is quite productive.



I'm not sure what the exact solution would be, but I imagine something
like variable-sized "struct commit"s with the parent pointers embedded,
with some kind of flag to indicate the number of parents (and probably
some fallback to break out to a linked list for extreme cases of more
than 2 parents).  It may end up pretty invasive, though, as there's a
lot of open-coded traversals of that parent list.

Anyway, not anything to do with this patch, but food for thought as you
micro-optimize these traversals.


One other thing that I've been thinking about is that 'struct commit' is 
so much bigger than the other structs in 'union any_object'. This means 
that the object cache, which I think creates blocks of 'union 
any_object' for memory-alignment reasons, is overly bloated. This would 
be especially true when walking many more trees than commits.


Perhaps there are other memory concerns in that case (such as cached 
buffers) that the 'union any_object' is not a concern, but it is worth 
thinking about as we brainstorm how to reduce the parent-list memory.


Thanks,
-Stolee


Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-03 Thread Derrick Stolee

On 4/3/2018 12:51 PM, Derrick Stolee wrote:

This is the first of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one).

This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a constant-time check in
queue_has_nonstale() instead of the previous linear-time check.

This does not have a significant performance benefit in repositories
of normal size, but in the Windows repository, some merge-base
calculations improve from 3.1s to 2.9s. A modest speedup, but provides
an actual consumer of generation numbers as a starting point.

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

This patch series depends on v7 of ds/commit-graph.

Derrick Stolee (6):
   object.c: parse commit in graph first
   commit: add generation number to struct commmit
   commit-graph: compute generation numbers
   commit: sort by generation number in paint_down_to_common()
   commit.c: use generation number to stop merge-base walks
   commit-graph.txt: update design doc with generation numbers


This patch is also available as a GitHub pull request [1]

[1] https://github.com/derrickstolee/git/pull/5


[PATCH 1/6] object.c: parse commit in graph first

2018-04-03 Thread Derrick Stolee
Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

Before adding generation numbers to the commit-graph, we need to ensure
that any commit that exists in the graph is loaded from the graph, so
check parse_commit_in_graph() before calling parse_commit_buffer().

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 object.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/object.c b/object.c
index e6ad3f61f0..4cd3e98e04 100644
--- a/object.c
+++ b/object.c
@@ -3,6 +3,7 @@
 #include "blob.h"
 #include "tree.h"
 #include "commit.h"
+#include "commit-graph.h"
 #include "tag.h"
 
 static struct object **obj_hash;
@@ -207,7 +208,8 @@ struct object *parse_object_buffer(const struct object_id 
*oid, enum object_type
} else if (type == OBJ_COMMIT) {
struct commit *commit = lookup_commit(oid);
if (commit) {
-   if (parse_commit_buffer(commit, buffer, size))
+   if (!parse_commit_in_graph(commit) &&
+   parse_commit_buffer(commit, buffer, size))
return NULL;
if (!get_cached_commit_buffer(commit, NULL)) {
set_commit_buffer(commit, buffer, size);
-- 
2.17.0.rc0



[PATCH 4/6] commit: use generations in paint_down_to_common()

2018-04-03 Thread Derrick Stolee
Define compare_commits_by_gen_then_commit_date(), which uses generation
numbers as a primary comparison and commit date to break ties (or as a
comparison when both commits do not have computed generation numbers).

Since the commit-graph file is closed under reachability, we know that
all commits in the file have generation at most GENERATION_NUMBER_MAX
which is less than GENERATION_NUMBER_UNDEF.

This change does not affect the number of commits that are walked during
the execution of paint_down_to_common(), only the order that those
commits are inspected. In the case that commit dates violate topological
order (i.e. a parent is "newer" than a child), the previous code could
walk a commit twice: if a commit is reached with the PARENT1 bit, but
later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
propagated to its parents. Using generation numbers avoids this extra
effort, even if it is somewhat rare.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 19 ++-
 commit.h |  1 +
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 3e39c86abf..95ae7e13a3 100644
--- a/commit.c
+++ b/commit.c
@@ -624,6 +624,23 @@ static int compare_commits_by_author_date(const void *a_, 
const void *b_,
return 0;
 }
 
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused)
+{
+   const struct commit *a = a_, *b = b_;
+
+   if (a->generation < b->generation)
+   return 1;
+   else if (a->generation > b->generation)
+   return -1;
+
+   /* newer commits with larger date first */
+   if (a->date < b->date)
+   return 1;
+   else if (a->date > b->date)
+   return -1;
+   return 0;
+}
+
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused)
 {
const struct commit *a = a_, *b = b_;
@@ -773,7 +790,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
 /* all input commits in one and twos[] must have been parsed! */
 static struct commit_list *paint_down_to_common(struct commit *one, int n, 
struct commit **twos)
 {
-   struct prio_queue queue = { compare_commits_by_commit_date };
+   struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
 
diff --git a/commit.h b/commit.h
index bc7a3186c5..cb97b7636a 100644
--- a/commit.h
+++ b/commit.h
@@ -332,6 +332,7 @@ extern int remove_signature(struct strbuf *buf);
 extern int check_commit_signature(const struct commit *commit, struct 
signature_check *sigc);
 
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused);
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused);
 
 LAST_ARG_MUST_BE_NULL
 extern int run_commit_hook(int editor_is_used, const char *index_file, const 
char *name, ...);
-- 
2.17.0.rc0



[PATCH 5/6] commit.c: use generation to halt paint walk

2018-04-03 Thread Derrick Stolee
In paint_down_to_common(), the walk is halted when the queue contains
only stale commits. The queue_has_nonstale() method iterates over the
entire queue looking for a nonstale commit. In a wide commit graph where
the two sides share many commits in common, but have deep sets of
different commits, this method may inspect many elements before finding
a nonstale commit. In the worst case, this can give quadratic
performance in paint_down_to_common().

Convert queue_has_nonstale() to use generation numbers for an O(1)
termination condition. To properly take advantage of this condition,
track the minimum generation number of a commit that enters the queue
with nonstale status.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 37 ++---
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/commit.c b/commit.c
index 95ae7e13a3..858f4fdbc9 100644
--- a/commit.c
+++ b/commit.c
@@ -776,14 +776,22 @@ void sort_in_topological_order(struct commit_list **list, 
enum rev_sort_order so
 
 static const unsigned all_flags = (PARENT1 | PARENT2 | STALE | RESULT);
 
-static int queue_has_nonstale(struct prio_queue *queue)
+static int queue_has_nonstale(struct prio_queue *queue, uint32_t min_gen)
 {
-   int i;
-   for (i = 0; i < queue->nr; i++) {
-   struct commit *commit = queue->array[i].data;
-   if (!(commit->object.flags & STALE))
-   return 1;
+   if (min_gen != GENERATION_NUMBER_UNDEF) {
+   if (queue->nr > 0) {
+   struct commit *commit = queue->array[0].data;
+   return commit->generation >= min_gen;
+   }
+   } else {
+   int i;
+   for (i = 0; i < queue->nr; i++) {
+   struct commit *commit = queue->array[i].data;
+   if (!(commit->object.flags & STALE))
+   return 1;
+   }
}
+
return 0;
 }
 
@@ -793,6 +801,8 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n, struc
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
+   uint32_t last_gen = GENERATION_NUMBER_UNDEF;
+   uint32_t min_nonstale_gen = GENERATION_NUMBER_UNDEF;
 
one->object.flags |= PARENT1;
if (!n) {
@@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n, struc
return result;
}
prio_queue_put(, one);
+   if (one->generation < min_nonstale_gen)
+   min_nonstale_gen = one->generation;
 
for (i = 0; i < n; i++) {
twos[i]->object.flags |= PARENT2;
prio_queue_put(, twos[i]);
+   if (twos[i]->generation < min_nonstale_gen)
+   min_nonstale_gen = twos[i]->generation;
}
 
-   while (queue_has_nonstale()) {
+   while (queue_has_nonstale(, min_nonstale_gen)) {
struct commit *commit = prio_queue_get();
struct commit_list *parents;
int flags;
 
+   if (commit->generation > last_gen)
+   BUG("bad generation skip");
+
+   last_gen = commit->generation;
+
flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
if (flags == (PARENT1 | PARENT2)) {
if (!(commit->object.flags & RESULT)) {
@@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n, struc
return NULL;
p->object.flags |= flags;
prio_queue_put(, p);
+
+   if (!(flags & STALE) &&
+   p->generation < min_nonstale_gen)
+   min_nonstale_gen = p->generation;
}
}
 
-- 
2.17.0.rc0



[PATCH 6/6] commit-graph.txt: update future work

2018-04-03 Thread Derrick Stolee
We now calculate generation numbers in the commit-graph file and use
them in paint_down_to_common().

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
index 0550c6d0dc..be68bee43d 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -98,17 +98,12 @@ Future Work
 - The 'commit-graph' subcommand does not have a "verify" mode that is
   necessary for integration with fsck.
 
-- The file format includes room for precomputed generation numbers. These
-  are not currently computed, so all generation numbers will be marked as
-  0 (or "uncomputed"). A later patch will include this calculation.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
   priority queue with one ordered by generation number. The following
-  operations are important candidates:
+  operation is an important candidate:
 
-- paint_down_to_common()
 - 'log --topo-order'
 
 - Currently, parse_commit_gently() requires filling in the root tree
-- 
2.17.0.rc0



[PATCH 0/6] Compute and consume generation numbers

2018-04-03 Thread Derrick Stolee
This is the first of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one).

This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a constant-time check in
queue_has_nonstale() instead of the previous linear-time check.

This does not have a significant performance benefit in repositories
of normal size, but in the Windows repository, some merge-base
calculations improve from 3.1s to 2.9s. A modest speedup, but provides
an actual consumer of generation numbers as a starting point.

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

This patch series depends on v7 of ds/commit-graph.

Derrick Stolee (6):
  object.c: parse commit in graph first
  commit: add generation number to struct commmit
  commit-graph: compute generation numbers
  commit: sort by generation number in paint_down_to_common()
  commit.c: use generation number to stop merge-base walks
  commit-graph.txt: update design doc with generation numbers

 Documentation/technical/commit-graph.txt |  7 +---
 alloc.c  |  1 +
 commit-graph.c   | 48 +
 commit.c | 53 
 commit.h |  7 +++-
 object.c |  4 +-
 6 files changed, 104 insertions(+), 16 deletions(-)

-- 
2.17.0.20.g9f30ba16e1



[PATCH 3/6] commit-graph: compute generation numbers

2018-04-03 Thread Derrick Stolee
While preparing commits to be written into a commit-graph file, compute
the generation numbers using a depth-first strategy.

The only commits that are walked in this depth-first search are those
without a precomputed generation number. Thus, computation time will be
relative to the number of new commits to the commit-graph file.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 46 ++
 commit.h   |  1 +
 2 files changed, 47 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index d24b947525..b80c8ad80e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -419,6 +419,13 @@ static void write_graph_chunk_data(struct hashfile *f, int 
hash_len,
else
packedDate[0] = 0;
 
+   if ((*list)->generation != GENERATION_NUMBER_UNDEF) {
+   if ((*list)->generation > GENERATION_NUMBER_MAX)
+   die("generation number %u is too large to store 
in commit-graph",
+   (*list)->generation);
+   packedDate[0] |= htonl((*list)->generation << 2);
+   }
+
packedDate[1] = htonl((*list)->date);
hashwrite(f, packedDate, 8);
 
@@ -551,6 +558,43 @@ static void close_reachable(struct packed_oid_list *oids)
}
 }
 
+static void compute_generation_numbers(struct commit** commits,
+  int nr_commits)
+{
+   int i;
+   struct commit_list *list = NULL;
+
+   for (i = 0; i < nr_commits; i++) {
+   if (commits[i]->generation != GENERATION_NUMBER_UNDEF &&
+   commits[i]->generation != GENERATION_NUMBER_NONE)
+   continue;
+
+   commit_list_insert(commits[i], );
+   while (list) {
+   struct commit *current = list->item;
+   struct commit_list *parent;
+   int all_parents_computed = 1;
+   uint32_t max_generation = 0;
+
+   for (parent = current->parents; parent; parent = 
parent->next) {
+   if (parent->item->generation == 
GENERATION_NUMBER_UNDEF ||
+   parent->item->generation == 
GENERATION_NUMBER_NONE) {
+   all_parents_computed = 0;
+   commit_list_insert(parent->item, );
+   break;
+   } else if (parent->item->generation > 
max_generation) {
+   max_generation = 
parent->item->generation;
+   }
+   }
+
+   if (all_parents_computed) {
+   current->generation = max_generation + 1;
+   pop_commit();
+   }
+   }
+   }
+}
+
 void write_commit_graph(const char *obj_dir,
const char **pack_indexes,
int nr_packs,
@@ -674,6 +718,8 @@ void write_commit_graph(const char *obj_dir,
if (commits.nr >= GRAPH_PARENT_MISSING)
die(_("too many commits to write graph"));
 
+   compute_generation_numbers(commits.list, commits.nr);
+
graph_name = get_commit_graph_filename(obj_dir);
fd = hold_lock_file_for_update(, graph_name, 0);
 
diff --git a/commit.h b/commit.h
index 3cadd386f3..bc7a3186c5 100644
--- a/commit.h
+++ b/commit.h
@@ -11,6 +11,7 @@
 
 #define COMMIT_NOT_FROM_GRAPH 0x
 #define GENERATION_NUMBER_UNDEF 0x
+#define GENERATION_NUMBER_MAX 0x3FFF
 #define GENERATION_NUMBER_NONE 0
 
 struct commit_list {
-- 
2.17.0.rc0



[PATCH 2/6] commit: add generation number to struct commmit

2018-04-03 Thread Derrick Stolee
The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
  more than the maximum generation number among the parents of A.

Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use two special values to signal
the generation number is invalid:

GENERATION_NUMBER_UNDEF 0x
GENERATION_NUMBER_NONE 0

The first (_UNDEF) means the generation number has not been loaded or
computed. The second (_NONE) means the generation number was loaded
from a commit graph file that was stored before generation numbers
were computed.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 alloc.c| 1 +
 commit-graph.c | 2 ++
 commit.h   | 3 +++
 3 files changed, 6 insertions(+)

diff --git a/alloc.c b/alloc.c
index cf4f8b61e1..1a62e85ac3 100644
--- a/alloc.c
+++ b/alloc.c
@@ -94,6 +94,7 @@ void *alloc_commit_node(void)
c->object.type = OBJ_COMMIT;
c->index = alloc_commit_index();
c->graph_pos = COMMIT_NOT_FROM_GRAPH;
+   c->generation = GENERATION_NUMBER_UNDEF;
return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 1fc63d541b..d24b947525 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -264,6 +264,8 @@ static int fill_commit_in_graph(struct commit *item, struct 
commit_graph *g, uin
date_low = get_be32(commit_data + g->hash_len + 12);
item->date = (timestamp_t)((date_high << 32) | date_low);
 
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
pptr = >parents;
 
edge_value = get_be32(commit_data + g->hash_len);
diff --git a/commit.h b/commit.h
index e57ae4b583..3cadd386f3 100644
--- a/commit.h
+++ b/commit.h
@@ -10,6 +10,8 @@
 #include "pretty.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0x
+#define GENERATION_NUMBER_UNDEF 0x
+#define GENERATION_NUMBER_NONE 0
 
 struct commit_list {
struct commit *item;
@@ -24,6 +26,7 @@ struct commit {
struct commit_list *parents;
struct tree *tree;
uint32_t graph_pos;
+   uint32_t generation;
 };
 
 extern int save_commit_buffer;
-- 
2.17.0.rc0



Re: [PATCH 3/3] commit-graph: lazy-load trees

2018-04-03 Thread Derrick Stolee

On 4/3/2018 2:00 PM, Stefan Beller wrote:

On Tue, Apr 3, 2018 at 5:00 AM, Derrick Stolee <dsto...@microsoft.com> wrote:

The commit-graph file provides quick access to commit data, including
the OID of the root tree for each commit in the graph. When performing
a deep commit-graph walk, we may not need to load most of the trees
for these commits.

Delay loading the tree object for a commit loaded from the graph
until requested via get_commit_tree(). Do not lazy-load trees for
commits not in the graph, since that requires duplicate parsing
and the relative peformance improvement when trees are not needed
is small.

On the Linux repository, performance tests were run for the following
command:

 git log --graph --oneline -1000

Before: 0.83s
After:  0.65s
Rel %: -21.6%

This is an awesome speedup.


Adding '-- kernel/' to the command requires loading the root tree
for every commit that is walked.

and as the walk prunes those commits that do not touch kernel/
which from my quick glance is the real core thing. Linus' announcements
claim that > 50% is drivers, networking and documentation[1].
So the "-- kernel/" walk needs to walk twice as many commits to find
a thousand commits that actually touch kernel/ ?

[1] http://lkml.iu.edu/hypermail/linux/kernel/1801.3/02794.html
http://lkml.iu.edu/hypermail/linux/kernel/1803.3/00580.html


There was no measureable performance
change as a result of this patch.

... which means that the walking itself is really fast now and the
dominating effects are setup and checking the tree?


Yeah. I was concerned that since we take two accesses into the 
commit-graph file that we could measurably slow down cases where we need 
to load the trees. That is not an issue since we will likely parse the 
tree after loading, and parsing is much slower than these commit-graph 
accesses.




Is git smart enough to not load the root tree for "log -- ./" or
would we get the desired performance numbers from that?


I wonder, since it only really needs the OID of the root tree to 
determine TREESAME. If it cares about following TREESAME relationships 
on ./, then it should do that.





@@ -317,6 +315,27 @@ int parse_commit_in_graph(struct commit *item)
 return 0;
  }

+static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit 
*c)
+{
+   struct object_id oid;
+   const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 
16) * (c->graph_pos);

What is 16? (I imagine it is the "length of the row" - g->hash_len ?)
Would it make sense to have a constant/define for an entire row instead?
(By any chance what is the meaning of GRAPH_DATA_WIDTH, which is 36?
That is defined but never used.)


Yeah, I should use GRAPH_DATA_WIDTH here instead.




+struct tree *get_commit_tree_in_graph(const struct commit *c)
+{
+   if (c->tree)
+   return c->tree;

This double checking is defensive programming, in case someone
doesn't check themselves (as get_commit_tree does below).

ok.


@@ -17,6 +17,13 @@ char *get_commit_graph_filename(const char *obj_dir);
   */
  int parse_commit_in_graph(struct commit *item);

+/*
+ * For performance reasons, a commit loaded from the graph does not
+ * have a tree loaded until trying to consume it for the first time.

That is the theme of this series/patch, but do we need to write it down
into the codebase? I'd be inclined to omit this part and only go with:

   Load the root tree of a commit and return the tree.


OK.




  struct tree *get_commit_tree(const struct commit *commit)
  {
-   return commit->tree;
+   if (commit->tree || !commit->object.parsed)

I understand to return the tree from the commit
when we have the tree in the commit object (the first
part).

But 'when we have not (yet) parsed the commit object',
we also just return its tree? Could you explain the
second part of the condition?
Is that for commits that are not part of the commit graph?
(But then why does it need to be negated?)


Some callers check the value of 'commit->tree' without a guarantee that 
the commit was parsed. In this case, the way to preserve the existing 
behavior is to continue returning NULL. If I remove the "|| 
!commit->object.parsed" then the BUG("commit has NULL tree, but was not 
loaded from commit-graph") is hit in these two tests:


t6012-rev-list-simplify.sh
t6110-rev-list-sparse.sh

I prefer to keep the BUG() statement and instead use this if statement. 
If someone has more clarity on why this is a good existing behavior, 
then please chime in.


Thanks,
-Stolee


Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-03 Thread Derrick Stolee

On 4/3/2018 2:03 PM, Brandon Williams wrote:

On 04/03, Derrick Stolee wrote:

This is the first of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one).

Thanks for ensuring that this is defined and documented somewhere :)


This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a constant-time check in
queue_has_nonstale() instead of the previous linear-time check.

This does not have a significant performance benefit in repositories
of normal size, but in the Windows repository, some merge-base
calculations improve from 3.1s to 2.9s. A modest speedup, but provides
an actual consumer of generation numbers as a starting point.

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

log --graph should benefit a lot more from this correct?  I know we've
talked a bit about negotiation and I wonder if these generation numbers
should be able to help out a little bit with that some day.


'log --graph' should be a HUGE speedup, when it is refactored. Since the 
topo-order can "stream" commits to the pager, it can be very responsive 
to return the graph in almost all conditions. (The case where generation 
numbers are not enough is when filters reduce the set of displayed 
commits to be very sparse, so many commits are walked anyway.)


If we have generic "can X reach Y?" queries, then we can also use 
generation numbers there to great effect (by not walking commits Z with 
gen(Z) <= gen(Y)). Perhaps I should look at that "git branch --contains" 
thread for ideas.


For negotiation, there are some things we can do here. VSTS uses 
generation numbers as a heuristic for determining "all wants connected 
to haves" which is a condition for halting negotiation. The idea is very 
simple, and I'd be happy to discuss it on a separate thread.


Thanks,
-Stolee


Re: [PATCH 2/6] commit: add generation number to struct commmit

2018-04-03 Thread Derrick Stolee

On 4/3/2018 2:28 PM, Jeff King wrote:

On Tue, Apr 03, 2018 at 11:05:36AM -0700, Brandon Williams wrote:


On 04/03, Derrick Stolee wrote:

The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
   more than the maximum generation number among the parents of A.

Add a uint32_t generation field to struct commit so we can pass this

Is there any reason to believe this would be too small of a value in the
future?  Or is a 32 bit unsigned good enough?

The linux kernel took ~10 years to produce 500k commits. Even assuming
those were all linear (and they're not), that gives us ~80,000 years of
leeway. So even if the pace of development speeds up or we have a
quicker project, it still seems we have a pretty reasonable safety
margin.


That, and larger projects do not have linear histories. Despite having 
almost 2 million reachable commits, the Windows repository has maximum 
generation number ~100,000.


Thanks,
-Stolee


Re: [PATCH 1/6] object.c: parse commit in graph first

2018-04-03 Thread Derrick Stolee



On 4/3/2018 2:28 PM, Jeff King wrote:

On Tue, Apr 03, 2018 at 11:21:36AM -0700, Jonathan Tan wrote:


On Tue,  3 Apr 2018 12:51:38 -0400
Derrick Stolee <dsto...@microsoft.com> wrote:


Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

Before adding generation numbers to the commit-graph, we need to ensure
that any commit that exists in the graph is loaded from the graph, so
check parse_commit_in_graph() before calling parse_commit_buffer().

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>

Modifying parse_object_buffer() is the most pragmatic way to accomplish
this, but this also means that parse_object_buffer() now potentially
reads from the local object store (instead of only relying on what's in
memory and what's in the provided buffer). parse_object_buffer() is
called by several callers including in builtin/fsck.c. I would feel more
comfortable if the relevant [1] caller to parse_object_buffer() was
modified instead of parse_object_buffer(), but I'll let others give
their opinions too.

It's not just you. This seems like a really odd place to put it.
Especially because if we have the buffer to pass to this function, then
we'd already have incurred the cost to inflate the object.



OK. Thanks. I'll try to find the better place to put this check.

-Stolee


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-19 Thread Derrick Stolee

On 3/16/2018 12:28 PM, Lars Schneider wrote:



On 14 Mar 2018, at 21:43, Junio C Hamano <gits...@pobox.com> wrote:

Derrick Stolee <sto...@gmail.com> writes:



Hopefully this version is ready to merge. I have several follow-up topics
in mind to submit soon after, including:

A few patches add trailing blank lines and other whitespace
breakages, which will stop my "git merge" later to 'next' and down,
as I have a pre-commit hook to catch them.

@stolee:

I run "git --no-pager diff --check $BASE_HASH...$HEAD_HASH" to detect
these kinds of things. I run this as part of my "prepare patch" [1] script
which is inspired by a similar script originally written by Dscho.

Do you think it would make sense to mention (or even
recommend) such a script in your awesome GfW CONTRIBUTING.md?


- Lars


[1] 
https://github.com/larsxschneider/git-list-helper/blob/master/prepare-patch.sh#L71



Thanks for the suggestions. Somehow I got extra whitespace doing 
copy/paste in vim and I never re-opened that file in my normal editor 
(VS Code with an extension that shows trailing whitespace).


On 3/15/2018 1:23 PM, Johannes Schindelin wrote:

git log --check`
was introduced to show you whitespace problems). If all of those
whitespace issues are unintentional, you can fix them using `git rebase
--whitespace=fix` in the most efficient way.


Thanks for both of the suggestions. The `rebase` check was already in 
the document, so I put the checks immediately above that line. PR is 
available now [1].


Thanks,
-Stolee

[1] https://github.com/git-for-windows/git/pull/1567


Re: [PATCH v6 12/14] commit-graph: read only from specific pack-indexes

2018-03-19 Thread Derrick Stolee

On 3/15/2018 6:50 PM, SZEDER Gábor wrote:

On Wed, Mar 14, 2018 at 8:27 PM, Derrick Stolee <sto...@gmail.com> wrote:

From: Derrick Stolee <dsto...@microsoft.com>

Teach git-commit-graph to inspect the objects only in a certain list
of pack-indexes within the given pack directory. This allows updating
the commit graph iteratively.

This commit message, and indeed the code itself talk about pack
indexes ...


Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  Documentation/git-commit-graph.txt | 11 ++-
  builtin/commit-graph.c | 33 ++---
  commit-graph.c | 26 --
  commit-graph.h |  4 +++-
  packfile.c |  4 ++--
  packfile.h |  2 ++
  t/t5318-commit-graph.sh| 10 ++
  7 files changed, 81 insertions(+), 9 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 51cb038f3d..b945510f0f 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -32,7 +32,9 @@ COMMANDS
  'write'::

  Write a commit graph file based on the commits found in packfiles.
-Includes all commits from the existing commit graph file.
++
+With the `--stdin-packs` option, generate the new commit graph by
+walking objects only in the specified packfiles.

... but this piece of documentation ...


+   OPT_BOOL(0, "stdin-packs", _packs,
+   N_("scan packfiles listed by stdin for commits")),

... and this help text, and even the name of the option talk about
packfiles.


Thanks! I'll fix that.

-Stolee


Re: [PATCH v6 07/14] commit-graph: implement 'git-commit-graph write'

2018-03-19 Thread Derrick Stolee

On 3/18/2018 9:25 AM, Ævar Arnfjörð Bjarmason wrote:

On Wed, Mar 14 2018, Derrick Stolee jotted:


+'git commit-graph write'  [--object-dir ]
+
+
+DESCRIPTION
+---
+
+Manage the serialized commit graph file.
+
+
+OPTIONS
+---
+--object-dir::
+   Use given directory for the location of packfiles and commit graph
+   file. The commit graph file is expected to be at /info/commit-graph
+   and the packfiles are expected to be in /pack.

Maybe this was covered in a previous round, this series is a little hard
to follow since each version isn't In-Reply-To the version before it,
but why is this option needed, i.e. why would you do:

 git commit-graph write --object-dir=/some/path/.git/objects

As opposed to just pigging-backing on what we already have with both of:

 git --git-dir=/some/path/.git commit-graph write
 git -C /some/path commit-graph write

Is there some use-case where you have *just* the objects dir and not the
rest of the .git folder?


Yes, such as an alternate. If I remember correctly, alternates only need 
the objects directory.


In the GVFS case, we place prefetch packfiles in an alternate so there 
is only one copy of the "remote objects" per drive. The commit graph 
will be stored in that alternate.


Thanks,
-Stolee


[PATCH] sha1_name: use bsearch_hash() for abbreviations

2018-03-20 Thread Derrick Stolee
This patch updates the abbreviation code to use bsearch_hash() as defined
in [1]. It gets a nice speedup since the old implementation did not use
the fanout table at all.

One caveat about the patch: there is a place where I cast a sha1 hash
into a struct object_id pointer. This is because the abbreviation code
still uses 'const unsigned char *' instead of structs. I wanted to avoid
a hashcpy() in these calls, but perhaps that is not too heavy a cost.

I look forward to feedback on this.

Thanks,
-Stolee

[1] 
https://public-inbox.org/git/007f3a4c84cb1c07255404ad1ea9f797129c5cf0.1517609773.git.jonathanta...@google.com/
[PATCH 2/2] packfile: refactor hash search with fanout table

-- >8 --

When computing abbreviation lengths for an object ID against a single
packfile, the method find_abbrev_len_for_pack() currently implements
binary search. This is one of several implementations. One issue with
this implementation is that it ignores the fanout table in the pack-
index.

Translate this binary search to use the existing bsearch_hash() method
that correctly uses a fanout table. To keep the details about pack-
index version 1 or 2 out of sha1_name.c, create a bsearch_pack() method
in packfile.c.

Due to the use of the fanout table, the abbreviation computation is
slightly faster than before. For a fully-repacked copy of the Linux
repo, the following 'git log' commands improved:

* git log --oneline --parents --raw
  Before: 66.83s
  After:  64.85s
  Rel %:  -2.9%

* git log --oneline --parents
  Before: 7.85s
  After:  7.29s
  Rel %: -7.1%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 packfile.c  | 23 +++
 packfile.h  |  8 
 sha1_name.c | 24 
 3 files changed, 35 insertions(+), 20 deletions(-)

diff --git a/packfile.c b/packfile.c
index 7c1a2519fc..ea0388f6dd 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1654,6 +1654,29 @@ void *unpack_entry(struct packed_git *p, off_t 
obj_offset,
return data;
 }
 
+int bsearch_pack(const struct object_id *oid, const struct packed_git *p, 
uint32_t *result)
+{
+   const unsigned char *index_fanout = p->index_data;
+   const unsigned char *index_lookup;
+   int index_lookup_width;
+
+   if (!index_fanout)
+   BUG("bsearch_pack called without a valid pack-index");
+
+   index_lookup = index_fanout + 4 * 256;
+   if (p->index_version == 1) {
+   index_lookup_width = 24;
+   index_lookup += 4;
+   } else {
+   index_lookup_width = 20;
+   index_fanout += 8;
+   index_lookup += 8;
+   }
+
+   return bsearch_hash(oid->hash, (const uint32_t*)index_fanout,
+   index_lookup, index_lookup_width, result);
+}
+
 const unsigned char *nth_packed_object_sha1(struct packed_git *p,
uint32_t n)
 {
diff --git a/packfile.h b/packfile.h
index a7fca598d6..ec08cb2bb0 100644
--- a/packfile.h
+++ b/packfile.h
@@ -78,6 +78,14 @@ extern struct packed_git *add_packed_git(const char *path, 
size_t path_len, int
  */
 extern void check_pack_index_ptr(const struct packed_git *p, const void *ptr);
 
+/*
+ * Perform binary search on a pack-index for a given oid. Packfile is expected 
to
+ * have a valid pack-index.
+ *
+ * See 'bsearch_hash' for more information.
+ */
+int bsearch_pack(const struct object_id *oid, const struct packed_git *p, 
uint32_t *result);
+
 /*
  * Return the SHA-1 of the nth object within the specified packfile.
  * Open the index if it is not already open.  The return value points
diff --git a/sha1_name.c b/sha1_name.c
index 735c1c0b8e..be3627b915 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -512,32 +512,16 @@ static void find_abbrev_len_for_pack(struct packed_git *p,
 struct min_abbrev_data *mad)
 {
int match = 0;
-   uint32_t num, last, first = 0;
+   uint32_t num, first = 0;
struct object_id oid;
+   const struct object_id *mad_oid;
 
if (open_pack_index(p) || !p->num_objects)
return;
 
num = p->num_objects;
-   last = num;
-   while (first < last) {
-   uint32_t mid = first + (last - first) / 2;
-   const unsigned char *current;
-   int cmp;
-
-   current = nth_packed_object_sha1(p, mid);
-   cmp = hashcmp(mad->hash, current);
-   if (!cmp) {
-   match = 1;
-   first = mid;
-   break;
-   }
-   if (cmp > 0) {
-   first = mid + 1;
-   continue;
-   }
-   last = mid;
-   }
+   mad_oid = (const struct object_id *)mad->hash;
+   match = bsearch_pack(mad_oid, p, );
 
/*
 * first is now the position in the packfile where we would insert
-- 
2.17.0.rc0



[PATCH v2 2/3] packfile: define and use bsearch_pack()

2018-03-22 Thread Derrick Stolee
The method bsearch_hash() generalizes binary searches using a
fanout table. The only consumer is currently find_pack_entry_one().
It requires a bit of pointer arithmetic to align the fanout table
and the lookup table depending on the pack-index version.

Extract the pack-index pointer arithmetic to a new method,
bsearch_pack(), so this can be re-used in other code paths.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 packfile.c | 42 ++
 packfile.h |  8 
 2 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/packfile.c b/packfile.c
index f26395ecab..69d3afda6c 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1654,6 +1654,29 @@ void *unpack_entry(struct packed_git *p, off_t 
obj_offset,
return data;
 }
 
+int bsearch_pack(const struct object_id *oid, const struct packed_git *p, 
uint32_t *result)
+{
+   const unsigned char *index_fanout = p->index_data;
+   const unsigned char *index_lookup;
+   int index_lookup_width;
+
+   if (!index_fanout)
+   BUG("bsearch_pack called without a valid pack-index");
+
+   index_lookup = index_fanout + 4 * 256;
+   if (p->index_version == 1) {
+   index_lookup_width = 24;
+   index_lookup += 4;
+   } else {
+   index_lookup_width = 20;
+   index_fanout += 8;
+   index_lookup += 8;
+   }
+
+   return bsearch_hash(oid->hash, (const uint32_t*)index_fanout,
+   index_lookup, index_lookup_width, result);
+}
+
 const unsigned char *nth_packed_object_sha1(struct packed_git *p,
uint32_t n)
 {
@@ -1720,30 +1743,17 @@ off_t nth_packed_object_offset(const struct packed_git 
*p, uint32_t n)
 off_t find_pack_entry_one(const unsigned char *sha1,
  struct packed_git *p)
 {
-   const uint32_t *level1_ofs = p->index_data;
const unsigned char *index = p->index_data;
-   unsigned stride;
+   struct object_id oid;
uint32_t result;
 
if (!index) {
if (open_pack_index(p))
return 0;
-   level1_ofs = p->index_data;
-   index = p->index_data;
-   }
-   if (p->index_version > 1) {
-   level1_ofs += 2;
-   index += 8;
-   }
-   index += 4 * 256;
-   if (p->index_version > 1) {
-   stride = 20;
-   } else {
-   stride = 24;
-   index += 4;
}
 
-   if (bsearch_hash(sha1, level1_ofs, index, stride, ))
+   hashcpy(oid.hash, sha1);
+   if (bsearch_pack(, p, ))
return nth_packed_object_offset(p, result);
return 0;
 }
diff --git a/packfile.h b/packfile.h
index a7fca598d6..ec08cb2bb0 100644
--- a/packfile.h
+++ b/packfile.h
@@ -78,6 +78,14 @@ extern struct packed_git *add_packed_git(const char *path, 
size_t path_len, int
  */
 extern void check_pack_index_ptr(const struct packed_git *p, const void *ptr);
 
+/*
+ * Perform binary search on a pack-index for a given oid. Packfile is expected 
to
+ * have a valid pack-index.
+ *
+ * See 'bsearch_hash' for more information.
+ */
+int bsearch_pack(const struct object_id *oid, const struct packed_git *p, 
uint32_t *result);
+
 /*
  * Return the SHA-1 of the nth object within the specified packfile.
  * Open the index if it is not already open.  The return value points
-- 
2.17.0.rc0



[PATCH v2 1/3] sha1_name: convert struct min_abbrev_data to object_id

2018-03-22 Thread Derrick Stolee
From: "brian m. carlson" 

This structure is only written to in one place, where we already have a
struct object_id.  Convert the struct to use a struct object_id instead.

Signed-off-by: brian m. carlson 
---
 sha1_name.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 39e911c8ba..16e0003396 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -480,7 +480,7 @@ struct min_abbrev_data {
unsigned int init_len;
unsigned int cur_len;
char *hex;
-   const unsigned char *hash;
+   const struct object_id *oid;
 };
 
 static inline char get_hex_char_from_oid(const struct object_id *oid,
@@ -526,7 +526,7 @@ static void find_abbrev_len_for_pack(struct packed_git *p,
int cmp;
 
current = nth_packed_object_sha1(p, mid);
-   cmp = hashcmp(mad->hash, current);
+   cmp = hashcmp(mad->oid->hash, current);
if (!cmp) {
match = 1;
first = mid;
@@ -603,7 +603,7 @@ int find_unique_abbrev_r(char *hex, const struct object_id 
*oid, int len)
mad.init_len = len;
mad.cur_len = len;
mad.hex = hex;
-   mad.hash = oid->hash;
+   mad.oid = oid;
 
find_abbrev_len_packed();
 

base-commit: 1a750441a7360b29fff7a414649ece1d35acaca6
-- 
2.17.0.rc0



[PATCH v2 3/3] sha1_name: use bsearch_pack() for abbreviations

2018-03-22 Thread Derrick Stolee
When computing abbreviation lengths for an object ID against a single
packfile, the method find_abbrev_len_for_pack() currently implements
binary search. This is one of several implementations. One issue with
this implementation is that it ignores the fanout table in the pack-
index.

Translate this binary search to use the existing bsearch_pack() method
that correctly uses a fanout table.

Due to the use of the fanout table, the abbreviation computation is
slightly faster than before. For a fully-repacked copy of the Linux
repo, the following 'git log' commands improved:

* git log --oneline --parents --raw
  Before: 59.2s
  After:  56.9s
  Rel %:  -3.8%

* git log --oneline --parents
  Before: 6.48s
  After:  5.91s
  Rel %: -8.9%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 24 
 1 file changed, 4 insertions(+), 20 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 16e0003396..24894b3dbe 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -512,32 +512,16 @@ static void find_abbrev_len_for_pack(struct packed_git *p,
 struct min_abbrev_data *mad)
 {
int match = 0;
-   uint32_t num, last, first = 0;
+   uint32_t num, first = 0;
struct object_id oid;
+   const struct object_id *mad_oid;
 
if (open_pack_index(p) || !p->num_objects)
return;
 
num = p->num_objects;
-   last = num;
-   while (first < last) {
-   uint32_t mid = first + (last - first) / 2;
-   const unsigned char *current;
-   int cmp;
-
-   current = nth_packed_object_sha1(p, mid);
-   cmp = hashcmp(mad->oid->hash, current);
-   if (!cmp) {
-   match = 1;
-   first = mid;
-   break;
-   }
-   if (cmp > 0) {
-   first = mid + 1;
-   continue;
-   }
-   last = mid;
-   }
+   mad_oid = mad->oid;
+   match = bsearch_pack(mad_oid, p, );
 
/*
 * first is now the position in the packfile where we would insert
-- 
2.17.0.rc0



[PATCH v2 0/3] Use bsearch_hash() for abbreviations

2018-03-22 Thread Derrick Stolee
Thanks to Jonathan and Brian for the help with the proper way to handle
OIDs and existing callers to bsearch_hash(). This patch includes one
commit that Brian sent in the previous discussion (included again here
for completeness).

Derrick Stolee (2):
  packfile: define and use bsearch_pack()
  sha1_name: use bsearch_pack() for abbreviations

brian m. carlson (1):
  sha1_name: use bsearch_hash() for abbreviations

 packfile.c  | 42 ++
 packfile.h  |  8 
 sha1_name.c | 28 ++--
 3 files changed, 40 insertions(+), 38 deletions(-)


base-commit: 1a750441a7360b29fff7a414649ece1d35acaca6
-- 
2.17.0.rc0



Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-19 Thread Derrick Stolee



On 3/16/2018 4:19 PM, Jeff King wrote:

On Fri, Mar 16, 2018 at 04:06:39PM -0400, Jeff King wrote:


Furthermore, in order to look at an object it has to be zlib inflated
first, and since commit objects tend to be much smaller than trees and
especially blobs, there are a lot less bytes to inflate:

   $ grep ^commit type-size |cut -d' ' -f2 |avg
   34395730 / 53754 = 639
   $ cat type-size |cut -d' ' -f2 |avg
   3866685744 / 244723 = 15800

So a simple revision walk inflates less than 1% of the bytes that the
"enumerate objects packfiles" approach has to inflate.

I don't think this is quite accurate. It's true that we have to
_consider_ every object, but Git is smart enough not to inflate each one
to find its type. For loose objects we just inflate the header. For
packed objects, we either pick the type directly out of the packfile
header (for a non-delta) or can walk the delta chain (without actually
looking at the data bytes!) until we hit the base.

Hmm, so that's a big part of the problem with this patch series. It
actually _does_ unpack every object with --stdin-packs to get the type,
which is just silly. With the patch below, my time for "commit-graph
write --stdin-packs" on linux.git goes from over 5 minutes (I got bored
and killed it) to 17 seconds.

diff --git a/commit-graph.c b/commit-graph.c
index 6348bab82b..cf1da2e8c1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -491,11 +491,12 @@ static int add_packed_commits(const struct object_id *oid,
  {
struct packed_oid_list *list = (struct packed_oid_list*)data;
enum object_type type;
-   unsigned long size;
-   void *inner_data;
off_t offset = nth_packed_object_offset(pack, pos);
-   inner_data = unpack_entry(pack, offset, , );
-   FREE_AND_NULL(inner_data);
+   struct object_info oi = OBJECT_INFO_INIT;
+
+   oi.typep = 
+   if (packed_object_info(pack, offset, ) < 0)
+   die("unable to get type of object %s", oid_to_hex(oid));
  
  	if (type != OBJ_COMMIT)

return 0;

-Peff


Thanks for this! Fixing this performance problem is very important to 
me, as we will use the "--stdin-packs" mechanism in the GVFS scenario 
(we will walk all commits in the prefetch packs full of commits and 
trees instead of relying on refs). This speedup is very valuable!


Thanks,
-Stolee


Re: [PATCH v6 07/14] commit-graph: implement 'git-commit-graph write'

2018-03-19 Thread Derrick Stolee

On 3/19/2018 10:36 AM, Ævar Arnfjörð Bjarmason wrote:

On Mon, Mar 19 2018, Derrick Stolee jotted:


On 3/18/2018 9:25 AM, Ævar Arnfjörð Bjarmason wrote:

On Wed, Mar 14 2018, Derrick Stolee jotted:


+'git commit-graph write'  [--object-dir ]
+
+
+DESCRIPTION
+---
+
+Manage the serialized commit graph file.
+
+
+OPTIONS
+---
+--object-dir::
+   Use given directory for the location of packfiles and commit graph
+   file. The commit graph file is expected to be at /info/commit-graph
+   and the packfiles are expected to be in /pack.

Maybe this was covered in a previous round, this series is a little hard
to follow since each version isn't In-Reply-To the version before it,
but why is this option needed, i.e. why would you do:

  git commit-graph write --object-dir=/some/path/.git/objects

As opposed to just pigging-backing on what we already have with both of:

  git --git-dir=/some/path/.git commit-graph write
  git -C /some/path commit-graph write

Is there some use-case where you have *just* the objects dir and not the
rest of the .git folder?

Yes, such as an alternate. If I remember correctly, alternates only
need the objects directory.

In the GVFS case, we place prefetch packfiles in an alternate so there
is only one copy of the "remote objects" per drive. The commit graph
will be stored in that alternate.

Makes sense, but we should really document this as being such an unusual
option, i.e. instead say something like.

 Use given directory for the location of packfiles and commit graph
 file. Usually you'd use the `--git-dir` or `-C` arguments to `git`
 itself. This option is here to support obscure use-cases where we
 have a stand-alone object directory. The commit graph file is
 expected to be at /info/commit-graph and the packfiles are
 expected to be in /pack.


A slight change to your recommendation:


OPTIONS
---
--object-dir::
    Use given directory for the location of packfiles and commit graph
    file. This parameter exists to specify the location of an alternate
    that only has the objects directory, not a full .git directory. The
    commit graph file is expected to be at /info/commit-graph and
    the packfiles are expected to be in /pack.



Re: What's cooking in git.git (Mar 2018, #03; Wed, 14)

2018-03-19 Thread Derrick Stolee

On 3/15/2018 4:36 AM, Ævar Arnfjörð Bjarmason wrote:

On Thu, Mar 15 2018, Junio C. Hamano jotted:


* nd/repack-keep-pack (2018-03-07) 6 commits
  - SQUASH???
  - pack-objects: display progress in get_object_details()
  - pack-objects: show some progress when counting kept objects
  - gc --auto: exclude base pack if not enough mem to "repack -ad"
  - repack: add --keep-pack option
  - t7700: have closing quote of a test at the beginning of line

  "git gc" in a large repository takes a lot of time as it considers
  to repack all objects into one pack by default.  The command has
  been taught to pretend as if the largest existing packfile is
  marked with ".keep" so that it is left untouched while objects in
  other packs and loose ones are repacked.

  Expecting a reroll.
  cf. 

Re: [PATCH] sha1_name: use bsearch_hash() for abbreviations

2018-03-21 Thread Derrick Stolee

On 3/20/2018 6:25 PM, Jonathan Tan wrote:

On Tue, 20 Mar 2018 16:03:25 -0400
Derrick Stolee <dsto...@microsoft.com> wrote:


This patch updates the abbreviation code to use bsearch_hash() as defined
in [1]. It gets a nice speedup since the old implementation did not use
the fanout table at all.

You can refer to the patch as:

   b4e00f7306a1 ("packfile: refactor hash search with fanout table",
   2018-02-15)

Also, might be worth noting that this patch builds on
jt/binsearch-with-fanout.


Thanks. That commit is in master and v2.17.0-rc0, so hopefully it isn't 
difficult to apply the patch.





One caveat about the patch: there is a place where I cast a sha1 hash
into a struct object_id pointer. This is because the abbreviation code
still uses 'const unsigned char *' instead of structs. I wanted to avoid
a hashcpy() in these calls, but perhaps that is not too heavy a cost.

I recall a discussion that there were alignment issues with doing this,
but I might have be remembering wrongly - in my limited knowledge of C
alignment, both "unsigned char *" and "struct object_id *" have the same
constraints, but I'm not sure.


Adding Brian M. Carlson in the CC line for advice on how to do this 
translation between old sha1's and new object_ids. If it isn't safe, 
then we could do a hashcpy() until the translation makes it unnecessary.


I should have compared the two methods before sending the patch, but 
running the "git log --oneline --parents" test with a hashcpy() versus a 
cast has no measurable difference in performance for Linux. Probably 
best to do the safest thing here if there is no cost to perf.





+   const unsigned char *index_fanout = p->index_data;

[snip]

+   return bsearch_hash(oid->hash, (const uint32_t*)index_fanout,
+   index_lookup, index_lookup_width, result);

This cast to "const uint32_t *" is safe, because p->index_data points to
a mmap-ed region (which has very good alignment, as far as I know). I
wonder if we should document alignment guarantees on p->index_data, and
if yes, what guarantees to declare.


The existing application in find_pack_entry_one() [1] does similar 
pack-index arithmetic before calling. A similar amount of arithmetic and 
alignment concerns appear in the commit-graph patch. Where is the right 
place to declare these alignment requirements?


In v2, I'll have find_pack_entry_one() call bsearch_pack() so this logic 
is not duplicated.


Thanks,
-Stolee

[1] 
https://github.com/git/git/blob/c6284da4ff4afbde8211efe5d03f3604b1c6b9d6/packfile.c#L1721-L1749

    find_pack_entry_one() in packfile.c


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-19 Thread Derrick Stolee

On 3/19/2018 8:55 AM, Derrick Stolee wrote:


Thanks for this! Fixing this performance problem is very important to 
me, as we will use the "--stdin-packs" mechanism in the GVFS scenario 
(we will walk all commits in the prefetch packs full of commits and 
trees instead of relying on refs). This speedup is very valuable!


Thanks,
-Stolee


Also, for those interested in this series, I plan to do a rebase onto 
2.17.0, when available, as my re-roll. I pushed my responses to the 
current feedback at the GitHub PR for the series [1].


If you are planning to provide more feedback to the series, then please 
let me know and I'll delay my re-roll so you have a chance to review.


Thanks,
-Stolee

[1] https://github.com/derrickstolee/git/pull/2


Re: How to debug a "git merge"?

2018-03-19 Thread Derrick Stolee

On 3/14/2018 4:53 PM, Lars Schneider wrote:

On 14 Mar 2018, at 18:02, Derrick Stolee <sto...@gmail.com> wrote:

On 3/14/2018 12:56 PM, Lars Schneider wrote:

Hi,

I am investigating a Git merge (a86dd40fe) in which an older version of
a file won over the newer version. I try to understand why this is the
case. I can reproduce the merge with the following commands:
$ git checkout -b test a02fa3303
$ GIT_MERGE_VERBOSITY=5 git merge --verbose c1b82995c

The merge actually generates a merge conflict but not for my
problematic file. The common ancestor of the two parents (merge base)
is b91161554.

The merge graph is not pretty (the committers don't have a clean
branching scheme) but I cannot spot a problem between the merge commit
and the common ancestor:
$ git log --graph --oneline a86dd40fe

Have you tried `git log --graph --oneline --simplify-merges -- path` to see 
what changes and merges involved the file? I find that view to be very helpful 
(while the default history simplification can hide things). In particular, if 
there was a change that was reverted in one side and not another, we could find 
out.

Thanks for this tip! Unfortunately, this only confirms my current view:

### First parent
$ git log --graph --oneline --simplify-merges a02fa3303 -- path/to/problem
* 4e47a10c7 <-- old version
* 01f01f61c

### Second parent
$ git log --graph --oneline --simplify-merges c1b82995c -- path/to/problem
* 590e52ed1 <-- new version
* 8e598828d
* ad4e9034b
* 4e47a10c7
* 01f01f61c

### Merge
$ git log --graph --oneline --simplify-merges a86dd40fe -- path/to/problem
*   a86dd40fe <-- old version ?!?! That's the problem!
|\
| * 590e52ed1 <-- new version
| * 8e598828d
| * ad4e9034b
|/
* 4e47a10c7 <-- old version
* 01f01f61c



You could also use the "A...B" to check your two commits for merging, and maybe add 
"--boundary".

$ git diff --boundary a02fa3303...c1b82995c -- path/to/problem

This looks like the correct diff. The "new version" is mark as +/add/green in 
the diff.

Does this make any sense to you?

Thank you,
Lars


I'm sorry for the delay on this, but in my experience in helping 
customers saying "why doesn't my commit/change appear in the history of 
a file?" is because someone resolved merge conflicts by just taking one 
side instead of doing a real merge (such as using -Xours). The only 
solution is to re-apply the original change again and talk to the user 
who incorrectly merged, to prevent future occurrences. These things 
rarely happen to teams who use pull requests that require review, since 
the diff would be problematic. There are still a lot of teams that push 
directly to master, though.


Thanks,
-Stolee


Re: Contributor Summit planning

2018-03-05 Thread Derrick Stolee

On 3/3/2018 5:39 AM, Jeff King wrote:

On Sat, Mar 03, 2018 at 05:30:10AM -0500, Jeff King wrote:


As in past years, I plan to run it like an unconference. Attendees are
expected to bring topics for group discussion. Short presentations are
also welcome. We'll put the topics on a whiteboard in the morning, and
pick whichever ones people are interested in.

Feel free to reply to this thread if you want to make plans or discuss
any proposed topics before the summit. Input or questions from
non-attendees is welcome here.

I'll plan to offer two topics:

  - a round-up of the current state and past year's activities of Git as
a member project of Software Freedom Conservancy

  - some updates on the state of the git-scm.com since my report last
year

As with last year, I'll try to send a written report to the list for
those who aren't at the summit in person.

-Peff


Thanks for putting this together, Peff.

I'll be ready to talk about the serialized commit graph [1], generation 
numbers, and other commit-walk optimizations.


Thanks,
-Stolee

[1] 
https://public-inbox.org/git/1519698787-190494-1-git-send-email-dsto...@microsoft.com/T/#u


Re: [RFC] Contributing to Git (on Windows)

2018-03-05 Thread Derrick Stolee

I really appreciate the feedback on this document, Jonathan.

On 3/3/2018 1:27 PM, Jonathan Nieder wrote:

Hi Dscho,

Johannes Schindelin wrote:

Jonathan Nieder  writes:

Dereck Stolee wrote:


nit: s/Dereck/Derrick/ Is my outgoing email name misspelled, or do you 
have a misspelled contact info for me?



+Test Your Changes on Linux
+--
+
+It can be important to work directly on the [core Git
codebase](https://github.com/git/git), +such as a recent commit into
the `master` or `next` branch that has not been incorporated +into
Git for Windows. Also, it can help to run functional and performance
tests on your +code in Linux before submitting patches to the
Linux-focused mailing list.

I'm surprised at this advice.  Does it actually come up?

Yes.

I personally set up the automated builds on Windows, Linux and macOS for
our team, and as a rule we always open an internal Pull Request on our
topic branches as we develop them, and you would probably not believe the
number of issues we caught before sending the patches to this list. Issues
including

[nice list snipped]

Thanks for explaining.  I still am going to push back on the wording
here, and here is why:

  1. Approximately 1/2 of the differences you describe apply to Mac as
 well as Windows.  The advice certainly does not apply on Mac.

 You might object: Mac readers would not be reading this text!  But
 that is not how humans work: if project documentation (e.g. the
 CONTRIBUTING.md on GitHub!) says that the project is Linux-focused
 and if you don't test on Linux then you might as well not bother,
 then people are going to believe it.

  2. It is not unusual for Linux users to make portability mistakes that
 are quickly pointed out on list.  If anything, the advice to test on
 other platforms should apply to contributors on Linux even more.
 This happens especially often to new contributors, who sometimes use
 bashisms, etc that get quickly pointed out.

  3. I do not *want* Git to be a Linux-focused project; I want the code
 to perform well on all popular platforms and for people not to be
 afraid to make it so.

 If the docs say that all we care about is Linux, then people are
 likely to be too scared to do the necessary and valuable work of
 making it work well on Mac, Windows, etc.  The actual mix of
 contributors doesn't bear it out anyway: a large number of
 contributors are already on Mac or Windows.

Fortunately this is pretty straightforward to fix in the doc: it could
say something like "to the multi-platform focused mailing list", for
example.


I like the "multi-platform" wording, and I'll try to use it as often as 
possible. I'll keep the Linux instructions because it is free to set up 
a Linux environment and testing in Windows and Linux will catch most of 
the cross-platform issues.




[...]

To my chagrin, this idea of making most of the boring stuff (and I include
formatting in that category, but I am probably special in that respect) as
automatable, and as little of an issue for review as possible, leaving
most brain cycles to work on the correctness of the patches instead, did
not really receive a lot of traction on this mailing list.

Huh?  I'm confident that this is a pretty widely supported idea within
the Git project.

I get the feeling you must have misread something or misinterpreted a
different response.

[...]

No, this advice comes straight from my personal observation that the
reviewers on the Git mailing list are Linux-centric.

Hopefully the clarifications and suggestions higher in this message
help.  If they don't, then I'm nervous about our ability to understand
each other.

[...]

Now, how reasonable do I think it is to ask those contributors to purchase
an Apple machine to test their patches on macOS (you cannot just download
an .iso nor would it be legal to run a macOS VM on anything but Apple
hardware)? You probably guessed my answer: not at all.

Agreed, this is something that needs to be automated (and not via a
CONTRIBUTING.md file).  As a stopgap, having a section in the
contribution instructions about testing using Windows's Linux
subsystem is a valuable thing, and thanks for that; I never meant to
imply otherwise.

[...]

On Fri, 2 Mar 2018, Junio C Hamano wrote:

In fact, portability issues in a patch originally written for a platform
is rather quickly discovered if the original platform is more minor than
the others, so while it is good advice to test your ware before you make
it public, singling out the portability issues may not add much value.
The fewer number of obvious bugs remaining, the fewer number of
iterations it would take for a series to get in a good shape.

[...]

For you, Junio, however, the task *now* is to put yourself into the shoes
of a Computer Science student in their 2nd year who wants to become an
Open Source contributor and is a little afraid to talk directly to "the

[RFC] Contributing to Git (on Windows)

2018-03-01 Thread Derrick Stolee
We (Git devs at Microsoft) have had several people start contributing to 
Git over the past few years (I'm the most-recent addition). As we 
on-boarded to Git development on our Windows machines, we collected our 
setup steps on an internal wiki page.


Now, we'd like to make that document publicly available. These steps are 
focused on a Windows user, so we propose putting them in the 
git-for-windows/git repo under CONTRIBUTING.md. I have a pull request 
open for feedback [1]. I'll read comments on that PR or in this thread.


If you've ever done Git development on a Windows machine, I would love 
to hear your feedback on this document. Any other advice you have is 
greatly appreciated.


For anyone interested, there are also a discussion about submitting 
patches upstream. The document links to Documentation/CodingGuidelines 
and Documentation/SubmittingPatches, but tries to elaborate with 
additional advice.


Thanks,
-Stolee

[1] https://github.com/git-for-windows/git/pull/1529


[PATCH v5 13/13] commit-graph: implement "--additive" option

2018-02-26 Thread Derrick Stolee
Teach git-commit-graph to add all commits from the existing
commit-graph file to the file about to be written. This should be
used when adding new commits without performing garbage collection.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt | 10 ++
 builtin/commit-graph.c | 10 +++---
 commit-graph.c | 17 -
 commit-graph.h |  3 ++-
 t/t5318-commit-graph.sh| 10 ++
 5 files changed, 45 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 0710a68..ccf5e20 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -41,6 +41,9 @@ With the `--stdin-commits` option, generate the new commit 
graph by
 walking commits starting at the commits specified in stdin as a list
 of OIDs in hex, one OID per line. (Cannot be combined with
 --stdin-packs.)
++
+With the `--additive` option, include all commits that are present
+in the existing commit-graph file.
 
 'read'::
 
@@ -70,6 +73,13 @@ $ echo  | git commit-graph write --stdin-packs
 $ git show-ref -s | git commit-graph write --stdin-commits
 
 
+* Write a graph file containing all commits in the current
+* commit-graph file along with those reachable from HEAD.
++
+
+$ git rev-parse HEAD | git commit-graph write --stdin-commits --additive
+
+
 * Read basic information from the commit-graph file.
 +
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 1c7b7e7..d26a6d6 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,7 +8,7 @@
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
-   N_("git commit-graph write [--object-dir ] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--additive] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -18,7 +18,7 @@ static const char * const builtin_commit_graph_read_usage[] = 
{
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--object-dir ] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--additive] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -26,6 +26,7 @@ static struct opts_commit_graph {
const char *obj_dir;
int stdin_packs;
int stdin_commits;
+   int additive;
 } opts;
 
 static int graph_read(int argc, const char **argv)
@@ -94,6 +95,8 @@ static int graph_write(int argc, const char **argv)
N_("scan packfiles listed by stdin for commits")),
OPT_BOOL(0, "stdin-commits", _commits,
N_("start walk at commits listed by stdin")),
+   OPT_BOOL(0, "additive", ,
+   N_("include all commits already in the commit-graph 
file")),
OPT_END(),
};
 
@@ -131,7 +134,8 @@ static int graph_write(int argc, const char **argv)
   pack_indexes,
   packs_nr,
   commit_hex,
-  commits_nr);
+  commits_nr,
+  opts.additive);
 
return 0;
 }
diff --git a/commit-graph.c b/commit-graph.c
index dbb9801..c111717 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -533,7 +533,8 @@ void write_commit_graph(const char *obj_dir,
const char **pack_indexes,
int nr_packs,
const char **commit_hex,
-   int nr_commits)
+   int nr_commits,
+   int additive)
 {
struct packed_oid_list oids;
struct packed_commit_list commits;
@@ -552,10 +553,24 @@ void write_commit_graph(const char *obj_dir,
oids.nr = 0;
oids.alloc = approximate_object_count() / 4;
 
+   if (additive) {
+   prepare_commit_graph_one(obj_dir);
+   if (commit_graph)
+   oids.alloc += commit_graph->num_commits;
+   }
+
if (oids.alloc < 1024)
oids.alloc = 1024;
ALLOC_ARRAY(oids.list, oids.alloc);
 
+   if (additive && commit_graph) {
+   for (i = 0; i < commit_graph->num_commits; i++) {
+   const unsigned char *hash = 
commit_graph->chunk_oid_lookup +
+   commit_graph->hash_len * i;
+   hashcpy(oids.list[oids.nr++].ha

[PATCH v5 12/13] commit-graph: build graph from starting commits

2018-02-26 Thread Derrick Stolee
Teach git-commit-graph to read commits from stdin when the
--stdin-commits flag is specified. Commits reachable from these
commits are added to the graph. This is a much faster way to construct
the graph than inspecting all packed objects, but is restricted to
known tips.

For the Linux repository, 700,000+ commits were added to the graph
file starting from 'master' in 7-9 seconds, depending on the number
of packfiles in the repo (1, 24, or 120).

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt | 14 +-
 builtin/commit-graph.c | 27 +--
 commit-graph.c | 27 +--
 commit-graph.h |  4 +++-
 t/t5318-commit-graph.sh| 13 +
 5 files changed, 75 insertions(+), 10 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index b945510..0710a68 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -34,7 +34,13 @@ COMMANDS
 Write a commit graph file based on the commits found in packfiles.
 +
 With the `--stdin-packs` option, generate the new commit graph by
-walking objects only in the specified packfiles.
+walking objects only in the specified packfiles. (Cannot be combined
+with --stdin-commits.)
++
+With the `--stdin-commits` option, generate the new commit graph by
+walking commits starting at the commits specified in stdin as a list
+of OIDs in hex, one OID per line. (Cannot be combined with
+--stdin-packs.)
 
 'read'::
 
@@ -58,6 +64,12 @@ $ git commit-graph write
 $ echo  | git commit-graph write --stdin-packs
 
 
+* Write a graph file containing all reachable commits.
++
+
+$ git show-ref -s | git commit-graph write --stdin-commits
+
+
 * Read basic information from the commit-graph file.
 +
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index eebca57..1c7b7e7 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,7 +8,7 @@
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
-   N_("git commit-graph write [--object-dir ] [--stdin-packs]"),
+   N_("git commit-graph write [--object-dir ] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -18,13 +18,14 @@ static const char * const builtin_commit_graph_read_usage[] 
= {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--object-dir ] [--stdin-packs]"),
+   N_("git commit-graph write [--object-dir ] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
 static struct opts_commit_graph {
const char *obj_dir;
int stdin_packs;
+   int stdin_commits;
 } opts;
 
 static int graph_read(int argc, const char **argv)
@@ -79,6 +80,8 @@ static int graph_write(int argc, const char **argv)
 {
const char **pack_indexes = NULL;
int packs_nr = 0;
+   const char **commit_hex = NULL;
+   int commits_nr = 0;
const char **lines = NULL;
int lines_nr = 0;
int lines_alloc = 0;
@@ -89,6 +92,8 @@ static int graph_write(int argc, const char **argv)
N_("The object directory to store the graph")),
OPT_BOOL(0, "stdin-packs", _packs,
N_("scan packfiles listed by stdin for commits")),
+   OPT_BOOL(0, "stdin-commits", _commits,
+   N_("start walk at commits listed by stdin")),
OPT_END(),
};
 
@@ -96,10 +101,12 @@ static int graph_write(int argc, const char **argv)
 builtin_commit_graph_write_options,
 builtin_commit_graph_write_usage, 0);
 
+   if (opts.stdin_packs && opts.stdin_commits)
+   die(_("cannot use both --stdin-commits and --stdin-packs"));
if (!opts.obj_dir)
opts.obj_dir = get_object_directory();
 
-   if (opts.stdin_packs) {
+   if (opts.stdin_packs || opts.stdin_commits) {
struct strbuf buf = STRBUF_INIT;
lines_nr = 0;
lines_alloc = 128;
@@ -110,13 +117,21 @@ static int graph_write(int argc, const char **argv)
lines[lines_nr++] = strbuf_detach(, NULL);
}
 
-   pack_indexes = lines;
-   packs_nr = lines_nr;
+   if (opts.stdin_packs) {
+   pack_indexes = lines;
+   packs_nr = lines_nr;
+  

[PATCH v5 03/13] commit-graph: create git-commit-graph builtin

2018-02-26 Thread Derrick Stolee
Thanks for the help in getting all the details right in setting up a
builtin.

-- >8 --

Teach git the 'commit-graph' builtin that will be used for writing and
reading packed graph files. The current implementation is mostly
empty, except for an '--object-dir' option.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 .gitignore |  1 +
 Documentation/git-commit-graph.txt | 11 ++
 Makefile   |  1 +
 builtin.h  |  1 +
 builtin/commit-graph.c | 37 ++
 command-list.txt   |  1 +
 contrib/completion/git-completion.bash |  2 ++
 git.c  |  1 +
 8 files changed, 55 insertions(+)
 create mode 100644 Documentation/git-commit-graph.txt
 create mode 100644 builtin/commit-graph.c

diff --git a/.gitignore b/.gitignore
index 833ef3b..e82f901 100644
--- a/.gitignore
+++ b/.gitignore
@@ -34,6 +34,7 @@
 /git-clone
 /git-column
 /git-commit
+/git-commit-graph
 /git-commit-tree
 /git-config
 /git-count-objects
diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
new file mode 100644
index 000..5913340
--- /dev/null
+++ b/Documentation/git-commit-graph.txt
@@ -0,0 +1,11 @@
+git-commit-graph(1)
+===
+
+NAME
+
+git-commit-graph - Write and verify Git commit graph files
+
+GIT
+---
+Part of the linkgit:git[1] suite
+
diff --git a/Makefile b/Makefile
index c56fdc1..2e4956f 100644
--- a/Makefile
+++ b/Makefile
@@ -946,6 +946,7 @@ BUILTIN_OBJS += builtin/clone.o
 BUILTIN_OBJS += builtin/column.o
 BUILTIN_OBJS += builtin/commit-tree.o
 BUILTIN_OBJS += builtin/commit.o
+BUILTIN_OBJS += builtin/commit-graph.o
 BUILTIN_OBJS += builtin/config.o
 BUILTIN_OBJS += builtin/count-objects.o
 BUILTIN_OBJS += builtin/credential.o
diff --git a/builtin.h b/builtin.h
index 42378f3..079855b 100644
--- a/builtin.h
+++ b/builtin.h
@@ -149,6 +149,7 @@ extern int cmd_clone(int argc, const char **argv, const 
char *prefix);
 extern int cmd_clean(int argc, const char **argv, const char *prefix);
 extern int cmd_column(int argc, const char **argv, const char *prefix);
 extern int cmd_commit(int argc, const char **argv, const char *prefix);
+extern int cmd_commit_graph(int argc, const char **argv, const char *prefix);
 extern int cmd_commit_tree(int argc, const char **argv, const char *prefix);
 extern int cmd_config(int argc, const char **argv, const char *prefix);
 extern int cmd_count_objects(int argc, const char **argv, const char *prefix);
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
new file mode 100644
index 000..8ff7336
--- /dev/null
+++ b/builtin/commit-graph.c
@@ -0,0 +1,37 @@
+#include "builtin.h"
+#include "config.h"
+#include "parse-options.h"
+
+static char const * const builtin_commit_graph_usage[] = {
+   N_("git commit-graph [--object-dir ]"),
+   NULL
+};
+
+static struct opts_commit_graph {
+   const char *obj_dir;
+} opts;
+
+
+int cmd_commit_graph(int argc, const char **argv, const char *prefix)
+{
+   static struct option builtin_commit_graph_options[] = {
+   OPT_STRING(0, "object-dir", _dir,
+   N_("dir"),
+   N_("The object directory to store the graph")),
+   OPT_END(),
+   };
+
+   if (argc == 2 && !strcmp(argv[1], "-h"))
+   usage_with_options(builtin_commit_graph_usage,
+  builtin_commit_graph_options);
+
+   git_config(git_default_config, NULL);
+   argc = parse_options(argc, argv, prefix,
+builtin_commit_graph_options,
+builtin_commit_graph_usage,
+PARSE_OPT_STOP_AT_NON_OPTION);
+
+   usage_with_options(builtin_commit_graph_usage,
+  builtin_commit_graph_options);
+}
+
diff --git a/command-list.txt b/command-list.txt
index a1fad28..835c589 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -34,6 +34,7 @@ git-clean   mainporcelain
 git-clone   mainporcelain   init
 git-column  purehelpers
 git-commit  mainporcelain   history
+git-commit-graphplumbingmanipulators
 git-commit-tree plumbingmanipulators
 git-config  ancillarymanipulators
 git-count-objects   ancillaryinterrogators
diff --git a/contrib/completion/git-completion.bash 
b/contrib/completion/git-completion.bash
index 88813e9..b060b6f 100644
--- a/contrib/completion/git-completion.bash
+++ b/contrib/completion/git-completion.bash
@@ -841,6 +841,7 @@ __git_list_porcelain_commands ()
check-ref-format) : plumb

[PATCH v5 02/13] graph: add commit graph design document

2018-02-26 Thread Derrick Stolee
Add Documentation/technical/commit-graph.txt with details of the planned
commit graph feature, including future plans.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 164 +++
 1 file changed, 164 insertions(+)
 create mode 100644 Documentation/technical/commit-graph.txt

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
new file mode 100644
index 000..d11753a
--- /dev/null
+++ b/Documentation/technical/commit-graph.txt
@@ -0,0 +1,164 @@
+Git Commit Graph Design Notes
+=
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows. The merge
+base calculation shows up in many user-facing commands, such as 'merge-base'
+or 'status' and can take minutes to compute depending on history shape.
+
+There are two main costs here:
+
+1. Decompressing and parsing commits.
+2. Walking the entire graph to satisfy topological order constraints.
+
+The commit graph file is a supplemental data structure that accelerates
+commit graph walks. If a user downgrades or disables the 'core.commitGraph'
+config setting, then the existing ODB is sufficient. The file is stored
+as "commit-graph" either in the .git/objects/info directory or in the info
+directory of an alternate.
+
+The commit graph file stores the commit graph structure along with some
+extra metadata to speed up graph walks. By listing commit OIDs in lexi-
+cographic order, we can identify an integer position for each commit and
+refer to the parents of a commit using those integer positions. We use
+binary search to find initial commits and then use the integer positions
+for fast lookups during the walk.
+
+A consumer may load the following info for a commit from the graph:
+
+1. The commit OID.
+2. The list of parents, along with their integer position.
+3. The commit date.
+4. The root tree OID.
+5. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().
+
+Define the "generation number" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has generation number one.
+
+ * A commit with at least one parent has generation number one more than
+   the largest generation number among its parents.
+
+Equivalently, the generation number of a commit A is one more than the
+length of a longest path from A to a root commit. The recursive definition
+is easier to use for computation and observing the following property:
+
+If A and B are commits with generation numbers N and M, respectively,
+and N <= M, then A cannot reach B. That is, we know without searching
+that B is not an ancestor of A because it is further from a root commit
+than A.
+
+Conversely, when checking if A is an ancestor of B, then we only need
+to walk commits until all commits on the walk boundary have generation
+number at most N. If we walk commits using a priority queue seeded by
+generation numbers, then we always expand the boundary commit with highest
+generation number and can easily detect the stopping condition.
+
+This property can be used to significantly reduce the time it takes to
+walk commits and determine topological relationships. Without generation
+numbers, the general heuristic is the following:
+
+If A and B are commits with commit time X and Y, respectively, and
+X < Y, then A _probably_ cannot reach B.
+
+This heuristic is currently used whenever the computation is allowed to
+violate topological relationships due to clock skew (such as "git log"
+with default order), but is not used when the topological order is
+required (such as merge base calculations, "git log --graph").
+
+In practice, we expect some commits to be created recently and not stored
+in the commit graph. We can treat these commits as having "infinite"
+generation number and walk until reaching commits with known generation
+number.
+
+Design Details
+--
+
+- The commit graph file is stored in a file named 'commit-graph' in the
+  .git/objects/info directory. This could be stored in the info directory
+  of an alternate.
+
+- The core.commitGraph config setting must be on to consume graph files.
+
+- The file format includes parameters for the object ID hash function,
+  so a future change of hash algorithm does not require a change in format.
+
+Future Work
+---
+
+- The commit graph feature currently does not honor commit grafts. This can
+  be remedied by duplicating or refactoring the current graft logic.
+
+- The 'commit-graph' subcommand does not have a "verify" mode that is
+  necessary for integration with fsck.
+
+- The file format includes room for precomputed generation numbers. These

[PATCH v5 10/13] commit: integrate commit graph with commit parsing

2018-02-26 Thread Derrick Stolee
Teach Git to inspect a commit graph file to supply the contents of a
struct commit when calling parse_commit_gently(). This implementation
satisfies all post-conditions on the struct commit, including loading
parents, the root tree, and the commit date.

If core.commitGraph is false, then do not check graph files.

In test script t5318-commit-graph.sh, add output-matching conditions on
read-only graph operations.

By loading commits from the graph instead of parsing commit buffers, we
save a lot of time on long commit walks. Here are some performance
results for a copy of the Linux repository where 'master' has 664,185
reachable commits and is behind 'origin/master' by 60,191 commits.

| Command  | Before | After  | Rel % |
|--|||---|
| log --oneline --topo-order -1000 |  6.56s |  0.66s | -89%  |
| branch -vv   |  1.35s |  0.32s | -76%  |
| rev-list --all   |  6.7s  |  0.83s | -87%  |
| rev-list --all --objects | 33.0s  | 27.5s  | -16%  |

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 alloc.c |   1 +
 commit-graph.c  | 141 +++-
 commit-graph.h  |  12 +
 commit.c|   3 ++
 commit.h|   3 ++
 t/t5318-commit-graph.sh |  47 +++-
 6 files changed, 205 insertions(+), 2 deletions(-)

diff --git a/alloc.c b/alloc.c
index 12afadf..cf4f8b6 100644
--- a/alloc.c
+++ b/alloc.c
@@ -93,6 +93,7 @@ void *alloc_commit_node(void)
struct commit *c = alloc_node(_state, sizeof(struct commit));
c->object.type = OBJ_COMMIT;
c->index = alloc_commit_index();
+   c->graph_pos = COMMIT_NOT_FROM_GRAPH;
return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 01aa23d..184b8da 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,7 +38,6 @@
 #define GRAPH_MIN_SIZE (5 * GRAPH_CHUNKLOOKUP_WIDTH + GRAPH_FANOUT_SIZE + \
GRAPH_OID_LEN + 8)
 
-
 char *get_commit_graph_filename(const char *obj_dir)
 {
return xstrfmt("%s/info/commit-graph", obj_dir);
@@ -182,6 +181,145 @@ struct commit_graph *load_commit_graph_one(const char 
*graph_file)
exit(1);
 }
 
+/* global storage */
+struct commit_graph *commit_graph = NULL;
+
+static void prepare_commit_graph_one(const char *obj_dir)
+{
+   char *graph_name;
+
+   if (commit_graph)
+   return;
+
+   graph_name = get_commit_graph_filename(obj_dir);
+   commit_graph = load_commit_graph_one(graph_name);
+
+   FREE_AND_NULL(graph_name);
+}
+
+static int prepare_commit_graph_run_once = 0;
+static void prepare_commit_graph(void)
+{
+   struct alternate_object_database *alt;
+   char *obj_dir;
+
+   if (prepare_commit_graph_run_once)
+   return;
+   prepare_commit_graph_run_once = 1;
+
+   obj_dir = get_object_directory();
+   prepare_commit_graph_one(obj_dir);
+   prepare_alt_odb();
+   for (alt = alt_odb_list; !commit_graph && alt; alt = alt->next)
+   prepare_commit_graph_one(alt->path);
+}
+
+static void close_commit_graph(void)
+{
+   if (!commit_graph)
+   return;
+
+   if (commit_graph->graph_fd >= 0) {
+   munmap((void *)commit_graph->data, commit_graph->data_len);
+   commit_graph->data = NULL;
+   close(commit_graph->graph_fd);
+   }
+
+   FREE_AND_NULL(commit_graph);
+}
+
+static int bsearch_graph(struct commit_graph *g, struct object_id *oid, 
uint32_t *pos)
+{
+   return bsearch_hash(oid->hash, g->chunk_oid_fanout,
+   g->chunk_oid_lookup, g->hash_len, pos);
+}
+
+static struct commit_list **insert_parent_or_die(struct commit_graph *g,
+uint64_t pos,
+struct commit_list **pptr)
+{
+   struct commit *c;
+   struct object_id oid;
+   hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos);
+   c = lookup_commit();
+   if (!c)
+   die("could not find commit %s", oid_to_hex());
+   c->graph_pos = pos;
+   return _list_insert(c, pptr)->next;
+}
+
+static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t pos)
+{
+   struct object_id oid;
+   uint32_t edge_value;
+   uint32_t *parent_data_ptr;
+   uint64_t date_low, date_high;
+   struct commit_list **pptr;
+   const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len 
+ 16) * pos;
+
+   item->object.parsed = 1;
+   item->graph_pos = pos;
+
+   hashcpy(oid.hash, commit_data);
+   item->tree = lookup_tree();
+
+   date_high = ntohl(*(uint32_t*)(commit_data + g->hash_len + 8)) & 0x3;
+   date_low = nto

[PATCH v5 06/13] commit-graph: implement 'git-commit-graph write'

2018-02-26 Thread Derrick Stolee
Teach git-commit-graph to write graph files. Create new test script to verify
this command succeeds without failure.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt |  39 
 builtin/commit-graph.c |  33 ++
 t/t5318-commit-graph.sh| 125 +
 3 files changed, 197 insertions(+)
 create mode 100755 t/t5318-commit-graph.sh

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 5913340..e688843 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -5,6 +5,45 @@ NAME
 
 git-commit-graph - Write and verify Git commit graph files
 
+
+SYNOPSIS
+
+[verse]
+'git commit-graph write'  [--object-dir ]
+
+
+DESCRIPTION
+---
+
+Manage the serialized commit graph file.
+
+
+OPTIONS
+---
+--object-dir::
+   Use given directory for the location of packfiles and commit graph
+   file. The commit graph file is expected to be at /info/commit-graph
+   and the packfiles are expected to be in /pack.
+
+
+COMMANDS
+
+'write'::
+
+Write a commit graph file based on the commits found in packfiles.
+Includes all commits from the existing commit graph file.
+
+
+EXAMPLES
+
+
+* Write a commit graph file for the packed commits in your local .git folder.
++
+
+$ git commit-graph write
+
+
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 8ff7336..a9d61f6 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -1,9 +1,18 @@
 #include "builtin.h"
 #include "config.h"
+#include "dir.h"
+#include "lockfile.h"
 #include "parse-options.h"
+#include "commit-graph.h"
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
+   N_("git commit-graph write [--object-dir ]"),
+   NULL
+};
+
+static const char * const builtin_commit_graph_write_usage[] = {
+   N_("git commit-graph write [--object-dir ]"),
NULL
 };
 
@@ -11,6 +20,25 @@ static struct opts_commit_graph {
const char *obj_dir;
 } opts;
 
+static int graph_write(int argc, const char **argv)
+{
+   static struct option builtin_commit_graph_write_options[] = {
+   OPT_STRING(0, "object-dir", _dir,
+   N_("dir"),
+   N_("The object directory to store the graph")),
+   OPT_END(),
+   };
+
+   argc = parse_options(argc, argv, NULL,
+builtin_commit_graph_write_options,
+builtin_commit_graph_write_usage, 0);
+
+   if (!opts.obj_dir)
+   opts.obj_dir = get_object_directory();
+
+   write_commit_graph(opts.obj_dir);
+   return 0;
+}
 
 int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 {
@@ -31,6 +59,11 @@ int cmd_commit_graph(int argc, const char **argv, const char 
*prefix)
 builtin_commit_graph_usage,
 PARSE_OPT_STOP_AT_NON_OPTION);
 
+   if (argc > 0) {
+   if (!strcmp(argv[0], "write"))
+   return graph_write(argc, argv);
+   }
+
usage_with_options(builtin_commit_graph_usage,
   builtin_commit_graph_options);
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
new file mode 100755
index 000..7524d2a
--- /dev/null
+++ b/t/t5318-commit-graph.sh
@@ -0,0 +1,125 @@
+#!/bin/sh
+
+test_description='commit graph'
+. ./test-lib.sh
+
+test_expect_success 'setup full repo' '
+   mkdir full &&
+   cd "$TRASH_DIRECTORY/full" &&
+   git init &&
+   objdir=".git/objects"
+'
+
+test_expect_success 'write graph with no packs' '
+   cd "$TRASH_DIRECTORY/full" &&
+   git commit-graph write --object-dir . &&
+   test_path_is_file info/commit-graph
+'
+
+test_expect_success 'create commits and repack' '
+   cd "$TRASH_DIRECTORY/full" &&
+   for i in $(test_seq 3)
+   do
+   test_commit $i &&
+   git branch commits/$i
+   done &&
+   git repack
+'
+
+test_expect_success 'write graph' '
+   cd "$TRASH_DIRECTORY/full" &&
+   graph1=$(git commit-graph write) &&
+   test_path_is_file $objdir/info/commit-graph
+'
+
+test_expect_success 'Add more commits' '
+   cd "$TRASH_DIRECTORY/full" &&
+   git reset --hard commits/1 &&
+   for i in $(test_seq 4 5)
+   do
+   test_commit $i &&
+   

[PATCH v5 00/13] Serialized Git Commit Graph

2018-02-26 Thread Derrick Stolee
This patch series is another big difference from version 4, but I do 
think we are converging on a stable design.

This series depends on a few things in flight:

* jt/binsearch-with-fanout for bsearch_graph()

* 'master' includes the sha1file -> hashfile rename in (98a3beab).

* [PATCH] commit: drop uses of get_cached_commit_buffer(). [1] I
  couldn't find a ds/* branch for this one, but it is necessary or
  else the commit graph test script should fail.

Here are some of the inter-patch changes:

* The single commit graph file is stored in the fixed filename
  .git/objects/info/commit-graph

* Because of this change, I struggled with the right way to pair the
  lockfile API with the hashfile API. Perhaps they were not meant to
  interact like this. I include a new patch step that adds a flag for
  hashclose() to keep the file descriptor open so commit_lock_file()
  can succeed. Please let me know if this is the wrong approach.

* A side-benefit of this change is that the "--set-latest" and
  "--delete-expired" arguments are no longer useful.

* I re-ran the performance tests since I rebased onto master. I had
  moved my "master" branch on my copy of Linux from another perf test,
  which changed the data shape a bit.

* There was some confusion between v3 and v4 about whether commits in
  an existing commit-graph file are automatically added to the new
  file during a write. I think I cleared up all of the documentation
  that referenced this to the new behavior: we only include commits
  reachable from the starting commits (depending on --stdin-commits,
  --stdin-packs, or neither) unless the new "--additive" argument
  is specified.

Thanks,
-Stolee

[1] 
https://public-inbox.org/git/1519240631-221761-1-git-send-email-dsto...@microsoft.com/

-- >8 --

This patch contains a way to serialize the commit graph.

The current implementation defines a new file format to store the graph
structure (parent relationships) and basic commit metadata (commit date,
root tree OID) in order to prevent parsing raw commits while performing
basic graph walks. For example, we do not need to parse the full commit
when performing these walks:

* 'git log --topo-order -1000' walks all reachable commits to avoid
  incorrect topological orders, but only needs the commit message for
  the top 1000 commits.

* 'git merge-base  ' may walk many commits to find the correct
  boundary between the commits reachable from A and those reachable
  from B. No commit messages are needed.

* 'git branch -vv' checks ahead/behind status for all local branches
  compared to their upstream remote branches. This is essentially as
  hard as computing merge bases for each.

The current patch speeds up these calculations by injecting a check in
parse_commit_gently() to check if there is a graph file and using that
to provide the required metadata to the struct commit.

The file format has room to store generation numbers, which will be
provided as a patch after this framework is merged. Generation numbers
are referenced by the design document but not implemented in order to
make the current patch focus on the graph construction process. Once
that is stable, it will be easier to add generation numbers and make
graph walks aware of generation numbers one-by-one.

Here are some performance results for a copy of the Linux repository
where 'master' has 664,185 reachable commits and is behind 'origin/master'
by 60,191 commits.

| Command  | Before | After  | Rel % |
|--|||---|
| log --oneline --topo-order -1000 |  6.56s |  0.66s | -89%  |
| branch -vv   |  1.35s |  0.32s | -76%  |
| rev-list --all   |  6.7s  |  0.83s | -87%  |
| rev-list --all --objects | 33.0s  | 27.5s  | -16%  |

To test this yourself, run the following on your repo:

  git config core.commitGraph true
  git show-ref -s | git commit-graph write --stdin-commits

The second command writes a commit graph file containing every commit
reachable from your refs. Now, all git commands that walk commits will
check your graph first before consulting the ODB. You can run your own
performance comparisons by toggling the 'core.commitGraph' setting.

[1] https://github.com/derrickstolee/git/pull/2
A GitHub pull request containing the latest version of this patch.

Derrick Stolee (13):
  commit-graph: add format document
  graph: add commit graph design document
  commit-graph: create git-commit-graph builtin
  csum-file: add CSUM_KEEP_OPEN flag
  commit-graph: implement write_commit_graph()
  commit-graph: implement 'git-commit-graph write'
  commit-graph: implement git commit-graph read
  commit-graph: add core.commitGraph setting
  commit-graph: close under reachability
  commit: integrate commit graph with commit parsing
  commit-graph: read only from specific pack-indexes
  commit-graph: build graph from starting commits

[PATCH v5 05/13] commit-graph: implement write_commit_graph()

2018-02-26 Thread Derrick Stolee
Teach Git to write a commit graph file by checking all packed objects
to see if they are commits, then store the file in the given object
directory.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Makefile   |   1 +
 commit-graph.c | 360 +
 commit-graph.h |   7 ++
 3 files changed, 368 insertions(+)
 create mode 100644 commit-graph.c
 create mode 100644 commit-graph.h

diff --git a/Makefile b/Makefile
index 2e4956f..bf91b2d 100644
--- a/Makefile
+++ b/Makefile
@@ -771,6 +771,7 @@ LIB_OBJS += color.o
 LIB_OBJS += column.o
 LIB_OBJS += combine-diff.o
 LIB_OBJS += commit.o
+LIB_OBJS += commit-graph.o
 LIB_OBJS += compat/obstack.o
 LIB_OBJS += compat/terminal.o
 LIB_OBJS += config.o
diff --git a/commit-graph.c b/commit-graph.c
new file mode 100644
index 000..2251397
--- /dev/null
+++ b/commit-graph.c
@@ -0,0 +1,360 @@
+#include "cache.h"
+#include "config.h"
+#include "git-compat-util.h"
+#include "lockfile.h"
+#include "pack.h"
+#include "packfile.h"
+#include "commit.h"
+#include "object.h"
+#include "revision.h"
+#include "sha1-lookup.h"
+#include "commit-graph.h"
+
+#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
+#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
+#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
+#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */
+
+#define GRAPH_DATA_WIDTH 36
+
+#define GRAPH_VERSION_1 0x1
+#define GRAPH_VERSION GRAPH_VERSION_1
+
+#define GRAPH_OID_VERSION_SHA1 1
+#define GRAPH_OID_LEN_SHA1 GIT_SHA1_RAWSZ
+#define GRAPH_OID_VERSION GRAPH_OID_VERSION_SHA1
+#define GRAPH_OID_LEN GRAPH_OID_LEN_SHA1
+
+#define GRAPH_OCTOPUS_EDGES_NEEDED 0x8000
+#define GRAPH_PARENT_MISSING 0x7fff
+#define GRAPH_EDGE_LAST_MASK 0x7fff
+#define GRAPH_PARENT_NONE 0x7000
+
+#define GRAPH_LAST_EDGE 0x8000
+
+#define GRAPH_FANOUT_SIZE (4 * 256)
+#define GRAPH_CHUNKLOOKUP_WIDTH 12
+#define GRAPH_MIN_SIZE (5 * GRAPH_CHUNKLOOKUP_WIDTH + GRAPH_FANOUT_SIZE + \
+   GRAPH_OID_LEN + 8)
+
+
+static char *get_commit_graph_filename(const char *obj_dir)
+{
+   return xstrfmt("%s/info/commit-graph", obj_dir);
+}
+
+static void write_graph_chunk_fanout(struct hashfile *f,
+struct commit **commits,
+int nr_commits)
+{
+   int i, count = 0;
+   struct commit **list = commits;
+
+   /*
+* Write the first-level table (the list is sorted,
+* but we use a 256-entry lookup to be able to avoid
+* having to do eight extra binary search iterations).
+*/
+   for (i = 0; i < 256; i++) {
+   while (count < nr_commits) {
+   if ((*list)->object.oid.hash[0] != i)
+   break;
+   count++;
+   list++;
+   }
+
+   hashwrite_be32(f, count);
+   }
+}
+
+static void write_graph_chunk_oids(struct hashfile *f, int hash_len,
+  struct commit **commits, int nr_commits)
+{
+   struct commit **list = commits;
+   int count;
+   for (count = 0; count < nr_commits; count++, list++)
+   hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
+}
+
+static const unsigned char *commit_to_sha1(size_t index, void *table)
+{
+   struct commit **commits = table;
+   return commits[index]->object.oid.hash;
+}
+
+static void write_graph_chunk_data(struct hashfile *f, int hash_len,
+  struct commit **commits, int nr_commits)
+{
+   struct commit **list = commits;
+   struct commit **last = commits + nr_commits;
+   uint32_t num_extra_edges = 0;
+
+   while (list < last) {
+   struct commit_list *parent;
+   int edge_value;
+   uint32_t packedDate[2];
+
+   parse_commit(*list);
+   hashwrite(f, (*list)->tree->object.oid.hash, hash_len);
+
+   parent = (*list)->parents;
+
+   if (!parent)
+   edge_value = GRAPH_PARENT_NONE;
+   else {
+   edge_value = sha1_pos(parent->item->object.oid.hash,
+ commits,
+ nr_commits,
+ commit_to_sha1);
+
+   if (edge_value < 0)
+   edge_value = GRAPH_PARENT_MISSING;
+   }
+
+   hashwrite_be32(f, edge_value);
+
+   if (parent)
+   parent = parent->next;
+
+   if (!parent)
+ 

[PATCH v5 04/13] csum-file: add CSUM_KEEP_OPEN flag

2018-02-26 Thread Derrick Stolee
This patch is new to the series due to the interactions with the lockfile API
and the hashfile API. I need to ensure the hashfile writes the hash value at
the end of the file, but keep the file descriptor open so the lock is valid.

I welcome any susggestions to this patch or to the way I use it in the commit
that follows.

-- >8 --

If we want to use a hashfile on the temporary file for a lockfile, then
we need hashclose() to fully write the trailing hash but also keep the
file descriptor open.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 csum-file.c | 10 +++---
 csum-file.h |  1 +
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 5eda7fb..302e6ae 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -66,9 +66,13 @@ int hashclose(struct hashfile *f, unsigned char *result, 
unsigned int flags)
flush(f, f->buffer, the_hash_algo->rawsz);
if (flags & CSUM_FSYNC)
fsync_or_die(f->fd, f->name);
-   if (close(f->fd))
-   die_errno("%s: sha1 file error on close", f->name);
-   fd = 0;
+   if (flags & CSUM_KEEP_OPEN)
+   fd = f->fd;
+   else {
+   if (close(f->fd))
+   die_errno("%s: sha1 file error on close", 
f->name);
+   fd = 0;
+   }
} else
fd = f->fd;
if (0 <= f->check_fd) {
diff --git a/csum-file.h b/csum-file.h
index 992e5c0..b7c0e48 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -29,6 +29,7 @@ extern int hashfile_truncate(struct hashfile *, struct 
hashfile_checkpoint *);
 /* hashclose flags */
 #define CSUM_CLOSE 1
 #define CSUM_FSYNC 2
+#define CSUM_KEEP_OPEN 4
 
 extern struct hashfile *hashfd(int fd, const char *name);
 extern struct hashfile *hashfd_check(const char *name);
-- 
2.7.4



[PATCH v5 01/13] commit-graph: add format document

2018-02-26 Thread Derrick Stolee
Add document specifying the binary format for commit graphs. This
format allows for:

* New versions.
* New hash functions and hash lengths.
* Optional extensions.

Basic header information is followed by a binary table of contents
into "chunks" that include:

* An ordered list of commit object IDs.
* A 256-entry fanout into that list of OIDs.
* A list of metadata for the commits.
* A list of "large edges" to enable octopus merges.

The format automatically includes two parent positions for every
commit. This favors speed over space, since using only one position
per commit would cause an extra level of indirection for every merge
commit. (Octopus merges suffer from this indirection, but they are
very rare.)

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/commit-graph-format.txt | 98 +
 1 file changed, 98 insertions(+)
 create mode 100644 Documentation/technical/commit-graph-format.txt

diff --git a/Documentation/technical/commit-graph-format.txt 
b/Documentation/technical/commit-graph-format.txt
new file mode 100644
index 000..4402baa
--- /dev/null
+++ b/Documentation/technical/commit-graph-format.txt
@@ -0,0 +1,98 @@
+Git commit graph format
+===
+
+The Git commit graph stores a list of commit OIDs and some associated
+metadata, including:
+
+- The generation number of the commit. Commits with no parents have
+  generation number 1; commits with parents have generation number
+  one more than the maximum generation number of its parents. We
+  reserve zero as special, and can be used to mark a generation
+  number invalid or as "not computed".
+
+- The root tree OID.
+
+- The commit date.
+
+- The parents of the commit, stored using positional references within
+  the graph file.
+
+These positional references are stored as 32-bit integers corresponding to
+the array position withing the list of commit OIDs. We use the most-significant
+bit for special purposes, so we can store at most (1 << 31) - 1 (around 2
+billion) commits.
+
+== Commit graph files have the following format:
+
+In order to allow extensions that add extra data to the graph, we organize
+the body into "chunks" and provide a binary lookup table at the beginning
+of the body. The header includes certain values, such as number of chunks
+and hash type.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+  4-byte signature:
+  The signature is: {'C', 'G', 'P', 'H'}
+
+  1-byte version number:
+  Currently, the only valid version is 1.
+
+  1-byte Hash Version (1 = SHA-1)
+  We infer the hash length (H) from this value.
+
+  1-byte number (C) of "chunks"
+
+  1-byte (reserved for later use)
+ Current clients should ignore this value.
+
+CHUNK LOOKUP:
+
+  (C + 1) * 12 bytes listing the table of contents for the chunks:
+  First 4 bytes describe the chunk id. Value 0 is a terminating label.
+  Other 8 bytes provide the byte-offset in current file for chunk to
+  start. (Chunks are ordered contiguously in the file, so you can infer
+  the length using the next chunk position if necessary.) Each chunk
+  type appears at most once.
+
+  The remaining data in the body is described one chunk at a time, and
+  these chunks may be given in any order. Chunks are required unless
+  otherwise specified.
+
+CHUNK DATA:
+
+  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+  The ith entry, F[i], stores the number of OIDs with first
+  byte at most i. Thus F[255] stores the total
+  number of commits (N).
+
+  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+  The OIDs for all commits in the graph, sorted in ascending order.
+
+  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+* The first H bytes are for the OID of the root tree.
+* The next 8 bytes are for the positions of the first two parents
+  of the ith commit. Stores value 0x if no parent in that
+  position. If there are more than two parents, the second value
+  has its most-significant bit on and the other bits store an array
+  position into the Large Edge List chunk.
+* The next 8 bytes store the generation number of the commit and
+  the commit time in seconds since EPOCH. The generation number
+  uses the higher 30 bits of the first 4 bytes, while the commit
+  time uses the 32 bits of the second 4 bytes, along with the lowest
+  2 bits of the lowest byte, storing the 33rd and 34th bit of the
+  commit time.
+
+  Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
+  This list of 4-byte values store the second through nth parents for
+  all octopus merges. The second parent value in the commit data stores
+  an array position within this list along with the most-significant bit
+  on. Starting at that array position, iterate through this list of commit
+  positions for the parent

[PATCH v5 11/13] commit-graph: read only from specific pack-indexes

2018-02-26 Thread Derrick Stolee
Teach git-commit-graph to inspect the objects only in a certain list
of pack-indexes within the given pack directory. This allows updating
the commit graph iteratively.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt | 11 ++-
 builtin/commit-graph.c | 33 ++---
 commit-graph.c | 26 --
 commit-graph.h |  4 +++-
 packfile.c |  4 ++--
 packfile.h |  2 ++
 t/t5318-commit-graph.sh| 10 ++
 7 files changed, 81 insertions(+), 9 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 51cb038..b945510 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -32,7 +32,9 @@ COMMANDS
 'write'::
 
 Write a commit graph file based on the commits found in packfiles.
-Includes all commits from the existing commit graph file.
++
+With the `--stdin-packs` option, generate the new commit graph by
+walking objects only in the specified packfiles.
 
 'read'::
 
@@ -49,6 +51,13 @@ EXAMPLES
 $ git commit-graph write
 
 
+* Write a graph file, extending the current graph file using commits
+* in .
++
+
+$ echo  | git commit-graph write --stdin-packs
+
+
 * Read basic information from the commit-graph file.
 +
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 0e164be..eebca57 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,7 +8,7 @@
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
-   N_("git commit-graph write [--object-dir ]"),
+   N_("git commit-graph write [--object-dir ] [--stdin-packs]"),
NULL
 };
 
@@ -18,12 +18,13 @@ static const char * const builtin_commit_graph_read_usage[] 
= {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--object-dir ]"),
+   N_("git commit-graph write [--object-dir ] [--stdin-packs]"),
NULL
 };
 
 static struct opts_commit_graph {
const char *obj_dir;
+   int stdin_packs;
 } opts;
 
 static int graph_read(int argc, const char **argv)
@@ -76,10 +77,18 @@ static int graph_read(int argc, const char **argv)
 
 static int graph_write(int argc, const char **argv)
 {
+   const char **pack_indexes = NULL;
+   int packs_nr = 0;
+   const char **lines = NULL;
+   int lines_nr = 0;
+   int lines_alloc = 0;
+
static struct option builtin_commit_graph_write_options[] = {
OPT_STRING(0, "object-dir", _dir,
N_("dir"),
N_("The object directory to store the graph")),
+   OPT_BOOL(0, "stdin-packs", _packs,
+   N_("scan packfiles listed by stdin for commits")),
OPT_END(),
};
 
@@ -90,7 +99,25 @@ static int graph_write(int argc, const char **argv)
if (!opts.obj_dir)
opts.obj_dir = get_object_directory();
 
-   write_commit_graph(opts.obj_dir);
+   if (opts.stdin_packs) {
+   struct strbuf buf = STRBUF_INIT;
+   lines_nr = 0;
+   lines_alloc = 128;
+   ALLOC_ARRAY(lines, lines_alloc);
+
+   while (strbuf_getline(, stdin) != EOF) {
+   ALLOC_GROW(lines, lines_nr + 1, lines_alloc);
+   lines[lines_nr++] = strbuf_detach(, NULL);
+   }
+
+   pack_indexes = lines;
+   packs_nr = lines_nr;
+   }
+
+   write_commit_graph(opts.obj_dir,
+  pack_indexes,
+  packs_nr);
+
return 0;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 184b8da..4e9f1d5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -529,7 +529,9 @@ static void close_reachable(struct packed_oid_list *oids)
}
 }
 
-void write_commit_graph(const char *obj_dir)
+void write_commit_graph(const char *obj_dir,
+   const char **pack_indexes,
+   int nr_packs)
 {
struct packed_oid_list oids;
struct packed_commit_list commits;
@@ -552,7 +554,27 @@ void write_commit_graph(const char *obj_dir)
oids.alloc = 1024;
ALLOC_ARRAY(oids.list, oids.alloc);
 
-   for_each_packed_object(add_packed_commits, , 0);
+   if (pack_indexes) {
+   struct strbuf packname = STRBUF_INIT;
+   int dirlen;
+   st

[PATCH v5 07/13] commit-graph: implement git commit-graph read

2018-02-26 Thread Derrick Stolee
Teach git-commit-graph to read commit graph files and summarize their contents.

Use the read subcommand to verify the contents of a commit graph file in the
tests.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt |  12 
 builtin/commit-graph.c |  56 +++
 commit-graph.c | 140 -
 commit-graph.h |  23 ++
 t/t5318-commit-graph.sh|  32 +++--
 5 files changed, 257 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index e688843..51cb038 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -9,6 +9,7 @@ git-commit-graph - Write and verify Git commit graph files
 SYNOPSIS
 
 [verse]
+'git commit-graph read'  [--object-dir ]
 'git commit-graph write'  [--object-dir ]
 
 
@@ -33,6 +34,11 @@ COMMANDS
 Write a commit graph file based on the commits found in packfiles.
 Includes all commits from the existing commit graph file.
 
+'read'::
+
+Read a graph file given by the commit-graph file and output basic
+details about the graph file. Used for debugging purposes.
+
 
 EXAMPLES
 
@@ -43,6 +49,12 @@ EXAMPLES
 $ git commit-graph write
 
 
+* Read basic information from the commit-graph file.
++
+
+$ git commit-graph read
+
+
 
 GIT
 ---
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index a9d61f6..0e164be 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -7,10 +7,16 @@
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
+   N_("git commit-graph read [--object-dir ]"),
N_("git commit-graph write [--object-dir ]"),
NULL
 };
 
+static const char * const builtin_commit_graph_read_usage[] = {
+   N_("git commit-graph read [--object-dir ]"),
+   NULL
+};
+
 static const char * const builtin_commit_graph_write_usage[] = {
N_("git commit-graph write [--object-dir ]"),
NULL
@@ -20,6 +26,54 @@ static struct opts_commit_graph {
const char *obj_dir;
 } opts;
 
+static int graph_read(int argc, const char **argv)
+{
+   struct commit_graph *graph = 0;
+   char *graph_name;
+
+   static struct option builtin_commit_graph_read_options[] = {
+   OPT_STRING(0, "object-dir", _dir,
+   N_("dir"),
+   N_("The object directory to store the graph")),
+   OPT_END(),
+   };
+
+   argc = parse_options(argc, argv, NULL,
+builtin_commit_graph_read_options,
+builtin_commit_graph_read_usage, 0);
+
+   if (!opts.obj_dir)
+   opts.obj_dir = get_object_directory();
+
+   graph_name = get_commit_graph_filename(opts.obj_dir);
+   graph = load_commit_graph_one(graph_name);
+
+   if (!graph)
+   die("graph file %s does not exist", graph_name);
+   FREE_AND_NULL(graph_name);
+
+   printf("header: %08x %d %d %d %d\n",
+   ntohl(*(uint32_t*)graph->data),
+   *(unsigned char*)(graph->data + 4),
+   *(unsigned char*)(graph->data + 5),
+   *(unsigned char*)(graph->data + 6),
+   *(unsigned char*)(graph->data + 7));
+   printf("num_commits: %u\n", graph->num_commits);
+   printf("chunks:");
+
+   if (graph->chunk_oid_fanout)
+   printf(" oid_fanout");
+   if (graph->chunk_oid_lookup)
+   printf(" oid_lookup");
+   if (graph->chunk_commit_data)
+   printf(" commit_metadata");
+   if (graph->chunk_large_edges)
+   printf(" large_edges");
+   printf("\n");
+
+   return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
static struct option builtin_commit_graph_write_options[] = {
@@ -60,6 +114,8 @@ int cmd_commit_graph(int argc, const char **argv, const char 
*prefix)
 PARSE_OPT_STOP_AT_NON_OPTION);
 
if (argc > 0) {
+   if (!strcmp(argv[0], "read"))
+   return graph_read(argc, argv);
if (!strcmp(argv[0], "write"))
return graph_write(argc, argv);
}
diff --git a/commit-graph.c b/commit-graph.c
index 2251397..7b0cfb4 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -39,11 +39,149 @@
GRAPH_OID_LEN + 8)
 
 
-static char *get_commit_graph_filename(const char *obj_dir)
+char *get_commit_graph_file

[PATCH v5 08/13] commit-graph: add core.commitGraph setting

2018-02-26 Thread Derrick Stolee
The commit graph feature is controlled by the new core.commitGraph config
setting. This defaults to 0, so the feature is opt-in.

The intention of core.commitGraph is that a user can always stop checking
for or parsing commit graph files if core.commitGraph=0.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/config.txt | 3 +++
 cache.h  | 1 +
 config.c | 5 +
 environment.c| 1 +
 4 files changed, 10 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index f57e9cf..77fcd53 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -898,6 +898,9 @@ core.notesRef::
 This setting defaults to "refs/notes/commits", and it can be overridden by
 the `GIT_NOTES_REF` environment variable.  See linkgit:git-notes[1].
 
+core.commitGraph::
+   Enable git commit graph feature. Allows reading from .graph files.
+
 core.sparseCheckout::
Enable "sparse checkout" feature. See section "Sparse checkout" in
linkgit:git-read-tree[1] for more information.
diff --git a/cache.h b/cache.h
index 21fbcc2..3a18a02 100644
--- a/cache.h
+++ b/cache.h
@@ -801,6 +801,7 @@ extern char *git_replace_ref_base;
 
 extern int fsync_object_files;
 extern int core_preload_index;
+extern int core_commit_graph;
 extern int core_apply_sparse_checkout;
 extern int precomposed_unicode;
 extern int protect_hfs;
diff --git a/config.c b/config.c
index b0c20e6..25ee4a6 100644
--- a/config.c
+++ b/config.c
@@ -1226,6 +1226,11 @@ static int git_default_core_config(const char *var, 
const char *value)
return 0;
}
 
+   if (!strcmp(var, "core.commitgraph")) {
+   core_commit_graph = git_config_bool(var, value);
+   return 0;
+   }
+
if (!strcmp(var, "core.sparsecheckout")) {
core_apply_sparse_checkout = git_config_bool(var, value);
return 0;
diff --git a/environment.c b/environment.c
index de8431e..0e96be3 100644
--- a/environment.c
+++ b/environment.c
@@ -62,6 +62,7 @@ enum push_default_type push_default = 
PUSH_DEFAULT_UNSPECIFIED;
 enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE;
 char *notes_ref_name;
 int grafts_replace_parents = 1;
+int core_commit_graph;
 int core_apply_sparse_checkout;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
-- 
2.7.4



[PATCH v5 09/13] commit-graph: close under reachability

2018-02-26 Thread Derrick Stolee
Teach write_commit_graph() to walk all parents from the commits
discovered in packfiles. This prevents gaps given by loose objects or
previously-missed packfiles.

Also automatically add commits from the existing graph file, if it
exists.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 7b0cfb4..01aa23d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -369,6 +369,28 @@ static int add_packed_commits(const struct object_id *oid,
return 0;
 }
 
+static void close_reachable(struct packed_oid_list *oids)
+{
+   int i;
+   struct rev_info revs;
+   struct commit *commit;
+   init_revisions(, NULL);
+   for (i = 0; i < oids->nr; i++) {
+   commit = lookup_commit(>list[i]);
+   if (commit && !parse_commit(commit))
+   revs.commits = commit_list_insert(commit, 
);
+   }
+
+   if (prepare_revision_walk())
+   die(_("revision walk setup failed"));
+
+   while ((commit = get_revision()) != NULL) {
+   ALLOC_GROW(oids->list, oids->nr + 1, oids->alloc);
+   oidcpy(>list[oids->nr], &(commit->object.oid));
+   (oids->nr)++;
+   }
+}
+
 void write_commit_graph(const char *obj_dir)
 {
struct packed_oid_list oids;
@@ -393,6 +415,7 @@ void write_commit_graph(const char *obj_dir)
ALLOC_ARRAY(oids.list, oids.alloc);
 
for_each_packed_object(add_packed_commits, , 0);
+   close_reachable();
 
QSORT(oids.list, oids.nr, commit_compare);
 
-- 
2.7.4



[PATCH] sha1_name: fix uninitialized memory errors

2018-02-26 Thread Derrick Stolee
During abbreviation checks, we navigate to the position within a
pack-index that an OID would be inserted and check surrounding OIDs
for the maximum matching prefix. This position may be beyond the
last position, because the given OID is lexicographically larger
than every OID in the pack. Then nth_packed_object_oid() does not
initialize "oid".

Use the return value of nth_packed_object_oid() to prevent these
errors.

Reported-by: Christian Couder <christian.cou...@gmail.com>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 611c7d2..44dd595 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -546,17 +546,12 @@ static void find_abbrev_len_for_pack(struct packed_git *p,
 * nearby for the abbreviation length.
 */
mad->init_len = 0;
-   if (!match) {
-   nth_packed_object_oid(, p, first);
+   if (!match && nth_packed_object_oid(, p, first))
extend_abbrev_len(, mad);
-   } else if (first < num - 1) {
-   nth_packed_object_oid(, p, first + 1);
+   else if (first < num - 1 && nth_packed_object_oid(, p, first + 1))
extend_abbrev_len(, mad);
-   }
-   if (first > 0) {
-   nth_packed_object_oid(, p, first - 1);
+   if (first > 0 && nth_packed_object_oid(, p, first - 1))
extend_abbrev_len(, mad);
-   }
mad->init_len = mad->cur_len;
 }
 
-- 
2.7.4



[PATCH v2] sha1_name: fix uninitialized memory errors

2018-02-27 Thread Derrick Stolee
Peff made an excellent point about the nested if statements. This
goes back to Christian's original recommendation.

-- >8 --

During abbreviation checks, we navigate to the position within a
pack-index that an OID would be inserted and check surrounding OIDs
for the maximum matching prefix. This position may be beyond the
last position, because the given OID is lexicographically larger
than every OID in the pack. Then nth_packed_object_oid() does not
initialize "oid".

Use the return value of nth_packed_object_oid() to prevent these
errors.

Reported-by: Christian Couder <christian.cou...@gmail.com>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 611c7d24dd..a041d8d24f 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -547,15 +547,15 @@ static void find_abbrev_len_for_pack(struct packed_git *p,
 */
mad->init_len = 0;
if (!match) {
-   nth_packed_object_oid(, p, first);
-   extend_abbrev_len(, mad);
+   if (nth_packed_object_oid(, p, first))
+   extend_abbrev_len(, mad);
} else if (first < num - 1) {
-   nth_packed_object_oid(, p, first + 1);
-   extend_abbrev_len(, mad);
+   if (nth_packed_object_oid(, p, first + 1))
+   extend_abbrev_len(, mad);
}
if (first > 0) {
-   nth_packed_object_oid(, p, first - 1);
-   extend_abbrev_len(, mad);
+   if (nth_packed_object_oid(, p, first - 1))
+   extend_abbrev_len(, mad);
}
mad->init_len = mad->cur_len;
 }
-- 
2.16.2.265.g3d5930c0b9.dirty



Re: Use of uninitialised value of size 8 in sha1_name.c

2018-02-26 Thread Derrick Stolee

On 2/26/2018 5:23 AM, Christian Couder wrote:

On Mon, Feb 26, 2018 at 10:53 AM, Jeff King  wrote:

On Mon, Feb 26, 2018 at 10:04:22AM +0100, Christian Couder wrote:


==21455== Use of uninitialised value of size 8
==21455==at 0x2D2A73: get_hex_char_from_oid (sha1_name.c:492)
==21455==by 0x2D2AFE: extend_abbrev_len (sha1_name.c:502)
==21455==by 0x2D2C3D: find_abbrev_len_for_pack (sha1_name.c:551)
==21455==by 0x2D2CFF: find_abbrev_len_packed (sha1_name.c:569)
==21455==by 0x2D2E12: find_unique_abbrev_r (sha1_name.c:608)
==21455==by 0x2DCB66: strbuf_add_unique_abbrev (strbuf.c:877)
==21455==by 0x14F7CE: update_local_ref (fetch.c:700)
==21455==by 0x1500CF: store_updated_refs (fetch.c:871)
==21455==by 0x15035B: fetch_refs (fetch.c:932)
==21455==by 0x150CF8: do_fetch (fetch.c:1146)
==21455==by 0x1515AB: fetch_one (fetch.c:1370)
==21455==by 0x151A1D: cmd_fetch (fetch.c:1457)
==21455==  Uninitialised value was created by a stack allocation
==21455==at 0x2D2B2E: find_abbrev_len_for_pack (sha1_name.c:513)
==21455==

A quick git blame seems to point to 0e87b85683 (sha1_name: minimize
OID comparisons during disambiguation, 2017-10-12).

It is difficult to tell for sure though as t5616-partial-clone.sh was
added after that commit.

I think that commit is to blame, though the error isn't exactly where
that stack trace puts it. Try this:

diff --git a/sha1_name.c b/sha1_name.c
index 611c7d24dd..6f7f36436f 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -547,7 +547,10 @@ static void find_abbrev_len_for_pack(struct packed_git *p,
  */
 mad->init_len = 0;
 if (!match) {
-   nth_packed_object_oid(, p, first);
+   warning("p->num_objects = %u, first = %u",
+   p->num_objects, first);
+   if (!nth_packed_object_oid(, p, first))
+   die("oops!");
 extend_abbrev_len(, mad);
 } else if (first < num - 1) {
 nth_packed_object_oid(, p, first + 1);

I get failures all over the test suite, like this:

   warning: p->num_objects = 4, first = 3
   warning: p->num_objects = 8, first = 6
   warning: p->num_objects = 10, first = 0
   warning: p->num_objects = 4, first = 0
   warning: p->num_objects = 8, first = 0
   warning: p->num_objects = 10, first = 10
   fatal: oops!

Yeah, I tried to t5601-clone.sh with --valgrind and I also get one
error (the same "Uninitialised value" error actually).


Thanks for finding this. Sorry for the bug.


Any time the abbreviated hex would go after the last object in the pack,
then first==p->num_objects, and nth_packed_object_oid() will fail. That
leaves uninitialized data in "oid", which is what valgrind complains
about when we examine it in extend_abbrev_len().

Most of the time the code behaves correctly anyway, because the
uninitialized bytes are unlikely to match up with our hex and extend the
length. Probably we just need to detect that case and skip the call to
extend_abbrev_len() altogether.

Yeah, something like:

diff --git a/sha1_name.c b/sha1_name.c
index 611c7d24dd..a041d8d24f 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -547,15 +547,15 @@ static void find_abbrev_len_for_pack(struct packed_git *p,
  */
 mad->init_len = 0;
 if (!match) {
-   nth_packed_object_oid(, p, first);
-   extend_abbrev_len(, mad);
+   if (nth_packed_object_oid(, p, first))
+   extend_abbrev_len(, mad);
 } else if (first < num - 1) {
-   nth_packed_object_oid(, p, first + 1);
-   extend_abbrev_len(, mad);
+   if (nth_packed_object_oid(, p, first + 1))
+   extend_abbrev_len(, mad);
 }
 if (first > 0) {
-   nth_packed_object_oid(, p, first - 1);
-   extend_abbrev_len(, mad);
+   if (nth_packed_object_oid(, p, first - 1))
+   extend_abbrev_len(, mad);
 }
 mad->init_len = mad->cur_len;
  }

seems to prevent valgrind from complaining when running t5616-partial-clone.sh.


This seems like the safest fix, but also we could use our values of 
"match", "first" and "num" to safely call nth_packed_object_oid(). 
However, since nth_packed_object_oid() checks the bounds, don't 
duplicate work and just use the return value.


I would reformat your patch slightly, but that's just preference:

diff --git a/sha1_name.c b/sha1_name.c
index 611c7d2..97b632c 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -546,17 +546,14 @@ static void find_abbrev_len_for_pack(struct 
packed_git *p,

 * nearby for the abbreviation length.
 */
    mad->init_len = 0;
-   if (!match) {
-   nth_packed_object_oid(, p, first);
+   if (!match && nth_packed_object_oid(, p, first))
    extend_abbrev_len(, mad);
-   } else if (first < num - 1) {
-   

Re: [PATCH] commit-graph: fix some "plain integer as NULL pointer" warnings

2018-02-26 Thread Derrick Stolee

On 2/24/2018 12:42 AM, René Scharfe wrote:

Am 24.02.2018 um 03:24 schrieb Ramsay Jones:

diff --git a/commit-graph.c b/commit-graph.c
index fc5ee7e99..c2f443436 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -45,7 +45,7 @@ char *get_graph_latest_filename(const char *obj_dir)
   {
struct strbuf fname = STRBUF_INIT;
strbuf_addf(, "%s/info/graph-latest", obj_dir);
-   return strbuf_detach(, 0);
+   return strbuf_detach(, NULL);
   }

You could also replace that function body with:

return xstrfmt("%s/info/graph-latest", obj_dir);

René


Thanks for the feedback! I will apply these locally as I am re-rolling 
today.


-Stolee


Re: [PATCH] revision.c: reduce object database queries

2018-02-28 Thread Derrick Stolee

On 2/28/2018 1:37 AM, Jeff King wrote:

On Tue, Feb 27, 2018 at 03:16:58PM -0800, Junio C Hamano wrote:


This code comes originally form 454fbbcde3 (git-rev-list: allow missing
objects when the parent is marked UNINTERESTING, 2005-07-10). But later,
in aeeae1b771 (revision traversal: allow UNINTERESTING objects to be
missing, 2009-01-27), we marked dealt with calling parse_object() on the
parents more directly.

So what I wonder is whether this code is simply redundant and can go
away entirely. That would save the has_object_file() call in all cases.

Hmm, interesting. I forgot all what I did around this area, but you
are right.

I'll leave it to Stolee whether he wants to dig into removing the
has_object_file() call. I think it would do the right thing, but the
most interesting bit would be how it impacts the timings.


This patch was so small that I could understand the full implication of 
my change.


I'm still very unfamiliar with the object walking machinery in 
revision.c. There are a lot of inter-dependencies that are taking time 
for me to understand and to feel confident that I won't cause a side 
effect in a special case. I do appreciate that Junio added a test in 
aeeae1b771.


I'll make a note to revisit this area after I have a better grasp of the 
object walk code, but I will not try removing the has_object_file() call 
now.





There's a similar case for trees. ...
though technically the existing code allows _missing_ trees, but
not on corrupt ones.

True, but the intention of these "do not care too much about missing
stuff while marking uninteresting" effort is aligned better with
ignoring corrupt ones, too, I would think, as "missing" in that
sentence is in fact about "not availble", and stuff that exists in
corrupt form is still not available anyway.  So I do not think it
makes a bad change to start allowing corrupt ones.

Agreed. Here it is in patch form, though as we both said, it probably
doesn't matter that much in practice. So I'd be OK dropping it out of
a sense of conservatism.

-- >8 --
Subject: [PATCH] mark_tree_contents_uninteresting: drop has_object check

It's generally acceptable for UNINTERESTING objects in a
traversal to be unavailable (e.g., see aeeae1b771). When
marking trees UNINTERESTING, we access the object database
twice: once to check if the object is missing (and return
quietly if it is), and then again to actually parse it.

We can instead just try to parse; if that fails, we can then
return quietly. That halves the effort we spend on locating
the object.

Note that this isn't _exactly_ the same as the original
behavior, as the parse failure could be due to other
problems than a missing object: it could be corrupted, in
which case the original code would have died. But the new
behavior is arguably better, as it covers the object being
unavailable for any reason. We'll also still issue a warning
to stderr in such a case.

Signed-off-by: Jeff King 
---
  revision.c | 5 +
  1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/revision.c b/revision.c
index 5ce9b93baa..221d62c52b 100644
--- a/revision.c
+++ b/revision.c
@@ -51,12 +51,9 @@ static void mark_tree_contents_uninteresting(struct tree 
*tree)
  {
struct tree_desc desc;
struct name_entry entry;
-   struct object *obj = >object;
  
-	if (!has_object_file(>oid))

+   if (parse_tree_gently(tree, 1) < 0)
return;
-   if (parse_tree(tree) < 0)
-   die("bad tree %s", oid_to_hex(>oid));
  
  	init_tree_desc(, tree->buffer, tree->size);

while (tree_entry(, )) {




Re: [PATCH 03/11] packfile: allow install_packed_git to handle arbitrary repositories

2018-02-28 Thread Derrick Stolee

On 2/27/2018 8:06 PM, Stefan Beller wrote:
  
-void install_packed_git(struct packed_git *pack)

+void install_packed_git(struct repository *r, struct packed_git *pack)


This is a good thing to do. I'm just making note that this will collide 
with the new instances of install_packed_git() that I added in the 
serialized git commit graph series, specifically when adding the 
--stdin-packs argument [1].


Thanks,
-Stolee

[1] 
https://public-inbox.org/git/cagz79kym0fhiyq2+k5__a2hy1pecyigyf3n9zbjskh8yjzo...@mail.gmail.com/T/#u


Re: [PATCH 00/11] Moving global state into the repository object (part 2)

2018-02-28 Thread Derrick Stolee



On 2/27/2018 9:15 PM, Duy Nguyen wrote:

On Tue, Feb 27, 2018 at 05:05:57PM -0800, Stefan Beller wrote:

This applies on top of origin/sb/object-store and is the continuation of
that series, adding the repository as a context argument to functions.

This series focusses on packfile handling, exposing (re)prepare_packed_git
and find_pack_entry to a repository argument.

It looks good.


I agree.


Looking at the full-series diff though, it makes me wonder if we
should keep prepare_packed_git() private (i.e. how we initialize the
object store, packfile included, is a private matter). How about
something like this on top?


I think the get_packed_git() approach is cleaner than navigating 
directly to the_repository->objects.packed_git that you expect to be 
initialized by an earlier method call.


Thanks,
-Stolee


-- 8< --
Subject: [PATCH] packfile: keep raw_object_store.packed_git{,_mru} fields 
private

These fields are initialized lazily via prepare_packed_git(). All
access to these must call that function first but unless you know the
implementation details, you may not see the connection.

Keep that connection internal. These fields should only be accessed
via get_packed_git() and get_packed_git_mru() outside packfile.c.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
  builtin/count-objects.c  |  4 +---
  builtin/fsck.c   |  6 ++
  builtin/gc.c |  3 +--
  builtin/pack-objects.c   | 21 +++--
  builtin/pack-redundant.c |  6 ++
  fast-import.c|  7 ++-
  http-backend.c   |  5 ++---
  object-store.h   |  4 ++--
  pack-bitmap.c|  3 +--
  packfile.c   | 15 ++-
  packfile.h   |  6 +-
  server-info.c|  5 ++---
  sha1_name.c  |  6 ++
  13 files changed, 47 insertions(+), 44 deletions(-)

diff --git a/builtin/count-objects.c b/builtin/count-objects.c
index d480301763..088309945b 100644
--- a/builtin/count-objects.c
+++ b/builtin/count-objects.c
@@ -121,9 +121,7 @@ int cmd_count_objects(int argc, const char **argv, const 
char *prefix)
struct strbuf loose_buf = STRBUF_INIT;
struct strbuf pack_buf = STRBUF_INIT;
struct strbuf garbage_buf = STRBUF_INIT;
-   if (!the_repository->objects.packed_git)
-   prepare_packed_git(the_repository);
-   for (p = the_repository->objects.packed_git; p; p = p->next) {
+   for (p = get_packed_git(the_repository); p; p = p->next) {
if (!p->pack_local)
continue;
if (open_pack_index(p))
diff --git a/builtin/fsck.c b/builtin/fsck.c
index 0a043a43c2..6d86f2581a 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -732,10 +732,8 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
uint32_t total = 0, count = 0;
struct progress *progress = NULL;
  
-			prepare_packed_git(the_repository);

-
if (show_progress) {
-   for (p = the_repository->objects.packed_git; p;
+   for (p = get_packed_git(the_repository); p;
 p = p->next) {
if (open_pack_index(p))
continue;
@@ -744,7 +742,7 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
  
  progress = start_progress(_("Checking objects"), total);

}
-   for (p = the_repository->objects.packed_git; p;
+   for (p = get_packed_git(the_repository); p;
 p = p->next) {
/* verify gives error messages itself */
if (verify_pack(p, fsck_obj_buffer,
diff --git a/builtin/gc.c b/builtin/gc.c
index 80d19c54d5..be63bec09c 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -173,8 +173,7 @@ static int too_many_packs(void)
if (gc_auto_pack_limit <= 0)
return 0;
  
-	prepare_packed_git(the_repository);

-   for (cnt = 0, p = the_repository->objects.packed_git; p; p = p->next) {
+   for (cnt = 0, p = get_packed_git(the_repository); p; p = p->next) {
if (!p->pack_local)
continue;
if (p->pack_keep)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5e2590f882..a305f50100 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1026,7 +1026,7 @@ static int want_object_in_pack(const struct object_id 
*oid,
if (want != -1)
return want;
}
-   list_for_each(pos, _repository->objects.packed_git_mru) {
+   list_for_each(pos, get_packed_git_mru(the_repository)) {
struct packed_git *p = list_entry(pos, struct 

Re: What's cooking in git.git (Mar 2018, #01; Thu, 1)

2018-03-02 Thread Derrick Stolee

On 3/1/2018 5:20 PM, Junio C Hamano wrote:

--
[Graduated to "master"]

* jt/binsearch-with-fanout (2018-02-15) 2 commits
   (merged to 'next' on 2018-02-15 at 7648891022)
  + packfile: refactor hash search with fanout table
  + packfile: remove GIT_DEBUG_LOOKUP log statements
  (this branch is used by ds/commit-graph.)

  Refactor the code to binary search starting from a fan-out table
  (which is how the packfile is indexed with object names) into a
  reusable helper.

[...]
--
[New Topics]

* jk/cached-commit-buffer (2018-02-22) 2 commits
   (merged to 'next' on 2018-02-27 at af791d9a1e)
  + revision: drop --show-all option
  + commit: drop uses of get_cached_commit_buffer()

  Code clean-up.

  Will merge to 'master'.


Thanks! These resolve the dependencies for ds/commit-graph


* ds/commit-graph (2018-02-20) 13 commits
  - commit-graph: build graph from starting commits
  - commit-graph: read only from specific pack-indexes
  - commit: integrate commit graph with commit parsing
  - commit-graph: close under reachability
  - commit-graph: add core.commitGraph setting
  - commit-graph: implement --delete-expired
  - commit-graph: implement --set-latest
  - commit-graph: implement git commit-graph read
  - commit-graph: implement 'git-commit-graph write'
  - commit-graph: implement write_commit_graph()
  - commit-graph: create git-commit-graph builtin
  - graph: add commit graph design document
  - commit-graph: add format document

  Precompute and store information necessary for ancestry traversal
  in a separate file to optimize graph walking.

  Reroll exists, but it appears that there will be a further reroll.
  cf. <1519698787-190494-1-git-send-email-dsto...@microsoft.com>


v5 reroll (mentioned above) has limited review, so I won't be rerolling 
soon. I'm particularly interested in review on [PATCH 04/13] csum-file: 
add CSUM_KEEP_OPEN flag [1].


Thanks,

-Stolee

[1] 
https://public-inbox.org/git/1519698787-190494-5-git-send-email-dsto...@microsoft.com/


Re: [RFC] Contributing to Git (on Windows)

2018-03-02 Thread Derrick Stolee

On 3/1/2018 11:44 PM, Jonathan Nieder wrote:

Hi,

Derrick Stolee wrote:


Now, we'd like to make that document publicly available. These steps are
focused on a Windows user, so we propose putting them in the
git-for-windows/git repo under CONTRIBUTING.md. I have a pull request open
for feedback [1]. I'll read comments on that PR or in this thread.

Thanks!  I wonder if we can put this in the standard Documentation/
directory as well.  E.g. the Windows CONTRIBUTING.md could say say
"See Documentation/Contributing-On-Windows.md" so that the bulk of the
text would not have to be git-for-windows specific.


That's a good idea. After this review stabilizes, I'll send a patch to 
add the windows-specific instructions as you recommend.


What precedent do we have for referencing forks like git-for-windows/git?



[...]

+++ b/CONTRIBUTING.md
@@ -0,0 +1,401 @@
+How to Contribute to Git for Windows
+

Would it make sense for this to be checked in with LF instead of CRLF
line endings, so that clones use the user's chosen / platform native
line ending?  The .gitattributes file could include

/CONTRIBUTING.md text

so that line ending conversion takes place even if the user hasn't
enabled core.autocrlf.


Oops! I turned off the CRLF munging in my repo because I had an issue 
with a script somewhere, but forgot to turn it back on again. Thanks for 
checking this.




[...]

+The SDK uses a different credential manager, so you may still want to use 
Visual Studio
+or normal Git Bash for interacting with your remotes.  Alternatively, use 
SSH rather
+than HTTPS and avoid credential manager problems.

Where can I read more about the problems in question?


I'll try to see if we can elaborate here. The Git for Windows client 
ships with Git Credential Manager for Windows [1] but the SDK does not. 
At the very least, credential interactions are not the same and you do 
not have access to the credentials stored in Windows Credential Manager.


[1] https://github.com/Microsoft/Git-Credential-Manager-for-Windows



[...]

+Many new contributors ask: What should I start working on?
+
+One way to win big with the maintainer of an open-source project is to look at 
the
+[issues page](https://github.com/git-for-windows/git/issues) and see if there 
are any issues that
+you can fix quickly, or if anything catches your eye.

You can also look at https://crbug.com/git for non
Windows-specific issues.  And you can look at recent user questions
on the mailing list: https://public-inbox.org/git

[...]

+If you are adding new functionality, you may need to create low-level tests by 
creating
+helper commands that test a very limited action. These commands are stored in 
`t/helpers`.
+When adding a helper, be sure to add a line to `t/Makefile` and to the 
`.gitignore` for the
+binary file you add. The Git community prefers functional tests using the full 
`git`
+executable, so be sure the test helper is required.

Maybe s/low-level/unit/?

[...]

+Read [`t/README`](https://github.com/git/git/blob/master/t/README) for more 
details.

Forgive my ignorance: does github flavored markdown allow relative
links?  (I.e., could this say [`t/README`](t/README)?)

[...]

+You can also set certain environment variables to help test the performance on 
different
+repositories or with more repetitions. The full list is available in
+[the `t/perf/README` 
file](https://github.com/git/git/blob/master/t/perf/README),

Likewise.

[...]

+Test Your Changes on Linux
+--
+
+It can be important to work directly on the [core Git 
codebase](https://github.com/git/git),
+such as a recent commit into the `master` or `next` branch that has not been 
incorporated
+into Git for Windows. Also, it can help to run functional and performance 
tests on your
+code in Linux before submitting patches to the Linux-focused mailing list.

I'm surprised at this advice.  Does it actually come up?  When I was
on Mac I never worried about this, nor when I was on Windows.

I'm personally happy to review patches that haven't been tested on
Linux, though it's of course even nicer if the patch author mentions
what platforms they've tested on.

Maybe this can be reframed to refer specifically to cases where you've
modified some Linux platform-specific code (e.g. to add a new feature
to run-command.c)?


I recently had a bug in my local copy of the commit-graph patch that was 
only caught by our Mac OSX automated builds: I was passing the 
edge-value for a parent into a lookup to get an octopus edge from the 
graph, but I forgot to drop the most-significant bit. This cast the 
"uint32_t" silently into an "int" but since we multiplied by 4 somehow 
the Windows and Linux compilers turned this into a left-shift while Mac 
did not and failed during my test.


Now this is an example of something that probably would have been caught 
in review, but stuff gets through.


The Win

Re: What's cooking in git.git (Apr 2018, #03; Wed, 25)

2018-04-26 Thread Derrick Stolee

On 4/25/2018 1:43 PM, Brandon Williams wrote:

On 04/25, Ævar Arnfjörð Bjarmason wrote:

* bw/protocol-v2 (2018-03-15) 35 commits
   (merged to 'next' on 2018-04-11 at 23ee234a2c)
  + remote-curl: don't request v2 when pushing
  + remote-curl: implement stateless-connect command
  + http: eliminate "# service" line when using protocol v2
  + http: don't always add Git-Protocol header
  + http: allow providing extra headers for http requests
  + remote-curl: store the protocol version the server responded with
  + remote-curl: create copy of the service name
  + pkt-line: add packet_buf_write_len function
  + transport-helper: introduce stateless-connect
  + transport-helper: refactor process_connect_service
  + transport-helper: remove name parameter
  + connect: don't request v2 when pushing
  + connect: refactor git_connect to only get the protocol version once
  + fetch-pack: support shallow requests
  + fetch-pack: perform a fetch using v2
  + upload-pack: introduce fetch server command
  + push: pass ref prefixes when pushing
  + fetch: pass ref prefixes when fetching
  + ls-remote: pass ref prefixes when requesting a remote's refs
  + transport: convert transport_get_remote_refs to take a list of ref prefixes
  + transport: convert get_refs_list to take a list of ref prefixes
  + connect: request remote refs using v2
  + ls-refs: introduce ls-refs server command
  + serve: introduce git-serve
  + test-pkt-line: introduce a packet-line test helper
  + protocol: introduce enum protocol_version value protocol_v2
  + transport: store protocol version
  + connect: discover protocol version outside of get_remote_heads
  + connect: convert get_remote_heads to use struct packet_reader
  + transport: use get_refs_via_connect to get refs
  + upload-pack: factor out processing lines
  + upload-pack: convert to a builtin
  + pkt-line: add delim packet support
  + pkt-line: allow peeking a packet line without consuming it
  + pkt-line: introduce packet_read_with_status
  (this branch is used by bw/server-options.)

  The beginning of the next-gen transfer protocol.

  Will cook in 'next'.

With a month & 10 days of no updates & this looking stable it would be
great to have it in master sooner than later to build on top of it in
the 2.18 window.

I personally think that this series is ready to graduate to master but I
mentioned to Junio off-list that it might be a good idea to wait until
it has undergone a little more stress testing.  We've been in the
process of trying to get this rolled out to our internal server but due
to a few roadblocks and people being out of the office its taken a bit
longer than I would have liked to get it up and running.  I'm hoping in
another few days/a week I'll have some more data on live traffic.  At
that point I think I'll be more convinced that it will be safe to merge it.

I may be overly cautions but I'm hoping that we can get this right
without needing to do another protocol version bump for a very long
time.  Technically using v2 is hidden behind an "experimental" config
flag so we should still be able to tweak it after the fact if we
absolutely needed to, but I'd like to avoid that if necessary.


There's no testing better than production. I think if we have an 
opportunity to test with realistic traffic, we should take advantage.


But I also agree that this series looks stable.

I realize it's hard to communicate both "this series is ready to merge" 
and "I appreciate your caution."


Thanks,

-Stolee



Re: [PATCH v4 03/10] commit-graph: compute generation numbers

2018-04-26 Thread Derrick Stolee

n 4/25/2018 10:35 PM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


@@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int 
hash_len,
else
packedDate[0] = 0;
  
+		if ((*list)->generation != GENERATION_NUMBER_INFINITY)

+   packedDate[0] |= htonl((*list)->generation << 2);
+
packedDate[1] = htonl((*list)->date);
hashwrite(f, packedDate, 8);

The ones that have infinity are written as zero here.  The code that
reads the generation field off of a file in fill_commit_graph_info()
and fill_commit_in_graph() both leave such a record in file as-is,
so the reader of what we write out will think it is _ZERO, not _INF.

Not that it matters, as it seems that most of the code being added
by this series treat _ZERO and _INF more or less interchangeably.
But it does raise another question, i.e. do we need both _ZERO and
_INF, or is it sufficient to have just a single _UNKNOWN?


This code is confusing. The 'if' condition is useless, since at this 
point every commit should be finite (since we computed generation 
numbers for everyone). We should just write the value always.


For the sake of discussion, the value _INFINITY means not in the graph 
and _ZERO means in the graph without a computed generation number. It's 
a small distinction, but it gives a single boundary to use for 
reachability queries that test generation number.





@@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids)
}
  }
  
+static void compute_generation_numbers(struct commit** commits,

+  int nr_commits)
+{
+   int i;
+   struct commit_list *list = NULL;
+
+   for (i = 0; i < nr_commits; i++) {
+   if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
+   commits[i]->generation != GENERATION_NUMBER_ZERO)
+   continue;
+
+   commit_list_insert(commits[i], );
+   while (list) {
+   struct commit *current = list->item;
+   struct commit_list *parent;
+   int all_parents_computed = 1;
+   uint32_t max_generation = 0;
+
+   for (parent = current->parents; parent; parent = 
parent->next) {
+   if (parent->item->generation == 
GENERATION_NUMBER_INFINITY ||
+   parent->item->generation == 
GENERATION_NUMBER_ZERO) {
+   all_parents_computed = 0;
+   commit_list_insert(parent->item, );
+   break;
+   } else if (parent->item->generation > 
max_generation) {
+   max_generation = 
parent->item->generation;
+   }
+   }
+
+   if (all_parents_computed) {
+   current->generation = max_generation + 1;
+   pop_commit();
+   }

If we haven't computed all parents' generations yet,
current->generation is undefined (or at least "left as
initialized"), so it does not make much sense to attempt to clip it
at _MAX at this point.  At leat not yet.

IOW, shouldn't the following two lines be inside the "we now know
genno of all parents, so we can compute genno for commit" block
above?


You're right! Good catch. This code sets every merge commit to _MAX. It 
should be in the block above.





+   if (current->generation > GENERATION_NUMBER_MAX)
+   current->generation = GENERATION_NUMBER_MAX;
+   }
+   }
+}
+
  void write_commit_graph(const char *obj_dir,
const char **pack_indexes,
int nr_packs,
@@ -694,6 +737,8 @@ void write_commit_graph(const char *obj_dir,
if (commits.nr >= GRAPH_PARENT_MISSING)
die(_("too many commits to write graph"));
  
+	compute_generation_numbers(commits.list, commits.nr);

+
graph_name = get_commit_graph_filename(obj_dir);
fd = hold_lock_file_for_update(, graph_name, 0);


<    5   6   7   8   9   10   11   12   13   14   >