from:"Derrick Stolee"

Re: [PATCH v3 8/9] commit-graph: always load commit-graph information

2018-04-23 Thread Derrick Stolee


On 4/18/2018 8:02 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

With generation numbers in the commit-graph, we need to ensure that any
commit that exists in the commit-graph file has its generation number
loaded.

All right, that is nice explanation of the why behind this change.


Create new load_commit_graph_info() method to fill in the information
for a commit that exists only in the commit-graph file. Call it from
parse_commit_buffer() after loading the other commit information from
the given buffer. Only fill this information when specified by the
'check_graph' parameter. This avoids duplicate work when we already
checked the graph in parse_commit_gently() or when simply checking the
buffer contents in check_commit().

Couldn't this 'check_graph' parameter be a global variable similar to
the 'commit_graph' variable?  Maybe I am not understanding it.


See the two callers at the bottom of the patch. They have different 
purposes: one needs to fill in a valid commit struct, the other needs to 
check the commit buffer is valid (then throws away the struct). They 
have different values for 'check_graph'. Also, in parse_commit_gently() 
we check parse_commit_in_graph() before we call parse_commit_buffer, so 
we do not want to repeat work; in the case of a valid commit-graph file, 
but the commit is not in the commit-graph, we would repeat our binary 
search for the same commit.





Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  commit-graph.c | 51 --
  commit-graph.h |  8 
  commit.c   |  7 +--
  commit.h   |  2 +-
  object.c   |  2 +-
  sha1_file.c|  2 +-
  6 files changed, 49 insertions(+), 23 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 688d5b1801..21e853c21a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct 
commit_graph *g,
return _list_insert(c, pptr)->next;
  }
  
+static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)

+{
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+}
+
  static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t pos)
  {
uint32_t edge_value;
uint32_t *parent_data_ptr;
uint64_t date_low, date_high;
struct commit_list **pptr;
-   const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len 
+ 16) * pos;
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;

I'm probably wrong, but isn't it unrelated change?


You're right. I saw this while I was in here, and there was a similar 
comment on this change in a different patch. Probably best to keep these 
cleanup things in a separate commit.



item->object.parsed = 1;
item->graph_pos = pos;
@@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin
return 1;
  }
  
+static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)

+{
+   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+   *pos = item->graph_pos;
+   return 1;
+   } else {
+   return bsearch_graph(commit_graph, &(item->object.oid), pos);
+   }
+}

All right (after the fix).


+
  int parse_commit_in_graph(struct commit *item)
  {
+   uint32_t pos;
+
+   if (item->object.parsed)
+   return 0;
if (!core_commit_graph)
return 0;
-   if (item->object.parsed)
-   return 1;

Hmmm... previously the function returned 1 if item->object.parsed, now
it returns 0 for this situation.  I don't understand this change.


The good news is that this change is unimportant (the only caller is 
parse_commit_gently() which checks item->object.parsed before calling 
parse_commit_in_graph()). I wonder why I reordered those things, anyway. 
I'll revert to simplify the patch.





-
prepare_commit_graph();
-   if (commit_graph) {
-   uint32_t pos;
-   int found;
-   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
-   pos = item->graph_pos;
-   found = 1;
-   } else {
-   found = bsearch_graph(commit_graph, &(item->object.oid), 
);
-   }
-
-   if (found)
-   return fill_commit_in_graph(item, commit_graph, po

Re: [PATCH v3 5/9] ref-filter: use generation number for --contains

2018-04-23 Thread Derrick Stolee


On 4/18/2018 5:02 PM, Jakub Narebski wrote:

Here I can offer only the cursory examination, as I don't know this area
of code in question.

Derrick Stolee <dsto...@microsoft.com> writes:


A commit A can reach a commit B only if the generation number of A
is larger than the generation number of B. This condition allows
significantly short-circuiting commit-graph walks.

Use generation number for 'git tag --contains' queries.

On a copy of the Linux repository where HEAD is containd in v4.13
but no earlier tag, the command 'git tag --contains HEAD' had the
following peformance improvement:

Before: 0.81s
After:  0.04s
Rel %:  -95%

A question: what is the performance after if the "commit-graph" feature
is disabled, or there is no commit-graph file?  Is there performance
regression in this case, or is the difference negligible?


Negligible, since we are adding a small number of integer comparisons 
and the main cost is in commit parsing. More on commit parsing in 
response to your comments below.





Helped-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  ref-filter.c | 23 +++
  1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index cffd8bf3ce..e2fea6d635 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1587,7 +1587,8 @@ static int in_commit_list(const struct commit_list *want, 
struct commit *c)
  /*
   * Test whether the candidate or one of its parents is contained in the list.

  ^

Sidenote: when examining the code after the change, I have noticed that
the above part of commit header for the comtains_test() function is no
longer entirely correct, as the function only checks the candidate
commit, and in no place it access its parents.

But that is not your problem.


I'll add a commit in the next version that fixes this comment before I 
make any changes to the method.





   * Do not recurse to find out, though, but return -1 if inconclusive.
   */
  static enum contains_result contains_test(struct commit *candidate,
  const struct commit_list *want,
- struct contains_cache *cache)
+ struct contains_cache *cache,
+ uint32_t cutoff)
  {
enum contains_result *cached = contains_cache_at(cache, candidate);
  
@@ -1603,6 +1604,10 @@ static enum contains_result contains_test(struct commit *candidate,
  
  	/* Otherwise, we don't know; prepare to recurse */

parse_commit_or_die(candidate);
+
+   if (candidate->generation < cutoff)
+   return CONTAINS_NO;
+

Looks good to me.

The only [minor] question may be whether to define separate type for
generation numbers, and whether to future proof the tests - though the
latter would be almost certainly overengineering, and the former
probablt too.


If we have multiple notions of generation, then we can refactor all 
references to the "generation" member.





return CONTAINS_UNKNOWN;
  }
  
@@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,

  struct contains_cache *cache)
  {
struct contains_stack contains_stack = { 0, 0, NULL };
-   enum contains_result result = contains_test(candidate, want, cache);
+   enum contains_result result;
+   uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+   const struct commit_list *p;
+
+   for (p = want; p; p = p->next) {
+   struct commit *c = p->item;
+   parse_commit_or_die(c);
+   if (c->generation < cutoff)
+   cutoff = c->generation;
+   }

Sholdn't the above be made conditional on the ability to get generation
numbers from the commit-graph file (feature is turned on and file
exists)?  Otherwise here after the change contains_tag_algo() now parses
each commit in 'want', which I think was not done previously.

With commit-graph file parsing is [probably] cheap.  Without it, not
necessary.

But I might be worrying about nothing.


Not nothing. This parses the "wants" when we previously did not parse 
the wants. Further: this parsing happens before we do the simple check 
of comparing the OID of the candidate against the wants.


The question is: are these parsed commits significant compared to the 
walk that will parse many more commits? It is certainly possible.


One way to fix this is to call 'prepare_commit_graph()' directly and 
then test that 'commit_graph' is non-null before performing any parses. 
I'm not thrilled with how that couples the commit-graph implementation 
to this feature, but that may be necessary to avoid regressions in the 
non-commit-graph case.




  
+	result = contains_test(candidate, want, cache, cutoff);

O

Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-23 Thread Derrick Stolee


On 4/21/2018 4:44 PM, Jakub Narebski wrote:

Jakub Narebski <jna...@gmail.com> writes:

Derrick Stolee <sto...@gmail.com> writes:

On 4/11/2018 3:32 PM, Jakub Narebski wrote:

What would you suggest as a good test that could imply performance? The
Google Colab notebook linked to above includes a function to count
number of commits (nodes / vertices in the commit graph) walked,
currently in the worst case scenario.

The two main questions to consider are:

1. Can X reach Y?

That is easy to do.  The function generic_is_reachable() does
that... though using direct translation of the pseudocode for
"Algorithm 3: Reachable" from FELINE paper, which is recursive and
doesn't check if vertex was already visited was not good idea for large
graphs such as Linux kernel commit graph, oops.  That is why
generic_is_reachable_large() was created.

[...]


And the thing to measure is a commit count. If possible, it would be
good to count commits walked (commits whose parent list is enumerated)
and commits inspected (commits that were listed as a parent of some
walked commit). Walked commits require a commit parse -- albeit from
the commit-graph instead of the ODB now -- while inspected commits
only check the in-memory cache.

[...]

For git.git and Linux, I like to use the release tags as tests. They
provide a realistic view of the linear history, and maintenance
releases have their own history from the major releases.

Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the
FELINE index does not bring any improvements over using just level
(generation number) filter.  But that may be caused by narrowing od
commit DAG around releases.

I try do do the same between commits in wide part, with many commits
with the same level (same generation number) both for source and for
target commit.  Though this may be unfair to level filter, though...


Note however that FELINE index is not unabiguous, like generation
numbers are (modulo decision whether to start at 0 or at 1); it depends
on the topological ordering chosen for the X elements.

One can now test reachability on git.git repository; there is a form
where one can plug source and destination revisions at
https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg#scrollTo=svNUnSA9O_NK=2=1

I have tried the case that is quite unfair to the generation numbers
filter, namely the check between one of recent tags, and one commit that
shares generation number among largest number of other commits.

Here level = generation number-1 (as it starts at 0 for root commit, not
1).

The results are:
  * src = 468165c1d = v2.17.0
  * dst = 66d2e04ec = v2.0.5-5-g66d2e04ec

  * 468165c1d has level 18418 which it shares with 6 commits
  * 66d2e04ec has level 14776 which it shares with 93 commits
  * gen(468165c1d) - gen(66d2e04ec) = 3642

  algorithm  | access  | walk   | maxdepth | visited | level-f  | FELINE-f  |
  ---+-++--+-+--+---+
  naive  | 48865   | 39599  | 244  | 9200|  |   |
  level  |  3086   |  2492  | 113  |  528| 285  |   |
  FELINE |   283   |   216  |  68  |0|  |  25   |
  lev+FELINE |   282   |   215  |  68  |0|   5  |  24   |
  ---+-++--+-+--+---+
  lev+FEL+mpi|79   |59  |  21  |0|   0  |   0   |

Here we have:
* 'naive' implementation means simple DFS walk, without any filters (cut-offs)
* 'level' means using levels / generation numbers based negative-cut filter
* 'FELINE' means using FELINE index based negative-cut filter
* 'lev+FELINE' means combining generation numbers filter with FELINE filter
* 'mpi' means min-post [smanning-tree] intervals for positive-cut filter;
   note that the code does not walk the path after cut, but it is easy to do

The stats have the following meaning:
* 'access' means accessing the node
* 'walk' is actual walking the node
* 'maxdepth' is maximum depth of the stack used for DFS
* 'level-f' and 'FELINE-f' is number of times levels filter or FELINE filter
   were used for negative-cut; note that those are not disjoint; node can
   be rejected by both level filter and FELINE filter

For v2.17.0 and v2.17.0-rc2 the numbers are much less in FELINE favor:
the results are the same, with 5 commits accessed and 6 walked compared
to 61574 accessed in naive algorithm.

The git.git commit graph has 53128 nodes and 66124 edges, 4 tips / heads
(different child-less commits) and 9 roots, and has average clustering
coefficient 0.000409217.


Thanks for these results. Now, write a patch. I'm sticking to generation 
numbers for my patch because of the simplified computation, but you can 
contribute a FELINE implementation.



P.S. Would it be better to move the discussion about possible extensions
to the commit-graph in the form of new chunks (topological order,

Re: [PATCH v3 0/9] Compute and consume generation numbers

2018-04-23 Thread Derrick Stolee


On 4/18/2018 8:04 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


-- >8 --

This is the one of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph) and lazy-loading trees
(ds/lazy-load-trees).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one). This section is expanded to describe the interaction with special
generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph
file) and *_ZERO (commits in a commit-graph file written before generation
numbers were implemented).

This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a short-circuit mechanism
to improve performance of `git branch --contains`.

Further, use generation numbers for 'git tag --contains', providing a
significant speedup (at least 95% for some cases).

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

This patch series is build on ds/lazy-load-trees.

Derrick Stolee (9):
   commit: add generation number to struct commmit

Nice and short patch. Looks good to me.


   commit-graph: compute generation numbers

Another quite easy to understand patch. LGTM.


   commit: use generations in paint_down_to_common()

Nice and short patch; minor typo in comment in code.
Otherwise it looks good to me.


   commit-graph.txt: update design document

I see that diagram got removed in this version; maybe it could be
replaced with relationship table?

Anyway, it looks good to me.


The diagrams and tables seemed to cause more confusion than clarity. I 
think the reader should create their own mental model from the 
definitions and description and we should avoid trying to make a summary.





   ref-filter: use generation number for --contains

A question: how performance looks like after the change if commit-graph
is not available?


The performance issue is minor, but will be fixed in v4.




   commit: use generation numbers for in_merge_bases()

Possible typo in the commit message, and stylistic inconsistence in
in_merge_bases() - though actually more clear than existing code.

Short, simple, and gives good performance improvenebts.


   commit: add short-circuit to paint_down_to_common()

Looks good to me; ignore [mostly] what I have written in response to the
patch in question.


   commit-graph: always load commit-graph information

Looks all right; question: parameter or one more global variable.


I responded to say that the global variable approach is incorrect. 
Parameter is important to functionality and performance.





   merge: check config before loading commits

This looks good to me.


  Documentation/technical/commit-graph.txt | 30 +--
  alloc.c  |  1 +
  builtin/merge.c  |  5 +-
  commit-graph.c   | 99 +++-
  commit-graph.h   |  8 ++
  commit.c | 54 +++--
  commit.h |  7 +-
  object.c |  2 +-
  ref-filter.c | 23 +-
  sha1_file.c  |  2 +-
  t/t5318-commit-graph.sh  |  9 +++
  11 files changed, 199 insertions(+), 41 deletions(-)


base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707

Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()

2018-04-24 Thread Derrick Stolee


On 4/23/2018 5:38 PM, Jakub Narebski wrote:

Derrick Stolee <sto...@gmail.com> writes:


On 4/18/2018 7:19 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


[...]

[...], and this saves time during 'git branch --contains' queries
that would otherwise walk "around" the commit we are inspecting.


If I understand the code properly, what happens is that we can now
short-circuit if all commits that are left are lower than the target
commit.

This is because max-order priority queue is used: if the commit with
maximum generation number is below generation number of target commit,
then target commit is not reachable from any commit in the priority
queue (all of which has generation number less or equal than the commit
at head of queue, i.e. all are same level or deeper); compare what I
have written in [1]

[1]: https://public-inbox.org/git/866052dkju@gmail.com/

Do I have that right?  If so, it looks all right to me.

Yes, the priority queue needs to compare via generation number first
or there will be errors. This is why we could not use commit time
before.

I was more concerned about getting right the order in the priority queue
(does it return minimal or maximal generation number).

I understand that the cutoff could not be used without generation
numbers because of the possibility of clock skew - using cutoff on dates
could lead to wrong results.


Maximal generation number is important so we do not visit commits 
multiple times (say, once with PARENT1 set, and a second time when 
PARENT2 is set). A minimal generation number order would create a DFS 
order and walk until the cutoff every time.


In cases without clock skew, maximal generation number order will walk 
the same set of commits as maximal commit time; the order may differ, 
but only between incomparable commits.



For a copy of the Linux repository, where HEAD is checked out at
v4.13~100, we get the following performance improvement for
'git branch --contains' over the previous commit:

Before: 0.21s
After:  0.13s
Rel %: -38%

[...]

flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
if (flags == (PARENT1 | PARENT2)) {
if (!(commit->object.flags & RESULT)) {
@@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit 
*one, int n, struct co
return NULL;
}
   -list = paint_down_to_common(one, n, twos);
+   list = paint_down_to_common(one, n, twos, 0);
while (list) {
struct commit *commit = pop_commit();
@@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
filled_index[filled] = j;
work[filled++] = array[j];
}
-   common = paint_down_to_common(array[i], filled, work);
+   common = paint_down_to_common(array[i], filled, work, 0);
if (array[i]->object.flags & PARENT2)
redundant[i] = 1;
for (j = 0; j < filled; j++)
@@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
if (commit->generation > min_generation)
return 0;
   -bases = paint_down_to_common(commit, nr_reference, reference);
+   bases = paint_down_to_common(commit, nr_reference, reference, 
commit->generation);

Is it the only case where we would call paint_down_to_common() with
non-zero last parameter?  Would we always use commit->generation where
commit is the first parameter of paint_down_to_common()?

If both are true and will remain true, then in my humble opinion it is
not necessary to change the signature of this function.

We need to change the signature some way, but maybe the way I chose is
not the best.

No, after taking longer I think the new signature is a good choice.


To elaborate: paint_down_to_common() is used for multiple
purposes. The caller here that supplies 'commit->generation' is used
only to compute reachability (by testing if the flag PARENT2 exists on
the commit, then clears all flags). The other callers expect the full
walk down to the common commits, and keeps those PARENT1, PARENT2, and
STALE flags for future use (such as reporting merge bases). Usually
the call to paint_down_to_common() is followed by a revision walk that
only halts when reaching root commits or commits with both PARENT1 and
PARENT2 flags on, so always short-circuiting on generations would
break the functionality; this is confirmed by the
t5318-commit-graph.sh.

Right.

I have realized that just after sending the email.  I'm sorry about this.


An alternative to the signature change is to add a boolean parameter
"use_cutoff" or something, that specifies "don't walk beyond the
commit". This may give a more of a clear description of what it will
do with the generation value, but since we are

[RFC PATCH 02/12] commit-graph: add 'check' subcommand

2018-04-17 Thread Derrick Stolee

If the commit-graph file becomes corrupt, we need a way to verify
its contents match the object database. In the manner of 'git fsck'
we will implement a 'git commit-graph check' subcommand to report
all issues with the file.

Add the 'check' subcommand to the 'commit-graph' builtin and its
documentation. Add a simple test that ensures the command returns
a zero error code.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt |  7 +-
 builtin/commit-graph.c | 38 ++
 commit-graph.c |  5 
 commit-graph.h |  2 ++
 t/t5318-commit-graph.sh|  5 
 5 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 4c97b555cc..93c7841ba2 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -9,10 +9,10 @@ git-commit-graph - Write and verify Git commit graph files
 SYNOPSIS
 
 [verse]
+'git commit-graph check' [--object-dir ]
 'git commit-graph read' [--object-dir ]
 'git commit-graph write'  [--object-dir ]
 
-
 DESCRIPTION
 ---
 
@@ -52,6 +52,11 @@ existing commit-graph file.
 Read a graph file given by the commit-graph file and output basic
 details about the graph file. Used for debugging purposes.
 
+'check'::
+
+Read the commit-graph file and verify its contents against the object
+database. Used to check for corrupted data.
+
 
 EXAMPLES
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 37420ae0fd..77c1a04932 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -7,11 +7,17 @@
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
+   N_("git commit-graph check [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
+static const char * const builtin_commit_graph_check_usage[] = {
+   N_("git commit-graph check [--object-dir ]"),
+   NULL
+};
+
 static const char * const builtin_commit_graph_read_usage[] = {
N_("git commit-graph read [--object-dir ]"),
NULL
@@ -29,6 +35,36 @@ static struct opts_commit_graph {
int append;
 } opts;
 
+
+static int graph_check(int argc, const char **argv)
+{
+   struct commit_graph *graph = 0;
+   char *graph_name;
+
+   static struct option builtin_commit_graph_check_options[] = {
+   OPT_STRING(0, "object-dir", _dir,
+  N_("dir"),
+  N_("The object directory to store the graph")),
+   OPT_END(),
+   };
+
+   argc = parse_options(argc, argv, NULL,
+builtin_commit_graph_check_options,
+builtin_commit_graph_check_usage, 0);
+
+   if (!opts.obj_dir)
+   opts.obj_dir = get_object_directory();
+
+   graph_name = get_commit_graph_filename(opts.obj_dir);
+   graph = load_commit_graph_one(graph_name);
+
+   if (!graph)
+   die("graph file %s does not exist", graph_name);
+   FREE_AND_NULL(graph_name);
+
+   return check_commit_graph(graph);
+}
+
 static int graph_read(int argc, const char **argv)
 {
struct commit_graph *graph = NULL;
@@ -160,6 +196,8 @@ int cmd_commit_graph(int argc, const char **argv, const 
char *prefix)
 PARSE_OPT_STOP_AT_NON_OPTION);
 
if (argc > 0) {
+   if (!strcmp(argv[0], "check"))
+   return graph_check(argc, argv);
if (!strcmp(argv[0], "read"))
return graph_read(argc, argv);
if (!strcmp(argv[0], "write"))
diff --git a/commit-graph.c b/commit-graph.c
index 3f0c142603..cd0634bba0 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -819,3 +819,8 @@ void write_commit_graph(const char *obj_dir,
oids.alloc = 0;
oids.nr = 0;
 }
+
+int check_commit_graph(struct commit_graph *g)
+{
+   return !g;
+}
diff --git a/commit-graph.h b/commit-graph.h
index 96cccb10f3..e8c8d99dff 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -53,4 +53,6 @@ void write_commit_graph(const char *obj_dir,
int nr_commits,
int append);
 
+int check_commit_graph(struct commit_graph *g);
+
 #endif
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 77d85aefe7..e91053271a 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -230,4 +230,9 @@ test_expect_success 'perform fast-forward merge in full 
repo' '
test_cmp expect output
 '
 
+test_expect_success 'git commit-graph check' '
+   cd "$TRASH_DIRECTORY/full" &&
+   git commit-graph check >output
+'
+
 test_done
-- 
2.17.0.39.g685157f7fb

[RFC PATCH 08/12] commit-graph: verify commit contents against odb

2018-04-17 Thread Derrick Stolee

When running 'git commit-graph check', compare the contents of the
commits that are loaded from the commit-graph file with commits that are
loaded directly from the object database. This includes checking the
root tree object ID, commit date, and parents.

In addition, verify the generation number calculation is correct for all
commits in the commit-graph file.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 82 ++
 1 file changed, 82 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 80a2ac2a6a..b5bce2bac4 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -899,5 +899,87 @@ int check_commit_graph(struct commit_graph *g)
 graph_commit->graph_pos, i);
}
 
+   for (i = 0; i < g->num_commits; i++) {
+   struct commit *graph_commit, *odb_commit;
+   struct commit_list *graph_parents, *odb_parents;
+   int num_parents = 0;
+
+   hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+   graph_commit = lookup_commit(_oid);
+   odb_commit = (struct commit *)create_object(cur_oid.hash, 
alloc_commit_node());
+   if (parse_commit_internal(odb_commit, 0, 0))
+   graph_report(_("failed to parse %s from object 
database"), oid_to_hex(_oid));
+
+   if (oidcmp(_commit_tree_in_graph_one(g, 
graph_commit)->object.oid,
+  get_commit_tree_oid(odb_commit)))
+   graph_report(_("root tree object ID for commit %s in 
commit-graph is %s != %s"),
+oid_to_hex(_oid),
+
oid_to_hex(get_commit_tree_oid(graph_commit)),
+
oid_to_hex(get_commit_tree_oid(odb_commit)));
+
+   if (graph_commit->date != odb_commit->date)
+   graph_report(_("commit date for commit %s in 
commit-graph is %"PRItime" != %"PRItime""),
+oid_to_hex(_oid),
+graph_commit->date,
+odb_commit->date);
+
+
+   graph_parents = graph_commit->parents;
+   odb_parents = odb_commit->parents;
+
+   while (graph_parents) {
+   num_parents++;
+
+   if (odb_parents == NULL)
+   graph_report(_("commit-graph parent list for 
commit %s is too long (%d)"),
+oid_to_hex(_oid),
+num_parents);
+
+   if (oidcmp(_parents->item->object.oid, 
_parents->item->object.oid))
+   graph_report(_("commit-graph parent for %s is 
%s != %s"),
+oid_to_hex(_oid),
+
oid_to_hex(_parents->item->object.oid),
+
oid_to_hex(_parents->item->object.oid));
+
+   graph_parents = graph_parents->next;
+   odb_parents = odb_parents->next;
+   }
+
+   if (odb_parents != NULL)
+   graph_report(_("commit-graph parent list for commit %s 
terminates early"),
+oid_to_hex(_oid));
+
+   if (graph_commit->generation) {
+   uint32_t max_generation = 0;
+   graph_parents = graph_commit->parents;
+
+   while (graph_parents) {
+   if (graph_parents->item->generation == 
GENERATION_NUMBER_ZERO ||
+   graph_parents->item->generation == 
GENERATION_NUMBER_INFINITY)
+   graph_report(_("commit-graph has valid 
generation for %s but not its parent, %s"),
+oid_to_hex(_oid),
+
oid_to_hex(_parents->item->object.oid));
+   if (graph_parents->item->generation > 
max_generation)
+   max_generation = 
graph_parents->item->generation;
+   graph_parents = graph_parents->next;
+   }
+
+   if (graph_commit->generation != max_generation + 1)
+   graph_report(_("commit-graph has incorrect 
generation for %s"),
+oid_to_hex(_oid));
+   } else {
+   graph_parents = graph_commit->parents;
+
+

[RFC PATCH 09/12] fsck: check commit-graph

2018-04-17 Thread Derrick Stolee

If a commit-graph file exists, check its contents during 'git fsck'.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/fsck.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index ef78c6c00c..9712f230ba 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -16,6 +16,7 @@
 #include "streaming.h"
 #include "decorate.h"
 #include "packfile.h"
+#include "run-command.h"
 
 #define REACHABLE 0x0001
 #define SEEN  0x0002
@@ -45,6 +46,7 @@ static int name_objects;
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
 #define ERROR_REFS 010
+#define ERROR_COMMIT_GRAPH 020
 
 static const char *describe_object(struct object *obj)
 {
@@ -815,5 +817,16 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
}
 
check_connectivity();
+
+   if (core_commit_graph) {
+   struct child_process commit_graph_check = CHILD_PROCESS_INIT;
+   const char *check_argv[] = { "commit-graph", "check", NULL, 
NULL };
+   commit_graph_check.argv = check_argv;
+   commit_graph_check.git_cmd = 1;
+
+   if (run_command(_graph_check))
+   errors_found |= ERROR_COMMIT_GRAPH;
+   }
+
return errors_found;
 }
-- 
2.17.0.39.g685157f7fb

[RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'

2018-04-17 Thread Derrick Stolee

The commit-graph feature is not useful to end users until the
commit-graph file is maintained automatically by Git during normal
upkeep operations. One natural place to trigger this write is during
'git gc'.

Before automatically generating a commit-graph, we need to be able to
verify the contents of a commit-graph file. Integrate commit-graph
checks into 'fsck' that check the commit-graph contents against commits
in the object database.

Things to think about:

* Are these the right integration points?

* gc.commitGraph defaults to true right now for the purpose of testing,
  but may not be required to start. The goal is to have this default to
  true eventually, but we may want to delay that until the feature is
  stable.

* I implement a "--reachable" option to 'git commit-graph write' that
  iterates over all refs. This does the same as 

git show-ref -s | git commit-graph write --stdin-commits

  but I don't know how to pipe two child processes together inside of Git.
  Perhaps this is a better solution, anyway.

What other things should I be considering in this case? I'm unfamiliar
with the inner-workings of 'fsck' and 'gc', so this is a new space for me.

This RFC is based on v3 of ds/generation-numbers, and the first commit
is a fixup! based on a bug in that version that I caught while prepping
this series.

Thanks,
-Stolee

Derrick Stolee (12):
  fixup! commit-graph: always load commit-graph information
  commit-graph: add 'check' subcommand
  commit-graph: check file header information
  commit-graph: parse commit from chosen graph
  commit-graph: check fanout and lookup table
  commit: force commit to parse from object database
  commit-graph: load a root tree from specific graph
  commit-graph: verify commit contents against odb
  fsck: check commit-graph
  commit-graph: add '--reachable' option
  gc: automatically write commit-graph files
  commit-graph: update design document

 Documentation/git-commit-graph.txt   |  15 +-
 Documentation/git-gc.txt |   4 +
 Documentation/technical/commit-graph.txt |   9 --
 builtin/commit-graph.c   |  79 +-
 builtin/fsck.c   |  13 ++
 builtin/gc.c |   8 +
 commit-graph.c   | 178 ++-
 commit-graph.h   |   2 +
 commit.c |  14 +-
 commit.h |   1 +
 t/t5318-commit-graph.sh  |  15 ++
 11 files changed, 311 insertions(+), 27 deletions(-)

-- 
2.17.0.39.g685157f7fb

[RFC PATCH 03/12] commit-graph: check file header information

2018-04-17 Thread Derrick Stolee

During a run of 'git commit-graph check', list the issues with the
header information in the commit-graph file. Some of this information
is inferred from the loaded 'struct commit_graph'.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 29 -
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index cd0634bba0..c5e5a0f860 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -820,7 +820,34 @@ void write_commit_graph(const char *obj_dir,
oids.nr = 0;
 }
 
+static int check_commit_graph_error;
+#define graph_report(...) { check_commit_graph_error = 1; printf(__VA_ARGS__); 
}
+
 int check_commit_graph(struct commit_graph *g)
 {
-   return !g;
+   if (!g) {
+   graph_report(_("no commit-graph file loaded"));
+   return 1;
+   }
+
+   check_commit_graph_error = 0;
+
+   if (get_be32(g->data) != GRAPH_SIGNATURE)
+   graph_report(_("commit-graph file has incorrect header"));
+
+   if (*(g->data + 4) != 1)
+   graph_report(_("commit-graph file version is not 1"));
+   if (*(g->data + 5) != GRAPH_OID_VERSION)
+   graph_report(_("commit-graph OID version is not 1 (SHA1)"));
+
+   if (!g->chunk_oid_fanout)
+   graph_report(_("commit-graph is missing the OID Fanout chunk"));
+   if (!g->chunk_oid_lookup)
+   graph_report(_("commit-graph is missing the OID Lookup chunk"));
+   if (!g->chunk_commit_data)
+   graph_report(_("commit-graph is missing the Commit Data 
chunk"));
+   if (g->hash_len != GRAPH_OID_LEN)
+   graph_report(_("commit-graph has incorrect hash length: %d"), 
g->hash_len);
+
+   return check_commit_graph_error;
 }
-- 
2.17.0.39.g685157f7fb

[RFC PATCH 11/12] gc: automatically write commit-graph files

2018-04-17 Thread Derrick Stolee

The commit-graph file is a very helpful feature for speeding up git
operations. In order to make it more useful, write the commit-graph file
by default during standard garbage collection operations.

Add a 'gc.commitGraph' config setting that triggers writing a
commit-graph file after any non-trivial 'git gc' command.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-gc.txt | 4 
 builtin/gc.c | 8 
 2 files changed, 12 insertions(+)

diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 571b5a7e3c..17dd654a59 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -119,6 +119,10 @@ The optional configuration variable `gc.packRefs` 
determines if
 it within all non-bare repos or it can be set to a boolean value.
 This defaults to true.
 
+The optional configuration variable 'gc.commitGraph' determines if
+'git gc' runs 'git commit-graph write'. This can be set to a boolean
+value. This defaults to false.
+
 The optional configuration variable `gc.aggressiveWindow` controls how
 much time is spent optimizing the delta compression of the objects in
 the repository when the --aggressive option is specified.  The larger
diff --git a/builtin/gc.c b/builtin/gc.c
index 77fa720bd0..070f2a7a3d 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -34,6 +34,7 @@ static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
+static int gc_commit_graph = 1;
 static int detach_auto = 1;
 static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
@@ -46,6 +47,7 @@ static struct argv_array repack = ARGV_ARRAY_INIT;
 static struct argv_array prune = ARGV_ARRAY_INIT;
 static struct argv_array prune_worktrees = ARGV_ARRAY_INIT;
 static struct argv_array rerere = ARGV_ARRAY_INIT;
+static struct argv_array commit_graph = ARGV_ARRAY_INIT;
 
 static struct tempfile *pidfile;
 static struct lock_file log_lock;
@@ -121,6 +123,7 @@ static void gc_config(void)
git_config_get_int("gc.aggressivedepth", _depth);
git_config_get_int("gc.auto", _auto_threshold);
git_config_get_int("gc.autopacklimit", _auto_pack_limit);
+   git_config_get_bool("gc.commitgraph", _commit_graph);
git_config_get_bool("gc.autodetach", _auto);
git_config_get_expiry("gc.pruneexpire", _expire);
git_config_get_expiry("gc.worktreepruneexpire", 
_worktrees_expire);
@@ -374,6 +377,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
argv_array_pushl(, "prune", "--expire", NULL);
argv_array_pushl(_worktrees, "worktree", "prune", "--expire", 
NULL);
argv_array_pushl(, "rerere", "gc", NULL);
+   argv_array_pushl(_graph, "commit-graph", "write", "--reachable", 
NULL);
 
/* default expiry time, overwritten in gc_config */
gc_config();
@@ -480,6 +484,10 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
if (pack_garbage.nr > 0)
clean_pack_garbage();
 
+   if (gc_commit_graph)
+   if (run_command_v_opt(commit_graph.argv, RUN_GIT_CMD))
+   return error(FAILED_RUN, commit_graph.argv[0]);
+
if (auto_gc && too_many_loose_objects())
warning(_("There are too many unreachable loose objects; "
"run 'git prune' to remove them."));
-- 
2.17.0.39.g685157f7fb

[RFC PATCH 01/12] fixup! commit-graph: always load commit-graph information

2018-04-17 Thread Derrick Stolee

---
 commit-graph.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 21e853c21a..3f0c142603 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -304,7 +304,7 @@ static int find_commit_in_graph(struct commit *item, struct 
commit_graph *g, uin
*pos = item->graph_pos;
return 1;
} else {
-   return bsearch_graph(commit_graph, &(item->object.oid), pos);
+   return bsearch_graph(g, &(item->object.oid), pos);
}
 }
 
-- 
2.17.0.39.g685157f7fb

[RFC PATCH 06/12] commit: force commit to parse from object database

2018-04-17 Thread Derrick Stolee

In anticipation of checking commit-graph file contents against the
object database, create parse_commit_internal() to allow side-stepping
the commit-graph file and parse directly from the object database.

Due to the use of generation numbers, this method should not be called
unless the intention is explicit in avoiding commits from the
commit-graph file.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 14 ++
 commit.h |  1 +
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index 9ef6f699bd..07752d8503 100644
--- a/commit.c
+++ b/commit.c
@@ -392,7 +392,8 @@ int parse_commit_buffer(struct commit *item, const void 
*buffer, unsigned long s
return 0;
 }
 
-int parse_commit_gently(struct commit *item, int quiet_on_missing)
+
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int 
use_commit_graph)
 {
enum object_type type;
void *buffer;
@@ -403,17 +404,17 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
return -1;
if (item->object.parsed)
return 0;
-   if (parse_commit_in_graph(item))
+   if (use_commit_graph && parse_commit_in_graph(item))
return 0;
buffer = read_sha1_file(item->object.oid.hash, , );
if (!buffer)
return quiet_on_missing ? -1 :
error("Could not read %s",
-oid_to_hex(>object.oid));
+   oid_to_hex(>object.oid));
if (type != OBJ_COMMIT) {
free(buffer);
return error("Object %s not a commit",
-oid_to_hex(>object.oid));
+   oid_to_hex(>object.oid));
}
ret = parse_commit_buffer(item, buffer, size, 0);
if (save_commit_buffer && !ret) {
@@ -424,6 +425,11 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
return ret;
 }
 
+int parse_commit_gently(struct commit *item, int quiet_on_missing)
+{
+   return parse_commit_internal(item, quiet_on_missing, 1);
+}
+
 void parse_commit_or_die(struct commit *item)
 {
if (parse_commit(item))
diff --git a/commit.h b/commit.h
index b5afde1ae9..5fde74fcd7 100644
--- a/commit.h
+++ b/commit.h
@@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char 
*name);
 struct commit *lookup_commit_or_die(const struct object_id *oid, const char 
*ref_name);
 
 int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long 
size, int check_graph);
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int 
use_commit_graph);
 int parse_commit_gently(struct commit *item, int quiet_on_missing);
 static inline int parse_commit(struct commit *item)
 {
-- 
2.17.0.39.g685157f7fb

[RFC PATCH 05/12] commit-graph: check fanout and lookup table

2018-04-17 Thread Derrick Stolee

While running 'git commit-graph check', verify that the object IDs
are listed in lexicographic order and that the fanout table correctly
navigates into that list of object IDs.

In anticipation of checking the commits in the commit-graph file
against the object database, parse the commits from that file in
advance. We perform this parse now to ensure the object cache contains
only commits from this commit-graph file.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 34 ++
 1 file changed, 34 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 6d0d303a7a..6e3c08cd5c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -835,6 +835,9 @@ static int check_commit_graph_error;
 
 int check_commit_graph(struct commit_graph *g)
 {
+   uint32_t i, cur_fanout_pos = 0;
+   struct object_id prev_oid, cur_oid;
+
if (!g) {
graph_report(_("no commit-graph file loaded"));
return 1;
@@ -859,5 +862,36 @@ int check_commit_graph(struct commit_graph *g)
if (g->hash_len != GRAPH_OID_LEN)
graph_report(_("commit-graph has incorrect hash length: %d"), 
g->hash_len);
 
+   for (i = 0; i < g->num_commits; i++) {
+   struct commit *graph_commit;
+
+   hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+   if (i > 0 && oidcmp(_oid, _oid) >= 0)
+   graph_report(_("commit-graph has incorrect oid order: 
%s then %s"),
+
+   oid_to_hex(_oid),
+   oid_to_hex(_oid));
+   oidcpy(_oid, _oid);
+
+   while (cur_oid.hash[0] > cur_fanout_pos) {
+   uint32_t fanout_value = get_be32(g->chunk_oid_fanout + 
cur_fanout_pos);
+   if (i != fanout_value)
+   graph_report(_("commit-graph has incorrect 
fanout value: fanout[%d] = %u != %u"),
+cur_fanout_pos, fanout_value, i);
+
+   cur_fanout_pos++;
+   }
+
+   graph_commit = lookup_commit(_oid);
+
+   if (!parse_commit_in_graph_one(g, graph_commit))
+   graph_report(_("failed to parse %s from commit-graph"), 
oid_to_hex(_oid));
+
+   if (graph_commit->graph_pos != i)
+   graph_report(_("graph_pos for commit %s is %u != %u"), 
oid_to_hex(_oid),
+graph_commit->graph_pos, i);
+   }
+
return check_commit_graph_error;
 }
-- 
2.17.0.39.g685157f7fb

[RFC PATCH 12/12] commit-graph: update design document

2018-04-17 Thread Derrick Stolee

The commit-graph feature is now integrated with 'fsck' and 'gc',
so remove those items from the "Future Work" section of the
commit-graph design document.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 9 -
 1 file changed, 9 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
index d9f2713efa..d04657b781 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -118,9 +118,6 @@ Future Work
 - The commit graph feature currently does not honor commit grafts. This can
   be remedied by duplicating or refactoring the current graft logic.
 
-- The 'commit-graph' subcommand does not have a "verify" mode that is
-  necessary for integration with fsck.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
@@ -142,12 +139,6 @@ Future Work
   such as "ensure_tree_loaded(commit)" that fully loads a tree before
   using commit->tree.
 
-- The current design uses the 'commit-graph' subcommand to generate the graph.
-  When this feature stabilizes enough to recommend to most users, we should
-  add automatic graph writes to common operations that create many commits.
-  For example, one could compute a graph on 'clone', 'fetch', or 'repack'
-  commands.
-
 - A server could provide a commit graph file as part of the network protocol
   to avoid extra calculations by clients. This feature is only of benefit if
   the user is willing to trust the file, because verifying the file is correct
-- 
2.17.0.39.g685157f7fb

[RFC PATCH 10/12] commit-graph: add '--reachable' option

2018-04-17 Thread Derrick Stolee

When writing commit-graph files, it can be convenient to ask for all
reachable commits (starting at the ref set) in the resulting file. This
is particularly helpful when writing to stdin is complicated, such as a
future integration with 'git gc' which will call
'git commit-graph write --reachable' after performing cleanup of the
object database.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt |  8 --
 builtin/commit-graph.c | 41 +++---
 t/t5318-commit-graph.sh| 10 
 3 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 93c7841ba2..1b14d40590 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -37,12 +37,16 @@ Write a commit graph file based on the commits found in 
packfiles.
 +
 With the `--stdin-packs` option, generate the new commit graph by
 walking objects only in the specified pack-indexes. (Cannot be combined
-with --stdin-commits.)
+with --stdin-commits or --reachable.)
 +
 With the `--stdin-commits` option, generate the new commit graph by
 walking commits starting at the commits specified in stdin as a list
 of OIDs in hex, one OID per line. (Cannot be combined with
---stdin-packs.)
+--stdin-packs or --reachable.)
++
+With the `--reachable` option, generate the new commit graph by walking
+commits starting at all refs. (Cannot be combined with --stdin-commits
+or --stind-packs.)
 +
 With the `--append` option, include all commits that are present in the
 existing commit-graph file.
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 77c1a04932..a89285ada8 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -3,13 +3,14 @@
 #include "dir.h"
 #include "lockfile.h"
 #include "parse-options.h"
+#include "refs.h"
 #include "commit-graph.h"
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph check [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
-   N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--append] 
[--reachable|--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -24,12 +25,13 @@ static const char * const builtin_commit_graph_read_usage[] 
= {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--append] 
[--reachable|--stdin-packs|--stdin-commits]"),
NULL
 };
 
 static struct opts_commit_graph {
const char *obj_dir;
+   int reachable;
int stdin_packs;
int stdin_commits;
int append;
@@ -113,6 +115,25 @@ static int graph_read(int argc, const char **argv)
return 0;
 }
 
+struct hex_list {
+   char **hex_strs;
+   int hex_nr;
+   int hex_alloc;
+};
+
+static int add_ref_to_list(const char *refname,
+  const struct object_id *oid,
+  int flags, void *cb_data)
+{
+   struct hex_list *list = (struct hex_list*)cb_data;
+
+   ALLOC_GROW(list->hex_strs, list->hex_nr + 1, list->hex_alloc);
+   list->hex_strs[list->hex_nr] = xcalloc(GIT_MAX_HEXSZ + 1, 1);
+   strcpy(list->hex_strs[list->hex_nr], oid_to_hex(oid));
+   list->hex_nr++;
+   return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
const char **pack_indexes = NULL;
@@ -127,6 +148,8 @@ static int graph_write(int argc, const char **argv)
OPT_STRING(0, "object-dir", _dir,
N_("dir"),
N_("The object directory to store the graph")),
+   OPT_BOOL(0, "reachable", ,
+   N_("start walk at all refs")),
OPT_BOOL(0, "stdin-packs", _packs,
N_("scan pack-indexes listed by stdin for commits")),
OPT_BOOL(0, "stdin-commits", _commits,
@@ -140,8 +163,8 @@ static int graph_write(int argc, const char **argv)
 builtin_commit_graph_write_options,
 builtin_commit_graph_write_usage, 0);
 
-   if (opts.stdin_packs && opts.stdin_commits)
-   die(_("cannot use both --stdin-commits and --stdin-packs"));
+   if (opts.reachable + opts.stdin_packs + opts.stdin_commits > 1)
+   die(_("use at most one of --reachable, --stdin-commits, or 
--stdin-packs"));
if (!opts.obj_dir)

[RFC PATCH 07/12] commit-graph: load a root tree from specific graph

2018-04-17 Thread Derrick Stolee

When lazy-loading a tree for a commit, it will be important to select
the tree from a specific struct commit_graph. Create a new method that
specifies the commit-graph file and use that in
get_commit_tree_in_graph().

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 6e3c08cd5c..80a2ac2a6a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -354,14 +354,20 @@ static struct tree *load_tree_for_commit(struct 
commit_graph *g, struct commit *
return c->maybe_tree;
 }
 
-struct tree *get_commit_tree_in_graph(const struct commit *c)
+static struct tree *get_commit_tree_in_graph_one(struct commit_graph *g,
+const struct commit *c)
 {
if (c->maybe_tree)
return c->maybe_tree;
if (c->graph_pos == COMMIT_NOT_FROM_GRAPH)
BUG("get_commit_tree_in_graph called from non-commit-graph 
commit");
 
-   return load_tree_for_commit(commit_graph, (struct commit *)c);
+   return load_tree_for_commit(g, (struct commit *)c);
+}
+
+struct tree *get_commit_tree_in_graph(const struct commit *c)
+{
+   return get_commit_tree_in_graph_one(commit_graph, c);
 }
 
 static void write_graph_chunk_fanout(struct hashfile *f,
-- 
2.17.0.39.g685157f7fb

[RFC PATCH 04/12] commit-graph: parse commit from chosen graph

2018-04-17 Thread Derrick Stolee

Before checking a commit-graph file against the object database, we
need to parse all commits from the given commit-graph file. Create
parse_commit_in_graph_one() to target a given struct commit_graph.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index c5e5a0f860..6d0d303a7a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -308,17 +308,27 @@ static int find_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin
}
 }
 
-int parse_commit_in_graph(struct commit *item)
+int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
 {
uint32_t pos;
 
if (item->object.parsed)
-   return 0;
+   return 1;
+
+   if (find_commit_in_graph(item, g, ))
+   return fill_commit_in_graph(item, g, pos);
+
+   return 0;
+}
+
+int parse_commit_in_graph(struct commit *item)
+{
if (!core_commit_graph)
return 0;
+
prepare_commit_graph();
-   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
-   return fill_commit_in_graph(item, commit_graph, pos);
+   if (commit_graph)
+   return parse_commit_in_graph_one(commit_graph, item);
return 0;
 }
 
-- 
2.17.0.39.g685157f7fb

Re: [PATCH v3 8/9] commit-graph: always load commit-graph information

2018-04-17 Thread Derrick Stolee


On 4/17/2018 1:00 PM, Derrick Stolee wrote:

Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

With generation numbers in the commit-graph, we need to ensure that any
commit that exists in the commit-graph file has its generation number
loaded.

Create new load_commit_graph_info() method to fill in the information
for a commit that exists only in the commit-graph file. Call it from
parse_commit_buffer() after loading the other commit information from
the given buffer. Only fill this information when specified by the
'check_graph' parameter. This avoids duplicate work when we already
checked the graph in parse_commit_gently() or when simply checking the
buffer contents in check_commit().

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  commit-graph.c | 51 --
  commit-graph.h |  8 
  commit.c   |  7 +--
  commit.h   |  2 +-
  object.c   |  2 +-
  sha1_file.c|  2 +-
  6 files changed, 49 insertions(+), 23 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 688d5b1801..21e853c21a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct 
commit_graph *g,
return _list_insert(c, pptr)->next;
  }
  
+static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)

+{
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+}
+
  static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t pos)
  {
uint32_t edge_value;
uint32_t *parent_data_ptr;
uint64_t date_low, date_high;
struct commit_list **pptr;
-   const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len 
+ 16) * pos;
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
  
  	item->object.parsed = 1;

item->graph_pos = pos;
@@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin
return 1;
  }
  
+static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)

+{
+   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+   *pos = item->graph_pos;
+   return 1;
+   } else {
+   return bsearch_graph(commit_graph, &(item->object.oid), pos);


The reference to 'commit_graph' in the above line should be 'g'. Sorry!


+   }
+}
+
  int parse_commit_in_graph(struct commit *item)
  {
+   uint32_t pos;
+
+   if (item->object.parsed)
+   return 0;
if (!core_commit_graph)
return 0;
-   if (item->object.parsed)
-   return 1;
-
prepare_commit_graph();
-   if (commit_graph) {
-   uint32_t pos;
-   int found;
-   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
-   pos = item->graph_pos;
-   found = 1;
-   } else {
-   found = bsearch_graph(commit_graph, &(item->object.oid), 
);
-   }
-
-   if (found)
-   return fill_commit_in_graph(item, commit_graph, pos);
-   }
-
+   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
+   return fill_commit_in_graph(item, commit_graph, pos);
return 0;
  }
  
+void load_commit_graph_info(struct commit *item)

+{
+   uint32_t pos;
+   if (!core_commit_graph)
+   return;
+   prepare_commit_graph();
+   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
+   fill_commit_graph_info(item, commit_graph, pos);
+}
+
  static struct tree *load_tree_for_commit(struct commit_graph *g, struct 
commit *c)
  {
struct object_id oid;
diff --git a/commit-graph.h b/commit-graph.h
index 260a468e73..96cccb10f3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
   */
  int parse_commit_in_graph(struct commit *item);
  
+/*

+ * It is possible that we loaded commit contents from the commit buffer,
+ * but we also want to ensure the commit-graph content is correctly
+ * checked and filled. Fill the graph_pos and generation members of
+ * the given commit.
+ */
+void load_commit_graph_info(struct commit *item);
+
  struct tree *get_commit_tree_in_graph(const struct commit *c);
  
  struct commit_graph {

diff --git a/commit.c b/commit.c
index a70f120878..9ef6f699bd 100644
--- a/commit.c
+++ b/commit.c
@@ -331,7 +331,7 @@ const void *detach_comm

Re: [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'

2018-04-17 Thread Derrick Stolee


On 4/17/2018 2:10 PM, Derrick Stolee wrote:

The commit-graph feature is not useful to end users until the
commit-graph file is maintained automatically by Git during normal
upkeep operations. One natural place to trigger this write is during
'git gc'.

Before automatically generating a commit-graph, we need to be able to
verify the contents of a commit-graph file. Integrate commit-graph
checks into 'fsck' that check the commit-graph contents against commits
in the object database.

Things to think about:

* Are these the right integration points?

* gc.commitGraph defaults to true right now for the purpose of testing,
   but may not be required to start. The goal is to have this default to
   true eventually, but we may want to delay that until the feature is
   stable.

* I implement a "--reachable" option to 'git commit-graph write' that
   iterates over all refs. This does the same as

git show-ref -s | git commit-graph write --stdin-commits

   but I don't know how to pipe two child processes together inside of Git.
   Perhaps this is a better solution, anyway.

What other things should I be considering in this case? I'm unfamiliar
with the inner-workings of 'fsck' and 'gc', so this is a new space for me.

This RFC is based on v3 of ds/generation-numbers, and the first commit
is a fixup! based on a bug in that version that I caught while prepping
this series.

Thanks,
-Stolee

Derrick Stolee (12):
   fixup! commit-graph: always load commit-graph information
   commit-graph: add 'check' subcommand
   commit-graph: check file header information
   commit-graph: parse commit from chosen graph
   commit-graph: check fanout and lookup table
   commit: force commit to parse from object database
   commit-graph: load a root tree from specific graph
   commit-graph: verify commit contents against odb
   fsck: check commit-graph
   commit-graph: add '--reachable' option
   gc: automatically write commit-graph files
   commit-graph: update design document

  Documentation/git-commit-graph.txt   |  15 +-
  Documentation/git-gc.txt |   4 +
  Documentation/technical/commit-graph.txt |   9 --
  builtin/commit-graph.c   |  79 +-
  builtin/fsck.c   |  13 ++
  builtin/gc.c |   8 +
  commit-graph.c   | 178 ++-
  commit-graph.h   |   2 +
  commit.c |  14 +-
  commit.h |   1 +
  t/t5318-commit-graph.sh  |  15 ++
  11 files changed, 311 insertions(+), 27 deletions(-)


This RFC is also available as a GitHub pull request [1]

[1] https://github.com/derrickstolee/git/pull/6

Re: [PATCH v10 00/36] Add directory rename detection to git

2018-04-19 Thread Derrick Stolee


On 4/19/2018 2:41 PM, Stefan Beller wrote:

On Thu, Apr 19, 2018 at 11:35 AM, Elijah Newren  wrote:

On Thu, Apr 19, 2018 at 10:57 AM, Elijah Newren  wrote:

This series is a reboot of the directory rename detection series that was
merged to master and then reverted due to the final patch having a buggy
can-skip-update check, as noted at
   https://public-inbox.org/git/xmqqmuya43cs@gitster-ct.c.googlers.com/
This series based on top of master.

...and merges cleanly to next but apparently has some minor conflicts
with both ds/lazy-load-trees and ps/test-chmtime-get from pu.

What's the preferred way to resolve this?  Rebase and resubmit my
series on pu, or something else?

If you were to base it off of pu, this series would depend on all other
series that pu contains. This is bad for the progress of this series.
(If it were to be merged to next, all other series would automatically
merge to next as well)

If the conflicts are minor, then Junio resolves them; if you want to be
nice, pick your merge point as

 git checkout origin/master
 git merge ds/lazy-load-trees
 git merge ps/test-chmtime-get
 git tag my-anchor

and put the series on top of that anchor.

If you do this, you'd want to be reasonably sure that
those two series are not in too much flux.


I believe ds/lazy-load-trees is queued for 'next'. I'm not surprised 
that there are some conflicts here. Any reference to the 'tree' member 
of a commit should be replaced with 'get_commit_tree(c)', or 
'get_commit_tree_oid(c)' if you only care about the tree's object id.


I think Stefan's suggestion is the best approach to get the right 
conflicts out of the way.


Thanks,
-Stolee

Re: [RFC 0/1] Tolerate broken headers in `packed-refs` files

2018-03-26 Thread Derrick Stolee


On 3/26/2018 8:42 AM, Michael Haggerty wrote:

[...]

But there might be some tools out in the wild that have been writing
broken headers. In that case, users who upgrade Git might suddenly
find that they can't read repositories that they could read before. In
fact, a tool that we wrote and use internally at GitHub was doing
exactly that, which is how we discovered this "problem".

This patch shows what it would look like to relax the parsing again,
albeit *only* for the first line of the file, and *only* for lines
that start with '#'.

The problem with this patch is that it would make it harder for people
who implement broken tools in the future to discover their mistakes.
The only result of the error would be that it is slower to work with
the `packed-refs` files that they wrote. Such an error could go
undiscovered for a long time.


My opinion is that we shouldn't maintain back-compat with formats that 
may have been written by another tool because Git wasn't strict about 
it. As long as Git never wrote files with these formats, then they 
shouldn't be supported.


You are absolutely right that staying strict will help discover the 
tools that are writing an incorrect format.


Since most heavily-used tools that didn't spawn Git processes use 
LibGit2 to interact with Git repos, I added Ed Thomson to CC to see if 
libgit2 could ever write these bad header comments.


Thanks for writing this RFC so we can have the discussion and more 
quickly identify this issue if/when users are broken.


Thanks,
-Stolee

Re: [PATCH 4/3] sha1_name: use bsearch_pack() in unique_in_pack()

2018-03-25 Thread Derrick Stolee


On 3/24/2018 12:41 PM, René Scharfe wrote:

Replace the custom binary search in unique_in_pack() with a call to
bsearch_pack().  This reduces code duplication and makes use of the
fan-out table of packs.

Signed-off-by: Rene Scharfe <l@web.de>
---
This is basically the same replacement as done by patch 3.  Speed is
less of a concern here -- at least I don't know a commonly used
command that needs to resolve lots of short hashes.


Thanks, René! Good teamwork on this patch series.

Reviewed-by: Derrick Stolee <dsto...@microsoft.com>

Re: [PATCH] unpack-trees: release oid_array after use in check_updates()

2018-03-25 Thread Derrick Stolee


On 3/25/2018 12:31 PM, René Scharfe wrote:

Signed-off-by: Rene Scharfe 
---
That leak was introduced by c0c578b33c (unpack-trees: batch fetching of
missing blobs).

  unpack-trees.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/unpack-trees.c b/unpack-trees.c
index d5685891a5..e73745051e 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -379,30 +379,31 @@ static int check_updates(struct unpack_trees_options *o)
struct oid_array to_fetch = OID_ARRAY_INIT;
int fetch_if_missing_store = fetch_if_missing;
fetch_if_missing = 0;
for (i = 0; i < index->cache_nr; i++) {
struct cache_entry *ce = index->cache[i];
if ((ce->ce_flags & CE_UPDATE) &&
!S_ISGITLINK(ce->ce_mode)) {
if (!has_object_file(>oid))
oid_array_append(_fetch, >oid);
}
}
if (to_fetch.nr)
fetch_objects(repository_format_partial_clone,
  _fetch);
fetch_if_missing = fetch_if_missing_store;
+   oid_array_clear(_fetch);
}
for (i = 0; i < index->cache_nr; i++) {
struct cache_entry *ce = index->cache[i];
  
  		if (ce->ce_flags & CE_UPDATE) {

if (ce->ce_flags & CE_WT_REMOVE)
die("BUG: both update and delete flags are set on 
%s",
ce->name);
display_progress(progress, ++cnt);
ce->ce_flags &= ~CE_UPDATE;
if (o->update && !o->dry_run) {
errs |= checkout_entry(ce, , NULL);
}
}
}


Ack. Looks correct.

-Stolee

Re: [ANNOUNCE] Git v2.17.0-rc1

2018-03-25 Thread Derrick Stolee

On 3/25/2018 2:42 PM, Ævar Arnfjörð Bjarmason wrote:

On Sun, Mar 25 2018, Derrick Stolee wrote:

On 3/23/2018 1:59 PM, Ævar Arnfjörð Bjarmason wrote:

On Wed, Mar 21 2018, Junio C. Hamano wrote:

A release candidate Git v2.17.0-rc1 is now available for testing
at the usual places. It is comprised of 493 non-merge commits
since v2.16.0, contributed by 62 people, 19 of which are new faces.

I have this deployed on some tens of K machines who all use git in one
way or another (from automated pulls, to users interactively), and rc0
before that, with a few patches on top from me + Takato + Duy + Derrick
since rc0 was released (and since today based on top of rc1). No issues
so far.

The specific in-house version I have is at:
https://github.com/git/git/compare/v2.17.0-rc1...bookingcom:booking-git-v2018-03-23-1

Thanks for testing the commit-graph feature, Ævar! I'm guessing you
have some mechanisms to ensure the 'git commit-graph write' command is
run on these machines and 'core.commitGraph' is set to true in the
config? I would love to hear how this benefits your org.

I haven't deployed any actual use of it at a wider scale, but I've done
some ad-hoc benchmarking with our internal version which has your
patches, and the results are very promising so far on the isolated test
cases where it helps (that you know about, e.g. rev-list --all).

So sorry, I don't have any meaningful testing of this, I just wanted an
easy way to ad-hoc test it & make sure it doesn't break other stuff for
now.

I also threw out most of the manual git maintenance stuff we had and
just rely on gc --auto now, so as soon as you have something to
integrate with that, along with those perf changes Peff suggested I'm
much more likely to play with it in some real way.

Thanks. Integration with 'gc --auto' is a high priority for me after the
patch lands.

The version on GitHub [1] is slightly ahead of v6 as I wait to reroll on
v2.17.0. It includes Peff's improvements to inspecting pack-indexes [2].

[1] https://github.com/derrickstolee/git/pull/2
[2]
https://github.com/derrickstolee/git/pull/2/commits/cb86817ee5c5127b32c93a22ef130f0db6207970

Re: [ANNOUNCE] Git v2.17.0-rc1

2018-03-25 Thread Derrick Stolee


On 3/23/2018 1:59 PM, Ævar Arnfjörð Bjarmason wrote:

On Wed, Mar 21 2018, Junio C. Hamano wrote:


A release candidate Git v2.17.0-rc1 is now available for testing
at the usual places.  It is comprised of 493 non-merge commits
since v2.16.0, contributed by 62 people, 19 of which are new faces.

I have this deployed on some tens of K machines who all use git in one
way or another (from automated pulls, to users interactively), and rc0
before that, with a few patches on top from me + Takato + Duy + Derrick
since rc0 was released (and since today based on top of rc1). No issues
so far.

The specific in-house version I have is at:
https://github.com/git/git/compare/v2.17.0-rc1...bookingcom:booking-git-v2018-03-23-1


Thanks for testing the commit-graph feature, Ævar! I'm guessing you have 
some mechanisms to ensure the 'git commit-graph write' command is run on 
these machines and 'core.commitGraph' is set to true in the config? I 
would love to hear how this benefits your org.


Thanks,
-Stolee

< 5 6 7 8 9 10 11 12 13 14 >

901 - 1000 of 1362 matches

Mail list logo