Re: [PATCH v3 8/9] commit-graph: always load commit-graph information
On 4/18/2018 8:02 PM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: Most code paths load commits using lookup_commit() and then parse_commit(). In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). With generation numbers in the commit-graph, we need to ensure that any commit that exists in the commit-graph file has its generation number loaded. All right, that is nice explanation of the why behind this change. Create new load_commit_graph_info() method to fill in the information for a commit that exists only in the commit-graph file. Call it from parse_commit_buffer() after loading the other commit information from the given buffer. Only fill this information when specified by the 'check_graph' parameter. This avoids duplicate work when we already checked the graph in parse_commit_gently() or when simply checking the buffer contents in check_commit(). Couldn't this 'check_graph' parameter be a global variable similar to the 'commit_graph' variable? Maybe I am not understanding it. See the two callers at the bottom of the patch. They have different purposes: one needs to fill in a valid commit struct, the other needs to check the commit buffer is valid (then throws away the struct). They have different values for 'check_graph'. Also, in parse_commit_gently() we check parse_commit_in_graph() before we call parse_commit_buffer, so we do not want to repeat work; in the case of a valid commit-graph file, but the commit is not in the commit-graph, we would repeat our binary search for the same commit. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 51 -- commit-graph.h | 8 commit.c | 7 +-- commit.h | 2 +- object.c | 2 +- sha1_file.c| 2 +- 6 files changed, 49 insertions(+), 23 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 688d5b1801..21e853c21a 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, return _list_insert(c, pptr)->next; } +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) +{ + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; +} + static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) { uint32_t edge_value; uint32_t *parent_data_ptr; uint64_t date_low, date_high; struct commit_list **pptr; - const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos; + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; I'm probably wrong, but isn't it unrelated change? You're right. I saw this while I was in here, and there was a similar comment on this change in a different patch. Probably best to keep these cleanup things in a separate commit. item->object.parsed = 1; item->graph_pos = pos; @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin return 1; } +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos) +{ + if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { + *pos = item->graph_pos; + return 1; + } else { + return bsearch_graph(commit_graph, &(item->object.oid), pos); + } +} All right (after the fix). + int parse_commit_in_graph(struct commit *item) { + uint32_t pos; + + if (item->object.parsed) + return 0; if (!core_commit_graph) return 0; - if (item->object.parsed) - return 1; Hmmm... previously the function returned 1 if item->object.parsed, now it returns 0 for this situation. I don't understand this change. The good news is that this change is unimportant (the only caller is parse_commit_gently() which checks item->object.parsed before calling parse_commit_in_graph()). I wonder why I reordered those things, anyway. I'll revert to simplify the patch. - prepare_commit_graph(); - if (commit_graph) { - uint32_t pos; - int found; - if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { - pos = item->graph_pos; - found = 1; - } else { - found = bsearch_graph(commit_graph, &(item->object.oid), ); - } - - if (found) - return fill_commit_in_graph(item, commit_graph, po
Re: [PATCH v3 5/9] ref-filter: use generation number for --contains
On 4/18/2018 5:02 PM, Jakub Narebski wrote: Here I can offer only the cursory examination, as I don't know this area of code in question. Derrick Stolee <dsto...@microsoft.com> writes: A commit A can reach a commit B only if the generation number of A is larger than the generation number of B. This condition allows significantly short-circuiting commit-graph walks. Use generation number for 'git tag --contains' queries. On a copy of the Linux repository where HEAD is containd in v4.13 but no earlier tag, the command 'git tag --contains HEAD' had the following peformance improvement: Before: 0.81s After: 0.04s Rel %: -95% A question: what is the performance after if the "commit-graph" feature is disabled, or there is no commit-graph file? Is there performance regression in this case, or is the difference negligible? Negligible, since we are adding a small number of integer comparisons and the main cost is in commit parsing. More on commit parsing in response to your comments below. Helped-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- ref-filter.c | 23 +++ 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/ref-filter.c b/ref-filter.c index cffd8bf3ce..e2fea6d635 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1587,7 +1587,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) /* * Test whether the candidate or one of its parents is contained in the list. ^ Sidenote: when examining the code after the change, I have noticed that the above part of commit header for the comtains_test() function is no longer entirely correct, as the function only checks the candidate commit, and in no place it access its parents. But that is not your problem. I'll add a commit in the next version that fixes this comment before I make any changes to the method. * Do not recurse to find out, though, but return -1 if inconclusive. */ static enum contains_result contains_test(struct commit *candidate, const struct commit_list *want, - struct contains_cache *cache) + struct contains_cache *cache, + uint32_t cutoff) { enum contains_result *cached = contains_cache_at(cache, candidate); @@ -1603,6 +1604,10 @@ static enum contains_result contains_test(struct commit *candidate, /* Otherwise, we don't know; prepare to recurse */ parse_commit_or_die(candidate); + + if (candidate->generation < cutoff) + return CONTAINS_NO; + Looks good to me. The only [minor] question may be whether to define separate type for generation numbers, and whether to future proof the tests - though the latter would be almost certainly overengineering, and the former probablt too. If we have multiple notions of generation, then we can refactor all references to the "generation" member. return CONTAINS_UNKNOWN; } @@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = GENERATION_NUMBER_INFINITY; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + parse_commit_or_die(c); + if (c->generation < cutoff) + cutoff = c->generation; + } Sholdn't the above be made conditional on the ability to get generation numbers from the commit-graph file (feature is turned on and file exists)? Otherwise here after the change contains_tag_algo() now parses each commit in 'want', which I think was not done previously. With commit-graph file parsing is [probably] cheap. Without it, not necessary. But I might be worrying about nothing. Not nothing. This parses the "wants" when we previously did not parse the wants. Further: this parsing happens before we do the simple check of comparing the OID of the candidate against the wants. The question is: are these parsed commits significant compared to the walk that will parse many more commits? It is certainly possible. One way to fix this is to call 'prepare_commit_graph()' directly and then test that 'commit_graph' is non-null before performing any parses. I'm not thrilled with how that couples the commit-graph implementation to this feature, but that may be necessary to avoid regressions in the non-commit-graph case. + result = contains_test(candidate, want, cache, cutoff); O
Re: [PATCH 0/6] Compute and consume generation numbers
On 4/21/2018 4:44 PM, Jakub Narebski wrote: Jakub Narebski <jna...@gmail.com> writes: Derrick Stolee <sto...@gmail.com> writes: On 4/11/2018 3:32 PM, Jakub Narebski wrote: What would you suggest as a good test that could imply performance? The Google Colab notebook linked to above includes a function to count number of commits (nodes / vertices in the commit graph) walked, currently in the worst case scenario. The two main questions to consider are: 1. Can X reach Y? That is easy to do. The function generic_is_reachable() does that... though using direct translation of the pseudocode for "Algorithm 3: Reachable" from FELINE paper, which is recursive and doesn't check if vertex was already visited was not good idea for large graphs such as Linux kernel commit graph, oops. That is why generic_is_reachable_large() was created. [...] And the thing to measure is a commit count. If possible, it would be good to count commits walked (commits whose parent list is enumerated) and commits inspected (commits that were listed as a parent of some walked commit). Walked commits require a commit parse -- albeit from the commit-graph instead of the ODB now -- while inspected commits only check the in-memory cache. [...] For git.git and Linux, I like to use the release tags as tests. They provide a realistic view of the linear history, and maintenance releases have their own history from the major releases. Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the FELINE index does not bring any improvements over using just level (generation number) filter. But that may be caused by narrowing od commit DAG around releases. I try do do the same between commits in wide part, with many commits with the same level (same generation number) both for source and for target commit. Though this may be unfair to level filter, though... Note however that FELINE index is not unabiguous, like generation numbers are (modulo decision whether to start at 0 or at 1); it depends on the topological ordering chosen for the X elements. One can now test reachability on git.git repository; there is a form where one can plug source and destination revisions at https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg#scrollTo=svNUnSA9O_NK=2=1 I have tried the case that is quite unfair to the generation numbers filter, namely the check between one of recent tags, and one commit that shares generation number among largest number of other commits. Here level = generation number-1 (as it starts at 0 for root commit, not 1). The results are: * src = 468165c1d = v2.17.0 * dst = 66d2e04ec = v2.0.5-5-g66d2e04ec * 468165c1d has level 18418 which it shares with 6 commits * 66d2e04ec has level 14776 which it shares with 93 commits * gen(468165c1d) - gen(66d2e04ec) = 3642 algorithm | access | walk | maxdepth | visited | level-f | FELINE-f | ---+-++--+-+--+---+ naive | 48865 | 39599 | 244 | 9200| | | level | 3086 | 2492 | 113 | 528| 285 | | FELINE | 283 | 216 | 68 |0| | 25 | lev+FELINE | 282 | 215 | 68 |0| 5 | 24 | ---+-++--+-+--+---+ lev+FEL+mpi|79 |59 | 21 |0| 0 | 0 | Here we have: * 'naive' implementation means simple DFS walk, without any filters (cut-offs) * 'level' means using levels / generation numbers based negative-cut filter * 'FELINE' means using FELINE index based negative-cut filter * 'lev+FELINE' means combining generation numbers filter with FELINE filter * 'mpi' means min-post [smanning-tree] intervals for positive-cut filter; note that the code does not walk the path after cut, but it is easy to do The stats have the following meaning: * 'access' means accessing the node * 'walk' is actual walking the node * 'maxdepth' is maximum depth of the stack used for DFS * 'level-f' and 'FELINE-f' is number of times levels filter or FELINE filter were used for negative-cut; note that those are not disjoint; node can be rejected by both level filter and FELINE filter For v2.17.0 and v2.17.0-rc2 the numbers are much less in FELINE favor: the results are the same, with 5 commits accessed and 6 walked compared to 61574 accessed in naive algorithm. The git.git commit graph has 53128 nodes and 66124 edges, 4 tips / heads (different child-less commits) and 9 roots, and has average clustering coefficient 0.000409217. Thanks for these results. Now, write a patch. I'm sticking to generation numbers for my patch because of the simplified computation, but you can contribute a FELINE implementation. P.S. Would it be better to move the discussion about possible extensions to the commit-graph in the form of new chunks (topological order,
Re: [PATCH v3 0/9] Compute and consume generation numbers
On 4/18/2018 8:04 PM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: -- >8 -- This is the one of several "small" patches that follow the serialized Git commit graph patch (ds/commit-graph) and lazy-loading trees (ds/lazy-load-trees). As described in Documentation/technical/commit-graph.txt, the generation number of a commit is one more than the maximum generation number among its parents (trivially, a commit with no parents has generation number one). This section is expanded to describe the interaction with special generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph file) and *_ZERO (commits in a commit-graph file written before generation numbers were implemented). This series makes the computation of generation numbers part of the commit-graph write process. Finally, generation numbers are used to order commits in the priority queue in paint_down_to_common(). This allows a short-circuit mechanism to improve performance of `git branch --contains`. Further, use generation numbers for 'git tag --contains', providing a significant speedup (at least 95% for some cases). A more substantial refactoring of revision.c is required before making 'git log --graph' use generation numbers effectively. This patch series is build on ds/lazy-load-trees. Derrick Stolee (9): commit: add generation number to struct commmit Nice and short patch. Looks good to me. commit-graph: compute generation numbers Another quite easy to understand patch. LGTM. commit: use generations in paint_down_to_common() Nice and short patch; minor typo in comment in code. Otherwise it looks good to me. commit-graph.txt: update design document I see that diagram got removed in this version; maybe it could be replaced with relationship table? Anyway, it looks good to me. The diagrams and tables seemed to cause more confusion than clarity. I think the reader should create their own mental model from the definitions and description and we should avoid trying to make a summary. ref-filter: use generation number for --contains A question: how performance looks like after the change if commit-graph is not available? The performance issue is minor, but will be fixed in v4. commit: use generation numbers for in_merge_bases() Possible typo in the commit message, and stylistic inconsistence in in_merge_bases() - though actually more clear than existing code. Short, simple, and gives good performance improvenebts. commit: add short-circuit to paint_down_to_common() Looks good to me; ignore [mostly] what I have written in response to the patch in question. commit-graph: always load commit-graph information Looks all right; question: parameter or one more global variable. I responded to say that the global variable approach is incorrect. Parameter is important to functionality and performance. merge: check config before loading commits This looks good to me. Documentation/technical/commit-graph.txt | 30 +-- alloc.c | 1 + builtin/merge.c | 5 +- commit-graph.c | 99 +++- commit-graph.h | 8 ++ commit.c | 54 +++-- commit.h | 7 +- object.c | 2 +- ref-filter.c | 23 +- sha1_file.c | 2 +- t/t5318-commit-graph.sh | 9 +++ 11 files changed, 199 insertions(+), 41 deletions(-) base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707
Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()
On 4/23/2018 5:38 PM, Jakub Narebski wrote: Derrick Stolee <sto...@gmail.com> writes: On 4/18/2018 7:19 PM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: [...] [...], and this saves time during 'git branch --contains' queries that would otherwise walk "around" the commit we are inspecting. If I understand the code properly, what happens is that we can now short-circuit if all commits that are left are lower than the target commit. This is because max-order priority queue is used: if the commit with maximum generation number is below generation number of target commit, then target commit is not reachable from any commit in the priority queue (all of which has generation number less or equal than the commit at head of queue, i.e. all are same level or deeper); compare what I have written in [1] [1]: https://public-inbox.org/git/866052dkju@gmail.com/ Do I have that right? If so, it looks all right to me. Yes, the priority queue needs to compare via generation number first or there will be errors. This is why we could not use commit time before. I was more concerned about getting right the order in the priority queue (does it return minimal or maximal generation number). I understand that the cutoff could not be used without generation numbers because of the possibility of clock skew - using cutoff on dates could lead to wrong results. Maximal generation number is important so we do not visit commits multiple times (say, once with PARENT1 set, and a second time when PARENT2 is set). A minimal generation number order would create a DFS order and walk until the cutoff every time. In cases without clock skew, maximal generation number order will walk the same set of commits as maximal commit time; the order may differ, but only between incomparable commits. For a copy of the Linux repository, where HEAD is checked out at v4.13~100, we get the following performance improvement for 'git branch --contains' over the previous commit: Before: 0.21s After: 0.13s Rel %: -38% [...] flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co return NULL; } -list = paint_down_to_common(one, n, twos); + list = paint_down_to_common(one, n, twos, 0); while (list) { struct commit *commit = pop_commit(); @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt) filled_index[filled] = j; work[filled++] = array[j]; } - common = paint_down_to_common(array[i], filled, work); + common = paint_down_to_common(array[i], filled, work, 0); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++) @@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * if (commit->generation > min_generation) return 0; -bases = paint_down_to_common(commit, nr_reference, reference); + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); Is it the only case where we would call paint_down_to_common() with non-zero last parameter? Would we always use commit->generation where commit is the first parameter of paint_down_to_common()? If both are true and will remain true, then in my humble opinion it is not necessary to change the signature of this function. We need to change the signature some way, but maybe the way I chose is not the best. No, after taking longer I think the new signature is a good choice. To elaborate: paint_down_to_common() is used for multiple purposes. The caller here that supplies 'commit->generation' is used only to compute reachability (by testing if the flag PARENT2 exists on the commit, then clears all flags). The other callers expect the full walk down to the common commits, and keeps those PARENT1, PARENT2, and STALE flags for future use (such as reporting merge bases). Usually the call to paint_down_to_common() is followed by a revision walk that only halts when reaching root commits or commits with both PARENT1 and PARENT2 flags on, so always short-circuiting on generations would break the functionality; this is confirmed by the t5318-commit-graph.sh. Right. I have realized that just after sending the email. I'm sorry about this. An alternative to the signature change is to add a boolean parameter "use_cutoff" or something, that specifies "don't walk beyond the commit". This may give a more of a clear description of what it will do with the generation value, but since we are
[RFC PATCH 02/12] commit-graph: add 'check' subcommand
If the commit-graph file becomes corrupt, we need a way to verify its contents match the object database. In the manner of 'git fsck' we will implement a 'git commit-graph check' subcommand to report all issues with the file. Add the 'check' subcommand to the 'commit-graph' builtin and its documentation. Add a simple test that ensures the command returns a zero error code. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 7 +- builtin/commit-graph.c | 38 ++ commit-graph.c | 5 commit-graph.h | 2 ++ t/t5318-commit-graph.sh| 5 5 files changed, 56 insertions(+), 1 deletion(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 4c97b555cc..93c7841ba2 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -9,10 +9,10 @@ git-commit-graph - Write and verify Git commit graph files SYNOPSIS [verse] +'git commit-graph check' [--object-dir ] 'git commit-graph read' [--object-dir ] 'git commit-graph write' [--object-dir ] - DESCRIPTION --- @@ -52,6 +52,11 @@ existing commit-graph file. Read a graph file given by the commit-graph file and output basic details about the graph file. Used for debugging purposes. +'check':: + +Read the commit-graph file and verify its contents against the object +database. Used to check for corrupted data. + EXAMPLES diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 37420ae0fd..77c1a04932 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -7,11 +7,17 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), + N_("git commit-graph check [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), NULL }; +static const char * const builtin_commit_graph_check_usage[] = { + N_("git commit-graph check [--object-dir ]"), + NULL +}; + static const char * const builtin_commit_graph_read_usage[] = { N_("git commit-graph read [--object-dir ]"), NULL @@ -29,6 +35,36 @@ static struct opts_commit_graph { int append; } opts; + +static int graph_check(int argc, const char **argv) +{ + struct commit_graph *graph = 0; + char *graph_name; + + static struct option builtin_commit_graph_check_options[] = { + OPT_STRING(0, "object-dir", _dir, + N_("dir"), + N_("The object directory to store the graph")), + OPT_END(), + }; + + argc = parse_options(argc, argv, NULL, +builtin_commit_graph_check_options, +builtin_commit_graph_check_usage, 0); + + if (!opts.obj_dir) + opts.obj_dir = get_object_directory(); + + graph_name = get_commit_graph_filename(opts.obj_dir); + graph = load_commit_graph_one(graph_name); + + if (!graph) + die("graph file %s does not exist", graph_name); + FREE_AND_NULL(graph_name); + + return check_commit_graph(graph); +} + static int graph_read(int argc, const char **argv) { struct commit_graph *graph = NULL; @@ -160,6 +196,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) PARSE_OPT_STOP_AT_NON_OPTION); if (argc > 0) { + if (!strcmp(argv[0], "check")) + return graph_check(argc, argv); if (!strcmp(argv[0], "read")) return graph_read(argc, argv); if (!strcmp(argv[0], "write")) diff --git a/commit-graph.c b/commit-graph.c index 3f0c142603..cd0634bba0 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -819,3 +819,8 @@ void write_commit_graph(const char *obj_dir, oids.alloc = 0; oids.nr = 0; } + +int check_commit_graph(struct commit_graph *g) +{ + return !g; +} diff --git a/commit-graph.h b/commit-graph.h index 96cccb10f3..e8c8d99dff 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -53,4 +53,6 @@ void write_commit_graph(const char *obj_dir, int nr_commits, int append); +int check_commit_graph(struct commit_graph *g); + #endif diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 77d85aefe7..e91053271a 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -230,4 +230,9 @@ test_expect_success 'perform fast-forward merge in full repo' ' test_cmp expect output ' +test_expect_success 'git commit-graph check' ' + cd "$TRASH_DIRECTORY/full" && + git commit-graph check >output +' + test_done -- 2.17.0.39.g685157f7fb
[RFC PATCH 08/12] commit-graph: verify commit contents against odb
When running 'git commit-graph check', compare the contents of the commits that are loaded from the commit-graph file with commits that are loaded directly from the object database. This includes checking the root tree object ID, commit date, and parents. In addition, verify the generation number calculation is correct for all commits in the commit-graph file. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 82 ++ 1 file changed, 82 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 80a2ac2a6a..b5bce2bac4 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -899,5 +899,87 @@ int check_commit_graph(struct commit_graph *g) graph_commit->graph_pos, i); } + for (i = 0; i < g->num_commits; i++) { + struct commit *graph_commit, *odb_commit; + struct commit_list *graph_parents, *odb_parents; + int num_parents = 0; + + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); + + graph_commit = lookup_commit(_oid); + odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node()); + if (parse_commit_internal(odb_commit, 0, 0)) + graph_report(_("failed to parse %s from object database"), oid_to_hex(_oid)); + + if (oidcmp(_commit_tree_in_graph_one(g, graph_commit)->object.oid, + get_commit_tree_oid(odb_commit))) + graph_report(_("root tree object ID for commit %s in commit-graph is %s != %s"), +oid_to_hex(_oid), + oid_to_hex(get_commit_tree_oid(graph_commit)), + oid_to_hex(get_commit_tree_oid(odb_commit))); + + if (graph_commit->date != odb_commit->date) + graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime""), +oid_to_hex(_oid), +graph_commit->date, +odb_commit->date); + + + graph_parents = graph_commit->parents; + odb_parents = odb_commit->parents; + + while (graph_parents) { + num_parents++; + + if (odb_parents == NULL) + graph_report(_("commit-graph parent list for commit %s is too long (%d)"), +oid_to_hex(_oid), +num_parents); + + if (oidcmp(_parents->item->object.oid, _parents->item->object.oid)) + graph_report(_("commit-graph parent for %s is %s != %s"), +oid_to_hex(_oid), + oid_to_hex(_parents->item->object.oid), + oid_to_hex(_parents->item->object.oid)); + + graph_parents = graph_parents->next; + odb_parents = odb_parents->next; + } + + if (odb_parents != NULL) + graph_report(_("commit-graph parent list for commit %s terminates early"), +oid_to_hex(_oid)); + + if (graph_commit->generation) { + uint32_t max_generation = 0; + graph_parents = graph_commit->parents; + + while (graph_parents) { + if (graph_parents->item->generation == GENERATION_NUMBER_ZERO || + graph_parents->item->generation == GENERATION_NUMBER_INFINITY) + graph_report(_("commit-graph has valid generation for %s but not its parent, %s"), +oid_to_hex(_oid), + oid_to_hex(_parents->item->object.oid)); + if (graph_parents->item->generation > max_generation) + max_generation = graph_parents->item->generation; + graph_parents = graph_parents->next; + } + + if (graph_commit->generation != max_generation + 1) + graph_report(_("commit-graph has incorrect generation for %s"), +oid_to_hex(_oid)); + } else { + graph_parents = graph_commit->parents; + +
[RFC PATCH 09/12] fsck: check commit-graph
If a commit-graph file exists, check its contents during 'git fsck'. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- builtin/fsck.c | 13 + 1 file changed, 13 insertions(+) diff --git a/builtin/fsck.c b/builtin/fsck.c index ef78c6c00c..9712f230ba 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -16,6 +16,7 @@ #include "streaming.h" #include "decorate.h" #include "packfile.h" +#include "run-command.h" #define REACHABLE 0x0001 #define SEEN 0x0002 @@ -45,6 +46,7 @@ static int name_objects; #define ERROR_REACHABLE 02 #define ERROR_PACK 04 #define ERROR_REFS 010 +#define ERROR_COMMIT_GRAPH 020 static const char *describe_object(struct object *obj) { @@ -815,5 +817,16 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) } check_connectivity(); + + if (core_commit_graph) { + struct child_process commit_graph_check = CHILD_PROCESS_INIT; + const char *check_argv[] = { "commit-graph", "check", NULL, NULL }; + commit_graph_check.argv = check_argv; + commit_graph_check.git_cmd = 1; + + if (run_command(_graph_check)) + errors_found |= ERROR_COMMIT_GRAPH; + } + return errors_found; } -- 2.17.0.39.g685157f7fb
[RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'
The commit-graph feature is not useful to end users until the commit-graph file is maintained automatically by Git during normal upkeep operations. One natural place to trigger this write is during 'git gc'. Before automatically generating a commit-graph, we need to be able to verify the contents of a commit-graph file. Integrate commit-graph checks into 'fsck' that check the commit-graph contents against commits in the object database. Things to think about: * Are these the right integration points? * gc.commitGraph defaults to true right now for the purpose of testing, but may not be required to start. The goal is to have this default to true eventually, but we may want to delay that until the feature is stable. * I implement a "--reachable" option to 'git commit-graph write' that iterates over all refs. This does the same as git show-ref -s | git commit-graph write --stdin-commits but I don't know how to pipe two child processes together inside of Git. Perhaps this is a better solution, anyway. What other things should I be considering in this case? I'm unfamiliar with the inner-workings of 'fsck' and 'gc', so this is a new space for me. This RFC is based on v3 of ds/generation-numbers, and the first commit is a fixup! based on a bug in that version that I caught while prepping this series. Thanks, -Stolee Derrick Stolee (12): fixup! commit-graph: always load commit-graph information commit-graph: add 'check' subcommand commit-graph: check file header information commit-graph: parse commit from chosen graph commit-graph: check fanout and lookup table commit: force commit to parse from object database commit-graph: load a root tree from specific graph commit-graph: verify commit contents against odb fsck: check commit-graph commit-graph: add '--reachable' option gc: automatically write commit-graph files commit-graph: update design document Documentation/git-commit-graph.txt | 15 +- Documentation/git-gc.txt | 4 + Documentation/technical/commit-graph.txt | 9 -- builtin/commit-graph.c | 79 +- builtin/fsck.c | 13 ++ builtin/gc.c | 8 + commit-graph.c | 178 ++- commit-graph.h | 2 + commit.c | 14 +- commit.h | 1 + t/t5318-commit-graph.sh | 15 ++ 11 files changed, 311 insertions(+), 27 deletions(-) -- 2.17.0.39.g685157f7fb
[RFC PATCH 03/12] commit-graph: check file header information
During a run of 'git commit-graph check', list the issues with the header information in the commit-graph file. Some of this information is inferred from the loaded 'struct commit_graph'. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 29 - 1 file changed, 28 insertions(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index cd0634bba0..c5e5a0f860 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -820,7 +820,34 @@ void write_commit_graph(const char *obj_dir, oids.nr = 0; } +static int check_commit_graph_error; +#define graph_report(...) { check_commit_graph_error = 1; printf(__VA_ARGS__); } + int check_commit_graph(struct commit_graph *g) { - return !g; + if (!g) { + graph_report(_("no commit-graph file loaded")); + return 1; + } + + check_commit_graph_error = 0; + + if (get_be32(g->data) != GRAPH_SIGNATURE) + graph_report(_("commit-graph file has incorrect header")); + + if (*(g->data + 4) != 1) + graph_report(_("commit-graph file version is not 1")); + if (*(g->data + 5) != GRAPH_OID_VERSION) + graph_report(_("commit-graph OID version is not 1 (SHA1)")); + + if (!g->chunk_oid_fanout) + graph_report(_("commit-graph is missing the OID Fanout chunk")); + if (!g->chunk_oid_lookup) + graph_report(_("commit-graph is missing the OID Lookup chunk")); + if (!g->chunk_commit_data) + graph_report(_("commit-graph is missing the Commit Data chunk")); + if (g->hash_len != GRAPH_OID_LEN) + graph_report(_("commit-graph has incorrect hash length: %d"), g->hash_len); + + return check_commit_graph_error; } -- 2.17.0.39.g685157f7fb
[RFC PATCH 11/12] gc: automatically write commit-graph files
The commit-graph file is a very helpful feature for speeding up git operations. In order to make it more useful, write the commit-graph file by default during standard garbage collection operations. Add a 'gc.commitGraph' config setting that triggers writing a commit-graph file after any non-trivial 'git gc' command. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-gc.txt | 4 builtin/gc.c | 8 2 files changed, 12 insertions(+) diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt index 571b5a7e3c..17dd654a59 100644 --- a/Documentation/git-gc.txt +++ b/Documentation/git-gc.txt @@ -119,6 +119,10 @@ The optional configuration variable `gc.packRefs` determines if it within all non-bare repos or it can be set to a boolean value. This defaults to true. +The optional configuration variable 'gc.commitGraph' determines if +'git gc' runs 'git commit-graph write'. This can be set to a boolean +value. This defaults to false. + The optional configuration variable `gc.aggressiveWindow` controls how much time is spent optimizing the delta compression of the objects in the repository when the --aggressive option is specified. The larger diff --git a/builtin/gc.c b/builtin/gc.c index 77fa720bd0..070f2a7a3d 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -34,6 +34,7 @@ static int aggressive_depth = 50; static int aggressive_window = 250; static int gc_auto_threshold = 6700; static int gc_auto_pack_limit = 50; +static int gc_commit_graph = 1; static int detach_auto = 1; static timestamp_t gc_log_expire_time; static const char *gc_log_expire = "1.day.ago"; @@ -46,6 +47,7 @@ static struct argv_array repack = ARGV_ARRAY_INIT; static struct argv_array prune = ARGV_ARRAY_INIT; static struct argv_array prune_worktrees = ARGV_ARRAY_INIT; static struct argv_array rerere = ARGV_ARRAY_INIT; +static struct argv_array commit_graph = ARGV_ARRAY_INIT; static struct tempfile *pidfile; static struct lock_file log_lock; @@ -121,6 +123,7 @@ static void gc_config(void) git_config_get_int("gc.aggressivedepth", _depth); git_config_get_int("gc.auto", _auto_threshold); git_config_get_int("gc.autopacklimit", _auto_pack_limit); + git_config_get_bool("gc.commitgraph", _commit_graph); git_config_get_bool("gc.autodetach", _auto); git_config_get_expiry("gc.pruneexpire", _expire); git_config_get_expiry("gc.worktreepruneexpire", _worktrees_expire); @@ -374,6 +377,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix) argv_array_pushl(, "prune", "--expire", NULL); argv_array_pushl(_worktrees, "worktree", "prune", "--expire", NULL); argv_array_pushl(, "rerere", "gc", NULL); + argv_array_pushl(_graph, "commit-graph", "write", "--reachable", NULL); /* default expiry time, overwritten in gc_config */ gc_config(); @@ -480,6 +484,10 @@ int cmd_gc(int argc, const char **argv, const char *prefix) if (pack_garbage.nr > 0) clean_pack_garbage(); + if (gc_commit_graph) + if (run_command_v_opt(commit_graph.argv, RUN_GIT_CMD)) + return error(FAILED_RUN, commit_graph.argv[0]); + if (auto_gc && too_many_loose_objects()) warning(_("There are too many unreachable loose objects; " "run 'git prune' to remove them.")); -- 2.17.0.39.g685157f7fb
[RFC PATCH 01/12] fixup! commit-graph: always load commit-graph information
--- commit-graph.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index 21e853c21a..3f0c142603 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -304,7 +304,7 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin *pos = item->graph_pos; return 1; } else { - return bsearch_graph(commit_graph, &(item->object.oid), pos); + return bsearch_graph(g, &(item->object.oid), pos); } } -- 2.17.0.39.g685157f7fb
[RFC PATCH 06/12] commit: force commit to parse from object database
In anticipation of checking commit-graph file contents against the object database, create parse_commit_internal() to allow side-stepping the commit-graph file and parse directly from the object database. Due to the use of generation numbers, this method should not be called unless the intention is explicit in avoiding commits from the commit-graph file. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 14 ++ commit.h | 1 + 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/commit.c b/commit.c index 9ef6f699bd..07752d8503 100644 --- a/commit.c +++ b/commit.c @@ -392,7 +392,8 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s return 0; } -int parse_commit_gently(struct commit *item, int quiet_on_missing) + +int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph) { enum object_type type; void *buffer; @@ -403,17 +404,17 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) return -1; if (item->object.parsed) return 0; - if (parse_commit_in_graph(item)) + if (use_commit_graph && parse_commit_in_graph(item)) return 0; buffer = read_sha1_file(item->object.oid.hash, , ); if (!buffer) return quiet_on_missing ? -1 : error("Could not read %s", -oid_to_hex(>object.oid)); + oid_to_hex(>object.oid)); if (type != OBJ_COMMIT) { free(buffer); return error("Object %s not a commit", -oid_to_hex(>object.oid)); + oid_to_hex(>object.oid)); } ret = parse_commit_buffer(item, buffer, size, 0); if (save_commit_buffer && !ret) { @@ -424,6 +425,11 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) return ret; } +int parse_commit_gently(struct commit *item, int quiet_on_missing) +{ + return parse_commit_internal(item, quiet_on_missing, 1); +} + void parse_commit_or_die(struct commit *item) { if (parse_commit(item)) diff --git a/commit.h b/commit.h index b5afde1ae9..5fde74fcd7 100644 --- a/commit.h +++ b/commit.h @@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char *name); struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name); int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph); +int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph); int parse_commit_gently(struct commit *item, int quiet_on_missing); static inline int parse_commit(struct commit *item) { -- 2.17.0.39.g685157f7fb
[RFC PATCH 05/12] commit-graph: check fanout and lookup table
While running 'git commit-graph check', verify that the object IDs are listed in lexicographic order and that the fanout table correctly navigates into that list of object IDs. In anticipation of checking the commits in the commit-graph file against the object database, parse the commits from that file in advance. We perform this parse now to ensure the object cache contains only commits from this commit-graph file. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 34 ++ 1 file changed, 34 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 6d0d303a7a..6e3c08cd5c 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -835,6 +835,9 @@ static int check_commit_graph_error; int check_commit_graph(struct commit_graph *g) { + uint32_t i, cur_fanout_pos = 0; + struct object_id prev_oid, cur_oid; + if (!g) { graph_report(_("no commit-graph file loaded")); return 1; @@ -859,5 +862,36 @@ int check_commit_graph(struct commit_graph *g) if (g->hash_len != GRAPH_OID_LEN) graph_report(_("commit-graph has incorrect hash length: %d"), g->hash_len); + for (i = 0; i < g->num_commits; i++) { + struct commit *graph_commit; + + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); + + if (i > 0 && oidcmp(_oid, _oid) >= 0) + graph_report(_("commit-graph has incorrect oid order: %s then %s"), + + oid_to_hex(_oid), + oid_to_hex(_oid)); + oidcpy(_oid, _oid); + + while (cur_oid.hash[0] > cur_fanout_pos) { + uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos); + if (i != fanout_value) + graph_report(_("commit-graph has incorrect fanout value: fanout[%d] = %u != %u"), +cur_fanout_pos, fanout_value, i); + + cur_fanout_pos++; + } + + graph_commit = lookup_commit(_oid); + + if (!parse_commit_in_graph_one(g, graph_commit)) + graph_report(_("failed to parse %s from commit-graph"), oid_to_hex(_oid)); + + if (graph_commit->graph_pos != i) + graph_report(_("graph_pos for commit %s is %u != %u"), oid_to_hex(_oid), +graph_commit->graph_pos, i); + } + return check_commit_graph_error; } -- 2.17.0.39.g685157f7fb
[RFC PATCH 12/12] commit-graph: update design document
The commit-graph feature is now integrated with 'fsck' and 'gc', so remove those items from the "Future Work" section of the commit-graph design document. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/commit-graph.txt | 9 - 1 file changed, 9 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index d9f2713efa..d04657b781 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -118,9 +118,6 @@ Future Work - The commit graph feature currently does not honor commit grafts. This can be remedied by duplicating or refactoring the current graft logic. -- The 'commit-graph' subcommand does not have a "verify" mode that is - necessary for integration with fsck. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered @@ -142,12 +139,6 @@ Future Work such as "ensure_tree_loaded(commit)" that fully loads a tree before using commit->tree. -- The current design uses the 'commit-graph' subcommand to generate the graph. - When this feature stabilizes enough to recommend to most users, we should - add automatic graph writes to common operations that create many commits. - For example, one could compute a graph on 'clone', 'fetch', or 'repack' - commands. - - A server could provide a commit graph file as part of the network protocol to avoid extra calculations by clients. This feature is only of benefit if the user is willing to trust the file, because verifying the file is correct -- 2.17.0.39.g685157f7fb
[RFC PATCH 10/12] commit-graph: add '--reachable' option
When writing commit-graph files, it can be convenient to ask for all reachable commits (starting at the ref set) in the resulting file. This is particularly helpful when writing to stdin is complicated, such as a future integration with 'git gc' which will call 'git commit-graph write --reachable' after performing cleanup of the object database. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 8 -- builtin/commit-graph.c | 41 +++--- t/t5318-commit-graph.sh| 10 3 files changed, 53 insertions(+), 6 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 93c7841ba2..1b14d40590 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -37,12 +37,16 @@ Write a commit graph file based on the commits found in packfiles. + With the `--stdin-packs` option, generate the new commit graph by walking objects only in the specified pack-indexes. (Cannot be combined -with --stdin-commits.) +with --stdin-commits or --reachable.) + With the `--stdin-commits` option, generate the new commit graph by walking commits starting at the commits specified in stdin as a list of OIDs in hex, one OID per line. (Cannot be combined with ---stdin-packs.) +--stdin-packs or --reachable.) ++ +With the `--reachable` option, generate the new commit graph by walking +commits starting at all refs. (Cannot be combined with --stdin-commits +or --stind-packs.) + With the `--append` option, include all commits that are present in the existing commit-graph file. diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 77c1a04932..a89285ada8 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -3,13 +3,14 @@ #include "dir.h" #include "lockfile.h" #include "parse-options.h" +#include "refs.h" #include "commit-graph.h" static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph check [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), - N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--append] [--reachable|--stdin-packs|--stdin-commits]"), NULL }; @@ -24,12 +25,13 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--append] [--reachable|--stdin-packs|--stdin-commits]"), NULL }; static struct opts_commit_graph { const char *obj_dir; + int reachable; int stdin_packs; int stdin_commits; int append; @@ -113,6 +115,25 @@ static int graph_read(int argc, const char **argv) return 0; } +struct hex_list { + char **hex_strs; + int hex_nr; + int hex_alloc; +}; + +static int add_ref_to_list(const char *refname, + const struct object_id *oid, + int flags, void *cb_data) +{ + struct hex_list *list = (struct hex_list*)cb_data; + + ALLOC_GROW(list->hex_strs, list->hex_nr + 1, list->hex_alloc); + list->hex_strs[list->hex_nr] = xcalloc(GIT_MAX_HEXSZ + 1, 1); + strcpy(list->hex_strs[list->hex_nr], oid_to_hex(oid)); + list->hex_nr++; + return 0; +} + static int graph_write(int argc, const char **argv) { const char **pack_indexes = NULL; @@ -127,6 +148,8 @@ static int graph_write(int argc, const char **argv) OPT_STRING(0, "object-dir", _dir, N_("dir"), N_("The object directory to store the graph")), + OPT_BOOL(0, "reachable", , + N_("start walk at all refs")), OPT_BOOL(0, "stdin-packs", _packs, N_("scan pack-indexes listed by stdin for commits")), OPT_BOOL(0, "stdin-commits", _commits, @@ -140,8 +163,8 @@ static int graph_write(int argc, const char **argv) builtin_commit_graph_write_options, builtin_commit_graph_write_usage, 0); - if (opts.stdin_packs && opts.stdin_commits) - die(_("cannot use both --stdin-commits and --stdin-packs")); + if (opts.reachable + opts.stdin_packs + opts.stdin_commits > 1) + die(_("use at most one of --reachable, --stdin-commits, or --stdin-packs")); if (!opts.obj_dir)
[RFC PATCH 07/12] commit-graph: load a root tree from specific graph
When lazy-loading a tree for a commit, it will be important to select the tree from a specific struct commit_graph. Create a new method that specifies the commit-graph file and use that in get_commit_tree_in_graph(). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 6e3c08cd5c..80a2ac2a6a 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -354,14 +354,20 @@ static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit * return c->maybe_tree; } -struct tree *get_commit_tree_in_graph(const struct commit *c) +static struct tree *get_commit_tree_in_graph_one(struct commit_graph *g, +const struct commit *c) { if (c->maybe_tree) return c->maybe_tree; if (c->graph_pos == COMMIT_NOT_FROM_GRAPH) BUG("get_commit_tree_in_graph called from non-commit-graph commit"); - return load_tree_for_commit(commit_graph, (struct commit *)c); + return load_tree_for_commit(g, (struct commit *)c); +} + +struct tree *get_commit_tree_in_graph(const struct commit *c) +{ + return get_commit_tree_in_graph_one(commit_graph, c); } static void write_graph_chunk_fanout(struct hashfile *f, -- 2.17.0.39.g685157f7fb
[RFC PATCH 04/12] commit-graph: parse commit from chosen graph
Before checking a commit-graph file against the object database, we need to parse all commits from the given commit-graph file. Create parse_commit_in_graph_one() to target a given struct commit_graph. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index c5e5a0f860..6d0d303a7a 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -308,17 +308,27 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin } } -int parse_commit_in_graph(struct commit *item) +int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item) { uint32_t pos; if (item->object.parsed) - return 0; + return 1; + + if (find_commit_in_graph(item, g, )) + return fill_commit_in_graph(item, g, pos); + + return 0; +} + +int parse_commit_in_graph(struct commit *item) +{ if (!core_commit_graph) return 0; + prepare_commit_graph(); - if (commit_graph && find_commit_in_graph(item, commit_graph, )) - return fill_commit_in_graph(item, commit_graph, pos); + if (commit_graph) + return parse_commit_in_graph_one(commit_graph, item); return 0; } -- 2.17.0.39.g685157f7fb
Re: [PATCH v3 8/9] commit-graph: always load commit-graph information
On 4/17/2018 1:00 PM, Derrick Stolee wrote: Most code paths load commits using lookup_commit() and then parse_commit(). In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). With generation numbers in the commit-graph, we need to ensure that any commit that exists in the commit-graph file has its generation number loaded. Create new load_commit_graph_info() method to fill in the information for a commit that exists only in the commit-graph file. Call it from parse_commit_buffer() after loading the other commit information from the given buffer. Only fill this information when specified by the 'check_graph' parameter. This avoids duplicate work when we already checked the graph in parse_commit_gently() or when simply checking the buffer contents in check_commit(). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 51 -- commit-graph.h | 8 commit.c | 7 +-- commit.h | 2 +- object.c | 2 +- sha1_file.c| 2 +- 6 files changed, 49 insertions(+), 23 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 688d5b1801..21e853c21a 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, return _list_insert(c, pptr)->next; } +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) +{ + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; +} + static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) { uint32_t edge_value; uint32_t *parent_data_ptr; uint64_t date_low, date_high; struct commit_list **pptr; - const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos; + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; item->object.parsed = 1; item->graph_pos = pos; @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin return 1; } +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos) +{ + if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { + *pos = item->graph_pos; + return 1; + } else { + return bsearch_graph(commit_graph, &(item->object.oid), pos); The reference to 'commit_graph' in the above line should be 'g'. Sorry! + } +} + int parse_commit_in_graph(struct commit *item) { + uint32_t pos; + + if (item->object.parsed) + return 0; if (!core_commit_graph) return 0; - if (item->object.parsed) - return 1; - prepare_commit_graph(); - if (commit_graph) { - uint32_t pos; - int found; - if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { - pos = item->graph_pos; - found = 1; - } else { - found = bsearch_graph(commit_graph, &(item->object.oid), ); - } - - if (found) - return fill_commit_in_graph(item, commit_graph, pos); - } - + if (commit_graph && find_commit_in_graph(item, commit_graph, )) + return fill_commit_in_graph(item, commit_graph, pos); return 0; } +void load_commit_graph_info(struct commit *item) +{ + uint32_t pos; + if (!core_commit_graph) + return; + prepare_commit_graph(); + if (commit_graph && find_commit_in_graph(item, commit_graph, )) + fill_commit_graph_info(item, commit_graph, pos); +} + static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c) { struct object_id oid; diff --git a/commit-graph.h b/commit-graph.h index 260a468e73..96cccb10f3 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir); */ int parse_commit_in_graph(struct commit *item); +/* + * It is possible that we loaded commit contents from the commit buffer, + * but we also want to ensure the commit-graph content is correctly + * checked and filled. Fill the graph_pos and generation members of + * the given commit. + */ +void load_commit_graph_info(struct commit *item); + struct tree *get_commit_tree_in_graph(const struct commit *c); struct commit_graph { diff --git a/commit.c b/commit.c index a70f120878..9ef6f699bd 100644 --- a/commit.c +++ b/commit.c @@ -331,7 +331,7 @@ const void *detach_comm
Re: [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'
On 4/17/2018 2:10 PM, Derrick Stolee wrote: The commit-graph feature is not useful to end users until the commit-graph file is maintained automatically by Git during normal upkeep operations. One natural place to trigger this write is during 'git gc'. Before automatically generating a commit-graph, we need to be able to verify the contents of a commit-graph file. Integrate commit-graph checks into 'fsck' that check the commit-graph contents against commits in the object database. Things to think about: * Are these the right integration points? * gc.commitGraph defaults to true right now for the purpose of testing, but may not be required to start. The goal is to have this default to true eventually, but we may want to delay that until the feature is stable. * I implement a "--reachable" option to 'git commit-graph write' that iterates over all refs. This does the same as git show-ref -s | git commit-graph write --stdin-commits but I don't know how to pipe two child processes together inside of Git. Perhaps this is a better solution, anyway. What other things should I be considering in this case? I'm unfamiliar with the inner-workings of 'fsck' and 'gc', so this is a new space for me. This RFC is based on v3 of ds/generation-numbers, and the first commit is a fixup! based on a bug in that version that I caught while prepping this series. Thanks, -Stolee Derrick Stolee (12): fixup! commit-graph: always load commit-graph information commit-graph: add 'check' subcommand commit-graph: check file header information commit-graph: parse commit from chosen graph commit-graph: check fanout and lookup table commit: force commit to parse from object database commit-graph: load a root tree from specific graph commit-graph: verify commit contents against odb fsck: check commit-graph commit-graph: add '--reachable' option gc: automatically write commit-graph files commit-graph: update design document Documentation/git-commit-graph.txt | 15 +- Documentation/git-gc.txt | 4 + Documentation/technical/commit-graph.txt | 9 -- builtin/commit-graph.c | 79 +- builtin/fsck.c | 13 ++ builtin/gc.c | 8 + commit-graph.c | 178 ++- commit-graph.h | 2 + commit.c | 14 +- commit.h | 1 + t/t5318-commit-graph.sh | 15 ++ 11 files changed, 311 insertions(+), 27 deletions(-) This RFC is also available as a GitHub pull request [1] [1] https://github.com/derrickstolee/git/pull/6
Re: [PATCH v10 00/36] Add directory rename detection to git
On 4/19/2018 2:41 PM, Stefan Beller wrote: On Thu, Apr 19, 2018 at 11:35 AM, Elijah Newrenwrote: On Thu, Apr 19, 2018 at 10:57 AM, Elijah Newren wrote: This series is a reboot of the directory rename detection series that was merged to master and then reverted due to the final patch having a buggy can-skip-update check, as noted at https://public-inbox.org/git/xmqqmuya43cs@gitster-ct.c.googlers.com/ This series based on top of master. ...and merges cleanly to next but apparently has some minor conflicts with both ds/lazy-load-trees and ps/test-chmtime-get from pu. What's the preferred way to resolve this? Rebase and resubmit my series on pu, or something else? If you were to base it off of pu, this series would depend on all other series that pu contains. This is bad for the progress of this series. (If it were to be merged to next, all other series would automatically merge to next as well) If the conflicts are minor, then Junio resolves them; if you want to be nice, pick your merge point as git checkout origin/master git merge ds/lazy-load-trees git merge ps/test-chmtime-get git tag my-anchor and put the series on top of that anchor. If you do this, you'd want to be reasonably sure that those two series are not in too much flux. I believe ds/lazy-load-trees is queued for 'next'. I'm not surprised that there are some conflicts here. Any reference to the 'tree' member of a commit should be replaced with 'get_commit_tree(c)', or 'get_commit_tree_oid(c)' if you only care about the tree's object id. I think Stefan's suggestion is the best approach to get the right conflicts out of the way. Thanks, -Stolee
Re: [RFC 0/1] Tolerate broken headers in `packed-refs` files
On 3/26/2018 8:42 AM, Michael Haggerty wrote: [...] But there might be some tools out in the wild that have been writing broken headers. In that case, users who upgrade Git might suddenly find that they can't read repositories that they could read before. In fact, a tool that we wrote and use internally at GitHub was doing exactly that, which is how we discovered this "problem". This patch shows what it would look like to relax the parsing again, albeit *only* for the first line of the file, and *only* for lines that start with '#'. The problem with this patch is that it would make it harder for people who implement broken tools in the future to discover their mistakes. The only result of the error would be that it is slower to work with the `packed-refs` files that they wrote. Such an error could go undiscovered for a long time. My opinion is that we shouldn't maintain back-compat with formats that may have been written by another tool because Git wasn't strict about it. As long as Git never wrote files with these formats, then they shouldn't be supported. You are absolutely right that staying strict will help discover the tools that are writing an incorrect format. Since most heavily-used tools that didn't spawn Git processes use LibGit2 to interact with Git repos, I added Ed Thomson to CC to see if libgit2 could ever write these bad header comments. Thanks for writing this RFC so we can have the discussion and more quickly identify this issue if/when users are broken. Thanks, -Stolee
Re: [PATCH 4/3] sha1_name: use bsearch_pack() in unique_in_pack()
On 3/24/2018 12:41 PM, René Scharfe wrote: Replace the custom binary search in unique_in_pack() with a call to bsearch_pack(). This reduces code duplication and makes use of the fan-out table of packs. Signed-off-by: Rene Scharfe <l@web.de> --- This is basically the same replacement as done by patch 3. Speed is less of a concern here -- at least I don't know a commonly used command that needs to resolve lots of short hashes. Thanks, René! Good teamwork on this patch series. Reviewed-by: Derrick Stolee <dsto...@microsoft.com>
Re: [PATCH] unpack-trees: release oid_array after use in check_updates()
On 3/25/2018 12:31 PM, René Scharfe wrote: Signed-off-by: Rene Scharfe--- That leak was introduced by c0c578b33c (unpack-trees: batch fetching of missing blobs). unpack-trees.c | 1 + 1 file changed, 1 insertion(+) diff --git a/unpack-trees.c b/unpack-trees.c index d5685891a5..e73745051e 100644 --- a/unpack-trees.c +++ b/unpack-trees.c @@ -379,30 +379,31 @@ static int check_updates(struct unpack_trees_options *o) struct oid_array to_fetch = OID_ARRAY_INIT; int fetch_if_missing_store = fetch_if_missing; fetch_if_missing = 0; for (i = 0; i < index->cache_nr; i++) { struct cache_entry *ce = index->cache[i]; if ((ce->ce_flags & CE_UPDATE) && !S_ISGITLINK(ce->ce_mode)) { if (!has_object_file(>oid)) oid_array_append(_fetch, >oid); } } if (to_fetch.nr) fetch_objects(repository_format_partial_clone, _fetch); fetch_if_missing = fetch_if_missing_store; + oid_array_clear(_fetch); } for (i = 0; i < index->cache_nr; i++) { struct cache_entry *ce = index->cache[i]; if (ce->ce_flags & CE_UPDATE) { if (ce->ce_flags & CE_WT_REMOVE) die("BUG: both update and delete flags are set on %s", ce->name); display_progress(progress, ++cnt); ce->ce_flags &= ~CE_UPDATE; if (o->update && !o->dry_run) { errs |= checkout_entry(ce, , NULL); } } } Ack. Looks correct. -Stolee
Re: [ANNOUNCE] Git v2.17.0-rc1
On 3/25/2018 2:42 PM, Ævar Arnfjörð Bjarmason wrote: On Sun, Mar 25 2018, Derrick Stolee wrote: On 3/23/2018 1:59 PM, Ævar Arnfjörð Bjarmason wrote: On Wed, Mar 21 2018, Junio C. Hamano wrote: A release candidate Git v2.17.0-rc1 is now available for testing at the usual places. It is comprised of 493 non-merge commits since v2.16.0, contributed by 62 people, 19 of which are new faces. I have this deployed on some tens of K machines who all use git in one way or another (from automated pulls, to users interactively), and rc0 before that, with a few patches on top from me + Takato + Duy + Derrick since rc0 was released (and since today based on top of rc1). No issues so far. The specific in-house version I have is at: https://github.com/git/git/compare/v2.17.0-rc1...bookingcom:booking-git-v2018-03-23-1 Thanks for testing the commit-graph feature, Ævar! I'm guessing you have some mechanisms to ensure the 'git commit-graph write' command is run on these machines and 'core.commitGraph' is set to true in the config? I would love to hear how this benefits your org. I haven't deployed any actual use of it at a wider scale, but I've done some ad-hoc benchmarking with our internal version which has your patches, and the results are very promising so far on the isolated test cases where it helps (that you know about, e.g. rev-list --all). So sorry, I don't have any meaningful testing of this, I just wanted an easy way to ad-hoc test it & make sure it doesn't break other stuff for now. I also threw out most of the manual git maintenance stuff we had and just rely on gc --auto now, so as soon as you have something to integrate with that, along with those perf changes Peff suggested I'm much more likely to play with it in some real way. Thanks. Integration with 'gc --auto' is a high priority for me after the patch lands. The version on GitHub [1] is slightly ahead of v6 as I wait to reroll on v2.17.0. It includes Peff's improvements to inspecting pack-indexes [2]. [1] https://github.com/derrickstolee/git/pull/2 [2] https://github.com/derrickstolee/git/pull/2/commits/cb86817ee5c5127b32c93a22ef130f0db6207970
Re: [ANNOUNCE] Git v2.17.0-rc1
On 3/23/2018 1:59 PM, Ævar Arnfjörð Bjarmason wrote: On Wed, Mar 21 2018, Junio C. Hamano wrote: A release candidate Git v2.17.0-rc1 is now available for testing at the usual places. It is comprised of 493 non-merge commits since v2.16.0, contributed by 62 people, 19 of which are new faces. I have this deployed on some tens of K machines who all use git in one way or another (from automated pulls, to users interactively), and rc0 before that, with a few patches on top from me + Takato + Duy + Derrick since rc0 was released (and since today based on top of rc1). No issues so far. The specific in-house version I have is at: https://github.com/git/git/compare/v2.17.0-rc1...bookingcom:booking-git-v2018-03-23-1 Thanks for testing the commit-graph feature, Ævar! I'm guessing you have some mechanisms to ensure the 'git commit-graph write' command is run on these machines and 'core.commitGraph' is set to true in the config? I would love to hear how this benefits your org. Thanks, -Stolee
Re: [PATCH v4 00/13] Serialized Git Commit Graph
On 3/30/2018 7:10 AM, Jakub Narebski wrote: I hope that I am addressing the most recent version of this series. Hi Jakub. Thanks for the interest in this patch series. The most-recent version is v6 [1], but I will re-roll to v7 soon (after v2.17.0 is marked). [1] https://public-inbox.org/git/20180314192736.70602-1-dsto...@microsoft.com/T/#u Derrick Stolee <sto...@gmail.com> writes: As promised [1], this patch contains a way to serialize the commit graph. The current implementation defines a new file format to store the graph structure (parent relationships) and basic commit metadata (commit date, root tree OID) in order to prevent parsing raw commits while performing basic graph walks. For example, we do not need to parse the full commit when performing these walks: * 'git log --topo-order -1000' walks all reachable commits to avoid incorrect topological orders, but only needs the commit message for the top 1000 commits. * 'git merge-base ' may walk many commits to find the correct boundary between the commits reachable from A and those reachable from B. No commit messages are needed. * 'git branch -vv' checks ahead/behind status for all local branches compared to their upstream remote branches. This is essentially as hard as computing merge bases for each. The current patch speeds up these calculations by injecting a check in parse_commit_gently() to check if there is a graph file and using that to provide the required metadata to the struct commit. That's nice. What are the assumptions about the serialized commit graph format? Does it need to be: - extensible without rewriting (e.g. append-only)? - like the above, but may need rewriting for optimal performance? - extending it needs to rewrite whole file? Excuse me if it waas already asked and answered. It is not extensible without rewriting. Reducing write time was not a main goal, since the graph will be written only occasionally during data management phases (like 'gc' or 'repack'; this integration is not implemented yet). The file format has room to store generation numbers, which will be provided as a patch after this framework is merged. Generation numbers are referenced by the design document but not implemented in order to make the current patch focus on the graph construction process. Once that is stable, it will be easier to add generation numbers and make graph walks aware of generation numbers one-by-one. As the serialized commit graph format is versioned, I wonder if it would be possible to speed up graph walks even more by adding to it FELINE index (pair of numbers) from "Reachability Queries in Very Large Graphs: A Fast Refined Olnine Search Approach" (2014) - available at http://openproceedings.org/EDBT/2014/paper_166.pdf The implementation would probably need adjustments to make it unambiguous and unambiguously extensible; unless there is place for indices that are local-only and need to be recalculated from scratch when graph changes (to cover all graph). The chunk-based format is intended to allow extra indexes like the one you recommend, without needing to increase the version number. Using an optional chunk allows older versions of Git to read the file without error, since the data is "extra", and newer versions can take advantage of the acceleration. At one point, I was investigating these reachability indexes (I read "SCARAB: Scaling Reachability Computation on Large Graphs" by Jihn, Ruan, Dey, and Xu [2]) but find the question that these indexes target to be lacking for most of the Git uses. That is, they ask the boolean question "Can X reach Y?". More often, Git needs to answer "What is the set of commits reachable from X but not from Y" or "Topologically sort commits reachable from X" or "How many commits are in each part of the symmetric difference between reachable from X or reachable from Y?" The case for "Can X reach Y?" is mostly for commands like 'git branch --contains', when 'git fetch' checks for forced-updates of branches, or when the server decides enough negotiation has occurred during a 'git fetch'. While these may be worth investigating, they also benefit greatly from the accelerated graph walk introduced in the current format. I would be happy to review any effort to extend the commit-graph format to include such indexes, as long as the performance benefits outweigh the complexity to create them. [2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396=rep1=pdf Here are some performance results for a copy of the Linux repository where 'master' has 704,766 reachable commits and is behind 'origin/master' by 19,610 commits. | Command | Before | After | Rel % | |--|||---| | log --oneline --topo-order -1000 | 5.9s | 0.7s | -88% | | branch -vv
Re: [PATCH v4 01/13] commit-graph: add format document
On 3/30/2018 9:25 AM, Jakub Narebski wrote: Derrick Stolee <sto...@gmail.com> writes: +== graph-*.graph files have the following format: What is this '*' here? No longer necessary. It used to be a placeholder for a hash value, but now the graph is stored in objects/info/commit-graph. [...] + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. Does Git need to understand all chunks, or could there be optional chunks that can be safely ignored (like in PNG format)? Though this may be overkill, and could be left for later revision of the format if deemed necessary. In v6, the format and design documents are edited to make clear the use of optional chunks, specifically for future extension without increasing the version number. + +CHUNK DATA: + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of commits (N). All right, it is small enough that can be required even for a very small number of commits. + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all commits in the graph, sorted in ascending order. + + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes) Do commits need to be put here in the ascending order of OIDs? Yes. If so, this would mean that it is not possible to add information about new commits by only appending data and maybe overwriting some fields, I think. You would need to do full rewrite to insert new commit in appropriate place. That is the idea. This file is not updated with every new commit, but instead will be updated on some scheduled cleanup events. The commit-graph file is designed in a way to be non-critical, and not tied to the packfile layout. This allows flexibility for when to do the write. For example, in GVFS, we will write a new commit-graph when there are new daily prefetch packs. This could also integrate with 'gc' and 'repack' so whenever they are triggered the commit-graph is written as well. Commits that do not exist in the commit-graph file will load from the object database as normal (after a failed lookup in the commit-graph file). +* The first H bytes are for the OID of the root tree. +* The next 8 bytes are for the int-ids of the first two parents + of the ith commit. Stores value 0x if no parent in that + position. If there are more than two parents, the second value + has its most-significant bit on and the other bits store an array + position into the Large Edge List chunk. +* The next 8 bytes store the generation number of the commit and + the commit time in seconds since EPOCH. The generation number + uses the higher 30 bits of the first 4 bytes, while the commit + time uses the 32 bits of the second 4 bytes, along with the lowest + 2 bits of the lowest byte, storing the 33rd and 34th bit of the + commit time. + + Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional] + This list of 4-byte values store the second through nth parents for + all octopus merges. The second parent value in the commit data stores + an array position within this list along with the most-significant bit + on. Starting at that array position, iterate through this list of int-ids + for the parents until reaching a value with the most-significant bit on. + The other bits correspond to the int-id of the last parent. All right, that is one chunk that cannot use fixed-length records; this shouldn't matter much, as we iterate only up to the number of parents less two. Less one: the second "parent" column of the commit data chunk is used to point into this list, so (P-1) parents are in this chunk for a commit with P parents. A question: what happens to the last list of parents? Is there a guardian value of 0x at last place? The termination condition is in the position of the last parent, since the most-significant bit is on. The other 31 bits contain the int-id of the parent. Thanks, -Stolee
Re: [PATCH v4 00/13] Serialized Git Commit Graph
On 4/2/2018 10:46 AM, Jakub Narebski wrote: Derrick Stolee <sto...@gmail.com> writes: [...] At one point, I was investigating these reachability indexes (I read "SCARAB: Scaling Reachability Computation on Large Graphs" by Jihn, Ruan, Dey, and Xu [2]) but find the question that these indexes target to be lacking for most of the Git uses. That is, they ask the boolean question "Can X reach Y?". More often, Git needs to answer "What is the set of commits reachable from X but not from Y" or "Topologically sort commits reachable from X" or "How many commits are in each part of the symmetric difference between reachable from X or reachable from Y?" In the "Reachability Queries in Very Large Graphs..." by Veloso, Cerf, Meira and Zaki FELINE-index work, authors mention SCARAB as something that can be used in addition to FELINE-index, as a complementary data (FELINE-SCARAB in the work, section 4.4). I see the FELINE-index as a stronger form of generation numbers (called also level of the vertex / node), in that it allows to negative-cut even more, pruning paths that are known to be unreachable (or marking nodes known to be unreachable in the "calculate difference" scenario). Also, FELINE-index uses two integer numbers (coordinates in 2d space); one of those indices can be the topological numbering (topological sorting order) of nodes in the commit graph. That would help to answer even more Git questions. This two-dimensional generation number is helpful for non-reachability queries, but is something that needs the "full" commit graph in order to define the value for a single commit (hence the O(N lg N) performance mentioned below). Generation numbers are effective while being easy to compute and immutable. I wonder if FELINE was compared directly to a one-dimensional index (I apologize that I have not read the paper in detail, so I don't understand the indexes they compare against). It also appears the graphs they use for their tests are not commit graphs, which have a different shape than many of the digraphs studies by that work. This is all to say: I would love to see an interesting study in this direction, specifically comparing this series' definition of generation numbers to the 2-dimensional system in FELINE, and on a large sample of commit graphs available in open-source data sets (Linux kernel, Git, etc.) and possibly on interesting closed-source data sets. The case for "Can X reach Y?" is mostly for commands like 'git branch --contains', when 'git fetch' checks for forced-updates of branches, or when the server decides enough negotiation has occurred during a 'git fetch'. While these may be worth investigating, they also benefit greatly from the accelerated graph walk introduced in the current format. I would be happy to review any effort to extend the commit-graph format to include such indexes, as long as the performance benefits outweigh the complexity to create them. [2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396=rep1=pdf The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the storage complexity is 2*|V|. This would be very easy to add as an optional chunk, since it can use one row per commit. Thanks, -Stolee
Re: [PATCH v4 00/13] Serialized Git Commit Graph
On 4/2/2018 1:35 PM, Stefan Beller wrote: On Mon, Apr 2, 2018 at 8:02 AM, Derrick Stolee <sto...@gmail.com> wrote: I would be happy to review any effort to extend the commit-graph format to include such indexes, as long as the performance benefits outweigh the complexity to create them. [2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396=rep1=pdf The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the storage complexity is 2*|V|. This would be very easy to add as an optional chunk, since it can use one row per commit. Given this discussion, I wonder if we want to include generation numbers as a first class citizen in the current format. They could also go as an optional chunk and we may want to discuss further if we want generation numbers or FELINE numbers or GRAIL or SCARAB, which are all graph related speedup mechanism AFAICT. In case we decide against generation numbers in the long run, the row of mandatory generation numbers would be dead weight that we'd need to carry. Currently, the format includes 8 bytes to share between the generation number and commit date. Due to alignment concerns, we will want to keep this as 8 bytes or truncate it to 4-bytes. Either we would be wasting at least 3 bytes or truncating dates too much (presenting the 2038 problem [1] since dates are signed). I only glanced at the paper, but it looks like a "more advanced 2d generation number" that seems to be able to answer questions that gen numbers can answer, but that paper also refers to SCARAB as well as GRAIL as the state of the art, so maybe there are even more papers to explore? The biggest reason I can say to advance this series (and the small follow-up series that computes and consumes generation numbers) is that generation numbers are _extremely simple_. You only need to know your parents and their generation numbers to compute your own. These other reachability indexes require examining the entire graph to create "good" index values. The hard part about using generation numbers (or any other reachability index) in Git is refactoring the revision-walk machinery to take advantage of them; current code requires O(reachable commits) to topo-order instead of O(commits that will be output). I think we should table any discussion of these advanced indexes until that work is done and a valuable comparison can be done. "Premature optimization is the root of all evil" and all that. Thanks, -Stolee [1] https://en.wikipedia.org/wiki/Year_2038_problem
[PATCH v7 13/14] commit-graph: build graph from starting commits
From: Derrick Stolee <dsto...@microsoft.com> Teach git-commit-graph to read commits from stdin when the --stdin-commits flag is specified. Commits reachable from these commits are added to the graph. This is a much faster way to construct the graph than inspecting all packed objects, but is restricted to known tips. For the Linux repository, 700,000+ commits were added to the graph file starting from 'master' in 7-9 seconds, depending on the number of packfiles in the repo (1, 24, or 120). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 14 +- builtin/commit-graph.c | 27 +-- commit-graph.c | 27 +-- commit-graph.h | 4 +++- t/t5318-commit-graph.sh| 13 + 5 files changed, 75 insertions(+), 10 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 8143cc3f07..442ac243e6 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -36,7 +36,13 @@ COMMANDS Write a commit graph file based on the commits found in packfiles. + With the `--stdin-packs` option, generate the new commit graph by -walking objects only in the specified pack-indexes. +walking objects only in the specified pack-indexes. (Cannot be combined +with --stdin-commits.) ++ +With the `--stdin-commits` option, generate the new commit graph by +walking commits starting at the commits specified in stdin as a list +of OIDs in hex, one OID per line. (Cannot be combined with +--stdin-packs.) 'read':: @@ -60,6 +66,12 @@ $ git commit-graph write $ echo | git commit-graph write --stdin-packs +* Write a graph file containing all reachable commits. ++ + +$ git show-ref -s | git commit-graph write --stdin-commits + + * Read basic information from the commit-graph file. + diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 9f83c872e9..f5fc717b8f 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -8,7 +8,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), - N_("git commit-graph write [--object-dir ] [--stdin-packs]"), + N_("git commit-graph write [--object-dir ] [--stdin-packs|--stdin-commits]"), NULL }; @@ -18,13 +18,14 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--object-dir ] [--stdin-packs]"), + N_("git commit-graph write [--object-dir ] [--stdin-packs|--stdin-commits]"), NULL }; static struct opts_commit_graph { const char *obj_dir; int stdin_packs; + int stdin_commits; } opts; static int graph_read(int argc, const char **argv) @@ -79,6 +80,8 @@ static int graph_write(int argc, const char **argv) { const char **pack_indexes = NULL; int packs_nr = 0; + const char **commit_hex = NULL; + int commits_nr = 0; const char **lines = NULL; int lines_nr = 0; int lines_alloc = 0; @@ -89,6 +92,8 @@ static int graph_write(int argc, const char **argv) N_("The object directory to store the graph")), OPT_BOOL(0, "stdin-packs", _packs, N_("scan pack-indexes listed by stdin for commits")), + OPT_BOOL(0, "stdin-commits", _commits, + N_("start walk at commits listed by stdin")), OPT_END(), }; @@ -96,10 +101,12 @@ static int graph_write(int argc, const char **argv) builtin_commit_graph_write_options, builtin_commit_graph_write_usage, 0); + if (opts.stdin_packs && opts.stdin_commits) + die(_("cannot use both --stdin-commits and --stdin-packs")); if (!opts.obj_dir) opts.obj_dir = get_object_directory(); - if (opts.stdin_packs) { + if (opts.stdin_packs || opts.stdin_commits) { struct strbuf buf = STRBUF_INIT; lines_nr = 0; lines_alloc = 128; @@ -110,13 +117,21 @@ static int graph_write(int argc, const char **argv) lines[lines_nr++] = strbuf_detach(, NULL); } - pack_indexes = lines; - packs_nr = lines_nr; + if (opts.stdin_packs) { + pack_indexes = lines; +
[PATCH v7 11/14] commit: integrate commit graph with commit parsing
From: Derrick Stolee <dsto...@microsoft.com> Teach Git to inspect a commit graph file to supply the contents of a struct commit when calling parse_commit_gently(). This implementation satisfies all post-conditions on the struct commit, including loading parents, the root tree, and the commit date. If core.commitGraph is false, then do not check graph files. In test script t5318-commit-graph.sh, add output-matching conditions on read-only graph operations. By loading commits from the graph instead of parsing commit buffers, we save a lot of time on long commit walks. Here are some performance results for a copy of the Linux repository where 'master' has 678,653 reachable commits and is behind 'origin/master' by 59,929 commits. | Command | Before | After | Rel % | |--|||---| | log --oneline --topo-order -1000 | 8.31s | 0.94s | -88% | | branch -vv | 1.02s | 0.14s | -86% | | rev-list --all | 5.89s | 1.07s | -81% | | rev-list --all --objects | 66.15s | 58.45s | -11% | Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- alloc.c | 1 + commit-graph.c | 141 +++- commit-graph.h | 12 commit.c| 3 + commit.h| 3 + t/t5318-commit-graph.sh | 47 +- 6 files changed, 205 insertions(+), 2 deletions(-) diff --git a/alloc.c b/alloc.c index 12afadfacd..cf4f8b61e1 100644 --- a/alloc.c +++ b/alloc.c @@ -93,6 +93,7 @@ void *alloc_commit_node(void) struct commit *c = alloc_node(_state, sizeof(struct commit)); c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); + c->graph_pos = COMMIT_NOT_FROM_GRAPH; return c; } diff --git a/commit-graph.c b/commit-graph.c index ea29c5c2d8..983454785e 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -38,7 +38,6 @@ #define GRAPH_MIN_SIZE (5 * GRAPH_CHUNKLOOKUP_WIDTH + GRAPH_FANOUT_SIZE + \ GRAPH_OID_LEN + 8) - char *get_commit_graph_filename(const char *obj_dir) { return xstrfmt("%s/info/commit-graph", obj_dir); @@ -179,6 +178,145 @@ struct commit_graph *load_commit_graph_one(const char *graph_file) exit(1); } +/* global storage */ +struct commit_graph *commit_graph = NULL; + +static void prepare_commit_graph_one(const char *obj_dir) +{ + char *graph_name; + + if (commit_graph) + return; + + graph_name = get_commit_graph_filename(obj_dir); + commit_graph = load_commit_graph_one(graph_name); + + FREE_AND_NULL(graph_name); +} + +static int prepare_commit_graph_run_once = 0; +static void prepare_commit_graph(void) +{ + struct alternate_object_database *alt; + char *obj_dir; + + if (prepare_commit_graph_run_once) + return; + prepare_commit_graph_run_once = 1; + + obj_dir = get_object_directory(); + prepare_commit_graph_one(obj_dir); + prepare_alt_odb(); + for (alt = alt_odb_list; !commit_graph && alt; alt = alt->next) + prepare_commit_graph_one(alt->path); +} + +static void close_commit_graph(void) +{ + if (!commit_graph) + return; + + if (commit_graph->graph_fd >= 0) { + munmap((void *)commit_graph->data, commit_graph->data_len); + commit_graph->data = NULL; + close(commit_graph->graph_fd); + } + + FREE_AND_NULL(commit_graph); +} + +static int bsearch_graph(struct commit_graph *g, struct object_id *oid, uint32_t *pos) +{ + return bsearch_hash(oid->hash, g->chunk_oid_fanout, + g->chunk_oid_lookup, g->hash_len, pos); +} + +static struct commit_list **insert_parent_or_die(struct commit_graph *g, +uint64_t pos, +struct commit_list **pptr) +{ + struct commit *c; + struct object_id oid; + hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos); + c = lookup_commit(); + if (!c) + die("could not find commit %s", oid_to_hex()); + c->graph_pos = pos; + return _list_insert(c, pptr)->next; +} + +static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) +{ + struct object_id oid; + uint32_t edge_value; + uint32_t *parent_data_ptr; + uint64_t date_low, date_high; + struct commit_list **pptr; + const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos; + + item->object.parsed = 1; + item->graph_pos = pos; + + hashcpy(oid.hash, commit_data); + item->tree = lookup_tree(); + + date_high = get_be32(commit_data + g->
[PATCH v7 01/14] csum-file: rename hashclose() to finalize_hashfile()
From: Derrick Stolee <dsto...@microsoft.com> The hashclose() method behaves very differently depending on the flags parameter. In particular, the file descriptor is not always closed. Perform a simple rename of "hashclose()" to "finalize_hashfile()" in preparation for functional changes. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- builtin/index-pack.c | 2 +- builtin/pack-objects.c | 6 +++--- bulk-checkin.c | 4 ++-- csum-file.c| 2 +- csum-file.h| 4 ++-- fast-import.c | 2 +- pack-bitmap-write.c| 2 +- pack-write.c | 4 ++-- 8 files changed, 13 insertions(+), 13 deletions(-) diff --git a/builtin/index-pack.c b/builtin/index-pack.c index bda84a92ef..8bcf280e0b 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -1270,7 +1270,7 @@ static void conclude_pack(int fix_thin_pack, const char *curr_pack, unsigned cha nr_objects - nr_objects_initial); stop_progress_msg(, msg.buf); strbuf_release(); - hashclose(f, tail_hash, 0); + finalize_hashfile(f, tail_hash, 0); hashcpy(read_hash, pack_hash); fixup_pack_header_footer(output_fd, pack_hash, curr_pack, nr_objects, diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index e9d3cfb9e3..ab3e80ee49 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -837,11 +837,11 @@ static void write_pack_file(void) * If so, rewrite it like in fast-import */ if (pack_to_stdout) { - hashclose(f, oid.hash, CSUM_CLOSE); + finalize_hashfile(f, oid.hash, CSUM_CLOSE); } else if (nr_written == nr_remaining) { - hashclose(f, oid.hash, CSUM_FSYNC); + finalize_hashfile(f, oid.hash, CSUM_FSYNC); } else { - int fd = hashclose(f, oid.hash, 0); + int fd = finalize_hashfile(f, oid.hash, 0); fixup_pack_header_footer(fd, oid.hash, pack_tmp_name, nr_written, oid.hash, offset); close(fd); diff --git a/bulk-checkin.c b/bulk-checkin.c index 9d87eac07b..227cc9f3b1 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -35,9 +35,9 @@ static void finish_bulk_checkin(struct bulk_checkin_state *state) unlink(state->pack_tmp_name); goto clear_exit; } else if (state->nr_written == 1) { - hashclose(state->f, oid.hash, CSUM_FSYNC); + finalize_hashfile(state->f, oid.hash, CSUM_FSYNC); } else { - int fd = hashclose(state->f, oid.hash, 0); + int fd = finalize_hashfile(state->f, oid.hash, 0); fixup_pack_header_footer(fd, oid.hash, state->pack_tmp_name, state->nr_written, oid.hash, state->offset); diff --git a/csum-file.c b/csum-file.c index 5eda7fb6af..e6c95a6915 100644 --- a/csum-file.c +++ b/csum-file.c @@ -53,7 +53,7 @@ void hashflush(struct hashfile *f) } } -int hashclose(struct hashfile *f, unsigned char *result, unsigned int flags) +int finalize_hashfile(struct hashfile *f, unsigned char *result, unsigned int flags) { int fd; diff --git a/csum-file.h b/csum-file.h index 992e5c0141..9ba87f0a6c 100644 --- a/csum-file.h +++ b/csum-file.h @@ -26,14 +26,14 @@ struct hashfile_checkpoint { extern void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint *); extern int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *); -/* hashclose flags */ +/* finalize_hashfile flags */ #define CSUM_CLOSE 1 #define CSUM_FSYNC 2 extern struct hashfile *hashfd(int fd, const char *name); extern struct hashfile *hashfd_check(const char *name); extern struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp); -extern int hashclose(struct hashfile *, unsigned char *, unsigned int); +extern int finalize_hashfile(struct hashfile *, unsigned char *, unsigned int); extern void hashwrite(struct hashfile *, const void *, unsigned int); extern void hashflush(struct hashfile *f); extern void crc32_begin(struct hashfile *); diff --git a/fast-import.c b/fast-import.c index b5db5d20b1..6d96f55d9d 100644 --- a/fast-import.c +++ b/fast-import.c @@ -1016,7 +1016,7 @@ static void end_packfile(void) struct tag *t; close_pack_windows(pack_data); - hashclose(pack_file, cur_pack_oid.hash, 0); + finalize_hashfile(pack_file, cur_pack_oid.hash, 0); fixup_pack_header_footer(pack_data->pack_fd, pack_data->sha1,
[PATCH v7 03/14] commit-graph: add format document
From: Derrick Stolee <dsto...@microsoft.com> Add document specifying the binary format for commit graphs. This format allows for: * New versions. * New hash functions and hash lengths. * Optional extensions. Basic header information is followed by a binary table of contents into "chunks" that include: * An ordered list of commit object IDs. * A 256-entry fanout into that list of OIDs. * A list of metadata for the commits. * A list of "large edges" to enable octopus merges. The format automatically includes two parent positions for every commit. This favors speed over space, since using only one position per commit would cause an extra level of indirection for every merge commit. (Octopus merges suffer from this indirection, but they are very rare.) Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- .../technical/commit-graph-format.txt | 97 +++ 1 file changed, 97 insertions(+) create mode 100644 Documentation/technical/commit-graph-format.txt diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt new file mode 100644 index 00..ad6af8105c --- /dev/null +++ b/Documentation/technical/commit-graph-format.txt @@ -0,0 +1,97 @@ +Git commit graph format +=== + +The Git commit graph stores a list of commit OIDs and some associated +metadata, including: + +- The generation number of the commit. Commits with no parents have + generation number 1; commits with parents have generation number + one more than the maximum generation number of its parents. We + reserve zero as special, and can be used to mark a generation + number invalid or as "not computed". + +- The root tree OID. + +- The commit date. + +- The parents of the commit, stored using positional references within + the graph file. + +These positional references are stored as unsigned 32-bit integers +corresponding to the array position withing the list of commit OIDs. We +use the most-significant bit for special purposes, so we can store at most +(1 << 31) - 1 (around 2 billion) commits. + +== Commit graph files have the following format: + +In order to allow extensions that add extra data to the graph, we organize +the body into "chunks" and provide a binary lookup table at the beginning +of the body. The header includes certain values, such as number of chunks +and hash type. + +All 4-byte numbers are in network order. + +HEADER: + + 4-byte signature: + The signature is: {'C', 'G', 'P', 'H'} + + 1-byte version number: + Currently, the only valid version is 1. + + 1-byte Hash Version (1 = SHA-1) + We infer the hash length (H) from this value. + + 1-byte number (C) of "chunks" + + 1-byte (reserved for later use) + Current clients should ignore this value. + +CHUNK LOOKUP: + + (C + 1) * 12 bytes listing the table of contents for the chunks: + First 4 bytes describe the chunk id. Value 0 is a terminating label. + Other 8 bytes provide the byte-offset in current file for chunk to + start. (Chunks are ordered contiguously in the file, so you can infer + the length using the next chunk position if necessary.) Each chunk + ID appears at most once. + + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. + +CHUNK DATA: + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of commits (N). + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all commits in the graph, sorted in ascending order. + + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes) +* The first H bytes are for the OID of the root tree. +* The next 8 bytes are for the positions of the first two parents + of the ith commit. Stores value 0x if no parent in that + position. If there are more than two parents, the second value + has its most-significant bit on and the other bits store an array + position into the Large Edge List chunk. +* The next 8 bytes store the generation number of the commit and + the commit time in seconds since EPOCH. The generation number + uses the higher 30 bits of the first 4 bytes, while the commit + time uses the 32 bits of the second 4 bytes, along with the lowest + 2 bits of the lowest byte, storing the 33rd and 34th bit of the + commit time. + + Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional] + This list of 4-byte values store the second through nth parents for + all octopus merges. The second parent value in the commit data stores + an array position within this list along with the most-significant bit + on. Starting at that array position, iterate through
[PATCH v7 12/14] commit-graph: read only from specific pack-indexes
From: Derrick Stolee <dsto...@microsoft.com> Teach git-commit-graph to inspect the objects only in a certain list of pack-indexes within the given pack directory. This allows updating the commit graph iteratively. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 11 +- builtin/commit-graph.c | 33 +++--- commit-graph.c | 26 +-- commit-graph.h | 4 +++- packfile.c | 4 ++-- packfile.h | 2 ++ t/t5318-commit-graph.sh| 10 + 7 files changed, 81 insertions(+), 9 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 8aad8303f5..8143cc3f07 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -34,7 +34,9 @@ COMMANDS 'write':: Write a commit graph file based on the commits found in packfiles. -Includes all commits from the existing commit graph file. ++ +With the `--stdin-packs` option, generate the new commit graph by +walking objects only in the specified pack-indexes. 'read':: @@ -51,6 +53,13 @@ EXAMPLES $ git commit-graph write +* Write a graph file, extending the current graph file using commits +* in . ++ + +$ echo | git commit-graph write --stdin-packs + + * Read basic information from the commit-graph file. + diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index e3f67401fb..9f83c872e9 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -8,7 +8,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), - N_("git commit-graph write [--object-dir ]"), + N_("git commit-graph write [--object-dir ] [--stdin-packs]"), NULL }; @@ -18,12 +18,13 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--object-dir ]"), + N_("git commit-graph write [--object-dir ] [--stdin-packs]"), NULL }; static struct opts_commit_graph { const char *obj_dir; + int stdin_packs; } opts; static int graph_read(int argc, const char **argv) @@ -76,10 +77,18 @@ static int graph_read(int argc, const char **argv) static int graph_write(int argc, const char **argv) { + const char **pack_indexes = NULL; + int packs_nr = 0; + const char **lines = NULL; + int lines_nr = 0; + int lines_alloc = 0; + static struct option builtin_commit_graph_write_options[] = { OPT_STRING(0, "object-dir", _dir, N_("dir"), N_("The object directory to store the graph")), + OPT_BOOL(0, "stdin-packs", _packs, + N_("scan pack-indexes listed by stdin for commits")), OPT_END(), }; @@ -90,7 +99,25 @@ static int graph_write(int argc, const char **argv) if (!opts.obj_dir) opts.obj_dir = get_object_directory(); - write_commit_graph(opts.obj_dir); + if (opts.stdin_packs) { + struct strbuf buf = STRBUF_INIT; + lines_nr = 0; + lines_alloc = 128; + ALLOC_ARRAY(lines, lines_alloc); + + while (strbuf_getline(, stdin) != EOF) { + ALLOC_GROW(lines, lines_nr + 1, lines_alloc); + lines[lines_nr++] = strbuf_detach(, NULL); + } + + pack_indexes = lines; + packs_nr = lines_nr; + } + + write_commit_graph(opts.obj_dir, + pack_indexes, + packs_nr); + return 0; } diff --git a/commit-graph.c b/commit-graph.c index 983454785e..fa19b83a8e 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -549,7 +549,9 @@ static void close_reachable(struct packed_oid_list *oids) } } -void write_commit_graph(const char *obj_dir) +void write_commit_graph(const char *obj_dir, + const char **pack_indexes, + int nr_packs) { struct packed_oid_list oids; struct packed_commit_list commits; @@ -571,7 +573,27 @@ void write_commit_graph(const char *obj_dir) oids.alloc = 1024; ALLOC_ARRAY(oids.list, oids.alloc); - for_each_packed_object(add_packed_commits, , 0); + if (pack_indexes) { + struct strbuf packname = STRBUF_I
[PATCH v7 02/14] csum-file: refactor finalize_hashfile() method
From: Derrick Stolee <dsto...@microsoft.com> If we want to use a hashfile on the temporary file for a lockfile, then we need finalize_hashfile() to fully write the trailing hash but also keep the file descriptor open. Do this by adding a new CSUM_HASH_IN_STREAM flag along with a functional change that checks this flag before writing the checksum to the stream. This differs from previous behavior since it would be written if either CSUM_CLOSE or CSUM_FSYNC is provided. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- builtin/pack-objects.c | 4 ++-- bulk-checkin.c | 2 +- csum-file.c| 8 csum-file.h| 5 +++-- pack-bitmap-write.c| 2 +- pack-write.c | 5 +++-- 6 files changed, 14 insertions(+), 12 deletions(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index ab3e80ee49..b09bbf4f4c 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -837,9 +837,9 @@ static void write_pack_file(void) * If so, rewrite it like in fast-import */ if (pack_to_stdout) { - finalize_hashfile(f, oid.hash, CSUM_CLOSE); + finalize_hashfile(f, oid.hash, CSUM_HASH_IN_STREAM | CSUM_CLOSE); } else if (nr_written == nr_remaining) { - finalize_hashfile(f, oid.hash, CSUM_FSYNC); + finalize_hashfile(f, oid.hash, CSUM_HASH_IN_STREAM | CSUM_FSYNC | CSUM_CLOSE); } else { int fd = finalize_hashfile(f, oid.hash, 0); fixup_pack_header_footer(fd, oid.hash, pack_tmp_name, diff --git a/bulk-checkin.c b/bulk-checkin.c index 227cc9f3b1..70b14fdf41 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -35,7 +35,7 @@ static void finish_bulk_checkin(struct bulk_checkin_state *state) unlink(state->pack_tmp_name); goto clear_exit; } else if (state->nr_written == 1) { - finalize_hashfile(state->f, oid.hash, CSUM_FSYNC); + finalize_hashfile(state->f, oid.hash, CSUM_HASH_IN_STREAM | CSUM_FSYNC | CSUM_CLOSE); } else { int fd = finalize_hashfile(state->f, oid.hash, 0); fixup_pack_header_footer(fd, oid.hash, state->pack_tmp_name, diff --git a/csum-file.c b/csum-file.c index e6c95a6915..53ce37f7ca 100644 --- a/csum-file.c +++ b/csum-file.c @@ -61,11 +61,11 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, unsigned int fl the_hash_algo->final_fn(f->buffer, >ctx); if (result) hashcpy(result, f->buffer); - if (flags & (CSUM_CLOSE | CSUM_FSYNC)) { - /* write checksum and close fd */ + if (flags & CSUM_HASH_IN_STREAM) flush(f, f->buffer, the_hash_algo->rawsz); - if (flags & CSUM_FSYNC) - fsync_or_die(f->fd, f->name); + if (flags & CSUM_FSYNC) + fsync_or_die(f->fd, f->name); + if (flags & CSUM_CLOSE) { if (close(f->fd)) die_errno("%s: sha1 file error on close", f->name); fd = 0; diff --git a/csum-file.h b/csum-file.h index 9ba87f0a6c..c5a2e335e7 100644 --- a/csum-file.h +++ b/csum-file.h @@ -27,8 +27,9 @@ extern void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint *) extern int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *); /* finalize_hashfile flags */ -#define CSUM_CLOSE 1 -#define CSUM_FSYNC 2 +#define CSUM_CLOSE 1 +#define CSUM_FSYNC 2 +#define CSUM_HASH_IN_STREAM4 extern struct hashfile *hashfd(int fd, const char *name); extern struct hashfile *hashfd_check(const char *name); diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c index 662b44f97d..db4c832428 100644 --- a/pack-bitmap-write.c +++ b/pack-bitmap-write.c @@ -535,7 +535,7 @@ void bitmap_writer_finish(struct pack_idx_entry **index, if (options & BITMAP_OPT_HASH_CACHE) write_hash_cache(f, index, index_nr); - finalize_hashfile(f, NULL, CSUM_FSYNC); + finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC | CSUM_CLOSE); if (adjust_shared_perm(tmp_file.buf)) die_errno("unable to make temporary bitmap file readable"); diff --git a/pack-write.c b/pack-write.c index 044f427392..a9d46bc03f 100644 --- a/pack-write.c +++ b/pack-write.c @@ -170,8 +170,9 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec } hashwrite(f, sha1, the_hash_algo->rawsz); - finalize_hashfile(f, NULL, ((opts->flags & WRITE_IDX_VERIFY) - ? CSUM_CLOSE : CSUM_FSYNC)); + finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_CLOSE | +
[PATCH v7 14/14] commit-graph: implement "--additive" option
From: Derrick Stolee <dsto...@microsoft.com> Teach git-commit-graph to add all commits from the existing commit-graph file to the file about to be written. This should be used when adding new commits without performing garbage collection. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 10 ++ builtin/commit-graph.c | 10 +++--- commit-graph.c | 17 - commit-graph.h | 3 ++- t/t5318-commit-graph.sh| 10 ++ 5 files changed, 45 insertions(+), 5 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 442ac243e6..4c97b555cc 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -43,6 +43,9 @@ With the `--stdin-commits` option, generate the new commit graph by walking commits starting at the commits specified in stdin as a list of OIDs in hex, one OID per line. (Cannot be combined with --stdin-packs.) ++ +With the `--append` option, include all commits that are present in the +existing commit-graph file. 'read':: @@ -72,6 +75,13 @@ $ echo | git commit-graph write --stdin-packs $ git show-ref -s | git commit-graph write --stdin-commits +* Write a graph file containing all commits in the current +* commit-graph file along with those reachable from HEAD. ++ + +$ git rev-parse HEAD | git commit-graph write --stdin-commits --append + + * Read basic information from the commit-graph file. + diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index f5fc717b8f..41c4f76caf 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -8,7 +8,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), - N_("git commit-graph write [--object-dir ] [--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), NULL }; @@ -18,7 +18,7 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--object-dir ] [--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), NULL }; @@ -26,6 +26,7 @@ static struct opts_commit_graph { const char *obj_dir; int stdin_packs; int stdin_commits; + int append; } opts; static int graph_read(int argc, const char **argv) @@ -94,6 +95,8 @@ static int graph_write(int argc, const char **argv) N_("scan pack-indexes listed by stdin for commits")), OPT_BOOL(0, "stdin-commits", _commits, N_("start walk at commits listed by stdin")), + OPT_BOOL(0, "append", , + N_("include all commits already in the commit-graph file")), OPT_END(), }; @@ -131,7 +134,8 @@ static int graph_write(int argc, const char **argv) pack_indexes, packs_nr, commit_hex, - commits_nr); + commits_nr, + opts.append); return 0; } diff --git a/commit-graph.c b/commit-graph.c index 253bc2213a..1fc63d541b 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -553,7 +553,8 @@ void write_commit_graph(const char *obj_dir, const char **pack_indexes, int nr_packs, const char **commit_hex, - int nr_commits) + int nr_commits, + int append) { struct packed_oid_list oids; struct packed_commit_list commits; @@ -571,10 +572,24 @@ void write_commit_graph(const char *obj_dir, oids.nr = 0; oids.alloc = approximate_object_count() / 4; + if (append) { + prepare_commit_graph_one(obj_dir); + if (commit_graph) + oids.alloc += commit_graph->num_commits; + } + if (oids.alloc < 1024) oids.alloc = 1024; ALLOC_ARRAY(oids.list, oids.alloc); + if (append && commit_graph) { + for (i = 0; i < commit_graph->num_commits; i++) { + const unsigned char *hash = commit_graph->chunk_oid_lookup + + commit_graph->hash_len * i; +
[PATCH v7 09/14] commit-graph: add core.commitGraph setting
From: Derrick Stolee <dsto...@microsoft.com> The commit graph feature is controlled by the new core.commitGraph config setting. This defaults to 0, so the feature is opt-in. The intention of core.commitGraph is that a user can always stop checking for or parsing commit graph files if core.commitGraph=0. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/config.txt | 4 cache.h | 1 + config.c | 5 + environment.c| 1 + 4 files changed, 11 insertions(+) diff --git a/Documentation/config.txt b/Documentation/config.txt index 4e0cff87f6..e5c7013fb0 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -898,6 +898,10 @@ core.notesRef:: This setting defaults to "refs/notes/commits", and it can be overridden by the `GIT_NOTES_REF` environment variable. See linkgit:git-notes[1]. +core.commitGraph:: + Enable git commit graph feature. Allows reading from the + commit-graph file. + core.sparseCheckout:: Enable "sparse checkout" feature. See section "Sparse checkout" in linkgit:git-read-tree[1] for more information. diff --git a/cache.h b/cache.h index a61b2d3f0d..8bdbcbbbf7 100644 --- a/cache.h +++ b/cache.h @@ -805,6 +805,7 @@ extern char *git_replace_ref_base; extern int fsync_object_files; extern int core_preload_index; +extern int core_commit_graph; extern int core_apply_sparse_checkout; extern int precomposed_unicode; extern int protect_hfs; diff --git a/config.c b/config.c index b0c20e6cb8..25ee4a676c 100644 --- a/config.c +++ b/config.c @@ -1226,6 +1226,11 @@ static int git_default_core_config(const char *var, const char *value) return 0; } + if (!strcmp(var, "core.commitgraph")) { + core_commit_graph = git_config_bool(var, value); + return 0; + } + if (!strcmp(var, "core.sparsecheckout")) { core_apply_sparse_checkout = git_config_bool(var, value); return 0; diff --git a/environment.c b/environment.c index d6dd64662c..8853e2f0dd 100644 --- a/environment.c +++ b/environment.c @@ -62,6 +62,7 @@ enum push_default_type push_default = PUSH_DEFAULT_UNSPECIFIED; enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE; char *notes_ref_name; int grafts_replace_parents = 1; +int core_commit_graph; int core_apply_sparse_checkout; int merge_log_config = -1; int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */ -- 2.17.0.14.gba1221a8ce
[PATCH v7 00/14] Serialized Git Commit Graph
This patch has only a few changes since v6: * Fixed whitespace issues using 'git rebase --whitespace=fix' * The --stdin-packs docs now refer to "pack-indexes" insead of "packs" * Modified description of --object-dir option to warn use is rare * Replaced '--additive' with '--append' * In "commit-graph: close under reachability" I greatly simplified the check that every reachable commit is included. While running tests I noticed that the revision walk machinery could not keep up with a very large queue created when combined with the '--append' option that added all commits from the existing file as starting points for the walk. The new algorithm simply appends missing commits to the end of the list, which are then iterated to ensure their parents are in the list. I have a few patch series prepared that provide further performance improvments following this patch. -- >8 -- This patch contains a way to serialize the commit graph. The current implementation defines a new file format to store the graph structure (parent relationships) and basic commit metadata (commit date, root tree OID) in order to prevent parsing raw commits while performing basic graph walks. For example, we do not need to parse the full commit when performing these walks: * 'git log --topo-order -1000' walks all reachable commits to avoid incorrect topological orders, but only needs the commit message for the top 1000 commits. * 'git merge-base ' may walk many commits to find the correct boundary between the commits reachable from A and those reachable from B. No commit messages are needed. * 'git branch -vv' checks ahead/behind status for all local branches compared to their upstream remote branches. This is essentially as hard as computing merge bases for each. The current patch speeds up these calculations by injecting a check in parse_commit_gently() to check if there is a graph file and using that to provide the required metadata to the struct commit. The file format has room to store generation numbers, which will be provided as a patch after this framework is merged. Generation numbers are referenced by the design document but not implemented in order to make the current patch focus on the graph construction process. Once that is stable, it will be easier to add generation numbers and make graph walks aware of generation numbers one-by-one. By loading commits from the graph instead of parsing commit buffers, we save a lot of time on long commit walks. Here are some performance results for a copy of the Linux repository where 'master' has 678,653 reachable commits and is behind 'origin/master' by 59,929 commits. | Command | Before | After | Rel % | |--|||---| | log --oneline --topo-order -1000 | 8.31s | 0.94s | -88% | | branch -vv | 1.02s | 0.14s | -86% | | rev-list --all | 5.89s | 1.07s | -81% | | rev-list --all --objects | 66.15s | 58.45s | -11% | To test this yourself, run the following on your repo: git config core.commitGraph true git show-ref -s | git commit-graph write --stdin-commits The second command writes a commit graph file containing every commit reachable from your refs. Now, all git commands that walk commits will check your graph first before consulting the ODB. You can run your own performance comparisons by toggling the 'core.commitGraph' setting. [1] https://github.com/derrickstolee/git/pull/2 A GitHub pull request containing the latest version of this patch. Derrick Stolee (14): csum-file: rename hashclose() to finalize_hashfile() csum-file: refactor finalize_hashfile() method commit-graph: add format document graph: add commit graph design document commit-graph: create git-commit-graph builtin commit-graph: implement write_commit_graph() commit-graph: implement git-commit-graph write commit-graph: implement git commit-graph read commit-graph: add core.commitGraph setting commit-graph: close under reachability commit: integrate commit graph with commit parsing commit-graph: read only from specific pack-indexes commit-graph: build graph from starting commits commit-graph: implement "--additive" option .gitignore| 1 + Documentation/config.txt | 4 + Documentation/git-commit-graph.txt| 94 +++ .../technical/commit-graph-format.txt | 97 +++ Documentation/technical/commit-graph.txt | 163 Makefile | 2 + alloc.c | 1 + builtin.h | 1 + builtin/commit-graph.c| 171 builtin/index-pack.c | 2 +- builtin/pack-objects.c| 6 +- bulk-checkin.c
[PATCH v7 06/14] commit-graph: implement write_commit_graph()
From: Derrick Stolee <dsto...@microsoft.com> Teach Git to write a commit graph file by checking all packed objects to see if they are commits, then store the file in the given object directory. Helped-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Makefile | 1 + commit-graph.c | 359 + commit-graph.h | 6 + 3 files changed, 366 insertions(+) create mode 100644 commit-graph.c create mode 100644 commit-graph.h diff --git a/Makefile b/Makefile index a59b62bed1..26a23257e9 100644 --- a/Makefile +++ b/Makefile @@ -777,6 +777,7 @@ LIB_OBJS += color.o LIB_OBJS += column.o LIB_OBJS += combine-diff.o LIB_OBJS += commit.o +LIB_OBJS += commit-graph.o LIB_OBJS += compat/obstack.o LIB_OBJS += compat/terminal.o LIB_OBJS += config.o diff --git a/commit-graph.c b/commit-graph.c new file mode 100644 index 00..f3f7c4f189 --- /dev/null +++ b/commit-graph.c @@ -0,0 +1,359 @@ +#include "cache.h" +#include "config.h" +#include "git-compat-util.h" +#include "lockfile.h" +#include "pack.h" +#include "packfile.h" +#include "commit.h" +#include "object.h" +#include "revision.h" +#include "sha1-lookup.h" +#include "commit-graph.h" + +#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */ +#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */ +#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */ +#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */ +#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */ + +#define GRAPH_DATA_WIDTH 36 + +#define GRAPH_VERSION_1 0x1 +#define GRAPH_VERSION GRAPH_VERSION_1 + +#define GRAPH_OID_VERSION_SHA1 1 +#define GRAPH_OID_LEN_SHA1 GIT_SHA1_RAWSZ +#define GRAPH_OID_VERSION GRAPH_OID_VERSION_SHA1 +#define GRAPH_OID_LEN GRAPH_OID_LEN_SHA1 + +#define GRAPH_OCTOPUS_EDGES_NEEDED 0x8000 +#define GRAPH_PARENT_MISSING 0x7fff +#define GRAPH_EDGE_LAST_MASK 0x7fff +#define GRAPH_PARENT_NONE 0x7000 + +#define GRAPH_LAST_EDGE 0x8000 + +#define GRAPH_FANOUT_SIZE (4 * 256) +#define GRAPH_CHUNKLOOKUP_WIDTH 12 +#define GRAPH_MIN_SIZE (5 * GRAPH_CHUNKLOOKUP_WIDTH + GRAPH_FANOUT_SIZE + \ + GRAPH_OID_LEN + 8) + + +static char *get_commit_graph_filename(const char *obj_dir) +{ + return xstrfmt("%s/info/commit-graph", obj_dir); +} + +static void write_graph_chunk_fanout(struct hashfile *f, +struct commit **commits, +int nr_commits) +{ + int i, count = 0; + struct commit **list = commits; + + /* +* Write the first-level table (the list is sorted, +* but we use a 256-entry lookup to be able to avoid +* having to do eight extra binary search iterations). +*/ + for (i = 0; i < 256; i++) { + while (count < nr_commits) { + if ((*list)->object.oid.hash[0] != i) + break; + count++; + list++; + } + + hashwrite_be32(f, count); + } +} + +static void write_graph_chunk_oids(struct hashfile *f, int hash_len, + struct commit **commits, int nr_commits) +{ + struct commit **list = commits; + int count; + for (count = 0; count < nr_commits; count++, list++) + hashwrite(f, (*list)->object.oid.hash, (int)hash_len); +} + +static const unsigned char *commit_to_sha1(size_t index, void *table) +{ + struct commit **commits = table; + return commits[index]->object.oid.hash; +} + +static void write_graph_chunk_data(struct hashfile *f, int hash_len, + struct commit **commits, int nr_commits) +{ + struct commit **list = commits; + struct commit **last = commits + nr_commits; + uint32_t num_extra_edges = 0; + + while (list < last) { + struct commit_list *parent; + int edge_value; + uint32_t packedDate[2]; + + parse_commit(*list); + hashwrite(f, (*list)->tree->object.oid.hash, hash_len); + + parent = (*list)->parents; + + if (!parent) + edge_value = GRAPH_PARENT_NONE; + else { + edge_value = sha1_pos(parent->item->object.oid.hash, + commits, + nr_commits, + commit_to_sha1); + + if (edge_value < 0) + edge_value = GRAPH_PARENT_MISSING; + } + + hashwrite_be32(f, edge_value); + +
[PATCH v7 04/14] graph: add commit graph design document
From: Derrick Stolee <dsto...@microsoft.com> Add Documentation/technical/commit-graph.txt with details of the planned commit graph feature, including future plans. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/commit-graph.txt | 163 +++ 1 file changed, 163 insertions(+) create mode 100644 Documentation/technical/commit-graph.txt diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt new file mode 100644 index 00..0550c6d0dc --- /dev/null +++ b/Documentation/technical/commit-graph.txt @@ -0,0 +1,163 @@ +Git Commit Graph Design Notes += + +Git walks the commit graph for many reasons, including: + +1. Listing and filtering commit history. +2. Computing merge bases. + +These operations can become slow as the commit count grows. The merge +base calculation shows up in many user-facing commands, such as 'merge-base' +or 'status' and can take minutes to compute depending on history shape. + +There are two main costs here: + +1. Decompressing and parsing commits. +2. Walking the entire graph to satisfy topological order constraints. + +The commit graph file is a supplemental data structure that accelerates +commit graph walks. If a user downgrades or disables the 'core.commitGraph' +config setting, then the existing ODB is sufficient. The file is stored +as "commit-graph" either in the .git/objects/info directory or in the info +directory of an alternate. + +The commit graph file stores the commit graph structure along with some +extra metadata to speed up graph walks. By listing commit OIDs in lexi- +cographic order, we can identify an integer position for each commit and +refer to the parents of a commit using those integer positions. We use +binary search to find initial commits and then use the integer positions +for fast lookups during the walk. + +A consumer may load the following info for a commit from the graph: + +1. The commit OID. +2. The list of parents, along with their integer position. +3. The commit date. +4. The root tree OID. +5. The generation number (see definition below). + +Values 1-4 satisfy the requirements of parse_commit_gently(). + +Define the "generation number" of a commit recursively as follows: + + * A commit with no parents (a root commit) has generation number one. + + * A commit with at least one parent has generation number one more than + the largest generation number among its parents. + +Equivalently, the generation number of a commit A is one more than the +length of a longest path from A to a root commit. The recursive definition +is easier to use for computation and observing the following property: + +If A and B are commits with generation numbers N and M, respectively, +and N <= M, then A cannot reach B. That is, we know without searching +that B is not an ancestor of A because it is further from a root commit +than A. + +Conversely, when checking if A is an ancestor of B, then we only need +to walk commits until all commits on the walk boundary have generation +number at most N. If we walk commits using a priority queue seeded by +generation numbers, then we always expand the boundary commit with highest +generation number and can easily detect the stopping condition. + +This property can be used to significantly reduce the time it takes to +walk commits and determine topological relationships. Without generation +numbers, the general heuristic is the following: + +If A and B are commits with commit time X and Y, respectively, and +X < Y, then A _probably_ cannot reach B. + +This heuristic is currently used whenever the computation is allowed to +violate topological relationships due to clock skew (such as "git log" +with default order), but is not used when the topological order is +required (such as merge base calculations, "git log --graph"). + +In practice, we expect some commits to be created recently and not stored +in the commit graph. We can treat these commits as having "infinite" +generation number and walk until reaching commits with known generation +number. + +Design Details +-- + +- The commit graph file is stored in a file named 'commit-graph' in the + .git/objects/info directory. This could be stored in the info directory + of an alternate. + +- The core.commitGraph config setting must be on to consume graph files. + +- The file format includes parameters for the object ID hash function, + so a future change of hash algorithm does not require a change in format. + +Future Work +--- + +- The commit graph feature currently does not honor commit grafts. This can + be remedied by duplicating or refactoring the current graft logic. + +- The 'commit-graph' subcommand does not have a "verify" mode that is + necessary for integration with fsck. + +- The file format includes ro
[PATCH v7 10/14] commit-graph: close under reachability
From: Derrick Stolee <dsto...@microsoft.com> Teach write_commit_graph() to walk all parents from the commits discovered in packfiles. This prevents gaps given by loose objects or previously-missed packfiles. Also automatically add commits from the existing graph file, if it exists. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 45 + 1 file changed, 45 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index b1bd3a892d..ea29c5c2d8 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -367,6 +367,50 @@ static int add_packed_commits(const struct object_id *oid, return 0; } +static void add_missing_parents(struct packed_oid_list *oids, struct commit *commit) +{ + struct commit_list *parent; + for (parent = commit->parents; parent; parent = parent->next) { + if (!(parent->item->object.flags & UNINTERESTING)) { + ALLOC_GROW(oids->list, oids->nr + 1, oids->alloc); + oidcpy(>list[oids->nr], &(parent->item->object.oid)); + oids->nr++; + parent->item->object.flags |= UNINTERESTING; + } + } +} + +static void close_reachable(struct packed_oid_list *oids) +{ + int i; + struct commit *commit; + + for (i = 0; i < oids->nr; i++) { + commit = lookup_commit(>list[i]); + if (commit) + commit->object.flags |= UNINTERESTING; + } + + /* +* As this loop runs, oids->nr may grow, but not more +* than the number of missing commits in the reachable +* closure. +*/ + for (i = 0; i < oids->nr; i++) { + commit = lookup_commit(>list[i]); + + if (commit && !parse_commit(commit)) + add_missing_parents(oids, commit); + } + + for (i = 0; i < oids->nr; i++) { + commit = lookup_commit(>list[i]); + + if (commit) + commit->object.flags &= ~UNINTERESTING; + } +} + void write_commit_graph(const char *obj_dir) { struct packed_oid_list oids; @@ -390,6 +434,7 @@ void write_commit_graph(const char *obj_dir) ALLOC_ARRAY(oids.list, oids.alloc); for_each_packed_object(add_packed_commits, , 0); + close_reachable(); QSORT(oids.list, oids.nr, commit_compare); -- 2.17.0.14.gba1221a8ce
[PATCH v7 08/14] commit-graph: implement git commit-graph read
From: Derrick Stolee <dsto...@microsoft.com> Teach git-commit-graph to read commit graph files and summarize their contents. Use the read subcommand to verify the contents of a commit graph file in the tests. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 12 +++ builtin/commit-graph.c | 56 commit-graph.c | 137 - commit-graph.h | 23 + t/t5318-commit-graph.sh| 32 +-- 5 files changed, 254 insertions(+), 6 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 47996e8f89..8aad8303f5 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -9,6 +9,7 @@ git-commit-graph - Write and verify Git commit graph files SYNOPSIS [verse] +'git commit-graph read' [--object-dir ] 'git commit-graph write' [--object-dir ] @@ -35,6 +36,11 @@ COMMANDS Write a commit graph file based on the commits found in packfiles. Includes all commits from the existing commit graph file. +'read':: + +Read a graph file given by the commit-graph file and output basic +details about the graph file. Used for debugging purposes. + EXAMPLES @@ -45,6 +51,12 @@ EXAMPLES $ git commit-graph write +* Read basic information from the commit-graph file. ++ + +$ git commit-graph read + + GIT --- diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 26b6360289..e3f67401fb 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -7,10 +7,16 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), + N_("git commit-graph read [--object-dir ]"), N_("git commit-graph write [--object-dir ]"), NULL }; +static const char * const builtin_commit_graph_read_usage[] = { + N_("git commit-graph read [--object-dir ]"), + NULL +}; + static const char * const builtin_commit_graph_write_usage[] = { N_("git commit-graph write [--object-dir ]"), NULL @@ -20,6 +26,54 @@ static struct opts_commit_graph { const char *obj_dir; } opts; +static int graph_read(int argc, const char **argv) +{ + struct commit_graph *graph = 0; + char *graph_name; + + static struct option builtin_commit_graph_read_options[] = { + OPT_STRING(0, "object-dir", _dir, + N_("dir"), + N_("The object directory to store the graph")), + OPT_END(), + }; + + argc = parse_options(argc, argv, NULL, +builtin_commit_graph_read_options, +builtin_commit_graph_read_usage, 0); + + if (!opts.obj_dir) + opts.obj_dir = get_object_directory(); + + graph_name = get_commit_graph_filename(opts.obj_dir); + graph = load_commit_graph_one(graph_name); + + if (!graph) + die("graph file %s does not exist", graph_name); + FREE_AND_NULL(graph_name); + + printf("header: %08x %d %d %d %d\n", + ntohl(*(uint32_t*)graph->data), + *(unsigned char*)(graph->data + 4), + *(unsigned char*)(graph->data + 5), + *(unsigned char*)(graph->data + 6), + *(unsigned char*)(graph->data + 7)); + printf("num_commits: %u\n", graph->num_commits); + printf("chunks:"); + + if (graph->chunk_oid_fanout) + printf(" oid_fanout"); + if (graph->chunk_oid_lookup) + printf(" oid_lookup"); + if (graph->chunk_commit_data) + printf(" commit_metadata"); + if (graph->chunk_large_edges) + printf(" large_edges"); + printf("\n"); + + return 0; +} + static int graph_write(int argc, const char **argv) { static struct option builtin_commit_graph_write_options[] = { @@ -60,6 +114,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) PARSE_OPT_STOP_AT_NON_OPTION); if (argc > 0) { + if (!strcmp(argv[0], "read")) + return graph_read(argc, argv); if (!strcmp(argv[0], "write")) return graph_write(argc, argv); } diff --git a/commit-graph.c b/commit-graph.c index f3f7c4f189..b1bd3a892d 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -39,11 +39,146 @@ GRAPH_OID_LEN + 8) -static char *get_commit_graph_f
[PATCH v7 05/14] commit-graph: create git-commit-graph builtin
From: Derrick Stolee <dsto...@microsoft.com> Teach git the 'commit-graph' builtin that will be used for writing and reading packed graph files. The current implementation is mostly empty, except for an '--object-dir' option. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- .gitignore | 1 + Documentation/git-commit-graph.txt | 10 +++ Makefile | 1 + builtin.h | 1 + builtin/commit-graph.c | 36 ++ command-list.txt | 1 + contrib/completion/git-completion.bash | 2 ++ git.c | 1 + 8 files changed, 53 insertions(+) create mode 100644 Documentation/git-commit-graph.txt create mode 100644 builtin/commit-graph.c diff --git a/.gitignore b/.gitignore index 833ef3b0b7..e82f90184d 100644 --- a/.gitignore +++ b/.gitignore @@ -34,6 +34,7 @@ /git-clone /git-column /git-commit +/git-commit-graph /git-commit-tree /git-config /git-count-objects diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt new file mode 100644 index 00..f3b34622a8 --- /dev/null +++ b/Documentation/git-commit-graph.txt @@ -0,0 +1,10 @@ +git-commit-graph(1) +=== + +NAME + +git-commit-graph - Write and verify Git commit graph files + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/Makefile b/Makefile index a1d8775adb..a59b62bed1 100644 --- a/Makefile +++ b/Makefile @@ -952,6 +952,7 @@ BUILTIN_OBJS += builtin/clone.o BUILTIN_OBJS += builtin/column.o BUILTIN_OBJS += builtin/commit-tree.o BUILTIN_OBJS += builtin/commit.o +BUILTIN_OBJS += builtin/commit-graph.o BUILTIN_OBJS += builtin/config.o BUILTIN_OBJS += builtin/count-objects.o BUILTIN_OBJS += builtin/credential.o diff --git a/builtin.h b/builtin.h index 42378f3aa4..079855b6d4 100644 --- a/builtin.h +++ b/builtin.h @@ -149,6 +149,7 @@ extern int cmd_clone(int argc, const char **argv, const char *prefix); extern int cmd_clean(int argc, const char **argv, const char *prefix); extern int cmd_column(int argc, const char **argv, const char *prefix); extern int cmd_commit(int argc, const char **argv, const char *prefix); +extern int cmd_commit_graph(int argc, const char **argv, const char *prefix); extern int cmd_commit_tree(int argc, const char **argv, const char *prefix); extern int cmd_config(int argc, const char **argv, const char *prefix); extern int cmd_count_objects(int argc, const char **argv, const char *prefix); diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c new file mode 100644 index 00..b466ecd781 --- /dev/null +++ b/builtin/commit-graph.c @@ -0,0 +1,36 @@ +#include "builtin.h" +#include "config.h" +#include "parse-options.h" + +static char const * const builtin_commit_graph_usage[] = { + N_("git commit-graph [--object-dir ]"), + NULL +}; + +static struct opts_commit_graph { + const char *obj_dir; +} opts; + + +int cmd_commit_graph(int argc, const char **argv, const char *prefix) +{ + static struct option builtin_commit_graph_options[] = { + OPT_STRING(0, "object-dir", _dir, + N_("dir"), + N_("The object directory to store the graph")), + OPT_END(), + }; + + if (argc == 2 && !strcmp(argv[1], "-h")) + usage_with_options(builtin_commit_graph_usage, + builtin_commit_graph_options); + + git_config(git_default_config, NULL); + argc = parse_options(argc, argv, prefix, +builtin_commit_graph_options, +builtin_commit_graph_usage, +PARSE_OPT_STOP_AT_NON_OPTION); + + usage_with_options(builtin_commit_graph_usage, + builtin_commit_graph_options); +} diff --git a/command-list.txt b/command-list.txt index a1fad28fd8..835c5890be 100644 --- a/command-list.txt +++ b/command-list.txt @@ -34,6 +34,7 @@ git-clean mainporcelain git-clone mainporcelain init git-column purehelpers git-commit mainporcelain history +git-commit-graphplumbingmanipulators git-commit-tree plumbingmanipulators git-config ancillarymanipulators git-count-objects ancillaryinterrogators diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash index b09c8a2362..6726daaf69 100644 --- a/contrib/completion/git-completion.bash +++ b/contrib/completion/git-completion.bash @@ -878,6 +878,7 @@ __git_list_porcelain_commands () check-ref-format) : plumbing;
[PATCH v7 07/14] commit-graph: implement git-commit-graph write
From: Derrick Stolee <dsto...@microsoft.com> Teach git-commit-graph to write graph files. Create new test script to verify this command succeeds without failure. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 41 ++ builtin/commit-graph.c | 33 t/t5318-commit-graph.sh| 124 + 3 files changed, 198 insertions(+) create mode 100755 t/t5318-commit-graph.sh diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index f3b34622a8..47996e8f89 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -5,6 +5,47 @@ NAME git-commit-graph - Write and verify Git commit graph files + +SYNOPSIS + +[verse] +'git commit-graph write' [--object-dir ] + + +DESCRIPTION +--- + +Manage the serialized commit graph file. + + +OPTIONS +--- +--object-dir:: + Use given directory for the location of packfiles and commit graph + file. This parameter exists to specify the location of an alternate + that only has the objects directory, not a full .git directory. The + commit graph file is expected to be at /info/commit-graph and + the packfiles are expected to be in /pack. + + +COMMANDS + +'write':: + +Write a commit graph file based on the commits found in packfiles. +Includes all commits from the existing commit graph file. + + +EXAMPLES + + +* Write a commit graph file for the packed commits in your local .git folder. ++ + +$ git commit-graph write + + + GIT --- Part of the linkgit:git[1] suite diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index b466ecd781..26b6360289 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -1,9 +1,18 @@ #include "builtin.h" #include "config.h" +#include "dir.h" +#include "lockfile.h" #include "parse-options.h" +#include "commit-graph.h" static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), + N_("git commit-graph write [--object-dir ]"), + NULL +}; + +static const char * const builtin_commit_graph_write_usage[] = { + N_("git commit-graph write [--object-dir ]"), NULL }; @@ -11,6 +20,25 @@ static struct opts_commit_graph { const char *obj_dir; } opts; +static int graph_write(int argc, const char **argv) +{ + static struct option builtin_commit_graph_write_options[] = { + OPT_STRING(0, "object-dir", _dir, + N_("dir"), + N_("The object directory to store the graph")), + OPT_END(), + }; + + argc = parse_options(argc, argv, NULL, +builtin_commit_graph_write_options, +builtin_commit_graph_write_usage, 0); + + if (!opts.obj_dir) + opts.obj_dir = get_object_directory(); + + write_commit_graph(opts.obj_dir); + return 0; +} int cmd_commit_graph(int argc, const char **argv, const char *prefix) { @@ -31,6 +59,11 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) builtin_commit_graph_usage, PARSE_OPT_STOP_AT_NON_OPTION); + if (argc > 0) { + if (!strcmp(argv[0], "write")) + return graph_write(argc, argv); + } + usage_with_options(builtin_commit_graph_usage, builtin_commit_graph_options); } diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh new file mode 100755 index 00..d7b635bd68 --- /dev/null +++ b/t/t5318-commit-graph.sh @@ -0,0 +1,124 @@ +#!/bin/sh + +test_description='commit graph' +. ./test-lib.sh + +test_expect_success 'setup full repo' ' + mkdir full && + cd "$TRASH_DIRECTORY/full" && + git init && + objdir=".git/objects" +' + +test_expect_success 'write graph with no packs' ' + cd "$TRASH_DIRECTORY/full" && + git commit-graph write --object-dir . && + test_path_is_file info/commit-graph +' + +test_expect_success 'create commits and repack' ' + cd "$TRASH_DIRECTORY/full" && + for i in $(test_seq 3) + do + test_commit $i && + git branch commits/$i + done && + git repack +' + +test_expect_success 'write graph' ' + cd "$TRASH_DIRECTORY/full" && + graph1=$(git commit-graph write) && + test_path_is_file $objdir/info/commit-graph +' + +test_expect_success 'Add more commits' ' +
Re: [PATCH 0/3] Lazy-load trees when reading commit-graph
On 4/3/2018 9:06 AM, Jeff King wrote: On Tue, Apr 03, 2018 at 08:00:54AM -0400, Derrick Stolee wrote: There are several commit-graph walks that require loading many commits but never walk the trees reachable from those commits. However, the current logic in parse_commit() requires the root tree to be loaded. This only uses lookup_tree(), but when reading commits from the commit- graph file, the hashcpy() to load the root tree hash and the time spent checking the object cache take more time than parsing the rest of the commit. In this patch series, all direct references to accessing the 'tree' member of struct commit are replaced instead by one of the following methods: struct tree *get_commit_tree(struct commit *) struct object_id *get_commit_tree_oid(struct commit *) This seems like a pretty sane thing to do. There may still be a few parts of the code, notably fsck, that are reliant on a "struct object" having been instantiated to determine whether an object is "used". I tried to clean that up around the time of c2d17b3b6e (fsck: check HAS_OBJ more consistently, 2017-01-16), but I won't be surprised if there's still some hidden assumptions. I peeked at the fsck.c parts of patch 2, and it looks like we may be OK, since you use get_commit_tree() during the walk. This replacement was assisted by a Coccinelle script, but the 'tree' member is overloaded in other types, so the script gave false-positives that were removed from the diff. That catches any existing in-tree callers, but not any topics in flight. We could add the Coccinelle script and re-run it to catch any future ones. But perhaps simpler still: can we rename the "tree" member to "maybe_tree" or something, since nobody should be accessing it directly? That will give us a compile error if an older topic is merged. That's a good idea. I can reorg in v2 to rename 'tree' to 'maybe_tree' (and add an explicit comment that 'maybe_tree' could be NULL, so don't reference it directly). The check that the rename patch is correct is simply "does it compile?" Then Coccinelle could do the transfer of "c->maybe_tree" to "get_commit_tree(c)" safely, and can be added to the scripts. I guess one caveat is to not include the mutators (c->maybe_tree = ...), but that's probably something Coccinelle can do. On the Linux repository, performance tests were run for the following command: git log --graph --oneline -1000 Before: 0.83s After: 0.65s Rel %: -21.6% Neat. Not strictly related, but I happened across another thing today that might be worth investigating here. We allocate "struct commit" in these nice big allocation blocks. But then for every commit we allocate at least one "struct commit_list" to store the parent pointer. I was looking at this from a memory consumption angle (I found a pathological repository full of single-parent commits, and this wastes an extra 16 bytes plus malloc overhead for every 64-byte "struct commit"). But I wonder if it could also have non-negligible overhead in calling malloc() for your case, since you've optimized out so much of the rest of the work. That is a pattern I'm seeing: we strip out one bit and something else shows up as a hot spot. This game of whack-a-mole is quite productive. I'm not sure what the exact solution would be, but I imagine something like variable-sized "struct commit"s with the parent pointers embedded, with some kind of flag to indicate the number of parents (and probably some fallback to break out to a linked list for extreme cases of more than 2 parents). It may end up pretty invasive, though, as there's a lot of open-coded traversals of that parent list. Anyway, not anything to do with this patch, but food for thought as you micro-optimize these traversals. One other thing that I've been thinking about is that 'struct commit' is so much bigger than the other structs in 'union any_object'. This means that the object cache, which I think creates blocks of 'union any_object' for memory-alignment reasons, is overly bloated. This would be especially true when walking many more trees than commits. Perhaps there are other memory concerns in that case (such as cached buffers) that the 'union any_object' is not a concern, but it is worth thinking about as we brainstorm how to reduce the parent-list memory. Thanks, -Stolee
Re: [PATCH 0/6] Compute and consume generation numbers
On 4/3/2018 12:51 PM, Derrick Stolee wrote: This is the first of several "small" patches that follow the serialized Git commit graph patch (ds/commit-graph). As described in Documentation/technical/commit-graph.txt, the generation number of a commit is one more than the maximum generation number among its parents (trivially, a commit with no parents has generation number one). This series makes the computation of generation numbers part of the commit-graph write process. Finally, generation numbers are used to order commits in the priority queue in paint_down_to_common(). This allows a constant-time check in queue_has_nonstale() instead of the previous linear-time check. This does not have a significant performance benefit in repositories of normal size, but in the Windows repository, some merge-base calculations improve from 3.1s to 2.9s. A modest speedup, but provides an actual consumer of generation numbers as a starting point. A more substantial refactoring of revision.c is required before making 'git log --graph' use generation numbers effectively. This patch series depends on v7 of ds/commit-graph. Derrick Stolee (6): object.c: parse commit in graph first commit: add generation number to struct commmit commit-graph: compute generation numbers commit: sort by generation number in paint_down_to_common() commit.c: use generation number to stop merge-base walks commit-graph.txt: update design doc with generation numbers This patch is also available as a GitHub pull request [1] [1] https://github.com/derrickstolee/git/pull/5
[PATCH 1/6] object.c: parse commit in graph first
Most code paths load commits using lookup_commit() and then parse_commit(). In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). Before adding generation numbers to the commit-graph, we need to ensure that any commit that exists in the graph is loaded from the graph, so check parse_commit_in_graph() before calling parse_commit_buffer(). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- object.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/object.c b/object.c index e6ad3f61f0..4cd3e98e04 100644 --- a/object.c +++ b/object.c @@ -3,6 +3,7 @@ #include "blob.h" #include "tree.h" #include "commit.h" +#include "commit-graph.h" #include "tag.h" static struct object **obj_hash; @@ -207,7 +208,8 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type } else if (type == OBJ_COMMIT) { struct commit *commit = lookup_commit(oid); if (commit) { - if (parse_commit_buffer(commit, buffer, size)) + if (!parse_commit_in_graph(commit) && + parse_commit_buffer(commit, buffer, size)) return NULL; if (!get_cached_commit_buffer(commit, NULL)) { set_commit_buffer(commit, buffer, size); -- 2.17.0.rc0
[PATCH 4/6] commit: use generations in paint_down_to_common()
Define compare_commits_by_gen_then_commit_date(), which uses generation numbers as a primary comparison and commit date to break ties (or as a comparison when both commits do not have computed generation numbers). Since the commit-graph file is closed under reachability, we know that all commits in the file have generation at most GENERATION_NUMBER_MAX which is less than GENERATION_NUMBER_UNDEF. This change does not affect the number of commits that are walked during the execution of paint_down_to_common(), only the order that those commits are inspected. In the case that commit dates violate topological order (i.e. a parent is "newer" than a child), the previous code could walk a commit twice: if a commit is reached with the PARENT1 bit, but later is re-visited with the PARENT2 bit, then that PARENT2 bit must be propagated to its parents. Using generation numbers avoids this extra effort, even if it is somewhat rare. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 19 ++- commit.h | 1 + 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 3e39c86abf..95ae7e13a3 100644 --- a/commit.c +++ b/commit.c @@ -624,6 +624,23 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, return 0; } +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) +{ + const struct commit *a = a_, *b = b_; + + if (a->generation < b->generation) + return 1; + else if (a->generation > b->generation) + return -1; + + /* newer commits with larger date first */ + if (a->date < b->date) + return 1; + else if (a->date > b->date) + return -1; + return 0; +} + int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) { const struct commit *a = a_, *b = b_; @@ -773,7 +790,7 @@ static int queue_has_nonstale(struct prio_queue *queue) /* all input commits in one and twos[] must have been parsed! */ static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) { - struct prio_queue queue = { compare_commits_by_commit_date }; + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; diff --git a/commit.h b/commit.h index bc7a3186c5..cb97b7636a 100644 --- a/commit.h +++ b/commit.h @@ -332,6 +332,7 @@ extern int remove_signature(struct strbuf *buf); extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); LAST_ARG_MUST_BE_NULL extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...); -- 2.17.0.rc0
[PATCH 5/6] commit.c: use generation to halt paint walk
In paint_down_to_common(), the walk is halted when the queue contains only stale commits. The queue_has_nonstale() method iterates over the entire queue looking for a nonstale commit. In a wide commit graph where the two sides share many commits in common, but have deep sets of different commits, this method may inspect many elements before finding a nonstale commit. In the worst case, this can give quadratic performance in paint_down_to_common(). Convert queue_has_nonstale() to use generation numbers for an O(1) termination condition. To properly take advantage of this condition, track the minimum generation number of a commit that enters the queue with nonstale status. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 37 ++--- 1 file changed, 30 insertions(+), 7 deletions(-) diff --git a/commit.c b/commit.c index 95ae7e13a3..858f4fdbc9 100644 --- a/commit.c +++ b/commit.c @@ -776,14 +776,22 @@ void sort_in_topological_order(struct commit_list **list, enum rev_sort_order so static const unsigned all_flags = (PARENT1 | PARENT2 | STALE | RESULT); -static int queue_has_nonstale(struct prio_queue *queue) +static int queue_has_nonstale(struct prio_queue *queue, uint32_t min_gen) { - int i; - for (i = 0; i < queue->nr; i++) { - struct commit *commit = queue->array[i].data; - if (!(commit->object.flags & STALE)) - return 1; + if (min_gen != GENERATION_NUMBER_UNDEF) { + if (queue->nr > 0) { + struct commit *commit = queue->array[0].data; + return commit->generation >= min_gen; + } + } else { + int i; + for (i = 0; i < queue->nr; i++) { + struct commit *commit = queue->array[i].data; + if (!(commit->object.flags & STALE)) + return 1; + } } + return 0; } @@ -793,6 +801,8 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; + uint32_t last_gen = GENERATION_NUMBER_UNDEF; + uint32_t min_nonstale_gen = GENERATION_NUMBER_UNDEF; one->object.flags |= PARENT1; if (!n) { @@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc return result; } prio_queue_put(, one); + if (one->generation < min_nonstale_gen) + min_nonstale_gen = one->generation; for (i = 0; i < n; i++) { twos[i]->object.flags |= PARENT2; prio_queue_put(, twos[i]); + if (twos[i]->generation < min_nonstale_gen) + min_nonstale_gen = twos[i]->generation; } - while (queue_has_nonstale()) { + while (queue_has_nonstale(, min_nonstale_gen)) { struct commit *commit = prio_queue_get(); struct commit_list *parents; int flags; + if (commit->generation > last_gen) + BUG("bad generation skip"); + + last_gen = commit->generation; + flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc return NULL; p->object.flags |= flags; prio_queue_put(, p); + + if (!(flags & STALE) && + p->generation < min_nonstale_gen) + min_nonstale_gen = p->generation; } } -- 2.17.0.rc0
[PATCH 6/6] commit-graph.txt: update future work
We now calculate generation numbers in the commit-graph file and use them in paint_down_to_common(). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/commit-graph.txt | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index 0550c6d0dc..be68bee43d 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -98,17 +98,12 @@ Future Work - The 'commit-graph' subcommand does not have a "verify" mode that is necessary for integration with fsck. -- The file format includes room for precomputed generation numbers. These - are not currently computed, so all generation numbers will be marked as - 0 (or "uncomputed"). A later patch will include this calculation. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered priority queue with one ordered by generation number. The following - operations are important candidates: + operation is an important candidate: -- paint_down_to_common() - 'log --topo-order' - Currently, parse_commit_gently() requires filling in the root tree -- 2.17.0.rc0
[PATCH 0/6] Compute and consume generation numbers
This is the first of several "small" patches that follow the serialized Git commit graph patch (ds/commit-graph). As described in Documentation/technical/commit-graph.txt, the generation number of a commit is one more than the maximum generation number among its parents (trivially, a commit with no parents has generation number one). This series makes the computation of generation numbers part of the commit-graph write process. Finally, generation numbers are used to order commits in the priority queue in paint_down_to_common(). This allows a constant-time check in queue_has_nonstale() instead of the previous linear-time check. This does not have a significant performance benefit in repositories of normal size, but in the Windows repository, some merge-base calculations improve from 3.1s to 2.9s. A modest speedup, but provides an actual consumer of generation numbers as a starting point. A more substantial refactoring of revision.c is required before making 'git log --graph' use generation numbers effectively. This patch series depends on v7 of ds/commit-graph. Derrick Stolee (6): object.c: parse commit in graph first commit: add generation number to struct commmit commit-graph: compute generation numbers commit: sort by generation number in paint_down_to_common() commit.c: use generation number to stop merge-base walks commit-graph.txt: update design doc with generation numbers Documentation/technical/commit-graph.txt | 7 +--- alloc.c | 1 + commit-graph.c | 48 + commit.c | 53 commit.h | 7 +++- object.c | 4 +- 6 files changed, 104 insertions(+), 16 deletions(-) -- 2.17.0.20.g9f30ba16e1
[PATCH 3/6] commit-graph: compute generation numbers
While preparing commits to be written into a commit-graph file, compute the generation numbers using a depth-first strategy. The only commits that are walked in this depth-first search are those without a precomputed generation number. Thus, computation time will be relative to the number of new commits to the commit-graph file. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 46 ++ commit.h | 1 + 2 files changed, 47 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index d24b947525..b80c8ad80e 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -419,6 +419,13 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; + if ((*list)->generation != GENERATION_NUMBER_UNDEF) { + if ((*list)->generation > GENERATION_NUMBER_MAX) + die("generation number %u is too large to store in commit-graph", + (*list)->generation); + packedDate[0] |= htonl((*list)->generation << 2); + } + packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); @@ -551,6 +558,43 @@ static void close_reachable(struct packed_oid_list *oids) } } +static void compute_generation_numbers(struct commit** commits, + int nr_commits) +{ + int i; + struct commit_list *list = NULL; + + for (i = 0; i < nr_commits; i++) { + if (commits[i]->generation != GENERATION_NUMBER_UNDEF && + commits[i]->generation != GENERATION_NUMBER_NONE) + continue; + + commit_list_insert(commits[i], ); + while (list) { + struct commit *current = list->item; + struct commit_list *parent; + int all_parents_computed = 1; + uint32_t max_generation = 0; + + for (parent = current->parents; parent; parent = parent->next) { + if (parent->item->generation == GENERATION_NUMBER_UNDEF || + parent->item->generation == GENERATION_NUMBER_NONE) { + all_parents_computed = 0; + commit_list_insert(parent->item, ); + break; + } else if (parent->item->generation > max_generation) { + max_generation = parent->item->generation; + } + } + + if (all_parents_computed) { + current->generation = max_generation + 1; + pop_commit(); + } + } + } +} + void write_commit_graph(const char *obj_dir, const char **pack_indexes, int nr_packs, @@ -674,6 +718,8 @@ void write_commit_graph(const char *obj_dir, if (commits.nr >= GRAPH_PARENT_MISSING) die(_("too many commits to write graph")); + compute_generation_numbers(commits.list, commits.nr); + graph_name = get_commit_graph_filename(obj_dir); fd = hold_lock_file_for_update(, graph_name, 0); diff --git a/commit.h b/commit.h index 3cadd386f3..bc7a3186c5 100644 --- a/commit.h +++ b/commit.h @@ -11,6 +11,7 @@ #define COMMIT_NOT_FROM_GRAPH 0x #define GENERATION_NUMBER_UNDEF 0x +#define GENERATION_NUMBER_MAX 0x3FFF #define GENERATION_NUMBER_NONE 0 struct commit_list { -- 2.17.0.rc0
[PATCH 2/6] commit: add generation number to struct commmit
The generation number of a commit is defined recursively as follows: * If a commit A has no parents, then the generation number of A is one. * If a commit A has parents, then the generation number of A is one more than the maximum generation number among the parents of A. Add a uint32_t generation field to struct commit so we can pass this information to revision walks. We use two special values to signal the generation number is invalid: GENERATION_NUMBER_UNDEF 0x GENERATION_NUMBER_NONE 0 The first (_UNDEF) means the generation number has not been loaded or computed. The second (_NONE) means the generation number was loaded from a commit graph file that was stored before generation numbers were computed. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- alloc.c| 1 + commit-graph.c | 2 ++ commit.h | 3 +++ 3 files changed, 6 insertions(+) diff --git a/alloc.c b/alloc.c index cf4f8b61e1..1a62e85ac3 100644 --- a/alloc.c +++ b/alloc.c @@ -94,6 +94,7 @@ void *alloc_commit_node(void) c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); c->graph_pos = COMMIT_NOT_FROM_GRAPH; + c->generation = GENERATION_NUMBER_UNDEF; return c; } diff --git a/commit-graph.c b/commit-graph.c index 1fc63d541b..d24b947525 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -264,6 +264,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin date_low = get_be32(commit_data + g->hash_len + 12); item->date = (timestamp_t)((date_high << 32) | date_low); + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; + pptr = >parents; edge_value = get_be32(commit_data + g->hash_len); diff --git a/commit.h b/commit.h index e57ae4b583..3cadd386f3 100644 --- a/commit.h +++ b/commit.h @@ -10,6 +10,8 @@ #include "pretty.h" #define COMMIT_NOT_FROM_GRAPH 0x +#define GENERATION_NUMBER_UNDEF 0x +#define GENERATION_NUMBER_NONE 0 struct commit_list { struct commit *item; @@ -24,6 +26,7 @@ struct commit { struct commit_list *parents; struct tree *tree; uint32_t graph_pos; + uint32_t generation; }; extern int save_commit_buffer; -- 2.17.0.rc0
Re: [PATCH 3/3] commit-graph: lazy-load trees
On 4/3/2018 2:00 PM, Stefan Beller wrote: On Tue, Apr 3, 2018 at 5:00 AM, Derrick Stolee <dsto...@microsoft.com> wrote: The commit-graph file provides quick access to commit data, including the OID of the root tree for each commit in the graph. When performing a deep commit-graph walk, we may not need to load most of the trees for these commits. Delay loading the tree object for a commit loaded from the graph until requested via get_commit_tree(). Do not lazy-load trees for commits not in the graph, since that requires duplicate parsing and the relative peformance improvement when trees are not needed is small. On the Linux repository, performance tests were run for the following command: git log --graph --oneline -1000 Before: 0.83s After: 0.65s Rel %: -21.6% This is an awesome speedup. Adding '-- kernel/' to the command requires loading the root tree for every commit that is walked. and as the walk prunes those commits that do not touch kernel/ which from my quick glance is the real core thing. Linus' announcements claim that > 50% is drivers, networking and documentation[1]. So the "-- kernel/" walk needs to walk twice as many commits to find a thousand commits that actually touch kernel/ ? [1] http://lkml.iu.edu/hypermail/linux/kernel/1801.3/02794.html http://lkml.iu.edu/hypermail/linux/kernel/1803.3/00580.html There was no measureable performance change as a result of this patch. ... which means that the walking itself is really fast now and the dominating effects are setup and checking the tree? Yeah. I was concerned that since we take two accesses into the commit-graph file that we could measurably slow down cases where we need to load the trees. That is not an issue since we will likely parse the tree after loading, and parsing is much slower than these commit-graph accesses. Is git smart enough to not load the root tree for "log -- ./" or would we get the desired performance numbers from that? I wonder, since it only really needs the OID of the root tree to determine TREESAME. If it cares about following TREESAME relationships on ./, then it should do that. @@ -317,6 +315,27 @@ int parse_commit_in_graph(struct commit *item) return 0; } +static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c) +{ + struct object_id oid; + const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * (c->graph_pos); What is 16? (I imagine it is the "length of the row" - g->hash_len ?) Would it make sense to have a constant/define for an entire row instead? (By any chance what is the meaning of GRAPH_DATA_WIDTH, which is 36? That is defined but never used.) Yeah, I should use GRAPH_DATA_WIDTH here instead. +struct tree *get_commit_tree_in_graph(const struct commit *c) +{ + if (c->tree) + return c->tree; This double checking is defensive programming, in case someone doesn't check themselves (as get_commit_tree does below). ok. @@ -17,6 +17,13 @@ char *get_commit_graph_filename(const char *obj_dir); */ int parse_commit_in_graph(struct commit *item); +/* + * For performance reasons, a commit loaded from the graph does not + * have a tree loaded until trying to consume it for the first time. That is the theme of this series/patch, but do we need to write it down into the codebase? I'd be inclined to omit this part and only go with: Load the root tree of a commit and return the tree. OK. struct tree *get_commit_tree(const struct commit *commit) { - return commit->tree; + if (commit->tree || !commit->object.parsed) I understand to return the tree from the commit when we have the tree in the commit object (the first part). But 'when we have not (yet) parsed the commit object', we also just return its tree? Could you explain the second part of the condition? Is that for commits that are not part of the commit graph? (But then why does it need to be negated?) Some callers check the value of 'commit->tree' without a guarantee that the commit was parsed. In this case, the way to preserve the existing behavior is to continue returning NULL. If I remove the "|| !commit->object.parsed" then the BUG("commit has NULL tree, but was not loaded from commit-graph") is hit in these two tests: t6012-rev-list-simplify.sh t6110-rev-list-sparse.sh I prefer to keep the BUG() statement and instead use this if statement. If someone has more clarity on why this is a good existing behavior, then please chime in. Thanks, -Stolee
Re: [PATCH 0/6] Compute and consume generation numbers
On 4/3/2018 2:03 PM, Brandon Williams wrote: On 04/03, Derrick Stolee wrote: This is the first of several "small" patches that follow the serialized Git commit graph patch (ds/commit-graph). As described in Documentation/technical/commit-graph.txt, the generation number of a commit is one more than the maximum generation number among its parents (trivially, a commit with no parents has generation number one). Thanks for ensuring that this is defined and documented somewhere :) This series makes the computation of generation numbers part of the commit-graph write process. Finally, generation numbers are used to order commits in the priority queue in paint_down_to_common(). This allows a constant-time check in queue_has_nonstale() instead of the previous linear-time check. This does not have a significant performance benefit in repositories of normal size, but in the Windows repository, some merge-base calculations improve from 3.1s to 2.9s. A modest speedup, but provides an actual consumer of generation numbers as a starting point. A more substantial refactoring of revision.c is required before making 'git log --graph' use generation numbers effectively. log --graph should benefit a lot more from this correct? I know we've talked a bit about negotiation and I wonder if these generation numbers should be able to help out a little bit with that some day. 'log --graph' should be a HUGE speedup, when it is refactored. Since the topo-order can "stream" commits to the pager, it can be very responsive to return the graph in almost all conditions. (The case where generation numbers are not enough is when filters reduce the set of displayed commits to be very sparse, so many commits are walked anyway.) If we have generic "can X reach Y?" queries, then we can also use generation numbers there to great effect (by not walking commits Z with gen(Z) <= gen(Y)). Perhaps I should look at that "git branch --contains" thread for ideas. For negotiation, there are some things we can do here. VSTS uses generation numbers as a heuristic for determining "all wants connected to haves" which is a condition for halting negotiation. The idea is very simple, and I'd be happy to discuss it on a separate thread. Thanks, -Stolee
Re: [PATCH 2/6] commit: add generation number to struct commmit
On 4/3/2018 2:28 PM, Jeff King wrote: On Tue, Apr 03, 2018 at 11:05:36AM -0700, Brandon Williams wrote: On 04/03, Derrick Stolee wrote: The generation number of a commit is defined recursively as follows: * If a commit A has no parents, then the generation number of A is one. * If a commit A has parents, then the generation number of A is one more than the maximum generation number among the parents of A. Add a uint32_t generation field to struct commit so we can pass this Is there any reason to believe this would be too small of a value in the future? Or is a 32 bit unsigned good enough? The linux kernel took ~10 years to produce 500k commits. Even assuming those were all linear (and they're not), that gives us ~80,000 years of leeway. So even if the pace of development speeds up or we have a quicker project, it still seems we have a pretty reasonable safety margin. That, and larger projects do not have linear histories. Despite having almost 2 million reachable commits, the Windows repository has maximum generation number ~100,000. Thanks, -Stolee
Re: [PATCH 1/6] object.c: parse commit in graph first
On 4/3/2018 2:28 PM, Jeff King wrote: On Tue, Apr 03, 2018 at 11:21:36AM -0700, Jonathan Tan wrote: On Tue, 3 Apr 2018 12:51:38 -0400 Derrick Stolee <dsto...@microsoft.com> wrote: Most code paths load commits using lookup_commit() and then parse_commit(). In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). Before adding generation numbers to the commit-graph, we need to ensure that any commit that exists in the graph is loaded from the graph, so check parse_commit_in_graph() before calling parse_commit_buffer(). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> Modifying parse_object_buffer() is the most pragmatic way to accomplish this, but this also means that parse_object_buffer() now potentially reads from the local object store (instead of only relying on what's in memory and what's in the provided buffer). parse_object_buffer() is called by several callers including in builtin/fsck.c. I would feel more comfortable if the relevant [1] caller to parse_object_buffer() was modified instead of parse_object_buffer(), but I'll let others give their opinions too. It's not just you. This seems like a really odd place to put it. Especially because if we have the buffer to pass to this function, then we'd already have incurred the cost to inflate the object. OK. Thanks. I'll try to find the better place to put this check. -Stolee
Re: [PATCH v6 00/14] Serialized Git Commit Graph
On 3/16/2018 12:28 PM, Lars Schneider wrote: On 14 Mar 2018, at 21:43, Junio C Hamano <gits...@pobox.com> wrote: Derrick Stolee <sto...@gmail.com> writes: Hopefully this version is ready to merge. I have several follow-up topics in mind to submit soon after, including: A few patches add trailing blank lines and other whitespace breakages, which will stop my "git merge" later to 'next' and down, as I have a pre-commit hook to catch them. @stolee: I run "git --no-pager diff --check $BASE_HASH...$HEAD_HASH" to detect these kinds of things. I run this as part of my "prepare patch" [1] script which is inspired by a similar script originally written by Dscho. Do you think it would make sense to mention (or even recommend) such a script in your awesome GfW CONTRIBUTING.md? - Lars [1] https://github.com/larsxschneider/git-list-helper/blob/master/prepare-patch.sh#L71 Thanks for the suggestions. Somehow I got extra whitespace doing copy/paste in vim and I never re-opened that file in my normal editor (VS Code with an extension that shows trailing whitespace). On 3/15/2018 1:23 PM, Johannes Schindelin wrote: git log --check` was introduced to show you whitespace problems). If all of those whitespace issues are unintentional, you can fix them using `git rebase --whitespace=fix` in the most efficient way. Thanks for both of the suggestions. The `rebase` check was already in the document, so I put the checks immediately above that line. PR is available now [1]. Thanks, -Stolee [1] https://github.com/git-for-windows/git/pull/1567
Re: [PATCH v6 12/14] commit-graph: read only from specific pack-indexes
On 3/15/2018 6:50 PM, SZEDER Gábor wrote: On Wed, Mar 14, 2018 at 8:27 PM, Derrick Stolee <sto...@gmail.com> wrote: From: Derrick Stolee <dsto...@microsoft.com> Teach git-commit-graph to inspect the objects only in a certain list of pack-indexes within the given pack directory. This allows updating the commit graph iteratively. This commit message, and indeed the code itself talk about pack indexes ... Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 11 ++- builtin/commit-graph.c | 33 ++--- commit-graph.c | 26 -- commit-graph.h | 4 +++- packfile.c | 4 ++-- packfile.h | 2 ++ t/t5318-commit-graph.sh| 10 ++ 7 files changed, 81 insertions(+), 9 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 51cb038f3d..b945510f0f 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -32,7 +32,9 @@ COMMANDS 'write':: Write a commit graph file based on the commits found in packfiles. -Includes all commits from the existing commit graph file. ++ +With the `--stdin-packs` option, generate the new commit graph by +walking objects only in the specified packfiles. ... but this piece of documentation ... + OPT_BOOL(0, "stdin-packs", _packs, + N_("scan packfiles listed by stdin for commits")), ... and this help text, and even the name of the option talk about packfiles. Thanks! I'll fix that. -Stolee
Re: [PATCH v6 07/14] commit-graph: implement 'git-commit-graph write'
On 3/18/2018 9:25 AM, Ævar Arnfjörð Bjarmason wrote: On Wed, Mar 14 2018, Derrick Stolee jotted: +'git commit-graph write' [--object-dir ] + + +DESCRIPTION +--- + +Manage the serialized commit graph file. + + +OPTIONS +--- +--object-dir:: + Use given directory for the location of packfiles and commit graph + file. The commit graph file is expected to be at /info/commit-graph + and the packfiles are expected to be in /pack. Maybe this was covered in a previous round, this series is a little hard to follow since each version isn't In-Reply-To the version before it, but why is this option needed, i.e. why would you do: git commit-graph write --object-dir=/some/path/.git/objects As opposed to just pigging-backing on what we already have with both of: git --git-dir=/some/path/.git commit-graph write git -C /some/path commit-graph write Is there some use-case where you have *just* the objects dir and not the rest of the .git folder? Yes, such as an alternate. If I remember correctly, alternates only need the objects directory. In the GVFS case, we place prefetch packfiles in an alternate so there is only one copy of the "remote objects" per drive. The commit graph will be stored in that alternate. Thanks, -Stolee
[PATCH] sha1_name: use bsearch_hash() for abbreviations
This patch updates the abbreviation code to use bsearch_hash() as defined in [1]. It gets a nice speedup since the old implementation did not use the fanout table at all. One caveat about the patch: there is a place where I cast a sha1 hash into a struct object_id pointer. This is because the abbreviation code still uses 'const unsigned char *' instead of structs. I wanted to avoid a hashcpy() in these calls, but perhaps that is not too heavy a cost. I look forward to feedback on this. Thanks, -Stolee [1] https://public-inbox.org/git/007f3a4c84cb1c07255404ad1ea9f797129c5cf0.1517609773.git.jonathanta...@google.com/ [PATCH 2/2] packfile: refactor hash search with fanout table -- >8 -- When computing abbreviation lengths for an object ID against a single packfile, the method find_abbrev_len_for_pack() currently implements binary search. This is one of several implementations. One issue with this implementation is that it ignores the fanout table in the pack- index. Translate this binary search to use the existing bsearch_hash() method that correctly uses a fanout table. To keep the details about pack- index version 1 or 2 out of sha1_name.c, create a bsearch_pack() method in packfile.c. Due to the use of the fanout table, the abbreviation computation is slightly faster than before. For a fully-repacked copy of the Linux repo, the following 'git log' commands improved: * git log --oneline --parents --raw Before: 66.83s After: 64.85s Rel %: -2.9% * git log --oneline --parents Before: 7.85s After: 7.29s Rel %: -7.1% Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- packfile.c | 23 +++ packfile.h | 8 sha1_name.c | 24 3 files changed, 35 insertions(+), 20 deletions(-) diff --git a/packfile.c b/packfile.c index 7c1a2519fc..ea0388f6dd 100644 --- a/packfile.c +++ b/packfile.c @@ -1654,6 +1654,29 @@ void *unpack_entry(struct packed_git *p, off_t obj_offset, return data; } +int bsearch_pack(const struct object_id *oid, const struct packed_git *p, uint32_t *result) +{ + const unsigned char *index_fanout = p->index_data; + const unsigned char *index_lookup; + int index_lookup_width; + + if (!index_fanout) + BUG("bsearch_pack called without a valid pack-index"); + + index_lookup = index_fanout + 4 * 256; + if (p->index_version == 1) { + index_lookup_width = 24; + index_lookup += 4; + } else { + index_lookup_width = 20; + index_fanout += 8; + index_lookup += 8; + } + + return bsearch_hash(oid->hash, (const uint32_t*)index_fanout, + index_lookup, index_lookup_width, result); +} + const unsigned char *nth_packed_object_sha1(struct packed_git *p, uint32_t n) { diff --git a/packfile.h b/packfile.h index a7fca598d6..ec08cb2bb0 100644 --- a/packfile.h +++ b/packfile.h @@ -78,6 +78,14 @@ extern struct packed_git *add_packed_git(const char *path, size_t path_len, int */ extern void check_pack_index_ptr(const struct packed_git *p, const void *ptr); +/* + * Perform binary search on a pack-index for a given oid. Packfile is expected to + * have a valid pack-index. + * + * See 'bsearch_hash' for more information. + */ +int bsearch_pack(const struct object_id *oid, const struct packed_git *p, uint32_t *result); + /* * Return the SHA-1 of the nth object within the specified packfile. * Open the index if it is not already open. The return value points diff --git a/sha1_name.c b/sha1_name.c index 735c1c0b8e..be3627b915 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -512,32 +512,16 @@ static void find_abbrev_len_for_pack(struct packed_git *p, struct min_abbrev_data *mad) { int match = 0; - uint32_t num, last, first = 0; + uint32_t num, first = 0; struct object_id oid; + const struct object_id *mad_oid; if (open_pack_index(p) || !p->num_objects) return; num = p->num_objects; - last = num; - while (first < last) { - uint32_t mid = first + (last - first) / 2; - const unsigned char *current; - int cmp; - - current = nth_packed_object_sha1(p, mid); - cmp = hashcmp(mad->hash, current); - if (!cmp) { - match = 1; - first = mid; - break; - } - if (cmp > 0) { - first = mid + 1; - continue; - } - last = mid; - } + mad_oid = (const struct object_id *)mad->hash; + match = bsearch_pack(mad_oid, p, ); /* * first is now the position in the packfile where we would insert -- 2.17.0.rc0
[PATCH v2 2/3] packfile: define and use bsearch_pack()
The method bsearch_hash() generalizes binary searches using a fanout table. The only consumer is currently find_pack_entry_one(). It requires a bit of pointer arithmetic to align the fanout table and the lookup table depending on the pack-index version. Extract the pack-index pointer arithmetic to a new method, bsearch_pack(), so this can be re-used in other code paths. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- packfile.c | 42 ++ packfile.h | 8 2 files changed, 34 insertions(+), 16 deletions(-) diff --git a/packfile.c b/packfile.c index f26395ecab..69d3afda6c 100644 --- a/packfile.c +++ b/packfile.c @@ -1654,6 +1654,29 @@ void *unpack_entry(struct packed_git *p, off_t obj_offset, return data; } +int bsearch_pack(const struct object_id *oid, const struct packed_git *p, uint32_t *result) +{ + const unsigned char *index_fanout = p->index_data; + const unsigned char *index_lookup; + int index_lookup_width; + + if (!index_fanout) + BUG("bsearch_pack called without a valid pack-index"); + + index_lookup = index_fanout + 4 * 256; + if (p->index_version == 1) { + index_lookup_width = 24; + index_lookup += 4; + } else { + index_lookup_width = 20; + index_fanout += 8; + index_lookup += 8; + } + + return bsearch_hash(oid->hash, (const uint32_t*)index_fanout, + index_lookup, index_lookup_width, result); +} + const unsigned char *nth_packed_object_sha1(struct packed_git *p, uint32_t n) { @@ -1720,30 +1743,17 @@ off_t nth_packed_object_offset(const struct packed_git *p, uint32_t n) off_t find_pack_entry_one(const unsigned char *sha1, struct packed_git *p) { - const uint32_t *level1_ofs = p->index_data; const unsigned char *index = p->index_data; - unsigned stride; + struct object_id oid; uint32_t result; if (!index) { if (open_pack_index(p)) return 0; - level1_ofs = p->index_data; - index = p->index_data; - } - if (p->index_version > 1) { - level1_ofs += 2; - index += 8; - } - index += 4 * 256; - if (p->index_version > 1) { - stride = 20; - } else { - stride = 24; - index += 4; } - if (bsearch_hash(sha1, level1_ofs, index, stride, )) + hashcpy(oid.hash, sha1); + if (bsearch_pack(, p, )) return nth_packed_object_offset(p, result); return 0; } diff --git a/packfile.h b/packfile.h index a7fca598d6..ec08cb2bb0 100644 --- a/packfile.h +++ b/packfile.h @@ -78,6 +78,14 @@ extern struct packed_git *add_packed_git(const char *path, size_t path_len, int */ extern void check_pack_index_ptr(const struct packed_git *p, const void *ptr); +/* + * Perform binary search on a pack-index for a given oid. Packfile is expected to + * have a valid pack-index. + * + * See 'bsearch_hash' for more information. + */ +int bsearch_pack(const struct object_id *oid, const struct packed_git *p, uint32_t *result); + /* * Return the SHA-1 of the nth object within the specified packfile. * Open the index if it is not already open. The return value points -- 2.17.0.rc0
[PATCH v2 1/3] sha1_name: convert struct min_abbrev_data to object_id
From: "brian m. carlson"This structure is only written to in one place, where we already have a struct object_id. Convert the struct to use a struct object_id instead. Signed-off-by: brian m. carlson --- sha1_name.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index 39e911c8ba..16e0003396 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -480,7 +480,7 @@ struct min_abbrev_data { unsigned int init_len; unsigned int cur_len; char *hex; - const unsigned char *hash; + const struct object_id *oid; }; static inline char get_hex_char_from_oid(const struct object_id *oid, @@ -526,7 +526,7 @@ static void find_abbrev_len_for_pack(struct packed_git *p, int cmp; current = nth_packed_object_sha1(p, mid); - cmp = hashcmp(mad->hash, current); + cmp = hashcmp(mad->oid->hash, current); if (!cmp) { match = 1; first = mid; @@ -603,7 +603,7 @@ int find_unique_abbrev_r(char *hex, const struct object_id *oid, int len) mad.init_len = len; mad.cur_len = len; mad.hex = hex; - mad.hash = oid->hash; + mad.oid = oid; find_abbrev_len_packed(); base-commit: 1a750441a7360b29fff7a414649ece1d35acaca6 -- 2.17.0.rc0
[PATCH v2 3/3] sha1_name: use bsearch_pack() for abbreviations
When computing abbreviation lengths for an object ID against a single packfile, the method find_abbrev_len_for_pack() currently implements binary search. This is one of several implementations. One issue with this implementation is that it ignores the fanout table in the pack- index. Translate this binary search to use the existing bsearch_pack() method that correctly uses a fanout table. Due to the use of the fanout table, the abbreviation computation is slightly faster than before. For a fully-repacked copy of the Linux repo, the following 'git log' commands improved: * git log --oneline --parents --raw Before: 59.2s After: 56.9s Rel %: -3.8% * git log --oneline --parents Before: 6.48s After: 5.91s Rel %: -8.9% Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_name.c | 24 1 file changed, 4 insertions(+), 20 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index 16e0003396..24894b3dbe 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -512,32 +512,16 @@ static void find_abbrev_len_for_pack(struct packed_git *p, struct min_abbrev_data *mad) { int match = 0; - uint32_t num, last, first = 0; + uint32_t num, first = 0; struct object_id oid; + const struct object_id *mad_oid; if (open_pack_index(p) || !p->num_objects) return; num = p->num_objects; - last = num; - while (first < last) { - uint32_t mid = first + (last - first) / 2; - const unsigned char *current; - int cmp; - - current = nth_packed_object_sha1(p, mid); - cmp = hashcmp(mad->oid->hash, current); - if (!cmp) { - match = 1; - first = mid; - break; - } - if (cmp > 0) { - first = mid + 1; - continue; - } - last = mid; - } + mad_oid = mad->oid; + match = bsearch_pack(mad_oid, p, ); /* * first is now the position in the packfile where we would insert -- 2.17.0.rc0
[PATCH v2 0/3] Use bsearch_hash() for abbreviations
Thanks to Jonathan and Brian for the help with the proper way to handle OIDs and existing callers to bsearch_hash(). This patch includes one commit that Brian sent in the previous discussion (included again here for completeness). Derrick Stolee (2): packfile: define and use bsearch_pack() sha1_name: use bsearch_pack() for abbreviations brian m. carlson (1): sha1_name: use bsearch_hash() for abbreviations packfile.c | 42 ++ packfile.h | 8 sha1_name.c | 28 ++-- 3 files changed, 40 insertions(+), 38 deletions(-) base-commit: 1a750441a7360b29fff7a414649ece1d35acaca6 -- 2.17.0.rc0
Re: [PATCH v6 00/14] Serialized Git Commit Graph
On 3/16/2018 4:19 PM, Jeff King wrote: On Fri, Mar 16, 2018 at 04:06:39PM -0400, Jeff King wrote: Furthermore, in order to look at an object it has to be zlib inflated first, and since commit objects tend to be much smaller than trees and especially blobs, there are a lot less bytes to inflate: $ grep ^commit type-size |cut -d' ' -f2 |avg 34395730 / 53754 = 639 $ cat type-size |cut -d' ' -f2 |avg 3866685744 / 244723 = 15800 So a simple revision walk inflates less than 1% of the bytes that the "enumerate objects packfiles" approach has to inflate. I don't think this is quite accurate. It's true that we have to _consider_ every object, but Git is smart enough not to inflate each one to find its type. For loose objects we just inflate the header. For packed objects, we either pick the type directly out of the packfile header (for a non-delta) or can walk the delta chain (without actually looking at the data bytes!) until we hit the base. Hmm, so that's a big part of the problem with this patch series. It actually _does_ unpack every object with --stdin-packs to get the type, which is just silly. With the patch below, my time for "commit-graph write --stdin-packs" on linux.git goes from over 5 minutes (I got bored and killed it) to 17 seconds. diff --git a/commit-graph.c b/commit-graph.c index 6348bab82b..cf1da2e8c1 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -491,11 +491,12 @@ static int add_packed_commits(const struct object_id *oid, { struct packed_oid_list *list = (struct packed_oid_list*)data; enum object_type type; - unsigned long size; - void *inner_data; off_t offset = nth_packed_object_offset(pack, pos); - inner_data = unpack_entry(pack, offset, , ); - FREE_AND_NULL(inner_data); + struct object_info oi = OBJECT_INFO_INIT; + + oi.typep = + if (packed_object_info(pack, offset, ) < 0) + die("unable to get type of object %s", oid_to_hex(oid)); if (type != OBJ_COMMIT) return 0; -Peff Thanks for this! Fixing this performance problem is very important to me, as we will use the "--stdin-packs" mechanism in the GVFS scenario (we will walk all commits in the prefetch packs full of commits and trees instead of relying on refs). This speedup is very valuable! Thanks, -Stolee
Re: [PATCH v6 07/14] commit-graph: implement 'git-commit-graph write'
On 3/19/2018 10:36 AM, Ævar Arnfjörð Bjarmason wrote: On Mon, Mar 19 2018, Derrick Stolee jotted: On 3/18/2018 9:25 AM, Ævar Arnfjörð Bjarmason wrote: On Wed, Mar 14 2018, Derrick Stolee jotted: +'git commit-graph write' [--object-dir ] + + +DESCRIPTION +--- + +Manage the serialized commit graph file. + + +OPTIONS +--- +--object-dir:: + Use given directory for the location of packfiles and commit graph + file. The commit graph file is expected to be at /info/commit-graph + and the packfiles are expected to be in /pack. Maybe this was covered in a previous round, this series is a little hard to follow since each version isn't In-Reply-To the version before it, but why is this option needed, i.e. why would you do: git commit-graph write --object-dir=/some/path/.git/objects As opposed to just pigging-backing on what we already have with both of: git --git-dir=/some/path/.git commit-graph write git -C /some/path commit-graph write Is there some use-case where you have *just* the objects dir and not the rest of the .git folder? Yes, such as an alternate. If I remember correctly, alternates only need the objects directory. In the GVFS case, we place prefetch packfiles in an alternate so there is only one copy of the "remote objects" per drive. The commit graph will be stored in that alternate. Makes sense, but we should really document this as being such an unusual option, i.e. instead say something like. Use given directory for the location of packfiles and commit graph file. Usually you'd use the `--git-dir` or `-C` arguments to `git` itself. This option is here to support obscure use-cases where we have a stand-alone object directory. The commit graph file is expected to be at /info/commit-graph and the packfiles are expected to be in /pack. A slight change to your recommendation: OPTIONS --- --object-dir:: Use given directory for the location of packfiles and commit graph file. This parameter exists to specify the location of an alternate that only has the objects directory, not a full .git directory. The commit graph file is expected to be at /info/commit-graph and the packfiles are expected to be in /pack.
Re: What's cooking in git.git (Mar 2018, #03; Wed, 14)
On 3/15/2018 4:36 AM, Ævar Arnfjörð Bjarmason wrote: On Thu, Mar 15 2018, Junio C. Hamano jotted: * nd/repack-keep-pack (2018-03-07) 6 commits - SQUASH??? - pack-objects: display progress in get_object_details() - pack-objects: show some progress when counting kept objects - gc --auto: exclude base pack if not enough mem to "repack -ad" - repack: add --keep-pack option - t7700: have closing quote of a test at the beginning of line "git gc" in a large repository takes a lot of time as it considers to repack all objects into one pack by default. The command has been taught to pretend as if the largest existing packfile is marked with ".keep" so that it is left untouched while objects in other packs and loose ones are repacked. Expecting a reroll. cf.
Re: [PATCH] sha1_name: use bsearch_hash() for abbreviations
On 3/20/2018 6:25 PM, Jonathan Tan wrote: On Tue, 20 Mar 2018 16:03:25 -0400 Derrick Stolee <dsto...@microsoft.com> wrote: This patch updates the abbreviation code to use bsearch_hash() as defined in [1]. It gets a nice speedup since the old implementation did not use the fanout table at all. You can refer to the patch as: b4e00f7306a1 ("packfile: refactor hash search with fanout table", 2018-02-15) Also, might be worth noting that this patch builds on jt/binsearch-with-fanout. Thanks. That commit is in master and v2.17.0-rc0, so hopefully it isn't difficult to apply the patch. One caveat about the patch: there is a place where I cast a sha1 hash into a struct object_id pointer. This is because the abbreviation code still uses 'const unsigned char *' instead of structs. I wanted to avoid a hashcpy() in these calls, but perhaps that is not too heavy a cost. I recall a discussion that there were alignment issues with doing this, but I might have be remembering wrongly - in my limited knowledge of C alignment, both "unsigned char *" and "struct object_id *" have the same constraints, but I'm not sure. Adding Brian M. Carlson in the CC line for advice on how to do this translation between old sha1's and new object_ids. If it isn't safe, then we could do a hashcpy() until the translation makes it unnecessary. I should have compared the two methods before sending the patch, but running the "git log --oneline --parents" test with a hashcpy() versus a cast has no measurable difference in performance for Linux. Probably best to do the safest thing here if there is no cost to perf. + const unsigned char *index_fanout = p->index_data; [snip] + return bsearch_hash(oid->hash, (const uint32_t*)index_fanout, + index_lookup, index_lookup_width, result); This cast to "const uint32_t *" is safe, because p->index_data points to a mmap-ed region (which has very good alignment, as far as I know). I wonder if we should document alignment guarantees on p->index_data, and if yes, what guarantees to declare. The existing application in find_pack_entry_one() [1] does similar pack-index arithmetic before calling. A similar amount of arithmetic and alignment concerns appear in the commit-graph patch. Where is the right place to declare these alignment requirements? In v2, I'll have find_pack_entry_one() call bsearch_pack() so this logic is not duplicated. Thanks, -Stolee [1] https://github.com/git/git/blob/c6284da4ff4afbde8211efe5d03f3604b1c6b9d6/packfile.c#L1721-L1749 find_pack_entry_one() in packfile.c
Re: [PATCH v6 00/14] Serialized Git Commit Graph
On 3/19/2018 8:55 AM, Derrick Stolee wrote: Thanks for this! Fixing this performance problem is very important to me, as we will use the "--stdin-packs" mechanism in the GVFS scenario (we will walk all commits in the prefetch packs full of commits and trees instead of relying on refs). This speedup is very valuable! Thanks, -Stolee Also, for those interested in this series, I plan to do a rebase onto 2.17.0, when available, as my re-roll. I pushed my responses to the current feedback at the GitHub PR for the series [1]. If you are planning to provide more feedback to the series, then please let me know and I'll delay my re-roll so you have a chance to review. Thanks, -Stolee [1] https://github.com/derrickstolee/git/pull/2
Re: How to debug a "git merge"?
On 3/14/2018 4:53 PM, Lars Schneider wrote: On 14 Mar 2018, at 18:02, Derrick Stolee <sto...@gmail.com> wrote: On 3/14/2018 12:56 PM, Lars Schneider wrote: Hi, I am investigating a Git merge (a86dd40fe) in which an older version of a file won over the newer version. I try to understand why this is the case. I can reproduce the merge with the following commands: $ git checkout -b test a02fa3303 $ GIT_MERGE_VERBOSITY=5 git merge --verbose c1b82995c The merge actually generates a merge conflict but not for my problematic file. The common ancestor of the two parents (merge base) is b91161554. The merge graph is not pretty (the committers don't have a clean branching scheme) but I cannot spot a problem between the merge commit and the common ancestor: $ git log --graph --oneline a86dd40fe Have you tried `git log --graph --oneline --simplify-merges -- path` to see what changes and merges involved the file? I find that view to be very helpful (while the default history simplification can hide things). In particular, if there was a change that was reverted in one side and not another, we could find out. Thanks for this tip! Unfortunately, this only confirms my current view: ### First parent $ git log --graph --oneline --simplify-merges a02fa3303 -- path/to/problem * 4e47a10c7 <-- old version * 01f01f61c ### Second parent $ git log --graph --oneline --simplify-merges c1b82995c -- path/to/problem * 590e52ed1 <-- new version * 8e598828d * ad4e9034b * 4e47a10c7 * 01f01f61c ### Merge $ git log --graph --oneline --simplify-merges a86dd40fe -- path/to/problem * a86dd40fe <-- old version ?!?! That's the problem! |\ | * 590e52ed1 <-- new version | * 8e598828d | * ad4e9034b |/ * 4e47a10c7 <-- old version * 01f01f61c You could also use the "A...B" to check your two commits for merging, and maybe add "--boundary". $ git diff --boundary a02fa3303...c1b82995c -- path/to/problem This looks like the correct diff. The "new version" is mark as +/add/green in the diff. Does this make any sense to you? Thank you, Lars I'm sorry for the delay on this, but in my experience in helping customers saying "why doesn't my commit/change appear in the history of a file?" is because someone resolved merge conflicts by just taking one side instead of doing a real merge (such as using -Xours). The only solution is to re-apply the original change again and talk to the user who incorrectly merged, to prevent future occurrences. These things rarely happen to teams who use pull requests that require review, since the diff would be problematic. There are still a lot of teams that push directly to master, though. Thanks, -Stolee
Re: Contributor Summit planning
On 3/3/2018 5:39 AM, Jeff King wrote: On Sat, Mar 03, 2018 at 05:30:10AM -0500, Jeff King wrote: As in past years, I plan to run it like an unconference. Attendees are expected to bring topics for group discussion. Short presentations are also welcome. We'll put the topics on a whiteboard in the morning, and pick whichever ones people are interested in. Feel free to reply to this thread if you want to make plans or discuss any proposed topics before the summit. Input or questions from non-attendees is welcome here. I'll plan to offer two topics: - a round-up of the current state and past year's activities of Git as a member project of Software Freedom Conservancy - some updates on the state of the git-scm.com since my report last year As with last year, I'll try to send a written report to the list for those who aren't at the summit in person. -Peff Thanks for putting this together, Peff. I'll be ready to talk about the serialized commit graph [1], generation numbers, and other commit-walk optimizations. Thanks, -Stolee [1] https://public-inbox.org/git/1519698787-190494-1-git-send-email-dsto...@microsoft.com/T/#u
Re: [RFC] Contributing to Git (on Windows)
I really appreciate the feedback on this document, Jonathan. On 3/3/2018 1:27 PM, Jonathan Nieder wrote: Hi Dscho, Johannes Schindelin wrote: Jonathan Niederwrites: Dereck Stolee wrote: nit: s/Dereck/Derrick/ Is my outgoing email name misspelled, or do you have a misspelled contact info for me? +Test Your Changes on Linux +-- + +It can be important to work directly on the [core Git codebase](https://github.com/git/git), +such as a recent commit into the `master` or `next` branch that has not been incorporated +into Git for Windows. Also, it can help to run functional and performance tests on your +code in Linux before submitting patches to the Linux-focused mailing list. I'm surprised at this advice. Does it actually come up? Yes. I personally set up the automated builds on Windows, Linux and macOS for our team, and as a rule we always open an internal Pull Request on our topic branches as we develop them, and you would probably not believe the number of issues we caught before sending the patches to this list. Issues including [nice list snipped] Thanks for explaining. I still am going to push back on the wording here, and here is why: 1. Approximately 1/2 of the differences you describe apply to Mac as well as Windows. The advice certainly does not apply on Mac. You might object: Mac readers would not be reading this text! But that is not how humans work: if project documentation (e.g. the CONTRIBUTING.md on GitHub!) says that the project is Linux-focused and if you don't test on Linux then you might as well not bother, then people are going to believe it. 2. It is not unusual for Linux users to make portability mistakes that are quickly pointed out on list. If anything, the advice to test on other platforms should apply to contributors on Linux even more. This happens especially often to new contributors, who sometimes use bashisms, etc that get quickly pointed out. 3. I do not *want* Git to be a Linux-focused project; I want the code to perform well on all popular platforms and for people not to be afraid to make it so. If the docs say that all we care about is Linux, then people are likely to be too scared to do the necessary and valuable work of making it work well on Mac, Windows, etc. The actual mix of contributors doesn't bear it out anyway: a large number of contributors are already on Mac or Windows. Fortunately this is pretty straightforward to fix in the doc: it could say something like "to the multi-platform focused mailing list", for example. I like the "multi-platform" wording, and I'll try to use it as often as possible. I'll keep the Linux instructions because it is free to set up a Linux environment and testing in Windows and Linux will catch most of the cross-platform issues. [...] To my chagrin, this idea of making most of the boring stuff (and I include formatting in that category, but I am probably special in that respect) as automatable, and as little of an issue for review as possible, leaving most brain cycles to work on the correctness of the patches instead, did not really receive a lot of traction on this mailing list. Huh? I'm confident that this is a pretty widely supported idea within the Git project. I get the feeling you must have misread something or misinterpreted a different response. [...] No, this advice comes straight from my personal observation that the reviewers on the Git mailing list are Linux-centric. Hopefully the clarifications and suggestions higher in this message help. If they don't, then I'm nervous about our ability to understand each other. [...] Now, how reasonable do I think it is to ask those contributors to purchase an Apple machine to test their patches on macOS (you cannot just download an .iso nor would it be legal to run a macOS VM on anything but Apple hardware)? You probably guessed my answer: not at all. Agreed, this is something that needs to be automated (and not via a CONTRIBUTING.md file). As a stopgap, having a section in the contribution instructions about testing using Windows's Linux subsystem is a valuable thing, and thanks for that; I never meant to imply otherwise. [...] On Fri, 2 Mar 2018, Junio C Hamano wrote: In fact, portability issues in a patch originally written for a platform is rather quickly discovered if the original platform is more minor than the others, so while it is good advice to test your ware before you make it public, singling out the portability issues may not add much value. The fewer number of obvious bugs remaining, the fewer number of iterations it would take for a series to get in a good shape. [...] For you, Junio, however, the task *now* is to put yourself into the shoes of a Computer Science student in their 2nd year who wants to become an Open Source contributor and is a little afraid to talk directly to "the
[RFC] Contributing to Git (on Windows)
We (Git devs at Microsoft) have had several people start contributing to Git over the past few years (I'm the most-recent addition). As we on-boarded to Git development on our Windows machines, we collected our setup steps on an internal wiki page. Now, we'd like to make that document publicly available. These steps are focused on a Windows user, so we propose putting them in the git-for-windows/git repo under CONTRIBUTING.md. I have a pull request open for feedback [1]. I'll read comments on that PR or in this thread. If you've ever done Git development on a Windows machine, I would love to hear your feedback on this document. Any other advice you have is greatly appreciated. For anyone interested, there are also a discussion about submitting patches upstream. The document links to Documentation/CodingGuidelines and Documentation/SubmittingPatches, but tries to elaborate with additional advice. Thanks, -Stolee [1] https://github.com/git-for-windows/git/pull/1529
[PATCH v5 13/13] commit-graph: implement "--additive" option
Teach git-commit-graph to add all commits from the existing commit-graph file to the file about to be written. This should be used when adding new commits without performing garbage collection. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 10 ++ builtin/commit-graph.c | 10 +++--- commit-graph.c | 17 - commit-graph.h | 3 ++- t/t5318-commit-graph.sh| 10 ++ 5 files changed, 45 insertions(+), 5 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 0710a68..ccf5e20 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -41,6 +41,9 @@ With the `--stdin-commits` option, generate the new commit graph by walking commits starting at the commits specified in stdin as a list of OIDs in hex, one OID per line. (Cannot be combined with --stdin-packs.) ++ +With the `--additive` option, include all commits that are present +in the existing commit-graph file. 'read':: @@ -70,6 +73,13 @@ $ echo | git commit-graph write --stdin-packs $ git show-ref -s | git commit-graph write --stdin-commits +* Write a graph file containing all commits in the current +* commit-graph file along with those reachable from HEAD. ++ + +$ git rev-parse HEAD | git commit-graph write --stdin-commits --additive + + * Read basic information from the commit-graph file. + diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 1c7b7e7..d26a6d6 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -8,7 +8,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), - N_("git commit-graph write [--object-dir ] [--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--additive] [--stdin-packs|--stdin-commits]"), NULL }; @@ -18,7 +18,7 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--object-dir ] [--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--additive] [--stdin-packs|--stdin-commits]"), NULL }; @@ -26,6 +26,7 @@ static struct opts_commit_graph { const char *obj_dir; int stdin_packs; int stdin_commits; + int additive; } opts; static int graph_read(int argc, const char **argv) @@ -94,6 +95,8 @@ static int graph_write(int argc, const char **argv) N_("scan packfiles listed by stdin for commits")), OPT_BOOL(0, "stdin-commits", _commits, N_("start walk at commits listed by stdin")), + OPT_BOOL(0, "additive", , + N_("include all commits already in the commit-graph file")), OPT_END(), }; @@ -131,7 +134,8 @@ static int graph_write(int argc, const char **argv) pack_indexes, packs_nr, commit_hex, - commits_nr); + commits_nr, + opts.additive); return 0; } diff --git a/commit-graph.c b/commit-graph.c index dbb9801..c111717 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -533,7 +533,8 @@ void write_commit_graph(const char *obj_dir, const char **pack_indexes, int nr_packs, const char **commit_hex, - int nr_commits) + int nr_commits, + int additive) { struct packed_oid_list oids; struct packed_commit_list commits; @@ -552,10 +553,24 @@ void write_commit_graph(const char *obj_dir, oids.nr = 0; oids.alloc = approximate_object_count() / 4; + if (additive) { + prepare_commit_graph_one(obj_dir); + if (commit_graph) + oids.alloc += commit_graph->num_commits; + } + if (oids.alloc < 1024) oids.alloc = 1024; ALLOC_ARRAY(oids.list, oids.alloc); + if (additive && commit_graph) { + for (i = 0; i < commit_graph->num_commits; i++) { + const unsigned char *hash = commit_graph->chunk_oid_lookup + + commit_graph->hash_len * i; + hashcpy(oids.list[oids.nr++].ha
[PATCH v5 12/13] commit-graph: build graph from starting commits
Teach git-commit-graph to read commits from stdin when the --stdin-commits flag is specified. Commits reachable from these commits are added to the graph. This is a much faster way to construct the graph than inspecting all packed objects, but is restricted to known tips. For the Linux repository, 700,000+ commits were added to the graph file starting from 'master' in 7-9 seconds, depending on the number of packfiles in the repo (1, 24, or 120). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 14 +- builtin/commit-graph.c | 27 +-- commit-graph.c | 27 +-- commit-graph.h | 4 +++- t/t5318-commit-graph.sh| 13 + 5 files changed, 75 insertions(+), 10 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index b945510..0710a68 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -34,7 +34,13 @@ COMMANDS Write a commit graph file based on the commits found in packfiles. + With the `--stdin-packs` option, generate the new commit graph by -walking objects only in the specified packfiles. +walking objects only in the specified packfiles. (Cannot be combined +with --stdin-commits.) ++ +With the `--stdin-commits` option, generate the new commit graph by +walking commits starting at the commits specified in stdin as a list +of OIDs in hex, one OID per line. (Cannot be combined with +--stdin-packs.) 'read':: @@ -58,6 +64,12 @@ $ git commit-graph write $ echo | git commit-graph write --stdin-packs +* Write a graph file containing all reachable commits. ++ + +$ git show-ref -s | git commit-graph write --stdin-commits + + * Read basic information from the commit-graph file. + diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index eebca57..1c7b7e7 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -8,7 +8,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), - N_("git commit-graph write [--object-dir ] [--stdin-packs]"), + N_("git commit-graph write [--object-dir ] [--stdin-packs|--stdin-commits]"), NULL }; @@ -18,13 +18,14 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--object-dir ] [--stdin-packs]"), + N_("git commit-graph write [--object-dir ] [--stdin-packs|--stdin-commits]"), NULL }; static struct opts_commit_graph { const char *obj_dir; int stdin_packs; + int stdin_commits; } opts; static int graph_read(int argc, const char **argv) @@ -79,6 +80,8 @@ static int graph_write(int argc, const char **argv) { const char **pack_indexes = NULL; int packs_nr = 0; + const char **commit_hex = NULL; + int commits_nr = 0; const char **lines = NULL; int lines_nr = 0; int lines_alloc = 0; @@ -89,6 +92,8 @@ static int graph_write(int argc, const char **argv) N_("The object directory to store the graph")), OPT_BOOL(0, "stdin-packs", _packs, N_("scan packfiles listed by stdin for commits")), + OPT_BOOL(0, "stdin-commits", _commits, + N_("start walk at commits listed by stdin")), OPT_END(), }; @@ -96,10 +101,12 @@ static int graph_write(int argc, const char **argv) builtin_commit_graph_write_options, builtin_commit_graph_write_usage, 0); + if (opts.stdin_packs && opts.stdin_commits) + die(_("cannot use both --stdin-commits and --stdin-packs")); if (!opts.obj_dir) opts.obj_dir = get_object_directory(); - if (opts.stdin_packs) { + if (opts.stdin_packs || opts.stdin_commits) { struct strbuf buf = STRBUF_INIT; lines_nr = 0; lines_alloc = 128; @@ -110,13 +117,21 @@ static int graph_write(int argc, const char **argv) lines[lines_nr++] = strbuf_detach(, NULL); } - pack_indexes = lines; - packs_nr = lines_nr; + if (opts.stdin_packs) { + pack_indexes = lines; + packs_nr = lines_nr; +
[PATCH v5 03/13] commit-graph: create git-commit-graph builtin
Thanks for the help in getting all the details right in setting up a builtin. -- >8 -- Teach git the 'commit-graph' builtin that will be used for writing and reading packed graph files. The current implementation is mostly empty, except for an '--object-dir' option. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- .gitignore | 1 + Documentation/git-commit-graph.txt | 11 ++ Makefile | 1 + builtin.h | 1 + builtin/commit-graph.c | 37 ++ command-list.txt | 1 + contrib/completion/git-completion.bash | 2 ++ git.c | 1 + 8 files changed, 55 insertions(+) create mode 100644 Documentation/git-commit-graph.txt create mode 100644 builtin/commit-graph.c diff --git a/.gitignore b/.gitignore index 833ef3b..e82f901 100644 --- a/.gitignore +++ b/.gitignore @@ -34,6 +34,7 @@ /git-clone /git-column /git-commit +/git-commit-graph /git-commit-tree /git-config /git-count-objects diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt new file mode 100644 index 000..5913340 --- /dev/null +++ b/Documentation/git-commit-graph.txt @@ -0,0 +1,11 @@ +git-commit-graph(1) +=== + +NAME + +git-commit-graph - Write and verify Git commit graph files + +GIT +--- +Part of the linkgit:git[1] suite + diff --git a/Makefile b/Makefile index c56fdc1..2e4956f 100644 --- a/Makefile +++ b/Makefile @@ -946,6 +946,7 @@ BUILTIN_OBJS += builtin/clone.o BUILTIN_OBJS += builtin/column.o BUILTIN_OBJS += builtin/commit-tree.o BUILTIN_OBJS += builtin/commit.o +BUILTIN_OBJS += builtin/commit-graph.o BUILTIN_OBJS += builtin/config.o BUILTIN_OBJS += builtin/count-objects.o BUILTIN_OBJS += builtin/credential.o diff --git a/builtin.h b/builtin.h index 42378f3..079855b 100644 --- a/builtin.h +++ b/builtin.h @@ -149,6 +149,7 @@ extern int cmd_clone(int argc, const char **argv, const char *prefix); extern int cmd_clean(int argc, const char **argv, const char *prefix); extern int cmd_column(int argc, const char **argv, const char *prefix); extern int cmd_commit(int argc, const char **argv, const char *prefix); +extern int cmd_commit_graph(int argc, const char **argv, const char *prefix); extern int cmd_commit_tree(int argc, const char **argv, const char *prefix); extern int cmd_config(int argc, const char **argv, const char *prefix); extern int cmd_count_objects(int argc, const char **argv, const char *prefix); diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c new file mode 100644 index 000..8ff7336 --- /dev/null +++ b/builtin/commit-graph.c @@ -0,0 +1,37 @@ +#include "builtin.h" +#include "config.h" +#include "parse-options.h" + +static char const * const builtin_commit_graph_usage[] = { + N_("git commit-graph [--object-dir ]"), + NULL +}; + +static struct opts_commit_graph { + const char *obj_dir; +} opts; + + +int cmd_commit_graph(int argc, const char **argv, const char *prefix) +{ + static struct option builtin_commit_graph_options[] = { + OPT_STRING(0, "object-dir", _dir, + N_("dir"), + N_("The object directory to store the graph")), + OPT_END(), + }; + + if (argc == 2 && !strcmp(argv[1], "-h")) + usage_with_options(builtin_commit_graph_usage, + builtin_commit_graph_options); + + git_config(git_default_config, NULL); + argc = parse_options(argc, argv, prefix, +builtin_commit_graph_options, +builtin_commit_graph_usage, +PARSE_OPT_STOP_AT_NON_OPTION); + + usage_with_options(builtin_commit_graph_usage, + builtin_commit_graph_options); +} + diff --git a/command-list.txt b/command-list.txt index a1fad28..835c589 100644 --- a/command-list.txt +++ b/command-list.txt @@ -34,6 +34,7 @@ git-clean mainporcelain git-clone mainporcelain init git-column purehelpers git-commit mainporcelain history +git-commit-graphplumbingmanipulators git-commit-tree plumbingmanipulators git-config ancillarymanipulators git-count-objects ancillaryinterrogators diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash index 88813e9..b060b6f 100644 --- a/contrib/completion/git-completion.bash +++ b/contrib/completion/git-completion.bash @@ -841,6 +841,7 @@ __git_list_porcelain_commands () check-ref-format) : plumb
[PATCH v5 02/13] graph: add commit graph design document
Add Documentation/technical/commit-graph.txt with details of the planned commit graph feature, including future plans. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/commit-graph.txt | 164 +++ 1 file changed, 164 insertions(+) create mode 100644 Documentation/technical/commit-graph.txt diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt new file mode 100644 index 000..d11753a --- /dev/null +++ b/Documentation/technical/commit-graph.txt @@ -0,0 +1,164 @@ +Git Commit Graph Design Notes += + +Git walks the commit graph for many reasons, including: + +1. Listing and filtering commit history. +2. Computing merge bases. + +These operations can become slow as the commit count grows. The merge +base calculation shows up in many user-facing commands, such as 'merge-base' +or 'status' and can take minutes to compute depending on history shape. + +There are two main costs here: + +1. Decompressing and parsing commits. +2. Walking the entire graph to satisfy topological order constraints. + +The commit graph file is a supplemental data structure that accelerates +commit graph walks. If a user downgrades or disables the 'core.commitGraph' +config setting, then the existing ODB is sufficient. The file is stored +as "commit-graph" either in the .git/objects/info directory or in the info +directory of an alternate. + +The commit graph file stores the commit graph structure along with some +extra metadata to speed up graph walks. By listing commit OIDs in lexi- +cographic order, we can identify an integer position for each commit and +refer to the parents of a commit using those integer positions. We use +binary search to find initial commits and then use the integer positions +for fast lookups during the walk. + +A consumer may load the following info for a commit from the graph: + +1. The commit OID. +2. The list of parents, along with their integer position. +3. The commit date. +4. The root tree OID. +5. The generation number (see definition below). + +Values 1-4 satisfy the requirements of parse_commit_gently(). + +Define the "generation number" of a commit recursively as follows: + + * A commit with no parents (a root commit) has generation number one. + + * A commit with at least one parent has generation number one more than + the largest generation number among its parents. + +Equivalently, the generation number of a commit A is one more than the +length of a longest path from A to a root commit. The recursive definition +is easier to use for computation and observing the following property: + +If A and B are commits with generation numbers N and M, respectively, +and N <= M, then A cannot reach B. That is, we know without searching +that B is not an ancestor of A because it is further from a root commit +than A. + +Conversely, when checking if A is an ancestor of B, then we only need +to walk commits until all commits on the walk boundary have generation +number at most N. If we walk commits using a priority queue seeded by +generation numbers, then we always expand the boundary commit with highest +generation number and can easily detect the stopping condition. + +This property can be used to significantly reduce the time it takes to +walk commits and determine topological relationships. Without generation +numbers, the general heuristic is the following: + +If A and B are commits with commit time X and Y, respectively, and +X < Y, then A _probably_ cannot reach B. + +This heuristic is currently used whenever the computation is allowed to +violate topological relationships due to clock skew (such as "git log" +with default order), but is not used when the topological order is +required (such as merge base calculations, "git log --graph"). + +In practice, we expect some commits to be created recently and not stored +in the commit graph. We can treat these commits as having "infinite" +generation number and walk until reaching commits with known generation +number. + +Design Details +-- + +- The commit graph file is stored in a file named 'commit-graph' in the + .git/objects/info directory. This could be stored in the info directory + of an alternate. + +- The core.commitGraph config setting must be on to consume graph files. + +- The file format includes parameters for the object ID hash function, + so a future change of hash algorithm does not require a change in format. + +Future Work +--- + +- The commit graph feature currently does not honor commit grafts. This can + be remedied by duplicating or refactoring the current graft logic. + +- The 'commit-graph' subcommand does not have a "verify" mode that is + necessary for integration with fsck. + +- The file format includes room for precomputed generation numbers. These
[PATCH v5 10/13] commit: integrate commit graph with commit parsing
Teach Git to inspect a commit graph file to supply the contents of a struct commit when calling parse_commit_gently(). This implementation satisfies all post-conditions on the struct commit, including loading parents, the root tree, and the commit date. If core.commitGraph is false, then do not check graph files. In test script t5318-commit-graph.sh, add output-matching conditions on read-only graph operations. By loading commits from the graph instead of parsing commit buffers, we save a lot of time on long commit walks. Here are some performance results for a copy of the Linux repository where 'master' has 664,185 reachable commits and is behind 'origin/master' by 60,191 commits. | Command | Before | After | Rel % | |--|||---| | log --oneline --topo-order -1000 | 6.56s | 0.66s | -89% | | branch -vv | 1.35s | 0.32s | -76% | | rev-list --all | 6.7s | 0.83s | -87% | | rev-list --all --objects | 33.0s | 27.5s | -16% | Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- alloc.c | 1 + commit-graph.c | 141 +++- commit-graph.h | 12 + commit.c| 3 ++ commit.h| 3 ++ t/t5318-commit-graph.sh | 47 +++- 6 files changed, 205 insertions(+), 2 deletions(-) diff --git a/alloc.c b/alloc.c index 12afadf..cf4f8b6 100644 --- a/alloc.c +++ b/alloc.c @@ -93,6 +93,7 @@ void *alloc_commit_node(void) struct commit *c = alloc_node(_state, sizeof(struct commit)); c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); + c->graph_pos = COMMIT_NOT_FROM_GRAPH; return c; } diff --git a/commit-graph.c b/commit-graph.c index 01aa23d..184b8da 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -38,7 +38,6 @@ #define GRAPH_MIN_SIZE (5 * GRAPH_CHUNKLOOKUP_WIDTH + GRAPH_FANOUT_SIZE + \ GRAPH_OID_LEN + 8) - char *get_commit_graph_filename(const char *obj_dir) { return xstrfmt("%s/info/commit-graph", obj_dir); @@ -182,6 +181,145 @@ struct commit_graph *load_commit_graph_one(const char *graph_file) exit(1); } +/* global storage */ +struct commit_graph *commit_graph = NULL; + +static void prepare_commit_graph_one(const char *obj_dir) +{ + char *graph_name; + + if (commit_graph) + return; + + graph_name = get_commit_graph_filename(obj_dir); + commit_graph = load_commit_graph_one(graph_name); + + FREE_AND_NULL(graph_name); +} + +static int prepare_commit_graph_run_once = 0; +static void prepare_commit_graph(void) +{ + struct alternate_object_database *alt; + char *obj_dir; + + if (prepare_commit_graph_run_once) + return; + prepare_commit_graph_run_once = 1; + + obj_dir = get_object_directory(); + prepare_commit_graph_one(obj_dir); + prepare_alt_odb(); + for (alt = alt_odb_list; !commit_graph && alt; alt = alt->next) + prepare_commit_graph_one(alt->path); +} + +static void close_commit_graph(void) +{ + if (!commit_graph) + return; + + if (commit_graph->graph_fd >= 0) { + munmap((void *)commit_graph->data, commit_graph->data_len); + commit_graph->data = NULL; + close(commit_graph->graph_fd); + } + + FREE_AND_NULL(commit_graph); +} + +static int bsearch_graph(struct commit_graph *g, struct object_id *oid, uint32_t *pos) +{ + return bsearch_hash(oid->hash, g->chunk_oid_fanout, + g->chunk_oid_lookup, g->hash_len, pos); +} + +static struct commit_list **insert_parent_or_die(struct commit_graph *g, +uint64_t pos, +struct commit_list **pptr) +{ + struct commit *c; + struct object_id oid; + hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos); + c = lookup_commit(); + if (!c) + die("could not find commit %s", oid_to_hex()); + c->graph_pos = pos; + return _list_insert(c, pptr)->next; +} + +static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) +{ + struct object_id oid; + uint32_t edge_value; + uint32_t *parent_data_ptr; + uint64_t date_low, date_high; + struct commit_list **pptr; + const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos; + + item->object.parsed = 1; + item->graph_pos = pos; + + hashcpy(oid.hash, commit_data); + item->tree = lookup_tree(); + + date_high = ntohl(*(uint32_t*)(commit_data + g->hash_len + 8)) & 0x3; + date_low = nto
[PATCH v5 06/13] commit-graph: implement 'git-commit-graph write'
Teach git-commit-graph to write graph files. Create new test script to verify this command succeeds without failure. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 39 builtin/commit-graph.c | 33 ++ t/t5318-commit-graph.sh| 125 + 3 files changed, 197 insertions(+) create mode 100755 t/t5318-commit-graph.sh diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 5913340..e688843 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -5,6 +5,45 @@ NAME git-commit-graph - Write and verify Git commit graph files + +SYNOPSIS + +[verse] +'git commit-graph write' [--object-dir ] + + +DESCRIPTION +--- + +Manage the serialized commit graph file. + + +OPTIONS +--- +--object-dir:: + Use given directory for the location of packfiles and commit graph + file. The commit graph file is expected to be at /info/commit-graph + and the packfiles are expected to be in /pack. + + +COMMANDS + +'write':: + +Write a commit graph file based on the commits found in packfiles. +Includes all commits from the existing commit graph file. + + +EXAMPLES + + +* Write a commit graph file for the packed commits in your local .git folder. ++ + +$ git commit-graph write + + + GIT --- Part of the linkgit:git[1] suite diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 8ff7336..a9d61f6 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -1,9 +1,18 @@ #include "builtin.h" #include "config.h" +#include "dir.h" +#include "lockfile.h" #include "parse-options.h" +#include "commit-graph.h" static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), + N_("git commit-graph write [--object-dir ]"), + NULL +}; + +static const char * const builtin_commit_graph_write_usage[] = { + N_("git commit-graph write [--object-dir ]"), NULL }; @@ -11,6 +20,25 @@ static struct opts_commit_graph { const char *obj_dir; } opts; +static int graph_write(int argc, const char **argv) +{ + static struct option builtin_commit_graph_write_options[] = { + OPT_STRING(0, "object-dir", _dir, + N_("dir"), + N_("The object directory to store the graph")), + OPT_END(), + }; + + argc = parse_options(argc, argv, NULL, +builtin_commit_graph_write_options, +builtin_commit_graph_write_usage, 0); + + if (!opts.obj_dir) + opts.obj_dir = get_object_directory(); + + write_commit_graph(opts.obj_dir); + return 0; +} int cmd_commit_graph(int argc, const char **argv, const char *prefix) { @@ -31,6 +59,11 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) builtin_commit_graph_usage, PARSE_OPT_STOP_AT_NON_OPTION); + if (argc > 0) { + if (!strcmp(argv[0], "write")) + return graph_write(argc, argv); + } + usage_with_options(builtin_commit_graph_usage, builtin_commit_graph_options); } diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh new file mode 100755 index 000..7524d2a --- /dev/null +++ b/t/t5318-commit-graph.sh @@ -0,0 +1,125 @@ +#!/bin/sh + +test_description='commit graph' +. ./test-lib.sh + +test_expect_success 'setup full repo' ' + mkdir full && + cd "$TRASH_DIRECTORY/full" && + git init && + objdir=".git/objects" +' + +test_expect_success 'write graph with no packs' ' + cd "$TRASH_DIRECTORY/full" && + git commit-graph write --object-dir . && + test_path_is_file info/commit-graph +' + +test_expect_success 'create commits and repack' ' + cd "$TRASH_DIRECTORY/full" && + for i in $(test_seq 3) + do + test_commit $i && + git branch commits/$i + done && + git repack +' + +test_expect_success 'write graph' ' + cd "$TRASH_DIRECTORY/full" && + graph1=$(git commit-graph write) && + test_path_is_file $objdir/info/commit-graph +' + +test_expect_success 'Add more commits' ' + cd "$TRASH_DIRECTORY/full" && + git reset --hard commits/1 && + for i in $(test_seq 4 5) + do + test_commit $i && +
[PATCH v5 00/13] Serialized Git Commit Graph
This patch series is another big difference from version 4, but I do think we are converging on a stable design. This series depends on a few things in flight: * jt/binsearch-with-fanout for bsearch_graph() * 'master' includes the sha1file -> hashfile rename in (98a3beab). * [PATCH] commit: drop uses of get_cached_commit_buffer(). [1] I couldn't find a ds/* branch for this one, but it is necessary or else the commit graph test script should fail. Here are some of the inter-patch changes: * The single commit graph file is stored in the fixed filename .git/objects/info/commit-graph * Because of this change, I struggled with the right way to pair the lockfile API with the hashfile API. Perhaps they were not meant to interact like this. I include a new patch step that adds a flag for hashclose() to keep the file descriptor open so commit_lock_file() can succeed. Please let me know if this is the wrong approach. * A side-benefit of this change is that the "--set-latest" and "--delete-expired" arguments are no longer useful. * I re-ran the performance tests since I rebased onto master. I had moved my "master" branch on my copy of Linux from another perf test, which changed the data shape a bit. * There was some confusion between v3 and v4 about whether commits in an existing commit-graph file are automatically added to the new file during a write. I think I cleared up all of the documentation that referenced this to the new behavior: we only include commits reachable from the starting commits (depending on --stdin-commits, --stdin-packs, or neither) unless the new "--additive" argument is specified. Thanks, -Stolee [1] https://public-inbox.org/git/1519240631-221761-1-git-send-email-dsto...@microsoft.com/ -- >8 -- This patch contains a way to serialize the commit graph. The current implementation defines a new file format to store the graph structure (parent relationships) and basic commit metadata (commit date, root tree OID) in order to prevent parsing raw commits while performing basic graph walks. For example, we do not need to parse the full commit when performing these walks: * 'git log --topo-order -1000' walks all reachable commits to avoid incorrect topological orders, but only needs the commit message for the top 1000 commits. * 'git merge-base ' may walk many commits to find the correct boundary between the commits reachable from A and those reachable from B. No commit messages are needed. * 'git branch -vv' checks ahead/behind status for all local branches compared to their upstream remote branches. This is essentially as hard as computing merge bases for each. The current patch speeds up these calculations by injecting a check in parse_commit_gently() to check if there is a graph file and using that to provide the required metadata to the struct commit. The file format has room to store generation numbers, which will be provided as a patch after this framework is merged. Generation numbers are referenced by the design document but not implemented in order to make the current patch focus on the graph construction process. Once that is stable, it will be easier to add generation numbers and make graph walks aware of generation numbers one-by-one. Here are some performance results for a copy of the Linux repository where 'master' has 664,185 reachable commits and is behind 'origin/master' by 60,191 commits. | Command | Before | After | Rel % | |--|||---| | log --oneline --topo-order -1000 | 6.56s | 0.66s | -89% | | branch -vv | 1.35s | 0.32s | -76% | | rev-list --all | 6.7s | 0.83s | -87% | | rev-list --all --objects | 33.0s | 27.5s | -16% | To test this yourself, run the following on your repo: git config core.commitGraph true git show-ref -s | git commit-graph write --stdin-commits The second command writes a commit graph file containing every commit reachable from your refs. Now, all git commands that walk commits will check your graph first before consulting the ODB. You can run your own performance comparisons by toggling the 'core.commitGraph' setting. [1] https://github.com/derrickstolee/git/pull/2 A GitHub pull request containing the latest version of this patch. Derrick Stolee (13): commit-graph: add format document graph: add commit graph design document commit-graph: create git-commit-graph builtin csum-file: add CSUM_KEEP_OPEN flag commit-graph: implement write_commit_graph() commit-graph: implement 'git-commit-graph write' commit-graph: implement git commit-graph read commit-graph: add core.commitGraph setting commit-graph: close under reachability commit: integrate commit graph with commit parsing commit-graph: read only from specific pack-indexes commit-graph: build graph from starting commits
[PATCH v5 05/13] commit-graph: implement write_commit_graph()
Teach Git to write a commit graph file by checking all packed objects to see if they are commits, then store the file in the given object directory. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Makefile | 1 + commit-graph.c | 360 + commit-graph.h | 7 ++ 3 files changed, 368 insertions(+) create mode 100644 commit-graph.c create mode 100644 commit-graph.h diff --git a/Makefile b/Makefile index 2e4956f..bf91b2d 100644 --- a/Makefile +++ b/Makefile @@ -771,6 +771,7 @@ LIB_OBJS += color.o LIB_OBJS += column.o LIB_OBJS += combine-diff.o LIB_OBJS += commit.o +LIB_OBJS += commit-graph.o LIB_OBJS += compat/obstack.o LIB_OBJS += compat/terminal.o LIB_OBJS += config.o diff --git a/commit-graph.c b/commit-graph.c new file mode 100644 index 000..2251397 --- /dev/null +++ b/commit-graph.c @@ -0,0 +1,360 @@ +#include "cache.h" +#include "config.h" +#include "git-compat-util.h" +#include "lockfile.h" +#include "pack.h" +#include "packfile.h" +#include "commit.h" +#include "object.h" +#include "revision.h" +#include "sha1-lookup.h" +#include "commit-graph.h" + +#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */ +#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */ +#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */ +#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */ +#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */ + +#define GRAPH_DATA_WIDTH 36 + +#define GRAPH_VERSION_1 0x1 +#define GRAPH_VERSION GRAPH_VERSION_1 + +#define GRAPH_OID_VERSION_SHA1 1 +#define GRAPH_OID_LEN_SHA1 GIT_SHA1_RAWSZ +#define GRAPH_OID_VERSION GRAPH_OID_VERSION_SHA1 +#define GRAPH_OID_LEN GRAPH_OID_LEN_SHA1 + +#define GRAPH_OCTOPUS_EDGES_NEEDED 0x8000 +#define GRAPH_PARENT_MISSING 0x7fff +#define GRAPH_EDGE_LAST_MASK 0x7fff +#define GRAPH_PARENT_NONE 0x7000 + +#define GRAPH_LAST_EDGE 0x8000 + +#define GRAPH_FANOUT_SIZE (4 * 256) +#define GRAPH_CHUNKLOOKUP_WIDTH 12 +#define GRAPH_MIN_SIZE (5 * GRAPH_CHUNKLOOKUP_WIDTH + GRAPH_FANOUT_SIZE + \ + GRAPH_OID_LEN + 8) + + +static char *get_commit_graph_filename(const char *obj_dir) +{ + return xstrfmt("%s/info/commit-graph", obj_dir); +} + +static void write_graph_chunk_fanout(struct hashfile *f, +struct commit **commits, +int nr_commits) +{ + int i, count = 0; + struct commit **list = commits; + + /* +* Write the first-level table (the list is sorted, +* but we use a 256-entry lookup to be able to avoid +* having to do eight extra binary search iterations). +*/ + for (i = 0; i < 256; i++) { + while (count < nr_commits) { + if ((*list)->object.oid.hash[0] != i) + break; + count++; + list++; + } + + hashwrite_be32(f, count); + } +} + +static void write_graph_chunk_oids(struct hashfile *f, int hash_len, + struct commit **commits, int nr_commits) +{ + struct commit **list = commits; + int count; + for (count = 0; count < nr_commits; count++, list++) + hashwrite(f, (*list)->object.oid.hash, (int)hash_len); +} + +static const unsigned char *commit_to_sha1(size_t index, void *table) +{ + struct commit **commits = table; + return commits[index]->object.oid.hash; +} + +static void write_graph_chunk_data(struct hashfile *f, int hash_len, + struct commit **commits, int nr_commits) +{ + struct commit **list = commits; + struct commit **last = commits + nr_commits; + uint32_t num_extra_edges = 0; + + while (list < last) { + struct commit_list *parent; + int edge_value; + uint32_t packedDate[2]; + + parse_commit(*list); + hashwrite(f, (*list)->tree->object.oid.hash, hash_len); + + parent = (*list)->parents; + + if (!parent) + edge_value = GRAPH_PARENT_NONE; + else { + edge_value = sha1_pos(parent->item->object.oid.hash, + commits, + nr_commits, + commit_to_sha1); + + if (edge_value < 0) + edge_value = GRAPH_PARENT_MISSING; + } + + hashwrite_be32(f, edge_value); + + if (parent) + parent = parent->next; + + if (!parent) +
[PATCH v5 04/13] csum-file: add CSUM_KEEP_OPEN flag
This patch is new to the series due to the interactions with the lockfile API and the hashfile API. I need to ensure the hashfile writes the hash value at the end of the file, but keep the file descriptor open so the lock is valid. I welcome any susggestions to this patch or to the way I use it in the commit that follows. -- >8 -- If we want to use a hashfile on the temporary file for a lockfile, then we need hashclose() to fully write the trailing hash but also keep the file descriptor open. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- csum-file.c | 10 +++--- csum-file.h | 1 + 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/csum-file.c b/csum-file.c index 5eda7fb..302e6ae 100644 --- a/csum-file.c +++ b/csum-file.c @@ -66,9 +66,13 @@ int hashclose(struct hashfile *f, unsigned char *result, unsigned int flags) flush(f, f->buffer, the_hash_algo->rawsz); if (flags & CSUM_FSYNC) fsync_or_die(f->fd, f->name); - if (close(f->fd)) - die_errno("%s: sha1 file error on close", f->name); - fd = 0; + if (flags & CSUM_KEEP_OPEN) + fd = f->fd; + else { + if (close(f->fd)) + die_errno("%s: sha1 file error on close", f->name); + fd = 0; + } } else fd = f->fd; if (0 <= f->check_fd) { diff --git a/csum-file.h b/csum-file.h index 992e5c0..b7c0e48 100644 --- a/csum-file.h +++ b/csum-file.h @@ -29,6 +29,7 @@ extern int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *); /* hashclose flags */ #define CSUM_CLOSE 1 #define CSUM_FSYNC 2 +#define CSUM_KEEP_OPEN 4 extern struct hashfile *hashfd(int fd, const char *name); extern struct hashfile *hashfd_check(const char *name); -- 2.7.4
[PATCH v5 01/13] commit-graph: add format document
Add document specifying the binary format for commit graphs. This format allows for: * New versions. * New hash functions and hash lengths. * Optional extensions. Basic header information is followed by a binary table of contents into "chunks" that include: * An ordered list of commit object IDs. * A 256-entry fanout into that list of OIDs. * A list of metadata for the commits. * A list of "large edges" to enable octopus merges. The format automatically includes two parent positions for every commit. This favors speed over space, since using only one position per commit would cause an extra level of indirection for every merge commit. (Octopus merges suffer from this indirection, but they are very rare.) Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/commit-graph-format.txt | 98 + 1 file changed, 98 insertions(+) create mode 100644 Documentation/technical/commit-graph-format.txt diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt new file mode 100644 index 000..4402baa --- /dev/null +++ b/Documentation/technical/commit-graph-format.txt @@ -0,0 +1,98 @@ +Git commit graph format +=== + +The Git commit graph stores a list of commit OIDs and some associated +metadata, including: + +- The generation number of the commit. Commits with no parents have + generation number 1; commits with parents have generation number + one more than the maximum generation number of its parents. We + reserve zero as special, and can be used to mark a generation + number invalid or as "not computed". + +- The root tree OID. + +- The commit date. + +- The parents of the commit, stored using positional references within + the graph file. + +These positional references are stored as 32-bit integers corresponding to +the array position withing the list of commit OIDs. We use the most-significant +bit for special purposes, so we can store at most (1 << 31) - 1 (around 2 +billion) commits. + +== Commit graph files have the following format: + +In order to allow extensions that add extra data to the graph, we organize +the body into "chunks" and provide a binary lookup table at the beginning +of the body. The header includes certain values, such as number of chunks +and hash type. + +All 4-byte numbers are in network order. + +HEADER: + + 4-byte signature: + The signature is: {'C', 'G', 'P', 'H'} + + 1-byte version number: + Currently, the only valid version is 1. + + 1-byte Hash Version (1 = SHA-1) + We infer the hash length (H) from this value. + + 1-byte number (C) of "chunks" + + 1-byte (reserved for later use) + Current clients should ignore this value. + +CHUNK LOOKUP: + + (C + 1) * 12 bytes listing the table of contents for the chunks: + First 4 bytes describe the chunk id. Value 0 is a terminating label. + Other 8 bytes provide the byte-offset in current file for chunk to + start. (Chunks are ordered contiguously in the file, so you can infer + the length using the next chunk position if necessary.) Each chunk + type appears at most once. + + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. + +CHUNK DATA: + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of commits (N). + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all commits in the graph, sorted in ascending order. + + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes) +* The first H bytes are for the OID of the root tree. +* The next 8 bytes are for the positions of the first two parents + of the ith commit. Stores value 0x if no parent in that + position. If there are more than two parents, the second value + has its most-significant bit on and the other bits store an array + position into the Large Edge List chunk. +* The next 8 bytes store the generation number of the commit and + the commit time in seconds since EPOCH. The generation number + uses the higher 30 bits of the first 4 bytes, while the commit + time uses the 32 bits of the second 4 bytes, along with the lowest + 2 bits of the lowest byte, storing the 33rd and 34th bit of the + commit time. + + Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional] + This list of 4-byte values store the second through nth parents for + all octopus merges. The second parent value in the commit data stores + an array position within this list along with the most-significant bit + on. Starting at that array position, iterate through this list of commit + positions for the parent
[PATCH v5 11/13] commit-graph: read only from specific pack-indexes
Teach git-commit-graph to inspect the objects only in a certain list of pack-indexes within the given pack directory. This allows updating the commit graph iteratively. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 11 ++- builtin/commit-graph.c | 33 ++--- commit-graph.c | 26 -- commit-graph.h | 4 +++- packfile.c | 4 ++-- packfile.h | 2 ++ t/t5318-commit-graph.sh| 10 ++ 7 files changed, 81 insertions(+), 9 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 51cb038..b945510 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -32,7 +32,9 @@ COMMANDS 'write':: Write a commit graph file based on the commits found in packfiles. -Includes all commits from the existing commit graph file. ++ +With the `--stdin-packs` option, generate the new commit graph by +walking objects only in the specified packfiles. 'read':: @@ -49,6 +51,13 @@ EXAMPLES $ git commit-graph write +* Write a graph file, extending the current graph file using commits +* in . ++ + +$ echo | git commit-graph write --stdin-packs + + * Read basic information from the commit-graph file. + diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 0e164be..eebca57 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -8,7 +8,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), - N_("git commit-graph write [--object-dir ]"), + N_("git commit-graph write [--object-dir ] [--stdin-packs]"), NULL }; @@ -18,12 +18,13 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--object-dir ]"), + N_("git commit-graph write [--object-dir ] [--stdin-packs]"), NULL }; static struct opts_commit_graph { const char *obj_dir; + int stdin_packs; } opts; static int graph_read(int argc, const char **argv) @@ -76,10 +77,18 @@ static int graph_read(int argc, const char **argv) static int graph_write(int argc, const char **argv) { + const char **pack_indexes = NULL; + int packs_nr = 0; + const char **lines = NULL; + int lines_nr = 0; + int lines_alloc = 0; + static struct option builtin_commit_graph_write_options[] = { OPT_STRING(0, "object-dir", _dir, N_("dir"), N_("The object directory to store the graph")), + OPT_BOOL(0, "stdin-packs", _packs, + N_("scan packfiles listed by stdin for commits")), OPT_END(), }; @@ -90,7 +99,25 @@ static int graph_write(int argc, const char **argv) if (!opts.obj_dir) opts.obj_dir = get_object_directory(); - write_commit_graph(opts.obj_dir); + if (opts.stdin_packs) { + struct strbuf buf = STRBUF_INIT; + lines_nr = 0; + lines_alloc = 128; + ALLOC_ARRAY(lines, lines_alloc); + + while (strbuf_getline(, stdin) != EOF) { + ALLOC_GROW(lines, lines_nr + 1, lines_alloc); + lines[lines_nr++] = strbuf_detach(, NULL); + } + + pack_indexes = lines; + packs_nr = lines_nr; + } + + write_commit_graph(opts.obj_dir, + pack_indexes, + packs_nr); + return 0; } diff --git a/commit-graph.c b/commit-graph.c index 184b8da..4e9f1d5 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -529,7 +529,9 @@ static void close_reachable(struct packed_oid_list *oids) } } -void write_commit_graph(const char *obj_dir) +void write_commit_graph(const char *obj_dir, + const char **pack_indexes, + int nr_packs) { struct packed_oid_list oids; struct packed_commit_list commits; @@ -552,7 +554,27 @@ void write_commit_graph(const char *obj_dir) oids.alloc = 1024; ALLOC_ARRAY(oids.list, oids.alloc); - for_each_packed_object(add_packed_commits, , 0); + if (pack_indexes) { + struct strbuf packname = STRBUF_INIT; + int dirlen; + st
[PATCH v5 07/13] commit-graph: implement git commit-graph read
Teach git-commit-graph to read commit graph files and summarize their contents. Use the read subcommand to verify the contents of a commit graph file in the tests. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 12 builtin/commit-graph.c | 56 +++ commit-graph.c | 140 - commit-graph.h | 23 ++ t/t5318-commit-graph.sh| 32 +++-- 5 files changed, 257 insertions(+), 6 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index e688843..51cb038 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -9,6 +9,7 @@ git-commit-graph - Write and verify Git commit graph files SYNOPSIS [verse] +'git commit-graph read' [--object-dir ] 'git commit-graph write' [--object-dir ] @@ -33,6 +34,11 @@ COMMANDS Write a commit graph file based on the commits found in packfiles. Includes all commits from the existing commit graph file. +'read':: + +Read a graph file given by the commit-graph file and output basic +details about the graph file. Used for debugging purposes. + EXAMPLES @@ -43,6 +49,12 @@ EXAMPLES $ git commit-graph write +* Read basic information from the commit-graph file. ++ + +$ git commit-graph read + + GIT --- diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index a9d61f6..0e164be 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -7,10 +7,16 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), + N_("git commit-graph read [--object-dir ]"), N_("git commit-graph write [--object-dir ]"), NULL }; +static const char * const builtin_commit_graph_read_usage[] = { + N_("git commit-graph read [--object-dir ]"), + NULL +}; + static const char * const builtin_commit_graph_write_usage[] = { N_("git commit-graph write [--object-dir ]"), NULL @@ -20,6 +26,54 @@ static struct opts_commit_graph { const char *obj_dir; } opts; +static int graph_read(int argc, const char **argv) +{ + struct commit_graph *graph = 0; + char *graph_name; + + static struct option builtin_commit_graph_read_options[] = { + OPT_STRING(0, "object-dir", _dir, + N_("dir"), + N_("The object directory to store the graph")), + OPT_END(), + }; + + argc = parse_options(argc, argv, NULL, +builtin_commit_graph_read_options, +builtin_commit_graph_read_usage, 0); + + if (!opts.obj_dir) + opts.obj_dir = get_object_directory(); + + graph_name = get_commit_graph_filename(opts.obj_dir); + graph = load_commit_graph_one(graph_name); + + if (!graph) + die("graph file %s does not exist", graph_name); + FREE_AND_NULL(graph_name); + + printf("header: %08x %d %d %d %d\n", + ntohl(*(uint32_t*)graph->data), + *(unsigned char*)(graph->data + 4), + *(unsigned char*)(graph->data + 5), + *(unsigned char*)(graph->data + 6), + *(unsigned char*)(graph->data + 7)); + printf("num_commits: %u\n", graph->num_commits); + printf("chunks:"); + + if (graph->chunk_oid_fanout) + printf(" oid_fanout"); + if (graph->chunk_oid_lookup) + printf(" oid_lookup"); + if (graph->chunk_commit_data) + printf(" commit_metadata"); + if (graph->chunk_large_edges) + printf(" large_edges"); + printf("\n"); + + return 0; +} + static int graph_write(int argc, const char **argv) { static struct option builtin_commit_graph_write_options[] = { @@ -60,6 +114,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) PARSE_OPT_STOP_AT_NON_OPTION); if (argc > 0) { + if (!strcmp(argv[0], "read")) + return graph_read(argc, argv); if (!strcmp(argv[0], "write")) return graph_write(argc, argv); } diff --git a/commit-graph.c b/commit-graph.c index 2251397..7b0cfb4 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -39,11 +39,149 @@ GRAPH_OID_LEN + 8) -static char *get_commit_graph_filename(const char *obj_dir) +char *get_commit_graph_file
[PATCH v5 08/13] commit-graph: add core.commitGraph setting
The commit graph feature is controlled by the new core.commitGraph config setting. This defaults to 0, so the feature is opt-in. The intention of core.commitGraph is that a user can always stop checking for or parsing commit graph files if core.commitGraph=0. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/config.txt | 3 +++ cache.h | 1 + config.c | 5 + environment.c| 1 + 4 files changed, 10 insertions(+) diff --git a/Documentation/config.txt b/Documentation/config.txt index f57e9cf..77fcd53 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -898,6 +898,9 @@ core.notesRef:: This setting defaults to "refs/notes/commits", and it can be overridden by the `GIT_NOTES_REF` environment variable. See linkgit:git-notes[1]. +core.commitGraph:: + Enable git commit graph feature. Allows reading from .graph files. + core.sparseCheckout:: Enable "sparse checkout" feature. See section "Sparse checkout" in linkgit:git-read-tree[1] for more information. diff --git a/cache.h b/cache.h index 21fbcc2..3a18a02 100644 --- a/cache.h +++ b/cache.h @@ -801,6 +801,7 @@ extern char *git_replace_ref_base; extern int fsync_object_files; extern int core_preload_index; +extern int core_commit_graph; extern int core_apply_sparse_checkout; extern int precomposed_unicode; extern int protect_hfs; diff --git a/config.c b/config.c index b0c20e6..25ee4a6 100644 --- a/config.c +++ b/config.c @@ -1226,6 +1226,11 @@ static int git_default_core_config(const char *var, const char *value) return 0; } + if (!strcmp(var, "core.commitgraph")) { + core_commit_graph = git_config_bool(var, value); + return 0; + } + if (!strcmp(var, "core.sparsecheckout")) { core_apply_sparse_checkout = git_config_bool(var, value); return 0; diff --git a/environment.c b/environment.c index de8431e..0e96be3 100644 --- a/environment.c +++ b/environment.c @@ -62,6 +62,7 @@ enum push_default_type push_default = PUSH_DEFAULT_UNSPECIFIED; enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE; char *notes_ref_name; int grafts_replace_parents = 1; +int core_commit_graph; int core_apply_sparse_checkout; int merge_log_config = -1; int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */ -- 2.7.4
[PATCH v5 09/13] commit-graph: close under reachability
Teach write_commit_graph() to walk all parents from the commits discovered in packfiles. This prevents gaps given by loose objects or previously-missed packfiles. Also automatically add commits from the existing graph file, if it exists. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 7b0cfb4..01aa23d 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -369,6 +369,28 @@ static int add_packed_commits(const struct object_id *oid, return 0; } +static void close_reachable(struct packed_oid_list *oids) +{ + int i; + struct rev_info revs; + struct commit *commit; + init_revisions(, NULL); + for (i = 0; i < oids->nr; i++) { + commit = lookup_commit(>list[i]); + if (commit && !parse_commit(commit)) + revs.commits = commit_list_insert(commit, ); + } + + if (prepare_revision_walk()) + die(_("revision walk setup failed")); + + while ((commit = get_revision()) != NULL) { + ALLOC_GROW(oids->list, oids->nr + 1, oids->alloc); + oidcpy(>list[oids->nr], &(commit->object.oid)); + (oids->nr)++; + } +} + void write_commit_graph(const char *obj_dir) { struct packed_oid_list oids; @@ -393,6 +415,7 @@ void write_commit_graph(const char *obj_dir) ALLOC_ARRAY(oids.list, oids.alloc); for_each_packed_object(add_packed_commits, , 0); + close_reachable(); QSORT(oids.list, oids.nr, commit_compare); -- 2.7.4
[PATCH] sha1_name: fix uninitialized memory errors
During abbreviation checks, we navigate to the position within a pack-index that an OID would be inserted and check surrounding OIDs for the maximum matching prefix. This position may be beyond the last position, because the given OID is lexicographically larger than every OID in the pack. Then nth_packed_object_oid() does not initialize "oid". Use the return value of nth_packed_object_oid() to prevent these errors. Reported-by: Christian Couder <christian.cou...@gmail.com> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_name.c | 11 +++ 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index 611c7d2..44dd595 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -546,17 +546,12 @@ static void find_abbrev_len_for_pack(struct packed_git *p, * nearby for the abbreviation length. */ mad->init_len = 0; - if (!match) { - nth_packed_object_oid(, p, first); + if (!match && nth_packed_object_oid(, p, first)) extend_abbrev_len(, mad); - } else if (first < num - 1) { - nth_packed_object_oid(, p, first + 1); + else if (first < num - 1 && nth_packed_object_oid(, p, first + 1)) extend_abbrev_len(, mad); - } - if (first > 0) { - nth_packed_object_oid(, p, first - 1); + if (first > 0 && nth_packed_object_oid(, p, first - 1)) extend_abbrev_len(, mad); - } mad->init_len = mad->cur_len; } -- 2.7.4
[PATCH v2] sha1_name: fix uninitialized memory errors
Peff made an excellent point about the nested if statements. This goes back to Christian's original recommendation. -- >8 -- During abbreviation checks, we navigate to the position within a pack-index that an OID would be inserted and check surrounding OIDs for the maximum matching prefix. This position may be beyond the last position, because the given OID is lexicographically larger than every OID in the pack. Then nth_packed_object_oid() does not initialize "oid". Use the return value of nth_packed_object_oid() to prevent these errors. Reported-by: Christian Couder <christian.cou...@gmail.com> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_name.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index 611c7d24dd..a041d8d24f 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -547,15 +547,15 @@ static void find_abbrev_len_for_pack(struct packed_git *p, */ mad->init_len = 0; if (!match) { - nth_packed_object_oid(, p, first); - extend_abbrev_len(, mad); + if (nth_packed_object_oid(, p, first)) + extend_abbrev_len(, mad); } else if (first < num - 1) { - nth_packed_object_oid(, p, first + 1); - extend_abbrev_len(, mad); + if (nth_packed_object_oid(, p, first + 1)) + extend_abbrev_len(, mad); } if (first > 0) { - nth_packed_object_oid(, p, first - 1); - extend_abbrev_len(, mad); + if (nth_packed_object_oid(, p, first - 1)) + extend_abbrev_len(, mad); } mad->init_len = mad->cur_len; } -- 2.16.2.265.g3d5930c0b9.dirty
Re: Use of uninitialised value of size 8 in sha1_name.c
On 2/26/2018 5:23 AM, Christian Couder wrote: On Mon, Feb 26, 2018 at 10:53 AM, Jeff Kingwrote: On Mon, Feb 26, 2018 at 10:04:22AM +0100, Christian Couder wrote: ==21455== Use of uninitialised value of size 8 ==21455==at 0x2D2A73: get_hex_char_from_oid (sha1_name.c:492) ==21455==by 0x2D2AFE: extend_abbrev_len (sha1_name.c:502) ==21455==by 0x2D2C3D: find_abbrev_len_for_pack (sha1_name.c:551) ==21455==by 0x2D2CFF: find_abbrev_len_packed (sha1_name.c:569) ==21455==by 0x2D2E12: find_unique_abbrev_r (sha1_name.c:608) ==21455==by 0x2DCB66: strbuf_add_unique_abbrev (strbuf.c:877) ==21455==by 0x14F7CE: update_local_ref (fetch.c:700) ==21455==by 0x1500CF: store_updated_refs (fetch.c:871) ==21455==by 0x15035B: fetch_refs (fetch.c:932) ==21455==by 0x150CF8: do_fetch (fetch.c:1146) ==21455==by 0x1515AB: fetch_one (fetch.c:1370) ==21455==by 0x151A1D: cmd_fetch (fetch.c:1457) ==21455== Uninitialised value was created by a stack allocation ==21455==at 0x2D2B2E: find_abbrev_len_for_pack (sha1_name.c:513) ==21455== A quick git blame seems to point to 0e87b85683 (sha1_name: minimize OID comparisons during disambiguation, 2017-10-12). It is difficult to tell for sure though as t5616-partial-clone.sh was added after that commit. I think that commit is to blame, though the error isn't exactly where that stack trace puts it. Try this: diff --git a/sha1_name.c b/sha1_name.c index 611c7d24dd..6f7f36436f 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -547,7 +547,10 @@ static void find_abbrev_len_for_pack(struct packed_git *p, */ mad->init_len = 0; if (!match) { - nth_packed_object_oid(, p, first); + warning("p->num_objects = %u, first = %u", + p->num_objects, first); + if (!nth_packed_object_oid(, p, first)) + die("oops!"); extend_abbrev_len(, mad); } else if (first < num - 1) { nth_packed_object_oid(, p, first + 1); I get failures all over the test suite, like this: warning: p->num_objects = 4, first = 3 warning: p->num_objects = 8, first = 6 warning: p->num_objects = 10, first = 0 warning: p->num_objects = 4, first = 0 warning: p->num_objects = 8, first = 0 warning: p->num_objects = 10, first = 10 fatal: oops! Yeah, I tried to t5601-clone.sh with --valgrind and I also get one error (the same "Uninitialised value" error actually). Thanks for finding this. Sorry for the bug. Any time the abbreviated hex would go after the last object in the pack, then first==p->num_objects, and nth_packed_object_oid() will fail. That leaves uninitialized data in "oid", which is what valgrind complains about when we examine it in extend_abbrev_len(). Most of the time the code behaves correctly anyway, because the uninitialized bytes are unlikely to match up with our hex and extend the length. Probably we just need to detect that case and skip the call to extend_abbrev_len() altogether. Yeah, something like: diff --git a/sha1_name.c b/sha1_name.c index 611c7d24dd..a041d8d24f 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -547,15 +547,15 @@ static void find_abbrev_len_for_pack(struct packed_git *p, */ mad->init_len = 0; if (!match) { - nth_packed_object_oid(, p, first); - extend_abbrev_len(, mad); + if (nth_packed_object_oid(, p, first)) + extend_abbrev_len(, mad); } else if (first < num - 1) { - nth_packed_object_oid(, p, first + 1); - extend_abbrev_len(, mad); + if (nth_packed_object_oid(, p, first + 1)) + extend_abbrev_len(, mad); } if (first > 0) { - nth_packed_object_oid(, p, first - 1); - extend_abbrev_len(, mad); + if (nth_packed_object_oid(, p, first - 1)) + extend_abbrev_len(, mad); } mad->init_len = mad->cur_len; } seems to prevent valgrind from complaining when running t5616-partial-clone.sh. This seems like the safest fix, but also we could use our values of "match", "first" and "num" to safely call nth_packed_object_oid(). However, since nth_packed_object_oid() checks the bounds, don't duplicate work and just use the return value. I would reformat your patch slightly, but that's just preference: diff --git a/sha1_name.c b/sha1_name.c index 611c7d2..97b632c 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -546,17 +546,14 @@ static void find_abbrev_len_for_pack(struct packed_git *p, * nearby for the abbreviation length. */ mad->init_len = 0; - if (!match) { - nth_packed_object_oid(, p, first); + if (!match && nth_packed_object_oid(, p, first)) extend_abbrev_len(, mad); - } else if (first < num - 1) { -
Re: [PATCH] commit-graph: fix some "plain integer as NULL pointer" warnings
On 2/24/2018 12:42 AM, René Scharfe wrote: Am 24.02.2018 um 03:24 schrieb Ramsay Jones: diff --git a/commit-graph.c b/commit-graph.c index fc5ee7e99..c2f443436 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -45,7 +45,7 @@ char *get_graph_latest_filename(const char *obj_dir) { struct strbuf fname = STRBUF_INIT; strbuf_addf(, "%s/info/graph-latest", obj_dir); - return strbuf_detach(, 0); + return strbuf_detach(, NULL); } You could also replace that function body with: return xstrfmt("%s/info/graph-latest", obj_dir); René Thanks for the feedback! I will apply these locally as I am re-rolling today. -Stolee
Re: [PATCH] revision.c: reduce object database queries
On 2/28/2018 1:37 AM, Jeff King wrote: On Tue, Feb 27, 2018 at 03:16:58PM -0800, Junio C Hamano wrote: This code comes originally form 454fbbcde3 (git-rev-list: allow missing objects when the parent is marked UNINTERESTING, 2005-07-10). But later, in aeeae1b771 (revision traversal: allow UNINTERESTING objects to be missing, 2009-01-27), we marked dealt with calling parse_object() on the parents more directly. So what I wonder is whether this code is simply redundant and can go away entirely. That would save the has_object_file() call in all cases. Hmm, interesting. I forgot all what I did around this area, but you are right. I'll leave it to Stolee whether he wants to dig into removing the has_object_file() call. I think it would do the right thing, but the most interesting bit would be how it impacts the timings. This patch was so small that I could understand the full implication of my change. I'm still very unfamiliar with the object walking machinery in revision.c. There are a lot of inter-dependencies that are taking time for me to understand and to feel confident that I won't cause a side effect in a special case. I do appreciate that Junio added a test in aeeae1b771. I'll make a note to revisit this area after I have a better grasp of the object walk code, but I will not try removing the has_object_file() call now. There's a similar case for trees. ... though technically the existing code allows _missing_ trees, but not on corrupt ones. True, but the intention of these "do not care too much about missing stuff while marking uninteresting" effort is aligned better with ignoring corrupt ones, too, I would think, as "missing" in that sentence is in fact about "not availble", and stuff that exists in corrupt form is still not available anyway. So I do not think it makes a bad change to start allowing corrupt ones. Agreed. Here it is in patch form, though as we both said, it probably doesn't matter that much in practice. So I'd be OK dropping it out of a sense of conservatism. -- >8 -- Subject: [PATCH] mark_tree_contents_uninteresting: drop has_object check It's generally acceptable for UNINTERESTING objects in a traversal to be unavailable (e.g., see aeeae1b771). When marking trees UNINTERESTING, we access the object database twice: once to check if the object is missing (and return quietly if it is), and then again to actually parse it. We can instead just try to parse; if that fails, we can then return quietly. That halves the effort we spend on locating the object. Note that this isn't _exactly_ the same as the original behavior, as the parse failure could be due to other problems than a missing object: it could be corrupted, in which case the original code would have died. But the new behavior is arguably better, as it covers the object being unavailable for any reason. We'll also still issue a warning to stderr in such a case. Signed-off-by: Jeff King--- revision.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/revision.c b/revision.c index 5ce9b93baa..221d62c52b 100644 --- a/revision.c +++ b/revision.c @@ -51,12 +51,9 @@ static void mark_tree_contents_uninteresting(struct tree *tree) { struct tree_desc desc; struct name_entry entry; - struct object *obj = >object; - if (!has_object_file(>oid)) + if (parse_tree_gently(tree, 1) < 0) return; - if (parse_tree(tree) < 0) - die("bad tree %s", oid_to_hex(>oid)); init_tree_desc(, tree->buffer, tree->size); while (tree_entry(, )) {
Re: [PATCH 03/11] packfile: allow install_packed_git to handle arbitrary repositories
On 2/27/2018 8:06 PM, Stefan Beller wrote: -void install_packed_git(struct packed_git *pack) +void install_packed_git(struct repository *r, struct packed_git *pack) This is a good thing to do. I'm just making note that this will collide with the new instances of install_packed_git() that I added in the serialized git commit graph series, specifically when adding the --stdin-packs argument [1]. Thanks, -Stolee [1] https://public-inbox.org/git/cagz79kym0fhiyq2+k5__a2hy1pecyigyf3n9zbjskh8yjzo...@mail.gmail.com/T/#u
Re: [PATCH 00/11] Moving global state into the repository object (part 2)
On 2/27/2018 9:15 PM, Duy Nguyen wrote: On Tue, Feb 27, 2018 at 05:05:57PM -0800, Stefan Beller wrote: This applies on top of origin/sb/object-store and is the continuation of that series, adding the repository as a context argument to functions. This series focusses on packfile handling, exposing (re)prepare_packed_git and find_pack_entry to a repository argument. It looks good. I agree. Looking at the full-series diff though, it makes me wonder if we should keep prepare_packed_git() private (i.e. how we initialize the object store, packfile included, is a private matter). How about something like this on top? I think the get_packed_git() approach is cleaner than navigating directly to the_repository->objects.packed_git that you expect to be initialized by an earlier method call. Thanks, -Stolee -- 8< -- Subject: [PATCH] packfile: keep raw_object_store.packed_git{,_mru} fields private These fields are initialized lazily via prepare_packed_git(). All access to these must call that function first but unless you know the implementation details, you may not see the connection. Keep that connection internal. These fields should only be accessed via get_packed_git() and get_packed_git_mru() outside packfile.c. Signed-off-by: Nguyễn Thái Ngọc Duy--- builtin/count-objects.c | 4 +--- builtin/fsck.c | 6 ++ builtin/gc.c | 3 +-- builtin/pack-objects.c | 21 +++-- builtin/pack-redundant.c | 6 ++ fast-import.c| 7 ++- http-backend.c | 5 ++--- object-store.h | 4 ++-- pack-bitmap.c| 3 +-- packfile.c | 15 ++- packfile.h | 6 +- server-info.c| 5 ++--- sha1_name.c | 6 ++ 13 files changed, 47 insertions(+), 44 deletions(-) diff --git a/builtin/count-objects.c b/builtin/count-objects.c index d480301763..088309945b 100644 --- a/builtin/count-objects.c +++ b/builtin/count-objects.c @@ -121,9 +121,7 @@ int cmd_count_objects(int argc, const char **argv, const char *prefix) struct strbuf loose_buf = STRBUF_INIT; struct strbuf pack_buf = STRBUF_INIT; struct strbuf garbage_buf = STRBUF_INIT; - if (!the_repository->objects.packed_git) - prepare_packed_git(the_repository); - for (p = the_repository->objects.packed_git; p; p = p->next) { + for (p = get_packed_git(the_repository); p; p = p->next) { if (!p->pack_local) continue; if (open_pack_index(p)) diff --git a/builtin/fsck.c b/builtin/fsck.c index 0a043a43c2..6d86f2581a 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -732,10 +732,8 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) uint32_t total = 0, count = 0; struct progress *progress = NULL; - prepare_packed_git(the_repository); - if (show_progress) { - for (p = the_repository->objects.packed_git; p; + for (p = get_packed_git(the_repository); p; p = p->next) { if (open_pack_index(p)) continue; @@ -744,7 +742,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) progress = start_progress(_("Checking objects"), total); } - for (p = the_repository->objects.packed_git; p; + for (p = get_packed_git(the_repository); p; p = p->next) { /* verify gives error messages itself */ if (verify_pack(p, fsck_obj_buffer, diff --git a/builtin/gc.c b/builtin/gc.c index 80d19c54d5..be63bec09c 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -173,8 +173,7 @@ static int too_many_packs(void) if (gc_auto_pack_limit <= 0) return 0; - prepare_packed_git(the_repository); - for (cnt = 0, p = the_repository->objects.packed_git; p; p = p->next) { + for (cnt = 0, p = get_packed_git(the_repository); p; p = p->next) { if (!p->pack_local) continue; if (p->pack_keep) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 5e2590f882..a305f50100 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -1026,7 +1026,7 @@ static int want_object_in_pack(const struct object_id *oid, if (want != -1) return want; } - list_for_each(pos, _repository->objects.packed_git_mru) { + list_for_each(pos, get_packed_git_mru(the_repository)) { struct packed_git *p = list_entry(pos, struct
Re: What's cooking in git.git (Mar 2018, #01; Thu, 1)
On 3/1/2018 5:20 PM, Junio C Hamano wrote: -- [Graduated to "master"] * jt/binsearch-with-fanout (2018-02-15) 2 commits (merged to 'next' on 2018-02-15 at 7648891022) + packfile: refactor hash search with fanout table + packfile: remove GIT_DEBUG_LOOKUP log statements (this branch is used by ds/commit-graph.) Refactor the code to binary search starting from a fan-out table (which is how the packfile is indexed with object names) into a reusable helper. [...] -- [New Topics] * jk/cached-commit-buffer (2018-02-22) 2 commits (merged to 'next' on 2018-02-27 at af791d9a1e) + revision: drop --show-all option + commit: drop uses of get_cached_commit_buffer() Code clean-up. Will merge to 'master'. Thanks! These resolve the dependencies for ds/commit-graph * ds/commit-graph (2018-02-20) 13 commits - commit-graph: build graph from starting commits - commit-graph: read only from specific pack-indexes - commit: integrate commit graph with commit parsing - commit-graph: close under reachability - commit-graph: add core.commitGraph setting - commit-graph: implement --delete-expired - commit-graph: implement --set-latest - commit-graph: implement git commit-graph read - commit-graph: implement 'git-commit-graph write' - commit-graph: implement write_commit_graph() - commit-graph: create git-commit-graph builtin - graph: add commit graph design document - commit-graph: add format document Precompute and store information necessary for ancestry traversal in a separate file to optimize graph walking. Reroll exists, but it appears that there will be a further reroll. cf. <1519698787-190494-1-git-send-email-dsto...@microsoft.com> v5 reroll (mentioned above) has limited review, so I won't be rerolling soon. I'm particularly interested in review on [PATCH 04/13] csum-file: add CSUM_KEEP_OPEN flag [1]. Thanks, -Stolee [1] https://public-inbox.org/git/1519698787-190494-5-git-send-email-dsto...@microsoft.com/
Re: [RFC] Contributing to Git (on Windows)
On 3/1/2018 11:44 PM, Jonathan Nieder wrote: Hi, Derrick Stolee wrote: Now, we'd like to make that document publicly available. These steps are focused on a Windows user, so we propose putting them in the git-for-windows/git repo under CONTRIBUTING.md. I have a pull request open for feedback [1]. I'll read comments on that PR or in this thread. Thanks! I wonder if we can put this in the standard Documentation/ directory as well. E.g. the Windows CONTRIBUTING.md could say say "See Documentation/Contributing-On-Windows.md" so that the bulk of the text would not have to be git-for-windows specific. That's a good idea. After this review stabilizes, I'll send a patch to add the windows-specific instructions as you recommend. What precedent do we have for referencing forks like git-for-windows/git? [...] +++ b/CONTRIBUTING.md @@ -0,0 +1,401 @@ +How to Contribute to Git for Windows + Would it make sense for this to be checked in with LF instead of CRLF line endings, so that clones use the user's chosen / platform native line ending? The .gitattributes file could include /CONTRIBUTING.md text so that line ending conversion takes place even if the user hasn't enabled core.autocrlf. Oops! I turned off the CRLF munging in my repo because I had an issue with a script somewhere, but forgot to turn it back on again. Thanks for checking this. [...] +The SDK uses a different credential manager, so you may still want to use Visual Studio +or normal Git Bash for interacting with your remotes. Alternatively, use SSH rather +than HTTPS and avoid credential manager problems. Where can I read more about the problems in question? I'll try to see if we can elaborate here. The Git for Windows client ships with Git Credential Manager for Windows [1] but the SDK does not. At the very least, credential interactions are not the same and you do not have access to the credentials stored in Windows Credential Manager. [1] https://github.com/Microsoft/Git-Credential-Manager-for-Windows [...] +Many new contributors ask: What should I start working on? + +One way to win big with the maintainer of an open-source project is to look at the +[issues page](https://github.com/git-for-windows/git/issues) and see if there are any issues that +you can fix quickly, or if anything catches your eye. You can also look at https://crbug.com/git for non Windows-specific issues. And you can look at recent user questions on the mailing list: https://public-inbox.org/git [...] +If you are adding new functionality, you may need to create low-level tests by creating +helper commands that test a very limited action. These commands are stored in `t/helpers`. +When adding a helper, be sure to add a line to `t/Makefile` and to the `.gitignore` for the +binary file you add. The Git community prefers functional tests using the full `git` +executable, so be sure the test helper is required. Maybe s/low-level/unit/? [...] +Read [`t/README`](https://github.com/git/git/blob/master/t/README) for more details. Forgive my ignorance: does github flavored markdown allow relative links? (I.e., could this say [`t/README`](t/README)?) [...] +You can also set certain environment variables to help test the performance on different +repositories or with more repetitions. The full list is available in +[the `t/perf/README` file](https://github.com/git/git/blob/master/t/perf/README), Likewise. [...] +Test Your Changes on Linux +-- + +It can be important to work directly on the [core Git codebase](https://github.com/git/git), +such as a recent commit into the `master` or `next` branch that has not been incorporated +into Git for Windows. Also, it can help to run functional and performance tests on your +code in Linux before submitting patches to the Linux-focused mailing list. I'm surprised at this advice. Does it actually come up? When I was on Mac I never worried about this, nor when I was on Windows. I'm personally happy to review patches that haven't been tested on Linux, though it's of course even nicer if the patch author mentions what platforms they've tested on. Maybe this can be reframed to refer specifically to cases where you've modified some Linux platform-specific code (e.g. to add a new feature to run-command.c)? I recently had a bug in my local copy of the commit-graph patch that was only caught by our Mac OSX automated builds: I was passing the edge-value for a parent into a lookup to get an octopus edge from the graph, but I forgot to drop the most-significant bit. This cast the "uint32_t" silently into an "int" but since we multiplied by 4 somehow the Windows and Linux compilers turned this into a left-shift while Mac did not and failed during my test. Now this is an example of something that probably would have been caught in review, but stuff gets through. The Win
Re: What's cooking in git.git (Apr 2018, #03; Wed, 25)
On 4/25/2018 1:43 PM, Brandon Williams wrote: On 04/25, Ævar Arnfjörð Bjarmason wrote: * bw/protocol-v2 (2018-03-15) 35 commits (merged to 'next' on 2018-04-11 at 23ee234a2c) + remote-curl: don't request v2 when pushing + remote-curl: implement stateless-connect command + http: eliminate "# service" line when using protocol v2 + http: don't always add Git-Protocol header + http: allow providing extra headers for http requests + remote-curl: store the protocol version the server responded with + remote-curl: create copy of the service name + pkt-line: add packet_buf_write_len function + transport-helper: introduce stateless-connect + transport-helper: refactor process_connect_service + transport-helper: remove name parameter + connect: don't request v2 when pushing + connect: refactor git_connect to only get the protocol version once + fetch-pack: support shallow requests + fetch-pack: perform a fetch using v2 + upload-pack: introduce fetch server command + push: pass ref prefixes when pushing + fetch: pass ref prefixes when fetching + ls-remote: pass ref prefixes when requesting a remote's refs + transport: convert transport_get_remote_refs to take a list of ref prefixes + transport: convert get_refs_list to take a list of ref prefixes + connect: request remote refs using v2 + ls-refs: introduce ls-refs server command + serve: introduce git-serve + test-pkt-line: introduce a packet-line test helper + protocol: introduce enum protocol_version value protocol_v2 + transport: store protocol version + connect: discover protocol version outside of get_remote_heads + connect: convert get_remote_heads to use struct packet_reader + transport: use get_refs_via_connect to get refs + upload-pack: factor out processing lines + upload-pack: convert to a builtin + pkt-line: add delim packet support + pkt-line: allow peeking a packet line without consuming it + pkt-line: introduce packet_read_with_status (this branch is used by bw/server-options.) The beginning of the next-gen transfer protocol. Will cook in 'next'. With a month & 10 days of no updates & this looking stable it would be great to have it in master sooner than later to build on top of it in the 2.18 window. I personally think that this series is ready to graduate to master but I mentioned to Junio off-list that it might be a good idea to wait until it has undergone a little more stress testing. We've been in the process of trying to get this rolled out to our internal server but due to a few roadblocks and people being out of the office its taken a bit longer than I would have liked to get it up and running. I'm hoping in another few days/a week I'll have some more data on live traffic. At that point I think I'll be more convinced that it will be safe to merge it. I may be overly cautions but I'm hoping that we can get this right without needing to do another protocol version bump for a very long time. Technically using v2 is hidden behind an "experimental" config flag so we should still be able to tweak it after the fact if we absolutely needed to, but I'd like to avoid that if necessary. There's no testing better than production. I think if we have an opportunity to test with realistic traffic, we should take advantage. But I also agree that this series looks stable. I realize it's hard to communicate both "this series is ready to merge" and "I appreciate your caution." Thanks, -Stolee
Re: [PATCH v4 03/10] commit-graph: compute generation numbers
n 4/25/2018 10:35 PM, Junio C Hamano wrote: Derrick Stolee <dsto...@microsoft.com> writes: @@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; + if ((*list)->generation != GENERATION_NUMBER_INFINITY) + packedDate[0] |= htonl((*list)->generation << 2); + packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); The ones that have infinity are written as zero here. The code that reads the generation field off of a file in fill_commit_graph_info() and fill_commit_in_graph() both leave such a record in file as-is, so the reader of what we write out will think it is _ZERO, not _INF. Not that it matters, as it seems that most of the code being added by this series treat _ZERO and _INF more or less interchangeably. But it does raise another question, i.e. do we need both _ZERO and _INF, or is it sufficient to have just a single _UNKNOWN? This code is confusing. The 'if' condition is useless, since at this point every commit should be finite (since we computed generation numbers for everyone). We should just write the value always. For the sake of discussion, the value _INFINITY means not in the graph and _ZERO means in the graph without a computed generation number. It's a small distinction, but it gives a single boundary to use for reachability queries that test generation number. @@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids) } } +static void compute_generation_numbers(struct commit** commits, + int nr_commits) +{ + int i; + struct commit_list *list = NULL; + + for (i = 0; i < nr_commits; i++) { + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && + commits[i]->generation != GENERATION_NUMBER_ZERO) + continue; + + commit_list_insert(commits[i], ); + while (list) { + struct commit *current = list->item; + struct commit_list *parent; + int all_parents_computed = 1; + uint32_t max_generation = 0; + + for (parent = current->parents; parent; parent = parent->next) { + if (parent->item->generation == GENERATION_NUMBER_INFINITY || + parent->item->generation == GENERATION_NUMBER_ZERO) { + all_parents_computed = 0; + commit_list_insert(parent->item, ); + break; + } else if (parent->item->generation > max_generation) { + max_generation = parent->item->generation; + } + } + + if (all_parents_computed) { + current->generation = max_generation + 1; + pop_commit(); + } If we haven't computed all parents' generations yet, current->generation is undefined (or at least "left as initialized"), so it does not make much sense to attempt to clip it at _MAX at this point. At leat not yet. IOW, shouldn't the following two lines be inside the "we now know genno of all parents, so we can compute genno for commit" block above? You're right! Good catch. This code sets every merge commit to _MAX. It should be in the block above. + if (current->generation > GENERATION_NUMBER_MAX) + current->generation = GENERATION_NUMBER_MAX; + } + } +} + void write_commit_graph(const char *obj_dir, const char **pack_indexes, int nr_packs, @@ -694,6 +737,8 @@ void write_commit_graph(const char *obj_dir, if (commits.nr >= GRAPH_PARENT_MISSING) die(_("too many commits to write graph")); + compute_generation_numbers(commits.list, commits.nr); + graph_name = get_commit_graph_filename(obj_dir); fd = hold_lock_file_for_update(, graph_name, 0);