Re: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()
On 12/6/2018 6:36 PM, Jonathan Tan wrote: AFAICT oid_object_info doesn't take advantage of the commit graph, but just looks up the object header, which is still less than completely parsing it. Then lookup_commit is overly strict, as it may return NULL as when there still is a type mismatch (I don't think a mismatch could happen here, as both rely on just the object store, and not the commit graph.), so this would be just defensive programming for the sake of it. I dunno. struct commit *c; if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT && (c = lookup_commit(revs->repo, oid)) && !repo_parse_commit(revs->repo, c)) object = (struct object *) c; else object = parse_object(revs->repo, oid); I like this way better - I'll do it in the next version. If we do _not_ have a commit-graph or if the commit-graph does not have that commit, this will have the same performance problem, right? Should we instead create a direct dependence on the commit-graph, and try to parse the oid from the graph directly? If it succeeds, then we learn that the object is a commit, in addition to all of the parsing work. This means we could avoid oid_object_info() loading data if we succeed. We would fall back to parse_object() if it fails. I was thinking this should be a simple API call to parse_commit_in_graph(), but that requires a struct commit filled with an oid, which is not the best idea if we don't actually know it is a commit yet. The approach I recommend would then be more detailed: 1. Modify find_commit_in_graph() to take a struct object_id instead of a struct commit. This helps find the integer position in the graph. That position can be used in fill_commit_in_graph() to load the commit contents. Keep find_commit_in_graph() static as it should not be a public function. 2. Create a public function with prototype struct commit *try_parse_commit_from_graph(struct repository *r, struct object_id *oid) that returns a commit struct fully parsed if and only if the repository has that oid. It can call find_commit_in_graph(), then lookup_commit() and fill_commit_in_graph() to create the commit and parse the data. 3. In replace of the snippet above, do: struct commit *c; if ((c = try_parse_commit_from_graph(revs->repo, oid)) object = (struct object *)c; else object = parse_object(revs->repo, oid); A similar pattern _could_ be used in parse_object(), but I don't recommend doing this pattern unless we have a reasonable suspicion that we are going to parse commits more often than other objects. (It adds an O(log(# commits)) binary search to each object.) A final thought: consider making this "try the commit graph first, but fall back to parse_object()" a library function with a name like struct object *parse_probably_commit(struct repository *r, struct object_id *oid) so other paths that are parsing a lot of commits (but also maybe tags) could use the logic. Thanks! -Stolee
Re: [PATCH v2 2/3] commit-graph: fix buffer read-overflow
On 12/6/2018 3:20 PM, Josh Steadmon wrote: + +# usage: corrupt_and_zero_graph_then_verify +# Manipulates the commit-graph file at by inserting the data, +# then zeros the file starting at . Finally, runs +# 'git commit-graph verify' and places the output in the file 'err'. Tests 'err' +# for the given string. +corrupt_and_zero_graph_then_verify() { This method is very similar to to 'corrupt_graph_and_verify()', the only difference being the zero_pos, which zeroes the graph. Could it instead be a modification of corrupt_graph_and_verify() where $4 is interpreted as zero_pos, and if it is blank we don't do the truncation? +test_expect_success 'detect truncated graph' ' + corrupt_and_zero_graph_then_verify $GRAPH_BYTE_CHUNK_COUNT "\xff" \ + $GRAPH_CHUNK_LOOKUP_OFFSET "chunk lookup table entry missing" +' + Thanks for this! I think it's valuable to keep explicit tests around that were discovered from your fuzz tests. Specifically, I can repeat the test when I get around to the next file format. Thanks, -Stolee
Re: [PATCH 2/2] commit-graph: fix buffer read-overflow
On 12/5/2018 5:32 PM, Josh Steadmon wrote: + if (chunk_lookup + GRAPH_CHUNKLOOKUP_WIDTH > data + graph_size) { + error(_("chunk lookup table entry missing; graph file may be incomplete")); + free(graph); + return NULL; + } Something I forgot earlier: there are several tests in t5318-commit-graph.sh that use 'git commit-graph verify' to ensure we hit these error conditions on a corrupted commit-graph file. Could you try adding a test there that looks for this error message? Thanks, -Stolee
Re: git, monorepos, and access control
On 12/5/2018 3:34 PM, Ævar Arnfjörð Bjarmason wrote: On Wed, Dec 05 2018, Coiner, John wrote: I'm an engineer with AMD. I'm looking at whether we could switch our internal version control to a monorepo, possibly one based on git and VFSForGit. Has anyone looked at adding access control to git, at a per-directory granularity? Is this a feature that the git community would possibly welcome? All of what you've described is possible to implement in git, but as far as I know there's no existing implementation of it. Microsoft's GVFS probably comes closest, and they're actively upstreaming bits of that, but as far as I know that doesn't in any way try to solve this "users XYZ can't even list such-and-such a tree" problem. (Avar has a lot of good ideas in his message, so I'm just going to add on a few here.) This directory-level security is not a goal for VFS for Git, and I don't see itbecoming a priority as it breaks a number of design decisions we made in our object storage and communication models. The best I can think about when considering Git as an approach would be to use submodules for your security-related content, and then have server- side security for access to those repos. Of course, submodules are not supported in VFS for Git, either. The Gerrit service has _branch_ level security, which is related to the reachability questions that a directory security would need. However, the problem is quite different. Gerrit does have a lot of experience in dealing with submodules, though, so that's probably a good place to start. Thanks, -Stolee
Git Test Coverage Report (Friday Nov 30)
Here is today's test coverage report. Thanks, -Stolee [1] https://dev.azure.com/git/git/_build/results?buildId=277 --- pu: 5a1a9a96d55fbb80426189a921d7b6cc66564c78 jch: 71c29cabb7379fe9abaacbbbd1350268d0c18b4f next: a9faaff8c120bf4783cb892c157871fe524b3608 master: 7068cbc4abac53d9c3675dfba81c1e97d25e8eeb master@{1}: 7f4e64169352e03476b0ea64e7e2973669e491a2 Uncovered code in 'pu' not in 'jch' -- builtin/blame.c 74e8221b52 builtin/blame.c 928) blame_date_width = sizeof("Thu Oct 19 16:00"); 74e8221b52 builtin/blame.c 929) break; builtin/remote.c b7f4e371e7 builtin/remote.c 1551) die(_("--save-to-push cannot be used with other options")); b7f4e371e7 builtin/remote.c 1575) die(_("--save-to-push can only be used when only one url is defined")); date.c 74e8221b52 113) die("Timestamp too large for this system: %"PRItime, time); 74e8221b52 216) if (tm->tm_mon == human_tm->tm_mon) { 74e8221b52 217) if (tm->tm_mday > human_tm->tm_mday) { 74e8221b52 219) } else if (tm->tm_mday == human_tm->tm_mday) { 74e8221b52 220) hide.date = hide.wday = 1; 74e8221b52 221) } else if (tm->tm_mday + 5 > human_tm->tm_mday) { 74e8221b52 223) hide.date = 1; 74e8221b52 231) gettimeofday(, NULL); 74e8221b52 232) show_date_relative(time, tz, , buf); 74e8221b52 233) return; 74e8221b52 246) hide.seconds = 1; 74e8221b52 247) hide.tz |= !hide.date; 74e8221b52 248) hide.wday = hide.time = !hide.year; 74e8221b52 262) strbuf_rtrim(buf); 74e8221b52 287) gettimeofday(, NULL); 74e8221b52 290) human_tz = local_time_tzoffset(now.tv_sec, _tm); 74e8221b52 886) static int auto_date_style(void) 74e8221b52 888) return (isatty(1) || pager_in_use()) ? DATE_HUMAN : DATE_NORMAL; 74e8221b52 909) return DATE_HUMAN; 74e8221b52 911) return auto_date_style(); protocol.c 24c10f7473 37) die(_("Unrecognized protocol version")); 24c10f7473 39) die(_("Unrecognized protocol_version")); remote-curl.c 24c10f7473 403) return 0; sha1-array.c bba406749a 91) oidcpy([dst], [src]); strbuf.c 10a40f5700 397) return 0; submodule.c e2419f7e30 1376) strbuf_release(); 7454fe5cb6 1499) struct get_next_submodule_task *task = task_cb; 7454fe5cb6 1503) get_next_submodule_task_release(task); 7454fe5cb6 1530) return 0; 7454fe5cb6 1534) goto out; 7454fe5cb6 1549) return 0; wrapper.c 5efde212fc 70) die("Out of memory, malloc failed (tried to allocate %" PRIuMAX " bytes)", 5efde212fc 73) error("Out of memory, malloc failed (tried to allocate %" PRIuMAX " bytes)", Commits introducing uncovered code: Anders Waldenborg 10a40f570: strbuf: separate callback for strbuf_expand:ing literals Denton Liu b7f4e371e: remote: add --save-to-push option to git remote set-url Josh Steadmon 24c10f747: protocol: advertise multiple supported versions Linus Torvalds 74e8221b5: Add 'human' date format Martin Koegler 5efde212f: zlib.c: use size_t for size Stefan Beller 7454fe5cb: fetch: try fetching submodules if needed objects were not fetched Stefan Beller bba406749: sha1-array: provide oid_array_filter Stefan Beller e2419f7e3: submodule: migrate get_next_submodule to use repository structs Uncovered code in 'jch' not in 'next' apply.c 0f086e6dca 3355) if (checkout_entry(ce, , NULL, NULL) || 0f086e6dca 3356) lstat(ce->name, st)) builtin/branch.c 0ecb1fc726 builtin/branch.c 456) die(_("could not resolve HEAD")); 0ecb1fc726 builtin/branch.c 462) die(_("HEAD (%s) points outside of refs/heads/"), refname); hex.c 47edb64997 93) char *sha1_to_hex_r(char *buffer, const unsigned char *sha1) 47edb64997 95) return hash_to_hex_algop_r(buffer, sha1, _algos[GIT_HASH_SHA1]); 47edb64997 116) char *hash_to_hex(const unsigned char *hash) 47edb64997 118) return hash_to_hex_algop(hash, the_hash_algo); pathspec.c 22af33bece 671) name = to_free = xmemdupz(name, namelen); read-cache.c ee70c12820 1730) if (advice_unknown_index_extension) { ee70c12820 1731) warning(_("ignoring optional %.4s index extension"), ext); ee70c12820 1732) advise(_("This is likely due to the file having been written by a newer\n" sequencer.c 18e711a162 2387) opts->quiet = 1; sha1-file.c 2f90b9d9b4 sha1-file.c 172) int hash_algo_by_name(const char *name) 2f90b9d9b4 sha1-file.c 175) if (!name) 2f90b9d9b4 sha1-file.c 176) return GIT_HASH_UNKNOWN; 2f90b9d9b4 sha1-file.c 177) for (i = 1; i < GIT_HASH_NALGOS; i++) 2f90b9d9b4 sha1-file.c 178) if (!strcmp(name, hash_algos[i].name)) 2f90b9d9b4 sha1-file.c 179) return i; 2f90b9d9b4 sha1-file.c 180) return GIT_HASH_UNKNOWN; 2f90b9d9b4 sha1-file.c 183) int hash_algo_by_id(uint32_t format_id) 2f90b9d9b4 sha1-file.c 186) for (i = 1; i < GIT_HASH_NALGOS; i++) 2f90b9d9b4 sha1-file.c 187) if (format_id == hash_algos[i].format_id) 2f90b9d9b4 sha1-file.c 188) return i; 2f90b9d9b4 sha1-file.c 189) return GIT_HASH_UNKNOWN; tree.c e092073d64 104) commit = lookup_commit(r, entry.oid); Commits introducing uncovered code:
Re: [PATCH 3/5] pack-objects: add --sparse option
On 11/29/2018 9:39 PM, Junio C Hamano wrote: Derrick Stolee writes: While _eventually_ we should make this opt-out, we shouldn't do that until it has cooked a while. I actually do not agree. If the knob gives enough benefit, the users will learn about it viva voce, and in a few more releases we'll hear "enough users complain that they have to turn it on, let's make it the default". If that does not happen, the knob does not deserve to be turned on in the first place. That's a good philosophy. I'll keep it in mind. Having said that, I won't be commenting on this shiny new toy before the final. I want to see more people help tying the loose ends and give it final varnish to the upcoming release, as it seems to have become rockier and larger release than we originally anticipated. Yeah, no rush on this one. I just wanted to get some initial feedback about the idea. Thanks, -Stolee
[PATCH v2 5/6] pack-objects: create pack.useSparse setting
From: Derrick Stolee The '--sparse' flag in 'git pack-objects' changes the algorithm used to enumerate objects to one that is faster for individual users pushing new objects that change only a small cone of the working directory. The sparse algorithm is not recommended for a server, which likely sends new objects that appear across the entire working directory. Create a 'pack.useSparse' setting that enables this new algorithm. This allows 'git push' to use this algorithm without passing a '--sparse' flag all the way through four levels of run_command() calls. If the '--no-sparse' flag is set, then this config setting is overridden. Signed-off-by: Derrick Stolee --- builtin/pack-objects.c | 4 t/t5322-pack-objects-sparse.sh | 15 +++ 2 files changed, 19 insertions(+) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 7d5b0735e3..124b1bafc4 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, void *cb) use_bitmap_index_default = git_config_bool(k, v); return 0; } + if (!strcmp(k, "pack.usesparse")) { + sparse = git_config_bool(k, v); + return 0; + } if (!strcmp(k, "pack.threads")) { delta_search_threads = git_config_int(k, v); if (delta_search_threads < 0) diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 45dba6e014..8f5699bd91 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' ' test_cmp expect_sparse_objects.txt sparse_objects.txt ' +test_expect_success 'pack.useSparse enables algorithm' ' + git config pack.useSparse true && + git pack-objects --stdout --revs sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index sparse_objects.txt && + test_cmp expect_sparse_objects.txt sparse_objects.txt +' + +test_expect_success 'pack.useSparse overridden' ' + git pack-objects --stdout --revs --no-sparse sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + test_done -- gitgitgadget
[PATCH v2 0/6] Add a new "sparse" tree walk algorithm
One of the biggest remaining pain points for users of very large repositories is the time it takes to run 'git push'. We inspected some slow pushes by our developers and found that the "Enumerating Objects" phase of a push was very slow. This is unsurprising, because this is why reachability bitmaps exist. However, reachability bitmaps are not available to us because of the single pack-file requirement. The bitmap approach is intended for servers anyway, and clients have a much different behavior pattern. Specifically, clients are normally pushing a very small number of objects compared to the entire working directory. A typical user changes only a small cone of the working directory, so let's use that to our benefit. Create a new "sparse" mode for 'git pack-objects' that uses the paths that introduce new objects to direct our search into the reachable trees. By collecting trees at each path, we can then recurse into a path only when there are uninteresting and interesting trees at that path. This gains a significant performance boost for small topics while presenting a possibility of packing extra objects. The main algorithm change is in patch 4, but is set up a little bit in patches 1 and 2. As demonstrated in the included test script, we see that the existing algorithm can send extra objects due to the way we specify the "frontier". But we can send even more objects if a user copies objects from one folder to another. I say "copy" because a rename would (usually) change the original folder and trigger a walk into that path, discovering the objects. In order to benefit from this approach, the user can opt-in using the pack.useSparse config setting. This setting can be overridden using the '--no-sparse' option. Update in V2: * Added GIT_TEST_PACK_SPARSE test option. * Fixed test breakages when GIT_TEST_PACK_SPARSE is enabled by adding null checks. Derrick Stolee (6): revision: add mark_tree_uninteresting_sparse list-objects: consume sparse tree walk pack-objects: add --sparse option revision: implement sparse algorithm pack-objects: create pack.useSparse setting pack-objects: create GIT_TEST_PACK_SPARSE Documentation/git-pack-objects.txt | 9 +- bisect.c | 2 +- builtin/pack-objects.c | 10 ++- builtin/rev-list.c | 2 +- http-push.c| 2 +- list-objects.c | 55 +++- list-objects.h | 4 +- revision.c | 121 + revision.h | 2 + t/README | 4 + t/t5322-pack-objects-sparse.sh | 139 + 11 files changed, 340 insertions(+), 10 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-89/derrickstolee/push/sparse-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/89 Range-diff vs v1: 1: b73b8de98c = 1: 60617681f7 revision: add mark_tree_uninteresting_sparse 2: 9bf04c748b ! 2: 4527addacb list-objects: consume sparse tree walk @@ -116,6 +116,10 @@ + for (parents = commit->parents; parents; parents = parents->next) { + struct commit *parent = parents->item; + struct tree *tree = get_commit_tree(parent); ++ ++ if (!tree) ++ continue; ++ + oidset_insert(set, >object.oid); + + if (!(parent->object.flags & UNINTERESTING)) @@ -142,14 +146,14 @@ + for (list = revs->commits; list; list = list->next) { struct commit *commit = list->item; -+ + +- if (commit->object.flags & UNINTERESTING) { + if (sparse) { + struct tree *tree = get_commit_tree(commit); -+ ++ + if (commit->object.flags & UNINTERESTING) + tree->object.flags |= UNINTERESTING; - -- if (commit->object.flags & UNINTERESTING) { ++ + oidset_insert(, >object.oid); + add_edge_parents(commit, revs, show_edge, ); + } else if (commit->object.flags & UNINTERESTING) { @@ -189,3 +193,17 @@ struct oidset; struct list_objects_filter_options; + +diff --git a/revision.c b/revision.c +--- a/revision.c b/revision.c +@@ + while ((oid = oidset_iter_next())) { + struct tree *tree = lookup_tree(r, oid); + ++ if (!tree) ++ continue; ++ +
[PATCH v2 1/6] revision: add mark_tree_uninteresting_sparse
From: Derrick Stolee In preparation for a new algorithm that walks fewer trees when creating a pack from a set of revisions, create a method that takes an oidset of tree oids and marks reachable objects as UNINTERESTING. The current implementation uses the existing mark_tree_uninteresting to recursively walk the trees and blobs. This will walk the same number of trees as the old mechanism. There is one new assumption in this approach: we are also given the oids of the interesting trees. This implementation does not use those trees at the moment, but we will use them in a later rewrite of this method. Signed-off-by: Derrick Stolee --- revision.c | 22 ++ revision.h | 2 ++ 2 files changed, 24 insertions(+) diff --git a/revision.c b/revision.c index 13e0519c02..3a62c7c187 100644 --- a/revision.c +++ b/revision.c @@ -99,6 +99,28 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +void mark_trees_uninteresting_sparse(struct repository *r, +struct oidset *set) +{ + struct object_id *oid; + struct oidset_iter iter; + + oidset_iter_init(set, ); + while ((oid = oidset_iter_next())) { + struct tree *tree = lookup_tree(r, oid); + + if (tree->object.flags & UNINTERESTING) { + /* +* Remove the flag so the next call +* is not a no-op. The flag is added +* in mark_tree_unintersting(). +*/ + tree->object.flags ^= UNINTERESTING; + mark_tree_uninteresting(r, tree); + } + } +} + struct commit_stack { struct commit **items; size_t nr, alloc; diff --git a/revision.h b/revision.h index 7987bfcd2e..f828e91ae9 100644 --- a/revision.h +++ b/revision.h @@ -67,6 +67,7 @@ struct rev_cmdline_info { #define REVISION_WALK_NO_WALK_SORTED 1 #define REVISION_WALK_NO_WALK_UNSORTED 2 +struct oidset; struct topo_walk_info; struct rev_info { @@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs, void mark_parents_uninteresting(struct commit *commit); void mark_tree_uninteresting(struct repository *r, struct tree *tree); +void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set); void show_object_with_name(FILE *, struct object *, const char *); -- gitgitgadget
[PATCH v2 2/6] list-objects: consume sparse tree walk
From: Derrick Stolee When creating a pack-file using 'git pack-objects --revs' we provide a list of interesting and uninteresting commits. For example, a push operation would make the local topic branch be interesting and the known remote refs as uninteresting. We want to discover the set of new objects to send to the server as a thin pack. We walk these commits until we discover a frontier of commits such that every commit walk starting at interesting commits ends in a root commit or unintersting commit. We then need to discover which non-commit objects are reachable from uninteresting commits. The mark_edges_unintersting() method in list-objects.c iterates on the commit list and does the following: * If the commit is UNINTERSTING, then mark its root tree and every object it can reach as UNINTERESTING. * If the commit is interesting, then mark the root tree of every UNINTERSTING parent (and all objects that tree can reach) as UNINTERSTING. At the very end, we repeat the process on every commit directly given to the revision walk from stdin. This helps ensure we properly cover shallow commits that otherwise were not included in the frontier. The logic to recursively follow trees is in the mark_tree_uninteresting() method in revision.c. The algorithm avoids duplicate work by not recursing into trees that are already marked UNINTERSTING. Add a new 'sparse' option to the mark_edges_uninteresting() method that performs this logic in a slightly new way. As we iterate over the commits, we add all of the root trees to an oidset. Then, call mark_trees_uninteresting_sparse() on that oidset. Note that we include interesting trees in this process. The current implementation of mark_trees_unintersting_sparse() will walk the same trees as the old logic, but this will be replaced in a later change. The sparse option is not used by any callers at the moment, but will be wired to 'git pack-objects' in the next change. Signed-off-by: Derrick Stolee --- bisect.c | 2 +- builtin/pack-objects.c | 2 +- builtin/rev-list.c | 2 +- http-push.c| 2 +- list-objects.c | 55 +++--- list-objects.h | 4 ++- revision.c | 3 +++ 7 files changed, 61 insertions(+), 9 deletions(-) diff --git a/bisect.c b/bisect.c index 487675c672..842f8b4b8f 100644 --- a/bisect.c +++ b/bisect.c @@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs) if (prepare_revision_walk(revs)) die("revision walk setup failed"); if (revs->tree_objects) - mark_edges_uninteresting(revs, NULL); + mark_edges_uninteresting(revs, NULL, 0); } static void exit_if_skipped_commits(struct commit_list *tried, diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 411aefd687..5f70d840a7 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk()) die(_("revision walk setup failed")); - mark_edges_uninteresting(, show_edge); + mark_edges_uninteresting(, show_edge, 0); if (!fn_show_object) fn_show_object = show_object; diff --git a/builtin/rev-list.c b/builtin/rev-list.c index 2880ed37e3..9663cbfae0 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) if (prepare_revision_walk()) die("revision walk setup failed"); if (revs.tree_objects) - mark_edges_uninteresting(, show_edge); + mark_edges_uninteresting(, show_edge, 0); if (bisect_list) { int reaches, all; diff --git a/http-push.c b/http-push.c index cd48590912..ea52d6f9f6 100644 --- a/http-push.c +++ b/http-push.c @@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv) pushing = 0; if (prepare_revision_walk()) die("revision walk setup failed"); - mark_edges_uninteresting(, NULL); + mark_edges_uninteresting(, NULL, 0); objects_to_send = get_delta(, ref_lock); finish_all_active_slots(); diff --git a/list-objects.c b/list-objects.c index c41cc80db5..4fbdeca0a4 100644 --- a/list-objects.c +++ b/list-objects.c @@ -222,25 +222,72 @@ static void mark_edge_parents_uninteresting(struct commit *commit, } } -void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge) +static void add_edge_parents(struct commit *commit, +struct rev_info *revs, +show_edge_fn show_edge, +struct oidset *set) +{ + struct commit_list *parents; + + for (parents = commit->parents; parents; parents = parents->next) { +
[PATCH v2 6/6] pack-objects: create GIT_TEST_PACK_SPARSE
From: Derrick Stolee Create a test variable GIT_TEST_PACK_SPARSE to enable the sparse object walk algorithm by default during the test suite. Enabling this variable ensures coverage in many interesting cases, such as shallow clones, partial clones, and missing objects. Signed-off-by: Derrick Stolee --- builtin/pack-objects.c | 1 + t/README | 4 t/t5322-pack-objects-sparse.sh | 6 +++--- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 124b1bafc4..507d381153 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3331,6 +3331,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) read_replace_refs = 0; + sparse = git_env_bool("GIT_TEST_PACK_SPARSE", 0); reset_pack_idx_option(_idx_opts); git_config(git_pack_config, NULL); diff --git a/t/README b/t/README index 28711cc508..8b6dfe1864 100644 --- a/t/README +++ b/t/README @@ -342,6 +342,10 @@ GIT_TEST_INDEX_VERSION= exercises the index read/write code path for the index version specified. Can be set to any valid version (currently 2, 3, or 4). +GIT_TEST_PACK_SPARSE= if enabled will default the pack-objects +builtin to use the sparse object walk. This can still be overridden by +the --no-sparse command-line argument. + GIT_TEST_PRELOAD_INDEX= exercises the preload-index code path by overriding the minimum number of cache entries required per thread. diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 8f5699bd91..e8cf41d1c6 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -36,7 +36,7 @@ test_expect_success 'setup repo' ' ' test_expect_success 'non-sparse pack-objects' ' - git pack-objects --stdout --revs nonsparse.pack && + git pack-objects --stdout --revs --no-sparse nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt @@ -70,7 +70,7 @@ test_expect_success 'duplicate a folder from f3 and commit to topic1' ' ' test_expect_success 'non-sparse pack-objects' ' - git pack-objects --stdout --revs nonsparse.pack && + git pack-objects --stdout --revs --no-sparse nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt @@ -102,7 +102,7 @@ test_expect_success 'non-sparse pack-objects' ' topic1 \ topic1^{tree} \ topic1:f3 | sort >expect_objects.txt && - git pack-objects --stdout --revs nonsparse.pack && + git pack-objects --stdout --revs --no-sparse nonsparse.pack && git index-pack -o nonsparse.idx nonsparse.pack && git show-index nonsparse_objects.txt && test_cmp expect_objects.txt nonsparse_objects.txt -- gitgitgadget
[PATCH v2 4/6] revision: implement sparse algorithm
From: Derrick Stolee When enumerating objects to place in a pack-file during 'git pack-objects --revs', we discover the "frontier" of commits that we care about and the boundary with commit we find uninteresting. From that point, we walk trees to discover which trees and blobs are uninteresting. Finally, we walk trees to find the interesting trees. This commit introduces a new, "sparse" way to discover the uninteresting trees. We use the perspective of a single user trying to push their topic to a large repository. That user likely changed a very small fraction of the paths in their working directory, but we spend a lot of time walking all reachable trees. The way to switch the logic to work in this sparse way is to start caring about which paths introduce new trees. While it is not possible to generate a diff between the frontier boundary and all of the interesting commits, we can simulate that behavior by inspecting all of the root trees as a whole, then recursing down to the set of trees at each path. We already had taken the first step by passing an oidset to mark_trees_uninteresting_sparse(). We now create a dictionary whose keys are paths and values are oidsets. We consider the set of trees that appear at each path. While we inspect a tree, we add its subtrees to the oidsets corresponding to the tree entry's path. We also mark trees as UNINTERESTING if the tree we are parsing is UNINTERESTING. To actually improve the peformance, we need to terminate our recursion unless the oidset contains some intersting trees and some uninteresting trees. Technically, we only need one interesting tree for this to speed up in most cases, but we also will not mark anything UNINTERESTING if there are no uninteresting trees, so that would be wasted effort. There are a few ways that this is not a universally better option. First, we can pack extra objects. If someone copies a subtree from one tree to another, the first tree will appear UNINTERESTING and we will not recurse to see that the subtree should also be UNINTERESTING. We will walk the new tree and see the subtree as a "new" object and add it to the pack. We add a test case that demonstrates this as a way to prove that the --sparse option is actually working. Second, we can have extra memory pressure. If instead of being a single user pushing a small topic we are a server sending new objects from across the entire working directory, then we will gain very little (the recursion will rarely terminate early) but will spend extra time maintaining the path-oidset dictionaries. Despite these potential drawbacks, the benefits of the algorithm are clear. By adding a counter to 'add_children_by_path' and 'mark_tree_contents_uninteresting', I measured the number of parsed trees for the two algorithms in a variety of repos. For git.git, I used the following input: v2.19.0 ^v2.19.0~10 Objects to pack: 550 Walked (old alg): 282 Walked (new alg): 130 For the Linux repo, I used the following input: v4.18 ^v4.18~10 Objects to pack: 518 Walked (old alg): 4,836 Walked (new alg): 188 The two repos above are rather "wide and flat" compared to other repos that I have used in the past. As a comparison, I tested an old topic branch in the Azure DevOps repo, which has a much deeper folder structure than the Linux repo. Objects to pack:220 Walked (old alg): 22,804 Walked (new alg):129 I used the number of walked trees the main metric above because it is consistent across multiple runs. When I ran my tests, the performance of the pack-objects command with the same options could change the end-to-end time by 10x depending on the file system being warm. However, by repeating the same test on repeat I could get more consistent timing results. The git.git and Linux tests were too fast overall (less than 0.5s) to measure an end-to-end difference. The Azure DevOps case was slow enough to see the time improve from 15s to 1s in the warm case. The cold case was 90s to 9s in my testing. These improvements will have even larger benefits in the super- large Windows repository. In our experiments, we see the "Enumerate objects" phase of pack-objects taking 60-80% of the end-to-end time of non-trivial pushes, taking longer than the network time to send the pack and the server time to verify the pack. Signed-off-by: Derrick Stolee --- revision.c | 116 ++--- t/t5322-pack-objects-sparse.sh | 21 -- 2 files changed, 121 insertions(+), 16 deletions(-) diff --git a/revision.c b/revision.c index f9eb6400f1..971f1bb095 100644 --- a/revision.c +++ b/revision.c @@ -99,29 +99,125 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +struct paths_and_oids { + struct string_list list; +}; + +static void paths_and_oids_init(struct paths_and_oids *po) +{ +
[PATCH v2 3/6] pack-objects: add --sparse option
From: Derrick Stolee Add a '--sparse' option flag to the pack-objects builtin. This allows the user to specify that they want to use the new logic for walking trees. This logic currently does not differ from the existing output, but will in a later change. Create a new test script, t5322-pack-objects-sparse.sh, to ensure the object list that is selected matches what we expect. When we update the logic to walk in a sparse fashion, the final test will be updated to show the extra objects that are added. Signed-off-by: Derrick Stolee --- Documentation/git-pack-objects.txt | 9 ++- builtin/pack-objects.c | 5 +- t/t5322-pack-objects-sparse.sh | 115 + 3 files changed, 127 insertions(+), 2 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index 40c825c381..ced2630eb3 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -14,7 +14,7 @@ SYNOPSIS [--local] [--incremental] [--window=] [--depth=] [--revs [--unpacked | --all]] [--keep-pack=] [--stdout [--filter=] | base-name] - [--shallow] [--keep-true-parents] < object-list + [--shallow] [--keep-true-parents] [--sparse] < object-list DESCRIPTION @@ -196,6 +196,13 @@ depth is 4095. Add --no-reuse-object if you want to force a uniform compression level on all data no matter the source. +--sparse:: + Use the "sparse" algorithm to determine which objects to include in + the pack. This can have significant performance benefits when computing + a pack to send a small change. However, it is possible that extra + objects are added to the pack-file if the included commits contain + certain types of direct renames. + --thin:: Create a "thin" pack by omitting the common objects between a sender and a receiver in order to reduce network transfer. This diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 5f70d840a7..7d5b0735e3 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -84,6 +84,7 @@ static unsigned long pack_size_limit; static int depth = 50; static int delta_search_threads; static int pack_to_stdout; +static int sparse; static int thin; static int num_preferred_base; static struct progress *progress_state; @@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk()) die(_("revision walk setup failed")); - mark_edges_uninteresting(, show_edge, 0); + mark_edges_uninteresting(, show_edge, sparse); if (!fn_show_object) fn_show_object = show_object; @@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) { OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"), N_("unpack unreachable objects newer than "), PARSE_OPT_OPTARG, option_parse_unpack_unreachable }, + OPT_BOOL(0, "sparse", , +N_("use the sparse reachability algorithm")), OPT_BOOL(0, "thin", , N_("create thin packs")), OPT_BOOL(0, "shallow", , diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh new file mode 100755 index 00..81f6805bc3 --- /dev/null +++ b/t/t5322-pack-objects-sparse.sh @@ -0,0 +1,115 @@ +#!/bin/sh + +test_description='pack-objects object selection using sparse algorithm' +. ./test-lib.sh + +test_expect_success 'setup repo' ' + test_commit initial && + for i in $(test_seq 1 3) + do + mkdir f$i && + for j in $(test_seq 1 3) + do + mkdir f$i/f$j && + echo $j >f$i/f$j/data.txt + done + done && + git add . && + git commit -m "Initialized trees" && + for i in $(test_seq 1 3) + do + git checkout -b topic$i master && + echo change-$i >f$i/f$i/data.txt && + git commit -a -m "Changed f$i/f$i/data.txt" + done && + cat >packinput.txt <<-EOF && + topic1 + ^topic2 + ^topic3 + EOF + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f1 \ + topic1:f1/f1\ + topic1:f1/f1/data.txt | sort >expect_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --revs nonsparse.pack && + git index-pack -o nonspar
Re: [PATCH 3/5] pack-objects: add --sparse option
On 11/28/2018 5:11 PM, Stefan Beller wrote: +--sparse:: + Use the "sparse" algorithm to determine which objects to include in + the pack. This can have significant performance benefits when computing + a pack to send a small change. However, it is possible that extra + objects are added to the pack-file if the included commits contain + certain types of direct renames. As a user, where do I find a discussion of these walking algorithms? (i.e. how can I tell if I can really expect to gain performance benefits, what are the tradeoffs? "If it were strictly better than the original, it would be on by default" might be a thought of a user) You're right that having this hidden as an opt-in config variable makes it hard to discover as a typical user. I would argue that we should actually make the config setting true by default, and recommend that servers opt-out. Here are my reasons: 1. The vast majority of users are clients. 2. Client users are not likely to know about and tweak these settings. 3. Server users are more likely to keep an eye on the different knobs they can tweak. 4. Servers should use the reachability bitmaps, which don't use this logic anyway. While _eventually_ we should make this opt-out, we shouldn't do that until it has cooked a while. Thanks, -Stolee
Re: [PATCH 0/5] Add a new "sparse" tree walk algorithm
On 11/28/2018 5:18 PM, Ævar Arnfjörð Bjarmason wrote: This is really interesting. I tested this with: diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 124b1bafc4..5c7615f06c 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3143 +3143 @@ static void get_object_list(int ac, const char **av) - mark_edges_uninteresting(, show_edge, sparse); + mark_edges_uninteresting(, show_edge, 1); To emulate having a GIT_TEST_* mode for this, which seems like a good idea since it turned up a lot of segfaults in pack-objects. I wasn't able to get a backtrace for that since it always happens indirectly, and I didn't dig enough to see how to manually invoke it the right way. Thanks for double-checking this. I had run a similar test in my prototype implementation, but over-simplified some code when rewriting it for submission (and then forgot to re-run that test). Specifically, these null checks are important: diff --git a/list-objects.c b/list-objects.c index 9bb93d1640..7e864b4db8 100644 --- a/list-objects.c +++ b/list-objects.c @@ -232,6 +232,10 @@ static void add_edge_parents(struct commit *commit, for (parents = commit->parents; parents; parents = parents->next) { struct commit *parent = parents->item; struct tree *tree = get_commit_tree(parent); + + if (!tree) + continue; + oidset_insert(set, >object.oid); if (!(parent->object.flags & UNINTERESTING)) @@ -261,6 +265,8 @@ void mark_edges_uninteresting(struct rev_info *revs, if (sparse) { struct tree *tree = get_commit_tree(commit); + if (!tree) + continue; if (commit->object.flags & UNINTERESTING) tree->object.flags |= UNINTERESTING; I will definitely include a GIT_TEST_* variable in v2. Thanks, -Stolee
[PATCH 4/5] revision: implement sparse algorithm
From: Derrick Stolee When enumerating objects to place in a pack-file during 'git pack-objects --revs', we discover the "frontier" of commits that we care about and the boundary with commit we find uninteresting. From that point, we walk trees to discover which trees and blobs are uninteresting. Finally, we walk trees to find the interesting trees. This commit introduces a new, "sparse" way to discover the uninteresting trees. We use the perspective of a single user trying to push their topic to a large repository. That user likely changed a very small fraction of the paths in their working directory, but we spend a lot of time walking all reachable trees. The way to switch the logic to work in this sparse way is to start caring about which paths introduce new trees. While it is not possible to generate a diff between the frontier boundary and all of the interesting commits, we can simulate that behavior by inspecting all of the root trees as a whole, then recursing down to the set of trees at each path. We already had taken the first step by passing an oidset to mark_trees_uninteresting_sparse(). We now create a dictionary whose keys are paths and values are oidsets. We consider the set of trees that appear at each path. While we inspect a tree, we add its subtrees to the oidsets corresponding to the tree entry's path. We also mark trees as UNINTERESTING if the tree we are parsing is UNINTERESTING. To actually improve the peformance, we need to terminate our recursion unless the oidset contains some intersting trees and some uninteresting trees. Technically, we only need one interesting tree for this to speed up in most cases, but we also will not mark anything UNINTERESTING if there are no uninteresting trees, so that would be wasted effort. There are a few ways that this is not a universally better option. First, we can pack extra objects. If someone copies a subtree from one tree to another, the first tree will appear UNINTERESTING and we will not recurse to see that the subtree should also be UNINTERESTING. We will walk the new tree and see the subtree as a "new" object and add it to the pack. We add a test case that demonstrates this as a way to prove that the --sparse option is actually working. Second, we can have extra memory pressure. If instead of being a single user pushing a small topic we are a server sending new objects from across the entire working directory, then we will gain very little (the recursion will rarely terminate early) but will spend extra time maintaining the path-oidset dictionaries. Despite these potential drawbacks, the benefits of the algorithm are clear. By adding a counter to 'add_children_by_path' and 'mark_tree_contents_uninteresting', I measured the number of parsed trees for the two algorithms in a variety of repos. For git.git, I used the following input: v2.19.0 ^v2.19.0~10 Objects to pack: 550 Walked (old alg): 282 Walked (new alg): 130 For the Linux repo, I used the following input: v4.18 ^v4.18~10 Objects to pack: 518 Walked (old alg): 4,836 Walked (new alg): 188 The two repos above are rather "wide and flat" compared to other repos that I have used in the past. As a comparison, I tested an old topic branch in the Azure DevOps repo, which has a much deeper folder structure than the Linux repo. Objects to pack:220 Walked (old alg): 22,804 Walked (new alg):129 I used the number of walked trees the main metric above because it is consistent across multiple runs. When I ran my tests, the performance of the pack-objects command with the same options could change the end-to-end time by 10x depending on the file system being warm. However, by repeating the same test on repeat I could get more consistent timing results. The git.git and Linux tests were too fast overall (less than 0.5s) to measure an end-to-end difference. The Azure DevOps case was slow enough to see the time improve from 15s to 1s in the warm case. The cold case was 90s to 9s in my testing. These improvements will have even larger benefits in the super- large Windows repository. In our experiments, we see the "Enumerate objects" phase of pack-objects taking 60-80% of the end-to-end time of non-trivial pushes, taking longer than the network time to send the pack and the server time to verify the pack. Signed-off-by: Derrick Stolee --- revision.c | 111 ++--- t/t5322-pack-objects-sparse.sh | 21 +-- 2 files changed, 116 insertions(+), 16 deletions(-) diff --git a/revision.c b/revision.c index 3a62c7c187..7e4bfe621a 100644 --- a/revision.c +++ b/revision.c @@ -99,26 +99,117 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +struct paths_and_oids { + struct string_list list; +}; + +static void paths_and_oids_init(struct paths_and_oids *po) +{ +
[PATCH 0/5] Add a new "sparse" tree walk algorithm
One of the biggest remaining pain points for users of very large repositories is the time it takes to run 'git push'. We inspected some slow pushes by our developers and found that the "Enumerating Objects" phase of a push was very slow. This is unsurprising, because this is why reachability bitmaps exist. However, reachability bitmaps are not available to us because of the single pack-file requirement. The bitmap approach is intended for servers anyway, and clients have a much different behavior pattern. Specifically, clients are normally pushing a very small number of objects compared to the entire working directory. A typical user changes only a small cone of the working directory, so let's use that to our benefit. Create a new "sparse" mode for 'git pack-objects' that uses the paths that introduce new objects to direct our search into the reachable trees. By collecting trees at each path, we can then recurse into a path only when there are uninteresting and interesting trees at that path. This gains a significant performance boost for small topics while presenting a possibility of packing extra objects. The main algorithm change is in patch 4, but is set up a little bit in patches 1 and 2. As demonstrated in the included test script, we see that the existing algorithm can send extra objects due to the way we specify the "frontier". But we can send even more objects if a user copies objects from one folder to another. I say "copy" because a rename would (usually) change the original folder and trigger a walk into that path, discovering the objects. In order to benefit from this approach, the user can opt-in using the pack.useSparse config setting. This setting can be overridden using the '--no-sparse' option. Derrick Stolee (5): revision: add mark_tree_uninteresting_sparse list-objects: consume sparse tree walk pack-objects: add --sparse option revision: implement sparse algorithm pack-objects: create pack.useSparse setting Documentation/git-pack-objects.txt | 9 +- bisect.c | 2 +- builtin/pack-objects.c | 9 +- builtin/rev-list.c | 2 +- http-push.c| 2 +- list-objects.c | 51 ++- list-objects.h | 4 +- revision.c | 113 +++ revision.h | 2 + t/t5322-pack-objects-sparse.sh | 139 + 10 files changed, 323 insertions(+), 10 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-89/derrickstolee/push/sparse-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/89 -- gitgitgadget
[PATCH 3/5] pack-objects: add --sparse option
From: Derrick Stolee Add a '--sparse' option flag to the pack-objects builtin. This allows the user to specify that they want to use the new logic for walking trees. This logic currently does not differ from the existing output, but will in a later change. Create a new test script, t5322-pack-objects-sparse.sh, to ensure the object list that is selected matches what we expect. When we update the logic to walk in a sparse fashion, the final test will be updated to show the extra objects that are added. Signed-off-by: Derrick Stolee --- Documentation/git-pack-objects.txt | 9 ++- builtin/pack-objects.c | 5 +- t/t5322-pack-objects-sparse.sh | 115 + 3 files changed, 127 insertions(+), 2 deletions(-) create mode 100755 t/t5322-pack-objects-sparse.sh diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index 40c825c381..ced2630eb3 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -14,7 +14,7 @@ SYNOPSIS [--local] [--incremental] [--window=] [--depth=] [--revs [--unpacked | --all]] [--keep-pack=] [--stdout [--filter=] | base-name] - [--shallow] [--keep-true-parents] < object-list + [--shallow] [--keep-true-parents] [--sparse] < object-list DESCRIPTION @@ -196,6 +196,13 @@ depth is 4095. Add --no-reuse-object if you want to force a uniform compression level on all data no matter the source. +--sparse:: + Use the "sparse" algorithm to determine which objects to include in + the pack. This can have significant performance benefits when computing + a pack to send a small change. However, it is possible that extra + objects are added to the pack-file if the included commits contain + certain types of direct renames. + --thin:: Create a "thin" pack by omitting the common objects between a sender and a receiver in order to reduce network transfer. This diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 5f70d840a7..7d5b0735e3 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -84,6 +84,7 @@ static unsigned long pack_size_limit; static int depth = 50; static int delta_search_threads; static int pack_to_stdout; +static int sparse; static int thin; static int num_preferred_base; static struct progress *progress_state; @@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk()) die(_("revision walk setup failed")); - mark_edges_uninteresting(, show_edge, 0); + mark_edges_uninteresting(, show_edge, sparse); if (!fn_show_object) fn_show_object = show_object; @@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) { OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"), N_("unpack unreachable objects newer than "), PARSE_OPT_OPTARG, option_parse_unpack_unreachable }, + OPT_BOOL(0, "sparse", , +N_("use the sparse reachability algorithm")), OPT_BOOL(0, "thin", , N_("create thin packs")), OPT_BOOL(0, "shallow", , diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh new file mode 100755 index 00..81f6805bc3 --- /dev/null +++ b/t/t5322-pack-objects-sparse.sh @@ -0,0 +1,115 @@ +#!/bin/sh + +test_description='pack-objects object selection using sparse algorithm' +. ./test-lib.sh + +test_expect_success 'setup repo' ' + test_commit initial && + for i in $(test_seq 1 3) + do + mkdir f$i && + for j in $(test_seq 1 3) + do + mkdir f$i/f$j && + echo $j >f$i/f$j/data.txt + done + done && + git add . && + git commit -m "Initialized trees" && + for i in $(test_seq 1 3) + do + git checkout -b topic$i master && + echo change-$i >f$i/f$i/data.txt && + git commit -a -m "Changed f$i/f$i/data.txt" + done && + cat >packinput.txt <<-EOF && + topic1 + ^topic2 + ^topic3 + EOF + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f1 \ + topic1:f1/f1\ + topic1:f1/f1/data.txt | sort >expect_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --revs nonsparse.pack && + git index-pack -o nonspar
[PATCH 5/5] pack-objects: create pack.useSparse setting
From: Derrick Stolee The '--sparse' flag in 'git pack-objects' changes the algorithm used to enumerate objects to one that is faster for individual users pushing new objects that change only a small cone of the working directory. The sparse algorithm is not recommended for a server, which likely sends new objects that appear across the entire working directory. Create a 'pack.useSparse' setting that enables this new algorithm. This allows 'git push' to use this algorithm without passing a '--sparse' flag all the way through four levels of run_command() calls. If the '--no-sparse' flag is set, then this config setting is overridden. Signed-off-by: Derrick Stolee --- builtin/pack-objects.c | 4 t/t5322-pack-objects-sparse.sh | 15 +++ 2 files changed, 19 insertions(+) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 7d5b0735e3..124b1bafc4 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, void *cb) use_bitmap_index_default = git_config_bool(k, v); return 0; } + if (!strcmp(k, "pack.usesparse")) { + sparse = git_config_bool(k, v); + return 0; + } if (!strcmp(k, "pack.threads")) { delta_search_threads = git_config_int(k, v); if (delta_search_threads < 0) diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh index 45dba6e014..8f5699bd91 100755 --- a/t/t5322-pack-objects-sparse.sh +++ b/t/t5322-pack-objects-sparse.sh @@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' ' test_cmp expect_sparse_objects.txt sparse_objects.txt ' +test_expect_success 'pack.useSparse enables algorithm' ' + git config pack.useSparse true && + git pack-objects --stdout --revs sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index sparse_objects.txt && + test_cmp expect_sparse_objects.txt sparse_objects.txt +' + +test_expect_success 'pack.useSparse overridden' ' + git pack-objects --stdout --revs --no-sparse sparse.pack && + git index-pack -o sparse.idx sparse.pack && + git show-index sparse_objects.txt && + test_cmp expect_objects.txt sparse_objects.txt +' + test_done -- gitgitgadget
[PATCH 2/5] list-objects: consume sparse tree walk
From: Derrick Stolee When creating a pack-file using 'git pack-objects --revs' we provide a list of interesting and uninteresting commits. For example, a push operation would make the local topic branch be interesting and the known remote refs as uninteresting. We want to discover the set of new objects to send to the server as a thin pack. We walk these commits until we discover a frontier of commits such that every commit walk starting at interesting commits ends in a root commit or unintersting commit. We then need to discover which non-commit objects are reachable from uninteresting commits. The mark_edges_unintersting() method in list-objects.c iterates on the commit list and does the following: * If the commit is UNINTERSTING, then mark its root tree and every object it can reach as UNINTERESTING. * If the commit is interesting, then mark the root tree of every UNINTERSTING parent (and all objects that tree can reach) as UNINTERSTING. At the very end, we repeat the process on every commit directly given to the revision walk from stdin. This helps ensure we properly cover shallow commits that otherwise were not included in the frontier. The logic to recursively follow trees is in the mark_tree_uninteresting() method in revision.c. The algorithm avoids duplicate work by not recursing into trees that are already marked UNINTERSTING. Add a new 'sparse' option to the mark_edges_uninteresting() method that performs this logic in a slightly new way. As we iterate over the commits, we add all of the root trees to an oidset. Then, call mark_trees_uninteresting_sparse() on that oidset. Note that we include interesting trees in this process. The current implementation of mark_trees_unintersting_sparse() will walk the same trees as the old logic, but this will be replaced in a later change. The sparse option is not used by any callers at the moment, but will be wired to 'git pack-objects' in the next change. Signed-off-by: Derrick Stolee --- bisect.c | 2 +- builtin/pack-objects.c | 2 +- builtin/rev-list.c | 2 +- http-push.c| 2 +- list-objects.c | 51 ++ list-objects.h | 4 +++- 6 files changed, 54 insertions(+), 9 deletions(-) diff --git a/bisect.c b/bisect.c index 487675c672..842f8b4b8f 100644 --- a/bisect.c +++ b/bisect.c @@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs) if (prepare_revision_walk(revs)) die("revision walk setup failed"); if (revs->tree_objects) - mark_edges_uninteresting(revs, NULL); + mark_edges_uninteresting(revs, NULL, 0); } static void exit_if_skipped_commits(struct commit_list *tried, diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 411aefd687..5f70d840a7 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av) if (prepare_revision_walk()) die(_("revision walk setup failed")); - mark_edges_uninteresting(, show_edge); + mark_edges_uninteresting(, show_edge, 0); if (!fn_show_object) fn_show_object = show_object; diff --git a/builtin/rev-list.c b/builtin/rev-list.c index 2880ed37e3..9663cbfae0 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) if (prepare_revision_walk()) die("revision walk setup failed"); if (revs.tree_objects) - mark_edges_uninteresting(, show_edge); + mark_edges_uninteresting(, show_edge, 0); if (bisect_list) { int reaches, all; diff --git a/http-push.c b/http-push.c index cd48590912..ea52d6f9f6 100644 --- a/http-push.c +++ b/http-push.c @@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv) pushing = 0; if (prepare_revision_walk()) die("revision walk setup failed"); - mark_edges_uninteresting(, NULL); + mark_edges_uninteresting(, NULL, 0); objects_to_send = get_delta(, ref_lock); finish_all_active_slots(); diff --git a/list-objects.c b/list-objects.c index c41cc80db5..9bb93d1640 100644 --- a/list-objects.c +++ b/list-objects.c @@ -222,25 +222,68 @@ static void mark_edge_parents_uninteresting(struct commit *commit, } } -void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge) +static void add_edge_parents(struct commit *commit, +struct rev_info *revs, +show_edge_fn show_edge, +struct oidset *set) +{ + struct commit_list *parents; + + for (parents = commit->parents; parents; parents = parents->next) { + struct commit *parent =
[PATCH 1/5] revision: add mark_tree_uninteresting_sparse
From: Derrick Stolee In preparation for a new algorithm that walks fewer trees when creating a pack from a set of revisions, create a method that takes an oidset of tree oids and marks reachable objects as UNINTERESTING. The current implementation uses the existing mark_tree_uninteresting to recursively walk the trees and blobs. This will walk the same number of trees as the old mechanism. There is one new assumption in this approach: we are also given the oids of the interesting trees. This implementation does not use those trees at the moment, but we will use them in a later rewrite of this method. Signed-off-by: Derrick Stolee --- revision.c | 22 ++ revision.h | 2 ++ 2 files changed, 24 insertions(+) diff --git a/revision.c b/revision.c index 13e0519c02..3a62c7c187 100644 --- a/revision.c +++ b/revision.c @@ -99,6 +99,28 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree) mark_tree_contents_uninteresting(r, tree); } +void mark_trees_uninteresting_sparse(struct repository *r, +struct oidset *set) +{ + struct object_id *oid; + struct oidset_iter iter; + + oidset_iter_init(set, ); + while ((oid = oidset_iter_next())) { + struct tree *tree = lookup_tree(r, oid); + + if (tree->object.flags & UNINTERESTING) { + /* +* Remove the flag so the next call +* is not a no-op. The flag is added +* in mark_tree_unintersting(). +*/ + tree->object.flags ^= UNINTERESTING; + mark_tree_uninteresting(r, tree); + } + } +} + struct commit_stack { struct commit **items; size_t nr, alloc; diff --git a/revision.h b/revision.h index 7987bfcd2e..f828e91ae9 100644 --- a/revision.h +++ b/revision.h @@ -67,6 +67,7 @@ struct rev_cmdline_info { #define REVISION_WALK_NO_WALK_SORTED 1 #define REVISION_WALK_NO_WALK_UNSORTED 2 +struct oidset; struct topo_walk_info; struct rev_info { @@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs, void mark_parents_uninteresting(struct commit *commit); void mark_tree_uninteresting(struct repository *r, struct tree *tree); +void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set); void show_object_with_name(FILE *, struct object *, const char *); -- gitgitgadget
Re: [BUG REPORT] t5322: demonstrate a pack-objects bug
On 11/28/2018 2:45 PM, Derrick Stolee wrote: I was preparing a new "sparse" algorithm for calculating the interesting objects to send on push. The important steps happen during 'git pack-objects', so I was creating test cases to see how the behavior changes in narrow cases. Specifically, when copying a directory across sibling directories (see test case), the new logic would accidentally send that object as an extra. However, I found a bug in the existing logic. The included test demonstrates this during the final 'git index-pack' call. It fails with the message 'fatal: pack has 1 unresolved delta' I realize now that I've gone through this effort that the problem is me (of course it is). + git pack-objects --stdout --thin --revs nonsparse.pack && Since I'm packing thin packs, the index operation is complaining about deltas. So, I need to use a different mechanism to list the objects in the pack (say, 'git verify-pack -v'). Sorry for the noise! Thanks, -Stolee
[BUG REPORT] t5322: demonstrate a pack-objects bug
I was preparing a new "sparse" algorithm for calculating the interesting objects to send on push. The important steps happen during 'git pack-objects', so I was creating test cases to see how the behavior changes in narrow cases. Specifically, when copying a directory across sibling directories (see test case), the new logic would accidentally send that object as an extra. However, I found a bug in the existing logic. The included test demonstrates this during the final 'git index-pack' call. It fails with the message 'fatal: pack has 1 unresolved delta' It is probable that this is not a minimal test case, but happens to be the test I had created before discovering the problem. I compiled v2.17.0 and v2.12.0 as checks to see if I could find a "good" commit with which to start a bisect, but both failed. This is an old bug! Signed-off-by: Derrick Stolee --- t/t5322-pack-objects-sparse.sh | 94 ++ 1 file changed, 94 insertions(+) create mode 100755 t/t5322-pack-objects-sparse.sh diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh new file mode 100755 index 00..36faa70fe9 --- /dev/null +++ b/t/t5322-pack-objects-sparse.sh @@ -0,0 +1,94 @@ +#!/bin/sh + +test_description='pack-objects object selection using sparse algorithm' +. ./test-lib.sh + +test_expect_success 'setup repo' ' + test_commit initial && + for i in $(test_seq 1 3) + do + mkdir f$i && + for j in $(test_seq 1 3) + do + mkdir f$i/f$j && + echo $j >f$i/f$j/data.txt + done + done && + git add . && + git commit -m "Initialized trees" && + for i in $(test_seq 1 3) + do + git checkout -b topic$i master && + echo change-$i >f$i/f$i/data.txt && + git commit -a -m "Changed f$i/f$i/data.txt" + done && + cat >packinput.txt <<-EOF && + topic1 + ^topic2 + ^topic3 + EOF + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f1 \ + topic1:f1/f1\ + topic1:f1/f1/data.txt | sort >actual_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --thin --revs nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index nonsparse_objects.txt && + test_cmp actual_objects.txt nonsparse_objects.txt +' + +# Demonstrate that both algorithms send "extra" objects because +# they are not in the frontier. + +test_expect_success 'duplicate a folder from f3 and commit to topic1' ' + git checkout topic1 && + echo change-3 >f3/f3/data.txt && + git commit -a -m "Changed f3/f3/data.txt" && + git rev-parse \ + topic1~1\ + topic1~1^{tree} \ + topic1^{tree} \ + topic1 \ + topic1:f1 \ + topic1:f1/f1\ + topic1:f1/f1/data.txt \ + topic1:f3 \ + topic1:f3/f3\ + topic1:f3/f3/data.txt | sort >actual_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --thin --revs nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index nonsparse_objects.txt && + test_cmp actual_objects.txt nonsparse_objects.txt +' + +test_expect_success 'duplicate a folder from f3 and commit to topic1' ' + mkdir f3/f4 && + cp -r f1/f1/* f3/f4 && + git add f3/f4 && + git commit -m "Copied f1/f1 to f3/f4" && + cat >packinput.txt <<-EOF && + topic1 + ^topic1~1 + EOF + git rev-parse \ + topic1 \ + topic1^{tree} \ + topic1:f3 | sort >actual_objects.txt +' + +test_expect_success 'non-sparse pack-objects' ' + git pack-objects --stdout --thin --revs nonsparse.pack && + git index-pack -o nonsparse.idx nonsparse.pack && + git show-index nonsparse_objects.txt && + test_cmp actual_objects.txt nonsparse_objects.txt +' + +test_done -- 2.20.0.rc1
Git Test Coverage Report (Wednesday Nov 21)
to leave_reset_head; bac2a1e36f 585) ret = error(_("failed to find tree of %s"), oid_to_hex(oid)); bac2a1e36f 586) goto leave_reset_head; 3249c1251e 590) ret = error(_("failed to find tree of %s"), oid_to_hex(oid)); 3249c1251e 591) goto leave_reset_head; 3249c1251e 604) goto leave_reset_head; builtin/show-branch.c 517fe807d6 builtin/show-branch.c 607) BUG_ON_OPT_NEG(unset); builtin/show-ref.c 517fe807d6 builtin/show-ref.c 154) BUG_ON_OPT_NEG(unset); bundle.c 2c8ee1f53c 267) error_errno(_("unable to dup bundle descriptor")); 2c8ee1f53c 268) child_process_clear(_objects); 2c8ee1f53c 269) return -1; 2c8ee1f53c 478) rollback_lock_file(); config.c fast-import.c ca473cef91 2958) strbuf_addf(, "%s %s %"PRIuMAX"\n", oid_to_hex(oid), midx.c name-hash.c 2179045fd0 532) die(_("unable to create lazy_dir thread: %s"), strerror(err)); 2179045fd0 554) die(_("unable to create lazy_name thread: %s"), strerror(err)); 2179045fd0 560) die(_("unable to join lazy_name thread: %s"), strerror(err)); preload-index.c 2179045fd0 137) die(_("unable to create threaded lstat: %s"), strerror(err)); revision.c b45424181e 2915) define_commit_slab(author_date_slab, timestamp_t); b45424181e 2942) return; b45424181e 2945) return; b45424181e 2948) record_author_date(>author_date, c); b45424181e 2951) c->object.flags |= UNINTERESTING; b45424181e 2954) return; b45424181e 2957) mark_parents_uninteresting(c); b45424181e 2980) return; b45424181e 2983) return; b45424181e 3031) info->topo_queue.compare = compare_commits_by_commit_date; b45424181e 3032) break; b45424181e 3034) init_author_date_slab(>author_date); b45424181e 3035) info->topo_queue.compare = compare_commits_by_author_date; b45424181e 3036) info->topo_queue.cb_data = >author_date; b45424181e 3037) break; b45424181e 3048) continue; b45424181e 3059) record_author_date(>author_date, c); f0d9cc4196 3097) if (!revs->ignore_missing_links) f0d9cc4196 3098) die("Failed to traverse parents of commit %s", f0d9cc4196 3099) oid_to_hex(>object.oid)); b45424181e 3107) continue; run-command.c 2179045fd0 1229) error(_("cannot create async thread: %s"), strerror(err)); send-pack.c c0e40a2d66 207) close(fd[1]); Commits introducing uncovered code: Derrick Stolee b45424181: revision.c: generation-based topo-order algorithm Derrick Stolee f0d9cc419: revision.c: begin refactoring --topo-order logic Jeff King 01a31f3bc: pull: handle --verify-signatures for unborn branch Jeff King 0eb8d3767: cat-file: report an error on multiple --batch options Jeff King 2c8ee1f53: bundle: dup() output descriptor closer to point-of-use Jeff King 517fe807d: assert NOARG/NONEG behavior of parse-options callbacks Jeff King 735ca208c: apply: return -1 from option callback instead of calling exit(1) Jeff King fce566480: am: handle --no-patch-format option Johannes Schindelin 3249c1251: rebase: consolidate clean-up code before leaving reset_head() Johannes Schindelin bac2a1e36: built-in rebase: reinstate `checkout -q` behavior where appropriate Nguyễn Thái Ngọc Duy 2179045fd: Clean up pthread_create() error handling Nguyễn Thái Ngọc Duy c0e40a2d6: send-pack.c: move async's #ifdef NO_PTHREADS back to run-command.c Nguyễn Thái Ngọc Duy fd6263fb7: grep: clean up num_threads handling Torsten Bögershausen ca473cef9: Upcast size_t variables to uintmax_t when printing
Re: [PATCH v2 1/1] Use size_t instead of 'unsigned long' for data in memory
On 11/20/2018 12:04 AM, tbo...@web.de wrote: From: Torsten Bögershausen Currently the length of data which is stored in memory is stored in "unsigned long" at many places in the code base. This is OK when both "unsigned long" and size_t are 32 bits, (and is OK when both are 64 bits). On a 64 bit Windows system am "unsigned long" is 32 bit, and that may be too short to measure the size of objects in memory, a size_t is the natural choice. Is this change enough to store 4GB files on Windows? Or is there more to do? Thanks for all the comments on V1. Changes since V1: - Make the motivation somewhat clearer in the commit message - Rebase to the November 19 pu What we really need for this patch to fly are this branches: mk/use-size-t-in-zlib tb/print-size-t-with-uintmax-format I believe communicating these direct dependencies is the correct way to go, and the rebase you mentioned is unnecessary (instead, test with a merge). Hopefully the patch applies on top of a merge of those branches. @@ -529,7 +530,7 @@ static void *unpack_raw_entry(struct object_entry *obj, } static void *unpack_data(struct object_entry *obj, -int (*consume)(const unsigned char *, unsigned long, void *), +int (*consume)(const unsigned char *, size_t, void *), void *cb_data) This is the only instance that is not paired directly with a variable name like "size", "sz", or "len". However, it appears to be correct from context. Thanks for this! Reviewed-by: Derrick Stolee
Re: [PATCH 2/2] commit-graph: don't call write_graph_chunk_large_edges() unnecessarily
On 11/20/2018 8:26 PM, SZEDER Gábor wrote: write_graph_chunk_data(f, GRAPH_OID_LEN, commits.list, commits.nr); - write_graph_chunk_large_edges(f, commits.list, commits.nr); + if (num_large_edges) + write_graph_chunk_large_edges(f, commits.list, commits.nr); This is clearly correct, and the tests in t5318-commit-graph.sh would catch a dropped (or additional) large/extra edge chunk. Thanks, -Stolee
Re: [PATCH 1/2] commit-graph: rename 'num_extra_edges' variable to 'num_large_edges'
On 11/20/2018 10:29 PM, Junio C Hamano wrote: SZEDER Gábor writes: I rename these variables to 'num_large_edges', because the commit graph file format speaks about the 'Large Edge List' chunk. However, I do find that the term 'extra' makes much more sense Would it make sense to do the rename in the other direction? I tend to agree that "large edge" is a misnomer. I agree with you both. "Extra" is better. Thanks, -Stolee
Re: [PATCH 1/1] revision.c: use new topo-order logic in tests
On 11/20/2018 1:13 AM, Junio C Hamano wrote: "Derrick Stolee via GitGitGadget" writes: @@ -3143,6 +3144,9 @@ int prepare_revision_walk(struct rev_info *revs) commit_list_sort_by_date(>commits); if (revs->no_walk) return 0; + if (revs->limited && + git_env_bool(GIT_TEST_COMMIT_GRAPH, 0)) + revs->limited = 0; if (revs->limited) { if (limit_list(revs) < 0) return -1; That is equivalent to say if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0)) revs->limited = 0; Not exactly equivalent, because we can use short-circuiting to avoid the git_env_bool check, but I see what you mean. Wouldn't that make the codepath that involves limit_list() completely unreachable while testing, though? Testing with GIT_TEST_COMMIT_GRAPH=0 would still hit limit_list(). Both modes are important to test (for instance, to ensure we still have correct behavior without a commit-graph file). The title only mentions "topo-order" logic, but the topo-order is not the only reason why limited bit can be set, is it? When showing merges, simplifying merges, or post-processing to show ancestry path, or showing a bottom-limited revision range, the limited bit is automatically set because all of these depend on first calling limit_list() and then postprocessing its result. Doesn't it hurt these cases to unconditionally drop limited bit? You're right that we only want to do this in the topo-order case, so perhaps the diff should instead be: if (revs->no_walk) return 0; + if (revs->topo_order && + git_env_bool(GIT_TEST_COMMIT_GRAPH, 0)) + revs->limited = 0; if (revs->limited) { if (limit_list(revs) < 0) return -1; Thanks, -Stolee
Re: [PATCH 0/3] delta-island fixes
On 11/20/2018 4:44 AM, Jeff King wrote: In cases like this I think it's often a good idea to have a perf test. Those are expensive anyway, and we'd have the double benefit of exercising the code and showing off the performance improvement. But the delta-island code only makes sense on a very specialized repo: one where you have multiple related but diverging histories. You could simulate that by picking two branches in a repository, but the effect is going to be miniscule. The changes in this series look correct. Too bad it is difficult to test. Perhaps we should add a performance test for the --delta-islands check that would trigger the failure (and maybe a clone afterwards). There are lots of freely available forks of git.git that present interesting fork structure. Here are three that would suffice to make this interesting: https://github.com/git/git https://github.com/git-for-windows/git https://github.com/microsoft/git Of course, running it on a specific repo is up to the person running the perf suite. Thanks, -Stolee
Re: Git Test Coverage Report (v2.20.0-rc0)
On 11/20/2018 6:34 AM, Jeff King wrote: On Mon, Nov 19, 2018 at 10:40:53AM -0500, Derrick Stolee wrote: Since depth is never incremented, we are not covering this block. Is it possible to test? This should be covered by the fix in: https://public-inbox.org/git/20181120095053.gc22...@sigill.intra.peff.net/ because now entries at the top-level are depth "1". The "depth++" line is still never executed in our test suite. I'm not sure how much that matters. Thanks! I'll go take a look at your patch. delta-islands.c c8d521faf7 53) memcpy(b, old, size); This memcpy happens when the 'old' island_bitmap is passed to island_bitmap_new(), but... c8d521faf7 187) b->refcount--; c8d521faf7 188) b = kh_value(island_marks, pos) = island_bitmap_new(b); This block has the only non-NULL caller to island_bitmap_new(). This is another case where it triggers a lot for a reasonably-sized repo, but it's hard to construct a small case. This code implements a copy-on-write of the bitmap, which means the same objects have to be accessible from two different paths through the reachability graph, each with different island marks. And then a test would I guess make sure that the correct subsets of objects never become deltas, which gets complicated. And I think that's a pattern with the delta-island code. What we really care about most is that if we throw a real fork-network repository at it, it produces faster clones with fewer un-reusable deltas. So I think a much more interesting approach here would be perf tests. But: - we'd want to count those as coverage, and that likely makes your coverage tests prohibitively expensive - it requires a specialized repo to demonstrate, which most people aren't going to have handy Do you have regularly-running tests that check this in your infrastructure? As long as someone would notice if this code starts failing, that would be enough. c8d521faf7 212) obj = ((struct tag *)obj)->tagged; c8d521faf7 213) if (obj) { c8d521faf7 214) parse_object(the_repository, >oid); c8d521faf7 215) marks = create_or_get_island_marks(obj); c8d521faf7 216) island_bitmap_set(marks, island_counter); It appears that this block would happen if we placed a tag in the delta island. Yep. Again, exercised by real repositories. I'm not sure how far we want to go in the blind pursuit of coverage. Certainly we could add a tag to the repo in t5320, and this code would get executed. But verifying that it's doing the right thing is much harder (and is more easily done with a perf test). Blind coverage goals are definitely not worth the effort. My goal here was to re-check all of the new code (since last release) that is not covered, because it's easier to hide a bug there. c8d521faf7 397) strbuf_addch(_name, '-'); This block is inside the following patch: [...] Yeah, this triggers if your regex has more than one capture group (and likewise, we almost certainly don't run the "you have too many groups" warning). Did you know that regexes are notoriously under-tested [1]? When looking at this code, I didn't even know regexes were involved (but I didn't look enough at the context). [1] https://dl.acm.org/citation.cfm?id=3236072 c8d521faf7 433) continue; c8d521faf7 436) list[dst] = list[src]; These blocks are inside the following nested loop in deduplicate_islands(): + for (ref = 0; ref + 1 < island_count; ref++) { + for (src = ref + 1, dst = src; src < island_count; src++) { + if (list[ref]->hash == list[src]->hash) + continue; + + if (src != dst) + list[dst] = list[src]; + + dst++; + } + island_count = dst; + } This means that our "deduplication" logic is never actually doing anything meaningful. Sorry, I don't even remember what this code is trying to do. The island code is 5+ years old, and just recently ported to upstream Git by Christian. And that's perhaps part of my laziness in the above tests; it would be a significant effort to re-figure out all these corner cases. It's a big part of why I hadn't been sending the patches upstream myself. Sure. Hopefully pointing out these blocks gives us more motivation to manually inspect them and avoid silly bugs. Thanks, -Stolee
Re: Git Test Coverage Report (v2.20.0-rc0)
On 11/19/2018 1:33 PM, Ævar Arnfjörð Bjarmason wrote: On Mon, Nov 19 2018, Derrick Stolee wrote: [...] builtin/rebase.c 62c23938fa 55) return env; [...] Ævar Arnfjörð Bjarmason 62c23938f: tests: add a special setup where rebase.useBuiltin is off This one would be covered with GIT_TEST_REBASE_USE_BUILTIN=false. Obviously trivial, but I wonder if the rest of the coverage would look different when passed through the various GIT_TEST_* options. Thanks for pointing out this GIT_TEST_* variable to me. I had been running builds with some of them enabled, but didn't know about this one. Unfortunately, t3406-rebase-message.sh fails with GIT_TEST_REBASE_USE_BUILTIN=false and it bisects to 4520c2337: Merge branch 'ab/rebase-in-c-escape-hatch'. The issue is that the commit 04519d72 "rebase: validate -C and --whitespace= parameters early" introduced the following test that cares about error messages: +test_expect_success 'error out early upon -C or --whitespace=' ' + test_must_fail git rebase -Cnot-a-number HEAD 2>err && + test_i18ngrep "numerical value" err && + test_must_fail git rebase --whitespace=bad HEAD 2>err && + test_i18ngrep "Invalid whitespace option" err +' The merge commit then was the first place where this test could run with that variable. What's the correct fix here? Force the builtin rebase in this test? Unify the error message in the non-builtin case? Thanks, -Stolee
Re: Git Test Coverage Report (v2.20.0-rc0)
On 11/19/2018 2:00 PM, Ben Peart wrote: On 11/19/2018 10:40 AM, Derrick Stolee wrote: There are a lot of lines introduced by the IEOT extension in these commits: > Ben Peart 3255089ad: ieot: add Index Entry Offset Table (IEOT) extension > Ben Peart 3b1d9e045: eoie: add End of Index Entry (EOIE) extension > Ben Peart 77ff1127a: read-cache: load cache entries on worker threads > Ben Peart abb4bb838: read-cache: load cache extensions on a worker thread > Ben Peart c780b9cfe: config: add new index.threads config setting > Ben Peart d1664e73a: add: speed up cmd_add() by utilizing read_cache_preload() > Ben Peart fa655d841: checkout: optimize "git checkout -b " These should be hit if you run the test suite with GIT_TEST_INDEX_THREADS=2. Without that, the indexes for the various tests are too small to trigger multi-threaded index reads/writes. From t/README: GIT_TEST_INDEX_THREADS= enables exercising the multi-threaded loading of the index for the whole test suite by bypassing the default number of cache entries and thread minimums. Setting this to 1 will make the index loading single threaded. I updated my build to add GIT_TEST_INDEX_THREADS=2 and I still see lots of uncovered stuff, including that load_cache_entries_threaded() is never run. I added the following diff to my repo and ran the test suite manually with GIT_TEST_INDEX_THREADS=2 and it didn't fail: diff --git a/read-cache.c b/read-cache.c index 4ca81286c0..36502586a2 100644 --- a/read-cache.c +++ b/read-cache.c @@ -2057,6 +2057,9 @@ static unsigned long load_cache_entries_threaded(struct index_state *istate, con struct load_cache_entries_thread_data *data; unsigned long consumed = 0; + fprintf(stderr, "load_cache_entries_threaded\n"); + exit(1); + /* a little sanity checking */ if (istate->name_hash_initialized) BUG("the name hash isn't thread safe"); Am I missing something? Is there another variable I should add? When I look for where the GIT_TEST_INDEX_THREADS variable is checked, I see that the calls to git_config_get_index_threads() are followed by a check for NO_PTHREADS (setting the number of threads to 1 again). Is it possible that my compiler environment is not allowing me to even compile with threads? Thanks, -Stolee
Re: [PATCH] commit-graph: split up close_reachable() progress output
On 11/19/2018 3:23 PM, Ævar Arnfjörð Bjarmason wrote: + if (report_progress) + progress = start_delayed_progress( + _("Expanding reachable commits in commit graph"), j = 0); This should be the only one that shows up in all but the very largest of repositories. LGTM. Thanks, -Stolee
Re: Git Test Coverage Report (v2.20.0-rc0)
On 11/19/2018 2:39 PM, Ævar Arnfjörð Bjarmason wrote: On Mon, Nov 19 2018, Derrick Stolee wrote: The coverage report has been using the following: export GIT_TEST_MULTI_PACK_INDEX=1 export GIT_TEST_COMMIT_GRAPH=1 export GIT_TEST_INDEX_VERION=4 export GIT_TEST_SPLIT_INDEX=yes export GIT_TEST_OE_SIZE=10 export GIT_TEST_OE_DELTA_SIZE=5 I need to add GIT_TEST_INDEX_THREADS=2 and GIT_TEST_REBASE_USE_BUILTIN=false ...although note you'll need to also test without GIT_TEST_REBASE_USE_BUILTIN=false, otherwise a lot of the new C code won't have coverage. Sorry for lack of clarity: I first run 'make coverage-test' with no GIT_TEST_* variables, then run the test suite again with the optional variables. Thanks, -Stolee
Re: Git Test Coverage Report (v2.20.0-rc0)
On 11/19/2018 1:33 PM, Ævar Arnfjörð Bjarmason wrote: On Mon, Nov 19 2018, Derrick Stolee wrote: Here is a test coverage report for the uncovered lines introduced in v2.20.0-rc0 compared to v2.19.1. Thanks a lot for this. [...] builtin/rebase.c 62c23938fa 55) return env; [...] Ævar Arnfjörð Bjarmason 62c23938f: tests: add a special setup where rebase.useBuiltin is off This one would be covered with GIT_TEST_REBASE_USE_BUILTIN=false. Obviously trivial, but I wonder if the rest of the coverage would look different when passed through the various GIT_TEST_* options. The coverage report has been using the following: export GIT_TEST_MULTI_PACK_INDEX=1 export GIT_TEST_COMMIT_GRAPH=1 export GIT_TEST_INDEX_VERION=4 export GIT_TEST_SPLIT_INDEX=yes export GIT_TEST_OE_SIZE=10 export GIT_TEST_OE_DELTA_SIZE=5 I need to add GIT_TEST_INDEX_THREADS=2 and GIT_TEST_REBASE_USE_BUILTIN=false Thanks! -Stolee
[PATCH 1/1] revision.c: use new topo-order logic in tests
From: Derrick Stolee The revision-walk machinery is being rewritten to use generation numbers in the commit-graph when availble. Due to some problematic commit histories, the new logic can be slower than the previous method due to how commit dates and generation numbers interact. Thus, the new logic is not used in comparison queries, such as git log --topo-order A..B The logic for these queries was implemented during the refactor, but is unreachable due to the potential performance penalty. The code came along with a larger block of code that was copied from the old code. When generation numbers are updated to v2 (corrected commit dates), then we will no longer have a performance penalty and this logic is what we will want to use. In the meantime, use the new logic when GIT_TEST_COMMIT_GRAPH is enabled. This will demonstrate that the new logic works for all comparison queries in the test suite, including these variants: git log --topo-order --ancestry-path A..B git log --topo-order A...B Signed-off-by: Derrick Stolee --- revision.c | 4 1 file changed, 4 insertions(+) diff --git a/revision.c b/revision.c index 4ef47d2fb4..d52da6e24f 100644 --- a/revision.c +++ b/revision.c @@ -27,6 +27,7 @@ #include "commit-reach.h" #include "commit-graph.h" #include "prio-queue.h" +#include "config.h" volatile show_early_output_fn_t show_early_output; @@ -3143,6 +3144,9 @@ int prepare_revision_walk(struct rev_info *revs) commit_list_sort_by_date(>commits); if (revs->no_walk) return 0; + if (revs->limited && + git_env_bool(GIT_TEST_COMMIT_GRAPH, 0)) + revs->limited = 0; if (revs->limited) { if (limit_list(revs) < 0) return -1; -- gitgitgadget
[PATCH 0/1] Use new topo-order logic with GIT_TEST_COMMIT_GRAPH
The recent Git test report for v2.20.0-rc0 shows that the logic around UNINTERESTING commits is not covered by the test suite. This is because the code is currently unreachable! See the commit message for details. An alternate approach would be to delete the code around UNINTERESTING commits, but that doesn't seem necessary. Thanks, -Stolee Derrick Stolee (1): revision.c: use new topo-order logic in tests revision.c | 4 1 file changed, 4 insertions(+) base-commit: 561b583749b7428f1790f03164d0d0e75be71d7b Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-83%2Fderrickstolee%2Ftopo-order-test-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-83/derrickstolee/topo-order-test-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/83 -- gitgitgadget
Re: Git Test Coverage Report (v2.20.0-rc0)
The test coverage reports started mid-way through this release cycle, so I thought it would be good to do a full review of the new uncovered code since the last release. I eliminated most of the uncovered code due to the following cases: 1. Code was only moved or refactored. 2. Code was related to unusual error conditions (e.g. open_pack_index() fails) The comments below are intended only to point out potential directions to improve test coverage. Some of it is for me to do! Thanks, -Stolee On 11/18/2018 9:54 PM, Derrick Stolee wrote: 66ec0390e7 builtin/fsck.c 888) midx_argv[2] = "--object-dir"; 66ec0390e7 builtin/fsck.c 889) midx_argv[3] = alt->path; 66ec0390e7 builtin/fsck.c 890) if (run_command(_verify)) 66ec0390e7 builtin/fsck.c 891) errors_found |= ERROR_COMMIT_GRAPH; There are two things wrong here: 1. Not properly covering multi-pack-index fsck with alternates. 2. the ERROR_COMMIT_GRAPH flag when the multi-pack-index is being checked. I'll submit a patch to fix this. 2fa233a554 builtin/pack-objects.c 1512) hashcpy(base_oid.hash, base_sha1); 2fa233a554 builtin/pack-objects.c 1513) if (!in_same_island(>idx.oid, _oid)) 2fa233a554 builtin/pack-objects.c 1514) return 0; These lines are inside a block for the following if statement: + /* + * Otherwise, reachability bitmaps may tell us if the receiver has it, + * even if it was buried too deep in history to make it into the + * packing list. + */ + if (thin && bitmap_has_sha1_in_uninteresting(bitmap_git, base_sha1)) { Peff: is this difficult to test? 28b8a73080 builtin/pack-objects.c 2793) depth++; 108f530385 builtin/pack-objects.c 2797) oe_set_tree_depth(_pack, ent, depth); This 'depth' variable is incremented as part of a for loop in this patch: static void show_object(struct object *obj, const char *name, void *data) @@ -2686,6 +2706,19 @@ static void show_object(struct object *obj, const char *name, void *data) add_preferred_base_object(name); add_object_entry(>oid, obj->type, name, 0); obj->flags |= OBJECT_ADDED; + + if (use_delta_islands) { + const char *p; + unsigned depth = 0; + struct object_entry *ent; + + for (p = strchr(name, '/'); p; p = strchr(p + 1, '/')) + depth++; + + ent = packlist_find(_pack, obj->oid.hash, NULL); + if (ent && depth > ent->tree_depth) + ent->tree_depth = depth; + } } And that 'ent->tree_depth = depth;' line is replaced with the oe_set_tree_depth() call in the report. Since depth is never incremented, we are not covering this block. Is it possible to test? builtin/repack.c 16d75fa48d 48) use_delta_islands = git_config_bool(var, value); 16d75fa48d 49) return 0; This is a simple config option check for "repack.useDeltaIslands". The logic it enables is tested, so this is an OK gap, in my opinion. > builtin/submodule--helper.c ee69b2a90c 1476) out->type = sub->update_strategy.type; ee69b2a90c 1477) out->command = sub->update_strategy.command; This block was introduced by this part of the patch: + } else if (sub->update_strategy.type != SM_UPDATE_UNSPECIFIED) { + trace_printf("loaded thing"); + out->type = sub->update_strategy.type; + out->command = sub->update_strategy.command; Which seems to be an important case, as the other SM_UPDATE_* types seem like interesting cases. Stefan: what actions would trigger this block? Is it easy to test? delta-islands.c c8d521faf7 53) memcpy(b, old, size); This memcpy happens when the 'old' island_bitmap is passed to island_bitmap_new(), but... c8d521faf7 187) b->refcount--; c8d521faf7 188) b = kh_value(island_marks, pos) = island_bitmap_new(b); This block has the only non-NULL caller to island_bitmap_new(). c8d521faf7 212) obj = ((struct tag *)obj)->tagged; c8d521faf7 213) if (obj) { c8d521faf7 214) parse_object(the_repository, >oid); c8d521faf7 215) marks = create_or_get_island_marks(obj); c8d521faf7 216) island_bitmap_set(marks, island_counter); It appears that this block would happen if we placed a tag in the delta island. c8d521faf7 397) strbuf_addch(_name, '-'); This block is inside the following patch: + if (matches[ARRAY_SIZE(matches) - 1].rm_so != -1) + warning(_("island regex from config has " + "too many capture groups (max=%d)"), + (int)ARRAY_SIZE(matches) - 2); + + for (m = 1; m < ARRAY_SIZE(matches); m++) { + regmatch_t *match = [m]; + + if (match->rm_so == -1) + continue; + + if (island_name.len) + strbuf_addch(_name, '-'); + + strbuf_ad
Git Test Coverage Report (v2.20.0-rc0)
(, ":%.*s ", perf_indent, space); c46c406ae1 317) void trace_performance_leave_fl(const char *file, int line, c46c406ae1 323) if (perf_indent) c46c406ae1 324) perf_indent--; c46c406ae1 326) if (!format) /* Allow callers to leave without tracing anything */ c46c406ae1 327) return; c46c406ae1 329) since = perf_start_times[perf_indent]; c46c406ae1 330) va_start(ap, format); c46c406ae1 331) trace_performance_vprintf_fl(file, line, nanos - since, format, ap); c46c406ae1 332) va_end(ap); c46c406ae1 477) trace_performance_leave("git command:%s", command_line.buf); c46c406ae1 485) if (!command_line.len) c46c406ae1 490) trace_performance_enter(); transport.c unpack-trees.c b878579ae7 360) string_list_append(, ce->name); b878579ae7 361) ce->ce_flags &= ~CE_MATCHED; b878579ae7 368) warning(_("the following paths have collided (e.g. case-sensitive paths\n" b878579ae7 372) for (i = 0; i < list.nr; i++) b878579ae7 373) fprintf(stderr, " '%s'\n", list.items[i].string); f1e11c6510 777) free(tree_ce); b4da37380b 778) return rc; b4da37380b 785) printf("Unpacked %d entries from %s to %s using cache-tree\n", b4da37380b 787) o->src_index->cache[pos]->name, b4da37380b 788) o->src_index->cache[pos + nr_entries - 1]->name); upload-pack.c 1d1243fe63 1403) deepen(INFINITE_DEPTH, data->deepen_relative, >shallows, worktree.c 3a3b9d8cde 495) return -1; 3a3b9d8cde 508) return -1; 3a3b9d8cde 517) return -1; ab3e1f78ae 537) break; wt-status.c f3bd35fa0d 671) s->committable = 1; 73ba5d78b4 1958) if (s->state.rebase_in_progress || 73ba5d78b4 1959) s->state.rebase_interactive_in_progress) 73ba5d78b4 1960) branch_name = s->state.onto; 73ba5d78b4 1961) else if (s->state.detached_from) 73ba5d78b4 1962) branch_name = s->state.detached_from; xdiff-interface.c xdiff/xutils.c 611e42a598 405) return -1; Commits introducing uncovered code: Ævar Arnfjörð Bjarmason 62c23938f: tests: add a special setup where rebase.useBuiltin is off Alban Gruin 0af129b2e: rebase--interactive2: rewrite the submodes of interactive rebase in C Alban Gruin 2c58483a5: rebase -i: rewrite setup_reflog_action() in C Alban Gruin 34b47315d: rebase -i: move rebase--helper modes to rebase--interactive Alban Gruin 4df66c40b: rebase -i: rewrite checkout_onto() in C Alban Gruin 53bbcfbde: rebase -i: implement the main part of interactive rebase as a builtin Alban Gruin 64a43cbd5: rebase -i: rewrite the edit-todo functionality in C Alban Gruin 65850686c: rebase -i: rewrite write_basic_state() in C Alban Gruin a9f5476fb: sequencer: refactor append_todo_help() to write its message to a buffer Alban Gruin b97e18736: rebase -i: rewrite complete_action() in C Antonio Ospite 45f5ef3d7: submodule: factor out a config_set_in_gitmodules_file_gently function Antonio Ospite 76e9bdc43: submodule: support reading .gitmodules when it's not in the working tree Antonio Ospite bcbc780d1: submodule: add a print_config_from_gitmodules() helper Ben Peart 3255089ad: ieot: add Index Entry Offset Table (IEOT) extension Ben Peart 3b1d9e045: eoie: add End of Index Entry (EOIE) extension Ben Peart 77ff1127a: read-cache: load cache entries on worker threads Ben Peart abb4bb838: read-cache: load cache extensions on a worker thread Ben Peart c780b9cfe: config: add new index.threads config setting Ben Peart d1664e73a: add: speed up cmd_add() by utilizing read_cache_preload() Ben Peart fa655d841: checkout: optimize "git checkout -b " Brendan Forster 93aef7c79: http: add support for disabling SSL revocation checks in cURL brian m. carlson 2f0c9e9a9: builtin/repack: replace hard-coded constants brian m. carlson eccb5a5f3: apply: rename new_sha1_prefix and old_sha1_prefix Christian Couder 108f53038: pack-objects: move tree_depth into 'struct packing_data' Christian Couder fe0ac2fb7: pack-objects: move 'layer' into 'struct packing_data' Derrick Stolee 0d5b3a5ef: midx: write object ids in a chunk Derrick Stolee 17c35c896: packfile: skip loading index if in multi-pack-index Derrick Stolee 1d614d41e: commit-reach: move ref_newer from remote.c Derrick Stolee 1dcd9f204: midx: close multi-pack-index on repack Derrick Stolee 20fd6d579: commit-graph: not compatible with grafts Derrick Stolee 3715a6335: midx: read objects from multi-pack-index Derrick Stolee 454ea2e4d: treewide: use get_all_packs Derrick Stolee 4d80560c5: multi-pack-index: load into memory Derrick Stolee 4fbcca4ef: commit-reach: make can_all_from_reach... linear Derrick Stolee 5227c3856: commit-reach: move walk methods from commit.c Derrick Stolee 525e18c04: midx: clear midx on repack Derrick Stolee 56ee7ff15: multi-pack-index: add 'verify' verb Derrick Stolee 662148c43: midx: write object offs
Re: [PATCH/RFC v1 1/1] Use size_t instead of unsigned long
On 11/17/2018 10:11 AM, tbo...@web.de wrote: From: Torsten Bögershausen Currently Git users can not commit files >4Gib under 64 bit Windows, where "long" is 32 bit but size_t is 64 bit. Improve the code base in small steps, as small as possible. What started with a small patch to replace "unsigned long" with size_t in one file (convert.c) ended up with a change in many files. Signed-off-by: Torsten Bögershausen --- This needs to go on top of pu, to cover all the good stuff cooking here. Better to work on top of 'master', as the work in 'pu' will be rewritten several times, probably. I have started this series on November 1st, since that 2 or 3 rebases had been done to catch up, and now it is on pu from November 15. I couldn't find a reason why changing "unsigned ling" into "size_t" may break anything, any thoughts, please ? IIRC, the blocker for why we haven't done this already is that "size_t", "timestamp_t" and "off_t" are all 64-bit types that give more implied meaning, so we should swap types specifically to these as we want. One example series does a specific, small change [1]. If we wanted to do a single swap that removes 'unsigned long' with an unambiguously 64-bit type, I would recommend "uint64_t". Later replacements to size_t, off_t, and timestamp_t could happen on a case-by-case basis for type clarity. It may benefit reviewers to split this change into multiple patches, starting at the lowest levels of the call stack (so higher 'unsigned long's can up-cast to the 64-bit types when calling a function) and focusing the changes to one or two files at a time (X.c and X.h, preferrably). Since you are talking about the benefits for Git for Windows to accept 4GB files, I wonder if we can add a test that verifies such a file will work. If you have such a test, then I could help verify that the test fails before the change and succeeds afterward. Finally, it may be good to add a coccinelle script that replaces 'unsigned long' with 'uint64_t' so we can easily fix any new introductions that happen in the future. Thanks! I do think we should make this change, but we must be careful. It may be disruptive to topics in flight. -Stolee [1] https://public-inbox.org/git/20181112084031.11769-1-care...@gmail.com/
Git Test Coverage Report (Sunday, Nov 18th)
385cb64ff: delta-islands.c: remove the_repository references Nguyễn Thái Ngọc Duy 674ba3403: fsck: mark strings for translation Nguyễn Thái Ngọc Duy 74ae4b638: bundle.c: remove the_repository references Nguyễn Thái Ngọc Duy 890034262: parse-options.c: mark more strings for translation Nguyễn Thái Ngọc Duy 8aa8c1409: git.c: mark more strings for translation Nguyễn Thái Ngọc Duy 9440b831a: parse-options: replace opterror() with optname() Nguyễn Thái Ngọc Duy 9d0a9e908: read-cache.c: mark more strings for translation Nguyễn Thái Ngọc Duy ad8f8f4ae: attr.c: mark more string for translation Nguyễn Thái Ngọc Duy c6e7965dd: archive.c: mark more strings for translation Nguyễn Thái Ngọc Duy c83d950e5: repack: mark more strings for translation Nguyễn Thái Ngọc Duy dd509db34: reflog: mark strings for translation Nguyễn Thái Ngọc Duy f11c95805: sequencer.c: remove implicit dependency on the_index Nguyễn Thái Ngọc Duy fb998eae6: blame.c: remove implicit dependency the_repository Olga Telezhnaya ab0e36715: ref-filter: add deltabase option Olga Telezhnaya bbfc042ef: ref-filter: add objectsize:disk option Torsten Bögershausen ca473cef9: Upcast size_t variables to uintmax_t when printing Uncovered code in 'master' not in 'master@{1}' apply.c 517fe807d6 4776) BUG_ON_OPT_NEG(unset); 735ca208c5 4830) return -1; builtin/am.c fce5664805 2117) *opt_value = PATCH_FORMAT_UNKNOWN; builtin/blame.c 517fe807d6 builtin/blame.c 759) BUG_ON_OPT_NEG(unset); builtin/cat-file.c 0eb8d3767c builtin/cat-file.c 609) return error(_("only one batch option may be specified")); builtin/grep.c fd6263fb73 builtin/grep.c 1051) warning(_("invalid option combination, ignoring --threads")); fd6263fb73 builtin/grep.c 1057) die(_("invalid number of threads specified (%d)"), num_threads); builtin/log.c 517fe807d6 builtin/log.c 1196) BUG_ON_OPT_NEG(unset); builtin/pull.c 01a31f3bca 565) die(_("unable to access commit %s"), builtin/rebase.c 62c23938fa 55) return env; 3249c1251e 556) ret = -1; 3249c1251e 557) goto leave_reset_head; bac2a1e36f 561) ret = error(_("could not determine HEAD revision")); bac2a1e36f 562) goto leave_reset_head; 3249c1251e 580) ret = error(_("could not read index")); 3249c1251e 581) goto leave_reset_head; bac2a1e36f 585) ret = error(_("failed to find tree of %s"), oid_to_hex(oid)); bac2a1e36f 586) goto leave_reset_head; 3249c1251e 590) ret = error(_("failed to find tree of %s"), oid_to_hex(oid)); 3249c1251e 591) goto leave_reset_head; 3249c1251e 604) goto leave_reset_head; builtin/show-branch.c 517fe807d6 builtin/show-branch.c 607) BUG_ON_OPT_NEG(unset); builtin/show-ref.c 517fe807d6 builtin/show-ref.c 154) BUG_ON_OPT_NEG(unset); bundle.c 2c8ee1f53c 267) error_errno(_("unable to dup bundle descriptor")); 2c8ee1f53c 268) child_process_clear(_objects); 2c8ee1f53c 269) return -1; 2c8ee1f53c 478) rollback_lock_file(); config.c midx.c name-hash.c 2179045fd0 532) die(_("unable to create lazy_dir thread: %s"), strerror(err)); 2179045fd0 554) die(_("unable to create lazy_name thread: %s"), strerror(err)); 2179045fd0 560) die(_("unable to join lazy_name thread: %s"), strerror(err)); preload-index.c 2179045fd0 137) die(_("unable to create threaded lstat: %s"), strerror(err)); revision.c b45424181e 2942) return; b45424181e 2945) return; b45424181e 2951) c->object.flags |= UNINTERESTING; b45424181e 2954) return; b45424181e 2957) mark_parents_uninteresting(c); b45424181e 2980) return; b45424181e 2983) return; b45424181e 3048) continue; f0d9cc4196 3097) if (!revs->ignore_missing_links) f0d9cc4196 3098) die("Failed to traverse parents of commit %s", f0d9cc4196 3099) oid_to_hex(>object.oid)); b45424181e 3107) continue; run-command.c 2179045fd0 1229) error(_("cannot create async thread: %s"), strerror(err)); send-pack.c c0e40a2d66 207) close(fd[1]); Commits introducing uncovered code: Ævar Arnfjörð Bjarmason 62c23938f: tests: add a special setup where rebase.useBuiltin is off Derrick Stolee b45424181: revision.c: generation-based topo-order algorithm Derrick Stolee f0d9cc419: revision.c: begin refactoring --topo-order logic Jeff King 01a31f3bc: pull: handle --verify-signatures for unborn branch Jeff King 0eb8d3767: cat-file: report an error on multiple --batch options Jeff King 2c8ee1f53: bundle: dup() output descriptor closer to point-of-use Jeff King 517fe807d: assert NOARG/NONEG behavior of parse-options callbacks Jeff King 735ca208c: apply: return -1 from option callback instead of calling exit(1) Jeff King fce566480: am: handle --no-patch-format option Johannes Schindelin 3249c1251: rebase: consolidate clean-up code before leaving reset_head() Johannes Schindelin bac2a1e36: built-in rebase: reinstate `checkout -q` behavior where appropriate Nguyễn Thái Ngọc Duy 2179045fd: Clean up pthread_create() error handling Nguyễn Thái Ngọc Duy c0e40a2d6: send-pack.c: move async's #ifdef NO_PTHREADS back to run-command.c Nguyễn Thái Ngọc Duy fd6263fb7: grep: clean up num_threads handling
Re: [PATCH] technical doc: add a design doc for the evolve command
On 11/14/2018 7:55 PM, sxe...@google.com wrote: From: Stefan Xenos This document describes what an obsolescence graph for git would look like, the behavior of the evolve command, and the changes planned for other commands. Thanks for putting this together! diff --git a/Documentation/technical/evolve.txt b/Documentation/technical/evolve.txt ... +Git Obsolescence Graph +== + +Objective +- +Track the edits to a commit over time in an obsolescence graph. The file name and the title are in a mismatch. I'd prefer if the title was "Git Evolve Design Document" and this opening paragraph was about the reasons we want a 'git evolve' command. Here is my attempt: The proposed 'git evolve' command will help users craft a high-quality commit history in their topic branches. By working to improve commits one at a time, then running 'git evolve', users can rewrite recent history with more options than interactive rebase. The core benefit is that users can pause their progress and move to other branches before returning to where they left off. Users can also share progress with others using standard 'push', 'fetch', and 'format-patch' commands. +Background +-- Perhaps you can call this "Example"? +Imagine you have three dependent changes up for review and you receive feedback +that requires editing all three changes. While you're editing one, more feedback +arrives on one of the others. What do you do? "three dependent changes" sounds a bit vague enough to me to possibly confuse readers. Perhaps "three sequential patches"? +- Users can view the history of a commit directly (the sequence of amends and + rebases it has undergone, orthogonal to the history of the branch it is on). "the history of a commit" doesn't semantically work, as a commit is an immutable Git object. Instead, I would try to use the term "patch" to describe a change to the codebase, and that takes the form as a list of commits that are improving on each other (but don't actually have each other in their commit history). This means that the lifetime of a patch is described by the commits that are amended or rebased. +- By pushing and pulling the obsolescence graph, users can collaborate more + easily on changes-in-progress. This is better than pushing and pulling the + changes themselves since the obsolescence graph can be used to locate a more + specific merge base, allowing for better merges between different versions of + the same change. (Making a note so I come back to this. I hope to learn what you mean by this "more specific merge base".) + +Similar technologies + ... It can't handle the case where you have +multiple changes sharing the same parent when that parent needs to be rebased Perhaps this could be made more concrete by describing commit history and a specific workflow change using 'git evolve'. Suppose we have two topic branches, topic1 and topic2, that point to commits A and B, respectively.Suppose further that A and B have a common parent C with parent D. If we rebase topic1 relativeto D, then we create new commits C' and A' that are newer versions of commits C and A. It would benice to easily update topic2 to be on a new commit B' with parent C'. Currently, a user needs to knowthat C updated to C', and use 'git rebase --onto C' C topic2'. Instead, if we have a marker showing thatC' is an updated version of C, 'git log topic2' would show that topic2 can be updated, and the 'gitevolve' command would perform the correct action to make B' with parent C'. (This paragraph above is an example of "what can happen now is complicated and demands that the user keep some information in their memory" and "the new workflow is simpler and helps users make the right decision". I think we could use more of these at the start to sell the idea.) +and won't let you collaborate with others on resolving a complicated interactive +rebase. In the same sentence, we have an even more complicated workflow mentioned as an aside. This could be fleshed out more concretely. It could help describing that the current model is for usersto share "!fixup" commits and then one performs an interactive rebase to apply those fixups inthe correct order. If a user instead shares an amended commit, then we are in a difficult state toapply those changes. The new workflow would be to share amended commits and 'git evolve'inserts the correct amended commits in the right order. I'm a big proponent of the teaching philosophy of "examples first". It's easier to talk abstractlyafter going through some concrete examples. You can think of rebase -i as a top-down approach and the evolve command +as the bottom-up approach to the same problem. This comparison is important. Perhaps it is more specific to say "interactive rebase splits a plan torewrite history into independent units of work, while evolve collects independent
Re: [PATCHv3 00/23] Bring more repository handles into our code base
On 11/13/2018 7:12 PM, Stefan Beller wrote: Please have a look at the last 4 patches specifically as they were new in the last iteration (but did not receive any comment), as they demonstrate and fix a problem that is only exposed when using GIT_TEST_COMMIT_GRAPH=1 for the test suite. I took a look at these last four and didn't find anything wrong. Rest of the series looks good. While all of the TODOs in the last patch are an imperfect solution, I think it's probably the best we can do for now. Thanks, -Stolee
Re: [PATCH v5 02/12] sha1-file: provide functions to look up hash algorithms
On 11/4/2018 6:44 PM, brian m. carlson wrote: +int hash_algo_by_name(const char *name) +{ + int i; + if (!name) + return GIT_HASH_UNKNOWN; + for (i = 1; i < GIT_HASH_NALGOS; i++) + if (!strcmp(name, hash_algos[i].name)) + return i; + return GIT_HASH_UNKNOWN; +} + Today's test coverage report [1] shows this method is not covered in the test suite. Looking at 'pu', it doesn't have any callers. Do you have a work in progress series that will use this? Could we add a test-tool to exercise this somehow? Thanks, -Stolee [1] https://public-inbox.org/git/97be3e21-6a8c-9718-5161-37318f6d6...@gmail.com/
Git Test Coverage Report (Tuesday, Nov 13)
7f8671656: merge-recursive: fix rename/add conflict handling Force Charlie d73019feb: http: add support selecting http version Jeff King b69fb867b: sha1_file_name(): overwrite buffer instead of appending Jeff King f0eaf6381: sha1-file: use an object_directory for the main object dir Joel Teichroeb 3d5ec65ce: stash: convert apply to builtin Joel Teichroeb 5bf62a19c: stash: convert pop to builtin Joel Teichroeb 700577117: stash: convert drop and clear to builtin Johannes Schindelin 3249c1251: rebase: consolidate clean-up code before leaving reset_head() Johannes Schindelin bac2a1e36: built-in rebase: reinstate `checkout -q` behavior where appropriate Josh Steadmon c95f25280: protocol: advertise multiple supported versions Junio C Hamano 287d68a44: Merge branch 'nd/i18n' into jch Nguyễn Thái Ngọc Duy 005af339c: sequencer.c: remove implicit dependency on the_repository Nguyễn Thái Ngọc Duy 0b9c3afdb: remote.c: mark messages for translation Nguyễn Thái Ngọc Duy 385cb64ff: delta-islands.c: remove the_repository references Nguyễn Thái Ngọc Duy 674ba3403: fsck: mark strings for translation Nguyễn Thái Ngọc Duy 74ae4b638: bundle.c: remove the_repository references Nguyễn Thái Ngọc Duy 890034262: parse-options.c: mark more strings for translation Nguyễn Thái Ngọc Duy 8aa8c1409: git.c: mark more strings for translation Nguyễn Thái Ngọc Duy 9440b831a: parse-options: replace opterror() with optname() Nguyễn Thái Ngọc Duy 9d0a9e908: read-cache.c: mark more strings for translation Nguyễn Thái Ngọc Duy ad8f8f4ae: attr.c: mark more string for translation Nguyễn Thái Ngọc Duy c6e7965dd: archive.c: mark more strings for translation Nguyễn Thái Ngọc Duy c83d950e5: repack: mark more strings for translation Nguyễn Thái Ngọc Duy dd509db34: reflog: mark strings for translation Nguyễn Thái Ngọc Duy f11c95805: sequencer.c: remove implicit dependency on the_index Nguyễn Thái Ngọc Duy fb998eae6: blame.c: remove implicit dependency the_repository Olga Telezhnaya ab0e36715: ref-filter: add deltabase option Olga Telezhnaya bbfc042ef: ref-filter: add objectsize:disk option Paul-Sebastian Ungureanu 104eb50d1: stash: convert show to builtin Paul-Sebastian Ungureanu 193c3e351: stash: convert `stash--helper.c` into `stash.c` Paul-Sebastian Ungureanu 1a0f0409a: stash: convert push to builtin Paul-Sebastian Ungureanu 813904a0c: stash: convert store to builtin Paul-Sebastian Ungureanu 9f630e748: stash: convert create to builtin Paul-Sebastian Ungureanu c2cc69f19: stash: make push -q quiet Torsten Bögershausen ca473cef9: Upcast size_t variables to uintmax_t when printing Uncovered code in 'next' not in 'master' apply.c 517fe807d6 4776) BUG_ON_OPT_NEG(unset); 735ca208c5 4830) return -1; builtin/am.c fce5664805 2117) *opt_value = PATCH_FORMAT_UNKNOWN; builtin/archive.c e001fd3a50 builtin/archive.c 78) die(_("git archive: expected ACK/NAK, got a flush packet")); e001fd3a50 builtin/archive.c 80) if (starts_with(reader.line, "NACK ")) e001fd3a50 builtin/archive.c 81) die(_("git archive: NACK %s"), reader.line + 5); e001fd3a50 builtin/archive.c 82) if (starts_with(reader.line, "ERR ")) e001fd3a50 builtin/archive.c 83) die(_("remote error: %s"), reader.line + 4); e001fd3a50 builtin/archive.c 84) die(_("git archive: protocol error")); e001fd3a50 builtin/archive.c 89) die(_("git archive: expected a flush")); fb19d32f05 builtin/archive.c 99) if (version != discover_version()) fb19d32f05 builtin/archive.c 100) die(_("git archive: received different protocol versions in subsequent requests")); builtin/blame.c 517fe807d6 builtin/blame.c 759) BUG_ON_OPT_NEG(unset); builtin/cat-file.c 0eb8d3767c builtin/cat-file.c 609) return error(_("only one batch option may be specified")); builtin/grep.c fd6263fb73 builtin/grep.c 1051) warning(_("invalid option combination, ignoring --threads")); fd6263fb73 builtin/grep.c 1057) die(_("invalid number of threads specified (%d)"), num_threads); builtin/log.c 517fe807d6 builtin/log.c 1196) BUG_ON_OPT_NEG(unset); builtin/pull.c 01a31f3bca 565) die(_("unable to access commit %s"), builtin/show-branch.c 517fe807d6 builtin/show-branch.c 607) BUG_ON_OPT_NEG(unset); builtin/show-ref.c 517fe807d6 builtin/show-ref.c 154) BUG_ON_OPT_NEG(unset); builtin/upload-archive.c e001fd3a50 builtin/upload-archive.c 113) if (version == protocol_v0 || version == protocol_v1) e001fd3a50 builtin/upload-archive.c 114) packet_write_fmt(1, "NACK unable to spawn subprocess\n"); e001fd3a50 builtin/upload-archive.c 115) else if (version == protocol_v2) e001fd3a50 builtin/upload-archive.c 116) error_clnt("unable to spawn subprocess\n"
Re: [PATCH 0/9] caching loose objects
On 11/12/2018 9:46 AM, Jeff King wrote: Here's the series I mentioned earlier in the thread to cache loose objects when answering has_object_file(..., OBJECT_INFO_QUICK). For those just joining us, this makes operations that look up a lot of missing objects (like "index-pack" looking for collisions) faster. This is mostly targeted at systems where stat() is slow, like over NFS, but it seems to give a 2% speedup indexing a full git.git packfile into an empty repository (i.e., what you'd see on a clone). I'm adding René Scharfe and Takuto Ikuta to the cc for their previous work in loose-object caching. The interesting bit is patch 8. The rest of it is cleanup to let us treat alternates and the main object directory similarly. This cleanup is actually really valuable, and affects much more than this application. I really think it is a good idea, and hope it doesn't cause too much trouble as the topic is cooking. Thanks, -Stolee
Re: [PATCH 8/9] sha1-file: use loose object cache for quick existence check
On 11/12/2018 9:54 AM, Jeff King wrote: In cases where we expect to ask has_sha1_file() about a lot of objects that we are not likely to have (e.g., during fetch negotiation), we already use OBJECT_INFO_QUICK to sacrifice accuracy (due to racing with a simultaneous write or repack) for speed (we avoid re-scanning the pack directory). However, even checking for loose objects can be expensive, as we will stat() each one. On many systems this cost isn't too noticeable, but stat() can be particularly slow on some operating systems, or due to network filesystems. Since the QUICK flag already tells us that we're OK with a slightly stale answer, we can use that as a cue to look in our in-memory cache of each object directory. That basically trades an in-memory binary search for a stat() call. Note that it is possible for this to actually be _slower_. We'll do a full readdir() to fill the cache, so if you have a very large number of loose objects and a very small number of lookups, that readdir() may end up more expensive. This shouldn't be a big deal in practice. If you have a large number of reachable loose objects, you'll already run into performance problems (which you should remedy by repacking). You may have unreachable objects which wouldn't otherwise impact performance. Usually these would go away with the prune step of "git gc", but they may be held for up to 2 weeks in the default configuration. So it comes down to how many such objects you might reasonably expect to have, how much slower is readdir() on N entries versus M stat() calls (and here we really care about the syscall backing readdir(), like getdents() on Linux, but I'll just call this readdir() below). If N is much smaller than M (a typical packed repo), we know this is a big win (few readdirs() followed by many uses of the resulting cache). When N and M are similar in size, it's also a win. We care about the latency of making a syscall, and readdir() should be giving us many values in a single call. How many? On Linux, running "strace -e getdents ls" shows a 32k buffer getting 512 entries per call (which is 64 bytes per entry; the name itself is 38 bytes, plus there are some other fields). So we can imagine that this is always a win as long as the number of loose objects in the repository is a factor of 500 less than the number of lookups you make. It's hard to auto-tune this because we don't generally know up front how many lookups we're going to do. But it's unlikely for this to perform significantly worse. Signed-off-by: Jeff King --- There's some obvious hand-waving in the paragraphs above. I would love it if somebody with an NFS system could do some before/after timings with various numbers of loose objects, to get a sense of where the breakeven point is. My gut is that we do not need the complexity of a cache-size limit, nor of a config option to disable this. But it would be nice to have a real number where "reasonable" ends and "pathological" begins. :) I'm interested in such numbers, but do not have the appropriate setup to test. I think the tradeoffs you mention above are reasonable. There's also some chance that this isn't "extra" work but is just "earlier" work, as the abbreviation code would load these loose object directories. object-store.h | 1 + sha1-file.c| 20 2 files changed, 21 insertions(+) diff --git a/object-store.h b/object-store.h index bf1e0cb761..60758efad8 100644 --- a/object-store.h +++ b/object-store.h @@ -13,6 +13,7 @@ struct object_directory { /* * Used to store the results of readdir(3) calls when we are OK * sacrificing accuracy due to races for speed. That includes +* object existence with OBJECT_INFO_QUICK, as well as * our search for unique abbreviated hashes. Don't use it for tasks * requiring greater accuracy! * diff --git a/sha1-file.c b/sha1-file.c index 4aae716a37..e53da0b701 100644 --- a/sha1-file.c +++ b/sha1-file.c @@ -921,6 +921,24 @@ static int open_sha1_file(struct repository *r, return -1; } +static int quick_has_loose(struct repository *r, + const unsigned char *sha1) +{ + int subdir_nr = sha1[0]; + struct object_id oid; + struct object_directory *odb; + + hashcpy(oid.hash, sha1); + + prepare_alt_odb(r); + for (odb = r->objects->odb; odb; odb = odb->next) { + odb_load_loose_cache(odb, subdir_nr); + if (oid_array_lookup(>loose_objects_cache, ) >= 0) + return 1; + } + return 0; +} + /* * Map the loose object at "path" if it is not NULL, or the path found by * searching for a loose object named "sha1". @@ -1171,6 +1189,8 @@ static int sha1_loose_object_info(struct repository *r, if (!oi->typep && !oi->type_name && !oi->sizep && !oi->contentp) { const char *path; struct stat st; + if
Re: [PATCH 5/9] handle alternates paths the same as the main object dir
On 11/12/2018 10:46 AM, Jeff King wrote: On Mon, Nov 12, 2018 at 10:38:28AM -0500, Derrick Stolee wrote: We could go either direction here, but this patch moves the alternates struct over to the main directory style (rather than vice-versa). Technically the alternates style is more efficient, as it avoids rewriting the object directory name on each call. But this is unlikely to matter in practice, as we avoid reallocations either way (and nobody has ever noticed or complained that the main object directory is copying a few extra bytes before making a much more expensive system call). Hm. I've complained in the past [1] about a simple method like strbuf_addf() over loose objects, but that was during abbreviation checks so we were adding the string for every loose object but not actually reading the objects. [1] https://public-inbox.org/git/20171201174956.143245-1-dsto...@microsoft.com/ I suspect that had more to do with the cost of snprintf() than the extra bytes being copied. And here we'd still be using addstr and addch exclusively. I'm open to numeric arguments to the contrary, though. :) I agree. I don't think it is worth investigating now, as the performance difference should be moot. I am making a mental note to take a look here if I notice a performance regression later. ;) There's actually a lot of low-hanging fruit there for pre-sizing, too. E.g., fill_sha1_path() calls strbuf_addch() in a loop, but it could quite easily grow the 41 bytes it needs ahead of time. I wouldn't want to change that without finding a measurable improvement, though. It might not be a big deal due to fec501dae8 (strbuf_addch: avoid calling strbuf_grow, 2015-04-16). -Peff
Re: [PATCH 6/9] sha1-file: use an object_directory for the main object dir
On 11/12/2018 9:50 AM, Jeff King wrote: Our handling of alternate object directories is needlessly different from the main object directory. As a result, many places in the code basically look like this: do_something(r->objects->objdir); for (odb = r->objects->alt_odb_list; odb; odb = odb->next) do_something(odb->path); That gets annoying when do_something() is non-trivial, and we've resorted to gross hacks like creating fake alternates (see find_short_object_filename()). Instead, let's give each raw_object_store a unified list of object_directory structs. The first will be the main store, and everything after is an alternate. Very few callers even care about the distinction, and can just loop over the whole list (and those who care can just treat the first element differently). A few observations: - we don't need r->objects->objectdir anymore, and can just mechanically convert that to r->objects->odb->path - object_directory's path field needs to become a real pointer rather than a FLEX_ARRAY, in order to fill it with expand_base_dir() - we'll call prepare_alt_odb() earlier in many functions (i.e., outside of the loop). This may result in us calling it even when our function would be satisfied looking only at the main odb. But this doesn't matter in practice. It's not a very expensive operation in the first place, and in the majority of cases it will be a noop. We call it already (and cache its results) in prepare_packed_git(), and we'll generally check packs before loose objects. So essentially every program is going to call it immediately once per program. Arguably we should just prepare_alt_odb() immediately upon setting up the repository's object directory, which would save us sprinkling calls throughout the code base (and forgetting to do so has been a source of subtle bugs in the past). But I've stopped short of that here, since there are already a lot of other moving parts in this patch. - Most call sites just get shorter. The check_and_freshen() functions are an exception, because they have entry points to handle local and nonlocal directories separately. Signed-off-by: Jeff King --- If the "the first one is the main store, the rest are alternates" bit is too subtle, we could mark each "struct object_directory" with a bit for "is_local". This is probably a good thing to do proactively. We have the equivalent in the packed_git struct, but that's also because they get out of order. At the moment, I can't think of a read-only action that needs to treat the local object directory more carefully. The closest I know about is 'git pack-objects --local', but that also writes a pack-file. I assume that when we write a pack-file to the "default location" we use get_object_directory() instead of referring to the default object_directory? builtin/fsck.c | 21 ++--- builtin/grep.c | 2 +- commit-graph.c | 5 +- environment.c | 4 +- object-store.h | 27 ++- object.c | 19 packfile.c | 10 ++-- path.c | 2 +- repository.c | 8 +++- sha1-file.c| 122 ++--- sha1-name.c| 17 ++- 11 files changed, 90 insertions(+), 147 deletions(-) diff --git a/builtin/fsck.c b/builtin/fsck.c index 55153cf92a..15338bd178 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -725,13 +725,8 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) for_each_loose_object(mark_loose_for_connectivity, NULL, 0); for_each_packed_object(mark_packed_for_connectivity, NULL, 0); } else { - struct object_directory *alt_odb_list; - - fsck_object_dir(get_object_directory()); - prepare_alt_odb(the_repository); - alt_odb_list = the_repository->objects->alt_odb_list; - for (odb = alt_odb_list; odb; odb = odb->next) + for (odb = the_repository->objects->odb; odb; odb = odb->next) fsck_object_dir(odb->path); if (check_full) { @@ -834,13 +829,8 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) struct child_process commit_graph_verify = CHILD_PROCESS_INIT; const char *verify_argv[] = { "commit-graph", "verify", NULL, NULL, NULL }; - commit_graph_verify.argv = verify_argv; - commit_graph_verify.git_cmd = 1; - if (run_command(_graph_verify)) - errors_found |= ERROR_COMMIT_GRAPH; - prepare_alt_odb(the_repository); - for (odb = the_repository->objects->alt_odb_list; odb; odb = odb->next) { + for (odb = the_repository->objects->odb; odb; odb = odb->next) { child_process_init(_graph_verify); commit_graph_verify.argv = verify_argv;
Re: [PATCH 5/9] handle alternates paths the same as the main object dir
On 11/12/2018 9:49 AM, Jeff King wrote: When we generate loose file paths for the main object directory, the caller provides a buffer to loose_object_path (formerly sha1_file_name). The callers generally keep their own static buffer to avoid excessive reallocations. But for alternate directories, each struct carries its own scratch buffer. This is needlessly different; let's unify them. We could go either direction here, but this patch moves the alternates struct over to the main directory style (rather than vice-versa). Technically the alternates style is more efficient, as it avoids rewriting the object directory name on each call. But this is unlikely to matter in practice, as we avoid reallocations either way (and nobody has ever noticed or complained that the main object directory is copying a few extra bytes before making a much more expensive system call). Hm. I've complained in the past [1] about a simple method like strbuf_addf() over loose objects, but that was during abbreviation checks so we were adding the string for every loose object but not actually reading the objects. [1] https://public-inbox.org/git/20171201174956.143245-1-dsto...@microsoft.com/ The other concern I have is for alternates that may have long-ish paths to their object directories. So, this is worth keeping an eye on, but is likely to be fine. And this has the advantage that the reusable buffers are tied to particular calls, which makes the invalidation rules simpler (for example, the return value from stat_sha1_file() used to be invalidated by basically any other object call, but now it is affected only by other calls to stat_sha1_file()). We do steal the trick from alt_sha1_path() of returning a pointer to the filled buffer, which makes a few conversions more convenient. Signed-off-by: Jeff King --- object-store.h | 14 +- object.c | 1 - sha1-file.c| 44 sha1-name.c| 8 ++-- 4 files changed, 23 insertions(+), 44 deletions(-) diff --git a/object-store.h b/object-store.h index fefa17e380..b2fa0d0df0 100644 --- a/object-store.h +++ b/object-store.h @@ -10,10 +10,6 @@ struct object_directory { struct object_directory *next; - /* see alt_scratch_buf() */ - struct strbuf scratch; - size_t base_len; - /* * Used to store the results of readdir(3) calls when searching * for unique abbreviated hashes. This cache is never @@ -54,14 +50,6 @@ void add_to_alternates_file(const char *dir); */ void add_to_alternates_memory(const char *dir); -/* - * Returns a scratch strbuf pre-filled with the alternate object directory, - * including a trailing slash, which can be used to access paths in the - * alternate. Always use this over direct access to alt->scratch, as it - * cleans up any previous use of the scratch buffer. - */ -struct strbuf *alt_scratch_buf(struct object_directory *odb); - struct packed_git { struct packed_git *next; struct list_head mru; @@ -157,7 +145,7 @@ void raw_object_store_clear(struct raw_object_store *o); * Put in `buf` the name of the file in the local object database that * would be used to store a loose object with the specified sha1. */ -void loose_object_path(struct repository *r, struct strbuf *buf, const unsigned char *sha1); +const char *loose_object_path(struct repository *r, struct strbuf *buf, const unsigned char *sha1); void *map_sha1_file(struct repository *r, const unsigned char *sha1, unsigned long *size); diff --git a/object.c b/object.c index 6af8e908bb..dd485ac629 100644 --- a/object.c +++ b/object.c @@ -484,7 +484,6 @@ struct raw_object_store *raw_object_store_new(void) static void free_alt_odb(struct object_directory *odb) { - strbuf_release(>scratch); oid_array_clear(>loose_objects_cache); free(odb); } diff --git a/sha1-file.c b/sha1-file.c index 478eac326b..15db6b61a9 100644 --- a/sha1-file.c +++ b/sha1-file.c @@ -346,27 +346,20 @@ static void fill_sha1_path(struct strbuf *buf, const unsigned char *sha1) } } -void loose_object_path(struct repository *r, struct strbuf *buf, - const unsigned char *sha1) +static const char *odb_loose_path(const char *path, struct strbuf *buf, + const unsigned char *sha1) { strbuf_reset(buf); - strbuf_addstr(buf, r->objects->objectdir); + strbuf_addstr(buf, path); strbuf_addch(buf, '/'); fill_sha1_path(buf, sha1); + return buf->buf; } -struct strbuf *alt_scratch_buf(struct object_directory *odb) +const char *loose_object_path(struct repository *r, struct strbuf *buf, + const unsigned char *sha1) { - strbuf_setlen(>scratch, odb->base_len); - return >scratch; -} - -static const char *alt_sha1_path(struct object_directory *odb, -const unsigned
Re: [PATCH 4/9] sha1_file_name(): overwrite buffer instead of appending
On 11/12/2018 9:48 AM, Jeff King wrote: Since we're changing the semantics, let's take the opportunity to give it a more hash-neutral name (which will also catch any callers from topics in flight). THANK YOU! This method name confused me so much when I was first looking at the code, but the new name is so much better. -Stolee
Re: [PATCH 3/9] rename "alternate_object_database" to "object_directory"
On 11/12/2018 9:48 AM, Jeff King wrote: In preparation for unifying the handling of alt odb's and the normal repo object directory, let's use a more neutral name. This patch is purely mechanical, swapping the type name, and converting any variables named "alt" to "odb". There should be no functional change, but it will reduce the noise in subsequent diffs. Signed-off-by: Jeff King --- I waffled on calling this object_database instead of object_directory. But really, it is very specifically about the directory (packed storage, including packs from alternates, is handled elsewhere). That makes sense. Each alternate makes its own object directory, but is part of a larger object database. It also helps clarify a difference from the object_store. My only complaint is that you have a lot of variable names with "odb" which are now object_directory pointers. Perhaps "odb" -> "objdir"? Or is that just too much change? builtin/count-objects.c | 4 ++-- builtin/fsck.c | 16 ++--- builtin/submodule--helper.c | 6 ++--- commit-graph.c | 10 object-store.h | 14 +-- object.c| 10 packfile.c | 8 +++ sha1-file.c | 48 ++--- sha1-name.c | 20 transport.c | 2 +- 10 files changed, 69 insertions(+), 69 deletions(-) diff --git a/builtin/count-objects.c b/builtin/count-objects.c index a7cad052c6..3fae474f6f 100644 --- a/builtin/count-objects.c +++ b/builtin/count-objects.c @@ -78,10 +78,10 @@ static int count_cruft(const char *basename, const char *path, void *data) return 0; } -static int print_alternate(struct alternate_object_database *alt, void *data) +static int print_alternate(struct object_directory *odb, void *data) { printf("alternate: "); - quote_c_style(alt->path, NULL, stdout, 0); + quote_c_style(odb->path, NULL, stdout, 0); putchar('\n'); return 0; } diff --git a/builtin/fsck.c b/builtin/fsck.c index b10f2b154c..55153cf92a 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -688,7 +688,7 @@ static struct option fsck_opts[] = { int cmd_fsck(int argc, const char **argv, const char *prefix) { int i; - struct alternate_object_database *alt; + struct object_directory *odb; /* fsck knows how to handle missing promisor objects */ fetch_if_missing = 0; @@ -725,14 +725,14 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) for_each_loose_object(mark_loose_for_connectivity, NULL, 0); for_each_packed_object(mark_packed_for_connectivity, NULL, 0); } else { - struct alternate_object_database *alt_odb_list; + struct object_directory *alt_odb_list; fsck_object_dir(get_object_directory()); prepare_alt_odb(the_repository); alt_odb_list = the_repository->objects->alt_odb_list; - for (alt = alt_odb_list; alt; alt = alt->next) - fsck_object_dir(alt->path); + for (odb = alt_odb_list; odb; odb = odb->next) + fsck_object_dir(odb->path); if (check_full) { struct packed_git *p; @@ -840,12 +840,12 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) errors_found |= ERROR_COMMIT_GRAPH; prepare_alt_odb(the_repository); - for (alt = the_repository->objects->alt_odb_list; alt; alt = alt->next) { + for (odb = the_repository->objects->alt_odb_list; odb; odb = odb->next) { child_process_init(_graph_verify); commit_graph_verify.argv = verify_argv; commit_graph_verify.git_cmd = 1; verify_argv[2] = "--object-dir"; - verify_argv[3] = alt->path; + verify_argv[3] = odb->path; if (run_command(_graph_verify)) errors_found |= ERROR_COMMIT_GRAPH; } @@ -861,12 +861,12 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) errors_found |= ERROR_COMMIT_GRAPH; prepare_alt_odb(the_repository); - for (alt = the_repository->objects->alt_odb_list; alt; alt = alt->next) { + for (odb = the_repository->objects->alt_odb_list; odb; odb = odb->next) { child_process_init(_verify); midx_verify.argv = midx_argv; midx_verify.git_cmd = 1; midx_argv[2] = "--object-dir"; - midx_argv[3] = alt->path; + midx_argv[3] = odb->path; if (run_command(_verify)) errors_found |=
Re: [PATCH 1/9] fsck: do not reuse child_process structs
On 11/12/2018 9:46 AM, Jeff King wrote: The run-command API makes no promises about what is left in a struct child_process after a command finishes, and it's not safe to simply reuse it again for a similar command. In particular: - if you use child->args or child->env_array, they are cleared after finish_command() - likewise, start_command() may point child->argv at child->args->argv; reusing that would lead to accessing freed memory - the in/out/err may hold pipe descriptors from the previous run Thanks! This is helpful information. These two calls are _probably_ OK because they do not use any of those features. But it's only by chance, and may break in the future; let's reinitialize our struct for each program we run. Signed-off-by: Jeff King --- builtin/fsck.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/builtin/fsck.c b/builtin/fsck.c index 06eb421720..b10f2b154c 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -841,6 +841,9 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) prepare_alt_odb(the_repository); for (alt = the_repository->objects->alt_odb_list; alt; alt = alt->next) { + child_process_init(_graph_verify); + commit_graph_verify.argv = verify_argv; + commit_graph_verify.git_cmd = 1; verify_argv[2] = "--object-dir"; verify_argv[3] = alt->path; if (run_command(_graph_verify)) @@ -859,6 +862,9 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) prepare_alt_odb(the_repository); for (alt = the_repository->objects->alt_odb_list; alt; alt = alt->next) { + child_process_init(_verify); + midx_verify.argv = midx_argv; + midx_verify.git_cmd = 1; midx_argv[2] = "--object-dir"; midx_argv[3] = alt->path; if (run_command(_verify)) Looks good to me. -Stolee
Git Test Coverage Report (Wednesday, Nov 7)
rrno(_("%s: cannot stat the open index"), path); 567e20ae23 2156) die(_("%s: index file smaller than expected"), path); 567e20ae23 2160) die_errno(_("%s: unable to map index file"), path); 567e20ae23 2250) warning(_("could not freshen shared index '%s'"), shared_index); 567e20ae23 2285) die(_("broken index, expect %s in %s, got %s"), 567e20ae23 3071) error(_("cannot fix permission bits on '%s'"), get_tempfile_path(*temp)); 567e20ae23 3217) return error(_("%s: cannot drop to stage #0"), ref-filter.c 23c0ab7312 2326) return error(_("option `%s' is incompatible with --no-merged"), remote.c eafdc91bbc 362) warning(_("config remote shorthand cannot begin with '/': %s"), eafdc91bbc 417) error(_("more than one uploadpack given, using the first")); eafdc91bbc 683) die(_("key '%s' of pattern had no '*'"), key); eafdc91bbc 693) die(_("value '%s' of pattern has no '*'"), value); eafdc91bbc 1044) error(_("unable to delete '%s': remote ref does not exist"), eafdc91bbc 1066) return error(_("dst ref %s receives from more than one src"), eafdc91bbc 1785) die(_("couldn't find remote ref %s"), name); eafdc91bbc 1798) error(_("* Ignoring funny ref '%s' locally"), eafdc91bbc 1893) die(_("revision walk setup failed")); eafdc91bbc 2166) return error(_("cannot parse expected object name '%s'"), revision.c b45424181e 2942) return; b45424181e 2945) return; b45424181e 2951) c->object.flags |= UNINTERESTING; b45424181e 2954) return; b45424181e 2957) mark_parents_uninteresting(c); b45424181e 2980) return; b45424181e 2983) return; b45424181e 3048) continue; f0d9cc4196 3097) if (!revs->ignore_missing_links) f0d9cc4196 3098) die("Failed to traverse parents of commit %s", f0d9cc4196 3099) oid_to_hex(>object.oid)); b45424181e 3107) continue; run-command.c 2179045fd0 1229) error(_("cannot create async thread: %s"), strerror(err)); send-pack.c c0e40a2d66 207) close(fd[1]); sha1-file.c 2f90b9d9b4 sha1-file.c 172) int hash_algo_by_name(const char *name) 2f90b9d9b4 sha1-file.c 175) if (!name) 2f90b9d9b4 sha1-file.c 176) return GIT_HASH_UNKNOWN; 2f90b9d9b4 sha1-file.c 177) for (i = 1; i < GIT_HASH_NALGOS; i++) 2f90b9d9b4 sha1-file.c 178) if (!strcmp(name, hash_algos[i].name)) 2f90b9d9b4 sha1-file.c 179) return i; 2f90b9d9b4 sha1-file.c 180) return GIT_HASH_UNKNOWN; 2f90b9d9b4 sha1-file.c 183) int hash_algo_by_id(uint32_t format_id) 2f90b9d9b4 sha1-file.c 186) for (i = 1; i < GIT_HASH_NALGOS; i++) 2f90b9d9b4 sha1-file.c 187) if (format_id == hash_algos[i].format_id) 2f90b9d9b4 sha1-file.c 188) return i; 2f90b9d9b4 sha1-file.c 189) return GIT_HASH_UNKNOWN; Commits introducing uncovered code: brian m. carlson 2f90b9d9b: sha1-file: provide functions to look up hash algorithms brian m. carlson b3a41547c: hex: introduce functions to print arbitrary hashes Daniels Umanovskis 0ecb1fc72: branch: introduce --show-current display option Derrick Stolee b45424181: revision.c: generation-based topo-order algorithm Derrick Stolee f0d9cc419: revision.c: begin refactoring --topo-order logic Elijah Newren 143dc7900: merge-recursive: fix rename/add conflict handling Elijah Newren 55ea0bf4c: merge-recursive: new function for better colliding conflict resolutions Elijah Newren f10ffbb2f: merge-recursive: improve rename/rename(1to2)/add[/add] handling Jeff King 01a31f3bc: pull: handle --verify-signatures for unborn branch Jeff King 0eb8d3767: cat-file: report an error on multiple --batch options Jeff King 517fe807d: assert NOARG/NONEG behavior of parse-options callbacks Jeff King 735ca208c: apply: return -1 from option callback instead of calling exit(1) Jeff King fce566480: am: handle --no-patch-format option Joel Teichroeb 3d5ec65ce: stash: convert apply to builtin Joel Teichroeb 5bf62a19c: stash: convert pop to builtin Joel Teichroeb 700577117: stash: convert drop and clear to builtin Junio C Hamano e4d034baf: Merge branch 'nd/i18n' into jch Nguyễn Thái Ngọc Duy 0a995c256: fsck: mark strings for translation Nguyễn Thái Ngọc Duy 0e1a23cec: repack: mark more strings for translation Nguyễn Thái Ngọc Duy 2179045fd: Clean up pthread_create() error handling Nguyễn Thái Ngọc Duy 23c0ab731: parse-options: replace opterror() with optname() Nguyễn Thái Ngọc Duy 4a4f8ae76: git.c: mark more strings for translation Nguyễn Thái Ngọc Duy 567e20ae2: read-cache.c: mark more strings for translation Nguyễn Thái Ngọc Duy 8a705c463: archive.c: mark more strings for translation Nguyễn Thái Ngọc Duy 977e72ca9: reflog: mark strings for translation Nguyễn Thái Ngọc Duy aa4fa3fa7: attr.c: mark more string for translation Nguyễn Thái Ngọc Duy c0e40a2d6: send-pack.c: move async's #ifdef
[PATCH v2 1/1] pack-objects: ignore ambiguous object warnings
From: Derrick Stolee A git push process runs several processes during its run, but one includes git send-pack which calls git pack-objects and passes the known have/wants into stdin using object ids. However, the default setting for core.warnAmbiguousRefs requires git pack-objects to check for ref names matching the ref_rev_parse_rules array in refs.c. This means that every object is triggering at least six "file exists?" queries. When there are a lot of refs, this can add up significantly! I observed a simple push spending three seconds checking these paths. The fix here is similar to 4c30d50 "rev-list: disable object/refname ambiguity check with --stdin". Save the value of the global warn_on_object_refname_ambiguity variable (which is usually set to the boolean config variable core.warnAmbiguousRefs) and change the state to false. Do this only during the get_object_list() method which reads the objects from stdin. Helped-by: Jeff King Signed-off-by: Derrick Stolee --- builtin/pack-objects.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index d1144a8f7e..f703e6df9b 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2988,6 +2988,7 @@ static void get_object_list(int ac, const char **av) struct rev_info revs; char line[1000]; int flags = 0; + int save_warning; init_revisions(, NULL); save_commit_buffer = 0; @@ -2996,6 +2997,9 @@ static void get_object_list(int ac, const char **av) /* make sure shallows are read */ is_repository_shallow(the_repository); + save_warning = warn_on_object_refname_ambiguity; + warn_on_object_refname_ambiguity = 0; + while (fgets(line, sizeof(line), stdin) != NULL) { int len = strlen(line); if (len && line[len - 1] == '\n') @@ -3022,6 +3026,8 @@ static void get_object_list(int ac, const char **av) die(_("bad revision '%s'"), line); } + warn_on_object_refname_ambiguity = save_warning; + if (use_bitmap_index && !get_object_list_from_bitmap()) return; -- gitgitgadget
[PATCH v2 0/1] send-pack: set core.warnAmbiguousRefs=false
I've been looking into the performance of git push for very large repos. Our users are reporting that 60-80% of git push time is spent during the "Enumerating objects" phase of git pack-objects. A git push process runs several processes during its run, but one includes git send-pack which calls git pack-objects and passes the known have/wants into stdin using object ids. However, the default setting for core.warnAmbiguousRefs requires git pack-objects to check for ref names matching the ref_rev_parse_rules array in refs.c. This means that every object is triggering at least six "file exists?" queries. When there are a lot of refs, this can add up significantly! My PerfView trace for a simple push measured 3 seconds spent checking these paths. The fix is to set the global warn_on_object_refname_ambiguity to 0 for the section that is performing these object reads. In addition to this patch submission, we are looking into merging it into our fork sooner [1]. [1] https://github.com/Microsoft/git/pull/67 Changes in V2: Instead of using the "-c" flag from send-pack, just set the global. I left the name of the cover letter the same to not confuse anyone viewing the message without threading. Derrick Stolee (1): pack-objects: ignore ambiguous object warnings builtin/pack-objects.c | 6 ++ 1 file changed, 6 insertions(+) base-commit: cae598d9980661a978e2df4fb338518f7bf09572 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-68%2Fderrickstolee%2Fsend-pack-config-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-68/derrickstolee/send-pack-config-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/68 Range-diff vs v1: 1: 1ef2c51550 < -: -- send-pack: set core.warnAmbiguousRefs=false -: -- > 1: 002868ee6b pack-objects: ignore ambiguous object warnings -- gitgitgadget
Re: [PATCH 0/1] send-pack: set core.warnAmbiguousRefs=false
On 11/6/2018 2:51 PM, Jeff King wrote: On Tue, Nov 06, 2018 at 02:44:42PM -0500, Jeff King wrote: The fix for this is simple: set core.warnAmbiguousRefs to false for this specific call of git pack-objects coming from git send-pack. We don't want to default it to false for all calls to git pack-objects, as it is valid to pass ref names instead of object ids. This helps regain these seconds during a push. I don't think you actually care about the ambiguity check between refs here; you just care about avoiding the ref check when we've seen (and are mostly expecting) a 40-hex sha1. We have a more specific flag for that: warn_on_object_refname_ambiguity. And I think it would be OK to enable that all the time for pack-objects, which is plumbing that does typically expect object names. See prior art in 25fba78d36 (cat-file: disable object/refname ambiguity check for batch mode, 2013-07-12) and 4c30d50402 (rev-list: disable object/refname ambiguity check with --stdin, 2014-03-12). I'd probably do it here: diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index e50c6cd1ff..d370638a5d 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -3104,6 +3104,7 @@ static void get_object_list(int ac, const char **av) Scoping the change into get_object_list does make sense. I was doing it a level higher, which is not worth it. I'll reproduce your change here. struct rev_info revs; char line[1000]; int flags = 0; + int save_warning; repo_init_revisions(the_repository, , NULL); save_commit_buffer = 0; @@ -3112,6 +3113,9 @@ static void get_object_list(int ac, const char **av) /* make sure shallows are read */ is_repository_shallow(the_repository); + save_warning = warn_on_object_refname_ambiguity; + warn_on_object_refname_ambiguity = 0; + while (fgets(line, sizeof(line), stdin) != NULL) { int len = strlen(line); if (len && line[len - 1] == '\n') @@ -3138,6 +3142,8 @@ static void get_object_list(int ac, const char **av) die(_("bad revision '%s'"), line); } + warn_on_object_refname_ambiguity = save_warning; + if (use_bitmap_index && !get_object_list_from_bitmap()) return; But I'll leave it to you to wrap that up in a patch, since you probably should re-check your timings (which it would be interesting to include in the commit message, if you have reproducible timings). The timings change a lot depending on the disk cache and the remote refs, which is unfortunate, but I have measured a three-second improvement. Thanks, -Stolee
Re: [PATCH 0/1] send-pack: set core.warnAmbiguousRefs=false
On 11/6/2018 2:44 PM, Jeff King wrote: On Tue, Nov 06, 2018 at 11:13:47AM -0800, Derrick Stolee via GitGitGadget wrote: I've been looking into the performance of git push for very large repos. Our users are reporting that 60-80% of git push time is spent during the "Enumerating objects" phase of git pack-objects. A git push process runs several processes during its run, but one includes git send-pack which calls git pack-objects and passes the known have/wants into stdin using object ids. However, the default setting for core.warnAmbiguousRefs requires git pack-objects to check for ref names matching the ref_rev_parse_rules array in refs.c. This means that every object is triggering at least six "file exists?" queries. When there are a lot of refs, this can add up significantly! My PerfView trace for a simple push measured 3 seconds spent checking these paths. Some of this might be useful in the commit message. :) The fix for this is simple: set core.warnAmbiguousRefs to false for this specific call of git pack-objects coming from git send-pack. We don't want to default it to false for all calls to git pack-objects, as it is valid to pass ref names instead of object ids. This helps regain these seconds during a push. I don't think you actually care about the ambiguity check between refs here; you just care about avoiding the ref check when we've seen (and are mostly expecting) a 40-hex sha1. We have a more specific flag for that: warn_on_object_refname_ambiguity. And I think it would be OK to enable that all the time for pack-objects, which is plumbing that does typically expect object names. See prior art in 25fba78d36 (cat-file: disable object/refname ambiguity check for batch mode, 2013-07-12) and 4c30d50402 (rev-list: disable object/refname ambiguity check with --stdin, 2014-03-12). Thanks for these pointers. Helps to know there is precedent for shutting down the behavior without relying on "-c" flags. Whenever I see a change like this to the pack-objects invocation for send-pack, it makes me wonder if upload-pack would want the same thing. It's a moot point if we just set the flag directly in inside pack-objects, though. I'll send a v2 that does just that. Thanks, -Stolee
[PATCH 1/1] send-pack: set core.warnAmbiguousRefs=false
From: Derrick Stolee During a 'git push' command, we run 'git send-pack' inside of our transport helper. This creates a 'git pack-objects' process and passes a list of object ids. If this list is large, then the pack-objects process can spend a lot of time checking the possible refs these strings could represent. Remove this extra check by setting core.warnAmbiguousRefs to false as we call 'git pack-objects'. Signed-off-by: Derrick Stolee --- send-pack.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/send-pack.c b/send-pack.c index e920ca57df..5055150fe1 100644 --- a/send-pack.c +++ b/send-pack.c @@ -64,6 +64,8 @@ static int pack_objects(int fd, struct ref *refs, struct oid_array *extra, struc int i; int rc; + argv_array_push(, "-c"); + argv_array_push(, "core.warnAmbiguousRefs=false"); argv_array_push(, "pack-objects"); argv_array_push(, "--all-progress-implied"); argv_array_push(, "--revs"); -- gitgitgadget
[PATCH 0/1] send-pack: set core.warnAmbiguousRefs=false
I've been looking into the performance of git push for very large repos. Our users are reporting that 60-80% of git push time is spent during the "Enumerating objects" phase of git pack-objects. A git push process runs several processes during its run, but one includes git send-pack which calls git pack-objects and passes the known have/wants into stdin using object ids. However, the default setting for core.warnAmbiguousRefs requires git pack-objects to check for ref names matching the ref_rev_parse_rules array in refs.c. This means that every object is triggering at least six "file exists?" queries. When there are a lot of refs, this can add up significantly! My PerfView trace for a simple push measured 3 seconds spent checking these paths. The fix for this is simple: set core.warnAmbiguousRefs to false for this specific call of git pack-objects coming from git send-pack. We don't want to default it to false for all calls to git pack-objects, as it is valid to pass ref names instead of object ids. This helps regain these seconds during a push. In addition to this patch submission, we are looking into merging it into our fork sooner [1]. [1] https://github.com/Microsoft/git/pull/67 Derrick Stolee (1): send-pack: set core.warnAmbiguousRefs=false send-pack.c | 2 ++ 1 file changed, 2 insertions(+) base-commit: cae598d9980661a978e2df4fb338518f7bf09572 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-68%2Fderrickstolee%2Fsend-pack-config-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-68/derrickstolee/send-pack-config-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/68 -- gitgitgadget
Re: [PATCH] multi-pack-index: make code -Wunused-parameter clean
On 11/3/2018 10:27 PM, Jeff King wrote: On Sat, Nov 03, 2018 at 05:49:57PM -0700, Carlo Marcelo Arenas Belón wrote: introduced in 662148c435 ("midx: write object offsets", 2018-07-12) but included on all previous versions as well. midx.c:713:54: warning: unused parameter 'nr_objects' [-Wunused-parameter] likely an oversight as the information needed to iterate over is embedded in nr_large_offset I've been preparing a series to make the whole code base compile with -Wunused-parameter, and I handled this case a bit differently. -- >8 -- Subject: [PATCH] midx: double-check large object write loop The write_midx_large_offsets() function takes an array of object entries, the number of entries in the array (nr_objects), and the number of entries with large offsets (nr_large_offset). But we never actually use nr_objects; instead we keep walking down the array and counting down nr_large_offset until we've seen all of the large entries. This is correct, but we can be a bit more defensive. If there were ever a mismatch between nr_large_offset and the actual set of large-offset objects, we'd walk off the end of the array. Since we know the size of the array, we can use nr_objects to make sure we don't walk too far. Signed-off-by: Jeff King Thanks, both, for catching this. I prefer the approach that adds defenses. Reviewed-by: Derrick Stolee --- midx.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/midx.c b/midx.c index 4fac0cd08a..ecd583666a 100644 --- a/midx.c +++ b/midx.c @@ -712,12 +712,18 @@ static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_nee static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_offset, struct pack_midx_entry *objects, uint32_t nr_objects) { - struct pack_midx_entry *list = objects; + struct pack_midx_entry *list = objects, *end = objects + nr_objects; size_t written = 0; while (nr_large_offset) { - struct pack_midx_entry *obj = list++; - uint64_t offset = obj->offset; + struct pack_midx_entry *obj; + uint64_t offset; + + if (list >= end) + BUG("too many large-offset objects"); + + obj = list++; + offset = obj->offset; if (!(offset >> 31)) continue;
Git Test Coverage Report (Friday, Nov 2)
)); read-cache.c 5257b0625a 675) die(_("will not add file alias '%s' ('%s' already exists in index)"), 5257b0625a 676) ce->name, alias->name); 5257b0625a 691) die(_("cannot create an empty blob in the object database")); 5257b0625a 712) return error(_("%s: can only add regular files, symbolic links or git-directories"), path); 5257b0625a 786) return error(_("unable to add '%s' to index"), path); 5257b0625a 822) error(_("invalid path '%s'"), path); 5257b0625a 848) error(_("invalid path '%s'"), path); 5257b0625a 1686) return error(_("bad signature 0x%08x"), hdr->hdr_signature); 5257b0625a 1689) return error(_("bad index version %d"), hdr_version); 5257b0625a 1728) return error(_("index uses %.4s extension, which we do not understand"), 5257b0625a 1730) fprintf_ln(stderr, _("ignoring %.4s extension"), ext); 5257b0625a 1777) die(_("unknown index entry format 0x%08x"), extended_flags); 5257b0625a 1848) die(_("unordered stage entries in index")); 5257b0625a 1851) die(_("multiple stage entries for merged file '%s'"), 5257b0625a 1854) die(_("unordered stage entries for '%s'"), 5257b0625a 2148) die_errno(_("%s: index file open failed"), path); 5257b0625a 2152) die_errno(_("%s: cannot stat the open index"), path); 5257b0625a 2156) die(_("%s: index file smaller than expected"), path); 5257b0625a 2160) die_errno(_("%s: unable to map index file"), path); 5257b0625a 2252) warning(_("could not freshen shared index '%s'"), shared_index); 5257b0625a 2287) die(_("broken index, expect %s in %s, got %s"), 5257b0625a 3073) error(_("cannot fix permission bits on '%s'"), get_tempfile_path(*temp)); 5257b0625a 3219) return error(_("%s: cannot drop to stage #0"), ref-filter.c 35408df41e 2324) return error(_("option `%s' is incompatible with --no-merged"), remote.c a454d6a26b 362) warning(_("config remote shorthand cannot begin with '/': %s"), a454d6a26b 417) error(_("more than one uploadpack given, using the first")); a454d6a26b 683) die(_("key '%s' of pattern had no '*'"), key); a454d6a26b 693) die(_("value '%s' of pattern has no '*'"), value); a454d6a26b 1044) error(_("unable to delete '%s': remote ref does not exist"), a454d6a26b 1066) return error(_("dst ref %s receives from more than one src"), a454d6a26b 1753) die(_("couldn't find remote ref %s"), name); a454d6a26b 1766) error(_("* Ignoring funny ref '%s' locally"), a454d6a26b 1861) die(_("revision walk setup failed")); a454d6a26b 2134) return error(_("cannot parse expected object name '%s'"), revision.c b45424181e 2936) return; b45424181e 2939) return; b45424181e 2945) c->object.flags |= UNINTERESTING; b45424181e 2948) return; b45424181e 2951) mark_parents_uninteresting(c); b45424181e 2974) return; b45424181e 2977) return; b45424181e 3042) continue; f0d9cc4196 3091) if (!revs->ignore_missing_links) f0d9cc4196 3092) die("Failed to traverse parents of commit %s", f0d9cc4196 3093) oid_to_hex(>object.oid)); b45424181e 3101) continue; run-command.c 31bfd155d8 1229) error(_("cannot create async thread: %s"), strerror(err)); sequencer.c bcd33ec25f 683) np = strchrnul(buf, '\n'); bcd33ec25f 684) return error(_("no key present in '%.*s'"), bcd33ec25f 695) return error(_("unable to dequote value of '%s'"), bcd33ec25f 737) goto finish; bcd33ec25f 742) name_i = error(_("'GIT_AUTHOR_NAME' already given")); bcd33ec25f 747) email_i = error(_("'GIT_AUTHOR_EMAIL' already given")); bcd33ec25f 752) date_i = error(_("'GIT_AUTHOR_DATE' already given")); bcd33ec25f 756) err = error(_("unknown variable '%s'"), bcd33ec25f 761) error(_("missing 'GIT_AUTHOR_NAME'")); bcd33ec25f 763) error(_("missing 'GIT_AUTHOR_EMAIL'")); bcd33ec25f 765) error(_("missing 'GIT_AUTHOR_DATE'")); sha1-file.c 2f90b9d9b4 sha1-file.c 172) int hash_algo_by_name(const char *name) 2f90b9d9b4 sha1-file.c 175) if (!name) 2f90b9d9b4 sha1-file.c 176) return GIT_HASH_UNKNOWN; 2f90b9d9b4 sha1-file.c 177) for (i = 1; i < GIT_HASH_NALGOS; i++) 2f90b9d9b4 sha1-file.c 178) if (!strcmp(name, hash_algos[i].name)) 2f90b9d9b4 sha1-file.c 179) return i; 2f90b9d9b4 sha1-file.c 180) return GIT_HASH_UNKNOWN; 2f90b9d9b4 sha1-file.c 183) int hash_algo_by_id(uint32_t format_id) 2f90b9d9b4 sha1-file.c 186) for (i = 1; i < GIT_HASH_NALGOS; i++) 2f90b9d9b4 sha1-file.c 187) if (format_id == hash_algos[i].format_id) 2f90b9d9b4 sha1-file.c 188) return i; 2f90b9d9b4 sha1-file.c 189) return GIT_HASH_UNKNOWN; xdiff-interface.c xdiff/xutils.c 611e42a598 405) return -1; Commits introducing uncovered code: brian m. carlson 2f90b9d9b: s
[PATCH] merge-recursive: combine error handling
In handle_rename_rename_1to2(), we have duplicated error handling around colliding paths. Specifically, when we want to write out the file and there is a directory or untracked file in the way, we need to create a temporary file to hold the contents. This has some special output to alert the user, and this output is duplicated for each side of the conflict. Simplify the call by generating this new path in a helper function. Signed-off-by: Derrick Stolee --- Elijah, Here is a patch that combines the logic to avoid code clones, and also more easily covers code blocks. Of course, your additional test helps the branch coverage. Thanks, -Stolee merge-recursive.c | 53 --- 1 file changed, 27 insertions(+), 26 deletions(-) diff --git a/merge-recursive.c b/merge-recursive.c index 5986b6..5e36bef162 100644 --- a/merge-recursive.c +++ b/merge-recursive.c @@ -1709,6 +1709,27 @@ static int handle_rename_add(struct merge_options *o, ci->dst_entry1->stages[other_stage].mode); } +static char *find_path_for_conflict(struct merge_options *o, + const char *path, + const char *branch1, + const char *branch2) +{ + char *new_path = NULL; + if (dir_in_way(path, !o->call_depth, 0)) { + new_path = unique_path(o, path, branch1); + output(o, 1, _("%s is a directory in %s adding " + "as %s instead"), + path, branch2, new_path); + } else if (would_lose_untracked(path)) { + new_path = unique_path(o, path, branch1); + output(o, 1, _("Refusing to lose untracked file" + " at %s; adding as %s instead"), + path, new_path); + } + + return new_path; +} + static int handle_rename_rename_1to2(struct merge_options *o, struct rename_conflict_info *ci) { @@ -1783,19 +1804,9 @@ static int handle_rename_rename_1to2(struct merge_options *o, >oid, add->mode) < 0) return -1; } else { - char *new_path = NULL; - if (dir_in_way(a->path, !o->call_depth, 0)) { - new_path = unique_path(o, a->path, ci->branch1); - output(o, 1, _("%s is a directory in %s adding " - "as %s instead"), - a->path, ci->branch2, new_path); - } else if (would_lose_untracked(a->path)) { - new_path = unique_path(o, a->path, ci->branch1); - output(o, 1, _("Refusing to lose untracked file" - " at %s; adding as %s instead"), - a->path, new_path); - } - + char *new_path = find_path_for_conflict(o, a->path, + ci->branch1, + ci->branch2); if (update_file(o, 0, , mfi.mode, new_path ? new_path : a->path)) return -1; free(new_path); @@ -1812,19 +1823,9 @@ static int handle_rename_rename_1to2(struct merge_options *o, , mfi.mode) < 0) return -1; } else { - char *new_path = NULL; - if (dir_in_way(b->path, !o->call_depth, 0)) { - new_path = unique_path(o, b->path, ci->branch2); - output(o, 1, _("%s is a directory in %s adding " - "as %s instead"), - b->path, ci->branch1, new_path); - } else if (would_lose_untracked(b->path)) { - new_path = unique_path(o, b->path, ci->branch2); - output(o, 1, _("Refusing to lose untracked file" - " at %s; adding as %s instead"), - b->path, new_path); - } - + char *new_path = find_path_for_conflict(o, b->path, + ci->branch2, + ci->branch1); if (update_file(o, 0, , mfi.mode, new_path ? new_path : b->path)) return -1; free(new_path); -- 2.19.1.1235.gbda564b1a2
Re: [PATCH v4 00/10] Improve path collision conflict resolutions
On 11/2/2018 2:53 PM, Elijah Newren wrote: Major question: * You'll note that I edited the last two patches to mark them as RFC. To be honest, I'm not sure what to do with these. They improve code coverage of new code, but the same gaps existed in the old code; they only show up in the coverage-diff because I essentially moved code around to a new improved function. Since the new code doesn't really add new abilities but rather just shifts the handling of these conflicts to a common function, they shouldn't need any more testcases than previously and modifying the existing patches thus feels slightly misleading. That line of thought leads me to believe that perhaps putting them in a separate combined patch of their own with a decent commit message is the right way to go. On the other hand, squashing them to the commits they're marked as fixups for shows which logical part of the code the tests are related to, which seems like a good thing. So what's the right way to handle these? I appreciate the effort you made to improve test coverage! It's unfortunate that this portion wasn't covered earlier, because we could have broken it and not noticed until a release. I think making them separate commits is fine, and the comment on the test case is helpful. The fact that you only had to change the commit timestamps in order to get the coverage makes me reexamine the code and realize that maybe the "right" thing to do is to reduce our code clones. (This is also how I was looking at the wrong block of the patch when talking about it not being covered.) I'll look at the patch and see if I can contribute a concrete code suggestion there. Aside: I hope that I am not being annoying by poking around with the test coverage reports. It does give me a way to target my review efforts, especially into changes that touch areas outside my expertise (like this one). Thanks, -Stolee
Re: [PATCH v3 8/8] merge-recursive: improve rename/rename(1to2)/add[/add] handling
On 11/2/2018 1:27 PM, Elijah Newren wrote: On Thu, Nov 1, 2018 at 12:01 AM Elijah Newren wrote: On Wed, Oct 31, 2018 at 8:08 AM Derrick Stolee wrote: On 10/19/2018 3:31 PM, Elijah Newren wrote: [snip] + char *new_path = NULL; + if (dir_in_way(b->path, !o->call_depth, 0)) { + new_path = unique_path(o, b->path, ci->branch2); + output(o, 1, _("%s is a directory in %s adding " +"as %s instead"), +b->path, ci->branch1, new_path); I tried really hard, but failed to get a test to cover the block below. I was able to find that the "check handling of differently renamed file with D/F conflicts" test in t6022-merge-rename.sh covers the block above. Trying to tweak the example using untracked files seems to hit an error message from unpack-trees.c instead. + } else if (would_lose_untracked(b->path)) { + new_path = unique_path(o, b->path, ci->branch2); + output(o, 1, _("Refusing to lose untracked file" +" at %s; adding as %s instead"), +b->path, new_path); So now I'm confused. This block was not listed in your coverage report[1]. And, in fact, I think this block IS covered by testcase 10c of t6043. However, there is a very similar looking block about 30 lines up that is uncovered (and which was mentioned in your report): } else if (would_lose_untracked(a->path)) { new_path = unique_path(o, a->path, ci->branch1); output(o, 1, _("Refusing to lose untracked file" " at %s; adding as %s instead"), a->path, new_path); covering it, I think, is just a matter of repeating the 10c test with the merge repeated in the other direction (checkout B and merge A instead of checking out A and merging B) -- and touching up the checks accordingly. However, now I'm wondering if I'm crazy. Was it really the block you had highlighted that you were seeing uncovered? Trust the report (generated by computer) over me (generated by squinting at an email, trying to match line numbers). Thanks, -Stolee
Re: [PATCH 19/24] submodule: use submodule repos for object lookup
On 11/2/2018 1:23 PM, Stefan Beller wrote: On Fri, Nov 2, 2018 at 6:03 AM Derrick Stolee wrote: On 10/30/2018 6:08 PM, Stefan Beller wrote: This converts the 'show_submodule_header' function to use the repository API properly, such that the submodule objects are not added to the main object store. Signed-off-by: Stefan Beller A couple tests are broken in 'pu' when run with GIT_TEST_COMMIT_GRAPH=1, including t4041-diff-submodule-option.sh. The failure bisects to this patch. Here is a verbose output of the first failure in that script:; expecting success: git diff-index -p --submodule=log HEAD >actual && cat >expected <<-EOF && Submodule sm1 $head2..$head3 (rewind): < Add foo3 ($added foo3) < Add foo2 ($added foo2) EOF test_cmp expected actual + git diff-index -p --submodule=log HEAD + cat + test_cmp expected actual + diff -u expected actual --- expected2018-11-02 12:58:43.429262380 + +++ actual 2018-11-02 12:58:43.429262380 + @@ -1,3 +1,5 @@ -Submodule sm1 30b9670..dafb207 (rewind): +Submodule sm1 30b9670...dafb207: < Add foo3 (hinzugefügt foo3) < Add foo2 (hinzugefügt foo2) + > Add foo1 (hinzugefügt foo1) + < Add foo1 (hinzugefügt foo1) error: last command exited with $?=1 not ok 9 - modified submodule(backward) I've been looking into the patch below to see if there is an obvious problem, but the best I can think is that open_submodule() creates an alternate 'struct repository' and somehow the commit-graph feature is interacting poorly with that struct. Stefan, do you know what's going on? Sure, see the last four patches of this series https://public-inbox.org/git/20181030220817.61691-1-sbel...@google.com/ (to which you also reply to? Junio did not queue this one, yet). Sorry! Got a bit mixed up looking at everything. I didn't realize that the current 'pu' didn't have your latest. Thanks!
Re: [PATCH 0/3] Make add_missing_tags() linear
On 11/2/2018 10:58 AM, Elijah Newren wrote: On Thu, Nov 1, 2018 at 12:02 PM Derrick Stolee wrote: On 11/1/2018 2:57 PM, Elijah Newren wrote: On Thu, Nov 1, 2018 at 5:32 AM Derrick Stolee wrote: No rush. I'd just like to understand how removing the commit-graph file can make the new algorithm faster. Putting a similar count in the old algorithm would involve giving a count for every call to in_merge_bases_many(), which would be very noisy. $ time git push --dry-run --follow-tags /home/newren/repo-mirror count: 92912 To /home/newren/repo-mirror * [new branch] test5 -> test5 real0m3.024s user0m2.752s sys0m0.320s Is the above test with or without the commit-graph file? Can you run it in the other mode, too? I'd like to see if the "count" value changes when the only difference is the presence of a commit-graph file. I apologize for providing misleading information earlier; this was an apples to oranges comparison. Here's what I did: git clone coworker.bundle coworker-repo cd coworker-repo time git push --dry-run --follow-tags /home/newren/repo-mirror git config core.commitgraph true git config gc.writecommitgraph true git gc time git push --dry-run --follow-tags /home/newren/nucleus-mirror I figured I had just done a fresh clone, so surely the gc wouldn't do anything other than write the .git/objects/info/commit-graph file. However, the original bundle contained many references outside of refs/heads/ and refs/tags/: $ git bundle list-heads ../coworker.bundle | grep -v -e refs/heads/ -e refs/tags/ -e HEAD | wc -l 2396 These other refs apparently referred to objects not otherwise referenced in refs/heads/ and refs/tags/, and caused the gc to explode lots of loose objects: $ git count-objects -v count: 147604 size: 856416 in-pack: 1180692 packs: 1 size-pack: 796143 prune-packable: 0 garbage: 0 size-garbage: 0 The slowdown with commit-graph was entirely due to there being lots of loose objects (147K of them). If I add a git-prune before doing the timing with commit-graph, then the timing with commit-graph is faster than the run without a commit-graph. Sorry for the wild goose chase. And thanks for the fixes; get_reachable_subset() makes things much faster even without a commit-graph, and the commit-graph just improves it more. :-) Thanks for double-checking! It's good to have confidence that this is a good algorithm to use. Thanks, -Stolee
[PATCH v2 0/3] Make add_missing_tags() linear
As reported earlier [1], the add_missing_tags() method in remote.c has quadratic performance. Some of that performance is curbed due to the generation-number cutoff in in_merge_bases_many(). However, that fix doesn't help users without a commit-graph, and it can still be painful if that cutoff is sufficiently low compared to the tags we are using for reachability testing. Add a new method in commit-reach.c called get_reachable_subset() which does a many-to-many reachability test. Starting at the 'from' commits, walk until the generation is below the smallest generation in the 'to' commits, or all 'to' commits have been discovered. This performs only one commit walk for the entire add_missing_tags() method, giving linear performance in the worst case. Tests are added in t6600-test-reach.sh to ensure get_reachable_subset() works independently of its application in add_missing_tags(). Thanks, -Stolee [1] https://public-inbox.org/git/cabpp-becpsoxudovjbdg_3w9wus102rw+e+qpmd4g3qyd-q...@mail.gmail.com/ Derrick Stolee (3): commit-reach: implement get_reachable_subset test-reach: test get_reachable_subset remote: make add_missing_tags() linear commit-reach.c| 70 +++ commit-reach.h| 13 remote.c | 34 - t/helper/test-reach.c | 34 ++--- t/t6600-test-reach.sh | 52 5 files changed, 198 insertions(+), 5 deletions(-) base-commit: c670b1f876521c9f7cd40184bf7ed05aad843433 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-60%2Fderrickstolee%2Fadd-missing-tags-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-60/derrickstolee/add-missing-tags-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/60 Range-diff vs v1: 1: 4c0c5c9143 ! 1: 9e570603bd commit-reach: implement get_reachable_subset @@ -49,7 +49,7 @@ + +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, + struct commit **to, int nr_to, -+ int reachable_flag) ++ unsigned int reachable_flag) +{ + struct commit **item; + struct commit *current; @@ -129,9 +129,12 @@ + * Return a list of commits containing the commits in the 'to' array + * that are reachable from at least one commit in the 'from' array. + * Also add the given 'flag' to each of the commits in the returned list. ++ * ++ * This method uses the PARENT1 and PARENT2 flags during its operation, ++ * so be sure these flags are not set before calling the method. + */ +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, + struct commit **to, int nr_to, -+ int reachable_flag); ++ unsigned int reachable_flag); + #endif 2: 382f4f4a5b = 2: 52e847b928 test-reach: test get_reachable_subset 3: ecbed3de5c = 3: dfaceb162f remote: make add_missing_tags() linear -- gitgitgadget
[PATCH v2 1/3] commit-reach: implement get_reachable_subset
From: Derrick Stolee The existing reachability algorithms in commit-reach.c focus on finding merge-bases or determining if all commits in a set X can reach at least one commit in a set Y. However, for two commits sets X and Y, we may also care about which commits in Y are reachable from at least one commit in X. Implement get_reachable_subset() which answers this question. Given two arrays of commits, 'from' and 'to', return a commit_list with every commit from the 'to' array that is reachable from at least one commit in the 'from' array. The algorithm is a simple walk starting at the 'from' commits, using the PARENT2 flag to indicate "this commit has already been added to the walk queue". By marking the 'to' commits with the PARENT1 flag, we can determine when we see a commit from the 'to' array. We remove the PARENT1 flag as we add that commit to the result list to avoid duplicates. The order of the resulting list is a reverse of the order that the commits are discovered in the walk. There are a couple shortcuts to avoid walking more than we need: 1. We determine the minimum generation number of commits in the 'to' array. We do not walk commits with generation number below this minimum. 2. We count how many distinct commits are in the 'to' array, and decrement this count when we discover a 'to' commit during the walk. If this number reaches zero, then we can terminate the walk. Tests will be added using the 'test-tool reach' helper in a subsequent commit. Signed-off-by: Derrick Stolee --- commit-reach.c | 70 ++ commit-reach.h | 13 ++ 2 files changed, 83 insertions(+) diff --git a/commit-reach.c b/commit-reach.c index 9f79ce0a22..8ad5352752 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to, object_array_clear(_objs); return result; } + +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, +struct commit **to, int nr_to, +unsigned int reachable_flag) +{ + struct commit **item; + struct commit *current; + struct commit_list *found_commits = NULL; + struct commit **to_last = to + nr_to; + struct commit **from_last = from + nr_from; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; + int num_to_find = 0; + + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; + + for (item = to; item < to_last; item++) { + struct commit *c = *item; + + parse_commit(c); + if (c->generation < min_generation) + min_generation = c->generation; + + if (!(c->object.flags & PARENT1)) { + c->object.flags |= PARENT1; + num_to_find++; + } + } + + for (item = from; item < from_last; item++) { + struct commit *c = *item; + if (!(c->object.flags & PARENT2)) { + c->object.flags |= PARENT2; + parse_commit(c); + + prio_queue_put(, *item); + } + } + + while (num_to_find && (current = prio_queue_get()) != NULL) { + struct commit_list *parents; + + if (current->object.flags & PARENT1) { + current->object.flags &= ~PARENT1; + current->object.flags |= reachable_flag; + commit_list_insert(current, _commits); + num_to_find--; + } + + for (parents = current->parents; parents; parents = parents->next) { + struct commit *p = parents->item; + + parse_commit(p); + + if (p->generation < min_generation) + continue; + + if (p->object.flags & PARENT2) + continue; + + p->object.flags |= PARENT2; + prio_queue_put(, p); + } + } + + clear_commit_marks_many(nr_to, to, PARENT1); + clear_commit_marks_many(nr_from, from, PARENT2); + + return found_commits; +} + diff --git a/commit-reach.h b/commit-reach.h index 7d313e2975..bb34af0269 100644 --- a/commit-reach.h +++ b/commit-reach.h @@ -74,4 +74,17 @@ int can_all_from_reach_with_flag(struct object_array *from, int can_all_from_reach(struct commit_list *from, struct commit_list *to, int commit_date_cutoff); + +/* + * Return a list of commits containing the commits in the 'to' array + * that are reachable from at least one commit in the 'from' array. + * Also add the given 'flag
[PATCH v2 3/3] remote: make add_missing_tags() linear
From: Derrick Stolee The add_missing_tags() method currently has quadratic behavior. This is due to a linear number (based on number of tags T) of calls to in_merge_bases_many, which has linear performance (based on number of commits C in the repository). Replace this O(T * C) algorithm with an O(T + C) algorithm by using get_reachable_subset(). We ignore the return list and focus instead on the reachable_flag assigned to the commits we care about, because we need to interact with the tag ref and not just the commit object. Signed-off-by: Derrick Stolee --- remote.c | 34 +- 1 file changed, 33 insertions(+), 1 deletion(-) diff --git a/remote.c b/remote.c index 81f4f01b00..b850f2feb3 100644 --- a/remote.c +++ b/remote.c @@ -1205,9 +1205,36 @@ static void add_missing_tags(struct ref *src, struct ref **dst, struct ref ***ds * sent to the other side. */ if (sent_tips.nr) { + const int reachable_flag = 1; + struct commit_list *found_commits; + struct commit **src_commits; + int nr_src_commits = 0, alloc_src_commits = 16; + ALLOC_ARRAY(src_commits, alloc_src_commits); + for_each_string_list_item(item, _tag) { struct ref *ref = item->util; + struct commit *commit; + + if (is_null_oid(>new_oid)) + continue; + commit = lookup_commit_reference_gently(the_repository, + >new_oid, + 1); + if (!commit) + /* not pushing a commit, which is not an error */ + continue; + + ALLOC_GROW(src_commits, nr_src_commits + 1, alloc_src_commits); + src_commits[nr_src_commits++] = commit; + } + + found_commits = get_reachable_subset(sent_tips.tip, sent_tips.nr, +src_commits, nr_src_commits, +reachable_flag); + + for_each_string_list_item(item, _tag) { struct ref *dst_ref; + struct ref *ref = item->util; struct commit *commit; if (is_null_oid(>new_oid)) @@ -1223,7 +1250,7 @@ static void add_missing_tags(struct ref *src, struct ref **dst, struct ref ***ds * Is this tag, which they do not have, reachable from * any of the commits we are sending? */ - if (!in_merge_bases_many(commit, sent_tips.nr, sent_tips.tip)) + if (!(commit->object.flags & reachable_flag)) continue; /* Add it in */ @@ -1231,7 +1258,12 @@ static void add_missing_tags(struct ref *src, struct ref **dst, struct ref ***ds oidcpy(_ref->new_oid, >new_oid); dst_ref->peer_ref = copy_ref(ref); } + + clear_commit_marks_many(nr_src_commits, src_commits, reachable_flag); + free(src_commits); + free_commit_list(found_commits); } + string_list_clear(_tag, 0); free(sent_tips.tip); } -- gitgitgadget
[PATCH v2 2/3] test-reach: test get_reachable_subset
From: Derrick Stolee The get_reachable_subset() method returns the list of commits in the 'to' array that are reachable from at least one commit in the 'from' array. Add tests that check this method works in a few cases: 1. All commits in the 'to' list are reachable. This exercises the early-termination condition. 2. Some commits in the 'to' list are reachable. This exercises the loop-termination condition. 3. No commits in the 'to' list are reachable. This exercises the NULL return condition. Signed-off-by: Derrick Stolee --- t/helper/test-reach.c | 34 t/t6600-test-reach.sh | 52 +++ 2 files changed, 82 insertions(+), 4 deletions(-) diff --git a/t/helper/test-reach.c b/t/helper/test-reach.c index 08d2ea68e8..a0272178b7 100644 --- a/t/helper/test-reach.c +++ b/t/helper/test-reach.c @@ -32,8 +32,8 @@ int cmd__reach(int ac, const char **av) struct commit *A, *B; struct commit_list *X, *Y; struct object_array X_obj = OBJECT_ARRAY_INIT; - struct commit **X_array; - int X_nr, X_alloc; + struct commit **X_array, **Y_array; + int X_nr, X_alloc, Y_nr, Y_alloc; struct strbuf buf = STRBUF_INIT; struct repository *r = the_repository; @@ -44,9 +44,10 @@ int cmd__reach(int ac, const char **av) A = B = NULL; X = Y = NULL; - X_nr = 0; - X_alloc = 16; + X_nr = Y_nr = 0; + X_alloc = Y_alloc = 16; ALLOC_ARRAY(X_array, X_alloc); + ALLOC_ARRAY(Y_array, Y_alloc); while (strbuf_getline(, stdin) != EOF) { struct object_id oid; @@ -92,6 +93,8 @@ int cmd__reach(int ac, const char **av) case 'Y': commit_list_insert(c, ); + ALLOC_GROW(Y_array, Y_nr + 1, Y_alloc); + Y_array[Y_nr++] = c; break; default: @@ -136,6 +139,29 @@ int cmd__reach(int ac, const char **av) filter.with_commit_tag_algo = 0; printf("%s(_,A,X,_):%d\n", av[1], commit_contains(, A, X, )); + } else if (!strcmp(av[1], "get_reachable_subset")) { + const int reachable_flag = 1; + int i, count = 0; + struct commit_list *current; + struct commit_list *list = get_reachable_subset(X_array, X_nr, + Y_array, Y_nr, + reachable_flag); + printf("get_reachable_subset(X,Y)\n"); + for (current = list; current; current = current->next) { + if (!(list->item->object.flags & reachable_flag)) + die(_("commit %s is not marked reachable"), + oid_to_hex(>item->object.oid)); + count++; + } + for (i = 0; i < Y_nr; i++) { + if (Y_array[i]->object.flags & reachable_flag) + count--; + } + + if (count < 0) + die(_("too many commits marked reachable")); + + print_sorted_commit_ids(list); } exit(0); diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh index ae94b27f70..a0c64e617a 100755 --- a/t/t6600-test-reach.sh +++ b/t/t6600-test-reach.sh @@ -265,4 +265,56 @@ test_expect_success 'commit_contains:miss' ' test_three_modes commit_contains --tag ' +test_expect_success 'get_reachable_subset:all' ' + cat >input <<-\EOF && + X:commit-9-1 + X:commit-8-3 + X:commit-7-5 + X:commit-6-6 + X:commit-1-7 + Y:commit-3-3 + Y:commit-1-7 + Y:commit-5-6 + EOF + ( + echo "get_reachable_subset(X,Y)" && + git rev-parse commit-3-3 \ + commit-1-7 \ + commit-5-6 | sort + ) >expect && + test_three_modes get_reachable_subset +' + +test_expect_success 'get_reachable_subset:some' ' + cat >input <<-\EOF && + X:commit-9-1 + X:commit-8-3 + X:commit-7-5 + X:commit-1-7 + Y:commit-3-3 + Y:commit-1-7 + Y:commit-5-6 + EOF + ( + echo "get_reachable_subset(X,Y)" && + git rev-parse commit-3-3 \ + commit-1-7 | sort + ) >expect && + test_three_modes get_reachable_subset +' + +test_expect_success 'get_reachable_subset:none' ' + cat >input <<-\EOF && + X:commit-9-1 + X:commit-8-3 + X:commit-7-5 +
Re: [PATCH 19/24] submodule: use submodule repos for object lookup
On 10/30/2018 6:08 PM, Stefan Beller wrote: This converts the 'show_submodule_header' function to use the repository API properly, such that the submodule objects are not added to the main object store. Signed-off-by: Stefan Beller A couple tests are broken in 'pu' when run with GIT_TEST_COMMIT_GRAPH=1, including t4041-diff-submodule-option.sh. The failure bisects to this patch. Here is a verbose output of the first failure in that script:; expecting success: git diff-index -p --submodule=log HEAD >actual && cat >expected <<-EOF && Submodule sm1 $head2..$head3 (rewind): < Add foo3 ($added foo3) < Add foo2 ($added foo2) EOF test_cmp expected actual + git diff-index -p --submodule=log HEAD + cat + test_cmp expected actual + diff -u expected actual --- expected 2018-11-02 12:58:43.429262380 + +++ actual 2018-11-02 12:58:43.429262380 + @@ -1,3 +1,5 @@ -Submodule sm1 30b9670..dafb207 (rewind): +Submodule sm1 30b9670...dafb207: < Add foo3 (hinzugefügt foo3) < Add foo2 (hinzugefügt foo2) + > Add foo1 (hinzugefügt foo1) + < Add foo1 (hinzugefügt foo1) error: last command exited with $?=1 not ok 9 - modified submodule(backward) I've been looking into the patch below to see if there is an obvious problem, but the best I can think is that open_submodule() creates an alternate 'struct repository' and somehow the commit-graph feature is interacting poorly with that struct. Stefan, do you know what's going on? Thanks, -Stolee --- submodule.c | 76 ++--- 1 file changed, 61 insertions(+), 15 deletions(-) diff --git a/submodule.c b/submodule.c index d9d3046689..0fdda45252 100644 --- a/submodule.c +++ b/submodule.c @@ -443,7 +443,7 @@ static int prepare_submodule_summary(struct rev_info *rev, const char *path, return prepare_revision_walk(rev); } -static void print_submodule_summary(struct rev_info *rev, struct diff_options *o) +static void print_submodule_summary(struct repository *r, struct rev_info *rev, struct diff_options *o) { static const char format[] = " %m %s"; struct strbuf sb = STRBUF_INIT; @@ -454,7 +454,8 @@ static void print_submodule_summary(struct rev_info *rev, struct diff_options *o ctx.date_mode = rev->date_mode; ctx.output_encoding = get_log_output_encoding(); strbuf_setlen(, 0); - format_commit_message(commit, format, , ); + repo_format_commit_message(r, commit, format, , + ); strbuf_addch(, '\n'); if (commit->object.flags & SYMMETRIC_LEFT) diff_emit_submodule_del(o, sb.buf); @@ -481,14 +482,47 @@ void prepare_submodule_repo_env(struct argv_array *out) DEFAULT_GIT_DIR_ENVIRONMENT); } -/* Helper function to display the submodule header line prior to the full - * summary output. If it can locate the submodule objects directory it will - * attempt to lookup both the left and right commits and put them into the - * left and right pointers. +/* + * Initialize 'out' based on the provided submodule path. + * + * Unlike repo_submodule_init, this tolerates submodules not present + * in .gitmodules. This function exists only to preserve historical behavior, + * + * Returns 0 on success, -1 when the submodule is not present. */ -static void show_submodule_header(struct diff_options *o, const char *path, +static struct repository *open_submodule(const char *path) +{ + struct strbuf sb = STRBUF_INIT; + struct repository *out = xmalloc(sizeof(*out)); + + if (submodule_to_gitdir(, path) || repo_init(out, sb.buf, NULL)) { + strbuf_release(); + free(out); + return NULL; + } + + out->submodule_prefix = xstrfmt("%s%s/", + the_repository->submodule_prefix ? + the_repository->submodule_prefix : + "", path); + + strbuf_release(); + return out; +} + +/* + * Helper function to display the submodule header line prior to the full + * summary output. + * + * If it can locate the submodule git directory it will create a repository + * handle for the submodule and lookup both the left and right commits and + * put them into the left and right pointers. + */ +static void show_submodule_header(struct diff_options *o, + const char *path, struct object_id *one, struct object_id *two, unsigned dirty_submodule, + struct repository *sub, struct commit **left, struct commit **right, struct commit_list **merge_bases) { @@ -507,7 +541,7 @@ static void show_submodule_header(struct diff_options *o, const char *path, else if (is_null_oid(two))
Re: [PATCH 0/3] Make add_missing_tags() linear
On 11/1/2018 2:57 PM, Elijah Newren wrote: On Thu, Nov 1, 2018 at 5:32 AM Derrick Stolee wrote: No rush. I'd just like to understand how removing the commit-graph file can make the new algorithm faster. Putting a similar count in the old algorithm would involve giving a count for every call to in_merge_bases_many(), which would be very noisy. $ time git push --dry-run --follow-tags /home/newren/repo-mirror count: 92912 To /home/newren/repo-mirror * [new branch] test5 -> test5 real0m3.024s user0m2.752s sys0m0.320s Is the above test with or without the commit-graph file? Can you run it in the other mode, too? I'd like to see if the "count" value changes when the only difference is the presence of a commit-graph file. Thanks, -Stolee
Git Test Coverage Report (Thursday, Nov 1) Part A
ath '%s'"), path); 5257b0625a 1686) return error(_("bad signature 0x%08x"), hdr->hdr_signature); 5257b0625a 1689) return error(_("bad index version %d"), hdr_version); 5257b0625a 1728) return error(_("index uses %.4s extension, which we do not understand"), 5257b0625a 1730) fprintf_ln(stderr, _("ignoring %.4s extension"), ext); 5257b0625a 1777) die(_("unknown index entry format 0x%08x"), extended_flags); 5257b0625a 1848) die(_("unordered stage entries in index")); 5257b0625a 1851) die(_("multiple stage entries for merged file '%s'"), 5257b0625a 1854) die(_("unordered stage entries for '%s'"), 5257b0625a 2148) die_errno(_("%s: index file open failed"), path); 5257b0625a 2152) die_errno(_("%s: cannot stat the open index"), path); 5257b0625a 2156) die(_("%s: index file smaller than expected"), path); 5257b0625a 2160) die_errno(_("%s: unable to map index file"), path); 5257b0625a 2252) warning(_("could not freshen shared index '%s'"), shared_index); 5257b0625a 2287) die(_("broken index, expect %s in %s, got %s"), 5257b0625a 3073) error(_("cannot fix permission bits on '%s'"), get_tempfile_path(*temp)); 5257b0625a 3219) return error(_("%s: cannot drop to stage #0"), ref-filter.c 35408df41e 2324) return error(_("option `%s' is incompatible with --no-merged"), remote.c a454d6a26b 362) warning(_("config remote shorthand cannot begin with '/': %s"), a454d6a26b 417) error(_("more than one uploadpack given, using the first")); a454d6a26b 683) die(_("key '%s' of pattern had no '*'"), key); a454d6a26b 693) die(_("value '%s' of pattern has no '*'"), value); a454d6a26b 1044) error(_("unable to delete '%s': remote ref does not exist"), a454d6a26b 1066) return error(_("dst ref %s receives from more than one src"), a454d6a26b 1753) die(_("couldn't find remote ref %s"), name); a454d6a26b 1766) error(_("* Ignoring funny ref '%s' locally"), a454d6a26b 1861) die(_("revision walk setup failed")); a454d6a26b 2134) return error(_("cannot parse expected object name '%s'"), revision.c a63d88e595 2932) return; a63d88e595 2935) return; a63d88e595 2941) c->object.flags |= UNINTERESTING; a63d88e595 2944) return; a63d88e595 2947) mark_parents_uninteresting(c); a63d88e595 2970) return; a63d88e595 2973) return; a63d88e595 2978) return; a63d88e595 3042) continue; f33f8de6af 3090) if (!revs->ignore_missing_links) f33f8de6af 3091) die("Failed to traverse parents of commit %s", a63d88e595 3092) oid_to_hex(>object.oid)); a63d88e595 3100) continue; run-command.c 31bfd155d8 1229) error(_("cannot create async thread: %s"), strerror(err)); sequencer.c bcd33ec25f 683) np = strchrnul(buf, '\n'); bcd33ec25f 684) return error(_("no key present in '%.*s'"), bcd33ec25f 695) return error(_("unable to dequote value of '%s'"), bcd33ec25f 737) goto finish; bcd33ec25f 742) name_i = error(_("'GIT_AUTHOR_NAME' already given")); bcd33ec25f 747) email_i = error(_("'GIT_AUTHOR_EMAIL' already given")); bcd33ec25f 752) date_i = error(_("'GIT_AUTHOR_DATE' already given")); bcd33ec25f 756) err = error(_("unknown variable '%s'"), bcd33ec25f 761) error(_("missing 'GIT_AUTHOR_NAME'")); bcd33ec25f 763) error(_("missing 'GIT_AUTHOR_EMAIL'")); bcd33ec25f 765) error(_("missing 'GIT_AUTHOR_DATE'")); sha1-file.c 2f90b9d9b4 sha1-file.c 172) int hash_algo_by_name(const char *name) 2f90b9d9b4 sha1-file.c 175) if (!name) 2f90b9d9b4 sha1-file.c 176) return GIT_HASH_UNKNOWN; 2f90b9d9b4 sha1-file.c 177) for (i = 1; i < GIT_HASH_NALGOS; i++) 2f90b9d9b4 sha1-file.c 178) if (!strcmp(name, hash_algos[i].name)) 2f90b9d9b4 sha1-file.c 179) return i; 2f90b9d9b4 sha1-file.c 180) return GIT_HASH_UNKNOWN; 2f90b9d9b4 sha1-file.c 183) int hash_algo_by_id(uint32_t format_id) 2f90b9d9b4 sha1-file.c 186) for (i = 1; i < GIT_HASH_NALGOS; i++) 2f90b9d9b4 sha1-file.c 187) if (format_id == hash_algos[i].format_id) 2f90b9d9b4 sha1-file.c 188) return i; 2f90b9d9b4 sha1-file.c 189) return GIT_HASH_UNKNOWN; Commits introducing uncovered code: brian m. carlson 2f90b9d9b: sha1-file: provide functions to look up hash algorithms brian m. carlson b3a41547c: hex: introduce functions to print arbitrary hashes Daniels Umanovskis 0ecb1fc72: branch: introduce --show-current display option Derrick Stolee 1dcd9f204: midx: close multi-pack-index on repack Derrick Stolee a63d88e59: revision.c: generation-based topo-order algorithm Derrick Stolee f33f8de6a: revision.c: begin refactoring --topo-order logic Jeff King 98f425b45: cat-file: handle streaming failures consistently Joel Teichroeb 3d5ec65ce: stash: convert apply to builtin Joel T
Re: [PATCH v5 6/7] revision.c: generation-based topo-order algorithm
On 11/1/2018 11:48 AM, SZEDER Gábor wrote: On Thu, Nov 01, 2018 at 01:46:22PM +, Derrick Stolee wrote: 1. EXPLORE: using the explore_queue priority queue (ordered by maximizing the generation number) 2. INDEGREE: using the indegree_queue priority queue (ordered by maximizing the generation number) Nit: I've been pondering for a while what exactly does "order by maximizing ..." mean. Highest to lowest or lowest to highest? If I understand the rest of the descriptions (that I snipped) correctly, then it's the former, but I find that phrase in itself too ambiguous. It means that our priority-queue "get" operation selects the item in the queue that is largest by our comparison function (first generation number, thencommit-date for ties).This means we walk commits that have high generation number before those with lower generation number, guaranteeing that we walk all children of a commit before walking that commit. Thanks, -Stolee
master updated? (was Re: What's cooking in git.git (Nov 2018, #01; Thu, 1))
On 11/1/2018 5:59 AM, Junio C Hamano wrote: -- [Graduated to "master"] I see that several topics graduated, but it appears the master branch was not updated at https://github.com/gister/git. Was this intentional? I noticed because the test-coverage build [1] using the previous master as master@{1} reported no uncovered code into master (because no new commits exist on master). Thanks, -Stolee [1] https://git.visualstudio.com/git/_build/results?buildId=237=log
Re: [PATCH v4 0/7] Use generation numbers for --topo-order
On 11/1/2018 1:21 AM, Junio C Hamano wrote: "Derrick Stolee via GitGitGadget" writes: This patch series performs a decently-sized refactoring of the revision-walk machinery. Well, "refactoring" is probably the wrong word, as I don't actually remove the old code. Instead, when we see certain options in the 'rev_info' struct, we redirect the commit-walk logic to a new set of methods that distribute the workload differently. By using generation numbers in the commit-graph, we can significantly improve 'git log --graph' commands (and the underlying 'git rev-list --topo-order'). Review discussions seem to have petered out. Would we merge this to 'next' and start cooking, perhaps for the remainder of this cycle? Thanks, but I've just sent a v5 responding to Jakub's feedback on v4. [1] I'd be happy to let it sit in next until you feel it has cooked long enough. I'm available to respond to feedback in the form of new topics. Thanks, -Stolee [1] https://public-inbox.org/git/20181101134623.84055-1-dsto...@microsoft.com/T/#u
[PATCH v5 6/7] revision.c: generation-based topo-order algorithm
n 'git rev-list --topo-order A..B'. This can be updated in the future. In my local testing, I used the following Git commands on the Linux repository in three modes: HEAD~1 with no commit-graph, HEAD~1 with a commit-graph, and HEAD with a commit-graph. This allows comparing the benefits we get from parsing commits from the commit-graph and then again the benefits we get by restricting the set of commits we walk. Test: git rev-list --topo-order -100 HEAD HEAD~1, no commit-graph: 6.80 s HEAD~1, w/ commit-graph: 0.77 s HEAD, w/ commit-graph: 0.02 s Test: git rev-list --topo-order -100 HEAD -- tools HEAD~1, no commit-graph: 9.63 s HEAD~1, w/ commit-graph: 6.06 s HEAD, w/ commit-graph: 0.06 s This speedup is due to a few things. First, the new generation- number-enabled algorithm walks commits on order of the number of results output (subject to some branching structure expectations). Since we limit to 100 results, we are running a query similar to filling a single page of results. Second, when specifying a path, we must parse the root tree object for each commit we walk. The previous benefits from the commit-graph are entirely from reading the commit-graph instead of parsing commits. Since we need to parse trees for the same number of commits as before, we slow down significantly from the non-path-based query. For the test above, I specifically selected a path that is changed frequently, including by merge commits. A less-frequently-changed path (such as 'README') has similar end-to-end time since we need to walk the same number of commits (before determining we do not have 100 hits). However, get the benefit that the output is presented to the user as it is discovered, much the same as a normal 'git log' command (no '--topo-order'). This is an improved user experience, even if the command has the same runtime. Helped-by: Jeff King Signed-off-by: Derrick Stolee --- object.h | 4 +- revision.c | 195 +++-- revision.h | 2 + 3 files changed, 194 insertions(+), 7 deletions(-) diff --git a/object.h b/object.h index 0feb90ae61..796792cb32 100644 --- a/object.h +++ b/object.h @@ -59,7 +59,7 @@ struct object_array { /* * object flag allocation: - * revision.h: 0-10 2526 + * revision.h: 0-10 2528 * fetch-pack.c: 01 * negotiator/default.c: 2--5 * walker.c: 0-2 @@ -78,7 +78,7 @@ struct object_array { * builtin/show-branch.c:0---26 * builtin/unpack-objects.c: 2021 */ -#define FLAG_BITS 27 +#define FLAG_BITS 29 /* * The object type is stored in 3 bits. diff --git a/revision.c b/revision.c index 36458265a0..4ef47d2fb4 100644 --- a/revision.c +++ b/revision.c @@ -26,6 +26,7 @@ #include "argv-array.h" #include "commit-reach.h" #include "commit-graph.h" +#include "prio-queue.h" volatile show_early_output_fn_t show_early_output; @@ -2895,31 +2896,215 @@ static int mark_uninteresting(const struct object_id *oid, return 0; } -struct topo_walk_info {}; +define_commit_slab(indegree_slab, int); +define_commit_slab(author_date_slab, timestamp_t); + +struct topo_walk_info { + uint32_t min_generation; + struct prio_queue explore_queue; + struct prio_queue indegree_queue; + struct prio_queue topo_queue; + struct indegree_slab indegree; + struct author_date_slab author_date; +}; + +static inline void test_flag_and_insert(struct prio_queue *q, struct commit *c, int flag) +{ + if (c->object.flags & flag) + return; + + c->object.flags |= flag; + prio_queue_put(q, c); +} + +static void explore_walk_step(struct rev_info *revs) +{ + struct topo_walk_info *info = revs->topo_walk_info; + struct commit_list *p; + struct commit *c = prio_queue_get(>explore_queue); + + if (!c) + return; + + if (parse_commit_gently(c, 1) < 0) + return; + + if (revs->sort_order == REV_SORT_BY_AUTHOR_DATE) + record_author_date(>author_date, c); + + if (revs->max_age != -1 && (c->date < revs->max_age)) + c->object.flags |= UNINTERESTING; + + if (process_parents(revs, c, NULL, NULL) < 0) + return; + + if (c->object.flags & UNINTERESTING) + mark_parents_uninteresting(c); + + for (p = c->parents; p; p = p->next) + test_flag_and_insert(>explore_queue, p->item, TOPO_WALK_EXPLORED); +} + +static void explore_to_depth(struct rev_info *revs, +uint32_t gen_cutoff) +{ + struct topo_walk_info *info = revs->topo_walk_info; + struct commit *c; + while ((c = prio_queue_peek(>explore
[PATCH v5 2/7] test-reach: add run_three_modes method
The 'test_three_modes' method assumes we are using the 'test-tool reach' command for our test. However, we may want to use the data shape of our commit graph and the three modes (no commit-graph, full commit-graph, partial commit-graph) for other git commands. Split test_three_modes to be a simple translation on a more general run_three_modes method that executes the given command and tests the actual output to the expected output. Signed-off-by: Derrick Stolee --- t/t6600-test-reach.sh | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh index d139a00d1d..9d65b8b946 100755 --- a/t/t6600-test-reach.sh +++ b/t/t6600-test-reach.sh @@ -53,18 +53,22 @@ test_expect_success 'setup' ' git config core.commitGraph true ' -test_three_modes () { +run_three_modes () { test_when_finished rm -rf .git/objects/info/commit-graph && - test-tool reach $1 actual && + "$@" actual && test_cmp expect actual && cp commit-graph-full .git/objects/info/commit-graph && - test-tool reach $1 actual && + "$@" actual && test_cmp expect actual && cp commit-graph-half .git/objects/info/commit-graph && - test-tool reach $1 actual && + "$@" actual && test_cmp expect actual } +test_three_modes () { + run_three_modes test-tool reach "$@" +} + test_expect_success 'ref_newer:miss' ' cat >input <<-\EOF && A:commit-5-7 -- 2.19.1.542.gc4df23f792
[PATCH v5 3/7] test-reach: add rev-list tests
The rev-list command is critical to Git's functionality. Ensure it works in the three commit-graph environments constructed in t6600-test-reach.sh. Here are a few important types of rev-list operations: * Basic: git rev-list --topo-order HEAD * Range: git rev-list --topo-order compare..HEAD * Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD * Symmetric Difference: git rev-list --topo-order compare...HEAD Signed-off-by: Derrick Stolee --- t/t6600-test-reach.sh | 84 +++ 1 file changed, 84 insertions(+) diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh index 9d65b8b946..288f703b7b 100755 --- a/t/t6600-test-reach.sh +++ b/t/t6600-test-reach.sh @@ -243,4 +243,88 @@ test_expect_success 'commit_contains:miss' ' test_three_modes commit_contains --tag ' +test_expect_success 'rev-list: basic topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \ + commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \ + commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \ + commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 commit-1-3 \ + commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \ + commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \ + >expect && + run_three_modes git rev-list --topo-order commit-6-6 +' + +test_expect_success 'rev-list: first-parent topo-order' ' + git rev-parse \ + commit-6-6 \ + commit-6-5 \ + commit-6-4 \ + commit-6-3 \ + commit-6-2 \ + commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \ + >expect && + run_three_modes git rev-list --first-parent --topo-order commit-6-6 +' + +test_expect_success 'rev-list: range topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \ + commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \ + commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + commit-6-2 commit-5-2 commit-4-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + >expect && + run_three_modes git rev-list --topo-order commit-3-3..commit-6-6 +' + +test_expect_success 'rev-list: range topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 \ + commit-6-5 commit-5-5 commit-4-5 \ + commit-6-4 commit-5-4 commit-4-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + commit-6-2 commit-5-2 commit-4-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + >expect && + run_three_modes git rev-list --topo-order commit-3-8..commit-6-6 +' + +test_expect_success 'rev-list: first-parent range topo-order' ' + git rev-parse \ + commit-6-6 \ + commit-6-5 \ + commit-6-4 \ + commit-6-3 \ + commit-6-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + >expect && + run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6 +' + +test_expect_success 'rev-list: ancestry-path topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 commit-3-6 \ + commit-6-5 commit-5-5 commit-4-5 commit-3-5 \ + commit-6-4 commit-5-4 commit-4-4 commit-3-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + >expect && + run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6 +' + +test_expect_success 'rev-list: symmetric difference topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 \ + commit-6-5 commit-5-5 commit-4-5 \ + commit-6-4 commit-5-4 commit-4-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + commit-6-2 commit-5-2 commit-4-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + commit-3-8 commit-2-8 commit-1-8 \ + commit-3-7 commit-2-7 commit-1-7 \ + >expect && + run_three_modes git rev-list --topo-order commit-3-8...commit-6-6 +' + test_done -- 2.19.1.542.gc4df23f792
[PATCH v5 5/7] commit/revisions: bookkeeping before refactoring
There are a few things that need to move around a little before making a big refactoring in the topo-order logic: 1. We need access to record_author_date() and compare_commits_by_author_date() in revision.c. These are used currently by sort_in_topological_order() in commit.c. 2. Moving these methods to commit.h requires adding an author_date_slab declaration to commit.h. Consumers will need their own implementation. 3. The add_parents_to_list() method in revision.c performs logic around the UNINTERESTING flag and other special cases depending on the struct rev_info. Allow this method to ignore a NULL 'list' parameter, as we will not be populating the list for our walk. Also rename the method to the slightly more generic name process_parents() to make clear that this method does more than add to a list (and no list is required anymore). Helped-by: Jeff King Signed-off-by: Derrick Stolee --- commit.c | 9 - commit.h | 7 +++ revision.c | 18 ++ 3 files changed, 21 insertions(+), 13 deletions(-) diff --git a/commit.c b/commit.c index d0f199e122..a025a0db60 100644 --- a/commit.c +++ b/commit.c @@ -655,11 +655,10 @@ struct commit *pop_commit(struct commit_list **stack) /* count number of children that have not been emitted */ define_commit_slab(indegree_slab, int); -/* record author-date for each commit object */ define_commit_slab(author_date_slab, timestamp_t); -static void record_author_date(struct author_date_slab *author_date, - struct commit *commit) +void record_author_date(struct author_date_slab *author_date, + struct commit *commit) { const char *buffer = get_commit_buffer(commit, NULL); struct ident_split ident; @@ -684,8 +683,8 @@ static void record_author_date(struct author_date_slab *author_date, unuse_commit_buffer(commit, buffer); } -static int compare_commits_by_author_date(const void *a_, const void *b_, - void *cb_data) +int compare_commits_by_author_date(const void *a_, const void *b_, + void *cb_data) { const struct commit *a = a_, *b = b_; struct author_date_slab *author_date = cb_data; diff --git a/commit.h b/commit.h index 2b1a734388..ec5b9093ad 100644 --- a/commit.h +++ b/commit.h @@ -8,6 +8,7 @@ #include "gpg-interface.h" #include "string-list.h" #include "pretty.h" +#include "commit-slab.h" #define COMMIT_NOT_FROM_GRAPH 0x #define GENERATION_NUMBER_INFINITY 0x @@ -328,6 +329,12 @@ extern int remove_signature(struct strbuf *buf); */ extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); +/* record author-date for each commit object */ +struct author_date_slab; +void record_author_date(struct author_date_slab *author_date, + struct commit *commit); + +int compare_commits_by_author_date(const void *a_, const void *b_, void *unused); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); diff --git a/revision.c b/revision.c index 2dcde8a8ac..36458265a0 100644 --- a/revision.c +++ b/revision.c @@ -768,8 +768,8 @@ static void commit_list_insert_by_date_cached(struct commit *p, struct commit_li *cache = new_entry; } -static int add_parents_to_list(struct rev_info *revs, struct commit *commit, - struct commit_list **list, struct commit_list **cache_ptr) +static int process_parents(struct rev_info *revs, struct commit *commit, + struct commit_list **list, struct commit_list **cache_ptr) { struct commit_list *parent = commit->parents; unsigned left_flag; @@ -808,7 +808,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit, if (p->object.flags & SEEN) continue; p->object.flags |= SEEN; - commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); + if (list) + commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); } return 0; } @@ -847,7 +848,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit, p->object.flags |= left_flag; if (!(p->object.flags & SEEN)) { p->object.flags |= SEEN; - commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); + if (list) + commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); } if (revs->first_parent_
[PATCH v5 4/7] revision.c: begin refactoring --topo-order logic
When running 'git rev-list --topo-order' and its kin, the topo_order setting in struct rev_info implies the limited setting. This means that the following things happen during prepare_revision_walk(): * revs->limited implies we run limit_list() to walk the entire reachable set. There are some short-cuts here, such as if we perform a range query like 'git rev-list COMPARE..HEAD' and we can stop limit_list() when all queued commits are uninteresting. * revs->topo_order implies we run sort_in_topological_order(). See the implementation of that method in commit.c. It implies that the full set of commits to order is in the given commit_list. These two methods imply that a 'git rev-list --topo-order HEAD' command must walk the entire reachable set of commits _twice_ before returning a single result. If we have a commit-graph file with generation numbers computed, then there is a better way. This patch introduces some necessary logic redirection when we are in this situation. In v2.18.0, the commit-graph file contains zero-valued bytes in the positions where the generation number is stored in v2.19.0 and later. Thus, we use generation_numbers_enabled() to check if the commit-graph is available and has non-zero generation numbers. When setting revs->limited only because revs->topo_order is true, only do so if generation numbers are not available. There is no reason to use the new logic as it will behave similarly when all generation numbers are INFINITY or ZERO. In prepare_revision_walk(), if we have revs->topo_order but not revs->limited, then we trigger the new logic. It breaks the logic into three pieces, to fit with the existing framework: 1. init_topo_walk() fills a new struct topo_walk_info in the rev_info struct. We use the presence of this struct as a signal to use the new methods during our walk. In this patch, this method simply calls limit_list() and sort_in_topological_order(). In the future, this method will set up a new data structure to perform that logic in-line. 2. next_topo_commit() provides get_revision_1() with the next topo- ordered commit in the list. Currently, this simply pops the commit from revs->commits. 3. expand_topo_walk() provides get_revision_1() with a way to signal walking beyond the latest commit. Currently, this calls add_parents_to_list() exactly like the old logic. While this commit presents method redirection for performing the exact same logic as before, it allows the next commit to focus only on the new logic. Signed-off-by: Derrick Stolee --- revision.c | 42 ++ revision.h | 4 2 files changed, 42 insertions(+), 4 deletions(-) diff --git a/revision.c b/revision.c index e18bd530e4..2dcde8a8ac 100644 --- a/revision.c +++ b/revision.c @@ -25,6 +25,7 @@ #include "worktree.h" #include "argv-array.h" #include "commit-reach.h" +#include "commit-graph.h" volatile show_early_output_fn_t show_early_output; @@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s if (revs->diffopt.objfind) revs->simplify_history = 0; - if (revs->topo_order) + if (revs->topo_order && !generation_numbers_enabled(the_repository)) revs->limited = 1; if (revs->prune_data.nr) { @@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id *oid, return 0; } +struct topo_walk_info {}; + +static void init_topo_walk(struct rev_info *revs) +{ + struct topo_walk_info *info; + revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info)); + info = revs->topo_walk_info; + memset(info, 0, sizeof(struct topo_walk_info)); + + limit_list(revs); + sort_in_topological_order(>commits, revs->sort_order); +} + +static struct commit *next_topo_commit(struct rev_info *revs) +{ + return pop_commit(>commits); +} + +static void expand_topo_walk(struct rev_info *revs, struct commit *commit) +{ + if (add_parents_to_list(revs, commit, >commits, NULL) < 0) { + if (!revs->ignore_missing_links) + die("Failed to traverse parents of commit %s", + oid_to_hex(>object.oid)); + } +} + int prepare_revision_walk(struct rev_info *revs) { int i; @@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs) commit_list_sort_by_date(>commits); if (revs->no_walk) return 0; - if (revs->limited) + if (revs->limited) { if (limit_list(revs) < 0) return -1; - if (revs->topo_order) - sort_in_topological_order(>commits, revs->sort_order); + if (revs->topo_order) + sort_in_topologic
[PATCH v5 7/7] t6012: make rev-list tests more interesting
As we are working to rewrite some of the revision-walk machinery, there could easily be some interesting interactions between the options that force topological constraints (--topo-order, --date-order, and --author-date-order) along with specifying a path. Add extra tests to t6012-rev-list-simplify.sh to add coverage of these interactions. To ensure interesting things occur, alter the repo data shape to have different orders depending on topo-, date-, or author-date-order. When testing using GIT_TEST_COMMIT_GRAPH, this assists in covering the new logic for topo-order walks using generation numbers. The extra tests can be added indepently. Signed-off-by: Derrick Stolee --- t/t6012-rev-list-simplify.sh | 45 1 file changed, 36 insertions(+), 9 deletions(-) diff --git a/t/t6012-rev-list-simplify.sh b/t/t6012-rev-list-simplify.sh index b5a1190ffe..a10f0df02b 100755 --- a/t/t6012-rev-list-simplify.sh +++ b/t/t6012-rev-list-simplify.sh @@ -12,6 +12,22 @@ unnote () { git name-rev --tags --stdin | sed -e "s|$OID_REGEX (tags/\([^)]*\)) |\1 |g" } +# +# Create a test repo with interesting commit graph: +# +# A--B--G--H--I--K--L +# \ \ / / +# \ \ / / +#C--E---F J +#\_/ +# +# The commits are laid out from left-to-right starting with +# the root commit A and terminating at the tip commit L. +# +# There are a few places where we adjust the commit date or +# author date to make the --topo-order, --date-order, and +# --author-date-order flags produce different output. + test_expect_success setup ' echo "Hi there" >file && echo "initial" >lost && @@ -21,10 +37,18 @@ test_expect_success setup ' git branch other-branch && + git symbolic-ref HEAD refs/heads/unrelated && + git rm -f "*" && + echo "Unrelated branch" >side && + git add side && + test_tick && git commit -m "Side root" && + note J && + git checkout master && + echo "Hello" >file && echo "second" >lost && git add file lost && - test_tick && git commit -m "Modified file and lost" && + test_tick && GIT_AUTHOR_DATE=$(($test_tick + 120)) git commit -m "Modified file and lost" && note B && git checkout other-branch && @@ -63,13 +87,6 @@ test_expect_success setup ' test_tick && git commit -a -m "Final change" && note I && - git symbolic-ref HEAD refs/heads/unrelated && - git rm -f "*" && - echo "Unrelated branch" >side && - git add side && - test_tick && git commit -m "Side root" && - note J && - git checkout master && test_tick && git merge --allow-unrelated-histories -m "Coolest" unrelated && note K && @@ -103,14 +120,24 @@ check_result () { check_outcome success "$@" } -check_result 'L K J I H G F E D C B A' --full-history +check_result 'L K J I H F E D C G B A' --full-history --topo-order +check_result 'L K I H G F E D C B J A' --full-history +check_result 'L K I H G F E D C B J A' --full-history --date-order +check_result 'L K I H G F E D B C J A' --full-history --author-date-order check_result 'K I H E C B A' --full-history -- file check_result 'K I H E C B A' --full-history --topo-order -- file check_result 'K I H E C B A' --full-history --date-order -- file +check_result 'K I H E B C A' --full-history --author-date-order -- file check_result 'I E C B A' --simplify-merges -- file +check_result 'I E C B A' --simplify-merges --topo-order -- file +check_result 'I E C B A' --simplify-merges --date-order -- file +check_result 'I E B C A' --simplify-merges --author-date-order -- file check_result 'I B A' -- file check_result 'I B A' --topo-order -- file +check_result 'I B A' --date-order -- file +check_result 'I B A' --author-date-order -- file check_result 'H' --first-parent -- another-file +check_result 'H' --first-parent --topo-order -- another-file check_result 'E C B A' --full-history E -- lost test_expect_success 'full history simplification without parent' ' -- 2.19.1.542.gc4df23f792
[PATCH v5 0/7] Use generation numbers for --topo-order
This patch series performs a decently-sized refactoring of the revision-walk machinery. Well, "refactoring" is probably the wrong word, as I don't actually remove the old code. Instead, when we see certain options in the 'rev_info' struct, we redirect the commit-walk logic to a new set of methods that distribute the workload differently. By using generation numbers in the commit-graph, we can significantly improve 'git log --graph' commands (and the underlying 'git rev-list --topo-order'). On the Linux repository, I got the following performance results when comparing to the previous version with or without a commit-graph: Test: git rev-list --topo-order -100 HEAD HEAD~1, no commit-graph: 6.80 s HEAD~1, w/ commit-graph: 0.77 s HEAD, w/ commit-graph: 0.02 s Test: git rev-list --topo-order -100 HEAD -- tools HEAD~1, no commit-graph: 9.63 s HEAD~1, w/ commit-graph: 6.06 s HEAD, w/ commit-graph: 0.06 s If you want to read this series but are unfamiliar with the commit-graph and generation numbers, then I recommend reading `Documentation/technical/commit-graph.txt` or a blog post [1] I wrote on the subject. In particular, the three-part walk described in "revision.c: refactor basic topo-order logic" is present (but underexplained) as an animated PNG [2]. **UPDATED** Now that we have had some review and some dogfooding, I'm removing the paragraph I had here about "RFC quality". I think this is ready to merge! One notable case that is not included in this series is the case of a history comparison such as 'git rev-list --topo-order A..B'. The existing code in limit_list() has ways to cut the walk short when all pending commits are UNINTERESTING. Since this code depends on commit_list instead of the prio_queue we are using here, I chose to leave it untouched for now. We can revisit it in a separate series later. Since handle_commit() turns on revs->limited when a commit is UNINTERESTING, we do not hit the new code in this case. Removing this 'revs->limited = 1;' line yields correct results, but the performance can be worse. **UPDATED** See the discussion about Generation Number V2 [4] for more on this topic. Changes in V5: Thanks Jakub for feedback on the huge commit! I think I've responded to all the code feedback. See the range-diff at the end of this cover-page. Thanks, -Stolee [1] https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/ Supercharging the Git Commit Graph III: Generations and Graph Algorithms [2] https://msdnshared.blob.core.windows.net/media/2018/06/commit-graph-topo-order-b-a.png Animation showing three-part walk [3] https://github.com/derrickstolee/git/tree/topo-order/test A branch containing this series along with commits to compute commit-graph in entire test suite. [4] https://public-inbox.org/git/6367e30a-1b3a-4fe9-611b-d931f51ef...@gmail.com/ [RFC] Generation Number v2 Note: I'm not submitting this version via GitGitGadget because it's currently struggling with how to handle a PR in a conflict state. The new flags in revision.h have a conflict with recent changes in master. Derrick Stolee (7): prio-queue: add 'peek' operation test-reach: add run_three_modes method test-reach: add rev-list tests revision.c: begin refactoring --topo-order logic commit/revisions: bookkeeping before refactoring revision.c: generation-based topo-order algorithm t6012: make rev-list tests more interesting commit.c | 9 +- commit.h | 7 + object.h | 4 +- prio-queue.c | 9 ++ prio-queue.h | 6 + revision.c | 243 +-- revision.h | 6 + t/helper/test-prio-queue.c | 26 ++-- t/t0009-prio-queue.sh| 14 ++ t/t6012-rev-list-simplify.sh | 45 +-- t/t6600-test-reach.sh| 96 +- 11 files changed, 426 insertions(+), 39 deletions(-) base-commit: 2d3b1c576c85b7f5db1f418907af00ab88e0c303 -- 2.19.1.542.gc4df23f792 -->8-- 1: 2358cfd5ed = 1: 7c75a56505 prio-queue: add 'peek' operation 2: 3a4b68e479 = 2: 686c4370de test-reach: add run_three_modes method 3: 12a3f6d367 = 3: 7410c00982 test-reach: add rev-list tests 4: cd9eef9688 = 4: 5439e11e37 revision.c: begin refactoring --topo-order logic 5: f3e291665d ! 5: 71554deb9b commit/revisions: bookkeeping before refactoring @@ -9,8 +9,8 @@ compare_commits_by_author_date() in revision.c. These are used currently by sort_in_topological_order() in commit.c. -2. Moving these methods to commit.h requires adding the author_slab - definition to commit.h. +2. Moving these methods to commit.h requires adding an author_date_slab + declaration to commit.h. Consumers will need their own implementation. 3. The add_parents_
[PATCH v5 1/7] prio-queue: add 'peek' operation
When consuming a priority queue, it can be convenient to inspect the next object that will be dequeued without actually dequeueing it. Our existing library did not have such a 'peek' operation, so add it as prio_queue_peek(). Add a reference-level comparison in t/helper/test-prio-queue.c so this method is exercised by t0009-prio-queue.sh. Further, add a test that checks the behavior when the compare function is NULL (i.e. the queue becomes a stack). Signed-off-by: Derrick Stolee --- prio-queue.c | 9 + prio-queue.h | 6 ++ t/helper/test-prio-queue.c | 26 ++ t/t0009-prio-queue.sh | 14 ++ 4 files changed, 47 insertions(+), 8 deletions(-) diff --git a/prio-queue.c b/prio-queue.c index a078451872..d3f488cb05 100644 --- a/prio-queue.c +++ b/prio-queue.c @@ -85,3 +85,12 @@ void *prio_queue_get(struct prio_queue *queue) } return result; } + +void *prio_queue_peek(struct prio_queue *queue) +{ + if (!queue->nr) + return NULL; + if (!queue->compare) + return queue->array[queue->nr - 1].data; + return queue->array[0].data; +} diff --git a/prio-queue.h b/prio-queue.h index d030ec9dd6..682e51867a 100644 --- a/prio-queue.h +++ b/prio-queue.h @@ -46,6 +46,12 @@ extern void prio_queue_put(struct prio_queue *, void *thing); */ extern void *prio_queue_get(struct prio_queue *); +/* + * Gain access to the "thing" that would be returned by + * prio_queue_get, but do not remove it from the queue. + */ +extern void *prio_queue_peek(struct prio_queue *); + extern void clear_prio_queue(struct prio_queue *); /* Reverse the LIFO elements */ diff --git a/t/helper/test-prio-queue.c b/t/helper/test-prio-queue.c index 9807b649b1..5bc9c46ea5 100644 --- a/t/helper/test-prio-queue.c +++ b/t/helper/test-prio-queue.c @@ -22,14 +22,24 @@ int cmd__prio_queue(int argc, const char **argv) struct prio_queue pq = { intcmp }; while (*++argv) { - if (!strcmp(*argv, "get")) - show(prio_queue_get()); - else if (!strcmp(*argv, "dump")) { - int *v; - while ((v = prio_queue_get())) - show(v); - } - else { + if (!strcmp(*argv, "get")) { + void *peek = prio_queue_peek(); + void *get = prio_queue_get(); + if (peek != get) + BUG("peek and get results do not match"); + show(get); + } else if (!strcmp(*argv, "dump")) { + void *peek; + void *get; + while ((peek = prio_queue_peek())) { + get = prio_queue_get(); + if (peek != get) + BUG("peek and get results do not match"); + show(get); + } + } else if (!strcmp(*argv, "stack")) { + pq.compare = NULL; + } else { int *v = malloc(sizeof(*v)); *v = atoi(*argv); prio_queue_put(, v); diff --git a/t/t0009-prio-queue.sh b/t/t0009-prio-queue.sh index e56dfce668..3941ad2528 100755 --- a/t/t0009-prio-queue.sh +++ b/t/t0009-prio-queue.sh @@ -47,4 +47,18 @@ test_expect_success 'notice empty queue' ' test_cmp expect actual ' +cat >expect <<'EOF' +3 +2 +6 +4 +5 +1 +8 +EOF +test_expect_success 'stack order' ' + test-tool prio-queue stack 8 1 5 4 6 2 3 dump >actual && + test_cmp expect actual +' + test_done base-commit: 2d3b1c576c85b7f5db1f418907af00ab88e0c303 -- 2.19.1.542.gc4df23f792
Re: [RFC] Generation Number v2
On 11/1/2018 8:27 AM, Jakub Narebski wrote: [I have noticed that in some places I wrote A..B instead of B..A. Sorry about that] Derrick Stolee writes: Please also let me know about any additional tests that I could run. Now that I've got a lot of test scripts built up, I can re-run the test suite pretty quickly. I would add straighforward 1-to-N and N-to-1 reachability tests in the form of `git branch / tag --contains` and `git branch / tag --merged`, and perhaps also ahead-behind calculations used by `git status`, and N-to-M reachability tests used by tag following code in push and fetch. Possibly also A...B walk, if it is not implemented via calculating merge-base. I believe this uses paint_down_to_common(), but looks at the PARENT1 and PARENT2 flags afterward to determine the left/right/both relationships. Maybe even --ancestry-path walk, though I am not sure how important best performance for rhis case is (we would want good performance, but perhaps best is not needed for rarely used command). Currently, the implementation of --ancestry-path does not use a reachability index. See explanation below. Generation Number Performance Test == Git uses the commit history to answer many basic questions, such as computing a merge-base. Some of these algorithms can benefit from a _reachability index_, which is a condition that allows us to answer "commit A cannot reach commit B" in some cases. These require pre- computing some values that we can load and use at run-time. Git already has a notion of _generation number_, stores it in the commit- graph file, and uses it in several reachability algorithms. Note that there are other kinds of reachability indices. First, there are reachability indices that can answer the full reachability query (if commit A can reach commit B, or if commit A cannot reach commit B) directly, without walking the commit graph at all: so called label-only approach. For example one could store for each commit the compressed list of all commits reachable from it (transitive closure compression). Those, I think (but I have not checked), would be of not much use for Git, as the size of the index grows stronger than linear with the number of commits, as grows the time to compute such index. So probably of no use to Git, at least not directly (Git uses so called "bitmap index", see e.g. [1], which stores reachability bit-vector as compressed bitmap... but not for all commits, only for a small subset). Second, beside negative-cut reachability indexes, that can answer with certainity that "commit A cannot reach commit B", like the generation numbers (also known as level, or topological level), there also positive-cut indexes (usually if not always based on the spanning tree or trees for the DAG), that can answer when "commit A can reach commit B". I think that those can lead to significant speedups for at least some types of commit traversal and reachability queries that Git needs to answer. But currently no algorithms has a provision for using positive-cut filter index. Anyway, such index would be auxiliary thing, see the next point. Third, we require more than having reachability index in the sense of having some kind of _labelling_, often composite, that can answer either "commit A cannot reach commit B" or "commit A can reach commit B", depending on the type. Git for most operations needs more, namely an actual _ordering_, that is a weak order (or to be more exact a total preorder, i.e. there can be two different commits with the same "generation number" or index, but always either idx(A) ≲ idx(B) or idx(B) ≲ idx(A)) and not only partial ordering (where there can be incomparable elements, i.e neither idx(A) ≼ idx(B) nor idx(B) ≼ idx(A)). This is because it needs to be used in priority queue to decide which commit to travel next; more on that later. This is also why there can be only one such "main" reachability _index_. [1]: https://githubengineering.com/counting-objects/ Thanks for the details. At the moment, I'm only interested in improving our negative-cut reachability index, as we have algorithms that can take advantage of them (and can compare their performance directly). You can read more about generation numbers and how to use them in algorithms in [this blog post](https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/). However, [generation numbers do not always improve our algorithms](https://public-inbox.org/git/pull.28.git.gitgitgad...@gmail.com/T/#u). Specifically, some algorithms in Git already use commit date as a heuristic reachability index. This has some problems, though, since commit date can be incorrect for several reasons (clock skew between machines, purposefully setting GIT_COMMIT_DATE to the author date, etc.). That's what we use the "slop" mechan
Re: [PATCH 0/3] Make add_missing_tags() linear
On 11/1/2018 2:52 AM, Elijah Newren wrote: On Wed, Oct 31, 2018 at 5:05 AM Derrick Stolee wrote: On 10/31/2018 2:04 AM, Elijah Newren wrote: On the original repo where the topic was brought up, with commit-graph NOT turned on and using origin/master, I see: $ time git push --dry-run --follow-tags /home/newren/repo-mirror To /home/newren/repo-mirror * [new branch] test5 -> test5 real 1m20.081s user 1m19.688s sys 0m0.292s Merging this series in, I now get: $ time git push --dry-run --follow-tags /home/newren/repo-mirror To /home/newren/repo-mirror * [new branch] test5 -> test5 real 0m2.857s user 0m2.580s sys 0m0.328s which provides a very nice speedup. Oddly enough, if I _also_ do the following: $ git config core.commitgraph true $ git config gc.writecommitgraph true $ git gc then my timing actually slows down just slightly: $ time git push --follow-tags --dry-run /home/newren/repo-mirror To /home/newren/repo-mirror * [new branch] test5 -> test5 real 0m3.027s user 0m2.696s sys 0m0.400s So you say that the commit-graph is off in the 2.8s case, but not here in the 3.1s case? I would expect _at minimum_ that the cost of parsing commits would have a speedup in the commit-graph case. There may be something else going on here, since you are timing a `push` event that is doing more than the current walk. (run-to-run variation seems pretty consistent, < .1s variation, so this difference is just enough to notice.) I wouldn't be that surprised if that means there's some really old tags with very small generation numbers, meaning it's not gaining anything in this special case from the commit-graph, but it does pay the cost of loading the commit-graph. While you have this test environment, do you mind applying the diff below and re-running the tests? It will output a count for how many commits are walked by the algorithm. This should help us determine if this is another case where generation numbers are worse than commit-date, or if there is something else going on. Thanks! I can do that, but wouldn't you want a similar patch for the old get_merge_bases_many() in order to compare? Does an absolute number help by itself? It's going to have to be tomorrow, though; not enough time tonight. No rush. I'd just like to understand how removing the commit-graph file can make the new algorithm faster. Putting a similar count in the old algorithm would involve giving a count for every call to in_merge_bases_many(), which would be very noisy. Thanks! -Stolee
Re: [PATCH v3 8/8] merge-recursive: improve rename/rename(1to2)/add[/add] handling
On 10/19/2018 3:31 PM, Elijah Newren wrote: [snip] + char *new_path = NULL; + if (dir_in_way(b->path, !o->call_depth, 0)) { + new_path = unique_path(o, b->path, ci->branch2); + output(o, 1, _("%s is a directory in %s adding " + "as %s instead"), + b->path, ci->branch1, new_path); I tried really hard, but failed to get a test to cover the block below. I was able to find that the "check handling of differently renamed file with D/F conflicts" test in t6022-merge-rename.sh covers the block above. Trying to tweak the example using untracked files seems to hit an error message from unpack-trees.c instead. + } else if (would_lose_untracked(b->path)) { + new_path = unique_path(o, b->path, ci->branch2); + output(o, 1, _("Refusing to lose untracked file" + " at %s; adding as %s instead"), + b->path, new_path); It could also be that I failed because I'm less familiar with this part of the codebase. Elijah, do you think it is possible to hit this block? Thanks, -Stolee
Re: [PATCH v3 2/8] t6036, t6042: testcases for rename collision of already conflicting files
On 10/19/2018 3:31 PM, Elijah Newren wrote: +test_expect_success "setup nested conflicts" ' nit: should these test names be single-quoted? I see you using double-quotes in PATCH 1/8 as well, but that seems to be because there are variables in the test names. ... +test_expect_failure "check nested conflicts" ' Same here. +test_expect_success "setup nested conflicts from rename/rename(2to1)" ' +test_expect_failure "check nested conflicts from rename/rename(2to1)" ' Thanks, -Stolee
Re: [PATCH v3 4/8] merge-recursive: new function for better colliding conflict resolutions
On 10/31/2018 9:53 AM, Derrick Stolee wrote: On 10/19/2018 3:31 PM, Elijah Newren wrote: +#if 0 // #if-0-ing avoids unused function warning; will make live in next commit +static int handle_file_collision(struct merge_options *o, + const char *collide_path, + const char *prev_path1, + const char *prev_path2, + const char *branch1, const char *branch2, + const struct object_id *a_oid, + unsigned int a_mode, + const struct object_id *b_oid, + unsigned int b_mode) +{ + struct merge_file_info mfi; + struct diff_filespec null, a, b; + char *alt_path = NULL; + const char *update_path = collide_path; + + /* + * In the recursive case, we just opt to undo renames + */ + if (o->call_depth && (prev_path1 || prev_path2)) { + /* Put first file (a_oid, a_mode) in its original spot */ + if (prev_path1) { + if (update_file(o, 1, a_oid, a_mode, prev_path1)) + return -1; + } else { + if (update_file(o, 1, a_oid, a_mode, collide_path)) The latest test coverage report [1] shows this if statement is never run, so it appears that every call to this method in the test suite has either o->call_depth positive, prev_path1 non-NULL, or both prev_path1 and prev_path2 NULL. Is there a way we can add a test case that calls this method with o->call_depth positive, prev_path1 NULL, and prev_path2 non-NULL? + return -1; + } + + /* Put second file (b_oid, b_mode) in its original spot */ + if (prev_path2) { + if (update_file(o, 1, b_oid, b_mode, prev_path2)) Since this line is covered, we _do_ call the method with prev_path2 non-NULL, but prev_path1 must be non-NULL in all cases. I may have found a reason why this doesn't happen in one of the callers you introduced. I'm going to comment on PATCH 8/8 to see if that is the case. Nevermind on the PATCH 8/8 situation. I thought I saw you pass (a->path, NULL) and (b->path, NULL) into the (prev_path1, prev_path2) pairs, but in each case the non-NULL parameter is actually for 'collide_path'. It is still interesting if we can hit this case. Perhaps we need a different kind of conflict, like (rename, delete) [but I struggle to make sense of how to do that]. Thanks, -Stolee
Re: [PATCH v3 4/8] merge-recursive: new function for better colliding conflict resolutions
On 10/19/2018 3:31 PM, Elijah Newren wrote: +#if 0 // #if-0-ing avoids unused function warning; will make live in next commit +static int handle_file_collision(struct merge_options *o, +const char *collide_path, +const char *prev_path1, +const char *prev_path2, +const char *branch1, const char *branch2, +const struct object_id *a_oid, +unsigned int a_mode, +const struct object_id *b_oid, +unsigned int b_mode) +{ + struct merge_file_info mfi; + struct diff_filespec null, a, b; + char *alt_path = NULL; + const char *update_path = collide_path; + + /* +* In the recursive case, we just opt to undo renames +*/ + if (o->call_depth && (prev_path1 || prev_path2)) { + /* Put first file (a_oid, a_mode) in its original spot */ + if (prev_path1) { + if (update_file(o, 1, a_oid, a_mode, prev_path1)) + return -1; + } else { + if (update_file(o, 1, a_oid, a_mode, collide_path)) The latest test coverage report [1] shows this if statement is never run, so it appears that every call to this method in the test suite has either o->call_depth positive, prev_path1 non-NULL, or both prev_path1 and prev_path2 NULL. Is there a way we can add a test case that calls this method with o->call_depth positive, prev_path1 NULL, and prev_path2 non-NULL? + return -1; + } + + /* Put second file (b_oid, b_mode) in its original spot */ + if (prev_path2) { + if (update_file(o, 1, b_oid, b_mode, prev_path2)) Since this line is covered, we _do_ call the method with prev_path2 non-NULL, but prev_path1 must be non-NULL in all cases. I may have found a reason why this doesn't happen in one of the callers you introduced. I'm going to comment on PATCH 8/8 to see if that is the case. Thanks, -Stolee [1] https://public-inbox.org/git/62f0bcf6-aa73-c192-d804-e6d69cac1...@gmail.com/
Re: [PATCH 18/19] submodule: use submodule repos for object lookup
On 10/16/2018 7:35 PM, Stefan Beller wrote: @@ -482,14 +483,46 @@ void prepare_submodule_repo_env(struct argv_array *out) DEFAULT_GIT_DIR_ENVIRONMENT); } -/* Helper function to display the submodule header line prior to the full - * summary output. If it can locate the submodule objects directory it will - * attempt to lookup both the left and right commits and put them into the - * left and right pointers. +/* + * Initialize 'out' based on the provided submodule path. + * + * Unlike repo_submodule_init, this tolerates submodules not present + * in .gitmodules. This function exists only to preserve historical behavior, + * + * Returns 0 on success, -1 when the submodule is not present. + */ +static int open_submodule(struct repository *out, const char *path) +{ + struct strbuf sb = STRBUF_INIT; + + if (submodule_to_gitdir(, path) || repo_init(out, sb.buf, NULL)) { + strbuf_release(); + return -1; + } + + out->submodule_prefix = xstrdup(path); + out->submodule_prefix = xstrfmt("%s%s/", + the_repository->submodule_prefix ? + the_repository->submodule_prefix : + "", path); + + strbuf_release(); + return 0; +} Based on the recent test coverage report [1], this xstrfmt() call is never run witha non-null the_repository->submodule_prefix. Is there a way we can exercise that branch? Thanks, -Stolee [1] https://public-inbox.org/git/62f0bcf6-aa73-c192-d804-e6d69cac1...@gmail.com/
Re: [RFC] Generation Number v2
On 10/31/2018 8:54 AM, Ævar Arnfjörð Bjarmason wrote: On Tue, Oct 30 2018, Junio C Hamano wrote: Derrick Stolee writes: In contrast, maximum generation numbers and corrected commit dates both performed quite well. They are frequently the top two performing indexes, and rarely significantly different. The trade-off here now seems to be: which _property_ is more important, locally-computable or backwards-compatible? Nice summary. As I already said, I personally do not think being compatible with currently deployed clients is important at all (primarily because I still consider the whole thing experimental), and there is a clear way forward once we correct the mistake of not having a version number in the file format that tells the updated clients to ignore the generation numbers. For longer term viability, we should pick something that is immutable, reproducible, computable with minimum input---all of which would lead to being incrementally computable, I would think. I think it depends on what we mean by backwards compatibility. None of our docs are saying this is experimental right now, just that it's opt-in like so many other git-config(1) options. So if we mean breaking backwards compatibility in that we'll write a new file or clobber the existing one with a version older clients can't use as an optimization, fine. But it would be bad to produce a hard error on older clients, but avoiding that seems as easy as just creating a "commit-graph2" file in .git/objects/info/. Well, we have a 1-byte version number following the "CGPH" header in the commit-graph file, and clients will ignore the commit-graph file if that number is not "1". My hope for backwards-compatibility was to avoid incrementing this value and instead use the unused 8th byte. However, it appears that we are destined to increment that version number, anyway. Here is my list for what needs to be in the next version of the commit-graph file format: 1. A four-byte hash version. 2. File incrementality (split commit-graph). 3. Reachability Index versioning Most of these changes will happen in the file header. The chunks themselves don't need to change, but some chunks may be added that only make sense in v2 commit-graphs. Thanks, -Stolee
Re: [RFC] Generation Number v2
On 10/29/2018 11:59 PM, Junio C Hamano wrote: Derrick Stolee writes: **V3: Corrected Commit Date.** For a commit C, let its _corrected commit date_ (denoted by cdate(C)) be the maximum of the commit date of C and the commit dates of its parents. "maximum of the commit date of C and the corrected commit dates of its parents"? That's what I mean. Thanks. We've talked about exactly this one in the past (long before any of Microsoft folks other than Dscho came to the scene) but without an implementation, or detailed design and analysis. I am very happy to see the idea among other possibilities to be considered again. This time around, we may finally come up with something better than the "commit dates with SLOP" thing ;-). Essentially, the felineY order is selected with the goal of swapping positions of topologically-independent commits relative to the felinX ordering. The resulting reachability index is as follows: If felineX(A) < felineY(B), then A cannot reach B. If felineY(A) < felineY(B), then A cannot reach B. I presume that the first line is a typo and you compare the same X index? Yes, sorry for the typos. I fixed them in the report on GitHub. * **Compatible?** In our test implementation, we use a previously unused byte of data in the commit-graph format to indicate which reachability index version we are using. Existing clients ignore this value, so we will want to consider if these new indexes are _backwards compatible_. That is, will they still report correct values if they ignore this byte and use the generation number column from the commit-graph file assuming the values are minimum generation numbers? I personally consider that the current commit-graph with generation numbers experimental, so I am not sure how much we care about this. Having said that. By the above definition, any new index that is wider than the current generation number cannot be compatible (can we even tell the existing clients how wide each elements in the ix array is?) In any case, perhaps the first thing to do is to update the clients so that they stop ignoring the version number field, and instead work without generation number when there is no version of reach.ix available in the file? That way, a better reachablility index can be chosen freely without having to worry about the compatibility. I can work on that. It should be as simple as setting commit->generation to GENERATION_NUMBER_ZERO in fill_commit_in_graph when the graph has a different version. * **Immutable?** Git objects are _immutable_. If you change an object you actually create a new object with a new object ID. Are the values we store for these reachability indexes also immutable? Even if we do not embed the reachability ix in commit objects, having an immutable value is probably a must if we want to make them incrementally computable, so this is a very good property to have. Unless there is a clever idea to incrementally compute a mutable reach.ix, my gut instinct says that this property is a must. Another thing, perhaps related to "local" below, is if exactly the same reach.ix is computed by anybody, given an identical commit history graph (perhaps "reproducibility"?). I think most of the candidates you listed are reproducible without a fixed tie-breaker, but I am not sure about felineY() thing. * **Local?** Are these values **locally computable**? That is, do we only need to look at the parents of a commit (assuming those parents have computed values) in order to determine the value at that commit? A subset of non-local reachability ix, for example, the ones that need to know what each commit's children are, cannot be immutable, as adding new objects to the graph (either with locally committing, or transferring objects from other repositories) would affect the ix; is this true for all non-local reachability ix, I wonder? As a thought experiment, we could define a function size(C) to be the numberof commits reachable from C. This is not locally-computable from the size values at C's parents due to the inclusion-exclusion principle. We would need to compute it by walking the reachable set and counting (resulting in quadratic performance overall) but is immutable. Since the performance cost is so expensive (unlike the linear costs in the other non-local versions) I didn't include it in my comparison. We focused on three types of performance tests that test the indexes in different ways. Each test lists the `git` command that is used, and the table lists which repository is used and which inputs. ### Test 1: `git log --topo-order -N` This test focuses on the number of commits that are parsed during a `git log --topo-order` before writing `N` commits to output. A devil's advocate comment. Your patches seem to be very focused on this "unlimited" case for the past few weeks, but I am not so sure if that is a cas
Re: [PATCH 0/3] Make add_missing_tags() linear
On 10/31/2018 2:04 AM, Elijah Newren wrote: > On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget > wrote: >> >> As reported earlier [1], the add_missing_tags() method in remote.c has >> quadratic performance. Some of that performance is curbed due to the >> generation-number cutoff in in_merge_bases_many(). However, that fix doesn't >> help users without a commit-graph, and it can still be painful if that >> cutoff is sufficiently low compared to the tags we are using for >> reachability testing. >> >> Add a new method in commit-reach.c called get_reachable_subset() which does >> a many-to-many reachability test. Starting at the 'from' commits, walk until >> the generation is below the smallest generation in the 'to' commits, or all >> 'to' commits have been discovered. This performs only one commit walk for >> the entire add_missing_tags() method, giving linear performance in the worst >> case. >> >> Tests are added in t6600-test-reach.sh to ensure get_reachable_subset() >> works independently of its application in add_missing_tags(). > > On the original repo where the topic was brought up, with commit-graph > NOT turned on and using origin/master, I see: > > $ time git push --dry-run --follow-tags /home/newren/repo-mirror > To /home/newren/repo-mirror > * [new branch] test5 -> test5 > > real 1m20.081s > user 1m19.688s > sys 0m0.292s > > Merging this series in, I now get: > > $ time git push --dry-run --follow-tags /home/newren/repo-mirror > To /home/newren/repo-mirror > * [new branch] test5 -> test5 > > real 0m2.857s > user 0m2.580s > sys 0m0.328s > > which provides a very nice speedup. > > Oddly enough, if I _also_ do the following: > $ git config core.commitgraph true > $ git config gc.writecommitgraph true > $ git gc > > then my timing actually slows down just slightly: > $ time git push --follow-tags --dry-run /home/newren/repo-mirror > To /home/newren/repo-mirror > * [new branch] test5 -> test5 > > real 0m3.027s > user 0m2.696s > sys 0m0.400s So you say that the commit-graph is off in the 2.8s case, but not here in the 3.1s case? I would expect _at minimum_ that the cost of parsing commits would have a speedup in the commit-graph case. There may be something else going on here, since you are timing a `push` event that is doing more than the current walk. > (run-to-run variation seems pretty consistent, < .1s variation, so > this difference is just enough to notice.) I wouldn't be that > surprised if that means there's some really old tags with very small > generation numbers, meaning it's not gaining anything in this special > case from the commit-graph, but it does pay the cost of loading the > commit-graph. While you have this test environment, do you mind applying the diff below and re-running the tests? It will output a count for how many commits are walked by the algorithm. This should help us determine if this is another case where generation numbers are worse than commit-date, or if there is something else going on. Thanks! -->8-- >From 2115e7dcafb2770455b7b4793f90edc2254bad97 Mon Sep 17 00:00:00 2001 From: Derrick Stolee Date: Wed, 31 Oct 2018 11:40:50 + Subject: [PATCH] DO-NOT-MERGE: count commits in get_reachable_subset Signed-off-by: Derrick Stolee --- commit-reach.c | 5 + 1 file changed, 5 insertions(+) diff --git a/commit-reach.c b/commit-reach.c index a98532ecc8..b512461cf7 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -700,6 +700,7 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from, struct commit **from_last = from + nr_from; uint32_t min_generation = GENERATION_NUMBER_INFINITY; int num_to_find = 0; + int count = 0; struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; @@ -729,6 +730,8 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from, while (num_to_find && (current = prio_queue_get()) != NULL) { struct commit_list *parents; + count++; + if (current->object.flags & PARENT1) { current->object.flags &= ~PARENT1; current->object.flags |= reachable_flag; @@ -755,6 +758,8 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from, clear_commit_marks_many(nr_to, to, PARENT1); clear_commit_marks_many(nr_from, from, PARENT2); + fprintf(stderr, "count: %d\n", count); + return found_commits; } -- 2.19.1.542.gc4df23f792
Re: [PATCH 1/3] commit-reach: implement get_reachable_subset
On 10/30/2018 11:35 PM, Junio C Hamano wrote: "Derrick Stolee via GitGitGadget" writes: +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, +struct commit **to, int nr_to, +int reachable_flag) This is OR'ed into object.flags, and I somoehow expected to see it as 'unsigned int', not a signed one. Will do. Thanks. +{ + struct commit **item; + struct commit *current; + struct commit_list *found_commits = NULL; + struct commit **to_last = to + nr_to; + struct commit **from_last = from + nr_from; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; + int num_to_find = 0; + + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; + + for (item = to; item < to_last; item++) { + struct commit *c = *item; + + parse_commit(c); + if (c->generation < min_generation) + min_generation = c->generation; + + if (!(c->object.flags & PARENT1)) { + c->object.flags |= PARENT1; + num_to_find++; + } + } + + for (item = from; item < from_last; item++) { + struct commit *c = *item; + if (!(c->object.flags & PARENT2)) { + c->object.flags |= PARENT2; + parse_commit(c); + + prio_queue_put(, *item); + } + } OK, we marked "to" with PARENT1 and counted them in num_to_find without dups. We also marked "from" with PARENT2 and threw them in the "queue" without dups. Mental note: the caller must guarantee that everybody reachable from "to" and "from" have PARENT1 and PARENT2 clear. This might deserve to be in the comment before the function. I'll put that in the header file. [snip] OK, this all makes sense. Unlike merge-base traversals, this does not have to traverse from the "to" side at all, which makes it quite simpler and straight-forward. I do wonder if we can now reimplement in_merge_bases_many() in terms of this helper, and if that gives us a better performance. It asks "is 'commit', i.e. a single 'to', an ancestor of, i.e. reachable from, one of the 'references', i.e. 'from'"? We could do this, but it does come with a performance hit when the following are all true: 1. 'to' is not reachable from any 'from' commits. 2. The 'to' and 'from' commits are close in commit-date. 3. Generation numbers are not available, or the topology is skewed to have commits with high commit date and low generation number. Since in_merge_bases_many() calls paint_down_to_common(), it has the same issues with the current generation numbers. This can be fixed when we have the next version of generation numbers available. I'll make a note to have in_merge_bases_many() call get_reachable_subset() conditionally (like the generation_numbers_available() trick in the --topo-order series) after the generation numbers are settled and implemented. Thanks, -Stolee
Re: [PATCH 1/3] commit-reach: implement get_reachable_subset
On 10/31/2018 2:07 AM, Elijah Newren wrote: On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget wrote: --- a/commit-reach.c +++ b/commit-reach.c @@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to, object_array_clear(_objs); return result; } + +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, +struct commit **to, int nr_to, +int reachable_flag) +{ + struct commit **item; + struct commit *current; + struct commit_list *found_commits = NULL; + struct commit **to_last = to + nr_to; + struct commit **from_last = from + nr_from; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; + int num_to_find = 0; + + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; + + for (item = to; item < to_last; item++) { + struct commit *c = *item; + + parse_commit(c); + if (c->generation < min_generation) + min_generation = c->generation; So, when we don't have a commit-graph, is c->generation just going to be 0, making min_generation also be 0? (meaning we get no possible speed benefit from the commit-graph, since we just don't have that information available)? If we don't have a commit-graph, then we can only terminate the loop early if we discover all of the commits (num_to_find == 0).Otherwise, we need to walk the entire graph in order to determine non-reachability. This relates to Junio's point about in_merge_bases_many(), which I'll respond to his message in more detail about that. Thanks, -Stolee
Re: commit-graph is cool (overcoming add_missing_tags() perf issues)
On 10/17/2018 2:00 PM, Elijah Newren wrote: Hi, Just wanted to give a shout-out for the commit-graph work and how impressive it is. I had an internal report from a user that git pushes containing only one new tiny commit were taking over a minute (in a moderate size repo with good network connectivity). After digging for a while, I noticed three unusual things about the repo[1]: * he had push.followTags set to true * upstream repo had about 20k tags (despite only 55k commits) * his repo had an additional 2.5k tags, but none of these were in the history of the branches he was pushing and thus would not be included in any pushes. Digging in, almost all the time was CPU-bound and spent in add_missing_tags()[2]. If I'm reading the code correctly, it appears that function loops over each tag, calling in_merge_bases_many() once per tag. Thus, for his case, we were potentially walking all of history of the main branch 2.5k times. That seemed rather suboptimal. Elijah, Do you still have this repo around? Could you by chance test the performance with the new algorithm for add_missing_tags() in [1]? Specifically, please test it without a commit-graph file, since your data shape already makes use of generation numbers pretty well. Thanks, -Stolee [1] https://public-inbox.org/git/pull.60.git.gitgitgad...@gmail.com/T/#t
[PATCH 2/3] test-reach: test get_reachable_subset
From: Derrick Stolee The get_reachable_subset() method returns the list of commits in the 'to' array that are reachable from at least one commit in the 'from' array. Add tests that check this method works in a few cases: 1. All commits in the 'to' list are reachable. This exercises the early-termination condition. 2. Some commits in the 'to' list are reachable. This exercises the loop-termination condition. 3. No commits in the 'to' list are reachable. This exercises the NULL return condition. Signed-off-by: Derrick Stolee --- t/helper/test-reach.c | 34 t/t6600-test-reach.sh | 52 +++ 2 files changed, 82 insertions(+), 4 deletions(-) diff --git a/t/helper/test-reach.c b/t/helper/test-reach.c index 08d2ea68e..a0272178b 100644 --- a/t/helper/test-reach.c +++ b/t/helper/test-reach.c @@ -32,8 +32,8 @@ int cmd__reach(int ac, const char **av) struct commit *A, *B; struct commit_list *X, *Y; struct object_array X_obj = OBJECT_ARRAY_INIT; - struct commit **X_array; - int X_nr, X_alloc; + struct commit **X_array, **Y_array; + int X_nr, X_alloc, Y_nr, Y_alloc; struct strbuf buf = STRBUF_INIT; struct repository *r = the_repository; @@ -44,9 +44,10 @@ int cmd__reach(int ac, const char **av) A = B = NULL; X = Y = NULL; - X_nr = 0; - X_alloc = 16; + X_nr = Y_nr = 0; + X_alloc = Y_alloc = 16; ALLOC_ARRAY(X_array, X_alloc); + ALLOC_ARRAY(Y_array, Y_alloc); while (strbuf_getline(, stdin) != EOF) { struct object_id oid; @@ -92,6 +93,8 @@ int cmd__reach(int ac, const char **av) case 'Y': commit_list_insert(c, ); + ALLOC_GROW(Y_array, Y_nr + 1, Y_alloc); + Y_array[Y_nr++] = c; break; default: @@ -136,6 +139,29 @@ int cmd__reach(int ac, const char **av) filter.with_commit_tag_algo = 0; printf("%s(_,A,X,_):%d\n", av[1], commit_contains(, A, X, )); + } else if (!strcmp(av[1], "get_reachable_subset")) { + const int reachable_flag = 1; + int i, count = 0; + struct commit_list *current; + struct commit_list *list = get_reachable_subset(X_array, X_nr, + Y_array, Y_nr, + reachable_flag); + printf("get_reachable_subset(X,Y)\n"); + for (current = list; current; current = current->next) { + if (!(list->item->object.flags & reachable_flag)) + die(_("commit %s is not marked reachable"), + oid_to_hex(>item->object.oid)); + count++; + } + for (i = 0; i < Y_nr; i++) { + if (Y_array[i]->object.flags & reachable_flag) + count--; + } + + if (count < 0) + die(_("too many commits marked reachable")); + + print_sorted_commit_ids(list); } exit(0); diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh index ae94b27f7..a0c64e617 100755 --- a/t/t6600-test-reach.sh +++ b/t/t6600-test-reach.sh @@ -265,4 +265,56 @@ test_expect_success 'commit_contains:miss' ' test_three_modes commit_contains --tag ' +test_expect_success 'get_reachable_subset:all' ' + cat >input <<-\EOF && + X:commit-9-1 + X:commit-8-3 + X:commit-7-5 + X:commit-6-6 + X:commit-1-7 + Y:commit-3-3 + Y:commit-1-7 + Y:commit-5-6 + EOF + ( + echo "get_reachable_subset(X,Y)" && + git rev-parse commit-3-3 \ + commit-1-7 \ + commit-5-6 | sort + ) >expect && + test_three_modes get_reachable_subset +' + +test_expect_success 'get_reachable_subset:some' ' + cat >input <<-\EOF && + X:commit-9-1 + X:commit-8-3 + X:commit-7-5 + X:commit-1-7 + Y:commit-3-3 + Y:commit-1-7 + Y:commit-5-6 + EOF + ( + echo "get_reachable_subset(X,Y)" && + git rev-parse commit-3-3 \ + commit-1-7 | sort + ) >expect && + test_three_modes get_reachable_subset +' + +test_expect_success 'get_reachable_subset:none' ' + cat >input <<-\EOF && + X:commit-9-1 + X:commit-8-3 + X:commit-7-5 +
[PATCH 3/3] remote: make add_missing_tags() linear
From: Derrick Stolee The add_missing_tags() method currently has quadratic behavior. This is due to a linear number (based on number of tags T) of calls to in_merge_bases_many, which has linear performance (based on number of commits C in the repository). Replace this O(T * C) algorithm with an O(T + C) algorithm by using get_reachable_subset(). We ignore the return list and focus instead on the reachable_flag assigned to the commits we care about, because we need to interact with the tag ref and not just the commit object. Signed-off-by: Derrick Stolee --- remote.c | 34 +- 1 file changed, 33 insertions(+), 1 deletion(-) diff --git a/remote.c b/remote.c index 81f4f01b0..b850f2feb 100644 --- a/remote.c +++ b/remote.c @@ -1205,9 +1205,36 @@ static void add_missing_tags(struct ref *src, struct ref **dst, struct ref ***ds * sent to the other side. */ if (sent_tips.nr) { + const int reachable_flag = 1; + struct commit_list *found_commits; + struct commit **src_commits; + int nr_src_commits = 0, alloc_src_commits = 16; + ALLOC_ARRAY(src_commits, alloc_src_commits); + for_each_string_list_item(item, _tag) { struct ref *ref = item->util; + struct commit *commit; + + if (is_null_oid(>new_oid)) + continue; + commit = lookup_commit_reference_gently(the_repository, + >new_oid, + 1); + if (!commit) + /* not pushing a commit, which is not an error */ + continue; + + ALLOC_GROW(src_commits, nr_src_commits + 1, alloc_src_commits); + src_commits[nr_src_commits++] = commit; + } + + found_commits = get_reachable_subset(sent_tips.tip, sent_tips.nr, +src_commits, nr_src_commits, +reachable_flag); + + for_each_string_list_item(item, _tag) { struct ref *dst_ref; + struct ref *ref = item->util; struct commit *commit; if (is_null_oid(>new_oid)) @@ -1223,7 +1250,7 @@ static void add_missing_tags(struct ref *src, struct ref **dst, struct ref ***ds * Is this tag, which they do not have, reachable from * any of the commits we are sending? */ - if (!in_merge_bases_many(commit, sent_tips.nr, sent_tips.tip)) + if (!(commit->object.flags & reachable_flag)) continue; /* Add it in */ @@ -1231,7 +1258,12 @@ static void add_missing_tags(struct ref *src, struct ref **dst, struct ref ***ds oidcpy(_ref->new_oid, >new_oid); dst_ref->peer_ref = copy_ref(ref); } + + clear_commit_marks_many(nr_src_commits, src_commits, reachable_flag); + free(src_commits); + free_commit_list(found_commits); } + string_list_clear(_tag, 0); free(sent_tips.tip); } -- gitgitgadget