[PATCH v6 10/14] commit-graph: close under reachability
From: Derrick Stolee <dsto...@microsoft.com> Teach write_commit_graph() to walk all parents from the commits discovered in packfiles. This prevents gaps given by loose objects or previously-missed packfiles. Also automatically add commits from the existing graph file, if it exists. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 2f2e2c7083..fc7b4fa622 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -369,6 +369,28 @@ static int add_packed_commits(const struct object_id *oid, return 0; } +static void close_reachable(struct packed_oid_list *oids) +{ + int i; + struct rev_info revs; + struct commit *commit; + init_revisions(, NULL); + for (i = 0; i < oids->nr; i++) { + commit = lookup_commit(>list[i]); + if (commit && !parse_commit(commit)) + revs.commits = commit_list_insert(commit, ); + } + + if (prepare_revision_walk()) + die(_("revision walk setup failed")); + + while ((commit = get_revision()) != NULL) { + ALLOC_GROW(oids->list, oids->nr + 1, oids->alloc); + oidcpy(>list[oids->nr], &(commit->object.oid)); + (oids->nr)++; + } +} + void write_commit_graph(const char *obj_dir) { struct packed_oid_list oids; @@ -392,6 +414,7 @@ void write_commit_graph(const char *obj_dir) ALLOC_ARRAY(oids.list, oids.alloc); for_each_packed_object(add_packed_commits, , 0); + close_reachable(); QSORT(oids.list, oids.nr, commit_compare); -- 2.14.1
[PATCH v6 07/14] commit-graph: implement 'git-commit-graph write'
From: Derrick Stolee <dsto...@microsoft.com> Teach git-commit-graph to write graph files. Create new test script to verify this command succeeds without failure. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 39 builtin/commit-graph.c | 33 ++ t/t5318-commit-graph.sh| 125 + 3 files changed, 197 insertions(+) create mode 100755 t/t5318-commit-graph.sh diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 5913340fad..e688843808 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -5,6 +5,45 @@ NAME git-commit-graph - Write and verify Git commit graph files + +SYNOPSIS + +[verse] +'git commit-graph write' [--object-dir ] + + +DESCRIPTION +--- + +Manage the serialized commit graph file. + + +OPTIONS +--- +--object-dir:: + Use given directory for the location of packfiles and commit graph + file. The commit graph file is expected to be at /info/commit-graph + and the packfiles are expected to be in /pack. + + +COMMANDS + +'write':: + +Write a commit graph file based on the commits found in packfiles. +Includes all commits from the existing commit graph file. + + +EXAMPLES + + +* Write a commit graph file for the packed commits in your local .git folder. ++ + +$ git commit-graph write + + + GIT --- Part of the linkgit:git[1] suite diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 8ff7336527..a9d61f649a 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -1,9 +1,18 @@ #include "builtin.h" #include "config.h" +#include "dir.h" +#include "lockfile.h" #include "parse-options.h" +#include "commit-graph.h" static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), + N_("git commit-graph write [--object-dir ]"), + NULL +}; + +static const char * const builtin_commit_graph_write_usage[] = { + N_("git commit-graph write [--object-dir ]"), NULL }; @@ -11,6 +20,25 @@ static struct opts_commit_graph { const char *obj_dir; } opts; +static int graph_write(int argc, const char **argv) +{ + static struct option builtin_commit_graph_write_options[] = { + OPT_STRING(0, "object-dir", _dir, + N_("dir"), + N_("The object directory to store the graph")), + OPT_END(), + }; + + argc = parse_options(argc, argv, NULL, +builtin_commit_graph_write_options, +builtin_commit_graph_write_usage, 0); + + if (!opts.obj_dir) + opts.obj_dir = get_object_directory(); + + write_commit_graph(opts.obj_dir); + return 0; +} int cmd_commit_graph(int argc, const char **argv, const char *prefix) { @@ -31,6 +59,11 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) builtin_commit_graph_usage, PARSE_OPT_STOP_AT_NON_OPTION); + if (argc > 0) { + if (!strcmp(argv[0], "write")) + return graph_write(argc, argv); + } + usage_with_options(builtin_commit_graph_usage, builtin_commit_graph_options); } diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh new file mode 100755 index 00..43707ce5bb --- /dev/null +++ b/t/t5318-commit-graph.sh @@ -0,0 +1,125 @@ +#!/bin/sh + +test_description='commit graph' +. ./test-lib.sh + +test_expect_success 'setup full repo' ' + mkdir full && + cd "$TRASH_DIRECTORY/full" && + git init && + objdir=".git/objects" +' + +test_expect_success 'write graph with no packs' ' +cd "$TRASH_DIRECTORY/full" && + git commit-graph write --object-dir . && + test_path_is_file info/commit-graph +' + +test_expect_success 'create commits and repack' ' +cd "$TRASH_DIRECTORY/full" && + for i in $(test_seq 3) + do + test_commit $i && + git branch commits/$i + done && + git repack +' + +test_expect_success 'write graph' ' +cd "$TRASH_DIRECTORY/full" && + graph1=$(git commit-graph write) && + test_path_is_file $objdir/info/commit-graph +' + +test_expect_success 'Add more commits' ' +cd "$TRASH_DIRECTORY/full" && + git reset --hard commits/1 && + for i in $(test_seq 4 5)
[PATCH v6 05/14] commit-graph: create git-commit-graph builtin
From: Derrick Stolee <dsto...@microsoft.com> Teach git the 'commit-graph' builtin that will be used for writing and reading packed graph files. The current implementation is mostly empty, except for an '--object-dir' option. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- .gitignore | 1 + Documentation/git-commit-graph.txt | 11 ++ Makefile | 1 + builtin.h | 1 + builtin/commit-graph.c | 37 ++ command-list.txt | 1 + contrib/completion/git-completion.bash | 2 ++ git.c | 1 + 8 files changed, 55 insertions(+) create mode 100644 Documentation/git-commit-graph.txt create mode 100644 builtin/commit-graph.c diff --git a/.gitignore b/.gitignore index 833ef3b0b7..e82f90184d 100644 --- a/.gitignore +++ b/.gitignore @@ -34,6 +34,7 @@ /git-clone /git-column /git-commit +/git-commit-graph /git-commit-tree /git-config /git-count-objects diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt new file mode 100644 index 00..5913340fad --- /dev/null +++ b/Documentation/git-commit-graph.txt @@ -0,0 +1,11 @@ +git-commit-graph(1) +=== + +NAME + +git-commit-graph - Write and verify Git commit graph files + +GIT +--- +Part of the linkgit:git[1] suite + diff --git a/Makefile b/Makefile index de4b8f0c02..a928d4de66 100644 --- a/Makefile +++ b/Makefile @@ -946,6 +946,7 @@ BUILTIN_OBJS += builtin/clone.o BUILTIN_OBJS += builtin/column.o BUILTIN_OBJS += builtin/commit-tree.o BUILTIN_OBJS += builtin/commit.o +BUILTIN_OBJS += builtin/commit-graph.o BUILTIN_OBJS += builtin/config.o BUILTIN_OBJS += builtin/count-objects.o BUILTIN_OBJS += builtin/credential.o diff --git a/builtin.h b/builtin.h index 42378f3aa4..079855b6d4 100644 --- a/builtin.h +++ b/builtin.h @@ -149,6 +149,7 @@ extern int cmd_clone(int argc, const char **argv, const char *prefix); extern int cmd_clean(int argc, const char **argv, const char *prefix); extern int cmd_column(int argc, const char **argv, const char *prefix); extern int cmd_commit(int argc, const char **argv, const char *prefix); +extern int cmd_commit_graph(int argc, const char **argv, const char *prefix); extern int cmd_commit_tree(int argc, const char **argv, const char *prefix); extern int cmd_config(int argc, const char **argv, const char *prefix); extern int cmd_count_objects(int argc, const char **argv, const char *prefix); diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c new file mode 100644 index 00..8ff7336527 --- /dev/null +++ b/builtin/commit-graph.c @@ -0,0 +1,37 @@ +#include "builtin.h" +#include "config.h" +#include "parse-options.h" + +static char const * const builtin_commit_graph_usage[] = { + N_("git commit-graph [--object-dir ]"), + NULL +}; + +static struct opts_commit_graph { + const char *obj_dir; +} opts; + + +int cmd_commit_graph(int argc, const char **argv, const char *prefix) +{ + static struct option builtin_commit_graph_options[] = { + OPT_STRING(0, "object-dir", _dir, + N_("dir"), + N_("The object directory to store the graph")), + OPT_END(), + }; + + if (argc == 2 && !strcmp(argv[1], "-h")) + usage_with_options(builtin_commit_graph_usage, + builtin_commit_graph_options); + + git_config(git_default_config, NULL); + argc = parse_options(argc, argv, prefix, +builtin_commit_graph_options, +builtin_commit_graph_usage, +PARSE_OPT_STOP_AT_NON_OPTION); + + usage_with_options(builtin_commit_graph_usage, + builtin_commit_graph_options); +} + diff --git a/command-list.txt b/command-list.txt index a1fad28fd8..835c5890be 100644 --- a/command-list.txt +++ b/command-list.txt @@ -34,6 +34,7 @@ git-clean mainporcelain git-clone mainporcelain init git-column purehelpers git-commit mainporcelain history +git-commit-graphplumbingmanipulators git-commit-tree plumbingmanipulators git-config ancillarymanipulators git-count-objects ancillaryinterrogators diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash index 91536d831c..a24af902d8 100644 --- a/contrib/completion/git-completion.bash +++ b/contrib/completion/git-completion.bash @@ -841,6 +841,7 @@ __git_list_porcelain_commands () check-ref-format) : pl
[PATCH v6 01/14] csum-file: rename hashclose() to finalize_hashfile()
From: Derrick Stolee <dsto...@microsoft.com> The hashclose() method behaves very differently depending on the flags parameter. In particular, the file descriptor is not always closed. Perform a simple rename of "hashclose()" to "finalize_hashfile()" in preparation for functional changes. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- builtin/index-pack.c | 2 +- builtin/pack-objects.c | 6 +++--- bulk-checkin.c | 4 ++-- csum-file.c| 2 +- csum-file.h| 4 ++-- fast-import.c | 2 +- pack-bitmap-write.c| 2 +- pack-write.c | 4 ++-- 8 files changed, 13 insertions(+), 13 deletions(-) diff --git a/builtin/index-pack.c b/builtin/index-pack.c index 59878e70b8..157bceb264 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -1269,7 +1269,7 @@ static void conclude_pack(int fix_thin_pack, const char *curr_pack, unsigned cha nr_objects - nr_objects_initial); stop_progress_msg(, msg.buf); strbuf_release(); - hashclose(f, tail_hash, 0); + finalize_hashfile(f, tail_hash, 0); hashcpy(read_hash, pack_hash); fixup_pack_header_footer(output_fd, pack_hash, curr_pack, nr_objects, diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index a197926eaa..84e9f57b7f 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -837,11 +837,11 @@ static void write_pack_file(void) * If so, rewrite it like in fast-import */ if (pack_to_stdout) { - hashclose(f, oid.hash, CSUM_CLOSE); + finalize_hashfile(f, oid.hash, CSUM_CLOSE); } else if (nr_written == nr_remaining) { - hashclose(f, oid.hash, CSUM_FSYNC); + finalize_hashfile(f, oid.hash, CSUM_FSYNC); } else { - int fd = hashclose(f, oid.hash, 0); + int fd = finalize_hashfile(f, oid.hash, 0); fixup_pack_header_footer(fd, oid.hash, pack_tmp_name, nr_written, oid.hash, offset); close(fd); diff --git a/bulk-checkin.c b/bulk-checkin.c index 9d87eac07b..227cc9f3b1 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -35,9 +35,9 @@ static void finish_bulk_checkin(struct bulk_checkin_state *state) unlink(state->pack_tmp_name); goto clear_exit; } else if (state->nr_written == 1) { - hashclose(state->f, oid.hash, CSUM_FSYNC); + finalize_hashfile(state->f, oid.hash, CSUM_FSYNC); } else { - int fd = hashclose(state->f, oid.hash, 0); + int fd = finalize_hashfile(state->f, oid.hash, 0); fixup_pack_header_footer(fd, oid.hash, state->pack_tmp_name, state->nr_written, oid.hash, state->offset); diff --git a/csum-file.c b/csum-file.c index 5eda7fb6af..e6c95a6915 100644 --- a/csum-file.c +++ b/csum-file.c @@ -53,7 +53,7 @@ void hashflush(struct hashfile *f) } } -int hashclose(struct hashfile *f, unsigned char *result, unsigned int flags) +int finalize_hashfile(struct hashfile *f, unsigned char *result, unsigned int flags) { int fd; diff --git a/csum-file.h b/csum-file.h index 992e5c0141..9ba87f0a6c 100644 --- a/csum-file.h +++ b/csum-file.h @@ -26,14 +26,14 @@ struct hashfile_checkpoint { extern void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint *); extern int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *); -/* hashclose flags */ +/* finalize_hashfile flags */ #define CSUM_CLOSE 1 #define CSUM_FSYNC 2 extern struct hashfile *hashfd(int fd, const char *name); extern struct hashfile *hashfd_check(const char *name); extern struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp); -extern int hashclose(struct hashfile *, unsigned char *, unsigned int); +extern int finalize_hashfile(struct hashfile *, unsigned char *, unsigned int); extern void hashwrite(struct hashfile *, const void *, unsigned int); extern void hashflush(struct hashfile *f); extern void crc32_begin(struct hashfile *); diff --git a/fast-import.c b/fast-import.c index 58ef360da4..2e5d17318d 100644 --- a/fast-import.c +++ b/fast-import.c @@ -1016,7 +1016,7 @@ static void end_packfile(void) struct tag *t; close_pack_windows(pack_data); - hashclose(pack_file, cur_pack_oid.hash, 0); + finalize_hashfile(pack_file, cur_pack_oid.hash, 0); fixup_pack_header_footer(pack_data->pack_fd, pack_data->sha1,
[PATCH v6 12/14] commit-graph: read only from specific pack-indexes
From: Derrick Stolee <dsto...@microsoft.com> Teach git-commit-graph to inspect the objects only in a certain list of pack-indexes within the given pack directory. This allows updating the commit graph iteratively. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 11 ++- builtin/commit-graph.c | 33 ++--- commit-graph.c | 26 -- commit-graph.h | 4 +++- packfile.c | 4 ++-- packfile.h | 2 ++ t/t5318-commit-graph.sh| 10 ++ 7 files changed, 81 insertions(+), 9 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 51cb038f3d..b945510f0f 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -32,7 +32,9 @@ COMMANDS 'write':: Write a commit graph file based on the commits found in packfiles. -Includes all commits from the existing commit graph file. ++ +With the `--stdin-packs` option, generate the new commit graph by +walking objects only in the specified packfiles. 'read':: @@ -49,6 +51,13 @@ EXAMPLES $ git commit-graph write +* Write a graph file, extending the current graph file using commits +* in . ++ + +$ echo | git commit-graph write --stdin-packs + + * Read basic information from the commit-graph file. + diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 0e164becff..eebca57e6f 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -8,7 +8,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), - N_("git commit-graph write [--object-dir ]"), + N_("git commit-graph write [--object-dir ] [--stdin-packs]"), NULL }; @@ -18,12 +18,13 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--object-dir ]"), + N_("git commit-graph write [--object-dir ] [--stdin-packs]"), NULL }; static struct opts_commit_graph { const char *obj_dir; + int stdin_packs; } opts; static int graph_read(int argc, const char **argv) @@ -76,10 +77,18 @@ static int graph_read(int argc, const char **argv) static int graph_write(int argc, const char **argv) { + const char **pack_indexes = NULL; + int packs_nr = 0; + const char **lines = NULL; + int lines_nr = 0; + int lines_alloc = 0; + static struct option builtin_commit_graph_write_options[] = { OPT_STRING(0, "object-dir", _dir, N_("dir"), N_("The object directory to store the graph")), + OPT_BOOL(0, "stdin-packs", _packs, + N_("scan packfiles listed by stdin for commits")), OPT_END(), }; @@ -90,7 +99,25 @@ static int graph_write(int argc, const char **argv) if (!opts.obj_dir) opts.obj_dir = get_object_directory(); - write_commit_graph(opts.obj_dir); + if (opts.stdin_packs) { + struct strbuf buf = STRBUF_INIT; + lines_nr = 0; + lines_alloc = 128; + ALLOC_ARRAY(lines, lines_alloc); + + while (strbuf_getline(, stdin) != EOF) { + ALLOC_GROW(lines, lines_nr + 1, lines_alloc); + lines[lines_nr++] = strbuf_detach(, NULL); + } + + pack_indexes = lines; + packs_nr = lines_nr; + } + + write_commit_graph(opts.obj_dir, + pack_indexes, + packs_nr); + return 0; } diff --git a/commit-graph.c b/commit-graph.c index 98e2b89b94..f0d7585ddb 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -529,7 +529,9 @@ static void close_reachable(struct packed_oid_list *oids) } } -void write_commit_graph(const char *obj_dir) +void write_commit_graph(const char *obj_dir, + const char **pack_indexes, + int nr_packs) { struct packed_oid_list oids; struct packed_commit_list commits; @@ -551,7 +553,27 @@ void write_commit_graph(const char *obj_dir) oids.alloc = 1024; ALLOC_ARRAY(oids.list, oids.alloc); - for_each_packed_object(add_packed_commits, , 0); + if (pack_indexes) { + struct strbuf packname = STRBUF_I
[PATCH v6 13/14] commit-graph: build graph from starting commits
From: Derrick Stolee <dsto...@microsoft.com> Teach git-commit-graph to read commits from stdin when the --stdin-commits flag is specified. Commits reachable from these commits are added to the graph. This is a much faster way to construct the graph than inspecting all packed objects, but is restricted to known tips. For the Linux repository, 700,000+ commits were added to the graph file starting from 'master' in 7-9 seconds, depending on the number of packfiles in the repo (1, 24, or 120). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 14 +- builtin/commit-graph.c | 27 +-- commit-graph.c | 27 +-- commit-graph.h | 4 +++- t/t5318-commit-graph.sh| 13 + 5 files changed, 75 insertions(+), 10 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index b945510f0f..0710a68f2d 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -34,7 +34,13 @@ COMMANDS Write a commit graph file based on the commits found in packfiles. + With the `--stdin-packs` option, generate the new commit graph by -walking objects only in the specified packfiles. +walking objects only in the specified packfiles. (Cannot be combined +with --stdin-commits.) ++ +With the `--stdin-commits` option, generate the new commit graph by +walking commits starting at the commits specified in stdin as a list +of OIDs in hex, one OID per line. (Cannot be combined with +--stdin-packs.) 'read':: @@ -58,6 +64,12 @@ $ git commit-graph write $ echo | git commit-graph write --stdin-packs +* Write a graph file containing all reachable commits. ++ + +$ git show-ref -s | git commit-graph write --stdin-commits + + * Read basic information from the commit-graph file. + diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index eebca57e6f..1c7b7e72b0 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -8,7 +8,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), - N_("git commit-graph write [--object-dir ] [--stdin-packs]"), + N_("git commit-graph write [--object-dir ] [--stdin-packs|--stdin-commits]"), NULL }; @@ -18,13 +18,14 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--object-dir ] [--stdin-packs]"), + N_("git commit-graph write [--object-dir ] [--stdin-packs|--stdin-commits]"), NULL }; static struct opts_commit_graph { const char *obj_dir; int stdin_packs; + int stdin_commits; } opts; static int graph_read(int argc, const char **argv) @@ -79,6 +80,8 @@ static int graph_write(int argc, const char **argv) { const char **pack_indexes = NULL; int packs_nr = 0; + const char **commit_hex = NULL; + int commits_nr = 0; const char **lines = NULL; int lines_nr = 0; int lines_alloc = 0; @@ -89,6 +92,8 @@ static int graph_write(int argc, const char **argv) N_("The object directory to store the graph")), OPT_BOOL(0, "stdin-packs", _packs, N_("scan packfiles listed by stdin for commits")), + OPT_BOOL(0, "stdin-commits", _commits, + N_("start walk at commits listed by stdin")), OPT_END(), }; @@ -96,10 +101,12 @@ static int graph_write(int argc, const char **argv) builtin_commit_graph_write_options, builtin_commit_graph_write_usage, 0); + if (opts.stdin_packs && opts.stdin_commits) + die(_("cannot use both --stdin-commits and --stdin-packs")); if (!opts.obj_dir) opts.obj_dir = get_object_directory(); - if (opts.stdin_packs) { + if (opts.stdin_packs || opts.stdin_commits) { struct strbuf buf = STRBUF_INIT; lines_nr = 0; lines_alloc = 128; @@ -110,13 +117,21 @@ static int graph_write(int argc, const char **argv) lines[lines_nr++] = strbuf_detach(, NULL); } - pack_indexes = lines; - packs_nr = lines_nr; + if (opts.stdin_packs) { + pack_indexes = lines; +
[PATCH v6 09/14] commit-graph: add core.commitGraph setting
From: Derrick Stolee <dsto...@microsoft.com> The commit graph feature is controlled by the new core.commitGraph config setting. This defaults to 0, so the feature is opt-in. The intention of core.commitGraph is that a user can always stop checking for or parsing commit graph files if core.commitGraph=0. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/config.txt | 3 +++ cache.h | 1 + config.c | 5 + environment.c| 1 + 4 files changed, 10 insertions(+) diff --git a/Documentation/config.txt b/Documentation/config.txt index ce9102cea8..9e3da629b8 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -898,6 +898,9 @@ core.notesRef:: This setting defaults to "refs/notes/commits", and it can be overridden by the `GIT_NOTES_REF` environment variable. See linkgit:git-notes[1]. +core.commitGraph:: + Enable git commit graph feature. Allows reading from .graph files. + core.sparseCheckout:: Enable "sparse checkout" feature. See section "Sparse checkout" in linkgit:git-read-tree[1] for more information. diff --git a/cache.h b/cache.h index d06932ed0b..e62569fbb1 100644 --- a/cache.h +++ b/cache.h @@ -801,6 +801,7 @@ extern char *git_replace_ref_base; extern int fsync_object_files; extern int core_preload_index; +extern int core_commit_graph; extern int core_apply_sparse_checkout; extern int precomposed_unicode; extern int protect_hfs; diff --git a/config.c b/config.c index b0c20e6cb8..25ee4a676c 100644 --- a/config.c +++ b/config.c @@ -1226,6 +1226,11 @@ static int git_default_core_config(const char *var, const char *value) return 0; } + if (!strcmp(var, "core.commitgraph")) { + core_commit_graph = git_config_bool(var, value); + return 0; + } + if (!strcmp(var, "core.sparsecheckout")) { core_apply_sparse_checkout = git_config_bool(var, value); return 0; diff --git a/environment.c b/environment.c index d6dd64662c..8853e2f0dd 100644 --- a/environment.c +++ b/environment.c @@ -62,6 +62,7 @@ enum push_default_type push_default = PUSH_DEFAULT_UNSPECIFIED; enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE; char *notes_ref_name; int grafts_replace_parents = 1; +int core_commit_graph; int core_apply_sparse_checkout; int merge_log_config = -1; int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */ -- 2.14.1
[PATCH v6 14/14] commit-graph: implement "--additive" option
From: Derrick Stolee <dsto...@microsoft.com> Teach git-commit-graph to add all commits from the existing commit-graph file to the file about to be written. This should be used when adding new commits without performing garbage collection. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 10 ++ builtin/commit-graph.c | 10 +++--- commit-graph.c | 17 - commit-graph.h | 3 ++- t/t5318-commit-graph.sh| 10 ++ 5 files changed, 45 insertions(+), 5 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 0710a68f2d..ccf5e203ce 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -41,6 +41,9 @@ With the `--stdin-commits` option, generate the new commit graph by walking commits starting at the commits specified in stdin as a list of OIDs in hex, one OID per line. (Cannot be combined with --stdin-packs.) ++ +With the `--additive` option, include all commits that are present +in the existing commit-graph file. 'read':: @@ -70,6 +73,13 @@ $ echo | git commit-graph write --stdin-packs $ git show-ref -s | git commit-graph write --stdin-commits +* Write a graph file containing all commits in the current +* commit-graph file along with those reachable from HEAD. ++ + +$ git rev-parse HEAD | git commit-graph write --stdin-commits --additive + + * Read basic information from the commit-graph file. + diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 1c7b7e72b0..d26a6d6de3 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -8,7 +8,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), - N_("git commit-graph write [--object-dir ] [--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--additive] [--stdin-packs|--stdin-commits]"), NULL }; @@ -18,7 +18,7 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--object-dir ] [--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--additive] [--stdin-packs|--stdin-commits]"), NULL }; @@ -26,6 +26,7 @@ static struct opts_commit_graph { const char *obj_dir; int stdin_packs; int stdin_commits; + int additive; } opts; static int graph_read(int argc, const char **argv) @@ -94,6 +95,8 @@ static int graph_write(int argc, const char **argv) N_("scan packfiles listed by stdin for commits")), OPT_BOOL(0, "stdin-commits", _commits, N_("start walk at commits listed by stdin")), + OPT_BOOL(0, "additive", , + N_("include all commits already in the commit-graph file")), OPT_END(), }; @@ -131,7 +134,8 @@ static int graph_write(int argc, const char **argv) pack_indexes, packs_nr, commit_hex, - commits_nr); + commits_nr, + opts.additive); return 0; } diff --git a/commit-graph.c b/commit-graph.c index 9f1ba9bff6..6348bab82b 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -533,7 +533,8 @@ void write_commit_graph(const char *obj_dir, const char **pack_indexes, int nr_packs, const char **commit_hex, - int nr_commits) + int nr_commits, + int additive) { struct packed_oid_list oids; struct packed_commit_list commits; @@ -551,10 +552,24 @@ void write_commit_graph(const char *obj_dir, oids.nr = 0; oids.alloc = approximate_object_count() / 4; + if (additive) { + prepare_commit_graph_one(obj_dir); + if (commit_graph) + oids.alloc += commit_graph->num_commits; + } + if (oids.alloc < 1024) oids.alloc = 1024; ALLOC_ARRAY(oids.list, oids.alloc); + if (additive && commit_graph) { + for (i = 0; i < commit_graph->num_commits; i++) { + const unsigned char *hash = commit_graph->chunk_oid_lookup + + com
[PATCH 1/3] commit: create get_commit_tree() method
While walking the commit graph, we load struct commit objects into the object cache. During this process, we also load struct tree objects for the root tree of each of these commits. We load these objects even if we are only computing commit reachability information, such as a merge base or ahead/behind information. Create get_commit_tree() as a first step to removing direct references to the 'tree' member of struct commit. Create get_commit_tree_oid() as a shortcut for several references to ">tree->object.oid" in the codebase. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 10 ++ commit.h | 3 +++ 2 files changed, 13 insertions(+) diff --git a/commit.c b/commit.c index 3e39c86abf..d65c7b3b47 100644 --- a/commit.c +++ b/commit.c @@ -296,6 +296,16 @@ void free_commit_buffer(struct commit *commit) } } +struct tree *get_commit_tree(const struct commit *commit) +{ + return commit->tree; +} + +struct object_id *get_commit_tree_oid(const struct commit *commit) +{ + return >tree->object.oid; +} + const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep) { struct commit_buffer *v = buffer_slab_peek(_slab, commit); diff --git a/commit.h b/commit.h index e57ae4b583..fa79cc4d1f 100644 --- a/commit.h +++ b/commit.h @@ -102,6 +102,9 @@ void unuse_commit_buffer(const struct commit *, const void *buffer); */ void free_commit_buffer(struct commit *); +struct tree *get_commit_tree(const struct commit *); +struct object_id *get_commit_tree_oid(const struct commit *); + /* * Disassociate any cached object buffer from the commit, but do not free it. * The buffer (or NULL, if none) is returned. -- 2.17.0.20.g9f30ba16e1
[PATCH 3/3] commit-graph: lazy-load trees
The commit-graph file provides quick access to commit data, including the OID of the root tree for each commit in the graph. When performing a deep commit-graph walk, we may not need to load most of the trees for these commits. Delay loading the tree object for a commit loaded from the graph until requested via get_commit_tree(). Do not lazy-load trees for commits not in the graph, since that requires duplicate parsing and the relative peformance improvement when trees are not needed is small. On the Linux repository, performance tests were run for the following command: git log --graph --oneline -1000 Before: 0.83s After: 0.65s Rel %: -21.6% Adding '-- kernel/' to the command requires loading the root tree for every commit that is walked. There was no measureable performance change as a result of this patch. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 25 ++--- commit-graph.h | 7 +++ commit.c | 10 -- 3 files changed, 37 insertions(+), 5 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 3080a87940..a3eeb25f22 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -247,7 +247,6 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) { - struct object_id oid; uint32_t edge_value; uint32_t *parent_data_ptr; uint64_t date_low, date_high; @@ -257,8 +256,7 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin item->object.parsed = 1; item->graph_pos = pos; - hashcpy(oid.hash, commit_data); - item->tree = lookup_tree(); + item->tree = NULL; date_high = get_be32(commit_data + g->hash_len + 8) & 0x3; date_low = get_be32(commit_data + g->hash_len + 12); @@ -317,6 +315,27 @@ int parse_commit_in_graph(struct commit *item) return 0; } +static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c) +{ + struct object_id oid; + const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * (c->graph_pos); + + hashcpy(oid.hash, commit_data); + c->tree = lookup_tree(); + + return c->tree; +} + +struct tree *get_commit_tree_in_graph(const struct commit *c) +{ + if (c->tree) + return c->tree; + if (c->graph_pos == COMMIT_NOT_FROM_GRAPH) + BUG("get_commit_tree_in_graph called from non-commit-graph commit"); + + return load_tree_for_commit(commit_graph, (struct commit *)c); +} + static void write_graph_chunk_fanout(struct hashfile *f, struct commit **commits, int nr_commits) diff --git a/commit-graph.h b/commit-graph.h index e1d8580c98..3ab45818e2 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -17,6 +17,13 @@ char *get_commit_graph_filename(const char *obj_dir); */ int parse_commit_in_graph(struct commit *item); +/* + * For performance reasons, a commit loaded from the graph does not + * have a tree loaded until trying to consume it for the first time. + * Load that tree into the commit and return the object. + */ +struct tree *get_commit_tree_in_graph(const struct commit *c); + struct commit_graph { int graph_fd; diff --git a/commit.c b/commit.c index d65c7b3b47..d4293ae8f6 100644 --- a/commit.c +++ b/commit.c @@ -298,12 +298,18 @@ void free_commit_buffer(struct commit *commit) struct tree *get_commit_tree(const struct commit *commit) { - return commit->tree; + if (commit->tree || !commit->object.parsed) + return commit->tree; + + if (commit->graph_pos == COMMIT_NOT_FROM_GRAPH) + BUG("commit has NULL tree, but was not loaded from commit-graph"); + + return get_commit_tree_in_graph(commit); } struct object_id *get_commit_tree_oid(const struct commit *commit) { - return >tree->object.oid; + return _commit_tree(commit)->object.oid; } const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep) -- 2.17.0.20.g9f30ba16e1
Re: [PATCH v7 08/14] commit-graph: implement git commit-graph read
On 4/2/2018 5:33 PM, Junio C Hamano wrote: Derrick Stolee <sto...@gmail.com> writes: From: Derrick Stolee <dsto...@microsoft.com> ... +static int graph_read(int argc, const char **argv) +{ + struct commit_graph *graph = 0; The previous round said NULL above, not 0, and NULL is the better way to spell it, I would think. Sorry about that. Hopefully it is easy to squash.
[PATCH 2/3] treewide: use get_commit_tree() for tree access
Replace all direct accesses of the 'tree' member in 'struct commit' with calls to get_commit_tree() or get_commit_tree_oid(). This patch was constructed starting with the following Coccinelle script, then removing false-positives: @@ expression c; @@ - >tree->object.oid + get_commit_tree_oid(c) @@ expression c; symbol m; @@ - c->tree->object.oid.m + get_commit_tree_oid(c)->m @@ expression c; @@ - c->tree + get_commit_tree(c) To ensure all references were removed, the 'tree' member was renamed to 'tree_renamed' along with the few allowed accessors. A successful compilation demonstrated a correct transformation. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- blame.c | 18 +- builtin/checkout.c| 17 + builtin/diff.c| 2 +- builtin/fast-export.c | 6 +++--- builtin/log.c | 4 ++-- builtin/reflog.c | 2 +- commit-graph.c| 2 +- fsck.c| 8 +--- http-push.c | 2 +- line-log.c| 4 ++-- list-objects.c| 10 +- log-tree.c| 6 +++--- merge-recursive.c | 3 ++- notes-merge.c | 8 packfile.c| 2 +- pretty.c | 5 +++-- ref-filter.c | 2 +- revision.c| 8 sequencer.c | 12 ++-- sha1_name.c | 2 +- tree.c| 4 ++-- walker.c | 2 +- 22 files changed, 67 insertions(+), 62 deletions(-) diff --git a/blame.c b/blame.c index 200e0ad9a2..7f5700b324 100644 --- a/blame.c +++ b/blame.c @@ -553,10 +553,10 @@ static struct blame_origin *find_origin(struct commit *parent, diff_setup_done(_opts); if (is_null_oid(>commit->object.oid)) - do_diff_cache(>tree->object.oid, _opts); + do_diff_cache(get_commit_tree_oid(parent), _opts); else - diff_tree_oid(>tree->object.oid, - >commit->tree->object.oid, + diff_tree_oid(get_commit_tree_oid(parent), + get_commit_tree_oid(origin->commit), "", _opts); diffcore_std(_opts); @@ -622,10 +622,10 @@ static struct blame_origin *find_rename(struct commit *parent, diff_setup_done(_opts); if (is_null_oid(>commit->object.oid)) - do_diff_cache(>tree->object.oid, _opts); + do_diff_cache(get_commit_tree_oid(parent), _opts); else - diff_tree_oid(>tree->object.oid, - >commit->tree->object.oid, + diff_tree_oid(get_commit_tree_oid(parent), + get_commit_tree_oid(origin->commit), "", _opts); diffcore_std(_opts); @@ -1257,10 +1257,10 @@ static void find_copy_in_parent(struct blame_scoreboard *sb, diff_opts.flags.find_copies_harder = 1; if (is_null_oid(>commit->object.oid)) - do_diff_cache(>tree->object.oid, _opts); + do_diff_cache(get_commit_tree_oid(parent), _opts); else - diff_tree_oid(>tree->object.oid, - >commit->tree->object.oid, + diff_tree_oid(get_commit_tree_oid(parent), + get_commit_tree_oid(target->commit), "", _opts); if (!diff_opts.flags.find_copies_harder) diff --git a/builtin/checkout.c b/builtin/checkout.c index d76e13c852..0b448fd179 100644 --- a/builtin/checkout.c +++ b/builtin/checkout.c @@ -484,7 +484,8 @@ static int merge_working_tree(const struct checkout_opts *opts, resolve_undo_clear(); if (opts->force) { - ret = reset_tree(new_branch_info->commit->tree, opts, 1, writeout_error); + ret = reset_tree(get_commit_tree(new_branch_info->commit), +opts, 1, writeout_error); if (ret) return ret; } else { @@ -570,19 +571,19 @@ static int merge_working_tree(const struct checkout_opts *opts, o.verbosity = 0; work = write_tree_from_memory(); - ret = reset_tree(new_branch_info->commit->tree, opts, 1, -writeout_error); + ret = reset_tree(get_commit_tree(new_branch_info->commit), +opts, 1, writeout_error); if (ret) return ret; o.ancestor = old_branch_info->name; o.branch1 = new_branch_info->name; o.branch2 = "local"; - ret = merge_trees(,
[PATCH 0/3] Lazy-load trees when reading commit-graph
There are several commit-graph walks that require loading many commits but never walk the trees reachable from those commits. However, the current logic in parse_commit() requires the root tree to be loaded. This only uses lookup_tree(), but when reading commits from the commit- graph file, the hashcpy() to load the root tree hash and the time spent checking the object cache take more time than parsing the rest of the commit. In this patch series, all direct references to accessing the 'tree' member of struct commit are replaced instead by one of the following methods: struct tree *get_commit_tree(struct commit *) struct object_id *get_commit_tree_oid(struct commit *) This replacement was assisted by a Coccinelle script, but the 'tree' member is overloaded in other types, so the script gave false-positives that were removed from the diff. After all access is restricted to use these methods, we can then change the postcondition of parse_commit_in_graph() to allow 'tree' to be NULL. If the tree is accessed later, we can load the tree's OID from the commit-graph in constant time and perform the lookup_tree(). On the Linux repository, performance tests were run for the following command: git log --graph --oneline -1000 Before: 0.83s After: 0.65s Rel %: -21.6% Adding '-- kernel/' to the command requires loading the root tree for every commit that is walked. There was no measureable performance change as a result of this patch. This patch series depends on v7 of ds/commit-graph. Derrick Stolee (3): commit: create get_commit_tree() method treewide: use get_commit_tree() for tree access commit-graph: lazy-load trees blame.c | 18 +- builtin/checkout.c| 17 + builtin/diff.c| 2 +- builtin/fast-export.c | 6 +++--- builtin/log.c | 4 ++-- builtin/reflog.c | 2 +- commit-graph.c| 27 +++ commit-graph.h| 7 +++ commit.c | 16 commit.h | 3 +++ fsck.c| 8 +--- http-push.c | 2 +- line-log.c| 4 ++-- list-objects.c| 10 +- log-tree.c| 6 +++--- merge-recursive.c | 3 ++- notes-merge.c | 8 packfile.c| 2 +- pretty.c | 5 +++-- ref-filter.c | 2 +- revision.c| 8 sequencer.c | 12 ++-- sha1_name.c | 2 +- tree.c| 4 ++-- walker.c | 2 +- 25 files changed, 115 insertions(+), 65 deletions(-) -- 2.17.0.20.g9f30ba16e1
Re: [PATCH v2 1/5] core.aheadbehind: add new config setting
On 4/3/2018 6:18 AM, Ævar Arnfjörð Bjarmason wrote: On Tue, Apr 03 2018, Lars Schneider wrote: What is the state of this series? I can't find it in git/git nor in git-for-windows/git. I think Stolee mentioned the config in his Git Merge talk [1] and I was about to test it/roll it out :-) It's in the gvfs branch of g...@github.com:Microsoft/git.git, i.e. it's not in Git for Windows, but used in Microsoft's own in-house version used for Windows.git. Thanks for adding me to CC. I mentioned it in my talk because that was one thing we shipped internally as a "quick fix" until we could do the right thing. If I remember correctly, Jeff abandoned shipping this upstream because it did have the feel of a hack and we wanted to see if users used the config setting or really cared about the output values. We saw fast adoption of the feature and even turned the config setting on automatically in the following version of GVFS. I may be misunderstanding this feature, but my impression was that it was a kludge as a workaround until the commit graph code landed, because once we have that then surely we can just cheaply report the actual (or approximate?) number in the common case, but of course it may still be slow if your commit graph file is out of date. You are correct that the commit-graph file may be out of date, causing slower performance. Even worse: the current graph patch only provides a constant-multiple speedup (still walking the same number of commits, but each commit is parsed much faster). Speaking of our GVFS-specific fork [0], the 'gvfs' branch was updated just yesterday with a couple of changes that I am prepping for submission upstream: * Lazy-load trees when parsing commits from commit-graph [1] * Compute and consume generation numbers [2] Each of these will speed up this ahead/behind calculation in different ways. [1] makes the cost of loading each commit a bit faster, saving up to 20% overall. [2] uses generation numbers in paint_down_to_common() to make the while() condition O(1) instead of O(Q) where Q is the size of the priority queue. The Windows repo is particularly "wide" with many parallel branches being merged in complicated ways, so the queue becomes quite large. This use of generation numbers saves about 4% on some ahead/behind calculations. This speedup is modest, but the existing code already made good use of limiting the commit walk to be mostly the "important" commits. The real benefit of generation numbers will manifest in a way to make --topo-order much faster when rendering a small number of commits. The generation numbers _could_ be used to approximate the ahead/behind calculation in the following way: When comparing A and B, and gen(A) < gen(B), then A is at least (gen(B) - gen(A)) behind. That's the only information that can be gathered directly from those values, but may be enough to short circuit an exact count. To truly accelerate these ahead/behind calculations to be sub-linear* in the ahead/behind counts, we would need a bitmap-based approach. The object-reachability bitmap is a non-starter for client machines in the Windows repo, but perhaps a commit-reachability bitmap could be interesting. Performing set operations on the bitmaps could more quickly answer these questions. Just thinking about it makes me want to go down a deep rabbit hole, investigating ways to compute, store, and use these bitmaps. However: let's wait and see how necessary it is as the commit-graph feature stabilizes. (*These bitmap approaches are not guaranteed to be sub-linear, because it may include iterating through a list of O(N) bits, but good run-length encodings will likely make the count operation very fast, even with a set-difference operation included.) There are too many fun things to work on, not enough time! Thanks, -Stolee [0] https://github.com/microsoft/git Fork of GitForWindows that ships to Windows developers [1] https://github.com/Microsoft/git/commit/29114bf86f591f5c87075f779a1faa2d0f17b92f Lazy-load trees when parsing commits from commit-graph (accidentally squashed to one commit) [2] https://github.com/microsoft/git/compare/879b7d3b1bddea2587b28cdd656c9c655018683a...a0731ca93a35fd042560c4b30e8e0edbdfa4bf9f Compute and consume generation numbers
Re: [PATCH 0/3] Lazy-load trees when reading commit-graph
On 4/3/2018 8:00 AM, Derrick Stolee wrote: There are several commit-graph walks that require loading many commits but never walk the trees reachable from those commits. However, the current logic in parse_commit() requires the root tree to be loaded. This only uses lookup_tree(), but when reading commits from the commit- graph file, the hashcpy() to load the root tree hash and the time spent checking the object cache take more time than parsing the rest of the commit. In this patch series, all direct references to accessing the 'tree' member of struct commit are replaced instead by one of the following methods: struct tree *get_commit_tree(struct commit *) struct object_id *get_commit_tree_oid(struct commit *) This replacement was assisted by a Coccinelle script, but the 'tree' member is overloaded in other types, so the script gave false-positives that were removed from the diff. After all access is restricted to use these methods, we can then change the postcondition of parse_commit_in_graph() to allow 'tree' to be NULL. If the tree is accessed later, we can load the tree's OID from the commit-graph in constant time and perform the lookup_tree(). On the Linux repository, performance tests were run for the following command: git log --graph --oneline -1000 Before: 0.83s After: 0.65s Rel %: -21.6% Adding '-- kernel/' to the command requires loading the root tree for every commit that is walked. There was no measureable performance change as a result of this patch. This patch series depends on v7 of ds/commit-graph. Derrick Stolee (3): commit: create get_commit_tree() method treewide: use get_commit_tree() for tree access commit-graph: lazy-load trees This patch series is also available as a GitHub pull request [1] [1] https://github.com/derrickstolee/git/pull/4
[PATCH v2 2/4] commit: create get_commit_tree() method
While walking the commit graph, we load struct commit objects into the object cache. During this process, we also load struct tree objects for the root tree of each of these commits. We load these objects even if we are only computing commit reachability information, such as a merge base or ahead/behind information. Create get_commit_tree() as a first step to removing direct references to the 'maybe_tree' member of struct commit. Create get_commit_tree_oid() as a shortcut for several references to ">maybe_tree->object.oid" in the codebase. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 10 ++ commit.h | 3 +++ 2 files changed, 13 insertions(+) diff --git a/commit.c b/commit.c index fbc092808c..aea2ca1f8b 100644 --- a/commit.c +++ b/commit.c @@ -296,6 +296,16 @@ void free_commit_buffer(struct commit *commit) } } +struct tree *get_commit_tree(const struct commit *commit) +{ + return commit->maybe_tree; +} + +struct object_id *get_commit_tree_oid(const struct commit *commit) +{ + return _commit_tree(commit)->object.oid; +} + const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep) { struct commit_buffer *v = buffer_slab_peek(_slab, commit); diff --git a/commit.h b/commit.h index c4d6e6e064..dc4bf97d9f 100644 --- a/commit.h +++ b/commit.h @@ -102,6 +102,9 @@ void unuse_commit_buffer(const struct commit *, const void *buffer); */ void free_commit_buffer(struct commit *); +struct tree *get_commit_tree(const struct commit *); +struct object_id *get_commit_tree_oid(const struct commit *); + /* * Disassociate any cached object buffer from the commit, but do not free it. * The buffer (or NULL, if none) is returned. -- 2.17.0
[PATCH v2 1/4] treewide: rename tree to maybe_tree
Using the commit-graph file to walk commit history removes the large cost of parsing commits during the walk. This exposes a performance issue: lookup_tree() takes a large portion of the computation time, even when Git never uses those trees. In anticipation of lazy-loading these trees, rename the 'tree' member of struct commit to 'maybe_tree'. This serves two purposes: it hints at the future role of possibly being NULL even if the commit has a valid tree, and it allows for unambiguous transformation from simple member access (i.e. commit->maybe_tree) to method access. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- blame.c | 18 +- builtin/checkout.c| 12 ++-- builtin/diff.c| 2 +- builtin/fast-export.c | 6 +++--- builtin/log.c | 4 ++-- builtin/reflog.c | 2 +- commit-graph.c| 4 ++-- commit.c | 2 +- commit.h | 2 +- fsck.c| 6 +++--- http-push.c | 2 +- line-log.c| 4 ++-- list-objects.c| 10 +- log-tree.c| 6 +++--- merge-recursive.c | 5 +++-- notes-merge.c | 8 packfile.c| 2 +- pretty.c | 4 ++-- ref-filter.c | 2 +- revision.c| 8 sequencer.c | 12 ++-- sha1_name.c | 2 +- tree.c| 4 ++-- walker.c | 2 +- 24 files changed, 65 insertions(+), 64 deletions(-) diff --git a/blame.c b/blame.c index 200e0ad9a2..b78e649cac 100644 --- a/blame.c +++ b/blame.c @@ -553,10 +553,10 @@ static struct blame_origin *find_origin(struct commit *parent, diff_setup_done(_opts); if (is_null_oid(>commit->object.oid)) - do_diff_cache(>tree->object.oid, _opts); + do_diff_cache(>maybe_tree->object.oid, _opts); else - diff_tree_oid(>tree->object.oid, - >commit->tree->object.oid, + diff_tree_oid(>maybe_tree->object.oid, + >commit->maybe_tree->object.oid, "", _opts); diffcore_std(_opts); @@ -622,10 +622,10 @@ static struct blame_origin *find_rename(struct commit *parent, diff_setup_done(_opts); if (is_null_oid(>commit->object.oid)) - do_diff_cache(>tree->object.oid, _opts); + do_diff_cache(>maybe_tree->object.oid, _opts); else - diff_tree_oid(>tree->object.oid, - >commit->tree->object.oid, + diff_tree_oid(>maybe_tree->object.oid, + >commit->maybe_tree->object.oid, "", _opts); diffcore_std(_opts); @@ -1257,10 +1257,10 @@ static void find_copy_in_parent(struct blame_scoreboard *sb, diff_opts.flags.find_copies_harder = 1; if (is_null_oid(>commit->object.oid)) - do_diff_cache(>tree->object.oid, _opts); + do_diff_cache(>maybe_tree->object.oid, _opts); else - diff_tree_oid(>tree->object.oid, - >commit->tree->object.oid, + diff_tree_oid(>maybe_tree->object.oid, + >commit->maybe_tree->object.oid, "", _opts); if (!diff_opts.flags.find_copies_harder) diff --git a/builtin/checkout.c b/builtin/checkout.c index d76e13c852..b15fed5d85 100644 --- a/builtin/checkout.c +++ b/builtin/checkout.c @@ -484,7 +484,7 @@ static int merge_working_tree(const struct checkout_opts *opts, resolve_undo_clear(); if (opts->force) { - ret = reset_tree(new_branch_info->commit->tree, opts, 1, writeout_error); + ret = reset_tree(new_branch_info->commit->maybe_tree, opts, 1, writeout_error); if (ret) return ret; } else { @@ -570,18 +570,18 @@ static int merge_working_tree(const struct checkout_opts *opts, o.verbosity = 0; work = write_tree_from_memory(); - ret = reset_tree(new_branch_info->commit->tree, opts, 1, + ret = reset_tree(new_branch_info->commit->maybe_tree, opts, 1, writeout_error); if (ret) return ret; o.ancestor = old_branch_info->name; o.branch1 = new_branch_info->name; o.branch2 = "local"; - ret = merge_trees(, new_branch_info->commit->tree, work, -
[PATCH v2 0/4] Lazy-load trees when reading commit-graph
There are several commit-graph walks that require loading many commits but never walk the trees reachable from those commits. However, the current logic in parse_commit() requires the root tree to be loaded. This only uses lookup_tree(), but when reading commits from the commit- graph file, the hashcpy() to load the root tree hash and the time spent checking the object cache take more time than parsing the rest of the commit. In this patch series, all direct references to accessing the 'tree' member of struct commit are replaced instead by one of the following methods: struct tree *get_commit_tree(struct commit *) struct object_id *get_commit_tree_oid(struct commit *) This replacement was assisted by a Coccinelle script, but the 'tree' member is overloaded in other types, so we first rename the 'tree' member to 'maybe_tree' and use the compiler to ensure we caught all examples. Then, contrib/coccinelle/commit.cocci generates the patch to replace all accessors of 'maybe_tree' to the methods above. After all access is restricted to use these methods, we can then change the postcondition of parse_commit_in_graph() to allow 'maybe_tree' to be NULL. If the tree is accessed later, we can load the tree's OID from the commit-graph in constant time and perform the lookup_tree(). On the Linux repository, performance tests were run for the following command: git log --graph --oneline -1000 Before: 0.92s After: 0.66s Rel %: -28.3% Adding '-- kernel/' to the command requires loading the root tree for every commit that is walked. There was no measureable performance change as a result of this patch. This patch series depends on v7 of ds/commit-graph. Derrick Stolee (4): treewide: rename tree to maybe_tree commit: create get_commit_tree() method treewide: replace maybe_tree with accessor methods commit-graph: lazy-load trees for commits blame.c | 18 +- builtin/checkout.c | 18 -- builtin/diff.c | 2 +- builtin/fast-export.c | 6 +++--- builtin/log.c | 4 ++-- builtin/reflog.c| 2 +- commit-graph.c | 27 +++ commit-graph.h | 7 +++ commit.c| 18 +- commit.h| 5 - contrib/coccinelle/commit.cocci | 30 ++ fsck.c | 8 +--- http-push.c | 2 +- line-log.c | 4 ++-- list-objects.c | 10 +- log-tree.c | 6 +++--- merge-recursive.c | 5 +++-- notes-merge.c | 9 + packfile.c | 2 +- pretty.c| 5 +++-- ref-filter.c| 2 +- revision.c | 8 sequencer.c | 12 ++-- sha1_name.c | 2 +- tree.c | 4 ++-- walker.c| 2 +- 26 files changed, 152 insertions(+), 66 deletions(-) create mode 100644 contrib/coccinelle/commit.cocci -- 2.17.0
Re: [PATCH 3/3] ref-filter: factor ref_array pushing into its own function
On 4/6/2018 2:59 PM, Jeff King wrote: In preparation for callers constructing their own ref_array structs, let's move our own internal push operation into its own function. While we're at it, we can replace REALLOC_ARRAY() with ALLOC_GROW(), which should give the growth operation amortized linear complexity (as opposed to growing by one, which is potentially quadratic, though in-place realloc growth often makes this faster in practice). Signed-off-by: Jeff King <p...@peff.net> --- ref-filter.c | 16 +--- ref-filter.h | 8 2 files changed, 21 insertions(+), 3 deletions(-) diff --git a/ref-filter.c b/ref-filter.c index c1c3cc9480..6e9328b274 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1840,6 +1840,18 @@ static struct ref_array_item *new_ref_array_item(const char *refname, return ref; } +struct ref_array_item *ref_array_push(struct ref_array *array, + const char *refname, + const struct object_id *oid) +{ + struct ref_array_item *ref = new_ref_array_item(refname, oid); + + ALLOC_GROW(array->items, array->nr + 1, array->alloc); + array->items[array->nr++] = ref; + + return ref; +} + static int ref_kind_from_refname(const char *refname) { unsigned int i; @@ -1930,13 +1942,11 @@ static int ref_filter_handler(const char *refname, const struct object_id *oid, * to do its job and the resulting list may yet to be pruned * by maxcount logic. */ - ref = new_ref_array_item(refname, oid); + ref = ref_array_push(ref_cbdata->array, refname, oid); ref->commit = commit; ref->flag = flag; ref->kind = kind; - REALLOC_ARRAY(ref_cbdata->array->items, ref_cbdata->array->nr + 1); - ref_cbdata->array->items[ref_cbdata->array->nr++] = ref; return 0; } diff --git a/ref-filter.h b/ref-filter.h index 68268f9ebc..76cf87cb6c 100644 --- a/ref-filter.h +++ b/ref-filter.h @@ -135,4 +135,12 @@ void setup_ref_filter_porcelain_msg(void); void pretty_print_ref(const char *name, const struct object_id *oid, const struct ref_format *format); +/* + * Push a single ref onto the array; this can be used to construct your own + * ref_array without using filter_refs(). + */ +struct ref_array_item *ref_array_push(struct ref_array *array, + const char *refname, + const struct object_id *oid); + #endif /* REF_FILTER_H */ The three patches in this series look good to me. Reviewed-by: Derrick Stolee <dsto...@microsoft.com>
Re: [PATCH v2 0/4] Lazy-load trees when reading commit-graph
On 4/6/2018 3:21 PM, Jeff King wrote: On Fri, Apr 06, 2018 at 07:09:30PM +, Derrick Stolee wrote: Derrick Stolee (4): treewide: rename tree to maybe_tree commit: create get_commit_tree() method treewide: replace maybe_tree with accessor methods commit-graph: lazy-load trees for commits I gave this only a cursory read, but it addresses my concern from the previous round. If I were doing it myself, I probably would have folded patches 1 and 3 together. They are touching all the same spots, and it would be an error for any case converted in patch 1 to not get converted in patch 3. I'm assuming you caught them all due to Coccinelle, though IMHO it is somewhat overkill here. By folding them together the compiler could tell you which spots you missed. And going forward, I doubt it is going to be a common error for people to use maybe_tree directly. Between the name and the warning comment, you'd have to really try to shoot yourself in the foot with it. The primary concern was catching people using the existing "tree" name, whose semantics changed. All that said, I'm fine with having it done this way, too. Thanks. As a double-check that I caught all of the 'maybe_tree' accesses, I ran the following: $ git grep maybe_tree | grep -v get_commit_tree commit-graph.c: item->maybe_tree = NULL; commit-graph.c: c->maybe_tree = lookup_tree(); commit-graph.c: return c->maybe_tree; commit-graph.c: if (c->maybe_tree) commit-graph.c: return c->maybe_tree; commit.c: if (commit->maybe_tree || !commit->object.parsed) commit.c: return commit->maybe_tree; commit.c: item->maybe_tree = lookup_tree(); commit.h: struct tree *maybe_tree; contrib/coccinelle/commit.cocci:- >maybe_tree->object.oid contrib/coccinelle/commit.cocci:- c->maybe_tree->object.oid.hash contrib/coccinelle/commit.cocci:- c->maybe_tree contrib/coccinelle/commit.cocci:+ c->maybe_tree = s contrib/coccinelle/commit.cocci:+ return c->maybe_tree; merge-recursive.c: commit->maybe_tree = tree; Thanks, -Stolee
[PATCH v2 4/4] commit-graph: lazy-load trees for commits
The commit-graph file provides quick access to commit data, including the OID of the root tree for each commit in the graph. When performing a deep commit-graph walk, we may not need to load most of the trees for these commits. Delay loading the tree object for a commit loaded from the graph until requested via get_commit_tree(). Do not lazy-load trees for commits not in the graph, since that requires duplicate parsing and the relative peformance improvement when trees are not needed is small. On the Linux repository, performance tests were run for the following command: git log --graph --oneline -1000 Before: 0.92s After: 0.66s Rel %: -28.3% Adding '-- kernel/' to the command requires loading the root tree for every commit that is walked. There was no measureable performance change as a result of this patch. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 26 +++--- commit-graph.h | 2 ++ commit.c | 8 +++- commit.h | 6 ++ 4 files changed, 38 insertions(+), 4 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 9f37d84209..a5de6f3102 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -247,7 +247,6 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) { - struct object_id oid; uint32_t edge_value; uint32_t *parent_data_ptr; uint64_t date_low, date_high; @@ -257,8 +256,7 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin item->object.parsed = 1; item->graph_pos = pos; - hashcpy(oid.hash, commit_data); - item->maybe_tree = lookup_tree(); + item->maybe_tree = NULL; date_high = get_be32(commit_data + g->hash_len + 8) & 0x3; date_low = get_be32(commit_data + g->hash_len + 12); @@ -317,6 +315,28 @@ int parse_commit_in_graph(struct commit *item) return 0; } +static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c) +{ + struct object_id oid; + const unsigned char *commit_data = g->chunk_commit_data + + GRAPH_DATA_WIDTH * (c->graph_pos); + + hashcpy(oid.hash, commit_data); + c->maybe_tree = lookup_tree(); + + return c->maybe_tree; +} + +struct tree *get_commit_tree_in_graph(const struct commit *c) +{ + if (c->maybe_tree) + return c->maybe_tree; + if (c->graph_pos == COMMIT_NOT_FROM_GRAPH) + BUG("get_commit_tree_in_graph called from non-commit-graph commit"); + + return load_tree_for_commit(commit_graph, (struct commit *)c); +} + static void write_graph_chunk_fanout(struct hashfile *f, struct commit **commits, int nr_commits) diff --git a/commit-graph.h b/commit-graph.h index e1d8580c98..260a468e73 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -17,6 +17,8 @@ char *get_commit_graph_filename(const char *obj_dir); */ int parse_commit_in_graph(struct commit *item); +struct tree *get_commit_tree_in_graph(const struct commit *c); + struct commit_graph { int graph_fd; diff --git a/commit.c b/commit.c index aea2ca1f8b..711f674c18 100644 --- a/commit.c +++ b/commit.c @@ -298,7 +298,13 @@ void free_commit_buffer(struct commit *commit) struct tree *get_commit_tree(const struct commit *commit) { - return commit->maybe_tree; + if (commit->maybe_tree || !commit->object.parsed) + return commit->maybe_tree; + + if (commit->graph_pos == COMMIT_NOT_FROM_GRAPH) + BUG("commit has NULL tree, but was not loaded from commit-graph"); + + return get_commit_tree_in_graph(commit); } struct object_id *get_commit_tree_oid(const struct commit *commit) diff --git a/commit.h b/commit.h index dc4bf97d9f..23a3f364ed 100644 --- a/commit.h +++ b/commit.h @@ -22,6 +22,12 @@ struct commit { unsigned int index; timestamp_t date; struct commit_list *parents; + + /* +* If the commit is loaded from the commit-graph file, then this +* member may be NULL. Only access it through get_commit_tree() +* or get_commit_tree_oid(). +*/ struct tree *maybe_tree; uint32_t graph_pos; }; -- 2.17.0
[PATCH v2 09/10] commit: use generation numbers for in_merge_bases()
The containment algorithm for 'git branch --contains' is different from that for 'git tag --contains' in that it uses is_descendant_of() instead of contains_tag_algo(). The expensive portion of the branch algorithm is computing merge bases. When a commit-graph file exists with generation numbers computed, we can avoid this merge-base calculation when the target commit has a larger generation number than the target commits. Performance tests were run on a copy of the Linux repository where HEAD is contained in v4.13 but no earlier tag. Also, all tags were copied to branches and 'git branch --contains' was tested: Before: 60.0s After: 0.4s Rel %: -99.3% Reported-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 00bdc2ab21..0b155dece8 100644 --- a/commit.c +++ b/commit.c @@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * { struct commit_list *bases; int ret = 0, i; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; if (parse_commit(commit)) return ret; - for (i = 0; i < nr_reference; i++) + for (i = 0; i < nr_reference; i++) { if (parse_commit(reference[i])) return ret; + if (min_generation > reference[i]->generation) + min_generation = reference[i]->generation; + } + + if (commit->generation > min_generation) + return 0; bases = paint_down_to_common(commit, nr_reference, reference); if (commit->object.flags & PARENT2) -- 2.17.0
[PATCH v2 02/10] merge: check config before loading commits
In anticipation of using generation numbers from the commit-graph, we must ensure that all commits that exist in the commit-graph are loaded from that file instead of from the object database. Since the commit-graph file is only checked if core.commitGraph is true, we must check the default config before we load any commits. In the merge builtin, the config was checked after loading the HEAD commit. This was due to the use of the global 'branch' when checking merge-specific config settings. Move the config load to be between the initialization of 'branch' and the commit lookup. Also add a test to t5318-commit-graph.sh that exercises this code path to prevent a regression. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- builtin/merge.c | 5 +++-- t/t5318-commit-graph.sh | 9 + 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/builtin/merge.c b/builtin/merge.c index ee050a47f3..20897f8223 100644 --- a/builtin/merge.c +++ b/builtin/merge.c @@ -1183,13 +1183,14 @@ int cmd_merge(int argc, const char **argv, const char *prefix) branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL); if (branch) skip_prefix(branch, "refs/heads/", ); + init_diff_ui_defaults(); + git_config(git_merge_config, NULL); + if (!branch || is_null_oid(_oid)) head_commit = NULL; else head_commit = lookup_commit_or_die(_oid, "HEAD"); - init_diff_ui_defaults(); - git_config(git_merge_config, NULL); if (branch_mergeoptions) parse_branch_merge_options(branch_mergeoptions); diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index a380419b65..77d85aefe7 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' ' graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2 +test_expect_success 'perform fast-forward merge in full repo' ' + cd "$TRASH_DIRECTORY/full" && + git checkout -b merge-5-to-8 commits/5 && + git merge commits/8 && + git show-ref -s merge-5-to-8 >output && + git show-ref -s commits/8 >expect && + test_cmp expect output +' + test_done -- 2.17.0
[PATCH v2 10/10] commit: add short-circuit to paint_down_to_common()
When running 'git branch --contains', the in_merge_bases_many() method calls paint_down_to_common() to discover if a specific commit is reachable from a set of branches. Commits with lower generation number are not needed to correctly answer the containment query of in_merge_bases_many(). Add a new parameter, min_generation, to paint_down_to_common() that prevents walking commits with generation number strictly less than min_generation. If 0 is given, then there is no functional change. For in_merge_bases_many(), we can pass commit->generation as the cutoff, and this saves time during 'git branch --contains' queries that would otherwise walk "around" the commit we are inspecting. For a copy of the Linux repository, where HEAD is checked out at v4.13~100, we get the following performance improvement for 'git branch --contains' over the previous commit: Before: 0.21s After: 0.13s Rel %: -38% Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/commit.c b/commit.c index 0b155dece8..7348075e38 100644 --- a/commit.c +++ b/commit.c @@ -796,7 +796,9 @@ static int queue_has_nonstale(struct prio_queue *queue, uint32_t min_gen) } /* all input commits in one and twos[] must have been parsed! */ -static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) +static struct commit_list *paint_down_to_common(struct commit *one, int n, + struct commit **twos, + int min_generation) { struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; @@ -830,6 +832,9 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc last_gen = commit->generation; + if (commit->generation < min_generation) + break; + flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -882,7 +887,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co return NULL; } - list = paint_down_to_common(one, n, twos); + list = paint_down_to_common(one, n, twos, 0); while (list) { struct commit *commit = pop_commit(); @@ -949,7 +954,7 @@ static int remove_redundant(struct commit **array, int cnt) filled_index[filled] = j; work[filled++] = array[j]; } - common = paint_down_to_common(array[i], filled, work); + common = paint_down_to_common(array[i], filled, work, 0); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++) @@ -1073,7 +1078,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * if (commit->generation > min_generation) return 0; - bases = paint_down_to_common(commit, nr_reference, reference); + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); if (commit->object.flags & PARENT2) ret = 1; clear_commit_marks(commit, all_flags); -- 2.17.0
[PATCH v2 08/10] ref-filter: use generation number for --contains
A commit A can reach a commit B only if the generation number of A is strictly larger than the generation number of B. This condition allows significantly short-circuiting commit-graph walks. Use generation number for '--contains' type queries. On a copy of the Linux repository where HEAD is containd in v4.13 but no earlier tag, the command 'git tag --contains HEAD' had the following peformance improvement: Before: 0.81s After: 0.04s Rel %: -95% Helped-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- ref-filter.c | 24 +++- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/ref-filter.c b/ref-filter.c index 45fc56216a..2f5e79b5de 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) */ static enum contains_result contains_test(struct commit *candidate, const struct commit_list *want, - struct contains_cache *cache) + struct contains_cache *cache, + uint32_t cutoff) { enum contains_result *cached = contains_cache_at(cache, candidate); @@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit *candidate, return CONTAINS_YES; } - /* Otherwise, we don't know; prepare to recurse */ parse_commit_or_die(candidate); + + if (candidate->generation < cutoff) + return CONTAINS_NO; + return CONTAINS_UNKNOWN; } @@ -1615,8 +1619,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = GENERATION_NUMBER_INFINITY; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + parse_commit_or_die(c); + if (c->generation < cutoff) + cutoff = c->generation; + } + result = contains_test(candidate, want, cache, cutoff); if (result != CONTAINS_UNKNOWN) return result; @@ -1634,7 +1648,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, * If we just popped the stack, parents->item has been marked, * therefore contains_test will return a meaningful yes/no. */ - else switch (contains_test(parents->item, want, cache)) { + else switch (contains_test(parents->item, want, cache, cutoff)) { case CONTAINS_YES: *contains_cache_at(cache, commit) = CONTAINS_YES; contains_stack.nr--; @@ -1648,7 +1662,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, } } free(contains_stack.contains_stack); - return contains_test(candidate, want, cache); + return contains_test(candidate, want, cache, cutoff); } static int commit_contains(struct ref_filter *filter, struct commit *commit, -- 2.17.0
[PATCH v8 14/14] commit-graph: implement "--append" option
From: Derrick Stolee <dsto...@microsoft.com> Teach git-commit-graph to add all commits from the existing commit-graph file to the file about to be written. This should be used when adding new commits without performing garbage collection. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 10 ++ builtin/commit-graph.c | 10 +++--- commit-graph.c | 17 - commit-graph.h | 3 ++- t/t5318-commit-graph.sh| 10 ++ 5 files changed, 45 insertions(+), 5 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 442ac243e6..4c97b555cc 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -43,6 +43,9 @@ With the `--stdin-commits` option, generate the new commit graph by walking commits starting at the commits specified in stdin as a list of OIDs in hex, one OID per line. (Cannot be combined with --stdin-packs.) ++ +With the `--append` option, include all commits that are present in the +existing commit-graph file. 'read':: @@ -72,6 +75,13 @@ $ echo | git commit-graph write --stdin-packs $ git show-ref -s | git commit-graph write --stdin-commits +* Write a graph file containing all commits in the current +* commit-graph file along with those reachable from HEAD. ++ + +$ git rev-parse HEAD | git commit-graph write --stdin-commits --append + + * Read basic information from the commit-graph file. + diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index b5c0b08905..37420ae0fd 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -8,7 +8,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), - N_("git commit-graph write [--object-dir ] [--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), NULL }; @@ -18,7 +18,7 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--object-dir ] [--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), NULL }; @@ -26,6 +26,7 @@ static struct opts_commit_graph { const char *obj_dir; int stdin_packs; int stdin_commits; + int append; } opts; static int graph_read(int argc, const char **argv) @@ -94,6 +95,8 @@ static int graph_write(int argc, const char **argv) N_("scan pack-indexes listed by stdin for commits")), OPT_BOOL(0, "stdin-commits", _commits, N_("start walk at commits listed by stdin")), + OPT_BOOL(0, "append", , + N_("include all commits already in the commit-graph file")), OPT_END(), }; @@ -131,7 +134,8 @@ static int graph_write(int argc, const char **argv) pack_indexes, packs_nr, commit_hex, - commits_nr); + commits_nr, + opts.append); return 0; } diff --git a/commit-graph.c b/commit-graph.c index a59d1e387b..3ff8c84c0e 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -553,7 +553,8 @@ void write_commit_graph(const char *obj_dir, const char **pack_indexes, int nr_packs, const char **commit_hex, - int nr_commits) + int nr_commits, + int append) { struct packed_oid_list oids; struct packed_commit_list commits; @@ -571,10 +572,24 @@ void write_commit_graph(const char *obj_dir, oids.nr = 0; oids.alloc = approximate_object_count() / 4; + if (append) { + prepare_commit_graph_one(obj_dir); + if (commit_graph) + oids.alloc += commit_graph->num_commits; + } + if (oids.alloc < 1024) oids.alloc = 1024; ALLOC_ARRAY(oids.list, oids.alloc); + if (append && commit_graph) { + for (i = 0; i < commit_graph->num_commits; i++) { + const unsigned char *hash = commit_graph->chunk_oid_lookup + + commit_graph->hash_len * i; +
[PATCH v8 13/14] commit-graph: build graph from starting commits
From: Derrick Stolee <dsto...@microsoft.com> Teach git-commit-graph to read commits from stdin when the --stdin-commits flag is specified. Commits reachable from these commits are added to the graph. This is a much faster way to construct the graph than inspecting all packed objects, but is restricted to known tips. For the Linux repository, 700,000+ commits were added to the graph file starting from 'master' in 7-9 seconds, depending on the number of packfiles in the repo (1, 24, or 120). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 14 +- builtin/commit-graph.c | 27 +-- commit-graph.c | 27 +-- commit-graph.h | 4 +++- t/t5318-commit-graph.sh| 13 + 5 files changed, 75 insertions(+), 10 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 8143cc3f07..442ac243e6 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -36,7 +36,13 @@ COMMANDS Write a commit graph file based on the commits found in packfiles. + With the `--stdin-packs` option, generate the new commit graph by -walking objects only in the specified pack-indexes. +walking objects only in the specified pack-indexes. (Cannot be combined +with --stdin-commits.) ++ +With the `--stdin-commits` option, generate the new commit graph by +walking commits starting at the commits specified in stdin as a list +of OIDs in hex, one OID per line. (Cannot be combined with +--stdin-packs.) 'read':: @@ -60,6 +66,12 @@ $ git commit-graph write $ echo | git commit-graph write --stdin-packs +* Write a graph file containing all reachable commits. ++ + +$ git show-ref -s | git commit-graph write --stdin-commits + + * Read basic information from the commit-graph file. + diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 5c70199003..b5c0b08905 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -8,7 +8,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), - N_("git commit-graph write [--object-dir ] [--stdin-packs]"), + N_("git commit-graph write [--object-dir ] [--stdin-packs|--stdin-commits]"), NULL }; @@ -18,13 +18,14 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--object-dir ] [--stdin-packs]"), + N_("git commit-graph write [--object-dir ] [--stdin-packs|--stdin-commits]"), NULL }; static struct opts_commit_graph { const char *obj_dir; int stdin_packs; + int stdin_commits; } opts; static int graph_read(int argc, const char **argv) @@ -79,6 +80,8 @@ static int graph_write(int argc, const char **argv) { const char **pack_indexes = NULL; int packs_nr = 0; + const char **commit_hex = NULL; + int commits_nr = 0; const char **lines = NULL; int lines_nr = 0; int lines_alloc = 0; @@ -89,6 +92,8 @@ static int graph_write(int argc, const char **argv) N_("The object directory to store the graph")), OPT_BOOL(0, "stdin-packs", _packs, N_("scan pack-indexes listed by stdin for commits")), + OPT_BOOL(0, "stdin-commits", _commits, + N_("start walk at commits listed by stdin")), OPT_END(), }; @@ -96,10 +101,12 @@ static int graph_write(int argc, const char **argv) builtin_commit_graph_write_options, builtin_commit_graph_write_usage, 0); + if (opts.stdin_packs && opts.stdin_commits) + die(_("cannot use both --stdin-commits and --stdin-packs")); if (!opts.obj_dir) opts.obj_dir = get_object_directory(); - if (opts.stdin_packs) { + if (opts.stdin_packs || opts.stdin_commits) { struct strbuf buf = STRBUF_INIT; lines_nr = 0; lines_alloc = 128; @@ -110,13 +117,21 @@ static int graph_write(int argc, const char **argv) lines[lines_nr++] = strbuf_detach(, NULL); } - pack_indexes = lines; - packs_nr = lines_nr; + if (opts.stdin_packs) { + pack_indexes = lines; +
[PATCH v8 11/14] commit: integrate commit graph with commit parsing
From: Derrick Stolee <dsto...@microsoft.com> Teach Git to inspect a commit graph file to supply the contents of a struct commit when calling parse_commit_gently(). This implementation satisfies all post-conditions on the struct commit, including loading parents, the root tree, and the commit date. If core.commitGraph is false, then do not check graph files. In test script t5318-commit-graph.sh, add output-matching conditions on read-only graph operations. By loading commits from the graph instead of parsing commit buffers, we save a lot of time on long commit walks. Here are some performance results for a copy of the Linux repository where 'master' has 678,653 reachable commits and is behind 'origin/master' by 59,929 commits. | Command | Before | After | Rel % | |--|||---| | log --oneline --topo-order -1000 | 8.31s | 0.94s | -88% | | branch -vv | 1.02s | 0.14s | -86% | | rev-list --all | 5.89s | 1.07s | -81% | | rev-list --all --objects | 66.15s | 58.45s | -11% | Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- alloc.c | 1 + commit-graph.c | 141 +++- commit-graph.h | 12 commit.c| 3 + commit.h| 3 + t/t5318-commit-graph.sh | 47 +- 6 files changed, 205 insertions(+), 2 deletions(-) diff --git a/alloc.c b/alloc.c index 12afadfacd..cf4f8b61e1 100644 --- a/alloc.c +++ b/alloc.c @@ -93,6 +93,7 @@ void *alloc_commit_node(void) struct commit *c = alloc_node(_state, sizeof(struct commit)); c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); + c->graph_pos = COMMIT_NOT_FROM_GRAPH; return c; } diff --git a/commit-graph.c b/commit-graph.c index ea29c5c2d8..f745186e7f 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -38,7 +38,6 @@ #define GRAPH_MIN_SIZE (5 * GRAPH_CHUNKLOOKUP_WIDTH + GRAPH_FANOUT_SIZE + \ GRAPH_OID_LEN + 8) - char *get_commit_graph_filename(const char *obj_dir) { return xstrfmt("%s/info/commit-graph", obj_dir); @@ -179,6 +178,145 @@ struct commit_graph *load_commit_graph_one(const char *graph_file) exit(1); } +/* global storage */ +static struct commit_graph *commit_graph = NULL; + +static void prepare_commit_graph_one(const char *obj_dir) +{ + char *graph_name; + + if (commit_graph) + return; + + graph_name = get_commit_graph_filename(obj_dir); + commit_graph = load_commit_graph_one(graph_name); + + FREE_AND_NULL(graph_name); +} + +static int prepare_commit_graph_run_once = 0; +static void prepare_commit_graph(void) +{ + struct alternate_object_database *alt; + char *obj_dir; + + if (prepare_commit_graph_run_once) + return; + prepare_commit_graph_run_once = 1; + + obj_dir = get_object_directory(); + prepare_commit_graph_one(obj_dir); + prepare_alt_odb(); + for (alt = alt_odb_list; !commit_graph && alt; alt = alt->next) + prepare_commit_graph_one(alt->path); +} + +static void close_commit_graph(void) +{ + if (!commit_graph) + return; + + if (commit_graph->graph_fd >= 0) { + munmap((void *)commit_graph->data, commit_graph->data_len); + commit_graph->data = NULL; + close(commit_graph->graph_fd); + } + + FREE_AND_NULL(commit_graph); +} + +static int bsearch_graph(struct commit_graph *g, struct object_id *oid, uint32_t *pos) +{ + return bsearch_hash(oid->hash, g->chunk_oid_fanout, + g->chunk_oid_lookup, g->hash_len, pos); +} + +static struct commit_list **insert_parent_or_die(struct commit_graph *g, +uint64_t pos, +struct commit_list **pptr) +{ + struct commit *c; + struct object_id oid; + hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos); + c = lookup_commit(); + if (!c) + die("could not find commit %s", oid_to_hex()); + c->graph_pos = pos; + return _list_insert(c, pptr)->next; +} + +static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) +{ + struct object_id oid; + uint32_t edge_value; + uint32_t *parent_data_ptr; + uint64_t date_low, date_high; + struct commit_list **pptr; + const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos; + + item->object.parsed = 1; + item->graph_pos = pos; + + hashcpy(oid.hash, commit_data); + item->tree = lookup_tree(); + + date_high = get_be32(commit_data + g->
[PATCH v8 09/14] commit-graph: add core.commitGraph setting
From: Derrick Stolee <dsto...@microsoft.com> The commit graph feature is controlled by the new core.commitGraph config setting. This defaults to 0, so the feature is opt-in. The intention of core.commitGraph is that a user can always stop checking for or parsing commit graph files if core.commitGraph=0. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/config.txt | 4 cache.h | 1 + config.c | 5 + environment.c| 1 + 4 files changed, 11 insertions(+) diff --git a/Documentation/config.txt b/Documentation/config.txt index 4e0cff87f6..e5c7013fb0 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -898,6 +898,10 @@ core.notesRef:: This setting defaults to "refs/notes/commits", and it can be overridden by the `GIT_NOTES_REF` environment variable. See linkgit:git-notes[1]. +core.commitGraph:: + Enable git commit graph feature. Allows reading from the + commit-graph file. + core.sparseCheckout:: Enable "sparse checkout" feature. See section "Sparse checkout" in linkgit:git-read-tree[1] for more information. diff --git a/cache.h b/cache.h index a61b2d3f0d..8bdbcbbbf7 100644 --- a/cache.h +++ b/cache.h @@ -805,6 +805,7 @@ extern char *git_replace_ref_base; extern int fsync_object_files; extern int core_preload_index; +extern int core_commit_graph; extern int core_apply_sparse_checkout; extern int precomposed_unicode; extern int protect_hfs; diff --git a/config.c b/config.c index b0c20e6cb8..25ee4a676c 100644 --- a/config.c +++ b/config.c @@ -1226,6 +1226,11 @@ static int git_default_core_config(const char *var, const char *value) return 0; } + if (!strcmp(var, "core.commitgraph")) { + core_commit_graph = git_config_bool(var, value); + return 0; + } + if (!strcmp(var, "core.sparsecheckout")) { core_apply_sparse_checkout = git_config_bool(var, value); return 0; diff --git a/environment.c b/environment.c index d6dd64662c..8853e2f0dd 100644 --- a/environment.c +++ b/environment.c @@ -62,6 +62,7 @@ enum push_default_type push_default = PUSH_DEFAULT_UNSPECIFIED; enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE; char *notes_ref_name; int grafts_replace_parents = 1; +int core_commit_graph; int core_apply_sparse_checkout; int merge_log_config = -1; int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */ -- 2.17.0
[PATCH v8 12/14] commit-graph: read only from specific pack-indexes
From: Derrick Stolee <dsto...@microsoft.com> Teach git-commit-graph to inspect the objects only in a certain list of pack-indexes within the given pack directory. This allows updating the commit graph iteratively. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 11 +- builtin/commit-graph.c | 33 +++--- commit-graph.c | 26 +-- commit-graph.h | 4 +++- packfile.c | 4 ++-- packfile.h | 2 ++ t/t5318-commit-graph.sh| 10 + 7 files changed, 81 insertions(+), 9 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 8aad8303f5..8143cc3f07 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -34,7 +34,9 @@ COMMANDS 'write':: Write a commit graph file based on the commits found in packfiles. -Includes all commits from the existing commit graph file. ++ +With the `--stdin-packs` option, generate the new commit graph by +walking objects only in the specified pack-indexes. 'read':: @@ -51,6 +53,13 @@ EXAMPLES $ git commit-graph write +* Write a graph file, extending the current graph file using commits +* in . ++ + +$ echo | git commit-graph write --stdin-packs + + * Read basic information from the commit-graph file. + diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index efd39331d7..5c70199003 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -8,7 +8,7 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), - N_("git commit-graph write [--object-dir ]"), + N_("git commit-graph write [--object-dir ] [--stdin-packs]"), NULL }; @@ -18,12 +18,13 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--object-dir ]"), + N_("git commit-graph write [--object-dir ] [--stdin-packs]"), NULL }; static struct opts_commit_graph { const char *obj_dir; + int stdin_packs; } opts; static int graph_read(int argc, const char **argv) @@ -76,10 +77,18 @@ static int graph_read(int argc, const char **argv) static int graph_write(int argc, const char **argv) { + const char **pack_indexes = NULL; + int packs_nr = 0; + const char **lines = NULL; + int lines_nr = 0; + int lines_alloc = 0; + static struct option builtin_commit_graph_write_options[] = { OPT_STRING(0, "object-dir", _dir, N_("dir"), N_("The object directory to store the graph")), + OPT_BOOL(0, "stdin-packs", _packs, + N_("scan pack-indexes listed by stdin for commits")), OPT_END(), }; @@ -90,7 +99,25 @@ static int graph_write(int argc, const char **argv) if (!opts.obj_dir) opts.obj_dir = get_object_directory(); - write_commit_graph(opts.obj_dir); + if (opts.stdin_packs) { + struct strbuf buf = STRBUF_INIT; + lines_nr = 0; + lines_alloc = 128; + ALLOC_ARRAY(lines, lines_alloc); + + while (strbuf_getline(, stdin) != EOF) { + ALLOC_GROW(lines, lines_nr + 1, lines_alloc); + lines[lines_nr++] = strbuf_detach(, NULL); + } + + pack_indexes = lines; + packs_nr = lines_nr; + } + + write_commit_graph(opts.obj_dir, + pack_indexes, + packs_nr); + return 0; } diff --git a/commit-graph.c b/commit-graph.c index f745186e7f..70472840a3 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -549,7 +549,9 @@ static void close_reachable(struct packed_oid_list *oids) } } -void write_commit_graph(const char *obj_dir) +void write_commit_graph(const char *obj_dir, + const char **pack_indexes, + int nr_packs) { struct packed_oid_list oids; struct packed_commit_list commits; @@ -571,7 +573,27 @@ void write_commit_graph(const char *obj_dir) oids.alloc = 1024; ALLOC_ARRAY(oids.list, oids.alloc); - for_each_packed_object(add_packed_commits, , 0); + if (pack_indexes) { + struct strbuf packname = STRBUF_I
Re: What's cooking in git.git (Apr 2018, #01; Mon, 9)
On 4/9/2018 6:08 PM, Junio C Hamano wrote: I guess we'd want a final cleaned-up round after all ;-) Thanks. v8 sent [1]. Thanks. -Stolee [1] https://public-inbox.org/git/20180410125608.39443-1-dsto...@microsoft.com/T/#u
[PATCH v8 10/14] commit-graph: close under reachability
From: Derrick Stolee <dsto...@microsoft.com> Teach write_commit_graph() to walk all parents from the commits discovered in packfiles. This prevents gaps given by loose objects or previously-missed packfiles. Also automatically add commits from the existing graph file, if it exists. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 45 + 1 file changed, 45 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index b1bd3a892d..ea29c5c2d8 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -367,6 +367,50 @@ static int add_packed_commits(const struct object_id *oid, return 0; } +static void add_missing_parents(struct packed_oid_list *oids, struct commit *commit) +{ + struct commit_list *parent; + for (parent = commit->parents; parent; parent = parent->next) { + if (!(parent->item->object.flags & UNINTERESTING)) { + ALLOC_GROW(oids->list, oids->nr + 1, oids->alloc); + oidcpy(>list[oids->nr], &(parent->item->object.oid)); + oids->nr++; + parent->item->object.flags |= UNINTERESTING; + } + } +} + +static void close_reachable(struct packed_oid_list *oids) +{ + int i; + struct commit *commit; + + for (i = 0; i < oids->nr; i++) { + commit = lookup_commit(>list[i]); + if (commit) + commit->object.flags |= UNINTERESTING; + } + + /* +* As this loop runs, oids->nr may grow, but not more +* than the number of missing commits in the reachable +* closure. +*/ + for (i = 0; i < oids->nr; i++) { + commit = lookup_commit(>list[i]); + + if (commit && !parse_commit(commit)) + add_missing_parents(oids, commit); + } + + for (i = 0; i < oids->nr; i++) { + commit = lookup_commit(>list[i]); + + if (commit) + commit->object.flags &= ~UNINTERESTING; + } +} + void write_commit_graph(const char *obj_dir) { struct packed_oid_list oids; @@ -390,6 +434,7 @@ void write_commit_graph(const char *obj_dir) ALLOC_ARRAY(oids.list, oids.alloc); for_each_packed_object(add_packed_commits, , 0); + close_reachable(); QSORT(oids.list, oids.nr, commit_compare); -- 2.17.0
[PATCH v8 07/14] commit-graph: implement git-commit-graph write
From: Derrick Stolee <dsto...@microsoft.com> Teach git-commit-graph to write graph files. Create new test script to verify this command succeeds without failure. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 41 ++ builtin/commit-graph.c | 33 t/t5318-commit-graph.sh| 124 + 3 files changed, 198 insertions(+) create mode 100755 t/t5318-commit-graph.sh diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index f3b34622a8..47996e8f89 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -5,6 +5,47 @@ NAME git-commit-graph - Write and verify Git commit graph files + +SYNOPSIS + +[verse] +'git commit-graph write' [--object-dir ] + + +DESCRIPTION +--- + +Manage the serialized commit graph file. + + +OPTIONS +--- +--object-dir:: + Use given directory for the location of packfiles and commit graph + file. This parameter exists to specify the location of an alternate + that only has the objects directory, not a full .git directory. The + commit graph file is expected to be at /info/commit-graph and + the packfiles are expected to be in /pack. + + +COMMANDS + +'write':: + +Write a commit graph file based on the commits found in packfiles. +Includes all commits from the existing commit graph file. + + +EXAMPLES + + +* Write a commit graph file for the packed commits in your local .git folder. ++ + +$ git commit-graph write + + + GIT --- Part of the linkgit:git[1] suite diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index b466ecd781..26b6360289 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -1,9 +1,18 @@ #include "builtin.h" #include "config.h" +#include "dir.h" +#include "lockfile.h" #include "parse-options.h" +#include "commit-graph.h" static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), + N_("git commit-graph write [--object-dir ]"), + NULL +}; + +static const char * const builtin_commit_graph_write_usage[] = { + N_("git commit-graph write [--object-dir ]"), NULL }; @@ -11,6 +20,25 @@ static struct opts_commit_graph { const char *obj_dir; } opts; +static int graph_write(int argc, const char **argv) +{ + static struct option builtin_commit_graph_write_options[] = { + OPT_STRING(0, "object-dir", _dir, + N_("dir"), + N_("The object directory to store the graph")), + OPT_END(), + }; + + argc = parse_options(argc, argv, NULL, +builtin_commit_graph_write_options, +builtin_commit_graph_write_usage, 0); + + if (!opts.obj_dir) + opts.obj_dir = get_object_directory(); + + write_commit_graph(opts.obj_dir); + return 0; +} int cmd_commit_graph(int argc, const char **argv, const char *prefix) { @@ -31,6 +59,11 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) builtin_commit_graph_usage, PARSE_OPT_STOP_AT_NON_OPTION); + if (argc > 0) { + if (!strcmp(argv[0], "write")) + return graph_write(argc, argv); + } + usage_with_options(builtin_commit_graph_usage, builtin_commit_graph_options); } diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh new file mode 100755 index 00..d7b635bd68 --- /dev/null +++ b/t/t5318-commit-graph.sh @@ -0,0 +1,124 @@ +#!/bin/sh + +test_description='commit graph' +. ./test-lib.sh + +test_expect_success 'setup full repo' ' + mkdir full && + cd "$TRASH_DIRECTORY/full" && + git init && + objdir=".git/objects" +' + +test_expect_success 'write graph with no packs' ' + cd "$TRASH_DIRECTORY/full" && + git commit-graph write --object-dir . && + test_path_is_file info/commit-graph +' + +test_expect_success 'create commits and repack' ' + cd "$TRASH_DIRECTORY/full" && + for i in $(test_seq 3) + do + test_commit $i && + git branch commits/$i + done && + git repack +' + +test_expect_success 'write graph' ' + cd "$TRASH_DIRECTORY/full" && + graph1=$(git commit-graph write) && + test_path_is_file $objdir/info/commit-graph +' + +test_expect_success 'Add more commits' ' +
[PATCH v8 04/14] graph: add commit graph design document
From: Derrick Stolee <dsto...@microsoft.com> Add Documentation/technical/commit-graph.txt with details of the planned commit graph feature, including future plans. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/commit-graph.txt | 163 +++ 1 file changed, 163 insertions(+) create mode 100644 Documentation/technical/commit-graph.txt diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt new file mode 100644 index 00..0550c6d0dc --- /dev/null +++ b/Documentation/technical/commit-graph.txt @@ -0,0 +1,163 @@ +Git Commit Graph Design Notes += + +Git walks the commit graph for many reasons, including: + +1. Listing and filtering commit history. +2. Computing merge bases. + +These operations can become slow as the commit count grows. The merge +base calculation shows up in many user-facing commands, such as 'merge-base' +or 'status' and can take minutes to compute depending on history shape. + +There are two main costs here: + +1. Decompressing and parsing commits. +2. Walking the entire graph to satisfy topological order constraints. + +The commit graph file is a supplemental data structure that accelerates +commit graph walks. If a user downgrades or disables the 'core.commitGraph' +config setting, then the existing ODB is sufficient. The file is stored +as "commit-graph" either in the .git/objects/info directory or in the info +directory of an alternate. + +The commit graph file stores the commit graph structure along with some +extra metadata to speed up graph walks. By listing commit OIDs in lexi- +cographic order, we can identify an integer position for each commit and +refer to the parents of a commit using those integer positions. We use +binary search to find initial commits and then use the integer positions +for fast lookups during the walk. + +A consumer may load the following info for a commit from the graph: + +1. The commit OID. +2. The list of parents, along with their integer position. +3. The commit date. +4. The root tree OID. +5. The generation number (see definition below). + +Values 1-4 satisfy the requirements of parse_commit_gently(). + +Define the "generation number" of a commit recursively as follows: + + * A commit with no parents (a root commit) has generation number one. + + * A commit with at least one parent has generation number one more than + the largest generation number among its parents. + +Equivalently, the generation number of a commit A is one more than the +length of a longest path from A to a root commit. The recursive definition +is easier to use for computation and observing the following property: + +If A and B are commits with generation numbers N and M, respectively, +and N <= M, then A cannot reach B. That is, we know without searching +that B is not an ancestor of A because it is further from a root commit +than A. + +Conversely, when checking if A is an ancestor of B, then we only need +to walk commits until all commits on the walk boundary have generation +number at most N. If we walk commits using a priority queue seeded by +generation numbers, then we always expand the boundary commit with highest +generation number and can easily detect the stopping condition. + +This property can be used to significantly reduce the time it takes to +walk commits and determine topological relationships. Without generation +numbers, the general heuristic is the following: + +If A and B are commits with commit time X and Y, respectively, and +X < Y, then A _probably_ cannot reach B. + +This heuristic is currently used whenever the computation is allowed to +violate topological relationships due to clock skew (such as "git log" +with default order), but is not used when the topological order is +required (such as merge base calculations, "git log --graph"). + +In practice, we expect some commits to be created recently and not stored +in the commit graph. We can treat these commits as having "infinite" +generation number and walk until reaching commits with known generation +number. + +Design Details +-- + +- The commit graph file is stored in a file named 'commit-graph' in the + .git/objects/info directory. This could be stored in the info directory + of an alternate. + +- The core.commitGraph config setting must be on to consume graph files. + +- The file format includes parameters for the object ID hash function, + so a future change of hash algorithm does not require a change in format. + +Future Work +--- + +- The commit graph feature currently does not honor commit grafts. This can + be remedied by duplicating or refactoring the current graft logic. + +- The 'commit-graph' subcommand does not have a "verify" mode that is + necessary for integration with fsck. + +- The file format includes ro
[PATCH v8 08/14] commit-graph: implement git commit-graph read
From: Derrick Stolee <dsto...@microsoft.com> Teach git-commit-graph to read commit graph files and summarize their contents. Use the read subcommand to verify the contents of a commit graph file in the tests. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 12 +++ builtin/commit-graph.c | 56 commit-graph.c | 137 - commit-graph.h | 23 + t/t5318-commit-graph.sh| 32 +-- 5 files changed, 254 insertions(+), 6 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 47996e8f89..8aad8303f5 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -9,6 +9,7 @@ git-commit-graph - Write and verify Git commit graph files SYNOPSIS [verse] +'git commit-graph read' [--object-dir ] 'git commit-graph write' [--object-dir ] @@ -35,6 +36,11 @@ COMMANDS Write a commit graph file based on the commits found in packfiles. Includes all commits from the existing commit graph file. +'read':: + +Read a graph file given by the commit-graph file and output basic +details about the graph file. Used for debugging purposes. + EXAMPLES @@ -45,6 +51,12 @@ EXAMPLES $ git commit-graph write +* Read basic information from the commit-graph file. ++ + +$ git commit-graph read + + GIT --- diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 26b6360289..efd39331d7 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -7,10 +7,16 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), + N_("git commit-graph read [--object-dir ]"), N_("git commit-graph write [--object-dir ]"), NULL }; +static const char * const builtin_commit_graph_read_usage[] = { + N_("git commit-graph read [--object-dir ]"), + NULL +}; + static const char * const builtin_commit_graph_write_usage[] = { N_("git commit-graph write [--object-dir ]"), NULL @@ -20,6 +26,54 @@ static struct opts_commit_graph { const char *obj_dir; } opts; +static int graph_read(int argc, const char **argv) +{ + struct commit_graph *graph = NULL; + char *graph_name; + + static struct option builtin_commit_graph_read_options[] = { + OPT_STRING(0, "object-dir", _dir, + N_("dir"), + N_("The object directory to store the graph")), + OPT_END(), + }; + + argc = parse_options(argc, argv, NULL, +builtin_commit_graph_read_options, +builtin_commit_graph_read_usage, 0); + + if (!opts.obj_dir) + opts.obj_dir = get_object_directory(); + + graph_name = get_commit_graph_filename(opts.obj_dir); + graph = load_commit_graph_one(graph_name); + + if (!graph) + die("graph file %s does not exist", graph_name); + FREE_AND_NULL(graph_name); + + printf("header: %08x %d %d %d %d\n", + ntohl(*(uint32_t*)graph->data), + *(unsigned char*)(graph->data + 4), + *(unsigned char*)(graph->data + 5), + *(unsigned char*)(graph->data + 6), + *(unsigned char*)(graph->data + 7)); + printf("num_commits: %u\n", graph->num_commits); + printf("chunks:"); + + if (graph->chunk_oid_fanout) + printf(" oid_fanout"); + if (graph->chunk_oid_lookup) + printf(" oid_lookup"); + if (graph->chunk_commit_data) + printf(" commit_metadata"); + if (graph->chunk_large_edges) + printf(" large_edges"); + printf("\n"); + + return 0; +} + static int graph_write(int argc, const char **argv) { static struct option builtin_commit_graph_write_options[] = { @@ -60,6 +114,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) PARSE_OPT_STOP_AT_NON_OPTION); if (argc > 0) { + if (!strcmp(argv[0], "read")) + return graph_read(argc, argv); if (!strcmp(argv[0], "write")) return graph_write(argc, argv); } diff --git a/commit-graph.c b/commit-graph.c index f3f7c4f189..b1bd3a892d 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -39,11 +39,146 @@ GRAPH_OID_LEN + 8) -static char *get_commit_graph_f
[PATCH v8 06/14] commit-graph: implement write_commit_graph()
From: Derrick Stolee <dsto...@microsoft.com> Teach Git to write a commit graph file by checking all packed objects to see if they are commits, then store the file in the given object directory. Helped-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Makefile | 1 + commit-graph.c | 359 + commit-graph.h | 6 + 3 files changed, 366 insertions(+) create mode 100644 commit-graph.c create mode 100644 commit-graph.h diff --git a/Makefile b/Makefile index a59b62bed1..26a23257e9 100644 --- a/Makefile +++ b/Makefile @@ -777,6 +777,7 @@ LIB_OBJS += color.o LIB_OBJS += column.o LIB_OBJS += combine-diff.o LIB_OBJS += commit.o +LIB_OBJS += commit-graph.o LIB_OBJS += compat/obstack.o LIB_OBJS += compat/terminal.o LIB_OBJS += config.o diff --git a/commit-graph.c b/commit-graph.c new file mode 100644 index 00..f3f7c4f189 --- /dev/null +++ b/commit-graph.c @@ -0,0 +1,359 @@ +#include "cache.h" +#include "config.h" +#include "git-compat-util.h" +#include "lockfile.h" +#include "pack.h" +#include "packfile.h" +#include "commit.h" +#include "object.h" +#include "revision.h" +#include "sha1-lookup.h" +#include "commit-graph.h" + +#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */ +#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */ +#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */ +#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */ +#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */ + +#define GRAPH_DATA_WIDTH 36 + +#define GRAPH_VERSION_1 0x1 +#define GRAPH_VERSION GRAPH_VERSION_1 + +#define GRAPH_OID_VERSION_SHA1 1 +#define GRAPH_OID_LEN_SHA1 GIT_SHA1_RAWSZ +#define GRAPH_OID_VERSION GRAPH_OID_VERSION_SHA1 +#define GRAPH_OID_LEN GRAPH_OID_LEN_SHA1 + +#define GRAPH_OCTOPUS_EDGES_NEEDED 0x8000 +#define GRAPH_PARENT_MISSING 0x7fff +#define GRAPH_EDGE_LAST_MASK 0x7fff +#define GRAPH_PARENT_NONE 0x7000 + +#define GRAPH_LAST_EDGE 0x8000 + +#define GRAPH_FANOUT_SIZE (4 * 256) +#define GRAPH_CHUNKLOOKUP_WIDTH 12 +#define GRAPH_MIN_SIZE (5 * GRAPH_CHUNKLOOKUP_WIDTH + GRAPH_FANOUT_SIZE + \ + GRAPH_OID_LEN + 8) + + +static char *get_commit_graph_filename(const char *obj_dir) +{ + return xstrfmt("%s/info/commit-graph", obj_dir); +} + +static void write_graph_chunk_fanout(struct hashfile *f, +struct commit **commits, +int nr_commits) +{ + int i, count = 0; + struct commit **list = commits; + + /* +* Write the first-level table (the list is sorted, +* but we use a 256-entry lookup to be able to avoid +* having to do eight extra binary search iterations). +*/ + for (i = 0; i < 256; i++) { + while (count < nr_commits) { + if ((*list)->object.oid.hash[0] != i) + break; + count++; + list++; + } + + hashwrite_be32(f, count); + } +} + +static void write_graph_chunk_oids(struct hashfile *f, int hash_len, + struct commit **commits, int nr_commits) +{ + struct commit **list = commits; + int count; + for (count = 0; count < nr_commits; count++, list++) + hashwrite(f, (*list)->object.oid.hash, (int)hash_len); +} + +static const unsigned char *commit_to_sha1(size_t index, void *table) +{ + struct commit **commits = table; + return commits[index]->object.oid.hash; +} + +static void write_graph_chunk_data(struct hashfile *f, int hash_len, + struct commit **commits, int nr_commits) +{ + struct commit **list = commits; + struct commit **last = commits + nr_commits; + uint32_t num_extra_edges = 0; + + while (list < last) { + struct commit_list *parent; + int edge_value; + uint32_t packedDate[2]; + + parse_commit(*list); + hashwrite(f, (*list)->tree->object.oid.hash, hash_len); + + parent = (*list)->parents; + + if (!parent) + edge_value = GRAPH_PARENT_NONE; + else { + edge_value = sha1_pos(parent->item->object.oid.hash, + commits, + nr_commits, + commit_to_sha1); + + if (edge_value < 0) + edge_value = GRAPH_PARENT_MISSING; + } + + hashwrite_be32(f, edge_value); + +
[PATCH v8 02/14] csum-file: refactor finalize_hashfile() method
From: Derrick Stolee <dsto...@microsoft.com> If we want to use a hashfile on the temporary file for a lockfile, then we need finalize_hashfile() to fully write the trailing hash but also keep the file descriptor open. Do this by adding a new CSUM_HASH_IN_STREAM flag along with a functional change that checks this flag before writing the checksum to the stream. This differs from previous behavior since it would be written if either CSUM_CLOSE or CSUM_FSYNC is provided. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- builtin/pack-objects.c | 4 ++-- bulk-checkin.c | 2 +- csum-file.c| 8 csum-file.h| 5 +++-- pack-bitmap-write.c| 2 +- pack-write.c | 5 +++-- 6 files changed, 14 insertions(+), 12 deletions(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index ab3e80ee49..b09bbf4f4c 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -837,9 +837,9 @@ static void write_pack_file(void) * If so, rewrite it like in fast-import */ if (pack_to_stdout) { - finalize_hashfile(f, oid.hash, CSUM_CLOSE); + finalize_hashfile(f, oid.hash, CSUM_HASH_IN_STREAM | CSUM_CLOSE); } else if (nr_written == nr_remaining) { - finalize_hashfile(f, oid.hash, CSUM_FSYNC); + finalize_hashfile(f, oid.hash, CSUM_HASH_IN_STREAM | CSUM_FSYNC | CSUM_CLOSE); } else { int fd = finalize_hashfile(f, oid.hash, 0); fixup_pack_header_footer(fd, oid.hash, pack_tmp_name, diff --git a/bulk-checkin.c b/bulk-checkin.c index 227cc9f3b1..70b14fdf41 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -35,7 +35,7 @@ static void finish_bulk_checkin(struct bulk_checkin_state *state) unlink(state->pack_tmp_name); goto clear_exit; } else if (state->nr_written == 1) { - finalize_hashfile(state->f, oid.hash, CSUM_FSYNC); + finalize_hashfile(state->f, oid.hash, CSUM_HASH_IN_STREAM | CSUM_FSYNC | CSUM_CLOSE); } else { int fd = finalize_hashfile(state->f, oid.hash, 0); fixup_pack_header_footer(fd, oid.hash, state->pack_tmp_name, diff --git a/csum-file.c b/csum-file.c index e6c95a6915..53ce37f7ca 100644 --- a/csum-file.c +++ b/csum-file.c @@ -61,11 +61,11 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, unsigned int fl the_hash_algo->final_fn(f->buffer, >ctx); if (result) hashcpy(result, f->buffer); - if (flags & (CSUM_CLOSE | CSUM_FSYNC)) { - /* write checksum and close fd */ + if (flags & CSUM_HASH_IN_STREAM) flush(f, f->buffer, the_hash_algo->rawsz); - if (flags & CSUM_FSYNC) - fsync_or_die(f->fd, f->name); + if (flags & CSUM_FSYNC) + fsync_or_die(f->fd, f->name); + if (flags & CSUM_CLOSE) { if (close(f->fd)) die_errno("%s: sha1 file error on close", f->name); fd = 0; diff --git a/csum-file.h b/csum-file.h index 9ba87f0a6c..c5a2e335e7 100644 --- a/csum-file.h +++ b/csum-file.h @@ -27,8 +27,9 @@ extern void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint *) extern int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *); /* finalize_hashfile flags */ -#define CSUM_CLOSE 1 -#define CSUM_FSYNC 2 +#define CSUM_CLOSE 1 +#define CSUM_FSYNC 2 +#define CSUM_HASH_IN_STREAM4 extern struct hashfile *hashfd(int fd, const char *name); extern struct hashfile *hashfd_check(const char *name); diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c index 662b44f97d..db4c832428 100644 --- a/pack-bitmap-write.c +++ b/pack-bitmap-write.c @@ -535,7 +535,7 @@ void bitmap_writer_finish(struct pack_idx_entry **index, if (options & BITMAP_OPT_HASH_CACHE) write_hash_cache(f, index, index_nr); - finalize_hashfile(f, NULL, CSUM_FSYNC); + finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC | CSUM_CLOSE); if (adjust_shared_perm(tmp_file.buf)) die_errno("unable to make temporary bitmap file readable"); diff --git a/pack-write.c b/pack-write.c index 044f427392..a9d46bc03f 100644 --- a/pack-write.c +++ b/pack-write.c @@ -170,8 +170,9 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec } hashwrite(f, sha1, the_hash_algo->rawsz); - finalize_hashfile(f, NULL, ((opts->flags & WRITE_IDX_VERIFY) - ? CSUM_CLOSE : CSUM_FSYNC)); + finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_CLOSE | +
[PATCH v8 03/14] commit-graph: add format document
From: Derrick Stolee <dsto...@microsoft.com> Add document specifying the binary format for commit graphs. This format allows for: * New versions. * New hash functions and hash lengths. * Optional extensions. Basic header information is followed by a binary table of contents into "chunks" that include: * An ordered list of commit object IDs. * A 256-entry fanout into that list of OIDs. * A list of metadata for the commits. * A list of "large edges" to enable octopus merges. The format automatically includes two parent positions for every commit. This favors speed over space, since using only one position per commit would cause an extra level of indirection for every merge commit. (Octopus merges suffer from this indirection, but they are very rare.) Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- .../technical/commit-graph-format.txt | 97 +++ 1 file changed, 97 insertions(+) create mode 100644 Documentation/technical/commit-graph-format.txt diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt new file mode 100644 index 00..ad6af8105c --- /dev/null +++ b/Documentation/technical/commit-graph-format.txt @@ -0,0 +1,97 @@ +Git commit graph format +=== + +The Git commit graph stores a list of commit OIDs and some associated +metadata, including: + +- The generation number of the commit. Commits with no parents have + generation number 1; commits with parents have generation number + one more than the maximum generation number of its parents. We + reserve zero as special, and can be used to mark a generation + number invalid or as "not computed". + +- The root tree OID. + +- The commit date. + +- The parents of the commit, stored using positional references within + the graph file. + +These positional references are stored as unsigned 32-bit integers +corresponding to the array position withing the list of commit OIDs. We +use the most-significant bit for special purposes, so we can store at most +(1 << 31) - 1 (around 2 billion) commits. + +== Commit graph files have the following format: + +In order to allow extensions that add extra data to the graph, we organize +the body into "chunks" and provide a binary lookup table at the beginning +of the body. The header includes certain values, such as number of chunks +and hash type. + +All 4-byte numbers are in network order. + +HEADER: + + 4-byte signature: + The signature is: {'C', 'G', 'P', 'H'} + + 1-byte version number: + Currently, the only valid version is 1. + + 1-byte Hash Version (1 = SHA-1) + We infer the hash length (H) from this value. + + 1-byte number (C) of "chunks" + + 1-byte (reserved for later use) + Current clients should ignore this value. + +CHUNK LOOKUP: + + (C + 1) * 12 bytes listing the table of contents for the chunks: + First 4 bytes describe the chunk id. Value 0 is a terminating label. + Other 8 bytes provide the byte-offset in current file for chunk to + start. (Chunks are ordered contiguously in the file, so you can infer + the length using the next chunk position if necessary.) Each chunk + ID appears at most once. + + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. + +CHUNK DATA: + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of commits (N). + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all commits in the graph, sorted in ascending order. + + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes) +* The first H bytes are for the OID of the root tree. +* The next 8 bytes are for the positions of the first two parents + of the ith commit. Stores value 0x if no parent in that + position. If there are more than two parents, the second value + has its most-significant bit on and the other bits store an array + position into the Large Edge List chunk. +* The next 8 bytes store the generation number of the commit and + the commit time in seconds since EPOCH. The generation number + uses the higher 30 bits of the first 4 bytes, while the commit + time uses the 32 bits of the second 4 bytes, along with the lowest + 2 bits of the lowest byte, storing the 33rd and 34th bit of the + commit time. + + Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional] + This list of 4-byte values store the second through nth parents for + all octopus merges. The second parent value in the commit data stores + an array position within this list along with the most-significant bit + on. Starting at that array position, iterate through
[PATCH v8 05/14] commit-graph: create git-commit-graph builtin
From: Derrick Stolee <dsto...@microsoft.com> Teach git the 'commit-graph' builtin that will be used for writing and reading packed graph files. The current implementation is mostly empty, except for an '--object-dir' option. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- .gitignore | 1 + Documentation/git-commit-graph.txt | 10 +++ Makefile | 1 + builtin.h | 1 + builtin/commit-graph.c | 36 ++ command-list.txt | 1 + contrib/completion/git-completion.bash | 2 ++ git.c | 1 + 8 files changed, 53 insertions(+) create mode 100644 Documentation/git-commit-graph.txt create mode 100644 builtin/commit-graph.c diff --git a/.gitignore b/.gitignore index 833ef3b0b7..e82f90184d 100644 --- a/.gitignore +++ b/.gitignore @@ -34,6 +34,7 @@ /git-clone /git-column /git-commit +/git-commit-graph /git-commit-tree /git-config /git-count-objects diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt new file mode 100644 index 00..f3b34622a8 --- /dev/null +++ b/Documentation/git-commit-graph.txt @@ -0,0 +1,10 @@ +git-commit-graph(1) +=== + +NAME + +git-commit-graph - Write and verify Git commit graph files + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/Makefile b/Makefile index a1d8775adb..a59b62bed1 100644 --- a/Makefile +++ b/Makefile @@ -952,6 +952,7 @@ BUILTIN_OBJS += builtin/clone.o BUILTIN_OBJS += builtin/column.o BUILTIN_OBJS += builtin/commit-tree.o BUILTIN_OBJS += builtin/commit.o +BUILTIN_OBJS += builtin/commit-graph.o BUILTIN_OBJS += builtin/config.o BUILTIN_OBJS += builtin/count-objects.o BUILTIN_OBJS += builtin/credential.o diff --git a/builtin.h b/builtin.h index 42378f3aa4..079855b6d4 100644 --- a/builtin.h +++ b/builtin.h @@ -149,6 +149,7 @@ extern int cmd_clone(int argc, const char **argv, const char *prefix); extern int cmd_clean(int argc, const char **argv, const char *prefix); extern int cmd_column(int argc, const char **argv, const char *prefix); extern int cmd_commit(int argc, const char **argv, const char *prefix); +extern int cmd_commit_graph(int argc, const char **argv, const char *prefix); extern int cmd_commit_tree(int argc, const char **argv, const char *prefix); extern int cmd_config(int argc, const char **argv, const char *prefix); extern int cmd_count_objects(int argc, const char **argv, const char *prefix); diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c new file mode 100644 index 00..b466ecd781 --- /dev/null +++ b/builtin/commit-graph.c @@ -0,0 +1,36 @@ +#include "builtin.h" +#include "config.h" +#include "parse-options.h" + +static char const * const builtin_commit_graph_usage[] = { + N_("git commit-graph [--object-dir ]"), + NULL +}; + +static struct opts_commit_graph { + const char *obj_dir; +} opts; + + +int cmd_commit_graph(int argc, const char **argv, const char *prefix) +{ + static struct option builtin_commit_graph_options[] = { + OPT_STRING(0, "object-dir", _dir, + N_("dir"), + N_("The object directory to store the graph")), + OPT_END(), + }; + + if (argc == 2 && !strcmp(argv[1], "-h")) + usage_with_options(builtin_commit_graph_usage, + builtin_commit_graph_options); + + git_config(git_default_config, NULL); + argc = parse_options(argc, argv, prefix, +builtin_commit_graph_options, +builtin_commit_graph_usage, +PARSE_OPT_STOP_AT_NON_OPTION); + + usage_with_options(builtin_commit_graph_usage, + builtin_commit_graph_options); +} diff --git a/command-list.txt b/command-list.txt index a1fad28fd8..835c5890be 100644 --- a/command-list.txt +++ b/command-list.txt @@ -34,6 +34,7 @@ git-clean mainporcelain git-clone mainporcelain init git-column purehelpers git-commit mainporcelain history +git-commit-graphplumbingmanipulators git-commit-tree plumbingmanipulators git-config ancillarymanipulators git-count-objects ancillaryinterrogators diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash index b09c8a2362..6726daaf69 100644 --- a/contrib/completion/git-completion.bash +++ b/contrib/completion/git-completion.bash @@ -878,6 +878,7 @@ __git_list_porcelain_commands () check-ref-format) : plumbing;
[PATCH v8 01/14] csum-file: rename hashclose() to finalize_hashfile()
From: Derrick Stolee <dsto...@microsoft.com> The hashclose() method behaves very differently depending on the flags parameter. In particular, the file descriptor is not always closed. Perform a simple rename of "hashclose()" to "finalize_hashfile()" in preparation for functional changes. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- builtin/index-pack.c | 2 +- builtin/pack-objects.c | 6 +++--- bulk-checkin.c | 4 ++-- csum-file.c| 2 +- csum-file.h| 4 ++-- fast-import.c | 2 +- pack-bitmap-write.c| 2 +- pack-write.c | 4 ++-- 8 files changed, 13 insertions(+), 13 deletions(-) diff --git a/builtin/index-pack.c b/builtin/index-pack.c index bda84a92ef..8bcf280e0b 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -1270,7 +1270,7 @@ static void conclude_pack(int fix_thin_pack, const char *curr_pack, unsigned cha nr_objects - nr_objects_initial); stop_progress_msg(, msg.buf); strbuf_release(); - hashclose(f, tail_hash, 0); + finalize_hashfile(f, tail_hash, 0); hashcpy(read_hash, pack_hash); fixup_pack_header_footer(output_fd, pack_hash, curr_pack, nr_objects, diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index e9d3cfb9e3..ab3e80ee49 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -837,11 +837,11 @@ static void write_pack_file(void) * If so, rewrite it like in fast-import */ if (pack_to_stdout) { - hashclose(f, oid.hash, CSUM_CLOSE); + finalize_hashfile(f, oid.hash, CSUM_CLOSE); } else if (nr_written == nr_remaining) { - hashclose(f, oid.hash, CSUM_FSYNC); + finalize_hashfile(f, oid.hash, CSUM_FSYNC); } else { - int fd = hashclose(f, oid.hash, 0); + int fd = finalize_hashfile(f, oid.hash, 0); fixup_pack_header_footer(fd, oid.hash, pack_tmp_name, nr_written, oid.hash, offset); close(fd); diff --git a/bulk-checkin.c b/bulk-checkin.c index 9d87eac07b..227cc9f3b1 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -35,9 +35,9 @@ static void finish_bulk_checkin(struct bulk_checkin_state *state) unlink(state->pack_tmp_name); goto clear_exit; } else if (state->nr_written == 1) { - hashclose(state->f, oid.hash, CSUM_FSYNC); + finalize_hashfile(state->f, oid.hash, CSUM_FSYNC); } else { - int fd = hashclose(state->f, oid.hash, 0); + int fd = finalize_hashfile(state->f, oid.hash, 0); fixup_pack_header_footer(fd, oid.hash, state->pack_tmp_name, state->nr_written, oid.hash, state->offset); diff --git a/csum-file.c b/csum-file.c index 5eda7fb6af..e6c95a6915 100644 --- a/csum-file.c +++ b/csum-file.c @@ -53,7 +53,7 @@ void hashflush(struct hashfile *f) } } -int hashclose(struct hashfile *f, unsigned char *result, unsigned int flags) +int finalize_hashfile(struct hashfile *f, unsigned char *result, unsigned int flags) { int fd; diff --git a/csum-file.h b/csum-file.h index 992e5c0141..9ba87f0a6c 100644 --- a/csum-file.h +++ b/csum-file.h @@ -26,14 +26,14 @@ struct hashfile_checkpoint { extern void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint *); extern int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *); -/* hashclose flags */ +/* finalize_hashfile flags */ #define CSUM_CLOSE 1 #define CSUM_FSYNC 2 extern struct hashfile *hashfd(int fd, const char *name); extern struct hashfile *hashfd_check(const char *name); extern struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp); -extern int hashclose(struct hashfile *, unsigned char *, unsigned int); +extern int finalize_hashfile(struct hashfile *, unsigned char *, unsigned int); extern void hashwrite(struct hashfile *, const void *, unsigned int); extern void hashflush(struct hashfile *f); extern void crc32_begin(struct hashfile *); diff --git a/fast-import.c b/fast-import.c index b5db5d20b1..6d96f55d9d 100644 --- a/fast-import.c +++ b/fast-import.c @@ -1016,7 +1016,7 @@ static void end_packfile(void) struct tag *t; close_pack_windows(pack_data); - hashclose(pack_file, cur_pack_oid.hash, 0); + finalize_hashfile(pack_file, cur_pack_oid.hash, 0); fixup_pack_header_footer(pack_data->pack_fd, pack_data->sha1,
[PATCH v8 00/14] Serialized Git Commit Graph
This version covers the review that I missed when rerolling v7. The file diff is below from previous version, and also PATCH 14/14 was reworded to use "--append" properly. diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 41c4f76caf..37420ae0fd 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -31,7 +31,7 @@ static struct opts_commit_graph { static int graph_read(int argc, const char **argv) { - struct commit_graph *graph = 0; + struct commit_graph *graph = NULL; char *graph_name; static struct option builtin_commit_graph_read_options[] = { diff --git a/commit-graph.c b/commit-graph.c index 1fc63d541b..3ff8c84c0e 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -179,7 +179,7 @@ struct commit_graph *load_commit_graph_one(const char *graph_file) } /* global storage */ -struct commit_graph *commit_graph = NULL; +static struct commit_graph *commit_graph = NULL; static void prepare_commit_graph_one(const char *obj_dir) { -- >8 -- This patch contains a way to serialize the commit graph. The current implementation defines a new file format to store the graph structure (parent relationships) and basic commit metadata (commit date, root tree OID) in order to prevent parsing raw commits while performing basic graph walks. For example, we do not need to parse the full commit when performing these walks: * 'git log --topo-order -1000' walks all reachable commits to avoid incorrect topological orders, but only needs the commit message for the top 1000 commits. * 'git merge-base ' may walk many commits to find the correct boundary between the commits reachable from A and those reachable from B. No commit messages are needed. * 'git branch -vv' checks ahead/behind status for all local branches compared to their upstream remote branches. This is essentially as hard as computing merge bases for each. The current patch speeds up these calculations by injecting a check in parse_commit_gently() to check if there is a graph file and using that to provide the required metadata to the struct commit. The file format has room to store generation numbers, which will be provided as a patch after this framework is merged. Generation numbers are referenced by the design document but not implemented in order to make the current patch focus on the graph construction process. Once that is stable, it will be easier to add generation numbers and make graph walks aware of generation numbers one-by-one. By loading commits from the graph instead of parsing commit buffers, we save a lot of time on long commit walks. Here are some performance results for a copy of the Linux repository where 'master' has 678,653 reachable commits and is behind 'origin/master' by 59,929 commits. | Command | Before | After | Rel % | |--|||---| | log --oneline --topo-order -1000 | 8.31s | 0.94s | -88% | | branch -vv | 1.02s | 0.14s | -86% | | rev-list --all | 5.89s | 1.07s | -81% | | rev-list --all --objects | 66.15s | 58.45s | -11% | To test this yourself, run the following on your repo: git config core.commitGraph true git show-ref -s | git commit-graph write --stdin-commits The second command writes a commit graph file containing every commit reachable from your refs. Now, all git commands that walk commits will check your graph first before consulting the ODB. You can run your own performance comparisons by toggling the 'core.commitGraph' setting. [1] https://github.com/derrickstolee/git/pull/2 A GitHub pull request containing the latest version of this patch. Derrick Stolee (14): csum-file: rename hashclose() to finalize_hashfile() csum-file: refactor finalize_hashfile() method commit-graph: add format document graph: add commit graph design document commit-graph: create git-commit-graph builtin commit-graph: implement write_commit_graph() commit-graph: implement git-commit-graph write commit-graph: implement git commit-graph read commit-graph: add core.commitGraph setting commit-graph: close under reachability commit: integrate commit graph with commit parsing commit-graph: read only from specific pack-indexes commit-graph: build graph from starting commits commit-graph: implement "--append" option .gitignore| 1 + Documentation/config.txt | 4 + Documentation/git-commit-graph.txt| 94 +++ .../technical/commit-graph-format.txt | 97 +++ Documentation/technical/commit-graph.txt | 163 Makefile | 2 + alloc.c | 1 + builtin.h | 1 + builtin/commit-graph.c| 171 builtin/index-pack.c | 2 +-
Re: What's cooking in git.git (Apr 2018, #01; Mon, 9)
On 4/10/2018 3:21 PM, Ramsay Jones wrote: On 10/04/18 13:57, Derrick Stolee wrote: On 4/9/2018 6:08 PM, Junio C Hamano wrote: I guess we'd want a final cleaned-up round after all ;-) Thanks. v8 sent [1]. Thanks. I just tried to 'git am' this series and I could not get it to apply cleanly to current 'master', 'next' or 'pu'. I fixed up a few patches, but just got bored ... ;-) In my cover letter, I did mention that my patch started here (using 'format-patch --base'): base-commit: 468165c1d8a442994a825f3684528361727cd8c0 This corresponds to v2.17.0. Thanks, -Stolee
Re: [PATCH v8 03/14] commit-graph: add format document
On 4/10/2018 3:10 PM, Stefan Beller wrote: Hi Derrick, On Tue, Apr 10, 2018 at 5:55 AM, Derrick Stolee <sto...@gmail.com> wrote: + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of commits (N). I was about to give this series one last read not expecting any questions to come up (this series has had a lot of feedback already!) Although I just did. What were your design considerations for the fanout table? Did you include it as the pack index has one or did you come up with them from first principles? Have you measured the performance impact of the fanout table (maybe even depending on the size of the fanout) ? context: https://public-inbox.org/git/CAJo=hJsto1ik=gtc8c3+2_jbuuqcapl0uwp-1uoyympgblb...@mail.gmail.com/ (side note: searching the web for fanout makes it seem as if it is git-lingo, apparently the term is not widely used) I don't think we want to restart the design discussion, I am just curious. I knew that I wanted some amount of a fanout table, and the 256-entry one was used for IDX files (and in my MIDX RFC). With the recent addition of "packfile: refactor hash search with fanout table" [1] it is probably best to keep the 256-entry table to reduce code clones. As for speed, we have the notion of 'graph_pos' which gives random access into the commit-graph after a commit is loaded as a parent of a commit from the commit-graph file. Thus, we are spending time in the binary search only for commits that do not exist in the commit-graph file and those that are first found in the file. Thus, running profilers on long commit-graph walks do not show any measurable time spent in 'bsearch_graph()'. Thanks, -Stolee [1] https://github.com/gitster/git/commit/b4e00f7306a160639f047b3421985e8f3d0c6fb1
Re: [PATCH 0/3] Lazy-load trees when reading commit-graph
On 4/3/2018 4:20 PM, Jeff King wrote: On Tue, Apr 03, 2018 at 09:14:50AM -0400, Derrick Stolee wrote: I'm not sure what the exact solution would be, but I imagine something like variable-sized "struct commit"s with the parent pointers embedded, with some kind of flag to indicate the number of parents (and probably some fallback to break out to a linked list for extreme cases of more than 2 parents). It may end up pretty invasive, though, as there's a lot of open-coded traversals of that parent list. Anyway, not anything to do with this patch, but food for thought as you micro-optimize these traversals. One other thing that I've been thinking about is that 'struct commit' is so much bigger than the other structs in 'union any_object'. This means that the object cache, which I think creates blocks of 'union any_object' for memory-alignment reasons, is overly bloated. This would be especially true when walking many more trees than commits. Perhaps there are other memory concerns in that case (such as cached buffers) that the 'union any_object' is not a concern, but it is worth thinking about as we brainstorm how to reduce the parent-list memory. It definitely bloats any_object, but I don't think we typically allocate too many of those. Those should only come from lookup_unknown_object(), but typically we'll come across objects by traversing the graph, in which case we have an expectation of the type (and use the appropriate lookup_foo() function, which uses the type-specific block allocators). Thanks for the clarification here. I'm glad I was wrong about how often any_object is used. Thanks, -Stolee
Re: [PATCH 8/6] commit: use generation numbers for in_merge_bases()
On 4/4/2018 11:45 AM, Derrick Stolee wrote: The containment algorithm for 'git branch --contains' is different from that for 'git tag --contains' in that it uses is_descendant_of() instead of contains_tag_algo(). The expensive portion of the branch algorithm is computing merge bases. When a commit-graph file exists with generation numbers computed, we can avoid this merge-base calculation when the target commit has a larger generation number than the target commits. Performance tests were run on a copy of the Linux repository where HEAD is contained in v4.13 but no earlier tag. Also, all tags were copied to branches and 'git branch --contains' was tested: Before: 60.0s After: 0.4s Rel %: -99.3% Reported-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 858f4fdbc9..2566cba79f 100644 --- a/commit.c +++ b/commit.c @@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * { struct commit_list *bases; int ret = 0, i; + uint32_t min_generation = GENERATION_NUMBER_UNDEF; if (parse_commit(commit)) return ret; - for (i = 0; i < nr_reference; i++) + for (i = 0; i < nr_reference; i++) { if (parse_commit(reference[i])) return ret; + if (min_generation > reference[i]->generation) + min_generation = reference[i]->generation; + } + + if (commit->generation > min_generation) + return 0; bases = paint_down_to_common(commit, nr_reference, reference); if (commit->object.flags & PARENT2) This patch may suffice to speed up 'git branch --contains' instead of needing to always use the 'git tag --contains' algorithm as considered in [1]. Thanks, -Stolee [1] https://public-inbox.org/git/20180303051516.ge27...@sigill.intra.peff.net/ Re: [PATCH 0/4] Speed up git tag --contains
[PATCH 7/6] ref-filter: use generation number for --contains
A commit A can reach a commit B only if the generation number of A is strictly larger than the generation number of B. This condition allows significantly short-circuiting commit-graph walks. Use generation number for '--contains' type queries. On a copy of the Linux repository where HEAD is containd in v4.13 but no earlier tag, the command 'git tag --contains HEAD' had the following peformance improvement: Before: 0.81s After: 0.04s Rel %: -95% Helped-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- ref-filter.c | 26 +- 1 file changed, 21 insertions(+), 5 deletions(-) diff --git a/ref-filter.c b/ref-filter.c index 45fc56216a..b147b1d0ee 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) */ static enum contains_result contains_test(struct commit *candidate, const struct commit_list *want, - struct contains_cache *cache) + struct contains_cache *cache, + uint32_t cutoff) { enum contains_result *cached = contains_cache_at(cache, candidate); @@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit *candidate, return CONTAINS_YES; } - /* Otherwise, we don't know; prepare to recurse */ parse_commit_or_die(candidate); + + if (candidate->generation < cutoff) + return CONTAINS_NO; + return CONTAINS_UNKNOWN; } @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = GENERATION_NUMBER_UNDEF; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + parse_commit_or_die(c); + if (c->generation < cutoff) + cutoff = c->generation; + } + if (cutoff == GENERATION_NUMBER_UNDEF) + cutoff = GENERATION_NUMBER_NONE; + result = contains_test(candidate, want, cache, cutoff); if (result != CONTAINS_UNKNOWN) return result; @@ -1634,7 +1650,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, * If we just popped the stack, parents->item has been marked, * therefore contains_test will return a meaningful yes/no. */ - else switch (contains_test(parents->item, want, cache)) { + else switch (contains_test(parents->item, want, cache, cutoff)) { case CONTAINS_YES: *contains_cache_at(cache, commit) = CONTAINS_YES; contains_stack.nr--; @@ -1648,7 +1664,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, } } free(contains_stack.contains_stack); - return contains_test(candidate, want, cache); + return contains_test(candidate, want, cache, cutoff); } static int commit_contains(struct ref_filter *filter, struct commit *commit, -- 2.17.0.rc0
[PATCH 8/6] commit: use generation numbers for in_merge_bases()
The containment algorithm for 'git branch --contains' is different from that for 'git tag --contains' in that it uses is_descendant_of() instead of contains_tag_algo(). The expensive portion of the branch algorithm is computing merge bases. When a commit-graph file exists with generation numbers computed, we can avoid this merge-base calculation when the target commit has a larger generation number than the target commits. Performance tests were run on a copy of the Linux repository where HEAD is contained in v4.13 but no earlier tag. Also, all tags were copied to branches and 'git branch --contains' was tested: Before: 60.0s After: 0.4s Rel %: -99.3% Reported-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 858f4fdbc9..2566cba79f 100644 --- a/commit.c +++ b/commit.c @@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * { struct commit_list *bases; int ret = 0, i; + uint32_t min_generation = GENERATION_NUMBER_UNDEF; if (parse_commit(commit)) return ret; - for (i = 0; i < nr_reference; i++) + for (i = 0; i < nr_reference; i++) { if (parse_commit(reference[i])) return ret; + if (min_generation > reference[i]->generation) + min_generation = reference[i]->generation; + } + + if (commit->generation > min_generation) + return 0; bases = paint_down_to_common(commit, nr_reference, reference); if (commit->object.flags & PARENT2) -- 2.17.0.rc0
Re: [PATCH 7/6] ref-filter: use generation number for --contains
On 4/4/2018 3:16 PM, Jeff King wrote: On Wed, Apr 04, 2018 at 03:06:26PM -0400, Derrick Stolee wrote: @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = GENERATION_NUMBER_UNDEF; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + parse_commit_or_die(c); + if (c->generation < cutoff) + cutoff = c->generation; + } Now that you mention it, let me split out the portion you are probably talking about as incorrect: + if (cutoff == GENERATION_NUMBER_UNDEF) + cutoff = GENERATION_NUMBER_NONE; You're right, we don't want this. Since GENERATION_NUMBER_NONE == 0, we get no benefit from this. If we keep it GENERATION_NUMBER_UNDEF, then our walk will be limited to commits NOT in the commit-graph (which we hope is small if proper hygiene is followed). I think it's more than that. If we leave it at UNDEF, that's wrong, because contains_test() compares: candidate->generation < cutoff which would _always_ be true. In other words, we're saying that our "want" has an insanely high generation number, and traversing can never find it. Which is clearly wrong. That condition is not always true (which is why we use strict comparison instead of <=). If a commit is not in the commit-graph file, then its generation is equal to GENERATION_NUMBER_UNDEF, as shown in alloc.c: void *alloc_commit_node(void) { struct commit *c = alloc_node(_state, sizeof(struct commit)); c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); c->graph_pos = COMMIT_NOT_FROM_GRAPH; c->generation = GENERATION_NUMBER_UNDEF; return c; } So we have to put it at "0", to say "you should always traverse, we can't tell you that this is a dead end". So that part of the logic is currently correct. But what I was getting at is that the loop behavior can't just pick the min cutoff. The min is effectively "0" if there's even a single ref for which we don't have a generation number, because we cannot ever stop traversing (we might get to that commit if we kept going). (It's also possible I'm confused about how UNDEF and NONE are used; I'm assuming commits for which we don't have a generation number available would get UNDEF in their commit->generation field). I think it is this case. If you could make the assumption that when we have a generation for commit X, then we have a generation for all of its ancestors, things get easier. Because then if you hit commit X with a generation number and want to compare it to a cutoff, you know that either: 1. The cutoff is defined, in which case you can stop traversing if we've gone past the cutoff. 2. The cutoff is undefined, in which case we cannot possibly reach our "want" by traversing. Even if it has a smaller generation number than us, it's on an unrelated line of development. I don't know that the reachability property is explicitly promised by your work, but it seems like it would be a natural fallout (after all, you have to know the generation of each ancestor in order to compute the later ones, so you're really just promising that you've actually stored all the ones you've computed). The commit-graph is closed under reachability, so if a commit has a generation number then it is in the graph and so are all its ancestors. The reason for GENERATION_NUMBER_NONE is that the commit-graph file stores "0" for generation number until this patch. It still satisfies the condition that gen(A) < gen(B) if B can reach A, but also gives us a condition for "this commit still needs its generation number computed". I wonder to what degree it's worth traversing to come up with a generation number for the "want" commits. If we walked, say, 50 commits to do it, you'd probably save a lot of work (since the alternative is walking thousands of commits until you realize that some ancient "v1.0" tag is not useful). I'd actually go so far as to say that any amount of traversal is generally going to be worth it to come up with the correct generation cutoff here. You can come up with pathological cases where you only have one really recent tag or something, but in practice every repository where performance is a concern is going to end up with refs much further back than it would take to reach the cutoff condition. Perhaps there is some value in walking to find the correct cutoff value, but it is difficult to determine how fa
Re: [PATCH 8/6] commit: use generation numbers for in_merge_bases()
On 4/4/2018 2:24 PM, Jeff King wrote: On Wed, Apr 04, 2018 at 11:48:42AM -0400, Derrick Stolee wrote: diff --git a/commit.c b/commit.c index 858f4fdbc9..2566cba79f 100644 --- a/commit.c +++ b/commit.c @@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * { struct commit_list *bases; int ret = 0, i; + uint32_t min_generation = GENERATION_NUMBER_UNDEF; if (parse_commit(commit)) return ret; - for (i = 0; i < nr_reference; i++) + for (i = 0; i < nr_reference; i++) { if (parse_commit(reference[i])) return ret; + if (min_generation > reference[i]->generation) + min_generation = reference[i]->generation; + } + + if (commit->generation > min_generation) + return 0; bases = paint_down_to_common(commit, nr_reference, reference); if (commit->object.flags & PARENT2) This patch may suffice to speed up 'git branch --contains' instead of needing to always use the 'git tag --contains' algorithm as considered in [1]. I guess I want to specify: the only reason to NOT switch to the tags algorithm is because it _may_ hurt existing cases in certain data shapes... I'd have to do some timings, but I suspect we may want to switch to the "tag --contains" algorithm anyway. This still does N independent merge-base operations, one per ref. So with enough refs, you're still better off throwing it all into one big traversal. ...and I suppose your timings are to find out if there are data shapes where the branch algorithm is faster. Perhaps that is impossible now that we have the generation number cutoff for the tag algorithm. Since the branch algorithm checks generation numbers before triggering pain_down_to_common(), we will do N independent merge-base calculations, where N is the number of branches with large enough generation numbers (which is why my test does so well: most are below the target generation number). This doesn't help at all if none of the refs are in the graph. The other thing to do is add a minimum generation for the walk in paint_down_to_common() so even if commit->generation <= min_generation we still only walk down to commit->generation instead of all merge bases. This is something we could change in a later patch. Patches 7 and 8 seem to me like simple changes with no downside UNLESS we are deciding instead to delete the code I'm changing. Thanks, -Stolee
Re: [PATCH 7/6] ref-filter: use generation number for --contains
On 4/4/2018 2:22 PM, Jeff King wrote: On Wed, Apr 04, 2018 at 11:45:53AM -0400, Derrick Stolee wrote: @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = GENERATION_NUMBER_UNDEF; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + parse_commit_or_die(c); + if (c->generation < cutoff) + cutoff = c->generation; + } Now that you mention it, let me split out the portion you are probably talking about as incorrect: + if (cutoff == GENERATION_NUMBER_UNDEF) + cutoff = GENERATION_NUMBER_NONE; You're right, we don't want this. Since GENERATION_NUMBER_NONE == 0, we get no benefit from this. If we keep it GENERATION_NUMBER_UNDEF, then our walk will be limited to commits NOT in the commit-graph (which we hope is small if proper hygiene is followed). Hmm, on reflection, I'm not sure if this is right in the face of multiple "want" commits, only some of which have generation numbers. We probably want to disable the cutoff if _any_ "want" commit doesn't have a number. There's also an obvious corner case where this won't kick in, and you'd really like it to: recently added commits. E.g,. if I do this: git gc ;# imagine this writes generation numbers git pull git tag --contains HEAD then HEAD isn't going to have a generation number. But this is the case where we have the most to gain, since we could throw away all of the ancient tags immediately upon seeing that their generation numbers are way less than that of HEAD. I wonder to what degree it's worth traversing to come up with a generation number for the "want" commits. If we walked, say, 50 commits to do it, you'd probably save a lot of work (since the alternative is walking thousands of commits until you realize that some ancient "v1.0" tag is not useful). I'd actually go so far as to say that any amount of traversal is generally going to be worth it to come up with the correct generation cutoff here. You can come up with pathological cases where you only have one really recent tag or something, but in practice every repository where performance is a concern is going to end up with refs much further back than it would take to reach the cutoff condition. Perhaps there is some value in walking to find the correct cutoff value, but it is difficult to determine how far we are from commits with correct generation numbers _a priori_. I'd rather rely on the commit-graph being in a good state, not too far behind the refs. An added complexity of computing generation numbers dynamically is that we would need to add a dependence on the commit-graph file's existence at all. Thanks, -Stolee
Re: [PATCH v2 04/10] commit-graph: compute generation numbers
On 4/10/2018 10:51 PM, Junio C Hamano wrote: Derrick Stolee <dsto...@microsoft.com> writes: + if ((*list)->generation != GENERATION_NUMBER_INFINITY) { + if ((*list)->generation > GENERATION_NUMBER_MAX) + die("generation number %u is too large to store in commit-graph", + (*list)->generation); + packedDate[0] |= htonl((*list)->generation << 2); + } How serious do we want this feature to be? On one extreme, we could be irresponsible and say it will be a problem for our descendants in the future if their repositories have more than billion pearls on a single strand, and the above certainly is a reasonable way to punt. Those who actually encounter the problem will notice by Git dying somewhere rather deep in the callchain. Or we could say Git actually does support a history that is arbitrarily long, even though such a deep portion of history will not benefit from having generation numbers in commit-graph. I've been assuming that our stance is the latter and that is why I made noises about overflowing 30-bit generation field in my review of the previous step. In case we want to do the "we know this is very large, but we do not know the exact value", we may actually want a mode where we can pretend that GENERATION_NUMBER_MAX is set to quite low (say 256) and make sure that the code to handle overflow behaves sensibly. I agree. I wonder how we can effectively expose this value into a test. It's probably not sufficient to manually test using compiler flags ("-D GENERATION_NUMBER_MAX=8"). + for (i = 0; i < nr_commits; i++) { + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && + commits[i]->generation != GENERATION_NUMBER_ZERO) + continue; + + commit_list_insert(commits[i], ); + while (list) { +... + } + } So we go over the list of commits just _once_ and make sure each of them gets the generation assigned correctly by (conceptually recursively but iteratively in implementation by using a commit list) making sure that all its parents have generation assigned and compute the generation for the commit, before moving to the next one. Which sounds correct. Yes, we compute the generation number of a commit exactly once. We use the list as a stack so we do not have recursion limits during our depth-first search (DFS). We rely on the object cache to ensure we store the computed generation numbers, and computed generation numbers provide termination conditions to the DFS. Thanks, -Stolee
Re: [PATCH v2 02/10] merge: check config before loading commits
On 4/10/2018 10:12 PM, Junio C Hamano wrote: Derrick Stolee <dsto...@microsoft.com> writes: diff --git a/builtin/merge.c b/builtin/merge.c index ee050a47f3..20897f8223 100644 --- a/builtin/merge.c +++ b/builtin/merge.c @@ -1183,13 +1183,14 @@ int cmd_merge(int argc, const char **argv, const char *prefix) branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL); if (branch) skip_prefix(branch, "refs/heads/", ); + init_diff_ui_defaults(); + git_config(git_merge_config, NULL); + if (!branch || is_null_oid(_oid)) head_commit = NULL; else head_commit = lookup_commit_or_die(_oid, "HEAD"); - init_diff_ui_defaults(); - git_config(git_merge_config, NULL); Wow, that's tricky. git_merge_config() wants to know which "branch" we are on, and this place is as early as we can move the call to without breaking things. Is this to allow parse_object() called in lookup_commit_reference_gently() to know if we can rely on the data cached in the commit-graph data? When I saw the bug on my machine, I tracked the issue down to a call to parse_commit_in_graph() that skipped the graph check since core_commit_graph was not set. The call stack from this call is as follows: * lookup_commit_or_die() * lookup_commit_reference() * lookup_commit_reference_gently() * parse_object() * parse_object_buffer() * parse_commit_in_graph() [as introduced in PATCH 01/10] Move the config load to be between the initialization of 'branch' and the commit lookup. Also add a test to t5318-commit-graph.sh that exercises this code path to prevent a regression. It is not clear to me how a successful merge of commits/8 demonstrates that reading the config earlier than before is regression free. I didn't want to introduce commits in an order that led to a commit failing tests, but if you drop the change to builtin/merge.c from this series, the tip commit will fail this test with "BUG: bad generation skip". The reason for this failure is that commits/5 is loaded from HEAD from the object database, so its generation is marked as GENERATION_NUMBER_INFINITY, and the commit is marked as parsed. Later, the commit at merges/3 is loaded from the graph with generation 4. This triggers the BUG statement in paint_down_to_common(). That is why it is important to check a fast-forward merge. In the 'graph_git_behavior' steps of t5318-commit-graph.sh, we were already testing 'git merge-base' to check the commit walk logic. Thanks, -Stolee
Re: [PATCH v2 03/10] commit: add generation number to struct commmit
On 4/10/2018 10:31 PM, Junio C Hamano wrote: Derrick Stolee <dsto...@microsoft.com> writes: The generation number of a commit is defined recursively as follows: * If a commit A has no parents, then the generation number of A is one. * If a commit A has parents, then the generation number of A is one more than the maximum generation number among the parents of A. Add a uint32_t generation field to struct commit so we can pass this information to revision walks. We use two special values to signal the generation number is invalid: GENERATION_NUMBER_ININITY 0x GENERATION_NUMBER_ZERO 0 The first (_INFINITY) means the generation number has not been loaded or computed. The second (_ZERO) means the generation number was loaded from a commit graph file that was stored before generation numbers were computed. Should it also be possible for a caller to tell if a given commit has too deep a history, i.e. we do not know its generation number exactly, but we know it is larger than 1<<30? It seems that we only have a 30-bit field in the file, so wouldn't we need a special value defined in (e.g. "0") so that we can tell that the commit has such a large generation number? E.g. + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; if (!item->generation) item->generation = GENERATION_NUMBER_OVERFLOW; when we read it from the file? We obviously need to do something similar when assigning a generation number to a child commit, perhaps like #define GENERATION_NUMBER_OVERFLOW (GENERATION_NUMBER_MAX + 1) commit->generation = 1; /* assume no parent */ for (p = commit->parents; p; p++) { uint32_t gen = p->item->generation + 1; if (gen >= GENERATION_NUMBER_OVERFLOW) { commit->generation = GENERATION_NUMBER_OVERFLOW; break; } else if (commit->generation < gen) commit->generation = gen; } or something? And then on the writing side you'd encode too large a generation as '0'. You raise a very good point. How about we do a slightly different arrangement for these overflow commits? Instead of storing the commits in the commit-graph file as "0" (which currently means "written by a version of git that did not compute generation numbers") we could let GENERATION_NUMBER_MAX be the maximum generation of a commit in the commit-graph, and if a commit would have larger generation, we collapse it down to that value. It slightly complicates the diagram I made in Documentation/technical/commit-graph.txt, but it was already a bit of a simplification. Here is an updated diagram, but likely we will want to limit discussion of the special-case GENERATION_NUMBER_MAX to the prose, since it is not a practical situation at the moment. +-+ | GENERATION_NUMBER_INFINITY = 0x | +-+ | | | ^ | | | | | | +--+ | | [gen(A) = gen(B)] | V | ++ | | GENERATION_NUMBER_MAX = 0x3FFF | | ++ | | | ^ | | | | | | +--+ | | [gen(A) = gen(B)] V V +-+ | 0 < commit->generation < 0x3FFF | +-+ | | ^ | | | | +--+ | [gen(A) > gen(B)] V +-+ | GENERATION_NUMBER_ZERO = 0 | +-+ | ^ | | +--+ [gen(A) = gen(B)] Thanks, -Stolee
Re: [PATCH v2 06/10] commit.c: use generation to halt paint walk
On 4/10/2018 11:02 PM, Junio C Hamano wrote: Derrick Stolee <dsto...@microsoft.com> writes: @@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc return result; } prio_queue_put(, one); + if (one->generation < min_nonstale_gen) + min_nonstale_gen = one->generation; for (i = 0; i < n; i++) { twos[i]->object.flags |= PARENT2; prio_queue_put(, twos[i]); + if (twos[i]->generation < min_nonstale_gen) + min_nonstale_gen = twos[i]->generation; } - while (queue_has_nonstale()) { + while (queue_has_nonstale(, min_nonstale_gen)) { struct commit *commit = prio_queue_get(); struct commit_list *parents; int flags; + if (commit->generation > last_gen) + BUG("bad generation skip"); + + last_gen = commit->generation; + flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc return NULL; p->object.flags |= flags; Hmph. Can a commit that used to be not stale (and contributed to the current value of min_nonstale_gen) become stale here by getting visited twice, invalidating the value in min_nonstale_gen? min_nonstale_gen can be "wrong" in the way you say, but fits the definition from the commit message: "To properly take advantage of this condition, track the minimum generation number of a commit that **enters the queue** with nonstale status." (Emphasis added) You make an excellent point about how this can be problematic. I was confused by the lack of clear performance benefits here, but I think that whatever benefits making queue_has_nonstale() be O(1) were removed by walking more commits than necessary. Consider the following commit graph, where M is a parent of both A and B, S is a parent of M and B, and there is a large set of commits reachable from M with generation number larger than gen(S). A B | __/| |/ | M | |\ | . | | . | | . |_/ |/ S Between A and B, the true merge base is M. Anything reachable from M is marked as stale. When S is added to the queue, it is only reachable from B, so it is non-stale. However, it is marked stale after M is walked. The old code would detect this as a termination condition, but the new code would not. I think this data shape is actually common (not exactly, as it may be that some ancestor of M provides a second path to S) especially in the world of pull requests and users merging master into their topic branches. I'll remove this commit in the next version, but use the new prototype for queue_has_nonstale() in "commit: add short-circuit to paint_down_to_common()" using the given 'min_generation' instead of 'min_nonstale_gen'. Thanks, -Stolee
Re: [PATCH resend] SubmittingPatches: mention the git contacts command
On 4/11/2018 4:20 PM, Thomas Gummerer wrote: Instead of just mentioning 'git blame' and 'git shortlog', which make it quite hard for new contributors to pick out the appropriate list of people to cc on their patch series, mention the 'git contacts' utility, which makes it much easier to get a reasonable list of contacts for a change. This should help new contributors pick out a reasonable cc list by simply using a single command. Signed-off-by: Thomas Gummerer--- I've originally sent this at <20180316213323.GC2224@hank>, during an the rc period. Eric had some comments, which I interpreted as being okay with the change (hope I'm not mistaken there :)). As I still think this would be an improvement for new contributors, I'm resending it here. I didn't know about this tool, and it seems helpful. I plan to use it now. Thanks! Documentation/SubmittingPatches | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches index a1d0feca36..945f8edb46 100644 --- a/Documentation/SubmittingPatches +++ b/Documentation/SubmittingPatches @@ -260,8 +260,8 @@ that starts with `-BEGIN PGP SIGNED MESSAGE-`. That is not a text/plain, it's something else. Send your patch with "To:" set to the mailing list, with "cc:" listing -people who are involved in the area you are touching (the output from -`git blame $path` and `git shortlog --no-merges $path` would help to +people who are involved in the area you are touching (the `git +contacts` command in `contrib/contacts/` can help to identify them), to solicit comments and reviews. :1: footnote:[The current maintainer: gits...@pobox.com]
Re: [PATCHv3 00/15] replace_object.c: rename to use dash in file name
On 4/11/2018 8:21 PM, Stefan Beller wrote: v3: * interdiff below, the only changes are renaming the variable - struct ref_store *main_ref_store; + struct ref_store *refs; in struct repository. as well as dropping the file rename patch. * improved commit messages from discussion on the single patches. Looks good to me. Thanks! -Stolee
Re: [PATCH v8 03/14] commit-graph: add format document
On 4/11/2018 4:58 PM, Jakub Narebski wrote: Derrick Stolee <sto...@gmail.com> writes: +CHUNK DATA: + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of commits (N). + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all commits in the graph, sorted in ascending order. + + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes) I think it is a typo, and it should be CDAT, not CGET (CDAT seem to me to stand for Commit DATa): + Commit Data (ID: {'C', 'D', 'A', 'T' }) (N * (H + 16) bytes) This is what you use in actual implementation, in PATCH v8 06/14 DS> +#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */ DS> +#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */ DS> +#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */ DS> +#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */ DS> +#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */ Documentation bugs are hard to diagnose. Thanks for finding this. I double checked that the hex int "0x43444154" matches "CDAT". Here is a diff to make it match. diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt index ad6af8105c..af03501834 100644 --- a/Documentation/technical/commit-graph-format.txt +++ b/Documentation/technical/commit-graph-format.txt @@ -70,7 +70,7 @@ CHUNK DATA: OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) The OIDs for all commits in the graph, sorted in ascending order. - Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes) + Commit Data (ID: {'C', 'D', 'A', 'T' }) (N * (H + 16) bytes) * The first H bytes are for the OID of the root tree. * The next 8 bytes are for the positions of the first two parents of the ith commit. Stores value 0x if no parent in that
Re: [PATCH v2 07/10] commit-graph.txt: update future work
On 4/12/2018 5:12 AM, Junio C Hamano wrote: Derrick Stolee <dsto...@microsoft.com> writes: +Here is a diagram to visualize the shape of the full commit graph, and +how different generation numbers relate: + ++-+ +| GENERATION_NUMBER_INFINITY = 0x | ++-+ + || ^ + || | + |+--+ + | [gen(A) = gen(B)] + V ++-+ +| 0 < commit->generation < 0x4000 | ++-+ + || ^ + || | + |+--+ + |[gen(A) > gen(B)] + V ++-+ +| GENERATION_NUMBER_ZERO = 0 | ++-+ +| ^ +| | ++--+ +[gen(A) = gen(B)] It may be just me but all I can read out of the above is that commit->generation may store 0x, a value between 0 and 0x4000, or 0. I cannot quite tell what the notation [gen(A) gen(B)] is trying to say. I am guessing "Two generation numbers within the 'valid' range can be compared" is what the second one is trying to say, but it is much less interesting to know that two infinities compare equal than how generation numbers from different classes compare, which cannot be depicted in the above notation, I am afraid. For example, don't we want to say that a commit with INF can never be reached by a commit with a valid generation number, or something like that? My intention with the arrows was to demonstrate where parent relationships can go, and the generation-number relation between a commit A with parent B. Clearly, this diagram is less than helpful. Design Details -- @@ -98,17 +141,12 @@ Future Work - The 'commit-graph' subcommand does not have a "verify" mode that is necessary for integration with fsck. -- The file format includes room for precomputed generation numbers. These - are not currently computed, so all generation numbers will be marked as - 0 (or "uncomputed"). A later patch will include this calculation. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered priority queue with one ordered by generation number. The following - operations are important candidates: + operation is an important candidate: Good.
Re: [PATCH 0/6] Compute and consume generation numbers
On 4/11/2018 3:32 PM, Jakub Narebski wrote: What would you suggest as a good test that could imply performance? The Google Colab notebook linked to above includes a function to count number of commits (nodes / vertices in the commit graph) walked, currently in the worst case scenario. The two main questions to consider are: 1. Can X reach Y? 2. What is the set of merge-bases between X and Y? And the thing to measure is a commit count. If possible, it would be good to count commits walked (commits whose parent list is enumerated) and commits inspected (commits that were listed as a parent of some walked commit). Walked commits require a commit parse -- albeit from the commit-graph instead of the ODB now -- while inspected commits only check the in-memory cache. For git.git and Linux, I like to use the release tags as tests. They provide a realistic view of the linear history, and maintenance releases have their own history from the major releases. I have tried finding number of false positives for level (generation number) filter and for FELINE index, and number of false negatives for min-post intervals in the spanning tree (for DFS tree) for 1 randomly selected pairs of commits... but I don't think this is a good benchmark. What is a false-positive? A case where gen(X) < gen(Y) but Y cannot reach X? I do not think that is a great benchmark, but I guess it is something to measure. I Linux kernel sources (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git) that has 750832 nodes and 811733 edges, and 563747941392 possible directed pairs, we have for 1 randomly selected pairs of commits: level-filter has91 = 0.91% [all] false positives FELINE index has78 = 0.78% [all] false positives FELINE index has 1.16667 less false positives than level filter min-post spanning-tree intervals has 3641 = 36.41% [all] false negatives Perhaps something you can do instead of sampling from N^2 commits in total is to select a pair of generations (say, G = 2, G' = 20100) or regions of generations ( 2 <= G <= 20050, 20100 <= G' <= 20150) and see how many false positives you see by testing all pairs (one from each level). The delta between the generations may need to be smaller to actually have a large proportion of unreachable pairs. Try different levels, since major version releases tend to "pinch" the commit graph to a common history. For git.git repository (https://github.com/git/git.git) that has 52950 nodes and 65887 edges the numbers are slighly more in FELINE index favor (also out of 1 random pairs): level-filter has 504 = 9.11% false positives FELINE index has 125 = 2.26% false positives FELINE index has 4.032 less false positives than level filter This is for FELINE which does not use level / generatio-numbers filter. Thanks, -Stolee
[PATCH v2 04/10] commit-graph: compute generation numbers
While preparing commits to be written into a commit-graph file, compute the generation numbers using a depth-first strategy. The only commits that are walked in this depth-first search are those without a precomputed generation number. Thus, computation time will be relative to the number of new commits to the commit-graph file. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 46 ++ 1 file changed, 46 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index d24b947525..5fd63acc31 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -419,6 +419,13 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; + if ((*list)->generation != GENERATION_NUMBER_INFINITY) { + if ((*list)->generation > GENERATION_NUMBER_MAX) + die("generation number %u is too large to store in commit-graph", + (*list)->generation); + packedDate[0] |= htonl((*list)->generation << 2); + } + packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); @@ -551,6 +558,43 @@ static void close_reachable(struct packed_oid_list *oids) } } +static void compute_generation_numbers(struct commit** commits, + int nr_commits) +{ + int i; + struct commit_list *list = NULL; + + for (i = 0; i < nr_commits; i++) { + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && + commits[i]->generation != GENERATION_NUMBER_ZERO) + continue; + + commit_list_insert(commits[i], ); + while (list) { + struct commit *current = list->item; + struct commit_list *parent; + int all_parents_computed = 1; + uint32_t max_generation = 0; + + for (parent = current->parents; parent; parent = parent->next) { + if (parent->item->generation == GENERATION_NUMBER_INFINITY || + parent->item->generation == GENERATION_NUMBER_ZERO) { + all_parents_computed = 0; + commit_list_insert(parent->item, ); + break; + } else if (parent->item->generation > max_generation) { + max_generation = parent->item->generation; + } + } + + if (all_parents_computed) { + current->generation = max_generation + 1; + pop_commit(); + } + } + } +} + void write_commit_graph(const char *obj_dir, const char **pack_indexes, int nr_packs, @@ -674,6 +718,8 @@ void write_commit_graph(const char *obj_dir, if (commits.nr >= GRAPH_PARENT_MISSING) die(_("too many commits to write graph")); + compute_generation_numbers(commits.list, commits.nr); + graph_name = get_commit_graph_filename(obj_dir); fd = hold_lock_file_for_update(, graph_name, 0); -- 2.17.0
[PATCH v2 07/10] commit-graph.txt: update future work
We now calculate generation numbers in the commit-graph file and use them in paint_down_to_common(). Expand the section on generation numbers to discuss how the two "special" generation numbers GENERATION_NUMBER_INFINITY and *_ZERO interact with other generation numbers. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/commit-graph.txt | 50 +--- 1 file changed, 44 insertions(+), 6 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index 0550c6d0dc..a8df0ae9db 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -77,6 +77,49 @@ in the commit graph. We can treat these commits as having "infinite" generation number and walk until reaching commits with known generation number. +We use the macro GENERATION_NUMBER_INFINITY = 0x to mark commits not +in the commit-graph file. If a commit-graph file was written by a version +of Git that did not compute generation numbers, then those commits will +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. + +Since the commit-graph file is closed under reachability, we can guarantee +the following weaker condition on all commits: + +If A and B are commits with generation numbers N amd M, respectively, +and N < M, then A cannot reach B. + +Note how the strict inequality differs from the inequality when we have +fully-computed generation numbers. Using strict inequality may result in +walking a few extra commits, but the simplicity in dealing with commits +with generation number *_INFINITY or *_ZERO is valuable. + +Here is a diagram to visualize the shape of the full commit graph, and +how different generation numbers relate: + ++-+ +| GENERATION_NUMBER_INFINITY = 0x | ++-+ + || ^ + || | + |+--+ + | [gen(A) = gen(B)] + V ++-+ +| 0 < commit->generation < 0x4000 | ++-+ + || ^ + || | + |+--+ + |[gen(A) > gen(B)] + V ++-+ +| GENERATION_NUMBER_ZERO = 0 | ++-+ +| ^ +| | ++--+ +[gen(A) = gen(B)] + Design Details -- @@ -98,17 +141,12 @@ Future Work - The 'commit-graph' subcommand does not have a "verify" mode that is necessary for integration with fsck. -- The file format includes room for precomputed generation numbers. These - are not currently computed, so all generation numbers will be marked as - 0 (or "uncomputed"). A later patch will include this calculation. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered priority queue with one ordered by generation number. The following - operations are important candidates: + operation is an important candidate: -- paint_down_to_common() - 'log --topo-order' - Currently, parse_commit_gently() requires filling in the root tree -- 2.17.0
[PATCH v2 06/10] commit.c: use generation to halt paint walk
In paint_down_to_common(), the walk is halted when the queue contains only stale commits. The queue_has_nonstale() method iterates over the entire queue looking for a nonstale commit. In a wide commit graph where the two sides share many commits in common, but have deep sets of different commits, this method may inspect many elements before finding a nonstale commit. In the worst case, this can give quadratic performance in paint_down_to_common(). Convert queue_has_nonstale() to use generation numbers for an O(1) termination condition. To properly take advantage of this condition, track the minimum generation number of a commit that enters the queue with nonstale status. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 37 ++--- 1 file changed, 30 insertions(+), 7 deletions(-) diff --git a/commit.c b/commit.c index 95ae7e13a3..00bdc2ab21 100644 --- a/commit.c +++ b/commit.c @@ -776,14 +776,22 @@ void sort_in_topological_order(struct commit_list **list, enum rev_sort_order so static const unsigned all_flags = (PARENT1 | PARENT2 | STALE | RESULT); -static int queue_has_nonstale(struct prio_queue *queue) +static int queue_has_nonstale(struct prio_queue *queue, uint32_t min_gen) { - int i; - for (i = 0; i < queue->nr; i++) { - struct commit *commit = queue->array[i].data; - if (!(commit->object.flags & STALE)) - return 1; + if (min_gen != GENERATION_NUMBER_INFINITY) { + if (queue->nr > 0) { + struct commit *commit = queue->array[0].data; + return commit->generation >= min_gen; + } + } else { + int i; + for (i = 0; i < queue->nr; i++) { + struct commit *commit = queue->array[i].data; + if (!(commit->object.flags & STALE)) + return 1; + } } + return 0; } @@ -793,6 +801,8 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; + uint32_t last_gen = GENERATION_NUMBER_INFINITY; + uint32_t min_nonstale_gen = GENERATION_NUMBER_INFINITY; one->object.flags |= PARENT1; if (!n) { @@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc return result; } prio_queue_put(, one); + if (one->generation < min_nonstale_gen) + min_nonstale_gen = one->generation; for (i = 0; i < n; i++) { twos[i]->object.flags |= PARENT2; prio_queue_put(, twos[i]); + if (twos[i]->generation < min_nonstale_gen) + min_nonstale_gen = twos[i]->generation; } - while (queue_has_nonstale()) { + while (queue_has_nonstale(, min_nonstale_gen)) { struct commit *commit = prio_queue_get(); struct commit_list *parents; int flags; + if (commit->generation > last_gen) + BUG("bad generation skip"); + + last_gen = commit->generation; + flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc return NULL; p->object.flags |= flags; prio_queue_put(, p); + + if (!(flags & STALE) && + p->generation < min_nonstale_gen) + min_nonstale_gen = p->generation; } } -- 2.17.0
[PATCH v2 05/10] commit: use generations in paint_down_to_common()
Define compare_commits_by_gen_then_commit_date(), which uses generation numbers as a primary comparison and commit date to break ties (or as a comparison when both commits do not have computed generation numbers). Since the commit-graph file is closed under reachability, we know that all commits in the file have generation at most GENERATION_NUMBER_MAX which is less than GENERATION_NUMBER_INFINITY. This change does not affect the number of commits that are walked during the execution of paint_down_to_common(), only the order that those commits are inspected. In the case that commit dates violate topological order (i.e. a parent is "newer" than a child), the previous code could walk a commit twice: if a commit is reached with the PARENT1 bit, but later is re-visited with the PARENT2 bit, then that PARENT2 bit must be propagated to its parents. Using generation numbers avoids this extra effort, even if it is somewhat rare. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 19 ++- commit.h | 1 + 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 3e39c86abf..95ae7e13a3 100644 --- a/commit.c +++ b/commit.c @@ -624,6 +624,23 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, return 0; } +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) +{ + const struct commit *a = a_, *b = b_; + + if (a->generation < b->generation) + return 1; + else if (a->generation > b->generation) + return -1; + + /* newer commits with larger date first */ + if (a->date < b->date) + return 1; + else if (a->date > b->date) + return -1; + return 0; +} + int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) { const struct commit *a = a_, *b = b_; @@ -773,7 +790,7 @@ static int queue_has_nonstale(struct prio_queue *queue) /* all input commits in one and twos[] must have been parsed! */ static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) { - struct prio_queue queue = { compare_commits_by_commit_date }; + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; diff --git a/commit.h b/commit.h index b91df315c5..c440f56bf9 100644 --- a/commit.h +++ b/commit.h @@ -332,6 +332,7 @@ extern int remove_signature(struct strbuf *buf); extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); LAST_ARG_MUST_BE_NULL extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...); -- 2.17.0
[PATCH v2 00/10] Compute and consume generation numbers
Thanks for the lively discussion of this patch series in v1! I've incorporated the feedback from the previous round, added patches [7/6] and [8/6], expanded the discussion of generation numbers in the design document, and added another speedup for 'git branch --contains'. One major difference: I renamed the macros from _UNDEF to _INFINITY and _NONE to _ZERO. This communicates their value more clearly, since the previous names were unclear about which was larger than the "real" generation numbers. Patch 2 includes a change to builtin/merge.c and a new test in t5318-commit-graph.sh that exposes a problem I found when testing the previous patch series on my box. The "BUG: bad generation skip" message from "commit.c: use generation to halt paint walk" would halt a fast- forward merge since the HEAD commit was loaded before the core.commitGraph config setting was loaded. It is crucial that all commits that exist in the commit-graph file are loaded from that file or else we will lose our expected inequalities of generation numbers. Thanks, -Stolee -- >8 -- This is the one of several "small" patches that follow the serialized Git commit graph patch (ds/commit-graph). As described in Documentation/technical/commit-graph.txt, the generation number of a commit is one more than the maximum generation number among its parents (trivially, a commit with no parents has generation number one). This section is expanded to describe the interaction with special generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph file) and *_ZERO (commits in a commit-graph file written before generation numbers were implemented). This series makes the computation of generation numbers part of the commit-graph write process. Finally, generation numbers are used to order commits in the priority queue in paint_down_to_common(). This allows a constant-time check in queue_has_nonstale() instead of the previous linear-time check. Further, use generation numbers for '--contains' queries in 'git tag' and 'git branch', providing a significant speedup (at least 95% for some cases). A more substantial refactoring of revision.c is required before making 'git log --graph' use generation numbers effectively. This patch series depends on v7 of ds/commit-graph. Derrick Stolee (10): object.c: parse commit in graph first merge: check config before loading commits commit: add generation number to struct commmit commit-graph: compute generation numbers commit: use generations in paint_down_to_common() commit.c: use generation to halt paint walk commit-graph.txt: update future work ref-filter: use generation number for --contains commit: use generation numbers for in_merge_bases() commit: add short-circuit to paint_down_to_common() Documentation/technical/commit-graph.txt | 50 +-- alloc.c | 1 + builtin/merge.c | 5 +- commit-graph.c | 48 +++ commit.c | 78 commit.h | 5 ++ object.c | 4 +- ref-filter.c | 24 ++-- t/t5318-commit-graph.sh | 9 +++ 9 files changed, 197 insertions(+), 27 deletions(-) -- 2.17.0
[PATCH v2 03/10] commit: add generation number to struct commmit
The generation number of a commit is defined recursively as follows: * If a commit A has no parents, then the generation number of A is one. * If a commit A has parents, then the generation number of A is one more than the maximum generation number among the parents of A. Add a uint32_t generation field to struct commit so we can pass this information to revision walks. We use two special values to signal the generation number is invalid: GENERATION_NUMBER_ININITY 0x GENERATION_NUMBER_ZERO 0 The first (_INFINITY) means the generation number has not been loaded or computed. The second (_ZERO) means the generation number was loaded from a commit graph file that was stored before generation numbers were computed. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- alloc.c| 1 + commit-graph.c | 2 ++ commit.h | 4 3 files changed, 7 insertions(+) diff --git a/alloc.c b/alloc.c index cf4f8b61e1..e8ab14f4a1 100644 --- a/alloc.c +++ b/alloc.c @@ -94,6 +94,7 @@ void *alloc_commit_node(void) c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); c->graph_pos = COMMIT_NOT_FROM_GRAPH; + c->generation = GENERATION_NUMBER_INFINITY; return c; } diff --git a/commit-graph.c b/commit-graph.c index 1fc63d541b..d24b947525 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -264,6 +264,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin date_low = get_be32(commit_data + g->hash_len + 12); item->date = (timestamp_t)((date_high << 32) | date_low); + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; + pptr = >parents; edge_value = get_be32(commit_data + g->hash_len); diff --git a/commit.h b/commit.h index e57ae4b583..b91df315c5 100644 --- a/commit.h +++ b/commit.h @@ -10,6 +10,9 @@ #include "pretty.h" #define COMMIT_NOT_FROM_GRAPH 0x +#define GENERATION_NUMBER_INFINITY 0x +#define GENERATION_NUMBER_MAX 0x3FFF +#define GENERATION_NUMBER_ZERO 0 struct commit_list { struct commit *item; @@ -24,6 +27,7 @@ struct commit { struct commit_list *parents; struct tree *tree; uint32_t graph_pos; + uint32_t generation; }; extern int save_commit_buffer; -- 2.17.0
[PATCH v2 01/10] object.c: parse commit in graph first
Most code paths load commits using lookup_commit() and then parse_commit(). In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). Before adding generation numbers to the commit-graph, we need to ensure that any commit that exists in the graph is loaded from the graph, so check parse_commit_in_graph() before calling parse_commit_buffer(). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- object.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/object.c b/object.c index e6ad3f61f0..4cd3e98e04 100644 --- a/object.c +++ b/object.c @@ -3,6 +3,7 @@ #include "blob.h" #include "tree.h" #include "commit.h" +#include "commit-graph.h" #include "tag.h" static struct object **obj_hash; @@ -207,7 +208,8 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type } else if (type == OBJ_COMMIT) { struct commit *commit = lookup_commit(oid); if (commit) { - if (parse_commit_buffer(commit, buffer, size)) + if (!parse_commit_in_graph(commit) && + parse_commit_buffer(commit, buffer, size)) return NULL; if (!get_cached_commit_buffer(commit, NULL)) { set_commit_buffer(commit, buffer, size); -- 2.17.0
Re: [PATCH v2 0/4] Lazy-load trees when reading commit-graph
On 4/8/2018 7:18 PM, Junio C Hamano wrote: Jeff Kingwrites: If I were doing it myself, I probably would have folded patches 1 and 3 together. They are touching all the same spots, and it would be an error for any case converted in patch 1 to not get converted in patch 3. I'm assuming you caught them all due to Coccinelle, though IMHO it is somewhat overkill here. By folding them together the compiler could tell you which spots you missed. Yeah, that approach would probably be a more sensible way to assure the safety/correctness of the result to readers better. I don't understand how folding the patches makes the correctness clearer, since the rename (1/4) is checked by the compiler and the Coccinelle script (3/4) only works after that rename is complete. The only thing I can imagine is that it makes smaller patch emails, since there is only one large patch instead of two. In this case, I prefer to make changes that are easier to check by automation (compiler and coccinelle). However, I will defer to the experts in this regard. If a v3 is necessary, then I will fold the commits together. Thanks, -Stolee
Re: What's cooking in git.git (Apr 2018, #01; Mon, 9)
On 4/9/2018 6:21 AM, Junio C Hamano wrote: * ds/commit-graph (2018-04-02) 16 commits - commit-graph: implement "--additive" option - commit-graph: build graph from starting commits - commit-graph: read only from specific pack-indexes - commit: integrate commit graph with commit parsing - commit-graph: close under reachability - commit-graph: add core.commitGraph setting - commit-graph: implement git commit-graph read - commit-graph: implement git-commit-graph write - commit-graph: implement write_commit_graph() - commit-graph: create git-commit-graph builtin - graph: add commit graph design document - commit-graph: add format document - csum-file: refactor finalize_hashfile() method - csum-file: rename hashclose() to finalize_hashfile() - Merge branch 'jk/cached-commit-buffer' into HEAD - Merge branch 'jt/binsearch-with-fanout' into HEAD (this branch is used by ds/lazy-load-trees.) Precompute and store information necessary for ancestry traversal in a separate file to optimize graph walking. Ready??? It seems that this topic is getting there. I think this patch is ready to go, barring the edit of "--additive" to "--append" in the final commit message and squashing following diff into "commit-graph: implement git commit-graph read": @@ -31,7 +31,7 @@ static struct opts_commit_graph { static int graph_read(int argc, const char **argv) { - struct commit_graph *graph = 0; + struct commit_graph *graph = NULL; char *graph_name; static struct option builtin_commit_graph_read_options[] = { If you prefer that I re-roll with those changes, I can send a v8. I'm currently working on new series based on this feature: * [1] Lazy-load trees when reading commit-graph (ds/lazy-load-trees) * [2] Compute and consume generation numbers * Move commit-graph.c globals to the_repository * Implement 'fsck' functionality for the commit-graph file * Integrate 'commit-graph write' into 'gc --auto' I would also like to open the feature to other contributors, especially for others who can contribute performance improvements using generation numbers. We had a very valuable discussion on the list [2], and I look forward to more collaborations like that. Thanks, -Stolee [1] https://public-inbox.org/git/20180403120057.173849-1-dsto...@microsoft.com/T/#u [2] https://public-inbox.org/git/20180403165143.80661-1-dsto...@microsoft.com/T/#u
Re: [PATCH 02/19] replace-object: move replace_object to object store
On 4/6/2018 7:21 PM, Stefan Beller wrote: Refs belong to particular repositories, so the replacements defined by them should belong to a particular repository as well. Move the definition of a single object replacement to a new header "replace-object.h". While at it replace the hardcoded 20 by GIT_MAX_RAWSZ. Signed-off-by: Jonathan NiederSigned-off-by: Stefan Beller --- object-store.h | 14 ++ replace-object.c | 40 ++-- replace-object.h | 9 + 3 files changed, 41 insertions(+), 22 deletions(-) create mode 100644 replace-object.h Throughout this commit, there appears to be an extra space inserted before 'the_repository'. Some are more obvious than others (such as a 'free( the_repository->...)' but others are after the indentation. diff --git a/object-store.h b/object-store.h index fef33f345f..da639b3184 100644 --- a/object-store.h +++ b/object-store.h @@ -93,6 +93,20 @@ struct raw_object_store { struct alternate_object_database *alt_odb_list; struct alternate_object_database **alt_odb_tail; + /* +* Objects that should be substituted by other objects +* (see git-replace(1)). +*/ + struct replace_objects { + /* +* An array of replacements. The array is kept sorted by the original +* sha1. +*/ + struct replace_object **items; + + int alloc, nr; + } replacements; + /* * private data * diff --git a/replace-object.c b/replace-object.c index 3e49965d05..a7eb31026e 100644 --- a/replace-object.c +++ b/replace-object.c @@ -1,19 +1,11 @@ #include "cache.h" +#include "replace-object.h" +#include "object-store.h" #include "sha1-lookup.h" #include "refs.h" +#include "repository.h" #include "commit.h" -/* - * An array of replacements. The array is kept sorted by the original - * sha1. - */ -static struct replace_object { - unsigned char original[20]; - unsigned char replacement[20]; -} **replace_object; - -static int replace_object_alloc, replace_object_nr; - static const unsigned char *replace_sha1_access(size_t index, void *table) { struct replace_object **replace = table; @@ -22,7 +14,8 @@ static const unsigned char *replace_sha1_access(size_t index, void *table) static int replace_object_pos(const unsigned char *sha1) { - return sha1_pos(sha1, replace_object, replace_object_nr, + return sha1_pos(sha1, the_repository->objects->replacements.items, +the_repository->objects->replacements.nr, replace_sha1_access); } @@ -35,18 +28,21 @@ static int register_replace_object(struct replace_object *replace, if (ignore_dups) free(replace); else { - free(replace_object[pos]); - replace_object[pos] = replace; + free( the_repository->objects->replacements.items[pos]); +the_repository->objects->replacements.items[pos] = replace; } return 1; } pos = -pos - 1; - ALLOC_GROW(replace_object, replace_object_nr + 1, replace_object_alloc); - replace_object_nr++; - if (pos < replace_object_nr) - MOVE_ARRAY(replace_object + pos + 1, replace_object + pos, - replace_object_nr - pos - 1); - replace_object[pos] = replace; + ALLOC_GROW( the_repository->objects->replacements.items, + the_repository->objects->replacements.nr + 1, + the_repository->objects->replacements.alloc); +the_repository->objects->replacements.nr++; + if (pos < the_repository->objects->replacements.nr) + MOVE_ARRAY( the_repository->objects->replacements.items + pos + 1, + the_repository->objects->replacements.items + pos, + the_repository->objects->replacements.nr - pos - 1); +the_repository->objects->replacements.items[pos] = replace; return 0; } @@ -84,7 +80,7 @@ static void prepare_replace_object(void) for_each_replace_ref(register_replace_ref, NULL); replace_object_prepared = 1; - if (!replace_object_nr) + if (!the_repository->objects->replacements.nr) check_replace_refs = 0; } @@ -113,7 +109,7 @@ const unsigned char *do_lookup_replace_object(const unsigned char *sha1) pos = replace_object_pos(cur); if (0 <= pos) - cur = replace_object[pos]->replacement; + cur = the_repository->objects->replacements.items[pos]->replacement; } while (0 <= pos); return cur; diff --git a/replace-object.h b/replace-object.h new file mode 100644 index
Re: [RFC PATCH 00/19] object-store refactoring 3 (replace objects, main ref store)
On 4/6/2018 7:21 PM, Stefan Beller wrote: This applies on top of 464416a2eaadf84d2bfdf795007863d03b222b7c (sb/packfiles-in-repository). It is also available at https://github.com/stefanbeller/git/tree/object-store-3 This series will bring the replacement mechanism (git replace) into the object store. The first patches are cleaning up a bit, and patches 6-19 are converting one function at a time using the tick-tock pattern with the #define trick. See cfc62fc98c (sha1_file: add repository argument to link_alt_odb_entry, 2018-03-23) for explanation. Thanks, Stefan I looked through these patches and only found one set of whitespace errors. Compiles and tests fine on my machine. Reviewed-by: Derrick Stolee <dsto...@microsoft.com>
Re: [PATCH v2 0/4] Lazy-load trees when reading commit-graph
On 4/7/2018 2:40 PM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: [...] On the Linux repository, performance tests were run for the following command: git log --graph --oneline -1000 Before: 0.92s After: 0.66s Rel %: -28.3% Adding '-- kernel/' to the command requires loading the root tree for every commit that is walked. There was no measureable performance change as a result of this patch. In the "Git Merge contributor summit notes" [1] one can read that: - VSTS adds bloom filters to know which paths have changed on the commit - tree-same check in the bloom filter is fast; speeds up file history checks - if the file history is _very_ sparse, then bloom filter is useful Could this method speed up also the second case mentioned here? Can anyone explain how this "path-changed bloom filter" works in VSTS? The idea is simple: for every commit, store a Bloom filter containing the list of paths that are not TREESAME against the first parent. (A slight detail: have a max cap on the number of paths, and store simply "TOO_BIG" for commits with too many diffs.) When performing 'git log -- path' queries, the most important detail for considering how to advance the walk is whether the commit is TREESAME to its first parent. For a deep path in a large repo, this is almost always true. When a Bloom filter says "TREESAME" (i.e. "this path is not in my set") it is always correct, so we can set the treesame bit and continue without walking any trees. When a Bloom filter says "MAYBE NOT TREESAME" (i.e. "this path is probably in my set") you only need to do the same work as before: walk the trees to compare against your first parent. If a Bloom filter has a false-positive rate of X%, then you can possibly drop your number of tree comparisons by (100-X)%. This is very important for large repos where some paths were changed only ten times or so, the full graph needs to be walked and it is helpful to avoid parsing too many trees. Could we add something like this to the commit-graph file? I'm not sure if it is necessary for client-side operations, but it is one of the reasons the commit-graph file has the idea of an "optional chunk". It could be added to the file format (without changing version numbers) and be ignored by clients that don't understand it. I could also be gated by a config setting for computing them. My guess is that only server-side operations will need the added response time, and can bear the cost of computing them when writing the commit-graph file. Clients are less likely to be patient waiting for a lot of diff calculations. If we add commit-graph file downloads to the protocol, then the server could do this computation and send the data to all clients. But that would be "secondary" information that maybe clients want to verify, which is as difficult as computing it themselves. Thanks, -Stolee
Re: [PATCH 0/6] Compute and consume generation numbers
On 4/7/2018 12:55 PM, Jakub Narebski wrote: Currently I am at the stage of reproducing results in FELINE paper: "Reachability Queries in Very Large Graphs: A Fast Refined Online Search Approach" by Renê R. Veloso, Loïc Cerf, Wagner Meira Jr and Mohammed J. Zaki (2014). This paper is available in the PDF form at https://openproceedings.org/EDBT/2014/paper_166.pdf The Jupyter Notebook (which runs on Google cloud, but can be also run locally) uses Python kernel, NetworkX librabry for graph manipulation, and matplotlib (via NetworkX) for display. Available at: https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing I hope that could be of help, or at least interesting Let me know when you can give numbers (either raw performance or # of commits walked) for real-world Git commit graphs. The Linux repo is a good example to use for benchmarking, but I also use the Kotlin repo sometimes as it has over a million objects and over 250K commits. Of course, the only important statistic at the end of the day is the end-to-end time of a 'git ...' command. Your investigations should inform whether it is worth prototyping the feature in the git codebase. Thanks, -Stolee
Re: [PATCH 7/6] ref-filter: use generation number for --contains
On 4/4/2018 3:42 PM, Jeff King wrote: On Wed, Apr 04, 2018 at 03:22:01PM -0400, Derrick Stolee wrote: That is the idea. I should make this clearer in all of my commit messages. Yes, please. :) And maybe in the documentation of the file format, if it's not there (I didn't check). It's a very useful property, and we want to make sure people making use of the graph know they can depend on it. For v2, I'll expand on the roles of _UNDEF and _NONE in the discussion of generation numbers in Documentation/technical/commit-graph.txt (the design doc instead of the file format). Thanks, -Stolee
Re: [PATCH v3 3/9] commit: use generations in paint_down_to_common()
On 4/18/2018 10:31 AM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: Define compare_commits_by_gen_then_commit_date(), which uses generation numbers as a primary comparison and commit date to break ties (or as a comparison when both commits do not have computed generation numbers). Since the commit-graph file is closed under reachability, we know that all commits in the file have generation at most GENERATION_NUMBER_MAX which is less than GENERATION_NUMBER_INFINITY. This change does not affect the number of commits that are walked during the execution of paint_down_to_common(), only the order that those commits are inspected. In the case that commit dates violate topological order (i.e. a parent is "newer" than a child), the previous code could walk a commit twice: if a commit is reached with the PARENT1 bit, but later is re-visited with the PARENT2 bit, then that PARENT2 bit must be propagated to its parents. Using generation numbers avoids this extra effort, even if it is somewhat rare. Does it mean that it gives no measureable performance improvements for typical test cases? Not in this commit. When we add the `min_generation` parameter in a later commit, we do get a significant performance boost (when we can supply a non-zero value to `min_generation`). This step of using generation numbers for the priority is important for that commit, but on its own has limited value outside of the clock-skew case mentioned above. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 20 +++- commit.h | 1 + 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 711f674c18..a44899c733 100644 --- a/commit.c +++ b/commit.c @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, return 0; } +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) +{ + const struct commit *a = a_, *b = b_; + + /* newer commits first */ + if (a->generation < b->generation) + return 1; + else if (a->generation > b->generation) + return -1; + + /* use date as a heuristic when generataions are equal */ Very minor typo in above comment: s/generataions/generations/ Good catch! + if (a->date < b->date) + return 1; + else if (a->date > b->date) + return -1; + return 0; +} + int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) { const struct commit *a = a_, *b = b_; @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue) /* all input commits in one and twos[] must have been parsed! */ static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) { - struct prio_queue queue = { compare_commits_by_commit_date }; + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; diff --git a/commit.h b/commit.h index aac3b8c56f..64436ff44e 100644 --- a/commit.h +++ b/commit.h @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf); extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); LAST_ARG_MUST_BE_NULL extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...);
Re: What's cooking in git.git (Apr 2018, #02; Tue, 17)
On 4/17/2018 9:04 PM, Junio C Hamano wrote: Stefan Beller <sbel...@google.com> writes: What's the doneness of this thing? I didn't recall seeing any response, especially ones that demonstrated the reviewer carefully read and thought about the issues surrounding the code. Not that I spotted any problems in these patches myself, though. Stolee and Brandon provided a "quick LGTM" type of review https://public-inbox.org/git/20180409232536.gb102...@google.com/ https://public-inbox.org/git/9ddfee7e-025a-79c9-8d6b-700c65a14...@gmail.com/ Yup. Giving positive reviews is harder than giving constructive criticism. Much harder. As readers cannot tell from a "quick LGTM" between "I didn't read it but it did not smell foul" and "I read it thoroughly, understood how the solution works, it was presented well, and agree with the design and implementation---there is nothing to add", the reviewers need to come up with some way to express that it is the latter case rather than the former. I would not claim that I've perfected my technique to do so, but when responding to such a "good" series, I rephrase the main idea in the series in my own words to show that I as a reviewer read the series well enough to be able to do so, perhaps with comparison with possible alternatives I could think of and dicussion to argue that the solution presented in the series is better, in an attempt to demonstrate that I am qualified to say "this one is good" with good enough understanding of both the issue the series addresses and the solution in the series. I'm sorry that my second message was terse. My response to v1 [1] was > I looked through these patches and only found one set of whitespace > errors. Compiles and tests fine on my machine. > > Reviewed-by: Derrick Stolee <dsto...@microsoft.com> So, I pulled the code, went through it patch-by-patch, and saw that the transformations were made using the established pattern. The second review was to chime in that my v1 comments had been addressed. Thanks, -Stolee [1] https://public-inbox.org/git/6c319100-df47-3b8d-8661-24e4643ad...@gmail.com/
[PATCH v3 8/9] commit-graph: always load commit-graph information
Most code paths load commits using lookup_commit() and then parse_commit(). In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). With generation numbers in the commit-graph, we need to ensure that any commit that exists in the commit-graph file has its generation number loaded. Create new load_commit_graph_info() method to fill in the information for a commit that exists only in the commit-graph file. Call it from parse_commit_buffer() after loading the other commit information from the given buffer. Only fill this information when specified by the 'check_graph' parameter. This avoids duplicate work when we already checked the graph in parse_commit_gently() or when simply checking the buffer contents in check_commit(). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 51 -- commit-graph.h | 8 commit.c | 7 +-- commit.h | 2 +- object.c | 2 +- sha1_file.c| 2 +- 6 files changed, 49 insertions(+), 23 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 688d5b1801..21e853c21a 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, return _list_insert(c, pptr)->next; } +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) +{ + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; +} + static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) { uint32_t edge_value; uint32_t *parent_data_ptr; uint64_t date_low, date_high; struct commit_list **pptr; - const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos; + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; item->object.parsed = 1; item->graph_pos = pos; @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin return 1; } +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos) +{ + if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { + *pos = item->graph_pos; + return 1; + } else { + return bsearch_graph(commit_graph, &(item->object.oid), pos); + } +} + int parse_commit_in_graph(struct commit *item) { + uint32_t pos; + + if (item->object.parsed) + return 0; if (!core_commit_graph) return 0; - if (item->object.parsed) - return 1; - prepare_commit_graph(); - if (commit_graph) { - uint32_t pos; - int found; - if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { - pos = item->graph_pos; - found = 1; - } else { - found = bsearch_graph(commit_graph, &(item->object.oid), ); - } - - if (found) - return fill_commit_in_graph(item, commit_graph, pos); - } - + if (commit_graph && find_commit_in_graph(item, commit_graph, )) + return fill_commit_in_graph(item, commit_graph, pos); return 0; } +void load_commit_graph_info(struct commit *item) +{ + uint32_t pos; + if (!core_commit_graph) + return; + prepare_commit_graph(); + if (commit_graph && find_commit_in_graph(item, commit_graph, )) + fill_commit_graph_info(item, commit_graph, pos); +} + static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c) { struct object_id oid; diff --git a/commit-graph.h b/commit-graph.h index 260a468e73..96cccb10f3 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir); */ int parse_commit_in_graph(struct commit *item); +/* + * It is possible that we loaded commit contents from the commit buffer, + * but we also want to ensure the commit-graph content is correctly + * checked and filled. Fill the graph_pos and generation members of + * the given commit. + */ +void load_commit_graph_info(struct commit *item); + struct tree *get_commit_tree_in_graph(const struct commit *c); struct commit_graph { diff --git a/commit.c b/commit.c index a70f120878..9ef6f699bd 100644 --- a/commit.c +++ b/commit.c @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep) return ret; } -int parse_commit_buffer(struct commit *item, const void *buffer
[PATCH v3 4/9] commit-graph.txt: update design document
We now calculate generation numbers in the commit-graph file and use them in paint_down_to_common(). Expand the section on generation numbers to discuss how the three special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and _MAX interact with other generation numbers. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/commit-graph.txt | 30 +++- 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index 0550c6d0dc..d9f2713efa 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite" generation number and walk until reaching commits with known generation number. +We use the macro GENERATION_NUMBER_INFINITY = 0x to mark commits not +in the commit-graph file. If a commit-graph file was written by a version +of Git that did not compute generation numbers, then those commits will +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. + +Since the commit-graph file is closed under reachability, we can guarantee +the following weaker condition on all commits: + +If A and B are commits with generation numbers N amd M, respectively, +and N < M, then A cannot reach B. + +Note how the strict inequality differs from the inequality when we have +fully-computed generation numbers. Using strict inequality may result in +walking a few extra commits, but the simplicity in dealing with commits +with generation number *_INFINITY or *_ZERO is valuable. + +We use the macro GENERATION_NUMBER_MAX = 0x3FFF to for commits whose +generation numbers are computed to be at least this value. We limit at +this value since it is the largest value that can be stored in the +commit-graph file using the 30 bits available to generation numbers. This +presents another case where a commit can have generation number equal to +that of a parent. + Design Details -- @@ -98,17 +121,12 @@ Future Work - The 'commit-graph' subcommand does not have a "verify" mode that is necessary for integration with fsck. -- The file format includes room for precomputed generation numbers. These - are not currently computed, so all generation numbers will be marked as - 0 (or "uncomputed"). A later patch will include this calculation. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered priority queue with one ordered by generation number. The following - operations are important candidates: + operation is an important candidate: -- paint_down_to_common() - 'log --topo-order' - Currently, parse_commit_gently() requires filling in the root tree -- 2.17.0.39.g685157f7fb
[PATCH v3 9/9] merge: check config before loading commits
Now that we use generation numbers from the commit-graph, we must ensure that all commits that exist in the commit-graph are loaded from that file instead of from the object database. Since the commit-graph file is only checked if core.commitGraph is true, we must check the default config before we load any commits. In the merge builtin, the config was checked after loading the HEAD commit. This was due to the use of the global 'branch' when checking merge-specific config settings. Move the config load to be between the initialization of 'branch' and the commit lookup. Without this change, a fast-forward merge would hit a BUG("bad generation skip") statement in commit.c during paint_down_to_common(). This is because the HEAD commit would be loaded with "infinite" generation but then reached by commits with "finite" generation numbers. Add a test to t5318-commit-graph.sh that exercises this code path to prevent a regression. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- builtin/merge.c | 5 +++-- t/t5318-commit-graph.sh | 9 + 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/builtin/merge.c b/builtin/merge.c index 5e5e4497e3..7e1da6c6ea 100644 --- a/builtin/merge.c +++ b/builtin/merge.c @@ -1148,13 +1148,14 @@ int cmd_merge(int argc, const char **argv, const char *prefix) branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL); if (branch) skip_prefix(branch, "refs/heads/", ); + init_diff_ui_defaults(); + git_config(git_merge_config, NULL); + if (!branch || is_null_oid(_oid)) head_commit = NULL; else head_commit = lookup_commit_or_die(_oid, "HEAD"); - init_diff_ui_defaults(); - git_config(git_merge_config, NULL); if (branch_mergeoptions) parse_branch_merge_options(branch_mergeoptions); diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index a380419b65..77d85aefe7 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' ' graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2 +test_expect_success 'perform fast-forward merge in full repo' ' + cd "$TRASH_DIRECTORY/full" && + git checkout -b merge-5-to-8 commits/5 && + git merge commits/8 && + git show-ref -s merge-5-to-8 >output && + git show-ref -s commits/8 >expect && + test_cmp expect output +' + test_done -- 2.17.0.39.g685157f7fb
[PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()
When running 'git branch --contains', the in_merge_bases_many() method calls paint_down_to_common() to discover if a specific commit is reachable from a set of branches. Commits with lower generation number are not needed to correctly answer the containment query of in_merge_bases_many(). Add a new parameter, min_generation, to paint_down_to_common() that prevents walking commits with generation number strictly less than min_generation. If 0 is given, then there is no functional change. For in_merge_bases_many(), we can pass commit->generation as the cutoff, and this saves time during 'git branch --contains' queries that would otherwise walk "around" the commit we are inspecting. For a copy of the Linux repository, where HEAD is checked out at v4.13~100, we get the following performance improvement for 'git branch --contains' over the previous commit: Before: 0.21s After: 0.13s Rel %: -38% Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/commit.c b/commit.c index bceb79c419..a70f120878 100644 --- a/commit.c +++ b/commit.c @@ -805,11 +805,14 @@ static int queue_has_nonstale(struct prio_queue *queue) } /* all input commits in one and twos[] must have been parsed! */ -static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) +static struct commit_list *paint_down_to_common(struct commit *one, int n, + struct commit **twos, + int min_generation) { struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; + uint32_t last_gen = GENERATION_NUMBER_INFINITY; one->object.flags |= PARENT1; if (!n) { @@ -828,6 +831,13 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc struct commit_list *parents; int flags; + if (commit->generation > last_gen) + BUG("bad generation skip"); + last_gen = commit->generation; + + if (commit->generation < min_generation) + break; + flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co return NULL; } - list = paint_down_to_common(one, n, twos); + list = paint_down_to_common(one, n, twos, 0); while (list) { struct commit *commit = pop_commit(); @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt) filled_index[filled] = j; work[filled++] = array[j]; } - common = paint_down_to_common(array[i], filled, work); + common = paint_down_to_common(array[i], filled, work, 0); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++) @@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * if (commit->generation > min_generation) return 0; - bases = paint_down_to_common(commit, nr_reference, reference); + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); if (commit->object.flags & PARENT2) ret = 1; clear_commit_marks(commit, all_flags); -- 2.17.0.39.g685157f7fb
[PATCH v3 2/9] commit-graph: compute generation numbers
While preparing commits to be written into a commit-graph file, compute the generation numbers using a depth-first strategy. The only commits that are walked in this depth-first search are those without a precomputed generation number. Thus, computation time will be relative to the number of new commits to the commit-graph file. If a computed generation number would exceed GENERATION_NUMBER_MAX, then use GENERATION_NUMBER_MAX instead. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 46 ++ 1 file changed, 46 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 9ad21c3ffb..688d5b1801 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -439,6 +439,10 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; + if ((*list)->generation != GENERATION_NUMBER_INFINITY) { + packedDate[0] |= htonl((*list)->generation << 2); + } + packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); @@ -571,6 +575,46 @@ static void close_reachable(struct packed_oid_list *oids) } } +static void compute_generation_numbers(struct commit** commits, + int nr_commits) +{ + int i; + struct commit_list *list = NULL; + + for (i = 0; i < nr_commits; i++) { + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && + commits[i]->generation != GENERATION_NUMBER_ZERO) + continue; + + commit_list_insert(commits[i], ); + while (list) { + struct commit *current = list->item; + struct commit_list *parent; + int all_parents_computed = 1; + uint32_t max_generation = 0; + + for (parent = current->parents; parent; parent = parent->next) { + if (parent->item->generation == GENERATION_NUMBER_INFINITY || + parent->item->generation == GENERATION_NUMBER_ZERO) { + all_parents_computed = 0; + commit_list_insert(parent->item, ); + break; + } else if (parent->item->generation > max_generation) { + max_generation = parent->item->generation; + } + } + + if (all_parents_computed) { + current->generation = max_generation + 1; + pop_commit(); + } + + if (current->generation > GENERATION_NUMBER_MAX) + current->generation = GENERATION_NUMBER_MAX; + } + } +} + void write_commit_graph(const char *obj_dir, const char **pack_indexes, int nr_packs, @@ -694,6 +738,8 @@ void write_commit_graph(const char *obj_dir, if (commits.nr >= GRAPH_PARENT_MISSING) die(_("too many commits to write graph")); + compute_generation_numbers(commits.list, commits.nr); + graph_name = get_commit_graph_filename(obj_dir); fd = hold_lock_file_for_update(, graph_name, 0); -- 2.17.0.39.g685157f7fb
[PATCH v3 1/9] commit: add generation number to struct commmit
The generation number of a commit is defined recursively as follows: * If a commit A has no parents, then the generation number of A is one. * If a commit A has parents, then the generation number of A is one more than the maximum generation number among the parents of A. Add a uint32_t generation field to struct commit so we can pass this information to revision walks. We use three special values to signal the generation number is invalid: GENERATION_NUMBER_INFINITY 0x GENERATION_NUMBER_MAX 0x3FFF GENERATION_NUMBER_ZERO 0 The first (_INFINITY) means the generation number has not been loaded or computed. The second (_MAX) means the generation number is too large to store in the commit-graph file. The third (_ZERO) means the generation number was loaded from a commit graph file that was written by a version of git that did not support generation numbers. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- alloc.c| 1 + commit-graph.c | 2 ++ commit.h | 4 3 files changed, 7 insertions(+) diff --git a/alloc.c b/alloc.c index cf4f8b61e1..e8ab14f4a1 100644 --- a/alloc.c +++ b/alloc.c @@ -94,6 +94,7 @@ void *alloc_commit_node(void) c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); c->graph_pos = COMMIT_NOT_FROM_GRAPH; + c->generation = GENERATION_NUMBER_INFINITY; return c; } diff --git a/commit-graph.c b/commit-graph.c index 70fa1b25fd..9ad21c3ffb 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin date_low = get_be32(commit_data + g->hash_len + 12); item->date = (timestamp_t)((date_high << 32) | date_low); + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; + pptr = >parents; edge_value = get_be32(commit_data + g->hash_len); diff --git a/commit.h b/commit.h index 23a3f364ed..aac3b8c56f 100644 --- a/commit.h +++ b/commit.h @@ -10,6 +10,9 @@ #include "pretty.h" #define COMMIT_NOT_FROM_GRAPH 0x +#define GENERATION_NUMBER_INFINITY 0x +#define GENERATION_NUMBER_MAX 0x3FFF +#define GENERATION_NUMBER_ZERO 0 struct commit_list { struct commit *item; @@ -30,6 +33,7 @@ struct commit { */ struct tree *maybe_tree; uint32_t graph_pos; + uint32_t generation; }; extern int save_commit_buffer; -- 2.17.0.39.g685157f7fb
[PATCH v3 6/9] commit: use generation numbers for in_merge_bases()
The containment algorithm for 'git branch --contains' is different from that for 'git tag --contains' in that it uses is_descendant_of() instead of contains_tag_algo(). The expensive portion of the branch algorithm is computing merge bases. When a commit-graph file exists with generation numbers computed, we can avoid this merge-base calculation when the target commit has a larger generation number than the target commits. Performance tests were run on a copy of the Linux repository where HEAD is contained in v4.13 but no earlier tag. Also, all tags were copied to branches and 'git branch --contains' was tested: Before: 60.0s After: 0.4s Rel %: -99.3% Reported-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index a44899c733..bceb79c419 100644 --- a/commit.c +++ b/commit.c @@ -1053,12 +1053,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * { struct commit_list *bases; int ret = 0, i; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; if (parse_commit(commit)) return ret; - for (i = 0; i < nr_reference; i++) + for (i = 0; i < nr_reference; i++) { if (parse_commit(reference[i])) return ret; + if (min_generation > reference[i]->generation) + min_generation = reference[i]->generation; + } + + if (commit->generation > min_generation) + return 0; bases = paint_down_to_common(commit, nr_reference, reference); if (commit->object.flags & PARENT2) -- 2.17.0.39.g685157f7fb
[PATCH v3 0/9] Compute and consume generation numbers
Thanks for all the help on v2. Here are a few changes between versions: * Removed the constant-time check in queue_has_nonstale() due to the possibility of a performance hit and no evidence of a performance benefit in typical cases. * Reordered the commits about loading commits from the commit-graph. This way it is easier to demonstrate the incorrect checks. On my machine, every commit compiles and the test suite passes, but patches 6-8 have the bug that is fixed in patch 9 "merge: check config before loading commits". * The interaction with parse_commit_in_graph() from parse_object() is replaced with a new 'check_graph' parameter in parse_commit_buffer(). This allows us to fill in the graph_pos and generation values for commits that are parsed directly from a buffer. This keeps the existing behavior that a commit parsed this way should match its buffer. * There was discussion about making GENERATION_NUMBER_MAX assignable by an environment variable so we could add tests that exercise the behavior of capping a generation at that value. Perhaps the code around this is simple enough that we do not need to add that complexity. Thanks, -Stolee -- >8 -- This is the one of several "small" patches that follow the serialized Git commit graph patch (ds/commit-graph) and lazy-loading trees (ds/lazy-load-trees). As described in Documentation/technical/commit-graph.txt, the generation number of a commit is one more than the maximum generation number among its parents (trivially, a commit with no parents has generation number one). This section is expanded to describe the interaction with special generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph file) and *_ZERO (commits in a commit-graph file written before generation numbers were implemented). This series makes the computation of generation numbers part of the commit-graph write process. Finally, generation numbers are used to order commits in the priority queue in paint_down_to_common(). This allows a short-circuit mechanism to improve performance of `git branch --contains`. Further, use generation numbers for 'git tag --contains), providing a significant speedup (at least 95% for some cases). A more substantial refactoring of revision.c is required before making 'git log --graph' use generation numbers effectively. This patch series is build on ds/lazy-load-trees. Derrick Stolee (9): commit: add generation number to struct commmit commit-graph: compute generation numbers commit: use generations in paint_down_to_common() commit-graph.txt: update design document ref-filter: use generation number for --contains commit: use generation numbers for in_merge_bases() commit: add short-circuit to paint_down_to_common() commit-graph: always load commit-graph information merge: check config before loading commits Documentation/technical/commit-graph.txt | 30 +-- alloc.c | 1 + builtin/merge.c | 5 +- commit-graph.c | 99 +++- commit-graph.h | 8 ++ commit.c | 54 +++-- commit.h | 7 +- object.c | 2 +- ref-filter.c | 23 +- sha1_file.c | 2 +- t/t5318-commit-graph.sh | 9 +++ 11 files changed, 199 insertions(+), 41 deletions(-) base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707 -- 2.17.0.39.g685157f7fb
[PATCH v3 3/9] commit: use generations in paint_down_to_common()
Define compare_commits_by_gen_then_commit_date(), which uses generation numbers as a primary comparison and commit date to break ties (or as a comparison when both commits do not have computed generation numbers). Since the commit-graph file is closed under reachability, we know that all commits in the file have generation at most GENERATION_NUMBER_MAX which is less than GENERATION_NUMBER_INFINITY. This change does not affect the number of commits that are walked during the execution of paint_down_to_common(), only the order that those commits are inspected. In the case that commit dates violate topological order (i.e. a parent is "newer" than a child), the previous code could walk a commit twice: if a commit is reached with the PARENT1 bit, but later is re-visited with the PARENT2 bit, then that PARENT2 bit must be propagated to its parents. Using generation numbers avoids this extra effort, even if it is somewhat rare. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 20 +++- commit.h | 1 + 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 711f674c18..a44899c733 100644 --- a/commit.c +++ b/commit.c @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, return 0; } +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) +{ + const struct commit *a = a_, *b = b_; + + /* newer commits first */ + if (a->generation < b->generation) + return 1; + else if (a->generation > b->generation) + return -1; + + /* use date as a heuristic when generataions are equal */ + if (a->date < b->date) + return 1; + else if (a->date > b->date) + return -1; + return 0; +} + int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) { const struct commit *a = a_, *b = b_; @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue) /* all input commits in one and twos[] must have been parsed! */ static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) { - struct prio_queue queue = { compare_commits_by_commit_date }; + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; diff --git a/commit.h b/commit.h index aac3b8c56f..64436ff44e 100644 --- a/commit.h +++ b/commit.h @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf); extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); LAST_ARG_MUST_BE_NULL extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...); -- 2.17.0.39.g685157f7fb
[PATCH v3 5/9] ref-filter: use generation number for --contains
A commit A can reach a commit B only if the generation number of A is larger than the generation number of B. This condition allows significantly short-circuiting commit-graph walks. Use generation number for 'git tag --contains' queries. On a copy of the Linux repository where HEAD is containd in v4.13 but no earlier tag, the command 'git tag --contains HEAD' had the following peformance improvement: Before: 0.81s After: 0.04s Rel %: -95% Helped-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- ref-filter.c | 23 +++ 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/ref-filter.c b/ref-filter.c index cffd8bf3ce..e2fea6d635 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1587,7 +1587,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) */ static enum contains_result contains_test(struct commit *candidate, const struct commit_list *want, - struct contains_cache *cache) + struct contains_cache *cache, + uint32_t cutoff) { enum contains_result *cached = contains_cache_at(cache, candidate); @@ -1603,6 +1604,10 @@ static enum contains_result contains_test(struct commit *candidate, /* Otherwise, we don't know; prepare to recurse */ parse_commit_or_die(candidate); + + if (candidate->generation < cutoff) + return CONTAINS_NO; + return CONTAINS_UNKNOWN; } @@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = GENERATION_NUMBER_INFINITY; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + parse_commit_or_die(c); + if (c->generation < cutoff) + cutoff = c->generation; + } + result = contains_test(candidate, want, cache, cutoff); if (result != CONTAINS_UNKNOWN) return result; @@ -1637,7 +1652,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, * If we just popped the stack, parents->item has been marked, * therefore contains_test will return a meaningful yes/no. */ - else switch (contains_test(parents->item, want, cache)) { + else switch (contains_test(parents->item, want, cache, cutoff)) { case CONTAINS_YES: *contains_cache_at(cache, commit) = CONTAINS_YES; contains_stack.nr--; @@ -1651,7 +1666,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, } } free(contains_stack.contains_stack); - return contains_test(candidate, want, cache); + return contains_test(candidate, want, cache, cutoff); } static int commit_contains(struct ref_filter *filter, struct commit *commit, -- 2.17.0.39.g685157f7fb
Re: [PATCH v3 5/9] ref-filter: use generation number for --contains
On 4/24/2018 2:56 PM, Jakub Narebski wrote: Derrick Stolee <sto...@gmail.com> writes: One way to fix this is to call 'prepare_commit_graph()' directly and then test that 'commit_graph' is non-null before performing any parses. I'm not thrilled with how that couples the commit-graph implementation to this feature, but that may be necessary to avoid regressions in the non-commit-graph case. Another possible solution (not sure if better or worse) would be to change the signature of contains_tag_algo() function to take parameter or flag that would decide whether to parse "wants". If I reorder commits so "commit-graph:always load commit-graph information" is before this one, then we can call load_commit_graph_info() which just fills the generation and graph_pos information. This will keep the coupling very light, instead of needing to call prepare_commit_graph() or checking the config setting. Thanks, -Stolee
[PATCH v4 01/10] ref-filter: fix outdated comment on in_commit_list
The in_commit_list() method does not check the parents of the candidate for containment in the list. Fix the comment that incorrectly states that it does. Reported-by: Jakub Narebski <jna...@gmail.com> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- ref-filter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ref-filter.c b/ref-filter.c index cffd8bf3ce..aff24d93be 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1582,7 +1582,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) } /* - * Test whether the candidate or one of its parents is contained in the list. + * Test whether the candidate is contained in the list. * Do not recurse to find out, though, but return -1 if inconclusive. */ static enum contains_result contains_test(struct commit *candidate, -- 2.17.0.39.g685157f7fb
[PATCH v4 00/10] Compute and consume generation numbers
Thanks for the feedback on the previous version. I think this series is stabilizing nicely. I'll reply to this message with an inter-diff as it is not too large to share but would clutter this cover letter. Thanks, -Stolee -- >8 -- This is the one of several "small" patches that follow the serialized Git commit graph patch (ds/commit-graph) and lazy-loading trees (ds/lazy-load-trees). As described in Documentation/technical/commit-graph.txt, the generation number of a commit is one more than the maximum generation number among its parents (trivially, a commit with no parents has generation number one). This section is expanded to describe the interaction with special generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph file) and *_ZERO (commits in a commit-graph file written before generation numbers were implemented). This series makes the computation of generation numbers part of the commit-graph write process. Finally, generation numbers are used to order commits in the priority queue in paint_down_to_common(). This allows a short-circuit mechanism to improve performance of `git branch --contains`. Further, use generation numbers for 'git tag --contains), providing a significant speedup (at least 95% for some cases). A more substantial refactoring of revision.c is required before making 'git log --graph' use generation numbers effectively. This patch series is built on ds/lazy-load-trees. Derrick Stolee (10): ref-filter: fix outdated comment on in_commit_list commit: add generation number to struct commmit commit-graph: compute generation numbers commit: use generations in paint_down_to_common() commit-graph: always load commit-graph information ref-filter: use generation number for --contains commit: use generation numbers for in_merge_bases() commit: add short-circuit to paint_down_to_common() merge: check config before loading commits commit-graph.txt: update design document Documentation/technical/commit-graph.txt | 30 ++-- alloc.c | 1 + builtin/merge.c | 7 +- commit-graph.c | 92 commit-graph.h | 8 +++ commit.c | 54 +++--- commit.h | 7 +- object.c | 2 +- ref-filter.c | 26 +-- sha1_file.c | 2 +- t/t5318-commit-graph.sh | 9 +++ 11 files changed, 198 insertions(+), 40 deletions(-) base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707 -- 2.17.0.39.g685157f7fb
[PATCH v4 03/10] commit-graph: compute generation numbers
While preparing commits to be written into a commit-graph file, compute the generation numbers using a depth-first strategy. The only commits that are walked in this depth-first search are those without a precomputed generation number. Thus, computation time will be relative to the number of new commits to the commit-graph file. If a computed generation number would exceed GENERATION_NUMBER_MAX, then use GENERATION_NUMBER_MAX instead. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 45 + 1 file changed, 45 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 9ad21c3ffb..047fa9fca5 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; + if ((*list)->generation != GENERATION_NUMBER_INFINITY) + packedDate[0] |= htonl((*list)->generation << 2); + packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); @@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids) } } +static void compute_generation_numbers(struct commit** commits, + int nr_commits) +{ + int i; + struct commit_list *list = NULL; + + for (i = 0; i < nr_commits; i++) { + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && + commits[i]->generation != GENERATION_NUMBER_ZERO) + continue; + + commit_list_insert(commits[i], ); + while (list) { + struct commit *current = list->item; + struct commit_list *parent; + int all_parents_computed = 1; + uint32_t max_generation = 0; + + for (parent = current->parents; parent; parent = parent->next) { + if (parent->item->generation == GENERATION_NUMBER_INFINITY || + parent->item->generation == GENERATION_NUMBER_ZERO) { + all_parents_computed = 0; + commit_list_insert(parent->item, ); + break; + } else if (parent->item->generation > max_generation) { + max_generation = parent->item->generation; + } + } + + if (all_parents_computed) { + current->generation = max_generation + 1; + pop_commit(); + } + + if (current->generation > GENERATION_NUMBER_MAX) + current->generation = GENERATION_NUMBER_MAX; + } + } +} + void write_commit_graph(const char *obj_dir, const char **pack_indexes, int nr_packs, @@ -694,6 +737,8 @@ void write_commit_graph(const char *obj_dir, if (commits.nr >= GRAPH_PARENT_MISSING) die(_("too many commits to write graph")); + compute_generation_numbers(commits.list, commits.nr); + graph_name = get_commit_graph_filename(obj_dir); fd = hold_lock_file_for_update(, graph_name, 0); -- 2.17.0.39.g685157f7fb
[PATCH v4 02/10] commit: add generation number to struct commmit
The generation number of a commit is defined recursively as follows: * If a commit A has no parents, then the generation number of A is one. * If a commit A has parents, then the generation number of A is one more than the maximum generation number among the parents of A. Add a uint32_t generation field to struct commit so we can pass this information to revision walks. We use three special values to signal the generation number is invalid: GENERATION_NUMBER_INFINITY 0x GENERATION_NUMBER_MAX 0x3FFF GENERATION_NUMBER_ZERO 0 The first (_INFINITY) means the generation number has not been loaded or computed. The second (_MAX) means the generation number is too large to store in the commit-graph file. The third (_ZERO) means the generation number was loaded from a commit graph file that was written by a version of git that did not support generation numbers. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- alloc.c| 1 + commit-graph.c | 2 ++ commit.h | 4 3 files changed, 7 insertions(+) diff --git a/alloc.c b/alloc.c index cf4f8b61e1..e8ab14f4a1 100644 --- a/alloc.c +++ b/alloc.c @@ -94,6 +94,7 @@ void *alloc_commit_node(void) c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); c->graph_pos = COMMIT_NOT_FROM_GRAPH; + c->generation = GENERATION_NUMBER_INFINITY; return c; } diff --git a/commit-graph.c b/commit-graph.c index 70fa1b25fd..9ad21c3ffb 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin date_low = get_be32(commit_data + g->hash_len + 12); item->date = (timestamp_t)((date_high << 32) | date_low); + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; + pptr = >parents; edge_value = get_be32(commit_data + g->hash_len); diff --git a/commit.h b/commit.h index 23a3f364ed..aac3b8c56f 100644 --- a/commit.h +++ b/commit.h @@ -10,6 +10,9 @@ #include "pretty.h" #define COMMIT_NOT_FROM_GRAPH 0x +#define GENERATION_NUMBER_INFINITY 0x +#define GENERATION_NUMBER_MAX 0x3FFF +#define GENERATION_NUMBER_ZERO 0 struct commit_list { struct commit *item; @@ -30,6 +33,7 @@ struct commit { */ struct tree *maybe_tree; uint32_t graph_pos; + uint32_t generation; }; extern int save_commit_buffer; -- 2.17.0.39.g685157f7fb
[PATCH v4 07/10] commit: use generation numbers for in_merge_bases()
The containment algorithm for 'git branch --contains' is different from that for 'git tag --contains' in that it uses is_descendant_of() instead of contains_tag_algo(). The expensive portion of the branch algorithm is computing merge bases. When a commit-graph file exists with generation numbers computed, we can avoid this merge-base calculation when the target commit has a larger generation number than the initial commits. Performance tests were run on a copy of the Linux repository where HEAD is contained in v4.13 but no earlier tag. Also, all tags were copied to branches and 'git branch --contains' was tested: Before: 60.0s After: 0.4s Rel %: -99.3% Reported-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 39a3749abd..7bb007f56a 100644 --- a/commit.c +++ b/commit.c @@ -1056,12 +1056,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * { struct commit_list *bases; int ret = 0, i; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; if (parse_commit(commit)) return ret; - for (i = 0; i < nr_reference; i++) + for (i = 0; i < nr_reference; i++) { if (parse_commit(reference[i])) return ret; + if (min_generation > reference[i]->generation) + min_generation = reference[i]->generation; + } + + if (commit->generation > min_generation) + return ret; bases = paint_down_to_common(commit, nr_reference, reference); if (commit->object.flags & PARENT2) -- 2.17.0.39.g685157f7fb
[PATCH v4 04/10] commit: use generations in paint_down_to_common()
Define compare_commits_by_gen_then_commit_date(), which uses generation numbers as a primary comparison and commit date to break ties (or as a comparison when both commits do not have computed generation numbers). Since the commit-graph file is closed under reachability, we know that all commits in the file have generation at most GENERATION_NUMBER_MAX which is less than GENERATION_NUMBER_INFINITY. This change does not affect the number of commits that are walked during the execution of paint_down_to_common(), only the order that those commits are inspected. In the case that commit dates violate topological order (i.e. a parent is "newer" than a child), the previous code could walk a commit twice: if a commit is reached with the PARENT1 bit, but later is re-visited with the PARENT2 bit, then that PARENT2 bit must be propagated to its parents. Using generation numbers avoids this extra effort, even if it is somewhat rare. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 20 +++- commit.h | 1 + 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 711f674c18..4d00b0a1d6 100644 --- a/commit.c +++ b/commit.c @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, return 0; } +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) +{ + const struct commit *a = a_, *b = b_; + + /* newer commits first */ + if (a->generation < b->generation) + return 1; + else if (a->generation > b->generation) + return -1; + + /* use date as a heuristic when generations are equal */ + if (a->date < b->date) + return 1; + else if (a->date > b->date) + return -1; + return 0; +} + int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) { const struct commit *a = a_, *b = b_; @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue) /* all input commits in one and twos[] must have been parsed! */ static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) { - struct prio_queue queue = { compare_commits_by_commit_date }; + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; diff --git a/commit.h b/commit.h index aac3b8c56f..64436ff44e 100644 --- a/commit.h +++ b/commit.h @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf); extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); LAST_ARG_MUST_BE_NULL extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...); -- 2.17.0.39.g685157f7fb
[PATCH v4 05/10] commit-graph: always load commit-graph information
Most code paths load commits using lookup_commit() and then parse_commit(). In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). With generation numbers in the commit-graph, we need to ensure that any commit that exists in the commit-graph file has its generation number loaded. Create new load_commit_graph_info() method to fill in the information for a commit that exists only in the commit-graph file. Call it from parse_commit_buffer() after loading the other commit information from the given buffer. Only fill this information when specified by the 'check_graph' parameter. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 45 ++--- commit-graph.h | 8 commit.c | 7 +-- commit.h | 2 +- object.c | 2 +- sha1_file.c| 2 +- 6 files changed, 46 insertions(+), 20 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 047fa9fca5..aebd242def 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -245,6 +245,12 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, return _list_insert(c, pptr)->next; } +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) +{ + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; +} + static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) { uint32_t edge_value; @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin return 1; } +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos) +{ + if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { + *pos = item->graph_pos; + return 1; + } else { + return bsearch_graph(g, &(item->object.oid), pos); + } +} + int parse_commit_in_graph(struct commit *item) { + uint32_t pos; + if (!core_commit_graph) return 0; if (item->object.parsed) return 1; - prepare_commit_graph(); - if (commit_graph) { - uint32_t pos; - int found; - if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { - pos = item->graph_pos; - found = 1; - } else { - found = bsearch_graph(commit_graph, &(item->object.oid), ); - } - - if (found) - return fill_commit_in_graph(item, commit_graph, pos); - } - + if (commit_graph && find_commit_in_graph(item, commit_graph, )) + return fill_commit_in_graph(item, commit_graph, pos); return 0; } +void load_commit_graph_info(struct commit *item) +{ + uint32_t pos; + if (!core_commit_graph) + return; + prepare_commit_graph(); + if (commit_graph && find_commit_in_graph(item, commit_graph, )) + fill_commit_graph_info(item, commit_graph, pos); +} + static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c) { struct object_id oid; diff --git a/commit-graph.h b/commit-graph.h index 260a468e73..96cccb10f3 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir); */ int parse_commit_in_graph(struct commit *item); +/* + * It is possible that we loaded commit contents from the commit buffer, + * but we also want to ensure the commit-graph content is correctly + * checked and filled. Fill the graph_pos and generation members of + * the given commit. + */ +void load_commit_graph_info(struct commit *item); + struct tree *get_commit_tree_in_graph(const struct commit *c); struct commit_graph { diff --git a/commit.c b/commit.c index 4d00b0a1d6..39a3749abd 100644 --- a/commit.c +++ b/commit.c @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep) return ret; } -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size) +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph) { const char *tail = buffer; const char *bufptr = buffer; @@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s } item->date = parse_commit_date(bufptr, tail); + if (check_graph) + load_commit_graph_info(item); + return 0; } @@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) return error("
[PATCH v4 08/10] commit: add short-circuit to paint_down_to_common()
When running 'git branch --contains', the in_merge_bases_many() method calls paint_down_to_common() to discover if a specific commit is reachable from a set of branches. Commits with lower generation number are not needed to correctly answer the containment query of in_merge_bases_many(). Add a new parameter, min_generation, to paint_down_to_common() that prevents walking commits with generation number strictly less than min_generation. If 0 is given, then there is no functional change. For in_merge_bases_many(), we can pass commit->generation as the cutoff, and this saves time during 'git branch --contains' queries that would otherwise walk "around" the commit we are inspecting. For a copy of the Linux repository, where HEAD is checked out at v4.13~100, we get the following performance improvement for 'git branch --contains' over the previous commit: Before: 0.21s After: 0.13s Rel %: -38% Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/commit.c b/commit.c index 7bb007f56a..e2e16ea1a7 100644 --- a/commit.c +++ b/commit.c @@ -808,11 +808,14 @@ static int queue_has_nonstale(struct prio_queue *queue) } /* all input commits in one and twos[] must have been parsed! */ -static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) +static struct commit_list *paint_down_to_common(struct commit *one, int n, + struct commit **twos, + int min_generation) { struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; + uint32_t last_gen = GENERATION_NUMBER_INFINITY; one->object.flags |= PARENT1; if (!n) { @@ -831,6 +834,13 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc struct commit_list *parents; int flags; + if (commit->generation > last_gen) + BUG("bad generation skip"); + last_gen = commit->generation; + + if (commit->generation < min_generation) + break; + flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -879,7 +889,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co return NULL; } - list = paint_down_to_common(one, n, twos); + list = paint_down_to_common(one, n, twos, 0); while (list) { struct commit *commit = pop_commit(); @@ -946,7 +956,7 @@ static int remove_redundant(struct commit **array, int cnt) filled_index[filled] = j; work[filled++] = array[j]; } - common = paint_down_to_common(array[i], filled, work); + common = paint_down_to_common(array[i], filled, work, 0); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++) @@ -1070,7 +1080,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * if (commit->generation > min_generation) return ret; - bases = paint_down_to_common(commit, nr_reference, reference); + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); if (commit->object.flags & PARENT2) ret = 1; clear_commit_marks(commit, all_flags); -- 2.17.0.39.g685157f7fb
[PATCH v4 06/10] ref-filter: use generation number for --contains
A commit A can reach a commit B only if the generation number of A is strictly larger than the generation number of B. This condition allows significantly short-circuiting commit-graph walks. Use generation number for '--contains' type queries. On a copy of the Linux repository where HEAD is containd in v4.13 but no earlier tag, the command 'git tag --contains HEAD' had the following peformance improvement: Before: 0.81s After: 0.04s Rel %: -95% Helped-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- ref-filter.c | 24 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/ref-filter.c b/ref-filter.c index aff24d93be..fb35067fc9 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -16,6 +16,7 @@ #include "trailer.h" #include "wt-status.h" #include "commit-slab.h" +#include "commit-graph.h" static struct ref_msg { const char *gone; @@ -1587,7 +1588,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) */ static enum contains_result contains_test(struct commit *candidate, const struct commit_list *want, - struct contains_cache *cache) + struct contains_cache *cache, + uint32_t cutoff) { enum contains_result *cached = contains_cache_at(cache, candidate); @@ -1603,6 +1605,10 @@ static enum contains_result contains_test(struct commit *candidate, /* Otherwise, we don't know; prepare to recurse */ parse_commit_or_die(candidate); + + if (candidate->generation < cutoff) + return CONTAINS_NO; + return CONTAINS_UNKNOWN; } @@ -1618,8 +1624,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = GENERATION_NUMBER_INFINITY; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + load_commit_graph_info(c); + if (c->generation < cutoff) + cutoff = c->generation; + } + result = contains_test(candidate, want, cache, cutoff); if (result != CONTAINS_UNKNOWN) return result; @@ -1637,7 +1653,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, * If we just popped the stack, parents->item has been marked, * therefore contains_test will return a meaningful yes/no. */ - else switch (contains_test(parents->item, want, cache)) { + else switch (contains_test(parents->item, want, cache, cutoff)) { case CONTAINS_YES: *contains_cache_at(cache, commit) = CONTAINS_YES; contains_stack.nr--; @@ -1651,7 +1667,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, } } free(contains_stack.contains_stack); - return contains_test(candidate, want, cache); + return contains_test(candidate, want, cache, cutoff); } static int commit_contains(struct ref_filter *filter, struct commit *commit, -- 2.17.0.39.g685157f7fb
[PATCH v4 10/10] commit-graph.txt: update design document
We now calculate generation numbers in the commit-graph file and use them in paint_down_to_common(). Expand the section on generation numbers to discuss how the three special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and _MAX interact with other generation numbers. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/commit-graph.txt | 30 +++- 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index 0550c6d0dc..d9f2713efa 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite" generation number and walk until reaching commits with known generation number. +We use the macro GENERATION_NUMBER_INFINITY = 0x to mark commits not +in the commit-graph file. If a commit-graph file was written by a version +of Git that did not compute generation numbers, then those commits will +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. + +Since the commit-graph file is closed under reachability, we can guarantee +the following weaker condition on all commits: + +If A and B are commits with generation numbers N amd M, respectively, +and N < M, then A cannot reach B. + +Note how the strict inequality differs from the inequality when we have +fully-computed generation numbers. Using strict inequality may result in +walking a few extra commits, but the simplicity in dealing with commits +with generation number *_INFINITY or *_ZERO is valuable. + +We use the macro GENERATION_NUMBER_MAX = 0x3FFF to for commits whose +generation numbers are computed to be at least this value. We limit at +this value since it is the largest value that can be stored in the +commit-graph file using the 30 bits available to generation numbers. This +presents another case where a commit can have generation number equal to +that of a parent. + Design Details -- @@ -98,17 +121,12 @@ Future Work - The 'commit-graph' subcommand does not have a "verify" mode that is necessary for integration with fsck. -- The file format includes room for precomputed generation numbers. These - are not currently computed, so all generation numbers will be marked as - 0 (or "uncomputed"). A later patch will include this calculation. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered priority queue with one ordered by generation number. The following - operations are important candidates: + operation is an important candidate: -- paint_down_to_common() - 'log --topo-order' - Currently, parse_commit_gently() requires filling in the root tree -- 2.17.0.39.g685157f7fb
[PATCH v4 09/10] merge: check config before loading commits
Now that we use generation numbers from the commit-graph, we must ensure that all commits that exist in the commit-graph are loaded from that file instead of from the object database. Since the commit-graph file is only checked if core.commitGraph is true, we must check the default config before we load any commits. In the merge builtin, the config was checked after loading the HEAD commit. This was due to the use of the global 'branch' when checking merge-specific config settings. Move the config load to be between the initialization of 'branch' and the commit lookup. Without this change, a fast-forward merge would hit a BUG("bad generation skip") statement in commit.c during paint_down_to_common(). This is because the HEAD commit would be loaded with "infinite" generation but then reached by commits with "finite" generation numbers. Add a test to t5318-commit-graph.sh that exercises this code path to prevent a regression. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- builtin/merge.c | 7 --- t/t5318-commit-graph.sh | 9 + 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/builtin/merge.c b/builtin/merge.c index 5e5e4497e3..b819756946 100644 --- a/builtin/merge.c +++ b/builtin/merge.c @@ -1148,14 +1148,15 @@ int cmd_merge(int argc, const char **argv, const char *prefix) branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL); if (branch) skip_prefix(branch, "refs/heads/", ); + + init_diff_ui_defaults(); + git_config(git_merge_config, NULL); + if (!branch || is_null_oid(_oid)) head_commit = NULL; else head_commit = lookup_commit_or_die(_oid, "HEAD"); - init_diff_ui_defaults(); - git_config(git_merge_config, NULL); - if (branch_mergeoptions) parse_branch_merge_options(branch_mergeoptions); argc = parse_options(argc, argv, prefix, builtin_merge_options, diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index a380419b65..77d85aefe7 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' ' graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2 +test_expect_success 'perform fast-forward merge in full repo' ' + cd "$TRASH_DIRECTORY/full" && + git checkout -b merge-5-to-8 commits/5 && + git merge commits/8 && + git show-ref -s merge-5-to-8 >output && + git show-ref -s commits/8 >expect && + test_cmp expect output +' + test_done -- 2.17.0.39.g685157f7fb
Re: [PATCH v4 00/10] Compute and consume generation numbers
As promised, here is the diff from v3. Thanks, -Stolee -- >8 -- diff --git a/builtin/merge.c b/builtin/merge.c index 7e1da6c6ea..b819756946 100644 --- a/builtin/merge.c +++ b/builtin/merge.c @@ -1148,6 +1148,7 @@ int cmd_merge(int argc, const char **argv, const char *prefix) branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL); if (branch) skip_prefix(branch, "refs/heads/", ); + init_diff_ui_defaults(); git_config(git_merge_config, NULL); @@ -1156,7 +1157,6 @@ int cmd_merge(int argc, const char **argv, const char *prefix) else head_commit = lookup_commit_or_die(_oid, "HEAD"); - if (branch_mergeoptions) parse_branch_merge_options(branch_mergeoptions); argc = parse_options(argc, argv, prefix, builtin_merge_options, diff --git a/commit-graph.c b/commit-graph.c index 21e853c21a..aebd242def 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -257,7 +257,7 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin uint32_t *parent_data_ptr; uint64_t date_low, date_high; struct commit_list **pptr; - const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; + const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos; item->object.parsed = 1; item->graph_pos = pos; @@ -304,7 +304,7 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin *pos = item->graph_pos; return 1; } else { - return bsearch_graph(commit_graph, &(item->object.oid), pos); + return bsearch_graph(g, &(item->object.oid), pos); } } @@ -312,10 +312,10 @@ int parse_commit_in_graph(struct commit *item) { uint32_t pos; - if (item->object.parsed) - return 0; if (!core_commit_graph) return 0; + if (item->object.parsed) + return 1; prepare_commit_graph(); if (commit_graph && find_commit_in_graph(item, commit_graph, )) return fill_commit_in_graph(item, commit_graph, pos); @@ -454,9 +454,8 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; - if ((*list)->generation != GENERATION_NUMBER_INFINITY) { + if ((*list)->generation != GENERATION_NUMBER_INFINITY) packedDate[0] |= htonl((*list)->generation << 2); - } packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); diff --git a/commit.c b/commit.c index 9ef6f699bd..e2e16ea1a7 100644 --- a/commit.c +++ b/commit.c @@ -653,7 +653,7 @@ int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void else if (a->generation > b->generation) return -1; - /* use date as a heuristic when generataions are equal */ + /* use date as a heuristic when generations are equal */ if (a->date < b->date) return 1; else if (a->date > b->date) @@ -1078,7 +1078,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * } if (commit->generation > min_generation) - return 0; + return ret; bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); if (commit->object.flags & PARENT2) diff --git a/ref-filter.c b/ref-filter.c index e2fea6d635..fb35067fc9 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -16,6 +16,7 @@ #include "trailer.h" #include "wt-status.h" #include "commit-slab.h" +#include "commit-graph.h" static struct ref_msg { const char *gone; @@ -1582,7 +1583,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) } /* - * Test whether the candidate or one of its parents is contained in the list. + * Test whether the candidate is contained in the list. * Do not recurse to find out, though, but return -1 if inconclusive. */ static enum contains_result contains_test(struct commit *candidate, @@ -1629,7 +1630,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, for (p = want; p; p = p->next) { struct commit *c = p->item; - parse_commit_or_die(c); + load_commit_graph_info(c); if (c->generation < cutoff) cutoff = c->generation; }
Re: [PATCH v3 6/9] commit: use generation numbers for in_merge_bases()
On 4/18/2018 6:15 PM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: The containment algorithm for 'git branch --contains' is different from that for 'git tag --contains' in that it uses is_descendant_of() instead of contains_tag_algo(). The expensive portion of the branch algorithm is computing merge bases. When a commit-graph file exists with generation numbers computed, we can avoid this merge-base calculation when the target commit has a larger generation number than the target commits. You have "target" twice in above paragraph; one of those should probably be something else. Thanks. Second "target" should be "initial". [...] + + if (commit->generation > min_generation) + return 0; Why not use "return ret;" instead of "return 0;", like the rest of the code [cryptically] does, that is: +if (commit->generation > min_generation) +return ret; Sure. Thanks, -Stolee
Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()
On 4/18/2018 7:19 PM, Jakub Narebski wrote: Derrick Stolee <dsto...@microsoft.com> writes: [...] [...], and this saves time during 'git branch --contains' queries that would otherwise walk "around" the commit we are inspecting. If I understand the code properly, what happens is that we can now short-circuit if all commits that are left are lower than the target commit. This is because max-order priority queue is used: if the commit with maximum generation number is below generation number of target commit, then target commit is not reachable from any commit in the priority queue (all of which has generation number less or equal than the commit at head of queue, i.e. all are same level or deeper); compare what I have written in [1] [1]: https://public-inbox.org/git/866052dkju@gmail.com/ Do I have that right? If so, it looks all right to me. Yes, the priority queue needs to compare via generation number first or there will be errors. This is why we could not use commit time before. For a copy of the Linux repository, where HEAD is checked out at v4.13~100, we get the following performance improvement for 'git branch --contains' over the previous commit: Before: 0.21s After: 0.13s Rel %: -38% [...] flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co return NULL; } - list = paint_down_to_common(one, n, twos); + list = paint_down_to_common(one, n, twos, 0); while (list) { struct commit *commit = pop_commit(); @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt) filled_index[filled] = j; work[filled++] = array[j]; } - common = paint_down_to_common(array[i], filled, work); + common = paint_down_to_common(array[i], filled, work, 0); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++) @@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * if (commit->generation > min_generation) return 0; - bases = paint_down_to_common(commit, nr_reference, reference); + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); Is it the only case where we would call paint_down_to_common() with non-zero last parameter? Would we always use commit->generation where commit is the first parameter of paint_down_to_common()? If both are true and will remain true, then in my humble opinion it is not necessary to change the signature of this function. We need to change the signature some way, but maybe the way I chose is not the best. To elaborate: paint_down_to_common() is used for multiple purposes. The caller here that supplies 'commit->generation' is used only to compute reachability (by testing if the flag PARENT2 exists on the commit, then clears all flags). The other callers expect the full walk down to the common commits, and keeps those PARENT1, PARENT2, and STALE flags for future use (such as reporting merge bases). Usually the call to paint_down_to_common() is followed by a revision walk that only halts when reaching root commits or commits with both PARENT1 and PARENT2 flags on, so always short-circuiting on generations would break the functionality; this is confirmed by the t5318-commit-graph.sh. An alternative to the signature change is to add a boolean parameter "use_cutoff" or something, that specifies "don't walk beyond the commit". This may give a more of a clear description of what it will do with the generation value, but since we are already performing generation comparisons before calling paint_down_to_common() I find this simple enough. Thanks, -Stolee