[PATCH v6 10/14] commit-graph: close under reachability

2018-03-14 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach write_commit_graph() to walk all parents from the commits
discovered in packfiles. This prevents gaps given by loose objects or
previously-missed packfiles.

Also automatically add commits from the existing graph file, if it
exists.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 2f2e2c7083..fc7b4fa622 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -369,6 +369,28 @@ static int add_packed_commits(const struct object_id *oid,
return 0;
 }
 
+static void close_reachable(struct packed_oid_list *oids)
+{
+   int i;
+   struct rev_info revs;
+   struct commit *commit;
+   init_revisions(, NULL);
+   for (i = 0; i < oids->nr; i++) {
+   commit = lookup_commit(>list[i]);
+   if (commit && !parse_commit(commit))
+   revs.commits = commit_list_insert(commit, 
);
+   }
+
+   if (prepare_revision_walk())
+   die(_("revision walk setup failed"));
+
+   while ((commit = get_revision()) != NULL) {
+   ALLOC_GROW(oids->list, oids->nr + 1, oids->alloc);
+   oidcpy(>list[oids->nr], &(commit->object.oid));
+   (oids->nr)++;
+   }
+}
+
 void write_commit_graph(const char *obj_dir)
 {
struct packed_oid_list oids;
@@ -392,6 +414,7 @@ void write_commit_graph(const char *obj_dir)
ALLOC_ARRAY(oids.list, oids.alloc);
 
for_each_packed_object(add_packed_commits, , 0);
+   close_reachable();
 
QSORT(oids.list, oids.nr, commit_compare);
 
-- 
2.14.1



[PATCH v6 07/14] commit-graph: implement 'git-commit-graph write'

2018-03-14 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach git-commit-graph to write graph files. Create new test script to verify
this command succeeds without failure.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt |  39 
 builtin/commit-graph.c |  33 ++
 t/t5318-commit-graph.sh| 125 +
 3 files changed, 197 insertions(+)
 create mode 100755 t/t5318-commit-graph.sh

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 5913340fad..e688843808 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -5,6 +5,45 @@ NAME
 
 git-commit-graph - Write and verify Git commit graph files
 
+
+SYNOPSIS
+
+[verse]
+'git commit-graph write'  [--object-dir ]
+
+
+DESCRIPTION
+---
+
+Manage the serialized commit graph file.
+
+
+OPTIONS
+---
+--object-dir::
+   Use given directory for the location of packfiles and commit graph
+   file. The commit graph file is expected to be at /info/commit-graph
+   and the packfiles are expected to be in /pack.
+
+
+COMMANDS
+
+'write'::
+
+Write a commit graph file based on the commits found in packfiles.
+Includes all commits from the existing commit graph file.
+
+
+EXAMPLES
+
+
+* Write a commit graph file for the packed commits in your local .git folder.
++
+
+$ git commit-graph write
+
+
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 8ff7336527..a9d61f649a 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -1,9 +1,18 @@
 #include "builtin.h"
 #include "config.h"
+#include "dir.h"
+#include "lockfile.h"
 #include "parse-options.h"
+#include "commit-graph.h"
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
+   N_("git commit-graph write [--object-dir ]"),
+   NULL
+};
+
+static const char * const builtin_commit_graph_write_usage[] = {
+   N_("git commit-graph write [--object-dir ]"),
NULL
 };
 
@@ -11,6 +20,25 @@ static struct opts_commit_graph {
const char *obj_dir;
 } opts;
 
+static int graph_write(int argc, const char **argv)
+{
+   static struct option builtin_commit_graph_write_options[] = {
+   OPT_STRING(0, "object-dir", _dir,
+   N_("dir"),
+   N_("The object directory to store the graph")),
+   OPT_END(),
+   };
+
+   argc = parse_options(argc, argv, NULL,
+builtin_commit_graph_write_options,
+builtin_commit_graph_write_usage, 0);
+
+   if (!opts.obj_dir)
+   opts.obj_dir = get_object_directory();
+
+   write_commit_graph(opts.obj_dir);
+   return 0;
+}
 
 int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 {
@@ -31,6 +59,11 @@ int cmd_commit_graph(int argc, const char **argv, const char 
*prefix)
 builtin_commit_graph_usage,
 PARSE_OPT_STOP_AT_NON_OPTION);
 
+   if (argc > 0) {
+   if (!strcmp(argv[0], "write"))
+   return graph_write(argc, argv);
+   }
+
usage_with_options(builtin_commit_graph_usage,
   builtin_commit_graph_options);
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
new file mode 100755
index 00..43707ce5bb
--- /dev/null
+++ b/t/t5318-commit-graph.sh
@@ -0,0 +1,125 @@
+#!/bin/sh
+
+test_description='commit graph'
+. ./test-lib.sh
+
+test_expect_success 'setup full repo' '
+   mkdir full &&
+   cd "$TRASH_DIRECTORY/full" &&
+   git init &&
+   objdir=".git/objects"
+'
+
+test_expect_success 'write graph with no packs' '
+cd "$TRASH_DIRECTORY/full" &&
+   git commit-graph write --object-dir . &&
+   test_path_is_file info/commit-graph
+'
+
+test_expect_success 'create commits and repack' '
+cd "$TRASH_DIRECTORY/full" &&
+   for i in $(test_seq 3)
+   do
+   test_commit $i &&
+   git branch commits/$i
+   done &&
+   git repack
+'
+
+test_expect_success 'write graph' '
+cd "$TRASH_DIRECTORY/full" &&
+   graph1=$(git commit-graph write) &&
+   test_path_is_file $objdir/info/commit-graph
+'
+
+test_expect_success 'Add more commits' '
+cd "$TRASH_DIRECTORY/full" &&
+   git reset --hard commits/1 &&
+   for i in $(test_seq 4 5)

[PATCH v6 05/14] commit-graph: create git-commit-graph builtin

2018-03-14 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach git the 'commit-graph' builtin that will be used for writing and
reading packed graph files. The current implementation is mostly
empty, except for an '--object-dir' option.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 .gitignore |  1 +
 Documentation/git-commit-graph.txt | 11 ++
 Makefile   |  1 +
 builtin.h  |  1 +
 builtin/commit-graph.c | 37 ++
 command-list.txt   |  1 +
 contrib/completion/git-completion.bash |  2 ++
 git.c  |  1 +
 8 files changed, 55 insertions(+)
 create mode 100644 Documentation/git-commit-graph.txt
 create mode 100644 builtin/commit-graph.c

diff --git a/.gitignore b/.gitignore
index 833ef3b0b7..e82f90184d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -34,6 +34,7 @@
 /git-clone
 /git-column
 /git-commit
+/git-commit-graph
 /git-commit-tree
 /git-config
 /git-count-objects
diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
new file mode 100644
index 00..5913340fad
--- /dev/null
+++ b/Documentation/git-commit-graph.txt
@@ -0,0 +1,11 @@
+git-commit-graph(1)
+===
+
+NAME
+
+git-commit-graph - Write and verify Git commit graph files
+
+GIT
+---
+Part of the linkgit:git[1] suite
+
diff --git a/Makefile b/Makefile
index de4b8f0c02..a928d4de66 100644
--- a/Makefile
+++ b/Makefile
@@ -946,6 +946,7 @@ BUILTIN_OBJS += builtin/clone.o
 BUILTIN_OBJS += builtin/column.o
 BUILTIN_OBJS += builtin/commit-tree.o
 BUILTIN_OBJS += builtin/commit.o
+BUILTIN_OBJS += builtin/commit-graph.o
 BUILTIN_OBJS += builtin/config.o
 BUILTIN_OBJS += builtin/count-objects.o
 BUILTIN_OBJS += builtin/credential.o
diff --git a/builtin.h b/builtin.h
index 42378f3aa4..079855b6d4 100644
--- a/builtin.h
+++ b/builtin.h
@@ -149,6 +149,7 @@ extern int cmd_clone(int argc, const char **argv, const 
char *prefix);
 extern int cmd_clean(int argc, const char **argv, const char *prefix);
 extern int cmd_column(int argc, const char **argv, const char *prefix);
 extern int cmd_commit(int argc, const char **argv, const char *prefix);
+extern int cmd_commit_graph(int argc, const char **argv, const char *prefix);
 extern int cmd_commit_tree(int argc, const char **argv, const char *prefix);
 extern int cmd_config(int argc, const char **argv, const char *prefix);
 extern int cmd_count_objects(int argc, const char **argv, const char *prefix);
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
new file mode 100644
index 00..8ff7336527
--- /dev/null
+++ b/builtin/commit-graph.c
@@ -0,0 +1,37 @@
+#include "builtin.h"
+#include "config.h"
+#include "parse-options.h"
+
+static char const * const builtin_commit_graph_usage[] = {
+   N_("git commit-graph [--object-dir ]"),
+   NULL
+};
+
+static struct opts_commit_graph {
+   const char *obj_dir;
+} opts;
+
+
+int cmd_commit_graph(int argc, const char **argv, const char *prefix)
+{
+   static struct option builtin_commit_graph_options[] = {
+   OPT_STRING(0, "object-dir", _dir,
+   N_("dir"),
+   N_("The object directory to store the graph")),
+   OPT_END(),
+   };
+
+   if (argc == 2 && !strcmp(argv[1], "-h"))
+   usage_with_options(builtin_commit_graph_usage,
+  builtin_commit_graph_options);
+
+   git_config(git_default_config, NULL);
+   argc = parse_options(argc, argv, prefix,
+builtin_commit_graph_options,
+builtin_commit_graph_usage,
+PARSE_OPT_STOP_AT_NON_OPTION);
+
+   usage_with_options(builtin_commit_graph_usage,
+  builtin_commit_graph_options);
+}
+
diff --git a/command-list.txt b/command-list.txt
index a1fad28fd8..835c5890be 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -34,6 +34,7 @@ git-clean   mainporcelain
 git-clone   mainporcelain   init
 git-column  purehelpers
 git-commit  mainporcelain   history
+git-commit-graphplumbingmanipulators
 git-commit-tree plumbingmanipulators
 git-config  ancillarymanipulators
 git-count-objects   ancillaryinterrogators
diff --git a/contrib/completion/git-completion.bash 
b/contrib/completion/git-completion.bash
index 91536d831c..a24af902d8 100644
--- a/contrib/completion/git-completion.bash
+++ b/contrib/completion/git-completion.bash
@@ -841,6 +841,7 @@ __git_list_porcelain_commands ()
check-ref-format) : pl

[PATCH v6 01/14] csum-file: rename hashclose() to finalize_hashfile()

2018-03-14 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

The hashclose() method behaves very differently depending on the flags
parameter. In particular, the file descriptor is not always closed.

Perform a simple rename of "hashclose()" to "finalize_hashfile()" in
preparation for functional changes.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/index-pack.c   | 2 +-
 builtin/pack-objects.c | 6 +++---
 bulk-checkin.c | 4 ++--
 csum-file.c| 2 +-
 csum-file.h| 4 ++--
 fast-import.c  | 2 +-
 pack-bitmap-write.c| 2 +-
 pack-write.c   | 4 ++--
 8 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 59878e70b8..157bceb264 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1269,7 +1269,7 @@ static void conclude_pack(int fix_thin_pack, const char 
*curr_pack, unsigned cha
nr_objects - nr_objects_initial);
stop_progress_msg(, msg.buf);
strbuf_release();
-   hashclose(f, tail_hash, 0);
+   finalize_hashfile(f, tail_hash, 0);
hashcpy(read_hash, pack_hash);
fixup_pack_header_footer(output_fd, pack_hash,
 curr_pack, nr_objects,
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a197926eaa..84e9f57b7f 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -837,11 +837,11 @@ static void write_pack_file(void)
 * If so, rewrite it like in fast-import
 */
if (pack_to_stdout) {
-   hashclose(f, oid.hash, CSUM_CLOSE);
+   finalize_hashfile(f, oid.hash, CSUM_CLOSE);
} else if (nr_written == nr_remaining) {
-   hashclose(f, oid.hash, CSUM_FSYNC);
+   finalize_hashfile(f, oid.hash, CSUM_FSYNC);
} else {
-   int fd = hashclose(f, oid.hash, 0);
+   int fd = finalize_hashfile(f, oid.hash, 0);
fixup_pack_header_footer(fd, oid.hash, pack_tmp_name,
 nr_written, oid.hash, offset);
close(fd);
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 9d87eac07b..227cc9f3b1 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -35,9 +35,9 @@ static void finish_bulk_checkin(struct bulk_checkin_state 
*state)
unlink(state->pack_tmp_name);
goto clear_exit;
} else if (state->nr_written == 1) {
-   hashclose(state->f, oid.hash, CSUM_FSYNC);
+   finalize_hashfile(state->f, oid.hash, CSUM_FSYNC);
} else {
-   int fd = hashclose(state->f, oid.hash, 0);
+   int fd = finalize_hashfile(state->f, oid.hash, 0);
fixup_pack_header_footer(fd, oid.hash, state->pack_tmp_name,
 state->nr_written, oid.hash,
 state->offset);
diff --git a/csum-file.c b/csum-file.c
index 5eda7fb6af..e6c95a6915 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -53,7 +53,7 @@ void hashflush(struct hashfile *f)
}
 }
 
-int hashclose(struct hashfile *f, unsigned char *result, unsigned int flags)
+int finalize_hashfile(struct hashfile *f, unsigned char *result, unsigned int 
flags)
 {
int fd;
 
diff --git a/csum-file.h b/csum-file.h
index 992e5c0141..9ba87f0a6c 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -26,14 +26,14 @@ struct hashfile_checkpoint {
 extern void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint 
*);
 extern int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);
 
-/* hashclose flags */
+/* finalize_hashfile flags */
 #define CSUM_CLOSE 1
 #define CSUM_FSYNC 2
 
 extern struct hashfile *hashfd(int fd, const char *name);
 extern struct hashfile *hashfd_check(const char *name);
 extern struct hashfile *hashfd_throughput(int fd, const char *name, struct 
progress *tp);
-extern int hashclose(struct hashfile *, unsigned char *, unsigned int);
+extern int finalize_hashfile(struct hashfile *, unsigned char *, unsigned int);
 extern void hashwrite(struct hashfile *, const void *, unsigned int);
 extern void hashflush(struct hashfile *f);
 extern void crc32_begin(struct hashfile *);
diff --git a/fast-import.c b/fast-import.c
index 58ef360da4..2e5d17318d 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1016,7 +1016,7 @@ static void end_packfile(void)
struct tag *t;
 
close_pack_windows(pack_data);
-   hashclose(pack_file, cur_pack_oid.hash, 0);
+   finalize_hashfile(pack_file, cur_pack_oid.hash, 0);
fixup_pack_header_footer(pack_data->pack_fd, pack_data->sha1,
 

[PATCH v6 12/14] commit-graph: read only from specific pack-indexes

2018-03-14 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach git-commit-graph to inspect the objects only in a certain list
of pack-indexes within the given pack directory. This allows updating
the commit graph iteratively.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt | 11 ++-
 builtin/commit-graph.c | 33 ++---
 commit-graph.c | 26 --
 commit-graph.h |  4 +++-
 packfile.c |  4 ++--
 packfile.h |  2 ++
 t/t5318-commit-graph.sh| 10 ++
 7 files changed, 81 insertions(+), 9 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 51cb038f3d..b945510f0f 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -32,7 +32,9 @@ COMMANDS
 'write'::
 
 Write a commit graph file based on the commits found in packfiles.
-Includes all commits from the existing commit graph file.
++
+With the `--stdin-packs` option, generate the new commit graph by
+walking objects only in the specified packfiles.
 
 'read'::
 
@@ -49,6 +51,13 @@ EXAMPLES
 $ git commit-graph write
 
 
+* Write a graph file, extending the current graph file using commits
+* in .
++
+
+$ echo  | git commit-graph write --stdin-packs
+
+
 * Read basic information from the commit-graph file.
 +
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 0e164becff..eebca57e6f 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,7 +8,7 @@
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
-   N_("git commit-graph write [--object-dir ]"),
+   N_("git commit-graph write [--object-dir ] [--stdin-packs]"),
NULL
 };
 
@@ -18,12 +18,13 @@ static const char * const builtin_commit_graph_read_usage[] 
= {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--object-dir ]"),
+   N_("git commit-graph write [--object-dir ] [--stdin-packs]"),
NULL
 };
 
 static struct opts_commit_graph {
const char *obj_dir;
+   int stdin_packs;
 } opts;
 
 static int graph_read(int argc, const char **argv)
@@ -76,10 +77,18 @@ static int graph_read(int argc, const char **argv)
 
 static int graph_write(int argc, const char **argv)
 {
+   const char **pack_indexes = NULL;
+   int packs_nr = 0;
+   const char **lines = NULL;
+   int lines_nr = 0;
+   int lines_alloc = 0;
+
static struct option builtin_commit_graph_write_options[] = {
OPT_STRING(0, "object-dir", _dir,
N_("dir"),
N_("The object directory to store the graph")),
+   OPT_BOOL(0, "stdin-packs", _packs,
+   N_("scan packfiles listed by stdin for commits")),
OPT_END(),
};
 
@@ -90,7 +99,25 @@ static int graph_write(int argc, const char **argv)
if (!opts.obj_dir)
opts.obj_dir = get_object_directory();
 
-   write_commit_graph(opts.obj_dir);
+   if (opts.stdin_packs) {
+   struct strbuf buf = STRBUF_INIT;
+   lines_nr = 0;
+   lines_alloc = 128;
+   ALLOC_ARRAY(lines, lines_alloc);
+
+   while (strbuf_getline(, stdin) != EOF) {
+   ALLOC_GROW(lines, lines_nr + 1, lines_alloc);
+   lines[lines_nr++] = strbuf_detach(, NULL);
+   }
+
+   pack_indexes = lines;
+   packs_nr = lines_nr;
+   }
+
+   write_commit_graph(opts.obj_dir,
+  pack_indexes,
+  packs_nr);
+
return 0;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 98e2b89b94..f0d7585ddb 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -529,7 +529,9 @@ static void close_reachable(struct packed_oid_list *oids)
}
 }
 
-void write_commit_graph(const char *obj_dir)
+void write_commit_graph(const char *obj_dir,
+   const char **pack_indexes,
+   int nr_packs)
 {
struct packed_oid_list oids;
struct packed_commit_list commits;
@@ -551,7 +553,27 @@ void write_commit_graph(const char *obj_dir)
oids.alloc = 1024;
ALLOC_ARRAY(oids.list, oids.alloc);
 
-   for_each_packed_object(add_packed_commits, , 0);
+   if (pack_indexes) {
+   struct strbuf packname = STRBUF_I

[PATCH v6 13/14] commit-graph: build graph from starting commits

2018-03-14 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach git-commit-graph to read commits from stdin when the
--stdin-commits flag is specified. Commits reachable from these
commits are added to the graph. This is a much faster way to construct
the graph than inspecting all packed objects, but is restricted to
known tips.

For the Linux repository, 700,000+ commits were added to the graph
file starting from 'master' in 7-9 seconds, depending on the number
of packfiles in the repo (1, 24, or 120).

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt | 14 +-
 builtin/commit-graph.c | 27 +--
 commit-graph.c | 27 +--
 commit-graph.h |  4 +++-
 t/t5318-commit-graph.sh| 13 +
 5 files changed, 75 insertions(+), 10 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index b945510f0f..0710a68f2d 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -34,7 +34,13 @@ COMMANDS
 Write a commit graph file based on the commits found in packfiles.
 +
 With the `--stdin-packs` option, generate the new commit graph by
-walking objects only in the specified packfiles.
+walking objects only in the specified packfiles. (Cannot be combined
+with --stdin-commits.)
++
+With the `--stdin-commits` option, generate the new commit graph by
+walking commits starting at the commits specified in stdin as a list
+of OIDs in hex, one OID per line. (Cannot be combined with
+--stdin-packs.)
 
 'read'::
 
@@ -58,6 +64,12 @@ $ git commit-graph write
 $ echo  | git commit-graph write --stdin-packs
 
 
+* Write a graph file containing all reachable commits.
++
+
+$ git show-ref -s | git commit-graph write --stdin-commits
+
+
 * Read basic information from the commit-graph file.
 +
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index eebca57e6f..1c7b7e72b0 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,7 +8,7 @@
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
-   N_("git commit-graph write [--object-dir ] [--stdin-packs]"),
+   N_("git commit-graph write [--object-dir ] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -18,13 +18,14 @@ static const char * const builtin_commit_graph_read_usage[] 
= {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--object-dir ] [--stdin-packs]"),
+   N_("git commit-graph write [--object-dir ] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
 static struct opts_commit_graph {
const char *obj_dir;
int stdin_packs;
+   int stdin_commits;
 } opts;
 
 static int graph_read(int argc, const char **argv)
@@ -79,6 +80,8 @@ static int graph_write(int argc, const char **argv)
 {
const char **pack_indexes = NULL;
int packs_nr = 0;
+   const char **commit_hex = NULL;
+   int commits_nr = 0;
const char **lines = NULL;
int lines_nr = 0;
int lines_alloc = 0;
@@ -89,6 +92,8 @@ static int graph_write(int argc, const char **argv)
N_("The object directory to store the graph")),
OPT_BOOL(0, "stdin-packs", _packs,
N_("scan packfiles listed by stdin for commits")),
+   OPT_BOOL(0, "stdin-commits", _commits,
+   N_("start walk at commits listed by stdin")),
OPT_END(),
};
 
@@ -96,10 +101,12 @@ static int graph_write(int argc, const char **argv)
 builtin_commit_graph_write_options,
 builtin_commit_graph_write_usage, 0);
 
+   if (opts.stdin_packs && opts.stdin_commits)
+   die(_("cannot use both --stdin-commits and --stdin-packs"));
if (!opts.obj_dir)
opts.obj_dir = get_object_directory();
 
-   if (opts.stdin_packs) {
+   if (opts.stdin_packs || opts.stdin_commits) {
struct strbuf buf = STRBUF_INIT;
lines_nr = 0;
lines_alloc = 128;
@@ -110,13 +117,21 @@ static int graph_write(int argc, const char **argv)
lines[lines_nr++] = strbuf_detach(, NULL);
}
 
-   pack_indexes = lines;
-   packs_nr = lines_nr;
+   if (opts.stdin_packs) {
+   pack_indexes = lines;
+  

[PATCH v6 09/14] commit-graph: add core.commitGraph setting

2018-03-14 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

The commit graph feature is controlled by the new core.commitGraph config
setting. This defaults to 0, so the feature is opt-in.

The intention of core.commitGraph is that a user can always stop checking
for or parsing commit graph files if core.commitGraph=0.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/config.txt | 3 +++
 cache.h  | 1 +
 config.c | 5 +
 environment.c| 1 +
 4 files changed, 10 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ce9102cea8..9e3da629b8 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -898,6 +898,9 @@ core.notesRef::
 This setting defaults to "refs/notes/commits", and it can be overridden by
 the `GIT_NOTES_REF` environment variable.  See linkgit:git-notes[1].
 
+core.commitGraph::
+   Enable git commit graph feature. Allows reading from .graph files.
+
 core.sparseCheckout::
Enable "sparse checkout" feature. See section "Sparse checkout" in
linkgit:git-read-tree[1] for more information.
diff --git a/cache.h b/cache.h
index d06932ed0b..e62569fbb1 100644
--- a/cache.h
+++ b/cache.h
@@ -801,6 +801,7 @@ extern char *git_replace_ref_base;
 
 extern int fsync_object_files;
 extern int core_preload_index;
+extern int core_commit_graph;
 extern int core_apply_sparse_checkout;
 extern int precomposed_unicode;
 extern int protect_hfs;
diff --git a/config.c b/config.c
index b0c20e6cb8..25ee4a676c 100644
--- a/config.c
+++ b/config.c
@@ -1226,6 +1226,11 @@ static int git_default_core_config(const char *var, 
const char *value)
return 0;
}
 
+   if (!strcmp(var, "core.commitgraph")) {
+   core_commit_graph = git_config_bool(var, value);
+   return 0;
+   }
+
if (!strcmp(var, "core.sparsecheckout")) {
core_apply_sparse_checkout = git_config_bool(var, value);
return 0;
diff --git a/environment.c b/environment.c
index d6dd64662c..8853e2f0dd 100644
--- a/environment.c
+++ b/environment.c
@@ -62,6 +62,7 @@ enum push_default_type push_default = 
PUSH_DEFAULT_UNSPECIFIED;
 enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE;
 char *notes_ref_name;
 int grafts_replace_parents = 1;
+int core_commit_graph;
 int core_apply_sparse_checkout;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
-- 
2.14.1



[PATCH v6 14/14] commit-graph: implement "--additive" option

2018-03-14 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach git-commit-graph to add all commits from the existing
commit-graph file to the file about to be written. This should be
used when adding new commits without performing garbage collection.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt | 10 ++
 builtin/commit-graph.c | 10 +++---
 commit-graph.c | 17 -
 commit-graph.h |  3 ++-
 t/t5318-commit-graph.sh| 10 ++
 5 files changed, 45 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 0710a68f2d..ccf5e203ce 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -41,6 +41,9 @@ With the `--stdin-commits` option, generate the new commit 
graph by
 walking commits starting at the commits specified in stdin as a list
 of OIDs in hex, one OID per line. (Cannot be combined with
 --stdin-packs.)
++
+With the `--additive` option, include all commits that are present
+in the existing commit-graph file.
 
 'read'::
 
@@ -70,6 +73,13 @@ $ echo  | git commit-graph write --stdin-packs
 $ git show-ref -s | git commit-graph write --stdin-commits
 
 
+* Write a graph file containing all commits in the current
+* commit-graph file along with those reachable from HEAD.
++
+
+$ git rev-parse HEAD | git commit-graph write --stdin-commits --additive
+
+
 * Read basic information from the commit-graph file.
 +
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 1c7b7e72b0..d26a6d6de3 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,7 +8,7 @@
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
-   N_("git commit-graph write [--object-dir ] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--additive] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -18,7 +18,7 @@ static const char * const builtin_commit_graph_read_usage[] = 
{
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--object-dir ] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--additive] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -26,6 +26,7 @@ static struct opts_commit_graph {
const char *obj_dir;
int stdin_packs;
int stdin_commits;
+   int additive;
 } opts;
 
 static int graph_read(int argc, const char **argv)
@@ -94,6 +95,8 @@ static int graph_write(int argc, const char **argv)
N_("scan packfiles listed by stdin for commits")),
OPT_BOOL(0, "stdin-commits", _commits,
N_("start walk at commits listed by stdin")),
+   OPT_BOOL(0, "additive", ,
+   N_("include all commits already in the commit-graph 
file")),
OPT_END(),
};
 
@@ -131,7 +134,8 @@ static int graph_write(int argc, const char **argv)
   pack_indexes,
   packs_nr,
   commit_hex,
-  commits_nr);
+  commits_nr,
+  opts.additive);
 
return 0;
 }
diff --git a/commit-graph.c b/commit-graph.c
index 9f1ba9bff6..6348bab82b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -533,7 +533,8 @@ void write_commit_graph(const char *obj_dir,
const char **pack_indexes,
int nr_packs,
const char **commit_hex,
-   int nr_commits)
+   int nr_commits,
+   int additive)
 {
struct packed_oid_list oids;
struct packed_commit_list commits;
@@ -551,10 +552,24 @@ void write_commit_graph(const char *obj_dir,
oids.nr = 0;
oids.alloc = approximate_object_count() / 4;
 
+   if (additive) {
+   prepare_commit_graph_one(obj_dir);
+   if (commit_graph)
+   oids.alloc += commit_graph->num_commits;
+   }
+
if (oids.alloc < 1024)
oids.alloc = 1024;
ALLOC_ARRAY(oids.list, oids.alloc);
 
+   if (additive && commit_graph) {
+   for (i = 0; i < commit_graph->num_commits; i++) {
+   const unsigned char *hash = 
commit_graph->chunk_oid_lookup +
+   com

[PATCH 1/3] commit: create get_commit_tree() method

2018-04-03 Thread Derrick Stolee
While walking the commit graph, we load struct commit objects into
the object cache. During this process, we also load struct tree
objects for the root tree of each of these commits. We load these
objects even if we are only computing commit reachability information,
such as a merge base or ahead/behind information.

Create get_commit_tree() as a first step to removing direct
references to the 'tree' member of struct commit.

Create get_commit_tree_oid() as a shortcut for several references
to ">tree->object.oid" in the codebase.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 10 ++
 commit.h |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/commit.c b/commit.c
index 3e39c86abf..d65c7b3b47 100644
--- a/commit.c
+++ b/commit.c
@@ -296,6 +296,16 @@ void free_commit_buffer(struct commit *commit)
}
 }
 
+struct tree *get_commit_tree(const struct commit *commit)
+{
+   return commit->tree;
+}
+
+struct object_id *get_commit_tree_oid(const struct commit *commit)
+{
+   return >tree->object.oid;
+}
+
 const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
 {
struct commit_buffer *v = buffer_slab_peek(_slab, commit);
diff --git a/commit.h b/commit.h
index e57ae4b583..fa79cc4d1f 100644
--- a/commit.h
+++ b/commit.h
@@ -102,6 +102,9 @@ void unuse_commit_buffer(const struct commit *, const void 
*buffer);
  */
 void free_commit_buffer(struct commit *);
 
+struct tree *get_commit_tree(const struct commit *);
+struct object_id *get_commit_tree_oid(const struct commit *);
+
 /*
  * Disassociate any cached object buffer from the commit, but do not free it.
  * The buffer (or NULL, if none) is returned.
-- 
2.17.0.20.g9f30ba16e1



[PATCH 3/3] commit-graph: lazy-load trees

2018-04-03 Thread Derrick Stolee
The commit-graph file provides quick access to commit data, including
the OID of the root tree for each commit in the graph. When performing
a deep commit-graph walk, we may not need to load most of the trees
for these commits.

Delay loading the tree object for a commit loaded from the graph
until requested via get_commit_tree(). Do not lazy-load trees for
commits not in the graph, since that requires duplicate parsing
and the relative peformance improvement when trees are not needed
is small.

On the Linux repository, performance tests were run for the following
command:

git log --graph --oneline -1000

Before: 0.83s
After:  0.65s
Rel %: -21.6%

Adding '-- kernel/' to the command requires loading the root tree
for every commit that is walked. There was no measureable performance
change as a result of this patch.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 25 ++---
 commit-graph.h |  7 +++
 commit.c   | 10 --
 3 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 3080a87940..a3eeb25f22 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -247,7 +247,6 @@ static struct commit_list **insert_parent_or_die(struct 
commit_graph *g,
 
 static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t pos)
 {
-   struct object_id oid;
uint32_t edge_value;
uint32_t *parent_data_ptr;
uint64_t date_low, date_high;
@@ -257,8 +256,7 @@ static int fill_commit_in_graph(struct commit *item, struct 
commit_graph *g, uin
item->object.parsed = 1;
item->graph_pos = pos;
 
-   hashcpy(oid.hash, commit_data);
-   item->tree = lookup_tree();
+   item->tree = NULL;
 
date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
date_low = get_be32(commit_data + g->hash_len + 12);
@@ -317,6 +315,27 @@ int parse_commit_in_graph(struct commit *item)
return 0;
 }
 
+static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit 
*c)
+{
+   struct object_id oid;
+   const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len 
+ 16) * (c->graph_pos);
+
+   hashcpy(oid.hash, commit_data);
+   c->tree = lookup_tree();
+
+   return c->tree;
+}
+
+struct tree *get_commit_tree_in_graph(const struct commit *c)
+{
+   if (c->tree)
+   return c->tree;
+   if (c->graph_pos == COMMIT_NOT_FROM_GRAPH)
+   BUG("get_commit_tree_in_graph called from non-commit-graph 
commit");
+
+   return load_tree_for_commit(commit_graph, (struct commit *)c);
+}
+
 static void write_graph_chunk_fanout(struct hashfile *f,
 struct commit **commits,
 int nr_commits)
diff --git a/commit-graph.h b/commit-graph.h
index e1d8580c98..3ab45818e2 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -17,6 +17,13 @@ char *get_commit_graph_filename(const char *obj_dir);
  */
 int parse_commit_in_graph(struct commit *item);
 
+/*
+ * For performance reasons, a commit loaded from the graph does not
+ * have a tree loaded until trying to consume it for the first time.
+ * Load that tree into the commit and return the object.
+ */
+struct tree *get_commit_tree_in_graph(const struct commit *c);
+
 struct commit_graph {
int graph_fd;
 
diff --git a/commit.c b/commit.c
index d65c7b3b47..d4293ae8f6 100644
--- a/commit.c
+++ b/commit.c
@@ -298,12 +298,18 @@ void free_commit_buffer(struct commit *commit)
 
 struct tree *get_commit_tree(const struct commit *commit)
 {
-   return commit->tree;
+   if (commit->tree || !commit->object.parsed)
+   return commit->tree;
+
+   if (commit->graph_pos == COMMIT_NOT_FROM_GRAPH)
+   BUG("commit has NULL tree, but was not loaded from 
commit-graph");
+
+   return get_commit_tree_in_graph(commit);
 }
 
 struct object_id *get_commit_tree_oid(const struct commit *commit)
 {
-   return >tree->object.oid;
+   return _commit_tree(commit)->object.oid;
 }
 
 const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
-- 
2.17.0.20.g9f30ba16e1



Re: [PATCH v7 08/14] commit-graph: implement git commit-graph read

2018-04-03 Thread Derrick Stolee

On 4/2/2018 5:33 PM, Junio C Hamano wrote:

Derrick Stolee <sto...@gmail.com> writes:


From: Derrick Stolee <dsto...@microsoft.com>
...
+static int graph_read(int argc, const char **argv)
+{
+   struct commit_graph *graph = 0;

The previous round said NULL above, not 0, and NULL is the better
way to spell it, I would think.


Sorry about that. Hopefully it is easy to squash.


[PATCH 2/3] treewide: use get_commit_tree() for tree access

2018-04-03 Thread Derrick Stolee
Replace all direct accesses of the 'tree' member in 'struct commit'
with calls to get_commit_tree() or get_commit_tree_oid().

This patch was constructed starting with the following Coccinelle
script, then removing false-positives:

@@
expression c;
@@
- >tree->object.oid
+ get_commit_tree_oid(c)

@@
expression c;
symbol m;
@@
- c->tree->object.oid.m
+ get_commit_tree_oid(c)->m

@@
expression c;
@@
- c->tree
+ get_commit_tree(c)

To ensure all references were removed, the 'tree' member was renamed
to 'tree_renamed' along with the few allowed accessors. A successful
compilation demonstrated a correct transformation.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 blame.c   | 18 +-
 builtin/checkout.c| 17 +
 builtin/diff.c|  2 +-
 builtin/fast-export.c |  6 +++---
 builtin/log.c |  4 ++--
 builtin/reflog.c  |  2 +-
 commit-graph.c|  2 +-
 fsck.c|  8 +---
 http-push.c   |  2 +-
 line-log.c|  4 ++--
 list-objects.c| 10 +-
 log-tree.c|  6 +++---
 merge-recursive.c |  3 ++-
 notes-merge.c |  8 
 packfile.c|  2 +-
 pretty.c  |  5 +++--
 ref-filter.c  |  2 +-
 revision.c|  8 
 sequencer.c   | 12 ++--
 sha1_name.c   |  2 +-
 tree.c|  4 ++--
 walker.c  |  2 +-
 22 files changed, 67 insertions(+), 62 deletions(-)

diff --git a/blame.c b/blame.c
index 200e0ad9a2..7f5700b324 100644
--- a/blame.c
+++ b/blame.c
@@ -553,10 +553,10 @@ static struct blame_origin *find_origin(struct commit 
*parent,
diff_setup_done(_opts);
 
if (is_null_oid(>commit->object.oid))
-   do_diff_cache(>tree->object.oid, _opts);
+   do_diff_cache(get_commit_tree_oid(parent), _opts);
else
-   diff_tree_oid(>tree->object.oid,
- >commit->tree->object.oid,
+   diff_tree_oid(get_commit_tree_oid(parent),
+ get_commit_tree_oid(origin->commit),
  "", _opts);
diffcore_std(_opts);
 
@@ -622,10 +622,10 @@ static struct blame_origin *find_rename(struct commit 
*parent,
diff_setup_done(_opts);
 
if (is_null_oid(>commit->object.oid))
-   do_diff_cache(>tree->object.oid, _opts);
+   do_diff_cache(get_commit_tree_oid(parent), _opts);
else
-   diff_tree_oid(>tree->object.oid,
- >commit->tree->object.oid,
+   diff_tree_oid(get_commit_tree_oid(parent),
+ get_commit_tree_oid(origin->commit),
  "", _opts);
diffcore_std(_opts);
 
@@ -1257,10 +1257,10 @@ static void find_copy_in_parent(struct blame_scoreboard 
*sb,
diff_opts.flags.find_copies_harder = 1;
 
if (is_null_oid(>commit->object.oid))
-   do_diff_cache(>tree->object.oid, _opts);
+   do_diff_cache(get_commit_tree_oid(parent), _opts);
else
-   diff_tree_oid(>tree->object.oid,
- >commit->tree->object.oid,
+   diff_tree_oid(get_commit_tree_oid(parent),
+ get_commit_tree_oid(target->commit),
  "", _opts);
 
if (!diff_opts.flags.find_copies_harder)
diff --git a/builtin/checkout.c b/builtin/checkout.c
index d76e13c852..0b448fd179 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -484,7 +484,8 @@ static int merge_working_tree(const struct checkout_opts 
*opts,
 
resolve_undo_clear();
if (opts->force) {
-   ret = reset_tree(new_branch_info->commit->tree, opts, 1, 
writeout_error);
+   ret = reset_tree(get_commit_tree(new_branch_info->commit),
+opts, 1, writeout_error);
if (ret)
return ret;
} else {
@@ -570,19 +571,19 @@ static int merge_working_tree(const struct checkout_opts 
*opts,
o.verbosity = 0;
work = write_tree_from_memory();
 
-   ret = reset_tree(new_branch_info->commit->tree, opts, 1,
-writeout_error);
+   ret = 
reset_tree(get_commit_tree(new_branch_info->commit),
+opts, 1, writeout_error);
if (ret)
return ret;
o.ancestor = old_branch_info->name;
o.branch1 = new_branch_info->name;
o.branch2 = "local";
-   ret = merge_trees(,

[PATCH 0/3] Lazy-load trees when reading commit-graph

2018-04-03 Thread Derrick Stolee
There are several commit-graph walks that require loading many commits
but never walk the trees reachable from those commits. However, the
current logic in parse_commit() requires the root tree to be loaded.
This only uses lookup_tree(), but when reading commits from the commit-
graph file, the hashcpy() to load the root tree hash and the time spent
checking the object cache take more time than parsing the rest of the
commit.

In this patch series, all direct references to accessing the 'tree'
member of struct commit are replaced instead by one of the following
methods:

struct tree *get_commit_tree(struct commit *)
struct object_id *get_commit_tree_oid(struct commit *)

This replacement was assisted by a Coccinelle script, but the 'tree'
member is overloaded in other types, so the script gave false-positives
that were removed from the diff.

After all access is restricted to use these methods, we can then
change the postcondition of parse_commit_in_graph() to allow 'tree'
to be NULL. If the tree is accessed later, we can load the tree's
OID from the commit-graph in constant time and perform the lookup_tree().

On the Linux repository, performance tests were run for the following
command:

git log --graph --oneline -1000

Before: 0.83s
After:  0.65s
Rel %: -21.6%

Adding '-- kernel/' to the command requires loading the root tree
for every commit that is walked. There was no measureable performance
change as a result of this patch.

This patch series depends on v7 of ds/commit-graph.

Derrick Stolee (3):
  commit: create get_commit_tree() method
  treewide: use get_commit_tree() for tree access
  commit-graph: lazy-load trees

 blame.c   | 18 +-
 builtin/checkout.c| 17 +
 builtin/diff.c|  2 +-
 builtin/fast-export.c |  6 +++---
 builtin/log.c |  4 ++--
 builtin/reflog.c  |  2 +-
 commit-graph.c| 27 +++
 commit-graph.h|  7 +++
 commit.c  | 16 
 commit.h  |  3 +++
 fsck.c|  8 +---
 http-push.c   |  2 +-
 line-log.c|  4 ++--
 list-objects.c| 10 +-
 log-tree.c|  6 +++---
 merge-recursive.c |  3 ++-
 notes-merge.c |  8 
 packfile.c|  2 +-
 pretty.c  |  5 +++--
 ref-filter.c  |  2 +-
 revision.c|  8 
 sequencer.c   | 12 ++--
 sha1_name.c   |  2 +-
 tree.c|  4 ++--
 walker.c  |  2 +-
 25 files changed, 115 insertions(+), 65 deletions(-)

-- 
2.17.0.20.g9f30ba16e1



Re: [PATCH v2 1/5] core.aheadbehind: add new config setting

2018-04-03 Thread Derrick Stolee

On 4/3/2018 6:18 AM, Ævar Arnfjörð Bjarmason wrote:

On Tue, Apr 03 2018, Lars Schneider wrote:

What is the state of this series? I can't find it in git/git nor in
git-for-windows/git. I think Stolee mentioned the config in
his Git Merge talk [1] and I was about to test it/roll it out :-)

It's in the gvfs branch of g...@github.com:Microsoft/git.git, i.e. it's
not in Git for Windows, but used in Microsoft's own in-house version
used for Windows.git.


Thanks for adding me to CC. I mentioned it in my talk because that was 
one thing we shipped internally as a "quick fix" until we could do the 
right thing.


If I remember correctly, Jeff abandoned shipping this upstream because 
it did have the feel of a hack and we wanted to see if users used the 
config setting or really cared about the output values. We saw fast 
adoption of the feature and even turned the config setting on 
automatically in the following version of GVFS.



I may be misunderstanding this feature, but my impression was that it
was a kludge as a workaround until the commit graph code landed, because
once we have that then surely we can just cheaply report the actual (or
approximate?) number in the common case, but of course it may still be
slow if your commit graph file is out of date.


You are correct that the commit-graph file may be out of date, causing 
slower performance. Even worse: the current graph patch only provides a 
constant-multiple speedup (still walking the same number of commits, but 
each commit is parsed much faster).


Speaking of our GVFS-specific fork [0], the 'gvfs' branch was updated 
just yesterday with a couple of changes that I am prepping for 
submission upstream:


* Lazy-load trees when parsing commits from commit-graph [1]
* Compute and consume generation numbers [2]

Each of these will speed up this ahead/behind calculation in different 
ways. [1] makes the cost of loading each commit a bit faster, saving up 
to 20% overall. [2] uses generation numbers in paint_down_to_common() to 
make the while() condition O(1) instead of O(Q) where Q is the size of 
the priority queue. The Windows repo is particularly "wide" with many 
parallel branches being merged in complicated ways, so the queue becomes 
quite large. This use of generation numbers saves about 4% on some 
ahead/behind calculations. This speedup is modest, but the existing code 
already made good use of limiting the commit walk to be mostly the 
"important" commits.


The real benefit of generation numbers will manifest in a way to make 
--topo-order much faster when rendering a small number of commits.


The generation numbers _could_ be used to approximate the ahead/behind 
calculation in the following way: When comparing A and B, and gen(A) < 
gen(B), then A is at least (gen(B) - gen(A)) behind. That's the only 
information that can be gathered directly from those values, but may be 
enough to short circuit an exact count.


To truly accelerate these ahead/behind calculations to be sub-linear* in 
the ahead/behind counts, we would need a bitmap-based approach. The 
object-reachability bitmap is a non-starter for client machines in the 
Windows repo, but perhaps a commit-reachability bitmap could be 
interesting. Performing set operations on the bitmaps could more quickly 
answer these questions. Just thinking about it makes me want to go down 
a deep rabbit hole, investigating ways to compute, store, and use these 
bitmaps. However: let's wait and see how necessary it is as the 
commit-graph feature stabilizes. (*These bitmap approaches are not 
guaranteed to be sub-linear, because it may include iterating through a 
list of O(N) bits, but good run-length encodings will likely make the 
count operation very fast, even with a set-difference operation included.)


There are too many fun things to work on, not enough time!

Thanks,
-Stolee

[0] https://github.com/microsoft/git
    Fork of GitForWindows that ships to Windows developers

[1] 
https://github.com/Microsoft/git/commit/29114bf86f591f5c87075f779a1faa2d0f17b92f
    Lazy-load trees when parsing commits from commit-graph 
(accidentally squashed to one commit)


[2] 
https://github.com/microsoft/git/compare/879b7d3b1bddea2587b28cdd656c9c655018683a...a0731ca93a35fd042560c4b30e8e0edbdfa4bf9f

    Compute and consume generation numbers


Re: [PATCH 0/3] Lazy-load trees when reading commit-graph

2018-04-03 Thread Derrick Stolee

On 4/3/2018 8:00 AM, Derrick Stolee wrote:

There are several commit-graph walks that require loading many commits
but never walk the trees reachable from those commits. However, the
current logic in parse_commit() requires the root tree to be loaded.
This only uses lookup_tree(), but when reading commits from the commit-
graph file, the hashcpy() to load the root tree hash and the time spent
checking the object cache take more time than parsing the rest of the
commit.

In this patch series, all direct references to accessing the 'tree'
member of struct commit are replaced instead by one of the following
methods:

struct tree *get_commit_tree(struct commit *)
struct object_id *get_commit_tree_oid(struct commit *)

This replacement was assisted by a Coccinelle script, but the 'tree'
member is overloaded in other types, so the script gave false-positives
that were removed from the diff.

After all access is restricted to use these methods, we can then
change the postcondition of parse_commit_in_graph() to allow 'tree'
to be NULL. If the tree is accessed later, we can load the tree's
OID from the commit-graph in constant time and perform the lookup_tree().

On the Linux repository, performance tests were run for the following
command:

 git log --graph --oneline -1000

Before: 0.83s
After:  0.65s
Rel %: -21.6%

Adding '-- kernel/' to the command requires loading the root tree
for every commit that is walked. There was no measureable performance
change as a result of this patch.

This patch series depends on v7 of ds/commit-graph.

Derrick Stolee (3):
   commit: create get_commit_tree() method
   treewide: use get_commit_tree() for tree access
   commit-graph: lazy-load trees



This patch series is also available as a GitHub pull request [1]

[1] https://github.com/derrickstolee/git/pull/4


[PATCH v2 2/4] commit: create get_commit_tree() method

2018-04-06 Thread Derrick Stolee
While walking the commit graph, we load struct commit objects into
the object cache. During this process, we also load struct tree
objects for the root tree of each of these commits. We load these
objects even if we are only computing commit reachability information,
such as a merge base or ahead/behind information.

Create get_commit_tree() as a first step to removing direct
references to the 'maybe_tree' member of struct commit.

Create get_commit_tree_oid() as a shortcut for several references
to ">maybe_tree->object.oid" in the codebase.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 10 ++
 commit.h |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/commit.c b/commit.c
index fbc092808c..aea2ca1f8b 100644
--- a/commit.c
+++ b/commit.c
@@ -296,6 +296,16 @@ void free_commit_buffer(struct commit *commit)
}
 }
 
+struct tree *get_commit_tree(const struct commit *commit)
+{
+   return commit->maybe_tree;
+}
+
+struct object_id *get_commit_tree_oid(const struct commit *commit)
+{
+   return _commit_tree(commit)->object.oid;
+}
+
 const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
 {
struct commit_buffer *v = buffer_slab_peek(_slab, commit);
diff --git a/commit.h b/commit.h
index c4d6e6e064..dc4bf97d9f 100644
--- a/commit.h
+++ b/commit.h
@@ -102,6 +102,9 @@ void unuse_commit_buffer(const struct commit *, const void 
*buffer);
  */
 void free_commit_buffer(struct commit *);
 
+struct tree *get_commit_tree(const struct commit *);
+struct object_id *get_commit_tree_oid(const struct commit *);
+
 /*
  * Disassociate any cached object buffer from the commit, but do not free it.
  * The buffer (or NULL, if none) is returned.
-- 
2.17.0



[PATCH v2 1/4] treewide: rename tree to maybe_tree

2018-04-06 Thread Derrick Stolee
Using the commit-graph file to walk commit history removes the large
cost of parsing commits during the walk. This exposes a performance
issue: lookup_tree() takes a large portion of the computation time,
even when Git never uses those trees.

In anticipation of lazy-loading these trees, rename the 'tree' member
of struct commit to 'maybe_tree'. This serves two purposes: it hints
at the future role of possibly being NULL even if the commit has a
valid tree, and it allows for unambiguous transformation from simple
member access (i.e. commit->maybe_tree) to method access.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 blame.c   | 18 +-
 builtin/checkout.c| 12 ++--
 builtin/diff.c|  2 +-
 builtin/fast-export.c |  6 +++---
 builtin/log.c |  4 ++--
 builtin/reflog.c  |  2 +-
 commit-graph.c|  4 ++--
 commit.c  |  2 +-
 commit.h  |  2 +-
 fsck.c|  6 +++---
 http-push.c   |  2 +-
 line-log.c|  4 ++--
 list-objects.c| 10 +-
 log-tree.c|  6 +++---
 merge-recursive.c |  5 +++--
 notes-merge.c |  8 
 packfile.c|  2 +-
 pretty.c  |  4 ++--
 ref-filter.c  |  2 +-
 revision.c|  8 
 sequencer.c   | 12 ++--
 sha1_name.c   |  2 +-
 tree.c|  4 ++--
 walker.c  |  2 +-
 24 files changed, 65 insertions(+), 64 deletions(-)

diff --git a/blame.c b/blame.c
index 200e0ad9a2..b78e649cac 100644
--- a/blame.c
+++ b/blame.c
@@ -553,10 +553,10 @@ static struct blame_origin *find_origin(struct commit 
*parent,
diff_setup_done(_opts);
 
if (is_null_oid(>commit->object.oid))
-   do_diff_cache(>tree->object.oid, _opts);
+   do_diff_cache(>maybe_tree->object.oid, _opts);
else
-   diff_tree_oid(>tree->object.oid,
- >commit->tree->object.oid,
+   diff_tree_oid(>maybe_tree->object.oid,
+ >commit->maybe_tree->object.oid,
  "", _opts);
diffcore_std(_opts);
 
@@ -622,10 +622,10 @@ static struct blame_origin *find_rename(struct commit 
*parent,
diff_setup_done(_opts);
 
if (is_null_oid(>commit->object.oid))
-   do_diff_cache(>tree->object.oid, _opts);
+   do_diff_cache(>maybe_tree->object.oid, _opts);
else
-   diff_tree_oid(>tree->object.oid,
- >commit->tree->object.oid,
+   diff_tree_oid(>maybe_tree->object.oid,
+ >commit->maybe_tree->object.oid,
  "", _opts);
diffcore_std(_opts);
 
@@ -1257,10 +1257,10 @@ static void find_copy_in_parent(struct blame_scoreboard 
*sb,
diff_opts.flags.find_copies_harder = 1;
 
if (is_null_oid(>commit->object.oid))
-   do_diff_cache(>tree->object.oid, _opts);
+   do_diff_cache(>maybe_tree->object.oid, _opts);
else
-   diff_tree_oid(>tree->object.oid,
- >commit->tree->object.oid,
+   diff_tree_oid(>maybe_tree->object.oid,
+ >commit->maybe_tree->object.oid,
  "", _opts);
 
if (!diff_opts.flags.find_copies_harder)
diff --git a/builtin/checkout.c b/builtin/checkout.c
index d76e13c852..b15fed5d85 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -484,7 +484,7 @@ static int merge_working_tree(const struct checkout_opts 
*opts,
 
resolve_undo_clear();
if (opts->force) {
-   ret = reset_tree(new_branch_info->commit->tree, opts, 1, 
writeout_error);
+   ret = reset_tree(new_branch_info->commit->maybe_tree, opts, 1, 
writeout_error);
if (ret)
return ret;
} else {
@@ -570,18 +570,18 @@ static int merge_working_tree(const struct checkout_opts 
*opts,
o.verbosity = 0;
work = write_tree_from_memory();
 
-   ret = reset_tree(new_branch_info->commit->tree, opts, 1,
+   ret = reset_tree(new_branch_info->commit->maybe_tree, 
opts, 1,
 writeout_error);
if (ret)
return ret;
o.ancestor = old_branch_info->name;
o.branch1 = new_branch_info->name;
o.branch2 = "local";
-   ret = merge_trees(, new_branch_info->commit->tree, 
work,
-

[PATCH v2 0/4] Lazy-load trees when reading commit-graph

2018-04-06 Thread Derrick Stolee
There are several commit-graph walks that require loading many commits
but never walk the trees reachable from those commits. However, the
current logic in parse_commit() requires the root tree to be loaded.
This only uses lookup_tree(), but when reading commits from the commit-
graph file, the hashcpy() to load the root tree hash and the time spent
checking the object cache take more time than parsing the rest of the
commit.

In this patch series, all direct references to accessing the 'tree'
member of struct commit are replaced instead by one of the following
methods:

struct tree *get_commit_tree(struct commit *)
struct object_id *get_commit_tree_oid(struct commit *)

This replacement was assisted by a Coccinelle script, but the 'tree'
member is overloaded in other types, so we first rename the 'tree'
member to 'maybe_tree' and use the compiler to ensure we caught all
examples. Then, contrib/coccinelle/commit.cocci generates the patch
to replace all accessors of 'maybe_tree' to the methods above.

After all access is restricted to use these methods, we can then
change the postcondition of parse_commit_in_graph() to allow 'maybe_tree'
to be NULL. If the tree is accessed later, we can load the tree's
OID from the commit-graph in constant time and perform the lookup_tree().

On the Linux repository, performance tests were run for the following
command:

git log --graph --oneline -1000

Before: 0.92s
After:  0.66s
Rel %: -28.3%

Adding '-- kernel/' to the command requires loading the root tree
for every commit that is walked. There was no measureable performance
change as a result of this patch.

This patch series depends on v7 of ds/commit-graph.

Derrick Stolee (4):
  treewide: rename tree to maybe_tree
  commit: create get_commit_tree() method
  treewide: replace maybe_tree with accessor methods
  commit-graph: lazy-load trees for commits

 blame.c | 18 +-
 builtin/checkout.c  | 18 --
 builtin/diff.c  |  2 +-
 builtin/fast-export.c   |  6 +++---
 builtin/log.c   |  4 ++--
 builtin/reflog.c|  2 +-
 commit-graph.c  | 27 +++
 commit-graph.h  |  7 +++
 commit.c| 18 +-
 commit.h|  5 -
 contrib/coccinelle/commit.cocci | 30 ++
 fsck.c  |  8 +---
 http-push.c |  2 +-
 line-log.c  |  4 ++--
 list-objects.c  | 10 +-
 log-tree.c  |  6 +++---
 merge-recursive.c   |  5 +++--
 notes-merge.c   |  9 +
 packfile.c  |  2 +-
 pretty.c|  5 +++--
 ref-filter.c|  2 +-
 revision.c  |  8 
 sequencer.c | 12 ++--
 sha1_name.c |  2 +-
 tree.c  |  4 ++--
 walker.c|  2 +-
 26 files changed, 152 insertions(+), 66 deletions(-)
 create mode 100644 contrib/coccinelle/commit.cocci

-- 
2.17.0



Re: [PATCH 3/3] ref-filter: factor ref_array pushing into its own function

2018-04-06 Thread Derrick Stolee

On 4/6/2018 2:59 PM, Jeff King wrote:

In preparation for callers constructing their own ref_array
structs, let's move our own internal push operation into its
own function.

While we're at it, we can replace REALLOC_ARRAY() with
ALLOC_GROW(), which should give the growth operation
amortized linear complexity (as opposed to growing by one,
which is potentially quadratic, though in-place realloc
growth often makes this faster in practice).

Signed-off-by: Jeff King <p...@peff.net>
---
  ref-filter.c | 16 +---
  ref-filter.h |  8 
  2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index c1c3cc9480..6e9328b274 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1840,6 +1840,18 @@ static struct ref_array_item *new_ref_array_item(const 
char *refname,
return ref;
  }
  
+struct ref_array_item *ref_array_push(struct ref_array *array,

+ const char *refname,
+ const struct object_id *oid)
+{
+   struct ref_array_item *ref = new_ref_array_item(refname, oid);
+
+   ALLOC_GROW(array->items, array->nr + 1, array->alloc);
+   array->items[array->nr++] = ref;
+
+   return ref;
+}
+
  static int ref_kind_from_refname(const char *refname)
  {
unsigned int i;
@@ -1930,13 +1942,11 @@ static int ref_filter_handler(const char *refname, 
const struct object_id *oid,
 * to do its job and the resulting list may yet to be pruned
 * by maxcount logic.
 */
-   ref = new_ref_array_item(refname, oid);
+   ref = ref_array_push(ref_cbdata->array, refname, oid);
ref->commit = commit;
ref->flag = flag;
ref->kind = kind;
  
-	REALLOC_ARRAY(ref_cbdata->array->items, ref_cbdata->array->nr + 1);

-   ref_cbdata->array->items[ref_cbdata->array->nr++] = ref;
return 0;
  }
  
diff --git a/ref-filter.h b/ref-filter.h

index 68268f9ebc..76cf87cb6c 100644
--- a/ref-filter.h
+++ b/ref-filter.h
@@ -135,4 +135,12 @@ void setup_ref_filter_porcelain_msg(void);
  void pretty_print_ref(const char *name, const struct object_id *oid,
  const struct ref_format *format);
  
+/*

+ * Push a single ref onto the array; this can be used to construct your own
+ * ref_array without using filter_refs().
+ */
+struct ref_array_item *ref_array_push(struct ref_array *array,
+ const char *refname,
+ const struct object_id *oid);
+
  #endif /*  REF_FILTER_H  */


The three patches in this series look good to me.

Reviewed-by: Derrick Stolee <dsto...@microsoft.com>


Re: [PATCH v2 0/4] Lazy-load trees when reading commit-graph

2018-04-06 Thread Derrick Stolee

On 4/6/2018 3:21 PM, Jeff King wrote:

On Fri, Apr 06, 2018 at 07:09:30PM +, Derrick Stolee wrote:


Derrick Stolee (4):
   treewide: rename tree to maybe_tree
   commit: create get_commit_tree() method
   treewide: replace maybe_tree with accessor methods
   commit-graph: lazy-load trees for commits

I gave this only a cursory read, but it addresses my concern from the
previous round.

If I were doing it myself, I probably would have folded patches 1 and 3
together. They are touching all the same spots, and it would be an error
for any case converted in patch 1 to not get converted in patch 3. I'm
assuming you caught them all due to Coccinelle, though IMHO it is
somewhat overkill here. By folding them together the compiler could tell
you which spots you missed.

And going forward, I doubt it is going to be a common error for people
to use maybe_tree directly. Between the name and the warning comment,
you'd have to really try to shoot yourself in the foot with it. The
primary concern was catching people using the existing "tree" name,
whose semantics changed.

All that said, I'm fine with having it done this way, too.


Thanks. As a double-check that I caught all of the 'maybe_tree' 
accesses, I ran the following:


$ git grep maybe_tree | grep -v get_commit_tree
commit-graph.c: item->maybe_tree = NULL;
commit-graph.c: c->maybe_tree = lookup_tree();
commit-graph.c: return c->maybe_tree;
commit-graph.c: if (c->maybe_tree)
commit-graph.c: return c->maybe_tree;
commit.c:   if (commit->maybe_tree || !commit->object.parsed)
commit.c:   return commit->maybe_tree;
commit.c:   item->maybe_tree = lookup_tree();
commit.h:   struct tree *maybe_tree;
contrib/coccinelle/commit.cocci:- >maybe_tree->object.oid
contrib/coccinelle/commit.cocci:- c->maybe_tree->object.oid.hash
contrib/coccinelle/commit.cocci:- c->maybe_tree
contrib/coccinelle/commit.cocci:+ c->maybe_tree = s
contrib/coccinelle/commit.cocci:+ return c->maybe_tree;
merge-recursive.c:  commit->maybe_tree = tree;

Thanks,
-Stolee


[PATCH v2 4/4] commit-graph: lazy-load trees for commits

2018-04-06 Thread Derrick Stolee
The commit-graph file provides quick access to commit data, including
the OID of the root tree for each commit in the graph. When performing
a deep commit-graph walk, we may not need to load most of the trees
for these commits.

Delay loading the tree object for a commit loaded from the graph
until requested via get_commit_tree(). Do not lazy-load trees for
commits not in the graph, since that requires duplicate parsing
and the relative peformance improvement when trees are not needed
is small.

On the Linux repository, performance tests were run for the following
command:

git log --graph --oneline -1000

Before: 0.92s
After:  0.66s
Rel %: -28.3%

Adding '-- kernel/' to the command requires loading the root tree
for every commit that is walked. There was no measureable performance
change as a result of this patch.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 26 +++---
 commit-graph.h |  2 ++
 commit.c   |  8 +++-
 commit.h   |  6 ++
 4 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 9f37d84209..a5de6f3102 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -247,7 +247,6 @@ static struct commit_list **insert_parent_or_die(struct 
commit_graph *g,
 
 static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t pos)
 {
-   struct object_id oid;
uint32_t edge_value;
uint32_t *parent_data_ptr;
uint64_t date_low, date_high;
@@ -257,8 +256,7 @@ static int fill_commit_in_graph(struct commit *item, struct 
commit_graph *g, uin
item->object.parsed = 1;
item->graph_pos = pos;
 
-   hashcpy(oid.hash, commit_data);
-   item->maybe_tree = lookup_tree();
+   item->maybe_tree = NULL;
 
date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
date_low = get_be32(commit_data + g->hash_len + 12);
@@ -317,6 +315,28 @@ int parse_commit_in_graph(struct commit *item)
return 0;
 }
 
+static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit 
*c)
+{
+   struct object_id oid;
+   const unsigned char *commit_data = g->chunk_commit_data +
+  GRAPH_DATA_WIDTH * (c->graph_pos);
+
+   hashcpy(oid.hash, commit_data);
+   c->maybe_tree = lookup_tree();
+
+   return c->maybe_tree;
+}
+
+struct tree *get_commit_tree_in_graph(const struct commit *c)
+{
+   if (c->maybe_tree)
+   return c->maybe_tree;
+   if (c->graph_pos == COMMIT_NOT_FROM_GRAPH)
+   BUG("get_commit_tree_in_graph called from non-commit-graph 
commit");
+
+   return load_tree_for_commit(commit_graph, (struct commit *)c);
+}
+
 static void write_graph_chunk_fanout(struct hashfile *f,
 struct commit **commits,
 int nr_commits)
diff --git a/commit-graph.h b/commit-graph.h
index e1d8580c98..260a468e73 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -17,6 +17,8 @@ char *get_commit_graph_filename(const char *obj_dir);
  */
 int parse_commit_in_graph(struct commit *item);
 
+struct tree *get_commit_tree_in_graph(const struct commit *c);
+
 struct commit_graph {
int graph_fd;
 
diff --git a/commit.c b/commit.c
index aea2ca1f8b..711f674c18 100644
--- a/commit.c
+++ b/commit.c
@@ -298,7 +298,13 @@ void free_commit_buffer(struct commit *commit)
 
 struct tree *get_commit_tree(const struct commit *commit)
 {
-   return commit->maybe_tree;
+   if (commit->maybe_tree || !commit->object.parsed)
+   return commit->maybe_tree;
+
+   if (commit->graph_pos == COMMIT_NOT_FROM_GRAPH)
+   BUG("commit has NULL tree, but was not loaded from 
commit-graph");
+
+   return get_commit_tree_in_graph(commit);
 }
 
 struct object_id *get_commit_tree_oid(const struct commit *commit)
diff --git a/commit.h b/commit.h
index dc4bf97d9f..23a3f364ed 100644
--- a/commit.h
+++ b/commit.h
@@ -22,6 +22,12 @@ struct commit {
unsigned int index;
timestamp_t date;
struct commit_list *parents;
+
+   /*
+* If the commit is loaded from the commit-graph file, then this
+* member may be NULL. Only access it through get_commit_tree()
+* or get_commit_tree_oid().
+*/
struct tree *maybe_tree;
uint32_t graph_pos;
 };
-- 
2.17.0



[PATCH v2 09/10] commit: use generation numbers for in_merge_bases()

2018-04-09 Thread Derrick Stolee
The containment algorithm for 'git branch --contains' is different
from that for 'git tag --contains' in that it uses is_descendant_of()
instead of contains_tag_algo(). The expensive portion of the branch
algorithm is computing merge bases.

When a commit-graph file exists with generation numbers computed,
we can avoid this merge-base calculation when the target commit has
a larger generation number than the target commits.

Performance tests were run on a copy of the Linux repository where
HEAD is contained in v4.13 but no earlier tag. Also, all tags were
copied to branches and 'git branch --contains' was tested:

Before: 60.0s
After:   0.4s
Rel %: -99.3%

Reported-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 00bdc2ab21..0b155dece8 100644
--- a/commit.c
+++ b/commit.c
@@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
 {
struct commit_list *bases;
int ret = 0, i;
+   uint32_t min_generation = GENERATION_NUMBER_INFINITY;
 
if (parse_commit(commit))
return ret;
-   for (i = 0; i < nr_reference; i++)
+   for (i = 0; i < nr_reference; i++) {
if (parse_commit(reference[i]))
return ret;
+   if (min_generation > reference[i]->generation)
+   min_generation = reference[i]->generation;
+   }
+
+   if (commit->generation > min_generation)
+   return 0;
 
bases = paint_down_to_common(commit, nr_reference, reference);
if (commit->object.flags & PARENT2)
-- 
2.17.0



[PATCH v2 02/10] merge: check config before loading commits

2018-04-09 Thread Derrick Stolee
In anticipation of using generation numbers from the commit-graph,
we must ensure that all commits that exist in the commit-graph are
loaded from that file instead of from the object database. Since
the commit-graph file is only checked if core.commitGraph is true,
we must check the default config before we load any commits.

In the merge builtin, the config was checked after loading the HEAD
commit. This was due to the use of the global 'branch' when checking
merge-specific config settings.

Move the config load to be between the initialization of 'branch'
and the commit lookup. Also add a test to t5318-commit-graph.sh
that exercises this code path to prevent a regression.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/merge.c | 5 +++--
 t/t5318-commit-graph.sh | 9 +
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/builtin/merge.c b/builtin/merge.c
index ee050a47f3..20897f8223 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -1183,13 +1183,14 @@ int cmd_merge(int argc, const char **argv, const char 
*prefix)
branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL);
if (branch)
skip_prefix(branch, "refs/heads/", );
+   init_diff_ui_defaults();
+   git_config(git_merge_config, NULL);
+
if (!branch || is_null_oid(_oid))
head_commit = NULL;
else
head_commit = lookup_commit_or_die(_oid, "HEAD");
 
-   init_diff_ui_defaults();
-   git_config(git_merge_config, NULL);
 
if (branch_mergeoptions)
parse_branch_merge_options(branch_mergeoptions);
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index a380419b65..77d85aefe7 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' '
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 
merge/1
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 
merge/2
 
+test_expect_success 'perform fast-forward merge in full repo' '
+   cd "$TRASH_DIRECTORY/full" &&
+   git checkout -b merge-5-to-8 commits/5 &&
+   git merge commits/8 &&
+   git show-ref -s merge-5-to-8 >output &&
+   git show-ref -s commits/8 >expect &&
+   test_cmp expect output
+'
+
 test_done
-- 
2.17.0



[PATCH v2 10/10] commit: add short-circuit to paint_down_to_common()

2018-04-09 Thread Derrick Stolee
When running 'git branch --contains', the in_merge_bases_many()
method calls paint_down_to_common() to discover if a specific
commit is reachable from a set of branches. Commits with lower
generation number are not needed to correctly answer the
containment query of in_merge_bases_many().

Add a new parameter, min_generation, to paint_down_to_common() that
prevents walking commits with generation number strictly less than
min_generation. If 0 is given, then there is no functional change.

For in_merge_bases_many(), we can pass commit->generation as the
cutoff, and this saves time during 'git branch --contains' queries
that would otherwise walk "around" the commit we are inspecting.

For a copy of the Linux repository, where HEAD is checked out at
v4.13~100, we get the following performance improvement for
'git branch --contains' over the previous commit:

Before: 0.21s
After:  0.13s
Rel %: -38%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index 0b155dece8..7348075e38 100644
--- a/commit.c
+++ b/commit.c
@@ -796,7 +796,9 @@ static int queue_has_nonstale(struct prio_queue *queue, 
uint32_t min_gen)
 }
 
 /* all input commits in one and twos[] must have been parsed! */
-static struct commit_list *paint_down_to_common(struct commit *one, int n, 
struct commit **twos)
+static struct commit_list *paint_down_to_common(struct commit *one, int n,
+   struct commit **twos,
+   int min_generation)
 {
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
@@ -830,6 +832,9 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n, struc
 
last_gen = commit->generation;
 
+   if (commit->generation < min_generation)
+   break;
+
flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
if (flags == (PARENT1 | PARENT2)) {
if (!(commit->object.flags & RESULT)) {
@@ -882,7 +887,7 @@ static struct commit_list *merge_bases_many(struct commit 
*one, int n, struct co
return NULL;
}
 
-   list = paint_down_to_common(one, n, twos);
+   list = paint_down_to_common(one, n, twos, 0);
 
while (list) {
struct commit *commit = pop_commit();
@@ -949,7 +954,7 @@ static int remove_redundant(struct commit **array, int cnt)
filled_index[filled] = j;
work[filled++] = array[j];
}
-   common = paint_down_to_common(array[i], filled, work);
+   common = paint_down_to_common(array[i], filled, work, 0);
if (array[i]->object.flags & PARENT2)
redundant[i] = 1;
for (j = 0; j < filled; j++)
@@ -1073,7 +1078,7 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
if (commit->generation > min_generation)
return 0;
 
-   bases = paint_down_to_common(commit, nr_reference, reference);
+   bases = paint_down_to_common(commit, nr_reference, reference, 
commit->generation);
if (commit->object.flags & PARENT2)
ret = 1;
clear_commit_marks(commit, all_flags);
-- 
2.17.0



[PATCH v2 08/10] ref-filter: use generation number for --contains

2018-04-09 Thread Derrick Stolee
A commit A can reach a commit B only if the generation number of A
is strictly larger than the generation number of B. This condition
allows significantly short-circuiting commit-graph walks.

Use generation number for '--contains' type queries.

On a copy of the Linux repository where HEAD is containd in v4.13
but no earlier tag, the command 'git tag --contains HEAD' had the
following peformance improvement:

Before: 0.81s
After:  0.04s
Rel %:  -95%

Helped-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 ref-filter.c | 24 +++-
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index 45fc56216a..2f5e79b5de 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, 
struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
  const struct commit_list *want,
- struct contains_cache *cache)
+ struct contains_cache *cache,
+ uint32_t cutoff)
 {
enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit 
*candidate,
return CONTAINS_YES;
}
 
-   /* Otherwise, we don't know; prepare to recurse */
parse_commit_or_die(candidate);
+
+   if (candidate->generation < cutoff)
+   return CONTAINS_NO;
+
return CONTAINS_UNKNOWN;
 }
 
@@ -1615,8 +1619,18 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
  struct contains_cache *cache)
 {
struct contains_stack contains_stack = { 0, 0, NULL };
-   enum contains_result result = contains_test(candidate, want, cache);
+   enum contains_result result;
+   uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+   const struct commit_list *p;
+
+   for (p = want; p; p = p->next) {
+   struct commit *c = p->item;
+   parse_commit_or_die(c);
+   if (c->generation < cutoff)
+   cutoff = c->generation;
+   }
 
+   result = contains_test(candidate, want, cache, cutoff);
if (result != CONTAINS_UNKNOWN)
return result;
 
@@ -1634,7 +1648,7 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
 * If we just popped the stack, parents->item has been marked,
 * therefore contains_test will return a meaningful yes/no.
 */
-   else switch (contains_test(parents->item, want, cache)) {
+   else switch (contains_test(parents->item, want, cache, cutoff)) 
{
case CONTAINS_YES:
*contains_cache_at(cache, commit) = CONTAINS_YES;
contains_stack.nr--;
@@ -1648,7 +1662,7 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
}
}
free(contains_stack.contains_stack);
-   return contains_test(candidate, want, cache);
+   return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,
-- 
2.17.0



[PATCH v8 14/14] commit-graph: implement "--append" option

2018-04-10 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach git-commit-graph to add all commits from the existing
commit-graph file to the file about to be written. This should be
used when adding new commits without performing garbage collection.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt | 10 ++
 builtin/commit-graph.c | 10 +++---
 commit-graph.c | 17 -
 commit-graph.h |  3 ++-
 t/t5318-commit-graph.sh| 10 ++
 5 files changed, 45 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 442ac243e6..4c97b555cc 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -43,6 +43,9 @@ With the `--stdin-commits` option, generate the new commit 
graph by
 walking commits starting at the commits specified in stdin as a list
 of OIDs in hex, one OID per line. (Cannot be combined with
 --stdin-packs.)
++
+With the `--append` option, include all commits that are present in the
+existing commit-graph file.
 
 'read'::
 
@@ -72,6 +75,13 @@ $ echo  | git commit-graph write --stdin-packs
 $ git show-ref -s | git commit-graph write --stdin-commits
 
 
+* Write a graph file containing all commits in the current
+* commit-graph file along with those reachable from HEAD.
++
+
+$ git rev-parse HEAD | git commit-graph write --stdin-commits --append
+
+
 * Read basic information from the commit-graph file.
 +
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index b5c0b08905..37420ae0fd 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,7 +8,7 @@
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
-   N_("git commit-graph write [--object-dir ] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -18,7 +18,7 @@ static const char * const builtin_commit_graph_read_usage[] = 
{
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--object-dir ] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -26,6 +26,7 @@ static struct opts_commit_graph {
const char *obj_dir;
int stdin_packs;
int stdin_commits;
+   int append;
 } opts;
 
 static int graph_read(int argc, const char **argv)
@@ -94,6 +95,8 @@ static int graph_write(int argc, const char **argv)
N_("scan pack-indexes listed by stdin for commits")),
OPT_BOOL(0, "stdin-commits", _commits,
N_("start walk at commits listed by stdin")),
+   OPT_BOOL(0, "append", ,
+   N_("include all commits already in the commit-graph 
file")),
OPT_END(),
};
 
@@ -131,7 +134,8 @@ static int graph_write(int argc, const char **argv)
   pack_indexes,
   packs_nr,
   commit_hex,
-  commits_nr);
+  commits_nr,
+  opts.append);
 
return 0;
 }
diff --git a/commit-graph.c b/commit-graph.c
index a59d1e387b..3ff8c84c0e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -553,7 +553,8 @@ void write_commit_graph(const char *obj_dir,
const char **pack_indexes,
int nr_packs,
const char **commit_hex,
-   int nr_commits)
+   int nr_commits,
+   int append)
 {
struct packed_oid_list oids;
struct packed_commit_list commits;
@@ -571,10 +572,24 @@ void write_commit_graph(const char *obj_dir,
oids.nr = 0;
oids.alloc = approximate_object_count() / 4;
 
+   if (append) {
+   prepare_commit_graph_one(obj_dir);
+   if (commit_graph)
+   oids.alloc += commit_graph->num_commits;
+   }
+
if (oids.alloc < 1024)
oids.alloc = 1024;
ALLOC_ARRAY(oids.list, oids.alloc);
 
+   if (append && commit_graph) {
+   for (i = 0; i < commit_graph->num_commits; i++) {
+   const unsigned char *hash = 
commit_graph->chunk_oid_lookup +
+   commit_graph->hash_len * i;
+

[PATCH v8 13/14] commit-graph: build graph from starting commits

2018-04-10 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach git-commit-graph to read commits from stdin when the
--stdin-commits flag is specified. Commits reachable from these
commits are added to the graph. This is a much faster way to construct
the graph than inspecting all packed objects, but is restricted to
known tips.

For the Linux repository, 700,000+ commits were added to the graph
file starting from 'master' in 7-9 seconds, depending on the number
of packfiles in the repo (1, 24, or 120).

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt | 14 +-
 builtin/commit-graph.c | 27 +--
 commit-graph.c | 27 +--
 commit-graph.h |  4 +++-
 t/t5318-commit-graph.sh| 13 +
 5 files changed, 75 insertions(+), 10 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 8143cc3f07..442ac243e6 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -36,7 +36,13 @@ COMMANDS
 Write a commit graph file based on the commits found in packfiles.
 +
 With the `--stdin-packs` option, generate the new commit graph by
-walking objects only in the specified pack-indexes.
+walking objects only in the specified pack-indexes. (Cannot be combined
+with --stdin-commits.)
++
+With the `--stdin-commits` option, generate the new commit graph by
+walking commits starting at the commits specified in stdin as a list
+of OIDs in hex, one OID per line. (Cannot be combined with
+--stdin-packs.)
 
 'read'::
 
@@ -60,6 +66,12 @@ $ git commit-graph write
 $ echo  | git commit-graph write --stdin-packs
 
 
+* Write a graph file containing all reachable commits.
++
+
+$ git show-ref -s | git commit-graph write --stdin-commits
+
+
 * Read basic information from the commit-graph file.
 +
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 5c70199003..b5c0b08905 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,7 +8,7 @@
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
-   N_("git commit-graph write [--object-dir ] [--stdin-packs]"),
+   N_("git commit-graph write [--object-dir ] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -18,13 +18,14 @@ static const char * const builtin_commit_graph_read_usage[] 
= {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--object-dir ] [--stdin-packs]"),
+   N_("git commit-graph write [--object-dir ] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
 static struct opts_commit_graph {
const char *obj_dir;
int stdin_packs;
+   int stdin_commits;
 } opts;
 
 static int graph_read(int argc, const char **argv)
@@ -79,6 +80,8 @@ static int graph_write(int argc, const char **argv)
 {
const char **pack_indexes = NULL;
int packs_nr = 0;
+   const char **commit_hex = NULL;
+   int commits_nr = 0;
const char **lines = NULL;
int lines_nr = 0;
int lines_alloc = 0;
@@ -89,6 +92,8 @@ static int graph_write(int argc, const char **argv)
N_("The object directory to store the graph")),
OPT_BOOL(0, "stdin-packs", _packs,
N_("scan pack-indexes listed by stdin for commits")),
+   OPT_BOOL(0, "stdin-commits", _commits,
+   N_("start walk at commits listed by stdin")),
OPT_END(),
};
 
@@ -96,10 +101,12 @@ static int graph_write(int argc, const char **argv)
 builtin_commit_graph_write_options,
 builtin_commit_graph_write_usage, 0);
 
+   if (opts.stdin_packs && opts.stdin_commits)
+   die(_("cannot use both --stdin-commits and --stdin-packs"));
if (!opts.obj_dir)
opts.obj_dir = get_object_directory();
 
-   if (opts.stdin_packs) {
+   if (opts.stdin_packs || opts.stdin_commits) {
struct strbuf buf = STRBUF_INIT;
lines_nr = 0;
lines_alloc = 128;
@@ -110,13 +117,21 @@ static int graph_write(int argc, const char **argv)
lines[lines_nr++] = strbuf_detach(, NULL);
}
 
-   pack_indexes = lines;
-   packs_nr = lines_nr;
+   if (opts.stdin_packs) {
+   pack_indexes = lines;
+  

[PATCH v8 11/14] commit: integrate commit graph with commit parsing

2018-04-10 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach Git to inspect a commit graph file to supply the contents of a
struct commit when calling parse_commit_gently(). This implementation
satisfies all post-conditions on the struct commit, including loading
parents, the root tree, and the commit date.

If core.commitGraph is false, then do not check graph files.

In test script t5318-commit-graph.sh, add output-matching conditions on
read-only graph operations.

By loading commits from the graph instead of parsing commit buffers, we
save a lot of time on long commit walks. Here are some performance
results for a copy of the Linux repository where 'master' has 678,653
reachable commits and is behind 'origin/master' by 59,929 commits.

| Command  | Before | After  | Rel % |
|--|||---|
| log --oneline --topo-order -1000 |  8.31s |  0.94s | -88%  |
| branch -vv   |  1.02s |  0.14s | -86%  |
| rev-list --all   |  5.89s |  1.07s | -81%  |
| rev-list --all --objects | 66.15s | 58.45s | -11%  |

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 alloc.c |   1 +
 commit-graph.c  | 141 +++-
 commit-graph.h  |  12 
 commit.c|   3 +
 commit.h|   3 +
 t/t5318-commit-graph.sh |  47 +-
 6 files changed, 205 insertions(+), 2 deletions(-)

diff --git a/alloc.c b/alloc.c
index 12afadfacd..cf4f8b61e1 100644
--- a/alloc.c
+++ b/alloc.c
@@ -93,6 +93,7 @@ void *alloc_commit_node(void)
struct commit *c = alloc_node(_state, sizeof(struct commit));
c->object.type = OBJ_COMMIT;
c->index = alloc_commit_index();
+   c->graph_pos = COMMIT_NOT_FROM_GRAPH;
return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index ea29c5c2d8..f745186e7f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,7 +38,6 @@
 #define GRAPH_MIN_SIZE (5 * GRAPH_CHUNKLOOKUP_WIDTH + GRAPH_FANOUT_SIZE + \
GRAPH_OID_LEN + 8)
 
-
 char *get_commit_graph_filename(const char *obj_dir)
 {
return xstrfmt("%s/info/commit-graph", obj_dir);
@@ -179,6 +178,145 @@ struct commit_graph *load_commit_graph_one(const char 
*graph_file)
exit(1);
 }
 
+/* global storage */
+static struct commit_graph *commit_graph = NULL;
+
+static void prepare_commit_graph_one(const char *obj_dir)
+{
+   char *graph_name;
+
+   if (commit_graph)
+   return;
+
+   graph_name = get_commit_graph_filename(obj_dir);
+   commit_graph = load_commit_graph_one(graph_name);
+
+   FREE_AND_NULL(graph_name);
+}
+
+static int prepare_commit_graph_run_once = 0;
+static void prepare_commit_graph(void)
+{
+   struct alternate_object_database *alt;
+   char *obj_dir;
+
+   if (prepare_commit_graph_run_once)
+   return;
+   prepare_commit_graph_run_once = 1;
+
+   obj_dir = get_object_directory();
+   prepare_commit_graph_one(obj_dir);
+   prepare_alt_odb();
+   for (alt = alt_odb_list; !commit_graph && alt; alt = alt->next)
+   prepare_commit_graph_one(alt->path);
+}
+
+static void close_commit_graph(void)
+{
+   if (!commit_graph)
+   return;
+
+   if (commit_graph->graph_fd >= 0) {
+   munmap((void *)commit_graph->data, commit_graph->data_len);
+   commit_graph->data = NULL;
+   close(commit_graph->graph_fd);
+   }
+
+   FREE_AND_NULL(commit_graph);
+}
+
+static int bsearch_graph(struct commit_graph *g, struct object_id *oid, 
uint32_t *pos)
+{
+   return bsearch_hash(oid->hash, g->chunk_oid_fanout,
+   g->chunk_oid_lookup, g->hash_len, pos);
+}
+
+static struct commit_list **insert_parent_or_die(struct commit_graph *g,
+uint64_t pos,
+struct commit_list **pptr)
+{
+   struct commit *c;
+   struct object_id oid;
+   hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos);
+   c = lookup_commit();
+   if (!c)
+   die("could not find commit %s", oid_to_hex());
+   c->graph_pos = pos;
+   return _list_insert(c, pptr)->next;
+}
+
+static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t pos)
+{
+   struct object_id oid;
+   uint32_t edge_value;
+   uint32_t *parent_data_ptr;
+   uint64_t date_low, date_high;
+   struct commit_list **pptr;
+   const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len 
+ 16) * pos;
+
+   item->object.parsed = 1;
+   item->graph_pos = pos;
+
+   hashcpy(oid.hash, commit_data);
+   item->tree = lookup_tree();
+
+   date_high = get_be32(commit_data + g->

[PATCH v8 09/14] commit-graph: add core.commitGraph setting

2018-04-10 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

The commit graph feature is controlled by the new core.commitGraph config
setting. This defaults to 0, so the feature is opt-in.

The intention of core.commitGraph is that a user can always stop checking
for or parsing commit graph files if core.commitGraph=0.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/config.txt | 4 
 cache.h  | 1 +
 config.c | 5 +
 environment.c| 1 +
 4 files changed, 11 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 4e0cff87f6..e5c7013fb0 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -898,6 +898,10 @@ core.notesRef::
 This setting defaults to "refs/notes/commits", and it can be overridden by
 the `GIT_NOTES_REF` environment variable.  See linkgit:git-notes[1].
 
+core.commitGraph::
+   Enable git commit graph feature. Allows reading from the
+   commit-graph file.
+
 core.sparseCheckout::
Enable "sparse checkout" feature. See section "Sparse checkout" in
linkgit:git-read-tree[1] for more information.
diff --git a/cache.h b/cache.h
index a61b2d3f0d..8bdbcbbbf7 100644
--- a/cache.h
+++ b/cache.h
@@ -805,6 +805,7 @@ extern char *git_replace_ref_base;
 
 extern int fsync_object_files;
 extern int core_preload_index;
+extern int core_commit_graph;
 extern int core_apply_sparse_checkout;
 extern int precomposed_unicode;
 extern int protect_hfs;
diff --git a/config.c b/config.c
index b0c20e6cb8..25ee4a676c 100644
--- a/config.c
+++ b/config.c
@@ -1226,6 +1226,11 @@ static int git_default_core_config(const char *var, 
const char *value)
return 0;
}
 
+   if (!strcmp(var, "core.commitgraph")) {
+   core_commit_graph = git_config_bool(var, value);
+   return 0;
+   }
+
if (!strcmp(var, "core.sparsecheckout")) {
core_apply_sparse_checkout = git_config_bool(var, value);
return 0;
diff --git a/environment.c b/environment.c
index d6dd64662c..8853e2f0dd 100644
--- a/environment.c
+++ b/environment.c
@@ -62,6 +62,7 @@ enum push_default_type push_default = 
PUSH_DEFAULT_UNSPECIFIED;
 enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE;
 char *notes_ref_name;
 int grafts_replace_parents = 1;
+int core_commit_graph;
 int core_apply_sparse_checkout;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
-- 
2.17.0



[PATCH v8 12/14] commit-graph: read only from specific pack-indexes

2018-04-10 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach git-commit-graph to inspect the objects only in a certain list
of pack-indexes within the given pack directory. This allows updating
the commit graph iteratively.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt | 11 +-
 builtin/commit-graph.c | 33 +++---
 commit-graph.c | 26 +--
 commit-graph.h |  4 +++-
 packfile.c |  4 ++--
 packfile.h |  2 ++
 t/t5318-commit-graph.sh| 10 +
 7 files changed, 81 insertions(+), 9 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 8aad8303f5..8143cc3f07 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -34,7 +34,9 @@ COMMANDS
 'write'::
 
 Write a commit graph file based on the commits found in packfiles.
-Includes all commits from the existing commit graph file.
++
+With the `--stdin-packs` option, generate the new commit graph by
+walking objects only in the specified pack-indexes.
 
 'read'::
 
@@ -51,6 +53,13 @@ EXAMPLES
 $ git commit-graph write
 
 
+* Write a graph file, extending the current graph file using commits
+* in .
++
+
+$ echo  | git commit-graph write --stdin-packs
+
+
 * Read basic information from the commit-graph file.
 +
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index efd39331d7..5c70199003 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,7 +8,7 @@
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
-   N_("git commit-graph write [--object-dir ]"),
+   N_("git commit-graph write [--object-dir ] [--stdin-packs]"),
NULL
 };
 
@@ -18,12 +18,13 @@ static const char * const builtin_commit_graph_read_usage[] 
= {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--object-dir ]"),
+   N_("git commit-graph write [--object-dir ] [--stdin-packs]"),
NULL
 };
 
 static struct opts_commit_graph {
const char *obj_dir;
+   int stdin_packs;
 } opts;
 
 static int graph_read(int argc, const char **argv)
@@ -76,10 +77,18 @@ static int graph_read(int argc, const char **argv)
 
 static int graph_write(int argc, const char **argv)
 {
+   const char **pack_indexes = NULL;
+   int packs_nr = 0;
+   const char **lines = NULL;
+   int lines_nr = 0;
+   int lines_alloc = 0;
+
static struct option builtin_commit_graph_write_options[] = {
OPT_STRING(0, "object-dir", _dir,
N_("dir"),
N_("The object directory to store the graph")),
+   OPT_BOOL(0, "stdin-packs", _packs,
+   N_("scan pack-indexes listed by stdin for commits")),
OPT_END(),
};
 
@@ -90,7 +99,25 @@ static int graph_write(int argc, const char **argv)
if (!opts.obj_dir)
opts.obj_dir = get_object_directory();
 
-   write_commit_graph(opts.obj_dir);
+   if (opts.stdin_packs) {
+   struct strbuf buf = STRBUF_INIT;
+   lines_nr = 0;
+   lines_alloc = 128;
+   ALLOC_ARRAY(lines, lines_alloc);
+
+   while (strbuf_getline(, stdin) != EOF) {
+   ALLOC_GROW(lines, lines_nr + 1, lines_alloc);
+   lines[lines_nr++] = strbuf_detach(, NULL);
+   }
+
+   pack_indexes = lines;
+   packs_nr = lines_nr;
+   }
+
+   write_commit_graph(opts.obj_dir,
+  pack_indexes,
+  packs_nr);
+
return 0;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index f745186e7f..70472840a3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -549,7 +549,9 @@ static void close_reachable(struct packed_oid_list *oids)
}
 }
 
-void write_commit_graph(const char *obj_dir)
+void write_commit_graph(const char *obj_dir,
+   const char **pack_indexes,
+   int nr_packs)
 {
struct packed_oid_list oids;
struct packed_commit_list commits;
@@ -571,7 +573,27 @@ void write_commit_graph(const char *obj_dir)
oids.alloc = 1024;
ALLOC_ARRAY(oids.list, oids.alloc);
 
-   for_each_packed_object(add_packed_commits, , 0);
+   if (pack_indexes) {
+   struct strbuf packname = STRBUF_I

Re: What's cooking in git.git (Apr 2018, #01; Mon, 9)

2018-04-10 Thread Derrick Stolee

On 4/9/2018 6:08 PM, Junio C Hamano wrote:


I guess we'd want a final cleaned-up round after all ;-)  Thanks.


v8 sent [1]. Thanks.

-Stolee

[1] 
https://public-inbox.org/git/20180410125608.39443-1-dsto...@microsoft.com/T/#u


[PATCH v8 10/14] commit-graph: close under reachability

2018-04-10 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach write_commit_graph() to walk all parents from the commits
discovered in packfiles. This prevents gaps given by loose objects or
previously-missed packfiles.

Also automatically add commits from the existing graph file, if it
exists.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 45 +
 1 file changed, 45 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index b1bd3a892d..ea29c5c2d8 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -367,6 +367,50 @@ static int add_packed_commits(const struct object_id *oid,
return 0;
 }
 
+static void add_missing_parents(struct packed_oid_list *oids, struct commit 
*commit)
+{
+   struct commit_list *parent;
+   for (parent = commit->parents; parent; parent = parent->next) {
+   if (!(parent->item->object.flags & UNINTERESTING)) {
+   ALLOC_GROW(oids->list, oids->nr + 1, oids->alloc);
+   oidcpy(>list[oids->nr], 
&(parent->item->object.oid));
+   oids->nr++;
+   parent->item->object.flags |= UNINTERESTING;
+   }
+   }
+}
+
+static void close_reachable(struct packed_oid_list *oids)
+{
+   int i;
+   struct commit *commit;
+
+   for (i = 0; i < oids->nr; i++) {
+   commit = lookup_commit(>list[i]);
+   if (commit)
+   commit->object.flags |= UNINTERESTING;
+   }
+
+   /*
+* As this loop runs, oids->nr may grow, but not more
+* than the number of missing commits in the reachable
+* closure.
+*/
+   for (i = 0; i < oids->nr; i++) {
+   commit = lookup_commit(>list[i]);
+
+   if (commit && !parse_commit(commit))
+   add_missing_parents(oids, commit);
+   }
+
+   for (i = 0; i < oids->nr; i++) {
+   commit = lookup_commit(>list[i]);
+
+   if (commit)
+   commit->object.flags &= ~UNINTERESTING;
+   }
+}
+
 void write_commit_graph(const char *obj_dir)
 {
struct packed_oid_list oids;
@@ -390,6 +434,7 @@ void write_commit_graph(const char *obj_dir)
ALLOC_ARRAY(oids.list, oids.alloc);
 
for_each_packed_object(add_packed_commits, , 0);
+   close_reachable();
 
QSORT(oids.list, oids.nr, commit_compare);
 
-- 
2.17.0



[PATCH v8 07/14] commit-graph: implement git-commit-graph write

2018-04-10 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach git-commit-graph to write graph files. Create new test script to verify
this command succeeds without failure.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt |  41 ++
 builtin/commit-graph.c |  33 
 t/t5318-commit-graph.sh| 124 +
 3 files changed, 198 insertions(+)
 create mode 100755 t/t5318-commit-graph.sh

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index f3b34622a8..47996e8f89 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -5,6 +5,47 @@ NAME
 
 git-commit-graph - Write and verify Git commit graph files
 
+
+SYNOPSIS
+
+[verse]
+'git commit-graph write'  [--object-dir ]
+
+
+DESCRIPTION
+---
+
+Manage the serialized commit graph file.
+
+
+OPTIONS
+---
+--object-dir::
+   Use given directory for the location of packfiles and commit graph
+   file. This parameter exists to specify the location of an alternate
+   that only has the objects directory, not a full .git directory. The
+   commit graph file is expected to be at /info/commit-graph and
+   the packfiles are expected to be in /pack.
+
+
+COMMANDS
+
+'write'::
+
+Write a commit graph file based on the commits found in packfiles.
+Includes all commits from the existing commit graph file.
+
+
+EXAMPLES
+
+
+* Write a commit graph file for the packed commits in your local .git folder.
++
+
+$ git commit-graph write
+
+
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index b466ecd781..26b6360289 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -1,9 +1,18 @@
 #include "builtin.h"
 #include "config.h"
+#include "dir.h"
+#include "lockfile.h"
 #include "parse-options.h"
+#include "commit-graph.h"
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
+   N_("git commit-graph write [--object-dir ]"),
+   NULL
+};
+
+static const char * const builtin_commit_graph_write_usage[] = {
+   N_("git commit-graph write [--object-dir ]"),
NULL
 };
 
@@ -11,6 +20,25 @@ static struct opts_commit_graph {
const char *obj_dir;
 } opts;
 
+static int graph_write(int argc, const char **argv)
+{
+   static struct option builtin_commit_graph_write_options[] = {
+   OPT_STRING(0, "object-dir", _dir,
+   N_("dir"),
+   N_("The object directory to store the graph")),
+   OPT_END(),
+   };
+
+   argc = parse_options(argc, argv, NULL,
+builtin_commit_graph_write_options,
+builtin_commit_graph_write_usage, 0);
+
+   if (!opts.obj_dir)
+   opts.obj_dir = get_object_directory();
+
+   write_commit_graph(opts.obj_dir);
+   return 0;
+}
 
 int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 {
@@ -31,6 +59,11 @@ int cmd_commit_graph(int argc, const char **argv, const char 
*prefix)
 builtin_commit_graph_usage,
 PARSE_OPT_STOP_AT_NON_OPTION);
 
+   if (argc > 0) {
+   if (!strcmp(argv[0], "write"))
+   return graph_write(argc, argv);
+   }
+
usage_with_options(builtin_commit_graph_usage,
   builtin_commit_graph_options);
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
new file mode 100755
index 00..d7b635bd68
--- /dev/null
+++ b/t/t5318-commit-graph.sh
@@ -0,0 +1,124 @@
+#!/bin/sh
+
+test_description='commit graph'
+. ./test-lib.sh
+
+test_expect_success 'setup full repo' '
+   mkdir full &&
+   cd "$TRASH_DIRECTORY/full" &&
+   git init &&
+   objdir=".git/objects"
+'
+
+test_expect_success 'write graph with no packs' '
+   cd "$TRASH_DIRECTORY/full" &&
+   git commit-graph write --object-dir . &&
+   test_path_is_file info/commit-graph
+'
+
+test_expect_success 'create commits and repack' '
+   cd "$TRASH_DIRECTORY/full" &&
+   for i in $(test_seq 3)
+   do
+   test_commit $i &&
+   git branch commits/$i
+   done &&
+   git repack
+'
+
+test_expect_success 'write graph' '
+   cd "$TRASH_DIRECTORY/full" &&
+   graph1=$(git commit-graph write) &&
+   test_path_is_file $objdir/info/commit-graph
+'
+
+test_expect_success 'Add more commits' '
+ 

[PATCH v8 04/14] graph: add commit graph design document

2018-04-10 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Add Documentation/technical/commit-graph.txt with details of the planned
commit graph feature, including future plans.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 163 +++
 1 file changed, 163 insertions(+)
 create mode 100644 Documentation/technical/commit-graph.txt

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
new file mode 100644
index 00..0550c6d0dc
--- /dev/null
+++ b/Documentation/technical/commit-graph.txt
@@ -0,0 +1,163 @@
+Git Commit Graph Design Notes
+=
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows. The merge
+base calculation shows up in many user-facing commands, such as 'merge-base'
+or 'status' and can take minutes to compute depending on history shape.
+
+There are two main costs here:
+
+1. Decompressing and parsing commits.
+2. Walking the entire graph to satisfy topological order constraints.
+
+The commit graph file is a supplemental data structure that accelerates
+commit graph walks. If a user downgrades or disables the 'core.commitGraph'
+config setting, then the existing ODB is sufficient. The file is stored
+as "commit-graph" either in the .git/objects/info directory or in the info
+directory of an alternate.
+
+The commit graph file stores the commit graph structure along with some
+extra metadata to speed up graph walks. By listing commit OIDs in lexi-
+cographic order, we can identify an integer position for each commit and
+refer to the parents of a commit using those integer positions. We use
+binary search to find initial commits and then use the integer positions
+for fast lookups during the walk.
+
+A consumer may load the following info for a commit from the graph:
+
+1. The commit OID.
+2. The list of parents, along with their integer position.
+3. The commit date.
+4. The root tree OID.
+5. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().
+
+Define the "generation number" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has generation number one.
+
+ * A commit with at least one parent has generation number one more than
+   the largest generation number among its parents.
+
+Equivalently, the generation number of a commit A is one more than the
+length of a longest path from A to a root commit. The recursive definition
+is easier to use for computation and observing the following property:
+
+If A and B are commits with generation numbers N and M, respectively,
+and N <= M, then A cannot reach B. That is, we know without searching
+that B is not an ancestor of A because it is further from a root commit
+than A.
+
+Conversely, when checking if A is an ancestor of B, then we only need
+to walk commits until all commits on the walk boundary have generation
+number at most N. If we walk commits using a priority queue seeded by
+generation numbers, then we always expand the boundary commit with highest
+generation number and can easily detect the stopping condition.
+
+This property can be used to significantly reduce the time it takes to
+walk commits and determine topological relationships. Without generation
+numbers, the general heuristic is the following:
+
+If A and B are commits with commit time X and Y, respectively, and
+X < Y, then A _probably_ cannot reach B.
+
+This heuristic is currently used whenever the computation is allowed to
+violate topological relationships due to clock skew (such as "git log"
+with default order), but is not used when the topological order is
+required (such as merge base calculations, "git log --graph").
+
+In practice, we expect some commits to be created recently and not stored
+in the commit graph. We can treat these commits as having "infinite"
+generation number and walk until reaching commits with known generation
+number.
+
+Design Details
+--
+
+- The commit graph file is stored in a file named 'commit-graph' in the
+  .git/objects/info directory. This could be stored in the info directory
+  of an alternate.
+
+- The core.commitGraph config setting must be on to consume graph files.
+
+- The file format includes parameters for the object ID hash function,
+  so a future change of hash algorithm does not require a change in format.
+
+Future Work
+---
+
+- The commit graph feature currently does not honor commit grafts. This can
+  be remedied by duplicating or refactoring the current graft logic.
+
+- The 'commit-graph' subcommand does not have a "verify" mode that is
+  necessary for integration with fsck.
+
+- The file format includes ro

[PATCH v8 08/14] commit-graph: implement git commit-graph read

2018-04-10 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach git-commit-graph to read commit graph files and summarize their contents.

Use the read subcommand to verify the contents of a commit graph file in the
tests.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt |  12 +++
 builtin/commit-graph.c |  56 
 commit-graph.c | 137 -
 commit-graph.h |  23 +
 t/t5318-commit-graph.sh|  32 +--
 5 files changed, 254 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 47996e8f89..8aad8303f5 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -9,6 +9,7 @@ git-commit-graph - Write and verify Git commit graph files
 SYNOPSIS
 
 [verse]
+'git commit-graph read' [--object-dir ]
 'git commit-graph write'  [--object-dir ]
 
 
@@ -35,6 +36,11 @@ COMMANDS
 Write a commit graph file based on the commits found in packfiles.
 Includes all commits from the existing commit graph file.
 
+'read'::
+
+Read a graph file given by the commit-graph file and output basic
+details about the graph file. Used for debugging purposes.
+
 
 EXAMPLES
 
@@ -45,6 +51,12 @@ EXAMPLES
 $ git commit-graph write
 
 
+* Read basic information from the commit-graph file.
++
+
+$ git commit-graph read
+
+
 
 GIT
 ---
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 26b6360289..efd39331d7 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -7,10 +7,16 @@
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
+   N_("git commit-graph read [--object-dir ]"),
N_("git commit-graph write [--object-dir ]"),
NULL
 };
 
+static const char * const builtin_commit_graph_read_usage[] = {
+   N_("git commit-graph read [--object-dir ]"),
+   NULL
+};
+
 static const char * const builtin_commit_graph_write_usage[] = {
N_("git commit-graph write [--object-dir ]"),
NULL
@@ -20,6 +26,54 @@ static struct opts_commit_graph {
const char *obj_dir;
 } opts;
 
+static int graph_read(int argc, const char **argv)
+{
+   struct commit_graph *graph = NULL;
+   char *graph_name;
+
+   static struct option builtin_commit_graph_read_options[] = {
+   OPT_STRING(0, "object-dir", _dir,
+   N_("dir"),
+   N_("The object directory to store the graph")),
+   OPT_END(),
+   };
+
+   argc = parse_options(argc, argv, NULL,
+builtin_commit_graph_read_options,
+builtin_commit_graph_read_usage, 0);
+
+   if (!opts.obj_dir)
+   opts.obj_dir = get_object_directory();
+
+   graph_name = get_commit_graph_filename(opts.obj_dir);
+   graph = load_commit_graph_one(graph_name);
+
+   if (!graph)
+   die("graph file %s does not exist", graph_name);
+   FREE_AND_NULL(graph_name);
+
+   printf("header: %08x %d %d %d %d\n",
+   ntohl(*(uint32_t*)graph->data),
+   *(unsigned char*)(graph->data + 4),
+   *(unsigned char*)(graph->data + 5),
+   *(unsigned char*)(graph->data + 6),
+   *(unsigned char*)(graph->data + 7));
+   printf("num_commits: %u\n", graph->num_commits);
+   printf("chunks:");
+
+   if (graph->chunk_oid_fanout)
+   printf(" oid_fanout");
+   if (graph->chunk_oid_lookup)
+   printf(" oid_lookup");
+   if (graph->chunk_commit_data)
+   printf(" commit_metadata");
+   if (graph->chunk_large_edges)
+   printf(" large_edges");
+   printf("\n");
+
+   return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
static struct option builtin_commit_graph_write_options[] = {
@@ -60,6 +114,8 @@ int cmd_commit_graph(int argc, const char **argv, const char 
*prefix)
 PARSE_OPT_STOP_AT_NON_OPTION);
 
if (argc > 0) {
+   if (!strcmp(argv[0], "read"))
+   return graph_read(argc, argv);
if (!strcmp(argv[0], "write"))
return graph_write(argc, argv);
}
diff --git a/commit-graph.c b/commit-graph.c
index f3f7c4f189..b1bd3a892d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -39,11 +39,146 @@
GRAPH_OID_LEN + 8)
 
 
-static char *get_commit_graph_f

[PATCH v8 06/14] commit-graph: implement write_commit_graph()

2018-04-10 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach Git to write a commit graph file by checking all packed objects
to see if they are commits, then store the file in the given object
directory.

Helped-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Makefile   |   1 +
 commit-graph.c | 359 +
 commit-graph.h |   6 +
 3 files changed, 366 insertions(+)
 create mode 100644 commit-graph.c
 create mode 100644 commit-graph.h

diff --git a/Makefile b/Makefile
index a59b62bed1..26a23257e9 100644
--- a/Makefile
+++ b/Makefile
@@ -777,6 +777,7 @@ LIB_OBJS += color.o
 LIB_OBJS += column.o
 LIB_OBJS += combine-diff.o
 LIB_OBJS += commit.o
+LIB_OBJS += commit-graph.o
 LIB_OBJS += compat/obstack.o
 LIB_OBJS += compat/terminal.o
 LIB_OBJS += config.o
diff --git a/commit-graph.c b/commit-graph.c
new file mode 100644
index 00..f3f7c4f189
--- /dev/null
+++ b/commit-graph.c
@@ -0,0 +1,359 @@
+#include "cache.h"
+#include "config.h"
+#include "git-compat-util.h"
+#include "lockfile.h"
+#include "pack.h"
+#include "packfile.h"
+#include "commit.h"
+#include "object.h"
+#include "revision.h"
+#include "sha1-lookup.h"
+#include "commit-graph.h"
+
+#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
+#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
+#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
+#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */
+
+#define GRAPH_DATA_WIDTH 36
+
+#define GRAPH_VERSION_1 0x1
+#define GRAPH_VERSION GRAPH_VERSION_1
+
+#define GRAPH_OID_VERSION_SHA1 1
+#define GRAPH_OID_LEN_SHA1 GIT_SHA1_RAWSZ
+#define GRAPH_OID_VERSION GRAPH_OID_VERSION_SHA1
+#define GRAPH_OID_LEN GRAPH_OID_LEN_SHA1
+
+#define GRAPH_OCTOPUS_EDGES_NEEDED 0x8000
+#define GRAPH_PARENT_MISSING 0x7fff
+#define GRAPH_EDGE_LAST_MASK 0x7fff
+#define GRAPH_PARENT_NONE 0x7000
+
+#define GRAPH_LAST_EDGE 0x8000
+
+#define GRAPH_FANOUT_SIZE (4 * 256)
+#define GRAPH_CHUNKLOOKUP_WIDTH 12
+#define GRAPH_MIN_SIZE (5 * GRAPH_CHUNKLOOKUP_WIDTH + GRAPH_FANOUT_SIZE + \
+   GRAPH_OID_LEN + 8)
+
+
+static char *get_commit_graph_filename(const char *obj_dir)
+{
+   return xstrfmt("%s/info/commit-graph", obj_dir);
+}
+
+static void write_graph_chunk_fanout(struct hashfile *f,
+struct commit **commits,
+int nr_commits)
+{
+   int i, count = 0;
+   struct commit **list = commits;
+
+   /*
+* Write the first-level table (the list is sorted,
+* but we use a 256-entry lookup to be able to avoid
+* having to do eight extra binary search iterations).
+*/
+   for (i = 0; i < 256; i++) {
+   while (count < nr_commits) {
+   if ((*list)->object.oid.hash[0] != i)
+   break;
+   count++;
+   list++;
+   }
+
+   hashwrite_be32(f, count);
+   }
+}
+
+static void write_graph_chunk_oids(struct hashfile *f, int hash_len,
+  struct commit **commits, int nr_commits)
+{
+   struct commit **list = commits;
+   int count;
+   for (count = 0; count < nr_commits; count++, list++)
+   hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
+}
+
+static const unsigned char *commit_to_sha1(size_t index, void *table)
+{
+   struct commit **commits = table;
+   return commits[index]->object.oid.hash;
+}
+
+static void write_graph_chunk_data(struct hashfile *f, int hash_len,
+  struct commit **commits, int nr_commits)
+{
+   struct commit **list = commits;
+   struct commit **last = commits + nr_commits;
+   uint32_t num_extra_edges = 0;
+
+   while (list < last) {
+   struct commit_list *parent;
+   int edge_value;
+   uint32_t packedDate[2];
+
+   parse_commit(*list);
+   hashwrite(f, (*list)->tree->object.oid.hash, hash_len);
+
+   parent = (*list)->parents;
+
+   if (!parent)
+   edge_value = GRAPH_PARENT_NONE;
+   else {
+   edge_value = sha1_pos(parent->item->object.oid.hash,
+ commits,
+ nr_commits,
+ commit_to_sha1);
+
+   if (edge_value < 0)
+   edge_value = GRAPH_PARENT_MISSING;
+   }
+
+   hashwrite_be32(f, edge_value);
+
+   

[PATCH v8 02/14] csum-file: refactor finalize_hashfile() method

2018-04-10 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

If we want to use a hashfile on the temporary file for a lockfile, then
we need finalize_hashfile() to fully write the trailing hash but also keep
the file descriptor open.

Do this by adding a new CSUM_HASH_IN_STREAM flag along with a functional
change that checks this flag before writing the checksum to the stream.
This differs from previous behavior since it would be written if either
CSUM_CLOSE or CSUM_FSYNC is provided.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/pack-objects.c | 4 ++--
 bulk-checkin.c | 2 +-
 csum-file.c| 8 
 csum-file.h| 5 +++--
 pack-bitmap-write.c| 2 +-
 pack-write.c   | 5 +++--
 6 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index ab3e80ee49..b09bbf4f4c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -837,9 +837,9 @@ static void write_pack_file(void)
 * If so, rewrite it like in fast-import
 */
if (pack_to_stdout) {
-   finalize_hashfile(f, oid.hash, CSUM_CLOSE);
+   finalize_hashfile(f, oid.hash, CSUM_HASH_IN_STREAM | 
CSUM_CLOSE);
} else if (nr_written == nr_remaining) {
-   finalize_hashfile(f, oid.hash, CSUM_FSYNC);
+   finalize_hashfile(f, oid.hash, CSUM_HASH_IN_STREAM | 
CSUM_FSYNC | CSUM_CLOSE);
} else {
int fd = finalize_hashfile(f, oid.hash, 0);
fixup_pack_header_footer(fd, oid.hash, pack_tmp_name,
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 227cc9f3b1..70b14fdf41 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -35,7 +35,7 @@ static void finish_bulk_checkin(struct bulk_checkin_state 
*state)
unlink(state->pack_tmp_name);
goto clear_exit;
} else if (state->nr_written == 1) {
-   finalize_hashfile(state->f, oid.hash, CSUM_FSYNC);
+   finalize_hashfile(state->f, oid.hash, CSUM_HASH_IN_STREAM | 
CSUM_FSYNC | CSUM_CLOSE);
} else {
int fd = finalize_hashfile(state->f, oid.hash, 0);
fixup_pack_header_footer(fd, oid.hash, state->pack_tmp_name,
diff --git a/csum-file.c b/csum-file.c
index e6c95a6915..53ce37f7ca 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -61,11 +61,11 @@ int finalize_hashfile(struct hashfile *f, unsigned char 
*result, unsigned int fl
the_hash_algo->final_fn(f->buffer, >ctx);
if (result)
hashcpy(result, f->buffer);
-   if (flags & (CSUM_CLOSE | CSUM_FSYNC)) {
-   /* write checksum and close fd */
+   if (flags & CSUM_HASH_IN_STREAM)
flush(f, f->buffer, the_hash_algo->rawsz);
-   if (flags & CSUM_FSYNC)
-   fsync_or_die(f->fd, f->name);
+   if (flags & CSUM_FSYNC)
+   fsync_or_die(f->fd, f->name);
+   if (flags & CSUM_CLOSE) {
if (close(f->fd))
die_errno("%s: sha1 file error on close", f->name);
fd = 0;
diff --git a/csum-file.h b/csum-file.h
index 9ba87f0a6c..c5a2e335e7 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -27,8 +27,9 @@ extern void hashfile_checkpoint(struct hashfile *, struct 
hashfile_checkpoint *)
 extern int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);
 
 /* finalize_hashfile flags */
-#define CSUM_CLOSE 1
-#define CSUM_FSYNC 2
+#define CSUM_CLOSE 1
+#define CSUM_FSYNC 2
+#define CSUM_HASH_IN_STREAM4
 
 extern struct hashfile *hashfd(int fd, const char *name);
 extern struct hashfile *hashfd_check(const char *name);
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index 662b44f97d..db4c832428 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -535,7 +535,7 @@ void bitmap_writer_finish(struct pack_idx_entry **index,
if (options & BITMAP_OPT_HASH_CACHE)
write_hash_cache(f, index, index_nr);
 
-   finalize_hashfile(f, NULL, CSUM_FSYNC);
+   finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC | 
CSUM_CLOSE);
 
if (adjust_shared_perm(tmp_file.buf))
die_errno("unable to make temporary bitmap file readable");
diff --git a/pack-write.c b/pack-write.c
index 044f427392..a9d46bc03f 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -170,8 +170,9 @@ const char *write_idx_file(const char *index_name, struct 
pack_idx_entry **objec
}
 
hashwrite(f, sha1, the_hash_algo->rawsz);
-   finalize_hashfile(f, NULL, ((opts->flags & WRITE_IDX_VERIFY)
-   ? CSUM_CLOSE : CSUM_FSYNC));
+   finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_CLOSE |
+   

[PATCH v8 03/14] commit-graph: add format document

2018-04-10 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Add document specifying the binary format for commit graphs. This
format allows for:

* New versions.
* New hash functions and hash lengths.
* Optional extensions.

Basic header information is followed by a binary table of contents
into "chunks" that include:

* An ordered list of commit object IDs.
* A 256-entry fanout into that list of OIDs.
* A list of metadata for the commits.
* A list of "large edges" to enable octopus merges.

The format automatically includes two parent positions for every
commit. This favors speed over space, since using only one position
per commit would cause an extra level of indirection for every merge
commit. (Octopus merges suffer from this indirection, but they are
very rare.)

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 .../technical/commit-graph-format.txt | 97 +++
 1 file changed, 97 insertions(+)
 create mode 100644 Documentation/technical/commit-graph-format.txt

diff --git a/Documentation/technical/commit-graph-format.txt 
b/Documentation/technical/commit-graph-format.txt
new file mode 100644
index 00..ad6af8105c
--- /dev/null
+++ b/Documentation/technical/commit-graph-format.txt
@@ -0,0 +1,97 @@
+Git commit graph format
+===
+
+The Git commit graph stores a list of commit OIDs and some associated
+metadata, including:
+
+- The generation number of the commit. Commits with no parents have
+  generation number 1; commits with parents have generation number
+  one more than the maximum generation number of its parents. We
+  reserve zero as special, and can be used to mark a generation
+  number invalid or as "not computed".
+
+- The root tree OID.
+
+- The commit date.
+
+- The parents of the commit, stored using positional references within
+  the graph file.
+
+These positional references are stored as unsigned 32-bit integers
+corresponding to the array position withing the list of commit OIDs. We
+use the most-significant bit for special purposes, so we can store at most
+(1 << 31) - 1 (around 2 billion) commits.
+
+== Commit graph files have the following format:
+
+In order to allow extensions that add extra data to the graph, we organize
+the body into "chunks" and provide a binary lookup table at the beginning
+of the body. The header includes certain values, such as number of chunks
+and hash type.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+  4-byte signature:
+  The signature is: {'C', 'G', 'P', 'H'}
+
+  1-byte version number:
+  Currently, the only valid version is 1.
+
+  1-byte Hash Version (1 = SHA-1)
+  We infer the hash length (H) from this value.
+
+  1-byte number (C) of "chunks"
+
+  1-byte (reserved for later use)
+ Current clients should ignore this value.
+
+CHUNK LOOKUP:
+
+  (C + 1) * 12 bytes listing the table of contents for the chunks:
+  First 4 bytes describe the chunk id. Value 0 is a terminating label.
+  Other 8 bytes provide the byte-offset in current file for chunk to
+  start. (Chunks are ordered contiguously in the file, so you can infer
+  the length using the next chunk position if necessary.) Each chunk
+  ID appears at most once.
+
+  The remaining data in the body is described one chunk at a time, and
+  these chunks may be given in any order. Chunks are required unless
+  otherwise specified.
+
+CHUNK DATA:
+
+  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+  The ith entry, F[i], stores the number of OIDs with first
+  byte at most i. Thus F[255] stores the total
+  number of commits (N).
+
+  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+  The OIDs for all commits in the graph, sorted in ascending order.
+
+  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+* The first H bytes are for the OID of the root tree.
+* The next 8 bytes are for the positions of the first two parents
+  of the ith commit. Stores value 0x if no parent in that
+  position. If there are more than two parents, the second value
+  has its most-significant bit on and the other bits store an array
+  position into the Large Edge List chunk.
+* The next 8 bytes store the generation number of the commit and
+  the commit time in seconds since EPOCH. The generation number
+  uses the higher 30 bits of the first 4 bytes, while the commit
+  time uses the 32 bits of the second 4 bytes, along with the lowest
+  2 bits of the lowest byte, storing the 33rd and 34th bit of the
+  commit time.
+
+  Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
+  This list of 4-byte values store the second through nth parents for
+  all octopus merges. The second parent value in the commit data stores
+  an array position within this list along with the most-significant bit
+  on. Starting at that array position, iterate through

[PATCH v8 05/14] commit-graph: create git-commit-graph builtin

2018-04-10 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

Teach git the 'commit-graph' builtin that will be used for writing and
reading packed graph files. The current implementation is mostly
empty, except for an '--object-dir' option.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 .gitignore |  1 +
 Documentation/git-commit-graph.txt | 10 +++
 Makefile   |  1 +
 builtin.h  |  1 +
 builtin/commit-graph.c | 36 ++
 command-list.txt   |  1 +
 contrib/completion/git-completion.bash |  2 ++
 git.c  |  1 +
 8 files changed, 53 insertions(+)
 create mode 100644 Documentation/git-commit-graph.txt
 create mode 100644 builtin/commit-graph.c

diff --git a/.gitignore b/.gitignore
index 833ef3b0b7..e82f90184d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -34,6 +34,7 @@
 /git-clone
 /git-column
 /git-commit
+/git-commit-graph
 /git-commit-tree
 /git-config
 /git-count-objects
diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
new file mode 100644
index 00..f3b34622a8
--- /dev/null
+++ b/Documentation/git-commit-graph.txt
@@ -0,0 +1,10 @@
+git-commit-graph(1)
+===
+
+NAME
+
+git-commit-graph - Write and verify Git commit graph files
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index a1d8775adb..a59b62bed1 100644
--- a/Makefile
+++ b/Makefile
@@ -952,6 +952,7 @@ BUILTIN_OBJS += builtin/clone.o
 BUILTIN_OBJS += builtin/column.o
 BUILTIN_OBJS += builtin/commit-tree.o
 BUILTIN_OBJS += builtin/commit.o
+BUILTIN_OBJS += builtin/commit-graph.o
 BUILTIN_OBJS += builtin/config.o
 BUILTIN_OBJS += builtin/count-objects.o
 BUILTIN_OBJS += builtin/credential.o
diff --git a/builtin.h b/builtin.h
index 42378f3aa4..079855b6d4 100644
--- a/builtin.h
+++ b/builtin.h
@@ -149,6 +149,7 @@ extern int cmd_clone(int argc, const char **argv, const 
char *prefix);
 extern int cmd_clean(int argc, const char **argv, const char *prefix);
 extern int cmd_column(int argc, const char **argv, const char *prefix);
 extern int cmd_commit(int argc, const char **argv, const char *prefix);
+extern int cmd_commit_graph(int argc, const char **argv, const char *prefix);
 extern int cmd_commit_tree(int argc, const char **argv, const char *prefix);
 extern int cmd_config(int argc, const char **argv, const char *prefix);
 extern int cmd_count_objects(int argc, const char **argv, const char *prefix);
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
new file mode 100644
index 00..b466ecd781
--- /dev/null
+++ b/builtin/commit-graph.c
@@ -0,0 +1,36 @@
+#include "builtin.h"
+#include "config.h"
+#include "parse-options.h"
+
+static char const * const builtin_commit_graph_usage[] = {
+   N_("git commit-graph [--object-dir ]"),
+   NULL
+};
+
+static struct opts_commit_graph {
+   const char *obj_dir;
+} opts;
+
+
+int cmd_commit_graph(int argc, const char **argv, const char *prefix)
+{
+   static struct option builtin_commit_graph_options[] = {
+   OPT_STRING(0, "object-dir", _dir,
+   N_("dir"),
+   N_("The object directory to store the graph")),
+   OPT_END(),
+   };
+
+   if (argc == 2 && !strcmp(argv[1], "-h"))
+   usage_with_options(builtin_commit_graph_usage,
+  builtin_commit_graph_options);
+
+   git_config(git_default_config, NULL);
+   argc = parse_options(argc, argv, prefix,
+builtin_commit_graph_options,
+builtin_commit_graph_usage,
+PARSE_OPT_STOP_AT_NON_OPTION);
+
+   usage_with_options(builtin_commit_graph_usage,
+  builtin_commit_graph_options);
+}
diff --git a/command-list.txt b/command-list.txt
index a1fad28fd8..835c5890be 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -34,6 +34,7 @@ git-clean   mainporcelain
 git-clone   mainporcelain   init
 git-column  purehelpers
 git-commit  mainporcelain   history
+git-commit-graphplumbingmanipulators
 git-commit-tree plumbingmanipulators
 git-config  ancillarymanipulators
 git-count-objects   ancillaryinterrogators
diff --git a/contrib/completion/git-completion.bash 
b/contrib/completion/git-completion.bash
index b09c8a2362..6726daaf69 100644
--- a/contrib/completion/git-completion.bash
+++ b/contrib/completion/git-completion.bash
@@ -878,6 +878,7 @@ __git_list_porcelain_commands ()
check-ref-format) : plumbing;

[PATCH v8 01/14] csum-file: rename hashclose() to finalize_hashfile()

2018-04-10 Thread Derrick Stolee
From: Derrick Stolee <dsto...@microsoft.com>

The hashclose() method behaves very differently depending on the flags
parameter. In particular, the file descriptor is not always closed.

Perform a simple rename of "hashclose()" to "finalize_hashfile()" in
preparation for functional changes.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/index-pack.c   | 2 +-
 builtin/pack-objects.c | 6 +++---
 bulk-checkin.c | 4 ++--
 csum-file.c| 2 +-
 csum-file.h| 4 ++--
 fast-import.c  | 2 +-
 pack-bitmap-write.c| 2 +-
 pack-write.c   | 4 ++--
 8 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index bda84a92ef..8bcf280e0b 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1270,7 +1270,7 @@ static void conclude_pack(int fix_thin_pack, const char 
*curr_pack, unsigned cha
nr_objects - nr_objects_initial);
stop_progress_msg(, msg.buf);
strbuf_release();
-   hashclose(f, tail_hash, 0);
+   finalize_hashfile(f, tail_hash, 0);
hashcpy(read_hash, pack_hash);
fixup_pack_header_footer(output_fd, pack_hash,
 curr_pack, nr_objects,
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index e9d3cfb9e3..ab3e80ee49 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -837,11 +837,11 @@ static void write_pack_file(void)
 * If so, rewrite it like in fast-import
 */
if (pack_to_stdout) {
-   hashclose(f, oid.hash, CSUM_CLOSE);
+   finalize_hashfile(f, oid.hash, CSUM_CLOSE);
} else if (nr_written == nr_remaining) {
-   hashclose(f, oid.hash, CSUM_FSYNC);
+   finalize_hashfile(f, oid.hash, CSUM_FSYNC);
} else {
-   int fd = hashclose(f, oid.hash, 0);
+   int fd = finalize_hashfile(f, oid.hash, 0);
fixup_pack_header_footer(fd, oid.hash, pack_tmp_name,
 nr_written, oid.hash, offset);
close(fd);
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 9d87eac07b..227cc9f3b1 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -35,9 +35,9 @@ static void finish_bulk_checkin(struct bulk_checkin_state 
*state)
unlink(state->pack_tmp_name);
goto clear_exit;
} else if (state->nr_written == 1) {
-   hashclose(state->f, oid.hash, CSUM_FSYNC);
+   finalize_hashfile(state->f, oid.hash, CSUM_FSYNC);
} else {
-   int fd = hashclose(state->f, oid.hash, 0);
+   int fd = finalize_hashfile(state->f, oid.hash, 0);
fixup_pack_header_footer(fd, oid.hash, state->pack_tmp_name,
 state->nr_written, oid.hash,
 state->offset);
diff --git a/csum-file.c b/csum-file.c
index 5eda7fb6af..e6c95a6915 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -53,7 +53,7 @@ void hashflush(struct hashfile *f)
}
 }
 
-int hashclose(struct hashfile *f, unsigned char *result, unsigned int flags)
+int finalize_hashfile(struct hashfile *f, unsigned char *result, unsigned int 
flags)
 {
int fd;
 
diff --git a/csum-file.h b/csum-file.h
index 992e5c0141..9ba87f0a6c 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -26,14 +26,14 @@ struct hashfile_checkpoint {
 extern void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint 
*);
 extern int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);
 
-/* hashclose flags */
+/* finalize_hashfile flags */
 #define CSUM_CLOSE 1
 #define CSUM_FSYNC 2
 
 extern struct hashfile *hashfd(int fd, const char *name);
 extern struct hashfile *hashfd_check(const char *name);
 extern struct hashfile *hashfd_throughput(int fd, const char *name, struct 
progress *tp);
-extern int hashclose(struct hashfile *, unsigned char *, unsigned int);
+extern int finalize_hashfile(struct hashfile *, unsigned char *, unsigned int);
 extern void hashwrite(struct hashfile *, const void *, unsigned int);
 extern void hashflush(struct hashfile *f);
 extern void crc32_begin(struct hashfile *);
diff --git a/fast-import.c b/fast-import.c
index b5db5d20b1..6d96f55d9d 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1016,7 +1016,7 @@ static void end_packfile(void)
struct tag *t;
 
close_pack_windows(pack_data);
-   hashclose(pack_file, cur_pack_oid.hash, 0);
+   finalize_hashfile(pack_file, cur_pack_oid.hash, 0);
fixup_pack_header_footer(pack_data->pack_fd, pack_data->sha1,
 

[PATCH v8 00/14] Serialized Git Commit Graph

2018-04-10 Thread Derrick Stolee
This version covers the review that I missed when rerolling v7. The
file diff is below from previous version, and also PATCH 14/14 was
reworded to use "--append" properly.

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 41c4f76caf..37420ae0fd 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -31,7 +31,7 @@ static struct opts_commit_graph {

 static int graph_read(int argc, const char **argv)
 {
-   struct commit_graph *graph = 0;
+   struct commit_graph *graph = NULL;
char *graph_name;

static struct option builtin_commit_graph_read_options[] = {
diff --git a/commit-graph.c b/commit-graph.c
index 1fc63d541b..3ff8c84c0e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -179,7 +179,7 @@ struct commit_graph *load_commit_graph_one(const char 
*graph_file)
 }

 /* global storage */
-struct commit_graph *commit_graph = NULL;
+static struct commit_graph *commit_graph = NULL;

 static void prepare_commit_graph_one(const char *obj_dir)
 {

-- >8 -- 

This patch contains a way to serialize the commit graph.

The current implementation defines a new file format to store the graph
structure (parent relationships) and basic commit metadata (commit date,
root tree OID) in order to prevent parsing raw commits while performing
basic graph walks. For example, we do not need to parse the full commit
when performing these walks:

* 'git log --topo-order -1000' walks all reachable commits to avoid
  incorrect topological orders, but only needs the commit message for
  the top 1000 commits.

* 'git merge-base  ' may walk many commits to find the correct
  boundary between the commits reachable from A and those reachable
  from B. No commit messages are needed.

* 'git branch -vv' checks ahead/behind status for all local branches
  compared to their upstream remote branches. This is essentially as
  hard as computing merge bases for each.

The current patch speeds up these calculations by injecting a check in
parse_commit_gently() to check if there is a graph file and using that
to provide the required metadata to the struct commit.

The file format has room to store generation numbers, which will be
provided as a patch after this framework is merged. Generation numbers
are referenced by the design document but not implemented in order to
make the current patch focus on the graph construction process. Once
that is stable, it will be easier to add generation numbers and make
graph walks aware of generation numbers one-by-one.

By loading commits from the graph instead of parsing commit buffers, we
save a lot of time on long commit walks. Here are some performance
results for a copy of the Linux repository where 'master' has 678,653
reachable commits and is behind 'origin/master' by 59,929 commits.

| Command  | Before | After  | Rel % |
|--|||---|
| log --oneline --topo-order -1000 |  8.31s |  0.94s | -88%  |
| branch -vv   |  1.02s |  0.14s | -86%  |
| rev-list --all   |  5.89s |  1.07s | -81%  |
| rev-list --all --objects | 66.15s | 58.45s | -11%  |

To test this yourself, run the following on your repo:

  git config core.commitGraph true
  git show-ref -s | git commit-graph write --stdin-commits

The second command writes a commit graph file containing every commit
reachable from your refs. Now, all git commands that walk commits will
check your graph first before consulting the ODB. You can run your own
performance comparisons by toggling the 'core.commitGraph' setting.

[1] https://github.com/derrickstolee/git/pull/2
A GitHub pull request containing the latest version of this patch.

Derrick Stolee (14):
  csum-file: rename hashclose() to finalize_hashfile()
  csum-file: refactor finalize_hashfile() method
  commit-graph: add format document
  graph: add commit graph design document
  commit-graph: create git-commit-graph builtin
  commit-graph: implement write_commit_graph()
  commit-graph: implement git-commit-graph write
  commit-graph: implement git commit-graph read
  commit-graph: add core.commitGraph setting
  commit-graph: close under reachability
  commit: integrate commit graph with commit parsing
  commit-graph: read only from specific pack-indexes
  commit-graph: build graph from starting commits
  commit-graph: implement "--append" option

 .gitignore|   1 +
 Documentation/config.txt  |   4 +
 Documentation/git-commit-graph.txt|  94 +++
 .../technical/commit-graph-format.txt |  97 +++
 Documentation/technical/commit-graph.txt  | 163 
 Makefile  |   2 +
 alloc.c   |   1 +
 builtin.h |   1 +
 builtin/commit-graph.c| 171 
 builtin/index-pack.c  |   2 +-

Re: What's cooking in git.git (Apr 2018, #01; Mon, 9)

2018-04-10 Thread Derrick Stolee

On 4/10/2018 3:21 PM, Ramsay Jones wrote:


On 10/04/18 13:57, Derrick Stolee wrote:

On 4/9/2018 6:08 PM, Junio C Hamano wrote:

I guess we'd want a final cleaned-up round after all ;-)  Thanks.

v8 sent [1]. Thanks.

I just tried to 'git am' this series and I could not get it
to apply cleanly to current 'master', 'next' or 'pu'. I fixed
up a few patches, but just got bored ... ;-)



In my cover letter, I did mention that my patch started here (using 
'format-patch --base'):


base-commit: 468165c1d8a442994a825f3684528361727cd8c0


This corresponds to v2.17.0.

Thanks,
-Stolee


Re: [PATCH v8 03/14] commit-graph: add format document

2018-04-10 Thread Derrick Stolee

On 4/10/2018 3:10 PM, Stefan Beller wrote:

Hi Derrick,

On Tue, Apr 10, 2018 at 5:55 AM, Derrick Stolee <sto...@gmail.com> wrote:


+  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+  The ith entry, F[i], stores the number of OIDs with first
+  byte at most i. Thus F[255] stores the total
+  number of commits (N).

I was about to give this series one last read not expecting any questions
to come up (this series has had a lot of feedback already!)
Although I just did.

What were your design considerations for the fanout table?
Did you include it as the pack index has one or did you come up with
them from first principles?
Have you measured the performance impact of the fanout table
(maybe even depending on the size of the fanout) ?

context:
https://public-inbox.org/git/CAJo=hJsto1ik=gtc8c3+2_jbuuqcapl0uwp-1uoyympgblb...@mail.gmail.com/
(side note: searching the web for fanout makes it seem
as if it is git-lingo, apparently the term is not widely used)

I don't think we want to restart the design discussion,
I am just curious.


I knew that I wanted some amount of a fanout table, and the 256-entry 
one was used for IDX files (and in my MIDX RFC). With the recent 
addition of "packfile: refactor hash search with fanout table" [1] it is 
probably best to keep the 256-entry table to reduce code clones.


As for speed, we have the notion of 'graph_pos' which gives random 
access into the commit-graph after a commit is loaded as a parent of a 
commit from the commit-graph file. Thus, we are spending time in the 
binary search only for commits that do not exist in the commit-graph 
file and those that are first found in the file. Thus, running profilers 
on long commit-graph walks do not show any measurable time spent in 
'bsearch_graph()'.


Thanks,
-Stolee

[1] 
https://github.com/gitster/git/commit/b4e00f7306a160639f047b3421985e8f3d0c6fb1


Re: [PATCH 0/3] Lazy-load trees when reading commit-graph

2018-04-04 Thread Derrick Stolee

On 4/3/2018 4:20 PM, Jeff King wrote:

On Tue, Apr 03, 2018 at 09:14:50AM -0400, Derrick Stolee wrote:


I'm not sure what the exact solution would be, but I imagine something
like variable-sized "struct commit"s with the parent pointers embedded,
with some kind of flag to indicate the number of parents (and probably
some fallback to break out to a linked list for extreme cases of more
than 2 parents).  It may end up pretty invasive, though, as there's a
lot of open-coded traversals of that parent list.

Anyway, not anything to do with this patch, but food for thought as you
micro-optimize these traversals.

One other thing that I've been thinking about is that 'struct commit' is so
much bigger than the other structs in 'union any_object'. This means that
the object cache, which I think creates blocks of 'union any_object' for
memory-alignment reasons, is overly bloated. This would be especially true
when walking many more trees than commits.

Perhaps there are other memory concerns in that case (such as cached
buffers) that the 'union any_object' is not a concern, but it is worth
thinking about as we brainstorm how to reduce the parent-list memory.

It definitely bloats any_object, but I don't think we typically allocate
too many of those. Those should only come from lookup_unknown_object(),
but typically we'll come across objects by traversing the graph, in
which case we have an expectation of the type (and use the appropriate
lookup_foo() function, which uses the type-specific block allocators).


Thanks for the clarification here. I'm glad I was wrong about how often 
any_object is used.


Thanks,
-Stolee


Re: [PATCH 8/6] commit: use generation numbers for in_merge_bases()

2018-04-04 Thread Derrick Stolee

On 4/4/2018 11:45 AM, Derrick Stolee wrote:

The containment algorithm for 'git branch --contains' is different
from that for 'git tag --contains' in that it uses is_descendant_of()
instead of contains_tag_algo(). The expensive portion of the branch
algorithm is computing merge bases.

When a commit-graph file exists with generation numbers computed,
we can avoid this merge-base calculation when the target commit has
a larger generation number than the target commits.

Performance tests were run on a copy of the Linux repository where
HEAD is contained in v4.13 but no earlier tag. Also, all tags were
copied to branches and 'git branch --contains' was tested:

Before: 60.0s
After:   0.4s
Rel %: -99.3%

Reported-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  commit.c | 9 -
  1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 858f4fdbc9..2566cba79f 100644
--- a/commit.c
+++ b/commit.c
@@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
  {
struct commit_list *bases;
int ret = 0, i;
+   uint32_t min_generation = GENERATION_NUMBER_UNDEF;
  
  	if (parse_commit(commit))

return ret;
-   for (i = 0; i < nr_reference; i++)
+   for (i = 0; i < nr_reference; i++) {
if (parse_commit(reference[i]))
return ret;
+   if (min_generation > reference[i]->generation)
+   min_generation = reference[i]->generation;
+   }
+
+   if (commit->generation > min_generation)
+   return 0;
  
  	bases = paint_down_to_common(commit, nr_reference, reference);

if (commit->object.flags & PARENT2)


This patch may suffice to speed up 'git branch --contains' instead of 
needing to always use the 'git tag --contains' algorithm as considered 
in [1].


Thanks,
-Stolee

[1] 
https://public-inbox.org/git/20180303051516.ge27...@sigill.intra.peff.net/

    Re: [PATCH 0/4] Speed up git tag --contains


[PATCH 7/6] ref-filter: use generation number for --contains

2018-04-04 Thread Derrick Stolee
A commit A can reach a commit B only if the generation number of A
is strictly larger than the generation number of B. This condition
allows significantly short-circuiting commit-graph walks.

Use generation number for '--contains' type queries.

On a copy of the Linux repository where HEAD is containd in v4.13
but no earlier tag, the command 'git tag --contains HEAD' had the
following peformance improvement:

Before: 0.81s
After:  0.04s
Rel %:  -95%

Helped-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 ref-filter.c | 26 +-
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index 45fc56216a..b147b1d0ee 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, 
struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
  const struct commit_list *want,
- struct contains_cache *cache)
+ struct contains_cache *cache,
+ uint32_t cutoff)
 {
enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit 
*candidate,
return CONTAINS_YES;
}
 
-   /* Otherwise, we don't know; prepare to recurse */
parse_commit_or_die(candidate);
+
+   if (candidate->generation < cutoff)
+   return CONTAINS_NO;
+
return CONTAINS_UNKNOWN;
 }
 
@@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
  struct contains_cache *cache)
 {
struct contains_stack contains_stack = { 0, 0, NULL };
-   enum contains_result result = contains_test(candidate, want, cache);
+   enum contains_result result;
+   uint32_t cutoff = GENERATION_NUMBER_UNDEF;
+   const struct commit_list *p;
+
+   for (p = want; p; p = p->next) {
+   struct commit *c = p->item;
+   parse_commit_or_die(c);
+   if (c->generation < cutoff)
+   cutoff = c->generation;
+   }
+   if (cutoff == GENERATION_NUMBER_UNDEF)
+   cutoff = GENERATION_NUMBER_NONE;
 
+   result = contains_test(candidate, want, cache, cutoff);
if (result != CONTAINS_UNKNOWN)
return result;
 
@@ -1634,7 +1650,7 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
 * If we just popped the stack, parents->item has been marked,
 * therefore contains_test will return a meaningful yes/no.
 */
-   else switch (contains_test(parents->item, want, cache)) {
+   else switch (contains_test(parents->item, want, cache, cutoff)) 
{
case CONTAINS_YES:
*contains_cache_at(cache, commit) = CONTAINS_YES;
contains_stack.nr--;
@@ -1648,7 +1664,7 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
}
}
free(contains_stack.contains_stack);
-   return contains_test(candidate, want, cache);
+   return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,
-- 
2.17.0.rc0



[PATCH 8/6] commit: use generation numbers for in_merge_bases()

2018-04-04 Thread Derrick Stolee
The containment algorithm for 'git branch --contains' is different
from that for 'git tag --contains' in that it uses is_descendant_of()
instead of contains_tag_algo(). The expensive portion of the branch
algorithm is computing merge bases.

When a commit-graph file exists with generation numbers computed,
we can avoid this merge-base calculation when the target commit has
a larger generation number than the target commits.

Performance tests were run on a copy of the Linux repository where
HEAD is contained in v4.13 but no earlier tag. Also, all tags were
copied to branches and 'git branch --contains' was tested:

Before: 60.0s
After:   0.4s
Rel %: -99.3%

Reported-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 858f4fdbc9..2566cba79f 100644
--- a/commit.c
+++ b/commit.c
@@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
 {
struct commit_list *bases;
int ret = 0, i;
+   uint32_t min_generation = GENERATION_NUMBER_UNDEF;
 
if (parse_commit(commit))
return ret;
-   for (i = 0; i < nr_reference; i++)
+   for (i = 0; i < nr_reference; i++) {
if (parse_commit(reference[i]))
return ret;
+   if (min_generation > reference[i]->generation)
+   min_generation = reference[i]->generation;
+   }
+
+   if (commit->generation > min_generation)
+   return 0;
 
bases = paint_down_to_common(commit, nr_reference, reference);
if (commit->object.flags & PARENT2)
-- 
2.17.0.rc0



Re: [PATCH 7/6] ref-filter: use generation number for --contains

2018-04-04 Thread Derrick Stolee

On 4/4/2018 3:16 PM, Jeff King wrote:

On Wed, Apr 04, 2018 at 03:06:26PM -0400, Derrick Stolee wrote:


@@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
  struct contains_cache *cache)
   {
struct contains_stack contains_stack = { 0, 0, NULL };
-   enum contains_result result = contains_test(candidate, want, cache);
+   enum contains_result result;
+   uint32_t cutoff = GENERATION_NUMBER_UNDEF;
+   const struct commit_list *p;
+
+   for (p = want; p; p = p->next) {
+   struct commit *c = p->item;
+   parse_commit_or_die(c);
+   if (c->generation < cutoff)
+   cutoff = c->generation;
+   }

Now that you mention it, let me split out the portion you are probably
talking about as incorrect:


+   if (cutoff == GENERATION_NUMBER_UNDEF)
+   cutoff = GENERATION_NUMBER_NONE;

You're right, we don't want this. Since GENERATION_NUMBER_NONE == 0, we get
no benefit from this. If we keep it GENERATION_NUMBER_UNDEF, then our walk
will be limited to commits NOT in the commit-graph (which we hope is small
if proper hygiene is followed).

I think it's more than that. If we leave it at UNDEF, that's wrong,
because contains_test() compares:

   candidate->generation < cutoff

which would _always_ be true. In other words, we're saying that our
"want" has an insanely high generation number, and traversing can never
find it. Which is clearly wrong.


That condition is not always true (which is why we use strict comparison 
instead of <=). If a commit is not in the commit-graph file, then its 
generation is equal to GENERATION_NUMBER_UNDEF, as shown in alloc.c:


void *alloc_commit_node(void)
{
    struct commit *c = alloc_node(_state, sizeof(struct 
commit));

    c->object.type = OBJ_COMMIT;
    c->index = alloc_commit_index();
    c->graph_pos = COMMIT_NOT_FROM_GRAPH;
    c->generation = GENERATION_NUMBER_UNDEF;
    return c;
}



So we have to put it at "0", to say "you should always traverse, we
can't tell you that this is a dead end". So that part of the logic is
currently correct.

But what I was getting at is that the loop behavior can't just pick the
min cutoff. The min is effectively "0" if there's even a single ref for
which we don't have a generation number, because we cannot ever stop
traversing (we might get to that commit if we kept going).

(It's also possible I'm confused about how UNDEF and NONE are used; I'm
assuming commits for which we don't have a generation number available
would get UNDEF in their commit->generation field).


I think it is this case.


If you could make the assumption that when we have a generation for
commit X, then we have a generation for all of its ancestors, things get
easier. Because then if you hit commit X with a generation number and
want to compare it to a cutoff, you know that either:

   1. The cutoff is defined, in which case you can stop traversing if
  we've gone past the cutoff.

   2. The cutoff is undefined, in which case we cannot possibly reach
  our "want" by traversing. Even if it has a smaller generation
  number than us, it's on an unrelated line of development.

I don't know that the reachability property is explicitly promised by
your work, but it seems like it would be a natural fallout (after all,
you have to know the generation of each ancestor in order to compute the
later ones, so you're really just promising that you've actually stored
all the ones you've computed).


The commit-graph is closed under reachability, so if a commit has a 
generation number then it is in the graph and so are all its ancestors.


The reason for GENERATION_NUMBER_NONE is that the commit-graph file 
stores "0" for generation number until this patch. It still satisfies 
the condition that gen(A) < gen(B) if B can reach A, but also gives us a 
condition for "this commit still needs its generation number computed".





I wonder to what degree it's worth traversing to come up with a
generation number for the "want" commits. If we walked, say, 50 commits
to do it, you'd probably save a lot of work (since the alternative is
walking thousands of commits until you realize that some ancient "v1.0"
tag is not useful).

I'd actually go so far as to say that any amount of traversal is
generally going to be worth it to come up with the correct generation
cutoff here. You can come up with pathological cases where you only have
one really recent tag or something, but in practice every repository
where performance is a concern is going to end up with refs much further
back than it would take to reach the cutoff condition.

Perhaps there is some value in walking to find the correct cutoff value, but
it is difficult to determine how fa

Re: [PATCH 8/6] commit: use generation numbers for in_merge_bases()

2018-04-04 Thread Derrick Stolee

On 4/4/2018 2:24 PM, Jeff King wrote:

On Wed, Apr 04, 2018 at 11:48:42AM -0400, Derrick Stolee wrote:


diff --git a/commit.c b/commit.c
index 858f4fdbc9..2566cba79f 100644
--- a/commit.c
+++ b/commit.c
@@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
   {
struct commit_list *bases;
int ret = 0, i;
+   uint32_t min_generation = GENERATION_NUMBER_UNDEF;
if (parse_commit(commit))
return ret;
-   for (i = 0; i < nr_reference; i++)
+   for (i = 0; i < nr_reference; i++) {
if (parse_commit(reference[i]))
return ret;
+   if (min_generation > reference[i]->generation)
+   min_generation = reference[i]->generation;
+   }
+
+   if (commit->generation > min_generation)
+   return 0;
bases = paint_down_to_common(commit, nr_reference, reference);
if (commit->object.flags & PARENT2)

This patch may suffice to speed up 'git branch --contains' instead of
needing to always use the 'git tag --contains' algorithm as considered in
[1].


I guess I want to specify: the only reason to NOT switch to the tags 
algorithm is because it _may_ hurt existing cases in certain data shapes...



I'd have to do some timings, but I suspect we may want to switch to the
"tag --contains" algorithm anyway. This still does N independent
merge-base operations, one per ref. So with enough refs, you're still
better off throwing it all into one big traversal.


...and I suppose your timings are to find out if there are data shapes 
where the branch algorithm is faster. Perhaps that is impossible now 
that we have the generation number cutoff for the tag algorithm.


Since the branch algorithm checks generation numbers before triggering 
pain_down_to_common(), we will do N independent merge-base calculations, 
where N is the number of branches with large enough generation numbers 
(which is why my test does so well: most are below the target generation 
number). This doesn't help at all if none of the refs are in the graph.


The other thing to do is add a minimum generation for the walk in 
paint_down_to_common() so even if commit->generation <= min_generation 
we still only walk down to commit->generation instead of all merge 
bases. This is something we could change in a later patch.


Patches 7 and 8 seem to me like simple changes with no downside UNLESS 
we are deciding instead to delete the code I'm changing.


Thanks,
-Stolee


Re: [PATCH 7/6] ref-filter: use generation number for --contains

2018-04-04 Thread Derrick Stolee

On 4/4/2018 2:22 PM, Jeff King wrote:

On Wed, Apr 04, 2018 at 11:45:53AM -0400, Derrick Stolee wrote:


@@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
  struct contains_cache *cache)
  {
struct contains_stack contains_stack = { 0, 0, NULL };
-   enum contains_result result = contains_test(candidate, want, cache);
+   enum contains_result result;
+   uint32_t cutoff = GENERATION_NUMBER_UNDEF;
+   const struct commit_list *p;
+
+   for (p = want; p; p = p->next) {
+   struct commit *c = p->item;
+   parse_commit_or_die(c);
+   if (c->generation < cutoff)
+   cutoff = c->generation;
+   }


Now that you mention it, let me split out the portion you are probably 
talking about as incorrect:



+   if (cutoff == GENERATION_NUMBER_UNDEF)
+   cutoff = GENERATION_NUMBER_NONE;


You're right, we don't want this. Since GENERATION_NUMBER_NONE == 0, we 
get no benefit from this. If we keep it GENERATION_NUMBER_UNDEF, then 
our walk will be limited to commits NOT in the commit-graph (which we 
hope is small if proper hygiene is followed).



Hmm, on reflection, I'm not sure if this is right in the face of
multiple "want" commits, only some of which have generation numbers.  We
probably want to disable the cutoff if _any_ "want" commit doesn't have
a number.

There's also an obvious corner case where this won't kick in, and you'd
really like it to: recently added commits. E.g,. if I do this:

   git gc ;# imagine this writes generation numbers
   git pull
   git tag --contains HEAD

then HEAD isn't going to have a generation number. But this is the case
where we have the most to gain, since we could throw away all of the
ancient tags immediately upon seeing that their generation numbers are
way less than that of HEAD.

I wonder to what degree it's worth traversing to come up with a
generation number for the "want" commits. If we walked, say, 50 commits
to do it, you'd probably save a lot of work (since the alternative is
walking thousands of commits until you realize that some ancient "v1.0"
tag is not useful).

I'd actually go so far as to say that any amount of traversal is
generally going to be worth it to come up with the correct generation
cutoff here. You can come up with pathological cases where you only have
one really recent tag or something, but in practice every repository
where performance is a concern is going to end up with refs much further
back than it would take to reach the cutoff condition.


Perhaps there is some value in walking to find the correct cutoff value, 
but it is difficult to determine how far we are from commits with 
correct generation numbers _a priori_. I'd rather rely on the 
commit-graph being in a good state, not too far behind the refs. An 
added complexity of computing generation numbers dynamically is that we 
would need to add a dependence on the commit-graph file's existence at all.


Thanks,
-Stolee


Re: [PATCH v2 04/10] commit-graph: compute generation numbers

2018-04-11 Thread Derrick Stolee

On 4/10/2018 10:51 PM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


+   if ((*list)->generation != GENERATION_NUMBER_INFINITY) {
+   if ((*list)->generation > GENERATION_NUMBER_MAX)
+   die("generation number %u is too large to store in 
commit-graph",
+   (*list)->generation);
+   packedDate[0] |= htonl((*list)->generation << 2);
+   }


How serious do we want this feature to be?  On one extreme, we could
be irresponsible and say it will be a problem for our descendants in
the future if their repositories have more than billion pearls on a
single strand, and the above certainly is a reasonable way to punt.
Those who actually encounter the problem will notice by Git dying
somewhere rather deep in the callchain.

Or we could say Git actually does support a history that is
arbitrarily long, even though such a deep portion of history will
not benefit from having generation numbers in commit-graph.

I've been assuming that our stance is the latter and that is why I
made noises about overflowing 30-bit generation field in my review
of the previous step.

In case we want to do the "we know this is very large, but we do not
know the exact value", we may actually want a mode where we can
pretend that GENERATION_NUMBER_MAX is set to quite low (say 256) and
make sure that the code to handle overflow behaves sensibly.


I agree. I wonder how we can effectively expose this value into a test. 
It's probably not sufficient to manually test using compiler flags ("-D 
GENERATION_NUMBER_MAX=8").





+   for (i = 0; i < nr_commits; i++) {
+   if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
+   commits[i]->generation != GENERATION_NUMBER_ZERO)
+   continue;
+
+   commit_list_insert(commits[i], );
+   while (list) {
+...
+   }
+   }

So we go over the list of commits just _once_ and make sure each of
them gets the generation assigned correctly by (conceptually
recursively but iteratively in implementation by using a commit
list) making sure that all its parents have generation assigned and
compute the generation for the commit, before moving to the next
one.  Which sounds correct.


Yes, we compute the generation number of a commit exactly once. We use 
the list as a stack so we do not have recursion limits during our 
depth-first search (DFS). We rely on the object cache to ensure we store 
the computed generation numbers, and computed generation numbers provide 
termination conditions to the DFS.


Thanks,
-Stolee


Re: [PATCH v2 02/10] merge: check config before loading commits

2018-04-11 Thread Derrick Stolee

On 4/10/2018 10:12 PM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


diff --git a/builtin/merge.c b/builtin/merge.c
index ee050a47f3..20897f8223 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -1183,13 +1183,14 @@ int cmd_merge(int argc, const char **argv, const char 
*prefix)
branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL);
if (branch)
skip_prefix(branch, "refs/heads/", );
+   init_diff_ui_defaults();
+   git_config(git_merge_config, NULL);
+
if (!branch || is_null_oid(_oid))
head_commit = NULL;
else
head_commit = lookup_commit_or_die(_oid, "HEAD");
  
-	init_diff_ui_defaults();

-   git_config(git_merge_config, NULL);

Wow, that's tricky.  git_merge_config() wants to know which "branch"
we are on, and this place is as early as we can move the call to
without breaking things.  Is this to allow parse_object() called
in lookup_commit_reference_gently() to know if we can rely on the
data cached in the commit-graph data?


When I saw the bug on my machine, I tracked the issue down to a call to 
parse_commit_in_graph() that skipped the graph check since 
core_commit_graph was not set. The call stack from this call is as follows:


* lookup_commit_or_die()
* lookup_commit_reference()
* lookup_commit_reference_gently()
* parse_object()
* parse_object_buffer()
* parse_commit_in_graph() [as introduced in PATCH 01/10]




Move the config load to be between the initialization of 'branch'
and the commit lookup. Also add a test to t5318-commit-graph.sh
that exercises this code path to prevent a regression.

It is not clear to me how a successful merge of commits/8
demonstrates that reading the config earlier than before is
regression free.


I didn't want to introduce commits in an order that led to a commit 
failing tests, but if you drop the change to builtin/merge.c from this 
series, the tip commit will fail this test with "BUG: bad generation skip".


The reason for this failure is that commits/5 is loaded from HEAD from 
the object database, so its generation is marked as 
GENERATION_NUMBER_INFINITY, and the commit is marked as parsed. Later, 
the commit at merges/3 is loaded from the graph with generation 4. This 
triggers the BUG statement in paint_down_to_common(). That is why it is 
important to check a fast-forward merge.


In the 'graph_git_behavior' steps of t5318-commit-graph.sh, we were 
already testing 'git merge-base' to check the commit walk logic.


Thanks,
-Stolee


Re: [PATCH v2 03/10] commit: add generation number to struct commmit

2018-04-11 Thread Derrick Stolee

On 4/10/2018 10:31 PM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
   more than the maximum generation number among the parents of A.

Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use two special values to signal
the generation number is invalid:

GENERATION_NUMBER_ININITY 0x
GENERATION_NUMBER_ZERO 0

The first (_INFINITY) means the generation number has not been loaded or
computed. The second (_ZERO) means the generation number was loaded
from a commit graph file that was stored before generation numbers
were computed.

Should it also be possible for a caller to tell if a given commit
has too deep a history, i.e. we do not know its generation number
exactly, but we know it is larger than 1<<30?

It seems that we only have a 30-bit field in the file, so wouldn't
we need a special value defined in (e.g. "0") so that we can tell
that the commit has such a large generation number?  E.g.


+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;

if (!item->generation)
item->generation = GENERATION_NUMBER_OVERFLOW;

when we read it from the file?

We obviously need to do something similar when assigning a
generation number to a child commit, perhaps like

#define GENERATION_NUMBER_OVERFLOW (GENERATION_NUMBER_MAX + 1)

commit->generation = 1; /* assume no parent */
for (p = commit->parents; p; p++) {
uint32_t gen = p->item->generation + 1;

if (gen >= GENERATION_NUMBER_OVERFLOW) {
commit->generation = GENERATION_NUMBER_OVERFLOW;
break;
} else if (commit->generation < gen)
commit->generation = gen;
}
 
or something?  And then on the writing side you'd encode too large a

generation as '0'.


You raise a very good point. How about we do a slightly different 
arrangement for these overflow commits?


Instead of storing the commits in the commit-graph file as "0" (which 
currently means "written by a version of git that did not compute 
generation numbers") we could let GENERATION_NUMBER_MAX be the maximum 
generation of a commit in the commit-graph, and if a commit would have 
larger generation, we collapse it down to that value.


It slightly complicates the diagram I made in 
Documentation/technical/commit-graph.txt, but it was already a bit of a 
simplification. Here is an updated diagram, but likely we will want to 
limit discussion of the special-case GENERATION_NUMBER_MAX to the prose, 
since it is not a practical situation at the moment.


    +-+
    | GENERATION_NUMBER_INFINITY = 0x |
    +-+
  |    |    |  ^
  |    |    |  |
  |    |    +--+
  |    | [gen(A) = gen(B)]
  |    V
  |  ++
      |  | GENERATION_NUMBER_MAX = 0x3FFF |
  |  ++
  |    |    |  ^
  |    |    |  |
  |    |    +--+
  |    | [gen(A) = gen(B)]
  V    V
    +-+
    | 0 < commit->generation < 0x3FFF |
    +-+
        |    |  ^
        |    |  |
        |    +--+
        |    [gen(A) > gen(B)]
        V
    +-+
    | GENERATION_NUMBER_ZERO = 0  |
    +-+
             |  ^
             |  |
             +--+
         [gen(A) = gen(B)]

Thanks,
-Stolee


Re: [PATCH v2 06/10] commit.c: use generation to halt paint walk

2018-04-11 Thread Derrick Stolee

On 4/10/2018 11:02 PM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


@@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n, struc
return result;
}
prio_queue_put(, one);
+   if (one->generation < min_nonstale_gen)
+   min_nonstale_gen = one->generation;
  
  	for (i = 0; i < n; i++) {

twos[i]->object.flags |= PARENT2;
prio_queue_put(, twos[i]);
+   if (twos[i]->generation < min_nonstale_gen)
+   min_nonstale_gen = twos[i]->generation;
}
  
-	while (queue_has_nonstale()) {

+   while (queue_has_nonstale(, min_nonstale_gen)) {
struct commit *commit = prio_queue_get();
struct commit_list *parents;
int flags;
  
+		if (commit->generation > last_gen)

+   BUG("bad generation skip");
+
+   last_gen = commit->generation;
+
flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
if (flags == (PARENT1 | PARENT2)) {
if (!(commit->object.flags & RESULT)) {
@@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n, struc
return NULL;
p->object.flags |= flags;

Hmph.  Can a commit that used to be not stale (and contributed to
the current value of min_nonstale_gen) become stale here by getting
visited twice, invalidating the value in min_nonstale_gen?


min_nonstale_gen can be "wrong" in the way you say, but fits the 
definition from the commit message:


"To properly take advantage of this condition, track the minimum 
generation number of a commit that **enters the queue** with nonstale 
status." (Emphasis added)


You make an excellent point about how this can be problematic. I was 
confused by the lack of clear performance benefits here, but I think 
that whatever benefits making queue_has_nonstale() be O(1) were removed 
by walking more commits than necessary.


Consider the following commit graph, where M is a parent of both A and 
B, S is a parent of M and B, and there is a large set of commits 
reachable from M with generation number larger than gen(S).


A    B
| __/|
|/   |
M    |
|\   |
. |  |
. |  |
. |_/
|/
S

Between A and B, the true merge base is M. Anything reachable from M is 
marked as stale. When S is added to the queue, it is only reachable from 
B, so it is non-stale. However, it is marked stale after M is walked. 
The old code would detect this as a termination condition, but the new 
code would not.


I think this data shape is actually common (not exactly, as it may be 
that some ancestor of M provides a second path to S) especially in the 
world of pull requests and users merging master into their topic branches.


I'll remove this commit in the next version, but use the new prototype 
for queue_has_nonstale() in "commit: add short-circuit to 
paint_down_to_common()" using the given 'min_generation' instead of 
'min_nonstale_gen'.


Thanks,
-Stolee


Re: [PATCH resend] SubmittingPatches: mention the git contacts command

2018-04-11 Thread Derrick Stolee

On 4/11/2018 4:20 PM, Thomas Gummerer wrote:

Instead of just mentioning 'git blame' and 'git shortlog', which make it
quite hard for new contributors to pick out the appropriate list of
people to cc on their patch series, mention the 'git contacts' utility,
which makes it much easier to get a reasonable list of contacts for a
change.

This should help new contributors pick out a reasonable cc list by
simply using a single command.

Signed-off-by: Thomas Gummerer 
---

I've originally sent this at <20180316213323.GC2224@hank>, during an
the rc period.  Eric had some comments, which I interpreted as being
okay with the change (hope I'm not mistaken there :)).  As I still
think this would be an improvement for new contributors, I'm resending
it here.


I didn't know about this tool, and it seems helpful. I plan to use it 
now. Thanks!



  Documentation/SubmittingPatches | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches
index a1d0feca36..945f8edb46 100644
--- a/Documentation/SubmittingPatches
+++ b/Documentation/SubmittingPatches
@@ -260,8 +260,8 @@ that starts with `-BEGIN PGP SIGNED MESSAGE-`.  
That is
  not a text/plain, it's something else.
  
  Send your patch with "To:" set to the mailing list, with "cc:" listing

-people who are involved in the area you are touching (the output from
-`git blame $path` and `git shortlog --no-merges $path` would help to
+people who are involved in the area you are touching (the `git
+contacts` command in `contrib/contacts/` can help to
  identify them), to solicit comments and reviews.
  
  :1: footnote:[The current maintainer: gits...@pobox.com]




Re: [PATCHv3 00/15] replace_object.c: rename to use dash in file name

2018-04-12 Thread Derrick Stolee

On 4/11/2018 8:21 PM, Stefan Beller wrote:

v3:
* interdiff below,
   the only changes are renaming the variable
   -   struct ref_store *main_ref_store;
   +   struct ref_store *refs;
   in struct repository.
   as well as dropping the file rename patch.
* improved commit messages from discussion on the single patches.

Looks good to me. Thanks!
-Stolee


Re: [PATCH v8 03/14] commit-graph: add format document

2018-04-12 Thread Derrick Stolee

On 4/11/2018 4:58 PM, Jakub Narebski wrote:

Derrick Stolee <sto...@gmail.com> writes:


+CHUNK DATA:
+
+  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+  The ith entry, F[i], stores the number of OIDs with first
+  byte at most i. Thus F[255] stores the total
+  number of commits (N).
+
+  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+  The OIDs for all commits in the graph, sorted in ascending order.
+
+  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)

I think it is a typo, and it should be CDAT, not CGET
(CDAT seem to me to stand for Commit DATa):

   +  Commit Data (ID: {'C', 'D', 'A', 'T' }) (N * (H + 16) bytes)

This is what you use in actual implementation, in PATCH v8 06/14

DS> +#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
DS> +#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
DS> +#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
DS> +#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
DS> +#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */



Documentation bugs are hard to diagnose. Thanks for finding this. I 
double checked that the hex int "0x43444154" matches "CDAT".


Here is a diff to make it match.

diff --git a/Documentation/technical/commit-graph-format.txt 
b/Documentation/technical/commit-graph-format.txt

index ad6af8105c..af03501834 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -70,7 +70,7 @@ CHUNK DATA:
   OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
   The OIDs for all commits in the graph, sorted in ascending order.

-  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+  Commit Data (ID: {'C', 'D', 'A', 'T' }) (N * (H + 16) bytes)
 * The first H bytes are for the OID of the root tree.
 * The next 8 bytes are for the positions of the first two parents
   of the ith commit. Stores value 0x if no parent in that



Re: [PATCH v2 07/10] commit-graph.txt: update future work

2018-04-12 Thread Derrick Stolee

On 4/12/2018 5:12 AM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


+Here is a diagram to visualize the shape of the full commit graph, and
+how different generation numbers relate:
+
++-+
+| GENERATION_NUMBER_INFINITY = 0x |
++-+
+   ||  ^
+   ||  |
+   |+--+
+   | [gen(A) = gen(B)]
+   V
++-+
+| 0 < commit->generation < 0x4000 |
++-+
+   ||  ^
+   ||  |
+   |+--+
+   |[gen(A) > gen(B)]
+   V
++-+
+| GENERATION_NUMBER_ZERO = 0  |
++-+
+|  ^
+|  |
++--+
+[gen(A) = gen(B)]

It may be just me but all I can read out of the above is that
commit->generation may store 0x, a value between 0 and
0x4000, or 0.  I cannot quite tell what the notation [gen(A)
 gen(B)] is trying to say.  I am guessing "Two generation
numbers within the 'valid' range can be compared" is what the second
one is trying to say, but it is much less interesting to know that
two infinities compare equal than how generation numbers from
different classes compare, which cannot be depicted in the above
notation, I am afraid.  For example, don't we want to say that a
commit with INF can never be reached by a commit with a valid
generation number, or something like that?


My intention with the arrows was to demonstrate where parent 
relationships can go, and the generation-number relation between a 
commit A with parent B. Clearly, this diagram is less than helpful.





  Design Details
  --
  
@@ -98,17 +141,12 @@ Future Work

  - The 'commit-graph' subcommand does not have a "verify" mode that is
necessary for integration with fsck.
  
-- The file format includes room for precomputed generation numbers. These

-  are not currently computed, so all generation numbers will be marked as
-  0 (or "uncomputed"). A later patch will include this calculation.
-
  - After computing and storing generation numbers, we must make graph
walks aware of generation numbers to gain the performance benefits they
enable. This will mostly be accomplished by swapping a commit-date-ordered
priority queue with one ordered by generation number. The following
-  operations are important candidates:
+  operation is an important candidate:

Good.




Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-11 Thread Derrick Stolee

On 4/11/2018 3:32 PM, Jakub Narebski wrote:

What would you suggest as a good test that could imply performance? The
Google Colab notebook linked to above includes a function to count
number of commits (nodes / vertices in the commit graph) walked,
currently in the worst case scenario.


The two main questions to consider are:

1. Can X reach Y?
2. What is the set of merge-bases between X and Y?

And the thing to measure is a commit count. If possible, it would be 
good to count commits walked (commits whose parent list is enumerated) 
and commits inspected (commits that were listed as a parent of some 
walked commit). Walked commits require a commit parse -- albeit from the 
commit-graph instead of the ODB now -- while inspected commits only 
check the in-memory cache.


For git.git and Linux, I like to use the release tags as tests. They 
provide a realistic view of the linear history, and maintenance releases 
have their own history from the major releases.



I have tried finding number of false positives for level (generation
number) filter and for FELINE index, and number of false negatives for
min-post intervals in the spanning tree (for DFS tree) for 1
randomly selected pairs of commits... but I don't think this is a good
benchmark.


What is a false-positive? A case where gen(X) < gen(Y) but Y cannot 
reach X? I do not think that is a great benchmark, but I guess it is 
something to measure.



I Linux kernel sources 
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git)
that has 750832 nodes and 811733 edges, and 563747941392 possible
directed pairs, we have for 1 randomly selected pairs of commits:

   level-filter has91 =  0.91% [all] false positives
   FELINE index has78 =  0.78% [all] false positives
   FELINE index has 1.16667 less false positives than level filter

   min-post spanning-tree intervals has  3641 = 36.41% [all] false
   negatives


Perhaps something you can do instead of sampling from N^2 commits in 
total is to select a pair of generations (say, G = 2, G' = 20100) or 
regions of generations ( 2 <= G <= 20050, 20100 <= G' <= 20150) and 
see how many false positives you see by testing all pairs (one from each 
level). The delta between the generations may need to be smaller to 
actually have a large proportion of unreachable pairs. Try different 
levels, since major version releases tend to "pinch" the commit graph to 
a common history.



For git.git repository (https://github.com/git/git.git) that has 52950
nodes and 65887 edges the numbers are slighly more in FELINE index
favor (also out of 1 random pairs):

   level-filter has   504 =  9.11% false positives
   FELINE index has   125 =  2.26% false positives
   FELINE index has 4.032 less false positives than level filter

This is for FELINE which does not use level / generatio-numbers filter.


Thanks,
-Stolee



[PATCH v2 04/10] commit-graph: compute generation numbers

2018-04-09 Thread Derrick Stolee
While preparing commits to be written into a commit-graph file, compute
the generation numbers using a depth-first strategy.

The only commits that are walked in this depth-first search are those
without a precomputed generation number. Thus, computation time will be
relative to the number of new commits to the commit-graph file.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 46 ++
 1 file changed, 46 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index d24b947525..5fd63acc31 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -419,6 +419,13 @@ static void write_graph_chunk_data(struct hashfile *f, int 
hash_len,
else
packedDate[0] = 0;
 
+   if ((*list)->generation != GENERATION_NUMBER_INFINITY) {
+   if ((*list)->generation > GENERATION_NUMBER_MAX)
+   die("generation number %u is too large to store 
in commit-graph",
+   (*list)->generation);
+   packedDate[0] |= htonl((*list)->generation << 2);
+   }
+
packedDate[1] = htonl((*list)->date);
hashwrite(f, packedDate, 8);
 
@@ -551,6 +558,43 @@ static void close_reachable(struct packed_oid_list *oids)
}
 }
 
+static void compute_generation_numbers(struct commit** commits,
+  int nr_commits)
+{
+   int i;
+   struct commit_list *list = NULL;
+
+   for (i = 0; i < nr_commits; i++) {
+   if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
+   commits[i]->generation != GENERATION_NUMBER_ZERO)
+   continue;
+
+   commit_list_insert(commits[i], );
+   while (list) {
+   struct commit *current = list->item;
+   struct commit_list *parent;
+   int all_parents_computed = 1;
+   uint32_t max_generation = 0;
+
+   for (parent = current->parents; parent; parent = 
parent->next) {
+   if (parent->item->generation == 
GENERATION_NUMBER_INFINITY ||
+   parent->item->generation == 
GENERATION_NUMBER_ZERO) {
+   all_parents_computed = 0;
+   commit_list_insert(parent->item, );
+   break;
+   } else if (parent->item->generation > 
max_generation) {
+   max_generation = 
parent->item->generation;
+   }
+   }
+
+   if (all_parents_computed) {
+   current->generation = max_generation + 1;
+   pop_commit();
+   }
+   }
+   }
+}
+
 void write_commit_graph(const char *obj_dir,
const char **pack_indexes,
int nr_packs,
@@ -674,6 +718,8 @@ void write_commit_graph(const char *obj_dir,
if (commits.nr >= GRAPH_PARENT_MISSING)
die(_("too many commits to write graph"));
 
+   compute_generation_numbers(commits.list, commits.nr);
+
graph_name = get_commit_graph_filename(obj_dir);
fd = hold_lock_file_for_update(, graph_name, 0);
 
-- 
2.17.0



[PATCH v2 07/10] commit-graph.txt: update future work

2018-04-09 Thread Derrick Stolee
We now calculate generation numbers in the commit-graph file and use
them in paint_down_to_common().

Expand the section on generation numbers to discuss how the two
"special" generation numbers GENERATION_NUMBER_INFINITY and *_ZERO
interact with other generation numbers.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 50 +---
 1 file changed, 44 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
index 0550c6d0dc..a8df0ae9db 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -77,6 +77,49 @@ in the commit graph. We can treat these commits as having 
"infinite"
 generation number and walk until reaching commits with known generation
 number.
 
+We use the macro GENERATION_NUMBER_INFINITY = 0x to mark commits not
+in the commit-graph file. If a commit-graph file was written by a version
+of Git that did not compute generation numbers, then those commits will
+have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
+
+Since the commit-graph file is closed under reachability, we can guarantee
+the following weaker condition on all commits:
+
+If A and B are commits with generation numbers N amd M, respectively,
+and N < M, then A cannot reach B.
+
+Note how the strict inequality differs from the inequality when we have
+fully-computed generation numbers. Using strict inequality may result in
+walking a few extra commits, but the simplicity in dealing with commits
+with generation number *_INFINITY or *_ZERO is valuable.
+
+Here is a diagram to visualize the shape of the full commit graph, and
+how different generation numbers relate:
+
++-+
+| GENERATION_NUMBER_INFINITY = 0x |
++-+
+   ||  ^
+   ||  |
+   |+--+
+   | [gen(A) = gen(B)]
+   V
++-+
+| 0 < commit->generation < 0x4000 |
++-+
+   ||  ^
+   ||  |
+   |+--+
+   |[gen(A) > gen(B)]
+   V
++-+
+| GENERATION_NUMBER_ZERO = 0  |
++-+
+|  ^
+|  |
++--+
+[gen(A) = gen(B)]
+
 Design Details
 --
 
@@ -98,17 +141,12 @@ Future Work
 - The 'commit-graph' subcommand does not have a "verify" mode that is
   necessary for integration with fsck.
 
-- The file format includes room for precomputed generation numbers. These
-  are not currently computed, so all generation numbers will be marked as
-  0 (or "uncomputed"). A later patch will include this calculation.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
   priority queue with one ordered by generation number. The following
-  operations are important candidates:
+  operation is an important candidate:
 
-- paint_down_to_common()
 - 'log --topo-order'
 
 - Currently, parse_commit_gently() requires filling in the root tree
-- 
2.17.0



[PATCH v2 06/10] commit.c: use generation to halt paint walk

2018-04-09 Thread Derrick Stolee
In paint_down_to_common(), the walk is halted when the queue contains
only stale commits. The queue_has_nonstale() method iterates over the
entire queue looking for a nonstale commit. In a wide commit graph where
the two sides share many commits in common, but have deep sets of
different commits, this method may inspect many elements before finding
a nonstale commit. In the worst case, this can give quadratic
performance in paint_down_to_common().

Convert queue_has_nonstale() to use generation numbers for an O(1)
termination condition. To properly take advantage of this condition,
track the minimum generation number of a commit that enters the queue
with nonstale status.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 37 ++---
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/commit.c b/commit.c
index 95ae7e13a3..00bdc2ab21 100644
--- a/commit.c
+++ b/commit.c
@@ -776,14 +776,22 @@ void sort_in_topological_order(struct commit_list **list, 
enum rev_sort_order so
 
 static const unsigned all_flags = (PARENT1 | PARENT2 | STALE | RESULT);
 
-static int queue_has_nonstale(struct prio_queue *queue)
+static int queue_has_nonstale(struct prio_queue *queue, uint32_t min_gen)
 {
-   int i;
-   for (i = 0; i < queue->nr; i++) {
-   struct commit *commit = queue->array[i].data;
-   if (!(commit->object.flags & STALE))
-   return 1;
+   if (min_gen != GENERATION_NUMBER_INFINITY) {
+   if (queue->nr > 0) {
+   struct commit *commit = queue->array[0].data;
+   return commit->generation >= min_gen;
+   }
+   } else {
+   int i;
+   for (i = 0; i < queue->nr; i++) {
+   struct commit *commit = queue->array[i].data;
+   if (!(commit->object.flags & STALE))
+   return 1;
+   }
}
+
return 0;
 }
 
@@ -793,6 +801,8 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n, struc
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
+   uint32_t last_gen = GENERATION_NUMBER_INFINITY;
+   uint32_t min_nonstale_gen = GENERATION_NUMBER_INFINITY;
 
one->object.flags |= PARENT1;
if (!n) {
@@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n, struc
return result;
}
prio_queue_put(, one);
+   if (one->generation < min_nonstale_gen)
+   min_nonstale_gen = one->generation;
 
for (i = 0; i < n; i++) {
twos[i]->object.flags |= PARENT2;
prio_queue_put(, twos[i]);
+   if (twos[i]->generation < min_nonstale_gen)
+   min_nonstale_gen = twos[i]->generation;
}
 
-   while (queue_has_nonstale()) {
+   while (queue_has_nonstale(, min_nonstale_gen)) {
struct commit *commit = prio_queue_get();
struct commit_list *parents;
int flags;
 
+   if (commit->generation > last_gen)
+   BUG("bad generation skip");
+
+   last_gen = commit->generation;
+
flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
if (flags == (PARENT1 | PARENT2)) {
if (!(commit->object.flags & RESULT)) {
@@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n, struc
return NULL;
p->object.flags |= flags;
prio_queue_put(, p);
+
+   if (!(flags & STALE) &&
+   p->generation < min_nonstale_gen)
+   min_nonstale_gen = p->generation;
}
}
 
-- 
2.17.0



[PATCH v2 05/10] commit: use generations in paint_down_to_common()

2018-04-09 Thread Derrick Stolee
Define compare_commits_by_gen_then_commit_date(), which uses generation
numbers as a primary comparison and commit date to break ties (or as a
comparison when both commits do not have computed generation numbers).

Since the commit-graph file is closed under reachability, we know that
all commits in the file have generation at most GENERATION_NUMBER_MAX
which is less than GENERATION_NUMBER_INFINITY.

This change does not affect the number of commits that are walked during
the execution of paint_down_to_common(), only the order that those
commits are inspected. In the case that commit dates violate topological
order (i.e. a parent is "newer" than a child), the previous code could
walk a commit twice: if a commit is reached with the PARENT1 bit, but
later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
propagated to its parents. Using generation numbers avoids this extra
effort, even if it is somewhat rare.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 19 ++-
 commit.h |  1 +
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 3e39c86abf..95ae7e13a3 100644
--- a/commit.c
+++ b/commit.c
@@ -624,6 +624,23 @@ static int compare_commits_by_author_date(const void *a_, 
const void *b_,
return 0;
 }
 
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused)
+{
+   const struct commit *a = a_, *b = b_;
+
+   if (a->generation < b->generation)
+   return 1;
+   else if (a->generation > b->generation)
+   return -1;
+
+   /* newer commits with larger date first */
+   if (a->date < b->date)
+   return 1;
+   else if (a->date > b->date)
+   return -1;
+   return 0;
+}
+
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused)
 {
const struct commit *a = a_, *b = b_;
@@ -773,7 +790,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
 /* all input commits in one and twos[] must have been parsed! */
 static struct commit_list *paint_down_to_common(struct commit *one, int n, 
struct commit **twos)
 {
-   struct prio_queue queue = { compare_commits_by_commit_date };
+   struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
 
diff --git a/commit.h b/commit.h
index b91df315c5..c440f56bf9 100644
--- a/commit.h
+++ b/commit.h
@@ -332,6 +332,7 @@ extern int remove_signature(struct strbuf *buf);
 extern int check_commit_signature(const struct commit *commit, struct 
signature_check *sigc);
 
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused);
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused);
 
 LAST_ARG_MUST_BE_NULL
 extern int run_commit_hook(int editor_is_used, const char *index_file, const 
char *name, ...);
-- 
2.17.0



[PATCH v2 00/10] Compute and consume generation numbers

2018-04-09 Thread Derrick Stolee
Thanks for the lively discussion of this patch series in v1!

I've incorporated the feedback from the previous round, added patches
[7/6] and [8/6], expanded the discussion of generation numbers in the
design document, and added another speedup for 'git branch --contains'.

One major difference: I renamed the macros from _UNDEF to _INFINITY and
_NONE to _ZERO. This communicates their value more clearly, since the
previous names were unclear about which was larger than the "real"
generation numbers.

Patch 2 includes a change to builtin/merge.c and a new test in
t5318-commit-graph.sh that exposes a problem I found when testing the
previous patch series on my box. The "BUG: bad generation skip" message
from "commit.c: use generation to halt paint walk" would halt a fast-
forward merge since the HEAD commit was loaded before the core.commitGraph
config setting was loaded. It is crucial that all commits that exist
in the commit-graph file are loaded from that file or else we will
lose our expected inequalities of generation numbers.

Thanks,
-Stolee

-- >8 --

This is the one of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one). This section is expanded to describe the interaction with special
generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph
file) and *_ZERO (commits in a commit-graph file written before generation
numbers were implemented).

This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a constant-time check in
queue_has_nonstale() instead of the previous linear-time check.

Further, use generation numbers for '--contains' queries in 'git tag'
and 'git branch', providing a significant speedup (at least 95% for
some cases).

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

This patch series depends on v7 of ds/commit-graph.

Derrick Stolee (10):
  object.c: parse commit in graph first
  merge: check config before loading commits
  commit: add generation number to struct commmit
  commit-graph: compute generation numbers
  commit: use generations in paint_down_to_common()
  commit.c: use generation to halt paint walk
  commit-graph.txt: update future work
  ref-filter: use generation number for --contains
  commit: use generation numbers for in_merge_bases()
  commit: add short-circuit to paint_down_to_common()

 Documentation/technical/commit-graph.txt | 50 +--
 alloc.c  |  1 +
 builtin/merge.c  |  5 +-
 commit-graph.c   | 48 +++
 commit.c | 78 
 commit.h |  5 ++
 object.c |  4 +-
 ref-filter.c | 24 ++--
 t/t5318-commit-graph.sh  |  9 +++
 9 files changed, 197 insertions(+), 27 deletions(-)

-- 
2.17.0



[PATCH v2 03/10] commit: add generation number to struct commmit

2018-04-09 Thread Derrick Stolee
The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
  more than the maximum generation number among the parents of A.

Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use two special values to signal
the generation number is invalid:

GENERATION_NUMBER_ININITY 0x
GENERATION_NUMBER_ZERO 0

The first (_INFINITY) means the generation number has not been loaded or
computed. The second (_ZERO) means the generation number was loaded
from a commit graph file that was stored before generation numbers
were computed.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 alloc.c| 1 +
 commit-graph.c | 2 ++
 commit.h   | 4 
 3 files changed, 7 insertions(+)

diff --git a/alloc.c b/alloc.c
index cf4f8b61e1..e8ab14f4a1 100644
--- a/alloc.c
+++ b/alloc.c
@@ -94,6 +94,7 @@ void *alloc_commit_node(void)
c->object.type = OBJ_COMMIT;
c->index = alloc_commit_index();
c->graph_pos = COMMIT_NOT_FROM_GRAPH;
+   c->generation = GENERATION_NUMBER_INFINITY;
return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 1fc63d541b..d24b947525 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -264,6 +264,8 @@ static int fill_commit_in_graph(struct commit *item, struct 
commit_graph *g, uin
date_low = get_be32(commit_data + g->hash_len + 12);
item->date = (timestamp_t)((date_high << 32) | date_low);
 
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
pptr = >parents;
 
edge_value = get_be32(commit_data + g->hash_len);
diff --git a/commit.h b/commit.h
index e57ae4b583..b91df315c5 100644
--- a/commit.h
+++ b/commit.h
@@ -10,6 +10,9 @@
 #include "pretty.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0x
+#define GENERATION_NUMBER_INFINITY 0x
+#define GENERATION_NUMBER_MAX 0x3FFF
+#define GENERATION_NUMBER_ZERO 0
 
 struct commit_list {
struct commit *item;
@@ -24,6 +27,7 @@ struct commit {
struct commit_list *parents;
struct tree *tree;
uint32_t graph_pos;
+   uint32_t generation;
 };
 
 extern int save_commit_buffer;
-- 
2.17.0



[PATCH v2 01/10] object.c: parse commit in graph first

2018-04-09 Thread Derrick Stolee
Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

Before adding generation numbers to the commit-graph, we need to ensure
that any commit that exists in the graph is loaded from the graph, so
check parse_commit_in_graph() before calling parse_commit_buffer().

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 object.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/object.c b/object.c
index e6ad3f61f0..4cd3e98e04 100644
--- a/object.c
+++ b/object.c
@@ -3,6 +3,7 @@
 #include "blob.h"
 #include "tree.h"
 #include "commit.h"
+#include "commit-graph.h"
 #include "tag.h"
 
 static struct object **obj_hash;
@@ -207,7 +208,8 @@ struct object *parse_object_buffer(const struct object_id 
*oid, enum object_type
} else if (type == OBJ_COMMIT) {
struct commit *commit = lookup_commit(oid);
if (commit) {
-   if (parse_commit_buffer(commit, buffer, size))
+   if (!parse_commit_in_graph(commit) &&
+   parse_commit_buffer(commit, buffer, size))
return NULL;
if (!get_cached_commit_buffer(commit, NULL)) {
set_commit_buffer(commit, buffer, size);
-- 
2.17.0



Re: [PATCH v2 0/4] Lazy-load trees when reading commit-graph

2018-04-09 Thread Derrick Stolee

On 4/8/2018 7:18 PM, Junio C Hamano wrote:

Jeff King  writes:


If I were doing it myself, I probably would have folded patches 1 and 3
together. They are touching all the same spots, and it would be an error
for any case converted in patch 1 to not get converted in patch 3. I'm
assuming you caught them all due to Coccinelle, though IMHO it is
somewhat overkill here. By folding them together the compiler could tell
you which spots you missed.

Yeah, that approach would probably be a more sensible way to assure
the safety/correctness of the result to readers better.


I don't understand how folding the patches makes the correctness 
clearer, since the rename (1/4) is checked by the compiler and the 
Coccinelle script (3/4) only works after that rename is complete.


The only thing I can imagine is that it makes smaller patch emails, 
since there is only one large patch instead of two. In this case, I 
prefer to make changes that are easier to check by automation (compiler 
and coccinelle).


However, I will defer to the experts in this regard. If a v3 is 
necessary, then I will fold the commits together.


Thanks,
-Stolee


Re: What's cooking in git.git (Apr 2018, #01; Mon, 9)

2018-04-09 Thread Derrick Stolee

On 4/9/2018 6:21 AM, Junio C Hamano wrote:

* ds/commit-graph (2018-04-02) 16 commits
  - commit-graph: implement "--additive" option
  - commit-graph: build graph from starting commits
  - commit-graph: read only from specific pack-indexes
  - commit: integrate commit graph with commit parsing
  - commit-graph: close under reachability
  - commit-graph: add core.commitGraph setting
  - commit-graph: implement git commit-graph read
  - commit-graph: implement git-commit-graph write
  - commit-graph: implement write_commit_graph()
  - commit-graph: create git-commit-graph builtin
  - graph: add commit graph design document
  - commit-graph: add format document
  - csum-file: refactor finalize_hashfile() method
  - csum-file: rename hashclose() to finalize_hashfile()
  - Merge branch 'jk/cached-commit-buffer' into HEAD
  - Merge branch 'jt/binsearch-with-fanout' into HEAD
  (this branch is used by ds/lazy-load-trees.)

  Precompute and store information necessary for ancestry traversal
  in a separate file to optimize graph walking.

  Ready???
  It seems that this topic is getting there.


I think this patch is ready to go, barring the edit of "--additive" to 
"--append" in the final commit message and squashing following diff into 
"commit-graph: implement git commit-graph read":


@@ -31,7 +31,7 @@ static struct opts_commit_graph {

 static int graph_read(int argc, const char **argv)
 {
-   struct commit_graph *graph = 0;
+   struct commit_graph *graph = NULL;
    char *graph_name;

    static struct option builtin_commit_graph_read_options[] = {

If you prefer that I re-roll with those changes, I can send a v8.

I'm currently working on new series based on this feature:

* [1] Lazy-load trees when reading commit-graph (ds/lazy-load-trees)

* [2] Compute and consume generation numbers

* Move commit-graph.c globals to the_repository

* Implement 'fsck' functionality for the commit-graph file

* Integrate 'commit-graph write' into 'gc --auto'

I would also like to open the feature to other contributors, especially 
for others who can contribute performance improvements using generation 
numbers. We had a very valuable discussion on the list [2], and I look 
forward to more collaborations like that.


Thanks,
-Stolee

[1] 
https://public-inbox.org/git/20180403120057.173849-1-dsto...@microsoft.com/T/#u


[2] 
https://public-inbox.org/git/20180403165143.80661-1-dsto...@microsoft.com/T/#u


Re: [PATCH 02/19] replace-object: move replace_object to object store

2018-04-09 Thread Derrick Stolee

On 4/6/2018 7:21 PM, Stefan Beller wrote:

Refs belong to particular repositories, so the replacements defined by
them should belong to a particular repository as well.

Move the definition of a single object replacement to a new header
"replace-object.h". While at it replace the hardcoded 20 by GIT_MAX_RAWSZ.

Signed-off-by: Jonathan Nieder 
Signed-off-by: Stefan Beller 
---
  object-store.h   | 14 ++
  replace-object.c | 40 ++--
  replace-object.h |  9 +
  3 files changed, 41 insertions(+), 22 deletions(-)
  create mode 100644 replace-object.h


Throughout this commit, there appears to be an extra space inserted 
before 'the_repository'. Some are more obvious than others (such as a 
'free( the_repository->...)' but others are after the indentation.



diff --git a/object-store.h b/object-store.h
index fef33f345f..da639b3184 100644
--- a/object-store.h
+++ b/object-store.h
@@ -93,6 +93,20 @@ struct raw_object_store {
struct alternate_object_database *alt_odb_list;
struct alternate_object_database **alt_odb_tail;
  
+	/*

+* Objects that should be substituted by other objects
+* (see git-replace(1)).
+*/
+   struct replace_objects {
+   /*
+* An array of replacements.  The array is kept sorted by the 
original
+* sha1.
+*/
+   struct replace_object **items;
+
+   int alloc, nr;
+   } replacements;
+
/*
 * private data
 *
diff --git a/replace-object.c b/replace-object.c
index 3e49965d05..a7eb31026e 100644
--- a/replace-object.c
+++ b/replace-object.c
@@ -1,19 +1,11 @@
  #include "cache.h"
+#include "replace-object.h"
+#include "object-store.h"
  #include "sha1-lookup.h"
  #include "refs.h"
+#include "repository.h"
  #include "commit.h"
  
-/*

- * An array of replacements.  The array is kept sorted by the original
- * sha1.
- */
-static struct replace_object {
-   unsigned char original[20];
-   unsigned char replacement[20];
-} **replace_object;
-
-static int replace_object_alloc, replace_object_nr;
-
  static const unsigned char *replace_sha1_access(size_t index, void *table)
  {
struct replace_object **replace = table;
@@ -22,7 +14,8 @@ static const unsigned char *replace_sha1_access(size_t index, 
void *table)
  
  static int replace_object_pos(const unsigned char *sha1)

  {
-   return sha1_pos(sha1, replace_object, replace_object_nr,
+   return sha1_pos(sha1,  the_repository->objects->replacements.items,
+the_repository->objects->replacements.nr,
replace_sha1_access);
  }
  
@@ -35,18 +28,21 @@ static int register_replace_object(struct replace_object *replace,

if (ignore_dups)
free(replace);
else {
-   free(replace_object[pos]);
-   replace_object[pos] = replace;
+   free( the_repository->objects->replacements.items[pos]);
+the_repository->objects->replacements.items[pos] = 
replace;
}
return 1;
}
pos = -pos - 1;
-   ALLOC_GROW(replace_object, replace_object_nr + 1, replace_object_alloc);
-   replace_object_nr++;
-   if (pos < replace_object_nr)
-   MOVE_ARRAY(replace_object + pos + 1, replace_object + pos,
-  replace_object_nr - pos - 1);
-   replace_object[pos] = replace;
+   ALLOC_GROW( the_repository->objects->replacements.items,
+   the_repository->objects->replacements.nr + 1,
+   the_repository->objects->replacements.alloc);
+the_repository->objects->replacements.nr++;
+   if (pos <  the_repository->objects->replacements.nr)
+   MOVE_ARRAY( the_repository->objects->replacements.items + pos + 
1,
+   the_repository->objects->replacements.items + pos,
+   the_repository->objects->replacements.nr - pos - 1);
+the_repository->objects->replacements.items[pos] = replace;
return 0;
  }
  
@@ -84,7 +80,7 @@ static void prepare_replace_object(void)
  
  	for_each_replace_ref(register_replace_ref, NULL);

replace_object_prepared = 1;
-   if (!replace_object_nr)
+   if (!the_repository->objects->replacements.nr)
check_replace_refs = 0;
  }
  
@@ -113,7 +109,7 @@ const unsigned char *do_lookup_replace_object(const unsigned char *sha1)
  
  		pos = replace_object_pos(cur);

if (0 <= pos)
-   cur = replace_object[pos]->replacement;
+   cur = 
the_repository->objects->replacements.items[pos]->replacement;
} while (0 <= pos);
  
  	return cur;

diff --git a/replace-object.h b/replace-object.h
new file mode 100644
index 

Re: [RFC PATCH 00/19] object-store refactoring 3 (replace objects, main ref store)

2018-04-09 Thread Derrick Stolee

On 4/6/2018 7:21 PM, Stefan Beller wrote:

This applies on top of 464416a2eaadf84d2bfdf795007863d03b222b7c
(sb/packfiles-in-repository).
It is also available at https://github.com/stefanbeller/git/tree/object-store-3

This series will bring the replacement mechanism (git replace)
into the object store.

The first patches are cleaning up a bit, and patches 6-19 are converting
one function at a time using the tick-tock pattern with the #define trick.
See cfc62fc98c (sha1_file: add repository argument to link_alt_odb_entry,
2018-03-23) for explanation.

Thanks,
Stefan


I looked through these patches and only found one set of whitespace 
errors. Compiles and tests fine on my machine.


Reviewed-by: Derrick Stolee <dsto...@microsoft.com>


Re: [PATCH v2 0/4] Lazy-load trees when reading commit-graph

2018-04-07 Thread Derrick Stolee

On 4/7/2018 2:40 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:

[...]

On the Linux repository, performance tests were run for the following
command:

 git log --graph --oneline -1000

 Before: 0.92s
 After:  0.66s
 Rel %: -28.3%

Adding '-- kernel/' to the command requires loading the root tree
for every commit that is walked. There was no measureable performance
change as a result of this patch.

In the "Git Merge contributor summit notes" [1] one can read that:


- VSTS adds bloom filters to know which paths have changed on the commit
- tree-same check in the bloom filter is fast; speeds up file history checks
- if the file history is _very_ sparse, then bloom filter is useful

Could this method speed up also the second case mentioned here?  Can
anyone explain how this "path-changed bloom filter" works in VSTS?
   


The idea is simple: for every commit, store a Bloom filter containing 
the list of paths that are not TREESAME against the first parent. (A 
slight detail: have a max cap on the number of paths, and store simply 
"TOO_BIG" for commits with too many diffs.)


When performing 'git log -- path' queries, the most important detail for 
considering how to advance the walk is whether the commit is TREESAME to 
its first parent. For a deep path in a large repo, this is almost always 
true. When a Bloom filter says "TREESAME" (i.e. "this path is not in my 
set") it is always correct, so we can set the treesame bit and continue 
without walking any trees. When a Bloom filter says "MAYBE NOT TREESAME" 
(i.e. "this path is probably in my set") you only need to do the same 
work as before: walk the trees to compare against your first parent.


If a Bloom filter has a false-positive rate of X%, then you can possibly 
drop your number of tree comparisons by (100-X)%. This is very important 
for large repos where some paths were changed only ten times or so, the 
full graph needs to be walked and it is helpful to avoid parsing too 
many trees.



Could we add something like this to the commit-graph file? 


I'm not sure if it is necessary for client-side operations, but it is 
one of the reasons the commit-graph file has the idea of an "optional 
chunk". It could be added to the file format (without changing version 
numbers) and be ignored by clients that don't understand it. I could 
also be gated by a config setting for computing them. My guess is that 
only server-side operations will need the added response time, and can 
bear the cost of computing them when writing the commit-graph file. 
Clients are less likely to be patient waiting for a lot of diff 
calculations.


If we add commit-graph file downloads to the protocol, then the server 
could do this computation and send the data to all clients. But that 
would be "secondary" information that maybe clients want to verify, 
which is as difficult as computing it themselves.


Thanks,

-Stolee



Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-07 Thread Derrick Stolee

On 4/7/2018 12:55 PM, Jakub Narebski wrote:

Currently I am at the stage of reproducing results in FELINE paper:
"Reachability Queries in Very Large Graphs: A Fast Refined Online Search
Approach" by Renê R. Veloso, Loïc Cerf, Wagner Meira Jr and Mohammed
J. Zaki (2014).  This paper is available in the PDF form at
https://openproceedings.org/EDBT/2014/paper_166.pdf

The Jupyter Notebook (which runs on Google cloud, but can be also run
locally) uses Python kernel, NetworkX librabry for graph manipulation,
and matplotlib (via NetworkX) for display.

Available at:
https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing

I hope that could be of help, or at least interesting


Let me know when you can give numbers (either raw performance or # of 
commits walked) for real-world Git commit graphs. The Linux repo is a 
good example to use for benchmarking, but I also use the Kotlin repo 
sometimes as it has over a million objects and over 250K commits.


Of course, the only important statistic at the end of the day is the 
end-to-end time of a 'git ...' command. Your investigations should 
inform whether it is worth prototyping the feature in the git codebase.


Thanks,

-Stolee



Re: [PATCH 7/6] ref-filter: use generation number for --contains

2018-04-04 Thread Derrick Stolee

On 4/4/2018 3:42 PM, Jeff King wrote:

On Wed, Apr 04, 2018 at 03:22:01PM -0400, Derrick Stolee wrote:


That is the idea. I should make this clearer in all of my commit messages.

Yes, please. :) And maybe in the documentation of the file format, if
it's not there (I didn't check). It's a very useful property, and we
want to make sure people making use of the graph know they can depend on
it.


For v2, I'll expand on the roles of _UNDEF and _NONE in the discussion 
of generation numbers in Documentation/technical/commit-graph.txt (the 
design doc instead of the file format).


Thanks,
-Stolee


Re: [PATCH v3 3/9] commit: use generations in paint_down_to_common()

2018-04-18 Thread Derrick Stolee

On 4/18/2018 10:31 AM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


Define compare_commits_by_gen_then_commit_date(), which uses generation
numbers as a primary comparison and commit date to break ties (or as a
comparison when both commits do not have computed generation numbers).

Since the commit-graph file is closed under reachability, we know that
all commits in the file have generation at most GENERATION_NUMBER_MAX
which is less than GENERATION_NUMBER_INFINITY.

This change does not affect the number of commits that are walked during
the execution of paint_down_to_common(), only the order that those
commits are inspected. In the case that commit dates violate topological
order (i.e. a parent is "newer" than a child), the previous code could
walk a commit twice: if a commit is reached with the PARENT1 bit, but
later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
propagated to its parents. Using generation numbers avoids this extra
effort, even if it is somewhat rare.

Does it mean that it gives no measureable performance improvements for
typical test cases?


Not in this commit. When we add the `min_generation` parameter in a 
later commit, we do get a significant performance boost (when we can 
supply a non-zero value to `min_generation`).


This step of using generation numbers for the priority is important for 
that commit, but on its own has limited value outside of the clock-skew 
case mentioned above.





Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  commit.c | 20 +++-
  commit.h |  1 +
  2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 711f674c18..a44899c733 100644
--- a/commit.c
+++ b/commit.c
@@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, 
const void *b_,
return 0;
  }
  
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)

+{
+   const struct commit *a = a_, *b = b_;
+
+   /* newer commits first */
+   if (a->generation < b->generation)
+   return 1;
+   else if (a->generation > b->generation)
+   return -1;
+
+   /* use date as a heuristic when generataions are equal */

Very minor typo in above comment:

s/generataions/generations/


Good catch!




+   if (a->date < b->date)
+   return 1;
+   else if (a->date > b->date)
+   return -1;
+   return 0;
+}
+
  int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused)
  {
const struct commit *a = a_, *b = b_;
@@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
  /* all input commits in one and twos[] must have been parsed! */
  static struct commit_list *paint_down_to_common(struct commit *one, int n, 
struct commit **twos)
  {
-   struct prio_queue queue = { compare_commits_by_commit_date };
+   struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
  
diff --git a/commit.h b/commit.h

index aac3b8c56f..64436ff44e 100644
--- a/commit.h
+++ b/commit.h
@@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
  extern int check_commit_signature(const struct commit *commit, struct 
signature_check *sigc);
  
  int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);

+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused);
  
  LAST_ARG_MUST_BE_NULL

  extern int run_commit_hook(int editor_is_used, const char *index_file, const 
char *name, ...);




Re: What's cooking in git.git (Apr 2018, #02; Tue, 17)

2018-04-18 Thread Derrick Stolee

On 4/17/2018 9:04 PM, Junio C Hamano wrote:

Stefan Beller <sbel...@google.com> writes:


  What's the doneness of this thing?  I didn't recall seeing any
  response, especially ones that demonstrated the reviewer carefully
  read and thought about the issues surrounding the code.  Not that I
  spotted any problems in these patches myself, though.

Stolee and Brandon provided a "quick LGTM" type of review
https://public-inbox.org/git/20180409232536.gb102...@google.com/
https://public-inbox.org/git/9ddfee7e-025a-79c9-8d6b-700c65a14...@gmail.com/

Yup.  Giving positive reviews is harder than giving constructive
criticism.  Much harder.

As readers cannot tell from a "quick LGTM" between "I didn't read it
but it did not smell foul" and "I read it thoroughly, understood how
the solution works, it was presented well, and agree with the design
and implementation---there is nothing to add", the reviewers need to
come up with some way to express that it is the latter case rather
than the former.

I would not claim that I've perfected my technique to do so, but
when responding to such a "good" series, I rephrase the main idea in
the series in my own words to show that I as a reviewer read the
series well enough to be able to do so, perhaps with comparison with
possible alternatives I could think of and dicussion to argue that
the solution presented in the series is better, in an attempt to
demonstrate that I am qualified to say "this one is good" with good
enough understanding of both the issue the series addresses and the
solution in the series.


I'm sorry that my second message was terse. My response to v1 [1] was

> I looked through these patches and only found one set of whitespace > 
errors. Compiles and tests fine on my machine. > > Reviewed-by: Derrick 
Stolee <dsto...@microsoft.com> So, I pulled the code, went through it 
patch-by-patch, and saw that the transformations were made using the 
established pattern. The second review was to chime in that my v1 
comments had been addressed. Thanks, -Stolee
[1] 
https://public-inbox.org/git/6c319100-df47-3b8d-8661-24e4643ad...@gmail.com/


[PATCH v3 8/9] commit-graph: always load commit-graph information

2018-04-17 Thread Derrick Stolee
Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

With generation numbers in the commit-graph, we need to ensure that any
commit that exists in the commit-graph file has its generation number
loaded.

Create new load_commit_graph_info() method to fill in the information
for a commit that exists only in the commit-graph file. Call it from
parse_commit_buffer() after loading the other commit information from
the given buffer. Only fill this information when specified by the
'check_graph' parameter. This avoids duplicate work when we already
checked the graph in parse_commit_gently() or when simply checking the
buffer contents in check_commit().

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 51 --
 commit-graph.h |  8 
 commit.c   |  7 +--
 commit.h   |  2 +-
 object.c   |  2 +-
 sha1_file.c|  2 +-
 6 files changed, 49 insertions(+), 23 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 688d5b1801..21e853c21a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct 
commit_graph *g,
return _list_insert(c, pptr)->next;
 }
 
+static void fill_commit_graph_info(struct commit *item, struct commit_graph 
*g, uint32_t pos)
+{
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+}
+
 static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t pos)
 {
uint32_t edge_value;
uint32_t *parent_data_ptr;
uint64_t date_low, date_high;
struct commit_list **pptr;
-   const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len 
+ 16) * pos;
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
 
item->object.parsed = 1;
item->graph_pos = pos;
@@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin
return 1;
 }
 
+static int find_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t *pos)
+{
+   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+   *pos = item->graph_pos;
+   return 1;
+   } else {
+   return bsearch_graph(commit_graph, &(item->object.oid), pos);
+   }
+}
+
 int parse_commit_in_graph(struct commit *item)
 {
+   uint32_t pos;
+
+   if (item->object.parsed)
+   return 0;
if (!core_commit_graph)
return 0;
-   if (item->object.parsed)
-   return 1;
-
prepare_commit_graph();
-   if (commit_graph) {
-   uint32_t pos;
-   int found;
-   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
-   pos = item->graph_pos;
-   found = 1;
-   } else {
-   found = bsearch_graph(commit_graph, 
&(item->object.oid), );
-   }
-
-   if (found)
-   return fill_commit_in_graph(item, commit_graph, pos);
-   }
-
+   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
+   return fill_commit_in_graph(item, commit_graph, pos);
return 0;
 }
 
+void load_commit_graph_info(struct commit *item)
+{
+   uint32_t pos;
+   if (!core_commit_graph)
+   return;
+   prepare_commit_graph();
+   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
+   fill_commit_graph_info(item, commit_graph, pos);
+}
+
 static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit 
*c)
 {
struct object_id oid;
diff --git a/commit-graph.h b/commit-graph.h
index 260a468e73..96cccb10f3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
  */
 int parse_commit_in_graph(struct commit *item);
 
+/*
+ * It is possible that we loaded commit contents from the commit buffer,
+ * but we also want to ensure the commit-graph content is correctly
+ * checked and filled. Fill the graph_pos and generation members of
+ * the given commit.
+ */
+void load_commit_graph_info(struct commit *item);
+
 struct tree *get_commit_tree_in_graph(const struct commit *c);
 
 struct commit_graph {
diff --git a/commit.c b/commit.c
index a70f120878..9ef6f699bd 100644
--- a/commit.c
+++ b/commit.c
@@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, 
unsigned long *sizep)
return ret;
 }
 
-int parse_commit_buffer(struct commit *item, const void *buffer

[PATCH v3 4/9] commit-graph.txt: update design document

2018-04-17 Thread Derrick Stolee
We now calculate generation numbers in the commit-graph file and use
them in paint_down_to_common().

Expand the section on generation numbers to discuss how the three
special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and
_MAX interact with other generation numbers.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 30 +++-
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
index 0550c6d0dc..d9f2713efa 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having 
"infinite"
 generation number and walk until reaching commits with known generation
 number.
 
+We use the macro GENERATION_NUMBER_INFINITY = 0x to mark commits not
+in the commit-graph file. If a commit-graph file was written by a version
+of Git that did not compute generation numbers, then those commits will
+have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
+
+Since the commit-graph file is closed under reachability, we can guarantee
+the following weaker condition on all commits:
+
+If A and B are commits with generation numbers N amd M, respectively,
+and N < M, then A cannot reach B.
+
+Note how the strict inequality differs from the inequality when we have
+fully-computed generation numbers. Using strict inequality may result in
+walking a few extra commits, but the simplicity in dealing with commits
+with generation number *_INFINITY or *_ZERO is valuable.
+
+We use the macro GENERATION_NUMBER_MAX = 0x3FFF to for commits whose
+generation numbers are computed to be at least this value. We limit at
+this value since it is the largest value that can be stored in the
+commit-graph file using the 30 bits available to generation numbers. This
+presents another case where a commit can have generation number equal to
+that of a parent.
+
 Design Details
 --
 
@@ -98,17 +121,12 @@ Future Work
 - The 'commit-graph' subcommand does not have a "verify" mode that is
   necessary for integration with fsck.
 
-- The file format includes room for precomputed generation numbers. These
-  are not currently computed, so all generation numbers will be marked as
-  0 (or "uncomputed"). A later patch will include this calculation.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
   priority queue with one ordered by generation number. The following
-  operations are important candidates:
+  operation is an important candidate:
 
-- paint_down_to_common()
 - 'log --topo-order'
 
 - Currently, parse_commit_gently() requires filling in the root tree
-- 
2.17.0.39.g685157f7fb



[PATCH v3 9/9] merge: check config before loading commits

2018-04-17 Thread Derrick Stolee
Now that we use generation numbers from the commit-graph, we must
ensure that all commits that exist in the commit-graph are loaded
from that file instead of from the object database. Since the
commit-graph file is only checked if core.commitGraph is true, we
must check the default config before we load any commits.

In the merge builtin, the config was checked after loading the HEAD
commit. This was due to the use of the global 'branch' when checking
merge-specific config settings.

Move the config load to be between the initialization of 'branch' and
the commit lookup.

Without this change, a fast-forward merge would hit a BUG("bad
generation skip") statement in commit.c during paint_down_to_common().
This is because the HEAD commit would be loaded with "infinite"
generation but then reached by commits with "finite" generation
numbers.

Add a test to t5318-commit-graph.sh that exercises this code path to
prevent a regression.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/merge.c | 5 +++--
 t/t5318-commit-graph.sh | 9 +
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/builtin/merge.c b/builtin/merge.c
index 5e5e4497e3..7e1da6c6ea 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -1148,13 +1148,14 @@ int cmd_merge(int argc, const char **argv, const char 
*prefix)
branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL);
if (branch)
skip_prefix(branch, "refs/heads/", );
+   init_diff_ui_defaults();
+   git_config(git_merge_config, NULL);
+
if (!branch || is_null_oid(_oid))
head_commit = NULL;
else
head_commit = lookup_commit_or_die(_oid, "HEAD");
 
-   init_diff_ui_defaults();
-   git_config(git_merge_config, NULL);
 
if (branch_mergeoptions)
parse_branch_merge_options(branch_mergeoptions);
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index a380419b65..77d85aefe7 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' '
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 
merge/1
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 
merge/2
 
+test_expect_success 'perform fast-forward merge in full repo' '
+   cd "$TRASH_DIRECTORY/full" &&
+   git checkout -b merge-5-to-8 commits/5 &&
+   git merge commits/8 &&
+   git show-ref -s merge-5-to-8 >output &&
+   git show-ref -s commits/8 >expect &&
+   test_cmp expect output
+'
+
 test_done
-- 
2.17.0.39.g685157f7fb



[PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()

2018-04-17 Thread Derrick Stolee
When running 'git branch --contains', the in_merge_bases_many()
method calls paint_down_to_common() to discover if a specific
commit is reachable from a set of branches. Commits with lower
generation number are not needed to correctly answer the
containment query of in_merge_bases_many().

Add a new parameter, min_generation, to paint_down_to_common() that
prevents walking commits with generation number strictly less than
min_generation. If 0 is given, then there is no functional change.

For in_merge_bases_many(), we can pass commit->generation as the
cutoff, and this saves time during 'git branch --contains' queries
that would otherwise walk "around" the commit we are inspecting.

For a copy of the Linux repository, where HEAD is checked out at
v4.13~100, we get the following performance improvement for
'git branch --contains' over the previous commit:

Before: 0.21s
After:  0.13s
Rel %: -38%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index bceb79c419..a70f120878 100644
--- a/commit.c
+++ b/commit.c
@@ -805,11 +805,14 @@ static int queue_has_nonstale(struct prio_queue *queue)
 }
 
 /* all input commits in one and twos[] must have been parsed! */
-static struct commit_list *paint_down_to_common(struct commit *one, int n, 
struct commit **twos)
+static struct commit_list *paint_down_to_common(struct commit *one, int n,
+   struct commit **twos,
+   int min_generation)
 {
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
+   uint32_t last_gen = GENERATION_NUMBER_INFINITY;
 
one->object.flags |= PARENT1;
if (!n) {
@@ -828,6 +831,13 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n, struc
struct commit_list *parents;
int flags;
 
+   if (commit->generation > last_gen)
+   BUG("bad generation skip");
+   last_gen = commit->generation;
+
+   if (commit->generation < min_generation)
+   break;
+
flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
if (flags == (PARENT1 | PARENT2)) {
if (!(commit->object.flags & RESULT)) {
@@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit 
*one, int n, struct co
return NULL;
}
 
-   list = paint_down_to_common(one, n, twos);
+   list = paint_down_to_common(one, n, twos, 0);
 
while (list) {
struct commit *commit = pop_commit();
@@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
filled_index[filled] = j;
work[filled++] = array[j];
}
-   common = paint_down_to_common(array[i], filled, work);
+   common = paint_down_to_common(array[i], filled, work, 0);
if (array[i]->object.flags & PARENT2)
redundant[i] = 1;
for (j = 0; j < filled; j++)
@@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
if (commit->generation > min_generation)
return 0;
 
-   bases = paint_down_to_common(commit, nr_reference, reference);
+   bases = paint_down_to_common(commit, nr_reference, reference, 
commit->generation);
if (commit->object.flags & PARENT2)
ret = 1;
clear_commit_marks(commit, all_flags);
-- 
2.17.0.39.g685157f7fb



[PATCH v3 2/9] commit-graph: compute generation numbers

2018-04-17 Thread Derrick Stolee
While preparing commits to be written into a commit-graph file, compute
the generation numbers using a depth-first strategy.

The only commits that are walked in this depth-first search are those
without a precomputed generation number. Thus, computation time will be
relative to the number of new commits to the commit-graph file.

If a computed generation number would exceed GENERATION_NUMBER_MAX, then
use GENERATION_NUMBER_MAX instead.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 46 ++
 1 file changed, 46 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 9ad21c3ffb..688d5b1801 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -439,6 +439,10 @@ static void write_graph_chunk_data(struct hashfile *f, int 
hash_len,
else
packedDate[0] = 0;
 
+   if ((*list)->generation != GENERATION_NUMBER_INFINITY) {
+   packedDate[0] |= htonl((*list)->generation << 2);
+   }
+
packedDate[1] = htonl((*list)->date);
hashwrite(f, packedDate, 8);
 
@@ -571,6 +575,46 @@ static void close_reachable(struct packed_oid_list *oids)
}
 }
 
+static void compute_generation_numbers(struct commit** commits,
+  int nr_commits)
+{
+   int i;
+   struct commit_list *list = NULL;
+
+   for (i = 0; i < nr_commits; i++) {
+   if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
+   commits[i]->generation != GENERATION_NUMBER_ZERO)
+   continue;
+
+   commit_list_insert(commits[i], );
+   while (list) {
+   struct commit *current = list->item;
+   struct commit_list *parent;
+   int all_parents_computed = 1;
+   uint32_t max_generation = 0;
+
+   for (parent = current->parents; parent; parent = 
parent->next) {
+   if (parent->item->generation == 
GENERATION_NUMBER_INFINITY ||
+   parent->item->generation == 
GENERATION_NUMBER_ZERO) {
+   all_parents_computed = 0;
+   commit_list_insert(parent->item, );
+   break;
+   } else if (parent->item->generation > 
max_generation) {
+   max_generation = 
parent->item->generation;
+   }
+   }
+
+   if (all_parents_computed) {
+   current->generation = max_generation + 1;
+   pop_commit();
+   }
+
+   if (current->generation > GENERATION_NUMBER_MAX)
+   current->generation = GENERATION_NUMBER_MAX;
+   }
+   }
+}
+
 void write_commit_graph(const char *obj_dir,
const char **pack_indexes,
int nr_packs,
@@ -694,6 +738,8 @@ void write_commit_graph(const char *obj_dir,
if (commits.nr >= GRAPH_PARENT_MISSING)
die(_("too many commits to write graph"));
 
+   compute_generation_numbers(commits.list, commits.nr);
+
graph_name = get_commit_graph_filename(obj_dir);
fd = hold_lock_file_for_update(, graph_name, 0);
 
-- 
2.17.0.39.g685157f7fb



[PATCH v3 1/9] commit: add generation number to struct commmit

2018-04-17 Thread Derrick Stolee
The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
  more than the maximum generation number among the parents of A.

Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use three special values to signal
the generation number is invalid:

GENERATION_NUMBER_INFINITY 0x
GENERATION_NUMBER_MAX 0x3FFF
GENERATION_NUMBER_ZERO 0

The first (_INFINITY) means the generation number has not been loaded or
computed. The second (_MAX) means the generation number is too large to
store in the commit-graph file. The third (_ZERO) means the generation
number was loaded from a commit graph file that was written by a version
of git that did not support generation numbers.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 alloc.c| 1 +
 commit-graph.c | 2 ++
 commit.h   | 4 
 3 files changed, 7 insertions(+)

diff --git a/alloc.c b/alloc.c
index cf4f8b61e1..e8ab14f4a1 100644
--- a/alloc.c
+++ b/alloc.c
@@ -94,6 +94,7 @@ void *alloc_commit_node(void)
c->object.type = OBJ_COMMIT;
c->index = alloc_commit_index();
c->graph_pos = COMMIT_NOT_FROM_GRAPH;
+   c->generation = GENERATION_NUMBER_INFINITY;
return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 70fa1b25fd..9ad21c3ffb 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct 
commit_graph *g, uin
date_low = get_be32(commit_data + g->hash_len + 12);
item->date = (timestamp_t)((date_high << 32) | date_low);
 
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
pptr = >parents;
 
edge_value = get_be32(commit_data + g->hash_len);
diff --git a/commit.h b/commit.h
index 23a3f364ed..aac3b8c56f 100644
--- a/commit.h
+++ b/commit.h
@@ -10,6 +10,9 @@
 #include "pretty.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0x
+#define GENERATION_NUMBER_INFINITY 0x
+#define GENERATION_NUMBER_MAX 0x3FFF
+#define GENERATION_NUMBER_ZERO 0
 
 struct commit_list {
struct commit *item;
@@ -30,6 +33,7 @@ struct commit {
 */
struct tree *maybe_tree;
uint32_t graph_pos;
+   uint32_t generation;
 };
 
 extern int save_commit_buffer;
-- 
2.17.0.39.g685157f7fb



[PATCH v3 6/9] commit: use generation numbers for in_merge_bases()

2018-04-17 Thread Derrick Stolee
The containment algorithm for 'git branch --contains' is different
from that for 'git tag --contains' in that it uses is_descendant_of()
instead of contains_tag_algo(). The expensive portion of the branch
algorithm is computing merge bases.

When a commit-graph file exists with generation numbers computed,
we can avoid this merge-base calculation when the target commit has
a larger generation number than the target commits.

Performance tests were run on a copy of the Linux repository where
HEAD is contained in v4.13 but no earlier tag. Also, all tags were
copied to branches and 'git branch --contains' was tested:

Before: 60.0s
After:   0.4s
Rel %: -99.3%

Reported-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index a44899c733..bceb79c419 100644
--- a/commit.c
+++ b/commit.c
@@ -1053,12 +1053,19 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
 {
struct commit_list *bases;
int ret = 0, i;
+   uint32_t min_generation = GENERATION_NUMBER_INFINITY;
 
if (parse_commit(commit))
return ret;
-   for (i = 0; i < nr_reference; i++)
+   for (i = 0; i < nr_reference; i++) {
if (parse_commit(reference[i]))
return ret;
+   if (min_generation > reference[i]->generation)
+   min_generation = reference[i]->generation;
+   }
+
+   if (commit->generation > min_generation)
+   return 0;
 
bases = paint_down_to_common(commit, nr_reference, reference);
if (commit->object.flags & PARENT2)
-- 
2.17.0.39.g685157f7fb



[PATCH v3 0/9] Compute and consume generation numbers

2018-04-17 Thread Derrick Stolee
Thanks for all the help on v2. Here are a few changes between versions:

* Removed the constant-time check in queue_has_nonstale() due to the
  possibility of a performance hit and no evidence of a performance
  benefit in typical cases.

* Reordered the commits about loading commits from the commit-graph.
  This way it is easier to demonstrate the incorrect checks. On my
  machine, every commit compiles and the test suite passes, but patches
  6-8 have the bug that is fixed in patch 9 "merge: check config before
  loading commits".

* The interaction with parse_commit_in_graph() from parse_object() is
  replaced with a new 'check_graph' parameter in parse_commit_buffer().
  This allows us to fill in the graph_pos and generation values for
  commits that are parsed directly from a buffer. This keeps the existing
  behavior that a commit parsed this way should match its buffer.

* There was discussion about making GENERATION_NUMBER_MAX assignable by
  an environment variable so we could add tests that exercise the behavior
  of capping a generation at that value. Perhaps the code around this is
  simple enough that we do not need to add that complexity.

Thanks,
-Stolee

-- >8 --

This is the one of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph) and lazy-loading trees
(ds/lazy-load-trees).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one). This section is expanded to describe the interaction with special
generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph
file) and *_ZERO (commits in a commit-graph file written before generation
numbers were implemented).

This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a short-circuit mechanism
to improve performance of `git branch --contains`.

Further, use generation numbers for 'git tag --contains), providing a
significant speedup (at least 95% for some cases).

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

This patch series is build on ds/lazy-load-trees.

Derrick Stolee (9):
  commit: add generation number to struct commmit
  commit-graph: compute generation numbers
  commit: use generations in paint_down_to_common()
  commit-graph.txt: update design document
  ref-filter: use generation number for --contains
  commit: use generation numbers for in_merge_bases()
  commit: add short-circuit to paint_down_to_common()
  commit-graph: always load commit-graph information
  merge: check config before loading commits

 Documentation/technical/commit-graph.txt | 30 +--
 alloc.c  |  1 +
 builtin/merge.c  |  5 +-
 commit-graph.c   | 99 +++-
 commit-graph.h   |  8 ++
 commit.c | 54 +++--
 commit.h |  7 +-
 object.c |  2 +-
 ref-filter.c | 23 +-
 sha1_file.c  |  2 +-
 t/t5318-commit-graph.sh  |  9 +++
 11 files changed, 199 insertions(+), 41 deletions(-)


base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707
-- 
2.17.0.39.g685157f7fb



[PATCH v3 3/9] commit: use generations in paint_down_to_common()

2018-04-17 Thread Derrick Stolee
Define compare_commits_by_gen_then_commit_date(), which uses generation
numbers as a primary comparison and commit date to break ties (or as a
comparison when both commits do not have computed generation numbers).

Since the commit-graph file is closed under reachability, we know that
all commits in the file have generation at most GENERATION_NUMBER_MAX
which is less than GENERATION_NUMBER_INFINITY.

This change does not affect the number of commits that are walked during
the execution of paint_down_to_common(), only the order that those
commits are inspected. In the case that commit dates violate topological
order (i.e. a parent is "newer" than a child), the previous code could
walk a commit twice: if a commit is reached with the PARENT1 bit, but
later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
propagated to its parents. Using generation numbers avoids this extra
effort, even if it is somewhat rare.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 20 +++-
 commit.h |  1 +
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 711f674c18..a44899c733 100644
--- a/commit.c
+++ b/commit.c
@@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, 
const void *b_,
return 0;
 }
 
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused)
+{
+   const struct commit *a = a_, *b = b_;
+
+   /* newer commits first */
+   if (a->generation < b->generation)
+   return 1;
+   else if (a->generation > b->generation)
+   return -1;
+
+   /* use date as a heuristic when generataions are equal */
+   if (a->date < b->date)
+   return 1;
+   else if (a->date > b->date)
+   return -1;
+   return 0;
+}
+
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused)
 {
const struct commit *a = a_, *b = b_;
@@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
 /* all input commits in one and twos[] must have been parsed! */
 static struct commit_list *paint_down_to_common(struct commit *one, int n, 
struct commit **twos)
 {
-   struct prio_queue queue = { compare_commits_by_commit_date };
+   struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
 
diff --git a/commit.h b/commit.h
index aac3b8c56f..64436ff44e 100644
--- a/commit.h
+++ b/commit.h
@@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
 extern int check_commit_signature(const struct commit *commit, struct 
signature_check *sigc);
 
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused);
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused);
 
 LAST_ARG_MUST_BE_NULL
 extern int run_commit_hook(int editor_is_used, const char *index_file, const 
char *name, ...);
-- 
2.17.0.39.g685157f7fb



[PATCH v3 5/9] ref-filter: use generation number for --contains

2018-04-17 Thread Derrick Stolee
A commit A can reach a commit B only if the generation number of A
is larger than the generation number of B. This condition allows
significantly short-circuiting commit-graph walks.

Use generation number for 'git tag --contains' queries.

On a copy of the Linux repository where HEAD is containd in v4.13
but no earlier tag, the command 'git tag --contains HEAD' had the
following peformance improvement:

Before: 0.81s
After:  0.04s
Rel %:  -95%

Helped-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 ref-filter.c | 23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index cffd8bf3ce..e2fea6d635 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1587,7 +1587,8 @@ static int in_commit_list(const struct commit_list *want, 
struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
  const struct commit_list *want,
- struct contains_cache *cache)
+ struct contains_cache *cache,
+ uint32_t cutoff)
 {
enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1603,6 +1604,10 @@ static enum contains_result contains_test(struct commit 
*candidate,
 
/* Otherwise, we don't know; prepare to recurse */
parse_commit_or_die(candidate);
+
+   if (candidate->generation < cutoff)
+   return CONTAINS_NO;
+
return CONTAINS_UNKNOWN;
 }
 
@@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
  struct contains_cache *cache)
 {
struct contains_stack contains_stack = { 0, 0, NULL };
-   enum contains_result result = contains_test(candidate, want, cache);
+   enum contains_result result;
+   uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+   const struct commit_list *p;
+
+   for (p = want; p; p = p->next) {
+   struct commit *c = p->item;
+   parse_commit_or_die(c);
+   if (c->generation < cutoff)
+   cutoff = c->generation;
+   }
 
+   result = contains_test(candidate, want, cache, cutoff);
if (result != CONTAINS_UNKNOWN)
return result;
 
@@ -1637,7 +1652,7 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
 * If we just popped the stack, parents->item has been marked,
 * therefore contains_test will return a meaningful yes/no.
 */
-   else switch (contains_test(parents->item, want, cache)) {
+   else switch (contains_test(parents->item, want, cache, cutoff)) 
{
case CONTAINS_YES:
*contains_cache_at(cache, commit) = CONTAINS_YES;
contains_stack.nr--;
@@ -1651,7 +1666,7 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
}
}
free(contains_stack.contains_stack);
-   return contains_test(candidate, want, cache);
+   return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,
-- 
2.17.0.39.g685157f7fb



Re: [PATCH v3 5/9] ref-filter: use generation number for --contains

2018-04-25 Thread Derrick Stolee

On 4/24/2018 2:56 PM, Jakub Narebski wrote:

Derrick Stolee <sto...@gmail.com> writes:

One way to fix this is to call 'prepare_commit_graph()' directly and
then test that 'commit_graph' is non-null before performing any
parses. I'm not thrilled with how that couples the commit-graph
implementation to this feature, but that may be necessary to avoid
regressions in the non-commit-graph case.

Another possible solution (not sure if better or worse) would be to
change the signature of contains_tag_algo() function to take parameter
or flag that would decide whether to parse "wants".


If I reorder commits so "commit-graph:always load commit-graph 
information" is before this one, then we can call 
load_commit_graph_info() which just fills the generation and graph_pos 
information. This will keep the coupling very light, instead of needing 
to call prepare_commit_graph() or checking the config setting.


Thanks,
-Stolee


[PATCH v4 01/10] ref-filter: fix outdated comment on in_commit_list

2018-04-25 Thread Derrick Stolee
The in_commit_list() method does not check the parents of
the candidate for containment in the list. Fix the comment
that incorrectly states that it does.

Reported-by: Jakub Narebski <jna...@gmail.com>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 ref-filter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ref-filter.c b/ref-filter.c
index cffd8bf3ce..aff24d93be 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1582,7 +1582,7 @@ static int in_commit_list(const struct commit_list *want, 
struct commit *c)
 }
 
 /*
- * Test whether the candidate or one of its parents is contained in the list.
+ * Test whether the candidate is contained in the list.
  * Do not recurse to find out, though, but return -1 if inconclusive.
  */
 static enum contains_result contains_test(struct commit *candidate,
-- 
2.17.0.39.g685157f7fb



[PATCH v4 00/10] Compute and consume generation numbers

2018-04-25 Thread Derrick Stolee
Thanks for the feedback on the previous version. I think this series is
stabilizing nicely. I'll reply to this message with an inter-diff as it
is not too large to share but would clutter this cover letter.

Thanks,
-Stolee

-- >8 --

This is the one of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph) and lazy-loading trees
(ds/lazy-load-trees).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one). This section is expanded to describe the interaction with special
generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph
file) and *_ZERO (commits in a commit-graph file written before generation
numbers were implemented).

This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a short-circuit mechanism
to improve performance of `git branch --contains`.

Further, use generation numbers for 'git tag --contains), providing a
significant speedup (at least 95% for some cases).

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

This patch series is built on ds/lazy-load-trees.

Derrick Stolee (10):
  ref-filter: fix outdated comment on in_commit_list
  commit: add generation number to struct commmit
  commit-graph: compute generation numbers
  commit: use generations in paint_down_to_common()
  commit-graph: always load commit-graph information
  ref-filter: use generation number for --contains
  commit: use generation numbers for in_merge_bases()
  commit: add short-circuit to paint_down_to_common()
  merge: check config before loading commits
  commit-graph.txt: update design document

 Documentation/technical/commit-graph.txt | 30 ++--
 alloc.c  |  1 +
 builtin/merge.c  |  7 +-
 commit-graph.c   | 92 
 commit-graph.h   |  8 +++
 commit.c | 54 +++---
 commit.h |  7 +-
 object.c |  2 +-
 ref-filter.c | 26 +--
 sha1_file.c  |  2 +-
 t/t5318-commit-graph.sh  |  9 +++
 11 files changed, 198 insertions(+), 40 deletions(-)


base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707
-- 
2.17.0.39.g685157f7fb



[PATCH v4 03/10] commit-graph: compute generation numbers

2018-04-25 Thread Derrick Stolee
While preparing commits to be written into a commit-graph file, compute
the generation numbers using a depth-first strategy.

The only commits that are walked in this depth-first search are those
without a precomputed generation number. Thus, computation time will be
relative to the number of new commits to the commit-graph file.

If a computed generation number would exceed GENERATION_NUMBER_MAX, then
use GENERATION_NUMBER_MAX instead.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 45 +
 1 file changed, 45 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 9ad21c3ffb..047fa9fca5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int 
hash_len,
else
packedDate[0] = 0;
 
+   if ((*list)->generation != GENERATION_NUMBER_INFINITY)
+   packedDate[0] |= htonl((*list)->generation << 2);
+
packedDate[1] = htonl((*list)->date);
hashwrite(f, packedDate, 8);
 
@@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids)
}
 }
 
+static void compute_generation_numbers(struct commit** commits,
+  int nr_commits)
+{
+   int i;
+   struct commit_list *list = NULL;
+
+   for (i = 0; i < nr_commits; i++) {
+   if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
+   commits[i]->generation != GENERATION_NUMBER_ZERO)
+   continue;
+
+   commit_list_insert(commits[i], );
+   while (list) {
+   struct commit *current = list->item;
+   struct commit_list *parent;
+   int all_parents_computed = 1;
+   uint32_t max_generation = 0;
+
+   for (parent = current->parents; parent; parent = 
parent->next) {
+   if (parent->item->generation == 
GENERATION_NUMBER_INFINITY ||
+   parent->item->generation == 
GENERATION_NUMBER_ZERO) {
+   all_parents_computed = 0;
+   commit_list_insert(parent->item, );
+   break;
+   } else if (parent->item->generation > 
max_generation) {
+   max_generation = 
parent->item->generation;
+   }
+   }
+
+   if (all_parents_computed) {
+   current->generation = max_generation + 1;
+   pop_commit();
+   }
+
+   if (current->generation > GENERATION_NUMBER_MAX)
+   current->generation = GENERATION_NUMBER_MAX;
+   }
+   }
+}
+
 void write_commit_graph(const char *obj_dir,
const char **pack_indexes,
int nr_packs,
@@ -694,6 +737,8 @@ void write_commit_graph(const char *obj_dir,
if (commits.nr >= GRAPH_PARENT_MISSING)
die(_("too many commits to write graph"));
 
+   compute_generation_numbers(commits.list, commits.nr);
+
graph_name = get_commit_graph_filename(obj_dir);
fd = hold_lock_file_for_update(, graph_name, 0);
 
-- 
2.17.0.39.g685157f7fb



[PATCH v4 02/10] commit: add generation number to struct commmit

2018-04-25 Thread Derrick Stolee
The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
  more than the maximum generation number among the parents of A.

Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use three special values to signal
the generation number is invalid:

GENERATION_NUMBER_INFINITY 0x
GENERATION_NUMBER_MAX 0x3FFF
GENERATION_NUMBER_ZERO 0

The first (_INFINITY) means the generation number has not been loaded or
computed. The second (_MAX) means the generation number is too large to
store in the commit-graph file. The third (_ZERO) means the generation
number was loaded from a commit graph file that was written by a version
of git that did not support generation numbers.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 alloc.c| 1 +
 commit-graph.c | 2 ++
 commit.h   | 4 
 3 files changed, 7 insertions(+)

diff --git a/alloc.c b/alloc.c
index cf4f8b61e1..e8ab14f4a1 100644
--- a/alloc.c
+++ b/alloc.c
@@ -94,6 +94,7 @@ void *alloc_commit_node(void)
c->object.type = OBJ_COMMIT;
c->index = alloc_commit_index();
c->graph_pos = COMMIT_NOT_FROM_GRAPH;
+   c->generation = GENERATION_NUMBER_INFINITY;
return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 70fa1b25fd..9ad21c3ffb 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct 
commit_graph *g, uin
date_low = get_be32(commit_data + g->hash_len + 12);
item->date = (timestamp_t)((date_high << 32) | date_low);
 
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
pptr = >parents;
 
edge_value = get_be32(commit_data + g->hash_len);
diff --git a/commit.h b/commit.h
index 23a3f364ed..aac3b8c56f 100644
--- a/commit.h
+++ b/commit.h
@@ -10,6 +10,9 @@
 #include "pretty.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0x
+#define GENERATION_NUMBER_INFINITY 0x
+#define GENERATION_NUMBER_MAX 0x3FFF
+#define GENERATION_NUMBER_ZERO 0
 
 struct commit_list {
struct commit *item;
@@ -30,6 +33,7 @@ struct commit {
 */
struct tree *maybe_tree;
uint32_t graph_pos;
+   uint32_t generation;
 };
 
 extern int save_commit_buffer;
-- 
2.17.0.39.g685157f7fb



[PATCH v4 07/10] commit: use generation numbers for in_merge_bases()

2018-04-25 Thread Derrick Stolee
The containment algorithm for 'git branch --contains' is different
from that for 'git tag --contains' in that it uses is_descendant_of()
instead of contains_tag_algo(). The expensive portion of the branch
algorithm is computing merge bases.

When a commit-graph file exists with generation numbers computed,
we can avoid this merge-base calculation when the target commit has
a larger generation number than the initial commits.

Performance tests were run on a copy of the Linux repository where
HEAD is contained in v4.13 but no earlier tag. Also, all tags were
copied to branches and 'git branch --contains' was tested:

Before: 60.0s
After:   0.4s
Rel %: -99.3%

Reported-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 39a3749abd..7bb007f56a 100644
--- a/commit.c
+++ b/commit.c
@@ -1056,12 +1056,19 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
 {
struct commit_list *bases;
int ret = 0, i;
+   uint32_t min_generation = GENERATION_NUMBER_INFINITY;
 
if (parse_commit(commit))
return ret;
-   for (i = 0; i < nr_reference; i++)
+   for (i = 0; i < nr_reference; i++) {
if (parse_commit(reference[i]))
return ret;
+   if (min_generation > reference[i]->generation)
+   min_generation = reference[i]->generation;
+   }
+
+   if (commit->generation > min_generation)
+   return ret;
 
bases = paint_down_to_common(commit, nr_reference, reference);
if (commit->object.flags & PARENT2)
-- 
2.17.0.39.g685157f7fb



[PATCH v4 04/10] commit: use generations in paint_down_to_common()

2018-04-25 Thread Derrick Stolee
Define compare_commits_by_gen_then_commit_date(), which uses generation
numbers as a primary comparison and commit date to break ties (or as a
comparison when both commits do not have computed generation numbers).

Since the commit-graph file is closed under reachability, we know that
all commits in the file have generation at most GENERATION_NUMBER_MAX
which is less than GENERATION_NUMBER_INFINITY.

This change does not affect the number of commits that are walked during
the execution of paint_down_to_common(), only the order that those
commits are inspected. In the case that commit dates violate topological
order (i.e. a parent is "newer" than a child), the previous code could
walk a commit twice: if a commit is reached with the PARENT1 bit, but
later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
propagated to its parents. Using generation numbers avoids this extra
effort, even if it is somewhat rare.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 20 +++-
 commit.h |  1 +
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 711f674c18..4d00b0a1d6 100644
--- a/commit.c
+++ b/commit.c
@@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, 
const void *b_,
return 0;
 }
 
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused)
+{
+   const struct commit *a = a_, *b = b_;
+
+   /* newer commits first */
+   if (a->generation < b->generation)
+   return 1;
+   else if (a->generation > b->generation)
+   return -1;
+
+   /* use date as a heuristic when generations are equal */
+   if (a->date < b->date)
+   return 1;
+   else if (a->date > b->date)
+   return -1;
+   return 0;
+}
+
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused)
 {
const struct commit *a = a_, *b = b_;
@@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
 /* all input commits in one and twos[] must have been parsed! */
 static struct commit_list *paint_down_to_common(struct commit *one, int n, 
struct commit **twos)
 {
-   struct prio_queue queue = { compare_commits_by_commit_date };
+   struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
 
diff --git a/commit.h b/commit.h
index aac3b8c56f..64436ff44e 100644
--- a/commit.h
+++ b/commit.h
@@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
 extern int check_commit_signature(const struct commit *commit, struct 
signature_check *sigc);
 
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused);
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused);
 
 LAST_ARG_MUST_BE_NULL
 extern int run_commit_hook(int editor_is_used, const char *index_file, const 
char *name, ...);
-- 
2.17.0.39.g685157f7fb



[PATCH v4 05/10] commit-graph: always load commit-graph information

2018-04-25 Thread Derrick Stolee
Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

With generation numbers in the commit-graph, we need to ensure that any
commit that exists in the commit-graph file has its generation number
loaded.

Create new load_commit_graph_info() method to fill in the information
for a commit that exists only in the commit-graph file. Call it from
parse_commit_buffer() after loading the other commit information from
the given buffer. Only fill this information when specified by the
'check_graph' parameter.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 45 ++---
 commit-graph.h |  8 
 commit.c   |  7 +--
 commit.h   |  2 +-
 object.c   |  2 +-
 sha1_file.c|  2 +-
 6 files changed, 46 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 047fa9fca5..aebd242def 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -245,6 +245,12 @@ static struct commit_list **insert_parent_or_die(struct 
commit_graph *g,
return _list_insert(c, pptr)->next;
 }
 
+static void fill_commit_graph_info(struct commit *item, struct commit_graph 
*g, uint32_t pos)
+{
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+}
+
 static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t pos)
 {
uint32_t edge_value;
@@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin
return 1;
 }
 
+static int find_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t *pos)
+{
+   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+   *pos = item->graph_pos;
+   return 1;
+   } else {
+   return bsearch_graph(g, &(item->object.oid), pos);
+   }
+}
+
 int parse_commit_in_graph(struct commit *item)
 {
+   uint32_t pos;
+
if (!core_commit_graph)
return 0;
if (item->object.parsed)
return 1;
-
prepare_commit_graph();
-   if (commit_graph) {
-   uint32_t pos;
-   int found;
-   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
-   pos = item->graph_pos;
-   found = 1;
-   } else {
-   found = bsearch_graph(commit_graph, 
&(item->object.oid), );
-   }
-
-   if (found)
-   return fill_commit_in_graph(item, commit_graph, pos);
-   }
-
+   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
+   return fill_commit_in_graph(item, commit_graph, pos);
return 0;
 }
 
+void load_commit_graph_info(struct commit *item)
+{
+   uint32_t pos;
+   if (!core_commit_graph)
+   return;
+   prepare_commit_graph();
+   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
+   fill_commit_graph_info(item, commit_graph, pos);
+}
+
 static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit 
*c)
 {
struct object_id oid;
diff --git a/commit-graph.h b/commit-graph.h
index 260a468e73..96cccb10f3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
  */
 int parse_commit_in_graph(struct commit *item);
 
+/*
+ * It is possible that we loaded commit contents from the commit buffer,
+ * but we also want to ensure the commit-graph content is correctly
+ * checked and filled. Fill the graph_pos and generation members of
+ * the given commit.
+ */
+void load_commit_graph_info(struct commit *item);
+
 struct tree *get_commit_tree_in_graph(const struct commit *c);
 
 struct commit_graph {
diff --git a/commit.c b/commit.c
index 4d00b0a1d6..39a3749abd 100644
--- a/commit.c
+++ b/commit.c
@@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, 
unsigned long *sizep)
return ret;
 }
 
-int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long 
size)
+int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long 
size, int check_graph)
 {
const char *tail = buffer;
const char *bufptr = buffer;
@@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void 
*buffer, unsigned long s
}
item->date = parse_commit_date(bufptr, tail);
 
+   if (check_graph)
+   load_commit_graph_info(item);
+
return 0;
 }
 
@@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
return error("

[PATCH v4 08/10] commit: add short-circuit to paint_down_to_common()

2018-04-25 Thread Derrick Stolee
When running 'git branch --contains', the in_merge_bases_many()
method calls paint_down_to_common() to discover if a specific
commit is reachable from a set of branches. Commits with lower
generation number are not needed to correctly answer the
containment query of in_merge_bases_many().

Add a new parameter, min_generation, to paint_down_to_common() that
prevents walking commits with generation number strictly less than
min_generation. If 0 is given, then there is no functional change.

For in_merge_bases_many(), we can pass commit->generation as the
cutoff, and this saves time during 'git branch --contains' queries
that would otherwise walk "around" the commit we are inspecting.

For a copy of the Linux repository, where HEAD is checked out at
v4.13~100, we get the following performance improvement for
'git branch --contains' over the previous commit:

Before: 0.21s
After:  0.13s
Rel %: -38%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index 7bb007f56a..e2e16ea1a7 100644
--- a/commit.c
+++ b/commit.c
@@ -808,11 +808,14 @@ static int queue_has_nonstale(struct prio_queue *queue)
 }
 
 /* all input commits in one and twos[] must have been parsed! */
-static struct commit_list *paint_down_to_common(struct commit *one, int n, 
struct commit **twos)
+static struct commit_list *paint_down_to_common(struct commit *one, int n,
+   struct commit **twos,
+   int min_generation)
 {
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
+   uint32_t last_gen = GENERATION_NUMBER_INFINITY;
 
one->object.flags |= PARENT1;
if (!n) {
@@ -831,6 +834,13 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n, struc
struct commit_list *parents;
int flags;
 
+   if (commit->generation > last_gen)
+   BUG("bad generation skip");
+   last_gen = commit->generation;
+
+   if (commit->generation < min_generation)
+   break;
+
flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
if (flags == (PARENT1 | PARENT2)) {
if (!(commit->object.flags & RESULT)) {
@@ -879,7 +889,7 @@ static struct commit_list *merge_bases_many(struct commit 
*one, int n, struct co
return NULL;
}
 
-   list = paint_down_to_common(one, n, twos);
+   list = paint_down_to_common(one, n, twos, 0);
 
while (list) {
struct commit *commit = pop_commit();
@@ -946,7 +956,7 @@ static int remove_redundant(struct commit **array, int cnt)
filled_index[filled] = j;
work[filled++] = array[j];
}
-   common = paint_down_to_common(array[i], filled, work);
+   common = paint_down_to_common(array[i], filled, work, 0);
if (array[i]->object.flags & PARENT2)
redundant[i] = 1;
for (j = 0; j < filled; j++)
@@ -1070,7 +1080,7 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
if (commit->generation > min_generation)
return ret;
 
-   bases = paint_down_to_common(commit, nr_reference, reference);
+   bases = paint_down_to_common(commit, nr_reference, reference, 
commit->generation);
if (commit->object.flags & PARENT2)
ret = 1;
clear_commit_marks(commit, all_flags);
-- 
2.17.0.39.g685157f7fb



[PATCH v4 06/10] ref-filter: use generation number for --contains

2018-04-25 Thread Derrick Stolee
A commit A can reach a commit B only if the generation number of A
is strictly larger than the generation number of B. This condition
allows significantly short-circuiting commit-graph walks.

Use generation number for '--contains' type queries.

On a copy of the Linux repository where HEAD is containd in v4.13
but no earlier tag, the command 'git tag --contains HEAD' had the
following peformance improvement:

Before: 0.81s
After:  0.04s
Rel %:  -95%

Helped-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 ref-filter.c | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index aff24d93be..fb35067fc9 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -16,6 +16,7 @@
 #include "trailer.h"
 #include "wt-status.h"
 #include "commit-slab.h"
+#include "commit-graph.h"
 
 static struct ref_msg {
const char *gone;
@@ -1587,7 +1588,8 @@ static int in_commit_list(const struct commit_list *want, 
struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
  const struct commit_list *want,
- struct contains_cache *cache)
+ struct contains_cache *cache,
+ uint32_t cutoff)
 {
enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1603,6 +1605,10 @@ static enum contains_result contains_test(struct commit 
*candidate,
 
/* Otherwise, we don't know; prepare to recurse */
parse_commit_or_die(candidate);
+
+   if (candidate->generation < cutoff)
+   return CONTAINS_NO;
+
return CONTAINS_UNKNOWN;
 }
 
@@ -1618,8 +1624,18 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
  struct contains_cache *cache)
 {
struct contains_stack contains_stack = { 0, 0, NULL };
-   enum contains_result result = contains_test(candidate, want, cache);
+   enum contains_result result;
+   uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+   const struct commit_list *p;
+
+   for (p = want; p; p = p->next) {
+   struct commit *c = p->item;
+   load_commit_graph_info(c);
+   if (c->generation < cutoff)
+   cutoff = c->generation;
+   }
 
+   result = contains_test(candidate, want, cache, cutoff);
if (result != CONTAINS_UNKNOWN)
return result;
 
@@ -1637,7 +1653,7 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
 * If we just popped the stack, parents->item has been marked,
 * therefore contains_test will return a meaningful yes/no.
 */
-   else switch (contains_test(parents->item, want, cache)) {
+   else switch (contains_test(parents->item, want, cache, cutoff)) 
{
case CONTAINS_YES:
*contains_cache_at(cache, commit) = CONTAINS_YES;
contains_stack.nr--;
@@ -1651,7 +1667,7 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
}
}
free(contains_stack.contains_stack);
-   return contains_test(candidate, want, cache);
+   return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,
-- 
2.17.0.39.g685157f7fb



[PATCH v4 10/10] commit-graph.txt: update design document

2018-04-25 Thread Derrick Stolee
We now calculate generation numbers in the commit-graph file and use
them in paint_down_to_common().

Expand the section on generation numbers to discuss how the three
special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and
_MAX interact with other generation numbers.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 30 +++-
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
index 0550c6d0dc..d9f2713efa 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having 
"infinite"
 generation number and walk until reaching commits with known generation
 number.
 
+We use the macro GENERATION_NUMBER_INFINITY = 0x to mark commits not
+in the commit-graph file. If a commit-graph file was written by a version
+of Git that did not compute generation numbers, then those commits will
+have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
+
+Since the commit-graph file is closed under reachability, we can guarantee
+the following weaker condition on all commits:
+
+If A and B are commits with generation numbers N amd M, respectively,
+and N < M, then A cannot reach B.
+
+Note how the strict inequality differs from the inequality when we have
+fully-computed generation numbers. Using strict inequality may result in
+walking a few extra commits, but the simplicity in dealing with commits
+with generation number *_INFINITY or *_ZERO is valuable.
+
+We use the macro GENERATION_NUMBER_MAX = 0x3FFF to for commits whose
+generation numbers are computed to be at least this value. We limit at
+this value since it is the largest value that can be stored in the
+commit-graph file using the 30 bits available to generation numbers. This
+presents another case where a commit can have generation number equal to
+that of a parent.
+
 Design Details
 --
 
@@ -98,17 +121,12 @@ Future Work
 - The 'commit-graph' subcommand does not have a "verify" mode that is
   necessary for integration with fsck.
 
-- The file format includes room for precomputed generation numbers. These
-  are not currently computed, so all generation numbers will be marked as
-  0 (or "uncomputed"). A later patch will include this calculation.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
   priority queue with one ordered by generation number. The following
-  operations are important candidates:
+  operation is an important candidate:
 
-- paint_down_to_common()
 - 'log --topo-order'
 
 - Currently, parse_commit_gently() requires filling in the root tree
-- 
2.17.0.39.g685157f7fb



[PATCH v4 09/10] merge: check config before loading commits

2018-04-25 Thread Derrick Stolee
Now that we use generation numbers from the commit-graph, we must
ensure that all commits that exist in the commit-graph are loaded
from that file instead of from the object database. Since the
commit-graph file is only checked if core.commitGraph is true, we
must check the default config before we load any commits.

In the merge builtin, the config was checked after loading the HEAD
commit. This was due to the use of the global 'branch' when checking
merge-specific config settings.

Move the config load to be between the initialization of 'branch' and
the commit lookup.

Without this change, a fast-forward merge would hit a BUG("bad
generation skip") statement in commit.c during paint_down_to_common().
This is because the HEAD commit would be loaded with "infinite"
generation but then reached by commits with "finite" generation
numbers.

Add a test to t5318-commit-graph.sh that exercises this code path to
prevent a regression.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/merge.c | 7 ---
 t/t5318-commit-graph.sh | 9 +
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/builtin/merge.c b/builtin/merge.c
index 5e5e4497e3..b819756946 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -1148,14 +1148,15 @@ int cmd_merge(int argc, const char **argv, const char 
*prefix)
branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL);
if (branch)
skip_prefix(branch, "refs/heads/", );
+
+   init_diff_ui_defaults();
+   git_config(git_merge_config, NULL);
+
if (!branch || is_null_oid(_oid))
head_commit = NULL;
else
head_commit = lookup_commit_or_die(_oid, "HEAD");
 
-   init_diff_ui_defaults();
-   git_config(git_merge_config, NULL);
-
if (branch_mergeoptions)
parse_branch_merge_options(branch_mergeoptions);
argc = parse_options(argc, argv, prefix, builtin_merge_options,
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index a380419b65..77d85aefe7 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' '
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 
merge/1
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 
merge/2
 
+test_expect_success 'perform fast-forward merge in full repo' '
+   cd "$TRASH_DIRECTORY/full" &&
+   git checkout -b merge-5-to-8 commits/5 &&
+   git merge commits/8 &&
+   git show-ref -s merge-5-to-8 >output &&
+   git show-ref -s commits/8 >expect &&
+   test_cmp expect output
+'
+
 test_done
-- 
2.17.0.39.g685157f7fb



Re: [PATCH v4 00/10] Compute and consume generation numbers

2018-04-25 Thread Derrick Stolee

As promised, here is the diff from v3.

Thanks,
-Stolee

-- >8 --

diff --git a/builtin/merge.c b/builtin/merge.c
index 7e1da6c6ea..b819756946 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -1148,6 +1148,7 @@ int cmd_merge(int argc, const char **argv, const 
char *prefix)
    branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, 
NULL);

    if (branch)
    skip_prefix(branch, "refs/heads/", );
+
    init_diff_ui_defaults();
    git_config(git_merge_config, NULL);

@@ -1156,7 +1157,6 @@ int cmd_merge(int argc, const char **argv, const 
char *prefix)

    else
    head_commit = lookup_commit_or_die(_oid, "HEAD");

-
    if (branch_mergeoptions)
    parse_branch_merge_options(branch_mergeoptions);
    argc = parse_options(argc, argv, prefix, builtin_merge_options,
diff --git a/commit-graph.c b/commit-graph.c
index 21e853c21a..aebd242def 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -257,7 +257,7 @@ static int fill_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin

    uint32_t *parent_data_ptr;
    uint64_t date_low, date_high;
    struct commit_list **pptr;
-   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
+   const unsigned char *commit_data = g->chunk_commit_data + 
(g->hash_len + 16) * pos;


    item->object.parsed = 1;
    item->graph_pos = pos;
@@ -304,7 +304,7 @@ static int find_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin

    *pos = item->graph_pos;
    return 1;
    } else {
-   return bsearch_graph(commit_graph, &(item->object.oid), 
pos);

+   return bsearch_graph(g, &(item->object.oid), pos);
    }
 }

@@ -312,10 +312,10 @@ int parse_commit_in_graph(struct commit *item)
 {
    uint32_t pos;

-   if (item->object.parsed)
-   return 0;
    if (!core_commit_graph)
    return 0;
+   if (item->object.parsed)
+   return 1;
    prepare_commit_graph();
    if (commit_graph && find_commit_in_graph(item, commit_graph, ))
    return fill_commit_in_graph(item, commit_graph, pos);
@@ -454,9 +454,8 @@ static void write_graph_chunk_data(struct hashfile 
*f, int hash_len,

    else
    packedDate[0] = 0;

-   if ((*list)->generation != GENERATION_NUMBER_INFINITY) {
+   if ((*list)->generation != GENERATION_NUMBER_INFINITY)
    packedDate[0] |= htonl((*list)->generation << 2);
-   }

    packedDate[1] = htonl((*list)->date);
    hashwrite(f, packedDate, 8);
diff --git a/commit.c b/commit.c
index 9ef6f699bd..e2e16ea1a7 100644
--- a/commit.c
+++ b/commit.c
@@ -653,7 +653,7 @@ int compare_commits_by_gen_then_commit_date(const 
void *a_, const void *b_, void

    else if (a->generation > b->generation)
    return -1;

-   /* use date as a heuristic when generataions are equal */
+   /* use date as a heuristic when generations are equal */
    if (a->date < b->date)
    return 1;
    else if (a->date > b->date)
@@ -1078,7 +1078,7 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *

    }

    if (commit->generation > min_generation)
-   return 0;
+   return ret;

    bases = paint_down_to_common(commit, nr_reference, reference, 
commit->generation);

    if (commit->object.flags & PARENT2)
diff --git a/ref-filter.c b/ref-filter.c
index e2fea6d635..fb35067fc9 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -16,6 +16,7 @@
 #include "trailer.h"
 #include "wt-status.h"
 #include "commit-slab.h"
+#include "commit-graph.h"

 static struct ref_msg {
    const char *gone;
@@ -1582,7 +1583,7 @@ static int in_commit_list(const struct commit_list 
*want, struct commit *c)

 }

 /*
- * Test whether the candidate or one of its parents is contained in the 
list.

+ * Test whether the candidate is contained in the list.
  * Do not recurse to find out, though, but return -1 if inconclusive.
  */
 static enum contains_result contains_test(struct commit *candidate,
@@ -1629,7 +1630,7 @@ static enum contains_result 
contains_tag_algo(struct commit *candidate,


    for (p = want; p; p = p->next) {
    struct commit *c = p->item;
-   parse_commit_or_die(c);
+   load_commit_graph_info(c);
    if (c->generation < cutoff)
    cutoff = c->generation;
    }




Re: [PATCH v3 6/9] commit: use generation numbers for in_merge_bases()

2018-04-23 Thread Derrick Stolee

On 4/18/2018 6:15 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


The containment algorithm for 'git branch --contains' is different
from that for 'git tag --contains' in that it uses is_descendant_of()
instead of contains_tag_algo(). The expensive portion of the branch
algorithm is computing merge bases.

When a commit-graph file exists with generation numbers computed,
we can avoid this merge-base calculation when the target commit has
a larger generation number than the target commits.

You have "target" twice in above paragraph; one of those should probably
be something else.


Thanks. Second "target" should be "initial".


[...]

+
+   if (commit->generation > min_generation)
+   return 0;

Why not use "return ret;" instead of "return 0;", like the rest of the
code [cryptically] does, that is:

   +if (commit->generation > min_generation)
   +return ret;


Sure.

Thanks,
-Stolee


Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()

2018-04-23 Thread Derrick Stolee

On 4/18/2018 7:19 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


[...]

[...], and this saves time during 'git branch --contains' queries
that would otherwise walk "around" the commit we are inspecting.

If I understand the code properly, what happens is that we can now
short-circuit if all commits that are left are lower than the target
commit.

This is because max-order priority queue is used: if the commit with
maximum generation number is below generation number of target commit,
then target commit is not reachable from any commit in the priority
queue (all of which has generation number less or equal than the commit
at head of queue, i.e. all are same level or deeper); compare what I
have written in [1]

[1]: https://public-inbox.org/git/866052dkju@gmail.com/

Do I have that right?  If so, it looks all right to me.


Yes, the priority queue needs to compare via generation number first or 
there will be errors. This is why we could not use commit time before.





For a copy of the Linux repository, where HEAD is checked out at
v4.13~100, we get the following performance improvement for
'git branch --contains' over the previous commit:

Before: 0.21s
After:  0.13s
Rel %: -38%

[...]

flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
if (flags == (PARENT1 | PARENT2)) {
if (!(commit->object.flags & RESULT)) {
@@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit 
*one, int n, struct co
return NULL;
}
  
-	list = paint_down_to_common(one, n, twos);

+   list = paint_down_to_common(one, n, twos, 0);
  
  	while (list) {

struct commit *commit = pop_commit();
@@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
filled_index[filled] = j;
work[filled++] = array[j];
}
-   common = paint_down_to_common(array[i], filled, work);
+   common = paint_down_to_common(array[i], filled, work, 0);
if (array[i]->object.flags & PARENT2)
redundant[i] = 1;
for (j = 0; j < filled; j++)
@@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
if (commit->generation > min_generation)
return 0;
  
-	bases = paint_down_to_common(commit, nr_reference, reference);

+   bases = paint_down_to_common(commit, nr_reference, reference, 
commit->generation);

Is it the only case where we would call paint_down_to_common() with
non-zero last parameter?  Would we always use commit->generation where
commit is the first parameter of paint_down_to_common()?

If both are true and will remain true, then in my humble opinion it is
not necessary to change the signature of this function.


We need to change the signature some way, but maybe the way I chose is 
not the best.


To elaborate: paint_down_to_common() is used for multiple purposes. The 
caller here that supplies 'commit->generation' is used only to compute 
reachability (by testing if the flag PARENT2 exists on the commit, then 
clears all flags). The other callers expect the full walk down to the 
common commits, and keeps those PARENT1, PARENT2, and STALE flags for 
future use (such as reporting merge bases). Usually the call to 
paint_down_to_common() is followed by a revision walk that only halts 
when reaching root commits or commits with both PARENT1 and PARENT2 
flags on, so always short-circuiting on generations would break the 
functionality; this is confirmed by the t5318-commit-graph.sh.


An alternative to the signature change is to add a boolean parameter 
"use_cutoff" or something, that specifies "don't walk beyond the 
commit". This may give a more of a clear description of what it will do 
with the generation value, but since we are already performing 
generation comparisons before calling paint_down_to_common() I find this 
simple enough.


Thanks,
-Stolee



<    4   5   6   7   8   9   10   11   12   13   >