[PATCH v2 5/6] pack-objects: create pack.useSparse setting

2018-11-29 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The '--sparse' flag in 'git pack-objects' changes the algorithm
used to enumerate objects to one that is faster for individual
users pushing new objects that change only a small cone of the
working directory. The sparse algorithm is not recommended for a
server, which likely sends new objects that appear across the
entire working directory.

Create a 'pack.useSparse' setting that enables this new algorithm.
This allows 'git push' to use this algorithm without passing a
'--sparse' flag all the way through four levels of run_command()
calls.

If the '--no-sparse' flag is set, then this config setting is
overridden.

Signed-off-by: Derrick Stolee 
---
 builtin/pack-objects.c |  4 
 t/t5322-pack-objects-sparse.sh | 15 +++
 2 files changed, 19 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7d5b0735e3..124b1bafc4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, 
void *cb)
use_bitmap_index_default = git_config_bool(k, v);
return 0;
}
+   if (!strcmp(k, "pack.usesparse")) {
+   sparse = git_config_bool(k, v);
+   return 0;
+   }
if (!strcmp(k, "pack.threads")) {
delta_search_threads = git_config_int(k, v);
if (delta_search_threads < 0)
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 45dba6e014..8f5699bd91 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' '
test_cmp expect_sparse_objects.txt sparse_objects.txt
 '
 
+test_expect_success 'pack.useSparse enables algorithm' '
+   git config pack.useSparse true &&
+   git pack-objects --stdout --revs sparse.pack &&
+   git index-pack -o sparse.idx sparse.pack &&
+   git show-index sparse_objects.txt &&
+   test_cmp expect_sparse_objects.txt sparse_objects.txt
+'
+
+test_expect_success 'pack.useSparse overridden' '
+   git pack-objects --stdout --revs --no-sparse sparse.pack &&
+   git index-pack -o sparse.idx sparse.pack &&
+   git show-index sparse_objects.txt &&
+   test_cmp expect_objects.txt sparse_objects.txt
+'
+
 test_done
-- 
gitgitgadget



[PATCH v2 0/6] Add a new "sparse" tree walk algorithm

2018-11-29 Thread Derrick Stolee via GitGitGadget
One of the biggest remaining pain points for users of very large
repositories is the time it takes to run 'git push'. We inspected some slow
pushes by our developers and found that the "Enumerating Objects" phase of a
push was very slow. This is unsurprising, because this is why reachability
bitmaps exist. However, reachability bitmaps are not available to us because
of the single pack-file requirement. The bitmap approach is intended for
servers anyway, and clients have a much different behavior pattern.

Specifically, clients are normally pushing a very small number of objects
compared to the entire working directory. A typical user changes only a
small cone of the working directory, so let's use that to our benefit.

Create a new "sparse" mode for 'git pack-objects' that uses the paths that
introduce new objects to direct our search into the reachable trees. By
collecting trees at each path, we can then recurse into a path only when
there are uninteresting and interesting trees at that path. This gains a
significant performance boost for small topics while presenting a
possibility of packing extra objects.

The main algorithm change is in patch 4, but is set up a little bit in
patches 1 and 2.

As demonstrated in the included test script, we see that the existing
algorithm can send extra objects due to the way we specify the "frontier".
But we can send even more objects if a user copies objects from one folder
to another. I say "copy" because a rename would (usually) change the
original folder and trigger a walk into that path, discovering the objects.

In order to benefit from this approach, the user can opt-in using the
pack.useSparse config setting. This setting can be overridden using the
'--no-sparse' option.

Update in V2: 

 * Added GIT_TEST_PACK_SPARSE test option.
 * Fixed test breakages when GIT_TEST_PACK_SPARSE is enabled by adding null
   checks.

Derrick Stolee (6):
  revision: add mark_tree_uninteresting_sparse
  list-objects: consume sparse tree walk
  pack-objects: add --sparse option
  revision: implement sparse algorithm
  pack-objects: create pack.useSparse setting
  pack-objects: create GIT_TEST_PACK_SPARSE

 Documentation/git-pack-objects.txt |   9 +-
 bisect.c   |   2 +-
 builtin/pack-objects.c |  10 ++-
 builtin/rev-list.c |   2 +-
 http-push.c|   2 +-
 list-objects.c |  55 +++-
 list-objects.h |   4 +-
 revision.c | 121 +
 revision.h |   2 +
 t/README   |   4 +
 t/t5322-pack-objects-sparse.sh | 139 +
 11 files changed, 340 insertions(+), 10 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh


base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-89/derrickstolee/push/sparse-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/89

Range-diff vs v1:

 1:  b73b8de98c = 1:  60617681f7 revision: add mark_tree_uninteresting_sparse
 2:  9bf04c748b ! 2:  4527addacb list-objects: consume sparse tree walk
 @@ -116,6 +116,10 @@
  + for (parents = commit->parents; parents; parents = parents->next) {
  + struct commit *parent = parents->item;
  + struct tree *tree = get_commit_tree(parent);
 ++
 ++ if (!tree)
 ++ continue;
 ++
  + oidset_insert(set, >object.oid);
  +
  + if (!(parent->object.flags & UNINTERESTING))
 @@ -142,14 +146,14 @@
  +
for (list = revs->commits; list; list = list->next) {
struct commit *commit = list->item;
 -+ 
 + 
 +- if (commit->object.flags & UNINTERESTING) {
  + if (sparse) {
  + struct tree *tree = get_commit_tree(commit);
 -+ 
 ++
  + if (commit->object.flags & UNINTERESTING)
  + tree->object.flags |= UNINTERESTING;
 - 
 -- if (commit->object.flags & UNINTERESTING) {
 ++
  + oidset_insert(, >object.oid);
  + add_edge_parents(commit, revs, show_edge, );
  + } else if (commit->object.flags & UNINTERESTING) {
 @@ -189,3 +193,17 @@
   
   struct oidset;
   struct list_objects_filter_options;
 +
 +diff --git a/revision.c b/revision.c
 +--- a/revision.c
  b/revision.c
 +@@
 +  while ((oid = oidset_iter_next())) {
 +  struct tree *tree = lookup_tree(r, oid);
 + 
 ++ if (!tree)
 ++ continue;
 ++
 +  if (tree->object.flags & UNINTERESTING) {
 +  /*
 +   * Remove 

[PATCH v2 1/6] revision: add mark_tree_uninteresting_sparse

2018-11-29 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

In preparation for a new algorithm that walks fewer trees when
creating a pack from a set of revisions, create a method that
takes an oidset of tree oids and marks reachable objects as
UNINTERESTING.

The current implementation uses the existing
mark_tree_uninteresting to recursively walk the trees and blobs.
This will walk the same number of trees as the old mechanism.

There is one new assumption in this approach: we are also given
the oids of the interesting trees. This implementation does not
use those trees at the moment, but we will use them in a later
rewrite of this method.

Signed-off-by: Derrick Stolee 
---
 revision.c | 22 ++
 revision.h |  2 ++
 2 files changed, 24 insertions(+)

diff --git a/revision.c b/revision.c
index 13e0519c02..3a62c7c187 100644
--- a/revision.c
+++ b/revision.c
@@ -99,6 +99,28 @@ void mark_tree_uninteresting(struct repository *r, struct 
tree *tree)
mark_tree_contents_uninteresting(r, tree);
 }
 
+void mark_trees_uninteresting_sparse(struct repository *r,
+struct oidset *set)
+{
+   struct object_id *oid;
+   struct oidset_iter iter;
+
+   oidset_iter_init(set, );
+   while ((oid = oidset_iter_next())) {
+   struct tree *tree = lookup_tree(r, oid);
+
+   if (tree->object.flags & UNINTERESTING) {
+   /*
+* Remove the flag so the next call
+* is not a no-op. The flag is added
+* in mark_tree_unintersting().
+*/
+   tree->object.flags ^= UNINTERESTING;
+   mark_tree_uninteresting(r, tree);
+   }
+   }
+}
+
 struct commit_stack {
struct commit **items;
size_t nr, alloc;
diff --git a/revision.h b/revision.h
index 7987bfcd2e..f828e91ae9 100644
--- a/revision.h
+++ b/revision.h
@@ -67,6 +67,7 @@ struct rev_cmdline_info {
 #define REVISION_WALK_NO_WALK_SORTED 1
 #define REVISION_WALK_NO_WALK_UNSORTED 2
 
+struct oidset;
 struct topo_walk_info;
 
 struct rev_info {
@@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs,
 
 void mark_parents_uninteresting(struct commit *commit);
 void mark_tree_uninteresting(struct repository *r, struct tree *tree);
+void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set);
 
 void show_object_with_name(FILE *, struct object *, const char *);
 
-- 
gitgitgadget



[PATCH v2 2/6] list-objects: consume sparse tree walk

2018-11-29 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When creating a pack-file using 'git pack-objects --revs' we provide
a list of interesting and uninteresting commits. For example, a push
operation would make the local topic branch be interesting and the
known remote refs as uninteresting. We want to discover the set of
new objects to send to the server as a thin pack.

We walk these commits until we discover a frontier of commits such
that every commit walk starting at interesting commits ends in a root
commit or unintersting commit. We then need to discover which
non-commit objects are reachable from  uninteresting commits.

The mark_edges_unintersting() method in list-objects.c iterates on
the commit list and does the following:

* If the commit is UNINTERSTING, then mark its root tree and every
  object it can reach as UNINTERESTING.

* If the commit is interesting, then mark the root tree of every
  UNINTERSTING parent (and all objects that tree can reach) as
  UNINTERSTING.

At the very end, we repeat the process on every commit directly
given to the revision walk from stdin. This helps ensure we properly
cover shallow commits that otherwise were not included in the
frontier.

The logic to recursively follow trees is in the
mark_tree_uninteresting() method in revision.c. The algorithm avoids
duplicate work by not recursing into trees that are already marked
UNINTERSTING.

Add a new 'sparse' option to the mark_edges_uninteresting() method
that performs this logic in a slightly new way. As we iterate over
the commits, we add all of the root trees to an oidset. Then, call
mark_trees_uninteresting_sparse() on that oidset. Note that we
include interesting trees in this process. The current implementation
of mark_trees_unintersting_sparse() will walk the same trees as
the old logic, but this will be replaced in a later change.

The sparse option is not used by any callers at the moment, but
will be wired to 'git pack-objects' in the next change.

Signed-off-by: Derrick Stolee 
---
 bisect.c   |  2 +-
 builtin/pack-objects.c |  2 +-
 builtin/rev-list.c |  2 +-
 http-push.c|  2 +-
 list-objects.c | 55 +++---
 list-objects.h |  4 ++-
 revision.c |  3 +++
 7 files changed, 61 insertions(+), 9 deletions(-)

diff --git a/bisect.c b/bisect.c
index 487675c672..842f8b4b8f 100644
--- a/bisect.c
+++ b/bisect.c
@@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs)
if (prepare_revision_walk(revs))
die("revision walk setup failed");
if (revs->tree_objects)
-   mark_edges_uninteresting(revs, NULL);
+   mark_edges_uninteresting(revs, NULL, 0);
 }
 
 static void exit_if_skipped_commits(struct commit_list *tried,
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 411aefd687..5f70d840a7 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av)
 
if (prepare_revision_walk())
die(_("revision walk setup failed"));
-   mark_edges_uninteresting(, show_edge);
+   mark_edges_uninteresting(, show_edge, 0);
 
if (!fn_show_object)
fn_show_object = show_object;
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 2880ed37e3..9663cbfae0 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
if (prepare_revision_walk())
die("revision walk setup failed");
if (revs.tree_objects)
-   mark_edges_uninteresting(, show_edge);
+   mark_edges_uninteresting(, show_edge, 0);
 
if (bisect_list) {
int reaches, all;
diff --git a/http-push.c b/http-push.c
index cd48590912..ea52d6f9f6 100644
--- a/http-push.c
+++ b/http-push.c
@@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv)
pushing = 0;
if (prepare_revision_walk())
die("revision walk setup failed");
-   mark_edges_uninteresting(, NULL);
+   mark_edges_uninteresting(, NULL, 0);
objects_to_send = get_delta(, ref_lock);
finish_all_active_slots();
 
diff --git a/list-objects.c b/list-objects.c
index c41cc80db5..4fbdeca0a4 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -222,25 +222,72 @@ static void mark_edge_parents_uninteresting(struct commit 
*commit,
}
 }
 
-void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge)
+static void add_edge_parents(struct commit *commit,
+struct rev_info *revs,
+show_edge_fn show_edge,
+struct oidset *set)
+{
+   struct commit_list *parents;
+
+   for (parents = commit->parents; parents; parents = parents->next) {
+   struct commit *parent = parents->item;
+  

[PATCH v2 6/6] pack-objects: create GIT_TEST_PACK_SPARSE

2018-11-29 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Create a test variable GIT_TEST_PACK_SPARSE to enable the sparse
object walk algorithm by default during the test suite. Enabling
this variable ensures coverage in many interesting cases, such as
shallow clones, partial clones, and missing objects.

Signed-off-by: Derrick Stolee 
---
 builtin/pack-objects.c | 1 +
 t/README   | 4 
 t/t5322-pack-objects-sparse.sh | 6 +++---
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 124b1bafc4..507d381153 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3331,6 +3331,7 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
 
read_replace_refs = 0;
 
+   sparse = git_env_bool("GIT_TEST_PACK_SPARSE", 0);
reset_pack_idx_option(_idx_opts);
git_config(git_pack_config, NULL);
 
diff --git a/t/README b/t/README
index 28711cc508..8b6dfe1864 100644
--- a/t/README
+++ b/t/README
@@ -342,6 +342,10 @@ GIT_TEST_INDEX_VERSION= exercises the index read/write 
code path
 for the index version specified.  Can be set to any valid version
 (currently 2, 3, or 4).
 
+GIT_TEST_PACK_SPARSE= if enabled will default the pack-objects
+builtin to use the sparse object walk. This can still be overridden by
+the --no-sparse command-line argument.
+
 GIT_TEST_PRELOAD_INDEX= exercises the preload-index code path
 by overriding the minimum number of cache entries required per thread.
 
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 8f5699bd91..e8cf41d1c6 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -36,7 +36,7 @@ test_expect_success 'setup repo' '
 '
 
 test_expect_success 'non-sparse pack-objects' '
-   git pack-objects --stdout --revs nonsparse.pack &&
+   git pack-objects --stdout --revs --no-sparse nonsparse.pack &&
git index-pack -o nonsparse.idx nonsparse.pack &&
git show-index nonsparse_objects.txt &&
test_cmp expect_objects.txt nonsparse_objects.txt
@@ -70,7 +70,7 @@ test_expect_success 'duplicate a folder from f3 and commit to 
topic1' '
 '
 
 test_expect_success 'non-sparse pack-objects' '
-   git pack-objects --stdout --revs nonsparse.pack &&
+   git pack-objects --stdout --revs --no-sparse nonsparse.pack &&
git index-pack -o nonsparse.idx nonsparse.pack &&
git show-index nonsparse_objects.txt &&
test_cmp expect_objects.txt nonsparse_objects.txt
@@ -102,7 +102,7 @@ test_expect_success 'non-sparse pack-objects' '
topic1  \
topic1^{tree}   \
topic1:f3 | sort >expect_objects.txt &&
-   git pack-objects --stdout --revs nonsparse.pack &&
+   git pack-objects --stdout --revs --no-sparse nonsparse.pack &&
git index-pack -o nonsparse.idx nonsparse.pack &&
git show-index nonsparse_objects.txt &&
test_cmp expect_objects.txt nonsparse_objects.txt
-- 
gitgitgadget


[PATCH v2 4/6] revision: implement sparse algorithm

2018-11-29 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees to find
the interesting trees.

This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.

The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.

We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.

To actually improve the peformance, we need to terminate our
recursion unless the oidset contains some intersting trees and
some uninteresting trees. Technically, we only need one interesting
tree for this to speed up in most cases, but we also will not mark
anything UNINTERESTING if there are no uninteresting trees, so
that would be wasted effort.

There are a few ways that this is not a universally better option.

First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. We add a test case that
demonstrates this as a way to prove that the --sparse option is
actually working.

Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.

Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.

For git.git, I used the following input:

v2.19.0
^v2.19.0~10

 Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130

For the Linux repo, I used the following input:

v4.18
^v4.18~10

 Objects to pack:   518
Walked (old alg): 4,836
Walked (new alg):   188

The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.

 Objects to pack:220
Walked (old alg): 22,804
Walked (new alg):129

I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.

These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.

Signed-off-by: Derrick Stolee 
---
 revision.c | 116 ++---
 t/t5322-pack-objects-sparse.sh |  21 --
 2 files changed, 121 insertions(+), 16 deletions(-)

diff --git a/revision.c b/revision.c
index f9eb6400f1..971f1bb095 100644
--- a/revision.c
+++ b/revision.c
@@ -99,29 +99,125 @@ void mark_tree_uninteresting(struct repository *r, struct 
tree *tree)
mark_tree_contents_uninteresting(r, tree);
 }
 
+struct paths_and_oids {
+   struct string_list list;
+};
+
+static void paths_and_oids_init(struct paths_and_oids *po)
+{
+   string_list_init(>list, 1);
+}
+
+static void 

[PATCH v2 3/6] pack-objects: add --sparse option

2018-11-29 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Add a '--sparse' option flag to the pack-objects builtin. This
allows the user to specify that they want to use the new logic
for walking trees. This logic currently does not differ from the
existing output, but will in a later change.

Create a new test script, t5322-pack-objects-sparse.sh, to ensure
the object list that is selected matches what we expect. When we
update the logic to walk in a sparse fashion, the final test will
be updated to show the extra objects that are added.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-pack-objects.txt |   9 ++-
 builtin/pack-objects.c |   5 +-
 t/t5322-pack-objects-sparse.sh | 115 +
 3 files changed, 127 insertions(+), 2 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh

diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index 40c825c381..ced2630eb3 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -14,7 +14,7 @@ SYNOPSIS
[--local] [--incremental] [--window=] [--depth=]
[--revs [--unpacked | --all]] [--keep-pack=]
[--stdout [--filter=] | base-name]
-   [--shallow] [--keep-true-parents] < object-list
+   [--shallow] [--keep-true-parents] [--sparse] < object-list
 
 
 DESCRIPTION
@@ -196,6 +196,13 @@ depth is 4095.
Add --no-reuse-object if you want to force a uniform compression
level on all data no matter the source.
 
+--sparse::
+   Use the "sparse" algorithm to determine which objects to include in
+   the pack. This can have significant performance benefits when computing
+   a pack to send a small change. However, it is possible that extra
+   objects are added to the pack-file if the included commits contain
+   certain types of direct renames.
+
 --thin::
Create a "thin" pack by omitting the common objects between a
sender and a receiver in order to reduce network transfer. This
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5f70d840a7..7d5b0735e3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -84,6 +84,7 @@ static unsigned long pack_size_limit;
 static int depth = 50;
 static int delta_search_threads;
 static int pack_to_stdout;
+static int sparse;
 static int thin;
 static int num_preferred_base;
 static struct progress *progress_state;
@@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av)
 
if (prepare_revision_walk())
die(_("revision walk setup failed"));
-   mark_edges_uninteresting(, show_edge, 0);
+   mark_edges_uninteresting(, show_edge, sparse);
 
if (!fn_show_object)
fn_show_object = show_object;
@@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
{ OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"),
  N_("unpack unreachable objects newer than "),
  PARSE_OPT_OPTARG, option_parse_unpack_unreachable },
+   OPT_BOOL(0, "sparse", ,
+N_("use the sparse reachability algorithm")),
OPT_BOOL(0, "thin", ,
 N_("create thin packs")),
OPT_BOOL(0, "shallow", ,
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
new file mode 100755
index 00..81f6805bc3
--- /dev/null
+++ b/t/t5322-pack-objects-sparse.sh
@@ -0,0 +1,115 @@
+#!/bin/sh
+
+test_description='pack-objects object selection using sparse algorithm'
+. ./test-lib.sh
+
+test_expect_success 'setup repo' '
+   test_commit initial &&
+   for i in $(test_seq 1 3)
+   do
+   mkdir f$i &&
+   for j in $(test_seq 1 3)
+   do
+   mkdir f$i/f$j &&
+   echo $j >f$i/f$j/data.txt
+   done
+   done &&
+   git add . &&
+   git commit -m "Initialized trees" &&
+   for i in $(test_seq 1 3)
+   do
+   git checkout -b topic$i master &&
+   echo change-$i >f$i/f$i/data.txt &&
+   git commit -a -m "Changed f$i/f$i/data.txt"
+   done &&
+   cat >packinput.txt <<-EOF &&
+   topic1
+   ^topic2
+   ^topic3
+   EOF
+   git rev-parse   \
+   topic1  \
+   topic1^{tree}   \
+   topic1:f1   \
+   topic1:f1/f1\
+   topic1:f1/f1/data.txt | sort >expect_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+   git pack-objects --stdout --revs nonsparse.pack &&
+   git index-pack -o nonsparse.idx nonsparse.pack &&
+   git show-index nonsparse_objects.txt &&
+   test_cmp expect_objects.txt nonsparse_objects.txt
+'
+
+test_expect_success 'sparse pack-objects' '
+   git pack-objects --stdout --revs --sparse sparse.pack &&
+ 

[PATCH 4/5] revision: implement sparse algorithm

2018-11-28 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees to find
the interesting trees.

This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.

The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.

We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.

To actually improve the peformance, we need to terminate our
recursion unless the oidset contains some intersting trees and
some uninteresting trees. Technically, we only need one interesting
tree for this to speed up in most cases, but we also will not mark
anything UNINTERESTING if there are no uninteresting trees, so
that would be wasted effort.

There are a few ways that this is not a universally better option.

First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. We add a test case that
demonstrates this as a way to prove that the --sparse option is
actually working.

Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.

Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.

For git.git, I used the following input:

v2.19.0
^v2.19.0~10

 Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130

For the Linux repo, I used the following input:

v4.18
^v4.18~10

 Objects to pack:   518
Walked (old alg): 4,836
Walked (new alg):   188

The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.

 Objects to pack:220
Walked (old alg): 22,804
Walked (new alg):129

I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.

These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.

Signed-off-by: Derrick Stolee 
---
 revision.c | 111 ++---
 t/t5322-pack-objects-sparse.sh |  21 +--
 2 files changed, 116 insertions(+), 16 deletions(-)

diff --git a/revision.c b/revision.c
index 3a62c7c187..7e4bfe621a 100644
--- a/revision.c
+++ b/revision.c
@@ -99,26 +99,117 @@ void mark_tree_uninteresting(struct repository *r, struct 
tree *tree)
mark_tree_contents_uninteresting(r, tree);
 }
 
+struct paths_and_oids {
+   struct string_list list;
+};
+
+static void paths_and_oids_init(struct paths_and_oids *po)
+{
+   string_list_init(>list, 1);
+}
+
+static void 

[PATCH 0/5] Add a new "sparse" tree walk algorithm

2018-11-28 Thread Derrick Stolee via GitGitGadget
One of the biggest remaining pain points for users of very large
repositories is the time it takes to run 'git push'. We inspected some slow
pushes by our developers and found that the "Enumerating Objects" phase of a
push was very slow. This is unsurprising, because this is why reachability
bitmaps exist. However, reachability bitmaps are not available to us because
of the single pack-file requirement. The bitmap approach is intended for
servers anyway, and clients have a much different behavior pattern.

Specifically, clients are normally pushing a very small number of objects
compared to the entire working directory. A typical user changes only a
small cone of the working directory, so let's use that to our benefit.

Create a new "sparse" mode for 'git pack-objects' that uses the paths that
introduce new objects to direct our search into the reachable trees. By
collecting trees at each path, we can then recurse into a path only when
there are uninteresting and interesting trees at that path. This gains a
significant performance boost for small topics while presenting a
possibility of packing extra objects.

The main algorithm change is in patch 4, but is set up a little bit in
patches 1 and 2.

As demonstrated in the included test script, we see that the existing
algorithm can send extra objects due to the way we specify the "frontier".
But we can send even more objects if a user copies objects from one folder
to another. I say "copy" because a rename would (usually) change the
original folder and trigger a walk into that path, discovering the objects.

In order to benefit from this approach, the user can opt-in using the
pack.useSparse config setting. This setting can be overridden using the
'--no-sparse' option.

Derrick Stolee (5):
  revision: add mark_tree_uninteresting_sparse
  list-objects: consume sparse tree walk
  pack-objects: add --sparse option
  revision: implement sparse algorithm
  pack-objects: create pack.useSparse setting

 Documentation/git-pack-objects.txt |   9 +-
 bisect.c   |   2 +-
 builtin/pack-objects.c |   9 +-
 builtin/rev-list.c |   2 +-
 http-push.c|   2 +-
 list-objects.c |  51 ++-
 list-objects.h |   4 +-
 revision.c | 113 +++
 revision.h |   2 +
 t/t5322-pack-objects-sparse.sh | 139 +
 10 files changed, 323 insertions(+), 10 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh


base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-89/derrickstolee/push/sparse-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/89
-- 
gitgitgadget


[PATCH 3/5] pack-objects: add --sparse option

2018-11-28 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Add a '--sparse' option flag to the pack-objects builtin. This
allows the user to specify that they want to use the new logic
for walking trees. This logic currently does not differ from the
existing output, but will in a later change.

Create a new test script, t5322-pack-objects-sparse.sh, to ensure
the object list that is selected matches what we expect. When we
update the logic to walk in a sparse fashion, the final test will
be updated to show the extra objects that are added.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-pack-objects.txt |   9 ++-
 builtin/pack-objects.c |   5 +-
 t/t5322-pack-objects-sparse.sh | 115 +
 3 files changed, 127 insertions(+), 2 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh

diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index 40c825c381..ced2630eb3 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -14,7 +14,7 @@ SYNOPSIS
[--local] [--incremental] [--window=] [--depth=]
[--revs [--unpacked | --all]] [--keep-pack=]
[--stdout [--filter=] | base-name]
-   [--shallow] [--keep-true-parents] < object-list
+   [--shallow] [--keep-true-parents] [--sparse] < object-list
 
 
 DESCRIPTION
@@ -196,6 +196,13 @@ depth is 4095.
Add --no-reuse-object if you want to force a uniform compression
level on all data no matter the source.
 
+--sparse::
+   Use the "sparse" algorithm to determine which objects to include in
+   the pack. This can have significant performance benefits when computing
+   a pack to send a small change. However, it is possible that extra
+   objects are added to the pack-file if the included commits contain
+   certain types of direct renames.
+
 --thin::
Create a "thin" pack by omitting the common objects between a
sender and a receiver in order to reduce network transfer. This
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5f70d840a7..7d5b0735e3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -84,6 +84,7 @@ static unsigned long pack_size_limit;
 static int depth = 50;
 static int delta_search_threads;
 static int pack_to_stdout;
+static int sparse;
 static int thin;
 static int num_preferred_base;
 static struct progress *progress_state;
@@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av)
 
if (prepare_revision_walk())
die(_("revision walk setup failed"));
-   mark_edges_uninteresting(, show_edge, 0);
+   mark_edges_uninteresting(, show_edge, sparse);
 
if (!fn_show_object)
fn_show_object = show_object;
@@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
{ OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"),
  N_("unpack unreachable objects newer than "),
  PARSE_OPT_OPTARG, option_parse_unpack_unreachable },
+   OPT_BOOL(0, "sparse", ,
+N_("use the sparse reachability algorithm")),
OPT_BOOL(0, "thin", ,
 N_("create thin packs")),
OPT_BOOL(0, "shallow", ,
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
new file mode 100755
index 00..81f6805bc3
--- /dev/null
+++ b/t/t5322-pack-objects-sparse.sh
@@ -0,0 +1,115 @@
+#!/bin/sh
+
+test_description='pack-objects object selection using sparse algorithm'
+. ./test-lib.sh
+
+test_expect_success 'setup repo' '
+   test_commit initial &&
+   for i in $(test_seq 1 3)
+   do
+   mkdir f$i &&
+   for j in $(test_seq 1 3)
+   do
+   mkdir f$i/f$j &&
+   echo $j >f$i/f$j/data.txt
+   done
+   done &&
+   git add . &&
+   git commit -m "Initialized trees" &&
+   for i in $(test_seq 1 3)
+   do
+   git checkout -b topic$i master &&
+   echo change-$i >f$i/f$i/data.txt &&
+   git commit -a -m "Changed f$i/f$i/data.txt"
+   done &&
+   cat >packinput.txt <<-EOF &&
+   topic1
+   ^topic2
+   ^topic3
+   EOF
+   git rev-parse   \
+   topic1  \
+   topic1^{tree}   \
+   topic1:f1   \
+   topic1:f1/f1\
+   topic1:f1/f1/data.txt | sort >expect_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+   git pack-objects --stdout --revs nonsparse.pack &&
+   git index-pack -o nonsparse.idx nonsparse.pack &&
+   git show-index nonsparse_objects.txt &&
+   test_cmp expect_objects.txt nonsparse_objects.txt
+'
+
+test_expect_success 'sparse pack-objects' '
+   git pack-objects --stdout --revs --sparse sparse.pack &&
+ 

[PATCH 5/5] pack-objects: create pack.useSparse setting

2018-11-28 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The '--sparse' flag in 'git pack-objects' changes the algorithm
used to enumerate objects to one that is faster for individual
users pushing new objects that change only a small cone of the
working directory. The sparse algorithm is not recommended for a
server, which likely sends new objects that appear across the
entire working directory.

Create a 'pack.useSparse' setting that enables this new algorithm.
This allows 'git push' to use this algorithm without passing a
'--sparse' flag all the way through four levels of run_command()
calls.

If the '--no-sparse' flag is set, then this config setting is
overridden.

Signed-off-by: Derrick Stolee 
---
 builtin/pack-objects.c |  4 
 t/t5322-pack-objects-sparse.sh | 15 +++
 2 files changed, 19 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7d5b0735e3..124b1bafc4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, 
void *cb)
use_bitmap_index_default = git_config_bool(k, v);
return 0;
}
+   if (!strcmp(k, "pack.usesparse")) {
+   sparse = git_config_bool(k, v);
+   return 0;
+   }
if (!strcmp(k, "pack.threads")) {
delta_search_threads = git_config_int(k, v);
if (delta_search_threads < 0)
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 45dba6e014..8f5699bd91 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' '
test_cmp expect_sparse_objects.txt sparse_objects.txt
 '
 
+test_expect_success 'pack.useSparse enables algorithm' '
+   git config pack.useSparse true &&
+   git pack-objects --stdout --revs sparse.pack &&
+   git index-pack -o sparse.idx sparse.pack &&
+   git show-index sparse_objects.txt &&
+   test_cmp expect_sparse_objects.txt sparse_objects.txt
+'
+
+test_expect_success 'pack.useSparse overridden' '
+   git pack-objects --stdout --revs --no-sparse sparse.pack &&
+   git index-pack -o sparse.idx sparse.pack &&
+   git show-index sparse_objects.txt &&
+   test_cmp expect_objects.txt sparse_objects.txt
+'
+
 test_done
-- 
gitgitgadget


[PATCH 2/5] list-objects: consume sparse tree walk

2018-11-28 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When creating a pack-file using 'git pack-objects --revs' we provide
a list of interesting and uninteresting commits. For example, a push
operation would make the local topic branch be interesting and the
known remote refs as uninteresting. We want to discover the set of
new objects to send to the server as a thin pack.

We walk these commits until we discover a frontier of commits such
that every commit walk starting at interesting commits ends in a root
commit or unintersting commit. We then need to discover which
non-commit objects are reachable from  uninteresting commits.

The mark_edges_unintersting() method in list-objects.c iterates on
the commit list and does the following:

* If the commit is UNINTERSTING, then mark its root tree and every
  object it can reach as UNINTERESTING.

* If the commit is interesting, then mark the root tree of every
  UNINTERSTING parent (and all objects that tree can reach) as
  UNINTERSTING.

At the very end, we repeat the process on every commit directly
given to the revision walk from stdin. This helps ensure we properly
cover shallow commits that otherwise were not included in the
frontier.

The logic to recursively follow trees is in the
mark_tree_uninteresting() method in revision.c. The algorithm avoids
duplicate work by not recursing into trees that are already marked
UNINTERSTING.

Add a new 'sparse' option to the mark_edges_uninteresting() method
that performs this logic in a slightly new way. As we iterate over
the commits, we add all of the root trees to an oidset. Then, call
mark_trees_uninteresting_sparse() on that oidset. Note that we
include interesting trees in this process. The current implementation
of mark_trees_unintersting_sparse() will walk the same trees as
the old logic, but this will be replaced in a later change.

The sparse option is not used by any callers at the moment, but
will be wired to 'git pack-objects' in the next change.

Signed-off-by: Derrick Stolee 
---
 bisect.c   |  2 +-
 builtin/pack-objects.c |  2 +-
 builtin/rev-list.c |  2 +-
 http-push.c|  2 +-
 list-objects.c | 51 ++
 list-objects.h |  4 +++-
 6 files changed, 54 insertions(+), 9 deletions(-)

diff --git a/bisect.c b/bisect.c
index 487675c672..842f8b4b8f 100644
--- a/bisect.c
+++ b/bisect.c
@@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs)
if (prepare_revision_walk(revs))
die("revision walk setup failed");
if (revs->tree_objects)
-   mark_edges_uninteresting(revs, NULL);
+   mark_edges_uninteresting(revs, NULL, 0);
 }
 
 static void exit_if_skipped_commits(struct commit_list *tried,
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 411aefd687..5f70d840a7 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av)
 
if (prepare_revision_walk())
die(_("revision walk setup failed"));
-   mark_edges_uninteresting(, show_edge);
+   mark_edges_uninteresting(, show_edge, 0);
 
if (!fn_show_object)
fn_show_object = show_object;
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 2880ed37e3..9663cbfae0 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
if (prepare_revision_walk())
die("revision walk setup failed");
if (revs.tree_objects)
-   mark_edges_uninteresting(, show_edge);
+   mark_edges_uninteresting(, show_edge, 0);
 
if (bisect_list) {
int reaches, all;
diff --git a/http-push.c b/http-push.c
index cd48590912..ea52d6f9f6 100644
--- a/http-push.c
+++ b/http-push.c
@@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv)
pushing = 0;
if (prepare_revision_walk())
die("revision walk setup failed");
-   mark_edges_uninteresting(, NULL);
+   mark_edges_uninteresting(, NULL, 0);
objects_to_send = get_delta(, ref_lock);
finish_all_active_slots();
 
diff --git a/list-objects.c b/list-objects.c
index c41cc80db5..9bb93d1640 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -222,25 +222,68 @@ static void mark_edge_parents_uninteresting(struct commit 
*commit,
}
 }
 
-void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge)
+static void add_edge_parents(struct commit *commit,
+struct rev_info *revs,
+show_edge_fn show_edge,
+struct oidset *set)
+{
+   struct commit_list *parents;
+
+   for (parents = commit->parents; parents; parents = parents->next) {
+   struct commit *parent = parents->item;
+   struct tree *tree 

[PATCH 1/5] revision: add mark_tree_uninteresting_sparse

2018-11-28 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

In preparation for a new algorithm that walks fewer trees when
creating a pack from a set of revisions, create a method that
takes an oidset of tree oids and marks reachable objects as
UNINTERESTING.

The current implementation uses the existing
mark_tree_uninteresting to recursively walk the trees and blobs.
This will walk the same number of trees as the old mechanism.

There is one new assumption in this approach: we are also given
the oids of the interesting trees. This implementation does not
use those trees at the moment, but we will use them in a later
rewrite of this method.

Signed-off-by: Derrick Stolee 
---
 revision.c | 22 ++
 revision.h |  2 ++
 2 files changed, 24 insertions(+)

diff --git a/revision.c b/revision.c
index 13e0519c02..3a62c7c187 100644
--- a/revision.c
+++ b/revision.c
@@ -99,6 +99,28 @@ void mark_tree_uninteresting(struct repository *r, struct 
tree *tree)
mark_tree_contents_uninteresting(r, tree);
 }
 
+void mark_trees_uninteresting_sparse(struct repository *r,
+struct oidset *set)
+{
+   struct object_id *oid;
+   struct oidset_iter iter;
+
+   oidset_iter_init(set, );
+   while ((oid = oidset_iter_next())) {
+   struct tree *tree = lookup_tree(r, oid);
+
+   if (tree->object.flags & UNINTERESTING) {
+   /*
+* Remove the flag so the next call
+* is not a no-op. The flag is added
+* in mark_tree_unintersting().
+*/
+   tree->object.flags ^= UNINTERESTING;
+   mark_tree_uninteresting(r, tree);
+   }
+   }
+}
+
 struct commit_stack {
struct commit **items;
size_t nr, alloc;
diff --git a/revision.h b/revision.h
index 7987bfcd2e..f828e91ae9 100644
--- a/revision.h
+++ b/revision.h
@@ -67,6 +67,7 @@ struct rev_cmdline_info {
 #define REVISION_WALK_NO_WALK_SORTED 1
 #define REVISION_WALK_NO_WALK_UNSORTED 2
 
+struct oidset;
 struct topo_walk_info;
 
 struct rev_info {
@@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs,
 
 void mark_parents_uninteresting(struct commit *commit);
 void mark_tree_uninteresting(struct repository *r, struct tree *tree);
+void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set);
 
 void show_object_with_name(FILE *, struct object *, const char *);
 
-- 
gitgitgadget



[PATCH 1/1] revision.c: use new topo-order logic in tests

2018-11-19 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The revision-walk machinery is being rewritten to use generation
numbers in the commit-graph when availble. Due to some problematic
commit histories, the new logic can be slower than the previous
method due to how commit dates and generation numbers interact.
Thus, the new logic is not used in comparison queries, such as

git log --topo-order A..B

The logic for these queries was implemented during the refactor,
but is unreachable due to the potential performance penalty. The
code came along with a larger block of code that was copied from
the old code. When generation numbers are updated to v2 (corrected
commit dates), then we will no longer have a performance penalty
and this logic is what we will want to use.

In the meantime, use the new logic when GIT_TEST_COMMIT_GRAPH is
enabled. This will demonstrate that the new logic works for all
comparison queries in the test suite, including these variants:

git log --topo-order --ancestry-path A..B
git log --topo-order A...B

Signed-off-by: Derrick Stolee 
---
 revision.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/revision.c b/revision.c
index 4ef47d2fb4..d52da6e24f 100644
--- a/revision.c
+++ b/revision.c
@@ -27,6 +27,7 @@
 #include "commit-reach.h"
 #include "commit-graph.h"
 #include "prio-queue.h"
+#include "config.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -3143,6 +3144,9 @@ int prepare_revision_walk(struct rev_info *revs)
commit_list_sort_by_date(>commits);
if (revs->no_walk)
return 0;
+   if (revs->limited &&
+   git_env_bool(GIT_TEST_COMMIT_GRAPH, 0))
+   revs->limited = 0;
if (revs->limited) {
if (limit_list(revs) < 0)
return -1;
-- 
gitgitgadget


[PATCH 0/1] Use new topo-order logic with GIT_TEST_COMMIT_GRAPH

2018-11-19 Thread Derrick Stolee via GitGitGadget
The recent Git test report for v2.20.0-rc0 shows that the logic around
UNINTERESTING commits is not covered by the test suite. This is because the
code is currently unreachable! See the commit message for details.

An alternate approach would be to delete the code around UNINTERESTING
commits, but that doesn't seem necessary.

Thanks, -Stolee

Derrick Stolee (1):
  revision.c: use new topo-order logic in tests

 revision.c | 4 
 1 file changed, 4 insertions(+)


base-commit: 561b583749b7428f1790f03164d0d0e75be71d7b
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-83%2Fderrickstolee%2Ftopo-order-test-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-83/derrickstolee/topo-order-test-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/83
-- 
gitgitgadget


[PATCH v2 1/1] pack-objects: ignore ambiguous object warnings

2018-11-06 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

A git push process runs several processes during its run, but one
includes git send-pack which calls git pack-objects and passes
the known have/wants into stdin using object ids. However, the
default setting for core.warnAmbiguousRefs requires git pack-objects
to check for ref names matching the ref_rev_parse_rules array in
refs.c. This means that every object is triggering at least six
"file exists?" queries.  When there are a lot of refs, this can
add up significantly! I observed a simple push spending three
seconds checking these paths.

The fix here is similar to 4c30d50 "rev-list: disable object/refname
ambiguity check with --stdin". Save the value of the global
warn_on_object_refname_ambiguity variable (which is usually set to
the boolean config variable core.warnAmbiguousRefs) and change the
state to false. Do this only during the get_object_list() method
which reads the objects from stdin.

Helped-by: Jeff King 
Signed-off-by: Derrick Stolee 
---
 builtin/pack-objects.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index d1144a8f7e..f703e6df9b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2988,6 +2988,7 @@ static void get_object_list(int ac, const char **av)
struct rev_info revs;
char line[1000];
int flags = 0;
+   int save_warning;
 
init_revisions(, NULL);
save_commit_buffer = 0;
@@ -2996,6 +2997,9 @@ static void get_object_list(int ac, const char **av)
/* make sure shallows are read */
is_repository_shallow(the_repository);
 
+   save_warning = warn_on_object_refname_ambiguity;
+   warn_on_object_refname_ambiguity = 0;
+
while (fgets(line, sizeof(line), stdin) != NULL) {
int len = strlen(line);
if (len && line[len - 1] == '\n')
@@ -3022,6 +3026,8 @@ static void get_object_list(int ac, const char **av)
die(_("bad revision '%s'"), line);
}
 
+   warn_on_object_refname_ambiguity = save_warning;
+
if (use_bitmap_index && !get_object_list_from_bitmap())
return;
 
-- 
gitgitgadget


[PATCH v2 0/1] send-pack: set core.warnAmbiguousRefs=false

2018-11-06 Thread Derrick Stolee via GitGitGadget
I've been looking into the performance of git push for very large repos. Our
users are reporting that 60-80% of git push time is spent during the
"Enumerating objects" phase of git pack-objects.

A git push process runs several processes during its run, but one includes 
git send-pack which calls git pack-objects and passes the known have/wants
into stdin using object ids. However, the default setting for 
core.warnAmbiguousRefs requires git pack-objects to check for ref names
matching the ref_rev_parse_rules array in refs.c. This means that every
object is triggering at least six "file exists?" queries.

When there are a lot of refs, this can add up significantly! My PerfView
trace for a simple push measured 3 seconds spent checking these paths.

The fix is to set the global warn_on_object_refname_ambiguity to 0 for the
section that is performing these object reads.

In addition to this patch submission, we are looking into merging it into
our fork sooner [1].

[1] https://github.com/Microsoft/git/pull/67

Changes in V2: Instead of using the "-c" flag from send-pack, just set the
global. I left the name of the cover letter the same to not confuse anyone
viewing the message without threading.

Derrick Stolee (1):
  pack-objects: ignore ambiguous object warnings

 builtin/pack-objects.c | 6 ++
 1 file changed, 6 insertions(+)


base-commit: cae598d9980661a978e2df4fb338518f7bf09572
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-68%2Fderrickstolee%2Fsend-pack-config-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-68/derrickstolee/send-pack-config-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/68

Range-diff vs v1:

 1:  1ef2c51550 < -:  -- send-pack: set core.warnAmbiguousRefs=false
 -:  -- > 1:  002868ee6b pack-objects: ignore ambiguous object warnings

-- 
gitgitgadget


[PATCH 1/1] send-pack: set core.warnAmbiguousRefs=false

2018-11-06 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

During a 'git push' command, we run 'git send-pack' inside of our
transport helper. This creates a 'git pack-objects' process and
passes a list of object ids. If this list is large, then the
pack-objects process can spend a lot of time checking the possible
refs these strings could represent.

Remove this extra check by setting core.warnAmbiguousRefs to false
as we call 'git pack-objects'.

Signed-off-by: Derrick Stolee 
---
 send-pack.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/send-pack.c b/send-pack.c
index e920ca57df..5055150fe1 100644
--- a/send-pack.c
+++ b/send-pack.c
@@ -64,6 +64,8 @@ static int pack_objects(int fd, struct ref *refs, struct 
oid_array *extra, struc
int i;
int rc;
 
+   argv_array_push(, "-c");
+   argv_array_push(, "core.warnAmbiguousRefs=false");
argv_array_push(, "pack-objects");
argv_array_push(, "--all-progress-implied");
argv_array_push(, "--revs");
-- 
gitgitgadget


[PATCH 0/1] send-pack: set core.warnAmbiguousRefs=false

2018-11-06 Thread Derrick Stolee via GitGitGadget
I've been looking into the performance of git push for very large repos. Our
users are reporting that 60-80% of git push time is spent during the
"Enumerating objects" phase of git pack-objects.

A git push process runs several processes during its run, but one includes 
git send-pack which calls git pack-objects and passes the known have/wants
into stdin using object ids. However, the default setting for 
core.warnAmbiguousRefs requires git pack-objects to check for ref names
matching the ref_rev_parse_rules array in refs.c. This means that every
object is triggering at least six "file exists?" queries.

When there are a lot of refs, this can add up significantly! My PerfView
trace for a simple push measured 3 seconds spent checking these paths.

The fix for this is simple: set core.warnAmbiguousRefs to false for this
specific call of git pack-objects coming from git send-pack. We don't want
to default it to false for all calls to git pack-objects, as it is valid to
pass ref names instead of object ids. This helps regain these seconds during
a push.

In addition to this patch submission, we are looking into merging it into
our fork sooner [1].

[1] https://github.com/Microsoft/git/pull/67

Derrick Stolee (1):
  send-pack: set core.warnAmbiguousRefs=false

 send-pack.c | 2 ++
 1 file changed, 2 insertions(+)


base-commit: cae598d9980661a978e2df4fb338518f7bf09572
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-68%2Fderrickstolee%2Fsend-pack-config-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-68/derrickstolee/send-pack-config-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/68
-- 
gitgitgadget


[PATCH v2 0/3] Make add_missing_tags() linear

2018-11-02 Thread Derrick Stolee via GitGitGadget
As reported earlier [1], the add_missing_tags() method in remote.c has
quadratic performance. Some of that performance is curbed due to the
generation-number cutoff in in_merge_bases_many(). However, that fix doesn't
help users without a commit-graph, and it can still be painful if that
cutoff is sufficiently low compared to the tags we are using for
reachability testing.

Add a new method in commit-reach.c called get_reachable_subset() which does
a many-to-many reachability test. Starting at the 'from' commits, walk until
the generation is below the smallest generation in the 'to' commits, or all
'to' commits have been discovered. This performs only one commit walk for
the entire add_missing_tags() method, giving linear performance in the worst
case.

Tests are added in t6600-test-reach.sh to ensure get_reachable_subset()
works independently of its application in add_missing_tags().

Thanks, -Stolee

[1] 
https://public-inbox.org/git/cabpp-becpsoxudovjbdg_3w9wus102rw+e+qpmd4g3qyd-q...@mail.gmail.com/

Derrick Stolee (3):
  commit-reach: implement get_reachable_subset
  test-reach: test get_reachable_subset
  remote: make add_missing_tags() linear

 commit-reach.c| 70 +++
 commit-reach.h| 13 
 remote.c  | 34 -
 t/helper/test-reach.c | 34 ++---
 t/t6600-test-reach.sh | 52 
 5 files changed, 198 insertions(+), 5 deletions(-)


base-commit: c670b1f876521c9f7cd40184bf7ed05aad843433
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-60%2Fderrickstolee%2Fadd-missing-tags-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-60/derrickstolee/add-missing-tags-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/60

Range-diff vs v1:

 1:  4c0c5c9143 ! 1:  9e570603bd commit-reach: implement get_reachable_subset
 @@ -49,7 +49,7 @@
  +
  +struct commit_list *get_reachable_subset(struct commit **from, int 
nr_from,
  +  struct commit **to, int nr_to,
 -+  int reachable_flag)
 ++  unsigned int reachable_flag)
  +{
  + struct commit **item;
  + struct commit *current;
 @@ -129,9 +129,12 @@
  + * Return a list of commits containing the commits in the 'to' array
  + * that are reachable from at least one commit in the 'from' array.
  + * Also add the given 'flag' to each of the commits in the returned list.
 ++ *
 ++ * This method uses the PARENT1 and PARENT2 flags during its operation,
 ++ * so be sure these flags are not set before calling the method.
  + */
  +struct commit_list *get_reachable_subset(struct commit **from, int 
nr_from,
  +  struct commit **to, int nr_to,
 -+  int reachable_flag);
 ++  unsigned int reachable_flag);
  +
   #endif
 2:  382f4f4a5b = 2:  52e847b928 test-reach: test get_reachable_subset
 3:  ecbed3de5c = 3:  dfaceb162f remote: make add_missing_tags() linear

-- 
gitgitgadget


[PATCH v2 1/3] commit-reach: implement get_reachable_subset

2018-11-02 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The existing reachability algorithms in commit-reach.c focus on
finding merge-bases or determining if all commits in a set X can
reach at least one commit in a set Y. However, for two commits sets
X and Y, we may also care about which commits in Y are reachable
from at least one commit in X.

Implement get_reachable_subset() which answers this question. Given
two arrays of commits, 'from' and 'to', return a commit_list with
every commit from the 'to' array that is reachable from at least
one commit in the 'from' array.

The algorithm is a simple walk starting at the 'from' commits, using
the PARENT2 flag to indicate "this commit has already been added to
the walk queue". By marking the 'to' commits with the PARENT1 flag,
we can determine when we see a commit from the 'to' array. We remove
the PARENT1 flag as we add that commit to the result list to avoid
duplicates.

The order of the resulting list is a reverse of the order that the
commits are discovered in the walk.

There are a couple shortcuts to avoid walking more than we need:

1. We determine the minimum generation number of commits in the
   'to' array. We do not walk commits with generation number
   below this minimum.

2. We count how many distinct commits are in the 'to' array, and
   decrement this count when we discover a 'to' commit during the
   walk. If this number reaches zero, then we can terminate the
   walk.

Tests will be added using the 'test-tool reach' helper in a
subsequent commit.

Signed-off-by: Derrick Stolee 
---
 commit-reach.c | 70 ++
 commit-reach.h | 13 ++
 2 files changed, 83 insertions(+)

diff --git a/commit-reach.c b/commit-reach.c
index 9f79ce0a22..8ad5352752 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct 
commit_list *to,
object_array_clear(_objs);
return result;
 }
+
+struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
+struct commit **to, int nr_to,
+unsigned int reachable_flag)
+{
+   struct commit **item;
+   struct commit *current;
+   struct commit_list *found_commits = NULL;
+   struct commit **to_last = to + nr_to;
+   struct commit **from_last = from + nr_from;
+   uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+   int num_to_find = 0;
+
+   struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
+
+   for (item = to; item < to_last; item++) {
+   struct commit *c = *item;
+   
+   parse_commit(c);
+   if (c->generation < min_generation)
+   min_generation = c->generation;
+
+   if (!(c->object.flags & PARENT1)) {
+   c->object.flags |= PARENT1;
+   num_to_find++;
+   }
+   }
+
+   for (item = from; item < from_last; item++) {
+   struct commit *c = *item;
+   if (!(c->object.flags & PARENT2)) {
+   c->object.flags |= PARENT2;
+   parse_commit(c);
+
+   prio_queue_put(, *item);
+   }
+   }
+
+   while (num_to_find && (current = prio_queue_get()) != NULL) {
+   struct commit_list *parents;
+
+   if (current->object.flags & PARENT1) {
+   current->object.flags &= ~PARENT1;
+   current->object.flags |= reachable_flag;
+   commit_list_insert(current, _commits);
+   num_to_find--;
+   }
+
+   for (parents = current->parents; parents; parents = 
parents->next) {
+   struct commit *p = parents->item;
+
+   parse_commit(p);
+
+   if (p->generation < min_generation)
+   continue;
+
+   if (p->object.flags & PARENT2)
+   continue;
+
+   p->object.flags |= PARENT2;
+   prio_queue_put(, p);
+   }
+   }
+
+   clear_commit_marks_many(nr_to, to, PARENT1);
+   clear_commit_marks_many(nr_from, from, PARENT2);
+
+   return found_commits;
+}
+
diff --git a/commit-reach.h b/commit-reach.h
index 7d313e2975..bb34af0269 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -74,4 +74,17 @@ int can_all_from_reach_with_flag(struct object_array *from,
 int can_all_from_reach(struct commit_list *from, struct commit_list *to,
   int commit_date_cutoff);
 
+
+/*
+ * Return a list of commits containing the commits in the 'to' array
+ * that are reachable from at least one commit in the 'from' array.
+ * Also add the given 'flag' to each of the commits in the returned list.
+ *
+ * This method uses the PARENT1 and PARENT2 

[PATCH v2 3/3] remote: make add_missing_tags() linear

2018-11-02 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The add_missing_tags() method currently has quadratic behavior.
This is due to a linear number (based on number of tags T) of
calls to in_merge_bases_many, which has linear performance (based
on number of commits C in the repository).

Replace this O(T * C) algorithm with an O(T + C) algorithm by
using get_reachable_subset(). We ignore the return list and focus
instead on the reachable_flag assigned to the commits we care
about, because we need to interact with the tag ref and not just
the commit object.

Signed-off-by: Derrick Stolee 
---
 remote.c | 34 +-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/remote.c b/remote.c
index 81f4f01b00..b850f2feb3 100644
--- a/remote.c
+++ b/remote.c
@@ -1205,9 +1205,36 @@ static void add_missing_tags(struct ref *src, struct ref 
**dst, struct ref ***ds
 * sent to the other side.
 */
if (sent_tips.nr) {
+   const int reachable_flag = 1;
+   struct commit_list *found_commits;
+   struct commit **src_commits;
+   int nr_src_commits = 0, alloc_src_commits = 16;
+   ALLOC_ARRAY(src_commits, alloc_src_commits);
+
for_each_string_list_item(item, _tag) {
struct ref *ref = item->util;
+   struct commit *commit;
+
+   if (is_null_oid(>new_oid))
+   continue;
+   commit = lookup_commit_reference_gently(the_repository,
+   >new_oid,
+   1);
+   if (!commit)
+   /* not pushing a commit, which is not an error 
*/
+   continue;
+
+   ALLOC_GROW(src_commits, nr_src_commits + 1, 
alloc_src_commits);
+   src_commits[nr_src_commits++] = commit;
+   }
+
+   found_commits = get_reachable_subset(sent_tips.tip, 
sent_tips.nr,
+src_commits, 
nr_src_commits,
+reachable_flag);
+
+   for_each_string_list_item(item, _tag) {
struct ref *dst_ref;
+   struct ref *ref = item->util;
struct commit *commit;
 
if (is_null_oid(>new_oid))
@@ -1223,7 +1250,7 @@ static void add_missing_tags(struct ref *src, struct ref 
**dst, struct ref ***ds
 * Is this tag, which they do not have, reachable from
 * any of the commits we are sending?
 */
-   if (!in_merge_bases_many(commit, sent_tips.nr, 
sent_tips.tip))
+   if (!(commit->object.flags & reachable_flag))
continue;
 
/* Add it in */
@@ -1231,7 +1258,12 @@ static void add_missing_tags(struct ref *src, struct ref 
**dst, struct ref ***ds
oidcpy(_ref->new_oid, >new_oid);
dst_ref->peer_ref = copy_ref(ref);
}
+
+   clear_commit_marks_many(nr_src_commits, src_commits, 
reachable_flag);
+   free(src_commits);
+   free_commit_list(found_commits);
}
+
string_list_clear(_tag, 0);
free(sent_tips.tip);
 }
-- 
gitgitgadget


[PATCH v2 2/3] test-reach: test get_reachable_subset

2018-11-02 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The get_reachable_subset() method returns the list of commits in
the 'to' array that are reachable from at least one commit in the
'from' array. Add tests that check this method works in a few
cases:

1. All commits in the 'to' list are reachable. This exercises the
   early-termination condition.

2. Some commits in the 'to' list are reachable. This exercises the
   loop-termination condition.

3. No commits in the 'to' list are reachable. This exercises the
   NULL return condition.

Signed-off-by: Derrick Stolee 
---
 t/helper/test-reach.c | 34 
 t/t6600-test-reach.sh | 52 +++
 2 files changed, 82 insertions(+), 4 deletions(-)

diff --git a/t/helper/test-reach.c b/t/helper/test-reach.c
index 08d2ea68e8..a0272178b7 100644
--- a/t/helper/test-reach.c
+++ b/t/helper/test-reach.c
@@ -32,8 +32,8 @@ int cmd__reach(int ac, const char **av)
struct commit *A, *B;
struct commit_list *X, *Y;
struct object_array X_obj = OBJECT_ARRAY_INIT;
-   struct commit **X_array;
-   int X_nr, X_alloc;
+   struct commit **X_array, **Y_array;
+   int X_nr, X_alloc, Y_nr, Y_alloc;
struct strbuf buf = STRBUF_INIT;
struct repository *r = the_repository;
 
@@ -44,9 +44,10 @@ int cmd__reach(int ac, const char **av)
 
A = B = NULL;
X = Y = NULL;
-   X_nr = 0;
-   X_alloc = 16;
+   X_nr = Y_nr = 0;
+   X_alloc = Y_alloc = 16;
ALLOC_ARRAY(X_array, X_alloc);
+   ALLOC_ARRAY(Y_array, Y_alloc);
 
while (strbuf_getline(, stdin) != EOF) {
struct object_id oid;
@@ -92,6 +93,8 @@ int cmd__reach(int ac, const char **av)
 
case 'Y':
commit_list_insert(c, );
+   ALLOC_GROW(Y_array, Y_nr + 1, Y_alloc);
+   Y_array[Y_nr++] = c;
break;
 
default:
@@ -136,6 +139,29 @@ int cmd__reach(int ac, const char **av)
filter.with_commit_tag_algo = 0;
 
printf("%s(_,A,X,_):%d\n", av[1], commit_contains(, A, 
X, ));
+   } else if (!strcmp(av[1], "get_reachable_subset")) {
+   const int reachable_flag = 1;
+   int i, count = 0;
+   struct commit_list *current;
+   struct commit_list *list = get_reachable_subset(X_array, X_nr,
+   Y_array, Y_nr,
+   reachable_flag);
+   printf("get_reachable_subset(X,Y)\n");
+   for (current = list; current; current = current->next) {
+   if (!(list->item->object.flags & reachable_flag))
+   die(_("commit %s is not marked reachable"),
+   oid_to_hex(>item->object.oid));
+   count++;
+   }
+   for (i = 0; i < Y_nr; i++) {
+   if (Y_array[i]->object.flags & reachable_flag)
+   count--;
+   }
+
+   if (count < 0)
+   die(_("too many commits marked reachable"));
+
+   print_sorted_commit_ids(list);
}
 
exit(0);
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index ae94b27f70..a0c64e617a 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -265,4 +265,56 @@ test_expect_success 'commit_contains:miss' '
test_three_modes commit_contains --tag
 '
 
+test_expect_success 'get_reachable_subset:all' '
+   cat >input <<-\EOF &&
+   X:commit-9-1
+   X:commit-8-3
+   X:commit-7-5
+   X:commit-6-6
+   X:commit-1-7
+   Y:commit-3-3
+   Y:commit-1-7
+   Y:commit-5-6
+   EOF
+   (
+   echo "get_reachable_subset(X,Y)" &&
+   git rev-parse commit-3-3 \
+ commit-1-7 \
+ commit-5-6 | sort
+   ) >expect &&
+   test_three_modes get_reachable_subset
+'
+
+test_expect_success 'get_reachable_subset:some' '
+   cat >input <<-\EOF &&
+   X:commit-9-1
+   X:commit-8-3
+   X:commit-7-5
+   X:commit-1-7
+   Y:commit-3-3
+   Y:commit-1-7
+   Y:commit-5-6
+   EOF
+   (
+   echo "get_reachable_subset(X,Y)" &&
+   git rev-parse commit-3-3 \
+ commit-1-7 | sort
+   ) >expect &&
+   test_three_modes get_reachable_subset
+'
+
+test_expect_success 'get_reachable_subset:none' '
+   cat >input <<-\EOF &&
+   X:commit-9-1
+   X:commit-8-3
+   X:commit-7-5
+   X:commit-1-7
+   Y:commit-9-3
+   Y:commit-7-6
+   Y:commit-2-8
+   EOF
+   echo "get_reachable_subset(X,Y)" >expect &&
+   test_three_modes get_reachable_subset
+'
+
 

[PATCH 2/3] test-reach: test get_reachable_subset

2018-10-30 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The get_reachable_subset() method returns the list of commits in
the 'to' array that are reachable from at least one commit in the
'from' array. Add tests that check this method works in a few
cases:

1. All commits in the 'to' list are reachable. This exercises the
   early-termination condition.

2. Some commits in the 'to' list are reachable. This exercises the
   loop-termination condition.

3. No commits in the 'to' list are reachable. This exercises the
   NULL return condition.

Signed-off-by: Derrick Stolee 
---
 t/helper/test-reach.c | 34 
 t/t6600-test-reach.sh | 52 +++
 2 files changed, 82 insertions(+), 4 deletions(-)

diff --git a/t/helper/test-reach.c b/t/helper/test-reach.c
index 08d2ea68e..a0272178b 100644
--- a/t/helper/test-reach.c
+++ b/t/helper/test-reach.c
@@ -32,8 +32,8 @@ int cmd__reach(int ac, const char **av)
struct commit *A, *B;
struct commit_list *X, *Y;
struct object_array X_obj = OBJECT_ARRAY_INIT;
-   struct commit **X_array;
-   int X_nr, X_alloc;
+   struct commit **X_array, **Y_array;
+   int X_nr, X_alloc, Y_nr, Y_alloc;
struct strbuf buf = STRBUF_INIT;
struct repository *r = the_repository;
 
@@ -44,9 +44,10 @@ int cmd__reach(int ac, const char **av)
 
A = B = NULL;
X = Y = NULL;
-   X_nr = 0;
-   X_alloc = 16;
+   X_nr = Y_nr = 0;
+   X_alloc = Y_alloc = 16;
ALLOC_ARRAY(X_array, X_alloc);
+   ALLOC_ARRAY(Y_array, Y_alloc);
 
while (strbuf_getline(, stdin) != EOF) {
struct object_id oid;
@@ -92,6 +93,8 @@ int cmd__reach(int ac, const char **av)
 
case 'Y':
commit_list_insert(c, );
+   ALLOC_GROW(Y_array, Y_nr + 1, Y_alloc);
+   Y_array[Y_nr++] = c;
break;
 
default:
@@ -136,6 +139,29 @@ int cmd__reach(int ac, const char **av)
filter.with_commit_tag_algo = 0;
 
printf("%s(_,A,X,_):%d\n", av[1], commit_contains(, A, 
X, ));
+   } else if (!strcmp(av[1], "get_reachable_subset")) {
+   const int reachable_flag = 1;
+   int i, count = 0;
+   struct commit_list *current;
+   struct commit_list *list = get_reachable_subset(X_array, X_nr,
+   Y_array, Y_nr,
+   reachable_flag);
+   printf("get_reachable_subset(X,Y)\n");
+   for (current = list; current; current = current->next) {
+   if (!(list->item->object.flags & reachable_flag))
+   die(_("commit %s is not marked reachable"),
+   oid_to_hex(>item->object.oid));
+   count++;
+   }
+   for (i = 0; i < Y_nr; i++) {
+   if (Y_array[i]->object.flags & reachable_flag)
+   count--;
+   }
+
+   if (count < 0)
+   die(_("too many commits marked reachable"));
+
+   print_sorted_commit_ids(list);
}
 
exit(0);
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index ae94b27f7..a0c64e617 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -265,4 +265,56 @@ test_expect_success 'commit_contains:miss' '
test_three_modes commit_contains --tag
 '
 
+test_expect_success 'get_reachable_subset:all' '
+   cat >input <<-\EOF &&
+   X:commit-9-1
+   X:commit-8-3
+   X:commit-7-5
+   X:commit-6-6
+   X:commit-1-7
+   Y:commit-3-3
+   Y:commit-1-7
+   Y:commit-5-6
+   EOF
+   (
+   echo "get_reachable_subset(X,Y)" &&
+   git rev-parse commit-3-3 \
+ commit-1-7 \
+ commit-5-6 | sort
+   ) >expect &&
+   test_three_modes get_reachable_subset
+'
+
+test_expect_success 'get_reachable_subset:some' '
+   cat >input <<-\EOF &&
+   X:commit-9-1
+   X:commit-8-3
+   X:commit-7-5
+   X:commit-1-7
+   Y:commit-3-3
+   Y:commit-1-7
+   Y:commit-5-6
+   EOF
+   (
+   echo "get_reachable_subset(X,Y)" &&
+   git rev-parse commit-3-3 \
+ commit-1-7 | sort
+   ) >expect &&
+   test_three_modes get_reachable_subset
+'
+
+test_expect_success 'get_reachable_subset:none' '
+   cat >input <<-\EOF &&
+   X:commit-9-1
+   X:commit-8-3
+   X:commit-7-5
+   X:commit-1-7
+   Y:commit-9-3
+   Y:commit-7-6
+   Y:commit-2-8
+   EOF
+   echo "get_reachable_subset(X,Y)" >expect &&
+   test_three_modes get_reachable_subset
+'
+
 

[PATCH 3/3] remote: make add_missing_tags() linear

2018-10-30 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The add_missing_tags() method currently has quadratic behavior.
This is due to a linear number (based on number of tags T) of
calls to in_merge_bases_many, which has linear performance (based
on number of commits C in the repository).

Replace this O(T * C) algorithm with an O(T + C) algorithm by
using get_reachable_subset(). We ignore the return list and focus
instead on the reachable_flag assigned to the commits we care
about, because we need to interact with the tag ref and not just
the commit object.

Signed-off-by: Derrick Stolee 
---
 remote.c | 34 +-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/remote.c b/remote.c
index 81f4f01b0..b850f2feb 100644
--- a/remote.c
+++ b/remote.c
@@ -1205,9 +1205,36 @@ static void add_missing_tags(struct ref *src, struct ref 
**dst, struct ref ***ds
 * sent to the other side.
 */
if (sent_tips.nr) {
+   const int reachable_flag = 1;
+   struct commit_list *found_commits;
+   struct commit **src_commits;
+   int nr_src_commits = 0, alloc_src_commits = 16;
+   ALLOC_ARRAY(src_commits, alloc_src_commits);
+
for_each_string_list_item(item, _tag) {
struct ref *ref = item->util;
+   struct commit *commit;
+
+   if (is_null_oid(>new_oid))
+   continue;
+   commit = lookup_commit_reference_gently(the_repository,
+   >new_oid,
+   1);
+   if (!commit)
+   /* not pushing a commit, which is not an error 
*/
+   continue;
+
+   ALLOC_GROW(src_commits, nr_src_commits + 1, 
alloc_src_commits);
+   src_commits[nr_src_commits++] = commit;
+   }
+
+   found_commits = get_reachable_subset(sent_tips.tip, 
sent_tips.nr,
+src_commits, 
nr_src_commits,
+reachable_flag);
+
+   for_each_string_list_item(item, _tag) {
struct ref *dst_ref;
+   struct ref *ref = item->util;
struct commit *commit;
 
if (is_null_oid(>new_oid))
@@ -1223,7 +1250,7 @@ static void add_missing_tags(struct ref *src, struct ref 
**dst, struct ref ***ds
 * Is this tag, which they do not have, reachable from
 * any of the commits we are sending?
 */
-   if (!in_merge_bases_many(commit, sent_tips.nr, 
sent_tips.tip))
+   if (!(commit->object.flags & reachable_flag))
continue;
 
/* Add it in */
@@ -1231,7 +1258,12 @@ static void add_missing_tags(struct ref *src, struct ref 
**dst, struct ref ***ds
oidcpy(_ref->new_oid, >new_oid);
dst_ref->peer_ref = copy_ref(ref);
}
+
+   clear_commit_marks_many(nr_src_commits, src_commits, 
reachable_flag);
+   free(src_commits);
+   free_commit_list(found_commits);
}
+
string_list_clear(_tag, 0);
free(sent_tips.tip);
 }
-- 
gitgitgadget


[PATCH 0/3] Make add_missing_tags() linear

2018-10-30 Thread Derrick Stolee via GitGitGadget
As reported earlier [1], the add_missing_tags() method in remote.c has
quadratic performance. Some of that performance is curbed due to the
generation-number cutoff in in_merge_bases_many(). However, that fix doesn't
help users without a commit-graph, and it can still be painful if that
cutoff is sufficiently low compared to the tags we are using for
reachability testing.

Add a new method in commit-reach.c called get_reachable_subset() which does
a many-to-many reachability test. Starting at the 'from' commits, walk until
the generation is below the smallest generation in the 'to' commits, or all
'to' commits have been discovered. This performs only one commit walk for
the entire add_missing_tags() method, giving linear performance in the worst
case.

Tests are added in t6600-test-reach.sh to ensure get_reachable_subset()
works independently of its application in add_missing_tags().

Thanks, -Stolee

[1] 
https://public-inbox.org/git/cabpp-becpsoxudovjbdg_3w9wus102rw+e+qpmd4g3qyd-q...@mail.gmail.com/

Derrick Stolee (3):
  commit-reach: implement get_reachable_subset
  test-reach: test get_reachable_subset
  remote: make add_missing_tags() linear

 commit-reach.c| 70 +++
 commit-reach.h| 10 +++
 remote.c  | 34 -
 t/helper/test-reach.c | 34 ++---
 t/t6600-test-reach.sh | 52 
 5 files changed, 195 insertions(+), 5 deletions(-)


base-commit: c670b1f876521c9f7cd40184bf7ed05aad843433
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-60%2Fderrickstolee%2Fadd-missing-tags-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-60/derrickstolee/add-missing-tags-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/60
-- 
gitgitgadget


[PATCH 1/3] commit-reach: implement get_reachable_subset

2018-10-30 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The existing reachability algorithms in commit-reach.c focus on
finding merge-bases or determining if all commits in a set X can
reach at least one commit in a set Y. However, for two commits sets
X and Y, we may also care about which commits in Y are reachable
from at least one commit in X.

Implement get_reachable_subset() which answers this question. Given
two arrays of commits, 'from' and 'to', return a commit_list with
every commit from the 'to' array that is reachable from at least
one commit in the 'from' array.

The algorithm is a simple walk starting at the 'from' commits, using
the PARENT2 flag to indicate "this commit has already been added to
the walk queue". By marking the 'to' commits with the PARENT1 flag,
we can determine when we see a commit from the 'to' array. We remove
the PARENT1 flag as we add that commit to the result list to avoid
duplicates.

The order of the resulting list is a reverse of the order that the
commits are discovered in the walk.

There are a couple shortcuts to avoid walking more than we need:

1. We determine the minimum generation number of commits in the
   'to' array. We do not walk commits with generation number
   below this minimum.

2. We count how many distinct commits are in the 'to' array, and
   decrement this count when we discover a 'to' commit during the
   walk. If this number reaches zero, then we can terminate the
   walk.

Tests will be added using the 'test-tool reach' helper in a
subsequent commit.

Signed-off-by: Derrick Stolee 
---
 commit-reach.c | 70 ++
 commit-reach.h | 10 
 2 files changed, 80 insertions(+)

diff --git a/commit-reach.c b/commit-reach.c
index 9f79ce0a2..a98532ecc 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct 
commit_list *to,
object_array_clear(_objs);
return result;
 }
+
+struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
+struct commit **to, int nr_to,
+int reachable_flag)
+{
+   struct commit **item;
+   struct commit *current;
+   struct commit_list *found_commits = NULL;
+   struct commit **to_last = to + nr_to;
+   struct commit **from_last = from + nr_from;
+   uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+   int num_to_find = 0;
+
+   struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
+
+   for (item = to; item < to_last; item++) {
+   struct commit *c = *item;
+   
+   parse_commit(c);
+   if (c->generation < min_generation)
+   min_generation = c->generation;
+
+   if (!(c->object.flags & PARENT1)) {
+   c->object.flags |= PARENT1;
+   num_to_find++;
+   }
+   }
+
+   for (item = from; item < from_last; item++) {
+   struct commit *c = *item;
+   if (!(c->object.flags & PARENT2)) {
+   c->object.flags |= PARENT2;
+   parse_commit(c);
+
+   prio_queue_put(, *item);
+   }
+   }
+
+   while (num_to_find && (current = prio_queue_get()) != NULL) {
+   struct commit_list *parents;
+
+   if (current->object.flags & PARENT1) {
+   current->object.flags &= ~PARENT1;
+   current->object.flags |= reachable_flag;
+   commit_list_insert(current, _commits);
+   num_to_find--;
+   }
+
+   for (parents = current->parents; parents; parents = 
parents->next) {
+   struct commit *p = parents->item;
+
+   parse_commit(p);
+
+   if (p->generation < min_generation)
+   continue;
+
+   if (p->object.flags & PARENT2)
+   continue;
+
+   p->object.flags |= PARENT2;
+   prio_queue_put(, p);
+   }
+   }
+
+   clear_commit_marks_many(nr_to, to, PARENT1);
+   clear_commit_marks_many(nr_from, from, PARENT2);
+
+   return found_commits;
+}
+
diff --git a/commit-reach.h b/commit-reach.h
index 7d313e297..43bd50a70 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -74,4 +74,14 @@ int can_all_from_reach_with_flag(struct object_array *from,
 int can_all_from_reach(struct commit_list *from, struct commit_list *to,
   int commit_date_cutoff);
 
+
+/*
+ * Return a list of commits containing the commits in the 'to' array
+ * that are reachable from at least one commit in the 'from' array.
+ * Also add the given 'flag' to each of the commits in the returned list.
+ */
+struct commit_list *get_reachable_subset(struct commit 

[PATCH 1/1] commit-reach: fix first-parent heuristic

2018-10-18 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The algorithm in can_all_from_reach_with_flags() performs a depth-
first-search, terminated by generation number, intending to use
a hueristic that "important" commits are found in the first-parent
history. This heuristic is valuable in scenarios like fetch
negotiation.

However, there is a problem! After the search finds a target commit,
it should pop all commits off the stack and mark them as "can reach".
This logic is incorrect, so the algorithm instead walks all reachable
commits above the generation-number cutoff.

The existing algorithm is still an improvement over the previous
algorithm, as the worst-case complexity went from quadratic to linear.
The performance measurement at the time was good, but not dramatic.
By fixing this heuristic, we reduce the number of walked commits.

We can also re-run the performance tests from commit 4fbcca4e
"commit-reach: make can_all_from_reach... linear".

Performance was measured on the Linux repository using
'test-tool reach can_all_from_reach'. The input included rows seeded by
tag values. The "small" case included X-rows as v4.[0-9]* and Y-rows as
v3.[0-9]*. This mimics a (very large) fetch that says "I have all major
v3 releases and want all major v4 releases." The "large" case included
X-rows as "v4.*" and Y-rows as "v3.*". This adds all release-candidate
tags to the set, which does not greatly increase the number of objects
that are considered, but does increase the number of 'from' commits,
demonstrating the quadratic nature of the previous code.

Small Case:

4fbcca4e~1: 0.85 s
  4fbcca4e: 0.26 s (num_walked: 1,011,035)
  HEAD: 0.14 s (num_walked: 8,601)

Large Case:

4fbcca4e~1: 24.0  s
  4fbcca4e:  0.12 s (num_walked:  503,925)
  HEAD:  0.06 s (num_walked:  217,243)

Signed-off-by: Derrick Stolee 
---
 commit-reach.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/commit-reach.c b/commit-reach.c
index 9f79ce0a22..79419be8af 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -593,8 +593,10 @@ int can_all_from_reach_with_flag(struct object_array *from,
while (stack) {
struct commit_list *parent;
 
-   if (stack->item->object.flags & with_flag) {
+   if (stack->item->object.flags & (with_flag | RESULT)) {
pop_commit();
+   if (stack)
+   stack->item->object.flags |= RESULT;
continue;
}
 
-- 
gitgitgadget


[PATCH 0/1] commit-reach: fix first-parent heuristic

2018-10-18 Thread Derrick Stolee via GitGitGadget
I originally reported this fix [1] after playing around with the trace2
series for measuring performance. Since trace2 isn't merging quickly, I
pulled the performance fix patch out and am sending it on its own. The only
difference here is that we don't have the tracing to verify the performance
fix in the test script.

See the patch message for details about the fix.

Thanks, -Stolee

[1] 
https://public-inbox.org/git/20180906151309.66712-7-dsto...@microsoft.com/

[RFC PATCH 6/6] commit-reach: fix first-parent heuristic

Derrick Stolee (1):
  commit-reach: fix first-parent heuristic

 commit-reach.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


base-commit: a4b8ab5363a32f283a61ef3a962853556d136c0e
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-51%2Fderrickstolee%2Ffirst-parent-heuristic-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-51/derrickstolee/first-parent-heuristic-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/51
-- 
gitgitgadget


[PATCH 3/3] commit-graph: Use commit-graph by default

2018-10-17 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The config setting "core.commitGraph" enables using the commit-graph
file to accelerate commit walks through parsing speed and generation
numbers. The setting "gc.writeCommitGraph" enables writing the
commit-graph file on every non-trivial 'git gc' operation. Together,
they help users automatically improve their performance.

By setting these config variables to true by default, we make the
commit-graph feature an "opt-out" feature instead of "opt-in".

Signed-off-by: Derrick Stolee 
---
 Documentation/config.txt | 4 ++--
 builtin/gc.c | 2 +-
 commit-graph.c   | 6 +++---
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index f6f4c21a54..dc5ee7c145 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -923,7 +923,7 @@ the `GIT_NOTES_REF` environment variable.  See 
linkgit:git-notes[1].
 
 core.commitGraph::
If true, then git will read the commit-graph file (if it exists)
-   to parse the graph structure of commits. Defaults to false. See
+   to parse the graph structure of commits. Defaults to true. See
linkgit:git-commit-graph[1] for more information.
 
 core.useReplaceRefs::
@@ -1639,7 +1639,7 @@ gc.writeCommitGraph::
If true, then gc will rewrite the commit-graph file when
linkgit:git-gc[1] is run. When using linkgit:git-gc[1]
'--auto' the commit-graph will be updated if housekeeping is
-   required. Default is false. See linkgit:git-commit-graph[1]
+   required. Default is true. See linkgit:git-commit-graph[1]
for details.
 
 gc.logExpiry::
diff --git a/builtin/gc.c b/builtin/gc.c
index 871a56f1c5..77e7413f94 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -41,7 +41,7 @@ static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
-static int gc_write_commit_graph;
+static int gc_write_commit_graph = 1;
 static int detach_auto = 1;
 static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
diff --git a/commit-graph.c b/commit-graph.c
index a686758603..a459272466 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -232,15 +232,15 @@ static int prepare_commit_graph(struct repository *r)
 {
struct alternate_object_database *alt;
char *obj_dir;
-   int config_value;
+   int config_value = 1;
 
if (r->objects->commit_graph_attempted)
return !!r->objects->commit_graph;
r->objects->commit_graph_attempted = 1;
 
+   repo_config_get_bool(r, "core.commitgraph", _value);
if (!git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
-   (repo_config_get_bool(r, "core.commitgraph", _value) ||
-   !config_value))
+   !config_value)
/*
 * This repository is not configured to use commit graphs, so
 * do not load one. (But report commit_graph_attempted anyway
-- 
gitgitgadget


[PATCH 1/3] t6501: use --quiet when testing gc stderr

2018-10-17 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The test script t6501-freshen-objects.sh has some tests that care
if 'git gc' has any output to stderr. This is intended to say that
no warnings occurred related to broken links. However, when we
have operations that output progress (like writing the commit-graph)
this causes the test to fail.

Use 'git gc --quiet' to avoid these progress indicators from causing
a test failure.

Signed-off-by: Derrick Stolee 
---
 t/t6501-freshen-objects.sh | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/t/t6501-freshen-objects.sh b/t/t6501-freshen-objects.sh
index 033871ee5f..0973130f06 100755
--- a/t/t6501-freshen-objects.sh
+++ b/t/t6501-freshen-objects.sh
@@ -137,7 +137,7 @@ test_expect_success 'do not complain about existing broken 
links (commit)' '
some message
EOF
commit=$(git hash-object -t commit -w broken-commit) &&
-   git gc 2>stderr &&
+   git gc --quiet 2>stderr &&
verbose git cat-file -e $commit &&
test_must_be_empty stderr
 '
@@ -147,7 +147,7 @@ test_expect_success 'do not complain about existing broken 
links (tree)' '
100644 blob 0003foo
EOF
tree=$(git mktree --missing stderr &&
+   git gc --quiet 2>stderr &&
git cat-file -e $tree &&
test_must_be_empty stderr
 '
@@ -162,7 +162,7 @@ test_expect_success 'do not complain about existing broken 
links (tag)' '
this is a broken tag
EOF
tag=$(git hash-object -t tag -w broken-tag) &&
-   git gc 2>stderr &&
+   git gc --quiet 2>stderr &&
git cat-file -e $tag &&
test_must_be_empty stderr
 '
-- 
gitgitgadget



[PATCH 0/3] Use commit-graph by default

2018-10-17 Thread Derrick Stolee via GitGitGadget
The commit-graph feature is starting to stabilize. Based on what is in
master right now, we have:

Git 2.18:

 * Ability to write commit-graph (requires user interaction).
   
   
 * Commit parsing is faster when commit-graph exists.
   
   
 * Must have core.commitGraph true to use.
   
   

Git 2.19:

 * Ability to write commit-graph on GC with gc.writeCommitGraph.
   
   
 * Generation numbers written in commit-graph
   
   
 * A few reachability algorithms make use of generation numbers.
   
   

(queued for) master:

 * The test suite passes with GIT_TEST_COMMIT_GRAPH=1
   
   
 * 'git commit-graph write' has progress indicators.
   
   
 * The commit-graph is automatically disabled when grafts or replace-objects
   exist.
   
   

There are some other things coming that are in review (like 'git log
--graph' speedups), but it is probably time to consider enabling the
commit-graph by default. This series does that.

For timing, I'm happy to leave this queued for a merge after the Git 2.20
release. There are enough things in master to justify not enabling this by
default until that goes out and more people use it.

PATCH 3/3 is rather simple, and is the obvious thing to do to achieve
enabling these config values by default.

PATCH 1/3 is a required change to make the test suite work with this change.
This change isn't needed with GIT_TEST_COMMIT_GRAPH=1 because the
commit-graph is up-to-date for these 'git gc' calls, so no progress is
output.

PATCH 2/3 is also a necessary evil, since we already had to disable
GIT_TEST_COMMIT_GRAPH for some tests, we now also need to turn off
core.commitGraph.

Thanks, -Stolee

Derrick Stolee (3):
  t6501: use --quiet when testing gc stderr
  t: explicitly turn off core.commitGraph as needed
  commit-graph: Use commit-graph by default

 Documentation/config.txt| 4 ++--
 builtin/gc.c| 2 +-
 commit-graph.c  | 6 +++---
 t/t0410-partial-clone.sh| 3 ++-
 t/t5307-pack-missing-commit.sh  | 3 ++-
 t/t6011-rev-list-with-bad-commit.sh | 3 ++-
 t/t6024-recursive-merge.sh  | 3 ++-
 t/t6501-freshen-objects.sh  | 6 +++---
 8 files changed, 17 insertions(+), 13 deletions(-)


base-commit: a4b8ab5363a32f283a61ef3a962853556d136c0e
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-50%2Fderrickstolee%2Fcommit-graph-default-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-50/derrickstolee/commit-graph-default-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/50
-- 
gitgitgadget


[PATCH 2/3] t: explicitly turn off core.commitGraph as needed

2018-10-17 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

There are a few tests that already require GIT_TEST_COMMIT_GRAPH=0
as they rely on an interaction with the commits in the object store
that is circumvented by parsing commit information from the
commit-graph instead. Before enabling core.commitGraph as true by
default, explicitly turn the setting off for these tests.

Signed-off-by: Derrick Stolee 
---
 t/t0410-partial-clone.sh| 3 ++-
 t/t5307-pack-missing-commit.sh  | 3 ++-
 t/t6011-rev-list-with-bad-commit.sh | 3 ++-
 t/t6024-recursive-merge.sh  | 3 ++-
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index cfd0655ea1..f5277fafb1 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -193,7 +193,8 @@ test_expect_success 'rev-list stops traversal at missing 
and promised commit' '
 
git -C repo config core.repositoryformatversion 1 &&
git -C repo config extensions.partialclone "arbitrary string" &&
-   GIT_TEST_COMMIT_GRAPH=0 git -C repo rev-list --exclude-promisor-objects 
--objects bar >out &&
+   GIT_TEST_COMMIT_GRAPH=0 git -c core.commitGraph=false \
+   -C repo rev-list --exclude-promisor-objects --objects bar >out 
&&
grep $(git -C repo rev-parse bar) out &&
! grep $FOO out
 '
diff --git a/t/t5307-pack-missing-commit.sh b/t/t5307-pack-missing-commit.sh
index dacb440b27..dc4c19d0aa 100755
--- a/t/t5307-pack-missing-commit.sh
+++ b/t/t5307-pack-missing-commit.sh
@@ -16,7 +16,8 @@ test_expect_success setup '
obj=$(git rev-parse --verify tag3) &&
fanout=$(expr "$obj" : "\(..\)") &&
remainder=$(expr "$obj" : "..\(.*\)") &&
-   rm -f ".git/objects/$fanout/$remainder"
+   rm -f ".git/objects/$fanout/$remainder" &&
+   git config core.commitGraph false
 '
 
 test_expect_success 'check corruption' '
diff --git a/t/t6011-rev-list-with-bad-commit.sh 
b/t/t6011-rev-list-with-bad-commit.sh
index 545b461e51..da6949743d 100755
--- a/t/t6011-rev-list-with-bad-commit.sh
+++ b/t/t6011-rev-list-with-bad-commit.sh
@@ -42,7 +42,8 @@ test_expect_success 'corrupt second commit object' \
'
 
 test_expect_success 'rev-list should fail' '
-   test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git rev-list --all > 
/dev/null
+   test_must_fail env GIT_TEST_COMMIT_GRAPH=0 \
+   git -c core.commitGraph=false rev-list --all > /dev/null
 '
 
 test_expect_success 'git repack _MUST_ fail' \
diff --git a/t/t6024-recursive-merge.sh b/t/t6024-recursive-merge.sh
index 27c7de90ce..fccdf96f13 100755
--- a/t/t6024-recursive-merge.sh
+++ b/t/t6024-recursive-merge.sh
@@ -61,7 +61,8 @@ GIT_AUTHOR_DATE="2006-12-12 23:00:08" git commit -m F
 '
 
 test_expect_success 'combined merge conflicts' '
-   test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
+   test_must_fail env GIT_TEST_COMMIT_GRAPH=0 \
+   git -c core.commitGraph=false merge -m final G
 '
 
 cat > expect << EOF
-- 
gitgitgadget



[PATCH 0/1] Run GIT_TEST_COMMIT_GRAPH and GIT_TEST_MULTI_PACK_INDEX during CI

2018-10-17 Thread Derrick Stolee via GitGitGadget
Our CI scripts include a step to run the test suite with certain optional
variables enabled. Now that all branches should build and have tests succeed
with GIT_TEST_COMMIT_GRAPH and GIT_TEST_MULTI_PACK_INDEX enabled, add those
variables to that stage.

Note: the GIT_TEST_MULTI_PACK_INDEX variable has not merged all the way
down, so will be ignored if this series is merged faster than that one.
However, it is safe to make these changes orthogonally as all (known) test
breaks with GIT_TEST_MULTI_PACK_INDEX=1 are fixed in the topic that
introduces the variable.

I also created a build definition on Azure Pipelines that runs the test
suite with different subsets of the test variables, split by the following
types:

1) Index options 2) Commit-graph 3) Multi-pack-index

These builds are found at [1]. There are benefits to testing the variables
all together but also separately. I didn't want to create new stages in the
CI scripts to avoid consuming extra resources.

This series is based on js/vsts-ci to avoid conflicts with that series, but
is not necessarily a hard dependence.

Thanks, -Stolee

[1] https://git.visualstudio.com/git/_build?definitionId=4Build definition
that tests Git with different arrangements of GIT_TEST_* variables.

Derrick Stolee (1):
  ci: add optional test variables

 ci/run-build-and-tests.sh | 2 ++
 1 file changed, 2 insertions(+)


base-commit: d82963f34cf6921ed29d1fc2d96b16acf9005159
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-49%2Fderrickstolee%2Fci-vars-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-49/derrickstolee/ci-vars-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/49
-- 
gitgitgadget


[PATCH 1/1] ci: add optional test variables

2018-10-17 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The commit-graph and multi-pack-index features introduce optional
data structures that are not required for normal Git operations.
It is important to run the normal test suite without them enabled,
but it is helpful to also run the test suite using them.

Our continuous integration scripts include a second test stage that
runs with optional GIT_TEST_* variables enabled. Add the following
two variables to that stage:

  GIT_TEST_COMMIT_GRAPH
  GIT_TEST_MULTI_PACK_INDEX

This will slow down the operation, as we build a commit-graph file
after every 'git commit' operation and build a multi-pack-index
during every 'git repack' operation. However, it is important that
future changes are compatible with these features.

Signed-off-by: Derrick Stolee 
---
 ci/run-build-and-tests.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index e28ac2fb9a..db342bb6a8 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -15,6 +15,8 @@ then
export GIT_TEST_FULL_IN_PACK_ARRAY=true
export GIT_TEST_OE_SIZE=10
export GIT_TEST_OE_DELTA_SIZE=5
+   export GIT_TEST_COMMIT_GRAPH=1
+   export GIT_TEST_MULTI_PACK_INDEX=1
make --quiet test
 fi
 
-- 
gitgitgadget


[PATCH v4 4/7] revision.c: begin refactoring --topo-order logic

2018-10-16 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():

* revs->limited implies we run limit_list() to walk the entire
  reachable set. There are some short-cuts here, such as if we
  perform a range query like 'git rev-list COMPARE..HEAD' and we
  can stop limit_list() when all queued commits are uninteresting.

* revs->topo_order implies we run sort_in_topological_order(). See
  the implementation of that method in commit.c. It implies that
  the full set of commits to order is in the given commit_list.

These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.

If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.

In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.

When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.

In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:

1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
   struct. We use the presence of this struct as a signal to use the
   new methods during our walk. In this patch, this method simply
   calls limit_list() and sort_in_topological_order(). In the future,
   this method will set up a new data structure to perform that logic
   in-line.

2. next_topo_commit() provides get_revision_1() with the next topo-
   ordered commit in the list. Currently, this simply pops the commit
   from revs->commits.

3. expand_topo_walk() provides get_revision_1() with a way to signal
   walking beyond the latest commit. Currently, this calls
   add_parents_to_list() exactly like the old logic.

While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.

Signed-off-by: Derrick Stolee 
---
 revision.c | 42 ++
 revision.h |  4 
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/revision.c b/revision.c
index e18bd530e4..2dcde8a8ac 100644
--- a/revision.c
+++ b/revision.c
@@ -25,6 +25,7 @@
 #include "worktree.h"
 #include "argv-array.h"
 #include "commit-reach.h"
+#include "commit-graph.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct 
rev_info *revs, struct s
if (revs->diffopt.objfind)
revs->simplify_history = 0;
 
-   if (revs->topo_order)
+   if (revs->topo_order && !generation_numbers_enabled(the_repository))
revs->limited = 1;
 
if (revs->prune_data.nr) {
@@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id 
*oid,
return 0;
 }
 
+struct topo_walk_info {};
+
+static void init_topo_walk(struct rev_info *revs)
+{
+   struct topo_walk_info *info;
+   revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
+   info = revs->topo_walk_info;
+   memset(info, 0, sizeof(struct topo_walk_info));
+
+   limit_list(revs);
+   sort_in_topological_order(>commits, revs->sort_order);
+}
+
+static struct commit *next_topo_commit(struct rev_info *revs)
+{
+   return pop_commit(>commits);
+}
+
+static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
+{
+   if (add_parents_to_list(revs, commit, >commits, NULL) < 0) {
+   if (!revs->ignore_missing_links)
+   die("Failed to traverse parents of commit %s",
+   oid_to_hex(>object.oid));
+   }
+}
+
 int prepare_revision_walk(struct rev_info *revs)
 {
int i;
@@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs)
commit_list_sort_by_date(>commits);
if (revs->no_walk)
return 0;
-   if (revs->limited)
+   if (revs->limited) {
if (limit_list(revs) < 0)
return -1;
-   if (revs->topo_order)
-   sort_in_topological_order(>commits, revs->sort_order);
+   if (revs->topo_order)
+   sort_in_topological_order(>commits, 
revs->sort_order);
+   } else if (revs->topo_order)
+   init_topo_walk(revs);
if 

[PATCH v4 6/7] revision.c: generation-based topo-order algorithm

2018-10-16 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n " or filling the first page of a pager.

When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:

1. Run limit_list(), which parses all reachable commits,
   adds them to a linked list, and distributes UNINTERESTING
   flags. If all unprocessed commits are UNINTERESTING, then
   it may terminate without walking all reachable commits.
   This does not occur if we do not specify UNINTERESTING
   commits.

2. Run sort_in_topological_order(), which is an implementation
   of Kahn's algorithm. It first iterates through the entire
   set of important commits and computes the in-degree of each
   (plus one, as we use 'zero' as a special value here). Then,
   we walk the commits in priority order, adding them to the
   priority queue if and only if their in-degree is one. As
   we remove commits from this priority queue, we decrement the
   in-degree of their parents.

3. While we are peeling commits for output, get_revision_1()
   uses pop_commit on the full list of commits computed by
   sort_in_topological_order().

In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.

Recall that the generation number of a commit satisfies:

* If the commit has at least one parent, then the generation
  number is one more than the maximum generation number among
  its parents.

* If the commit has no parent, then the generation number is one.

There are two special generation numbers:

* GENERATION_NUMBER_INFINITY: this value is 0x and
  indicates that the commit is not stored in the commit-graph and
  the generation number was not previously calculated.

* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
  to say that the commit-graph was generated by a version of Git
  that does not compute generation numbers (such as v2.18.0).

Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:

If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.

Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.

The walks are as follows:

1. EXPLORE: using the explore_queue priority queue (ordered by
   maximizing the generation number), parse each reachable
   commit until all commits in the queue have generation
   number strictly lower than needed. During this walk, update
   the UNINTERESTING flags as necessary.

2. INDEGREE: using the indegree_queue priority queue (ordered
   by maximizing the generation number), add one to the in-
   degree of each parent for each commit that is walked. Since
   we walk in order of decreasing generation number, we know
   that discovering an in-degree value of 0 means the value for
   that commit was not initialized, so should be initialized to
   two. (Recall that in-degree value "1" is what we use to say a
   commit is ready for output.) As we iterate the parents of a
   commit during this walk, ensure the EXPLORE walk has walked
   beyond their generation numbers.

3. TOPO: using the topo_queue priority queue (ordered based on
   the sort_order given, which could be commit-date, author-
   date, or typical topo-order which treats the queue as a LIFO
   stack), remove a commit from the queue and decrement the
   in-degree of each parent. If a parent has an in-degree of
   one, then we add it to the topo_queue. Before we decrement
   the in-degree, however, ensure the INDEGREE walk has walked
   beyond that generation number.

The implementations of these walks are in the following methods:

* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk

These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.

One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list 

[PATCH v4 3/7] test-reach: add rev-list tests

2018-10-16 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The rev-list command is critical to Git's functionality. Ensure it
works in the three commit-graph environments constructed in
t6600-test-reach.sh. Here are a few important types of rev-list
operations:

* Basic: git rev-list --topo-order HEAD
* Range: git rev-list --topo-order compare..HEAD
* Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD
* Symmetric Difference: git rev-list --topo-order compare...HEAD

Signed-off-by: Derrick Stolee 
---
 t/t6600-test-reach.sh | 84 +++
 1 file changed, 84 insertions(+)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index 9d65b8b946..288f703b7b 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -243,4 +243,88 @@ test_expect_success 'commit_contains:miss' '
test_three_modes commit_contains --tag
 '
 
+test_expect_success 'rev-list: basic topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 
commit-1-6 \
+   commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 
commit-1-5 \
+   commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 
commit-1-4 \
+   commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 
commit-1-3 \
+   commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 
commit-1-2 \
+   commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 
commit-1-1 \
+   >expect &&
+   run_three_modes git rev-list --topo-order commit-6-6
+'
+
+test_expect_success 'rev-list: first-parent topo-order' '
+   git rev-parse \
+   commit-6-6 \
+   commit-6-5 \
+   commit-6-4 \
+   commit-6-3 \
+   commit-6-2 \
+   commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 
commit-1-1 \
+   >expect &&
+   run_three_modes git rev-list --first-parent --topo-order commit-6-6
+'
+
+test_expect_success 'rev-list: range topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 
commit-1-6 \
+   commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 
commit-1-5 \
+   commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 
commit-1-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   commit-6-2 commit-5-2 commit-4-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   >expect &&
+   run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
+'
+
+test_expect_success 'rev-list: range topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 \
+   commit-6-5 commit-5-5 commit-4-5 \
+   commit-6-4 commit-5-4 commit-4-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   commit-6-2 commit-5-2 commit-4-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   >expect &&
+   run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
+'
+
+test_expect_success 'rev-list: first-parent range topo-order' '
+   git rev-parse \
+   commit-6-6 \
+   commit-6-5 \
+   commit-6-4 \
+   commit-6-3 \
+   commit-6-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   >expect &&
+   run_three_modes git rev-list --first-parent --topo-order 
commit-3-8..commit-6-6
+'
+
+test_expect_success 'rev-list: ancestry-path topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 commit-3-6 \
+   commit-6-5 commit-5-5 commit-4-5 commit-3-5 \
+   commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   >expect &&
+   run_three_modes git rev-list --topo-order --ancestry-path 
commit-3-3..commit-6-6
+'
+
+test_expect_success 'rev-list: symmetric difference topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 \
+   commit-6-5 commit-5-5 commit-4-5 \
+   commit-6-4 commit-5-4 commit-4-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   commit-6-2 commit-5-2 commit-4-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   commit-3-8 commit-2-8 commit-1-8 \
+   commit-3-7 commit-2-7 commit-1-7 \
+   >expect &&
+   run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
+'
+
 test_done
-- 
gitgitgadget



[PATCH v4 1/7] prio-queue: add 'peek' operation

2018-10-16 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When consuming a priority queue, it can be convenient to inspect
the next object that will be dequeued without actually dequeueing
it. Our existing library did not have such a 'peek' operation, so
add it as prio_queue_peek().

Add a reference-level comparison in t/helper/test-prio-queue.c
so this method is exercised by t0009-prio-queue.sh. Further, add
a test that checks the behavior when the compare function is NULL
(i.e. the queue becomes a stack).

Signed-off-by: Derrick Stolee 
---
 prio-queue.c   |  9 +
 prio-queue.h   |  6 ++
 t/helper/test-prio-queue.c | 26 ++
 t/t0009-prio-queue.sh  | 14 ++
 4 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/prio-queue.c b/prio-queue.c
index a078451872..d3f488cb05 100644
--- a/prio-queue.c
+++ b/prio-queue.c
@@ -85,3 +85,12 @@ void *prio_queue_get(struct prio_queue *queue)
}
return result;
 }
+
+void *prio_queue_peek(struct prio_queue *queue)
+{
+   if (!queue->nr)
+   return NULL;
+   if (!queue->compare)
+   return queue->array[queue->nr - 1].data;
+   return queue->array[0].data;
+}
diff --git a/prio-queue.h b/prio-queue.h
index d030ec9dd6..682e51867a 100644
--- a/prio-queue.h
+++ b/prio-queue.h
@@ -46,6 +46,12 @@ extern void prio_queue_put(struct prio_queue *, void *thing);
  */
 extern void *prio_queue_get(struct prio_queue *);
 
+/*
+ * Gain access to the "thing" that would be returned by
+ * prio_queue_get, but do not remove it from the queue.
+ */
+extern void *prio_queue_peek(struct prio_queue *);
+
 extern void clear_prio_queue(struct prio_queue *);
 
 /* Reverse the LIFO elements */
diff --git a/t/helper/test-prio-queue.c b/t/helper/test-prio-queue.c
index 9807b649b1..5bc9c46ea5 100644
--- a/t/helper/test-prio-queue.c
+++ b/t/helper/test-prio-queue.c
@@ -22,14 +22,24 @@ int cmd__prio_queue(int argc, const char **argv)
struct prio_queue pq = { intcmp };
 
while (*++argv) {
-   if (!strcmp(*argv, "get"))
-   show(prio_queue_get());
-   else if (!strcmp(*argv, "dump")) {
-   int *v;
-   while ((v = prio_queue_get()))
-  show(v);
-   }
-   else {
+   if (!strcmp(*argv, "get")) {
+   void *peek = prio_queue_peek();
+   void *get = prio_queue_get();
+   if (peek != get)
+   BUG("peek and get results do not match");
+   show(get);
+   } else if (!strcmp(*argv, "dump")) {
+   void *peek;
+   void *get;
+   while ((peek = prio_queue_peek())) {
+   get = prio_queue_get();
+   if (peek != get)
+   BUG("peek and get results do not 
match");
+   show(get);
+   }
+   } else if (!strcmp(*argv, "stack")) {
+   pq.compare = NULL;
+   } else {
int *v = malloc(sizeof(*v));
*v = atoi(*argv);
prio_queue_put(, v);
diff --git a/t/t0009-prio-queue.sh b/t/t0009-prio-queue.sh
index e56dfce668..3941ad2528 100755
--- a/t/t0009-prio-queue.sh
+++ b/t/t0009-prio-queue.sh
@@ -47,4 +47,18 @@ test_expect_success 'notice empty queue' '
test_cmp expect actual
 '
 
+cat >expect <<'EOF'
+3
+2
+6
+4
+5
+1
+8
+EOF
+test_expect_success 'stack order' '
+   test-tool prio-queue stack 8 1 5 4 6 2 3 dump >actual &&
+   test_cmp expect actual
+'
+
 test_done
-- 
gitgitgadget



[PATCH v4 7/7] t6012: make rev-list tests more interesting

2018-10-16 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

As we are working to rewrite some of the revision-walk machinery,
there could easily be some interesting interactions between the
options that force topological constraints (--topo-order,
--date-order, and --author-date-order) along with specifying a
path.

Add extra tests to t6012-rev-list-simplify.sh to add coverage of
these interactions. To ensure interesting things occur, alter the
repo data shape to have different orders depending on topo-, date-,
or author-date-order.

When testing using GIT_TEST_COMMIT_GRAPH, this assists in covering
the new logic for topo-order walks using generation numbers. The
extra tests can be added indepently.

Signed-off-by: Derrick Stolee 
---
 t/t6012-rev-list-simplify.sh | 45 
 1 file changed, 36 insertions(+), 9 deletions(-)

diff --git a/t/t6012-rev-list-simplify.sh b/t/t6012-rev-list-simplify.sh
index b5a1190ffe..a10f0df02b 100755
--- a/t/t6012-rev-list-simplify.sh
+++ b/t/t6012-rev-list-simplify.sh
@@ -12,6 +12,22 @@ unnote () {
git name-rev --tags --stdin | sed -e "s|$OID_REGEX (tags/\([^)]*\)) |\1 
|g"
 }
 
+#
+# Create a test repo with interesting commit graph:
+#
+# A--B--G--H--I--K--L
+#  \  \   / /
+#   \  \ / /
+#C--E---F J
+#\_/
+#
+# The commits are laid out from left-to-right starting with
+# the root commit A and terminating at the tip commit L.
+#
+# There are a few places where we adjust the commit date or
+# author date to make the --topo-order, --date-order, and
+# --author-date-order flags produce different output.
+
 test_expect_success setup '
echo "Hi there" >file &&
echo "initial" >lost &&
@@ -21,10 +37,18 @@ test_expect_success setup '
 
git branch other-branch &&
 
+   git symbolic-ref HEAD refs/heads/unrelated &&
+   git rm -f "*" &&
+   echo "Unrelated branch" >side &&
+   git add side &&
+   test_tick && git commit -m "Side root" &&
+   note J &&
+   git checkout master &&
+
echo "Hello" >file &&
echo "second" >lost &&
git add file lost &&
-   test_tick && git commit -m "Modified file and lost" &&
+   test_tick && GIT_AUTHOR_DATE=$(($test_tick + 120)) git commit -m 
"Modified file and lost" &&
note B &&
 
git checkout other-branch &&
@@ -63,13 +87,6 @@ test_expect_success setup '
test_tick && git commit -a -m "Final change" &&
note I &&
 
-   git symbolic-ref HEAD refs/heads/unrelated &&
-   git rm -f "*" &&
-   echo "Unrelated branch" >side &&
-   git add side &&
-   test_tick && git commit -m "Side root" &&
-   note J &&
-
git checkout master &&
test_tick && git merge --allow-unrelated-histories -m "Coolest" 
unrelated &&
note K &&
@@ -103,14 +120,24 @@ check_result () {
check_outcome success "$@"
 }
 
-check_result 'L K J I H G F E D C B A' --full-history
+check_result 'L K J I H F E D C G B A' --full-history --topo-order
+check_result 'L K I H G F E D C B J A' --full-history
+check_result 'L K I H G F E D C B J A' --full-history --date-order
+check_result 'L K I H G F E D B C J A' --full-history --author-date-order
 check_result 'K I H E C B A' --full-history -- file
 check_result 'K I H E C B A' --full-history --topo-order -- file
 check_result 'K I H E C B A' --full-history --date-order -- file
+check_result 'K I H E B C A' --full-history --author-date-order -- file
 check_result 'I E C B A' --simplify-merges -- file
+check_result 'I E C B A' --simplify-merges --topo-order -- file
+check_result 'I E C B A' --simplify-merges --date-order -- file
+check_result 'I E B C A' --simplify-merges --author-date-order -- file
 check_result 'I B A' -- file
 check_result 'I B A' --topo-order -- file
+check_result 'I B A' --date-order -- file
+check_result 'I B A' --author-date-order -- file
 check_result 'H' --first-parent -- another-file
+check_result 'H' --first-parent --topo-order -- another-file
 
 check_result 'E C B A' --full-history E -- lost
 test_expect_success 'full history simplification without parent' '
-- 
gitgitgadget


[PATCH v4 5/7] commit/revisions: bookkeeping before refactoring

2018-10-16 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

There are a few things that need to move around a little before
making a big refactoring in the topo-order logic:

1. We need access to record_author_date() and
   compare_commits_by_author_date() in revision.c. These are used
   currently by sort_in_topological_order() in commit.c.

2. Moving these methods to commit.h requires adding the author_slab
   definition to commit.h.

3. The add_parents_to_list() method in revision.c performs logic
   around the UNINTERESTING flag and other special cases depending
   on the struct rev_info. Allow this method to ignore a NULL 'list'
   parameter, as we will not be populating the list for our walk.
   Also rename the method to the slightly more generic name
   process_parents() to make clear that this method does more than
   add to a list (and no list is required anymore).

Helped-by: Jeff King 
Signed-off-by: Derrick Stolee 
---
 commit.c   | 11 +--
 commit.h   |  8 
 revision.c | 18 ++
 3 files changed, 23 insertions(+), 14 deletions(-)

diff --git a/commit.c b/commit.c
index d0f199e122..861a485e93 100644
--- a/commit.c
+++ b/commit.c
@@ -655,11 +655,10 @@ struct commit *pop_commit(struct commit_list **stack)
 /* count number of children that have not been emitted */
 define_commit_slab(indegree_slab, int);
 
-/* record author-date for each commit object */
-define_commit_slab(author_date_slab, timestamp_t);
+implement_shared_commit_slab(author_date_slab, timestamp_t);
 
-static void record_author_date(struct author_date_slab *author_date,
-  struct commit *commit)
+void record_author_date(struct author_date_slab *author_date,
+   struct commit *commit)
 {
const char *buffer = get_commit_buffer(commit, NULL);
struct ident_split ident;
@@ -684,8 +683,8 @@ fail_exit:
unuse_commit_buffer(commit, buffer);
 }
 
-static int compare_commits_by_author_date(const void *a_, const void *b_,
- void *cb_data)
+int compare_commits_by_author_date(const void *a_, const void *b_,
+  void *cb_data)
 {
const struct commit *a = a_, *b = b_;
struct author_date_slab *author_date = cb_data;
diff --git a/commit.h b/commit.h
index 2b1a734388..977d397356 100644
--- a/commit.h
+++ b/commit.h
@@ -8,6 +8,7 @@
 #include "gpg-interface.h"
 #include "string-list.h"
 #include "pretty.h"
+#include "commit-slab.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0x
 #define GENERATION_NUMBER_INFINITY 0x
@@ -328,6 +329,13 @@ extern int remove_signature(struct strbuf *buf);
  */
 extern int check_commit_signature(const struct commit *commit, struct 
signature_check *sigc);
 
+/* record author-date for each commit object */
+define_shared_commit_slab(author_date_slab, timestamp_t);
+
+void record_author_date(struct author_date_slab *author_date,
+   struct commit *commit);
+
+int compare_commits_by_author_date(const void *a_, const void *b_, void 
*unused);
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused);
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused);
 
diff --git a/revision.c b/revision.c
index 2dcde8a8ac..36458265a0 100644
--- a/revision.c
+++ b/revision.c
@@ -768,8 +768,8 @@ static void commit_list_insert_by_date_cached(struct commit 
*p, struct commit_li
*cache = new_entry;
 }
 
-static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
-   struct commit_list **list, struct commit_list **cache_ptr)
+static int process_parents(struct rev_info *revs, struct commit *commit,
+  struct commit_list **list, struct commit_list 
**cache_ptr)
 {
struct commit_list *parent = commit->parents;
unsigned left_flag;
@@ -808,7 +808,8 @@ static int add_parents_to_list(struct rev_info *revs, 
struct commit *commit,
if (p->object.flags & SEEN)
continue;
p->object.flags |= SEEN;
-   commit_list_insert_by_date_cached(p, list, cached_base, 
cache_ptr);
+   if (list)
+   commit_list_insert_by_date_cached(p, list, 
cached_base, cache_ptr);
}
return 0;
}
@@ -847,7 +848,8 @@ static int add_parents_to_list(struct rev_info *revs, 
struct commit *commit,
p->object.flags |= left_flag;
if (!(p->object.flags & SEEN)) {
p->object.flags |= SEEN;
-   commit_list_insert_by_date_cached(p, list, cached_base, 
cache_ptr);
+   if (list)
+   commit_list_insert_by_date_cached(p, list, 
cached_base, cache_ptr);
}
if (revs->first_parent_only)
break;
@@ -1091,7 +1093,7 @@ static 

[PATCH v4 0/7] Use generation numbers for --topo-order

2018-10-16 Thread Derrick Stolee via GitGitGadget
This patch series performs a decently-sized refactoring of the revision-walk
machinery. Well, "refactoring" is probably the wrong word, as I don't
actually remove the old code. Instead, when we see certain options in the
'rev_info' struct, we redirect the commit-walk logic to a new set of methods
that distribute the workload differently. By using generation numbers in the
commit-graph, we can significantly improve 'git log --graph' commands (and
the underlying 'git rev-list --topo-order').

On the Linux repository, I got the following performance results when
comparing to the previous version with or without a commit-graph:

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
  HEAD, w/ commit-graph: 0.02 s

Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
  HEAD, w/ commit-graph: 0.06 s

If you want to read this series but are unfamiliar with the commit-graph and
generation numbers, then I recommend reading 
Documentation/technical/commit-graph.txt or a blob post [1] I wrote on the
subject. In particular, the three-part walk described in "revision.c:
refactor basic topo-order logic" is present (but underexplained) as an
animated PNG [2].

Since revision.c is an incredibly important (and old) portion of the
codebase -- and because there are so many orthogonal options in 'struct
rev_info' -- I consider this submission to be "RFC quality". That is, I am
not confident that I am not missing anything, or that my solution is the
best it can be. I did merge this branch with ds/commit-graph-with-grafts and
the "DO-NOT-MERGE: write and read commit-graph always" commit that computes
a commit-graph with every 'git commit' command. The test suite passed with
that change, available on GitHub [3]. To ensure that I cover at least the
case I think are interesting, I added tests to t6600-test-reach.sh to verify
the walks report the correct results for the three cases there (no
commit-graph, full commit-graph, and a partial commit-graph so the walk
starts at GENERATION_NUMBER_INFINITY).

One notable case that is not included in this series is the case of a
history comparison such as 'git rev-list --topo-order A..B'. The existing
code in limit_list() has ways to cut the walk short when all pending commits
are UNINTERESTING. Since this code depends on commit_list instead of the
prio_queue we are using here, I chose to leave it untouched for now. We can
revisit it in a separate series later. Since handle_commit() turns on
revs->limited when a commit is UNINTERESTING, we do not hit the new code in
this case. Removing this 'revs->limited = 1;' line yields correct results,
but the performance is worse.

This series was based on ds/reachable, but is now based on 'master' to not
conflict with 182070 "commit: use timestamp_t for author_date_slab". There
is a small conflict with md/filter-trees, because it renamed a flag in
revisions.h in the line before I add new flags. Hopefully this conflict is
not too difficult to resolve.

Changes in V3: I added a new patch that updates the tab-alignment for flags
in revision.h before adding new ones (Thanks, Ævar!). Also, I squashed the
recommended changes to run_three_modes and test_three_modes from Szeder and
Junio. Thanks!

Changes in V4: I'm sending a V4 to respond to the feedback so far. Still
looking forward to more on the really big commit!

 * Removed the whitespace changes to the flags in revision.c that caused
   merge pain. 
   
   
 * The prio-queue peek function is now covered by tests when in "stack"
   mode.
   
   
 * The "add_parents_to_list()" function is now renamed to
   "process_parents()"
   
   
 * Added a new commit that expands test coverage with alternate orders and
   file history (use GIT_TEST_COMMIT_GRAPH to have
   t6012-rev-list-simplify.sh cover the new logic). These tests found a
   problem with author dates (I forgot to record them during the explore
   walk).
   
   
 * Commit message edits.
   
   

Thanks, -Stolee

[1] 
https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/
Supercharging the Git Commit Graph III: Generations and Graph Algorithms

[2] 
https://msdnshared.blob.core.windows.net/media/2018/06/commit-graph-topo-order-b-a.png
Animation showing three-part walk

[3] https://github.com/derrickstolee/git/tree/topo-order/testA branch
containing this series along with commits to compute commit-graph in entire
test suite.

Cc: avarab@gmail.comCc: szeder@gmail.com

Derrick Stolee (7):
  prio-queue: add 'peek' operation
  test-reach: add run_three_modes method
  test-reach: add rev-list tests
  revision.c: begin refactoring --topo-order logic
  commit/revisions: bookkeeping before refactoring
  revision.c: generation-based topo-order algorithm
  t6012: make rev-list tests more interesting

 commit.c |  11 +-
 commit.h |   8 ++
 object.h

[PATCH v4 2/7] test-reach: add run_three_modes method

2018-10-16 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The 'test_three_modes' method assumes we are using the 'test-tool
reach' command for our test. However, we may want to use the data
shape of our commit graph and the three modes (no commit-graph,
full commit-graph, partial commit-graph) for other git commands.

Split test_three_modes to be a simple translation on a more general
run_three_modes method that executes the given command and tests
the actual output to the expected output.

Signed-off-by: Derrick Stolee 
---
 t/t6600-test-reach.sh | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index d139a00d1d..9d65b8b946 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -53,18 +53,22 @@ test_expect_success 'setup' '
git config core.commitGraph true
 '
 
-test_three_modes () {
+run_three_modes () {
test_when_finished rm -rf .git/objects/info/commit-graph &&
-   test-tool reach $1 actual &&
+   "$@" actual &&
test_cmp expect actual &&
cp commit-graph-full .git/objects/info/commit-graph &&
-   test-tool reach $1 actual &&
+   "$@" actual &&
test_cmp expect actual &&
cp commit-graph-half .git/objects/info/commit-graph &&
-   test-tool reach $1 actual &&
+   "$@" actual &&
test_cmp expect actual
 }
 
+test_three_modes () {
+   run_three_modes test-tool reach "$@"
+}
+
 test_expect_success 'ref_newer:miss' '
cat >input <<-\EOF &&
A:commit-5-7
-- 
gitgitgadget



[PATCH v2 2/3] midx: close multi-pack-index on repack

2018-10-12 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When repacking, we may remove pack-files. This invalidates the
multi-pack-index (if it exists). Previously, we removed the
multi-pack-index file before removing any pack-file. In some cases,
the repack command may load the multi-pack-index into memory. This
may lead to later in-memory references to the non-existent pack-
files.

Signed-off-by: Derrick Stolee 
---
 builtin/repack.c |  3 +--
 midx.c   | 15 ---
 midx.h   |  4 +++-
 3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index c6a7943d5c..44965cbaa3 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -431,8 +431,7 @@ int cmd_repack(int argc, const char **argv, const char 
*prefix)
char *fname, *fname_old;
 
if (!midx_cleared) {
-   /* if we move a packfile, it will invalidated 
the midx */
-   clear_midx_file(get_object_directory());
+   clear_midx_file(the_repository);
midx_cleared = 1;
}
 
diff --git a/midx.c b/midx.c
index bf1f511862..22247a30ab 100644
--- a/midx.c
+++ b/midx.c
@@ -176,9 +176,13 @@ cleanup_fail:
return NULL;
 }
 
-static void close_midx(struct multi_pack_index *m)
+void close_midx(struct multi_pack_index *m)
 {
uint32_t i;
+
+   if (!m)
+   return;
+
munmap((unsigned char *)m->data, m->data_len);
close(m->fd);
m->fd = -1;
@@ -914,9 +918,14 @@ cleanup:
return 0;
 }
 
-void clear_midx_file(const char *object_dir)
+void clear_midx_file(struct repository *r)
 {
-   char *midx = get_midx_filename(object_dir);
+   char *midx = get_midx_filename(r->objects->objectdir);
+
+   if (r->objects && r->objects->multi_pack_index) {
+   close_midx(r->objects->multi_pack_index);
+   r->objects->multi_pack_index = NULL;
+   }
 
if (remove_path(midx)) {
UNLEAK(midx);
diff --git a/midx.h b/midx.h
index ce80b91c68..0f68bccdd5 100644
--- a/midx.h
+++ b/midx.h
@@ -42,7 +42,9 @@ int midx_contains_pack(struct multi_pack_index *m, const char 
*idx_name);
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, 
int local);
 
 int write_midx_file(const char *object_dir);
-void clear_midx_file(const char *object_dir);
+void clear_midx_file(struct repository *r);
 int verify_midx_file(const char *object_dir);
 
+void close_midx(struct multi_pack_index *m);
+
 #endif
-- 
gitgitgadget



[PATCH v2 3/3] multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX

2018-10-12 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The multi-pack-index feature is tested in isolation by
t5319-multi-pack-index.sh, but there are many more interesting
scenarios in the test suite surrounding pack-file data shapes
and interactions. Since the multi-pack-index is an optional
data structure, it does not make sense to include it by default
in those tests.

Instead, add a new GIT_TEST_MULTI_PACK_INDEX environment variable
that enables core.multiPackIndex and writes a multi-pack-index
after each 'git repack' command. This adds extra test coverage
when needed.

There are a few spots in the test suite that need to react to this
change:

* t5319-multi-pack-index.sh: there is a test that checks that
  'git repack' deletes the multi-pack-index. Disable the environment
  variable to ensure this still happens.

* t5310-pack-bitmaps.sh: One test moves a pack-file from the object
  directory to an alternate. This breaks the multi-pack-index, so
  delete the multi-pack-index at this point, if it exists.

* t9300-fast-import.sh: One test verifies the number of files in
  the .git/objects/pack directory is exactly 8. Exclude the
  multi-pack-index from this count so it is still 8 in all cases.

Signed-off-by: Derrick Stolee 
---
 builtin/repack.c| 4 
 midx.c  | 9 +++--
 midx.h  | 2 ++
 t/README| 4 
 t/t5310-pack-bitmaps.sh | 1 +
 t/t5319-multi-pack-index.sh | 2 +-
 t/t9300-fast-import.sh  | 2 +-
 7 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 44965cbaa3..26dcccdafc 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -553,6 +553,10 @@ int cmd_repack(int argc, const char **argv, const char 
*prefix)
if (!no_update_server_info)
update_server_info(0);
remove_temporary_files();
+
+   if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0))
+   write_midx_file(get_object_directory());
+
string_list_clear(, 0);
string_list_clear(, 0);
string_list_clear(_packs, 0);
diff --git a/midx.c b/midx.c
index 22247a30ab..02b2211e31 100644
--- a/midx.c
+++ b/midx.c
@@ -335,9 +335,14 @@ int prepare_multi_pack_index_one(struct repository *r, 
const char *object_dir, i
struct multi_pack_index *m;
struct multi_pack_index *m_search;
int config_value;
+   static int env_value = -1;
 
-   if (repo_config_get_bool(r, "core.multipackindex", _value) ||
-   !config_value)
+   if (env_value < 0)
+   env_value = git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0);
+
+   if (!env_value &&
+   (repo_config_get_bool(r, "core.multipackindex", _value) ||
+   !config_value))
return 0;
 
for (m_search = r->objects->multi_pack_index; m_search; m_search = 
m_search->next)
diff --git a/midx.h b/midx.h
index 0f68bccdd5..f2bb7e681c 100644
--- a/midx.h
+++ b/midx.h
@@ -3,6 +3,8 @@
 
 #include "repository.h"
 
+#define GIT_TEST_MULTI_PACK_INDEX "GIT_TEST_MULTI_PACK_INDEX"
+
 struct multi_pack_index {
struct multi_pack_index *next;
 
diff --git a/t/README b/t/README
index 5e48a043ce..9bfdd3004c 100644
--- a/t/README
+++ b/t/README
@@ -327,6 +327,10 @@ GIT_TEST_COMMIT_GRAPH=, when true, forces the 
commit-graph to
 be written after every 'git commit' command, and overrides the
 'core.commitGraph' setting to true.
 
+GIT_TEST_MULTI_PACK_INDEX=, when true, forces the multi-pack-
+index to be written after every 'git repack' command, and overrides the
+'core.multiPackIndex' setting to true.
+
 Naming Tests
 
 
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 1be3459c5b..82d7f7f6a5 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -191,6 +191,7 @@ test_expect_success 'pack-objects respects 
--honor-pack-keep (local bitmapped pa
 
 test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' 
'
mv .git/objects/pack/$packbitmap.* alt.git/objects/pack/ &&
+   rm -f .git/objects/pack/multi-pack-index &&
test_when_finished "mv alt.git/objects/pack/$packbitmap.* 
.git/objects/pack/" &&
echo HEAD | git pack-objects --local --stdout --revs >3b.pack &&
git index-pack 3b.pack &&
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index bd8e841b81..70926b5bc0 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -271,7 +271,7 @@ test_expect_success 'git-fsck incorrect offset' '
 
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
-   git repack -adf &&
+   GIT_TEST_MULTI_PACK_INDEX=0 git repack -adf &&
test_path_is_missing $objdir/pack/multi-pack-index
 '
 
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 40fe7e4976..59a13b6a77 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -1558,7 +1558,7 @@ test_expect_success 'O: 

[PATCH v2 1/3] midx: fix broken free() in close_midx()

2018-10-12 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When closing a multi-pack-index, we intend to close each pack-file
and free the struct packed_git that represents it. However, this
line was previously freeing the array of pointers, not the
pointer itself. This leads to a double-free issue.

Signed-off-by: Derrick Stolee 
---
 midx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 713d6f9dde..bf1f511862 100644
--- a/midx.c
+++ b/midx.c
@@ -186,7 +186,7 @@ static void close_midx(struct multi_pack_index *m)
for (i = 0; i < m->num_packs; i++) {
if (m->packs[i]) {
close_pack(m->packs[i]);
-   free(m->packs);
+   free(m->packs[i]);
}
}
FREE_AND_NULL(m->packs);
-- 
gitgitgadget



[PATCH v2 0/3] Add GIT_TEST_MULTI_PACK_INDEX environment variable

2018-10-12 Thread Derrick Stolee via GitGitGadget
To increase coverage of the multi-pack-index feature, add a
GIT_TEST_MULTI_PACK_INDEX environment variable similar to other GIT_TEST_*
variables.

After creating the environment variable and running the test suite with it
enabled, I found a few bugs in the multi-pack-index implementation. These
are handled by the first two patches.

I have set up a CI build on Azure Pipelines [1] that runs the test suite
with a few optional features enabled, including GIT_TEST_MULTI_PACK_INDEX
and GIT_TEST_COMMIT_GRAPH. I'll use this to watch the features and ensure
they work well with the rest of the ongoing work. Eventually, we can add
these variables to the Travis CI scripts.

[1] https://git.visualstudio.com/git/_build?definitionId=4

Derrick Stolee (3):
  midx: fix broken free() in close_midx()
  midx: close multi-pack-index on repack
  multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX

 builtin/repack.c|  7 +--
 midx.c  | 26 --
 midx.h  |  6 +-
 t/README|  4 
 t/t5310-pack-bitmaps.sh |  1 +
 t/t5319-multi-pack-index.sh |  2 +-
 t/t9300-fast-import.sh  |  2 +-
 7 files changed, 37 insertions(+), 11 deletions(-)


base-commit: 5a0cc8aca797dbd7d2be3b67458ff880ed45cddf
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-27%2Fderrickstolee%2Fmidx-test%2Fupstream-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-27/derrickstolee/midx-test/upstream-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/27

Range-diff vs v1:

 1:  9fcbbe336d = 1:  8bd672fe26 midx: fix broken free() in close_midx()
 2:  725ebadc92 ! 2:  2d8f26679d midx: close multi-pack-index on repack
 @@ -15,16 +15,15 @@
  --- a/builtin/repack.c
  +++ b/builtin/repack.c
  @@
 +  char *fname, *fname_old;
   
if (!midx_cleared) {
 -  /* if we move a packfile, it will invalidated 
the midx */
 -+ if (the_repository->objects) {
 -+ 
close_midx(the_repository->objects->multi_pack_index);
 -+ 
the_repository->objects->multi_pack_index = NULL;
 -+ }
 -  clear_midx_file(get_object_directory());
 +- /* if we move a packfile, it will invalidated 
the midx */
 +- clear_midx_file(get_object_directory());
 ++ clear_midx_file(the_repository);
midx_cleared = 1;
}
 + 
  
  diff --git a/midx.c b/midx.c
  --- a/midx.c
 @@ -44,13 +43,34 @@
munmap((unsigned char *)m->data, m->data_len);
close(m->fd);
m->fd = -1;
 +@@
 +  return 0;
 + }
 + 
 +-void clear_midx_file(const char *object_dir)
 ++void clear_midx_file(struct repository *r)
 + {
 +- char *midx = get_midx_filename(object_dir);
 ++ char *midx = get_midx_filename(r->objects->objectdir);
 ++
 ++ if (r->objects && r->objects->multi_pack_index) {
 ++ close_midx(r->objects->multi_pack_index);
 ++ r->objects->multi_pack_index = NULL;
 ++ }
 + 
 +  if (remove_path(midx)) {
 +  UNLEAK(midx);
  
  diff --git a/midx.h b/midx.h
  --- a/midx.h
  +++ b/midx.h
  @@
 + int prepare_multi_pack_index_one(struct repository *r, const char 
*object_dir, int local);
 + 
   int write_midx_file(const char *object_dir);
 - void clear_midx_file(const char *object_dir);
 +-void clear_midx_file(const char *object_dir);
 ++void clear_midx_file(struct repository *r);
 + int verify_midx_file(const char *object_dir);
   
  +void close_midx(struct multi_pack_index *m);
  +
 3:  04e3e91082 = 3:  57c64e814c multi-pack-index: define 
GIT_TEST_MULTI_PACK_INDEX

-- 
gitgitgadget


[PATCH 1/3] midx: fix broken free() in close_midx()

2018-10-08 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When closing a multi-pack-index, we intend to close each pack-file
and free the struct packed_git that represents it. However, this
line was previously freeing the array of pointers, not the
pointer itself. This leads to a double-free issue.

Signed-off-by: Derrick Stolee 
---
 midx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index f3e8dbc108..999717b96f 100644
--- a/midx.c
+++ b/midx.c
@@ -190,7 +190,7 @@ static void close_midx(struct multi_pack_index *m)
for (i = 0; i < m->num_packs; i++) {
if (m->packs[i]) {
close_pack(m->packs[i]);
-   free(m->packs);
+   free(m->packs[i]);
}
}
FREE_AND_NULL(m->packs);
-- 
gitgitgadget



[PATCH 0/3] Add GIT_TEST_MULTI_PACK_INDEX environment variable

2018-10-08 Thread Derrick Stolee via GitGitGadget
To increase coverage of the multi-pack-index feature, add a
GIT_TEST_MULTI_PACK_INDEX environment variable similar to other GIT_TEST_*
variables.

After creating the environment variable and running the test suite with it
enabled, I found a few bugs in the multi-pack-index implementation. These
are handled by the first two patches.

I have set up a CI build on Azure Pipelines [1] that runs the test suite
with a few optional features enabled, including GIT_TEST_MULTI_PACK_INDEX
and GIT_TEST_COMMIT_GRAPH. I'll use this to watch the features and ensure
they work well with the rest of the ongoing work. Eventually, we can add
these variables to the Travis CI scripts.

[1] https://git.visualstudio.com/git/_build?definitionId=4

Derrick Stolee (3):
  midx: fix broken free() in close_midx()
  midx: close multi-pack-index on repack
  multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX

 builtin/repack.c|  8 
 midx.c  | 17 +
 midx.h  |  4 
 t/README|  4 
 t/t5310-pack-bitmaps.sh |  1 +
 t/t5319-multi-pack-index.sh |  2 +-
 t/t9300-fast-import.sh  |  2 +-
 7 files changed, 32 insertions(+), 6 deletions(-)


base-commit: f84b9b09d40408cf91bbc500d9f190a7866c3e0f
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-27%2Fderrickstolee%2Fmidx-test%2Fupstream-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-27/derrickstolee/midx-test/upstream-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/27
-- 
gitgitgadget


[PATCH 3/3] multi-pack-index: define GIT_TEST_MULTI_PACK_INDEX

2018-10-08 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The multi-pack-index feature is tested in isolation by
t5319-multi-pack-index.sh, but there are many more interesting
scenarios in the test suite surrounding pack-file data shapes
and interactions. Since the multi-pack-index is an optional
data structure, it does not make sense to include it by default
in those tests.

Instead, add a new GIT_TEST_MULTI_PACK_INDEX environment variable
that enables core.multiPackIndex and writes a multi-pack-index
after each 'git repack' command. This adds extra test coverage
when needed.

There are a few spots in the test suite that need to react to this
change:

* t5319-multi-pack-index.sh: there is a test that checks that
  'git repack' deletes the multi-pack-index. Disable the environment
  variable to ensure this still happens.

* t5310-pack-bitmaps.sh: One test moves a pack-file from the object
  directory to an alternate. This breaks the multi-pack-index, so
  delete the multi-pack-index at this point, if it exists.

* t9300-fast-import.sh: One test verifies the number of files in
  the .git/objects/pack directory is exactly 8. Exclude the
  multi-pack-index from this count so it is still 8 in all cases.

Signed-off-by: Derrick Stolee 
---
 builtin/repack.c| 4 
 midx.c  | 9 +++--
 midx.h  | 2 ++
 t/README| 4 
 t/t5310-pack-bitmaps.sh | 1 +
 t/t5319-multi-pack-index.sh | 2 +-
 t/t9300-fast-import.sh  | 2 +-
 7 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 7925bb976e..418442bfe2 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -558,6 +558,10 @@ int cmd_repack(int argc, const char **argv, const char 
*prefix)
if (!no_update_server_info)
update_server_info(0);
remove_temporary_files();
+
+   if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0))
+   write_midx_file(get_object_directory());
+
string_list_clear(, 0);
string_list_clear(, 0);
string_list_clear(_packs, 0);
diff --git a/midx.c b/midx.c
index fe8532a9d1..aeafb58fa3 100644
--- a/midx.c
+++ b/midx.c
@@ -338,9 +338,14 @@ int prepare_multi_pack_index_one(struct repository *r, 
const char *object_dir, i
struct multi_pack_index *m;
struct multi_pack_index *m_search;
int config_value;
+   static int env_value = -1;
 
-   if (repo_config_get_bool(r, "core.multipackindex", _value) ||
-   !config_value)
+   if (env_value < 0)
+   env_value = git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0);
+
+   if (!env_value &&
+   (repo_config_get_bool(r, "core.multipackindex", _value) ||
+   !config_value))
return 0;
 
for (m_search = r->objects->multi_pack_index; m_search; m_search = 
m_search->next)
diff --git a/midx.h b/midx.h
index af6b5cb58f..bec8f73d28 100644
--- a/midx.h
+++ b/midx.h
@@ -3,6 +3,8 @@
 
 #include "repository.h"
 
+#define GIT_TEST_MULTI_PACK_INDEX "GIT_TEST_MULTI_PACK_INDEX"
+
 struct multi_pack_index {
struct multi_pack_index *next;
 
diff --git a/t/README b/t/README
index 3ea6c85460..9d0277c338 100644
--- a/t/README
+++ b/t/README
@@ -327,6 +327,10 @@ GIT_TEST_COMMIT_GRAPH=, when true, forces the 
commit-graph to
 be written after every 'git commit' command, and overrides the
 'core.commitGraph' setting to true.
 
+GIT_TEST_MULTI_PACK_INDEX=, when true, forces the multi-pack-
+index to be written after every 'git repack' command, and overrides the
+'core.multiPackIndex' setting to true.
+
 Naming Tests
 
 
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 1be3459c5b..82d7f7f6a5 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -191,6 +191,7 @@ test_expect_success 'pack-objects respects 
--honor-pack-keep (local bitmapped pa
 
 test_expect_success 'pack-objects respects --local (non-local bitmapped pack)' 
'
mv .git/objects/pack/$packbitmap.* alt.git/objects/pack/ &&
+   rm -f .git/objects/pack/multi-pack-index &&
test_when_finished "mv alt.git/objects/pack/$packbitmap.* 
.git/objects/pack/" &&
echo HEAD | git pack-objects --local --stdout --revs >3b.pack &&
git index-pack 3b.pack &&
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 6f56b38674..4024ff9a39 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -152,7 +152,7 @@ compare_results_with_midx "twelve packs"
 
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
-   git repack -adf &&
+   GIT_TEST_MULTI_PACK_INDEX=0 git repack -adf &&
test_path_is_missing $objdir/pack/multi-pack-index
 '
 
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 40fe7e4976..59a13b6a77 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -1558,7 +1558,7 @@ test_expect_success 'O: blank lines 

[PATCH 2/3] midx: close multi-pack-index on repack

2018-10-08 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When repacking, we may remove pack-files. This invalidates the
multi-pack-index (if it exists). Previously, we removed the
multi-pack-index file before removing any pack-file. In some cases,
the repack command may load the multi-pack-index into memory. This
may lead to later in-memory references to the non-existent pack-
files.

Signed-off-by: Derrick Stolee 
---
 builtin/repack.c | 4 
 midx.c   | 6 +-
 midx.h   | 2 ++
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index c6a7943d5c..7925bb976e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -432,6 +432,10 @@ int cmd_repack(int argc, const char **argv, const char 
*prefix)
 
if (!midx_cleared) {
/* if we move a packfile, it will invalidated 
the midx */
+   if (the_repository->objects) {
+   
close_midx(the_repository->objects->multi_pack_index);
+   
the_repository->objects->multi_pack_index = NULL;
+   }
clear_midx_file(get_object_directory());
midx_cleared = 1;
}
diff --git a/midx.c b/midx.c
index 999717b96f..fe8532a9d1 100644
--- a/midx.c
+++ b/midx.c
@@ -180,9 +180,13 @@ cleanup_fail:
return NULL;
 }
 
-static void close_midx(struct multi_pack_index *m)
+void close_midx(struct multi_pack_index *m)
 {
uint32_t i;
+
+   if (!m)
+   return;
+
munmap((unsigned char *)m->data, m->data_len);
close(m->fd);
m->fd = -1;
diff --git a/midx.h b/midx.h
index a210f1af2a..af6b5cb58f 100644
--- a/midx.h
+++ b/midx.h
@@ -44,4 +44,6 @@ int prepare_multi_pack_index_one(struct repository *r, const 
char *object_dir, i
 int write_midx_file(const char *object_dir);
 void clear_midx_file(const char *object_dir);
 
+void close_midx(struct multi_pack_index *m);
+
 #endif
-- 
gitgitgadget



[PATCH v4 1/1] contrib: add coverage-diff script

2018-10-08 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

We have coverage targets in our Makefile for using gcov to display line
coverage based on our test suite. The way I like to do it is to run:

make coverage-test
make coverage-report

This leaves the repo in a state where every X.c file that was covered has
an X.c.gcov file containing the coverage counts for every line, and "#"
at every uncovered line.

There have been a few bugs in recent patches what would have been caught
if the test suite covered those blocks (including a few of mine). I want
to work towards a "sensible" amount of coverage on new topics. In my opinion,
this means that any logic should be covered, but the 'die()' blocks covering
very unlikely (or near-impossible) situations may not warrant coverage.

It is important to not measure the coverage of the codebase by what old code
is not covered. To help, I created the 'contrib/coverage-diff.sh' script.
After creating the coverage statistics at a version (say, 'topic') you can
then run

contrib/coverage-diff.sh base topic

to see the lines added between 'base' and 'topic' that are not covered by the
test suite. The output uses 'git blame -s' format so you can find the commits
responsible and view the line numbers for quick access to the context, but
trims leading tabs in the file contents to reduce output width.

Signed-off-by: Derrick Stolee 
---
 contrib/coverage-diff.sh | 108 +++
 1 file changed, 108 insertions(+)
 create mode 100755 contrib/coverage-diff.sh

diff --git a/contrib/coverage-diff.sh b/contrib/coverage-diff.sh
new file mode 100755
index 00..4ec419f900
--- /dev/null
+++ b/contrib/coverage-diff.sh
@@ -0,0 +1,108 @@
+#!/bin/sh
+
+# Usage: Run 'contrib/coverage-diff.sh  ' from source-root
+# after running
+#
+# make coverage-test
+# make coverage-report
+#
+# while checked out at . This script combines the *.gcov files
+# generated by the 'make' commands above with 'git diff  '
+# to report new lines that are not covered by the test suite.
+
+V1=$1
+V2=$2
+
+diff_lines () {
+   perl -e '
+   my $line_num;
+   while (<>) {
+   # Hunk header?  Grab the beginning in postimage.
+   if (/^@@ -\d+(?:,\d+)? \+(\d+)(?:,\d+)? @@/) {
+   $line_num = $1;
+   next;
+   }
+
+   # Have we seen a hunk?  Ignore "diff --git" etc.
+   next unless defined $line_num;
+
+   # Deleted line? Ignore.
+   if (/^-/) {
+   next;
+   }
+
+   # Show only the line number of added lines.
+   if (/^\+/) {
+   print "$line_num\n";
+   }
+   # Either common context or added line appear in
+   # the postimage.  Count it.
+   $line_num++;
+   }
+   '
+}
+
+files=$(git diff --name-only "$V1" "$V2" -- \*.c)
+
+# create empty file
+>coverage-data.txt
+
+for file in $files
+do
+   git diff "$V1" "$V2" -- "$file" |
+   diff_lines |
+   sort >new_lines.txt
+
+   if ! test -s new_lines.txt
+   then
+   continue
+   fi
+
+   hash_file=$(echo $file | sed "s/\//\#/")
+
+   if ! test -s "$hash_file.gcov"
+   then
+   continue
+   fi
+
+   sed -ne '/#:/{
+   s/#://
+   s/:.*//
+   s/ //g
+   p
+   }' "$hash_file.gcov" |
+   sort >uncovered_lines.txt
+
+   comm -12 uncovered_lines.txt new_lines.txt |
+   sed -e 's/$/\)/' |
+   sed -e 's/^/ /' >uncovered_new_lines.txt
+
+   grep -q '[^[:space:]]' >coverage-data.txt &&
+   git blame -s "$V2" -- "$file" |
+   sed 's/\t//g' |
+   grep -f uncovered_new_lines.txt >>coverage-data.txt &&
+   echo >>coverage-data.txt
+
+   rm -f new_lines.txt uncovered_lines.txt uncovered_new_lines.txt
+done
+
+cat coverage-data.txt
+
+echo "Commits introducing uncovered code:"
+
+commit_list=$(cat coverage-data.txt |
+   grep -E '^[0-9a-f]{7,} ' |
+   awk '{print $1;}' |
+   sort |
+   uniq)
+
+(
+   for commit in $commit_list
+   do
+   git log --no-decorate --pretty=format:'%an  %h: %s' -1 
$commit
+   echo
+   done
+) | sort
+
+rm coverage-data.txt
-- 
gitgitgadget


[PATCH v4 0/1] contrib: Add script to show uncovered "new" lines

2018-10-08 Thread Derrick Stolee via GitGitGadget
We have coverage targets in our Makefile for using gcov to display line
coverage based on our test suite. The way I like to do it is to run:

make coverage-test
make coverage-report

This leaves the repo in a state where every X.c file that was covered has an
X.c.gcov file containing the coverage counts for every line, and "#" at
every uncovered line.

There have been a few bugs in recent patches what would have been caught if
the test suite covered those blocks (including a few of mine). I want to
work towards a "sensible" amount of coverage on new topics. In my opinion,
this means that any logic should be covered, but the 'die()' blocks in error
cases do not need to be covered.

It is important to not measure the coverage of the codebase by what old code
is not covered. To help, I created the 'contrib/coverage-diff.sh' script.
After creating the coverage statistics at a version (say, 'topic') you can
then run

contrib/coverage-diff.sh base topic

to see the lines added between 'base' and 'topic' that are not covered by
the test suite. For example, I ran this against the 'next' branch (e82ca0)
versus 'master' (f84b9b) and got the following output:

builtin/commit.c
76f2f5c1e3 builtin/commit.c 1657) 
write_commit_graph_reachable(get_object_directory(), 0, 0);

builtin/fsck.c
66ec0390e7 builtin/fsck.c 862) midx_argv[2] = "--object-dir";
66ec0390e7 builtin/fsck.c 863) midx_argv[3] = alt->path;
66ec0390e7 builtin/fsck.c 864) if (run_command(_verify))
66ec0390e7 builtin/fsck.c 865) errors_found |= ERROR_COMMIT_GRAPH;

fsck.c
fb8952077d  214) die_errno("Could not read '%s'", path);

midx.c
56ee7ff156  949) return 0;
cc6af73c02  990) midx_report(_("failed to load pack-index for packfile %s"),
cc6af73c02  991) e.p->pack_name);
cc6af73c02  992) break;

Commits introducing uncovered code:
Derrick Stolee  56ee7ff15: multi-pack-index: add 'verify' verb
Derrick Stolee  66ec0390e: fsck: verify multi-pack-index
Derrick Stolee  cc6af73c0: multi-pack-index: verify object offsets
Junio C Hamano  76f2f5c1e: Merge branch 'ab/commit-graph-progress' into next
René Scharfe  fb8952077: fsck: use strbuf_getline() to read skiplist file

Thanks, -Stolee

CHANGES IN V3: I took Junio's perl script verbatim, which speeds up the
performance greatly. Some of the other sed commands needed some massaging,
but also added extra cleanup. Thanks for the help!

CHANGES IN V4: I reduced the blame output using -s which decreases the
width. I include a summary of the commit authors at the end to help people
see the lines they wrote. This version is also copied into a build
definition in the public Git project on Azure Pipelines [1]. I'll use this
build definition to generate the coverage report after each "What's Cooking"
email.

[1] https://git.visualstudio.com/git/_build?definitionId=5

Derrick Stolee (1):
  contrib: add coverage-diff script

 contrib/coverage-diff.sh | 108 +++
 1 file changed, 108 insertions(+)
 create mode 100755 contrib/coverage-diff.sh


base-commit: 1d4361b0f344188ab5eec6dcea01f61a3a3a1670
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-40%2Fderrickstolee%2Fcoverage-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-40/derrickstolee/coverage-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/40

Range-diff vs v3:

 1:  21214cc321 ! 1:  6daf310a43 contrib: add coverage-diff script
 @@ -26,10 +26,10 @@
  contrib/coverage-diff.sh base topic
  
  to see the lines added between 'base' and 'topic' that are not 
covered by the
 -test suite. The output uses 'git blame -c' format so you can find the 
commits
 -responsible and view the line numbers for quick access to the context.
 +test suite. The output uses 'git blame -s' format so you can find the 
commits
 +responsible and view the line numbers for quick access to the 
context, but
 +trims leading tabs in the file contents to reduce output width.
  
 -Helped-by: Junio C Hamano 
  Signed-off-by: Derrick Stolee 
  
  diff --git a/contrib/coverage-diff.sh b/contrib/coverage-diff.sh
 @@ -81,13 +81,16 @@
  + '
  +}
  +
 -+files=$(git diff --name-only $V1 $V2 -- *.c)
 ++files=$(git diff --name-only "$V1" "$V2" -- \*.c)
 ++
 ++# create empty file
 ++>coverage-data.txt
  +
  +for file in $files
  +do
 -+ git diff $V1 $V2 -- $file \
 -+ | diff_lines \
 -+ | sort >new_lines.txt
 ++ git diff "$V1" "$V2" -- "$file" |
 ++ diff_lines |
 ++ sort >new_lines.txt
  +
  + if ! test -s new_lines.txt
  + then
 @@ -95,24 +98,50 @@
  + fi
  +
  + hash_file=$(echo $file | sed "s/\//\#/")
 ++
 ++ if ! test -s "$hash_file.gcov"
 ++ then
 ++ continue
 ++ fi
 ++
  + sed -ne '/#:/{
  + s/#://
  + s/:.*//
  

[PATCH v2 1/3] commit-graph: clean up leaked memory during write

2018-10-03 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The write_commit_graph() method in commit-graph.c leaks some lits
and strings during execution. In addition, a list of strings is
leaked in write_commit_graph_reachable(). Clean these up so our
memory checking is cleaner.

Further, if we use a list of pack-files to find the commits, we
can leak the packed_git structs after scanning them for commits.

Running the following commands demonstrates the leak before and
the fix after:

* valgrind --leak-check=full ./git commit-graph write --reachable
* valgrind --leak-check=full ./git commit-graph write --stdin-packs

Signed-off-by: Martin Ågren 
Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 2a24eb8b5a..ceca6026b0 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -693,11 +693,12 @@ static int add_ref_to_list(const char *refname,
 void write_commit_graph_reachable(const char *obj_dir, int append,
  int report_progress)
 {
-   struct string_list list;
+   struct string_list list = STRING_LIST_INIT_DUP;
 
-   string_list_init(, 1);
for_each_ref(add_ref_to_list, );
write_commit_graph(obj_dir, NULL, , append, report_progress);
+
+   string_list_clear(, 0);
 }
 
 void write_commit_graph(const char *obj_dir,
@@ -764,6 +765,7 @@ void write_commit_graph(const char *obj_dir,
die(_("error opening index for %s"), 
packname.buf);
for_each_object_in_pack(p, add_packed_commits, , 
0);
close_pack(p);
+   free(p);
}
stop_progress();
strbuf_release();
@@ -846,9 +848,11 @@ void write_commit_graph(const char *obj_dir,
compute_generation_numbers(, report_progress);
 
graph_name = get_commit_graph_filename(obj_dir);
-   if (safe_create_leading_directories(graph_name))
+   if (safe_create_leading_directories(graph_name)) {
+   UNLEAK(graph_name);
die_errno(_("unable to create leading directories of %s"),
  graph_name);
+   }
 
hold_lock_file_for_update(, graph_name, LOCK_DIE_ON_ERROR);
f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
@@ -893,9 +897,9 @@ void write_commit_graph(const char *obj_dir,
finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
commit_lock_file();
 
+   free(graph_name);
+   free(commits.list);
free(oids.list);
-   oids.alloc = 0;
-   oids.nr = 0;
 }
 
 #define VERIFY_COMMIT_GRAPH_ERROR_HASH 2
-- 
gitgitgadget



[PATCH v2 3/3] commit-graph: reduce initial oid allocation

2018-10-03 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

While writing a commit-graph file, we store the full list of
commits in a flat list. We use this list for sorting and ensuring
we are closed under reachability.

The initial allocation assumed that (at most) one in four objects
is a commit. This is a dramatic over-count for many repos,
especially large ones. Since we grow the repo dynamically, reduce
this count by a factor of eight. We still set it to a minimum of
1024 before allocating.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index ceca6026b0..e773703e1d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -720,7 +720,7 @@ void write_commit_graph(const char *obj_dir,
struct progress *progress = NULL;
 
oids.nr = 0;
-   oids.alloc = approximate_object_count() / 4;
+   oids.alloc = approximate_object_count() / 32;
oids.progress = NULL;
oids.progress_done = 0;
 
-- 
gitgitgadget


[PATCH v2 0/3] Clean up leaks in commit-graph.c

2018-10-03 Thread Derrick Stolee via GitGitGadget
While looking at the commit-graph code, I noticed some memory leaks. These
can be found by running

valgrind --leak-check=full ./git commit-graph write --reachable

The impact of these leaks are small, as we never call write_commit_graph
_reachable in a loop, but it is best to be diligent here.

While looking at memory consumption within write_commit_graph(), I noticed
that we initialize our oid list with "object count / 4", which seems to be a
huge over-count. I reduce this by a factor of eight.

I built off of ab/commit-graph-progress, because my patch involves lines
close to those changes.

V2 includes feedback from V1 along with Martin's additional patches.

Thanks, -Stolee

Derrick Stolee (2):
  commit-graph: clean up leaked memory during write
  commit-graph: reduce initial oid allocation

Martin Ågren (1):
  builtin/commit-graph.c: UNLEAK variables

 builtin/commit-graph.c | 11 ++-
 commit-graph.c | 16 ++--
 2 files changed, 16 insertions(+), 11 deletions(-)


base-commit: 6b89a34c89fc763292f06012318b852b74825619
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-42%2Fderrickstolee%2Fcommit-graph-leak-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-42/derrickstolee/commit-graph-leak-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/42

Range-diff vs v1:

 1:  6906c25415 ! 1:  ba65680b3d commit-graph: clean up leaked memory during 
write
 @@ -7,17 +7,29 @@
  leaked in write_commit_graph_reachable(). Clean these up so our
  memory checking is cleaner.
  
 -Running 'valgrind --leak-check=full git commit-graph write
 ---reachable' demonstrates these leaks and how they are fixed after
 -this change.
 +Further, if we use a list of pack-files to find the commits, we
 +can leak the packed_git structs after scanning them for commits.
  
 +Running the following commands demonstrates the leak before and
 +the fix after:
 +
 +* valgrind --leak-check=full ./git commit-graph write --reachable
 +* valgrind --leak-check=full ./git commit-graph write --stdin-packs
 +
 +Signed-off-by: Martin Ågren 
  Signed-off-by: Derrick Stolee 
  
  diff --git a/commit-graph.c b/commit-graph.c
  --- a/commit-graph.c
  +++ b/commit-graph.c
  @@
 -  string_list_init(, 1);
 + void write_commit_graph_reachable(const char *obj_dir, int append,
 +int report_progress)
 + {
 +- struct string_list list;
 ++ struct string_list list = STRING_LIST_INIT_DUP;
 + 
 +- string_list_init(, 1);
for_each_ref(add_ref_to_list, );
write_commit_graph(obj_dir, NULL, , append, report_progress);
  +
 @@ -25,6 +37,14 @@
   }
   
   void write_commit_graph(const char *obj_dir,
 +@@
 +  die(_("error opening index for %s"), 
packname.buf);
 +  for_each_object_in_pack(p, add_packed_commits, , 
0);
 +  close_pack(p);
 ++ free(p);
 +  }
 +  stop_progress();
 +  strbuf_release();
  @@
compute_generation_numbers(, report_progress);
   
 @@ -45,5 +65,8 @@
  + free(graph_name);
  + free(commits.list);
free(oids.list);
 -  oids.alloc = 0;
 -  oids.nr = 0;
 +- oids.alloc = 0;
 +- oids.nr = 0;
 + }
 + 
 + #define VERIFY_COMMIT_GRAPH_ERROR_HASH 2
 -:  -- > 2:  13032d8475 builtin/commit-graph.c: UNLEAK variables
 2:  e29a0eaf03 = 3:  1002fd34fc commit-graph: reduce initial oid allocation

-- 
gitgitgadget


[PATCH 2/2] commit-graph: reduce initial oid allocation

2018-10-02 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

While writing a commit-graph file, we store the full list of
commits in a flat list. We use this list for sorting and ensuring
we are closed under reachability.

The initial allocation assumed that (at most) one in four objects
is a commit. This is a dramatic over-count for many repos,
especially large ones. Since we grow the repo dynamically, reduce
this count by a factor of eight. We still set it to a minimum of
1024 before allocating.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 7226bd6b58..a24cceb55f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -721,7 +721,7 @@ void write_commit_graph(const char *obj_dir,
struct progress *progress = NULL;
 
oids.nr = 0;
-   oids.alloc = approximate_object_count() / 4;
+   oids.alloc = approximate_object_count() / 32;
oids.progress = NULL;
oids.progress_done = 0;
 
-- 
gitgitgadget


[PATCH 1/2] commit-graph: clean up leaked memory during write

2018-10-02 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The write_commit_graph() method in commit-graph.c leaks some lits
and strings during execution. In addition, a list of strings is
leaked in write_commit_graph_reachable(). Clean these up so our
memory checking is cleaner.

Running 'valgrind --leak-check=full git commit-graph write
--reachable' demonstrates these leaks and how they are fixed after
this change.

Signed-off-by: Derrick Stolee 
---
 commit-graph.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 2a24eb8b5a..7226bd6b58 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -698,6 +698,8 @@ void write_commit_graph_reachable(const char *obj_dir, int 
append,
string_list_init(, 1);
for_each_ref(add_ref_to_list, );
write_commit_graph(obj_dir, NULL, , append, report_progress);
+
+   string_list_clear(, 0);
 }
 
 void write_commit_graph(const char *obj_dir,
@@ -846,9 +848,11 @@ void write_commit_graph(const char *obj_dir,
compute_generation_numbers(, report_progress);
 
graph_name = get_commit_graph_filename(obj_dir);
-   if (safe_create_leading_directories(graph_name))
+   if (safe_create_leading_directories(graph_name)) {
+   UNLEAK(graph_name);
die_errno(_("unable to create leading directories of %s"),
  graph_name);
+   }
 
hold_lock_file_for_update(, graph_name, LOCK_DIE_ON_ERROR);
f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
@@ -893,6 +897,8 @@ void write_commit_graph(const char *obj_dir,
finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
commit_lock_file();
 
+   free(graph_name);
+   free(commits.list);
free(oids.list);
oids.alloc = 0;
oids.nr = 0;
-- 
gitgitgadget



[PATCH 0/2] Clean up leaks in commit-graph.c

2018-10-02 Thread Derrick Stolee via GitGitGadget
While looking at the commit-graph code, I noticed some memory leaks. These
can be found by running

valgrind --leak-check=full ./git commit-graph write --reachable

The impact of these leaks are small, as we never call write_commit_graph
_reachable in a loop, but it is best to be diligent here.

While looking at memory consumption within write_commit_graph(), I noticed
that we initialize our oid list with "object count / 4", which seems to be a
huge over-count. I reduce this by a factor of eight.

I built off of ab/commit-graph-progress, because my patch involves lines
close to those changes.

Thanks, -Stolee

Derrick Stolee (2):
  commit-graph: clean up leaked memory during write
  commit-graph: reduce initial oid allocation

 commit-graph.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)


base-commit: 6b89a34c89fc763292f06012318b852b74825619
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-42%2Fderrickstolee%2Fcommit-graph-leak-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-42/derrickstolee/commit-graph-leak-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/42
-- 
gitgitgadget


[PATCH 1/1] read-cache: update index format default to v4

2018-09-24 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The index v4 format has been available since 2012 with 9d22778
"reach-cache.c: write prefix-compressed names in the index". Since
the format has been stable for so long, almost all versions of Git
in use today understand version 4, removing one barrier to upgrade
-- that someone may want to downgrade and needs a working repo.

Despite being stable for a long time, this index version was never
adopted as the default. This prefix-compressed version of the format
can get significant space savings on repos with large working
directories (which naturally tend to have deep nesting). This version
is set as the default for some external tools, such as VFS for Git.
Because of this external use, the format has had a lot of "testing in
production" and also is subject to continuous integration in these
environments.

Previously, to test version 4 indexes, we needed to run the test
suite with GIT_TEST_INDEX_VERSION=4 (or TEST_GIT_INDEX_VERSION=4).

One potential, but short-term, downside is that we lose coverage of
the version 3 indexes. The trade-off is that we may want to cover
that version using GIT_TEST_INDEX_VERSION=3.

Signed-off-by: Derrick Stolee 
---
 read-cache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/read-cache.c b/read-cache.c
index 372588260e..af6c8f2a67 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1484,7 +1484,7 @@ struct cache_entry *refresh_cache_entry(struct 
cache_entry *ce,
  * Index File I/O
  */
 
-#define INDEX_FORMAT_DEFAULT 3
+#define INDEX_FORMAT_DEFAULT 4
 
 static unsigned int get_index_format_default(void)
 {
-- 
gitgitgadget


[PATCH 0/1] read-cache: update index format default to v4

2018-09-24 Thread Derrick Stolee via GitGitGadget
After discussing this with several people off-list, I thought I would open
the question up to the list:

Should we update the default index version to v4?

The .git/index file stores the list of every path tracked by Git in the
working directory, including merge information, staged paths, and
information about the file system contents (based on modified time). The
only major update in v4 is that the paths are prefix-compressed. This
compression works best in repos with a lot of paths, especially deep paths.
For this reason, we set the index to v4 in VFS for Git.

Among VFS for Git contributors, we were talking about how the v4 format is
not covered by the test suite by default. We are working to increase the
number of CI builds that set extra GIT_TEST_* variables that we need.
However, I thought it worth having a discussion of whether this is a good
thing to recommend for all users of Git.

Personally, I'm not an expert here, but I am happy to start the
conversation.

Thanks, -Stolee

Derrick Stolee (1):
  read-cache: update index format default to v4

 read-cache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


base-commit: 53f9a3e157dbbc901a02ac2c73346d375e24978c
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-41%2Fderrickstolee%2Findex-v4-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-41/derrickstolee/index-v4-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/41
-- 
gitgitgadget


[PATCH v4 0/1] Properly peel tags in can_all_from_reach_with_flags()

2018-09-24 Thread Derrick Stolee via GitGitGadget
As Peff reported [1], the refactored can_all_from_reach_with_flags() method
does not properly peel tags. Since the helper method can_all_from_reach()
and code in t/helper/test-reach.c all peel tags before getting to this
method, it is not super-simple to create a test that demonstrates this.

I modified t/helper/test-reach.c to allow calling
can_all_from_reach_with_flags() directly, and added a test in
t6600-test-reach.sh that demonstrates the segfault without the fix.

For V2, I compared the loop that inspects the 'from' commits in commit
ba3ca1edce "commit-reach: move can_all_from_reach_with_flags" to the version
here and got the following diff:

3c3
< if (from_one->flags & assign_flag)
---
> if (!from_one || from_one->flags & assign_flag)
5c5,7
< from_one = deref_tag(the_repository, from_one, "a from 
object", 0);
---
>
> from_one = deref_tag(the_repository, from_one,
>  "a from object", 0);
14a17,22
>
> list[nr_commits] = (struct commit *)from_one;
> if (parse_commit(list[nr_commits]) ||
> list[nr_commits]->generation < min_generation)
> return 0; /* is this a leak? */
> nr_commits++;

This diff includes the early termination we had before 'deref_tag' and the
comment for why we can ignore non-commit objects.

[1] 
https://public-inbox.org/git/0bf9103c-9377-506b-7ad7-e5273d8e9...@gmail.com/T/#u

Derrick Stolee (1):
  commit-reach: properly peel tags and clear flags

 commit-reach.c| 44 +--
 t/helper/test-reach.c | 22 +-
 t/t6600-test-reach.sh | 30 +++--
 3 files changed, 79 insertions(+), 17 deletions(-)


base-commit: 6621c838743812aaba96e55cfec8524ea1144c2d
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-39%2Fderrickstolee%2Ftag-fix-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-39/derrickstolee/tag-fix-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/39

Range-diff vs v3:

 1:  0a1e661271 ! 1:  a0a3cf0134 commit-reach: properly peel tags
 @@ -1,6 +1,6 @@
  Author: Derrick Stolee 
  
 -commit-reach: properly peel tags
 +commit-reach: properly peel tags and clear flags
  
  The can_all_from_reach_with_flag() algorithm was refactored in 
4fbcca4e
  "commit-reach: make can_all_from_reach... linear" but incorrectly
 @@ -14,6 +14,19 @@
  Correct the issue by peeling tags when investigating the initial list
  of objects in the 'from' array.
  
 +The can_all_from_reach_with_flag() method uses 'assign_flag' as a
 +value we can use to mark objects temporarily during our commit walk.
 +The intent is that these flags are removed from all objects before
 +returning. However, this is not the case.
 +
 +The 'from' array could also contain objects that are not commits, and
 +we mark those objects with 'assign_flag'. Add a loop to the 'cleanup'
 +section that removes these markers.
 +
 +Also, we forgot to free() the memory for 'list', so add that to the
 +'cleanup' section. Also, use a cleaner mechanism for clearing those
 +flags.
 +
  Signed-off-by: Jeff King 
  Signed-off-by: Derrick Stolee 
  
 @@ -74,10 +87,18 @@
   
   cleanup:
  - for (i = 0; i < from->nr; i++) {
 -+ for (i = 0; i < nr_commits; i++) {
 -  clear_commit_marks(list[i], RESULT);
 -  clear_commit_marks(list[i], assign_flag);
 -  }
 +- clear_commit_marks(list[i], RESULT);
 +- clear_commit_marks(list[i], assign_flag);
 +- }
 ++ clear_commit_marks_many(nr_commits, list, RESULT | assign_flag);
 ++ free(list);
 ++
 ++ for (i = 0; i < from->nr; i++)
 ++ from->objects[i].item->flags &= ~assign_flag;
 ++
 +  return result;
 + }
 + 
  
  diff --git a/t/helper/test-reach.c b/t/helper/test-reach.c
  --- a/t/helper/test-reach.c
 2:  b2e0ee4978 < -:  -- commit-reach: fix memory and flag leaks

-- 
gitgitgadget


[PATCH v4 1/1] commit-reach: properly peel tags and clear flags

2018-09-24 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The can_all_from_reach_with_flag() algorithm was refactored in 4fbcca4e
"commit-reach: make can_all_from_reach... linear" but incorrectly
assumed that all objects provided were commits. During a fetch
negotiation, ok_to_give_up() in upload-pack.c may provide unpeeled tags
to the 'from' array. The current code creates a segfault.

Add a direct call to can_all_from_reach_with_flag() in 'test-tool reach'
and add a test in t6600-test-reach.sh that demonstrates this segfault.

Correct the issue by peeling tags when investigating the initial list
of objects in the 'from' array.

The can_all_from_reach_with_flag() method uses 'assign_flag' as a
value we can use to mark objects temporarily during our commit walk.
The intent is that these flags are removed from all objects before
returning. However, this is not the case.

The 'from' array could also contain objects that are not commits, and
we mark those objects with 'assign_flag'. Add a loop to the 'cleanup'
section that removes these markers.

Also, we forgot to free() the memory for 'list', so add that to the
'cleanup' section. Also, use a cleaner mechanism for clearing those
flags.

Signed-off-by: Jeff King 
Signed-off-by: Derrick Stolee 
---
 commit-reach.c| 44 +--
 t/helper/test-reach.c | 22 +-
 t/t6600-test-reach.sh | 30 +++--
 3 files changed, 79 insertions(+), 17 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index 86715c103c..db84f85986 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -544,20 +544,42 @@ int can_all_from_reach_with_flag(struct object_array 
*from,
 {
struct commit **list = NULL;
int i;
+   int nr_commits;
int result = 1;
 
ALLOC_ARRAY(list, from->nr);
+   nr_commits = 0;
for (i = 0; i < from->nr; i++) {
-   list[i] = (struct commit *)from->objects[i].item;
+   struct object *from_one = from->objects[i].item;
 
-   if (parse_commit(list[i]) ||
-   list[i]->generation < min_generation)
-   return 0;
+   if (!from_one || from_one->flags & assign_flag)
+   continue;
+
+   from_one = deref_tag(the_repository, from_one,
+"a from object", 0);
+   if (!from_one || from_one->type != OBJ_COMMIT) {
+   /* no way to tell if this is reachable by
+* looking at the ancestry chain alone, so
+* leave a note to ourselves not to worry about
+* this object anymore.
+*/
+   from->objects[i].item->flags |= assign_flag;
+   continue;
+   }
+
+   list[nr_commits] = (struct commit *)from_one;
+   if (parse_commit(list[nr_commits]) ||
+   list[nr_commits]->generation < min_generation) {
+   result = 0;
+   goto cleanup;
+   }
+
+   nr_commits++;
}
 
-   QSORT(list, from->nr, compare_commits_by_gen);
+   QSORT(list, nr_commits, compare_commits_by_gen);
 
-   for (i = 0; i < from->nr; i++) {
+   for (i = 0; i < nr_commits; i++) {
/* DFS from list[i] */
struct commit_list *stack = NULL;
 
@@ -600,10 +622,12 @@ int can_all_from_reach_with_flag(struct object_array 
*from,
}
 
 cleanup:
-   for (i = 0; i < from->nr; i++) {
-   clear_commit_marks(list[i], RESULT);
-   clear_commit_marks(list[i], assign_flag);
-   }
+   clear_commit_marks_many(nr_commits, list, RESULT | assign_flag);
+   free(list);
+
+   for (i = 0; i < from->nr; i++)
+   from->objects[i].item->flags &= ~assign_flag;
+
return result;
 }
 
diff --git a/t/helper/test-reach.c b/t/helper/test-reach.c
index eb21103998..08d2ea68e8 100644
--- a/t/helper/test-reach.c
+++ b/t/helper/test-reach.c
@@ -31,6 +31,7 @@ int cmd__reach(int ac, const char **av)
struct object_id oid_A, oid_B;
struct commit *A, *B;
struct commit_list *X, *Y;
+   struct object_array X_obj = OBJECT_ARRAY_INIT;
struct commit **X_array;
int X_nr, X_alloc;
struct strbuf buf = STRBUF_INIT;
@@ -49,7 +50,8 @@ int cmd__reach(int ac, const char **av)
 
while (strbuf_getline(, stdin) != EOF) {
struct object_id oid;
-   struct object *o;
+   struct object *orig;
+   struct object *peeled;
struct commit *c;
if (buf.len < 3)
continue;
@@ -57,14 +59,14 @@ int cmd__reach(int ac, const char **av)
if (get_oid_committish(buf.buf + 2, ))
die("failed to resolve %s", buf.buf + 2);
 
-   o = parse_object(r, );
- 

[PATCH v3 5/7] commit/revisions: bookkeeping before refactoring

2018-09-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

There are a few things that need to move around a little before
making a big refactoring in the topo-order logic:

1. We need access to record_author_date() and
   compare_commits_by_author_date() in revision.c. These are used
   currently by sort_in_topological_order() in commit.c.

2. Moving these methods to commit.h requires adding the author_slab
   definition to commit.h.

3. The add_parents_to_list() method in revision.c performs logic
   around the UNINTERESTING flag and other special cases depending
   on the struct rev_info. Allow this method to ignore a NULL 'list'
   parameter, as we will not be populating the list for our walk.

Signed-off-by: Derrick Stolee 
---
 commit.c   | 11 ---
 commit.h   |  8 
 revision.c |  6 --
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/commit.c b/commit.c
index d0f199e122..f68e04b2f1 100644
--- a/commit.c
+++ b/commit.c
@@ -655,11 +655,8 @@ struct commit *pop_commit(struct commit_list **stack)
 /* count number of children that have not been emitted */
 define_commit_slab(indegree_slab, int);
 
-/* record author-date for each commit object */
-define_commit_slab(author_date_slab, timestamp_t);
-
-static void record_author_date(struct author_date_slab *author_date,
-  struct commit *commit)
+void record_author_date(struct author_date_slab *author_date,
+   struct commit *commit)
 {
const char *buffer = get_commit_buffer(commit, NULL);
struct ident_split ident;
@@ -684,8 +681,8 @@ fail_exit:
unuse_commit_buffer(commit, buffer);
 }
 
-static int compare_commits_by_author_date(const void *a_, const void *b_,
- void *cb_data)
+int compare_commits_by_author_date(const void *a_, const void *b_,
+  void *cb_data)
 {
const struct commit *a = a_, *b = b_;
struct author_date_slab *author_date = cb_data;
diff --git a/commit.h b/commit.h
index 2b1a734388..ff0eb5f8ef 100644
--- a/commit.h
+++ b/commit.h
@@ -8,6 +8,7 @@
 #include "gpg-interface.h"
 #include "string-list.h"
 #include "pretty.h"
+#include "commit-slab.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0x
 #define GENERATION_NUMBER_INFINITY 0x
@@ -328,6 +329,13 @@ extern int remove_signature(struct strbuf *buf);
  */
 extern int check_commit_signature(const struct commit *commit, struct 
signature_check *sigc);
 
+/* record author-date for each commit object */
+define_commit_slab(author_date_slab, timestamp_t);
+
+void record_author_date(struct author_date_slab *author_date,
+   struct commit *commit);
+
+int compare_commits_by_author_date(const void *a_, const void *b_, void 
*unused);
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused);
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused);
 
diff --git a/revision.c b/revision.c
index 2dcde8a8ac..92012d5f45 100644
--- a/revision.c
+++ b/revision.c
@@ -808,7 +808,8 @@ static int add_parents_to_list(struct rev_info *revs, 
struct commit *commit,
if (p->object.flags & SEEN)
continue;
p->object.flags |= SEEN;
-   commit_list_insert_by_date_cached(p, list, cached_base, 
cache_ptr);
+   if (list)
+   commit_list_insert_by_date_cached(p, list, 
cached_base, cache_ptr);
}
return 0;
}
@@ -847,7 +848,8 @@ static int add_parents_to_list(struct rev_info *revs, 
struct commit *commit,
p->object.flags |= left_flag;
if (!(p->object.flags & SEEN)) {
p->object.flags |= SEEN;
-   commit_list_insert_by_date_cached(p, list, cached_base, 
cache_ptr);
+   if (list)
+   commit_list_insert_by_date_cached(p, list, 
cached_base, cache_ptr);
}
if (revs->first_parent_only)
break;
-- 
gitgitgadget



[PATCH v3 7/7] revision.c: refactor basic topo-order logic

2018-09-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:

1. Run limit_list(), which parses all reachable commits,
   adds them to a linked list, and distributes UNINTERESTING
   flags. If all unprocessed commits are UNINTERESTING, then
   it may terminate without walking all reachable commits.
   This does not occur if we do not specify UNINTERESTING
   commits.

2. Run sort_in_topological_order(), which is an implementation
   of Kahn's algorithm. It first iterates through the entire
   set of important commits and computes the in-degree of each
   (plus one, as we use 'zero' as a special value here). Then,
   we walk the commits in priority order, adding them to the
   priority queue if and only if their in-degree is one. As
   we remove commits from this priority queue, we decrement the
   in-degree of their parents.

3. While we are peeling commits for output, get_revision_1()
   uses pop_commit on the full list of commits computed by
   sort_in_topological_order().

In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.

Recall that the generation number of a commit satisfies:

* If the commit has at least one parent, then the generation
  number is one more than the maximum generation number among
  its parents.

* If the commit has no parent, then the generation number is one.

There are two special generation numbers:

* GENERATION_NUMBER_INFINITY: this value is 0x and
  indicates that the commit is not stored in the commit-graph and
  the generation number was not previously calculated.

* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
  to say that the commit-graph was generated by a version of Git
  that does not compute generation numbers (such as v2.18.0).

Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:

If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.

Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.

The walks are as follows:

1. EXPLORE: using the explore_queue priority queue (ordered by
   maximizing the generation number), parse each reachable
   commit until all commits in the queue have generation
   number strictly lower than needed. During this walk, update
   the UNINTERESTING flags as necessary.

2. INDEGREE: using the indegree_queue priority queue (ordered
   by maximizing the generation number), add one to the in-
   degree of each parent for each commit that is walked. Since
   we walk in order of decreasing generation number, we know
   that discovering an in-degree value of 0 means the value for
   that commit was not initialized, so should be initialized to
   two. (Recall that in-degree value "1" is what we use to say a
   commit is ready for output.) As we iterate the parents of a
   commit during this walk, ensure the EXPLORE walk has walked
   beyond their generation numbers.

3. TOPO: using the topo_queue priority queue (ordered based on
   the sort_order given, which could be commit-date, author-
   date, or typical topo-order which treats the queue as a LIFO
   stack), remove a commit from the queue and decrement the
   in-degree of each parent. If a parent has an in-degree of
   one, then we add it to the topo_queue. Before we decrement
   the in-degree, however, ensure the INDEGREE walk has walked
   beyond that generation number.

The implementations of these walks are in the following methods:

* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk

These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.

One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.

In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.

Test: git rev-list --topo-order -100 HEAD

[PATCH v3 3/7] test-reach: add rev-list tests

2018-09-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The rev-list command is critical to Git's functionality. Ensure it
works in the three commit-graph environments constructed in
t6600-test-reach.sh. Here are a few important types of rev-list
operations:

* Basic: git rev-list --topo-order HEAD
* Range: git rev-list --topo-order compare..HEAD
* Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD
* Symmetric Difference: git rev-list --topo-order compare...HEAD

Signed-off-by: Derrick Stolee 
---
 t/t6600-test-reach.sh | 84 +++
 1 file changed, 84 insertions(+)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index 9d65b8b946..288f703b7b 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -243,4 +243,88 @@ test_expect_success 'commit_contains:miss' '
test_three_modes commit_contains --tag
 '
 
+test_expect_success 'rev-list: basic topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 
commit-1-6 \
+   commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 
commit-1-5 \
+   commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 
commit-1-4 \
+   commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 
commit-1-3 \
+   commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 
commit-1-2 \
+   commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 
commit-1-1 \
+   >expect &&
+   run_three_modes git rev-list --topo-order commit-6-6
+'
+
+test_expect_success 'rev-list: first-parent topo-order' '
+   git rev-parse \
+   commit-6-6 \
+   commit-6-5 \
+   commit-6-4 \
+   commit-6-3 \
+   commit-6-2 \
+   commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 
commit-1-1 \
+   >expect &&
+   run_three_modes git rev-list --first-parent --topo-order commit-6-6
+'
+
+test_expect_success 'rev-list: range topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 
commit-1-6 \
+   commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 
commit-1-5 \
+   commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 
commit-1-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   commit-6-2 commit-5-2 commit-4-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   >expect &&
+   run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
+'
+
+test_expect_success 'rev-list: range topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 \
+   commit-6-5 commit-5-5 commit-4-5 \
+   commit-6-4 commit-5-4 commit-4-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   commit-6-2 commit-5-2 commit-4-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   >expect &&
+   run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
+'
+
+test_expect_success 'rev-list: first-parent range topo-order' '
+   git rev-parse \
+   commit-6-6 \
+   commit-6-5 \
+   commit-6-4 \
+   commit-6-3 \
+   commit-6-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   >expect &&
+   run_three_modes git rev-list --first-parent --topo-order 
commit-3-8..commit-6-6
+'
+
+test_expect_success 'rev-list: ancestry-path topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 commit-3-6 \
+   commit-6-5 commit-5-5 commit-4-5 commit-3-5 \
+   commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   >expect &&
+   run_three_modes git rev-list --topo-order --ancestry-path 
commit-3-3..commit-6-6
+'
+
+test_expect_success 'rev-list: symmetric difference topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 \
+   commit-6-5 commit-5-5 commit-4-5 \
+   commit-6-4 commit-5-4 commit-4-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   commit-6-2 commit-5-2 commit-4-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   commit-3-8 commit-2-8 commit-1-8 \
+   commit-3-7 commit-2-7 commit-1-7 \
+   >expect &&
+   run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
+'
+
 test_done
-- 
gitgitgadget



[PATCH v3 6/7] revision.h: add whitespace in flag definitions

2018-09-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

In anticipation of adding longer flag names in the next change, add
an extra tab to each flag definition in revision.h.

Signed-off-by: Derrick Stolee 
---
 revision.h | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/revision.h b/revision.h
index fd4154ff75..e7bd059d80 100644
--- a/revision.h
+++ b/revision.h
@@ -10,20 +10,20 @@
 #include "commit-slab-decl.h"
 
 /* Remember to update object flag allocation in object.h */
-#define SEEN   (1u<<0)
-#define UNINTERESTING   (1u<<1)
-#define TREESAME   (1u<<2)
-#define SHOWN  (1u<<3)
-#define TMP_MARK   (1u<<4) /* for isolated cases; clean after use */
-#define BOUNDARY   (1u<<5)
-#define CHILD_SHOWN(1u<<6)
-#define ADDED  (1u<<7) /* Parents already parsed and added? */
-#define SYMMETRIC_LEFT (1u<<8)
-#define PATCHSAME  (1u<<9)
-#define BOTTOM (1u<<10)
-#define USER_GIVEN (1u<<25) /* given directly by the user */
-#define TRACK_LINEAR   (1u<<26)
-#define ALL_REV_FLAGS  (((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR)
+#define SEEN   (1u<<0)
+#define UNINTERESTING  (1u<<1)
+#define TREESAME   (1u<<2)
+#define SHOWN  (1u<<3)
+#define TMP_MARK   (1u<<4) /* for isolated cases; clean after use 
*/
+#define BOUNDARY   (1u<<5)
+#define CHILD_SHOWN(1u<<6)
+#define ADDED  (1u<<7) /* Parents already parsed and added? */
+#define SYMMETRIC_LEFT (1u<<8)
+#define PATCHSAME  (1u<<9)
+#define BOTTOM (1u<<10)
+#define USER_GIVEN (1u<<25) /* given directly by the user */
+#define TRACK_LINEAR   (1u<<26)
+#define ALL_REV_FLAGS  (((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR)
 
 #define DECORATE_SHORT_REFS1
 #define DECORATE_FULL_REFS 2
-- 
gitgitgadget



[PATCH v3 4/7] revision.c: begin refactoring --topo-order logic

2018-09-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():

* revs->limited implies we run limit_list() to walk the entire
  reachable set. There are some short-cuts here, such as if we
  perform a range query like 'git rev-list COMPARE..HEAD' and we
  can stop limit_list() when all queued commits are uninteresting.

* revs->topo_order implies we run sort_in_topological_order(). See
  the implementation of that method in commit.c. It implies that
  the full set of commits to order is in the given commit_list.

These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.

If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.

In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.

When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.

In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:

1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
   struct. We use the presence of this struct as a signal to use the
   new methods during our walk. In this patch, this method simply
   calls limit_list() and sort_in_topological_order(). In the future,
   this method will set up a new data structure to perform that logic
   in-line.

2. next_topo_commit() provides get_revision_1() with the next topo-
   ordered commit in the list. Currently, this simply pops the commit
   from revs->commits.

3. expand_topo_walk() provides get_revision_1() with a way to signal
   walking beyond the latest commit. Currently, this calls
   add_parents_to_list() exactly like the old logic.

While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.

Signed-off-by: Derrick Stolee 
---
 revision.c | 42 ++
 revision.h |  4 
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/revision.c b/revision.c
index e18bd530e4..2dcde8a8ac 100644
--- a/revision.c
+++ b/revision.c
@@ -25,6 +25,7 @@
 #include "worktree.h"
 #include "argv-array.h"
 #include "commit-reach.h"
+#include "commit-graph.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct 
rev_info *revs, struct s
if (revs->diffopt.objfind)
revs->simplify_history = 0;
 
-   if (revs->topo_order)
+   if (revs->topo_order && !generation_numbers_enabled(the_repository))
revs->limited = 1;
 
if (revs->prune_data.nr) {
@@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id 
*oid,
return 0;
 }
 
+struct topo_walk_info {};
+
+static void init_topo_walk(struct rev_info *revs)
+{
+   struct topo_walk_info *info;
+   revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
+   info = revs->topo_walk_info;
+   memset(info, 0, sizeof(struct topo_walk_info));
+
+   limit_list(revs);
+   sort_in_topological_order(>commits, revs->sort_order);
+}
+
+static struct commit *next_topo_commit(struct rev_info *revs)
+{
+   return pop_commit(>commits);
+}
+
+static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
+{
+   if (add_parents_to_list(revs, commit, >commits, NULL) < 0) {
+   if (!revs->ignore_missing_links)
+   die("Failed to traverse parents of commit %s",
+   oid_to_hex(>object.oid));
+   }
+}
+
 int prepare_revision_walk(struct rev_info *revs)
 {
int i;
@@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs)
commit_list_sort_by_date(>commits);
if (revs->no_walk)
return 0;
-   if (revs->limited)
+   if (revs->limited) {
if (limit_list(revs) < 0)
return -1;
-   if (revs->topo_order)
-   sort_in_topological_order(>commits, revs->sort_order);
+   if (revs->topo_order)
+   sort_in_topological_order(>commits, 
revs->sort_order);
+   } else if (revs->topo_order)
+   init_topo_walk(revs);
if 

[PATCH v3 1/7] prio-queue: add 'peek' operation

2018-09-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When consuming a priority queue, it can be convenient to inspect
the next object that will be dequeued without actually dequeueing
it. Our existing library did not have such a 'peek' operation, so
add it as prio_queue_peek().

Add a reference-level comparison in t/helper/test-prio-queue.c
so this method is exercised by t0009-prio-queue.sh.

Signed-off-by: Derrick Stolee 
---
 prio-queue.c   |  9 +
 prio-queue.h   |  6 ++
 t/helper/test-prio-queue.c | 10 +++---
 3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/prio-queue.c b/prio-queue.c
index a078451872..d3f488cb05 100644
--- a/prio-queue.c
+++ b/prio-queue.c
@@ -85,3 +85,12 @@ void *prio_queue_get(struct prio_queue *queue)
}
return result;
 }
+
+void *prio_queue_peek(struct prio_queue *queue)
+{
+   if (!queue->nr)
+   return NULL;
+   if (!queue->compare)
+   return queue->array[queue->nr - 1].data;
+   return queue->array[0].data;
+}
diff --git a/prio-queue.h b/prio-queue.h
index d030ec9dd6..682e51867a 100644
--- a/prio-queue.h
+++ b/prio-queue.h
@@ -46,6 +46,12 @@ extern void prio_queue_put(struct prio_queue *, void *thing);
  */
 extern void *prio_queue_get(struct prio_queue *);
 
+/*
+ * Gain access to the "thing" that would be returned by
+ * prio_queue_get, but do not remove it from the queue.
+ */
+extern void *prio_queue_peek(struct prio_queue *);
+
 extern void clear_prio_queue(struct prio_queue *);
 
 /* Reverse the LIFO elements */
diff --git a/t/helper/test-prio-queue.c b/t/helper/test-prio-queue.c
index 9807b649b1..e817bbf464 100644
--- a/t/helper/test-prio-queue.c
+++ b/t/helper/test-prio-queue.c
@@ -22,9 +22,13 @@ int cmd__prio_queue(int argc, const char **argv)
struct prio_queue pq = { intcmp };
 
while (*++argv) {
-   if (!strcmp(*argv, "get"))
-   show(prio_queue_get());
-   else if (!strcmp(*argv, "dump")) {
+   if (!strcmp(*argv, "get")) {
+   void *peek = prio_queue_peek();
+   void *get = prio_queue_get();
+   if (peek != get)
+   BUG("peek and get results do not match");
+   show(get);
+   } else if (!strcmp(*argv, "dump")) {
int *v;
while ((v = prio_queue_get()))
   show(v);
-- 
gitgitgadget



[PATCH v3 2/7] test-reach: add run_three_modes method

2018-09-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The 'test_three_modes' method assumes we are using the 'test-tool
reach' command for our test. However, we may want to use the data
shape of our commit graph and the three modes (no commit-graph,
full commit-graph, partial commit-graph) for other git commands.

Split test_three_modes to be a simple translation on a more general
run_three_modes method that executes the given command and tests
the actual output to the expected output.

Signed-off-by: Derrick Stolee 
---
 t/t6600-test-reach.sh | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index d139a00d1d..9d65b8b946 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -53,18 +53,22 @@ test_expect_success 'setup' '
git config core.commitGraph true
 '
 
-test_three_modes () {
+run_three_modes () {
test_when_finished rm -rf .git/objects/info/commit-graph &&
-   test-tool reach $1 actual &&
+   "$@" actual &&
test_cmp expect actual &&
cp commit-graph-full .git/objects/info/commit-graph &&
-   test-tool reach $1 actual &&
+   "$@" actual &&
test_cmp expect actual &&
cp commit-graph-half .git/objects/info/commit-graph &&
-   test-tool reach $1 actual &&
+   "$@" actual &&
test_cmp expect actual
 }
 
+test_three_modes () {
+   run_three_modes test-tool reach "$@"
+}
+
 test_expect_success 'ref_newer:miss' '
cat >input <<-\EOF &&
A:commit-5-7
-- 
gitgitgadget



[PATCH v3 0/7] Use generation numbers for --topo-order

2018-09-21 Thread Derrick Stolee via GitGitGadget
This patch series performs a decently-sized refactoring of the revision-walk
machinery. Well, "refactoring" is probably the wrong word, as I don't
actually remove the old code. Instead, when we see certain options in the
'rev_info' struct, we redirect the commit-walk logic to a new set of methods
that distribute the workload differently. By using generation numbers in the
commit-graph, we can significantly improve 'git log --graph' commands (and
the underlying 'git rev-list --topo-order').

On the Linux repository, I got the following performance results when
comparing to the previous version with or without a commit-graph:

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
  HEAD, w/ commit-graph: 0.02 s

Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
  HEAD, w/ commit-graph: 0.06 s

If you want to read this series but are unfamiliar with the commit-graph and
generation numbers, then I recommend reading 
Documentation/technical/commit-graph.txt or a blob post [1] I wrote on the
subject. In particular, the three-part walk described in "revision.c:
refactor basic topo-order logic" is present (but underexplained) as an
animated PNG [2].

Since revision.c is an incredibly important (and old) portion of the
codebase -- and because there are so many orthogonal options in 'struct
rev_info' -- I consider this submission to be "RFC quality". That is, I am
not confident that I am not missing anything, or that my solution is the
best it can be. I did merge this branch with ds/commit-graph-with-grafts and
the "DO-NOT-MERGE: write and read commit-graph always" commit that computes
a commit-graph with every 'git commit' command. The test suite passed with
that change, available on GitHub [3]. To ensure that I cover at least the
case I think are interesting, I added tests to t6600-test-reach.sh to verify
the walks report the correct results for the three cases there (no
commit-graph, full commit-graph, and a partial commit-graph so the walk
starts at GENERATION_NUMBER_INFINITY).

One notable case that is not included in this series is the case of a
history comparison such as 'git rev-list --topo-order A..B'. The existing
code in limit_list() has ways to cut the walk short when all pending commits
are UNINTERESTING. Since this code depends on commit_list instead of the
prio_queue we are using here, I chose to leave it untouched for now. We can
revisit it in a separate series later. Since handle_commit() turns on
revs->limited when a commit is UNINTERESTING, we do not hit the new code in
this case. Removing this 'revs->limited = 1;' line yields correct results,
but the performance is worse.

This series was based on ds/reachable, but is now based on 'master' to not
conflict with 182070 "commit: use timestamp_t for author_date_slab". There
is a small conflict with md/filter-trees, because it renamed a flag in
revisions.h in the line before I add new flags. Hopefully this conflict is
not too difficult to resolve.

Changes in V3: I added a new patch that updates the tab-alignment for flags
in revision.h before adding new ones (Thanks, Ævar!). Also, I squashed the
recommended changes to run_three_modes and test_three_modes from Szeder and
Junio. Thanks!

Thanks, -Stolee

[1] 
https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/
Supercharging the Git Commit Graph III: Generations and Graph Algorithms

[2] 
https://msdnshared.blob.core.windows.net/media/2018/06/commit-graph-topo-order-b-a.png
Animation showing three-part walk

[3] https://github.com/derrickstolee/git/tree/topo-order/testA branch
containing this series along with commits to compute commit-graph in entire
test suite.

Cc: avarab@gmail.comCc: szeder@gmail.com

Derrick Stolee (7):
  prio-queue: add 'peek' operation
  test-reach: add run_three_modes method
  test-reach: add rev-list tests
  revision.c: begin refactoring --topo-order logic
  commit/revisions: bookkeeping before refactoring
  revision.h: add whitespace in flag definitions
  revision.c: refactor basic topo-order logic

 commit.c   |  11 +-
 commit.h   |   8 ++
 object.h   |   4 +-
 prio-queue.c   |   9 ++
 prio-queue.h   |   6 +
 revision.c | 232 -
 revision.h |  34 +++---
 t/helper/test-prio-queue.c |  10 +-
 t/t6600-test-reach.sh  |  96 ++-
 9 files changed, 374 insertions(+), 36 deletions(-)


base-commit: 2d3b1c576c85b7f5db1f418907af00ab88e0c303
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-25%2Fderrickstolee%2Ftopo-order%2Fprogress-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-25/derrickstolee/topo-order/progress-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/25

Range-diff vs v2:

 1:  cc1ec4c270 = 1:  cc1ec4c270 

[PATCH v3 1/1] contrib: add coverage-diff script

2018-09-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

We have coverage targets in our Makefile for using gcov to display line
coverage based on our test suite. The way I like to do it is to run:

make coverage-test
make coverage-report

This leaves the repo in a state where every X.c file that was covered has
an X.c.gcov file containing the coverage counts for every line, and "#"
at every uncovered line.

There have been a few bugs in recent patches what would have been caught
if the test suite covered those blocks (including a few of mine). I want
to work towards a "sensible" amount of coverage on new topics. In my opinion,
this means that any logic should be covered, but the 'die()' blocks covering
very unlikely (or near-impossible) situations may not warrant coverage.

It is important to not measure the coverage of the codebase by what old code
is not covered. To help, I created the 'contrib/coverage-diff.sh' script.
After creating the coverage statistics at a version (say, 'topic') you can
then run

contrib/coverage-diff.sh base topic

to see the lines added between 'base' and 'topic' that are not covered by the
test suite. The output uses 'git blame -c' format so you can find the commits
responsible and view the line numbers for quick access to the context.

Helped-by: Junio C Hamano 
Signed-off-by: Derrick Stolee 
---
 contrib/coverage-diff.sh | 79 
 1 file changed, 79 insertions(+)
 create mode 100755 contrib/coverage-diff.sh

diff --git a/contrib/coverage-diff.sh b/contrib/coverage-diff.sh
new file mode 100755
index 00..48b9a3ae96
--- /dev/null
+++ b/contrib/coverage-diff.sh
@@ -0,0 +1,79 @@
+#!/bin/sh
+
+# Usage: Run 'contrib/coverage-diff.sh  ' from source-root
+# after running
+#
+# make coverage-test
+# make coverage-report
+#
+# while checked out at . This script combines the *.gcov files
+# generated by the 'make' commands above with 'git diff  '
+# to report new lines that are not covered by the test suite.
+
+V1=$1
+V2=$2
+
+diff_lines () {
+   perl -e '
+   my $line_num;
+   while (<>) {
+   # Hunk header?  Grab the beginning in postimage.
+   if (/^@@ -\d+(?:,\d+)? \+(\d+)(?:,\d+)? @@/) {
+   $line_num = $1;
+   next;
+   }
+
+   # Have we seen a hunk?  Ignore "diff --git" etc.
+   next unless defined $line_num;
+
+   # Deleted line? Ignore.
+   if (/^-/) {
+   next;
+   }
+
+   # Show only the line number of added lines.
+   if (/^\+/) {
+   print "$line_num\n";
+   }
+   # Either common context or added line appear in
+   # the postimage.  Count it.
+   $line_num++;
+   }
+   '
+}
+
+files=$(git diff --name-only $V1 $V2 -- *.c)
+
+for file in $files
+do
+   git diff $V1 $V2 -- $file \
+   | diff_lines \
+   | sort >new_lines.txt
+
+   if ! test -s new_lines.txt
+   then
+   continue
+   fi
+
+   hash_file=$(echo $file | sed "s/\//\#/")
+   sed -ne '/#:/{
+   s/#://
+   s/:.*//
+   s/ //g
+   p
+   }' "$hash_file.gcov" \
+   | sort >uncovered_lines.txt
+
+   comm -12 uncovered_lines.txt new_lines.txt \
+   | sed -e 's/$/\)/' \
+   | sed -e 's/^/\t/' \
+   >uncovered_new_lines.txt
+
+   grep -q '[^[:space:]]' < uncovered_new_lines.txt && \
+   echo $file && \
+   git blame -c $file \
+   | grep -f uncovered_new_lines.txt
+
+   rm -f new_lines.txt uncovered_lines.txt uncovered_new_lines.txt
+done
+
-- 
gitgitgadget


[PATCH v3 0/1] contrib: Add script to show uncovered "new" lines

2018-09-21 Thread Derrick Stolee via GitGitGadget
We have coverage targets in our Makefile for using gcov to display line
coverage based on our test suite. The way I like to do it is to run:

make coverage-test
make coverage-report

This leaves the repo in a state where every X.c file that was covered has an
X.c.gcov file containing the coverage counts for every line, and "#" at
every uncovered line.

There have been a few bugs in recent patches what would have been caught if
the test suite covered those blocks (including a few of mine). I want to
work towards a "sensible" amount of coverage on new topics. In my opinion,
this means that any logic should be covered, but the 'die()' blocks in error
cases do not need to be covered.

It is important to not measure the coverage of the codebase by what old code
is not covered. To help, I created the 'contrib/coverage-diff.sh' script.
After creating the coverage statistics at a version (say, 'topic') you can
then run

contrib/coverage-diff.sh base topic

to see the lines added between 'base' and 'topic' that are not covered by
the test suite. For example, I ran this against the 'next' branch (22e244b)
versus 'master' (150f307) and got the following output:

fsck.c
fb8952077df (René Scharfe   2018-09-03 14:49:26 +   212)
die_errno("Could not read '%s'", path);
list-objects-filter-options.c
f12b8fc6d3b (Matthew DeVore 2018-09-13 17:55:27 -0700   56) 
if (errbuf) {
f12b8fc6d3b (Matthew DeVore 2018-09-13 17:55:27 -0700   57) 
strbuf_init(errbuf, 0);
f12b8fc6d3b (Matthew DeVore 2018-09-13 17:55:27 -0700   58) 
strbuf_addstr(
f12b8fc6d3b (Matthew DeVore 2018-09-13 17:55:27 -0700   62) 
return 1;
list-objects-filter.c
77d7a65d502 (Matthew DeVore 2018-09-13 17:55:26 -0700   47) 
BUG("unknown filter_situation: %d", filter_situation);
f12b8fc6d3b (Matthew DeVore 2018-09-13 17:55:27 -0700   100)default:
f12b8fc6d3b (Matthew DeVore 2018-09-13 17:55:27 -0700   101)
BUG("unknown filter_situation: %d", filter_situation);
77d7a65d502 (Matthew DeVore 2018-09-13 17:55:26 -0700   152)
BUG("unknown filter_situation: %d", filter_situation);
77d7a65d502 (Matthew DeVore 2018-09-13 17:55:26 -0700   257)
BUG("unknown filter_situation: %d", filter_situation);
77d7a65d502 (Matthew DeVore 2018-09-13 17:55:26 -0700   438)
BUG("invalid list-objects filter choice: %d",
list-objects.c
f447a499dbb (Matthew DeVore 2018-08-13 11:14:28 -0700   197)
ctx->show_object(obj, base->buf, ctx->show_data);
ll-merge.c
d64324cb60e (Torsten Bögershausen   2018-09-12 21:32:02 +0200   379)
marker_size = DEFAULT_CONFLICT_MARKER_SIZE;
midx.c
56ee7ff1565 (Derrick Stolee 2018-09-13 11:02:13 -0700   949)
return 0;
cc6af73c029 (Derrick Stolee 2018-09-13 11:02:25 -0700   990)
midx_report(_("failed to load pack-index for packfile %s"),
cc6af73c029 (Derrick Stolee 2018-09-13 11:02:25 -0700   991)
e.p->pack_name);
cc6af73c029 (Derrick Stolee 2018-09-13 11:02:25 -0700   992)
break;
remote-curl.c
c3b9bc94b9b (Elijah Newren  2018-09-05 10:03:07 -0700   181)
options.filter = xstrdup(value);
submodule.c
df255b8cac7 (Brandon Williams   2018-08-08 15:33:22 -0700   1738)   
die(_("could not create directory '%s'"), new_gitdir.buf);

Using this 'git blame' output, we can quickly inspect whether the uncovered
lines are appropriate. For instance:

 1. The line in builtin/commit.c is due to writing the commit-graph file
when GIT_TEST_COMMIT_GRAPH is enabled, which is not on by default in the
test suite. Being uncovered is expected here.


 2. The lines in builtin/worktree.c are all related to error conditions.
This is acceptable.


 3. The line in builtin/rev-list.c is a flag replacement in a block that is
otherwise unchanged. It must not be covered by the test suite normally.
This could be worth adding a test to ensure the new logic maintains old
behavior.


 4. The lines in read-cache.c are part of a new block for the condition "if
(expand_name_field)" as part of an optimization. These lines should
probably be covered before that series is merged to 'next'. I understand
that Ben and Duy are continuing work in this direction [1].



I used this approach for 'next' over 'master' and got a larger list, some of
which I have already submitted tests to increase coverage [2] or will be
covered by topics not in 'next' [3].

Thanks, -Stolee

CHANGES IN V3: I took Junio's perl script verbatim, which speeds up the
performance greatly. Some of the other sed commands needed some massaging,
but also added extra cleanup. Thanks for the help!

[1] 

[PATCH v3 2/2] commit-reach: fix memory and flag leaks

2018-09-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The can_all_from_reach_with_flag() method uses 'assign_flag' as a
value we can use to mark objects temporarily during our commit walk.
The intent is that these flags are removed from all objects before
returning. However, this is not the case.

The 'from' array could also contain objects that are not commits, and
we mark those objects with 'assign_flag'. Add a loop to the 'cleanup'
section that removes these markers.

Also, we forgot to free() the memory for 'list', so add that to the
'cleanup' section.

Signed-off-by: Derrick Stolee 
---
 commit-reach.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/commit-reach.c b/commit-reach.c
index e748414d04..5a845440a9 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -626,6 +626,11 @@ cleanup:
clear_commit_marks(list[i], RESULT);
clear_commit_marks(list[i], assign_flag);
}
+   free(list);
+
+   for (i = 0; i < from->nr; i++)
+   from->objects[i].item->flags &= ~assign_flag;
+
return result;
 }
 
-- 
gitgitgadget


[PATCH v3 0/2] Properly peel tags in can_all_from_reach_with_flags()

2018-09-21 Thread Derrick Stolee via GitGitGadget
As Peff reported [1], the refactored can_all_from_reach_with_flags() method
does not properly peel tags. Since the helper method can_all_from_reach()
and code in t/helper/test-reach.c all peel tags before getting to this
method, it is not super-simple to create a test that demonstrates this.

I modified t/helper/test-reach.c to allow calling
can_all_from_reach_with_flags() directly, and added a test in
t6600-test-reach.sh that demonstrates the segfault without the fix.

For V2, I compared the loop that inspects the 'from' commits in commit
ba3ca1edce "commit-reach: move can_all_from_reach_with_flags" to the version
here and got the following diff:

3c3
< if (from_one->flags & assign_flag)
---
> if (!from_one || from_one->flags & assign_flag)
5c5,7
< from_one = deref_tag(the_repository, from_one, "a from 
object", 0);
---
>
> from_one = deref_tag(the_repository, from_one,
>  "a from object", 0);
14a17,22
>
> list[nr_commits] = (struct commit *)from_one;
> if (parse_commit(list[nr_commits]) ||
> list[nr_commits]->generation < min_generation)
> return 0; /* is this a leak? */
> nr_commits++;

This diff includes the early termination we had before 'deref_tag' and the
comment for why we can ignore non-commit objects.

[1] 
https://public-inbox.org/git/0bf9103c-9377-506b-7ad7-e5273d8e9...@gmail.com/T/#u

Derrick Stolee (2):
  commit-reach: properly peel tags
  commit-reach: fix memory and flag leaks

 commit-reach.c| 41 ++---
 t/helper/test-reach.c | 22 +-
 t/t6600-test-reach.sh | 30 --
 3 files changed, 79 insertions(+), 14 deletions(-)


base-commit: 6621c838743812aaba96e55cfec8524ea1144c2d
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-39%2Fderrickstolee%2Ftag-fix-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-39/derrickstolee/tag-fix-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/39

Range-diff vs v2:

 1:  4bf21204dd ! 1:  0a1e661271 commit-reach: properly peel tags
 @@ -53,8 +53,11 @@
  +
  + list[nr_commits] = (struct commit *)from_one;
  + if (parse_commit(list[nr_commits]) ||
 -+ list[nr_commits]->generation < min_generation)
 -+ return 0; /* is this a leak? */
 ++ list[nr_commits]->generation < min_generation) {
 ++ result = 0;
 ++ goto cleanup;
 ++ }
 ++
  + nr_commits++;
}
   
 -:  -- > 2:  b2e0ee4978 commit-reach: fix memory and flag leaks

-- 
gitgitgadget


[PATCH v3 1/2] commit-reach: properly peel tags

2018-09-21 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The can_all_from_reach_with_flag() algorithm was refactored in 4fbcca4e
"commit-reach: make can_all_from_reach... linear" but incorrectly
assumed that all objects provided were commits. During a fetch
negotiation, ok_to_give_up() in upload-pack.c may provide unpeeled tags
to the 'from' array. The current code creates a segfault.

Add a direct call to can_all_from_reach_with_flag() in 'test-tool reach'
and add a test in t6600-test-reach.sh that demonstrates this segfault.

Correct the issue by peeling tags when investigating the initial list
of objects in the 'from' array.

Signed-off-by: Jeff King 
Signed-off-by: Derrick Stolee 
---
 commit-reach.c| 36 +---
 t/helper/test-reach.c | 22 +-
 t/t6600-test-reach.sh | 30 --
 3 files changed, 74 insertions(+), 14 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index 86715c103c..e748414d04 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -544,20 +544,42 @@ int can_all_from_reach_with_flag(struct object_array 
*from,
 {
struct commit **list = NULL;
int i;
+   int nr_commits;
int result = 1;
 
ALLOC_ARRAY(list, from->nr);
+   nr_commits = 0;
for (i = 0; i < from->nr; i++) {
-   list[i] = (struct commit *)from->objects[i].item;
+   struct object *from_one = from->objects[i].item;
 
-   if (parse_commit(list[i]) ||
-   list[i]->generation < min_generation)
-   return 0;
+   if (!from_one || from_one->flags & assign_flag)
+   continue;
+
+   from_one = deref_tag(the_repository, from_one,
+"a from object", 0);
+   if (!from_one || from_one->type != OBJ_COMMIT) {
+   /* no way to tell if this is reachable by
+* looking at the ancestry chain alone, so
+* leave a note to ourselves not to worry about
+* this object anymore.
+*/
+   from->objects[i].item->flags |= assign_flag;
+   continue;
+   }
+
+   list[nr_commits] = (struct commit *)from_one;
+   if (parse_commit(list[nr_commits]) ||
+   list[nr_commits]->generation < min_generation) {
+   result = 0;
+   goto cleanup;
+   }
+
+   nr_commits++;
}
 
-   QSORT(list, from->nr, compare_commits_by_gen);
+   QSORT(list, nr_commits, compare_commits_by_gen);
 
-   for (i = 0; i < from->nr; i++) {
+   for (i = 0; i < nr_commits; i++) {
/* DFS from list[i] */
struct commit_list *stack = NULL;
 
@@ -600,7 +622,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
}
 
 cleanup:
-   for (i = 0; i < from->nr; i++) {
+   for (i = 0; i < nr_commits; i++) {
clear_commit_marks(list[i], RESULT);
clear_commit_marks(list[i], assign_flag);
}
diff --git a/t/helper/test-reach.c b/t/helper/test-reach.c
index eb21103998..08d2ea68e8 100644
--- a/t/helper/test-reach.c
+++ b/t/helper/test-reach.c
@@ -31,6 +31,7 @@ int cmd__reach(int ac, const char **av)
struct object_id oid_A, oid_B;
struct commit *A, *B;
struct commit_list *X, *Y;
+   struct object_array X_obj = OBJECT_ARRAY_INIT;
struct commit **X_array;
int X_nr, X_alloc;
struct strbuf buf = STRBUF_INIT;
@@ -49,7 +50,8 @@ int cmd__reach(int ac, const char **av)
 
while (strbuf_getline(, stdin) != EOF) {
struct object_id oid;
-   struct object *o;
+   struct object *orig;
+   struct object *peeled;
struct commit *c;
if (buf.len < 3)
continue;
@@ -57,14 +59,14 @@ int cmd__reach(int ac, const char **av)
if (get_oid_committish(buf.buf + 2, ))
die("failed to resolve %s", buf.buf + 2);
 
-   o = parse_object(r, );
-   o = deref_tag_noverify(o);
+   orig = parse_object(r, );
+   peeled = deref_tag_noverify(orig);
 
-   if (!o)
+   if (!peeled)
die("failed to load commit for input %s resulting in 
oid %s\n",
buf.buf, oid_to_hex());
 
-   c = object_as_type(r, o, OBJ_COMMIT, 0);
+   c = object_as_type(r, peeled, OBJ_COMMIT, 0);
 
if (!c)
die("failed to load commit for input %s resulting in 
oid %s\n",
@@ -85,6 +87,7 @@ int cmd__reach(int ac, const char **av)
commit_list_insert(c, );
ALLOC_GROW(X_array, X_nr + 1, X_alloc);
 

[PATCH v2 5/6] commit/revisions: bookkeeping before refactoring

2018-09-17 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

There are a few things that need to move around a little before
making a big refactoring in the topo-order logic:

1. We need access to record_author_date() and
   compare_commits_by_author_date() in revision.c. These are used
   currently by sort_in_topological_order() in commit.c.

2. Moving these methods to commit.h requires adding the author_slab
   definition to commit.h.

3. The add_parents_to_list() method in revision.c performs logic
   around the UNINTERESTING flag and other special cases depending
   on the struct rev_info. Allow this method to ignore a NULL 'list'
   parameter, as we will not be populating the list for our walk.

Signed-off-by: Derrick Stolee 
---
 commit.c   | 11 ---
 commit.h   |  8 
 revision.c |  6 --
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/commit.c b/commit.c
index d0f199e122..f68e04b2f1 100644
--- a/commit.c
+++ b/commit.c
@@ -655,11 +655,8 @@ struct commit *pop_commit(struct commit_list **stack)
 /* count number of children that have not been emitted */
 define_commit_slab(indegree_slab, int);
 
-/* record author-date for each commit object */
-define_commit_slab(author_date_slab, timestamp_t);
-
-static void record_author_date(struct author_date_slab *author_date,
-  struct commit *commit)
+void record_author_date(struct author_date_slab *author_date,
+   struct commit *commit)
 {
const char *buffer = get_commit_buffer(commit, NULL);
struct ident_split ident;
@@ -684,8 +681,8 @@ fail_exit:
unuse_commit_buffer(commit, buffer);
 }
 
-static int compare_commits_by_author_date(const void *a_, const void *b_,
- void *cb_data)
+int compare_commits_by_author_date(const void *a_, const void *b_,
+  void *cb_data)
 {
const struct commit *a = a_, *b = b_;
struct author_date_slab *author_date = cb_data;
diff --git a/commit.h b/commit.h
index 2b1a734388..ff0eb5f8ef 100644
--- a/commit.h
+++ b/commit.h
@@ -8,6 +8,7 @@
 #include "gpg-interface.h"
 #include "string-list.h"
 #include "pretty.h"
+#include "commit-slab.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0x
 #define GENERATION_NUMBER_INFINITY 0x
@@ -328,6 +329,13 @@ extern int remove_signature(struct strbuf *buf);
  */
 extern int check_commit_signature(const struct commit *commit, struct 
signature_check *sigc);
 
+/* record author-date for each commit object */
+define_commit_slab(author_date_slab, timestamp_t);
+
+void record_author_date(struct author_date_slab *author_date,
+   struct commit *commit);
+
+int compare_commits_by_author_date(const void *a_, const void *b_, void 
*unused);
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused);
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused);
 
diff --git a/revision.c b/revision.c
index 2dcde8a8ac..92012d5f45 100644
--- a/revision.c
+++ b/revision.c
@@ -808,7 +808,8 @@ static int add_parents_to_list(struct rev_info *revs, 
struct commit *commit,
if (p->object.flags & SEEN)
continue;
p->object.flags |= SEEN;
-   commit_list_insert_by_date_cached(p, list, cached_base, 
cache_ptr);
+   if (list)
+   commit_list_insert_by_date_cached(p, list, 
cached_base, cache_ptr);
}
return 0;
}
@@ -847,7 +848,8 @@ static int add_parents_to_list(struct rev_info *revs, 
struct commit *commit,
p->object.flags |= left_flag;
if (!(p->object.flags & SEEN)) {
p->object.flags |= SEEN;
-   commit_list_insert_by_date_cached(p, list, cached_base, 
cache_ptr);
+   if (list)
+   commit_list_insert_by_date_cached(p, list, 
cached_base, cache_ptr);
}
if (revs->first_parent_only)
break;
-- 
gitgitgadget



[PATCH v2 0/6] Use generation numbers for --topo-order

2018-09-17 Thread Derrick Stolee via GitGitGadget
This patch series performs a decently-sized refactoring of the revision-walk
machinery. Well, "refactoring" is probably the wrong word, as I don't
actually remove the old code. Instead, when we see certain options in the
'rev_info' struct, we redirect the commit-walk logic to a new set of methods
that distribute the workload differently. By using generation numbers in the
commit-graph, we can significantly improve 'git log --graph' commands (and
the underlying 'git rev-list --topo-order').

On the Linux repository, I got the following performance results when
comparing to the previous version with or without a commit-graph:

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
  HEAD, w/ commit-graph: 0.02 s

Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
  HEAD, w/ commit-graph: 0.06 s

If you want to read this series but are unfamiliar with the commit-graph and
generation numbers, then I recommend reading 
Documentation/technical/commit-graph.txt or a blob post [1] I wrote on the
subject. In particular, the three-part walk described in "revision.c:
refactor basic topo-order logic" is present (but underexplained) as an
animated PNG [2].

Since revision.c is an incredibly important (and old) portion of the
codebase -- and because there are so many orthogonal options in 'struct
rev_info' -- I consider this submission to be "RFC quality". That is, I am
not confident that I am not missing anything, or that my solution is the
best it can be. I did merge this branch with ds/commit-graph-with-grafts and
the "DO-NOT-MERGE: write and read commit-graph always" commit that computes
a commit-graph with every 'git commit' command. The test suite passed with
that change, available on GitHub [3]. To ensure that I cover at least the
case I think are interesting, I added tests to t6600-test-reach.sh to verify
the walks report the correct results for the three cases there (no
commit-graph, full commit-graph, and a partial commit-graph so the walk
starts at GENERATION_NUMBER_INFINITY).

One notable case that is not included in this series is the case of a
history comparison such as 'git rev-list --topo-order A..B'. The existing
code in limit_list() has ways to cut the walk short when all pending commits
are UNINTERESTING. Since this code depends on commit_list instead of the
prio_queue we are using here, I chose to leave it untouched for now. We can
revisit it in a separate series later. Since handle_commit() turns on
revs->limited when a commit is UNINTERESTING, we do not hit the new code in
this case. Removing this 'revs->limited = 1;' line yields correct results,
but the performance is worse.

This series was based on ds/reachable, but is now based on 'master' to not
conflict with 182070 "commit: use timestamp_t for author_date_slab". There
is a small conflict with md/filter-trees, because it renamed a flag in
revisions.h in the line before I add new flags. Hopefully this conflict is
not too difficult to resolve.

Thanks, -Stolee

[1] 
https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/
Supercharging the Git Commit Graph III: Generations and Graph Algorithms

[2] 
https://msdnshared.blob.core.windows.net/media/2018/06/commit-graph-topo-order-b-a.png
Animation showing three-part walk

[3] https://github.com/derrickstolee/git/tree/topo-order/testA branch
containing this series along with commits to compute commit-graph in entire
test suite.

Derrick Stolee (6):
  prio-queue: add 'peek' operation
  test-reach: add run_three_modes method
  test-reach: add rev-list tests
  revision.c: begin refactoring --topo-order logic
  commit/revisions: bookkeeping before refactoring
  revision.c: refactor basic topo-order logic

 commit.c   |  11 +-
 commit.h   |   8 ++
 object.h   |   4 +-
 prio-queue.c   |   9 ++
 prio-queue.h   |   6 +
 revision.c | 232 -
 revision.h |   6 +
 t/helper/test-prio-queue.c |  10 +-
 t/t6600-test-reach.sh  |  98 +++-
 9 files changed, 361 insertions(+), 23 deletions(-)


base-commit: 2d3b1c576c85b7f5db1f418907af00ab88e0c303
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-25%2Fderrickstolee%2Ftopo-order%2Fprogress-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-25/derrickstolee/topo-order/progress-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/25

Range-diff vs v1:

 1:  5e55669f4d = 1:  cc1ec4c270 prio-queue: add 'peek' operation
 2:  9628396af1 = 2:  404c918608 test-reach: add run_three_modes method
 3:  708b4550a1 = 3:  30dee58c61 test-reach: add rev-list tests
 4:  908442417d ! 4:  a74ae13d4e revision.c: begin refactoring --topo-order 
logic
 @@ -168,4 +168,4 @@
  + struct topo_walk_info *topo_walk_info;
   };
   
  

[PATCH v2 3/6] test-reach: add rev-list tests

2018-09-17 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The rev-list command is critical to Git's functionality. Ensure it
works in the three commit-graph environments constructed in
t6600-test-reach.sh. Here are a few important types of rev-list
operations:

* Basic: git rev-list --topo-order HEAD
* Range: git rev-list --topo-order compare..HEAD
* Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD
* Symmetric Difference: git rev-list --topo-order compare...HEAD

Signed-off-by: Derrick Stolee 
---
 t/t6600-test-reach.sh | 84 +++
 1 file changed, 84 insertions(+)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index 1b18e12a4e..2fcaa39077 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -243,4 +243,88 @@ test_expect_success 'commit_contains:miss' '
test_three_modes commit_contains --tag
 '
 
+test_expect_success 'rev-list: basic topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 
commit-1-6 \
+   commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 
commit-1-5 \
+   commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 
commit-1-4 \
+   commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 
commit-1-3 \
+   commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 
commit-1-2 \
+   commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 
commit-1-1 \
+   >expect &&
+   run_three_modes "git rev-list --topo-order commit-6-6"
+'
+
+test_expect_success 'rev-list: first-parent topo-order' '
+   git rev-parse \
+   commit-6-6 \
+   commit-6-5 \
+   commit-6-4 \
+   commit-6-3 \
+   commit-6-2 \
+   commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 
commit-1-1 \
+   >expect &&
+   run_three_modes "git rev-list --first-parent --topo-order commit-6-6"
+'
+
+test_expect_success 'rev-list: range topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 
commit-1-6 \
+   commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 
commit-1-5 \
+   commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 
commit-1-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   commit-6-2 commit-5-2 commit-4-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   >expect &&
+   run_three_modes "git rev-list --topo-order commit-3-3..commit-6-6"
+'
+
+test_expect_success 'rev-list: range topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 \
+   commit-6-5 commit-5-5 commit-4-5 \
+   commit-6-4 commit-5-4 commit-4-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   commit-6-2 commit-5-2 commit-4-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   >expect &&
+   run_three_modes "git rev-list --topo-order commit-3-8..commit-6-6"
+'
+
+test_expect_success 'rev-list: first-parent range topo-order' '
+   git rev-parse \
+   commit-6-6 \
+   commit-6-5 \
+   commit-6-4 \
+   commit-6-3 \
+   commit-6-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   >expect &&
+   run_three_modes "git rev-list --first-parent --topo-order 
commit-3-8..commit-6-6"
+'
+
+test_expect_success 'rev-list: ancestry-path topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 commit-3-6 \
+   commit-6-5 commit-5-5 commit-4-5 commit-3-5 \
+   commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   >expect &&
+   run_three_modes "git rev-list --topo-order --ancestry-path 
commit-3-3..commit-6-6"
+'
+
+test_expect_success 'rev-list: symmetric difference topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 \
+   commit-6-5 commit-5-5 commit-4-5 \
+   commit-6-4 commit-5-4 commit-4-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   commit-6-2 commit-5-2 commit-4-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   commit-3-8 commit-2-8 commit-1-8 \
+   commit-3-7 commit-2-7 commit-1-7 \
+   >expect &&
+   run_three_modes "git rev-list --topo-order commit-3-8...commit-6-6"
+'
+
 test_done
-- 
gitgitgadget



[PATCH v2 1/6] prio-queue: add 'peek' operation

2018-09-17 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When consuming a priority queue, it can be convenient to inspect
the next object that will be dequeued without actually dequeueing
it. Our existing library did not have such a 'peek' operation, so
add it as prio_queue_peek().

Add a reference-level comparison in t/helper/test-prio-queue.c
so this method is exercised by t0009-prio-queue.sh.

Signed-off-by: Derrick Stolee 
---
 prio-queue.c   |  9 +
 prio-queue.h   |  6 ++
 t/helper/test-prio-queue.c | 10 +++---
 3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/prio-queue.c b/prio-queue.c
index a078451872..d3f488cb05 100644
--- a/prio-queue.c
+++ b/prio-queue.c
@@ -85,3 +85,12 @@ void *prio_queue_get(struct prio_queue *queue)
}
return result;
 }
+
+void *prio_queue_peek(struct prio_queue *queue)
+{
+   if (!queue->nr)
+   return NULL;
+   if (!queue->compare)
+   return queue->array[queue->nr - 1].data;
+   return queue->array[0].data;
+}
diff --git a/prio-queue.h b/prio-queue.h
index d030ec9dd6..682e51867a 100644
--- a/prio-queue.h
+++ b/prio-queue.h
@@ -46,6 +46,12 @@ extern void prio_queue_put(struct prio_queue *, void *thing);
  */
 extern void *prio_queue_get(struct prio_queue *);
 
+/*
+ * Gain access to the "thing" that would be returned by
+ * prio_queue_get, but do not remove it from the queue.
+ */
+extern void *prio_queue_peek(struct prio_queue *);
+
 extern void clear_prio_queue(struct prio_queue *);
 
 /* Reverse the LIFO elements */
diff --git a/t/helper/test-prio-queue.c b/t/helper/test-prio-queue.c
index 9807b649b1..e817bbf464 100644
--- a/t/helper/test-prio-queue.c
+++ b/t/helper/test-prio-queue.c
@@ -22,9 +22,13 @@ int cmd__prio_queue(int argc, const char **argv)
struct prio_queue pq = { intcmp };
 
while (*++argv) {
-   if (!strcmp(*argv, "get"))
-   show(prio_queue_get());
-   else if (!strcmp(*argv, "dump")) {
+   if (!strcmp(*argv, "get")) {
+   void *peek = prio_queue_peek();
+   void *get = prio_queue_get();
+   if (peek != get)
+   BUG("peek and get results do not match");
+   show(get);
+   } else if (!strcmp(*argv, "dump")) {
int *v;
while ((v = prio_queue_get()))
   show(v);
-- 
gitgitgadget



[PATCH v2 6/6] revision.c: refactor basic topo-order logic

2018-09-17 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:

1. Run limit_list(), which parses all reachable commits,
   adds them to a linked list, and distributes UNINTERESTING
   flags. If all unprocessed commits are UNINTERESTING, then
   it may terminate without walking all reachable commits.
   This does not occur if we do not specify UNINTERESTING
   commits.

2. Run sort_in_topological_order(), which is an implementation
   of Kahn's algorithm. It first iterates through the entire
   set of important commits and computes the in-degree of each
   (plus one, as we use 'zero' as a special value here). Then,
   we walk the commits in priority order, adding them to the
   priority queue if and only if their in-degree is one. As
   we remove commits from this priority queue, we decrement the
   in-degree of their parents.

3. While we are peeling commits for output, get_revision_1()
   uses pop_commit on the full list of commits computed by
   sort_in_topological_order().

In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.

Recall that the generation number of a commit satisfies:

* If the commit has at least one parent, then the generation
  number is one more than the maximum generation number among
  its parents.

* If the commit has no parent, then the generation number is one.

There are two special generation numbers:

* GENERATION_NUMBER_INFINITY: this value is 0x and
  indicates that the commit is not stored in the commit-graph and
  the generation number was not previously calculated.

* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
  to say that the commit-graph was generated by a version of Git
  that does not compute generation numbers (such as v2.18.0).

Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:

If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.

Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.

The walks are as follows:

1. EXPLORE: using the explore_queue priority queue (ordered by
   maximizing the generation number), parse each reachable
   commit until all commits in the queue have generation
   number strictly lower than needed. During this walk, update
   the UNINTERESTING flags as necessary.

2. INDEGREE: using the indegree_queue priority queue (ordered
   by maximizing the generation number), add one to the in-
   degree of each parent for each commit that is walked. Since
   we walk in order of decreasing generation number, we know
   that discovering an in-degree value of 0 means the value for
   that commit was not initialized, so should be initialized to
   two. (Recall that in-degree value "1" is what we use to say a
   commit is ready for output.) As we iterate the parents of a
   commit during this walk, ensure the EXPLORE walk has walked
   beyond their generation numbers.

3. TOPO: using the topo_queue priority queue (ordered based on
   the sort_order given, which could be commit-date, author-
   date, or typical topo-order which treats the queue as a LIFO
   stack), remove a commit from the queue and decrement the
   in-degree of each parent. If a parent has an in-degree of
   one, then we add it to the topo_queue. Before we decrement
   the in-degree, however, ensure the INDEGREE walk has walked
   beyond that generation number.

The implementations of these walks are in the following methods:

* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk

These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.

One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.

In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.

Test: git rev-list --topo-order -100 HEAD

[PATCH v2 4/6] revision.c: begin refactoring --topo-order logic

2018-09-17 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():

* revs->limited implies we run limit_list() to walk the entire
  reachable set. There are some short-cuts here, such as if we
  perform a range query like 'git rev-list COMPARE..HEAD' and we
  can stop limit_list() when all queued commits are uninteresting.

* revs->topo_order implies we run sort_in_topological_order(). See
  the implementation of that method in commit.c. It implies that
  the full set of commits to order is in the given commit_list.

These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.

If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.

In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.

When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.

In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:

1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
   struct. We use the presence of this struct as a signal to use the
   new methods during our walk. In this patch, this method simply
   calls limit_list() and sort_in_topological_order(). In the future,
   this method will set up a new data structure to perform that logic
   in-line.

2. next_topo_commit() provides get_revision_1() with the next topo-
   ordered commit in the list. Currently, this simply pops the commit
   from revs->commits.

3. expand_topo_walk() provides get_revision_1() with a way to signal
   walking beyond the latest commit. Currently, this calls
   add_parents_to_list() exactly like the old logic.

While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.

Signed-off-by: Derrick Stolee 
---
 revision.c | 42 ++
 revision.h |  4 
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/revision.c b/revision.c
index e18bd530e4..2dcde8a8ac 100644
--- a/revision.c
+++ b/revision.c
@@ -25,6 +25,7 @@
 #include "worktree.h"
 #include "argv-array.h"
 #include "commit-reach.h"
+#include "commit-graph.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct 
rev_info *revs, struct s
if (revs->diffopt.objfind)
revs->simplify_history = 0;
 
-   if (revs->topo_order)
+   if (revs->topo_order && !generation_numbers_enabled(the_repository))
revs->limited = 1;
 
if (revs->prune_data.nr) {
@@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id 
*oid,
return 0;
 }
 
+struct topo_walk_info {};
+
+static void init_topo_walk(struct rev_info *revs)
+{
+   struct topo_walk_info *info;
+   revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
+   info = revs->topo_walk_info;
+   memset(info, 0, sizeof(struct topo_walk_info));
+
+   limit_list(revs);
+   sort_in_topological_order(>commits, revs->sort_order);
+}
+
+static struct commit *next_topo_commit(struct rev_info *revs)
+{
+   return pop_commit(>commits);
+}
+
+static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
+{
+   if (add_parents_to_list(revs, commit, >commits, NULL) < 0) {
+   if (!revs->ignore_missing_links)
+   die("Failed to traverse parents of commit %s",
+   oid_to_hex(>object.oid));
+   }
+}
+
 int prepare_revision_walk(struct rev_info *revs)
 {
int i;
@@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs)
commit_list_sort_by_date(>commits);
if (revs->no_walk)
return 0;
-   if (revs->limited)
+   if (revs->limited) {
if (limit_list(revs) < 0)
return -1;
-   if (revs->topo_order)
-   sort_in_topological_order(>commits, revs->sort_order);
+   if (revs->topo_order)
+   sort_in_topological_order(>commits, 
revs->sort_order);
+   } else if (revs->topo_order)
+   init_topo_walk(revs);
if 

[PATCH v2 2/6] test-reach: add run_three_modes method

2018-09-17 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The 'test_three_modes' method assumes we are using the 'test-tool
reach' command for our test. However, we may want to use the data
shape of our commit graph and the three modes (no commit-graph,
full commit-graph, partial commit-graph) for other git commands.

Split test_three_modes to be a simple translation on a more general
run_three_modes method that executes the given command and tests
the actual output to the expected output.

While inspecting this code, I realized that the final test for
'commit_contains --tag' is silently dropping the '--tag' argument.
It should be quoted to include both.

Signed-off-by: Derrick Stolee 
---
 t/t6600-test-reach.sh | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index d139a00d1d..1b18e12a4e 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -53,18 +53,22 @@ test_expect_success 'setup' '
git config core.commitGraph true
 '
 
-test_three_modes () {
+run_three_modes () {
test_when_finished rm -rf .git/objects/info/commit-graph &&
-   test-tool reach $1 actual &&
+   $1 actual &&
test_cmp expect actual &&
cp commit-graph-full .git/objects/info/commit-graph &&
-   test-tool reach $1 actual &&
+   $1 actual &&
test_cmp expect actual &&
cp commit-graph-half .git/objects/info/commit-graph &&
-   test-tool reach $1 actual &&
+   $1 actual &&
test_cmp expect actual
 }
 
+test_three_modes () {
+   run_three_modes "test-tool reach $1"
+}
+
 test_expect_success 'ref_newer:miss' '
cat >input <<-\EOF &&
A:commit-5-7
@@ -219,7 +223,7 @@ test_expect_success 'commit_contains:hit' '
EOF
echo "commit_contains(_,A,X,_):1" >expect &&
test_three_modes commit_contains &&
-   test_three_modes commit_contains --tag
+   test_three_modes "commit_contains --tag"
 '
 
 test_expect_success 'commit_contains:miss' '
-- 
gitgitgadget



[PATCH v2 01/11] multi-pack-index: add 'verify' verb

2018-09-13 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The multi-pack-index builtin writes multi-pack-index files, and
uses a 'write' verb to do so. Add a 'verify' verb that checks this
file matches the contents of the pack-indexes it replaces.

The current implementation is a no-op, but will be extended in
small increments in later commits.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-multi-pack-index.txt | 10 ++
 builtin/multi-pack-index.c |  4 +++-
 midx.c | 13 +
 midx.h |  1 +
 t/t5319-multi-pack-index.sh|  8 
 5 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-multi-pack-index.txt 
b/Documentation/git-multi-pack-index.txt
index 1f97e79912..f7778a2c85 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -27,6 +27,10 @@ write::
When given as the verb, write a new MIDX file to
`/packs/multi-pack-index`.
 
+verify::
+   When given as the verb, verify the contents of the MIDX file
+   at `/packs/multi-pack-index`.
+
 
 EXAMPLES
 
@@ -43,6 +47,12 @@ $ git multi-pack-index write
 $ git multi-pack-index --object-dir  write
 ---
 
+* Verify the MIDX file for the packfiles in the current .git folder.
++
+---
+$ git multi-pack-index verify
+---
+
 
 SEE ALSO
 
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 2633efd95d..fca70f8e4f 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -5,7 +5,7 @@
 #include "midx.h"
 
 static char const * const builtin_multi_pack_index_usage[] = {
-   N_("git multi-pack-index [--object-dir=] write"),
+   N_("git multi-pack-index [--object-dir=] (write|verify)"),
NULL
 };
 
@@ -42,6 +42,8 @@ int cmd_multi_pack_index(int argc, const char **argv,
 
if (!strcmp(argv[0], "write"))
return write_midx_file(opts.object_dir);
+   if (!strcmp(argv[0], "verify"))
+   return verify_midx_file(opts.object_dir);
 
die(_("unrecognized verb: %s"), argv[0]);
 }
diff --git a/midx.c b/midx.c
index f3e8dbc108..b253bed517 100644
--- a/midx.c
+++ b/midx.c
@@ -928,3 +928,16 @@ void clear_midx_file(const char *object_dir)
 
free(midx);
 }
+
+int verify_midx_error;
+
+int verify_midx_file(const char *object_dir)
+{
+   struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
+   verify_midx_error = 0;
+
+   if (!m)
+   return 0;
+
+   return verify_midx_error;
+}
diff --git a/midx.h b/midx.h
index a210f1af2a..ce80b91c68 100644
--- a/midx.h
+++ b/midx.h
@@ -43,5 +43,6 @@ int prepare_multi_pack_index_one(struct repository *r, const 
char *object_dir, i
 
 int write_midx_file(const char *object_dir);
 void clear_midx_file(const char *object_dir);
+int verify_midx_file(const char *object_dir);
 
 #endif
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 6f56b38674..1c4e0e6d31 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -150,6 +150,10 @@ test_expect_success 'write midx with twelve packs' '
 
 compare_results_with_midx "twelve packs"
 
+test_expect_success 'verify multi-pack-index success' '
+   git multi-pack-index verify --object-dir=$objdir
+'
+
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
git repack -adf &&
@@ -214,4 +218,8 @@ test_expect_success 'force some 64-bit offsets with 
pack-objects' '
midx_read_expect 1 63 5 objects64 " large-offsets"
 '
 
+test_expect_success 'verify multi-pack-index with 64-bit offsets' '
+   git multi-pack-index verify --object-dir=objects64
+'
+
 test_done
-- 
gitgitgadget



[PATCH v2 10/11] multi-pack-index: report progress during 'verify'

2018-09-13 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When verifying a multi-pack-index, the only action that takes
significant time is checking the object offsets. For example,
to verify a multi-pack-index containing 6.2 million objects in
the Linux kernel repository takes 1.3 seconds on my machine.
99% of that time is spent looking up object offsets in each of
the packfiles and comparing them to the multi-pack-index offset.

Add a progress indicator for that section of the 'verify' verb.

Signed-off-by: Derrick Stolee 
---
 midx.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/midx.c b/midx.c
index 47e7e6113a..4d4c930522 100644
--- a/midx.c
+++ b/midx.c
@@ -7,6 +7,7 @@
 #include "object-store.h"
 #include "sha1-lookup.h"
 #include "midx.h"
+#include "progress.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -940,6 +941,7 @@ static void midx_report(const char *fmt, ...)
 int verify_midx_file(const char *object_dir)
 {
uint32_t i;
+   struct progress *progress = NULL;
struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
verify_midx_error = 0;
 
@@ -971,6 +973,7 @@ int verify_midx_file(const char *object_dir)
i, oid_to_hex(), oid_to_hex(), i 
+ 1);
}
 
+   progress = start_progress(_("Verifying object offsets"), 
m->num_objects);
for (i = 0; i < m->num_objects; i++) {
struct object_id oid;
struct pack_entry e;
@@ -995,7 +998,10 @@ int verify_midx_file(const char *object_dir)
if (m_offset != p_offset)
midx_report(_("incorrect object offset for oid[%d] = 
%s: %"PRIx64" != %"PRIx64),
i, oid_to_hex(), m_offset, p_offset);
+
+   display_progress(progress, i + 1);
}
+   stop_progress();
 
return verify_midx_error;
 }
-- 
gitgitgadget



[PATCH v2 04/11] multi-pack-index: verify packname order

2018-09-13 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The final check we make while loading a multi-pack-index is that
the packfile names are in lexicographical order. Make this error
be a die() instead.

In order to test this condition, we need multiple packfiles.
Earlier in t5319-multi-pack-index.sh, we tested the interaction with
'git repack' but this limits us to one packfile in our object dir.
Move these repack tests until after the 'verify' tests.

Signed-off-by: Derrick Stolee 
---
 midx.c  |  6 ++
 t/t5319-multi-pack-index.sh | 10 ++
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/midx.c b/midx.c
index 8b054b39ab..e655a15aed 100644
--- a/midx.c
+++ b/midx.c
@@ -157,12 +157,10 @@ struct multi_pack_index *load_multi_pack_index(const char 
*object_dir, int local
 
cur_pack_name += strlen(cur_pack_name) + 1;
 
-   if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0) {
-   error(_("multi-pack-index pack names out of order: '%s' 
before '%s'"),
+   if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0)
+   die(_("multi-pack-index pack names out of order: '%s' 
before '%s'"),
  m->pack_names[i - 1],
  m->pack_names[i]);
-   goto cleanup_fail;
-   }
}
 
return m;
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index c54b6e7188..01a3cd6b00 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -181,6 +181,11 @@ MIDX_BYTE_CHUNK_COUNT=6
 MIDX_HEADER_SIZE=12
 MIDX_BYTE_CHUNK_ID=$MIDX_HEADER_SIZE
 MIDX_BYTE_CHUNK_OFFSET=$(($MIDX_HEADER_SIZE + 4))
+MIDX_NUM_CHUNKS=5
+MIDX_CHUNK_LOOKUP_WIDTH=12
+MIDX_OFFSET_PACKNAMES=$(($MIDX_HEADER_SIZE + \
+$MIDX_NUM_CHUNKS * $MIDX_CHUNK_LOOKUP_WIDTH))
+MIDX_BYTE_PACKNAME_ORDER=$(($MIDX_OFFSET_PACKNAMES + 2))
 
 test_expect_success 'verify bad version' '
corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \
@@ -212,6 +217,11 @@ test_expect_success 'verify invalid chunk offset' '
"invalid chunk offset (too large)"
 '
 
+test_expect_success 'verify packnames out of order' '
+   corrupt_midx_and_verify $MIDX_BYTE_PACKNAME_ORDER "z" $objdir \
+   "pack names out of order"
+'
+
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
git repack -adf &&
-- 
gitgitgadget



[PATCH v2 09/11] multi-pack-index: verify object offsets

2018-09-13 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The 'git multi-pack-index verify' command must verify the object
offsets stored in the multi-pack-index are correct. There are two
ways the offset chunk can be incorrect: the pack-int-id and the
object offset.

Replace the BUG() statement with a die() statement, now that we
may hit a bad pack-int-id during a 'verify' command on a corrupt
multi-pack-index, and it is covered by a test.

Signed-off-by: Derrick Stolee 
---
 midx.c  | 29 -
 t/t5319-multi-pack-index.sh | 27 +++
 2 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 80094c02a7..47e7e6113a 100644
--- a/midx.c
+++ b/midx.c
@@ -197,7 +197,8 @@ int prepare_midx_pack(struct multi_pack_index *m, uint32_t 
pack_int_id)
struct strbuf pack_name = STRBUF_INIT;
 
if (pack_int_id >= m->num_packs)
-   BUG("bad pack-int-id");
+   die(_("bad pack-int-id: %u (%u total packs"),
+   pack_int_id, m->num_packs);
 
if (m->packs[pack_int_id])
return 0;
@@ -970,5 +971,31 @@ int verify_midx_file(const char *object_dir)
i, oid_to_hex(), oid_to_hex(), i 
+ 1);
}
 
+   for (i = 0; i < m->num_objects; i++) {
+   struct object_id oid;
+   struct pack_entry e;
+   off_t m_offset, p_offset;
+
+   nth_midxed_object_oid(, m, i);
+   if (!fill_midx_entry(, , m)) {
+   midx_report(_("failed to load pack entry for oid[%d] = 
%s"),
+   i, oid_to_hex());
+   continue;
+   }
+
+   if (open_pack_index(e.p)) {
+   midx_report(_("failed to load pack-index for packfile 
%s"),
+   e.p->pack_name);
+   break;
+   }
+
+   m_offset = e.offset;
+   p_offset = find_pack_entry_one(oid.hash, e.p);
+
+   if (m_offset != p_offset)
+   midx_report(_("incorrect object offset for oid[%d] = 
%s: %"PRIx64" != %"PRIx64),
+   i, oid_to_hex(), m_offset, p_offset);
+   }
+
return verify_midx_error;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index a968b9a959..828c240389 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -176,6 +176,7 @@ test_expect_success 'verify bad signature' '
 '
 
 HASH_LEN=20
+NUM_OBJECTS=74
 MIDX_BYTE_VERSION=4
 MIDX_BYTE_OID_VERSION=5
 MIDX_BYTE_CHUNK_COUNT=6
@@ -192,6 +193,10 @@ MIDX_OID_FANOUT_WIDTH=4
 MIDX_BYTE_OID_FANOUT_ORDER=$((MIDX_OFFSET_OID_FANOUT + 250 * 
$MIDX_OID_FANOUT_WIDTH + 1))
 MIDX_OFFSET_OID_LOOKUP=$(($MIDX_OFFSET_OID_FANOUT + 256 * 
$MIDX_OID_FANOUT_WIDTH))
 MIDX_BYTE_OID_LOOKUP=$(($MIDX_OFFSET_OID_LOOKUP + 16 * $HASH_LEN))
+MIDX_OFFSET_OBJECT_OFFSETS=$(($MIDX_OFFSET_OID_LOOKUP + $NUM_OBJECTS * 
$HASH_LEN))
+MIDX_OFFSET_WIDTH=8
+MIDX_BYTE_PACK_INT_ID=$(($MIDX_OFFSET_OBJECT_OFFSETS + 16 * $MIDX_OFFSET_WIDTH 
+ 2))
+MIDX_BYTE_OFFSET=$(($MIDX_OFFSET_OBJECT_OFFSETS + 16 * $MIDX_OFFSET_WIDTH + 6))
 
 test_expect_success 'verify bad version' '
corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \
@@ -243,6 +248,16 @@ test_expect_success 'verify oid lookup out of order' '
"oid lookup out of order"
 '
 
+test_expect_success 'verify incorrect pack-int-id' '
+   corrupt_midx_and_verify $MIDX_BYTE_PACK_INT_ID "\07" $objdir \
+   "bad pack-int-id"
+'
+
+test_expect_success 'verify incorrect offset' '
+   corrupt_midx_and_verify $MIDX_BYTE_OFFSET "\07" $objdir \
+   "incorrect object offset"
+'
+
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
git repack -adf &&
@@ -310,4 +325,16 @@ test_expect_success 'verify multi-pack-index with 64-bit 
offsets' '
git multi-pack-index verify --object-dir=objects64
 '
 
+NUM_OBJECTS=63
+MIDX_OFFSET_OID_FANOUT=$((MIDX_OFFSET_PACKNAMES + 54))
+MIDX_OFFSET_OID_LOOKUP=$((MIDX_OFFSET_OID_FANOUT + 256 * 
$MIDX_OID_FANOUT_WIDTH))
+MIDX_OFFSET_OBJECT_OFFSETS=$(($MIDX_OFFSET_OID_LOOKUP + $NUM_OBJECTS * 
$HASH_LEN))
+MIDX_OFFSET_LARGE_OFFSETS=$(($MIDX_OFFSET_OBJECT_OFFSETS + $NUM_OBJECTS * 
$MIDX_OFFSET_WIDTH))
+MIDX_BYTE_LARGE_OFFSET=$(($MIDX_OFFSET_LARGE_OFFSETS + 3))
+
+test_expect_success 'verify incorrect 64-bit offset' '
+   corrupt_midx_and_verify $MIDX_BYTE_LARGE_OFFSET "\07" objects64 \
+   "incorrect object offset"
+'
+
 test_done
-- 
gitgitgadget



[PATCH v2 08/11] multi-pack-index: fix 32-bit vs 64-bit size check

2018-09-13 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When loading a 64-bit offset, we intend to check that off_t can store
the resulting offset. However, the condition accidentally checks the
32-bit offset to see if it is smaller than a 64-bit value. Fix it,
and this will be covered by a test in the 'git multi-pack-index verify'
command in a later commit.

Signed-off-by: Derrick Stolee 
---
 midx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 06d5cfc826..80094c02a7 100644
--- a/midx.c
+++ b/midx.c
@@ -236,7 +236,7 @@ static off_t nth_midxed_offset(struct multi_pack_index *m, 
uint32_t pos)
offset32 = get_be32(offset_data + sizeof(uint32_t));
 
if (m->chunk_large_offsets && offset32 & MIDX_LARGE_OFFSET_NEEDED) {
-   if (sizeof(offset32) < sizeof(uint64_t))
+   if (sizeof(off_t) < sizeof(uint64_t))
die(_("multi-pack-index stores a 64-bit offset, but 
off_t is too small"));
 
offset32 ^= MIDX_LARGE_OFFSET_NEEDED;
-- 
gitgitgadget



[PATCH v2 05/11] multi-pack-index: verify missing pack

2018-09-13 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Signed-off-by: Derrick Stolee 
---
 midx.c  | 16 
 t/t5319-multi-pack-index.sh |  5 +
 2 files changed, 21 insertions(+)

diff --git a/midx.c b/midx.c
index e655a15aed..a02b19efc1 100644
--- a/midx.c
+++ b/midx.c
@@ -926,13 +926,29 @@ void clear_midx_file(const char *object_dir)
 
 int verify_midx_error;
 
+static void midx_report(const char *fmt, ...)
+{
+   va_list ap;
+   verify_midx_error = 1;
+   va_start(ap, fmt);
+   vfprintf(stderr, fmt, ap);
+   fprintf(stderr, "\n");
+   va_end(ap);
+}
+
 int verify_midx_file(const char *object_dir)
 {
+   uint32_t i;
struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
verify_midx_error = 0;
 
if (!m)
return 0;
 
+   for (i = 0; i < m->num_packs; i++) {
+   if (prepare_midx_pack(m, i))
+   midx_report("failed to load pack in position %d", i);
+   }
+
return verify_midx_error;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 01a3cd6b00..0a566afb05 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -222,6 +222,11 @@ test_expect_success 'verify packnames out of order' '
"pack names out of order"
 '
 
+test_expect_success 'verify packnames out of order' '
+   corrupt_midx_and_verify $MIDX_BYTE_PACKNAME_ORDER "a" $objdir \
+   "failed to load pack"
+'
+
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
git repack -adf &&
-- 
gitgitgadget



[PATCH v2 11/11] fsck: verify multi-pack-index

2018-09-13 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When core.multiPackIndex is true, we may have a multi-pack-index
in our object directory. Add calls to 'git multi-pack-index verify'
at the end of 'git fsck' if so.

Signed-off-by: Derrick Stolee 
---
 builtin/fsck.c  | 18 ++
 t/t5319-multi-pack-index.sh | 13 -
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 63c8578cc1..06eb421720 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -848,5 +848,23 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
}
}
 
+   if (!git_config_get_bool("core.multipackindex", ) && i) {
+   struct child_process midx_verify = CHILD_PROCESS_INIT;
+   const char *midx_argv[] = { "multi-pack-index", "verify", NULL, 
NULL, NULL };
+
+   midx_verify.argv = midx_argv;
+   midx_verify.git_cmd = 1;
+   if (run_command(_verify))
+   errors_found |= ERROR_COMMIT_GRAPH;
+
+   prepare_alt_odb(the_repository);
+   for (alt =  the_repository->objects->alt_odb_list; alt; alt = 
alt->next) {
+   midx_argv[2] = "--object-dir";
+   midx_argv[3] = alt->path;
+   if (run_command(_verify))
+   errors_found |= ERROR_COMMIT_GRAPH;
+   }
+   }
+
return errors_found;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 828c240389..bd8e841b81 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -160,12 +160,17 @@ corrupt_midx_and_verify() {
DATA="${2:-\0}" &&
OBJDIR=$3 &&
GREPSTR="$4" &&
+   COMMAND="$5" &&
+   if test -z "$COMMAND"
+   then
+   COMMAND="git multi-pack-index verify --object-dir=$OBJDIR"
+   fi &&
FILE=$OBJDIR/pack/multi-pack-index &&
chmod a+w $FILE &&
test_when_finished mv midx-backup $FILE &&
cp $FILE midx-backup &&
printf "$DATA" | dd of="$FILE" bs=1 seek="$POS" conv=notrunc &&
-   test_must_fail git multi-pack-index verify --object-dir=$OBJDIR 
2>test_err &&
+   test_must_fail $COMMAND 2>test_err &&
grep -v "^+" test_err >err &&
test_i18ngrep "$GREPSTR" err
 }
@@ -258,6 +263,12 @@ test_expect_success 'verify incorrect offset' '
"incorrect object offset"
 '
 
+test_expect_success 'git-fsck incorrect offset' '
+   corrupt_midx_and_verify $MIDX_BYTE_OFFSET "\07" $objdir \
+   "incorrect object offset" \
+   "git -c core.multipackindex=true fsck"
+'
+
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
git repack -adf &&
-- 
gitgitgadget


[PATCH v2 07/11] multi-pack-index: verify oid lookup order

2018-09-13 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Signed-off-by: Derrick Stolee 
---
 midx.c  | 11 +++
 t/t5319-multi-pack-index.sh |  8 
 2 files changed, 19 insertions(+)

diff --git a/midx.c b/midx.c
index dfd26b4d74..06d5cfc826 100644
--- a/midx.c
+++ b/midx.c
@@ -959,5 +959,16 @@ int verify_midx_file(const char *object_dir)
i, oid_fanout1, oid_fanout2, i + 1);
}
 
+   for (i = 0; i < m->num_objects - 1; i++) {
+   struct object_id oid1, oid2;
+
+   nth_midxed_object_oid(, m, i);
+   nth_midxed_object_oid(, m, i + 1);
+
+   if (oidcmp(, ) >= 0)
+   midx_report(_("oid lookup out of order: oid[%d] = %s >= 
%s = oid[%d]"),
+   i, oid_to_hex(), oid_to_hex(), i 
+ 1);
+   }
+
return verify_midx_error;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 47a54e138d..a968b9a959 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -175,6 +175,7 @@ test_expect_success 'verify bad signature' '
"multi-pack-index signature"
 '
 
+HASH_LEN=20
 MIDX_BYTE_VERSION=4
 MIDX_BYTE_OID_VERSION=5
 MIDX_BYTE_CHUNK_COUNT=6
@@ -189,6 +190,8 @@ MIDX_BYTE_PACKNAME_ORDER=$(($MIDX_OFFSET_PACKNAMES + 2))
 MIDX_OFFSET_OID_FANOUT=$(($MIDX_OFFSET_PACKNAMES + 652))
 MIDX_OID_FANOUT_WIDTH=4
 MIDX_BYTE_OID_FANOUT_ORDER=$((MIDX_OFFSET_OID_FANOUT + 250 * 
$MIDX_OID_FANOUT_WIDTH + 1))
+MIDX_OFFSET_OID_LOOKUP=$(($MIDX_OFFSET_OID_FANOUT + 256 * 
$MIDX_OID_FANOUT_WIDTH))
+MIDX_BYTE_OID_LOOKUP=$(($MIDX_OFFSET_OID_LOOKUP + 16 * $HASH_LEN))
 
 test_expect_success 'verify bad version' '
corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \
@@ -235,6 +238,11 @@ test_expect_success 'verify oid fanout out of order' '
"oid fanout out of order"
 '
 
+test_expect_success 'verify oid lookup out of order' '
+   corrupt_midx_and_verify $MIDX_BYTE_OID_LOOKUP "\00" $objdir \
+   "oid lookup out of order"
+'
+
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
git repack -adf &&
-- 
gitgitgadget



[PATCH v2 03/11] multi-pack-index: verify corrupt chunk lookup table

2018-09-13 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Signed-off-by: Derrick Stolee 
---
 midx.c  |  3 +++
 t/t5319-multi-pack-index.sh | 13 +
 2 files changed, 16 insertions(+)

diff --git a/midx.c b/midx.c
index ec78254bb6..8b054b39ab 100644
--- a/midx.c
+++ b/midx.c
@@ -100,6 +100,9 @@ struct multi_pack_index *load_multi_pack_index(const char 
*object_dir, int local
uint64_t chunk_offset = get_be64(m->data + MIDX_HEADER_SIZE + 4 
+
 MIDX_CHUNKLOOKUP_WIDTH * i);
 
+   if (chunk_offset >= m->data_len)
+   die(_("invalid chunk offset (too large)"));
+
switch (chunk_id) {
case MIDX_CHUNKID_PACKNAMES:
m->chunk_pack_names = m->data + chunk_offset;
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index e04b5f43a2..c54b6e7188 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -178,6 +178,9 @@ test_expect_success 'verify bad signature' '
 MIDX_BYTE_VERSION=4
 MIDX_BYTE_OID_VERSION=5
 MIDX_BYTE_CHUNK_COUNT=6
+MIDX_HEADER_SIZE=12
+MIDX_BYTE_CHUNK_ID=$MIDX_HEADER_SIZE
+MIDX_BYTE_CHUNK_OFFSET=$(($MIDX_HEADER_SIZE + 4))
 
 test_expect_success 'verify bad version' '
corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \
@@ -199,6 +202,16 @@ test_expect_success 'verify extended chunk count' '
"terminating multi-pack-index chunk id appears earlier than 
expected"
 '
 
+test_expect_success 'verify missing required chunk' '
+   corrupt_midx_and_verify $MIDX_BYTE_CHUNK_ID "\01" $objdir \
+   "missing required"
+'
+
+test_expect_success 'verify invalid chunk offset' '
+   corrupt_midx_and_verify $MIDX_BYTE_CHUNK_OFFSET "\01" $objdir \
+   "invalid chunk offset (too large)"
+'
+
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
git repack -adf &&
-- 
gitgitgadget



[PATCH v2 02/11] multi-pack-index: verify bad header

2018-09-13 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

When verifying if a multi-pack-index file is valid, we want the
command to fail to signal an invalid file. Previously, we wrote
an error to stderr and continued as if we had no multi-pack-index.
Now, die() instead of error().

Add tests that check corrupted headers in a few ways:

* Bad signature
* Bad file version
* Bad hash version
* Truncated hash count
* Extended hash count

Signed-off-by: Derrick Stolee 
---
 midx.c  | 18 +--
 t/t5319-multi-pack-index.sh | 46 -
 2 files changed, 51 insertions(+), 13 deletions(-)

diff --git a/midx.c b/midx.c
index b253bed517..ec78254bb6 100644
--- a/midx.c
+++ b/midx.c
@@ -76,24 +76,18 @@ struct multi_pack_index *load_multi_pack_index(const char 
*object_dir, int local
m->local = local;
 
m->signature = get_be32(m->data);
-   if (m->signature != MIDX_SIGNATURE) {
-   error(_("multi-pack-index signature 0x%08x does not match 
signature 0x%08x"),
+   if (m->signature != MIDX_SIGNATURE)
+   die(_("multi-pack-index signature 0x%08x does not match 
signature 0x%08x"),
  m->signature, MIDX_SIGNATURE);
-   goto cleanup_fail;
-   }
 
m->version = m->data[MIDX_BYTE_FILE_VERSION];
-   if (m->version != MIDX_VERSION) {
-   error(_("multi-pack-index version %d not recognized"),
+   if (m->version != MIDX_VERSION)
+   die(_("multi-pack-index version %d not recognized"),
  m->version);
-   goto cleanup_fail;
-   }
 
hash_version = m->data[MIDX_BYTE_HASH_VERSION];
-   if (hash_version != MIDX_HASH_VERSION) {
-   error(_("hash version %u does not match"), hash_version);
-   goto cleanup_fail;
-   }
+   if (hash_version != MIDX_HASH_VERSION)
+   die(_("hash version %u does not match"), hash_version);
m->hash_len = MIDX_HASH_LEN;
 
m->num_chunks = m->data[MIDX_BYTE_NUM_CHUNKS];
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 1c4e0e6d31..e04b5f43a2 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -154,6 +154,51 @@ test_expect_success 'verify multi-pack-index success' '
git multi-pack-index verify --object-dir=$objdir
 '
 
+# usage: corrupt_midx_and_verify
+corrupt_midx_and_verify() {
+   POS=$1 &&
+   DATA="${2:-\0}" &&
+   OBJDIR=$3 &&
+   GREPSTR="$4" &&
+   FILE=$OBJDIR/pack/multi-pack-index &&
+   chmod a+w $FILE &&
+   test_when_finished mv midx-backup $FILE &&
+   cp $FILE midx-backup &&
+   printf "$DATA" | dd of="$FILE" bs=1 seek="$POS" conv=notrunc &&
+   test_must_fail git multi-pack-index verify --object-dir=$OBJDIR 
2>test_err &&
+   grep -v "^+" test_err >err &&
+   test_i18ngrep "$GREPSTR" err
+}
+
+test_expect_success 'verify bad signature' '
+   corrupt_midx_and_verify 0 "\00" $objdir \
+   "multi-pack-index signature"
+'
+
+MIDX_BYTE_VERSION=4
+MIDX_BYTE_OID_VERSION=5
+MIDX_BYTE_CHUNK_COUNT=6
+
+test_expect_success 'verify bad version' '
+   corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \
+   "multi-pack-index version"
+'
+
+test_expect_success 'verify bad OID version' '
+   corrupt_midx_and_verify $MIDX_BYTE_OID_VERSION "\02" $objdir \
+   "hash version"
+'
+
+test_expect_success 'verify truncated chunk count' '
+   corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\01" $objdir \
+   "missing required"
+'
+
+test_expect_success 'verify extended chunk count' '
+   corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\07" $objdir \
+   "terminating multi-pack-index chunk id appears earlier than 
expected"
+'
+
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
git repack -adf &&
@@ -191,7 +236,6 @@ test_expect_success 'multi-pack-index in an alternate' '
 
 compare_results_with_midx "with alternate (remote midx)"
 
-
 # usage: corrupt_data   []
 corrupt_data () {
file=$1
-- 
gitgitgadget



[PATCH v2 06/11] multi-pack-index: verify oid fanout order

2018-09-13 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

Signed-off-by: Derrick Stolee 
---
 midx.c  | 9 +
 t/t5319-multi-pack-index.sh | 8 
 2 files changed, 17 insertions(+)

diff --git a/midx.c b/midx.c
index a02b19efc1..dfd26b4d74 100644
--- a/midx.c
+++ b/midx.c
@@ -950,5 +950,14 @@ int verify_midx_file(const char *object_dir)
midx_report("failed to load pack in position %d", i);
}
 
+   for (i = 0; i < 255; i++) {
+   uint32_t oid_fanout1 = ntohl(m->chunk_oid_fanout[i]);
+   uint32_t oid_fanout2 = ntohl(m->chunk_oid_fanout[i + 1]);
+
+   if (oid_fanout1 > oid_fanout2)
+   midx_report(_("oid fanout out of order: fanout[%d] = 
%"PRIx32" > %"PRIx32" = fanout[%d]"),
+   i, oid_fanout1, oid_fanout2, i + 1);
+   }
+
return verify_midx_error;
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 0a566afb05..47a54e138d 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -186,6 +186,9 @@ MIDX_CHUNK_LOOKUP_WIDTH=12
 MIDX_OFFSET_PACKNAMES=$(($MIDX_HEADER_SIZE + \
 $MIDX_NUM_CHUNKS * $MIDX_CHUNK_LOOKUP_WIDTH))
 MIDX_BYTE_PACKNAME_ORDER=$(($MIDX_OFFSET_PACKNAMES + 2))
+MIDX_OFFSET_OID_FANOUT=$(($MIDX_OFFSET_PACKNAMES + 652))
+MIDX_OID_FANOUT_WIDTH=4
+MIDX_BYTE_OID_FANOUT_ORDER=$((MIDX_OFFSET_OID_FANOUT + 250 * 
$MIDX_OID_FANOUT_WIDTH + 1))
 
 test_expect_success 'verify bad version' '
corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \
@@ -227,6 +230,11 @@ test_expect_success 'verify packnames out of order' '
"failed to load pack"
 '
 
+test_expect_success 'verify oid fanout out of order' '
+   corrupt_midx_and_verify $MIDX_BYTE_OID_FANOUT_ORDER "\01" $objdir \
+   "oid fanout out of order"
+'
+
 test_expect_success 'repack removes multi-pack-index' '
test_path_is_file $objdir/pack/multi-pack-index &&
git repack -adf &&
-- 
gitgitgadget



[PATCH v2 00/11] Add 'git multi-pack-index verify' command

2018-09-13 Thread Derrick Stolee via GitGitGadget
The multi-pack-index file provides faster lookups in repos with many
packfiles by duplicating the information from multiple pack-indexes into a
single file. This series allows us to verify a multi-pack-index using 'git
multi-pack-index verify' and 'git fsck' (when core.multiPackIndex is true).

The pattern for the tests is similar to that found in t5318-commit-graph.sh.

During testing, I found a bug in how we check for the size of off_t (we are
not actually checking off_t, but instead uint32_t). See "multi-pack-index:
fix 32-bit vs 64-bit size check".

Thanks to Ævar [1], I added a commit that provides progress updates when
checking object offsets.

Based on ds/multi-pack-index

[1] 
https://public-inbox.org/git/20180904202729.13900-1-ava...@gmail.com/T/#u

Derrick Stolee (11):
  multi-pack-index: add 'verify' verb
  multi-pack-index: verify bad header
  multi-pack-index: verify corrupt chunk lookup table
  multi-pack-index: verify packname order
  multi-pack-index: verify missing pack
  multi-pack-index: verify oid fanout order
  multi-pack-index: verify oid lookup order
  multi-pack-index: fix 32-bit vs 64-bit size check
  multi-pack-index: verify object offsets
  multi-pack-index: report progress during 'verify'
  fsck: verify multi-pack-index

 Documentation/git-multi-pack-index.txt |  10 ++
 builtin/fsck.c |  18 
 builtin/multi-pack-index.c |   4 +-
 midx.c | 113 
 midx.h |   1 +
 t/t5319-multi-pack-index.sh| 136 -
 6 files changed, 262 insertions(+), 20 deletions(-)


base-commit: 6a22d521260f86dff8fe6f23ab329cebb62ba4f0
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-34%2Fderrickstolee%2Fmidx%2Fverify-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-34/derrickstolee/midx/verify-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/34

Range-diff vs v1:

  1:  8dc38afe2b !  1:  d8ffd84d67 multi-pack-index: add 'verify' verb
 @@ -47,7 +47,7 @@
   
   static char const * const builtin_multi_pack_index_usage[] = {
  - N_("git multi-pack-index [--object-dir=] write"),
 -+ N_("git multi-pack-index [--object-dir=] [write|verify]"),
 ++ N_("git multi-pack-index [--object-dir=] (write|verify)"),
NULL
   };
   
  2:  787e1fb616 !  2:  9590895830 multi-pack-index: verify bad header
 @@ -61,10 +61,10 @@
   
  +# usage: corrupt_midx_and_verify
  +corrupt_midx_and_verify() {
 -+ POS=$1
 -+ DATA="${2:-\0}"
 -+ OBJDIR=$3
 -+ GREPSTR="$4"
 ++ POS=$1 &&
 ++ DATA="${2:-\0}" &&
 ++ OBJDIR=$3 &&
 ++ GREPSTR="$4" &&
  + FILE=$OBJDIR/pack/multi-pack-index &&
  + chmod a+w $FILE &&
  + test_when_finished mv midx-backup $FILE &&
  3:  b385aa2abf =  3:  2448173844 multi-pack-index: verify corrupt chunk 
lookup table
  4:  37ee24c82b =  4:  947241bfdc multi-pack-index: verify packname order
  5:  b747da415c =  5:  4058867380 multi-pack-index: verify missing pack
  6:  58e5c09468 =  6:  ea1c522702 multi-pack-index: verify oid fanout order
  7:  b21772d054 =  7:  511791de91 multi-pack-index: verify oid lookup order
  8:  b08d3f0055 =  8:  210649bf83 multi-pack-index: fix 32-bit vs 64-bit size 
check
  9:  e1498aea45 !  9:  ef20193d59 multi-pack-index: verify object offsets
 @@ -21,7 +21,8 @@
   
if (pack_int_id >= m->num_packs)
  - BUG("bad pack-int-id");
 -+ die(_("bad pack-int-id"));
 ++ die(_("bad pack-int-id: %u (%u total packs"),
 ++ pack_int_id, m->num_packs);
   
if (m->packs[pack_int_id])
return 0;
 10:  acf8cfd632 = 10:  29ebc17161 multi-pack-index: report progress during 
'verify'
 11:  09d16aff20 ! 11:  406c88b456 fsck: verify multi-pack-index
 @@ -40,14 +40,14 @@
  --- a/t/t5319-multi-pack-index.sh
  +++ b/t/t5319-multi-pack-index.sh
  @@
 -  DATA="${2:-\0}"
 -  OBJDIR=$3
 -  GREPSTR="$4"
 -+ COMMAND="$5"
 +  DATA="${2:-\0}" &&
 +  OBJDIR=$3 &&
 +  GREPSTR="$4" &&
 ++ COMMAND="$5" &&
  + if test -z "$COMMAND"
  + then
  + COMMAND="git multi-pack-index verify --object-dir=$OBJDIR"
 -+ fi
 ++ fi &&
FILE=$OBJDIR/pack/multi-pack-index &&
chmod a+w $FILE &&
test_when_finished mv midx-backup $FILE &&

-- 
gitgitgadget


[PATCH v2 1/1] commit-reach: properly peel tags

2018-09-13 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

The can_all_from_reach_with_flag() algorithm was refactored in 4fbcca4e
"commit-reach: make can_all_from_reach... linear" but incorrectly
assumed that all objects provided were commits. During a fetch
negotiation, ok_to_give_up() in upload-pack.c may provide unpeeled tags
to the 'from' array. The current code creates a segfault.

Add a direct call to can_all_from_reach_with_flag() in 'test-tool reach'
and add a test in t6600-test-reach.sh that demonstrates this segfault.

Correct the issue by peeling tags when investigating the initial list
of objects in the 'from' array.

Signed-off-by: Jeff King 
Signed-off-by: Derrick Stolee 
---
 commit-reach.c| 33 ++---
 t/helper/test-reach.c | 22 +-
 t/t6600-test-reach.sh | 30 --
 3 files changed, 71 insertions(+), 14 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index 86715c103c..4048a2132a 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -544,20 +544,39 @@ int can_all_from_reach_with_flag(struct object_array 
*from,
 {
struct commit **list = NULL;
int i;
+   int nr_commits;
int result = 1;
 
ALLOC_ARRAY(list, from->nr);
+   nr_commits = 0;
for (i = 0; i < from->nr; i++) {
-   list[i] = (struct commit *)from->objects[i].item;
+   struct object *from_one = from->objects[i].item;
 
-   if (parse_commit(list[i]) ||
-   list[i]->generation < min_generation)
-   return 0;
+   if (!from_one || from_one->flags & assign_flag)
+   continue;
+
+   from_one = deref_tag(the_repository, from_one,
+"a from object", 0);
+   if (!from_one || from_one->type != OBJ_COMMIT) {
+   /* no way to tell if this is reachable by
+* looking at the ancestry chain alone, so
+* leave a note to ourselves not to worry about
+* this object anymore.
+*/
+   from->objects[i].item->flags |= assign_flag;
+   continue;
+   }
+
+   list[nr_commits] = (struct commit *)from_one;
+   if (parse_commit(list[nr_commits]) ||
+   list[nr_commits]->generation < min_generation)
+   return 0; /* is this a leak? */
+   nr_commits++;
}
 
-   QSORT(list, from->nr, compare_commits_by_gen);
+   QSORT(list, nr_commits, compare_commits_by_gen);
 
-   for (i = 0; i < from->nr; i++) {
+   for (i = 0; i < nr_commits; i++) {
/* DFS from list[i] */
struct commit_list *stack = NULL;
 
@@ -600,7 +619,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
}
 
 cleanup:
-   for (i = 0; i < from->nr; i++) {
+   for (i = 0; i < nr_commits; i++) {
clear_commit_marks(list[i], RESULT);
clear_commit_marks(list[i], assign_flag);
}
diff --git a/t/helper/test-reach.c b/t/helper/test-reach.c
index eb21103998..08d2ea68e8 100644
--- a/t/helper/test-reach.c
+++ b/t/helper/test-reach.c
@@ -31,6 +31,7 @@ int cmd__reach(int ac, const char **av)
struct object_id oid_A, oid_B;
struct commit *A, *B;
struct commit_list *X, *Y;
+   struct object_array X_obj = OBJECT_ARRAY_INIT;
struct commit **X_array;
int X_nr, X_alloc;
struct strbuf buf = STRBUF_INIT;
@@ -49,7 +50,8 @@ int cmd__reach(int ac, const char **av)
 
while (strbuf_getline(, stdin) != EOF) {
struct object_id oid;
-   struct object *o;
+   struct object *orig;
+   struct object *peeled;
struct commit *c;
if (buf.len < 3)
continue;
@@ -57,14 +59,14 @@ int cmd__reach(int ac, const char **av)
if (get_oid_committish(buf.buf + 2, ))
die("failed to resolve %s", buf.buf + 2);
 
-   o = parse_object(r, );
-   o = deref_tag_noverify(o);
+   orig = parse_object(r, );
+   peeled = deref_tag_noverify(orig);
 
-   if (!o)
+   if (!peeled)
die("failed to load commit for input %s resulting in 
oid %s\n",
buf.buf, oid_to_hex());
 
-   c = object_as_type(r, o, OBJ_COMMIT, 0);
+   c = object_as_type(r, peeled, OBJ_COMMIT, 0);
 
if (!c)
die("failed to load commit for input %s resulting in 
oid %s\n",
@@ -85,6 +87,7 @@ int cmd__reach(int ac, const char **av)
commit_list_insert(c, );
ALLOC_GROW(X_array, X_nr + 1, X_alloc);
X_array[X_nr++] = c;
+  

[PATCH v2 0/1] Properly peel tags in can_all_from_reach_with_flags()

2018-09-13 Thread Derrick Stolee via GitGitGadget
As Peff reported [1], the refactored can_all_from_reach_with_flags() method
does not properly peel tags. Since the helper method can_all_from_reach()
and code in t/helper/test-reach.c all peel tags before getting to this
method, it is not super-simple to create a test that demonstrates this.

I modified t/helper/test-reach.c to allow calling
can_all_from_reach_with_flags() directly, and added a test in
t6600-test-reach.sh that demonstrates the segfault without the fix.

For V2, I compared the loop that inspects the 'from' commits in commit
ba3ca1edce "commit-reach: move can_all_from_reach_with_flags" to the version
here and got the following diff:

3c3
< if (from_one->flags & assign_flag)
---
> if (!from_one || from_one->flags & assign_flag)
5c5,7
< from_one = deref_tag(the_repository, from_one, "a from 
object", 0);
---
>
> from_one = deref_tag(the_repository, from_one,
>  "a from object", 0);
14a17,22
>
> list[nr_commits] = (struct commit *)from_one;
> if (parse_commit(list[nr_commits]) ||
> list[nr_commits]->generation < min_generation)
> return 0; /* is this a leak? */
> nr_commits++;

This diff includes the early termination we had before 'deref_tag' and the
comment for why we can ignore non-commit objects.

[1] 
https://public-inbox.org/git/0bf9103c-9377-506b-7ad7-e5273d8e9...@gmail.com/T/#u

Derrick Stolee (1):
  commit-reach: properly peel tags

 commit-reach.c| 33 ++---
 t/helper/test-reach.c | 22 +-
 t/t6600-test-reach.sh | 30 --
 3 files changed, 71 insertions(+), 14 deletions(-)


base-commit: 6621c838743812aaba96e55cfec8524ea1144c2d
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-39%2Fderrickstolee%2Ftag-fix-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-39/derrickstolee/tag-fix-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/39

Range-diff vs v1:

 1:  948e28 ! 1:  4bf21204dd commit-reach: properly peel tags
 @@ -36,9 +36,17 @@
  - if (parse_commit(list[i]) ||
  - list[i]->generation < min_generation)
  - return 0;
 ++ if (!from_one || from_one->flags & assign_flag)
 ++ continue;
 ++
  + from_one = deref_tag(the_repository, from_one,
  +  "a from object", 0);
  + if (!from_one || from_one->type != OBJ_COMMIT) {
 ++ /* no way to tell if this is reachable by
 ++  * looking at the ancestry chain alone, so
 ++  * leave a note to ourselves not to worry about
 ++  * this object anymore.
 ++  */
  + from->objects[i].item->flags |= assign_flag;
  + continue;
  + }
 @@ -187,7 +195,7 @@
  + echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
  + test_three_modes can_all_from_reach_with_flag
  +'
 -+ 
 ++
   test_expect_success 'commit_contains:hit' '
cat >input <<-\EOF &&
A:commit-7-7

-- 
gitgitgadget


[PATCH v2 1/1] contrib: add coverage-diff script

2018-09-13 Thread Derrick Stolee via GitGitGadget
From: Derrick Stolee 

We have coverage targets in our Makefile for using gcov to display line
coverage based on our test suite. The way I like to do it is to run:

make coverage-test
make coverage-report

This leaves the repo in a state where every X.c file that was covered has
an X.c.gcov file containing the coverage counts for every line, and "#"
at every uncovered line.

There have been a few bugs in recent patches what would have been caught
if the test suite covered those blocks (including a few of mine). I want
to work towards a "sensible" amount of coverage on new topics. In my opinion,
this means that any logic should be covered, but the 'die()' blocks in error
cases do not need to be covered.

It is important to not measure the coverage of the codebase by what old code
is not covered. To help, I created the 'contrib/coverage-diff.sh' script.
After creating the coverage statistics at a version (say, 'topic') you can
then run

contrib/coverage-diff.sh base topic

to see the lines added between 'base' and 'topic' that are not covered by the
test suite. The output uses 'git blame -c' format so you can find the commits
responsible and view the line numbers for quick access to the context.

Signed-off-by: Derrick Stolee 
---
 contrib/coverage-diff.sh | 63 
 1 file changed, 63 insertions(+)
 create mode 100755 contrib/coverage-diff.sh

diff --git a/contrib/coverage-diff.sh b/contrib/coverage-diff.sh
new file mode 100755
index 00..0f226f038c
--- /dev/null
+++ b/contrib/coverage-diff.sh
@@ -0,0 +1,63 @@
+#!/bin/sh
+
+# Usage: 'contrib/coverage-diff.sh  
+# Outputs a list of new lines in version2 compared to version1 that are
+# not covered by the test suite. Assumes you ran
+# 'make coverage-test coverage-report' from root first, so .gcov files exist.
+
+V1=$1
+V2=$2
+
+diff_lines () {
+   while read line
+   do
+   if echo $line | grep -q -e "^@@ -([0-9]+)(,[0-9]+)? 
\\+([0-9]+)(,[0-9]+)? @@.*"
+   then
+   line_num=$(echo $line \
+   | awk 'match($0, "@@ -([0-9]+)(,[0-9]+)? 
\\+([0-9]+)(,[0-9]+)? @@.*", m) { print m[3] }')
+   else
+   echo "$line_num:$line"
+   if ! echo $line | grep -q -e "^-"
+   then
+   line_num=$(($line_num + 1))
+   fi
+   fi
+   done
+}
+
+files=$(git diff --raw $V1 $V2 \
+   | grep \.c$ \
+   | awk 'NF>1{print $NF}')
+
+for file in $files
+do
+   git diff $V1 $V2 -- $file \
+   | diff_lines \
+   | grep ":+" \
+   | sed 's/:/ /g' \
+   | awk '{print $1}' \
+   | sort \
+   >new_lines.txt
+
+   hash_file=$(echo $file | sed "s/\//\#/")
+   cat "$hash_file.gcov" \
+   | grep \#\#\#\#\# \
+   | sed 's/#: //g' \
+   | sed 's/\:/ /g' \
+   | awk '{print $1}' \
+   | sort \
+   >uncovered_lines.txt
+
+   comm -12 uncovered_lines.txt new_lines.txt \
+   | sed -e 's/$/\)/' \
+   | sed -e 's/^/\t/' \
+   >uncovered_new_lines.txt
+
+   grep -q '[^[:space:]]' < uncovered_new_lines.txt && \
+   echo $file && \
+   git blame -c $file \
+   | grep -f uncovered_new_lines.txt
+
+   rm -f new_lines.txt uncovered_lines.txt uncovered_new_lines.txt
+done
+
-- 
gitgitgadget


[PATCH v2 0/1] contrib: Add script to show uncovered "new" lines

2018-09-13 Thread Derrick Stolee via GitGitGadget
We have coverage targets in our Makefile for using gcov to display line
coverage based on our test suite. The way I like to do it is to run:

make coverage-test
make coverage-report

This leaves the repo in a state where every X.c file that was covered has an
X.c.gcov file containing the coverage counts for every line, and "#" at
every uncovered line.

There have been a few bugs in recent patches what would have been caught if
the test suite covered those blocks (including a few of mine). I want to
work towards a "sensible" amount of coverage on new topics. In my opinion,
this means that any logic should be covered, but the 'die()' blocks in error
cases do not need to be covered.

It is important to not measure the coverage of the codebase by what old code
is not covered. To help, I created the 'contrib/coverage-diff.sh' script.
After creating the coverage statistics at a version (say, 'topic') you can
then run

contrib/coverage-diff.sh base topic

to see the lines added between 'base' and 'topic' that are not covered by
the test suite. For example, I ran this against the 'jch' branch (d3c0046)
versus 'next' (dd90340) and got the following output:

builtin/commit.c
859fdc0c3cf (Derrick Stolee 2018-08-29 05:49:04 -0700   1657)   
write_commit_graph_reachable(get_object_directory(), 0);
builtin/rev-list.c
250edfa8c87 (Harald Nordgren2018-04-18 23:05:35 +0200   431)
bisect_flags |= BISECT_FIND_ALL;
builtin/worktree.c
e5353bef550 (Eric Sunshine  2018-08-28 17:20:19 -0400   60) 
error_errno(_("failed to delete '%s'"), sb.buf);
e19831c94f9 (Eric Sunshine  2018-08-28 17:20:23 -0400   251)
die(_("unable to re-add worktree '%s'"), path);
68a6b3a1bd4 (Eric Sunshine  2018-08-28 17:20:24 -0400   793)
die(_("cannot move a locked working tree, lock reason: %s\nuse 'move -f 
-f' to override or unlock first"),
f4143101cbb (Eric Sunshine  2018-08-28 17:20:25 -0400   906)
die(_("cannot remove a locked working tree, lock reason: %s\nuse 
'remove -f -f' to override or unlock first"),
read-cache.c
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1754)   
const unsigned char *cp = (const unsigned char *)name;
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1757)   
previous_len = previous_ce ? previous_ce->ce_namelen : 0;
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1758)   
strip_len = decode_varint();
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1759)   
if (previous_len < strip_len) {
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1760)   
if (previous_ce)
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1761)   
die(_("malformed name field in the index, near path 
'%s'"),
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1762)   
previous_ce->name);
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1764)   
die(_("malformed name field in the index in the first 
path"));
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1766)   
copy_len = previous_len - strip_len;
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1767)   
name = (const char *)cp;
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1773)   
len += copy_len;
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1794)   
if (copy_len)
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1795)   
memcpy(ce->name, previous_ce->name, copy_len);
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1796)   
memcpy(ce->name + copy_len, name, len + 1 - copy_len);
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1797)   
*ent_size = (name - ((char *)ondisk)) + len + 1 - copy_len;
remote-curl.c
c3b9bc94b9b (Elijah Newren  2018-09-05 10:03:07 -0700   181)
options.filter = xstrdup(value);

Using this 'git blame' output, we can quickly inspect whether the uncovered
lines are appropriate. For instance:

 1. The line in builtin/commit.c is due to writing the commit-graph file
when GIT_TEST_COMMIT_GRAPH is enabled, which is not on by default in the
test suite. Being uncovered is expected here.


 2. The lines in builtin/worktree.c are all related to error conditions.
This is acceptable.


 3. The line in builtin/rev-list.c is a flag replacement in a block that is
otherwise unchanged. It must not be covered by the test suite normally.
This could be worth adding a test to ensure the new logic maintains old
behavior.


 4. The lines in 

[PATCH 0/1] contrib: Add script to show uncovered "new" lines

2018-09-12 Thread Derrick Stolee via GitGitGadget
We have coverage targets in our Makefile for using gcov to display line
coverage based on our test suite. The way I like to do it is to run:

make coverage-test
make coverage-report

This leaves the repo in a state where every X.c file that was covered has an
X.c.gcov file containing the coverage counts for every line, and "#" at
every uncovered line.

There have been a few bugs in recent patches what would have been caught if
the test suite covered those blocks (including a few of mine). I want to
work towards a "sensible" amount of coverage on new topics. In my opinion,
this means that any logic should be covered, but the 'die()' blocks in error
cases do not need to be covered.

It is important to not measure the coverage of the codebase by what old code
is not covered. To help, I created the 'contrib/coverage-diff.sh' script.
After creating the coverage statistics at a version (say, 'topic') you can
then run

contrib/coverage-diff.sh base topic

to see the lines added between 'base' and 'topic' that are not covered by
the test suite. For example, I ran this against the 'jch' branch (d3c0046)
versus 'next' (dd90340) and got the following output:

builtin/commit.c
859fdc0c3cf (Derrick Stolee 2018-08-29 05:49:04 -0700   1657)   
write_commit_graph_reachable(get_object_directory(), 0);
builtin/rev-list.c
250edfa8c87 (Harald Nordgren2018-04-18 23:05:35 +0200   431)
bisect_flags |= BISECT_FIND_ALL;
builtin/worktree.c
e5353bef550 (Eric Sunshine  2018-08-28 17:20:19 -0400   60) 
error_errno(_("failed to delete '%s'"), sb.buf);
e19831c94f9 (Eric Sunshine  2018-08-28 17:20:23 -0400   251)
die(_("unable to re-add worktree '%s'"), path);
68a6b3a1bd4 (Eric Sunshine  2018-08-28 17:20:24 -0400   793)
die(_("cannot move a locked working tree, lock reason: %s\nuse 'move -f 
-f' to override or unlock first"),
f4143101cbb (Eric Sunshine  2018-08-28 17:20:25 -0400   906)
die(_("cannot remove a locked working tree, lock reason: %s\nuse 
'remove -f -f' to override or unlock first"),
read-cache.c
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1754)   
const unsigned char *cp = (const unsigned char *)name;
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1757)   
previous_len = previous_ce ? previous_ce->ce_namelen : 0;
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1758)   
strip_len = decode_varint();
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1759)   
if (previous_len < strip_len) {
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1760)   
if (previous_ce)
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1761)   
die(_("malformed name field in the index, near path 
'%s'"),
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1762)   
previous_ce->name);
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1764)   
die(_("malformed name field in the index in the first 
path"));
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1766)   
copy_len = previous_len - strip_len;
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1767)   
name = (const char *)cp;
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1773)   
len += copy_len;
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1794)   
if (copy_len)
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1795)   
memcpy(ce->name, previous_ce->name, copy_len);
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1796)   
memcpy(ce->name + copy_len, name, len + 1 - copy_len);
67922a3 (Nguyễn Thái Ngọc Duy   2018-09-02 15:19:33 +0200   1797)   
*ent_size = (name - ((char *)ondisk)) + len + 1 - copy_len;
remote-curl.c
c3b9bc94b9b (Elijah Newren  2018-09-05 10:03:07 -0700   181)
options.filter = xstrdup(value);

Using this 'git blame' output, we can quickly inspect whether the uncovered
lines are appropriate. For instance:

 1. The line in builtin/commit.c is due to writing the commit-graph file
when GIT_TEST_COMMIT_GRAPH is enabled, which is not on by default in the
test suite. Being uncovered is expected here.


 2. The lines in builtin/worktree.c are all related to error conditions.
This is acceptable.


 3. The line in builtin/rev-list.c is a flag replacement in a block that is
otherwise unchanged. It must not be covered by the test suite normally.
This could be worth adding a test to ensure the new logic maintains old
behavior.


 4. The lines in 

  1   2   >