from:"Derrick Stolee"

Re: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()

2018-12-07 Thread Derrick Stolee


On 12/6/2018 6:36 PM, Jonathan Tan wrote:

AFAICT oid_object_info doesn't take advantage of the commit graph,
but just looks up the object header, which is still less than completely
parsing it. Then lookup_commit is overly strict, as it may return
NULL as when there still is a type mismatch (I don't think a mismatch
could happen here, as both rely on just the object store, and not the
commit graph.), so this would be just defensive programming for
the sake of it. I dunno.

 struct commit *c;

 if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT &&
 (c = lookup_commit(revs->repo, oid)) &&
 !repo_parse_commit(revs->repo, c))
 object = (struct object *) c;
 else
 object = parse_object(revs->repo, oid);

I like this way better - I'll do it in the next version.


If we do _not_ have a commit-graph or if the commit-graph does not have
that commit, this will have the same performance problem, right?

Should we instead create a direct dependence on the commit-graph, and try
to parse the oid from the graph directly? If it succeeds, then we learn
that the object is a commit, in addition to all of the parsing work. This
means we could avoid oid_object_info() loading data if we succeed. We
would fall back to parse_object() if it fails.

I was thinking this should be a simple API call to parse_commit_in_graph(),
but that requires a struct commit filled with an oid, which is not the
best idea if we don't actually know it is a commit yet.

The approach I recommend would then be more detailed:

1. Modify find_commit_in_graph() to take a struct object_id instead of a
   struct commit. This helps find the integer position in the graph. That
   position can be used in fill_commit_in_graph() to load the commit
   contents. Keep find_commit_in_graph() static as it should not be a
   public function.

2. Create a public function with prototype

struct commit *try_parse_commit_from_graph(struct repository *r, struct 
object_id *oid)


   that returns a commit struct fully parsed if and only if the repository
   has that oid. It can call find_commit_in_graph(), then 
lookup_commit() and

   fill_commit_in_graph() to create the commit and parse the data.

3. In replace of the snippet above, do:

    struct commit *c;

    if ((c = try_parse_commit_from_graph(revs->repo, oid))
        object = (struct object *)c;
    else
        object = parse_object(revs->repo, oid);

A similar pattern _could_ be used in parse_object(), but I don't recommend
doing this pattern unless we have a reasonable suspicion that we are going
to parse commits more often than other objects. (It adds an O(log(# 
commits))

binary search to each object.)

A final thought: consider making this "try the commit graph first, but fall
back to parse_object()" a library function with a name like

    struct object *parse_probably_commit(struct repository *r, struct 
object_id *oid)


so other paths that are parsing a lot of commits (but also maybe tags) could
use the logic.

Thanks!
-Stolee

Re: [PATCH v2 2/3] commit-graph: fix buffer read-overflow

2018-12-07 Thread Derrick Stolee


On 12/6/2018 3:20 PM, Josh Steadmon wrote:

+
+# usage: corrupt_and_zero_graph_then_verify   
 
+# Manipulates the commit-graph file at  by inserting the 
data,
+# then zeros the file starting at . Finally, runs
+# 'git commit-graph verify' and places the output in the file 'err'. Tests 
'err'
+# for the given string.
+corrupt_and_zero_graph_then_verify() {


This method is very similar to to 'corrupt_graph_and_verify()', the only 
difference being the zero_pos, which zeroes the graph.


Could it instead be a modification of corrupt_graph_and_verify() where 
$4 is interpreted as zero_pos, and if it is blank we don't do the 
truncation?



+test_expect_success 'detect truncated graph' '
+   corrupt_and_zero_graph_then_verify $GRAPH_BYTE_CHUNK_COUNT "\xff" \
+   $GRAPH_CHUNK_LOOKUP_OFFSET "chunk lookup table entry missing"
+'
+


Thanks for this! I think it's valuable to keep explicit tests around 
that were discovered from your fuzz tests. Specifically, I can repeat 
the test when I get around to the next file format.


Thanks,
-Stolee

Re: [PATCH 2/2] commit-graph: fix buffer read-overflow

2018-12-06 Thread Derrick Stolee


On 12/5/2018 5:32 PM, Josh Steadmon wrote:
  
+		if (chunk_lookup + GRAPH_CHUNKLOOKUP_WIDTH > data + graph_size) {

+   error(_("chunk lookup table entry missing; graph file may be 
incomplete"));
+   free(graph);
+   return NULL;
+   }


Something I forgot earlier: there are several tests in 
t5318-commit-graph.sh that use 'git commit-graph verify' to ensure we 
hit these error conditions on a corrupted commit-graph file. Could you 
try adding a test there that looks for this error message?


Thanks,
-Stolee

Re: git, monorepos, and access control

2018-12-05 Thread Derrick Stolee


On 12/5/2018 3:34 PM, Ævar Arnfjörð Bjarmason wrote:

On Wed, Dec 05 2018, Coiner, John wrote:


I'm an engineer with AMD. I'm looking at whether we could switch our
internal version control to a monorepo, possibly one based on git and
VFSForGit.

Has anyone looked at adding access control to git, at a per-directory
granularity? Is this a feature that the git community would possibly
welcome?

All of what you've described is possible to implement in git, but as far
as I know there's no existing implementation of it.

Microsoft's GVFS probably comes closest, and they're actively
upstreaming bits of that, but as far as I know that doesn't in any way
try to solve this "users XYZ can't even list such-and-such a tree"
problem.

(Avar has a lot of good ideas in his message, so I'm just going to add
on a few here.)

This directory-level security is not a goal for VFS for Git, and I don't
see itbecoming a priority as it breaks a number of design decisions we
made in our object storage and communication models.

The best I can think about when considering Git as an approach would be
to use submodules for your security-related content, and then have server-
side security for access to those repos. Of course, submodules are not
supported in VFS for Git, either.

The Gerrit service has _branch_ level security, which is related to the
reachability questions that a directory security would need. However,
the problem is quite different. Gerrit does have a lot of experience in
dealing with submodules, though, so that's probably a good place to
start.

Thanks,
-Stolee

Git Test Coverage Report (Friday Nov 30)

2018-11-30 Thread Derrick Stolee


Here is today's test coverage report.

Thanks,
-Stolee

[1] https://dev.azure.com/git/git/_build/results?buildId=277

---

pu: 5a1a9a96d55fbb80426189a921d7b6cc66564c78
jch: 71c29cabb7379fe9abaacbbbd1350268d0c18b4f
next: a9faaff8c120bf4783cb892c157871fe524b3608
master: 7068cbc4abac53d9c3675dfba81c1e97d25e8eeb
master@{1}: 7f4e64169352e03476b0ea64e7e2973669e491a2

Uncovered code in 'pu' not in 'jch'
--

builtin/blame.c
74e8221b52 builtin/blame.c    928) blame_date_width = sizeof("Thu Oct 19 
16:00");

74e8221b52 builtin/blame.c    929) break;

builtin/remote.c
b7f4e371e7 builtin/remote.c 1551) die(_("--save-to-push cannot be used 
with other options"));
b7f4e371e7 builtin/remote.c 1575) die(_("--save-to-push can only be used 
when only one url is defined"));


date.c
74e8221b52  113) die("Timestamp too large for this system: %"PRItime, time);
74e8221b52  216) if (tm->tm_mon == human_tm->tm_mon) {
74e8221b52  217) if (tm->tm_mday > human_tm->tm_mday) {
74e8221b52  219) } else if (tm->tm_mday == human_tm->tm_mday) {
74e8221b52  220) hide.date = hide.wday = 1;
74e8221b52  221) } else if (tm->tm_mday + 5 > human_tm->tm_mday) {
74e8221b52  223) hide.date = 1;
74e8221b52  231) gettimeofday(, NULL);
74e8221b52  232) show_date_relative(time, tz, , buf);
74e8221b52  233) return;
74e8221b52  246) hide.seconds = 1;
74e8221b52  247) hide.tz |= !hide.date;
74e8221b52  248) hide.wday = hide.time = !hide.year;
74e8221b52  262) strbuf_rtrim(buf);
74e8221b52  287) gettimeofday(, NULL);
74e8221b52  290) human_tz = local_time_tzoffset(now.tv_sec, _tm);
74e8221b52  886) static int auto_date_style(void)
74e8221b52  888) return (isatty(1) || pager_in_use()) ? DATE_HUMAN : 
DATE_NORMAL;

74e8221b52  909) return DATE_HUMAN;
74e8221b52  911) return auto_date_style();

protocol.c
24c10f7473  37) die(_("Unrecognized protocol version"));
24c10f7473  39) die(_("Unrecognized protocol_version"));

remote-curl.c
24c10f7473  403) return 0;

sha1-array.c
bba406749a 91) oidcpy([dst], [src]);

strbuf.c
10a40f5700  397) return 0;

submodule.c
e2419f7e30 1376) strbuf_release();
7454fe5cb6 1499) struct get_next_submodule_task *task = task_cb;
7454fe5cb6 1503) get_next_submodule_task_release(task);
7454fe5cb6 1530) return 0;
7454fe5cb6 1534) goto out;
7454fe5cb6 1549) return 0;

wrapper.c
5efde212fc  70) die("Out of memory, malloc failed (tried to allocate %" 
PRIuMAX " bytes)",
5efde212fc  73) error("Out of memory, malloc failed (tried to allocate 
%" PRIuMAX " bytes)",


Commits introducing uncovered code:
Anders Waldenborg  10a40f570: strbuf: separate callback for 
strbuf_expand:ing literals
Denton Liu  b7f4e371e: remote: add --save-to-push option to git 
remote set-url
Josh Steadmon  24c10f747: protocol: advertise multiple supported 
versions

Linus Torvalds  74e8221b5: Add 'human' date format
Martin Koegler  5efde212f: zlib.c: use size_t for size
Stefan Beller  7454fe5cb: fetch: try fetching submodules if needed 
objects were not fetched

Stefan Beller  bba406749: sha1-array: provide oid_array_filter
Stefan Beller  e2419f7e3: submodule: migrate get_next_submodule to 
use repository structs




Uncovered code in 'jch' not in 'next'


apply.c
0f086e6dca 3355) if (checkout_entry(ce, , NULL, NULL) ||
0f086e6dca 3356) lstat(ce->name, st))

builtin/branch.c
0ecb1fc726 builtin/branch.c 456) die(_("could not resolve HEAD"));
0ecb1fc726 builtin/branch.c 462) die(_("HEAD (%s) points outside of 
refs/heads/"), refname);


hex.c
47edb64997  93) char *sha1_to_hex_r(char *buffer, const unsigned char *sha1)
47edb64997  95) return hash_to_hex_algop_r(buffer, sha1, 
_algos[GIT_HASH_SHA1]);

47edb64997 116) char *hash_to_hex(const unsigned char *hash)
47edb64997 118) return hash_to_hex_algop(hash, the_hash_algo);

pathspec.c
22af33bece 671) name = to_free = xmemdupz(name, namelen);

read-cache.c
ee70c12820 1730) if (advice_unknown_index_extension) {
ee70c12820 1731) warning(_("ignoring optional %.4s index extension"), ext);
ee70c12820 1732) advise(_("This is likely due to the file having been 
written by a newer\n"


sequencer.c
18e711a162 2387) opts->quiet = 1;

sha1-file.c
2f90b9d9b4 sha1-file.c  172) int hash_algo_by_name(const char *name)
2f90b9d9b4 sha1-file.c  175) if (!name)
2f90b9d9b4 sha1-file.c  176) return GIT_HASH_UNKNOWN;
2f90b9d9b4 sha1-file.c  177) for (i = 1; i < GIT_HASH_NALGOS; i++)
2f90b9d9b4 sha1-file.c  178) if (!strcmp(name, hash_algos[i].name))
2f90b9d9b4 sha1-file.c  179) return i;
2f90b9d9b4 sha1-file.c  180) return GIT_HASH_UNKNOWN;
2f90b9d9b4 sha1-file.c  183) int hash_algo_by_id(uint32_t format_id)
2f90b9d9b4 sha1-file.c  186) for (i = 1; i < GIT_HASH_NALGOS; i++)
2f90b9d9b4 sha1-file.c  187) if (format_id == hash_algos[i].format_id)
2f90b9d9b4 sha1-file.c  188) return i;
2f90b9d9b4 sha1-file.c  189) return GIT_HASH_UNKNOWN;

tree.c
e092073d64 104) commit = lookup_commit(r, entry.oid);

Commits introducing uncovered code:

Re: [PATCH 3/5] pack-objects: add --sparse option

2018-11-30 Thread Derrick Stolee


On 11/29/2018 9:39 PM, Junio C Hamano wrote:

Derrick Stolee  writes:


While _eventually_ we should make this opt-out, we shouldn't do that
until it has cooked a while.

I actually do not agree.  If the knob gives enough benefit, the
users will learn about it viva voce, and in a few more releases
we'll hear "enough users complain that they have to turn it on,
let's make it the default".  If that does not happen, the knob
does not deserve to be turned on in the first place.


That's a good philosophy. I'll keep it in mind.

Having said that, I won't be commenting on this shiny new toy before
the final.  I want to see more people help tying the loose ends and
give it final varnish to the upcoming release, as it seems to have
become rockier and larger release than we originally anticipated.


Yeah, no rush on this one. I just wanted to get some initial feedback 
about the idea.


Thanks,
-Stolee

[PATCH v2 5/6] pack-objects: create pack.useSparse setting

2018-11-29 Thread Derrick Stolee via GitGitGadget

From: Derrick Stolee 

The '--sparse' flag in 'git pack-objects' changes the algorithm
used to enumerate objects to one that is faster for individual
users pushing new objects that change only a small cone of the
working directory. The sparse algorithm is not recommended for a
server, which likely sends new objects that appear across the
entire working directory.

Create a 'pack.useSparse' setting that enables this new algorithm.
This allows 'git push' to use this algorithm without passing a
'--sparse' flag all the way through four levels of run_command()
calls.

If the '--no-sparse' flag is set, then this config setting is
overridden.

Signed-off-by: Derrick Stolee 
---
 builtin/pack-objects.c |  4 
 t/t5322-pack-objects-sparse.sh | 15 +++
 2 files changed, 19 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7d5b0735e3..124b1bafc4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, 
void *cb)
use_bitmap_index_default = git_config_bool(k, v);
return 0;
}
+   if (!strcmp(k, "pack.usesparse")) {
+   sparse = git_config_bool(k, v);
+   return 0;
+   }
if (!strcmp(k, "pack.threads")) {
delta_search_threads = git_config_int(k, v);
if (delta_search_threads < 0)
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 45dba6e014..8f5699bd91 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' '
test_cmp expect_sparse_objects.txt sparse_objects.txt
 '
 
+test_expect_success 'pack.useSparse enables algorithm' '
+   git config pack.useSparse true &&
+   git pack-objects --stdout --revs sparse.pack &&
+   git index-pack -o sparse.idx sparse.pack &&
+   git show-index sparse_objects.txt &&
+   test_cmp expect_sparse_objects.txt sparse_objects.txt
+'
+
+test_expect_success 'pack.useSparse overridden' '
+   git pack-objects --stdout --revs --no-sparse sparse.pack &&
+   git index-pack -o sparse.idx sparse.pack &&
+   git show-index sparse_objects.txt &&
+   test_cmp expect_objects.txt sparse_objects.txt
+'
+
 test_done
-- 
gitgitgadget

[PATCH v2 0/6] Add a new "sparse" tree walk algorithm

2018-11-29 Thread Derrick Stolee via GitGitGadget

One of the biggest remaining pain points for users of very large
repositories is the time it takes to run 'git push'. We inspected some slow
pushes by our developers and found that the "Enumerating Objects" phase of a
push was very slow. This is unsurprising, because this is why reachability
bitmaps exist. However, reachability bitmaps are not available to us because
of the single pack-file requirement. The bitmap approach is intended for
servers anyway, and clients have a much different behavior pattern.

Specifically, clients are normally pushing a very small number of objects
compared to the entire working directory. A typical user changes only a
small cone of the working directory, so let's use that to our benefit.

Create a new "sparse" mode for 'git pack-objects' that uses the paths that
introduce new objects to direct our search into the reachable trees. By
collecting trees at each path, we can then recurse into a path only when
there are uninteresting and interesting trees at that path. This gains a
significant performance boost for small topics while presenting a
possibility of packing extra objects.

The main algorithm change is in patch 4, but is set up a little bit in
patches 1 and 2.

As demonstrated in the included test script, we see that the existing
algorithm can send extra objects due to the way we specify the "frontier".
But we can send even more objects if a user copies objects from one folder
to another. I say "copy" because a rename would (usually) change the
original folder and trigger a walk into that path, discovering the objects.

In order to benefit from this approach, the user can opt-in using the
pack.useSparse config setting. This setting can be overridden using the
'--no-sparse' option.

Update in V2: 

 * Added GIT_TEST_PACK_SPARSE test option.
 * Fixed test breakages when GIT_TEST_PACK_SPARSE is enabled by adding null
   checks.

Derrick Stolee (6):
  revision: add mark_tree_uninteresting_sparse
  list-objects: consume sparse tree walk
  pack-objects: add --sparse option
  revision: implement sparse algorithm
  pack-objects: create pack.useSparse setting
  pack-objects: create GIT_TEST_PACK_SPARSE

 Documentation/git-pack-objects.txt |   9 +-
 bisect.c   |   2 +-
 builtin/pack-objects.c |  10 ++-
 builtin/rev-list.c |   2 +-
 http-push.c|   2 +-
 list-objects.c |  55 +++-
 list-objects.h |   4 +-
 revision.c | 121 +
 revision.h |   2 +
 t/README   |   4 +
 t/t5322-pack-objects-sparse.sh | 139 +
 11 files changed, 340 insertions(+), 10 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh


base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-89/derrickstolee/push/sparse-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/89

Range-diff vs v1:

 1:  b73b8de98c = 1:  60617681f7 revision: add mark_tree_uninteresting_sparse
 2:  9bf04c748b ! 2:  4527addacb list-objects: consume sparse tree walk
 @@ -116,6 +116,10 @@
  + for (parents = commit->parents; parents; parents = parents->next) {
  + struct commit *parent = parents->item;
  + struct tree *tree = get_commit_tree(parent);
 ++
 ++ if (!tree)
 ++ continue;
 ++
  + oidset_insert(set, >object.oid);
  +
  + if (!(parent->object.flags & UNINTERESTING))
 @@ -142,14 +146,14 @@
  +
for (list = revs->commits; list; list = list->next) {
struct commit *commit = list->item;
 -+ 
 + 
 +- if (commit->object.flags & UNINTERESTING) {
  + if (sparse) {
  + struct tree *tree = get_commit_tree(commit);
 -+ 
 ++
  + if (commit->object.flags & UNINTERESTING)
  + tree->object.flags |= UNINTERESTING;
 - 
 -- if (commit->object.flags & UNINTERESTING) {
 ++
  + oidset_insert(, >object.oid);
  + add_edge_parents(commit, revs, show_edge, );
  + } else if (commit->object.flags & UNINTERESTING) {
 @@ -189,3 +193,17 @@
   
   struct oidset;
   struct list_objects_filter_options;
 +
 +diff --git a/revision.c b/revision.c
 +--- a/revision.c
  b/revision.c
 +@@
 +  while ((oid = oidset_iter_next())) {
 +  struct tree *tree = lookup_tree(r, oid);
 + 
 ++ if (!tree)
 ++ continue;
 ++
 +

[PATCH v2 1/6] revision: add mark_tree_uninteresting_sparse

2018-11-29 Thread Derrick Stolee via GitGitGadget

From: Derrick Stolee 

In preparation for a new algorithm that walks fewer trees when
creating a pack from a set of revisions, create a method that
takes an oidset of tree oids and marks reachable objects as
UNINTERESTING.

The current implementation uses the existing
mark_tree_uninteresting to recursively walk the trees and blobs.
This will walk the same number of trees as the old mechanism.

There is one new assumption in this approach: we are also given
the oids of the interesting trees. This implementation does not
use those trees at the moment, but we will use them in a later
rewrite of this method.

Signed-off-by: Derrick Stolee 
---
 revision.c | 22 ++
 revision.h |  2 ++
 2 files changed, 24 insertions(+)

diff --git a/revision.c b/revision.c
index 13e0519c02..3a62c7c187 100644
--- a/revision.c
+++ b/revision.c
@@ -99,6 +99,28 @@ void mark_tree_uninteresting(struct repository *r, struct 
tree *tree)
mark_tree_contents_uninteresting(r, tree);
 }
 
+void mark_trees_uninteresting_sparse(struct repository *r,
+struct oidset *set)
+{
+   struct object_id *oid;
+   struct oidset_iter iter;
+
+   oidset_iter_init(set, );
+   while ((oid = oidset_iter_next())) {
+   struct tree *tree = lookup_tree(r, oid);
+
+   if (tree->object.flags & UNINTERESTING) {
+   /*
+* Remove the flag so the next call
+* is not a no-op. The flag is added
+* in mark_tree_unintersting().
+*/
+   tree->object.flags ^= UNINTERESTING;
+   mark_tree_uninteresting(r, tree);
+   }
+   }
+}
+
 struct commit_stack {
struct commit **items;
size_t nr, alloc;
diff --git a/revision.h b/revision.h
index 7987bfcd2e..f828e91ae9 100644
--- a/revision.h
+++ b/revision.h
@@ -67,6 +67,7 @@ struct rev_cmdline_info {
 #define REVISION_WALK_NO_WALK_SORTED 1
 #define REVISION_WALK_NO_WALK_UNSORTED 2
 
+struct oidset;
 struct topo_walk_info;
 
 struct rev_info {
@@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs,
 
 void mark_parents_uninteresting(struct commit *commit);
 void mark_tree_uninteresting(struct repository *r, struct tree *tree);
+void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set);
 
 void show_object_with_name(FILE *, struct object *, const char *);
 
-- 
gitgitgadget

[PATCH v2 2/6] list-objects: consume sparse tree walk

2018-11-29 Thread Derrick Stolee via GitGitGadget

From: Derrick Stolee 

When creating a pack-file using 'git pack-objects --revs' we provide
a list of interesting and uninteresting commits. For example, a push
operation would make the local topic branch be interesting and the
known remote refs as uninteresting. We want to discover the set of
new objects to send to the server as a thin pack.

We walk these commits until we discover a frontier of commits such
that every commit walk starting at interesting commits ends in a root
commit or unintersting commit. We then need to discover which
non-commit objects are reachable from  uninteresting commits.

The mark_edges_unintersting() method in list-objects.c iterates on
the commit list and does the following:

* If the commit is UNINTERSTING, then mark its root tree and every
  object it can reach as UNINTERESTING.

* If the commit is interesting, then mark the root tree of every
  UNINTERSTING parent (and all objects that tree can reach) as
  UNINTERSTING.

At the very end, we repeat the process on every commit directly
given to the revision walk from stdin. This helps ensure we properly
cover shallow commits that otherwise were not included in the
frontier.

The logic to recursively follow trees is in the
mark_tree_uninteresting() method in revision.c. The algorithm avoids
duplicate work by not recursing into trees that are already marked
UNINTERSTING.

Add a new 'sparse' option to the mark_edges_uninteresting() method
that performs this logic in a slightly new way. As we iterate over
the commits, we add all of the root trees to an oidset. Then, call
mark_trees_uninteresting_sparse() on that oidset. Note that we
include interesting trees in this process. The current implementation
of mark_trees_unintersting_sparse() will walk the same trees as
the old logic, but this will be replaced in a later change.

The sparse option is not used by any callers at the moment, but
will be wired to 'git pack-objects' in the next change.

Signed-off-by: Derrick Stolee 
---
 bisect.c   |  2 +-
 builtin/pack-objects.c |  2 +-
 builtin/rev-list.c |  2 +-
 http-push.c|  2 +-
 list-objects.c | 55 +++---
 list-objects.h |  4 ++-
 revision.c |  3 +++
 7 files changed, 61 insertions(+), 9 deletions(-)

diff --git a/bisect.c b/bisect.c
index 487675c672..842f8b4b8f 100644
--- a/bisect.c
+++ b/bisect.c
@@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs)
if (prepare_revision_walk(revs))
die("revision walk setup failed");
if (revs->tree_objects)
-   mark_edges_uninteresting(revs, NULL);
+   mark_edges_uninteresting(revs, NULL, 0);
 }
 
 static void exit_if_skipped_commits(struct commit_list *tried,
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 411aefd687..5f70d840a7 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av)
 
if (prepare_revision_walk())
die(_("revision walk setup failed"));
-   mark_edges_uninteresting(, show_edge);
+   mark_edges_uninteresting(, show_edge, 0);
 
if (!fn_show_object)
fn_show_object = show_object;
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 2880ed37e3..9663cbfae0 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
if (prepare_revision_walk())
die("revision walk setup failed");
if (revs.tree_objects)
-   mark_edges_uninteresting(, show_edge);
+   mark_edges_uninteresting(, show_edge, 0);
 
if (bisect_list) {
int reaches, all;
diff --git a/http-push.c b/http-push.c
index cd48590912..ea52d6f9f6 100644
--- a/http-push.c
+++ b/http-push.c
@@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv)
pushing = 0;
if (prepare_revision_walk())
die("revision walk setup failed");
-   mark_edges_uninteresting(, NULL);
+   mark_edges_uninteresting(, NULL, 0);
objects_to_send = get_delta(, ref_lock);
finish_all_active_slots();
 
diff --git a/list-objects.c b/list-objects.c
index c41cc80db5..4fbdeca0a4 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -222,25 +222,72 @@ static void mark_edge_parents_uninteresting(struct commit 
*commit,
}
 }
 
-void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge)
+static void add_edge_parents(struct commit *commit,
+struct rev_info *revs,
+show_edge_fn show_edge,
+struct oidset *set)
+{
+   struct commit_list *parents;
+
+   for (parents = commit->parents; parents; parents = parents->next) {
+

[PATCH v2 6/6] pack-objects: create GIT_TEST_PACK_SPARSE

2018-11-29 Thread Derrick Stolee via GitGitGadget

From: Derrick Stolee 

Create a test variable GIT_TEST_PACK_SPARSE to enable the sparse
object walk algorithm by default during the test suite. Enabling
this variable ensures coverage in many interesting cases, such as
shallow clones, partial clones, and missing objects.

Signed-off-by: Derrick Stolee 
---
 builtin/pack-objects.c | 1 +
 t/README   | 4 
 t/t5322-pack-objects-sparse.sh | 6 +++---
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 124b1bafc4..507d381153 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3331,6 +3331,7 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
 
read_replace_refs = 0;
 
+   sparse = git_env_bool("GIT_TEST_PACK_SPARSE", 0);
reset_pack_idx_option(_idx_opts);
git_config(git_pack_config, NULL);
 
diff --git a/t/README b/t/README
index 28711cc508..8b6dfe1864 100644
--- a/t/README
+++ b/t/README
@@ -342,6 +342,10 @@ GIT_TEST_INDEX_VERSION= exercises the index read/write 
code path
 for the index version specified.  Can be set to any valid version
 (currently 2, 3, or 4).
 
+GIT_TEST_PACK_SPARSE= if enabled will default the pack-objects
+builtin to use the sparse object walk. This can still be overridden by
+the --no-sparse command-line argument.
+
 GIT_TEST_PRELOAD_INDEX= exercises the preload-index code path
 by overriding the minimum number of cache entries required per thread.
 
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 8f5699bd91..e8cf41d1c6 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -36,7 +36,7 @@ test_expect_success 'setup repo' '
 '
 
 test_expect_success 'non-sparse pack-objects' '
-   git pack-objects --stdout --revs nonsparse.pack &&
+   git pack-objects --stdout --revs --no-sparse nonsparse.pack &&
git index-pack -o nonsparse.idx nonsparse.pack &&
git show-index nonsparse_objects.txt &&
test_cmp expect_objects.txt nonsparse_objects.txt
@@ -70,7 +70,7 @@ test_expect_success 'duplicate a folder from f3 and commit to 
topic1' '
 '
 
 test_expect_success 'non-sparse pack-objects' '
-   git pack-objects --stdout --revs nonsparse.pack &&
+   git pack-objects --stdout --revs --no-sparse nonsparse.pack &&
git index-pack -o nonsparse.idx nonsparse.pack &&
git show-index nonsparse_objects.txt &&
test_cmp expect_objects.txt nonsparse_objects.txt
@@ -102,7 +102,7 @@ test_expect_success 'non-sparse pack-objects' '
topic1  \
topic1^{tree}   \
topic1:f3 | sort >expect_objects.txt &&
-   git pack-objects --stdout --revs nonsparse.pack &&
+   git pack-objects --stdout --revs --no-sparse nonsparse.pack &&
git index-pack -o nonsparse.idx nonsparse.pack &&
git show-index nonsparse_objects.txt &&
test_cmp expect_objects.txt nonsparse_objects.txt
-- 
gitgitgadget

[PATCH v2 4/6] revision: implement sparse algorithm

2018-11-29 Thread Derrick Stolee via GitGitGadget

From: Derrick Stolee 

When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees to find
the interesting trees.

This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.

The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.

We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.

To actually improve the peformance, we need to terminate our
recursion unless the oidset contains some intersting trees and
some uninteresting trees. Technically, we only need one interesting
tree for this to speed up in most cases, but we also will not mark
anything UNINTERESTING if there are no uninteresting trees, so
that would be wasted effort.

There are a few ways that this is not a universally better option.

First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. We add a test case that
demonstrates this as a way to prove that the --sparse option is
actually working.

Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.

Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.

For git.git, I used the following input:

v2.19.0
^v2.19.0~10

 Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130

For the Linux repo, I used the following input:

v4.18
^v4.18~10

 Objects to pack:   518
Walked (old alg): 4,836
Walked (new alg):   188

The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.

 Objects to pack:220
Walked (old alg): 22,804
Walked (new alg):129

I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.

These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.

Signed-off-by: Derrick Stolee 
---
 revision.c | 116 ++---
 t/t5322-pack-objects-sparse.sh |  21 --
 2 files changed, 121 insertions(+), 16 deletions(-)

diff --git a/revision.c b/revision.c
index f9eb6400f1..971f1bb095 100644
--- a/revision.c
+++ b/revision.c
@@ -99,29 +99,125 @@ void mark_tree_uninteresting(struct repository *r, struct 
tree *tree)
mark_tree_contents_uninteresting(r, tree);
 }
 
+struct paths_and_oids {
+   struct string_list list;
+};
+
+static void paths_and_oids_init(struct paths_and_oids *po)
+{
+

[PATCH v2 3/6] pack-objects: add --sparse option

2018-11-29 Thread Derrick Stolee via GitGitGadget

From: Derrick Stolee 

Add a '--sparse' option flag to the pack-objects builtin. This
allows the user to specify that they want to use the new logic
for walking trees. This logic currently does not differ from the
existing output, but will in a later change.

Create a new test script, t5322-pack-objects-sparse.sh, to ensure
the object list that is selected matches what we expect. When we
update the logic to walk in a sparse fashion, the final test will
be updated to show the extra objects that are added.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-pack-objects.txt |   9 ++-
 builtin/pack-objects.c |   5 +-
 t/t5322-pack-objects-sparse.sh | 115 +
 3 files changed, 127 insertions(+), 2 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh

diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index 40c825c381..ced2630eb3 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -14,7 +14,7 @@ SYNOPSIS
[--local] [--incremental] [--window=] [--depth=]
[--revs [--unpacked | --all]] [--keep-pack=]
[--stdout [--filter=] | base-name]
-   [--shallow] [--keep-true-parents] < object-list
+   [--shallow] [--keep-true-parents] [--sparse] < object-list
 
 
 DESCRIPTION
@@ -196,6 +196,13 @@ depth is 4095.
Add --no-reuse-object if you want to force a uniform compression
level on all data no matter the source.
 
+--sparse::
+   Use the "sparse" algorithm to determine which objects to include in
+   the pack. This can have significant performance benefits when computing
+   a pack to send a small change. However, it is possible that extra
+   objects are added to the pack-file if the included commits contain
+   certain types of direct renames.
+
 --thin::
Create a "thin" pack by omitting the common objects between a
sender and a receiver in order to reduce network transfer. This
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5f70d840a7..7d5b0735e3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -84,6 +84,7 @@ static unsigned long pack_size_limit;
 static int depth = 50;
 static int delta_search_threads;
 static int pack_to_stdout;
+static int sparse;
 static int thin;
 static int num_preferred_base;
 static struct progress *progress_state;
@@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av)
 
if (prepare_revision_walk())
die(_("revision walk setup failed"));
-   mark_edges_uninteresting(, show_edge, 0);
+   mark_edges_uninteresting(, show_edge, sparse);
 
if (!fn_show_object)
fn_show_object = show_object;
@@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
{ OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"),
  N_("unpack unreachable objects newer than "),
  PARSE_OPT_OPTARG, option_parse_unpack_unreachable },
+   OPT_BOOL(0, "sparse", ,
+N_("use the sparse reachability algorithm")),
OPT_BOOL(0, "thin", ,
 N_("create thin packs")),
OPT_BOOL(0, "shallow", ,
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
new file mode 100755
index 00..81f6805bc3
--- /dev/null
+++ b/t/t5322-pack-objects-sparse.sh
@@ -0,0 +1,115 @@
+#!/bin/sh
+
+test_description='pack-objects object selection using sparse algorithm'
+. ./test-lib.sh
+
+test_expect_success 'setup repo' '
+   test_commit initial &&
+   for i in $(test_seq 1 3)
+   do
+   mkdir f$i &&
+   for j in $(test_seq 1 3)
+   do
+   mkdir f$i/f$j &&
+   echo $j >f$i/f$j/data.txt
+   done
+   done &&
+   git add . &&
+   git commit -m "Initialized trees" &&
+   for i in $(test_seq 1 3)
+   do
+   git checkout -b topic$i master &&
+   echo change-$i >f$i/f$i/data.txt &&
+   git commit -a -m "Changed f$i/f$i/data.txt"
+   done &&
+   cat >packinput.txt <<-EOF &&
+   topic1
+   ^topic2
+   ^topic3
+   EOF
+   git rev-parse   \
+   topic1  \
+   topic1^{tree}   \
+   topic1:f1   \
+   topic1:f1/f1\
+   topic1:f1/f1/data.txt | sort >expect_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+   git pack-objects --stdout --revs nonsparse.pack &&
+   git index-pack -o nonspar

Re: [PATCH 3/5] pack-objects: add --sparse option

2018-11-29 Thread Derrick Stolee


On 11/28/2018 5:11 PM, Stefan Beller wrote:

+--sparse::
+   Use the "sparse" algorithm to determine which objects to include in
+   the pack. This can have significant performance benefits when computing
+   a pack to send a small change. However, it is possible that extra
+   objects are added to the pack-file if the included commits contain
+   certain types of direct renames.

As a user, where do I find a discussion of these walking algorithms?
(i.e. how can I tell if I can really expect to gain performance benefits,
what are the tradeoffs? "If it were strictly better than the original,
it would be on by default" might be a thought of a user)


You're right that having this hidden as an opt-in config variable makes 
it hard to discover as a typical user.


I would argue that we should actually make the config setting true by 
default, and recommend that servers opt-out. Here are my reasons:


1. The vast majority of users are clients.

2. Client users are not likely to know about and tweak these settings.

3. Server users are more likely to keep an eye on the different knobs 
they can tweak.


4. Servers should use the reachability bitmaps, which don't use this 
logic anyway.


While _eventually_ we should make this opt-out, we shouldn't do that 
until it has cooked a while.


Thanks,
-Stolee

Re: [PATCH 0/5] Add a new "sparse" tree walk algorithm

2018-11-28 Thread Derrick Stolee


On 11/28/2018 5:18 PM, Ævar Arnfjörð Bjarmason wrote:

This is really interesting. I tested this with:

 diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
 index 124b1bafc4..5c7615f06c 100644
 --- a/builtin/pack-objects.c
 +++ b/builtin/pack-objects.c
 @@ -3143 +3143 @@ static void get_object_list(int ac, const char **av)
 -   mark_edges_uninteresting(, show_edge, sparse);
 +   mark_edges_uninteresting(, show_edge, 1);

To emulate having a GIT_TEST_* mode for this, which seems like a good
idea since it turned up a lot of segfaults in pack-objects. I wasn't
able to get a backtrace for that since it always happens indirectly, and
I didn't dig enough to see how to manually invoke it the right way.


Thanks for double-checking this. I had run a similar test in my 
prototype implementation, but over-simplified some code when rewriting 
it for submission (and then forgot to re-run that test).


Specifically, these null checks are important:

diff --git a/list-objects.c b/list-objects.c
index 9bb93d1640..7e864b4db8 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -232,6 +232,10 @@ static void add_edge_parents(struct commit *commit,
    for (parents = commit->parents; parents; parents = parents->next) {
    struct commit *parent = parents->item;
    struct tree *tree = get_commit_tree(parent);
+
+   if (!tree)
+   continue;
+
    oidset_insert(set, >object.oid);

    if (!(parent->object.flags & UNINTERESTING))
@@ -261,6 +265,8 @@ void mark_edges_uninteresting(struct rev_info *revs,

    if (sparse) {
    struct tree *tree = get_commit_tree(commit);
+   if (!tree)
+   continue;

    if (commit->object.flags & UNINTERESTING)
    tree->object.flags |= UNINTERESTING;

I will definitely include a GIT_TEST_* variable in v2.

Thanks,

-Stolee

[PATCH 4/5] revision: implement sparse algorithm

2018-11-28 Thread Derrick Stolee via GitGitGadget

From: Derrick Stolee 

When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees to find
the interesting trees.

This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.

The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.

We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.

To actually improve the peformance, we need to terminate our
recursion unless the oidset contains some intersting trees and
some uninteresting trees. Technically, we only need one interesting
tree for this to speed up in most cases, but we also will not mark
anything UNINTERESTING if there are no uninteresting trees, so
that would be wasted effort.

There are a few ways that this is not a universally better option.

First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. We add a test case that
demonstrates this as a way to prove that the --sparse option is
actually working.

Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.

Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.

For git.git, I used the following input:

v2.19.0
^v2.19.0~10

 Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130

For the Linux repo, I used the following input:

v4.18
^v4.18~10

 Objects to pack:   518
Walked (old alg): 4,836
Walked (new alg):   188

The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.

 Objects to pack:220
Walked (old alg): 22,804
Walked (new alg):129

I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.

These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.

Signed-off-by: Derrick Stolee 
---
 revision.c | 111 ++---
 t/t5322-pack-objects-sparse.sh |  21 +--
 2 files changed, 116 insertions(+), 16 deletions(-)

diff --git a/revision.c b/revision.c
index 3a62c7c187..7e4bfe621a 100644
--- a/revision.c
+++ b/revision.c
@@ -99,26 +99,117 @@ void mark_tree_uninteresting(struct repository *r, struct 
tree *tree)
mark_tree_contents_uninteresting(r, tree);
 }
 
+struct paths_and_oids {
+   struct string_list list;
+};
+
+static void paths_and_oids_init(struct paths_and_oids *po)
+{
+

[PATCH 0/5] Add a new "sparse" tree walk algorithm

2018-11-28 Thread Derrick Stolee via GitGitGadget

One of the biggest remaining pain points for users of very large
repositories is the time it takes to run 'git push'. We inspected some slow
pushes by our developers and found that the "Enumerating Objects" phase of a
push was very slow. This is unsurprising, because this is why reachability
bitmaps exist. However, reachability bitmaps are not available to us because
of the single pack-file requirement. The bitmap approach is intended for
servers anyway, and clients have a much different behavior pattern.

Specifically, clients are normally pushing a very small number of objects
compared to the entire working directory. A typical user changes only a
small cone of the working directory, so let's use that to our benefit.

Create a new "sparse" mode for 'git pack-objects' that uses the paths that
introduce new objects to direct our search into the reachable trees. By
collecting trees at each path, we can then recurse into a path only when
there are uninteresting and interesting trees at that path. This gains a
significant performance boost for small topics while presenting a
possibility of packing extra objects.

The main algorithm change is in patch 4, but is set up a little bit in
patches 1 and 2.

As demonstrated in the included test script, we see that the existing
algorithm can send extra objects due to the way we specify the "frontier".
But we can send even more objects if a user copies objects from one folder
to another. I say "copy" because a rename would (usually) change the
original folder and trigger a walk into that path, discovering the objects.

In order to benefit from this approach, the user can opt-in using the
pack.useSparse config setting. This setting can be overridden using the
'--no-sparse' option.

Derrick Stolee (5):
revision: add mark_tree_uninteresting_sparse
list-objects: consume sparse tree walk
pack-objects: add --sparse option
revision: implement sparse algorithm
pack-objects: create pack.useSparse setting

base-commit: a1598010f775d82b5adf12c29d0f5bc9b41434c6
Published-As:
https://github.com/gitgitgadget/git/releases/tags/pr-89%2Fderrickstolee%2Fpush%2Fsparse-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git
pr-89/derrickstolee/push/sparse-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/89
--
gitgitgadget

[PATCH 3/5] pack-objects: add --sparse option

2018-11-28 Thread Derrick Stolee via GitGitGadget

From: Derrick Stolee 

Add a '--sparse' option flag to the pack-objects builtin. This
allows the user to specify that they want to use the new logic
for walking trees. This logic currently does not differ from the
existing output, but will in a later change.

Create a new test script, t5322-pack-objects-sparse.sh, to ensure
the object list that is selected matches what we expect. When we
update the logic to walk in a sparse fashion, the final test will
be updated to show the extra objects that are added.

Signed-off-by: Derrick Stolee 
---
 Documentation/git-pack-objects.txt |   9 ++-
 builtin/pack-objects.c |   5 +-
 t/t5322-pack-objects-sparse.sh | 115 +
 3 files changed, 127 insertions(+), 2 deletions(-)
 create mode 100755 t/t5322-pack-objects-sparse.sh

diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index 40c825c381..ced2630eb3 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -14,7 +14,7 @@ SYNOPSIS
[--local] [--incremental] [--window=] [--depth=]
[--revs [--unpacked | --all]] [--keep-pack=]
[--stdout [--filter=] | base-name]
-   [--shallow] [--keep-true-parents] < object-list
+   [--shallow] [--keep-true-parents] [--sparse] < object-list
 
 
 DESCRIPTION
@@ -196,6 +196,13 @@ depth is 4095.
Add --no-reuse-object if you want to force a uniform compression
level on all data no matter the source.
 
+--sparse::
+   Use the "sparse" algorithm to determine which objects to include in
+   the pack. This can have significant performance benefits when computing
+   a pack to send a small change. However, it is possible that extra
+   objects are added to the pack-file if the included commits contain
+   certain types of direct renames.
+
 --thin::
Create a "thin" pack by omitting the common objects between a
sender and a receiver in order to reduce network transfer. This
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5f70d840a7..7d5b0735e3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -84,6 +84,7 @@ static unsigned long pack_size_limit;
 static int depth = 50;
 static int delta_search_threads;
 static int pack_to_stdout;
+static int sparse;
 static int thin;
 static int num_preferred_base;
 static struct progress *progress_state;
@@ -3135,7 +3136,7 @@ static void get_object_list(int ac, const char **av)
 
if (prepare_revision_walk())
die(_("revision walk setup failed"));
-   mark_edges_uninteresting(, show_edge, 0);
+   mark_edges_uninteresting(, show_edge, sparse);
 
if (!fn_show_object)
fn_show_object = show_object;
@@ -3292,6 +3293,8 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
{ OPTION_CALLBACK, 0, "unpack-unreachable", NULL, N_("time"),
  N_("unpack unreachable objects newer than "),
  PARSE_OPT_OPTARG, option_parse_unpack_unreachable },
+   OPT_BOOL(0, "sparse", ,
+N_("use the sparse reachability algorithm")),
OPT_BOOL(0, "thin", ,
 N_("create thin packs")),
OPT_BOOL(0, "shallow", ,
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
new file mode 100755
index 00..81f6805bc3
--- /dev/null
+++ b/t/t5322-pack-objects-sparse.sh
@@ -0,0 +1,115 @@
+#!/bin/sh
+
+test_description='pack-objects object selection using sparse algorithm'
+. ./test-lib.sh
+
+test_expect_success 'setup repo' '
+   test_commit initial &&
+   for i in $(test_seq 1 3)
+   do
+   mkdir f$i &&
+   for j in $(test_seq 1 3)
+   do
+   mkdir f$i/f$j &&
+   echo $j >f$i/f$j/data.txt
+   done
+   done &&
+   git add . &&
+   git commit -m "Initialized trees" &&
+   for i in $(test_seq 1 3)
+   do
+   git checkout -b topic$i master &&
+   echo change-$i >f$i/f$i/data.txt &&
+   git commit -a -m "Changed f$i/f$i/data.txt"
+   done &&
+   cat >packinput.txt <<-EOF &&
+   topic1
+   ^topic2
+   ^topic3
+   EOF
+   git rev-parse   \
+   topic1  \
+   topic1^{tree}   \
+   topic1:f1   \
+   topic1:f1/f1\
+   topic1:f1/f1/data.txt | sort >expect_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+   git pack-objects --stdout --revs nonsparse.pack &&
+   git index-pack -o nonspar

[PATCH 5/5] pack-objects: create pack.useSparse setting

2018-11-28 Thread Derrick Stolee via GitGitGadget

From: Derrick Stolee 

The '--sparse' flag in 'git pack-objects' changes the algorithm
used to enumerate objects to one that is faster for individual
users pushing new objects that change only a small cone of the
working directory. The sparse algorithm is not recommended for a
server, which likely sends new objects that appear across the
entire working directory.

Create a 'pack.useSparse' setting that enables this new algorithm.
This allows 'git push' to use this algorithm without passing a
'--sparse' flag all the way through four levels of run_command()
calls.

If the '--no-sparse' flag is set, then this config setting is
overridden.

Signed-off-by: Derrick Stolee 
---
 builtin/pack-objects.c |  4 
 t/t5322-pack-objects-sparse.sh | 15 +++
 2 files changed, 19 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7d5b0735e3..124b1bafc4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2711,6 +2711,10 @@ static int git_pack_config(const char *k, const char *v, 
void *cb)
use_bitmap_index_default = git_config_bool(k, v);
return 0;
}
+   if (!strcmp(k, "pack.usesparse")) {
+   sparse = git_config_bool(k, v);
+   return 0;
+   }
if (!strcmp(k, "pack.threads")) {
delta_search_threads = git_config_int(k, v);
if (delta_search_threads < 0)
diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
index 45dba6e014..8f5699bd91 100755
--- a/t/t5322-pack-objects-sparse.sh
+++ b/t/t5322-pack-objects-sparse.sh
@@ -121,4 +121,19 @@ test_expect_success 'sparse pack-objects' '
test_cmp expect_sparse_objects.txt sparse_objects.txt
 '
 
+test_expect_success 'pack.useSparse enables algorithm' '
+   git config pack.useSparse true &&
+   git pack-objects --stdout --revs sparse.pack &&
+   git index-pack -o sparse.idx sparse.pack &&
+   git show-index sparse_objects.txt &&
+   test_cmp expect_sparse_objects.txt sparse_objects.txt
+'
+
+test_expect_success 'pack.useSparse overridden' '
+   git pack-objects --stdout --revs --no-sparse sparse.pack &&
+   git index-pack -o sparse.idx sparse.pack &&
+   git show-index sparse_objects.txt &&
+   test_cmp expect_objects.txt sparse_objects.txt
+'
+
 test_done
-- 
gitgitgadget

[PATCH 2/5] list-objects: consume sparse tree walk

2018-11-28 Thread Derrick Stolee via GitGitGadget

From: Derrick Stolee 

When creating a pack-file using 'git pack-objects --revs' we provide
a list of interesting and uninteresting commits. For example, a push
operation would make the local topic branch be interesting and the
known remote refs as uninteresting. We want to discover the set of
new objects to send to the server as a thin pack.

We walk these commits until we discover a frontier of commits such
that every commit walk starting at interesting commits ends in a root
commit or unintersting commit. We then need to discover which
non-commit objects are reachable from  uninteresting commits.

The mark_edges_unintersting() method in list-objects.c iterates on
the commit list and does the following:

* If the commit is UNINTERSTING, then mark its root tree and every
  object it can reach as UNINTERESTING.

* If the commit is interesting, then mark the root tree of every
  UNINTERSTING parent (and all objects that tree can reach) as
  UNINTERSTING.

At the very end, we repeat the process on every commit directly
given to the revision walk from stdin. This helps ensure we properly
cover shallow commits that otherwise were not included in the
frontier.

The logic to recursively follow trees is in the
mark_tree_uninteresting() method in revision.c. The algorithm avoids
duplicate work by not recursing into trees that are already marked
UNINTERSTING.

Add a new 'sparse' option to the mark_edges_uninteresting() method
that performs this logic in a slightly new way. As we iterate over
the commits, we add all of the root trees to an oidset. Then, call
mark_trees_uninteresting_sparse() on that oidset. Note that we
include interesting trees in this process. The current implementation
of mark_trees_unintersting_sparse() will walk the same trees as
the old logic, but this will be replaced in a later change.

The sparse option is not used by any callers at the moment, but
will be wired to 'git pack-objects' in the next change.

Signed-off-by: Derrick Stolee 
---
 bisect.c   |  2 +-
 builtin/pack-objects.c |  2 +-
 builtin/rev-list.c |  2 +-
 http-push.c|  2 +-
 list-objects.c | 51 ++
 list-objects.h |  4 +++-
 6 files changed, 54 insertions(+), 9 deletions(-)

diff --git a/bisect.c b/bisect.c
index 487675c672..842f8b4b8f 100644
--- a/bisect.c
+++ b/bisect.c
@@ -656,7 +656,7 @@ static void bisect_common(struct rev_info *revs)
if (prepare_revision_walk(revs))
die("revision walk setup failed");
if (revs->tree_objects)
-   mark_edges_uninteresting(revs, NULL);
+   mark_edges_uninteresting(revs, NULL, 0);
 }
 
 static void exit_if_skipped_commits(struct commit_list *tried,
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 411aefd687..5f70d840a7 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3135,7 +3135,7 @@ static void get_object_list(int ac, const char **av)
 
if (prepare_revision_walk())
die(_("revision walk setup failed"));
-   mark_edges_uninteresting(, show_edge);
+   mark_edges_uninteresting(, show_edge, 0);
 
if (!fn_show_object)
fn_show_object = show_object;
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 2880ed37e3..9663cbfae0 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -543,7 +543,7 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
if (prepare_revision_walk())
die("revision walk setup failed");
if (revs.tree_objects)
-   mark_edges_uninteresting(, show_edge);
+   mark_edges_uninteresting(, show_edge, 0);
 
if (bisect_list) {
int reaches, all;
diff --git a/http-push.c b/http-push.c
index cd48590912..ea52d6f9f6 100644
--- a/http-push.c
+++ b/http-push.c
@@ -1933,7 +1933,7 @@ int cmd_main(int argc, const char **argv)
pushing = 0;
if (prepare_revision_walk())
die("revision walk setup failed");
-   mark_edges_uninteresting(, NULL);
+   mark_edges_uninteresting(, NULL, 0);
objects_to_send = get_delta(, ref_lock);
finish_all_active_slots();
 
diff --git a/list-objects.c b/list-objects.c
index c41cc80db5..9bb93d1640 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -222,25 +222,68 @@ static void mark_edge_parents_uninteresting(struct commit 
*commit,
}
 }
 
-void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge)
+static void add_edge_parents(struct commit *commit,
+struct rev_info *revs,
+show_edge_fn show_edge,
+struct oidset *set)
+{
+   struct commit_list *parents;
+
+   for (parents = commit->parents; parents; parents = parents->next) {
+   struct commit *parent =

[PATCH 1/5] revision: add mark_tree_uninteresting_sparse

2018-11-28 Thread Derrick Stolee via GitGitGadget

From: Derrick Stolee 

In preparation for a new algorithm that walks fewer trees when
creating a pack from a set of revisions, create a method that
takes an oidset of tree oids and marks reachable objects as
UNINTERESTING.

The current implementation uses the existing
mark_tree_uninteresting to recursively walk the trees and blobs.
This will walk the same number of trees as the old mechanism.

There is one new assumption in this approach: we are also given
the oids of the interesting trees. This implementation does not
use those trees at the moment, but we will use them in a later
rewrite of this method.

Signed-off-by: Derrick Stolee 
---
 revision.c | 22 ++
 revision.h |  2 ++
 2 files changed, 24 insertions(+)

diff --git a/revision.c b/revision.c
index 13e0519c02..3a62c7c187 100644
--- a/revision.c
+++ b/revision.c
@@ -99,6 +99,28 @@ void mark_tree_uninteresting(struct repository *r, struct 
tree *tree)
mark_tree_contents_uninteresting(r, tree);
 }
 
+void mark_trees_uninteresting_sparse(struct repository *r,
+struct oidset *set)
+{
+   struct object_id *oid;
+   struct oidset_iter iter;
+
+   oidset_iter_init(set, );
+   while ((oid = oidset_iter_next())) {
+   struct tree *tree = lookup_tree(r, oid);
+
+   if (tree->object.flags & UNINTERESTING) {
+   /*
+* Remove the flag so the next call
+* is not a no-op. The flag is added
+* in mark_tree_unintersting().
+*/
+   tree->object.flags ^= UNINTERESTING;
+   mark_tree_uninteresting(r, tree);
+   }
+   }
+}
+
 struct commit_stack {
struct commit **items;
size_t nr, alloc;
diff --git a/revision.h b/revision.h
index 7987bfcd2e..f828e91ae9 100644
--- a/revision.h
+++ b/revision.h
@@ -67,6 +67,7 @@ struct rev_cmdline_info {
 #define REVISION_WALK_NO_WALK_SORTED 1
 #define REVISION_WALK_NO_WALK_UNSORTED 2
 
+struct oidset;
 struct topo_walk_info;
 
 struct rev_info {
@@ -327,6 +328,7 @@ void put_revision_mark(const struct rev_info *revs,
 
 void mark_parents_uninteresting(struct commit *commit);
 void mark_tree_uninteresting(struct repository *r, struct tree *tree);
+void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *set);
 
 void show_object_with_name(FILE *, struct object *, const char *);
 
-- 
gitgitgadget

Re: [BUG REPORT] t5322: demonstrate a pack-objects bug

2018-11-28 Thread Derrick Stolee


On 11/28/2018 2:45 PM, Derrick Stolee wrote:

I was preparing a new "sparse" algorithm for calculating the
interesting objects to send on push. The important steps happen
during 'git pack-objects', so I was creating test cases to see
how the behavior changes in narrow cases. Specifically, when
copying a directory across sibling directories (see test case),
the new logic would accidentally send that object as an extra.

However, I found a bug in the existing logic. The included test
demonstrates this during the final 'git index-pack' call. It
fails with the message

'fatal: pack has 1 unresolved delta'


I realize now that I've gone through this effort that the problem is me 
(of course it is).



+   git pack-objects --stdout --thin --revs nonsparse.pack 
&&


Since I'm packing thin packs, the index operation is complaining about 
deltas. So, I need to use a different mechanism to list the objects in 
the pack (say, 'git verify-pack -v').


Sorry for the noise!

Thanks,

-Stolee

[BUG REPORT] t5322: demonstrate a pack-objects bug

2018-11-28 Thread Derrick Stolee

I was preparing a new "sparse" algorithm for calculating the
interesting objects to send on push. The important steps happen
during 'git pack-objects', so I was creating test cases to see
how the behavior changes in narrow cases. Specifically, when
copying a directory across sibling directories (see test case),
the new logic would accidentally send that object as an extra.

However, I found a bug in the existing logic. The included test
demonstrates this during the final 'git index-pack' call. It
fails with the message

'fatal: pack has 1 unresolved delta'

It is probable that this is not a minimal test case, but happens
to be the test I had created before discovering the problem.

I compiled v2.17.0 and v2.12.0 as checks to see if I could find
a "good" commit with which to start a bisect, but both failed.
This is an old bug!

Signed-off-by: Derrick Stolee 
---
 t/t5322-pack-objects-sparse.sh | 94 ++
 1 file changed, 94 insertions(+)
 create mode 100755 t/t5322-pack-objects-sparse.sh

diff --git a/t/t5322-pack-objects-sparse.sh b/t/t5322-pack-objects-sparse.sh
new file mode 100755
index 00..36faa70fe9
--- /dev/null
+++ b/t/t5322-pack-objects-sparse.sh
@@ -0,0 +1,94 @@
+#!/bin/sh
+
+test_description='pack-objects object selection using sparse algorithm'
+. ./test-lib.sh
+
+test_expect_success 'setup repo' '
+   test_commit initial &&
+   for i in $(test_seq 1 3)
+   do
+   mkdir f$i &&
+   for j in $(test_seq 1 3)
+   do
+   mkdir f$i/f$j &&
+   echo $j >f$i/f$j/data.txt
+   done
+   done &&
+   git add . &&
+   git commit -m "Initialized trees" &&
+   for i in $(test_seq 1 3)
+   do
+   git checkout -b topic$i master &&
+   echo change-$i >f$i/f$i/data.txt &&
+   git commit -a -m "Changed f$i/f$i/data.txt"
+   done &&
+   cat >packinput.txt <<-EOF &&
+   topic1
+   ^topic2
+   ^topic3
+   EOF
+   git rev-parse   \
+   topic1  \
+   topic1^{tree}   \
+   topic1:f1   \
+   topic1:f1/f1\
+   topic1:f1/f1/data.txt | sort >actual_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+   git pack-objects --stdout --thin --revs nonsparse.pack 
&&
+   git index-pack -o nonsparse.idx nonsparse.pack &&
+   git show-index nonsparse_objects.txt &&
+   test_cmp actual_objects.txt nonsparse_objects.txt
+'
+
+# Demonstrate that both algorithms send "extra" objects because
+# they are not in the frontier.
+
+test_expect_success 'duplicate a folder from f3 and commit to topic1' '
+   git checkout topic1 &&
+   echo change-3 >f3/f3/data.txt &&
+   git commit -a -m "Changed f3/f3/data.txt" &&
+   git rev-parse   \
+   topic1~1\
+   topic1~1^{tree} \
+   topic1^{tree}   \
+   topic1  \
+   topic1:f1   \
+   topic1:f1/f1\
+   topic1:f1/f1/data.txt   \
+   topic1:f3   \
+   topic1:f3/f3\
+   topic1:f3/f3/data.txt | sort >actual_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+   git pack-objects --stdout --thin --revs nonsparse.pack 
&&
+   git index-pack -o nonsparse.idx nonsparse.pack &&
+   git show-index nonsparse_objects.txt &&
+   test_cmp actual_objects.txt nonsparse_objects.txt
+'
+
+test_expect_success 'duplicate a folder from f3 and commit to topic1' '
+   mkdir f3/f4 &&
+   cp -r f1/f1/* f3/f4 &&
+   git add f3/f4 &&
+   git commit -m "Copied f1/f1 to f3/f4" &&
+   cat >packinput.txt <<-EOF &&
+   topic1
+   ^topic1~1
+   EOF
+   git rev-parse   \
+   topic1  \
+   topic1^{tree}   \
+   topic1:f3 | sort >actual_objects.txt
+'
+
+test_expect_success 'non-sparse pack-objects' '
+   git pack-objects --stdout --thin --revs nonsparse.pack 
&&
+   git index-pack -o nonsparse.idx nonsparse.pack &&
+   git show-index nonsparse_objects.txt &&
+   test_cmp actual_objects.txt nonsparse_objects.txt
+'
+
+test_done
-- 
2.20.0.rc1

Git Test Coverage Report (Wednesday Nov 21)

2018-11-21 Thread Derrick Stolee

to leave_reset_head;
bac2a1e36f  585) ret = error(_("failed to find tree of %s"), 
oid_to_hex(oid));

bac2a1e36f  586) goto leave_reset_head;
3249c1251e  590) ret = error(_("failed to find tree of %s"), 
oid_to_hex(oid));

3249c1251e  591) goto leave_reset_head;
3249c1251e  604) goto leave_reset_head;

builtin/show-branch.c
517fe807d6 builtin/show-branch.c 607) BUG_ON_OPT_NEG(unset);

builtin/show-ref.c
517fe807d6 builtin/show-ref.c 154) BUG_ON_OPT_NEG(unset);

bundle.c
2c8ee1f53c 267) error_errno(_("unable to dup bundle descriptor"));
2c8ee1f53c 268) child_process_clear(_objects);
2c8ee1f53c 269) return -1;
2c8ee1f53c 478) rollback_lock_file();

config.c
fast-import.c
ca473cef91 2958) strbuf_addf(, "%s %s %"PRIuMAX"\n", oid_to_hex(oid),

midx.c
name-hash.c
2179045fd0 532) die(_("unable to create lazy_dir thread: %s"), 
strerror(err));
2179045fd0 554) die(_("unable to create lazy_name thread: %s"), 
strerror(err));
2179045fd0 560) die(_("unable to join lazy_name thread: %s"), 
strerror(err));


preload-index.c
2179045fd0 137) die(_("unable to create threaded lstat: %s"), 
strerror(err));


revision.c
b45424181e 2915) define_commit_slab(author_date_slab, timestamp_t);
b45424181e 2942) return;
b45424181e 2945) return;
b45424181e 2948) record_author_date(>author_date, c);
b45424181e 2951) c->object.flags |= UNINTERESTING;
b45424181e 2954) return;
b45424181e 2957) mark_parents_uninteresting(c);
b45424181e 2980) return;
b45424181e 2983) return;
b45424181e 3031) info->topo_queue.compare = compare_commits_by_commit_date;
b45424181e 3032) break;
b45424181e 3034) init_author_date_slab(>author_date);
b45424181e 3035) info->topo_queue.compare = compare_commits_by_author_date;
b45424181e 3036) info->topo_queue.cb_data = >author_date;
b45424181e 3037) break;
b45424181e 3048) continue;
b45424181e 3059) record_author_date(>author_date, c);
f0d9cc4196 3097) if (!revs->ignore_missing_links)
f0d9cc4196 3098) die("Failed to traverse parents of commit %s",
f0d9cc4196 3099) oid_to_hex(>object.oid));
b45424181e 3107) continue;

run-command.c
2179045fd0 1229) error(_("cannot create async thread: %s"), strerror(err));

send-pack.c
c0e40a2d66 207) close(fd[1]);

Commits introducing uncovered code:
Derrick Stolee  b45424181: revision.c: generation-based topo-order 
algorithm
Derrick Stolee  f0d9cc419: revision.c: begin refactoring 
--topo-order logic

Jeff King  01a31f3bc: pull: handle --verify-signatures for unborn branch
Jeff King  0eb8d3767: cat-file: report an error on multiple --batch 
options
Jeff King  2c8ee1f53: bundle: dup() output descriptor closer to 
point-of-use
Jeff King  517fe807d: assert NOARG/NONEG behavior of parse-options 
callbacks
Jeff King  735ca208c: apply: return -1 from option callback instead 
of calling exit(1)

Jeff King  fce566480: am: handle --no-patch-format option
Johannes Schindelin  3249c1251: rebase: consolidate clean-up code 
before leaving reset_head()
Johannes Schindelin  bac2a1e36: built-in rebase: reinstate `checkout 
-q` behavior where appropriate
Nguyễn Thái Ngọc Duy  2179045fd: Clean up pthread_create() error 
handling
Nguyễn Thái Ngọc Duy  c0e40a2d6: send-pack.c: move async's #ifdef 
NO_PTHREADS back to run-command.c

Nguyễn Thái Ngọc Duy  fd6263fb7: grep: clean up num_threads handling
Torsten Bögershausen  ca473cef9: Upcast size_t variables to 
uintmax_t when printing

Re: [PATCH v2 1/1] Use size_t instead of 'unsigned long' for data in memory

2018-11-21 Thread Derrick Stolee


On 11/20/2018 12:04 AM, tbo...@web.de wrote:

From: Torsten Bögershausen 

Currently the length of data which is stored in memory is stored
in "unsigned long" at many places in the code base.
This is OK when both "unsigned long" and size_t are 32 bits,
(and is OK when both are 64 bits).
On a 64 bit Windows system am "unsigned long" is 32 bit, and
that may be too short to measure the size of objects in memory,
a size_t is the natural choice.


Is this change enough to store 4GB files on Windows? Or is there more to 
do?



Thanks for all the comments on V1.
Changes since V1:
- Make the motivation somewhat clearer in the commit message
- Rebase to the November 19 pu

What we really need for this patch to fly are this branches:
mk/use-size-t-in-zlib
tb/print-size-t-with-uintmax-format


I believe communicating these direct dependencies is the correct way to 
go, and the rebase you mentioned is unnecessary (instead, test with a 
merge).


Hopefully the patch applies on top of a merge of those branches.


@@ -529,7 +530,7 @@ static void *unpack_raw_entry(struct object_entry *obj,
  }
  
  static void *unpack_data(struct object_entry *obj,

-int (*consume)(const unsigned char *, unsigned long, 
void *),
+int (*consume)(const unsigned char *, size_t, void *),
 void *cb_data)


This is the only instance that is not paired directly with a variable 
name like "size", "sz", or "len". However, it appears to be correct from 
context.


Thanks for this!

Reviewed-by: Derrick Stolee

Re: [PATCH 2/2] commit-graph: don't call write_graph_chunk_large_edges() unnecessarily

2018-11-21 Thread Derrick Stolee


On 11/20/2018 8:26 PM, SZEDER Gábor wrote:

write_graph_chunk_data(f, GRAPH_OID_LEN, commits.list, commits.nr);
-   write_graph_chunk_large_edges(f, commits.list, commits.nr);
+   if (num_large_edges)
+   write_graph_chunk_large_edges(f, commits.list, commits.nr);


This is clearly correct, and the tests in t5318-commit-graph.sh would 
catch a dropped (or additional) large/extra edge chunk.


Thanks,

-Stolee

Re: [PATCH 1/2] commit-graph: rename 'num_extra_edges' variable to 'num_large_edges'

2018-11-21 Thread Derrick Stolee


On 11/20/2018 10:29 PM, Junio C Hamano wrote:

SZEDER Gábor  writes:


I rename these variables to 'num_large_edges', because the commit
graph file format speaks about the 'Large Edge List' chunk.

However, I do find that the term 'extra' makes much more sense

Would it make sense to do the rename in the other direction?

I tend to agree that "large edge" is a misnomer.


I agree with you both. "Extra" is better.

Thanks,

-Stolee

Re: [PATCH 1/1] revision.c: use new topo-order logic in tests

2018-11-20 Thread Derrick Stolee


On 11/20/2018 1:13 AM, Junio C Hamano wrote:

"Derrick Stolee via GitGitGadget"  writes:


@@ -3143,6 +3144,9 @@ int prepare_revision_walk(struct rev_info *revs)
commit_list_sort_by_date(>commits);
if (revs->no_walk)
return 0;
+   if (revs->limited &&
+   git_env_bool(GIT_TEST_COMMIT_GRAPH, 0))
+   revs->limited = 0;
if (revs->limited) {
if (limit_list(revs) < 0)
return -1;

That is equivalent to say

if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0))
revs->limited = 0;


Not exactly equivalent, because we can use short-circuiting to avoid the 
git_env_bool check, but I see what you mean.



Wouldn't that make the codepath that involves limit_list()
completely unreachable while testing, though?


Testing with GIT_TEST_COMMIT_GRAPH=0 would still hit limit_list(). Both 
modes are important to test (for instance, to ensure we still have 
correct behavior without a commit-graph file).



The title only mentions "topo-order" logic, but the topo-order is
not the only reason why limited bit can be set, is it?  When showing
merges, simplifying merges, or post-processing to show ancestry
path, or showing a bottom-limited revision range, the limited bit is
automatically set because all of these depend on first calling
limit_list() and then postprocessing its result.  Doesn't it hurt
these cases to unconditionally drop limited bit?


You're right that we only want to do this in the topo-order case, so 
perhaps the diff should instead be:


if (revs->no_walk)
return 0;
+   if (revs->topo_order &&
+   git_env_bool(GIT_TEST_COMMIT_GRAPH, 0))
+   revs->limited = 0;
if (revs->limited) {
if (limit_list(revs) < 0)
return -1;

Thanks,
-Stolee

Re: [PATCH 0/3] delta-island fixes

2018-11-20 Thread Derrick Stolee


On 11/20/2018 4:44 AM, Jeff King wrote:

In cases like this I think it's often a good idea to have a perf test.
Those are expensive anyway, and we'd have the double benefit of
exercising the code and showing off the performance improvement. But the
delta-island code only makes sense on a very specialized repo: one where
you have multiple related but diverging histories.  You could simulate
that by picking two branches in a repository, but the effect is going to
be miniscule.


The changes in this series look correct. Too bad it is difficult to test.

Perhaps we should add a performance test for the --delta-islands check 
that would trigger the failure (and maybe a clone afterwards). There are 
lots of freely available forks of git.git that present interesting fork 
structure. Here are three that would suffice to make this interesting:


    https://github.com/git/git
    https://github.com/git-for-windows/git
    https://github.com/microsoft/git

Of course, running it on a specific repo is up to the person running the 
perf suite.


Thanks,
-Stolee

Re: Git Test Coverage Report (v2.20.0-rc0)

2018-11-20 Thread Derrick Stolee


On 11/20/2018 6:34 AM, Jeff King wrote:

On Mon, Nov 19, 2018 at 10:40:53AM -0500, Derrick Stolee wrote:


Since depth is never incremented, we are not covering this block. Is it
possible to test?
This should be covered by the fix in:

   https://public-inbox.org/git/20181120095053.gc22...@sigill.intra.peff.net/

because now entries at the top-level are depth "1". The "depth++" line
is still never executed in our test suite. I'm not sure how much that
matters.


Thanks! I'll go take a look at your patch.


delta-islands.c
c8d521faf7  53) memcpy(b, old, size);

This memcpy happens when the 'old' island_bitmap is passed to
island_bitmap_new(), but...


c8d521faf7 187) b->refcount--;
c8d521faf7 188) b = kh_value(island_marks, pos) = island_bitmap_new(b);

This block has the only non-NULL caller to island_bitmap_new().

This is another case where it triggers a lot for a reasonably-sized
repo, but it's hard to construct a small case. This code implements a
copy-on-write of the bitmap, which means the same objects have to be
accessible from two different paths through the reachability graph, each
with different island marks. And then a test would I guess make sure
that the correct subsets of objects never become deltas, which gets
complicated.

And I think that's a pattern with the delta-island code. What we really
care about most is that if we throw a real fork-network repository at
it, it produces faster clones with fewer un-reusable deltas. So I think
a much more interesting approach here would be perf tests. But:

   - we'd want to count those as coverage, and that likely makes your
 coverage tests prohibitively expensive

   - it requires a specialized repo to demonstrate, which most people
 aren't going to have handy


Do you have regularly-running tests that check this in your 
infrastructure? As long as someone would notice if this code starts 
failing, that would be enough.



c8d521faf7 212) obj = ((struct tag *)obj)->tagged;
c8d521faf7 213) if (obj) {
c8d521faf7 214) parse_object(the_repository, >oid);
c8d521faf7 215) marks = create_or_get_island_marks(obj);
c8d521faf7 216) island_bitmap_set(marks, island_counter);

It appears that this block would happen if we placed a tag in the delta
island.

Yep. Again, exercised by real repositories.

I'm not sure how far we want to go in the blind pursuit of coverage.
Certainly we could add a tag to the repo in t5320, and this code would
get executed. But verifying that it's doing the right thing is much
harder (and is more easily done with a perf test).


Blind coverage goals are definitely not worth the effort. My goal here 
was to re-check all of the new code (since last release) that is not 
covered, because it's easier to hide a bug there.





c8d521faf7 397) strbuf_addch(_name, '-');

This block is inside the following patch:
[...]

Yeah, this triggers if your regex has more than one capture group (and
likewise, we almost certainly don't run the "you have too many groups"
warning).


Did you know that regexes are notoriously under-tested [1]? When looking 
at this code, I didn't even know regexes were involved (but I didn't 
look enough at the context).


[1] https://dl.acm.org/citation.cfm?id=3236072




c8d521faf7 433) continue;
c8d521faf7 436) list[dst] = list[src];

These blocks are inside the following nested loop in deduplicate_islands():

+   for (ref = 0; ref + 1 < island_count; ref++) {
+   for (src = ref + 1, dst = src; src < island_count; src++) {
+   if (list[ref]->hash == list[src]->hash)
+   continue;
+
+   if (src != dst)
+   list[dst] = list[src];
+
+   dst++;
+   }
+   island_count = dst;
+   }

This means that our "deduplication" logic is never actually doing anything
meaningful.

Sorry, I don't even remember what this code is trying to do. The island
code is 5+ years old, and just recently ported to upstream Git by
Christian. And that's perhaps part of my laziness in the above tests; it
would be a significant effort to re-figure out all these corner cases.
It's a big part of why I hadn't been sending the patches upstream
myself.


Sure. Hopefully pointing out these blocks gives us more motivation to 
manually inspect them and avoid silly bugs.


Thanks,
-Stolee

Re: Git Test Coverage Report (v2.20.0-rc0)

2018-11-19 Thread Derrick Stolee


On 11/19/2018 1:33 PM, Ævar Arnfjörð Bjarmason wrote:

On Mon, Nov 19 2018, Derrick Stolee wrote:


[...]
builtin/rebase.c
62c23938fa 55) return env;
[...]
Ævar Arnfjörð Bjarmason 62c23938f: tests: add a special setup
where rebase.useBuiltin is off

This one would be covered with
GIT_TEST_REBASE_USE_BUILTIN=false. Obviously trivial, but I wonder if
the rest of the coverage would look different when passed through the various 
GIT_TEST_* options.



Thanks for pointing out this GIT_TEST_* variable to me. I had been 
running builds with some of them enabled, but didn't know about this one.


Unfortunately, t3406-rebase-message.sh fails with 
GIT_TEST_REBASE_USE_BUILTIN=false and it bisects to 4520c2337: Merge 
branch 'ab/rebase-in-c-escape-hatch'.


The issue is that the commit 04519d72 "rebase: validate -C and 
--whitespace= parameters early" introduced the following test that 
cares about error messages:


+test_expect_success 'error out early upon -C or --whitespace=' '
+   test_must_fail git rebase -Cnot-a-number HEAD 2>err &&
+   test_i18ngrep "numerical value" err &&
+   test_must_fail git rebase --whitespace=bad HEAD 2>err &&
+   test_i18ngrep "Invalid whitespace option" err
+'

The merge commit then was the first place where this test could run with 
that variable.


What's the correct fix here? Force the builtin rebase in this test? 
Unify the error message in the non-builtin case?


Thanks,
-Stolee

Re: Git Test Coverage Report (v2.20.0-rc0)

2018-11-19 Thread Derrick Stolee

On 11/19/2018 2:00 PM, Ben Peart wrote:

On 11/19/2018 10:40 AM, Derrick Stolee wrote:

There are a lot of lines introduced by the IEOT extension in these 
commits:

 > Ben Peart  3255089ad: ieot: add Index Entry Offset Table 
(IEOT) extension
 > Ben Peart  3b1d9e045: eoie: add End of Index Entry (EOIE) 
extension
 > Ben Peart  77ff1127a: read-cache: load cache entries on worker 
threads
 > Ben Peart  abb4bb838: read-cache: load cache extensions on a 
worker thread
 > Ben Peart  c780b9cfe: config: add new index.threads config 
setting
 > Ben Peart  d1664e73a: add: speed up cmd_add() by utilizing 
read_cache_preload()
 > Ben Peart  fa655d841: checkout: optimize "git checkout -b 
"

These should be hit if you run the test suite with 
GIT_TEST_INDEX_THREADS=2.  Without that, the indexes for the various 
tests are too small to trigger multi-threaded index reads/writes.

From t/README:

GIT_TEST_INDEX_THREADS= enables exercising the multi-threaded loading
of the index for the whole test suite by bypassing the default number of
cache entries and thread minimums. Setting this to 1 will make the
index loading single threaded.

I updated my build to add GIT_TEST_INDEX_THREADS=2 and I still see lots 
of uncovered stuff, including that load_cache_entries_threaded() is 
never run.

I added the following diff to my repo and ran the test suite manually 
with GIT_TEST_INDEX_THREADS=2 and it didn't fail:

diff --git a/read-cache.c b/read-cache.c
index 4ca81286c0..36502586a2 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -2057,6 +2057,9 @@ static unsigned long 
load_cache_entries_threaded(struct index_state *istate, con

    struct load_cache_entries_thread_data *data;
    unsigned long consumed = 0;

+   fprintf(stderr, "load_cache_entries_threaded\n");
+   exit(1);
+
    /* a little sanity checking */
    if (istate->name_hash_initialized)
    BUG("the name hash isn't thread safe");

Am I missing something? Is there another variable I should add?

When I look for where the GIT_TEST_INDEX_THREADS variable is checked, I 
see that the calls to git_config_get_index_threads() are followed by a 
check for NO_PTHREADS (setting the number of threads to 1 again). Is it 
possible that my compiler environment is not allowing me to even compile 
with threads?

Thanks,
-Stolee

Re: [PATCH] commit-graph: split up close_reachable() progress output

2018-11-19 Thread Derrick Stolee


On 11/19/2018 3:23 PM, Ævar Arnfjörð Bjarmason wrote:

+   if (report_progress)
+   progress = start_delayed_progress(
+   _("Expanding reachable commits in commit graph"), j = 
0);


This should be the only one that shows up in all but the very largest of 
repositories.


LGTM.

Thanks,
-Stolee

Re: Git Test Coverage Report (v2.20.0-rc0)

2018-11-19 Thread Derrick Stolee


On 11/19/2018 2:39 PM, Ævar Arnfjörð Bjarmason wrote:

On Mon, Nov 19 2018, Derrick Stolee wrote:


The coverage report has been using the following:

export GIT_TEST_MULTI_PACK_INDEX=1
export GIT_TEST_COMMIT_GRAPH=1
export GIT_TEST_INDEX_VERION=4
export GIT_TEST_SPLIT_INDEX=yes
export GIT_TEST_OE_SIZE=10
export GIT_TEST_OE_DELTA_SIZE=5

I need to add GIT_TEST_INDEX_THREADS=2 and GIT_TEST_REBASE_USE_BUILTIN=false

...although note you'll need to also test without
GIT_TEST_REBASE_USE_BUILTIN=false, otherwise a lot of the new C code
won't have coverage.


Sorry for lack of clarity: I first run 'make coverage-test' with no 
GIT_TEST_* variables, then run the test suite again with the optional 
variables.


Thanks,
-Stolee

Re: Git Test Coverage Report (v2.20.0-rc0)

2018-11-19 Thread Derrick Stolee


On 11/19/2018 1:33 PM, Ævar Arnfjörð Bjarmason wrote:

On Mon, Nov 19 2018, Derrick Stolee wrote:


Here is a test coverage report for the uncovered lines introduced in
v2.20.0-rc0 compared to v2.19.1.

Thanks a lot for this.


[...]
builtin/rebase.c
62c23938fa 55) return env;
[...]
Ævar Arnfjörð Bjarmason 62c23938f: tests: add a special setup
where rebase.useBuiltin is off

This one would be covered with
GIT_TEST_REBASE_USE_BUILTIN=false. Obviously trivial, but I wonder if
the rest of the coverage would look different when passed through the various 
GIT_TEST_* options.


The coverage report has been using the following:

export GIT_TEST_MULTI_PACK_INDEX=1
export GIT_TEST_COMMIT_GRAPH=1
export GIT_TEST_INDEX_VERION=4
export GIT_TEST_SPLIT_INDEX=yes
export GIT_TEST_OE_SIZE=10
export GIT_TEST_OE_DELTA_SIZE=5

I need to add GIT_TEST_INDEX_THREADS=2 and GIT_TEST_REBASE_USE_BUILTIN=false

Thanks!
-Stolee

[PATCH 1/1] revision.c: use new topo-order logic in tests

2018-11-19 Thread Derrick Stolee via GitGitGadget

From: Derrick Stolee 

The revision-walk machinery is being rewritten to use generation
numbers in the commit-graph when availble. Due to some problematic
commit histories, the new logic can be slower than the previous
method due to how commit dates and generation numbers interact.
Thus, the new logic is not used in comparison queries, such as

git log --topo-order A..B

The logic for these queries was implemented during the refactor,
but is unreachable due to the potential performance penalty. The
code came along with a larger block of code that was copied from
the old code. When generation numbers are updated to v2 (corrected
commit dates), then we will no longer have a performance penalty
and this logic is what we will want to use.

In the meantime, use the new logic when GIT_TEST_COMMIT_GRAPH is
enabled. This will demonstrate that the new logic works for all
comparison queries in the test suite, including these variants:

git log --topo-order --ancestry-path A..B
git log --topo-order A...B

Signed-off-by: Derrick Stolee 
---
 revision.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/revision.c b/revision.c
index 4ef47d2fb4..d52da6e24f 100644
--- a/revision.c
+++ b/revision.c
@@ -27,6 +27,7 @@
 #include "commit-reach.h"
 #include "commit-graph.h"
 #include "prio-queue.h"
+#include "config.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -3143,6 +3144,9 @@ int prepare_revision_walk(struct rev_info *revs)
commit_list_sort_by_date(>commits);
if (revs->no_walk)
return 0;
+   if (revs->limited &&
+   git_env_bool(GIT_TEST_COMMIT_GRAPH, 0))
+   revs->limited = 0;
if (revs->limited) {
if (limit_list(revs) < 0)
return -1;
-- 
gitgitgadget

[PATCH 0/1] Use new topo-order logic with GIT_TEST_COMMIT_GRAPH

2018-11-19 Thread Derrick Stolee via GitGitGadget

The recent Git test report for v2.20.0-rc0 shows that the logic around
UNINTERESTING commits is not covered by the test suite. This is because the
code is currently unreachable! See the commit message for details.

An alternate approach would be to delete the code around UNINTERESTING
commits, but that doesn't seem necessary.

Thanks, -Stolee

Derrick Stolee (1):
  revision.c: use new topo-order logic in tests

 revision.c | 4 
 1 file changed, 4 insertions(+)


base-commit: 561b583749b7428f1790f03164d0d0e75be71d7b
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-83%2Fderrickstolee%2Ftopo-order-test-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-83/derrickstolee/topo-order-test-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/83
-- 
gitgitgadget

Re: Git Test Coverage Report (v2.20.0-rc0)

2018-11-19 Thread Derrick Stolee

The test coverage reports started mid-way through this release cycle, so 
I thought it would be good to do a full review of the new uncovered code 
since the last release.


I eliminated most of the uncovered code due to the following cases:

1. Code was only moved or refactored.
2. Code was related to unusual error conditions (e.g. open_pack_index() 
fails)


The comments below are intended only to point out potential directions 
to improve test coverage. Some of it is for me to do!


Thanks,
-Stolee

On 11/18/2018 9:54 PM, Derrick Stolee wrote:

66ec0390e7 builtin/fsck.c 888) midx_argv[2] = "--object-dir";
66ec0390e7 builtin/fsck.c 889) midx_argv[3] = alt->path;
66ec0390e7 builtin/fsck.c 890) if (run_command(_verify))
66ec0390e7 builtin/fsck.c 891) errors_found |= ERROR_COMMIT_GRAPH;



There are two things wrong here:

1. Not properly covering multi-pack-index fsck with alternates.
2. the ERROR_COMMIT_GRAPH flag when the multi-pack-index is being checked.

I'll submit a patch to fix this.

2fa233a554 builtin/pack-objects.c 1512) hashcpy(base_oid.hash, 
base_sha1);
2fa233a554 builtin/pack-objects.c 1513) if 
(!in_same_island(>idx.oid, _oid))

2fa233a554 builtin/pack-objects.c 1514) return 0;


These lines are inside a block for the following if statement:

+   /*
+    * Otherwise, reachability bitmaps may tell us if the receiver 
has it,

+    * even if it was buried too deep in history to make it into the
+    * packing list.
+    */
+   if (thin && bitmap_has_sha1_in_uninteresting(bitmap_git, 
base_sha1)) {


Peff: is this difficult to test?


28b8a73080 builtin/pack-objects.c 2793) depth++;
108f530385 builtin/pack-objects.c 2797) oe_set_tree_depth(_pack, 
ent, depth); 


This 'depth' variable is incremented as part of a for loop in this patch:

 static void show_object(struct object *obj, const char *name, void *data)
@@ -2686,6 +2706,19 @@ static void show_object(struct object *obj, const 
char *name, void *data)

    add_preferred_base_object(name);
    add_object_entry(>oid, obj->type, name, 0);
    obj->flags |= OBJECT_ADDED;
+
+   if (use_delta_islands) {
+   const char *p;
+   unsigned depth = 0;
+   struct object_entry *ent;
+
+   for (p = strchr(name, '/'); p; p = strchr(p + 1, '/'))
+   depth++;
+
+   ent = packlist_find(_pack, obj->oid.hash, NULL);
+   if (ent && depth > ent->tree_depth)
+   ent->tree_depth = depth;
+   }
 }

And that 'ent->tree_depth = depth;' line is replaced with the 
oe_set_tree_depth() call in the report.


Since depth is never incremented, we are not covering this block. Is it 
possible to test?



builtin/repack.c
16d75fa48d  48) use_delta_islands = git_config_bool(var, value);
16d75fa48d  49) return 0;


This is a simple config option check for "repack.useDeltaIslands". The 
logic it enables is tested, so this is an OK gap, in my opinion.


> builtin/submodule--helper.c

ee69b2a90c 1476) out->type = sub->update_strategy.type;
ee69b2a90c 1477) out->command = sub->update_strategy.command;


This block was introduced by this part of the patch:

+   } else if (sub->update_strategy.type != SM_UPDATE_UNSPECIFIED) {
+   trace_printf("loaded thing");
+   out->type = sub->update_strategy.type;
+   out->command = sub->update_strategy.command;

Which seems to be an important case, as the other SM_UPDATE_* types seem 
like interesting cases.


Stefan: what actions would trigger this block? Is it easy to test?


delta-islands.c
c8d521faf7  53) memcpy(b, old, size);


This memcpy happens when the 'old' island_bitmap is passed to 
island_bitmap_new(), but...



c8d521faf7 187) b->refcount--;
c8d521faf7 188) b = kh_value(island_marks, pos) = island_bitmap_new(b);


This block has the only non-NULL caller to island_bitmap_new().


c8d521faf7 212) obj = ((struct tag *)obj)->tagged;
c8d521faf7 213) if (obj) {
c8d521faf7 214) parse_object(the_repository, >oid);
c8d521faf7 215) marks = create_or_get_island_marks(obj);
c8d521faf7 216) island_bitmap_set(marks, island_counter);


It appears that this block would happen if we placed a tag in the delta 
island.



c8d521faf7 397) strbuf_addch(_name, '-');


This block is inside the following patch:

+   if (matches[ARRAY_SIZE(matches) - 1].rm_so != -1)
+   warning(_("island regex from config has "
+ "too many capture groups (max=%d)"),
+   (int)ARRAY_SIZE(matches) - 2);
+
+   for (m = 1; m < ARRAY_SIZE(matches); m++) {
+   regmatch_t *match = [m];
+
+   if (match->rm_so == -1)
+   continue;
+
+   if (island_name.len)
+   strbuf_addch(_name, '-');
+
+   strbuf_ad

Git Test Coverage Report (v2.20.0-rc0)

2018-11-18 Thread Derrick Stolee

(, ":%.*s ", perf_indent, space);
c46c406ae1 317) void trace_performance_leave_fl(const char *file, int line,
c46c406ae1 323) if (perf_indent)
c46c406ae1 324) perf_indent--;
c46c406ae1 326) if (!format) /* Allow callers to leave without tracing 
anything */

c46c406ae1 327) return;
c46c406ae1 329) since = perf_start_times[perf_indent];
c46c406ae1 330) va_start(ap, format);
c46c406ae1 331) trace_performance_vprintf_fl(file, line, nanos - since, 
format, ap);

c46c406ae1 332) va_end(ap);
c46c406ae1 477) trace_performance_leave("git command:%s", command_line.buf);
c46c406ae1 485) if (!command_line.len)
c46c406ae1 490) trace_performance_enter();

transport.c
unpack-trees.c
b878579ae7  360) string_list_append(, ce->name);
b878579ae7  361) ce->ce_flags &= ~CE_MATCHED;
b878579ae7  368) warning(_("the following paths have collided (e.g. 
case-sensitive paths\n"

b878579ae7  372) for (i = 0; i < list.nr; i++)
b878579ae7  373) fprintf(stderr, "  '%s'\n", list.items[i].string);
f1e11c6510  777) free(tree_ce);
b4da37380b  778) return rc;
b4da37380b  785) printf("Unpacked %d entries from %s to %s using 
cache-tree\n",

b4da37380b  787)    o->src_index->cache[pos]->name,
b4da37380b  788)    o->src_index->cache[pos + nr_entries - 1]->name);

upload-pack.c
1d1243fe63 1403) deepen(INFINITE_DEPTH, data->deepen_relative, 
>shallows,


worktree.c
3a3b9d8cde 495) return -1;
3a3b9d8cde 508) return -1;
3a3b9d8cde 517) return -1;
ab3e1f78ae 537) break;

wt-status.c
f3bd35fa0d  671) s->committable = 1;
73ba5d78b4 1958) if (s->state.rebase_in_progress ||
73ba5d78b4 1959) s->state.rebase_interactive_in_progress)
73ba5d78b4 1960) branch_name = s->state.onto;
73ba5d78b4 1961) else if (s->state.detached_from)
73ba5d78b4 1962) branch_name = s->state.detached_from;

xdiff-interface.c
xdiff/xutils.c
611e42a598 405) return -1;

Commits introducing uncovered code:
Ævar Arnfjörð Bjarmason      62c23938f: tests: add a special setup where 
rebase.useBuiltin is off
Alban Gruin  0af129b2e: rebase--interactive2: rewrite the submodes 
of interactive rebase in C

Alban Gruin  2c58483a5: rebase -i: rewrite setup_reflog_action() in C
Alban Gruin  34b47315d: rebase -i: move rebase--helper modes to 
rebase--interactive

Alban Gruin  4df66c40b: rebase -i: rewrite checkout_onto() in C
Alban Gruin  53bbcfbde: rebase -i: implement the main part of 
interactive rebase as a builtin
Alban Gruin  64a43cbd5: rebase -i: rewrite the edit-todo 
functionality in C

Alban Gruin  65850686c: rebase -i: rewrite write_basic_state() in C
Alban Gruin  a9f5476fb: sequencer: refactor append_todo_help() to 
write its message to a buffer

Alban Gruin  b97e18736: rebase -i: rewrite complete_action() in C
Antonio Ospite      45f5ef3d7: submodule: factor out a 
config_set_in_gitmodules_file_gently function
Antonio Ospite  76e9bdc43: submodule: support reading .gitmodules 
when it's not in the working tree
Antonio Ospite  bcbc780d1: submodule: add a 
print_config_from_gitmodules() helper
Ben Peart  3255089ad: ieot: add Index Entry Offset Table (IEOT) 
extension

Ben Peart      3b1d9e045: eoie: add End of Index Entry (EOIE) extension
Ben Peart  77ff1127a: read-cache: load cache entries on worker threads
Ben Peart  abb4bb838: read-cache: load cache extensions on a worker 
thread

Ben Peart  c780b9cfe: config: add new index.threads config setting
Ben Peart  d1664e73a: add: speed up cmd_add() by utilizing 
read_cache_preload()

Ben Peart  fa655d841: checkout: optimize "git checkout -b "
Brendan Forster  93aef7c79: http: add support for disabling SSL 
revocation checks in cURL
brian m. carlson  2f0c9e9a9: builtin/repack: replace hard-coded 
constants
brian m. carlson  eccb5a5f3: apply: rename new_sha1_prefix and 
old_sha1_prefix
Christian Couder  108f53038: pack-objects: move tree_depth into 
'struct packing_data'
Christian Couder  fe0ac2fb7: pack-objects: move 'layer' into 'struct 
packing_data'

Derrick Stolee  0d5b3a5ef: midx: write object ids in a chunk
Derrick Stolee  17c35c896: packfile: skip loading index if in 
multi-pack-index

Derrick Stolee  1d614d41e: commit-reach: move ref_newer from remote.c
Derrick Stolee  1dcd9f204: midx: close multi-pack-index on repack
Derrick Stolee  20fd6d579: commit-graph: not compatible with grafts
Derrick Stolee  3715a6335: midx: read objects from multi-pack-index
Derrick Stolee  454ea2e4d: treewide: use get_all_packs
Derrick Stolee  4d80560c5: multi-pack-index: load into memory
Derrick Stolee  4fbcca4ef: commit-reach: make can_all_from_reach... 
linear

Derrick Stolee  5227c3856: commit-reach: move walk methods from commit.c
Derrick Stolee  525e18c04: midx: clear midx on repack
Derrick Stolee  56ee7ff15: multi-pack-index: add 'verify' verb
Derrick Stolee  662148c43: midx: write object offs

Re: [PATCH/RFC v1 1/1] Use size_t instead of unsigned long

2018-11-18 Thread Derrick Stolee


On 11/17/2018 10:11 AM, tbo...@web.de wrote:

From: Torsten Bögershausen 

Currently Git users can not commit files >4Gib under 64 bit Windows,
where "long" is 32 bit but size_t is 64 bit.
Improve the code base in small steps, as small as possible.
What started with a small patch to replace "unsigned long" with size_t
in one file (convert.c) ended up with a change in many files.

Signed-off-by: Torsten Bögershausen 
---

This needs to go on top of pu, to cover all the good stuff
   cooking here.


Better to work on top of 'master', as the work in 'pu' will be rewritten 
several times, probably.



I have started this series on November 1st, since that 2 or 3 rebases
   had been done to catch up, and now it is on pu from November 15.

I couldn't find a reason why changing "unsigned ling"
   into "size_t" may break anything, any thoughts, please ?


IIRC, the blocker for why we haven't done this already is that "size_t", 
"timestamp_t" and "off_t" are all 64-bit types that give more implied 
meaning, so we should swap types specifically to these as we want. One 
example series does a specific, small change [1].


If we wanted to do a single swap that removes 'unsigned long' with an 
unambiguously 64-bit type, I would recommend "uint64_t". Later 
replacements to size_t, off_t, and timestamp_t could happen on a 
case-by-case basis for type clarity.


It may benefit reviewers to split this change into multiple patches, 
starting at the lowest levels of the call stack (so higher 'unsigned 
long's can up-cast to the 64-bit types when calling a function) and 
focusing the changes to one or two files at a time (X.c and X.h, 
preferrably).


Since you are talking about the benefits for Git for Windows to accept 
4GB files, I wonder if we can add a test that verifies such a file will 
work. If you have such a test, then I could help verify that the test 
fails before the change and succeeds afterward.


Finally, it may be good to add a coccinelle script that replaces 
'unsigned long' with 'uint64_t' so we can easily fix any new 
introductions that happen in the future.


Thanks! I do think we should make this change, but we must be careful. 
It may be disruptive to topics in flight.


-Stolee

[1] https://public-inbox.org/git/20181112084031.11769-1-care...@gmail.com/

Git Test Coverage Report (Sunday, Nov 18th)

2018-11-18 Thread Derrick Stolee

 385cb64ff: delta-islands.c: remove 
the_repository references

Nguyễn Thái Ngọc Duy  674ba3403: fsck: mark strings for translation
Nguyễn Thái Ngọc Duy  74ae4b638: bundle.c: remove the_repository 
references
Nguyễn Thái Ngọc Duy  890034262: parse-options.c: mark more strings 
for translation
Nguyễn Thái Ngọc Duy  8aa8c1409: git.c: mark more strings for 
translation
Nguyễn Thái Ngọc Duy  9440b831a: parse-options: replace opterror() 
with optname()
Nguyễn Thái Ngọc Duy  9d0a9e908: read-cache.c: mark more strings for 
translation
Nguyễn Thái Ngọc Duy  ad8f8f4ae: attr.c: mark more string for 
translation
Nguyễn Thái Ngọc Duy  c6e7965dd: archive.c: mark more strings for 
translation
Nguyễn Thái Ngọc Duy  c83d950e5: repack: mark more strings for 
translation

Nguyễn Thái Ngọc Duy  dd509db34: reflog: mark strings for translation
Nguyễn Thái Ngọc Duy  f11c95805: sequencer.c: remove implicit 
dependency on the_index
Nguyễn Thái Ngọc Duy  fb998eae6: blame.c: remove implicit dependency 
the_repository

Olga Telezhnaya  ab0e36715: ref-filter: add deltabase option
Olga Telezhnaya  bbfc042ef: ref-filter: add objectsize:disk option
Torsten Bögershausen  ca473cef9: Upcast size_t variables to 
uintmax_t when printing




Uncovered code in 'master' not in 'master@{1}'


apply.c
517fe807d6 4776) BUG_ON_OPT_NEG(unset);
735ca208c5 4830) return -1;

builtin/am.c
fce5664805 2117) *opt_value = PATCH_FORMAT_UNKNOWN;

builtin/blame.c
517fe807d6 builtin/blame.c    759) BUG_ON_OPT_NEG(unset);

builtin/cat-file.c
0eb8d3767c builtin/cat-file.c 609) return error(_("only one batch option 
may be specified"));


builtin/grep.c
fd6263fb73 builtin/grep.c 1051) warning(_("invalid option combination, 
ignoring --threads"));
fd6263fb73 builtin/grep.c 1057) die(_("invalid number of threads 
specified (%d)"), num_threads);


builtin/log.c
517fe807d6 builtin/log.c 1196) BUG_ON_OPT_NEG(unset);

builtin/pull.c
01a31f3bca 565) die(_("unable to access commit %s"),

builtin/rebase.c
62c23938fa   55) return env;
3249c1251e  556) ret = -1;
3249c1251e  557) goto leave_reset_head;
bac2a1e36f  561) ret = error(_("could not determine HEAD revision"));
bac2a1e36f  562) goto leave_reset_head;
3249c1251e  580) ret = error(_("could not read index"));
3249c1251e  581) goto leave_reset_head;
bac2a1e36f  585) ret = error(_("failed to find tree of %s"), 
oid_to_hex(oid));

bac2a1e36f  586) goto leave_reset_head;
3249c1251e  590) ret = error(_("failed to find tree of %s"), 
oid_to_hex(oid));

3249c1251e  591) goto leave_reset_head;
3249c1251e  604) goto leave_reset_head;

builtin/show-branch.c
517fe807d6 builtin/show-branch.c 607) BUG_ON_OPT_NEG(unset);

builtin/show-ref.c
517fe807d6 builtin/show-ref.c 154) BUG_ON_OPT_NEG(unset);

bundle.c
2c8ee1f53c 267) error_errno(_("unable to dup bundle descriptor"));
2c8ee1f53c 268) child_process_clear(_objects);
2c8ee1f53c 269) return -1;
2c8ee1f53c 478) rollback_lock_file();

config.c
midx.c
name-hash.c
2179045fd0 532) die(_("unable to create lazy_dir thread: %s"), 
strerror(err));
2179045fd0 554) die(_("unable to create lazy_name thread: %s"), 
strerror(err));
2179045fd0 560) die(_("unable to join lazy_name thread: %s"), 
strerror(err));


preload-index.c
2179045fd0 137) die(_("unable to create threaded lstat: %s"), 
strerror(err));


revision.c
b45424181e 2942) return;
b45424181e 2945) return;
b45424181e 2951) c->object.flags |= UNINTERESTING;
b45424181e 2954) return;
b45424181e 2957) mark_parents_uninteresting(c);
b45424181e 2980) return;
b45424181e 2983) return;
b45424181e 3048) continue;
f0d9cc4196 3097) if (!revs->ignore_missing_links)
f0d9cc4196 3098) die("Failed to traverse parents of commit %s",
f0d9cc4196 3099) oid_to_hex(>object.oid));
b45424181e 3107) continue;

run-command.c
2179045fd0 1229) error(_("cannot create async thread: %s"), strerror(err));

send-pack.c
c0e40a2d66 207) close(fd[1]);

Commits introducing uncovered code:
Ævar Arnfjörð Bjarmason  62c23938f: tests: add a special setup where 
rebase.useBuiltin is off
Derrick Stolee  b45424181: revision.c: generation-based topo-order 
algorithm
Derrick Stolee  f0d9cc419: revision.c: begin refactoring 
--topo-order logic

Jeff King  01a31f3bc: pull: handle --verify-signatures for unborn branch
Jeff King  0eb8d3767: cat-file: report an error on multiple --batch 
options
Jeff King  2c8ee1f53: bundle: dup() output descriptor closer to 
point-of-use
Jeff King  517fe807d: assert NOARG/NONEG behavior of parse-options 
callbacks
Jeff King  735ca208c: apply: return -1 from option callback instead 
of calling exit(1)

Jeff King  fce566480: am: handle --no-patch-format option
Johannes Schindelin  3249c1251: rebase: consolidate clean-up code 
before leaving reset_head()
Johannes Schindelin  bac2a1e36: built-in rebase: reinstate `checkout 
-q` behavior where appropriate
Nguyễn Thái Ngọc Duy  2179045fd: Clean up pthread_create() error 
handling
Nguyễn Thái Ngọc Duy  c0e40a2d6: send-pack.c: move async's #ifdef 
NO_PTHREADS back to run-command.c

Nguyễn Thái Ngọc Duy  fd6263fb7: grep: clean up num_threads handling

Re: [PATCH] technical doc: add a design doc for the evolve command

2018-11-16 Thread Derrick Stolee


On 11/14/2018 7:55 PM, sxe...@google.com wrote:

From: Stefan Xenos 

This document describes what an obsolescence graph for
git would look like, the behavior of the evolve command,
and the changes planned for other commands.


Thanks for putting this together!


diff --git a/Documentation/technical/evolve.txt 
b/Documentation/technical/evolve.txt

...

+Git Obsolescence Graph
+==
+
+Objective
+-
+Track the edits to a commit over time in an obsolescence graph.


The file name and the title are in a mismatch.

I'd prefer if the title was "Git Evolve Design Document" and this 
opening paragraph

was about the reasons we want a 'git evolve' command. Here is my attempt:

  The proposed 'git evolve' command will help users craft a 
high-quality commit
  history in their topic branches. By working to improve commits one at 
a time,
  then running 'git evolve', users can rewrite recent history with more 
options
  than interactive rebase. The core benefit is that users can pause 
their progress
  and move to other branches before returning to where they left off. 
Users can
  also share progress with others using standard 'push', 'fetch', and 
'format-patch'

  commands.


+Background
+--


Perhaps you can call this "Example"?


+Imagine you have three dependent changes up for review and you receive feedback
+that requires editing all three changes. While you're editing one, more 
feedback
+arrives on one of the others. What do you do?


"three dependent changes" sounds a bit vague enough to me to possibly 
confuse readers. Perhaps

"three sequential patches"?


+- Users can view the history of a commit directly (the sequence of amends and
+  rebases it has undergone, orthogonal to the history of the branch it is on).


"the history of a commit" doesn't semantically work, as a commit is an 
immutable Git object.


Instead, I would try to use the term "patch" to describe a change to the 
codebase, and that
takes the form as a list of commits that are improving on each other 
(but don't actually
have each other in their commit history). This means that the lifetime 
of a patch is described

by the commits that are amended or rebased.


+- By pushing and pulling the obsolescence graph, users can collaborate more
+  easily on changes-in-progress. This is better than pushing and pulling the
+  changes themselves since the obsolescence graph can be used to locate a more
+  specific merge base, allowing for better merges between different versions of
+  the same change.


(Making a note so I come back to this. I hope to learn what you mean by 
this "more specific

merge base".)


+
+Similar technologies
+
... It can't handle the case where you have
+multiple changes sharing the same parent when that parent needs to be rebased


Perhaps this could be made more concrete by describing commit history 
and a specific workflow

change using 'git evolve'.

Suppose we have two topic branches, topic1 and topic2, that point to 
commits A and B,
respectively.Suppose further that A and B have a common parent C with 
parent D. If we rebase
topic1 relativeto D, then we create new commits C' and A' that are newer 
versions of commits
C and A. It would benice to easily update topic2 to be on a new commit 
B' with parent C'.
Currently, a user needs to knowthat C updated to C', and use 'git rebase 
--onto C' C topic2'.
Instead, if we have a marker showing thatC' is an updated version of C, 
'git log topic2'
would show that topic2 can be updated, and the 'gitevolve' command would 
perform the correct

action to make B' with parent C'.

(This paragraph above is an example of "what can happen now is 
complicated and demands that
the user keep some information in their memory" and "the new workflow is 
simpler and helps
users make the right decision". I think we could use more of these at 
the start to sell the

idea.)



+and won't let you collaborate with others on resolving a complicated 
interactive
+rebase.


In the same sentence, we have an even more complicated workflow 
mentioned as an aside. This
could be fleshed out more concretely. It could help describing that the 
current model is for
usersto share "!fixup" commits and then one performs an interactive 
rebase to apply those
fixups inthe correct order. If a user instead shares an amended commit, 
then we are in a
difficult state toapply those changes. The new workflow would be to 
share amended commits

and 'git evolve'inserts the correct amended commits in the right order.

I'm a big proponent of the teaching philosophy of "examples first". It's 
easier to talk

abstractlyafter going through some concrete examples.


  You can think of rebase -i as a top-down approach and the evolve command
+as the bottom-up approach to the same problem.


This comparison is important. Perhaps it is more specific to say 
"interactive rebase splits
a plan torewrite history into independent units of work, while evolve 
collects independent

Re: [PATCHv3 00/23] Bring more repository handles into our code base

2018-11-16 Thread Derrick Stolee


On 11/13/2018 7:12 PM, Stefan Beller wrote:

Please have a look at the last 4 patches specifically as they were new in
the last iteration (but did not receive any comment), as they demonstrate
and fix a problem that is only exposed when using GIT_TEST_COMMIT_GRAPH=1
for the test suite.


I took a look at these last four and didn't find anything wrong. Rest of 
the series looks good.


While all of the TODOs in the last patch are an imperfect solution, I 
think it's probably the best we can do for now.


Thanks,
-Stolee

Re: [PATCH v5 02/12] sha1-file: provide functions to look up hash algorithms

2018-11-13 Thread Derrick Stolee


On 11/4/2018 6:44 PM, brian m. carlson wrote:

+int hash_algo_by_name(const char *name)
+{
+   int i;
+   if (!name)
+   return GIT_HASH_UNKNOWN;
+   for (i = 1; i < GIT_HASH_NALGOS; i++)
+   if (!strcmp(name, hash_algos[i].name))
+   return i;
+   return GIT_HASH_UNKNOWN;
+}
+


Today's test coverage report [1] shows this method is not covered in the 
test suite. Looking at 'pu', it doesn't have any callers.


Do you have a work in progress series that will use this? Could we add a 
test-tool to exercise this somehow?


Thanks,
-Stolee

[1] 
https://public-inbox.org/git/97be3e21-6a8c-9718-5161-37318f6d6...@gmail.com/

Git Test Coverage Report (Tuesday, Nov 13)

2018-11-13 Thread Derrick Stolee

  7f8671656: merge-recursive: fix rename/add conflict 
handling

Force Charlie  d73019feb: http: add support selecting http version
Jeff King  b69fb867b: sha1_file_name(): overwrite buffer instead of 
appending
Jeff King  f0eaf6381: sha1-file: use an object_directory for the 
main object dir

Joel Teichroeb  3d5ec65ce: stash: convert apply to builtin
Joel Teichroeb  5bf62a19c: stash: convert pop to builtin
Joel Teichroeb  700577117: stash: convert drop and clear to builtin
Johannes Schindelin  3249c1251: rebase: consolidate clean-up code 
before leaving reset_head()
Johannes Schindelin  bac2a1e36: built-in rebase: reinstate `checkout 
-q` behavior where appropriate
Josh Steadmon  c95f25280: protocol: advertise multiple supported 
versions

Junio C Hamano  287d68a44: Merge branch 'nd/i18n' into jch
Nguyễn Thái Ngọc Duy  005af339c: sequencer.c: remove implicit 
dependency on the_repository

Nguyễn Thái Ngọc Duy  0b9c3afdb: remote.c: mark messages for translation
Nguyễn Thái Ngọc Duy  385cb64ff: delta-islands.c: remove 
the_repository references

Nguyễn Thái Ngọc Duy  674ba3403: fsck: mark strings for translation
Nguyễn Thái Ngọc Duy  74ae4b638: bundle.c: remove the_repository 
references
Nguyễn Thái Ngọc Duy  890034262: parse-options.c: mark more strings 
for translation
Nguyễn Thái Ngọc Duy  8aa8c1409: git.c: mark more strings for 
translation
Nguyễn Thái Ngọc Duy  9440b831a: parse-options: replace opterror() 
with optname()
Nguyễn Thái Ngọc Duy  9d0a9e908: read-cache.c: mark more strings for 
translation
Nguyễn Thái Ngọc Duy  ad8f8f4ae: attr.c: mark more string for 
translation
Nguyễn Thái Ngọc Duy  c6e7965dd: archive.c: mark more strings for 
translation
Nguyễn Thái Ngọc Duy  c83d950e5: repack: mark more strings for 
translation

Nguyễn Thái Ngọc Duy  dd509db34: reflog: mark strings for translation
Nguyễn Thái Ngọc Duy  f11c95805: sequencer.c: remove implicit 
dependency on the_index
Nguyễn Thái Ngọc Duy  fb998eae6: blame.c: remove implicit dependency 
the_repository

Olga Telezhnaya  ab0e36715: ref-filter: add deltabase option
Olga Telezhnaya  bbfc042ef: ref-filter: add objectsize:disk option
Paul-Sebastian Ungureanu  104eb50d1: stash: convert show to builtin
Paul-Sebastian Ungureanu  193c3e351: stash: convert 
`stash--helper.c` into `stash.c`

Paul-Sebastian Ungureanu  1a0f0409a: stash: convert push to builtin
Paul-Sebastian Ungureanu  813904a0c: stash: convert store to builtin
Paul-Sebastian Ungureanu  9f630e748: stash: convert create to builtin
Paul-Sebastian Ungureanu  c2cc69f19: stash: make push -q quiet
Torsten Bögershausen  ca473cef9: Upcast size_t variables to 
uintmax_t when printing




Uncovered code in 'next' not in 'master'


apply.c
517fe807d6 4776) BUG_ON_OPT_NEG(unset);
735ca208c5 4830) return -1;

builtin/am.c
fce5664805 2117) *opt_value = PATCH_FORMAT_UNKNOWN;

builtin/archive.c
e001fd3a50 builtin/archive.c  78) die(_("git archive: expected ACK/NAK, 
got a flush packet"));

e001fd3a50 builtin/archive.c  80) if (starts_with(reader.line, "NACK "))
e001fd3a50 builtin/archive.c  81) die(_("git archive: NACK %s"), 
reader.line + 5);

e001fd3a50 builtin/archive.c  82) if (starts_with(reader.line, "ERR "))
e001fd3a50 builtin/archive.c  83) die(_("remote error: %s"), reader.line 
+ 4);

e001fd3a50 builtin/archive.c  84) die(_("git archive: protocol error"));
e001fd3a50 builtin/archive.c  89) die(_("git archive: expected a flush"));
fb19d32f05 builtin/archive.c  99) if (version != discover_version())
fb19d32f05 builtin/archive.c 100) die(_("git archive: received different 
protocol versions in subsequent requests"));


builtin/blame.c
517fe807d6 builtin/blame.c    759) BUG_ON_OPT_NEG(unset);

builtin/cat-file.c
0eb8d3767c builtin/cat-file.c 609) return error(_("only one batch option 
may be specified"));


builtin/grep.c
fd6263fb73 builtin/grep.c 1051) warning(_("invalid option combination, 
ignoring --threads"));
fd6263fb73 builtin/grep.c 1057) die(_("invalid number of threads 
specified (%d)"), num_threads);


builtin/log.c
517fe807d6 builtin/log.c 1196) BUG_ON_OPT_NEG(unset);

builtin/pull.c
01a31f3bca 565) die(_("unable to access commit %s"),

builtin/show-branch.c
517fe807d6 builtin/show-branch.c 607) BUG_ON_OPT_NEG(unset);

builtin/show-ref.c
517fe807d6 builtin/show-ref.c 154) BUG_ON_OPT_NEG(unset);

builtin/upload-archive.c
e001fd3a50 builtin/upload-archive.c 113) if (version == protocol_v0 || 
version == protocol_v1)
e001fd3a50 builtin/upload-archive.c 114) packet_write_fmt(1, "NACK 
unable to spawn subprocess\n");

e001fd3a50 builtin/upload-archive.c 115) else if (version == protocol_v2)
e001fd3a50 builtin/upload-archive.c 116) error_clnt("unable to spawn 
subprocess\n"

Re: [PATCH 0/9] caching loose objects

2018-11-12 Thread Derrick Stolee


On 11/12/2018 9:46 AM, Jeff King wrote:

Here's the series I mentioned earlier in the thread to cache loose
objects when answering has_object_file(..., OBJECT_INFO_QUICK). For
those just joining us, this makes operations that look up a lot of
missing objects (like "index-pack" looking for collisions) faster. This
is mostly targeted at systems where stat() is slow, like over NFS, but
it seems to give a 2% speedup indexing a full git.git packfile into an
empty repository (i.e., what you'd see on a clone).

I'm adding René Scharfe and Takuto Ikuta to the cc for their previous
work in loose-object caching.

The interesting bit is patch 8. The rest of it is cleanup to let us
treat alternates and the main object directory similarly.


This cleanup is actually really valuable, and affects much more than 
this application.


I really think it is a good idea, and hope it doesn't cause too much 
trouble as the topic is cooking.


Thanks,
-Stolee

Re: [PATCH 8/9] sha1-file: use loose object cache for quick existence check

2018-11-12 Thread Derrick Stolee


On 11/12/2018 9:54 AM, Jeff King wrote:

In cases where we expect to ask has_sha1_file() about a lot of objects
that we are not likely to have (e.g., during fetch negotiation), we
already use OBJECT_INFO_QUICK to sacrifice accuracy (due to racing with
a simultaneous write or repack) for speed (we avoid re-scanning the pack
directory).

However, even checking for loose objects can be expensive, as we will
stat() each one. On many systems this cost isn't too noticeable, but
stat() can be particularly slow on some operating systems, or due to
network filesystems.

Since the QUICK flag already tells us that we're OK with a slightly
stale answer, we can use that as a cue to look in our in-memory cache of
each object directory. That basically trades an in-memory binary search
for a stat() call.

Note that it is possible for this to actually be _slower_. We'll do a
full readdir() to fill the cache, so if you have a very large number of
loose objects and a very small number of lookups, that readdir() may end
up more expensive.

This shouldn't be a big deal in practice. If you have a large number of
reachable loose objects, you'll already run into performance problems
(which you should remedy by repacking). You may have unreachable objects
which wouldn't otherwise impact performance. Usually these would go away
with the prune step of "git gc", but they may be held for up to 2 weeks
in the default configuration.

So it comes down to how many such objects you might reasonably expect to
have, how much slower is readdir() on N entries versus M stat() calls
(and here we really care about the syscall backing readdir(), like
getdents() on Linux, but I'll just call this readdir() below).

If N is much smaller than M (a typical packed repo), we know this is a
big win (few readdirs() followed by many uses of the resulting cache).
When N and M are similar in size, it's also a win. We care about the
latency of making a syscall, and readdir() should be giving us many
values in a single call. How many?

On Linux, running "strace -e getdents ls" shows a 32k buffer getting 512
entries per call (which is 64 bytes per entry; the name itself is 38
bytes, plus there are some other fields). So we can imagine that this is
always a win as long as the number of loose objects in the repository is
a factor of 500 less than the number of lookups you make. It's hard to
auto-tune this because we don't generally know up front how many lookups
we're going to do. But it's unlikely for this to perform significantly
worse.

Signed-off-by: Jeff King 
---
There's some obvious hand-waving in the paragraphs above. I would love
it if somebody with an NFS system could do some before/after timings
with various numbers of loose objects, to get a sense of where the
breakeven point is.

My gut is that we do not need the complexity of a cache-size limit, nor
of a config option to disable this. But it would be nice to have a real
number where "reasonable" ends and "pathological" begins. :)


I'm interested in such numbers, but do not have the appropriate setup to 
test.


I think the tradeoffs you mention above are reasonable. There's also 
some chance that this isn't "extra" work but is just "earlier" work, as 
the abbreviation code would load these loose object directories.




  object-store.h |  1 +
  sha1-file.c| 20 
  2 files changed, 21 insertions(+)

diff --git a/object-store.h b/object-store.h
index bf1e0cb761..60758efad8 100644
--- a/object-store.h
+++ b/object-store.h
@@ -13,6 +13,7 @@ struct object_directory {
/*
 * Used to store the results of readdir(3) calls when we are OK
 * sacrificing accuracy due to races for speed. That includes
+* object existence with OBJECT_INFO_QUICK, as well as
 * our search for unique abbreviated hashes. Don't use it for tasks
 * requiring greater accuracy!
 *
diff --git a/sha1-file.c b/sha1-file.c
index 4aae716a37..e53da0b701 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -921,6 +921,24 @@ static int open_sha1_file(struct repository *r,
return -1;
  }
  
+static int quick_has_loose(struct repository *r,

+  const unsigned char *sha1)
+{
+   int subdir_nr = sha1[0];
+   struct object_id oid;
+   struct object_directory *odb;
+
+   hashcpy(oid.hash, sha1);
+
+   prepare_alt_odb(r);
+   for (odb = r->objects->odb; odb; odb = odb->next) {
+   odb_load_loose_cache(odb, subdir_nr);
+   if (oid_array_lookup(>loose_objects_cache, ) >= 0)
+   return 1;
+   }
+   return 0;
+}
+
  /*
   * Map the loose object at "path" if it is not NULL, or the path found by
   * searching for a loose object named "sha1".
@@ -1171,6 +1189,8 @@ static int sha1_loose_object_info(struct repository *r,
if (!oi->typep && !oi->type_name && !oi->sizep && !oi->contentp) {
const char *path;
struct stat st;
+   if

Re: [PATCH 5/9] handle alternates paths the same as the main object dir

2018-11-12 Thread Derrick Stolee





On 11/12/2018 10:46 AM, Jeff King wrote:

On Mon, Nov 12, 2018 at 10:38:28AM -0500, Derrick Stolee wrote:


We could go either direction here, but this patch moves the alternates
struct over to the main directory style (rather than vice-versa).
Technically the alternates style is more efficient, as it avoids
rewriting the object directory name on each call. But this is unlikely
to matter in practice, as we avoid reallocations either way (and nobody
has ever noticed or complained that the main object directory is copying
a few extra bytes before making a much more expensive system call).

Hm. I've complained in the past [1] about a simple method like strbuf_addf()
over loose objects, but that was during abbreviation checks so we were
adding the string for every loose object but not actually reading the
objects.

[1]
https://public-inbox.org/git/20171201174956.143245-1-dsto...@microsoft.com/

I suspect that had more to do with the cost of snprintf() than the extra
bytes being copied. And here we'd still be using addstr and addch
exclusively. I'm open to numeric arguments to the contrary, though. :)


I agree. I don't think it is worth investigating now, as the performance 
difference should be moot. I am making a mental note to take a look here 
if I notice a performance regression later. ;)



There's actually a lot of low-hanging fruit there for pre-sizing, too.
E.g., fill_sha1_path() calls strbuf_addch() in a loop, but it could
quite easily grow the 41 bytes it needs ahead of time. I wouldn't want
to change that without finding a measurable improvement, though. It
might not be a big deal due to fec501dae8 (strbuf_addch: avoid calling
strbuf_grow, 2015-04-16).

-Peff

Re: [PATCH 6/9] sha1-file: use an object_directory for the main object dir

2018-11-12 Thread Derrick Stolee


On 11/12/2018 9:50 AM, Jeff King wrote:

Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:

   do_something(r->objects->objdir);

   for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
 do_something(odb->path);

That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).

Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).

A few observations:

   - we don't need r->objects->objectdir anymore, and can just
 mechanically convert that to r->objects->odb->path

   - object_directory's path field needs to become a real pointer rather
 than a FLEX_ARRAY, in order to fill it with expand_base_dir()

   - we'll call prepare_alt_odb() earlier in many functions (i.e.,
 outside of the loop). This may result in us calling it even when our
 function would be satisfied looking only at the main odb.

 But this doesn't matter in practice. It's not a very expensive
 operation in the first place, and in the majority of cases it will
 be a noop. We call it already (and cache its results) in
 prepare_packed_git(), and we'll generally check packs before loose
 objects. So essentially every program is going to call it
 immediately once per program.

 Arguably we should just prepare_alt_odb() immediately upon setting
 up the repository's object directory, which would save us sprinkling
 calls throughout the code base (and forgetting to do so has been a
 source of subtle bugs in the past). But I've stopped short of that
 here, since there are already a lot of other moving parts in this
 patch.

   - Most call sites just get shorter. The check_and_freshen() functions
 are an exception, because they have entry points to handle local and
 nonlocal directories separately.

Signed-off-by: Jeff King 
---
If the "the first one is the main store, the rest are alternates" bit is
too subtle, we could mark each "struct object_directory" with a bit for
"is_local".


This is probably a good thing to do proactively. We have the equivalent 
in the packed_git struct, but that's also because they get out of order. 
At the moment, I can't think of a read-only action that needs to treat 
the local object directory more carefully. The closest I know about is 
'git pack-objects --local', but that also writes a pack-file.


I assume that when we write a pack-file to the "default location" we use 
get_object_directory() instead of referring to the default object_directory?




  builtin/fsck.c |  21 ++---
  builtin/grep.c |   2 +-
  commit-graph.c |   5 +-
  environment.c  |   4 +-
  object-store.h |  27 ++-
  object.c   |  19 
  packfile.c |  10 ++--
  path.c |   2 +-
  repository.c   |   8 +++-
  sha1-file.c| 122 ++---
  sha1-name.c|  17 ++-
  11 files changed, 90 insertions(+), 147 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 55153cf92a..15338bd178 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -725,13 +725,8 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
for_each_loose_object(mark_loose_for_connectivity, NULL, 0);
for_each_packed_object(mark_packed_for_connectivity, NULL, 0);
} else {
-   struct object_directory *alt_odb_list;
-
-   fsck_object_dir(get_object_directory());
-
prepare_alt_odb(the_repository);
-   alt_odb_list = the_repository->objects->alt_odb_list;
-   for (odb = alt_odb_list; odb; odb = odb->next)
+   for (odb = the_repository->objects->odb; odb; odb = odb->next)
fsck_object_dir(odb->path);
  
  		if (check_full) {

@@ -834,13 +829,8 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
struct child_process commit_graph_verify = CHILD_PROCESS_INIT;
const char *verify_argv[] = { "commit-graph", "verify", NULL, 
NULL, NULL };
  
-		commit_graph_verify.argv = verify_argv;

-   commit_graph_verify.git_cmd = 1;
-   if (run_command(_graph_verify))
-   errors_found |= ERROR_COMMIT_GRAPH;
-
prepare_alt_odb(the_repository);
-   for (odb = the_repository->objects->alt_odb_list; odb; odb = 
odb->next) {
+   for (odb = the_repository->objects->odb; odb; odb = odb->next) {
child_process_init(_graph_verify);
commit_graph_verify.argv = verify_argv;

Re: [PATCH 5/9] handle alternates paths the same as the main object dir

2018-11-12 Thread Derrick Stolee


On 11/12/2018 9:49 AM, Jeff King wrote:

When we generate loose file paths for the main object directory, the
caller provides a buffer to loose_object_path (formerly sha1_file_name).
The callers generally keep their own static buffer to avoid excessive
reallocations.

But for alternate directories, each struct carries its own scratch
buffer. This is needlessly different; let's unify them.

We could go either direction here, but this patch moves the alternates
struct over to the main directory style (rather than vice-versa).
Technically the alternates style is more efficient, as it avoids
rewriting the object directory name on each call. But this is unlikely
to matter in practice, as we avoid reallocations either way (and nobody
has ever noticed or complained that the main object directory is copying
a few extra bytes before making a much more expensive system call).


Hm. I've complained in the past [1] about a simple method like 
strbuf_addf() over loose objects, but that was during abbreviation 
checks so we were adding the string for every loose object but not 
actually reading the objects.


[1] 
https://public-inbox.org/git/20171201174956.143245-1-dsto...@microsoft.com/


The other concern I have is for alternates that may have long-ish paths 
to their object directories.


So, this is worth keeping an eye on, but is likely to be fine.


And this has the advantage that the reusable buffers are tied to
particular calls, which makes the invalidation rules simpler (for
example, the return value from stat_sha1_file() used to be invalidated
by basically any other object call, but now it is affected only by other
calls to stat_sha1_file()).

We do steal the trick from alt_sha1_path() of returning a pointer to the
filled buffer, which makes a few conversions more convenient.

Signed-off-by: Jeff King 
---
  object-store.h | 14 +-
  object.c   |  1 -
  sha1-file.c| 44 
  sha1-name.c|  8 ++--
  4 files changed, 23 insertions(+), 44 deletions(-)

diff --git a/object-store.h b/object-store.h
index fefa17e380..b2fa0d0df0 100644
--- a/object-store.h
+++ b/object-store.h
@@ -10,10 +10,6 @@
  struct object_directory {
struct object_directory *next;
  
-	/* see alt_scratch_buf() */

-   struct strbuf scratch;
-   size_t base_len;
-
/*
 * Used to store the results of readdir(3) calls when searching
 * for unique abbreviated hashes.  This cache is never
@@ -54,14 +50,6 @@ void add_to_alternates_file(const char *dir);
   */
  void add_to_alternates_memory(const char *dir);
  
-/*

- * Returns a scratch strbuf pre-filled with the alternate object directory,
- * including a trailing slash, which can be used to access paths in the
- * alternate. Always use this over direct access to alt->scratch, as it
- * cleans up any previous use of the scratch buffer.
- */
-struct strbuf *alt_scratch_buf(struct object_directory *odb);
-
  struct packed_git {
struct packed_git *next;
struct list_head mru;
@@ -157,7 +145,7 @@ void raw_object_store_clear(struct raw_object_store *o);
   * Put in `buf` the name of the file in the local object database that
   * would be used to store a loose object with the specified sha1.
   */
-void loose_object_path(struct repository *r, struct strbuf *buf, const 
unsigned char *sha1);
+const char *loose_object_path(struct repository *r, struct strbuf *buf, const 
unsigned char *sha1);
  
  void *map_sha1_file(struct repository *r, const unsigned char *sha1, unsigned long *size);
  
diff --git a/object.c b/object.c

index 6af8e908bb..dd485ac629 100644
--- a/object.c
+++ b/object.c
@@ -484,7 +484,6 @@ struct raw_object_store *raw_object_store_new(void)
  
  static void free_alt_odb(struct object_directory *odb)

  {
-   strbuf_release(>scratch);
oid_array_clear(>loose_objects_cache);
free(odb);
  }
diff --git a/sha1-file.c b/sha1-file.c
index 478eac326b..15db6b61a9 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -346,27 +346,20 @@ static void fill_sha1_path(struct strbuf *buf, const 
unsigned char *sha1)
}
  }
  
-void loose_object_path(struct repository *r, struct strbuf *buf,

-  const unsigned char *sha1)
+static const char *odb_loose_path(const char *path, struct strbuf *buf,
+ const unsigned char *sha1)
  {
strbuf_reset(buf);
-   strbuf_addstr(buf, r->objects->objectdir);
+   strbuf_addstr(buf, path);
strbuf_addch(buf, '/');
fill_sha1_path(buf, sha1);
+   return buf->buf;
  }
  
-struct strbuf *alt_scratch_buf(struct object_directory *odb)

+const char *loose_object_path(struct repository *r, struct strbuf *buf,
+ const unsigned char *sha1)
  {
-   strbuf_setlen(>scratch, odb->base_len);
-   return >scratch;
-}
-
-static const char *alt_sha1_path(struct object_directory *odb,
-const unsigned

Re: [PATCH 4/9] sha1_file_name(): overwrite buffer instead of appending

2018-11-12 Thread Derrick Stolee


On 11/12/2018 9:48 AM, Jeff King wrote:

Since we're changing the semantics, let's take the opportunity to give
it a more hash-neutral name (which will also catch any callers from
topics in flight).


THANK YOU! This method name confused me so much when I was first looking 
at the code, but the new name is so much better.


-Stolee

Re: [PATCH 3/9] rename "alternate_object_database" to "object_directory"

2018-11-12 Thread Derrick Stolee


On 11/12/2018 9:48 AM, Jeff King wrote:

In preparation for unifying the handling of alt odb's and the normal
repo object directory, let's use a more neutral name. This patch is
purely mechanical, swapping the type name, and converting any variables
named "alt" to "odb". There should be no functional change, but it will
reduce the noise in subsequent diffs.

Signed-off-by: Jeff King 
---
I waffled on calling this object_database instead of object_directory.
But really, it is very specifically about the directory (packed
storage, including packs from alternates, is handled elsewhere).


That makes sense. Each alternate makes its own object directory, but is 
part of a larger object database. It also helps clarify a difference 
from the object_store.


My only complaint is that you have a lot of variable names with "odb" 
which are now object_directory pointers. Perhaps "odb" -> "objdir"? Or 
is that just too much change?




  builtin/count-objects.c |  4 ++--
  builtin/fsck.c  | 16 ++---
  builtin/submodule--helper.c |  6 ++---
  commit-graph.c  | 10 
  object-store.h  | 14 +--
  object.c| 10 
  packfile.c  |  8 +++
  sha1-file.c | 48 ++---
  sha1-name.c | 20 
  transport.c |  2 +-
  10 files changed, 69 insertions(+), 69 deletions(-)

diff --git a/builtin/count-objects.c b/builtin/count-objects.c
index a7cad052c6..3fae474f6f 100644
--- a/builtin/count-objects.c
+++ b/builtin/count-objects.c
@@ -78,10 +78,10 @@ static int count_cruft(const char *basename, const char 
*path, void *data)
return 0;
  }
  
-static int print_alternate(struct alternate_object_database *alt, void *data)

+static int print_alternate(struct object_directory *odb, void *data)
  {
printf("alternate: ");
-   quote_c_style(alt->path, NULL, stdout, 0);
+   quote_c_style(odb->path, NULL, stdout, 0);
putchar('\n');
return 0;
  }
diff --git a/builtin/fsck.c b/builtin/fsck.c
index b10f2b154c..55153cf92a 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -688,7 +688,7 @@ static struct option fsck_opts[] = {
  int cmd_fsck(int argc, const char **argv, const char *prefix)
  {
int i;
-   struct alternate_object_database *alt;
+   struct object_directory *odb;
  
  	/* fsck knows how to handle missing promisor objects */

fetch_if_missing = 0;
@@ -725,14 +725,14 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
for_each_loose_object(mark_loose_for_connectivity, NULL, 0);
for_each_packed_object(mark_packed_for_connectivity, NULL, 0);
} else {
-   struct alternate_object_database *alt_odb_list;
+   struct object_directory *alt_odb_list;
  
  		fsck_object_dir(get_object_directory());
  
  		prepare_alt_odb(the_repository);

alt_odb_list = the_repository->objects->alt_odb_list;
-   for (alt = alt_odb_list; alt; alt = alt->next)
-   fsck_object_dir(alt->path);
+   for (odb = alt_odb_list; odb; odb = odb->next)
+   fsck_object_dir(odb->path);
  
  		if (check_full) {

struct packed_git *p;
@@ -840,12 +840,12 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
errors_found |= ERROR_COMMIT_GRAPH;
  
  		prepare_alt_odb(the_repository);

-   for (alt =  the_repository->objects->alt_odb_list; alt; alt = 
alt->next) {
+   for (odb = the_repository->objects->alt_odb_list; odb; odb = 
odb->next) {
child_process_init(_graph_verify);
commit_graph_verify.argv = verify_argv;
commit_graph_verify.git_cmd = 1;
verify_argv[2] = "--object-dir";
-   verify_argv[3] = alt->path;
+   verify_argv[3] = odb->path;
if (run_command(_graph_verify))
errors_found |= ERROR_COMMIT_GRAPH;
}
@@ -861,12 +861,12 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
errors_found |= ERROR_COMMIT_GRAPH;
  
  		prepare_alt_odb(the_repository);

-   for (alt =  the_repository->objects->alt_odb_list; alt; alt = 
alt->next) {
+   for (odb = the_repository->objects->alt_odb_list; odb; odb = 
odb->next) {
child_process_init(_verify);
midx_verify.argv = midx_argv;
midx_verify.git_cmd = 1;
midx_argv[2] = "--object-dir";
-   midx_argv[3] = alt->path;
+   midx_argv[3] = odb->path;
if (run_command(_verify))
errors_found |=

Re: [PATCH 1/9] fsck: do not reuse child_process structs

2018-11-12 Thread Derrick Stolee


On 11/12/2018 9:46 AM, Jeff King wrote:

The run-command API makes no promises about what is left in a struct
child_process after a command finishes, and it's not safe to simply
reuse it again for a similar command. In particular:

  - if you use child->args or child->env_array, they are cleared after
finish_command()

  - likewise, start_command() may point child->argv at child->args->argv;
reusing that would lead to accessing freed memory

  - the in/out/err may hold pipe descriptors from the previous run


Thanks! This is helpful information.


These two calls are _probably_ OK because they do not use any of those
features. But it's only by chance, and may break in the future; let's
reinitialize our struct for each program we run.

Signed-off-by: Jeff King 
---
  builtin/fsck.c | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 06eb421720..b10f2b154c 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -841,6 +841,9 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
  
  		prepare_alt_odb(the_repository);

for (alt =  the_repository->objects->alt_odb_list; alt; alt = 
alt->next) {
+   child_process_init(_graph_verify);
+   commit_graph_verify.argv = verify_argv;
+   commit_graph_verify.git_cmd = 1;
verify_argv[2] = "--object-dir";
verify_argv[3] = alt->path;
if (run_command(_graph_verify))
@@ -859,6 +862,9 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
  
  		prepare_alt_odb(the_repository);

for (alt =  the_repository->objects->alt_odb_list; alt; alt = 
alt->next) {
+   child_process_init(_verify);
+   midx_verify.argv = midx_argv;
+   midx_verify.git_cmd = 1;
midx_argv[2] = "--object-dir";
midx_argv[3] = alt->path;
if (run_command(_verify))


Looks good to me.

-Stolee

Git Test Coverage Report (Wednesday, Nov 7)

2018-11-07 Thread Derrick Stolee

rrno(_("%s: cannot stat the open index"), path);
567e20ae23 2156) die(_("%s: index file smaller than expected"), path);
567e20ae23 2160) die_errno(_("%s: unable to map index file"), path);
567e20ae23 2250) warning(_("could not freshen shared index '%s'"), 
shared_index);

567e20ae23 2285) die(_("broken index, expect %s in %s, got %s"),
567e20ae23 3071) error(_("cannot fix permission bits on '%s'"), 
get_tempfile_path(*temp));

567e20ae23 3217) return error(_("%s: cannot drop to stage #0"),

ref-filter.c
23c0ab7312 2326) return error(_("option `%s' is incompatible with 
--no-merged"),


remote.c
eafdc91bbc  362) warning(_("config remote shorthand cannot begin with 
'/': %s"),
eafdc91bbc  417) error(_("more than one uploadpack given, using the 
first"));

eafdc91bbc  683) die(_("key '%s' of pattern had no '*'"), key);
eafdc91bbc  693) die(_("value '%s' of pattern has no '*'"), value);
eafdc91bbc 1044) error(_("unable to delete '%s': remote ref does not 
exist"),
eafdc91bbc 1066) return error(_("dst ref %s receives from more than one 
src"),

eafdc91bbc 1785) die(_("couldn't find remote ref %s"), name);
eafdc91bbc 1798) error(_("* Ignoring funny ref '%s' locally"),
eafdc91bbc 1893) die(_("revision walk setup failed"));
eafdc91bbc 2166) return error(_("cannot parse expected object name '%s'"),

revision.c
b45424181e 2942) return;
b45424181e 2945) return;
b45424181e 2951) c->object.flags |= UNINTERESTING;
b45424181e 2954) return;
b45424181e 2957) mark_parents_uninteresting(c);
b45424181e 2980) return;
b45424181e 2983) return;
b45424181e 3048) continue;
f0d9cc4196 3097) if (!revs->ignore_missing_links)
f0d9cc4196 3098) die("Failed to traverse parents of commit %s",
f0d9cc4196 3099) oid_to_hex(>object.oid));
b45424181e 3107) continue;

run-command.c
2179045fd0 1229) error(_("cannot create async thread: %s"), strerror(err));

send-pack.c
c0e40a2d66 207) close(fd[1]);

sha1-file.c
2f90b9d9b4 sha1-file.c  172) int hash_algo_by_name(const char *name)
2f90b9d9b4 sha1-file.c  175) if (!name)
2f90b9d9b4 sha1-file.c  176) return GIT_HASH_UNKNOWN;
2f90b9d9b4 sha1-file.c  177) for (i = 1; i < GIT_HASH_NALGOS; i++)
2f90b9d9b4 sha1-file.c  178) if (!strcmp(name, hash_algos[i].name))
2f90b9d9b4 sha1-file.c  179) return i;
2f90b9d9b4 sha1-file.c  180) return GIT_HASH_UNKNOWN;
2f90b9d9b4 sha1-file.c  183) int hash_algo_by_id(uint32_t format_id)
2f90b9d9b4 sha1-file.c  186) for (i = 1; i < GIT_HASH_NALGOS; i++)
2f90b9d9b4 sha1-file.c  187) if (format_id == hash_algos[i].format_id)
2f90b9d9b4 sha1-file.c  188) return i;
2f90b9d9b4 sha1-file.c  189) return GIT_HASH_UNKNOWN;

Commits introducing uncovered code:
brian m. carlson  2f90b9d9b: sha1-file: provide functions to look up 
hash algorithms
brian m. carlson  b3a41547c: hex: introduce functions to print 
arbitrary hashes
Daniels Umanovskis  0ecb1fc72: branch: introduce --show-current 
display option
Derrick Stolee  b45424181: revision.c: generation-based topo-order 
algorithm
Derrick Stolee  f0d9cc419: revision.c: begin refactoring 
--topo-order logic
Elijah Newren  143dc7900: merge-recursive: fix rename/add conflict 
handling
Elijah Newren  55ea0bf4c: merge-recursive: new function for better 
colliding conflict resolutions
Elijah Newren  f10ffbb2f: merge-recursive: improve 
rename/rename(1to2)/add[/add] handling

Jeff King  01a31f3bc: pull: handle --verify-signatures for unborn branch
Jeff King  0eb8d3767: cat-file: report an error on multiple --batch 
options
Jeff King  517fe807d: assert NOARG/NONEG behavior of parse-options 
callbacks
Jeff King  735ca208c: apply: return -1 from option callback instead 
of calling exit(1)

Jeff King  fce566480: am: handle --no-patch-format option
Joel Teichroeb  3d5ec65ce: stash: convert apply to builtin
Joel Teichroeb  5bf62a19c: stash: convert pop to builtin
Joel Teichroeb  700577117: stash: convert drop and clear to builtin
Junio C Hamano  e4d034baf: Merge branch 'nd/i18n' into jch
Nguyễn Thái Ngọc Duy  0a995c256: fsck: mark strings for translation
Nguyễn Thái Ngọc Duy  0e1a23cec: repack: mark more strings for 
translation
Nguyễn Thái Ngọc Duy  2179045fd: Clean up pthread_create() error 
handling
Nguyễn Thái Ngọc Duy  23c0ab731: parse-options: replace opterror() 
with optname()
Nguyễn Thái Ngọc Duy  4a4f8ae76: git.c: mark more strings for 
translation
Nguyễn Thái Ngọc Duy  567e20ae2: read-cache.c: mark more strings for 
translation
Nguyễn Thái Ngọc Duy  8a705c463: archive.c: mark more strings for 
translation

Nguyễn Thái Ngọc Duy  977e72ca9: reflog: mark strings for translation
Nguyễn Thái Ngọc Duy  aa4fa3fa7: attr.c: mark more string for 
translation
Nguyễn Thái Ngọc Duy  c0e40a2d6: send-pack.c: move async's #ifdef

[PATCH v2 1/1] pack-objects: ignore ambiguous object warnings

2018-11-06 Thread Derrick Stolee via GitGitGadget

From: Derrick Stolee 

A git push process runs several processes during its run, but one
includes git send-pack which calls git pack-objects and passes
the known have/wants into stdin using object ids. However, the
default setting for core.warnAmbiguousRefs requires git pack-objects
to check for ref names matching the ref_rev_parse_rules array in
refs.c. This means that every object is triggering at least six
"file exists?" queries.  When there are a lot of refs, this can
add up significantly! I observed a simple push spending three
seconds checking these paths.

The fix here is similar to 4c30d50 "rev-list: disable object/refname
ambiguity check with --stdin". Save the value of the global
warn_on_object_refname_ambiguity variable (which is usually set to
the boolean config variable core.warnAmbiguousRefs) and change the
state to false. Do this only during the get_object_list() method
which reads the objects from stdin.

Helped-by: Jeff King 
Signed-off-by: Derrick Stolee 
---
 builtin/pack-objects.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index d1144a8f7e..f703e6df9b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2988,6 +2988,7 @@ static void get_object_list(int ac, const char **av)
struct rev_info revs;
char line[1000];
int flags = 0;
+   int save_warning;
 
init_revisions(, NULL);
save_commit_buffer = 0;
@@ -2996,6 +2997,9 @@ static void get_object_list(int ac, const char **av)
/* make sure shallows are read */
is_repository_shallow(the_repository);
 
+   save_warning = warn_on_object_refname_ambiguity;
+   warn_on_object_refname_ambiguity = 0;
+
while (fgets(line, sizeof(line), stdin) != NULL) {
int len = strlen(line);
if (len && line[len - 1] == '\n')
@@ -3022,6 +3026,8 @@ static void get_object_list(int ac, const char **av)
die(_("bad revision '%s'"), line);
}
 
+   warn_on_object_refname_ambiguity = save_warning;
+
if (use_bitmap_index && !get_object_list_from_bitmap())
return;
 
-- 
gitgitgadget

[PATCH v2 0/1] send-pack: set core.warnAmbiguousRefs=false

2018-11-06 Thread Derrick Stolee via GitGitGadget

I've been looking into the performance of git push for very large repos. Our
users are reporting that 60-80% of git push time is spent during the
"Enumerating objects" phase of git pack-objects.

A git push process runs several processes during its run, but one includes
git send-pack which calls git pack-objects and passes the known have/wants
into stdin using object ids. However, the default setting for
core.warnAmbiguousRefs requires git pack-objects to check for ref names
matching the ref_rev_parse_rules array in refs.c. This means that every
object is triggering at least six "file exists?" queries.

When there are a lot of refs, this can add up significantly! My PerfView
trace for a simple push measured 3 seconds spent checking these paths.

The fix is to set the global warn_on_object_refname_ambiguity to 0 for the
section that is performing these object reads.

In addition to this patch submission, we are looking into merging it into
our fork sooner [1].

[1] https://github.com/Microsoft/git/pull/67

Changes in V2: Instead of using the "-c" flag from send-pack, just set the
global. I left the name of the cover letter the same to not confuse anyone
viewing the message without threading.

Derrick Stolee (1):
pack-objects: ignore ambiguous object warnings

builtin/pack-objects.c | 6 ++
1 file changed, 6 insertions(+)

base-commit: cae598d9980661a978e2df4fb338518f7bf09572
Published-As:
https://github.com/gitgitgadget/git/releases/tags/pr-68%2Fderrickstolee%2Fsend-pack-config-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git
pr-68/derrickstolee/send-pack-config-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/68

Range-diff vs v1:

1: 1ef2c51550 < -: -- send-pack: set core.warnAmbiguousRefs=false
-: -- > 1: 002868ee6b pack-objects: ignore ambiguous object warnings

--
gitgitgadget

Re: [PATCH 0/1] send-pack: set core.warnAmbiguousRefs=false

2018-11-06 Thread Derrick Stolee


On 11/6/2018 2:51 PM, Jeff King wrote:

On Tue, Nov 06, 2018 at 02:44:42PM -0500, Jeff King wrote:


The fix for this is simple: set core.warnAmbiguousRefs to false for this
specific call of git pack-objects coming from git send-pack. We don't want
to default it to false for all calls to git pack-objects, as it is valid to
pass ref names instead of object ids. This helps regain these seconds during
a push.

I don't think you actually care about the ambiguity check between refs
here; you just care about avoiding the ref check when we've seen (and
are mostly expecting) a 40-hex sha1. We have a more specific flag for
that: warn_on_object_refname_ambiguity.

And I think it would be OK to enable that all the time for pack-objects,
which is plumbing that does typically expect object names. See prior art
in 25fba78d36 (cat-file: disable object/refname ambiguity check for
batch mode, 2013-07-12) and 4c30d50402 (rev-list: disable object/refname
ambiguity check with --stdin, 2014-03-12).

I'd probably do it here:

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index e50c6cd1ff..d370638a5d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3104,6 +3104,7 @@ static void get_object_list(int ac, const char **av)


Scoping the change into get_object_list does make sense. I was doing it 
a level higher, which is not worth it. I'll reproduce your change here.



struct rev_info revs;
char line[1000];
int flags = 0;
+   int save_warning;
  
  	repo_init_revisions(the_repository, , NULL);

save_commit_buffer = 0;
@@ -3112,6 +3113,9 @@ static void get_object_list(int ac, const char **av)
/* make sure shallows are read */
is_repository_shallow(the_repository);
  
+	save_warning = warn_on_object_refname_ambiguity;

+   warn_on_object_refname_ambiguity = 0;
+
while (fgets(line, sizeof(line), stdin) != NULL) {
int len = strlen(line);
if (len && line[len - 1] == '\n')
@@ -3138,6 +3142,8 @@ static void get_object_list(int ac, const char **av)
die(_("bad revision '%s'"), line);
}
  
+	warn_on_object_refname_ambiguity = save_warning;

+
if (use_bitmap_index && !get_object_list_from_bitmap())
return;
  


But I'll leave it to you to wrap that up in a patch, since you probably
should re-check your timings (which it would be interesting to include
in the commit message, if you have reproducible timings).


The timings change a lot depending on the disk cache and the remote 
refs, which is unfortunate, but I have measured a three-second improvement.


Thanks,
-Stolee

Re: [PATCH 0/1] send-pack: set core.warnAmbiguousRefs=false

2018-11-06 Thread Derrick Stolee


On 11/6/2018 2:44 PM, Jeff King wrote:

On Tue, Nov 06, 2018 at 11:13:47AM -0800, Derrick Stolee via GitGitGadget wrote:


I've been looking into the performance of git push for very large repos. Our
users are reporting that 60-80% of git push time is spent during the
"Enumerating objects" phase of git pack-objects.

A git push process runs several processes during its run, but one includes
git send-pack which calls git pack-objects and passes the known have/wants
into stdin using object ids. However, the default setting for
core.warnAmbiguousRefs requires git pack-objects to check for ref names
matching the ref_rev_parse_rules array in refs.c. This means that every
object is triggering at least six "file exists?" queries.

When there are a lot of refs, this can add up significantly! My PerfView
trace for a simple push measured 3 seconds spent checking these paths.

Some of this might be useful in the commit message. :)


The fix for this is simple: set core.warnAmbiguousRefs to false for this
specific call of git pack-objects coming from git send-pack. We don't want
to default it to false for all calls to git pack-objects, as it is valid to
pass ref names instead of object ids. This helps regain these seconds during
a push.

I don't think you actually care about the ambiguity check between refs
here; you just care about avoiding the ref check when we've seen (and
are mostly expecting) a 40-hex sha1. We have a more specific flag for
that: warn_on_object_refname_ambiguity.

And I think it would be OK to enable that all the time for pack-objects,
which is plumbing that does typically expect object names. See prior art
in 25fba78d36 (cat-file: disable object/refname ambiguity check for
batch mode, 2013-07-12) and 4c30d50402 (rev-list: disable object/refname
ambiguity check with --stdin, 2014-03-12).
Thanks for these pointers. Helps to know there is precedent for shutting 
down the

behavior without relying on "-c" flags.


Whenever I see a change like this to the pack-objects invocation for
send-pack, it makes me wonder if upload-pack would want the same thing.

It's a moot point if we just set the flag directly in inside
pack-objects, though.

I'll send a v2 that does just that.

Thanks,
-Stolee

[PATCH 1/1] send-pack: set core.warnAmbiguousRefs=false

2018-11-06 Thread Derrick Stolee via GitGitGadget

From: Derrick Stolee 

During a 'git push' command, we run 'git send-pack' inside of our
transport helper. This creates a 'git pack-objects' process and
passes a list of object ids. If this list is large, then the
pack-objects process can spend a lot of time checking the possible
refs these strings could represent.

Remove this extra check by setting core.warnAmbiguousRefs to false
as we call 'git pack-objects'.

Signed-off-by: Derrick Stolee 
---
 send-pack.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/send-pack.c b/send-pack.c
index e920ca57df..5055150fe1 100644
--- a/send-pack.c
+++ b/send-pack.c
@@ -64,6 +64,8 @@ static int pack_objects(int fd, struct ref *refs, struct 
oid_array *extra, struc
int i;
int rc;
 
+   argv_array_push(, "-c");
+   argv_array_push(, "core.warnAmbiguousRefs=false");
argv_array_push(, "pack-objects");
argv_array_push(, "--all-progress-implied");
argv_array_push(, "--revs");
-- 
gitgitgadget

[PATCH 0/1] send-pack: set core.warnAmbiguousRefs=false

2018-11-06 Thread Derrick Stolee via GitGitGadget

I've been looking into the performance of git push for very large repos. Our
users are reporting that 60-80% of git push time is spent during the
"Enumerating objects" phase of git pack-objects.

When there are a lot of refs, this can add up significantly! My PerfView
trace for a simple push measured 3 seconds spent checking these paths.

The fix for this is simple: set core.warnAmbiguousRefs to false for this
specific call of git pack-objects coming from git send-pack. We don't want
to default it to false for all calls to git pack-objects, as it is valid to
pass ref names instead of object ids. This helps regain these seconds during
a push.

In addition to this patch submission, we are looking into merging it into
our fork sooner [1].

[1] https://github.com/Microsoft/git/pull/67

Derrick Stolee (1):
send-pack: set core.warnAmbiguousRefs=false

send-pack.c | 2 ++
1 file changed, 2 insertions(+)

base-commit: cae598d9980661a978e2df4fb338518f7bf09572
Published-As:
https://github.com/gitgitgadget/git/releases/tags/pr-68%2Fderrickstolee%2Fsend-pack-config-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git
pr-68/derrickstolee/send-pack-config-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/68
--
gitgitgadget

Re: [PATCH] multi-pack-index: make code -Wunused-parameter clean

2018-11-05 Thread Derrick Stolee


On 11/3/2018 10:27 PM, Jeff King wrote:

On Sat, Nov 03, 2018 at 05:49:57PM -0700, Carlo Marcelo Arenas Belón wrote:


introduced in 662148c435 ("midx: write object offsets", 2018-07-12)
but included on all previous versions as well.

midx.c:713:54: warning: unused parameter 'nr_objects' [-Wunused-parameter]

likely an oversight as the information needed to iterate over is
embedded in nr_large_offset

I've been preparing a series to make the whole code base compile with
-Wunused-parameter, and I handled this case a bit differently.

-- >8 --
Subject: [PATCH] midx: double-check large object write loop

The write_midx_large_offsets() function takes an array of object
entries, the number of entries in the array (nr_objects), and the number
of entries with large offsets (nr_large_offset). But we never actually
use nr_objects; instead we keep walking down the array and counting down
nr_large_offset until we've seen all of the large entries.

This is correct, but we can be a bit more defensive. If there were ever
a mismatch between nr_large_offset and the actual set of large-offset
objects, we'd walk off the end of the array.

Since we know the size of the array, we can use nr_objects to make sure
we don't walk too far.

Signed-off-by: Jeff King 


Thanks, both, for catching this. I prefer the approach that adds defenses.

Reviewed-by: Derrick Stolee 


---
  midx.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/midx.c b/midx.c
index 4fac0cd08a..ecd583666a 100644
--- a/midx.c
+++ b/midx.c
@@ -712,12 +712,18 @@ static size_t write_midx_object_offsets(struct hashfile 
*f, int large_offset_nee
  static size_t write_midx_large_offsets(struct hashfile *f, uint32_t 
nr_large_offset,
   struct pack_midx_entry *objects, 
uint32_t nr_objects)
  {
-   struct pack_midx_entry *list = objects;
+   struct pack_midx_entry *list = objects, *end = objects + nr_objects;
size_t written = 0;
  
  	while (nr_large_offset) {

-   struct pack_midx_entry *obj = list++;
-   uint64_t offset = obj->offset;
+   struct pack_midx_entry *obj;
+   uint64_t offset;
+
+   if (list >= end)
+   BUG("too many large-offset objects");
+
+   obj = list++;
+   offset = obj->offset;
  
  		if (!(offset >> 31))

continue;

Git Test Coverage Report (Friday, Nov 2)

2018-11-02 Thread Derrick Stolee

));


read-cache.c
5257b0625a  675) die(_("will not add file alias '%s' ('%s' already 
exists in index)"),

5257b0625a  676) ce->name, alias->name);
5257b0625a  691) die(_("cannot create an empty blob in the object 
database"));
5257b0625a  712) return error(_("%s: can only add regular files, 
symbolic links or git-directories"), path);

5257b0625a  786) return error(_("unable to add '%s' to index"), path);
5257b0625a  822) error(_("invalid path '%s'"), path);
5257b0625a  848) error(_("invalid path '%s'"), path);
5257b0625a 1686) return error(_("bad signature 0x%08x"), 
hdr->hdr_signature);

5257b0625a 1689) return error(_("bad index version %d"), hdr_version);
5257b0625a 1728) return error(_("index uses %.4s extension, which we do 
not understand"),

5257b0625a 1730) fprintf_ln(stderr, _("ignoring %.4s extension"), ext);
5257b0625a 1777) die(_("unknown index entry format 0x%08x"), 
extended_flags);

5257b0625a 1848) die(_("unordered stage entries in index"));
5257b0625a 1851) die(_("multiple stage entries for merged file '%s'"),
5257b0625a 1854) die(_("unordered stage entries for '%s'"),
5257b0625a 2148) die_errno(_("%s: index file open failed"), path);
5257b0625a 2152) die_errno(_("%s: cannot stat the open index"), path);
5257b0625a 2156) die(_("%s: index file smaller than expected"), path);
5257b0625a 2160) die_errno(_("%s: unable to map index file"), path);
5257b0625a 2252) warning(_("could not freshen shared index '%s'"), 
shared_index);

5257b0625a 2287) die(_("broken index, expect %s in %s, got %s"),
5257b0625a 3073) error(_("cannot fix permission bits on '%s'"), 
get_tempfile_path(*temp));

5257b0625a 3219) return error(_("%s: cannot drop to stage #0"),

ref-filter.c
35408df41e 2324) return error(_("option `%s' is incompatible with 
--no-merged"),


remote.c
a454d6a26b  362) warning(_("config remote shorthand cannot begin with 
'/': %s"),
a454d6a26b  417) error(_("more than one uploadpack given, using the 
first"));

a454d6a26b  683) die(_("key '%s' of pattern had no '*'"), key);
a454d6a26b  693) die(_("value '%s' of pattern has no '*'"), value);
a454d6a26b 1044) error(_("unable to delete '%s': remote ref does not 
exist"),
a454d6a26b 1066) return error(_("dst ref %s receives from more than one 
src"),

a454d6a26b 1753) die(_("couldn't find remote ref %s"), name);
a454d6a26b 1766) error(_("* Ignoring funny ref '%s' locally"),
a454d6a26b 1861) die(_("revision walk setup failed"));
a454d6a26b 2134) return error(_("cannot parse expected object name '%s'"),

revision.c
b45424181e 2936) return;
b45424181e 2939) return;
b45424181e 2945) c->object.flags |= UNINTERESTING;
b45424181e 2948) return;
b45424181e 2951) mark_parents_uninteresting(c);
b45424181e 2974) return;
b45424181e 2977) return;
b45424181e 3042) continue;
f0d9cc4196 3091) if (!revs->ignore_missing_links)
f0d9cc4196 3092) die("Failed to traverse parents of commit %s",
f0d9cc4196 3093) oid_to_hex(>object.oid));
b45424181e 3101) continue;

run-command.c
31bfd155d8 1229) error(_("cannot create async thread: %s"), strerror(err));

sequencer.c
bcd33ec25f  683) np = strchrnul(buf, '\n');
bcd33ec25f  684) return error(_("no key present in '%.*s'"),
bcd33ec25f  695) return error(_("unable to dequote value of '%s'"),
bcd33ec25f  737) goto finish;
bcd33ec25f  742) name_i = error(_("'GIT_AUTHOR_NAME' already given"));
bcd33ec25f  747) email_i = error(_("'GIT_AUTHOR_EMAIL' already given"));
bcd33ec25f  752) date_i = error(_("'GIT_AUTHOR_DATE' already given"));
bcd33ec25f  756) err = error(_("unknown variable '%s'"),
bcd33ec25f  761) error(_("missing 'GIT_AUTHOR_NAME'"));
bcd33ec25f  763) error(_("missing 'GIT_AUTHOR_EMAIL'"));
bcd33ec25f  765) error(_("missing 'GIT_AUTHOR_DATE'"));

sha1-file.c
2f90b9d9b4 sha1-file.c  172) int hash_algo_by_name(const char *name)
2f90b9d9b4 sha1-file.c  175) if (!name)
2f90b9d9b4 sha1-file.c  176) return GIT_HASH_UNKNOWN;
2f90b9d9b4 sha1-file.c  177) for (i = 1; i < GIT_HASH_NALGOS; i++)
2f90b9d9b4 sha1-file.c  178) if (!strcmp(name, hash_algos[i].name))
2f90b9d9b4 sha1-file.c  179) return i;
2f90b9d9b4 sha1-file.c  180) return GIT_HASH_UNKNOWN;
2f90b9d9b4 sha1-file.c  183) int hash_algo_by_id(uint32_t format_id)
2f90b9d9b4 sha1-file.c  186) for (i = 1; i < GIT_HASH_NALGOS; i++)
2f90b9d9b4 sha1-file.c  187) if (format_id == hash_algos[i].format_id)
2f90b9d9b4 sha1-file.c  188) return i;
2f90b9d9b4 sha1-file.c  189) return GIT_HASH_UNKNOWN;

xdiff-interface.c
xdiff/xutils.c
611e42a598 405) return -1;

Commits introducing uncovered code:
brian m. carlson  2f90b9d9b: s

[PATCH] merge-recursive: combine error handling

2018-11-02 Thread Derrick Stolee

In handle_rename_rename_1to2(), we have duplicated error handling
around colliding paths. Specifically, when we want to write out
the file and there is a directory or untracked file in the way,
we need to create a temporary file to hold the contents. This has
some special output to alert the user, and this output is
duplicated for each side of the conflict.

Simplify the call by generating this new path in a helper
function.

Signed-off-by: Derrick Stolee 
---

Elijah,

Here is a patch that combines the logic to avoid code clones, and also
more easily covers code blocks. Of course, your additional test helps
the branch coverage.

Thanks,
-Stolee

 merge-recursive.c | 53 ---
 1 file changed, 27 insertions(+), 26 deletions(-)

diff --git a/merge-recursive.c b/merge-recursive.c
index 5986b6..5e36bef162 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -1709,6 +1709,27 @@ static int handle_rename_add(struct merge_options *o,
 ci->dst_entry1->stages[other_stage].mode);
 }
 
+static char *find_path_for_conflict(struct merge_options *o,
+   const char *path,
+   const char *branch1,
+   const char *branch2)
+{
+   char *new_path = NULL;
+   if (dir_in_way(path, !o->call_depth, 0)) {
+   new_path = unique_path(o, path, branch1);
+   output(o, 1, _("%s is a directory in %s adding "
+  "as %s instead"),
+  path, branch2, new_path);
+   } else if (would_lose_untracked(path)) {
+   new_path = unique_path(o, path, branch1);
+   output(o, 1, _("Refusing to lose untracked file"
+  " at %s; adding as %s instead"),
+  path, new_path);
+   }
+
+   return new_path;
+}
+
 static int handle_rename_rename_1to2(struct merge_options *o,
 struct rename_conflict_info *ci)
 {
@@ -1783,19 +1804,9 @@ static int handle_rename_rename_1to2(struct 
merge_options *o,
  >oid, add->mode) < 0)
return -1;
} else {
-   char *new_path = NULL;
-   if (dir_in_way(a->path, !o->call_depth, 0)) {
-   new_path = unique_path(o, a->path, ci->branch1);
-   output(o, 1, _("%s is a directory in %s adding "
-  "as %s instead"),
-  a->path, ci->branch2, new_path);
-   } else if (would_lose_untracked(a->path)) {
-   new_path = unique_path(o, a->path, ci->branch1);
-   output(o, 1, _("Refusing to lose untracked file"
-  " at %s; adding as %s instead"),
-  a->path, new_path);
-   }
-
+   char *new_path = find_path_for_conflict(o, a->path,
+   ci->branch1,
+   ci->branch2);
if (update_file(o, 0, , mfi.mode, new_path ? 
new_path : a->path))
return -1;
free(new_path);
@@ -1812,19 +1823,9 @@ static int handle_rename_rename_1to2(struct 
merge_options *o,
  , mfi.mode) < 0)
return -1;
} else {
-   char *new_path = NULL;
-   if (dir_in_way(b->path, !o->call_depth, 0)) {
-   new_path = unique_path(o, b->path, ci->branch2);
-   output(o, 1, _("%s is a directory in %s adding "
-  "as %s instead"),
-  b->path, ci->branch1, new_path);
-   } else if (would_lose_untracked(b->path)) {
-   new_path = unique_path(o, b->path, ci->branch2);
-   output(o, 1, _("Refusing to lose untracked file"
-  " at %s; adding as %s instead"),
-  b->path, new_path);
-   }
-
+   char *new_path = find_path_for_conflict(o, b->path,
+   ci->branch2,
+   ci->branch1);
if (update_file(o, 0, , mfi.mode, new_path ? 
new_path : b->path))
return -1;
free(new_path);
-- 
2.19.1.1235.gbda564b1a2

Re: [PATCH v4 00/10] Improve path collision conflict resolutions

2018-11-02 Thread Derrick Stolee


On 11/2/2018 2:53 PM, Elijah Newren wrote:

Major question:
   * You'll note that I edited the last two patches to mark them as RFC.
 To be honest, I'm not sure what to do with these.  They improve code
 coverage of new code, but the same gaps existed in the old code;
 they only show up in the coverage-diff because I essentially moved
 code around to a new improved function.  Since the new code doesn't
 really add new abilities but rather just shifts the handling of
 these conflicts to a common function, they shouldn't need any more
 testcases than previously and modifying the existing patches thus
 feels slightly misleading.  That line of thought leads me to believe
 that perhaps putting them in a separate combined patch of their own
 with a decent commit message is the right way to go.  On the other
 hand, squashing them to the commits they're marked as fixups for
 shows which logical part of the code the tests are related to, which
 seems like a good thing.  So what's the right way to handle these?


I appreciate the effort you made to improve test coverage! It's 
unfortunate that this portion wasn't covered earlier, because we could 
have broken it and not noticed until a release.


I think making them separate commits is fine, and the comment on the 
test case is helpful. The fact that you only had to change the commit 
timestamps in order to get the coverage makes me reexamine the code and 
realize that maybe the "right" thing to do is to reduce our code clones. 
(This is also how I was looking at the wrong block of the patch when 
talking about it not being covered.) I'll look at the patch and see if I 
can contribute a concrete code suggestion there.


Aside: I hope that I am not being annoying by poking around with the 
test coverage reports. It does give me a way to target my review 
efforts, especially into changes that touch areas outside my expertise 
(like this one).


Thanks,

-Stolee

Re: [PATCH v3 8/8] merge-recursive: improve rename/rename(1to2)/add[/add] handling

2018-11-02 Thread Derrick Stolee


On 11/2/2018 1:27 PM, Elijah Newren wrote:

On Thu, Nov 1, 2018 at 12:01 AM Elijah Newren  wrote:

On Wed, Oct 31, 2018 at 8:08 AM Derrick Stolee  wrote:

On 10/19/2018 3:31 PM, Elijah Newren wrote:

[snip]

+ char *new_path = NULL;
+ if (dir_in_way(b->path, !o->call_depth, 0)) {
+ new_path = unique_path(o, b->path, ci->branch2);
+ output(o, 1, _("%s is a directory in %s adding "
+"as %s instead"),
+b->path, ci->branch1, new_path);

I tried really hard, but failed to get a test to cover the block below.
I was able to
find that the "check handling of differently renamed file with D/F
conflicts" test
in t6022-merge-rename.sh covers the block above. Trying to tweak the
example using
untracked files seems to hit an error message from unpack-trees.c instead.


+ } else if (would_lose_untracked(b->path)) {
+ new_path = unique_path(o, b->path, ci->branch2);
+ output(o, 1, _("Refusing to lose untracked file"
+" at %s; adding as %s instead"),
+b->path, new_path);

So now I'm confused.  This block was not listed in your coverage
report[1].  And, in fact, I think this block IS covered by testcase
10c of t6043.  However, there is a very similar looking block about 30
lines up that is uncovered (and which was mentioned in your report):

 } else if (would_lose_untracked(a->path)) {
 new_path = unique_path(o, a->path, ci->branch1);
 output(o, 1, _("Refusing to lose untracked file"
" at %s; adding as %s instead"),
a->path, new_path);

covering it, I think, is just a matter of repeating the 10c test with
the merge repeated in the other direction (checkout B and merge A
instead of checking out A and merging B) -- and touching up the checks
accordingly.

However, now I'm wondering if I'm crazy.  Was it really the block you
had highlighted that you were seeing uncovered?


Trust the report (generated by computer) over me (generated by squinting 
at an email, trying to match line numbers).


Thanks,

-Stolee

Re: [PATCH 19/24] submodule: use submodule repos for object lookup

2018-11-02 Thread Derrick Stolee


On 11/2/2018 1:23 PM, Stefan Beller wrote:

On Fri, Nov 2, 2018 at 6:03 AM Derrick Stolee  wrote:

On 10/30/2018 6:08 PM, Stefan Beller wrote:

This converts the 'show_submodule_header' function to use
the repository API properly, such that the submodule objects
are not added to the main object store.

Signed-off-by: Stefan Beller 

A couple tests are broken in 'pu' when run with GIT_TEST_COMMIT_GRAPH=1,
including t4041-diff-submodule-option.sh. The failure bisects to this patch.

Here is a verbose output of the first failure in that script:;


expecting success:
  git diff-index -p --submodule=log HEAD >actual &&
  cat >expected <<-EOF &&
  Submodule sm1 $head2..$head3 (rewind):
< Add foo3 ($added foo3)
< Add foo2 ($added foo2)
  EOF
  test_cmp expected actual

+ git diff-index -p --submodule=log HEAD
+ cat
+ test_cmp expected actual
+ diff -u expected actual
--- expected2018-11-02 12:58:43.429262380 +
+++ actual  2018-11-02 12:58:43.429262380 +
@@ -1,3 +1,5 @@
-Submodule sm1 30b9670..dafb207 (rewind):
+Submodule sm1 30b9670...dafb207:
 < Add foo3 (hinzugefügt foo3)
 < Add foo2 (hinzugefügt foo2)
+  > Add foo1 (hinzugefügt foo1)
+  < Add foo1 (hinzugefügt foo1)
error: last command exited with $?=1
not ok 9 - modified submodule(backward)

I've been looking into the patch below to see if there is an obvious
problem, but the best I can think is that open_submodule() creates an
alternate 'struct repository' and somehow the commit-graph feature is
interacting poorly with that struct.

Stefan, do you know what's going on?

Sure, see the last four patches of this series
https://public-inbox.org/git/20181030220817.61691-1-sbel...@google.com/
(to which you also reply to? Junio did not queue this one, yet).


Sorry! Got a bit mixed up looking at everything. I didn't realize that 
the current 'pu' didn't have your latest. Thanks!

Re: [PATCH 0/3] Make add_missing_tags() linear

2018-11-02 Thread Derrick Stolee


On 11/2/2018 10:58 AM, Elijah Newren wrote:

On Thu, Nov 1, 2018 at 12:02 PM Derrick Stolee  wrote:

On 11/1/2018 2:57 PM, Elijah Newren wrote:

On Thu, Nov 1, 2018 at 5:32 AM Derrick Stolee  wrote:

No rush. I'd just like to understand how removing the commit-graph file
can make the new algorithm faster. Putting a similar count in the old
algorithm would involve giving a count for every call to
in_merge_bases_many(), which would be very noisy.

$ time git push --dry-run --follow-tags /home/newren/repo-mirror
count: 92912
To /home/newren/repo-mirror
   * [new branch]  test5 -> test5

real0m3.024s
user0m2.752s
sys0m0.320s

Is the above test with or without the commit-graph file? Can you run it
in the other mode, too? I'd like to see if the "count" value changes
when the only difference is the presence of a commit-graph file.

I apologize for providing misleading information earlier; this was an
apples to oranges comparison.  Here's what I did:


git clone coworker.bundle coworker-repo
cd coworker-repo
time git push --dry-run --follow-tags /home/newren/repo-mirror
git config core.commitgraph true
git config gc.writecommitgraph true
git gc
time git push --dry-run --follow-tags /home/newren/nucleus-mirror


I figured I had just done a fresh clone, so surely the gc wouldn't do
anything other than write the .git/objects/info/commit-graph file.
However, the original bundle contained many references outside of
refs/heads/ and refs/tags/:

$ git bundle list-heads ../coworker.bundle | grep -v -e refs/heads/ -e
refs/tags/ -e HEAD | wc -l
2396

These other refs apparently referred to objects not otherwise
referenced in refs/heads/ and refs/tags/, and caused the gc to explode
lots of loose objects:
$ git count-objects -v
count: 147604
size: 856416
in-pack: 1180692
packs: 1
size-pack: 796143
prune-packable: 0
garbage: 0
size-garbage: 0


The slowdown with commit-graph was entirely due to there being lots of
loose objects (147K of them).  If I add a git-prune before doing the
timing with commit-graph, then the timing with commit-graph is faster
than the run without a commit-graph.

Sorry for the wild goose chase.

And thanks for the fixes; get_reachable_subset() makes things much
faster even without a commit-graph, and the commit-graph just improves
it more.  :-)



Thanks for double-checking! It's good to have confidence that this is a good

algorithm to use.


Thanks,

-Stolee

[PATCH v2 0/3] Make add_missing_tags() linear

2018-11-02 Thread Derrick Stolee via GitGitGadget

As reported earlier [1], the add_missing_tags() method in remote.c has
quadratic performance. Some of that performance is curbed due to the
generation-number cutoff in in_merge_bases_many(). However, that fix doesn't
help users without a commit-graph, and it can still be painful if that
cutoff is sufficiently low compared to the tags we are using for
reachability testing.

Add a new method in commit-reach.c called get_reachable_subset() which does
a many-to-many reachability test. Starting at the 'from' commits, walk until
the generation is below the smallest generation in the 'to' commits, or all
'to' commits have been discovered. This performs only one commit walk for
the entire add_missing_tags() method, giving linear performance in the worst
case.

Tests are added in t6600-test-reach.sh to ensure get_reachable_subset()
works independently of its application in add_missing_tags().

Thanks, -Stolee

[1] 
https://public-inbox.org/git/cabpp-becpsoxudovjbdg_3w9wus102rw+e+qpmd4g3qyd-q...@mail.gmail.com/

Derrick Stolee (3):
  commit-reach: implement get_reachable_subset
  test-reach: test get_reachable_subset
  remote: make add_missing_tags() linear

 commit-reach.c| 70 +++
 commit-reach.h| 13 
 remote.c  | 34 -
 t/helper/test-reach.c | 34 ++---
 t/t6600-test-reach.sh | 52 
 5 files changed, 198 insertions(+), 5 deletions(-)


base-commit: c670b1f876521c9f7cd40184bf7ed05aad843433
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-60%2Fderrickstolee%2Fadd-missing-tags-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-60/derrickstolee/add-missing-tags-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/60

Range-diff vs v1:

 1:  4c0c5c9143 ! 1:  9e570603bd commit-reach: implement get_reachable_subset
 @@ -49,7 +49,7 @@
  +
  +struct commit_list *get_reachable_subset(struct commit **from, int 
nr_from,
  +  struct commit **to, int nr_to,
 -+  int reachable_flag)
 ++  unsigned int reachable_flag)
  +{
  + struct commit **item;
  + struct commit *current;
 @@ -129,9 +129,12 @@
  + * Return a list of commits containing the commits in the 'to' array
  + * that are reachable from at least one commit in the 'from' array.
  + * Also add the given 'flag' to each of the commits in the returned list.
 ++ *
 ++ * This method uses the PARENT1 and PARENT2 flags during its operation,
 ++ * so be sure these flags are not set before calling the method.
  + */
  +struct commit_list *get_reachable_subset(struct commit **from, int 
nr_from,
  +  struct commit **to, int nr_to,
 -+  int reachable_flag);
 ++  unsigned int reachable_flag);
  +
   #endif
 2:  382f4f4a5b = 2:  52e847b928 test-reach: test get_reachable_subset
 3:  ecbed3de5c = 3:  dfaceb162f remote: make add_missing_tags() linear

-- 
gitgitgadget

[PATCH v2 1/3] commit-reach: implement get_reachable_subset

2018-11-02 Thread Derrick Stolee via GitGitGadget

From: Derrick Stolee 

The existing reachability algorithms in commit-reach.c focus on
finding merge-bases or determining if all commits in a set X can
reach at least one commit in a set Y. However, for two commits sets
X and Y, we may also care about which commits in Y are reachable
from at least one commit in X.

Implement get_reachable_subset() which answers this question. Given
two arrays of commits, 'from' and 'to', return a commit_list with
every commit from the 'to' array that is reachable from at least
one commit in the 'from' array.

The algorithm is a simple walk starting at the 'from' commits, using
the PARENT2 flag to indicate "this commit has already been added to
the walk queue". By marking the 'to' commits with the PARENT1 flag,
we can determine when we see a commit from the 'to' array. We remove
the PARENT1 flag as we add that commit to the result list to avoid
duplicates.

The order of the resulting list is a reverse of the order that the
commits are discovered in the walk.

There are a couple shortcuts to avoid walking more than we need:

1. We determine the minimum generation number of commits in the
   'to' array. We do not walk commits with generation number
   below this minimum.

2. We count how many distinct commits are in the 'to' array, and
   decrement this count when we discover a 'to' commit during the
   walk. If this number reaches zero, then we can terminate the
   walk.

Tests will be added using the 'test-tool reach' helper in a
subsequent commit.

Signed-off-by: Derrick Stolee 
---
 commit-reach.c | 70 ++
 commit-reach.h | 13 ++
 2 files changed, 83 insertions(+)

diff --git a/commit-reach.c b/commit-reach.c
index 9f79ce0a22..8ad5352752 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct 
commit_list *to,
object_array_clear(_objs);
return result;
 }
+
+struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
+struct commit **to, int nr_to,
+unsigned int reachable_flag)
+{
+   struct commit **item;
+   struct commit *current;
+   struct commit_list *found_commits = NULL;
+   struct commit **to_last = to + nr_to;
+   struct commit **from_last = from + nr_from;
+   uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+   int num_to_find = 0;
+
+   struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
+
+   for (item = to; item < to_last; item++) {
+   struct commit *c = *item;
+   
+   parse_commit(c);
+   if (c->generation < min_generation)
+   min_generation = c->generation;
+
+   if (!(c->object.flags & PARENT1)) {
+   c->object.flags |= PARENT1;
+   num_to_find++;
+   }
+   }
+
+   for (item = from; item < from_last; item++) {
+   struct commit *c = *item;
+   if (!(c->object.flags & PARENT2)) {
+   c->object.flags |= PARENT2;
+   parse_commit(c);
+
+   prio_queue_put(, *item);
+   }
+   }
+
+   while (num_to_find && (current = prio_queue_get()) != NULL) {
+   struct commit_list *parents;
+
+   if (current->object.flags & PARENT1) {
+   current->object.flags &= ~PARENT1;
+   current->object.flags |= reachable_flag;
+   commit_list_insert(current, _commits);
+   num_to_find--;
+   }
+
+   for (parents = current->parents; parents; parents = 
parents->next) {
+   struct commit *p = parents->item;
+
+   parse_commit(p);
+
+   if (p->generation < min_generation)
+   continue;
+
+   if (p->object.flags & PARENT2)
+   continue;
+
+   p->object.flags |= PARENT2;
+   prio_queue_put(, p);
+   }
+   }
+
+   clear_commit_marks_many(nr_to, to, PARENT1);
+   clear_commit_marks_many(nr_from, from, PARENT2);
+
+   return found_commits;
+}
+
diff --git a/commit-reach.h b/commit-reach.h
index 7d313e2975..bb34af0269 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -74,4 +74,17 @@ int can_all_from_reach_with_flag(struct object_array *from,
 int can_all_from_reach(struct commit_list *from, struct commit_list *to,
   int commit_date_cutoff);
 
+
+/*
+ * Return a list of commits containing the commits in the 'to' array
+ * that are reachable from at least one commit in the 'from' array.
+ * Also add the given 'flag

[PATCH v2 3/3] remote: make add_missing_tags() linear

2018-11-02 Thread Derrick Stolee via GitGitGadget

From: Derrick Stolee 

The add_missing_tags() method currently has quadratic behavior.
This is due to a linear number (based on number of tags T) of
calls to in_merge_bases_many, which has linear performance (based
on number of commits C in the repository).

Replace this O(T * C) algorithm with an O(T + C) algorithm by
using get_reachable_subset(). We ignore the return list and focus
instead on the reachable_flag assigned to the commits we care
about, because we need to interact with the tag ref and not just
the commit object.

Signed-off-by: Derrick Stolee 
---
 remote.c | 34 +-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/remote.c b/remote.c
index 81f4f01b00..b850f2feb3 100644
--- a/remote.c
+++ b/remote.c
@@ -1205,9 +1205,36 @@ static void add_missing_tags(struct ref *src, struct ref 
**dst, struct ref ***ds
 * sent to the other side.
 */
if (sent_tips.nr) {
+   const int reachable_flag = 1;
+   struct commit_list *found_commits;
+   struct commit **src_commits;
+   int nr_src_commits = 0, alloc_src_commits = 16;
+   ALLOC_ARRAY(src_commits, alloc_src_commits);
+
for_each_string_list_item(item, _tag) {
struct ref *ref = item->util;
+   struct commit *commit;
+
+   if (is_null_oid(>new_oid))
+   continue;
+   commit = lookup_commit_reference_gently(the_repository,
+   >new_oid,
+   1);
+   if (!commit)
+   /* not pushing a commit, which is not an error 
*/
+   continue;
+
+   ALLOC_GROW(src_commits, nr_src_commits + 1, 
alloc_src_commits);
+   src_commits[nr_src_commits++] = commit;
+   }
+
+   found_commits = get_reachable_subset(sent_tips.tip, 
sent_tips.nr,
+src_commits, 
nr_src_commits,
+reachable_flag);
+
+   for_each_string_list_item(item, _tag) {
struct ref *dst_ref;
+   struct ref *ref = item->util;
struct commit *commit;
 
if (is_null_oid(>new_oid))
@@ -1223,7 +1250,7 @@ static void add_missing_tags(struct ref *src, struct ref 
**dst, struct ref ***ds
 * Is this tag, which they do not have, reachable from
 * any of the commits we are sending?
 */
-   if (!in_merge_bases_many(commit, sent_tips.nr, 
sent_tips.tip))
+   if (!(commit->object.flags & reachable_flag))
continue;
 
/* Add it in */
@@ -1231,7 +1258,12 @@ static void add_missing_tags(struct ref *src, struct ref 
**dst, struct ref ***ds
oidcpy(_ref->new_oid, >new_oid);
dst_ref->peer_ref = copy_ref(ref);
}
+
+   clear_commit_marks_many(nr_src_commits, src_commits, 
reachable_flag);
+   free(src_commits);
+   free_commit_list(found_commits);
}
+
string_list_clear(_tag, 0);
free(sent_tips.tip);
 }
-- 
gitgitgadget

[PATCH v2 2/3] test-reach: test get_reachable_subset

2018-11-02 Thread Derrick Stolee via GitGitGadget

From: Derrick Stolee 

The get_reachable_subset() method returns the list of commits in
the 'to' array that are reachable from at least one commit in the
'from' array. Add tests that check this method works in a few
cases:

1. All commits in the 'to' list are reachable. This exercises the
   early-termination condition.

2. Some commits in the 'to' list are reachable. This exercises the
   loop-termination condition.

3. No commits in the 'to' list are reachable. This exercises the
   NULL return condition.

Signed-off-by: Derrick Stolee 
---
 t/helper/test-reach.c | 34 
 t/t6600-test-reach.sh | 52 +++
 2 files changed, 82 insertions(+), 4 deletions(-)

diff --git a/t/helper/test-reach.c b/t/helper/test-reach.c
index 08d2ea68e8..a0272178b7 100644
--- a/t/helper/test-reach.c
+++ b/t/helper/test-reach.c
@@ -32,8 +32,8 @@ int cmd__reach(int ac, const char **av)
struct commit *A, *B;
struct commit_list *X, *Y;
struct object_array X_obj = OBJECT_ARRAY_INIT;
-   struct commit **X_array;
-   int X_nr, X_alloc;
+   struct commit **X_array, **Y_array;
+   int X_nr, X_alloc, Y_nr, Y_alloc;
struct strbuf buf = STRBUF_INIT;
struct repository *r = the_repository;
 
@@ -44,9 +44,10 @@ int cmd__reach(int ac, const char **av)
 
A = B = NULL;
X = Y = NULL;
-   X_nr = 0;
-   X_alloc = 16;
+   X_nr = Y_nr = 0;
+   X_alloc = Y_alloc = 16;
ALLOC_ARRAY(X_array, X_alloc);
+   ALLOC_ARRAY(Y_array, Y_alloc);
 
while (strbuf_getline(, stdin) != EOF) {
struct object_id oid;
@@ -92,6 +93,8 @@ int cmd__reach(int ac, const char **av)
 
case 'Y':
commit_list_insert(c, );
+   ALLOC_GROW(Y_array, Y_nr + 1, Y_alloc);
+   Y_array[Y_nr++] = c;
break;
 
default:
@@ -136,6 +139,29 @@ int cmd__reach(int ac, const char **av)
filter.with_commit_tag_algo = 0;
 
printf("%s(_,A,X,_):%d\n", av[1], commit_contains(, A, 
X, ));
+   } else if (!strcmp(av[1], "get_reachable_subset")) {
+   const int reachable_flag = 1;
+   int i, count = 0;
+   struct commit_list *current;
+   struct commit_list *list = get_reachable_subset(X_array, X_nr,
+   Y_array, Y_nr,
+   reachable_flag);
+   printf("get_reachable_subset(X,Y)\n");
+   for (current = list; current; current = current->next) {
+   if (!(list->item->object.flags & reachable_flag))
+   die(_("commit %s is not marked reachable"),
+   oid_to_hex(>item->object.oid));
+   count++;
+   }
+   for (i = 0; i < Y_nr; i++) {
+   if (Y_array[i]->object.flags & reachable_flag)
+   count--;
+   }
+
+   if (count < 0)
+   die(_("too many commits marked reachable"));
+
+   print_sorted_commit_ids(list);
}
 
exit(0);
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index ae94b27f70..a0c64e617a 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -265,4 +265,56 @@ test_expect_success 'commit_contains:miss' '
test_three_modes commit_contains --tag
 '
 
+test_expect_success 'get_reachable_subset:all' '
+   cat >input <<-\EOF &&
+   X:commit-9-1
+   X:commit-8-3
+   X:commit-7-5
+   X:commit-6-6
+   X:commit-1-7
+   Y:commit-3-3
+   Y:commit-1-7
+   Y:commit-5-6
+   EOF
+   (
+   echo "get_reachable_subset(X,Y)" &&
+   git rev-parse commit-3-3 \
+ commit-1-7 \
+ commit-5-6 | sort
+   ) >expect &&
+   test_three_modes get_reachable_subset
+'
+
+test_expect_success 'get_reachable_subset:some' '
+   cat >input <<-\EOF &&
+   X:commit-9-1
+   X:commit-8-3
+   X:commit-7-5
+   X:commit-1-7
+   Y:commit-3-3
+   Y:commit-1-7
+   Y:commit-5-6
+   EOF
+   (
+   echo "get_reachable_subset(X,Y)" &&
+   git rev-parse commit-3-3 \
+ commit-1-7 | sort
+   ) >expect &&
+   test_three_modes get_reachable_subset
+'
+
+test_expect_success 'get_reachable_subset:none' '
+   cat >input <<-\EOF &&
+   X:commit-9-1
+   X:commit-8-3
+   X:commit-7-5
+

Re: [PATCH 19/24] submodule: use submodule repos for object lookup

2018-11-02 Thread Derrick Stolee


On 10/30/2018 6:08 PM, Stefan Beller wrote:

This converts the 'show_submodule_header' function to use
the repository API properly, such that the submodule objects
are not added to the main object store.

Signed-off-by: Stefan Beller 


A couple tests are broken in 'pu' when run with GIT_TEST_COMMIT_GRAPH=1, 
including t4041-diff-submodule-option.sh. The failure bisects to this patch.


Here is a verbose output of the first failure in that script:;


expecting success:
    git diff-index -p --submodule=log HEAD >actual &&
    cat >expected <<-EOF &&
    Submodule sm1 $head2..$head3 (rewind):
  < Add foo3 ($added foo3)
  < Add foo2 ($added foo2)
    EOF
    test_cmp expected actual

+ git diff-index -p --submodule=log HEAD
+ cat
+ test_cmp expected actual
+ diff -u expected actual
--- expected    2018-11-02 12:58:43.429262380 +
+++ actual  2018-11-02 12:58:43.429262380 +
@@ -1,3 +1,5 @@
-Submodule sm1 30b9670..dafb207 (rewind):
+Submodule sm1 30b9670...dafb207:
   < Add foo3 (hinzugefügt foo3)
   < Add foo2 (hinzugefügt foo2)
+  > Add foo1 (hinzugefügt foo1)
+  < Add foo1 (hinzugefügt foo1)
error: last command exited with $?=1
not ok 9 - modified submodule(backward)

I've been looking into the patch below to see if there is an obvious 
problem, but the best I can think is that open_submodule() creates an 
alternate 'struct repository' and somehow the commit-graph feature is 
interacting poorly with that struct.


Stefan, do you know what's going on?

Thanks,

-Stolee


---
  submodule.c | 76 ++---
  1 file changed, 61 insertions(+), 15 deletions(-)

diff --git a/submodule.c b/submodule.c
index d9d3046689..0fdda45252 100644
--- a/submodule.c
+++ b/submodule.c
@@ -443,7 +443,7 @@ static int prepare_submodule_summary(struct rev_info *rev, 
const char *path,
return prepare_revision_walk(rev);
  }
  
-static void print_submodule_summary(struct rev_info *rev, struct diff_options *o)

+static void print_submodule_summary(struct repository *r, struct rev_info 
*rev, struct diff_options *o)
  {
static const char format[] = "  %m %s";
struct strbuf sb = STRBUF_INIT;
@@ -454,7 +454,8 @@ static void print_submodule_summary(struct rev_info *rev, 
struct diff_options *o
ctx.date_mode = rev->date_mode;
ctx.output_encoding = get_log_output_encoding();
strbuf_setlen(, 0);
-   format_commit_message(commit, format, , );
+   repo_format_commit_message(r, commit, format, ,
+ );
strbuf_addch(, '\n');
if (commit->object.flags & SYMMETRIC_LEFT)
diff_emit_submodule_del(o, sb.buf);
@@ -481,14 +482,47 @@ void prepare_submodule_repo_env(struct argv_array *out)
 DEFAULT_GIT_DIR_ENVIRONMENT);
  }
  
-/* Helper function to display the submodule header line prior to the full

- * summary output. If it can locate the submodule objects directory it will
- * attempt to lookup both the left and right commits and put them into the
- * left and right pointers.
+/*
+ * Initialize 'out' based on the provided submodule path.
+ *
+ * Unlike repo_submodule_init, this tolerates submodules not present
+ * in .gitmodules. This function exists only to preserve historical behavior,
+ *
+ * Returns 0 on success, -1 when the submodule is not present.
   */
-static void show_submodule_header(struct diff_options *o, const char *path,
+static struct repository *open_submodule(const char *path)
+{
+   struct strbuf sb = STRBUF_INIT;
+   struct repository *out = xmalloc(sizeof(*out));
+
+   if (submodule_to_gitdir(, path) || repo_init(out, sb.buf, NULL)) {
+   strbuf_release();
+   free(out);
+   return NULL;
+   }
+
+   out->submodule_prefix = xstrfmt("%s%s/",
+   the_repository->submodule_prefix ?
+   the_repository->submodule_prefix :
+   "", path);
+
+   strbuf_release();
+   return out;
+}
+
+/*
+ * Helper function to display the submodule header line prior to the full
+ * summary output.
+ *
+ * If it can locate the submodule git directory it will create a repository
+ * handle for the submodule and lookup both the left and right commits and
+ * put them into the left and right pointers.
+ */
+static void show_submodule_header(struct diff_options *o,
+   const char *path,
struct object_id *one, struct object_id *two,
unsigned dirty_submodule,
+   struct repository *sub,
struct commit **left, struct commit **right,
struct commit_list **merge_bases)
  {
@@ -507,7 +541,7 @@ static void show_submodule_header(struct diff_options *o, 
const char *path,
else if (is_null_oid(two))

Re: [PATCH 0/3] Make add_missing_tags() linear

2018-11-01 Thread Derrick Stolee


On 11/1/2018 2:57 PM, Elijah Newren wrote:

On Thu, Nov 1, 2018 at 5:32 AM Derrick Stolee  wrote:

No rush. I'd just like to understand how removing the commit-graph file
can make the new algorithm faster. Putting a similar count in the old
algorithm would involve giving a count for every call to
in_merge_bases_many(), which would be very noisy.

$ time git push --dry-run --follow-tags /home/newren/repo-mirror
count: 92912
To /home/newren/repo-mirror
  * [new branch]  test5 -> test5

real0m3.024s
user0m2.752s
sys0m0.320s


Is the above test with or without the commit-graph file? Can you run it 
in the other mode, too? I'd like to see if the "count" value changes 
when the only difference is the presence of a commit-graph file.


Thanks,
-Stolee

Git Test Coverage Report (Thursday, Nov 1) Part A

2018-11-01 Thread Derrick Stolee

ath '%s'"), path);
5257b0625a 1686) return error(_("bad signature 0x%08x"), 
hdr->hdr_signature);

5257b0625a 1689) return error(_("bad index version %d"), hdr_version);
5257b0625a 1728) return error(_("index uses %.4s extension, which we do 
not understand"),

5257b0625a 1730) fprintf_ln(stderr, _("ignoring %.4s extension"), ext);
5257b0625a 1777) die(_("unknown index entry format 0x%08x"), 
extended_flags);

5257b0625a 1848) die(_("unordered stage entries in index"));
5257b0625a 1851) die(_("multiple stage entries for merged file '%s'"),
5257b0625a 1854) die(_("unordered stage entries for '%s'"),
5257b0625a 2148) die_errno(_("%s: index file open failed"), path);
5257b0625a 2152) die_errno(_("%s: cannot stat the open index"), path);
5257b0625a 2156) die(_("%s: index file smaller than expected"), path);
5257b0625a 2160) die_errno(_("%s: unable to map index file"), path);
5257b0625a 2252) warning(_("could not freshen shared index '%s'"), 
shared_index);

5257b0625a 2287) die(_("broken index, expect %s in %s, got %s"),
5257b0625a 3073) error(_("cannot fix permission bits on '%s'"), 
get_tempfile_path(*temp));

5257b0625a 3219) return error(_("%s: cannot drop to stage #0"),

ref-filter.c
35408df41e 2324) return error(_("option `%s' is incompatible with 
--no-merged"),


remote.c
a454d6a26b  362) warning(_("config remote shorthand cannot begin with 
'/': %s"),
a454d6a26b  417) error(_("more than one uploadpack given, using the 
first"));

a454d6a26b  683) die(_("key '%s' of pattern had no '*'"), key);
a454d6a26b  693) die(_("value '%s' of pattern has no '*'"), value);
a454d6a26b 1044) error(_("unable to delete '%s': remote ref does not 
exist"),
a454d6a26b 1066) return error(_("dst ref %s receives from more than one 
src"),

a454d6a26b 1753) die(_("couldn't find remote ref %s"), name);
a454d6a26b 1766) error(_("* Ignoring funny ref '%s' locally"),
a454d6a26b 1861) die(_("revision walk setup failed"));
a454d6a26b 2134) return error(_("cannot parse expected object name '%s'"),

revision.c
a63d88e595 2932) return;
a63d88e595 2935) return;
a63d88e595 2941) c->object.flags |= UNINTERESTING;
a63d88e595 2944) return;
a63d88e595 2947) mark_parents_uninteresting(c);
a63d88e595 2970) return;
a63d88e595 2973) return;
a63d88e595 2978) return;
a63d88e595 3042) continue;
f33f8de6af 3090) if (!revs->ignore_missing_links)
f33f8de6af 3091) die("Failed to traverse parents of commit %s",
a63d88e595 3092) oid_to_hex(>object.oid));
a63d88e595 3100) continue;

run-command.c
31bfd155d8 1229) error(_("cannot create async thread: %s"), strerror(err));

sequencer.c
bcd33ec25f  683) np = strchrnul(buf, '\n');
bcd33ec25f  684) return error(_("no key present in '%.*s'"),
bcd33ec25f  695) return error(_("unable to dequote value of '%s'"),
bcd33ec25f  737) goto finish;
bcd33ec25f  742) name_i = error(_("'GIT_AUTHOR_NAME' already given"));
bcd33ec25f  747) email_i = error(_("'GIT_AUTHOR_EMAIL' already given"));
bcd33ec25f  752) date_i = error(_("'GIT_AUTHOR_DATE' already given"));
bcd33ec25f  756) err = error(_("unknown variable '%s'"),
bcd33ec25f  761) error(_("missing 'GIT_AUTHOR_NAME'"));
bcd33ec25f  763) error(_("missing 'GIT_AUTHOR_EMAIL'"));
bcd33ec25f  765) error(_("missing 'GIT_AUTHOR_DATE'"));

sha1-file.c
2f90b9d9b4 sha1-file.c  172) int hash_algo_by_name(const char *name)
2f90b9d9b4 sha1-file.c  175) if (!name)
2f90b9d9b4 sha1-file.c  176) return GIT_HASH_UNKNOWN;
2f90b9d9b4 sha1-file.c  177) for (i = 1; i < GIT_HASH_NALGOS; i++)
2f90b9d9b4 sha1-file.c  178) if (!strcmp(name, hash_algos[i].name))
2f90b9d9b4 sha1-file.c  179) return i;
2f90b9d9b4 sha1-file.c  180) return GIT_HASH_UNKNOWN;
2f90b9d9b4 sha1-file.c  183) int hash_algo_by_id(uint32_t format_id)
2f90b9d9b4 sha1-file.c  186) for (i = 1; i < GIT_HASH_NALGOS; i++)
2f90b9d9b4 sha1-file.c  187) if (format_id == hash_algos[i].format_id)
2f90b9d9b4 sha1-file.c  188) return i;
2f90b9d9b4 sha1-file.c  189) return GIT_HASH_UNKNOWN;

Commits introducing uncovered code:
brian m. carlson  2f90b9d9b: sha1-file: provide functions to look up 
hash algorithms
brian m. carlson  b3a41547c: hex: introduce functions to print 
arbitrary hashes
Daniels Umanovskis  0ecb1fc72: branch: introduce --show-current 
display option

Derrick Stolee  1dcd9f204: midx: close multi-pack-index on repack
Derrick Stolee  a63d88e59: revision.c: generation-based topo-order 
algorithm
Derrick Stolee  f33f8de6a: revision.c: begin refactoring 
--topo-order logic

Jeff King  98f425b45: cat-file: handle streaming failures consistently
Joel Teichroeb  3d5ec65ce: stash: convert apply to builtin
Joel T

Re: [PATCH v5 6/7] revision.c: generation-based topo-order algorithm

2018-11-01 Thread Derrick Stolee


On 11/1/2018 11:48 AM, SZEDER Gábor wrote:

On Thu, Nov 01, 2018 at 01:46:22PM +, Derrick Stolee wrote:

1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number)
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number)

Nit: I've been pondering for a while what exactly does "order by
maximizing ..." mean.  Highest to lowest or lowest to highest?  If I
understand the rest of the descriptions (that I snipped) correctly,
then it's the former, but I find that phrase in itself too ambiguous.


It means that our priority-queue "get" operation selects the item in the
queue that is largest by our comparison function (first generation number,
thencommit-date for ties).This means we walk commits that have high
generation number before those with lower generation number, guaranteeing
that we walk all children of a commit before walking that commit.

Thanks,
-Stolee

master updated? (was Re: What's cooking in git.git (Nov 2018, #01; Thu, 1))

2018-11-01 Thread Derrick Stolee


On 11/1/2018 5:59 AM, Junio C Hamano wrote:

--
[Graduated to "master"]


I see that several topics graduated, but it appears the master branch 
was not updated at https://github.com/gister/git. Was this intentional?


I noticed because the test-coverage build [1] using the previous master 
as master@{1} reported no uncovered code into master (because no new 
commits exist on master).


Thanks,
-Stolee

[1] https://git.visualstudio.com/git/_build/results?buildId=237=log

Re: [PATCH v4 0/7] Use generation numbers for --topo-order

2018-11-01 Thread Derrick Stolee


On 11/1/2018 1:21 AM, Junio C Hamano wrote:

"Derrick Stolee via GitGitGadget"  writes:


This patch series performs a decently-sized refactoring of the revision-walk
machinery. Well, "refactoring" is probably the wrong word, as I don't
actually remove the old code. Instead, when we see certain options in the
'rev_info' struct, we redirect the commit-walk logic to a new set of methods
that distribute the workload differently. By using generation numbers in the
commit-graph, we can significantly improve 'git log --graph' commands (and
the underlying 'git rev-list --topo-order').

Review discussions seem to have petered out.  Would we merge this to
'next' and start cooking, perhaps for the remainder of this cycle?


Thanks, but I've just sent a v5 responding to Jakub's feedback on v4. [1]

I'd be happy to let it sit in next until you feel it has cooked long 
enough. I'm available to respond to feedback in the form of new topics.


Thanks,
-Stolee

[1] 
https://public-inbox.org/git/20181101134623.84055-1-dsto...@microsoft.com/T/#u

[PATCH v5 6/7] revision.c: generation-based topo-order algorithm

2018-11-01 Thread Derrick Stolee

n
'git rev-list --topo-order A..B'. This can be updated in the
future.

In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
  HEAD, w/ commit-graph: 0.02 s

Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
  HEAD, w/ commit-graph: 0.06 s

This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.

For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.

Helped-by: Jeff King 
Signed-off-by: Derrick Stolee 
---
 object.h   |   4 +-
 revision.c | 195 +++--
 revision.h |   2 +
 3 files changed, 194 insertions(+), 7 deletions(-)

diff --git a/object.h b/object.h
index 0feb90ae61..796792cb32 100644
--- a/object.h
+++ b/object.h
@@ -59,7 +59,7 @@ struct object_array {
 
 /*
  * object flag allocation:
- * revision.h:   0-10  2526
+ * revision.h:   0-10  2528
  * fetch-pack.c: 01
  * negotiator/default.c:   2--5
  * walker.c: 0-2
@@ -78,7 +78,7 @@ struct object_array {
  * builtin/show-branch.c:0---26
  * builtin/unpack-objects.c: 2021
  */
-#define FLAG_BITS  27
+#define FLAG_BITS  29
 
 /*
  * The object type is stored in 3 bits.
diff --git a/revision.c b/revision.c
index 36458265a0..4ef47d2fb4 100644
--- a/revision.c
+++ b/revision.c
@@ -26,6 +26,7 @@
 #include "argv-array.h"
 #include "commit-reach.h"
 #include "commit-graph.h"
+#include "prio-queue.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2895,31 +2896,215 @@ static int mark_uninteresting(const struct object_id 
*oid,
return 0;
 }
 
-struct topo_walk_info {};
+define_commit_slab(indegree_slab, int);
+define_commit_slab(author_date_slab, timestamp_t);
+
+struct topo_walk_info {
+   uint32_t min_generation;
+   struct prio_queue explore_queue;
+   struct prio_queue indegree_queue;
+   struct prio_queue topo_queue;
+   struct indegree_slab indegree;
+   struct author_date_slab author_date;
+};
+
+static inline void test_flag_and_insert(struct prio_queue *q, struct commit 
*c, int flag)
+{
+   if (c->object.flags & flag)
+   return;
+
+   c->object.flags |= flag;
+   prio_queue_put(q, c);
+}
+
+static void explore_walk_step(struct rev_info *revs)
+{
+   struct topo_walk_info *info = revs->topo_walk_info;
+   struct commit_list *p;
+   struct commit *c = prio_queue_get(>explore_queue);
+
+   if (!c)
+   return;
+
+   if (parse_commit_gently(c, 1) < 0)
+   return;
+
+   if (revs->sort_order == REV_SORT_BY_AUTHOR_DATE)
+   record_author_date(>author_date, c);
+
+   if (revs->max_age != -1 && (c->date < revs->max_age))
+   c->object.flags |= UNINTERESTING;
+
+   if (process_parents(revs, c, NULL, NULL) < 0)
+   return;
+
+   if (c->object.flags & UNINTERESTING)
+   mark_parents_uninteresting(c);
+
+   for (p = c->parents; p; p = p->next)
+   test_flag_and_insert(>explore_queue, p->item, 
TOPO_WALK_EXPLORED);
+}
+
+static void explore_to_depth(struct rev_info *revs,
+uint32_t gen_cutoff)
+{
+   struct topo_walk_info *info = revs->topo_walk_info;
+   struct commit *c;
+   while ((c = prio_queue_peek(>explore

[PATCH v5 2/7] test-reach: add run_three_modes method

2018-11-01 Thread Derrick Stolee

The 'test_three_modes' method assumes we are using the 'test-tool
reach' command for our test. However, we may want to use the data
shape of our commit graph and the three modes (no commit-graph,
full commit-graph, partial commit-graph) for other git commands.

Split test_three_modes to be a simple translation on a more general
run_three_modes method that executes the given command and tests
the actual output to the expected output.

Signed-off-by: Derrick Stolee 
---
 t/t6600-test-reach.sh | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index d139a00d1d..9d65b8b946 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -53,18 +53,22 @@ test_expect_success 'setup' '
git config core.commitGraph true
 '
 
-test_three_modes () {
+run_three_modes () {
test_when_finished rm -rf .git/objects/info/commit-graph &&
-   test-tool reach $1 actual &&
+   "$@" actual &&
test_cmp expect actual &&
cp commit-graph-full .git/objects/info/commit-graph &&
-   test-tool reach $1 actual &&
+   "$@" actual &&
test_cmp expect actual &&
cp commit-graph-half .git/objects/info/commit-graph &&
-   test-tool reach $1 actual &&
+   "$@" actual &&
test_cmp expect actual
 }
 
+test_three_modes () {
+   run_three_modes test-tool reach "$@"
+}
+
 test_expect_success 'ref_newer:miss' '
cat >input <<-\EOF &&
A:commit-5-7
-- 
2.19.1.542.gc4df23f792

[PATCH v5 3/7] test-reach: add rev-list tests

2018-11-01 Thread Derrick Stolee

The rev-list command is critical to Git's functionality. Ensure it
works in the three commit-graph environments constructed in
t6600-test-reach.sh. Here are a few important types of rev-list
operations:

* Basic: git rev-list --topo-order HEAD
* Range: git rev-list --topo-order compare..HEAD
* Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD
* Symmetric Difference: git rev-list --topo-order compare...HEAD

Signed-off-by: Derrick Stolee 
---
 t/t6600-test-reach.sh | 84 +++
 1 file changed, 84 insertions(+)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index 9d65b8b946..288f703b7b 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -243,4 +243,88 @@ test_expect_success 'commit_contains:miss' '
test_three_modes commit_contains --tag
 '
 
+test_expect_success 'rev-list: basic topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 
commit-1-6 \
+   commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 
commit-1-5 \
+   commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 
commit-1-4 \
+   commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 
commit-1-3 \
+   commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 
commit-1-2 \
+   commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 
commit-1-1 \
+   >expect &&
+   run_three_modes git rev-list --topo-order commit-6-6
+'
+
+test_expect_success 'rev-list: first-parent topo-order' '
+   git rev-parse \
+   commit-6-6 \
+   commit-6-5 \
+   commit-6-4 \
+   commit-6-3 \
+   commit-6-2 \
+   commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 
commit-1-1 \
+   >expect &&
+   run_three_modes git rev-list --first-parent --topo-order commit-6-6
+'
+
+test_expect_success 'rev-list: range topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 
commit-1-6 \
+   commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 
commit-1-5 \
+   commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 
commit-1-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   commit-6-2 commit-5-2 commit-4-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   >expect &&
+   run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
+'
+
+test_expect_success 'rev-list: range topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 \
+   commit-6-5 commit-5-5 commit-4-5 \
+   commit-6-4 commit-5-4 commit-4-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   commit-6-2 commit-5-2 commit-4-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   >expect &&
+   run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
+'
+
+test_expect_success 'rev-list: first-parent range topo-order' '
+   git rev-parse \
+   commit-6-6 \
+   commit-6-5 \
+   commit-6-4 \
+   commit-6-3 \
+   commit-6-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   >expect &&
+   run_three_modes git rev-list --first-parent --topo-order 
commit-3-8..commit-6-6
+'
+
+test_expect_success 'rev-list: ancestry-path topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 commit-3-6 \
+   commit-6-5 commit-5-5 commit-4-5 commit-3-5 \
+   commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   >expect &&
+   run_three_modes git rev-list --topo-order --ancestry-path 
commit-3-3..commit-6-6
+'
+
+test_expect_success 'rev-list: symmetric difference topo-order' '
+   git rev-parse \
+   commit-6-6 commit-5-6 commit-4-6 \
+   commit-6-5 commit-5-5 commit-4-5 \
+   commit-6-4 commit-5-4 commit-4-4 \
+   commit-6-3 commit-5-3 commit-4-3 \
+   commit-6-2 commit-5-2 commit-4-2 \
+   commit-6-1 commit-5-1 commit-4-1 \
+   commit-3-8 commit-2-8 commit-1-8 \
+   commit-3-7 commit-2-7 commit-1-7 \
+   >expect &&
+   run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
+'
+
 test_done
-- 
2.19.1.542.gc4df23f792

[PATCH v5 5/7] commit/revisions: bookkeeping before refactoring

2018-11-01 Thread Derrick Stolee

There are a few things that need to move around a little before
making a big refactoring in the topo-order logic:

1. We need access to record_author_date() and
   compare_commits_by_author_date() in revision.c. These are used
   currently by sort_in_topological_order() in commit.c.

2. Moving these methods to commit.h requires adding an author_date_slab
   declaration to commit.h. Consumers will need their own implementation.

3. The add_parents_to_list() method in revision.c performs logic
   around the UNINTERESTING flag and other special cases depending
   on the struct rev_info. Allow this method to ignore a NULL 'list'
   parameter, as we will not be populating the list for our walk.
   Also rename the method to the slightly more generic name
   process_parents() to make clear that this method does more than
   add to a list (and no list is required anymore).

Helped-by: Jeff King 
Signed-off-by: Derrick Stolee 
---
 commit.c   |  9 -
 commit.h   |  7 +++
 revision.c | 18 ++
 3 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/commit.c b/commit.c
index d0f199e122..a025a0db60 100644
--- a/commit.c
+++ b/commit.c
@@ -655,11 +655,10 @@ struct commit *pop_commit(struct commit_list **stack)
 /* count number of children that have not been emitted */
 define_commit_slab(indegree_slab, int);
 
-/* record author-date for each commit object */
 define_commit_slab(author_date_slab, timestamp_t);
 
-static void record_author_date(struct author_date_slab *author_date,
-  struct commit *commit)
+void record_author_date(struct author_date_slab *author_date,
+   struct commit *commit)
 {
const char *buffer = get_commit_buffer(commit, NULL);
struct ident_split ident;
@@ -684,8 +683,8 @@ static void record_author_date(struct author_date_slab 
*author_date,
unuse_commit_buffer(commit, buffer);
 }
 
-static int compare_commits_by_author_date(const void *a_, const void *b_,
- void *cb_data)
+int compare_commits_by_author_date(const void *a_, const void *b_,
+  void *cb_data)
 {
const struct commit *a = a_, *b = b_;
struct author_date_slab *author_date = cb_data;
diff --git a/commit.h b/commit.h
index 2b1a734388..ec5b9093ad 100644
--- a/commit.h
+++ b/commit.h
@@ -8,6 +8,7 @@
 #include "gpg-interface.h"
 #include "string-list.h"
 #include "pretty.h"
+#include "commit-slab.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0x
 #define GENERATION_NUMBER_INFINITY 0x
@@ -328,6 +329,12 @@ extern int remove_signature(struct strbuf *buf);
  */
 extern int check_commit_signature(const struct commit *commit, struct 
signature_check *sigc);
 
+/* record author-date for each commit object */
+struct author_date_slab;
+void record_author_date(struct author_date_slab *author_date,
+   struct commit *commit);
+
+int compare_commits_by_author_date(const void *a_, const void *b_, void 
*unused);
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused);
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused);
 
diff --git a/revision.c b/revision.c
index 2dcde8a8ac..36458265a0 100644
--- a/revision.c
+++ b/revision.c
@@ -768,8 +768,8 @@ static void commit_list_insert_by_date_cached(struct commit 
*p, struct commit_li
*cache = new_entry;
 }
 
-static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
-   struct commit_list **list, struct commit_list **cache_ptr)
+static int process_parents(struct rev_info *revs, struct commit *commit,
+  struct commit_list **list, struct commit_list 
**cache_ptr)
 {
struct commit_list *parent = commit->parents;
unsigned left_flag;
@@ -808,7 +808,8 @@ static int add_parents_to_list(struct rev_info *revs, 
struct commit *commit,
if (p->object.flags & SEEN)
continue;
p->object.flags |= SEEN;
-   commit_list_insert_by_date_cached(p, list, cached_base, 
cache_ptr);
+   if (list)
+   commit_list_insert_by_date_cached(p, list, 
cached_base, cache_ptr);
}
return 0;
}
@@ -847,7 +848,8 @@ static int add_parents_to_list(struct rev_info *revs, 
struct commit *commit,
p->object.flags |= left_flag;
if (!(p->object.flags & SEEN)) {
p->object.flags |= SEEN;
-   commit_list_insert_by_date_cached(p, list, cached_base, 
cache_ptr);
+   if (list)
+   commit_list_insert_by_date_cached(p, list, 
cached_base, cache_ptr);
}
if (revs->first_parent_

[PATCH v5 4/7] revision.c: begin refactoring --topo-order logic

2018-11-01 Thread Derrick Stolee

When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():

* revs->limited implies we run limit_list() to walk the entire
  reachable set. There are some short-cuts here, such as if we
  perform a range query like 'git rev-list COMPARE..HEAD' and we
  can stop limit_list() when all queued commits are uninteresting.

* revs->topo_order implies we run sort_in_topological_order(). See
  the implementation of that method in commit.c. It implies that
  the full set of commits to order is in the given commit_list.

These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.

If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.

In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.

When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.

In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:

1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
   struct. We use the presence of this struct as a signal to use the
   new methods during our walk. In this patch, this method simply
   calls limit_list() and sort_in_topological_order(). In the future,
   this method will set up a new data structure to perform that logic
   in-line.

2. next_topo_commit() provides get_revision_1() with the next topo-
   ordered commit in the list. Currently, this simply pops the commit
   from revs->commits.

3. expand_topo_walk() provides get_revision_1() with a way to signal
   walking beyond the latest commit. Currently, this calls
   add_parents_to_list() exactly like the old logic.

While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.

Signed-off-by: Derrick Stolee 
---
 revision.c | 42 ++
 revision.h |  4 
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/revision.c b/revision.c
index e18bd530e4..2dcde8a8ac 100644
--- a/revision.c
+++ b/revision.c
@@ -25,6 +25,7 @@
 #include "worktree.h"
 #include "argv-array.h"
 #include "commit-reach.h"
+#include "commit-graph.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct 
rev_info *revs, struct s
if (revs->diffopt.objfind)
revs->simplify_history = 0;
 
-   if (revs->topo_order)
+   if (revs->topo_order && !generation_numbers_enabled(the_repository))
revs->limited = 1;
 
if (revs->prune_data.nr) {
@@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id 
*oid,
return 0;
 }
 
+struct topo_walk_info {};
+
+static void init_topo_walk(struct rev_info *revs)
+{
+   struct topo_walk_info *info;
+   revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
+   info = revs->topo_walk_info;
+   memset(info, 0, sizeof(struct topo_walk_info));
+
+   limit_list(revs);
+   sort_in_topological_order(>commits, revs->sort_order);
+}
+
+static struct commit *next_topo_commit(struct rev_info *revs)
+{
+   return pop_commit(>commits);
+}
+
+static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
+{
+   if (add_parents_to_list(revs, commit, >commits, NULL) < 0) {
+   if (!revs->ignore_missing_links)
+   die("Failed to traverse parents of commit %s",
+   oid_to_hex(>object.oid));
+   }
+}
+
 int prepare_revision_walk(struct rev_info *revs)
 {
int i;
@@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs)
commit_list_sort_by_date(>commits);
if (revs->no_walk)
return 0;
-   if (revs->limited)
+   if (revs->limited) {
if (limit_list(revs) < 0)
return -1;
-   if (revs->topo_order)
-   sort_in_topological_order(>commits, revs->sort_order);
+   if (revs->topo_order)
+   sort_in_topologic

[PATCH v5 7/7] t6012: make rev-list tests more interesting

2018-11-01 Thread Derrick Stolee

As we are working to rewrite some of the revision-walk machinery,
there could easily be some interesting interactions between the
options that force topological constraints (--topo-order,
--date-order, and --author-date-order) along with specifying a
path.

Add extra tests to t6012-rev-list-simplify.sh to add coverage of
these interactions. To ensure interesting things occur, alter the
repo data shape to have different orders depending on topo-, date-,
or author-date-order.

When testing using GIT_TEST_COMMIT_GRAPH, this assists in covering
the new logic for topo-order walks using generation numbers. The
extra tests can be added indepently.

Signed-off-by: Derrick Stolee 
---
 t/t6012-rev-list-simplify.sh | 45 
 1 file changed, 36 insertions(+), 9 deletions(-)

diff --git a/t/t6012-rev-list-simplify.sh b/t/t6012-rev-list-simplify.sh
index b5a1190ffe..a10f0df02b 100755
--- a/t/t6012-rev-list-simplify.sh
+++ b/t/t6012-rev-list-simplify.sh
@@ -12,6 +12,22 @@ unnote () {
git name-rev --tags --stdin | sed -e "s|$OID_REGEX (tags/\([^)]*\)) |\1 
|g"
 }
 
+#
+# Create a test repo with interesting commit graph:
+#
+# A--B--G--H--I--K--L
+#  \  \   / /
+#   \  \ / /
+#C--E---F J
+#\_/
+#
+# The commits are laid out from left-to-right starting with
+# the root commit A and terminating at the tip commit L.
+#
+# There are a few places where we adjust the commit date or
+# author date to make the --topo-order, --date-order, and
+# --author-date-order flags produce different output.
+
 test_expect_success setup '
echo "Hi there" >file &&
echo "initial" >lost &&
@@ -21,10 +37,18 @@ test_expect_success setup '
 
git branch other-branch &&
 
+   git symbolic-ref HEAD refs/heads/unrelated &&
+   git rm -f "*" &&
+   echo "Unrelated branch" >side &&
+   git add side &&
+   test_tick && git commit -m "Side root" &&
+   note J &&
+   git checkout master &&
+
echo "Hello" >file &&
echo "second" >lost &&
git add file lost &&
-   test_tick && git commit -m "Modified file and lost" &&
+   test_tick && GIT_AUTHOR_DATE=$(($test_tick + 120)) git commit -m 
"Modified file and lost" &&
note B &&
 
git checkout other-branch &&
@@ -63,13 +87,6 @@ test_expect_success setup '
test_tick && git commit -a -m "Final change" &&
note I &&
 
-   git symbolic-ref HEAD refs/heads/unrelated &&
-   git rm -f "*" &&
-   echo "Unrelated branch" >side &&
-   git add side &&
-   test_tick && git commit -m "Side root" &&
-   note J &&
-
git checkout master &&
test_tick && git merge --allow-unrelated-histories -m "Coolest" 
unrelated &&
note K &&
@@ -103,14 +120,24 @@ check_result () {
check_outcome success "$@"
 }
 
-check_result 'L K J I H G F E D C B A' --full-history
+check_result 'L K J I H F E D C G B A' --full-history --topo-order
+check_result 'L K I H G F E D C B J A' --full-history
+check_result 'L K I H G F E D C B J A' --full-history --date-order
+check_result 'L K I H G F E D B C J A' --full-history --author-date-order
 check_result 'K I H E C B A' --full-history -- file
 check_result 'K I H E C B A' --full-history --topo-order -- file
 check_result 'K I H E C B A' --full-history --date-order -- file
+check_result 'K I H E B C A' --full-history --author-date-order -- file
 check_result 'I E C B A' --simplify-merges -- file
+check_result 'I E C B A' --simplify-merges --topo-order -- file
+check_result 'I E C B A' --simplify-merges --date-order -- file
+check_result 'I E B C A' --simplify-merges --author-date-order -- file
 check_result 'I B A' -- file
 check_result 'I B A' --topo-order -- file
+check_result 'I B A' --date-order -- file
+check_result 'I B A' --author-date-order -- file
 check_result 'H' --first-parent -- another-file
+check_result 'H' --first-parent --topo-order -- another-file
 
 check_result 'E C B A' --full-history E -- lost
 test_expect_success 'full history simplification without parent' '
-- 
2.19.1.542.gc4df23f792

[PATCH v5 0/7] Use generation numbers for --topo-order

2018-11-01 Thread Derrick Stolee

This patch series performs a decently-sized refactoring of the
revision-walk machinery. Well, "refactoring" is probably the wrong word,
as I don't actually remove the old code. Instead, when we see certain
options in the 'rev_info' struct, we redirect the commit-walk logic to
a new set of methods that distribute the workload differently. By using
generation numbers in the commit-graph, we can significantly improve
'git log --graph' commands (and the underlying 'git rev-list --topo-order').

On the Linux repository, I got the following performance results when
comparing to the previous version with or without a commit-graph:

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
  HEAD, w/ commit-graph: 0.02 s

Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
  HEAD, w/ commit-graph: 0.06 s

If you want to read this series but are unfamiliar with the commit-graph
and generation numbers, then I recommend reading
`Documentation/technical/commit-graph.txt` or a blog post [1] I wrote on
the subject. In particular, the three-part walk described in "revision.c:
refactor basic topo-order logic" is present (but underexplained) as an
animated PNG [2].

**UPDATED** Now that we have had some review and some dogfooding, I'm
removing the paragraph I had here about "RFC quality". I think this is
ready to merge!

One notable case that is not included in this series is the case of a
history comparison such as 'git rev-list --topo-order A..B'. The existing
code in limit_list() has ways to cut the walk short when all pending
commits are UNINTERESTING. Since this code depends on commit_list instead
of the prio_queue we are using here, I chose to leave it untouched for now.
We can revisit it in a separate series later. Since handle_commit() turns
on revs->limited when a commit is UNINTERESTING, we do not hit the new
code in this case. Removing this 'revs->limited = 1;' line yields correct
results, but the performance can be worse.

**UPDATED** See the discussion about Generation Number V2 [4] for more
on this topic.

Changes in V5: Thanks Jakub for feedback on the huge commit! I think
I've responded to all the code feedback. See the range-diff at the
end of this cover-page.

Thanks,
-Stolee

[1] 
https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/
   Supercharging the Git Commit Graph III: Generations and Graph Algorithms

[2] 
https://msdnshared.blob.core.windows.net/media/2018/06/commit-graph-topo-order-b-a.png
Animation showing three-part walk

[3] https://github.com/derrickstolee/git/tree/topo-order/test
A branch containing this series along with commits to compute commit-graph 
in entire test suite.

[4] https://public-inbox.org/git/6367e30a-1b3a-4fe9-611b-d931f51ef...@gmail.com/
[RFC] Generation Number v2

Note: I'm not submitting this version via GitGitGadget because it's
currently struggling with how to handle a PR in a conflict state.
The new flags in revision.h have a conflict with recent changes in
master.

Derrick Stolee (7):
  prio-queue: add 'peek' operation
  test-reach: add run_three_modes method
  test-reach: add rev-list tests
  revision.c: begin refactoring --topo-order logic
  commit/revisions: bookkeeping before refactoring
  revision.c: generation-based topo-order algorithm
  t6012: make rev-list tests more interesting

 commit.c |   9 +-
 commit.h |   7 +
 object.h |   4 +-
 prio-queue.c |   9 ++
 prio-queue.h |   6 +
 revision.c   | 243 +--
 revision.h   |   6 +
 t/helper/test-prio-queue.c   |  26 ++--
 t/t0009-prio-queue.sh|  14 ++
 t/t6012-rev-list-simplify.sh |  45 +--
 t/t6600-test-reach.sh|  96 +-
 11 files changed, 426 insertions(+), 39 deletions(-)


base-commit: 2d3b1c576c85b7f5db1f418907af00ab88e0c303
-- 
2.19.1.542.gc4df23f792

-->8--

1:  2358cfd5ed = 1:  7c75a56505 prio-queue: add 'peek' operation
2:  3a4b68e479 = 2:  686c4370de test-reach: add run_three_modes method
3:  12a3f6d367 = 3:  7410c00982 test-reach: add rev-list tests
4:  cd9eef9688 = 4:  5439e11e37 revision.c: begin refactoring --topo-order logic
5:  f3e291665d ! 5:  71554deb9b commit/revisions: bookkeeping before refactoring
@@ -9,8 +9,8 @@
compare_commits_by_author_date() in revision.c. These are used
currently by sort_in_topological_order() in commit.c.
 
-2. Moving these methods to commit.h requires adding the author_slab
-   definition to commit.h.
+2. Moving these methods to commit.h requires adding an author_date_slab
+   declaration to commit.h. Consumers will need their own 
implementation.
 
 3. The add_parents_

[PATCH v5 1/7] prio-queue: add 'peek' operation

2018-11-01 Thread Derrick Stolee

When consuming a priority queue, it can be convenient to inspect
the next object that will be dequeued without actually dequeueing
it. Our existing library did not have such a 'peek' operation, so
add it as prio_queue_peek().

Add a reference-level comparison in t/helper/test-prio-queue.c
so this method is exercised by t0009-prio-queue.sh. Further, add
a test that checks the behavior when the compare function is NULL
(i.e. the queue becomes a stack).

Signed-off-by: Derrick Stolee 
---
 prio-queue.c   |  9 +
 prio-queue.h   |  6 ++
 t/helper/test-prio-queue.c | 26 ++
 t/t0009-prio-queue.sh  | 14 ++
 4 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/prio-queue.c b/prio-queue.c
index a078451872..d3f488cb05 100644
--- a/prio-queue.c
+++ b/prio-queue.c
@@ -85,3 +85,12 @@ void *prio_queue_get(struct prio_queue *queue)
}
return result;
 }
+
+void *prio_queue_peek(struct prio_queue *queue)
+{
+   if (!queue->nr)
+   return NULL;
+   if (!queue->compare)
+   return queue->array[queue->nr - 1].data;
+   return queue->array[0].data;
+}
diff --git a/prio-queue.h b/prio-queue.h
index d030ec9dd6..682e51867a 100644
--- a/prio-queue.h
+++ b/prio-queue.h
@@ -46,6 +46,12 @@ extern void prio_queue_put(struct prio_queue *, void *thing);
  */
 extern void *prio_queue_get(struct prio_queue *);
 
+/*
+ * Gain access to the "thing" that would be returned by
+ * prio_queue_get, but do not remove it from the queue.
+ */
+extern void *prio_queue_peek(struct prio_queue *);
+
 extern void clear_prio_queue(struct prio_queue *);
 
 /* Reverse the LIFO elements */
diff --git a/t/helper/test-prio-queue.c b/t/helper/test-prio-queue.c
index 9807b649b1..5bc9c46ea5 100644
--- a/t/helper/test-prio-queue.c
+++ b/t/helper/test-prio-queue.c
@@ -22,14 +22,24 @@ int cmd__prio_queue(int argc, const char **argv)
struct prio_queue pq = { intcmp };
 
while (*++argv) {
-   if (!strcmp(*argv, "get"))
-   show(prio_queue_get());
-   else if (!strcmp(*argv, "dump")) {
-   int *v;
-   while ((v = prio_queue_get()))
-  show(v);
-   }
-   else {
+   if (!strcmp(*argv, "get")) {
+   void *peek = prio_queue_peek();
+   void *get = prio_queue_get();
+   if (peek != get)
+   BUG("peek and get results do not match");
+   show(get);
+   } else if (!strcmp(*argv, "dump")) {
+   void *peek;
+   void *get;
+   while ((peek = prio_queue_peek())) {
+   get = prio_queue_get();
+   if (peek != get)
+   BUG("peek and get results do not 
match");
+   show(get);
+   }
+   } else if (!strcmp(*argv, "stack")) {
+   pq.compare = NULL;
+   } else {
int *v = malloc(sizeof(*v));
*v = atoi(*argv);
prio_queue_put(, v);
diff --git a/t/t0009-prio-queue.sh b/t/t0009-prio-queue.sh
index e56dfce668..3941ad2528 100755
--- a/t/t0009-prio-queue.sh
+++ b/t/t0009-prio-queue.sh
@@ -47,4 +47,18 @@ test_expect_success 'notice empty queue' '
test_cmp expect actual
 '
 
+cat >expect <<'EOF'
+3
+2
+6
+4
+5
+1
+8
+EOF
+test_expect_success 'stack order' '
+   test-tool prio-queue stack 8 1 5 4 6 2 3 dump >actual &&
+   test_cmp expect actual
+'
+
 test_done

base-commit: 2d3b1c576c85b7f5db1f418907af00ab88e0c303
-- 
2.19.1.542.gc4df23f792

Re: [RFC] Generation Number v2

2018-11-01 Thread Derrick Stolee

On 11/1/2018 8:27 AM, Jakub Narebski wrote:

[I have noticed that in some places I wrote A..B instead of B..A. Sorry
about that]

Derrick Stolee writes:

Please also let me know about any additional tests that I could
run. Now that I've got a lot of test scripts built up, I can re-run
the test suite pretty quickly.

I would add straighforward 1-to-N and N-to-1 reachability tests in the
form of `git branch / tag --contains` and `git branch / tag --merged`,
and perhaps also ahead-behind calculations used by `git status`, and
N-to-M reachability tests used by tag following code in push and fetch.

Possibly also A...B walk, if it is not implemented via calculating
merge-base.

I believe this uses paint_down_to_common(), but looks at the PARENT1 and
PARENT2 flags afterward to determine the left/right/both relationships.

Maybe even --ancestry-path walk, though I am not sure how important best
performance for rhis case is (we would want good performance, but
perhaps best is not needed for rarely used command).

Currently, the implementation of --ancestry-path does not use a
reachability index.

See explanation below.

Generation Number Performance Test
==

Git uses the commit history to answer many basic questions, such as
computing a merge-base. Some of these algorithms can benefit from a
_reachability index_, which is a condition that allows us to answer
"commit A cannot reach commit B" in some cases. These require pre-
computing some values that we can load and use at run-time. Git
already has a notion of _generation number_, stores it in the commit-
graph file, and uses it in several reachability algorithms.

Note that there are other kinds of reachability indices.

First, there are reachability indices that can answer the full
reachability query (if commit A can reach commit B, or if commit A
cannot reach commit B) directly, without walking the commit graph at
all: so called label-only approach. For example one could store for
each commit the compressed list of all commits reachable from it
(transitive closure compression).

Those, I think (but I have not checked), would be of not much use for
Git, as the size of the index grows stronger than linear with the number
of commits, as grows the time to compute such index. So probably of no
use to Git, at least not directly (Git uses so called "bitmap index",
see e.g. [1], which stores reachability bit-vector as compressed
bitmap... but not for all commits, only for a small subset).

Second, beside negative-cut reachability indexes, that can answer with
certainity that "commit A cannot reach commit B", like the generation
numbers (also known as level, or topological level), there also
positive-cut indexes (usually if not always based on the spanning tree
or trees for the DAG), that can answer when "commit A can reach commit
B".

I think that those can lead to significant speedups for at least some
types of commit traversal and reachability queries that Git needs to
answer. But currently no algorithms has a provision for using
positive-cut filter index. Anyway, such index would be auxiliary thing,
see the next point.

Third, we require more than having reachability index in the sense of
having some kind of _labelling_, often composite, that can answer either
"commit A cannot reach commit B" or "commit A can reach commit B",
depending on the type. Git for most operations needs more, namely an
actual _ordering_, that is a weak order (or to be more exact a total
preorder, i.e. there can be two different commits with the same
"generation number" or index, but always either idx(A) ≲ idx(B) or
idx(B) ≲ idx(A)) and not only partial ordering (where there can be
incomparable elements, i.e neither idx(A) ≼ idx(B) nor idx(B) ≼ idx(A)).
This is because it needs to be used in priority queue to decide which
commit to travel next; more on that later. This is also why there can
be only one such "main" reachability _index_.

[1]: https://githubengineering.com/counting-objects/

Thanks for the details. At the moment, I'm only interested in improving our
negative-cut reachability index, as we have algorithms that can take
advantage

of them (and can compare their performance directly).

You can read more about generation numbers and how to use them in
algorithms in [this blog
post](https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/).

However, [generation numbers do not always improve our
algorithms](https://public-inbox.org/git/pull.28.git.gitgitgad...@gmail.com/T/#u).
Specifically, some algorithms in Git already use commit date as a
heuristic reachability index. This has some problems, though, since
commit date can be incorrect for several reasons (clock skew between
machines, purposefully setting GIT_COMMIT_DATE to the author date,
etc.).

That's what we use the "slop" mechan

Re: [PATCH 0/3] Make add_missing_tags() linear

2018-11-01 Thread Derrick Stolee


On 11/1/2018 2:52 AM, Elijah Newren wrote:

On Wed, Oct 31, 2018 at 5:05 AM Derrick Stolee  wrote:

On 10/31/2018 2:04 AM, Elijah Newren wrote:


On the original repo where the topic was brought up, with commit-graph
NOT turned on and using origin/master, I see:

$ time git push --dry-run --follow-tags /home/newren/repo-mirror
To /home/newren/repo-mirror
  * [new branch]   test5 -> test5

real 1m20.081s
user 1m19.688s
sys 0m0.292s

Merging this series in, I now get:

$ time git push --dry-run --follow-tags /home/newren/repo-mirror
To /home/newren/repo-mirror
  * [new branch]   test5 -> test5

real 0m2.857s
user 0m2.580s
sys 0m0.328s

which provides a very nice speedup.

Oddly enough, if I _also_ do the following:
$ git config core.commitgraph true
$ git config gc.writecommitgraph true
$ git gc

then my timing actually slows down just slightly:
$ time git push --follow-tags --dry-run /home/newren/repo-mirror
To /home/newren/repo-mirror
  * [new branch]  test5 -> test5

real 0m3.027s
user 0m2.696s
sys 0m0.400s

So you say that the commit-graph is off in the 2.8s case, but not here
in the 3.1s case? I would expect _at minimum_ that the cost of parsing
commits would have a speedup in the commit-graph case.  There may be
something else going on here, since you are timing a `push` event that
is doing more than the current walk.


(run-to-run variation seems pretty consistent, < .1s variation, so
this difference is just enough to notice.)  I wouldn't be that
surprised if that means there's some really old tags with very small
generation numbers, meaning it's not gaining anything in this special
case from the commit-graph, but it does pay the cost of loading the
commit-graph.

While you have this test environment, do you mind applying the diff
below and re-running the tests? It will output a count for how many
commits are walked by the algorithm. This should help us determine if
this is another case where generation numbers are worse than commit-date,
or if there is something else going on. Thanks!

I can do that, but wouldn't you want a similar patch for the old
get_merge_bases_many() in order to compare?  Does an absolute number
help by itself?
It's going to have to be tomorrow, though; not enough time tonight.


No rush. I'd just like to understand how removing the commit-graph file
can make the new algorithm faster. Putting a similar count in the old
algorithm would involve giving a count for every call to
in_merge_bases_many(), which would be very noisy.

Thanks!
-Stolee

Re: [PATCH v3 8/8] merge-recursive: improve rename/rename(1to2)/add[/add] handling

2018-10-31 Thread Derrick Stolee


On 10/19/2018 3:31 PM, Elijah Newren wrote:

[snip]

+   char *new_path = NULL;
+   if (dir_in_way(b->path, !o->call_depth, 0)) {
+   new_path = unique_path(o, b->path, ci->branch2);
+   output(o, 1, _("%s is a directory in %s adding "
+  "as %s instead"),
+  b->path, ci->branch1, new_path);


I tried really hard, but failed to get a test to cover the block below. 
I was able to
find that the "check handling of differently renamed file with D/F 
conflicts" test
in t6022-merge-rename.sh covers the block above. Trying to tweak the 
example using

untracked files seems to hit an error message from unpack-trees.c instead.


+   } else if (would_lose_untracked(b->path)) {
+   new_path = unique_path(o, b->path, ci->branch2);
+   output(o, 1, _("Refusing to lose untracked file"
+  " at %s; adding as %s instead"),
+  b->path, new_path);


It could also be that I failed because I'm less familiar with this part 
of the

codebase. Elijah, do you think it is possible to hit this block?

Thanks,
-Stolee

Re: [PATCH v3 2/8] t6036, t6042: testcases for rename collision of already conflicting files

2018-10-31 Thread Derrick Stolee


On 10/19/2018 3:31 PM, Elijah Newren wrote:

+test_expect_success "setup nested conflicts" '


nit: should these test names be single-quoted? I see you using double-quotes
in PATCH 1/8 as well, but that seems to be because there are variables in
the test names.


...

+test_expect_failure "check nested conflicts" '


Same here.


+test_expect_success "setup nested conflicts from rename/rename(2to1)" '



+test_expect_failure "check nested conflicts from rename/rename(2to1)" '


Thanks,
-Stolee

Re: [PATCH v3 4/8] merge-recursive: new function for better colliding conflict resolutions

2018-10-31 Thread Derrick Stolee


On 10/31/2018 9:53 AM, Derrick Stolee wrote:

On 10/19/2018 3:31 PM, Elijah Newren wrote:
+#if 0 // #if-0-ing avoids unused function warning; will make live in 
next commit

+static int handle_file_collision(struct merge_options *o,
+ const char *collide_path,
+ const char *prev_path1,
+ const char *prev_path2,
+ const char *branch1, const char *branch2,
+ const struct object_id *a_oid,
+ unsigned int a_mode,
+ const struct object_id *b_oid,
+ unsigned int b_mode)
+{
+    struct merge_file_info mfi;
+    struct diff_filespec null, a, b;
+    char *alt_path = NULL;
+    const char *update_path = collide_path;
+
+    /*
+ * In the recursive case, we just opt to undo renames
+ */
+    if (o->call_depth && (prev_path1 || prev_path2)) {
+    /* Put first file (a_oid, a_mode) in its original spot */
+    if (prev_path1) {
+    if (update_file(o, 1, a_oid, a_mode, prev_path1))
+    return -1;
+    } else {
+    if (update_file(o, 1, a_oid, a_mode, collide_path))


The latest test coverage report [1] shows this if statement is never 
run, so

it appears that every call to this method in the test suite has either
o->call_depth positive, prev_path1 non-NULL, or both prev_path1 and 
prev_path2

NULL.

Is there a way we can add a test case that calls this method with 
o->call_depth

positive, prev_path1 NULL, and prev_path2 non-NULL?


+    return -1;
+    }
+
+    /* Put second file (b_oid, b_mode) in its original spot */
+    if (prev_path2) {
+    if (update_file(o, 1, b_oid, b_mode, prev_path2))


Since this line is covered, we _do_ call the method with prev_path2 
non-NULL, but

prev_path1 must be non-NULL in all cases.

I may have found a reason why this doesn't happen in one of the 
callers you introduced.

I'm going to comment on PATCH 8/8 to see if that is the case.


Nevermind on the PATCH 8/8 situation. I thought I saw you pass (a->path, 
NULL) and
(b->path, NULL) into the (prev_path1, prev_path2) pairs, but in each 
case the non-NULL

parameter is actually for 'collide_path'.

It is still interesting if we can hit this case. Perhaps we need a 
different kind of
conflict, like (rename, delete) [but I struggle to make sense of how to 
do that].


Thanks,
-Stolee

Re: [PATCH v3 4/8] merge-recursive: new function for better colliding conflict resolutions

2018-10-31 Thread Derrick Stolee


On 10/19/2018 3:31 PM, Elijah Newren wrote:

+#if 0 // #if-0-ing avoids unused function warning; will make live in next 
commit
+static int handle_file_collision(struct merge_options *o,
+const char *collide_path,
+const char *prev_path1,
+const char *prev_path2,
+const char *branch1, const char *branch2,
+const struct object_id *a_oid,
+unsigned int a_mode,
+const struct object_id *b_oid,
+unsigned int b_mode)
+{
+   struct merge_file_info mfi;
+   struct diff_filespec null, a, b;
+   char *alt_path = NULL;
+   const char *update_path = collide_path;
+
+   /*
+* In the recursive case, we just opt to undo renames
+*/
+   if (o->call_depth && (prev_path1 || prev_path2)) {
+   /* Put first file (a_oid, a_mode) in its original spot */
+   if (prev_path1) {
+   if (update_file(o, 1, a_oid, a_mode, prev_path1))
+   return -1;
+   } else {
+   if (update_file(o, 1, a_oid, a_mode, collide_path))


The latest test coverage report [1] shows this if statement is never run, so
it appears that every call to this method in the test suite has either
o->call_depth positive, prev_path1 non-NULL, or both prev_path1 and 
prev_path2

NULL.

Is there a way we can add a test case that calls this method with 
o->call_depth

positive, prev_path1 NULL, and prev_path2 non-NULL?


+   return -1;
+   }
+
+   /* Put second file (b_oid, b_mode) in its original spot */
+   if (prev_path2) {
+   if (update_file(o, 1, b_oid, b_mode, prev_path2))


Since this line is covered, we _do_ call the method with prev_path2 
non-NULL, but

prev_path1 must be non-NULL in all cases.

I may have found a reason why this doesn't happen in one of the callers 
you introduced.

I'm going to comment on PATCH 8/8 to see if that is the case.

Thanks,
-Stolee

[1] 
https://public-inbox.org/git/62f0bcf6-aa73-c192-d804-e6d69cac1...@gmail.com/

Re: [PATCH 18/19] submodule: use submodule repos for object lookup

2018-10-31 Thread Derrick Stolee


On 10/16/2018 7:35 PM, Stefan Beller wrote:

@@ -482,14 +483,46 @@ void prepare_submodule_repo_env(struct argv_array *out)
 DEFAULT_GIT_DIR_ENVIRONMENT);
  }
  
-/* Helper function to display the submodule header line prior to the full

- * summary output. If it can locate the submodule objects directory it will
- * attempt to lookup both the left and right commits and put them into the
- * left and right pointers.
+/*
+ * Initialize 'out' based on the provided submodule path.
+ *
+ * Unlike repo_submodule_init, this tolerates submodules not present
+ * in .gitmodules. This function exists only to preserve historical behavior,
+ *
+ * Returns 0 on success, -1 when the submodule is not present.
+ */
+static int open_submodule(struct repository *out, const char *path)
+{
+   struct strbuf sb = STRBUF_INIT;
+
+   if (submodule_to_gitdir(, path) || repo_init(out, sb.buf, NULL)) {
+   strbuf_release();
+   return -1;
+   }
+
+   out->submodule_prefix = xstrdup(path);
+   out->submodule_prefix = xstrfmt("%s%s/",
+   the_repository->submodule_prefix ?
+   the_repository->submodule_prefix :
+   "", path);
+
+   strbuf_release();
+   return 0;
+}


Based on the recent test coverage report [1], this xstrfmt() call is never
run witha non-null the_repository->submodule_prefix. Is there a way we can
exercise that branch?

Thanks,
-Stolee

[1] 
https://public-inbox.org/git/62f0bcf6-aa73-c192-d804-e6d69cac1...@gmail.com/

Re: [RFC] Generation Number v2

2018-10-31 Thread Derrick Stolee


On 10/31/2018 8:54 AM, Ævar Arnfjörð Bjarmason wrote:

On Tue, Oct 30 2018, Junio C Hamano wrote:


Derrick Stolee  writes:

In contrast, maximum generation numbers and corrected commit
dates both performed quite well. They are frequently the top
two performing indexes, and rarely significantly different.

The trade-off here now seems to be: which _property_ is more important,
locally-computable or backwards-compatible?

Nice summary.

As I already said, I personally do not think being compatible with
currently deployed clients is important at all (primarily because I
still consider the whole thing experimental), and there is a clear
way forward once we correct the mistake of not having a version
number in the file format that tells the updated clients to ignore
the generation numbers.  For longer term viability, we should pick
something that is immutable, reproducible, computable with minimum
input---all of which would lead to being incrementally computable, I
would think.

I think it depends on what we mean by backwards compatibility. None of
our docs are saying this is experimental right now, just that it's
opt-in like so many other git-config(1) options.

So if we mean breaking backwards compatibility in that we'll write a new
file or clobber the existing one with a version older clients can't use
as an optimization, fine.

But it would be bad to produce a hard error on older clients, but
avoiding that seems as easy as just creating a "commit-graph2" file in
.git/objects/info/.


Well, we have a 1-byte version number following the "CGPH" header in
the commit-graph file, and clients will ignore the commit-graph file
if that number is not "1". My hope for backwards-compatibility was
to avoid incrementing this value and instead use the unused 8th byte.

However, it appears that we are destined to increment that version
number, anyway. Here is my list for what needs to be in the next
version of the commit-graph file format:

1. A four-byte hash version.

2. File incrementality (split commit-graph).

3. Reachability Index versioning

Most of these changes will happen in the file header. The chunks
themselves don't need to change, but some chunks may be added that
only make sense in v2 commit-graphs.

Thanks,
-Stolee

Re: [RFC] Generation Number v2

2018-10-31 Thread Derrick Stolee


On 10/29/2018 11:59 PM, Junio C Hamano wrote:

Derrick Stolee  writes:


**V3: Corrected Commit Date.**
For a commit C, let its _corrected commit date_ (denoted by cdate(C))
be the maximum of the commit date of C and the commit dates of its
parents.

"maximum of the commit date of C and the corrected commit dates of
its parents"?


That's what I mean. Thanks.



We've talked about exactly this one in the past (long before any of
Microsoft folks other than Dscho came to the scene) but without an
implementation, or detailed design and analysis.  I am very happy to
see the idea among other possibilities to be considered again.  This
time around, we may finally come up with something better than the
"commit dates with SLOP" thing ;-).


Essentially, the felineY order is selected with the goal of swapping
positions of topologically-independent commits relative to the felinX
ordering. The resulting reachability index is as follows:

If felineX(A) < felineY(B), then A cannot reach B.
If felineY(A) < felineY(B), then A cannot reach B.

I presume that the first line is a typo and you compare the same X index?


Yes, sorry for the typos. I fixed them in the report on GitHub.




* **Compatible?** In our test implementation, we use a previously unused
   byte of data in the commit-graph format to indicate which reachability
   index version we are using. Existing clients ignore this value, so we
   will want to consider if these new indexes are _backwards compatible_.
   That is, will they still report correct values if they ignore this byte
   and use the generation number column from the commit-graph file assuming
   the values are minimum generation numbers?

I personally consider that the current commit-graph with generation
numbers experimental, so I am not sure how much we care about this.

Having said that.

By the above definition, any new index that is wider than the
current generation number cannot be compatible (can we even tell the
existing clients how wide each elements in the ix array is?)

In any case, perhaps the first thing to do is to update the clients
so that they stop ignoring the version number field, and instead
work without generation number when there is no version of reach.ix
available in the file?  That way, a better reachablility index can
be chosen freely without having to worry about the compatibility.


I can work on that. It should be as simple as setting commit->generation to
GENERATION_NUMBER_ZERO in fill_commit_in_graph when the graph
has a different version.




* **Immutable?** Git objects are _immutable_. If you change an object you
   actually create a new object with a new object ID. Are the values we
store
   for these reachability indexes also immutable?

Even if we do not embed the reachability ix in commit objects,
having an immutable value is probably a must if we want to make them
incrementally computable, so this is a very good property to have.
Unless there is a clever idea to incrementally compute a mutable
reach.ix, my gut instinct says that this property is a must.

Another thing, perhaps related to "local" below, is if exactly the
same reach.ix is computed by anybody, given an identical commit
history graph (perhaps "reproducibility"?).  I think most of the
candidates you listed are reproducible without a fixed tie-breaker,
but I am not sure about felineY() thing.


* **Local?** Are these values **locally computable**? That is, do we only
   need to look at the parents of a commit (assuming those parents have
   computed values) in order to determine the value at that commit?

A subset of non-local reachability ix, for example, the ones that
need to know what each commit's children are, cannot be immutable,
as adding new objects to the graph (either with locally committing,
or transferring objects from other repositories) would affect the
ix; is this true for all non-local reachability ix, I wonder?


As a thought experiment, we could define a function size(C) to be the
numberof commits reachable from C. This is not locally-computable
from the size values at C's parents due to the inclusion-exclusion
principle. We would need to compute it by walking the reachable set
and counting (resulting in quadratic performance overall) but is
immutable. Since the performance cost is so expensive (unlike the
linear costs in the other non-local versions) I didn't include it
in my comparison.




We focused on three types of performance tests that test the indexes
in different ways. Each test lists the `git` command that is used,
and the table lists which repository is used and which inputs.

### Test 1: `git log --topo-order -N`

This test focuses on the number of commits that are parsed during
a `git log --topo-order` before writing `N` commits to output.

A devil's advocate comment.  Your patches seem to be very focused on
this "unlimited" case for the past few weeks, but I am not so sure
if that is a cas

Re: [PATCH 0/3] Make add_missing_tags() linear

2018-10-31 Thread Derrick Stolee



On 10/31/2018 2:04 AM, Elijah Newren wrote:
> On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget
>  wrote:
>>
>> As reported earlier [1], the add_missing_tags() method in remote.c has
>> quadratic performance. Some of that performance is curbed due to the
>> generation-number cutoff in in_merge_bases_many(). However, that fix doesn't
>> help users without a commit-graph, and it can still be painful if that
>> cutoff is sufficiently low compared to the tags we are using for
>> reachability testing.
>>
>> Add a new method in commit-reach.c called get_reachable_subset() which does
>> a many-to-many reachability test. Starting at the 'from' commits, walk until
>> the generation is below the smallest generation in the 'to' commits, or all
>> 'to' commits have been discovered. This performs only one commit walk for
>> the entire add_missing_tags() method, giving linear performance in the worst
>> case.
>>
>> Tests are added in t6600-test-reach.sh to ensure get_reachable_subset()
>> works independently of its application in add_missing_tags().
>
> On the original repo where the topic was brought up, with commit-graph
> NOT turned on and using origin/master, I see:
>
> $ time git push --dry-run --follow-tags /home/newren/repo-mirror
> To /home/newren/repo-mirror
>  * [new branch]   test5 -> test5
>
> real 1m20.081s
> user 1m19.688s
> sys 0m0.292s
>
> Merging this series in, I now get:
>
> $ time git push --dry-run --follow-tags /home/newren/repo-mirror
> To /home/newren/repo-mirror
>  * [new branch]   test5 -> test5
>
> real 0m2.857s
> user 0m2.580s
> sys 0m0.328s
>
> which provides a very nice speedup.
>
> Oddly enough, if I _also_ do the following:
> $ git config core.commitgraph true
> $ git config gc.writecommitgraph true
> $ git gc
>
> then my timing actually slows down just slightly:
> $ time git push --follow-tags --dry-run /home/newren/repo-mirror
> To /home/newren/repo-mirror
>  * [new branch]  test5 -> test5
>
> real 0m3.027s
> user 0m2.696s
> sys 0m0.400s

So you say that the commit-graph is off in the 2.8s case, but not here
in the 3.1s case? I would expect _at minimum_ that the cost of parsing
commits would have a speedup in the commit-graph case.  There may be
something else going on here, since you are timing a `push` event that
is doing more than the current walk.

> (run-to-run variation seems pretty consistent, < .1s variation, so
> this difference is just enough to notice.)  I wouldn't be that
> surprised if that means there's some really old tags with very small
> generation numbers, meaning it's not gaining anything in this special
> case from the commit-graph, but it does pay the cost of loading the
> commit-graph.

While you have this test environment, do you mind applying the diff
below and re-running the tests? It will output a count for how many
commits are walked by the algorithm. This should help us determine if
this is another case where generation numbers are worse than commit-date,
or if there is something else going on. Thanks!

-->8--

>From 2115e7dcafb2770455b7b4793f90edc2254bad97 Mon Sep 17 00:00:00 2001
From: Derrick Stolee 
Date: Wed, 31 Oct 2018 11:40:50 +
Subject: [PATCH] DO-NOT-MERGE: count commits in get_reachable_subset

Signed-off-by: Derrick Stolee 
---
 commit-reach.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/commit-reach.c b/commit-reach.c
index a98532ecc8..b512461cf7 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -700,6 +700,7 @@ struct commit_list *get_reachable_subset(struct commit 
**from, int nr_from,
struct commit **from_last = from + nr_from;
uint32_t min_generation = GENERATION_NUMBER_INFINITY;
int num_to_find = 0;
+   int count = 0;
 
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 
@@ -729,6 +730,8 @@ struct commit_list *get_reachable_subset(struct commit 
**from, int nr_from,
while (num_to_find && (current = prio_queue_get()) != NULL) {
struct commit_list *parents;
 
+   count++;
+
if (current->object.flags & PARENT1) {
current->object.flags &= ~PARENT1;
current->object.flags |= reachable_flag;
@@ -755,6 +758,8 @@ struct commit_list *get_reachable_subset(struct commit 
**from, int nr_from,
clear_commit_marks_many(nr_to, to, PARENT1);
clear_commit_marks_many(nr_from, from, PARENT2);
 
+   fprintf(stderr, "count: %d\n", count);
+
return found_commits;
 }
 
-- 
2.19.1.542.gc4df23f792

Re: [PATCH 1/3] commit-reach: implement get_reachable_subset

2018-10-31 Thread Derrick Stolee


On 10/30/2018 11:35 PM, Junio C Hamano wrote:

"Derrick Stolee via GitGitGadget"  writes:


+struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
+struct commit **to, int nr_to,
+int reachable_flag)

This is OR'ed into object.flags, and I somoehow expected to see it
as 'unsigned int', not a signed one.


Will do. Thanks.




+{
+   struct commit **item;
+   struct commit *current;
+   struct commit_list *found_commits = NULL;
+   struct commit **to_last = to + nr_to;
+   struct commit **from_last = from + nr_from;
+   uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+   int num_to_find = 0;
+
+   struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
+
+   for (item = to; item < to_last; item++) {
+   struct commit *c = *item;
+   
+   parse_commit(c);
+   if (c->generation < min_generation)
+   min_generation = c->generation;
+
+   if (!(c->object.flags & PARENT1)) {
+   c->object.flags |= PARENT1;
+   num_to_find++;
+   }
+   }
+
+   for (item = from; item < from_last; item++) {
+   struct commit *c = *item;
+   if (!(c->object.flags & PARENT2)) {
+   c->object.flags |= PARENT2;
+   parse_commit(c);
+
+   prio_queue_put(, *item);
+   }
+   }

OK, we marked "to" with PARENT1 and counted them in num_to_find
without dups.  We also marked "from" with PARENT2 and threw them in
the "queue" without dups.

Mental note: the caller must guarantee that everybody reachable from
"to" and "from" have PARENT1 and PARENT2 clear.  This might deserve
to be in the comment before the function.


I'll put that in the header file.

[snip]

OK, this all makes sense.  Unlike merge-base traversals, this does
not have to traverse from the "to" side at all, which makes it quite
simpler and straight-forward.

I do wonder if we can now reimplement in_merge_bases_many() in terms
of this helper, and if that gives us a better performance.  It asks
"is 'commit', i.e. a single 'to', an ancestor of, i.e. reachable
from, one of the 'references', i.e.  'from'"?


We could do this, but it does come with a performance hit when the following
are all true:

1. 'to' is not reachable from any 'from' commits.

2. The 'to' and 'from' commits are close in commit-date.

3. Generation numbers are not available, or the topology is skewed to have
   commits with high commit date and low generation number.

Since in_merge_bases_many() calls paint_down_to_common(), it has the same
issues with the current generation numbers. This can be fixed when we have
the next version of generation numbers available.

I'll make a note to have in_merge_bases_many() call get_reachable_subset()
conditionally (like the generation_numbers_available() trick in the 
--topo-order

series) after the generation numbers are settled and implemented.

Thanks,
-Stolee

Re: [PATCH 1/3] commit-reach: implement get_reachable_subset

2018-10-31 Thread Derrick Stolee


On 10/31/2018 2:07 AM, Elijah Newren wrote:

On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget
 wrote:

--- a/commit-reach.c
+++ b/commit-reach.c
@@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct 
commit_list *to,
 object_array_clear(_objs);
 return result;
  }
+
+struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
+struct commit **to, int nr_to,
+int reachable_flag)
+{
+   struct commit **item;
+   struct commit *current;
+   struct commit_list *found_commits = NULL;
+   struct commit **to_last = to + nr_to;
+   struct commit **from_last = from + nr_from;
+   uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+   int num_to_find = 0;
+
+   struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
+
+   for (item = to; item < to_last; item++) {
+   struct commit *c = *item;
+
+   parse_commit(c);
+   if (c->generation < min_generation)
+   min_generation = c->generation;

So, when we don't have a commit-graph, is c->generation just going to
be 0, making min_generation also be 0? (meaning we get no possible
speed benefit from the commit-graph, since we just don't have that
information available)?


If we don't have a commit-graph, then we can only terminate the loop early
if we discover all of the commits (num_to_find == 0).Otherwise, we need to
walk the entire graph in order to determine non-reachability. This relates
to Junio's point about in_merge_bases_many(), which I'll respond to his
message in more detail about that.

Thanks,
-Stolee

Re: commit-graph is cool (overcoming add_missing_tags() perf issues)

2018-10-30 Thread Derrick Stolee


On 10/17/2018 2:00 PM, Elijah Newren wrote:

Hi,

Just wanted to give a shout-out for the commit-graph work and how
impressive it is.  I had an internal report from a user that git
pushes containing only one new tiny commit were taking over a minute
(in a moderate size repo with good network connectivity). After
digging for a while, I noticed three unusual things about the repo[1]:
   * he had push.followTags set to true
   * upstream repo had about 20k tags (despite only 55k commits)
   * his repo had an additional 2.5k tags, but none of these were in
 the history of the branches he was pushing and thus would not be
 included in any pushes.

Digging in, almost all the time was CPU-bound and spent in
add_missing_tags()[2].  If I'm reading the code correctly, it appears
that function loops over each tag, calling in_merge_bases_many() once
per tag.  Thus, for his case, we were potentially walking all of
history of the main branch 2.5k times.  That seemed rather suboptimal.


Elijah,

Do you still have this repo around? Could you by chance test the 
performance with the new algorithm for add_missing_tags() in [1]? 
Specifically, please test it without a commit-graph file, since your 
data shape already makes use of generation numbers pretty well.


Thanks,
-Stolee

[1] https://public-inbox.org/git/pull.60.git.gitgitgad...@gmail.com/T/#t

[PATCH 2/3] test-reach: test get_reachable_subset

2018-10-30 Thread Derrick Stolee via GitGitGadget

From: Derrick Stolee 

The get_reachable_subset() method returns the list of commits in
the 'to' array that are reachable from at least one commit in the
'from' array. Add tests that check this method works in a few
cases:

1. All commits in the 'to' list are reachable. This exercises the
   early-termination condition.

2. Some commits in the 'to' list are reachable. This exercises the
   loop-termination condition.

3. No commits in the 'to' list are reachable. This exercises the
   NULL return condition.

Signed-off-by: Derrick Stolee 
---
 t/helper/test-reach.c | 34 
 t/t6600-test-reach.sh | 52 +++
 2 files changed, 82 insertions(+), 4 deletions(-)

diff --git a/t/helper/test-reach.c b/t/helper/test-reach.c
index 08d2ea68e..a0272178b 100644
--- a/t/helper/test-reach.c
+++ b/t/helper/test-reach.c
@@ -32,8 +32,8 @@ int cmd__reach(int ac, const char **av)
struct commit *A, *B;
struct commit_list *X, *Y;
struct object_array X_obj = OBJECT_ARRAY_INIT;
-   struct commit **X_array;
-   int X_nr, X_alloc;
+   struct commit **X_array, **Y_array;
+   int X_nr, X_alloc, Y_nr, Y_alloc;
struct strbuf buf = STRBUF_INIT;
struct repository *r = the_repository;
 
@@ -44,9 +44,10 @@ int cmd__reach(int ac, const char **av)
 
A = B = NULL;
X = Y = NULL;
-   X_nr = 0;
-   X_alloc = 16;
+   X_nr = Y_nr = 0;
+   X_alloc = Y_alloc = 16;
ALLOC_ARRAY(X_array, X_alloc);
+   ALLOC_ARRAY(Y_array, Y_alloc);
 
while (strbuf_getline(, stdin) != EOF) {
struct object_id oid;
@@ -92,6 +93,8 @@ int cmd__reach(int ac, const char **av)
 
case 'Y':
commit_list_insert(c, );
+   ALLOC_GROW(Y_array, Y_nr + 1, Y_alloc);
+   Y_array[Y_nr++] = c;
break;
 
default:
@@ -136,6 +139,29 @@ int cmd__reach(int ac, const char **av)
filter.with_commit_tag_algo = 0;
 
printf("%s(_,A,X,_):%d\n", av[1], commit_contains(, A, 
X, ));
+   } else if (!strcmp(av[1], "get_reachable_subset")) {
+   const int reachable_flag = 1;
+   int i, count = 0;
+   struct commit_list *current;
+   struct commit_list *list = get_reachable_subset(X_array, X_nr,
+   Y_array, Y_nr,
+   reachable_flag);
+   printf("get_reachable_subset(X,Y)\n");
+   for (current = list; current; current = current->next) {
+   if (!(list->item->object.flags & reachable_flag))
+   die(_("commit %s is not marked reachable"),
+   oid_to_hex(>item->object.oid));
+   count++;
+   }
+   for (i = 0; i < Y_nr; i++) {
+   if (Y_array[i]->object.flags & reachable_flag)
+   count--;
+   }
+
+   if (count < 0)
+   die(_("too many commits marked reachable"));
+
+   print_sorted_commit_ids(list);
}
 
exit(0);
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index ae94b27f7..a0c64e617 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -265,4 +265,56 @@ test_expect_success 'commit_contains:miss' '
test_three_modes commit_contains --tag
 '
 
+test_expect_success 'get_reachable_subset:all' '
+   cat >input <<-\EOF &&
+   X:commit-9-1
+   X:commit-8-3
+   X:commit-7-5
+   X:commit-6-6
+   X:commit-1-7
+   Y:commit-3-3
+   Y:commit-1-7
+   Y:commit-5-6
+   EOF
+   (
+   echo "get_reachable_subset(X,Y)" &&
+   git rev-parse commit-3-3 \
+ commit-1-7 \
+ commit-5-6 | sort
+   ) >expect &&
+   test_three_modes get_reachable_subset
+'
+
+test_expect_success 'get_reachable_subset:some' '
+   cat >input <<-\EOF &&
+   X:commit-9-1
+   X:commit-8-3
+   X:commit-7-5
+   X:commit-1-7
+   Y:commit-3-3
+   Y:commit-1-7
+   Y:commit-5-6
+   EOF
+   (
+   echo "get_reachable_subset(X,Y)" &&
+   git rev-parse commit-3-3 \
+ commit-1-7 | sort
+   ) >expect &&
+   test_three_modes get_reachable_subset
+'
+
+test_expect_success 'get_reachable_subset:none' '
+   cat >input <<-\EOF &&
+   X:commit-9-1
+   X:commit-8-3
+   X:commit-7-5
+

[PATCH 3/3] remote: make add_missing_tags() linear

2018-10-30 Thread Derrick Stolee via GitGitGadget

From: Derrick Stolee 

The add_missing_tags() method currently has quadratic behavior.
This is due to a linear number (based on number of tags T) of
calls to in_merge_bases_many, which has linear performance (based
on number of commits C in the repository).

Replace this O(T * C) algorithm with an O(T + C) algorithm by
using get_reachable_subset(). We ignore the return list and focus
instead on the reachable_flag assigned to the commits we care
about, because we need to interact with the tag ref and not just
the commit object.

Signed-off-by: Derrick Stolee 
---
 remote.c | 34 +-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/remote.c b/remote.c
index 81f4f01b0..b850f2feb 100644
--- a/remote.c
+++ b/remote.c
@@ -1205,9 +1205,36 @@ static void add_missing_tags(struct ref *src, struct ref 
**dst, struct ref ***ds
 * sent to the other side.
 */
if (sent_tips.nr) {
+   const int reachable_flag = 1;
+   struct commit_list *found_commits;
+   struct commit **src_commits;
+   int nr_src_commits = 0, alloc_src_commits = 16;
+   ALLOC_ARRAY(src_commits, alloc_src_commits);
+
for_each_string_list_item(item, _tag) {
struct ref *ref = item->util;
+   struct commit *commit;
+
+   if (is_null_oid(>new_oid))
+   continue;
+   commit = lookup_commit_reference_gently(the_repository,
+   >new_oid,
+   1);
+   if (!commit)
+   /* not pushing a commit, which is not an error 
*/
+   continue;
+
+   ALLOC_GROW(src_commits, nr_src_commits + 1, 
alloc_src_commits);
+   src_commits[nr_src_commits++] = commit;
+   }
+
+   found_commits = get_reachable_subset(sent_tips.tip, 
sent_tips.nr,
+src_commits, 
nr_src_commits,
+reachable_flag);
+
+   for_each_string_list_item(item, _tag) {
struct ref *dst_ref;
+   struct ref *ref = item->util;
struct commit *commit;
 
if (is_null_oid(>new_oid))
@@ -1223,7 +1250,7 @@ static void add_missing_tags(struct ref *src, struct ref 
**dst, struct ref ***ds
 * Is this tag, which they do not have, reachable from
 * any of the commits we are sending?
 */
-   if (!in_merge_bases_many(commit, sent_tips.nr, 
sent_tips.tip))
+   if (!(commit->object.flags & reachable_flag))
continue;
 
/* Add it in */
@@ -1231,7 +1258,12 @@ static void add_missing_tags(struct ref *src, struct ref 
**dst, struct ref ***ds
oidcpy(_ref->new_oid, >new_oid);
dst_ref->peer_ref = copy_ref(ref);
}
+
+   clear_commit_marks_many(nr_src_commits, src_commits, 
reachable_flag);
+   free(src_commits);
+   free_commit_list(found_commits);
}
+
string_list_clear(_tag, 0);
free(sent_tips.tip);
 }
-- 
gitgitgadget

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1362 matches

Mail list logo