[PATCH 0/3] Improve abbreviation disambiguation

2017-09-15 Thread Derrick Stolee
Hello,

My name is Derrick Stolee and I just switched teams at Microsoft from
the VSTS Git Server to work on performance improvements in core Git.

This is my first patch submission, and I look forward to your feedback.

Thanks,
 Stolee


When displaying object ids, we frequently want to see an abbreviation
for easier typing. That abbreviation must be unambiguous among all
object ids.

The current implementation of find_unique_abbrev() performs a loop
checking if each abbreviation length is unambiguous until finding one
that works. This causes multiple round-trips to the disk when starting
with the default abbreviation length (usually 7) but needing up to 12
characters for an unambiguous short-sha. For very large repos, this
effect is pronounced and causes issues with several commands, from
obvious consumers `status` and `log` to less obvious commands such as
`fetch` and `push`.

This patch improves performance by iterating over objects matching the
short abbreviation only once, inspecting each object id, and reporting
the minimum length of an unambiguous abbreviation.

A performance helper `test-abbrev` and performance test `p0008-abbrev.sh`
are added to demonstrate this performance improvement. Here are some
performance results for the three included commits, using
GIT_PERF_REPEAT_COUNT=10 since the first test is frequently an outlier
due to the file cache being cold.

Running git on a Linux VM, we see the following gains.

| Repo| Pack-Files | Loose Objs | Baseline | Patch 2 | Patch 3 |
|-|||--|-|-|
| Git.git | 1  | 0  | 0.46 s   | -87.0%  | -87.0%  |
| Git.git | 5  | 0  | 1.04 s   | -84.6%  | -85.6%  |
| Git.git | 4  | 75852  | 0.88 s   | -86.4%  | -86.4%  |
| Linux   | 1  | 0  | 0.63 s   | -38.1%  | -69.8%  |
| Linux   | 24 | 0  | 5.41 s   | -69.3%  | -71.5%  |
| Linux   | 23 | 323441 | 5.41 s   | -70.6%  | -73.4%  |

Running a similar patch on Git for Windows, we see the following gains.

| Repo  | Pack-Files | Loose | Baseline | Patch 2 | Patch 3 |
|---||---|--|-|-|
| GitForWindows | 6  | 319   | 7.19 s   | -91.1%  | -91.5%  |
| VSTS  | 3  | 38| 7.83 s   | -88.9%  | -90.9%  |
| Linux | 3  | 0 | 7.92 s   | -87.9%  | -90.2%  |
| Windows   | 50 | 219   | 17.8 s   | -98.6%  | -98.6%  |

Note that performance improves in all cases, but the performance gain
is larger when there are multiple, large pack-files. This gain comes
from the lack of in-memory caching of index files that have already been
inspected.


Derrick Stolee (3):
  sha1_name: Create perf test for find_unique_abbrev()
  sha1_name: Unroll len loop in find_unique_abbrev_r
  sha1_name: Parse less while finding common prefix

 Makefile   |  1 +
 sha1_name.c| 66 ++
 t/helper/.gitignore|  1 +
 t/helper/test-abbrev.c | 22 +
 t/perf/p0008-abbrev.sh | 12 +
 5 files changed, 87 insertions(+), 15 deletions(-)
 create mode 100644 t/helper/test-abbrev.c
 create mode 100755 t/perf/p0008-abbrev.sh

-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH 2/3] sha1_name: Unroll len loop in find_unique_abbrev_r

2017-09-15 Thread Derrick Stolee
Unroll the while loop inside find_unique_abbrev_r to avoid iterating
through all loose objects and packfiles multiple times when the short
name is longer than the predicted length.

Instead, inspect each object that collides with the estimated
abbreviation to find the longest common prefix.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 57 ++---
 1 file changed, 42 insertions(+), 15 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 134ac9742..f2a1ebe49 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -474,10 +474,32 @@ static unsigned msb(unsigned long val)
return r;
 }
 
-int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+struct min_abbrev_data {
+   unsigned int init_len;
+   unsigned int cur_len;
+   char *hex;
+};
+
+static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
-   int status, exists;
+   struct min_abbrev_data *mad = cb_data;
+
+   char *hex = oid_to_hex(oid);
+   unsigned int i = mad->init_len;
+   while (mad->hex[i] && mad->hex[i] == hex[i])
+   i++;
+
+   if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
+   mad->cur_len = i + 1;
+
+   return 0;
+}
 
+int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+{
+   struct disambiguate_state ds;
+   struct min_abbrev_data mad;
+   struct object_id oid_ret;
if (len < 0) {
unsigned long count = approximate_object_count();
/*
@@ -503,19 +525,24 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
*sha1, int len)
sha1_to_hex_r(hex, sha1);
if (len == GIT_SHA1_HEXSZ || !len)
return GIT_SHA1_HEXSZ;
-   exists = has_sha1_file(sha1);
-   while (len < GIT_SHA1_HEXSZ) {
-   struct object_id oid_ret;
-   status = get_short_oid(hex, len, _ret, GET_OID_QUIETLY);
-   if (exists
-   ? !status
-   : status == SHORT_NAME_NOT_FOUND) {
-   hex[len] = 0;
-   return len;
-   }
-   len++;
-   }
-   return len;
+
+   if (init_object_disambiguation(hex, len, ) < 0)
+   return -1;
+
+   mad.init_len = len;
+   mad.cur_len = len;
+   mad.hex = hex;
+
+   ds.fn = extend_abbrev_len;
+   ds.always_call_fn = 1;
+   ds.cb_data = (void*)
+
+   find_short_object_filename();
+   find_short_packed_object();
+   (void)finish_object_disambiguation(, _ret);
+
+   hex[mad.cur_len] = 0;
+   return mad.cur_len;
 }
 
 const char *find_unique_abbrev(const unsigned char *sha1, int len)
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH 1/3] sha1_name: Create perf test for find_unique_abbrev()

2017-09-15 Thread Derrick Stolee
Create helper program test-abbrev to compute the minimum length of a
disambiguating short-sha for 100,000 object ids. The ids are created
by iterating an unsigned int hash_base by a constant hash_delta and
copying hash_base five times across the sha1. Iterating by hash_delta
does not create a duplicate value for over 10,000,000 iterations.

test-abberv demonstrates the performance improvements that will be
shown by later improvements to the find_unique_abberv(). The value of
100,000 is large enough to show the significance of the later
improvements while only taking a few seconds on large repos.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Makefile   |  1 +
 t/helper/.gitignore|  1 +
 t/helper/test-abbrev.c | 23 +++
 t/perf/p0008-abbrev.sh | 12 
 4 files changed, 37 insertions(+)
 create mode 100644 t/helper/test-abbrev.c
 create mode 100755 t/perf/p0008-abbrev.sh

diff --git a/Makefile b/Makefile
index f2bb7f2f6..081ca05e8 100644
--- a/Makefile
+++ b/Makefile
@@ -633,6 +633,7 @@ X =
 
 PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
 
+TEST_PROGRAMS_NEED_X += test-abbrev
 TEST_PROGRAMS_NEED_X += test-chmtime
 TEST_PROGRAMS_NEED_X += test-ctype
 TEST_PROGRAMS_NEED_X += test-config
diff --git a/t/helper/.gitignore b/t/helper/.gitignore
index 721650256..80ce7d836 100644
--- a/t/helper/.gitignore
+++ b/t/helper/.gitignore
@@ -1,3 +1,4 @@
+/test-abbrev
 /test-chmtime
 /test-ctype
 /test-config
diff --git a/t/helper/test-abbrev.c b/t/helper/test-abbrev.c
new file mode 100644
index 0..cb3551df9
--- /dev/null
+++ b/t/helper/test-abbrev.c
@@ -0,0 +1,23 @@
+#include "cache.h"
+
+int cmd_main(int ac, const char **av)
+{
+   setup_git_directory();
+
+   unsigned int hash_delt = 0x13579BDF;
+   unsigned int hash_base = 0x01020304;
+   struct object_id oid;
+
+   int i, count = 0;
+   int n = sizeof(struct object_id) / sizeof(int);
+   while (count++ < 10) {
+   for (i = 0; i < n; i++)
+   ((unsigned int*)oid.hash)[i] = hash_base;
+
+   find_unique_abbrev(oid.hash, MINIMUM_ABBREV);
+
+   hash_base += hash_delt;
+   }
+
+   exit(0);
+}
diff --git a/t/perf/p0008-abbrev.sh b/t/perf/p0008-abbrev.sh
new file mode 100755
index 0..7c3fad807
--- /dev/null
+++ b/t/perf/p0008-abbrev.sh
@@ -0,0 +1,12 @@
+#!/bin/sh
+
+test_description='Test object disambiguation through abbreviations'
+. ./perf-lib.sh
+
+test_perf_large_repo
+
+test_perf 'find_unique_abbrev()' '
+   test-abbrev
+'
+
+test_done
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH 3/3] sha1_name: Parse less while finding common prefix

2017-09-15 Thread Derrick Stolee
Create get_hex_char_from_oid() to parse oids one hex character at a
time. This prevents unnecessary copying of hex characters in
extend_abbrev_len() when finding the length of a common prefix.

This change decreases the time to run test-abbrev by up to 40% on
large repos.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index f2a1ebe49..bb47b6702 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -480,13 +480,22 @@ struct min_abbrev_data {
char *hex;
 };
 
+static inline char get_hex_char_from_oid(const struct object_id *oid, int i)
+{
+   static const char hex[] = "0123456789abcdef";
+
+   if ((i & 1) == 0)
+   return hex[oid->hash[i >> 1] >> 4];
+   else
+   return hex[oid->hash[i >> 1] & 0xf];
+}
+
 static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
struct min_abbrev_data *mad = cb_data;
 
-   char *hex = oid_to_hex(oid);
unsigned int i = mad->init_len;
-   while (mad->hex[i] && mad->hex[i] == hex[i])
+   while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
i++;
 
if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
-- 
2.14.1.538.g56ec8fc98.dirty



Re: [PATCH 1/3] sha1_name: Create perf test for find_unique_abbrev()

2017-09-18 Thread Derrick Stolee

On 9/17/2017 5:51 PM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


+int cmd_main(int ac, const char **av)
+{
+   setup_git_directory();


As far as I recall, we do not (yet) allow declaration after
statement in our codebase.  Move this down to make it after all
decls.


Will fix.


+
+   unsigned int hash_delt = 0x13579BDF;
+   unsigned int hash_base = 0x01020304;
+   struct object_id oid;
+
+   int i, count = 0;
+   int n = sizeof(struct object_id) / sizeof(int);


It probably is technically OK to assume sizeof(int) always equals to
sizeof(unsigned), but because you use 'n' _only_ to work with uint
and never with int, it would make more sense to match.


Will fix. I also notice that "n" should be const.


But I do not think we want this "clever" optimization that involves
'n' in the first place.

>>> +  while (count++ < 10) {

+   for (i = 0; i < n; i++)
+   ((unsigned int*)oid.hash)[i] = hash_base;


Does it make sense to assume that uint is always 4-byte (so this
code won't work if it is 8-byte on your platform) and doing this is
faster than using platform-optimized memcpy()?


I'm not sure what you mean by using memcpy to improve this, because
it would require calling memcpy in the inner loop, such as

for (i = 0; i < n; i++)
memcpy(oid.hash + i * sizeof(unsigned), _base,
   sizeof(unsigned));

I'm probably misunderstanding your intended use of memcpy().


+   find_unique_abbrev(oid.hash, MINIMUM_ABBREV);
+
+   hash_base += hash_delt;
+   }


I also wonder if this is measuring the right thing.  I am guessing
that by making many queries for a unique abbreviation of "random"
(well, it is deterministic, but my point is these queries are not
based on the names of objects that exist in the repository) hashes,
this test measures how much time it takes for us to decide that such
a random hash does not exist.  In the real life use, we make the
opposite query far more frequently: we have an object that we _know_
exists in the repository and we try to find a sufficient length to
disambiguate it from others, and I suspect that these two use
different logic.  Don't you need to be measuring the time it takes
to compute the shortest abbreviation of an object that exists
instead?


First, this doesn't just measure the time it takes to determine non-
existence, because it finds the len required to disambiguate an
"incoming" hash from all known hashes. When testing, I put in a
simple printf to report the result abbreviation so I could see how
often it needed to be extended. In this sense, the test exposes the
while loop that is removed by PATCH 2/3.

Second, your criticism about extant hashes is valid, and one I
struggled to reconcile. I see two issues with testing known hashes:

1. By determining the hash exists, we have inspected the file that
   contains it (either a .idx or the loose-object directory). This
   has side-effects that warm up the file cache so the looped method
   is artificially faster to find matching objects. The effect is
   particularly significant on a repo with exactly one packfile.

2. If we iterate over the entire set of objects, this test takes
   O(N*t(N)) time, where t(N) is the average time to compute a
   minimum abbreviation. For large repos, this can take several
   minutes. Instead, even with the constant 100,000 test hashes, we
   have an O(t(N)) test. We could avoid the asymptotic growth by
   limiting the number of existing hashes we use, but how do we
   find a sufficiently uniform sample from them?

By looking at some other perf tests, I see that we can add a pre-
requisite action. I will investigate making another helper that
uniformly selects a set of hashes from the repo and writes them
to stdout in a random order. p0008-abbrev.sh will run the helper as
a prerequisite, saving the data to a file. test-abbrev will read
the hashes from stdin to test find_unique_abbrev. This should avoid
the side-effects I mentioned (test-abbrev will not warm up indexes)
while also testing abbreviation lengths for existing objects.

I'll get started on these changes and send a new patch with new perf
data in a couple days. Please let me know if there is a better way
to measure performance for this change.

Thanks,
-Stolee



Re: [PATCH] cleanup: fix possible overflow errors in binary search

2017-10-06 Thread Derrick Stolee

On 10/6/2017 10:18 AM, Jeff King wrote:

On Fri, Oct 06, 2017 at 09:52:31AM -0400, Derrick Stolee wrote:


A common mistake when writing binary search is to allow possible
integer overflow by using the simple average:

mid = (min + max) / 2;

Instead, use the overflow-safe version:

mid = min + (max - min) / 2;

Great, thank you for picking this up!


The included changes were found using the following two greps:

grep "/ 2;" *.c
grep "/ 2;" */*.c
grep "/2;" */*.c

You can use[1]:

   git grep '/ 2;' '*.c'

to have Git expand the wildcard. That catches a few extra cases in
compat/regex/*.c.  Even though it's imported code, it might be
nice to cover those, too (since it's a possible bug, and also as a good
example).

[1] I'd actually write:

   git grep '/ *2;' '*.c'

 to do it all in one grep. :)


Thanks for the grep lesson! I knew there would be a simpler way to do 
what I wanted.



---
  builtin/index-pack.c | 4 ++--
  builtin/pack-objects.c   | 2 +-
  builtin/unpack-objects.c | 2 +-
  cache-tree.c | 2 +-
  packfile.c   | 2 +-
  sha1-lookup.c| 2 +-
  sha1_name.c  | 2 +-
  string-list.c| 2 +-
  utf8.c   | 2 +-
  xdiff/xpatience.c| 2 +-
  10 files changed, 11 insertions(+), 11 deletions(-)

These all look good to me (really the only way the conversion could be
bad is if "min" was higher than "max", and each case is just inside a
loop condition which makes sure that is not the case).

-Peff
I thought this should be simple enough. When we are all happy with the 
idea of this cleanup, I'll re-roll with the missed examples.


Thanks,
-Stolee


[PATCH] cleanup: fix possible overflow errors in binary search

2017-10-06 Thread Derrick Stolee
A common mistake when writing binary search is to allow possible
integer overflow by using the simple average:

mid = (min + max) / 2;

Instead, use the overflow-safe version:

mid = min + (max - min) / 2;

The included changes were found using the following two greps:

grep "/ 2;" *.c
grep "/ 2;" */*.c
grep "/2;" */*.c

Making this cleanup will prevent future review friction when a new
binary search is contructed based on existing code.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/index-pack.c | 4 ++--
 builtin/pack-objects.c   | 2 +-
 builtin/unpack-objects.c | 2 +-
 cache-tree.c | 2 +-
 packfile.c   | 2 +-
 sha1-lookup.c| 2 +-
 sha1_name.c  | 2 +-
 string-list.c| 2 +-
 utf8.c   | 2 +-
 xdiff/xpatience.c| 2 +-
 10 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index f2be145e1..8ec459f52 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -633,7 +633,7 @@ static int find_ofs_delta(const off_t offset, enum 
object_type type)
int first = 0, last = nr_ofs_deltas;
 
while (first < last) {
-   int next = (first + last) / 2;
+   int next = first + (last - first) / 2;
struct ofs_delta_entry *delta = _deltas[next];
int cmp;
 
@@ -687,7 +687,7 @@ static int find_ref_delta(const unsigned char *sha1, enum 
object_type type)
int first = 0, last = nr_ref_deltas;
 
while (first < last) {
-   int next = (first + last) / 2;
+   int next = first + (last - first) / 2;
struct ref_delta_entry *delta = _deltas[next];
int cmp;
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5ee2c48ff..6e77dfd44 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1277,7 +1277,7 @@ static int done_pbase_path_pos(unsigned hash)
int lo = 0;
int hi = done_pbase_paths_num;
while (lo < hi) {
-   int mi = (hi + lo) / 2;
+   int mi = lo + (hi - lo) / 2;
if (done_pbase_paths[mi] == hash)
return mi;
if (done_pbase_paths[mi] < hash)
diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
index 689a29fac..62ea264c4 100644
--- a/builtin/unpack-objects.c
+++ b/builtin/unpack-objects.c
@@ -394,7 +394,7 @@ static void unpack_delta_entry(enum object_type type, 
unsigned long delta_size,
lo = 0;
hi = nr;
while (lo < hi) {
-   mid = (lo + hi)/2;
+   mid = lo + (hi - lo) / 2;
if (base_offset < obj_list[mid].offset) {
hi = mid;
} else if (base_offset > obj_list[mid].offset) {
diff --git a/cache-tree.c b/cache-tree.c
index 71d092ed5..d3f740127 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -49,7 +49,7 @@ static int subtree_pos(struct cache_tree *it, const char 
*path, int pathlen)
lo = 0;
hi = it->subtree_nr;
while (lo < hi) {
-   int mi = (lo + hi) / 2;
+   int mi = lo + (hi - lo) / 2;
struct cache_tree_sub *mdl = down[mi];
int cmp = subtree_name_cmp(path, pathlen,
   mdl->name, mdl->namelen);
diff --git a/packfile.c b/packfile.c
index eab754248..4a5fe7ab1 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1743,7 +1743,7 @@ off_t find_pack_entry_one(const unsigned char *sha1,
   sha1[0], sha1[1], sha1[2], lo, hi, p->num_objects);
 
while (lo < hi) {
-   unsigned mi = (lo + hi) / 2;
+   unsigned mi = lo + (hi - lo) / 2;
int cmp = hashcmp(index + mi * stride, sha1);
 
if (debug_lookup)
diff --git a/sha1-lookup.c b/sha1-lookup.c
index 2552b7902..df08f8355 100644
--- a/sha1-lookup.c
+++ b/sha1-lookup.c
@@ -95,7 +95,7 @@ int sha1_pos(const unsigned char *sha1, void *table, size_t 
nr,
hi = mi;
else
lo = mi + 1;
-   mi = (hi + lo) / 2;
+   mi = lo + (hi - lo) / 2;
} while (lo < hi);
return -lo-1;
 }
diff --git a/sha1_name.c b/sha1_name.c
index 134ac9742..c7c5ab376 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -157,7 +157,7 @@ static void unique_in_pack(struct packed_git *p,
num = p->num_objects;
last = num;
while (first < last) {
-   uint32_t mid = (first + last) / 2;
+   uint32_t mid = first + (last - first) / 2;
const unsigned char *current;
int cmp;
 
diff --git a/string-list.c b/string-list.c
index 806b4c872..a0cf0cfe8 100644
--- a/

[PATCH v4 1/4] p4211-line-log.sh: add log --online --raw --parents perf test

2017-10-08 Thread Derrick Stolee
Add a new perf test for testing the performance of log while computing
OID abbreviations. Using --oneline --raw and --parents options maximizes
the number of OIDs to abbreviate while still spending some time computing
diffs.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 t/perf/p4211-line-log.sh | 4 
 1 file changed, 4 insertions(+)

diff --git a/t/perf/p4211-line-log.sh b/t/perf/p4211-line-log.sh
index b7ff68d4f..e0ed05907 100755
--- a/t/perf/p4211-line-log.sh
+++ b/t/perf/p4211-line-log.sh
@@ -31,4 +31,8 @@ test_perf 'git log -L (renames on)' '
git log -M -L 1:"$file" >/dev/null
 '
 
+test_perf 'git log --oneline --raw --parents' '
+   git log --oneline --raw --parents >/dev/null
+'
+
 test_done
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v4 3/4] sha1_name: parse less while finding common prefix

2017-10-08 Thread Derrick Stolee
Create get_hex_char_from_oid() to parse oids one hex character at a
time. This prevents unnecessary copying of hex characters in
extend_abbrev_len() when finding the length of a common prefix.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index f2a1ebe49..5081aeb71 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -480,13 +480,23 @@ struct min_abbrev_data {
char *hex;
 };
 
+static inline char get_hex_char_from_oid(const struct object_id *oid,
+int pos)
+{
+   static const char hex[] = "0123456789abcdef";
+
+   if ((pos & 1) == 0)
+   return hex[oid->hash[pos >> 1] >> 4];
+   else
+   return hex[oid->hash[pos >> 1] & 0xf];
+}
+
 static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
struct min_abbrev_data *mad = cb_data;
 
-   char *hex = oid_to_hex(oid);
unsigned int i = mad->init_len;
-   while (mad->hex[i] && mad->hex[i] == hex[i])
+   while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
i++;
 
if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation

2017-10-08 Thread Derrick Stolee
Minimize OID comparisons during disambiguatiosn of packfile OIDs.

Teach git to use binary search with the full OID to find the object's
position (or insertion position, if not present) in the pack-index.
The object before and immediately after (or the one at the insertion
position) give the maximum common prefix.  No subsequent linear search
is required.

Take care of which two to inspect, in case the object id exists in the
packfile.

If the input to find_unique_abbrev_r() is a partial prefix, then the
OID used for the binary search is padded with zeroes so the object will
not exist in the repo (with high probability) and the same logic
applies.

This commit completes a series of three changes to OID abbreviation
code, and the overall change can be seen using standard commands for
large repos. Below we report performance statistics for perf test 4211.6
from p4211-line-log.sh using three copies of the Linux repo:

| Packs | Loose  | HEAD~3   | HEAD | Rel%  |
|---||--|--|---|
|  1|  0 |  41.27 s |  38.93 s | -4.8% |
| 24|  0 |  98.04 s |  91.35 s | -5.7% |
| 23| 323952 | 117.78 s | 112.18 s | -4.8% |

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 70 +
 1 file changed, 66 insertions(+), 4 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 5081aeb71..49ba67955 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -478,6 +478,7 @@ struct min_abbrev_data {
unsigned int init_len;
unsigned int cur_len;
char *hex;
+   const unsigned char *hash;
 };
 
 static inline char get_hex_char_from_oid(const struct object_id *oid,
@@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id *oid, 
void *cb_data)
return 0;
 }
 
+static void find_abbrev_len_for_pack(struct packed_git *p,
+struct min_abbrev_data *mad)
+{
+   int match = 0;
+   uint32_t num, last, first = 0;
+   struct object_id oid;
+
+   open_pack_index(p);
+   num = p->num_objects;
+   last = num;
+   while (first < last) {
+   uint32_t mid = first + (last - first) / 2;
+   const unsigned char *current;
+   int cmp;
+
+   current = nth_packed_object_sha1(p, mid);
+   cmp = hashcmp(mad->hash, current);
+   if (!cmp) {
+   match = 1;
+   first = mid;
+   break;
+   }
+   if (cmp > 0) {
+   first = mid + 1;
+   continue;
+   }
+   last = mid;
+   }
+
+   /*
+* first is now the position in the packfile where we would insert
+* mad->hash if it does not exist (or the position of mad->hash if
+* it does exist). Hence, we consider a maximum of three objects
+* nearby for the abbreviation length.
+*/
+   mad->init_len = 0;
+   if (!match) {
+   nth_packed_object_oid(, p, first);
+   extend_abbrev_len(, mad);
+   } else if (first < num - 1) {
+   nth_packed_object_oid(, p, first + 1);
+   extend_abbrev_len(, mad);
+   }
+   if (first > 0) {
+   nth_packed_object_oid(, p, first - 1);
+   extend_abbrev_len(, mad);
+   }
+   mad->init_len = mad->cur_len;
+}
+
+static void find_abbrev_len_packed(struct min_abbrev_data *mad)
+{
+   struct packed_git *p;
+
+   prepare_packed_git();
+   for (p = packed_git; p; p = p->next)
+   find_abbrev_len_for_pack(p, mad);
+}
+
 int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
 {
struct disambiguate_state ds;
@@ -536,19 +596,21 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
*sha1, int len)
if (len == GIT_SHA1_HEXSZ || !len)
return GIT_SHA1_HEXSZ;
 
-   if (init_object_disambiguation(hex, len, ) < 0)
-   return -1;
-
mad.init_len = len;
mad.cur_len = len;
mad.hex = hex;
+   mad.hash = sha1;
+
+   find_abbrev_len_packed();
+
+   if (init_object_disambiguation(hex, mad.cur_len, ) < 0)
+   return -1;
 
ds.fn = extend_abbrev_len;
ds.always_call_fn = 1;
ds.cb_data = (void*)
 
find_short_object_filename();
-   find_short_packed_object();
(void)finish_object_disambiguation(, _ret);
 
hex[mad.cur_len] = 0;
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v4 0/4] Improve abbreviation disambiguation

2017-10-08 Thread Derrick Stolee
Changes since previous version:

* Fixed an overflow error in the binary search. I sent a separate patch
  to fix this error in existing searches; that patch should be applied
  before this one.

* Removed test-list-objects and test-abbrev in favor of a new git log
  test in p4211-line-log.sh. Limited perf numbers for Linux repo are
  given in cover letter and commit 4/4.

* Silently skip packfiles that fail to open with open_pack_index()

Thanks for all the comments from Jeff, Junio, Ramsey, and Stefan!

Thanks,
 Stolee

---

When displaying object ids, we frequently want to see an abbreviation
for easier typing. That abbreviation must be unambiguous among all
object ids.

The current implementation of find_unique_abbrev() performs a loop
checking if each abbreviation length is unambiguous until finding one
that works. This causes multiple round-trips to the disk when starting
with the default abbreviation length (usually 7) but needing up to 12
characters for an unambiguous short-sha. For very large repos, this
effect is pronounced and causes issues with several commands, from
obvious consumers `status` and `log` to less obvious commands such as
`fetch` and `push`.

This patch improves performance by iterating over objects matching the
short abbreviation only once, inspecting each object id, and reporting
the minimum length of an unambiguous abbreviation.

Add a new perf test for testing the performance of log while computing
OID abbreviations. Using --oneline --raw and --parents options maximizes
the number of OIDs to abbreviate while still spending some time
computing diffs. Below we report performance statistics for perf test
4211.6 from p4211-line-log.sh using three copies of the Linux repo:

| Packs | Loose  | Base Time | New Time | Rel%  |
|---||---|--|---|
|  1|  0 |   41.27 s |  38.93 s | -4.8% |
| 24|  0 |   98.04 s |  91.35 s | -5.7% |
| 23| 323952 |  117.78 s | 112.18 s | -4.8% |

Derrick Stolee (4):
  p4211-line-log.sh: add log --online --raw --parents perf test
  sha1_name: Unroll len loop in find_unique_abbrev_r
  sha1_name: Parse less while finding common prefix
  sha1_name: Minimize OID comparisons during disambiguation

 sha1_name.c  | 129 +--
 t/perf/p4211-line-log.sh |   4 ++
 2 files changed, 118 insertions(+), 15 deletions(-)

-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v4 2/4] sha1_name: unroll len loop in find_unique_abbrev_r

2017-10-08 Thread Derrick Stolee
Unroll the while loop inside find_unique_abbrev_r to avoid iterating
through all loose objects and packfiles multiple times when the short
name is longer than the predicted length.

Instead, inspect each object that collides with the estimated
abbreviation to find the longest common prefix.

The focus of this change is to refactor the existing method in a way
that clearly does not change the current behavior. In some cases, the
new method is slower than the previous method. Later changes will
correct all performance loss.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 57 ++---
 1 file changed, 42 insertions(+), 15 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 134ac9742..f2a1ebe49 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -474,10 +474,32 @@ static unsigned msb(unsigned long val)
return r;
 }
 
-int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+struct min_abbrev_data {
+   unsigned int init_len;
+   unsigned int cur_len;
+   char *hex;
+};
+
+static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
-   int status, exists;
+   struct min_abbrev_data *mad = cb_data;
+
+   char *hex = oid_to_hex(oid);
+   unsigned int i = mad->init_len;
+   while (mad->hex[i] && mad->hex[i] == hex[i])
+   i++;
+
+   if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
+   mad->cur_len = i + 1;
+
+   return 0;
+}
 
+int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+{
+   struct disambiguate_state ds;
+   struct min_abbrev_data mad;
+   struct object_id oid_ret;
if (len < 0) {
unsigned long count = approximate_object_count();
/*
@@ -503,19 +525,24 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
*sha1, int len)
sha1_to_hex_r(hex, sha1);
if (len == GIT_SHA1_HEXSZ || !len)
return GIT_SHA1_HEXSZ;
-   exists = has_sha1_file(sha1);
-   while (len < GIT_SHA1_HEXSZ) {
-   struct object_id oid_ret;
-   status = get_short_oid(hex, len, _ret, GET_OID_QUIETLY);
-   if (exists
-   ? !status
-   : status == SHORT_NAME_NOT_FOUND) {
-   hex[len] = 0;
-   return len;
-   }
-   len++;
-   }
-   return len;
+
+   if (init_object_disambiguation(hex, len, ) < 0)
+   return -1;
+
+   mad.init_len = len;
+   mad.cur_len = len;
+   mad.hex = hex;
+
+   ds.fn = extend_abbrev_len;
+   ds.always_call_fn = 1;
+   ds.cb_data = (void*)
+
+   find_short_object_filename();
+   find_short_packed_object();
+   (void)finish_object_disambiguation(, _ret);
+
+   hex[mad.cur_len] = 0;
+   return mad.cur_len;
 }
 
 const char *find_unique_abbrev(const unsigned char *sha1, int len)
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v2] cleanup: fix possible overflow errors in binary search

2017-10-08 Thread Derrick Stolee
A common mistake when writing binary search is to allow possible
integer overflow by using the simple average:

mid = (min + max) / 2;

Instead, use the overflow-safe version:

mid = min + (max - min) / 2;

This translation is safe since the operation occurs inside a loop
conditioned on "min < max". The included changes were found using
the following git grep:

git grep '/ *2;' '*.c'

Making this cleanup will prevent future review friction when a new
binary search is contructed based on existing code.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/index-pack.c  | 4 ++--
 builtin/pack-objects.c| 2 +-
 builtin/unpack-objects.c  | 2 +-
 cache-tree.c  | 2 +-
 compat/regex/regex_internal.c | 4 ++--
 compat/regex/regexec.c| 2 +-
 packfile.c| 2 +-
 sha1-lookup.c | 4 ++--
 sha1_name.c   | 2 +-
 string-list.c | 2 +-
 utf8.c| 2 +-
 xdiff/xpatience.c | 2 +-
 12 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index f2be145e1..8ec459f52 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -633,7 +633,7 @@ static int find_ofs_delta(const off_t offset, enum 
object_type type)
int first = 0, last = nr_ofs_deltas;
 
while (first < last) {
-   int next = (first + last) / 2;
+   int next = first + (last - first) / 2;
struct ofs_delta_entry *delta = _deltas[next];
int cmp;
 
@@ -687,7 +687,7 @@ static int find_ref_delta(const unsigned char *sha1, enum 
object_type type)
int first = 0, last = nr_ref_deltas;
 
while (first < last) {
-   int next = (first + last) / 2;
+   int next = first + (last - first) / 2;
struct ref_delta_entry *delta = _deltas[next];
int cmp;
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5ee2c48ff..6e77dfd44 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1277,7 +1277,7 @@ static int done_pbase_path_pos(unsigned hash)
int lo = 0;
int hi = done_pbase_paths_num;
while (lo < hi) {
-   int mi = (hi + lo) / 2;
+   int mi = lo + (hi - lo) / 2;
if (done_pbase_paths[mi] == hash)
return mi;
if (done_pbase_paths[mi] < hash)
diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
index 689a29fac..62ea264c4 100644
--- a/builtin/unpack-objects.c
+++ b/builtin/unpack-objects.c
@@ -394,7 +394,7 @@ static void unpack_delta_entry(enum object_type type, 
unsigned long delta_size,
lo = 0;
hi = nr;
while (lo < hi) {
-   mid = (lo + hi)/2;
+   mid = lo + (hi - lo) / 2;
if (base_offset < obj_list[mid].offset) {
hi = mid;
} else if (base_offset > obj_list[mid].offset) {
diff --git a/cache-tree.c b/cache-tree.c
index 71d092ed5..d3f740127 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -49,7 +49,7 @@ static int subtree_pos(struct cache_tree *it, const char 
*path, int pathlen)
lo = 0;
hi = it->subtree_nr;
while (lo < hi) {
-   int mi = (lo + hi) / 2;
+   int mi = lo + (hi - lo) / 2;
struct cache_tree_sub *mdl = down[mi];
int cmp = subtree_name_cmp(path, pathlen,
   mdl->name, mdl->namelen);
diff --git a/compat/regex/regex_internal.c b/compat/regex/regex_internal.c
index d4121f2f4..98342b831 100644
--- a/compat/regex/regex_internal.c
+++ b/compat/regex/regex_internal.c
@@ -613,7 +613,7 @@ re_string_reconstruct (re_string_t *pstr, int idx, int 
eflags)
  int low = 0, high = pstr->valid_len, mid;
  do
{
- mid = (high + low) / 2;
+ mid = low + (high - low) / 2;
  if (pstr->offsets[mid] > offset)
high = mid;
  else if (pstr->offsets[mid] < offset)
@@ -1394,7 +1394,7 @@ re_node_set_contains (const re_node_set *set, int elem)
   right = set->nelem - 1;
   while (idx < right)
 {
-  mid = (idx + right) / 2;
+  mid = idx + (right - idx) / 2;
   if (set->elems[mid] < elem)
idx = mid + 1;
   else
diff --git a/compat/regex/regexec.c b/compat/regex/regexec.c
index 0a745d9c3..6f2b48a78 100644
--- a/compat/regex/regexec.c
+++ b/compat/regex/regexec.c
@@ -4284,7 +4284,7 @@ search_cur_bkref_entry (const re_match_context_t *mctx, 
int str_idx)
   last = right = mctx->nbkref_ents;
   for (left = 0; left < right;)
 {
-  mid = (left + right) / 2;
+  mid = left + (right - 

Re: [PATCH v3 3/5] sha1_name: Unroll len loop in find_unique_abbrev_r

2017-10-04 Thread Derrick Stolee

On 10/4/2017 2:07 AM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


-   exists = has_sha1_file(sha1);
-   while (len < GIT_SHA1_HEXSZ) {
-   struct object_id oid_ret;
-   status = get_short_oid(hex, len, _ret, GET_OID_QUIETLY);
-   if (exists
-   ? !status
-   : status == SHORT_NAME_NOT_FOUND) {
-   hex[len] = 0;
-   return len;
-   }
-   len++;
-   }
-   return len;

The "always_call_fn" thing is a big sledgehammer that overrides
everything else in update_candidates().  It bypasses the careful
machinery set up to avoid having to open ambiguous object to learn
their types as much as possible.  One narrow exception when it is OK
to use is if we never limit our candidates with type.


I do not modify get_short_oid, which uses these advanced options, 
depending on the flags given. find_unique_abbrev_r() does not use these 
advanced options.



And it might appear that the conversion is safe (if only because we
do not see any type limitation in the get_short_oid() call above),
but I think there is one case where this patch changes the
behaviour: what happens if core.disambiguate was set to anything
other than "none"?  The new code does not know anything about type
based filtering, so it can end up reporting longer abbreviation than
it was asked to produce.  It may not be a problem in practice, though.

I am not sure if setting core.disambiguate is generally a good idea
in the first place, and if it is OK to break find_unique_abbrev()
with respect to the configuration variable like this patch does.


I do not think that type-aware disambiguation goes through this code 
path, since it requires giving different parameters to get_short_oid(). 
Test t1512-rev-parse-disambituagion.sh has a test 'core.disambiguate 
config can prefer types' that verifies this behavior.



I'd feel safe if we get extra input from Peff, who introduced the
feature in 5b33cb1f ("get_short_sha1: make default disambiguation
configurable", 2016-09-27).


I look forward to more feedback. Thanks for taking the time to look at 
my patch series.


Thanks,
-Stolee


Re: [PATCH] test-list-objects: mark file-local symbols as static

2017-10-04 Thread Derrick Stolee

On 10/3/2017 5:51 PM, Ramsay Jones wrote:

Signed-off-by: Ramsay Jones 
---

Hi Derrick,

If you need to re-roll your 'ds/find-unique-abbrev-optim' branch,
could you please squash this into the relevant patch (commit 3792c78ba0,
"test-list-objects: list a subset of object ids", 01-10-2017).

Thanks!

ATB,
Ramsay Jones

  t/helper/test-list-objects.c | 32 
  1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/t/helper/test-list-objects.c b/t/helper/test-list-objects.c
index 22bc9b4e6..5c5d3c03f 100644
--- a/t/helper/test-list-objects.c
+++ b/t/helper/test-list-objects.c
@@ -6,43 +6,43 @@ struct count {
int select_mod;
  };
  
-int count_loose(const struct object_id *oid,

-   const char *path,
-   void *data)
+static int count_loose(const struct object_id *oid,
+  const char *path,
+  void *data)
  {
((struct count*)data)->total++;
return 0;
  }
  
-int count_packed(const struct object_id *oid,

-struct packed_git *pack,
-uint32_t pos,
-void* data)
+static int count_packed(const struct object_id *oid,
+   struct packed_git *pack,
+   uint32_t pos,
+   void* data)
  {
((struct count*)data)->total++;
return 0;
  }
  
-void output(const struct object_id *oid,

-   struct count *c)
+static void output(const struct object_id *oid,
+  struct count *c)
  {
if (!(c->total % c->select_mod))
printf("%s\n", oid_to_hex(oid));
c->total--;
  }
  
-int output_loose(const struct object_id *oid,

-const char *path,
-void *data)
+static int output_loose(const struct object_id *oid,
+   const char *path,
+   void *data)
  {
output(oid, (struct count*)data);
return 0;
  }
  
-int output_packed(const struct object_id *oid,

- struct packed_git *pack,
- uint32_t pos,
- void* data)
+static int output_packed(const struct object_id *oid,
+struct packed_git *pack,
+uint32_t pos,
+void* data)
  {
output(oid, (struct count*)data);
return 0;
Thanks, Ramsay. I applied these changes locally. I'll remember "static" 
in the future.


Thanks,
-Stolee


Re: [PATCH v3 3/5] sha1_name: Unroll len loop in find_unique_abbrev_r

2017-10-04 Thread Derrick Stolee

On 10/4/2017 2:10 AM, Junio C Hamano wrote:

Derrick Stolee <sto...@gmail.com> writes:
...

I understand that this patch on its own does not have good numbers. I
split the
patches 3 and 4 specifically to highlight two distinct changes:

Patch 3: Unroll the len loop that may inspect all files multiple times.
Patch 4: Parse less while disambiguating.

Patch 4 more than makes up for the performance hits in this patch.

Now you confused me even more.  When we read the similar table that
appears in [Patch 4/5], what does the "Base Time" column mean?
Vanilla Git with [Patch 3/5] applied?  Vanillay Git with [Patch 4/5]
alone applied?  Something else?
In PATCH 3, 4, and 5, I used the commit-by-commit diff for the perf 
numbers, so the "Base Time" for PATCH 4 is the time calculated when 
PATCH 3 is applied. The table in the [PATCH 0/5] message includes the 
relative change for all commits.


I recalculated the relative change for each patch related to the 
baseline (PATCH 2). Looking again, it appears I misspoke and PATCH 4 
does include a +8% change for a fully-repacked Linux repo relative to 
PATCH 2. Since PATCH 5 includes an optimization targeted directly at 
large packfiles, the final performance gain is significant in the 
fully-packed cases.


It is also worth looking at the absolute times for these cases, since 
the fully-packed case is significantly faster than the multiple-packfile 
case, so the relative change impacts users less.


One final note: the improvement was clearer in test p0008.1 when the 
test included "sort -R" to shuffle the known OIDs. Providing OIDs in 
lexicographic order has had a significant effect on the performance, 
which does not reflect real-world usage. I removed the "sort -R" because 
it is a GNU-ism, but if there is a good cross-platform alternative I 
would be happy to replace it.


p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

| Repo  | Baseline | Patch 3 | Rel % | Patch 4 | Rel % | Patch 5 | Rel % |
|---|--|-|---|-|---|-|---|
| Git   | 0.09 | 0.06    | -33%  | 0.05    | -44%  | 0.05    | -44%  |
| Git   | 0.11 | 0.08    | -27%  | 0.08    | -27%  | 0.08    | -27%  |
| Git   | 0.09 | 0.07    | -22%  | 0.06    | -33%  | 0.06    | -33%  |
| Linux | 0.13 | 0.32    | 146%  | 0.14    | + 8%  | 0.05    | -62%  |
| Linux | 1.13 | 1.12    | - 1%  | 0.94    | -17%  | 0.88    | -22%  |
| Linux | 1.08 | 1.05    | - 3%  | 0.86    | -20%  | 0.80    | -26%  |
| VSTS  | 0.12 | 0.23    | +92%  | 0.11    | - 8%  | 0.05    | -58%  |
| VSTS  | 1.02 | 1.08    | + 6%  | 0.95    | - 7%  | 0.95    | - 7%  |
| VSTS  | 2.25 | 2.08    | - 8%  | 1.82    | -19%  | 1.93    | -14%  |

(Each repo has three versions, in order: 1 packfile, multiple packfiles, 
and multiple packfiles and loose objects.)


Thanks,
-Stolee



Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation

2017-10-10 Thread Derrick Stolee

On 10/10/2017 8:56 AM, Junio C Hamano wrote:

Jeff King  writes:


OK, I think that makes more sense. But note the p->num_objects thing I
mentioned. If I do:

   git pack-objects .git/objects/pack/pack num_objects)
return;

Technically that also covers open_pack_index() failure, too, but that's
a subtlety I don't think we should rely on.

True.  I notice that the early part of the two functions look almost
identical.  Do we need error condition handling for the other one,
too?


I prefer to fix the problem in all code clones when they cause review 
friction, so I'll send a fifth commit showing just the diff for these 
packfile issues in sha1_name.c. See patch below.


Should open_pack_index() return a non-zero status if the packfile is 
empty? Or, is there a meaningful reason to have empty packfiles?


Thanks,
-Stolee


diff --git a/sha1_name.c b/sha1_name.c
index 49ba67955..9f8a33e82 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -153,7 +153,9 @@ static void unique_in_pack(struct packed_git *p,
    uint32_t num, last, i, first = 0;
    const struct object_id *current = NULL;

-   open_pack_index(p);
+   if (open_pack_index(p) || !p->num_objects)
+   return;
+
    num = p->num_objects;
    last = num;
    while (first < last) {
@@ -513,7 +515,9 @@ static void find_abbrev_len_for_pack(struct 
packed_git *p,

    uint32_t num, last, first = 0;
    struct object_id oid;

-   open_pack_index(p);
+   if (open_pack_index(p) || !p->num_objects)
+   return;
+
    num = p->num_objects;
    last = num;
    while (first < last) {



Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation

2017-10-10 Thread Derrick Stolee

On 10/9/2017 9:49 AM, Jeff King wrote:

On Sun, Oct 08, 2017 at 02:49:42PM -0400, Derrick Stolee wrote:


@@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id *oid, 
void *cb_data)
return 0;
  }
  
+static void find_abbrev_len_for_pack(struct packed_git *p,

+struct min_abbrev_data *mad)
+{
+   int match = 0;
+   uint32_t num, last, first = 0;
+   struct object_id oid;
+
+   open_pack_index(p);
+   num = p->num_objects;
+   last = num;
+   while (first < last) {
[...]

Your cover letter lists:

   * Silently skip packfiles that fail to open with open_pack_index()

as a change from the previous version. But this looks the same as the
last round. I think this _does_ end up skipping such packfiles because
p->num_objects will be zero. Is it worth having a comment to that
effect (or even just an early return) to make it clear that the
situation is intentional?

Although...


+   /*
+* first is now the position in the packfile where we would insert
+* mad->hash if it does not exist (or the position of mad->hash if
+* it does exist). Hence, we consider a maximum of three objects
+* nearby for the abbreviation length.
+*/
+   mad->init_len = 0;
+   if (!match) {
+   nth_packed_object_oid(, p, first);
+   extend_abbrev_len(, mad);

If we have zero objects in the pack, what would nth_packed_object_oid()
be returning here?

So I actually think we do want an early return, not just when
open_packed_index() fails, but also when p->num_objects is zero.

-Peff


Sorry about this. I caught this while I was writing my cover letter and 
amended my last commit to include the following:


    if (open_pack_index(p))
        return;

After I amended the commit, I forgot to 'format-patch' again. I can send 
a diff between the commits after review has calmed.


Thanks,
-Stolee


Re: git-clone causes out of memory

2017-10-13 Thread Derrick Stolee

On 10/13/2017 8:44 AM, Jeff King wrote:

On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote:


On 13.10.2017 15:04, Junio C Hamano wrote:

Christian Couder  writes:


Yeah, but perhaps Git could be smarter when rev-listing too and avoid
processing files or directories it has already seen?

Aren't you suggesting to optimize for a wrong case?


Anything that is possible with a software should be considered as a possible
usecase. It's in fact a DoS attack. Imagine there's a server that using git
to process something, and now there's a way to knock down this server. It's
also bad from a promotional stand point.

But the point is that you'd have the same problem with any repository
that had 10^7 files in it. Yes, it's convenient for the attacker that
there are only 9 objects, but fundamentally it's pretty easy for an
attacker to construct repositories that have large trees (or very deep
trees -- that's what causes stack exhaustion in some cases).

Note too that this attack almost always comes down to the diff code
(which is why it kicks in for pathspec limiting) which has to actually
expand the tree. Most "normal" server-side operations (like accepting
pushes or serving fetches) operate only on the object graph and _do_
avoid processing already-seen objects.

As soon as servers start trying to checkout or diff, though, the attack
surface gets quite large. And you really need to start thinking about
having resource limits and quotas for CPU and memory use of each process
(and group by requesting user, IP, repo, etc).

I think the main thing Git could be doing here is to limit the size of
the tree (both width and depth). But arbitrary limits like that have a
way of being annoying, and I think it just pushes resource-exhaustion
attacks off a little (e.g., can you construct a blob that behaves badly
with the "--patch"?).

-Peff


I'm particularly interested in why `git rev-list HEAD -- [path]` gets 
slower in this case, because I wrote the history algorithm used by VSTS. 
In our algorithm, we only walk the list of objects from commit to the 
tree containing the path item. For example, in the path d0/d0/d0, we 
would only walk:


    commit --root--> tree --d0--> tree --d0--> tree [parse oid for d0 
entry]


From this, we can determine the TREESAME relationship by parsing four 
objects without parsing all contents below d0/d0/d0.


The reason we have exponential behavior in `git rev-list` is because we 
are calling diff_tree_oid() in tree-diff.c recursively without 
short-circuiting on equal OIDs.


I will prepare a patch that adds this OID-equal short-circuit to avoid 
this exponential behavior. I'll model my patch against a similar patch 
in master:


    commit d12a8cf0af18804c2000efc7a0224da631e04cd1 unpack-trees: avoid 
duplicate ODB lookups during checkout


It will also significantly speed up rev-list calls for short paths in 
deep repositories. It will not be very measurable in the git or Linux 
repos because their shallow folder structure.


Thanks,
-Stolee


Re: [PATCH v2 1/5] test-list-objects: List a subset of object ids

2017-10-05 Thread Derrick Stolee



On 10/5/2017 6:00 AM, Jeff King wrote:

On Thu, Oct 05, 2017 at 06:48:10PM +0900, Junio C Hamano wrote:


Jeff King  writes:


This is weirdly specific. Can we accomplish the same thing with existing
tools?

E.g., could:

   git cat-file --batch-all-objects --batch-check='%(objectname)' |
   shuffle |
   head -n 100

do the same thing?

I know that "shuffle" isn't available everywhere, but I'd much rather
see us fill in portability gaps in a general way, rather than
introducing one-shot C code that needs to be maintained (and you
wouldn't _think_ that t/helper programs need much maintenance, but try
perusing "git log t/helper" output; they have to adapt to the same
tree-wide changes as the rest of the code).

I was thinking about this a bit more, and came to the conclusion
that "sort -R" and "shuf" are wrong tools to use.  We would want to
measure with something close to real world workload.  for example,
letting

git rev-list --all --objects

produce the listof objects in traversal order (i.e. this is very
similar to the order in which "git log -p" needs to access the
objects) and chomping at the number of sample objects you need in
your test would give you such a list.

Actually, I'd just as soon see timings for "git log --format=%h" or "git
log --raw", as opposed to patches 1 and 2.

You won't see a 90% speedup there, but you will see the actual
improvement that real-world users are going to experience, which is way
more important, IMHO.

-Peff

Thanks for thinking hard about this.

For some real-user context: Some engineers using Git for the Windows 
repo were seeing extremely slow commands, such as 'fetch' or 'commit', 
and when we took a trace we saw most of the time spinning in this 
abbreviation code. Our workaround so far has been to set core.abbrev=40.


I'll run some perf numbers for these commands you recommend, and also 
see if I can replicate some of the pain points that triggered this 
change using the Linux repo.


Thanks,
-Stolee


Re: git-clone causes out of memory

2017-10-13 Thread Derrick Stolee

On 10/13/2017 9:15 AM, Derrick Stolee wrote:

On 10/13/2017 8:44 AM, Jeff King wrote:

On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote:


On 13.10.2017 15:04, Junio C Hamano wrote:

Christian Couder <christian.cou...@gmail.com> writes:


Yeah, but perhaps Git could be smarter when rev-listing too and avoid
processing files or directories it has already seen?

Aren't you suggesting to optimize for a wrong case?

Anything that is possible with a software should be considered as a 
possible
usecase. It's in fact a DoS attack. Imagine there's a server that 
using git
to process something, and now there's a way to knock down this 
server. It's

also bad from a promotional stand point.

But the point is that you'd have the same problem with any repository
that had 10^7 files in it. Yes, it's convenient for the attacker that
there are only 9 objects, but fundamentally it's pretty easy for an
attacker to construct repositories that have large trees (or very deep
trees -- that's what causes stack exhaustion in some cases).

Note too that this attack almost always comes down to the diff code
(which is why it kicks in for pathspec limiting) which has to actually
expand the tree. Most "normal" server-side operations (like accepting
pushes or serving fetches) operate only on the object graph and _do_
avoid processing already-seen objects.

As soon as servers start trying to checkout or diff, though, the attack
surface gets quite large. And you really need to start thinking about
having resource limits and quotas for CPU and memory use of each process
(and group by requesting user, IP, repo, etc).

I think the main thing Git could be doing here is to limit the size of
the tree (both width and depth). But arbitrary limits like that have a
way of being annoying, and I think it just pushes resource-exhaustion
attacks off a little (e.g., can you construct a blob that behaves badly
with the "--patch"?).

-Peff


I'm particularly interested in why `git rev-list HEAD -- [path]` gets 
slower in this case, because I wrote the history algorithm used by 
VSTS. In our algorithm, we only walk the list of objects from commit 
to the tree containing the path item. For example, in the path 
d0/d0/d0, we would only walk:


    commit --root--> tree --d0--> tree --d0--> tree [parse oid for d0 
entry]


From this, we can determine the TREESAME relationship by parsing four 
objects without parsing all contents below d0/d0/d0.


The reason we have exponential behavior in `git rev-list` is because 
we are calling diff_tree_oid() in tree-diff.c recursively without 
short-circuiting on equal OIDs.


I will prepare a patch that adds this OID-equal short-circuit to avoid 
this exponential behavior. I'll model my patch against a similar patch 
in master:


    commit d12a8cf0af18804c2000efc7a0224da631e04cd1 unpack-trees: 
avoid duplicate ODB lookups during checkout


It will also significantly speed up rev-list calls for short paths in 
deep repositories. It will not be very measurable in the git or Linux 
repos because their shallow folder structure.


Thanks,
-Stolee


Since I don't understand enough about the consumers to diff_tree_oid() 
(and the fact that the recursive behavior may be wanted in some cases), 
I think we can fix this in builtin/rev-list.c with this simple diff:


---

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index ded1577424..b2e8e02cc8 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const 
char *prefix)


    git_config(git_default_config, NULL);
    init_revisions(, prefix);
+
+   revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE;
+
    revs.abbrev = DEFAULT_ABBREV;
    revs.commit_format = CMIT_FMT_UNSPECIFIED;
    argc = setup_revisions(argc, argv, , NULL);



Re: git-clone causes out of memory

2017-10-13 Thread Derrick Stolee

On 10/13/2017 10:26 AM, Jeff King wrote:

On Fri, Oct 13, 2017 at 10:25:10AM -0400, Derrick Stolee wrote:


This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is
causing diff_can_quit_early() to return false. Due to the corner-case of the
bug it seems it will not be a huge performance improvement in most cases.
Still worth fixing and I'm looking at your suggestions to try and learn this
area better.

Yeah, I just timed some pathspec limits on linux.git, and it makes at
best a fraction of a percent improvement (but any improvement is well
within run-to-run noise). Which is not surprising.

I agree it's worth fixing, though.

-Peff


I just ran a first-level folder history in the Windows repository and 
`git rev-list HEAD -- windows >/dev/null` went from 0.43s to 0.04s. So, 
a good percentage improvement but even on this enormous repo the bug 
doesn't present an issue to human users.


Thanks,
-Stolee


Re: git-clone causes out of memory

2017-10-13 Thread Derrick Stolee

On 10/13/2017 9:50 AM, Jeff King wrote:

On Fri, Oct 13, 2017 at 09:39:14AM -0400, Derrick Stolee wrote:


Since I don't understand enough about the consumers to diff_tree_oid() (and
the fact that the recursive behavior may be wanted in some cases), I think
we can fix this in builtin/rev-list.c with this simple diff:

---

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index ded1577424..b2e8e02cc8 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const char
*prefix)

     git_config(git_default_config, NULL);
     init_revisions(, prefix);
+
+   revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE;
+


(Note: I'm running tests now and see that this breaks behavior. 
Definitely not the solution we want.)



Hmm, this feels wrong, because we _do_ want to recurse down and follow
the pathspec to see if there are real changes.

We should be comparing an empty tree and d0/d0/d0/d0 (or however deep
your pathspec goes). We should be able to see immediately that the entry
is not present between the two and not bother descending. After all,
we've set the QUICK flag in init_revisions(). So the real question is
why QUICK is not kicking in.

-Peff


I'm struggling to understand your meaning. We want to walk from root to 
d0/d0/d0/d0, but there is no reason to walk beyond that tree. But maybe 
that's what the QUICK flag is supposed to do.


Thanks,
-Stolee


Re: [PATCH] revision: quit pruning diff more quickly when possible

2017-10-13 Thread Derrick Stolee

On 10/13/2017 11:27 AM, Jeff King wrote:

On Fri, Oct 13, 2017 at 10:26:46AM -0400, Jeff King wrote:


On Fri, Oct 13, 2017 at 10:25:10AM -0400, Derrick Stolee wrote:


This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is
causing diff_can_quit_early() to return false. Due to the corner-case of the
bug it seems it will not be a huge performance improvement in most cases.
Still worth fixing and I'm looking at your suggestions to try and learn this
area better.

Yeah, I just timed some pathspec limits on linux.git, and it makes at
best a fraction of a percent improvement (but any improvement is well
within run-to-run noise). Which is not surprising.

I agree it's worth fixing, though.

Here it is cleaned up and with a commit message. There's another case
that can be optimized, too: --remove-empty with an all-deletions commit.
That's probably even more obscure and pathological, but it was easy to
cover in the same breath.

I didn't bother making a perf script, since this really isn't indicative
of real-world performance. If we wanted to do perf regression tests
here, I think the best path forward would be:

   1. Make sure there the perf tests cover pathspecs (maybe in p0001?).

   2. Make it easy to run the whole perf suite against a "bomb" repo.
  This surely isn't the only slow thing of interest.

-- >8 --
Subject: revision: quit pruning diff more quickly when possible

When the revision traversal machinery is given a pathspec,
we must compute the parent-diff for each commit to determine
which ones are TREESAME. We set the QUICK diff flag to avoid
looking at more entries than we need; we really just care
whether there are any changes at all.

But there is one case where we want to know a bit more: if
--remove-empty is set, we care about finding cases where the
change consists only of added entries (in which case we may
prune the parent in try_to_simplify_commit()). To cover that
case, our file_add_remove() callback does not quit the diff
upon seeing an added entry; it keeps looking for other types
of entries.

But this means when --remove-empty is not set (and it is not
by default), we compute more of the diff than is necessary.
You can see this in a pathological case where a commit adds
a very large number of entries, and we limit based on a
broad pathspec. E.g.:

   perl -e '
 chomp(my $blob = `git hash-object -w --stdin remove_empty_trees. This callback parameter could be
passed to the "add_remove" and "change" callbacks, but
there's not much point. They already receive the
diff_options struct, and doing it this way avoids having to
update the function signature of the other callbacks
(arguably the format_callback and output_prefix functions
could benefit from the same simplification).

Signed-off-by: Jeff King <p...@peff.net>
---
  diff.h |  1 +
  revision.c | 16 +---
  2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/diff.h b/diff.h
index 7dcfcfbef7..4a34d256f1 100644
--- a/diff.h
+++ b/diff.h
@@ -180,6 +180,7 @@ struct diff_options {
pathchange_fn_t pathchange;
change_fn_t change;
add_remove_fn_t add_remove;
+   void *change_fn_data;
diff_format_fn_t format_callback;
void *format_callback_data;
diff_prefix_fn_t output_prefix;
diff --git a/revision.c b/revision.c
index 8fd222f3bf..a3f245e2cc 100644
--- a/revision.c
+++ b/revision.c
@@ -399,8 +399,16 @@ static struct commit *one_relevant_parent(const struct 
rev_info *revs,
   * if the whole diff is removal of old data, and otherwise
   * REV_TREE_DIFFERENT (of course if the trees are the same we
   * want REV_TREE_SAME).
- * That means that once we get to REV_TREE_DIFFERENT, we do not
- * have to look any further.
+ *
+ * The only time we care about the distinction is when
+ * remove_empty_trees is in effect, in which case we care only about
+ * whether the whole change is REV_TREE_NEW, or if there's another type
+ * of change. Which means we can stop the diff early in either of these
+ * cases:
+ *
+ *   1. We're not using remove_empty_trees at all.
+ *
+ *   2. We saw anything except REV_TREE_NEW.
   */
  static int tree_difference = REV_TREE_SAME;
  
@@ -411,9 +419,10 @@ static void file_add_remove(struct diff_options *options,

const char *fullpath, unsigned dirty_submodule)
  {
int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD;
+   struct rev_info *revs = options->change_fn_data;
  
  	tree_difference |= diff;

-   if (tree_difference == REV_TREE_DIFFERENT)
+   if (!revs->remove_empty_trees || tree_difference != REV_TREE_NEW)
DIFF_OPT_SET(options, HAS_CHANGES);
  }
  
@@ -1351,6 +1360,7 @@ void init_revisions(struct rev_info *revs, const char *prefix)

DIFF_OPT_SET(>pruning, QUICK);
revs->pruning.add_remove = file_add_remove;
revs->pruning.change = file_change;
+   revs-&g

Re: git-clone causes out of memory

2017-10-13 Thread Derrick Stolee

On 10/13/2017 10:20 AM, Jeff King wrote:

On Fri, Oct 13, 2017 at 10:10:18AM -0400, Jeff King wrote:


Hmm. So this patch makes it go fast:

diff --git a/revision.c b/revision.c
index d167223e69..b52ea4e9d8 100644
--- a/revision.c
+++ b/revision.c
@@ -409,7 +409,7 @@ static void file_add_remove(struct diff_options *options,
int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD;
  
  	tree_difference |= diff;

-   if (tree_difference == REV_TREE_DIFFERENT)
+   if (tree_difference & REV_TREE_DIFFERENT)
DIFF_OPT_SET(options, HAS_CHANGES);
  }
  


But that essentially makes the conditional a noop (since we know we set
either NEW or OLD above and DIFFERENT is the union of those flags).

I'm not sure I understand why file_add_remove() would ever want to avoid
setting HAS_CHANGES (certainly its companion file_change() always does).
This goes back to Junio's dd47aa3133 (try-to-simplify-commit: use
diff-tree --quiet machinery., 2007-03-14).

Maybe I am missing something, but AFAICT this was always buggy. But
since it only affects adds and deletes, maybe nobody noticed? I'm also
not sure if it only causes a slowdown, or if this could cause us to
erroneously mark something as TREESAME which isn't (I _do_ think people
would have noticed that).

Answering my own question a little, there is a hint in the comment
a few lines above:

   /*
* The goal is to get REV_TREE_NEW as the result only if the
* diff consists of all '+' (and no other changes), REV_TREE_OLD
* if the whole diff is removal of old data, and otherwise
* REV_TREE_DIFFERENT (of course if the trees are the same we
* want REV_TREE_SAME).
* That means that once we get to REV_TREE_DIFFERENT, we do not
* have to look any further.
*/

So my patch above is breaking that. But it's not clear from that comment
why we care about knowing the different between NEW, OLD, and DIFFERENT.

Grepping around for REV_TREE_NEW and REV_TREE_OLD, I think the answer is
in try_to_simplify_commit():

  case REV_TREE_NEW:
   if (revs->remove_empty_trees &&
   rev_same_tree_as_empty(revs, p)) {
   /* We are adding all the specified
* paths from this parent, so the
* history beyond this parent is not
* interesting.  Remove its parents
* (they are grandparents for us).
* IOW, we pretend this parent is a
* "root" commit.
*/
   if (parse_commit(p) < 0)
   die("cannot simplify commit %s (invalid %s)",
   oid_to_hex(>object.oid),
   oid_to_hex(>object.oid));
   p->parents = NULL;
   }
 /* fallthrough */
 case REV_TREE_OLD:
 case REV_TREE_DIFFERENT:

So when --remove-empty is not in effect (and it's not by default), we
don't care about OLD vs NEW, and we should be able to optimize further.

-Peff


This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is 
causing diff_can_quit_early() to return false. Due to the corner-case of 
the bug it seems it will not be a huge performance improvement in most 
cases. Still worth fixing and I'm looking at your suggestions to try and 
learn this area better.


It will speed up natural cases of someone adding or renaming a folder 
with a lot of contents, including someone initializing a repository from 
an existing codebase. This git bomb case happens to add all of the 
folders at once, which is why the performance is so noticeable. I use a 
version of this "exponential file growth" for testing that increases the 
folder-depth one commit at a time, so the contents are rather small in 
the first "add", hence not noticing this issue.


Thanks,
-Stolee


[PATCH v5 3/4] sha1_name: parse less while finding common prefix

2017-10-12 Thread Derrick Stolee
Create get_hex_char_from_oid() to parse oids one hex character at a
time. This prevents unnecessary copying of hex characters in
extend_abbrev_len() when finding the length of a common prefix.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 19603713f..fdd2711a6 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -480,13 +480,23 @@ struct min_abbrev_data {
char *hex;
 };
 
+static inline char get_hex_char_from_oid(const struct object_id *oid,
+unsigned int pos)
+{
+   static const char hex[] = "0123456789abcdef";
+
+   if ((pos & 1) == 0)
+   return hex[oid->hash[pos >> 1] >> 4];
+   else
+   return hex[oid->hash[pos >> 1] & 0xf];
+}
+
 static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
struct min_abbrev_data *mad = cb_data;
 
-   char *hex = oid_to_hex(oid);
unsigned int i = mad->init_len;
-   while (mad->hex[i] && mad->hex[i] == hex[i])
+   while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
i++;
 
if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v5 1/4] p4211-line-log.sh: add log --online --raw --parents perf test

2017-10-12 Thread Derrick Stolee
Add a new perf test for testing the performance of log while computing
OID abbreviations. Using --oneline --raw and --parents options maximizes
the number of OIDs to abbreviate while still spending some time computing
diffs.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 t/perf/p4211-line-log.sh | 4 
 1 file changed, 4 insertions(+)

diff --git a/t/perf/p4211-line-log.sh b/t/perf/p4211-line-log.sh
index b7ff68d4f..e0ed05907 100755
--- a/t/perf/p4211-line-log.sh
+++ b/t/perf/p4211-line-log.sh
@@ -31,4 +31,8 @@ test_perf 'git log -L (renames on)' '
git log -M -L 1:"$file" >/dev/null
 '
 
+test_perf 'git log --oneline --raw --parents' '
+   git log --oneline --raw --parents >/dev/null
+'
+
 test_done
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v5 2/4] sha1_name: unroll len loop in find_unique_abbrev_r

2017-10-12 Thread Derrick Stolee
Unroll the while loop inside find_unique_abbrev_r to avoid iterating
through all loose objects and packfiles multiple times when the short
name is longer than the predicted length.

Instead, inspect each object that collides with the estimated
abbreviation to find the longest common prefix.

The focus of this change is to refactor the existing method in a way
that clearly does not change the current behavior. In some cases, the
new method is slower than the previous method. Later changes will
correct all performance loss.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 57 ++---
 1 file changed, 42 insertions(+), 15 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index c7c5ab376..19603713f 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -474,10 +474,32 @@ static unsigned msb(unsigned long val)
return r;
 }
 
-int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+struct min_abbrev_data {
+   unsigned int init_len;
+   unsigned int cur_len;
+   char *hex;
+};
+
+static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
-   int status, exists;
+   struct min_abbrev_data *mad = cb_data;
+
+   char *hex = oid_to_hex(oid);
+   unsigned int i = mad->init_len;
+   while (mad->hex[i] && mad->hex[i] == hex[i])
+   i++;
+
+   if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
+   mad->cur_len = i + 1;
+
+   return 0;
+}
 
+int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+{
+   struct disambiguate_state ds;
+   struct min_abbrev_data mad;
+   struct object_id oid_ret;
if (len < 0) {
unsigned long count = approximate_object_count();
/*
@@ -503,19 +525,24 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
*sha1, int len)
sha1_to_hex_r(hex, sha1);
if (len == GIT_SHA1_HEXSZ || !len)
return GIT_SHA1_HEXSZ;
-   exists = has_sha1_file(sha1);
-   while (len < GIT_SHA1_HEXSZ) {
-   struct object_id oid_ret;
-   status = get_short_oid(hex, len, _ret, GET_OID_QUIETLY);
-   if (exists
-   ? !status
-   : status == SHORT_NAME_NOT_FOUND) {
-   hex[len] = 0;
-   return len;
-   }
-   len++;
-   }
-   return len;
+
+   if (init_object_disambiguation(hex, len, ) < 0)
+   return -1;
+
+   mad.init_len = len;
+   mad.cur_len = len;
+   mad.hex = hex;
+
+   ds.fn = extend_abbrev_len;
+   ds.always_call_fn = 1;
+   ds.cb_data = (void*)
+
+   find_short_object_filename();
+   find_short_packed_object();
+   (void)finish_object_disambiguation(, _ret);
+
+   hex[mad.cur_len] = 0;
+   return mad.cur_len;
 }
 
 const char *find_unique_abbrev(const unsigned char *sha1, int len)
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v5 4/4] sha1_name: minimize OID comparisons during disambiguation

2017-10-12 Thread Derrick Stolee
Minimize OID comparisons during disambiguation of packfile OIDs.

Teach git to use binary search with the full OID to find the object's
position (or insertion position, if not present) in the pack-index.
The object before and immediately after (or the one at the insertion
position) give the maximum common prefix.  No subsequent linear search
is required.

Take care of which two to inspect, in case the object id exists in the
packfile.

If the input to find_unique_abbrev_r() is a partial prefix, then the
OID used for the binary search is padded with zeroes so the object will
not exist in the repo (with high probability) and the same logic
applies.

This commit completes a series of three changes to OID abbreviation
code, and the overall change can be seen using standard commands for
large repos. Below we report performance statistics for perf test 4211.6
from p4211-line-log.sh using three copies of the Linux repo:

| Packs | Loose  | HEAD~3   | HEAD | Rel%  |
|---||--|--|---|
|  1|  0 |  41.27 s |  38.93 s | -4.8% |
| 24|  0 |  98.04 s |  91.35 s | -5.7% |
| 23| 323952 | 117.78 s | 112.18 s | -4.8% |

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 76 +
 1 file changed, 71 insertions(+), 5 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index fdd2711a6..7fd5f5f71 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -153,7 +153,9 @@ static void unique_in_pack(struct packed_git *p,
uint32_t num, last, i, first = 0;
const struct object_id *current = NULL;
 
-   open_pack_index(p);
+   if (open_pack_index(p) || !p->num_objects)
+   return;
+
num = p->num_objects;
last = num;
while (first < last) {
@@ -478,6 +480,7 @@ struct min_abbrev_data {
unsigned int init_len;
unsigned int cur_len;
char *hex;
+   const unsigned char *hash;
 };
 
 static inline char get_hex_char_from_oid(const struct object_id *oid,
@@ -505,6 +508,67 @@ static int extend_abbrev_len(const struct object_id *oid, 
void *cb_data)
return 0;
 }
 
+static void find_abbrev_len_for_pack(struct packed_git *p,
+struct min_abbrev_data *mad)
+{
+   int match = 0;
+   uint32_t num, last, first = 0;
+   struct object_id oid;
+
+   if (open_pack_index(p) || !p->num_objects)
+   return;
+
+   num = p->num_objects;
+   last = num;
+   while (first < last) {
+   uint32_t mid = first + (last - first) / 2;
+   const unsigned char *current;
+   int cmp;
+
+   current = nth_packed_object_sha1(p, mid);
+   cmp = hashcmp(mad->hash, current);
+   if (!cmp) {
+   match = 1;
+   first = mid;
+   break;
+   }
+   if (cmp > 0) {
+   first = mid + 1;
+   continue;
+   }
+   last = mid;
+   }
+
+   /*
+* first is now the position in the packfile where we would insert
+* mad->hash if it does not exist (or the position of mad->hash if
+* it does exist). Hence, we consider a maximum of three objects
+* nearby for the abbreviation length.
+*/
+   mad->init_len = 0;
+   if (!match) {
+   nth_packed_object_oid(, p, first);
+   extend_abbrev_len(, mad);
+   } else if (first < num - 1) {
+   nth_packed_object_oid(, p, first + 1);
+   extend_abbrev_len(, mad);
+   }
+   if (first > 0) {
+   nth_packed_object_oid(, p, first - 1);
+   extend_abbrev_len(, mad);
+   }
+   mad->init_len = mad->cur_len;
+}
+
+static void find_abbrev_len_packed(struct min_abbrev_data *mad)
+{
+   struct packed_git *p;
+
+   prepare_packed_git();
+   for (p = packed_git; p; p = p->next)
+   find_abbrev_len_for_pack(p, mad);
+}
+
 int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
 {
struct disambiguate_state ds;
@@ -536,19 +600,21 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
*sha1, int len)
if (len == GIT_SHA1_HEXSZ || !len)
return GIT_SHA1_HEXSZ;
 
-   if (init_object_disambiguation(hex, len, ) < 0)
-   return -1;
-
mad.init_len = len;
mad.cur_len = len;
mad.hex = hex;
+   mad.hash = sha1;
+
+   find_abbrev_len_packed();
+
+   if (init_object_disambiguation(hex, mad.cur_len, ) < 0)
+   return -1;
 
ds.fn = extend_abbrev_len;
ds.always_call_fn = 1;
ds.cb_data = (void*)
 
find_short_object_filename();
-   find_short_packed_object();
(void)finish_object_disambiguation(, _ret);
 
hex[mad.cur_len] = 0;
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v5 0/4] Improve abbreviation disambiguation

2017-10-12 Thread Derrick Stolee
Changes since previous version:

* Make 'pos' unsigned in get_hex_char_from_oid()

* Check response from open_pack_index()

* Small typos in commit messages 

Thanks,
 Stolee

---

When displaying object ids, we frequently want to see an abbreviation
for easier typing. That abbreviation must be unambiguous among all
object ids.

The current implementation of find_unique_abbrev() performs a loop
checking if each abbreviation length is unambiguous until finding one
that works. This causes multiple round-trips to the disk when starting
with the default abbreviation length (usually 7) but needing up to 12
characters for an unambiguous short-sha. For very large repos, this
effect is pronounced and causes issues with several commands, from
obvious consumers `status` and `log` to less obvious commands such as
`fetch` and `push`.

This patch improves performance by iterating over objects matching the
short abbreviation only once, inspecting each object id, and reporting
the minimum length of an unambiguous abbreviation.

Add a new perf test for testing the performance of log while computing
OID abbreviations. Using --oneline --raw and --parents options maximizes
the number of OIDs to abbreviate while still spending some time
computing diffs. Below we report performance statistics for perf test
4211.6 from p4211-line-log.sh using three copies of the Linux repo:

| Packs | Loose  | Base Time | New Time | Rel%  |
|---||---|--|---|
|  1|  0 |   41.27 s |  38.93 s | -4.8% |
| 24|  0 |   98.04 s |  91.35 s | -5.7% |
| 23| 323952 |  117.78 s | 112.18 s | -4.8% |

Derrick Stolee (4):
  p4211-line-log.sh: add log --online --raw --parents perf test
  sha1_name: unroll len loop in find_unique_abbrev_r
  sha1_name: parse less while finding common prefix
  sha1_name: minimize OID comparisons during disambiguation
 sha1_name.c  | 135 +--
 t/perf/p4211-line-log.sh |   4 ++
 2 files changed, 123 insertions(+), 16 deletions(-)

-- 
2.14.1.538.g56ec8fc98.dirty



Re: [PATCH v5 0/4] Improve abbreviation disambiguation

2017-10-12 Thread Derrick Stolee

On 10/12/2017 8:02 AM, Derrick Stolee wrote:

Changes since previous version:

* Make 'pos' unsigned in get_hex_char_from_oid()

* Check response from open_pack_index()

* Small typos in commit messages

Thanks,
  Stolee

I forgot to mention that I rebased on master this morning to be sure 
this doesn't conflict with the binary-search patch that was queued this 
week.


Thanks,
 Stolee


[PATCH v3 2/5] p0008-abbrev.sh: Test find_unique_abbrev() perf

2017-10-02 Thread Derrick Stolee
Create helper program test-abbrev to compute the minimum length of a
disambiguating short-sha for an input list of object ids.

Perf test p0008-abbrev.sh runs test-abbrev for 100,000 object ids. For
test 0008.1, these objects exist. For test 0008.2 these objects do not
exist in the repo (with high probability).

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Makefile   |  1 +
 t/helper/.gitignore|  1 +
 t/helper/test-abbrev.c | 18 ++
 t/perf/p0008-abbrev.sh | 22 ++
 4 files changed, 42 insertions(+)
 create mode 100644 t/helper/test-abbrev.c
 create mode 100755 t/perf/p0008-abbrev.sh

diff --git a/Makefile b/Makefile
index 50a2eab80..63438a44e 100644
--- a/Makefile
+++ b/Makefile
@@ -638,6 +638,7 @@ X =
 
 PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
 
+TEST_PROGRAMS_NEED_X += test-abbrev
 TEST_PROGRAMS_NEED_X += test-chmtime
 TEST_PROGRAMS_NEED_X += test-ctype
 TEST_PROGRAMS_NEED_X += test-config
diff --git a/t/helper/.gitignore b/t/helper/.gitignore
index 9696f54bb..2190781ff 100644
--- a/t/helper/.gitignore
+++ b/t/helper/.gitignore
@@ -1,3 +1,4 @@
+/test-abbrev
 /test-chmtime
 /test-ctype
 /test-config
diff --git a/t/helper/test-abbrev.c b/t/helper/test-abbrev.c
new file mode 100644
index 0..3ad88611a
--- /dev/null
+++ b/t/helper/test-abbrev.c
@@ -0,0 +1,18 @@
+#include "cache.h"
+
+int cmd_main(int ac, const char **av)
+{
+   struct object_id oid;
+   char hex[GIT_MAX_HEXSZ + 2];
+   const char *end;
+
+   setup_git_directory();
+
+   while (fgets(hex, GIT_MAX_HEXSZ + 2, stdin)) {
+   hex[GIT_MAX_HEXSZ] = 0;
+   if (!parse_oid_hex(hex, , ))
+   find_unique_abbrev(oid.hash, MINIMUM_ABBREV);
+   }
+
+   exit(0);
+}
diff --git a/t/perf/p0008-abbrev.sh b/t/perf/p0008-abbrev.sh
new file mode 100755
index 0..5cbc8a888
--- /dev/null
+++ b/t/perf/p0008-abbrev.sh
@@ -0,0 +1,22 @@
+#!/bin/bash
+
+test_description='Test object disambiguation through abbreviations'
+. ./perf-lib.sh
+
+test_perf_large_repo
+
+test-list-objects 10 > objs.txt
+
+test_perf 'find_unique_abbrev() for existing objects' '
+   test-abbrev < objs.txt
+'
+
+test-list-objects --missing 10 > objs.txt
+
+test_perf 'find_unique_abbrev() for missing objects' '
+   test-abbrev < objs.txt
+'
+
+rm objs.txt
+
+test_done
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v3 4/5] sha1_name: Parse less while finding common prefix

2017-10-02 Thread Derrick Stolee
Create get_hex_char_from_oid() to parse oids one hex character at a
time. This prevents unnecessary copying of hex characters in
extend_abbrev_len() when finding the length of a common prefix.

p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.06 s | 0.05 s | -16.7% |
| Git   | 5 |  230162 |  0 | 0.08 s | 0.08 s |   0.0% |
| Git   | 4 |  154310 |  75852 | 0.07 s | 0.06 s | -14.3% |
| Linux | 1 | 5606645 |  0 | 0.32 s | 0.14 s | -56.3% |
| Linux |24 | 5606645 |  0 | 1.12 s | 0.94 s | -16.1% |
| Linux |23 | 5283204 | 323441 | 1.05 s | 0.86 s | -18.1% |
| VSTS  | 1 | 4355923 |  0 | 0.23 s | 0.11 s | -52.2% |
| VSTS  |32 | 4355923 |  0 | 1.08 s | 0.95 s | -12.0% |
| VSTS  |31 | 4276829 |  79094 | 2.08 s | 1.82 s | -12.5% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 4.61 s
  New Time: 4.61 s
 Rel %: 0.0%

p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.08 s | 0.07 s | -12.5% |
| Git   | 5 |  230162 |  0 | 0.13 s | 0.12 s | - 7.7% |
| Git   | 4 |  154310 |  75852 | 0.10 s | 0.09 s | -10.0% |
| Linux | 1 | 5606645 |  0 | 0.32 s | 0.13 s | -59.4% |
| Linux |24 | 5606645 |  0 | 1.09 s | 0.89 s | -18.3% |
| Linux |23 | 5283204 | 323441 | 0.99 s | 0.83 s | -16.2% |
| VSTS  | 1 | 4355923 |  0 | 0.25 s | 0.11 s | -56.0% |
| VSTS  |32 | 4355923 |  0 | 1.15 s | 0.99 s | -13.9% |
| VSTS  |31 | 4276829 |  79094 | 1.18 s | 1.02 s | -13.6% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 3.91 s
  New Time: 3.08 s
 Rel %: -21.1%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index f2a1ebe49..5081aeb71 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -480,13 +480,23 @@ struct min_abbrev_data {
char *hex;
 };
 
+static inline char get_hex_char_from_oid(const struct object_id *oid,
+int pos)
+{
+   static const char hex[] = "0123456789abcdef";
+
+   if ((pos & 1) == 0)
+   return hex[oid->hash[pos >> 1] >> 4];
+   else
+   return hex[oid->hash[pos >> 1] & 0xf];
+}
+
 static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
struct min_abbrev_data *mad = cb_data;
 
-   char *hex = oid_to_hex(oid);
unsigned int i = mad->init_len;
-   while (mad->hex[i] && mad->hex[i] == hex[i])
+   while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
i++;
 
if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v3 5/5] sha1_name: Minimize OID comparisons during disambiguation

2017-10-02 Thread Derrick Stolee
Minimize OID comparisons during disambiguatiosn of packfile OIDs.

Teach git to use binary search with the full OID to find the object's
position (or insertion position, if not present) in the pack-index.
The object before and immediately after (or the one at the insertion
position) give the maximum common prefix.  No subsequent linear search
is required.

Take care of which two to inspect, in case the object id exists in the
packfile.

If the input to find_unique_abbrev_r() is a partial prefix, then the
OID used for the binary search is padded with zeroes so the object will
not exist in the repo (with high probability) and the same logic
applies.

p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.05 s | 0.05 s |   0.0% |
| Git   | 5 |  230162 |  0 | 0.08 s | 0.08 s |   0.0% |
| Git   | 4 |  154310 |  75852 | 0.06 s | 0.06 s |   0.0% |
| Linux | 1 | 5606645 |  0 | 0.14 s | 0.05 s | -64.3% |
| Linux |24 | 5606645 |  0 | 0.94 s | 0.88 s | - 6.4% |
| Linux |23 | 5283204 | 323441 | 0.86 s | 0.80 s | - 7.0% |
| VSTS  | 1 | 4355923 |  0 | 0.11 s | 0.05 s | -54.5% |
| VSTS  |32 | 4355923 |  0 | 0.95 s | 0.95 s |   0.0% |
| VSTS  |31 | 4276829 |  79094 | 1.82 s | 1.93 s | + 6.0% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 4.62 s
  New Time: 4.09 s
 Rel %: -11.5%

p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.07 s | 0.07 s |   0.0% |
| Git   | 5 |  230162 |  0 | 0.12 s | 0.12 s |   0.0% |
| Git   | 4 |  154310 |  75852 | 0.09 s | 0.09 s |   0.0% |
| Linux | 1 | 5606645 |  0 | 0.13 s | 0.04 s | -69.2% |
| Linux |24 | 5606645 |  0 | 0.89 s | 0.85 s | - 4.5% |
| Linux |23 | 5283204 | 323441 | 0.83 s | 0.78 s | - 6.0% |
| VSTS  | 1 | 4355923 |  0 | 0.11 s | 0.04 s | -63.6% |
| VSTS  |32 | 4355923 |  0 | 0.99 s | 0.98 s | - 1.0% |
| VSTS  |31 | 4276829 |  79094 | 1.02 s | 1.04 s | + 2.0% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 3.08 s
  New Time: 2.72 s
 Rel %: -11.8

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 70 +
 1 file changed, 66 insertions(+), 4 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 5081aeb71..54b3a37da 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -478,6 +478,7 @@ struct min_abbrev_data {
unsigned int init_len;
unsigned int cur_len;
char *hex;
+   const unsigned char *hash;
 };
 
 static inline char get_hex_char_from_oid(const struct object_id *oid,
@@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id *oid, 
void *cb_data)
return 0;
 }
 
+static void find_abbrev_len_for_pack(struct packed_git *p,
+struct min_abbrev_data *mad)
+{
+   int match = 0;
+   uint32_t num, last, first = 0;
+   struct object_id oid;
+
+   open_pack_index(p);
+   num = p->num_objects;
+   last = num;
+   while (first < last) {
+   uint32_t mid = (first + last) / 2;
+   const unsigned char *current;
+   int cmp;
+
+   current = nth_packed_object_sha1(p, mid);
+   cmp = hashcmp(mad->hash, current);
+   if (!cmp) {
+   match = 1;
+   first = mid;
+   break;
+   }
+   if (cmp > 0) {
+   first = mid + 1;
+   continue;
+   }
+   last = mid;
+   }
+
+   /*
+* first is now the position in the packfile where we would insert
+* mad->hash if it does not exist (or the position of mad->hash if
+* it does exist). Hence, we consider a maximum of three objects
+* nearby for the abbreviation length.
+*/
+   mad->init_len = 0;
+   if (!match) {
+   nth_packed_object_oid(, p, first);
+   

[PATCH v3 1/5] test-list-objects: List a subset of object ids

2017-10-02 Thread Derrick Stolee
Create test-list-objects helper program to output a random sample of
OIDs present in the repo.

If no command line arguments are provided, all OIDs are output.

The last command line argument specifies how many samples to output.
Samples are collected across the entire set of available OIDs to avoid
clustering and therefore quite uniformly distributed.

If a command line argument "--missing" is given before the sample count,
then a list of OIDs is generated without examining the repo.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Makefile |  1 +
 t/helper/.gitignore  |  1 +
 t/helper/test-list-objects.c | 87 
 3 files changed, 89 insertions(+)
 create mode 100644 t/helper/test-list-objects.c

diff --git a/Makefile b/Makefile
index ed4ca438b..50a2eab80 100644
--- a/Makefile
+++ b/Makefile
@@ -652,6 +652,7 @@ TEST_PROGRAMS_NEED_X += test-hashmap
 TEST_PROGRAMS_NEED_X += test-index-version
 TEST_PROGRAMS_NEED_X += test-lazy-init-name-hash
 TEST_PROGRAMS_NEED_X += test-line-buffer
+TEST_PROGRAMS_NEED_X += test-list-objects
 TEST_PROGRAMS_NEED_X += test-match-trees
 TEST_PROGRAMS_NEED_X += test-mergesort
 TEST_PROGRAMS_NEED_X += test-mktemp
diff --git a/t/helper/.gitignore b/t/helper/.gitignore
index 7c9d28a83..9696f54bb 100644
--- a/t/helper/.gitignore
+++ b/t/helper/.gitignore
@@ -13,6 +13,7 @@
 /test-index-version
 /test-lazy-init-name-hash
 /test-line-buffer
+/test-list-objects
 /test-match-trees
 /test-mergesort
 /test-mktemp
diff --git a/t/helper/test-list-objects.c b/t/helper/test-list-objects.c
new file mode 100644
index 0..22bc9b4e6
--- /dev/null
+++ b/t/helper/test-list-objects.c
@@ -0,0 +1,87 @@
+#include "cache.h"
+#include "packfile.h"
+
+struct count {
+   int total;
+   int select_mod;
+};
+
+int count_loose(const struct object_id *oid,
+   const char *path,
+   void *data)
+{
+   ((struct count*)data)->total++;
+   return 0;
+}
+
+int count_packed(const struct object_id *oid,
+struct packed_git *pack,
+uint32_t pos,
+void* data)
+{
+   ((struct count*)data)->total++;
+   return 0;
+}
+
+void output(const struct object_id *oid,
+   struct count *c)
+{
+   if (!(c->total % c->select_mod))
+   printf("%s\n", oid_to_hex(oid));
+   c->total--;
+}
+
+int output_loose(const struct object_id *oid,
+const char *path,
+void *data)
+{
+   output(oid, (struct count*)data);
+   return 0;
+}
+
+int output_packed(const struct object_id *oid,
+ struct packed_git *pack,
+ uint32_t pos,
+ void* data)
+{
+   output(oid, (struct count*)data);
+   return 0;
+}
+
+int cmd_main(int ac, const char **av)
+{
+   uint32_t hash_delt = 0xFDB97531;
+   uint32_t hash_base = 0x01020304;
+   int i, n = -1;
+   struct count c;
+   const int u_per_oid = sizeof(struct object_id) / sizeof(uint32_t);
+
+   c.total = 0;
+   if (ac > 1)
+   n = atoi(av[ac - 1]);
+
+   if (ac > 2 && !strcmp(av[1], "--missing")) {
+   while (c.total++ < n) {
+   for (i = 0; i < u_per_oid; i++) {
+   printf("%08x", hash_base);
+   hash_base += hash_delt;
+   }
+   printf("\n");
+   }
+   } else {
+   setup_git_directory();
+
+   if (n > 0) {
+   for_each_loose_object(count_loose, , 0);
+   for_each_packed_object(count_packed, , 0);
+   c.select_mod = 1 + c.total / n;
+   } else {
+   c.select_mod = 1;
+   }
+
+   for_each_loose_object(output_loose, , 0);
+   for_each_packed_object(output_packed, , 0);
+   }
+
+   exit(0);
+}
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v3 0/5] Improve abbreviation disambituation

2017-10-02 Thread Derrick Stolee
Thanks for the feedback on my previous versions, and for patience
with my inexperience on the mailing list. I tried to respond to all
feedback, but decided to not apply one suggestion:

* Removed the "sort -R" from p0008-abbrev.sh, since it is a GNU-
  specific flag. The perf test now considers objects in "discovery"
  order, which improved performance all versions of the test as
  expected. New perf numbers are provided.

* get_hex_char_from_oid(): renamed `i` to `pos`. I did not unify the
  code in this method with that in hex.c:sha1_to_hex_r due to the
  extra branch in get_hex_char_from_oid and wanting to inline the
  method. If we want to make this method available for other callers,
  then I would be happy to submit a separate patch for this change
  after the current patch is accepted.

* Removed unecessary includes.

* Use uint32_t instead of unsigned int in test-list-objects.c

* Rearranged arguments in test-list-objects and fixed the n == 0 bug.

---

When displaying object ids, we frequently want to see an abbreviation
for easier typing. That abbreviation must be unambiguous among all
object ids.

The current implementation of find_unique_abbrev() performs a loop
checking if each abbreviation length is unambiguous until finding one
that works. This causes multiple round-trips to the disk when starting
with the default abbreviation length (usually 7) but needing up to 12
characters for an unambiguous short-sha. For very large repos, this
effect is pronounced and causes issues with several commands, from
obvious consumers `status` and `log` to less obvious commands such as
`fetch` and `push`.

This patch improves performance by iterating over objects matching the
short abbreviation only once, inspecting each object id, and reporting
the minimum length of an unambiguous abbreviation.

A helper program `test-list-objects` outputs a sampling of object ids,
which we reorder using `sort -R` before using them as input to a
performance test. 

A performance helper `test-abbrev` and performance test `p0008-abbrev.sh`
are added to demonstrate performance improvements in this area.

I include performance test numbers in the commit messages for each
change, but I also include the difference between the baseline and the
final change here:


p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.09 s | 0.05 s | -44.4% |
| Git   | 5 |  230162 |  0 | 0.11 s | 0.08 s | -27.3% |
| Git   | 4 |  154310 |  75852 | 0.09 s | 0.06 s | -33.3% |
| Linux | 1 | 5606645 |  0 | 0.13 s | 0.05 s | -61.5% |
| Linux |24 | 5606645 |  0 | 1.13 s | 0.88 s | -22.1% |
| Linux |23 | 5283204 | 323441 | 1.08 s | 0.80 s | -25.9% |
| VSTS  | 1 | 4355923 |  0 | 0.12 s | 0.05 s | -58.3% |
| VSTS  |32 | 4355923 |  0 | 1.02 s | 0.95 s | - 6.9% |
| VSTS  |31 | 4276829 |  79094 | 2.25 s | 1.93 s | -14.2% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 5.69 s
  New Time: 4.09 s
 Rel %: -28.1%

p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.66 s | 0.07 s | -89.4% |
| Git   | 5 |  230162 |  0 | 0.90 s | 0.12 s | -86.7% |
| Git   | 4 |  154310 |  75852 | 0.79 s | 0.09 s | -88.6% |
| Linux | 1 | 5606645 |  0 | 0.48 s | 0.04 s | -91.7% |
| Linux |24 | 5606645 |  0 | 4.41 s | 0.85 s | -80.7% |
| Linux |23 | 5283204 | 323441 | 4.11 s | 0.78 s | -81.0% |
| VSTS  | 1 | 4355923 |  0 | 0.46 s | 0.04 s | -91.3% |
| VSTS  |32 | 4355923 |  0 | 5.40 s | 0.98 s | -81.9% |
| VSTS  |31 | 4276829 |  79094 | 5.88 s | 1.04 s | -82.3% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 38.9 s
  New Time:  2.7 s
 Rel %: -93.1%

Derrick Stolee (5):
  test-list-objects: List a subset of object ids
  p0008-abbrev.sh: Test find_unique_abbrev() perf
  sha1_name: Unroll len loop in find_unique_abbrev_r
  sha1_name: Parse less while finding common prefix
  sha1_name: Minimize OID comparisons during disambiguation
 Makefile 

[PATCH v3 3/5] sha1_name: Unroll len loop in find_unique_abbrev_r

2017-10-02 Thread Derrick Stolee
Unroll the while loop inside find_unique_abbrev_r to avoid iterating
through all loose objects and packfiles multiple times when the short
name is longer than the predicted length.

Instead, inspect each object that collides with the estimated
abbreviation to find the longest common prefix.

p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New| |
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%|
|---|---|-||||-|
| Git   | 1 |  230078 |  0 | 0.09 s | 0.06 s | - 33.3% |
| Git   | 5 |  230162 |  0 | 0.11 s | 0.08 s | - 27.3% |
| Git   | 4 |  154310 |  75852 | 0.09 s | 0.07 s | - 22.2% |
| Linux | 1 | 5606645 |  0 | 0.12 s | 0.32 s | +146.2% |
| Linux |24 | 5606645 |  0 | 1.12 s | 1.12 s | -  0.9% |
| Linux |23 | 5283204 | 323441 | 1.08 s | 1.05 s | -  2.8% |
| VSTS  | 1 | 4355923 |  0 | 0.12 s | 0.23 s | + 91.7% |
| VSTS  |32 | 4355923 |  0 | 1.02 s | 1.08 s | +  5.9% |
| VSTS  |31 | 4276829 |  79094 | 2.25 s | 2.08 s | -  7.6% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 5.69 s
  New Time: 4.62 s
 Rel %: -18.9%

p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.66 s | 0.08 s | -87.9% |
| Git   | 5 |  230162 |  0 | 0.90 s | 0.13 s | -85.6% |
| Git   | 4 |  154310 |  75852 | 0.79 s | 0.10 s | -87.3% |
| Linux | 1 | 5606645 |  0 | 0.48 s | 0.32 s | -33.3% |
| Linux |24 | 5606645 |  0 | 4.41 s | 1.09 s | -75.3% |
| Linux |23 | 5283204 | 323441 | 4.11 s | 0.99 s | -75.9% |
| VSTS  | 1 | 4355923 |  0 | 0.46 s | 0.25 s | -45.7% |
| VSTS  |32 | 4355923 |  0 | 5.40 s | 1.15 s | -78.7% |
| VSTS  |31 | 4276829 |  79094 | 5.88 s | 1.18 s | -79.9% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 39.0 s
  New Time:  3.9 s
 Rel %: -90.0%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 57 ++---
 1 file changed, 42 insertions(+), 15 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 134ac9742..f2a1ebe49 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -474,10 +474,32 @@ static unsigned msb(unsigned long val)
return r;
 }
 
-int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+struct min_abbrev_data {
+   unsigned int init_len;
+   unsigned int cur_len;
+   char *hex;
+};
+
+static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
-   int status, exists;
+   struct min_abbrev_data *mad = cb_data;
+
+   char *hex = oid_to_hex(oid);
+   unsigned int i = mad->init_len;
+   while (mad->hex[i] && mad->hex[i] == hex[i])
+   i++;
+
+   if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
+   mad->cur_len = i + 1;
+
+   return 0;
+}
 
+int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+{
+   struct disambiguate_state ds;
+   struct min_abbrev_data mad;
+   struct object_id oid_ret;
if (len < 0) {
unsigned long count = approximate_object_count();
/*
@@ -503,19 +525,24 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
*sha1, int len)
sha1_to_hex_r(hex, sha1);
if (len == GIT_SHA1_HEXSZ || !len)
return GIT_SHA1_HEXSZ;
-   exists = has_sha1_file(sha1);
-   while (len < GIT_SHA1_HEXSZ) {
-   struct object_id oid_ret;
-   status = get_short_oid(hex, len, _ret, GET_OID_QUIETLY);
-   if (exists
-   ? !status
-   : status == SHORT_NAME_NOT_FOUND) {
-   hex[len] = 0;
-   return len;
-   }
-   len++;
-   }
-   return len;
+
+   if (init_object_disambiguation(hex, len, ) < 0)
+   return -1;
+
+   mad.init_len = len;
+   mad.cur_len = len;
+   mad.hex = hex;
+
+   ds.fn = extend_abbrev_len;
+   ds.always_call_fn = 1;
+   ds.cb_data = (void*)
+
+   find_short_object_filename(

Re: [PATCH v2 4/5] sha1_name: Parse less while finding common prefix

2017-10-02 Thread Derrick Stolee

My v3 patch is incoming, but I wanted to respond directly to this message.

On 9/25/2017 7:42 PM, Stefan Beller wrote:

On Mon, Sep 25, 2017 at 2:54 AM, Derrick Stolee <dsto...@microsoft.com> wrote:

Create get_hex_char_from_oid() to parse oids one hex character at a
time. This prevents unnecessary copying of hex characters in
extend_abbrev_len() when finding the length of a common prefix.

p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.08 s | 0.08 s |   0.0% |
| Git   | 5 |  230162 |  0 | 0.17 s | 0.16 s | - 5.9% |
| Git   | 4 |  154310 |  75852 | 0.14 s | 0.12 s | -14.3% |
| Linux | 1 | 5606645 |  0 | 0.50 s | 0.25 s | -50.0% |
| Linux |24 | 5606645 |  0 | 2.41 s | 2.08 s | -13.7% |
| Linux |23 | 5283204 | 323441 | 1.99 s | 1.69 s | -15.1% |
| VSTS  | 1 | 4355923 |  0 | 0.40 s | 0.22 s | -45.0% |
| VSTS  |32 | 4355923 |  0 | 2.09 s | 1.99 s | - 4.8% |
| VSTS  |31 | 4276829 |  79094 | 3.60 s | 3.20 s | -11.1% |

For the Windows repo running in Windows Subsystem for Linux:

 Pack Files: 50
Packed Objects: 22,385,898
  Loose Objects: 492
  Base Time: 4.61 s
   New Time: 4.61 s
  Rel %: 0.0%

p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.06 s | 0.05 s | -16.7% |
| Git   | 5 |  230162 |  0 | 0.14 s | 0.15 s | + 7.1% |
| Git   | 4 |  154310 |  75852 | 0.12 s | 0.12 s |   0.0% |
| Linux | 1 | 5606645 |  0 | 0.40 s | 0.17 s | -57.5% |
| Linux |24 | 5606645 |  0 | 1.59 s | 1.30 s | -18.2% |
| Linux |23 | 5283204 | 323441 | 1.23 s | 1.10 s | -10.6% |
| VSTS  | 1 | 4355923 |  0 | 0.25 s | 0.12 s | -52.0% |
| VSTS  |32 | 4355923 |  0 | 1.45 s | 1.34 s | - 7.6% |
| VSTS  |31 | 4276829 |  79094 | 1.59 s | 1.34 s | -15.7% |

For the Windows repo running in Windows Subsystem for Linux:

 Pack Files: 50
Packed Objects: 22,385,898
  Loose Objects: 492
  Base Time: 3.91 s
   New Time: 3.08 s
  Rel %: -21.1%


These number look pretty cool!


Signed-off-by: Derrick Stolee <dsto...@microsoft.com>

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>

double signoff?


Oops! I'll be more careful with my format-patch in the future.




---
  sha1_name.c | 13 +++--
  1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index f2a1ebe49..bb47b6702 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -480,13 +480,22 @@ struct min_abbrev_data {
 char *hex;
  };

+static inline char get_hex_char_from_oid(const struct object_id *oid, int i)

'i' is not very descriptive, maybe add a comment?
(I realize it is just walking through the char*s one by one)

I renamed 'i' to 'pos' in my v3.



Maybe this function (together with the change in the while below)
could go into hex.c as "int progressively_cmp_oids" that returns
the position at which the given oids differ?


+{
+   static const char hex[] = "0123456789abcdef";
+
+   if ((i & 1) == 0)
+   return hex[oid->hash[i >> 1] >> 4];
+   else
+   return hex[oid->hash[i >> 1] & 0xf];
+}

sha1_to_hex_r has very similar code, though it iterates less
and covers both cases in the loop body.

That is the actual reason I propose moving this function
(or a variant thereof) to hex.c as there we can share code.


You're right that sha1_to_hex_r is similar, in fact I based my work on 
it. There are a few reasons I didn't combine the two implementations:


* I wanted to be sure my patch was only touching the code for 
disambiguating short-shas. Modifying code in hex.c would touch many more 
code paths.


* I realize that the extra branch in my version is slower than the 
branchless loop body in sha1_to_hex_r, so either I would slow that 
method or make the method call more complicated by returning two chars 
at a time.


* I wanted to strongly hint that the method should be inlined, but I'm 
not sure how to guarantee that happens across a linker boundary without 
doing strange things in header files.


I'm happy to revisit this after my patch is complete, since I think 
there are interesting trade-offs to cons

Re: What's cooking in git.git (Sep 2017, #06; Fri, 29)

2017-09-29 Thread Derrick Stolee

Hi Junio,

On 9/29/2017 12:34 AM, Junio C Hamano wrote:

* ds/find-unique-abbrev-optim (2017-09-19) 4 commits
  - SQUASH???
  - sha1_name: parse less while finding common prefix
  - sha1_name: unroll len loop in find_unique_abbrev_r()
  - sha1_name: create perf test for find_unique_abbrev()


I'll re-roll my patch on Monday if reviews have stabilized. I think I 
understand your comments this time (especially around 32-bit ints).

Since I'm new to the list, I'm not sure what the change in messages
means here.

What does "SQUASH???" mean? Is that why there are three meaningful 
commits in this note, despite my five-commit patch? Would you like me to 
squash the commits in v3?


Thanks,
-Stolee


Re: [PATCH v3 3/5] sha1_name: Unroll len loop in find_unique_abbrev_r

2017-10-03 Thread Derrick Stolee

On 10/3/2017 6:49 AM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New| |
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%|
|---|---|-||||-|
| Git   | 1 |  230078 |  0 | 0.09 s | 0.06 s | - 33.3% |
| Git   | 5 |  230162 |  0 | 0.11 s | 0.08 s | - 27.3% |
| Git   | 4 |  154310 |  75852 | 0.09 s | 0.07 s | - 22.2% |
| Linux | 1 | 5606645 |  0 | 0.12 s | 0.32 s | +146.2% |
| Linux |24 | 5606645 |  0 | 1.12 s | 1.12 s | -  0.9% |
| Linux |23 | 5283204 | 323441 | 1.08 s | 1.05 s | -  2.8% |
| VSTS  | 1 | 4355923 |  0 | 0.12 s | 0.23 s | + 91.7% |
| VSTS  |32 | 4355923 |  0 | 1.02 s | 1.08 s | +  5.9% |
| VSTS  |31 | 4276829 |  79094 | 2.25 s | 2.08 s | -  7.6% |

The above does not look so good, especially in cases where a
repository is well maintained by packing into smaller number of
packs, we get much worse result?
I understand that this patch on its own does not have good numbers. I 
split the

patches 3 and 4 specifically to highlight two distinct changes:

Patch 3: Unroll the len loop that may inspect all files multiple times.
Patch 4: Parse less while disambiguating.

Patch 4 more than makes up for the performance hits in this patch.



p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.66 s | 0.08 s | -87.9% |
| Git   | 5 |  230162 |  0 | 0.90 s | 0.13 s | -85.6% |
| Git   | 4 |  154310 |  75852 | 0.79 s | 0.10 s | -87.3% |
| Linux | 1 | 5606645 |  0 | 0.48 s | 0.32 s | -33.3% |
| Linux |24 | 5606645 |  0 | 4.41 s | 1.09 s | -75.3% |
| Linux |23 | 5283204 | 323441 | 4.11 s | 0.99 s | -75.9% |
| VSTS  | 1 | 4355923 |  0 | 0.46 s | 0.25 s | -45.7% |
| VSTS  |32 | 4355923 |  0 | 5.40 s | 1.15 s | -78.7% |
| VSTS  |31 | 4276829 |  79094 | 5.88 s | 1.18 s | -79.9% |

The question is if this is even measuring a relevant workload.  How
often would we have a full 40-hex object name for which we actually
do not have the object, and ask its name to be abbreviated?

Compared to it, the result from p0008.1 feels a lot more important.
We know we make tons of "abbreviate the object name for this object
we have" and we see them every day in our "git log -p" output.

Seeing a lot more impressive result from p0008.2 than p0008.1 makes
me unsure if this patch is optimizing for the right case.

I'll have to see the code a bit deeper before I can comment on it.

Thanks.
I agree that p0008.1 is more important. p0008.2 covers an important case 
of the

previous implementation. The line

    exists = has_sha1_file(sha1);

will inspect all packfiles and scan the full loose-object directory that 
would contain
the object. The only reason for this call is to determine how to inspect 
the result of
get_short_oid(), but is a significant portion of the time that is gained 
here.


Thanks,
-Stolee


Re: [PATCH v3 5/5] sha1_name: Minimize OID comparisons during disambiguation

2017-10-03 Thread Derrick Stolee

On 10/3/2017 11:55 AM, Stefan Beller wrote:

@@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id *oid, 
void *cb_data)
 return 0;
  }

+static void find_abbrev_len_for_pack(struct packed_git *p,
+struct min_abbrev_data *mad)
+{
+   int match = 0;
+   uint32_t num, last, first = 0;
+   struct object_id oid;
+
+   open_pack_index(p);

coverity complained here with
 Calling "open_pack_index" without checking return value
 (as is done elsewhere 13 out of 15 times).
Good catch! This same line is repeated in unique_in_pack() in this same 
file, so if this is worth fixing then we should probably fix it there, too.

I think the easiest way out is just a

 if (open_pack_index(p))
 die(_("Cannot open existing pack idx file for '%s'"), p);

or is there another good approach?

You probably intended to have p->pack_name in the die();

Using `cat *.c | grep -A 2 "if (open_pack_index("` and `cat */*.c | grep 
-A 2 "if (open_pack_index("` I see a few places that return error codes 
or quietly fail. The cases that use die() are inside builtin/ so I don't 
think die() is the right choice here.


Since find_abbrev_len_for_pack() is intended to extend the abbreviation 
length when necessary, I think a silent return is best here:


    if (open_pack_index(p))
        return;

Thanks,
-Stolee


[PATCH v2 2/5] p0008-abbrev.sh: Test find_unique_abbrev() perf

2017-09-25 Thread Derrick Stolee
Create helper program test-abbrev to compute the minimum length of a
disambiguating short-sha for an input list of object ids.

Perf test p0008-abbrev.sh runs test-abbrev for 100,000 object ids. For
test 0008.1, these objects exist. For test 0008.2 these objects do not
exist in the repo (with high probability). For each test, use `sort -R`
to (deterministically) shuffle the sample of object ids to not check
abbreviations in lexicographic order.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Makefile   |  1 +
 t/helper/.gitignore|  1 +
 t/helper/test-abbrev.c | 19 +++
 t/perf/p0008-abbrev.sh | 22 ++
 4 files changed, 43 insertions(+)
 create mode 100644 t/helper/test-abbrev.c
 create mode 100755 t/perf/p0008-abbrev.sh

diff --git a/Makefile b/Makefile
index af94b655a..75835c7c0 100644
--- a/Makefile
+++ b/Makefile
@@ -633,6 +633,7 @@ X =
 
 PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
 
+TEST_PROGRAMS_NEED_X += test-abbrev
 TEST_PROGRAMS_NEED_X += test-chmtime
 TEST_PROGRAMS_NEED_X += test-ctype
 TEST_PROGRAMS_NEED_X += test-config
diff --git a/t/helper/.gitignore b/t/helper/.gitignore
index 9dd1c9f3c..ab9df8369 100644
--- a/t/helper/.gitignore
+++ b/t/helper/.gitignore
@@ -1,3 +1,4 @@
+/test-abbrev
 /test-chmtime
 /test-ctype
 /test-config
diff --git a/t/helper/test-abbrev.c b/t/helper/test-abbrev.c
new file mode 100644
index 0..6866896eb
--- /dev/null
+++ b/t/helper/test-abbrev.c
@@ -0,0 +1,19 @@
+#include "cache.h"
+#include 
+
+int cmd_main(int ac, const char **av)
+{
+   struct object_id oid;
+   char hex[GIT_MAX_HEXSZ + 2];
+   const char *end;
+
+   setup_git_directory();
+
+   while (fgets(hex, GIT_MAX_HEXSZ + 2, stdin)) {
+   hex[GIT_MAX_HEXSZ] = 0;
+   if (!parse_oid_hex(hex, , ))
+   find_unique_abbrev(oid.hash, MINIMUM_ABBREV);
+   }
+
+   exit(0);
+}
diff --git a/t/perf/p0008-abbrev.sh b/t/perf/p0008-abbrev.sh
new file mode 100755
index 0..ba25e7824
--- /dev/null
+++ b/t/perf/p0008-abbrev.sh
@@ -0,0 +1,22 @@
+#!/bin/bash
+
+test_description='Test object disambiguation through abbreviations'
+. ./perf-lib.sh
+
+test_perf_large_repo
+
+test-list-objects 10 | sort -R > objs.txt
+
+test_perf 'find_unique_abbrev() for existing objects' '
+   test-abbrev < objs.txt
+'
+
+test-list-objects 10 --missing | sort -R > objs.txt
+
+test_perf 'find_unique_abbrev() for missing objects' '
+   test-abbrev < objs.txt
+'
+
+rm objs.txt
+
+test_done
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v2 0/5] Improve abbreviation disambiguation

2017-09-25 Thread Derrick Stolee
Thanks for the feedback on my v1 patch. Thanks also to Jeff Hostetler
for helping me with this v2 patch, which includes an extra performance
improvement in commit 5.

Changes since last version:

* Added helper program test-list-objects to construct a list of
  existing object ids.

* test-abbrev now disambiguates a list of OIDs from stdin.

* p0008-abbrev.sh now has two tests:
* 0008.1 tests 100,000 known OIDs
* 0008.2 tests 100,000 missing OIDs

* Added a third performance improvement that uses binary search within
  packfiles to inspect at most two object ids per packfile.

Thanks,
 Stolee

---

When displaying object ids, we frequently want to see an abbreviation
for easier typing. That abbreviation must be unambiguous among all
object ids.

The current implementation of find_unique_abbrev() performs a loop
checking if each abbreviation length is unambiguous until finding one
that works. This causes multiple round-trips to the disk when starting
with the default abbreviation length (usually 7) but needing up to 12
characters for an unambiguous short-sha. For very large repos, this
effect is pronounced and causes issues with several commands, from
obvious consumers `status` and `log` to less obvious commands such as
`fetch` and `push`.

This patch improves performance by iterating over objects matching the
short abbreviation only once, inspecting each object id, and reporting
the minimum length of an unambiguous abbreviation.

A helper program `test-list-objects` outputs a sampling of object ids,
which we reorder using `sort -R` before using them as input to a
performance test. 

A performance helper `test-abbrev` and performance test `p0008-abbrev.sh`
are added to demonstrate performance improvements in this area.

I include performance test numbers in the commit messages for each
change, but I also include the difference between the baseline and the
final change here:


p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.12 s | 0.05 s | -58.3% |
| Git   | 5 |  230162 |  0 | 0.25 s | 0.15 s | -40.0% |
| Git   | 4 |  154310 |  75852 | 0.18 s | 0.11 s | -38.9% |
| Linux | 1 | 5606645 |  0 | 0.32 s | 0.10 s | -68.8% |
| Linux |24 | 5606645 |  0 | 2.77 s | 2.00 s | -27.8% |
| Linux |23 | 5283204 | 323441 | 2.86 s | 1.62 s | -43.4% |
| VSTS  | 1 | 4355923 |  0 | 0.27 s | 0.09 s | -66.7% |
| VSTS  |32 | 4355923 |  0 | 2.41 s | 2.01 s | -16.6% |
| VSTS  |31 | 4276829 |  79094 | 4.22 s | 3.02 s | -28.4% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 5.69 s
  New Time: 4.09 s
 Rel %: -28.1%

p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.61 s | 0.04 s | -93.4% |
| Git   | 5 |  230162 |  0 | 1.30 s | 0.15 s | -88.5% |
| Git   | 4 |  154310 |  75852 | 1.07 s | 0.12 s | -88.8% |
| Linux | 1 | 5606645 |  0 | 0.65 s | 0.05 s | -92.3% |
| Linux |24 | 5606645 |  0 | 7.12 s | 1.28 s | -82.0% |
| Linux |23 | 5283204 | 323441 | 6.33 s | 0.96 s | -84.8% |
| VSTS  | 1 | 4355923 |  0 | 0.64 s | 0.05 s | -92.2% |
| VSTS  |32 | 4355923 |  0 | 7.77 s | 1.36 s | -82.5% |
| VSTS  |31 | 4276829 |  79094 | 7.76 s | 1.36 s | -82.5% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 38.9 s
  New Time:  2.7 s
 Rel %: -93.1%

Derrick Stolee (5):
  test-list-objects: List a subset of object ids
  p0008-abbrev.sh: Test find_unique_abbrev() perf
  sha1_name: Unroll len loop in find_unique_abbrev_r
  sha1_name: Parse less while finding common prefix
  sha1_name: Minimize OID comparisons during disambiguation

 Makefile |   2 +
 sha1_name.c  | 128 ++-
 t/helper/.gitignore  |   2 +
 t/helper/test-abbrev.c   |  19 +++
 t/helper/test-list-objects.c |  85 
 t/perf/p0008-abbrev.sh   |  22 
 6 files changed, 243 insertions(+), 15 deletions(-)
 create mode 100644 t/helper/test-abbrev.c
 create mode

[PATCH v2 4/5] sha1_name: Parse less while finding common prefix

2017-09-25 Thread Derrick Stolee
Create get_hex_char_from_oid() to parse oids one hex character at a
time. This prevents unnecessary copying of hex characters in
extend_abbrev_len() when finding the length of a common prefix.

p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.08 s | 0.08 s |   0.0% |
| Git   | 5 |  230162 |  0 | 0.17 s | 0.16 s | - 5.9% |
| Git   | 4 |  154310 |  75852 | 0.14 s | 0.12 s | -14.3% |
| Linux | 1 | 5606645 |  0 | 0.50 s | 0.25 s | -50.0% |
| Linux |24 | 5606645 |  0 | 2.41 s | 2.08 s | -13.7% |
| Linux |23 | 5283204 | 323441 | 1.99 s | 1.69 s | -15.1% |
| VSTS  | 1 | 4355923 |  0 | 0.40 s | 0.22 s | -45.0% |
| VSTS  |32 | 4355923 |  0 | 2.09 s | 1.99 s | - 4.8% |
| VSTS  |31 | 4276829 |  79094 | 3.60 s | 3.20 s | -11.1% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 4.61 s
  New Time: 4.61 s
 Rel %: 0.0%

p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.06 s | 0.05 s | -16.7% |
| Git   | 5 |  230162 |  0 | 0.14 s | 0.15 s | + 7.1% |
| Git   | 4 |  154310 |  75852 | 0.12 s | 0.12 s |   0.0% |
| Linux | 1 | 5606645 |  0 | 0.40 s | 0.17 s | -57.5% |
| Linux |24 | 5606645 |  0 | 1.59 s | 1.30 s | -18.2% |
| Linux |23 | 5283204 | 323441 | 1.23 s | 1.10 s | -10.6% |
| VSTS  | 1 | 4355923 |  0 | 0.25 s | 0.12 s | -52.0% |
| VSTS  |32 | 4355923 |  0 | 1.45 s | 1.34 s | - 7.6% |
| VSTS  |31 | 4276829 |  79094 | 1.59 s | 1.34 s | -15.7% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 3.91 s
  New Time: 3.08 s
 Rel %: -21.1%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index f2a1ebe49..bb47b6702 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -480,13 +480,22 @@ struct min_abbrev_data {
char *hex;
 };
 
+static inline char get_hex_char_from_oid(const struct object_id *oid, int i)
+{
+   static const char hex[] = "0123456789abcdef";
+
+   if ((i & 1) == 0)
+   return hex[oid->hash[i >> 1] >> 4];
+   else
+   return hex[oid->hash[i >> 1] & 0xf];
+}
+
 static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
struct min_abbrev_data *mad = cb_data;
 
-   char *hex = oid_to_hex(oid);
unsigned int i = mad->init_len;
-   while (mad->hex[i] && mad->hex[i] == hex[i])
+   while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i))
i++;
 
if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
-- 
2.14.1.538.g56ec8fc98.dirty



[PATCH v2 5/5] sha1_name: Minimize OID comparisons during disambiguation

2017-09-25 Thread Derrick Stolee
Minimize OID comparisons during disambiguatiosn of packfile OIDs.

Teach git to use binary search with the full OID to find the object's
position (or insertion position, if not present) in the pack-index.
The object before and immediately after (or the one at the insertion
position) give the maximum common prefix.  No subsequent linear search
is required.

Take care of which two to inspect, in case the object id exists in the
packfile.

If the input to find_unique_abbrev_r() is a partial prefix, then the
OID used for the binary search is padded with zeroes so the object will
not exist in the repo (with high probability) and the same logic
applies.

p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.08 s | 0.05 s | -37.5% |
| Git   | 5 |  230162 |  0 | 0.16 s | 0.15 s | - 6.3% |
| Git   | 4 |  154310 |  75852 | 0.12 s | 0.11 s | - 8.3% |
| Linux | 1 | 5606645 |  0 | 0.25 s | 0.10 s | -60.0% |
| Linux |24 | 5606645 |  0 | 2.08 s | 2.00 s | - 3.8% |
| Linux |23 | 5283204 | 323441 | 1.69 s | 1.62 s | - 4.1% |
| VSTS  | 1 | 4355923 |  0 | 0.22 s | 0.09 s | -59.1% |
| VSTS  |32 | 4355923 |  0 | 1.99 s | 2.01 s | + 1.0% |
| VSTS  |31 | 4276829 |  79094 | 3.20 s | 3.02 s | - 5.6% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 4.62 s
  New Time: 4.09 s
 Rel %: -11.5%

p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.05 s | 0.04 s | -20.0% |
| Git   | 5 |  230162 |  0 | 0.15 s | 0.15 s |   0.0% |
| Git   | 4 |  154310 |  75852 | 0.12 s | 0.12 s |   0.0% |
| Linux | 1 | 5606645 |  0 | 0.17 s | 0.05 s | -70.6% |
| Linux |24 | 5606645 |  0 | 1.30 s | 1.28 s | - 1.5% |
| Linux |23 | 5283204 | 323441 | 1.10 s | 0.96 s | -12.7% |
| VSTS  | 1 | 4355923 |  0 | 0.12 s | 0.05 s | -58.3% |
| VSTS  |32 | 4355923 |  0 | 1.34 s | 1.36 s | + 1.5% |
| VSTS  |31 | 4276829 |  79094 | 1.34 s | 1.36 s | + 1.5% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 3.08 s
  New Time: 2.72 s
 Rel %: -11.8

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 70 +
 1 file changed, 66 insertions(+), 4 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index bb47b6702..1566cd4fc 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -478,6 +478,7 @@ struct min_abbrev_data {
unsigned int init_len;
unsigned int cur_len;
char *hex;
+   const unsigned char *hash;
 };
 
 static inline char get_hex_char_from_oid(const struct object_id *oid, int i)
@@ -504,6 +505,65 @@ static int extend_abbrev_len(const struct object_id *oid, 
void *cb_data)
return 0;
 }
 
+static void find_abbrev_len_for_pack(struct packed_git *p,
+struct min_abbrev_data *mad)
+{
+   int match = 0;
+   uint32_t num, last, first = 0;
+   struct object_id oid;
+
+   open_pack_index(p);
+   num = p->num_objects;
+   last = num;
+   while (first < last) {
+   uint32_t mid = (first + last) / 2;
+   const unsigned char *current;
+   int cmp;
+
+   current = nth_packed_object_sha1(p, mid);
+   cmp = hashcmp(mad->hash, current);
+   if (!cmp) {
+   match = 1;
+   first = mid;
+   break;
+   }
+   if (cmp > 0) {
+   first = mid + 1;
+   continue;
+   }
+   last = mid;
+   }
+
+   /*
+* first is now the position in the packfile where we would insert
+* mad->hash if it does not exist (or the position of mad->hash if
+* it does exist). Hence, we consider a maximum of three objects
+* nearby for the abbreviation length.
+*/
+   mad->init_len = 0;
+   if (!match) {
+   nth_packed_object_oid(, p, first);
+ 

[PATCH v2 3/5] sha1_name: Unroll len loop in find_unique_abbrev_r

2017-09-25 Thread Derrick Stolee
Unroll the while loop inside find_unique_abbrev_r to avoid iterating
through all loose objects and packfiles multiple times when the short
name is longer than the predicted length.

Instead, inspect each object that collides with the estimated
abbreviation to find the longest common prefix.

p0008.1: find_unique_abbrev() for existing objects
--

For 10 repeated tests, each checking 100,000 known objects, we find the
following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.12 s | 0.08 s | -33.3% |
| Git   | 5 |  230162 |  0 | 0.25 s | 0.17 s | -32.0% |
| Git   | 4 |  154310 |  75852 | 0.18 s | 0.14 s | -22.2% |
| Linux | 1 | 5606645 |  0 | 0.32 s | 0.50 s | +56.2% |
| Linux |24 | 5606645 |  0 | 2.77 s | 2.41 s | -13.0% |
| Linux |23 | 5283204 | 323441 | 2.86 s | 1.99 s | -30.4% |
| VSTS  | 1 | 4355923 |  0 | 0.27 s | 0.40 s | +48.1% |
| VSTS  |32 | 4355923 |  0 | 2.41 s | 2.09 s | -13.3% |
| VSTS  |31 | 4276829 |  79094 | 4.22 s | 3.60 s | -14.7% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 5.69 s
  New Time: 4.62 s
 Rel %: -18.9%

p0008.2: find_unique_abbrev() for missing objects
-

For 10 repeated tests, each checking 100,000 missing objects, we find
the following results when running in a Linux VM:

|   | Pack  | Packed  | Loose  | Base   | New||
| Repo  | Files | Objects | Objects| Time   | Time   | Rel%   |
|---|---|-|||||
| Git   | 1 |  230078 |  0 | 0.61 s | 0.06 s | -90.2% |
| Git   | 5 |  230162 |  0 | 1.30 s | 0.14 s | -89.2% |
| Git   | 4 |  154310 |  75852 | 1.07 s | 0.12 s | -88.8% |
| Linux | 1 | 5606645 |  0 | 0.65 s | 0.40 s | -38.5% |
| Linux |24 | 5606645 |  0 | 7.12 s | 1.59 s | -77.7% |
| Linux |23 | 5283204 | 323441 | 6.33 s | 1.23 s | -80.6% |
| VSTS  | 1 | 4355923 |  0 | 0.64 s | 0.25 s | -60.9% |
| VSTS  |32 | 4355923 |  0 | 7.77 s | 1.45 s | -81.3% |
| VSTS  |31 | 4276829 |  79094 | 7.76 s | 1.59 s | -79.5% |

For the Windows repo running in Windows Subsystem for Linux:

Pack Files: 50
Packed Objects: 22,385,898
 Loose Objects: 492
 Base Time: 39.0 s
  New Time:  3.9 s
 Rel %: -90.0%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 57 ++---
 1 file changed, 42 insertions(+), 15 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 134ac9742..f2a1ebe49 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -474,10 +474,32 @@ static unsigned msb(unsigned long val)
return r;
 }
 
-int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+struct min_abbrev_data {
+   unsigned int init_len;
+   unsigned int cur_len;
+   char *hex;
+};
+
+static int extend_abbrev_len(const struct object_id *oid, void *cb_data)
 {
-   int status, exists;
+   struct min_abbrev_data *mad = cb_data;
+
+   char *hex = oid_to_hex(oid);
+   unsigned int i = mad->init_len;
+   while (mad->hex[i] && mad->hex[i] == hex[i])
+   i++;
+
+   if (i < GIT_MAX_RAWSZ && i >= mad->cur_len)
+   mad->cur_len = i + 1;
+
+   return 0;
+}
 
+int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len)
+{
+   struct disambiguate_state ds;
+   struct min_abbrev_data mad;
+   struct object_id oid_ret;
if (len < 0) {
unsigned long count = approximate_object_count();
/*
@@ -503,19 +525,24 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
*sha1, int len)
sha1_to_hex_r(hex, sha1);
if (len == GIT_SHA1_HEXSZ || !len)
return GIT_SHA1_HEXSZ;
-   exists = has_sha1_file(sha1);
-   while (len < GIT_SHA1_HEXSZ) {
-   struct object_id oid_ret;
-   status = get_short_oid(hex, len, _ret, GET_OID_QUIETLY);
-   if (exists
-   ? !status
-   : status == SHORT_NAME_NOT_FOUND) {
-   hex[len] = 0;
-   return len;
-   }
-   len++;
-   }
-   return len;
+
+   if (init_object_disambiguation(hex, len, ) < 0)
+   return -1;
+
+   mad.init_len = len;
+   mad.cur_len = len;
+   mad.hex = hex;
+
+   ds.fn = extend_abbrev_len;
+   ds.always_call_fn = 1;
+  

[PATCH v2 1/5] test-list-objects: List a subset of object ids

2017-09-25 Thread Derrick Stolee
Create test-list-objects helper program to output a random sample of
OIDs present in the repo.

If no command line arguments are provided, all OIDs are output.

The first command line argument specifies how many samples to output.
Samples are collected across the entire set of available OIDs to avoid
clustering and therefore quite uniformly distributed.

If a second command line argument "--missing" is given, then a list of
OIDs is generated without examining the repo.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Makefile |  1 +
 t/helper/.gitignore  |  1 +
 t/helper/test-list-objects.c | 85 
 3 files changed, 87 insertions(+)
 create mode 100644 t/helper/test-list-objects.c

diff --git a/Makefile b/Makefile
index f2bb7f2f6..af94b655a 100644
--- a/Makefile
+++ b/Makefile
@@ -647,6 +647,7 @@ TEST_PROGRAMS_NEED_X += test-hashmap
 TEST_PROGRAMS_NEED_X += test-index-version
 TEST_PROGRAMS_NEED_X += test-lazy-init-name-hash
 TEST_PROGRAMS_NEED_X += test-line-buffer
+TEST_PROGRAMS_NEED_X += test-list-objects
 TEST_PROGRAMS_NEED_X += test-match-trees
 TEST_PROGRAMS_NEED_X += test-mergesort
 TEST_PROGRAMS_NEED_X += test-mktemp
diff --git a/t/helper/.gitignore b/t/helper/.gitignore
index 721650256..9dd1c9f3c 100644
--- a/t/helper/.gitignore
+++ b/t/helper/.gitignore
@@ -13,6 +13,7 @@
 /test-index-version
 /test-lazy-init-name-hash
 /test-line-buffer
+/test-list-objects
 /test-match-trees
 /test-mergesort
 /test-mktemp
diff --git a/t/helper/test-list-objects.c b/t/helper/test-list-objects.c
new file mode 100644
index 0..83b1250fe
--- /dev/null
+++ b/t/helper/test-list-objects.c
@@ -0,0 +1,85 @@
+#include "cache.h"
+#include "packfile.h"
+#include 
+
+struct count {
+   int total;
+   int select_mod;
+};
+
+int count_loose(const struct object_id *oid,
+   const char *path,
+   void *data)
+{
+   ((struct count*)data)->total++;
+   return 0;
+}
+
+int count_packed(const struct object_id *oid,
+struct packed_git *pack,
+uint32_t pos,
+void* data)
+{
+   ((struct count*)data)->total++;
+   return 0;
+}
+
+void output(const struct object_id *oid,
+   struct count *c)
+{
+   if (!(c->total % c->select_mod))
+   printf("%s\n", oid_to_hex(oid));
+   c->total--;
+}
+
+int output_loose(const struct object_id *oid,
+const char *path,
+void *data)
+{
+   output(oid, (struct count*)data);
+   return 0;
+}
+
+int output_packed(const struct object_id *oid,
+ struct packed_git *pack,
+ uint32_t pos,
+ void* data)
+{
+   output(oid, (struct count*)data);
+   return 0;
+}
+
+int cmd_main(int ac, const char **av)
+{
+   unsigned int hash_delt = 0xFDB97531;
+   unsigned int hash_base = 0x01020304;
+   int i, n = 0;
+   struct count c;
+   const int u_per_oid = sizeof(struct object_id) / sizeof(unsigned int);
+
+   c.total = 0;
+   if (ac > 1)
+   n = atoi(av[1]);
+
+   if (ac > 2 && !strcmp(av[2], "--missing")) {
+   while (c.total++ < n) {
+   for (i = 0; i < u_per_oid; i++) {
+   printf("%08x", hash_base);
+   hash_base += hash_delt;
+   }
+   printf("\n");
+   }
+   } else {
+   setup_git_directory();
+
+   for_each_loose_object(count_loose, , 0);
+   for_each_packed_object(count_packed, , 0);
+
+   c.select_mod = 1 + c.total / n;
+
+   for_each_loose_object(output_loose, , 0);
+   for_each_packed_object(output_packed, , 0);
+   }
+
+   exit(0);
+}
-- 
2.14.1.538.g56ec8fc98.dirty



Re: [PATCH v2 1/5] test-list-objects: List a subset of object ids

2017-10-07 Thread Derrick Stolee

On 10/6/2017 10:11 AM, Jeff King wrote:

On Thu, Oct 05, 2017 at 08:39:42AM -0400, Derrick Stolee wrote:

I'll run some perf numbers for these commands you recommend, and also see if
I can replicate some of the pain points that triggered this change using the
Linux repo.


Thanks!

-Peff



In my local copy, I added a test to p4211-line-log.sh that runs "git log 
--raw -r" and tested it on three copies of the Linux repo. In order, 
they have 1 packfile (0 loose), 24 packfiles (0 loose), and 23 packfiles 
(~324,000 loose).


4211.6: git log --raw -r  43.34(42.62+0.65)   40.47(40.16+0.27)  -6.6%
4211.6: git log --raw -r  88.77(86.54+2.12)   82.44(81.87+0.52)  -7.1%
4211.6: git log --raw -r 108.86(103.97+4.81) 103.92(100.63+3.19) -4.5%

We have moderate performance gains for this command, despite the command 
doing many more things than just checking abbreviations.


I plan to re-roll my patch on Monday including the following feedback items:

* Remove test-list-objects and test-abbrev in favor of a new "git log"
  performance test.

* Fix binary search overflow error.

* Check response from open_pack_index(p) in find_abbrev_len_for_pack().
  I plan to return without failure on non-zero result, which results in
  no failure on a bad pack and the abbreviation length will be the
  minimum required among all valid packs. (Thanks Stefan!)

I made note of a few things, but will not include them in my re-roll. 
I'll revisit them later if they are valuable:


- nth_packed_object_sha1() could be simplified in
  find_abbrev_len_for_pack().

- Teach 'cat-file' to --batch-check='%(objectsize:short)'. (Peff already
  included a patch, perhaps that could be reviewed separately.)

- Ramsay caught my lack of "static" in test-list-objects.c, but that
  file will be removed in the next patch. I'll make sure to use "static"
  in the future.

I'm not re-rolling immediately to allow for some extra review over the 
weekend, if anyone is so inclined.


Thanks,
-Stolee


Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation

2017-10-11 Thread Derrick Stolee

On 10/10/2017 9:30 AM, Jeff King wrote:

On Tue, Oct 10, 2017 at 09:11:15AM -0400, Derrick Stolee wrote:


On 10/10/2017 8:56 AM, Junio C Hamano wrote:

Jeff King <p...@peff.net> writes:


OK, I think that makes more sense. But note the p->num_objects thing I
mentioned. If I do:

git pack-objects .git/objects/pack/pack num_objects)
return;

Technically that also covers open_pack_index() failure, too, but that's
a subtlety I don't think we should rely on.

True.  I notice that the early part of the two functions look almost
identical.  Do we need error condition handling for the other one,
too?

I prefer to fix the problem in all code clones when they cause review
friction, so I'll send a fifth commit showing just the diff for these
packfile issues in sha1_name.c. See patch below.

Ah, that answers my earlier question. Junio mean unique_in_pack(). And
yeah, I think it suffers from the same problem.


Should open_pack_index() return a non-zero status if the packfile is empty?
Or, is there a meaningful reason to have empty packfiles?

I can't think of a compelling reason to have an empty packfile. But nor
do I think we should consider them an error when we can handle them
quietly (and returning non-zero status would cause Git to complain on
many operations in a repository that has such a file).

-Peff


Thanks for the comments. I found some typos in my commit messages, too.

I plan to send a (hopefully) final version tomorrow (Thursday). It will 
include:


* Make 'pos' unsigned in get_hex_char_from_oid()

* Check response from open_pack_index()

* Small typos in commit messages

If there are other issues, then please let me know.

Thanks,
-Stolee


Re: cherry-pick very slow on big repository

2017-11-10 Thread Derrick Stolee

On 11/10/2017 7:37 AM, Peter Krefting wrote:

Jeff King:


Can you get a backtrace? I'd do something like:


Seems that it spends most time in diffcore_count_changes(), that is 
where it hits whenever I hit Ctrl+C (various line numbers 199-207 in 
diffcore-delta.c; this is on the v2.15.0 tag).


(gdb) bt
#0  diffcore_count_changes (src=src@entry=0x5db99970,
    dst=dst@entry=0x5d6a4810,
    src_count_p=src_count_p@entry=0x5db8,
    dst_count_p=dst_count_p@entry=0x5d6a4838,
    src_copied=src_copied@entry=0x7fffd3e0,
    literal_added=literal_added@entry=0x7fffd3f0)
    at diffcore-delta.c:203
#1  0x556dee1a in estimate_similarity (minimum_score=3,
    dst=0x5d6a4810, src=0x5db99970) at diffcore-rename.c:193
#2  diffcore_rename (options=options@entry=0x7fffd4f0)
    at diffcore-rename.c:560
#3  0x55623d83 in diffcore_std (
    options=options@entry=0x7fffd4f0) at diff.c:5846
...


Git is spending time detecting renames, which implies you probably 
renamed a folder or added and deleted a large number of files. This 
rename detection is quadratic (# adds times # deletes).


You can remove this rename detection by running your cherry-pick with 
`git -c diff.renameLimit=1 cherry-pick ...`


See https://git-scm.com/docs/diff-config#diff-config-diffrenameLimit

Thanks,
-Stolee


Re: [PATCH v2] sha1_file: use strbuf_add() instead of strbuf_addf()

2017-12-04 Thread Derrick Stolee

On 12/4/2017 11:56 AM, Jeff King wrote:

When you put your cover letter first, you need to use "scissors" like:
-- >8 --

to separate it from the commit message. Using three-dashes means "git
am" will include your cover letter as the commit message, and omit your
real commit message entirely.


Thanks. I updated our internal best-practices accordingly.

-Stolee


Re: [PATCH] sha1_file: use strbuf_add() instead of strbuf_addf()

2017-12-01 Thread Derrick Stolee

On 12/1/2017 1:22 PM, Jeff King wrote:

On Fri, Dec 01, 2017 at 12:49:56PM -0500, Derrick Stolee wrote:

[snip]

diff --git a/sha1_file.c b/sha1_file.c
index 8ae6cb6285..2160323c4a 100644

This overall looks good, but I noticed one bug and a few cosmetic
improvements.


Thanks for finding quality improvements to this patch. I'll let it sit 
over the weekend for more feedback before sending a v2.





--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1914,17 +1914,18 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr,
}
  
  	oid.hash[0] = subdir_nr;

+   strbuf_add(path, "/", 1);

Maybe:

   strbuf_addch(path, '/');

would be a bit more readable (it's also a tiny bit faster, but this
isn't part of the tight loop).


+   baselen = path->len;

We set this here so that the '/' is included as part of the base. Makes
sense, but can we now drop the earlier setting of baselen before the
opendir() call?


Yeah, probably. I had briefly considered just adding the '/' before the 
first assignment of "baselen", but didn't want to change the error 
output. I also don't know if there are side effects for other platforms 
by calling opendir() with a '/'-terminated path.



while ((de = readdir(dir))) {
if (is_dot_or_dotdot(de->d_name))
continue;
  
-		strbuf_setlen(path, baselen);

-   strbuf_addf(path, "/%s", de->d_name);
-
if (strlen(de->d_name) == GIT_SHA1_HEXSZ - 2 &&
!hex_to_bytes(oid.hash + 1, de->d_name,
  GIT_SHA1_RAWSZ - 1)) {
+   strbuf_setlen(path, baselen);
+   strbuf_add(path, de->d_name, GIT_SHA1_HEXSZ - 2);

Moving this code into the conditional makes sense here, since that's
where we know the actual size. But what about the case when we _don't_
hit this conditional. We still do:

   if (cruft_cb)
 cruft_cb(path->buf);

which is now broken (it will just get the directory name without the
entry appended).


Good catch! A big reason to pull it inside and use strbuf_add over 
strbuf_addstr is to avoid a duplicate strlen() calculation. However, I 
can store the length before the conditional.



I think the optimized versions probably needs to be something more like:

   while (de = readdir(...)) {
   strbuf_setlen(path, baselen);
   if (size is HEXSZ - 2) {
   strbuf_add(path, de->d_name, HEXSZ - 2);
  obj_cb(path->buf);
   } else if (cruft_cb) {
   strbuf_addstr(path, de->d_name);
  cruft_cb(path->buf);
   }
   }


Small change by storing the length in advance of the conditional:

while (de = readdir(...)) {
    int namelen = strlen(de->d_name);
    strbuf_setlen(path, baselen);
    strbuf_add(path, de->d_name, namelen);

    if (namelen == HEXSZ - 2)
        obj_cb(path->buf)
    else
        cruft_cb(path->buf);
}


Two other thoughts:

   - from past discussions, I suspect most of your performance
 improvement actually comes from avoiding addf(), and the difference
 between addstr() and add() may be negligible here. It might be worth
 timing strbuf_addstr(). If that's similarly fast, that means we
 could keep the logic out of the conditional entirely.


addstr() duplicates the strlen(), which isn't much but we might as well 
save cycles where we can if it isn't too hard.



   - there's an extra micro-optimization there, which is that if there's
 no obj_cb, we have no need to assemble the full path at all. I doubt
 it makes much of a difference, as most callers would pass an object
 callback (I'd be surprised if we even have one that doesn't).


After doing a few 'git grep' commands, I found several that include a 
NULL cruft_cb but none that have a NULL obj_cb.


Thanks,
-Stolee


Re: Question about the ahead-behind computation and status

2017-12-15 Thread Derrick Stolee

On 12/15/2017 10:08 AM, Jeff Hostetler wrote:

On 12/15/2017 5:08 AM, Jeff King wrote:

On Thu, Dec 14, 2017 at 04:49:31PM -0500, Jeff Hostetler wrote:

[*] Sadly, the local repo was only about 20 days out of
 date (including the Thanksgiving holidays)


Taking 20 seconds to traverse 20 days worth of history sounds pretty
awful. How many commits is it?


150,000 commits, give or take.  The graph is many thousands of lanes
wide because of the merges and that isn't helping.

(But I should give you folks lots of credit -- it *only* took 20
seconds to find, unzip and decode 150,000 commit objects and walk
the commit graph.)


To give a few more data points, I created similar situation by checking 
out a big repo I hadn't updated in three months and it was 16,000 
commits behind. That took 1.5s to calculate the ahead/behind. Moving it 
100,000 commits behind it took 5s. This repo has about 300,000 total 
commits versus the 1.5 million commits in the Windows repo.


The biggest reason for the 20 seconds is not just the number of commits 
in the ahead/behind but how many commits are walked (including common to 
both branches) before paint_down_to_common() breaks its while loop due 
to queue_has_nonstale().


Thanks,
-Stolee



Re: Question about the ahead-behind computation and status

2017-12-15 Thread Derrick Stolee

On 12/15/2017 1:30 PM, Junio C Hamano wrote:

Derrick Stolee <sto...@gmail.com> writes:


The biggest reason for the 20 seconds is not just the number of
commits in the ahead/behind but how many commits are walked (including
common to both branches) before paint_down_to_common() breaks its
while loop due to queue_has_nonstale().

Hmm, queue_has_nonstale() looks to see if any element is not STALE
(where the definition of STALE is "known to be a common ancestor")
by potentially checking all elements in the queue.  I wonder if we
have an opportunity for a trivial optimization?  When the caller
knows that it dug one level and added the parents that are not
stale, it does not have to ask queue_has_nonstale() if there is any
non stale element, for example.


I thought this, too, but my tracing did not show significant time spent 
in this method. 99% of the time is spent unzipping and parsing commits.


If this was taking too long, then we could track a minimum timestamp for 
a commit that entered the queue in a non-stale state, but this will 
delay the termination condition a bit since commits can be marked stale 
after they enter the queue.



What do you exactly mean by "not just the number of commits in the
ahead/behind"?  Aren't the number of these commits pretty much
proportional to the number of commits we need to paint down to
common ancestors?  Is the latter a lot larger than the former
(i.e. are we somehow not stopping when we _could_ notice that we
can with better information)?




With the wide history, there is actually a large set of commits that are 
in the common history but have newer commit times than the oldest commit 
in only one history. Consider the following ASCII art:


  A
  |
  1
  |
  2
  |
  3
  |\
  4 B
  |\|
  5 C
  |\|
  6 D
  |\|
   .
   .
   .
  |\|
  N Z
  |/
  0

Between A and B, A is ahead by commits {A,1,2,3,4,5,6,...,N}. Meanwhile, 
commits B,C,D,...,Z are in the common history, but still must be walked.


Now imagine these two sets are actually much MUCH wider (thousands of 
commits that are pairwise non-ancestors). This causes the number of 
walked commits to be larger than just the number of commits in the 
symmetric difference of the histories.


Unfortunately, generation numbers are not the only optimization needed 
to make this call be sub-100ms. A graph data structure that lists the 
edges between commits would prevent the time spent in zlib decompressing 
and parsing commits. I'm working on investigating how such a data 
structure (and corresponding file format) could integrate in the 
commit-walking code.


Thanks,
-Stolee


[PATCH v2] sha1_file: use strbuf_add() instead of strbuf_addf()

2017-12-04 Thread Derrick Stolee
Thanks for the feedback on v1. This version fixes the cruft_cb
bug and streamlines the strlen(). I would include an inter-diff
but it was the same size as the patch.

Thanks,
-Stolee

---

Replace use of strbuf_addf() with strbuf_add() when enumerating
loose objects in for_each_file_in_obj_subdir(). Since we already
check the length and hex-values of the string before consuming
the path, we can prevent extra computation by using the lower-
level method.

One consumer of for_each_file_in_obj_subdir() is the abbreviation
code. OID abbreviations use a cached list of loose objects (per
object subdirectory) to make repeated queries fast, but there is
significant cache load time when there are many loose objects.

Most repositories do not have many loose objects before repacking,
but in the GVFS case the repos can grow to have millions of loose
objects. Profiling 'git log' performance in GitForWindows on a
GVFS-enabled repo with ~2.5 million loose objects revealed 12% of
the CPU time was spent in strbuf_addf().

Add a new performance test to p4211-line-log.sh that is more
sensitive to this cache-loading. By limiting to 1000 commits, we
more closely resemble user wait time when reading history into a
pager.

For a copy of the Linux repo with two ~512 MB packfiles and ~572K
loose objects, running 'git log --oneline --parents --raw -1000'
had the following performance:

 HEAD~1HEAD

 7.70(7.15+0.54)   7.44(7.09+0.29) -3.4%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_file.c  | 12 +++-
 t/perf/p4211-line-log.sh |  4 
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 8ae6cb6285..2fc8fa93b4 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1903,7 +1903,6 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr,
origlen = path->len;
strbuf_complete(path, '/');
strbuf_addf(path, "%02x", subdir_nr);
-   baselen = path->len;
 
dir = opendir(path->buf);
if (!dir) {
@@ -1914,15 +1913,18 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr,
}
 
oid.hash[0] = subdir_nr;
+   strbuf_addch(path, '/');
+   baselen = path->len;
 
while ((de = readdir(dir))) {
+   size_t namelen;
if (is_dot_or_dotdot(de->d_name))
continue;
 
+   namelen = strlen(de->d_name);
strbuf_setlen(path, baselen);
-   strbuf_addf(path, "/%s", de->d_name);
-
-   if (strlen(de->d_name) == GIT_SHA1_HEXSZ - 2 &&
+   strbuf_add(path, de->d_name, namelen);
+   if (namelen == GIT_SHA1_HEXSZ - 2 &&
!hex_to_bytes(oid.hash + 1, de->d_name,
  GIT_SHA1_RAWSZ - 1)) {
if (obj_cb) {
@@ -1941,7 +1943,7 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr,
}
closedir(dir);
 
-   strbuf_setlen(path, baselen);
+   strbuf_setlen(path, baselen - 1);
if (!r && subdir_cb)
r = subdir_cb(subdir_nr, path->buf, data);
 
diff --git a/t/perf/p4211-line-log.sh b/t/perf/p4211-line-log.sh
index e0ed05907c..392bcc0e51 100755
--- a/t/perf/p4211-line-log.sh
+++ b/t/perf/p4211-line-log.sh
@@ -35,4 +35,8 @@ test_perf 'git log --oneline --raw --parents' '
git log --oneline --raw --parents >/dev/null
 '
 
+test_perf 'git log --oneline --raw --parents -1000' '
+   git log --oneline --raw --parents -1000 >/dev/null
+'
+
 test_done
-- 
2.15.0



Re: [PATCH] builtin/tag.c: return appropriate value when --points-at finds an empty list

2017-12-11 Thread Derrick Stolee

On 12/11/2017 8:44 AM, George Papanikolaou wrote:

`git tag --points-at` can simply return if the given rev does not have
any tags pointing to it. It's not a failure but it shouldn't return
with 0 value.


I disagree. I think the 0 return means "I completed successfully" and 
the empty output means "I didn't find any tags pointing to this object."


Changing the return value here could break a lot of scripts out in the 
wild, and I consider this to be an "API" compatibility that needs to 
stay as-is.


What are you using "--points-at" where you need a nonzero exit code 
instead of a different indicator?


Thanks,
-Stolee



[PATCH] sha1_file: use strbuf_add() instead of strbuf_addf()

2017-12-01 Thread Derrick Stolee
Replace use of strbuf_addf() with strbuf_add() when enumerating
loose objects in for_each_file_in_obj_subdir(). Since we already
check the length and hex-values of the string before consuming
the path, we can prevent extra computation by using the lower-
level method.

One consumer of for_each_file_in_obj_subdir() is the abbreviation
code. OID abbreviations use a cached list of loose objects (per
object subdirectory) to make repeated queries fast, but there is
significant cache load time when there are many loose objects.

Most repositories do not have many loose objects before repacking,
but in the GVFS case the repos can grow to have millions of loose
objects. Profiling 'git log' performance in GitForWindows on a
GVFS-enabled repo with ~2.5 million loose objects revealed 12% of
the CPU time was spent in strbuf_addf().

Add a new performance test to p4211-line-log.sh that is more
sensitive to this cache-loading. By limiting to 1000 commits, we
more closely resemble user wait time when reading history into a
pager.

For a copy of the Linux repo with two ~512 MB packfiles and ~572K
loose objects, running 'git log --oneline --raw --parents -1000'
had the following performance:

 HEAD~1HEAD

 7.70(7.15+0.54)   7.44(7.09+0.29) -3.4%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_file.c  | 7 ---
 t/perf/p4211-line-log.sh | 4 
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 8ae6cb6285..2160323c4a 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1914,17 +1914,18 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr,
}
 
oid.hash[0] = subdir_nr;
+   strbuf_add(path, "/", 1);
+   baselen = path->len;
 
while ((de = readdir(dir))) {
if (is_dot_or_dotdot(de->d_name))
continue;
 
-   strbuf_setlen(path, baselen);
-   strbuf_addf(path, "/%s", de->d_name);
-
if (strlen(de->d_name) == GIT_SHA1_HEXSZ - 2 &&
!hex_to_bytes(oid.hash + 1, de->d_name,
  GIT_SHA1_RAWSZ - 1)) {
+   strbuf_setlen(path, baselen);
+   strbuf_add(path, de->d_name, GIT_SHA1_HEXSZ - 2);
if (obj_cb) {
r = obj_cb(, path->buf, data);
if (r)
diff --git a/t/perf/p4211-line-log.sh b/t/perf/p4211-line-log.sh
index e0ed05907c..392bcc0e51 100755
--- a/t/perf/p4211-line-log.sh
+++ b/t/perf/p4211-line-log.sh
@@ -35,4 +35,8 @@ test_perf 'git log --oneline --raw --parents' '
git log --oneline --raw --parents >/dev/null
 '
 
+test_perf 'git log --oneline --raw --parents -1000' '
+   git log --oneline --raw --parents -1000 >/dev/null
+'
+
 test_done
-- 
2.15.0



[RFC] 'unsigned long' to 'size_t' conversion

2017-12-06 Thread Derrick Stolee
There are several places in Git where we refer to the size of an object 
by an 'unsigned long' instead of a 'size_t'. In 64-bit Linux, 'unsigned 
long' is 8 bytes, but in 64-bit Windows it is 4 bytes.


The main issue with this conversion is that large objects fail to load 
(they seem to hash and store just fine). For example, the following 
'blob8gb' is an 8 GB file where the ith byte is equal to i % 256:


$ git hash-object -w --no-filters blob8gb
5391939346b98600acc0283dda24649450cec51f

$ git cat-file -s 5391939346b98600acc0283dda24649450cec51f
error: bad object header
fatal: git cat-file: could not get object info

An existing discussion can be found here: 
https://github.com/git-for-windows/git/issues/1063


The error message results from unpack_object_header_buffer() which had 
its most-recent meaningful change in 'ea4f9685: 
unpack_object_header_buffer(): clear the size field upon error' (in 2011).


In my opinion, the correct thing to do would be to replace all 'unsigned 
long's that refer to an object size and replace them with 'size_t'. 
However, a simple "git grep 'unsigned long size'" reveals 194 results, 
and there are other permutations of names and pointer types all over.


This conversion would be a significant patch, so I wanted to get the 
community's thoughts on this conversion.


If there are small, isolated chunks that can be done safely, then this 
may be a good target for a first patch.


Thanks,

-Stolee



Re: [RFC] protocol version 2

2017-10-25 Thread Derrick Stolee

On 10/20/2017 1:18 PM, Brandon Williams wrote:

  Overview
==

This document presents a specification for a version 2 of Git's wire
protocol.  Protocol v2 will improve upon v1 in the following ways:

   * Instead of multiple service names, multiple commands will be
 supported by a single service
   * Easily extendable as capabilities are moved into their own section
 of the protocol, no longer being hidden behind a NUL byte and
 limited by the size of a pkt-line (as there will be a single
 capability per pkt-line).
   * Separate out other information hidden behind NUL bytes (e.g. agent
 string as a capability and symrefs can be requested using 'ls-ref')
   * Ref advertisement will be omitted unless explicitly requested
   * ls-ref command to explicitly request some refs


Hi Brandon,

I'm very interested in your protocol as a former server-side dev for the 
VSTS Git server, and understand some of these headaches. We built 
limited refs specifically to target the problem you are solving with 
ls-ref, but it requires knowledge about the authenticated user in order 
to work. I believe your suggestion is a better solution for the Git 
protocol.


The "easily extendable" part has specifically caught my interest, as we 
(Microsoft) would like to move most of the GVFS protocol into core Git, 
and this is a great way to do it. Even if not all features are accepted 
by upstream, we could use our GVFS-specific fork of Git to communicate 
to our servers without breaking normal users' interactions.


Please CC me in future versions of this proposal. Let me know if you 
want to chat directly about the "TODO" items below.


Speaking of TODOs, how much of this concept do you have working in a 
prototype? Do you have code that performs this version 2 handshake and 
communicates the ls-refs result?



  Ls-refs
-

Ls-refs can be looked at as the equivalent of the current ls-remote as
it is a way to query a remote for the references that it has.  Unlike
the current ls-remote, the filtering of the output is done on the server
side by passing a number of parameters to the server-side command
instead of the filtering occurring on the client.

Ls-ref takes in the following parameters:

   --head, --tags: Limit to only refs/heads or refs/tags


Nit: It would be better to use "--heads" to match refs/heads and your 
use of "--tags" for refs/tags.



   --refs: Do not show peeled tags or pseudorefs like HEAD


Assuming we are in the case where the server has a HEAD ref, why would 
that ever be advertised? Also, does this imply that without the --refs 
option we would peel annotated tags until we find non-tag OIDs? Neither 
of these functions seem useful as default behavior.



   --symref: In addition to the object pointed by it, show the underlying
 ref pointed by it when showing a symbolic ref
   : When specified, only references matching the given patterns
  are displayed.


Can you be specific about the patterns? For instance, it is not a good 
idea to allow the client to submit a regex for the server to compute. 
Instead, can we limit this pattern-matching to a prefix-set, such as the 
following list of prefixes:


    refs/heads/master
    refs/releases/*
    refs/heads/user/me/*

  Fetch
---

Fetch will need to be a modified version of the v1 fetch protocol.  Some
potential areas for improvement are: Ref-in-want, CDN offloading,
Fetch-options.

Since we'll have an 'ls-ref' service we can eliminate the need of fetch
to perform a ref-advertisement, instead a client can run the 'ls-refs'
service first, in order to find out what refs the server has, and then
request those refs directly using the fetch service.

//TODO Flush out the design

  Fetch-object
--

This service could be used by partial clones in order to request missing
objects.

//TODO Flush out the design


As you flesh our these "fetch" and "fetch-object" commands, keep in mind 
that partial clones could mean any of the following:


 * fetch all reachable objects except for blobs.

 * fetch all reachable objects except for blobs above a
   certain size.

 * fetch all commits, trees, (and blobs?) within a certain
   "cone" of the file system.


  Push
--

Push will need to be a modified version of the v1 push protocol.  Some
potential areas for improvement are: Fix push-options, Negotiation for
force push.


Negotiation is something to keep in mind for all pushes, especially in 
an ecosystem full of fork-based workflows. If you are working across 
forks and someone else syncs data between your remotes, you may re-push 
a large chunk of objects that are already present in a fork. Adding an 
ls-refs step before push would be a step in the right direction.

  Other Considerations
==

   * Move away from pkt-line framing?
   * Have responses structured in well known formats (e.g. JSON)
   * Eliminate initial round-trip using 'GIT_PROTOCOL' side-channel
   * Additional 

Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.

2018-05-07 Thread Derrick Stolee

On 5/4/2018 3:40 PM, Jakub Narebski wrote:

Hello,

With early parts of commit-graph feature (ds/commit-graph and
ds/lazy-load-trees) close to being merged into "master", see
https://public-inbox.org/git/xmqq4ljtz87g@gitster-ct.c.googlers.com/
I think it would be good idea to think what other data could be added
there to make Git even faster.


Before thinking about adding more data to the commit-graph, I think 
instead we need to finish taking advantage of the data that is already 
there. This means landing the generation number patch [1] (I think this 
is close, so I'll send a v6 this week if there is no new feedback.) and 
the auto-compute patch [2] (this could use more feedback, but I'll send 
a v1 based on the RFC feedback if no one chimes in).


[1] 
https://public-inbox.org/git/20180501124652.155781-1-dsto...@microsoft.com/

    [PATCH v5 00/11] Compute and consume generation numbers

[2] 
https://public-inbox.org/git/20180417181028.198397-1-dsto...@microsoft.com/

    [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'

The big wins remaining from this data are `git tag --merged` and `git 
log --graph`. The `tag` scenario is probably easier: this can be done by 
replacing the revision-walk underlying the call to use 
paint_down_to_common() instead. Requires adding an external method to 
commit.c, but not too much code.


The tougher challenge is `git log --graph`. The revision walk machinery 
currently uses two precompute phases before iterating results to the 
pager: limit_list() and sort_in_topological_order(); these correspond to 
two phases of Kahn's algorithm for topo-sort (compute in-degrees, then 
walk by peeling commits with in-degree zero). This requires O(N) time, 
where N is the number of reachable commits. Instead, we could make this 
be O(W) time to output one page of results, where W is (roughly) the 
number of reachable commits with generation number above the last 
reported result.


In order to take advantage of this approach, the two phases of Kahn's 
algorithm need to be done in-line with reporting results to the pager. 
This means keeping two queues: one is a priority queue by generation 
number that computes in-degrees, the other is a priority queue (by 
commit-date or a visit-order value to do the --topo-order priority) that 
peels the in-degree-zero commits (and decrements the in-degree of their 
parents). I have not begun this refactoring effort because appears 
complicated to me, and it will be hard to tease out the logic without 
affecting other consumers of the revision-walk machinery.


I would love it if someone picked up the `git log --graph` task, since 
it will be a few weeks before I have the time to focus on it.


Without completing the benefits we get from generation numbers, these 
investigations into other reachability indexes will be incomplete as 
they are comparing benefits without all consumers taking advantage of a 
reachability index.


[...]

Bloom filter for changed paths
--

The goal of this chunk is to speed up checking if the file or directory
was changed in given commit, for queries such as "git log -- " or
"git blame ".  This is something that according to "Git Merge
contributor summit notes" [2] is already present in VSTS (Visual Studio
Team Services - the server counterpart of GVFS: Git Virtual File System)
at Microsoft:

AV> - VSTS adds bloom filters to know which paths have changed on the commit
AV> - tree-same check in the bloom filter is fast; speeds up file history checks
AV> - might be useful in the client as well, since limited-traversal is common
AV> - if the file history is _very_ sparse, then bloom filter is useful
AV> - but needs pre-compute, so useful to do once
AV> - first make the client do it, then think about how to serve it centrally

[2]: 
https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/

I think it was what Derrick Stolee was talking about at the end of his
part of "Making Git for Windows" presentation at Git Merge 2018:
https://youtu.be/oOMzi983Qmw?t=1835

This was also mentioned in subthread of "Re: [PATCH v2 0/4] Lazy-load
trees when reading commit-graph", starting from [3]
[3]: https://public-inbox.org/git/86y3hyeu6c@gmail.com/


Again, the benefits of Bloom filters should only be measured after 
already taking advantage of a reachability index during `git log`. 
However, you could get performance benefits from Bloom filters in a 
normal `git log` (no topo-order).


The tricky part about this feature is that the decisions we made in our 
C# implementation for the VSTS Git server may be very different than the 
needs for the C implementation of the Git client. Questions like "how do 
we handle merge commits?" may have different answers, which can only be 
discovered by implementing the feature.


(The answer for VSTS is that we only store Bloom filters containin

Re: What's cooking in git.git (May 2018, #01; Mon, 7)

2018-05-07 Thread Derrick Stolee

On 5/7/2018 10:58 AM, Junio C Hamano wrote:

* ds/generation-numbers (2018-05-02) 11 commits
  - commit-graph.txt: update design document
  - merge: check config before loading commits
  - commit: use generation number in remove_redundant()
  - commit: add short-circuit to paint_down_to_common()
  - commit: use generation numbers for in_merge_bases()
  - ref-filter: use generation number for --contains
  - commit-graph: always load commit-graph information
  - commit: use generations in paint_down_to_common()
  - commit-graph: compute generation numbers
  - commit: add generation number to struct commmit
  - ref-filter: fix outdated comment on in_commit_list
  (this branch uses ds/commit-graph and ds/lazy-load-trees.)

  A recently added "commit-graph" datafile has learned to store
  pre-computed generation numbers to speed up the decisions to stop
  history traversal.

  Is this ready for 'next'?


I see that you squashed the fix from [1], so I think this is ready to 
go. Thanks!


[1] 
https://public-inbox.org/git/1cfe38f6-925b-d36b-53ae-6b586eed1...@gmail.com/



* ds/lazy-load-trees (2018-05-02) 6 commits
   (merged to 'next' on 2018-05-02 at d54016d9e3)
  + coccinelle: avoid wrong transformation suggestions from commit.cocci
   (merged to 'next' on 2018-04-25 at b90813f421)
  + commit-graph: lazy-load trees for commits
  + treewide: replace maybe_tree with accessor methods
  + commit: create get_commit_tree() method
  + treewide: rename tree to maybe_tree
  + Merge branch 'bw/c-plus-plus' into ds/lazy-load-trees
  (this branch is used by ds/generation-numbers; uses ds/commit-graph.)

  The code has been taught to use the duplicated information stored
  in the commit-graph file to learn the tree object name for a commit
  to avoid opening and parsing the commit object when it makes sense
  to do so.

  Will merge to 'master'.


* ds/commit-graph (2018-04-11) 16 commits
   (merged to 'next' on 2018-04-25 at 18af3d28d9)
  + commit-graph: implement "--append" option
  + commit-graph: build graph from starting commits
  + commit-graph: read only from specific pack-indexes
  + commit: integrate commit graph with commit parsing
  + commit-graph: close under reachability
  + commit-graph: add core.commitGraph setting
  + commit-graph: implement git commit-graph read
  + commit-graph: implement git-commit-graph write
  + commit-graph: implement write_commit_graph()
  + commit-graph: create git-commit-graph builtin
  + graph: add commit graph design document
  + commit-graph: add format document
  + csum-file: refactor finalize_hashfile() method
  + csum-file: rename hashclose() to finalize_hashfile()
  + Merge branch 'jk/cached-commit-buffer' into HEAD
  + Merge branch 'jt/binsearch-with-fanout' into HEAD
  (this branch is used by ds/generation-numbers and ds/lazy-load-trees.)

  Precompute and store information necessary for ancestry traversal
  in a separate file to optimize graph walking.

  Will merge to 'master'.


These have been queued for master for a few weeks. Is anything delaying 
them? I'd love to see the community dogfood this feature by running the 
following on their local repos:


    git config core.commitGraph true
    git show-ref -s | git commit-graph write --stdin-commits

Thanks,
-Stolee


Re: [PATCH 4/4] mark_parents_uninteresting(): avoid most allocation

2018-05-14 Thread Derrick Stolee

On 5/14/2018 9:09 AM, Jeff King wrote:

On Mon, May 14, 2018 at 08:47:33AM -0400, Derrick Stolee wrote:


On 5/11/2018 2:03 PM, Jeff King wrote:

Commit 941ba8db57 (Eliminate recursion in setting/clearing
marks in commit list, 2012-01-14) used a clever double-loop
to avoid allocations for single-parent chains of history.
However, it did so only when following parents of parents
(which was an uncommon case), and _always_ incurred at least
one allocation to populate the list of pending parents in
the first place.

We can turn this into zero-allocation in the common case by
iterating directly over the initial parent list, and then
following up on any pending items we might have discovered.

This change appears to improve performance, but I was unable to measure any
difference between this commit and the one ahead, even when merging
ds/generation-numbers (which significantly reduces the other costs). I was
testing 'git status' and 'git rev-list --boundary master...origin/master' in
the Linux repo with my copy of master 70,000+ commits behind origin/master.

I think you'd want to go the other way: this is marking uninteresting
commits, so you'd want origin/master..master, which would make those 70k
commits uninteresting.


Thanks for the tip. Running 'git rev-list origin/master..master' 100 
times gave the following times, on average (with ds/generation-numbers):


PATCH 3/4: 0.19 s
PATCH 4/4: 0.16 s
    Rel %: -16%



I still doubt it will create that much difference, though. It's
more or less saving a single malloc per commit we traverse. Certainly
without the .commits file that's a drop in the bucket compared to
inflating the object data, but I wouldn't be surprised if we end up with
a few mallocs even after your patches.

Anyway, thanks for poking at it. :)

-Peff




Re: [PATCH 0/4] a few mark_parents_uninteresting cleanups

2018-05-14 Thread Derrick Stolee

On 5/11/2018 2:00 PM, Jeff King wrote:

This is a follow-up to the discussion from February:

   
https://public-inbox.org/git/1519522496-73090-1-git-send-email-dsto...@microsoft.com/

There I theorized that some of these extra has_object_file() checks were
unnecessary. After poking around a bit, I've convinced myself that this
is the case, so here are some patches.

After Stolee's fix in ebbed3ba04 (revision.c: reduce object database
queries, 2018-02-24), I doubt this will provide any noticeable speedup.
IMHO the value is just in simplifying the logic.

The first two patches are the actual has_object_file simplifications.
The second two are an attempt to fix some head-scratching I had while
reading mark_parents_uninteresting(). I hope the result is easier to
follow, but I may have just shuffled one confusion for another. :)

   [1/4]: mark_tree_contents_uninteresting(): drop missing object check
   [2/4]: mark_parents_uninteresting(): drop missing object check
   [3/4]: mark_parents_uninteresting(): replace list with stack
   [4/4]: mark_parents_uninteresting(): avoid most allocation

  revision.c | 90 ++
  1 file changed, 50 insertions(+), 40 deletions(-)

-Peff


This series looks good to me. I found Patch 3 hard to read in the diff, 
so I just looked at the final result and the new arrangement is very 
clear about how it should behave.


Reviewed-by: Derrick Stolee <dsto...@microsoft.com>


Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.

2018-05-14 Thread Derrick Stolee

On 5/12/2018 10:00 AM, Jakub Narebski wrote:

Derrick Stolee <sto...@gmail.com> writes:

On 5/4/2018 3:40 PM, Jakub Narebski wrote:

With early parts of commit-graph feature (ds/commit-graph and
ds/lazy-load-trees) close to being merged into "master", see
https://public-inbox.org/git/xmqq4ljtz87g@gitster-ct.c.googlers.com/
I think it would be good idea to think what other data could be added
there to make Git even faster.

Before thinking about adding more data to the commit-graph, I think
instead we need to finish taking advantage of the data that is already
there. This means landing the generation number patch [1] (I think
this is close, so I'll send a v6 this week if there is no new
feedback.) and the auto-compute patch [2] (this could use more
feedback, but I'll send a v1 based on the RFC feedback if no one
chimes in).

[1]
https://public-inbox.org/git/20180501124652.155781-1-dsto...@microsoft.com/
     [PATCH v5 00/11] Compute and consume generation numbers

[2]
https://public-inbox.org/git/20180417181028.198397-1-dsto...@microsoft.com/
     [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'

Right, so the RFC might be a bit premature; I wanted the discussion to
be out there to think about when adding new uses of existing features.


DIGRESSION: it is commendable that you are sending patches in small,
easy digestible chunks / patch series.  It is much easier to review 10+
series than 80+ behemoth (though I understand it is not always possible
to subdivide patch series into smaller self-contained sub-series).

On the other hand, it would be nice to have some roadmap about series
and features to be sent in the future, if possible.  Something like what
was done when 'git rebase --interactive' was getting builtinified: moved
(in parts) to C.  It would be great to have such roadmap with milestones
achieved and milestones to be done in the cover letter for series.


I suppose that is what I intended in the "Future Work" section of 
Documentation/technical/commit-graph.txt. It gives a set of things that 
need to be done in order to make this a default feature, not just an 
expert-level feature. When I wrote the section, I was focused entirely 
on "commit-graph version 1.0" and I didn't know how long that would 
take. The series have been getting interest and review (in great part to 
your interest, Jakub) so they have been moving to 'next' faster than I 
anticipated.


I'll plan on writing a more detailed list of future directions, but for 
now I'll list the things I know about and how I see them in priority order:


Commit-graph v1.0:
* ds/generation-numbers
* 'verify' and fsck/gc integration
* correct behavior with shallow clones, replace-objects, and grafts

Commit-graph v1.1:
* Place commit-graph storage in the_repository
* 'git tag --merged' use generation numbers
* 'git log --graph' use generation numbers

Commit-graph v1.X:
* Protocol v2 verb for sending commit-graph
* Bloom filters for changed paths




The big wins remaining from this data are `git tag --merged` and `git
log --graph`. The `tag` scenario is probably easier: this can be done
by replacing the revision-walk underlying the call to use
paint_down_to_common() instead. Requires adding an external method to
commit.c, but not too much code.

I wonder if there is some significant reason behind `git tag --merged`
using its own codepath, beside historical reasons.  Maybe performance is
better with current code?

Utilizing paint_down_to_common() there, beside reducing amount of code
you would have to modify, would also unify code (and possibly reduce
amount of code).  That's very nice.


The tougher challenge is `git log --graph`. The revision walk
machinery currently uses two precompute phases before iterating
results to the pager: limit_list() and sort_in_topological_order();
these correspond to two phases of Kahn's algorithm for topo-sort
(compute in-degrees, then walk by peeling commits with in-degree
zero). This requires O(N) time, where N is the number of reachable
commits. Instead, we could make this be O(W) time to output one page
of results, where W is (roughly) the number of reachable commits with
generation number above the last reported result.

A reminder: Kahn's algorithm (described for example in [1] and [3])
looks like this:

   L ← Empty list that will contain the sorted elements
   S ← Collection of all nodes with no incoming edge
   while S is non-empty do
   remove a node 'n' from S
   add 'n' to tail of L
   for each parent 'm' of 'n' do
   decrease the in-degree of 'm'
   if 'm' has in-degree of 0 then
   insert 'm' into S

[1]: https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm
[2]: https://www.geeksforgeeks.org/topological-sorting-indegree-based-solution/


In order to take advantage of this approach, the two phases of Kahn's
algorithm need to be done in-line with reporting results to the
pager. This means keeping

Re: [PATCH v2 01/12] commit-graph: add 'verify' subcommand

2018-05-14 Thread Derrick Stolee

On 5/12/2018 9:31 AM, Martin Ågren wrote:

On 11 May 2018 at 23:15, Derrick Stolee <dsto...@microsoft.com> wrote:


 graph_name = get_commit_graph_filename(opts.obj_dir);
 graph = load_commit_graph_one(graph_name);
+   FREE_AND_NULL(graph_name);

 if (!graph)
 die("graph file %s does not exist", graph_name);
-   FREE_AND_NULL(graph_name);

This is probably because of something I said, but this does not look
correct. The `die()` would typically print "(null)" or segfault. If the
`die()` means we don't free `graph_name`, that should be fine.

You're still leaking `graph` here (possibly nothing this patch should
worry about) and in `graph_verify()`. UNLEAK-ing it immediately before
calling `verify_commit_graph()` should be ok. I also think punting on
this UNLEAK-business entirely would be ok; I was just a bit surprised to
see one variable getting freed and the other one ignored.


Thanks, Martin. I was just blindly searching for FREE_AND_NULL() and 
shouldn't have been so careless.


-Stolee


Re: [PATCH 4/4] mark_parents_uninteresting(): avoid most allocation

2018-05-14 Thread Derrick Stolee

On 5/11/2018 2:03 PM, Jeff King wrote:

Commit 941ba8db57 (Eliminate recursion in setting/clearing
marks in commit list, 2012-01-14) used a clever double-loop
to avoid allocations for single-parent chains of history.
However, it did so only when following parents of parents
(which was an uncommon case), and _always_ incurred at least
one allocation to populate the list of pending parents in
the first place.

We can turn this into zero-allocation in the common case by
iterating directly over the initial parent list, and then
following up on any pending items we might have discovered.


This change appears to improve performance, but I was unable to measure 
any difference between this commit and the one ahead, even when merging 
ds/generation-numbers (which significantly reduces the other costs). I 
was testing 'git status' and 'git rev-list --boundary 
master...origin/master' in the Linux repo with my copy of master 70,000+ 
commits behind origin/master.


It's still a good change, but I was hoping to find a measurable benefit :(



Signed-off-by: Jeff King 
---
Again, try "-w" for more readability.

  revision.c | 44 +---
  1 file changed, 25 insertions(+), 19 deletions(-)

diff --git a/revision.c b/revision.c
index 89ff9a99ce..cbe041128e 100644
--- a/revision.c
+++ b/revision.c
@@ -115,32 +115,38 @@ static void commit_stack_clear(struct commit_stack *stack)
stack->nr = stack->alloc = 0;
  }
  
-void mark_parents_uninteresting(struct commit *commit)

+static void mark_one_parent_uninteresting(struct commit *commit,
+ struct commit_stack *pending)
  {
-   struct commit_stack pending = COMMIT_STACK_INIT;
struct commit_list *l;
  
+	if (commit->object.flags & UNINTERESTING)

+   return;
+   commit->object.flags |= UNINTERESTING;
+
+   /*
+* Normally we haven't parsed the parent
+* yet, so we won't have a parent of a parent
+* here. However, it may turn out that we've
+* reached this commit some other way (where it
+* wasn't uninteresting), in which case we need
+* to mark its parents recursively too..
+*/
for (l = commit->parents; l; l = l->next)
-   commit_stack_push(, l->item);
+   commit_stack_push(pending, l->item);
+}
  
-	while (pending.nr > 0) {

-   struct commit *commit = commit_stack_pop();
+void mark_parents_uninteresting(struct commit *commit)
+{
+   struct commit_stack pending = COMMIT_STACK_INIT;
+   struct commit_list *l;
  
-		if (commit->object.flags & UNINTERESTING)

-   return;
-   commit->object.flags |= UNINTERESTING;
+   for (l = commit->parents; l; l = l->next)
+   mark_one_parent_uninteresting(l->item, );
  
-		/*

-* Normally we haven't parsed the parent
-* yet, so we won't have a parent of a parent
-* here. However, it may turn out that we've
-* reached this commit some other way (where it
-* wasn't uninteresting), in which case we need
-* to mark its parents recursively too..
-*/
-   for (l = commit->parents; l; l = l->next)
-   commit_stack_push(, l->item);
-   }
+   while (pending.nr > 0)
+   mark_one_parent_uninteresting(commit_stack_pop(),
+ );
  
  	commit_stack_clear();

  }




Re: [PATCH v2 02/12] commit-graph: verify file header information

2018-05-14 Thread Derrick Stolee

On 5/12/2018 9:35 AM, Martin Ågren wrote:

+static int verify_commit_graph_error;
+
+static void graph_report(const char *fmt, ...)
+{
+   va_list ap;
+   struct strbuf sb = STRBUF_INIT;
+   verify_commit_graph_error = 1;
+
+   va_start(ap, fmt);
+   strbuf_vaddf(, fmt, ap);
+
+   fprintf(stderr, "%s\n", sb.buf);
+   strbuf_release();
+   va_end(ap);
+}


That's a good idea. Makes that patch a bit less trivial and this one a 
bit less difficult.


Re: [PATCH v2 08/12] commit-graph: verify commit contents against odb

2018-05-14 Thread Derrick Stolee

On 5/12/2018 5:17 PM, Martin Ågren wrote:

On 11 May 2018 at 23:15, Derrick Stolee <dsto...@microsoft.com> wrote:

When running 'git commit-graph verify', compare the contents of the
commits that are loaded from the commit-graph file with commits that are
loaded directly from the object database. This includes checking the
root tree object ID, commit date, and parents.

Parse the commit from the graph during the initial loop through the
object IDs to guarantee we parse from the commit-graph file.

In addition, verify the generation number calculation is correct for all
commits in the commit-graph file.

While testing, we discovered that mutating the integer value for a
parent to be outside the accepted range causes a segmentation fault. Add
a new check in insert_parent_or_die() that prevents this fault. Check
for that error during the test, both in the typical parents and in the
list of parents for octopus merges.

This paragraph and the corresponding fix and test feel like a separate
patch to me. (The commit message of it could be "To test the next patch,
we threw invalid data at `git commit-graph verify, and it crashed in
pre-existing code, so let's fix that first" -- there is definitely a
connection.) Is this important enough to fast-track to master in time
for 2.18? My guess would be no.


+
+   if (pos >= g->num_commits)
+   die("invalide parent position %"PRIu64, pos);

s/invalide/invalid/


@@ -875,6 +879,8 @@ int verify_commit_graph(struct commit_graph *g)
 return 1;

 for (i = 0; i < g->num_commits; i++) {
+   struct commit *graph_commit;
+
 hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);

 if (i && oidcmp(_oid, _oid) >= 0)
@@ -892,6 +898,10 @@ int verify_commit_graph(struct commit_graph *g)

 cur_fanout_pos++;
 }
+
+   graph_commit = lookup_commit(_oid);
+   if (!parse_commit_in_graph_one(g, graph_commit))
+   graph_report("failed to parse %s from commit-graph", 
oid_to_hex(_oid));
 }

Could this end up giving ridiculous amounts of output? It would depend
on the input, I guess.


 while (cur_fanout_pos < 256) {
@@ -904,5 +914,95 @@ int verify_commit_graph(struct commit_graph *g)
 cur_fanout_pos++;
 }

+   if (verify_commit_graph_error)
+   return 1;

Well, here we give up before running into *too* much problem.


+   for (i = 0; i < g->num_commits; i++) {
+   struct commit *graph_commit, *odb_commit;
+   struct commit_list *graph_parents, *odb_parents;
+   int num_parents = 0;
+
+   hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+   graph_commit = lookup_commit(_oid);
+   odb_commit = (struct commit *)create_object(cur_oid.hash, 
alloc_commit_node());
+   if (parse_commit_internal(odb_commit, 0, 0)) {
+   graph_report("failed to parse %s from object database", 
oid_to_hex(_oid));
+   continue;
+   }
+
+   if (oidcmp(_commit_tree_in_graph_one(g, 
graph_commit)->object.oid,
+  get_commit_tree_oid(odb_commit)))
+   graph_report("root tree object ID for commit %s in 
commit-graph is %s != %s",
+oid_to_hex(_oid),
+
oid_to_hex(get_commit_tree_oid(graph_commit)),
+
oid_to_hex(get_commit_tree_oid(odb_commit)));
+
+   if (graph_commit->date != odb_commit->date)
+   graph_report("commit date for commit %s in commit-graph is %"PRItime" 
!= %"PRItime"",
+oid_to_hex(_oid),
+graph_commit->date,
+odb_commit->date);
+
+
+   graph_parents = graph_commit->parents;
+   odb_parents = odb_commit->parents;
+
+   while (graph_parents) {
+   num_parents++;
+
+   if (odb_parents == NULL)
+   graph_report("commit-graph parent list for commit %s 
is too long (%d)",
+oid_to_hex(_oid),
+num_parents);
+
+   if (oidcmp(_parents->item->object.oid, 
_parents->item->object.oid))
+   graph_report("commit-graph parent for %s is %s != 
%s",
+oid_to_hex(_oid),
+
oid_to_hex(_parents->item->object.oid),
+   

Re: [PATCH v2 14/14] commit.h: delete 'util' field in struct commit

2018-05-14 Thread Derrick Stolee

On 5/14/2018 1:30 PM, Duy Nguyen wrote:

On Mon, May 14, 2018 at 6:07 PM, Duy Nguyen  wrote:

On Mon, May 14, 2018 at 04:52:29PM +0900, Junio C Hamano wrote:

Nguyễn Thái Ngọc Duy   writes:


diff --git a/commit.h b/commit.h
index 838f6a6b26..70371e111e 100644
--- a/commit.h
+++ b/commit.h
@@ -18,12 +18,16 @@ struct commit_list {

  struct commit {
 struct object object;
-   void *util;
 unsigned int index;
 timestamp_t date;
 struct commit_list *parents;
 struct tree *tree;
 uint32_t graph_pos;
+   /*
+* Do not add more fields here unless it's _very_ often
+* used. Use commit-slab to associate more data with a commit
+* instead.
+*/
  };

That's a logical consequence and a natural endgame of this
pleasent-to-read series.  THanks.

Unfortunately we are gaining more and more stuff in "struct commit"
with recent topics, and we may want to see if we can evict some of
them out to shrink it again.

Sigh.. ds/lazy-load-trees already enters 'next' so a fixup patch is
something like this.

Sorry I take this patch back. I didn't realize it was a rename and the
old field named 'tree' was already there (I vaguely recalled some
"tree" in this struct but didn't stop to think about it). Moving
graph_pos out is an option, but only 32-bit arch gains from it (64-bit
arch will have 4 bytes padding anyway) so probably not worth the
effort. "generation" field should probably be moved out though.


I'm happy to take a look at this patch and figure out the best way to 
integrate it with the changes I've been doing.


I disagree with the removal of "generation". My intention is to make the 
commit-graph feature a standard feature that most repos (of reasonable 
size) want to have. In that case, 'graph_pos' and 'generation' will be 
set during every parse and used by every commit walk. This is just my 
gut reaction.


As I investigate these changes, I'll try to see what performance hit is 
caused by converting the graph_pos and/or generation to commit slabs. 
(Again, I'm assuming the slabs will make this slower. I may be wrong here.)


Thanks,
-Stolee


Re: Implementing reftable in Git

2018-05-09 Thread Derrick Stolee

On 5/9/2018 10:33 AM, Christian Couder wrote:

Hi,

I might start working on implementing reftable in Git soon.

During the last Git Merge conference last March Stefan talked about
reftable. In Alex Vandiver's notes [1] it is asked that people
announce it on the list when they start working on it, and it appears
that there is a reference implementation in JGit.


Thanks for starting on this! In addition to the performance gains, this 
will help a lot of users with case-insensitive file systems from getting 
case-errors on refnames.



Looking it up, there is indeed some documentation [2], code [3], tests
[4] and other related stuff [5] in the JGit repo. It looks like the
JGit repo and the reftable code there are licensed under the Eclipse
Distribution License - v 1.0 [7] which is very similar to the 3-Clause
BSD License also called Modified BSD License which is GPL compatible
according to gnu.org [9]. So from a quick look it appears that I
should be able to port the JGit to Git if I just keep the copyright
and license header comments in all the related files.

So I think the most straightforward and compatible way to do it would
be to port the JGit implementation.

Thanks in advance for any suggestion or comment about this.

Reftable was first described by Shawn and then discussed last July on
the list [6].


The hope is that such a direct port should be possible, but someone else 
should comment on the porting process.


This is also something that could be created independently based on the 
documentation you mention. I was planning to attempt that during a 
hackathon in July, but I'm happy you are able to start earlier (and that 
you are announcing your intentions). I would be happy to review your 
patch series, so please keep me posted.


Thanks,
-Stolee


Re: [PATCH 1/1] commit-graph: fix UX issue when .lock file exists

2018-05-09 Thread Derrick Stolee

On 5/9/2018 10:42 AM, Jeff King wrote:

On Wed, May 09, 2018 at 02:15:38PM +, Derrick Stolee wrote:


The commit-graph file lives in the .git/objects/info directory.
Previously, a failure to acquire the commit-graph.lock file was
assumed to be due to the lack of the info directory, so a mkdir()
was called. This gave incorrect messaging if instead the lockfile
was open by another process:

   "fatal: cannot mkdir .git/objects/info: File exists"

Rearrange the expectations of this directory existing to avoid
this error, and instead show the typical message when a lockfile
already exists.

Sounds like a reasonable bug fix.

Your cover letter is way longer than this description. Should some of
that background perhaps go in the commit message?


I did want a place to include the full die() message in the new 
behavior, but that seemed like overkill for the commit message.



(I would go so far as to say that sending a cover letter for a single
patch is an anti-pattern, because the commit message should be able to
stand on its own).


@@ -754,23 +755,14 @@ void write_commit_graph(const char *obj_dir,
compute_generation_numbers();
  
  	graph_name = get_commit_graph_filename(obj_dir);

-   fd = hold_lock_file_for_update(, graph_name, 0);
  
-	if (fd < 0) {

-   struct strbuf folder = STRBUF_INIT;
-   strbuf_addstr(, graph_name);
-   strbuf_setlen(, strrchr(folder.buf, '/') - folder.buf);
-
-   if (mkdir(folder.buf, 0777) < 0)
-   die_errno(_("cannot mkdir %s"), folder.buf);
-   strbuf_release();
-
-   fd = hold_lock_file_for_update(, graph_name, 
LOCK_DIE_ON_ERROR);
-
-   if (fd < 0)
-   die_errno("unable to create '%s'", graph_name);
-   }
+   strbuf_addstr(, graph_name);
+   strbuf_setlen(, strrchr(folder.buf, '/') - folder.buf);
+   if (!file_exists(folder.buf) && mkdir(folder.buf, 0777) < 0)
+   die_errno(_("cannot mkdir %s"), folder.buf);
+   strbuf_release();

The result is racy if somebody else is trying to create the directory at
the same time. For that you'd want to notice EEXIST coming from mkdir
and consider that a success.

I think you probably ought to be calling adjust_shared_perm() on the
result, too, in case core.sharedrepository is configured.

If you use safe_create_leading_directories(), it should handle both.
Something like:

   if (safe_create_leading_directories(graph_name))
die_errno(_("unable to create leading directories of %s"),
  graph_name));

I think I'm holding it right; that function is a little short on
documentation, but it's the standard way to do this in Git's codebase,
and you can find lots of example callers.


Thanks for this method. I was unfamiliar with it. Saves the effort of 
creating the strbuf, too.


Thanks,
-Stolee


[PATCH 0/1] Fix UX issue with commit-graph.lock

2018-05-09 Thread Derrick Stolee
We use the lockfile API to avoid multiple Git processes from writing to
the commit-graph file in the .git/objects/info directory. In some cases,
this directory may not exist, so we check for its existence.

The existing code does the following when acquiring the lock:

1. Try to acquire the lock.
2. If it fails, try to create the .git/object/info directory.
3. Try to acquire the lock, failing if necessary.

The problem is that if the lockfile exists, then the mkdir fails, giving
an error that doesn't help the user:

  "fatal: cannot mkdir .git/objects/info: File exists"

While technically this honors the lockfile, it does not help the user.

Instead, do the following:

1. Check for existence of .git/objects/info; create if necessary.
2. Try to acquire the lock, failing if necessary.

The new output looks like:

  fatal: Unable to create 
'/home/stolee/git/.git/objects/info/commit-graph.lock': File exists.

  Another git process seems to be running in this repository, e.g.
  an editor opened by 'git commit'. Please make sure all processes
  are terminated then try again. If it still fails, a git process
  may have crashed in this repository earlier:
  remove the file manually to continue.

This patch is based on ds/generation-numbers

Derrick Stolee (1):
  commit-graph: fix UX issue when .lock file exists

 commit-graph.c | 24 
 1 file changed, 8 insertions(+), 16 deletions(-)


base-commit: 7547b95b4fbb8591726b1d9381c176cc27fc6aea
-- 
2.17.0.39.g685157f7fb



[PATCH 1/1] commit-graph: fix UX issue when .lock file exists

2018-05-09 Thread Derrick Stolee
The commit-graph file lives in the .git/objects/info directory.
Previously, a failure to acquire the commit-graph.lock file was
assumed to be due to the lack of the info directory, so a mkdir()
was called. This gave incorrect messaging if instead the lockfile
was open by another process:

  "fatal: cannot mkdir .git/objects/info: File exists"

Rearrange the expectations of this directory existing to avoid
this error, and instead show the typical message when a lockfile
already exists.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 24 
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index a8c337dd77..8399194da1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1,5 +1,6 @@
 #include "cache.h"
 #include "config.h"
+#include "dir.h"
 #include "git-compat-util.h"
 #include "lockfile.h"
 #include "pack.h"
@@ -640,13 +641,13 @@ void write_commit_graph(const char *obj_dir,
struct hashfile *f;
uint32_t i, count_distinct = 0;
char *graph_name;
-   int fd;
struct lock_file lk = LOCK_INIT;
uint32_t chunk_ids[5];
uint64_t chunk_offsets[5];
int num_chunks;
int num_extra_edges;
struct commit_list *parent;
+   struct strbuf folder = STRBUF_INIT;
 
oids.nr = 0;
oids.alloc = approximate_object_count() / 4;
@@ -754,23 +755,14 @@ void write_commit_graph(const char *obj_dir,
compute_generation_numbers();
 
graph_name = get_commit_graph_filename(obj_dir);
-   fd = hold_lock_file_for_update(, graph_name, 0);
 
-   if (fd < 0) {
-   struct strbuf folder = STRBUF_INIT;
-   strbuf_addstr(, graph_name);
-   strbuf_setlen(, strrchr(folder.buf, '/') - folder.buf);
-
-   if (mkdir(folder.buf, 0777) < 0)
-   die_errno(_("cannot mkdir %s"), folder.buf);
-   strbuf_release();
-
-   fd = hold_lock_file_for_update(, graph_name, 
LOCK_DIE_ON_ERROR);
-
-   if (fd < 0)
-   die_errno("unable to create '%s'", graph_name);
-   }
+   strbuf_addstr(, graph_name);
+   strbuf_setlen(, strrchr(folder.buf, '/') - folder.buf);
+   if (!file_exists(folder.buf) && mkdir(folder.buf, 0777) < 0)
+   die_errno(_("cannot mkdir %s"), folder.buf);
+   strbuf_release();
 
+   hold_lock_file_for_update(, graph_name, LOCK_DIE_ON_ERROR);
f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 
hashwrite_be32(f, GRAPH_SIGNATURE);
-- 
2.17.0.39.g685157f7fb



Re: [PATCH 0/1] Fix UX issue with commit-graph.lock

2018-05-10 Thread Derrick Stolee

On 5/10/2018 3:00 AM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


We use the lockfile API to avoid multiple Git processes from writing to
the commit-graph file in the .git/objects/info directory. In some cases,
this directory may not exist, so we check for its existence.

The existing code does the following when acquiring the lock:

1. Try to acquire the lock.
2. If it fails, try to create the .git/object/info directory.
3. Try to acquire the lock, failing if necessary.

The problem is that if the lockfile exists, then the mkdir fails, giving
an error that doesn't help the user:

   "fatal: cannot mkdir .git/objects/info: File exists"

Isn't a better immediate fix to make the second step pay attention
to errno?  If mkdir() failed due to EEXIST, then we know we tried to
aquire the lock already, so we can die with an appropriate message.

That way, we can keep the "optimize for the normal case" that the
approach to assume object/info/ directory is already there, instead
of always checking its existence which is almost always true
beforehand.


This "optimize for the normal case" is why the existing code is 
organized the way it is.


Since this code is only for writing a commit-graph file, this "check the 
directory first" option is a very small portion of the full time to 
write the file, so the "optimization" has very little effect, 
relatively. My personal opinion is to make the code cleaner when the 
performance difference is negligible.


I'm willing to concede this point and use the steps you suggest below, 
if we think this is the best way forward.



Also, can't we tell why we failed to acquire the lock at step #1?
Do we only get a NULL that says "I am not telling you why, but we
failed to lock"?


To tell why we failed to acquire the lock, we could inspect "errno". 
However, this requires whitebox knowledge of both the lockfile API and 
the tempfile API to know that the last system call to set errno was an 
open() or adjust_shared_perm(). To cleanly make decisions based on the 
reason the lock failed to acquire, I think we would need to modify the 
lockfile and tempfile APIs to return a failure reason. This could be 
done by passing an `int *reason`, but the extra noise in these APIs is 
likely not worth the change.




What I am getting at is that the ideal sequence
would be more like:
 1. Try to acquire the lock.
 2-a. if #1 succeeds, we are happy. ignore the rest and return
  the lock.
 2-b. if #1 failed because object/info/ did not exist,
  mkdir() it, and die if we cannot, saying "cannot mkdir".
 if mkdir() succeeds, jump t 3.
 2-c. if #1 failed but that is not due to missing object/info/,
 die saying "cannot lock".
 3. Try to acquire the lock.
 4-a. if #3 succeeds, we are happy.ignore the rest and return
  the lock.
 4-b. die saying "cannot lock".



Thanks,
-Stolee


Re: [PATCH 2/2] replace-object.c: remove the_repository from prepare_replace_object

2018-05-10 Thread Derrick Stolee

On 5/10/2018 7:56 AM, Jeff King wrote:

On Thu, May 10, 2018 at 07:23:13PM +0900, Junio C Hamano wrote:


This one was doing

ptr = xmalloc(sizeof(*another_ptr))

and it was OK because ptr and another_ptr happened to be of the same
type.  I wonder if we are making it safer, or making it more obscure
to seasoned C programmers, if we introduced a pair of helper macros,
perhaps like these:

#define ALLOCATE(ptr) (ptr) = xmalloc(sizeof(*(ptr)))
#define CALLOCATE(ptr,cnt) (ptr) = xcalloc((cnt), sizeof(*(ptr)))

I've often wondered that, too. It's the natural endgame of the
ALLOC_ARRAY() road we've been going down.


I'll chime in that I like this idea.

Because I'm trying to learn more about Coccinelle, I added the diff 
below and ran 'make coccicheck'. It resulted in a 1000-line patch on 
'next'. I'll refrain from sending that patch series, but it's nice to 
know a "simple" transformation is possible.


Using `git grep` I see 230 instances of 'xmalloc' and 261 instances of 
'xcalloc'. After the Coccinelle transformation, these are down to 194 
and 190, respectively, because the rest allocate in the same line as the 
definition. It's worth thinking about the macro pattern for those cases.


diff --git a/contrib/coccinelle/alloc.cocci b/contrib/coccinelle/alloc.cocci
new file mode 100644
index 00..95f426c4dc
--- /dev/null
+++ b/contrib/coccinelle/alloc.cocci
@@ -0,0 +1,13 @@
+@@
+expression p;
+@@
+- p = xmalloc(sizeof(*p));
++ ALLOCATE(p);
+
+@@
+expression c;
+expression p;
+@@
+- p = xcalloc(c, sizeof(*p));
++ CALLOCATE(p,c);
+
diff --git a/git-compat-util.h b/git-compat-util.h
index f9e4c5f9bc..5398f217d7 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -843,6 +843,9 @@ extern FILE *fopen_or_warn(const char *path, const 
char *mode);

  */
 #define FREE_AND_NULL(p) do { free(p); (p) = NULL; } while (0)

+#define ALLOCATE(ptr) (ptr) = xmalloc(sizeof(*(ptr)))
+#define CALLOCATE(ptr,cnt) (ptr) = xcalloc((cnt), sizeof(*(ptr)))
+
 #define ALLOC_ARRAY(x, alloc) (x) = xmalloc(st_mult(sizeof(*(x)), 
(alloc)))
 #define REALLOC_ARRAY(x, alloc) (x) = xrealloc((x), 
st_mult(sizeof(*(x)), (alloc)))




[PATCH 10/12] gc: automatically write commit-graph files

2018-05-10 Thread Derrick Stolee
The commit-graph file is a very helpful feature for speeding up git
operations. In order to make it more useful, write the commit-graph file
by default during standard garbage collection operations.

Add a 'gc.commitGraph' config setting that triggers writing a
commit-graph file after any non-trivial 'git gc' command. Defaults to
false while the commit-graph feature matures. We specifically do not
want to turn this on by default until the commit-graph feature is fully
integrated with history-modifying features like shallow clones.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/config.txt | 6 ++
 Documentation/git-gc.txt | 4 
 builtin/gc.c | 8 
 3 files changed, 18 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 11f027194e..9a3abd87e7 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1553,6 +1553,12 @@ gc.autoDetach::
Make `git gc --auto` return immediately and run in background
if the system supports it. Default is true.
 
+gc.commitGraph::
+   If true, then gc will rewrite the commit-graph file after any
+   change to the object database. If '--auto' is used, then the
+   commit-graph will not be updated unless the threshold is met.
+   See linkgit:git-commit-graph[1] for details.
+
 gc.logExpiry::
If the file gc.log exists, then `git gc --auto` won't run
unless that file is more than 'gc.logExpiry' old.  Default is
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 571b5a7e3c..17dd654a59 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -119,6 +119,10 @@ The optional configuration variable `gc.packRefs` 
determines if
 it within all non-bare repos or it can be set to a boolean value.
 This defaults to true.
 
+The optional configuration variable 'gc.commitGraph' determines if
+'git gc' runs 'git commit-graph write'. This can be set to a boolean
+value. This defaults to false.
+
 The optional configuration variable `gc.aggressiveWindow` controls how
 much time is spent optimizing the delta compression of the objects in
 the repository when the --aggressive option is specified.  The larger
diff --git a/builtin/gc.c b/builtin/gc.c
index 77fa720bd0..8403445738 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -34,6 +34,7 @@ static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
+static int gc_commit_graph = 0;
 static int detach_auto = 1;
 static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
@@ -46,6 +47,7 @@ static struct argv_array repack = ARGV_ARRAY_INIT;
 static struct argv_array prune = ARGV_ARRAY_INIT;
 static struct argv_array prune_worktrees = ARGV_ARRAY_INIT;
 static struct argv_array rerere = ARGV_ARRAY_INIT;
+static struct argv_array commit_graph = ARGV_ARRAY_INIT;
 
 static struct tempfile *pidfile;
 static struct lock_file log_lock;
@@ -121,6 +123,7 @@ static void gc_config(void)
git_config_get_int("gc.aggressivedepth", _depth);
git_config_get_int("gc.auto", _auto_threshold);
git_config_get_int("gc.autopacklimit", _auto_pack_limit);
+   git_config_get_bool("gc.commitgraph", _commit_graph);
git_config_get_bool("gc.autodetach", _auto);
git_config_get_expiry("gc.pruneexpire", _expire);
git_config_get_expiry("gc.worktreepruneexpire", 
_worktrees_expire);
@@ -374,6 +377,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
argv_array_pushl(, "prune", "--expire", NULL);
argv_array_pushl(_worktrees, "worktree", "prune", "--expire", 
NULL);
argv_array_pushl(, "rerere", "gc", NULL);
+   argv_array_pushl(_graph, "commit-graph", "write", "--reachable", 
NULL);
 
/* default expiry time, overwritten in gc_config */
gc_config();
@@ -480,6 +484,10 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
if (pack_garbage.nr > 0)
clean_pack_garbage();
 
+   if (gc_commit_graph)
+   if (run_command_v_opt(commit_graph.argv, RUN_GIT_CMD))
+   return error(FAILED_RUN, commit_graph.argv[0]);
+
if (auto_gc && too_many_loose_objects())
warning(_("There are too many unreachable loose objects; "
"run 'git prune' to remove them."));
-- 
2.16.2.329.gfb62395de6



[PATCH 09/12] commit-graph: add '--reachable' option

2018-05-10 Thread Derrick Stolee
When writing commit-graph files, it can be convenient to ask for all
reachable commits (starting at the ref set) in the resulting file. This
is particularly helpful when writing to stdin is complicated, such as a
future integration with 'git gc' which will call
'git commit-graph write --reachable' after performing cleanup of the
object database.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt |  8 ++--
 builtin/commit-graph.c | 41 ++
 t/t5318-commit-graph.sh| 10 ++
 3 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 1daefa7fb1..fbab3feba1 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -38,12 +38,16 @@ Write a commit graph file based on the commits found in 
packfiles.
 +
 With the `--stdin-packs` option, generate the new commit graph by
 walking objects only in the specified pack-indexes. (Cannot be combined
-with --stdin-commits.)
+with --stdin-commits or --reachable.)
 +
 With the `--stdin-commits` option, generate the new commit graph by
 walking commits starting at the commits specified in stdin as a list
 of OIDs in hex, one OID per line. (Cannot be combined with
---stdin-packs.)
+--stdin-packs or --reachable.)
++
+With the `--reachable` option, generate the new commit graph by walking
+commits starting at all refs. (Cannot be combined with --stdin-commits
+or --stind-packs.)
 +
 With the `--append` option, include all commits that are present in the
 existing commit-graph file.
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index f5d891f2b8..4c9328cdf2 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -3,13 +3,14 @@
 #include "dir.h"
 #include "lockfile.h"
 #include "parse-options.h"
+#include "refs.h"
 #include "commit-graph.h"
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph verify [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
-   N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--append] 
[--reachable|--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -24,12 +25,13 @@ static const char * const builtin_commit_graph_read_usage[] 
= {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--append] 
[--reachable|--stdin-packs|--stdin-commits]"),
NULL
 };
 
 static struct opts_commit_graph {
const char *obj_dir;
+   int reachable;
int stdin_packs;
int stdin_commits;
int append;
@@ -113,6 +115,25 @@ static int graph_read(int argc, const char **argv)
return 0;
 }
 
+struct hex_list {
+   char **hex_strs;
+   int hex_nr;
+   int hex_alloc;
+};
+
+static int add_ref_to_list(const char *refname,
+  const struct object_id *oid,
+  int flags, void *cb_data)
+{
+   struct hex_list *list = (struct hex_list*)cb_data;
+
+   ALLOC_GROW(list->hex_strs, list->hex_nr + 1, list->hex_alloc);
+   list->hex_strs[list->hex_nr] = xcalloc(GIT_MAX_HEXSZ + 1, 1);
+   strcpy(list->hex_strs[list->hex_nr], oid_to_hex(oid));
+   list->hex_nr++;
+   return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
const char **pack_indexes = NULL;
@@ -127,6 +148,8 @@ static int graph_write(int argc, const char **argv)
OPT_STRING(0, "object-dir", _dir,
N_("dir"),
N_("The object directory to store the graph")),
+   OPT_BOOL(0, "reachable", ,
+   N_("start walk at all refs")),
OPT_BOOL(0, "stdin-packs", _packs,
N_("scan pack-indexes listed by stdin for commits")),
OPT_BOOL(0, "stdin-commits", _commits,
@@ -140,8 +163,8 @@ static int graph_write(int argc, const char **argv)
 builtin_commit_graph_write_options,
 builtin_commit_graph_write_usage, 0);
 
-   if (opts.stdin_packs && opts.stdin_commits)
-   die(_("cannot use both --stdin-commits and --stdin-packs"));
+   if (opts.reachable + opts.stdin_packs + opts.stdin_commits > 1)
+   die(_("use at most one of --reachable, --stdin-commits, or 
--stdin-packs"));
if (!opts.o

[PATCH 05/12] commit: force commit to parse from object database

2018-05-10 Thread Derrick Stolee
In anticipation of verifying commit-graph file contents against the
object database, create parse_commit_internal() to allow side-stepping
the commit-graph file and parse directly from the object database.

Most consumers expect that if a commit exists in the commit-graph, then
the commit was loaded from the commit-graph so values such as generation
numbers are loaded. Hence, this method should not be called unless the
intention is explicit in avoiding commits from the commit-graph file.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 13 +
 commit.h |  1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index 1d28677dfb..7c92350373 100644
--- a/commit.c
+++ b/commit.c
@@ -392,7 +392,7 @@ int parse_commit_buffer(struct commit *item, const void 
*buffer, unsigned long s
return 0;
 }
 
-int parse_commit_gently(struct commit *item, int quiet_on_missing)
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int 
use_commit_graph)
 {
enum object_type type;
void *buffer;
@@ -403,17 +403,17 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
return -1;
if (item->object.parsed)
return 0;
-   if (parse_commit_in_graph(item))
+   if (use_commit_graph && parse_commit_in_graph(item))
return 0;
buffer = read_sha1_file(item->object.oid.hash, , );
if (!buffer)
return quiet_on_missing ? -1 :
error("Could not read %s",
-oid_to_hex(>object.oid));
+   oid_to_hex(>object.oid));
if (type != OBJ_COMMIT) {
free(buffer);
return error("Object %s not a commit",
-oid_to_hex(>object.oid));
+   oid_to_hex(>object.oid));
}
ret = parse_commit_buffer(item, buffer, size, 0);
if (save_commit_buffer && !ret) {
@@ -424,6 +424,11 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
return ret;
 }
 
+int parse_commit_gently(struct commit *item, int quiet_on_missing)
+{
+   return parse_commit_internal(item, quiet_on_missing, 1);
+}
+
 void parse_commit_or_die(struct commit *item)
 {
if (parse_commit(item))
diff --git a/commit.h b/commit.h
index b5afde1ae9..5fde74fcd7 100644
--- a/commit.h
+++ b/commit.h
@@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char 
*name);
 struct commit *lookup_commit_or_die(const struct object_id *oid, const char 
*ref_name);
 
 int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long 
size, int check_graph);
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int 
use_commit_graph);
 int parse_commit_gently(struct commit *item, int quiet_on_missing);
 static inline int parse_commit(struct commit *item)
 {
-- 
2.16.2.329.gfb62395de6



[PATCH 02/12] commit-graph: verify file header information

2018-05-10 Thread Derrick Stolee
During a run of 'git commit-graph verify', list the issues with the
header information in the commit-graph file. Some of this information
is inferred from the loaded 'struct commit_graph'. Some header
information is checked as part of load_commit_graph_one().

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index b25aaed128..c3b8716c14 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -818,7 +818,28 @@ void write_commit_graph(const char *obj_dir,
oids.nr = 0;
 }
 
+static int verify_commit_graph_error;
+#define graph_report(...) \
+   do {\
+   verify_commit_graph_error = 1;\
+   printf(__VA_ARGS__);\
+   } while (0);
+
 int verify_commit_graph(struct commit_graph *g)
 {
-   return !g;
+   if (!g) {
+   graph_report(_("no commit-graph file loaded"));
+   return 1;
+   }
+
+   verify_commit_graph_error = 0;
+
+   if (!g->chunk_oid_fanout)
+   graph_report(_("commit-graph is missing the OID Fanout chunk"));
+   if (!g->chunk_oid_lookup)
+   graph_report(_("commit-graph is missing the OID Lookup chunk"));
+   if (!g->chunk_commit_data)
+   graph_report(_("commit-graph is missing the Commit Data 
chunk"));
+
+   return verify_commit_graph_error;
 }
-- 
2.16.2.329.gfb62395de6



[PATCH 01/12] commit-graph: add 'verify' subcommand

2018-05-10 Thread Derrick Stolee
In case the commit-graph file becomes corrupt, we need a way to
verify its contents match the object database. In the manner of
'git fsck' we will implement a 'git commit-graph verify' subcommand
to report all issues with the file.

Add the 'verify' subcommand to the 'commit-graph' builtin and its
documentation. The subcommand is currently a no-op except for
loading the commit-graph into memory, which may trigger run-time
errors that would be caught by normal use. Add a simple test that
ensures the command returns a zero error code.

If no commit-graph file exists, this is an acceptable state. Do
not report any errors.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt |  6 ++
 builtin/commit-graph.c | 38 ++
 commit-graph.c |  5 +
 commit-graph.h |  2 ++
 t/t5318-commit-graph.sh| 10 ++
 5 files changed, 61 insertions(+)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 4c97b555cc..1daefa7fb1 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -10,6 +10,7 @@ SYNOPSIS
 
 [verse]
 'git commit-graph read' [--object-dir ]
+'git commit-graph verify' [--object-dir ]
 'git commit-graph write'  [--object-dir ]
 
 
@@ -52,6 +53,11 @@ existing commit-graph file.
 Read a graph file given by the commit-graph file and output basic
 details about the graph file. Used for debugging purposes.
 
+'verify'::
+
+Read the commit-graph file and verify its contents against the object
+database. Used to verify for corrupted data.
+
 
 EXAMPLES
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 37420ae0fd..f5d891f2b8 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -7,11 +7,17 @@
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
+   N_("git commit-graph verify [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
+static const char * const builtin_commit_graph_verify_usage[] = {
+   N_("git commit-graph verify [--object-dir ]"),
+   NULL
+};
+
 static const char * const builtin_commit_graph_read_usage[] = {
N_("git commit-graph read [--object-dir ]"),
NULL
@@ -29,6 +35,36 @@ static struct opts_commit_graph {
int append;
 } opts;
 
+
+static int graph_verify(int argc, const char **argv)
+{
+   struct commit_graph *graph = 0;
+   char *graph_name;
+
+   static struct option builtin_commit_graph_verify_options[] = {
+   OPT_STRING(0, "object-dir", _dir,
+  N_("dir"),
+  N_("The object directory to store the graph")),
+   OPT_END(),
+   };
+
+   argc = parse_options(argc, argv, NULL,
+builtin_commit_graph_verify_options,
+builtin_commit_graph_verify_usage, 0);
+
+   if (!opts.obj_dir)
+   opts.obj_dir = get_object_directory();
+
+   graph_name = get_commit_graph_filename(opts.obj_dir);
+   graph = load_commit_graph_one(graph_name);
+
+   if (!graph)
+   return 0;
+   FREE_AND_NULL(graph_name);
+
+   return verify_commit_graph(graph);
+}
+
 static int graph_read(int argc, const char **argv)
 {
struct commit_graph *graph = NULL;
@@ -160,6 +196,8 @@ int cmd_commit_graph(int argc, const char **argv, const 
char *prefix)
 PARSE_OPT_STOP_AT_NON_OPTION);
 
if (argc > 0) {
+   if (!strcmp(argv[0], "verify"))
+   return graph_verify(argc, argv);
if (!strcmp(argv[0], "read"))
return graph_read(argc, argv);
if (!strcmp(argv[0], "write"))
diff --git a/commit-graph.c b/commit-graph.c
index a8c337dd77..b25aaed128 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -817,3 +817,8 @@ void write_commit_graph(const char *obj_dir,
oids.alloc = 0;
oids.nr = 0;
 }
+
+int verify_commit_graph(struct commit_graph *g)
+{
+   return !g;
+}
diff --git a/commit-graph.h b/commit-graph.h
index 96cccb10f3..71a39c5a57 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -53,4 +53,6 @@ void write_commit_graph(const char *obj_dir,
int nr_commits,
int append);
 
+int verify_commit_graph(struct commit_graph *g);
+
 #endif
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 77d85aefe7..6ca451dfd2 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -11,6 +11,11 @@ test_expect_success 'setup full repo

[PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch

2018-05-10 Thread Derrick Stolee
The commit-graph feature is not useful to end users until the
commit-graph file is maintained automatically by Git during normal
upkeep operations. One natural place to trigger this write is during
'git gc'.

Before automatically generating a commit-graph, we need to be able to
verify the contents of a commit-graph file. Integrate commit-graph
checks into 'fsck' that check the commit-graph contents against commits
in the object database.

Thanks, Jakub, for the feedback on the RFC. I think there are still some
things to decide at a high-level before we dig too far into the review.
Specifically, the integration points in 'fsck', 'gc', and 'fetch' are
worth considering our alternatives.

For 'fsck', the current integration is minimal: if core.commitGraph is
true, then 'git fsck' calls 'git commit-graph verify --object-dir=X'
for the objects directory and every alternate. There are a few options
to consider here:

1. Keep this behavior: we should always check the commit-graph if it
   exists.

2. Add a --[no-]commit-graph argument to 'fsck' that toggles the
   commit-graph verification.

3. Remove all direct integration between 'fsck' and 'commit-graph' and
   instead rely on users checking 'git commit-graph verify' manually
   when they suspect a problem with the commit-graph file. While this
   option is worth considering, it is my least favorite since it requires
   more from users.

For 'gc' and 'fetch', these seemed like natural places to update the
commit-graph file. Relative to the other maintenance that occurs during
these commands, the 'git commit-graph write' command is fast, especially
for incremental updates; only the "new" commits are walked when computing
generation numbers and other metadata for the commit-graph file.

The behavior in this patch series does the following:

1. Near the end of 'git gc', run 'git commit-graph write'. The location
   of this code assumes that a 'git gc --auto' has not terminated early
   due to not meeting the auto threshold.

2. At the end of 'git fetch', run 'git commit-graph write'. This means
   that every reachable commit will be in the commit-graph after a
   a successful fetch, which seems a reasonable frequency. Then, the
   only times we would be missing a reachable commit is after creating
   one locally. There is a problem with the current patch, though: every
   'git fetch' call runs 'git commit-graph write', even if there were no
   ref updates or objects downloaded. Is there a simple way to detect if
   the fetch was non-trivial?

One obvious problem with this approach: if we compute this during 'gc'
AND 'fetch', there will be times where a 'fetch' calls 'gc' and triggers
two commit-graph writes. If I were to abandon one of these patches, it
would be the 'fetch' integration. A 'git gc' really wants to delete all
references to unreachable commits, and without updating the commit-graph
we may still have commit data in the commit-graph file that is not in
the object database. In fact, deleting commits from the object database
but not from the commit-graph will cause 'git commit-graph verify' to
fail!

I welcome discussion on these ideas, as we are venturing out of the
"pure data structure" world and into the "user experience" world. I am
less confident in my skills in this world, but the feature is worthless
if it does not improve the user experience.

Thanks,
-Stolee

Derrick Stolee (12):

Commits 01-07 focus on the 'git commit-graph verify' subcommand. These
are ready for full, rigorous review.

  commit-graph: add 'verify' subcommand
  commit-graph: verify file header information
  commit-graph: parse commit from chosen graph
  commit-graph: verify fanout and lookup table
  commit: force commit to parse from object database
  commit-graph: load a root tree from specific graph
  commit-graph: verify commit contents against odb

Commit 08 integrates 'git commit-graph verify' into fsck. The work here
is the minimum integration possible. (See above discussion of options.)

  fsck: verify commit-graph

Commit 09 introduces a new '--reachable' option only to make the calls
from 'gc' and 'fetch' simpler. Commits 10-11 integrate writing the
commit-graph into 'gc' and 'fetch', respectively. (See above disucssion.)

  commit-graph: add '--reachable' option
  gc: automatically write commit-graph files
  fetch: compute commit-graph by default

Commit 12 simply deletes sections from the "Future Work" section
of the commit-graph design document.

  commit-graph: update design document

 Documentation/config.txt |  10 ++
 Documentation/git-commit-graph.txt   |  14 ++-
 Documentation/git-fsck.txt   |   3 +
 Documentation/git-gc.txt |   4 +
 Documentation/technical/commit-graph.txt |  22 
 builtin/commit-graph.c   |  79 +-
 builtin/fetch.c  |  13 +++
 builtin/fsck.c   |  21 
 builtin/gc.c 

[PATCH 11/12] fetch: compute commit-graph by default

2018-05-10 Thread Derrick Stolee
During a call to 'git fetch', we expect new commits and updated refs.
Use these updated refs to add the new commits to the commit-graph file,
automatically providing performance benefits in other calls.

Use 'fetch.commitGraph' config setting to enable or disable this
behavior. Defaults to false while the commit-graph feature matures.
Specifically, we do not want this on by default until the commit-graph
feature integrates with history-modifying features such as shallow
clones.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/config.txt |  4 
 builtin/fetch.c  | 13 +
 2 files changed, 17 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 9a3abd87e7..3d8225600a 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1409,6 +1409,10 @@ fetch.output::
`full` and `compact`. Default value is `full`. See section
OUTPUT in linkgit:git-fetch[1] for detail.
 
+fetch.commitGraph::
+   If true, fetch will automatically update the commit-graph file.
+   See linkgit:git-commit-graph[1].
+
 format.attach::
Enable multipart/mixed attachments as the default for
'format-patch'.  The value can also be a double quoted string
diff --git a/builtin/fetch.c b/builtin/fetch.c
index 8ee998ea2e..254f6ecfb6 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -38,6 +38,7 @@ enum {
 static int fetch_prune_config = -1; /* unspecified */
 static int prune = -1; /* unspecified */
 #define PRUNE_BY_DEFAULT 0 /* do we prune by default? */
+static int fetch_commit_graph = 0;
 
 static int all, append, dry_run, force, keep, multiple, update_head_ok, 
verbosity, deepen_relative;
 static int progress = -1;
@@ -66,6 +67,11 @@ static int git_fetch_config(const char *k, const char *v, 
void *cb)
return 0;
}
 
+   if (!strcmp(k, "fetch.commitGraph")) {
+   fetch_commit_graph = git_config_bool(k, v);
+   return 0;
+   }
+
if (!strcmp(k, "submodule.recurse")) {
int r = git_config_bool(k, v) ?
RECURSE_SUBMODULES_ON : RECURSE_SUBMODULES_OFF;
@@ -1462,6 +1468,13 @@ int cmd_fetch(int argc, const char **argv, const char 
*prefix)
result = fetch_multiple();
}
 
+   if (!result && fetch_commit_graph) {
+   struct argv_array commit_graph = ARGV_ARRAY_INIT;
+   argv_array_pushl(_graph, "commit-graph", "write", 
"--reachable", NULL);
+   if (run_command_v_opt(commit_graph.argv, RUN_GIT_CMD))
+   result = 1;
+   }
+
if (!result && (recurse_submodules != RECURSE_SUBMODULES_OFF)) {
struct argv_array options = ARGV_ARRAY_INIT;
 
-- 
2.16.2.329.gfb62395de6



[PATCH 08/12] fsck: verify commit-graph

2018-05-10 Thread Derrick Stolee
If core.commitGraph is true, verify the contents of the commit-graph
during 'git fsck' using the 'git commit-graph verify' subcommand. Run
this check on all alternates, as well.

We use a new process for two reasons:

1. The subcommand decouples the details of loading and verifying a
   commit-graph file from the other fsck details.

2. The commit-graph verification requires the commits to be loaded
   in a specific order to guarantee we parse from the commit-graph
   file for some objects and from the object database for others.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-fsck.txt |  3 +++
 builtin/fsck.c | 21 +
 t/t5318-commit-graph.sh|  5 +
 3 files changed, 29 insertions(+)

diff --git a/Documentation/git-fsck.txt b/Documentation/git-fsck.txt
index b9f060e3b2..ab9a93fb9b 100644
--- a/Documentation/git-fsck.txt
+++ b/Documentation/git-fsck.txt
@@ -110,6 +110,9 @@ Any corrupt objects you will have to find in backups or 
other archives
 (i.e., you can just remove them and do an 'rsync' with some other site in
 the hopes that somebody else has the object you have corrupted).
 
+If core.commitGraph is true, the commit-graph file will also be inspected
+using 'git commit-graph verify'. See linkgit:git-commit-graph[1].
+
 Extracted Diagnostics
 -
 
diff --git a/builtin/fsck.c b/builtin/fsck.c
index ef78c6c00c..a6d5045b77 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -16,6 +16,7 @@
 #include "streaming.h"
 #include "decorate.h"
 #include "packfile.h"
+#include "run-command.h"
 
 #define REACHABLE 0x0001
 #define SEEN  0x0002
@@ -45,6 +46,7 @@ static int name_objects;
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
 #define ERROR_REFS 010
+#define ERROR_COMMIT_GRAPH 020
 
 static const char *describe_object(struct object *obj)
 {
@@ -815,5 +817,24 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
}
 
check_connectivity();
+
+   if (core_commit_graph) {
+   struct child_process commit_graph_verify = CHILD_PROCESS_INIT;
+   const char *verify_argv[] = { "commit-graph", "verify", NULL, 
NULL, NULL, NULL };
+   commit_graph_verify.argv = verify_argv;
+   commit_graph_verify.git_cmd = 1;
+
+   if (run_command(_graph_verify))
+   errors_found |= ERROR_COMMIT_GRAPH;
+
+   prepare_alt_odb();
+   for (alt = alt_odb_list; alt; alt = alt->next) {
+   verify_argv[2] = "--object-dir";
+   verify_argv[3] = alt->path;
+   if (run_command(_graph_verify))
+   errors_found |= ERROR_COMMIT_GRAPH;
+   }
+   }
+
return errors_found;
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 6ca451dfd2..9680e6e6e0 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -240,4 +240,9 @@ test_expect_success 'git commit-graph verify' '
git commit-graph verify >output
 '
 
+test_expect_success 'git fsck (checks commit-graph)' '
+   cd "$TRASH_DIRECTORY/full" &&
+   git fsck
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6



[PATCH 07/12] commit-graph: verify commit contents against odb

2018-05-10 Thread Derrick Stolee
When running 'git commit-graph verify', compare the contents of the
commits that are loaded from the commit-graph file with commits that are
loaded directly from the object database. This includes checking the
root tree object ID, commit date, and parents.

Parse the commit from the graph during the initial loop through the
object IDs to guarantee we parse from the commit-graph file.

In addition, verify the generation number calculation is correct for all
commits in the commit-graph file.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 88 ++
 1 file changed, 88 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 8c636abba9..24f5031f3e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -863,6 +863,8 @@ int verify_commit_graph(struct commit_graph *g)
graph_report(_("commit-graph is missing the Commit Data 
chunk"));
 
for (i = 0; i < g->num_commits; i++) {
+   struct commit *graph_commit;
+
hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
 
if (i > 0 && oidcmp(_oid, _oid) >= 0)
@@ -880,6 +882,10 @@ int verify_commit_graph(struct commit_graph *g)
 
cur_fanout_pos++;
}
+
+   graph_commit = lookup_commit(_oid);
+   if (!parse_commit_in_graph_one(g, graph_commit))
+   graph_report(_("failed to parse %s from commit-graph"), 
oid_to_hex(_oid));
}
 
while (cur_fanout_pos < 256) {
@@ -892,5 +898,87 @@ int verify_commit_graph(struct commit_graph *g)
cur_fanout_pos++;
}
 
+   for (i = 0; i < g->num_commits; i++) {
+   struct commit *graph_commit, *odb_commit;
+   struct commit_list *graph_parents, *odb_parents;
+   int num_parents = 0;
+
+   hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+   graph_commit = lookup_commit(_oid);
+   odb_commit = (struct commit *)create_object(cur_oid.hash, 
alloc_commit_node());
+   if (parse_commit_internal(odb_commit, 0, 0))
+   graph_report(_("failed to parse %s from object 
database"), oid_to_hex(_oid));
+
+   if (oidcmp(_commit_tree_in_graph_one(g, 
graph_commit)->object.oid,
+  get_commit_tree_oid(odb_commit)))
+   graph_report(_("root tree object ID for commit %s in 
commit-graph is %s != %s"),
+oid_to_hex(_oid),
+
oid_to_hex(get_commit_tree_oid(graph_commit)),
+
oid_to_hex(get_commit_tree_oid(odb_commit)));
+
+   if (graph_commit->date != odb_commit->date)
+   graph_report(_("commit date for commit %s in 
commit-graph is %"PRItime" != %"PRItime""),
+oid_to_hex(_oid),
+graph_commit->date,
+odb_commit->date);
+
+
+   graph_parents = graph_commit->parents;
+   odb_parents = odb_commit->parents;
+
+   while (graph_parents) {
+   num_parents++;
+
+   if (odb_parents == NULL)
+   graph_report(_("commit-graph parent list for 
commit %s is too long (%d)"),
+oid_to_hex(_oid),
+num_parents);
+
+   if (oidcmp(_parents->item->object.oid, 
_parents->item->object.oid))
+   graph_report(_("commit-graph parent for %s is 
%s != %s"),
+oid_to_hex(_oid),
+
oid_to_hex(_parents->item->object.oid),
+
oid_to_hex(_parents->item->object.oid));
+
+   graph_parents = graph_parents->next;
+   odb_parents = odb_parents->next;
+   }
+
+   if (odb_parents != NULL)
+   graph_report(_("commit-graph parent list for commit %s 
terminates early"),
+oid_to_hex(_oid));
+
+   if (graph_commit->generation) {
+   uint32_t max_generation = 0;
+   graph_parents = graph_commit->parents;
+
+   while (graph_parents) {
+   if (graph_parents->item->generation == 
GENERATION_NUMBER_ZERO ||
+   graph_parents->item->generation == 
GENE

[PATCH 04/12] commit-graph: verify fanout and lookup table

2018-05-10 Thread Derrick Stolee
While running 'git commit-graph verify', verify that the object IDs
are listed in lexicographic order and that the fanout table correctly
navigates into that list of object IDs.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 33 +
 1 file changed, 33 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index ce11af1d20..b4c146c423 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -839,6 +839,9 @@ static int verify_commit_graph_error;
 
 int verify_commit_graph(struct commit_graph *g)
 {
+   uint32_t i, cur_fanout_pos = 0;
+   struct object_id prev_oid, cur_oid;
+
if (!g) {
graph_report(_("no commit-graph file loaded"));
return 1;
@@ -853,5 +856,35 @@ int verify_commit_graph(struct commit_graph *g)
if (!g->chunk_commit_data)
graph_report(_("commit-graph is missing the Commit Data 
chunk"));
 
+   for (i = 0; i < g->num_commits; i++) {
+   hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+   if (i > 0 && oidcmp(_oid, _oid) >= 0)
+   graph_report(_("commit-graph has incorrect oid order: 
%s then %s"),
+
+   oid_to_hex(_oid),
+   oid_to_hex(_oid));
+   oidcpy(_oid, _oid);
+
+   while (cur_oid.hash[0] > cur_fanout_pos) {
+   uint32_t fanout_value = get_be32(g->chunk_oid_fanout + 
cur_fanout_pos);
+   if (i != fanout_value)
+   graph_report(_("commit-graph has incorrect 
fanout value: fanout[%d] = %u != %u"),
+cur_fanout_pos, fanout_value, i);
+
+   cur_fanout_pos++;
+   }
+   }
+
+   while (cur_fanout_pos < 256) {
+   uint32_t fanout_value = get_be32(g->chunk_oid_fanout + 
cur_fanout_pos);
+
+   if (g->num_commits != fanout_value)
+   graph_report(_("commit-graph has incorrect fanout 
value: fanout[%d] = %u != %u"),
+cur_fanout_pos, fanout_value, i);
+
+   cur_fanout_pos++;
+   }
+
return verify_commit_graph_error;
 }
-- 
2.16.2.329.gfb62395de6



[PATCH 12/12] commit-graph: update design document

2018-05-10 Thread Derrick Stolee
The commit-graph feature is now integrated with 'fsck' and 'gc',
so remove those items from the "Future Work" section of the
commit-graph design document.

Also remove the section on lazy-loading trees, as that was completed
in an earlier patch series.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 22 --
 1 file changed, 22 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
index e1a883eb46..c664acbd76 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -118,9 +118,6 @@ Future Work
 - The commit graph feature currently does not honor commit grafts. This can
   be remedied by duplicating or refactoring the current graft logic.
 
-- The 'commit-graph' subcommand does not have a "verify" mode that is
-  necessary for integration with fsck.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
@@ -130,25 +127,6 @@ Future Work
 - 'log --topo-order'
 - 'tag --merged'
 
-- Currently, parse_commit_gently() requires filling in the root tree
-  object for a commit. This passes through lookup_tree() and consequently
-  lookup_object(). Also, it calls lookup_commit() when loading the parents.
-  These method calls check the ODB for object existence, even if the
-  consumer does not need the content. For example, we do not need the
-  tree contents when computing merge bases. Now that commit parsing is
-  removed from the computation time, these lookup operations are the
-  slowest operations keeping graph walks from being fast. Consider
-  loading these objects without verifying their existence in the ODB and
-  only loading them fully when consumers need them. Consider a method
-  such as "ensure_tree_loaded(commit)" that fully loads a tree before
-  using commit->tree.
-
-- The current design uses the 'commit-graph' subcommand to generate the graph.
-  When this feature stabilizes enough to recommend to most users, we should
-  add automatic graph writes to common operations that create many commits.
-  For example, one could compute a graph on 'clone', 'fetch', or 'repack'
-  commands.
-
 - A server could provide a commit graph file as part of the network protocol
   to avoid extra calculations by clients. This feature is only of benefit if
   the user is willing to trust the file, because verifying the file is correct
-- 
2.16.2.329.gfb62395de6



[PATCH 03/12] commit-graph: parse commit from chosen graph

2018-05-10 Thread Derrick Stolee
Before verifying a commit-graph file against the object database, we
need to parse all commits from the given commit-graph file. Create
parse_commit_in_graph_one() to target a given struct commit_graph.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index c3b8716c14..ce11af1d20 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -309,7 +309,7 @@ static int find_commit_in_graph(struct commit *item, struct 
commit_graph *g, uin
}
 }
 
-int parse_commit_in_graph(struct commit *item)
+int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
 {
uint32_t pos;
 
@@ -317,9 +317,21 @@ int parse_commit_in_graph(struct commit *item)
return 0;
if (item->object.parsed)
return 1;
+
+   if (find_commit_in_graph(item, g, ))
+   return fill_commit_in_graph(item, g, pos);
+
+   return 0;
+}
+
+int parse_commit_in_graph(struct commit *item)
+{
+   if (!core_commit_graph)
+   return 0;
+
prepare_commit_graph();
-   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
-   return fill_commit_in_graph(item, commit_graph, pos);
+   if (commit_graph)
+   return parse_commit_in_graph_one(commit_graph, item);
return 0;
 }
 
-- 
2.16.2.329.gfb62395de6



[PATCH 06/12] commit-graph: load a root tree from specific graph

2018-05-10 Thread Derrick Stolee
When lazy-loading a tree for a commit, it will be important to select
the tree from a specific struct commit_graph. Create a new method that
specifies the commit-graph file and use that in
get_commit_tree_in_graph().

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index b4c146c423..8c636abba9 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -357,14 +357,20 @@ static struct tree *load_tree_for_commit(struct 
commit_graph *g, struct commit *
return c->maybe_tree;
 }
 
-struct tree *get_commit_tree_in_graph(const struct commit *c)
+static struct tree *get_commit_tree_in_graph_one(struct commit_graph *g,
+const struct commit *c)
 {
if (c->maybe_tree)
return c->maybe_tree;
if (c->graph_pos == COMMIT_NOT_FROM_GRAPH)
BUG("get_commit_tree_in_graph called from non-commit-graph 
commit");
 
-   return load_tree_for_commit(commit_graph, (struct commit *)c);
+   return load_tree_for_commit(g, (struct commit *)c);
+}
+
+struct tree *get_commit_tree_in_graph(const struct commit *c)
+{
+   return get_commit_tree_in_graph_one(commit_graph, c);
 }
 
 static void write_graph_chunk_fanout(struct hashfile *f,
-- 
2.16.2.329.gfb62395de6



[PATCH v2] commit-graph: fix UX issue when .lock file exists

2018-05-10 Thread Derrick Stolee
We use the lockfile API to avoid multiple Git processes from writing to
the commit-graph file in the .git/objects/info directory. In some cases,
this directory may not exist, so we check for its existence.

The existing code does the following when acquiring the lock:

1. Try to acquire the lock.
2. If it fails, try to create the .git/object/info directory.
3. Try to acquire the lock, failing if necessary.

The problem is that if the lockfile exists, then the mkdir fails, giving
an error that doesn't help the user:

  "fatal: cannot mkdir .git/objects/info: File exists"

While technically this honors the lockfile, it does not help the user.

Instead, do the following:

1. Check for existence of .git/objects/info; create if necessary.
2. Try to acquire the lock, failing if necessary.

The new output looks like:

  fatal: Unable to create
  '/.git/objects/info/commit-graph.lock': File exists.

  Another git process seems to be running in this repository, e.g.
  an editor opened by 'git commit'. Please make sure all processes
  are terminated then try again. If it still fails, a git process
  may have crashed in this repository earlier:
  remove the file manually to continue.

Helped-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 22 +-
 1 file changed, 5 insertions(+), 17 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index a8c337dd77..bb54c1214c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1,5 +1,6 @@
 #include "cache.h"
 #include "config.h"
+#include "dir.h"
 #include "git-compat-util.h"
 #include "lockfile.h"
 #include "pack.h"
@@ -640,7 +641,6 @@ void write_commit_graph(const char *obj_dir,
struct hashfile *f;
uint32_t i, count_distinct = 0;
char *graph_name;
-   int fd;
struct lock_file lk = LOCK_INIT;
uint32_t chunk_ids[5];
uint64_t chunk_offsets[5];
@@ -754,23 +754,11 @@ void write_commit_graph(const char *obj_dir,
compute_generation_numbers();
 
graph_name = get_commit_graph_filename(obj_dir);
-   fd = hold_lock_file_for_update(, graph_name, 0);
-
-   if (fd < 0) {
-   struct strbuf folder = STRBUF_INIT;
-   strbuf_addstr(, graph_name);
-   strbuf_setlen(, strrchr(folder.buf, '/') - folder.buf);
-
-   if (mkdir(folder.buf, 0777) < 0)
-   die_errno(_("cannot mkdir %s"), folder.buf);
-   strbuf_release();
-
-   fd = hold_lock_file_for_update(, graph_name, 
LOCK_DIE_ON_ERROR);
-
-   if (fd < 0)
-   die_errno("unable to create '%s'", graph_name);
-   }
+   if (safe_create_leading_directories(graph_name))
+   die_errno(_("unable to create leading directories of %s"),
+ graph_name);
 
+   hold_lock_file_for_update(, graph_name, LOCK_DIE_ON_ERROR);
f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 
hashwrite_be32(f, GRAPH_SIGNATURE);

base-commit: 34fdd433396ee0e3ef4de02eb2189f8226eafe4e
-- 
2.16.2.329.gfb62395de6



Re: [PATCH 04/12] commit-graph: verify fanout and lookup table

2018-05-11 Thread Derrick Stolee

On 5/10/2018 2:29 PM, Martin Ågren wrote:

On 10 May 2018 at 19:34, Derrick Stolee <dsto...@microsoft.com> wrote:

While running 'git commit-graph verify', verify that the object IDs
are listed in lexicographic order and that the fanout table correctly
navigates into that list of object IDs.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  commit-graph.c | 33 +
  1 file changed, 33 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index ce11af1d20..b4c146c423 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -839,6 +839,9 @@ static int verify_commit_graph_error;

  int verify_commit_graph(struct commit_graph *g)
  {
+   uint32_t i, cur_fanout_pos = 0;
+   struct object_id prev_oid, cur_oid;
+
 if (!g) {
 graph_report(_("no commit-graph file loaded"));
 return 1;
@@ -853,5 +856,35 @@ int verify_commit_graph(struct commit_graph *g)
 if (!g->chunk_commit_data)
 graph_report(_("commit-graph is missing the Commit Data 
chunk"));

+   for (i = 0; i < g->num_commits; i++) {
+   hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+   if (i > 0 && oidcmp(_oid, _oid) >= 0)
+   graph_report(_("commit-graph has incorrect oid order: %s 
then %s"),

Minor: I think our style would prefer s/i > 0/i/.

I suppose the second check should be s/>=/>/, but it's not like it
should matter. ;-)


It shouldn't, but a duplicate commit is still an incorrect oid order.


I wonder if this is a message that would virtually never make sense to a
user, but more to a developer. Leave it untranslated to make sure any
bug reports to the list are readable to us?


Yeah, I'll make all of the errors untranslated since they are for 
developer debugging, not end-user information.



+
+   oid_to_hex(_oid),
+   oid_to_hex(_oid));

Hmm, these two lines do not actually achieve anything?


It's a formatting bug: They complete the statement starting with 
'graph_report(' above.





+   oidcpy(_oid, _oid);
+
+   while (cur_oid.hash[0] > cur_fanout_pos) {
+   uint32_t fanout_value = get_be32(g->chunk_oid_fanout + 
cur_fanout_pos);
+   if (i != fanout_value)
+   graph_report(_("commit-graph has incorrect fanout 
value: fanout[%d] = %u != %u"),
+cur_fanout_pos, fanout_value, i);

Same though re `_()`, even more so because of the more technical
notation.


+
+   cur_fanout_pos++;
+   }
+   }
+
+   while (cur_fanout_pos < 256) {
+   uint32_t fanout_value = get_be32(g->chunk_oid_fanout + 
cur_fanout_pos);
+
+   if (g->num_commits != fanout_value)
+   graph_report(_("commit-graph has incorrect fanout value: 
fanout[%d] = %u != %u"),
+cur_fanout_pos, fanout_value, i);

Same here. Or maybe these should just give a translated user-readable
basic idea of what is wrong and skip the details?


As someone who is responsible for probably inserting these problems, I 
think the details are important.


Thanks,
-Stolee


Re: [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch

2018-05-11 Thread Derrick Stolee

On 5/10/2018 4:45 PM, Martin Ågren wrote:

On 10 May 2018 at 21:22, Stefan Beller  wrote:

On Thu, May 10, 2018 at 12:05 PM, Martin Ågren  wrote:

I hope to find time to do some more hands-on testing of this to see that
errors actually do get caught.

Packfiles and loose objects are primary data, which means that those
need a more advanced way to diagnose and repair them, so I would imagine
the commit graph fsck is closer to bitmaps fsck, which I would have suspected
to be found in t5310, but a quick read doesn't reveal many tests that are
checking for integrity. So I guess the test coverage here is ok, (although we
should always ask for more)

Since I'm wrapping up for today, I'm posting some simple tests that I
assembled. The last of these showcases one or two problems with the
current error-reporting. Depending on the error, there can be *lots* of
errors reported and there are no new-lines, so the result on stdout can
be a wall of not-very-legible text.

Some of these might not make sense. I just started going through the
documentation on the format, causing some sort of corruption in each
field. Maybe this can be helpful somehow.

Martin

diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 82f95eb11f..a7e48db2de 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -255,4 +255,49 @@ test_expect_success 'git fsck (checks commit-graph)' '
git fsck
  '
  
+# usage: corrupt_data   []

+corrupt_data() {
+   file=$1
+   pos=$2
+   data="${3:-\0}"
+   printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc
+}
+
+test_expect_success 'detect bad signature' '
+   cd "$TRASH_DIRECTORY/full" &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   corrupt_data $objdir/info/commit-graph 0 "\0" &&
+   test_must_fail git commit-graph verify 2>err &&
+   grep "graph signature" err
+'
+
+test_expect_success 'detect bad version number' '
+   cd "$TRASH_DIRECTORY/full" &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   corrupt_data $objdir/info/commit-graph 4 "\02" &&
+   test_must_fail git commit-graph verify 2>err &&
+   grep "graph version" err
+'
+
+test_expect_success 'detect bad hash version' '
+   cd "$TRASH_DIRECTORY/full" &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   corrupt_data $objdir/info/commit-graph 5 "\02" &&
+   test_must_fail git commit-graph verify 2>err &&
+   grep "hash version" err
+'
+
+test_expect_success 'detect too small chunk-count' '
+   cd "$TRASH_DIRECTORY/full" &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   corrupt_data $objdir/info/commit-graph 6 "\01" &&
+   test_must_fail git commit-graph verify 2>err &&
+   cat err
+'
+
+
  test_done



Martin: thank you so much for these test examples, and for running them 
to find out about the newline issue. This is really helpful.


Re: [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch

2018-05-11 Thread Derrick Stolee

On 5/10/2018 3:22 PM, Stefan Beller wrote:

On Thu, May 10, 2018 at 12:05 PM, Martin Ågren <martin.ag...@gmail.com> wrote:

On 10 May 2018 at 19:34, Derrick Stolee <dsto...@microsoft.com> wrote:


Commits 01-07 focus on the 'git commit-graph verify' subcommand. These
are ready for full, rigorous review.

I don't know about "full" and "rigorous", but I tried to offer my thoughts.

A lingering feeling I have is that users could possibly benefit from
seeing "the commit-graph has a bad foo" a bit more than just "the
commit-graph is bad". But adding "the bar is baz, should have been
frotz" might not bring that much. Maybe you could keep the translatable
string somewhat simple, then, if the more technical data could be useful
to Git developers, dump it in a non-translated format. (I guess it could
be hidden behind a debug switch, but let's take one step at a time..)
This is nothing I feel strongly about.


  t/t5318-commit-graph.sh  |  25 +

I wonder about tests. Some tests seem to use `dd` to corrupt files and
check that it gets caught. Doing this in a a hash-agnostic way could be
tricky, but maybe not impossible. I guess we could do something
probabilistic, like "set the first two bytes of the very last OID to
zero -- surely all OIDs can't start with 16 zero bits". Hmm, that might
still require knowing the size of the OIDs...

I hope to find time to do some more hands-on testing of this to see that
errors actually do get caught.

Given that the commit graph is secondary data, the users work around
to quickly get back to a well working repository is most likely to remove
the file and regenerate it.
As a developer who wants to fix the bug, a stacktrace/datadump and the
history of git commands might be most valuable, but I agree we should
hide that behind a debug flag.

Packfiles and loose objects are primary data, which means that those
need a more advanced way to diagnose and repair them, so I would imagine
the commit graph fsck is closer to bitmaps fsck, which I would have suspected
to be found in t5310, but a quick read doesn't reveal many tests that are
checking for integrity. So I guess the test coverage here is ok, (although we
should always ask for more)


My main goal is to help developers figure out what is wrong with a file, 
and then we can use other diagnostic tools to discover how it got into 
that state.


Martin's initial test cases are wonderful. I've adapted them to test the 
other conditions in the verify_commit_graph() method and caught some 
interesting behavior in the process. I'm preparing v2 so we can 
investigate the direction of the tests.


Thanks,
-Stolee


Re: [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch

2018-05-11 Thread Derrick Stolee

On 5/10/2018 3:17 PM, Ævar Arnfjörð Bjarmason wrote:

On Thu, May 10 2018, Derrick Stolee wrote:


The behavior in this patch series does the following:

1. Near the end of 'git gc', run 'git commit-graph write'. The location
of this code assumes that a 'git gc --auto' has not terminated early
due to not meeting the auto threshold.

2. At the end of 'git fetch', run 'git commit-graph write'. This means
that every reachable commit will be in the commit-graph after a
a successful fetch, which seems a reasonable frequency. Then, the
only times we would be missing a reachable commit is after creating
one locally. There is a problem with the current patch, though: every
'git fetch' call runs 'git commit-graph write', even if there were no
ref updates or objects downloaded. Is there a simple way to detect if
the fetch was non-trivial?

One obvious problem with this approach: if we compute this during 'gc'
AND 'fetch', there will be times where a 'fetch' calls 'gc' and triggers
two commit-graph writes. If I were to abandon one of these patches, it
would be the 'fetch' integration. A 'git gc' really wants to delete all
references to unreachable commits, and without updating the commit-graph
we may still have commit data in the commit-graph file that is not in
the object database. In fact, deleting commits from the object database
but not from the commit-graph will cause 'git commit-graph verify' to
fail!

I welcome discussion on these ideas, as we are venturing out of the
"pure data structure" world and into the "user experience" world. I am
less confident in my skills in this world, but the feature is worthless
if it does not improve the user experience.

I really like #1 here, but I wonder why #2 is necessary.

I.e. is it critical for the performance of the commit graph feature that
it be kept really up-to-date, moreso than other things that rely on gc
--auto (e.g. the optional bitmap index)?


It is not critical. The feature has been designed to have recent commits 
not in the file. For simplicity, it is probably best to limit ourselves 
to writing after a non-trivial 'gc'.



Even if that's the case, I think something that does this via gc --auto
is a much better option. I.e. now we have gc.auto & gc.autoPackLimit, if
the answer to my question above is "yes" this could also be accomplished
by introducing a new graph-specific gc.* setting, and --auto would just
update the graph more often, but leave the rest.


This is an excellent idea for a follow-up series, if we find we want the 
commit-graph written more frequently. For now, I'm satisfied with one 
place where it is automatically computed.


I'll drop the fetch integration in my v2 series.

Thanks,
-Stolee


[PATCH v2 02/12] commit-graph: verify file header information

2018-05-11 Thread Derrick Stolee
During a run of 'git commit-graph verify', list the issues with the
header information in the commit-graph file. Some of this information
is inferred from the loaded 'struct commit_graph'. Some header
information is checked as part of load_commit_graph_one().

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 32 +++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index b25aaed128..d2db20e49a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -818,7 +818,37 @@ void write_commit_graph(const char *obj_dir,
oids.nr = 0;
 }
 
+static int verify_commit_graph_error;
+
+static void graph_report(const char *fmt, ...)
+{
+   va_list ap;
+   struct strbuf sb = STRBUF_INIT;
+   verify_commit_graph_error = 1;
+
+   va_start(ap, fmt);
+   strbuf_vaddf(, fmt, ap);
+
+   fprintf(stderr, "%s\n", sb.buf);
+   strbuf_release();
+   va_end(ap);
+}
+
 int verify_commit_graph(struct commit_graph *g)
 {
-   return !g;
+   if (!g) {
+   graph_report("no commit-graph file loaded");
+   return 1;
+   }
+
+   verify_commit_graph_error = 0;
+
+   if (!g->chunk_oid_fanout)
+   graph_report("commit-graph is missing the OID Fanout chunk");
+   if (!g->chunk_oid_lookup)
+   graph_report("commit-graph is missing the OID Lookup chunk");
+   if (!g->chunk_commit_data)
+   graph_report("commit-graph is missing the Commit Data chunk");
+
+   return verify_commit_graph_error;
 }
-- 
2.16.2.329.gfb62395de6



[PATCH v2 01/12] commit-graph: add 'verify' subcommand

2018-05-11 Thread Derrick Stolee
If the commit-graph file becomes corrupt, we need a way to verify
that its contents match the object database. In the manner of
'git fsck' we will implement a 'git commit-graph verify' subcommand
to report all issues with the file.

Add the 'verify' subcommand to the 'commit-graph' builtin and its
documentation. The subcommand is currently a no-op except for
loading the commit-graph into memory, which may trigger run-time
errors that would be caught by normal use. Add a simple test that
ensures the command returns a zero error code.

If no commit-graph file exists, this is an acceptable state. Do
not report any errors.

During review, we noticed that a FREE_AND_NULL(graph_name) was
placed after a possible 'return', and this pattern was also in
graph_read(). Fix that case, too.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt |  6 ++
 builtin/commit-graph.c | 40 +-
 commit-graph.c |  5 +
 commit-graph.h |  2 ++
 t/t5318-commit-graph.sh| 10 ++
 5 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 4c97b555cc..a222cfab08 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -10,6 +10,7 @@ SYNOPSIS
 
 [verse]
 'git commit-graph read' [--object-dir ]
+'git commit-graph verify' [--object-dir ]
 'git commit-graph write'  [--object-dir ]
 
 
@@ -52,6 +53,11 @@ existing commit-graph file.
 Read a graph file given by the commit-graph file and output basic
 details about the graph file. Used for debugging purposes.
 
+'verify'::
+
+Read the commit-graph file and verify its contents against the object
+database. Used to check for corrupted data.
+
 
 EXAMPLES
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 37420ae0fd..af3101291f 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,10 +8,16 @@
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
+   N_("git commit-graph verify [--object-dir ]"),
N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
+static const char * const builtin_commit_graph_verify_usage[] = {
+   N_("git commit-graph verify [--object-dir ]"),
+   NULL
+};
+
 static const char * const builtin_commit_graph_read_usage[] = {
N_("git commit-graph read [--object-dir ]"),
NULL
@@ -29,6 +35,36 @@ static struct opts_commit_graph {
int append;
 } opts;
 
+
+static int graph_verify(int argc, const char **argv)
+{
+   struct commit_graph *graph = 0;
+   char *graph_name;
+
+   static struct option builtin_commit_graph_verify_options[] = {
+   OPT_STRING(0, "object-dir", _dir,
+  N_("dir"),
+  N_("The object directory to store the graph")),
+   OPT_END(),
+   };
+
+   argc = parse_options(argc, argv, NULL,
+builtin_commit_graph_verify_options,
+builtin_commit_graph_verify_usage, 0);
+
+   if (!opts.obj_dir)
+   opts.obj_dir = get_object_directory();
+
+   graph_name = get_commit_graph_filename(opts.obj_dir);
+   graph = load_commit_graph_one(graph_name);
+   FREE_AND_NULL(graph_name);
+
+   if (!graph)
+   return 0;
+
+   return verify_commit_graph(graph);
+}
+
 static int graph_read(int argc, const char **argv)
 {
struct commit_graph *graph = NULL;
@@ -50,10 +86,10 @@ static int graph_read(int argc, const char **argv)
 
graph_name = get_commit_graph_filename(opts.obj_dir);
graph = load_commit_graph_one(graph_name);
+   FREE_AND_NULL(graph_name);
 
if (!graph)
die("graph file %s does not exist", graph_name);
-   FREE_AND_NULL(graph_name);
 
printf("header: %08x %d %d %d %d\n",
ntohl(*(uint32_t*)graph->data),
@@ -160,6 +196,8 @@ int cmd_commit_graph(int argc, const char **argv, const 
char *prefix)
 PARSE_OPT_STOP_AT_NON_OPTION);
 
if (argc > 0) {
+   if (!strcmp(argv[0], "verify"))
+   return graph_verify(argc, argv);
if (!strcmp(argv[0], "read"))
return graph_read(argc, argv);
if (!strcmp(argv[0], "write"))
diff --git a/commit-graph.c b/commit-graph.c
index a8c337dd77..b25aaed128 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -817,3 +817,8 @@ void write_commit_graph(const char *obj_dir,
oids.alloc = 0;
o

[PATCH v2 03/12] commit-graph: test that 'verify' finds corruption

2018-05-11 Thread Derrick Stolee
Add test cases to t5318-commit-graph.sh that corrupt the commit-graph
file and check that the 'git commit-graph verify' command fails. These
tests verify the header and chunk information is checked carefully.

Helped-by: Martin Ågren <martin.ag...@gmail.com>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 t/t5318-commit-graph.sh | 53 +
 1 file changed, 53 insertions(+)

diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 6ca451dfd2..0cb88232fa 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -240,4 +240,57 @@ test_expect_success 'git commit-graph verify' '
git commit-graph verify >output
 '
 
+# usage: corrupt_data   []
+corrupt_data() {
+   file=$1
+   pos=$2
+   data="${3:-\0}"
+   printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc
+}
+
+test_expect_success 'detect bad signature' '
+   cd "$TRASH_DIRECTORY/full" &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   corrupt_data $objdir/info/commit-graph 0 "\0" &&
+   test_must_fail git commit-graph verify 2>err &&
+   grep -v "^\+" err > verify-errors &&
+   test_line_count = 1 verify-errors &&
+   grep "graph signature" verify-errors
+'
+
+test_expect_success 'detect bad version number' '
+   cd "$TRASH_DIRECTORY/full" &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   corrupt_data $objdir/info/commit-graph 4 "\02" &&
+   test_must_fail git commit-graph verify 2>err &&
+   grep -v "^\+" err > verify-errors &&
+   test_line_count = 1 verify-errors &&
+   grep "graph version" verify-errors
+'
+
+test_expect_success 'detect bad hash version' '
+   cd "$TRASH_DIRECTORY/full" &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   corrupt_data $objdir/info/commit-graph 5 "\02" &&
+   test_must_fail git commit-graph verify 2>err &&
+   grep -v "^\+" err > verify-errors &&
+   test_line_count = 1 verify-errors &&
+   grep "hash version" verify-errors
+'
+
+test_expect_success 'detect too small chunk-count' '
+   cd "$TRASH_DIRECTORY/full" &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   corrupt_data $objdir/info/commit-graph 6 "\01" &&
+   test_must_fail git commit-graph verify 2>err &&
+   grep -v "^\+" err > verify-errors &&
+   test_line_count = 2 verify-errors &&
+   grep "missing the OID Lookup chunk" verify-errors &&
+   grep "missing the Commit Data chunk" verify-errors
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6



[PATCH v2 00/12] Integrate commit-graph into fsck and gc

2018-05-11 Thread Derrick Stolee
I'm sending this v2 re-roll rather quickly after the previous version
because Martin provided a framework to add tests to the 'verify'
subcommand. I took that framework and added tests for the other checks
of the commit-graph data. This also found some interesting things about
the command:

1. There were some segfaults because we were not checking for bad data
   carefully enough.

2. To avoid segfaults, we will now terminate the check early if we find
   problems earlier in the file, such as in the header, or OID lookup.

3. We were not writing newlines between reports. This now happens by
   default in graph_report().

The integration into 'fetch' is dropped (thanks Ævar!).

Derrick Stolee (12):
  commit-graph: add 'verify' subcommand
  commit-graph: verify file header information
  commit-graph: test that 'verify' finds corruption
  commit-graph: parse commit from chosen graph
  commit-graph: verify fanout and lookup table
  commit: force commit to parse from object database
  commit-graph: load a root tree from specific graph
  commit-graph: verify commit contents against odb
  fsck: verify commit-graph
  commit-graph: add '--reachable' option
  gc: automatically write commit-graph files
  commit-graph: update design document

 Documentation/config.txt |   6 +
 Documentation/git-commit-graph.txt   |  14 ++-
 Documentation/git-fsck.txt   |   3 +
 Documentation/git-gc.txt |   4 +
 Documentation/technical/commit-graph.txt |  22 
 builtin/commit-graph.c   |  81 -
 builtin/fsck.c   |  21 
 builtin/gc.c |   8 ++
 commit-graph.c   | 199 ++-
 commit-graph.h   |   2 +
 commit.c |  13 +-
 commit.h |   1 +
 t/t5318-commit-graph.sh  | 173 +++
 13 files changed, 509 insertions(+), 38 deletions(-)


base-commit: 34fdd433396ee0e3ef4de02eb2189f8226eafe4e
-- 
2.16.2.329.gfb62395de6



[PATCH v2 06/12] commit: force commit to parse from object database

2018-05-11 Thread Derrick Stolee
In anticipation of verifying commit-graph file contents against the
object database, create parse_commit_internal() to allow side-stepping
the commit-graph file and parse directly from the object database.

Due to the use of generation numbers, this method should not be called
unless the intention is explicit in avoiding commits from the
commit-graph file.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 13 +
 commit.h |  1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index 1d28677dfb..7c92350373 100644
--- a/commit.c
+++ b/commit.c
@@ -392,7 +392,7 @@ int parse_commit_buffer(struct commit *item, const void 
*buffer, unsigned long s
return 0;
 }
 
-int parse_commit_gently(struct commit *item, int quiet_on_missing)
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int 
use_commit_graph)
 {
enum object_type type;
void *buffer;
@@ -403,17 +403,17 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
return -1;
if (item->object.parsed)
return 0;
-   if (parse_commit_in_graph(item))
+   if (use_commit_graph && parse_commit_in_graph(item))
return 0;
buffer = read_sha1_file(item->object.oid.hash, , );
if (!buffer)
return quiet_on_missing ? -1 :
error("Could not read %s",
-oid_to_hex(>object.oid));
+   oid_to_hex(>object.oid));
if (type != OBJ_COMMIT) {
free(buffer);
return error("Object %s not a commit",
-oid_to_hex(>object.oid));
+   oid_to_hex(>object.oid));
}
ret = parse_commit_buffer(item, buffer, size, 0);
if (save_commit_buffer && !ret) {
@@ -424,6 +424,11 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
return ret;
 }
 
+int parse_commit_gently(struct commit *item, int quiet_on_missing)
+{
+   return parse_commit_internal(item, quiet_on_missing, 1);
+}
+
 void parse_commit_or_die(struct commit *item)
 {
if (parse_commit(item))
diff --git a/commit.h b/commit.h
index b5afde1ae9..5fde74fcd7 100644
--- a/commit.h
+++ b/commit.h
@@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char 
*name);
 struct commit *lookup_commit_or_die(const struct object_id *oid, const char 
*ref_name);
 
 int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long 
size, int check_graph);
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int 
use_commit_graph);
 int parse_commit_gently(struct commit *item, int quiet_on_missing);
 static inline int parse_commit(struct commit *item)
 {
-- 
2.16.2.329.gfb62395de6



[PATCH v2 05/12] commit-graph: verify fanout and lookup table

2018-05-11 Thread Derrick Stolee
While running 'git commit-graph verify', verify that the object IDs
are listed in lexicographic order and that the fanout table correctly
navigates into that list of object IDs.

Add tests that check these corruptions are caught by the verify
subcommand. Most of the tests check the full output matches the exact
error we inserted, but since our OID order test triggers incorrect
fanout values (with possibly different numbers of output lines) we
focus only that the correct error is written in that case.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c  | 36 
 t/t5318-commit-graph.sh | 31 +++
 2 files changed, 67 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 4dfff7e752..b0fd1d5320 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -848,6 +848,9 @@ static void graph_report(const char *fmt, ...)
 
 int verify_commit_graph(struct commit_graph *g)
 {
+   uint32_t i, cur_fanout_pos = 0;
+   struct object_id prev_oid, cur_oid;
+
if (!g) {
graph_report("no commit-graph file loaded");
return 1;
@@ -862,5 +865,38 @@ int verify_commit_graph(struct commit_graph *g)
if (!g->chunk_commit_data)
graph_report("commit-graph is missing the Commit Data chunk");
 
+   if (verify_commit_graph_error)
+   return 1;
+
+   for (i = 0; i < g->num_commits; i++) {
+   hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+   if (i && oidcmp(_oid, _oid) >= 0)
+   graph_report("commit-graph has incorrect oid order: %s 
then %s",
+oid_to_hex(_oid),
+oid_to_hex(_oid));
+
+   oidcpy(_oid, _oid);
+
+   while (cur_oid.hash[0] > cur_fanout_pos) {
+   uint32_t fanout_value = get_be32(g->chunk_oid_fanout + 
cur_fanout_pos);
+   if (i != fanout_value)
+   graph_report("commit-graph has incorrect fanout 
value: fanout[%d] = %u != %u",
+cur_fanout_pos, fanout_value, i);
+
+   cur_fanout_pos++;
+   }
+   }
+
+   while (cur_fanout_pos < 256) {
+   uint32_t fanout_value = get_be32(g->chunk_oid_fanout + 
cur_fanout_pos);
+
+   if (g->num_commits != fanout_value)
+   graph_report("commit-graph has incorrect fanout value: 
fanout[%d] = %u != %u",
+cur_fanout_pos, fanout_value, i);
+
+   cur_fanout_pos++;
+   }
+
return verify_commit_graph_error;
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 0cb88232fa..6fb306b0da 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -293,4 +293,35 @@ test_expect_success 'detect too small chunk-count' '
grep "missing the Commit Data chunk" verify-errors
 '
 
+test_expect_success 'detect incorrect chunk lookup value' '
+   cd "$TRASH_DIRECTORY/full" &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   corrupt_data $objdir/info/commit-graph 25 "\01" &&
+   test_must_fail git commit-graph verify 2>err &&
+   grep -v "^\+" err > verify-errors &&
+   test_line_count = 1 verify-errors &&
+   grep "improper chunk offset" verify-errors
+'
+
+test_expect_success 'detect incorrect fanout' '
+   cd "$TRASH_DIRECTORY/full" &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   corrupt_data $objdir/info/commit-graph 128 "\01" &&
+   test_must_fail git commit-graph verify 2>err &&
+   grep -v "^\+" err > verify-errors &&
+   test_line_count = 1 verify-errors &&
+   grep "fanout value" verify-errors
+'
+
+test_expect_success 'detect incorrect OID order' '
+   cd "$TRASH_DIRECTORY/full" &&
+   cp $objdir/info/commit-graph commit-graph-backup &&
+   test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+   corrupt_data $objdir/info/commit-graph 1272 "\01" &&
+   test_must_fail git commit-graph verify 2>err &&
+   grep "incorrect oid order" err
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6



[PATCH v2 04/12] commit-graph: parse commit from chosen graph

2018-05-11 Thread Derrick Stolee
Before verifying a commit-graph file against the object database, we
need to parse all commits from the given commit-graph file. Create
parse_commit_in_graph_one() to target a given struct commit_graph.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index d2db20e49a..4dfff7e752 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -309,7 +309,7 @@ static int find_commit_in_graph(struct commit *item, struct 
commit_graph *g, uin
}
 }
 
-int parse_commit_in_graph(struct commit *item)
+int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
 {
uint32_t pos;
 
@@ -317,9 +317,21 @@ int parse_commit_in_graph(struct commit *item)
return 0;
if (item->object.parsed)
return 1;
+
+   if (find_commit_in_graph(item, g, ))
+   return fill_commit_in_graph(item, g, pos);
+
+   return 0;
+}
+
+int parse_commit_in_graph(struct commit *item)
+{
+   if (!core_commit_graph)
+   return 0;
+
prepare_commit_graph();
-   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
-   return fill_commit_in_graph(item, commit_graph, pos);
+   if (commit_graph)
+   return parse_commit_in_graph_one(commit_graph, item);
return 0;
 }
 
-- 
2.16.2.329.gfb62395de6



  1   2   3   4   5   6   7   8   9   10   >