[PATCH 0/3] Improve abbreviation disambiguation
Hello, My name is Derrick Stolee and I just switched teams at Microsoft from the VSTS Git Server to work on performance improvements in core Git. This is my first patch submission, and I look forward to your feedback. Thanks, Stolee When displaying object ids, we frequently want to see an abbreviation for easier typing. That abbreviation must be unambiguous among all object ids. The current implementation of find_unique_abbrev() performs a loop checking if each abbreviation length is unambiguous until finding one that works. This causes multiple round-trips to the disk when starting with the default abbreviation length (usually 7) but needing up to 12 characters for an unambiguous short-sha. For very large repos, this effect is pronounced and causes issues with several commands, from obvious consumers `status` and `log` to less obvious commands such as `fetch` and `push`. This patch improves performance by iterating over objects matching the short abbreviation only once, inspecting each object id, and reporting the minimum length of an unambiguous abbreviation. A performance helper `test-abbrev` and performance test `p0008-abbrev.sh` are added to demonstrate this performance improvement. Here are some performance results for the three included commits, using GIT_PERF_REPEAT_COUNT=10 since the first test is frequently an outlier due to the file cache being cold. Running git on a Linux VM, we see the following gains. | Repo| Pack-Files | Loose Objs | Baseline | Patch 2 | Patch 3 | |-|||--|-|-| | Git.git | 1 | 0 | 0.46 s | -87.0% | -87.0% | | Git.git | 5 | 0 | 1.04 s | -84.6% | -85.6% | | Git.git | 4 | 75852 | 0.88 s | -86.4% | -86.4% | | Linux | 1 | 0 | 0.63 s | -38.1% | -69.8% | | Linux | 24 | 0 | 5.41 s | -69.3% | -71.5% | | Linux | 23 | 323441 | 5.41 s | -70.6% | -73.4% | Running a similar patch on Git for Windows, we see the following gains. | Repo | Pack-Files | Loose | Baseline | Patch 2 | Patch 3 | |---||---|--|-|-| | GitForWindows | 6 | 319 | 7.19 s | -91.1% | -91.5% | | VSTS | 3 | 38| 7.83 s | -88.9% | -90.9% | | Linux | 3 | 0 | 7.92 s | -87.9% | -90.2% | | Windows | 50 | 219 | 17.8 s | -98.6% | -98.6% | Note that performance improves in all cases, but the performance gain is larger when there are multiple, large pack-files. This gain comes from the lack of in-memory caching of index files that have already been inspected. Derrick Stolee (3): sha1_name: Create perf test for find_unique_abbrev() sha1_name: Unroll len loop in find_unique_abbrev_r sha1_name: Parse less while finding common prefix Makefile | 1 + sha1_name.c| 66 ++ t/helper/.gitignore| 1 + t/helper/test-abbrev.c | 22 + t/perf/p0008-abbrev.sh | 12 + 5 files changed, 87 insertions(+), 15 deletions(-) create mode 100644 t/helper/test-abbrev.c create mode 100755 t/perf/p0008-abbrev.sh -- 2.14.1.538.g56ec8fc98.dirty
[PATCH 2/3] sha1_name: Unroll len loop in find_unique_abbrev_r
Unroll the while loop inside find_unique_abbrev_r to avoid iterating through all loose objects and packfiles multiple times when the short name is longer than the predicted length. Instead, inspect each object that collides with the estimated abbreviation to find the longest common prefix. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_name.c | 57 ++--- 1 file changed, 42 insertions(+), 15 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index 134ac9742..f2a1ebe49 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -474,10 +474,32 @@ static unsigned msb(unsigned long val) return r; } -int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) +struct min_abbrev_data { + unsigned int init_len; + unsigned int cur_len; + char *hex; +}; + +static int extend_abbrev_len(const struct object_id *oid, void *cb_data) { - int status, exists; + struct min_abbrev_data *mad = cb_data; + + char *hex = oid_to_hex(oid); + unsigned int i = mad->init_len; + while (mad->hex[i] && mad->hex[i] == hex[i]) + i++; + + if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) + mad->cur_len = i + 1; + + return 0; +} +int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) +{ + struct disambiguate_state ds; + struct min_abbrev_data mad; + struct object_id oid_ret; if (len < 0) { unsigned long count = approximate_object_count(); /* @@ -503,19 +525,24 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) sha1_to_hex_r(hex, sha1); if (len == GIT_SHA1_HEXSZ || !len) return GIT_SHA1_HEXSZ; - exists = has_sha1_file(sha1); - while (len < GIT_SHA1_HEXSZ) { - struct object_id oid_ret; - status = get_short_oid(hex, len, _ret, GET_OID_QUIETLY); - if (exists - ? !status - : status == SHORT_NAME_NOT_FOUND) { - hex[len] = 0; - return len; - } - len++; - } - return len; + + if (init_object_disambiguation(hex, len, ) < 0) + return -1; + + mad.init_len = len; + mad.cur_len = len; + mad.hex = hex; + + ds.fn = extend_abbrev_len; + ds.always_call_fn = 1; + ds.cb_data = (void*) + + find_short_object_filename(); + find_short_packed_object(); + (void)finish_object_disambiguation(, _ret); + + hex[mad.cur_len] = 0; + return mad.cur_len; } const char *find_unique_abbrev(const unsigned char *sha1, int len) -- 2.14.1.538.g56ec8fc98.dirty
[PATCH 1/3] sha1_name: Create perf test for find_unique_abbrev()
Create helper program test-abbrev to compute the minimum length of a disambiguating short-sha for 100,000 object ids. The ids are created by iterating an unsigned int hash_base by a constant hash_delta and copying hash_base five times across the sha1. Iterating by hash_delta does not create a duplicate value for over 10,000,000 iterations. test-abberv demonstrates the performance improvements that will be shown by later improvements to the find_unique_abberv(). The value of 100,000 is large enough to show the significance of the later improvements while only taking a few seconds on large repos. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Makefile | 1 + t/helper/.gitignore| 1 + t/helper/test-abbrev.c | 23 +++ t/perf/p0008-abbrev.sh | 12 4 files changed, 37 insertions(+) create mode 100644 t/helper/test-abbrev.c create mode 100755 t/perf/p0008-abbrev.sh diff --git a/Makefile b/Makefile index f2bb7f2f6..081ca05e8 100644 --- a/Makefile +++ b/Makefile @@ -633,6 +633,7 @@ X = PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS)) +TEST_PROGRAMS_NEED_X += test-abbrev TEST_PROGRAMS_NEED_X += test-chmtime TEST_PROGRAMS_NEED_X += test-ctype TEST_PROGRAMS_NEED_X += test-config diff --git a/t/helper/.gitignore b/t/helper/.gitignore index 721650256..80ce7d836 100644 --- a/t/helper/.gitignore +++ b/t/helper/.gitignore @@ -1,3 +1,4 @@ +/test-abbrev /test-chmtime /test-ctype /test-config diff --git a/t/helper/test-abbrev.c b/t/helper/test-abbrev.c new file mode 100644 index 0..cb3551df9 --- /dev/null +++ b/t/helper/test-abbrev.c @@ -0,0 +1,23 @@ +#include "cache.h" + +int cmd_main(int ac, const char **av) +{ + setup_git_directory(); + + unsigned int hash_delt = 0x13579BDF; + unsigned int hash_base = 0x01020304; + struct object_id oid; + + int i, count = 0; + int n = sizeof(struct object_id) / sizeof(int); + while (count++ < 10) { + for (i = 0; i < n; i++) + ((unsigned int*)oid.hash)[i] = hash_base; + + find_unique_abbrev(oid.hash, MINIMUM_ABBREV); + + hash_base += hash_delt; + } + + exit(0); +} diff --git a/t/perf/p0008-abbrev.sh b/t/perf/p0008-abbrev.sh new file mode 100755 index 0..7c3fad807 --- /dev/null +++ b/t/perf/p0008-abbrev.sh @@ -0,0 +1,12 @@ +#!/bin/sh + +test_description='Test object disambiguation through abbreviations' +. ./perf-lib.sh + +test_perf_large_repo + +test_perf 'find_unique_abbrev()' ' + test-abbrev +' + +test_done -- 2.14.1.538.g56ec8fc98.dirty
[PATCH 3/3] sha1_name: Parse less while finding common prefix
Create get_hex_char_from_oid() to parse oids one hex character at a time. This prevents unnecessary copying of hex characters in extend_abbrev_len() when finding the length of a common prefix. This change decreases the time to run test-abbrev by up to 40% on large repos. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_name.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index f2a1ebe49..bb47b6702 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -480,13 +480,22 @@ struct min_abbrev_data { char *hex; }; +static inline char get_hex_char_from_oid(const struct object_id *oid, int i) +{ + static const char hex[] = "0123456789abcdef"; + + if ((i & 1) == 0) + return hex[oid->hash[i >> 1] >> 4]; + else + return hex[oid->hash[i >> 1] & 0xf]; +} + static int extend_abbrev_len(const struct object_id *oid, void *cb_data) { struct min_abbrev_data *mad = cb_data; - char *hex = oid_to_hex(oid); unsigned int i = mad->init_len; - while (mad->hex[i] && mad->hex[i] == hex[i]) + while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i)) i++; if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) -- 2.14.1.538.g56ec8fc98.dirty
Re: [PATCH 1/3] sha1_name: Create perf test for find_unique_abbrev()
On 9/17/2017 5:51 PM, Junio C Hamano wrote: Derrick Stolee <dsto...@microsoft.com> writes: +int cmd_main(int ac, const char **av) +{ + setup_git_directory(); As far as I recall, we do not (yet) allow declaration after statement in our codebase. Move this down to make it after all decls. Will fix. + + unsigned int hash_delt = 0x13579BDF; + unsigned int hash_base = 0x01020304; + struct object_id oid; + + int i, count = 0; + int n = sizeof(struct object_id) / sizeof(int); It probably is technically OK to assume sizeof(int) always equals to sizeof(unsigned), but because you use 'n' _only_ to work with uint and never with int, it would make more sense to match. Will fix. I also notice that "n" should be const. But I do not think we want this "clever" optimization that involves 'n' in the first place. >>> + while (count++ < 10) { + for (i = 0; i < n; i++) + ((unsigned int*)oid.hash)[i] = hash_base; Does it make sense to assume that uint is always 4-byte (so this code won't work if it is 8-byte on your platform) and doing this is faster than using platform-optimized memcpy()? I'm not sure what you mean by using memcpy to improve this, because it would require calling memcpy in the inner loop, such as for (i = 0; i < n; i++) memcpy(oid.hash + i * sizeof(unsigned), _base, sizeof(unsigned)); I'm probably misunderstanding your intended use of memcpy(). + find_unique_abbrev(oid.hash, MINIMUM_ABBREV); + + hash_base += hash_delt; + } I also wonder if this is measuring the right thing. I am guessing that by making many queries for a unique abbreviation of "random" (well, it is deterministic, but my point is these queries are not based on the names of objects that exist in the repository) hashes, this test measures how much time it takes for us to decide that such a random hash does not exist. In the real life use, we make the opposite query far more frequently: we have an object that we _know_ exists in the repository and we try to find a sufficient length to disambiguate it from others, and I suspect that these two use different logic. Don't you need to be measuring the time it takes to compute the shortest abbreviation of an object that exists instead? First, this doesn't just measure the time it takes to determine non- existence, because it finds the len required to disambiguate an "incoming" hash from all known hashes. When testing, I put in a simple printf to report the result abbreviation so I could see how often it needed to be extended. In this sense, the test exposes the while loop that is removed by PATCH 2/3. Second, your criticism about extant hashes is valid, and one I struggled to reconcile. I see two issues with testing known hashes: 1. By determining the hash exists, we have inspected the file that contains it (either a .idx or the loose-object directory). This has side-effects that warm up the file cache so the looped method is artificially faster to find matching objects. The effect is particularly significant on a repo with exactly one packfile. 2. If we iterate over the entire set of objects, this test takes O(N*t(N)) time, where t(N) is the average time to compute a minimum abbreviation. For large repos, this can take several minutes. Instead, even with the constant 100,000 test hashes, we have an O(t(N)) test. We could avoid the asymptotic growth by limiting the number of existing hashes we use, but how do we find a sufficiently uniform sample from them? By looking at some other perf tests, I see that we can add a pre- requisite action. I will investigate making another helper that uniformly selects a set of hashes from the repo and writes them to stdout in a random order. p0008-abbrev.sh will run the helper as a prerequisite, saving the data to a file. test-abbrev will read the hashes from stdin to test find_unique_abbrev. This should avoid the side-effects I mentioned (test-abbrev will not warm up indexes) while also testing abbreviation lengths for existing objects. I'll get started on these changes and send a new patch with new perf data in a couple days. Please let me know if there is a better way to measure performance for this change. Thanks, -Stolee
Re: [PATCH] cleanup: fix possible overflow errors in binary search
On 10/6/2017 10:18 AM, Jeff King wrote: On Fri, Oct 06, 2017 at 09:52:31AM -0400, Derrick Stolee wrote: A common mistake when writing binary search is to allow possible integer overflow by using the simple average: mid = (min + max) / 2; Instead, use the overflow-safe version: mid = min + (max - min) / 2; Great, thank you for picking this up! The included changes were found using the following two greps: grep "/ 2;" *.c grep "/ 2;" */*.c grep "/2;" */*.c You can use[1]: git grep '/ 2;' '*.c' to have Git expand the wildcard. That catches a few extra cases in compat/regex/*.c. Even though it's imported code, it might be nice to cover those, too (since it's a possible bug, and also as a good example). [1] I'd actually write: git grep '/ *2;' '*.c' to do it all in one grep. :) Thanks for the grep lesson! I knew there would be a simpler way to do what I wanted. --- builtin/index-pack.c | 4 ++-- builtin/pack-objects.c | 2 +- builtin/unpack-objects.c | 2 +- cache-tree.c | 2 +- packfile.c | 2 +- sha1-lookup.c| 2 +- sha1_name.c | 2 +- string-list.c| 2 +- utf8.c | 2 +- xdiff/xpatience.c| 2 +- 10 files changed, 11 insertions(+), 11 deletions(-) These all look good to me (really the only way the conversion could be bad is if "min" was higher than "max", and each case is just inside a loop condition which makes sure that is not the case). -Peff I thought this should be simple enough. When we are all happy with the idea of this cleanup, I'll re-roll with the missed examples. Thanks, -Stolee
[PATCH] cleanup: fix possible overflow errors in binary search
A common mistake when writing binary search is to allow possible integer overflow by using the simple average: mid = (min + max) / 2; Instead, use the overflow-safe version: mid = min + (max - min) / 2; The included changes were found using the following two greps: grep "/ 2;" *.c grep "/ 2;" */*.c grep "/2;" */*.c Making this cleanup will prevent future review friction when a new binary search is contructed based on existing code. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- builtin/index-pack.c | 4 ++-- builtin/pack-objects.c | 2 +- builtin/unpack-objects.c | 2 +- cache-tree.c | 2 +- packfile.c | 2 +- sha1-lookup.c| 2 +- sha1_name.c | 2 +- string-list.c| 2 +- utf8.c | 2 +- xdiff/xpatience.c| 2 +- 10 files changed, 11 insertions(+), 11 deletions(-) diff --git a/builtin/index-pack.c b/builtin/index-pack.c index f2be145e1..8ec459f52 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -633,7 +633,7 @@ static int find_ofs_delta(const off_t offset, enum object_type type) int first = 0, last = nr_ofs_deltas; while (first < last) { - int next = (first + last) / 2; + int next = first + (last - first) / 2; struct ofs_delta_entry *delta = _deltas[next]; int cmp; @@ -687,7 +687,7 @@ static int find_ref_delta(const unsigned char *sha1, enum object_type type) int first = 0, last = nr_ref_deltas; while (first < last) { - int next = (first + last) / 2; + int next = first + (last - first) / 2; struct ref_delta_entry *delta = _deltas[next]; int cmp; diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 5ee2c48ff..6e77dfd44 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -1277,7 +1277,7 @@ static int done_pbase_path_pos(unsigned hash) int lo = 0; int hi = done_pbase_paths_num; while (lo < hi) { - int mi = (hi + lo) / 2; + int mi = lo + (hi - lo) / 2; if (done_pbase_paths[mi] == hash) return mi; if (done_pbase_paths[mi] < hash) diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index 689a29fac..62ea264c4 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -394,7 +394,7 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size, lo = 0; hi = nr; while (lo < hi) { - mid = (lo + hi)/2; + mid = lo + (hi - lo) / 2; if (base_offset < obj_list[mid].offset) { hi = mid; } else if (base_offset > obj_list[mid].offset) { diff --git a/cache-tree.c b/cache-tree.c index 71d092ed5..d3f740127 100644 --- a/cache-tree.c +++ b/cache-tree.c @@ -49,7 +49,7 @@ static int subtree_pos(struct cache_tree *it, const char *path, int pathlen) lo = 0; hi = it->subtree_nr; while (lo < hi) { - int mi = (lo + hi) / 2; + int mi = lo + (hi - lo) / 2; struct cache_tree_sub *mdl = down[mi]; int cmp = subtree_name_cmp(path, pathlen, mdl->name, mdl->namelen); diff --git a/packfile.c b/packfile.c index eab754248..4a5fe7ab1 100644 --- a/packfile.c +++ b/packfile.c @@ -1743,7 +1743,7 @@ off_t find_pack_entry_one(const unsigned char *sha1, sha1[0], sha1[1], sha1[2], lo, hi, p->num_objects); while (lo < hi) { - unsigned mi = (lo + hi) / 2; + unsigned mi = lo + (hi - lo) / 2; int cmp = hashcmp(index + mi * stride, sha1); if (debug_lookup) diff --git a/sha1-lookup.c b/sha1-lookup.c index 2552b7902..df08f8355 100644 --- a/sha1-lookup.c +++ b/sha1-lookup.c @@ -95,7 +95,7 @@ int sha1_pos(const unsigned char *sha1, void *table, size_t nr, hi = mi; else lo = mi + 1; - mi = (hi + lo) / 2; + mi = lo + (hi - lo) / 2; } while (lo < hi); return -lo-1; } diff --git a/sha1_name.c b/sha1_name.c index 134ac9742..c7c5ab376 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -157,7 +157,7 @@ static void unique_in_pack(struct packed_git *p, num = p->num_objects; last = num; while (first < last) { - uint32_t mid = (first + last) / 2; + uint32_t mid = first + (last - first) / 2; const unsigned char *current; int cmp; diff --git a/string-list.c b/string-list.c index 806b4c872..a0cf0cfe8 100644 --- a/
[PATCH v4 1/4] p4211-line-log.sh: add log --online --raw --parents perf test
Add a new perf test for testing the performance of log while computing OID abbreviations. Using --oneline --raw and --parents options maximizes the number of OIDs to abbreviate while still spending some time computing diffs. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- t/perf/p4211-line-log.sh | 4 1 file changed, 4 insertions(+) diff --git a/t/perf/p4211-line-log.sh b/t/perf/p4211-line-log.sh index b7ff68d4f..e0ed05907 100755 --- a/t/perf/p4211-line-log.sh +++ b/t/perf/p4211-line-log.sh @@ -31,4 +31,8 @@ test_perf 'git log -L (renames on)' ' git log -M -L 1:"$file" >/dev/null ' +test_perf 'git log --oneline --raw --parents' ' + git log --oneline --raw --parents >/dev/null +' + test_done -- 2.14.1.538.g56ec8fc98.dirty
[PATCH v4 3/4] sha1_name: parse less while finding common prefix
Create get_hex_char_from_oid() to parse oids one hex character at a time. This prevents unnecessary copying of hex characters in extend_abbrev_len() when finding the length of a common prefix. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_name.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index f2a1ebe49..5081aeb71 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -480,13 +480,23 @@ struct min_abbrev_data { char *hex; }; +static inline char get_hex_char_from_oid(const struct object_id *oid, +int pos) +{ + static const char hex[] = "0123456789abcdef"; + + if ((pos & 1) == 0) + return hex[oid->hash[pos >> 1] >> 4]; + else + return hex[oid->hash[pos >> 1] & 0xf]; +} + static int extend_abbrev_len(const struct object_id *oid, void *cb_data) { struct min_abbrev_data *mad = cb_data; - char *hex = oid_to_hex(oid); unsigned int i = mad->init_len; - while (mad->hex[i] && mad->hex[i] == hex[i]) + while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i)) i++; if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) -- 2.14.1.538.g56ec8fc98.dirty
[PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation
Minimize OID comparisons during disambiguatiosn of packfile OIDs. Teach git to use binary search with the full OID to find the object's position (or insertion position, if not present) in the pack-index. The object before and immediately after (or the one at the insertion position) give the maximum common prefix. No subsequent linear search is required. Take care of which two to inspect, in case the object id exists in the packfile. If the input to find_unique_abbrev_r() is a partial prefix, then the OID used for the binary search is padded with zeroes so the object will not exist in the repo (with high probability) and the same logic applies. This commit completes a series of three changes to OID abbreviation code, and the overall change can be seen using standard commands for large repos. Below we report performance statistics for perf test 4211.6 from p4211-line-log.sh using three copies of the Linux repo: | Packs | Loose | HEAD~3 | HEAD | Rel% | |---||--|--|---| | 1| 0 | 41.27 s | 38.93 s | -4.8% | | 24| 0 | 98.04 s | 91.35 s | -5.7% | | 23| 323952 | 117.78 s | 112.18 s | -4.8% | Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_name.c | 70 + 1 file changed, 66 insertions(+), 4 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index 5081aeb71..49ba67955 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -478,6 +478,7 @@ struct min_abbrev_data { unsigned int init_len; unsigned int cur_len; char *hex; + const unsigned char *hash; }; static inline char get_hex_char_from_oid(const struct object_id *oid, @@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data) return 0; } +static void find_abbrev_len_for_pack(struct packed_git *p, +struct min_abbrev_data *mad) +{ + int match = 0; + uint32_t num, last, first = 0; + struct object_id oid; + + open_pack_index(p); + num = p->num_objects; + last = num; + while (first < last) { + uint32_t mid = first + (last - first) / 2; + const unsigned char *current; + int cmp; + + current = nth_packed_object_sha1(p, mid); + cmp = hashcmp(mad->hash, current); + if (!cmp) { + match = 1; + first = mid; + break; + } + if (cmp > 0) { + first = mid + 1; + continue; + } + last = mid; + } + + /* +* first is now the position in the packfile where we would insert +* mad->hash if it does not exist (or the position of mad->hash if +* it does exist). Hence, we consider a maximum of three objects +* nearby for the abbreviation length. +*/ + mad->init_len = 0; + if (!match) { + nth_packed_object_oid(, p, first); + extend_abbrev_len(, mad); + } else if (first < num - 1) { + nth_packed_object_oid(, p, first + 1); + extend_abbrev_len(, mad); + } + if (first > 0) { + nth_packed_object_oid(, p, first - 1); + extend_abbrev_len(, mad); + } + mad->init_len = mad->cur_len; +} + +static void find_abbrev_len_packed(struct min_abbrev_data *mad) +{ + struct packed_git *p; + + prepare_packed_git(); + for (p = packed_git; p; p = p->next) + find_abbrev_len_for_pack(p, mad); +} + int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) { struct disambiguate_state ds; @@ -536,19 +596,21 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) if (len == GIT_SHA1_HEXSZ || !len) return GIT_SHA1_HEXSZ; - if (init_object_disambiguation(hex, len, ) < 0) - return -1; - mad.init_len = len; mad.cur_len = len; mad.hex = hex; + mad.hash = sha1; + + find_abbrev_len_packed(); + + if (init_object_disambiguation(hex, mad.cur_len, ) < 0) + return -1; ds.fn = extend_abbrev_len; ds.always_call_fn = 1; ds.cb_data = (void*) find_short_object_filename(); - find_short_packed_object(); (void)finish_object_disambiguation(, _ret); hex[mad.cur_len] = 0; -- 2.14.1.538.g56ec8fc98.dirty
[PATCH v4 0/4] Improve abbreviation disambiguation
Changes since previous version: * Fixed an overflow error in the binary search. I sent a separate patch to fix this error in existing searches; that patch should be applied before this one. * Removed test-list-objects and test-abbrev in favor of a new git log test in p4211-line-log.sh. Limited perf numbers for Linux repo are given in cover letter and commit 4/4. * Silently skip packfiles that fail to open with open_pack_index() Thanks for all the comments from Jeff, Junio, Ramsey, and Stefan! Thanks, Stolee --- When displaying object ids, we frequently want to see an abbreviation for easier typing. That abbreviation must be unambiguous among all object ids. The current implementation of find_unique_abbrev() performs a loop checking if each abbreviation length is unambiguous until finding one that works. This causes multiple round-trips to the disk when starting with the default abbreviation length (usually 7) but needing up to 12 characters for an unambiguous short-sha. For very large repos, this effect is pronounced and causes issues with several commands, from obvious consumers `status` and `log` to less obvious commands such as `fetch` and `push`. This patch improves performance by iterating over objects matching the short abbreviation only once, inspecting each object id, and reporting the minimum length of an unambiguous abbreviation. Add a new perf test for testing the performance of log while computing OID abbreviations. Using --oneline --raw and --parents options maximizes the number of OIDs to abbreviate while still spending some time computing diffs. Below we report performance statistics for perf test 4211.6 from p4211-line-log.sh using three copies of the Linux repo: | Packs | Loose | Base Time | New Time | Rel% | |---||---|--|---| | 1| 0 | 41.27 s | 38.93 s | -4.8% | | 24| 0 | 98.04 s | 91.35 s | -5.7% | | 23| 323952 | 117.78 s | 112.18 s | -4.8% | Derrick Stolee (4): p4211-line-log.sh: add log --online --raw --parents perf test sha1_name: Unroll len loop in find_unique_abbrev_r sha1_name: Parse less while finding common prefix sha1_name: Minimize OID comparisons during disambiguation sha1_name.c | 129 +-- t/perf/p4211-line-log.sh | 4 ++ 2 files changed, 118 insertions(+), 15 deletions(-) -- 2.14.1.538.g56ec8fc98.dirty
[PATCH v4 2/4] sha1_name: unroll len loop in find_unique_abbrev_r
Unroll the while loop inside find_unique_abbrev_r to avoid iterating through all loose objects and packfiles multiple times when the short name is longer than the predicted length. Instead, inspect each object that collides with the estimated abbreviation to find the longest common prefix. The focus of this change is to refactor the existing method in a way that clearly does not change the current behavior. In some cases, the new method is slower than the previous method. Later changes will correct all performance loss. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_name.c | 57 ++--- 1 file changed, 42 insertions(+), 15 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index 134ac9742..f2a1ebe49 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -474,10 +474,32 @@ static unsigned msb(unsigned long val) return r; } -int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) +struct min_abbrev_data { + unsigned int init_len; + unsigned int cur_len; + char *hex; +}; + +static int extend_abbrev_len(const struct object_id *oid, void *cb_data) { - int status, exists; + struct min_abbrev_data *mad = cb_data; + + char *hex = oid_to_hex(oid); + unsigned int i = mad->init_len; + while (mad->hex[i] && mad->hex[i] == hex[i]) + i++; + + if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) + mad->cur_len = i + 1; + + return 0; +} +int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) +{ + struct disambiguate_state ds; + struct min_abbrev_data mad; + struct object_id oid_ret; if (len < 0) { unsigned long count = approximate_object_count(); /* @@ -503,19 +525,24 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) sha1_to_hex_r(hex, sha1); if (len == GIT_SHA1_HEXSZ || !len) return GIT_SHA1_HEXSZ; - exists = has_sha1_file(sha1); - while (len < GIT_SHA1_HEXSZ) { - struct object_id oid_ret; - status = get_short_oid(hex, len, _ret, GET_OID_QUIETLY); - if (exists - ? !status - : status == SHORT_NAME_NOT_FOUND) { - hex[len] = 0; - return len; - } - len++; - } - return len; + + if (init_object_disambiguation(hex, len, ) < 0) + return -1; + + mad.init_len = len; + mad.cur_len = len; + mad.hex = hex; + + ds.fn = extend_abbrev_len; + ds.always_call_fn = 1; + ds.cb_data = (void*) + + find_short_object_filename(); + find_short_packed_object(); + (void)finish_object_disambiguation(, _ret); + + hex[mad.cur_len] = 0; + return mad.cur_len; } const char *find_unique_abbrev(const unsigned char *sha1, int len) -- 2.14.1.538.g56ec8fc98.dirty
[PATCH v2] cleanup: fix possible overflow errors in binary search
A common mistake when writing binary search is to allow possible integer overflow by using the simple average: mid = (min + max) / 2; Instead, use the overflow-safe version: mid = min + (max - min) / 2; This translation is safe since the operation occurs inside a loop conditioned on "min < max". The included changes were found using the following git grep: git grep '/ *2;' '*.c' Making this cleanup will prevent future review friction when a new binary search is contructed based on existing code. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- builtin/index-pack.c | 4 ++-- builtin/pack-objects.c| 2 +- builtin/unpack-objects.c | 2 +- cache-tree.c | 2 +- compat/regex/regex_internal.c | 4 ++-- compat/regex/regexec.c| 2 +- packfile.c| 2 +- sha1-lookup.c | 4 ++-- sha1_name.c | 2 +- string-list.c | 2 +- utf8.c| 2 +- xdiff/xpatience.c | 2 +- 12 files changed, 15 insertions(+), 15 deletions(-) diff --git a/builtin/index-pack.c b/builtin/index-pack.c index f2be145e1..8ec459f52 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -633,7 +633,7 @@ static int find_ofs_delta(const off_t offset, enum object_type type) int first = 0, last = nr_ofs_deltas; while (first < last) { - int next = (first + last) / 2; + int next = first + (last - first) / 2; struct ofs_delta_entry *delta = _deltas[next]; int cmp; @@ -687,7 +687,7 @@ static int find_ref_delta(const unsigned char *sha1, enum object_type type) int first = 0, last = nr_ref_deltas; while (first < last) { - int next = (first + last) / 2; + int next = first + (last - first) / 2; struct ref_delta_entry *delta = _deltas[next]; int cmp; diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 5ee2c48ff..6e77dfd44 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -1277,7 +1277,7 @@ static int done_pbase_path_pos(unsigned hash) int lo = 0; int hi = done_pbase_paths_num; while (lo < hi) { - int mi = (hi + lo) / 2; + int mi = lo + (hi - lo) / 2; if (done_pbase_paths[mi] == hash) return mi; if (done_pbase_paths[mi] < hash) diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index 689a29fac..62ea264c4 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -394,7 +394,7 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size, lo = 0; hi = nr; while (lo < hi) { - mid = (lo + hi)/2; + mid = lo + (hi - lo) / 2; if (base_offset < obj_list[mid].offset) { hi = mid; } else if (base_offset > obj_list[mid].offset) { diff --git a/cache-tree.c b/cache-tree.c index 71d092ed5..d3f740127 100644 --- a/cache-tree.c +++ b/cache-tree.c @@ -49,7 +49,7 @@ static int subtree_pos(struct cache_tree *it, const char *path, int pathlen) lo = 0; hi = it->subtree_nr; while (lo < hi) { - int mi = (lo + hi) / 2; + int mi = lo + (hi - lo) / 2; struct cache_tree_sub *mdl = down[mi]; int cmp = subtree_name_cmp(path, pathlen, mdl->name, mdl->namelen); diff --git a/compat/regex/regex_internal.c b/compat/regex/regex_internal.c index d4121f2f4..98342b831 100644 --- a/compat/regex/regex_internal.c +++ b/compat/regex/regex_internal.c @@ -613,7 +613,7 @@ re_string_reconstruct (re_string_t *pstr, int idx, int eflags) int low = 0, high = pstr->valid_len, mid; do { - mid = (high + low) / 2; + mid = low + (high - low) / 2; if (pstr->offsets[mid] > offset) high = mid; else if (pstr->offsets[mid] < offset) @@ -1394,7 +1394,7 @@ re_node_set_contains (const re_node_set *set, int elem) right = set->nelem - 1; while (idx < right) { - mid = (idx + right) / 2; + mid = idx + (right - idx) / 2; if (set->elems[mid] < elem) idx = mid + 1; else diff --git a/compat/regex/regexec.c b/compat/regex/regexec.c index 0a745d9c3..6f2b48a78 100644 --- a/compat/regex/regexec.c +++ b/compat/regex/regexec.c @@ -4284,7 +4284,7 @@ search_cur_bkref_entry (const re_match_context_t *mctx, int str_idx) last = right = mctx->nbkref_ents; for (left = 0; left < right;) { - mid = (left + right) / 2; + mid = left + (right -
Re: [PATCH v3 3/5] sha1_name: Unroll len loop in find_unique_abbrev_r
On 10/4/2017 2:07 AM, Junio C Hamano wrote: Derrick Stolee <dsto...@microsoft.com> writes: - exists = has_sha1_file(sha1); - while (len < GIT_SHA1_HEXSZ) { - struct object_id oid_ret; - status = get_short_oid(hex, len, _ret, GET_OID_QUIETLY); - if (exists - ? !status - : status == SHORT_NAME_NOT_FOUND) { - hex[len] = 0; - return len; - } - len++; - } - return len; The "always_call_fn" thing is a big sledgehammer that overrides everything else in update_candidates(). It bypasses the careful machinery set up to avoid having to open ambiguous object to learn their types as much as possible. One narrow exception when it is OK to use is if we never limit our candidates with type. I do not modify get_short_oid, which uses these advanced options, depending on the flags given. find_unique_abbrev_r() does not use these advanced options. And it might appear that the conversion is safe (if only because we do not see any type limitation in the get_short_oid() call above), but I think there is one case where this patch changes the behaviour: what happens if core.disambiguate was set to anything other than "none"? The new code does not know anything about type based filtering, so it can end up reporting longer abbreviation than it was asked to produce. It may not be a problem in practice, though. I am not sure if setting core.disambiguate is generally a good idea in the first place, and if it is OK to break find_unique_abbrev() with respect to the configuration variable like this patch does. I do not think that type-aware disambiguation goes through this code path, since it requires giving different parameters to get_short_oid(). Test t1512-rev-parse-disambituagion.sh has a test 'core.disambiguate config can prefer types' that verifies this behavior. I'd feel safe if we get extra input from Peff, who introduced the feature in 5b33cb1f ("get_short_sha1: make default disambiguation configurable", 2016-09-27). I look forward to more feedback. Thanks for taking the time to look at my patch series. Thanks, -Stolee
Re: [PATCH] test-list-objects: mark file-local symbols as static
On 10/3/2017 5:51 PM, Ramsay Jones wrote: Signed-off-by: Ramsay Jones--- Hi Derrick, If you need to re-roll your 'ds/find-unique-abbrev-optim' branch, could you please squash this into the relevant patch (commit 3792c78ba0, "test-list-objects: list a subset of object ids", 01-10-2017). Thanks! ATB, Ramsay Jones t/helper/test-list-objects.c | 32 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/t/helper/test-list-objects.c b/t/helper/test-list-objects.c index 22bc9b4e6..5c5d3c03f 100644 --- a/t/helper/test-list-objects.c +++ b/t/helper/test-list-objects.c @@ -6,43 +6,43 @@ struct count { int select_mod; }; -int count_loose(const struct object_id *oid, - const char *path, - void *data) +static int count_loose(const struct object_id *oid, + const char *path, + void *data) { ((struct count*)data)->total++; return 0; } -int count_packed(const struct object_id *oid, -struct packed_git *pack, -uint32_t pos, -void* data) +static int count_packed(const struct object_id *oid, + struct packed_git *pack, + uint32_t pos, + void* data) { ((struct count*)data)->total++; return 0; } -void output(const struct object_id *oid, - struct count *c) +static void output(const struct object_id *oid, + struct count *c) { if (!(c->total % c->select_mod)) printf("%s\n", oid_to_hex(oid)); c->total--; } -int output_loose(const struct object_id *oid, -const char *path, -void *data) +static int output_loose(const struct object_id *oid, + const char *path, + void *data) { output(oid, (struct count*)data); return 0; } -int output_packed(const struct object_id *oid, - struct packed_git *pack, - uint32_t pos, - void* data) +static int output_packed(const struct object_id *oid, +struct packed_git *pack, +uint32_t pos, +void* data) { output(oid, (struct count*)data); return 0; Thanks, Ramsay. I applied these changes locally. I'll remember "static" in the future. Thanks, -Stolee
Re: [PATCH v3 3/5] sha1_name: Unroll len loop in find_unique_abbrev_r
On 10/4/2017 2:10 AM, Junio C Hamano wrote: Derrick Stolee <sto...@gmail.com> writes: ... I understand that this patch on its own does not have good numbers. I split the patches 3 and 4 specifically to highlight two distinct changes: Patch 3: Unroll the len loop that may inspect all files multiple times. Patch 4: Parse less while disambiguating. Patch 4 more than makes up for the performance hits in this patch. Now you confused me even more. When we read the similar table that appears in [Patch 4/5], what does the "Base Time" column mean? Vanilla Git with [Patch 3/5] applied? Vanillay Git with [Patch 4/5] alone applied? Something else? In PATCH 3, 4, and 5, I used the commit-by-commit diff for the perf numbers, so the "Base Time" for PATCH 4 is the time calculated when PATCH 3 is applied. The table in the [PATCH 0/5] message includes the relative change for all commits. I recalculated the relative change for each patch related to the baseline (PATCH 2). Looking again, it appears I misspoke and PATCH 4 does include a +8% change for a fully-repacked Linux repo relative to PATCH 2. Since PATCH 5 includes an optimization targeted directly at large packfiles, the final performance gain is significant in the fully-packed cases. It is also worth looking at the absolute times for these cases, since the fully-packed case is significantly faster than the multiple-packfile case, so the relative change impacts users less. One final note: the improvement was clearer in test p0008.1 when the test included "sort -R" to shuffle the known OIDs. Providing OIDs in lexicographic order has had a significant effect on the performance, which does not reflect real-world usage. I removed the "sort -R" because it is a GNU-ism, but if there is a good cross-platform alternative I would be happy to replace it. p0008.1: find_unique_abbrev() for existing objects -- For 10 repeated tests, each checking 100,000 known objects, we find the following results when running in a Linux VM: | Repo | Baseline | Patch 3 | Rel % | Patch 4 | Rel % | Patch 5 | Rel % | |---|--|-|---|-|---|-|---| | Git | 0.09 | 0.06 | -33% | 0.05 | -44% | 0.05 | -44% | | Git | 0.11 | 0.08 | -27% | 0.08 | -27% | 0.08 | -27% | | Git | 0.09 | 0.07 | -22% | 0.06 | -33% | 0.06 | -33% | | Linux | 0.13 | 0.32 | 146% | 0.14 | + 8% | 0.05 | -62% | | Linux | 1.13 | 1.12 | - 1% | 0.94 | -17% | 0.88 | -22% | | Linux | 1.08 | 1.05 | - 3% | 0.86 | -20% | 0.80 | -26% | | VSTS | 0.12 | 0.23 | +92% | 0.11 | - 8% | 0.05 | -58% | | VSTS | 1.02 | 1.08 | + 6% | 0.95 | - 7% | 0.95 | - 7% | | VSTS | 2.25 | 2.08 | - 8% | 1.82 | -19% | 1.93 | -14% | (Each repo has three versions, in order: 1 packfile, multiple packfiles, and multiple packfiles and loose objects.) Thanks, -Stolee
Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation
On 10/10/2017 8:56 AM, Junio C Hamano wrote: Jeff Kingwrites: OK, I think that makes more sense. But note the p->num_objects thing I mentioned. If I do: git pack-objects .git/objects/pack/pack num_objects) return; Technically that also covers open_pack_index() failure, too, but that's a subtlety I don't think we should rely on. True. I notice that the early part of the two functions look almost identical. Do we need error condition handling for the other one, too? I prefer to fix the problem in all code clones when they cause review friction, so I'll send a fifth commit showing just the diff for these packfile issues in sha1_name.c. See patch below. Should open_pack_index() return a non-zero status if the packfile is empty? Or, is there a meaningful reason to have empty packfiles? Thanks, -Stolee diff --git a/sha1_name.c b/sha1_name.c index 49ba67955..9f8a33e82 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -153,7 +153,9 @@ static void unique_in_pack(struct packed_git *p, uint32_t num, last, i, first = 0; const struct object_id *current = NULL; - open_pack_index(p); + if (open_pack_index(p) || !p->num_objects) + return; + num = p->num_objects; last = num; while (first < last) { @@ -513,7 +515,9 @@ static void find_abbrev_len_for_pack(struct packed_git *p, uint32_t num, last, first = 0; struct object_id oid; - open_pack_index(p); + if (open_pack_index(p) || !p->num_objects) + return; + num = p->num_objects; last = num; while (first < last) {
Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation
On 10/9/2017 9:49 AM, Jeff King wrote: On Sun, Oct 08, 2017 at 02:49:42PM -0400, Derrick Stolee wrote: @@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data) return 0; } +static void find_abbrev_len_for_pack(struct packed_git *p, +struct min_abbrev_data *mad) +{ + int match = 0; + uint32_t num, last, first = 0; + struct object_id oid; + + open_pack_index(p); + num = p->num_objects; + last = num; + while (first < last) { [...] Your cover letter lists: * Silently skip packfiles that fail to open with open_pack_index() as a change from the previous version. But this looks the same as the last round. I think this _does_ end up skipping such packfiles because p->num_objects will be zero. Is it worth having a comment to that effect (or even just an early return) to make it clear that the situation is intentional? Although... + /* +* first is now the position in the packfile where we would insert +* mad->hash if it does not exist (or the position of mad->hash if +* it does exist). Hence, we consider a maximum of three objects +* nearby for the abbreviation length. +*/ + mad->init_len = 0; + if (!match) { + nth_packed_object_oid(, p, first); + extend_abbrev_len(, mad); If we have zero objects in the pack, what would nth_packed_object_oid() be returning here? So I actually think we do want an early return, not just when open_packed_index() fails, but also when p->num_objects is zero. -Peff Sorry about this. I caught this while I was writing my cover letter and amended my last commit to include the following: if (open_pack_index(p)) return; After I amended the commit, I forgot to 'format-patch' again. I can send a diff between the commits after review has calmed. Thanks, -Stolee
Re: git-clone causes out of memory
On 10/13/2017 8:44 AM, Jeff King wrote: On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote: On 13.10.2017 15:04, Junio C Hamano wrote: Christian Couderwrites: Yeah, but perhaps Git could be smarter when rev-listing too and avoid processing files or directories it has already seen? Aren't you suggesting to optimize for a wrong case? Anything that is possible with a software should be considered as a possible usecase. It's in fact a DoS attack. Imagine there's a server that using git to process something, and now there's a way to knock down this server. It's also bad from a promotional stand point. But the point is that you'd have the same problem with any repository that had 10^7 files in it. Yes, it's convenient for the attacker that there are only 9 objects, but fundamentally it's pretty easy for an attacker to construct repositories that have large trees (or very deep trees -- that's what causes stack exhaustion in some cases). Note too that this attack almost always comes down to the diff code (which is why it kicks in for pathspec limiting) which has to actually expand the tree. Most "normal" server-side operations (like accepting pushes or serving fetches) operate only on the object graph and _do_ avoid processing already-seen objects. As soon as servers start trying to checkout or diff, though, the attack surface gets quite large. And you really need to start thinking about having resource limits and quotas for CPU and memory use of each process (and group by requesting user, IP, repo, etc). I think the main thing Git could be doing here is to limit the size of the tree (both width and depth). But arbitrary limits like that have a way of being annoying, and I think it just pushes resource-exhaustion attacks off a little (e.g., can you construct a blob that behaves badly with the "--patch"?). -Peff I'm particularly interested in why `git rev-list HEAD -- [path]` gets slower in this case, because I wrote the history algorithm used by VSTS. In our algorithm, we only walk the list of objects from commit to the tree containing the path item. For example, in the path d0/d0/d0, we would only walk: commit --root--> tree --d0--> tree --d0--> tree [parse oid for d0 entry] From this, we can determine the TREESAME relationship by parsing four objects without parsing all contents below d0/d0/d0. The reason we have exponential behavior in `git rev-list` is because we are calling diff_tree_oid() in tree-diff.c recursively without short-circuiting on equal OIDs. I will prepare a patch that adds this OID-equal short-circuit to avoid this exponential behavior. I'll model my patch against a similar patch in master: commit d12a8cf0af18804c2000efc7a0224da631e04cd1 unpack-trees: avoid duplicate ODB lookups during checkout It will also significantly speed up rev-list calls for short paths in deep repositories. It will not be very measurable in the git or Linux repos because their shallow folder structure. Thanks, -Stolee
Re: [PATCH v2 1/5] test-list-objects: List a subset of object ids
On 10/5/2017 6:00 AM, Jeff King wrote: On Thu, Oct 05, 2017 at 06:48:10PM +0900, Junio C Hamano wrote: Jeff Kingwrites: This is weirdly specific. Can we accomplish the same thing with existing tools? E.g., could: git cat-file --batch-all-objects --batch-check='%(objectname)' | shuffle | head -n 100 do the same thing? I know that "shuffle" isn't available everywhere, but I'd much rather see us fill in portability gaps in a general way, rather than introducing one-shot C code that needs to be maintained (and you wouldn't _think_ that t/helper programs need much maintenance, but try perusing "git log t/helper" output; they have to adapt to the same tree-wide changes as the rest of the code). I was thinking about this a bit more, and came to the conclusion that "sort -R" and "shuf" are wrong tools to use. We would want to measure with something close to real world workload. for example, letting git rev-list --all --objects produce the listof objects in traversal order (i.e. this is very similar to the order in which "git log -p" needs to access the objects) and chomping at the number of sample objects you need in your test would give you such a list. Actually, I'd just as soon see timings for "git log --format=%h" or "git log --raw", as opposed to patches 1 and 2. You won't see a 90% speedup there, but you will see the actual improvement that real-world users are going to experience, which is way more important, IMHO. -Peff Thanks for thinking hard about this. For some real-user context: Some engineers using Git for the Windows repo were seeing extremely slow commands, such as 'fetch' or 'commit', and when we took a trace we saw most of the time spinning in this abbreviation code. Our workaround so far has been to set core.abbrev=40. I'll run some perf numbers for these commands you recommend, and also see if I can replicate some of the pain points that triggered this change using the Linux repo. Thanks, -Stolee
Re: git-clone causes out of memory
On 10/13/2017 9:15 AM, Derrick Stolee wrote: On 10/13/2017 8:44 AM, Jeff King wrote: On Fri, Oct 13, 2017 at 03:12:43PM +0300, Constantine wrote: On 13.10.2017 15:04, Junio C Hamano wrote: Christian Couder <christian.cou...@gmail.com> writes: Yeah, but perhaps Git could be smarter when rev-listing too and avoid processing files or directories it has already seen? Aren't you suggesting to optimize for a wrong case? Anything that is possible with a software should be considered as a possible usecase. It's in fact a DoS attack. Imagine there's a server that using git to process something, and now there's a way to knock down this server. It's also bad from a promotional stand point. But the point is that you'd have the same problem with any repository that had 10^7 files in it. Yes, it's convenient for the attacker that there are only 9 objects, but fundamentally it's pretty easy for an attacker to construct repositories that have large trees (or very deep trees -- that's what causes stack exhaustion in some cases). Note too that this attack almost always comes down to the diff code (which is why it kicks in for pathspec limiting) which has to actually expand the tree. Most "normal" server-side operations (like accepting pushes or serving fetches) operate only on the object graph and _do_ avoid processing already-seen objects. As soon as servers start trying to checkout or diff, though, the attack surface gets quite large. And you really need to start thinking about having resource limits and quotas for CPU and memory use of each process (and group by requesting user, IP, repo, etc). I think the main thing Git could be doing here is to limit the size of the tree (both width and depth). But arbitrary limits like that have a way of being annoying, and I think it just pushes resource-exhaustion attacks off a little (e.g., can you construct a blob that behaves badly with the "--patch"?). -Peff I'm particularly interested in why `git rev-list HEAD -- [path]` gets slower in this case, because I wrote the history algorithm used by VSTS. In our algorithm, we only walk the list of objects from commit to the tree containing the path item. For example, in the path d0/d0/d0, we would only walk: commit --root--> tree --d0--> tree --d0--> tree [parse oid for d0 entry] From this, we can determine the TREESAME relationship by parsing four objects without parsing all contents below d0/d0/d0. The reason we have exponential behavior in `git rev-list` is because we are calling diff_tree_oid() in tree-diff.c recursively without short-circuiting on equal OIDs. I will prepare a patch that adds this OID-equal short-circuit to avoid this exponential behavior. I'll model my patch against a similar patch in master: commit d12a8cf0af18804c2000efc7a0224da631e04cd1 unpack-trees: avoid duplicate ODB lookups during checkout It will also significantly speed up rev-list calls for short paths in deep repositories. It will not be very measurable in the git or Linux repos because their shallow folder structure. Thanks, -Stolee Since I don't understand enough about the consumers to diff_tree_oid() (and the fact that the recursive behavior may be wanted in some cases), I think we can fix this in builtin/rev-list.c with this simple diff: --- diff --git a/builtin/rev-list.c b/builtin/rev-list.c index ded1577424..b2e8e02cc8 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) git_config(git_default_config, NULL); init_revisions(, prefix); + + revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE; + revs.abbrev = DEFAULT_ABBREV; revs.commit_format = CMIT_FMT_UNSPECIFIED; argc = setup_revisions(argc, argv, , NULL);
Re: git-clone causes out of memory
On 10/13/2017 10:26 AM, Jeff King wrote: On Fri, Oct 13, 2017 at 10:25:10AM -0400, Derrick Stolee wrote: This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is causing diff_can_quit_early() to return false. Due to the corner-case of the bug it seems it will not be a huge performance improvement in most cases. Still worth fixing and I'm looking at your suggestions to try and learn this area better. Yeah, I just timed some pathspec limits on linux.git, and it makes at best a fraction of a percent improvement (but any improvement is well within run-to-run noise). Which is not surprising. I agree it's worth fixing, though. -Peff I just ran a first-level folder history in the Windows repository and `git rev-list HEAD -- windows >/dev/null` went from 0.43s to 0.04s. So, a good percentage improvement but even on this enormous repo the bug doesn't present an issue to human users. Thanks, -Stolee
Re: git-clone causes out of memory
On 10/13/2017 9:50 AM, Jeff King wrote: On Fri, Oct 13, 2017 at 09:39:14AM -0400, Derrick Stolee wrote: Since I don't understand enough about the consumers to diff_tree_oid() (and the fact that the recursive behavior may be wanted in some cases), I think we can fix this in builtin/rev-list.c with this simple diff: --- diff --git a/builtin/rev-list.c b/builtin/rev-list.c index ded1577424..b2e8e02cc8 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -285,6 +285,9 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix) git_config(git_default_config, NULL); init_revisions(, prefix); + + revs.pruning.flags = revs.pruning.flags & ~DIFF_OPT_RECURSIVE; + (Note: I'm running tests now and see that this breaks behavior. Definitely not the solution we want.) Hmm, this feels wrong, because we _do_ want to recurse down and follow the pathspec to see if there are real changes. We should be comparing an empty tree and d0/d0/d0/d0 (or however deep your pathspec goes). We should be able to see immediately that the entry is not present between the two and not bother descending. After all, we've set the QUICK flag in init_revisions(). So the real question is why QUICK is not kicking in. -Peff I'm struggling to understand your meaning. We want to walk from root to d0/d0/d0/d0, but there is no reason to walk beyond that tree. But maybe that's what the QUICK flag is supposed to do. Thanks, -Stolee
Re: [PATCH] revision: quit pruning diff more quickly when possible
On 10/13/2017 11:27 AM, Jeff King wrote: On Fri, Oct 13, 2017 at 10:26:46AM -0400, Jeff King wrote: On Fri, Oct 13, 2017 at 10:25:10AM -0400, Derrick Stolee wrote: This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is causing diff_can_quit_early() to return false. Due to the corner-case of the bug it seems it will not be a huge performance improvement in most cases. Still worth fixing and I'm looking at your suggestions to try and learn this area better. Yeah, I just timed some pathspec limits on linux.git, and it makes at best a fraction of a percent improvement (but any improvement is well within run-to-run noise). Which is not surprising. I agree it's worth fixing, though. Here it is cleaned up and with a commit message. There's another case that can be optimized, too: --remove-empty with an all-deletions commit. That's probably even more obscure and pathological, but it was easy to cover in the same breath. I didn't bother making a perf script, since this really isn't indicative of real-world performance. If we wanted to do perf regression tests here, I think the best path forward would be: 1. Make sure there the perf tests cover pathspecs (maybe in p0001?). 2. Make it easy to run the whole perf suite against a "bomb" repo. This surely isn't the only slow thing of interest. -- >8 -- Subject: revision: quit pruning diff more quickly when possible When the revision traversal machinery is given a pathspec, we must compute the parent-diff for each commit to determine which ones are TREESAME. We set the QUICK diff flag to avoid looking at more entries than we need; we really just care whether there are any changes at all. But there is one case where we want to know a bit more: if --remove-empty is set, we care about finding cases where the change consists only of added entries (in which case we may prune the parent in try_to_simplify_commit()). To cover that case, our file_add_remove() callback does not quit the diff upon seeing an added entry; it keeps looking for other types of entries. But this means when --remove-empty is not set (and it is not by default), we compute more of the diff than is necessary. You can see this in a pathological case where a commit adds a very large number of entries, and we limit based on a broad pathspec. E.g.: perl -e ' chomp(my $blob = `git hash-object -w --stdin remove_empty_trees. This callback parameter could be passed to the "add_remove" and "change" callbacks, but there's not much point. They already receive the diff_options struct, and doing it this way avoids having to update the function signature of the other callbacks (arguably the format_callback and output_prefix functions could benefit from the same simplification). Signed-off-by: Jeff King <p...@peff.net> --- diff.h | 1 + revision.c | 16 +--- 2 files changed, 14 insertions(+), 3 deletions(-) diff --git a/diff.h b/diff.h index 7dcfcfbef7..4a34d256f1 100644 --- a/diff.h +++ b/diff.h @@ -180,6 +180,7 @@ struct diff_options { pathchange_fn_t pathchange; change_fn_t change; add_remove_fn_t add_remove; + void *change_fn_data; diff_format_fn_t format_callback; void *format_callback_data; diff_prefix_fn_t output_prefix; diff --git a/revision.c b/revision.c index 8fd222f3bf..a3f245e2cc 100644 --- a/revision.c +++ b/revision.c @@ -399,8 +399,16 @@ static struct commit *one_relevant_parent(const struct rev_info *revs, * if the whole diff is removal of old data, and otherwise * REV_TREE_DIFFERENT (of course if the trees are the same we * want REV_TREE_SAME). - * That means that once we get to REV_TREE_DIFFERENT, we do not - * have to look any further. + * + * The only time we care about the distinction is when + * remove_empty_trees is in effect, in which case we care only about + * whether the whole change is REV_TREE_NEW, or if there's another type + * of change. Which means we can stop the diff early in either of these + * cases: + * + * 1. We're not using remove_empty_trees at all. + * + * 2. We saw anything except REV_TREE_NEW. */ static int tree_difference = REV_TREE_SAME; @@ -411,9 +419,10 @@ static void file_add_remove(struct diff_options *options, const char *fullpath, unsigned dirty_submodule) { int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD; + struct rev_info *revs = options->change_fn_data; tree_difference |= diff; - if (tree_difference == REV_TREE_DIFFERENT) + if (!revs->remove_empty_trees || tree_difference != REV_TREE_NEW) DIFF_OPT_SET(options, HAS_CHANGES); } @@ -1351,6 +1360,7 @@ void init_revisions(struct rev_info *revs, const char *prefix) DIFF_OPT_SET(>pruning, QUICK); revs->pruning.add_remove = file_add_remove; revs->pruning.change = file_change; + revs-&g
Re: git-clone causes out of memory
On 10/13/2017 10:20 AM, Jeff King wrote: On Fri, Oct 13, 2017 at 10:10:18AM -0400, Jeff King wrote: Hmm. So this patch makes it go fast: diff --git a/revision.c b/revision.c index d167223e69..b52ea4e9d8 100644 --- a/revision.c +++ b/revision.c @@ -409,7 +409,7 @@ static void file_add_remove(struct diff_options *options, int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD; tree_difference |= diff; - if (tree_difference == REV_TREE_DIFFERENT) + if (tree_difference & REV_TREE_DIFFERENT) DIFF_OPT_SET(options, HAS_CHANGES); } But that essentially makes the conditional a noop (since we know we set either NEW or OLD above and DIFFERENT is the union of those flags). I'm not sure I understand why file_add_remove() would ever want to avoid setting HAS_CHANGES (certainly its companion file_change() always does). This goes back to Junio's dd47aa3133 (try-to-simplify-commit: use diff-tree --quiet machinery., 2007-03-14). Maybe I am missing something, but AFAICT this was always buggy. But since it only affects adds and deletes, maybe nobody noticed? I'm also not sure if it only causes a slowdown, or if this could cause us to erroneously mark something as TREESAME which isn't (I _do_ think people would have noticed that). Answering my own question a little, there is a hint in the comment a few lines above: /* * The goal is to get REV_TREE_NEW as the result only if the * diff consists of all '+' (and no other changes), REV_TREE_OLD * if the whole diff is removal of old data, and otherwise * REV_TREE_DIFFERENT (of course if the trees are the same we * want REV_TREE_SAME). * That means that once we get to REV_TREE_DIFFERENT, we do not * have to look any further. */ So my patch above is breaking that. But it's not clear from that comment why we care about knowing the different between NEW, OLD, and DIFFERENT. Grepping around for REV_TREE_NEW and REV_TREE_OLD, I think the answer is in try_to_simplify_commit(): case REV_TREE_NEW: if (revs->remove_empty_trees && rev_same_tree_as_empty(revs, p)) { /* We are adding all the specified * paths from this parent, so the * history beyond this parent is not * interesting. Remove its parents * (they are grandparents for us). * IOW, we pretend this parent is a * "root" commit. */ if (parse_commit(p) < 0) die("cannot simplify commit %s (invalid %s)", oid_to_hex(>object.oid), oid_to_hex(>object.oid)); p->parents = NULL; } /* fallthrough */ case REV_TREE_OLD: case REV_TREE_DIFFERENT: So when --remove-empty is not in effect (and it's not by default), we don't care about OLD vs NEW, and we should be able to optimize further. -Peff This does appear to be the problem. The missing DIFF_OPT_HAS_CHANGES is causing diff_can_quit_early() to return false. Due to the corner-case of the bug it seems it will not be a huge performance improvement in most cases. Still worth fixing and I'm looking at your suggestions to try and learn this area better. It will speed up natural cases of someone adding or renaming a folder with a lot of contents, including someone initializing a repository from an existing codebase. This git bomb case happens to add all of the folders at once, which is why the performance is so noticeable. I use a version of this "exponential file growth" for testing that increases the folder-depth one commit at a time, so the contents are rather small in the first "add", hence not noticing this issue. Thanks, -Stolee
[PATCH v5 3/4] sha1_name: parse less while finding common prefix
Create get_hex_char_from_oid() to parse oids one hex character at a time. This prevents unnecessary copying of hex characters in extend_abbrev_len() when finding the length of a common prefix. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_name.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index 19603713f..fdd2711a6 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -480,13 +480,23 @@ struct min_abbrev_data { char *hex; }; +static inline char get_hex_char_from_oid(const struct object_id *oid, +unsigned int pos) +{ + static const char hex[] = "0123456789abcdef"; + + if ((pos & 1) == 0) + return hex[oid->hash[pos >> 1] >> 4]; + else + return hex[oid->hash[pos >> 1] & 0xf]; +} + static int extend_abbrev_len(const struct object_id *oid, void *cb_data) { struct min_abbrev_data *mad = cb_data; - char *hex = oid_to_hex(oid); unsigned int i = mad->init_len; - while (mad->hex[i] && mad->hex[i] == hex[i]) + while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i)) i++; if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) -- 2.14.1.538.g56ec8fc98.dirty
[PATCH v5 1/4] p4211-line-log.sh: add log --online --raw --parents perf test
Add a new perf test for testing the performance of log while computing OID abbreviations. Using --oneline --raw and --parents options maximizes the number of OIDs to abbreviate while still spending some time computing diffs. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- t/perf/p4211-line-log.sh | 4 1 file changed, 4 insertions(+) diff --git a/t/perf/p4211-line-log.sh b/t/perf/p4211-line-log.sh index b7ff68d4f..e0ed05907 100755 --- a/t/perf/p4211-line-log.sh +++ b/t/perf/p4211-line-log.sh @@ -31,4 +31,8 @@ test_perf 'git log -L (renames on)' ' git log -M -L 1:"$file" >/dev/null ' +test_perf 'git log --oneline --raw --parents' ' + git log --oneline --raw --parents >/dev/null +' + test_done -- 2.14.1.538.g56ec8fc98.dirty
[PATCH v5 2/4] sha1_name: unroll len loop in find_unique_abbrev_r
Unroll the while loop inside find_unique_abbrev_r to avoid iterating through all loose objects and packfiles multiple times when the short name is longer than the predicted length. Instead, inspect each object that collides with the estimated abbreviation to find the longest common prefix. The focus of this change is to refactor the existing method in a way that clearly does not change the current behavior. In some cases, the new method is slower than the previous method. Later changes will correct all performance loss. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_name.c | 57 ++--- 1 file changed, 42 insertions(+), 15 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index c7c5ab376..19603713f 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -474,10 +474,32 @@ static unsigned msb(unsigned long val) return r; } -int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) +struct min_abbrev_data { + unsigned int init_len; + unsigned int cur_len; + char *hex; +}; + +static int extend_abbrev_len(const struct object_id *oid, void *cb_data) { - int status, exists; + struct min_abbrev_data *mad = cb_data; + + char *hex = oid_to_hex(oid); + unsigned int i = mad->init_len; + while (mad->hex[i] && mad->hex[i] == hex[i]) + i++; + + if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) + mad->cur_len = i + 1; + + return 0; +} +int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) +{ + struct disambiguate_state ds; + struct min_abbrev_data mad; + struct object_id oid_ret; if (len < 0) { unsigned long count = approximate_object_count(); /* @@ -503,19 +525,24 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) sha1_to_hex_r(hex, sha1); if (len == GIT_SHA1_HEXSZ || !len) return GIT_SHA1_HEXSZ; - exists = has_sha1_file(sha1); - while (len < GIT_SHA1_HEXSZ) { - struct object_id oid_ret; - status = get_short_oid(hex, len, _ret, GET_OID_QUIETLY); - if (exists - ? !status - : status == SHORT_NAME_NOT_FOUND) { - hex[len] = 0; - return len; - } - len++; - } - return len; + + if (init_object_disambiguation(hex, len, ) < 0) + return -1; + + mad.init_len = len; + mad.cur_len = len; + mad.hex = hex; + + ds.fn = extend_abbrev_len; + ds.always_call_fn = 1; + ds.cb_data = (void*) + + find_short_object_filename(); + find_short_packed_object(); + (void)finish_object_disambiguation(, _ret); + + hex[mad.cur_len] = 0; + return mad.cur_len; } const char *find_unique_abbrev(const unsigned char *sha1, int len) -- 2.14.1.538.g56ec8fc98.dirty
[PATCH v5 4/4] sha1_name: minimize OID comparisons during disambiguation
Minimize OID comparisons during disambiguation of packfile OIDs. Teach git to use binary search with the full OID to find the object's position (or insertion position, if not present) in the pack-index. The object before and immediately after (or the one at the insertion position) give the maximum common prefix. No subsequent linear search is required. Take care of which two to inspect, in case the object id exists in the packfile. If the input to find_unique_abbrev_r() is a partial prefix, then the OID used for the binary search is padded with zeroes so the object will not exist in the repo (with high probability) and the same logic applies. This commit completes a series of three changes to OID abbreviation code, and the overall change can be seen using standard commands for large repos. Below we report performance statistics for perf test 4211.6 from p4211-line-log.sh using three copies of the Linux repo: | Packs | Loose | HEAD~3 | HEAD | Rel% | |---||--|--|---| | 1| 0 | 41.27 s | 38.93 s | -4.8% | | 24| 0 | 98.04 s | 91.35 s | -5.7% | | 23| 323952 | 117.78 s | 112.18 s | -4.8% | Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_name.c | 76 + 1 file changed, 71 insertions(+), 5 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index fdd2711a6..7fd5f5f71 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -153,7 +153,9 @@ static void unique_in_pack(struct packed_git *p, uint32_t num, last, i, first = 0; const struct object_id *current = NULL; - open_pack_index(p); + if (open_pack_index(p) || !p->num_objects) + return; + num = p->num_objects; last = num; while (first < last) { @@ -478,6 +480,7 @@ struct min_abbrev_data { unsigned int init_len; unsigned int cur_len; char *hex; + const unsigned char *hash; }; static inline char get_hex_char_from_oid(const struct object_id *oid, @@ -505,6 +508,67 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data) return 0; } +static void find_abbrev_len_for_pack(struct packed_git *p, +struct min_abbrev_data *mad) +{ + int match = 0; + uint32_t num, last, first = 0; + struct object_id oid; + + if (open_pack_index(p) || !p->num_objects) + return; + + num = p->num_objects; + last = num; + while (first < last) { + uint32_t mid = first + (last - first) / 2; + const unsigned char *current; + int cmp; + + current = nth_packed_object_sha1(p, mid); + cmp = hashcmp(mad->hash, current); + if (!cmp) { + match = 1; + first = mid; + break; + } + if (cmp > 0) { + first = mid + 1; + continue; + } + last = mid; + } + + /* +* first is now the position in the packfile where we would insert +* mad->hash if it does not exist (or the position of mad->hash if +* it does exist). Hence, we consider a maximum of three objects +* nearby for the abbreviation length. +*/ + mad->init_len = 0; + if (!match) { + nth_packed_object_oid(, p, first); + extend_abbrev_len(, mad); + } else if (first < num - 1) { + nth_packed_object_oid(, p, first + 1); + extend_abbrev_len(, mad); + } + if (first > 0) { + nth_packed_object_oid(, p, first - 1); + extend_abbrev_len(, mad); + } + mad->init_len = mad->cur_len; +} + +static void find_abbrev_len_packed(struct min_abbrev_data *mad) +{ + struct packed_git *p; + + prepare_packed_git(); + for (p = packed_git; p; p = p->next) + find_abbrev_len_for_pack(p, mad); +} + int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) { struct disambiguate_state ds; @@ -536,19 +600,21 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) if (len == GIT_SHA1_HEXSZ || !len) return GIT_SHA1_HEXSZ; - if (init_object_disambiguation(hex, len, ) < 0) - return -1; - mad.init_len = len; mad.cur_len = len; mad.hex = hex; + mad.hash = sha1; + + find_abbrev_len_packed(); + + if (init_object_disambiguation(hex, mad.cur_len, ) < 0) + return -1; ds.fn = extend_abbrev_len; ds.always_call_fn = 1; ds.cb_data = (void*) find_short_object_filename(); - find_short_packed_object(); (void)finish_object_disambiguation(, _ret); hex[mad.cur_len] = 0; -- 2.14.1.538.g56ec8fc98.dirty
[PATCH v5 0/4] Improve abbreviation disambiguation
Changes since previous version: * Make 'pos' unsigned in get_hex_char_from_oid() * Check response from open_pack_index() * Small typos in commit messages Thanks, Stolee --- When displaying object ids, we frequently want to see an abbreviation for easier typing. That abbreviation must be unambiguous among all object ids. The current implementation of find_unique_abbrev() performs a loop checking if each abbreviation length is unambiguous until finding one that works. This causes multiple round-trips to the disk when starting with the default abbreviation length (usually 7) but needing up to 12 characters for an unambiguous short-sha. For very large repos, this effect is pronounced and causes issues with several commands, from obvious consumers `status` and `log` to less obvious commands such as `fetch` and `push`. This patch improves performance by iterating over objects matching the short abbreviation only once, inspecting each object id, and reporting the minimum length of an unambiguous abbreviation. Add a new perf test for testing the performance of log while computing OID abbreviations. Using --oneline --raw and --parents options maximizes the number of OIDs to abbreviate while still spending some time computing diffs. Below we report performance statistics for perf test 4211.6 from p4211-line-log.sh using three copies of the Linux repo: | Packs | Loose | Base Time | New Time | Rel% | |---||---|--|---| | 1| 0 | 41.27 s | 38.93 s | -4.8% | | 24| 0 | 98.04 s | 91.35 s | -5.7% | | 23| 323952 | 117.78 s | 112.18 s | -4.8% | Derrick Stolee (4): p4211-line-log.sh: add log --online --raw --parents perf test sha1_name: unroll len loop in find_unique_abbrev_r sha1_name: parse less while finding common prefix sha1_name: minimize OID comparisons during disambiguation sha1_name.c | 135 +-- t/perf/p4211-line-log.sh | 4 ++ 2 files changed, 123 insertions(+), 16 deletions(-) -- 2.14.1.538.g56ec8fc98.dirty
Re: [PATCH v5 0/4] Improve abbreviation disambiguation
On 10/12/2017 8:02 AM, Derrick Stolee wrote: Changes since previous version: * Make 'pos' unsigned in get_hex_char_from_oid() * Check response from open_pack_index() * Small typos in commit messages Thanks, Stolee I forgot to mention that I rebased on master this morning to be sure this doesn't conflict with the binary-search patch that was queued this week. Thanks, Stolee
[PATCH v3 2/5] p0008-abbrev.sh: Test find_unique_abbrev() perf
Create helper program test-abbrev to compute the minimum length of a disambiguating short-sha for an input list of object ids. Perf test p0008-abbrev.sh runs test-abbrev for 100,000 object ids. For test 0008.1, these objects exist. For test 0008.2 these objects do not exist in the repo (with high probability). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Makefile | 1 + t/helper/.gitignore| 1 + t/helper/test-abbrev.c | 18 ++ t/perf/p0008-abbrev.sh | 22 ++ 4 files changed, 42 insertions(+) create mode 100644 t/helper/test-abbrev.c create mode 100755 t/perf/p0008-abbrev.sh diff --git a/Makefile b/Makefile index 50a2eab80..63438a44e 100644 --- a/Makefile +++ b/Makefile @@ -638,6 +638,7 @@ X = PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS)) +TEST_PROGRAMS_NEED_X += test-abbrev TEST_PROGRAMS_NEED_X += test-chmtime TEST_PROGRAMS_NEED_X += test-ctype TEST_PROGRAMS_NEED_X += test-config diff --git a/t/helper/.gitignore b/t/helper/.gitignore index 9696f54bb..2190781ff 100644 --- a/t/helper/.gitignore +++ b/t/helper/.gitignore @@ -1,3 +1,4 @@ +/test-abbrev /test-chmtime /test-ctype /test-config diff --git a/t/helper/test-abbrev.c b/t/helper/test-abbrev.c new file mode 100644 index 0..3ad88611a --- /dev/null +++ b/t/helper/test-abbrev.c @@ -0,0 +1,18 @@ +#include "cache.h" + +int cmd_main(int ac, const char **av) +{ + struct object_id oid; + char hex[GIT_MAX_HEXSZ + 2]; + const char *end; + + setup_git_directory(); + + while (fgets(hex, GIT_MAX_HEXSZ + 2, stdin)) { + hex[GIT_MAX_HEXSZ] = 0; + if (!parse_oid_hex(hex, , )) + find_unique_abbrev(oid.hash, MINIMUM_ABBREV); + } + + exit(0); +} diff --git a/t/perf/p0008-abbrev.sh b/t/perf/p0008-abbrev.sh new file mode 100755 index 0..5cbc8a888 --- /dev/null +++ b/t/perf/p0008-abbrev.sh @@ -0,0 +1,22 @@ +#!/bin/bash + +test_description='Test object disambiguation through abbreviations' +. ./perf-lib.sh + +test_perf_large_repo + +test-list-objects 10 > objs.txt + +test_perf 'find_unique_abbrev() for existing objects' ' + test-abbrev < objs.txt +' + +test-list-objects --missing 10 > objs.txt + +test_perf 'find_unique_abbrev() for missing objects' ' + test-abbrev < objs.txt +' + +rm objs.txt + +test_done -- 2.14.1.538.g56ec8fc98.dirty
[PATCH v3 4/5] sha1_name: Parse less while finding common prefix
Create get_hex_char_from_oid() to parse oids one hex character at a time. This prevents unnecessary copying of hex characters in extend_abbrev_len() when finding the length of a common prefix. p0008.1: find_unique_abbrev() for existing objects -- For 10 repeated tests, each checking 100,000 known objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New|| | Repo | Files | Objects | Objects| Time | Time | Rel% | |---|---|-||||| | Git | 1 | 230078 | 0 | 0.06 s | 0.05 s | -16.7% | | Git | 5 | 230162 | 0 | 0.08 s | 0.08 s | 0.0% | | Git | 4 | 154310 | 75852 | 0.07 s | 0.06 s | -14.3% | | Linux | 1 | 5606645 | 0 | 0.32 s | 0.14 s | -56.3% | | Linux |24 | 5606645 | 0 | 1.12 s | 0.94 s | -16.1% | | Linux |23 | 5283204 | 323441 | 1.05 s | 0.86 s | -18.1% | | VSTS | 1 | 4355923 | 0 | 0.23 s | 0.11 s | -52.2% | | VSTS |32 | 4355923 | 0 | 1.08 s | 0.95 s | -12.0% | | VSTS |31 | 4276829 | 79094 | 2.08 s | 1.82 s | -12.5% | For the Windows repo running in Windows Subsystem for Linux: Pack Files: 50 Packed Objects: 22,385,898 Loose Objects: 492 Base Time: 4.61 s New Time: 4.61 s Rel %: 0.0% p0008.2: find_unique_abbrev() for missing objects - For 10 repeated tests, each checking 100,000 missing objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New|| | Repo | Files | Objects | Objects| Time | Time | Rel% | |---|---|-||||| | Git | 1 | 230078 | 0 | 0.08 s | 0.07 s | -12.5% | | Git | 5 | 230162 | 0 | 0.13 s | 0.12 s | - 7.7% | | Git | 4 | 154310 | 75852 | 0.10 s | 0.09 s | -10.0% | | Linux | 1 | 5606645 | 0 | 0.32 s | 0.13 s | -59.4% | | Linux |24 | 5606645 | 0 | 1.09 s | 0.89 s | -18.3% | | Linux |23 | 5283204 | 323441 | 0.99 s | 0.83 s | -16.2% | | VSTS | 1 | 4355923 | 0 | 0.25 s | 0.11 s | -56.0% | | VSTS |32 | 4355923 | 0 | 1.15 s | 0.99 s | -13.9% | | VSTS |31 | 4276829 | 79094 | 1.18 s | 1.02 s | -13.6% | For the Windows repo running in Windows Subsystem for Linux: Pack Files: 50 Packed Objects: 22,385,898 Loose Objects: 492 Base Time: 3.91 s New Time: 3.08 s Rel %: -21.1% Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_name.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index f2a1ebe49..5081aeb71 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -480,13 +480,23 @@ struct min_abbrev_data { char *hex; }; +static inline char get_hex_char_from_oid(const struct object_id *oid, +int pos) +{ + static const char hex[] = "0123456789abcdef"; + + if ((pos & 1) == 0) + return hex[oid->hash[pos >> 1] >> 4]; + else + return hex[oid->hash[pos >> 1] & 0xf]; +} + static int extend_abbrev_len(const struct object_id *oid, void *cb_data) { struct min_abbrev_data *mad = cb_data; - char *hex = oid_to_hex(oid); unsigned int i = mad->init_len; - while (mad->hex[i] && mad->hex[i] == hex[i]) + while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i)) i++; if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) -- 2.14.1.538.g56ec8fc98.dirty
[PATCH v3 5/5] sha1_name: Minimize OID comparisons during disambiguation
Minimize OID comparisons during disambiguatiosn of packfile OIDs. Teach git to use binary search with the full OID to find the object's position (or insertion position, if not present) in the pack-index. The object before and immediately after (or the one at the insertion position) give the maximum common prefix. No subsequent linear search is required. Take care of which two to inspect, in case the object id exists in the packfile. If the input to find_unique_abbrev_r() is a partial prefix, then the OID used for the binary search is padded with zeroes so the object will not exist in the repo (with high probability) and the same logic applies. p0008.1: find_unique_abbrev() for existing objects -- For 10 repeated tests, each checking 100,000 known objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New|| | Repo | Files | Objects | Objects| Time | Time | Rel% | |---|---|-||||| | Git | 1 | 230078 | 0 | 0.05 s | 0.05 s | 0.0% | | Git | 5 | 230162 | 0 | 0.08 s | 0.08 s | 0.0% | | Git | 4 | 154310 | 75852 | 0.06 s | 0.06 s | 0.0% | | Linux | 1 | 5606645 | 0 | 0.14 s | 0.05 s | -64.3% | | Linux |24 | 5606645 | 0 | 0.94 s | 0.88 s | - 6.4% | | Linux |23 | 5283204 | 323441 | 0.86 s | 0.80 s | - 7.0% | | VSTS | 1 | 4355923 | 0 | 0.11 s | 0.05 s | -54.5% | | VSTS |32 | 4355923 | 0 | 0.95 s | 0.95 s | 0.0% | | VSTS |31 | 4276829 | 79094 | 1.82 s | 1.93 s | + 6.0% | For the Windows repo running in Windows Subsystem for Linux: Pack Files: 50 Packed Objects: 22,385,898 Loose Objects: 492 Base Time: 4.62 s New Time: 4.09 s Rel %: -11.5% p0008.2: find_unique_abbrev() for missing objects - For 10 repeated tests, each checking 100,000 missing objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New|| | Repo | Files | Objects | Objects| Time | Time | Rel% | |---|---|-||||| | Git | 1 | 230078 | 0 | 0.07 s | 0.07 s | 0.0% | | Git | 5 | 230162 | 0 | 0.12 s | 0.12 s | 0.0% | | Git | 4 | 154310 | 75852 | 0.09 s | 0.09 s | 0.0% | | Linux | 1 | 5606645 | 0 | 0.13 s | 0.04 s | -69.2% | | Linux |24 | 5606645 | 0 | 0.89 s | 0.85 s | - 4.5% | | Linux |23 | 5283204 | 323441 | 0.83 s | 0.78 s | - 6.0% | | VSTS | 1 | 4355923 | 0 | 0.11 s | 0.04 s | -63.6% | | VSTS |32 | 4355923 | 0 | 0.99 s | 0.98 s | - 1.0% | | VSTS |31 | 4276829 | 79094 | 1.02 s | 1.04 s | + 2.0% | For the Windows repo running in Windows Subsystem for Linux: Pack Files: 50 Packed Objects: 22,385,898 Loose Objects: 492 Base Time: 3.08 s New Time: 2.72 s Rel %: -11.8 Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_name.c | 70 + 1 file changed, 66 insertions(+), 4 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index 5081aeb71..54b3a37da 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -478,6 +478,7 @@ struct min_abbrev_data { unsigned int init_len; unsigned int cur_len; char *hex; + const unsigned char *hash; }; static inline char get_hex_char_from_oid(const struct object_id *oid, @@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data) return 0; } +static void find_abbrev_len_for_pack(struct packed_git *p, +struct min_abbrev_data *mad) +{ + int match = 0; + uint32_t num, last, first = 0; + struct object_id oid; + + open_pack_index(p); + num = p->num_objects; + last = num; + while (first < last) { + uint32_t mid = (first + last) / 2; + const unsigned char *current; + int cmp; + + current = nth_packed_object_sha1(p, mid); + cmp = hashcmp(mad->hash, current); + if (!cmp) { + match = 1; + first = mid; + break; + } + if (cmp > 0) { + first = mid + 1; + continue; + } + last = mid; + } + + /* +* first is now the position in the packfile where we would insert +* mad->hash if it does not exist (or the position of mad->hash if +* it does exist). Hence, we consider a maximum of three objects +* nearby for the abbreviation length. +*/ + mad->init_len = 0; + if (!match) { + nth_packed_object_oid(, p, first); +
[PATCH v3 1/5] test-list-objects: List a subset of object ids
Create test-list-objects helper program to output a random sample of OIDs present in the repo. If no command line arguments are provided, all OIDs are output. The last command line argument specifies how many samples to output. Samples are collected across the entire set of available OIDs to avoid clustering and therefore quite uniformly distributed. If a command line argument "--missing" is given before the sample count, then a list of OIDs is generated without examining the repo. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Makefile | 1 + t/helper/.gitignore | 1 + t/helper/test-list-objects.c | 87 3 files changed, 89 insertions(+) create mode 100644 t/helper/test-list-objects.c diff --git a/Makefile b/Makefile index ed4ca438b..50a2eab80 100644 --- a/Makefile +++ b/Makefile @@ -652,6 +652,7 @@ TEST_PROGRAMS_NEED_X += test-hashmap TEST_PROGRAMS_NEED_X += test-index-version TEST_PROGRAMS_NEED_X += test-lazy-init-name-hash TEST_PROGRAMS_NEED_X += test-line-buffer +TEST_PROGRAMS_NEED_X += test-list-objects TEST_PROGRAMS_NEED_X += test-match-trees TEST_PROGRAMS_NEED_X += test-mergesort TEST_PROGRAMS_NEED_X += test-mktemp diff --git a/t/helper/.gitignore b/t/helper/.gitignore index 7c9d28a83..9696f54bb 100644 --- a/t/helper/.gitignore +++ b/t/helper/.gitignore @@ -13,6 +13,7 @@ /test-index-version /test-lazy-init-name-hash /test-line-buffer +/test-list-objects /test-match-trees /test-mergesort /test-mktemp diff --git a/t/helper/test-list-objects.c b/t/helper/test-list-objects.c new file mode 100644 index 0..22bc9b4e6 --- /dev/null +++ b/t/helper/test-list-objects.c @@ -0,0 +1,87 @@ +#include "cache.h" +#include "packfile.h" + +struct count { + int total; + int select_mod; +}; + +int count_loose(const struct object_id *oid, + const char *path, + void *data) +{ + ((struct count*)data)->total++; + return 0; +} + +int count_packed(const struct object_id *oid, +struct packed_git *pack, +uint32_t pos, +void* data) +{ + ((struct count*)data)->total++; + return 0; +} + +void output(const struct object_id *oid, + struct count *c) +{ + if (!(c->total % c->select_mod)) + printf("%s\n", oid_to_hex(oid)); + c->total--; +} + +int output_loose(const struct object_id *oid, +const char *path, +void *data) +{ + output(oid, (struct count*)data); + return 0; +} + +int output_packed(const struct object_id *oid, + struct packed_git *pack, + uint32_t pos, + void* data) +{ + output(oid, (struct count*)data); + return 0; +} + +int cmd_main(int ac, const char **av) +{ + uint32_t hash_delt = 0xFDB97531; + uint32_t hash_base = 0x01020304; + int i, n = -1; + struct count c; + const int u_per_oid = sizeof(struct object_id) / sizeof(uint32_t); + + c.total = 0; + if (ac > 1) + n = atoi(av[ac - 1]); + + if (ac > 2 && !strcmp(av[1], "--missing")) { + while (c.total++ < n) { + for (i = 0; i < u_per_oid; i++) { + printf("%08x", hash_base); + hash_base += hash_delt; + } + printf("\n"); + } + } else { + setup_git_directory(); + + if (n > 0) { + for_each_loose_object(count_loose, , 0); + for_each_packed_object(count_packed, , 0); + c.select_mod = 1 + c.total / n; + } else { + c.select_mod = 1; + } + + for_each_loose_object(output_loose, , 0); + for_each_packed_object(output_packed, , 0); + } + + exit(0); +} -- 2.14.1.538.g56ec8fc98.dirty
[PATCH v3 0/5] Improve abbreviation disambituation
Thanks for the feedback on my previous versions, and for patience with my inexperience on the mailing list. I tried to respond to all feedback, but decided to not apply one suggestion: * Removed the "sort -R" from p0008-abbrev.sh, since it is a GNU- specific flag. The perf test now considers objects in "discovery" order, which improved performance all versions of the test as expected. New perf numbers are provided. * get_hex_char_from_oid(): renamed `i` to `pos`. I did not unify the code in this method with that in hex.c:sha1_to_hex_r due to the extra branch in get_hex_char_from_oid and wanting to inline the method. If we want to make this method available for other callers, then I would be happy to submit a separate patch for this change after the current patch is accepted. * Removed unecessary includes. * Use uint32_t instead of unsigned int in test-list-objects.c * Rearranged arguments in test-list-objects and fixed the n == 0 bug. --- When displaying object ids, we frequently want to see an abbreviation for easier typing. That abbreviation must be unambiguous among all object ids. The current implementation of find_unique_abbrev() performs a loop checking if each abbreviation length is unambiguous until finding one that works. This causes multiple round-trips to the disk when starting with the default abbreviation length (usually 7) but needing up to 12 characters for an unambiguous short-sha. For very large repos, this effect is pronounced and causes issues with several commands, from obvious consumers `status` and `log` to less obvious commands such as `fetch` and `push`. This patch improves performance by iterating over objects matching the short abbreviation only once, inspecting each object id, and reporting the minimum length of an unambiguous abbreviation. A helper program `test-list-objects` outputs a sampling of object ids, which we reorder using `sort -R` before using them as input to a performance test. A performance helper `test-abbrev` and performance test `p0008-abbrev.sh` are added to demonstrate performance improvements in this area. I include performance test numbers in the commit messages for each change, but I also include the difference between the baseline and the final change here: p0008.1: find_unique_abbrev() for existing objects -- For 10 repeated tests, each checking 100,000 known objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New|| | Repo | Files | Objects | Objects| Time | Time | Rel% | |---|---|-||||| | Git | 1 | 230078 | 0 | 0.09 s | 0.05 s | -44.4% | | Git | 5 | 230162 | 0 | 0.11 s | 0.08 s | -27.3% | | Git | 4 | 154310 | 75852 | 0.09 s | 0.06 s | -33.3% | | Linux | 1 | 5606645 | 0 | 0.13 s | 0.05 s | -61.5% | | Linux |24 | 5606645 | 0 | 1.13 s | 0.88 s | -22.1% | | Linux |23 | 5283204 | 323441 | 1.08 s | 0.80 s | -25.9% | | VSTS | 1 | 4355923 | 0 | 0.12 s | 0.05 s | -58.3% | | VSTS |32 | 4355923 | 0 | 1.02 s | 0.95 s | - 6.9% | | VSTS |31 | 4276829 | 79094 | 2.25 s | 1.93 s | -14.2% | For the Windows repo running in Windows Subsystem for Linux: Pack Files: 50 Packed Objects: 22,385,898 Loose Objects: 492 Base Time: 5.69 s New Time: 4.09 s Rel %: -28.1% p0008.2: find_unique_abbrev() for missing objects - For 10 repeated tests, each checking 100,000 missing objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New|| | Repo | Files | Objects | Objects| Time | Time | Rel% | |---|---|-||||| | Git | 1 | 230078 | 0 | 0.66 s | 0.07 s | -89.4% | | Git | 5 | 230162 | 0 | 0.90 s | 0.12 s | -86.7% | | Git | 4 | 154310 | 75852 | 0.79 s | 0.09 s | -88.6% | | Linux | 1 | 5606645 | 0 | 0.48 s | 0.04 s | -91.7% | | Linux |24 | 5606645 | 0 | 4.41 s | 0.85 s | -80.7% | | Linux |23 | 5283204 | 323441 | 4.11 s | 0.78 s | -81.0% | | VSTS | 1 | 4355923 | 0 | 0.46 s | 0.04 s | -91.3% | | VSTS |32 | 4355923 | 0 | 5.40 s | 0.98 s | -81.9% | | VSTS |31 | 4276829 | 79094 | 5.88 s | 1.04 s | -82.3% | For the Windows repo running in Windows Subsystem for Linux: Pack Files: 50 Packed Objects: 22,385,898 Loose Objects: 492 Base Time: 38.9 s New Time: 2.7 s Rel %: -93.1% Derrick Stolee (5): test-list-objects: List a subset of object ids p0008-abbrev.sh: Test find_unique_abbrev() perf sha1_name: Unroll len loop in find_unique_abbrev_r sha1_name: Parse less while finding common prefix sha1_name: Minimize OID comparisons during disambiguation Makefile
[PATCH v3 3/5] sha1_name: Unroll len loop in find_unique_abbrev_r
Unroll the while loop inside find_unique_abbrev_r to avoid iterating through all loose objects and packfiles multiple times when the short name is longer than the predicted length. Instead, inspect each object that collides with the estimated abbreviation to find the longest common prefix. p0008.1: find_unique_abbrev() for existing objects -- For 10 repeated tests, each checking 100,000 known objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New| | | Repo | Files | Objects | Objects| Time | Time | Rel%| |---|---|-||||-| | Git | 1 | 230078 | 0 | 0.09 s | 0.06 s | - 33.3% | | Git | 5 | 230162 | 0 | 0.11 s | 0.08 s | - 27.3% | | Git | 4 | 154310 | 75852 | 0.09 s | 0.07 s | - 22.2% | | Linux | 1 | 5606645 | 0 | 0.12 s | 0.32 s | +146.2% | | Linux |24 | 5606645 | 0 | 1.12 s | 1.12 s | - 0.9% | | Linux |23 | 5283204 | 323441 | 1.08 s | 1.05 s | - 2.8% | | VSTS | 1 | 4355923 | 0 | 0.12 s | 0.23 s | + 91.7% | | VSTS |32 | 4355923 | 0 | 1.02 s | 1.08 s | + 5.9% | | VSTS |31 | 4276829 | 79094 | 2.25 s | 2.08 s | - 7.6% | For the Windows repo running in Windows Subsystem for Linux: Pack Files: 50 Packed Objects: 22,385,898 Loose Objects: 492 Base Time: 5.69 s New Time: 4.62 s Rel %: -18.9% p0008.2: find_unique_abbrev() for missing objects - For 10 repeated tests, each checking 100,000 missing objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New|| | Repo | Files | Objects | Objects| Time | Time | Rel% | |---|---|-||||| | Git | 1 | 230078 | 0 | 0.66 s | 0.08 s | -87.9% | | Git | 5 | 230162 | 0 | 0.90 s | 0.13 s | -85.6% | | Git | 4 | 154310 | 75852 | 0.79 s | 0.10 s | -87.3% | | Linux | 1 | 5606645 | 0 | 0.48 s | 0.32 s | -33.3% | | Linux |24 | 5606645 | 0 | 4.41 s | 1.09 s | -75.3% | | Linux |23 | 5283204 | 323441 | 4.11 s | 0.99 s | -75.9% | | VSTS | 1 | 4355923 | 0 | 0.46 s | 0.25 s | -45.7% | | VSTS |32 | 4355923 | 0 | 5.40 s | 1.15 s | -78.7% | | VSTS |31 | 4276829 | 79094 | 5.88 s | 1.18 s | -79.9% | For the Windows repo running in Windows Subsystem for Linux: Pack Files: 50 Packed Objects: 22,385,898 Loose Objects: 492 Base Time: 39.0 s New Time: 3.9 s Rel %: -90.0% Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_name.c | 57 ++--- 1 file changed, 42 insertions(+), 15 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index 134ac9742..f2a1ebe49 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -474,10 +474,32 @@ static unsigned msb(unsigned long val) return r; } -int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) +struct min_abbrev_data { + unsigned int init_len; + unsigned int cur_len; + char *hex; +}; + +static int extend_abbrev_len(const struct object_id *oid, void *cb_data) { - int status, exists; + struct min_abbrev_data *mad = cb_data; + + char *hex = oid_to_hex(oid); + unsigned int i = mad->init_len; + while (mad->hex[i] && mad->hex[i] == hex[i]) + i++; + + if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) + mad->cur_len = i + 1; + + return 0; +} +int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) +{ + struct disambiguate_state ds; + struct min_abbrev_data mad; + struct object_id oid_ret; if (len < 0) { unsigned long count = approximate_object_count(); /* @@ -503,19 +525,24 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) sha1_to_hex_r(hex, sha1); if (len == GIT_SHA1_HEXSZ || !len) return GIT_SHA1_HEXSZ; - exists = has_sha1_file(sha1); - while (len < GIT_SHA1_HEXSZ) { - struct object_id oid_ret; - status = get_short_oid(hex, len, _ret, GET_OID_QUIETLY); - if (exists - ? !status - : status == SHORT_NAME_NOT_FOUND) { - hex[len] = 0; - return len; - } - len++; - } - return len; + + if (init_object_disambiguation(hex, len, ) < 0) + return -1; + + mad.init_len = len; + mad.cur_len = len; + mad.hex = hex; + + ds.fn = extend_abbrev_len; + ds.always_call_fn = 1; + ds.cb_data = (void*) + + find_short_object_filename(
Re: [PATCH v2 4/5] sha1_name: Parse less while finding common prefix
My v3 patch is incoming, but I wanted to respond directly to this message. On 9/25/2017 7:42 PM, Stefan Beller wrote: On Mon, Sep 25, 2017 at 2:54 AM, Derrick Stolee <dsto...@microsoft.com> wrote: Create get_hex_char_from_oid() to parse oids one hex character at a time. This prevents unnecessary copying of hex characters in extend_abbrev_len() when finding the length of a common prefix. p0008.1: find_unique_abbrev() for existing objects -- For 10 repeated tests, each checking 100,000 known objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New|| | Repo | Files | Objects | Objects| Time | Time | Rel% | |---|---|-||||| | Git | 1 | 230078 | 0 | 0.08 s | 0.08 s | 0.0% | | Git | 5 | 230162 | 0 | 0.17 s | 0.16 s | - 5.9% | | Git | 4 | 154310 | 75852 | 0.14 s | 0.12 s | -14.3% | | Linux | 1 | 5606645 | 0 | 0.50 s | 0.25 s | -50.0% | | Linux |24 | 5606645 | 0 | 2.41 s | 2.08 s | -13.7% | | Linux |23 | 5283204 | 323441 | 1.99 s | 1.69 s | -15.1% | | VSTS | 1 | 4355923 | 0 | 0.40 s | 0.22 s | -45.0% | | VSTS |32 | 4355923 | 0 | 2.09 s | 1.99 s | - 4.8% | | VSTS |31 | 4276829 | 79094 | 3.60 s | 3.20 s | -11.1% | For the Windows repo running in Windows Subsystem for Linux: Pack Files: 50 Packed Objects: 22,385,898 Loose Objects: 492 Base Time: 4.61 s New Time: 4.61 s Rel %: 0.0% p0008.2: find_unique_abbrev() for missing objects - For 10 repeated tests, each checking 100,000 missing objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New|| | Repo | Files | Objects | Objects| Time | Time | Rel% | |---|---|-||||| | Git | 1 | 230078 | 0 | 0.06 s | 0.05 s | -16.7% | | Git | 5 | 230162 | 0 | 0.14 s | 0.15 s | + 7.1% | | Git | 4 | 154310 | 75852 | 0.12 s | 0.12 s | 0.0% | | Linux | 1 | 5606645 | 0 | 0.40 s | 0.17 s | -57.5% | | Linux |24 | 5606645 | 0 | 1.59 s | 1.30 s | -18.2% | | Linux |23 | 5283204 | 323441 | 1.23 s | 1.10 s | -10.6% | | VSTS | 1 | 4355923 | 0 | 0.25 s | 0.12 s | -52.0% | | VSTS |32 | 4355923 | 0 | 1.45 s | 1.34 s | - 7.6% | | VSTS |31 | 4276829 | 79094 | 1.59 s | 1.34 s | -15.7% | For the Windows repo running in Windows Subsystem for Linux: Pack Files: 50 Packed Objects: 22,385,898 Loose Objects: 492 Base Time: 3.91 s New Time: 3.08 s Rel %: -21.1% These number look pretty cool! Signed-off-by: Derrick Stolee <dsto...@microsoft.com> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> double signoff? Oops! I'll be more careful with my format-patch in the future. --- sha1_name.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index f2a1ebe49..bb47b6702 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -480,13 +480,22 @@ struct min_abbrev_data { char *hex; }; +static inline char get_hex_char_from_oid(const struct object_id *oid, int i) 'i' is not very descriptive, maybe add a comment? (I realize it is just walking through the char*s one by one) I renamed 'i' to 'pos' in my v3. Maybe this function (together with the change in the while below) could go into hex.c as "int progressively_cmp_oids" that returns the position at which the given oids differ? +{ + static const char hex[] = "0123456789abcdef"; + + if ((i & 1) == 0) + return hex[oid->hash[i >> 1] >> 4]; + else + return hex[oid->hash[i >> 1] & 0xf]; +} sha1_to_hex_r has very similar code, though it iterates less and covers both cases in the loop body. That is the actual reason I propose moving this function (or a variant thereof) to hex.c as there we can share code. You're right that sha1_to_hex_r is similar, in fact I based my work on it. There are a few reasons I didn't combine the two implementations: * I wanted to be sure my patch was only touching the code for disambiguating short-shas. Modifying code in hex.c would touch many more code paths. * I realize that the extra branch in my version is slower than the branchless loop body in sha1_to_hex_r, so either I would slow that method or make the method call more complicated by returning two chars at a time. * I wanted to strongly hint that the method should be inlined, but I'm not sure how to guarantee that happens across a linker boundary without doing strange things in header files. I'm happy to revisit this after my patch is complete, since I think there are interesting trade-offs to cons
Re: What's cooking in git.git (Sep 2017, #06; Fri, 29)
Hi Junio, On 9/29/2017 12:34 AM, Junio C Hamano wrote: * ds/find-unique-abbrev-optim (2017-09-19) 4 commits - SQUASH??? - sha1_name: parse less while finding common prefix - sha1_name: unroll len loop in find_unique_abbrev_r() - sha1_name: create perf test for find_unique_abbrev() I'll re-roll my patch on Monday if reviews have stabilized. I think I understand your comments this time (especially around 32-bit ints). Since I'm new to the list, I'm not sure what the change in messages means here. What does "SQUASH???" mean? Is that why there are three meaningful commits in this note, despite my five-commit patch? Would you like me to squash the commits in v3? Thanks, -Stolee
Re: [PATCH v3 3/5] sha1_name: Unroll len loop in find_unique_abbrev_r
On 10/3/2017 6:49 AM, Junio C Hamano wrote: Derrick Stolee <dsto...@microsoft.com> writes: p0008.1: find_unique_abbrev() for existing objects -- For 10 repeated tests, each checking 100,000 known objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New| | | Repo | Files | Objects | Objects| Time | Time | Rel%| |---|---|-||||-| | Git | 1 | 230078 | 0 | 0.09 s | 0.06 s | - 33.3% | | Git | 5 | 230162 | 0 | 0.11 s | 0.08 s | - 27.3% | | Git | 4 | 154310 | 75852 | 0.09 s | 0.07 s | - 22.2% | | Linux | 1 | 5606645 | 0 | 0.12 s | 0.32 s | +146.2% | | Linux |24 | 5606645 | 0 | 1.12 s | 1.12 s | - 0.9% | | Linux |23 | 5283204 | 323441 | 1.08 s | 1.05 s | - 2.8% | | VSTS | 1 | 4355923 | 0 | 0.12 s | 0.23 s | + 91.7% | | VSTS |32 | 4355923 | 0 | 1.02 s | 1.08 s | + 5.9% | | VSTS |31 | 4276829 | 79094 | 2.25 s | 2.08 s | - 7.6% | The above does not look so good, especially in cases where a repository is well maintained by packing into smaller number of packs, we get much worse result? I understand that this patch on its own does not have good numbers. I split the patches 3 and 4 specifically to highlight two distinct changes: Patch 3: Unroll the len loop that may inspect all files multiple times. Patch 4: Parse less while disambiguating. Patch 4 more than makes up for the performance hits in this patch. p0008.2: find_unique_abbrev() for missing objects - For 10 repeated tests, each checking 100,000 missing objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New|| | Repo | Files | Objects | Objects| Time | Time | Rel% | |---|---|-||||| | Git | 1 | 230078 | 0 | 0.66 s | 0.08 s | -87.9% | | Git | 5 | 230162 | 0 | 0.90 s | 0.13 s | -85.6% | | Git | 4 | 154310 | 75852 | 0.79 s | 0.10 s | -87.3% | | Linux | 1 | 5606645 | 0 | 0.48 s | 0.32 s | -33.3% | | Linux |24 | 5606645 | 0 | 4.41 s | 1.09 s | -75.3% | | Linux |23 | 5283204 | 323441 | 4.11 s | 0.99 s | -75.9% | | VSTS | 1 | 4355923 | 0 | 0.46 s | 0.25 s | -45.7% | | VSTS |32 | 4355923 | 0 | 5.40 s | 1.15 s | -78.7% | | VSTS |31 | 4276829 | 79094 | 5.88 s | 1.18 s | -79.9% | The question is if this is even measuring a relevant workload. How often would we have a full 40-hex object name for which we actually do not have the object, and ask its name to be abbreviated? Compared to it, the result from p0008.1 feels a lot more important. We know we make tons of "abbreviate the object name for this object we have" and we see them every day in our "git log -p" output. Seeing a lot more impressive result from p0008.2 than p0008.1 makes me unsure if this patch is optimizing for the right case. I'll have to see the code a bit deeper before I can comment on it. Thanks. I agree that p0008.1 is more important. p0008.2 covers an important case of the previous implementation. The line exists = has_sha1_file(sha1); will inspect all packfiles and scan the full loose-object directory that would contain the object. The only reason for this call is to determine how to inspect the result of get_short_oid(), but is a significant portion of the time that is gained here. Thanks, -Stolee
Re: [PATCH v3 5/5] sha1_name: Minimize OID comparisons during disambiguation
On 10/3/2017 11:55 AM, Stefan Beller wrote: @@ -505,6 +506,65 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data) return 0; } +static void find_abbrev_len_for_pack(struct packed_git *p, +struct min_abbrev_data *mad) +{ + int match = 0; + uint32_t num, last, first = 0; + struct object_id oid; + + open_pack_index(p); coverity complained here with Calling "open_pack_index" without checking return value (as is done elsewhere 13 out of 15 times). Good catch! This same line is repeated in unique_in_pack() in this same file, so if this is worth fixing then we should probably fix it there, too. I think the easiest way out is just a if (open_pack_index(p)) die(_("Cannot open existing pack idx file for '%s'"), p); or is there another good approach? You probably intended to have p->pack_name in the die(); Using `cat *.c | grep -A 2 "if (open_pack_index("` and `cat */*.c | grep -A 2 "if (open_pack_index("` I see a few places that return error codes or quietly fail. The cases that use die() are inside builtin/ so I don't think die() is the right choice here. Since find_abbrev_len_for_pack() is intended to extend the abbreviation length when necessary, I think a silent return is best here: if (open_pack_index(p)) return; Thanks, -Stolee
[PATCH v2 2/5] p0008-abbrev.sh: Test find_unique_abbrev() perf
Create helper program test-abbrev to compute the minimum length of a disambiguating short-sha for an input list of object ids. Perf test p0008-abbrev.sh runs test-abbrev for 100,000 object ids. For test 0008.1, these objects exist. For test 0008.2 these objects do not exist in the repo (with high probability). For each test, use `sort -R` to (deterministically) shuffle the sample of object ids to not check abbreviations in lexicographic order. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Makefile | 1 + t/helper/.gitignore| 1 + t/helper/test-abbrev.c | 19 +++ t/perf/p0008-abbrev.sh | 22 ++ 4 files changed, 43 insertions(+) create mode 100644 t/helper/test-abbrev.c create mode 100755 t/perf/p0008-abbrev.sh diff --git a/Makefile b/Makefile index af94b655a..75835c7c0 100644 --- a/Makefile +++ b/Makefile @@ -633,6 +633,7 @@ X = PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS)) +TEST_PROGRAMS_NEED_X += test-abbrev TEST_PROGRAMS_NEED_X += test-chmtime TEST_PROGRAMS_NEED_X += test-ctype TEST_PROGRAMS_NEED_X += test-config diff --git a/t/helper/.gitignore b/t/helper/.gitignore index 9dd1c9f3c..ab9df8369 100644 --- a/t/helper/.gitignore +++ b/t/helper/.gitignore @@ -1,3 +1,4 @@ +/test-abbrev /test-chmtime /test-ctype /test-config diff --git a/t/helper/test-abbrev.c b/t/helper/test-abbrev.c new file mode 100644 index 0..6866896eb --- /dev/null +++ b/t/helper/test-abbrev.c @@ -0,0 +1,19 @@ +#include "cache.h" +#include + +int cmd_main(int ac, const char **av) +{ + struct object_id oid; + char hex[GIT_MAX_HEXSZ + 2]; + const char *end; + + setup_git_directory(); + + while (fgets(hex, GIT_MAX_HEXSZ + 2, stdin)) { + hex[GIT_MAX_HEXSZ] = 0; + if (!parse_oid_hex(hex, , )) + find_unique_abbrev(oid.hash, MINIMUM_ABBREV); + } + + exit(0); +} diff --git a/t/perf/p0008-abbrev.sh b/t/perf/p0008-abbrev.sh new file mode 100755 index 0..ba25e7824 --- /dev/null +++ b/t/perf/p0008-abbrev.sh @@ -0,0 +1,22 @@ +#!/bin/bash + +test_description='Test object disambiguation through abbreviations' +. ./perf-lib.sh + +test_perf_large_repo + +test-list-objects 10 | sort -R > objs.txt + +test_perf 'find_unique_abbrev() for existing objects' ' + test-abbrev < objs.txt +' + +test-list-objects 10 --missing | sort -R > objs.txt + +test_perf 'find_unique_abbrev() for missing objects' ' + test-abbrev < objs.txt +' + +rm objs.txt + +test_done -- 2.14.1.538.g56ec8fc98.dirty
[PATCH v2 0/5] Improve abbreviation disambiguation
Thanks for the feedback on my v1 patch. Thanks also to Jeff Hostetler for helping me with this v2 patch, which includes an extra performance improvement in commit 5. Changes since last version: * Added helper program test-list-objects to construct a list of existing object ids. * test-abbrev now disambiguates a list of OIDs from stdin. * p0008-abbrev.sh now has two tests: * 0008.1 tests 100,000 known OIDs * 0008.2 tests 100,000 missing OIDs * Added a third performance improvement that uses binary search within packfiles to inspect at most two object ids per packfile. Thanks, Stolee --- When displaying object ids, we frequently want to see an abbreviation for easier typing. That abbreviation must be unambiguous among all object ids. The current implementation of find_unique_abbrev() performs a loop checking if each abbreviation length is unambiguous until finding one that works. This causes multiple round-trips to the disk when starting with the default abbreviation length (usually 7) but needing up to 12 characters for an unambiguous short-sha. For very large repos, this effect is pronounced and causes issues with several commands, from obvious consumers `status` and `log` to less obvious commands such as `fetch` and `push`. This patch improves performance by iterating over objects matching the short abbreviation only once, inspecting each object id, and reporting the minimum length of an unambiguous abbreviation. A helper program `test-list-objects` outputs a sampling of object ids, which we reorder using `sort -R` before using them as input to a performance test. A performance helper `test-abbrev` and performance test `p0008-abbrev.sh` are added to demonstrate performance improvements in this area. I include performance test numbers in the commit messages for each change, but I also include the difference between the baseline and the final change here: p0008.1: find_unique_abbrev() for existing objects -- For 10 repeated tests, each checking 100,000 known objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New|| | Repo | Files | Objects | Objects| Time | Time | Rel% | |---|---|-||||| | Git | 1 | 230078 | 0 | 0.12 s | 0.05 s | -58.3% | | Git | 5 | 230162 | 0 | 0.25 s | 0.15 s | -40.0% | | Git | 4 | 154310 | 75852 | 0.18 s | 0.11 s | -38.9% | | Linux | 1 | 5606645 | 0 | 0.32 s | 0.10 s | -68.8% | | Linux |24 | 5606645 | 0 | 2.77 s | 2.00 s | -27.8% | | Linux |23 | 5283204 | 323441 | 2.86 s | 1.62 s | -43.4% | | VSTS | 1 | 4355923 | 0 | 0.27 s | 0.09 s | -66.7% | | VSTS |32 | 4355923 | 0 | 2.41 s | 2.01 s | -16.6% | | VSTS |31 | 4276829 | 79094 | 4.22 s | 3.02 s | -28.4% | For the Windows repo running in Windows Subsystem for Linux: Pack Files: 50 Packed Objects: 22,385,898 Loose Objects: 492 Base Time: 5.69 s New Time: 4.09 s Rel %: -28.1% p0008.2: find_unique_abbrev() for missing objects - For 10 repeated tests, each checking 100,000 missing objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New|| | Repo | Files | Objects | Objects| Time | Time | Rel% | |---|---|-||||| | Git | 1 | 230078 | 0 | 0.61 s | 0.04 s | -93.4% | | Git | 5 | 230162 | 0 | 1.30 s | 0.15 s | -88.5% | | Git | 4 | 154310 | 75852 | 1.07 s | 0.12 s | -88.8% | | Linux | 1 | 5606645 | 0 | 0.65 s | 0.05 s | -92.3% | | Linux |24 | 5606645 | 0 | 7.12 s | 1.28 s | -82.0% | | Linux |23 | 5283204 | 323441 | 6.33 s | 0.96 s | -84.8% | | VSTS | 1 | 4355923 | 0 | 0.64 s | 0.05 s | -92.2% | | VSTS |32 | 4355923 | 0 | 7.77 s | 1.36 s | -82.5% | | VSTS |31 | 4276829 | 79094 | 7.76 s | 1.36 s | -82.5% | For the Windows repo running in Windows Subsystem for Linux: Pack Files: 50 Packed Objects: 22,385,898 Loose Objects: 492 Base Time: 38.9 s New Time: 2.7 s Rel %: -93.1% Derrick Stolee (5): test-list-objects: List a subset of object ids p0008-abbrev.sh: Test find_unique_abbrev() perf sha1_name: Unroll len loop in find_unique_abbrev_r sha1_name: Parse less while finding common prefix sha1_name: Minimize OID comparisons during disambiguation Makefile | 2 + sha1_name.c | 128 ++- t/helper/.gitignore | 2 + t/helper/test-abbrev.c | 19 +++ t/helper/test-list-objects.c | 85 t/perf/p0008-abbrev.sh | 22 6 files changed, 243 insertions(+), 15 deletions(-) create mode 100644 t/helper/test-abbrev.c create mode
[PATCH v2 4/5] sha1_name: Parse less while finding common prefix
Create get_hex_char_from_oid() to parse oids one hex character at a time. This prevents unnecessary copying of hex characters in extend_abbrev_len() when finding the length of a common prefix. p0008.1: find_unique_abbrev() for existing objects -- For 10 repeated tests, each checking 100,000 known objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New|| | Repo | Files | Objects | Objects| Time | Time | Rel% | |---|---|-||||| | Git | 1 | 230078 | 0 | 0.08 s | 0.08 s | 0.0% | | Git | 5 | 230162 | 0 | 0.17 s | 0.16 s | - 5.9% | | Git | 4 | 154310 | 75852 | 0.14 s | 0.12 s | -14.3% | | Linux | 1 | 5606645 | 0 | 0.50 s | 0.25 s | -50.0% | | Linux |24 | 5606645 | 0 | 2.41 s | 2.08 s | -13.7% | | Linux |23 | 5283204 | 323441 | 1.99 s | 1.69 s | -15.1% | | VSTS | 1 | 4355923 | 0 | 0.40 s | 0.22 s | -45.0% | | VSTS |32 | 4355923 | 0 | 2.09 s | 1.99 s | - 4.8% | | VSTS |31 | 4276829 | 79094 | 3.60 s | 3.20 s | -11.1% | For the Windows repo running in Windows Subsystem for Linux: Pack Files: 50 Packed Objects: 22,385,898 Loose Objects: 492 Base Time: 4.61 s New Time: 4.61 s Rel %: 0.0% p0008.2: find_unique_abbrev() for missing objects - For 10 repeated tests, each checking 100,000 missing objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New|| | Repo | Files | Objects | Objects| Time | Time | Rel% | |---|---|-||||| | Git | 1 | 230078 | 0 | 0.06 s | 0.05 s | -16.7% | | Git | 5 | 230162 | 0 | 0.14 s | 0.15 s | + 7.1% | | Git | 4 | 154310 | 75852 | 0.12 s | 0.12 s | 0.0% | | Linux | 1 | 5606645 | 0 | 0.40 s | 0.17 s | -57.5% | | Linux |24 | 5606645 | 0 | 1.59 s | 1.30 s | -18.2% | | Linux |23 | 5283204 | 323441 | 1.23 s | 1.10 s | -10.6% | | VSTS | 1 | 4355923 | 0 | 0.25 s | 0.12 s | -52.0% | | VSTS |32 | 4355923 | 0 | 1.45 s | 1.34 s | - 7.6% | | VSTS |31 | 4276829 | 79094 | 1.59 s | 1.34 s | -15.7% | For the Windows repo running in Windows Subsystem for Linux: Pack Files: 50 Packed Objects: 22,385,898 Loose Objects: 492 Base Time: 3.91 s New Time: 3.08 s Rel %: -21.1% Signed-off-by: Derrick Stolee <dsto...@microsoft.com> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_name.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index f2a1ebe49..bb47b6702 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -480,13 +480,22 @@ struct min_abbrev_data { char *hex; }; +static inline char get_hex_char_from_oid(const struct object_id *oid, int i) +{ + static const char hex[] = "0123456789abcdef"; + + if ((i & 1) == 0) + return hex[oid->hash[i >> 1] >> 4]; + else + return hex[oid->hash[i >> 1] & 0xf]; +} + static int extend_abbrev_len(const struct object_id *oid, void *cb_data) { struct min_abbrev_data *mad = cb_data; - char *hex = oid_to_hex(oid); unsigned int i = mad->init_len; - while (mad->hex[i] && mad->hex[i] == hex[i]) + while (mad->hex[i] && mad->hex[i] == get_hex_char_from_oid(oid, i)) i++; if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) -- 2.14.1.538.g56ec8fc98.dirty
[PATCH v2 5/5] sha1_name: Minimize OID comparisons during disambiguation
Minimize OID comparisons during disambiguatiosn of packfile OIDs. Teach git to use binary search with the full OID to find the object's position (or insertion position, if not present) in the pack-index. The object before and immediately after (or the one at the insertion position) give the maximum common prefix. No subsequent linear search is required. Take care of which two to inspect, in case the object id exists in the packfile. If the input to find_unique_abbrev_r() is a partial prefix, then the OID used for the binary search is padded with zeroes so the object will not exist in the repo (with high probability) and the same logic applies. p0008.1: find_unique_abbrev() for existing objects -- For 10 repeated tests, each checking 100,000 known objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New|| | Repo | Files | Objects | Objects| Time | Time | Rel% | |---|---|-||||| | Git | 1 | 230078 | 0 | 0.08 s | 0.05 s | -37.5% | | Git | 5 | 230162 | 0 | 0.16 s | 0.15 s | - 6.3% | | Git | 4 | 154310 | 75852 | 0.12 s | 0.11 s | - 8.3% | | Linux | 1 | 5606645 | 0 | 0.25 s | 0.10 s | -60.0% | | Linux |24 | 5606645 | 0 | 2.08 s | 2.00 s | - 3.8% | | Linux |23 | 5283204 | 323441 | 1.69 s | 1.62 s | - 4.1% | | VSTS | 1 | 4355923 | 0 | 0.22 s | 0.09 s | -59.1% | | VSTS |32 | 4355923 | 0 | 1.99 s | 2.01 s | + 1.0% | | VSTS |31 | 4276829 | 79094 | 3.20 s | 3.02 s | - 5.6% | For the Windows repo running in Windows Subsystem for Linux: Pack Files: 50 Packed Objects: 22,385,898 Loose Objects: 492 Base Time: 4.62 s New Time: 4.09 s Rel %: -11.5% p0008.2: find_unique_abbrev() for missing objects - For 10 repeated tests, each checking 100,000 missing objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New|| | Repo | Files | Objects | Objects| Time | Time | Rel% | |---|---|-||||| | Git | 1 | 230078 | 0 | 0.05 s | 0.04 s | -20.0% | | Git | 5 | 230162 | 0 | 0.15 s | 0.15 s | 0.0% | | Git | 4 | 154310 | 75852 | 0.12 s | 0.12 s | 0.0% | | Linux | 1 | 5606645 | 0 | 0.17 s | 0.05 s | -70.6% | | Linux |24 | 5606645 | 0 | 1.30 s | 1.28 s | - 1.5% | | Linux |23 | 5283204 | 323441 | 1.10 s | 0.96 s | -12.7% | | VSTS | 1 | 4355923 | 0 | 0.12 s | 0.05 s | -58.3% | | VSTS |32 | 4355923 | 0 | 1.34 s | 1.36 s | + 1.5% | | VSTS |31 | 4276829 | 79094 | 1.34 s | 1.36 s | + 1.5% | For the Windows repo running in Windows Subsystem for Linux: Pack Files: 50 Packed Objects: 22,385,898 Loose Objects: 492 Base Time: 3.08 s New Time: 2.72 s Rel %: -11.8 Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_name.c | 70 + 1 file changed, 66 insertions(+), 4 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index bb47b6702..1566cd4fc 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -478,6 +478,7 @@ struct min_abbrev_data { unsigned int init_len; unsigned int cur_len; char *hex; + const unsigned char *hash; }; static inline char get_hex_char_from_oid(const struct object_id *oid, int i) @@ -504,6 +505,65 @@ static int extend_abbrev_len(const struct object_id *oid, void *cb_data) return 0; } +static void find_abbrev_len_for_pack(struct packed_git *p, +struct min_abbrev_data *mad) +{ + int match = 0; + uint32_t num, last, first = 0; + struct object_id oid; + + open_pack_index(p); + num = p->num_objects; + last = num; + while (first < last) { + uint32_t mid = (first + last) / 2; + const unsigned char *current; + int cmp; + + current = nth_packed_object_sha1(p, mid); + cmp = hashcmp(mad->hash, current); + if (!cmp) { + match = 1; + first = mid; + break; + } + if (cmp > 0) { + first = mid + 1; + continue; + } + last = mid; + } + + /* +* first is now the position in the packfile where we would insert +* mad->hash if it does not exist (or the position of mad->hash if +* it does exist). Hence, we consider a maximum of three objects +* nearby for the abbreviation length. +*/ + mad->init_len = 0; + if (!match) { + nth_packed_object_oid(, p, first); +
[PATCH v2 3/5] sha1_name: Unroll len loop in find_unique_abbrev_r
Unroll the while loop inside find_unique_abbrev_r to avoid iterating through all loose objects and packfiles multiple times when the short name is longer than the predicted length. Instead, inspect each object that collides with the estimated abbreviation to find the longest common prefix. p0008.1: find_unique_abbrev() for existing objects -- For 10 repeated tests, each checking 100,000 known objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New|| | Repo | Files | Objects | Objects| Time | Time | Rel% | |---|---|-||||| | Git | 1 | 230078 | 0 | 0.12 s | 0.08 s | -33.3% | | Git | 5 | 230162 | 0 | 0.25 s | 0.17 s | -32.0% | | Git | 4 | 154310 | 75852 | 0.18 s | 0.14 s | -22.2% | | Linux | 1 | 5606645 | 0 | 0.32 s | 0.50 s | +56.2% | | Linux |24 | 5606645 | 0 | 2.77 s | 2.41 s | -13.0% | | Linux |23 | 5283204 | 323441 | 2.86 s | 1.99 s | -30.4% | | VSTS | 1 | 4355923 | 0 | 0.27 s | 0.40 s | +48.1% | | VSTS |32 | 4355923 | 0 | 2.41 s | 2.09 s | -13.3% | | VSTS |31 | 4276829 | 79094 | 4.22 s | 3.60 s | -14.7% | For the Windows repo running in Windows Subsystem for Linux: Pack Files: 50 Packed Objects: 22,385,898 Loose Objects: 492 Base Time: 5.69 s New Time: 4.62 s Rel %: -18.9% p0008.2: find_unique_abbrev() for missing objects - For 10 repeated tests, each checking 100,000 missing objects, we find the following results when running in a Linux VM: | | Pack | Packed | Loose | Base | New|| | Repo | Files | Objects | Objects| Time | Time | Rel% | |---|---|-||||| | Git | 1 | 230078 | 0 | 0.61 s | 0.06 s | -90.2% | | Git | 5 | 230162 | 0 | 1.30 s | 0.14 s | -89.2% | | Git | 4 | 154310 | 75852 | 1.07 s | 0.12 s | -88.8% | | Linux | 1 | 5606645 | 0 | 0.65 s | 0.40 s | -38.5% | | Linux |24 | 5606645 | 0 | 7.12 s | 1.59 s | -77.7% | | Linux |23 | 5283204 | 323441 | 6.33 s | 1.23 s | -80.6% | | VSTS | 1 | 4355923 | 0 | 0.64 s | 0.25 s | -60.9% | | VSTS |32 | 4355923 | 0 | 7.77 s | 1.45 s | -81.3% | | VSTS |31 | 4276829 | 79094 | 7.76 s | 1.59 s | -79.5% | For the Windows repo running in Windows Subsystem for Linux: Pack Files: 50 Packed Objects: 22,385,898 Loose Objects: 492 Base Time: 39.0 s New Time: 3.9 s Rel %: -90.0% Signed-off-by: Derrick Stolee <dsto...@microsoft.com> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_name.c | 57 ++--- 1 file changed, 42 insertions(+), 15 deletions(-) diff --git a/sha1_name.c b/sha1_name.c index 134ac9742..f2a1ebe49 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -474,10 +474,32 @@ static unsigned msb(unsigned long val) return r; } -int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) +struct min_abbrev_data { + unsigned int init_len; + unsigned int cur_len; + char *hex; +}; + +static int extend_abbrev_len(const struct object_id *oid, void *cb_data) { - int status, exists; + struct min_abbrev_data *mad = cb_data; + + char *hex = oid_to_hex(oid); + unsigned int i = mad->init_len; + while (mad->hex[i] && mad->hex[i] == hex[i]) + i++; + + if (i < GIT_MAX_RAWSZ && i >= mad->cur_len) + mad->cur_len = i + 1; + + return 0; +} +int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) +{ + struct disambiguate_state ds; + struct min_abbrev_data mad; + struct object_id oid_ret; if (len < 0) { unsigned long count = approximate_object_count(); /* @@ -503,19 +525,24 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) sha1_to_hex_r(hex, sha1); if (len == GIT_SHA1_HEXSZ || !len) return GIT_SHA1_HEXSZ; - exists = has_sha1_file(sha1); - while (len < GIT_SHA1_HEXSZ) { - struct object_id oid_ret; - status = get_short_oid(hex, len, _ret, GET_OID_QUIETLY); - if (exists - ? !status - : status == SHORT_NAME_NOT_FOUND) { - hex[len] = 0; - return len; - } - len++; - } - return len; + + if (init_object_disambiguation(hex, len, ) < 0) + return -1; + + mad.init_len = len; + mad.cur_len = len; + mad.hex = hex; + + ds.fn = extend_abbrev_len; + ds.always_call_fn = 1; +
[PATCH v2 1/5] test-list-objects: List a subset of object ids
Create test-list-objects helper program to output a random sample of OIDs present in the repo. If no command line arguments are provided, all OIDs are output. The first command line argument specifies how many samples to output. Samples are collected across the entire set of available OIDs to avoid clustering and therefore quite uniformly distributed. If a second command line argument "--missing" is given, then a list of OIDs is generated without examining the repo. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Makefile | 1 + t/helper/.gitignore | 1 + t/helper/test-list-objects.c | 85 3 files changed, 87 insertions(+) create mode 100644 t/helper/test-list-objects.c diff --git a/Makefile b/Makefile index f2bb7f2f6..af94b655a 100644 --- a/Makefile +++ b/Makefile @@ -647,6 +647,7 @@ TEST_PROGRAMS_NEED_X += test-hashmap TEST_PROGRAMS_NEED_X += test-index-version TEST_PROGRAMS_NEED_X += test-lazy-init-name-hash TEST_PROGRAMS_NEED_X += test-line-buffer +TEST_PROGRAMS_NEED_X += test-list-objects TEST_PROGRAMS_NEED_X += test-match-trees TEST_PROGRAMS_NEED_X += test-mergesort TEST_PROGRAMS_NEED_X += test-mktemp diff --git a/t/helper/.gitignore b/t/helper/.gitignore index 721650256..9dd1c9f3c 100644 --- a/t/helper/.gitignore +++ b/t/helper/.gitignore @@ -13,6 +13,7 @@ /test-index-version /test-lazy-init-name-hash /test-line-buffer +/test-list-objects /test-match-trees /test-mergesort /test-mktemp diff --git a/t/helper/test-list-objects.c b/t/helper/test-list-objects.c new file mode 100644 index 0..83b1250fe --- /dev/null +++ b/t/helper/test-list-objects.c @@ -0,0 +1,85 @@ +#include "cache.h" +#include "packfile.h" +#include + +struct count { + int total; + int select_mod; +}; + +int count_loose(const struct object_id *oid, + const char *path, + void *data) +{ + ((struct count*)data)->total++; + return 0; +} + +int count_packed(const struct object_id *oid, +struct packed_git *pack, +uint32_t pos, +void* data) +{ + ((struct count*)data)->total++; + return 0; +} + +void output(const struct object_id *oid, + struct count *c) +{ + if (!(c->total % c->select_mod)) + printf("%s\n", oid_to_hex(oid)); + c->total--; +} + +int output_loose(const struct object_id *oid, +const char *path, +void *data) +{ + output(oid, (struct count*)data); + return 0; +} + +int output_packed(const struct object_id *oid, + struct packed_git *pack, + uint32_t pos, + void* data) +{ + output(oid, (struct count*)data); + return 0; +} + +int cmd_main(int ac, const char **av) +{ + unsigned int hash_delt = 0xFDB97531; + unsigned int hash_base = 0x01020304; + int i, n = 0; + struct count c; + const int u_per_oid = sizeof(struct object_id) / sizeof(unsigned int); + + c.total = 0; + if (ac > 1) + n = atoi(av[1]); + + if (ac > 2 && !strcmp(av[2], "--missing")) { + while (c.total++ < n) { + for (i = 0; i < u_per_oid; i++) { + printf("%08x", hash_base); + hash_base += hash_delt; + } + printf("\n"); + } + } else { + setup_git_directory(); + + for_each_loose_object(count_loose, , 0); + for_each_packed_object(count_packed, , 0); + + c.select_mod = 1 + c.total / n; + + for_each_loose_object(output_loose, , 0); + for_each_packed_object(output_packed, , 0); + } + + exit(0); +} -- 2.14.1.538.g56ec8fc98.dirty
Re: [PATCH v2 1/5] test-list-objects: List a subset of object ids
On 10/6/2017 10:11 AM, Jeff King wrote: On Thu, Oct 05, 2017 at 08:39:42AM -0400, Derrick Stolee wrote: I'll run some perf numbers for these commands you recommend, and also see if I can replicate some of the pain points that triggered this change using the Linux repo. Thanks! -Peff In my local copy, I added a test to p4211-line-log.sh that runs "git log --raw -r" and tested it on three copies of the Linux repo. In order, they have 1 packfile (0 loose), 24 packfiles (0 loose), and 23 packfiles (~324,000 loose). 4211.6: git log --raw -r 43.34(42.62+0.65) 40.47(40.16+0.27) -6.6% 4211.6: git log --raw -r 88.77(86.54+2.12) 82.44(81.87+0.52) -7.1% 4211.6: git log --raw -r 108.86(103.97+4.81) 103.92(100.63+3.19) -4.5% We have moderate performance gains for this command, despite the command doing many more things than just checking abbreviations. I plan to re-roll my patch on Monday including the following feedback items: * Remove test-list-objects and test-abbrev in favor of a new "git log" performance test. * Fix binary search overflow error. * Check response from open_pack_index(p) in find_abbrev_len_for_pack(). I plan to return without failure on non-zero result, which results in no failure on a bad pack and the abbreviation length will be the minimum required among all valid packs. (Thanks Stefan!) I made note of a few things, but will not include them in my re-roll. I'll revisit them later if they are valuable: - nth_packed_object_sha1() could be simplified in find_abbrev_len_for_pack(). - Teach 'cat-file' to --batch-check='%(objectsize:short)'. (Peff already included a patch, perhaps that could be reviewed separately.) - Ramsay caught my lack of "static" in test-list-objects.c, but that file will be removed in the next patch. I'll make sure to use "static" in the future. I'm not re-rolling immediately to allow for some extra review over the weekend, if anyone is so inclined. Thanks, -Stolee
Re: [PATCH v4 4/4] sha1_name: minimize OID comparisons during disambiguation
On 10/10/2017 9:30 AM, Jeff King wrote: On Tue, Oct 10, 2017 at 09:11:15AM -0400, Derrick Stolee wrote: On 10/10/2017 8:56 AM, Junio C Hamano wrote: Jeff King <p...@peff.net> writes: OK, I think that makes more sense. But note the p->num_objects thing I mentioned. If I do: git pack-objects .git/objects/pack/pack num_objects) return; Technically that also covers open_pack_index() failure, too, but that's a subtlety I don't think we should rely on. True. I notice that the early part of the two functions look almost identical. Do we need error condition handling for the other one, too? I prefer to fix the problem in all code clones when they cause review friction, so I'll send a fifth commit showing just the diff for these packfile issues in sha1_name.c. See patch below. Ah, that answers my earlier question. Junio mean unique_in_pack(). And yeah, I think it suffers from the same problem. Should open_pack_index() return a non-zero status if the packfile is empty? Or, is there a meaningful reason to have empty packfiles? I can't think of a compelling reason to have an empty packfile. But nor do I think we should consider them an error when we can handle them quietly (and returning non-zero status would cause Git to complain on many operations in a repository that has such a file). -Peff Thanks for the comments. I found some typos in my commit messages, too. I plan to send a (hopefully) final version tomorrow (Thursday). It will include: * Make 'pos' unsigned in get_hex_char_from_oid() * Check response from open_pack_index() * Small typos in commit messages If there are other issues, then please let me know. Thanks, -Stolee
Re: cherry-pick very slow on big repository
On 11/10/2017 7:37 AM, Peter Krefting wrote: Jeff King: Can you get a backtrace? I'd do something like: Seems that it spends most time in diffcore_count_changes(), that is where it hits whenever I hit Ctrl+C (various line numbers 199-207 in diffcore-delta.c; this is on the v2.15.0 tag). (gdb) bt #0 diffcore_count_changes (src=src@entry=0x5db99970, dst=dst@entry=0x5d6a4810, src_count_p=src_count_p@entry=0x5db8, dst_count_p=dst_count_p@entry=0x5d6a4838, src_copied=src_copied@entry=0x7fffd3e0, literal_added=literal_added@entry=0x7fffd3f0) at diffcore-delta.c:203 #1 0x556dee1a in estimate_similarity (minimum_score=3, dst=0x5d6a4810, src=0x5db99970) at diffcore-rename.c:193 #2 diffcore_rename (options=options@entry=0x7fffd4f0) at diffcore-rename.c:560 #3 0x55623d83 in diffcore_std ( options=options@entry=0x7fffd4f0) at diff.c:5846 ... Git is spending time detecting renames, which implies you probably renamed a folder or added and deleted a large number of files. This rename detection is quadratic (# adds times # deletes). You can remove this rename detection by running your cherry-pick with `git -c diff.renameLimit=1 cherry-pick ...` See https://git-scm.com/docs/diff-config#diff-config-diffrenameLimit Thanks, -Stolee
Re: [PATCH v2] sha1_file: use strbuf_add() instead of strbuf_addf()
On 12/4/2017 11:56 AM, Jeff King wrote: When you put your cover letter first, you need to use "scissors" like: -- >8 -- to separate it from the commit message. Using three-dashes means "git am" will include your cover letter as the commit message, and omit your real commit message entirely. Thanks. I updated our internal best-practices accordingly. -Stolee
Re: [PATCH] sha1_file: use strbuf_add() instead of strbuf_addf()
On 12/1/2017 1:22 PM, Jeff King wrote: On Fri, Dec 01, 2017 at 12:49:56PM -0500, Derrick Stolee wrote: [snip] diff --git a/sha1_file.c b/sha1_file.c index 8ae6cb6285..2160323c4a 100644 This overall looks good, but I noticed one bug and a few cosmetic improvements. Thanks for finding quality improvements to this patch. I'll let it sit over the weekend for more feedback before sending a v2. --- a/sha1_file.c +++ b/sha1_file.c @@ -1914,17 +1914,18 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr, } oid.hash[0] = subdir_nr; + strbuf_add(path, "/", 1); Maybe: strbuf_addch(path, '/'); would be a bit more readable (it's also a tiny bit faster, but this isn't part of the tight loop). + baselen = path->len; We set this here so that the '/' is included as part of the base. Makes sense, but can we now drop the earlier setting of baselen before the opendir() call? Yeah, probably. I had briefly considered just adding the '/' before the first assignment of "baselen", but didn't want to change the error output. I also don't know if there are side effects for other platforms by calling opendir() with a '/'-terminated path. while ((de = readdir(dir))) { if (is_dot_or_dotdot(de->d_name)) continue; - strbuf_setlen(path, baselen); - strbuf_addf(path, "/%s", de->d_name); - if (strlen(de->d_name) == GIT_SHA1_HEXSZ - 2 && !hex_to_bytes(oid.hash + 1, de->d_name, GIT_SHA1_RAWSZ - 1)) { + strbuf_setlen(path, baselen); + strbuf_add(path, de->d_name, GIT_SHA1_HEXSZ - 2); Moving this code into the conditional makes sense here, since that's where we know the actual size. But what about the case when we _don't_ hit this conditional. We still do: if (cruft_cb) cruft_cb(path->buf); which is now broken (it will just get the directory name without the entry appended). Good catch! A big reason to pull it inside and use strbuf_add over strbuf_addstr is to avoid a duplicate strlen() calculation. However, I can store the length before the conditional. I think the optimized versions probably needs to be something more like: while (de = readdir(...)) { strbuf_setlen(path, baselen); if (size is HEXSZ - 2) { strbuf_add(path, de->d_name, HEXSZ - 2); obj_cb(path->buf); } else if (cruft_cb) { strbuf_addstr(path, de->d_name); cruft_cb(path->buf); } } Small change by storing the length in advance of the conditional: while (de = readdir(...)) { int namelen = strlen(de->d_name); strbuf_setlen(path, baselen); strbuf_add(path, de->d_name, namelen); if (namelen == HEXSZ - 2) obj_cb(path->buf) else cruft_cb(path->buf); } Two other thoughts: - from past discussions, I suspect most of your performance improvement actually comes from avoiding addf(), and the difference between addstr() and add() may be negligible here. It might be worth timing strbuf_addstr(). If that's similarly fast, that means we could keep the logic out of the conditional entirely. addstr() duplicates the strlen(), which isn't much but we might as well save cycles where we can if it isn't too hard. - there's an extra micro-optimization there, which is that if there's no obj_cb, we have no need to assemble the full path at all. I doubt it makes much of a difference, as most callers would pass an object callback (I'd be surprised if we even have one that doesn't). After doing a few 'git grep' commands, I found several that include a NULL cruft_cb but none that have a NULL obj_cb. Thanks, -Stolee
Re: Question about the ahead-behind computation and status
On 12/15/2017 10:08 AM, Jeff Hostetler wrote: On 12/15/2017 5:08 AM, Jeff King wrote: On Thu, Dec 14, 2017 at 04:49:31PM -0500, Jeff Hostetler wrote: [*] Sadly, the local repo was only about 20 days out of date (including the Thanksgiving holidays) Taking 20 seconds to traverse 20 days worth of history sounds pretty awful. How many commits is it? 150,000 commits, give or take. The graph is many thousands of lanes wide because of the merges and that isn't helping. (But I should give you folks lots of credit -- it *only* took 20 seconds to find, unzip and decode 150,000 commit objects and walk the commit graph.) To give a few more data points, I created similar situation by checking out a big repo I hadn't updated in three months and it was 16,000 commits behind. That took 1.5s to calculate the ahead/behind. Moving it 100,000 commits behind it took 5s. This repo has about 300,000 total commits versus the 1.5 million commits in the Windows repo. The biggest reason for the 20 seconds is not just the number of commits in the ahead/behind but how many commits are walked (including common to both branches) before paint_down_to_common() breaks its while loop due to queue_has_nonstale(). Thanks, -Stolee
Re: Question about the ahead-behind computation and status
On 12/15/2017 1:30 PM, Junio C Hamano wrote: Derrick Stolee <sto...@gmail.com> writes: The biggest reason for the 20 seconds is not just the number of commits in the ahead/behind but how many commits are walked (including common to both branches) before paint_down_to_common() breaks its while loop due to queue_has_nonstale(). Hmm, queue_has_nonstale() looks to see if any element is not STALE (where the definition of STALE is "known to be a common ancestor") by potentially checking all elements in the queue. I wonder if we have an opportunity for a trivial optimization? When the caller knows that it dug one level and added the parents that are not stale, it does not have to ask queue_has_nonstale() if there is any non stale element, for example. I thought this, too, but my tracing did not show significant time spent in this method. 99% of the time is spent unzipping and parsing commits. If this was taking too long, then we could track a minimum timestamp for a commit that entered the queue in a non-stale state, but this will delay the termination condition a bit since commits can be marked stale after they enter the queue. What do you exactly mean by "not just the number of commits in the ahead/behind"? Aren't the number of these commits pretty much proportional to the number of commits we need to paint down to common ancestors? Is the latter a lot larger than the former (i.e. are we somehow not stopping when we _could_ notice that we can with better information)? With the wide history, there is actually a large set of commits that are in the common history but have newer commit times than the oldest commit in only one history. Consider the following ASCII art: A | 1 | 2 | 3 |\ 4 B |\| 5 C |\| 6 D |\| . . . |\| N Z |/ 0 Between A and B, A is ahead by commits {A,1,2,3,4,5,6,...,N}. Meanwhile, commits B,C,D,...,Z are in the common history, but still must be walked. Now imagine these two sets are actually much MUCH wider (thousands of commits that are pairwise non-ancestors). This causes the number of walked commits to be larger than just the number of commits in the symmetric difference of the histories. Unfortunately, generation numbers are not the only optimization needed to make this call be sub-100ms. A graph data structure that lists the edges between commits would prevent the time spent in zlib decompressing and parsing commits. I'm working on investigating how such a data structure (and corresponding file format) could integrate in the commit-walking code. Thanks, -Stolee
[PATCH v2] sha1_file: use strbuf_add() instead of strbuf_addf()
Thanks for the feedback on v1. This version fixes the cruft_cb bug and streamlines the strlen(). I would include an inter-diff but it was the same size as the patch. Thanks, -Stolee --- Replace use of strbuf_addf() with strbuf_add() when enumerating loose objects in for_each_file_in_obj_subdir(). Since we already check the length and hex-values of the string before consuming the path, we can prevent extra computation by using the lower- level method. One consumer of for_each_file_in_obj_subdir() is the abbreviation code. OID abbreviations use a cached list of loose objects (per object subdirectory) to make repeated queries fast, but there is significant cache load time when there are many loose objects. Most repositories do not have many loose objects before repacking, but in the GVFS case the repos can grow to have millions of loose objects. Profiling 'git log' performance in GitForWindows on a GVFS-enabled repo with ~2.5 million loose objects revealed 12% of the CPU time was spent in strbuf_addf(). Add a new performance test to p4211-line-log.sh that is more sensitive to this cache-loading. By limiting to 1000 commits, we more closely resemble user wait time when reading history into a pager. For a copy of the Linux repo with two ~512 MB packfiles and ~572K loose objects, running 'git log --oneline --parents --raw -1000' had the following performance: HEAD~1HEAD 7.70(7.15+0.54) 7.44(7.09+0.29) -3.4% Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_file.c | 12 +++- t/perf/p4211-line-log.sh | 4 2 files changed, 11 insertions(+), 5 deletions(-) diff --git a/sha1_file.c b/sha1_file.c index 8ae6cb6285..2fc8fa93b4 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -1903,7 +1903,6 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr, origlen = path->len; strbuf_complete(path, '/'); strbuf_addf(path, "%02x", subdir_nr); - baselen = path->len; dir = opendir(path->buf); if (!dir) { @@ -1914,15 +1913,18 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr, } oid.hash[0] = subdir_nr; + strbuf_addch(path, '/'); + baselen = path->len; while ((de = readdir(dir))) { + size_t namelen; if (is_dot_or_dotdot(de->d_name)) continue; + namelen = strlen(de->d_name); strbuf_setlen(path, baselen); - strbuf_addf(path, "/%s", de->d_name); - - if (strlen(de->d_name) == GIT_SHA1_HEXSZ - 2 && + strbuf_add(path, de->d_name, namelen); + if (namelen == GIT_SHA1_HEXSZ - 2 && !hex_to_bytes(oid.hash + 1, de->d_name, GIT_SHA1_RAWSZ - 1)) { if (obj_cb) { @@ -1941,7 +1943,7 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr, } closedir(dir); - strbuf_setlen(path, baselen); + strbuf_setlen(path, baselen - 1); if (!r && subdir_cb) r = subdir_cb(subdir_nr, path->buf, data); diff --git a/t/perf/p4211-line-log.sh b/t/perf/p4211-line-log.sh index e0ed05907c..392bcc0e51 100755 --- a/t/perf/p4211-line-log.sh +++ b/t/perf/p4211-line-log.sh @@ -35,4 +35,8 @@ test_perf 'git log --oneline --raw --parents' ' git log --oneline --raw --parents >/dev/null ' +test_perf 'git log --oneline --raw --parents -1000' ' + git log --oneline --raw --parents -1000 >/dev/null +' + test_done -- 2.15.0
Re: [PATCH] builtin/tag.c: return appropriate value when --points-at finds an empty list
On 12/11/2017 8:44 AM, George Papanikolaou wrote: `git tag --points-at` can simply return if the given rev does not have any tags pointing to it. It's not a failure but it shouldn't return with 0 value. I disagree. I think the 0 return means "I completed successfully" and the empty output means "I didn't find any tags pointing to this object." Changing the return value here could break a lot of scripts out in the wild, and I consider this to be an "API" compatibility that needs to stay as-is. What are you using "--points-at" where you need a nonzero exit code instead of a different indicator? Thanks, -Stolee
[PATCH] sha1_file: use strbuf_add() instead of strbuf_addf()
Replace use of strbuf_addf() with strbuf_add() when enumerating loose objects in for_each_file_in_obj_subdir(). Since we already check the length and hex-values of the string before consuming the path, we can prevent extra computation by using the lower- level method. One consumer of for_each_file_in_obj_subdir() is the abbreviation code. OID abbreviations use a cached list of loose objects (per object subdirectory) to make repeated queries fast, but there is significant cache load time when there are many loose objects. Most repositories do not have many loose objects before repacking, but in the GVFS case the repos can grow to have millions of loose objects. Profiling 'git log' performance in GitForWindows on a GVFS-enabled repo with ~2.5 million loose objects revealed 12% of the CPU time was spent in strbuf_addf(). Add a new performance test to p4211-line-log.sh that is more sensitive to this cache-loading. By limiting to 1000 commits, we more closely resemble user wait time when reading history into a pager. For a copy of the Linux repo with two ~512 MB packfiles and ~572K loose objects, running 'git log --oneline --raw --parents -1000' had the following performance: HEAD~1HEAD 7.70(7.15+0.54) 7.44(7.09+0.29) -3.4% Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- sha1_file.c | 7 --- t/perf/p4211-line-log.sh | 4 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/sha1_file.c b/sha1_file.c index 8ae6cb6285..2160323c4a 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -1914,17 +1914,18 @@ int for_each_file_in_obj_subdir(unsigned int subdir_nr, } oid.hash[0] = subdir_nr; + strbuf_add(path, "/", 1); + baselen = path->len; while ((de = readdir(dir))) { if (is_dot_or_dotdot(de->d_name)) continue; - strbuf_setlen(path, baselen); - strbuf_addf(path, "/%s", de->d_name); - if (strlen(de->d_name) == GIT_SHA1_HEXSZ - 2 && !hex_to_bytes(oid.hash + 1, de->d_name, GIT_SHA1_RAWSZ - 1)) { + strbuf_setlen(path, baselen); + strbuf_add(path, de->d_name, GIT_SHA1_HEXSZ - 2); if (obj_cb) { r = obj_cb(, path->buf, data); if (r) diff --git a/t/perf/p4211-line-log.sh b/t/perf/p4211-line-log.sh index e0ed05907c..392bcc0e51 100755 --- a/t/perf/p4211-line-log.sh +++ b/t/perf/p4211-line-log.sh @@ -35,4 +35,8 @@ test_perf 'git log --oneline --raw --parents' ' git log --oneline --raw --parents >/dev/null ' +test_perf 'git log --oneline --raw --parents -1000' ' + git log --oneline --raw --parents -1000 >/dev/null +' + test_done -- 2.15.0
[RFC] 'unsigned long' to 'size_t' conversion
There are several places in Git where we refer to the size of an object by an 'unsigned long' instead of a 'size_t'. In 64-bit Linux, 'unsigned long' is 8 bytes, but in 64-bit Windows it is 4 bytes. The main issue with this conversion is that large objects fail to load (they seem to hash and store just fine). For example, the following 'blob8gb' is an 8 GB file where the ith byte is equal to i % 256: $ git hash-object -w --no-filters blob8gb 5391939346b98600acc0283dda24649450cec51f $ git cat-file -s 5391939346b98600acc0283dda24649450cec51f error: bad object header fatal: git cat-file: could not get object info An existing discussion can be found here: https://github.com/git-for-windows/git/issues/1063 The error message results from unpack_object_header_buffer() which had its most-recent meaningful change in 'ea4f9685: unpack_object_header_buffer(): clear the size field upon error' (in 2011). In my opinion, the correct thing to do would be to replace all 'unsigned long's that refer to an object size and replace them with 'size_t'. However, a simple "git grep 'unsigned long size'" reveals 194 results, and there are other permutations of names and pointer types all over. This conversion would be a significant patch, so I wanted to get the community's thoughts on this conversion. If there are small, isolated chunks that can be done safely, then this may be a good target for a first patch. Thanks, -Stolee
Re: [RFC] protocol version 2
On 10/20/2017 1:18 PM, Brandon Williams wrote: Overview == This document presents a specification for a version 2 of Git's wire protocol. Protocol v2 will improve upon v1 in the following ways: * Instead of multiple service names, multiple commands will be supported by a single service * Easily extendable as capabilities are moved into their own section of the protocol, no longer being hidden behind a NUL byte and limited by the size of a pkt-line (as there will be a single capability per pkt-line). * Separate out other information hidden behind NUL bytes (e.g. agent string as a capability and symrefs can be requested using 'ls-ref') * Ref advertisement will be omitted unless explicitly requested * ls-ref command to explicitly request some refs Hi Brandon, I'm very interested in your protocol as a former server-side dev for the VSTS Git server, and understand some of these headaches. We built limited refs specifically to target the problem you are solving with ls-ref, but it requires knowledge about the authenticated user in order to work. I believe your suggestion is a better solution for the Git protocol. The "easily extendable" part has specifically caught my interest, as we (Microsoft) would like to move most of the GVFS protocol into core Git, and this is a great way to do it. Even if not all features are accepted by upstream, we could use our GVFS-specific fork of Git to communicate to our servers without breaking normal users' interactions. Please CC me in future versions of this proposal. Let me know if you want to chat directly about the "TODO" items below. Speaking of TODOs, how much of this concept do you have working in a prototype? Do you have code that performs this version 2 handshake and communicates the ls-refs result? Ls-refs - Ls-refs can be looked at as the equivalent of the current ls-remote as it is a way to query a remote for the references that it has. Unlike the current ls-remote, the filtering of the output is done on the server side by passing a number of parameters to the server-side command instead of the filtering occurring on the client. Ls-ref takes in the following parameters: --head, --tags: Limit to only refs/heads or refs/tags Nit: It would be better to use "--heads" to match refs/heads and your use of "--tags" for refs/tags. --refs: Do not show peeled tags or pseudorefs like HEAD Assuming we are in the case where the server has a HEAD ref, why would that ever be advertised? Also, does this imply that without the --refs option we would peel annotated tags until we find non-tag OIDs? Neither of these functions seem useful as default behavior. --symref: In addition to the object pointed by it, show the underlying ref pointed by it when showing a symbolic ref : When specified, only references matching the given patterns are displayed. Can you be specific about the patterns? For instance, it is not a good idea to allow the client to submit a regex for the server to compute. Instead, can we limit this pattern-matching to a prefix-set, such as the following list of prefixes: refs/heads/master refs/releases/* refs/heads/user/me/* Fetch --- Fetch will need to be a modified version of the v1 fetch protocol. Some potential areas for improvement are: Ref-in-want, CDN offloading, Fetch-options. Since we'll have an 'ls-ref' service we can eliminate the need of fetch to perform a ref-advertisement, instead a client can run the 'ls-refs' service first, in order to find out what refs the server has, and then request those refs directly using the fetch service. //TODO Flush out the design Fetch-object -- This service could be used by partial clones in order to request missing objects. //TODO Flush out the design As you flesh our these "fetch" and "fetch-object" commands, keep in mind that partial clones could mean any of the following: * fetch all reachable objects except for blobs. * fetch all reachable objects except for blobs above a certain size. * fetch all commits, trees, (and blobs?) within a certain "cone" of the file system. Push -- Push will need to be a modified version of the v1 push protocol. Some potential areas for improvement are: Fix push-options, Negotiation for force push. Negotiation is something to keep in mind for all pushes, especially in an ecosystem full of fork-based workflows. If you are working across forks and someone else syncs data between your remotes, you may re-push a large chunk of objects that are already present in a fork. Adding an ls-refs step before push would be a step in the right direction. Other Considerations == * Move away from pkt-line framing? * Have responses structured in well known formats (e.g. JSON) * Eliminate initial round-trip using 'GIT_PROTOCOL' side-channel * Additional
Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.
On 5/4/2018 3:40 PM, Jakub Narebski wrote: Hello, With early parts of commit-graph feature (ds/commit-graph and ds/lazy-load-trees) close to being merged into "master", see https://public-inbox.org/git/xmqq4ljtz87g@gitster-ct.c.googlers.com/ I think it would be good idea to think what other data could be added there to make Git even faster. Before thinking about adding more data to the commit-graph, I think instead we need to finish taking advantage of the data that is already there. This means landing the generation number patch [1] (I think this is close, so I'll send a v6 this week if there is no new feedback.) and the auto-compute patch [2] (this could use more feedback, but I'll send a v1 based on the RFC feedback if no one chimes in). [1] https://public-inbox.org/git/20180501124652.155781-1-dsto...@microsoft.com/ [PATCH v5 00/11] Compute and consume generation numbers [2] https://public-inbox.org/git/20180417181028.198397-1-dsto...@microsoft.com/ [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' The big wins remaining from this data are `git tag --merged` and `git log --graph`. The `tag` scenario is probably easier: this can be done by replacing the revision-walk underlying the call to use paint_down_to_common() instead. Requires adding an external method to commit.c, but not too much code. The tougher challenge is `git log --graph`. The revision walk machinery currently uses two precompute phases before iterating results to the pager: limit_list() and sort_in_topological_order(); these correspond to two phases of Kahn's algorithm for topo-sort (compute in-degrees, then walk by peeling commits with in-degree zero). This requires O(N) time, where N is the number of reachable commits. Instead, we could make this be O(W) time to output one page of results, where W is (roughly) the number of reachable commits with generation number above the last reported result. In order to take advantage of this approach, the two phases of Kahn's algorithm need to be done in-line with reporting results to the pager. This means keeping two queues: one is a priority queue by generation number that computes in-degrees, the other is a priority queue (by commit-date or a visit-order value to do the --topo-order priority) that peels the in-degree-zero commits (and decrements the in-degree of their parents). I have not begun this refactoring effort because appears complicated to me, and it will be hard to tease out the logic without affecting other consumers of the revision-walk machinery. I would love it if someone picked up the `git log --graph` task, since it will be a few weeks before I have the time to focus on it. Without completing the benefits we get from generation numbers, these investigations into other reachability indexes will be incomplete as they are comparing benefits without all consumers taking advantage of a reachability index. [...] Bloom filter for changed paths -- The goal of this chunk is to speed up checking if the file or directory was changed in given commit, for queries such as "git log -- " or "git blame ". This is something that according to "Git Merge contributor summit notes" [2] is already present in VSTS (Visual Studio Team Services - the server counterpart of GVFS: Git Virtual File System) at Microsoft: AV> - VSTS adds bloom filters to know which paths have changed on the commit AV> - tree-same check in the bloom filter is fast; speeds up file history checks AV> - might be useful in the client as well, since limited-traversal is common AV> - if the file history is _very_ sparse, then bloom filter is useful AV> - but needs pre-compute, so useful to do once AV> - first make the client do it, then think about how to serve it centrally [2]: https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/ I think it was what Derrick Stolee was talking about at the end of his part of "Making Git for Windows" presentation at Git Merge 2018: https://youtu.be/oOMzi983Qmw?t=1835 This was also mentioned in subthread of "Re: [PATCH v2 0/4] Lazy-load trees when reading commit-graph", starting from [3] [3]: https://public-inbox.org/git/86y3hyeu6c@gmail.com/ Again, the benefits of Bloom filters should only be measured after already taking advantage of a reachability index during `git log`. However, you could get performance benefits from Bloom filters in a normal `git log` (no topo-order). The tricky part about this feature is that the decisions we made in our C# implementation for the VSTS Git server may be very different than the needs for the C implementation of the Git client. Questions like "how do we handle merge commits?" may have different answers, which can only be discovered by implementing the feature. (The answer for VSTS is that we only store Bloom filters containin
Re: What's cooking in git.git (May 2018, #01; Mon, 7)
On 5/7/2018 10:58 AM, Junio C Hamano wrote: * ds/generation-numbers (2018-05-02) 11 commits - commit-graph.txt: update design document - merge: check config before loading commits - commit: use generation number in remove_redundant() - commit: add short-circuit to paint_down_to_common() - commit: use generation numbers for in_merge_bases() - ref-filter: use generation number for --contains - commit-graph: always load commit-graph information - commit: use generations in paint_down_to_common() - commit-graph: compute generation numbers - commit: add generation number to struct commmit - ref-filter: fix outdated comment on in_commit_list (this branch uses ds/commit-graph and ds/lazy-load-trees.) A recently added "commit-graph" datafile has learned to store pre-computed generation numbers to speed up the decisions to stop history traversal. Is this ready for 'next'? I see that you squashed the fix from [1], so I think this is ready to go. Thanks! [1] https://public-inbox.org/git/1cfe38f6-925b-d36b-53ae-6b586eed1...@gmail.com/ * ds/lazy-load-trees (2018-05-02) 6 commits (merged to 'next' on 2018-05-02 at d54016d9e3) + coccinelle: avoid wrong transformation suggestions from commit.cocci (merged to 'next' on 2018-04-25 at b90813f421) + commit-graph: lazy-load trees for commits + treewide: replace maybe_tree with accessor methods + commit: create get_commit_tree() method + treewide: rename tree to maybe_tree + Merge branch 'bw/c-plus-plus' into ds/lazy-load-trees (this branch is used by ds/generation-numbers; uses ds/commit-graph.) The code has been taught to use the duplicated information stored in the commit-graph file to learn the tree object name for a commit to avoid opening and parsing the commit object when it makes sense to do so. Will merge to 'master'. * ds/commit-graph (2018-04-11) 16 commits (merged to 'next' on 2018-04-25 at 18af3d28d9) + commit-graph: implement "--append" option + commit-graph: build graph from starting commits + commit-graph: read only from specific pack-indexes + commit: integrate commit graph with commit parsing + commit-graph: close under reachability + commit-graph: add core.commitGraph setting + commit-graph: implement git commit-graph read + commit-graph: implement git-commit-graph write + commit-graph: implement write_commit_graph() + commit-graph: create git-commit-graph builtin + graph: add commit graph design document + commit-graph: add format document + csum-file: refactor finalize_hashfile() method + csum-file: rename hashclose() to finalize_hashfile() + Merge branch 'jk/cached-commit-buffer' into HEAD + Merge branch 'jt/binsearch-with-fanout' into HEAD (this branch is used by ds/generation-numbers and ds/lazy-load-trees.) Precompute and store information necessary for ancestry traversal in a separate file to optimize graph walking. Will merge to 'master'. These have been queued for master for a few weeks. Is anything delaying them? I'd love to see the community dogfood this feature by running the following on their local repos: git config core.commitGraph true git show-ref -s | git commit-graph write --stdin-commits Thanks, -Stolee
Re: [PATCH 4/4] mark_parents_uninteresting(): avoid most allocation
On 5/14/2018 9:09 AM, Jeff King wrote: On Mon, May 14, 2018 at 08:47:33AM -0400, Derrick Stolee wrote: On 5/11/2018 2:03 PM, Jeff King wrote: Commit 941ba8db57 (Eliminate recursion in setting/clearing marks in commit list, 2012-01-14) used a clever double-loop to avoid allocations for single-parent chains of history. However, it did so only when following parents of parents (which was an uncommon case), and _always_ incurred at least one allocation to populate the list of pending parents in the first place. We can turn this into zero-allocation in the common case by iterating directly over the initial parent list, and then following up on any pending items we might have discovered. This change appears to improve performance, but I was unable to measure any difference between this commit and the one ahead, even when merging ds/generation-numbers (which significantly reduces the other costs). I was testing 'git status' and 'git rev-list --boundary master...origin/master' in the Linux repo with my copy of master 70,000+ commits behind origin/master. I think you'd want to go the other way: this is marking uninteresting commits, so you'd want origin/master..master, which would make those 70k commits uninteresting. Thanks for the tip. Running 'git rev-list origin/master..master' 100 times gave the following times, on average (with ds/generation-numbers): PATCH 3/4: 0.19 s PATCH 4/4: 0.16 s Rel %: -16% I still doubt it will create that much difference, though. It's more or less saving a single malloc per commit we traverse. Certainly without the .commits file that's a drop in the bucket compared to inflating the object data, but I wouldn't be surprised if we end up with a few mallocs even after your patches. Anyway, thanks for poking at it. :) -Peff
Re: [PATCH 0/4] a few mark_parents_uninteresting cleanups
On 5/11/2018 2:00 PM, Jeff King wrote: This is a follow-up to the discussion from February: https://public-inbox.org/git/1519522496-73090-1-git-send-email-dsto...@microsoft.com/ There I theorized that some of these extra has_object_file() checks were unnecessary. After poking around a bit, I've convinced myself that this is the case, so here are some patches. After Stolee's fix in ebbed3ba04 (revision.c: reduce object database queries, 2018-02-24), I doubt this will provide any noticeable speedup. IMHO the value is just in simplifying the logic. The first two patches are the actual has_object_file simplifications. The second two are an attempt to fix some head-scratching I had while reading mark_parents_uninteresting(). I hope the result is easier to follow, but I may have just shuffled one confusion for another. :) [1/4]: mark_tree_contents_uninteresting(): drop missing object check [2/4]: mark_parents_uninteresting(): drop missing object check [3/4]: mark_parents_uninteresting(): replace list with stack [4/4]: mark_parents_uninteresting(): avoid most allocation revision.c | 90 ++ 1 file changed, 50 insertions(+), 40 deletions(-) -Peff This series looks good to me. I found Patch 3 hard to read in the diff, so I just looked at the final result and the new arrangement is very clear about how it should behave. Reviewed-by: Derrick Stolee <dsto...@microsoft.com>
Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.
On 5/12/2018 10:00 AM, Jakub Narebski wrote: Derrick Stolee <sto...@gmail.com> writes: On 5/4/2018 3:40 PM, Jakub Narebski wrote: With early parts of commit-graph feature (ds/commit-graph and ds/lazy-load-trees) close to being merged into "master", see https://public-inbox.org/git/xmqq4ljtz87g@gitster-ct.c.googlers.com/ I think it would be good idea to think what other data could be added there to make Git even faster. Before thinking about adding more data to the commit-graph, I think instead we need to finish taking advantage of the data that is already there. This means landing the generation number patch [1] (I think this is close, so I'll send a v6 this week if there is no new feedback.) and the auto-compute patch [2] (this could use more feedback, but I'll send a v1 based on the RFC feedback if no one chimes in). [1] https://public-inbox.org/git/20180501124652.155781-1-dsto...@microsoft.com/ [PATCH v5 00/11] Compute and consume generation numbers [2] https://public-inbox.org/git/20180417181028.198397-1-dsto...@microsoft.com/ [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Right, so the RFC might be a bit premature; I wanted the discussion to be out there to think about when adding new uses of existing features. DIGRESSION: it is commendable that you are sending patches in small, easy digestible chunks / patch series. It is much easier to review 10+ series than 80+ behemoth (though I understand it is not always possible to subdivide patch series into smaller self-contained sub-series). On the other hand, it would be nice to have some roadmap about series and features to be sent in the future, if possible. Something like what was done when 'git rebase --interactive' was getting builtinified: moved (in parts) to C. It would be great to have such roadmap with milestones achieved and milestones to be done in the cover letter for series. I suppose that is what I intended in the "Future Work" section of Documentation/technical/commit-graph.txt. It gives a set of things that need to be done in order to make this a default feature, not just an expert-level feature. When I wrote the section, I was focused entirely on "commit-graph version 1.0" and I didn't know how long that would take. The series have been getting interest and review (in great part to your interest, Jakub) so they have been moving to 'next' faster than I anticipated. I'll plan on writing a more detailed list of future directions, but for now I'll list the things I know about and how I see them in priority order: Commit-graph v1.0: * ds/generation-numbers * 'verify' and fsck/gc integration * correct behavior with shallow clones, replace-objects, and grafts Commit-graph v1.1: * Place commit-graph storage in the_repository * 'git tag --merged' use generation numbers * 'git log --graph' use generation numbers Commit-graph v1.X: * Protocol v2 verb for sending commit-graph * Bloom filters for changed paths The big wins remaining from this data are `git tag --merged` and `git log --graph`. The `tag` scenario is probably easier: this can be done by replacing the revision-walk underlying the call to use paint_down_to_common() instead. Requires adding an external method to commit.c, but not too much code. I wonder if there is some significant reason behind `git tag --merged` using its own codepath, beside historical reasons. Maybe performance is better with current code? Utilizing paint_down_to_common() there, beside reducing amount of code you would have to modify, would also unify code (and possibly reduce amount of code). That's very nice. The tougher challenge is `git log --graph`. The revision walk machinery currently uses two precompute phases before iterating results to the pager: limit_list() and sort_in_topological_order(); these correspond to two phases of Kahn's algorithm for topo-sort (compute in-degrees, then walk by peeling commits with in-degree zero). This requires O(N) time, where N is the number of reachable commits. Instead, we could make this be O(W) time to output one page of results, where W is (roughly) the number of reachable commits with generation number above the last reported result. A reminder: Kahn's algorithm (described for example in [1] and [3]) looks like this: L ← Empty list that will contain the sorted elements S ← Collection of all nodes with no incoming edge while S is non-empty do remove a node 'n' from S add 'n' to tail of L for each parent 'm' of 'n' do decrease the in-degree of 'm' if 'm' has in-degree of 0 then insert 'm' into S [1]: https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm [2]: https://www.geeksforgeeks.org/topological-sorting-indegree-based-solution/ In order to take advantage of this approach, the two phases of Kahn's algorithm need to be done in-line with reporting results to the pager. This means keeping
Re: [PATCH v2 01/12] commit-graph: add 'verify' subcommand
On 5/12/2018 9:31 AM, Martin Ågren wrote: On 11 May 2018 at 23:15, Derrick Stolee <dsto...@microsoft.com> wrote: graph_name = get_commit_graph_filename(opts.obj_dir); graph = load_commit_graph_one(graph_name); + FREE_AND_NULL(graph_name); if (!graph) die("graph file %s does not exist", graph_name); - FREE_AND_NULL(graph_name); This is probably because of something I said, but this does not look correct. The `die()` would typically print "(null)" or segfault. If the `die()` means we don't free `graph_name`, that should be fine. You're still leaking `graph` here (possibly nothing this patch should worry about) and in `graph_verify()`. UNLEAK-ing it immediately before calling `verify_commit_graph()` should be ok. I also think punting on this UNLEAK-business entirely would be ok; I was just a bit surprised to see one variable getting freed and the other one ignored. Thanks, Martin. I was just blindly searching for FREE_AND_NULL() and shouldn't have been so careless. -Stolee
Re: [PATCH 4/4] mark_parents_uninteresting(): avoid most allocation
On 5/11/2018 2:03 PM, Jeff King wrote: Commit 941ba8db57 (Eliminate recursion in setting/clearing marks in commit list, 2012-01-14) used a clever double-loop to avoid allocations for single-parent chains of history. However, it did so only when following parents of parents (which was an uncommon case), and _always_ incurred at least one allocation to populate the list of pending parents in the first place. We can turn this into zero-allocation in the common case by iterating directly over the initial parent list, and then following up on any pending items we might have discovered. This change appears to improve performance, but I was unable to measure any difference between this commit and the one ahead, even when merging ds/generation-numbers (which significantly reduces the other costs). I was testing 'git status' and 'git rev-list --boundary master...origin/master' in the Linux repo with my copy of master 70,000+ commits behind origin/master. It's still a good change, but I was hoping to find a measurable benefit :( Signed-off-by: Jeff King--- Again, try "-w" for more readability. revision.c | 44 +--- 1 file changed, 25 insertions(+), 19 deletions(-) diff --git a/revision.c b/revision.c index 89ff9a99ce..cbe041128e 100644 --- a/revision.c +++ b/revision.c @@ -115,32 +115,38 @@ static void commit_stack_clear(struct commit_stack *stack) stack->nr = stack->alloc = 0; } -void mark_parents_uninteresting(struct commit *commit) +static void mark_one_parent_uninteresting(struct commit *commit, + struct commit_stack *pending) { - struct commit_stack pending = COMMIT_STACK_INIT; struct commit_list *l; + if (commit->object.flags & UNINTERESTING) + return; + commit->object.flags |= UNINTERESTING; + + /* +* Normally we haven't parsed the parent +* yet, so we won't have a parent of a parent +* here. However, it may turn out that we've +* reached this commit some other way (where it +* wasn't uninteresting), in which case we need +* to mark its parents recursively too.. +*/ for (l = commit->parents; l; l = l->next) - commit_stack_push(, l->item); + commit_stack_push(pending, l->item); +} - while (pending.nr > 0) { - struct commit *commit = commit_stack_pop(); +void mark_parents_uninteresting(struct commit *commit) +{ + struct commit_stack pending = COMMIT_STACK_INIT; + struct commit_list *l; - if (commit->object.flags & UNINTERESTING) - return; - commit->object.flags |= UNINTERESTING; + for (l = commit->parents; l; l = l->next) + mark_one_parent_uninteresting(l->item, ); - /* -* Normally we haven't parsed the parent -* yet, so we won't have a parent of a parent -* here. However, it may turn out that we've -* reached this commit some other way (where it -* wasn't uninteresting), in which case we need -* to mark its parents recursively too.. -*/ - for (l = commit->parents; l; l = l->next) - commit_stack_push(, l->item); - } + while (pending.nr > 0) + mark_one_parent_uninteresting(commit_stack_pop(), + ); commit_stack_clear(); }
Re: [PATCH v2 02/12] commit-graph: verify file header information
On 5/12/2018 9:35 AM, Martin Ågren wrote: +static int verify_commit_graph_error; + +static void graph_report(const char *fmt, ...) +{ + va_list ap; + struct strbuf sb = STRBUF_INIT; + verify_commit_graph_error = 1; + + va_start(ap, fmt); + strbuf_vaddf(, fmt, ap); + + fprintf(stderr, "%s\n", sb.buf); + strbuf_release(); + va_end(ap); +} That's a good idea. Makes that patch a bit less trivial and this one a bit less difficult.
Re: [PATCH v2 08/12] commit-graph: verify commit contents against odb
On 5/12/2018 5:17 PM, Martin Ågren wrote: On 11 May 2018 at 23:15, Derrick Stolee <dsto...@microsoft.com> wrote: When running 'git commit-graph verify', compare the contents of the commits that are loaded from the commit-graph file with commits that are loaded directly from the object database. This includes checking the root tree object ID, commit date, and parents. Parse the commit from the graph during the initial loop through the object IDs to guarantee we parse from the commit-graph file. In addition, verify the generation number calculation is correct for all commits in the commit-graph file. While testing, we discovered that mutating the integer value for a parent to be outside the accepted range causes a segmentation fault. Add a new check in insert_parent_or_die() that prevents this fault. Check for that error during the test, both in the typical parents and in the list of parents for octopus merges. This paragraph and the corresponding fix and test feel like a separate patch to me. (The commit message of it could be "To test the next patch, we threw invalid data at `git commit-graph verify, and it crashed in pre-existing code, so let's fix that first" -- there is definitely a connection.) Is this important enough to fast-track to master in time for 2.18? My guess would be no. + + if (pos >= g->num_commits) + die("invalide parent position %"PRIu64, pos); s/invalide/invalid/ @@ -875,6 +879,8 @@ int verify_commit_graph(struct commit_graph *g) return 1; for (i = 0; i < g->num_commits; i++) { + struct commit *graph_commit; + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); if (i && oidcmp(_oid, _oid) >= 0) @@ -892,6 +898,10 @@ int verify_commit_graph(struct commit_graph *g) cur_fanout_pos++; } + + graph_commit = lookup_commit(_oid); + if (!parse_commit_in_graph_one(g, graph_commit)) + graph_report("failed to parse %s from commit-graph", oid_to_hex(_oid)); } Could this end up giving ridiculous amounts of output? It would depend on the input, I guess. while (cur_fanout_pos < 256) { @@ -904,5 +914,95 @@ int verify_commit_graph(struct commit_graph *g) cur_fanout_pos++; } + if (verify_commit_graph_error) + return 1; Well, here we give up before running into *too* much problem. + for (i = 0; i < g->num_commits; i++) { + struct commit *graph_commit, *odb_commit; + struct commit_list *graph_parents, *odb_parents; + int num_parents = 0; + + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); + + graph_commit = lookup_commit(_oid); + odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node()); + if (parse_commit_internal(odb_commit, 0, 0)) { + graph_report("failed to parse %s from object database", oid_to_hex(_oid)); + continue; + } + + if (oidcmp(_commit_tree_in_graph_one(g, graph_commit)->object.oid, + get_commit_tree_oid(odb_commit))) + graph_report("root tree object ID for commit %s in commit-graph is %s != %s", +oid_to_hex(_oid), + oid_to_hex(get_commit_tree_oid(graph_commit)), + oid_to_hex(get_commit_tree_oid(odb_commit))); + + if (graph_commit->date != odb_commit->date) + graph_report("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime"", +oid_to_hex(_oid), +graph_commit->date, +odb_commit->date); + + + graph_parents = graph_commit->parents; + odb_parents = odb_commit->parents; + + while (graph_parents) { + num_parents++; + + if (odb_parents == NULL) + graph_report("commit-graph parent list for commit %s is too long (%d)", +oid_to_hex(_oid), +num_parents); + + if (oidcmp(_parents->item->object.oid, _parents->item->object.oid)) + graph_report("commit-graph parent for %s is %s != %s", +oid_to_hex(_oid), + oid_to_hex(_parents->item->object.oid), +
Re: [PATCH v2 14/14] commit.h: delete 'util' field in struct commit
On 5/14/2018 1:30 PM, Duy Nguyen wrote: On Mon, May 14, 2018 at 6:07 PM, Duy Nguyenwrote: On Mon, May 14, 2018 at 04:52:29PM +0900, Junio C Hamano wrote: Nguyễn Thái Ngọc Duy writes: diff --git a/commit.h b/commit.h index 838f6a6b26..70371e111e 100644 --- a/commit.h +++ b/commit.h @@ -18,12 +18,16 @@ struct commit_list { struct commit { struct object object; - void *util; unsigned int index; timestamp_t date; struct commit_list *parents; struct tree *tree; uint32_t graph_pos; + /* +* Do not add more fields here unless it's _very_ often +* used. Use commit-slab to associate more data with a commit +* instead. +*/ }; That's a logical consequence and a natural endgame of this pleasent-to-read series. THanks. Unfortunately we are gaining more and more stuff in "struct commit" with recent topics, and we may want to see if we can evict some of them out to shrink it again. Sigh.. ds/lazy-load-trees already enters 'next' so a fixup patch is something like this. Sorry I take this patch back. I didn't realize it was a rename and the old field named 'tree' was already there (I vaguely recalled some "tree" in this struct but didn't stop to think about it). Moving graph_pos out is an option, but only 32-bit arch gains from it (64-bit arch will have 4 bytes padding anyway) so probably not worth the effort. "generation" field should probably be moved out though. I'm happy to take a look at this patch and figure out the best way to integrate it with the changes I've been doing. I disagree with the removal of "generation". My intention is to make the commit-graph feature a standard feature that most repos (of reasonable size) want to have. In that case, 'graph_pos' and 'generation' will be set during every parse and used by every commit walk. This is just my gut reaction. As I investigate these changes, I'll try to see what performance hit is caused by converting the graph_pos and/or generation to commit slabs. (Again, I'm assuming the slabs will make this slower. I may be wrong here.) Thanks, -Stolee
Re: Implementing reftable in Git
On 5/9/2018 10:33 AM, Christian Couder wrote: Hi, I might start working on implementing reftable in Git soon. During the last Git Merge conference last March Stefan talked about reftable. In Alex Vandiver's notes [1] it is asked that people announce it on the list when they start working on it, and it appears that there is a reference implementation in JGit. Thanks for starting on this! In addition to the performance gains, this will help a lot of users with case-insensitive file systems from getting case-errors on refnames. Looking it up, there is indeed some documentation [2], code [3], tests [4] and other related stuff [5] in the JGit repo. It looks like the JGit repo and the reftable code there are licensed under the Eclipse Distribution License - v 1.0 [7] which is very similar to the 3-Clause BSD License also called Modified BSD License which is GPL compatible according to gnu.org [9]. So from a quick look it appears that I should be able to port the JGit to Git if I just keep the copyright and license header comments in all the related files. So I think the most straightforward and compatible way to do it would be to port the JGit implementation. Thanks in advance for any suggestion or comment about this. Reftable was first described by Shawn and then discussed last July on the list [6]. The hope is that such a direct port should be possible, but someone else should comment on the porting process. This is also something that could be created independently based on the documentation you mention. I was planning to attempt that during a hackathon in July, but I'm happy you are able to start earlier (and that you are announcing your intentions). I would be happy to review your patch series, so please keep me posted. Thanks, -Stolee
Re: [PATCH 1/1] commit-graph: fix UX issue when .lock file exists
On 5/9/2018 10:42 AM, Jeff King wrote: On Wed, May 09, 2018 at 02:15:38PM +, Derrick Stolee wrote: The commit-graph file lives in the .git/objects/info directory. Previously, a failure to acquire the commit-graph.lock file was assumed to be due to the lack of the info directory, so a mkdir() was called. This gave incorrect messaging if instead the lockfile was open by another process: "fatal: cannot mkdir .git/objects/info: File exists" Rearrange the expectations of this directory existing to avoid this error, and instead show the typical message when a lockfile already exists. Sounds like a reasonable bug fix. Your cover letter is way longer than this description. Should some of that background perhaps go in the commit message? I did want a place to include the full die() message in the new behavior, but that seemed like overkill for the commit message. (I would go so far as to say that sending a cover letter for a single patch is an anti-pattern, because the commit message should be able to stand on its own). @@ -754,23 +755,14 @@ void write_commit_graph(const char *obj_dir, compute_generation_numbers(); graph_name = get_commit_graph_filename(obj_dir); - fd = hold_lock_file_for_update(, graph_name, 0); - if (fd < 0) { - struct strbuf folder = STRBUF_INIT; - strbuf_addstr(, graph_name); - strbuf_setlen(, strrchr(folder.buf, '/') - folder.buf); - - if (mkdir(folder.buf, 0777) < 0) - die_errno(_("cannot mkdir %s"), folder.buf); - strbuf_release(); - - fd = hold_lock_file_for_update(, graph_name, LOCK_DIE_ON_ERROR); - - if (fd < 0) - die_errno("unable to create '%s'", graph_name); - } + strbuf_addstr(, graph_name); + strbuf_setlen(, strrchr(folder.buf, '/') - folder.buf); + if (!file_exists(folder.buf) && mkdir(folder.buf, 0777) < 0) + die_errno(_("cannot mkdir %s"), folder.buf); + strbuf_release(); The result is racy if somebody else is trying to create the directory at the same time. For that you'd want to notice EEXIST coming from mkdir and consider that a success. I think you probably ought to be calling adjust_shared_perm() on the result, too, in case core.sharedrepository is configured. If you use safe_create_leading_directories(), it should handle both. Something like: if (safe_create_leading_directories(graph_name)) die_errno(_("unable to create leading directories of %s"), graph_name)); I think I'm holding it right; that function is a little short on documentation, but it's the standard way to do this in Git's codebase, and you can find lots of example callers. Thanks for this method. I was unfamiliar with it. Saves the effort of creating the strbuf, too. Thanks, -Stolee
[PATCH 0/1] Fix UX issue with commit-graph.lock
We use the lockfile API to avoid multiple Git processes from writing to the commit-graph file in the .git/objects/info directory. In some cases, this directory may not exist, so we check for its existence. The existing code does the following when acquiring the lock: 1. Try to acquire the lock. 2. If it fails, try to create the .git/object/info directory. 3. Try to acquire the lock, failing if necessary. The problem is that if the lockfile exists, then the mkdir fails, giving an error that doesn't help the user: "fatal: cannot mkdir .git/objects/info: File exists" While technically this honors the lockfile, it does not help the user. Instead, do the following: 1. Check for existence of .git/objects/info; create if necessary. 2. Try to acquire the lock, failing if necessary. The new output looks like: fatal: Unable to create '/home/stolee/git/.git/objects/info/commit-graph.lock': File exists. Another git process seems to be running in this repository, e.g. an editor opened by 'git commit'. Please make sure all processes are terminated then try again. If it still fails, a git process may have crashed in this repository earlier: remove the file manually to continue. This patch is based on ds/generation-numbers Derrick Stolee (1): commit-graph: fix UX issue when .lock file exists commit-graph.c | 24 1 file changed, 8 insertions(+), 16 deletions(-) base-commit: 7547b95b4fbb8591726b1d9381c176cc27fc6aea -- 2.17.0.39.g685157f7fb
[PATCH 1/1] commit-graph: fix UX issue when .lock file exists
The commit-graph file lives in the .git/objects/info directory. Previously, a failure to acquire the commit-graph.lock file was assumed to be due to the lack of the info directory, so a mkdir() was called. This gave incorrect messaging if instead the lockfile was open by another process: "fatal: cannot mkdir .git/objects/info: File exists" Rearrange the expectations of this directory existing to avoid this error, and instead show the typical message when a lockfile already exists. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 24 1 file changed, 8 insertions(+), 16 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index a8c337dd77..8399194da1 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1,5 +1,6 @@ #include "cache.h" #include "config.h" +#include "dir.h" #include "git-compat-util.h" #include "lockfile.h" #include "pack.h" @@ -640,13 +641,13 @@ void write_commit_graph(const char *obj_dir, struct hashfile *f; uint32_t i, count_distinct = 0; char *graph_name; - int fd; struct lock_file lk = LOCK_INIT; uint32_t chunk_ids[5]; uint64_t chunk_offsets[5]; int num_chunks; int num_extra_edges; struct commit_list *parent; + struct strbuf folder = STRBUF_INIT; oids.nr = 0; oids.alloc = approximate_object_count() / 4; @@ -754,23 +755,14 @@ void write_commit_graph(const char *obj_dir, compute_generation_numbers(); graph_name = get_commit_graph_filename(obj_dir); - fd = hold_lock_file_for_update(, graph_name, 0); - if (fd < 0) { - struct strbuf folder = STRBUF_INIT; - strbuf_addstr(, graph_name); - strbuf_setlen(, strrchr(folder.buf, '/') - folder.buf); - - if (mkdir(folder.buf, 0777) < 0) - die_errno(_("cannot mkdir %s"), folder.buf); - strbuf_release(); - - fd = hold_lock_file_for_update(, graph_name, LOCK_DIE_ON_ERROR); - - if (fd < 0) - die_errno("unable to create '%s'", graph_name); - } + strbuf_addstr(, graph_name); + strbuf_setlen(, strrchr(folder.buf, '/') - folder.buf); + if (!file_exists(folder.buf) && mkdir(folder.buf, 0777) < 0) + die_errno(_("cannot mkdir %s"), folder.buf); + strbuf_release(); + hold_lock_file_for_update(, graph_name, LOCK_DIE_ON_ERROR); f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf); hashwrite_be32(f, GRAPH_SIGNATURE); -- 2.17.0.39.g685157f7fb
Re: [PATCH 0/1] Fix UX issue with commit-graph.lock
On 5/10/2018 3:00 AM, Junio C Hamano wrote: Derrick Stolee <dsto...@microsoft.com> writes: We use the lockfile API to avoid multiple Git processes from writing to the commit-graph file in the .git/objects/info directory. In some cases, this directory may not exist, so we check for its existence. The existing code does the following when acquiring the lock: 1. Try to acquire the lock. 2. If it fails, try to create the .git/object/info directory. 3. Try to acquire the lock, failing if necessary. The problem is that if the lockfile exists, then the mkdir fails, giving an error that doesn't help the user: "fatal: cannot mkdir .git/objects/info: File exists" Isn't a better immediate fix to make the second step pay attention to errno? If mkdir() failed due to EEXIST, then we know we tried to aquire the lock already, so we can die with an appropriate message. That way, we can keep the "optimize for the normal case" that the approach to assume object/info/ directory is already there, instead of always checking its existence which is almost always true beforehand. This "optimize for the normal case" is why the existing code is organized the way it is. Since this code is only for writing a commit-graph file, this "check the directory first" option is a very small portion of the full time to write the file, so the "optimization" has very little effect, relatively. My personal opinion is to make the code cleaner when the performance difference is negligible. I'm willing to concede this point and use the steps you suggest below, if we think this is the best way forward. Also, can't we tell why we failed to acquire the lock at step #1? Do we only get a NULL that says "I am not telling you why, but we failed to lock"? To tell why we failed to acquire the lock, we could inspect "errno". However, this requires whitebox knowledge of both the lockfile API and the tempfile API to know that the last system call to set errno was an open() or adjust_shared_perm(). To cleanly make decisions based on the reason the lock failed to acquire, I think we would need to modify the lockfile and tempfile APIs to return a failure reason. This could be done by passing an `int *reason`, but the extra noise in these APIs is likely not worth the change. What I am getting at is that the ideal sequence would be more like: 1. Try to acquire the lock. 2-a. if #1 succeeds, we are happy. ignore the rest and return the lock. 2-b. if #1 failed because object/info/ did not exist, mkdir() it, and die if we cannot, saying "cannot mkdir". if mkdir() succeeds, jump t 3. 2-c. if #1 failed but that is not due to missing object/info/, die saying "cannot lock". 3. Try to acquire the lock. 4-a. if #3 succeeds, we are happy.ignore the rest and return the lock. 4-b. die saying "cannot lock". Thanks, -Stolee
Re: [PATCH 2/2] replace-object.c: remove the_repository from prepare_replace_object
On 5/10/2018 7:56 AM, Jeff King wrote: On Thu, May 10, 2018 at 07:23:13PM +0900, Junio C Hamano wrote: This one was doing ptr = xmalloc(sizeof(*another_ptr)) and it was OK because ptr and another_ptr happened to be of the same type. I wonder if we are making it safer, or making it more obscure to seasoned C programmers, if we introduced a pair of helper macros, perhaps like these: #define ALLOCATE(ptr) (ptr) = xmalloc(sizeof(*(ptr))) #define CALLOCATE(ptr,cnt) (ptr) = xcalloc((cnt), sizeof(*(ptr))) I've often wondered that, too. It's the natural endgame of the ALLOC_ARRAY() road we've been going down. I'll chime in that I like this idea. Because I'm trying to learn more about Coccinelle, I added the diff below and ran 'make coccicheck'. It resulted in a 1000-line patch on 'next'. I'll refrain from sending that patch series, but it's nice to know a "simple" transformation is possible. Using `git grep` I see 230 instances of 'xmalloc' and 261 instances of 'xcalloc'. After the Coccinelle transformation, these are down to 194 and 190, respectively, because the rest allocate in the same line as the definition. It's worth thinking about the macro pattern for those cases. diff --git a/contrib/coccinelle/alloc.cocci b/contrib/coccinelle/alloc.cocci new file mode 100644 index 00..95f426c4dc --- /dev/null +++ b/contrib/coccinelle/alloc.cocci @@ -0,0 +1,13 @@ +@@ +expression p; +@@ +- p = xmalloc(sizeof(*p)); ++ ALLOCATE(p); + +@@ +expression c; +expression p; +@@ +- p = xcalloc(c, sizeof(*p)); ++ CALLOCATE(p,c); + diff --git a/git-compat-util.h b/git-compat-util.h index f9e4c5f9bc..5398f217d7 100644 --- a/git-compat-util.h +++ b/git-compat-util.h @@ -843,6 +843,9 @@ extern FILE *fopen_or_warn(const char *path, const char *mode); */ #define FREE_AND_NULL(p) do { free(p); (p) = NULL; } while (0) +#define ALLOCATE(ptr) (ptr) = xmalloc(sizeof(*(ptr))) +#define CALLOCATE(ptr,cnt) (ptr) = xcalloc((cnt), sizeof(*(ptr))) + #define ALLOC_ARRAY(x, alloc) (x) = xmalloc(st_mult(sizeof(*(x)), (alloc))) #define REALLOC_ARRAY(x, alloc) (x) = xrealloc((x), st_mult(sizeof(*(x)), (alloc)))
[PATCH 10/12] gc: automatically write commit-graph files
The commit-graph file is a very helpful feature for speeding up git operations. In order to make it more useful, write the commit-graph file by default during standard garbage collection operations. Add a 'gc.commitGraph' config setting that triggers writing a commit-graph file after any non-trivial 'git gc' command. Defaults to false while the commit-graph feature matures. We specifically do not want to turn this on by default until the commit-graph feature is fully integrated with history-modifying features like shallow clones. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/config.txt | 6 ++ Documentation/git-gc.txt | 4 builtin/gc.c | 8 3 files changed, 18 insertions(+) diff --git a/Documentation/config.txt b/Documentation/config.txt index 11f027194e..9a3abd87e7 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -1553,6 +1553,12 @@ gc.autoDetach:: Make `git gc --auto` return immediately and run in background if the system supports it. Default is true. +gc.commitGraph:: + If true, then gc will rewrite the commit-graph file after any + change to the object database. If '--auto' is used, then the + commit-graph will not be updated unless the threshold is met. + See linkgit:git-commit-graph[1] for details. + gc.logExpiry:: If the file gc.log exists, then `git gc --auto` won't run unless that file is more than 'gc.logExpiry' old. Default is diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt index 571b5a7e3c..17dd654a59 100644 --- a/Documentation/git-gc.txt +++ b/Documentation/git-gc.txt @@ -119,6 +119,10 @@ The optional configuration variable `gc.packRefs` determines if it within all non-bare repos or it can be set to a boolean value. This defaults to true. +The optional configuration variable 'gc.commitGraph' determines if +'git gc' runs 'git commit-graph write'. This can be set to a boolean +value. This defaults to false. + The optional configuration variable `gc.aggressiveWindow` controls how much time is spent optimizing the delta compression of the objects in the repository when the --aggressive option is specified. The larger diff --git a/builtin/gc.c b/builtin/gc.c index 77fa720bd0..8403445738 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -34,6 +34,7 @@ static int aggressive_depth = 50; static int aggressive_window = 250; static int gc_auto_threshold = 6700; static int gc_auto_pack_limit = 50; +static int gc_commit_graph = 0; static int detach_auto = 1; static timestamp_t gc_log_expire_time; static const char *gc_log_expire = "1.day.ago"; @@ -46,6 +47,7 @@ static struct argv_array repack = ARGV_ARRAY_INIT; static struct argv_array prune = ARGV_ARRAY_INIT; static struct argv_array prune_worktrees = ARGV_ARRAY_INIT; static struct argv_array rerere = ARGV_ARRAY_INIT; +static struct argv_array commit_graph = ARGV_ARRAY_INIT; static struct tempfile *pidfile; static struct lock_file log_lock; @@ -121,6 +123,7 @@ static void gc_config(void) git_config_get_int("gc.aggressivedepth", _depth); git_config_get_int("gc.auto", _auto_threshold); git_config_get_int("gc.autopacklimit", _auto_pack_limit); + git_config_get_bool("gc.commitgraph", _commit_graph); git_config_get_bool("gc.autodetach", _auto); git_config_get_expiry("gc.pruneexpire", _expire); git_config_get_expiry("gc.worktreepruneexpire", _worktrees_expire); @@ -374,6 +377,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix) argv_array_pushl(, "prune", "--expire", NULL); argv_array_pushl(_worktrees, "worktree", "prune", "--expire", NULL); argv_array_pushl(, "rerere", "gc", NULL); + argv_array_pushl(_graph, "commit-graph", "write", "--reachable", NULL); /* default expiry time, overwritten in gc_config */ gc_config(); @@ -480,6 +484,10 @@ int cmd_gc(int argc, const char **argv, const char *prefix) if (pack_garbage.nr > 0) clean_pack_garbage(); + if (gc_commit_graph) + if (run_command_v_opt(commit_graph.argv, RUN_GIT_CMD)) + return error(FAILED_RUN, commit_graph.argv[0]); + if (auto_gc && too_many_loose_objects()) warning(_("There are too many unreachable loose objects; " "run 'git prune' to remove them.")); -- 2.16.2.329.gfb62395de6
[PATCH 09/12] commit-graph: add '--reachable' option
When writing commit-graph files, it can be convenient to ask for all reachable commits (starting at the ref set) in the resulting file. This is particularly helpful when writing to stdin is complicated, such as a future integration with 'git gc' which will call 'git commit-graph write --reachable' after performing cleanup of the object database. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 8 ++-- builtin/commit-graph.c | 41 ++ t/t5318-commit-graph.sh| 10 ++ 3 files changed, 53 insertions(+), 6 deletions(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 1daefa7fb1..fbab3feba1 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -38,12 +38,16 @@ Write a commit graph file based on the commits found in packfiles. + With the `--stdin-packs` option, generate the new commit graph by walking objects only in the specified pack-indexes. (Cannot be combined -with --stdin-commits.) +with --stdin-commits or --reachable.) + With the `--stdin-commits` option, generate the new commit graph by walking commits starting at the commits specified in stdin as a list of OIDs in hex, one OID per line. (Cannot be combined with ---stdin-packs.) +--stdin-packs or --reachable.) ++ +With the `--reachable` option, generate the new commit graph by walking +commits starting at all refs. (Cannot be combined with --stdin-commits +or --stind-packs.) + With the `--append` option, include all commits that are present in the existing commit-graph file. diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index f5d891f2b8..4c9328cdf2 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -3,13 +3,14 @@ #include "dir.h" #include "lockfile.h" #include "parse-options.h" +#include "refs.h" #include "commit-graph.h" static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph verify [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), - N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--append] [--reachable|--stdin-packs|--stdin-commits]"), NULL }; @@ -24,12 +25,13 @@ static const char * const builtin_commit_graph_read_usage[] = { }; static const char * const builtin_commit_graph_write_usage[] = { - N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), + N_("git commit-graph write [--object-dir ] [--append] [--reachable|--stdin-packs|--stdin-commits]"), NULL }; static struct opts_commit_graph { const char *obj_dir; + int reachable; int stdin_packs; int stdin_commits; int append; @@ -113,6 +115,25 @@ static int graph_read(int argc, const char **argv) return 0; } +struct hex_list { + char **hex_strs; + int hex_nr; + int hex_alloc; +}; + +static int add_ref_to_list(const char *refname, + const struct object_id *oid, + int flags, void *cb_data) +{ + struct hex_list *list = (struct hex_list*)cb_data; + + ALLOC_GROW(list->hex_strs, list->hex_nr + 1, list->hex_alloc); + list->hex_strs[list->hex_nr] = xcalloc(GIT_MAX_HEXSZ + 1, 1); + strcpy(list->hex_strs[list->hex_nr], oid_to_hex(oid)); + list->hex_nr++; + return 0; +} + static int graph_write(int argc, const char **argv) { const char **pack_indexes = NULL; @@ -127,6 +148,8 @@ static int graph_write(int argc, const char **argv) OPT_STRING(0, "object-dir", _dir, N_("dir"), N_("The object directory to store the graph")), + OPT_BOOL(0, "reachable", , + N_("start walk at all refs")), OPT_BOOL(0, "stdin-packs", _packs, N_("scan pack-indexes listed by stdin for commits")), OPT_BOOL(0, "stdin-commits", _commits, @@ -140,8 +163,8 @@ static int graph_write(int argc, const char **argv) builtin_commit_graph_write_options, builtin_commit_graph_write_usage, 0); - if (opts.stdin_packs && opts.stdin_commits) - die(_("cannot use both --stdin-commits and --stdin-packs")); + if (opts.reachable + opts.stdin_packs + opts.stdin_commits > 1) + die(_("use at most one of --reachable, --stdin-commits, or --stdin-packs")); if (!opts.o
[PATCH 05/12] commit: force commit to parse from object database
In anticipation of verifying commit-graph file contents against the object database, create parse_commit_internal() to allow side-stepping the commit-graph file and parse directly from the object database. Most consumers expect that if a commit exists in the commit-graph, then the commit was loaded from the commit-graph so values such as generation numbers are loaded. Hence, this method should not be called unless the intention is explicit in avoiding commits from the commit-graph file. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 13 + commit.h | 1 + 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/commit.c b/commit.c index 1d28677dfb..7c92350373 100644 --- a/commit.c +++ b/commit.c @@ -392,7 +392,7 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s return 0; } -int parse_commit_gently(struct commit *item, int quiet_on_missing) +int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph) { enum object_type type; void *buffer; @@ -403,17 +403,17 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) return -1; if (item->object.parsed) return 0; - if (parse_commit_in_graph(item)) + if (use_commit_graph && parse_commit_in_graph(item)) return 0; buffer = read_sha1_file(item->object.oid.hash, , ); if (!buffer) return quiet_on_missing ? -1 : error("Could not read %s", -oid_to_hex(>object.oid)); + oid_to_hex(>object.oid)); if (type != OBJ_COMMIT) { free(buffer); return error("Object %s not a commit", -oid_to_hex(>object.oid)); + oid_to_hex(>object.oid)); } ret = parse_commit_buffer(item, buffer, size, 0); if (save_commit_buffer && !ret) { @@ -424,6 +424,11 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) return ret; } +int parse_commit_gently(struct commit *item, int quiet_on_missing) +{ + return parse_commit_internal(item, quiet_on_missing, 1); +} + void parse_commit_or_die(struct commit *item) { if (parse_commit(item)) diff --git a/commit.h b/commit.h index b5afde1ae9..5fde74fcd7 100644 --- a/commit.h +++ b/commit.h @@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char *name); struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name); int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph); +int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph); int parse_commit_gently(struct commit *item, int quiet_on_missing); static inline int parse_commit(struct commit *item) { -- 2.16.2.329.gfb62395de6
[PATCH 02/12] commit-graph: verify file header information
During a run of 'git commit-graph verify', list the issues with the header information in the commit-graph file. Some of this information is inferred from the loaded 'struct commit_graph'. Some header information is checked as part of load_commit_graph_one(). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 23 ++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index b25aaed128..c3b8716c14 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -818,7 +818,28 @@ void write_commit_graph(const char *obj_dir, oids.nr = 0; } +static int verify_commit_graph_error; +#define graph_report(...) \ + do {\ + verify_commit_graph_error = 1;\ + printf(__VA_ARGS__);\ + } while (0); + int verify_commit_graph(struct commit_graph *g) { - return !g; + if (!g) { + graph_report(_("no commit-graph file loaded")); + return 1; + } + + verify_commit_graph_error = 0; + + if (!g->chunk_oid_fanout) + graph_report(_("commit-graph is missing the OID Fanout chunk")); + if (!g->chunk_oid_lookup) + graph_report(_("commit-graph is missing the OID Lookup chunk")); + if (!g->chunk_commit_data) + graph_report(_("commit-graph is missing the Commit Data chunk")); + + return verify_commit_graph_error; } -- 2.16.2.329.gfb62395de6
[PATCH 01/12] commit-graph: add 'verify' subcommand
In case the commit-graph file becomes corrupt, we need a way to verify its contents match the object database. In the manner of 'git fsck' we will implement a 'git commit-graph verify' subcommand to report all issues with the file. Add the 'verify' subcommand to the 'commit-graph' builtin and its documentation. The subcommand is currently a no-op except for loading the commit-graph into memory, which may trigger run-time errors that would be caught by normal use. Add a simple test that ensures the command returns a zero error code. If no commit-graph file exists, this is an acceptable state. Do not report any errors. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 6 ++ builtin/commit-graph.c | 38 ++ commit-graph.c | 5 + commit-graph.h | 2 ++ t/t5318-commit-graph.sh| 10 ++ 5 files changed, 61 insertions(+) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 4c97b555cc..1daefa7fb1 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -10,6 +10,7 @@ SYNOPSIS [verse] 'git commit-graph read' [--object-dir ] +'git commit-graph verify' [--object-dir ] 'git commit-graph write' [--object-dir ] @@ -52,6 +53,11 @@ existing commit-graph file. Read a graph file given by the commit-graph file and output basic details about the graph file. Used for debugging purposes. +'verify':: + +Read the commit-graph file and verify its contents against the object +database. Used to verify for corrupted data. + EXAMPLES diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 37420ae0fd..f5d891f2b8 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -7,11 +7,17 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), + N_("git commit-graph verify [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), NULL }; +static const char * const builtin_commit_graph_verify_usage[] = { + N_("git commit-graph verify [--object-dir ]"), + NULL +}; + static const char * const builtin_commit_graph_read_usage[] = { N_("git commit-graph read [--object-dir ]"), NULL @@ -29,6 +35,36 @@ static struct opts_commit_graph { int append; } opts; + +static int graph_verify(int argc, const char **argv) +{ + struct commit_graph *graph = 0; + char *graph_name; + + static struct option builtin_commit_graph_verify_options[] = { + OPT_STRING(0, "object-dir", _dir, + N_("dir"), + N_("The object directory to store the graph")), + OPT_END(), + }; + + argc = parse_options(argc, argv, NULL, +builtin_commit_graph_verify_options, +builtin_commit_graph_verify_usage, 0); + + if (!opts.obj_dir) + opts.obj_dir = get_object_directory(); + + graph_name = get_commit_graph_filename(opts.obj_dir); + graph = load_commit_graph_one(graph_name); + + if (!graph) + return 0; + FREE_AND_NULL(graph_name); + + return verify_commit_graph(graph); +} + static int graph_read(int argc, const char **argv) { struct commit_graph *graph = NULL; @@ -160,6 +196,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) PARSE_OPT_STOP_AT_NON_OPTION); if (argc > 0) { + if (!strcmp(argv[0], "verify")) + return graph_verify(argc, argv); if (!strcmp(argv[0], "read")) return graph_read(argc, argv); if (!strcmp(argv[0], "write")) diff --git a/commit-graph.c b/commit-graph.c index a8c337dd77..b25aaed128 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -817,3 +817,8 @@ void write_commit_graph(const char *obj_dir, oids.alloc = 0; oids.nr = 0; } + +int verify_commit_graph(struct commit_graph *g) +{ + return !g; +} diff --git a/commit-graph.h b/commit-graph.h index 96cccb10f3..71a39c5a57 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -53,4 +53,6 @@ void write_commit_graph(const char *obj_dir, int nr_commits, int append); +int verify_commit_graph(struct commit_graph *g); + #endif diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 77d85aefe7..6ca451dfd2 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -11,6 +11,11 @@ test_expect_success 'setup full repo
[PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch
The commit-graph feature is not useful to end users until the commit-graph file is maintained automatically by Git during normal upkeep operations. One natural place to trigger this write is during 'git gc'. Before automatically generating a commit-graph, we need to be able to verify the contents of a commit-graph file. Integrate commit-graph checks into 'fsck' that check the commit-graph contents against commits in the object database. Thanks, Jakub, for the feedback on the RFC. I think there are still some things to decide at a high-level before we dig too far into the review. Specifically, the integration points in 'fsck', 'gc', and 'fetch' are worth considering our alternatives. For 'fsck', the current integration is minimal: if core.commitGraph is true, then 'git fsck' calls 'git commit-graph verify --object-dir=X' for the objects directory and every alternate. There are a few options to consider here: 1. Keep this behavior: we should always check the commit-graph if it exists. 2. Add a --[no-]commit-graph argument to 'fsck' that toggles the commit-graph verification. 3. Remove all direct integration between 'fsck' and 'commit-graph' and instead rely on users checking 'git commit-graph verify' manually when they suspect a problem with the commit-graph file. While this option is worth considering, it is my least favorite since it requires more from users. For 'gc' and 'fetch', these seemed like natural places to update the commit-graph file. Relative to the other maintenance that occurs during these commands, the 'git commit-graph write' command is fast, especially for incremental updates; only the "new" commits are walked when computing generation numbers and other metadata for the commit-graph file. The behavior in this patch series does the following: 1. Near the end of 'git gc', run 'git commit-graph write'. The location of this code assumes that a 'git gc --auto' has not terminated early due to not meeting the auto threshold. 2. At the end of 'git fetch', run 'git commit-graph write'. This means that every reachable commit will be in the commit-graph after a a successful fetch, which seems a reasonable frequency. Then, the only times we would be missing a reachable commit is after creating one locally. There is a problem with the current patch, though: every 'git fetch' call runs 'git commit-graph write', even if there were no ref updates or objects downloaded. Is there a simple way to detect if the fetch was non-trivial? One obvious problem with this approach: if we compute this during 'gc' AND 'fetch', there will be times where a 'fetch' calls 'gc' and triggers two commit-graph writes. If I were to abandon one of these patches, it would be the 'fetch' integration. A 'git gc' really wants to delete all references to unreachable commits, and without updating the commit-graph we may still have commit data in the commit-graph file that is not in the object database. In fact, deleting commits from the object database but not from the commit-graph will cause 'git commit-graph verify' to fail! I welcome discussion on these ideas, as we are venturing out of the "pure data structure" world and into the "user experience" world. I am less confident in my skills in this world, but the feature is worthless if it does not improve the user experience. Thanks, -Stolee Derrick Stolee (12): Commits 01-07 focus on the 'git commit-graph verify' subcommand. These are ready for full, rigorous review. commit-graph: add 'verify' subcommand commit-graph: verify file header information commit-graph: parse commit from chosen graph commit-graph: verify fanout and lookup table commit: force commit to parse from object database commit-graph: load a root tree from specific graph commit-graph: verify commit contents against odb Commit 08 integrates 'git commit-graph verify' into fsck. The work here is the minimum integration possible. (See above discussion of options.) fsck: verify commit-graph Commit 09 introduces a new '--reachable' option only to make the calls from 'gc' and 'fetch' simpler. Commits 10-11 integrate writing the commit-graph into 'gc' and 'fetch', respectively. (See above disucssion.) commit-graph: add '--reachable' option gc: automatically write commit-graph files fetch: compute commit-graph by default Commit 12 simply deletes sections from the "Future Work" section of the commit-graph design document. commit-graph: update design document Documentation/config.txt | 10 ++ Documentation/git-commit-graph.txt | 14 ++- Documentation/git-fsck.txt | 3 + Documentation/git-gc.txt | 4 + Documentation/technical/commit-graph.txt | 22 builtin/commit-graph.c | 79 +- builtin/fetch.c | 13 +++ builtin/fsck.c | 21 builtin/gc.c
[PATCH 11/12] fetch: compute commit-graph by default
During a call to 'git fetch', we expect new commits and updated refs. Use these updated refs to add the new commits to the commit-graph file, automatically providing performance benefits in other calls. Use 'fetch.commitGraph' config setting to enable or disable this behavior. Defaults to false while the commit-graph feature matures. Specifically, we do not want this on by default until the commit-graph feature integrates with history-modifying features such as shallow clones. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/config.txt | 4 builtin/fetch.c | 13 + 2 files changed, 17 insertions(+) diff --git a/Documentation/config.txt b/Documentation/config.txt index 9a3abd87e7..3d8225600a 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -1409,6 +1409,10 @@ fetch.output:: `full` and `compact`. Default value is `full`. See section OUTPUT in linkgit:git-fetch[1] for detail. +fetch.commitGraph:: + If true, fetch will automatically update the commit-graph file. + See linkgit:git-commit-graph[1]. + format.attach:: Enable multipart/mixed attachments as the default for 'format-patch'. The value can also be a double quoted string diff --git a/builtin/fetch.c b/builtin/fetch.c index 8ee998ea2e..254f6ecfb6 100644 --- a/builtin/fetch.c +++ b/builtin/fetch.c @@ -38,6 +38,7 @@ enum { static int fetch_prune_config = -1; /* unspecified */ static int prune = -1; /* unspecified */ #define PRUNE_BY_DEFAULT 0 /* do we prune by default? */ +static int fetch_commit_graph = 0; static int all, append, dry_run, force, keep, multiple, update_head_ok, verbosity, deepen_relative; static int progress = -1; @@ -66,6 +67,11 @@ static int git_fetch_config(const char *k, const char *v, void *cb) return 0; } + if (!strcmp(k, "fetch.commitGraph")) { + fetch_commit_graph = git_config_bool(k, v); + return 0; + } + if (!strcmp(k, "submodule.recurse")) { int r = git_config_bool(k, v) ? RECURSE_SUBMODULES_ON : RECURSE_SUBMODULES_OFF; @@ -1462,6 +1468,13 @@ int cmd_fetch(int argc, const char **argv, const char *prefix) result = fetch_multiple(); } + if (!result && fetch_commit_graph) { + struct argv_array commit_graph = ARGV_ARRAY_INIT; + argv_array_pushl(_graph, "commit-graph", "write", "--reachable", NULL); + if (run_command_v_opt(commit_graph.argv, RUN_GIT_CMD)) + result = 1; + } + if (!result && (recurse_submodules != RECURSE_SUBMODULES_OFF)) { struct argv_array options = ARGV_ARRAY_INIT; -- 2.16.2.329.gfb62395de6
[PATCH 08/12] fsck: verify commit-graph
If core.commitGraph is true, verify the contents of the commit-graph during 'git fsck' using the 'git commit-graph verify' subcommand. Run this check on all alternates, as well. We use a new process for two reasons: 1. The subcommand decouples the details of loading and verifying a commit-graph file from the other fsck details. 2. The commit-graph verification requires the commits to be loaded in a specific order to guarantee we parse from the commit-graph file for some objects and from the object database for others. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-fsck.txt | 3 +++ builtin/fsck.c | 21 + t/t5318-commit-graph.sh| 5 + 3 files changed, 29 insertions(+) diff --git a/Documentation/git-fsck.txt b/Documentation/git-fsck.txt index b9f060e3b2..ab9a93fb9b 100644 --- a/Documentation/git-fsck.txt +++ b/Documentation/git-fsck.txt @@ -110,6 +110,9 @@ Any corrupt objects you will have to find in backups or other archives (i.e., you can just remove them and do an 'rsync' with some other site in the hopes that somebody else has the object you have corrupted). +If core.commitGraph is true, the commit-graph file will also be inspected +using 'git commit-graph verify'. See linkgit:git-commit-graph[1]. + Extracted Diagnostics - diff --git a/builtin/fsck.c b/builtin/fsck.c index ef78c6c00c..a6d5045b77 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -16,6 +16,7 @@ #include "streaming.h" #include "decorate.h" #include "packfile.h" +#include "run-command.h" #define REACHABLE 0x0001 #define SEEN 0x0002 @@ -45,6 +46,7 @@ static int name_objects; #define ERROR_REACHABLE 02 #define ERROR_PACK 04 #define ERROR_REFS 010 +#define ERROR_COMMIT_GRAPH 020 static const char *describe_object(struct object *obj) { @@ -815,5 +817,24 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) } check_connectivity(); + + if (core_commit_graph) { + struct child_process commit_graph_verify = CHILD_PROCESS_INIT; + const char *verify_argv[] = { "commit-graph", "verify", NULL, NULL, NULL, NULL }; + commit_graph_verify.argv = verify_argv; + commit_graph_verify.git_cmd = 1; + + if (run_command(_graph_verify)) + errors_found |= ERROR_COMMIT_GRAPH; + + prepare_alt_odb(); + for (alt = alt_odb_list; alt; alt = alt->next) { + verify_argv[2] = "--object-dir"; + verify_argv[3] = alt->path; + if (run_command(_graph_verify)) + errors_found |= ERROR_COMMIT_GRAPH; + } + } + return errors_found; } diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 6ca451dfd2..9680e6e6e0 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -240,4 +240,9 @@ test_expect_success 'git commit-graph verify' ' git commit-graph verify >output ' +test_expect_success 'git fsck (checks commit-graph)' ' + cd "$TRASH_DIRECTORY/full" && + git fsck +' + test_done -- 2.16.2.329.gfb62395de6
[PATCH 07/12] commit-graph: verify commit contents against odb
When running 'git commit-graph verify', compare the contents of the commits that are loaded from the commit-graph file with commits that are loaded directly from the object database. This includes checking the root tree object ID, commit date, and parents. Parse the commit from the graph during the initial loop through the object IDs to guarantee we parse from the commit-graph file. In addition, verify the generation number calculation is correct for all commits in the commit-graph file. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 88 ++ 1 file changed, 88 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 8c636abba9..24f5031f3e 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -863,6 +863,8 @@ int verify_commit_graph(struct commit_graph *g) graph_report(_("commit-graph is missing the Commit Data chunk")); for (i = 0; i < g->num_commits; i++) { + struct commit *graph_commit; + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); if (i > 0 && oidcmp(_oid, _oid) >= 0) @@ -880,6 +882,10 @@ int verify_commit_graph(struct commit_graph *g) cur_fanout_pos++; } + + graph_commit = lookup_commit(_oid); + if (!parse_commit_in_graph_one(g, graph_commit)) + graph_report(_("failed to parse %s from commit-graph"), oid_to_hex(_oid)); } while (cur_fanout_pos < 256) { @@ -892,5 +898,87 @@ int verify_commit_graph(struct commit_graph *g) cur_fanout_pos++; } + for (i = 0; i < g->num_commits; i++) { + struct commit *graph_commit, *odb_commit; + struct commit_list *graph_parents, *odb_parents; + int num_parents = 0; + + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); + + graph_commit = lookup_commit(_oid); + odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node()); + if (parse_commit_internal(odb_commit, 0, 0)) + graph_report(_("failed to parse %s from object database"), oid_to_hex(_oid)); + + if (oidcmp(_commit_tree_in_graph_one(g, graph_commit)->object.oid, + get_commit_tree_oid(odb_commit))) + graph_report(_("root tree object ID for commit %s in commit-graph is %s != %s"), +oid_to_hex(_oid), + oid_to_hex(get_commit_tree_oid(graph_commit)), + oid_to_hex(get_commit_tree_oid(odb_commit))); + + if (graph_commit->date != odb_commit->date) + graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime""), +oid_to_hex(_oid), +graph_commit->date, +odb_commit->date); + + + graph_parents = graph_commit->parents; + odb_parents = odb_commit->parents; + + while (graph_parents) { + num_parents++; + + if (odb_parents == NULL) + graph_report(_("commit-graph parent list for commit %s is too long (%d)"), +oid_to_hex(_oid), +num_parents); + + if (oidcmp(_parents->item->object.oid, _parents->item->object.oid)) + graph_report(_("commit-graph parent for %s is %s != %s"), +oid_to_hex(_oid), + oid_to_hex(_parents->item->object.oid), + oid_to_hex(_parents->item->object.oid)); + + graph_parents = graph_parents->next; + odb_parents = odb_parents->next; + } + + if (odb_parents != NULL) + graph_report(_("commit-graph parent list for commit %s terminates early"), +oid_to_hex(_oid)); + + if (graph_commit->generation) { + uint32_t max_generation = 0; + graph_parents = graph_commit->parents; + + while (graph_parents) { + if (graph_parents->item->generation == GENERATION_NUMBER_ZERO || + graph_parents->item->generation == GENE
[PATCH 04/12] commit-graph: verify fanout and lookup table
While running 'git commit-graph verify', verify that the object IDs are listed in lexicographic order and that the fanout table correctly navigates into that list of object IDs. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 33 + 1 file changed, 33 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index ce11af1d20..b4c146c423 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -839,6 +839,9 @@ static int verify_commit_graph_error; int verify_commit_graph(struct commit_graph *g) { + uint32_t i, cur_fanout_pos = 0; + struct object_id prev_oid, cur_oid; + if (!g) { graph_report(_("no commit-graph file loaded")); return 1; @@ -853,5 +856,35 @@ int verify_commit_graph(struct commit_graph *g) if (!g->chunk_commit_data) graph_report(_("commit-graph is missing the Commit Data chunk")); + for (i = 0; i < g->num_commits; i++) { + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); + + if (i > 0 && oidcmp(_oid, _oid) >= 0) + graph_report(_("commit-graph has incorrect oid order: %s then %s"), + + oid_to_hex(_oid), + oid_to_hex(_oid)); + oidcpy(_oid, _oid); + + while (cur_oid.hash[0] > cur_fanout_pos) { + uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos); + if (i != fanout_value) + graph_report(_("commit-graph has incorrect fanout value: fanout[%d] = %u != %u"), +cur_fanout_pos, fanout_value, i); + + cur_fanout_pos++; + } + } + + while (cur_fanout_pos < 256) { + uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos); + + if (g->num_commits != fanout_value) + graph_report(_("commit-graph has incorrect fanout value: fanout[%d] = %u != %u"), +cur_fanout_pos, fanout_value, i); + + cur_fanout_pos++; + } + return verify_commit_graph_error; } -- 2.16.2.329.gfb62395de6
[PATCH 12/12] commit-graph: update design document
The commit-graph feature is now integrated with 'fsck' and 'gc', so remove those items from the "Future Work" section of the commit-graph design document. Also remove the section on lazy-loading trees, as that was completed in an earlier patch series. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/technical/commit-graph.txt | 22 -- 1 file changed, 22 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index e1a883eb46..c664acbd76 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -118,9 +118,6 @@ Future Work - The commit graph feature currently does not honor commit grafts. This can be remedied by duplicating or refactoring the current graft logic. -- The 'commit-graph' subcommand does not have a "verify" mode that is - necessary for integration with fsck. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered @@ -130,25 +127,6 @@ Future Work - 'log --topo-order' - 'tag --merged' -- Currently, parse_commit_gently() requires filling in the root tree - object for a commit. This passes through lookup_tree() and consequently - lookup_object(). Also, it calls lookup_commit() when loading the parents. - These method calls check the ODB for object existence, even if the - consumer does not need the content. For example, we do not need the - tree contents when computing merge bases. Now that commit parsing is - removed from the computation time, these lookup operations are the - slowest operations keeping graph walks from being fast. Consider - loading these objects without verifying their existence in the ODB and - only loading them fully when consumers need them. Consider a method - such as "ensure_tree_loaded(commit)" that fully loads a tree before - using commit->tree. - -- The current design uses the 'commit-graph' subcommand to generate the graph. - When this feature stabilizes enough to recommend to most users, we should - add automatic graph writes to common operations that create many commits. - For example, one could compute a graph on 'clone', 'fetch', or 'repack' - commands. - - A server could provide a commit graph file as part of the network protocol to avoid extra calculations by clients. This feature is only of benefit if the user is willing to trust the file, because verifying the file is correct -- 2.16.2.329.gfb62395de6
[PATCH 03/12] commit-graph: parse commit from chosen graph
Before verifying a commit-graph file against the object database, we need to parse all commits from the given commit-graph file. Create parse_commit_in_graph_one() to target a given struct commit_graph. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 18 +++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index c3b8716c14..ce11af1d20 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -309,7 +309,7 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin } } -int parse_commit_in_graph(struct commit *item) +int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item) { uint32_t pos; @@ -317,9 +317,21 @@ int parse_commit_in_graph(struct commit *item) return 0; if (item->object.parsed) return 1; + + if (find_commit_in_graph(item, g, )) + return fill_commit_in_graph(item, g, pos); + + return 0; +} + +int parse_commit_in_graph(struct commit *item) +{ + if (!core_commit_graph) + return 0; + prepare_commit_graph(); - if (commit_graph && find_commit_in_graph(item, commit_graph, )) - return fill_commit_in_graph(item, commit_graph, pos); + if (commit_graph) + return parse_commit_in_graph_one(commit_graph, item); return 0; } -- 2.16.2.329.gfb62395de6
[PATCH 06/12] commit-graph: load a root tree from specific graph
When lazy-loading a tree for a commit, it will be important to select the tree from a specific struct commit_graph. Create a new method that specifies the commit-graph file and use that in get_commit_tree_in_graph(). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index b4c146c423..8c636abba9 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -357,14 +357,20 @@ static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit * return c->maybe_tree; } -struct tree *get_commit_tree_in_graph(const struct commit *c) +static struct tree *get_commit_tree_in_graph_one(struct commit_graph *g, +const struct commit *c) { if (c->maybe_tree) return c->maybe_tree; if (c->graph_pos == COMMIT_NOT_FROM_GRAPH) BUG("get_commit_tree_in_graph called from non-commit-graph commit"); - return load_tree_for_commit(commit_graph, (struct commit *)c); + return load_tree_for_commit(g, (struct commit *)c); +} + +struct tree *get_commit_tree_in_graph(const struct commit *c) +{ + return get_commit_tree_in_graph_one(commit_graph, c); } static void write_graph_chunk_fanout(struct hashfile *f, -- 2.16.2.329.gfb62395de6
[PATCH v2] commit-graph: fix UX issue when .lock file exists
We use the lockfile API to avoid multiple Git processes from writing to the commit-graph file in the .git/objects/info directory. In some cases, this directory may not exist, so we check for its existence. The existing code does the following when acquiring the lock: 1. Try to acquire the lock. 2. If it fails, try to create the .git/object/info directory. 3. Try to acquire the lock, failing if necessary. The problem is that if the lockfile exists, then the mkdir fails, giving an error that doesn't help the user: "fatal: cannot mkdir .git/objects/info: File exists" While technically this honors the lockfile, it does not help the user. Instead, do the following: 1. Check for existence of .git/objects/info; create if necessary. 2. Try to acquire the lock, failing if necessary. The new output looks like: fatal: Unable to create '/.git/objects/info/commit-graph.lock': File exists. Another git process seems to be running in this repository, e.g. an editor opened by 'git commit'. Please make sure all processes are terminated then try again. If it still fails, a git process may have crashed in this repository earlier: remove the file manually to continue. Helped-by: Jeff King <p...@peff.net> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 22 +- 1 file changed, 5 insertions(+), 17 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index a8c337dd77..bb54c1214c 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1,5 +1,6 @@ #include "cache.h" #include "config.h" +#include "dir.h" #include "git-compat-util.h" #include "lockfile.h" #include "pack.h" @@ -640,7 +641,6 @@ void write_commit_graph(const char *obj_dir, struct hashfile *f; uint32_t i, count_distinct = 0; char *graph_name; - int fd; struct lock_file lk = LOCK_INIT; uint32_t chunk_ids[5]; uint64_t chunk_offsets[5]; @@ -754,23 +754,11 @@ void write_commit_graph(const char *obj_dir, compute_generation_numbers(); graph_name = get_commit_graph_filename(obj_dir); - fd = hold_lock_file_for_update(, graph_name, 0); - - if (fd < 0) { - struct strbuf folder = STRBUF_INIT; - strbuf_addstr(, graph_name); - strbuf_setlen(, strrchr(folder.buf, '/') - folder.buf); - - if (mkdir(folder.buf, 0777) < 0) - die_errno(_("cannot mkdir %s"), folder.buf); - strbuf_release(); - - fd = hold_lock_file_for_update(, graph_name, LOCK_DIE_ON_ERROR); - - if (fd < 0) - die_errno("unable to create '%s'", graph_name); - } + if (safe_create_leading_directories(graph_name)) + die_errno(_("unable to create leading directories of %s"), + graph_name); + hold_lock_file_for_update(, graph_name, LOCK_DIE_ON_ERROR); f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf); hashwrite_be32(f, GRAPH_SIGNATURE); base-commit: 34fdd433396ee0e3ef4de02eb2189f8226eafe4e -- 2.16.2.329.gfb62395de6
Re: [PATCH 04/12] commit-graph: verify fanout and lookup table
On 5/10/2018 2:29 PM, Martin Ågren wrote: On 10 May 2018 at 19:34, Derrick Stolee <dsto...@microsoft.com> wrote: While running 'git commit-graph verify', verify that the object IDs are listed in lexicographic order and that the fanout table correctly navigates into that list of object IDs. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 33 + 1 file changed, 33 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index ce11af1d20..b4c146c423 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -839,6 +839,9 @@ static int verify_commit_graph_error; int verify_commit_graph(struct commit_graph *g) { + uint32_t i, cur_fanout_pos = 0; + struct object_id prev_oid, cur_oid; + if (!g) { graph_report(_("no commit-graph file loaded")); return 1; @@ -853,5 +856,35 @@ int verify_commit_graph(struct commit_graph *g) if (!g->chunk_commit_data) graph_report(_("commit-graph is missing the Commit Data chunk")); + for (i = 0; i < g->num_commits; i++) { + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); + + if (i > 0 && oidcmp(_oid, _oid) >= 0) + graph_report(_("commit-graph has incorrect oid order: %s then %s"), Minor: I think our style would prefer s/i > 0/i/. I suppose the second check should be s/>=/>/, but it's not like it should matter. ;-) It shouldn't, but a duplicate commit is still an incorrect oid order. I wonder if this is a message that would virtually never make sense to a user, but more to a developer. Leave it untranslated to make sure any bug reports to the list are readable to us? Yeah, I'll make all of the errors untranslated since they are for developer debugging, not end-user information. + + oid_to_hex(_oid), + oid_to_hex(_oid)); Hmm, these two lines do not actually achieve anything? It's a formatting bug: They complete the statement starting with 'graph_report(' above. + oidcpy(_oid, _oid); + + while (cur_oid.hash[0] > cur_fanout_pos) { + uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos); + if (i != fanout_value) + graph_report(_("commit-graph has incorrect fanout value: fanout[%d] = %u != %u"), +cur_fanout_pos, fanout_value, i); Same though re `_()`, even more so because of the more technical notation. + + cur_fanout_pos++; + } + } + + while (cur_fanout_pos < 256) { + uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos); + + if (g->num_commits != fanout_value) + graph_report(_("commit-graph has incorrect fanout value: fanout[%d] = %u != %u"), +cur_fanout_pos, fanout_value, i); Same here. Or maybe these should just give a translated user-readable basic idea of what is wrong and skip the details? As someone who is responsible for probably inserting these problems, I think the details are important. Thanks, -Stolee
Re: [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch
On 5/10/2018 4:45 PM, Martin Ågren wrote: On 10 May 2018 at 21:22, Stefan Bellerwrote: On Thu, May 10, 2018 at 12:05 PM, Martin Ågren wrote: I hope to find time to do some more hands-on testing of this to see that errors actually do get caught. Packfiles and loose objects are primary data, which means that those need a more advanced way to diagnose and repair them, so I would imagine the commit graph fsck is closer to bitmaps fsck, which I would have suspected to be found in t5310, but a quick read doesn't reveal many tests that are checking for integrity. So I guess the test coverage here is ok, (although we should always ask for more) Since I'm wrapping up for today, I'm posting some simple tests that I assembled. The last of these showcases one or two problems with the current error-reporting. Depending on the error, there can be *lots* of errors reported and there are no new-lines, so the result on stdout can be a wall of not-very-legible text. Some of these might not make sense. I just started going through the documentation on the format, causing some sort of corruption in each field. Maybe this can be helpful somehow. Martin diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 82f95eb11f..a7e48db2de 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -255,4 +255,49 @@ test_expect_success 'git fsck (checks commit-graph)' ' git fsck ' +# usage: corrupt_data [] +corrupt_data() { + file=$1 + pos=$2 + data="${3:-\0}" + printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc +} + +test_expect_success 'detect bad signature' ' + cd "$TRASH_DIRECTORY/full" && + cp $objdir/info/commit-graph commit-graph-backup && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + corrupt_data $objdir/info/commit-graph 0 "\0" && + test_must_fail git commit-graph verify 2>err && + grep "graph signature" err +' + +test_expect_success 'detect bad version number' ' + cd "$TRASH_DIRECTORY/full" && + cp $objdir/info/commit-graph commit-graph-backup && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + corrupt_data $objdir/info/commit-graph 4 "\02" && + test_must_fail git commit-graph verify 2>err && + grep "graph version" err +' + +test_expect_success 'detect bad hash version' ' + cd "$TRASH_DIRECTORY/full" && + cp $objdir/info/commit-graph commit-graph-backup && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + corrupt_data $objdir/info/commit-graph 5 "\02" && + test_must_fail git commit-graph verify 2>err && + grep "hash version" err +' + +test_expect_success 'detect too small chunk-count' ' + cd "$TRASH_DIRECTORY/full" && + cp $objdir/info/commit-graph commit-graph-backup && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + corrupt_data $objdir/info/commit-graph 6 "\01" && + test_must_fail git commit-graph verify 2>err && + cat err +' + + test_done Martin: thank you so much for these test examples, and for running them to find out about the newline issue. This is really helpful.
Re: [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch
On 5/10/2018 3:22 PM, Stefan Beller wrote: On Thu, May 10, 2018 at 12:05 PM, Martin Ågren <martin.ag...@gmail.com> wrote: On 10 May 2018 at 19:34, Derrick Stolee <dsto...@microsoft.com> wrote: Commits 01-07 focus on the 'git commit-graph verify' subcommand. These are ready for full, rigorous review. I don't know about "full" and "rigorous", but I tried to offer my thoughts. A lingering feeling I have is that users could possibly benefit from seeing "the commit-graph has a bad foo" a bit more than just "the commit-graph is bad". But adding "the bar is baz, should have been frotz" might not bring that much. Maybe you could keep the translatable string somewhat simple, then, if the more technical data could be useful to Git developers, dump it in a non-translated format. (I guess it could be hidden behind a debug switch, but let's take one step at a time..) This is nothing I feel strongly about. t/t5318-commit-graph.sh | 25 + I wonder about tests. Some tests seem to use `dd` to corrupt files and check that it gets caught. Doing this in a a hash-agnostic way could be tricky, but maybe not impossible. I guess we could do something probabilistic, like "set the first two bytes of the very last OID to zero -- surely all OIDs can't start with 16 zero bits". Hmm, that might still require knowing the size of the OIDs... I hope to find time to do some more hands-on testing of this to see that errors actually do get caught. Given that the commit graph is secondary data, the users work around to quickly get back to a well working repository is most likely to remove the file and regenerate it. As a developer who wants to fix the bug, a stacktrace/datadump and the history of git commands might be most valuable, but I agree we should hide that behind a debug flag. Packfiles and loose objects are primary data, which means that those need a more advanced way to diagnose and repair them, so I would imagine the commit graph fsck is closer to bitmaps fsck, which I would have suspected to be found in t5310, but a quick read doesn't reveal many tests that are checking for integrity. So I guess the test coverage here is ok, (although we should always ask for more) My main goal is to help developers figure out what is wrong with a file, and then we can use other diagnostic tools to discover how it got into that state. Martin's initial test cases are wonderful. I've adapted them to test the other conditions in the verify_commit_graph() method and caught some interesting behavior in the process. I'm preparing v2 so we can investigate the direction of the tests. Thanks, -Stolee
Re: [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch
On 5/10/2018 3:17 PM, Ævar Arnfjörð Bjarmason wrote: On Thu, May 10 2018, Derrick Stolee wrote: The behavior in this patch series does the following: 1. Near the end of 'git gc', run 'git commit-graph write'. The location of this code assumes that a 'git gc --auto' has not terminated early due to not meeting the auto threshold. 2. At the end of 'git fetch', run 'git commit-graph write'. This means that every reachable commit will be in the commit-graph after a a successful fetch, which seems a reasonable frequency. Then, the only times we would be missing a reachable commit is after creating one locally. There is a problem with the current patch, though: every 'git fetch' call runs 'git commit-graph write', even if there were no ref updates or objects downloaded. Is there a simple way to detect if the fetch was non-trivial? One obvious problem with this approach: if we compute this during 'gc' AND 'fetch', there will be times where a 'fetch' calls 'gc' and triggers two commit-graph writes. If I were to abandon one of these patches, it would be the 'fetch' integration. A 'git gc' really wants to delete all references to unreachable commits, and without updating the commit-graph we may still have commit data in the commit-graph file that is not in the object database. In fact, deleting commits from the object database but not from the commit-graph will cause 'git commit-graph verify' to fail! I welcome discussion on these ideas, as we are venturing out of the "pure data structure" world and into the "user experience" world. I am less confident in my skills in this world, but the feature is worthless if it does not improve the user experience. I really like #1 here, but I wonder why #2 is necessary. I.e. is it critical for the performance of the commit graph feature that it be kept really up-to-date, moreso than other things that rely on gc --auto (e.g. the optional bitmap index)? It is not critical. The feature has been designed to have recent commits not in the file. For simplicity, it is probably best to limit ourselves to writing after a non-trivial 'gc'. Even if that's the case, I think something that does this via gc --auto is a much better option. I.e. now we have gc.auto & gc.autoPackLimit, if the answer to my question above is "yes" this could also be accomplished by introducing a new graph-specific gc.* setting, and --auto would just update the graph more often, but leave the rest. This is an excellent idea for a follow-up series, if we find we want the commit-graph written more frequently. For now, I'm satisfied with one place where it is automatically computed. I'll drop the fetch integration in my v2 series. Thanks, -Stolee
[PATCH v2 02/12] commit-graph: verify file header information
During a run of 'git commit-graph verify', list the issues with the header information in the commit-graph file. Some of this information is inferred from the loaded 'struct commit_graph'. Some header information is checked as part of load_commit_graph_one(). Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 32 +++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index b25aaed128..d2db20e49a 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -818,7 +818,37 @@ void write_commit_graph(const char *obj_dir, oids.nr = 0; } +static int verify_commit_graph_error; + +static void graph_report(const char *fmt, ...) +{ + va_list ap; + struct strbuf sb = STRBUF_INIT; + verify_commit_graph_error = 1; + + va_start(ap, fmt); + strbuf_vaddf(, fmt, ap); + + fprintf(stderr, "%s\n", sb.buf); + strbuf_release(); + va_end(ap); +} + int verify_commit_graph(struct commit_graph *g) { - return !g; + if (!g) { + graph_report("no commit-graph file loaded"); + return 1; + } + + verify_commit_graph_error = 0; + + if (!g->chunk_oid_fanout) + graph_report("commit-graph is missing the OID Fanout chunk"); + if (!g->chunk_oid_lookup) + graph_report("commit-graph is missing the OID Lookup chunk"); + if (!g->chunk_commit_data) + graph_report("commit-graph is missing the Commit Data chunk"); + + return verify_commit_graph_error; } -- 2.16.2.329.gfb62395de6
[PATCH v2 01/12] commit-graph: add 'verify' subcommand
If the commit-graph file becomes corrupt, we need a way to verify that its contents match the object database. In the manner of 'git fsck' we will implement a 'git commit-graph verify' subcommand to report all issues with the file. Add the 'verify' subcommand to the 'commit-graph' builtin and its documentation. The subcommand is currently a no-op except for loading the commit-graph into memory, which may trigger run-time errors that would be caught by normal use. Add a simple test that ensures the command returns a zero error code. If no commit-graph file exists, this is an acceptable state. Do not report any errors. During review, we noticed that a FREE_AND_NULL(graph_name) was placed after a possible 'return', and this pattern was also in graph_read(). Fix that case, too. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- Documentation/git-commit-graph.txt | 6 ++ builtin/commit-graph.c | 40 +- commit-graph.c | 5 + commit-graph.h | 2 ++ t/t5318-commit-graph.sh| 10 ++ 5 files changed, 62 insertions(+), 1 deletion(-) diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt index 4c97b555cc..a222cfab08 100644 --- a/Documentation/git-commit-graph.txt +++ b/Documentation/git-commit-graph.txt @@ -10,6 +10,7 @@ SYNOPSIS [verse] 'git commit-graph read' [--object-dir ] +'git commit-graph verify' [--object-dir ] 'git commit-graph write' [--object-dir ] @@ -52,6 +53,11 @@ existing commit-graph file. Read a graph file given by the commit-graph file and output basic details about the graph file. Used for debugging purposes. +'verify':: + +Read the commit-graph file and verify its contents against the object +database. Used to check for corrupted data. + EXAMPLES diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 37420ae0fd..af3101291f 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -8,10 +8,16 @@ static char const * const builtin_commit_graph_usage[] = { N_("git commit-graph [--object-dir ]"), N_("git commit-graph read [--object-dir ]"), + N_("git commit-graph verify [--object-dir ]"), N_("git commit-graph write [--object-dir ] [--append] [--stdin-packs|--stdin-commits]"), NULL }; +static const char * const builtin_commit_graph_verify_usage[] = { + N_("git commit-graph verify [--object-dir ]"), + NULL +}; + static const char * const builtin_commit_graph_read_usage[] = { N_("git commit-graph read [--object-dir ]"), NULL @@ -29,6 +35,36 @@ static struct opts_commit_graph { int append; } opts; + +static int graph_verify(int argc, const char **argv) +{ + struct commit_graph *graph = 0; + char *graph_name; + + static struct option builtin_commit_graph_verify_options[] = { + OPT_STRING(0, "object-dir", _dir, + N_("dir"), + N_("The object directory to store the graph")), + OPT_END(), + }; + + argc = parse_options(argc, argv, NULL, +builtin_commit_graph_verify_options, +builtin_commit_graph_verify_usage, 0); + + if (!opts.obj_dir) + opts.obj_dir = get_object_directory(); + + graph_name = get_commit_graph_filename(opts.obj_dir); + graph = load_commit_graph_one(graph_name); + FREE_AND_NULL(graph_name); + + if (!graph) + return 0; + + return verify_commit_graph(graph); +} + static int graph_read(int argc, const char **argv) { struct commit_graph *graph = NULL; @@ -50,10 +86,10 @@ static int graph_read(int argc, const char **argv) graph_name = get_commit_graph_filename(opts.obj_dir); graph = load_commit_graph_one(graph_name); + FREE_AND_NULL(graph_name); if (!graph) die("graph file %s does not exist", graph_name); - FREE_AND_NULL(graph_name); printf("header: %08x %d %d %d %d\n", ntohl(*(uint32_t*)graph->data), @@ -160,6 +196,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix) PARSE_OPT_STOP_AT_NON_OPTION); if (argc > 0) { + if (!strcmp(argv[0], "verify")) + return graph_verify(argc, argv); if (!strcmp(argv[0], "read")) return graph_read(argc, argv); if (!strcmp(argv[0], "write")) diff --git a/commit-graph.c b/commit-graph.c index a8c337dd77..b25aaed128 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -817,3 +817,8 @@ void write_commit_graph(const char *obj_dir, oids.alloc = 0; o
[PATCH v2 03/12] commit-graph: test that 'verify' finds corruption
Add test cases to t5318-commit-graph.sh that corrupt the commit-graph file and check that the 'git commit-graph verify' command fails. These tests verify the header and chunk information is checked carefully. Helped-by: Martin Ågren <martin.ag...@gmail.com> Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- t/t5318-commit-graph.sh | 53 + 1 file changed, 53 insertions(+) diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 6ca451dfd2..0cb88232fa 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -240,4 +240,57 @@ test_expect_success 'git commit-graph verify' ' git commit-graph verify >output ' +# usage: corrupt_data [] +corrupt_data() { + file=$1 + pos=$2 + data="${3:-\0}" + printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc +} + +test_expect_success 'detect bad signature' ' + cd "$TRASH_DIRECTORY/full" && + cp $objdir/info/commit-graph commit-graph-backup && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + corrupt_data $objdir/info/commit-graph 0 "\0" && + test_must_fail git commit-graph verify 2>err && + grep -v "^\+" err > verify-errors && + test_line_count = 1 verify-errors && + grep "graph signature" verify-errors +' + +test_expect_success 'detect bad version number' ' + cd "$TRASH_DIRECTORY/full" && + cp $objdir/info/commit-graph commit-graph-backup && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + corrupt_data $objdir/info/commit-graph 4 "\02" && + test_must_fail git commit-graph verify 2>err && + grep -v "^\+" err > verify-errors && + test_line_count = 1 verify-errors && + grep "graph version" verify-errors +' + +test_expect_success 'detect bad hash version' ' + cd "$TRASH_DIRECTORY/full" && + cp $objdir/info/commit-graph commit-graph-backup && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + corrupt_data $objdir/info/commit-graph 5 "\02" && + test_must_fail git commit-graph verify 2>err && + grep -v "^\+" err > verify-errors && + test_line_count = 1 verify-errors && + grep "hash version" verify-errors +' + +test_expect_success 'detect too small chunk-count' ' + cd "$TRASH_DIRECTORY/full" && + cp $objdir/info/commit-graph commit-graph-backup && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + corrupt_data $objdir/info/commit-graph 6 "\01" && + test_must_fail git commit-graph verify 2>err && + grep -v "^\+" err > verify-errors && + test_line_count = 2 verify-errors && + grep "missing the OID Lookup chunk" verify-errors && + grep "missing the Commit Data chunk" verify-errors +' + test_done -- 2.16.2.329.gfb62395de6
[PATCH v2 00/12] Integrate commit-graph into fsck and gc
I'm sending this v2 re-roll rather quickly after the previous version because Martin provided a framework to add tests to the 'verify' subcommand. I took that framework and added tests for the other checks of the commit-graph data. This also found some interesting things about the command: 1. There were some segfaults because we were not checking for bad data carefully enough. 2. To avoid segfaults, we will now terminate the check early if we find problems earlier in the file, such as in the header, or OID lookup. 3. We were not writing newlines between reports. This now happens by default in graph_report(). The integration into 'fetch' is dropped (thanks Ævar!). Derrick Stolee (12): commit-graph: add 'verify' subcommand commit-graph: verify file header information commit-graph: test that 'verify' finds corruption commit-graph: parse commit from chosen graph commit-graph: verify fanout and lookup table commit: force commit to parse from object database commit-graph: load a root tree from specific graph commit-graph: verify commit contents against odb fsck: verify commit-graph commit-graph: add '--reachable' option gc: automatically write commit-graph files commit-graph: update design document Documentation/config.txt | 6 + Documentation/git-commit-graph.txt | 14 ++- Documentation/git-fsck.txt | 3 + Documentation/git-gc.txt | 4 + Documentation/technical/commit-graph.txt | 22 builtin/commit-graph.c | 81 - builtin/fsck.c | 21 builtin/gc.c | 8 ++ commit-graph.c | 199 ++- commit-graph.h | 2 + commit.c | 13 +- commit.h | 1 + t/t5318-commit-graph.sh | 173 +++ 13 files changed, 509 insertions(+), 38 deletions(-) base-commit: 34fdd433396ee0e3ef4de02eb2189f8226eafe4e -- 2.16.2.329.gfb62395de6
[PATCH v2 06/12] commit: force commit to parse from object database
In anticipation of verifying commit-graph file contents against the object database, create parse_commit_internal() to allow side-stepping the commit-graph file and parse directly from the object database. Due to the use of generation numbers, this method should not be called unless the intention is explicit in avoiding commits from the commit-graph file. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit.c | 13 + commit.h | 1 + 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/commit.c b/commit.c index 1d28677dfb..7c92350373 100644 --- a/commit.c +++ b/commit.c @@ -392,7 +392,7 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s return 0; } -int parse_commit_gently(struct commit *item, int quiet_on_missing) +int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph) { enum object_type type; void *buffer; @@ -403,17 +403,17 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) return -1; if (item->object.parsed) return 0; - if (parse_commit_in_graph(item)) + if (use_commit_graph && parse_commit_in_graph(item)) return 0; buffer = read_sha1_file(item->object.oid.hash, , ); if (!buffer) return quiet_on_missing ? -1 : error("Could not read %s", -oid_to_hex(>object.oid)); + oid_to_hex(>object.oid)); if (type != OBJ_COMMIT) { free(buffer); return error("Object %s not a commit", -oid_to_hex(>object.oid)); + oid_to_hex(>object.oid)); } ret = parse_commit_buffer(item, buffer, size, 0); if (save_commit_buffer && !ret) { @@ -424,6 +424,11 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) return ret; } +int parse_commit_gently(struct commit *item, int quiet_on_missing) +{ + return parse_commit_internal(item, quiet_on_missing, 1); +} + void parse_commit_or_die(struct commit *item) { if (parse_commit(item)) diff --git a/commit.h b/commit.h index b5afde1ae9..5fde74fcd7 100644 --- a/commit.h +++ b/commit.h @@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char *name); struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name); int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph); +int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph); int parse_commit_gently(struct commit *item, int quiet_on_missing); static inline int parse_commit(struct commit *item) { -- 2.16.2.329.gfb62395de6
[PATCH v2 05/12] commit-graph: verify fanout and lookup table
While running 'git commit-graph verify', verify that the object IDs are listed in lexicographic order and that the fanout table correctly navigates into that list of object IDs. Add tests that check these corruptions are caught by the verify subcommand. Most of the tests check the full output matches the exact error we inserted, but since our OID order test triggers incorrect fanout values (with possibly different numbers of output lines) we focus only that the correct error is written in that case. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 36 t/t5318-commit-graph.sh | 31 +++ 2 files changed, 67 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 4dfff7e752..b0fd1d5320 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -848,6 +848,9 @@ static void graph_report(const char *fmt, ...) int verify_commit_graph(struct commit_graph *g) { + uint32_t i, cur_fanout_pos = 0; + struct object_id prev_oid, cur_oid; + if (!g) { graph_report("no commit-graph file loaded"); return 1; @@ -862,5 +865,38 @@ int verify_commit_graph(struct commit_graph *g) if (!g->chunk_commit_data) graph_report("commit-graph is missing the Commit Data chunk"); + if (verify_commit_graph_error) + return 1; + + for (i = 0; i < g->num_commits; i++) { + hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i); + + if (i && oidcmp(_oid, _oid) >= 0) + graph_report("commit-graph has incorrect oid order: %s then %s", +oid_to_hex(_oid), +oid_to_hex(_oid)); + + oidcpy(_oid, _oid); + + while (cur_oid.hash[0] > cur_fanout_pos) { + uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos); + if (i != fanout_value) + graph_report("commit-graph has incorrect fanout value: fanout[%d] = %u != %u", +cur_fanout_pos, fanout_value, i); + + cur_fanout_pos++; + } + } + + while (cur_fanout_pos < 256) { + uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos); + + if (g->num_commits != fanout_value) + graph_report("commit-graph has incorrect fanout value: fanout[%d] = %u != %u", +cur_fanout_pos, fanout_value, i); + + cur_fanout_pos++; + } + return verify_commit_graph_error; } diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 0cb88232fa..6fb306b0da 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -293,4 +293,35 @@ test_expect_success 'detect too small chunk-count' ' grep "missing the Commit Data chunk" verify-errors ' +test_expect_success 'detect incorrect chunk lookup value' ' + cd "$TRASH_DIRECTORY/full" && + cp $objdir/info/commit-graph commit-graph-backup && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + corrupt_data $objdir/info/commit-graph 25 "\01" && + test_must_fail git commit-graph verify 2>err && + grep -v "^\+" err > verify-errors && + test_line_count = 1 verify-errors && + grep "improper chunk offset" verify-errors +' + +test_expect_success 'detect incorrect fanout' ' + cd "$TRASH_DIRECTORY/full" && + cp $objdir/info/commit-graph commit-graph-backup && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + corrupt_data $objdir/info/commit-graph 128 "\01" && + test_must_fail git commit-graph verify 2>err && + grep -v "^\+" err > verify-errors && + test_line_count = 1 verify-errors && + grep "fanout value" verify-errors +' + +test_expect_success 'detect incorrect OID order' ' + cd "$TRASH_DIRECTORY/full" && + cp $objdir/info/commit-graph commit-graph-backup && + test_when_finished mv commit-graph-backup $objdir/info/commit-graph && + corrupt_data $objdir/info/commit-graph 1272 "\01" && + test_must_fail git commit-graph verify 2>err && + grep "incorrect oid order" err +' + test_done -- 2.16.2.329.gfb62395de6
[PATCH v2 04/12] commit-graph: parse commit from chosen graph
Before verifying a commit-graph file against the object database, we need to parse all commits from the given commit-graph file. Create parse_commit_in_graph_one() to target a given struct commit_graph. Signed-off-by: Derrick Stolee <dsto...@microsoft.com> --- commit-graph.c | 18 +++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index d2db20e49a..4dfff7e752 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -309,7 +309,7 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin } } -int parse_commit_in_graph(struct commit *item) +int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item) { uint32_t pos; @@ -317,9 +317,21 @@ int parse_commit_in_graph(struct commit *item) return 0; if (item->object.parsed) return 1; + + if (find_commit_in_graph(item, g, )) + return fill_commit_in_graph(item, g, pos); + + return 0; +} + +int parse_commit_in_graph(struct commit *item) +{ + if (!core_commit_graph) + return 0; + prepare_commit_graph(); - if (commit_graph && find_commit_in_graph(item, commit_graph, )) - return fill_commit_in_graph(item, commit_graph, pos); + if (commit_graph) + return parse_commit_in_graph_one(commit_graph, item); return 0; } -- 2.16.2.329.gfb62395de6