[RFC PATCH 1/6] commit-reach: add trace2 telemetry and walk count
paint_down_to_common() is used by many Git commands, and sometimes multiple times in a single call. It is important to measure performance of this method, but the actual time it takes can vary due to interactions outside Git's control (file system, CPU contention, etc.). Instead, count how many times we execute the while loop, which is consistent between runs. Signed-off-by: Derrick Stolee --- commit-reach.c | 8 1 file changed, 8 insertions(+) diff --git a/commit-reach.c b/commit-reach.c index 86715c103c..ee374dce20 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -38,6 +38,9 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit_list *result = NULL; int i; uint32_t last_gen = GENERATION_NUMBER_INFINITY; + uint32_t num_walked = 0; + + trace2_region_enter("paint_down_to_common"); one->object.flags |= PARENT1; if (!n) { @@ -55,6 +58,7 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit *commit = prio_queue_get(&queue); struct commit_list *parents; int flags; + num_walked++; if (commit->generation > last_gen) BUG("bad generation skip %8x > %8x at %s", @@ -88,6 +92,10 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, } clear_prio_queue(&queue); + + trace2_data_intmax("paint_down_to_common", "num_walked", num_walked); + trace2_region_leave("paint_down_to_common"); + return result; } -- 2.19.0.rc2
[RFC PATCH 0/6] Use trace2 in commit-reach
As promised, here is the direction I took when applying the trace2 feature to the walking code in commit-reach.c. Hence, this depends on Jeff's trace2 patches and ds/reachable. There are multiple benefits to the approach I take here: 1. If a user has performance problems, we can rerun the command with tracing on and get the region enter/leave notifications and the extra data like "num_walked". 2. While I was testing this series against real-world examples, I found a number I didn't expect. The numbers for can_all_from_reach_with_flags were much higher than I expected. Turns out the heuristic I wrote was not working correctly. With the trace2 library, I was able to add a "run_and_check_trace2" function in test-lib.sh so I could make the number of walked commits be a condition we check in the test. Then, the benefit we expect is demonstrated by the test suite when I fix the bug. Thanks, -Stolee P.S. I'm sending this RFC from gmail because I'm having SMTP issues with my work email. Derrick Stolee (6): commit-reach: add trace2 telemetry and walk count comit-reach: use trace2 for commit_contains_tag_algo commit-reach: use trace2 in can_all_from_reach test-tool: start trace2 environment test-lib: add run_and_check_trace2 commit-reach: fix first-parent heuristic commit-reach.c| 32 ++-- t/helper/test-tool.c | 3 +++ t/t6600-test-reach.sh | 6 ++ t/test-lib.sh | 14 ++ 4 files changed, 53 insertions(+), 2 deletions(-) -- 2.19.0-rc2
Re: [PATCH 1/2] commit-graph write: add progress output
On 9/5/2018 5:46 PM, Junio C Hamano wrote: Derrick Stolee writes: for (i = 0; i < commits->nr; i++) { + display_progress(progress, i); if (commits->list[i]->generation != GENERATION_NUMBER_INFINITY && commits->list[i]->generation != GENERATION_NUMBER_ZERO) continue; I am wondering if the progress call should be moved after this conditional continue; would we want to count the entry whose generation is already known here? Of course, as we give commits->nr as the 100% ceiling, we cannot avoid doing so, but it somehow smells wrong. If we wanted to be completely right, we would count the commits in the list that do not have a generation number and report that as the 100% ceiling. Yeah, but I realize that the definition of "right" really depends on what we consider a task being accomplished is in this loop. If we define the task to "we have some number of commits that lack generation numbers and our task is to assign numbers to them.", then yes counting the ones without generation number and culling the ones that already have generation number is outside the work and we need another loop to count them. But the position the posted patch takes is also a valid one: we have some commits and we are making sure each and every one of them has assigned a generation number. So I do not think it is necessary to introduce another loop just for counting. Thanks. Makes sense to me. Thanks!
Re: [PATCH 01/11] multi-pack-index: add 'verify' verb
On 9/5/2018 2:59 PM, Eric Sunshine wrote: On Wed, Sep 5, 2018 at 10:46 AM Derrick Stolee via GitGitGadget wrote: The multi-pack-index builtin writes multi-pack-index files, and uses a 'write' verb to do so. Add a 'verify' verb that checks this file matches the contents of the pack-indexes it replaces. [...] Signed-off-by: Derrick Stolee --- diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c @@ -5,7 +5,7 @@ static char const * const builtin_multi_pack_index_usage[] = { - N_("git multi-pack-index [--object-dir=] write"), + N_("git multi-pack-index [--object-dir=] [write|verify]"), Nit: The square brackets make the verb optional. You probably want parenthesis to indicate that one or the other is required: git multi-pack-index [--object-dir=] (write|verify) Thanks for catching this mistake! This would be easy to miss and keep around forever.
Re: [PATCH 04/11] multi-pack-index: verify packname order
On 9/5/2018 3:14 PM, Stefan Beller wrote: On Wed, Sep 5, 2018 at 12:11 PM Derrick Stolee wrote: On 9/5/2018 2:15 PM, Stefan Beller wrote: On Wed, Sep 5, 2018 at 7:46 AM Derrick Stolee via GitGitGadget wrote: From: Derrick Stolee The final check we make while loading a multi-pack-index is that the packfile names are in lexicographical order. Make this error be a die() instead. What is the advantage of having a die() here? Would the midx work under normal operation when not sorted correctly? This sounds like a hack for easy testability in this context, so could you clarify why this helps the regular user? The multi-pack-index will not work as expected, since midx_contains_pack() may provide incorrect results, and thus we will add the "missing" packfiles to the packed_git linked list. This _should_ be a die(), as something unexpected occurred (the file doesn't match the format specification). Thanks for the clarification. So this patch actually hits 2 birds with one stone, as it fixes a bug as well as adds the check for such corruption to the verify step? This is a common occurrence in this series (replacing error() with die()) as we are now exercising those conditions and clarifying what should happen.
Re: [PATCH 04/11] multi-pack-index: verify packname order
On 9/5/2018 2:15 PM, Stefan Beller wrote: On Wed, Sep 5, 2018 at 7:46 AM Derrick Stolee via GitGitGadget wrote: From: Derrick Stolee The final check we make while loading a multi-pack-index is that the packfile names are in lexicographical order. Make this error be a die() instead. What is the advantage of having a die() here? Would the midx work under normal operation when not sorted correctly? This sounds like a hack for easy testability in this context, so could you clarify why this helps the regular user? The multi-pack-index will not work as expected, since midx_contains_pack() may provide incorrect results, and thus we will add the "missing" packfiles to the packed_git linked list. This _should_ be a die(), as something unexpected occurred (the file doesn't match the format specification).
[PATCH 11/11] fsck: verify multi-pack-index
From: Derrick Stolee When core.multiPackIndex is true, we may have a multi-pack-index in our object directory. Add calls to 'git multi-pack-index verify' at the end of 'git fsck' if so. Signed-off-by: Derrick Stolee --- builtin/fsck.c | 18 ++ t/t5319-multi-pack-index.sh | 13 - 2 files changed, 30 insertions(+), 1 deletion(-) diff --git a/builtin/fsck.c b/builtin/fsck.c index 63c8578cc1..06eb421720 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -848,5 +848,23 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) } } + if (!git_config_get_bool("core.multipackindex", &i) && i) { + struct child_process midx_verify = CHILD_PROCESS_INIT; + const char *midx_argv[] = { "multi-pack-index", "verify", NULL, NULL, NULL }; + + midx_verify.argv = midx_argv; + midx_verify.git_cmd = 1; + if (run_command(&midx_verify)) + errors_found |= ERROR_COMMIT_GRAPH; + + prepare_alt_odb(the_repository); + for (alt = the_repository->objects->alt_odb_list; alt; alt = alt->next) { + midx_argv[2] = "--object-dir"; + midx_argv[3] = alt->path; + if (run_command(&midx_verify)) + errors_found |= ERROR_COMMIT_GRAPH; + } + } + return errors_found; } diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index b6c34631d3..63f8718fd7 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -160,12 +160,17 @@ corrupt_midx_and_verify() { DATA="${2:-\0}" OBJDIR=$3 GREPSTR="$4" + COMMAND="$5" + if test -z "$COMMAND" + then + COMMAND="git multi-pack-index verify --object-dir=$OBJDIR" + fi FILE=$OBJDIR/pack/multi-pack-index && chmod a+w $FILE && test_when_finished mv midx-backup $FILE && cp $FILE midx-backup && printf "$DATA" | dd of="$FILE" bs=1 seek="$POS" conv=notrunc && - test_must_fail git multi-pack-index verify --object-dir=$OBJDIR 2>test_err && + test_must_fail $COMMAND 2>test_err && grep -v "^+" test_err >err && test_i18ngrep "$GREPSTR" err } @@ -258,6 +263,12 @@ test_expect_success 'verify incorrect offset' ' "incorrect object offset" ' +test_expect_success 'git-fsck incorrect offset' ' + corrupt_midx_and_verify $MIDX_BYTE_OFFSET "\07" $objdir \ + "incorrect object offset" \ + "git -c core.multipackindex=true fsck" +' + test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && git repack -adf && -- gitgitgadget
[PATCH 10/11] multi-pack-index: report progress during 'verify'
From: Derrick Stolee When verifying a multi-pack-index, the only action that takes significant time is checking the object offsets. For example, to verify a multi-pack-index containing 6.2 million objects in the Linux kernel repository takes 1.3 seconds on my machine. 99% of that time is spent looking up object offsets in each of the packfiles and comparing them to the multi-pack-index offset. Add a progress indicator for that section of the 'verify' verb. Signed-off-by: Derrick Stolee --- midx.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/midx.c b/midx.c index 87708dc367..489180a599 100644 --- a/midx.c +++ b/midx.c @@ -7,6 +7,7 @@ #include "object-store.h" #include "sha1-lookup.h" #include "midx.h" +#include "progress.h" #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */ #define MIDX_VERSION 1 @@ -939,6 +940,7 @@ static void midx_report(const char *fmt, ...) int verify_midx_file(const char *object_dir) { uint32_t i; + struct progress *progress = NULL; struct multi_pack_index *m = load_multi_pack_index(object_dir, 1); verify_midx_error = 0; @@ -970,6 +972,7 @@ int verify_midx_file(const char *object_dir) i, oid_to_hex(&oid1), oid_to_hex(&oid2), i + 1); } + progress = start_progress(_("Verifying object offsets"), m->num_objects); for (i = 0; i < m->num_objects; i++) { struct object_id oid; struct pack_entry e; @@ -994,7 +997,10 @@ int verify_midx_file(const char *object_dir) if (m_offset != p_offset) midx_report(_("incorrect object offset for oid[%d] = %s: %"PRIx64" != %"PRIx64), i, oid_to_hex(&oid), m_offset, p_offset); + + display_progress(progress, i + 1); } + stop_progress(&progress); return verify_midx_error; } -- gitgitgadget
[PATCH 03/11] multi-pack-index: verify corrupt chunk lookup table
From: Derrick Stolee Signed-off-by: Derrick Stolee --- midx.c | 3 +++ t/t5319-multi-pack-index.sh | 13 + 2 files changed, 16 insertions(+) diff --git a/midx.c b/midx.c index ec78254bb6..8b054b39ab 100644 --- a/midx.c +++ b/midx.c @@ -100,6 +100,9 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local uint64_t chunk_offset = get_be64(m->data + MIDX_HEADER_SIZE + 4 + MIDX_CHUNKLOOKUP_WIDTH * i); + if (chunk_offset >= m->data_len) + die(_("invalid chunk offset (too large)")); + switch (chunk_id) { case MIDX_CHUNKID_PACKNAMES: m->chunk_pack_names = m->data + chunk_offset; diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 836d2bdb53..5c9c499aa6 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -178,6 +178,9 @@ test_expect_success 'verify bad signature' ' MIDX_BYTE_VERSION=4 MIDX_BYTE_OID_VERSION=5 MIDX_BYTE_CHUNK_COUNT=6 +MIDX_HEADER_SIZE=12 +MIDX_BYTE_CHUNK_ID=$MIDX_HEADER_SIZE +MIDX_BYTE_CHUNK_OFFSET=$(($MIDX_HEADER_SIZE + 4)) test_expect_success 'verify bad version' ' corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \ @@ -199,6 +202,16 @@ test_expect_success 'verify extended chunk count' ' "terminating multi-pack-index chunk id appears earlier than expected" ' +test_expect_success 'verify missing required chunk' ' + corrupt_midx_and_verify $MIDX_BYTE_CHUNK_ID "\01" $objdir \ + "missing required" +' + +test_expect_success 'verify invalid chunk offset' ' + corrupt_midx_and_verify $MIDX_BYTE_CHUNK_OFFSET "\01" $objdir \ + "invalid chunk offset (too large)" +' + test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && git repack -adf && -- gitgitgadget
[PATCH 07/11] multi-pack-index: verify oid lookup order
From: Derrick Stolee Signed-off-by: Derrick Stolee --- midx.c | 11 +++ t/t5319-multi-pack-index.sh | 8 2 files changed, 19 insertions(+) diff --git a/midx.c b/midx.c index dfd26b4d74..06d5cfc826 100644 --- a/midx.c +++ b/midx.c @@ -959,5 +959,16 @@ int verify_midx_file(const char *object_dir) i, oid_fanout1, oid_fanout2, i + 1); } + for (i = 0; i < m->num_objects - 1; i++) { + struct object_id oid1, oid2; + + nth_midxed_object_oid(&oid1, m, i); + nth_midxed_object_oid(&oid2, m, i + 1); + + if (oidcmp(&oid1, &oid2) >= 0) + midx_report(_("oid lookup out of order: oid[%d] = %s >= %s = oid[%d]"), + i, oid_to_hex(&oid1), oid_to_hex(&oid2), i + 1); + } + return verify_midx_error; } diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 95ac7a6edd..072b1d1a74 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -175,6 +175,7 @@ test_expect_success 'verify bad signature' ' "multi-pack-index signature" ' +HASH_LEN=20 MIDX_BYTE_VERSION=4 MIDX_BYTE_OID_VERSION=5 MIDX_BYTE_CHUNK_COUNT=6 @@ -189,6 +190,8 @@ MIDX_BYTE_PACKNAME_ORDER=$(($MIDX_OFFSET_PACKNAMES + 2)) MIDX_OFFSET_OID_FANOUT=$(($MIDX_OFFSET_PACKNAMES + 652)) MIDX_OID_FANOUT_WIDTH=4 MIDX_BYTE_OID_FANOUT_ORDER=$((MIDX_OFFSET_OID_FANOUT + 250 * $MIDX_OID_FANOUT_WIDTH + 1)) +MIDX_OFFSET_OID_LOOKUP=$(($MIDX_OFFSET_OID_FANOUT + 256 * $MIDX_OID_FANOUT_WIDTH)) +MIDX_BYTE_OID_LOOKUP=$(($MIDX_OFFSET_OID_LOOKUP + 16 * $HASH_LEN)) test_expect_success 'verify bad version' ' corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \ @@ -235,6 +238,11 @@ test_expect_success 'verify oid fanout out of order' ' "oid fanout out of order" ' +test_expect_success 'verify oid lookup out of order' ' + corrupt_midx_and_verify $MIDX_BYTE_OID_LOOKUP "\00" $objdir \ + "oid lookup out of order" +' + test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && git repack -adf && -- gitgitgadget
[PATCH 09/11] multi-pack-index: verify object offsets
From: Derrick Stolee The 'git multi-pack-index verify' command must verify the object offsets stored in the multi-pack-index are correct. There are two ways the offset chunk can be incorrect: the pack-int-id and the object offset. Replace the BUG() statement with a die() statement, now that we may hit a bad pack-int-id during a 'verify' command on a corrupt multi-pack-index, and it is covered by a test. Signed-off-by: Derrick Stolee --- midx.c | 28 +++- t/t5319-multi-pack-index.sh | 27 +++ 2 files changed, 54 insertions(+), 1 deletion(-) diff --git a/midx.c b/midx.c index 80094c02a7..87708dc367 100644 --- a/midx.c +++ b/midx.c @@ -197,7 +197,7 @@ int prepare_midx_pack(struct multi_pack_index *m, uint32_t pack_int_id) struct strbuf pack_name = STRBUF_INIT; if (pack_int_id >= m->num_packs) - BUG("bad pack-int-id"); + die(_("bad pack-int-id")); if (m->packs[pack_int_id]) return 0; @@ -970,5 +970,31 @@ int verify_midx_file(const char *object_dir) i, oid_to_hex(&oid1), oid_to_hex(&oid2), i + 1); } + for (i = 0; i < m->num_objects; i++) { + struct object_id oid; + struct pack_entry e; + off_t m_offset, p_offset; + + nth_midxed_object_oid(&oid, m, i); + if (!fill_midx_entry(&oid, &e, m)) { + midx_report(_("failed to load pack entry for oid[%d] = %s"), + i, oid_to_hex(&oid)); + continue; + } + + if (open_pack_index(e.p)) { + midx_report(_("failed to load pack-index for packfile %s"), + e.p->pack_name); + break; + } + + m_offset = e.offset; + p_offset = find_pack_entry_one(oid.hash, e.p); + + if (m_offset != p_offset) + midx_report(_("incorrect object offset for oid[%d] = %s: %"PRIx64" != %"PRIx64), + i, oid_to_hex(&oid), m_offset, p_offset); + } + return verify_midx_error; } diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 072b1d1a74..b6c34631d3 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -176,6 +176,7 @@ test_expect_success 'verify bad signature' ' ' HASH_LEN=20 +NUM_OBJECTS=74 MIDX_BYTE_VERSION=4 MIDX_BYTE_OID_VERSION=5 MIDX_BYTE_CHUNK_COUNT=6 @@ -192,6 +193,10 @@ MIDX_OID_FANOUT_WIDTH=4 MIDX_BYTE_OID_FANOUT_ORDER=$((MIDX_OFFSET_OID_FANOUT + 250 * $MIDX_OID_FANOUT_WIDTH + 1)) MIDX_OFFSET_OID_LOOKUP=$(($MIDX_OFFSET_OID_FANOUT + 256 * $MIDX_OID_FANOUT_WIDTH)) MIDX_BYTE_OID_LOOKUP=$(($MIDX_OFFSET_OID_LOOKUP + 16 * $HASH_LEN)) +MIDX_OFFSET_OBJECT_OFFSETS=$(($MIDX_OFFSET_OID_LOOKUP + $NUM_OBJECTS * $HASH_LEN)) +MIDX_OFFSET_WIDTH=8 +MIDX_BYTE_PACK_INT_ID=$(($MIDX_OFFSET_OBJECT_OFFSETS + 16 * $MIDX_OFFSET_WIDTH + 2)) +MIDX_BYTE_OFFSET=$(($MIDX_OFFSET_OBJECT_OFFSETS + 16 * $MIDX_OFFSET_WIDTH + 6)) test_expect_success 'verify bad version' ' corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \ @@ -243,6 +248,16 @@ test_expect_success 'verify oid lookup out of order' ' "oid lookup out of order" ' +test_expect_success 'verify incorrect pack-int-id' ' + corrupt_midx_and_verify $MIDX_BYTE_PACK_INT_ID "\07" $objdir \ + "bad pack-int-id" +' + +test_expect_success 'verify incorrect offset' ' + corrupt_midx_and_verify $MIDX_BYTE_OFFSET "\07" $objdir \ + "incorrect object offset" +' + test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && git repack -adf && @@ -310,4 +325,16 @@ test_expect_success 'verify multi-pack-index with 64-bit offsets' ' git multi-pack-index verify --object-dir=objects64 ' +NUM_OBJECTS=63 +MIDX_OFFSET_OID_FANOUT=$((MIDX_OFFSET_PACKNAMES + 54)) +MIDX_OFFSET_OID_LOOKUP=$((MIDX_OFFSET_OID_FANOUT + 256 * $MIDX_OID_FANOUT_WIDTH)) +MIDX_OFFSET_OBJECT_OFFSETS=$(($MIDX_OFFSET_OID_LOOKUP + $NUM_OBJECTS * $HASH_LEN)) +MIDX_OFFSET_LARGE_OFFSETS=$(($MIDX_OFFSET_OBJECT_OFFSETS + $NUM_OBJECTS * $MIDX_OFFSET_WIDTH)) +MIDX_BYTE_LARGE_OFFSET=$(($MIDX_OFFSET_LARGE_OFFSETS + 3)) + +test_expect_success 'verify incorrect 64-bit offset' ' + corrupt_midx_and_verify $MIDX_BYTE_LARGE_OFFSET "\07" objects64 \ + "incorrect object offset" +' + test_done -- gitgitgadget
[PATCH 08/11] multi-pack-index: fix 32-bit vs 64-bit size check
From: Derrick Stolee When loading a 64-bit offset, we intend to check that off_t can store the resulting offset. However, the condition accidentally checks the 32-bit offset to see if it is smaller than a 64-bit value. Fix it, and this will be covered by a test in the 'git multi-pack-index verify' command in a later commit. Signed-off-by: Derrick Stolee --- midx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/midx.c b/midx.c index 06d5cfc826..80094c02a7 100644 --- a/midx.c +++ b/midx.c @@ -236,7 +236,7 @@ static off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos) offset32 = get_be32(offset_data + sizeof(uint32_t)); if (m->chunk_large_offsets && offset32 & MIDX_LARGE_OFFSET_NEEDED) { - if (sizeof(offset32) < sizeof(uint64_t)) + if (sizeof(off_t) < sizeof(uint64_t)) die(_("multi-pack-index stores a 64-bit offset, but off_t is too small")); offset32 ^= MIDX_LARGE_OFFSET_NEEDED; -- gitgitgadget
[PATCH 06/11] multi-pack-index: verify oid fanout order
From: Derrick Stolee Signed-off-by: Derrick Stolee --- midx.c | 9 + t/t5319-multi-pack-index.sh | 8 2 files changed, 17 insertions(+) diff --git a/midx.c b/midx.c index a02b19efc1..dfd26b4d74 100644 --- a/midx.c +++ b/midx.c @@ -950,5 +950,14 @@ int verify_midx_file(const char *object_dir) midx_report("failed to load pack in position %d", i); } + for (i = 0; i < 255; i++) { + uint32_t oid_fanout1 = ntohl(m->chunk_oid_fanout[i]); + uint32_t oid_fanout2 = ntohl(m->chunk_oid_fanout[i + 1]); + + if (oid_fanout1 > oid_fanout2) + midx_report(_("oid fanout out of order: fanout[%d] = %"PRIx32" > %"PRIx32" = fanout[%d]"), + i, oid_fanout1, oid_fanout2, i + 1); + } + return verify_midx_error; } diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 52b7f7905b..95ac7a6edd 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -186,6 +186,9 @@ MIDX_CHUNK_LOOKUP_WIDTH=12 MIDX_OFFSET_PACKNAMES=$(($MIDX_HEADER_SIZE + \ $MIDX_NUM_CHUNKS * $MIDX_CHUNK_LOOKUP_WIDTH)) MIDX_BYTE_PACKNAME_ORDER=$(($MIDX_OFFSET_PACKNAMES + 2)) +MIDX_OFFSET_OID_FANOUT=$(($MIDX_OFFSET_PACKNAMES + 652)) +MIDX_OID_FANOUT_WIDTH=4 +MIDX_BYTE_OID_FANOUT_ORDER=$((MIDX_OFFSET_OID_FANOUT + 250 * $MIDX_OID_FANOUT_WIDTH + 1)) test_expect_success 'verify bad version' ' corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \ @@ -227,6 +230,11 @@ test_expect_success 'verify packnames out of order' ' "failed to load pack" ' +test_expect_success 'verify oid fanout out of order' ' + corrupt_midx_and_verify $MIDX_BYTE_OID_FANOUT_ORDER "\01" $objdir \ + "oid fanout out of order" +' + test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && git repack -adf && -- gitgitgadget
[PATCH 05/11] multi-pack-index: verify missing pack
From: Derrick Stolee Signed-off-by: Derrick Stolee --- midx.c | 16 t/t5319-multi-pack-index.sh | 5 + 2 files changed, 21 insertions(+) diff --git a/midx.c b/midx.c index e655a15aed..a02b19efc1 100644 --- a/midx.c +++ b/midx.c @@ -926,13 +926,29 @@ void clear_midx_file(const char *object_dir) int verify_midx_error; +static void midx_report(const char *fmt, ...) +{ + va_list ap; + verify_midx_error = 1; + va_start(ap, fmt); + vfprintf(stderr, fmt, ap); + fprintf(stderr, "\n"); + va_end(ap); +} + int verify_midx_file(const char *object_dir) { + uint32_t i; struct multi_pack_index *m = load_multi_pack_index(object_dir, 1); verify_midx_error = 0; if (!m) return 0; + for (i = 0; i < m->num_packs; i++) { + if (prepare_midx_pack(m, i)) + midx_report("failed to load pack in position %d", i); + } + return verify_midx_error; } diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 4a8f231d93..52b7f7905b 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -222,6 +222,11 @@ test_expect_success 'verify packnames out of order' ' "pack names out of order" ' +test_expect_success 'verify packnames out of order' ' + corrupt_midx_and_verify $MIDX_BYTE_PACKNAME_ORDER "a" $objdir \ + "failed to load pack" +' + test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && git repack -adf && -- gitgitgadget
[PATCH 04/11] multi-pack-index: verify packname order
From: Derrick Stolee The final check we make while loading a multi-pack-index is that the packfile names are in lexicographical order. Make this error be a die() instead. In order to test this condition, we need multiple packfiles. Earlier in t5319-multi-pack-index.sh, we tested the interaction with 'git repack' but this limits us to one packfile in our object dir. Move these repack tests until after the 'verify' tests. Signed-off-by: Derrick Stolee --- midx.c | 6 ++ t/t5319-multi-pack-index.sh | 10 ++ 2 files changed, 12 insertions(+), 4 deletions(-) diff --git a/midx.c b/midx.c index 8b054b39ab..e655a15aed 100644 --- a/midx.c +++ b/midx.c @@ -157,12 +157,10 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local cur_pack_name += strlen(cur_pack_name) + 1; - if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0) { - error(_("multi-pack-index pack names out of order: '%s' before '%s'"), + if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0) + die(_("multi-pack-index pack names out of order: '%s' before '%s'"), m->pack_names[i - 1], m->pack_names[i]); - goto cleanup_fail; - } } return m; diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 5c9c499aa6..4a8f231d93 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -181,6 +181,11 @@ MIDX_BYTE_CHUNK_COUNT=6 MIDX_HEADER_SIZE=12 MIDX_BYTE_CHUNK_ID=$MIDX_HEADER_SIZE MIDX_BYTE_CHUNK_OFFSET=$(($MIDX_HEADER_SIZE + 4)) +MIDX_NUM_CHUNKS=5 +MIDX_CHUNK_LOOKUP_WIDTH=12 +MIDX_OFFSET_PACKNAMES=$(($MIDX_HEADER_SIZE + \ +$MIDX_NUM_CHUNKS * $MIDX_CHUNK_LOOKUP_WIDTH)) +MIDX_BYTE_PACKNAME_ORDER=$(($MIDX_OFFSET_PACKNAMES + 2)) test_expect_success 'verify bad version' ' corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \ @@ -212,6 +217,11 @@ test_expect_success 'verify invalid chunk offset' ' "invalid chunk offset (too large)" ' +test_expect_success 'verify packnames out of order' ' + corrupt_midx_and_verify $MIDX_BYTE_PACKNAME_ORDER "z" $objdir \ + "pack names out of order" +' + test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && git repack -adf && -- gitgitgadget
[PATCH 02/11] multi-pack-index: verify bad header
From: Derrick Stolee When verifying if a multi-pack-index file is valid, we want the command to fail to signal an invalid file. Previously, we wrote an error to stderr and continued as if we had no multi-pack-index. Now, die() instead of error(). Add tests that check corrupted headers in a few ways: * Bad signature * Bad file version * Bad hash version * Truncated hash count * Extended hash count Signed-off-by: Derrick Stolee --- midx.c | 18 +-- t/t5319-multi-pack-index.sh | 46 - 2 files changed, 51 insertions(+), 13 deletions(-) diff --git a/midx.c b/midx.c index b253bed517..ec78254bb6 100644 --- a/midx.c +++ b/midx.c @@ -76,24 +76,18 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local m->local = local; m->signature = get_be32(m->data); - if (m->signature != MIDX_SIGNATURE) { - error(_("multi-pack-index signature 0x%08x does not match signature 0x%08x"), + if (m->signature != MIDX_SIGNATURE) + die(_("multi-pack-index signature 0x%08x does not match signature 0x%08x"), m->signature, MIDX_SIGNATURE); - goto cleanup_fail; - } m->version = m->data[MIDX_BYTE_FILE_VERSION]; - if (m->version != MIDX_VERSION) { - error(_("multi-pack-index version %d not recognized"), + if (m->version != MIDX_VERSION) + die(_("multi-pack-index version %d not recognized"), m->version); - goto cleanup_fail; - } hash_version = m->data[MIDX_BYTE_HASH_VERSION]; - if (hash_version != MIDX_HASH_VERSION) { - error(_("hash version %u does not match"), hash_version); - goto cleanup_fail; - } + if (hash_version != MIDX_HASH_VERSION) + die(_("hash version %u does not match"), hash_version); m->hash_len = MIDX_HASH_LEN; m->num_chunks = m->data[MIDX_BYTE_NUM_CHUNKS]; diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 1c4e0e6d31..836d2bdb53 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -154,6 +154,51 @@ test_expect_success 'verify multi-pack-index success' ' git multi-pack-index verify --object-dir=$objdir ' +# usage: corrupt_midx_and_verify +corrupt_midx_and_verify() { + POS=$1 + DATA="${2:-\0}" + OBJDIR=$3 + GREPSTR="$4" + FILE=$OBJDIR/pack/multi-pack-index && + chmod a+w $FILE && + test_when_finished mv midx-backup $FILE && + cp $FILE midx-backup && + printf "$DATA" | dd of="$FILE" bs=1 seek="$POS" conv=notrunc && + test_must_fail git multi-pack-index verify --object-dir=$OBJDIR 2>test_err && + grep -v "^+" test_err >err && + test_i18ngrep "$GREPSTR" err +} + +test_expect_success 'verify bad signature' ' + corrupt_midx_and_verify 0 "\00" $objdir \ + "multi-pack-index signature" +' + +MIDX_BYTE_VERSION=4 +MIDX_BYTE_OID_VERSION=5 +MIDX_BYTE_CHUNK_COUNT=6 + +test_expect_success 'verify bad version' ' + corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \ + "multi-pack-index version" +' + +test_expect_success 'verify bad OID version' ' + corrupt_midx_and_verify $MIDX_BYTE_OID_VERSION "\02" $objdir \ + "hash version" +' + +test_expect_success 'verify truncated chunk count' ' + corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\01" $objdir \ + "missing required" +' + +test_expect_success 'verify extended chunk count' ' + corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\07" $objdir \ + "terminating multi-pack-index chunk id appears earlier than expected" +' + test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && git repack -adf && @@ -191,7 +236,6 @@ test_expect_success 'multi-pack-index in an alternate' ' compare_results_with_midx "with alternate (remote midx)" - # usage: corrupt_data [] corrupt_data () { file=$1 -- gitgitgadget
[PATCH 01/11] multi-pack-index: add 'verify' verb
From: Derrick Stolee The multi-pack-index builtin writes multi-pack-index files, and uses a 'write' verb to do so. Add a 'verify' verb that checks this file matches the contents of the pack-indexes it replaces. The current implementation is a no-op, but will be extended in small increments in later commits. Signed-off-by: Derrick Stolee --- Documentation/git-multi-pack-index.txt | 10 ++ builtin/multi-pack-index.c | 4 +++- midx.c | 13 + midx.h | 1 + t/t5319-multi-pack-index.sh| 8 5 files changed, 35 insertions(+), 1 deletion(-) diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt index 1f97e79912..f7778a2c85 100644 --- a/Documentation/git-multi-pack-index.txt +++ b/Documentation/git-multi-pack-index.txt @@ -27,6 +27,10 @@ write:: When given as the verb, write a new MIDX file to `/packs/multi-pack-index`. +verify:: + When given as the verb, verify the contents of the MIDX file + at `/packs/multi-pack-index`. + EXAMPLES @@ -43,6 +47,12 @@ $ git multi-pack-index write $ git multi-pack-index --object-dir write --- +* Verify the MIDX file for the packfiles in the current .git folder. ++ +--- +$ git multi-pack-index verify +--- + SEE ALSO diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c index 2633efd95d..7d567dafbc 100644 --- a/builtin/multi-pack-index.c +++ b/builtin/multi-pack-index.c @@ -5,7 +5,7 @@ #include "midx.h" static char const * const builtin_multi_pack_index_usage[] = { - N_("git multi-pack-index [--object-dir=] write"), + N_("git multi-pack-index [--object-dir=] [write|verify]"), NULL }; @@ -42,6 +42,8 @@ int cmd_multi_pack_index(int argc, const char **argv, if (!strcmp(argv[0], "write")) return write_midx_file(opts.object_dir); + if (!strcmp(argv[0], "verify")) + return verify_midx_file(opts.object_dir); die(_("unrecognized verb: %s"), argv[0]); } diff --git a/midx.c b/midx.c index f3e8dbc108..b253bed517 100644 --- a/midx.c +++ b/midx.c @@ -928,3 +928,16 @@ void clear_midx_file(const char *object_dir) free(midx); } + +int verify_midx_error; + +int verify_midx_file(const char *object_dir) +{ + struct multi_pack_index *m = load_multi_pack_index(object_dir, 1); + verify_midx_error = 0; + + if (!m) + return 0; + + return verify_midx_error; +} diff --git a/midx.h b/midx.h index a210f1af2a..ce80b91c68 100644 --- a/midx.h +++ b/midx.h @@ -43,5 +43,6 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, i int write_midx_file(const char *object_dir); void clear_midx_file(const char *object_dir); +int verify_midx_file(const char *object_dir); #endif diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 6f56b38674..1c4e0e6d31 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -150,6 +150,10 @@ test_expect_success 'write midx with twelve packs' ' compare_results_with_midx "twelve packs" +test_expect_success 'verify multi-pack-index success' ' + git multi-pack-index verify --object-dir=$objdir +' + test_expect_success 'repack removes multi-pack-index' ' test_path_is_file $objdir/pack/multi-pack-index && git repack -adf && @@ -214,4 +218,8 @@ test_expect_success 'force some 64-bit offsets with pack-objects' ' midx_read_expect 1 63 5 objects64 " large-offsets" ' +test_expect_success 'verify multi-pack-index with 64-bit offsets' ' + git multi-pack-index verify --object-dir=objects64 +' + test_done -- gitgitgadget
[PATCH 00/11] Add 'git multi-pack-index verify' command
The multi-pack-index file provides faster lookups in repos with many packfiles by duplicating the information from multiple pack-indexes into a single file. This series allows us to verify a multi-pack-index using 'git multi-pack-index verify' and 'git fsck' (when core.multiPackIndex is true). The pattern for the tests is similar to that found in t5318-commit-graph.sh. During testing, I found a bug in how we check for the size of off_t (we are not actually checking off_t, but instead uint32_t). See "multi-pack-index: fix 32-bit vs 64-bit size check". Thanks to Ævar [1], I added a commit that provides progress updates when checking object offsets. Based on ds/multi-pack-index [1] https://public-inbox.org/git/20180904202729.13900-1-ava...@gmail.com/T/#u Derrick Stolee (11): multi-pack-index: add 'verify' verb multi-pack-index: verify bad header multi-pack-index: verify corrupt chunk lookup table multi-pack-index: verify packname order multi-pack-index: verify missing pack multi-pack-index: verify oid fanout order multi-pack-index: verify oid lookup order multi-pack-index: fix 32-bit vs 64-bit size check multi-pack-index: verify object offsets multi-pack-index: report progress during 'verify' fsck: verify multi-pack-index Documentation/git-multi-pack-index.txt | 10 ++ builtin/fsck.c | 18 builtin/multi-pack-index.c | 4 +- midx.c | 112 midx.h | 1 + t/t5319-multi-pack-index.sh| 136 - 6 files changed, 261 insertions(+), 20 deletions(-) base-commit: 6a22d521260f86dff8fe6f23ab329cebb62ba4f0 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-34%2Fderrickstolee%2Fmidx%2Fverify-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-34/derrickstolee/midx/verify-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/34 -- gitgitgadget
Re: [PATCH 0/2] commit-graph: add progress output
On 9/4/2018 4:27 PM, Ævar Arnfjörð Bjarmason wrote: This series adds progress output to the commit-graph command, so that when it's called by "git gc" or "git fsck" we can see what's going on with it. Ævar Arnfjörð Bjarmason (2): commit-graph write: add progress output commit-graph verify: add progress output commit-graph.c | 44 +++- 1 file changed, 43 insertions(+), 1 deletion(-) Thanks for writing this, Ævar. I appreciate that you took the time to fill in an important UX gap. I had a couple of comments, but generally this is a good change. -Stolee
Re: [PATCH 1/2] commit-graph write: add progress output
On 9/4/2018 4:27 PM, Ævar Arnfjörð Bjarmason wrote: @@ -591,8 +597,13 @@ static void close_reachable(struct packed_oid_list *oids) { int i; struct commit *commit; + struct progress *progress = NULL; + int j = 0; The change below over-counts the number of commits we are processing (by at least double, possibly triple). + progress = start_delayed_progress( + _("Annotating commits in commit graph"), 0); for (i = 0; i < oids->nr; i++) { + display_progress(progress, ++j); commit = lookup_commit(the_repository, &oids->list[i]); if (commit) commit->object.flags |= UNINTERESTING; This count is the number of oids given to the method. For 'git commit-graph write --reachable', this will be the number of refs. @@ -604,6 +615,7 @@ static void close_reachable(struct packed_oid_list *oids) * closure. */ for (i = 0; i < oids->nr; i++) { + display_progress(progress, ++j); commit = lookup_commit(the_repository, &oids->list[i]); if (commit && !parse_commit(commit)) This is the important count, since we will be parsing commits and adding their parents to the list. The bulk of the work happens here. @@ -611,19 +623,25 @@ static void close_reachable(struct packed_oid_list *oids) } for (i = 0; i < oids->nr; i++) { + display_progress(progress, ++j); commit = lookup_commit(the_repository, &oids->list[i]); This iterates through the commits a second time and removes the UNINTERESTING flag. if (commit) commit->object.flags &= ~UNINTERESTING; } + stop_progress(&progress); } I think it is good to have the progress start before the first loop and end after the third loop, but the middle loop has the important count. I tried deleting the first and third display_progress() methods and re-ran the process on the Linux repo and did not notice a delay at the 0% and 100% progress spots. The count matches the number of commits. Thanks, -Stolee
Re: [PATCH 1/2] commit-graph write: add progress output
On 9/4/2018 6:07 PM, Junio C Hamano wrote: Ævar Arnfjörð Bjarmason writes: With --stdin-packs we don't show any estimation of how much is left to do. This is because we might be processing more than one pack. We could be less lazy here and show progress, either detect by detecting that we're only processing one pack, or by first looping over the packs to discover how many commits they have. I don't see the point in I do not know if there is no point, but if we were to do it, I think slurping the list of packs and computing the number of objects is not all that bad. If you want to do that, I have nothing against it. However, I don't expect users to use that option directly. That option is used by VFS for Git to compute the commit-graph in the background after receiving a pack of commits and trees, but not by 'git gc' which I expect is how most users will compute commit-graphs. static void compute_generation_numbers(struct packed_commit_list* commits) { int i; struct commit_list *list = NULL; + struct progress *progress = NULL; + progress = start_progress( + _("Computing commit graph generation numbers"), commits->nr); for (i = 0; i < commits->nr; i++) { + display_progress(progress, i); if (commits->list[i]->generation != GENERATION_NUMBER_INFINITY && commits->list[i]->generation != GENERATION_NUMBER_ZERO) continue; I am wondering if the progress call should be moved after this conditional continue; would we want to count the entry whose generation is already known here? Of course, as we give commits->nr as the 100% ceiling, we cannot avoid doing so, but it somehow smells wrong. If we wanted to be completely right, we would count the commits in the list that do not have a generation number and report that as the 100% ceiling. Something like the diff below would work. I tested it in Linux by first deleting my commit-graph and running the following: stolee@stolee-linux:~/linux$ rm .git/objects/info/commit-graph stolee@stolee-linux:~/linux$ git rev-parse v4.6 | ~/git/git commit-graph write --stdin-commits Annotating commits in commit graph: 1180333, done. Computing commit graph generation numbers: 100% (590166/590166), done. stolee@stolee-linux:~/linux$ ~/git/git commit-graph write --reachable Annotating commits in commit graph: 1564087, done. Computing commit graph generation numbers: 100% (191590/191590), done. -->8-- From: Derrick Stolee Date: Wed, 5 Sep 2018 11:55:42 + Subject: [PATCH] fixup! commit-graph write: add progress output Signed-off-by: Derrick Stolee --- commit-graph.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 1a02fe019a..b933bc9f00 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -634,14 +634,20 @@ static void close_reachable(struct packed_oid_list *oids) static void compute_generation_numbers(struct packed_commit_list* commits) { - int i; + int i, count_uncomputed = 0; struct commit_list *list = NULL; struct progress *progress = NULL; + for (i = 0; i < commits->nr; i++) + if (commits->list[i]->generation == GENERATION_NUMBER_INFINITY || + commits->list[i]->generation == GENERATION_NUMBER_ZERO) + count_uncomputed++; + progress = start_progress( - _("Computing commit graph generation numbers"), commits->nr); + _("Computing commit graph generation numbers"), count_uncomputed); + count_uncomputed = 0; + for (i = 0; i < commits->nr; i++) { - display_progress(progress, i); if (commits->list[i]->generation != GENERATION_NUMBER_INFINITY && commits->list[i]->generation != GENERATION_NUMBER_ZERO) continue; @@ -670,10 +676,11 @@ static void compute_generation_numbers(struct packed_commit_list* commits) if (current->generation > GENERATION_NUMBER_MAX) current->generation = GENERATION_NUMBER_MAX; + + display_progress(progress, ++count_uncomputed); } } } - display_progress(progress, i); stop_progress(&progress); } -- 2.19.0.rc2
Re: [PATCH v2 0/1] Define GIT_TEST_COMMIT_GRAPH for commit-graph test coverage
On 9/4/2018 12:49 PM, Duy Nguyen wrote: On Wed, Aug 29, 2018 at 2:49 PM Derrick Stolee via GitGitGadget wrote: The commit-graph (and multi-pack-index) features are optional data structures that can make Git operations faster. Since they are optional, we do not enable them in most Git tests. The commit-graph is tested in t5318-commit-graph.sh (and t6600-test-reach.sh in ds/reachable), but that one script cannot cover the data shapes present in the rest of the test suite. This patch introduces a new test environment variable, GIT_TEST_COMMIT_GRAPH . Similar to GIT_TEST_SPLIT_INDEX, it enables the commit-graph and writes it with every git commit command. Thanks, Duy, for pointing out this direction [1]. Any reason to not add this new flag in ci/run-build-and-tests.sh (which is used by Travis)? I see your note about VSTS but I don't see why it has to be exclusive to VSTS. I only wanted to volunteer resources that I know are available. I am looking into an additional test run in Travis CI builds, but didn't want to presume the extra resources would be available.
Re: [PATCH v4 12/12] t5318: use test_oid for HASH_LEN
On 9/3/2018 7:25 PM, brian m. carlson wrote: From: Derrick Stolee Signed-off-by: Derrick Stolee Signed-off-by: brian m. carlson --- t/t5318-commit-graph.sh | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 3c1ffad491..d286516f0e 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -8,7 +8,8 @@ test_expect_success 'setup full repo' ' cd "$TRASH_DIRECTORY/full" && git init && git config core.commitGraph true && - objdir=".git/objects" + objdir=".git/objects" && + test_oid_init ' Ah yes, thanks for adding this. test_expect_success 'verify graph with no graph file' ' @@ -273,7 +274,7 @@ test_expect_success 'git commit-graph verify' ' NUM_COMMITS=9 NUM_OCTOPUS_EDGES=2 -HASH_LEN=20 +HASH_LEN="$(test_oid rawsz)" GRAPH_BYTE_VERSION=4 GRAPH_BYTE_HASH=5 GRAPH_BYTE_CHUNK_COUNT=6
Re: [PATCH 1/8] trace2: create new combined trace facility
On 8/31/2018 12:49 PM, Jeff Hostetler via GitGitGadget wrote: + if (tr2key_trace_want(&tr2key_event)) { + struct tr2tls_thread_ctx *ctx = tr2tls_get_self(); + if (ctx->nr_open_regions <= tr2env_event_depth_wanted) { This should also compare to TR2_MAX_REGION_NESTING to avoid memory bounds issues. It may even be worth sending a message that the depth went beyond the limits. Thanks, -Stolee
Re: [PATCH 0/8] WIP: trace2: a new trace facility
On 8/31/2018 12:49 PM, Jeff Hostetler via GitGitGadget wrote: This patch series contains a new trace2 facility that hopefully addresses the recent trace- and structured-logging-related discussions. The intent is to eventually replace the existing trace_ routines (or to route them to the new trace2_ routines) as time permits. I haven't been part of the recent discussions on these logging efforts, but I wanted to mention that I'm excited about the potential here. I want to use this new tracing model to trace how many commits are walked by graph algorithms like paint_down_to_common(). I'm playing with some efforts here, and found one issue when the API is used incorrectly (I'll respond to the patch with the issue). Thanks, -Stolee
[PATCH 1/1] commit: don't use generation numbers if not needed
From: Derrick Stolee In 3afc679b "commit: use generations in paint_down_to_common()", the queue in paint_down_to_common() was changed to use a priority order based on generation number before commit date. This served two purposes: 1. When generation numbers are present, the walk guarantees correct topological relationships, regardless of clock skew in commit dates. 2. It enables short-circuiting the walk when the min_generation parameter is added in d7c1ec3e "commit: add short-circuit to paint_down_to_common()". This short-circuit helps commands like 'git branch --contains' from needing to walk to a merge base when we know the result is false. The commit message for 3afc679b includes the following sentence: This change does not affect the number of commits that are walked during the execution of paint_down_to_common(), only the order that those commits are inspected. This statement is incorrect. Because it changes the order in which the commits are inspected, it changes the order they are added to the queue, and hence can change the number of loops before the queue_has_nonstale() method returns true. This change makes a concrete difference depending on the topology of the commit graph. For instance, computing the merge-base between consecutive versions of the Linux kernel has no effect for versions after v4.9, but 'git merge-base v4.8 v4.9' presents a performance regression: v2.18.0: 0.122s v2.19.0-rc1: 0.547s HEAD: 0.127s To determine that this was simply an ordering issue, I inserted a counter within the while loop of paint_down_to_common() and found that the loop runs 167,468 times in v2.18.0 and 635,579 times in v2.19.0-rc1. The topology of this case can be described in a simplified way here: v4.9 | \ | \ v4.8 \ | \ \ | \ | ... A B | / / | / / |/__/ C Here, the "..." means "a very long line of commits". By generation number, A and B have generation one more than C. However, A and B have commit date higher than most of the commits reachable from v4.8. When the walk reaches v4.8, we realize that it has PARENT1 and PARENT2 flags, so everything it can reach is marked as STALE, including A. B has only the PARENT1 flag, so is not STALE. When paint_down_to_common() is run using compare_commits_by_commit_date, A and B are removed from the queue early and C is inserted into the queue. At this point, C and the rest of the queue entries are marked as STALE. The loop then terminates. When paint_down_to_common() is run using compare_commits_by_gen_then_commit_date, B is removed from the queue only after the many commits reachable from v4.8 are explored. This causes the loop to run longer. The reason for this regression is simple: the queue order is intended to not explore a commit until everything that _could_ reach that commit is explored. From the information gathered by the original ordering, we have no guarantee that there is not a commit D reachable from v4.8 that can also reach B. We gained absolute correctness in exchange for a performance regression. The performance regression is probably the worse option, since these incorrect results in paint_down_to_common() are rare. The topology required for the performance regression are less rare, but still require multiple merge commits where the parents differ greatly in generation number. In our example above, the commit A is as important as the commit B to demonstrate the problem, since otherwise the commit C will sit in the queue as non-stale just as long in both orders. The solution provided uses the min_generation parameter to decide if we should use generation numbers in our ordering. When min_generation is equal to zero, it means that the caller has no known cutoff for the walk, so we should rely on our commit-date heuristic as before; this is the case with merge_bases_many(). When min_generation is non-zero, then the caller knows a valuable cutoff for the short-circuit mechanism; this is the case with remove_redundant() and in_merge_bases_many(). Signed-off-by: Derrick Stolee --- commit.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 1a6e632185..449c1f4920 100644 --- a/commit.c +++ b/commit.c @@ -874,6 +874,9 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, int i; uint32_t last_gen = GENERATION_NUMBER_INFINITY; + if (!min_generation) + queue.compare = compare_commits_by_commit_date; + one->object.flags |= PARENT1; if (!n) { commit_list_append(one, &result); @@ -891,7 +894,7 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit_list *parents; int flags; - if (commit->generation > last_gen) + if (min_generation && co
[PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base'
As I was testing the release candidate, I stumbled across a regression in 'git merge-base' as a result of the switch to generation numbers. The commit message in [PATCH 1/1] describes the topology involved, but you can test it yourself by comparing 'git merge-base v4.8 v4.9' in the Linux kernel. The regression does not show up when running merge-base for tags at least v4.9, which is why I didn't see it when I was testing earlier. The solution is simple, but also will conflict with ds/reachable in next. I can send a similar patch that applies the same diff into commit-reach.c. With the integration of generation numbers into most commit walks coming to a close [1], it will be time to re-investigate other options for reachability indexes [2]. As I was digging into the issue with this regression, I discovered a way we can modify our generation numbers and pair them with commit dates to give us a simple-to-compute, immutable two-dimensional reachability index that would be immune to this regression. I will investigate that more and report back, but it is more important to fix this regression now. Thanks, -Stolee [1] https://public-inbox.org/git/pull.25.git.gitgitgad...@gmail.com/[PATCH 0/6] Use generation numbers for --topo-order [2] https://public-inbox.org/git/86muxcuyod@gmail.com/[RFC] Other chunks for commit-graph, part 2 - reachability indexes Cc: gitster@pobox.comCc: p...@peff.net Derrick Stolee (1): commit: don't use generation numbers if not needed commit.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) base-commit: 2f743933341f27603550fbf383a34dfcfd38 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-28%2Fderrickstolee%2Fmerge-base-regression-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-28/derrickstolee/merge-base-regression-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/28 -- gitgitgadget
Re: Questions about the hash function transition
On 8/29/2018 9:27 AM, Derrick Stolee wrote: On 8/29/2018 9:09 AM, Johannes Schindelin wrote: What I meant was to leverage the midx code, not the .midx files. My comment was motivated by my realizing that both the SHA-1 <-> SHA-256 mapping and the MIDX code have to look up (in a *fast* way) information with hash values as keys. *And* this information is immutable. *And* the amount of information should grow with new objects being added to the database. I'm unsure what this means, as the multi-pack-index simply uses bsearch_hash() to find hashes in the list. The same method is used for IDX lookups. I talked with Johannes privately, and we found differences in our understanding of the current multi-pack-index feature. Johannes thought the feature was farther along than it is, specifically related to how much we value the data in the multi-pack-index when adding objects to pack-files or repacking. Some of this misunderstanding is due to how the equivalent feature works in VSTS (where there is no IDX-file equivalent, every object in the repo is tracked by a multi-pack-index). I'd like to point out a few things about how the multi-pack-index works now, and how we hope to extend it in the future. Currently: 1. Objects are added to the multi-pack-index by adding a new set of .idx/.pack file pairs. We scan the .idx file for the objects and offsets to add. 2. We re-use the information in the multi-pack-index only to write the new one without re-reading the .pack files that are already covered. 3. If a 'git repack' command deletes a pack-file, then we delete the multi-pack-index. It must be regenerated by 'git multi-pack-index write' later. In the current world, the multi-pack-index is completely secondary to the .idx files. In the future, I hope these features exist in the multi-pack-index: 1. A stable object order. As objects are added to the multi-pack-index, we assign a distinct integer value to each. As we add objects, those integers values do not change. We can then pair the reachability bitmap to the multi-pack-index instead of a specific pack-file (allowing repack and bitmap computations to happen asynchronously). The data required to store this object order is very similar to storing the bijection between SHA-1 and SHA-256 hashes. 2. Incremental multi-pack-index: Currently, we have only one multi-pack-index file per object directory. We can use a mechanism similar to the split-index to keep a small number of multi-pack-index files (at most 3, probably) such that the '.git/objects/pack/multi-pack-index' file is small and easy to rewrite, while it refers to larger '.git/objects/pack/*.midx' files that change infrequently. 3. Multi-pack-index-aware repack: The repacker only knows about the multi-pack-index enough to delete it. We could instead directly manipulate the multi-pack-index during repack, and we could decide to do more incremental repacks based on data stored in the multi-pack-index. In conclusion: please keep the multi-pack-index in mind as we implement the transition plan. I'll continue building the feature as planned (the next thing to do after the current series of cleanups is 'git multi-pack-index verify') but am happy to look into other applications as we need it. Thanks, -Stolee
Re: Questions about the hash function transition
On 8/29/2018 9:09 AM, Johannes Schindelin wrote: Hi Jonathan, On Tue, 28 Aug 2018, Jonathan Nieder wrote: Johannes Schindelin wrote: On Thu, 23 Aug 2018, Jonathan Nieder wrote: Ævar Arnfjörð Bjarmason wrote: Are we going to need a midx version of these mapping files? How does midx fit into this picture? Perhaps it's too obscure to worry about... That's a great question! I think the simplest answer is to have a midx only for the primary object format and fall back to using ordinary idx files for the others. The midx format already has a field for hash function (thanks, Derrick!). Related: I wondered whether we could simply leverage the midx code for the bidirectional SHA-1 <-> SHA-256 mapping, as it strikes me as very similar in concept and challenges. Interesting: tell me more. My first instinct is to prefer the idx-based design that is already described in the design doc. If we want to change that, we should have a motivating reason. Midx is designed to be optional and to not necessarily cover all objects, so it doesn't seem like a good fit. It is optional, but shouldn't this mode where a Git repo that needs to know about two different versions of all files be optional? Or at least temporary? The multi-pack-index is intended to cover all packed objects, so covers the same number of objects as an IDX-based strategy. If we are rebuilding the repo from scratch by translating the hashes, then "being too big to repack" is probably not a problem, so we would expect a single IDX file anyway. In my opinion, whatever we do for the IDX-based approach will need to be duplicated in the multi-pack-index. The multi-pack-index does have a natural mechanism (optional chunks) for inserting this data without incrementing the version number. Right. What I meant was to leverage the midx code, not the .midx files. My comment was motivated by my realizing that both the SHA-1 <-> SHA-256 mapping and the MIDX code have to look up (in a *fast* way) information with hash values as keys. *And* this information is immutable. *And* the amount of information should grow with new objects being added to the database. I'm unsure what this means, as the multi-pack-index simply uses bsearch_hash() to find hashes in the list. The same method is used for IDX lookups. I know that Stolee performed a bit of performance testing regarding different data structures to use in MIDX. We could benefit from that testing by using not only the results from those tests, but also the code. I did test ways to use something other than bsearch_hash(), such as using a 65,536-entry fanout table for lookups using the first two bytes of a hash (tl;dr: it speeds things up a bit, but the super-small improvement is probably not worth the space and complexity). I've also toyed with the idea of using interpolation search inside bsearch_hash(), but I haven't had time to do that. IIRC one of the insights was that packs are a natural structure that can be used for the MIDX mapping, too (you could, for example, store the SHA-1 <-> SHA-256 mapping *only* for objects inside packs, and re-generate them on the fly for loose objects all the time). Stolee can speak with much more competence and confidence about this, though, whereas all of what I said above is me waving my hands quite frantically. I understand the hesitation to pair such an important feature (hash transition) to a feature that hasn't even shipped. We will need to see how things progress on both fronts to see how mature the multi-pack-index is when we need this transition table. Thanks, -Stolee
Re: [RFC PATCH 09/12] Add a base implementation of SHA-256 support
On 8/28/2018 8:58 PM, brian m. carlson wrote: SHA-256 is somewhat slower to compute than SHA-1 in software. However, since our default SHA-1 implementation is collision-detecting, a reasonable cryptographic library implementation of SHA-256 will actually be faster than SHA-256. Nit: do you mean "faster than SHA-1DC"?
Re: [RFC PATCH 08/12] commit-graph: convert to using the_hash_algo
On 8/28/2018 8:58 PM, brian m. carlson wrote: Instead of using hard-coded constants for object sizes, use the_hash_algo to look them up. In addition, use a function call to look up the object ID version and produce the correct value. The C code in this patch looks good to me. The only issue is that I predict failure in the 'git commit-graph verify' tests in t5318-commit-graph.sh. Squashing in this commit should help (assuming that test_oid works, it doesn't at my current branch): -->8-- Subject: [PATCH] t5318-commit-graph.sh: use test_oid for HASH_LEN Signed-off-by: Derrick Stolee --- t/t5318-commit-graph.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 6aee861f78..676c1a9ae0 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -333,7 +333,7 @@ test_expect_success 'git commit-graph verify' ' NUM_COMMITS=9 NUM_OCTOPUS_EDGES=2 -HASH_LEN=20 +HASH_LEN="$(test_oid rawsz)" GRAPH_BYTE_VERSION=4 GRAPH_BYTE_HASH=5 GRAPH_BYTE_CHUNK_COUNT=6 -- 2.19.0.rc1
[PATCH v2 1/1] commit-graph: define GIT_TEST_COMMIT_GRAPH
From: Derrick Stolee The commit-graph feature is tested in isolation by t5318-commit-graph.sh and t6600-test-reach.sh, but there are many more interesting scenarios involving commit walks. Many of these scenarios are covered by the existing test suite, but we need to maintain coverage when the optional commit-graph structure is not present. To allow running the full test suite with the commit-graph present, add a new test environment variable, GIT_TEST_COMMIT_GRAPH. Similar to GIT_TEST_SPLIT_INDEX, this variable makes every Git command try to load the commit-graph when parsing commits, and writes the commit-graph file after every 'git commit' command. There are a few tests that rely on commits not existing in pack-files to trigger important events, so manually set GIT_TEST_COMMIT_GRAPH to false for the necessary commands. There is one test in t6024-recursive-merge.sh that relies on the merge-base algorithm picking one of two ambiguous merge-bases, and the commit-graph feature changes which merge-base is picked. Helped-by: Eric Sunshine Signed-off-by: Derrick Stolee --- builtin/commit.c| 4 commit-graph.c | 5 +++-- commit-graph.h | 2 ++ t/README| 4 t/t0410-partial-clone.sh| 2 +- t/t5307-pack-missing-commit.sh | 4 ++-- t/t6011-rev-list-with-bad-commit.sh | 7 +++ t/t6024-recursive-merge.sh | 6 +++--- 8 files changed, 22 insertions(+), 12 deletions(-) diff --git a/builtin/commit.c b/builtin/commit.c index 158e3f843a..a25e652102 100644 --- a/builtin/commit.c +++ b/builtin/commit.c @@ -33,6 +33,7 @@ #include "sequencer.h" #include "mailmap.h" #include "help.h" +#include "commit-graph.h" static const char * const builtin_commit_usage[] = { N_("git commit [] [--] ..."), @@ -1651,6 +1652,9 @@ int cmd_commit(int argc, const char **argv, const char *prefix) "new_index file. Check that disk is not full and quota is\n" "not exceeded, and then \"git reset HEAD\" to recover.")); + if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0)) + write_commit_graph_reachable(get_object_directory(), 0); + rerere(0); run_command_v_opt(argv_gc_auto, RUN_GIT_CMD); run_commit_hook(use_editor, get_index_file(), "post-commit", NULL); diff --git a/commit-graph.c b/commit-graph.c index 4bd1a4abbf..5cae83fe03 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -237,8 +237,9 @@ static int prepare_commit_graph(struct repository *r) return !!r->objects->commit_graph; r->objects->commit_graph_attempted = 1; - if (repo_config_get_bool(r, "core.commitgraph", &config_value) || - !config_value) + if (!git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) && + (repo_config_get_bool(r, "core.commitgraph", &config_value) || + !config_value)) /* * This repository is not configured to use commit graphs, so * do not load one. (But report commit_graph_attempted anyway diff --git a/commit-graph.h b/commit-graph.h index 13d736cdde..cf9141f356 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -5,6 +5,8 @@ #include "repository.h" #include "string-list.h" +#define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH" + struct commit; char *get_commit_graph_filename(const char *obj_dir); diff --git a/t/README b/t/README index 8373a27fea..2b90c433f5 100644 --- a/t/README +++ b/t/README @@ -315,6 +315,10 @@ packs on demand. This normally only happens when the object size is over 2GB. This variable forces the code path on any object larger than bytes. +GIT_TEST_COMMIT_GRAPH=, when true, forces the commit-graph to +be written after every 'git commit' command, and overrides the +'core.commitGraph' setting to true. + Naming Tests diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh index 4984ca583d..73d5284a91 100755 --- a/t/t0410-partial-clone.sh +++ b/t/t0410-partial-clone.sh @@ -181,7 +181,7 @@ test_expect_success 'rev-list stops traversal at missing and promised commit' ' git -C repo config core.repositoryformatversion 1 && git -C repo config extensions.partialclone "arbitrary string" && - git -C repo rev-list --exclude-promisor-objects --objects bar >out && + GIT_TEST_COMMIT_GRAPH=0 git -C repo rev-list --exclude-promisor-objects --objects bar >out && grep $(git -C repo rev-parse bar) out && ! grep $FOO out ' diff --git a/t/t5307-pack-missing-commit.sh b/t/t5307-pack-missing-commit.sh index ae52a1882d..dacb440b27 100755 --- a/t/t5307-pack-missing-commi
[PATCH v2 0/1] Define GIT_TEST_COMMIT_GRAPH for commit-graph test coverage
The commit-graph (and multi-pack-index) features are optional data structures that can make Git operations faster. Since they are optional, we do not enable them in most Git tests. The commit-graph is tested in t5318-commit-graph.sh (and t6600-test-reach.sh in ds/reachable), but that one script cannot cover the data shapes present in the rest of the test suite. This patch introduces a new test environment variable, GIT_TEST_COMMIT_GRAPH . Similar to GIT_TEST_SPLIT_INDEX, it enables the commit-graph and writes it with every git commit command. Thanks, Duy, for pointing out this direction [1]. A few tests needed to be modified. These are the same tests that were mentioned in my previous example patch [2]. Thanks, Eric, for providing the correct way to override the settings [3]. When this merges down, I'll create a CI build in VSTS that runs the test suite with this option enabled. Thanks, -Stolee [1] https://public-inbox.org/git/CACsJy8CKnXVJYkKM_W=N=Vq-TVXf+YCqZP_uP7B-dN_6xddB=g...@mail.gmail.com/ Re: [PATCH 0/9] multi-pack-index cleanups (Discussing test environment variables) [2] https://public-inbox.org/git/20180718152244.45513-1-dsto...@microsoft.com/ [PATCH] DO-NOT-MERGE: write and read commit-graph always [3] https://public-inbox.org/git/CAPig+cSjanDi=jv75pdzypajwvgd4suh3uyvy+vy7yehauy...@mail.gmail.com/ Based-On: ds/commit-graph-with-grafts Cc: jnareb@gmail.comCc: sbeller@google.comCc: sunsh...@sunshineco.com Derrick Stolee (1): commit-graph: define GIT_TEST_COMMIT_GRAPH builtin/commit.c| 4 commit-graph.c | 5 +++-- commit-graph.h | 2 ++ t/README| 4 t/t0410-partial-clone.sh| 2 +- t/t5307-pack-missing-commit.sh | 4 ++-- t/t6011-rev-list-with-bad-commit.sh | 7 +++ t/t6024-recursive-merge.sh | 6 +++--- 8 files changed, 22 insertions(+), 12 deletions(-) base-commit: 829a321569d8e8f2c582aef9f0c990df976ab842 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-26%2Fderrickstolee%2Fshallow%2Ftest-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-26/derrickstolee/shallow/test-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/26 Range-diff vs v1: 1: 85d02ac8d8 ! 1: 4ff6695c7e commit-graph: define GIT_TEST_COMMIT_GRAPH @@ -23,6 +23,7 @@ merge-base algorithm picking one of two ambiguous merge-bases, and the commit-graph feature changes which merge-base is picked. +Helped-by: Eric Sunshine Signed-off-by: Derrick Stolee diff --git a/builtin/commit.c b/builtin/commit.c @@ -112,18 +113,12 @@ test_expect_success 'rev-list notices corruption (1)' ' - test_must_fail git rev-list HEAD -+ ( -+ GIT_TEST_COMMIT_GRAPH=0 && -+ test_must_fail git rev-list HEAD -+ ) ++ test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git rev-list HEAD ' test_expect_success 'rev-list notices corruption (2)' ' - test_must_fail git rev-list --objects HEAD -+ ( -+ GIT_TEST_COMMIT_GRAPH=0 && -+ test_must_fail git rev-list --objects HEAD -+ ) ++ test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git rev-list --objects HEAD ' test_expect_success 'pack-objects notices corruption' ' @@ -140,10 +135,7 @@ - test_must_fail git rev-list --all > /dev/null - ' +test_expect_success 'rev-list should fail' ' -+ ( -+ GIT_TEST_COMMIT_GRAPH=0 && -+ test_must_fail git rev-list --all > /dev/null -+ ) ++ test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git rev-list --all > /dev/null +' test_expect_success 'git repack _MUST_ fail' \ @@ -160,10 +152,7 @@ - test_must_fail git merge -m final G -" +test_expect_success 'combined merge conflicts' ' -+ ( -+ GIT_TEST_COMMIT_GRAPH=0 && -+ test_must_fail git merge -m final G -+ ) ++ test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G +' cat > expect << EOF -- gitgitgadget
Re: [PATCH v3 01/11] t: add tool to translate hash-related values
On 8/28/2018 8:56 PM, brian m. carlson wrote: + rawsz="$(test_oid rawsz)" && + hexsz="$(test_oid hexsz)" && These are neat helpers! The 'git commit-graph verify' tests in t5318-commit-graph.sh should automatically work if we use these for HASH_LEN instead of 20. I'll use a similar pattern when writing 'git multi-pack-index verify'. Thanks, -Stolee
Re: [PATCH 1/1] commit-graph: define GIT_TEST_COMMIT_GRAPH
On 8/28/2018 5:59 PM, Eric Sunshine wrote: On Tue, Aug 28, 2018 at 5:31 PM Derrick Stolee wrote: On 8/28/2018 4:41 PM, Stefan Beller wrote: On Tue, Aug 28, 2018 at 1:33 PM Derrick Stolee via GitGitGadget wrote: + GIT_TEST_COMMIT_GRAPH=0 && + test_must_fail git merge -m final G This could go on the same line without the && in between, setting the variable as a prefix. It cannot! The Linux build I ran complained that you can't put environment variables through test_must_fail. Is GIT_TEST_COMMIT_GRAPH exported? If not, it won't have an impact on git-merge anyhow. In my testing this changed the behavior from fail to pass when passing GIT_TEST_COMMIT_GRAPH=1 from the command. As for the special case of one-shot environment variable and test_must_fail(), you'll find "env" used as a workaround in a number of tests: test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge ... && Thanks for this! This is clearly the better solution. -Stolee
Re: [PATCH] commit-reach: correct accidental #include of C file
On 8/28/2018 5:36 PM, Jonathan Nieder wrote: Without this change, the build breaks with clang: For some reason, it didn't fail with GCC for me, but this is an obviously correct change to make. Thanks! libgit/ref-filter.pic.o: multiple definition of 'filter_refs' libgit/commit-reach.pic.o: previous definition here Signed-off-by: Jonathan Nieder --- Jonathan Nieder wrote: Derrick Stolee wrote: --- a/commit-reach.c +++ b/commit-reach.c @@ -1,8 +1,10 @@ #include "cache.h" #include "commit.h" +#include "commit-graph.h" #include "decorate.h" #include "prio-queue.h" #include "tree.h" +#include "ref-filter.c" Did you mean "ref-filter.h"? This broke the build here. Is there some check that we can use to prevent it happening again? I don't think we ever intentionally #include a .c file. Here's what I'm applying locally. Thanks, Jonathan commit-reach.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/commit-reach.c b/commit-reach.c index c996524032..86715c103c 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -4,7 +4,7 @@ #include "decorate.h" #include "prio-queue.h" #include "tree.h" -#include "ref-filter.c" +#include "ref-filter.h" #include "revision.h" #include "tag.h" #include "commit-reach.h"
Re: [PATCH 0/9] introducing oideq()
On 8/28/2018 5:21 PM, Jeff King wrote: On Sun, Aug 26, 2018 at 08:56:21PM +, brian m. carlson wrote: Due to the simplicity of the current code and our inlining, the compiler can usually figure this out for now. So I wouldn't expect this patch to actually improve performance right away. But as that discussion shows, we are likely to take a performance hit as we move to more runtime determination of the_hash_algo parameters. Having these callers use the more strict form will potentially help us recover that. So in that sense we _could_ simply punt on this series until then (and it's certainly post-v2.19 material). But I think it's worth doing now, simply from a readability/annotation standpoint. IMHO the resulting code is more clear (though I've long since taught myself to read !foocmp() as equality). I would quite like to see this series picked up for v2.20. If we want to minimize performance regressions with the SHA-256 work, I think it's important. Thanks. One of the things I was worried about was causing unnecessary conflicts with existing topics, including your work. But if everybody is on board, I'd be happy to see this go in early in the next release cycle (the longer we wait, the more annoying conflicts Junio has to resolve). I'm happy to take this change whenever. In my opinion, right after v2.19.0 is cut would be a great time to merge it into master. This v2 is good. Reviewed-by: Derrick Stolee
Re: [PATCH v2 04/18] commit-reach: move commit_contains from ref-filter
On 8/28/2018 5:24 PM, Jonathan Nieder wrote: Hi, Derrick Stolee wrote: There are several commit walks in the codebase. Group them together into a new commit-reach.c file and corresponding header. After we group these walks into one place, we can reduce duplicate logic by calling equivalent methods. All methods are direct moves, except we also make the commit_contains() method public so its consumers in ref-filter.c can still call it. We can also test this method in a test-tool in a later commit. Signed-off-by: Derrick Stolee --- commit-reach.c | 121 + commit-reach.h | 20 ++- ref-filter.c | 145 +++-- 3 files changed, 147 insertions(+), 139 deletions(-) diff --git a/commit-reach.c b/commit-reach.c index a6bc4781a6..01d796f011 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -1,8 +1,10 @@ #include "cache.h" #include "commit.h" +#include "commit-graph.h" #include "decorate.h" #include "prio-queue.h" #include "tree.h" +#include "ref-filter.c" Did you mean "ref-filter.h"? This broke the build here. Is there some check that we can use to prevent it happening again? I don't think we ever intentionally #include a .c file. Woah! How did that ever work? I definitely built this locally.
Re: [PATCH 0/1] Define GIT_TEST_COMMIT_GRAPH for commit-graph test coverage
On 8/28/2018 4:37 PM, Stefan Beller wrote: On Tue, Aug 28, 2018 at 1:33 PM Derrick Stolee via GitGitGadget wrote: The commit-graph (and multi-pack-index) features are optional data structures that can make Git operations faster. Since they are optional, we do not enable them in most Git tests. The commit-graph is tested in t5318-commit-graph.sh (and t6600-test-reach.sh in ds/reachable), but that one script cannot cover the data shapes present in the rest of the test suite. This patch introduces a new test environment variable, GIT_TEST_COMMIT_GRAPH . Similar to GIT_TEST_SPLIT_INDEX, it enables the commit-graph and writes it with every git commit command. Thanks, Duy, for pointing out this direction Did you mean to cc Duy (instead of me)? (I'll happily review the patch, too... just asking) I just added you because you've been on a lot of commit-graph things lately. But yes, I forgot to add Duy to the CC list. Added to this message. Thanks, -Stolee
Re: [PATCH 1/1] commit-graph: define GIT_TEST_COMMIT_GRAPH
On 8/28/2018 4:41 PM, Stefan Beller wrote: On Tue, Aug 28, 2018 at 1:33 PM Derrick Stolee via GitGitGadget wrote: From: Derrick Stolee The commit-graph feature is tested in isolation by t5318-commit-graph.sh and t6600-test-reach.sh, but there are many more interesting scenarios involving commit walks. Many of these scenarios are covered by the existing test suite, but we need to maintain coverage when the optional commit-graph structure is not present. To allow running the full test suite with the commit-graph present, add a new test environment variable, GIT_TEST_COMMIT_GRAPH. Similar to GIT_TEST_SPLIT_INDEX, this variable makes every Git command try to load the commit-graph when parsing commits, and writes the commit-graph file after every 'git commit' command. There are a few tests that rely on commits not existing in pack-files to trigger important events, so manually set GIT_TEST_COMMIT_GRAPH to false for the necessary commands. So the plan is to turn on the commit graph for the whole test suite excluding these selected tests? Excluding these specific _steps_, but yes. + GIT_TEST_COMMIT_GRAPH=0 && + test_must_fail git merge -m final G This could go on the same line without the && in between, setting the variable as a prefix. It cannot! The Linux build I ran complained that you can't put environment variables through test_must_fail. -Stolee
[PATCH 0/1] Define GIT_TEST_COMMIT_GRAPH for commit-graph test coverage
The commit-graph (and multi-pack-index) features are optional data structures that can make Git operations faster. Since they are optional, we do not enable them in most Git tests. The commit-graph is tested in t5318-commit-graph.sh (and t6600-test-reach.sh in ds/reachable), but that one script cannot cover the data shapes present in the rest of the test suite. This patch introduces a new test environment variable, GIT_TEST_COMMIT_GRAPH . Similar to GIT_TEST_SPLIT_INDEX, it enables the commit-graph and writes it with every git commit command. Thanks, Duy, for pointing out this direction [1]. A few tests needed to be modified. These are the same tests that were mentioned in my previous example patch [2]. When this merges down, I'll create a CI build in VSTS that runs the test suite with this option enabled. Thanks, -Stolee [1] https://public-inbox.org/git/CACsJy8CKnXVJYkKM_W=N=Vq-TVXf+YCqZP_uP7B-dN_6xddB=g...@mail.gmail.com/ Re: [PATCH 0/9] multi-pack-index cleanups (Discussing test environment variables) [2] https://public-inbox.org/git/20180718152244.45513-1-dsto...@microsoft.com/ [PATCH] DO-NOT-MERGE: write and read commit-graph always Based-On: ds/commit-graph-with-grafts Cc: jna...@gmail.com Derrick Stolee (1): commit-graph: define GIT_TEST_COMMIT_GRAPH builtin/commit.c| 4 commit-graph.c | 5 +++-- commit-graph.h | 2 ++ t/README| 4 t/t0410-partial-clone.sh| 2 +- t/t5307-pack-missing-commit.sh | 10 -- t/t6011-rev-list-with-bad-commit.sh | 10 ++ t/t6024-recursive-merge.sh | 9 ++--- 8 files changed, 34 insertions(+), 12 deletions(-) base-commit: 829a321569d8e8f2c582aef9f0c990df976ab842 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-26%2Fderrickstolee%2Fshallow%2Ftest-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-26/derrickstolee/shallow/test-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/26 -- gitgitgadget
[PATCH 1/1] commit-graph: define GIT_TEST_COMMIT_GRAPH
From: Derrick Stolee The commit-graph feature is tested in isolation by t5318-commit-graph.sh and t6600-test-reach.sh, but there are many more interesting scenarios involving commit walks. Many of these scenarios are covered by the existing test suite, but we need to maintain coverage when the optional commit-graph structure is not present. To allow running the full test suite with the commit-graph present, add a new test environment variable, GIT_TEST_COMMIT_GRAPH. Similar to GIT_TEST_SPLIT_INDEX, this variable makes every Git command try to load the commit-graph when parsing commits, and writes the commit-graph file after every 'git commit' command. There are a few tests that rely on commits not existing in pack-files to trigger important events, so manually set GIT_TEST_COMMIT_GRAPH to false for the necessary commands. There is one test in t6024-recursive-merge.sh that relies on the merge-base algorithm picking one of two ambiguous merge-bases, and the commit-graph feature changes which merge-base is picked. Signed-off-by: Derrick Stolee --- builtin/commit.c| 4 commit-graph.c | 5 +++-- commit-graph.h | 2 ++ t/README| 4 t/t0410-partial-clone.sh| 2 +- t/t5307-pack-missing-commit.sh | 10 -- t/t6011-rev-list-with-bad-commit.sh | 10 ++ t/t6024-recursive-merge.sh | 9 ++--- 8 files changed, 34 insertions(+), 12 deletions(-) diff --git a/builtin/commit.c b/builtin/commit.c index 158e3f843a..a25e652102 100644 --- a/builtin/commit.c +++ b/builtin/commit.c @@ -33,6 +33,7 @@ #include "sequencer.h" #include "mailmap.h" #include "help.h" +#include "commit-graph.h" static const char * const builtin_commit_usage[] = { N_("git commit [] [--] ..."), @@ -1651,6 +1652,9 @@ int cmd_commit(int argc, const char **argv, const char *prefix) "new_index file. Check that disk is not full and quota is\n" "not exceeded, and then \"git reset HEAD\" to recover.")); + if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0)) + write_commit_graph_reachable(get_object_directory(), 0); + rerere(0); run_command_v_opt(argv_gc_auto, RUN_GIT_CMD); run_commit_hook(use_editor, get_index_file(), "post-commit", NULL); diff --git a/commit-graph.c b/commit-graph.c index 4bd1a4abbf..5cae83fe03 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -237,8 +237,9 @@ static int prepare_commit_graph(struct repository *r) return !!r->objects->commit_graph; r->objects->commit_graph_attempted = 1; - if (repo_config_get_bool(r, "core.commitgraph", &config_value) || - !config_value) + if (!git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) && + (repo_config_get_bool(r, "core.commitgraph", &config_value) || + !config_value)) /* * This repository is not configured to use commit graphs, so * do not load one. (But report commit_graph_attempted anyway diff --git a/commit-graph.h b/commit-graph.h index 13d736cdde..cf9141f356 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -5,6 +5,8 @@ #include "repository.h" #include "string-list.h" +#define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH" + struct commit; char *get_commit_graph_filename(const char *obj_dir); diff --git a/t/README b/t/README index 8373a27fea..2b90c433f5 100644 --- a/t/README +++ b/t/README @@ -315,6 +315,10 @@ packs on demand. This normally only happens when the object size is over 2GB. This variable forces the code path on any object larger than bytes. +GIT_TEST_COMMIT_GRAPH=, when true, forces the commit-graph to +be written after every 'git commit' command, and overrides the +'core.commitGraph' setting to true. + Naming Tests diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh index 4984ca583d..73d5284a91 100755 --- a/t/t0410-partial-clone.sh +++ b/t/t0410-partial-clone.sh @@ -181,7 +181,7 @@ test_expect_success 'rev-list stops traversal at missing and promised commit' ' git -C repo config core.repositoryformatversion 1 && git -C repo config extensions.partialclone "arbitrary string" && - git -C repo rev-list --exclude-promisor-objects --objects bar >out && + GIT_TEST_COMMIT_GRAPH=0 git -C repo rev-list --exclude-promisor-objects --objects bar >out && grep $(git -C repo rev-parse bar) out && ! grep $FOO out ' diff --git a/t/t5307-pack-missing-commit.sh b/t/t5307-pack-missing-commit.sh index ae52a1882d..15df19a040 100755 --- a/t/t5307-pack-missing-commit.sh +++ b/t/t5307-pack
Re: Git clean removing tracked files semi-regularly
On 8/28/2018 2:29 PM, Brennan Conroy wrote: I'm using windows. Have had it happen on 8.1 and 10. "git status" shows "nothing to commit, working tree clean". I am using git version 2.17.1.windows.2 I tested on my machine and could not reproduce this issue. Could you find a full repro (first command 'git clone https://github.com/aspnet/SignalR') and create an Issue at https://github.com/git-for-windows/git/issues ? Thanks, -Stolee
Re: Questions about the hash function transition
On 8/28/2018 8:04 AM, Johannes Schindelin wrote: Hi, On Thu, 23 Aug 2018, Jonathan Nieder wrote: Ævar Arnfjörð Bjarmason wrote: [...] Since all operations that make new objects (e.g., "git commit") add the new objects to the corresponding index, this mapping is possible for all objects in the object store. Are we going to need a midx version of these mapping files? How does midx fit into this picture? Perhaps it's too obscure to worry about... That's a great question! I think the simplest answer is to have a midx only for the primary object format and fall back to using ordinary idx files for the others. The midx format already has a field for hash function (thanks, Derrick!). Related: I wondered whether we could simply leverage the midx code for the bidirectional SHA-1 <-> SHA-256 mapping, as it strikes me as very similar in concept and challenges. If we would like such a mapping, then I would propose the following: 1. The object store has everything in SHA-256, so the HASH_LEN parameter of the multi-pack-index is 32. 2. We create an optional chunk to add to the multi-pack-index that stores the SHA-1 for each object. This list would be in lex order. 3. We create two optional chunks that store the bijection between SHA-256 and SHA-1: the first is a list of integers i_0, i_1, ..., i_{N-1} such that i_k is the position in the SHA-1 list corresponding to the kth SHA-256. The second is a list of integers j_0, j_1, ..., j_{N-1} such that j_k is the position in the SHA-256 list of the kth SHA-1. I'm not super-familiar with how the transition plan specifically needs this mapping, but it seems like a good place to put it. Thanks, -Stolee
[PATCH 1/6] prio-queue: add 'peek' operation
From: Derrick Stolee When consuming a priority queue, it can be convenient to inspect the next object that will be dequeued without actually dequeueing it. Our existing library did not have such a 'peek' operation, so add it as prio_queue_peek(). Add a reference-level comparison in t/helper/test-prio-queue.c so this method is exercised by t0009-prio-queue.sh. Signed-off-by: Derrick Stolee --- prio-queue.c | 9 + prio-queue.h | 6 ++ t/helper/test-prio-queue.c | 10 +++--- 3 files changed, 22 insertions(+), 3 deletions(-) diff --git a/prio-queue.c b/prio-queue.c index a078451872..d3f488cb05 100644 --- a/prio-queue.c +++ b/prio-queue.c @@ -85,3 +85,12 @@ void *prio_queue_get(struct prio_queue *queue) } return result; } + +void *prio_queue_peek(struct prio_queue *queue) +{ + if (!queue->nr) + return NULL; + if (!queue->compare) + return queue->array[queue->nr - 1].data; + return queue->array[0].data; +} diff --git a/prio-queue.h b/prio-queue.h index d030ec9dd6..682e51867a 100644 --- a/prio-queue.h +++ b/prio-queue.h @@ -46,6 +46,12 @@ extern void prio_queue_put(struct prio_queue *, void *thing); */ extern void *prio_queue_get(struct prio_queue *); +/* + * Gain access to the "thing" that would be returned by + * prio_queue_get, but do not remove it from the queue. + */ +extern void *prio_queue_peek(struct prio_queue *); + extern void clear_prio_queue(struct prio_queue *); /* Reverse the LIFO elements */ diff --git a/t/helper/test-prio-queue.c b/t/helper/test-prio-queue.c index 9807b649b1..e817bbf464 100644 --- a/t/helper/test-prio-queue.c +++ b/t/helper/test-prio-queue.c @@ -22,9 +22,13 @@ int cmd__prio_queue(int argc, const char **argv) struct prio_queue pq = { intcmp }; while (*++argv) { - if (!strcmp(*argv, "get")) - show(prio_queue_get(&pq)); - else if (!strcmp(*argv, "dump")) { + if (!strcmp(*argv, "get")) { + void *peek = prio_queue_peek(&pq); + void *get = prio_queue_get(&pq); + if (peek != get) + BUG("peek and get results do not match"); + show(get); + } else if (!strcmp(*argv, "dump")) { int *v; while ((v = prio_queue_get(&pq))) show(v); -- gitgitgadget
[PATCH 2/6] test-reach: add run_three_modes method
From: Derrick Stolee The 'test_three_modes' method assumes we are using the 'test-tool reach' command for our test. However, we may want to use the data shape of our commit graph and the three modes (no commit-graph, full commit-graph, partial commit-graph) for other git commands. Split test_three_modes to be a simple translation on a more general run_three_modes method that executes the given command and tests the actual output to the expected output. While inspecting this code, I realized that the final test for 'commit_contains --tag' is silently dropping the '--tag' argument. It should be quoted to include both. Signed-off-by: Derrick Stolee --- t/t6600-test-reach.sh | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh index d139a00d1d..1b18e12a4e 100755 --- a/t/t6600-test-reach.sh +++ b/t/t6600-test-reach.sh @@ -53,18 +53,22 @@ test_expect_success 'setup' ' git config core.commitGraph true ' -test_three_modes () { +run_three_modes () { test_when_finished rm -rf .git/objects/info/commit-graph && - test-tool reach $1 actual && + $1 actual && test_cmp expect actual && cp commit-graph-full .git/objects/info/commit-graph && - test-tool reach $1 actual && + $1 actual && test_cmp expect actual && cp commit-graph-half .git/objects/info/commit-graph && - test-tool reach $1 actual && + $1 actual && test_cmp expect actual } +test_three_modes () { + run_three_modes "test-tool reach $1" +} + test_expect_success 'ref_newer:miss' ' cat >input <<-\EOF && A:commit-5-7 @@ -219,7 +223,7 @@ test_expect_success 'commit_contains:hit' ' EOF echo "commit_contains(_,A,X,_):1" >expect && test_three_modes commit_contains && - test_three_modes commit_contains --tag + test_three_modes "commit_contains --tag" ' test_expect_success 'commit_contains:miss' ' -- gitgitgadget
[PATCH 4/6] revision.c: begin refactoring --topo-order logic
From: Derrick Stolee When running 'git rev-list --topo-order' and its kin, the topo_order setting in struct rev_info implies the limited setting. This means that the following things happen during prepare_revision_walk(): * revs->limited implies we run limit_list() to walk the entire reachable set. There are some short-cuts here, such as if we perform a range query like 'git rev-list COMPARE..HEAD' and we can stop limit_list() when all queued commits are uninteresting. * revs->topo_order implies we run sort_in_topological_order(). See the implementation of that method in commit.c. It implies that the full set of commits to order is in the given commit_list. These two methods imply that a 'git rev-list --topo-order HEAD' command must walk the entire reachable set of commits _twice_ before returning a single result. If we have a commit-graph file with generation numbers computed, then there is a better way. This patch introduces some necessary logic redirection when we are in this situation. In v2.18.0, the commit-graph file contains zero-valued bytes in the positions where the generation number is stored in v2.19.0 and later. Thus, we use generation_numbers_enabled() to check if the commit-graph is available and has non-zero generation numbers. When setting revs->limited only because revs->topo_order is true, only do so if generation numbers are not available. There is no reason to use the new logic as it will behave similarly when all generation numbers are INFINITY or ZERO. In prepare_revision_walk(), if we have revs->topo_order but not revs->limited, then we trigger the new logic. It breaks the logic into three pieces, to fit with the existing framework: 1. init_topo_walk() fills a new struct topo_walk_info in the rev_info struct. We use the presence of this struct as a signal to use the new methods during our walk. In this patch, this method simply calls limit_list() and sort_in_topological_order(). In the future, this method will set up a new data structure to perform that logic in-line. 2. next_topo_commit() provides get_revision_1() with the next topo- ordered commit in the list. Currently, this simply pops the commit from revs->commits. 3. expand_topo_walk() provides get_revision_1() with a way to signal walking beyond the latest commit. Currently, this calls add_parents_to_list() exactly like the old logic. While this commit presents method redirection for performing the exact same logic as before, it allows the next commit to focus only on the new logic. Signed-off-by: Derrick Stolee --- revision.c | 42 ++ revision.h | 4 2 files changed, 42 insertions(+), 4 deletions(-) diff --git a/revision.c b/revision.c index 3205a3947a..1db70dc951 100644 --- a/revision.c +++ b/revision.c @@ -25,6 +25,7 @@ #include "worktree.h" #include "argv-array.h" #include "commit-reach.h" +#include "commit-graph.h" volatile show_early_output_fn_t show_early_output; @@ -2451,7 +2452,7 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s if (revs->diffopt.objfind) revs->simplify_history = 0; - if (revs->topo_order) + if (revs->topo_order && !generation_numbers_enabled(the_repository)) revs->limited = 1; if (revs->prune_data.nr) { @@ -2889,6 +2890,33 @@ static int mark_uninteresting(const struct object_id *oid, return 0; } +struct topo_walk_info {}; + +static void init_topo_walk(struct rev_info *revs) +{ + struct topo_walk_info *info; + revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info)); + info = revs->topo_walk_info; + memset(info, 0, sizeof(struct topo_walk_info)); + + limit_list(revs); + sort_in_topological_order(&revs->commits, revs->sort_order); +} + +static struct commit *next_topo_commit(struct rev_info *revs) +{ + return pop_commit(&revs->commits); +} + +static void expand_topo_walk(struct rev_info *revs, struct commit *commit) +{ + if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) { + if (!revs->ignore_missing_links) + die("Failed to traverse parents of commit %s", + oid_to_hex(&commit->object.oid)); + } +} + int prepare_revision_walk(struct rev_info *revs) { int i; @@ -2925,11 +2953,13 @@ int prepare_revision_walk(struct rev_info *revs) commit_list_sort_by_date(&revs->commits); if (revs->no_walk) return 0; - if (revs->limited) + if (revs->limited) { if (limit_list(revs) < 0) return -1; - if (revs->topo_order) - sort_in_topological_order(&a
[PATCH 3/6] test-reach: add rev-list tests
From: Derrick Stolee The rev-list command is critical to Git's functionality. Ensure it works in the three commit-graph environments constructed in t6600-test-reach.sh. Here are a few important types of rev-list operations: * Basic: git rev-list --topo-order HEAD * Range: git rev-list --topo-order compare..HEAD * Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD * Symmetric Difference: git rev-list --topo-order compare...HEAD Signed-off-by: Derrick Stolee --- t/t6600-test-reach.sh | 84 +++ 1 file changed, 84 insertions(+) diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh index 1b18e12a4e..2fcaa39077 100755 --- a/t/t6600-test-reach.sh +++ b/t/t6600-test-reach.sh @@ -243,4 +243,88 @@ test_expect_success 'commit_contains:miss' ' test_three_modes commit_contains --tag ' +test_expect_success 'rev-list: basic topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \ + commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \ + commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \ + commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 commit-1-3 \ + commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \ + commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \ + >expect && + run_three_modes "git rev-list --topo-order commit-6-6" +' + +test_expect_success 'rev-list: first-parent topo-order' ' + git rev-parse \ + commit-6-6 \ + commit-6-5 \ + commit-6-4 \ + commit-6-3 \ + commit-6-2 \ + commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \ + >expect && + run_three_modes "git rev-list --first-parent --topo-order commit-6-6" +' + +test_expect_success 'rev-list: range topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \ + commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \ + commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + commit-6-2 commit-5-2 commit-4-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + >expect && + run_three_modes "git rev-list --topo-order commit-3-3..commit-6-6" +' + +test_expect_success 'rev-list: range topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 \ + commit-6-5 commit-5-5 commit-4-5 \ + commit-6-4 commit-5-4 commit-4-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + commit-6-2 commit-5-2 commit-4-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + >expect && + run_three_modes "git rev-list --topo-order commit-3-8..commit-6-6" +' + +test_expect_success 'rev-list: first-parent range topo-order' ' + git rev-parse \ + commit-6-6 \ + commit-6-5 \ + commit-6-4 \ + commit-6-3 \ + commit-6-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + >expect && + run_three_modes "git rev-list --first-parent --topo-order commit-3-8..commit-6-6" +' + +test_expect_success 'rev-list: ancestry-path topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 commit-3-6 \ + commit-6-5 commit-5-5 commit-4-5 commit-3-5 \ + commit-6-4 commit-5-4 commit-4-4 commit-3-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + >expect && + run_three_modes "git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6" +' + +test_expect_success 'rev-list: symmetric difference topo-order' ' + git rev-parse \ + commit-6-6 commit-5-6 commit-4-6 \ + commit-6-5 commit-5-5 commit-4-5 \ + commit-6-4 commit-5-4 commit-4-4 \ + commit-6-3 commit-5-3 commit-4-3 \ + commit-6-2 commit-5-2 commit-4-2 \ + commit-6-1 commit-5-1 commit-4-1 \ + commit-3-8 commit-2-8 commit-1-8 \ + commit-3-7 commit-2-7 commit-1-7 \ + >expect && + run_three_modes "git rev-list --topo-order commit-3-8...commit-6-6" +' + test_done -- gitgitgadget
[PATCH 5/6] commit/revisions: bookkeeping before refactoring
From: Derrick Stolee There are a few things that need to move around a little before making a big refactoring in the topo-order logic: 1. We need access to record_author_date() and compare_commits_by_author_date() in revision.c. These are used currently by sort_in_topological_order() in commit.c. 2. Moving these methods to commit.h requires adding the author_slab definition to commit.h. 3. The add_parents_to_list() method in revision.c performs logic around the UNINTERESTING flag and other special cases depending on the struct rev_info. Allow this method to ignore a NULL 'list' parameter, as we will not be populating the list for our walk. Signed-off-by: Derrick Stolee --- commit.c | 11 --- commit.h | 8 revision.c | 6 -- 3 files changed, 16 insertions(+), 9 deletions(-) diff --git a/commit.c b/commit.c index 32d1234bd7..2dbe187b8c 100644 --- a/commit.c +++ b/commit.c @@ -655,11 +655,8 @@ struct commit *pop_commit(struct commit_list **stack) /* count number of children that have not been emitted */ define_commit_slab(indegree_slab, int); -/* record author-date for each commit object */ -define_commit_slab(author_date_slab, unsigned long); - -static void record_author_date(struct author_date_slab *author_date, - struct commit *commit) +void record_author_date(struct author_date_slab *author_date, + struct commit *commit) { const char *buffer = get_commit_buffer(commit, NULL); struct ident_split ident; @@ -684,8 +681,8 @@ fail_exit: unuse_commit_buffer(commit, buffer); } -static int compare_commits_by_author_date(const void *a_, const void *b_, - void *cb_data) +int compare_commits_by_author_date(const void *a_, const void *b_, + void *cb_data) { const struct commit *a = a_, *b = b_; struct author_date_slab *author_date = cb_data; diff --git a/commit.h b/commit.h index e2c99d9b04..51de10e698 100644 --- a/commit.h +++ b/commit.h @@ -8,6 +8,7 @@ #include "gpg-interface.h" #include "string-list.h" #include "pretty.h" +#include "commit-slab.h" #define COMMIT_NOT_FROM_GRAPH 0x #define GENERATION_NUMBER_INFINITY 0x @@ -328,6 +329,13 @@ extern int remove_signature(struct strbuf *buf); */ extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); +/* record author-date for each commit object */ +define_commit_slab(author_date_slab, timestamp_t); + +void record_author_date(struct author_date_slab *author_date, + struct commit *commit); + +int compare_commits_by_author_date(const void *a_, const void *b_, void *unused); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); diff --git a/revision.c b/revision.c index 1db70dc951..565f903e46 100644 --- a/revision.c +++ b/revision.c @@ -804,7 +804,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit, if (p->object.flags & SEEN) continue; p->object.flags |= SEEN; - commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); + if (list) + commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); } return 0; } @@ -843,7 +844,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit, p->object.flags |= left_flag; if (!(p->object.flags & SEEN)) { p->object.flags |= SEEN; - commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); + if (list) + commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr); } if (revs->first_parent_only) break; -- gitgitgadget
[PATCH 0/6] Use generation numbers for --topo-order
This patch series performs a decently-sized refactoring of the revision-walk machinery. Well, "refactoring" is probably the wrong word, as I don't actually remove the old code. Instead, when we see certain options in the 'rev_info' struct, we redirect the commit-walk logic to a new set of methods that distribute the workload differently. By using generation numbers in the commit-graph, we can significantly improve 'git log --graph' commands (and the underlying 'git rev-list --topo-order'). On the Linux repository, I got the following performance results when comparing to the previous version with or without a commit-graph: Test: git rev-list --topo-order -100 HEAD HEAD~1, no commit-graph: 6.80 s HEAD~1, w/ commit-graph: 0.77 s HEAD, w/ commit-graph: 0.02 s Test: git rev-list --topo-order -100 HEAD -- tools HEAD~1, no commit-graph: 9.63 s HEAD~1, w/ commit-graph: 6.06 s HEAD, w/ commit-graph: 0.06 s If you want to read this series but are unfamiliar with the commit-graph and generation numbers, then I recommend reading Documentation/technical/commit-graph.txt or a blob post [1] I wrote on the subject. In particular, the three-part walk described in "revision.c: refactor basic topo-order logic" is present (but underexplained) as an animated PNG [2]. Since revision.c is an incredibly important (and old) portion of the codebase -- and because there are so many orthogonal options in 'struct rev_info' -- I consider this submission to be "RFC quality". That is, I am not confident that I am not missing anything, or that my solution is the best it can be. I did merge this branch with ds/commit-graph-with-grafts and the "DO-NOT-MERGE: write and read commit-graph always" commit that computes a commit-graph with every 'git commit' command. The test suite passed with that change, available on GitHub [3]. To ensure that I cover at least the case I think are interesting, I added tests to t6600-test-reach.sh to verify the walks report the correct results for the three cases there (no commit-graph, full commit-graph, and a partial commit-graph so the walk starts at GENERATION_NUMBER_INFINITY). One notable case that is not included in this series is the case of a history comparison such as 'git rev-list --topo-order A..B'. The existing code in limit_list() has ways to cut the walk short when all pending commits are UNINTERESTING. Since this code depends on commit_list instead of the prio_queue we are using here, I chose to leave it untouched for now. We can revisit it in a separate series later. Since handle_commit() turns on revs->limited when a commit is UNINTERESTING, we do not hit the new code in this case. Removing this 'revs->limited = 1;' line yields correct results, but the performance is worse. This series is based on ds/reachable. Thanks, -Stolee [1] https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/ Supercharging the Git Commit Graph III: Generations and Graph Algorithms [2] https://msdnshared.blob.core.windows.net/media/2018/06/commit-graph-topo-order-b-a.png Animation showing three-part walk [3] https://github.com/derrickstolee/git/tree/topo-order/testA branch containing this series along with commits to compute commit-graph in entire test suite. Derrick Stolee (6): prio-queue: add 'peek' operation test-reach: add run_three_modes method test-reach: add rev-list tests revision.c: begin refactoring --topo-order logic commit/revisions: bookkeeping before refactoring revision.c: refactor basic topo-order logic commit.c | 11 +- commit.h | 8 ++ object.h | 4 +- prio-queue.c | 9 ++ prio-queue.h | 6 + revision.c | 232 - revision.h | 6 + t/helper/test-prio-queue.c | 10 +- t/t6600-test-reach.sh | 98 +++- 9 files changed, 361 insertions(+), 23 deletions(-) base-commit: 6cc017431c1c48f80d1c6512fdcc9866cf4b7f55 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-25%2Fderrickstolee%2Ftopo-order%2Fprogress-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-25/derrickstolee/topo-order/progress-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/25 -- gitgitgadget
[PATCH 6/6] revision.c: refactor basic topo-order logic
From: Derrick Stolee When running a command like 'git rev-list --topo-order HEAD', Git performed the following steps: 1. Run limit_list(), which parses all reachable commits, adds them to a linked list, and distributes UNINTERESTING flags. If all unprocessed commits are UNINTERESTING, then it may terminate without walking all reachable commits. This does not occur if we do not specify UNINTERESTING commits. 2. Run sort_in_topological_order(), which is an implementation of Kahn's algorithm. It first iterates through the entire set of important commits and computes the in-degree of each (plus one, as we use 'zero' as a special value here). Then, we walk the commits in priority order, adding them to the priority queue if and only if their in-degree is one. As we remove commits from this priority queue, we decrement the in-degree of their parents. 3. While we are peeling commits for output, get_revision_1() uses pop_commit on the full list of commits computed by sort_in_topological_order(). In the new algorithm, these three steps correspond to three different commit walks. We run these walks simultaneously, and advance each only as far as necessary to satisfy the requirements of the 'higher order' walk. We know when we can pause each walk by using generation numbers from the commit- graph feature. Recall that the generation number of a commit satisfies: * If the commit has at least one parent, then the generation number is one more than the maximum generation number among its parents. * If the commit has no parent, then the generation number is one. There are two special generation numbers: * GENERATION_NUMBER_INFINITY: this value is 0x and indicates that the commit is not stored in the commit-graph and the generation number was not previously calculated. * GENERATION_NUMBER_ZERO: this value (0) is a special indicator to say that the commit-graph was generated by a version of Git that does not compute generation numbers (such as v2.18.0). Since we use generation_numbers_enabled() before using the new algorithm, we do not need to worry about GENERATION_NUMBER_ZERO. However, the existence of GENERATION_NUMBER_INFINITY implies the following weaker statement than the usual we expect from generation numbers: If A and B are commits with generation numbers gen(A) and gen(B) and gen(A) < gen(B), then A cannot reach B. Thus, we will walk in each of our stages until the "maximum unexpanded generation number" is strictly lower than the generation number of a commit we are about to use. The walks are as follows: 1. EXPLORE: using the explore_queue priority queue (ordered by maximizing the generation number), parse each reachable commit until all commits in the queue have generation number strictly lower than needed. During this walk, update the UNINTERESTING flags as necessary. 2. INDEGREE: using the indegree_queue priority queue (ordered by maximizing the generation number), add one to the in- degree of each parent for each commit that is walked. Since we walk in order of decreasing generation number, we know that discovering an in-degree value of 0 means the value for that commit was not initialized, so should be initialized to two. (Recall that in-degree value "1" is what we use to say a commit is ready for output.) As we iterate the parents of a commit during this walk, ensure the EXPLORE walk has walked beyond their generation numbers. 3. TOPO: using the topo_queue priority queue (ordered based on the sort_order given, which could be commit-date, author- date, or typical topo-order which treats the queue as a LIFO stack), remove a commit from the queue and decrement the in-degree of each parent. If a parent has an in-degree of one, then we add it to the topo_queue. Before we decrement the in-degree, however, ensure the INDEGREE walk has walked beyond that generation number. The implementations of these walks are in the following methods: * explore_walk_step and explore_to_depth * indegree_walk_step and compute_indegrees_to_depth * next_topo_commit and expand_topo_walk These methods have some patterns that may seem strange at first, but they are probably carry-overs from their equivalents in limit_list and sort_in_topological_order. One thing that is missing from this implementation is a proper way to stop walking when the entire queue is UNINTERESTING, so this implementation is not enabled by comparisions, such as in 'git rev-list --topo-order A..B'. This can be updated in the future. In my local testing, I used the following Git commands on the Linux repository in three modes: HEAD~1 with no commit-graph, HEAD~1 with a commit-graph, and HEAD with a commit-graph. This allows comparing the benefits we get from parsing commits from the commit-graph and then again the benefits we get by restricting th
Re: Contributor Summit planning
On 8/27/2018 9:22 AM, Johannes Schindelin wrote: Point in favor of the pure-online meeting: the informal standup on IRC every second Friday. I really try to attend it (it is a bit awkward because it is on a Friday evening in my timezone, right at the time when I want to unwind from the work week), as it does have a similar effect to in-person standups: surprising collaborations spring up, unexpected help, and a general sense of belonging. Of course, the value of these standups comes from the makeup of the participants: Stefan, Brandon, Stolee, JeffH, Jonathan and other *very* active core contributors hang out for roughly half an hour, sharing what they are working on, exchanging ideas, etc. A focused aside, since you brought up the online "standup": it seems the IRC channel has been less than ideal, with people trying to participate but having nickname issues or being muted. You also describe another issue: the timing. Having a real-time discussion has its benefits, but also it leaves many people out. One idea to try next time is to create a mailing list thread asking for statuses, and each person can chime in asynchronously and spawn a new discussion based on that status. Perhaps we can try that next time. Thanks, -Stolee
Re: [PATCH 0/9] introducing oideq()
On 8/26/2018 4:56 PM, brian m. carlson wrote: On Sat, Aug 25, 2018 at 04:00:31AM -0400, Jeff King wrote: This is a follow-up to the discussion in: https://public-inbox.org/git/20180822030344.ga14...@sigill.intra.peff.net/ The general idea is that the majority of callers don't care about actual plus/minus ordering from oidcmp() and hashcmp(); they just want to know if two oids are equal. The stricter equality check can be optimized better by the compiler. Due to the simplicity of the current code and our inlining, the compiler can usually figure this out for now. So I wouldn't expect this patch to actually improve performance right away. But as that discussion shows, we are likely to take a performance hit as we move to more runtime determination of the_hash_algo parameters. Having these callers use the more strict form will potentially help us recover that. So in that sense we _could_ simply punt on this series until then (and it's certainly post-v2.19 material). But I think it's worth doing now, simply from a readability/annotation standpoint. IMHO the resulting code is more clear (though I've long since taught myself to read !foocmp() as equality). I would quite like to see this series picked up for v2.20. If we want to minimize performance regressions with the SHA-256 work, I think it's important. Seconded. Applying the following patch on top of this series causes gcc to inline both branches, which is pretty much the best we can expect. I haven't benchmarked it, though, so I can't say what the actual performance consequence is. We should definitely measure the cost here, but when we have the option to move to newhash, that cost should be acceptable. Thanks, -Stolee
Re: [PATCH 3/9] convert "oidcmp() == 0" to oideq()
On 8/27/2018 8:31 AM, Derrick Stolee wrote: On 8/25/2018 4:36 AM, Jeff King wrote: On Sat, Aug 25, 2018 at 04:07:15AM -0400, Jeff King wrote: diff --git a/contrib/coccinelle/object_id.cocci b/contrib/coccinelle/object_id.cocci index 5869979be7..548c02336d 100644 --- a/contrib/coccinelle/object_id.cocci +++ b/contrib/coccinelle/object_id.cocci @@ -108,3 +108,9 @@ expression E1, E2; @@ - hashcpy(E1.hash, E2->hash) + oidcpy(&E1, E2) Is this change intended? It doesn't seem to match the intention of the rest of the patch. Ignore me. It's just confusing to read the +/- notation from a cocci script alongside the file diff.
Re: [PATCH 3/9] convert "oidcmp() == 0" to oideq()
On 8/25/2018 4:36 AM, Jeff King wrote: On Sat, Aug 25, 2018 at 04:07:15AM -0400, Jeff King wrote: diff --git a/contrib/coccinelle/object_id.cocci b/contrib/coccinelle/object_id.cocci index 5869979be7..548c02336d 100644 --- a/contrib/coccinelle/object_id.cocci +++ b/contrib/coccinelle/object_id.cocci @@ -108,3 +108,9 @@ expression E1, E2; @@ - hashcpy(E1.hash, E2->hash) + oidcpy(&E1, E2) Is this change intended? It doesn't seem to match the intention of the rest of the patch. + +@@ +expression E1, E2; +@@ +- oidcmp(E1, E2) == 0 ++ oideq(E1, E2) [...] diff --git a/cache.h b/cache.h index f6d227fac7..d090f71706 100644 --- a/cache.h +++ b/cache.h @@ -1100,7 +1100,7 @@ static inline int is_empty_blob_sha1(const unsigned char *sha1) static inline int is_empty_blob_oid(const struct object_id *oid) { - return !oidcmp(oid, the_hash_algo->empty_blob); + return oideq(oid, the_hash_algo->empty_blob); } By the way, one interesting thing I noticed is that coccinelle generates the hunk for cache.h for _every_ file that includes it, and the resulting patch is annoying to apply. I think this is related to the discussion we were having in: https://public-inbox.org/git/20180802115522.16107-1-szeder@gmail.com/ Namely that we might do better to invoke one big spatch (and let it parallelize internally) instead of one perfile. Presumably that would also avoid looking at the headers repeatedly (which would be both faster and produce better output). I'm not planning to pursue that immediately, so just food for thought at this point. The more we use Coccinelle, the more we will learn how to improve the workflow. Thanks, -Stolee
Re: [ANNOUNCE] Git v2.19.0-rc0
On 8/23/2018 4:59 PM, Derrick Stolee wrote: When I choose my own metrics for performance tests, I like to run at least 10 runs, remove the largest AND smallest runs from the samples, and then take the average. I did this manually for 'git rev-list --all --objects' on git/git and got the following results: v2.18.0 v2.19.0-rc0 HEAD 3.126 s 3.308 s 3.170 s For full disclosure, here is a full table including all samples: | | v2.18.0 | v2.19.0-rc0 | HEAD | |--|-|-|-| | | 4.58 | 3.302 | 3.239 | | | 3.13 | 3.337 | 3.133 | | | 3.213 | 3.291 | 3.159 | | | 3.219 | 3.318 | 3.131 | | | 3.077 | 3.302 | 3.163 | | | 3.074 | 3.328 | 3.119 | | | 3.022 | 3.277 | 3.125 | | | 3.083 | 3.259 | 3.203 | | | 3.057 | 3.311 | 3.223 | | | 3.155 | 3.413 | 3.225 | | Max | 4.58 | 3.413 | 3.239 | | Min | 3.022 | 3.259 | 3.119 | | Avg* | 3.126 | 3.30825 | 3.17025 | (Note that the largest one was the first run, on v2.18.0, which is due to a cold disk.) I just kicked off a script that will run this test on the Linux repo while I drive home. I'll be able to report a similar table of data easily. Here are the numbers for Linux: | | v2.18.0 | v2.19.0-rc0 | HEAD | |--|--|-|| | | 86.5 | 70.739 | 57.266 | | | 60.582 | 101.928 | 56.641 | | | 58.964 | 60.139 | 60.258 | | | 59.47 | 61.141 | 58.213 | | | 62.554 | 60.73 | 84.54 | | | 59.139 | 85.424 | 57.745 | | | 58.487 | 59.31 | 59.979 | | | 58.653 | 69.845 | 60.181 | | | 58.085 | 102.777 | 61.455 | | | 58.304 | 60.459 | 62.551 | | Max | 86.5 | 102.777 | 84.54 | | Min | 58.085 | 59.31 | 56.641 | | Avg* | 59.51913 | 71.30063 | 59.706 | | Med | 59.0515 | 65.493 | 60.08 | Thanks, -Stolee
Re: [PATCH v3 5/5] tests: fix and add lint for non-portable grep --file
On 8/23/2018 4:44 PM, Junio C Hamano wrote: Ævar Arnfjörð Bjarmason writes: The --file option to grep isn't in POSIX[1], but -f is[1]. Let's check for that in the future, and fix the portability regression in f237c8b6fe ("commit-graph: implement git-commit-graph write", 2018-04-02) that broke e.g. AIX. 1. http://pubs.opengroup.org/onlinepubs/009695399/utilities/grep.html Signed-off-by: Ævar Arnfjörð Bjarmason --- Thanks. Yes, thanks! I'll keep this in mind for the future. t/check-non-portable-shell.pl | 1 + t/t5318-commit-graph.sh | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/t/check-non-portable-shell.pl b/t/check-non-portable-shell.pl index 75f38298d7..b45bdac688 100755 --- a/t/check-non-portable-shell.pl +++ b/t/check-non-portable-shell.pl @@ -43,6 +43,7 @@ sub err { /\bwc -l.*"\s*=/ and err '`"$(wc -l)"` is not portable (use test_line_count)'; /\bhead\s+-c\b/ and err 'head -c is not portable (use test_copy_bytes BYTES out)'; /(?:\$\(seq|^\s*seq\b)/ and err 'seq is not portable (use test_seq)'; + /\bgrep\b.*--file\b/ and err 'grep --file FILE is not portable (use grep -f FILE)'; /\bexport\s+[A-Za-z0-9_]*=/ and err '"export FOO=bar" is not portable (use FOO=bar && export FOO)'; /^\s*([A-Z0-9_]+=(\w+|(["']).*?\3)\s+)+(\w+)/ and exists($func{$4}) and err '"FOO=bar shell_func" assignment extends beyond "shell_func"'; diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 3c1ffad491..0c500f7ca2 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -134,7 +134,7 @@ test_expect_success 'Add one more commit' ' git branch commits/8 && ls $objdir/pack | grep idx >existing-idx && git repack && - ls $objdir/pack| grep idx | grep -v --file=existing-idx >new-idx + ls $objdir/pack| grep idx | grep -v -f existing-idx >new-idx ' # Current graph structure:
Re: [ANNOUNCE] Git v2.19.0-rc0
On 8/24/2018 2:45 AM, Jeff King wrote: On Thu, Aug 23, 2018 at 10:59:55PM -0400, Jeff King wrote: So I think we have a winner. I'll polish that up into patches and send it out later tonight. Oof. This rabbit hole keeps going deeper and deeper. I wrote up my coccinelle findings separately in: https://public-inbox.org/git/20180824064228.ga3...@sigill.intra.peff.net/ which is possibly a coccinelle bug (there I talked about oidcmp, since it can be demonstrated with the existing transformations, but the same thing happens with my hasheq patches). I'll wait to see how that discussion plays out, but I do otherwise have hasheq() patches ready to go, so it's probably not worth anybody else digging in further. I was trying this myself yesterday, and found a doc _somewhere_ that said what we should do is this: @@ expression e1, e2; @@ if ( - oidcmp(e1, e2) + !oideq(e1, e2) ) (Don't forget the space before the last end-paren!) -Stolee
Re: [ANNOUNCE] Git v2.19.0-rc0
On 8/23/2018 2:53 PM, Jeff King wrote: On Thu, Aug 23, 2018 at 06:26:58AM -0400, Derrick Stolee wrote: I think you can safely ignore the rest of it if you are otherwise occupied. Even if v2.19 ships without some mitigation, I don't know that it's all that big a deal, given the numbers I generated (which for some reason are less dramatic than Stolee's). My numbers may be more dramatic because my Linux environment is a virtual machine. If you have a chance, can you run p0001 on my patch (compared to 2.19-rc0, or to both v2.18 and v2.19-rc0)? It would be nice to double check that it really is fixing the problem you saw. Sure. Note: I had to create a new Linux VM on a different machine between Tuesday and today, so the absolute numbers are different. Using git/git: Test v2.18.0 v2.19.0-rc0 HEAD - 0001.2: 3.10(3.02+0.08) 3.27(3.17+0.09) +5.5% 3.14(3.02+0.11) +1.3% Using torvalds/linux: Test v2.18.0 v2.19.0-rc0 HEAD -- 0001.2: 56.08(45.91+1.50) 56.60(46.62+1.50) +0.9% 54.61(45.47+1.46) -2.6% Now here is where I get on my soapbox (and create a TODO for myself later). I ran the above with GIT_PERF_REPEAT_COUNT=10, which intuitively suggests that the results should be _more_ accurate than the default of 3. However, I then remember that we only report the *minimum* time from all the runs, which is likely to select an outlier from the distribution. To test this, I ran a few tests manually and found the variation between runs to be larger than 3%. When I choose my own metrics for performance tests, I like to run at least 10 runs, remove the largest AND smallest runs from the samples, and then take the average. I did this manually for 'git rev-list --all --objects' on git/git and got the following results: v2.18.0 v2.19.0-rc0 HEAD 3.126 s 3.308 s 3.170 s For full disclosure, here is a full table including all samples: | | v2.18.0 | v2.19.0-rc0 | HEAD | |--|-|-|-| | | 4.58 | 3.302 | 3.239 | | | 3.13 | 3.337 | 3.133 | | | 3.213 | 3.291 | 3.159 | | | 3.219 | 3.318 | 3.131 | | | 3.077 | 3.302 | 3.163 | | | 3.074 | 3.328 | 3.119 | | | 3.022 | 3.277 | 3.125 | | | 3.083 | 3.259 | 3.203 | | | 3.057 | 3.311 | 3.223 | | | 3.155 | 3.413 | 3.225 | | Max | 4.58 | 3.413 | 3.239 | | Min | 3.022 | 3.259 | 3.119 | | Avg* | 3.126 | 3.30825 | 3.17025 | (Note that the largest one was the first run, on v2.18.0, which is due to a cold disk.) I just kicked off a script that will run this test on the Linux repo while I drive home. I'll be able to report a similar table of data easily. My TODO is to consider aggregating the data this way (or with a median) instead of reporting the minimum. Thanks, -Stolee
[PATCH v2] config: fix commit-graph related config docs
The core.commitGraph config setting was accidentally removed from the config documentation. In that same patch, the config setting that writes a commit-graph during garbage collection was incorrectly written to the doc as "gc.commitGraph" instead of "gc.writeCommitGraph". Reported-by: Szeder Gábor Signed-off-by: Derrick Stolee --- Documentation/config.txt | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/Documentation/config.txt b/Documentation/config.txt index 1c42364988..6ee1890984 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -917,12 +917,10 @@ core.notesRef:: This setting defaults to "refs/notes/commits", and it can be overridden by the `GIT_NOTES_REF` environment variable. See linkgit:git-notes[1]. -gc.commitGraph:: - If true, then gc will rewrite the commit-graph file when - linkgit:git-gc[1] is run. When using linkgit:git-gc[1] - '--auto' the commit-graph will be updated if housekeeping is - required. Default is false. See linkgit:git-commit-graph[1] - for details. +core.commitGraph:: + If true, then git will read the commit-graph file (if it exists) + to parse the graph structure of commits. Defaults to false. See + linkgit:git-commit-graph[1] for more information. core.useReplaceRefs:: If set to `false`, behave as if the `--no-replace-objects` @@ -1763,6 +1761,13 @@ this configuration variable is ignored, all packs except the base pack will be repacked. After this the number of packs should go below gc.autoPackLimit and gc.bigPackThreshold should be respected again. +gc.writeCommitGraph:: + If true, then gc will rewrite the commit-graph file when + linkgit:git-gc[1] is run. When using linkgit:git-gc[1] + '--auto' the commit-graph will be updated if housekeeping is + required. Default is false. See linkgit:git-commit-graph[1] + for details. + gc.logExpiry:: If the file gc.log exists, then `git gc --auto` won't run unless that file is more than 'gc.logExpiry' old. Default is -- 2.19.0.rc0
Re: [ANNOUNCE] Git v2.19.0-rc0
On 8/23/2018 1:04 AM, Jeff King wrote: On Thu, Aug 23, 2018 at 03:47:07AM +, brian m. carlson wrote: I expect that's going to be the case as well. I have patches that wire up actual SHA-256 support in my hash-impl branch. However, having said that, I'm happy to defer to whatever everyone else thinks is best for 2.19. The assert solution would be fine with me in this situation, and if we need to pull it out in the future, that's okay with me. I don't really have a strong opinion on this either way, so if someone else does, please say so. I have somewhat more limited availability over the next couple days, as I'm travelling on business, but I'm happy to review a patch (and it seems like Peff has one minus the actual commit message). I just posted the patch elsewhere in the thread. Thank you for that! I think you can safely ignore the rest of it if you are otherwise occupied. Even if v2.19 ships without some mitigation, I don't know that it's all that big a deal, given the numbers I generated (which for some reason are less dramatic than Stolee's). My numbers may be more dramatic because my Linux environment is a virtual machine. I was thinking that having a mitigation for 2.19 is best, and then we can focus as part of the 2.20 cycle how we can properly avoid this cost, especially when 32 is a valid option. Around the time that my proposed approaches were getting vetoed for alignment issues, I figured I was out of my depth here. I reached out to Daniel Lemire (of EWAH bitmap fame) on Twitter [1]. His blog is full of posts of word-based approaches to different problems, so I thought he might know something off the top of his head that would be applicable. His conclusion (after looking only a short time) was to take a 'hasheq' approach [2] like Peff suggested [3]. Since that requires auditing all callers of hashcmp to see if hasheq is appropriate, it is not a good solution for 2.19 but (in my opinion) should be evaluated as part of the 2.20 cycle. Of course, if someone with knowledge of word-alignment issues across the platforms we support knows how to enforce an alignment for object_id, then something word-based like [4] could be reconsidered. Thanks, everyone! -Stolee [1] https://twitter.com/stolee/status/1032312965754748930 [2] https://lemire.me/blog/2018/08/22/avoid-lexicographical-comparisons-when-testing-for-string-equality/ [3] https://public-inbox.org/git/20180822030344.ga14...@sigill.intra.peff.net/ [4] https://public-inbox.org/git/7ea416cf-b043-1274-e161-85a8780b8...@gmail.com/
Re: [ANNOUNCE] Git v2.19.0-rc0
On 8/22/2018 12:58 PM, Duy Nguyen wrote: On Wed, Aug 22, 2018 at 6:49 PM Derrick Stolee wrote: On 8/22/2018 12:26 PM, Jeff King wrote: On Wed, Aug 22, 2018 at 06:14:24PM +0200, Duy Nguyen wrote: On Wed, Aug 22, 2018 at 6:08 PM Duy Nguyen wrote: On Wed, Aug 22, 2018 at 6:03 PM Jeff King wrote: On Wed, Aug 22, 2018 at 07:14:42AM -0400, Derrick Stolee wrote: The other thing I was going to recommend (and I'll try to test this out myself later) is to see if 'the_hash_algo->rawsz' is being treated as a volatile variable, since it is being referenced through a pointer. Perhaps storing the value locally and then casing on it would help? I tried various sprinkling of "const" around the declarations to make it clear that the values wouldn't change once we saw them. But I couldn't detect any difference. At most I think that would let us hoist the "if" out of the loop, but gcc still seems unwilling to expand the memcmp when there are other branches. I think if that's the thing we want to have happen, we really do need to just write it out on that branch rather than saying "memcmp". This reminds me of an old discussion about memcpy() vs doing explicit compare loop with lots of performance measurements.. Ah found it. Not sure if it is still relevant in light of multiple hash support https://public-inbox.org/git/20110427225114.ga16...@elte.hu/ Yes, that was what I meant. We actually did switch to that hand-rolled loop, but later we went back to memcmp in 0b006014c8 (hashcmp: use memcmp instead of open-coded loop, 2017-08-09). Looking at that commit, I'm surprised the old logic was just a for loop, instead of a word-based approach, such as the following: Might work on x86 but it breaks on cpu architectures with stricter alignment. I don't think we have a guarantee that object_id is always 8 byte aligned. You (and Peff) are probably correct here, which is unfortunate. I'm not familiar with alignment constraints, but assume that such a word-based approach is best. Thanks, -Stolee
Re: [ANNOUNCE] Git v2.19.0-rc0
On 8/22/2018 12:26 PM, Jeff King wrote: On Wed, Aug 22, 2018 at 06:14:24PM +0200, Duy Nguyen wrote: On Wed, Aug 22, 2018 at 6:08 PM Duy Nguyen wrote: On Wed, Aug 22, 2018 at 6:03 PM Jeff King wrote: On Wed, Aug 22, 2018 at 07:14:42AM -0400, Derrick Stolee wrote: The other thing I was going to recommend (and I'll try to test this out myself later) is to see if 'the_hash_algo->rawsz' is being treated as a volatile variable, since it is being referenced through a pointer. Perhaps storing the value locally and then casing on it would help? I tried various sprinkling of "const" around the declarations to make it clear that the values wouldn't change once we saw them. But I couldn't detect any difference. At most I think that would let us hoist the "if" out of the loop, but gcc still seems unwilling to expand the memcmp when there are other branches. I think if that's the thing we want to have happen, we really do need to just write it out on that branch rather than saying "memcmp". This reminds me of an old discussion about memcpy() vs doing explicit compare loop with lots of performance measurements.. Ah found it. Not sure if it is still relevant in light of multiple hash support https://public-inbox.org/git/20110427225114.ga16...@elte.hu/ Yes, that was what I meant. We actually did switch to that hand-rolled loop, but later we went back to memcmp in 0b006014c8 (hashcmp: use memcmp instead of open-coded loop, 2017-08-09). Looking at that commit, I'm surprised the old logic was just a for loop, instead of a word-based approach, such as the following: diff --git a/cache.h b/cache.h index b1fd3d58ab..5e5819ad49 100644 --- a/cache.h +++ b/cache.h @@ -1021,9 +1021,41 @@ extern int find_unique_abbrev_r(char *hex, const struct object_id *oid, int len) extern const unsigned char null_sha1[GIT_MAX_RAWSZ]; extern const struct object_id null_oid; +static inline int word_cmp_32(uint32_t a, uint32_t b) +{ + return memcmp(&a, &b, sizeof(uint32_t)); +} + +static inline int word_cmp_64(uint64_t a, uint64_t b) +{ + return memcmp(&a, &b, sizeof(uint64_t)); +} + +struct object_id_20 { + uint64_t data0; + uint64_t data1; + uint32_t data2; +}; + static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2) { - return memcmp(sha1, sha2, the_hash_algo->rawsz); + if (the_hash_algo->rawsz == 20) { + struct object_id_20 *obj1 = (struct object_id_20 *)sha1; + struct object_id_20 *obj2 = (struct object_id_20 *)sha2; + + if (obj1->data0 == obj2->data0) { + if (obj1->data1 == obj2->data1) { + if (obj1->data2 == obj2->data2) { + return 0; + } + return word_cmp_32(obj1->data2, obj2->data2); + } + return word_cmp_64(obj1->data1, obj2->data1); + } + return word_cmp_64(obj1->data0, obj2->data0); + } + + assert(0); } static inline int oidcmp(const struct object_id *oid1, const struct object_id *oid2)
Re: [ANNOUNCE] Git v2.19.0-rc0
On 8/22/2018 1:36 AM, brian m. carlson wrote: On Tue, Aug 21, 2018 at 11:03:44PM -0400, Jeff King wrote: So I wonder if there's some other way to tell the compiler that we'll only have a few values. An enum comes to mind, though I don't think the enum rules are strict enough to make this guarantee (after all, it's OK to bitwise-OR enums, so they clearly don't specify all possible values). I was thinking about this: diff --git a/cache.h b/cache.h index 1398b2a4e4..1f5c6e9319 100644 --- a/cache.h +++ b/cache.h @@ -1033,7 +1033,14 @@ extern const struct object_id null_oid; static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2) { - return memcmp(sha1, sha2, the_hash_algo->rawsz); + switch (the_hash_algo->rawsz) { + case 20: + return memcmp(sha1, sha2, 20); + case 32: + return memcmp(sha1, sha2, 32); + default: + assert(0); + } } static inline int oidcmp(const struct object_id *oid1, const struct object_id *oid2) That would make it obvious that there are at most two options. Unfortunately, gcc for me determines that the buffer in walker.c is 20 bytes in size and steadfastly refuses to compile because it doesn't know that the value will never be 32 in our codebase currently. I'd need to send in more patches before it would compile. I don't know if something like this is an improvement or now, but this seems to at least compile: diff --git a/cache.h b/cache.h index 1398b2a4e4..3207f74771 100644 --- a/cache.h +++ b/cache.h @@ -1033,7 +1033,13 @@ extern const struct object_id null_oid; static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2) { - return memcmp(sha1, sha2, the_hash_algo->rawsz); + switch (the_hash_algo->rawsz) { + case 20: + case 32: + return memcmp(sha1, sha2, the_hash_algo->rawsz); + default: + assert(0); + } } In my testing, I've had the best luck with this change: diff --git a/cache.h b/cache.h index b1fd3d58ab..6c8b51c390 100644 --- a/cache.h +++ b/cache.h @@ -1023,7 +1023,14 @@ extern const struct object_id null_oid; static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2) { - return memcmp(sha1, sha2, the_hash_algo->rawsz); + switch (the_hash_algo->rawsz) { + case 20: + return memcmp(sha1, sha2, 20); + case 32: + return memcmp(sha1, sha2, 32); + default: + assert(0); + } } The fact that '20' and '32' are constants here may be helpful to the compiler. Can someone else test the perf? Thanks, -Stolee
Re: [PATCH 3/6] t/perf: add infrastructure for measuring sizes
On 8/21/2018 3:06 PM, Jeff King wrote: The main objective of scripts in the perf framework is to run "test_perf", which measures the time it takes to run some operation. However, it can also be interesting to see the change in the output size of certain operations. This patch introduces test_size, which records a single numeric output from the test and shows it in the aggregated output (with pretty printing and relative size comparison). I'm interested in exploring this test_size mechanism. The other area that could benefit from size testing is 'git repack', but I don't have any plans to change our compression or delta strategies. If we _did_ look into that, then using test_size would be a natural fit. Overall, I really like this series! Thanks -Stolee
[PATCH] config: fix commit-graph related config docs
The core.commitGraph config setting was accidentally removed from the config documentation. In that same patch, the config setting that writes a commit-graph during garbage collection was incorrectly written to the doc as "gc.commitGraph" instead of "gc.writeCommitGraph". Reported-by: Szeder Gábor Signed-off-by: Derrick Stolee --- Szeder reported this in [1] and I forgot until now to come back to it. [1] https://public-inbox.org/git/20180803093909.2853-1-szeder@gmail.com/ Documentation/config.txt | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/Documentation/config.txt b/Documentation/config.txt index 1c42364988..f846543414 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -917,7 +917,11 @@ core.notesRef:: This setting defaults to "refs/notes/commits", and it can be overridden by the `GIT_NOTES_REF` environment variable. See linkgit:git-notes[1]. -gc.commitGraph:: +core.commitGraph:: + Enable git commit graph feature. Allows reading from the + commit-graph file. + +gc.writeCommitGraph:: If true, then gc will rewrite the commit-graph file when linkgit:git-gc[1] is run. When using linkgit:git-gc[1] '--auto' the commit-graph will be updated if housekeeping is base-commit: 7e8bfb0412581daf8f3c89909f1d37844e8610dd -- 2.19.0.rc0
Re: [ANNOUNCE] Git v2.19.0-rc0
On 8/22/2018 3:39 AM, Ævar Arnfjörð Bjarmason wrote: On Wed, Aug 22, 2018 at 8:20 AM Jeff King wrote: On Wed, Aug 22, 2018 at 05:36:26AM +, brian m. carlson wrote: On Tue, Aug 21, 2018 at 11:03:44PM -0400, Jeff King wrote: I don't know if something like this is an improvement or now, but this seems to at least compile: diff --git a/cache.h b/cache.h index 1398b2a4e4..3207f74771 100644 --- a/cache.h +++ b/cache.h @@ -1033,7 +1033,13 @@ extern const struct object_id null_oid; static inline int hashcmp(const unsigned char *sha1, const unsigned char *sha2) { - return memcmp(sha1, sha2, the_hash_algo->rawsz); + switch (the_hash_algo->rawsz) { + case 20: + case 32: + return memcmp(sha1, sha2, the_hash_algo->rawsz); + default: + assert(0); + } I think that would end up with the same slow code, as gcc would rather call memcmp than expand out the two sets of asm. I won't have time to sit down and test this out until tomorrow afternoon at the earliest. If you want to send in something in the mean time, even if that limits things to just 20 for now, that's fine. I don't have a good option. The assert() thing works until I add in the "32" branch, but that's just punting the issue off until you add support for the new hash. Hand-rolling our own asm or C is a portability headache, and we need to change all of the callsites to use a new hasheq(). Hiding it behind a per-hash function is conceptually cleanest, but not quite as fast. And it also requires hasheq(). So all of the solutions seem non-trivial. Again, I'm starting to wonder if it's worth chasing this few percent. Did you try __builtin_expect? It's a GCC builtin for these sorts of situations, and sometimes helps: https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html I.e. you'd tell GCC we expect to have the 20 there with: if (__builtin_expect(the_hash_algo->rawsz == 20, 1)) { ... } The perl codebase has LIKELY() and UNLIKELY() macros for this which if the feature isn't available fall back on just plain C code: https://github.com/Perl/perl5/blob/v5.27.7/perl.h#L3335-L3344 The other thing I was going to recommend (and I'll try to test this out myself later) is to see if 'the_hash_algo->rawsz' is being treated as a volatile variable, since it is being referenced through a pointer. Perhaps storing the value locally and then casing on it would help?
Re: [ANNOUNCE] Git v2.19.0-rc0
On 8/21/2018 11:36 PM, Jeff King wrote: On Tue, Aug 21, 2018 at 11:03:44PM -0400, Jeff King wrote: with the obvious "oideq()" implementation added, that seems to get me to 2-3%. Not _quite_ as good as the original branching version I showed. And we had to touch all the callsites (although arguably that kind of "eq" function is a better interface anyway, since it obviously allows for more optimization. So maybe the branching thing is actually not so insane. It makes new hash_algo's Just Work; they just won't be optimized. And the change is very localized. Hmph. So I went back to double-check my measurements on that branching version, and I couldn't replicate it! I'm actually relieved to see this, as I couldn't either. It turns out what I showed (and measured) before has a bug. Can you see it? I had rewritten the section from scratch instead of applying your diff, so I didn't get the sha1-sha1 error. I decided to sleep on it instead of sending my email. So the assert() version really is the fastest. I didn't test, but I suspect we could "trick" the compiler by having the fallback call an opaque wrapper around memcmp(). That would prevent it from combining the two paths, and presumably it would still optimize the constant-20 side. Or maybe it would eventually decide our inline function is getting too big and scrap it. Which probably crosses a line of craziness (if I didn't already cross it two emails ago). I appreciate your effort here. Thanks -Stolee
Re: [PATCH v2 2/2] commit-graph.txt: improve formatting for asciidoc
On 8/21/2018 5:29 PM, Junio C Hamano wrote: "Derrick Stolee via GitGitGadget" writes: From: Derrick Stolee When viewing commit-graph.txt as a plain-text document, it makes sense to keep paragraphs left-padded between bullet points. However, asciidoc converts these left-padded paragraphs as monospace fonts, creating an unpleasant document. Remove the padding. That's completely backwards. These indented two paragraphs that follow "... the following property:" do not fall into the same classes of paragraphs as others. The way the text version makes them stand out by indenting clearly show these two are what "... the following ..." refers to. Perhaps these two would want to become a bulleted list or something. The "Future Work" section includes a bulleted list of items, and one item has sub-items. These do not render properly in asciidoc, so remove the sub-list and incorporate them into the paragraph. Signed-off-by: Derrick Stolee --- Documentation/technical/commit-graph.txt | 43 +++- I said I didn't check if commit-graph-format.txt doc format correctly, but that does not mean you do not have to ;-). I suspect that most of the contents would become monospaced wall of text, which is no better than the original text and because only the headings are typeset in different font, the result actually would be harder to read than the original. I compared the results of commit-graph-format.txt to pack-format.txt. Since I had based the format of commit-graph-format.txt on that file, the output was remarkably similar (and hence I couldn't find anything to change). We are not in a hurry to do this during the pre-release period, are we? My understanding is that the original text documentation files will be shipped and installed, and they are adequately readable (correct me if it is not the case). Unless we are making the result a lot more readable as the original text, let's not distract ourselves by rerolling this in too quick cycles without giving us sufficiently big improvements. No rush! I'll wait until the release calms down for a v3, and read the helpful docs on asciidoc format that Eric shared earlier. Thanks, -Stolee
[PATCH] commit: use timestamp_t for author_date_slab
The author_date_slab is used to store the author date of a commit when walking with the --author-date flag in rev-list or log. This was added as an 'unsigned long' in 81c6b38b "log: --author-date-order" Since 'unsigned long' is ambiguous in its bit-ness across platforms (64-bit in Linux, 32-bit in Windows, for example), most references to the author dates in commit.c were converted to timestamp_t in dddbad72 "timestamp_t: a new data type for timestamps" However, the slab definition was missed, leading to a mismatch in the data types in Windows. This would not reveal itself as a bug unless someone authors a commit after February 2106, but commits can store anything as their author date. Signed-off-by: Derrick Stolee --- I found this while reading up on the revision-walk machinery. This code hasn't been touched in years, so could apply to 'maint'. commit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 0030e79940..2ce5052b3e 100644 --- a/commit.c +++ b/commit.c @@ -609,7 +609,7 @@ struct commit *pop_commit(struct commit_list **stack) define_commit_slab(indegree_slab, int); /* record author-date for each commit object */ -define_commit_slab(author_date_slab, unsigned long); +define_commit_slab(author_date_slab, timestamp_t); static void record_author_date(struct author_date_slab *author_date, struct commit *commit) -- 2.19.0.rc0
Re: [ANNOUNCE] Git v2.19.0-rc0
On 8/20/2018 6:13 PM, Junio C Hamano wrote: An early preview release Git v2.19.0-rc0 is now available for testing at the usual places. As part of testing the release candidate, I ran the performance suite against a fresh clone of the Linux repository using v2.18.0 and v2.19.0-rc0 (also: GIT_PERF_REPEAT_COUNT=10). I found a few nice improvements, but I also found a possible regression in tree walking. I say "tree walking" because it was revealed using p0001-rev-list.sh, but only with the "--objects" flag. I also saw some similar numbers on 'git log --raw'. Test v2.18.0 v2.19.0-rc0 0001.1: rev-list --all 6.69(6.33+0.35) 6.52(6.20+0.31) -2.5% 0001.2: rev-list --all --objects 52.14(47.43+1.02) 57.15(51.09+1.18) +9.6% To me, 9.6% seems out of the range of just noise for this length of a command, but I could be wrong. Could anyone else try to repro these results? (This may also not just be tree-walking, but general pack-file loading and decompression, since I computed and stored a commit-graph file. Hence, commits are not being parsed from the pack-file by either command.) Aside: the perf results were not all bad. Here was an interesting improvement: Test v2.18.0 v2.19.0-rc0 0002.1: read_cache/discard_cache 1000 times 5.63(5.30+0.32) 3.34(3.03+0.30) -40.7% Thanks, -Stolee
Re: [PATCH 5/6] pack-bitmap: save "have" bitmap from walk
On 8/21/2018 3:07 PM, Jeff King wrote: When we do a bitmap walk, we save the result, which represents (WANTs & ~HAVEs); i.e., every object we care about visiting in our walk. However, we throw away the haves bitmap, which can sometimes be useful, too. Save it and provide an access function so code which has performed a walk can query it. This makes a lot of sense. Based on the amount of time the "Counting Objects" blog post [1] spent talking about delta bases, I would have assumed this "haves" bitmap was already part of it. But, I can also see how it could be dropped if you are focusing on performance for 'git clone'. A few notes on the accessor interface: - the bitmap code calls these "haves" because it grew out of the want/have negotiation for fetches. But really, these are simply the objects that would be flagged UNINTERESTING in a regular traversal. Let's use that more universal nomenclature for the external module interface. We may want to change the internal naming inside the bitmap code, but that's outside the scope of this patch. I saw the uninteresting-vs-haves name confusion in the patch below, but I agree with your logic here. Sorry that I'm late to the party, but I was interested in the topic. Thanks, -Stolee [1] https://githubengineering.com/counting-objects/ "Counting Objects" by Vincent Martí
Re: [PATCH] SubmittingPatches: mention doc-diff
On 8/21/2018 3:23 PM, Jeff King wrote: We already advise people to make sure their documentation formats correctly. Let's point them at the doc-diff script, which can help with that. Let's also put a brief note in the script about its purpose, since that otherwise can only be found in the original commit message. Along with the existing -h/usage text, that's hopefully enough for developers to make use of it. This is helpful, thanks! Signed-off-by: Jeff King --- Just a finishing touch on the jk/diff-rendered-docs topic. Documentation/SubmittingPatches | 4 +++- Documentation/doc-diff | 8 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches index b44fd51f27..ec8b205145 100644 --- a/Documentation/SubmittingPatches +++ b/Documentation/SubmittingPatches @@ -80,7 +80,9 @@ GitHub-Travis CI hints section for details. Do not forget to update the documentation to describe the updated behavior and make sure that the resulting documentation set formats -well. It is currently a liberal mixture of US and UK English norms for +well (try the Documentation/doc-diff script). + +We currently have a liberal mixture of US and UK English norms for spelling and grammar, which is somewhat unfortunate. A huge patch that touches the files all over the place only to correct the inconsistency is not welcome, though. Potential clashes with other changes that can diff --git a/Documentation/doc-diff b/Documentation/doc-diff index f483fe427c..6e285e648c 100755 --- a/Documentation/doc-diff +++ b/Documentation/doc-diff @@ -1,4 +1,12 @@ #!/bin/sh +# +# Build two documentation trees and diff the resulting formatted output. +# Compared to a source diff, this can reveal mistakes in the formatting. +# For example: +# +# ./doc-diff origin/master HEAD +# +# would show the differences introduced by a branch based on master. OPTIONS_SPEC="\ doc-diff [options] [-- ]
[PATCH v2 2/2] commit-graph.txt: improve formatting for asciidoc
From: Derrick Stolee When viewing commit-graph.txt as a plain-text document, it makes sense to keep paragraphs left-padded between bullet points. However, asciidoc converts these left-padded paragraphs as monospace fonts, creating an unpleasant document. Remove the padding. The "Future Work" section includes a bulleted list of items, and one item has sub-items. These do not render properly in asciidoc, so remove the sub-list and incorporate them into the paragraph. Signed-off-by: Derrick Stolee --- Documentation/technical/commit-graph.txt | 43 +++- 1 file changed, 20 insertions(+), 23 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index c664acbd7..bdc3eab9e 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -49,23 +49,23 @@ Equivalently, the generation number of a commit A is one more than the length of a longest path from A to a root commit. The recursive definition is easier to use for computation and observing the following property: -If A and B are commits with generation numbers N and M, respectively, -and N <= M, then A cannot reach B. That is, we know without searching -that B is not an ancestor of A because it is further from a root commit -than A. +If A and B are commits with generation numbers N and M, respectively, +and N <= M, then A cannot reach B. That is, we know without searching +that B is not an ancestor of A because it is further from a root commit +than A. -Conversely, when checking if A is an ancestor of B, then we only need -to walk commits until all commits on the walk boundary have generation -number at most N. If we walk commits using a priority queue seeded by -generation numbers, then we always expand the boundary commit with highest -generation number and can easily detect the stopping condition. +Conversely, when checking if A is an ancestor of B, then we only need +to walk commits until all commits on the walk boundary have generation +number at most N. If we walk commits using a priority queue seeded by +generation numbers, then we always expand the boundary commit with highest +generation number and can easily detect the stopping condition. This property can be used to significantly reduce the time it takes to walk commits and determine topological relationships. Without generation numbers, the general heuristic is the following: -If A and B are commits with commit time X and Y, respectively, and -X < Y, then A _probably_ cannot reach B. +If A and B are commits with commit time X and Y, respectively, and +X < Y, then A _probably_ cannot reach B. This heuristic is currently used whenever the computation is allowed to violate topological relationships due to clock skew (such as "git log" @@ -121,11 +121,8 @@ Future Work - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered - priority queue with one ordered by generation number. The following - operations are important candidates: - -- 'log --topo-order' -- 'tag --merged' + priority queue with one ordered by generation number. Commands that could + improve include 'git log --topo-order' and 'git tag --merged'. - A server could provide a commit graph file as part of the network protocol to avoid extra calculations by clients. This feature is only of benefit if @@ -148,13 +145,13 @@ Related Links More discussion about generation numbers and not storing them inside commit objects. A valuable quote: -"I think we should be moving more in the direction of keeping - repo-local caches for optimizations. Reachability bitmaps have been - a big performance win. I think we should be doing the same with our - properties of commits. Not just generation numbers, but making it - cheap to access the graph structure without zlib-inflating whole - commit objects (i.e., packv4 or something like the "metapacks" I - proposed a few years ago)." +"I think we should be moving more in the direction of keeping + repo-local caches for optimizations. Reachability bitmaps have been + a big performance win. I think we should be doing the same with our + properties of commits. Not just generation numbers, but making it + cheap to access the graph structure without zlib-inflating whole + commit objects (i.e., packv4 or something like the "metapacks" I + proposed a few years ago)." [4] https://public-inbox.org/git/20180108154822.54829-1-...@jeffhostetler.com/T/#u A patch to remove the ahead-behind calculation from 'status'. -- gitgitgadget
[PATCH v2 1/2] Docs: Add commit-graph tech docs to Makefile
From: Derrick Stolee Ensure that the commit-graph.txt and commit-graph-format.txt files are compiled to HTML using ASCIIDOC. Signed-off-by: Derrick Stolee --- Documentation/Makefile | 2 ++ 1 file changed, 2 insertions(+) diff --git a/Documentation/Makefile b/Documentation/Makefile index d079d7c73..841e4f705 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -69,6 +69,8 @@ API_DOCS = $(patsubst %.txt,%,$(filter-out technical/api-index-skel.txt technica SP_ARTICLES += $(API_DOCS) TECH_DOCS += SubmittingPatches +TECH_DOCS += technical/commit-graph +TECH_DOCS += technical/commit-graph-format TECH_DOCS += technical/hash-function-transition TECH_DOCS += technical/http-protocol TECH_DOCS += technical/index-format -- gitgitgadget
[PATCH v2 0/2] Docs: Add commit-graph tech docs to Makefile
Similar to [1], add the commit-graph and commit-graph-format technical docs to Documentation/Makefile so they are automatically converted to HTML when needed. I compiled the docs and inspected the HTML manually in the browser. As suggested, I modified the documents to format a bit better. See the commit messages for details. Since the files had been modified since 'maint', this version is based on 'master'. [1] https://public-inbox.org/git/20180814222846.gg142...@aiede.svl.corp.google.com/ [PATCH] partial-clone: render design doc using asciidoc Derrick Stolee (2): Docs: Add commit-graph tech docs to Makefile commit-graph.txt: improve formatting for asciidoc Documentation/Makefile | 2 ++ Documentation/technical/commit-graph.txt | 43 +++- 2 files changed, 22 insertions(+), 23 deletions(-) base-commit: fa03cdc39b951d1cfbfd690fe6f3ac6c57ab6a44 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-22%2Fderrickstolee%2Fmake-docs-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-22/derrickstolee/make-docs-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/22 Range-diff vs v1: 1: ef5af2ccc = 1: 4c66af626 Docs: Add commit-graph tech docs to Makefile -: - > 2: 6cf253c2a commit-graph.txt: improve formatting for asciidoc -- gitgitgadget
Re: [PATCH v2 5/8] commit-graph: not compatible with replace objects
On 8/21/2018 1:45 PM, Junio C Hamano wrote: Derrick Stolee writes: Create new method commit_graph_compatible(r) to check if a given repository r is compatible with the commit-graph feature. Fill the method with a check to see if replace-objects exist. Test this interaction succeeds, including ignoring an existing commit-graph and failing to write a new commit-graph. However, we do ensure that we write a new commit-graph by setting read_replace_refs to 0, thereby ignoring the replace refs. Signed-off-by: Derrick Stolee --- builtin/commit-graph.c | 4 commit-graph.c | 21 + replace-object.c| 2 +- replace-object.h| 2 ++ t/t5318-commit-graph.sh | 22 ++ 5 files changed, 50 insertions(+), 1 deletion(-) diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 0bf0c48657..da737df321 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -120,6 +120,8 @@ static int graph_read(int argc, const char **argv) return 0; } +extern int read_replace_refs; + Why do you need this (and also in commit-graph.c)? I thought cache.h includes it, which you can just make use of it. You're right. I don't know how I missed that. +static int commit_graph_compatible(struct repository *r) +{ + if (read_replace_refs) { + prepare_replace_object(r); + if (hashmap_get_size(&r->objects->replace_map->map)) + return 0; + } + + return 1; +} diff --git a/replace-object.c b/replace-object.c index 3c17864eb7..9821f1477e 100644 --- a/replace-object.c +++ b/replace-object.c @@ -32,7 +32,7 @@ static int register_replace_ref(struct repository *r, return 0; } -static void prepare_replace_object(struct repository *r) +void prepare_replace_object(struct repository *r) { if (r->objects->replace_map) return; The way the new caller is written, I am wondering if this function should return "we are (or, are not) using the replace object feature", making it unnecessary for callers on the reading side to know about "read-replace-refs" external variable, for example. /* * To be called on-demand from codepaths that want to make * sure that replacement objects can be found as necessary. * * Return number of replacement defined for the repository, or * -1 when !read_replace_refs tells us not to use replacement * mechanism at all. */ int prepare_replace_object(struct repository *r) { if (!read_replace_refs) return -1; if (!r->objects->replace_map) { r->objects->replace_map = xmalloc(...); oidmap_init(r->objects->replace_map, 0); for_each_refplace_ref(r, register_...); } return hashmap_get_size(&r->objects->replace_map->map); } This is a good idea. I can incorporate it into v3. Then, the caller side can simply become something like: #define cgraph_compat(r) (prepare_replace_object(r) <= 0) With the next two patches that add more conditions to commit_graph_compatible(), I'd prefer keeping it a method instead of a macro. There are various writers to read_replace_refs variable, but I think they should first be replaced with calls to something like: void use_replace_refs(struct repository *r, int enable); which allows us to hide the global variable and later make it per repository if we wanted to. I will incorporate this into v3 as well. Thanks, -Stolee
Re: [PATCH v2 0/8] Clarify commit-graph and grafts/replace/shallow incompatibilities
On 8/21/2018 1:51 PM, Junio C Hamano wrote: Stefan Beller writes: On Mon, Aug 20, 2018 at 11:24 AM Derrick Stolee wrote: Because of the change of base, it is hard to provide a side-by-side diff from v1. I thought that was the whole point of range-diff. ;-) I thought so, too. Was there any change that confused range-diff machinery? I've just been out of the loop and didn't realize what range-diff did. Here is the range-diff: 1: 43ddcc9ef9 = 1: c8edae4179 refs.c: migrate internal ref iteration to pass thru repository argument 2: 22dc9ce836 = 2: f89451e884 refs.c: upgrade for_each_replace_ref to be a each_repo_ref_fn callback 3: 5e90d36f84 ! 3: d95aa3472c commit-graph: update design document @@ -19,7 +19,7 @@ so a future change of hash algorithm does not require a change in format. +- Commit grafts and replace objects can change the shape of the commit -+ history. These can also be enabled/disabled on the fly using ++ history. The latter can also be enabled/disabled on the fly using + `--no-replace-objects`. This leads to difficultly storing both possible + interpretations of a commit id, especially when computing generation + numbers. The commit-graph will not be read or written when 4: 88711a3cf4 = 4: ec89c88e14 test-repository: properly init repo 5: 7f596c1718 ! 5: 0321a3cf10 commit-graph: not compatible with replace objects @@ -6,10 +6,34 @@ repository r is compatible with the commit-graph feature. Fill the method with a check to see if replace-objects exist. Test this interaction succeeds, including ignoring an existing commit-graph and - failing to write a new commit-graph. + failing to write a new commit-graph. However, we do ensure that + we write a new commit-graph by setting read_replace_refs to 0, thereby + ignoring the replace refs. Signed-off-by: Derrick Stolee +diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c +--- a/builtin/commit-graph.c b/builtin/commit-graph.c +@@ + return 0; + } + ++extern int read_replace_refs; ++ + static int graph_write(int argc, const char **argv) + { + struct string_list *pack_indexes = NULL; +@@ + if (!opts.obj_dir) + opts.obj_dir = get_object_directory(); + ++ read_replace_refs = 0; ++ + if (opts.reachable) { + write_commit_graph_reachable(opts.obj_dir, opts.append); + return 0; + diff --git a/commit-graph.c b/commit-graph.c --- a/commit-graph.c +++ b/commit-graph.c @@ -26,11 +50,15 @@ return g; } ++extern int read_replace_refs; ++ +static int commit_graph_compatible(struct repository *r) +{ -+ prepare_replace_object(r); -+ if (hashmap_get_size(&r->objects->replace_map->map)) -+ return 0; ++ if (read_replace_refs) { ++ prepare_replace_object(r); ++ if (hashmap_get_size(&r->objects->replace_map->map)) ++ return 0; ++ } + + return 1; +} @@ -110,7 +138,7 @@ + test_cmp expect actual && + rm -rf .git/objects/info/commit-graph && + git commit-graph write --reachable && -+ test_path_is_missing .git/objects/info/commit-graph ++ test_path_is_file .git/objects/info/commit-graph + ) +' + 6: 94dd91ac35 ! 6: d18ecc0124 commit-graph: not compatible with grafts @@ -7,24 +7,25 @@ situations we ignore existing commit-graph files and we do not write new commit-graph files. + Helped-by: Jakub Narebski Signed-off-by: Derrick Stolee diff --git a/commit-graph.c b/commit-graph.c --- a/commit-graph.c +++ b/commit-graph.c @@ + return 0; + } - static int commit_graph_compatible(struct repository *r) - { + prepare_commit_graft(r); + if (r->parsed_objects && r->parsed_objects->grafts_nr) + return 0; + if (is_repository_shallow(r)) + return 0; + - prepare_replace_object(r); - if (hashmap_get_size(&r->objects->replace_map->map)) - return 0; + return 1; + } + diff --git a/commit.c b/commit.c --- a/commit.c @@ -66,7 +67,9 @@ + cd graft && + git commit-graph write --reachable && + test_path_is_file .git/objects/info/commit-graph && -+ git replace --graft HEAD~1 HEAD~3 && ++ H1=$(git rev-parse --verify HEAD~1) && ++ H3=$(git rev-parse --verify HEAD~3) && ++ echo "$H1 $H3" >.git/info/grafts && + g
Re: [PATCH 0/9] multi-pack-index cleanups
On 8/21/2018 10:34 AM, Duy Nguyen wrote: On Mon, Aug 20, 2018 at 6:53 PM Derrick Stolee wrote: To better understand the implications of the multi-pack-index with other scenarios, I ran the test suite after adding a step to 'git repack' to write a multi-pack-index, and to default core.multiPackIndex to 'true'. This commit is available as [1]. ... [1] https://github.com/derrickstolee/git/commit/098dd1d515b592fb165a276241d7d68d1cde0036 DO-NOT-MERGE: compute multi-pack-index on repack. It should be able to _merge_ this patch, but only activate it when some test environment variable is set. On Travis x86 we run the test suite twice, once normal, and once with special features activated (see "if test "$jobname" = "linux-gcc"" in ci/run-build-and-tests.sh). I don't think midx is incompatible with any of those features, so we could activate and run the whole test suite with it too. Interesting. I'll look into this. A similar approach may work for the commit-graph. Thanks, -Stolee
Re: [PATCH 7/9] treewide: use get_all_packs
On 8/20/2018 6:01 PM, Stefan Beller wrote: On Mon, Aug 20, 2018 at 9:52 AM Derrick Stolee wrote: There are many places in the codebase that want to iterate over all packfiles known to Git. The purposes are wide-ranging, and those that can take advantage of the multi-pack-index already do. So, use get_all_packs() instead of get_packed_git() to be sure we are iterating over all packfiles. So get_packed_git shouold not be used any further? Do we want to annotate it as deprecated, to be deleted in the future? Or even remove it as part of this series (that might be an issue with other series in flight). Perhaps the right thing to do would be to put a big warning over the definition to say "only use this if you really don't want get_all_packs()" or something. The reason I didn't remove it from packfile.h is that it is used in sha1-name.c alongside get_multi_pack_index() (so making it 'static' would be incorrect).
Re: [PATCH 3/9] midx: mark bad packed objects
On 8/20/2018 5:23 PM, Stefan Beller wrote: On Mon, Aug 20, 2018 at 9:52 AM Derrick Stolee wrote: When an object fails to decompress from a pack-file, we mark the object as 'bad' so we can retry with a different copy of the object (if such a copy exists). Before now, the multi-pack-index did not update the bad objects list for the pack-files it contains, and we did not check the bad objects list when reading an object. Now, do both. This sounds like a bug fix unlike patches 1 & 2 that sound like feature work(2) or making code more readable(1). (After studying the code, this doesn't sound like a bug fix any more, but a safety thing) It is a safety thing. One codepath that needs this includes this comment: /* * We're probably in deep shit, but let's try to fetch * the required base anyway from another pack or loose. * This is costly but should happen only in the presence * of a corrupted pack, and is better than failing outright. */ This goes back to 8eca0b47: implement some resilience against pack corruptions. The logic is tested in t5303-pack-corruption-resilience.sh, with the test title '... and a redundant pack allows for full recovery too'. Is it worth having this on a separate track coming in faster than the rest of this series? ds/multi-pack-index is cooking in 'next' until after 2.19.0, so I'm not sure it's worth splitting things up at this point. Signed-off-by: Derrick Stolee --- midx.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/midx.c b/midx.c index 6824acf5f8..7fa75a37a3 100644 --- a/midx.c +++ b/midx.c @@ -280,6 +280,16 @@ static int nth_midxed_pack_entry(struct multi_pack_index *m, struct pack_entry * if (!is_pack_valid(p)) return 0; + if (p->num_bad_objects) { + uint32_t i; Is there a reason that i needs to be if 32 bits? Would size_t (for iterating) or 'int' (as a default like in many for loops) be ok, too? i is bounded by p->num_bad_objects, which is also uint32_t. Thanks, -Stolee
[PATCH 1/1] Docs: Add commit-graph tech docs to Makefile
From: Derrick Stolee Ensure that the commit-graph.txt and commit-graph-format.txt files are compiled to HTML using ASCIIDOC. Signed-off-by: Derrick Stolee --- Documentation/Makefile | 2 ++ 1 file changed, 2 insertions(+) diff --git a/Documentation/Makefile b/Documentation/Makefile index d079d7c73..841e4f705 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -69,6 +69,8 @@ API_DOCS = $(patsubst %.txt,%,$(filter-out technical/api-index-skel.txt technica SP_ARTICLES += $(API_DOCS) TECH_DOCS += SubmittingPatches +TECH_DOCS += technical/commit-graph +TECH_DOCS += technical/commit-graph-format TECH_DOCS += technical/hash-function-transition TECH_DOCS += technical/http-protocol TECH_DOCS += technical/index-format -- gitgitgadget
[PATCH 0/1] Docs: Add commit-graph tech docs to Makefile
Similar to [1], add the commit-graph and commit-graph-format technical docs to Documentation/Makefile so they are automatically converted to HTML when needed. I compiled the docs and inspected the HTML manually in the browser. Nothing looked strange, so I don't think the docs themselves need any editing for format. [1] https://public-inbox.org/git/20180814222846.gg142...@aiede.svl.corp.google.com/ [PATCH] partial-clone: render design doc using asciidoc Derrick Stolee (1): Docs: Add commit-graph tech docs to Makefile Documentation/Makefile | 2 ++ 1 file changed, 2 insertions(+) base-commit: 53f9a3e157dbbc901a02ac2c73346d375e24978c Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-22%2Fderrickstolee%2Fmake-docs-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-22/derrickstolee/make-docs-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/22 -- gitgitgadget
Re: [PATCH v2 0/8] Clarify commit-graph and grafts/replace/shallow incompatibilities
On 8/20/2018 3:37 PM, Stefan Beller wrote: On Mon, Aug 20, 2018 at 11:24 AM Derrick Stolee wrote: One unresolved issue with the commit-graph feature is that it can cause issues when combined with replace objects, commit grafts, or shallow clones. These are not 100% incompatible, as one could be reasonably successful writing a commit-graph after replacing some objects and not have issues. The problems happen when commits that are already in the commit-graph file are replaced, or when git is run with the `--no-replace-objects` option; this can cause incorrect parents or incorrect generation numbers. Similar things occur with commit grafts and shallow clones, especially when running `git fetch --unshallow` in a shallow repo. Instead of trying (and probably failing) to make these features work together, default to making the commit-graph feature unavailable in these situations. Create a new method 'commit_graph_compatible(r)' that checks if the repository 'r' has any of these features enabled. CHANGES IN V2: * The first two commits regarding the ref iterators are unchanged, despite a lot of discussion on the subject [1]. * I included Peff's changes in jk/core-use-replace-refs, changing the base commit for the series to 1689c22c1c328e9135ed51458e9f9a5d224c5057 (the merge that brought that topic into 'msater'). * I fixed the tests for the interactions with the graft feature. Because of the change of base, it is hard to provide a side-by-side diff from v1. Thanks, -Stolee [1] https://public-inbox.org/git/CAGZ79kZ3PzqpGzXWcmxjzi98gA+LT2MBOf8KaA89hOa-Qig=o...@mail.gmail.com/ Stefan's response recommending we keep the first two commits. After reviewing my own patches, I flipped again (Sorry!) and would rather not see my patches be merged, but the very original solution by you, that you proposed at [1]. That said, I will not insist on it, and if this is merged as is, we can fix it up later. With that said, I just read through the remaining patches, I think they are a valuable addition to Git and could be merged as-is. [1] https://github.com/gitgitgadget/git/pull/11/commits/300db80140dacc927db0d46c804ca0ef4dcc1be1 Thanks, Stefan Just to keep things on-list as possible, here is the patch for the commit linked above. It would replace patches 1 & 2 from this series. Junio: I can send a v3 that includes this commit if you need it, or if there are other edits to be made in this series. commit 300db80140dacc927db0d46c804ca0ef4dcc1be1 Author: Derrick Stolee Date: Fri Jul 20 15:39:06 2018 -0400 replace-objects: use arbitrary repositories This is the smallest possible change that makes prepare_replace_objects work properly with arbitrary repositories. By supplying the repository as the cb_data, we do not need to modify any code in the ref iterator logic. We will likely want to do a full replacement of the ref iterator logic to provide a repository struct as a concrete parameter. Signed-off-by: Derrick Stolee diff --git a/replace-object.c b/replace-object.c index 801b5c1678..e99fcd1ff6 100644 --- a/replace-object.c +++ b/replace-object.c @@ -14,6 +14,7 @@ static int register_replace_ref(const char *refname, const char *slash = strrchr(refname, '/'); const char *hash = slash ? slash + 1 : refname; struct replace_object *repl_obj = xmalloc(sizeof(*repl_obj)); + struct repository *r = (struct repository *)cb_data; if (get_oid_hex(hash, &repl_obj->original.oid)) { free(repl_obj); @@ -25,7 +26,7 @@ static int register_replace_ref(const char *refname, oidcpy(&repl_obj->replacement, oid); /* Register new object */ - if (oidmap_put(the_repository->objects->replace_map, repl_obj)) + if (oidmap_put(r->objects->replace_map, repl_obj)) die("duplicate replace ref: %s", refname); return 0; @@ -40,7 +41,7 @@ static void prepare_replace_object(struct repository *r) xmalloc(sizeof(*r->objects->replace_map)); oidmap_init(r->objects->replace_map, 0); - for_each_replace_ref(r, register_replace_ref, NULL); + for_each_replace_ref(r, register_replace_ref, r); } /* We allow "recursive" replacement. Only within reason, though */
[PATCH v2 1/8] refs.c: migrate internal ref iteration to pass thru repository argument
From: Stefan Beller In 60ce76d3581 (refs: add repository argument to for_each_replace_ref, 2018-04-11) and 0d296c57aec (refs: allow for_each_replace_ref to handle arbitrary repositories, 2018-04-11), for_each_replace_ref learned how to iterate over refs by a given arbitrary repository. New attempts in the object store conversion have shown that it is useful to have the repository handle available that the refs iteration is currently iterating over. To achieve this goal we will need to add a repository argument to each_ref_fn in refs.h. However as many callers rely on the signature such a patch would be too large. So convert the internals of the ref subsystem first to pass through a repository argument without exposing the change to the user. Assume the_repository for the passed through repository, although it is not used anywhere yet. Signed-off-by: Stefan Beller Signed-off-by: Derrick Stolee --- refs.c | 39 +-- refs.h | 10 ++ refs/iterator.c | 6 +++--- refs/refs-internal.h | 5 +++-- 4 files changed, 53 insertions(+), 7 deletions(-) diff --git a/refs.c b/refs.c index 457fb78057..7cd76f72d2 100644 --- a/refs.c +++ b/refs.c @@ -1386,17 +1386,50 @@ struct ref_iterator *refs_ref_iterator_begin( * non-zero value, stop the iteration and return that value; * otherwise, return 0. */ +static int do_for_each_repo_ref(struct repository *r, const char *prefix, + each_repo_ref_fn fn, int trim, int flags, + void *cb_data) +{ + struct ref_iterator *iter; + struct ref_store *refs = get_main_ref_store(r); + + if (!refs) + return 0; + + iter = refs_ref_iterator_begin(refs, prefix, trim, flags); + + return do_for_each_repo_ref_iterator(r, iter, fn, cb_data); +} + +struct do_for_each_ref_help { + each_ref_fn *fn; + void *cb_data; +}; + +static int do_for_each_ref_helper(struct repository *r, + const char *refname, + const struct object_id *oid, + int flags, + void *cb_data) +{ + struct do_for_each_ref_help *hp = cb_data; + + return hp->fn(refname, oid, flags, hp->cb_data); +} + static int do_for_each_ref(struct ref_store *refs, const char *prefix, each_ref_fn fn, int trim, int flags, void *cb_data) { struct ref_iterator *iter; + struct do_for_each_ref_help hp = { fn, cb_data }; if (!refs) return 0; iter = refs_ref_iterator_begin(refs, prefix, trim, flags); - return do_for_each_ref_iterator(iter, fn, cb_data); + return do_for_each_repo_ref_iterator(the_repository, iter, + do_for_each_ref_helper, &hp); } int refs_for_each_ref(struct ref_store *refs, each_ref_fn fn, void *cb_data) @@ -2025,10 +2058,12 @@ int refs_verify_refname_available(struct ref_store *refs, int refs_for_each_reflog(struct ref_store *refs, each_ref_fn fn, void *cb_data) { struct ref_iterator *iter; + struct do_for_each_ref_help hp = { fn, cb_data }; iter = refs->be->reflog_iterator_begin(refs); - return do_for_each_ref_iterator(iter, fn, cb_data); + return do_for_each_repo_ref_iterator(the_repository, iter, +do_for_each_ref_helper, &hp); } int for_each_reflog(each_ref_fn fn, void *cb_data) diff --git a/refs.h b/refs.h index cc2fb4c68c..80eec8bbc6 100644 --- a/refs.h +++ b/refs.h @@ -274,6 +274,16 @@ struct ref_transaction; typedef int each_ref_fn(const char *refname, const struct object_id *oid, int flags, void *cb_data); +/* + * The same as each_ref_fn, but also with a repository argument that + * contains the repository associated with the callback. + */ +typedef int each_repo_ref_fn(struct repository *r, +const char *refname, +const struct object_id *oid, +int flags, +void *cb_data); + /* * The following functions invoke the specified callback function for * each reference indicated. If the function ever returns a nonzero diff --git a/refs/iterator.c b/refs/iterator.c index 2ac91ac340..629e00a122 100644 --- a/refs/iterator.c +++ b/refs/iterator.c @@ -407,15 +407,15 @@ struct ref_iterator *prefix_ref_iterator_begin(struct ref_iterator *iter0, struct ref_iterator *current_ref_iter = NULL; -int do_for_each_ref_iterator(struct ref_iterator *iter, -each_ref_fn fn, void *cb_data) +int do_for_each_repo_ref_iterator(struct repository *r, struct ref_iterator *iter, + each_repo_ref_fn fn, void *cb_data) { int retval = 0, ok; struct ref_it
[PATCH v2 8/8] commit-graph: close_commit_graph before shallow walk
Call close_commit_graph() when about to start a rev-list walk that includes shallow commits. This is necessary in code paths that "fake" shallow commits for the sake of fetch. Specifically, test 351 in t5500-fetch-pack.sh runs git fetch --shallow-exclude one origin with a file-based transfer. When the "remote" has a commit-graph, we do not prevent the commit-graph from being loaded, but then the commits are intended to be dynamically transferred into shallow commits during get_shallow_commits_by_rev_list(). By closing the commit-graph before this call, we prevent this interaction. Signed-off-by: Derrick Stolee --- commit-graph.c | 8 commit-graph.h | 1 + upload-pack.c | 2 ++ 3 files changed, 7 insertions(+), 4 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index cee2caab5c..4bd1a4abbf 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -260,10 +260,10 @@ static int prepare_commit_graph(struct repository *r) return !!r->objects->commit_graph; } -static void close_commit_graph(void) +void close_commit_graph(struct repository *r) { - free_commit_graph(the_repository->objects->commit_graph); - the_repository->objects->commit_graph = NULL; + free_commit_graph(r->objects->commit_graph); + r->objects->commit_graph = NULL; } static int bsearch_graph(struct commit_graph *g, struct object_id *oid, uint32_t *pos) @@ -875,7 +875,7 @@ void write_commit_graph(const char *obj_dir, write_graph_chunk_data(f, GRAPH_OID_LEN, commits.list, commits.nr); write_graph_chunk_large_edges(f, commits.list, commits.nr); - close_commit_graph(); + close_commit_graph(the_repository); finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC); commit_lock_file(&lk); diff --git a/commit-graph.h b/commit-graph.h index 76e098934a..13d736cdde 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -59,6 +59,7 @@ void write_commit_graph(const char *obj_dir, int verify_commit_graph(struct repository *r, struct commit_graph *g); +void close_commit_graph(struct repository *); void free_commit_graph(struct commit_graph *); #endif diff --git a/upload-pack.c b/upload-pack.c index 82b393ec31..2ae9d9bb47 100644 --- a/upload-pack.c +++ b/upload-pack.c @@ -24,6 +24,7 @@ #include "quote.h" #include "upload-pack.h" #include "serve.h" +#include "commit-graph.h" /* Remember to update object flag allocation in object.h */ #define THEY_HAVE (1u << 11) @@ -740,6 +741,7 @@ static void deepen_by_rev_list(int ac, const char **av, { struct commit_list *result; + close_commit_graph(the_repository); result = get_shallow_commits_by_rev_list(ac, av, SHALLOW, NOT_SHALLOW); send_shallow(result); free_commit_list(result); -- 2.18.0.118.gd4f65b8d14
[PATCH v2 2/8] refs.c: upgrade for_each_replace_ref to be a each_repo_ref_fn callback
From: Stefan Beller Signed-off-by: Stefan Beller Signed-off-by: Derrick Stolee --- builtin/replace.c | 8 refs.c| 9 - refs.h| 2 +- replace-object.c | 5 +++-- 4 files changed, 12 insertions(+), 12 deletions(-) diff --git a/builtin/replace.c b/builtin/replace.c index e997a562f0..b5861a0ee9 100644 --- a/builtin/replace.c +++ b/builtin/replace.c @@ -39,7 +39,8 @@ struct show_data { enum replace_format format; }; -static int show_reference(const char *refname, const struct object_id *oid, +static int show_reference(struct repository *r, const char *refname, + const struct object_id *oid, int flag, void *cb_data) { struct show_data *data = cb_data; @@ -56,9 +57,8 @@ static int show_reference(const char *refname, const struct object_id *oid, if (get_oid(refname, &object)) return error("Failed to resolve '%s' as a valid ref.", refname); - obj_type = oid_object_info(the_repository, &object, - NULL); - repl_type = oid_object_info(the_repository, oid, NULL); + obj_type = oid_object_info(r, &object, NULL); + repl_type = oid_object_info(r, oid, NULL); printf("%s (%s) -> %s (%s)\n", refname, type_name(obj_type), oid_to_hex(oid), type_name(repl_type)); diff --git a/refs.c b/refs.c index 7cd76f72d2..c5a5f727e8 100644 --- a/refs.c +++ b/refs.c @@ -1474,12 +1474,11 @@ int refs_for_each_fullref_in(struct ref_store *refs, const char *prefix, return do_for_each_ref(refs, prefix, fn, 0, flag, cb_data); } -int for_each_replace_ref(struct repository *r, each_ref_fn fn, void *cb_data) +int for_each_replace_ref(struct repository *r, each_repo_ref_fn fn, void *cb_data) { - return do_for_each_ref(get_main_ref_store(r), - git_replace_ref_base, fn, - strlen(git_replace_ref_base), - DO_FOR_EACH_INCLUDE_BROKEN, cb_data); + return do_for_each_repo_ref(r, git_replace_ref_base, fn, + strlen(git_replace_ref_base), + DO_FOR_EACH_INCLUDE_BROKEN, cb_data); } int for_each_namespaced_ref(each_ref_fn fn, void *cb_data) diff --git a/refs.h b/refs.h index 80eec8bbc6..a0a18223a1 100644 --- a/refs.h +++ b/refs.h @@ -317,7 +317,7 @@ int for_each_fullref_in(const char *prefix, each_ref_fn fn, void *cb_data, int for_each_tag_ref(each_ref_fn fn, void *cb_data); int for_each_branch_ref(each_ref_fn fn, void *cb_data); int for_each_remote_ref(each_ref_fn fn, void *cb_data); -int for_each_replace_ref(struct repository *r, each_ref_fn fn, void *cb_data); +int for_each_replace_ref(struct repository *r, each_repo_ref_fn fn, void *cb_data); int for_each_glob_ref(each_ref_fn fn, const char *pattern, void *cb_data); int for_each_glob_ref_in(each_ref_fn fn, const char *pattern, const char *prefix, void *cb_data); diff --git a/replace-object.c b/replace-object.c index 4162df6324..3c17864eb7 100644 --- a/replace-object.c +++ b/replace-object.c @@ -6,7 +6,8 @@ #include "repository.h" #include "commit.h" -static int register_replace_ref(const char *refname, +static int register_replace_ref(struct repository *r, + const char *refname, const struct object_id *oid, int flag, void *cb_data) { @@ -25,7 +26,7 @@ static int register_replace_ref(const char *refname, oidcpy(&repl_obj->replacement, oid); /* Register new object */ - if (oidmap_put(the_repository->objects->replace_map, repl_obj)) + if (oidmap_put(r->objects->replace_map, repl_obj)) die("duplicate replace ref: %s", refname); return 0; -- 2.18.0.118.gd4f65b8d14
[PATCH v2 5/8] commit-graph: not compatible with replace objects
Create new method commit_graph_compatible(r) to check if a given repository r is compatible with the commit-graph feature. Fill the method with a check to see if replace-objects exist. Test this interaction succeeds, including ignoring an existing commit-graph and failing to write a new commit-graph. However, we do ensure that we write a new commit-graph by setting read_replace_refs to 0, thereby ignoring the replace refs. Signed-off-by: Derrick Stolee --- builtin/commit-graph.c | 4 commit-graph.c | 21 + replace-object.c| 2 +- replace-object.h| 2 ++ t/t5318-commit-graph.sh | 22 ++ 5 files changed, 50 insertions(+), 1 deletion(-) diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c index 0bf0c48657..da737df321 100644 --- a/builtin/commit-graph.c +++ b/builtin/commit-graph.c @@ -120,6 +120,8 @@ static int graph_read(int argc, const char **argv) return 0; } +extern int read_replace_refs; + static int graph_write(int argc, const char **argv) { struct string_list *pack_indexes = NULL; @@ -150,6 +152,8 @@ static int graph_write(int argc, const char **argv) if (!opts.obj_dir) opts.obj_dir = get_object_directory(); + read_replace_refs = 0; + if (opts.reachable) { write_commit_graph_reachable(opts.obj_dir, opts.append); return 0; diff --git a/commit-graph.c b/commit-graph.c index b0a55ad128..2c01fa433f 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -13,6 +13,8 @@ #include "commit-graph.h" #include "object-store.h" #include "alloc.h" +#include "hashmap.h" +#include "replace-object.h" #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */ #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */ @@ -56,6 +58,19 @@ static struct commit_graph *alloc_commit_graph(void) return g; } +extern int read_replace_refs; + +static int commit_graph_compatible(struct repository *r) +{ + if (read_replace_refs) { + prepare_replace_object(r); + if (hashmap_get_size(&r->objects->replace_map->map)) + return 0; + } + + return 1; +} + struct commit_graph *load_commit_graph_one(const char *graph_file) { void *graph_map; @@ -223,6 +238,9 @@ static int prepare_commit_graph(struct repository *r) */ return 0; + if (!commit_graph_compatible(r)) + return 0; + obj_dir = r->objects->objectdir; prepare_commit_graph_one(r, obj_dir); prepare_alt_odb(r); @@ -693,6 +711,9 @@ void write_commit_graph(const char *obj_dir, int num_extra_edges; struct commit_list *parent; + if (!commit_graph_compatible(the_repository)) + return; + oids.nr = 0; oids.alloc = approximate_object_count() / 4; diff --git a/replace-object.c b/replace-object.c index 3c17864eb7..9821f1477e 100644 --- a/replace-object.c +++ b/replace-object.c @@ -32,7 +32,7 @@ static int register_replace_ref(struct repository *r, return 0; } -static void prepare_replace_object(struct repository *r) +void prepare_replace_object(struct repository *r) { if (r->objects->replace_map) return; diff --git a/replace-object.h b/replace-object.h index 9345e105dd..16528df942 100644 --- a/replace-object.h +++ b/replace-object.h @@ -10,6 +10,8 @@ struct replace_object { struct object_id replacement; }; +void prepare_replace_object(struct repository *r); + /* * This internal function is only declared here for the benefit of * lookup_replace_object(). Please do not call it directly. diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index 4f17d7701e..e0c3c60f66 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -259,6 +259,28 @@ test_expect_success 'check that gc computes commit-graph' ' test_cmp commit-graph-after-gc $objdir/info/commit-graph ' +test_expect_success 'replace-objects invalidates commit-graph' ' + cd "$TRASH_DIRECTORY" && + test_when_finished rm -rf replace && + git clone full replace && + ( + cd replace && + git commit-graph write --reachable && + test_path_is_file .git/objects/info/commit-graph && + git replace HEAD~1 HEAD~2 && + git -c core.commitGraph=false log >expect && + git -c core.commitGraph=true log >actual && + test_cmp expect actual && + git commit-graph write --reachable && + git -c core.commitGraph=false --no-replace-objects log >expect && + git -c core.commitGraph=true --no-rep
[PATCH v2 3/8] commit-graph: update design document
As it exists right now, the commit-graph feature may provide inconsistent results when combined with commit grafts, replace objects, and shallow clones. Update the design document to discuss why these interactions are difficult to reconcile and how we will avoid errors by preventing updates to and reads from the commit-graph file when these other features exist. Signed-off-by: Derrick Stolee --- Documentation/technical/commit-graph.txt | 18 +++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index c664acbd76..001395e950 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -112,12 +112,24 @@ Design Details - The file format includes parameters for the object ID hash function, so a future change of hash algorithm does not require a change in format. +- Commit grafts and replace objects can change the shape of the commit + history. The latter can also be enabled/disabled on the fly using + `--no-replace-objects`. This leads to difficultly storing both possible + interpretations of a commit id, especially when computing generation + numbers. The commit-graph will not be read or written when + replace-objects or grafts are present. + +- Shallow clones create grafts of commits by dropping their parents. This + leads the commit-graph to think those commits have generation number 1. + If and when those commits are made unshallow, those generation numbers + become invalid. Since shallow clones are intended to restrict the commit + history to a very small set of commits, the commit-graph feature is less + helpful for these clones, anyway. The commit-graph will not be read or + written when shallow commits are present. + Future Work --- -- The commit graph feature currently does not honor commit grafts. This can - be remedied by duplicating or refactoring the current graft logic. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered -- 2.18.0.118.gd4f65b8d14
[PATCH v2 4/8] test-repository: properly init repo
Signed-off-by: Derrick Stolee --- t/helper/test-repository.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/t/helper/test-repository.c b/t/helper/test-repository.c index 2762ca6562..6a84a53efb 100644 --- a/t/helper/test-repository.c +++ b/t/helper/test-repository.c @@ -15,7 +15,10 @@ static void test_parse_commit_in_graph(const char *gitdir, const char *worktree, struct commit *c; struct commit_list *parent; - repo_init(&r, gitdir, worktree); + setup_git_env(gitdir); + + if (repo_init(&r, gitdir, worktree)) + die("Couldn't init repo"); c = lookup_commit(&r, commit_oid); @@ -38,7 +41,10 @@ static void test_get_commit_tree_in_graph(const char *gitdir, struct commit *c; struct tree *tree; - repo_init(&r, gitdir, worktree); + setup_git_env(gitdir); + + if (repo_init(&r, gitdir, worktree)) + die("Couldn't init repo"); c = lookup_commit(&r, commit_oid); -- 2.18.0.118.gd4f65b8d14
[PATCH v2 7/8] commit-graph: not compatible with uninitialized repo
Signed-off-by: Derrick Stolee --- commit-graph.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index c4eaedf4e5..cee2caab5c 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -62,6 +62,9 @@ extern int read_replace_refs; static int commit_graph_compatible(struct repository *r) { + if (!r->gitdir) + return 0; + if (read_replace_refs) { prepare_replace_object(r); if (hashmap_get_size(&r->objects->replace_map->map)) -- 2.18.0.118.gd4f65b8d14
[PATCH v2 6/8] commit-graph: not compatible with grafts
Augment commit_graph_compatible(r) to return false when the given repository r has commit grafts or is a shallow clone. Test that in these situations we ignore existing commit-graph files and we do not write new commit-graph files. Helped-by: Jakub Narebski Signed-off-by: Derrick Stolee --- commit-graph.c | 6 ++ commit.c| 2 +- commit.h| 1 + t/t5318-commit-graph.sh | 38 ++ 4 files changed, 46 insertions(+), 1 deletion(-) diff --git a/commit-graph.c b/commit-graph.c index 2c01fa433f..c4eaedf4e5 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -68,6 +68,12 @@ static int commit_graph_compatible(struct repository *r) return 0; } + prepare_commit_graft(r); + if (r->parsed_objects && r->parsed_objects->grafts_nr) + return 0; + if (is_repository_shallow(r)) + return 0; + return 1; } diff --git a/commit.c b/commit.c index 30d1af2b20..ef9a2cbb23 100644 --- a/commit.c +++ b/commit.c @@ -209,7 +209,7 @@ static int read_graft_file(struct repository *r, const char *graft_file) return 0; } -static void prepare_commit_graft(struct repository *r) +void prepare_commit_graft(struct repository *r) { char *graft_file; diff --git a/commit.h b/commit.h index da0db36eba..5459e279fe 100644 --- a/commit.h +++ b/commit.h @@ -202,6 +202,7 @@ typedef int (*each_commit_graft_fn)(const struct commit_graft *, void *); struct commit_graft *read_graft_line(struct strbuf *line); int register_commit_graft(struct repository *r, struct commit_graft *, int); +void prepare_commit_graft(struct repository *r); struct commit_graft *lookup_commit_graft(struct repository *r, const struct object_id *oid); extern struct commit_list *get_merge_bases(struct commit *rev1, struct commit *rev2); diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index e0c3c60f66..6aee861f78 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -281,6 +281,44 @@ test_expect_success 'replace-objects invalidates commit-graph' ' ) ' +test_expect_success 'commit grafts invalidate commit-graph' ' + cd "$TRASH_DIRECTORY" && + test_when_finished rm -rf graft && + git clone full graft && + ( + cd graft && + git commit-graph write --reachable && + test_path_is_file .git/objects/info/commit-graph && + H1=$(git rev-parse --verify HEAD~1) && + H3=$(git rev-parse --verify HEAD~3) && + echo "$H1 $H3" >.git/info/grafts && + git -c core.commitGraph=false log >expect && + git -c core.commitGraph=true log >actual && + test_cmp expect actual && + git commit-graph write --reachable && + git -c core.commitGraph=false --no-replace-objects log >expect && + git -c core.commitGraph=true --no-replace-objects log >actual && + test_cmp expect actual && + rm -rf .git/objects/info/commit-graph && + git commit-graph write --reachable && + test_path_is_missing .git/objects/info/commit-graph + ) +' + +test_expect_success 'replace-objects invalidates commit-graph' ' + cd "$TRASH_DIRECTORY" && + test_when_finished rm -rf shallow && + git clone --depth 2 "file://$TRASH_DIRECTORY/full" shallow && + ( + cd shallow && + git commit-graph write --reachable && + test_path_is_missing .git/objects/info/commit-graph && + git fetch origin --unshallow && + git commit-graph write --reachable && + test_path_is_file .git/objects/info/commit-graph + ) +' + # the verify tests below expect the commit-graph to contain # exactly the commits reachable from the commits/8 branch. # If the file changes the set of commits in the list, then the -- 2.18.0.118.gd4f65b8d14
[PATCH v2 0/8] Clarify commit-graph and grafts/replace/shallow incompatibilities
One unresolved issue with the commit-graph feature is that it can cause issues when combined with replace objects, commit grafts, or shallow clones. These are not 100% incompatible, as one could be reasonably successful writing a commit-graph after replacing some objects and not have issues. The problems happen when commits that are already in the commit-graph file are replaced, or when git is run with the `--no-replace-objects` option; this can cause incorrect parents or incorrect generation numbers. Similar things occur with commit grafts and shallow clones, especially when running `git fetch --unshallow` in a shallow repo. Instead of trying (and probably failing) to make these features work together, default to making the commit-graph feature unavailable in these situations. Create a new method 'commit_graph_compatible(r)' that checks if the repository 'r' has any of these features enabled. CHANGES IN V2: * The first two commits regarding the ref iterators are unchanged, despite a lot of discussion on the subject [1]. * I included Peff's changes in jk/core-use-replace-refs, changing the base commit for the series to 1689c22c1c328e9135ed51458e9f9a5d224c5057 (the merge that brought that topic into 'msater'). * I fixed the tests for the interactions with the graft feature. Because of the change of base, it is hard to provide a side-by-side diff from v1. Thanks, -Stolee [1] https://public-inbox.org/git/CAGZ79kZ3PzqpGzXWcmxjzi98gA+LT2MBOf8KaA89hOa-Qig=o...@mail.gmail.com/ Stefan's response recommending we keep the first two commits. Derrick Stolee (6): commit-graph: update design document test-repository: properly init repo commit-graph: not compatible with replace objects commit-graph: not compatible with grafts commit-graph: not compatible with uninitialized repo commit-graph: close_commit_graph before shallow walk Stefan Beller (2): refs.c: migrate internal ref iteration to pass thru repository argument refs.c: upgrade for_each_replace_ref to be a each_repo_ref_fn callback Documentation/technical/commit-graph.txt | 18 +-- builtin/commit-graph.c | 4 ++ builtin/replace.c| 8 ++-- commit-graph.c | 38 +-- commit-graph.h | 1 + commit.c | 2 +- commit.h | 1 + refs.c | 48 --- refs.h | 12 - refs/iterator.c | 6 +-- refs/refs-internal.h | 5 +- replace-object.c | 7 +-- replace-object.h | 2 + t/helper/test-repository.c | 10 +++- t/t5318-commit-graph.sh | 60 upload-pack.c| 2 + 16 files changed, 194 insertions(+), 30 deletions(-) base-commit: 1689c22c1c328e9135ed51458e9f9a5d224c5057 -- 2.18.0.118.gd4f65b8d14
Re: What's cooking in git.git (Aug 2018, #04; Fri, 17)
On 8/20/2018 1:26 PM, Junio C Hamano wrote: Jonathan Nieder writes: Usually, I refrain from merging larger topics in 'next' down to 'master' when we get close to -rc0, but I am wondering if it is better to merge all of them to 'master', even the ones on the larger and possibly undercooked side, expecting that we collectively spend effort on hunting and fixing bugs in them during the pre-release freeze period. If we were to go that route, I'd want everybody's buy-in and I'll promise to ignore any shiny new toys that appear on list that are not regression fixes to topics merged to 'master' since the end of the previous cycle to make sure people are not distracted. Based on what I see in 'next' (midx, range-diff, etc), I quite like this idea. The comment above was about the ones that was marked as "Will merge to 'master'" that are in 'next', not the ones marked as "Will cook in 'next'", like the midx ones. I am not worried about range-diff, as I do not think it came close enough to the core-ish code to break other things; the potential damage is fairly isolated and the worst would be that we'd realize that we shipped with a new command that was buggy after the release, which is not the end of the world. 'midx' and 'reachable' are quite different stories, as bugs in them would in the worst case lead to an immediate and unrecoverable repository corruption. So I am still on the fence, even though I am leaning toward merging them down to 'master'. I'm happy to wait until 2.19 is cut for ds/multi-pack-index to merge down. I sent a new series today [1] that improves the series and corrects some bugs I found while making the full Git test suite work when writing a multi-pack-index on every repack. Nothing was "unrecoverable repository corruption," but I understand the concern there. Making the safest decision is probably the best decision. I'm happy to contribute to a "pre-2.19 bug bash" after you merge things into master. Thanks, -Stolee [1] https://public-inbox.org/git/20180820165124.152146-1-dsto...@microsoft.com/T/#u [PATCH 0/9] multi-pack-index cleanups
[PATCH 8/9] midx: test a few commands that use get_all_packs
The new get_all_packs() method exposed the packfiles coverede by a multi-pack-index. Before, the 'git cat-file --batch' and 'git count-objects' commands would skip objects in an environment with a multi-pack-index. Further, a reachability bitmap would be ignored if its pack-file was covered by a multi-pack-index. Signed-off-by: Derrick Stolee --- t/t5319-multi-pack-index.sh | 21 ++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 4b6e2825a6..424d0c640f 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -86,8 +86,14 @@ test_expect_success 'write midx with one v1 pack' ' ' midx_git_two_modes () { - git -c core.multiPackIndex=false $1 >expect && - git -c core.multiPackIndex=true $1 >actual && + if [ "$2" = "sorted" ] + then + git -c core.multiPackIndex=false $1 | sort >expect && + git -c core.multiPackIndex=true $1 | sort >actual + else + git -c core.multiPackIndex=false $1 >expect && + git -c core.multiPackIndex=true $1 >actual + fi && test_cmp expect actual } @@ -95,7 +101,10 @@ compare_results_with_midx () { MSG=$1 test_expect_success "check normal git operations: $MSG" ' midx_git_two_modes "rev-list --objects --all" && - midx_git_two_modes "log --raw" + midx_git_two_modes "log --raw" && + midx_git_two_modes "count-objects --verbose" && + midx_git_two_modes "cat-file --batch-all-objects --buffer --batch-check" && + midx_git_two_modes "cat-file --batch-all-objects --buffer --batch-check --unsorted" sorted ' } @@ -149,6 +158,12 @@ test_expect_success 'repack removes multi-pack-index' ' compare_results_with_midx "after repack" +test_expect_success 'multi-pack-index and pack-bitmap' ' + git -c repack.writeBitmaps=true repack -ad && + git multi-pack-index write && + git rev-list --test-bitmap HEAD +' + test_expect_success 'multi-pack-index and alternates' ' git init --bare alt.git && echo $(pwd)/alt.git/objects >.git/objects/info/alternates && -- 2.18.0.118.gd4f65b8d14
[PATCH 9/9] pack-objects: consider packs in multi-pack-index
When running 'git pack-objects --local', we want to avoid packing objects that are in an alternate. Currently, we check for these objects using the packed_git_mru list, which excludes the pack-files covered by a multi-pack-index. Add a new iteration over the multi-pack-indexes to find these copies and mark them as unwanted. Signed-off-by: Derrick Stolee --- builtin/pack-objects.c | 28 t/t5319-multi-pack-index.sh | 8 +++- 2 files changed, 35 insertions(+), 1 deletion(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 1a896d7810..4a9a42d29a 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -31,6 +31,7 @@ #include "packfile.h" #include "object-store.h" #include "dir.h" +#include "midx.h" #define IN_PACK(obj) oe_in_pack(&to_pack, obj) #define SIZE(obj) oe_size(&to_pack, obj) @@ -1034,6 +1035,7 @@ static int want_object_in_pack(const struct object_id *oid, { int want; struct list_head *pos; + struct multi_pack_index *m; if (!exclude && local && has_loose_object_nonlocal(oid)) return 0; @@ -1048,6 +1050,32 @@ static int want_object_in_pack(const struct object_id *oid, if (want != -1) return want; } + + for (m = get_multi_pack_index(the_repository); m; m = m->next) { + struct pack_entry e; + if (fill_midx_entry(oid, &e, m)) { + struct packed_git *p = e.p; + off_t offset; + + if (p == *found_pack) + offset = *found_offset; + else + offset = find_pack_entry_one(oid->hash, p); + + if (offset) { + if (!*found_pack) { + if (!is_pack_valid(p)) + continue; + *found_offset = offset; + *found_pack = p; + } + want = want_found_object(exclude, p); + if (want != -1) + return want; + } + } + } + list_for_each(pos, get_packed_git_mru(the_repository)) { struct packed_git *p = list_entry(pos, struct packed_git, mru); off_t offset; diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index 424d0c640f..6f56b38674 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -176,7 +176,13 @@ test_expect_success 'multi-pack-index and alternates' ' compare_results_with_midx "with alternate (local midx)" test_expect_success 'multi-pack-index in an alternate' ' - mv .git/objects/pack/* alt.git/objects/pack + mv .git/objects/pack/* alt.git/objects/pack && + test_commit add_local_objects && + git repack --local && + git multi-pack-index write && + midx_read_expect 1 3 4 $objdir && + git reset --hard HEAD~1 && + rm -f .git/objects/pack/* ' compare_results_with_midx "with alternate (remote midx)" -- 2.18.0.118.gd4f65b8d14
[PATCH 2/9] multi-pack-index: store local property
A pack-file is 'local' if it is stored within the usual object directory. If it is stored in an alternate, it is non-local. Pack-files are stored using a 'pack_local' member in the packed_git struct. Add a similar 'local' member to the multi_pack_index struct and 'local' parameters to the methods that load and prepare multi- pack-indexes. Signed-off-by: Derrick Stolee --- midx.c| 11 ++- midx.h| 6 -- packfile.c| 4 ++-- t/helper/test-read-midx.c | 2 +- 4 files changed, 13 insertions(+), 10 deletions(-) diff --git a/midx.c b/midx.c index 19b7df338e..6824acf5f8 100644 --- a/midx.c +++ b/midx.c @@ -37,7 +37,7 @@ static char *get_midx_filename(const char *object_dir) return xstrfmt("%s/pack/multi-pack-index", object_dir); } -struct multi_pack_index *load_multi_pack_index(const char *object_dir) +struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local) { struct multi_pack_index *m = NULL; int fd; @@ -73,6 +73,7 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir) m->fd = fd; m->data = midx_map; m->data_len = midx_size; + m->local = local; m->signature = get_be32(m->data); if (m->signature != MIDX_SIGNATURE) { @@ -209,7 +210,7 @@ static int prepare_midx_pack(struct multi_pack_index *m, uint32_t pack_int_id) strbuf_addf(&pack_name, "%s/pack/%s", m->object_dir, m->pack_names[pack_int_id]); - m->packs[pack_int_id] = add_packed_git(pack_name.buf, pack_name.len, 1); + m->packs[pack_int_id] = add_packed_git(pack_name.buf, pack_name.len, m->local); strbuf_release(&pack_name); return !m->packs[pack_int_id]; } @@ -318,7 +319,7 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_name) return 0; } -int prepare_multi_pack_index_one(struct repository *r, const char *object_dir) +int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local) { struct multi_pack_index *m = r->objects->multi_pack_index; struct multi_pack_index *m_search; @@ -332,7 +333,7 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir) if (!strcmp(object_dir, m_search->object_dir)) return 1; - r->objects->multi_pack_index = load_multi_pack_index(object_dir); + r->objects->multi_pack_index = load_multi_pack_index(object_dir, local); if (r->objects->multi_pack_index) { r->objects->multi_pack_index->next = m; @@ -746,7 +747,7 @@ int write_midx_file(const char *object_dir) midx_name); } - packs.m = load_multi_pack_index(object_dir); + packs.m = load_multi_pack_index(object_dir, 1); packs.nr = 0; packs.alloc_list = packs.m ? packs.m->num_packs : 16; diff --git a/midx.h b/midx.h index e3b07f1586..8aa79f4b62 100644 --- a/midx.h +++ b/midx.h @@ -18,6 +18,8 @@ struct multi_pack_index { uint32_t num_packs; uint32_t num_objects; + int local; + const unsigned char *chunk_pack_names; const uint32_t *chunk_oid_fanout; const unsigned char *chunk_oid_lookup; @@ -29,14 +31,14 @@ struct multi_pack_index { char object_dir[FLEX_ARRAY]; }; -struct multi_pack_index *load_multi_pack_index(const char *object_dir); +struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local); int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result); struct object_id *nth_midxed_object_oid(struct object_id *oid, struct multi_pack_index *m, uint32_t n); int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, struct multi_pack_index *m); int midx_contains_pack(struct multi_pack_index *m, const char *idx_name); -int prepare_multi_pack_index_one(struct repository *r, const char *object_dir); +int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local); int write_midx_file(const char *object_dir); void clear_midx_file(const char *object_dir); diff --git a/packfile.c b/packfile.c index 12db1a9d7d..896da460ac 100644 --- a/packfile.c +++ b/packfile.c @@ -963,11 +963,11 @@ static void prepare_packed_git(struct repository *r) if (r->objects->packed_git_initialized) return; - prepare_multi_pack_index_one(r, r->objects->objectdir); + prepare_multi_pack_index_one(r, r->objects->objectdir, 1); prepare_packed_git_one(r, r->objects->objectdir, 1); prepare_alt_odb(r); for (alt = r->objects->alt_odb_list; alt; alt = alt->next)
[PATCH 4/9] midx: stop reporting garbage
When prepare_packed_git is called with the report_garbage method initialized, we report unexpected files in the objects directory as garbage. Stop reporting the multi-pack-index and the pack-files it covers as garbage. Signed-off-by: Derrick Stolee --- packfile.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/packfile.c b/packfile.c index 896da460ac..fe713a0242 100644 --- a/packfile.c +++ b/packfile.c @@ -820,9 +820,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len, struct packed_git *p; size_t base_len = full_name_len; - if (strip_suffix_mem(full_name, &base_len, ".idx")) { - if (data->m && midx_contains_pack(data->m, file_name)) - return; + if (strip_suffix_mem(full_name, &base_len, ".idx") && + !(data->m && midx_contains_pack(data->m, file_name))) { /* Don't reopen a pack we already have. */ for (p = data->r->objects->packed_git; p; p = p->next) { size_t len; @@ -842,6 +841,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len, if (!report_garbage) return; + if (!strcmp(file_name, "multi-pack-index")) + return; if (ends_with(file_name, ".idx") || ends_with(file_name, ".pack") || ends_with(file_name, ".bitmap") || -- 2.18.0.118.gd4f65b8d14
[PATCH 3/9] midx: mark bad packed objects
When an object fails to decompress from a pack-file, we mark the object as 'bad' so we can retry with a different copy of the object (if such a copy exists). Before now, the multi-pack-index did not update the bad objects list for the pack-files it contains, and we did not check the bad objects list when reading an object. Now, do both. Signed-off-by: Derrick Stolee --- midx.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/midx.c b/midx.c index 6824acf5f8..7fa75a37a3 100644 --- a/midx.c +++ b/midx.c @@ -280,6 +280,16 @@ static int nth_midxed_pack_entry(struct multi_pack_index *m, struct pack_entry * if (!is_pack_valid(p)) return 0; + if (p->num_bad_objects) { + uint32_t i; + struct object_id oid; + nth_midxed_object_oid(&oid, m, pos); + for (i = 0; i < p->num_bad_objects; i++) + if (!hashcmp(oid.hash, +p->bad_object_sha1 + the_hash_algo->rawsz * i)) + return 0; + } + e->offset = nth_midxed_offset(m, pos); e->p = p; -- 2.18.0.118.gd4f65b8d14
[PATCH 7/9] treewide: use get_all_packs
There are many places in the codebase that want to iterate over all packfiles known to Git. The purposes are wide-ranging, and those that can take advantage of the multi-pack-index already do. So, use get_all_packs() instead of get_packed_git() to be sure we are iterating over all packfiles. Signed-off-by: Derrick Stolee --- builtin/count-objects.c | 2 +- builtin/fsck.c | 4 ++-- builtin/gc.c | 4 ++-- builtin/pack-objects.c | 14 +++--- builtin/pack-redundant.c | 4 ++-- fast-import.c| 4 ++-- http-backend.c | 4 ++-- pack-bitmap.c| 2 +- pack-objects.c | 2 +- packfile.c | 2 +- server-info.c| 4 ++-- 11 files changed, 23 insertions(+), 23 deletions(-) diff --git a/builtin/count-objects.c b/builtin/count-objects.c index d51e2ce1ec..a7cad052c6 100644 --- a/builtin/count-objects.c +++ b/builtin/count-objects.c @@ -123,7 +123,7 @@ int cmd_count_objects(int argc, const char **argv, const char *prefix) struct strbuf pack_buf = STRBUF_INIT; struct strbuf garbage_buf = STRBUF_INIT; - for (p = get_packed_git(the_repository); p; p = p->next) { + for (p = get_all_packs(the_repository); p; p = p->next) { if (!p->pack_local) continue; if (open_pack_index(p)) diff --git a/builtin/fsck.c b/builtin/fsck.c index c96f3f4fcc..184d8e7f4e 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -740,7 +740,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) struct progress *progress = NULL; if (show_progress) { - for (p = get_packed_git(the_repository); p; + for (p = get_all_packs(the_repository); p; p = p->next) { if (open_pack_index(p)) continue; @@ -749,7 +749,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) progress = start_progress(_("Checking objects"), total); } - for (p = get_packed_git(the_repository); p; + for (p = get_all_packs(the_repository); p; p = p->next) { /* verify gives error messages itself */ if (verify_pack(p, fsck_obj_buffer, diff --git a/builtin/gc.c b/builtin/gc.c index 57069442b0..2b592260e9 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -183,7 +183,7 @@ static struct packed_git *find_base_packs(struct string_list *packs, { struct packed_git *p, *base = NULL; - for (p = get_packed_git(the_repository); p; p = p->next) { + for (p = get_all_packs(the_repository); p; p = p->next) { if (!p->pack_local) continue; if (limit) { @@ -208,7 +208,7 @@ static int too_many_packs(void) if (gc_auto_pack_limit <= 0) return 0; - for (cnt = 0, p = get_packed_git(the_repository); p; p = p->next) { + for (cnt = 0, p = get_all_packs(the_repository); p; p = p->next) { if (!p->pack_local) continue; if (p->pack_keep) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 4391504a91..1a896d7810 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -2784,7 +2784,7 @@ static void add_objects_in_unpacked_packs(struct rev_info *revs) memset(&in_pack, 0, sizeof(in_pack)); - for (p = get_packed_git(the_repository); p; p = p->next) { + for (p = get_all_packs(the_repository); p; p = p->next) { struct object_id oid; struct object *o; @@ -2848,7 +2848,7 @@ static int has_sha1_pack_kept_or_nonlocal(const struct object_id *oid) struct packed_git *p; p = (last_found != (void *)1) ? last_found : - get_packed_git(the_repository); + get_all_packs(the_repository); while (p) { if ((!p->pack_local || p->pack_keep || @@ -2858,7 +2858,7 @@ static int has_sha1_pack_kept_or_nonlocal(const struct object_id *oid) return 1; } if (p == last_found) - p = get_packed_git(the_repository); + p = get_all_packs(the_repository); else p = p->next; if (p == last_found) @@ -2894,7 +2894,7 @@ static void loosen_unused_packed_objects(struct rev_info *revs) uint32_t i; struct object_id oid; - for (p = get_packed_git(the_repository); p;