Re: [PATCH 0/3] Make add_missing_tags() linear
On 11/2/2018 10:58 AM, Elijah Newren wrote: On Thu, Nov 1, 2018 at 12:02 PM Derrick Stolee wrote: On 11/1/2018 2:57 PM, Elijah Newren wrote: On Thu, Nov 1, 2018 at 5:32 AM Derrick Stolee wrote: No rush. I'd just like to understand how removing the commit-graph file can make the new algorithm faster. Putting a similar count in the old algorithm would involve giving a count for every call to in_merge_bases_many(), which would be very noisy. $ time git push --dry-run --follow-tags /home/newren/repo-mirror count: 92912 To /home/newren/repo-mirror * [new branch] test5 -> test5 real0m3.024s user0m2.752s sys0m0.320s Is the above test with or without the commit-graph file? Can you run it in the other mode, too? I'd like to see if the "count" value changes when the only difference is the presence of a commit-graph file. I apologize for providing misleading information earlier; this was an apples to oranges comparison. Here's what I did: git clone coworker.bundle coworker-repo cd coworker-repo time git push --dry-run --follow-tags /home/newren/repo-mirror git config core.commitgraph true git config gc.writecommitgraph true git gc time git push --dry-run --follow-tags /home/newren/nucleus-mirror I figured I had just done a fresh clone, so surely the gc wouldn't do anything other than write the .git/objects/info/commit-graph file. However, the original bundle contained many references outside of refs/heads/ and refs/tags/: $ git bundle list-heads ../coworker.bundle | grep -v -e refs/heads/ -e refs/tags/ -e HEAD | wc -l 2396 These other refs apparently referred to objects not otherwise referenced in refs/heads/ and refs/tags/, and caused the gc to explode lots of loose objects: $ git count-objects -v count: 147604 size: 856416 in-pack: 1180692 packs: 1 size-pack: 796143 prune-packable: 0 garbage: 0 size-garbage: 0 The slowdown with commit-graph was entirely due to there being lots of loose objects (147K of them). If I add a git-prune before doing the timing with commit-graph, then the timing with commit-graph is faster than the run without a commit-graph. Sorry for the wild goose chase. And thanks for the fixes; get_reachable_subset() makes things much faster even without a commit-graph, and the commit-graph just improves it more. :-) Thanks for double-checking! It's good to have confidence that this is a good algorithm to use. Thanks, -Stolee
Re: [PATCH 0/3] Make add_missing_tags() linear
On Thu, Nov 1, 2018 at 12:02 PM Derrick Stolee wrote: > > On 11/1/2018 2:57 PM, Elijah Newren wrote: > > On Thu, Nov 1, 2018 at 5:32 AM Derrick Stolee wrote: > >> No rush. I'd just like to understand how removing the commit-graph file > >> can make the new algorithm faster. Putting a similar count in the old > >> algorithm would involve giving a count for every call to > >> in_merge_bases_many(), which would be very noisy. > > $ time git push --dry-run --follow-tags /home/newren/repo-mirror > > count: 92912 > > To /home/newren/repo-mirror > > * [new branch] test5 -> test5 > > > > real0m3.024s > > user0m2.752s > > sys0m0.320s > > Is the above test with or without the commit-graph file? Can you run it > in the other mode, too? I'd like to see if the "count" value changes > when the only difference is the presence of a commit-graph file. I apologize for providing misleading information earlier; this was an apples to oranges comparison. Here's what I did: git clone coworker.bundle coworker-repo cd coworker-repo time git push --dry-run --follow-tags /home/newren/repo-mirror git config core.commitgraph true git config gc.writecommitgraph true git gc time git push --dry-run --follow-tags /home/newren/nucleus-mirror I figured I had just done a fresh clone, so surely the gc wouldn't do anything other than write the .git/objects/info/commit-graph file. However, the original bundle contained many references outside of refs/heads/ and refs/tags/: $ git bundle list-heads ../coworker.bundle | grep -v -e refs/heads/ -e refs/tags/ -e HEAD | wc -l 2396 These other refs apparently referred to objects not otherwise referenced in refs/heads/ and refs/tags/, and caused the gc to explode lots of loose objects: $ git count-objects -v count: 147604 size: 856416 in-pack: 1180692 packs: 1 size-pack: 796143 prune-packable: 0 garbage: 0 size-garbage: 0 The slowdown with commit-graph was entirely due to there being lots of loose objects (147K of them). If I add a git-prune before doing the timing with commit-graph, then the timing with commit-graph is faster than the run without a commit-graph. Sorry for the wild goose chase. And thanks for the fixes; get_reachable_subset() makes things much faster even without a commit-graph, and the commit-graph just improves it more. :-)
Re: [PATCH 0/3] Make add_missing_tags() linear
On 11/1/2018 2:57 PM, Elijah Newren wrote: On Thu, Nov 1, 2018 at 5:32 AM Derrick Stolee wrote: No rush. I'd just like to understand how removing the commit-graph file can make the new algorithm faster. Putting a similar count in the old algorithm would involve giving a count for every call to in_merge_bases_many(), which would be very noisy. $ time git push --dry-run --follow-tags /home/newren/repo-mirror count: 92912 To /home/newren/repo-mirror * [new branch] test5 -> test5 real0m3.024s user0m2.752s sys0m0.320s Is the above test with or without the commit-graph file? Can you run it in the other mode, too? I'd like to see if the "count" value changes when the only difference is the presence of a commit-graph file. Thanks, -Stolee
Re: [PATCH 0/3] Make add_missing_tags() linear
On Thu, Nov 1, 2018 at 5:32 AM Derrick Stolee wrote: > On 11/1/2018 2:52 AM, Elijah Newren wrote: > > On Wed, Oct 31, 2018 at 5:05 AM Derrick Stolee wrote: > >> On 10/31/2018 2:04 AM, Elijah Newren wrote: > >>> > >>> On the original repo where the topic was brought up, with commit-graph > >>> NOT turned on and using origin/master, I see: > >>> > >>> $ time git push --dry-run --follow-tags /home/newren/repo-mirror > >>> To /home/newren/repo-mirror > >>> * [new branch] test5 -> test5 > >>> > >>> real 1m20.081s > >>> user 1m19.688s > >>> sys 0m0.292s > >>> > >>> Merging this series in, I now get: > >>> > >>> $ time git push --dry-run --follow-tags /home/newren/repo-mirror > >>> To /home/newren/repo-mirror > >>> * [new branch] test5 -> test5 > >>> > >>> real 0m2.857s > >>> user 0m2.580s > >>> sys 0m0.328s > >>> > >>> which provides a very nice speedup. > >>> > >>> Oddly enough, if I _also_ do the following: > >>> $ git config core.commitgraph true > >>> $ git config gc.writecommitgraph true > >>> $ git gc > >>> > >>> then my timing actually slows down just slightly: > >>> $ time git push --follow-tags --dry-run /home/newren/repo-mirror > >>> To /home/newren/repo-mirror > >>> * [new branch] test5 -> test5 > >>> > >>> real 0m3.027s > >>> user 0m2.696s > >>> sys 0m0.400s > >> So you say that the commit-graph is off in the 2.8s case, but not here > >> in the 3.1s case? I would expect _at minimum_ that the cost of parsing > >> commits would have a speedup in the commit-graph case. There may be > >> something else going on here, since you are timing a `push` event that > >> is doing more than the current walk. > >> > >>> (run-to-run variation seems pretty consistent, < .1s variation, so > >>> this difference is just enough to notice.) I wouldn't be that > >>> surprised if that means there's some really old tags with very small > >>> generation numbers, meaning it's not gaining anything in this special > >>> case from the commit-graph, but it does pay the cost of loading the > >>> commit-graph. > >> While you have this test environment, do you mind applying the diff > >> below and re-running the tests? It will output a count for how many > >> commits are walked by the algorithm. This should help us determine if > >> this is another case where generation numbers are worse than commit-date, > >> or if there is something else going on. Thanks! > > I can do that, but wouldn't you want a similar patch for the old > > get_merge_bases_many() in order to compare? Does an absolute number > > help by itself? > > It's going to have to be tomorrow, though; not enough time tonight. > > No rush. I'd just like to understand how removing the commit-graph file > can make the new algorithm faster. Putting a similar count in the old > algorithm would involve giving a count for every call to > in_merge_bases_many(), which would be very noisy. $ time git push --dry-run --follow-tags /home/newren/repo-mirror count: 92912 To /home/newren/repo-mirror * [new branch] test5 -> test5 real0m3.024s user0m2.752s sys0m0.320s Also: $ git rev-list --count HEAD 55764 $ git rev-list --count --all 91820 Seems a little odd to me that count is greater than `git rev-list --count --all`. However, the fact that they are close in magnitude isn't surprising since I went digging for the commit with smallest generation number not found in the upstream repo, and found: $ git ls-remote /home/newren/repo-mirror/ | grep refs/tags/v0.2.0; echo $? 1 $ git rev-list --count refs/tags/v0.2.0 4 $ git rev-list --count refs/tags/v0.2.0 ^HEAD 4 So, the commit-graph can only help us avoid parsing 3 or so commits, but we have to parse the 5M .git/objects/info/commit-graph file, and then for every parse_commit() call we need to bsearch_graph() for the commit.My theory is that parsing the commit-graph file and binary searching it for commits is relatively fast, but that the time is just big enough to measure and notice, while avoiding parsing 3 more commits is a negligible time savings. To me, I'm thinking this is one of those bizarre corner cases where the commit-graph is almost imperceptibly slower than without the commit-graph. (And it is a very weird repo; someone repeatedly filter-branched lots of small independent repos into a monorepo, but didn't always push everything and didn't clean out all old stuff.) But if you still see weird stuff you want to dig into further (maybe the 92912 > 91820 bit?), I'm happy to try out other stuff.
Re: [PATCH 0/3] Make add_missing_tags() linear
On 11/1/2018 2:52 AM, Elijah Newren wrote: On Wed, Oct 31, 2018 at 5:05 AM Derrick Stolee wrote: On 10/31/2018 2:04 AM, Elijah Newren wrote: On the original repo where the topic was brought up, with commit-graph NOT turned on and using origin/master, I see: $ time git push --dry-run --follow-tags /home/newren/repo-mirror To /home/newren/repo-mirror * [new branch] test5 -> test5 real 1m20.081s user 1m19.688s sys 0m0.292s Merging this series in, I now get: $ time git push --dry-run --follow-tags /home/newren/repo-mirror To /home/newren/repo-mirror * [new branch] test5 -> test5 real 0m2.857s user 0m2.580s sys 0m0.328s which provides a very nice speedup. Oddly enough, if I _also_ do the following: $ git config core.commitgraph true $ git config gc.writecommitgraph true $ git gc then my timing actually slows down just slightly: $ time git push --follow-tags --dry-run /home/newren/repo-mirror To /home/newren/repo-mirror * [new branch] test5 -> test5 real 0m3.027s user 0m2.696s sys 0m0.400s So you say that the commit-graph is off in the 2.8s case, but not here in the 3.1s case? I would expect _at minimum_ that the cost of parsing commits would have a speedup in the commit-graph case. There may be something else going on here, since you are timing a `push` event that is doing more than the current walk. (run-to-run variation seems pretty consistent, < .1s variation, so this difference is just enough to notice.) I wouldn't be that surprised if that means there's some really old tags with very small generation numbers, meaning it's not gaining anything in this special case from the commit-graph, but it does pay the cost of loading the commit-graph. While you have this test environment, do you mind applying the diff below and re-running the tests? It will output a count for how many commits are walked by the algorithm. This should help us determine if this is another case where generation numbers are worse than commit-date, or if there is something else going on. Thanks! I can do that, but wouldn't you want a similar patch for the old get_merge_bases_many() in order to compare? Does an absolute number help by itself? It's going to have to be tomorrow, though; not enough time tonight. No rush. I'd just like to understand how removing the commit-graph file can make the new algorithm faster. Putting a similar count in the old algorithm would involve giving a count for every call to in_merge_bases_many(), which would be very noisy. Thanks! -Stolee
Re: [PATCH 0/3] Make add_missing_tags() linear
On Wed, Oct 31, 2018 at 5:05 AM Derrick Stolee wrote: > > On 10/31/2018 2:04 AM, Elijah Newren wrote: > > On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget > > wrote: > >> > >> As reported earlier [1], the add_missing_tags() method in remote.c has > >> quadratic performance. Some of that performance is curbed due to the > >> generation-number cutoff in in_merge_bases_many(). However, that fix > >> doesn't > >> help users without a commit-graph, and it can still be painful if that > >> cutoff is sufficiently low compared to the tags we are using for > >> reachability testing. > >> > >> Add a new method in commit-reach.c called get_reachable_subset() which does > >> a many-to-many reachability test. Starting at the 'from' commits, walk > >> until > >> the generation is below the smallest generation in the 'to' commits, or all > >> 'to' commits have been discovered. This performs only one commit walk for > >> the entire add_missing_tags() method, giving linear performance in the > >> worst > >> case. > >> > >> Tests are added in t6600-test-reach.sh to ensure get_reachable_subset() > >> works independently of its application in add_missing_tags(). > > > > On the original repo where the topic was brought up, with commit-graph > > NOT turned on and using origin/master, I see: > > > > $ time git push --dry-run --follow-tags /home/newren/repo-mirror > > To /home/newren/repo-mirror > > * [new branch] test5 -> test5 > > > > real 1m20.081s > > user 1m19.688s > > sys 0m0.292s > > > > Merging this series in, I now get: > > > > $ time git push --dry-run --follow-tags /home/newren/repo-mirror > > To /home/newren/repo-mirror > > * [new branch] test5 -> test5 > > > > real 0m2.857s > > user 0m2.580s > > sys 0m0.328s > > > > which provides a very nice speedup. > > > > Oddly enough, if I _also_ do the following: > > $ git config core.commitgraph true > > $ git config gc.writecommitgraph true > > $ git gc > > > > then my timing actually slows down just slightly: > > $ time git push --follow-tags --dry-run /home/newren/repo-mirror > > To /home/newren/repo-mirror > > * [new branch] test5 -> test5 > > > > real 0m3.027s > > user 0m2.696s > > sys 0m0.400s > > So you say that the commit-graph is off in the 2.8s case, but not here > in the 3.1s case? I would expect _at minimum_ that the cost of parsing > commits would have a speedup in the commit-graph case. There may be > something else going on here, since you are timing a `push` event that > is doing more than the current walk. > > > (run-to-run variation seems pretty consistent, < .1s variation, so > > this difference is just enough to notice.) I wouldn't be that > > surprised if that means there's some really old tags with very small > > generation numbers, meaning it's not gaining anything in this special > > case from the commit-graph, but it does pay the cost of loading the > > commit-graph. > > While you have this test environment, do you mind applying the diff > below and re-running the tests? It will output a count for how many > commits are walked by the algorithm. This should help us determine if > this is another case where generation numbers are worse than commit-date, > or if there is something else going on. Thanks! I can do that, but wouldn't you want a similar patch for the old get_merge_bases_many() in order to compare? Does an absolute number help by itself? It's going to have to be tomorrow, though; not enough time tonight.
Re: [PATCH 0/3] Make add_missing_tags() linear
On 10/31/2018 2:04 AM, Elijah Newren wrote: > On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget > wrote: >> >> As reported earlier [1], the add_missing_tags() method in remote.c has >> quadratic performance. Some of that performance is curbed due to the >> generation-number cutoff in in_merge_bases_many(). However, that fix doesn't >> help users without a commit-graph, and it can still be painful if that >> cutoff is sufficiently low compared to the tags we are using for >> reachability testing. >> >> Add a new method in commit-reach.c called get_reachable_subset() which does >> a many-to-many reachability test. Starting at the 'from' commits, walk until >> the generation is below the smallest generation in the 'to' commits, or all >> 'to' commits have been discovered. This performs only one commit walk for >> the entire add_missing_tags() method, giving linear performance in the worst >> case. >> >> Tests are added in t6600-test-reach.sh to ensure get_reachable_subset() >> works independently of its application in add_missing_tags(). > > On the original repo where the topic was brought up, with commit-graph > NOT turned on and using origin/master, I see: > > $ time git push --dry-run --follow-tags /home/newren/repo-mirror > To /home/newren/repo-mirror > * [new branch] test5 -> test5 > > real 1m20.081s > user 1m19.688s > sys 0m0.292s > > Merging this series in, I now get: > > $ time git push --dry-run --follow-tags /home/newren/repo-mirror > To /home/newren/repo-mirror > * [new branch] test5 -> test5 > > real 0m2.857s > user 0m2.580s > sys 0m0.328s > > which provides a very nice speedup. > > Oddly enough, if I _also_ do the following: > $ git config core.commitgraph true > $ git config gc.writecommitgraph true > $ git gc > > then my timing actually slows down just slightly: > $ time git push --follow-tags --dry-run /home/newren/repo-mirror > To /home/newren/repo-mirror > * [new branch] test5 -> test5 > > real 0m3.027s > user 0m2.696s > sys 0m0.400s So you say that the commit-graph is off in the 2.8s case, but not here in the 3.1s case? I would expect _at minimum_ that the cost of parsing commits would have a speedup in the commit-graph case. There may be something else going on here, since you are timing a `push` event that is doing more than the current walk. > (run-to-run variation seems pretty consistent, < .1s variation, so > this difference is just enough to notice.) I wouldn't be that > surprised if that means there's some really old tags with very small > generation numbers, meaning it's not gaining anything in this special > case from the commit-graph, but it does pay the cost of loading the > commit-graph. While you have this test environment, do you mind applying the diff below and re-running the tests? It will output a count for how many commits are walked by the algorithm. This should help us determine if this is another case where generation numbers are worse than commit-date, or if there is something else going on. Thanks! -->8-- >From 2115e7dcafb2770455b7b4793f90edc2254bad97 Mon Sep 17 00:00:00 2001 From: Derrick Stolee Date: Wed, 31 Oct 2018 11:40:50 + Subject: [PATCH] DO-NOT-MERGE: count commits in get_reachable_subset Signed-off-by: Derrick Stolee --- commit-reach.c | 5 + 1 file changed, 5 insertions(+) diff --git a/commit-reach.c b/commit-reach.c index a98532ecc8..b512461cf7 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -700,6 +700,7 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from, struct commit **from_last = from + nr_from; uint32_t min_generation = GENERATION_NUMBER_INFINITY; int num_to_find = 0; + int count = 0; struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; @@ -729,6 +730,8 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from, while (num_to_find && (current = prio_queue_get()) != NULL) { struct commit_list *parents; + count++; + if (current->object.flags & PARENT1) { current->object.flags &= ~PARENT1; current->object.flags |= reachable_flag; @@ -755,6 +758,8 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from, clear_commit_marks_many(nr_to, to, PARENT1); clear_commit_marks_many(nr_from, from, PARENT2); + fprintf(stderr, "count: %d\n", count); + return found_commits; } -- 2.19.1.542.gc4df23f792
Re: [PATCH 0/3] Make add_missing_tags() linear
On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget wrote: > > As reported earlier [1], the add_missing_tags() method in remote.c has > quadratic performance. Some of that performance is curbed due to the > generation-number cutoff in in_merge_bases_many(). However, that fix doesn't > help users without a commit-graph, and it can still be painful if that > cutoff is sufficiently low compared to the tags we are using for > reachability testing. > > Add a new method in commit-reach.c called get_reachable_subset() which does > a many-to-many reachability test. Starting at the 'from' commits, walk until > the generation is below the smallest generation in the 'to' commits, or all > 'to' commits have been discovered. This performs only one commit walk for > the entire add_missing_tags() method, giving linear performance in the worst > case. > > Tests are added in t6600-test-reach.sh to ensure get_reachable_subset() > works independently of its application in add_missing_tags(). On the original repo where the topic was brought up, with commit-graph NOT turned on and using origin/master, I see: $ time git push --dry-run --follow-tags /home/newren/repo-mirror To /home/newren/repo-mirror * [new branch] test5 -> test5 real 1m20.081s user 1m19.688s sys 0m0.292s Merging this series in, I now get: $ time git push --dry-run --follow-tags /home/newren/repo-mirror To /home/newren/repo-mirror * [new branch] test5 -> test5 real 0m2.857s user 0m2.580s sys 0m0.328s which provides a very nice speedup. Oddly enough, if I _also_ do the following: $ git config core.commitgraph true $ git config gc.writecommitgraph true $ git gc then my timing actually slows down just slightly: $ time git push --follow-tags --dry-run /home/newren/repo-mirror To /home/newren/repo-mirror * [new branch] test5 -> test5 real 0m3.027s user 0m2.696s sys 0m0.400s (run-to-run variation seems pretty consistent, < .1s variation, so this difference is just enough to notice.) I wouldn't be that surprised if that means there's some really old tags with very small generation numbers, meaning it's not gaining anything in this special case from the commit-graph, but it does pay the cost of loading the commit-graph. Anyway, looks good in my testing. Thanks much for working on this!
Re: [PATCH 0/3] Make add_missing_tags() linear
"Derrick Stolee via GitGitGadget" writes: > Add a new method in commit-reach.c called get_reachable_subset() which does > a many-to-many reachability test. Starting at the 'from' commits, walk until > the generation is below the smallest generation in the 'to' commits, or all > 'to' commits have been discovered. This performs only one commit walk for > the entire add_missing_tags() method, giving linear performance in the worst > case. ;-) I think in_merge_bases_many() was an attempt to do one half of this (i.e. it checks only one 'to' against main 'from' to see if any of them reach). I wonder why we didn't extend it to multiple 'to' back when we invented it. In any case, good to see this code optimized. Thanks.
[PATCH 0/3] Make add_missing_tags() linear
As reported earlier [1], the add_missing_tags() method in remote.c has quadratic performance. Some of that performance is curbed due to the generation-number cutoff in in_merge_bases_many(). However, that fix doesn't help users without a commit-graph, and it can still be painful if that cutoff is sufficiently low compared to the tags we are using for reachability testing. Add a new method in commit-reach.c called get_reachable_subset() which does a many-to-many reachability test. Starting at the 'from' commits, walk until the generation is below the smallest generation in the 'to' commits, or all 'to' commits have been discovered. This performs only one commit walk for the entire add_missing_tags() method, giving linear performance in the worst case. Tests are added in t6600-test-reach.sh to ensure get_reachable_subset() works independently of its application in add_missing_tags(). Thanks, -Stolee [1] https://public-inbox.org/git/cabpp-becpsoxudovjbdg_3w9wus102rw+e+qpmd4g3qyd-q...@mail.gmail.com/ Derrick Stolee (3): commit-reach: implement get_reachable_subset test-reach: test get_reachable_subset remote: make add_missing_tags() linear commit-reach.c| 70 +++ commit-reach.h| 10 +++ remote.c | 34 - t/helper/test-reach.c | 34 ++--- t/t6600-test-reach.sh | 52 5 files changed, 195 insertions(+), 5 deletions(-) base-commit: c670b1f876521c9f7cd40184bf7ed05aad843433 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-60%2Fderrickstolee%2Fadd-missing-tags-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-60/derrickstolee/add-missing-tags-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/60 -- gitgitgadget