Re: [PATCH 1/3] commit-reach: implement get_reachable_subset
Derrick Stolee writes: > We could do this, but it does come with a performance hit when the following > are all true: > > 1. 'to' is not reachable from any 'from' commits. > > 2. The 'to' and 'from' commits are close in commit-date. > > 3. Generation numbers are not available, or the topology is skewed to have >commits with high commit date and low generation number. > > Since in_merge_bases_many() calls paint_down_to_common(), it has the same > issues with the current generation numbers. This can be fixed when we have > the next version of generation numbers available. > > I'll make a note to have in_merge_bases_many() call get_reachable_subset() > conditionally (like the generation_numbers_available() trick in the > --topo-order > series) after the generation numbers are settled and implemented. Sounds good. I wondered how this compares with in_merge_bases_many() primarily because how performance characteristics between two approaches trade off. After all, what it needs to compute is a specialized case of get_reachable_subset() where "to" happens to be a set with only single element. If the latter with a single element "to" has the above performance "issues" compared to paint-down-to-common based approach, it could be possible that, for small enough N>1, running N in_merge_bases_many() traversals is more performant than a single get_reachable_subset() call that works on N-element "to". I am hoping that an update to better generation numbers to help get_reachable_subset() would help paint-down-to-common the same way, and would eventually allow us to use a single traversal that is best for N==1 and N>1. Thanks.
Re: [PATCH 1/3] commit-reach: implement get_reachable_subset
On 10/30/2018 11:35 PM, Junio C Hamano wrote: "Derrick Stolee via GitGitGadget" writes: +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, +struct commit **to, int nr_to, +int reachable_flag) This is OR'ed into object.flags, and I somoehow expected to see it as 'unsigned int', not a signed one. Will do. Thanks. +{ + struct commit **item; + struct commit *current; + struct commit_list *found_commits = NULL; + struct commit **to_last = to + nr_to; + struct commit **from_last = from + nr_from; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; + int num_to_find = 0; + + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; + + for (item = to; item < to_last; item++) { + struct commit *c = *item; + + parse_commit(c); + if (c->generation < min_generation) + min_generation = c->generation; + + if (!(c->object.flags & PARENT1)) { + c->object.flags |= PARENT1; + num_to_find++; + } + } + + for (item = from; item < from_last; item++) { + struct commit *c = *item; + if (!(c->object.flags & PARENT2)) { + c->object.flags |= PARENT2; + parse_commit(c); + + prio_queue_put(, *item); + } + } OK, we marked "to" with PARENT1 and counted them in num_to_find without dups. We also marked "from" with PARENT2 and threw them in the "queue" without dups. Mental note: the caller must guarantee that everybody reachable from "to" and "from" have PARENT1 and PARENT2 clear. This might deserve to be in the comment before the function. I'll put that in the header file. [snip] OK, this all makes sense. Unlike merge-base traversals, this does not have to traverse from the "to" side at all, which makes it quite simpler and straight-forward. I do wonder if we can now reimplement in_merge_bases_many() in terms of this helper, and if that gives us a better performance. It asks "is 'commit', i.e. a single 'to', an ancestor of, i.e. reachable from, one of the 'references', i.e. 'from'"? We could do this, but it does come with a performance hit when the following are all true: 1. 'to' is not reachable from any 'from' commits. 2. The 'to' and 'from' commits are close in commit-date. 3. Generation numbers are not available, or the topology is skewed to have commits with high commit date and low generation number. Since in_merge_bases_many() calls paint_down_to_common(), it has the same issues with the current generation numbers. This can be fixed when we have the next version of generation numbers available. I'll make a note to have in_merge_bases_many() call get_reachable_subset() conditionally (like the generation_numbers_available() trick in the --topo-order series) after the generation numbers are settled and implemented. Thanks, -Stolee
Re: [PATCH 1/3] commit-reach: implement get_reachable_subset
On 10/31/2018 2:07 AM, Elijah Newren wrote: On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget wrote: --- a/commit-reach.c +++ b/commit-reach.c @@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to, object_array_clear(_objs); return result; } + +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, +struct commit **to, int nr_to, +int reachable_flag) +{ + struct commit **item; + struct commit *current; + struct commit_list *found_commits = NULL; + struct commit **to_last = to + nr_to; + struct commit **from_last = from + nr_from; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; + int num_to_find = 0; + + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; + + for (item = to; item < to_last; item++) { + struct commit *c = *item; + + parse_commit(c); + if (c->generation < min_generation) + min_generation = c->generation; So, when we don't have a commit-graph, is c->generation just going to be 0, making min_generation also be 0? (meaning we get no possible speed benefit from the commit-graph, since we just don't have that information available)? If we don't have a commit-graph, then we can only terminate the loop early if we discover all of the commits (num_to_find == 0).Otherwise, we need to walk the entire graph in order to determine non-reachability. This relates to Junio's point about in_merge_bases_many(), which I'll respond to his message in more detail about that. Thanks, -Stolee
Re: [PATCH 1/3] commit-reach: implement get_reachable_subset
On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget wrote: > --- a/commit-reach.c > +++ b/commit-reach.c > @@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct > commit_list *to, > object_array_clear(_objs); > return result; > } > + > +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, > +struct commit **to, int nr_to, > +int reachable_flag) > +{ > + struct commit **item; > + struct commit *current; > + struct commit_list *found_commits = NULL; > + struct commit **to_last = to + nr_to; > + struct commit **from_last = from + nr_from; > + uint32_t min_generation = GENERATION_NUMBER_INFINITY; > + int num_to_find = 0; > + > + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; > + > + for (item = to; item < to_last; item++) { > + struct commit *c = *item; > + > + parse_commit(c); > + if (c->generation < min_generation) > + min_generation = c->generation; So, when we don't have a commit-graph, is c->generation just going to be 0, making min_generation also be 0? (meaning we get no possible speed benefit from the commit-graph, since we just don't have that information available)? > + > + if (!(c->object.flags & PARENT1)) { > + c->object.flags |= PARENT1; > + num_to_find++; > + } > + } > + > + for (item = from; item < from_last; item++) { > + struct commit *c = *item; > + if (!(c->object.flags & PARENT2)) { > + c->object.flags |= PARENT2; > + parse_commit(c); > + > + prio_queue_put(, *item); > + } > + } > + > + while (num_to_find && (current = prio_queue_get()) != NULL) { > + struct commit_list *parents; > + > + if (current->object.flags & PARENT1) { > + current->object.flags &= ~PARENT1; > + current->object.flags |= reachable_flag; > + commit_list_insert(current, _commits); > + num_to_find--; > + } > + > + for (parents = current->parents; parents; parents = > parents->next) { > + struct commit *p = parents->item; > + > + parse_commit(p); > + > + if (p->generation < min_generation) > + continue; > + > + if (p->object.flags & PARENT2) > + continue; > + > + p->object.flags |= PARENT2; > + prio_queue_put(, p); > + } > + } > + > + clear_commit_marks_many(nr_to, to, PARENT1); > + clear_commit_marks_many(nr_from, from, PARENT2); > + > + return found_commits; > +} > + > diff --git a/commit-reach.h b/commit-reach.h > index 7d313e297..43bd50a70 100644 > --- a/commit-reach.h > +++ b/commit-reach.h > @@ -74,4 +74,14 @@ int can_all_from_reach_with_flag(struct object_array *from, > int can_all_from_reach(struct commit_list *from, struct commit_list *to, >int commit_date_cutoff); > > + > +/* > + * Return a list of commits containing the commits in the 'to' array > + * that are reachable from at least one commit in the 'from' array. > + * Also add the given 'flag' to each of the commits in the returned list. > + */ > +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, > +struct commit **to, int nr_to, > +int reachable_flag); > + > #endif > -- > gitgitgadget >
Re: [PATCH 1/3] commit-reach: implement get_reachable_subset
"Derrick Stolee via GitGitGadget" writes: > +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, > + struct commit **to, int nr_to, > + int reachable_flag) This is OR'ed into object.flags, and I somoehow expected to see it as 'unsigned int', not a signed one. > +{ > + struct commit **item; > + struct commit *current; > + struct commit_list *found_commits = NULL; > + struct commit **to_last = to + nr_to; > + struct commit **from_last = from + nr_from; > + uint32_t min_generation = GENERATION_NUMBER_INFINITY; > + int num_to_find = 0; > + > + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; > + > + for (item = to; item < to_last; item++) { > + struct commit *c = *item; > + > + parse_commit(c); > + if (c->generation < min_generation) > + min_generation = c->generation; > + > + if (!(c->object.flags & PARENT1)) { > + c->object.flags |= PARENT1; > + num_to_find++; > + } > + } > + > + for (item = from; item < from_last; item++) { > + struct commit *c = *item; > + if (!(c->object.flags & PARENT2)) { > + c->object.flags |= PARENT2; > + parse_commit(c); > + > + prio_queue_put(, *item); > + } > + } OK, we marked "to" with PARENT1 and counted them in num_to_find without dups. We also marked "from" with PARENT2 and threw them in the "queue" without dups. Mental note: the caller must guarantee that everybody reachable from "to" and "from" have PARENT1 and PARENT2 clear. This might deserve to be in the comment before the function. > + while (num_to_find && (current = prio_queue_get()) != NULL) { > + struct commit_list *parents; > + > + if (current->object.flags & PARENT1) { > + current->object.flags &= ~PARENT1; > + current->object.flags |= reachable_flag; > + commit_list_insert(current, _commits); > + num_to_find--; What is in "queue" are all reachable from "from" so we found this object in the "to" set is reachable. One down. > + } > + > + for (parents = current->parents; parents; parents = > parents->next) { > + struct commit *p = parents->item; > + > + parse_commit(p); > + > + if (p->generation < min_generation) > + continue; Any commit reachable from "from" whose generation is too small (i.e. old) can never reach any of "to", because all "to" are at generation "min_generation" or greater. Makes sense. > + if (p->object.flags & PARENT2) > + continue; If the object is painted with PARENT2, we know we have it in queue and traversing to find more of its ancestors. Avoid recursing into it twice. Makes sense. > + p->object.flags |= PARENT2; > + prio_queue_put(, p); Otherwise, explore this parent. > + } > + } > + > + clear_commit_marks_many(nr_to, to, PARENT1); Among "to", the ones that were not found still have PARENT1. Because we do not propagate this flag down the ancestry chain like merge-base discovery, this only has to clear just one level. > + clear_commit_marks_many(nr_from, from, PARENT2); And then we smudged everything that are reachable from "from" with this flag. In the worst case, i.e. when num_to_find is still positive here, we would have painted all commits down to the root, and we need to clear all of them. > + return found_commits; > +} OK, this all makes sense. Unlike merge-base traversals, this does not have to traverse from the "to" side at all, which makes it quite simpler and straight-forward. I do wonder if we can now reimplement in_merge_bases_many() in terms of this helper, and if that gives us a better performance. It asks "is 'commit', i.e. a single 'to', an ancestor of, i.e. reachable from, one of the 'references', i.e. 'from'"? > diff --git a/commit-reach.h b/commit-reach.h > index 7d313e297..43bd50a70 100644 > --- a/commit-reach.h > +++ b/commit-reach.h > @@ -74,4 +74,14 @@ int can_all_from_reach_with_flag(struct object_array *from, > int can_all_from_reach(struct commit_list *from, struct commit_list *to, > int commit_date_cutoff); > > + > +/* > + * Return a list of commits containing the commits in the 'to' array > + * that are reachable from at least one commit in the 'from' array. > + * Also add the given 'flag' to each of the commits in the returned list. > + */ > +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, > + struct commit **to, int nr_to, >
[PATCH 1/3] commit-reach: implement get_reachable_subset
From: Derrick Stolee The existing reachability algorithms in commit-reach.c focus on finding merge-bases or determining if all commits in a set X can reach at least one commit in a set Y. However, for two commits sets X and Y, we may also care about which commits in Y are reachable from at least one commit in X. Implement get_reachable_subset() which answers this question. Given two arrays of commits, 'from' and 'to', return a commit_list with every commit from the 'to' array that is reachable from at least one commit in the 'from' array. The algorithm is a simple walk starting at the 'from' commits, using the PARENT2 flag to indicate "this commit has already been added to the walk queue". By marking the 'to' commits with the PARENT1 flag, we can determine when we see a commit from the 'to' array. We remove the PARENT1 flag as we add that commit to the result list to avoid duplicates. The order of the resulting list is a reverse of the order that the commits are discovered in the walk. There are a couple shortcuts to avoid walking more than we need: 1. We determine the minimum generation number of commits in the 'to' array. We do not walk commits with generation number below this minimum. 2. We count how many distinct commits are in the 'to' array, and decrement this count when we discover a 'to' commit during the walk. If this number reaches zero, then we can terminate the walk. Tests will be added using the 'test-tool reach' helper in a subsequent commit. Signed-off-by: Derrick Stolee --- commit-reach.c | 70 ++ commit-reach.h | 10 2 files changed, 80 insertions(+) diff --git a/commit-reach.c b/commit-reach.c index 9f79ce0a2..a98532ecc 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to, object_array_clear(_objs); return result; } + +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, +struct commit **to, int nr_to, +int reachable_flag) +{ + struct commit **item; + struct commit *current; + struct commit_list *found_commits = NULL; + struct commit **to_last = to + nr_to; + struct commit **from_last = from + nr_from; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; + int num_to_find = 0; + + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; + + for (item = to; item < to_last; item++) { + struct commit *c = *item; + + parse_commit(c); + if (c->generation < min_generation) + min_generation = c->generation; + + if (!(c->object.flags & PARENT1)) { + c->object.flags |= PARENT1; + num_to_find++; + } + } + + for (item = from; item < from_last; item++) { + struct commit *c = *item; + if (!(c->object.flags & PARENT2)) { + c->object.flags |= PARENT2; + parse_commit(c); + + prio_queue_put(, *item); + } + } + + while (num_to_find && (current = prio_queue_get()) != NULL) { + struct commit_list *parents; + + if (current->object.flags & PARENT1) { + current->object.flags &= ~PARENT1; + current->object.flags |= reachable_flag; + commit_list_insert(current, _commits); + num_to_find--; + } + + for (parents = current->parents; parents; parents = parents->next) { + struct commit *p = parents->item; + + parse_commit(p); + + if (p->generation < min_generation) + continue; + + if (p->object.flags & PARENT2) + continue; + + p->object.flags |= PARENT2; + prio_queue_put(, p); + } + } + + clear_commit_marks_many(nr_to, to, PARENT1); + clear_commit_marks_many(nr_from, from, PARENT2); + + return found_commits; +} + diff --git a/commit-reach.h b/commit-reach.h index 7d313e297..43bd50a70 100644 --- a/commit-reach.h +++ b/commit-reach.h @@ -74,4 +74,14 @@ int can_all_from_reach_with_flag(struct object_array *from, int can_all_from_reach(struct commit_list *from, struct commit_list *to, int commit_date_cutoff); + +/* + * Return a list of commits containing the commits in the 'to' array + * that are reachable from at least one commit in the 'from' array. + * Also add the given 'flag' to each of the commits in the returned list. + */ +struct commit_list *get_reachable_subset(struct commit