Re: [PATCH v4 04/10] commit: use generations in paint_down_to_common()
Derrick Stoleewrites: > Define compare_commits_by_gen_then_commit_date(), which uses generation > numbers as a primary comparison and commit date to break ties (or as a > comparison when both commits do not have computed generation numbers). All right, this looks reasonable thing to do when we have access to commit generation numbers.. > Since the commit-graph file is closed under reachability, we know that > all commits in the file have generation at most GENERATION_NUMBER_MAX > which is less than GENERATION_NUMBER_INFINITY. Thus the condition that if B is reachable from A, then gen(A) >= gen(B), even if they have generation numbers _INFINITY, _MAX or _ZERO. We use generation numbers, if possible, to choose closest commit; if not, we use dates. > > This change does not affect the number of commits that are walked during > the execution of paint_down_to_common(), only the order that those > commits are inspected. In the case that commit dates violate topological > order (i.e. a parent is "newer" than a child), the previous code could > walk a commit twice: if a commit is reached with the PARENT1 bit, but > later is re-visited with the PARENT2 bit, then that PARENT2 bit must be > propagated to its parents. Using generation numbers avoids this extra > effort, even if it is somewhat rare. Actually the ordering of commits walked does not affect the correctness of the result. Better ordering means that commits do not need to be walked twice; I think it would be possible to craft repository in which unlucky clock skew would lead to depth-first walk of commits later part of walk would mark STALE. Pedantry aside, I think it is a good description of analysis of change results. > > Signed-off-by: Derrick Stolee > --- > commit.c | 20 +++- > commit.h | 1 + > 2 files changed, 20 insertions(+), 1 deletion(-) > > diff --git a/commit.c b/commit.c > index 711f674c18..4d00b0a1d6 100644 > --- a/commit.c > +++ b/commit.c > @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void > *a_, const void *b_, > return 0; > } > > +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, > void *unused) > +{ > + const struct commit *a = a_, *b = b_; > + > + /* newer commits first */ To be pedantic, larger generation number does not necessary mean that commit was created later (is newer), only that it is on longer chain since common ancestor or root commit. > + if (a->generation < b->generation) > + return 1; > + else if (a->generation > b->generation) > + return -1; If the commit-graph feature is not available, or is disabled, all commits would have the same generation number (_INFINITY), then this block would be always practically no-op. This is not very costly: 2 access to data which should be in cache, and 1 to 2 comparison operations. But I wonder if we wouldn't want to avoid this nano-cost if possible... > + > + /* use date as a heuristic when generations are equal */ > + if (a->date < b->date) > + return 1; > + else if (a->date > b->date) > + return -1; > + return 0; The above is the same code as in compare_commits_by_commit_date(), but there it is with "newer commits with larger date first" as comment instead. All right: we need inlining for speed. > +} > + > int compare_commits_by_commit_date(const void *a_, const void *b_, void > *unused) > { > const struct commit *a = a_, *b = b_; > @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue) > /* all input commits in one and twos[] must have been parsed! */ > static struct commit_list *paint_down_to_common(struct commit *one, int n, > struct commit **twos) > { > - struct prio_queue queue = { compare_commits_by_commit_date }; > + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; I wonder if it would be worth it to avoid comparing by generation numbers without commit graph data: + struct prio_queue queue; [...] + if (commit_graph) + queue.compare = compare_commits_by_gen_then_commit_date; + else + queue.compare = compare_commits_by_commit_date; Or something like that. But perhaps this nano-optimization is not worth it (it is not that complicated, though). Sidenote: when I searched for compare_commits_by_commit_date use, I have noticed that it is used, I think as heuristics, for packfile creation in upload-pack.c and fetch-pack.c. Would they similarly improve with compare_commits_by_gen_then_commit_date? This is of course not something that this commit, or this patch series, needs to address now. > struct commit_list *result = NULL; > int i; > > diff --git a/commit.h b/commit.h > index aac3b8c56f..64436ff44e 100644 > --- a/commit.h > +++ b/commit.h > @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf); > extern int
Re: [PATCH v4 04/10] commit: use generations in paint_down_to_common()
Jakub Narebskiwrites: > Junio C Hamano writes: >> Derrick Stolee writes: [...] >>> +int compare_commits_by_gen_then_commit_date(const void *a_, const void >>> *b_, void *unused) >>> +{ >>> + const struct commit *a = a_, *b = b_; >>> + >>> + /* newer commits first */ >>> + if (a->generation < b->generation) >>> + return 1; >>> + else if (a->generation > b->generation) >>> + return -1; >> >> ... this does not check if a->generation is _ZERO or _INF. >> >> Both being _MAX is OK (the control will fall through and use the >> dates below). One being _MAX and the other being a normal value is >> also OK (the above comparisons will declare the commit with _MAX is >> farther than less-than-max one from a root). >> >> Or is the assumption that if one has _ZERO, that must have come from >> an ancient commit-graph file and none of the commits have anything >> but _ZERO? > > There is stronger and weaker version of the negative-cut criteria based > on generation numbers. > > The strong criteria: > > if A != B and gen(A) <= gen(B), then A cannot reach B > > The weaker criteria: > > if gen(A) < gen(B), then A cannot reach B > > > Because commit-graph is closed under reachability, this means that > > if A is in commit graph, and B is outside of it, then A cannot reach B > > If A is in commit graph, then either _MAX >= gen(A) >= 1, > or gen(A) == _ZERO. Because _INFINITY > _MAX > _ZERO, then we have > > if _MAX >= gen(A) >= 1 || gen(A) == 0, and gen(B) == _INFINITY > then A cannot reach B > > which also fullfils the weaker criteria > > if gen(A) < gen(B), then A cannot reach B > > > If both A and B are outside commit-graph, i.e. gen(A) = gen(B) = _INFINITY, > or if both A and B have gen(A) = gen(B) = _MAX, > or if both A and B come from old commit graph with gen(A) = gen(B) =_ZERO, > then we cannot say anything about reachability... and weak criteria > also does not say anything about reachability. > > > Maybe the following ASCII table would make it clear. > > | gen(B) > | ::: > gen(A) | _INFINITY | _MAX | larger | smaller | _ZERO > -+---+--+--+--+ > _INFINITY| = | >| >| >| > > _MAX | < Nn | =| >| >| > > larger | < Nn | < Nn | = n | >| > > smaller | < Nn | < Nn | < Nn | = n | > > _ZERO| < Nn | < Nn | < Nn | < Nn | = > > Here "n" denotes stronger condition, and "N" denotes weaker condition. > We have _INFINITY > _MAX > larger > smaller > _ZERO. > > > NOTE however that it is a *tradeoff*. Using weaker criteria, with > strict inequality, means that we don't need to handle _INFINITY, _MAX > and _ZERO corner-cases in a special way; but it also means that we would > walk slightly more commits than if we used stronger criteria, with less > or equals. Actually, if we look at the table above, it turns out that we can use the stronger version of negative-cut criteria without special-casing all the possible combinations. Just use stronger criteria on normal range, weaker criteria if any of generation numbers is special generation number. if _MAX > gen(A) > _ZERO and _MAX > gen(B) > _ZERO then if A != B and gen(A) <= gen(B) then A cannot reach B else A can reach B else /* at least one special case */ if gen(A) < gen(B) then A cannot reach B else A can reach B NOTE that it specifically does not matter for created here compare_commits_by_gen_then_commit_date(), as it requires strict inequality for sorting - which using weak criteria explains why we don't need any special cases in the code here. Best, -- Jakub Narębski
Re: [PATCH v4 04/10] commit: use generations in paint_down_to_common()
Junio C Hamanowrites: > Derrick Stolee writes: > >> Define compare_commits_by_gen_then_commit_date(), which uses generation >> numbers as a primary comparison and commit date to break ties (or as a >> comparison when both commits do not have computed generation numbers). >> >> Since the commit-graph file is closed under reachability, we know that >> all commits in the file have generation at most GENERATION_NUMBER_MAX >> which is less than GENERATION_NUMBER_INFINITY. > > I suspect that my puzzlement may be coming from my not "getting" > what you meant by "closed under reachability", It means that if commit A is in the commit graph, then all of its ancestors (all commits reachable from A) are also in the commit graph. >but could you also > explain how _INF and _ZERO interact with commits with normal > generation numbers? I've always assumed that genno will be used > only when comparing two commits with valid genno and otherwise we'd > fall back to the traditional date based one, but... > >> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, >> void *unused) >> +{ >> +const struct commit *a = a_, *b = b_; >> + >> +/* newer commits first */ >> +if (a->generation < b->generation) >> +return 1; >> +else if (a->generation > b->generation) >> +return -1; > > ... this does not check if a->generation is _ZERO or _INF. > > Both being _MAX is OK (the control will fall through and use the > dates below). One being _MAX and the other being a normal value is > also OK (the above comparisons will declare the commit with _MAX is > farther than less-than-max one from a root). > > Or is the assumption that if one has _ZERO, that must have come from > an ancient commit-graph file and none of the commits have anything > but _ZERO? There is stronger and weaker version of the negative-cut criteria based on generation numbers. The strong criteria: if A != B and gen(A) <= gen(B), then A cannot reach B The weaker criteria: if gen(A) < gen(B), then A cannot reach B Because commit-graph is closed under reachability, this means that if A is in commit graph, and B is outside of it, then A cannot reach B If A is in commit graph, then either _MAX >= gen(A) >= 1, or gen(A) == _ZERO. Because _INFINITY > _MAX > _ZERO, then we have if _MAX >= gen(A) >= 1 || gen(A) == 0, and gen(B) == _INFINITY then A cannot reach B which also fullfils the weaker criteria if gen(A) < gen(B), then A cannot reach B If both A and B are outside commit-graph, i.e. gen(A) = gen(B) = _INFINITY, or if both A and B have gen(A) = gen(B) = _MAX, or if both A and B come from old commit graph with gen(A) = gen(B) =_ZERO, then we cannot say anything about reachability... and weak criteria also does not say anything about reachability. Maybe the following ASCII table would make it clear. | gen(B) | ::: gen(A) | _INFINITY | _MAX | larger | smaller | _ZERO -+---+--+--+--+ _INFINITY| = | >| >| >| > _MAX | < Nn | =| >| >| > larger | < Nn | < Nn | = n | >| > smaller | < Nn | < Nn | < Nn | = n | > _ZERO| < Nn | < Nn | < Nn | < Nn | = Here "n" denotes stronger condition, and "N" denotes weaker condition. We have _INFINITY > _MAX > larger > smaller > _ZERO. NOTE however that it is a *tradeoff*. Using weaker criteria, with strict inequality, means that we don't need to handle _INFINITY, _MAX and _ZERO corner-cases in a special way; but it also means that we would walk slightly more commits than if we used stronger criteria, with less or equals. For Linux kernel public repository commit graph[1] we have maximum of 512 commits sharing the same level, 5.43 sharing the same commit on average, and 50% of time only 2 commits sharing the same level (median, or 2nd quartile, or 50% percentile). This is roughly the amount of commits we walk more with weaker cut-off condition. [1]: with 750k commits, but which is not largest commit graph any more :-0 >> +/* use date as a heuristic when generations are equal */ >> +if (a->date < b->date) >> +return 1; >> +else if (a->date > b->date) >> +return -1; >> +return 0; >> +} HTH -- Jakub Narębski
Re: [PATCH v4 04/10] commit: use generations in paint_down_to_common()
Derrick Stoleewrites: > Define compare_commits_by_gen_then_commit_date(), which uses generation > numbers as a primary comparison and commit date to break ties (or as a > comparison when both commits do not have computed generation numbers). > > Since the commit-graph file is closed under reachability, we know that > all commits in the file have generation at most GENERATION_NUMBER_MAX > which is less than GENERATION_NUMBER_INFINITY. I suspect that my puzzlement may be coming from my not "getting" what you meant by "closed under reachability", but could you also explain how _INF and _ZERO interact with commits with normal generation numbers? I've always assumed that genno will be used only when comparing two commits with valid genno and otherwise we'd fall back to the traditional date based one, but... > +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, > void *unused) > +{ > + const struct commit *a = a_, *b = b_; > + > + /* newer commits first */ > + if (a->generation < b->generation) > + return 1; > + else if (a->generation > b->generation) > + return -1; ... this does not check if a->generation is _ZERO or _INF. Both being _MAX is OK (the control will fall through and use the dates below). One being _MAX and the other being a normal value is also OK (the above comparisons will declare the commit with _MAX is farther than less-than-max one from a root). Or is the assumption that if one has _ZERO, that must have come from an ancient commit-graph file and none of the commits have anything but _ZERO? > + /* use date as a heuristic when generations are equal */ > + if (a->date < b->date) > + return 1; > + else if (a->date > b->date) > + return -1; > + return 0; > +} > + > int compare_commits_by_commit_date(const void *a_, const void *b_, void > *unused) > { > const struct commit *a = a_, *b = b_; > @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue) > /* all input commits in one and twos[] must have been parsed! */ > static struct commit_list *paint_down_to_common(struct commit *one, int n, > struct commit **twos) > { > - struct prio_queue queue = { compare_commits_by_commit_date }; > + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; > struct commit_list *result = NULL; > int i; > > diff --git a/commit.h b/commit.h > index aac3b8c56f..64436ff44e 100644 > --- a/commit.h > +++ b/commit.h > @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf); > extern int check_commit_signature(const struct commit *commit, struct > signature_check *sigc); > > int compare_commits_by_commit_date(const void *a_, const void *b_, void > *unused); > +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, > void *unused); > > LAST_ARG_MUST_BE_NULL > extern int run_commit_hook(int editor_is_used, const char *index_file, const > char *name, ...);
[PATCH v4 04/10] commit: use generations in paint_down_to_common()
Define compare_commits_by_gen_then_commit_date(), which uses generation numbers as a primary comparison and commit date to break ties (or as a comparison when both commits do not have computed generation numbers). Since the commit-graph file is closed under reachability, we know that all commits in the file have generation at most GENERATION_NUMBER_MAX which is less than GENERATION_NUMBER_INFINITY. This change does not affect the number of commits that are walked during the execution of paint_down_to_common(), only the order that those commits are inspected. In the case that commit dates violate topological order (i.e. a parent is "newer" than a child), the previous code could walk a commit twice: if a commit is reached with the PARENT1 bit, but later is re-visited with the PARENT2 bit, then that PARENT2 bit must be propagated to its parents. Using generation numbers avoids this extra effort, even if it is somewhat rare. Signed-off-by: Derrick Stolee--- commit.c | 20 +++- commit.h | 1 + 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 711f674c18..4d00b0a1d6 100644 --- a/commit.c +++ b/commit.c @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, return 0; } +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) +{ + const struct commit *a = a_, *b = b_; + + /* newer commits first */ + if (a->generation < b->generation) + return 1; + else if (a->generation > b->generation) + return -1; + + /* use date as a heuristic when generations are equal */ + if (a->date < b->date) + return 1; + else if (a->date > b->date) + return -1; + return 0; +} + int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) { const struct commit *a = a_, *b = b_; @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue) /* all input commits in one and twos[] must have been parsed! */ static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) { - struct prio_queue queue = { compare_commits_by_commit_date }; + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; diff --git a/commit.h b/commit.h index aac3b8c56f..64436ff44e 100644 --- a/commit.h +++ b/commit.h @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf); extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); LAST_ARG_MUST_BE_NULL extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...); -- 2.17.0.39.g685157f7fb