Re: commit-graph: change in "best" merge-base when ambiguous
On 05/25/2018 12:08 AM, Jakub Narebski wrote: > Derrick Stoleewrites: >> On 5/22/2018 1:39 AM, Michael Haggerty wrote: >>> On 05/21/2018 08:10 PM, Derrick Stolee wrote: [...] >>> This may be beyond the scope of what you are working on, but there are >>> significant advantages to selecting a "best" merge base from among the >>> candidates. Long ago [1] I proposed that the "best" merge base is the >>> merge base candidate that minimizes the number of non-merge commits that >>> are in >>> >>> git rev-list $candidate..$branch >>> >>> that are already in master: >>> >>> git rev-list $master >>> >>> (assuming merging branch into master), which is equivalent to choosing >>> the merge base that minimizes >>> >>> git rev-list --count $candidate..$branch > > Is the above correct... > >>> In fact, this criterion is symmetric if you exchange branch ↔ master, >>> which is a nice property, and indeed generalizes pretty simply to >>> computing the merge base of more than two commits. > > ...as it doesn't seem to have the described symmetry. The first email that I referenced [1] demonstrates this in the section "Symmetry; generalization to more than two branches". The same thing is demonstrated in a simpler way using set notation in a later email in that thread [2]. Michael [1] https://public-inbox.org/git/539a25bf.4060...@alum.mit.edu/ [2] https://public-inbox.org/git/53a06264.9080...@alum.mit.edu/
Re: commit-graph: change in "best" merge-base when ambiguous
Derrick Stoleewrites: > On 5/22/2018 1:39 AM, Michael Haggerty wrote: >> On 05/21/2018 08:10 PM, Derrick Stolee wrote: >>> [...] >>> In the Discussion section of the `git merge-base` docs [1], we have the >>> following: >>> >>> When the history involves criss-cross merges, there can be more than >>> one best common ancestor for two commits. For example, with this topology: >>> >>> ---1---o---A >>> \ / >>> X >>> / \ >>> ---2---o---o---B >>> >>> both 1 and 2 are merge-bases of A and B. Neither one is better than >>> the other (both are best merge bases). When the --all option is not >>> given, it is unspecified which best one is output. >>> >>> This means our official documentation mentions that we do not have a >>> concrete way to differentiate between these choices. This makes me think >>> that this change in behavior is not a bug, but it _is_ a change in >>> behavior. It's worth mentioning, but I don't think there is any value in >>> making sure `git merge-base` returns the same output. >>> >>> Does anyone disagree? Is this something we should solidify so we always >>> have a "definitive" merge-base? >>> [...] >> This may be beyond the scope of what you are working on, but there are >> significant advantages to selecting a "best" merge base from among the >> candidates. Long ago [1] I proposed that the "best" merge base is the >> merge base candidate that minimizes the number of non-merge commits that >> are in >> >> git rev-list $candidate..$branch >> >> that are already in master: >> >> git rev-list $master >> >> (assuming merging branch into master), which is equivalent to choosing >> the merge base that minimizes >> >> git rev-list --count $candidate..$branch Is the above correct... >> In fact, this criterion is symmetric if you exchange branch ↔ master, >> which is a nice property, and indeed generalizes pretty simply to >> computing the merge base of more than two commits. ...as it doesn't seem to have the described symmetry. >> >> In that email I also included some data showing that the "best" merge >> base almost always results in either the same or a shorter diff than the >> more or less arbitrary algorithm that we currently use. Sometimes the >> difference in diff length is dramatic. >> >> To me it feels like the best *deterministic* merge base would be based >> on the above criterion, maybe with first-parent reachability, commit >> times, and SHA-1s used (in that order) to break ties. > > Thanks, everyone, for your perspective on this. I'm walking away with > these conclusions: > > 1. While this is a change in behavior, it is not a regression. We do > not need to act immediately to preserve old behavior in these > ambiguous cases. > > 2. We should (eventually) define tie-breaking conditions. I like > Michael's suggestion above. One thing I'd like to point out is that when searching for some algorithm to speed up merge-base calculation (which is called lowest common ancestor in graph theory, and for which I have currently found only an algorithm with O(|V|*{E|) preparation time, and U(1) query) I have found instead attempts to rigorously define single representative lowest common ancestor. It might be worth a look how it is done. Another possible source to compare against is the algorithm used by Mercurial (which as far as I know doesn't use recursive merge strategy, so it needs to chose one merge base). HTH, -- Jakub Narębski
Re: commit-graph: change in "best" merge-base when ambiguous
On 5/22/2018 1:39 AM, Michael Haggerty wrote: On 05/21/2018 08:10 PM, Derrick Stolee wrote: [...] In the Discussion section of the `git merge-base` docs [1], we have the following: When the history involves criss-cross merges, there can be more than one best common ancestor for two commits. For example, with this topology: ---1---o---A \ / X / \ ---2---o---o---B both 1 and 2 are merge-bases of A and B. Neither one is better than the other (both are best merge bases). When the --all option is not given, it is unspecified which best one is output. This means our official documentation mentions that we do not have a concrete way to differentiate between these choices. This makes me think that this change in behavior is not a bug, but it _is_ a change in behavior. It's worth mentioning, but I don't think there is any value in making sure `git merge-base` returns the same output. Does anyone disagree? Is this something we should solidify so we always have a "definitive" merge-base? [...] This may be beyond the scope of what you are working on, but there are significant advantages to selecting a "best" merge base from among the candidates. Long ago [1] I proposed that the "best" merge base is the merge base candidate that minimizes the number of non-merge commits that are in git rev-list $candidate..$branch that are already in master: git rev-list $master (assuming merging branch into master), which is equivalent to choosing the merge base that minimizes git rev-list --count $candidate..$branch In fact, this criterion is symmetric if you exchange branch ↔ master, which is a nice property, and indeed generalizes pretty simply to computing the merge base of more than two commits. In that email I also included some data showing that the "best" merge base almost always results in either the same or a shorter diff than the more or less arbitrary algorithm that we currently use. Sometimes the difference in diff length is dramatic. To me it feels like the best *deterministic* merge base would be based on the above criterion, maybe with first-parent reachability, commit times, and SHA-1s used (in that order) to break ties. Thanks, everyone, for your perspective on this. I'm walking away with these conclusions: 1. While this is a change in behavior, it is not a regression. We do not need to act immediately to preserve old behavior in these ambiguous cases. 2. We should (eventually) define tie-breaking conditions. I like Michael's suggestion above. Thanks, -Stolee
Re: commit-graph: change in "best" merge-base when ambiguous
On 05/21/2018 08:10 PM, Derrick Stolee wrote: > [...] > In the Discussion section of the `git merge-base` docs [1], we have the > following: > > When the history involves criss-cross merges, there can be more than > one best common ancestor for two commits. For example, with this topology: > > ---1---o---A > \ / > X > / \ > ---2---o---o---B > > both 1 and 2 are merge-bases of A and B. Neither one is better than > the other (both are best merge bases). When the --all option is not > given, it is unspecified which best one is output. > > This means our official documentation mentions that we do not have a > concrete way to differentiate between these choices. This makes me think > that this change in behavior is not a bug, but it _is_ a change in > behavior. It's worth mentioning, but I don't think there is any value in > making sure `git merge-base` returns the same output. > > Does anyone disagree? Is this something we should solidify so we always > have a "definitive" merge-base? > [...] This may be beyond the scope of what you are working on, but there are significant advantages to selecting a "best" merge base from among the candidates. Long ago [1] I proposed that the "best" merge base is the merge base candidate that minimizes the number of non-merge commits that are in git rev-list $candidate..$branch that are already in master: git rev-list $master (assuming merging branch into master), which is equivalent to choosing the merge base that minimizes git rev-list --count $candidate..$branch In fact, this criterion is symmetric if you exchange branch ↔ master, which is a nice property, and indeed generalizes pretty simply to computing the merge base of more than two commits. In that email I also included some data showing that the "best" merge base almost always results in either the same or a shorter diff than the more or less arbitrary algorithm that we currently use. Sometimes the difference in diff length is dramatic. To me it feels like the best *deterministic* merge base would be based on the above criterion, maybe with first-parent reachability, commit times, and SHA-1s used (in that order) to break ties. I don't plan to work on the implementation of this idea myself (though we've long used a script-based implementation of this algorithm internally at GitHub). Michael [1] https://public-inbox.org/git/539a25bf.4060...@alum.mit.edu/ See the rest of the thread for more interesting discussion. [2] https://public-inbox.org/git/8a9b3f20-eed2-c59b-f7ea-3c68b3c30...@alum.mit.edu/ Higher in this thread, Junio proposes a different criterion.
Re: commit-graph: change in "best" merge-base when ambiguous
On Mon, May 21, 2018 at 2:50 PM, Jeff Kingwrote: > On Mon, May 21, 2018 at 11:33:11AM -0700, Elijah Newren wrote: > >> > In t6024-recursive-merge.sh, we have the following commit structure: >> > >> > # 1 - A - D - F >> > # \ X / >> > # B X >> > # X \ >> > # 2 - C - E - G >> > >> > When merging F to G, there are two "best" merge-bases, A and C. With >> > core.commitGraph=false, 'git merge-base F G' returns A, while it returns C >> > when core.commitGraph=true. This is due to the new walk order when using >> > generation numbers, although I have not dug deep into the code to point out >> > exactly where the choice between A and C is made. Likely it's just whatever >> > order they are inserted into a list. >> >> Ooh, interesting. >> >> Just a guess, but could it be related to relative ordering of >> committer timestamps? Ordering of committer timestamps apparently >> affects order of merge-bases returned to merge-recursive, and although >> that shouldn't have mattered, a few bugs meant that it did and the >> order ended up determining what contents a successful merge would >> have. See this recent post: >> >> https://public-inbox.org/git/CABPp-BFc1OLYKzS5rauOehvEugPc0oGMJp-NMEAmVMW7QR=4...@mail.gmail.com/ >> >> The fact that the merge was successful for both orderings of merge >> bases was the real bug, though; it should have detected and reported a >> conflict both ways. > > Traditionally we've inserted commits into the walk queue in commit-date > ordering, but with identical dates it may depend on the order in which > you reach the commits. Many of the tests are particularly bad for > showing this off because they do not use test_tick, and so you end up > with a bunch of commits with identical timestamps. > > If we're just using generation numbers for queue ordering, we're even > more likely to hit these cases, since they're expected to increase along > parallel branches at roughly the same rate. It's probably a good idea to > have some tie-breakers to make things more deterministic (walk order > shouldn't matter, but it can be confusing if we sometimes use one order > and sometimes the other). > > Even ordering by {generation, timestamp} isn't quite enough, since you > could still tie there. Perhaps {generation, timestamp, hash} would be a > sensible ordering? The hash sounds reasonable as the definite tie breaker. git merge-base is documented as "Find as good common ancestors as possible for a merge", so in case we do not require the tie breaking to be cheap, we could go by "smallest diff output" of the two diffs against the potential merge commit. Though I don't think this is really optimal for performance reasons.
Re: commit-graph: change in "best" merge-base when ambiguous
On Mon, May 21, 2018 at 2:54 PM, Jeff Kingwrote: > Yes, I think this is clearly a case where all of the single merge-bases > we could show are equally good. And I don't think we should promise to > show a particular one, but I _do_ think it's friendly for us to have > deterministic tie-breakers (we certainly don't now). > > -Peff Right. I think we should probably have some mechanism that will allow us to always give the same answer, even if it's somewhat arbitrary. Regards, Jake
Re: commit-graph: change in "best" merge-base when ambiguous
On Mon, May 21, 2018 at 02:10:54PM -0400, Derrick Stolee wrote: > In the Discussion section of the `git merge-base` docs [1], we have the > following: > > When the history involves criss-cross merges, there can be more than one > best common ancestor for two commits. For example, with this topology: > > ---1---o---A > \ / > X > / \ > ---2---o---o---B > > both 1 and 2 are merge-bases of A and B. Neither one is better than the > other (both are best merge bases). When the --all option is not given, > it is unspecified which best one is output. > > This means our official documentation mentions that we do not have a > concrete way to differentiate between these choices. This makes me think > that this change in behavior is not a bug, but it _is_ a change in behavior. > It's worth mentioning, but I don't think there is any value in making sure > `git merge-base` returns the same output. > > Does anyone disagree? Is this something we should solidify so we always have > a "definitive" merge-base? Heh, I should have read your whole original message before responding, not just the part that Elijah quoted. Yes, I think this is clearly a case where all of the single merge-bases we could show are equally good. And I don't think we should promise to show a particular one, but I _do_ think it's friendly for us to have deterministic tie-breakers (we certainly don't now). -Peff
Re: commit-graph: change in "best" merge-base when ambiguous
On Mon, May 21, 2018 at 11:33:11AM -0700, Elijah Newren wrote: > > In t6024-recursive-merge.sh, we have the following commit structure: > > > > # 1 - A - D - F > > # \ X / > > # B X > > # X \ > > # 2 - C - E - G > > > > When merging F to G, there are two "best" merge-bases, A and C. With > > core.commitGraph=false, 'git merge-base F G' returns A, while it returns C > > when core.commitGraph=true. This is due to the new walk order when using > > generation numbers, although I have not dug deep into the code to point out > > exactly where the choice between A and C is made. Likely it's just whatever > > order they are inserted into a list. > > Ooh, interesting. > > Just a guess, but could it be related to relative ordering of > committer timestamps? Ordering of committer timestamps apparently > affects order of merge-bases returned to merge-recursive, and although > that shouldn't have mattered, a few bugs meant that it did and the > order ended up determining what contents a successful merge would > have. See this recent post: > > https://public-inbox.org/git/CABPp-BFc1OLYKzS5rauOehvEugPc0oGMJp-NMEAmVMW7QR=4...@mail.gmail.com/ > > The fact that the merge was successful for both orderings of merge > bases was the real bug, though; it should have detected and reported a > conflict both ways. Traditionally we've inserted commits into the walk queue in commit-date ordering, but with identical dates it may depend on the order in which you reach the commits. Many of the tests are particularly bad for showing this off because they do not use test_tick, and so you end up with a bunch of commits with identical timestamps. If we're just using generation numbers for queue ordering, we're even more likely to hit these cases, since they're expected to increase along parallel branches at roughly the same rate. It's probably a good idea to have some tie-breakers to make things more deterministic (walk order shouldn't matter, but it can be confusing if we sometimes use one order and sometimes the other). Even ordering by {generation, timestamp} isn't quite enough, since you could still tie there. Perhaps {generation, timestamp, hash} would be a sensible ordering? As for this specific case, even with the current code asking for `git merge-base G F` will return the other answer. This is clearly a case with multiple merge bases, and I'd expect "merge-base --all" to return both (and actually it shows "B" as well, which makes sense). In the non-all case, there is no "best", so we're free to show any. -Peff
Re: commit-graph: change in "best" merge-base when ambiguous
Hi, On Mon, May 21, 2018 at 11:10 AM, Derrick Stoleewrote: > Hello all, > > While working on the commit-graph feature, I made a test commit that sets > core.commitGraph and gc.commitGraph to true by default AND runs 'git > commit-graph write --reachable' after each 'git commit' command. This helped > me find instances in the test suite where the commit-graph feature changes > existing functionality. Most of these were in regards to grafts, > replace-objects, and shallow-clones (as expected) or when trying to find a > corrupt or hidden commit (the commit-graph hides this corrupt/missing data). > However, there was one interesting case that I'd like to mention on-list. > > In t6024-recursive-merge.sh, we have the following commit structure: > > # 1 - A - D - F > # \ X / > # B X > # X \ > # 2 - C - E - G > > When merging F to G, there are two "best" merge-bases, A and C. With > core.commitGraph=false, 'git merge-base F G' returns A, while it returns C > when core.commitGraph=true. This is due to the new walk order when using > generation numbers, although I have not dug deep into the code to point out > exactly where the choice between A and C is made. Likely it's just whatever > order they are inserted into a list. Ooh, interesting. Just a guess, but could it be related to relative ordering of committer timestamps? Ordering of committer timestamps apparently affects order of merge-bases returned to merge-recursive, and although that shouldn't have mattered, a few bugs meant that it did and the order ended up determining what contents a successful merge would have. See this recent post: https://public-inbox.org/git/CABPp-BFc1OLYKzS5rauOehvEugPc0oGMJp-NMEAmVMW7QR=4...@mail.gmail.com/ The fact that the merge was successful for both orderings of merge bases was the real bug, though; it should have detected and reported a conflict both ways. I'm not sure where else we have an accidental and incorrect dependence on merge-base tie-breaker or ordering logic, but if it's like this one, changing the tie-breaker should be okay.
commit-graph: change in "best" merge-base when ambiguous
Hello all, While working on the commit-graph feature, I made a test commit that sets core.commitGraph and gc.commitGraph to true by default AND runs 'git commit-graph write --reachable' after each 'git commit' command. This helped me find instances in the test suite where the commit-graph feature changes existing functionality. Most of these were in regards to grafts, replace-objects, and shallow-clones (as expected) or when trying to find a corrupt or hidden commit (the commit-graph hides this corrupt/missing data). However, there was one interesting case that I'd like to mention on-list. In t6024-recursive-merge.sh, we have the following commit structure: # 1 - A - D - F # \ X / # B X # X \ # 2 - C - E - G When merging F to G, there are two "best" merge-bases, A and C. With core.commitGraph=false, 'git merge-base F G' returns A, while it returns C when core.commitGraph=true. This is due to the new walk order when using generation numbers, although I have not dug deep into the code to point out exactly where the choice between A and C is made. Likely it's just whatever order they are inserted into a list. In the Discussion section of the `git merge-base` docs [1], we have the following: When the history involves criss-cross merges, there can be more than one best common ancestor for two commits. For example, with this topology: ---1---o---A \ / X / \ ---2---o---o---B both 1 and 2 are merge-bases of A and B. Neither one is better than the other (both are best merge bases). When the --all option is not given, it is unspecified which best one is output. This means our official documentation mentions that we do not have a concrete way to differentiate between these choices. This makes me think that this change in behavior is not a bug, but it _is_ a change in behavior. It's worth mentioning, but I don't think there is any value in making sure `git merge-base` returns the same output. Does anyone disagree? Is this something we should solidify so we always have a "definitive" merge-base? The biggest reason I think we should avoid sticking to the existing behavior is that the current behavior depends on the walk order. That means we would not be able to concretely define a tie-breaker without changing the existing behavior anyway. Thanks, -Stolee [1] https://git-scm.com/docs/git-merge-base#_discussion