Re: [RFC v2] blame: new option --prefer-first to better handle merged cherry-picks
Junio C Hamano gits...@pobox.com writes: Junio C Hamano gits...@pobox.com writes: While the result is more consistent and more predictable in the case of merged cherry picks, it is also slower in every case. Consistent and predictable, perhaps, but I am not sure exact would be a good word. Another thing I am not enthusiasitc about this change is that I am afraid that this may make git blame -- path and git log -- path work inconsistenly. The both cull side branches whenever one of the parents gave the resulting blob, even that parent is not the first one. But git blame --prefer-first -- path, afaict, behaves quite differently from git log --first-parent -- path, even though they share similar option names, adding more to the confusion. I think I am starting to understand why this patch felt wrong to me. This wasn't about --first-parent at all, and you are correct that you didn't call the option as such), but I somehow thought that they were related; perhaps the fact that both disable the if the result exactly matches one parent, all the other parents can be culled to simplify the history logic blinded me. In reality, the new flag is a lot closer in spirit to the total opposite of --first-parent, i.e. --full-history. That option also disables that if same to one parent, other parents do not matter logic, but its effect is quite different. It makes the other histories that did not have to have contributed the end result shown in the output. Now, when we step back and think about how the normal git blame logic apportions the blame to multiple parents when there is no exact match, it does so in a pretty arbitrary way. It lets earlier parents to claim the responsibility and later parents only get leftover contents that weren't claimed by the earlier ones. We can call that favouring earler ones, i.e. --prefer-first. It was implemented this way, not because this order makes any sense, but primarily because no order is particularly better than any other, and the designer (me) happened to have picked the easiest one at random. The pick the one that exactly matches if exists can be thought of an easy hack to hide the problems that come from this arbitrary choice. Without it, if the result matches the second parent (i.e. a typical merge of a work done on the topic branch while the mainline has been quiescent in the same area), the give earlier parents a chance to claim responsiblity before later ones rule would have split the blame for parts that weren't changed in the side branch topic to the mainline and blame would have been passed to the side branch only for the portion that were changed by the side branch. Instead, pass the whole blame to the one that exactly matches hack keeps larger blocks of text unsplit, clumping related contents together as long as possible while we traverse the history. It is an easy hack, because we only need to compare the object name, but a logical extension to it would have been to compute the similarity scores between the result and each of the parents, sort the parents by that similarity score order, and give more similar ones a chance to claim responsibility before less similar ones. We could call it favouring similar ones, i.e. --prefer-similar or something. That would have made the result more stable. Imagine that in one history, a merge's result matchs exactly the second parent, and in another history, a merge's result almost matches exactly the second parent but the difference is the result adds one blank line at the end of the file relative to what the second parent has. With the current code, blaming the file will get quite a different result. In the former history, the sub-history leading to the second parent of the merge will get all the blame, but in the latter history, the sub-history leading to the first parent of the merge will have a chance to claim the responsibility for the shared part before the second parent has a say in the output. If we sorted the parents in the similarity order and gave the first refusal right to more similar parents before less similar ones, then the resulting output from git blame would be very similar in these two histories, which would be a very desirable property. If the only difference between the results of the merge in the former and the latter histories is one blank line at the end of the file in question, blames for the remaining part of the file should be assigned the same between the two histories, but the pass the entire blame to the second parent only when the second parent exactly matches hack gets in the way for that ideal, and sort the parents in similarity order will fix that. Of course, it would make the computation a lot more costly, but it would make the behaviour more predictable and understandable. But that is a different tangent. I think the new feature introduced by your change can be explained as 'git blame' uses the same history simplification as the commands in the 'git log'
Re: [RFC v2] blame: new option --prefer-first to better handle merged cherry-picks
Junio C Hamano gits...@pobox.com writes: The pick the one that exactly matches if exists can be thought of an easy hack to hide the problems that come from this arbitrary choice. ... Instead, pass the whole blame to the one that exactly matches hack keeps larger blocks of text unsplit, clumping related contents together as long as possible while we traverse the history. It is an easy hack, because we only need to compare the object name, but a logical extension to it would have been to compute the similarity scores between the result and each of the parents, sort the parents by that similarity score order, and give more similar ones a chance to claim responsibility before less similar ones. We could call it favouring similar ones, i.e. --prefer-similar or something. Extending along the tangent further. Another thing that I found the argument in the proposed log message of the patch weak was that the claim that changed code will assign the blame to the same commit for both path b and c. There are two reasons why. One is that we do not look at b while chasing the ancestry of c, so if a different traverse order assigns the blame to the same commit for them, it is a mere happenstance. But a more important reason is that the changed code will still assign the blame for different commits if the final merge were made in the opposite direction. In your original topology, we skip over the first parent and give the whole blame to the second parent without the change, and with the change, we stop doing so and instead give some blame to the first parent and then allow the second parent a chance to claim the blame for the remainder. But in a history where the final merge went in the opposite direction, even with the change, we compare with the first parent (which was the second one in your original topology) with the result, find out that the contents exactly match, and that parent grabs the whole blame. So in that sense, the updated code that consistently gives earlier parents chance to claim the blame before later ones does not behave consitently on the same history with different merge parent order. That makes me think that the reason why the result you got with the change is better (assuming it is better) is _not_ because the updated code lets earlier parents give chance to claim the blame; it could be an indication that the keep larger blocks of text unsplit, clumping related contents together as long as possible heuristics is what prevents us from having a better result. If that is really the case, that would mean that letting the blame split early would give us a better result. I alluded to give more similar parents first chance to claim responsibility before less similar ones in the previous message, but perhaps this is indicating that we might get a better result if we did the opposite---instead of assigning blames to earlier parents and then to later ones, compare the result with each parent, order the parents by how few lines of blame they could claim if each of them were allowed to go first, and then actually compute and assign the blame in that order, favouring dissimilar ones. That may produce the result you are after in a more consistent way, regardless of the merge order. I think I've done thinking about this issue, at least for now. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2] blame: new option --prefer-first to better handle merged cherry-picks
Bernhard R. Link brl...@debian.org writes: Allows to disable the git blame optimization of assuming that if there is a parent of a merge commit that has the exactly same file content, then only this parent is to be looked at. This optimization, while being faster in the usual case, means that in the case of cherry-picks the blamed commit depends on which other commits touched a file. If for example one commit A modified both files b and c. And there are commits B and C, B only modifies file b and C only modifies file c (so that no conflicts happen), and assume A is cherry-picked as A' and the two branches then merged: --o-B---A \ \ ---C---A'--M--- Then without this new option git blame blames the A|A' changes of file b to A while blaming the changes of c to A'. With the new option --prefer-first it blames both changes to the same commit and to the one more on the left side of the graph. Signed-off-by: Bernhard R. Link brl...@debian.org --- Documentation/blame-options.txt | 6 ++ builtin/blame.c | 7 +-- 2 files changed, 11 insertions(+), 2 deletions(-) Differences to first round: rename option and describe the effect instead of the implementation in documentation. I read the updated documentation three times but it still does not answer any of my questions I had in $gmane/239888, the most important part of which was: Yeah, the cherry-picked one will introduce the same change as the one that was cherry-picked, so if you look at the end result and ask where did _this_ line come from?, there are two equally plausible candidates, as blame output can give only one answer to each line. I still do not see why the one that is picked with the new option is better. At best, it looks to me that it is saying running with this option may (or may not) give a different answer, so run the command with and without it and see which one you like, which does not sound too useful to the end users. To put it another way, why/when would an end user choose to use this option? If the result of using this option is always better than without, why/when would an end user choose not to use this option? diff --git a/Documentation/blame-options.txt b/Documentation/blame-options.txt index 0cebc4f..b2e7fb8 100644 --- a/Documentation/blame-options.txt +++ b/Documentation/blame-options.txt @@ -48,6 +48,12 @@ include::line-range-format.txt[] Show the result incrementally in a format designed for machine consumption. +--prefer-first:: + If a line was introduced by two commits (for example via + a merged cherry-pick), prefer the commit that was + first merged in the history of always following the + first parent. + --encoding=encoding:: Specifies the encoding used to output author names and commit summaries. Setting it to `none` makes blame -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2] blame: new option --prefer-first to better handle merged cherry-picks
* Junio C Hamano gits...@pobox.com [140113 23:31]: I read the updated documentation three times but it still does not answer any of my questions I had in $gmane/239888, the most important part of which was: Yeah, the cherry-picked one will introduce the same change as the one that was cherry-picked, so if you look at the end result and ask where did _this_ line come from?, there are two equally plausible candidates, as blame output can give only one answer to each line. I still do not see why the one that is picked with the new option is better. Because: - it will blame the modifications of merged cherry-picked commit to only one commit. Without the option parts of the modification will be reported as coming from the one, parts will be reported to be from the other. With the option only one of those two commits is reported as the origin at the same time and not both. - it is more predictable which commit is blamed, so if one is interested in where some commit was introduced first into a mainline, one gets this information, and not somtimes a different one due to unrelated reasons. To put it another way, why/when would an end user choose to use this option? If the result of using this option is always better than without, why/when would an end user choose not to use this option? While the result is more consistent and more predictable in the case of merged cherry picks, it is also slower in every case. Usually speed will be more important than this exactness, especially as the result will not differ for the common case (if there are no cherry-picked commits merged or when those commits do not touch any files that are otherwise only modified in the merged branch). Bernhard R. Link -- F8AC 04D5 0B9B 064B 3383 C3DA AFFC 96D1 151D FFDC -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2] blame: new option --prefer-first to better handle merged cherry-picks
Bernhard R. Link brl+...@mail.brlink.eu writes: * Junio C Hamano gits...@pobox.com [140113 23:31]: I read the updated documentation three times but it still does not answer any of my questions I had in $gmane/239888, the most important part of which was: Yeah, the cherry-picked one will introduce the same change as the one that was cherry-picked, so if you look at the end result and ask where did _this_ line come from?, there are two equally plausible candidates, as blame output can give only one answer to each line. I still do not see why the one that is picked with the new option is better. Because: - it will blame the modifications of merged cherry-picked commit to only one commit. Without the option parts of the modification will be reported as coming from the one, parts will be reported to be from the other. With the option only one of those two commits is reported as the origin at the same time and not both. - it is more predictable which commit is blamed, so if one is interested in where some commit was introduced first into a mainline, one gets this information, and not somtimes a different one due to unrelated reasons. To put it another way, why/when would an end user choose to use this option? If the result of using this option is always better than without, why/when would an end user choose not to use this option? While the result is more consistent and more predictable in the case of merged cherry picks, it is also slower in every case. Consistent and predictable, perhaps, but I am not sure exact would be a good word. Wouldn't the result depend on which way the cherry pick went, and then the later merge went? In the particular topology you depicted in the log message, the end result may happen to point at the same commit for these two paths, but I am not sure how the change guarantees that we always point at the same original commit not the cherry-picked one, which was implied by the log message, if your cherry-pick and merge went in different direction in similar topologies. And that is why I said: At best, it looks to me that it is saying running with this option may (or may not) give a different answer, so run the command with and without it and see which one you like With the stress on different answer; it the change were with the option the result is always better, albeit you will have to wait longer, I would not have this much trouble accepting the change, though. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC v2] blame: new option --prefer-first to better handle merged cherry-picks
Allows to disable the git blame optimization of assuming that if there is a parent of a merge commit that has the exactly same file content, then only this parent is to be looked at. This optimization, while being faster in the usual case, means that in the case of cherry-picks the blamed commit depends on which other commits touched a file. If for example one commit A modified both files b and c. And there are commits B and C, B only modifies file b and C only modifies file c (so that no conflicts happen), and assume A is cherry-picked as A' and the two branches then merged: --o-B---A \ \ ---C---A'--M--- Then without this new option git blame blames the A|A' changes of file b to A while blaming the changes of c to A'. With the new option --prefer-first it blames both changes to the same commit and to the one more on the left side of the graph. Signed-off-by: Bernhard R. Link brl...@debian.org --- Documentation/blame-options.txt | 6 ++ builtin/blame.c | 7 +-- 2 files changed, 11 insertions(+), 2 deletions(-) Differences to first round: rename option and describe the effect instead of the implementation in documentation. diff --git a/Documentation/blame-options.txt b/Documentation/blame-options.txt index 0cebc4f..b2e7fb8 100644 --- a/Documentation/blame-options.txt +++ b/Documentation/blame-options.txt @@ -48,6 +48,12 @@ include::line-range-format.txt[] Show the result incrementally in a format designed for machine consumption. +--prefer-first:: + If a line was introduced by two commits (for example via + a merged cherry-pick), prefer the commit that was + first merged in the history of always following the + first parent. + --encoding=encoding:: Specifies the encoding used to output author names and commit summaries. Setting it to `none` makes blame diff --git a/builtin/blame.c b/builtin/blame.c index 4916eb2..8ea34cf 100644 --- a/builtin/blame.c +++ b/builtin/blame.c @@ -45,6 +45,7 @@ static int incremental; static int xdl_opts; static int abbrev = -1; static int no_whole_file_rename; +static int prefer_first; static enum date_mode blame_date_mode = DATE_ISO8601; static size_t blame_date_width; @@ -1248,7 +1249,8 @@ static void pass_blame(struct scoreboard *sb, struct origin *origin, int opt) porigin = find(sb, p, origin); if (!porigin) continue; - if (!hashcmp(porigin-blob_sha1, origin-blob_sha1)) { + if (!prefer_first + !hashcmp(porigin-blob_sha1, origin-blob_sha1)) { pass_whole_blame(sb, origin, porigin); origin_decref(porigin); goto finish; @@ -2247,7 +2249,8 @@ int cmd_blame(int argc, const char **argv, const char *prefix) static const char *contents_from = NULL; static const struct option options[] = { OPT_BOOL(0, incremental, incremental, N_(Show blame entries as we find them, incrementally)), - OPT_BOOL('b', NULL, blank_boundary, N_(Show blank SHA-1 for boundary commits (Default: off))), + OPT_BOOL(0, prefer-first, prefer_first, N_(Prefer blaming commits merged earlier)), + OPT_BOOL('b', NULL, blank_boundary, N_(Show blank SHA-1 for boundary commits (Default: ff))), OPT_BOOL(0, root, show_root, N_(Do not treat root commits as boundaries (Default: off))), OPT_BOOL(0, show-stats, show_stats, N_(Show work cost statistics)), OPT_BIT(0, score-debug, output_option, N_(Show output score for blame entries), OUTPUT_SHOW_SCORE), -- 1.8.5.1 Bernhard R. Link -- F8AC 04D5 0B9B 064B 3383 C3DA AFFC 96D1 151D FFDC -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html