Re: [RFC v2] blame: new option --prefer-first to better handle merged cherry-picks

2014-01-14 Thread Junio C Hamano
Junio C Hamano gits...@pobox.com writes:

 Junio C Hamano gits...@pobox.com writes:

 While the result is more consistent and more predictable in the case
 of merged cherry picks, it is also slower in every case.

 Consistent and predictable, perhaps, but I am not sure exact would
 be a good word.

 Another thing I am not enthusiasitc about this change is that I am
 afraid that this may make git blame -- path and git log -- path
 work inconsistenly.  The both cull side branches whenever one of the
 parents gave the resulting blob, even that parent is not the first
 one.  But git blame --prefer-first -- path, afaict, behaves quite
 differently from git log --first-parent -- path, even though they
 share similar option names, adding more to the confusion.

I think I am starting to understand why this patch felt wrong to me.
This wasn't about --first-parent at all, and you are correct that
you didn't call the option as such), but I somehow thought that they
were related; perhaps the fact that both disable the if the result
exactly matches one parent, all the other parents can be culled to
simplify the history logic blinded me.

In reality, the new flag is a lot closer in spirit to the total
opposite of --first-parent, i.e. --full-history.  That option
also disables that if same to one parent, other parents do not
matter logic, but its effect is quite different.  It makes the
other histories that did not have to have contributed the end result
shown in the output.

Now, when we step back and think about how the normal git blame
logic apportions the blame to multiple parents when there is no
exact match, it does so in a pretty arbitrary way.  It lets earlier
parents to claim the responsibility and later parents only get
leftover contents that weren't claimed by the earlier ones.  We can
call that favouring earler ones, i.e. --prefer-first.

It was implemented this way, not because this order makes any sense,
but primarily because no order is particularly better than any
other, and the designer (me) happened to have picked the easiest one
at random.

The pick the one that exactly matches if exists can be thought of
an easy hack to hide the problems that come from this arbitrary
choice.  Without it, if the result matches the second parent (i.e. a
typical merge of a work done on the topic branch while the mainline
has been quiescent in the same area), the give earlier parents a
chance to claim responsiblity before later ones rule would have
split the blame for parts that weren't changed in the side branch
topic to the mainline and blame would have been passed to the side
branch only for the portion that were changed by the side branch.
Instead, pass the whole blame to the one that exactly matches hack
keeps larger blocks of text unsplit, clumping related contents
together as long as possible while we traverse the history.

It is an easy hack, because we only need to compare the object
name, but a logical extension to it would have been to compute the
similarity scores between the result and each of the parents, sort
the parents by that similarity score order, and give more similar
ones a chance to claim responsibility before less similar ones.
We could call it favouring similar ones, i.e. --prefer-similar
or something.

That would have made the result more stable.  Imagine that in one
history, a merge's result matchs exactly the second parent, and in
another history, a merge's result almost matches exactly the second
parent but the difference is the result adds one blank line at the
end of the file relative to what the second parent has.

With the current code, blaming the file will get quite a different
result.

In the former history, the sub-history leading to the second parent
of the merge will get all the blame, but in the latter history, the
sub-history leading to the first parent of the merge will have a
chance to claim the responsibility for the shared part before the
second parent has a say in the output.

If we sorted the parents in the similarity order and gave the first
refusal right to more similar parents before less similar ones, then
the resulting output from git blame would be very similar in these
two histories, which would be a very desirable property.  If the
only difference between the results of the merge in the former and
the latter histories is one blank line at the end of the file in
question, blames for the remaining part of the file should be
assigned the same between the two histories, but the pass the
entire blame to the second parent only when the second parent
exactly matches hack gets in the way for that ideal, and sort the
parents in similarity order will fix that.

Of course, it would make the computation a lot more costly, but it
would make the behaviour more predictable and understandable.

But that is a different tangent.

I think the new feature introduced by your change can be explained
as 'git blame' uses the same history simplification as the commands
in the 'git log' 

Re: [RFC v2] blame: new option --prefer-first to better handle merged cherry-picks

2014-01-14 Thread Junio C Hamano
Junio C Hamano gits...@pobox.com writes:

 The pick the one that exactly matches if exists can be thought of
 an easy hack to hide the problems that come from this arbitrary
 choice.  ...
 Instead, pass the whole blame to the one that exactly matches hack
 keeps larger blocks of text unsplit, clumping related contents
 together as long as possible while we traverse the history.

 It is an easy hack, because we only need to compare the object
 name, but a logical extension to it would have been to compute the
 similarity scores between the result and each of the parents, sort
 the parents by that similarity score order, and give more similar
 ones a chance to claim responsibility before less similar ones.
 We could call it favouring similar ones, i.e. --prefer-similar
 or something.

Extending along the tangent further.

Another thing that I found the argument in the proposed log message
of the patch weak was that the claim that changed code will assign
the blame to the same commit for both path b and c.  There are two
reasons why.  One is that we do not look at b while chasing the
ancestry of c, so if a different traverse order assigns the blame to
the same commit for them, it is a mere happenstance.  But a more
important reason is that the changed code will still assign the
blame for different commits if the final merge were made in the
opposite direction.  In your original topology, we skip over the
first parent and give the whole blame to the second parent without
the change, and with the change, we stop doing so and instead give
some blame to the first parent and then allow the second parent a
chance to claim the blame for the remainder.  But in a history where
the final merge went in the opposite direction, even with the
change, we compare with the first parent (which was the second
one in your original topology) with the result, find out that the
contents exactly match, and that parent grabs the whole blame.  So
in that sense, the updated code that consistently gives earlier
parents chance to claim the blame before later ones does not behave
consitently on the same history with different merge parent order.

That makes me think that the reason why the result you got with the
change is better (assuming it is better) is _not_ because the
updated code lets earlier parents give chance to claim the blame; it
could be an indication that the keep larger blocks of text unsplit,
clumping related contents together as long as possible heuristics
is what prevents us from having a better result.

If that is really the case, that would mean that letting the blame
split early would give us a better result.  I alluded to give more
similar parents first chance to claim responsibility before less
similar ones in the previous message, but perhaps this is
indicating that we might get a better result if we did the
opposite---instead of assigning blames to earlier parents and then
to later ones, compare the result with each parent, order the
parents by how few lines of blame they could claim if each of them
were allowed to go first, and then actually compute and assign the
blame in that order, favouring dissimilar ones.  That may produce
the result you are after in a more consistent way, regardless of the
merge order.

I think I've done thinking about this issue, at least for now.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2] blame: new option --prefer-first to better handle merged cherry-picks

2014-01-13 Thread Junio C Hamano
Bernhard R. Link brl...@debian.org writes:

 Allows to disable the git blame optimization of assuming that if there is a
 parent of a merge commit that has the exactly same file content, then
 only this parent is to be looked at.

 This optimization, while being faster in the usual case, means that in
 the case of cherry-picks the blamed commit depends on which other commits
 touched a file.

 If for example one commit A modified both files b and c. And there are
 commits B and C, B only modifies file b and C only modifies file c
 (so that no conflicts happen), and assume A is cherry-picked as A'
 and the two branches then merged:

 --o-B---A
\ \
 ---C---A'--M---

 Then without this new option git blame blames the A|A' changes of
 file b to A while blaming the changes of c to A'.
 With the new option --prefer-first it blames both changes to the
 same commit and to the one more on the left side of the graph.

 Signed-off-by: Bernhard R. Link brl...@debian.org
 ---
  Documentation/blame-options.txt | 6 ++
  builtin/blame.c | 7 +--
  2 files changed, 11 insertions(+), 2 deletions(-)

  Differences to first round: rename option and describe the effect
  instead of the implementation in documentation.

I read the updated documentation three times but it still does not
answer any of my questions I had in $gmane/239888, the most
important part of which was:

Yeah, the cherry-picked one will introduce the same change as
the one that was cherry-picked, so if you look at the end result
and ask where did _this_ line come from?, there are two
equally plausible candidates, as blame output can give only
one answer to each line.  I still do not see why the one that is
picked with the new option is better.  At best, it looks to me
that it is saying running with this option may (or may not)
give a different answer, so run the command with and without it
and see which one you like, which does not sound too useful to
the end users.

To put it another way, why/when would an end user choose to use this
option?  If the result of using this option is always better than
without, why/when would an end user choose not to use this option?

 diff --git a/Documentation/blame-options.txt b/Documentation/blame-options.txt
 index 0cebc4f..b2e7fb8 100644
 --- a/Documentation/blame-options.txt
 +++ b/Documentation/blame-options.txt
 @@ -48,6 +48,12 @@ include::line-range-format.txt[]
   Show the result incrementally in a format designed for
   machine consumption.
  
 +--prefer-first::
 + If a line was introduced by two commits (for example via
 + a merged cherry-pick), prefer the commit that was
 + first merged in the history of always following the
 + first parent.
 +
  --encoding=encoding::
   Specifies the encoding used to output author names
   and commit summaries. Setting it to `none` makes blame
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2] blame: new option --prefer-first to better handle merged cherry-picks

2014-01-13 Thread Bernhard R. Link
* Junio C Hamano gits...@pobox.com [140113 23:31]:
 I read the updated documentation three times but it still does not
 answer any of my questions I had in $gmane/239888, the most
 important part of which was:

 Yeah, the cherry-picked one will introduce the same change as
 the one that was cherry-picked, so if you look at the end result
 and ask where did _this_ line come from?, there are two
 equally plausible candidates, as blame output can give only
 one answer to each line.  I still do not see why the one that is
 picked with the new option is better.

Because:
  - it will blame the modifications of merged cherry-picked commit
to only one commit. Without the option parts of the modification
will be reported as coming from the one, parts will be reported
to be from the other. With the option only one of those two commits
is reported as the origin at the same time and not both.
  - it is more predictable which commit is blamed, so if one is
interested in where some commit was introduced first into a
mainline, one gets this information, and not somtimes a different
one due to unrelated reasons.

 To put it another way, why/when would an end user choose to use this
 option?  If the result of using this option is always better than
 without, why/when would an end user choose not to use this option?

While the result is more consistent and more predictable in the case
of merged cherry picks, it is also slower in every case. Usually speed
will be more important than this exactness, especially as the result
will not differ for the common case (if there are no cherry-picked
commits merged or when those commits do not touch any files that are
otherwise only modified in the merged branch).

Bernhard R. Link
-- 
F8AC 04D5 0B9B 064B 3383  C3DA AFFC 96D1 151D FFDC
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2] blame: new option --prefer-first to better handle merged cherry-picks

2014-01-13 Thread Junio C Hamano
Bernhard R. Link brl+...@mail.brlink.eu writes:

 * Junio C Hamano gits...@pobox.com [140113 23:31]:
 I read the updated documentation three times but it still does not
 answer any of my questions I had in $gmane/239888, the most
 important part of which was:

 Yeah, the cherry-picked one will introduce the same change as
 the one that was cherry-picked, so if you look at the end result
 and ask where did _this_ line come from?, there are two
 equally plausible candidates, as blame output can give only
 one answer to each line.  I still do not see why the one that is
 picked with the new option is better.

 Because:
   - it will blame the modifications of merged cherry-picked commit
 to only one commit. Without the option parts of the modification
 will be reported as coming from the one, parts will be reported
 to be from the other. With the option only one of those two commits
 is reported as the origin at the same time and not both.
   - it is more predictable which commit is blamed, so if one is
 interested in where some commit was introduced first into a
 mainline, one gets this information, and not somtimes a different
 one due to unrelated reasons.

 To put it another way, why/when would an end user choose to use this
 option?  If the result of using this option is always better than
 without, why/when would an end user choose not to use this option?

 While the result is more consistent and more predictable in the case
 of merged cherry picks, it is also slower in every case.

Consistent and predictable, perhaps, but I am not sure exact would
be a good word.

Wouldn't the result depend on which way the cherry pick went, and
then the later merge went?  In the particular topology you depicted
in the log message, the end result may happen to point at the same
commit for these two paths, but I am not sure how the change
guarantees that we always point at the same original commit not the
cherry-picked one, which was implied by the log message, if your
cherry-pick and merge went in different direction in similar
topologies.

And that is why I said:

At best, it looks to me that it is saying running with this
option may (or may not) give a different answer, so run the
command with and without it and see which one you like

With the stress on different answer; it the change were with the
option the result is always better, albeit you will have to wait
longer, I would not have this much trouble accepting the change,
though.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v2] blame: new option --prefer-first to better handle merged cherry-picks

2014-01-12 Thread Bernhard R. Link
Allows to disable the git blame optimization of assuming that if there is a
parent of a merge commit that has the exactly same file content, then
only this parent is to be looked at.

This optimization, while being faster in the usual case, means that in
the case of cherry-picks the blamed commit depends on which other commits
touched a file.

If for example one commit A modified both files b and c. And there are
commits B and C, B only modifies file b and C only modifies file c
(so that no conflicts happen), and assume A is cherry-picked as A'
and the two branches then merged:

--o-B---A
   \ \
---C---A'--M---

Then without this new option git blame blames the A|A' changes of
file b to A while blaming the changes of c to A'.
With the new option --prefer-first it blames both changes to the
same commit and to the one more on the left side of the graph.

Signed-off-by: Bernhard R. Link brl...@debian.org
---
 Documentation/blame-options.txt | 6 ++
 builtin/blame.c | 7 +--
 2 files changed, 11 insertions(+), 2 deletions(-)

 Differences to first round: rename option and describe the effect
 instead of the implementation in documentation.

diff --git a/Documentation/blame-options.txt b/Documentation/blame-options.txt
index 0cebc4f..b2e7fb8 100644
--- a/Documentation/blame-options.txt
+++ b/Documentation/blame-options.txt
@@ -48,6 +48,12 @@ include::line-range-format.txt[]
Show the result incrementally in a format designed for
machine consumption.
 
+--prefer-first::
+   If a line was introduced by two commits (for example via
+   a merged cherry-pick), prefer the commit that was
+   first merged in the history of always following the
+   first parent.
+
 --encoding=encoding::
Specifies the encoding used to output author names
and commit summaries. Setting it to `none` makes blame
diff --git a/builtin/blame.c b/builtin/blame.c
index 4916eb2..8ea34cf 100644
--- a/builtin/blame.c
+++ b/builtin/blame.c
@@ -45,6 +45,7 @@ static int incremental;
 static int xdl_opts;
 static int abbrev = -1;
 static int no_whole_file_rename;
+static int prefer_first;
 
 static enum date_mode blame_date_mode = DATE_ISO8601;
 static size_t blame_date_width;
@@ -1248,7 +1249,8 @@ static void pass_blame(struct scoreboard *sb, struct 
origin *origin, int opt)
porigin = find(sb, p, origin);
if (!porigin)
continue;
-   if (!hashcmp(porigin-blob_sha1, origin-blob_sha1)) {
+   if (!prefer_first 
+   !hashcmp(porigin-blob_sha1, origin-blob_sha1)) {
pass_whole_blame(sb, origin, porigin);
origin_decref(porigin);
goto finish;
@@ -2247,7 +2249,8 @@ int cmd_blame(int argc, const char **argv, const char 
*prefix)
static const char *contents_from = NULL;
static const struct option options[] = {
OPT_BOOL(0, incremental, incremental, N_(Show blame entries 
as we find them, incrementally)),
-   OPT_BOOL('b', NULL, blank_boundary, N_(Show blank SHA-1 for 
boundary commits (Default: off))),
+   OPT_BOOL(0, prefer-first, prefer_first, N_(Prefer blaming 
commits merged earlier)),
+   OPT_BOOL('b', NULL, blank_boundary, N_(Show blank SHA-1 for 
boundary commits (Default: ff))),
OPT_BOOL(0, root, show_root, N_(Do not treat root commits 
as boundaries (Default: off))),
OPT_BOOL(0, show-stats, show_stats, N_(Show work cost 
statistics)),
OPT_BIT(0, score-debug, output_option, N_(Show output score 
for blame entries), OUTPUT_SHOW_SCORE),
-- 
1.8.5.1

Bernhard R. Link
-- 
F8AC 04D5 0B9B 064B 3383  C3DA AFFC 96D1 151D FFDC
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html