Re: [PATCH v5 6/6] RFC blame: use a fingerprint heuristic to match ignored lines

Barret Rhoden Tue, 09 Apr 2019 08:57:11 -0700

Hi -

On 4/7/19 5:46 PM, [email protected] wrote:

From: Michael Platings <[email protected]>


Hi Barret,
This is the updated fuzzy matching algorithm, sorry for the delay. It does
highlight a bug in the calculation for the number of lines ("int nr_parent_lines
  = e->num_lines - delta;") - if you apply the patch, build it, then try to
./git blame --ignore-rev <the patch commit ID> blame.c then you'll get a 
segfault
because nr_parent_lines is a negative number. I haven't had time to investigate 
further
but I have confirmed that the bug is not due to my patch.

Although I couldn't recreate this, I saw how it could happen. I have anupdated version that fixes it. Short version: I just pass throughparent_len instead of trying to recreate it with 'delta.' Recreatingcan go awry because e->num_lines may be less than a chunk size.

The matching algorithm might not be obvious so it could do with more commenting.
In the mean time I hope the tests will make the intent clear. In particular I
want to avoid lines being reordered, because for the interesting use cases
usually sequences are unchanged even if they shift across different lines.

I thought you wanted to detect similarity across potentially reorderedlines. E.g.


commit a 10) #include <sys/a.h>
commit b 11) #include <foo/b.h>
commit c 12) #include <bar/c.h>
commit d 13) #include <dir/d.h>

Where commit X (to be ignored) alphabetized them (or something):

commit X 10) #include <bar/c.h>
commit X 11) #include <dir/d.h>
commit X 12) #include <foo/b.h>
commit X 13) #include <sys/a.h>

In this case, you want to find the line number in the parent thatcorresponds to each line in the target (X is the target, in the relevantblame code). The mapping is: (target -> parent)


10 -> 12
11 -> 13
12 -> 11
13 -> 10

And the parent's line numbers are out of order. But that's fine - atleast with the infrastructure from my patch, since it sorts the ignoredqueue at the end of blame_chunk(). The rule in blame.c is that blameentries attached to a suspect need to be sorted by s_lno. (At leastbased on a comment, and my reading of the code).

Maybe we have a different definition of reordering or are havingdifferent issues regarding line numbers? TBH, I didn't follow therecursion and partitioning strategy too closely.

Regarding the existing implementation I've got to say I find it unhelpful
marking "unblameable" lines with a 000000 commit ID. That commit ID already has
a meaning - lines that aren't yet committed. Further, the purpose of ignoring
commits should be to avoid obscuring other useful information, not to absolutely
refuse to show that commit at all. If there's no other commit to show then it's
harmless to show the commit that would otherwise be ignored.

For me, if I tell git blame to not tell me about a commit, then I don'twant to hear about it. My typical blame session involves running blame,then digging up the commit it pointed to. If it points to the commit Ialready told it to not show, then it'll just annoy me. All 0s madesense to me as in "don't bother looking this up."

If you *did* want to see the ignored commit, then I'd run it with--ignore-revs-file="", to disable ignore.

That being said, I realize this is a preference of mine, and I can makea blame config option to control this. Probablyblame.maskIgnoredCommits or something.

- How about matching *outside* the parent's diff hunk?
I'd like to know what the use case would be for that. For the use case of
looking "through" a reformatting or renaming commit I think it would be 
unhelpful.

I noticed that I have some commits where a hunk in a diff is broken upinto multiple calls to blame_chunk. So a better way to prhase it wouldbe "looking outside a blame_chunk()'s chunk."

For example, I have a hunk like this (modified to show the contiguouslines of context, subtraction, addition):


@@ -256,99 +260,99 @@
  3 context
-83 subtract
+23 addition
  1 context
- 2 sub
+16 add
  1 context
- 5 sub
+ 6 add
  1 context
+45 add
  3 context

blame_chunk() sees that as four separate calls, broken up by the linesof context:


        blame_chunk tlno 262 off -4 same 285 plen 83 enl 23
        blame_chunk tlno 286 off 56 same 302 plen 2 enl 16
        blame_chunk tlno 303 off 42 same 309 plen 5 enl 6
        blame_chunk tlno 310 off 41 same 355 plen 0 enl 45

For a lot of those lines, the source of the change was in another diffhunk - especially that 45 line addition with nothing from the parent.

Part of the reason for that large diff is whitespace changes, so I canrun blame with -w. But there are other scenarios. Here's another one:the header file alphabetizing case:


@@ -4,9 +4,9 @@
  3 context
- 1 #include <header.h>
  2 context (other headers)
+ 1 #include <header.h>
  3 context

That shows up as two blame chunks:
        blame_chunk tlno 6 off 0 same 6 plen 1 enl 0
        blame_chunk tlno 8 off 1 same 9 plen 0 enl 1

What's funny about it is that if the commit we're ignoring did moredamage to the file (changed every line, for instance), it'd be easier tofind the right line in the parent, since we'd have the entire diff hunk.

Anyway, being able to look outside the current blame_chunk would help inthose scenarios. Specifically, I'm talking about letting blame_chunk()point anywhere in the parent. Right now, it can only look in theparent's part of the chunk passed to blame_chunk, which can berelatively small.


Barret

Re: [PATCH v5 6/6] RFC blame: use a fingerprint heuristic to match ignored lines

Reply via email to