On 4/9/19 11:56 AM, Barret Rhoden wrote:
Anyway, being able to look outside the current blame_chunk would help in
those scenarios. Specifically, I'm talking about letting blame_chunk()
point anywhere in the parent. Right now, it can only look in the
parent's part of the chunk passed to blame_chunk, which can be
relatively small.
I hacked up the ability to look outside of a diff chunk. The change to
the heuristic-independent part of the code was very minor, both in code
and in performance.
The change to make the fingerprinting algorithm from my RFC patch look
at the entire parent was pretty minor too - I can also cache the
fingerprints. The main drawback is performance, but Michael's new
fingerprinting code alleviates this.
Here's a quick analysis. When run on a 1000 line C file, with large
changes from an ignored commit, after the file has been paged in (so,
run twice):
not ignoring at all:
-------------
real 0m0.062s
user 0m0.042s
sys 0m0.021s
scan only in the parent chunk:
----------------------------
real 0m0.097s
user 0m0.085s
sys 0m0.012s
scan parent chunk, scan entire parent on failure:
-------------------------
real 0m1.773s
user 0m1.752s
sys 0m0.021s
scan the entire parent:
-----------------------
real 0m3.049s
user 0m3.024s
sys 0m0.024s
Scanning the parent chunk first helped a lot. Scanning the entire
parent is O(nr_parent_lines * nr_lines_changed). In my test file,
that's about 1000 * 600.
It still takes a little while even when checking the parent chunk first.
Let's call that one the 'smaller scan.'
Caching the fingerprints (meaning, calculate once and attach to the
blame_origin) didn't help much here. It's actually worse, possibly due
to fingerprinting more than I needed to.
smaller scan, without caching:
----------------
real 0m1.651s
user 0m1.626s
sys 0m0.025s
smaller scan, with caching:
----------------
real 0m1.774s
user 0m1.753s
sys 0m0.021s
Let's try Michael's new fingerprinting code (get_fingerprint() and
fingerprint_similarity())
smaller scan, caching:
-------------------
real 0m0.240s
user 0m0.215s
sys 0m0.025s
smaller scan, no caching:
----------------------
real 0m0.295s
user 0m0.266s
sys 0m0.029s
full parent scan, caching:
--------------------------
real 0m0.377s
user 0m0.356s
sys 0m0.021s
full parent scan, no caching:
-----------------------------
real 0m0.458s
user 0m0.430s
sys 0m0.028s
And for completenesss,
scan only in the parent chunk:
------------------------------
real 0m0.072s
user 0m0.048s
sys 0m0.024s
So first off, Michael's new fingerprinting is much better. Caching the
fingerprints also helps. Overall, it's fast enough now that I don't
notice the delay.
I'll reroll the patchset with the ability to find lines in the entire
parent, and see what you all think.
Barret