[RFC] Diff (blame) optimization: how to go forward

Johan Corveleyn Thu, 07 Oct 2010 16:44:59 -0700

Hi all,

This is a follow-up to the WIP-patches I posted last week [1], which
are about improving performance of svn_diff (and therefor also blame
on the client-side), especially for large files.


To summarize: the idea was (is) to strip off the identical prefix and
suffix, and then letting the existing diff algorithm work on the
remainder. As stated in [2], I ran into some difficulties to get the
exact same result of the existing diff algorithm (because of too eager
suffix stripping). I've spent some time searching for a "perfect"
solution, but I can't find one. So now I'm stuck and would like to
have some feedback on what to do with it.

First: the thing works, and it can reduce the time needed for svn_diff
by 20% up to 80% for large files (depending on the distribution of the
changes). An extreme example is a reduction from ~150ms to ~30ms for a
60000-lines file with a small number of lines changed (say up to 100)
located close together (so lots of identical prefix/suffix).

Blame gets similar benefits, *if the server is fast enough*. I used
svnserve built from Stefan^2's performance branch to test this. A
normal svnserve with FSFS isn't really fast enough (unless maybe with
an SSD, but I don't have one here) to serve up the 2500 revisions of
the large file quickly enough. But the performance-enhanced-svnserve,
with a hot cache filled with the necessary full-texts, is definitely
fast enough to make the time of blame completely dependent on the
client-side performance. Conclusion: the diff optimization we're
talking about is much more useful for blame together with a server
with the full-text membuffer caching of the performance branch.

Now, the reason why I'm stuck: suffix scanning sometimes strips a
little bit too much (see [2] for full explanation). In this case the
resulting diff is still correct, but it's not as nice for the user. As
I noted in [2], GNU diff also has this problem, but the user can help
a little bit by specifying a large enough number of
prefix/suffix-lines to keep (--horizon-lines).

So, what to do?
- I have not found a better solution than to use some form of
"horizon-lines" to get the "nice" result, but still have a similar
performance improvement (unless if I leave out suffix stripping
entirely, only strip prefix, but that halves the power of the
optimization).

- I've tested with keeping up to 50 suffix-lines. It didn't have the
slightest impact on the performance improvement (we're talking about
stripping away 1000's of lines). This fixed all real-life examples I
could find/test (diff/blame output is precisely identical to original
svn). If I kept only 20 lines, I still ran into some differences (30
lines was enough for my example file, but I took it up to 50 to be
sure).

- I would like to avoid leaving the decision to the user to come up
with an adequate number of suffix-lines-to-keep. So I'd like to just
hardcode some value, say 50, or even 100. I think this would give good
(=identical to original svn) diff/blame results in all but the most
extreme cases. With suffix-lines-to-keep=50, you'd need to insert a
block of text that has its last 50 lines identical to the 50 lines
preceding the insertion point, to mess up the diff result.

- If really necessary, we could say default=50, but it can be
overridden by the user with another -x option.

So the main question is: is such a change in behavior (unlikely to
have an impact in most realistic cases) acceptable for this kind of
performance improvement?

Other options:
- Come up with a clever solution to fix the optimization properly (but
I don't know if that's possible -> suggestions welcome :-)).
- Throw this optimization away, and take a look at something like
"Patience diff" (supposed to be fast as well, and give nice results).
However, I don't know Patience diff well enough to code up something,
or even to judge whether it would give good results in all (or most)
cases.
- Just throw it away and don't look back :-)
- Something else?

Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2010-10/0032.shtml
[2] http://svn.haxx.se/dev/archive-2010-10/0050.shtml

[RFC] Diff (blame) optimization: how to go forward

Reply via email to