Hi all, This is a follow-up to the WIP-patches I posted last week [1], which are about improving performance of svn_diff (and therefor also blame on the client-side), especially for large files.
To summarize: the idea was (is) to strip off the identical prefix and suffix, and then letting the existing diff algorithm work on the remainder. As stated in [2], I ran into some difficulties to get the exact same result of the existing diff algorithm (because of too eager suffix stripping). I've spent some time searching for a "perfect" solution, but I can't find one. So now I'm stuck and would like to have some feedback on what to do with it. First: the thing works, and it can reduce the time needed for svn_diff by 20% up to 80% for large files (depending on the distribution of the changes). An extreme example is a reduction from ~150ms to ~30ms for a 60000-lines file with a small number of lines changed (say up to 100) located close together (so lots of identical prefix/suffix). Blame gets similar benefits, *if the server is fast enough*. I used svnserve built from Stefan^2's performance branch to test this. A normal svnserve with FSFS isn't really fast enough (unless maybe with an SSD, but I don't have one here) to serve up the 2500 revisions of the large file quickly enough. But the performance-enhanced-svnserve, with a hot cache filled with the necessary full-texts, is definitely fast enough to make the time of blame completely dependent on the client-side performance. Conclusion: the diff optimization we're talking about is much more useful for blame together with a server with the full-text membuffer caching of the performance branch. Now, the reason why I'm stuck: suffix scanning sometimes strips a little bit too much (see [2] for full explanation). In this case the resulting diff is still correct, but it's not as nice for the user. As I noted in [2], GNU diff also has this problem, but the user can help a little bit by specifying a large enough number of prefix/suffix-lines to keep (--horizon-lines). So, what to do? - I have not found a better solution than to use some form of "horizon-lines" to get the "nice" result, but still have a similar performance improvement (unless if I leave out suffix stripping entirely, only strip prefix, but that halves the power of the optimization). - I've tested with keeping up to 50 suffix-lines. It didn't have the slightest impact on the performance improvement (we're talking about stripping away 1000's of lines). This fixed all real-life examples I could find/test (diff/blame output is precisely identical to original svn). If I kept only 20 lines, I still ran into some differences (30 lines was enough for my example file, but I took it up to 50 to be sure). - I would like to avoid leaving the decision to the user to come up with an adequate number of suffix-lines-to-keep. So I'd like to just hardcode some value, say 50, or even 100. I think this would give good (=identical to original svn) diff/blame results in all but the most extreme cases. With suffix-lines-to-keep=50, you'd need to insert a block of text that has its last 50 lines identical to the 50 lines preceding the insertion point, to mess up the diff result. - If really necessary, we could say default=50, but it can be overridden by the user with another -x option. So the main question is: is such a change in behavior (unlikely to have an impact in most realistic cases) acceptable for this kind of performance improvement? Other options: - Come up with a clever solution to fix the optimization properly (but I don't know if that's possible -> suggestions welcome :-)). - Throw this optimization away, and take a look at something like "Patience diff" (supposed to be fast as well, and give nice results). However, I don't know Patience diff well enough to code up something, or even to judge whether it would give good results in all (or most) cases. - Just throw it away and don't look back :-) - Something else? Cheers, -- Johan [1] http://svn.haxx.se/dev/archive-2010-10/0032.shtml [2] http://svn.haxx.se/dev/archive-2010-10/0050.shtml