On Tue, Aug 24, 2010 at 11:11 AM, Philip Martin <philip.mar...@wandisco.com> wrote: > Johan Corveleyn <jcor...@gmail.com> writes: >> On Sun, Aug 22, 2010 at 4:02 PM, Branko Čibej <br...@xbc.nu> wrote: >>> svn_diff uses basically the same algorithm as GNU diff but implemented >>> slightly differently and IIRC it doesn't have some of GNU diff's >>> optimizations. I'm sure it can be speeded up, but haven't a clue about >>> how much. >> >> Ok, thanks. In the meantime I saw that there is not that much >> difference anymore between GNU diff and svn_diff, after running the >> latter from a release build, and disabling my anti-virus (which makes >> me wonder why my anti-virus slows down svn_diff (impact when opening >> the modified datasource), but not on GNU diff). > > The big difference between Subversion's diff and GNU diff is that GNU > uses heuristics to cut short the diff algorithm whereas Subversion > grinds on to the end. Certain files trigger the heuristics and then > GNU diff is much faster than Subversion. Typically the file has a > large number of matches and non-matches repeated through the file, > machine generated files sometimes fit this pattern. > > GNU diff's heuristics work well so when they trigger the resulting > diff is usually good enough. They can be disabled using the --minimal > option and using that makes GNU diff performance much more like > Subversion.
Hmm, I thought harder about this. If I try --minimal with GNU diff, it's just as fast, still significantly faster than svn diff. Then I read a little bit more about diff algorithms, like the blog entry that Greg Hudson suggested in the other thread [1], about the Patience Diff algorithm [2]. In there the following two lines caught my eye, first steps in the diff algo: [[[ 1) Match the first lines of both if they're identical, then match the second, third, etc. until a pair doesn't match. 2) Match the last lines of both if they're identical, then match the next to last, second to last, etc. until a pair doesn't match. ]]] (then the longest common subsequence algorithm kicks in) Now, these steps are not specific to Patience Diff, they are common to most diff algorithms. And I bet it's a much more efficient to exclude the common prefix/suffix this way, than to include them into the entire lcs algorithm. Right now, svn diff doesn't do this AFAICS. However, GNU diff does, and I think that's where most of the performance difference comes from. For big files, where a couple of lines are modified somewhere in the middle (and the first 30000 and last 30000 lines are identical), I think this can make a big difference. Thoughts? If no-one else fancies this, I think I'll take a stab at implementing this optimization (first with a POC to see if it's significant). Unless feedback tells me it's not worth it, not a good idea, ... Cheers, -- Johan [1] http://svn.haxx.se/dev/archive-2010-08/0411.shtml [2] http://bramcohen.livejournal.com/73318.html