On Tue, Jun 7, 2011 at 10:16 PM, Morten Kloster <mor...@gmail.com> wrote: > On Fri, May 27, 2011 at 11:00 PM, Morten Kloster <mor...@gmail.com> wrote: >> [[[ >> Speeds up LCS algorithm for well-separated sections of change >> By exploiting unique matches between the two files, it is possible to >> "restart" the LCS algorithm at times and reset the buildup of square >> computational complexity. >> >> * subversion/libsvn_diff/lcs.c >> (svn_diff__snake_t): Added uniquematchcount member. >> (svn_diff__snake): Increments uniquematchcount each time >> a matching token occurs only once in each file. >> (clear_fp): New method that removes nonviable states. >> (svn_diff__lcs): Restarts LCS algorithm from p=0 if the number >> of unique matches exceeds the number of mismatches, using >> the first state where that occurs as the new start state. >> ]]] >> >> This patch is a partial implementation of a possible enhancement to >> the LCS algorithm. The patch builds on the patch in the thread >> "[PATCH] Speed-up of libsvn_diff using token counts". I have only >> included the lcs.c file in this patch; the other files in that other patch >> must already be patched. >> >> >> Theory behind the enhancement: >> If the number of unique matches - tokens that only occur once in each >> file - so far found exceeds the number of mismatches so far, and this >> is the FIRST time that happens, then we are guaranteed that that >> state is part of the true LCS, since eliminating some number of the >> mismatches will necessarily add more new mismatches among the >> current unique matches. We can thus "start over" from the current >> state and thus stop the build-up of computational complexity >> represented by the 'p' variable. By resetting also the number >> of unique matches found to 0, we can repeat this process many >> times. >> >> This implementation is partial in the sense that it does not reach the >> full potential of the enhancement. It will still scan at least 'd' states >> for each value of 'p', whereas an optimal implementation would >> interrupt the scan if there's a mismatch for what would have been >> a unique match (since this always make the potential LCS worse, >> regardless of whether the change is an insertion or a deletion). >> However, a full implementation will be significantly more complicated, >> as far as I can see. >> >> The current implementation can reduce the average computational >> complexity from O((p_max + d)*p_max) to O(sum((p_n + d)*p_n)), >> while an optimal implementation would be O(sum((p_n + d_n)*p_n)). >> p_max is here the total number of mismatches in the shorter file, >> while p_n and d_n are the lesser number of mismatches and the >> length difference between the files within each area of difference that >> is separated from other areas of difference by a large number of >> (not necessarily contiguous) unique matches. The current version >> is only really useful if there is a similar number of mismatches in >> each file, i.e. if there is a similar number of insertions and deletions >> (not counting tokens that are unique to one of the files). >> >> Johan: I initially thought it would be easiest to give unique matches >> higher weight in the diff algorithm. This would have meant a change >> in the definition of the optimal diff - rather than asking for simply the >> largest number of matching tokes, we would prefer diffs with slightly >> fewer total matches but more unique matches. It would not be a >> heuristic, since we would still be a guaranteed optimal diff given the >> new optimality criterion, but it would no longer be an LCS. >> However, it turned out to be easier to keep the same weights. >> >> >> The patch will need significant changes to integrate with the idx >> patch (also libsvn_diff speed-up). The current version doesn't give >> much speed improvement for most of my tests so far, but can >> give very large improvements for "ideal" cases (many small areas >> of non-unique changes that are separated by many unique matches, >> and where the files are of equal length). >> >> I'm not sure whether or not the current version (integrated with idx) >> is worthwhile or not. I'll look into making a full implementation of the >> enhancement. >> >> >> Morten >> > Here's an updated patch based on the current HEAD, and with > one or two minor improvements. While, as stated, the patch > can give many-fold speed increases for ideal cases, it also slows > down the LCS algorithm significantly for poorly matching files; I > got about 18% increase in running time on my system for files > that were very poor matches but had no lines unique to either file.
Hi Morten, Hmmm, that sounds like a lot of overhead in some cases (I actually wonder why -- I haven't fully understood the implementation, but on first sight it doesn't seem that computationally intensive). But the most important question that's bugging me: do those "ideal cases" occur often in real-world examples? I'm sure you could craft good examples, but how likely am I to have a benefit of this in my average repository (given all of the other optimizations taking away already a lot of the work (prefix/suffix elimination; elimination of non-matching lines due to token-counting; ...))? Can you give examples from subversion's own repository, or other public repositories, where it can have a big impact? Or are you looking at this for some specific purpose, some repository where this situation occurs regularly and has a big impact (huge LCS'es ...)? -- Johan