Here's an updated patch based on the current HEAD, and with one or two minor improvements. While, as stated, the patch can give many-fold speed increases for ideal cases, it also slows down the LCS algorithm significantly for poorly matching files; I got about 18% increase in running time on my system for files that were very poor matches but had no lines unique to either file.
On Fri, May 27, 2011 at 11:00 PM, Morten Kloster <mor...@gmail.com> wrote: > [[[ > Speeds up LCS algorithm for well-separated sections of change > By exploiting unique matches between the two files, it is possible to > "restart" the LCS algorithm at times and reset the buildup of square > computational complexity. > > * subversion/libsvn_diff/lcs.c > (svn_diff__snake_t): Added uniquematchcount member. > (svn_diff__snake): Increments uniquematchcount each time > a matching token occurs only once in each file. > (clear_fp): New method that removes nonviable states. > (svn_diff__lcs): Restarts LCS algorithm from p=0 if the number > of unique matches exceeds the number of mismatches, using > the first state where that occurs as the new start state. > ]]] > > This patch is a partial implementation of a possible enhancement to > the LCS algorithm. The patch builds on the patch in the thread > "[PATCH] Speed-up of libsvn_diff using token counts". I have only > included the lcs.c file in this patch; the other files in that other patch > must already be patched. > > > Theory behind the enhancement: > If the number of unique matches - tokens that only occur once in each > file - so far found exceeds the number of mismatches so far, and this > is the FIRST time that happens, then we are guaranteed that that > state is part of the true LCS, since eliminating some number of the > mismatches will necessarily add more new mismatches among the > current unique matches. We can thus "start over" from the current > state and thus stop the build-up of computational complexity > represented by the 'p' variable. By resetting also the number > of unique matches found to 0, we can repeat this process many > times. > > This implementation is partial in the sense that it does not reach the > full potential of the enhancement. It will still scan at least 'd' states > for each value of 'p', whereas an optimal implementation would > interrupt the scan if there's a mismatch for what would have been > a unique match (since this always make the potential LCS worse, > regardless of whether the change is an insertion or a deletion). > However, a full implementation will be significantly more complicated, > as far as I can see. > > The current implementation can reduce the average computational > complexity from O((p_max + d)*p_max) to O(sum((p_n + d)*p_n)), > while an optimal implementation would be O(sum((p_n + d_n)*p_n)). > p_max is here the total number of mismatches in the shorter file, > while p_n and d_n are the lesser number of mismatches and the > length difference between the files within each area of difference that > is separated from other areas of difference by a large number of > (not necessarily contiguous) unique matches. The current version > is only really useful if there is a similar number of mismatches in > each file, i.e. if there is a similar number of insertions and deletions > (not counting tokens that are unique to one of the files). > > Johan: I initially thought it would be easiest to give unique matches > higher weight in the diff algorithm. This would have meant a change > in the definition of the optimal diff - rather than asking for simply the > largest number of matching tokes, we would prefer diffs with slightly > fewer total matches but more unique matches. It would not be a > heuristic, since we would still be a guaranteed optimal diff given the > new optimality criterion, but it would no longer be an LCS. > However, it turned out to be easier to keep the same weights. > > > The patch will need significant changes to integrate with the idx > patch (also libsvn_diff speed-up). The current version doesn't give > much speed improvement for most of my tests so far, but can > give very large improvements for "ideal" cases (many small areas > of non-unique changes that are separated by many unique matches, > and where the files are of equal length). > > I'm not sure whether or not the current version (integrated with idx) > is worthwhile or not. I'll look into making a full implementation of the > enhancement. > > > Morten >
Index: subversion/libsvn_diff/lcs.c =================================================================== --- subversion/libsvn_diff/lcs.c (revision 1133037) +++ subversion/libsvn_diff/lcs.c (working copy) @@ -77,6 +77,7 @@ apr_off_t y; svn_diff__lcs_t *lcs; svn_diff__position_t *position[2]; + svn_diff__token_index_t uniquematchcount; }; static APR_INLINE void @@ -89,6 +90,7 @@ svn_diff__position_t *position[2]; svn_diff__lcs_t *lcs; svn_diff__lcs_t *previous_lcs; + svn_diff__token_index_t uniquematchcount; /* The previous entry at fp[k] is going to be replaced. See if we * can mark that lcs node for reuse, because the sequence up to this @@ -111,6 +113,7 @@ { start_position[0] = fp_k[-1].position[0]; start_position[1] = fp_k[-1].position[1]->next; + uniquematchcount = fp_k[-1].uniquematchcount; previous_lcs = fp_k[-1].lcs; } @@ -118,6 +121,7 @@ { start_position[0] = fp_k[1].position[0]->next; start_position[1] = fp_k[1].position[1]; + uniquematchcount = fp_k[1].uniquematchcount; previous_lcs = fp_k[1].lcs; } @@ -139,6 +143,8 @@ { while (position[0]->token_index == position[1]->token_index) { + if (token_counts[0][position[0]->token_index] == 1 && token_counts[1][position[0]->token_index] == 1) + uniquematchcount++; position[0] = position[0]->next; position[1] = position[1]->next; } @@ -181,6 +187,7 @@ fp_k[0].position[1] = position[1]; fp_k[0].y = position[1]->offset; + fp_k[0].uniquematchcount = uniquematchcount; } @@ -228,6 +235,42 @@ } +static void +clear_fp(svn_diff__snake_t *fp, + const apr_off_t k, + const apr_off_t min_index, + const apr_off_t max_index, + svn_diff__lcs_t **lcs_freelist) +{ + svn_diff__snake_t best; + svn_diff__lcs_t *lcs, *prev_lcs; + apr_off_t index; + + best = fp[k]; + best.lcs->refcount++; + for (index = min_index; index <= max_index; index++) + { + lcs = fp[index].lcs; + fp[index].y = 0; + fp[index].lcs = NULL; + fp[index].uniquematchcount = 0; + while (lcs) + { + lcs->refcount--; + if (lcs->refcount) + break; + + prev_lcs = lcs->next; + lcs->next = lcs_freelist; + lcs_freelist = lcs; + lcs = prev_lcs; + } + } + fp[k] = best; + fp[k].uniquematchcount = 0; +} + + svn_diff__lcs_t * svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail (ring) */ svn_diff__position_t *position_list2, /* pointer to tail (ring) */ @@ -245,6 +288,9 @@ svn_diff__snake_t *fp; apr_off_t d; apr_off_t k; + svn_diff__token_index_t limit; + apr_off_t min_index; + apr_off_t max_index; apr_off_t p = 0; svn_diff__lcs_t *lcs, *lcs_freelist = NULL; @@ -340,17 +386,32 @@ p = 0; do { + min_index = (d<0 ? d : 0) - p; + max_index = (d>0 ? d : 0) + p; + limit = (d >= 0 ? 2 * p + d : 2 * p - d); /* For k < 0, insertions are free */ - for (k = (d < 0 ? d : 0) - p; k < 0; k++) + for (k = min_index; k < 0; k++) { svn_diff__snake(fp + k, token_counts, &lcs_freelist, pool); + if (fp[k].uniquematchcount > limit + k) + { + d = k; + clear_fp(fp, k, min_index, max_index, &lcs_freelist); + p = 0; + max_index = (d>0 ? d : 0); + } } /* for k > 0, deletions are free */ - for (k = (d > 0 ? d : 0) + p; k >= 0; k--) + for (k = max_index; k >= 0; k--) { svn_diff__snake(fp + k, token_counts, &lcs_freelist, pool); + if (fp[k].uniquematchcount > limit - k) + { + d = k; + clear_fp(fp, k, min_index, max_index, &lcs_freelist); + p = 0; + } } - p++; } while (fp[0].position[1] != &sentinel_position[1]);