Here's an updated patch based on the current HEAD, and with
one or two minor improvements. While, as stated, the patch
can give many-fold speed increases for ideal cases, it also slows
down the LCS algorithm significantly for poorly matching files; I
got about 18% increase in running time on my system for files
that were very poor matches but had no lines unique to either file.
On Fri, May 27, 2011 at 11:00 PM, Morten Kloster <[email protected]> wrote:
> [[[
> Speeds up LCS algorithm for well-separated sections of change
> By exploiting unique matches between the two files, it is possible to
> "restart" the LCS algorithm at times and reset the buildup of square
> computational complexity.
>
> * subversion/libsvn_diff/lcs.c
> (svn_diff__snake_t): Added uniquematchcount member.
> (svn_diff__snake): Increments uniquematchcount each time
> a matching token occurs only once in each file.
> (clear_fp): New method that removes nonviable states.
> (svn_diff__lcs): Restarts LCS algorithm from p=0 if the number
> of unique matches exceeds the number of mismatches, using
> the first state where that occurs as the new start state.
> ]]]
>
> This patch is a partial implementation of a possible enhancement to
> the LCS algorithm. The patch builds on the patch in the thread
> "[PATCH] Speed-up of libsvn_diff using token counts". I have only
> included the lcs.c file in this patch; the other files in that other patch
> must already be patched.
>
>
> Theory behind the enhancement:
> If the number of unique matches - tokens that only occur once in each
> file - so far found exceeds the number of mismatches so far, and this
> is the FIRST time that happens, then we are guaranteed that that
> state is part of the true LCS, since eliminating some number of the
> mismatches will necessarily add more new mismatches among the
> current unique matches. We can thus "start over" from the current
> state and thus stop the build-up of computational complexity
> represented by the 'p' variable. By resetting also the number
> of unique matches found to 0, we can repeat this process many
> times.
>
> This implementation is partial in the sense that it does not reach the
> full potential of the enhancement. It will still scan at least 'd' states
> for each value of 'p', whereas an optimal implementation would
> interrupt the scan if there's a mismatch for what would have been
> a unique match (since this always make the potential LCS worse,
> regardless of whether the change is an insertion or a deletion).
> However, a full implementation will be significantly more complicated,
> as far as I can see.
>
> The current implementation can reduce the average computational
> complexity from O((p_max + d)*p_max) to O(sum((p_n + d)*p_n)),
> while an optimal implementation would be O(sum((p_n + d_n)*p_n)).
> p_max is here the total number of mismatches in the shorter file,
> while p_n and d_n are the lesser number of mismatches and the
> length difference between the files within each area of difference that
> is separated from other areas of difference by a large number of
> (not necessarily contiguous) unique matches. The current version
> is only really useful if there is a similar number of mismatches in
> each file, i.e. if there is a similar number of insertions and deletions
> (not counting tokens that are unique to one of the files).
>
> Johan: I initially thought it would be easiest to give unique matches
> higher weight in the diff algorithm. This would have meant a change
> in the definition of the optimal diff - rather than asking for simply the
> largest number of matching tokes, we would prefer diffs with slightly
> fewer total matches but more unique matches. It would not be a
> heuristic, since we would still be a guaranteed optimal diff given the
> new optimality criterion, but it would no longer be an LCS.
> However, it turned out to be easier to keep the same weights.
>
>
> The patch will need significant changes to integrate with the idx
> patch (also libsvn_diff speed-up). The current version doesn't give
> much speed improvement for most of my tests so far, but can
> give very large improvements for "ideal" cases (many small areas
> of non-unique changes that are separated by many unique matches,
> and where the files are of equal length).
>
> I'm not sure whether or not the current version (integrated with idx)
> is worthwhile or not. I'll look into making a full implementation of the
> enhancement.
>
>
> Morten
>
Index: subversion/libsvn_diff/lcs.c
===================================================================
--- subversion/libsvn_diff/lcs.c (revision 1133037)
+++ subversion/libsvn_diff/lcs.c (working copy)
@@ -77,6 +77,7 @@
apr_off_t y;
svn_diff__lcs_t *lcs;
svn_diff__position_t *position[2];
+ svn_diff__token_index_t uniquematchcount;
};
static APR_INLINE void
@@ -89,6 +90,7 @@
svn_diff__position_t *position[2];
svn_diff__lcs_t *lcs;
svn_diff__lcs_t *previous_lcs;
+ svn_diff__token_index_t uniquematchcount;
/* The previous entry at fp[k] is going to be replaced. See if we
* can mark that lcs node for reuse, because the sequence up to this
@@ -111,6 +113,7 @@
{
start_position[0] = fp_k[-1].position[0];
start_position[1] = fp_k[-1].position[1]->next;
+ uniquematchcount = fp_k[-1].uniquematchcount;
previous_lcs = fp_k[-1].lcs;
}
@@ -118,6 +121,7 @@
{
start_position[0] = fp_k[1].position[0]->next;
start_position[1] = fp_k[1].position[1];
+ uniquematchcount = fp_k[1].uniquematchcount;
previous_lcs = fp_k[1].lcs;
}
@@ -139,6 +143,8 @@
{
while (position[0]->token_index == position[1]->token_index)
{
+ if (token_counts[0][position[0]->token_index] == 1 &&
token_counts[1][position[0]->token_index] == 1)
+ uniquematchcount++;
position[0] = position[0]->next;
position[1] = position[1]->next;
}
@@ -181,6 +187,7 @@
fp_k[0].position[1] = position[1];
fp_k[0].y = position[1]->offset;
+ fp_k[0].uniquematchcount = uniquematchcount;
}
@@ -228,6 +235,42 @@
}
+static void
+clear_fp(svn_diff__snake_t *fp,
+ const apr_off_t k,
+ const apr_off_t min_index,
+ const apr_off_t max_index,
+ svn_diff__lcs_t **lcs_freelist)
+{
+ svn_diff__snake_t best;
+ svn_diff__lcs_t *lcs, *prev_lcs;
+ apr_off_t index;
+
+ best = fp[k];
+ best.lcs->refcount++;
+ for (index = min_index; index <= max_index; index++)
+ {
+ lcs = fp[index].lcs;
+ fp[index].y = 0;
+ fp[index].lcs = NULL;
+ fp[index].uniquematchcount = 0;
+ while (lcs)
+ {
+ lcs->refcount--;
+ if (lcs->refcount)
+ break;
+
+ prev_lcs = lcs->next;
+ lcs->next = lcs_freelist;
+ lcs_freelist = lcs;
+ lcs = prev_lcs;
+ }
+ }
+ fp[k] = best;
+ fp[k].uniquematchcount = 0;
+}
+
+
svn_diff__lcs_t *
svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail (ring)
*/
svn_diff__position_t *position_list2, /* pointer to tail (ring)
*/
@@ -245,6 +288,9 @@
svn_diff__snake_t *fp;
apr_off_t d;
apr_off_t k;
+ svn_diff__token_index_t limit;
+ apr_off_t min_index;
+ apr_off_t max_index;
apr_off_t p = 0;
svn_diff__lcs_t *lcs, *lcs_freelist = NULL;
@@ -340,17 +386,32 @@
p = 0;
do
{
+ min_index = (d<0 ? d : 0) - p;
+ max_index = (d>0 ? d : 0) + p;
+ limit = (d >= 0 ? 2 * p + d : 2 * p - d);
/* For k < 0, insertions are free */
- for (k = (d < 0 ? d : 0) - p; k < 0; k++)
+ for (k = min_index; k < 0; k++)
{
svn_diff__snake(fp + k, token_counts, &lcs_freelist, pool);
+ if (fp[k].uniquematchcount > limit + k)
+ {
+ d = k;
+ clear_fp(fp, k, min_index, max_index, &lcs_freelist);
+ p = 0;
+ max_index = (d>0 ? d : 0);
+ }
}
/* for k > 0, deletions are free */
- for (k = (d > 0 ? d : 0) + p; k >= 0; k--)
+ for (k = max_index; k >= 0; k--)
{
svn_diff__snake(fp + k, token_counts, &lcs_freelist, pool);
+ if (fp[k].uniquematchcount > limit - k)
+ {
+ d = k;
+ clear_fp(fp, k, min_index, max_index, &lcs_freelist);
+ p = 0;
+ }
}
-
p++;
}
while (fp[0].position[1] != &sentinel_position[1]);