Cool. Sounds good and I'll have a go at applying your patch. Any chance of you submitting your other unit tests for the class?
Hen On Wed, 22 Oct 2003, Chas Emerick wrote: > I've never submitted a patch for an open-source project before (never > got around to it lo these many years I guess), so I apologize for any > errors in form or convention that I commit. > > I was planning on using the > StringUtils.getLevenshteinDistance(String,String) method in a > particular circumstance where I needed to get the edit distance between > two large strings (anywhere between 10K - 500K characters each). > However, I found that calling that method (as of v2.0 of commons-lang) > would result in an OutOfMemoryError when strings of such lengths were > provided. > > A quick look at the source revealed that the current implementation > (which uses the sample impl. at http://www.merriampark.com/ld.htm) > creates a matrix with dimensions corresponding to the lengths of the > two strings provided. Clearly, a 100K x 100K int[] is problematic. > > Therefore, I've (mostly) rewritten the method using a two int arrays of > the size of the first string's length, and the method now works > properly with larger strings (beastly slow, but that's quadratic > algorithms for you) . I've tested the new implementation against the > ten or so testcases mentioned in the javadocs, as well as another > half-dozen of my own, and everything looks good. > > If my mail client botches the patch diff, you can get it at > http://www.snowtide.com/commons-lang-LDpatch.txt > > Chas Emerick | [EMAIL PROTECTED] > > http://www.snowtide.com > Snowtide Informatics Systems, Inc. > > ======================================================================== > == > --- StringUtils.java.old Wed Oct 22 12:58:04 2003 > +++ StringUtils.java Wed Oct 22 13:51:36 2003 > @@ -4255,8 +4255,8 @@ > * another, where each change is a single character modification > (deletion, > * insertion or substitution).</p> > * > - * <p>This implementation of the Levenshtein distance algorithm > - * is from <a > href="http://www.merriampark.com/ld.htm">http://www.merriampark.com/ld. > htm</a></p> > + * <p>This implementation of the Levenshtein distance algorithm > was originally based > upon the one > + * presented at <a > href="http://www.merriampark.com/ld.htm">http://www.merriampark.co > m/ld.htm</a></p> > * > * <pre> > * StringUtils.getLevenshteinDistance(null, *) = > IllegalArgumentException > @@ -4281,76 +4281,64 @@ > if (s == null || t == null) { > throw new IllegalArgumentException("Strings must not be > null"); > } > - int d[][]; // matrix > - int n; // length of s > - int m; // length of t > - int i; // iterates through s > - int j; // iterates through t > - char s_i; // ith character of s > - char t_j; // jth character of t > - int cost; // cost > - > - // Step 1 > - n = s.length(); > - m = t.length(); > + > + /* > + The difference between this impl. and the previous is > that, rather than cr > eating and retaining a matrix of size > + s.length()+1 by t.length()+1, we maintain two > single-dimensional arrays of > length s.length()+1. The first, d, > + is the 'current working' distance array that maintains > the newest distance > cost counts as we iterate through > + the characters of String s. Each time we increment the > index of String t > we are comparing, d is copied to p, > + the second int[]. Doing so allows us to retain the > previous cost counts a > s required by the algorithm > + (taking the minimum of the cost count to the left, up > one, and diagonally > up and to the left of the current > + cost count being calculated). (Note that the arrays > aren't really copied > anymore, just switched...this is clearly much > + better than cloning an array or doing a > System.arraycopy() each time throu > gh the outer loop.) > + > + Effectively, the difference between the two > implementations is this one do > es not cause an out of memory condition > + when calculating the LD over two very large strings. > + */ > + > + int n = s.length(); // length of s > + int m = t.length(); // length of t > + > if (n == 0) { > return m; > - } > - if (m == 0) { > + } else if (m == 0) { > return n; > } > - d = new int[n + 1][m + 1]; > > - // Step 2 > - for (i = 0; i <= n; i++) { > - d[i][0] = i; > - } > + int p[] = new int[n+1]; //'previous' cost array, horizontally > + int d[] = new int[n+1]; // cost array, horizontally > + int _d[]; //placeholder to assist in swapping p and d > > - for (j = 0; j <= m; j++) { > - d[0][j] = j; > - } > - > - // Step 3 > - for (i = 1; i <= n; i++) { > - s_i = s.charAt(i - 1); > - > - // Step 4 > - for (j = 1; j <= m; j++) { > - t_j = t.charAt(j - 1); > - > - // Step 5 > - if (s_i == t_j) { > - cost = 0; > - } else { > - cost = 1; > - } > + //indexes into strings s and t > + int i; // iterates through s > + int j; // iterates through t > > - // Step 6 > - d[i][j] = min(d[i - 1][j] + 1, d[i][j - 1] + 1, d[i - > 1][j - 1] + cost); > - } > - } > + char t_j; // jth character of t > > - // Step 7 > - return d[n][m]; > - } > + int cost; // cost > > - /** > - * <p>Gets the minimum of three <code>int</code> values.</p> > - * > - * @param a value 1 > - * @param b value 2 > - * @param c value 3 > - * @return the smallest of the values > - */ > - private static int min(int a, int b, int c) { > - // Method copied from NumberUtils to avoid dependency on > subpackage > - if (b < a) { > - a = b; > - } > - if (c < a) { > - a = c; > + for (i = 0; i<=n; i++) { > + p[i] = i; > } > - return a; > + > + for (j = 1; j<=m; j++) { > + t_j = t.charAt(j-1); > + d[0] = j; > + > + for (i=1; i<=n; i++) { > + cost = s.charAt(i-1)==t_j ? 0 : 1; > + > + d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), > p[i-1]+cost); //minimum of c > ell to the left+1, to the top+1, diagonally left and up +cost > + } > + > + //copy current distance counts to 'previous row' distance > counts > + _d = p; > + p = d; > + d = _d; > + } > + > + //our last action in the above loop was to switch d and p, so > p now actual > ly has the most recent cost counts > + return p[n]; > } > > } > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]