I've never submitted a patch for an open-source project before (never
got around to it lo these many years I guess), so I apologize for any
errors in form or convention that I commit.
I was planning on using the
StringUtils.getLevenshteinDistance(String,String) method in a
particular circumstance where I needed to get the edit distance between
two large strings (anywhere between 10K - 500K characters each).
However, I found that calling that method (as of v2.0 of commons-lang)
would result in an OutOfMemoryError when strings of such lengths were
provided.
A quick look at the source revealed that the current implementation
(which uses the sample impl. at http://www.merriampark.com/ld.htm)
creates a matrix with dimensions corresponding to the lengths of the
two strings provided. Clearly, a 100K x 100K int[] is problematic.
Therefore, I've (mostly) rewritten the method using a two int arrays of
the size of the first string's length, and the method now works
properly with larger strings (beastly slow, but that's quadratic
algorithms for you) . I've tested the new implementation against the
ten or so testcases mentioned in the javadocs, as well as another
half-dozen of my own, and everything looks good.
If my mail client botches the patch diff, you can get it at
http://www.snowtide.com/commons-lang-LDpatch.txt
Chas Emerick | [EMAIL PROTECTED]
http://www.snowtide.com
Snowtide Informatics Systems, Inc.
==
--- StringUtils.java.oldWed Oct 22 12:58:04 2003
+++ StringUtils.javaWed Oct 22 13:51:36 2003
@@ -4255,8 +4255,8 @@
* another, where each change is a single character modification
(deletion,
* insertion or substitution)./p
*
- * pThis implementation of the Levenshtein distance algorithm
- * is from a
href=http://www.merriampark.com/ld.htm;http://www.merriampark.com/ld.
htm/a/p
+ * pThis implementation of the Levenshtein distance algorithm
was originally based
upon the one
+ * presented at a
href=http://www.merriampark.com/ld.htm;http://www.merriampark.co
m/ld.htm/a/p
*
* pre
* StringUtils.getLevenshteinDistance(null, *) =
IllegalArgumentException
@@ -4281,76 +4281,64 @@
if (s == null || t == null) {
throw new IllegalArgumentException(Strings must not be
null);
}
-int d[][]; // matrix
-int n; // length of s
-int m; // length of t
-int i; // iterates through s
-int j; // iterates through t
-char s_i; // ith character of s
-char t_j; // jth character of t
-int cost; // cost
-
-// Step 1
-n = s.length();
-m = t.length();
+
+ /*
+ The difference between this impl. and the previous is
that, rather than cr
eating and retaining a matrix of size
+ s.length()+1 by t.length()+1, we maintain two
single-dimensional arrays of
length s.length()+1. The first, d,
+ is the 'current working' distance array that maintains
the newest distance
cost counts as we iterate through
+ the characters of String s. Each time we increment the
index of String t
we are comparing, d is copied to p,
+ the second int[]. Doing so allows us to retain the
previous cost counts a
s required by the algorithm
+ (taking the minimum of the cost count to the left, up
one, and diagonally
up and to the left of the current
+ cost count being calculated). (Note that the arrays
aren't really copied
anymore, just switched...this is clearly much
+ better than cloning an array or doing a
System.arraycopy() each time throu
gh the outer loop.)
+
+ Effectively, the difference between the two
implementations is this one do
es not cause an out of memory condition
+ when calculating the LD over two very large strings.
+ */
+
+int n = s.length(); // length of s
+int m = t.length(); // length of t
+
if (n == 0) {
return m;
-}
-if (m == 0) {
+} else if (m == 0) {
return n;
}
-d = new int[n + 1][m + 1];
-// Step 2
-for (i = 0; i = n; i++) {
-d[i][0] = i;
-}
+int p[] = new int[n+1]; //'previous' cost array, horizontally
+int d[] = new int[n+1]; // cost array, horizontally
+ int _d[]; //placeholder to assist in swapping p and d
-for (j = 0; j = m; j++) {
-d[0][j] = j;
-}
-
-// Step 3
-for (i = 1; i = n; i++) {
-s_i = s.charAt(i - 1);
-
-// Step 4
-for (j = 1; j = m; j++) {
-t_j = t.charAt(j - 1);
-
-// Step 5
-if (s_i == t_j) {
-cost = 0;
-} else