Might be worth including as a separate subclass of testcase then - anyone looking for quick runs can take it out of the test suite temporarily.
-AMT -----Original Message----- From: Chas Emerick [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 22, 2003 3:52 PM To: Jakarta Commons Developers List Subject: Re: [lang] [PATCH] StringUtils.getLevenshteinDistance() fix for OutOfMemoryError when used with LONG strings Sure...although there's not much I can add with regard to the simpler test cases, I could use RandomStringUtils to generate some long strings to test behaviour that the patch addresses. The downside is that the algorithm is fundamentally quadratic, so doing such a test would significantly extend the time required to complete the java-lang tests (i.e. ~7 minutes to calculate the character-level edit distance of two strings of 50K each). If you think that's OK, I'll submit a patch to the test class. Chas Emerick | [EMAIL PROTECTED] http://www.snowtide.com Snowtide Informatics Systems, Inc. On Wednesday, October 22, 2003, at 04:53 PM, Henri Yandell wrote: > > Cool. Sounds good and I'll have a go at applying your patch. > > Any chance of you submitting your other unit tests for the class? > > Hen > > On Wed, 22 Oct 2003, Chas Emerick wrote: > >> I've never submitted a patch for an open-source project before (never >> got around to it lo these many years I guess), so I apologize for any >> errors in form or convention that I commit. >> >> I was planning on using the >> StringUtils.getLevenshteinDistance(String,String) method in a >> particular circumstance where I needed to get the edit distance >> between >> two large strings (anywhere between 10K - 500K characters each). >> However, I found that calling that method (as of v2.0 of commons-lang) >> would result in an OutOfMemoryError when strings of such lengths were >> provided. >> >> A quick look at the source revealed that the current implementation >> (which uses the sample impl. at http://www.merriampark.com/ld.htm) >> creates a matrix with dimensions corresponding to the lengths of the >> two strings provided. Clearly, a 100K x 100K int[] is problematic. >> >> Therefore, I've (mostly) rewritten the method using a two int arrays >> of >> the size of the first string's length, and the method now works >> properly with larger strings (beastly slow, but that's quadratic >> algorithms for you) . I've tested the new implementation against the >> ten or so testcases mentioned in the javadocs, as well as another >> half-dozen of my own, and everything looks good. >> >> If my mail client botches the patch diff, you can get it at >> http://www.snowtide.com/commons-lang-LDpatch.txt >> >> Chas Emerick | [EMAIL PROTECTED] >> >> http://www.snowtide.com >> Snowtide Informatics Systems, Inc. >> >> ===================================================================== >> = >> == >> == >> --- StringUtils.java.old Wed Oct 22 12:58:04 2003 >> +++ StringUtils.java Wed Oct 22 13:51:36 2003 >> @@ -4255,8 +4255,8 @@ >> * another, where each change is a single character >> modification (deletion, >> * insertion or substitution).</p> >> * >> - * <p>This implementation of the Levenshtein distance algorithm >> - * is from <a >> href="http://www.merriampark.com/ld.htm">http://www.merriampark.com/ >> ld. >> htm</a></p> >> + * <p>This implementation of the Levenshtein distance algorithm >> was originally based >> upon the one >> + * presented at <a >> href="http://www.merriampark.com/ld.htm">http://www.merriampark.co >> m/ld.htm</a></p> >> * >> * <pre> >> * StringUtils.getLevenshteinDistance(null, *) = >> IllegalArgumentException >> @@ -4281,76 +4281,64 @@ >> if (s == null || t == null) { >> throw new IllegalArgumentException("Strings must not be >> null"); >> } >> - int d[][]; // matrix >> - int n; // length of s >> - int m; // length of t >> - int i; // iterates through s >> - int j; // iterates through t >> - char s_i; // ith character of s >> - char t_j; // jth character of t >> - int cost; // cost >> - >> - // Step 1 >> - n = s.length(); >> - m = t.length(); >> + >> + /* >> + The difference between this impl. and the previous is >> that, rather than cr >> eating and retaining a matrix of size >> + s.length()+1 by t.length()+1, we maintain two >> single-dimensional arrays of >> length s.length()+1. The first, d, >> + is the 'current working' distance array that >> + maintains >> the newest distance >> cost counts as we iterate through >> + the characters of String s. Each time we increment >> the >> index of String t >> we are comparing, d is copied to p, >> + the second int[]. Doing so allows us to retain the >> previous cost counts a >> s required by the algorithm >> + (taking the minimum of the cost count to the left, up >> one, and diagonally >> up and to the left of the current >> + cost count being calculated). (Note that the arrays >> aren't really copied >> anymore, just switched...this is clearly much >> + better than cloning an array or doing a >> System.arraycopy() each time throu >> gh the outer loop.) >> + >> + Effectively, the difference between the two >> implementations is this one do >> es not cause an out of memory condition >> + when calculating the LD over two very large strings. >> + */ >> + >> + int n = s.length(); // length of s >> + int m = t.length(); // length of t >> + >> if (n == 0) { >> return m; >> - } >> - if (m == 0) { >> + } else if (m == 0) { >> return n; >> } >> - d = new int[n + 1][m + 1]; >> >> - // Step 2 >> - for (i = 0; i <= n; i++) { >> - d[i][0] = i; >> - } >> + int p[] = new int[n+1]; //'previous' cost array, horizontally >> + int d[] = new int[n+1]; // cost array, horizontally >> + int _d[]; //placeholder to assist in swapping p and d >> >> - for (j = 0; j <= m; j++) { >> - d[0][j] = j; >> - } >> - >> - // Step 3 >> - for (i = 1; i <= n; i++) { >> - s_i = s.charAt(i - 1); >> - >> - // Step 4 >> - for (j = 1; j <= m; j++) { >> - t_j = t.charAt(j - 1); >> - >> - // Step 5 >> - if (s_i == t_j) { >> - cost = 0; >> - } else { >> - cost = 1; >> - } >> + //indexes into strings s and t >> + int i; // iterates through s >> + int j; // iterates through t >> >> - // Step 6 >> - d[i][j] = min(d[i - 1][j] + 1, d[i][j - 1] + 1, d[i - >> 1][j - 1] + cost); >> - } >> - } >> + char t_j; // jth character of t >> >> - // Step 7 >> - return d[n][m]; >> - } >> + int cost; // cost >> >> - /** >> - * <p>Gets the minimum of three <code>int</code> values.</p> >> - * >> - * @param a value 1 >> - * @param b value 2 >> - * @param c value 3 >> - * @return the smallest of the values >> - */ >> - private static int min(int a, int b, int c) { >> - // Method copied from NumberUtils to avoid dependency on >> subpackage >> - if (b < a) { >> - a = b; >> - } >> - if (c < a) { >> - a = c; >> + for (i = 0; i<=n; i++) { >> + p[i] = i; >> } >> - return a; >> + >> + for (j = 1; j<=m; j++) { >> + t_j = t.charAt(j-1); >> + d[0] = j; >> + >> + for (i=1; i<=n; i++) { >> + cost = s.charAt(i-1)==t_j ? 0 : 1; >> + >> + d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), >> p[i-1]+cost); //minimum of c >> ell to the left+1, to the top+1, diagonally left and up +cost >> + } >> + >> + //copy current distance counts to 'previous row' >> + distance >> counts >> + _d = p; >> + p = d; >> + d = _d; >> + } >> + >> + //our last action in the above loop was to switch d and p, >> + so >> p now actual >> ly has the most recent cost counts >> + return p[n]; >> } >> >> } >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]