Re: Commons Lang + Improved getLevenshtein() implementation

2006-03-14 Thread Chas Emerick

Henri,

Well, better late than never!  It's good of you to follow up -- I had  
forgotten about that code a long time ago.  I've continued to use my  
version since it didn't show up in commons after a month or so, and I  
just never remembered to circle back about it.


Yes, by all means, use the code.

Thanks,

Chas Emerick
Founder, Snowtide Informatics Systems
Enterprise-class PDF content extraction

[EMAIL PROTECTED]
http://snowtide.com


On Mar 14, 2006, at 1:23 AM, Henri Yandell wrote:



Back in October 2003, Chas submitted an improved implementation of  
the getLevenshtein method in Jakarta Commons Lang's StringUtils  
class. By the looks of the history, I then failed to apply it. My  
apologies for that.


Cedrik posted an issue reminding us of this fact, and finally we've  
applied the patch, though we've used the version from http:// 
www.merriampark.com/ldjava.htm rather than the original email. I  
presume it's the more up to date and I'm assuming that Chas is  
still happy for us to use the code. Let me know if that's not the  
case, Chas.


Mostly though, I wanted to thank all three of you for your  
involvement. Michael for the original implementation, Chas for the  
optimised one and Cedrik for following up so well.


The code should go out in Commons Lang 2.2 at some point in the  
coming months.


Thanks,

Hen



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[PATCH] StringUtils.getLevenshteinDistance() fix for OutOfMemoryError when used with LONG strings

2003-10-22 Thread Chas Emerick
I've never submitted a patch for an open-source project before (never  
got around to it lo these many years I guess), so I apologize for any  
errors in form or convention that I commit.

I was planning on using the  
StringUtils.getLevenshteinDistance(String,String) method in a  
particular circumstance where I needed to get the edit distance between  
two large strings (anywhere between 10K - 500K characters each).   
However, I found that calling that method (as of v2.0 of commons-lang)  
would result in an OutOfMemoryError when strings of such lengths were  
provided.

A quick look at the source revealed that the current implementation  
(which uses the sample impl. at http://www.merriampark.com/ld.htm)  
creates a matrix with dimensions corresponding to the lengths of the  
two strings provided.  Clearly, a 100K x 100K int[] is problematic.

Therefore, I've (mostly) rewritten the method using a two int arrays of  
the size of the first string's length, and the method now works  
properly with larger strings (beastly slow, but that's quadratic  
algorithms for you) .  I've tested the new implementation against the  
ten or so testcases mentioned in the javadocs, as well as another  
half-dozen of my own, and everything looks good.

If my mail client botches the patch diff, you can get it at  
http://www.snowtide.com/commons-lang-LDpatch.txt

Chas Emerick   |   [EMAIL PROTECTED]

http://www.snowtide.com
Snowtide Informatics Systems, Inc.
 
==
--- StringUtils.java.oldWed Oct 22 12:58:04 2003
+++ StringUtils.javaWed Oct 22 13:51:36 2003
@@ -4255,8 +4255,8 @@
  * another, where each change is a single character modification  
(deletion,
  * insertion or substitution)./p
  *
- * pThis implementation of the Levenshtein distance algorithm
- * is from a  
href=http://www.merriampark.com/ld.htm;http://www.merriampark.com/ld.
htm/a/p
+ * pThis implementation of the Levenshtein distance algorithm  
was originally based
upon the one
+ * presented at a  
href=http://www.merriampark.com/ld.htm;http://www.merriampark.co
m/ld.htm/a/p
  *
  * pre
  * StringUtils.getLevenshteinDistance(null, *) =  
IllegalArgumentException
@@ -4281,76 +4281,64 @@
 if (s == null || t == null) {
 throw new IllegalArgumentException(Strings must not be  
null);
 }
-int d[][]; // matrix
-int n; // length of s
-int m; // length of t
-int i; // iterates through s
-int j; // iterates through t
-char s_i; // ith character of s
-char t_j; // jth character of t
-int cost; // cost
-
-// Step 1
-n = s.length();
-m = t.length();
+
+   /*
+   The difference between this impl. and the previous is  
that, rather than cr
eating and retaining a matrix of size
+   s.length()+1 by t.length()+1, we maintain two  
single-dimensional arrays of
 length s.length()+1.  The first, d,
+   is the 'current working' distance array that maintains  
the newest distance
 cost counts as we iterate through
+   the characters of String s.  Each time we increment the  
index of String t
we are comparing, d is copied to p,
+   the second int[].  Doing so allows us to retain the  
previous cost counts a
s required by the algorithm
+   (taking the minimum of the cost count to the left, up  
one, and diagonally
up and to the left of the current
+   cost count being calculated).  (Note that the arrays  
aren't really copied
anymore, just switched...this is clearly much
+   better than cloning an array or doing a  
System.arraycopy() each time throu
gh the outer loop.)
+
+   Effectively, the difference between the two  
implementations is this one do
es not cause an out of memory condition
+   when calculating the LD over two very large strings.
+   */
+
+int n = s.length(); // length of s
+int m = t.length(); // length of t
+
 if (n == 0) {
 return m;
-}
-if (m == 0) {
+} else if (m == 0) {
 return n;
 }
-d = new int[n + 1][m + 1];

-// Step 2
-for (i = 0; i = n; i++) {
-d[i][0] = i;
-}
+int p[] = new int[n+1]; //'previous' cost array, horizontally
+int d[] = new int[n+1]; // cost array, horizontally
+   int _d[]; //placeholder to assist in swapping p and d
-for (j = 0; j = m; j++) {
-d[0][j] = j;
-}
-
-// Step 3
-for (i = 1; i = n; i++) {
-s_i = s.charAt(i - 1);
-
-// Step 4
-for (j = 1; j = m; j++) {
-t_j = t.charAt(j - 1);
-
-// Step 5
-if (s_i == t_j) {
-cost = 0;
-} else