Re: Improving String Distance calculation performance

2010-12-28 Thread Robert Muir
On Tue, Dec 28, 2010 at 5:26 AM, Biedermann,S.,Fa. Post Direkt
 wrote:
> Hi Robert,
>
> Thanks for your hint about LevensteinAutomata. Are AutomatonQueries planned 
> for an upcoming release?

yes, but its in trunk, so you can use it now...

>
> At the moment, we build the reference to boost documents those at query time 
> which contain fuzzily seldom used tokens within a queried region, in a manner 
> of speaking a fuzzied localised idf() .The boosts are injected via payloads. 
> Since levenstein must be calculated within a (fuzzied) region only, O(mn) 
> applies "only" to each region. On the outside, we have O(#region).
>
> The problem could be equivalently solved query time. But this would mean to 
> count the matched documents of each fuzzy query within a more complex queries.
> In Release 3.0.2. it looks quite complicated to me to incorporate a different 
> scoring model that first count matches of each fuzzy sub-query and then apply 
> the boosts to the matched tokens. I haven't seen a Scorer doing this so far. 
> Furthermore we are sensible about query time.
>
> Do you have any ideas?

Not sure I fully understand what your app needs to do, but you can
take a look at using a different rewrite method.

for example, it seems like rewriting to span queries (see
SpanMultiTermQueryWrapper) might be close to what you want, except it
suffers from the problem that boosting is completely broken in
Lucene's span queries (since they don't combine with real Scorers but
instead Spans)...

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



AW: Improving String Distance calculation performance

2010-12-28 Thread Biedermann,S.,Fa. Post Direkt
Hi Robert,

Thanks for your hint about LevensteinAutomata. Are AutomatonQueries planned for 
an upcoming release?

At the moment, we build the reference to boost documents those at query time 
which contain fuzzily seldom used tokens within a queried region, in a manner 
of speaking a fuzzied localised idf() .The boosts are injected via payloads. 
Since levenstein must be calculated within a (fuzzied) region only, O(mn) 
applies "only" to each region. On the outside, we have O(#region).

The problem could be equivalently solved query time. But this would mean to 
count the matched documents of each fuzzy query within a more complex queries.
In Release 3.0.2. it looks quite complicated to me to incorporate a different 
scoring model that first count matches of each fuzzy sub-query and then apply 
the boosts to the matched tokens. I haven't seen a Scorer doing this so far. 
Furthermore we are sensible about query time. 

Do you have any ideas?



-Ursprüngliche Nachricht-
Von: Robert Muir [mailto:rcm...@gmail.com] 
Gesendet: Montag, 27. Dezember 2010 17:11
An: dev@lucene.apache.org
Betreff: Re: Improving String Distance calculation performance

On Mon, Dec 27, 2010 at 10:31 AM, Biedermann,S.,Fa. Post Direkt 
 wrote:
>
> As for our problem: we are trying to build reference data against which 
> requests shall be matched. In this case we need quite a huge amount of string 
> distance measurements for preparing this reference.
>

If this is your problem, i wouldn't recommend using the StringDistance 
directly. As i mentioned, its not designed for your use case because the way 
its used by spellchecker, it only needs something like 20-50 comparisons...

If you try to use it the way you describe, it will be very slow, it must do 
O(k) comparisons, where k is the number of strings, and each comparison is 
O(mn), where m and n are the lengths of the input string and string being 
compared, respectively.

Easier would be to index your terms and simply do FuzzyQuery (with trunk), 
specifying the exact max edit distance you want. Or if you care about getting 
all exact results within Levenshtein distance of some degree N, use 
AutomatonQuery built from LevenshteinAutomata.

This will give you a sublinear number of comparisons, something complicated but 
more like O(sqrt(k)) where k is the number of strings, and each comparison is 
O(n), where n is the length of the target string.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional 
commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Improving String Distance calculation performance

2010-12-27 Thread Robert Muir
On Mon, Dec 27, 2010 at 10:31 AM, Biedermann,S.,Fa. Post Direkt
 wrote:
>
> As for our problem: we are trying to build reference data against which 
> requests shall be matched. In this case we need quite a huge amount of string 
> distance measurements for preparing this reference.
>

If this is your problem, i wouldn't recommend using the StringDistance
directly. As i mentioned, its not designed for your use case because
the way its used by spellchecker, it only needs something like 20-50
comparisons...

If you try to use it the way you describe, it will be very slow, it
must do O(k) comparisons, where k is the number of strings, and each
comparison is O(mn), where m and n are the lengths of the input string
and string being compared, respectively.

Easier would be to index your terms and simply do FuzzyQuery (with
trunk), specifying the exact max edit distance you want. Or if you
care about getting all exact results within Levenshtein distance of
some degree N, use AutomatonQuery built from LevenshteinAutomata.

This will give you a sublinear number of comparisons, something
complicated but more like O(sqrt(k)) where k is the number of strings,
and each comparison is O(n), where n is the length of the target
string.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



AW: Improving String Distance calculation performance

2010-12-27 Thread Biedermann,S.,Fa. Post Direkt
Hi Robert,

I don't use the spellchecker, but of course I want to re-use the string 
distance algorithms from a well-implemented library. I find that these 
algorithms are of broader scope than spellchecking only. For instance 
FuzzyTermEnum could rely on them. FuzzyTermEnum could be refactored to use 
other string distance measurements as well...

As for our problem: we are trying to build reference data against which 
requests shall be matched. In this case we need quite a huge amount of string 
distance measurements for preparing this reference. 

For score scaling I took 1 - (#edits/maxTermLength) as suggested by the 
original. I ran the candidate in parallel to the original LevensteinDistance 
from spellchecker and found no difference so far. Of cousre, this is no proof. 


Sven


-Ursprüngliche Nachricht-
Von: Robert Muir [mailto:rcm...@gmail.com] 
Gesendet: Montag, 27. Dezember 2010 16:07
An: dev@lucene.apache.org
Betreff: Re: Improving String Distance calculation performance

Hi Biedermann:

you are correct in that the comparator in spellcheck could maybe use some 
optimizations.

But I'm curious as to why you would be doing a lot of comparisons with the 
spellchecker? Are you using this class separately for some other purpose?

The reason is that the spellchecker works like this (in two phases) to retrieve 
N suggestions for a word:
* first phase is to do a n-gram query against the spellcheck index.
This is a simple BooleanQuery that returns 10 * N suggestions. For example if 
you want the top 5 suggestions for a word, it will get the top 50 based on an 
n-gram ranking.
* these top-N (in my example only 50) are then re-ranked according to the 
spellcheck comparator, such as Levenshtein. So in this example only 50 
levenshtein comparisons are done.

of course there is no reason why we shouldn't optimize the compare, if its 
safe. for the particular optimization you mention I only have one
concern: that is that the optimization is correct for FuzzyTermsEnum where the 
score scaling is 1 - (#edits/minTermLength). But in the spellchecker 
comparator, the scores are scaled as 1 - (#edits/maxTermLength).

It might be your optimization is just fine... but I just wanted to mention this 
difference.

On Mon, Dec 27, 2010 at 8:08 AM, Biedermann,S.,Fa. Post Direkt 
 wrote:
> Hi,
>
> this is a re-post, because the first time I re-used another thread 
> (sorry for any inconvenience):
>
>
> this is my first post to this mailing list, so I first want to say 
> hello to all of you!
>
>        You are doing a great job
>
> In org.apache.lucene.search.FuzzyTermEnum I found an optimised 
> implementation of the Levenstein-Algorithms which makes use of the 
> fact that the algorithm can be aborted if a given minimum similarity 
> cannot be reached anymore. I isolated that algorithm into a subclass 
> of org.apache.lucene.spell.StringDistance, since we usually can make 
> use of this optimisation.
>
> With our current miminum similarity setting of 0.75 this algorithm 
> needs against our test data only about 22% of run time compared to the 
> original algorithm from the spell package.
>
> With a further optimisation candidate (see below) the runtime can be 
> further reduced by another third to only 14% of original Levenstein.
>
> So, my first question is: is it worth adding a further method to the
> StringDistance-Interface:
>
>        float getDistance(String left, String right, float
> minimumSimilarity)
>
> so that applications can make use of possible optimisations 
> (StringDistance-Implementations without optimisations would just skip 
> the minimSimilarity parameter)?
>
>
> The idea of the optimsation candidate is about calculating only those 
> fields in the "virtual" matrix that are near its diagonal.
> It is only a candidants since we have not prooven it to work. But with 
> all our test data (0.5 billion comparisons) there is no difference to 
> the original algorithm.
>
>
> Do you have any counter examples?
> Since this is my first post, is this the right mailing list?
>
> Best Regards,
>
> Sven
>
>
>
> Here is the code taken from FuzzyTermEnum with some modfications  (p 
> and d are initialised somewhere else):
>
>
>    public float getDistance(final String left, final String right, 
> float minimumSimilarity) {
>
>        if (left.length() > right.length())   // candidate works only 
> if longer string is right
>            return getDistanceInner(right, left, minimumSimilarity);
>        else
>            return getDistanceInner(left, right, minimumSimilarity);
>
>    }
>
>
>    private float getDistanceInner(final String left, final String 
> right, float minimumSimilarity) {
>        final int m = right.length();
>        final int n = left.leng

Re: Improving String Distance calculation performance

2010-12-27 Thread Robert Muir
Hi Biedermann:

you are correct in that the comparator in spellcheck could maybe use
some optimizations.

But I'm curious as to why you would be doing a lot of comparisons with
the spellchecker? Are you using this class separately for some other
purpose?

The reason is that the spellchecker works like this (in two phases) to
retrieve N suggestions for a word:
* first phase is to do a n-gram query against the spellcheck index.
This is a simple BooleanQuery that returns 10 * N suggestions. For
example if you want the top 5 suggestions for a word, it will get the
top 50 based on an n-gram ranking.
* these top-N (in my example only 50) are then re-ranked according to
the spellcheck comparator, such as Levenshtein. So in this example
only 50 levenshtein comparisons are done.

of course there is no reason why we shouldn't optimize the compare, if
its safe. for the particular optimization you mention I only have one
concern: that is that the optimization is correct for FuzzyTermsEnum
where the score scaling is 1 - (#edits/minTermLength). But in the
spellchecker comparator, the scores are scaled as 1 -
(#edits/maxTermLength).

It might be your optimization is just fine... but I just wanted to
mention this difference.

On Mon, Dec 27, 2010 at 8:08 AM, Biedermann,S.,Fa. Post Direkt
 wrote:
> Hi,
>
> this is a re-post, because the first time I re-used another thread
> (sorry for any inconvenience):
>
>
> this is my first post to this mailing list, so I first want to say hello
> to all of you!
>
>        You are doing a great job
>
> In org.apache.lucene.search.FuzzyTermEnum I found an optimised
> implementation of the Levenstein-Algorithms which makes use of the fact
> that the algorithm can be aborted if a given minimum similarity cannot
> be reached anymore. I isolated that algorithm into a subclass of
> org.apache.lucene.spell.StringDistance, since we usually can make use of
> this optimisation.
>
> With our current miminum similarity setting of 0.75 this algorithm needs
> against our test data only about 22% of run time compared to the
> original algorithm from the spell package.
>
> With a further optimisation candidate (see below) the runtime can be
> further reduced by another third to only 14% of original Levenstein.
>
> So, my first question is: is it worth adding a further method to the
> StringDistance-Interface:
>
>        float getDistance(String left, String right, float
> minimumSimilarity)
>
> so that applications can make use of possible optimisations
> (StringDistance-Implementations without optimisations would just skip
> the minimSimilarity parameter)?
>
>
> The idea of the optimsation candidate is about calculating only those
> fields in the "virtual" matrix that are near its diagonal.
> It is only a candidants since we have not prooven it to work. But with
> all our test data (0.5 billion comparisons) there is no difference to
> the original algorithm.
>
>
> Do you have any counter examples?
> Since this is my first post, is this the right mailing list?
>
> Best Regards,
>
> Sven
>
>
>
> Here is the code taken from FuzzyTermEnum with some modfications  (p and
> d are initialised somewhere else):
>
>
>    public float getDistance(final String left, final String right,
> float minimumSimilarity) {
>
>        if (left.length() > right.length())   // candidate works only if
> longer string is right
>            return getDistanceInner(right, left, minimumSimilarity);
>        else
>            return getDistanceInner(left, right, minimumSimilarity);
>
>    }
>
>
>    private float getDistanceInner(final String left, final String
> right, float minimumSimilarity) {
>        final int m = right.length();
>        final int n = left.length();
>        final int maxLength = Math.max(m, n);
>        if (n == 0)  {
>          //we don't have anything to compare.  That means if we just
> add
>          //the letters for m we get the new word
>            return (m == 0) ? 1f : 0f;
>        }
>        if (m == 0) {
>          return 0f;
>        }
>
>        // be patient with rounding errors (1.001f instead of 1f).
>        final int maxDistance = (int) ((1.001f-minimumSimilarity) *
> maxLength);
>
>        if (maxDistance < Math.abs(m-n)) {
>          //just adding the characters of m to n or vice-versa results
> in
>          //too many edits
>          //for example "pre" length is 3 and "prefixes" length is 8.
> We can see that
>          //given this optimal circumstance, the edit distance cannot be
> less than 5.
>          //which is 8-3 or more precisely Math.abs(3-8).
>          //if our maximum edit distance is 4, then we can discard this
> word
>          //without looking at it.
>          return 0.0f;
>        }
>
>        // if no edits are allowed, strings must be equal
>        if (maxDistance == 0)
>            return left.equals(right) ? 1f : 0f;
>
>        // init matrix d
>        for (int i = 0; i<=n; i++) {
>          p[i] = i;
>        }
>
>        // start computing edit dis

Improving String Distance calculation performance

2010-12-27 Thread Biedermann,S.,Fa. Post Direkt
Hi,

this is a re-post, because the first time I re-used another thread
(sorry for any inconvenience):


this is my first post to this mailing list, so I first want to say hello
to all of you!

You are doing a great job 

In org.apache.lucene.search.FuzzyTermEnum I found an optimised
implementation of the Levenstein-Algorithms which makes use of the fact
that the algorithm can be aborted if a given minimum similarity cannot
be reached anymore. I isolated that algorithm into a subclass of
org.apache.lucene.spell.StringDistance, since we usually can make use of
this optimisation.

With our current miminum similarity setting of 0.75 this algorithm needs
against our test data only about 22% of run time compared to the
original algorithm from the spell package.

With a further optimisation candidate (see below) the runtime can be
further reduced by another third to only 14% of original Levenstein.

So, my first question is: is it worth adding a further method to the
StringDistance-Interface:

float getDistance(String left, String right, float
minimumSimilarity)

so that applications can make use of possible optimisations
(StringDistance-Implementations without optimisations would just skip
the minimSimilarity parameter)?


The idea of the optimsation candidate is about calculating only those
fields in the "virtual" matrix that are near its diagonal.
It is only a candidants since we have not prooven it to work. But with
all our test data (0.5 billion comparisons) there is no difference to
the original algorithm.


Do you have any counter examples?
Since this is my first post, is this the right mailing list?

Best Regards,

Sven



Here is the code taken from FuzzyTermEnum with some modfications  (p and
d are initialised somewhere else):


public float getDistance(final String left, final String right,
float minimumSimilarity) {

if (left.length() > right.length())   // candidate works only if
longer string is right
return getDistanceInner(right, left, minimumSimilarity);
else
return getDistanceInner(left, right, minimumSimilarity);

}


private float getDistanceInner(final String left, final String
right, float minimumSimilarity) {
final int m = right.length();
final int n = left.length();
final int maxLength = Math.max(m, n);
if (n == 0)  {
  //we don't have anything to compare.  That means if we just
add
  //the letters for m we get the new word
return (m == 0) ? 1f : 0f;
}
if (m == 0) {
  return 0f;
}

// be patient with rounding errors (1.001f instead of 1f).
final int maxDistance = (int) ((1.001f-minimumSimilarity) *
maxLength);

if (maxDistance < Math.abs(m-n)) {
  //just adding the characters of m to n or vice-versa results
in
  //too many edits
  //for example "pre" length is 3 and "prefixes" length is 8.
We can see that
  //given this optimal circumstance, the edit distance cannot be
less than 5.
  //which is 8-3 or more precisely Math.abs(3-8).
  //if our maximum edit distance is 4, then we can discard this
word
  //without looking at it.
  return 0.0f;
}

// if no edits are allowed, strings must be equal 
if (maxDistance == 0)
return left.equals(right) ? 1f : 0f;

// init matrix d
for (int i = 0; i<=n; i++) {
  p[i] = i;
}

// start computing edit distance
for (int j = 1; j<=m; j++) { // iterates through target
  int bestPossibleEditDistance = m;
  final char t_j = right.charAt(j-1); // jth character of t
  d[0] = j;


//---> here is the optimisation candiates

  //only iterate through a maxDistance corridor
  final int startAt  = Math.max(1, j - maxDistance );
  final int finishAt = Math.min(n, maxDistance - 1 + j);
  
  for (int i=startAt; i<=finishAt; ++i) { // iterates through
text
//
// minimum of cell to the left+1, to the top+1, diagonally
left and up +(0|1)
final char t_i = left.charAt(i-1);  
if (t_j != t_i) {
  d[i] = Math.min(Math.min(d[i-1], p[i]),  p[i-1]) + 1;
} else {
d[i] = Math.min(Math.min(d[i-1], p[i]) + 1,  p[i-1]);
}
bestPossibleEditDistance =
Math.min(bestPossibleEditDistance, d[i]);

  }

  //After calculating row i, the best possible edit distance
  //can be found by found by finding the smallest value in a
given column.
  //If the bestPossibleEditDistance is greater than the max
distance, abort.

  if (j > maxDistance && bestPossibleEditDistance > maxDistance)
{  //equal is okay, but not greater
//the closest the target can be to the text is just too far
away.
//this target is leaving the party early

RE: Improving String Distance calculation performance

2010-12-27 Thread Uwe Schindler
Sven:
Please start a new thread when asking a new question. From Hossman's apache
page:

When starting a new discussion on a mailing list, please do not reply to an
existing message, instead start a fresh email.  Even if you change the
subject line of your email, other mail headers still track which thread you
replied to and your question is "hidden" in that thread and gets less
attention.   It makes following discussions in the mailing list archives
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Biedermann,S.,Fa. Post Direkt [mailto:s.biederm...@postdirekt.de]
> Sent: Monday, December 27, 2010 12:23 PM
> To: dev@lucene.apache.org
> Subject: Improving String Distance calculation performance
> 
> Hi,
> 
> this is my first post to this mailing list, so I first want to say hello
to all of you!
> 
>   You are doing a great job
> 
> In org.apache.lucene.search.FuzzyTermEnum I found an optimised
> implementation of the Levenstein-Algorithms which makes use of the fact
> that the algorithm can be aborted if a given minimum similarity cannot be
> reached anymore. I isolated that algorithm into a subclass of
> org.apache.lucene.spell.StringDistance, since we usually can make use of
this
> optimisation.
> 
> With our current miminum similarity setting of 0.75 this algorithm needs
> against our test data only about 22% of run time compared to the original
> algorithm from the spell package.
> 
> With a further optimisation candidate (see below) the runtime can be
further
> reduced by another third to only 14% of original Levenstein.
> 
> So, my first question is: is it worth adding a further method to the
> StringDistance-Interface:
> 
>   float getDistance(String left, String right, float
> minimumSimilarity)
> 
> so that applications can make use of possible optimisations
(StringDistance-
> Implementations without optimisations would just skip the minimSimilarity
> parameter)?
> 
> 
> The idea of the optimsation candidate is about calculating only those
fields in
> the "virtual" matrix that are near its diagonal.
> It is only a candidants since we have not prooven it to work. But with all
our
> test data (0.5 billion comparisons) there is no difference to the original
> algorithm.
> 
> Do you have any counter examples?
> Since this is my first post, is this the right mailing list?
> 
> Best Regards,
> 
> Sven
> 
> 
> 
> 
> 
> Here is the code taken from FuzzyTermEnum with some modfications  (p and
> d are initialised somewhere else):
> 
> public float getDistance(final String left, final String right, float
> minimumSimilarity) {
> final int m = right.length();
> final int n = left.length();
> final int maxLength = Math.max(m, n);
> if (n == 0)  {
>   //we don't have anything to compare.  That means if we just add
>   //the letters for m we get the new word
> return (m == 0) ? 1f : 0f;
> }
> if (m == 0) {
>   return 0f;
> }
> 
> // be patient with rounding errors (1.001f instead of 1f).
> final int maxDistance = (int) ((1.001f-minimumSimilarity) *
> maxLength);
> 
> if (maxDistance < Math.abs(m-n)) {
>   //just adding the characters of m to n or vice-versa results in
>   //too many edits
>   //for example "pre" length is 3 and "prefixes" length is 8.
> We can see that
>   //given this optimal circumstance, the edit distance cannot be
less than
> 5.
>   //which is 8-3 or more precisely Math.abs(3-8).
>   //if our maximum edit distance is 4, then we can discard this
word
>   //without looking at it.
>   return 0.0f;
> }
> 
> // if no edits are allowed, strings must be equal
> if (maxDistance == 0)
> return left.equals(right) ? 1f : 0f;
> 
> // init matrix d
> for (int i = 0; i<=n; i++) {
>   p[i] = i;
> }
> 
> // start computing edit distance
> for (int j = 1; j<=m; j++) { // iterates through target
>   int bestPossibleEditDistance = m;
>   final char t_j = right.charAt(j-1); // jth character of t
>   d[0] = j;
> 
> 
> //---> here is the optimisation candiates
> 
>   //only iterate through a maxDistance corridor
>   final int startAt  = Math.max(1, j - maxDistance );
>   final int finishAt = Math.min(n, maxDistance -

PS: Improving String Distance calculation performance

2010-12-27 Thread Biedermann,S.,Fa. Post Direkt
Ups... I forgot to say, that the candiate only works if left.length() <= 
right.length() !

-Ursprüngliche Nachricht-
Von: Biedermann,S.,Fa. Post Direkt 
Gesendet: Montag, 27. Dezember 2010 12:23
An: 'dev@lucene.apache.org'
Betreff: Improving String Distance calculation performance

Hi,

this is my first post to this mailing list, so I first want to say hello to all 
of you!

You are doing a great job 

In org.apache.lucene.search.FuzzyTermEnum I found an optimised implementation 
of the Levenstein-Algorithms which makes use of the fact that the algorithm can 
be aborted if a given minimum similarity cannot be reached anymore. I isolated 
that algorithm into a subclass of org.apache.lucene.spell.StringDistance, since 
we usually can make use of this optimisation.

With our current miminum similarity setting of 0.75 this algorithm needs 
against our test data only about 22% of run time compared to the original 
algorithm from the spell package.

With a further optimisation candidate (see below) the runtime can be further 
reduced by another third to only 14% of original Levenstein.

So, my first question is: is it worth adding a further method to the 
StringDistance-Interface:

float getDistance(String left, String right, float minimumSimilarity)

so that applications can make use of possible optimisations 
(StringDistance-Implementations without optimisations would just skip the 
minimSimilarity parameter)?


The idea of the optimsation candidate is about calculating only those fields in 
the "virtual" matrix that are near its diagonal.
It is only a candidants since we have not prooven it to work. But with all our 
test data (0.5 billion comparisons) there is no difference to the original 
algorithm.

Do you have any counter examples?
Since this is my first post, is this the right mailing list?

Best Regards,

Sven





Here is the code taken from FuzzyTermEnum with some modfications  (p and d are 
initialised somewhere else):

public float getDistance(final String left, final String right, float 
minimumSimilarity) {
final int m = right.length();
final int n = left.length();
final int maxLength = Math.max(m, n);
if (n == 0)  {
  //we don't have anything to compare.  That means if we just add
  //the letters for m we get the new word
return (m == 0) ? 1f : 0f;
}
if (m == 0) {
  return 0f;
}

// be patient with rounding errors (1.001f instead of 1f).
final int maxDistance = (int) ((1.001f-minimumSimilarity) * 
maxLength);

if (maxDistance < Math.abs(m-n)) {
  //just adding the characters of m to n or vice-versa results in
  //too many edits
  //for example "pre" length is 3 and "prefixes" length is 8.  We can 
see that
  //given this optimal circumstance, the edit distance cannot be less 
than 5.
  //which is 8-3 or more precisely Math.abs(3-8).
  //if our maximum edit distance is 4, then we can discard this word
  //without looking at it.
  return 0.0f;
}

// if no edits are allowed, strings must be equal 
if (maxDistance == 0)
return left.equals(right) ? 1f : 0f;

// init matrix d
for (int i = 0; i<=n; i++) {
  p[i] = i;
}

// start computing edit distance
for (int j = 1; j<=m; j++) { // iterates through target
  int bestPossibleEditDistance = m;
  final char t_j = right.charAt(j-1); // jth character of t
  d[0] = j;


//---> here is the optimisation candiates

  //only iterate through a maxDistance corridor
  final int startAt  = Math.max(1, j - maxDistance );
  final int finishAt = Math.min(n, maxDistance - 1 + j);
  
  for (int i=startAt; i<=finishAt; ++i) { // iterates through text
//
// minimum of cell to the left+1, to the top+1, diagonally left and 
up +(0|1)
final char t_i = left.charAt(i-1);  
if (t_j != t_i) {
  d[i] = Math.min(Math.min(d[i-1], p[i]),  p[i-1]) + 1;
} else {
d[i] = Math.min(Math.min(d[i-1], p[i]) + 1,  p[i-1]);
}
bestPossibleEditDistance = Math.min(bestPossibleEditDistance, d[i]);

  }

  //After calculating row i, the best possible edit distance
  //can be found by found by finding the smallest value in a given 
column.
  //If the bestPossibleEditDistance is greater than the max distance, 
abort.

  if (j > maxDistance && bestPossibleEditDistance > maxDistance) {  
//equal is okay, but not greater
//the closest the target can be to the text is just too far away.
//this target is leaving the party early.
return 0.0f;
  }

  // 

Improving String Distance calculation performance

2010-12-27 Thread Biedermann,S.,Fa. Post Direkt
Hi,

this is my first post to this mailing list, so I first want to say hello
to all of you!

You are doing a great job 

In org.apache.lucene.search.FuzzyTermEnum I found an optimised
implementation of the Levenstein-Algorithms which makes use of the fact
that the algorithm can be aborted if a given minimum similarity cannot
be reached anymore. I isolated that algorithm into a subclass of
org.apache.lucene.spell.StringDistance, since we usually can make use of
this optimisation.

With our current miminum similarity setting of 0.75 this algorithm needs
against our test data only about 22% of run time compared to the
original algorithm from the spell package.

With a further optimisation candidate (see below) the runtime can be
further reduced by another third to only 14% of original Levenstein.

So, my first question is: is it worth adding a further method to the
StringDistance-Interface:

float getDistance(String left, String right, float
minimumSimilarity)

so that applications can make use of possible optimisations
(StringDistance-Implementations without optimisations would just skip
the minimSimilarity parameter)?


The idea of the optimsation candidate is about calculating only those
fields in the "virtual" matrix that are near its diagonal.
It is only a candidants since we have not prooven it to work. But with
all our test data (0.5 billion comparisons) there is no difference to
the original algorithm.

Do you have any counter examples?
Since this is my first post, is this the right mailing list?

Best Regards,

Sven





Here is the code taken from FuzzyTermEnum with some modfications  (p and
d are initialised somewhere else):

public float getDistance(final String left, final String right,
float minimumSimilarity) {
final int m = right.length();
final int n = left.length();
final int maxLength = Math.max(m, n);
if (n == 0)  {
  //we don't have anything to compare.  That means if we just
add
  //the letters for m we get the new word
return (m == 0) ? 1f : 0f;
}
if (m == 0) {
  return 0f;
}

// be patient with rounding errors (1.001f instead of 1f).
final int maxDistance = (int) ((1.001f-minimumSimilarity) *
maxLength);

if (maxDistance < Math.abs(m-n)) {
  //just adding the characters of m to n or vice-versa results
in
  //too many edits
  //for example "pre" length is 3 and "prefixes" length is 8.
We can see that
  //given this optimal circumstance, the edit distance cannot be
less than 5.
  //which is 8-3 or more precisely Math.abs(3-8).
  //if our maximum edit distance is 4, then we can discard this
word
  //without looking at it.
  return 0.0f;
}

// if no edits are allowed, strings must be equal 
if (maxDistance == 0)
return left.equals(right) ? 1f : 0f;

// init matrix d
for (int i = 0; i<=n; i++) {
  p[i] = i;
}

// start computing edit distance
for (int j = 1; j<=m; j++) { // iterates through target
  int bestPossibleEditDistance = m;
  final char t_j = right.charAt(j-1); // jth character of t
  d[0] = j;


//---> here is the optimisation candiates

  //only iterate through a maxDistance corridor
  final int startAt  = Math.max(1, j - maxDistance );
  final int finishAt = Math.min(n, maxDistance - 1 + j);
  
  for (int i=startAt; i<=finishAt; ++i) { // iterates through
text
//
// minimum of cell to the left+1, to the top+1, diagonally
left and up +(0|1)
final char t_i = left.charAt(i-1);  
if (t_j != t_i) {
  d[i] = Math.min(Math.min(d[i-1], p[i]),  p[i-1]) + 1;
} else {
d[i] = Math.min(Math.min(d[i-1], p[i]) + 1,  p[i-1]);
}
bestPossibleEditDistance =
Math.min(bestPossibleEditDistance, d[i]);

  }

  //After calculating row i, the best possible edit distance
  //can be found by found by finding the smallest value in a
given column.
  //If the bestPossibleEditDistance is greater than the max
distance, abort.

  if (j > maxDistance && bestPossibleEditDistance > maxDistance)
{  //equal is okay, but not greater
//the closest the target can be to the text is just too far
away.
//this target is leaving the party early.
return 0.0f;
  }

  // copy current distance counts to 'previous row' distance
counts: swap p and d
  int _d[] = p;
  p = d;
  d = _d;
}

// our last action in the above loop was to switch d and p, so p
now
// actually has the most recent cost counts

// this will return less than 0.0 when the edit distance is
// greater than the number of characters in the shor