Hi Robert,
I don't use the spellchecker, but of course I want to re-use the string
distance algorithms from a well-implemented library. I find that these
algorithms are of broader scope than spellchecking only. For instance
FuzzyTermEnum could rely on them. FuzzyTermEnum could be refactored to use
other string distance measurements as well...
As for our problem: we are trying to build reference data against which
requests shall be matched. In this case we need quite a huge amount of string
distance measurements for preparing this reference.
For score scaling I took 1 - (#edits/maxTermLength) as suggested by the
original. I ran the candidate in parallel to the original LevensteinDistance
from spellchecker and found no difference so far. Of cousre, this is no proof.
Sven
-Ursprüngliche Nachricht-
Von: Robert Muir [mailto:rcm...@gmail.com]
Gesendet: Montag, 27. Dezember 2010 16:07
An: dev@lucene.apache.org
Betreff: Re: Improving String Distance calculation performance
Hi Biedermann:
you are correct in that the comparator in spellcheck could maybe use some
optimizations.
But I'm curious as to why you would be doing a lot of comparisons with the
spellchecker? Are you using this class separately for some other purpose?
The reason is that the spellchecker works like this (in two phases) to retrieve
N suggestions for a word:
* first phase is to do a n-gram query against the spellcheck index.
This is a simple BooleanQuery that returns 10 * N suggestions. For example if
you want the top 5 suggestions for a word, it will get the top 50 based on an
n-gram ranking.
* these top-N (in my example only 50) are then re-ranked according to the
spellcheck comparator, such as Levenshtein. So in this example only 50
levenshtein comparisons are done.
of course there is no reason why we shouldn't optimize the compare, if its
safe. for the particular optimization you mention I only have one
concern: that is that the optimization is correct for FuzzyTermsEnum where the
score scaling is 1 - (#edits/minTermLength). But in the spellchecker
comparator, the scores are scaled as 1 - (#edits/maxTermLength).
It might be your optimization is just fine... but I just wanted to mention this
difference.
On Mon, Dec 27, 2010 at 8:08 AM, Biedermann,S.,Fa. Post Direkt
wrote:
> Hi,
>
> this is a re-post, because the first time I re-used another thread
> (sorry for any inconvenience):
>
>
> this is my first post to this mailing list, so I first want to say
> hello to all of you!
>
> You are doing a great job
>
> In org.apache.lucene.search.FuzzyTermEnum I found an optimised
> implementation of the Levenstein-Algorithms which makes use of the
> fact that the algorithm can be aborted if a given minimum similarity
> cannot be reached anymore. I isolated that algorithm into a subclass
> of org.apache.lucene.spell.StringDistance, since we usually can make
> use of this optimisation.
>
> With our current miminum similarity setting of 0.75 this algorithm
> needs against our test data only about 22% of run time compared to the
> original algorithm from the spell package.
>
> With a further optimisation candidate (see below) the runtime can be
> further reduced by another third to only 14% of original Levenstein.
>
> So, my first question is: is it worth adding a further method to the
> StringDistance-Interface:
>
> float getDistance(String left, String right, float
> minimumSimilarity)
>
> so that applications can make use of possible optimisations
> (StringDistance-Implementations without optimisations would just skip
> the minimSimilarity parameter)?
>
>
> The idea of the optimsation candidate is about calculating only those
> fields in the "virtual" matrix that are near its diagonal.
> It is only a candidants since we have not prooven it to work. But with
> all our test data (0.5 billion comparisons) there is no difference to
> the original algorithm.
>
>
> Do you have any counter examples?
> Since this is my first post, is this the right mailing list?
>
> Best Regards,
>
> Sven
>
>
>
> Here is the code taken from FuzzyTermEnum with some modfications (p
> and d are initialised somewhere else):
>
>
> public float getDistance(final String left, final String right,
> float minimumSimilarity) {
>
> if (left.length() > right.length()) // candidate works only
> if longer string is right
> return getDistanceInner(right, left, minimumSimilarity);
> else
> return getDistanceInner(left, right, minimumSimilarity);
>
> }
>
>
> private float getDistanceInner(final String left, final String
> right, float minimumSimilarity) {
> final int m = right.length();
> final int n = left.length();
> final int maxLength = Math.max(m, n);
> if (n == 0) {
> //we don't have anything to compare. That means if we just
> add
> //the letters for m we get the new word
> return (m == 0) ? 1f : 0f;
>