Re: Improving String Distance calculation performance

Robert Muir Mon, 27 Dec 2010 08:11:21 -0800

On Mon, Dec 27, 2010 at 10:31 AM, Biedermann,S.,Fa. Post Direkt
<[email protected]> wrote:
>
> As for our problem: we are trying to build reference data against which 
> requests shall be matched. In this case we need quite a huge amount of string 
> distance measurements for preparing this reference.
>


If this is your problem, i wouldn't recommend using the StringDistance
directly. As i mentioned, its not designed for your use case because
the way its used by spellchecker, it only needs something like 20-50
comparisons...

If you try to use it the way you describe, it will be very slow, it
must do O(k) comparisons, where k is the number of strings, and each
comparison is O(mn), where m and n are the lengths of the input string
and string being compared, respectively.

Easier would be to index your terms and simply do FuzzyQuery (with
trunk), specifying the exact max edit distance you want. Or if you
care about getting all exact results within Levenshtein distance of
some degree N, use AutomatonQuery built from LevenshteinAutomata.

This will give you a sublinear number of comparisons, something
complicated but more like O(sqrt(k)) where k is the number of strings,
and each comparison is O(n), where n is the length of the target
string.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Improving String Distance calculation performance

Reply via email to