AW: Improving String Distance calculation performance

Biedermann,S.,Fa. Post Direkt Tue, 28 Dec 2010 02:27:17 -0800

Hi Robert,

Thanks for your hint about LevensteinAutomata. Are AutomatonQueries planned for 
an upcoming release?

At the moment, we build the reference to boost documents those at query time 
which contain fuzzily seldom used tokens within a queried region, in a manner 
of speaking a fuzzied localised idf() .The boosts are injected via payloads. 
Since levenstein must be calculated within a (fuzzied) region only, O(mn) 
applies "only" to each region. On the outside, we have O(#region).

The problem could be equivalently solved query time. But this would mean to 
count the matched documents of each fuzzy query within a more complex queries.
In Release 3.0.2. it looks quite complicated to me to incorporate a different 
scoring model that first count matches of each fuzzy sub-query and then apply 
the boosts to the matched tokens. I haven't seen a Scorer doing this so far. 
Furthermore we are sensible about query time. 

Do you have any ideas?

-----Ursprüngliche Nachricht-----
Von: Robert Muir [mailto:rcm...@gmail.com] 
Gesendet: Montag, 27. Dezember 2010 17:11
An: dev@lucene.apache.org
Betreff: Re: Improving String Distance calculation performance

On Mon, Dec 27, 2010 at 10:31 AM, Biedermann,S.,Fa. Post Direkt 
<s.biederm...@postdirekt.de> wrote:
>
> As for our problem: we are trying to build reference data against which 
> requests shall be matched. In this case we need quite a huge amount of string 
> distance measurements for preparing this reference.
>

If this is your problem, i wouldn't recommend using the StringDistance 
directly. As i mentioned, its not designed for your use case because the way 
its used by spellchecker, it only needs something like 20-50 comparisons...

If you try to use it the way you describe, it will be very slow, it must do 
O(k) comparisons, where k is the number of strings, and each comparison is 
O(mn), where m and n are the lengths of the input string and string being 
compared, respectively.

Easier would be to index your terms and simply do FuzzyQuery (with trunk), 
specifying the exact max edit distance you want. Or if you care about getting 
all exact results within Levenshtein distance of some degree N, use 
AutomatonQuery built from LevenshteinAutomata.

This will give you a sublinear number of comparisons, something complicated but 
more like O(sqrt(k)) where k is the number of strings, and each comparison is 
O(n), where n is the length of the target string.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional 
commands, e-mail: dev-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

AW: Improving String Distance calculation performance

Reply via email to