Hi Robert, Thanks for your hint about LevensteinAutomata. Are AutomatonQueries planned for an upcoming release?
At the moment, we build the reference to boost documents those at query time which contain fuzzily seldom used tokens within a queried region, in a manner of speaking a fuzzied localised idf() .The boosts are injected via payloads. Since levenstein must be calculated within a (fuzzied) region only, O(mn) applies "only" to each region. On the outside, we have O(#region). The problem could be equivalently solved query time. But this would mean to count the matched documents of each fuzzy query within a more complex queries. In Release 3.0.2. it looks quite complicated to me to incorporate a different scoring model that first count matches of each fuzzy sub-query and then apply the boosts to the matched tokens. I haven't seen a Scorer doing this so far. Furthermore we are sensible about query time. Do you have any ideas? -----Ursprüngliche Nachricht----- Von: Robert Muir [mailto:rcm...@gmail.com] Gesendet: Montag, 27. Dezember 2010 17:11 An: dev@lucene.apache.org Betreff: Re: Improving String Distance calculation performance On Mon, Dec 27, 2010 at 10:31 AM, Biedermann,S.,Fa. Post Direkt <s.biederm...@postdirekt.de> wrote: > > As for our problem: we are trying to build reference data against which > requests shall be matched. In this case we need quite a huge amount of string > distance measurements for preparing this reference. > If this is your problem, i wouldn't recommend using the StringDistance directly. As i mentioned, its not designed for your use case because the way its used by spellchecker, it only needs something like 20-50 comparisons... If you try to use it the way you describe, it will be very slow, it must do O(k) comparisons, where k is the number of strings, and each comparison is O(mn), where m and n are the lengths of the input string and string being compared, respectively. Easier would be to index your terms and simply do FuzzyQuery (with trunk), specifying the exact max edit distance you want. Or if you care about getting all exact results within Levenshtein distance of some degree N, use AutomatonQuery built from LevenshteinAutomata. This will give you a sublinear number of comparisons, something complicated but more like O(sqrt(k)) where k is the number of strings, and each comparison is O(n), where n is the length of the target string. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org