Why exactly are fuzzy queries so slow?

2007-11-24 Thread Timo Nentwig
Hi! I search an 1.5 gig index and fuzzy queries are really slow; something like avg. ~500ms (IndexSearcher.search(Query, HitCollector)). When performing exact queries I archieve response times <25ms. What is it that makes fuzzy queries so slow? Increased index access due to more terms, i.e. d

Re: Why exactly are fuzzy queries so slow?

2007-11-24 Thread Mathieu Lecarme
fuzzy are simply not indexed. If you wont to search quickly with fuzzy search, you should index word and their ngrams, it's the "do you mean" pattern. you first select used word wich share ngram with the query word, the distance is computed with levenstein, and you use this word as a synon

Re: Why exactly are fuzzy queries so slow?

2007-11-24 Thread markharw00d
The added IO is one factor. Another is the CPU load from doing many edit-distance comparisons between index terms and the provided search term. You can limit the number of edit distance comparisons conducted by setting the minimum prefix length. This is a property of the QueryParser if parsing

Re: Why exactly are fuzzy queries so slow?

2007-11-25 Thread Timo Nentwig
On Saturday 24 November 2007 18:28:48 markharw00d wrote: > The added IO is one factor. Another is the CPU load from doing many > edit-distance comparisons between index terms and the provided search You mean FuzzyQuery.rewrite(). Are you sure this is a CPU and not an IO issue (reading the terms f

Re: Why exactly are fuzzy queries so slow?

2007-11-25 Thread Timo Nentwig
On Saturday 24 November 2007 18:48:18 Mathieu Lecarme wrote: > fuzzy are simply not indexed. > If you wont to search quickly with fuzzy search, you should index word > and their ngrams, it's the "do you mean" pattern. replacing fuzzy with "did you mean" is indeed my favourite option however so fa

Re: Why exactly are fuzzy queries so slow?

2007-11-25 Thread Timo Nentwig
On Saturday 24 November 2007 18:28:48 markharw00d wrote: > term. You can limit the number of edit distance comparisons conducted by > setting the minimum prefix length. This is a property of the QueryParser Well, javadoc: "prefixLength - length of common (non-fuzzy) prefix". So, this is some kind

Re: Why exactly are fuzzy queries so slow?

2007-11-25 Thread markharw00d
For "fuzzy" you're going to pay one way or another. You can use ngram analyzers on indexed content and queries which will add IO costs ("files" becomes "fi","fil", "file","il","ile","iles" in both your query and index) or you can use some form of query-time edit distance comparison on "files" a

Re: Why exactly are fuzzy queries so slow?

2007-11-25 Thread Mathieu Lecarme
Well, javadoc: "prefixLength - length of common (non-fuzzy) prefix". So, this is some kind of "wildcard fuzzy" but not real fuzzy anymore. I understand the optimitation but right now I hardly can image a reasonable use-case. Who care whether the levenstein distance is a the beginnen, middle

Re: Why exactly are fuzzy queries so slow?

2007-11-26 Thread Timo Nentwig
On Sunday 25 November 2007 11:54:15 markharw00d wrote: > For "fuzzy" you're going to pay one way or another. But which one is the cheapest? :) > You can use ngram analyzers on indexed content and queries which will > add IO costs ("files" becomes "fi","fil", "file","il","ile","iles" in > both you