The bottleneck may simply be there are a lot of docs to score since you are using fairly common terms.

Also, what file format (compound, non-compound) are you using? Is it optimized? Have you profiled your app for these queries? When you say the "query is longer", define "longer"... 5 terms? 50 terms? Do you have lots of deleted docs? Can you share your DisMax params? Are you doing wildcard queries? Can you share the syntax of one of the offending queries?

Since you want to keep "stopwords", you might consider a slightly better use of them, whereby you use them in n-grams only during query parsing.

See also https://issues.apache.org/jira/browse/LUCENE-494 for related stuff.

-Grant


On Sep 11, 2008, at 11:24 AM, Jason Rennie wrote:

We have a 14 million document index that we only use for querying
(optimized, read-only). When we issue queries that have few, relatively rare words, the query returns quickly. However, when the query is longer and uses more common words (hitting, say, ~1 million docs), it might take seconds to return. I'd like to know: what's the bottleneck? It doesn't seem to be disk---i/o wait times on the machine are much, much lower than on our database servers (e.g. 3% vs. 50%). Our search server is an 8- core
machine and we do see cpu regularly holding above 100%, so cpu seems
plausible, but would it really take that long to compute scores?

We're using DisMax. There are a number of different fields that we search over (5 to be exact). We also have an fq on a single-digit status field. Does it make sense that computation time could easily exceed a second? If cpu is the bottleneck, is there anything we could do to easily trim- down
computation time (besides removing common words from the query)?

Jason

--
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/


Reply via email to