The bottleneck may simply be there are a lot of docs to score since
you are using fairly common terms.
Also, what file format (compound, non-compound) are you using? Is it
optimized? Have you profiled your app for these queries? When you
say the "query is longer", define "longer"... 5 terms? 50 terms? Do
you have lots of deleted docs? Can you share your DisMax params? Are
you doing wildcard queries? Can you share the syntax of one of the
offending queries?
Since you want to keep "stopwords", you might consider a slightly
better use of them, whereby you use them in n-grams only during query
parsing.
See also https://issues.apache.org/jira/browse/LUCENE-494 for related
stuff.
-Grant
On Sep 11, 2008, at 11:24 AM, Jason Rennie wrote:
We have a 14 million document index that we only use for querying
(optimized, read-only). When we issue queries that have few,
relatively
rare words, the query returns quickly. However, when the query is
longer
and uses more common words (hitting, say, ~1 million docs), it might
take
seconds to return. I'd like to know: what's the bottleneck? It
doesn't
seem to be disk---i/o wait times on the machine are much, much lower
than on
our database servers (e.g. 3% vs. 50%). Our search server is an 8-
core
machine and we do see cpu regularly holding above 100%, so cpu seems
plausible, but would it really take that long to compute scores?
We're using DisMax. There are a number of different fields that we
search
over (5 to be exact). We also have an fq on a single-digit status
field.
Does it make sense that computation time could easily exceed a
second? If
cpu is the bottleneck, is there anything we could do to easily trim-
down
computation time (besides removing common words from the query)?
Jason
--
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/