Re: What's the bottleneck?

Grant Ingersoll Fri, 12 Sep 2008 05:40:54 -0700

The bottleneck may simply be there are a lot of docs to score sinceyou are using fairly common terms.

Also, what file format (compound, non-compound) are you using? Is itoptimized? Have you profiled your app for these queries? When yousay the "query is longer", define "longer"... 5 terms? 50 terms? Doyou have lots of deleted docs? Can you share your DisMax params? Areyou doing wildcard queries? Can you share the syntax of one of theoffending queries?

Since you want to keep "stopwords", you might consider a slightlybetter use of them, whereby you use them in n-grams only during queryparsing.

See also https://issues.apache.org/jira/browse/LUCENE-494 for relatedstuff.


-Grant


On Sep 11, 2008, at 11:24 AM, Jason Rennie wrote:

We have a 14 million document index that we only use for querying
(optimized, read-only). When we issue queries that have few,relativelyrare words, the query returns quickly. However, when the query islongerand uses more common words (hitting, say, ~1 million docs), it mighttakeseconds to return. I'd like to know: what's the bottleneck? Itdoesn'tseem to be disk---i/o wait times on the machine are much, much lowerthan onour database servers (e.g. 3% vs. 50%). Our search server is an 8-core
machine and we do see cpu regularly holding above 100%, so cpu seems
plausible, but would it really take that long to compute scores?
We're using DisMax. There are a number of different fields that wesearchover (5 to be exact). We also have an fq on a single-digit statusfield.Does it make sense that computation time could easily exceed asecond? Ifcpu is the bottleneck, is there anything we could do to easily trim-down
computation time (besides removing common words from the query)?

Jason

--
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/

Re: What's the bottleneck?

Reply via email to