On 31-Oct-07, at 11:54 PM, Haishan Chen wrote:
Date: Wed, 31 Oct 2007 17:54:53 -0700> Subject: Re: Phrase Query
Performance Question> From: [EMAIL PROTECTED]> To: solr-
[EMAIL PROTECTED]> > "hurricane katrina" is a very expensive
query against a collection> focused on Hurricane Katrina. There
will be many matches in many> documents. If you want to measure
worst-case, this is fine.> > I'd try other things, like:> > *
ninth ward> * Ray Nagin> * Audubon Park> * Canal Street> * French
Quarter> * FEMA mistakes> * storm surge> * Jackson Square> > Of
course, real query logs are the only real test.> > wunder
These terms are not frequent in my index. I believe they are going
to be fast. The thing is that I feel 2 million documents is a small
index.
100,000 or 200,000 hits is a small set and should always have sub
second query performance. Now I am only querying one field and the
response is almost one second. I feel I can't achieve sub second
performance if I add a bit more complexity to the query.
Many of the category terms in my index will appear in more than 5%
of the documents and those category terms are very popular search
terms. So the example I gave were not extreme cases for my index
I think that you are somewhat misguided about what constitutes a
small set. A query term that appears in 5-10% of the index in a
natural language corpus is _extremely_ frequent. Not quite on the
order of stopwords, but getting there. As a comparison, on an
extremely large corpus that I have handy, documents containing both
the word 'auto' and 'repair' (not necessarily adjacent) constitute
0.1% of the index. The frequency of the phrase "auto repair" is 0.025%.
@200k docs would be the response rate from an 800million-doc corpus.
What data are you indexing, what what is the intended effect of the
phrase queries you are performing? Perhaps getting at the issue from
this end would be more productive than hammering at the phrasequery
performance question.
When I start tomcat I saw this message:
The Apache Tomcat Native library which allows optimal performance
in production environments was not found on the java.library.path
Is that mean if I use Apache Tomcat Native library the query
performance will be better. Anyone has experience on that?
Unlikely, though it might help you slightly at a high query rate with
high cache hit ratios.
-Mike