Fergus McMenemie schrieb:
On Tue, Jun 9, 2009 at 7:25 PM, Michael Ludwig <m...@as-guides.com>
wrote:

A filter query is cached, which means that it is the more useful
the more often it is repeated. We know how often certain queries
arise, or at least have the means to collect that data - so we
know what might be candidates for filtering.
Sorry but I cant make any sense of the above. Could you have
another go at explaining it?

Filtering a given query result R on bla:eins, bla:zwei, bla:drei or
bla:vier is very common in my application. So while I could include
this criterion in my main query (q) and hope for the queryResultCache
to kick in, this would be unlikely to be efficient as my primary
query, which gave me R, likely varies a lot, resulting in a high
number of distinct queries, with relatively low probability for a
given query to occur frequently. So each of these query result sets
would enter the queryResultCache as a distinct set, hence high
contention, high eviction rate, poor cache efficiency.

Now I'm going to factor out those bla:{eins,zwei,drei,vier} filters
from my primary query (q) and put them in the filter query (fq). The
benefit is double:

(1) Solr has a dedicated cachespace for filters the usage of which I
control by my usage of the filter query (fq). I can set up things so
the usage of the primary query (q) is under the user's control while
the usage of the filter query (fq) is under my application's control.
I control this cache, I ensure its efficiency.

(2) Factoring out the filter query bla:{eins,zwei,drei,vier} from the
primary query also reduces variation in the primary query, thus making
the queryResultCache more efficient.

So instead of having, say, 10000 distinct primary queries, no usage of
the filterCache, and poor usage of the queryResultCache, I may have
only, say, 3000 distinct primary queries, four cached filters in the
filterCache (bla:{eins,zwei,drei,vier}), and a somewhat better usage
of the queryResultCache.

I wrote that we "know how often certain queries arise, or at least
have the means to collect that data", because we know the application
we're writing, so we either know the frequency of a given search
pattern based on the usage our application makes of Solr and on the
restrictions it imposes on the user by, say, using Dismax; or - if we
give the user fine-grained control over the query language - we may
somehow collect and analyze the actual queries in order to empirically
determine actual search engine usage and optimize accordingly.

The result of a filter query is cached and then used to filter a
primary query result using set intersection. If my filter query
result comprises more than 50 % of the entire document collection,
its selectivity is poor. I might need it despite this fact, but it
might also be worth while thinking about how to reframe the
requirement, allowing for more efficient filters.

So, just to be explicit, if I have a query containing:

   &fq=EventType:fair&fq=EventType:film&fq=LAT:[50 TO 60]&fq=LONG:[-1 TO 1]

The first time this is encountered it is going to cause four
queries of the entire index and cause four sets of document ID's
to be cached. Subsequent queries will reuse the various cached
entries as appropriate. Is that correct?

I do think so.

I guess in the above case where my GEO search window will keep
changing I should ideally arrange that the lat and long element is
added to the q parameter to stop my cache being cluttered.

My understanding is that what varies heavily should *not* go into the
filterCache. Your GEO search window might vary quite a bit (probably
much more than EventType), so to me it looks like a candidate for the
main query.

Also what happens when the filter is full? If there any accounting
of which cache entries are getting the most or most recent hits?

Good question!

Michael Ludwig

Reply via email to