When I enable faceting in SOLR for some reason our incoming user queries
start becoming cached in the filter cache, this very quickly leads the
instance to run out of memory; we could lower the size of the
filtercache, but I feel this is a band-aid around a far odder problem.
I have been investigating the heap-dumps that were created on our
instances when we ran out of memory, these dumps show (unless yourkit is
being dishonest) that the filter-cache contains
BoostedQueries(BooleanQueries(DisjunctionMaxQueries))) objects, each of
which contains terms objects that I would not expect to see in the
filterCache.
A snapshot of the object graph can be seen here.
http://gbowyer.freeshell.org/filter-cache2.html
In terms of our index, queries and setup; have a solr 3.3 setup with
sharding, we have nodes that act as aggregators with the rest acting as
slaves or shards. As per recommendations, the aggregators act as
dispatchers for searches, but do not themselves surface any index data.
Most of our search queries differ on the search terms but generally have
the following form:
path=/aggregator/
params={fl=docid,pid,score&start=0&q=dat+data+cartridge&fq=+parent_cids:438&fq=+dtype:(1+OR+2)&rows=20
path=/select
params={fl=docid,score&start=0&q=polyethylene+bench+storage&enable=true&isShard=true&wt=javabin&fq=+rev_type:[1+TO+2]&fq=+parent_cids:25000500&fq=+dtype:(1+OR+2)&fsv=true&rows=20&version=2
Breaking this down, the fqs defined are against three fields:
* parent_cids - This field contains roughly 1394 terms, there are a
few
permutations for this field, but I would expect no
more than
at most ~10000 fqs for this field
* dtype - This field has 2 terms, and we only ever query it as
shown above,
its reserved for some future work and would at most only
ever have
8 terms
* rev_type - Similer to dtype, we only have 3 terms in this field
All of our filters are not generally user accessible, and we ensure that
clients alway provide filter queries in the same order to remove the
duplication of fq's (that is, we go to some length to avoid things like
fq=+dtype(2+OR+1) appearing since we already cache fq=+dtype(1+OR+2)).
Our search handler is defined with some basic parameters as follows
---- %< ----
<requestHandler name="search" class="solr.SearchHandler" default="true">
<!-- default values for query parameters can be specified, these
will be overridden by parameters in the request
-->
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="qf">title^1.0 descr^0.5 mft^0.5 brand^0.5</str>
<str name="pf">title^3 descr^0.5</str>
<str name="boost">product(redir,bid)</str>
<str name="ps">4</str>
<str name="mm">50%</str>
<str name="defType">edismax</str>
<int name="rows">20</int>
<str name="facet">true</str>
<str name="facet.field">price_bucket</str>
<str name="facet.price_bucket.sort">count</str>
<str name="facet.price_bucket.mincount">1</str>
<str name="facet.price_bucket.limit">100</str>
<str name="facet.mincount">1</str>
</lst>
</requestHandler>
---- >% ----
price_bucket is a field that we deduce at index time, it takes a field
we store called price and creates a term that reflects a range (or
bucket) of prices that the given document falls into. I did originally
attempt to use facet counts directly but found that the instance failed
due to running out of memory; at the time it was assumed that our range
of prices and the granularity of our "buckets" were creating too many
filter queries. for reference there are 239 unique terms in the
price_bucket field.
At present our installation, indexing practices and queries are very
vanilla, we are doing nothing esoteric out of the box.
This is a fairly undesirable issue as it means that our filter-cache
rapidly fills rapidly, with cache items that are unlikely to ever be
required again.
Does anyone have any ideas on what could be causing this?
-- Greg Bowyer
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org