Re: Solr caching clarifications

Erick Erickson Mon, 15 Jul 2013 04:48:32 -0700

Manuel:

First off, anything that Mike McCandless says about low-level
details should override anything I say. The memory savings
he's talking about there are actually something he tutored me
in once on a chat.


The savings there, as I understand it, aren't huge. For large
sets I think it's a 25% savings (if I calculated right). But consider
that even without those savings, 8 filter cache entries will be
more than the entire structure that JIRA talks about....

As to your fq question, absolutely! Any yes/no clause that,
as you say, contribute to the score is a candidate to be
moved to a fq clause. There are a couple of things to
be aware of though.
1> be a little careful of using NOW. If you don't use it correctly,
     fq clauses will not be re-used. See:
     http://searchhub.org/2012/02/23/date-math-now-and-filter-queries/
2> How you usually do this is through the UI, not the users entering
     a query. For instance if you have a date-range picker your a;;
     constructs the fq clause from that. Or you append fq clauses to the
     links you create when you display facets or....

No, there's no automatic tool for this. There's not likely to be one
since there's no way to infer the intent. Say you put in a clause like
q=a AND b.
That scores things. It would give the same result set as
q=*:*&fq=1&fq=b
which would compute no scores. How could a tool infer when this
was or wasn't OK?

Best
Erick

On Sun, Jul 14, 2013 at 6:10 PM, Manuel Le Normand
<manuel.lenorm...@gmail.com> wrote:
> Alright, thanks Erick. For the question about memory usage of merges, taken
> from  Mike McCandless Blog
>
> The big thing that stays in RAM is a logical int[] mapping old docIDs to
> new docIDs, but in more recent versions of Lucene (4.x) we use a much more
> efficient structure than a simple int[] ... see
> https://issues.apache.org/jira/browse/LUCENE-2357
>
> How much RAM is required is mostly a function of how many documents (lots
> of tiny docs use more RAM than fewer huge docs).
>
>
> A related clarification
> As my users are not aware of the fq possibility, i was wondering how do I
> make the best out of this field cache. Would if be efficient transforming
> implicitly their query to a filter query on fields that are boolean
> searches (date range etc. that do not affect the score of a document). Is
> this a good practice? Is there any plugin for a query parser that makes it?
>
>
>
>>
>> Inline
>>
>> On Thu, Jul 11, 2013 at 8:36 AM, Manuel Le Normand
>> <manuel.lenorm...@gmail.com> wrote:
>> > Hello,
>> > As a result of frequent java OOM exceptions, I try to investigate more
> into
>> > the solr jvm memory heap usage.
>> > Please correct me if I am mistaking, this is my understanding of usages
> for
>> > the heap (per replica on a solr instance):
>> > 1. Buffers for indexing - bounded by ramBufferSize
>> > 2. Solr caches
>> > 3. Segment merge
>> > 4. Miscellaneous- buffers for Tlogs, servlet overhead etc.
>> >
>> > Particularly I'm concerned by Solr caches and segment merges.
>> > 1. How much memory consuming (bytes per doc) are FilterCaches
> (bitDocSet)
>> > and queryResultCaches (DocList)? I understand it is related to the skip
>> > spaces between doc id's that match (so it's not saved as a bitmap). But
>> > basically, is every id saved as a java int?
>>
>> Different beasts. filterCache consumes, essentially, maxDoc/8 bytes (you
>> can get the maxDoc number from your Solr admin page). Plus some overhead
>> for storing the fq text, but that's usually not much. This is for each
>> entry up to "Size".
>
>
>
>>
>> queryResultCache is usually trivial unless you've configured it
> extravagantly.
>> It's the query string length + queryResultWindowSize integers per entry
>> (queryResultWindowSize is from solrconfig.xml).
>>
>> > 2. QueryResultMaxDocsCached - (for example = 100) means that any query
>> > resulting in more than 100 docs will not be cached (at all) in the
>> > queryResultCache? Or does it have to do with the documentCache?
>> It's just a limit on the queryResultCache entry size as far as I can
>> tell. But again
>> this cache is relatively small, I'd be surprised if it used
>> significant resources.
>>
>> > 3. DocumentCache - written on the wiki it should be greater than
>> > max_results*concurrent_queries. Max result is just the num of rows
>> > displayed (rows-start) param, right? Not the queryResultWindow.
>>
>> Yes. This a cache (I think) for the _contents_ of the documents you'll
>> be returning to be manipulated by various components during the life
>> of the query.
>>
>> > 4. LazyFieldLoading=true - when quering for id's only (fl=id) will this
>> > cache be used? (on the expense of eviction of docs that were already
> loaded
>> > with stored fields)
>>
>> Not sure, but I don't think this will contribute much to memory pressure.
> This
>> is about now many fields are loaded to get a single value from a doc in
> the
>> results list, and since one is usually working with 20 or so docs this
>> is usually
>> a small amount of memory.
>>
>> > 5. How large is the heap used by mergings? Assuming we have a merge of
> 10
>> > segments of 500MB each (half inverted files - *.pos *.doc etc, half non
>> > inverted files - *.fdt, *.tvd), how much heap should be left unused for
>> > this merge?
>>
>> Again, I don't think this is much of a memory consumer, although I
>> confess I don't
>> know the internals. Merging is mostly about I/O.
>>
>> >
>> > Thanks in advance,
>> > Manu
>>
>> But take a look at the admin page, you can see how much memory various
>> caches are using by looking at the plugins/stats section.
>>
>> Best
>> Erick

Re: Solr caching clarifications

Reply via email to