filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in
cache). Notice the /8. This reflects the fact that the filters are
represented by a bitset on the _internal_ Lucene ID. UniqueId has no
bearing here whatsoever. This is, in a nutshell, why warming is
required, the internal Lucene IDs may change. Note also that it's
maxDoc, the internal arrays have "holes" for deleted documents.

Note this is an _upper_ bound, if there are only a few docs that
match, the size will be (num of matching docs) * sizeof(int)).

fieldValueCache. I don't think so, although I'm a bit fuzzy on this.
It depends on whether these are "per-segment" caches or not. Any "per
segment" cache is still valid.

Think of documentCache as intended to hold the stored fields while
various components operate on it, thus avoiding repeatedly fetching
the data from disk. It's _usually_ not too big a worry.

About hard-commits once a day. That's _extremely_ long. Think instead
of committing more frequently with openSearcher=false. If nothing
else, you transaction log will grow lots and lots and lots. I'm
thinking on the order of 15 minutes, or possibly even much less. With
softCommits happening more often, maybe every 15 seconds. In fact, I'd
start out with soft commits every 15 seconds and hard commits
(openSearcher=false) every 5 minutes. The problem with hard commits
being once a day is that, if for any reason the server is interrupted,
on startup Solr will try to replay the entire transaction log to
assure index integrity. Not to mention that your tlog will be huge.
Not to mention that there is some memory usage for each document in
the tlog. Hard commits roll over the tlog, flush the in-memory tlog
pointers, close index segments, etc.

Best
Erick

On Thu, Jan 17, 2013 at 1:29 PM, Isaac Hebsh <isaac.he...@gmail.com> wrote:
> Hi,
>
> I am going to build a big Solr (4.0?) index, which holds some dozens of
> millions of documents. Each document has some dozens of fields, and one big
> textual field.
> The queries on the index are non-trivial, and a little-bit long (might be
> hundreds of terms). No query is identical to another.
>
> Now, I want to analyze the cache performance (before setting up the whole
> environment), in order to estimate how much RAM will I need.
>
> filterCache:
> In my scenariom, every query has some filters. let's say that each filter
> matches 1M documents, out of 10M. Does the estimated memory usage should be
> 1M * sizeof(uniqueId) * num-of-filters-in-cache?
>
> fieldValueCache:
> Due to the difference between queries, I guess that fieldValueCache is the
> most important factor on query performance. Here comes a generic question:
> I'm indexing new documents to the index constantly. Soft commits will be
> performed every 10 mins. Does it say that the cache is meaningless, after
> every 10 minutes?
>
> documentCache:
> enableLazyFieldLoading will be enabled, and "fl" contains a very small set
> of fields. BUT, I need to return highlighting on about (possibly) 20
> fields. Does the highlighting component use the documentCache? I guess that
> highlighting requires the whole field to be loaded into the documentCache.
> Will it happen only for fields that matched a term from the query?
>
> And one more question: I'm planning to hard-commit once a day. Should I
> prepare to a significant RAM usage growth between hard-commits? (consider a
> lot of new documents in this period...)
> Does this RAM comes from the same pool as the caches? An OutOfMemory
> exception can happen is this scenario?
>
> Thanks a lot.

Reply via email to