filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in cache). Notice the /8. This reflects the fact that the filters are represented by a bitset on the _internal_ Lucene ID. UniqueId has no bearing here whatsoever. This is, in a nutshell, why warming is required, the internal Lucene IDs may change. Note also that it's maxDoc, the internal arrays have "holes" for deleted documents.
Note this is an _upper_ bound, if there are only a few docs that match, the size will be (num of matching docs) * sizeof(int)). fieldValueCache. I don't think so, although I'm a bit fuzzy on this. It depends on whether these are "per-segment" caches or not. Any "per segment" cache is still valid. Think of documentCache as intended to hold the stored fields while various components operate on it, thus avoiding repeatedly fetching the data from disk. It's _usually_ not too big a worry. About hard-commits once a day. That's _extremely_ long. Think instead of committing more frequently with openSearcher=false. If nothing else, you transaction log will grow lots and lots and lots. I'm thinking on the order of 15 minutes, or possibly even much less. With softCommits happening more often, maybe every 15 seconds. In fact, I'd start out with soft commits every 15 seconds and hard commits (openSearcher=false) every 5 minutes. The problem with hard commits being once a day is that, if for any reason the server is interrupted, on startup Solr will try to replay the entire transaction log to assure index integrity. Not to mention that your tlog will be huge. Not to mention that there is some memory usage for each document in the tlog. Hard commits roll over the tlog, flush the in-memory tlog pointers, close index segments, etc. Best Erick On Thu, Jan 17, 2013 at 1:29 PM, Isaac Hebsh <isaac.he...@gmail.com> wrote: > Hi, > > I am going to build a big Solr (4.0?) index, which holds some dozens of > millions of documents. Each document has some dozens of fields, and one big > textual field. > The queries on the index are non-trivial, and a little-bit long (might be > hundreds of terms). No query is identical to another. > > Now, I want to analyze the cache performance (before setting up the whole > environment), in order to estimate how much RAM will I need. > > filterCache: > In my scenariom, every query has some filters. let's say that each filter > matches 1M documents, out of 10M. Does the estimated memory usage should be > 1M * sizeof(uniqueId) * num-of-filters-in-cache? > > fieldValueCache: > Due to the difference between queries, I guess that fieldValueCache is the > most important factor on query performance. Here comes a generic question: > I'm indexing new documents to the index constantly. Soft commits will be > performed every 10 mins. Does it say that the cache is meaningless, after > every 10 minutes? > > documentCache: > enableLazyFieldLoading will be enabled, and "fl" contains a very small set > of fields. BUT, I need to return highlighting on about (possibly) 20 > fields. Does the highlighting component use the documentCache? I guess that > highlighting requires the whole field to be loaded into the documentCache. > Will it happen only for fields that matched a term from the query? > > And one more question: I'm planning to hard-commit once a day. Should I > prepare to a significant RAM usage growth between hard-commits? (consider a > lot of new documents in this period...) > Does this RAM comes from the same pool as the caches? An OutOfMemory > exception can happen is this scenario? > > Thanks a lot.