Brian: Many thanks for letting us know what you found. I’ll attach this to SOLR-13003 which is about this exact issue but doesn’t contain this information. This is a great help.
> On May 2, 2019, at 6:15 AM, Brian Ecker <briancec...@gmail.com> wrote: > > Just to update here in order to help others that might run into similar > issues in the future, the problem is resolved. The issue was caused by the > queryResultCache. This was very easy to determine by analyzing a heap dump. > In our setup we had the following config: > > <queryResultCache class="solr.FastLRUCache" maxRamMB="3072" > autowarmCount="0"/> > > In reality this maxRamMB="3072" was not as expected, and this cache was > using *way* more memory (about 6-8 times the amount). See the following > screenshot from Eclipse MAT (http://oi63.tinypic.com/epn341.jpg). Notice in > the left window that ramBytes, the internal calculation of how much memory > Solr currently thinks this cache is using, is 1894333464B (1894MB). Now > notice that the highlighted line, the ConcurrentLRUCache used internally by > the FastLRUCache representing the queryResultCache, is actually using > 12212779160B (12212MB). On further investigation, I realized that this > cache is a map from a query with all its associated objects as the key, to > a very simple object containing an array of document (integer) ids as the > value. > > Looking into the lucene-solr source, I found the following line for the > calculation of ramBytesUsed > https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/util/ConcurrentLRUCache.java#L605. > Surprisingly, the query objects used as keys in the queryResultCache do not > implement Accountable as far as I can tell, and this lines up very well > with our observation of memory usage because in the heap dump we can also > see that the keys in the cache are using substantially more memory than the > values and completely account for the additional memory usage. It was quite > surprising to me that the keys were given a default value of 192B as > specified in LRUCache.DEFAULT_RAM_BYTES_USED because I can't actually > imagine a case where the keys in the queryResultCache would be so small. I > imagine that in almost all cases the keys would actually be larger than the > values for the queryResultCache, but that's probably not true for all > usages of a FastLRUCache. > > We solved our memory usage issue by drastically reducing the maxRamMB value > and calculating the actual max usage as maxRamMB * 8. It would be quite > useful to have this detail at least documented somewhere. > > -Brian > > On Tue, Apr 23, 2019 at 9:49 PM Shawn Heisey <apa...@elyograg.org> wrote: > >> On 4/23/2019 11:48 AM, Brian Ecker wrote: >>> I see. The other files I meant to attach were the GC log ( >>> https://pastebin.com/raw/qeuQwsyd), the heap histogram ( >>> https://pastebin.com/raw/aapKTKTU), and the screenshot from top ( >>> http://oi64.tinypic.com/21r0bk.jpg). >> >> I have no idea what to do with the histogram. I doubt it's all that >> useful anyway, as it wouldn't have any information about what parts of >> the system are using the most. >> >> The GC log is not complete. It only covers 2 min 47 sec 674 ms of time. >> To get anything useful out of a GC log, it would probably need to >> cover hours of runtime. >> >> But if you are experiencing OutOfMemoryError, then either you have run >> into something where a memory leak exists, or there's something about >> your index or your queries that needs more heap than you have allocated. >> Memory leaks are not super common in Solr, but they have happened. >> >> Tuning GC will never help OOME problems. >> >> The screenshot looks like it matches the info below. >> >>> I'll work on getting the heap dump, but would it also be sufficient to >> use >>> say a 5GB dump from when it's half full and then extrapolate to the >>> contents of the heap when it's full? That way the dump would be a bit >>> easier to work with. >> >> That might be useful. The only way to know for sure is to take a look >> at it to see if the part of the code using lots of heap is detectable. >> >>> There are around 2,100,000 documents. >> <snip> >>> The data takes around 9GB on disk. >> >> Ordinarily, I would expect that level of data to not need a whole lot of >> heap. 10GB would be more than I would think necessary, but if your >> queries are big consumers of memory, I could be wrong. I ran indexes >> with 30 million documents taking up 50GB of disk space on an 8GB heap. >> I probably could have gone lower with no problems. >> >> I have absolutely no idea what kind of requirements the spellcheck >> feature has. I've never used that beyond a few test queries. If the >> query information you sent is complete, I wouldn't expect the >> non-spellcheck parts to require a whole lot of heap. So perhaps >> spellcheck is the culprit here. Somebody else will need to comment on >> that. >> >> Thanks, >> Shawn >>