Brian: 

Many thanks for letting us know what you found. I’ll attach this to SOLR-13003 
which is about this exact issue but doesn’t contain this information. This is a 
great help.

> On May 2, 2019, at 6:15 AM, Brian Ecker <briancec...@gmail.com> wrote:
> 
> Just to update here in order to help others that might run into similar
> issues in the future, the problem is resolved. The issue was caused by the
> queryResultCache. This was very easy to determine by analyzing a heap dump.
> In our setup we had the following config:
> 
> <queryResultCache class="solr.FastLRUCache" maxRamMB="3072"
> autowarmCount="0"/>
> 
> In reality this maxRamMB="3072" was not as expected, and this cache was
> using *way* more memory (about 6-8 times the amount). See the following
> screenshot from Eclipse MAT (http://oi63.tinypic.com/epn341.jpg). Notice in
> the left window that ramBytes, the internal calculation of how much memory
> Solr currently thinks this cache is using, is 1894333464B (1894MB). Now
> notice that the highlighted line, the ConcurrentLRUCache used internally by
> the FastLRUCache representing the queryResultCache, is actually using
> 12212779160B (12212MB). On further investigation, I realized that this
> cache is a map from a query with all its associated objects as the key, to
> a very simple object containing an array of document (integer) ids as the
> value.
> 
> Looking into the lucene-solr source, I found the following line for the
> calculation of ramBytesUsed
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/util/ConcurrentLRUCache.java#L605.
> Surprisingly, the query objects used as keys in the queryResultCache do not
> implement Accountable as far as I can tell, and this lines up very well
> with our observation of memory usage because in the heap dump we can also
> see that the keys in the cache are using substantially more memory than the
> values and completely account for the additional memory usage. It was quite
> surprising to me that the keys were given a default value of 192B as
> specified in LRUCache.DEFAULT_RAM_BYTES_USED because I can't actually
> imagine a case where the keys in the queryResultCache would be so small. I
> imagine that in almost all cases the keys would actually be larger than the
> values for the queryResultCache, but that's probably not true for all
> usages of a FastLRUCache.
> 
> We solved our memory usage issue by drastically reducing the maxRamMB value
> and calculating the actual max usage as maxRamMB * 8. It would be quite
> useful to have this detail at least documented somewhere.
> 
> -Brian
> 
> On Tue, Apr 23, 2019 at 9:49 PM Shawn Heisey <apa...@elyograg.org> wrote:
> 
>> On 4/23/2019 11:48 AM, Brian Ecker wrote:
>>> I see. The other files I meant to attach were the GC log (
>>> https://pastebin.com/raw/qeuQwsyd), the heap histogram (
>>> https://pastebin.com/raw/aapKTKTU), and the screenshot from top (
>>> http://oi64.tinypic.com/21r0bk.jpg).
>> 
>> I have no idea what to do with the histogram.  I doubt it's all that
>> useful anyway, as it wouldn't have any information about what parts of
>> the system are using the most.
>> 
>> The GC log is not complete.  It only covers 2 min 47 sec 674 ms of time.
>>  To get anything useful out of a GC log, it would probably need to
>> cover hours of runtime.
>> 
>> But if you are experiencing OutOfMemoryError, then either you have run
>> into something where a memory leak exists, or there's something about
>> your index or your queries that needs more heap than you have allocated.
>>  Memory leaks are not super common in Solr, but they have happened.
>> 
>> Tuning GC will never help OOME problems.
>> 
>> The screenshot looks like it matches the info below.
>> 
>>> I'll work on getting the heap dump, but would it also be sufficient to
>> use
>>> say a 5GB dump from when it's half full and then extrapolate to the
>>> contents of the heap when it's full? That way the dump would be a bit
>>> easier to work with.
>> 
>> That might be useful.  The only way to know for sure is to take a look
>> at it to see if the part of the code using lots of heap is detectable.
>> 
>>> There are around 2,100,000 documents.
>> <snip>
>>> The data takes around 9GB on disk.
>> 
>> Ordinarily, I would expect that level of data to not need a whole lot of
>> heap.  10GB would be more than I would think necessary, but if your
>> queries are big consumers of memory, I could be wrong.  I ran indexes
>> with 30 million documents taking up 50GB of disk space on an 8GB heap.
>> I probably could have gone lower with no problems.
>> 
>> I have absolutely no idea what kind of requirements the spellcheck
>> feature has.  I've never used that beyond a few test queries.  If the
>> query information you sent is complete, I wouldn't expect the
>> non-spellcheck parts to require a whole lot of heap.  So perhaps
>> spellcheck is the culprit here.  Somebody else will need to comment on
>> that.
>> 
>> Thanks,
>> Shawn
>> 

Reply via email to