Hi Shawn,

My answers are in-line below...

Cheers,
Vassil

-----Original Message-----
From: Shawn Heisey <apa...@elyograg.org> 
Sent: Monday, October 14, 2019 3:56 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any 
tests with Java 13 and the new ZGC?

On 10/14/2019 6:18 AM, Vassil Velichkov (Sensika) wrote:
> We have 1 x Replica with 1 x Solr Core per JVM and each JVM runs in a 
> separate VMware VM.
> We have 32 x JVMs/VMs in total, containing between 50M to 180M documents per 
> replica/core/JVM.

With 180 million documents, each filterCache entry will be 22.5 megabytes in 
size.  They will ALL be this size.

>>>>> Ops, I didn't know that, but this makes the things even worse. By looking 
>>>>> at the GC log, it seems evicted entries are never discarded.

> In our case most filterCache entities (maxDoc/8 + overhead) are typically 
> more than 16MB, which is more than 50% of the max setting for 
> "XX:G1HeapRegionSize" (which is 32MB). That's why I am so interested in Java 
> 13 and ZGC, because ZGC does not have this weird limitation and collects even 
> _large_ garbage pieces :-). We have almost no documentCache or queryCache 
> entities.

I am not aware of any Solr testing with the new garbage collector.  I'm 
interested in knowing whether it does a better job than CMS and G1, but do not 
have any opportunities to try it.

>>>>> Currently we have some 2TB free RAM on the cluster, so I guess we could 
>>>>> test it in the next coming days. The plan is to re-index at least 2B 
>>>>> documents in a separate cluster and stress-test the new cluster with real 
>>>>> production data and real production code with Java 13 and ZGC.

Have you tried letting Solr use its default garbage collection settings instead 
of G1?  Have you tried Java 11?  Java 9 is one of the releases without long 
term support, so as Erick says, it is not recommended.

>>>>> After the migration from 6.x to 7.6 we kept the default GC for a couple 
>>>>> of weeks, than we've started experimenting with G1 and we've managed to 
>>>>> achieve less frequent OOM crashes, but not by much.

> By some time tonight all shards will be rebalanced (we've added 6 more) and 
> will contain up to 100-120M documents (14.31MB + overhead should be < 16MB), 
> so hopefully this will help us to alleviate the OOM crashes.

It doesn't sound to me like your filterCache can cause OOM.  The total size of 
256 filterCache entries that are each 22.5 megabytes should be less than 6GB, 
and I would expect the other Solr caches to be smaller.

>>>>> As I explained in my previous e-mail, the unused filterCache entries are 
>>>>> not discarded, even after a new SolrSearcher is started. The Replicas are 
>>>>> synced with the Masters every 5 minutes, the filterCache is auto-warmed 
>>>>> and the JVM heap utilization keeps going up. Within 1 to 2 hours a 64GB 
>>>>> heap is being exhausted. The GC log entries clearly show that there are 
>>>>> more and more humongous allocations piling up. 
 
If you are hitting OOMs, then some other aspect of your setup is the reason 
that's happening.  I would not normally expect a single core with
180 million documents to need more than about 16GB of heap, and 31GB should 
definitely be enough.  Hitting OOM with the heap sizes you have described is 
very strange.

>>>>>> We have a really stressful use-case: a single user opens a live-report 
>>>>>> with 20-30 widgets, each widget performs a Solr Search or facet 
>>>>>> aggregations, sometimes with 5-15 complex filter queries attached to the 
>>>>>> main query, so the end results are visualized as pivot charts. So, one 
>>>>>> user could trigger hundreds of queries in a very short period of time 
>>>>>> and when we have several analysts working on the same time-period, we 
>>>>>> usually end-up with OOM. This logic used to work quite well on Solr 6.x. 
>>>>>> The only other difference that comes to my mind is that with Solr 7.6 
>>>>>> we've started using DocValues. I could not find documentation about 
>>>>>> DocValues memory consumption, so it might be related.

Perhaps the root cause of your OOMs is not heap memory, but some other system 
resource.  Do you have log entries showing the stacktrace on the OOM?

>>>>>> Yep, but I plan to generate some detailed JVM trace-dumps, so we could 
>>>>>> analyze which class / data structure causes the OOM. Any recommendations 
>>>>>> about what tool to use for a detailed JVM dump? 
Also, not sure if I could send attachments to the mailing list, but there must 
be a way to share logs...?

Thanks,
Shawn

Reply via email to