Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any tests with Java 13 and the new ZGC?

Vassil Velichkov (DGM) Mon, 14 Oct 2019 04:01:23 -0700

Hi Everyone,

Since we’ve upgraded our cluster (legacy sharding) from Solr 6.x to Solr 7.6 we 
have frequent OOM crashes on specific nodes.


All investigations (detailed below) lead to a hard-coded limitation in the G1 
garbage collector and the Java Heap is exhausted due to too many filterCache 
allocations that are never discarded by G1.

Our hope is to use Java 13 with the new ZGC, which is specifically designed for 
large heap-sizes, and supposedly would handle and dispose larger allocations. 
The Solr release notes claim that Solr 7.6 builds are tested with Java 11 / 12 
/ 13 (pre-release), but does anyone uses Java 13 in production and has 
experience with the new ZGC and large heap sizes / large document sets of more 
than 150M documents per shard?

>>>>>>>>> Some background information and reference to the possible root-cause, 
>>>>>>>>> described by Shawn Heisey in Solr 1.4 documentation.

Our current setup is as follows:

1.       All nodes are running on VMware 6.5 VMs with Debian 9u5 / Java 9 / 
Solr 7.6

2.       Each VM has 6 or 8 x vCPUs, 128GB or 192GB RAM (50% for Java Heap / 
50% for OS) and 1 x Solr Core with 80M to 160M documents, NO stored fields, 
DocValues ON

3.       The only “hot” and frequently used cache is filterCache, configured 
with the default value of 256 entries. If we increase the setting to 512 or 
1024 entries, we are getting 4-5 times better hit-ratio, but the OOM crashes 
become too frequent.

4.       Regardless of the Java Heap size (we’ve tested with even larger heaps 
and VM sizing up to 384GB), all nodes that have approx. more than 120-130M 
documents crash with OOM under heavy load (hundreds of simultaneous searches 
with a variety of Filter Queries).

FilterCache is really frequently used and some of the BitSets are spanning 
across 80-90% of the Docset of each shard, so in many cases the FC entries 
become larger than 16MB. We believe we’ve pinpointed the problem to the G1 
Garbage Collector and the hard-coded limit for "-XX:G1HeapRegionSize", which 
allows setting a maximum of 32MB, regardless if it is auto-calculated or set 
manually in the JVM startup options. The JVM memory allocation algorithm tracks 
every memory allocation request and if the request exceeds 50% of 
G1HeapRegionSize, it is considered humongous allocation (he-he, extremely large 
allocation in 2019?!?), so it is not scanned and evaluated during standard 
garbage collection cycles. Unused humongous allocations are basically freed 
only during Full Garbage Collection cycles, which are never really invoked by 
the G1 garbage collector, before it is too late and the JVM crashes with OOM.

Now we are rebalancing the cluster to have up to 100-120M  documents per shard, 
following and ancient, but probably still valid limitation suggested in Solr 
1.4 documentation by Shawn 
Heisey<https://cwiki.apache.org/confluence/display/solr/ShawnHeisey>: “If you 
have an index with about 100 million documents in it, you'll want to use a 
region size of 32MB, which is the maximum possible size. Because of this 
limitation of the G1 collector, we recommend always keeping a Solr index below 
a maxDoc value of around 100 to 120 million.”

Cheers,
Vassil Velichkov

Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any tests with Java 13 and the new ZGC?

Reply via email to