Thanks Jörn, Yep, we are rebalancing the cluster to keep up to ~100M documents per shard, but that's not quite optimal in our use-case.
We've tried with various ratios between JVM Heap / OS RAM (up to 128GB / 256GB) and we have the same Java Heap OOM crashes. For example, a BitSet of 160M documents is > 16MB and when we look at the G1 logs, it seems it never discards the humongous allocations, so they keep piling. Forcing a full-garbage collection is just not practical - it takes forever and the shard is not usable. Even when a new Searcher is started (every several minutes) the old large filterCache entries are not freed and sooner or later the JVM crashes. On the other hand ZGC has a completely different architecture and does not have the hard-coded threshold of 16MB for *humongous allocations*: https://wiki.openjdk.java.net/display/zgc/Main Anyway, we will be probably testing Java 13 and ZGC with the real data, we just have to reindex 30+ shards to new Solr servers, which will take a couple of days :-) Cheers, Vassil -----Original Message----- From: Jörn Franke <jornfra...@gmail.com> Sent: Monday, October 14, 2019 1:47 PM To: solr-user@lucene.apache.org Subject: Re: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any tests with Java 13 and the new ZGC? I would try JDK11 - it works much better than JDK9 in general. I don‘t think JDK13 with ZGC will bring you better results. There seems to be sth strange with the JDk version or Solr version and some settings. Then , make sure that you have much more free memory for the os cache than the heap. Nearly 100 gb for Solr heap sounds excessive. Try to reduce it to much less. Try the default options of Solr and use the latest 7.x version or 8.x version of Solr. Additionally you can try to shard more. > Am 14.10.2019 um 19:19 schrieb Vassil Velichkov (Sensika) > <vassil.velich...@sensika.com>: > > Hi Everyone, > > Since we’ve upgraded our cluster (legacy sharding) from Solr 6.x to Solr 7.6 > we have frequent OOM crashes on specific nodes. > > All investigations (detailed below) lead to a hard-coded limitation in the G1 > garbage collector. The Java Heap is exhausted due to too many filterCache > allocations that are never discarded by the G1. > > Our hope is to use Java 13 with the new ZGC, which is specifically designed > for large heap-sizes, and supposedly would handle and dispose larger > allocations. The Solr release notes claim that Solr 7.6 builds are tested > with Java 11 / 12 / 13 (pre-release). > Does anyone use Java 13 in production and has experience with the new ZGC and > large heap sizes / large document sets of more than 150M documents per shard? > >>>>>>>>>> Some background information and reference to the possible >>>>>>>>>> root-cause, described by Shawn Heisey in Solr 1.4 documentation >>>>> > > Our current setup is as follows: > > 1. All nodes are running on VMware 6.5 VMs with Debian 9u5 / Java 9 / > Solr 7.6 > > 2. Each VM has 6 or 8 x vCPUs, 128GB or 192GB RAM (50% for Java Heap / > 50% for OS) and 1 x Solr Core with 80M to 160M documents, NO stored fields, > DocValues ON > > 3. The only “hot” and frequently used cache is filterCache, configured > with the default value of 256 entries. If we increase the setting to 512 or > 1024 entries, we are getting 4-5 times better hit-ratio, but the OOM crashes > become too frequent. > > 4. Regardless of the Java Heap size (we’ve tested with even larger > heaps and VM sizing up to 384GB), all nodes that have approx. more than > 120-130M documents crash with OOM under heavy load (hundreds of simultaneous > searches with a variety of Filter Queries). > > FilterCache is really frequently used and some of the BitSets are spanning > across 80-90% of the Docset of each shard, so in many cases the FC entries > become larger than 16MB. We believe we’ve pinpointed the problem to the G1 > Garbage Collector and the hard-coded limit for "-XX:G1HeapRegionSize", which > allows setting a maximum of 32MB, regardless if it is auto-calculated or set > manually in the JVM startup options. The JVM memory allocation algorithm > tracks every memory allocation request and if the request exceeds 50% of > G1HeapRegionSize, it is considered humongous allocation (he-he, extremely > large allocation in 2019?!?), so it is not scanned and evaluated during > standard garbage collection cycles. Unused humongous allocations are > basically freed only during Full Garbage Collection cycles, which are never > really invoked by the G1 garbage collector, before it is too late and the JVM > crashes with OOM. > > Now we are rebalancing the cluster to have up to 100-120M documents per > shard, following and ancient, but probably still valid limitation suggested > in Solr 1.4 documentation by Shawn > Heisey<https://cwiki.apache.org/confluence/display/solr/ShawnHeisey>: “If you > have an index with about 100 million documents in it, you'll want to use a > region size of 32MB, which is the maximum possible size. Because of this > limitation of the G1 collector, we recommend always keeping a Solr index > below a maxDoc value of around 100 to 120 million.” > > Cheers, > Vassil Velichkov