Hi Erick, We have 1 x Replica with 1 x Solr Core per JVM and each JVM runs in a separate VMware VM. We have 32 x JVMs/VMs in total, containing between 50M to 180M documents per replica/core/JVM. In our case most filterCache entities (maxDoc/8 + overhead) are typically more than 16MB, which is more than 50% of the max setting for "XX:G1HeapRegionSize" (which is 32MB). That's why I am so interested in Java 13 and ZGC, because ZGC does not have this weird limitation and collects even _large_ garbage pieces :-). We have almost no documentCache or queryCache entities.
By some time tonight all shards will be rebalanced (we've added 6 more) and will contain up to 100-120M documents (14.31MB + overhead should be < 16MB), so hopefully this will help us to alleviate the OOM crashes. Cheers, Vassil -----Original Message----- From: Erick Erickson <erickerick...@gmail.com> Sent: Monday, October 14, 2019 3:03 PM To: solr-user@lucene.apache.org Subject: Re: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any tests with Java 13 and the new ZGC? The filterCache isn’t a single huge allocation, it’s made up of _size_ entries, each individual entry shouldn’t be that big, each entry should cap around maxDoc/8 bytes + some overhead. I just scanned the e-mail, I’m not clear how many _replicas_ per JVM you have, nor how many JVMs per server you’re running. One strategy to deal with large heaps if you have a lot of replicas is to run multiple JVMs, each with a smaller heap. One peculiarity of heaps is that at 32G, one must use long pointers, so a 32G heap actually has less available memory than a 31G heap if many of the objects are small. > On Oct 14, 2019, at 7:00 AM, Vassil Velichkov (Sensika) > <vassil.velich...@sensika.com> wrote: > > Thanks Jörn, > > Yep, we are rebalancing the cluster to keep up to ~100M documents per shard, > but that's not quite optimal in our use-case. > > We've tried with various ratios between JVM Heap / OS RAM (up to 128GB / > 256GB) and we have the same Java Heap OOM crashes. > For example, a BitSet of 160M documents is > 16MB and when we look at the G1 > logs, it seems it never discards the humongous allocations, so they keep > piling. Forcing a full-garbage collection is just not practical - it takes > forever and the shard is not usable. Even when a new Searcher is started > (every several minutes) the old large filterCache entries are not freed and > sooner or later the JVM crashes. > > On the other hand ZGC has a completely different architecture and does not > have the hard-coded threshold of 16MB for *humongous allocations*: > https://wiki.openjdk.java.net/display/zgc/Main > > Anyway, we will be probably testing Java 13 and ZGC with the real data, we > just have to reindex 30+ shards to new Solr servers, which will take a couple > of days :-) > > Cheers, > Vassil > > -----Original Message----- > From: Jörn Franke <jornfra...@gmail.com> > Sent: Monday, October 14, 2019 1:47 PM > To: solr-user@lucene.apache.org > Subject: Re: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any > tests with Java 13 and the new ZGC? > > I would try JDK11 - it works much better than JDK9 in general. > > I don‘t think JDK13 with ZGC will bring you better results. There seems to be > sth strange with the JDk version or Solr version and some settings. > > Then , make sure that you have much more free memory for the os cache than > the heap. Nearly 100 gb for Solr heap sounds excessive. Try to reduce it to > much less. > > Try the default options of Solr and use the latest 7.x version or 8.x version > of Solr. > > Additionally you can try to shard more. > >> Am 14.10.2019 um 19:19 schrieb Vassil Velichkov (Sensika) >> <vassil.velich...@sensika.com>: >> >> Hi Everyone, >> >> Since we’ve upgraded our cluster (legacy sharding) from Solr 6.x to Solr 7.6 >> we have frequent OOM crashes on specific nodes. >> >> All investigations (detailed below) lead to a hard-coded limitation in the >> G1 garbage collector. The Java Heap is exhausted due to too many filterCache >> allocations that are never discarded by the G1. >> >> Our hope is to use Java 13 with the new ZGC, which is specifically designed >> for large heap-sizes, and supposedly would handle and dispose larger >> allocations. The Solr release notes claim that Solr 7.6 builds are tested >> with Java 11 / 12 / 13 (pre-release). >> Does anyone use Java 13 in production and has experience with the new ZGC >> and large heap sizes / large document sets of more than 150M documents per >> shard? >> >>>>>>>>>>> Some background information and reference to the possible >>>>>>>>>>> root-cause, described by Shawn Heisey in Solr 1.4 documentation >>>>>>>>>>> >>>>> >> >> Our current setup is as follows: >> >> 1. All nodes are running on VMware 6.5 VMs with Debian 9u5 / Java 9 / >> Solr 7.6 >> >> 2. Each VM has 6 or 8 x vCPUs, 128GB or 192GB RAM (50% for Java Heap / >> 50% for OS) and 1 x Solr Core with 80M to 160M documents, NO stored fields, >> DocValues ON >> >> 3. The only “hot” and frequently used cache is filterCache, configured >> with the default value of 256 entries. If we increase the setting to 512 or >> 1024 entries, we are getting 4-5 times better hit-ratio, but the OOM crashes >> become too frequent. >> >> 4. Regardless of the Java Heap size (we’ve tested with even larger >> heaps and VM sizing up to 384GB), all nodes that have approx. more than >> 120-130M documents crash with OOM under heavy load (hundreds of simultaneous >> searches with a variety of Filter Queries). >> >> FilterCache is really frequently used and some of the BitSets are spanning >> across 80-90% of the Docset of each shard, so in many cases the FC entries >> become larger than 16MB. We believe we’ve pinpointed the problem to the G1 >> Garbage Collector and the hard-coded limit for "-XX:G1HeapRegionSize", which >> allows setting a maximum of 32MB, regardless if it is auto-calculated or set >> manually in the JVM startup options. The JVM memory allocation algorithm >> tracks every memory allocation request and if the request exceeds 50% of >> G1HeapRegionSize, it is considered humongous allocation (he-he, extremely >> large allocation in 2019?!?), so it is not scanned and evaluated during >> standard garbage collection cycles. Unused humongous allocations are >> basically freed only during Full Garbage Collection cycles, which are never >> really invoked by the G1 garbage collector, before it is too late and the >> JVM crashes with OOM. >> >> Now we are rebalancing the cluster to have up to 100-120M documents per >> shard, following and ancient, but probably still valid limitation suggested >> in Solr 1.4 documentation by Shawn >> Heisey<https://cwiki.apache.org/confluence/display/solr/ShawnHeisey>: “If >> you have an index with about 100 million documents in it, you'll want to use >> a region size of 32MB, which is the maximum possible size. Because of this >> limitation of the G1 collector, we recommend always keeping a Solr index >> below a maxDoc value of around 100 to 120 million.” >> >> Cheers, >> Vassil Velichkov