Hi Erick,

We have 1 x Replica with 1 x Solr Core per JVM and each JVM runs in a separate 
VMware VM.
We have 32 x JVMs/VMs in total, containing between 50M to 180M documents per 
replica/core/JVM.
In our case most filterCache entities (maxDoc/8 + overhead) are typically more 
than 16MB, which is more than 50% of the max setting for "XX:G1HeapRegionSize" 
(which is 32MB). That's why I am so interested in Java 13 and ZGC, because ZGC 
does not have this weird limitation and collects even _large_ garbage pieces 
:-). We have almost no documentCache or queryCache entities.

By some time tonight all shards will be rebalanced (we've added 6 more) and 
will contain up to 100-120M documents (14.31MB + overhead should be < 16MB), so 
hopefully this will help us to alleviate the OOM crashes.

Cheers,
Vassil


-----Original Message-----
From: Erick Erickson <erickerick...@gmail.com> 
Sent: Monday, October 14, 2019 3:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any 
tests with Java 13 and the new ZGC?

The filterCache isn’t a single huge allocation, it’s made up of _size_ entries, 
each individual entry shouldn’t be that big, each entry should cap around 
maxDoc/8 bytes + some overhead.

I just scanned the e-mail, I’m not clear how many _replicas_ per JVM you have, 
nor how many JVMs per server you’re running. One strategy to deal with large 
heaps if you have a lot of replicas is to run multiple JVMs, each with a 
smaller heap.

One peculiarity of heaps is that at 32G, one must use long pointers, so a 32G 
heap actually has less available memory than a 31G heap if many of the objects 
are small.


> On Oct 14, 2019, at 7:00 AM, Vassil Velichkov (Sensika) 
> <vassil.velich...@sensika.com> wrote:
> 
> Thanks Jörn,
> 
> Yep, we are rebalancing the cluster to keep up to ~100M documents per shard, 
> but that's not quite optimal in our use-case.
> 
> We've tried with various ratios between JVM Heap / OS RAM (up to 128GB / 
> 256GB) and we have the same Java Heap OOM crashes.
> For example, a BitSet of 160M documents is > 16MB and when we look at the G1 
> logs, it seems it never discards the humongous allocations, so they keep 
> piling. Forcing a full-garbage collection is just not practical - it takes 
> forever and the shard is not usable. Even when a new Searcher is started 
> (every several minutes) the old large filterCache entries are not freed and 
> sooner or later the JVM crashes.
> 
> On the other hand ZGC has a completely different architecture and does not 
> have the hard-coded threshold of 16MB for *humongous allocations*:
> https://wiki.openjdk.java.net/display/zgc/Main
> 
> Anyway, we will be probably testing Java 13 and ZGC with the real data, we 
> just have to reindex 30+ shards to new Solr servers, which will take a couple 
> of days :-)
> 
> Cheers,
> Vassil
> 
> -----Original Message-----
> From: Jörn Franke <jornfra...@gmail.com> 
> Sent: Monday, October 14, 2019 1:47 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 7.6 frequent OOM with Java 9, G1 and large heap sizes - any 
> tests with Java 13 and the new ZGC?
> 
> I would try JDK11 - it works much better than JDK9 in general. 
> 
> I don‘t think JDK13 with ZGC will bring you better results. There seems to be 
> sth strange with the JDk version or Solr version and some settings. 
> 
> Then , make sure that you have much more free memory for the os cache than 
> the heap. Nearly 100 gb for Solr heap sounds excessive. Try to reduce it to 
> much less.
> 
> Try the default options of Solr and use the latest 7.x version or 8.x version 
> of Solr.
> 
> Additionally you can try to shard more.
> 
>> Am 14.10.2019 um 19:19 schrieb Vassil Velichkov (Sensika) 
>> <vassil.velich...@sensika.com>:
>> 
>> Hi Everyone,
>> 
>> Since we’ve upgraded our cluster (legacy sharding) from Solr 6.x to Solr 7.6 
>> we have frequent OOM crashes on specific nodes.
>> 
>> All investigations (detailed below) lead to a hard-coded limitation in the 
>> G1 garbage collector. The Java Heap is exhausted due to too many filterCache 
>> allocations that are never discarded by the G1.
>> 
>> Our hope is to use Java 13 with the new ZGC, which is specifically designed 
>> for large heap-sizes, and supposedly would handle and dispose larger 
>> allocations. The Solr release notes claim that Solr 7.6 builds are tested 
>> with Java 11 / 12 / 13 (pre-release).
>> Does anyone use Java 13 in production and has experience with the new ZGC 
>> and large heap sizes / large document sets of more than 150M documents per 
>> shard?
>> 
>>>>>>>>>>> Some background information and reference to the possible 
>>>>>>>>>>> root-cause, described by Shawn Heisey in Solr 1.4 documentation 
>>>>>>>>>>> >>>>>
>> 
>> Our current setup is as follows:
>> 
>> 1.       All nodes are running on VMware 6.5 VMs with Debian 9u5 / Java 9 / 
>> Solr 7.6
>> 
>> 2.       Each VM has 6 or 8 x vCPUs, 128GB or 192GB RAM (50% for Java Heap / 
>> 50% for OS) and 1 x Solr Core with 80M to 160M documents, NO stored fields, 
>> DocValues ON
>> 
>> 3.       The only “hot” and frequently used cache is filterCache, configured 
>> with the default value of 256 entries. If we increase the setting to 512 or 
>> 1024 entries, we are getting 4-5 times better hit-ratio, but the OOM crashes 
>> become too frequent.
>> 
>> 4.       Regardless of the Java Heap size (we’ve tested with even larger 
>> heaps and VM sizing up to 384GB), all nodes that have approx. more than 
>> 120-130M documents crash with OOM under heavy load (hundreds of simultaneous 
>> searches with a variety of Filter Queries).
>> 
>> FilterCache is really frequently used and some of the BitSets are spanning 
>> across 80-90% of the Docset of each shard, so in many cases the FC entries 
>> become larger than 16MB. We believe we’ve pinpointed the problem to the G1 
>> Garbage Collector and the hard-coded limit for "-XX:G1HeapRegionSize", which 
>> allows setting a maximum of 32MB, regardless if it is auto-calculated or set 
>> manually in the JVM startup options. The JVM memory allocation algorithm 
>> tracks every memory allocation request and if the request exceeds 50% of 
>> G1HeapRegionSize, it is considered humongous allocation (he-he, extremely 
>> large allocation in 2019?!?), so it is not scanned and evaluated during 
>> standard garbage collection cycles. Unused humongous allocations are 
>> basically freed only during Full Garbage Collection cycles, which are never 
>> really invoked by the G1 garbage collector, before it is too late and the 
>> JVM crashes with OOM.
>> 
>> Now we are rebalancing the cluster to have up to 100-120M  documents per 
>> shard, following and ancient, but probably still valid limitation suggested 
>> in Solr 1.4 documentation by Shawn 
>> Heisey<https://cwiki.apache.org/confluence/display/solr/ShawnHeisey>: “If 
>> you have an index with about 100 million documents in it, you'll want to use 
>> a region size of 32MB, which is the maximum possible size. Because of this 
>> limitation of the G1 collector, we recommend always keeping a Solr index 
>> below a maxDoc value of around 100 to 120 million.”
>> 
>> Cheers,
>> Vassil Velichkov

Reply via email to