You are running with a 290 Gb heap (!!!!) on a 30 Gb machine. That is the worst 
Java config I have ever seen.

Use this:

SOLR_JAVA_MEM="-Xms8g -Xmx8g”

That starts with an 8 Gb heap and stays there.

Also, you might think about simplifying the GC configuration. Or if you are on 
a recent release of Java 8, using the G1 collector. We’re getting great 
performance with this config:

SOLR_HEAP=8g
# Use G1 GC  -- wunder 2017-01-23
# Settings from https://wiki.apache.org/solr/ShawnHeisey
GC_TUNE=" \
-XX:+UseG1GC \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=200 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
"

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 18, 2017, at 7:24 AM, Shamik Bandopadhyay <sham...@gmail.com> wrote:
> 
> Hi,
> 
>   I recently upgraded to Solr 6.6 from 5.5. After running for a couple of
> days, the entire Solr cluster suddenly came down with OOM exception. Once
> the servers are being restarted, the memory footprint stays stable for a
> while before the sudden spike in memory occurs. The heap surges up quickly
> and hits the max causing the JVM to shut down due to OOM. It starts with
> one server but eventually trickles downs to the rest of the nodes, bringing
> the entire cluster down within a span of 10-15 mins.
> 
> The cluster consists of 6 nodes with two shards having 2 replicas each.
> There are two collections with total index size close to 24 gb. Each server
> has 8 CPUs with 30gb memory. Solr is running on an embedded jetty on jdk
> 1.8. The JVM parameters are identical to 5.5:
> 
> SOLR_JAVA_MEM="-Xms1000m -Xmx290000m"
> 
> GC_LOG_OPTS="-verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails \
>  -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
> -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime"
> 
> GC_TUNE="-XX:NewRatio=3 \
> -XX:SurvivorRatio=4 \
> -XX:TargetSurvivorRatio=90 \
> -XX:MaxTenuringThreshold=8 \
> -XX:+UseConcMarkSweepGC \
> -XX:+UseParNewGC \
> -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
> -XX:+CMSScavengeBeforeRemark \
> -XX:PretenureSizeThreshold=64m \
> -XX:+UseCMSInitiatingOccupancyOnly \
> -XX:CMSInitiatingOccupancyFraction=50 \
> -XX:CMSMaxAbortablePrecleanTime=6000 \
> -XX:+CMSParallelRemarkEnabled \
> -XX:+ParallelRefProcEnabled"
> 
> I've tried G1GC based on Shawn's WIKI, but didn't make any difference.
> Though G1GC seemed to do well with GC initially, it showed similar
> behaviour during the spike. It prompted me to revert back to CMS.
> 
> I'm doing a hard commit every 5 mins.
> 
> SOLR_OPTS="$SOLR_OPTS -Xss256k"
> SOLR_OPTS="$SOLR_OPTS -Dsolr.autoCommit.maxTime=300000"
> SOLR_OPTS="$SOLR_OPTS -Dsolr.clustering.enabled=true"
> SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=120000"
> 
> Othe Solr configurations:
> 
> <autoSoftCommit>
> <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
> </autoSoftCommit>
> 
> Cache settings:
> 
> <maxBooleanClauses>4096</maxBooleanClauses>
> <slowQueryThresholdMillis>1000</slowQueryThresholdMillis>
> <filterCache class="solr.FastLRUCache" size="20000" initialSize="4096"
> autowarmCount="512"/>
> <queryResultCache class="solr.LRUCache" size="2000" initialSize="500"
> autowarmCount="100"/>
> <documentCache class="solr.LRUCache" size="60000" initialSize="5000"
> autowarmCount="0"/>
> <cache name="perSegFilter" class="solr.search.LRUCache" size="10"
> initialSize="0" autowarmCount="10" regenerator="solr.NoOpRegenerator" />
> <fieldValueCache class="solr.FastLRUCache" size="20000"
> autowarmCount="4096" showItems="1024" />
> <cache enable="${solr.ltr.enabled:false}" name="QUERY_DOC_FV"
> class="solr.search.LRUCache" size="4096" initialSize="2048"
> autowarmCount="4096" regenerator="solr.search.NoOpRegenerator" />
> <enableLazyFieldLoading>true</enableLazyFieldLoading>
> <queryResultWindowSize>200</queryResultWindowSize>
> <queryResultMaxDocsCached>400</queryResultMaxDocsCached>
> 
> I'm not sure what has changed so drastically in 6.6 compared to 5.5. I
> never had a single OOM in 5.5 which has been running for a couple of years.
> Moreover, the memory footprint was much less with 15gb set as Xmx. All my
> facet parameters have docvalues enabled, it should handle the memory part
> efficiently.
> 
> I'm struggling to figure out the root cause. Does 6.6 command more memory
> than what is currently available on our servers (30gb)? What might be the
> probable cause for this sort of scenario? What are the best practices to
> troubleshoot such issues?
> 
> Any pointers will be appreciated.
> 
> Thanks,
> Shamik

Reply via email to