Hi,

  We recently upgraded Solr from 6.5 to 7.5. We are observing some weird
issues in terms of recovery and memory usage. We've a cluster of 6 physical
nodes with 2 shards having two replica each. 7.5 seemed to have a higher
memory consumption where the average heap utilization hovers around 18 gb.
Couple of days back, one of the replicas went down as heap (30 gb) was
exhausted. Upon restart, the instance came back quickly but then started a
spiral effect where one of the nodes in the cluster kept going down one
after the other. So at any point of time, there were 5 instances available
instead of 6. Every time we would bring the bad instance back up, it would
be functional immediately but the shard it was recovering from will
eventually (within minutes) go down . This cycle (of restarting instances)
went for almost an hour before all the nodes were finally started active.
It again occurred today where we are observing similar behavior. We even
stopped the indexing pipeline to make sure that recovery is minimal. But it
didn't make any difference, the error is consistent, the affected node goes
into a recovery (not sure why) and encounters session time out in the
process.

2018-12-12 19:59:16.813 ERROR
(recoveryExecutor-4-thread-3-processing-n:x.x.23.51:8983_solr
x:knowledge_shard1_replica_n1 c:knowledge s:shard1 r:core_node7)
[c:knowledge s:shard1 r:core_node7 x:knowledge_shard1_replica_n1]
o.a.s.c.RecoveryStrategy Error while trying to recover.
core=knowledge_shard1_replica_n1:org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer/queue/qn-
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
 at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:786)
 at
org.apache.solr.common.cloud.SolrZkClient.lambda$create$7(SolrZkClient.java:398)
 at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
 at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:398)
 at
org.apache.solr.cloud.ZkDistributedQueue.offer(ZkDistributedQueue.java:321)
 at org.apache.solr.cloud.ZkController.publish(ZkController.java:1548)
 at org.apache.solr.cloud.ZkController.publish(ZkController.java:1436)
 at
org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:549)
 at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:310)
 at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:294)
 at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
 at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514)
 at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
 at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
 at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
 at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
 at java.base/java.lang.Thread.run(Thread.java:844)

I've added the dropbox links to the relevant error from solr log, solr gc
log, top screenshot and heap usage from spm monitor.

Solr log --> https://www.dropbox.com/s/als26je0dgrp10r/solr.log?dl=0

Solr gc log 1 -->
https://www.dropbox.com/s/m0ikb3kc9enme4f/solr_gc.log.0?dl=0

Solr gc log 2 -->
https://www.dropbox.com/s/jfs7wcjyult5ud8/solr_gc.log.1?dl=0

Top --> https://www.dropbox.com/s/x6f0zwqfbabybd0/top.png?dl=0

SPM monitor screenshots:

JVM pool --> https://www.dropbox.com/s/nbko83eflp8y2tp/jvm_pool.png?dl=0

GC screenshot --> https://www.dropbox.com/s/6zofkvgfknxwjgd/gc.png?dl=0

Solr Cache --> https://www.dropbox.com/s/o6zsxwal6pzspve/cache.png?dl=0

Here are the relevant entries from solr startup script:

SOLR_JAVA_MEM="-Xms40g -Xmx40g"

GC_LOG_OPTS='-Xlog:gc*'

GC_TUNE="-XX:+UseG1GC \
-XX:+PerfDisableSharedMem \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=250 \
-XX:InitiatingHeapOccupancyPercent=75 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
-XX:OnOutOfMemoryError=/mnt/ebs2/solrhome/bin/oom_solr.sh"

ZK_CLIENT_TIMEOUT="60000"

SOLR_WAIT_FOR_ZK="180"

SOLR_OPTS="$SOLR_OPTS -Dsolr.autoSoftCommit.maxTime=120000"
SOLR_OPTS="$SOLR_OPTS -Dsolr.autoCommit.maxTime=600000"
SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=0x200000"
SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=120000"

All our nodes are running on AWS having 16 vCPU and 128gb . We started with
30gb heap space as the average utilization was between 18-19 gb. For this
recovery issue, we tried bumping up to 40gb but didn't make any difference.
We are using jdk 9.0.4+11 with 6 Solr nodes and 3 zookeeper (3.4.10) quorum.

Out index has close to 10 million documents with 55gb index size. Not sure
if it's relevant, but we have noticed that filter cache utilization has
drastically reduced (0.17) while document cache has gone up (0.61). It used
to be 0.9 and 0.3 in Solr 6.5.

Not sure what we are missing here in terms of Solr upgrade to 7.5 I can
provide other relevant information.

Thanks,
Shamik

Reply via email to