Hi, I'm having this weird issue where Solr nodes suddenly go into recovery mode and eventually failing. That one failure kicks off a cascading effect and impacts the other nodes eventually. Without a restart, the entire cluster goes into a limbo after a while. Looking into the log and SPM monitoring tool, the issue happens under following circumstances: 1. Node gets a spike in query/index request, thus exhausting its allocated memory. 2. GC forces CPU to use 100% of it's capacity 3. None of the above, when both JVM and CPU are within limit
I'm using Solr 6.6. Here are the details about the node : Hardware type: AWS m4.4xlarge instance Total memory : 64 gb CPU : 16 SSD SOLR_JAVA_MEM="-Xms35g -Xmx35g" GC_TUNE="-XX:+UseG1GC \ -XX:+ParallelRefProcEnabled \ -XX:G1HeapRegionSize=8m \ -XX:MaxGCPauseMillis=200 \ -XX:+UseLargePages \ -XX:+AggressiveOpts" SOLR_OPTS="$SOLR_OPTS -Xss256k" SOLR_OPTS="$SOLR_OPTS -Dsolr.autoCommit.maxTime=600000" SOLR_OPTS="$SOLR_OPTS -Dsolr.clustering.enabled=true" SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=120000" Cache Parameters: <maxBooleanClauses>4096</maxBooleanClauses> <slowQueryThresholdMillis>1000</slowQueryThresholdMillis> <filterCache class="solr.FastLRUCache" size="4096" initialSize="1024" autowarmCount="20"/> <queryResultCache class="solr.LRUCache" size="256" initialSize="256" autowarmCount="10"/> <documentCache class="solr.LRUCache" size="256" initialSize="256" autowarmCount="10"/> <!-- custom cache currently used by block join --> <cache name="perSegFilter" class="solr.search.LRUCache" size="10" initialSize="0" autowarmCount="10" regenerator="solr.NoOpRegenerator" /> <fieldValueCache class="solr.FastLRUCache" size="256" autowarmCount="256" showItems="10" /> <cache enable="${solr.ltr.enabled:false}" name="QUERY_DOC_FV" class="solr.search.LRUCache" size="4096" initialSize="2048" autowarmCount="4096" regenerator="solr.search.NoOpRegenerator" /> <enableLazyFieldLoading>true</enableLazyFieldLoading> <queryResultWindowSize>60</queryResultWindowSize> I've currently 2 shards each having 2 replicas. The index size is approximately 70gb. Here's a solr log trace from the series of events once the node starts getting into trouble. I've posted only the relevant ones here. org.apache.solr.common.SolrException.log(SolrException.java:148) - org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates are disabled. at org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:1738) org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$5(CoreAdminOperation.java:143) - It has been requested that we recover: core=knowledge INFO647718[qtp2039328061-1526] - org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:732) - [admin] webapp=null path=/admin/cores params={core=knowledge&action=REQUESTRECOVERY&wt=javabin&version=2} status=0 QTime=0 INFO647808[qtp2039328061-1540] - org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:187) - [knowledge] webapp=/solr path=/update params={update.distrib=FROMLEADER&distrib.from= http://xx.xxx.xxx.63:8983/solr/knowledge/&wt=javabin&version=2}{} 0 0 WARN657500[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr x:knowledge s:shard2 c:knowledge r:core_node9] - org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:659) - Socket timeout on send prep recovery cmd, retrying.. INFO657500[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr x:knowledge s:shard2 c:knowledge r:core_node9] - org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:676) - Sending prep recovery command to [http://xx.xxx.xxx.63:8983/solr]; [WaitForState: action=PREPRECOVERY&core=knowledge&nodeName=xx.xxx.xxx.251:8983_solr&coreNodeName=core_node9&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true] WARN667514[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr x:knowledge s:shard2 c:knowledge r:core_node9] - org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:659) - Socket timeout on send prep recovery cmd, retrying.. The retry happens for few times, then INFO689389[qtp2039328061-1649] - org.apache.solr.security.RuleBasedAuthorizationPlugin.checkPathPerm(RuleBasedAuthorizationPlugin.java:147) - request has come without principal. failed permission { "name":"select", "collection":"knowledge", "path":"/select", "role":[ "admin", "dev", "read"], "index":3} INFO689390[qtp2039328061-1649] - org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:500) - USER_REQUIRED auth header null context : userPrincipal: [null] type: [READ], collections: [knowledge,], Path: [/select] path : /select params :q=*:*&distrib=false&sort=_docid_+asc&rows=0&wt=javabin&version=2 INFO907854[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr x:knowledge s:shard2 c:knowledge r:core_node9] - org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:676) - Sending prep recovery command to [http://xx.xxx.xxx.63:8983/solr]; [WaitForState: action=PREPRECOVERY&core=knowledge&nodeName=xx.xxx.xxx.251:8983_solr&coreNodeName=core_node9&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true] INFO913811[commitScheduler-13-thread-1] - org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:603) - start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} INFO913812[commitScheduler-13-thread-1] - org.apache.solr.update.SolrIndexWriter.setCommitData(SolrIndexWriter.java:174) - Calling setCommitData with IW:org.apache.solr.update.SolrIndexWriter@29f7c7fc INFO913812[commitScheduler-13-thread-1] - org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34) - [IW][commitScheduler-13-thread-1]: commit: start INFO913812[commitScheduler-13-thread-1] - org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34) - [IW][commitScheduler-13-thread-1]: commit: enter lock INFO913812[commitScheduler-13-thread-1] - org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34) - [IW][commitScheduler-13-thread-1]: commit: now prepare INFO913812[commitScheduler-13-thread-1] - org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34) - [IW][commitScheduler-13-thread-1]: prepareCommit: flush INFO913812[commitScheduler-13-thread-1] - org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34) - [IW][commitScheduler-13-thread-1]: index before flush _1e0k(6.6.0):C4171894/2478712:delGen=4036 _1pl8(6.6.0):C774233/198857:delGen=642 _1g34(6.6.0):C1023729/434317:delGen=3518 _1q40(6.6.0):C978521/95955:delGen=457 _1pzp(6.6.0):c202089/41205:delGen=494 _1qqe(6.6.0):c64285/33487:delGen=251 _1qsz(6.6.0):c163910/69541:delGen=233 _1rhp(6.6.0):c114624/15323:delGen=51 _1rc2(6.6.0):c47728/12355:delGen=93 _1rm0(6.6.0):c43606/4233:delGen=14 _1rl6(6.6.0):c1349/425:delGen=18 _1rk9(6.6.0):C2237/901:delGen=26 _1rmg(6.6.0):c3698/1000:delGen=10 _1rni(6.6.0):c5632/602:delGen=2 _1rlw(6.6.0):C3/2:delGen=2 _1rn1(6.6.0):c475/134:delGen=6 _1rmi(6.6.0):C709/361:delGen=9 _1rnp(6.6.0):c194/21:delGen=1 _1rmr(6.6.0):C1058/525:delGen=7 _1rn4(6.6.0):C661/316:delGen=6 _1rnc(6.6.0):C1111/534:delGen=4 _1rne(6.6.0):C1103/523:delGen=3 _1rnf(6.6.0):C1087/466:delGen=3 _1rng(6.6.0):C1099/535:delGen=3 _1rnl(6.6.0):C1222/473:delGen=2 _1rnn(6.6.0):C445/162:delGen=2 _1rnm(6.6.0):C2455/1084:delGen=2 _1rnq(6.6.0):C588/200:delGen=1 _1rns(6.6.0):C1153/306:delGen=1 _1rnr(6.6.0):C1434/309:delGen=1 _1rnt(6.6.0):C7 INFO913812[commitScheduler-13-thread-1] - org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34) - [DW][commitScheduler-13-thread-1]: startFullFlush INFO913812[commitScheduler-13-thread-1] - org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34) - [DW][commitScheduler-13-thread-1]: anyChanges? numDocsInRam=1725 deletes=true hasTickets:false pendingChangesInFullFlush: false INFO951165[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr x:knowledge s:shard2 c:knowledge r:core_node9] - org.apache.solr.core.SolrCore.getIndexFingerprint(SolrCore.java:1853) - Caching fingerprint for searcher:Searcher@61856c91[knowledge] main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_1e0k(6.6.0):C4171894/2478713:delGen=4037) Uninverting(_1pl8(6.6.0):C774233/198909:delGen=643) Uninverting(_1g34(6.6.0):C1023729/434329:delGen=3519) Uninverting(_1q40(6.6.0):C978521/96008:delGen=458) Uninverting(_1pzp(6.6.0):c202089/41209:delGen=495) Uninverting(_1qqe(6.6.0):c64285/33505:delGen=252) Uninverting(_1qsz(6.6.0):c163910/69577:delGen=234) Uninverting(_1rhp(6.6.0):c114624/15521:delGen=52) Uninverting(_1rc2(6.6.0):c47728/12387:delGen=94) Uninverting(_1rm0(6.6.0):c43606/4440:delGen=15) Uninverting(_1rl6(6.6.0):c1349/433:delGen=19) Uninverting(_1rk9(6.6.0):C2237/908:delGen=27) Uninverting(_1rmg(6.6.0):c3698/1044:delGen=11) Uninverting(_1rni(6.6.0):c5632/708:delGen=3) Uninverting(_1rlw(6.6.0):C3/2:delGen=2) Uninverting(_1rn1(6.6.0):c475/143:delGen=7) Uninverting(_1rmi(6.6.0):C709/367:delGen=10) Uninverting(_1rnp(6.6.0):c194/28:delGen=2) Uninverting(_1rmr(6.6.0):C1058/541:delGen=8) Uninverting(_1rn4(6.6.0):C661/325:delGen=7) Uninverting(_1rnc(6.6.0):C1111/550:delGen=5) Uninverting(_1rne(6.6.0):C1103/548:delGen=4) Uninverting(_1rnf(6.6.0):C1087/500:delGen=4) Uninverting(_1rng(6.6.0):C1099/553:delGen=4) Uninverting(_1rnl(6.6.0):C1222/518:delGen=3) Uninverting(_1rnn(6.6.0):C445/178:delGen=3) Uninverting(_1rnm(6.6.0):C2455/1138:delGen=3) Uninverting(_1rnq(6.6.0):C588/238:delGen=2) Uninverting(_1rns(6.6.0):C1153/397:delGen=2) Uninverting(_1rnr(6.6.0):C1434/479:delGen=2) Uninverting(_1rnt(6.6.0):C7/1:delGen=1) Uninverting(_1rnw(6.6.0):C4/1:delGen=1) Uninverting(_1rnx(6.6.0):C679/149:delGen=1) Uninverting(_1rnv(6.6.0):C107/33:delGen=1) Uninverting(_1rnu(6.6.0):C935/205:delGen=1)))} leafReaderContext:LeafReaderContext(FilterLeafReader(Uninverting(_1rnu(6.6.0):C935/205:delGen=1)) docBase=7613129 ord=34) mavVersion:9223372036854775807 This goes on for a while till I see this: index=_1e0k(6.6.0):C4171894/2478715:delGen=4037 _1pl8(6.6.0):C774233/198956:delGen=643 _1g34(6.6.0):C1023729/434341:delGen=3519 _1q40(6.6.0):C978521/96071:delGen=458 _1pzp(6.6.0):c202089/41213:delGen=495 _1qqe(6.6.0):c64285/33523:delGen=252 _1qsz(6.6.0):c163910/69626:delGen=234 _1rhp(6.6.0):c114624/15701:delGen=52 _1rc2(6.6.0):c47728/12410:delGen=94 _1rm0(6.6.0):c43606/4621:delGen=15 _1rl6(6.6.0):c1349/443:delGen=19 _1rk9(6.6.0):C2237/913:delGen=27 _1rmg(6.6.0):c3698/1074:delGen=11 _1rna(6.6.0):c6218/1527:delGen=5 _1rlw(6.6.0):C3/2:delGen=2 _1rn1(6.6.0):c475/151:delGen=7 _1rmi(6.6.0):C709/374:delGen=10 _1rnn(6.6.0):c233/33:delGen=2 _1rmr(6.6.0):C1058/553:delGen=8 _1rmw(6.6.0):C1 _1rnd(6.6.0):C576/312:delGen=5 _1rne(6.6.0):C568/277:delGen=5 _1rng(6.6.0):C2136/1055:delGen=4 _1rnh(6.6.0):C2570/1260:delGen=4 _1rni(6.6.0):C1 _1rnk(6.6.0):C1462/712:delGen=3 _1rnj(6.6.0):C2436/1106:delGen=3 _1rnp(6.6.0):C1560/603:delGen=2 _1rno(6.6.0):C141/71:delGen=2 _1rnq(6.6.0):C1393/517:delGen=2 _1rns(6.6.0):C4/1:delGen=1 _1rnr(6.6.0):C12/5:delGen=1 _1rnu(6.6.0):C1846/484:delGen=1 _1rnv(6.6.0):C513/106:delGen=1 _1rnt(6.6.0):C600/185:delGen=1 _1rnw(6.6.0):C1 _1rnx(6.6.0):C3 _1rnz(6.6.0):C1 version=6.6.0 analyzer=org.apache.solr.update.SolrIndexConfig$DelayedSchemaAnalyzer ramBufferSizeMB=320.0 maxBufferedDocs=-1 maxBufferedDeleteTerms=-1 mergedSegmentWarmer=null delPolicy=org.apache.solr.core.IndexDeletionPolicyWrapper commit=null openMode=APPEND similarity=org.apache.solr.search.similarities.SchemaSimilarityFactory$SchemaSimilarity mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=-1, maxMergeCount=-1, ioThrottle=true codec=Lucene62 infoStream=org.apache.solr.update.LoggingInfoStream mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10, maxMergeAtOnceExplicit=30, maxMergedSegmentMB=5120.0, floorSegmentMB=2.0, forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0, maxCFSSegmentSizeMB=8.796093022207999E12, noCFSRatio=0.1 indexerThreadPool=org.apache.lucene.index.DocumentsWriterPerThreadPool@63603ac1 readerPooling=false perThreadHardLimitMB=1945 useCompoundFile=false commitOnClose=true indexSort=null writer=org.apache.solr.update.SolrIndexWriter@3360cb8 INFO265541[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr x:knowledge s:shard2 c:knowledge r:core_node9] - org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34) - [IW][recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr x:knowledge s:shard2 c:knowledge r:core_node9]: MMapDirectory.UNMAP_SUPPORTED=true INFO265541[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr x:knowledge s:shard2 c:knowledge r:core_node9] - org.apache.solr.update.DefaultSolrCoreState.changeWriter(DefaultSolrCoreState.java:212) - New IndexWriter is ready to be used. ERROR267470[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr x:knowledge s:shard2 c:knowledge r:core_node9] - org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:519) - Could not publish as ACTIVE after succesful recovery org.apache.solr.common.SolrException: Cannot publish state of core 'knowledge' as active without recovering first! at org.apache.solr.cloud.ZkController.publish(ZkController.java:1175) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1131) at org.apache.solr.cloud.ZkController.publish(ZkController.java:1127) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:517) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:284) at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ERROR267471[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr x:knowledge s:shard2 c:knowledge r:core_node9] - org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:541) - Recovery failed - trying again... (0) INFO267471[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr x:knowledge s:shard2 c:knowledge r:core_node9] - org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:563) - Wait [2.0] seconds before trying to recover again (attempt=1) The entire cycle keeps repeating till it goes offline. I've also seen the following exceptions at times: WARN670422[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr x:knowledge s:shard2 c:knowledge r:core_node9] - org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1536) - Error in fetching file: _1e0k.fdt (downloaded 2385510400 of 6610887701 bytes) org.apache.http.TruncatedChunkException: Truncated chunk ( expected size: 32768; actual size: 28517) at org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:200) at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137) INFO831840[qtp1389808948-125825] - org.apache.solr.security.RuleBasedAuthorizationPlugin.checkPathPerm(RuleBasedAuthorizationPlugin.java:147) - request has come without principal. failed permission { "name":"select", "collection":"knowledge", "path":"/select", "role":[ "admin", "dev", "read"]} INFO831841[qtp1389808948-125836] - org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:500) - USER_REQUIRED auth header null context : userPrincipal: [null] type: [READ], collections: [knowledge,], Path: [/select] path : /select params :q=*:*&distrib=false&sort=_docid_+asc&rows=0&wt=javabin&version=2 INFO831840[qtp1389808948-125855] - org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:500) - USER_REQUIRED auth header null context : userPrincipal: [null] type: [READ], collections: [knowledge,], Path: [/select] path : /select params :q=*:*&distrib=false&sort=_docid_+asc&rows=0&wt=javabin&version=2 ERROR831840[qtp1389808948-125740] - org.apache.solr.common.SolrException.log(SolrException.java:148) - org.apache.solr.common.SolrException: no servers hosting shard: shard1 at org.apache.solr.handler.component.HttpShardHandler.prepDistributed(HttpShardHandler.java:413) at org.apache.solr.handler.component.SearchHandler.getAndPrepShardHandler(SearchHandler.java:227) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:265) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) Not sure what's the issue here and how to address this. I've played around with different memory settings but haven't been successful so far. Also, not sure why it affects the entire cluster. When I restart the instance, it goes into recovery mode and updates it's index with the delta, which is understandable.But at the same time, the other replica in the same shard stalls and goes offline. This starts a cascading effect and I've to end up restarting all the nodes. Any pointers will be appreciated. Thanks, Shamik