Hi,

  I'm having this weird issue where Solr nodes suddenly go into recovery
mode and eventually failing. That one failure kicks off a cascading effect
and impacts the other nodes eventually. Without a restart, the entire
cluster goes into a limbo after a while. Looking into the log and SPM
monitoring tool, the issue happens under following circumstances:
1. Node gets a spike in query/index request, thus exhausting its allocated
memory.
2. GC forces CPU to use 100% of it's capacity
3. None of the above, when both JVM and CPU are within limit

I'm using Solr 6.6. Here are the details about the node :

Hardware type: AWS m4.4xlarge instance
Total memory : 64 gb
CPU : 16
SSD
SOLR_JAVA_MEM="-Xms35g -Xmx35g"
GC_TUNE="-XX:+UseG1GC \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=200 \
-XX:+UseLargePages \
-XX:+AggressiveOpts"
SOLR_OPTS="$SOLR_OPTS -Xss256k"
SOLR_OPTS="$SOLR_OPTS -Dsolr.autoCommit.maxTime=600000"
SOLR_OPTS="$SOLR_OPTS -Dsolr.clustering.enabled=true"
SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=120000"

Cache Parameters:
<maxBooleanClauses>4096</maxBooleanClauses>
<slowQueryThresholdMillis>1000</slowQueryThresholdMillis>
<filterCache class="solr.FastLRUCache" size="4096" initialSize="1024"
autowarmCount="20"/>
<queryResultCache class="solr.LRUCache" size="256" initialSize="256"
autowarmCount="10"/>
<documentCache class="solr.LRUCache" size="256" initialSize="256"
autowarmCount="10"/>
<!-- custom cache currently used by block join -->
<cache name="perSegFilter" class="solr.search.LRUCache" size="10"
initialSize="0" autowarmCount="10" regenerator="solr.NoOpRegenerator" />
<fieldValueCache class="solr.FastLRUCache" size="256" autowarmCount="256"
showItems="10" />
<cache enable="${solr.ltr.enabled:false}" name="QUERY_DOC_FV"
class="solr.search.LRUCache" size="4096" initialSize="2048"
autowarmCount="4096" regenerator="solr.search.NoOpRegenerator" />
<enableLazyFieldLoading>true</enableLazyFieldLoading>
<queryResultWindowSize>60</queryResultWindowSize>

I've currently 2 shards each having 2 replicas. The index size is
approximately 70gb.

Here's a solr log trace from the series of events once the node starts
getting into trouble. I've posted only the relevant ones here.


org.apache.solr.common.SolrException.log(SolrException.java:148) -
org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates
are disabled.
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:1738)

 
org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$5(CoreAdminOperation.java:143)
- It has been requested that we recover: core=knowledge
INFO647718[qtp2039328061-1526] -
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:732)
- [admin] webapp=null path=/admin/cores
params={core=knowledge&action=REQUESTRECOVERY&wt=javabin&version=2}
status=0 QTime=0
INFO647808[qtp2039328061-1540] -
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:187)
- [knowledge]  webapp=/solr path=/update
params={update.distrib=FROMLEADER&distrib.from=
http://xx.xxx.xxx.63:8983/solr/knowledge/&wt=javabin&version=2}{} 0 0

WARN657500[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
x:knowledge s:shard2 c:knowledge r:core_node9] -
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:659)
- Socket timeout on send prep recovery cmd, retrying..
INFO657500[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
x:knowledge s:shard2 c:knowledge r:core_node9] -
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:676)
- Sending prep recovery command to [http://xx.xxx.xxx.63:8983/solr];
[WaitForState:
action=PREPRECOVERY&core=knowledge&nodeName=xx.xxx.xxx.251:8983_solr&coreNodeName=core_node9&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
WARN667514[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
x:knowledge s:shard2 c:knowledge r:core_node9] -
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:659)
- Socket timeout on send prep recovery cmd, retrying..

The retry happens for few times, then

INFO689389[qtp2039328061-1649] -
org.apache.solr.security.RuleBasedAuthorizationPlugin.checkPathPerm(RuleBasedAuthorizationPlugin.java:147)
- request has come without principal. failed permission {
  "name":"select",
  "collection":"knowledge",
  "path":"/select",
  "role":[
    "admin",
    "dev",
    "read"],
  "index":3}
INFO689390[qtp2039328061-1649] -
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:500) -
USER_REQUIRED auth header null context : userPrincipal: [null] type:
[READ], collections: [knowledge,], Path: [/select] path : /select params
:q=*:*&distrib=false&sort=_docid_+asc&rows=0&wt=javabin&version=2
INFO907854[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
x:knowledge s:shard2 c:knowledge r:core_node9] -
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:676)
- Sending prep recovery command to [http://xx.xxx.xxx.63:8983/solr];
[WaitForState:
action=PREPRECOVERY&core=knowledge&nodeName=xx.xxx.xxx.251:8983_solr&coreNodeName=core_node9&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
INFO913811[commitScheduler-13-thread-1] -
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:603)
- start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
INFO913812[commitScheduler-13-thread-1] -
org.apache.solr.update.SolrIndexWriter.setCommitData(SolrIndexWriter.java:174)
- Calling setCommitData with
IW:org.apache.solr.update.SolrIndexWriter@29f7c7fc
 INFO913812[commitScheduler-13-thread-1] -
org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34)
- [IW][commitScheduler-13-thread-1]: commit: start
 INFO913812[commitScheduler-13-thread-1] -
org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34)
- [IW][commitScheduler-13-thread-1]: commit: enter lock
INFO913812[commitScheduler-13-thread-1] -
org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34)
- [IW][commitScheduler-13-thread-1]: commit: now prepare
INFO913812[commitScheduler-13-thread-1] -
org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34)
- [IW][commitScheduler-13-thread-1]: prepareCommit: flush
 INFO913812[commitScheduler-13-thread-1] -
org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34)
- [IW][commitScheduler-13-thread-1]:   index before flush
_1e0k(6.6.0):C4171894/2478712:delGen=4036
_1pl8(6.6.0):C774233/198857:delGen=642
_1g34(6.6.0):C1023729/434317:delGen=3518
_1q40(6.6.0):C978521/95955:delGen=457 _1pzp(6.6.0):c202089/41205:delGen=494
_1qqe(6.6.0):c64285/33487:delGen=251 _1qsz(6.6.0):c163910/69541:delGen=233
_1rhp(6.6.0):c114624/15323:delGen=51 _1rc2(6.6.0):c47728/12355:delGen=93
_1rm0(6.6.0):c43606/4233:delGen=14 _1rl6(6.6.0):c1349/425:delGen=18
_1rk9(6.6.0):C2237/901:delGen=26 _1rmg(6.6.0):c3698/1000:delGen=10
_1rni(6.6.0):c5632/602:delGen=2 _1rlw(6.6.0):C3/2:delGen=2
_1rn1(6.6.0):c475/134:delGen=6 _1rmi(6.6.0):C709/361:delGen=9
_1rnp(6.6.0):c194/21:delGen=1 _1rmr(6.6.0):C1058/525:delGen=7
_1rn4(6.6.0):C661/316:delGen=6 _1rnc(6.6.0):C1111/534:delGen=4
_1rne(6.6.0):C1103/523:delGen=3 _1rnf(6.6.0):C1087/466:delGen=3
_1rng(6.6.0):C1099/535:delGen=3 _1rnl(6.6.0):C1222/473:delGen=2
_1rnn(6.6.0):C445/162:delGen=2 _1rnm(6.6.0):C2455/1084:delGen=2
_1rnq(6.6.0):C588/200:delGen=1 _1rns(6.6.0):C1153/306:delGen=1
_1rnr(6.6.0):C1434/309:delGen=1 _1rnt(6.6.0):C7
INFO913812[commitScheduler-13-thread-1] -
org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34)
- [DW][commitScheduler-13-thread-1]: startFullFlush
 INFO913812[commitScheduler-13-thread-1] -
org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34)
- [DW][commitScheduler-13-thread-1]: anyChanges? numDocsInRam=1725
deletes=true hasTickets:false pendingChangesInFullFlush: false
INFO951165[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
x:knowledge s:shard2 c:knowledge r:core_node9] -
org.apache.solr.core.SolrCore.getIndexFingerprint(SolrCore.java:1853) -
Caching fingerprint for searcher:Searcher@61856c91[knowledge]
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_1e0k(6.6.0):C4171894/2478713:delGen=4037)
Uninverting(_1pl8(6.6.0):C774233/198909:delGen=643)
Uninverting(_1g34(6.6.0):C1023729/434329:delGen=3519)
Uninverting(_1q40(6.6.0):C978521/96008:delGen=458)
Uninverting(_1pzp(6.6.0):c202089/41209:delGen=495)
Uninverting(_1qqe(6.6.0):c64285/33505:delGen=252)
Uninverting(_1qsz(6.6.0):c163910/69577:delGen=234)
Uninverting(_1rhp(6.6.0):c114624/15521:delGen=52)
Uninverting(_1rc2(6.6.0):c47728/12387:delGen=94)
Uninverting(_1rm0(6.6.0):c43606/4440:delGen=15)
Uninverting(_1rl6(6.6.0):c1349/433:delGen=19)
Uninverting(_1rk9(6.6.0):C2237/908:delGen=27)
Uninverting(_1rmg(6.6.0):c3698/1044:delGen=11)
Uninverting(_1rni(6.6.0):c5632/708:delGen=3)
Uninverting(_1rlw(6.6.0):C3/2:delGen=2)
Uninverting(_1rn1(6.6.0):c475/143:delGen=7)
Uninverting(_1rmi(6.6.0):C709/367:delGen=10)
Uninverting(_1rnp(6.6.0):c194/28:delGen=2)
Uninverting(_1rmr(6.6.0):C1058/541:delGen=8)
Uninverting(_1rn4(6.6.0):C661/325:delGen=7)
Uninverting(_1rnc(6.6.0):C1111/550:delGen=5)
Uninverting(_1rne(6.6.0):C1103/548:delGen=4)
Uninverting(_1rnf(6.6.0):C1087/500:delGen=4)
Uninverting(_1rng(6.6.0):C1099/553:delGen=4)
Uninverting(_1rnl(6.6.0):C1222/518:delGen=3)
Uninverting(_1rnn(6.6.0):C445/178:delGen=3)
Uninverting(_1rnm(6.6.0):C2455/1138:delGen=3)
Uninverting(_1rnq(6.6.0):C588/238:delGen=2)
Uninverting(_1rns(6.6.0):C1153/397:delGen=2)
Uninverting(_1rnr(6.6.0):C1434/479:delGen=2)
Uninverting(_1rnt(6.6.0):C7/1:delGen=1)
Uninverting(_1rnw(6.6.0):C4/1:delGen=1)
Uninverting(_1rnx(6.6.0):C679/149:delGen=1)
Uninverting(_1rnv(6.6.0):C107/33:delGen=1)
Uninverting(_1rnu(6.6.0):C935/205:delGen=1)))}
leafReaderContext:LeafReaderContext(FilterLeafReader(Uninverting(_1rnu(6.6.0):C935/205:delGen=1))
docBase=7613129 ord=34) mavVersion:9223372036854775807


This goes on for a while till I see this:


index=_1e0k(6.6.0):C4171894/2478715:delGen=4037
_1pl8(6.6.0):C774233/198956:delGen=643
_1g34(6.6.0):C1023729/434341:delGen=3519
_1q40(6.6.0):C978521/96071:delGen=458 _1pzp(6.6.0):c202089/41213:delGen=495
_1qqe(6.6.0):c64285/33523:delGen=252 _1qsz(6.6.0):c163910/69626:delGen=234
_1rhp(6.6.0):c114624/15701:delGen=52 _1rc2(6.6.0):c47728/12410:delGen=94
_1rm0(6.6.0):c43606/4621:delGen=15 _1rl6(6.6.0):c1349/443:delGen=19
_1rk9(6.6.0):C2237/913:delGen=27 _1rmg(6.6.0):c3698/1074:delGen=11
_1rna(6.6.0):c6218/1527:delGen=5 _1rlw(6.6.0):C3/2:delGen=2
_1rn1(6.6.0):c475/151:delGen=7 _1rmi(6.6.0):C709/374:delGen=10
_1rnn(6.6.0):c233/33:delGen=2 _1rmr(6.6.0):C1058/553:delGen=8
_1rmw(6.6.0):C1 _1rnd(6.6.0):C576/312:delGen=5
_1rne(6.6.0):C568/277:delGen=5 _1rng(6.6.0):C2136/1055:delGen=4
_1rnh(6.6.0):C2570/1260:delGen=4 _1rni(6.6.0):C1
_1rnk(6.6.0):C1462/712:delGen=3 _1rnj(6.6.0):C2436/1106:delGen=3
_1rnp(6.6.0):C1560/603:delGen=2 _1rno(6.6.0):C141/71:delGen=2
_1rnq(6.6.0):C1393/517:delGen=2 _1rns(6.6.0):C4/1:delGen=1
_1rnr(6.6.0):C12/5:delGen=1 _1rnu(6.6.0):C1846/484:delGen=1
_1rnv(6.6.0):C513/106:delGen=1 _1rnt(6.6.0):C600/185:delGen=1
_1rnw(6.6.0):C1 _1rnx(6.6.0):C3 _1rnz(6.6.0):C1
version=6.6.0
analyzer=org.apache.solr.update.SolrIndexConfig$DelayedSchemaAnalyzer
ramBufferSizeMB=320.0
maxBufferedDocs=-1
maxBufferedDeleteTerms=-1
mergedSegmentWarmer=null
delPolicy=org.apache.solr.core.IndexDeletionPolicyWrapper
commit=null
openMode=APPEND
similarity=org.apache.solr.search.similarities.SchemaSimilarityFactory$SchemaSimilarity
mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=-1,
maxMergeCount=-1, ioThrottle=true
codec=Lucene62
infoStream=org.apache.solr.update.LoggingInfoStream
mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10,
maxMergeAtOnceExplicit=30, maxMergedSegmentMB=5120.0, floorSegmentMB=2.0,
forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0,
maxCFSSegmentSizeMB=8.796093022207999E12, noCFSRatio=0.1
indexerThreadPool=org.apache.lucene.index.DocumentsWriterPerThreadPool@63603ac1
readerPooling=false
perThreadHardLimitMB=1945
useCompoundFile=false
commitOnClose=true
indexSort=null
writer=org.apache.solr.update.SolrIndexWriter@3360cb8

INFO265541[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
x:knowledge s:shard2 c:knowledge r:core_node9] -
org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34)
- [IW][recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
x:knowledge s:shard2 c:knowledge r:core_node9]:
MMapDirectory.UNMAP_SUPPORTED=true
INFO265541[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
x:knowledge s:shard2 c:knowledge r:core_node9] -
org.apache.solr.update.DefaultSolrCoreState.changeWriter(DefaultSolrCoreState.java:212)
- New IndexWriter is ready to be used.

ERROR267470[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
x:knowledge s:shard2 c:knowledge r:core_node9] -
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:519)
- Could not publish as ACTIVE after succesful recovery
org.apache.solr.common.SolrException: Cannot publish state of core
'knowledge' as active without recovering first!
        at
org.apache.solr.cloud.ZkController.publish(ZkController.java:1175)
        at
org.apache.solr.cloud.ZkController.publish(ZkController.java:1131)
        at
org.apache.solr.cloud.ZkController.publish(ZkController.java:1127)
        at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:517)
        at
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:284)
        at
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
ERROR267471[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
x:knowledge s:shard2 c:knowledge r:core_node9] -
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:541)
- Recovery failed - trying again... (0)
INFO267471[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
x:knowledge s:shard2 c:knowledge r:core_node9] -
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:563)
- Wait [2.0] seconds before trying to recover again (attempt=1)


The entire cycle keeps repeating till it goes offline. I've also seen the
following exceptions at times:


WARN670422[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
x:knowledge s:shard2 c:knowledge r:core_node9] -
org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1536)
- Error in fetching file: _1e0k.fdt (downloaded 2385510400 of 6610887701
bytes)
org.apache.http.TruncatedChunkException: Truncated chunk ( expected size:
32768; actual size: 28517)
        at
org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:200)
        at
org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)

INFO831840[qtp1389808948-125825] -
org.apache.solr.security.RuleBasedAuthorizationPlugin.checkPathPerm(RuleBasedAuthorizationPlugin.java:147)
- request has come without principal. failed permission {
  "name":"select",
  "collection":"knowledge",
  "path":"/select",
  "role":[
    "admin",
    "dev",
    "read"]}
INFO831841[qtp1389808948-125836] -
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:500) -
USER_REQUIRED auth header null context : userPrincipal: [null] type:
[READ], collections: [knowledge,], Path: [/select] path : /select params
:q=*:*&distrib=false&sort=_docid_+asc&rows=0&wt=javabin&version=2
INFO831840[qtp1389808948-125855] -
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:500) -
USER_REQUIRED auth header null context : userPrincipal: [null] type:
[READ], collections: [knowledge,], Path: [/select] path : /select params
:q=*:*&distrib=false&sort=_docid_+asc&rows=0&wt=javabin&version=2
ERROR831840[qtp1389808948-125740] -
org.apache.solr.common.SolrException.log(SolrException.java:148) -
org.apache.solr.common.SolrException: no servers hosting shard: shard1
        at
org.apache.solr.handler.component.HttpShardHandler.prepDistributed(HttpShardHandler.java:413)
        at
org.apache.solr.handler.component.SearchHandler.getAndPrepShardHandler(SearchHandler.java:227)
        at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:265)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)



Not sure what's the issue here and how to address this. I've played around
with different memory settings but haven't been successful so far. Also,
not sure why it affects the entire cluster. When I restart the instance,
it goes into recovery mode and updates it's index with the delta, which is
understandable.But at the same time, the other replica in the same shard
stalls and goes offline. This starts a cascading effect and I've to end up
restarting all the nodes.

Any pointers will be appreciated.

Thanks,
Shamik

Reply via email to