Erick,

   Thanks for your input. All our fields (for facet, group & sort) have
docvalues enabled since 6.5. That includes the id field. Here's the field
cache entry:

CACHE.core.fieldCache.entries_count:    0
CACHE.core.fieldCache.total_size:          0 bytes

Based on whatever I've seen so far, I think zookeeper is not the culprit
here. All the nodes including zookeeper was setup recently. The all are all
inside the same VPC within the same AZ. The instances talk to each other
through a dedicated network. Both zookeeper and Solr instances have SSDs.

Here's what's happening based on my observation. Whenever an instance is
getting restarted, it initiates a preprecovery command to its leader or a
different node in the other shard. The node which receives the recovery
request is the one which is due to go down next. Within few minutes, the
heap size (old gen) reaches the max allocated heap, thus stalling the
process. I guess due to this, it fails to send the credentials for a
zookeeper session within the stipulated timeframe, which is why zookeeper
terminates the session. Here's from the startup log.

2018-12-13 04:02:34.910 INFO 
(recoveryExecutor-4-thread-1-processing-n:x.x.193.244:8983_solr
x:knowledge_shard2_replica_n4 c:knowledge s:shard2 r:core_node9)
[c:knowledge s:shard2 r:core_node9 x:knowledge_shard2_replica_n4]
o.a.s.c.RecoveryStrategy Sending prep recovery command to
[http://x.x.240.225:8983/solr]; [WaitForState:
action=PREPRECOVERY&core=knowledge_shard2_replica_n6&nodeName=x.x.x.244:8983_solr&coreNodeName=core_node9&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]

The node sends a recovery command to its replica which immediately triggers
G1 Old Gen jvm pool to reach the max heap size. Please see the screenshot
below which shows the sudden jump in heap size. We've made sure that the
indexing process is completely switched off at this point, so there's no
commit happening.

JVM Pool --> https://www.dropbox.com/s/5s0igznhrol6c05/jvm_pool_1.png?dl=0

I'm totally puzzled by this weird behavior, never seen something like this
before. Could G1GC settings be contributing to this issue? 

>From zookeeper log:

2018-12-13 03:47:27,905 [myid:1] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] -
Accepted socket connection from /10.0.0.160:58376
2018-12-13 03:47:27,905 [myid:1] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] -
Accepted socket connection from /10.0.0.160:58378
2018-12-13 03:47:27,905 [myid:1] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@938] - Client
attempting to establish new session at /10.0.0.160:58376
2018-12-13 03:47:27,905 [myid:1] - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to
read additional data from client sessionid 0x0, likely client has closed
socket
2018-12-13 03:47:27,905 [myid:1] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1040] - Closed
socket connection for client /10.0.0.160:58378 (no session established for
client)
2018-12-13 03:47:27,907 [myid:1] - INFO 
[CommitProcessor:1:ZooKeeperServer@683] - Established session
0x100c46d01440072 with negotiated timeout 10000 for client /10.0.0.160:58376
2018-12-13 03:47:39,386 [myid:1] - INFO 
[CommitProcessor:1:NIOServerCnxn@1040] - Closed socket connection for client
/10.0.0.160:58376 which had sessionid 0x100c46d01440072



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply via email to