Hello Solr gurus, So I have a scenario where on Solr cluster restart the replica node goes into full index replication for about 7 hours. Both replica nodes are restarted around the same time for maintenance. Also, during usual times, if one node goes down for whatever reason, upon restart it again does index replication. In certain instances, some replicas just fail to recover.
*SolrCloud 7.2.1 *cluster configuration*:* ============================ 16 shards - replication factor=2 Per server configuration: ====================== 32GB machine - 16GB heap space for Solr Index size : 3TB per server autoCommit (openSearcher=false) of 3 minutes We have a heavy indexing load of about 10,000 documents every 150 seconds. Not so heavy query load. Reading through some of the threads on similar topic, I suspect it would be the disparity between the number of updates(>100) between the replicas that is causing this (courtesy our indexing load). One of the suggestions I saw was using numRecordsToKeep. However as Erick mentioned in one of the threads, that's a bandaid measure and I am trying to eliminate some of the fundamental issues that might exist. 1) Is the heap too less for that index size? If yes, what would be a recommended max heap size? 2) Is there a general guideline to estimate the required max heap based on index size on disk? 3) What would be a recommended autoCommit and autoSoftCommit interval ? 4) Any configurations that would help improve the restart time and avoid full replication? 5) Does Solr retain "numRecordsToKeep" number of documents in tlog *per replica*? 6) The reasons for peersync from below logs are not completely clear to me. Can someone please elaborate? *PeerSync fails with* : Failure type 1: ----------------- 2019-02-04 20:43:50.018 INFO (recoveryExecutor-4-thread-2-processing-n:indexnode1:20000_solr x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42 s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45) [c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45 x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42] org.apache.solr.update.PeerSync Fingerprint comparison: 1 2019-02-04 20:43:50.018 INFO (recoveryExecutor-4-thread-2-processing-n:indexnode1:20000_solr x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42 s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45) [c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45 x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42] org.apache.solr.update.PeerSync Other fingerprint: {maxVersionSpecified=1624579878580912128, maxVersionEncountered=1624579893816721408, maxInHash=1624579878580912128, versionsHash=-8308981502886241345, numVersions=32966082, numDocs=32966165, maxDoc=1828452}, Our fingerprint: {maxVersionSpecified=1624579878580912128, maxVersionEncountered=1624579975760838656, maxInHash=1624579878580912128, versionsHash=4017509388564167234, numVersions=32966066, numDocs=32966165, maxDoc=1828452} 2019-02-04 20:43:50.018 INFO (recoveryExecutor-4-thread-2-processing-n:indexnode1:20000_solr x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42 s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45) [c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45 x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42] org.apache.solr.update.PeerSync PeerSync: core=DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42 url= http://indexnode1:8983/solr DONE. sync failed 2019-02-04 20:43:50.018 INFO (recoveryExecutor-4-thread-2-processing-n:indexnode1:8983_solr x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42 s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45) [c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45 x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42] org.apache.solr.cloud.RecoveryStrategy PeerSync Recovery was not successful - trying replication. Failure type 2: ------------------ 2019-02-02 20:26:56.256 WARN (recoveryExecutor-4-thread-11-processing-n:indexnode1:20000_solr x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46 s:shard12 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node49) [c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard12 r:core_node49 x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46] org.apache.solr.update.PeerSync PeerSync: core=DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46 url= http://indexnode1:20000/solr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates Thanks, Rahul