Hello Solr gurus,

So I have a scenario where on Solr cluster restart the replica node goes
into full index replication for about 7 hours. Both replica nodes are
restarted around the same time for maintenance. Also, during usual times,
if one node goes down for whatever reason, upon restart it again does index
replication. In certain instances, some replicas just fail to recover.

*SolrCloud 7.2.1 *cluster configuration*:*
============================
16 shards - replication factor=2

Per server configuration:
======================
32GB machine - 16GB heap space for Solr
Index size : 3TB per server

autoCommit (openSearcher=false) of 3 minutes

We have a heavy indexing load of about 10,000 documents every 150 seconds.
Not so heavy query load.

Reading through some of the threads on similar topic, I suspect it would be
the disparity between the number of updates(>100) between the replicas that
is causing this (courtesy our indexing load). One of the suggestions I saw
was using numRecordsToKeep.
However as Erick mentioned in one of the threads, that's a bandaid measure
and I am trying to eliminate some of the fundamental issues that might
exist.

1) Is the heap too less for that index size? If yes, what would be a
recommended max heap size?
2) Is there a general guideline to estimate the required max heap based on
index size on disk?
3) What would be a recommended autoCommit and autoSoftCommit interval ?
4) Any configurations that would help improve the restart time and avoid
full replication?
5) Does Solr retain "numRecordsToKeep" number of  documents in tlog *per
replica*?
6) The reasons for peersync from below logs are not completely clear to me.
Can someone please elaborate?

*PeerSync fails with* :

Failure type 1:
-----------------
2019-02-04 20:43:50.018 INFO
(recoveryExecutor-4-thread-2-processing-n:indexnode1:20000_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
org.apache.solr.update.PeerSync Fingerprint comparison: 1

2019-02-04 20:43:50.018 INFO
(recoveryExecutor-4-thread-2-processing-n:indexnode1:20000_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
org.apache.solr.update.PeerSync Other fingerprint:
{maxVersionSpecified=1624579878580912128,
maxVersionEncountered=1624579893816721408, maxInHash=1624579878580912128,
versionsHash=-8308981502886241345, numVersions=32966082, numDocs=32966165,
maxDoc=1828452}, Our fingerprint: {maxVersionSpecified=1624579878580912128,
maxVersionEncountered=1624579975760838656, maxInHash=1624579878580912128,
versionsHash=4017509388564167234, numVersions=32966066, numDocs=32966165,
maxDoc=1828452}

2019-02-04 20:43:50.018 INFO
(recoveryExecutor-4-thread-2-processing-n:indexnode1:20000_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
org.apache.solr.update.PeerSync PeerSync:
core=DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42 url=
http://indexnode1:8983/solr DONE. sync failed

2019-02-04 20:43:50.018 INFO
(recoveryExecutor-4-thread-2-processing-n:indexnode1:8983_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
org.apache.solr.cloud.RecoveryStrategy PeerSync Recovery was not successful
- trying replication.


Failure type 2:
------------------
2019-02-02 20:26:56.256 WARN
(recoveryExecutor-4-thread-11-processing-n:indexnode1:20000_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46
s:shard12 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node49)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard12 r:core_node49
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46]
org.apache.solr.update.PeerSync PeerSync:
core=DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46 url=
http://indexnode1:20000/solr too many updates received since start -
startingUpdates no longer overlaps with our currentUpdates


Thanks,
Rahul

Reply via email to