We found that for the shard that does not get a leader, the tlog replay did not complete (we don't see "log replay finished", "creating leader registration node", "I am the new leader" etc log messages) for hours.
Also not sure why the TLOG are 10's of GBs (anywhere from 30 to 40GB). Collection's shards have 3x replicas, TLOG replication and 10sec hard commit. The situation is resulting in 2x per day outage. Current work around is to stop ingestion, issue a collection rebalance and/or node restarts and repeat until shards are online (a couple of hrs each day of manual recovery). Any suggestions or workarounds? Not sure if we're running into these issues: https://issues.apache.org/jira/browse/SOLR-13486 https://issues.apache.org/jira/browse/SOLR-14679 Partial log output from both nodes (sorted by time asc): myapp-data-solr-0 2021-02-12 19:36:05.965 INFO (zkCallback-14-thread-51) [c:mydata s:0_80000000-9fffffff r:core_node3 x:mydata_0_80000000-9fffffff_replica_t1] o.a.s.c.ShardLeaderElectionContext Replaying tlog before become new leader myapp-data-solr-0 2021-02-12 19:36:05.966 WARN (recoveryExecutor-96-thread-1-processing-n:myapp-data-solr-0.myapp-data-solr-headless:8983_solr x:mydata_0_80000000-9fffffff_replica_t1 c:mydata s:0_80000000-9fffffff r:core_node3) [c:mydata s:0_80000000-9fffffff r:core_node3 x:mydata_0_80000000-9fffffff_replica_t1] o.a.s.u.UpdateLog Starting log replay tlog{file=/opt/solr/volumes/data1/mydata_0_80000000-9fffffff/tlog/tlog.0000000000000003525 refcount=2} active=false starting pos=0 inSortedOrder=true myapp-data-solr-0 2021-02-12 22:13:03.084 INFO (recoveryExecutor-96-thread-1-processing-n:myapp-data-solr-0.myapp-data-solr-headless:8983_solr x:mydata_0_80000000-9fffffff_replica_t1 c:mydata s:0_80000000-9fffffff r:core_node3) [c:mydata s:0_80000000-9fffffff r:core_node3 x:mydata_0_80000000-9fffffff_replica_t1] o.a.s.u.UpdateLog log replay status tlog{file=/opt/solr/volumes/data1/mydata_0_80000000-9fffffff/tlog/tlog.0000000000000003525 refcount=3} active=false starting pos=0 current pos=27101174167 current size=33357447222 % read=81.0 myapp-data-solr-0 2021-02-12 22:13:06.602 ERROR (indexFetcher-4092-thread-1) [ ] o.a.s.h.ReplicationHandler Index fetch failed :org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: mydata slice: 0_80000000-9fffffff saw state=DocCollection(mydata//collections/mydata/state.json/750)={ "pullReplicas":"0", "replicationFactor":"0", "shards":{ "0_80000000-9fffffff":{ "range":null, "state":"active", "replicas":{ "core_node3":{ "core":"mydata_0_80000000-9fffffff_replica_t1", "base_url":"http://myapp-data-solr-0.myapp-data-solr-headless:8983/solr", "node_name":"myapp-data-solr-0.myapp-data-solr-headless:8983_solr", "state":"active", "type":"TLOG", "force_set_state":"false"}, "core_node5":{ "core":"mydata_0_80000000-9fffffff_replica_t2", "base_url":"http://myapp-data-solr-1.myapp-data-solr-headless:8983/solr", "node_name":"myapp-data-solr-1.myapp-data-solr-headless:8983_solr", "state":"active", "type":"TLOG", "force_set_state":"false", "property.preferredleader":"true"}, "core_node6":{ "core":"mydata_0_80000000-9fffffff_replica_t4", "base_url":"http://myapp-data-solr-2.myapp-data-solr-headless:8983/solr", "node_name":"myapp-data-solr-2.myapp-data-solr-headless:8983_solr", "state":"down", "type":"TLOG", "force_set_state":"false"}}}, myapp-data-solr-0 2021-02-12 22:45:51.600 ERROR (indexFetcher-4092-thread-1) [ ] o.a.s.h.ReplicationHandler Index fetch failed :org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: mydata slice: 0_80000000-9fffffff saw state=DocCollection(mydata//collections/mydata/state.json/754)={ "pullReplicas":"0", "replicationFactor":"0", "shards":{ "0_80000000-9fffffff":{ "range":null, "state":"active", "replicas":{ "core_node3":{ "core":"mydata_0_80000000-9fffffff_replica_t1", "base_url":"http://myapp-data-solr-0.myapp-data-solr-headless:8983/solr", "node_name":"myapp-data-solr-0.myapp-data-solr-headless:8983_solr", "state":"active", "type":"TLOG", "force_set_state":"false"}, "core_node5":{ "core":"mydata_0_80000000-9fffffff_replica_t2", "base_url":"http://myapp-data-solr-1.myapp-data-solr-headless:8983/solr", "node_name":"myapp-data-solr-1.myapp-data-solr-headless:8983_solr", "state":"down", "type":"TLOG", "force_set_state":"false", "property.preferredleader":"true"}, "core_node6":{ "core":"mydata_0_80000000-9fffffff_replica_t4", "base_url":"http://myapp-data-solr-2.myapp-data-solr-headless:8983/solr", "node_name":"myapp-data-solr-2.myapp-data-solr-headless:8983_solr", "state":"down", "type":"TLOG", "force_set_state":"false"}}},... myapp-data-solr-1 2021-02-12 22:45:56.600 ERROR (indexFetcher-4092-thread-1) [ ] o.a.s.h.ReplicationHandler Index fetch failed :org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: mydata slice: 0_80000000-9fffffff saw state=DocCollection(mydata//collections/mydata/state.json/755)={ "pullReplicas":"0", "replicationFactor":"0", "shards":{ "0_80000000-9fffffff":{ "range":null, "state":"active", "replicas":{ "core_node3":{ "core":"mydata_0_80000000-9fffffff_replica_t1", "base_url":"http://myapp-data-solr-0.myapp-data-solr-headless:8983/solr", "node_name":"myapp-data-solr-0.myapp-data-solr-headless:8983_solr", "state":"active", "type":"TLOG", "force_set_state":"false"}, "core_node5":{ "core":"mydata_0_80000000-9fffffff_replica_t2", "base_url":"http://myapp-data-solr-1.myapp-data-solr-headless:8983/solr", "node_name":"myapp-data-solr-1.myapp-data-solr-headless:8983_solr", "state":"down", "type":"TLOG", "force_set_state":"false", "property.preferredleader":"true"}, "core_node6":{ "core":"mydata_0_80000000-9fffffff_replica_t4", "base_url":"http://myapp-data-solr-2.myapp-data-solr-headless:8983/solr", "node_name":"myapp-data-solr-2.myapp-data-solr-headless:8983_solr", "state":"down", "type":"TLOG", "force_set_state":"false"}}},... -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html