We found that for the shard that does not get a leader, the tlog replay did
not complete (we don't see "log replay finished", "creating leader
registration node", "I am the new leader" etc log messages) for hours.

Also not sure why the TLOG are 10's of GBs (anywhere from 30 to 40GB). 

Collection's shards have 3x replicas, TLOG replication and 10sec hard
commit.

The situation is resulting in 2x per day outage. Current work around is to
stop ingestion, issue a collection rebalance and/or node restarts and repeat
until shards are online (a couple of hrs each day of manual recovery).

Any suggestions or workarounds?

Not sure if we're running into these issues:
https://issues.apache.org/jira/browse/SOLR-13486
https://issues.apache.org/jira/browse/SOLR-14679


Partial log output from both nodes (sorted by time asc):

myapp-data-solr-0
2021-02-12 19:36:05.965 INFO (zkCallback-14-thread-51) [c:mydata
s:0_80000000-9fffffff r:core_node3 x:mydata_0_80000000-9fffffff_replica_t1]
o.a.s.c.ShardLeaderElectionContext Replaying tlog before become new leader


myapp-data-solr-0 
2021-02-12 19:36:05.966 WARN 
(recoveryExecutor-96-thread-1-processing-n:myapp-data-solr-0.myapp-data-solr-headless:8983_solr
x:mydata_0_80000000-9fffffff_replica_t1 c:mydata s:0_80000000-9fffffff
r:core_node3) [c:mydata s:0_80000000-9fffffff r:core_node3
x:mydata_0_80000000-9fffffff_replica_t1] o.a.s.u.UpdateLog Starting log
replay
tlog{file=/opt/solr/volumes/data1/mydata_0_80000000-9fffffff/tlog/tlog.0000000000000003525
refcount=2}  active=false starting pos=0 inSortedOrder=true


myapp-data-solr-0 
2021-02-12 22:13:03.084 INFO 
(recoveryExecutor-96-thread-1-processing-n:myapp-data-solr-0.myapp-data-solr-headless:8983_solr
x:mydata_0_80000000-9fffffff_replica_t1 c:mydata s:0_80000000-9fffffff
r:core_node3) [c:mydata s:0_80000000-9fffffff r:core_node3
x:mydata_0_80000000-9fffffff_replica_t1] o.a.s.u.UpdateLog log replay status
tlog{file=/opt/solr/volumes/data1/mydata_0_80000000-9fffffff/tlog/tlog.0000000000000003525
refcount=3} active=false starting pos=0 current pos=27101174167 current
size=33357447222 % read=81.0


myapp-data-solr-0
2021-02-12 22:13:06.602 ERROR (indexFetcher-4092-thread-1) [ ]
o.a.s.h.ReplicationHandler Index fetch failed
:org.apache.solr.common.SolrException: No registered leader was found after
waiting for 4000ms , collection: mydata slice: 0_80000000-9fffffff saw
state=DocCollection(mydata//collections/mydata/state.json/750)={
"pullReplicas":"0", "replicationFactor":"0", "shards":{
"0_80000000-9fffffff":{ "range":null, "state":"active", "replicas":{
"core_node3":{ "core":"mydata_0_80000000-9fffffff_replica_t1",
"base_url":"http://myapp-data-solr-0.myapp-data-solr-headless:8983/solr";,
"node_name":"myapp-data-solr-0.myapp-data-solr-headless:8983_solr",
"state":"active", "type":"TLOG", "force_set_state":"false"}, "core_node5":{
"core":"mydata_0_80000000-9fffffff_replica_t2",
"base_url":"http://myapp-data-solr-1.myapp-data-solr-headless:8983/solr";,
"node_name":"myapp-data-solr-1.myapp-data-solr-headless:8983_solr",
"state":"active", "type":"TLOG", "force_set_state":"false",
"property.preferredleader":"true"}, "core_node6":{
"core":"mydata_0_80000000-9fffffff_replica_t4",
"base_url":"http://myapp-data-solr-2.myapp-data-solr-headless:8983/solr";,
"node_name":"myapp-data-solr-2.myapp-data-solr-headless:8983_solr",
"state":"down", "type":"TLOG", "force_set_state":"false"}}},


myapp-data-solr-0
2021-02-12 22:45:51.600 ERROR (indexFetcher-4092-thread-1) [ ]
o.a.s.h.ReplicationHandler Index fetch failed
:org.apache.solr.common.SolrException: No registered leader was found after
waiting for 4000ms , collection: mydata slice: 0_80000000-9fffffff saw
state=DocCollection(mydata//collections/mydata/state.json/754)={
"pullReplicas":"0", "replicationFactor":"0", "shards":{
"0_80000000-9fffffff":{ "range":null, "state":"active", "replicas":{
"core_node3":{ "core":"mydata_0_80000000-9fffffff_replica_t1",
"base_url":"http://myapp-data-solr-0.myapp-data-solr-headless:8983/solr";,
"node_name":"myapp-data-solr-0.myapp-data-solr-headless:8983_solr",
"state":"active", "type":"TLOG", "force_set_state":"false"}, "core_node5":{
"core":"mydata_0_80000000-9fffffff_replica_t2",
"base_url":"http://myapp-data-solr-1.myapp-data-solr-headless:8983/solr";,
"node_name":"myapp-data-solr-1.myapp-data-solr-headless:8983_solr",
"state":"down", "type":"TLOG", "force_set_state":"false",
"property.preferredleader":"true"}, "core_node6":{
"core":"mydata_0_80000000-9fffffff_replica_t4",
"base_url":"http://myapp-data-solr-2.myapp-data-solr-headless:8983/solr";,
"node_name":"myapp-data-solr-2.myapp-data-solr-headless:8983_solr",
"state":"down", "type":"TLOG", "force_set_state":"false"}}},...


myapp-data-solr-1
2021-02-12 22:45:56.600 ERROR (indexFetcher-4092-thread-1) [ ]
o.a.s.h.ReplicationHandler Index fetch failed
:org.apache.solr.common.SolrException: No registered leader was found after
waiting for 4000ms , collection: mydata slice: 0_80000000-9fffffff saw
state=DocCollection(mydata//collections/mydata/state.json/755)={
"pullReplicas":"0", "replicationFactor":"0", "shards":{
"0_80000000-9fffffff":{ "range":null, "state":"active", "replicas":{
"core_node3":{ "core":"mydata_0_80000000-9fffffff_replica_t1",
"base_url":"http://myapp-data-solr-0.myapp-data-solr-headless:8983/solr";,
"node_name":"myapp-data-solr-0.myapp-data-solr-headless:8983_solr",
"state":"active", "type":"TLOG", "force_set_state":"false"}, "core_node5":{
"core":"mydata_0_80000000-9fffffff_replica_t2",
"base_url":"http://myapp-data-solr-1.myapp-data-solr-headless:8983/solr";,
"node_name":"myapp-data-solr-1.myapp-data-solr-headless:8983_solr",
"state":"down", "type":"TLOG", "force_set_state":"false",
"property.preferredleader":"true"}, "core_node6":{
"core":"mydata_0_80000000-9fffffff_replica_t4",
"base_url":"http://myapp-data-solr-2.myapp-data-solr-headless:8983/solr";,
"node_name":"myapp-data-solr-2.myapp-data-solr-headless:8983_solr",
"state":"down", "type":"TLOG", "force_set_state":"false"}}},...




--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply via email to