[
https://issues.apache.org/jira/browse/SOLR-17652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris M. Hostetter updated SOLR-17652:
--------------------------------------
Attachment: SOLR-17652.patch
Status: Open (was: Open)
Every now and then over the past few years, I've heard vague rumors of weird
situations involving TLOG+PULL replica based collections after "catastrophic"
failure situations with a Solr cloud cluster's impacting the shard leader
(network partitions, disk errors on leader, multi-node failures, etc...).
Typically, once the underlying problem is dealt with, and the leader election
eventually "succeeds" (either on it's own after nodes get restarted, or
sometimes with a manual FORCELEADER) all of the replicas should eventually get
healthy again ... BUT! ... in some of these vague rumored situations, I'm told
there are occasionally PULL replicas that never recover no matter how much time
passes – which never really made sense to me since I know the recovery code
operates in an (exponential back off loop) and because the anecdotal reports
only ever described PULL replicas being affect – not TLOG replicas, but PULL
replicas should never really care which TLOG replica is the leader (They
recheck the current leader on every PULL attempt)
----
Today I was doing some experiments with PULL replicas, and reading through some
of the core initialization code, and I think i figured out where/how this kind
of bug can occur:
1. A node hosting a PULL replica gets restarted (as part of whatever broader
catastrophic problem impacts the cluster)
2. the PULL replica comes online while there is no elected leader
3. the leader election takes longer then ~13 min (due to whatever catastrophic
problem is still being dealt with, and/or slow TLOG replay on the leader)
...in this case, the PULL replica will be left in a "DOWN" and will never
attempt to recover.
The cause of this problem is code in {{ZkController.register(...)}} which
expects that it can do the following in sequence:
* {{joinElection(...)}} (If replica type is eligible)
* call {{getLeader(cloudDesc, leaderVoteWait + 600000)}} to see if the current
replica is the leader
** If this call throws a SolrException, {{ZkController.register(...)}}
propagates the exception w/o any sort of retry.
For NRT or TLOG replicas, this logic is fine: We've triggered an election and
even if no other replicas come along and join the election, eventually we'll
recognize that we're the leader and put our props into the leader znode so we
can read them back. But in the case of a PULL replica, there is no guarantee
if/when another replica comes along to become the leader – so if that
{{getLeader(...)}} times out, we get a log message that looks like this...
{noformat}
2025-02-04 18:38:50.124 ERROR
(coreZkRegister-1-thread-3-processing-localhost:8983_solr
techproducts_shard1_replica_p3 techproducts shard1 core_node4) [c:techproducts
s:shard1 r:core_node4 x:techproducts_shard1_replica_p3 t:] o.a.s.c.ZkCon
troller Error getting leader from zk => org.apache.solr.common.SolrException:
Could not get leader props
at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1554)
org.apache.solr.common.SolrException: Could not get leader props
at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1554) ~[?:?]
at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1514) ~[?:?]
at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1456)
~[?:?]
at org.apache.solr.cloud.ZkController.register(ZkController.java:1318)
~[?:?]
at org.apache.solr.cloud.ZkController.register(ZkController.java:1229)
~[?:?]
at
org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:208)
~[?:?]
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:380)
~[?:?]
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]
at java.base/java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for /collections/techproducts/leaders/shard1/leader
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:117) ~[?:?]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:53)
~[?:?]
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972) ~[?:?]
at
org.apache.solr.common.cloud.SolrZkClient.lambda$getData$6(SolrZkClient.java:452)
~[?:?]
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:70)
~[?:?]
at
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:452) ~[?:?]
at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1534) ~[?:?]
... 9 more
2025-02-04 18:38:50.125 ERROR
(coreZkRegister-1-thread-3-processing-localhost:8983_solr
techproducts_shard1_replica_p3 techproducts shard1 core_node4) [c:techproducts
s:shard1 r:core_node4 x:techproducts_shard1_replica_p3 t:] o.a.s.c.ZkContainer
Exception registering core techproducts_shard1_replica_p3 =>
org.apache.solr.common.SolrException: Error getting leader from zk for shard
shard1
at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1499)
org.apache.solr.common.SolrException: Error getting leader from zk for shard
shard1
at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1499)
~[?:?]
at org.apache.solr.cloud.ZkController.register(ZkController.java:1318)
~[?:?]
at org.apache.solr.cloud.ZkController.register(ZkController.java:1229)
~[?:?]
at
org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:208)
~[?:?]
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:380)
~[?:?]
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]
at java.base/java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: org.apache.solr.common.SolrException: Could not get leader props
at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1554) ~[?:?]
at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1514) ~[?:?]
at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1456)
~[?:?]
... 7 more
Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for /collections/techproducts/leaders/shard1/leader
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:117) ~[?:?]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:53)
~[?:?]
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972) ~[?:?]
at
org.apache.solr.common.cloud.SolrZkClient.lambda$getData$6(SolrZkClient.java:452)
~[?:?]
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:70)
~[?:?]
at
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:452) ~[?:?]
at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1534) ~[?:?]
at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1514) ~[?:?]
at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1456)
~[?:?]
... 7 more
{noformat}
...and we get stuck in limbo where we are DOWN and stay down and never recover
even if/when another replica *does* become a leader later.
*EVEN IF YOU RELOAD THE CORE, THE REPLICA STAYS IN A DOWN STATE!*
----
The fix (see attached patch) is fairly simple: Since the only reason
{{ZkController.register(...)}} calls {{getLeader(...)}} is to see if the
current replica _might_ be the leader, skip that when we know from the replica
type that we _can't_ be the leader, and skip straight ahead to doing recovery
(which {{RecoveryStrategy}} does in a perpetual loop – even if it can't find a
leader)
> PULL replicas can be stuck permemantly in DOWN state if leader election takes
> too long
> --------------------------------------------------------------------------------------
>
> Key: SOLR-17652
> URL: https://issues.apache.org/jira/browse/SOLR-17652
> Project: Solr
> Issue Type: Bug
> Reporter: Chris M. Hostetter
> Assignee: Chris M. Hostetter
> Priority: Major
> Attachments: SOLR-17652.patch
>
>
> A bug exists in {{ZkController}} that can cause PULL replicas to be
> permanently stuck in a DOWN state (such that even a core RELOAD can not fix
> it) if that PULL replica was initially loaded during a leader election that
> takes a significant amount of time.
>
> Details to follow in comments
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]