[ 
https://issues.apache.org/jira/browse/SOLR-17652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated SOLR-17652:
--------------------------------------
    Attachment: SOLR-17652.patch
        Status: Open  (was: Open)

Every now and then over the past few years, I've heard vague rumors of weird 
situations involving TLOG+PULL replica based collections after "catastrophic" 
failure situations with a Solr cloud cluster's impacting the shard leader 
(network partitions, disk errors on leader, multi-node failures, etc...).

Typically, once the underlying problem is dealt with, and the leader election 
eventually "succeeds" (either on it's own after nodes get restarted, or 
sometimes with a manual FORCELEADER) all of the replicas should eventually get 
healthy again ... BUT! ... in some of these vague rumored situations, I'm told 
there are occasionally PULL replicas that never recover no matter how much time 
passes – which never really made sense to me since I know the recovery code 
operates in an (exponential back off loop) and because the anecdotal reports 
only ever described PULL replicas being affect – not TLOG replicas, but PULL 
replicas should never really care which TLOG replica is the leader (They 
recheck the current leader on every PULL attempt)
----
Today I was doing some experiments with PULL replicas, and reading through some 
of the core initialization code, and I think i figured out where/how this kind 
of bug can occur:

1. A node hosting a PULL replica gets restarted (as part of whatever broader 
catastrophic problem impacts the cluster)
2. the PULL replica comes online while there is no elected leader
3. the leader election takes longer then ~13 min (due to whatever catastrophic 
problem is still being dealt with, and/or slow TLOG replay on the leader)

...in this case, the PULL replica will be left in a "DOWN" and will never 
attempt to recover.

The cause of this problem is code in {{ZkController.register(...)}} which 
expects that it can do the following in sequence:
 * {{joinElection(...)}} (If replica type is  eligible)
 * call {{getLeader(cloudDesc, leaderVoteWait + 600000)}} to see if the current 
replica is the leader
 ** If this call throws a SolrException, {{ZkController.register(...)}} 
propagates the exception w/o any sort of retry.

For NRT or TLOG replicas, this logic is fine: We've triggered an election and 
even if no other replicas come along and join the election, eventually we'll 
recognize that we're the leader and put our props into the leader znode so we 
can read them back. But in the case of a PULL replica, there is no guarantee 
if/when another replica comes along to become the leader – so if that 
{{getLeader(...)}} times out, we get a log message that looks like this...
{noformat}
2025-02-04 18:38:50.124 ERROR 
(coreZkRegister-1-thread-3-processing-localhost:8983_solr 
techproducts_shard1_replica_p3 techproducts shard1 core_node4) [c:techproducts 
s:shard1 r:core_node4 x:techproducts_shard1_replica_p3 t:] o.a.s.c.ZkCon
troller Error getting leader from zk => org.apache.solr.common.SolrException: 
Could not get leader props
        at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1554)
org.apache.solr.common.SolrException: Could not get leader props
        at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1554) ~[?:?]
        at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1514) ~[?:?]
        at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1456) 
~[?:?]
        at org.apache.solr.cloud.ZkController.register(ZkController.java:1318) 
~[?:?]
        at org.apache.solr.cloud.ZkController.register(ZkController.java:1229) 
~[?:?]
        at 
org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:208) 
~[?:?]
        at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:380)
 ~[?:?]
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 ~[?:?]
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 ~[?:?]
        at java.base/java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /collections/techproducts/leaders/shard1/leader
        at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:117) ~[?:?]
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:53) 
~[?:?]
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972) ~[?:?]
        at 
org.apache.solr.common.cloud.SolrZkClient.lambda$getData$6(SolrZkClient.java:452)
 ~[?:?]
        at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:70)
 ~[?:?]
        at 
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:452) ~[?:?]
        at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1534) ~[?:?]
        ... 9 more
2025-02-04 18:38:50.125 ERROR 
(coreZkRegister-1-thread-3-processing-localhost:8983_solr 
techproducts_shard1_replica_p3 techproducts shard1 core_node4) [c:techproducts 
s:shard1 r:core_node4 x:techproducts_shard1_replica_p3 t:] o.a.s.c.ZkContainer 
Exception registering core techproducts_shard1_replica_p3 => 
org.apache.solr.common.SolrException: Error getting leader from zk for shard 
shard1
        at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1499)
org.apache.solr.common.SolrException: Error getting leader from zk for shard 
shard1
        at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1499) 
~[?:?]
        at org.apache.solr.cloud.ZkController.register(ZkController.java:1318) 
~[?:?]
        at org.apache.solr.cloud.ZkController.register(ZkController.java:1229) 
~[?:?]
        at 
org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:208) 
~[?:?]
        at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:380)
 ~[?:?]
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 ~[?:?]
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 ~[?:?]
        at java.base/java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: org.apache.solr.common.SolrException: Could not get leader props
        at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1554) ~[?:?]
        at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1514) ~[?:?]
        at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1456) 
~[?:?]
        ... 7 more
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /collections/techproducts/leaders/shard1/leader
        at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:117) ~[?:?]
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:53) 
~[?:?]
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972) ~[?:?]
        at 
org.apache.solr.common.cloud.SolrZkClient.lambda$getData$6(SolrZkClient.java:452)
 ~[?:?]
        at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:70)
 ~[?:?]
        at 
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:452) ~[?:?]
        at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1534) ~[?:?]
        at 
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1514) ~[?:?]
        at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1456) 
~[?:?]
        ... 7 more
{noformat}
...and we get stuck in limbo where we are DOWN and stay down and never recover 
even if/when another replica *does* become a leader later.

*EVEN IF YOU RELOAD THE CORE, THE REPLICA STAYS IN A DOWN STATE!*
----
The fix (see attached patch) is fairly simple: Since the only reason 
{{ZkController.register(...)}} calls {{getLeader(...)}} is to see if the 
current replica _might_ be the leader, skip that when we know from the replica 
type that we _can't_ be the leader, and skip straight ahead to doing recovery 
(which {{RecoveryStrategy}} does in a perpetual loop – even if it can't find a 
leader)

> PULL replicas can be stuck permemantly in DOWN state if leader election takes 
> too long
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-17652
>                 URL: https://issues.apache.org/jira/browse/SOLR-17652
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Chris M. Hostetter
>            Assignee: Chris M. Hostetter
>            Priority: Major
>         Attachments: SOLR-17652.patch
>
>
> A bug exists in {{ZkController}} that can cause PULL replicas to be 
> permanently stuck in a DOWN state (such that even a core RELOAD can not fix 
> it) if that PULL replica was initially loaded during a leader election that 
> takes a significant amount of time.
>  
> Details to follow in comments



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to