[jira] [Commented] (SOLR-5552) Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been recovered.

2013-12-28 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858184#comment-13858184
 ] 

ASF subversion and git services commented on SOLR-5552:
---

Commit 1553978 from [~markrmil...@gmail.com] in branch 
'dev/branches/lucene_solr_4_6'
[ https://svn.apache.org/r1553978 ]

SOLR-5552: Add CHANGES entry
SOLR-5569: Add CHANGES entry
SOLR-5568: Add CHANGES entry

> Leader recovery process can select the wrong leader if all replicas for a 
> shard are down and trying to recover as well as lose updates that should have 
> been recovered.
> ---
>
> Key: SOLR-5552
> URL: https://issues.apache.org/jira/browse/SOLR-5552
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Timothy Potter
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: SOLR-5552.patch, SOLR-5552.patch
>
>
> One particular issue that leads to out-of-sync shards, related to SOLR-4260
> Here's what I know so far, which admittedly isn't much:
> As cloud85 (replica before it crashed) is initializing, it enters the wait 
> process in ShardLeaderElectionContext#waitForReplicasToComeUp; this is 
> expected and a good thing.
> Some short amount of time in the future, cloud84 (leader before it crashed) 
> begins initializing and gets to a point where it adds itself as a possible 
> leader for the shard (by creating a znode under 
> /collections/cloud/leaders_elect/shard1/election), which leads to cloud85 
> being able to return from waitForReplicasToComeUp and try to determine who 
> should be the leader.
> cloud85 then tries to run the SyncStrategy, which can never work because in 
> this scenario the Jetty HTTP listener is not active yet on either node, so 
> all replication work that uses HTTP requests fails on both nodes ... PeerSync 
> treats these failures as indicators that the other replicas in the shard are 
> unavailable (or whatever) and assumes success. Here's the log message:
> 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN 
> solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 
> url=http://cloud85:8985/solr couldn't connect to 
> http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success
> The Jetty HTTP listener doesn't start accepting connections until long after 
> this process has completed and already selected the wrong leader.
> From what I can see, we seem to have a leader recovery process that is based 
> partly on HTTP requests to the other nodes, but the HTTP listener on those 
> nodes isn't active yet. We need a leader recovery process that doesn't rely 
> on HTTP requests. Perhaps, leader recovery for a shard w/o a current leader 
> may need to work differently than leader election in a shard that has 
> replicas that can respond to HTTP requests? All of what I'm seeing makes 
> perfect sense for leader election when there are active replicas and the 
> current leader fails.
> All this aside, I'm not asserting that this is the only cause for the 
> out-of-sync issues reported in this ticket, but it definitely seems like it 
> could happen in a real cluster.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5552) Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been recovered.

2013-12-28 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858177#comment-13858177
 ] 

ASF subversion and git services commented on SOLR-5552:
---

Commit 1553973 from [~markrmil...@gmail.com] in branch 
'dev/branches/lucene_solr_4_6'
[ https://svn.apache.org/r1553973 ]

SOLR-5552: Leader recovery process can select the wrong leader if all replicas 
for a shard are down and trying to recover as well as lose updates that should 
have been recovered.
SOLR-5569 A replica should not try and recover from a leader until it has 
published that it is ACTIVE.
SOLR-5568 A SolrCore cannot decide to be the leader just because the cluster 
state says no other SolrCore's are active.

> Leader recovery process can select the wrong leader if all replicas for a 
> shard are down and trying to recover as well as lose updates that should have 
> been recovered.
> ---
>
> Key: SOLR-5552
> URL: https://issues.apache.org/jira/browse/SOLR-5552
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Timothy Potter
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: SOLR-5552.patch, SOLR-5552.patch
>
>
> One particular issue that leads to out-of-sync shards, related to SOLR-4260
> Here's what I know so far, which admittedly isn't much:
> As cloud85 (replica before it crashed) is initializing, it enters the wait 
> process in ShardLeaderElectionContext#waitForReplicasToComeUp; this is 
> expected and a good thing.
> Some short amount of time in the future, cloud84 (leader before it crashed) 
> begins initializing and gets to a point where it adds itself as a possible 
> leader for the shard (by creating a znode under 
> /collections/cloud/leaders_elect/shard1/election), which leads to cloud85 
> being able to return from waitForReplicasToComeUp and try to determine who 
> should be the leader.
> cloud85 then tries to run the SyncStrategy, which can never work because in 
> this scenario the Jetty HTTP listener is not active yet on either node, so 
> all replication work that uses HTTP requests fails on both nodes ... PeerSync 
> treats these failures as indicators that the other replicas in the shard are 
> unavailable (or whatever) and assumes success. Here's the log message:
> 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN 
> solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 
> url=http://cloud85:8985/solr couldn't connect to 
> http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success
> The Jetty HTTP listener doesn't start accepting connections until long after 
> this process has completed and already selected the wrong leader.
> From what I can see, we seem to have a leader recovery process that is based 
> partly on HTTP requests to the other nodes, but the HTTP listener on those 
> nodes isn't active yet. We need a leader recovery process that doesn't rely 
> on HTTP requests. Perhaps, leader recovery for a shard w/o a current leader 
> may need to work differently than leader election in a shard that has 
> replicas that can respond to HTTP requests? All of what I'm seeing makes 
> perfect sense for leader election when there are active replicas and the 
> current leader fails.
> All this aside, I'm not asserting that this is the only cause for the 
> out-of-sync issues reported in this ticket, but it definitely seems like it 
> could happen in a real cluster.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5552) Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been recovered.

2013-12-28 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858169#comment-13858169
 ] 

ASF subversion and git services commented on SOLR-5552:
---

Commit 1553970 from [~markrmil...@gmail.com] in branch 'dev/trunk'
[ https://svn.apache.org/r1553970 ]

SOLR-5552: Add CHANGES entry
SOLR-5569: Add CHANGES entry
SOLR-5568: Add CHANGES entry

> Leader recovery process can select the wrong leader if all replicas for a 
> shard are down and trying to recover as well as lose updates that should have 
> been recovered.
> ---
>
> Key: SOLR-5552
> URL: https://issues.apache.org/jira/browse/SOLR-5552
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Timothy Potter
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: SOLR-5552.patch, SOLR-5552.patch
>
>
> One particular issue that leads to out-of-sync shards, related to SOLR-4260
> Here's what I know so far, which admittedly isn't much:
> As cloud85 (replica before it crashed) is initializing, it enters the wait 
> process in ShardLeaderElectionContext#waitForReplicasToComeUp; this is 
> expected and a good thing.
> Some short amount of time in the future, cloud84 (leader before it crashed) 
> begins initializing and gets to a point where it adds itself as a possible 
> leader for the shard (by creating a znode under 
> /collections/cloud/leaders_elect/shard1/election), which leads to cloud85 
> being able to return from waitForReplicasToComeUp and try to determine who 
> should be the leader.
> cloud85 then tries to run the SyncStrategy, which can never work because in 
> this scenario the Jetty HTTP listener is not active yet on either node, so 
> all replication work that uses HTTP requests fails on both nodes ... PeerSync 
> treats these failures as indicators that the other replicas in the shard are 
> unavailable (or whatever) and assumes success. Here's the log message:
> 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN 
> solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 
> url=http://cloud85:8985/solr couldn't connect to 
> http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success
> The Jetty HTTP listener doesn't start accepting connections until long after 
> this process has completed and already selected the wrong leader.
> From what I can see, we seem to have a leader recovery process that is based 
> partly on HTTP requests to the other nodes, but the HTTP listener on those 
> nodes isn't active yet. We need a leader recovery process that doesn't rely 
> on HTTP requests. Perhaps, leader recovery for a shard w/o a current leader 
> may need to work differently than leader election in a shard that has 
> replicas that can respond to HTTP requests? All of what I'm seeing makes 
> perfect sense for leader election when there are active replicas and the 
> current leader fails.
> All this aside, I'm not asserting that this is the only cause for the 
> out-of-sync issues reported in this ticket, but it definitely seems like it 
> could happen in a real cluster.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5552) Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been recovered.

2013-12-28 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858172#comment-13858172
 ] 

ASF subversion and git services commented on SOLR-5552:
---

Commit 1553971 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1553971 ]

SOLR-5552: Add CHANGES entry
SOLR-5569: Add CHANGES entry
SOLR-5568: Add CHANGES entry

> Leader recovery process can select the wrong leader if all replicas for a 
> shard are down and trying to recover as well as lose updates that should have 
> been recovered.
> ---
>
> Key: SOLR-5552
> URL: https://issues.apache.org/jira/browse/SOLR-5552
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Timothy Potter
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: SOLR-5552.patch, SOLR-5552.patch
>
>
> One particular issue that leads to out-of-sync shards, related to SOLR-4260
> Here's what I know so far, which admittedly isn't much:
> As cloud85 (replica before it crashed) is initializing, it enters the wait 
> process in ShardLeaderElectionContext#waitForReplicasToComeUp; this is 
> expected and a good thing.
> Some short amount of time in the future, cloud84 (leader before it crashed) 
> begins initializing and gets to a point where it adds itself as a possible 
> leader for the shard (by creating a znode under 
> /collections/cloud/leaders_elect/shard1/election), which leads to cloud85 
> being able to return from waitForReplicasToComeUp and try to determine who 
> should be the leader.
> cloud85 then tries to run the SyncStrategy, which can never work because in 
> this scenario the Jetty HTTP listener is not active yet on either node, so 
> all replication work that uses HTTP requests fails on both nodes ... PeerSync 
> treats these failures as indicators that the other replicas in the shard are 
> unavailable (or whatever) and assumes success. Here's the log message:
> 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN 
> solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 
> url=http://cloud85:8985/solr couldn't connect to 
> http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success
> The Jetty HTTP listener doesn't start accepting connections until long after 
> this process has completed and already selected the wrong leader.
> From what I can see, we seem to have a leader recovery process that is based 
> partly on HTTP requests to the other nodes, but the HTTP listener on those 
> nodes isn't active yet. We need a leader recovery process that doesn't rely 
> on HTTP requests. Perhaps, leader recovery for a shard w/o a current leader 
> may need to work differently than leader election in a shard that has 
> replicas that can respond to HTTP requests? All of what I'm seeing makes 
> perfect sense for leader election when there are active replicas and the 
> current leader fails.
> All this aside, I'm not asserting that this is the only cause for the 
> out-of-sync issues reported in this ticket, but it definitely seems like it 
> could happen in a real cluster.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5552) Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been recovered.

2013-12-28 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858168#comment-13858168
 ] 

Mark Miller commented on SOLR-5552:
---

Sweet, thanks! 

> Leader recovery process can select the wrong leader if all replicas for a 
> shard are down and trying to recover as well as lose updates that should have 
> been recovered.
> ---
>
> Key: SOLR-5552
> URL: https://issues.apache.org/jira/browse/SOLR-5552
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Timothy Potter
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: SOLR-5552.patch, SOLR-5552.patch
>
>
> One particular issue that leads to out-of-sync shards, related to SOLR-4260
> Here's what I know so far, which admittedly isn't much:
> As cloud85 (replica before it crashed) is initializing, it enters the wait 
> process in ShardLeaderElectionContext#waitForReplicasToComeUp; this is 
> expected and a good thing.
> Some short amount of time in the future, cloud84 (leader before it crashed) 
> begins initializing and gets to a point where it adds itself as a possible 
> leader for the shard (by creating a znode under 
> /collections/cloud/leaders_elect/shard1/election), which leads to cloud85 
> being able to return from waitForReplicasToComeUp and try to determine who 
> should be the leader.
> cloud85 then tries to run the SyncStrategy, which can never work because in 
> this scenario the Jetty HTTP listener is not active yet on either node, so 
> all replication work that uses HTTP requests fails on both nodes ... PeerSync 
> treats these failures as indicators that the other replicas in the shard are 
> unavailable (or whatever) and assumes success. Here's the log message:
> 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN 
> solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 
> url=http://cloud85:8985/solr couldn't connect to 
> http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success
> The Jetty HTTP listener doesn't start accepting connections until long after 
> this process has completed and already selected the wrong leader.
> From what I can see, we seem to have a leader recovery process that is based 
> partly on HTTP requests to the other nodes, but the HTTP listener on those 
> nodes isn't active yet. We need a leader recovery process that doesn't rely 
> on HTTP requests. Perhaps, leader recovery for a shard w/o a current leader 
> may need to work differently than leader election in a shard that has 
> replicas that can respond to HTTP requests? All of what I'm seeing makes 
> perfect sense for leader election when there are active replicas and the 
> current leader fails.
> All this aside, I'm not asserting that this is the only cause for the 
> out-of-sync issues reported in this ticket, but it definitely seems like it 
> could happen in a real cluster.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5552) Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been recovered.

2013-12-23 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856159#comment-13856159
 ] 

Timothy Potter commented on SOLR-5552:
--

Ran my manual test process on trunk and could not reproduce the out-of-sync 
issue! From the logs, the recovery process definitely starts after the HTTP 
listener is up. Looking good on trunk.

> Leader recovery process can select the wrong leader if all replicas for a 
> shard are down and trying to recover as well as lose updates that should have 
> been recovered.
> ---
>
> Key: SOLR-5552
> URL: https://issues.apache.org/jira/browse/SOLR-5552
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Timothy Potter
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: SOLR-5552.patch, SOLR-5552.patch
>
>
> One particular issue that leads to out-of-sync shards, related to SOLR-4260
> Here's what I know so far, which admittedly isn't much:
> As cloud85 (replica before it crashed) is initializing, it enters the wait 
> process in ShardLeaderElectionContext#waitForReplicasToComeUp; this is 
> expected and a good thing.
> Some short amount of time in the future, cloud84 (leader before it crashed) 
> begins initializing and gets to a point where it adds itself as a possible 
> leader for the shard (by creating a znode under 
> /collections/cloud/leaders_elect/shard1/election), which leads to cloud85 
> being able to return from waitForReplicasToComeUp and try to determine who 
> should be the leader.
> cloud85 then tries to run the SyncStrategy, which can never work because in 
> this scenario the Jetty HTTP listener is not active yet on either node, so 
> all replication work that uses HTTP requests fails on both nodes ... PeerSync 
> treats these failures as indicators that the other replicas in the shard are 
> unavailable (or whatever) and assumes success. Here's the log message:
> 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN 
> solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 
> url=http://cloud85:8985/solr couldn't connect to 
> http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success
> The Jetty HTTP listener doesn't start accepting connections until long after 
> this process has completed and already selected the wrong leader.
> From what I can see, we seem to have a leader recovery process that is based 
> partly on HTTP requests to the other nodes, but the HTTP listener on those 
> nodes isn't active yet. We need a leader recovery process that doesn't rely 
> on HTTP requests. Perhaps, leader recovery for a shard w/o a current leader 
> may need to work differently than leader election in a shard that has 
> replicas that can respond to HTTP requests? All of what I'm seeing makes 
> perfect sense for leader election when there are active replicas and the 
> current leader fails.
> All this aside, I'm not asserting that this is the only cause for the 
> out-of-sync issues reported in this ticket, but it definitely seems like it 
> could happen in a real cluster.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5552) Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been recovered.

2013-12-23 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856062#comment-13856062
 ] 

Timothy Potter commented on SOLR-5552:
--

Glad it was helpful even though my patch was crap ;-) I'll test against trunk 
in my env as well. Thanks.

> Leader recovery process can select the wrong leader if all replicas for a 
> shard are down and trying to recover as well as lose updates that should have 
> been recovered.
> ---
>
> Key: SOLR-5552
> URL: https://issues.apache.org/jira/browse/SOLR-5552
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Timothy Potter
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: SOLR-5552.patch, SOLR-5552.patch
>
>
> One particular issue that leads to out-of-sync shards, related to SOLR-4260
> Here's what I know so far, which admittedly isn't much:
> As cloud85 (replica before it crashed) is initializing, it enters the wait 
> process in ShardLeaderElectionContext#waitForReplicasToComeUp; this is 
> expected and a good thing.
> Some short amount of time in the future, cloud84 (leader before it crashed) 
> begins initializing and gets to a point where it adds itself as a possible 
> leader for the shard (by creating a znode under 
> /collections/cloud/leaders_elect/shard1/election), which leads to cloud85 
> being able to return from waitForReplicasToComeUp and try to determine who 
> should be the leader.
> cloud85 then tries to run the SyncStrategy, which can never work because in 
> this scenario the Jetty HTTP listener is not active yet on either node, so 
> all replication work that uses HTTP requests fails on both nodes ... PeerSync 
> treats these failures as indicators that the other replicas in the shard are 
> unavailable (or whatever) and assumes success. Here's the log message:
> 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN 
> solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 
> url=http://cloud85:8985/solr couldn't connect to 
> http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success
> The Jetty HTTP listener doesn't start accepting connections until long after 
> this process has completed and already selected the wrong leader.
> From what I can see, we seem to have a leader recovery process that is based 
> partly on HTTP requests to the other nodes, but the HTTP listener on those 
> nodes isn't active yet. We need a leader recovery process that doesn't rely 
> on HTTP requests. Perhaps, leader recovery for a shard w/o a current leader 
> may need to work differently than leader election in a shard that has 
> replicas that can respond to HTTP requests? All of what I'm seeing makes 
> perfect sense for leader election when there are active replicas and the 
> current leader fails.
> All this aside, I'm not asserting that this is the only cause for the 
> out-of-sync issues reported in this ticket, but it definitely seems like it 
> could happen in a real cluster.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5552) Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been recovered.

2013-12-22 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855284#comment-13855284
 ] 

ASF subversion and git services commented on SOLR-5552:
---

Commit 1553034 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1553034 ]

SOLR-5552: Leader recovery process can select the wrong leader if all replicas 
for a shard are down and trying to recover as well as lose updates that should 
have been recovered.
SOLR-5569 A replica should not try and recover from a leader until it has 
published that it is ACTIVE. 
SOLR-5568 A SolrCore cannot decide to be the leader just because the cluster 
state says no other SolrCore's are active.

> Leader recovery process can select the wrong leader if all replicas for a 
> shard are down and trying to recover as well as lose updates that should have 
> been recovered.
> ---
>
> Key: SOLR-5552
> URL: https://issues.apache.org/jira/browse/SOLR-5552
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Timothy Potter
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: SOLR-5552.patch, SOLR-5552.patch
>
>
> One particular issue that leads to out-of-sync shards, related to SOLR-4260
> Here's what I know so far, which admittedly isn't much:
> As cloud85 (replica before it crashed) is initializing, it enters the wait 
> process in ShardLeaderElectionContext#waitForReplicasToComeUp; this is 
> expected and a good thing.
> Some short amount of time in the future, cloud84 (leader before it crashed) 
> begins initializing and gets to a point where it adds itself as a possible 
> leader for the shard (by creating a znode under 
> /collections/cloud/leaders_elect/shard1/election), which leads to cloud85 
> being able to return from waitForReplicasToComeUp and try to determine who 
> should be the leader.
> cloud85 then tries to run the SyncStrategy, which can never work because in 
> this scenario the Jetty HTTP listener is not active yet on either node, so 
> all replication work that uses HTTP requests fails on both nodes ... PeerSync 
> treats these failures as indicators that the other replicas in the shard are 
> unavailable (or whatever) and assumes success. Here's the log message:
> 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN 
> solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 
> url=http://cloud85:8985/solr couldn't connect to 
> http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success
> The Jetty HTTP listener doesn't start accepting connections until long after 
> this process has completed and already selected the wrong leader.
> From what I can see, we seem to have a leader recovery process that is based 
> partly on HTTP requests to the other nodes, but the HTTP listener on those 
> nodes isn't active yet. We need a leader recovery process that doesn't rely 
> on HTTP requests. Perhaps, leader recovery for a shard w/o a current leader 
> may need to work differently than leader election in a shard that has 
> replicas that can respond to HTTP requests? All of what I'm seeing makes 
> perfect sense for leader election when there are active replicas and the 
> current leader fails.
> All this aside, I'm not asserting that this is the only cause for the 
> out-of-sync issues reported in this ticket, but it definitely seems like it 
> could happen in a real cluster.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5552) Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been recovered.

2013-12-22 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855279#comment-13855279
 ] 

ASF subversion and git services commented on SOLR-5552:
---

Commit 1553031 from [~markrmil...@gmail.com] in branch 'dev/trunk'
[ https://svn.apache.org/r1553031 ]

SOLR-5552: Leader recovery process can select the wrong leader if all replicas 
for a shard are down and trying to recover as well as lose updates that should 
have been recovered.
SOLR-5569 A replica should not try and recover from a leader until it has 
published that it is ACTIVE. 
SOLR-5568 A SolrCore cannot decide to be the leader just because the cluster 
state says no other SolrCore's are active.

> Leader recovery process can select the wrong leader if all replicas for a 
> shard are down and trying to recover as well as lose updates that should have 
> been recovered.
> ---
>
> Key: SOLR-5552
> URL: https://issues.apache.org/jira/browse/SOLR-5552
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Timothy Potter
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: SOLR-5552.patch, SOLR-5552.patch
>
>
> One particular issue that leads to out-of-sync shards, related to SOLR-4260
> Here's what I know so far, which admittedly isn't much:
> As cloud85 (replica before it crashed) is initializing, it enters the wait 
> process in ShardLeaderElectionContext#waitForReplicasToComeUp; this is 
> expected and a good thing.
> Some short amount of time in the future, cloud84 (leader before it crashed) 
> begins initializing and gets to a point where it adds itself as a possible 
> leader for the shard (by creating a znode under 
> /collections/cloud/leaders_elect/shard1/election), which leads to cloud85 
> being able to return from waitForReplicasToComeUp and try to determine who 
> should be the leader.
> cloud85 then tries to run the SyncStrategy, which can never work because in 
> this scenario the Jetty HTTP listener is not active yet on either node, so 
> all replication work that uses HTTP requests fails on both nodes ... PeerSync 
> treats these failures as indicators that the other replicas in the shard are 
> unavailable (or whatever) and assumes success. Here's the log message:
> 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN 
> solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 
> url=http://cloud85:8985/solr couldn't connect to 
> http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success
> The Jetty HTTP listener doesn't start accepting connections until long after 
> this process has completed and already selected the wrong leader.
> From what I can see, we seem to have a leader recovery process that is based 
> partly on HTTP requests to the other nodes, but the HTTP listener on those 
> nodes isn't active yet. We need a leader recovery process that doesn't rely 
> on HTTP requests. Perhaps, leader recovery for a shard w/o a current leader 
> may need to work differently than leader election in a shard that has 
> replicas that can respond to HTTP requests? All of what I'm seeing makes 
> perfect sense for leader election when there are active replicas and the 
> current leader fails.
> All this aside, I'm not asserting that this is the only cause for the 
> out-of-sync issues reported in this ticket, but it definitely seems like it 
> could happen in a real cluster.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5552) Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover

2013-12-22 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855266#comment-13855266
 ] 

Mark Miller commented on SOLR-5552:
---

Fantastic investigation and report Mr. Potter - extremely helpful.

> Leader recovery process can select the wrong leader if all replicas for a 
> shard are down and trying to recover
> --
>
> Key: SOLR-5552
> URL: https://issues.apache.org/jira/browse/SOLR-5552
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Timothy Potter
>Assignee: Mark Miller
>Priority: Critical
>  Labels: leader, recovery
> Fix For: 5.0, 4.7, 4.6.1
>
> Attachments: SOLR-5552.patch, SOLR-5552.patch
>
>
> One particular issue that leads to out-of-sync shards, related to SOLR-4260
> Here's what I know so far, which admittedly isn't much:
> As cloud85 (replica before it crashed) is initializing, it enters the wait 
> process in ShardLeaderElectionContext#waitForReplicasToComeUp; this is 
> expected and a good thing.
> Some short amount of time in the future, cloud84 (leader before it crashed) 
> begins initializing and gets to a point where it adds itself as a possible 
> leader for the shard (by creating a znode under 
> /collections/cloud/leaders_elect/shard1/election), which leads to cloud85 
> being able to return from waitForReplicasToComeUp and try to determine who 
> should be the leader.
> cloud85 then tries to run the SyncStrategy, which can never work because in 
> this scenario the Jetty HTTP listener is not active yet on either node, so 
> all replication work that uses HTTP requests fails on both nodes ... PeerSync 
> treats these failures as indicators that the other replicas in the shard are 
> unavailable (or whatever) and assumes success. Here's the log message:
> 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN 
> solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 
> url=http://cloud85:8985/solr couldn't connect to 
> http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success
> The Jetty HTTP listener doesn't start accepting connections until long after 
> this process has completed and already selected the wrong leader.
> From what I can see, we seem to have a leader recovery process that is based 
> partly on HTTP requests to the other nodes, but the HTTP listener on those 
> nodes isn't active yet. We need a leader recovery process that doesn't rely 
> on HTTP requests. Perhaps, leader recovery for a shard w/o a current leader 
> may need to work differently than leader election in a shard that has 
> replicas that can respond to HTTP requests? All of what I'm seeing makes 
> perfect sense for leader election when there are active replicas and the 
> current leader fails.
> All this aside, I'm not asserting that this is the only cause for the 
> out-of-sync issues reported in this ticket, but it definitely seems like it 
> could happen in a real cluster.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5552) Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover

2013-12-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13849290#comment-13849290
 ] 

Mark Miller commented on SOLR-5552:
---

It might be a little rabbit trail, but one I think will be well worth 
following. The ZooKeeper expiration path is not as well tested and anything we 
find is likely to lead to further bug fixes around that. I hope.

> Leader recovery process can select the wrong leader if all replicas for a 
> shard are down and trying to recover
> --
>
> Key: SOLR-5552
> URL: https://issues.apache.org/jira/browse/SOLR-5552
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Timothy Potter
>  Labels: leader, recovery
> Attachments: SOLR-5552.patch
>
>
> One particular issue that leads to out-of-sync shards, related to SOLR-4260
> Here's what I know so far, which admittedly isn't much:
> As cloud85 (replica before it crashed) is initializing, it enters the wait 
> process in ShardLeaderElectionContext#waitForReplicasToComeUp; this is 
> expected and a good thing.
> Some short amount of time in the future, cloud84 (leader before it crashed) 
> begins initializing and gets to a point where it adds itself as a possible 
> leader for the shard (by creating a znode under 
> /collections/cloud/leaders_elect/shard1/election), which leads to cloud85 
> being able to return from waitForReplicasToComeUp and try to determine who 
> should be the leader.
> cloud85 then tries to run the SyncStrategy, which can never work because in 
> this scenario the Jetty HTTP listener is not active yet on either node, so 
> all replication work that uses HTTP requests fails on both nodes ... PeerSync 
> treats these failures as indicators that the other replicas in the shard are 
> unavailable (or whatever) and assumes success. Here's the log message:
> 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN 
> solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 
> url=http://cloud85:8985/solr couldn't connect to 
> http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success
> The Jetty HTTP listener doesn't start accepting connections until long after 
> this process has completed and already selected the wrong leader.
> From what I can see, we seem to have a leader recovery process that is based 
> partly on HTTP requests to the other nodes, but the HTTP listener on those 
> nodes isn't active yet. We need a leader recovery process that doesn't rely 
> on HTTP requests. Perhaps, leader recovery for a shard w/o a current leader 
> may need to work differently than leader election in a shard that has 
> replicas that can respond to HTTP requests? All of what I'm seeing makes 
> perfect sense for leader election when there are active replicas and the 
> current leader fails.
> All this aside, I'm not asserting that this is the only cause for the 
> out-of-sync issues reported in this ticket, but it definitely seems like it 
> could happen in a real cluster.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5552) Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover

2013-12-16 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13849245#comment-13849245
 ] 

Timothy Potter commented on SOLR-5552:
--

Thanks for the feedback. I was originally thinking that would be the better way 
to go but didn't know how many rabbit trails that would lead down. Will get 
working on another patch using this approach.


> Leader recovery process can select the wrong leader if all replicas for a 
> shard are down and trying to recover
> --
>
> Key: SOLR-5552
> URL: https://issues.apache.org/jira/browse/SOLR-5552
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Timothy Potter
>  Labels: leader, recovery
> Attachments: SOLR-5552.patch
>
>
> One particular issue that leads to out-of-sync shards, related to SOLR-4260
> Here's what I know so far, which admittedly isn't much:
> As cloud85 (replica before it crashed) is initializing, it enters the wait 
> process in ShardLeaderElectionContext#waitForReplicasToComeUp; this is 
> expected and a good thing.
> Some short amount of time in the future, cloud84 (leader before it crashed) 
> begins initializing and gets to a point where it adds itself as a possible 
> leader for the shard (by creating a znode under 
> /collections/cloud/leaders_elect/shard1/election), which leads to cloud85 
> being able to return from waitForReplicasToComeUp and try to determine who 
> should be the leader.
> cloud85 then tries to run the SyncStrategy, which can never work because in 
> this scenario the Jetty HTTP listener is not active yet on either node, so 
> all replication work that uses HTTP requests fails on both nodes ... PeerSync 
> treats these failures as indicators that the other replicas in the shard are 
> unavailable (or whatever) and assumes success. Here's the log message:
> 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN 
> solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 
> url=http://cloud85:8985/solr couldn't connect to 
> http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success
> The Jetty HTTP listener doesn't start accepting connections until long after 
> this process has completed and already selected the wrong leader.
> From what I can see, we seem to have a leader recovery process that is based 
> partly on HTTP requests to the other nodes, but the HTTP listener on those 
> nodes isn't active yet. We need a leader recovery process that doesn't rely 
> on HTTP requests. Perhaps, leader recovery for a shard w/o a current leader 
> may need to work differently than leader election in a shard that has 
> replicas that can respond to HTTP requests? All of what I'm seeing makes 
> perfect sense for leader election when there are active replicas and the 
> current leader fails.
> All this aside, I'm not asserting that this is the only cause for the 
> out-of-sync issues reported in this ticket, but it definitely seems like it 
> could happen in a real cluster.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5552) Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover

2013-12-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848693#comment-13848693
 ] 

Mark Miller commented on SOLR-5552:
---

I think we want to do here is look at having the core actually accept http 
requests before it registers and enters leader election - any issues we find 
there should be issues anyway, as we already have this case on a ZooKeeper 
expiration and recovery.

> Leader recovery process can select the wrong leader if all replicas for a 
> shard are down and trying to recover
> --
>
> Key: SOLR-5552
> URL: https://issues.apache.org/jira/browse/SOLR-5552
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Timothy Potter
>  Labels: leader, recovery
> Attachments: SOLR-5552.patch
>
>
> One particular issue that leads to out-of-sync shards, related to SOLR-4260
> Here's what I know so far, which admittedly isn't much:
> As cloud85 (replica before it crashed) is initializing, it enters the wait 
> process in ShardLeaderElectionContext#waitForReplicasToComeUp; this is 
> expected and a good thing.
> Some short amount of time in the future, cloud84 (leader before it crashed) 
> begins initializing and gets to a point where it adds itself as a possible 
> leader for the shard (by creating a znode under 
> /collections/cloud/leaders_elect/shard1/election), which leads to cloud85 
> being able to return from waitForReplicasToComeUp and try to determine who 
> should be the leader.
> cloud85 then tries to run the SyncStrategy, which can never work because in 
> this scenario the Jetty HTTP listener is not active yet on either node, so 
> all replication work that uses HTTP requests fails on both nodes ... PeerSync 
> treats these failures as indicators that the other replicas in the shard are 
> unavailable (or whatever) and assumes success. Here's the log message:
> 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN 
> solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 
> url=http://cloud85:8985/solr couldn't connect to 
> http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success
> The Jetty HTTP listener doesn't start accepting connections until long after 
> this process has completed and already selected the wrong leader.
> From what I can see, we seem to have a leader recovery process that is based 
> partly on HTTP requests to the other nodes, but the HTTP listener on those 
> nodes isn't active yet. We need a leader recovery process that doesn't rely 
> on HTTP requests. Perhaps, leader recovery for a shard w/o a current leader 
> may need to work differently than leader election in a shard that has 
> replicas that can respond to HTTP requests? All of what I'm seeing makes 
> perfect sense for leader election when there are active replicas and the 
> current leader fails.
> All this aside, I'm not asserting that this is the only cause for the 
> out-of-sync issues reported in this ticket, but it definitely seems like it 
> could happen in a real cluster.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5552) Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover

2013-12-13 Thread Timothy Potter (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13847614#comment-13847614
 ] 

Timothy Potter commented on SOLR-5552:
--

Here's a first cut at a solution sans unit tests, which relies on a new Slice 
property - last_known_leader_core_url. However I'm open to other suggestions on 
how to solve this issue if someone sees a cleaner way.

During the leader recovery process outlined in the description of this ticket, 
the ShardLeaderElectionContext can use this property as a hint to replicas to 
defer to the previous known leader if it is one of the replicas that is trying 
to recover. Specifically, this patch only applies if all replicas are "down" 
and the previous known leader is on a "live" node and is one of the replicas 
trying to recover. This may be too restrictive but it covers this issue nicely 
and minimizes chance of regression for other leader election / recovery cases.

Here are some log messages from the replica as it exits the 
waitForReplicasToComeUp process that show this patch working:

>>>

2013-12-13 08:51:26,992 [coreLoadExecutor-3-thread-1] INFO  
solr.cloud.ShardLeaderElectionContext  - Enough replicas found to continue.
2013-12-13 08:51:26,992 [coreLoadExecutor-3-thread-1] INFO  
solr.cloud.ShardLeaderElectionContext  - Last known leader is 
http://cloud84:8984/solr/cloud_shard1_replica1/ and I am 
http://cloud85:8985/solr/cloud_shard1_replica2/
2013-12-13 08:51:26,992 [coreLoadExecutor-3-thread-1] INFO  
solr.cloud.ShardLeaderElectionContext  - Found previous? true and numDown is 2
2013-12-13 08:51:26,992 [coreLoadExecutor-3-thread-1] INFO  
solr.cloud.ShardLeaderElectionContext  - All 2 replicas are down. Choosing to 
let last known leader http://cloud84:8984/solr/cloud_shard1_replica1/ try first 
...
2013-12-13 08:51:26,992 [coreLoadExecutor-3-thread-1] INFO  
solr.cloud.ShardLeaderElectionContext  - There may be a better leader candidate 
than us - going back into recovery

<<<
The end result was that my shard recovered correctly and the data remained 
consistent between leader and replica. I've also tried this with 3 replicas in 
a Slice and when the last known leader doesn't come back, which works as it did 
previously.

Lastly, I'm not entirely certain I like how the property gets set in the Slice 
constructor. It may be better to set this property in the Overseer? Or even 
store the last_known_leader_core_url in a separate znode, such as 
/collections//last_known_leader/shardN. I do see some comments in places 
about keeping the leader property on the Slice vs. in the leader Replica so 
maybe that figures into this as well?

> Leader recovery process can select the wrong leader if all replicas for a 
> shard are down and trying to recover
> --
>
> Key: SOLR-5552
> URL: https://issues.apache.org/jira/browse/SOLR-5552
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Reporter: Timothy Potter
>  Labels: leader, recovery
>
> One particular issue that leads to out-of-sync shards, related to SOLR-4260
> Here's what I know so far, which admittedly isn't much:
> As cloud85 (replica before it crashed) is initializing, it enters the wait 
> process in ShardLeaderElectionContext#waitForReplicasToComeUp; this is 
> expected and a good thing.
> Some short amount of time in the future, cloud84 (leader before it crashed) 
> begins initializing and gets to a point where it adds itself as a possible 
> leader for the shard (by creating a znode under 
> /collections/cloud/leaders_elect/shard1/election), which leads to cloud85 
> being able to return from waitForReplicasToComeUp and try to determine who 
> should be the leader.
> cloud85 then tries to run the SyncStrategy, which can never work because in 
> this scenario the Jetty HTTP listener is not active yet on either node, so 
> all replication work that uses HTTP requests fails on both nodes ... PeerSync 
> treats these failures as indicators that the other replicas in the shard are 
> unavailable (or whatever) and assumes success. Here's the log message:
> 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN 
> solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 
> url=http://cloud85:8985/solr couldn't connect to 
> http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success
> The Jetty HTTP listener doesn't start accepting connections until long after 
> this process has completed and already selected the wrong leader.
> From what I can see, we seem to have a leader recovery process that is based 
> partly on HTTP requests to the other nodes, but the HTTP listener on those 
> nodes isn't active yet. We need a leader recovery process that doesn't rely 
> on HTTP requests. Perhaps, leader r