[ 
https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14727549#comment-14727549
 ] 

Ishan Chattopadhyaya commented on SOLR-7569:
--------------------------------------------


bq.  1.  nit - RecoverShardTest has an unused notLeader1 variable
Thanks. Made some refactoring to the test and this has gone away now.

bq.   2.    Shouldn't the "Wait for a long time for a steady state" piece of 
code be before the proxies for the two replicas are reopened? The LIR state 
will surely be set at indexing time and only if the proxy is closed. Also if 
you move that wait before the proxy is reopened then you are sure to have the 
LIR state as 'down'.
This makes sense, I've made the change.

bq.   3.    The check for 'numActiveReplicas' and 'numReplicasOnLiveNodes' 
should be done after force refreshing the cluster state of the cloudClient 
otherwise spurious failures can happen

I didn't know about this force update of the cluster state; I've now added it.

bq.  4.    nit - Why is sendDoc overridden in RecoverShardTest? The minRf is 
same, just the max retries has been increased and wait between retries has been 
decreased
The tests were (and still are) taking too long, and reducing the wait from 
30sec to 1sec was helpful.

bq. 5.    The OCMH.recoverShard() isn't unsetting the leader properly. It 
should be as simple as:
Thanks, I've cleaned this up.

bq.  6.    Can you please write a test to ensure that this API works with 
'async' parameter?
TODO.

bq.    Leader is live but 'down' -> mark it 'active'
This works now. Added testLeaderDown() method.

bq.    Leader itself is in LIR -> delete the LIR node
This should work, since the API method first clears the LIR state. Couldn't add 
a test for this, since I couldn't simulate this state in a test.

bq.    Leader is not live:       Replicas are live but 'down' or 'recovering' 
-> mark them 'active'
This works now. Added testAllReplicasDownNoLeader() method.

bq.    Leader is not live:       Replicas are live but in LIR -> delete the LIR 
nodes
This works as last patch. The corresponding test is now at 
testReplicasInLIRNoLeader().

bq. Did you find out why/how that happened? If this is reproducible, can you 
please create an issue and post the test there?
Added SOLR-7989 for this, will look deeper soon.

> Create an API to force a leader election between nodes
> ------------------------------------------------------
>
>                 Key: SOLR-7569
>                 URL: https://issues.apache.org/jira/browse/SOLR-7569
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>              Labels: difficulty-medium, impact-high
>         Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, 
> SOLR-7569_lir_down_state_test.patch
>
>
> There are many reasons why Solr will not elect a leader for a shard e.g. all 
> replicas' last published state was recovery or due to bugs which cause a 
> leader to be marked as 'down'. While the best solution is that they never get 
> into this state, we need a manual way to fix this when it does get into this  
> state. Right now we can do a series of dance involving bouncing the node 
> (since recovery paths between bouncing and REQUESTRECOVERY are different), 
> but that is difficult when running a large cluster. Although it is possible 
> that such a manual API may lead to some data loss but in some cases, it is 
> the only possible option to restore availability.
> This issue proposes to build a new collection API which can be used to force 
> replicas into recovering a leader while avoiding data loss on a best effort 
> basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to