[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

Cao Manh Dat (JIRA) Thu, 01 Mar 2018 22:49:14 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-12011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383261#comment-16383261
 ]


Cao Manh Dat commented on SOLR-12011:
-------------------------------------

{quote}So I traced the code and a live fetch on the leader works fine today but 
as a side-effect. We set the term equal to max for a recoverying replica (using 
ZkShardTerms.startRecovering() method) in ZkController.publish *before* we 
publish the replica state to the overseer queue. So if the leader (during prep 
recovery) sees replica state as recoverying then Zookeeper also guarantees that 
it will see the max term published before the recoverying state was published. 
I think we should make this behavior clear via a code comment.
{quote}
Yeah, I will add a comment for clarification 
{quote}bq. The changes in SolrCmdDistributor fix a different bug, no? Describe 
the problem here in this issue and how it is solved. Otherwise extract it to 
its own ticket.
{quote}
Hmm, you're right, maybe the changes in SolrCmdDistributor should go into 
SOLR-11702
{quote}Latest patch added changes for RestoreCoreOp and SplitOp where an empty 
core is added new data
{quote}
The destination for {{RestoreCoreOp}} and {{SplitOp}} should be for slice with 
no more than 1 replicas, and that how some collections API use these admins 
API. If not, how can other replicas in the same shard acknowledge the changes 
and put themselves to recovery? 

> Consistence problem when in-sync replicas are DOWN
> --------------------------------------------------
>
>                 Key: SOLR-12011
>                 URL: https://issues.apache.org/jira/browse/SOLR-12011
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Cao Manh Dat
>            Assignee: Cao Manh Dat
>            Priority: Major
>         Attachments: SOLR-12011.patch, SOLR-12011.patch, SOLR-12011.patch, 
> SOLR-12011.patch
>
>
> Currently, we will meet consistency problem when in-sync replicas are DOWN. 
> For example:
>  1. A collection with 1 shard with 1 leader and 2 replicas
>  2. Nodes contain 2 replicas go down
>  3. The leader receives an update A, success
>  4. The node contains the leader goes down
>  5. 2 replicas come back
>  6. One of them become leader --> But they shouldn't become leader since they 
> missed the update A
> A solution to this issue :
>  * The idea here is using term value of each replica (SOLR-11702) will be 
> enough to tell that a replica received the latest updates or not. Therefore 
> only replicas with the highest term can become the leader.
>  * There are a couple of things need to be done on this issue
>  ** When leader receives the first updates, its term should be changed from 0 
> -> 1, so further replicas added to the same shard won't be able to become 
> leader (their term = 0) until they finish recovery
>  ** For DOWN replicas, the leader should also need to check (in DUP.finish()) 
> that those replicas have term less than leader before return results to users
>  ** Just by looking at term value of replica, it is not enough to tell us 
> that replica is in-sync with leader or not. Because that replica might not 
> finish the recovery process. We need to introduce another flag (stored on 
> shard term node on ZK) to tell us that replica finished recovery or not. It 
> will look like this.
>  *** {"code_node1" : 1, "core_node2" : 0} — (when core_node2 start recovery) 
> --->
>  *** {"core_node1" : 1, "core_node2" : 1, "core_node2_recovering" : 1} — 
> (when core_node2 finish recovery) --->
>  *** {"core_node1" : 1, "core_node2" : 1}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-12011) Consistence problem when in-sync replicas are DOWN

Reply via email to