[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15393388#comment-15393388
 ] 

Arshad Mohammad commented on ZOOKEEPER-2172:
--------------------------------------------

bq. The existence of .next file indicates that there was a failure in the 
middle of reconfig, and the commitandactivate message didn't arrive to the 
server on which you find this file. 

Failure may be one of scenarios where .next file is not deleted but not the 
only scenario. In this scenario .next file exists because of logical problem in 
the code.
commitandactivate message arrives but it is not getting process because of 
bellow reason:
sequence of events:
1) case Leader.NEWLEADER:
lastSeenQuorumVerifier is updated with 100000000 and .next file created.
2) case Leader.PROPOSAL(reconfig): 
lastSeenQuorumVerifier is updated with 200000000 and earlier .next file 
overridden
3) case Leader.COMMITANDACTIVATE:
Because snapshot in taken in step 1 snapshotTaken=true and 
{{self.processReconfig();}} is not called. This call was supposed to delete 
.next file and create the updated zoo.cfg.dynamic.200000000 file
code reference:
{code}
if (!snapshotTaken) {
----
boolean majorChange =
           self.processReconfig(qv, ByteBuffer.wrap(qp.getData()).getLong(), 
qp.getZxid(), true);
----
}
{code}
4) 
case Leader.UPTODATE:
this calls self.processReconfig but again it is skipped because the 
lastSeenQuorumVerifier version is higher. it got updated in 2)
{code}
 public synchronized QuorumVerifier setQuorumVerifier(QuorumVerifier qv, 
boolean writeToDisk){
        if ((quorumVerifier != null) && (quorumVerifier.getVersion() >= 
qv.getVersion())) {

{code}
bq. Which server was it ? just 4 or all of them ?
just 4
bq. is actually expected if server 4 crashed before it got the 
commitandactivate message. I described this in the manual:
But the sever did not crash. It is in normal flow
bq. Could you please provide the logs ?
Can you please try to reproduce with the above steps, we can reach to a 
conclusion fast. Let me know, if not reproducing I will reproduce and share the 
logs

> Cluster crashes when reconfig a new node as a participant
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-2172
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum, server
>    Affects Versions: 3.5.0
>         Environment: Ubuntu 12.04 + java 7
>            Reporter: Ziyou Wang
>            Assignee: Arshad Mohammad
>            Priority: Critical
>             Fix For: 3.5.3
>
>         Attachments: ZOOKEEPER-2172-02.patch, ZOOKEEPER-2172.patch, 
> history.txt, node-1.log, node-2.log, node-3.log, zoo-1.log, zoo-2-1.log, 
> zoo-2-2.log, zoo-2-3.log, zoo-2.log, zoo-2212-1.log, zoo-2212-2.log, 
> zoo-2212-3.log, zoo-3-1.log, zoo-3-2.log, zoo-3-3.log, zoo-3.log, 
> zoo-4-1.log, zoo-4-2.log, zoo-4-3.log, zoo.cfg.dynamic.10000005d, 
> zoo.cfg.dynamic.next, zookeeper-1.log, zookeeper-1.out, zookeeper-2.log, 
> zookeeper-2.out, zookeeper-3.log, zookeeper-3.out
>
>
> The operations are quite simple: start three zk servers one by one, then 
> reconfig the cluster to add the new one as a participant. When I add the  
> third one, the zk cluster may enter a weird state and cannot recover.
>  
>       I found “2015-04-20 12:53:48,236 [myid:1] - INFO  [ProcessThread(sid:1 
> cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. 
> So the first node received the reconfig cmd at 12:53:48. Latter, it logged 
> “2015-04-20  12:53:52,230 [myid:1] - ERROR 
> [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception 
> causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] 
> - WARN  [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - ******* GOODBYE 
>  /10.0.0.2:55890 ********”. From then on, the first node and second node 
> rejected all client connections and the third node didn’t join the cluster as 
> a participant. The whole cluster was done.
>  
>      When the problem happened, all three nodes just used the same dynamic 
> config file zoo.cfg.dynamic.10000005d which only contained the first two 
> nodes. But there was another unused dynamic config file in node-1 directory 
> zoo.cfg.dynamic.next  which already contained three nodes.
>  
>      When I extended the waiting time between starting the third node and 
> reconfiguring the cluster, the problem didn’t show again. So it should be a 
> race condition problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to