[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant

Akihiro Suda (JIRA) Fri, 19 Jun 2015 00:57:40 -0700

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593156#comment-14593156
 ]


Akihiro Suda commented on ZOOKEEPER-2172:
-----------------------------------------

Hi,

It is so interesting that both of servers 1 and 2 timeout at 15:55:08,439 after 
reconfig has begun at 15:55:04.
After these timeouts, both {{ZooKeeperServer}} cannot be revived and the 
ensemble gets weird.
(However in zoo-3-2.log (Jun 3), server 2 raises {{EOFException}}, not 
{{SocketTimeoutException}} at 17:15:31).

These timeouts are raised by [this while 
loop|https://github.com/apache/zookeeper/blob/77e46cad03d64530ea53be53f5e38e8f1e7e8eee/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L515]
 in server 1 and [this while 
loop|https://github.com/apache/zookeeper/blob/77e46cad03d64530ea53be53f5e38e8f1e7e8eee/src/java/main/org/apache/zookeeper/server/quorum/Follower.java#L89]
 in server 2.

Unfortunately, we are not sure which types of QuorumPacket are triggering these 
timeouts.
So I think it might be helpful to add {{LOG.debug(qp.getType())}} at [this 
switch|https://github.com/apache/zookeeper/blob/77e46cad03d64530ea53be53f5e38e8f1e7e8eee/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L532]
 for server 1 and [this 
switch|https://github.com/apache/zookeeper/blob/77e46cad03d64530ea53be53f5e38e8f1e7e8eee/src/java/main/org/apache/zookeeper/server/quorum/Follower.java#L114]
 for server 2.

Perhaps they are not pinging each other?
[This|https://github.com/apache/zookeeper/blob/77e46cad03d64530ea53be53f5e38e8f1e7e8eee/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L924-925]
 comment in {{LearnerHandler.ping()}} seems interesting.
{panel}
// If learner hasn't sync properly yet, don't send ping packet
// otherwise, the learner will crash
{panel}


> Cluster crashes when reconfig a new node as a participant
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-2172
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum, server
>    Affects Versions: 3.5.0
>         Environment: Ubuntu 12.04 + java 7
>            Reporter: Ziyou Wang
>            Priority: Critical
>         Attachments: node-1.log, node-2.log, node-3.log, zoo-1.log, 
> zoo-2-1.log, zoo-2-2.log, zoo-2-3.log, zoo-2.log, zoo-2212-1.log, 
> zoo-2212-2.log, zoo-2212-3.log, zoo-3-1.log, zoo-3-2.log, zoo-3-3.log, 
> zoo-3.log, zoo.cfg.dynamic.10000005d, zoo.cfg.dynamic.next, zookeeper-1.log, 
> zookeeper-2.log, zookeeper-3.log
>
>
> The operations are quite simple: start three zk servers one by one, then 
> reconfig the cluster to add the new one as a participant. When I add the  
> third one, the zk cluster may enter a weird state and cannot recover.
>  
>       I found “2015-04-20 12:53:48,236 [myid:1] - INFO  [ProcessThread(sid:1 
> cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. 
> So the first node received the reconfig cmd at 12:53:48. Latter, it logged 
> “2015-04-20  12:53:52,230 [myid:1] - ERROR 
> [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception 
> causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] 
> - WARN  [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - ******* GOODBYE 
>  /10.0.0.2:55890 ********”. From then on, the first node and second node 
> rejected all client connections and the third node didn’t join the cluster as 
> a participant. The whole cluster was done.
>  
>      When the problem happened, all three nodes just used the same dynamic 
> config file zoo.cfg.dynamic.10000005d which only contained the first two 
> nodes. But there was another unused dynamic config file in node-1 directory 
> zoo.cfg.dynamic.next  which already contained three nodes.
>  
>      When I extended the waiting time between starting the third node and 
> reconfiguring the cluster, the problem didn’t show again. So it should be a 
> race condition problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant

Reply via email to