[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634903#comment-14634903
 ] 

Hitoshi Mitake commented on ZOOKEEPER-2172:
-------------------------------------------

Hi [~ziyouw],

It seems that I could reproduce this problem. Just adding new servers with 
reconfig one by one, then the ensemble rejects every client request.
(Of course there is a possibility of my misunderstanding)

I used our distributed systems debugger named 
[earthquake|https://github.com/osrg/earthquake]. It uses byteman and inspect 
execution of debuggee (zookeeper server in this case). It tries to cause corner 
case situations that is hard to be produced in ordinal testing by reordering 
inspected method calls and returns.

We are preparing a docker image for easy reproducing in your environment. 
Please wait for a while.

I'm analyzing the problem and would like to post the root cause and patch, but 
it may take a time because I'm new to zookeeper. So I attached logs 
(zookeeper-123.out) and the history of ensemble (history.txt). The logs seem to 
be similar to yours.
The format of the history is earthquake specific format, so it wouldn't be easy 
to read. But I think you can interpret the event sequence roughly (it is just a 
sequence of method calls and returns + their stacktrace). It would be great if 
I can hear your comments.

> Cluster crashes when reconfig a new node as a participant
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-2172
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum, server
>    Affects Versions: 3.5.0
>         Environment: Ubuntu 12.04 + java 7
>            Reporter: Ziyou Wang
>            Priority: Critical
>         Attachments: history.txt, node-1.log, node-2.log, node-3.log, 
> zoo-1.log, zoo-2-1.log, zoo-2-2.log, zoo-2-3.log, zoo-2.log, zoo-2212-1.log, 
> zoo-2212-2.log, zoo-2212-3.log, zoo-3-1.log, zoo-3-2.log, zoo-3-3.log, 
> zoo-3.log, zoo-4-1.log, zoo-4-2.log, zoo-4-3.log, zoo.cfg.dynamic.10000005d, 
> zoo.cfg.dynamic.next, zookeeper-1.log, zookeeper-1.out, zookeeper-2.log, 
> zookeeper-2.out, zookeeper-3.log, zookeeper-3.out
>
>
> The operations are quite simple: start three zk servers one by one, then 
> reconfig the cluster to add the new one as a participant. When I add the  
> third one, the zk cluster may enter a weird state and cannot recover.
>  
>       I found “2015-04-20 12:53:48,236 [myid:1] - INFO  [ProcessThread(sid:1 
> cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. 
> So the first node received the reconfig cmd at 12:53:48. Latter, it logged 
> “2015-04-20  12:53:52,230 [myid:1] - ERROR 
> [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception 
> causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] 
> - WARN  [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - ******* GOODBYE 
>  /10.0.0.2:55890 ********”. From then on, the first node and second node 
> rejected all client connections and the third node didn’t join the cluster as 
> a participant. The whole cluster was done.
>  
>      When the problem happened, all three nodes just used the same dynamic 
> config file zoo.cfg.dynamic.10000005d which only contained the first two 
> nodes. But there was another unused dynamic config file in node-1 directory 
> zoo.cfg.dynamic.next  which already contained three nodes.
>  
>      When I extended the waiting time between starting the third node and 
> reconfiguring the cluster, the problem didn’t show again. So it should be a 
> race condition problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to