[ https://issues.apache.org/jira/browse/ZOOKEEPER-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381347#comment-16381347 ]
Jiafu Jiang commented on ZOOKEEPER-2930: ---------------------------------------- I hope this PB can be fix in version 3.4.X, since 3.4.X is the stable version. > Leader cannot be elected due to network timeout of some members. > ---------------------------------------------------------------- > > Key: ZOOKEEPER-2930 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2930 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server > Affects Versions: 3.4.10, 3.5.3, 3.4.11, 3.5.4, 3.4.12 > Environment: Java 8 > ZooKeeper 3.4.11(from github) > Centos6.5 > Reporter: Jiafu Jiang > Priority: Critical > Attachments: zoo.cfg, zookeeper1.log, zookeeper2.log > > > I deploy a cluster of ZooKeeper with three nodes: > ofs_zk1:20.10.11.101, 30.10.11.101 > ofs_zk2:20.10.11.102, 30.10.11.102 > ofs_zk3:20.10.11.103, 30.10.11.103 > I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command. > It is supposed that the new Leader should be elected in some seconds, but the > fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of > them can become the new Leader. > I change the log level to DEBUG (the default is INFO), and restart zookeeper > servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem. > I read the log and the ZooKeeper source code, and I think I find the reason. > When the potential leader(says ofs_zk3) begins the > election(FastLeaderElection.lookForLeader()), it will send notifications to > all the servers. > When it fails to receive any notification during a timeout, it will resend > the notifications, and double the timeout. This process will repeat until any > notification is received or the timeout reaches a max value. > The FastLeaderElection.sendNotifications() just put the notification message > into a queue and return. The WorkerSender is responsable to send the > notifications. > The WorkerSender just process the notifications one by one by passing the > notifications to QuorumCnxManager. Here comes the problem, the > QuorumCnxManager.toSend() blocks for a long time when the notification is > send to ofs_zk2(whose network is down) and some notifications (which belongs > to ofs_zk1) will thus be blocked for a long time. The repeated notifications > by FastLeaderElection.sendNotifications() just make things worse. > Here is the related source code: > {code:java} > public void toSend(Long sid, ByteBuffer b) { > /* > * If sending message to myself, then simply enqueue it (loopback). > */ > if (this.mySid == sid) { > b.position(0); > addToRecvQueue(new Message(b.duplicate(), sid)); > /* > * Otherwise send to the corresponding thread to send. > */ > } else { > /* > * Start a new connection if doesn't have one already. > */ > ArrayBlockingQueue<ByteBuffer> bq = new > ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY); > ArrayBlockingQueue<ByteBuffer> bqExisting = > queueSendMap.putIfAbsent(sid, bq); > if (bqExisting != null) { > addToSendQueue(bqExisting, b); > } else { > addToSendQueue(bq, b); > } > > // This may block!!! > connectOne(sid); > > } > } > {code} > Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the > epoch ack, but in fact the ofs_zk1 does not receive the notification(which > says the leader is ofs_zk3) because the ofs_zk3 has not sent the > notification(which may still exist in the sendqueue of WorkerSender). At > last, the potential leader ofs_zk3 fails to receive the epoch ack in timeout, > so it quits the leader and begins a new election. > The log files of ofs_zk1 and ofs_zk3 are attached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)