Zookeeper servers fail to elect a leader succesfully.
-----------------------------------------------------
Key: ZOOKEEPER-1005
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1005
Project: ZooKeeper
Issue Type: Bug
Components: quorum
Affects Versions: 3.2.2
Environment: zookeeper-3.2.2; debian
Reporter: Alexandre Hardy
We were running 3 zookeeper servers, and simulated a failure on one of the
servers.
The one zookeeper node follows the other, but has trouble connecting. It looks
like the following exception is the cause:
{noformat}
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --
[org.apache.zookeeper.server.quorum.QuorumPeer] FOLLOWING
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --
[org.apache.zookeeper.server.ZooKeeperServer] Created server
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --
[org.apache.zookeeper.server.quorum.Follower] Following
zookeeper3/192.168.131.11:2888
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] --
[org.apache.zookeeper.server.quorum.Follower] Unexpected exception, tries=0
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING java.net.ConnectException:
-- Connection refused
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at
java.net.PlainSocketImpl.socketConnect(Native Method)
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at
java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at
java.net.Socket.connect(Socket.java:546)
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549)
{noformat}
The last exception while connecting was:
{noformat}
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR [zookeeper] --
[org.apache.zookeeper.server.quorum.Follower] Unexpected exception
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR java.net.ConnectException: --
Connection refused
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at
java.net.PlainSocketImpl.socketConnect(Native Method)
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at
java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at
java.net.Socket.connect(Socket.java:546)
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549)
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] --
[org.apache.zookeeper.server.quorum.Follower] Exception when following the
leader
{noformat}
The leader started leading a bit later
{noformat}
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
[org.apache.zookeeper.server.quorum.FastLeaderElection] Notification: 0,
94489312534, 25, 2, LOOKING, LOOKING, 0
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
[org.apache.zookeeper.server.quorum.FastLeaderElection] Adding vote
2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --
[org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 1
at election address zookeeper2/192.168.132.10:3888
2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING -- at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
[org.apache.zookeeper.server.quorum.QuorumPeer] LEADING
{noformat}
But at that time the follower had already terminated and started a new
election, so the leader failed:
{noformat}
2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
[org.apache.zookeeper.server.ZooKeeperServer] Created server
2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
[org.apache.zookeeper.server.persistence.FileSnap] Reading snapshot
/var/lib/zookeeper/version-2/snapshot.1600007d16
2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
[org.apache.zookeeper.server.persistence.FileTxnSnapLog] Snapshotting:
1600007d16
2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --
[org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 1
at election address zookeeper2/192.168.132.10:3888
2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302)
2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323)
2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296)
2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
[org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new
notification.
2011-03-01T14:03:11+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
[org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new
notification.
2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --
[org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 1
at election address zookeeper2/192.168.132.10:3888
2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302)
2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323)
2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296)
2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
[org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new
notification.
2011-03-01T14:03:32+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
[org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new
notification.
2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --
[org.apache.zookeeper.server.quorum.Leader] Shutdown called
2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO -- at
org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:371)
2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO -- at
org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:297)
2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO -- at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:562)
{noformat}
>From http://zookeeper.apache.org/doc/r3.2.2/zookeeperStarted.html:
{quote}
The new entry, initLimit is timeouts ZooKeeper uses to limit the length of time
the ZooKeeper servers in quorum have to connect to a leader
{quote}
Since we have initLimit=10 and tickTime=4000, we should have 40 seconds for a
zookeeper server to contact the leader.
However, in the source code
src/java/main/org/apache/zookeeper/server/quorum/Follower.java:
{noformat}
152 for (int tries = 0; tries < 5; tries++) {
153 try {
154 //sock = new Socket();
155 //sock.setSoTimeout(self.tickTime * self.initLimit);
156 sock.connect(addr, self.tickTime * self.syncLimit);
157 sock.setTcpNoDelay(nodelay);
158 break;
159 } catch (IOException e) {
160 if (tries == 4) {
161 LOG.error("Unexpected exception",e);
162 throw e;
163 } else {
164 LOG.warn("Unexpected exception,
tries="+tries,e);
165 sock = new Socket();
166 sock.setSoTimeout(self.tickTime *
self.initLimit);
167 }
168 }
169 Thread.sleep(1000);
170 }
{noformat}
It appears as if we only have 4 seconds to contact the leader. The timeouts are
applied to the socket, but do not take into account that the zookeeper leader
may not have started its zookeeper service yet.
Is this the expected behaviour? Or is the expected behaviour that followers
should always be able to connect to the leader?
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira