[ https://issues.apache.org/jira/browse/ZOOKEEPER-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Nauroth updated ZOOKEEPER-1005: ------------------------------------- Fix Version/s: (was: 3.5.2) 3.5.3 > Zookeeper servers fail to elect a leader succesfully. > ----------------------------------------------------- > > Key: ZOOKEEPER-1005 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1005 > Project: ZooKeeper > Issue Type: Bug > Components: quorum > Affects Versions: 3.2.2 > Environment: zookeeper-3.2.2; debian > Reporter: Alexandre Hardy > Fix For: 3.6.0, 3.5.3 > > > We were running 3 zookeeper servers, and simulated a failure on one of the > servers. > The one zookeeper node follows the other, but has trouble connecting. It > looks like the following exception is the cause: > {noformat} > 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] -- > [org.apache.zookeeper.server.quorum.QuorumPeer] FOLLOWING > 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] -- > [org.apache.zookeeper.server.ZooKeeperServer] Created server > 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] -- > [org.apache.zookeeper.server.quorum.Follower] Following > zookeeper3/192.168.131.11:2888 > 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] -- > [org.apache.zookeeper.server.quorum.Follower] Unexpected exception, tries=0 > 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING > java.net.ConnectException: -- Connection refused > 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at > java.net.PlainSocketImpl.socketConnect(Native Method) > 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310) > 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176) > 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163) > 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at > java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384) > 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at > java.net.Socket.connect(Socket.java:546) > 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156) > 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING -- at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549) > {noformat} > The last exception while connecting was: > {noformat} > 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR [zookeeper] -- > [org.apache.zookeeper.server.quorum.Follower] Unexpected exception > 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR java.net.ConnectException: -- > Connection refused > 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at > java.net.PlainSocketImpl.socketConnect(Native Method) > 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310) > 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176) > 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163) > 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at > java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384) > 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at > java.net.Socket.connect(Socket.java:546) > 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156) > 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR -- at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549) > 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] -- > [org.apache.zookeeper.server.quorum.Follower] Exception when following the > leader > {noformat} > The leader started leading a bit later > {noformat} > 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- > [org.apache.zookeeper.server.quorum.FastLeaderElection] Notification: 0, > 94489312534, 25, 2, LOOKING, LOOKING, 0 > 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- > [org.apache.zookeeper.server.quorum.FastLeaderElection] Adding vote > 2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] -- > [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to > 1 at election address zookeeper2/192.168.132.10:3888 > 2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING -- at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323) > 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- > [org.apache.zookeeper.server.quorum.QuorumPeer] LEADING > {noformat} > But at that time the follower had already terminated and started a new > election, so the leader failed: > {noformat} > 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- > [org.apache.zookeeper.server.ZooKeeperServer] Created server > 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- > [org.apache.zookeeper.server.persistence.FileSnap] Reading snapshot > /var/lib/zookeeper/version-2/snapshot.1600007d16 > 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- > [org.apache.zookeeper.server.persistence.FileTxnSnapLog] Snapshotting: > 1600007d16 > 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] -- > [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to > 1 at election address zookeeper2/192.168.132.10:3888 > 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323) > 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302) > 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323) > 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING -- at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296) > 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- > [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new > notification. > 2011-03-01T14:03:11+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- > [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new > notification. > 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] -- > [org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to > 1 at election address zookeeper2/192.168.132.10:3888 > 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at > org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323) > 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at > org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302) > 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323) > 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING -- at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296) > 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- > [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new > notification. > 2011-03-01T14:03:32+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- > [org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new > notification. > 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] -- > [org.apache.zookeeper.server.quorum.Leader] Shutdown called > 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO -- at > org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:371) > 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO -- at > org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:297) > 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO -- at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:562) > {noformat} > From http://zookeeper.apache.org/doc/r3.2.2/zookeeperStarted.html: > {quote} > The new entry, initLimit is timeouts ZooKeeper uses to limit the length of > time the ZooKeeper servers in quorum have to connect to a leader > {quote} > Since we have initLimit=10 and tickTime=4000, we should have 40 seconds for a > zookeeper server to contact the leader. > However, in the source code > src/java/main/org/apache/zookeeper/server/quorum/Follower.java: > {noformat} > 152 for (int tries = 0; tries < 5; tries++) { > 153 try { > 154 //sock = new Socket(); > 155 //sock.setSoTimeout(self.tickTime * > self.initLimit); > 156 sock.connect(addr, self.tickTime * > self.syncLimit); > 157 sock.setTcpNoDelay(nodelay); > 158 break; > 159 } catch (IOException e) { > 160 if (tries == 4) { > 161 LOG.error("Unexpected exception",e); > 162 throw e; > 163 } else { > 164 LOG.warn("Unexpected exception, > tries="+tries,e); > 165 sock = new Socket(); > 166 sock.setSoTimeout(self.tickTime * > self.initLimit); > 167 } > 168 } > 169 Thread.sleep(1000); > 170 } > {noformat} > It appears as if we only have 4 seconds to contact the leader. The timeouts > are applied to the socket, but do not take into account that the zookeeper > leader may not have started its zookeeper service yet. > Is this the expected behaviour? Or is the expected behaviour that followers > should always be able to connect to the leader? -- This message was sent by Atlassian JIRA (v6.3.4#6332)