Zookeeper servers fail to elect a leader succesfully.
-----------------------------------------------------

                 Key: ZOOKEEPER-1005
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1005
             Project: ZooKeeper
          Issue Type: Bug
          Components: quorum
    Affects Versions: 3.2.2
         Environment: zookeeper-3.2.2; debian
            Reporter: Alexandre Hardy


We were running 3 zookeeper servers, and simulated a failure on one of the 
servers. 

The one zookeeper node follows the other, but has trouble connecting. It looks 
like the following exception is the cause:
{noformat}
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --  
[org.apache.zookeeper.server.quorum.QuorumPeer] FOLLOWING
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --  
[org.apache.zookeeper.server.ZooKeeperServer] Created server
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --  
[org.apache.zookeeper.server.quorum.Follower] Following 
zookeeper3/192.168.131.11:2888
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] --  
[org.apache.zookeeper.server.quorum.Follower] Unexpected exception, tries=0
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING java.net.ConnectException: 
--  Connection refused
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
java.net.PlainSocketImpl.socketConnect(Native Method)
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
java.net.Socket.connect(Socket.java:546)
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549)
{noformat}
The last exception while connecting was:
{noformat}
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR [zookeeper] --  
[org.apache.zookeeper.server.quorum.Follower] Unexpected exception
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR java.net.ConnectException: --  
Connection refused
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
java.net.PlainSocketImpl.socketConnect(Native Method)
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
java.net.Socket.connect(Socket.java:546)
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549)
2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] --  
[org.apache.zookeeper.server.quorum.Follower] Exception when following the 
leader
{noformat}

The leader started leading a bit later 
{noformat}
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
[org.apache.zookeeper.server.quorum.FastLeaderElection] Notification: 0, 
94489312534, 25, 2, LOOKING, LOOKING, 0
2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
[org.apache.zookeeper.server.quorum.FastLeaderElection] Adding vote
2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --  
[org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 1 
at election address zookeeper2/192.168.132.10:3888
2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
[org.apache.zookeeper.server.quorum.QuorumPeer] LEADING
{noformat}

But at that time the follower had already terminated and started a new 
election, so the leader failed:
{noformat}
2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
[org.apache.zookeeper.server.ZooKeeperServer] Created server
2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
[org.apache.zookeeper.server.persistence.FileSnap] Reading snapshot 
/var/lib/zookeeper/version-2/snapshot.1600007d16
2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
[org.apache.zookeeper.server.persistence.FileTxnSnapLog] Snapshotting: 
1600007d16
2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --  
[org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 1 
at election address zookeeper2/192.168.132.10:3888
2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302)
2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323)
2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296)
2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
[org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new 
notification.
2011-03-01T14:03:11+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
[org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new 
notification.
2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --  
[org.apache.zookeeper.server.quorum.QuorumCnxManager] Cannot open channel to 1 
at election address zookeeper2/192.168.132.10:3888
2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302)
2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323)
2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at 
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296)
2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
[org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new 
notification.
2011-03-01T14:03:32+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
[org.apache.zookeeper.server.quorum.FastLeaderElection] Sending new 
notification.
2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  
[org.apache.zookeeper.server.quorum.Leader] Shutdown called
2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO  --     at 
org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:371)
2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO  --     at 
org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:297)
2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO  --     at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:562)
{noformat}

>From http://zookeeper.apache.org/doc/r3.2.2/zookeeperStarted.html:
{quote}
The new entry, initLimit is timeouts ZooKeeper uses to limit the length of time 
the ZooKeeper servers in quorum have to connect to a leader
{quote}

Since we have initLimit=10 and tickTime=4000, we should have 40 seconds for a 
zookeeper server to contact the leader.

However, in the source code 
src/java/main/org/apache/zookeeper/server/quorum/Follower.java:

{noformat}
152                 for (int tries = 0; tries < 5; tries++) {
153                     try {
154                         //sock = new Socket();
155                         //sock.setSoTimeout(self.tickTime * self.initLimit);
156                         sock.connect(addr, self.tickTime * self.syncLimit);
157                         sock.setTcpNoDelay(nodelay);
158                         break;
159                     } catch (IOException e) {
160                         if (tries == 4) {
161                             LOG.error("Unexpected exception",e);
162                             throw e;
163                         } else {
164                             LOG.warn("Unexpected exception, 
tries="+tries,e);
165                             sock = new Socket();
166                             sock.setSoTimeout(self.tickTime * 
self.initLimit);
167                         }
168                     }
169                     Thread.sleep(1000);
170                 }
{noformat}

It appears as if we only have 4 seconds to contact the leader. The timeouts are 
applied to the socket, but do not take into account that the zookeeper leader 
may not have started its zookeeper service yet. 

Is this the expected behaviour? Or is the expected behaviour that followers 
should always be able to connect to the leader?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to