[ https://issues.apache.org/jira/browse/ZOOKEEPER-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Flavio Junqueira updated ZOOKEEPER-1515: ---------------------------------------- Issue Type: Improvement (was: Bug) > Long reconnect timeout if leader failed. > ---------------------------------------- > > Key: ZOOKEEPER-1515 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1515 > Project: ZooKeeper > Issue Type: Improvement > Components: leaderElection, quorum, server > Affects Versions: 3.3.5 > Environment: Gentoo linux, but every environment is affected. > Reporter: Ian Babrou > Labels: patch, performance > > In zookeeper 3.3.5 in file > src/java/main/org/apache/zookeeper/server/quorum/Learner.java:325 you may see > Thread.sleep(1000); > This is always happens after leader failure or restart. Zookeeper reelects > new leader and all followers try to connect to it. But first attempt always > fails because of "Connection refused": > {quote} > 2012-07-23 18:55:48,159 - WARN [QuorumPeer:/0.0.0.0:2181:Learner@229] - > Unexpected exception, tries=0, connecting to web329.local/192.168.1.74:2888 > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) > at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) > at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) > at java.net.Socket.connect(Socket.java:529) > at > org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:221) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:65) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:645) > {quote} > I propose to change this line to the next code: > {code:title=Learner.java|borderStyle=solid} > if (tries > 0) { > Thread.sleep(self.tickTime); > } > {code} > This way first reconnect attempt will be done immediately, other will wait > for tick time (this is good semantic change, I suppose). > The result of this change - leader reelection time lowered from >1500ms to > 300-400ms with 50ms tick time. This is pretty important for our production > environment and will not break any existing installations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira