[ https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michi Mutsuzaki updated ZOOKEEPER-1865: --------------------------------------- Fix Version/s: 3.6.0 > Fix retry logic in Learner.connectToLeader() > --------------------------------------------- > > Key: ZOOKEEPER-1865 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865 > Project: ZooKeeper > Issue Type: Bug > Components: server > Reporter: Thawan Kooburat > Assignee: Edward Carter > Fix For: 3.5.1, 3.6.0 > > Attachments: ZOOKEEPER-1865-nanoTime.patch, > ZOOKEEPER-1865-testfix.patch, ZOOKEEPER-1865.patch > > > We discovered a long leader election time today in one of our prod ensemble. > Here is the description of the event. > Before the old leader goes down, it is able to announce notification message. > So 3 out 5 (including the old leader) elected the old leader to be a new > leader for the next epoch. While, the old leader is being rebooted, 2 other > machines are trying to connect to the old leader. So the quorum couldn't > form until those 2 machines give up and move to the next round of leader > election. > This is because Learner.connectToLeader() use a simple retry logic. The > contract for this method is that it should never spend longer that initLimit > trying to connect to the leader. In our outage, each sock.connect() is > probably blocked for initLimit and it is called 5 times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)