[ https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Flavio Junqueira updated ZOOKEEPER-822: --------------------------------------- Attachment: ZOOKEEPER-822.patch I believe the patch I'm attaching achieves the same goal and is even simpler, but I'd like to make sure it suits your needs, Vishal. If you agree with the modifications, I can work on a test. I was also thinking that the 2-second timeout you used before is too low and I've raised to 5 seconds. But, instead of trying to argue which value is ideal, I'd rather use a system property and use a default value of at least 5 seconds. I also commit to redesigning QuorumCnxManager for either 3.4.0 or 4.0.0 to use a selector or some other approach we agree upon. I've been wanting to do it for a while anyway, and I actually thought there was a jira open for it... Maybe not, I can't find it right now. > Leader election taking a long time to complete > ----------------------------------------------- > > Key: ZOOKEEPER-822 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-822 > Project: Zookeeper > Issue Type: Bug > Components: quorum > Affects Versions: 3.3.0 > Reporter: Vishal K > Assignee: Vishal K > Priority: Blocker > Fix For: 3.3.2, 3.4.0 > > Attachments: 822.tar.gz, rhel.tar.gz, test_zookeeper_1.log, > test_zookeeper_2.log, zk_leader_election.tar.gz, zookeeper-3.4.0.tar.gz, > ZOOKEEPER-822.patch, ZOOKEEPER-822.patch_v1 > > > Created a 3 node cluster. > 1 Fail the ZK leader > 2. Let leader election finish. Restart the leader and let it join the > 3. Repeat > After a few rounds leader election takes anywhere 25- 60 seconds to finish. > Note- we didn't have any ZK clients and no new znodes were created. > zoo.cfg is shown below: > #Mon Jul 19 12:15:10 UTC 2010 > server.1=192.168.4.12\:2888\:3888 > server.0=192.168.4.11\:2888\:3888 > clientPort=2181 > dataDir=/var/zookeeper > syncLimit=2 > server.2=192.168.4.13\:2888\:3888 > initLimit=5 > tickTime=2000 > I have attached logs from two nodes that took a long time to form the cluster > after failing the leader. The leader was down anyways so logs from that node > shouldn't matter. > Look for "START HERE". Logs after that point should be of our interest. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.