[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900108#action_12900108
 ] 

Vishal K commented on ZOOKEEPER-822:
------------------------------------

I am suspecting that one of the node (10.17.119.101) is not sending the 
notification to the other node. sendNotifications() is called to send 
notification to all peers. This functions enteres the notification in 
sendqueue. However, either the entry was not put in the queue (sendqueue.offer 
failed) or the thread that polls sendqueue did not wake up. I am not sure what 
the cause is yet.

I had added extra debug messages. Three messages are of main interest:

1. in sendNotifications():  Print "IN FLE sending notification to server id = 
1" for each server. Also print  "proposedLeader, proposedZxid, logicalclock"
2. In FastLeaderElection.lookForLeader() print "Updating proposa" before 
calling upgradeProposal if (totalOrderPredicate(n.leader, n.zxid, proposedZxid) 
is true
3. in WorkerSender.process(), log -  LOG.info("WorkSender.process() QUEUEING 
m.state= " + m.state + " m.leader=" + m.leader + " m.sid=" + m.sid);

Suppporting log entries from 10.17.119.101-zookeeper.log.  I  have added 
description inline..
------------------
2010-08-18 14:53:56,451 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@475] - IN FLE sending 
notification to server id = 1
2010-08-18 14:53:56,451 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@476] - proposedLeader, 
proposedZxid, logicalclock 0343597383684
2010-08-18 14:53:56,452 - INFO  [WorkerSender 
Thread:fastleaderelection$messenger$workersen...@352] - WorkSender.process() 
QUEUEING m.state= LOOKING m.leader=0 m.sid=1
2010-08-18 14:53:56,452 - DEBUG [WorkerSender Thread:quorumcnxmana...@347] - 
Opening channel to server 1
2010-08-18 14:53:56,453 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@475] - IN FLE sending 
notification to server id = 2
2010-08-18 14:53:56,453 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@476] - proposedLeader, 
proposedZxid, logicalclock 0343597383684
2010-08-18 14:53:56,453 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@690] - Notification: 1, 
34359738368, 4, 0, LOOKING, LOOKING, 1
2010-08-18 14:53:56,454 - DEBUG 
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@496] - id: 1, proposed id: 
0, zxid: 34359738368, proposed zxid: 34359738368
2010-08-18 14:53:56,454 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@710] - Updating proposal
2010-08-18 14:53:56,454 - INFO  [WorkerSender Thread:quorumcnxmana...@162] - 
Have smaller server identifier, so dropping the connection: (1, 0)
2010-08-18 14:53:56,455 - INFO  [WorkerSender 
Thread:fastleaderelection$messenger$workersen...@352] - WorkSender.process() 
QUEUEING m.state= LOOKING m.leader=0 m.sid=2
2010-08-18 14:53:56,458 - DEBUG [WorkerSender Thread:quorumcnxmana...@347] - 
Opening channel to server 2
2010-08-18 14:53:56,458 - WARN  [Thread-19:quorumcnxmanager$recvwor...@659] - 
Connection broken:
java.io.IOException: Channel eof
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:631)
2010-08-18 14:53:56,459 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@475] - IN FLE sending 
notification to server id = 0

2010-08-18 14:53:56,460 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@476] - proposedLeader, 
proposedZxid, logicalclock 1343597383684  

***** The above line shows that this node (server 0) is going to vote for 1. 
see - proposedLeader, proposedZxid, logicalclock 1 34359738368 4  Forgot to add 
spaces in the message  :-) ****


2010-08-18 14:53:56,460 - DEBUG [Thread-1:quorumcnxmanager$liste...@446] - 
Connection request /10.17.119.102:41597
2010-08-18 14:53:56,461 - DEBUG [Thread-1:quorumcnxmanager$liste...@449] - 
Connection request: 0
2010-08-18 14:53:56,461 - DEBUG [Thread-1:quorumcnxmanager$sendwor...@505] - 
Address of remote peer: 1
2010-08-18 14:53:56,461 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@475] - IN FLE sending 
notification to server id = 1
2010-08-18 14:53:56,462 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@476] - proposedLeader, 
proposedZxid, logicalclock 1343597383684 

***** Above, server 0 queued a notification to be sent to server 1. The 
notfication is saying that it accepts 1 as the leader. But the notification 
never got sent. process() was not called at all from WorkerSender. Its almost 
as if the notification was never entered in sendqueue (in sendNotifications). 
*****

2010-08-18 14:53:56,462 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@475] - IN FLE sending 
notification to server id = 2
2010-08-18 14:53:56,462 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@476] - proposedLeader, 
proposedZxid, logicalclock 1343597383684
2010-08-18 14:53:56,463 - DEBUG 
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@716] - Adding vote: From = 
1, Proposed leader = 1, Porposed zxid = 34359738368, Proposed epoch = 4
2010-08-18 14:53:56,463 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@690] - Notification: 0, 
34359738368, 4, 0, LOOKING, LOOKING, 0
2010-08-18 14:53:56,463 - DEBUG 
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@496] - id: 0, proposed id: 
1, zxid: 34359738368, proposed zxid: 34359738368
2010-08-18 14:53:56,463 - DEBUG 
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@716] - Adding vote: From = 
0, Proposed leader = 0, Porposed zxid = 34359738368, Proposed epoch = 4


> Leader election taking a long time  to complete
> -----------------------------------------------
>
>                 Key: ZOOKEEPER-822
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-822
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.0
>            Reporter: Vishal K
>            Priority: Blocker
>         Attachments: 822.tar.gz, rhel.tar.gz, test_zookeeper_1.log, 
> test_zookeeper_2.log, zk_leader_election.tar.gz
>
>
> Created a 3 node cluster.
> 1 Fail the ZK leader
> 2. Let leader election finish. Restart the leader and let it join the 
> 3. Repeat 
> After a few rounds leader election takes anywhere 25- 60 seconds to finish. 
> Note- we didn't have any ZK clients and no new znodes were created.
> zoo.cfg is shown below:
> #Mon Jul 19 12:15:10 UTC 2010
> server.1=192.168.4.12\:2888\:3888
> server.0=192.168.4.11\:2888\:3888
> clientPort=2181
> dataDir=/var/zookeeper
> syncLimit=2
> server.2=192.168.4.13\:2888\:3888
> initLimit=5
> tickTime=2000
> I have attached logs from two nodes that took a long time to form the cluster 
> after failing the leader. The leader was down anyways so logs from that node 
> shouldn't matter.
> Look for "START HERE". Logs after that point should be of our interest.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to