Ivo Vrdoljak created ZOOKEEPER-4771:
---------------------------------------
Summary: Fast leader election taking too long
Key: ZOOKEEPER-4771
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4771
Project: ZooKeeper
Issue Type: Bug
Components: leaderElection
Affects Versions: 3.4.10
Reporter: Ivo Vrdoljak
Attachments: zookeeper10.log, zookeeper11.log, zookeeper12.log,
zookeeper20.log, zookeeper21.log
Hello ZooKeeper Community,
Background:
We are using ZooKeeper version 3.4.10. in our system and we have 5 Zookeeper
servers running, that are distributed across 2 clusters of servers.
In the first cluster, we have 3 Zookeeper servers, each deployed on its own
machine, and in the second cluster we have 2 Zookeeper servers, also each on
its own machine. Zookeeper servers that are distributed on the same cluster
communicate through the local network, and with the servers on the remote
cluster through an external network.
The situation is the following:
{code:java}
Cluster 1
Zookeeper server 10
Zookeeper server 11
Zookeeper server 12 -> Leader
Cluster 2
Zookeeper server 20
Zookeeper server 21
{code}
Problem:
We have an issue with Fast Leader Election when we kill the ZooKeeper leader
process.
After the leader (server 12) is killed and leader election starts, we can see
in Zookeeper logs that voting notifications are exchanged from each Zookeeper
server that remained alive towards all the others. Notification on Zookeeper
servers located in the same cluster (communicating over the local network) are
successfully exchanged. The problem seems to be for Zookeeper server sending
votes over external network as according to the logs they are only sent in one
direction.
Logs from zookeeper server 10:
{code:java}
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: INFO LOOKING
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Initializing leader election
protocol...
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Updating proposal: 10 (newleader),
0xe00009c97 (newzxid), 12 (oldleader), 0xd00001380 (oldzxid)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: INFO New election. My id = 10, proposed
zxid=0xe00009c97
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader),
0xe00009c97 (n.zxid), 0x11 (n.round), 20 (recipient), 10 (myid), 0xe
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader),
0xe00009c97 (n.zxid), 0x11 (n.round), 21 (recipient), 10 (myid), 0xe
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader),
0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 10 (myid), 0xe
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader),
0xe00009c97 (n.zxid), 0x11 (n.round), 11 (recipient), 10 (myid), 0xe
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader),
0xe00009c97 (n.zxid), 0x11 (n.round), 12 (recipient), 10 (myid), 0xe
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed
leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed
leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed
leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed
leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed
leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed
leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:15 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed
leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:15 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed
leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11{code}
Logs from zookeeper server 20:
{code:java}
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: INFO LOOKING
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Initializing leader election
protocol...
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader),
0xe00009c97 (n.zxid), 0x11 (n.round), 20 (recipient), 20 (myid), 0xe
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader),
0xe00009c97 (n.zxid), 0x11 (n.round), 21 (recipient), 20 (myid), 0xe
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader),
0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), 0xe
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader),
0xe00009c97 (n.zxid), 0x11 (n.round), 11 (recipient), 20 (myid), 0xe
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader),
0xe00009c97 (n.zxid), 0x11 (n.round), 12 (recipient), 20 (myid), 0xe
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=11, proposed
leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=20, proposed
leader=20, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=21, proposed
leader=21, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=10, proposed
leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader),
0xe00009c97 (n.zxid), 0x11 (n.round), 20 (recipient), 20 (myid), 0xe
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader),
0xe00009c97 (n.zxid), 0x11 (n.round), 21 (recipient), 20 (myid), 0xe
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader),
0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), 0xe
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader),
0xe00009c97 (n.zxid), 0x11 (n.round), 11 (recipient), 20 (myid), 0xe
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader),
0xe00009c97 (n.zxid), 0x11 (n.round), 12 (recipient), 20 (myid), 0xe
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=21, proposed
leader=21, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=10, proposed
leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=10, proposed
leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11{code}
These are just some of the lines from the log relevant to the issue and
complete logs from all the servers are attached as well as the server
configuration.
As it can be seen from the logs the server with ID 20 is sending notifications
to all cluster members but in the logs from server 10 there is no indication
that a notification from server 20 is received. However, it is obvious from the
logs that server 20 received notifications from server 10 at the same time.
This part is a bit strange since the same socket is used for communication in
both directions.
The first suspect was the network, but after investigation by following tcpdump
we could not find any packages leaving from the cluster where the Zk server 20
is located which is sending notifications.
About almost half a minute later, the server realizes that there is no
connection to Zk servers 10 and 11 and then reopens channels towards them.
After that, the notifications are exchanged and the quorum is formed quickly
after:
{code:java}
Nov 22 10:31:40 sc_2_1 BC[myid: 20]: WARN Connection broken for id 11, my id =
20, error = #012java.net.SocketException: Connection timed out (Read
failed)#012#011at java.net.SocketInputStream.socketRead0(Native Method)
~[?:1.8.0_261]#012#011at java.net.SocketInputStream.socketRead(Unknown Source)
~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown Source)
~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown Source)
~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.fill(Unknown Source)
~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.read(Unknown Source)
~[?:1.8.0_261]#012#011at java.io.DataInputStream.readInt(Unknown Source)
~[?:1.8.0_261]#012#011at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
[zookeeper-3.4.10.jar:3.4.10-5f5438876cc9050ce4380a2780ce07d063aed8d7]
Nov 22 10:31:40 sc_2_1 BC[myid: 20]: WARN Connection broken for id 10, my id =
20, error = #012java.net.SocketException: Connection timed out (Read
failed)#012#011at java.net.SocketInputStream.socketRead0(Native Method)
~[?:1.8.0_261]#012#011at java.net.SocketInputStream.socketRead(Unknown Source)
~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown Source)
~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown Source)
~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.fill(Unknown Source)
~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.read(Unknown Source)
~[?:1.8.0_261]#012#011at java.io.DataInputStream.readInt(Unknown Source)
~[?:1.8.0_261]#012#011at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
[zookeeper-3.4.10.jar:3.4.10-5f5438876cc9050ce4380a2780ce07d063aed8d7]
.....
Nov 22 10:31:41 sc_2_1 BC[myid: 20]: INFO FOLLOWING
Nov 22 10:31:41 sc_2_1 BC[myid: 20]: INFO Created server with tickTime 1500
minSessionTimeout 3000 maxSessionTimeout 30000 datadir /local/version-2 snapdir
/local/version-2
Nov 22 10:31:41 sc_2_1 BC[myid: 20]: INFO FOLLOWING - LEADER ELECTION TOOK -
27333{code}
We have two main questions related to this issue if you could help:
1) Is there any way to know for sure that the vote notification was actually
sent other than the following log:
{code:java}
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader),
0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), 0xe
(n.peerEpoch){code}
Or is there any chance that the receiving server would discard the notification
for some reason and not even log the event, since we cannot see anything in the
logs even with debug enabled?
2) Are there any options to reduce this timeout since half a minute for leader
election is too much for our highly available product with only 5 zookeeper
servers? Is there any configuration parameter that could be tuned?
Also, do you know if the same or similar issue has been already reported or
fixed in some of the later versions of the product? We could not find something
similar in Zookeeper Jira.
Thanks and regards,
Ivo
--
This message was sent by Atlassian Jira
(v8.20.10#820010)