[
https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930788#action_12930788
]
Flavio Junqueira commented on ZOOKEEPER-928:
--------------------------------------------
Hi Vishal, My understanding is that the readRecord call in readPacket will
timeout, even if the TCP connection is still up. The documentation in:
http://download.oracle.com/javase/6/docs/api/java/net/SocketOptions.html
says that:
{noformat}
static int SO_TIMEOUT
Set a timeout on blocking Socket operations:
{noformat}
> Follower should stop following and start FLE if it does not receive pings
> from the leader
> -----------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-928
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
> Project: Zookeeper
> Issue Type: Bug
> Components: quorum, server
> Affects Versions: 3.3.2
> Reporter: Vishal K
> Priority: Critical
> Fix For: 3.3.3, 3.4.0
>
>
> In Follower.followLeader() after syncing with the leader, the follower does:
> while (self.isRunning()) {
> readPacket(qp);
> processPacket(qp);
> }
> It looks like it relies on socket timeout expiry to figure out if the
> connection with the leader has gone down. So a follower *with no cilents*
> may never notice a faulty leader if a Leader has a software hang, but the TCP
> connections with the peers are still valid. Since it has no cilents, it won't
> hearbeat with the Leader. If majority of followers are not connected to any
> clients, then FLE will fail even if other followers attempt to elect a new
> leader.
> We should keep track of pings received from the leader and see if we havent
> seen
> a ping packet from the leader for (syncLimit * tickTime) time and give up
> following the
> leader.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.