[ https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930800#action_12930800 ]
Flavio Junqueira commented on ZOOKEEPER-928: -------------------------------------------- My understanding is that SO_TIMEOUT also affects SocketChannel, since it builds on top of a Socket object. > Follower should stop following and start FLE if it does not receive pings > from the leader > ----------------------------------------------------------------------------------------- > > Key: ZOOKEEPER-928 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928 > Project: Zookeeper > Issue Type: Bug > Components: quorum, server > Affects Versions: 3.3.2 > Reporter: Vishal K > Priority: Critical > > In Follower.followLeader() after syncing with the leader, the follower does: > while (self.isRunning()) { > readPacket(qp); > processPacket(qp); > } > It looks like it relies on socket timeout expiry to figure out if the > connection with the leader has gone down. So a follower *with no cilents* > may never notice a faulty leader if a Leader has a software hang, but the TCP > connections with the peers are still valid. Since it has no cilents, it won't > hearbeat with the Leader. If majority of followers are not connected to any > clients, then FLE will fail even if other followers attempt to elect a new > leader. > We should keep track of pings received from the leader and see if we havent > seen > a ping packet from the leader for (syncLimit * tickTime) time and give up > following the > leader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.