[ https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930785#action_12930785 ]
Mahadev konar commented on ZOOKEEPER-928: ----------------------------------------- vishal, Here is the definition of setSoTimeout - {code} public void setSoTimeout(int timeout) throws SocketException Enable/disable SO_TIMEOUT with the specified timeout, in milliseconds. With this option set to a non-zero timeout, a read() call on the InputStream associated with this Socket will block for only this amount of time. If the timeout expires, a java.net.SocketTimeoutException is raised, though the Socket is still valid. The option must be enabled prior to entering the blocking operation to have effect. The timeout must be > 0. A timeout of zero is interpreted as an infinite timeout. {code} This means is that the read would block till timeout and throw an exception if it doesnt hear from the leader during that time. Wouldnt this suffice? > Follower should stop following and start FLE if it does not receive pings > from the leader > ----------------------------------------------------------------------------------------- > > Key: ZOOKEEPER-928 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928 > Project: Zookeeper > Issue Type: Bug > Components: quorum, server > Affects Versions: 3.3.2 > Reporter: Vishal K > Priority: Critical > Fix For: 3.3.3, 3.4.0 > > > In Follower.followLeader() after syncing with the leader, the follower does: > while (self.isRunning()) { > readPacket(qp); > processPacket(qp); > } > It looks like it relies on socket timeout expiry to figure out if the > connection with the leader has gone down. So a follower *with no cilents* > may never notice a faulty leader if a Leader has a software hang, but the TCP > connections with the peers are still valid. Since it has no cilents, it won't > hearbeat with the Leader. If majority of followers are not connected to any > clients, then FLE will fail even if other followers attempt to elect a new > leader. > We should keep track of pings received from the leader and see if we havent > seen > a ping packet from the leader for (syncLimit * tickTime) time and give up > following the > leader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.