[ https://issues.apache.org/jira/browse/CASSANDRA-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163290#comment-13163290 ]
Jonathan Ellis commented on CASSANDRA-3569: ------------------------------------------- I think there's two motivations for using the FD: - socket-based failure detection is prone to false negatives, e.g., I remember talking to the Twitter guys in the context of CASSANDRA-3005 about connections that appeared to be alive but made no progress - socket-based failure detection is also prone to false positives, in the sense that a transient network failure shouldn't insta-fail a streaming operation. Trying to strike the right balance between retrying "enough, but not too much" would basically be reinventing the FD IMO. I also note that while I agree with "just fix it at the OS level" in principle, we already have a higher bar than average for sysadmin kung-fu. Other things being equal, I'd like to work as well as possible even in the face of an OS running with default tuning, i.e., almost every cluster in the wild. > Failure detector downs should not break streams > ----------------------------------------------- > > Key: CASSANDRA-3569 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3569 > Project: Cassandra > Issue Type: Bug > Reporter: Peter Schuller > Assignee: Peter Schuller > > CASSANDRA-2433 introduced this behavior just to get repairs to don't sit > there waiting forever. In my opinion the correct fix to that problem is to > use TCP keep alive. Unfortunately the TCP keep alive period is insanely high > by default on a modern Linux, so just doing that is not entirely good either. > But using the failure detector seems non-sensicle to me. We have a > communication method which is the TCP transport, that we know is used for > long-running processes that you don't want to incorrectly be killed for no > good reason, and we are using a failure detector tuned to detecting when not > to send real-time sensitive request to nodes in order to actively kill a > working connection. > So, rather than add complexity with protocol based ping/pongs and such, I > propose that we simply just use TCP keep alive for streaming connections and > instruct operators of production clusters to tweak > net.ipv4.tcp_keepalive_{probes,intvl} as appropriate (or whatever equivalent > on their OS). > I can submit the patch. Awaiting opinions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira