[ https://issues.apache.org/jira/browse/CASSANDRA-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13165004#comment-13165004 ]
Peter Schuller commented on CASSANDRA-3569: ------------------------------------------- So yes, I *am* saying that the actual TCP connection is a more reliable detector, in terms of false negatives being by definition impossible, than the failure detecter since the only thing the streaming actually depends on is the data being transfered over that TCP connection. There idea that the failure detector might somehow be better than a keep-alive is IMO not worth the negative side-effects of bogusly killing running repairs. I *do* however agree that since we cannot choose the TCP keep-alive behavior on a per-socket basis, there is in fact some actual functionality in the failure detector that we don't get in keep-alive, so I will grant that the failure detector is not strictly <= the TCP connection. When I say that the FD is orthogonal, I am specifically referring to the fact that our FD is essentially doing messaging *on top* of a TCP connection. If we had been UDP based, it would make sense to let the FD be the keeper of logic having to do with the connectivity we have to other nodes. But given that we are using TCP, it's not adding anything, it seems to me, except for the fact that the FD is more tweakable (from the app) than the keep-alive timeouts. Or, put another way, the FD is communicating over a channel which by design is never supposed to "block"; we could effectively use a TCP timeout there, which we cannot for streaming because we don't want to cause streaming TCP connections to die just because e.g. the receiver is legitimately blocking for an extended period of time (this is why keep-alive is more suitable since it applies to the health of the underlying transport and is independent of whether in-band data is being sent). {quote} I'm not claiming that at all. I was really only reminding that the main goal of detecting failure in the first place was to address frustrated and confused users of hanging repair to be sure we were on the same page. {quote} I completely agree with the goal (after all I *am* a user, and a user with lots of history of pain with repair), just not the method used to achieve it - because the cost of slaying perfectly working repair sessions is just too high. Given that, I prefer keep-alive because it *does* do the job except for the problem of non-tweakable time frames which is still acceptable. A two hour wait in the unusual cases where a connection silently dies in a way that does not generate a RST or otherwise close the connection, does not feel like the end of the world to me. It is absolutely not optimal though, I agree with that. Quoting out of order: {quote} And I'm personally fine having a 'different and wider discussion' if that helps improving the code (btw, if our way of doing failure detection is really fundamentally broken in many ways, it shouldn't be too hard to show how). {quote} The short version is that the failure detecter seems to be designed for non-reliable datagram messaging. Given that we use TCP, it already does what we need to determine "are we able to communicate with this guy at all?". Everything else we would need good handling of, like half-down failure modes, the FD is mostly useless for us anyway (the best thing we have is the dynamic snitch, but that also has issues). It seems to me the FD is more of a show-off of an algorithm because the paper looked cool, than actually addressing a real-world problem for Cassandra (I'm sorry but that really is what it seems like to me). Witness CASSANDRA-3294 for a good example of this reality vs. theory split-brain problem. Why on earth are we waiting on the FD to figure out that we shouldn't send messages to a node, when we have *absolute perfect proof* that the *only* communication channel we have with the other node is *not working*? Having 1/N requests (in an N node cluster) suddenly stall for rpc_timeout is not a sensible behavior for high-request rate production systems, when the failure mode is one of the most friendly modes you can imagine (clean process death, OS knows it, TCP connection dies, other end gets an RST). It is valid that the fact that a TCP connection has a hiccup is probably not a good reason to consider a node down in the sense that we shouldn't be trying to e.g. stream from it during bootstrap etc. It makes sense to filter out flapping there, maybe. But "sending" messages to a node with whom we currently cannot communicate, by definition, seems obviously bad. And it is a real problem in production, although mitigated once you realize this (undocumented) limitation and start doing the disablegossip+disablethrift+wait-for-a-while dance. (If there is some kind of network flap so that e.g. all replicas for a row are all temporarily unreachable, it's also perfectly valid to fast-fail and return an error back to the client, rather than queueing up messages to be delivered later on or something like that.) So essentially, my feeling is that the FD is trying to a very general problem which is only partially overlapping with the real-world concerns of Cassandra, while failing to address the simplest cases that are also very common. Anyways, this is really something which could get discussed in another ticket of its own. But, in my opinion there are far more important things to change than the workings of the failure detector. > Failure detector downs should not break streams > ----------------------------------------------- > > Key: CASSANDRA-3569 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3569 > Project: Cassandra > Issue Type: Bug > Reporter: Peter Schuller > Assignee: Peter Schuller > > CASSANDRA-2433 introduced this behavior just to get repairs to don't sit > there waiting forever. In my opinion the correct fix to that problem is to > use TCP keep alive. Unfortunately the TCP keep alive period is insanely high > by default on a modern Linux, so just doing that is not entirely good either. > But using the failure detector seems non-sensicle to me. We have a > communication method which is the TCP transport, that we know is used for > long-running processes that you don't want to incorrectly be killed for no > good reason, and we are using a failure detector tuned to detecting when not > to send real-time sensitive request to nodes in order to actively kill a > working connection. > So, rather than add complexity with protocol based ping/pongs and such, I > propose that we simply just use TCP keep alive for streaming connections and > instruct operators of production clusters to tweak > net.ipv4.tcp_keepalive_{probes,intvl} as appropriate (or whatever equivalent > on their OS). > I can submit the patch. Awaiting opinions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira