[jira] [Commented] (CASSANDRA-3569) Failure detector downs should not break streams

Peter Schuller (Commented) (JIRA) Wed, 07 Dec 2011 21:50:09 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13165004#comment-13165004
 ]


Peter Schuller commented on CASSANDRA-3569:
-------------------------------------------

So yes, I *am* saying that the actual TCP connection is a more reliable 
detector, in terms of false negatives being by definition impossible, than the 
failure detecter since the only thing the streaming actually depends on is the 
data being transfered over that TCP connection.

There idea that the failure detector might somehow be better than a keep-alive 
is IMO not worth the negative side-effects of bogusly killing running repairs.

I *do* however agree that since we cannot choose the TCP keep-alive behavior on 
a per-socket basis, there is in fact some actual functionality in the failure 
detector that we don't get in keep-alive, so I will grant that the failure 
detector is not strictly <= the TCP connection.

When I say that the FD is orthogonal, I am specifically referring to the fact 
that our FD is essentially doing messaging *on top* of a TCP connection. If we 
had been UDP based, it would make sense to let the FD be the keeper of logic 
having to do with the connectivity we have to other nodes. But given that we 
are using TCP, it's not adding anything, it seems to me, except for the fact 
that the FD is more tweakable (from the app) than the keep-alive timeouts. Or, 
put another way, the FD is communicating over a channel which by design is 
never supposed to "block"; we could effectively use a TCP timeout there, which 
we cannot for streaming because we don't want to cause streaming TCP 
connections to die just because e.g. the receiver is legitimately blocking for 
an extended period of time (this is why keep-alive is more suitable since it 
applies to the health of the underlying transport and is independent of whether 
in-band data is being sent).

{quote}
I'm not claiming that at all. I was really only reminding that the main goal of 
detecting failure in the first place was to address frustrated and confused 
users of hanging repair to be sure we were on the same page. 
{quote}

I completely agree with the goal (after all I *am* a user, and a user with lots 
of history of pain with repair), just not the method used to achieve it - 
because the cost of slaying perfectly working repair sessions is just too high. 
Given that, I prefer keep-alive because it *does* do the job except for the 
problem of non-tweakable time frames which is still acceptable. A two hour wait 
in the unusual cases where a connection silently dies in a way that does not 
generate a RST or otherwise close the connection, does not feel like the end of 
the world to me. It is absolutely not optimal though, I agree with that.

Quoting out of order:

{quote}
 And I'm personally fine having a 'different and wider discussion' if that 
helps improving the code (btw, if our way of doing failure detection is really 
fundamentally broken in many ways, it shouldn't be too hard to show how).
{quote}

The short version is that the failure detecter seems to be designed for 
non-reliable datagram messaging. Given that we use TCP, it already does what we 
need to determine "are we able to communicate with this guy at all?". 
Everything else we would need good handling of, like half-down failure modes, 
the FD is mostly useless for us anyway (the best thing we have is the dynamic 
snitch, but that also has issues). It seems to me the FD is more of a show-off 
of an algorithm because the paper looked cool, than actually addressing a 
real-world problem for Cassandra (I'm sorry but that really is what it seems 
like to me).

Witness CASSANDRA-3294 for a good example of this reality vs. theory 
split-brain problem. Why on earth are we waiting on the FD to figure out that 
we shouldn't send messages to a node, when we have *absolute perfect proof* 
that the *only* communication channel we have with the other node is *not 
working*? Having 1/N requests (in an N node cluster) suddenly stall for 
rpc_timeout is not a sensible behavior for high-request rate production 
systems, when the failure mode is one of the most friendly modes you can 
imagine (clean process death, OS knows it, TCP connection dies, other end gets 
an RST).

It is valid that the fact that a TCP connection has a hiccup is probably not a 
good reason to consider a node down in the sense that we shouldn't be trying to 
e.g. stream from it during bootstrap etc. It makes sense to filter out flapping 
there, maybe. But "sending" messages to a node with whom we currently cannot 
communicate, by definition, seems obviously bad. And it is a real problem in 
production, although mitigated once you realize this (undocumented) limitation 
and start doing the disablegossip+disablethrift+wait-for-a-while dance.

(If there is some kind of network flap so that e.g. all replicas for a row are 
all temporarily unreachable, it's also perfectly valid to fast-fail and return 
an error back to the client, rather than queueing up messages to be delivered 
later on or something like that.)

So essentially, my feeling is that the FD is trying to a very general problem 
which is only partially overlapping with the real-world concerns of Cassandra, 
while failing to address the simplest cases that are also very common.

Anyways, this is really something which could get discussed in another ticket 
of its own. But, in my opinion there are far more important things to change 
than the workings of the failure detector.


                
> Failure detector downs should not break streams
> -----------------------------------------------
>
>                 Key: CASSANDRA-3569
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3569
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> CASSANDRA-2433 introduced this behavior just to get repairs to don't sit 
> there waiting forever. In my opinion the correct fix to that problem is to 
> use TCP keep alive. Unfortunately the TCP keep alive period is insanely high 
> by default on a modern Linux, so just doing that is not entirely good either.
> But using the failure detector seems non-sensicle to me. We have a 
> communication method which is the TCP transport, that we know is used for 
> long-running processes that you don't want to incorrectly be killed for no 
> good reason, and we are using a failure detector tuned to detecting when not 
> to send real-time sensitive request to nodes in order to actively kill a 
> working connection.
> So, rather than add complexity with protocol based ping/pongs and such, I 
> propose that we simply just use TCP keep alive for streaming connections and 
> instruct operators of production clusters to tweak 
> net.ipv4.tcp_keepalive_{probes,intvl} as appropriate (or whatever equivalent 
> on their OS).
> I can submit the patch. Awaiting opinions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3569) Failure detector downs should not break streams

Reply via email to