demystify failure detector, consider partial failure handling, latency 
optimizations
------------------------------------------------------------------------------------

                 Key: CASSANDRA-3927
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3927
             Project: Cassandra
          Issue Type: Wish
            Reporter: Peter Schuller
            Assignee: Peter Schuller
            Priority: Minor


[My aim with this ticket is to explain my current understanding of the FD and 
it's behavior, express some opinions, and invite others to let me know if I'm 
misunderstanding something.]

So I was getting back to CASSANDRA-3569 and I re-read the ticket history, and I 
want to add a few things that are more about the FD in general, that I've 
realized since the last batch of discussion.

Firstly, as a result of investigating gossip more, and of reading 
CASSANDRA-2597 (Paul Cannon's excellent break-down of what the FD actually does 
mathematically - thank you Paul!), I now have a much better understanding of 
the behavior of the failure detector than I did before. Unfortunately for the 
failure detector. Under the premise that the break-down in CASSANDRA-2597 (and 
the resulting commit to the behavior of Cassandra) is correct, if we ignore all 
the guassian/normal distribution stuff (I openly admit I lack the necessary 
math background for a rigorous analysis), the behavior of the failure detector 
is the following (not a quote despite use of quote syntax, I'm speaking):

{quote}
For a given node, keep track of the last 1000 intervals between heartbeats 
received via gossip (indirectly or directly from the node, doesn't matter). At 
any given moment, the phi "score" of the node is the *time since the last 
heartbeat divided by the average time between heartbeats over the past 1000 
intervals* (scaled by some constant factor which is essentially ignoreable, 
since it is equivalent of the equivalent adjustment to convict threshold). If 
it goes above a certain value, we consider it down. We check for this on every 
gossip round (meaning about every second).
{quote}

I want to re-state the behavior like this because it makes it trivial to get an 
intuitive understanding of what it does without dwelwing into the AFD paper.

In addition, consider that we the failure detector will *immediately* consider 
a node up when it receives a new heartbeat and if the node is considered down 
at the time.

Further, while the accural FD paper talks about giving a non-binary score, and 
we do this, we don't actually use it except to trigger a binary up/down flap.

Given the above, and general background, it seems to me that:

* In terms of detecting that something goes down, the FD is just barely one 
step above just slapping a fixed timeout on heartbeats; essentially a timeout 
scaled relative to average historic latency.
** But on the other hand, it's also fuzzed (relative to simple tcp timeouts) 
due to o variation in gossip propagation time which is almost certainly higher 
than the variation in network latency between two nodes.
* The gist of the previous two items is that the FD is really not doing 
anything advanced/magic or otherwise "opaque".
* In addition, because any heartbeat from a down:ed node implies that it goes 
up immediately, the failure detector has very little ability to effectively do 
something "non-trivial" to deal with partial failures, such as demanding that a 
flapping node show itself healthy for a while before going back up.
** Indeed, as far as I can tell, if a node is slowly growing worse in heartbeat 
it might never get marked as down - if the rate of worsening is slow enough 
you'll just slowly scale the past latency history and never hit the threshold. 
(Untested/unconfirmed)
* The FD is oblivious to input from real traffic (2000 000 messages backed up? 
doesn't matter, it's "up", even though neighboring nodes have nothing backed 
up). This is not necessarily wrong in any way, but it needs to be kept in mind 
when considering what to do *with* the FD.

Now, CASSANDRA-3910 was recently filed where there is an attempt to use the FD 
for what I personally think is better dealt with in the request path (see that 
ticket).

In seems to me that the FD as it works now is *definitely* not suitable to 
handle partial failures or smoothly redirecting traffic from anywhere, since we 
are converting the output of the underlying FD algo to a binary up/down state. 
Further even if we directly propagated current phi and used it to do relative 
weighting on sending requests, we still have the instant return to low phi on 
the very next heartbeat. It is just not suitable, as currently implemented, for 
anything other than binary up/down flagging as far as I can tell.

This re-enforces, in my view, my skepticism towards CASSANDRA-2433, and my 
position in CASSANDRA-3294. We seem to be treating the failure detector as if 
it's doing something non-trivial, where in reality it isn't (gossip as a 
communication channel may be non-trivial, but the FD algorithm isn't). It seems 
to me the main function of the failure detector is that it allows us to scale 
to very large clusters; we need not to full-mesh ping-pong (whether at app 
level or at socket level) in order for up/down state to be communicated. This 
is a useful feature. However, it is actually a result of using gossip as the 
underlying channel, rather than due to anything specific to the FD algorithm 
(except insofar as the FD algorithm doesn't need more detailed or specific 
information than is reasonable to communicate over gossip).

I believe that the FD *is* good at:

* Efficiently (in terms of scaling to large clusters) allowing controlled 
shutdowns and "official" up/down marking.
** Useful for e.g. selecting hosts for streaming - but only for it's usefulness 
as "official flag" marking, not in the sense that it detects failure conditions.
** Nodes coming up from scratch can get a feel for the cluster immediately 
without having to go through an initial burst of failed messages (but currently 
we don't actually guarantee this anyway because we don't wait for gossip to 
settle on start-up - but that's a separate issue).

I believe that the FD is *not* good at:

* Indicating whether to send a message to a node, or how often, except for the 
special case of "the node is completely down, don't even bother at all, ever".
* Optimizing for request latency, at all. It is not an effective tool to 
mitigate request latencies.
* Optimizing for avoiding heap growth as requests back up; this is a very 
separate concern that should take into account things like relative queue 
sizes, past history of real requests, etc.
* High latencies, node hiccups, congestion.

I think that the things it is good at, is a legitimate job for the FD. The 
things I list under bad, I think is the job of other things like CASSANDRA-2540 
and CASSANDRA-3294 to add intelligence to the way we handle requests. Even an 
infinitely smart FD will never work for these tasks, as long as we retain the 
binary up/down output (unless you want to start talking flapping up/down at 
high frequency. I also think that with a sufficiently good request path, we 
probably don't even need any active ping-pong at all, except maybe very seldom 
if we don't send any traffic to the node. So while using gossip for this is a 
cool application of it, I am unconvinced that we need active ping/pong to any 
great extent if we are sufficiently good at not routing requests to nodes of 
unknown status, and instantly prefer nodes that are actively responding to 
those who don't (e.g. least pending request input to routing).

Finally, there is one additional particularly interesting feature of the 
failure detector and its integration with gossip the way it works now that I am 
aware of and feel is worth mentioning: My understanding is that the intention 
is for the FD, due to coupling with gossip, to guarantee that we never send 
messages to a node whose impact on the token ring is somehow "incorrect". So 
for example, on a cluster partition you're supposed to be able to make 
topological changes during a partition and it will just magically resolve 
itself when you de-partition. (I would *personally* never ever trust this in 
production; others may disagree. But theoretically, if it works, it's a cool 
feature.)

Thoughts?


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to