[ 
https://issues.apache.org/jira/browse/CASSANDRA-3927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210815#comment-13210815
 ] 

Jonathan Ellis commented on CASSANDRA-3927:
-------------------------------------------

Granted, the ML archive options we have are pretty awful.  If creating a 
linkable record is your goal, perhaps some "here is my thinking through of X" 
trains of thought are best suited to a blog post?  Creating jira tickets is a 
Call To Action, and it's a exhausting to parse through 1300+ words looking for 
what action is actually being proposed, to come up empty handed or with an 
exercise-for-the-reader.

On a related note, I personally very much appreciate writing that is structured 
as abstract/summary followed by paragraphs each of which is introduced by topic 
sentences.  This lets me as a reader decide where I need to dig in and where 
it's okay to skim.  (Newspaper articles are invariably written this way, 
incidentally, if I may invoke that soon-to-be-obsolete medium.)

Admittedly writing this way takes more time, sometimes a lot more time, but 
it's worth it when communicating in a group setting like this when many people 
are trying to keep up with a large volume of tickets and proposals.

                
> demystify failure detector, consider partial failure handling, latency 
> optimizations
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3927
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3927
>             Project: Cassandra
>          Issue Type: Wish
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>            Priority: Minor
>
> [My aim with this ticket is to explain my current understanding of the FD and 
> it's behavior, express some opinions, and invite others to let me know if I'm 
> misunderstanding something.]
> So I was getting back to CASSANDRA-3569 and I re-read the ticket history, and 
> I want to add a few things that are more about the FD in general, that I've 
> realized since the last batch of discussion.
> Firstly, as a result of investigating gossip more, and of reading 
> CASSANDRA-2597 (Paul Cannon's excellent break-down of what the FD actually 
> does mathematically - thank you Paul!), I now have a much better 
> understanding of the behavior of the failure detector than I did before. 
> Unfortunately for the failure detector. Under the premise that the break-down 
> in CASSANDRA-2597 (and the resulting commit to the behavior of Cassandra) is 
> correct, if we ignore all the guassian/normal distribution stuff (I openly 
> admit I lack the necessary math background for a rigorous analysis), the 
> behavior of the failure detector is the following (not a quote despite use of 
> quote syntax, I'm speaking):
> {quote}
> For a given node, keep track of the last 1000 intervals between heartbeats 
> received via gossip (indirectly or directly from the node, doesn't matter). 
> At any given moment, the phi "score" of the node is the *time since the last 
> heartbeat divided by the average time between heartbeats over the past 1000 
> intervals* (scaled by some constant factor which is essentially ignoreable, 
> since it is equivalent of the equivalent adjustment to convict threshold). If 
> it goes above a certain value, we consider it down. We check for this on 
> every gossip round (meaning about every second).
> {quote}
> I want to re-state the behavior like this because it makes it trivial to get 
> an intuitive understanding of what it does without dwelwing into the AFD 
> paper.
> In addition, consider that we the failure detector will *immediately* 
> consider a node up when it receives a new heartbeat and if the node is 
> considered down at the time.
> Further, while the accural FD paper talks about giving a non-binary score, 
> and we do this, we don't actually use it except to trigger a binary up/down 
> flap.
> Given the above, and general background, it seems to me that:
> * In terms of detecting that something goes down, the FD is just barely one 
> step above just slapping a fixed timeout on heartbeats; essentially a timeout 
> scaled relative to average historic latency.
> ** But on the other hand, it's also fuzzed (relative to simple tcp timeouts) 
> due to o variation in gossip propagation time which is almost certainly 
> higher than the variation in network latency between two nodes.
> * The gist of the previous two items is that the FD is really not doing 
> anything advanced/magic or otherwise "opaque".
> * In addition, because any heartbeat from a down:ed node implies that it goes 
> up immediately, the failure detector has very little ability to effectively 
> do something "non-trivial" to deal with partial failures, such as demanding 
> that a flapping node show itself healthy for a while before going back up.
> ** Indeed, as far as I can tell, if a node is slowly growing worse in 
> heartbeat it might never get marked as down - if the rate of worsening is 
> slow enough you'll just slowly scale the past latency history and never hit 
> the threshold. (Untested/unconfirmed)
> * The FD is oblivious to input from real traffic (2000 000 messages backed 
> up? doesn't matter, it's "up", even though neighboring nodes have nothing 
> backed up). This is not necessarily wrong in any way, but it needs to be kept 
> in mind when considering what to do *with* the FD.
> Now, CASSANDRA-3910 was recently filed where there is an attempt to use the 
> FD for what I personally think is better dealt with in the request path (see 
> that ticket).
> In seems to me that the FD as it works now is *definitely* not suitable to 
> handle partial failures or smoothly redirecting traffic from anywhere, since 
> we are converting the output of the underlying FD algo to a binary up/down 
> state. Further even if we directly propagated current phi and used it to do 
> relative weighting on sending requests, we still have the instant return to 
> low phi on the very next heartbeat. It is just not suitable, as currently 
> implemented, for anything other than binary up/down flagging as far as I can 
> tell.
> This re-enforces, in my view, my skepticism towards CASSANDRA-2433, and my 
> position in CASSANDRA-3294. We seem to be treating the failure detector as if 
> it's doing something non-trivial, where in reality it isn't (gossip as a 
> communication channel may be non-trivial, but the FD algorithm isn't). It 
> seems to me the main function of the failure detector is that it allows us to 
> scale to very large clusters; we need not to full-mesh ping-pong (whether at 
> app level or at socket level) in order for up/down state to be communicated. 
> This is a useful feature. However, it is actually a result of using gossip as 
> the underlying channel, rather than due to anything specific to the FD 
> algorithm (except insofar as the FD algorithm doesn't need more detailed or 
> specific information than is reasonable to communicate over gossip).
> I believe that the FD *is* good at:
> * Efficiently (in terms of scaling to large clusters) allowing controlled 
> shutdowns and "official" up/down marking.
> ** Useful for e.g. selecting hosts for streaming - but only for it's 
> usefulness as "official flag" marking, not in the sense that it detects 
> failure conditions.
> ** Nodes coming up from scratch can get a feel for the cluster immediately 
> without having to go through an initial burst of failed messages (but 
> currently we don't actually guarantee this anyway because we don't wait for 
> gossip to settle on start-up - but that's a separate issue).
> I believe that the FD is *not* good at:
> * Indicating whether to send a message to a node, or how often, except for 
> the special case of "the node is completely down, don't even bother at all, 
> ever".
> * Optimizing for request latency, at all. It is not an effective tool to 
> mitigate request latencies.
> * Optimizing for avoiding heap growth as requests back up; this is a very 
> separate concern that should take into account things like relative queue 
> sizes, past history of real requests, etc.
> * High latencies, node hiccups, congestion.
> I think that the things it is good at, is a legitimate job for the FD. The 
> things I list under bad, I think is the job of other things like 
> CASSANDRA-2540 and CASSANDRA-3294 to add intelligence to the way we handle 
> requests. Even an infinitely smart FD will never work for these tasks, as 
> long as we retain the binary up/down output (unless you want to start talking 
> flapping up/down at high frequency. I also think that with a sufficiently 
> good request path, we probably don't even need any active ping-pong at all, 
> except maybe very seldom if we don't send any traffic to the node. So while 
> using gossip for this is a cool application of it, I am unconvinced that we 
> need active ping/pong to any great extent if we are sufficiently good at not 
> routing requests to nodes of unknown status, and instantly prefer nodes that 
> are actively responding to those who don't (e.g. least pending request input 
> to routing).
> Finally, there is one additional particularly interesting feature of the 
> failure detector and its integration with gossip the way it works now that I 
> am aware of and feel is worth mentioning: My understanding is that the 
> intention is for the FD, due to coupling with gossip, to guarantee that we 
> never send messages to a node whose impact on the token ring is somehow 
> "incorrect". So for example, on a cluster partition you're supposed to be 
> able to make topological changes during a partition and it will just 
> magically resolve itself when you de-partition. (I would *personally* never 
> ever trust this in production; others may disagree. But theoretically, if it 
> works, it's a cool feature.)
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to