[ https://issues.apache.org/jira/browse/CASSANDRA-3927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210815#comment-13210815 ]
Jonathan Ellis commented on CASSANDRA-3927: ------------------------------------------- Granted, the ML archive options we have are pretty awful. If creating a linkable record is your goal, perhaps some "here is my thinking through of X" trains of thought are best suited to a blog post? Creating jira tickets is a Call To Action, and it's a exhausting to parse through 1300+ words looking for what action is actually being proposed, to come up empty handed or with an exercise-for-the-reader. On a related note, I personally very much appreciate writing that is structured as abstract/summary followed by paragraphs each of which is introduced by topic sentences. This lets me as a reader decide where I need to dig in and where it's okay to skim. (Newspaper articles are invariably written this way, incidentally, if I may invoke that soon-to-be-obsolete medium.) Admittedly writing this way takes more time, sometimes a lot more time, but it's worth it when communicating in a group setting like this when many people are trying to keep up with a large volume of tickets and proposals. > demystify failure detector, consider partial failure handling, latency > optimizations > ------------------------------------------------------------------------------------ > > Key: CASSANDRA-3927 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3927 > Project: Cassandra > Issue Type: Wish > Reporter: Peter Schuller > Assignee: Peter Schuller > Priority: Minor > > [My aim with this ticket is to explain my current understanding of the FD and > it's behavior, express some opinions, and invite others to let me know if I'm > misunderstanding something.] > So I was getting back to CASSANDRA-3569 and I re-read the ticket history, and > I want to add a few things that are more about the FD in general, that I've > realized since the last batch of discussion. > Firstly, as a result of investigating gossip more, and of reading > CASSANDRA-2597 (Paul Cannon's excellent break-down of what the FD actually > does mathematically - thank you Paul!), I now have a much better > understanding of the behavior of the failure detector than I did before. > Unfortunately for the failure detector. Under the premise that the break-down > in CASSANDRA-2597 (and the resulting commit to the behavior of Cassandra) is > correct, if we ignore all the guassian/normal distribution stuff (I openly > admit I lack the necessary math background for a rigorous analysis), the > behavior of the failure detector is the following (not a quote despite use of > quote syntax, I'm speaking): > {quote} > For a given node, keep track of the last 1000 intervals between heartbeats > received via gossip (indirectly or directly from the node, doesn't matter). > At any given moment, the phi "score" of the node is the *time since the last > heartbeat divided by the average time between heartbeats over the past 1000 > intervals* (scaled by some constant factor which is essentially ignoreable, > since it is equivalent of the equivalent adjustment to convict threshold). If > it goes above a certain value, we consider it down. We check for this on > every gossip round (meaning about every second). > {quote} > I want to re-state the behavior like this because it makes it trivial to get > an intuitive understanding of what it does without dwelwing into the AFD > paper. > In addition, consider that we the failure detector will *immediately* > consider a node up when it receives a new heartbeat and if the node is > considered down at the time. > Further, while the accural FD paper talks about giving a non-binary score, > and we do this, we don't actually use it except to trigger a binary up/down > flap. > Given the above, and general background, it seems to me that: > * In terms of detecting that something goes down, the FD is just barely one > step above just slapping a fixed timeout on heartbeats; essentially a timeout > scaled relative to average historic latency. > ** But on the other hand, it's also fuzzed (relative to simple tcp timeouts) > due to o variation in gossip propagation time which is almost certainly > higher than the variation in network latency between two nodes. > * The gist of the previous two items is that the FD is really not doing > anything advanced/magic or otherwise "opaque". > * In addition, because any heartbeat from a down:ed node implies that it goes > up immediately, the failure detector has very little ability to effectively > do something "non-trivial" to deal with partial failures, such as demanding > that a flapping node show itself healthy for a while before going back up. > ** Indeed, as far as I can tell, if a node is slowly growing worse in > heartbeat it might never get marked as down - if the rate of worsening is > slow enough you'll just slowly scale the past latency history and never hit > the threshold. (Untested/unconfirmed) > * The FD is oblivious to input from real traffic (2000 000 messages backed > up? doesn't matter, it's "up", even though neighboring nodes have nothing > backed up). This is not necessarily wrong in any way, but it needs to be kept > in mind when considering what to do *with* the FD. > Now, CASSANDRA-3910 was recently filed where there is an attempt to use the > FD for what I personally think is better dealt with in the request path (see > that ticket). > In seems to me that the FD as it works now is *definitely* not suitable to > handle partial failures or smoothly redirecting traffic from anywhere, since > we are converting the output of the underlying FD algo to a binary up/down > state. Further even if we directly propagated current phi and used it to do > relative weighting on sending requests, we still have the instant return to > low phi on the very next heartbeat. It is just not suitable, as currently > implemented, for anything other than binary up/down flagging as far as I can > tell. > This re-enforces, in my view, my skepticism towards CASSANDRA-2433, and my > position in CASSANDRA-3294. We seem to be treating the failure detector as if > it's doing something non-trivial, where in reality it isn't (gossip as a > communication channel may be non-trivial, but the FD algorithm isn't). It > seems to me the main function of the failure detector is that it allows us to > scale to very large clusters; we need not to full-mesh ping-pong (whether at > app level or at socket level) in order for up/down state to be communicated. > This is a useful feature. However, it is actually a result of using gossip as > the underlying channel, rather than due to anything specific to the FD > algorithm (except insofar as the FD algorithm doesn't need more detailed or > specific information than is reasonable to communicate over gossip). > I believe that the FD *is* good at: > * Efficiently (in terms of scaling to large clusters) allowing controlled > shutdowns and "official" up/down marking. > ** Useful for e.g. selecting hosts for streaming - but only for it's > usefulness as "official flag" marking, not in the sense that it detects > failure conditions. > ** Nodes coming up from scratch can get a feel for the cluster immediately > without having to go through an initial burst of failed messages (but > currently we don't actually guarantee this anyway because we don't wait for > gossip to settle on start-up - but that's a separate issue). > I believe that the FD is *not* good at: > * Indicating whether to send a message to a node, or how often, except for > the special case of "the node is completely down, don't even bother at all, > ever". > * Optimizing for request latency, at all. It is not an effective tool to > mitigate request latencies. > * Optimizing for avoiding heap growth as requests back up; this is a very > separate concern that should take into account things like relative queue > sizes, past history of real requests, etc. > * High latencies, node hiccups, congestion. > I think that the things it is good at, is a legitimate job for the FD. The > things I list under bad, I think is the job of other things like > CASSANDRA-2540 and CASSANDRA-3294 to add intelligence to the way we handle > requests. Even an infinitely smart FD will never work for these tasks, as > long as we retain the binary up/down output (unless you want to start talking > flapping up/down at high frequency. I also think that with a sufficiently > good request path, we probably don't even need any active ping-pong at all, > except maybe very seldom if we don't send any traffic to the node. So while > using gossip for this is a cool application of it, I am unconvinced that we > need active ping/pong to any great extent if we are sufficiently good at not > routing requests to nodes of unknown status, and instantly prefer nodes that > are actively responding to those who don't (e.g. least pending request input > to routing). > Finally, there is one additional particularly interesting feature of the > failure detector and its integration with gossip the way it works now that I > am aware of and feel is worth mentioning: My understanding is that the > intention is for the FD, due to coupling with gossip, to guarantee that we > never send messages to a node whose impact on the token ring is somehow > "incorrect". So for example, on a cluster partition you're supposed to be > able to make topological changes during a partition and it will just > magically resolve itself when you de-partition. (I would *personally* never > ever trust this in production; others may disagree. But theoretically, if it > works, it's a cool feature.) > Thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira