[ https://issues.apache.org/jira/browse/CASSANDRA-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401410#comment-13401410 ]
Brandon Williams commented on CASSANDRA-4375: --------------------------------------------- What if we set MAX_INTERVAL_IN_MS to the greater of DD.getRpcTimeout() or the gossip interval * 2? > FD incorrectly using RPC timeout to ignore gossip heartbeats > ------------------------------------------------------------ > > Key: CASSANDRA-4375 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4375 > Project: Cassandra > Issue Type: Bug > Components: Core > Reporter: Peter Schuller > Priority: Critical > > Short version: You can't run a cluster with short RPC timeouts because nodes > just constantly flap up/down. > Long version: > CASSANDRA-3273 tried to fix a problem resulting from the way the failure > detector works, but did so by introducing a much more sever bug: With low RPC > timeouts, that are lower than the typical gossip propagation time, a cluster > will just constantly have all nodes flapping other nodes up and down. > The cause is this: > {code} > + // in the event of a long partition, never record an interval longer > than the rpc timeout, > + // since if a host is regularly experiencing connectivity problems > lasting this long we'd > + // rather mark it down quickly instead of adapting > + private final double MAX_INTERVAL_IN_MS = > DatabaseDescriptor.getRpcTimeout(); > {code} > And then: > {code} > - tLast_ = value; > - arrivalIntervals_.add(interArrivalTime); > + if (interArrivalTime <= MAX_INTERVAL_IN_MS) > + arrivalIntervals_.add(interArrivalTime); > + else > + logger_.debug("Ignoring interval time of {}", interArrivalTime); > {code} > Using the RPC timeout to ignore unreasonably long intervals is not correct, > as the RPC timeout is completely orthogonal to gossip propagation delay (see > CASSANDRA-3927 for a quick description of how the FD works). > In practice, the propagation delay ends up being in the 0-3 second range on a > cluster with good local latency. With a low RPC timeout of say 200 ms, very > few heartbeat updates come in fast enough that it doesn't get ignored by the > failure detector. This in turn means that the FD records a completely skewed > average heartbeat interval, which in turn means that nodes almost always get > flapped on interpret() unless they happen to *just* have had their heartbeat > updated. Then they flap back up whenever the next heartbeat comes in (since > it gets brought up immediately). > In our build, we are replacing the FD with an implementation that simply uses > a fixed {{N}} second time to convict, because this is just one of many ways > in which the current FD hurts, while we still haven't found a way it actually > helps relative to the trivial fixed-second conviction policy. > For upstream, assuming people won't agree on changing it to a fixed timeout, > I suggest, at minimum, never using a value lower than something like 10 > seconds or something, when determining whether to ignore. Slightly better is > to make it a config option. > (I should note that if propagation delays are significantly off from the > expected level, other things than the FD already breaks - such as the whole > concept of {{RING_DELAY}}, which assumes the propagation time is roughly > constant with e.g. cluster size.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira