[jira] [Created] (CASSANDRA-14001) Gossip after node restart can take a long time to converge about "down" nodes in large clusters

Joseph Lynch (JIRA) Tue, 07 Nov 2017 19:00:40 -0800

Joseph Lynch created CASSANDRA-14001:
----------------------------------------


             Summary: Gossip after node restart can take a long time to 
converge about "down" nodes in large clusters
                 Key: CASSANDRA-14001
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14001
             Project: Cassandra
          Issue Type: Improvement
          Components: Lifecycle
            Reporter: Joseph Lynch
            Priority: Minor


When nodes restart in a large cluster, they mark all nodes as "alive", which 
first calls {{markDead}} and then creates an {{EchoMessage}} and in the 
callback to that marks the node as alive. This works great, except when that 
initial echo fails for w.e. reason and that node is marked as dead, in which 
case it will remain dead for a long while.

We mostly see this on 100+ node clusters, and almost always when nodes are in 
different datacenters that have unreliable network connections (e.g, cross 
region in AWS) and I think that it comes down to a combination of:
1. Only a node itself can mark another node as "UP"
2. Nodes only gossip with dead nodes with probability {{#dead / (#live +1)}}

In particular the algorithm in #2 leads to long convergence times because the 
number of dead nodes it typically very small compared to the cluster size. My 
back of the envelope model of this algorithm indicates that for a 100 node 
cluster this would take an average of ~50 seconds with a stdev of 50 seconds, 
which means we might be waiting _minutes_ for the nodes to gossip with each 
other. I'm modeling this as the minimum of two [geometric 
distributions|https://en.wikipedia.org/wiki/Geometric_distribution] with 
parameter {{p=1/#nodes}}, yielding a geometric distribution with parameter 
{{p=1-(1-(1/#nodes)^2)}}. So for a 100 node cluster:

{noformat}
100 node cluster =>
X = Pr(node1 gossips with node2) = geom(0.01)
Y = Pr(node 2 gossips with node1) = geom(0.01)
Z = min(X or Y) = geom(1 - (1 - 0.01)^2) = geom(0.02)
E[Z] = 1/0.02 = 50
V[Z] = (1-0.02)/(0.02)^2 = 2450

1000 node cluster ->
Z = geom(1 - (1 - 0.001)^2) = geom(0.002)
E[Z] = 500
V[Z] = 24500
{noformat}

Since we gossip every second that means that on expectation in a 100 node 
cluster these nodes would see each other after about a minute and in a thousand 
node cluster, after ~8 minutes. For 100 node clusters the variance is 
astounding, and means that in particular edge cases we might be waiting hours 
before these nodes gossip with each other.

I'm thinking of writing a patch which either:
# Makes gossip order a shuffled list that includes dead nodes a la [swim 
gossip|https://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf]. This would 
make it so that we waste some rounds on dead nodes but guarantee linear 
bounding of gossip.
# Adds an endpoint that re-triggers gossip with all nodes. Operators could call 
this after a restart a few times if they detect a gossip inconsistency.
# Bounding the probability we gossip with a dead node at some reasonable number 
like 1/10 or something. This might cause a lot of gossip load when a node is 
actually down for large clusters, but would also act to bound the variance.
# Something else?

I've got a WIP 
[branch|https://github.com/apache/cassandra/compare/cassandra-3.11...jolynch:force_gossip]
 on 3.11 which implements options #1 and #2, but I can reduce/change/modify as 
needed if people think there is a better way. The patch doesn't pass tests yet 
but I'm not going to change/add the tests unless we think moving to time 
bounded gossip for down nodes is a good idea.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-14001) Gossip after node restart can take a long time to converge about "down" nodes in large clusters

Reply via email to