[jira] [Commented] (CASSANDRA-14001) Gossip after node restart can take a long time to converge about "down" nodes in large clusters
[ https://issues.apache.org/jira/browse/CASSANDRA-14001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421084#comment-16421084 ] Joseph Lynch commented on CASSANDRA-14001: -- After digging deeply I think that the evidence is indicating not an issue with Gossip, but just with how we establish connections on startup. I think we're just hitting a combination of CASSANDRA-13993 and CASSANDRA-14001. I'm going to close this out since I don't think the issue is Gossip related. > Gossip after node restart can take a long time to converge about "down" nodes > in large clusters > --- > > Key: CASSANDRA-14001 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14001 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Joseph Lynch >Priority: Minor > > When nodes restart in a large cluster, they mark all nodes as "alive", which > first calls {{markDead}} and then creates an {{EchoMessage}} and in the > callback to that marks the node as alive. This works great, except when that > initial echo fails for w.e. reason and that node is marked as dead, in which > case it will remain dead for a long while. > We mostly see this on 100+ node clusters, and almost always when nodes are in > different datacenters that have unreliable network connections (e.g, cross > region in AWS) and I think that it comes down to a combination of: > 1. Only a node itself can mark another node as "UP" > 2. Nodes only gossip with dead nodes with probability {{#dead / (#live +1)}} > In particular the algorithm in #2 leads to long convergence times because the > number of dead nodes it typically very small compared to the cluster size. My > back of the envelope model of this algorithm indicates that for a 100 node > cluster this would take an average of ~50 seconds with a stdev of 50 seconds, > which means we might be waiting _minutes_ for the nodes to gossip with each > other. I'm modeling this as the minimum of two [geometric > distributions|https://en.wikipedia.org/wiki/Geometric_distribution] with > parameter {{p=1/#nodes}}, yielding a geometric distribution with parameter > {{p=1-(1-(1/#nodes)^2)}}. So for a 100 node cluster: > {noformat} > 100 node cluster => > X = Pr(node1 gossips with node2) = geom(0.01) > Y = Pr(node 2 gossips with node1) = geom(0.01) > Z = min(X or Y) = geom(1 - (1 - 0.01)^2) = geom(0.02) > E[Z] = 1/0.02 = 50 > V[Z] = (1-0.02)/(0.02)^2 = 2450 > 1000 node cluster -> > Z = geom(1 - (1 - 0.001)^2) = geom(0.002) > E[Z] = 500 > V[Z] = 24500 > {noformat} > Since we gossip every second that means that on expectation in a 100 node > cluster these nodes would see each other after about a minute and in a > thousand node cluster, after ~8 minutes. For 100 node clusters the variance > is astounding, and means that in particular edge cases we might be waiting > hours before these nodes gossip with each other. > I'm thinking of writing a patch which either: > # Makes gossip order a shuffled list that includes dead nodes a la [swim > gossip|https://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf]. This would > make it so that we waste some rounds on dead nodes but guarantee linear > bounding of gossip. > # Adds an endpoint that re-triggers gossip with all nodes. Operators could > call this after a restart a few times if they detect a gossip inconsistency. > # Bounding the probability we gossip with a dead node at some reasonable > number like 1/10 or something. This might cause a lot of gossip load when a > node is actually down for large clusters, but would also act to bound the > variance. > # Something else? > I've got a WIP > [branch|https://github.com/apache/cassandra/compare/cassandra-3.11...jolynch:force_gossip] > on 3.11 which implements options #1 and #2, but I can reduce/change/modify > as needed if people think there is a better way. The patch doesn't pass tests > yet but I'm not going to change/add the tests unless we think moving to time > bounded gossip for down nodes is a good idea. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14001) Gossip after node restart can take a long time to converge about "down" nodes in large clusters
[ https://issues.apache.org/jira/browse/CASSANDRA-14001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16327795#comment-16327795 ] Joseph Lynch commented on CASSANDRA-14001: -- [~jasonstack] thanks for the reply :-) If I understand you correctly, you're saying since the remote node is still heartbeating to other nodes (versions are increasing) we'll end up in [{{applyStateLocally}}|https://github.com/apache/cassandra/blob/6d324f9d769f24ac209f6ea7649fee02b0200ba0/src/java/org/apache/cassandra/gms/Gossiper.java#L1153-L1166] and that will cause us to call {{markAlive}} again which sends another Echo after marking dead? That makes sense, let me spend some more time trying to get a reproduction so we can narrow this down. I can simulate a partitioned cluster using {{CCM}} and {{ipfw}} and use {{wireshark}} to determine if cassandra continues sending echo messages. bq. Roughly speaking, there should be one node ( N * #dead / (#live +1) ) in the cluster that will talk to "dead" node per second. Convergence time is O( log2(N) ) where N is size of cluster. bq. I suppose that the long convergence time you observed is not related to gossip peer selection.. Hm, even in that case isn't it probabilistically O(log2(N))? I guess we'd need something like the broadcast trees as proposed in CASSANDRA-12345 to guarantee O(log2(N)) convergence? I'll work on modeling the variance of the existing system since I think that may be the issue (rather than the expected value). bq. Could you share the phi_convict_threshold value? We don't tune {{phi_convict_threshold}}, so I believe it's the default of 8. > Gossip after node restart can take a long time to converge about "down" nodes > in large clusters > --- > > Key: CASSANDRA-14001 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14001 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Joseph Lynch >Priority: Minor > > When nodes restart in a large cluster, they mark all nodes as "alive", which > first calls {{markDead}} and then creates an {{EchoMessage}} and in the > callback to that marks the node as alive. This works great, except when that > initial echo fails for w.e. reason and that node is marked as dead, in which > case it will remain dead for a long while. > We mostly see this on 100+ node clusters, and almost always when nodes are in > different datacenters that have unreliable network connections (e.g, cross > region in AWS) and I think that it comes down to a combination of: > 1. Only a node itself can mark another node as "UP" > 2. Nodes only gossip with dead nodes with probability {{#dead / (#live +1)}} > In particular the algorithm in #2 leads to long convergence times because the > number of dead nodes it typically very small compared to the cluster size. My > back of the envelope model of this algorithm indicates that for a 100 node > cluster this would take an average of ~50 seconds with a stdev of 50 seconds, > which means we might be waiting _minutes_ for the nodes to gossip with each > other. I'm modeling this as the minimum of two [geometric > distributions|https://en.wikipedia.org/wiki/Geometric_distribution] with > parameter {{p=1/#nodes}}, yielding a geometric distribution with parameter > {{p=1-(1-(1/#nodes)^2)}}. So for a 100 node cluster: > {noformat} > 100 node cluster => > X = Pr(node1 gossips with node2) = geom(0.01) > Y = Pr(node 2 gossips with node1) = geom(0.01) > Z = min(X or Y) = geom(1 - (1 - 0.01)^2) = geom(0.02) > E[Z] = 1/0.02 = 50 > V[Z] = (1-0.02)/(0.02)^2 = 2450 > 1000 node cluster -> > Z = geom(1 - (1 - 0.001)^2) = geom(0.002) > E[Z] = 500 > V[Z] = 24500 > {noformat} > Since we gossip every second that means that on expectation in a 100 node > cluster these nodes would see each other after about a minute and in a > thousand node cluster, after ~8 minutes. For 100 node clusters the variance > is astounding, and means that in particular edge cases we might be waiting > hours before these nodes gossip with each other. > I'm thinking of writing a patch which either: > # Makes gossip order a shuffled list that includes dead nodes a la [swim > gossip|https://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf]. This would > make it so that we waste some rounds on dead nodes but guarantee linear > bounding of gossip. > # Adds an endpoint that re-triggers gossip with all nodes. Operators could > call this after a restart a few times if they detect a gossip inconsistency. > # Bounding the probability we gossip with a dead node at some reasonable > number like 1/10 or something. This might cause a lot of gossip load when a > node is actually down for large clusters, but would also act to bound the > variance. > #
[jira] [Commented] (CASSANDRA-14001) Gossip after node restart can take a long time to converge about "down" nodes in large clusters
[ https://issues.apache.org/jira/browse/CASSANDRA-14001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312809#comment-16312809 ] ZhaoYang commented on CASSANDRA-14001: -- bq. 1. Only a node itself can mark another node as "UP" bq. 2. Nodes only gossip with dead nodes with probability #dead / (#live +1) You are right, but a node-A can use the exchanged gossip status (containing not only node-B's status, but also other nodes status that node-B knows) from node-B to determine UP/DOWN of node-C, assuming node-B has gossiped with node-C. So node-A doesn't need to directly gossip with node-C to determine node-C's status, like a epidemic infection... Roughly speaking, there should be one node ( N * #dead / (#live +1) ) in the cluster that will talk to "dead" node per second. Convergence time is {{O( log2(N) )}} where N is size of cluster. I suppose that the long convergence time you observed is not related to gossip peer selection.. Could you share the phi_convict_threshold value? > Gossip after node restart can take a long time to converge about "down" nodes > in large clusters > --- > > Key: CASSANDRA-14001 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14001 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Joseph Lynch >Priority: Minor > > When nodes restart in a large cluster, they mark all nodes as "alive", which > first calls {{markDead}} and then creates an {{EchoMessage}} and in the > callback to that marks the node as alive. This works great, except when that > initial echo fails for w.e. reason and that node is marked as dead, in which > case it will remain dead for a long while. > We mostly see this on 100+ node clusters, and almost always when nodes are in > different datacenters that have unreliable network connections (e.g, cross > region in AWS) and I think that it comes down to a combination of: > 1. Only a node itself can mark another node as "UP" > 2. Nodes only gossip with dead nodes with probability {{#dead / (#live +1)}} > In particular the algorithm in #2 leads to long convergence times because the > number of dead nodes it typically very small compared to the cluster size. My > back of the envelope model of this algorithm indicates that for a 100 node > cluster this would take an average of ~50 seconds with a stdev of 50 seconds, > which means we might be waiting _minutes_ for the nodes to gossip with each > other. I'm modeling this as the minimum of two [geometric > distributions|https://en.wikipedia.org/wiki/Geometric_distribution] with > parameter {{p=1/#nodes}}, yielding a geometric distribution with parameter > {{p=1-(1-(1/#nodes)^2)}}. So for a 100 node cluster: > {noformat} > 100 node cluster => > X = Pr(node1 gossips with node2) = geom(0.01) > Y = Pr(node 2 gossips with node1) = geom(0.01) > Z = min(X or Y) = geom(1 - (1 - 0.01)^2) = geom(0.02) > E[Z] = 1/0.02 = 50 > V[Z] = (1-0.02)/(0.02)^2 = 2450 > 1000 node cluster -> > Z = geom(1 - (1 - 0.001)^2) = geom(0.002) > E[Z] = 500 > V[Z] = 24500 > {noformat} > Since we gossip every second that means that on expectation in a 100 node > cluster these nodes would see each other after about a minute and in a > thousand node cluster, after ~8 minutes. For 100 node clusters the variance > is astounding, and means that in particular edge cases we might be waiting > hours before these nodes gossip with each other. > I'm thinking of writing a patch which either: > # Makes gossip order a shuffled list that includes dead nodes a la [swim > gossip|https://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf]. This would > make it so that we waste some rounds on dead nodes but guarantee linear > bounding of gossip. > # Adds an endpoint that re-triggers gossip with all nodes. Operators could > call this after a restart a few times if they detect a gossip inconsistency. > # Bounding the probability we gossip with a dead node at some reasonable > number like 1/10 or something. This might cause a lot of gossip load when a > node is actually down for large clusters, but would also act to bound the > variance. > # Something else? > I've got a WIP > [branch|https://github.com/apache/cassandra/compare/cassandra-3.11...jolynch:force_gossip] > on 3.11 which implements options #1 and #2, but I can reduce/change/modify > as needed if people think there is a better way. The patch doesn't pass tests > yet but I'm not going to change/add the tests unless we think moving to time > bounded gossip for down nodes is a good idea. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional
[jira] [Commented] (CASSANDRA-14001) Gossip after node restart can take a long time to converge about "down" nodes in large clusters
[ https://issues.apache.org/jira/browse/CASSANDRA-14001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243223#comment-16243223 ] Joseph Lynch commented on CASSANDRA-14001: -- I think CASSANDRA-13993 might help with this, but I _thin_ it's solving a slightly different problem. > Gossip after node restart can take a long time to converge about "down" nodes > in large clusters > --- > > Key: CASSANDRA-14001 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14001 > Project: Cassandra > Issue Type: Improvement > Components: Lifecycle >Reporter: Joseph Lynch >Priority: Minor > > When nodes restart in a large cluster, they mark all nodes as "alive", which > first calls {{markDead}} and then creates an {{EchoMessage}} and in the > callback to that marks the node as alive. This works great, except when that > initial echo fails for w.e. reason and that node is marked as dead, in which > case it will remain dead for a long while. > We mostly see this on 100+ node clusters, and almost always when nodes are in > different datacenters that have unreliable network connections (e.g, cross > region in AWS) and I think that it comes down to a combination of: > 1. Only a node itself can mark another node as "UP" > 2. Nodes only gossip with dead nodes with probability {{#dead / (#live +1)}} > In particular the algorithm in #2 leads to long convergence times because the > number of dead nodes it typically very small compared to the cluster size. My > back of the envelope model of this algorithm indicates that for a 100 node > cluster this would take an average of ~50 seconds with a stdev of 50 seconds, > which means we might be waiting _minutes_ for the nodes to gossip with each > other. I'm modeling this as the minimum of two [geometric > distributions|https://en.wikipedia.org/wiki/Geometric_distribution] with > parameter {{p=1/#nodes}}, yielding a geometric distribution with parameter > {{p=1-(1-(1/#nodes)^2)}}. So for a 100 node cluster: > {noformat} > 100 node cluster => > X = Pr(node1 gossips with node2) = geom(0.01) > Y = Pr(node 2 gossips with node1) = geom(0.01) > Z = min(X or Y) = geom(1 - (1 - 0.01)^2) = geom(0.02) > E[Z] = 1/0.02 = 50 > V[Z] = (1-0.02)/(0.02)^2 = 2450 > 1000 node cluster -> > Z = geom(1 - (1 - 0.001)^2) = geom(0.002) > E[Z] = 500 > V[Z] = 24500 > {noformat} > Since we gossip every second that means that on expectation in a 100 node > cluster these nodes would see each other after about a minute and in a > thousand node cluster, after ~8 minutes. For 100 node clusters the variance > is astounding, and means that in particular edge cases we might be waiting > hours before these nodes gossip with each other. > I'm thinking of writing a patch which either: > # Makes gossip order a shuffled list that includes dead nodes a la [swim > gossip|https://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf]. This would > make it so that we waste some rounds on dead nodes but guarantee linear > bounding of gossip. > # Adds an endpoint that re-triggers gossip with all nodes. Operators could > call this after a restart a few times if they detect a gossip inconsistency. > # Bounding the probability we gossip with a dead node at some reasonable > number like 1/10 or something. This might cause a lot of gossip load when a > node is actually down for large clusters, but would also act to bound the > variance. > # Something else? > I've got a WIP > [branch|https://github.com/apache/cassandra/compare/cassandra-3.11...jolynch:force_gossip] > on 3.11 which implements options #1 and #2, but I can reduce/change/modify > as needed if people think there is a better way. The patch doesn't pass tests > yet but I'm not going to change/add the tests unless we think moving to time > bounded gossip for down nodes is a good idea. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org