[ 
https://issues.apache.org/jira/browse/CASSANDRA-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500565#comment-14500565
 ] 

Jason Brown commented on CASSANDRA-9206:
----------------------------------------

TBH, I'm kinda +0 on this ticket. While I agree the original motivation behind 
the probabalistic desire to contact seeds is a bit spurious/funky/undocumented, 
I'm not compltetly convinced adding more traffic will help much in cluster 
convergence. For small clusters (less than 20 nodes), there will be near zero 
impact, so I don't have much problem in that case - but then, they probably 
don't suffer from the problems we're trying to address here. 

However, for larger clusters (greater than 500 nodes), think the extra 
messaging might be an issue. The problem I see is that when things slow down, 
and you have a very low number of seed nodes (i.e. less than 5), the gossip 
messages will back up on those nodes and we'll spend lot of cycles just trying 
to broadcast the same redundant data over and over again. What's worse is that 
the operator won't really have any great insight to discover that gossip (our 
membership dissemination protocol) is contributing to things going weird; and, 
thus, the advice to "add more seeds" isn't obvious nor simple, in some cases. 
(I'm thinking of Netflix's Priam programmed to use up to two nodes per 
availability zone as seeds. It would require a non-trivial effort to change 
that core assumption, fwiw.) Further, in 3.0, we've now split the OTCP by 
message size, not function. Thus, all the excess gossip messages on the seeds 
could start interfering with the normal read/write traffic.

Also, we will not create a spanning tree by increasing the number of nodes 
contacted during a gossip round. What that does is increase the fanout (the 
number of nodes contacted) from a fixed size of 1 to 2. We still have randomly 
selected peers at every step, and not a static nor dynamic tree that covers all 
nodes from a given sender.

Lastly, there is a minor error in the number of messages to be generated: in a 
cluster of 1000 nodes, we will start 1000 more gossip sessions to the seeds, 
and each gossip session is comprised of 3 messages. Thus, the message count is 
3000. If you are actually running a cluster that large, and the network can't 
sustain that extra load, you're probably screwed anyway.

While this might help in convergence (primarily for heartbeat dissemination), 
the trade off is for more (non-directed) traffic. All in all (and thinking 
while I'm typing), this patch is probably fine for the vast majority of use 
cases, and if anything, the clarity in the code that will come from it should 
be worthwhile.

> Remove seed gossip probability
> ------------------------------
>
>                 Key: CASSANDRA-9206
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9206
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Brandon Williams
>            Assignee: Brandon Williams
>             Fix For: 2.1.5
>
>         Attachments: 9206.txt
>
>
> Currently, we use probability to determine whether a node will gossip with a 
> seed:
> {noformat} 
>                 double probability = seeds.size() / (double) 
> (liveEndpoints.size() + unreachableEndpoints.size());
>                 double randDbl = random.nextDouble();
>                 if (randDbl <= probability)
>                     sendGossip(prod, seeds);
> {noformat}
> I propose that we remove this probability, and instead *always* gossip with a 
> seed.  This of course means increased traffic and processing on the seed(s), 
> but even a 1000 node cluster with a single seed will only put ~1000 messages 
> per second on the seed, which is virtually nothing.  Should it become a 
> problem, the solution is simple: add more seeds.  Since seeds will also 
> always gossip with each other, this effectively gives us a poor man's 
> spanning tree, with the only cost being removing a few lines of code, and 
> should greatly improve our gossip convergence time, especially in large 
> clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to