[ https://issues.apache.org/jira/browse/CASSANDRA-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500565#comment-14500565 ]
Jason Brown commented on CASSANDRA-9206: ---------------------------------------- TBH, I'm kinda +0 on this ticket. While I agree the original motivation behind the probabalistic desire to contact seeds is a bit spurious/funky/undocumented, I'm not compltetly convinced adding more traffic will help much in cluster convergence. For small clusters (less than 20 nodes), there will be near zero impact, so I don't have much problem in that case - but then, they probably don't suffer from the problems we're trying to address here. However, for larger clusters (greater than 500 nodes), think the extra messaging might be an issue. The problem I see is that when things slow down, and you have a very low number of seed nodes (i.e. less than 5), the gossip messages will back up on those nodes and we'll spend lot of cycles just trying to broadcast the same redundant data over and over again. What's worse is that the operator won't really have any great insight to discover that gossip (our membership dissemination protocol) is contributing to things going weird; and, thus, the advice to "add more seeds" isn't obvious nor simple, in some cases. (I'm thinking of Netflix's Priam programmed to use up to two nodes per availability zone as seeds. It would require a non-trivial effort to change that core assumption, fwiw.) Further, in 3.0, we've now split the OTCP by message size, not function. Thus, all the excess gossip messages on the seeds could start interfering with the normal read/write traffic. Also, we will not create a spanning tree by increasing the number of nodes contacted during a gossip round. What that does is increase the fanout (the number of nodes contacted) from a fixed size of 1 to 2. We still have randomly selected peers at every step, and not a static nor dynamic tree that covers all nodes from a given sender. Lastly, there is a minor error in the number of messages to be generated: in a cluster of 1000 nodes, we will start 1000 more gossip sessions to the seeds, and each gossip session is comprised of 3 messages. Thus, the message count is 3000. If you are actually running a cluster that large, and the network can't sustain that extra load, you're probably screwed anyway. While this might help in convergence (primarily for heartbeat dissemination), the trade off is for more (non-directed) traffic. All in all (and thinking while I'm typing), this patch is probably fine for the vast majority of use cases, and if anything, the clarity in the code that will come from it should be worthwhile. > Remove seed gossip probability > ------------------------------ > > Key: CASSANDRA-9206 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9206 > Project: Cassandra > Issue Type: Improvement > Reporter: Brandon Williams > Assignee: Brandon Williams > Fix For: 2.1.5 > > Attachments: 9206.txt > > > Currently, we use probability to determine whether a node will gossip with a > seed: > {noformat} > double probability = seeds.size() / (double) > (liveEndpoints.size() + unreachableEndpoints.size()); > double randDbl = random.nextDouble(); > if (randDbl <= probability) > sendGossip(prod, seeds); > {noformat} > I propose that we remove this probability, and instead *always* gossip with a > seed. This of course means increased traffic and processing on the seed(s), > but even a 1000 node cluster with a single seed will only put ~1000 messages > per second on the seed, which is virtually nothing. Should it become a > problem, the solution is simple: add more seeds. Since seeds will also > always gossip with each other, this effectively gives us a poor man's > spanning tree, with the only cost being removing a few lines of code, and > should greatly improve our gossip convergence time, especially in large > clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)