Re: [Discuss] num_tokens default in Cassandra 4.0

2020-07-08 Thread Jeremy Hanna
Just to close the loop on this, https://issues.apache.org/jira/browse/CASSANDRA-13701 is getting tested now. The project testing will get updated to utilize the new defaults (both num_tokens and using the new allocation algorithm by uncommenting allocate_tokens_for_local_replication_factor: 3.

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-03-31 Thread Jeremy Hanna
As discussed, let's go with 16. Speaking with Anthony privately as well, I had forgotten that some of the analysis that Branimir had initially done on the skew and allocation may have been internal to DataStax so I should have mentioned that previously. Thanks to Mick, Alex, and Anthony for

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-03-10 Thread Mick Semb Wever
> I propose we drop it to 16 immediately. I'll add the production docs > in CASSANDRA-15618 with notes on token count, the reasons why you'd want 1, > 4, or 16. As a follow up, if we can get a token simulation written we can > try all sorts of topologies with whatever token algorithms we

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-03-09 Thread Jon Haddad
There's a lot going on here... hopefully I can respond to everything in a coherent manner. > Perhaps a simple way to avoid this is to update the random allocation algorithm to re-generate tokens when the ranges created do not have a good size distribution? Instead of using random tokens for the

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-03-09 Thread Paulo Motta
Great investigation, good job guys! > Personally I would have liked to have seen even more iterations. While 14 run iterations gives an indication, the average of randomness is not what is important here. What concerns me is the consequence to imbalances as the cluster grows when you're very

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-21 Thread Mick Semb Wever
The appeal to 'perfect is the enemy...' is appreciated. But I (we) have seen from experiences that this is about what is good rather than what is perfect. I'm not suggesting we create a fool proof system, just one that is safe against what we know happens all too often in production systems. I

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-19 Thread Jon Haddad
Joey Lynch had a good idea - that if the allocate tokens for RF isn't set we use 1 as the RF. I suggested we take it a step further and use the rack count as the RF if it's not set. This should take care of most clusters even if they don't set the RF, and will handle the uneven distribution when

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-19 Thread Jeremiah Jordan
If you don’t know what you are doing you will have one rack which will also be safe. If you are setting up racks then you most likely read something about doing that, and should also be fine. This discussion has gone off the rails 100 times with what ifs that are “letting perfect be the enemy

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-18 Thread Jeremiah D Jordan
+1 for 8 + algorithm assignment being the default. Why do we have to assume random assignment? If someone turns off algorithm assignment they are changing away from defaults, so they should also adjust the num tokens. -Jeremiah > On Feb 18, 2020, at 1:44 AM, Mick Semb Wever wrote: > > -1 >

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-18 Thread Joshua McKenzie
> > Discussions here and on slack have brought up a number of important > concerns. Sounds like we're letting the perfect be the enemy of the good. Is anyone arguing that 256 is a better default than 16? Or is the fear that going to 16 now would make a default change in, say, 5.0 more painful?

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-18 Thread Ben Slater
In case it helps move the decision along, we moved to 16 vnodes as default in Nov 2018 and haven't looked back (many clusters from 3-100s of nodes later). The testing we did in making that decision is summarised here: https://www.instaclustr.com/cassandra-vnodes-how-many-should-i-use/

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-17 Thread Mick Semb Wever
-1 Discussions here and on slack have brought up a number of important concerns. I think those concerns need to be summarised here before any informal vote. It was my understanding that some of those concerns may even be blockers to a move to 16. That is we have to presume the worse case

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-17 Thread Rahul Singh
+1 on 8 rahul.xavier.si...@gmail.com http://cassandra.link The Apache Cassandra Knowledge Base. On Feb 17, 2020, 5:20 PM -0500, Erick Ramirez , wrote: > +1 on 8 tokens. I'd personally like us to be able to move this along pretty > quickly as it's confusing for users looking for direction.

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-17 Thread Erick Ramirez
+1 on 8 tokens. I'd personally like us to be able to move this along pretty quickly as it's confusing for users looking for direction. Cheers! On Tue, 18 Feb 2020, 9:14 am Jeremy Hanna, wrote: > I just wanted to close the loop on this if possible. After some discussion > in slack about various

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-03 Thread Jon Haddad
I think it's a good idea to take a step back and get a high level view of the problem we're trying to solve. First, high token counts result in decreased availability as each node has data overlap with with more nodes in the cluster. Specifically, a node can share data with RF-1 * 2 *

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-31 Thread Jeremy Hanna
I think Mick and Anthony make some valid operational and skew points for smaller/starting clusters with 4 num_tokens. There’s an arbitrary line between small and large clusters but I think most would agree that most clusters are on the small to medium side. (A small nuance is afaict the

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-31 Thread Jeff Jirsa
On Fri, Jan 31, 2020 at 11:25 AM Joseph Lynch wrote: > I think that we might be bikeshedding this number a bit because it is easy > to debate and there is not yet one right answer. > https://www.youtube.com/watch?v=v465T5u9UKo

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-31 Thread Joseph Lynch
I think that we might be bikeshedding this number a bit because it is easy to debate and there is not yet one right answer. I hope we recognize either choice (4 or 16) is fine in that users can always override us and we can always change our minds later or better yet improve allocation so users

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-31 Thread Carl Mueller
"large/giant clusters and admins are the target audience for the value we select" There are reasons aside from massive scale to pick cassandra, but the primary reason cassandra is selected technically is to support vertically scaling to large clusters. Why pick a value that once you reach scale

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-31 Thread Carl Mueller
edit: 4 is bad at small cluster sizes and could scare off adoption On Fri, Jan 31, 2020 at 12:15 PM Carl Mueller wrote: > "large/giant clusters and admins are the target audience for the value we > select" > > There are reasons aside from massive scale to pick cassandra, but the > primary

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-31 Thread Michael Shuler
On 1/31/20 9:58 AM, Dimitar Dimitrov wrote: one corollary of the way the algorithm works (or more precisely might not work) with multiple seeds or simultaneous multi-node bootstraps or decommissions, is that a lot of dtests start failing due to deterministic token conflicts. I wasn't able to fix

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-31 Thread Carl Mueller
So why even have virtual nodes at all, why not work on improving single token approaches so that we can support cluster doubling, which IMO would enable cassandra to more quickly scale for volatile loads? It's my guess/understanding that vnodes eliminate the token rebalancing that existed back in

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-31 Thread Dimitar Dimitrov
Hey all, At some point not too long ago I spent some time trying to make the token allocation algorithm the default. I didn't foresee it, although it might be obvious for many of you, but one corollary of the way the algorithm works (or more precisely might not work) with multiple seeds or

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-31 Thread Joshua McKenzie
> > We should be using the default value that benefits the most people, rather > than an arbitrary compromise. I'd caution we're talking about the default value *we believe* will benefit the most people according to our respective understandings of C* usage. Most clusters don't shrink, they

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-31 Thread Alexander Dejanovski
While I (mostly) understand the maths behind using 4 vnodes as a default (which really is a question of extreme availability), I don't think they provide noticeable performance improvements over using 16, while 16 vnodes will protect folks from imbalances. It is very hard to deal with unbalanced

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-31 Thread Mick Semb Wever
> TLDR, based on availability concerns, skew concerns, operational > concerns, and based on the fact that the new allocation algorithm can > be configured fairly simply now, this is a proposal to go with 4 as the > new default and the allocate_tokens_for_local_replication_factor set to > 3.

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-30 Thread Jon Haddad
Yes, I'm against it. We should be using the default value that benefits the most people, rather than an arbitrary compromise. Most clusters don't shrink, they stay the same size or grow. I'd say 90% or more fall in this category. Let's do the right thing by default and include good comments that

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-30 Thread Joseph Lynch
Any objections to the compromise of 16 as proposed in Chris's original patch? -Joey On Thu, Jan 30, 2020, 3:47 PM Anthony Grasso wrote: > I think lowering the number of tokens is a great idea! Similar to Jon, when > I have reduced the number of tokens for clients it has been improvement in >

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-30 Thread Anthony Grasso
I think lowering the number of tokens is a great idea! Similar to Jon, when I have reduced the number of tokens for clients it has been improvement in repair performance. I am concerned that the proposed default value for num_tokens is too low. If you set up a cluster using the proposed defaults,

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-30 Thread Jon Haddad
Larger clusters is where high token counts do the most damage. That's why it's such a problem. You start out with a small cluster using 256, as you grow into the hundreds it becomes more and more unstable. On Thu, Jan 30, 2020, 8:19 AM onmstester onmstester wrote: > Shouldn't we consider the

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-30 Thread onmstester onmstester
Shouldn't we consider the cluster size to configure num_tokens?  For example is it OK to use num_tokens=4 for a cluster of more than 100 of nodes? Another question that is not so much relevant to this : When we use the token assignment algorithm (the new/non-random one) for a specific

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-29 Thread Jeremy Hanna
The new default wouldn't be retroactively set for 3.x, but the same principles apply. The new algorithm is in 3.x as well as the simplification of the configuration. So no reason not to use the same configuration on 3.x. > On Jan 30, 2020, at 4:34 AM, Chen-Becker, Derek > wrote: > > Does

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-29 Thread Chen-Becker, Derek
Does the same guidance apply to 3.x clusters? I read through the JIRA ticket linked below, along with tickets that it links to, but it's not clear that the new allocation algorithm is available in 3.x or if there are other reasons that this would be problematic. Thanks, Derek On 1/29/20,

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-29 Thread Jon Haddad
Ive put a lot of my previous clients on 4 tokens, all of which have resulted in a major improvement. I wouldn't use any more than 4 except under some pretty unusual circumstances. Jon On Wed, Jan 29, 2020, 11:18 AM Ben Bromhead wrote: > +1 to reducing the number of tokens as low as possible

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-29 Thread Ben Bromhead
+1 to reducing the number of tokens as low as possible for availability issues. 4 lgtm On Wed, Jan 29, 2020 at 1:14 AM Dinesh Joshi wrote: > Thanks for restarting this discussion Jeremy. I personally think 4 is a > good number as a default. I think whatever we pick, we should have enough >

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-28 Thread Dinesh Joshi
Thanks for restarting this discussion Jeremy. I personally think 4 is a good number as a default. I think whatever we pick, we should have enough documentation for operators to make sense of the new defaults in 4.0. Dinesh > On Jan 28, 2020, at 9:25 PM, Jeremy Hanna wrote: > > I wanted to

[Discuss] num_tokens default in Cassandra 4.0

2020-01-28 Thread Jeremy Hanna
I wanted to start a discussion about the default for num_tokens that we'd like for people starting in Cassandra 4.0. This is for ticket CASSANDRA-13701 (which has been duplicated a number of times, most recently by me). TLDR, based on