2020-03-09 4.0 Status

2020-03-09 Thread Jon Meredith
Link to JIRA board:
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355=CASSANDRA

It's been a week of toiling on the tasks we need to ship a release --
fixing bugs and flaky tests.

We've had 0 new ticket opened against 4.0 since the last status email (6d).
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355=CASSANDRA=1661=1670

We've closed out 10 tickets since the last status email (6d)
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355=CASSANDRA=1671

For a delta of -10 total tickets, and a grand total of 98 tickets currently
unresolved.


The cumulative flow diagram continues to show healthy progress:
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355=CASSANDRA=reporting=cumulativeFlowDiagram=939=936=931=1505=1506=1514=1509=1512=1507=90

[Unassigned Tickets]

We have 2 alpha, 3 beta, 17 RC tickets without assigness.
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355=CASSANDRA=CASSANDRA-15338=1658=1661

A couple of test failures would be good to investigate, however the two
BufferPool tickets remain unassigned still.

CASSANDRA-15306: Investigate why we are allocating 8MiB chunks and reaching
the maximum BufferPool size
https://issues.apache.org/jira/browse/CASSANDRA-15306

Our current unassigned white whale from July of last year:
CASSANDRA-15229: BufferPool Regression
https://issues.apache.org/jira/browse/CASSANDRA-15229
The TL;DR from Benedict on the ticket: "The BufferPool was never intended
to be used for a ChunkCache, and we need to either change our behaviour to
handle uncorrelated lifetimes or use something else. "
So this one could be quite an interesting challenge for someone that knows
this portion of the codebase.


[Stuck tickets - needs reviewer]
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=355=1661=1659
(removing tickets that have had discussion since the most recent patch was
made available and checked with a few assignees on Slack)

CASSANDRA-15565 Fix flaky test
org.apache.cassandra.index.internal.CassandraIndexTest
indexCorrectlyMarkedAsBuildAndRemoved -- it's fairly small.

[Notable tickets closed in the past week]

CASSANDRA-15338, CASSANDRA-15552, CASSANDRA-15613 fixes for flakey tests.
Fans of green CI runs can rejoice.

CASSANDRA-15476, CASSANDRA-15481, CASSANDRA-15353 docs for Transient
Replication and Data Modeling
and more.

CASSANDRA-15616 Expose Cassandra related system properties in a virtual
table


Thanks everyone for all your hard work.

Jon


Re: [Discuss] num_tokens default in Cassandra 4.0

2020-03-09 Thread Jon Haddad
There's a lot going on here... hopefully I can respond to everything in a
coherent manner.

> Perhaps a simple way to avoid this is to update the random allocation
algorithm to re-generate tokens when the ranges created do not have a good
size distribution?

Instead of using random tokens for the first node, I think we'd be better
off picking a random initial token then using an even distribution around
the ring, using the first token as an offset.  The main benefit of random
is that we don't get collisions, not the distribution.  I haven't read
through the change in CASSANDRA-15600, maybe it addresses this problem
already, if so we can ignore my suggestion here.

> Clusters where we have used num_tokens 4 we have regretted.
> While we accept the validity and importance of the increased availability
provided by num_tokens 4, we have never seen or used it in practice.

While we worked together, I personally moved quite a few clusters to 4
tokens, and didn't run into any balance issues.  I'm not sure why you're
saying you've never seen it in practice, I did it with a whole bunch of our
clients.

Mick Said:

> We know of a number of production clusters that have been set up this
way. I am unaware of any Cassandra docs or community recommendations that
say you should avoid doing this. So, this is a problem regardless of the
value for num_tokens.

Paulo:

> Having the number of racks not a multiple of the replication factor is
not a good practice since it can lead to imbalance and other problems like
this, so we should not only document this but perhaps add a warning or even
hard fail when this is encountered during node startup?

Agreed on both the above - I intend to document this in CASSANDRA-15618.

Mick, from your test:

>  Each cluster was configured with one rack.

This is an important nuance of the results you're seeing.  It sounds like
the test covers the edge case of using a single rack / AZ for an entire
cluster.  I can't remember too many times where I actually saw this, of the
several hundred clusters I looked at over the almost 4 years I was at TLP.
   This isn't to say it's not out there in the wild, but I don't think it
should drive us to pick a token count.  We can probably do better than
using a completely random algorithm for the corner case of using a single
rack or fewer racks than RF, and we should also encourage people to run
Cassandra in a way that doesn't set themselves up for a gunshot to the foot.

In a world of tradeoffs, I'm still not convinced that 16 tokens makes any
sense as a default.  Assuming we can fix the worst case random imbalance in
small clusters, 4 is a significantly better option as it will make it
easier for teams to scale Cassandra out the way we claim they can.  Using
16 tokens brings an unnecessary (and probably unknown) ceiling to people's
abilities to scale and for the *majority* of clusters where people pick
Cassandra for scalability and availability it's still too high.  I'd rather
us put a default that works best for the majority of people and document
the cases where people might want to deviate from it, rather than picking a
somewhat crappy (but better than 256) default.

That said, we don't have the better token distribution yet, so if we're
going to assume people just put C* in production with minimal configuration
changes, 16 will help us deal with the imbalance issues *today*.  We know
it works better than 256, so I'm willing to take this as a win *today*, on
the assumption that folks are OK changing this value again before we
release 4.0 if we find we can make it work without the super sharp edges
that we can currently stab ourselves with.  I'd much rather ship C* with 16
tokens than 256, and I don't want to keep debating this so much we don't
end up making any change at all.

I propose we drop it to 16 immediately.  I'll add the production docs
in CASSANDRA-15618 with notes on token count, the reasons why you'd want 1,
4, or 16.  As a follow up, if we can get a token simulation written we can
try all sorts of topologies with whatever token algorithms we want.  Once
that simulation is written and we've got some reports we can revisit.

Eventually we'll probably need to add the ability for folks to fix cluster
imbalances without adding / removing hardware, but I suspect we've got a
fair amount of plumbing to rework to make something like that doable.

Jon


On Mon, Mar 9, 2020 at 5:03 AM Paulo Motta  wrote:

> Great investigation, good job guys!
>
> > Personally I would have liked to have seen even more iterations. While
> 14 run iterations gives an indication, the average of randomness is not
> what is important here. What concerns me is the consequence to imbalances
> as the cluster grows when you're very unlucky with initial random tokens,
> for example when random tokens land very close together. The token
> allocation can deal with breaking up large token ranges but is unable to do
> anything about such tiny token ranges. Even a bad 1-in-a-100 

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-03-09 Thread Paulo Motta
Great investigation, good job guys!

> Personally I would have liked to have seen even more iterations. While 14
run iterations gives an indication, the average of randomness is not what
is important here. What concerns me is the consequence to imbalances as the
cluster grows when you're very unlucky with initial random tokens, for
example when random tokens land very close together. The token allocation
can deal with breaking up large token ranges but is unable to do anything
about such tiny token ranges. Even a bad 1-in-a-100 experience should be a
consideration when picking a default num_tokens.

Perhaps a simple way to avoid this is to update the random allocation
algorithm to re-generate tokens when the ranges created do not have a good
size distribution?

> But it can be worse, for example if you have RF=3 and only two racks then
you will only get random tokens. We know of a number of production clusters
that have been set up this way. I am unaware of any Cassandra docs or
community recommendations that say you should avoid doing this. So, this is
a problem regardless of the value for num_tokens.

Having the number of racks not a multiple of the replication factor is not
a good practice since it can lead to imbalance and other problems like
this, so we should not only document this but perhaps add a warning or even
hard fail when this is encountered during node startup?

Cheers,

Paulo

Em seg., 9 de mar. de 2020 às 08:25, Mick Semb Wever 
escreveu:

>
> > Can we ask for some analysis and data against the risks different
> > num_tokens choices present. We shouldn't rush into a new default, and
> such
> > background information and data is operator value added.
>
>
> Thanks for everyone's patience on this topic.
> The following is further input on a number of fronts.
>
>
> ** Analysis of Token Distributions
>
> The following is work done by Alex Dejanovski and Anthony Grasso. It
> builds upon their previous work at The Last Pickle and why we recommend 16
> as the best value to clients. (Please buy beers for these two for the
> effort they have done here.)
>
> The following three graphs show the ranges of imbalance that occur on
> clusters growing from 4 nodes to 12 nodes, for the different values of
> num_tokens: 4, 8 and 16. The range is based on 14 run iterations (except 16
> which only got ten).
>
>
> num_tokens: 4
>
>
> num_tokens: 8
>
>
> num_tokens: 16
>
> These graphs were generated using clusters created in AWS by tlp-cluster (
> https://github.com/thelastpickle/tlp-cluster). A script was written to
> automate the testing and generate the data for each value of num_tokens.
> Each cluster was configured with one rack.  Of course these interpretations
> are debatable. The data to the graphs is in
> https://docs.google.com/spreadsheets/d/1gPZpSOUm3_pSCo9y-ZJ8WIctpvXNr5hDdupJ7K_9PHY/edit?usp=sharing
>
>
> What I see from these graphs is…
>  a)  token allocation is pretty good are fixing initial bad random token
> imbalances. By the time you are at 12 nodes, presuming you have setup the
> cluster correctly so that token allocation actually works, your nodes will
> be balanced with num_tokens 4 or greater.
>  b) you need to get to ~12 nodes with num_tokens 4 to have a good balance.
>  c) you need to get to ~9 nodes with num_token 8 to have a good balance.
>  d) you need to get to ~6 nodes with num_tokens 16 to have a good balance.
>
> Personally I would have liked to have seen even more iterations. While 14
> run iterations gives an indication, the average of randomness is not what
> is important here. What concerns me is the consequence to imbalances as the
> cluster grows when you're very unlucky with initial random tokens, for
> example when random tokens land very close together. The token allocation
> can deal with breaking up large token ranges but is unable to do anything
> about such tiny token ranges. Even a bad 1-in-a-100 experience should be a
> consideration when picking a default num_tokens.
>
>
> ** When does the Token Allocation work…
>
> This has been touched on already in this thread. There are cases where
> token allocation fails to kick in. The first node in up to RF racks
> generates random tokens, this typically means the first three nodes.
>
> But it can be worse, for example if you have RF=3 and only two racks then
> you will only get random tokens. We know of a number of production clusters
> that have been set up this way. I am unaware of any Cassandra docs or
> community recommendations that say you should avoid doing this. So, this is
> a problem regardless of the value for num_tokens.
>
>
> ** Algorithmic token allocation does not handle the racks = RF case well (
> CASSANDRA-15600 )
>
> This recently landed in trunk. My understanding is that this improves the
> situation the graphs cover, but not the situation just described where a DC
> has 1>racks>RF.  Ekaterina, maybe you could elaborate?
>
>
> ** Decommissioning