Degradation of availability when using NTS and RF > number of racks

Miklosovic, Stefan Mon, 06 Mar 2023 02:51:56 -0800

Hi all,

some time ago we identified an issue with NetworkTopologyStrategy. The problem 
is that when RF > number of racks, it may happen that NTS places replicas in 
such a way that when whole rack is lost, we lose QUORUM and data are not 
available anymore if QUORUM CL is used.

To illustrate this problem, lets have this setup:

9 nodes in 1 DC, 3 racks, 3 nodes per rack. RF = 5. Then, NTS could place
replicas like this: 3 replicas in rack1, 1 replica in rack2, 1 replica in
rack3. Hence, when rack1 is lost, we do not have QUORUM.

It seems to us that there is already some logic around this scenario (1) but
the implementation is not entirely correct. This solution is not computing the
replica placement correctly so the above problem would be addressed.

We created a draft here (2, 3) which fixes it.

There is also a test which simulates this scenario. When I assign 256 tokens to
each node randomly (by same mean as generatetokens command uses) and I try to
compute natural replicas for 1 billion random tokens and I compute how many
cases there will be when 3 replicas out of 5 are inserted in the same rack (so
by losing it we would lose quorum), for above setup I get around 6%.

For 12 nodes, 3 racks, 4 nodes per rack, rf = 5, this happens in 10% cases.

To interpret this number, it basically means that with such topology, RF and
CL, when a random rack fails completely, when doing a random read, there is 6%
chance that data will not be available (or 10%, respectively).

One caveat here is that NTS is not compatible with this new strategy anymore
because it will place replicas differently. So I guess that fixing this in NTS
will not be possible because of upgrades. I think people would need to setup
completely new keyspace and somehow migrate data if they wish or they just
start from scratch with this strategy.

Questions:

1) do you think this is meaningful to fix and it might end up in trunk?

2) should not we just ban this scenario entirely? It might be possible to check
the configuration upon keyspace creation (rf > num of racks) and if we see this
is problematic we would just fail that query? Guardrail maybe?

3) people in the ticket mention writing "CEP" for this but I do not see any
reason to do so. It is just a strategy as any other. What would that CEP would
even be about? Is this necessary?

Regards

(1)
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java#L126-L128
(2) https://github.com/apache/cassandra/pull/2191
(3) https://issues.apache.org/jira/browse/CASSANDRA-16203

Degradation of availability when using NTS and RF > number of racks

Reply via email to