Re: any risks with changing replication factor on live production cluster without downtime and service interruption?

Leena Ghatpande Wed, 27 May 2020 05:49:26 -0700

Nothing complex. All we do is perform the read for  x number of retries 
(configurable parameter) and if it fails , flag an alert

But agree with the solution that Jeff provided and would use the approach.

Thanks for all the responses.

________________________________
From: Reid Pinchback <rpinchb...@tripadvisor.com>
Sent: Tuesday, May 26, 2020 11:33 PM
To: user@cassandra.apache.org <user@cassandra.apache.org>
Subject: Re: any risks with changing replication factor on live production 
cluster without downtime and service interruption?

By retry logic, I’m going to guess you are doing some kind of version 
consistency trick where you have a non-key column managing a visibility horizon 
to simulate a transaction, and you poll for a horizon value >= some threshold 
that the app is keeping aware of.

Note that these assorted variations on trying to do battle with eventual 
consistency can generate a lot of load on the cluster, unless there is enough 
latency in the progress of the logical flow at the app level that the 
optimistic concurrency hack almost always succeeds the first time anyways.

If this generates the degree of java garbage collection that I suspect, then 
the advice to upgrade C* becomes even more significant.  Repairs themselves can 
generate substantial memory load, and you could have a node or two drop out on 
you if they OOM. I’d definitely take Jeff’s advice about switching your reads 
to LOCAL_QUORUM until you’re done to buffer yourself from that risk.

From: Leena Ghatpande <lghatpa...@hotmail.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Tuesday, May 26, 2020 at 1:20 PM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: any risks with changing replication factor on live production 
cluster without downtime and service interruption?

Message from External Sender

Thank you for the response. Will follow the recommendation for the update. So 
with Read=LOCAL_QUORUM we should see some latency, but not failures during RF 
change right?

We do mitigate the issue of not seeing writes when set to Local_one, by having 
a Retry logic in the app

________________________________

From: Leena Ghatpande <lghatpa...@hotmail.com>
Sent: Friday, May 22, 2020 11:51 AM
To: cassandra cassandra <user@cassandra.apache.org>
Subject: any risks with changing replication factor on live production cluster 
without downtime and service interruption?

We are on Cassandra 3.7 and have a 12 node cluster , 2DC, with 6 nodes in each 
DC. RF=3

We have around 150M rows across tables.

We are planning to add more nodes to the cluster, and thinking of changing the 
replication factor to 5 for each DC.

Our application uses the below consistency level

 read-level: LOCAL_ONE

 write-level: LOCAL_QUORUM

if we change the RF=5 on live cluster, and run full repairs, would we see 
read/write errors while data is being replicated?

if so, This is not something that we can afford in production, so how would we 
avoid this?

Re: any risks with changing replication factor on live production cluster without downtime and service interruption?

Reply via email to