Re: [DISCUSS] CEP 53: Cassandra Rolling Restarts via Sidecar

Jindal, Himanshu Thu, 04 Sep 2025 09:44:03 -0700

Hi Andres,

This looks like a great CEP. Having official, source-controlled code within 
Cassandra (or a sidecar in this case) to handle common operator actions would 
centralize best practices and make the operator experience smoother—especially 
for users who may not have deep Cassandra expertise.

A couple of questions:

  1.  Have we considered introducing the concept of a datacenter alongside 
cluster? I imagine there will be cases where a user wants to perform a rolling 
restart on a single datacenter rather than across the entire cluster.
  2.  Do we see this framework extending to other cluster- or datacenter-wide 
operations, such as scale-up/scale-down operations, or backups/restores, or 
nodetool rebuilds run as part of adding a new datacenter?

Best,
Himanshu

From: Andrés Beck-Ruiz <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Tuesday, September 2, 2025 at 11:58 AM
To: "[email protected]" <[email protected]>
Subject: RE: [EXTERNAL] [DISCUSS] CEP 53: Cassandra Rolling Restarts via Sidecar

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

Thanks everyone for the feedback. +1 to using the term 'cluster-wide 
operations'.

> The only suggestion I have is to keep in mind the pluggability aspect of
> Sidecar. For example, for the Distributed Restart portion of the work, we
> should consider making interfaces that would allow us to potentially move
> the responsibility of keeping the state outside of Cassandra.

Are you referring to tracking the state of a restart job (and cluster-wide 
operations in general) outside of sidecar_internal Cassandra tables?

> What do you think about broadening the scope of the CEP to propose a way 
> (API) to perform bulk operations, and propose the current Rolling restarts as 
> the first implementation for that bulk operations API? I’m proposing this as 
> I see value to reuse this proposal for other bulk operations such as enabling 
> CDC (it requires enabling cdc on cassandra.yml and some other
> operations) for better supporting CEP-44.

We propose a way to persist and monitor cluster-wide operations in the new 
sidecar_internal system tables. 
(https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-CassandraSidecarSystemTables).
 I think it would make sense to also generalize the API to apply to 
cluster-wide operations. I'm curious about any feedback on whether this should 
be a separate API from the current operational job framework and live under the 
/cluster resource. We've discussed why we didn't propose to use the existing 
API and how the current framework would need to be extended here 
(https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-OperationalJobFramework).

> I’m not quite sold on using a PATCH to move from pending state to running 
> state. Quick question, what is the goal of the pending state? I see a PATCH 
> operation as modifying part of an object data. In this case, modifying the 
> state looks like a change on the operation state, not on its metadata. I’d 
> love to hear your thoughts on this one.

The "PENDING" state allows for an operator to double check a submitted 
cluster-wide operation, which could have unintended consequences, before 
starting it. For example, performing a rolling restart could prevent other 
operations on the cluster that might be scheduled or needed, such as replacing 
a Cassandra instance. While an operator should be able to abort a restart job, 
I see value in having this guard against operator error.

Given that we are applying a partial update to the resource, which in this 
context would be the restart job, we chose PATCH for this API.

Best,
Andrés

On Tue, Sep 2, 2025 at 12:33 PM Dinesh Joshi 
<[email protected]<mailto:[email protected]>> wrote:
I would like to chime in and say that we need to refine our vocabulary. The 
term 'bulk commands' was used originally in CEP-1. This is my fault totally as 
I originally wrote that down. But over time it has caused confusion. I believe 
'cluster-wide operations' is a better term to describe those operations. We 
have also used 'Bulk' in the context of CEP-28 which means something rather 
different which leads to confusion. So I propose using the term 'cluster-wide 
operations' for operations that have to be run across all nodes in the cluster.

Thanks,

Dinesh

On Tue, Sep 2, 2025 at 9:21 AM Bernardo Botella 
<[email protected]<mailto:[email protected]>> wrote:
This is an incredible contribution. Thanks a lot!

Now, let me throw some thoughts :-)

Rolling restarts is a great example of a broader feature that could be seen as 
bulk operations on a cluster via Sidecar.

What do you think about broadening the scope of the CEP to propose a way (API) 
to perform bulk operations, and propose the current Rolling restarts as the 
first implementation for that bulk operations API? I’m proposing this as I see 
value to reuse this proposal for other bulk operations such as enabling CDC (it 
requires enabling cdc on cassandra.yml and some other operations) for better 
supporting CEP-44.

I’m not quite sold on using a PATCH to move from pending state to running 
state. Quick question, what is the goal of the pending state? I see a PATCH 
operation as modifying part of an object data. In this case, modifying the 
state looks like a change on the operation state, not on its metadata. I’d love 
to hear your thoughts on this one.

Again, thanks a lot for the contribution!
Bernardo

> On Aug 30, 2025, at 7:02 AM, Francisco Guerrero 
> <[email protected]<mailto:[email protected]>> wrote:
>
> Thanks Andrés for the CEP. This is a great contribution to the project and
> aligns with the original intent of the Sidecar stated in CEP-1. I've gone
> over the CEP details and it is consistent with the internals of Sidecar.
>
> The only suggestion I have is to keep in mind the pluggability aspect of
> Sidecar. For example, for the Distributed Restart portion of the work, we
> should consider making interfaces that would allow us to potentially move
> the responsibility of keeping the state outside of Cassandra.
>
> Best,
> - Francisco
>
> On 2025/08/29 19:56:08 Andrés Beck-Ruiz wrote:
>> Hello everyone,
>>
>> We would like to propose CEP 53: Cassandra Rolling Restarts via Sidecar (
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar
>> )
>>
>> This CEP builds off of CEP-1
>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-1%3A+Apache+Cassandra+Management+Process%28es%29+-+Deprecated>
>> and proposes a design for safe, efficient, and operator friendly rolling
>> restarts on Cassandra clusters, as well as an extensible approach for
>> persisting future cluster-wide operations in Cassandra Sidecar. We hope to
>> leverage this infrastructure in the future to implement upgrade automation.
>>
>> We welcome all feedback and discussion. Thank you in advance for your time
>> and consideration of this proposal!
>>
>> Best,
>> Andrés and Paulo
>>

Re: [DISCUSS] CEP 53: Cassandra Rolling Restarts via Sidecar

Reply via email to