Hi Andres, This looks like a great CEP. Having official, source-controlled code within Cassandra (or a sidecar in this case) to handle common operator actions would centralize best practices and make the operator experience smoother—especially for users who may not have deep Cassandra expertise.
A couple of questions: 1. Have we considered introducing the concept of a datacenter alongside cluster? I imagine there will be cases where a user wants to perform a rolling restart on a single datacenter rather than across the entire cluster. 2. Do we see this framework extending to other cluster- or datacenter-wide operations, such as scale-up/scale-down operations, or backups/restores, or nodetool rebuilds run as part of adding a new datacenter? Best, Himanshu From: Andrés Beck-Ruiz <andresbeckr...@gmail.com> Reply-To: "dev@cassandra.apache.org" <dev@cassandra.apache.org> Date: Tuesday, September 2, 2025 at 11:58 AM To: "dev@cassandra.apache.org" <dev@cassandra.apache.org> Subject: RE: [EXTERNAL] [DISCUSS] CEP 53: Cassandra Rolling Restarts via Sidecar CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Thanks everyone for the feedback. +1 to using the term 'cluster-wide operations'. > The only suggestion I have is to keep in mind the pluggability aspect of > Sidecar. For example, for the Distributed Restart portion of the work, we > should consider making interfaces that would allow us to potentially move > the responsibility of keeping the state outside of Cassandra. Are you referring to tracking the state of a restart job (and cluster-wide operations in general) outside of sidecar_internal Cassandra tables? > What do you think about broadening the scope of the CEP to propose a way > (API) to perform bulk operations, and propose the current Rolling restarts as > the first implementation for that bulk operations API? I’m proposing this as > I see value to reuse this proposal for other bulk operations such as enabling > CDC (it requires enabling cdc on cassandra.yml and some other > operations) for better supporting CEP-44. We propose a way to persist and monitor cluster-wide operations in the new sidecar_internal system tables. (https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-CassandraSidecarSystemTables). I think it would make sense to also generalize the API to apply to cluster-wide operations. I'm curious about any feedback on whether this should be a separate API from the current operational job framework and live under the /cluster resource. We've discussed why we didn't propose to use the existing API and how the current framework would need to be extended here (https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar#CEP53:CassandraRollingRestartsviaSidecar-OperationalJobFramework). > I’m not quite sold on using a PATCH to move from pending state to running > state. Quick question, what is the goal of the pending state? I see a PATCH > operation as modifying part of an object data. In this case, modifying the > state looks like a change on the operation state, not on its metadata. I’d > love to hear your thoughts on this one. The "PENDING" state allows for an operator to double check a submitted cluster-wide operation, which could have unintended consequences, before starting it. For example, performing a rolling restart could prevent other operations on the cluster that might be scheduled or needed, such as replacing a Cassandra instance. While an operator should be able to abort a restart job, I see value in having this guard against operator error. Given that we are applying a partial update to the resource, which in this context would be the restart job, we chose PATCH for this API. Best, Andrés On Tue, Sep 2, 2025 at 12:33 PM Dinesh Joshi <djo...@apache.org<mailto:djo...@apache.org>> wrote: I would like to chime in and say that we need to refine our vocabulary. The term 'bulk commands' was used originally in CEP-1. This is my fault totally as I originally wrote that down. But over time it has caused confusion. I believe 'cluster-wide operations' is a better term to describe those operations. We have also used 'Bulk' in the context of CEP-28 which means something rather different which leads to confusion. So I propose using the term 'cluster-wide operations' for operations that have to be run across all nodes in the cluster. Thanks, Dinesh On Tue, Sep 2, 2025 at 9:21 AM Bernardo Botella <conta...@bernardobotella.com<mailto:conta...@bernardobotella.com>> wrote: This is an incredible contribution. Thanks a lot! Now, let me throw some thoughts :-) Rolling restarts is a great example of a broader feature that could be seen as bulk operations on a cluster via Sidecar. What do you think about broadening the scope of the CEP to propose a way (API) to perform bulk operations, and propose the current Rolling restarts as the first implementation for that bulk operations API? I’m proposing this as I see value to reuse this proposal for other bulk operations such as enabling CDC (it requires enabling cdc on cassandra.yml and some other operations) for better supporting CEP-44. I’m not quite sold on using a PATCH to move from pending state to running state. Quick question, what is the goal of the pending state? I see a PATCH operation as modifying part of an object data. In this case, modifying the state looks like a change on the operation state, not on its metadata. I’d love to hear your thoughts on this one. Again, thanks a lot for the contribution! Bernardo > On Aug 30, 2025, at 7:02 AM, Francisco Guerrero > <fran...@apache.org<mailto:fran...@apache.org>> wrote: > > Thanks Andrés for the CEP. This is a great contribution to the project and > aligns with the original intent of the Sidecar stated in CEP-1. I've gone > over the CEP details and it is consistent with the internals of Sidecar. > > The only suggestion I have is to keep in mind the pluggability aspect of > Sidecar. For example, for the Distributed Restart portion of the work, we > should consider making interfaces that would allow us to potentially move > the responsibility of keeping the state outside of Cassandra. > > Best, > - Francisco > > On 2025/08/29 19:56:08 Andrés Beck-Ruiz wrote: >> Hello everyone, >> >> We would like to propose CEP 53: Cassandra Rolling Restarts via Sidecar ( >> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-53%3A+Cassandra+Rolling+Restarts+via+Sidecar >> ) >> >> This CEP builds off of CEP-1 >> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-1%3A+Apache+Cassandra+Management+Process%28es%29+-+Deprecated> >> and proposes a design for safe, efficient, and operator friendly rolling >> restarts on Cassandra clusters, as well as an extensible approach for >> persisting future cluster-wide operations in Cassandra Sidecar. We hope to >> leverage this infrastructure in the future to implement upgrade automation. >> >> We welcome all feedback and discussion. Thank you in advance for your time >> and consideration of this proposal! >> >> Best, >> Andrés and Paulo >>