[ 
https://issues.apache.org/jira/browse/CASSSIDECAR-274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18013787#comment-18013787
 ] 

Andres Beck-Ruiz commented on CASSSIDECAR-274:
----------------------------------------------

Following the existing Sidecar API structure that organizes keyspace and table 
operations under {{/keyspace}} or {{/table}} resources, respectively, we can 
create a new {{/cluster}} resource for all cluster-wide operations, which could 
include upgrades and configuration changes in the future.
----
{{POST /api/v1/cassandra/cluster/restart}}

Create a restart job. By default, will put the restart job in "PENDING" state, 
and will throw a 409 Conflict error if {{executeImmediately = true}} is 
submitted as an argument if there is already an active restart ("RUNNING" or 
"PAUSED").
||Field||Type||Optional||Default Value||Description||
|hosts|List<String>|true|[]|Subset of hosts to restart. By default, all hosts 
will be restarted|
|rackParallelism|int|true|1|Amount of nodes that can be restarted in parallel 
within a rack. Max parallelism would be the number of nodes in a rack.|
|executeImmediately|boolean|true|false|Optional flag to skip pending state and 
set the restart task to "RUNNING"|

{{RestartJob}} Object

The order of the lists of hosts in {{hosts_pending}} represents the order in 
which hosts will be restarted, and each list of hosts can be restarted in 
parallel.
{code:java}
{
     id: int,
     state: String,
     start_time: String,
     hosts_succeeded: List<String>,
     hosts_failed: List<String>,
     hosts_restarting: List<String>,
     hosts_pending: List<List<String>>,
     last_update: String (message containing job updates)
}
{code}
Response
 * 201 Created
 ** {{restart_job: RestartJob}}
 * 400 Bad Request
 ** {{error :: string}}
 * 409 Conflict
 ** {{error :: string}}
 * 500 Internal Sever Error
 ** {{error :: string}}

Sample input:
{code:java}
{
       "hosts": ["192.168.1.10", "192.168.1.11", "192.168.1.12", 
"192.168.1.13", "192.168.1.14", "192.168.1.15"],
       "rackParallelism": 2,
       "executeImmediately": true
}
{code}
Sample output:
{code:java}
{
    id: 123,
    state: "RUNNING",
    start_date: “2025-08-04T22:55:00+00:00”,
    hosts_succeeded: []
    hosts_failed: []
    hosts_restarting: [] 
    hosts_pending: [["192.168.1.10","192.168.1.11"]["192.168.1.12", 
"192.168.1.13"],[ "192.168.1.14", "192.168.1.15"]]
    last_update: “Starting restart 123"
}
{code}
----
{{{}PATCH /api/v1/cassandra/cluster/restart/{id{}}}}

Used to update the state of an active restart job specified by ID. This 
endpoint will be used to start, pause, and abort an active restart job. Will 
throw a 409 Conflict error if setting a restart to "RUNNING" when there is 
already an active restart or invalid state transition. Valid state transitions 
are depicted below, and the parameters follow guidelines described in [RFC 6902 
JavaScript Object Notation (JSON) 
Patch|https://www.rfc-editor.org/rfc/rfc6902#section-4.3]. 
||Field||Type||Optional||Default Value||Description||
|op|String|true|N/A|Patch operation (must be “replace”)|
|path|String|true|N/A|Resource being replaced (must be “state”)|
|value|String|true|N/A|State to patch for the current restart job. Valid states 
to patch on update are RUNNING, PAUSED, or ABORTED.|

Valid state transitions:

!Screenshot 2025-08-13 at 12.34.43 PM.png|width=590,height=454!

Response
 * 200 Ok
 ** {{restart_job: RestartJob object}}
 * 400 Bad Request
 ** {{error :: string}}
 * 409 Conflict (Invalid state transition or active restart ongoing)
 ** {{error :: string}}
 * 500 Internal Sever Error
 ** {{error :: string}}

Sample Input
{code:java}
{
     “op”: “replace”, 
     “path”: “/state”, 
     “value”: “PAUSED”
}
{code}
----
{{{}GET /api/v1/cassandra/cluster/restart/{id{}}}}

Get the restart job specified by the ID.

Response
 * 200 Ok
 ** {{restart_job :: RestartJob }}
 * 404 Not Found
 ** {{error :: string}}
 * 500 Internal Sever Error
 ** {{error :: string}}

----
{{GET /api/v1/cassandra/cluster/restart}}

Gets the history of all restart jobs, including any active restart. The amount 
of restart history that is kept can be configured in {{{}sidecar.yaml{}}}.

Response
 * 200 Ok
 ** {{restart_jobs :: [RestartJob]}} (can be empty)
 * 500 Internal Sever Error
 ** {{error :: string}}

Sample response:
{code:java}
{ 
“restart_jobs”: [
    {
         id: 123,
         state: ABORTED,
         start_date: “2025-08-04T22:55:00+00:00”,
         hosts_succeeded: ["192.168.1.10","192.168.1.11"],
         hosts_failed: [],
         hosts_restarting: [],
         hosts_pending: [["192.168.1.12", "192.168.1.13"],[ "192.168.1.14", 
"192.168.1.15"]],
         last_update: “Restart 123 aborted"
    },
    {
         id: 456,
         state: COMPLETED,
         start_date: “2025-08-03T22:55:00+00:00”,
         hosts_succeeded: ["192.168.1.10", "192.168.1.11", "192.168.1.12", 
"192.168.1.13", "192.168.1.14","192.168.1.15"],
         hosts_failed: [],
         hosts_restarting: [],
         hosts_pending: [],
         last_update: “Restart 123 completed"
    }
  ]
}
{code}

> Enable rolling restarts of Cassandra clusters via Sidecar
> ---------------------------------------------------------
>
>                 Key: CASSSIDECAR-274
>                 URL: https://issues.apache.org/jira/browse/CASSSIDECAR-274
>             Project: Sidecar for Apache Cassandra
>          Issue Type: Improvement
>            Reporter: Isaac Reath
>            Priority: Major
>         Attachments: Screenshot 2025-08-13 at 12.34.43 PM.png
>
>
> Rolling restarts are frequently used in Cassandra to apply changes to a 
> cluster such as configuration changes, or version upgrades. In 
> CASSSIDECAR-266, we are adding functionality to safely start and stop a 
> single Cassandra node via Sidecar. This ticket will build on that work to 
> implement a coordinated rolling restart. 
> The scope of this effort includes:
>  * Adding API endpoints to enable operators to start, monitor, pause and stop 
> a rolling restart.
>  * Updating Sidecar to orchestrate start and stop operations across the 
> cluster, allowing for a configurable amount of nodes to be offline 
> simultaneously.
>  * Creating safeguards to ensure that a rolling restart is safe to perform 
> and does not interfere with other operations ongoing in the cluster such as 
> node bootstraps or decommissions. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to