[ 
https://issues.apache.org/jira/browse/CASSANDRA-18555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732564#comment-17732564
 ] 

Stefan Miklosovic edited comment on CASSANDRA-18555 at 6/14/23 2:31 PM:
------------------------------------------------------------------------

The compromise would consist of putting it into nodetool info so we use already 
existing nodetool command but we would not save the state to table. It would be 
visible it failed to decommission as long as Cassandra process runs.
So in practice, an operator would see "ah, this node is leaving as it would be 
in UL (I think that's right)" but if an operator sees that it takes too long 
and something is probably going on as it is stuck in UL for eternity, there 
would be nodetool info to see what the state of the decommission is / what 
state that node is in - and there would be DECOMMISSION_FAILED.


was (Author: smiklosovic):
The compromise would consist of putting it into nodetool info so we use already 
existing nodetool command but we would not save the state to table. It would be 
visible it failed to decommission as long as Cassandra process runs.

> A new nodetool/JMX command that tells whether node's decommission failed or 
> not
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-18555
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18555
>             Project: Cassandra
>          Issue Type: Task
>          Components: Observability/JMX
>            Reporter: Jaydeepkumar Chovatia
>            Assignee: Jaydeepkumar Chovatia
>            Priority: Normal
>          Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Currently, when a node is being decommissioned and if any failure happens, 
> then an exception is thrown back to the caller.
> But Cassandra's decommission takes considerable time ranging from minutes to 
> hours to days. There are various scenarios in that the caller may need to 
> probe the status again:
>  * The caller times out
>  * It is not possible to keep the caller hanging for such a long time
> And If the caller does not know what happened internally, then it cannot 
> retry, etc., leading to other issues.
> So, in this ticket, I am going to add a new nodetool/JMX command that can be 
> invoked by the caller anytime, and it will return the correct status.
> It might look like a smaller change, but when we need to operate Cassandra at 
> scale in a large-scale fleet, then this becomes a bottleneck and require 
> constant operator intervention.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to