[ 
https://issues.apache.org/jira/browse/CASSANDRA-18555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732474#comment-17732474
 ] 

Brandon Williams commented on CASSANDRA-18555:
----------------------------------------------

I don't think we need to persist any state here, if decom fails then you are 
free to try again so saving the failed state doesn't make sense since there is 
no logic based on it.  I think we are overdoing this, the original problem can 
be summarized as "JMX times out so I can't tell if the node is still 
decommissioning" which led to proposing a new command to essentially allow 
retrying the query.

Let's take a step back.  If the node is in the decom process even if the 
nodeool process times out, can this state not be seen from nodetool status?  If 
it has failed should it not be reflected there too? Why isn't status sufficient 
here?  Gossip is already the de facto way to get this information and being is 
used to convey that decom has begun so I don't see why it shouldn't be used to 
indicate it has failed.

> A new nodetool/JMX command that tells whether node's decommission failed or 
> not
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-18555
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18555
>             Project: Cassandra
>          Issue Type: Task
>          Components: Observability/JMX
>            Reporter: Jaydeepkumar Chovatia
>            Assignee: Jaydeepkumar Chovatia
>            Priority: Normal
>          Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Currently, when a node is being decommissioned and if any failure happens, 
> then an exception is thrown back to the caller.
> But Cassandra's decommission takes considerable time ranging from minutes to 
> hours to days. There are various scenarios in that the caller may need to 
> probe the status again:
>  * The caller times out
>  * It is not possible to keep the caller hanging for such a long time
> And If the caller does not know what happened internally, then it cannot 
> retry, etc., leading to other issues.
> So, in this ticket, I am going to add a new nodetool/JMX command that can be 
> invoked by the caller anytime, and it will return the correct status.
> It might look like a smaller change, but when we need to operate Cassandra at 
> scale in a large-scale fleet, then this becomes a bottleneck and require 
> constant operator intervention.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to