[ 
https://issues.apache.org/jira/browse/CASSANDRA-18555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732543#comment-17732543
 ] 

Brandon Williams commented on CASSANDRA-18555:
----------------------------------------------

bq. If you checked the whole communication

I did... I think here's where the disconnect is:

bq. If it is not persisted, if we fail to decommission and we kill the instance 
and it is started again, what state that node is actually in?

The state it was in before starting decom: a normal member of the ring.  If you 
accidentally begin a decommission and restart the node to get out of it, this 
is what you would expect.

bq. partially decommission itself which is quite dangerous, isnt it?

No.  Barring space issues, a node can attempt to decom as many times as it 
needs to until it completes.  We shouldn't be changing this behavior, and so 
persisting in the bootstrap state doesn't make sense.  We only persist when 
decom completes so that we can prevent it from rejoining, but retrying failures 
is allowed.




> A new nodetool/JMX command that tells whether node's decommission failed or 
> not
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-18555
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18555
>             Project: Cassandra
>          Issue Type: Task
>          Components: Observability/JMX
>            Reporter: Jaydeepkumar Chovatia
>            Assignee: Jaydeepkumar Chovatia
>            Priority: Normal
>          Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Currently, when a node is being decommissioned and if any failure happens, 
> then an exception is thrown back to the caller.
> But Cassandra's decommission takes considerable time ranging from minutes to 
> hours to days. There are various scenarios in that the caller may need to 
> probe the status again:
>  * The caller times out
>  * It is not possible to keep the caller hanging for such a long time
> And If the caller does not know what happened internally, then it cannot 
> retry, etc., leading to other issues.
> So, in this ticket, I am going to add a new nodetool/JMX command that can be 
> invoked by the caller anytime, and it will return the correct status.
> It might look like a smaller change, but when we need to operate Cassandra at 
> scale in a large-scale fleet, then this becomes a bottleneck and require 
> constant operator intervention.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to