[ 
https://issues.apache.org/jira/browse/CASSANDRA-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carl Yeksigian updated CASSANDRA-8801:
--------------------------------------
    Attachment: 8801-v2.txt

I was able to bring up a node again after decommissioning; it doesn't seem like 
the {{DECOMMISSIONED}} state gets saved to the {{system.local}} table.

The cause is IOErrors from MessagingService while it was trying to close the 
socket threads. Wrapping MessagingService in a try block fixed the problem, and 
when I restarted, it error'd that the node had been decommissioned, and I was 
able to use the {{override_decommission}} flag.

I've attached the change that I made to make it work.

Just one nit otherwise, there is an unnecessary whitespace change in 
StorageService.

> Decommissioned nodes are willing to rejoin the cluster if restarted
> -------------------------------------------------------------------
>
>                 Key: CASSANDRA-8801
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8801
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Eric Stevens
>            Assignee: Brandon Williams
>             Fix For: 3.0
>
>         Attachments: 8801-v2.txt, 8801.txt
>
>
> This issue comes from the Cassandra user group.
> If a node which was successfully decommissioned gets restarted with its data 
> directory in tact, it will rejoin the cluster immediately going to {{UN}} and 
> beginning to serve client requests.
> This is wrong - the node has consistency issues, having missed any writes 
> while it was offline because no hinted handoffs were being kept.  And in the 
> best case scenario (it's spotted and remediated immediately), near-100% 
> overstreaming will still occur.
> Also, whatever reasons the operator had for decommissioning the node would 
> presumably still be valid, so this action may threaten cluster stability if 
> the node is underpowered or suffering hardware issues.
> But what elevates this to critical is that if the node had been offline 
> longer than gc_grace_seconds, it may cause permanent and unrecoverable 
> consistency issues due to data resurrection.
> h3. Recommendation:
> A node should remember that it was decommissioned and refuse to rejoin a 
> cluster without at least a -Dflag forcing it to.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to