[jira] [Comment Edited] (CASSANDRA-20033) Shutdown message doesn't have generation check causing normal node considered shutdown by other nodes in cluster

Michael Semb Wever (Jira) Fri, 22 Nov 2024 03:57:24 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-20033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17900200#comment-17900200
 ]


Michael Semb Wever edited comment on CASSANDRA-20033 at 11/22/24 11:56 AM:
---------------------------------------------------------------------------

bq. the previous logic had issues due to version values not being relative to 
the node that went down, which lead to the cluster not converging properly...

In CASSANDRA-18913 GossipShutdown was only added to 5.0+

If we want to fix this for 4.0 and 4.1 then 18913's 5.0 commit will need to be 
back-ported.
I don't yet see any alternative. I don't think this bug warrants that change. 
(the workaround is just to restart the node again – one should be checking node 
status on other nodes before progressing to the next node in a rolling restart)

The check is in this 5.0 patch: 
https://github.com/apache/cassandra/compare/cassandra-5.0...thelastpickle:cassandra:mck/20033/5.0
 


was (Author: michaelsembwever):
bq. the previous logic had issues due to version values not being relative to 
the node that went down, which lead to the cluster not converging properly...

In CASSANDRA-18913 GossipShutdown was only added to 5.0+

If we want to fix this for 4.0 and 4.1 then 18913's 5.0 commit will need to be 
back-ported.
I don't yet see any alternative. I'm don't think this bug warrants that change. 
(the workaround is just to restart the node again – one should be checking node 
status on other nodes before progressing to the next node in a rolling restart)

The check is in this 5.0 patch: 
https://github.com/apache/cassandra/compare/cassandra-5.0...thelastpickle:cassandra:mck/20033/5.0
 

> Shutdown message doesn't have generation check causing normal node considered 
> shutdown by other nodes in cluster
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-20033
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20033
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Gossip
>            Reporter: Runtian Liu
>            Assignee: Michael Semb Wever
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x, 5.0.x
>
>         Attachments: ci_summary_thelastpickle_mck-20033-5.0_132.html, 
> results_details_thelastpickle_mck-20033-5.0_132.tar.xz
>
>
> Recently we run into one issue that when we do rolling restart of a cluster. 
> We found one node is UN in it's own gossip state but all other nodes in the 
> same cluster are considering it as a DN node.
> Gossip info from the node that is being considered as down by other nodes:
>  
> {code:java}
> /1.1.1.1
>   generation:1729812724
>   heartbeat:20676
>   STATUS:26:NORMAL,-1215648874011476782
>   LOAD:20620:8.030878944E10
>   SCHEMA:41:e3b3d9f6-e2e2-307b-959e-b493bd1f7bef
>   DC:13:dc4
>   RACK:15:dc4-0
>   RELEASE_VERSION:6:4.1.3
>   INTERNAL_IP:11:1.1.1.1
>   RPC_ADDRESS:5:1.1.1.1
>   NET_VERSION:2:12
>   HOST_ID:3:65a87fef-e7f8-41f4-8cdf-9d8ea4f5e4f0
>   RPC_READY:161:true
>   INTERNAL_ADDRESS_AND_PORT:9:1.1.1.1:27378
>   NATIVE_ADDRESS_AND_PORT:4:1.1.1.1:27379
>   STATUS_WITH_PORT:25:NORMAL,-1215648874011476782
>   SSTABLE_VERSIONS:7:big-nb
>   TOKENS:24:<hidden> {code}
> Gossip state from other nodes for this node:
>  
> {code:java}
> /1.1.1.1
>   generation:1729812724
>   heartbeat:2147483647
>   STATUS:332020:shutdown,true
>   LOAD:30:8.032911052E10
>   SCHEMA:19:e3b3d9f6-e2e2-307b-959e-b493bd1f7bef
>   DC:13:dc4
>   RACK:15:dc4-0
>   RELEASE_VERSION:6:4.1.3
>   INTERNAL_IP:11:1.1.1.1
>   RPC_ADDRESS:5:1.1.1.1
>   NET_VERSION:2:12
>   HOST_ID:3:65a87fef-e7f8-41f4-8cdf-9d8ea4f5e4f0
>   RPC_READY:332021:false
>   INTERNAL_ADDRESS_AND_PORT:9:1.1.1.1:27378
>   NATIVE_ADDRESS_AND_PORT:4:1.1.1.1:27379
>   STATUS_WITH_PORT:332020:shutdown,true
>   SSTABLE_VERSIONS:7:big-nb
>   TOKENS:24:<hidden> {code}
> The share the same generation but the other nodes are considering the node is 
> shutdown.
>  
>  
> After closer look into the problem, I think here's what happened.
> When the node get restarted,
> 1. it first got gracefully shutdown, it will broadcast the GOSSIP_SHUTDOWN 
> message to the rest of the cluster.
> 2. When it get back up, it will try to update it's generation and gossip with 
> other nodes.
>  
> If one node get the new generation for this 1.1.1.1 node first, then it 
> receive the GOSSIP_SHUTDOWN message from step 1 (Assuming we have a very 
> large delay in network between 1.1.1.1 node and the bad receiver node). We 
> will run into above situation.
>  
> I think GOSSIP_SHUTDOWN message should have the generation information and 
> GossipShutdownVerbHandler should only bump the heartbeat for the local state 
> if generation is same. If local generation is higher, we should ignore the 
> shutdown message. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-20033) Shutdown message doesn't have generation check causing normal node considered shutdown by other nodes in cluster

Reply via email to