[ 
https://issues.apache.org/jira/browse/CASSANDRA-20033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17900680#comment-17900680
 ] 

Michael Semb Wever edited comment on CASSANDRA-20033 at 11/24/24 4:37 PM:
--------------------------------------------------------------------------

~5 tests are failing, around {{ActiveRepairService}} throwing this exception:
{noformat}
java.lang.RuntimeException: Did not get replies from all endpoints.)
                at 
org.apache.cassandra.utils.concurrent.AsyncPromise.setSuccess(AsyncPromise.java:106)
{noformat}
-I might not have time to look into this for a week, maybe [~curlylrt] you're 
interested in jumping in and co-authoring this on…?-  EDIT:  failure not 
related after all, see below.

Can be reproduced with
{code}
.build/docker/run-tests.sh jvm-dtest "\\.RepairTest" 11
{code}


was (Author: michaelsembwever):
~5 tests are failing, around {{ActiveRepairService}} throwing this exception:
{noformat}
java.lang.RuntimeException: Did not get replies from all endpoints.)
                at 
org.apache.cassandra.utils.concurrent.AsyncPromise.setSuccess(AsyncPromise.java:106)
{noformat}
I might not have time to look into this for a week, maybe [~curlylrt] you're 
interested in jumping in and co-authoring this on…?

Can be reproduced with
{code}
.build/docker/run-tests.sh jvm-dtest "\\.RepairTest" 11
{code}

> Shutdown message doesn't have generation check causing normal node considered 
> shutdown by other nodes in cluster
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-20033
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20033
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Gossip
>            Reporter: Runtian Liu
>            Assignee: Michael Semb Wever
>            Priority: Normal
>             Fix For: 5.0.x
>
>         Attachments: ci_summary_thelastpickle_mck-20033-5.0_132.html, 
> results_details_thelastpickle_mck-20033-5.0_132.tar.xz
>
>
> Recently we run into one issue that when we do rolling restart of a cluster. 
> We found one node is UN in it's own gossip state but all other nodes in the 
> same cluster are considering it as a DN node.
> Gossip info from the node that is being considered as down by other nodes:
>  
> {code:java}
> /1.1.1.1
>   generation:1729812724
>   heartbeat:20676
>   STATUS:26:NORMAL,-1215648874011476782
>   LOAD:20620:8.030878944E10
>   SCHEMA:41:e3b3d9f6-e2e2-307b-959e-b493bd1f7bef
>   DC:13:dc4
>   RACK:15:dc4-0
>   RELEASE_VERSION:6:4.1.3
>   INTERNAL_IP:11:1.1.1.1
>   RPC_ADDRESS:5:1.1.1.1
>   NET_VERSION:2:12
>   HOST_ID:3:65a87fef-e7f8-41f4-8cdf-9d8ea4f5e4f0
>   RPC_READY:161:true
>   INTERNAL_ADDRESS_AND_PORT:9:1.1.1.1:27378
>   NATIVE_ADDRESS_AND_PORT:4:1.1.1.1:27379
>   STATUS_WITH_PORT:25:NORMAL,-1215648874011476782
>   SSTABLE_VERSIONS:7:big-nb
>   TOKENS:24:<hidden> {code}
> Gossip state from other nodes for this node:
>  
> {code:java}
> /1.1.1.1
>   generation:1729812724
>   heartbeat:2147483647
>   STATUS:332020:shutdown,true
>   LOAD:30:8.032911052E10
>   SCHEMA:19:e3b3d9f6-e2e2-307b-959e-b493bd1f7bef
>   DC:13:dc4
>   RACK:15:dc4-0
>   RELEASE_VERSION:6:4.1.3
>   INTERNAL_IP:11:1.1.1.1
>   RPC_ADDRESS:5:1.1.1.1
>   NET_VERSION:2:12
>   HOST_ID:3:65a87fef-e7f8-41f4-8cdf-9d8ea4f5e4f0
>   RPC_READY:332021:false
>   INTERNAL_ADDRESS_AND_PORT:9:1.1.1.1:27378
>   NATIVE_ADDRESS_AND_PORT:4:1.1.1.1:27379
>   STATUS_WITH_PORT:332020:shutdown,true
>   SSTABLE_VERSIONS:7:big-nb
>   TOKENS:24:<hidden> {code}
> The share the same generation but the other nodes are considering the node is 
> shutdown.
>  
>  
> After closer look into the problem, I think here's what happened.
> When the node get restarted,
> 1. it first got gracefully shutdown, it will broadcast the GOSSIP_SHUTDOWN 
> message to the rest of the cluster.
> 2. When it get back up, it will try to update it's generation and gossip with 
> other nodes.
>  
> If one node get the new generation for this 1.1.1.1 node first, then it 
> receive the GOSSIP_SHUTDOWN message from step 1 (Assuming we have a very 
> large delay in network between 1.1.1.1 node and the bad receiver node). We 
> will run into above situation.
>  
> I think GOSSIP_SHUTDOWN message should have the generation information and 
> GossipShutdownVerbHandler should only bump the heartbeat for the local state 
> if generation is same. If local generation is higher, we should ignore the 
> shutdown message. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to