Runtian Liu created CASSANDRA-20033:
---------------------------------------

             Summary: Shutdown message doesn't have generation check causing 
normal node considered shutdown by other nodes in cluster
                 Key: CASSANDRA-20033
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20033
             Project: Cassandra
          Issue Type: Bug
          Components: Cluster/Gossip
            Reporter: Runtian Liu


Recently we run into one issue that when we do rolling restart of a cluster. We 
found one node is UN in it's own gossip state but all other nodes in the same 
cluster are considering it as a DN node.

Gossip info from the node that is being considered as down by other nodes:

 
{code:java}
/1.1.1.1
  generation:1729812724
  heartbeat:20676
  STATUS:26:NORMAL,-1215648874011476782
  LOAD:20620:8.030878944E10
  SCHEMA:41:e3b3d9f6-e2e2-307b-959e-b493bd1f7bef
  DC:13:dc4
  RACK:15:dc4-0
  RELEASE_VERSION:6:4.1.3
  INTERNAL_IP:11:1.1.1.1
  RPC_ADDRESS:5:1.1.1.1
  NET_VERSION:2:12
  HOST_ID:3:65a87fef-e7f8-41f4-8cdf-9d8ea4f5e4f0
  RPC_READY:161:true
  INTERNAL_ADDRESS_AND_PORT:9:1.1.1.1:27378
  NATIVE_ADDRESS_AND_PORT:4:1.1.1.1:27379
  STATUS_WITH_PORT:25:NORMAL,-1215648874011476782
  SSTABLE_VERSIONS:7:big-nb
  TOKENS:24:<hidden> {code}
Gossip state from other nodes for this node:

 
{code:java}
/1.1.1.1
  generation:1729812724
  heartbeat:2147483647
  STATUS:332020:shutdown,true
  LOAD:30:8.032911052E10
  SCHEMA:19:e3b3d9f6-e2e2-307b-959e-b493bd1f7bef
  DC:13:dc4
  RACK:15:dc4-0
  RELEASE_VERSION:6:4.1.3
  INTERNAL_IP:11:1.1.1.1
  RPC_ADDRESS:5:1.1.1.1
  NET_VERSION:2:12
  HOST_ID:3:65a87fef-e7f8-41f4-8cdf-9d8ea4f5e4f0
  RPC_READY:332021:false
  INTERNAL_ADDRESS_AND_PORT:9:1.1.1.1:27378
  NATIVE_ADDRESS_AND_PORT:4:1.1.1.1:27379
  STATUS_WITH_PORT:332020:shutdown,true
  SSTABLE_VERSIONS:7:big-nb
  TOKENS:24:<hidden> {code}
The share the same generation but the other nodes are considering the node is 
shutdown.

 

 

After closer look into the problem, I think here's what happened.

When the node get restarted,

1. it first got gracefully shutdown, it will broadcast the GOSSIP_SHUTDOWN 
message to the rest of the cluster.

2. When it get back up, it will try to update it's generation and gossip with 
other nodes.

 

If one node get the new generation for this 1.1.1.1 node first, then it receive 
the GOSSIP_SHUTDOWN message from step 1 (Assuming we have a very large delay in 
network between 1.1.1.1 node and the bad receiver node). We will run into above 
situation.

 

I think GOSSIP_SHUTDOWN message should have the generation information and 
GossipShutdownVerbHandler should only bump the heartbeat for the local state if 
generation is same. If local generation is higher, we should ignore the 
shutdown message. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to