Just filed https://issues.apache.org/jira/browse/CASSANDRA-18913 (Gossip NPE due to shutdown event corrupting empty statuses) which is where I saw this issue..
When we do gossip shutdown we send a message GOSSIP_SHUTDOWN which then gets handled by this method org.apache.cassandra.gms.Gossiper#markAsShutdown… there is a issue with the current implementation; the peers mutate the state for the node shutting down, which cause pending gossip events to get ignored! Simple example of an issue here is the following Node1 starts up and starts bootstrapping Node1 joins the ring Node1 disables gossip (or halts) In this case some nodes in the cluster will see the joining of the ring, and others won’t. Now, the ones who have seen the gossip shutdown will set the version to Integer.MAX_VALUE which will have gossip not sync any unseen states… Why is this a problem? Lets say you now need to host replace node1… and the seeds you are using didn’t see the join ring event… you then get the following error during the host replacement "Could not find tokens for %s to replace” To solve this and clean things up, I would like to send the state from the node shutting down and avoid peers mutating endpoint states they don’t own; with this the cluster should eventually converge! This would be a protocol change, so would need to make sure everyone is cool with me doing this in 5.0.