Ines Potier created CASSANDRA-18319: ---------------------------------------
Summary: Cassandra in Kubernetes: IP switch decommission issue Key: CASSANDRA-18319 URL: https://issues.apache.org/jira/browse/CASSANDRA-18319 Project: Cassandra Issue Type: Bug Reporter: Ines Potier We have recently encountered a recurring old IP reappearance issue while testing decommissions on some of our Kubernetes Cassandra staging clusters. *Issue Description* In Kubernetes, a Cassandra node can change IP at each pod bounce. We have noticed that this behavior, associated with a decommission operation, can get the cluster into an erroneous state. Consider the following situation: a Cassandra node {{node1}} , with {{{}hostId1{}}}, owning 20.5% of the token ring, bounces and switches IP ({{{}old_IP{}}} → {{{}new_IP{}}}). After a couple gossip iterations, all other nodes’ nodetool status output includes a {{new_IP}} UN entry owning 20.5% of the token ring and no {{old_IP}} entry. Shortly after the bounce, {{node1}} gets decommissioned. Our cluster does not have a lot of data, and the decommission operation completes pretty quickly. Logs on other nodes start showing acknowledgment that {{node1}} has left and soon, nodetool status’ {{new_IP}} UL entry disappears. {{node1}} ‘s pod is deleted. After a minute delay, the cluster enters the erroneous state. An {{old_IP}} DN entry reappears in nodetool status, owning 20.5% of the token ring. No node owns this IP anymore and according to logs, {{old_IP}} is still associated with {{{}hostId1{}}}. *Issue Root Cause* By digging through Cassandra logs, and re-testing this scenario over and over again, we have reached the following conclusion: * Other nodes will continue exchanging gossip about {{old_IP}} , even after it becomes a fatClient. * The fatClient timeout and subsequent quarantine does not stop {{old_IP}} from reappearing in a node’s Gossip state, once its quarantine is over. We believe that this is due to a misalignment on all nodes’ {{old_IP}} expiration time. * Once {{new_IP}} has left the cluster, and {{old_IP}} next gossip state message is received by a node, StorageService will no longer face collisions (or will, but with an even older IP) for {{hostId1}} and its corresponding tokens. As a result, {{old_IP}} will regain ownership of 20.5% of the token ring. *Proposed fix* Following the above investigation, we were thinking about implementing the following fix: When a node receives a gossip status change with {{STATE_LEFT}} for a leaving endpoint {{{}new_IP{}}}, before evicting {{new_IP }}from the token ring, purge from Gossip (ie {{{}evictFromMembership{}}}) all endpoints that meet the following criteria: * {{endpointStateMap}} contains this endpoint * The endpoint is not currently a token owner ({{{}!tokenMetadata.isMember(endpoint){}}}) * The endpoint’s {{hostId}} matches the {{hostId}} of {{new_IP}} * The endpoint is older than {{leaving_IP}} ({{{}Gossiper.instance.compareEndpointStartup{}}}) * The endpoint’s token range (from {{{}endpointStateMap{}}}) intersects with {{{}new_IP{}}}’s This modification’s intention is to force nodes to realign on {{old_IP}} expiration, and expunge it from Gossip so it does not reappear after {{new_IP}} leaves the ring. Another approach we have also been considering is expunging {{old_IP}} at the moment of the StorageService collision resolution. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org