[ https://issues.apache.org/jira/browse/CASSANDRA-18319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706439#comment-17706439 ]
Brandon Williams commented on CASSANDRA-18319: ---------------------------------------------- bq. there are few places where the host_id won't be available Unfortunately, I think this is a larger problem. We need to be able to ignore/quarantine a justRemoved endpoint when we only have the IP address before further states have been processed. > Cassandra in Kubernetes: IP switch decommission issue > ----------------------------------------------------- > > Key: CASSANDRA-18319 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18319 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip > Reporter: Ines Potier > Priority: Normal > Fix For: 5.x > > Attachments: 3.11_gossipinfo.zip, node1_gossipinfo.txt, > test_decommission_after_ip_change_logs.zip, > v4.0_1678853171792_test_decommission_after_ip_change.zip, write_failure.txt > > Time Spent: 10m > Remaining Estimate: 0h > > We have recently encountered a recurring old IP reappearance issue while > testing decommissions on some of our Kubernetes Cassandra staging clusters. > *Issue Description* > In Kubernetes, a Cassandra node can change IP at each pod bounce. We have > noticed that this behavior, associated with a decommission operation, can get > the cluster into an erroneous state. > Consider the following situation: a Cassandra node {{node1}} , with > {{{}hostId1{}}}, owning 20.5% of the token ring, bounces and switches IP > ({{{}old_IP{}}} → {{{}new_IP{}}}). After a couple gossip iterations, all > other nodes’ nodetool status output includes a {{new_IP}} UN entry owning > 20.5% of the token ring and no {{old_IP}} entry. > Shortly after the bounce, {{node1}} gets decommissioned. Our cluster does not > have a lot of data, and the decommission operation completes pretty quickly. > Logs on other nodes start showing acknowledgment that {{node1}} has left and > soon, nodetool status’ {{new_IP}} UL entry disappears. {{node1}} ‘s pod is > deleted. > After a minute delay, the cluster enters the erroneous state. An {{old_IP}} > DN entry reappears in nodetool status, owning 20.5% of the token ring. No > node owns this IP anymore and according to logs, {{old_IP}} is still > associated with {{{}hostId1{}}}. > *Issue Root Cause* > By digging through Cassandra logs, and re-testing this scenario over and over > again, we have reached the following conclusion: > * Other nodes will continue exchanging gossip about {{old_IP}} , even after > it becomes a fatClient. > * The fatClient timeout and subsequent quarantine does not stop {{old_IP}} > from reappearing in a node’s Gossip state, once its quarantine is over. We > believe that this is due to a misalignment on all nodes’ {{old_IP}} > expiration time. > * Once {{new_IP}} has left the cluster, and {{old_IP}} next gossip state > message is received by a node, StorageService will no longer face collisions > (or will, but with an even older IP) for {{hostId1}} and its corresponding > tokens. As a result, {{old_IP}} will regain ownership of 20.5% of the token > ring. > *Proposed fix* > Following the above investigation, we were thinking about implementing the > following fix: > When a node receives a gossip status change with {{STATE_LEFT}} for a leaving > endpoint {{{}new_IP{}}}, before evicting {{{}new_IP from the token ring, > purge from Gossip (ie evictFromMembership{}}}) all endpoints that meet the > following criteria: > * {{endpointStateMap}} contains this endpoint > * The endpoint is not currently a token owner > ({{{}!tokenMetadata.isMember(endpoint){}}}) > * The endpoint’s {{hostId}} matches the {{hostId}} of {{new_IP}} > * The endpoint is older than {{leaving_IP}} > ({{{}Gossiper.instance.compareEndpointStartup{}}}) > * The endpoint’s token range (from {{{}endpointStateMap{}}}) intersects with > {{{}new_IP{}}}’s > This modification’s intention is to force nodes to realign on {{old_IP}} > expiration, and expunge it from Gossip so it does not reappear after > {{new_IP}} leaves the ring. > Another approach we have also been considering is expunging {{old_IP}} at the > moment of the StorageService collision resolution. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org