Ines Potier created CASSANDRA-18319:
---------------------------------------

             Summary: Cassandra in Kubernetes: IP switch decommission issue
                 Key: CASSANDRA-18319
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18319
             Project: Cassandra
          Issue Type: Bug
            Reporter: Ines Potier


We have recently encountered a recurring old IP reappearance issue while 
testing decommissions on some of our Kubernetes Cassandra staging clusters.


*Issue Description*

In Kubernetes, a Cassandra node can change IP at each pod bounce. We have 
noticed that this behavior, associated with a decommission operation, can get 
the cluster into an erroneous state.

Consider the following situation: a Cassandra node {{node1}} , with 
{{{}hostId1{}}}, owning 20.5% of the token ring, bounces and switches IP 
({{{}old_IP{}}} → {{{}new_IP{}}}). After a couple gossip iterations, all other 
nodes’ nodetool status output includes a {{new_IP}} UN entry owning 20.5% of 
the token ring and no {{old_IP}} entry.

Shortly after the bounce, {{node1}} gets decommissioned. Our cluster does not 
have a lot of data, and the decommission operation completes pretty quickly. 
Logs on other nodes start showing acknowledgment that {{node1}} has left and 
soon, nodetool status’ {{new_IP}} UL entry disappears. {{node1}} ‘s pod is 
deleted.

After a minute delay, the cluster enters the erroneous state. An  {{old_IP}} DN 
entry reappears in nodetool status, owning 20.5% of the token ring. No node 
owns this IP anymore and according to logs, {{old_IP}} is still associated with 
{{{}hostId1{}}}.

*Issue Root Cause*

By digging through Cassandra logs, and re-testing this scenario over and over 
again, we have reached the following conclusion: 

 * Other nodes will continue exchanging gossip about {{old_IP}} , even after it 
becomes a fatClient.
 * The fatClient timeout and subsequent quarantine does not stop {{old_IP}} 
from reappearing in a node’s Gossip state, once its quarantine is over. We 
believe that this is due to a misalignment on all nodes’ {{old_IP}} expiration 
time.
 * Once {{new_IP}} has left the cluster, and {{old_IP}} next gossip state 
message is received by a node, StorageService will no longer face collisions 
(or will, but with an even older IP) for {{hostId1}} and its corresponding 
tokens. As a result, {{old_IP}} will regain ownership of 20.5% of the token 
ring.


*Proposed fix*

Following the above investigation, we were thinking about implementing the 
following fix:

When a node receives a gossip status change with {{STATE_LEFT}} for a leaving 
endpoint {{{}new_IP{}}}, before evicting {{new_IP }}from the token ring, purge 
from Gossip (ie {{{}evictFromMembership{}}}) all endpoints that meet the 
following criteria:

 * {{endpointStateMap}} contains this endpoint
 * The endpoint is not currently a token owner 
({{{}!tokenMetadata.isMember(endpoint){}}})
 * The endpoint’s {{hostId}} matches the {{hostId}} of {{new_IP}}
 * The endpoint is older than {{leaving_IP}} 
({{{}Gossiper.instance.compareEndpointStartup{}}})
 * The endpoint’s token range (from {{{}endpointStateMap{}}}) intersects with 
{{{}new_IP{}}}’s

This modification’s intention is to force nodes to realign on {{old_IP}} 
expiration, and expunge it from Gossip so it does not reappear after {{new_IP}} 
leaves the ring.

Another approach we have also been considering is expunging {{old_IP}} at the 
moment of the StorageService collision resolution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to