[ 
https://issues.apache.org/jira/browse/CASSANDRA-18319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701356#comment-17701356
 ] 

Brandon Williams commented on CASSANDRA-18319:
----------------------------------------------

bq. I will add that the rolling restart is necessary to get the cluster into 
this state. Since the restarts are staggered, the FatClient timeout each node 
has for the old IP will trigger at different times, and their subsequent gossip 
quarantines for the old IP will also end at different times. 

Yes, if a rolling restart occurs before a fat client is fully removed, this 
will happen.

bq. Presumably there is some value for QUARANTINE_DELAY that is large enough, 
but it would probably need to scale with the number of nodes in the cluster.

One way to solve this is the same way we did for removed nodes in 
CASSANDRA-2961 and use a timestamp to coordinate the removal without any 
previous state, except fat clients don't have a concrete expiration time since 
they are simply non-members - a boostrapping node that hasn't completed 
joining, for instance.

The good news is that this is a cosmetic problem - this is only log noise.  You 
can also make it cease by assassinating the fat client IP, which will give it a 
dead state with an expiration time in addition to removing it. In the future, 
[transaction cluster 
metadata|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata]
 should be able to handle this in a more graceful manner.

> Cassandra in Kubernetes: IP switch decommission issue
> -----------------------------------------------------
>
>                 Key: CASSANDRA-18319
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18319
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Ines Potier
>            Priority: Normal
>         Attachments: 3.11_gossipinfo.zip, node1_gossipinfo.txt, 
> test_decommission_after_ip_change_logs.zip, 
> v4.0_1678853171792_test_decommission_after_ip_change.zip
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> We have recently encountered a recurring old IP reappearance issue while 
> testing decommissions on some of our Kubernetes Cassandra staging clusters.
> *Issue Description*
> In Kubernetes, a Cassandra node can change IP at each pod bounce. We have 
> noticed that this behavior, associated with a decommission operation, can get 
> the cluster into an erroneous state.
> Consider the following situation: a Cassandra node {{node1}} , with 
> {{{}hostId1{}}}, owning 20.5% of the token ring, bounces and switches IP 
> ({{{}old_IP{}}} → {{{}new_IP{}}}). After a couple gossip iterations, all 
> other nodes’ nodetool status output includes a {{new_IP}} UN entry owning 
> 20.5% of the token ring and no {{old_IP}} entry.
> Shortly after the bounce, {{node1}} gets decommissioned. Our cluster does not 
> have a lot of data, and the decommission operation completes pretty quickly. 
> Logs on other nodes start showing acknowledgment that {{node1}} has left and 
> soon, nodetool status’ {{new_IP}} UL entry disappears. {{node1}} ‘s pod is 
> deleted.
> After a minute delay, the cluster enters the erroneous state. An  {{old_IP}} 
> DN entry reappears in nodetool status, owning 20.5% of the token ring. No 
> node owns this IP anymore and according to logs, {{old_IP}} is still 
> associated with {{{}hostId1{}}}.
> *Issue Root Cause*
> By digging through Cassandra logs, and re-testing this scenario over and over 
> again, we have reached the following conclusion: 
>  * Other nodes will continue exchanging gossip about {{old_IP}} , even after 
> it becomes a fatClient.
>  * The fatClient timeout and subsequent quarantine does not stop {{old_IP}} 
> from reappearing in a node’s Gossip state, once its quarantine is over. We 
> believe that this is due to a misalignment on all nodes’ {{old_IP}} 
> expiration time.
>  * Once {{new_IP}} has left the cluster, and {{old_IP}} next gossip state 
> message is received by a node, StorageService will no longer face collisions 
> (or will, but with an even older IP) for {{hostId1}} and its corresponding 
> tokens. As a result, {{old_IP}} will regain ownership of 20.5% of the token 
> ring.
> *Proposed fix*
> Following the above investigation, we were thinking about implementing the 
> following fix:
> When a node receives a gossip status change with {{STATE_LEFT}} for a leaving 
> endpoint {{{}new_IP{}}}, before evicting {{{}new_IP from the token ring, 
> purge from Gossip (ie evictFromMembership{}}}) all endpoints that meet the 
> following criteria:
>  * {{endpointStateMap}} contains this endpoint
>  * The endpoint is not currently a token owner 
> ({{{}!tokenMetadata.isMember(endpoint){}}})
>  * The endpoint’s {{hostId}} matches the {{hostId}} of {{new_IP}}
>  * The endpoint is older than {{leaving_IP}} 
> ({{{}Gossiper.instance.compareEndpointStartup{}}})
>  * The endpoint’s token range (from {{{}endpointStateMap{}}}) intersects with 
> {{{}new_IP{}}}’s
> This modification’s intention is to force nodes to realign on {{old_IP}} 
> expiration, and expunge it from Gossip so it does not reappear after 
> {{new_IP}} leaves the ring.
> Another approach we have also been considering is expunging {{old_IP}} at the 
> moment of the StorageService collision resolution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to