[ 
https://issues.apache.org/jira/browse/CASSANDRA-18319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17700384#comment-17700384
 ] 

Raymond Huffman commented on CASSANDRA-18319:
---------------------------------------------

I've implemented a DTest here that reproduces the issue. 
https://github.com/apache/cassandra-dtest/pull/215 I've confirmed that this 
test fails on v3.0.28 and v3.11.14. Logs from these tests are attached: 
[^test_decommission_after_ip_change_logs.zip]

In the tests, the node at 127.0.0.6 changes to 127.0.0.9.

This test performs the following:
 * creates a 6 node cluster
 * changes the IP of Node6 from {{127.0.0.6}} to {{127.0.0.9}}
 * performs a rolling restart on the cluster
 * decommissions Node6
 * asserts that the log {{"Node /127.0.0.6 is now part of the cluster"}} does 
not appear after the rolling restart.

Running nodetool status a few seconds after the decommission looks like this:
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                
               Rack
UN  127.0.0.1  188.92 KiB  1            16.7%             
82d3d6c1-c4c8-4bdc-afc5-5827cd17544a  rack1
UN  127.0.0.2  162.84 KiB  1            16.7%             
1399d77b-06d0-4d3b-9248-dbe24486a310  rack1
UN  127.0.0.3  162.07 KiB  1            16.7%             
1289aa44-e4a6-422f-ab72-b5daf53a55d2  rack1
UN  127.0.0.4  162.48 KiB  1            16.7%             
b38c9f92-b651-4660-941c-ca2072d24501  rack1
UN  127.0.0.5  188.36 KiB  1            16.7%             
7125bfc7-a519-419b-ab1a-e9995aed40d9  rack1
?N  127.0.0.6  110.25 KiB  1            16.7%             
29cbf560-7686-49f6-a06a-7184ebd42aa2  rack1
Gossipinfo looks like this: [^3.11_gossipinfo.zip]

> Cassandra in Kubernetes: IP switch decommission issue
> -----------------------------------------------------
>
>                 Key: CASSANDRA-18319
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18319
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Ines Potier
>            Priority: Normal
>         Attachments: 3.11_gossipinfo.zip, node1_gossipinfo.txt, 
> test_decommission_after_ip_change_logs.zip
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> We have recently encountered a recurring old IP reappearance issue while 
> testing decommissions on some of our Kubernetes Cassandra staging clusters.
> *Issue Description*
> In Kubernetes, a Cassandra node can change IP at each pod bounce. We have 
> noticed that this behavior, associated with a decommission operation, can get 
> the cluster into an erroneous state.
> Consider the following situation: a Cassandra node {{node1}} , with 
> {{{}hostId1{}}}, owning 20.5% of the token ring, bounces and switches IP 
> ({{{}old_IP{}}} → {{{}new_IP{}}}). After a couple gossip iterations, all 
> other nodes’ nodetool status output includes a {{new_IP}} UN entry owning 
> 20.5% of the token ring and no {{old_IP}} entry.
> Shortly after the bounce, {{node1}} gets decommissioned. Our cluster does not 
> have a lot of data, and the decommission operation completes pretty quickly. 
> Logs on other nodes start showing acknowledgment that {{node1}} has left and 
> soon, nodetool status’ {{new_IP}} UL entry disappears. {{node1}} ‘s pod is 
> deleted.
> After a minute delay, the cluster enters the erroneous state. An  {{old_IP}} 
> DN entry reappears in nodetool status, owning 20.5% of the token ring. No 
> node owns this IP anymore and according to logs, {{old_IP}} is still 
> associated with {{{}hostId1{}}}.
> *Issue Root Cause*
> By digging through Cassandra logs, and re-testing this scenario over and over 
> again, we have reached the following conclusion: 
>  * Other nodes will continue exchanging gossip about {{old_IP}} , even after 
> it becomes a fatClient.
>  * The fatClient timeout and subsequent quarantine does not stop {{old_IP}} 
> from reappearing in a node’s Gossip state, once its quarantine is over. We 
> believe that this is due to a misalignment on all nodes’ {{old_IP}} 
> expiration time.
>  * Once {{new_IP}} has left the cluster, and {{old_IP}} next gossip state 
> message is received by a node, StorageService will no longer face collisions 
> (or will, but with an even older IP) for {{hostId1}} and its corresponding 
> tokens. As a result, {{old_IP}} will regain ownership of 20.5% of the token 
> ring.
> *Proposed fix*
> Following the above investigation, we were thinking about implementing the 
> following fix:
> When a node receives a gossip status change with {{STATE_LEFT}} for a leaving 
> endpoint {{{}new_IP{}}}, before evicting {{{}new_IP from the token ring, 
> purge from Gossip (ie evictFromMembership{}}}) all endpoints that meet the 
> following criteria:
>  * {{endpointStateMap}} contains this endpoint
>  * The endpoint is not currently a token owner 
> ({{{}!tokenMetadata.isMember(endpoint){}}})
>  * The endpoint’s {{hostId}} matches the {{hostId}} of {{new_IP}}
>  * The endpoint is older than {{leaving_IP}} 
> ({{{}Gossiper.instance.compareEndpointStartup{}}})
>  * The endpoint’s token range (from {{{}endpointStateMap{}}}) intersects with 
> {{{}new_IP{}}}’s
> This modification’s intention is to force nodes to realign on {{old_IP}} 
> expiration, and expunge it from Gossip so it does not reappear after 
> {{new_IP}} leaves the ring.
> Another approach we have also been considering is expunging {{old_IP}} at the 
> moment of the StorageService collision resolution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to