[ 
https://issues.apache.org/jira/browse/CASSANDRA-17572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Kramer updated CASSANDRA-17572:
-----------------------------------
    Description: 
Hi,

We noticed that there is a race condition present in the trunk of 3.x code, and 
confirmed that it’s there in 4.x as well, which will result in incorrect reads, 
and missed writes, for a very short period of time.

What brought the race condition to our attention was due to the fact we started 
noticing a couple of missed writes for our Cassandra clusters in Kubernetes. We 
found the Kubernetes piece interesting, as IP changes are very frequent as 
opposed to a traditional setup.

More concretely:
 # When a Cassandra node is turned off, and then starts with a new IP address 
Z, it announces to the cluster (via gossip) it has IP X for Host ID Y
 # If there are no conflicts, each node will decide to remove the old IP 
address associated with Host ID Y 
([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/service/StorageService.java#L2529-L2532])
 from the storage ring. This also causes us to invalidate our token ring cache 
([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/locator/TokenMetadata.java#L488]
 ).
 # At this time, a new request could come in (read or write), and will 
re-calculate which endpoints to send the request to, as we’ve invalidated our 
token ring cache 
([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/locator/AbstractReplicationStrategy.java#L88-L104]).
 # However, at this time we’ve only removed the IP address X, and have not 
re-added IP address Z.
 # As a result, we will choose a new host to route our request to. 

In our case, our keyspaces all run with NetworkTopologyStrategy, and so we 
simply choose the node with the next closest token in the same rack as host Y 
([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java#L149-L191]).
 # Thus, the request is routed to a _different_ host, rather than the host that 
has came back online.
 # However, shortly later, we re-add the host (via it’s _new_ endpoint) to the 
token ring 
[https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/service/StorageService.java#L2549]
 # This will result in us invalidating our cache, and then again re-routing 
requests appropriately.

Couple of additional thoughts:
 - This doesn’t affect clusters where nodes <= RF with network topology 
strategy.
 - During this very brief period of time, CL for all user queries are violated, 
but are ACK’d as successful.
 - It’s easy to reproduce this race condition by simply adding a sleep here 
([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/service/StorageService.java#L2529-L2532])
 - If a cleanup is not ran before any range movement, it’s possible for rows 
that were temporarily written to the wrong node re-appear. 
 - We tested that the race condition exist in our Cassandra 2.x fork (we're not 
on 3.x or 4.x). So, there is a possibility here that it's on for Cassandra 2.x, 
however unlikely. 

  was:
Hi,

We noticed that there is a race condition present in the trunk of 3.x code, and 
confirmed that it’s there in 4.x as well, which will result in incorrect reads, 
and missed writes, for a very short period of time.

What brought the race condition to our attention was due to the fact we started 
noticing a couple of missed writes for our Cassandra clusters in Kubernetes. We 
found the Kubernetes piece interesting, as IP changes are very frequent as 
opposed to a traditional setup. 

More concretely:
 # When a Cassandra node is turned off, and then starts with a new IP address 
Z, it announces to the cluster (via gossip) it has IP X for Host ID Y
 # If there are no conflicts, each node will decide to remove the old IP 
address associated with Host ID Y 
([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/service/StorageService.java#L2529-L2532])
 from the storage ring. This also causes us to invalidate our token ring cache 
([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/locator/TokenMetadata.java#L488]
 ).
 # At this time, a new request could come in (read or write), and will 
re-calculate which endpoints to send the request to, as we’ve invalidated our 
token ring cache 
([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/locator/AbstractReplicationStrategy.java#L88-L104]).
 # However, at this time we’ve only removed the IP address X, and have not 
re-added IP address Z. 

As a result, we will choose a new host to route our request to. In our case, 
our keyspaces all run with NetworkTopologyStrategy, and so we simply choose the 
node with the next closest token in the same rack as host Y 
([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java#L149-L191]).
 # Thus, the request is routed to a _different_ host, rather than the host that 
has came back online.
 # However, shortly later, we re-add the host (via it’s _new_ endpoint) to the 
token ring 
[https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/service/StorageService.java#L2549]
 # This will result in us invalidating our cache, and then again re-routing 
requests appropriately.


Couple of additional thoughts:
- This doesn’t affect clusters where nodes <= RF with network topology 
strategy. 
- During this very brief period of time, CL for all user queries are violated, 
but are ACK’d as successful. 
- It’s easy to reproduce this race condition by simply adding a sleep here 
([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/service/StorageService.java#L2529-L2532])
- If a cleanup is not ran before any range movement, it’s possible for rows 
that were temporarily written to the wrong node re-appear. 
- We tested that the race condition exist in our Cassandra 2.x fork (we're not 
on 3.x or 4.x). So, there is a possibility here that it's on for Cassandra 2.x, 
however unlikely. 


> Race condition when IP address changes for a node can cause reads/writes to 
> route to the wrong node
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-17572
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17572
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sam Kramer
>            Priority: Normal
>
> Hi,
> We noticed that there is a race condition present in the trunk of 3.x code, 
> and confirmed that it’s there in 4.x as well, which will result in incorrect 
> reads, and missed writes, for a very short period of time.
> What brought the race condition to our attention was due to the fact we 
> started noticing a couple of missed writes for our Cassandra clusters in 
> Kubernetes. We found the Kubernetes piece interesting, as IP changes are very 
> frequent as opposed to a traditional setup.
> More concretely:
>  # When a Cassandra node is turned off, and then starts with a new IP address 
> Z, it announces to the cluster (via gossip) it has IP X for Host ID Y
>  # If there are no conflicts, each node will decide to remove the old IP 
> address associated with Host ID Y 
> ([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/service/StorageService.java#L2529-L2532])
>  from the storage ring. This also causes us to invalidate our token ring 
> cache 
> ([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/locator/TokenMetadata.java#L488]
>  ).
>  # At this time, a new request could come in (read or write), and will 
> re-calculate which endpoints to send the request to, as we’ve invalidated our 
> token ring cache 
> ([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/locator/AbstractReplicationStrategy.java#L88-L104]).
>  # However, at this time we’ve only removed the IP address X, and have not 
> re-added IP address Z.
>  # As a result, we will choose a new host to route our request to. 
> In our case, our keyspaces all run with NetworkTopologyStrategy, and so we 
> simply choose the node with the next closest token in the same rack as host Y 
> ([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java#L149-L191]).
>  # Thus, the request is routed to a _different_ host, rather than the host 
> that has came back online.
>  # However, shortly later, we re-add the host (via it’s _new_ endpoint) to 
> the token ring 
> [https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/service/StorageService.java#L2549]
>  # This will result in us invalidating our cache, and then again re-routing 
> requests appropriately.
> Couple of additional thoughts:
>  - This doesn’t affect clusters where nodes <= RF with network topology 
> strategy.
>  - During this very brief period of time, CL for all user queries are 
> violated, but are ACK’d as successful.
>  - It’s easy to reproduce this race condition by simply adding a sleep here 
> ([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/service/StorageService.java#L2529-L2532])
>  - If a cleanup is not ran before any range movement, it’s possible for rows 
> that were temporarily written to the wrong node re-appear. 
>  - We tested that the race condition exist in our Cassandra 2.x fork (we're 
> not on 3.x or 4.x). So, there is a possibility here that it's on for 
> Cassandra 2.x, however unlikely. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to