[ 
https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228317#comment-15228317
 ] 

Paulo Motta commented on CASSANDRA-8523:
----------------------------------------

There are two scenarios we should consider when replacing a node:
1) The replacing node has the same IP as the previous node
2) The replacing node has a different IP as the previous node

On CASSANDRA-9244 I have gotten pretty far in an 
[implementation|https://issues.apache.org/jira/browse/CASSANDRA-9244?focusedCommentId=15211202&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15211202]
 that adds a new non-dead gossip state {{BOOT_REPLACE}} and considers the 
replacing endpoint as a bootstrapping pending endpoint, solving case 2 
transparently.

Case 1 is trickier because when the replacing node enters gossip with a 
non-dead state, other nodes will think the previous node is back up and send 
reads to him (since he is a natural endpoint).

A simple way to solve this is to special-case the read path and ignore nodes in 
"NON-NORMAL" state when sending reads to natural endpoints. While this will 
probably solve the problem, there are quite a few different paths we need to 
hack to make sure this is enforced correctly (paxos, read, hints, etc), so I'm 
not totally comfortable with that.

A more transparent but a bit costlier approach to solve case 1 would be to 
change the {{TokenMetadata}} to keep nodes as {{(InetAddress, UUID)}} pairs, 
and create a new interface to the {{FailureDetector}} indexed by {{UUID}}. This 
way we could keep {{(IP=127.0.0.1,UUID=1)}} in {{TokenMetadata}} as a natural 
endpoint, and add a replacement node {{(IP=127.0.0.1,UUID=2)}} as a pending 
endpoint. So, during reads, {{FD.isAlive(UUID=1)}} would return false, and 
natural reads would not be sent to {{(IP=127.0.0.1,UUID=1)}}, while pending 
writes would be sent to {{(IP=127.0.0.1,UUID=2)}} because 
{{FD.isAlive(UUID=2)}} would return true.

I'd be happy to continue working on this, so feedback on any of the above or 
alternative approaches would be greatly appreciated.

> Writes should be sent to a replacement node while it is streaming in data
> -------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8523
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8523
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Richard Wagner
>            Assignee: Brandon Williams
>             Fix For: 2.1.x
>
>
> In our operations, we make heavy use of replace_address (or 
> replace_address_first_boot) in order to replace broken nodes. We now realize 
> that writes are not sent to the replacement nodes while they are in hibernate 
> state and streaming in data. This runs counter to what our expectations were, 
> especially since we know that writes ARE sent to nodes when they are 
> bootstrapped into the ring.
> It seems like cassandra should arrange to send writes to a node that is in 
> the process of replacing another node, just like it does for a nodes that are 
> bootstraping. I hesitate to phrase this as "we should send writes to a node 
> in hibernate" because the concept of hibernate may be useful in other 
> contexts, as per CASSANDRA-8336. Maybe a new state is needed here?
> Among other things, the fact that we don't get writes during this period 
> makes subsequent repairs more expensive, proportional to the number of writes 
> that we miss (and depending on the amount of data that needs to be streamed 
> during replacement and the time it may take to rebuild secondary indexes, we 
> could miss many many hours worth of writes). It also leaves us more exposed 
> to consistency violations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to