[ https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228317#comment-15228317 ]
Paulo Motta commented on CASSANDRA-8523: ---------------------------------------- There are two scenarios we should consider when replacing a node: 1) The replacing node has the same IP as the previous node 2) The replacing node has a different IP as the previous node On CASSANDRA-9244 I have gotten pretty far in an [implementation|https://issues.apache.org/jira/browse/CASSANDRA-9244?focusedCommentId=15211202&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15211202] that adds a new non-dead gossip state {{BOOT_REPLACE}} and considers the replacing endpoint as a bootstrapping pending endpoint, solving case 2 transparently. Case 1 is trickier because when the replacing node enters gossip with a non-dead state, other nodes will think the previous node is back up and send reads to him (since he is a natural endpoint). A simple way to solve this is to special-case the read path and ignore nodes in "NON-NORMAL" state when sending reads to natural endpoints. While this will probably solve the problem, there are quite a few different paths we need to hack to make sure this is enforced correctly (paxos, read, hints, etc), so I'm not totally comfortable with that. A more transparent but a bit costlier approach to solve case 1 would be to change the {{TokenMetadata}} to keep nodes as {{(InetAddress, UUID)}} pairs, and create a new interface to the {{FailureDetector}} indexed by {{UUID}}. This way we could keep {{(IP=127.0.0.1,UUID=1)}} in {{TokenMetadata}} as a natural endpoint, and add a replacement node {{(IP=127.0.0.1,UUID=2)}} as a pending endpoint. So, during reads, {{FD.isAlive(UUID=1)}} would return false, and natural reads would not be sent to {{(IP=127.0.0.1,UUID=1)}}, while pending writes would be sent to {{(IP=127.0.0.1,UUID=2)}} because {{FD.isAlive(UUID=2)}} would return true. I'd be happy to continue working on this, so feedback on any of the above or alternative approaches would be greatly appreciated. > Writes should be sent to a replacement node while it is streaming in data > ------------------------------------------------------------------------- > > Key: CASSANDRA-8523 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8523 > Project: Cassandra > Issue Type: Improvement > Reporter: Richard Wagner > Assignee: Brandon Williams > Fix For: 2.1.x > > > In our operations, we make heavy use of replace_address (or > replace_address_first_boot) in order to replace broken nodes. We now realize > that writes are not sent to the replacement nodes while they are in hibernate > state and streaming in data. This runs counter to what our expectations were, > especially since we know that writes ARE sent to nodes when they are > bootstrapped into the ring. > It seems like cassandra should arrange to send writes to a node that is in > the process of replacing another node, just like it does for a nodes that are > bootstraping. I hesitate to phrase this as "we should send writes to a node > in hibernate" because the concept of hibernate may be useful in other > contexts, as per CASSANDRA-8336. Maybe a new state is needed here? > Among other things, the fact that we don't get writes during this period > makes subsequent repairs more expensive, proportional to the number of writes > that we miss (and depending on the amount of data that needs to be streamed > during replacement and the time it may take to rebuild secondary indexes, we > could miss many many hours worth of writes). It also leaves us more exposed > to consistency violations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)