[ https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337288#comment-15337288 ]
Paulo Motta commented on CASSANDRA-8523: ---------------------------------------- Due to the limitations of forwarding writes to replacement nodes with the same IP, I propose initially adding this support only to replacement nodes with a different IP, since it's much simpler and we can do it in a backward-compatible way so it can probably go on 2.2+. After CASSANDRA-11559, we can extend this support to nodes with the same IP quite easily by setting an inactive flag on nodes being replaced and ignore these nodes on read. The central idea is: {quote} * Add a new non-dead gossip state for replace BOOT_REPLACE * When receiving BOOT_REPLACE, other node adds the replacing node as bootstrapping endpoint * Pending ranges are calculated, and writes are sent to the replacing node during replace * When replacing node changes state to NORMAL, the old node is removed and the new node becomes a natural endpoint on TokenMetadata * The final step is to change the original node state to REMOVED_TOKEN so other nodes evict the original node from gossip {quote} Since it's no longer necessary to forward hints to the replacement node when {{replace_address != broadcast_address}}, the replacement node does not need to inherit the same ID of the original node. The replacing process remains unchanged when the replacement node has the same IP as the original node. If that's the case, I added a warn message so users know they need to run repair if the node is down for longer than {{max_hint_window_in_ms}}: {noformat} Writes will not be redirected to this node while it is performing replace because it has the same address as the node to be replaced ({}). If that node has been down for longer than max_hint_window_in_ms, repair must be run after the replacement process in order to make this node consistent. {noformat} I adapted current dtests to test replace_address for both the old and the new path, and when {{replace_address != broadcast_address}} make sure writes are being redirected to the replacement node. Initial patch and tests below (will provide 2.2+ patches after initial review): ||2.2||dtest|| |[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:2.2-8523]|[branch|https://github.com/riptano/cassandra-dtest/compare/master...pauloricardomg:8523]| |[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-8523-testall/lastCompletedBuild/testReport/]| |[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-8523-dtest/lastCompletedBuild/testReport/]| > Writes should be sent to a replacement node while it is streaming in data > ------------------------------------------------------------------------- > > Key: CASSANDRA-8523 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8523 > Project: Cassandra > Issue Type: Improvement > Reporter: Richard Wagner > Assignee: Paulo Motta > Fix For: 2.1.x > > > In our operations, we make heavy use of replace_address (or > replace_address_first_boot) in order to replace broken nodes. We now realize > that writes are not sent to the replacement nodes while they are in hibernate > state and streaming in data. This runs counter to what our expectations were, > especially since we know that writes ARE sent to nodes when they are > bootstrapped into the ring. > It seems like cassandra should arrange to send writes to a node that is in > the process of replacing another node, just like it does for a nodes that are > bootstraping. I hesitate to phrase this as "we should send writes to a node > in hibernate" because the concept of hibernate may be useful in other > contexts, as per CASSANDRA-8336. Maybe a new state is needed here? > Among other things, the fact that we don't get writes during this period > makes subsequent repairs more expensive, proportional to the number of writes > that we miss (and depending on the amount of data that needs to be streamed > during replacement and the time it may take to rebuild secondary indexes, we > could miss many many hours worth of writes). It also leaves us more exposed > to consistency violations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)