[jira] [Commented] (CASSANDRA-8523) Writes should be sent to a replacement node while it is streaming in data

Paulo Motta (JIRA) Thu, 02 Jun 2016 09:20:19 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15312567#comment-15312567
 ]


Paulo Motta commented on CASSANDRA-8523:
----------------------------------------

Solving this when the replacing node has a different ip address is quite simple:
* Add a new non-dead gossip state for replace BOOT_REPLACE
* When receiving BOOT_REPLACE, other node adds the replacing node as 
bootstrapping AND replacing endpoint
* Pending ranges are calculated, and writes are sent to the replacing node 
during replace
* When replacing node changes state to NORMAL, the old node is removed and the 
new node becomes a natural endpoint on {{TokenMetadata}}

I have a WIP patch with this approach 
[here|https://github.com/pauloricardomg/cassandra/commit/a4be7f7a0f298b1d52821f67ff2e927a1d1f1c5e]
 that is functional. The tricky part is when the replacing node has the same IP 
as the old node, because if this node ever joins gossip with a non-dead state, 
{{FailureDetector.isAlive(InetAddress)}} return true and reads are sent to the 
replacing node, since the original node is a natural endpoint.

My initial idea of removing the old node as a natural endpoint from 
{{TokenMetadata}} obviously does not work because this changes range 
assignments making reads go the following node in the ring, which has no data 
for that range. I also considered replacing the node IP in TokenMetadata with a 
special marker, but this will also not play along well with snitches and 
topologies.

A clean fix to this is to enhance the node representation on {{TokenMetadata}}, 
which is currently a plain {{InetAddresss}}, to include the information if an 
endpoint is available for reads (CASSANDRA-11559), but this cannot be done 
before cassandra 4.0 because it will change {{TokenMetadata}} and 
{{AbstractReplicationStrategy}} public interfaces and other related classes, so 
it's a major refactoring in the codebase. One step further is to change the 
{{FailureDetector}} interface to query nodes by {{UUIDs}} instead of 
{{InetAddress}}, so a replacing endpoint will not be confused with the original 
node since they will have different ids.

One workaround before that would be to send a read to a normal endpoint if both 
{{FailureDetector.isAlive(endpoint)}} is true and the endpoint is on {{NORMAL}} 
state, so we avoid sending reads to a replacing endpoint with the same IP 
address. A big downside is that this check should also be peformed for other 
operations such as bootstrap, repairs, hints, batches, etc. I experimented with 
that route [in this 
commit|https://github.com/pauloricardomg/cassandra/commit/a439edcbd8d4186ea2e3301681b7cc0b5012a265]
 by creating a method {{StorageService.isAvailable(InetAddress endpoint, 
boolean isPending) = \{FailureDetector.instance.isAlive(endpoint) && 
(isPendingEndpoint || Gossiper.instance.isNormalStatus(endpoint))\}}} that 
should be used instead of {{FailureDetector.isAlive(endpoint)}}, but I'm not 
very comfortable with this approach for the following reasons:
* It's pretty fragile, because that check must be used instead of 
{{FailureDetector.isAlive()}} in every place throughout the code
* Other developers/drivers/tools wil also need to use the same approach or a 
similar logic
* We will need to modify the write path to segregate writes between natural and 
pending endpoints (see an example on 
[doPaxosCommit|https://github.com/pauloricardomg/cassandra/commit/a439edcbd8d4186ea2e3301681b7cc0b5012a265#diff-71f06c193f5b5e270cf8ac695164f43aR498])

So unless someone comes up with a magic idea, I think we're left with two 
options to fix this before CASSANDRA-11559:
1. Go ahead with {{StorageService.isAvailable}} approach even with its 
downsides and fragility
2. Fix this only for replacing endpoints with a different IP address, falling 
back to the previous behavior when the replacing node has the same IP as the 
old node, which should already fix this in most cloud deployments, where 
replacement nodes usually have different IPs

I'm leaning more towards 2 since it's much simpler/safer and work on a proper 
general solution for 4.0 after CASSANDRA-11559. WDYT?

> Writes should be sent to a replacement node while it is streaming in data
> -------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8523
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8523
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Richard Wagner
>            Assignee: Paulo Motta
>             Fix For: 2.1.x
>
>
> In our operations, we make heavy use of replace_address (or 
> replace_address_first_boot) in order to replace broken nodes. We now realize 
> that writes are not sent to the replacement nodes while they are in hibernate 
> state and streaming in data. This runs counter to what our expectations were, 
> especially since we know that writes ARE sent to nodes when they are 
> bootstrapped into the ring.
> It seems like cassandra should arrange to send writes to a node that is in 
> the process of replacing another node, just like it does for a nodes that are 
> bootstraping. I hesitate to phrase this as "we should send writes to a node 
> in hibernate" because the concept of hibernate may be useful in other 
> contexts, as per CASSANDRA-8336. Maybe a new state is needed here?
> Among other things, the fact that we don't get writes during this period 
> makes subsequent repairs more expensive, proportional to the number of writes 
> that we miss (and depending on the amount of data that needs to be streamed 
> during replacement and the time it may take to rebuild secondary indexes, we 
> could miss many many hours worth of writes). It also leaves us more exposed 
> to consistency violations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8523) Writes should be sent to a replacement node while it is streaming in data

Reply via email to