[jira] [Comment Edited] (CASSANDRA-19344) Range movements involving transient replicas must safely enact changes to read and write replica sets

Alex Petrov (Jira) Tue, 16 Apr 2024 01:16:12 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17837573#comment-17837573
 ]


Alex Petrov edited comment on CASSANDRA-19344 at 4/16/24 8:04 AM:
------------------------------------------------------------------

Wanted to point out a somewhat unintuitive albeit correct behaviour that 
involves Transient Replicas. I think it is worth talking through such things 
because pending ranges with transient replicas work slightly differently from 
their "normal" counterparts. 

We have a four node cluster with nodes 1,2,3,4 owning tokens 100,200,300,400, 
and 4 moving from 400 to 350.

Original/start state (READ/WRITE placements):

{code}
    (400,MIN] -> [Full(/127.0.0.1:7012,(400,MIN]), 
Full(/127.0.0.2:7012,(400,MIN]), Transient(/127.0.0.3:7012,(400,MIN])]}
    (MIN,100] -> [Full(/127.0.0.1:7012,(MIN,100]), 
Full(/127.0.0.2:7012,(MIN,100]), Transient(/127.0.0.3:7012,(MIN,100])]}
    (100,200] -> [Full(/127.0.0.2:7012,(100,200]), 
Full(/127.0.0.3:7012,(100,200]), Transient(/127.0.0.4:7012,(100,200])]}
    (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.4:7012,(200,300]), Transient(/127.0.0.1:7012,(200,300])]}
    (300,350] -> [Full(/127.0.0.1:7012,(300,350]), 
Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]}
    (350,400] -> [Full(/127.0.0.4:7012,(350,400]), 
Full(/127.0.0.1:7012,(350,400]), Transient(/127.0.0.2:7012,(350,400])]}
{code}

State after {{START_MOVE}} (which is the point at which streaming starts, so 
think of additional replicas as pending), for WRITE placements:

{code}
    (400,MIN] -> [Full(/127.0.0.1:7012,(400,MIN]), 
Full(/127.0.0.2:7012,(400,MIN]), Full(/127.0.0.3:701a2,(400,MIN])]}
    (MIN,100] -> [Full(/127.0.0.1:7012,(MIN,100]), 
Full(/127.0.0.2:7012,(MIN,100]), Full(/127.0.0.3:7012,(MIN,100])]}
    (100,200] -> [Full(/127.0.0.2:7012,(100,200]), 
Full(/127.0.0.3:7012,(100,200]), Transient(/127.0.0.4:7012,(100,200]), 
Transient(/127.0.0.1:7012,(100,200])]}
    (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.4:7012,(200,300]), Full(/127.0.0.1:7012,(200,300])]}
    (300,350] -> [Full(/127.0.0.1:7012,(300,350]), 
Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]}
    (350,400] -> [Full(/127.0.0.4:7012,(350,400]), 
Full(/127.0.0.1:7012,(350,400]), Full(/127.0.0.2:7012,(350,400]), 
Transient(/127.0.0.3:7012,(350,400])]}
{code}

READ placements at the same moment:

{code}
    (400,MIN] -> [Transient(/127.0.0.1:7012,(400,MIN]), 
Full(/127.0.0.2:7012,(400,MIN]), Full(/127.0.0.3:7012,(400,MIN])]}
    (MIN,100] -> [Transient(/127.0.0.1:7012,(MIN,100]), 
Full(/127.0.0.2:7012,(MIN,100]), Full(/127.0.0.3:7012,(MIN,100])]}
    (100,200] -> [Full(/127.0.0.2:7012,(100,200]), 
Full(/127.0.0.3:7012,(100,200]), Transient(/127.0.0.1:7012,(100,200])]}
    (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.1:7012,(200,300]), Transient(/127.0.0.4:7012,(200,300])]}
    (300,350] -> [Full(/127.0.0.1:7012,(300,350]), 
Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]}
    (350,400] -> [Full(/127.0.0.4:7012,(350,400]), 
Full(/127.0.0.2:7012,(350,400]), Transient(/127.0.0.3:7012,(350,400])]}
{code} 

Please note that READ placements are always a subset of WRITE ones (or, well, 
in a way: we can technically read from full to satisfy a transient read)
after FINISH_MOVE, we get for both READ and WRITE:

{code}
    (400,MIN] -> [Full(/127.0.0.2:7012,(400,MIN]), 
Full(/127.0.0.3:7012,(400,MIN]), Transient(/127.0.0.1:7012,(400,MIN])]}
    (MIN,200] -> [Full(/127.0.0.2:7012,(MIN,200]), 
Transient(/127.0.0.1:7012,(MIN,200]), Full(/127.0.0.3:7012,(MIN,200])]}
    (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.1:7012,(200,300]), Transient(/127.0.0.4:7012,(200,300])]}
    (300,350] -> [Full(/127.0.0.1:7012,(300,350]), 
Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]}
    (350,400] -> [Full(/127.0.0.4:7012,(350,400]), 
Full(/127.0.0.2:7012,(350,400]), Transient(/127.0.0.3:7012,(350,400])]} 
{code}

After executing START_MOVE, we get 3 full and no transient nodes for 
{{(200,300]}}. If we put transitions together, we see: 

{code}
    1. (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.4:7012,(200,300]), Transient(/127.0.0.1:7012,(200,300])]}
    2. (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.4:7012,(200,300]), Full(/127.0.0.1:7012,(200,300])]}
    3. (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.1:7012,(200,300]), Transient(/127.0.0.4:7012,(200,300])]}
{code}

In {{2.}}, you see that {{127.0.0.1}} went from transient to full, since it is 
now gaining a range, and should be a target for pending writes for this range. 
At the same time, it remains a _transient read replica_. In {{3.}}, 
{{127.0.0.04}} went from full to transient; it was kept full up till now since 
it was a streaming source, and to keep consistency levels correct, we 

What is unintuitive here is that usually, with replication factor of 3, we see 
the _fourth_ node added as pending. This happens because all ranges are unique. 
With transient replication, the node can remain an owner of the same range, but 
because of the new token in the ring, change its transient status (from 
transient to full or vice versa).  

At the same time, when the node's transient status does not change, we can end 
up with  2 full and 2 transient. See {{(100,200}} or {{(350,400]}}. 

All these cases are a consequence of streaming during range handoff.  


was (Author: ifesdjeen):
Wanted to point out a somewhat unintuitive albeit correct behaviour that 
involves Transient Replicas. I think it is worth talking through such things 
because pending ranges with transient replicas work slightly differently from 
their "normal" counterparts. 

We have a four node cluster with nodes 1,2,3,4 owning tokens 100,200,300,400, 
and 4 moving from 400 to 350.

Original/start state (READ/WRITE placements):

{code]
    (400,MIN] -> [Full(/127.0.0.1:7012,(400,MIN]), 
Full(/127.0.0.2:7012,(400,MIN]), Transient(/127.0.0.3:7012,(400,MIN])]}
    (MIN,100] -> [Full(/127.0.0.1:7012,(MIN,100]), 
Full(/127.0.0.2:7012,(MIN,100]), Transient(/127.0.0.3:7012,(MIN,100])]}
    (100,200] -> [Full(/127.0.0.2:7012,(100,200]), 
Full(/127.0.0.3:7012,(100,200]), Transient(/127.0.0.4:7012,(100,200])]}
    (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.4:7012,(200,300]), Transient(/127.0.0.1:7012,(200,300])]}
    (300,350] -> [Full(/127.0.0.1:7012,(300,350]), 
Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]}
    (350,400] -> [Full(/127.0.0.4:7012,(350,400]), 
Full(/127.0.0.1:7012,(350,400]), Transient(/127.0.0.2:7012,(350,400])]}
{code}

State after {{START_MOVE}} (which is the point at which streaming starts, so 
think of additional replicas as pending), for WRITE placements:

{code}
    (400,MIN] -> [Full(/127.0.0.1:7012,(400,MIN]), 
Full(/127.0.0.2:7012,(400,MIN]), Full(/127.0.0.3:701a2,(400,MIN])]}
    (MIN,100] -> [Full(/127.0.0.1:7012,(MIN,100]), 
Full(/127.0.0.2:7012,(MIN,100]), Full(/127.0.0.3:7012,(MIN,100])]}
    (100,200] -> [Full(/127.0.0.2:7012,(100,200]), 
Full(/127.0.0.3:7012,(100,200]), Transient(/127.0.0.4:7012,(100,200]), 
Transient(/127.0.0.1:7012,(100,200])]}
    (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.4:7012,(200,300]), Full(/127.0.0.1:7012,(200,300])]}
    (300,350] -> [Full(/127.0.0.1:7012,(300,350]), 
Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]}
    (350,400] -> [Full(/127.0.0.4:7012,(350,400]), 
Full(/127.0.0.1:7012,(350,400]), Full(/127.0.0.2:7012,(350,400]), 
Transient(/127.0.0.3:7012,(350,400])]}
{code}

READ placements at the same moment:

{code}
    (400,MIN] -> [Transient(/127.0.0.1:7012,(400,MIN]), 
Full(/127.0.0.2:7012,(400,MIN]), Full(/127.0.0.3:7012,(400,MIN])]}
    (MIN,100] -> [Transient(/127.0.0.1:7012,(MIN,100]), 
Full(/127.0.0.2:7012,(MIN,100]), Full(/127.0.0.3:7012,(MIN,100])]}
    (100,200] -> [Full(/127.0.0.2:7012,(100,200]), 
Full(/127.0.0.3:7012,(100,200]), Transient(/127.0.0.1:7012,(100,200])]}
    (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.1:7012,(200,300]), Transient(/127.0.0.4:7012,(200,300])]}
    (300,350] -> [Full(/127.0.0.1:7012,(300,350]), 
Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]}
    (350,400] -> [Full(/127.0.0.4:7012,(350,400]), 
Full(/127.0.0.2:7012,(350,400]), Transient(/127.0.0.3:7012,(350,400])]}
{code} 

Please note that READ placements are always a subset of WRITE ones (or, well, 
in a way: we can technically read from full to satisfy a transient read)
after FINISH_MOVE, we get for both READ and WRITE:

{code}
    (400,MIN] -> [Full(/127.0.0.2:7012,(400,MIN]), 
Full(/127.0.0.3:7012,(400,MIN]), Transient(/127.0.0.1:7012,(400,MIN])]}
    (MIN,200] -> [Full(/127.0.0.2:7012,(MIN,200]), 
Transient(/127.0.0.1:7012,(MIN,200]), Full(/127.0.0.3:7012,(MIN,200])]}
    (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.1:7012,(200,300]), Transient(/127.0.0.4:7012,(200,300])]}
    (300,350] -> [Full(/127.0.0.1:7012,(300,350]), 
Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]}
    (350,400] -> [Full(/127.0.0.4:7012,(350,400]), 
Full(/127.0.0.2:7012,(350,400]), Transient(/127.0.0.3:7012,(350,400])]} 
{code}

After executing START_MOVE, we get 3 full and no transient nodes for 
{{(200,300]}}. If we put transitions together, we see: 

{code}
    1. (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.4:7012,(200,300]), Transient(/127.0.0.1:7012,(200,300])]}
    2. (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.4:7012,(200,300]), Full(/127.0.0.1:7012,(200,300])]}
    3. (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.1:7012,(200,300]), Transient(/127.0.0.4:7012,(200,300])]}
{code}

In {{2.}}, you see that {{127.0.0.1}} went from transient to full, since it is 
now gaining a range, and should be a target for pending writes for this range. 
At the same time, it remains a _transient read replica_. In {{3.}}, 
{{127.0.0.04}} went from full to transient; it was kept full up till now since 
it was a streaming source, and to keep consistency levels correct, we 

What is unintuitive here is that usually, with replication factor of 3, we see 
the _fourth_ node added as pending. This happens because all ranges are unique. 
With transient replication, the node can remain an owner of the same range, but 
because of the new token in the ring, change its transient status (from 
transient to full or vice versa).  

At the same time, when the node's transient status does not change, we can end 
up with  2 full and 2 transient. See {{(100,200}} or {{(350,400]}}. 

All these cases are a consequence of streaming during range handoff.  

> Range movements involving transient replicas must safely enact changes to 
> read and write replica sets
> -----------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-19344
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19344
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CI
>            Reporter: Ekaterina Dimitrova
>            Assignee: Sam Tunnicliffe
>            Priority: Normal
>             Fix For: 5.x
>
>         Attachments: ci_summary.html, remove-n4-post-19344.txt, 
> remove-n4-pre-19344.txt, result_details.tar.gz
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> (edit) This was originally opened due to a flaky test 
> {{org.apache.cassandra.distributed.test.TransientRangeMovementTest.testRemoveNode-_jdk17}}
> The test can fail in two different ways:
> {code:java}
> junit.framework.AssertionFailedError: NOT IN CURRENT: 31 -- [(00,20), 
> (31,50)] at 
> org.apache.cassandra.distributed.test.TransientRangeMovementTest.assertAllContained(TransientRangeMovementTest.java:203)
>  at 
> org.apache.cassandra.distributed.test.TransientRangeMovementTest.testRemoveNode(TransientRangeMovementTest.java:183)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43){code}
> as in here - 
> [https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2639/workflows/32b92ce7-5e9d-4efb-8362-d200d2414597/jobs/55139/tests#failed-test-0]
> and
> {code:java}
> junit.framework.AssertionFailedError: nodetool command [removenode, 
> 6d194555-f6eb-41d0-c000-000000000003, --force] was not successful stdout: 
> stderr: error: Node /127.0.0.4:7012 is alive and owns this ID. Use 
> decommission command to remove it from the ring -- StackTrace -- 
> java.lang.UnsupportedOperationException: Node /127.0.0.4:7012 is alive and 
> owns this ID. Use decommission command to remove it from the ring at 
> org.apache.cassandra.tcm.sequences.SingleNodeSequences.removeNode(SingleNodeSequences.java:110)
>  at 
> org.apache.cassandra.service.StorageService.removeNode(StorageService.java:3682)
>  at org.apache.cassandra.tools.NodeProbe.removeNode(NodeProbe.java:1020) at 
> org.apache.cassandra.tools.nodetool.RemoveNode.execute(RemoveNode.java:51) at 
> org.apache.cassandra.tools.NodeTool$NodeToolCmd.runInternal(NodeTool.java:388)
>  at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:373) at 
> org.apache.cassandra.tools.NodeTool.execute(NodeTool.java:272) at 
> org.apache.cassandra.distributed.impl.Instance$DTestNodeTool.execute(Instance.java:1129)
>  at 
> org.apache.cassandra.distributed.impl.Instance.lambda$nodetoolResult$51(Instance.java:1038)
>  at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at 
> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>  at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  at java.base/java.lang.Thread.run(Thread.java:833) Notifications: Error: 
> java.lang.UnsupportedOperationException: Node /127.0.0.4:7012 is alive and 
> owns this ID. Use decommission command to remove it from the ring at 
> org.apache.cassandra.tcm.sequences.SingleNodeSequences.removeNode(SingleNodeSequences.java:110)
>  at 
> org.apache.cassandra.service.StorageService.removeNode(StorageService.java:3682)
>  at org.apache.cassandra.tools.NodeProbe.removeNode(NodeProbe.java:1020) at 
> org.apache.cassandra.tools.nodetool.RemoveNode.execute(RemoveNode.java:51) at 
> org.apache.cassandra.tools.NodeTool$NodeToolCmd.runInternal(NodeTool.java:388)
>  at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:373) at 
> org.apache.cassandra.tools.NodeTool.execute(NodeTool.java:272) at 
> org.apache.cassandra.distributed.impl.Instance$DTestNodeTool.execute(Instance.java:1129)
>  at 
> org.apache.cassandra.distributed.impl.Instance.lambda$nodetoolResult$51(Instance.java:1038)
>  at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at 
> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>  at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  at java.base/java.lang.Thread.run(Thread.java:833) at 
> org.apache.cassandra.distributed.api.NodeToolResult$Asserts.fail(NodeToolResult.java:214)
>  at 
> org.apache.cassandra.distributed.api.NodeToolResult$Asserts.success(NodeToolResult.java:97)
>  at 
> org.apache.cassandra.distributed.test.TransientRangeMovementTest.testRemoveNode(TransientRangeMovementTest.java:173)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43){code}
> as in here - 
> [https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2634/workflows/24617d26-e297-4857-bc43-b6a04e64a6ea/jobs/54534/tests#failed-test-0]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-19344) Range movements involving transient replicas must safely enact changes to read and write replica sets

Reply via email to