[ 
https://issues.apache.org/jira/browse/CASSANDRA-10887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063752#comment-15063752
 ] 

Sylvain Lebresne commented on CASSANDRA-10887:
----------------------------------------------

I can confirm that having a node both a natural endpoint and pending is bad 
(it's possible for node moving to have some of his range stay after the move, 
but in that case, these ranges shouldn't be added as pending). This would 
totally trigger CASSANDRA-10423 and is thus likely the culprit.

Not an expert in our pending range calculations though and not sure who is our 
expert there. [~blambov], you've worked on pseudo-related problems lately 
(token assignment), would you have some time to have a look? 

[~cdaw] would you have someone to turn [~sashley]'s reproduction step above 
into a dtest in parallel?

> Pending range calculator gives wrong pending ranges for moves
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-10887
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10887
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>            Reporter: Richard Low
>            Priority: Critical
>
> My understanding is the PendingRangeCalculator is meant to calculate who 
> should receive extra writes during range movements. However, it adds the 
> wrong ranges for moves. An extreme example of this can be seen in the 
> following reproduction. Create a 5 node cluster (I did this on 2.0.16 and 
> 2.2.4) and a keyspace RF=3 and a simple table. Then start moving a node and 
> immediately kill -9 it. Now you see a node as down and moving in the ring. 
> Try a quorum write for a partition that is stored on that node - it will fail 
> with a timeout. Further, all CAS reads or writes fail immediately with 
> unavailable exception because they attempt to include the moving node twice. 
> This is likely to be the cause of CASSANDRA-10423.
> In my example I had this ring:
> 127.0.0.1  rack1       Up     Normal  170.97 KB       20.00%              
> -9223372036854775808
> 127.0.0.2  rack1       Up     Normal  124.06 KB       20.00%              
> -5534023222112865485
> 127.0.0.3  rack1       Down   Moving  108.7 KB        40.00%              
> 1844674407370955160
> 127.0.0.4  rack1       Up     Normal  142.58 KB       0.00%               
> 1844674407370955161
> 127.0.0.5  rack1       Up     Normal  118.64 KB       20.00%              
> 5534023222112865484
> Node 3 was moving to -1844674407370955160. I added logging to print the 
> pending and natural endpoints. For ranges owned by node 3, node 3 appeared in 
> pending and natural endpoints. The blockFor is increased to 3 so we’re 
> effectively doing CL.ALL operations. This manifests as write timeouts and CAS 
> unavailables when the node is down.
> The correct pending range for this scenario is node 1 is gaining the range 
> (-1844674407370955160, 1844674407370955160). So node 1 should be added as a 
> destination for writes and CAS for this range, not node 3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to