[ https://issues.apache.org/jira/browse/CASSANDRA-14801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068746#comment-17068746 ]
Aleksandr Sorokoumov commented on CASSANDRA-14801: -------------------------------------------------- I reproduced the bug using simple [randomized test|https://gist.github.com/Gerrrr/f59dc5dedaf6501bb31d79b068244213] that creates a cluster of N nodes and adds, moves, and removes nodes from it. I also found another bug where {{TokenMetadata#calculatePendingRanges}} fails during move-affected replica calculation. Patch that addresses both problems ([link to PR|https://github.com/apache/cassandra/pull/495]): * [a534b2|https://github.com/Gerrrr/cassandra/commit/a534b2be9a653fb0cdda75043c0b79e481ca1701] adds tests that reproduce both bugs. * [bb36cd|https://github.com/Gerrrr/cassandra/commit/bb36cd09c164840ad9ab3231f1039c443f8f040c] fixes the newly discovered bug ({{test1Leave1Move}} in [a534b2|https://github.com/Gerrrr/cassandra/commit/a534b2be9a653fb0cdda75043c0b79e481ca1701]). There, the failure happens because for the same pending range we can include 2 replicas with the same endpoint and different ranges (violation of {{PendingRangeMaps Conflict.DUPLICATE}} policy). This happens because right now we include in {{PendingRangeMaps}} the entire new replica after leave for leave-affected ranges and only the pending part of it for move-affected ranges. This commit marks as pending only the missing part of the new replica. * [a33b03|https://github.com/Gerrrr/cassandra/commit/a33b032cd75044db3475505cd32e6afd6f98ad40] solves the original issue. Without this commit {{getPendingRanges}} can fail if the same replica covers more than 1 pending range. I think that this is a valid situation. Consider {{testLeave2}} in [a534b2|https://github.com/Gerrrr/cassandra/commit/a534b2be9a653fb0cdda75043c0b79e481ca1701]. In this case replica {{Full(127.0.0.1:7012,(-9,0])}} covers 2 ranges - {{(-9,-4]}} and {{(-4,0]}}. If we run the same test against 3.11 {{(-9, 0]}} is represented as {{\{(-9, -4], (-4, 0]\}}} and that possible representation would not trigger the bug in 4.0. However, I think that we should not base safety of {{getPendingRanges}} execution on the way a particular {{AbstractReplicationStrategy}} represents a range and we should allow duplicate entries while building pending {{RangesAtEndpoint}}. With these changes, the randomized test hasn't failed after running for a while. As I mentioned, it is a very simplistic test, so any suggestions for improvement are welcome. I haven't included it in the PR, as it seems to be a good candidate to become flaky. Maybe it is worth adding as a {{long}} test? > calculatePendingRanges no longer safe for multiple adjacent range movements > --------------------------------------------------------------------------- > > Key: CASSANDRA-14801 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14801 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Coordination, Legacy/Distributed Metadata > Reporter: Benedict Elliott Smith > Assignee: Aleksandr Sorokoumov > Priority: Normal > Labels: pull-request-available > Fix For: 4.0, 4.0-beta > > Time Spent: 10m > Remaining Estimate: 0h > > Correctness depended upon the narrowing to a {{Set<InetAddressAndport>}}, > which we no longer do - we maintain a collection of all {{Replica}}. Our > {{RangesAtEndpoint}} collection built by {{getPendingRanges}} can as a result > contain the same endpoint multiple times; and our {{EndpointsForToken}} > obtained by {{TokenMetadata.pendingEndpointsFor}} may fail to be constructed, > resulting in cluster-wide failures for writes to the affected token ranges > for the duration of the range movement. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org