[ https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212949#comment-17212949 ]
Alexander Dejanovski commented on CASSANDRA-15580: -------------------------------------------------- Here's a test plan proposal: Generate/restore a workload of ~100GB to 200GB per node. Some SSTables will have to be deleted (in a random fashion?) to make repair go through streaming sessions. Perform repairs for a 3 nodes cluster with 4 cores each and 16GB RAM. Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for subranges with different sets of replicas). || Mode || Version || Settings || Checks || | Full repair | trunk | Sequential + All token ranges | "No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range" | | Full repair | trunk | Parallel + Primary range | "No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range" | | Full repair | trunk | Force terminate repair shortly after it was triggered | Repair threads must be cleaned up | | Full repair | Mixed trunk + latest 3.11.x | Sequential + All token ranges | Repair should fail | | Subrange repair | trunk | Sequential + single token range | "No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range" | | Subrange repair | trunk | Parallel + 10 token ranges which have the same replicas | "No anticompaction (repairedAt == 0) Out of sync ranges > 0 Subsequent run must show no out of sync range + Check that repair sessions are cleaned up after a force terminate" | | Subrange repair | trunk | Parallel + 10 token ranges which have different replicas | "No anticompaction (repairedAt==0) Out of sync ranges > 0 Subsequent run must show no out of sync range + Check that repair sessions are cleaned up after a force terminate" | | Subrange repair | trunk | "Single token range. Force terminate repair shortly after it was triggered." | Repair threads must be cleaned up | | Subrange repair | Mixed trunk + latest 3.11.x | Sequential + single token range | Repair should fail | | Incremental repair | trunk | "Parallel (mandatory) No compaction during repair" | "Anticompaction status (repairedAt != 0) on all SSTables No pending repair on SSTables after completion Out of sync ranges > 0 + Subsequent run must show no out of sync range" | | Incremental repair | trunk | "Parallel (mandatory) Major compaction triggered during repair" | "Anticompaction status (repairedAt != 0) on all SSTables No pending repair on SSTables after completion Out of sync ranges > 0 + Subsequent run must show no out of sync range" | | Incremental repair | trunk | Force terminate repair shortly after it was triggered. | Repair threads must be cleaned up | | Incremental repair | Mixed trunk + latest 3.11.x | Parallel | Repair should fail | I'm not sure about fuzz testing repair though. It's not a resilient process and isn't designed as such. Resiliency is obtained through third party tools that will reschedule failed repairs. If a node is/goes down and should be part of a repair session, the repair session will simply fail AFAIK. The mixed version tests could be challenging to set up as we probably don't want to pin a specific version as being the "previous" one. Should this test be performed consistently between trunk and the previous major version? On a major version bump (when trunk moves to 5.0), I'd expect the test to pass as repair will probably work for a bit, unless there's a check on version numbers during repair/streaming? > 4.0 quality testing: Repair > --------------------------- > > Key: CASSANDRA-15580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15580 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/python > Reporter: Josh McKenzie > Assignee: Alexander Dejanovski > Priority: Normal > Fix For: 4.0-rc > > > Reference [doc from > NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] > for context. > *Shepherd: Alexander Dejanovski* > We aim for 4.0 to have the first fully functioning incremental repair > solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of > repair: (full range, sub range, incremental) function as expected as well as > ensuring community tools such as Reaper work. CASSANDRA-3200 adds an > experimental option to reduce the amount of data streamed during repair, we > should write more tests and see how it works with big nodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org