[ https://issues.apache.org/jira/browse/CASSANDRA-16585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Capwell updated CASSANDRA-16585: -------------------------------------- Test and Documentation Plan: added tests Status: Patch Available (was: Open) > Periodic failures in *RepairCoordinator*Test caused by race condition with > nodetool repair > ------------------------------------------------------------------------------------------ > > Key: CASSANDRA-16585 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16585 > Project: Cassandra > Issue Type: Bug > Components: CI, Consistency/Repair, Test/dtest/java > Reporter: David Capwell > Assignee: David Capwell > Priority: Normal > Fix For: 4.0-rc > > Time Spent: 10m > Remaining Estimate: 0h > > Periodic failures in *RepairCoordinator*Test cause errors such as > FullRepairCoordinatorNeighbourDownTest#validationParticipentCrashesAndComesBack[DATACENTER_AWARE/true] > > {code} > nodetool command [repair, distributed_test_keyspace, > validationparticipentcrashesandcomesback_full_datacenter_aware_true, > --dc-parallel, --full] Error message 'Some repair failed' does not contain > any of [/127.0.0.2:7012 died] > stdout: > [2021-04-07 22:45:24,887] Starting repair command #10 > (f129cb60-97f2-11eb-9316-794aa6ab8411), repairing keyspace > distributed_test_keyspace with repair options (parallelism: dc_parallel, > primary range: false, incremental: false, job threads: 1, ColumnFamilies: > [validationparticipentcrashesandcomesback_full_datacenter_aware_true], > dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 2, pull repair: > false, force repair: false, optimise streams: false, ignore unreplicated > keyspaces: false) > [2021-04-07 22:45:32,864] Repair command #10 failed with error Repair session > f1342ba0-97f2-11eb-9316-794aa6ab8411 for range [(-1,9223372036854775805], > (9223372036854775805,-1]] failed with error Endpoint /127.0.0.2:7012 died > [2021-04-07 22:45:32,887] After waiting for poll interval of 1 seconds > queried for parent session status and discovered repair failed. > [2021-04-07 22:45:32,887] Repair command #10 finished with error > [2021-04-07 22:45:32,887] Some repair failed > [2021-04-07 22:45:32,888] Repair command #10 finished with error > stderr: > error: Some repair failed > -- StackTrace -- > java.io.IOException: Some repair failed > at > org.apache.cassandra.tools.RepairRunner.queryForCompletedRepair(RepairRunner.java:167) > at org.apache.cassandra.tools.RepairRunner.run(RepairRunner.java:72) > at org.apache.cassandra.tools.NodeProbe.repairAsync(NodeProbe.java:431) > at org.apache.cassandra.tools.nodetool.Repair.execute(Repair.java:171) > at > org.apache.cassandra.tools.NodeTool$NodeToolCmd.runInternal(NodeTool.java:358) > at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:343) > at org.apache.cassandra.tools.NodeTool.execute(NodeTool.java:246) > at > org.apache.cassandra.distributed.impl.Instance$DTestNodeTool.execute(Instance.java:836) > at > org.apache.cassandra.distributed.impl.Instance.lambda$nodetoolResult$38(Instance.java:746) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:834) > Notifications: > Notification{type=START, src=repair:10, message=Starting repair command #10 > (f129cb60-97f2-11eb-9316-794aa6ab8411), repairing keyspace > distributed_test_keyspace with repair options (parallelism: dc_parallel, > primary range: false, incremental: false, job threads: 1, ColumnFamilies: > [validationparticipentcrashesandcomesback_full_datacenter_aware_true], > dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 2, pull repair: > false, force repair: false, optimise streams: false, ignore unreplicated > keyspaces: false)} > Notification{type=ERROR, src=repair:10, message=Repair command #10 failed > with error Repair session f1342ba0-97f2-11eb-9316-794aa6ab8411 for range > [(-1,9223372036854775805], (9223372036854775805,-1]] failed with error > Endpoint /127.0.0.2:7012 died} > Notification{type=COMPLETE, src=repair:10, message=Repair command #10 > finished with error} > Error: > java.io.IOException: Some repair failed > at > org.apache.cassandra.tools.RepairRunner.queryForCompletedRepair(RepairRunner.java:167) > at org.apache.cassandra.tools.RepairRunner.run(RepairRunner.java:72) > at org.apache.cassandra.tools.NodeProbe.repairAsync(NodeProbe.java:431) > at org.apache.cassandra.tools.nodetool.Repair.execute(Repair.java:171) > at > org.apache.cassandra.tools.NodeTool$NodeToolCmd.runInternal(NodeTool.java:358) > at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:343) > at org.apache.cassandra.tools.NodeTool.execute(NodeTool.java:246) > at > org.apache.cassandra.distributed.impl.Instance$DTestNodeTool.execute(Instance.java:836) > at > org.apache.cassandra.distributed.impl.Instance.lambda$nodetoolResult$38(Instance.java:746) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:834) > {code} > Seems there is a race condition in nodetool repair where we query the error > state before we get the notification, then we throw a generic error rather > than the specific error. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org