[ https://issues.apache.org/jira/browse/CASSANDRA-19085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17812636#comment-17812636 ]
Berenguer Blasi commented on CASSANDRA-19085: --------------------------------------------- The [original|https://github.com/apache/cassandra/blob/cassandra-5.0/src/java/org/apache/cassandra/gms/Gossiper.java#L610] method marking a node as shutdown was deprecated in 5.0 and the [new|https://github.com/apache/cassandra/blob/cassandra-5.0/src/java/org/apache/cassandra/gms/Gossiper.java#L634] replacement method was missing some code. Different SCM settings exercised different code paths hence the failure/pass. The failure came from post-shutdown ECHOs that got through and messed the node's status marking it UP again: {noformat} DEBUG [node1_GossipStage:1] node1 2024-01-30 09:24:17,482 Marked /127.0.0.3:7012 as shutdown TRACE [node1_GossipStage:1] node1 2024-01-30 09:24:17,595 Updating heartbeat state version to 56 from 49 for /127.0.0.3:7012 ... TRACE [node1_GossipStage:1] node1 2024-01-30 09:24:17,595 Sending ECHO_REQ to /127.0.0.3:7012 TRACE [node1_GossipStage:1] node1 2024-01-30 09:24:17,595 /127.0.0.1:7012 sending ECHO_REQ to 76@/127.0.0.3:7012 TRACE [node1_GossipStage:1] node1 2024-01-30 09:24:17,595 marking as alive /127.0.0.3:7012 DEBUG [node1_GossipStage:1] node1 2024-01-30 09:24:17,595 removing expire time for endpoint : /127.0.0.3:7012 INFO [node1_GossipStage:1] node1 2024-01-30 09:24:17,595 InetAddress /127.0.0.3:7012 is now UP {noformat} The PR includes CI runs for both vanilla and NONE SCM. If we're happy with the fix I suggest we merge it into CASSANDRA-18753 which will exercise a better CI. > In-jvm dtest RepairTest fails with storage_compatibility_mode: NONE > ------------------------------------------------------------------- > > Key: CASSANDRA-19085 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19085 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair > Reporter: Branimir Lambov > Assignee: Berenguer Blasi > Priority: Normal > Fix For: 5.0-rc, 5.x > > > More precisely, when the {{MessagingService}} version to {{{}VERSION_50{}}}, > the test fails with an exception that appears to be a genuine problem: > {code:java} > junit.framework.AssertionFailedError: Exception found expected null, but > was:<java.lang.RuntimeException: Did not get replies from all endpoints. > at > org.apache.cassandra.service.ActiveRepairService.lambda$prepareForRepair$2(ActiveRepairService.java:678) > at > org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:833) > > > at > org.apache.cassandra.distributed.test.DistributedRepairUtils.lambda$assertParentRepairSuccess$4(DistributedRepairUtils.java:129) > at > org.apache.cassandra.distributed.test.DistributedRepairUtils.validateExistingParentRepair(DistributedRepairUtils.java:164) > at > org.apache.cassandra.distributed.test.DistributedRepairUtils.assertParentRepairSuccess(DistributedRepairUtils.java:124) > at > org.apache.cassandra.distributed.test.RepairTest.testForcedNormalRepairWithOneNodeDown(RepairTest.java:211) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > org.apache.cassandra.distributed.shared.ShutdownException: Uncaught > exceptions were thrown during test > at > org.apache.cassandra.distributed.impl.AbstractCluster.checkAndResetUncaughtExceptions(AbstractCluster.java:1117) > at > org.apache.cassandra.distributed.impl.AbstractCluster.close(AbstractCluster.java:1103) > at > org.apache.cassandra.distributed.test.RepairTest.closeCluster(RepairTest.java:160) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > Suppressed: java.lang.IllegalStateException: complete already: > (failure: java.lang.RuntimeException: Did not get replies from all endpoints.) > at > org.apache.cassandra.utils.concurrent.AsyncPromise.setSuccess(AsyncPromise.java:106) > at > org.apache.cassandra.service.ActiveRepairService$2.ack(ActiveRepairService.java:721) > at > org.apache.cassandra.service.ActiveRepairService$2.onResponse(ActiveRepairService.java:697) > at > org.apache.cassandra.repair.messages.RepairMessage$2.onResponse(RepairMessage.java:187) > at > org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:58) > at > org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78) > at > org.apache.cassandra.net.InboundSink$Filtered.accept(InboundSink.java:64) > at > org.apache.cassandra.net.InboundSink$Filtered.accept(InboundSink.java:50) > at > org.apache.cassandra.net.InboundSink.accept(InboundSink.java:97) > at > org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45) > at > org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:430) > at > org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133) > at > org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:143) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:833){code} > The updates to {{pending}} in ActiveRepairService are not concurrency-safe, > but fixing them by doing e.g. > {code:java} > Index: src/java/org/apache/cassandra/service/ActiveRepairService.java > IDEA additional info: > Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP > <+>UTF-8 > =================================================================== > diff --git a/src/java/org/apache/cassandra/service/ActiveRepairService.java > b/src/java/org/apache/cassandra/service/ActiveRepairService.java > --- a/src/java/org/apache/cassandra/service/ActiveRepairService.java > (revision 04552046f74f596e69e2d98c3f3e522fb5888c99) > +++ b/src/java/org/apache/cassandra/service/ActiveRepairService.java (date > 1700839874092) > @@ -675,7 +675,7 @@ > if (promise.isDone()) > return; > String errorMsg = "Did not get replies from all endpoints."; > - if (promise.tryFailure(new RuntimeException(errorMsg))) > + if (pending.getAndSet(-1) > 0 && promise.tryFailure(new > RuntimeException(errorMsg))) > participateFailed(parentRepairSession, errorMsg); > }, timeoutMillis, MILLISECONDS); > > @@ -703,8 +703,8 @@ > failedNodes.add(from.toString()); > if (failureReason == RequestFailureReason.TIMEOUT) > { > - pending.set(-1); > - > promise.setFailure(failRepairException(parentRepairSession, "Did not get > replies from all endpoints.")); > + if (pending.getAndSet(-1) > 0) > + > promise.setFailure(failRepairException(parentRepairSession, "Did not get > replies from all endpoints.")); > } > else > { > {code} > still results in a test failure: > {code:java} > java.lang.AssertionError: Exception found expected null, but > was:<java.lang.RuntimeException: Did not get replies from all endpoints. at > org.apache.cassandra.service.ActiveRepairService.lambda$prepareForRepair$2(ActiveRepairService.java:678) > at > org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:829)> > at org.junit.Assert.fail(Assert.java:88) at > org.junit.Assert.failNotNull(Assert.java:755) at > org.junit.Assert.assertNull(Assert.java:737) at > org.apache.cassandra.distributed.test.DistributedRepairUtils.lambda$assertParentRepairSuccess$4(DistributedRepairUtils.java:129) > at > org.apache.cassandra.distributed.test.DistributedRepairUtils.validateExistingParentRepair(DistributedRepairUtils.java:164) > at > org.apache.cassandra.distributed.test.DistributedRepairUtils.assertParentRepairSuccess(DistributedRepairUtils.java:124) > at > org.apache.cassandra.distributed.test.RepairTest.testForcedNormalRepairWithOneNodeDown(RepairTest.java:211) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at > org.junit.runner.JUnitCore.run(JUnitCore.java:137) at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69) > at > com.intellij.rt.junit.IdeaTestRunner$Repeater$1.execute(IdeaTestRunner.java:38) > at > com.intellij.rt.execution.junit.TestsRepeater.repeat(TestsRepeater.java:11) > at > com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:35) > at > com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:232) > at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:55) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org