ConfX created CASSANDRA-19742: --------------------------------- Summary: Cassandra Dtest Cluster is not fully up after upgrading and may fail on queries Key: CASSANDRA-19742 URL: https://issues.apache.org/jira/browse/CASSANDRA-19742 Project: Cassandra Issue Type: Bug Components: Test/dtest/java Reporter: ConfX
This may not be a bug, but the Cassandra Dtest framework can definitely be improved. h2. What happened In DTest framework, the cluster node may not fully up and become fully operational when the test logic in {{runAfterNodeUpgrade()}} is executed. This may cause expected behavior and even test flakiness. h2. How to reproduce Use the following upgrade tests as example, Put the following test under cassandra/test/distributed/org/apache/cassandra/distributed/upgrade/, and build dtest jars. {code:java} package org.apache.cassandra.distributed.upgrade;public class demoUpgradeTest extends UpgradeTestBase { @Test public void demoTest() throws Throwable { { new TestCase() .nodes(2) .nodesToUpgrade(1) .withConfig(c -> c.with(GOSSIP, NATIVE_PROTOCOL).set("drop_compact_storage_enabled", true)) .upgradesToCurrentFrom(v3X) .setup((cluster) -> { cluster.schemaChange("CREATE KEYSPACE k WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}"); cluster.schemaChange("CREATE TABLE k.t ( k int, c int, total counter, PRIMARY KEY (k, c))"); }) .runAfterNodeUpgrade((cluster, node) -> { ConsistencyLevel cl = ConsistencyLevel.ONE; String select = "SELECT total FROM k.t WHERE k = 1 AND c = ?"; for (int i = 1; i <= cluster.size(); i++) { ICoordinator coordinator = cluster.coordinator(i); coordinator.execute("UPDATE k.t SET total = total + 1 WHERE k = 1 AND c = ?", cl, i); assertRows(coordinator.execute(select, cl, i), row(1L)); coordinator.execute("UPDATE k.t SET total = total - 4 WHERE k = 1 AND c = ?", cl, i); assertRows(coordinator.execute(select, cl, i), row(-3L)); } }).run(); } } {code} Run the test with: {code:java} $ ant test-jvm-dtest-some-Duse.jdk11=true -Dtest.name=org.apache.cassandra.distributed.upgrade.demoUpgradeTest {code} You will see the following failure: {code:java} [junit-timeout] Testcase: demoTest(org.apache.cassandra.distributed.upgrade.demoUpgradeTest)-_jdk11: FAILED [junit-timeout] Error in test '4.0.13 -> [4.1-alpha1]' while upgrading to '4.1-alpha1'; successful upgrades [] [junit-timeout] junit.framework.AssertionFailedError: Error in test '4.0.13 -> [4.1-alpha1]' while upgrading to '4.1-alpha1'; successful upgrades [] [junit-timeout] at org.apache.cassandra.distributed.upgrade.UpgradeTestBase$TestCase.run(UpgradeTestBase.java:442) [junit-timeout] at org.apache.cassandra.distributed.upgrade.demoUpgradeTest.demoTest(demoUpgradeTest.java:62) [junit-timeout] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit-timeout] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [junit-timeout] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [junit-timeout] Caused by: org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level ONE [junit-timeout] at org.apache.cassandra.exceptions.UnavailableException.create(UnavailableException.java:37) [junit-timeout] at org.apache.cassandra.exceptions.UnavailableException.create(UnavailableException.java:31) [junit-timeout] at org.apache.cassandra.service.StorageProxy.findSuitableReplica(StorageProxy.java:1617) [junit-timeout] at org.apache.cassandra.service.StorageProxy.mutateCounter(StorageProxy.java:1565) [junit-timeout] at org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:809) [junit-timeout] at org.apache.cassandra.service.StorageProxy.mutateWithTriggers(StorageProxy.java:1054) [junit-timeout] at org.apache.cassandra.cql3.statements.ModificationStatement.executeWithoutCondition(ModificationStatement.java:476) [junit-timeout] at org.apache.cassandra.cql3.statements.ModificationStatement.execute(ModificationStatement.java:454) [junit-timeout] at org.apache.cassandra.distributed.impl.Coordinator.executeInternal(Coordinator.java:103) [junit-timeout] at org.apache.cassandra.distributed.impl.Coordinator.lambda$executeWithResult$0(Coordinator.java:65) [junit-timeout] at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [junit-timeout] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [junit-timeout] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [junit-timeout] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [junit-timeout] at java.base/java.lang.Thread.run(Thread.java:829) {code} This is actually due to the fact that the upgraded node is not fully operational after a restart and directly executes the UPDATE statement. This can be manually fixed by adding a sleep in the beginning of runAfterNodeUpgrade like below: {code:java} ... .runAfterNodeUpgrade((cluster, node) -> { // Wait for the node to be fully operational cluster.get(node).nodetool("status"); // Adding a small delay to ensure the node is fully integrated Thread.sleep(10000); ... }).run(); } } {code} However, by design, the Dtest framework should wait for the node to be fully operational before executing the runAfterNodeUpgrade(). It would be good to add some waiting logic for this purpose to prevent such unexpected behavior from happening. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org