jmckenzie-dev commented on code in PR #203:
URL: 
https://github.com/apache/cassandra-analytics/pull/203#discussion_r3189358775


##########
cassandra-analytics-integration-tests/src/test/java/org/apache/cassandra/analytics/BulkReaderMultiDCConsistencyTest.java:
##########
@@ -134,93 +135,140 @@ public static PartitionedDataLayer.AvailabilityHint 
getAvailability(CassandraIns
      * @throws NoSuchMethodException
      */
     @Test
-    void eachQuorumIsNotQuorum() throws NoSuchMethodException
+    void eachQuorumIsNotQuorum() throws IOException, NoSuchMethodException
     {
-        List<String> updatedDataSet = new ArrayList<>(OG_DATASET);
-        updatedDataSet.set(1, TEST_VAL);
-
-        // Internally update value for TEST_KEY for node5 and node6. This 
update doesn't propagate to other nodes.
-        updateValueNodeInternal(5, TEST_KEY, TEST_VAL);
-        updateValueNodeInternal(6, TEST_KEY, TEST_VAL);
-
-        // Bytecode injection to simulate a scenario where node5 and node6 are 
at the end of the replica list for bulk reader.
-        // This simulation mimics a real world scenario.
-        // With this arrangement PartitionedDataLayer.splitReplicas method for 
QUORUM will split the replicas like below:
-        // primaryReplicas: [Node1, Node2, Node3, Node4]
-        // secondaryReplicas: [Node5, Node6]
-        // Number of nodes required for QUORUM read id 6/1 + 1 = 4. Bulk 
reader will read from [Node1, Node2, Node3, Node4] only.
+        // The agent must be installed before 
ClassReloadingStrategy.fromInstalledAgent() — otherwise
+        // it throws IllegalStateException when the JVM hosting this test 
class was started without a
+        // prior ByteBuddy install (e.g. when the CI harness runs this class 
in its own gradle invocation).
         ByteBuddyAgent.install();
-        new ByteBuddy()
-        .redefine(CassandraDataLayer.class)
-        .method(ElementMatchers.named("getAvailability"))
-        .intercept(
-        
MethodCall.invoke(BulkReaderMultiDCConsistencyTest.class.getMethod("getAvailability",
 CassandraInstance.class))
-                  .withAllArguments()
-        )
-        .make()
-        .load(
-        CassandraDataLayer.class.getClassLoader(),
-        ClassReloadingStrategy.fromInstalledAgent()
-        );
+        // Hold on to the original strategy to reset ByteBuddy after this test
+        ClassReloadingStrategy crStrategy = 
ClassReloadingStrategy.fromInstalledAgent();
 
-        // Bulk read with QUORUM consistency
-        List<Row> rowList = bulkRead(ConsistencyLevel.QUORUM.name());
-        // Validate that the result doesn't have the updated data.
-        validateBulkReadRows(rowList, OG_DATASET);
-
-        // Message filter to mimic message drops from Node5 and Node6 to Node1.
-        // We are setting this up to simulate a scenario where reading values 
with QUORUM consistency with driver
-        // and using Node1 as the coordinator doesn't get the values from 
Node5 and Node6.
-        cluster.filters().allVerbs().from(5).to(1).drop();
-        cluster.filters().allVerbs().from(6).to(1).drop();
-
-        // Read value for TEST_KEY with driver using Node1 as coordinator
-        String quorumVal = readValueForKey(cluster.get(1).coordinator(), 
TEST_KEY, ConsistencyLevel.QUORUM);
-        // Validate that the updated value is not read
-        assertThat(quorumVal).isEqualTo(OG_DATASET.get(TEST_KEY));
-
-        // Cleanup message filter
-        cluster.filters().reset();
-
-        // Bulk read with EACH_QUORUM consistency
-        rowList = bulkRead(ConsistencyLevel.EACH_QUORUM.name());
-        // Validate that bulk reader was able to read the updated value
-        validateBulkReadRows(rowList, updatedDataSet);
-        // Read value using driver with EACH_QUORUM
-        String eachQuorumVal = readValueForKey(TEST_KEY, 
ConsistencyLevel.EACH_QUORUM);
-        // Validate that EACH_QUORUM read using driver and the bulk reader are 
the same
-        
assertThat(eachQuorumVal).isEqualTo(rowList.get(TEST_KEY).getString(1));
-
-        // Revert the value update for all nodes
-        setValueForALL(TEST_KEY, OG_DATASET.get(TEST_KEY));
+        // We need the try/finally structure since this test mutates global 
state through the ByteBuddy changes; if we
+        // time out during the test run that change persists and, pending test 
ordering, will cascade and take the rest
+        // with it.
+        try
+        {
+            List<String> updatedDataSet = new ArrayList<>(OG_DATASET);
+            updatedDataSet.set(1, TEST_VAL);
+
+            // Internally update value for TEST_KEY for node5 and node6. This 
update doesn't propagate to other nodes.
+            updateValueNodeInternal(5, TEST_KEY, TEST_VAL);
+            updateValueNodeInternal(6, TEST_KEY, TEST_VAL);
+
+            // Bytecode injection to simulate a scenario where node5 and node6 
are at the end of the replica list for bulk reader.
+            // This simulation mimics a real world scenario.
+            // With this arrangement PartitionedDataLayer.splitReplicas method 
for QUORUM will split the replicas like below:
+            // primaryReplicas: [Node1, Node2, Node3, Node4]
+            // secondaryReplicas: [Node5, Node6]
+            // Number of nodes required for QUORUM read id 6/1 + 1 = 4. Bulk 
reader will read from [Node1, Node2, Node3, Node4] only.
+            new ByteBuddy()
+                    .redefine(CassandraDataLayer.class)
+                    .method(ElementMatchers.named("getAvailability"))
+                    .intercept(
+                            
MethodCall.invoke(BulkReaderMultiDCConsistencyTest.class.getMethod("getAvailability",
 CassandraInstance.class))
+                                    .withAllArguments()
+                    )
+                    .make()
+                    .load(CassandraDataLayer.class.getClassLoader(), 
crStrategy);
+
+            // Bulk read with QUORUM consistency
+            List<Row> rowList = bulkRead(ConsistencyLevel.QUORUM.name());
+            // Validate that the result doesn't have the updated data.
+            validateBulkReadRows(rowList, OG_DATASET);
+
+            // Message filter to mimic message drops from Node5 and Node6 to 
Node1.
+            // We are setting this up to simulate a scenario where reading 
values with QUORUM consistency with driver
+            // and using Node1 as the coordinator doesn't get the values from 
Node5 and Node6.
+            cluster.filters().allVerbs().from(5).to(1).drop();
+            cluster.filters().allVerbs().from(6).to(1).drop();
+
+            // Read value for TEST_KEY with driver using Node1 as coordinator.
+            // The message filters above drop responses from Node5/Node6 to 
Node1, but the coordinator's
+            // snitch-based replica selection may still pick Node5 or Node6 as 
one of its 4 QUORUM replicas.
+            // When that happens the dropped response causes a 
ReadTimeoutException. This is infrastructure
+            // noise from the message filter and not a real test failure, 
which would manifest as an
+            // AssertionError (wrong value returned), not a timeout. Retry to 
tolerate the non-deterministic
+            // replica selection. This test was _very_ intermittently flaky 
before, so if we take something that
+            // flaked out 1% of the time for instance and then repeat 10x, we 
_should_ be in a better place.
+            // Another option here would be to just keep spinning on 
ReadTimeoutExceptions until the junit
+            // timeout timer but that just seems excessive.
+            String quorumVal = null;
+            for (int attempt = 1; attempt <= 10; attempt++)
+            {
+                try
+                {
+                    quorumVal = readValueForKey(cluster.get(1).coordinator(), 
TEST_KEY, ConsistencyLevel.QUORUM);
+                    break;
+                }
+                catch (Exception e)
+                {
+                    if (attempt == 10 || 
!e.getClass().getSimpleName().equals("ReadTimeoutException"))

Review Comment:
   Yeah, completely. Probably a leftover of digging out wrapped exceptions but 
these 2 are functionally identical. I'll commit the change and if CI's still 
good take your prior +1 as still applying? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to