chihsuan commented on code in PR #10451:
URL: https://github.com/apache/ozone/pull/10451#discussion_r3371632728


##########
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOMRatisSnapshots.java:
##########
@@ -805,6 +804,24 @@ public void testInstallSnapshotWithClientWrite() throws 
Exception {
     assertLogCapture(logCapture, msg);
     assertLogCapture(logCapture, "Install Checkpoint is finished");
 
+    // All newKeys writes have completed (writeFuture.get() above), so the
+    // leader must already contain them.
+    OMMetadataManager leaderOmMetaMgr = leaderOM.getMetadataManager();
+    for (String key : newKeys) {
+      assertNotNull(leaderOmMetaMgr.getKeyTable(
+          TEST_BUCKET_LAYOUT)
+          .get(leaderOmMetaMgr.getOzoneKey(volumeName, bucketName, key)));
+    }

Review Comment:
   Good point, I'll move the leader key check right after `writeFuture.get()`. 
Thanks!



##########
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOMRatisSnapshots.java:
##########
@@ -805,6 +804,24 @@ public void testInstallSnapshotWithClientWrite() throws 
Exception {
     assertLogCapture(logCapture, msg);
     assertLogCapture(logCapture, "Install Checkpoint is finished");
 
+    // All newKeys writes have completed (writeFuture.get() above), so the
+    // leader must already contain them.
+    OMMetadataManager leaderOmMetaMgr = leaderOM.getMetadataManager();
+    for (String key : newKeys) {
+      assertNotNull(leaderOmMetaMgr.getKeyTable(
+          TEST_BUCKET_LAYOUT)
+          .get(leaderOmMetaMgr.getOzoneKey(volumeName, bucketName, key)));
+    }
+
+    // Wait for the follower to apply everything the leader has applied; all
+    // writes have completed on the leader, so after this no further snapshot
+    // install (and DB reload) can occur and the follower DB reads below are
+    // safe from "Rocks Database is closed" races.
+    long leaderApplied = leaderOM.getOmRatisServer()
+        .getLastAppliedTermIndex().getIndex();
+    GenericTestUtils.waitFor(() -> followerOM.getOmRatisServer()
+        .getLastAppliedTermIndex().getIndex() >= leaderApplied, 100, 30_000);
+

Review Comment:
   Agreed, that's exactly the approach here. The first sleep's waitFor was 
already there (the leaderOMSnapshotIndex - 1 check right below it), so removing 
the sleep is safe. 
   
   The second sleep is now replaced by the leaderApplied waitFor, which matches 
your snippet. I kept the timeout at `30_000` to stay consistent with the other 
waitFor calls in this file (since it returns as soon as the condition is met, 
the ceiling only affects the failure case.



##########
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOMRatisSnapshots.java:
##########
@@ -169,11 +169,13 @@ public void init(TestInfo testInfo) throws Exception {
     clientConfig.setRpcTimeOut(TimeUnit.SECONDS.toMillis(5));
     conf.setFromObject(clientConfig);
 
-    cluster = MiniOzoneCluster.newHABuilder(conf)
-        .setOMServiceId("om-service-test1")
+    MiniOzoneHAClusterImpl.Builder clusterBuilder =
+        MiniOzoneCluster.newHABuilder(conf);
+    clusterBuilder.setOMServiceId("om-service-test1")
         .setNumOfOzoneManagers(NUM_OF_OMS)
         .setNumOfActiveOMs(2)
-        .build();
+        .setNumDatanodes(1);
+    cluster = clusterBuilder.build();

Review Comment:
   I'd love to keep it as a single chain, but it won't work. The 
`setNumDatanodes` is only defined on the base `MiniOzoneCluster.Builder,` so 
calling it downgrades the static type to the base builder. After that, 
`.build()` resolves to the base `build()` which returns `MiniOzoneCluster`, not 
`MiniOzoneHAClusterImpl`. 



##########
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOMRatisSnapshots.java:
##########
@@ -931,8 +941,6 @@ public void testInstallSnapshotWithClientRead() throws 
Exception {
           .get(followerOMMetaMngr.getOzoneKey(volumeName, bucketName, key)));
     }
 
-    // Wait installation finish
-    Thread.sleep(5000);

Review Comment:
   For this one we don't really need a `waitFor`. The line right after is 
`assertLogCapture(...)`, which is itself a `waitFor` (polls up to `30s`). The 
sleep was purely redundant, so I just removed it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to