[jira] [Commented] (HDFS-16935) TestFsDatasetImpl.testReportBadBlocks brittle

ASF GitHub Bot (Jira) Fri, 24 Feb 2023 23:55:04 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-16935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693444#comment-17693444
 ]


ASF GitHub Bot commented on HDFS-16935:
---------------------------------------

virajjasani commented on code in PR #5432:
URL: https://github.com/apache/hadoop/pull/5432#discussion_r1117883225


##########
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java:
##########
@@ -1101,15 +1099,12 @@ public void testReportBadBlocks() throws Exception {
 
       block = DFSTestUtil.getFirstBlock(fs, filePath);
       // Test for the overloaded method reportBadBlocks
-      dataNode.reportBadBlocks(block, dataNode.getFSDataset()
-          .getFsVolumeReferences().get(0));
-      Thread.sleep(3000);
-      BlockManagerTestUtil.updateState(cluster.getNamesystem()
-          .getBlockManager());
-      // Verify the bad block has been reported to namenode
-      Assert.assertEquals(1, 
cluster.getNamesystem().getCorruptReplicaBlocks());
-    } finally {
-      cluster.shutdown();
+      dataNode.reportBadBlocks(block, 
dataNode.getFSDataset().getFsVolumeReferences().get(0));
+      GenericTestUtils.waitFor(() -> {
+        
BlockManagerTestUtil.updateState(cluster.getNamesystem().getBlockManager());
+        // Verify the bad block has been reported to namenode
+        return 1 == cluster.getNamesystem().getCorruptReplicaBlocks();
+      }, 100, 10000, "Corrupted replica blocks could not be found");

Review Comment:
   Basically what I am trying to say is that whether we should also consider 
increasing wait time here, by say 500/1000 ms instead of 100 ms?
   
   ```
     void triggerHeartbeatForTests() {
       synchronized (ibrManager) {
         final long nextHeartbeatTime = scheduler.scheduleHeartbeat();
         ibrManager.notifyAll();
         while (nextHeartbeatTime - scheduler.nextHeartbeatTime >= 0) {
           try {
             ibrManager.wait(100);  <=== how about 500ms at least?
           } catch (InterruptedException e) {
             return;
           }
         }
       }
     }
   
   ```
   
   Edit: Anyways until we have concrete proof of heartbeat based tests being 
flaky, this change might not be useful, not for this Jira at least.
   I updated the test to reflect the heartbeat trigger as I am not able to see 
any failures with inconsistent corrupt replica number.





> TestFsDatasetImpl.testReportBadBlocks brittle
> ---------------------------------------------
>
>                 Key: HDFS-16935
>                 URL: https://issues.apache.org/jira/browse/HDFS-16935
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 3.4.0, 3.3.5, 3.3.9
>            Reporter: Steve Loughran
>            Assignee: Viraj Jasani
>            Priority: Minor
>              Labels: pull-request-available
>
> jenkins failure as sleep() time not long enough
> {code}
> Failing for the past 1 build (Since #4 )
> Took 7.4 sec.
> Error Message
> expected:<1> but was:<0>
> Stacktrace
> java.lang.AssertionError: expected:<1> but was:<0>
>       at org.junit.Assert.fail(Assert.java:89)
>       at org.junit.Assert.failNotEquals(Assert.java:835)
>       at org.junit.Assert.assertEquals(Assert.java:647)
>       at org.junit.Assert.assertEquals(Assert.java:633)
> {code}
> assert is after a 3s sleep waiting for reports coming in.
> {code}
>       dataNode.reportBadBlocks(block, dataNode.getFSDataset()
>           .getFsVolumeReferences().get(0));
>       Thread.sleep(3000);                                           // 3s 
> sleep
>       BlockManagerTestUtil.updateState(cluster.getNamesystem()
>           .getBlockManager());
>       // Verify the bad block has been reported to namenode
>       Assert.assertEquals(1, 
> cluster.getNamesystem().getCorruptReplicaBlocks());  // here
> {code}
> LambdaTestUtils.eventually() should be used around this assert, maybe with an 
> even shorter initial delay so on faster systems, test is faster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16935) TestFsDatasetImpl.testReportBadBlocks brittle

Reply via email to