[ https://issues.apache.org/jira/browse/HDFS-16935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693444#comment-17693444 ]
ASF GitHub Bot commented on HDFS-16935: --------------------------------------- virajjasani commented on code in PR #5432: URL: https://github.com/apache/hadoop/pull/5432#discussion_r1117883225 ########## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java: ########## @@ -1101,15 +1099,12 @@ public void testReportBadBlocks() throws Exception { block = DFSTestUtil.getFirstBlock(fs, filePath); // Test for the overloaded method reportBadBlocks - dataNode.reportBadBlocks(block, dataNode.getFSDataset() - .getFsVolumeReferences().get(0)); - Thread.sleep(3000); - BlockManagerTestUtil.updateState(cluster.getNamesystem() - .getBlockManager()); - // Verify the bad block has been reported to namenode - Assert.assertEquals(1, cluster.getNamesystem().getCorruptReplicaBlocks()); - } finally { - cluster.shutdown(); + dataNode.reportBadBlocks(block, dataNode.getFSDataset().getFsVolumeReferences().get(0)); + GenericTestUtils.waitFor(() -> { + BlockManagerTestUtil.updateState(cluster.getNamesystem().getBlockManager()); + // Verify the bad block has been reported to namenode + return 1 == cluster.getNamesystem().getCorruptReplicaBlocks(); + }, 100, 10000, "Corrupted replica blocks could not be found"); Review Comment: Basically what I am trying to say is that whether we should also consider increasing wait time here, by say 500/1000 ms instead of 100 ms? ``` void triggerHeartbeatForTests() { synchronized (ibrManager) { final long nextHeartbeatTime = scheduler.scheduleHeartbeat(); ibrManager.notifyAll(); while (nextHeartbeatTime - scheduler.nextHeartbeatTime >= 0) { try { ibrManager.wait(100); <=== how about 500ms at least? } catch (InterruptedException e) { return; } } } } ``` Edit: Anyways until we have concrete proof of heartbeat based tests being flaky, this change might not be useful, not for this Jira at least. I updated the test to reflect the heartbeat trigger as I am not able to see any failures with inconsistent corrupt replica number. > TestFsDatasetImpl.testReportBadBlocks brittle > --------------------------------------------- > > Key: HDFS-16935 > URL: https://issues.apache.org/jira/browse/HDFS-16935 > Project: Hadoop HDFS > Issue Type: Bug > Components: test > Affects Versions: 3.4.0, 3.3.5, 3.3.9 > Reporter: Steve Loughran > Assignee: Viraj Jasani > Priority: Minor > Labels: pull-request-available > > jenkins failure as sleep() time not long enough > {code} > Failing for the past 1 build (Since #4 ) > Took 7.4 sec. > Error Message > expected:<1> but was:<0> > Stacktrace > java.lang.AssertionError: expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:633) > {code} > assert is after a 3s sleep waiting for reports coming in. > {code} > dataNode.reportBadBlocks(block, dataNode.getFSDataset() > .getFsVolumeReferences().get(0)); > Thread.sleep(3000); // 3s > sleep > BlockManagerTestUtil.updateState(cluster.getNamesystem() > .getBlockManager()); > // Verify the bad block has been reported to namenode > Assert.assertEquals(1, > cluster.getNamesystem().getCorruptReplicaBlocks()); // here > {code} > LambdaTestUtils.eventually() should be used around this assert, maybe with an > even shorter initial delay so on faster systems, test is faster. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org