[ https://issues.apache.org/jira/browse/HDFS-11353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15843328#comment-15843328 ]
Xiao Chen commented on HDFS-11353: ---------------------------------- Thanks [~linyiqun] for the work, good to see test improvements! Didn't look into the {{TestDataNodeVolumeFailureReporting}} case yet, but some general comments/questions. - since {{DataNodeTestUtils#checkDiskErrorSync}} is really loop-waiting, maybe better to rename it like {{waitForDistError}}. - same method, could use {{GenericTestUtils.waitFor}} to replace the while loop + {{assertTrue}}. - In the test classes, instead of adding specific timeout to each test case, we can just add a {{@Rule}} for the timeout, to the entire test class. This is also a little more future proof. The only downside is we need to examine all existing test cases, to make sure this timeout isn't too aggressive. > Improve the unit tests relevant to DataNode volume failure testing > ------------------------------------------------------------------ > > Key: HDFS-11353 > URL: https://issues.apache.org/jira/browse/HDFS-11353 > Project: Hadoop HDFS > Issue Type: Improvement > Affects Versions: 3.0.0-alpha2 > Reporter: Yiqun Lin > Assignee: Yiqun Lin > Attachments: HDFS-11353.001.patch, HDFS-11353.002.patch, > HDFS-11353.003.patch, HDFS-11353.004.patch > > > Currently there are many tests which start with > {{TestDataNodeVolumeFailure*}} frequently run timedout or failed. I found one > failure test in recent Jenkins building. The stack info: > {code} > org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures > java.util.concurrent.TimeoutException: Timed out waiting for DN to die > at > org.apache.hadoop.hdfs.DFSTestUtil.waitForDatanodeDeath(DFSTestUtil.java:702) > at > org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures(TestDataNodeVolumeFailureReporting.java:208) > {code} > The related codes: > {code} > /* > * Now fail the 2nd volume on the 3rd datanode. All its volumes > * are now failed and so it should report two volume failures > * and that it's no longer up. Only wait for two replicas since > * we'll never get a third. > */ > DataNodeTestUtils.injectDataDirFailure(dn3Vol2); > Path file3 = new Path("/test3"); > DFSTestUtil.createFile(fs, file3, 1024, (short)3, 1L); > DFSTestUtil.waitReplication(fs, file3, (short)2); > // The DN should consider itself dead > DFSTestUtil.waitForDatanodeDeath(dns.get(2)); > {code} > Here the code waits for the datanode failed all the volume and then become > dead. But it timed out. We would be better to compare that if all the volumes > are failed then wair for the datanode dead. > In addition, we can use the method {{checkDiskErrorSync}} to do the disk > error check instead of creaing files. In this JIRA, I would like to extract > this logic and defined that in {{DataNodeTestUtils}}. And then we can reuse > this method for datanode volme failure testing in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org