Lin Yiqun created HDFS-10275: -------------------------------- Summary: TestDataNodeMetrics failing intermittently due to TotalWriteTime counted incorrectly Key: HDFS-10275 URL: https://issues.apache.org/jira/browse/HDFS-10275 Project: Hadoop HDFS Issue Type: Bug Components: test Reporter: Lin Yiqun Assignee: Lin Yiqun
The unit test {{TestDataNodeMetrics}} fails intermittently. The failed info show these: {code} Results : Failed tests: TestDataNodeVolumeFailureToleration.testVolumeAndTolerableConfiguration:195->testVolumeConfig:232 expected:<false> but was:<true> Tests in error: TestOpenFilesWithSnapshot.testWithCheckpoint:94 ? IO Timed out waiting for Min... TestDataNodeMetrics.testDataNodeTimeSpend:279 ? Timeout Timed out waiting for ... TestHFlush.testHFlushInterrupted ? IO The stream is closed {code} In line 279 in {{TestDataNodeMetrics}}, it takes place timed out. Then I looked into the code and found the real reason is that the metric of {{TotalWriteTime}} frequently count 0 in each iteration of creating file. And the this leads to retry operations till timeout. I debug the test in my local. I found the most suspect reason whic cause {{TotalWriteTime}} metric count always be 0 is that we using the {{SimulatedFSDataset}} for spending time test. In {{SimulatedFSDataset}}, it will use the inner class's method {{SimulatedOutputStream#write}} to count the write time and the method of this class just updates the {{length}} and throws its data away. {code} @Override public void write(byte[] b, int off, int len) throws IOException { length += len; } {code} So the writing operation hardly not costs any time. So we should use a real way to create file instead of simulated way. I have tested in my local that the test is passed just one time when I delete the simulated way, while the test retries many times to count write time in old way. -- This message was sent by Atlassian JIRA (v6.3.4#6332)