Wei-Chiu Chuang created HDFS-12279: -------------------------------------- Summary: TestPipelinesFailover#testPipelineRecoveryStress fails due to race condition Key: HDFS-12279 URL: https://issues.apache.org/jira/browse/HDFS-12279 Project: Hadoop HDFS Issue Type: Bug Components: namenode, test Reporter: Wei-Chiu Chuang
Saw a test failure in a precommit test https://builds.apache.org/job/PreCommit-HDFS-Build/20600/testReport/org.apache.hadoop.hdfs.server.namenode.ha/TestPipelinesFailover/testPipelineRecoveryStress/ {noformat} Error Message Deferred Stacktrace java.lang.RuntimeException: Deferred at org.apache.hadoop.test.MultithreadedTestUtil$TestContext.checkException(MultithreadedTestUtil.java:130) at org.apache.hadoop.test.MultithreadedTestUtil$TestContext.stop(MultithreadedTestUtil.java:166) at org.apache.hadoop.hdfs.server.namenode.ha.HAStressTestHarness.shutdown(HAStressTestHarness.java:154) at org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover.testPipelineRecoveryStress(TestPipelinesFailover.java:493) Caused by: java.lang.AssertionError: null at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.addBlocksToBeInvalidated(DatanodeDescriptor.java:641) at org.apache.hadoop.hdfs.server.blockmanagement.InvalidateBlocks.invalidateWork(InvalidateBlocks.java:299) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.invalidateWorkForOneNode(BlockManager.java:4236) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeInvalidateWork(BlockManager.java:1736) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManagerTestUtil.computeInvalidationWork(BlockManagerTestUtil.java:169) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManagerTestUtil.computeAllPendingWork(BlockManagerTestUtil.java:185) at org.apache.hadoop.hdfs.server.namenode.ha.HAStressTestHarness$1.doAnAction(HAStressTestHarness.java:102) at org.apache.hadoop.test.MultithreadedTestUtil$RepeatingTestThread.doWork(MultithreadedTestUtil.java:222) at org.apache.hadoop.test.MultithreadedTestUtil$TestingThread.run(MultithreadedTestUtil.java:189) {noformat} Studying the code, the assert can only fail due to a race condition that only happens in the test. Specifically, the test uses BlockManagerTestUtil to call {{BlockManager#computeInvalidateWork}}, which gets {{invalidateBlocks.getDatanodes()}}. Afterwards, use the list to perform block invalidation via {{InvalidateBlocks#invalidateWork}}, which calls {{DatanodeDesriptor#addBlocksToBeInvalidated}} and there is an assertion to ensure the invalidation list is not empty. However, if the BlockManager performs block invalidation before {{DatanodeDesriptor#addBlocksToBeInvalidated}}, the invalidation list can be empty, because there's no proper lock to ensure atomicity. This is not a problem for real cluster, because there is only one BlockManager per NameNode process, so the potential race condition is not exposed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org