[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15186974#comment-15186974 ]
Walter Su commented on HDFS-9822: --------------------------------- bq. I am still a little confused how this error happens. Me too. I don't think we get the right cause. bq. But if there are same block group entry exists in different queue.. No 2 queues can have same BG. The update(..) logic is correct. No queue can has 2 same items. The queue is a HashSet. My pure guess is that it's caused by race condition. We have a guard at {code} // BlockManager#scheduleReconstruction(..) if (block.isStriped()) { if (pendingNum > 0) { // Wait the previous reconstruction to finish. return null; } {code} which is inside namesystem lock. But before {{ReplicationMonitor}} thread goes to {{validateReconstructionWork(..)}}, it loses the lock. So it's possible the junit thread get the lock. If they both passes the guard, eventually one of them will failed the assert. > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > ---------------------------------------------------------------------------------------------------- > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding > Reporter: Tsz Wo Nicholas Sze > Assignee: Rakesh R > Attachments: HDFS-9822-001.patch, HDFS-9822-002.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)