[ https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15178302#comment-15178302 ]
Rakesh R commented on HDFS-9822: -------------------------------- Thanks a lot [~drankye] for the interests and useful comments. bq. 1. Why multiple reconstruction tasks for the same striped block or block group are figured out and put into queues? I have come across a situation while testing corrupted striped blocks. I think its not a straight scenario and unfortunately this occurred only once in my env. Please see the below logs, here same block group {{9223372036854775792_1001}} is added to two different priority queues. Initially the block {{9223372036854775792_1001}} has added to the neededReplications {{priority queue 2}}. Second time, while reporting the addStoredBlock request the same block group {{9223372036854775792_1001}} is added to the neededReplications {{priority queue 1}} {code} 2016-03-03 11:42:42,544 DEBUG BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_-9223372036854775792 added as corrupt on 127.0.0.1:7517 by null because TEST 2016-03-03 11:42:42,545 DEBUG org.apache.hadoop.hdfs.StateChange: UnderReplicationBlocks.update blk_-9223372036854775792_1001 curReplicas 8 curExpectedReplicas 9 oldReplicas 9 oldExpectedReplicas 9 curPri 2 oldPri 3 2016-03-03 11:42:42,545 DEBUG BlockStateChange: BLOCK* NameSystem.UnderReplicationBlock.update: blk_-9223372036854775792_1001 has only 8 replicas and needs 9 replicas so is added to neededReplications at priority level 2 {code} {code} 2016-03-03 11:42:42,920 WARN BlockStateChange: BLOCK* addStoredBlock: Redundant addStoredBlock request received for blk_-9223372036854775792_1001 on node 127.0.0.1:7517 size 786432 2016-03-03 11:42:42,921 DEBUG org.apache.hadoop.hdfs.StateChange: UnderReplicationBlocks.update blk_-9223372036854775792_1001 curReplicas 7 curExpectedReplicas 9 oldReplicas 7 oldExpectedReplicas 9 curPri 1 oldPri 1 2016-03-03 11:42:42,921 DEBUG BlockStateChange: BLOCK* NameSystem.UnderReplicationBlock.update: blk_-9223372036854775792_1001 has only 7 replicas and needs 9 replicas so is added to neededReplications at priority level 1 {code} bq. 2. Is it possible to maintain a separate queue for striped block groups, where a block group is ensured to be put into exactly once As we know, there could be situations of both contiguous and striped under replicated blocks exists in the system at a time. Currently while choosing the under replicated blocks for reconstruction, there is a natural ordering of both contiguous and striped blocks. Providing a separate queue is an interesting idea. Just a quick thought, with a separate queue for the striped blocks, I'm thinking how efficiently we will be able to maintain the ordering between the under replicated contiguous and striped blocks. > Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped > block at the same time > ---------------------------------------------------------------------------------------------------- > > Key: HDFS-9822 > URL: https://issues.apache.org/jira/browse/HDFS-9822 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding > Reporter: Tsz Wo Nicholas Sze > Assignee: Rakesh R > Attachments: HDFS-9822-001.patch > > > Found the following AssertionError in > https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/ > {code} > AssertionError: Should wait the previous reconstruction to finish > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100) > at java.lang.Thread.run(Thread.java:745) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126) > at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)