[ 
https://issues.apache.org/jira/browse/HDFS-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15178302#comment-15178302
 ] 

Rakesh R commented on HDFS-9822:
--------------------------------

Thanks a lot [~drankye] for the interests and useful comments.

bq. 1. Why multiple reconstruction tasks for the same striped block or block 
group are figured out and put into queues?
I have come across a situation while testing corrupted striped blocks. I think 
its not a straight scenario and unfortunately this occurred only once in my 
env. Please see the below logs, here same block group 
{{9223372036854775792_1001}} is added to two different priority queues. 
Initially the block {{9223372036854775792_1001}} has added to the 
neededReplications {{priority queue 2}}. Second time, while reporting the 
addStoredBlock request the same block group {{9223372036854775792_1001}} is 
added to the neededReplications {{priority queue 1}}

{code}
2016-03-03 11:42:42,544 DEBUG BlockStateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: blk_-9223372036854775792 added as corrupt 
on 127.0.0.1:7517 by null  because TEST
2016-03-03 11:42:42,545 DEBUG org.apache.hadoop.hdfs.StateChange: 
UnderReplicationBlocks.update blk_-9223372036854775792_1001 curReplicas 8 
curExpectedReplicas 9 oldReplicas 9 oldExpectedReplicas  9 curPri  2 oldPri  3
2016-03-03 11:42:42,545 DEBUG BlockStateChange: BLOCK* 
NameSystem.UnderReplicationBlock.update: blk_-9223372036854775792_1001 has only 
8 replicas and needs 9 replicas so is added to neededReplications at priority 
level 2
{code}

{code}
2016-03-03 11:42:42,920 WARN BlockStateChange: BLOCK* addStoredBlock: Redundant 
addStoredBlock request received for blk_-9223372036854775792_1001 on node 
127.0.0.1:7517 size 786432
2016-03-03 11:42:42,921 DEBUG org.apache.hadoop.hdfs.StateChange: 
UnderReplicationBlocks.update blk_-9223372036854775792_1001 curReplicas 7 
curExpectedReplicas 9 oldReplicas 7 oldExpectedReplicas  9 curPri  1 oldPri  1
2016-03-03 11:42:42,921 DEBUG BlockStateChange: BLOCK* 
NameSystem.UnderReplicationBlock.update: blk_-9223372036854775792_1001 has only 
7 replicas and needs 9 replicas so is added to neededReplications at priority 
level 1
{code}

bq. 2. Is it possible to maintain a separate queue for striped block groups, 
where a block group is ensured to be put into exactly once
As we know, there could be situations of both contiguous and striped under 
replicated blocks exists in the system at a time. Currently while choosing the 
under replicated blocks for reconstruction, there is a natural ordering of both 
contiguous and striped blocks. Providing a separate queue is an interesting 
idea. Just a quick thought, with a separate queue for the striped blocks, I'm 
thinking how efficiently we will be able to maintain the ordering between the 
under replicated contiguous and striped blocks.

> Erasure Coding: Avoids scheduling multiple reconstruction tasks for a striped 
> block at the same time
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-9822
>                 URL: https://issues.apache.org/jira/browse/HDFS-9822
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: erasure-coding
>            Reporter: Tsz Wo Nicholas Sze
>            Assignee: Rakesh R
>         Attachments: HDFS-9822-001.patch
>
>
> Found the following AssertionError in 
> https://builds.apache.org/job/PreCommit-HDFS-Build/14501/testReport/org.apache.hadoop.hdfs.server.namenode/TestReconstructStripedBlocks/testMissingStripedBlockWithBusyNode2/
> {code}
> AssertionError: Should wait the previous reconstruction to finish
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.validateReconstructionWork(BlockManager.java:1680)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1536)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1472)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4229)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4100)
>       at java.lang.Thread.run(Thread.java:745)
>       at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126)
>       at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4119)
>       at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to