[ 
https://issues.apache.org/jira/browse/HDFS-9381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032807#comment-15032807
 ] 

Zhe Zhang commented on HDFS-9381:
---------------------------------

Thanks Jing for the comment.

Let's consider this case:
# Cluster has 100 nodes
# DN_1 and DN_2 failed
# They are on different racks
# They happen to share 1000 striped blocks and 1000 contiguous blocks (we can 
easily scale up the calculated numbers for n x 1000 blocks). So there are 2000 
striped internal blocks, and 2000 contiguous block replicas missing.

So in each iteration ReplicationMonitor tries to pickup 200 items. Without the 
change, it will be 100 striped and 100 contiguous on average. Assuming EC 
recovery work takes longer than 3 seconds ({{replicationRecheckInterval}}), 
then the 2nd iteration will pickup about 5 invalid striped items (being 
recovered). If EC recovery work takes long enough, then the 3rd round will 
pickup about 10 (2/18) invalid striped items, 4th round 18 invalid items. This 
way the replication work for the lost contiguous replicas will take 20 x 3 = 30 
seconds to distributed to DNs.

With the change, 2nd round will pickup 95 striped items and 105 contiguous 
items, 3rd round 110 contiguous items, .... It's tricky to get very accurate, 
but seems we can save a few 3-second cycles. 

[~umamaheswararao] Does the example itself make sense to you? If so, how should 
we calculate the saving in locking time?

bq. it is also possible that because of the longer processing time, there is 
higher chance for the striped blocks to be updated in the UC queue before being 
processed by the replication monitor for the first time
I'm not fully following the above. Jing do you mind elaborating it a little bit?


> When same block came for replication for Striped mode, we can move that block 
> to PendingReplications
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-9381
>                 URL: https://issues.apache.org/jira/browse/HDFS-9381
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: erasure-coding, namenode
>    Affects Versions: 3.0.0
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HDFS-9381-02.patch, HDFS-9381-03.patch, 
> HDFS-9381.00.patch, HDFS-9381.01.patch
>
>
> Currently I noticed that we are just returning null if block already exists 
> in pendingReplications in replication flow for striped blocks.
> {code}
> if (block.isStriped()) {
>       if (pendingNum > 0) {
>         // Wait the previous recovery to finish.
>         return null;
>       }
> {code}
>  Here if we just return null and if neededReplications contains only fewer 
> blocks(basically by default if less than numliveNodes*2), then same blocks 
> can be picked again from neededReplications from next loop as we are not 
> removing element from neededReplications. Since this replication process need 
> to take fsnamesystmem lock and do, we may spend some time unnecessarily in 
> every loop. 
> So my suggestion/improvement is:
>  Instead of just returning null, how about incrementing pendingReplications 
> for this block and remove from neededReplications? and also another point to 
> consider here is, to add into pendingReplications, generally we need target 
> and it is nothing but to which node we issued replication command. Later when 
> after replication success and DN reported it, block will be removed from 
> pendingReplications from NN addBlock. 
>  So since this is newly picked block from neededReplications, we would not 
> have selected target yet. So which target to be passed to pendingReplications 
> if we add this block? One Option I am thinking is, how about just passing 
> srcNode itself as target for this special condition? So, anyway if the block 
> is really missed, srcNode will not report it. So this block will not be 
> removed from pending replications, so that when it is timed out, it will be 
> considered for replication again and that time it will find actual target to 
> replicate while processing as part of regular replication flow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to