[ 
https://issues.apache.org/jira/browse/HDFS-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13143525#comment-13143525
 ] 

Nathan Roberts commented on HDFS-2537:
--------------------------------------



IIRC, the basic algorithm for this is:
{code}
Every dfs.replication-interval (default 3 seconds)
   Choose 2 * nr_live_nodes under-replicated blocks as datanode work
   Prune this list by making sure no datanode has more than 
dfs.max-repl-streams (default 2) active
Work is distributed to datanodes as they heartbeat in
{code}

Seems like more responsibility needs to be given to the datanodes to properly 
pace this activity. The datanode probably needs to be told how much resource it 
can use for the activity, adjusted during heartbeats (sort of like how balancer 
bandwidth is adjusted). The namenode also has to be able to give datanodes more 
work during each individual heartbeat so it can re-replicate at the prescribed 
pace. 

I think it would also be beneficial if the re-replication work could be 
scheduled more as a coordinated piece of work instead of a slew of individual 
blocks that need re-replicating. For example, if you have a 1000 node cluster, 
each node with 100,000 blocks, then ideally each node would only need to 
connect to one other node and then stream its share of the under-replicated 
blocks to one other node (one isn't exactly precise due to replication rules, 
but you get the point.) Instead of 100,000 random block re-replications, the 
work is coordinated by what could be moved from node A to node B while still 
obeying replication rules. This would allow node A to do almost all of its work 
with exactly 1 other node (possibly on-rack), which would then allow for other 
efficiencies (connection caching, single stream of blocks, etc.)
                
> re-replicating under replicated blocks should be more dynamic
> -------------------------------------------------------------
>
>                 Key: HDFS-2537
>                 URL: https://issues.apache.org/jira/browse/HDFS-2537
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 0.20.205.0, 0.23.0
>            Reporter: Nathan Roberts
>
> When a node fails or is decommissioned, a large number of blocks become 
> under-replicated. Since re-replication work is distributed, the hope would be 
> that all blocks could be restored to their desired replication factor in very 
> short order. This doesn't happen though because the load the cluster is 
> willing to devote to this activity is mostly static (controlled by 
> configuration variables). Since it's mostly static, the rate has to be set 
> conservatively to avoid overloading the cluster with replication work.
> This problem is especially noticeable when you have lots of small blocks. It 
> can take many hours to re-replicate the blocks that were on a node while the 
> cluster is mostly idle. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to