[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920441#comment-16920441
 ] 

He Xiaoqiao commented on HDFS-13157:
------------------------------------

Thanks [~belugabehr] for the great work and detailed analysis. I believe this 
issue is more obvious in Federation arch. setup. +1 for the deep dig via 
[~sodonnell], and we could tune parameters [blocksReplWorkMultiplier, 
maxReplicationStreams, maxReplicationStreamsHardLimit] just for per namespace, 
but the common operation to decommission node is triggered from shell even at 
the same time, and different namenode send replication command also at the same 
time if this node is reporting to different namespace. Then load of this 
decommission in progress node is out of control. I have met both network and 
single disk i/o bottleneck.
I believe the current parameter is enough to use for solving network bottleneck.
fo single disk io bottleneck, +1 for update DatanodeDescriptor#BlockIterator 
and support to iterator block from alternate disks rather than iterator blocks 
from disk one by one.
another thought, we should not dispatch write operation to decommission in 
progress node and decrease read priority to the lowest just as decommissioned 
node, then if high load of decommissioning nodes or not is completely not 
affect to client or cluster.
this discussion is not including RAID and scenarios [~zhangchen] mentioned 
above.

> Do Not Remove Blocks Sequentially During Decommission 
> ------------------------------------------------------
>
>                 Key: HDFS-13157
>                 URL: https://issues.apache.org/jira/browse/HDFS-13157
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, namenode
>    Affects Versions: 3.0.0
>            Reporter: David Mollitor
>            Assignee: David Mollitor
>            Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to