[ https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920441#comment-16920441 ]
He Xiaoqiao commented on HDFS-13157: ------------------------------------ Thanks [~belugabehr] for the great work and detailed analysis. I believe this issue is more obvious in Federation arch. setup. +1 for the deep dig via [~sodonnell], and we could tune parameters [blocksReplWorkMultiplier, maxReplicationStreams, maxReplicationStreamsHardLimit] just for per namespace, but the common operation to decommission node is triggered from shell even at the same time, and different namenode send replication command also at the same time if this node is reporting to different namespace. Then load of this decommission in progress node is out of control. I have met both network and single disk i/o bottleneck. I believe the current parameter is enough to use for solving network bottleneck. fo single disk io bottleneck, +1 for update DatanodeDescriptor#BlockIterator and support to iterator block from alternate disks rather than iterator blocks from disk one by one. another thought, we should not dispatch write operation to decommission in progress node and decrease read priority to the lowest just as decommissioned node, then if high load of decommissioning nodes or not is completely not affect to client or cluster. this discussion is not including RAID and scenarios [~zhangchen] mentioned above. > Do Not Remove Blocks Sequentially During Decommission > ------------------------------------------------------ > > Key: HDFS-13157 > URL: https://issues.apache.org/jira/browse/HDFS-13157 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, namenode > Affects Versions: 3.0.0 > Reporter: David Mollitor > Assignee: David Mollitor > Priority: Major > > From what I understand of [DataNode > decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java] > it appears that all the blocks are scheduled for removal _in order._. I'm > not 100% sure what the ordering is exactly, but I think it loops through each > data volume and schedules each block to be replicated elsewhere. The net > affect is that during a decommission, all of the DataNode transfer threads > slam on a single volume until it is cleaned out. At which point, they all > slam on the next volume, etc. > Please randomize the block list so that there is a more even distribution > across all volumes when decommissioning a node. -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org