[ https://issues.apache.org/jira/browse/HDFS-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163589#comment-14163589 ]
Hudson commented on HDFS-7128: ------------------------------ FAILURE: Integrated in Hadoop-Mapreduce-trunk #1920 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1920/]) HDFS-7128. Decommission slows way down when it gets towards the end. Contributed by Ming Ma. (cnauroth: rev 9b8a35aff6d4bd7bb066ce01fa63a88fa49245ee) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestUnderReplicatedBlocks.java > Decommission slows way down when it gets towards the end > -------------------------------------------------------- > > Key: HDFS-7128 > URL: https://issues.apache.org/jira/browse/HDFS-7128 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Reporter: Ming Ma > Assignee: Ming Ma > Fix For: 2.6.0 > > Attachments: HDFS-7128-2.patch, HDFS-7128.patch > > > When we decommission nodes across different racks, the decommission process > becomes really slow at the end, hardly making any progress. The problem is > some blocks are on 3 decomm-in-progress DNs and the way how replications are > scheduled caused unnecessary delay. Here is the analysis. > When BlockManager schedules the replication work from neededReplication, it > first needs to pick the source node for replication via chooseSourceDatanode. > The core policies to pick the source node are: > 1. Prefer decomm-in-progress node. > 2. Only pick the nodes whose outstanding replication counts are below > thresholds dfs.namenode.replication.max-streams or > dfs.namenode.replication.max-streams-hard-limit, based on the replication > priority. > When we decommission nodes, > 1. All the decommission nodes' blocks will be added to neededReplication. > 2. BM will pick X number of blocks from neededReplication in each iteration. > X is based on cluster size and some configurable multiplier. So if the > cluster has 2000 nodes, X will be around 4000. > 3. Given these 4000 nodes are on the same decomm-in-progress node A, A end up > being chosen as the source node of all these 4000 nodes. The reason the > outstanding replication thresholds don't kick is due to the implementation of > BlockManager.computeReplicationWorkForBlocks; > node.getNumberOfBlocksToBeReplicated() remains zero given > node.addBlockToBeReplicated is called after source node iteration. > {noformat} > ... > synchronized (neededReplications) { > for (int priority = 0; priority < blocksToReplicate.size(); > priority++) { > ... > chooseSourceDatanode > ... > } > for(ReplicationWork rw : work){ > ... > rw.srcNode.addBlockToBeReplicated(block, targets); > ... > } > {noformat} > > 4. So several decomm-in-progress nodes A, B, C end up with 4000 > node.getNumberOfBlocksToBeReplicated(). > 5. If we assume each node can replicate 5 blocks per minutes, it is going to > take 800 minutes to finish replication of these blocks. > 6. Pending replication timeout kick in after 5 minutes. The items will be > removed from the pending replication queue and added back to > neededReplication. The replications will then be handled by other source > nodes of these blocks. But the blocks still remain in nodes A, B, C's pending > replication queue, DatanodeDescriptor.replicateBlocks, so A, B, C continue > the replications of these blocks, although these blocks might have been > replicated by other DNs after replication timeout. > 7. Some block' replicas exist on A, B, C and it is at the end of A's pending > replication queue. Even though the block's replication timeout, no source > node can be chosen given A, B, C all have high pending replication count. So > we have to wait until A drains its pending replication queue. Meanwhile, the > items in A's pending replication queue have been taken care of by other nodes > and no longer under replicated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)