[ https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16925523#comment-16925523 ]
Stephen O'Donnell commented on HDFS-13157: ------------------------------------------ > How is it handled, iterating through each DataNode, that a block is scheduled > to be replicated onto a DataNode that will be decommissioned further down in > the list? The block manager takes care of this when allocating a new block target. Nodes that are in a decommissioning state will not be considered as a new target. In the current decommissioning implementation, the nodes selected for decommissioning are processed in a very conservative way, in that for nodes with more than 500K blocks (dfs.namenode.decommission.blocks.per.interval) it will process one node, and then sleep for 30 seconds before processing the next one. I believe this is to prevent locking the Namenode too often in close succession. When processing a node, we really need to hold a lock while processing some unit of work. The work in the datanode is split into its storage volumes, and we use an iterator to process all the blocks on the iterator. If you drop the lock part way through that iterator, then a block report or file modification in HDFS can change the contents of the storage and then the iterator will get a concurrent modification exception. Therefore interleaving many DNs for processing at the same time is tricky. Each one needs an exclusive lock and they will all be contenting for it. If we drop and re-take the lock for each block we will need to bookmark the iterator and handle concurrentModificationException, possibly frequently. There is also no guarantee a user would not decomm node 1, then 10 minutes later decom node 2 and so on, and the suggested strategy would not help with that. I still believe the simplest fix to this issue is to change the implementation of the pending replication queue to process it in a random order rather than FIFO, but, that does not help deal with nodes which have had some blocks skipped on the first pass and need to be processed a second time, but we may be able to solve that by retrying them a few times as you also suggested. > Do Not Remove Blocks Sequentially During Decommission > ------------------------------------------------------ > > Key: HDFS-13157 > URL: https://issues.apache.org/jira/browse/HDFS-13157 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, namenode > Affects Versions: 3.0.0 > Reporter: David Mollitor > Assignee: David Mollitor > Priority: Major > Attachments: HDFS-13157.1.patch > > > From what I understand of [DataNode > decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java] > it appears that all the blocks are scheduled for removal _in order._. I'm > not 100% sure what the ordering is exactly, but I think it loops through each > data volume and schedules each block to be replicated elsewhere. The net > affect is that during a decommission, all of the DataNode transfer threads > slam on a single volume until it is cleaned out. At which point, they all > slam on the next volume, etc. > Please randomize the block list so that there is a more even distribution > across all volumes when decommissioning a node. -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org