[ https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921331#comment-16921331 ]
Stephen O'Donnell edited comment on HDFS-13157 at 9/3/19 10:58 AM: ------------------------------------------------------------------- > Add a configuration, which makes NN to release the lock every > 10000(configurable) blocks. There was some discussion in related to this in HDFS-10477, and they decided to drop the lock after processing each storage. The reason, is that the iterator for the storage could get a ConcurrentModificationException if its contents change when the lock is dropped and retaken. Locking at the storage level is probably a good middle ground between how it works currently locking on a block count threshold. --Thinking about the problem on replicating older blocks first ... We currently have several replication queues, and blocks with only 1 replica should go into the highest priority queue. That means other blocks (only 2 replicas) and decommissioning blocks are in the 'normal' queue. Looking at how that queue is currently processed, it begins at the start and: # Gets 2 * live_nodes blocks # Attempts to schedule them for replication based on max-streams limits # Any that are not scheduled are simply dropped until all other blocks have been tried and the iterator cycles round. Therefore even in the current implementation some of the blocks can get left behind for some time. This does seem to be a tricky problem to get correct, as there are quite a few edge cases and scenarios to consider. was (Author: sodonnell): {quote} Add a configuration, which makes NN to release the lock every 10000(configurable) blocks. \{quote} There was some discussion in related to this in HDFS-10477, and they decided to drop the lock after processing each storage. The reason, is that the iterator for the storage could get a ConcurrentModificationException if its contents change when the lock is dropped and retaken. Locking at the storage level is probably a good middle ground between how it works currently locking on a block count threshold. --Thinking about the problem on replicating older blocks first ... We currently have several replication queues, and blocks with only 1 replica should go into the highest priority queue. That means other blocks (only 2 replicas) and decommissioning blocks are in the 'normal' queue. Looking at how that queue is currently processed, it begins at the start and: # Gets 2 * live_nodes blocks # Attempts to schedule them for replication based on max-streams limits # Any that are not scheduled are simply dropped until all other blocks have been tried and the iterator cycles round. Therefore even in the current implementation some of the blocks can get left behind for some time. This does seem to be a tricky problem to get correct, as there are quite a few edge cases and scenarios to consider. > Do Not Remove Blocks Sequentially During Decommission > ------------------------------------------------------ > > Key: HDFS-13157 > URL: https://issues.apache.org/jira/browse/HDFS-13157 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, namenode > Affects Versions: 3.0.0 > Reporter: David Mollitor > Assignee: David Mollitor > Priority: Major > Attachments: HDFS-13157.1.patch > > > From what I understand of [DataNode > decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java] > it appears that all the blocks are scheduled for removal _in order._. I'm > not 100% sure what the ordering is exactly, but I think it loops through each > data volume and schedules each block to be replicated elsewhere. The net > affect is that during a decommission, all of the DataNode transfer threads > slam on a single volume until it is cleaned out. At which point, they all > slam on the next volume, etc. > Please randomize the block list so that there is a more even distribution > across all volumes when decommissioning a node. -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org