[jira] [Comment Edited] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

Stephen O'Donnell (Jira) Tue, 03 Sep 2019 03:59:39 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921331#comment-16921331
 ]


Stephen O'Donnell edited comment on HDFS-13157 at 9/3/19 10:58 AM:
-------------------------------------------------------------------

> Add a configuration, which makes NN to release the lock every 
> 10000(configurable) blocks.

There was some discussion in related to this in HDFS-10477, and they decided to 
drop the lock after processing each storage. The reason, is that the iterator 
for the storage could get a ConcurrentModificationException if its contents 
change when the lock is dropped and retaken. Locking at the storage level is 
probably a good middle ground between how it works currently locking on a block 
count threshold.

--Thinking about the problem on replicating older blocks first ... We currently 
have several replication queues, and blocks with only 1 replica should go into 
the highest priority queue. That means other blocks (only 2 replicas) and 
decommissioning blocks are in the 'normal' queue. Looking at how that queue is 
currently processed, it begins at the start and:
 # Gets 2 * live_nodes blocks
 # Attempts to schedule them for replication based on max-streams limits
 # Any that are not scheduled are simply dropped until all other blocks have 
been tried and the iterator cycles round.

Therefore even in the current implementation some of the blocks can get left 
behind for some time.

This does seem to be a tricky problem to get correct, as there are quite a few 
edge cases and scenarios to consider.


was (Author: sodonnell):
{quote}

Add a configuration, which makes NN to release the lock every 
10000(configurable) blocks.

 \{quote}

There was some discussion in related to this in HDFS-10477, and they decided to 
drop the lock after processing each storage. The reason, is that the iterator 
for the storage could get a ConcurrentModificationException if its contents 
change when the lock is dropped and retaken. Locking at the storage level is 
probably a good middle ground between how it works currently locking on a block 
count threshold.

--Thinking about the problem on replicating older blocks first ... We currently 
have several replication queues, and blocks with only 1 replica should go into 
the highest priority queue. That means other blocks (only 2 replicas) and 
decommissioning blocks are in the 'normal' queue. Looking at how that queue is 
currently processed, it begins at the start and:
 # Gets 2 * live_nodes blocks
 # Attempts to schedule them for replication based on max-streams limits
 # Any that are not scheduled are simply dropped until all other blocks have 
been tried and the iterator cycles round.

Therefore even in the current implementation some of the blocks can get left 
behind for some time.

This does seem to be a tricky problem to get correct, as there are quite a few 
edge cases and scenarios to consider.

> Do Not Remove Blocks Sequentially During Decommission 
> ------------------------------------------------------
>
>                 Key: HDFS-13157
>                 URL: https://issues.apache.org/jira/browse/HDFS-13157
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, namenode
>    Affects Versions: 3.0.0
>            Reporter: David Mollitor
>            Assignee: David Mollitor
>            Priority: Major
>         Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

Reply via email to