[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920783#comment-16920783
 ] 

Stephen O'Donnell commented on HDFS-13157:
------------------------------------------

I tested my theory that this problem can also result in only 1 node making 
decommission progress when several are decommissioned at the same time. Using a 
simulated cluster with two carefully picked nodes, such that node 1 and node 2 
do not host any of the same blocks, I can see the first to start decommission 
makes progress while the other does not make any progress. In a real cluster, 
there is likely to be some overlap in the blocks between the two clusters, so 
both will make some progress, but this is only because the replication monitor 
notices 2 new replicas are needed for the block and schedules two new copies at 
the same time.

Therefore this problem is worse than concentrating decommission on a single 
disk, but it also does not really work on more than 1 node at a time.

I am also concerned about the time the NN lock is held when processing a node 
for decommission. In tests on my laptop, it takes about 300ms for a node with 
340K blocks and 660ms for 1M blocks. Scaling up to a 5M blocks this could hold 
the lock for about 3 seconds per node. There is a delay between each node, but 
it is still not ideal to block the NN for that long.

Randomizing the iterator in the way suggested here would prevent us from making 
a later change to drop and retake the NN lock per storage on the DN to improve 
the locking time.

This makes me think the solution to this problem is not to randomize the blocks 
from one node onto the replication queue, but instead to randomize the order 
the replication queue is processed somehow.

> Do Not Remove Blocks Sequentially During Decommission 
> ------------------------------------------------------
>
>                 Key: HDFS-13157
>                 URL: https://issues.apache.org/jira/browse/HDFS-13157
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, namenode
>    Affects Versions: 3.0.0
>            Reporter: David Mollitor
>            Assignee: David Mollitor
>            Priority: Major
>         Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to