[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16925523#comment-16925523
 ] 

Stephen O'Donnell commented on HDFS-13157:
------------------------------------------

> How is it handled, iterating through each DataNode, that a block is scheduled 
> to be replicated onto a DataNode that will be decommissioned further down in 
> the list?

The block manager takes care of this when allocating a new block target. Nodes 
that are in a decommissioning state will not be considered as a new target.

In the current decommissioning implementation, the nodes selected for 
decommissioning are processed in a very conservative way, in that for nodes 
with more than 500K blocks (dfs.namenode.decommission.blocks.per.interval) it 
will process one node, and then sleep for 30 seconds before processing the next 
one. I believe this is to prevent locking the Namenode too often in close 
succession.

When processing a node, we really need to hold a lock while processing some 
unit of work. The work in the datanode is split into its storage volumes, and 
we use an iterator to process all the blocks on the iterator. If you drop the 
lock part way through that iterator, then a block report or file modification 
in HDFS can change the contents of the storage and then the iterator will get a 
concurrent modification exception. Therefore interleaving many DNs for 
processing at the same time is tricky. Each one needs an exclusive lock and 
they will all be contenting for it. If we drop and re-take the lock for each 
block we will need to bookmark the iterator and handle 
concurrentModificationException, possibly frequently.

There is also no guarantee a user would not decomm node 1, then 10 minutes 
later decom node 2 and so on, and the suggested strategy would not help with 
that.

I still believe the simplest fix to this issue is to change the implementation 
of the pending replication queue to process it in a random order rather than 
FIFO, but, that does not help deal with nodes which have had some blocks 
skipped on the first pass and need to be processed a second time, but we may be 
able to solve that by retrying them a few times as you also suggested.

> Do Not Remove Blocks Sequentially During Decommission 
> ------------------------------------------------------
>
>                 Key: HDFS-13157
>                 URL: https://issues.apache.org/jira/browse/HDFS-13157
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, namenode
>    Affects Versions: 3.0.0
>            Reporter: David Mollitor
>            Assignee: David Mollitor
>            Priority: Major
>         Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to