[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920125#comment-16920125
 ] 

Stephen O'Donnell commented on HDFS-13157:
------------------------------------------

Thinking about this some more, the current logic looks like:
{code:java}
for each block on DN {
  if (not_sufficiently_replicated) {
    add_to_replication_queue(block)
    add_to_insufficientList(block)
  }
} {code}
If it is possible to drop and re-take the namenode log for each datanode disk 
in a similar way as HDFS-10477, then I wonder if we could shuffle the order 
blocks are added to the replication_queue, rather than shuffle the order they 
are read from the datanode storage?

Eg, something like:
{code:java}
for each storage on DN {
  get_nn_lock()
  for each block on storage {
    if (not_sufficiently_replicated) {
      add_to_insufficientList(block)
    }
  }
  release_nn_lock()
}
add_to_replication_queue_in_random_order(insufficientList) {code}
However, it may not be that simple, as between the locks, the state of some 
blocks may have changed. Eg they could have been deleted or replica count 
changed etc, meaning they need to be rechecked for sufficient replication, 
duplicating work already done.

> Do Not Remove Blocks Sequentially During Decommission 
> ------------------------------------------------------
>
>                 Key: HDFS-13157
>                 URL: https://issues.apache.org/jira/browse/HDFS-13157
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, namenode
>    Affects Versions: 3.0.0
>            Reporter: David Mollitor
>            Assignee: David Mollitor
>            Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to