[ https://issues.apache.org/jira/browse/HDFS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389145#comment-14389145 ]
Hudson commented on HDFS-7742: ------------------------------ FAILURE: Integrated in Hadoop-Mapreduce-trunk #2099 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2099/]) HDFS-7742. Favoring decommissioning node for replication can cause a block to stay (kihwal: rev 04ee18ed48ceef34598f954ff40940abc9fde1d2) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockManager.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > favoring decommissioning node for replication can cause a block to stay > underreplicated for long periods > -------------------------------------------------------------------------------------------------------- > > Key: HDFS-7742 > URL: https://issues.apache.org/jira/browse/HDFS-7742 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.6.0 > Reporter: Nathan Roberts > Assignee: Nathan Roberts > Fix For: 2.7.0 > > Attachments: HDFS-7742-v0.patch > > > When choosing a source node to replicate a block from, a decommissioning node > is favored. The reason for the favoritism is that decommissioning nodes > aren't servicing any writes so in-theory they are less loaded. > However, the same selection algorithm also tries to make sure it doesn't get > "stuck" on any particular node: > {noformat} > // switch to a different node randomly > // this to prevent from deterministically selecting the same node even > // if the node failed to replicate the block on previous iterations > {noformat} > Unfortunately, the decommissioning check is prior to this randomness so the > algorithm can get stuck trying to replicate from a decommissioning node. > We've seen this in practice where a decommissioning datanode was failing to > replicate a block for many days, when other viable replicas of the block were > available. > Given that we limit the number of streams we'll assign to a given node > (default soft limit of 2, hard limit of 4), It doesn't seem like favoring a > decommissioning node has significant benefit. i.e. when there is significant > replication work to do, we'll quickly hit the stream limit of the > decommissioning nodes and use other nodes in the cluster anyway; when there > isn't significant replication work then in theory we've got plenty of > replication bandwidth available so choosing a decommissioning node isn't much > of a win. > I see two choices: > 1) Change the algorithm to still favor decommissioning nodes but with some > level of randomness that will avoid always selecting the decommissioning node > 2) Remove the favoritism for decommissioning nodes > I prefer #2. It simplifies the algorithm, and given the other throttles we > have in place, I'm not sure there is a significant benefit to selecting > decommissioning nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)