[ 
https://issues.apache.org/jira/browse/HDFS-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027204#comment-15027204
 ] 

Rushabh S Shah commented on HDFS-4861:
--------------------------------------

Recently we this bug in one of our production cluster.
We decommissioned multiple racks (more than 10) at once.
All the blocks with high replication factor were failing.
Since the maxNodesPerRack is determined without considering the number of racks 
that are decommissioning, the BlockPlacementPolicyDefault couldn't find any 
good target to copy the block.

I have one approach which can fix this bug:
We can get rid of maxNodesPerRack altogether.
NetworkTopology (and NetworkTopologyWithNodeGroup) can maintain list of all 
racks (and NodeGroup) in the current topology.
In BlockPlacementPolicyDefault#chooseRandom we can get a list of the racks 
given the scope.
If we get multiple racks from the above function, we can go  each rack and 
select a target in a round robin fashion.
In this way we can try to place the replica evenly on all the racks for higher 
replication factor blocks.

This case breaks for BlockPlacementPolicyRackFaultTolerant policy.
This policy is heavily dependent on maxNodesPerRack (and overriding its 
implementation) unlike other placement policies.
Asking the author [~walter.k.su] to share the motivation behind this policy.
My approach will try to place the high replication factor blocks on most racks 
but it will put first 2 out of 3 copies on one rack.

I have a patch that is working for branch-2.7 since 
BlockPlacementPolicyRackFaultTolerant is not committed to branch-2.7.

> BlockPlacementPolicyDefault does not consider decommissioning racks
> -------------------------------------------------------------------
>
>                 Key: HDFS-4861
>                 URL: https://issues.apache.org/jira/browse/HDFS-4861
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 0.23.7, 2.1.0-beta
>            Reporter: Kihwal Lee
>            Assignee: Rushabh S Shah
>              Labels: BB2015-05-TBR
>         Attachments: HDFS-4861-v2.patch, HDFS-4861.patch
>
>
> getMaxNodesPerRack() calculates the max replicas/rack like this:
> {code}
> int maxNodesPerRack = (totalNumOfReplicas-1)/clusterMap.getNumOfRacks()+2;
> {code}
> Since this does not consider the racks that are being decommissioned and the 
> decommissioning state is only checked later in isGoodTarget(), certain blocks 
> are not replicated even when there are many racks and nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to