[ 
https://issues.apache.org/jira/browse/HDFS-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Binglin Chang updated HDFS-5580:
--------------------------------

    Attachment: HDFS-5580.v1.patch

Bug analysis:
In Balancer.PendingBlockMove.chooseProxySource()
{code}
      boolean find = false;
      for (BalancerDatanode loc : block.getLocations()) {
        // check if there is replica which is on the same rack with the target
        if (cluster.isOnSameRack(loc.getDatanode(), targetDN) && addTo(loc)) {
          find = true;
          // if cluster is not nodegroup aware or the proxy is on the same 
          // nodegroup with target, then we already find the nearest proxy
          if (!cluster.isNodeGroupAware() 
              || cluster.isOnSameNodeGroup(loc.getDatanode(), targetDN)) {
            return true;
          }
        }
        
        if (!find) {
          // find out a non-busy replica out of rack of target
          find = addTo(loc);
        }
      }
{code}
PendingBlockMove may be added to mulitple locations instead of one, but 
consumer thread pool only remove a pair of PendingBlockMove at a time, left  
some wild PendingBlockMove in the queue, Balancer.waitForMoveCompletion wait 
the queue become empty, which will never happen, causing dead lock.



> Infinite loop in Balancer.waitForMoveCompletion
> -----------------------------------------------
>
>                 Key: HDFS-5580
>                 URL: https://issues.apache.org/jira/browse/HDFS-5580
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Binglin Chang
>            Assignee: Binglin Chang
>         Attachments: HDFS-5580.v1.patch, TestBalancerWithNodeGroupTimeout.log
>
>
> In recent 
> [build|https://builds.apache.org/job/PreCommit-HDFS-Build/5592//testReport/org.apache.hadoop.hdfs.server.balancer/TestBalancerWithNodeGroup/testBalancerWithNodeGroup/]
>  in HDFS-5574, TestBalancerWithNodeGroup timeout, this is also mentioned in 
> HDFS-4376 
> [here|https://issues.apache.org/jira/browse/HDFS-4376?focusedCommentId=13799402&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13799402].
>  
> Looks like the bug is introduced by HDFS-4376.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to