Daryn Sharp created HDFS-9287:
---------------------------------

             Summary: Block placement completely fails if too many nodes are 
decommissioning
                 Key: HDFS-9287
                 URL: https://issues.apache.org/jira/browse/HDFS-9287
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 2.6.0
            Reporter: Daryn Sharp
            Priority: Critical


The DatanodeManager coordinates with the HeartbeatManager to update 
HeartbeatManager.Stats to track capacity and load.   This is crucial for block 
placement to consider space and load.  It's completely broken for decomm nodes.

The heartbeat manager substracts the prior values before it adds new values.  
During registration of a decomm node, it substracts before seeding the initial 
values.  This decrements nodesInService, flips state to decomm, add will not 
increment nodesInService (correct).  There are other math bugs (double adding) 
that accidentally work due to 0 values.

The result is every decomm node decrements the node count used for block 
placement.  When enough nodes are decomm, the replication monitor will silently 
stop working.  No logging.  It searches all nodes and just gives up.  
Eventually, all block allocation will also completely fail.  No files can be 
created.  No jobs can be submitted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to