Ming Ma created HDFS-7150:
-----------------------------

             Summary: MissingBlocks > 0 when all replicas are on 
decomm-in-progress nodes
                 Key: HDFS-7150
                 URL: https://issues.apache.org/jira/browse/HDFS-7150
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Ming Ma


Our clusters recently have this false alert, where NN metrics MissingBlocks > 0 
while all replicas of these blocks are on decomm-in-progress nodes. Normally, 
when you have replicas only on decomm-in-progress nodes, the blocks won't be 
counted as missing. It turns out if decomm-in-progress nodes lost heartbeat and 
reconnect to NN, this could happen. The scenario is the following.

1. Kick off decomm on several nodes across different racks.
2. NN lost heartbeat from 3 decomm-in-progress nodes around the same time. BM's 
neededReplications will be updated as part of BM.removeStoredBlock process. If 
block A's 3 replicas happen to be on these 3 nodes, block A will be moved to 
BM's neededReplications.QUEUE_WITH_CORRUPT_BLOCKS queue. So at this point, 
block A will be counted as missing.
3. These 3 nodes reconnect with NNs. However, block A remains in BM's 
neededReplications.QUEUE_WITH_CORRUPT_BLOCKS queue, until the block A is 
replicated to other live nodes.

The issue will be mitigated by HDFS-7128 with faster decommission. But it is 
better to fix the correctness issue. When decomm-in-progress nodes reconnect 
with NN, blocks should be moved out of BM's 
neededReplications.QUEUE_WITH_CORRUPT_BLOCKS queue. This will also give 
replication of these blocks higher priority.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to