Ming Ma created HDFS-7150: ----------------------------- Summary: MissingBlocks > 0 when all replicas are on decomm-in-progress nodes Key: HDFS-7150 URL: https://issues.apache.org/jira/browse/HDFS-7150 Project: Hadoop HDFS Issue Type: Bug Reporter: Ming Ma
Our clusters recently have this false alert, where NN metrics MissingBlocks > 0 while all replicas of these blocks are on decomm-in-progress nodes. Normally, when you have replicas only on decomm-in-progress nodes, the blocks won't be counted as missing. It turns out if decomm-in-progress nodes lost heartbeat and reconnect to NN, this could happen. The scenario is the following. 1. Kick off decomm on several nodes across different racks. 2. NN lost heartbeat from 3 decomm-in-progress nodes around the same time. BM's neededReplications will be updated as part of BM.removeStoredBlock process. If block A's 3 replicas happen to be on these 3 nodes, block A will be moved to BM's neededReplications.QUEUE_WITH_CORRUPT_BLOCKS queue. So at this point, block A will be counted as missing. 3. These 3 nodes reconnect with NNs. However, block A remains in BM's neededReplications.QUEUE_WITH_CORRUPT_BLOCKS queue, until the block A is replicated to other live nodes. The issue will be mitigated by HDFS-7128 with faster decommission. But it is better to fix the correctness issue. When decomm-in-progress nodes reconnect with NN, blocks should be moved out of BM's neededReplications.QUEUE_WITH_CORRUPT_BLOCKS queue. This will also give replication of these blocks higher priority. -- This message was sent by Atlassian JIRA (v6.3.4#6332)