ZanderXu commented on PR #5583: URL: https://github.com/apache/hadoop/pull/5583#issuecomment-1519301097
@Hexiaoqiao @ayushtkn Master, thanks for your comments. I try to explain this problem cleanly. First, we can reproduce this problem through the following steps: Supposing there is a cluster with one Active NameNode, one Standby NameNode, and three datanodes (DN1, DN2 and DN3). 1. Client create a file with three replicas, write some data and close it. (supposing this file only has one block, blk_1024_1001) 2. Client try to append some data into this file and close it. (supposing this file only has one block, blk_1024_1002) 3. Client try to append some data into this file again and close it. (supposing this file only has one block, block_1024_1003) 4. Standby is unstable, and it replay all edits first, such as replay blk_1024_1001, blk_1024_1002, blk_1024_1003 5. Then Standby received some blockReceivedAndDeleted requests from Datanode and process them in order [ (DN1, blk_1024_1001), (DN2, blk_1024_1001), (DN3, blk_1024_1001), (DN1, blk_1024_1002), (DN2, blk_1024_1002), (DN3, blk_1024_1002), (DN1, blk_1024_1003), (DN2, blk_1024_1003), (DN3, blk_1024_1003) ] 6. Standby NameNode will put the report message for blk_1024_1001 and blk_1024_1002 into PendingDataNodeMessage, because the GS of the stored block is 1003, 1001 and 1002 is less than 1003, so standby consider that these report messages may be corrupt, so just put them into PendingDataNodeMessage. 7. Right now, the block status in StandbyNameNode are as following: PendingDataNodeMessage: (DN1, blk_1024_1001), (DN2, blk_1024_1001), (DN3, blk_1024_1001), (DN1, blk_1024_1002), (DN2, blk_1024_1002), (DN3, blk_1024_1002) BlockMap: (DN1, blk_1024_1003), (DN2, blk_1024_1003), (DN3, blk_1024_1003) 8. Right now, the block status in ActiveNameNode is normal. BlockMap: (DN1, blk_1024_1003), (DN2, blk_1024_1003), (DN3, blk_1024_1003) 9. HA failover, Active -> Standby, Standby -> Active. During starting Active service, namenode will process all messages of PendingDataNodeMessage. Because the GS 1001 and 1002 are less then 1003, namenode marked these pending message as corrupted block and put these replicas into the corruptReplicas list. 10. Right now, the block status in ActiveNameNode are as following: CorruptReplicas: (DN1, blk_1024_1001), (DN2, blk_1024_1001), (DN3, blk_1024_1001), (DN1, blk_1024_1002), (DN2, blk_1024_1002), (DN3, blk_1024_1002) BlockMap: (DN1, blk_1024_1003), (DN2, blk_1024_1003), (DN3, blk_1024_1003) 11. Active NameNode will try to remove some invalid corrupted block while processing block report or blockReceived report if the datanode has reported one healthy replica. ``` // add block to the datanode AddBlockResult result = storageInfo.addBlock(storedBlock, reportedBlock); int curReplicaDelta; if (result == AddBlockResult.ADDED) { curReplicaDelta = (node.isDecommissioned() || node.isDecommissionInProgress()) ? 0 : 1; if (logEveryBlock) { blockLog.info("BLOCK* addStoredBlock: {} is added to {} (size={})", node, storedBlock, storedBlock.getNumBytes()); } } else if (result == AddBlockResult.REPLACED) { curReplicaDelta = 0; blockLog.warn("BLOCK* addStoredBlock: block {} moved to storageType " + "{} on node {}", storedBlock, storageInfo.getStorageType(), node); } else { // if the same block is added again and the replica was corrupt // previously because of a wrong gen stamp, remove it from the // corrupt block list. corruptReplicas.removeFromCorruptReplicasMap(block, node, Reason.GENSTAMP_MISMATCH); curReplicaDelta = 0; blockLog.debug("BLOCK* addStoredBlock: Redundant addStoredBlock request" + " received for {} on node {} size {}", storedBlock, node, storedBlock.getNumBytes()); } ``` I said the pending message is invalid or "the corrupted block" is invalid means that the datanode has report one healthy replica for this datanode, such as:blk_1024_1003, the pending messages with small GS are invalid, such as: blk_1024_1001, blk_1024_1002. So NameNode can judge that if these pending message is valid according the status of the stored block. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org