ZanderXu commented on PR #5583:
URL: https://github.com/apache/hadoop/pull/5583#issuecomment-1519301097

   @Hexiaoqiao @ayushtkn Master, thanks for your comments. I try to explain 
this problem cleanly.
   
   First, we can reproduce this problem through the following steps:
   Supposing there is a cluster with one Active NameNode, one Standby NameNode, 
and three datanodes (DN1, DN2 and DN3).
   
   1. Client create a file with three replicas, write some data and close it. 
(supposing this file only has one block, blk_1024_1001)
   2. Client try to append some data into this file and close it. (supposing 
this file only has one block, blk_1024_1002)
   3. Client try to append some data into this file again and close it. 
(supposing this file only has one block, block_1024_1003)
   4. Standby is unstable, and it replay all edits first, such as replay 
blk_1024_1001, blk_1024_1002, blk_1024_1003
   5. Then Standby received some blockReceivedAndDeleted requests from Datanode 
and process them in order [ (DN1, blk_1024_1001),  (DN2, blk_1024_1001),  (DN3, 
blk_1024_1001),  (DN1, blk_1024_1002), (DN2, blk_1024_1002), (DN3, 
blk_1024_1002), (DN1, blk_1024_1003), (DN2, blk_1024_1003), (DN3, 
blk_1024_1003) ]
   6. Standby NameNode will put the report message for blk_1024_1001 and 
blk_1024_1002 into PendingDataNodeMessage, because the GS of the stored block 
is 1003, 1001 and 1002 is less than 1003, so standby consider that these report 
messages may be corrupt, so just put them into PendingDataNodeMessage.
   7. Right now, the block status in StandbyNameNode are as following:
   PendingDataNodeMessage:  (DN1, blk_1024_1001), (DN2, blk_1024_1001), (DN3, 
blk_1024_1001), (DN1, blk_1024_1002), (DN2, blk_1024_1002), (DN3, blk_1024_1002)
   BlockMap: (DN1, blk_1024_1003), (DN2, blk_1024_1003), (DN3, blk_1024_1003)
   8. Right now, the block status in ActiveNameNode is normal. BlockMap: (DN1, 
blk_1024_1003), (DN2, blk_1024_1003), (DN3, blk_1024_1003)
   9. HA failover, Active -> Standby, Standby -> Active. During starting Active 
service, namenode will process all messages of PendingDataNodeMessage. Because 
the GS 1001 and 1002 are less then 1003, namenode  marked these pending message 
as corrupted block and put these replicas into the corruptReplicas list.
   10. Right now, the block status in  ActiveNameNode are as following:
   CorruptReplicas:  (DN1, blk_1024_1001), (DN2, blk_1024_1001), (DN3, 
blk_1024_1001), (DN1, blk_1024_1002), (DN2, blk_1024_1002), (DN3, blk_1024_1002)
   BlockMap: (DN1, blk_1024_1003), (DN2, blk_1024_1003), (DN3, blk_1024_1003)
   11. Active NameNode will try to remove some invalid corrupted block while 
processing block report or blockReceived report if the datanode has reported 
one healthy replica.
   
   ```
   // add block to the datanode
       AddBlockResult result = storageInfo.addBlock(storedBlock, reportedBlock);
   
       int curReplicaDelta;
       if (result == AddBlockResult.ADDED) {
         curReplicaDelta =
             (node.isDecommissioned() || node.isDecommissionInProgress()) ? 0 : 
1;
         if (logEveryBlock) {
           blockLog.info("BLOCK* addStoredBlock: {} is added to {} (size={})",
               node, storedBlock, storedBlock.getNumBytes());
         }
       } else if (result == AddBlockResult.REPLACED) {
         curReplicaDelta = 0;
         blockLog.warn("BLOCK* addStoredBlock: block {} moved to storageType " +
             "{} on node {}", storedBlock, storageInfo.getStorageType(), node);
       } else {
         // if the same block is added again and the replica was corrupt
         // previously because of a wrong gen stamp, remove it from the
         // corrupt block list.
         corruptReplicas.removeFromCorruptReplicasMap(block, node,
             Reason.GENSTAMP_MISMATCH);
         curReplicaDelta = 0;
         blockLog.debug("BLOCK* addStoredBlock: Redundant addStoredBlock 
request"
                 + " received for {} on node {} size {}", storedBlock, node,
             storedBlock.getNumBytes());
       }
   ```
   
   
   I said the pending message is invalid or "the corrupted block" is invalid 
means that the datanode has report one healthy replica for this datanode, such 
as:blk_1024_1003, the pending messages with small GS are invalid, such as: 
blk_1024_1001, blk_1024_1002.
   
   So NameNode can judge that if these pending message is valid according the 
status of the stored block.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to