Ming Ma created HDFS-7269: ----------------------------- Summary: NN and DN don't check whether corrupted blocks reported by clients are actually corrupted Key: HDFS-7269 URL: https://issues.apache.org/jira/browse/HDFS-7269 Project: Hadoop HDFS Issue Type: Improvement Reporter: Ming Ma
We had a case where the client machine had memory issue and thus failed the checksum validation of a given block for all its replicas. So the client ended up informing NN about the corrupted blocks for all DNs via reportBadBlocks. However, the block isn't corrupted on any of the DNs. You can still use DFSClient to read the block. But in order to get rid of NN's warning message for corrupt block, we either do a NN fail over, or repair the file via a) copy the file somewhere, b) remove the file, c) copy the file back. It will be useful if NN and DN can validate client's report. In fact, there is a comment in NamenodeRpcServer about this. {noformat} /** * The client has detected an error on the specified located blocks * and is reporting them to the server. For now, the namenode will * mark the block as corrupt. In the future we might * check the blocks are actually corrupt. */ {noformat} To allow system to recover from invalid client report quickly, we can support automatic recovery or manual admins command. 1. we can have NN send a new DatanodeCommand like ValidateBlockCommand. DN will notify the validate result via IBR and new ReceivedDeletedBlockInfo.BlockStatus.VALIDATED_BLOCK. 2. Some new admins command to move corrupted blocks out of BM's CorruptReplicasMap and UnderReplicatedBlocks. Appreciate any input. -- This message was sent by Atlassian JIRA (v6.3.4#6332)