Ming Ma created HDFS-7269:
-----------------------------

             Summary: NN and DN don't check whether corrupted blocks reported 
by clients are actually corrupted
                 Key: HDFS-7269
                 URL: https://issues.apache.org/jira/browse/HDFS-7269
             Project: Hadoop HDFS
          Issue Type: Improvement
            Reporter: Ming Ma


We had a case where the client machine had memory issue and thus failed the 
checksum validation of a given block for all its replicas. So the client ended 
up informing NN about the corrupted blocks for all DNs via reportBadBlocks. 
However, the block isn't corrupted on any of the DNs. You can still use 
DFSClient to read the block. But in order to get rid of NN's warning message 
for corrupt block, we either do a NN fail over, or repair the file via a) copy 
the file somewhere, b) remove the file, c) copy the file back.

It will be useful if NN and DN can validate client's report. In fact, there is 
a comment in NamenodeRpcServer about this.

{noformat}
  /**
   * The client has detected an error on the specified located blocks 
   * and is reporting them to the server.  For now, the namenode will 
   * mark the block as corrupt.  In the future we might 
   * check the blocks are actually corrupt. 
   */
{noformat}

To allow system to recover from invalid client report quickly, we can support 
automatic recovery or manual admins command.

1. we can have NN send a new DatanodeCommand like ValidateBlockCommand. DN will 
notify the validate result via IBR and new 
ReceivedDeletedBlockInfo.BlockStatus.VALIDATED_BLOCK.
2. Some new admins command to move corrupted blocks out of BM's 
CorruptReplicasMap and UnderReplicatedBlocks.

Appreciate any input.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to