[ 
https://issues.apache.org/jira/browse/HDFS-6505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026085#comment-14026085
 ] 

Gordon Wang commented on HDFS-6505:
-----------------------------------

This issue causes the last block is missing and the file is corrupted. But 
actually, the data on DataNode is correct.

I went through the code, and I think some safe check is missing when namenode 
receives a bad block report from datanodes.
See the following code snippet in namenode BlockManager
{code}
  public void findAndMarkBlockAsCorrupt(final ExtendedBlock blk,
      final DatanodeInfo dn, String storageID, String reason) throws 
IOException {
    assert namesystem.hasWriteLock();
    final BlockInfo storedBlock = getStoredBlock(blk.getLocalBlock());
    if (storedBlock == null) {
      // Check if the replica is in the blockMap, if not
      // ignore the request for now. This could happen when BlockScanner
      // thread of Datanode reports bad block before Block reports are sent
      // by the Datanode on startup
      blockLog.info("BLOCK* findAndMarkBlockAsCorrupt: "
          + blk + " not found");
      return;
    }
    markBlockAsCorrupt(new BlockToMarkCorrupt(storedBlock, reason,
        Reason.CORRUPTION_REPORTED), dn, storageID);
  }
{code} 
We should check the timestamp in reported block and stored block. If the 
reported block has a smaller timestamp, this block should not be marked as 
corrupt. It is possible that the reported block has a smaller timestamp when 
client has done some work on recovering pipeline.

> Can not close file due to last block is marked as corrupt
> ---------------------------------------------------------
>
>                 Key: HDFS-6505
>                 URL: https://issues.apache.org/jira/browse/HDFS-6505
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.2.0
>            Reporter: Gordon Wang
>
> After appending a file, client could not close it. Because namenode could not 
> complete the last block in file. The UC status of last block remained as 
> COMMIT and never change.
> The namenode log was like this.
> {code}
> INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> checkFileProgress: blk_1073741920_13948{blockUCState=COMMITTED, 
> primaryNodeIndex=-1,
> replicas=[ReplicaUnderConstruction[172.28.1.2:50010|RBW]]} has not reached 
> minimal replication 1
> {code}
> After going through the log of namenode, I found a log like this
> {code}
> INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: 
> blk_1073741920 added as corrupt on 172.28.1.2:50010 by sdw3/172.28.1.3 
> because client machine reported it
> {code}
> But actually, the last block was finished successfully in the data node. 
> Because I could find the log in datanode
> {code}
> INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer: 
> Transmitted BP-649434182-172.28.1.251-1401432753616:blk_1073741920_13808 
> (numBytes=50120352) to /172.28.1.3:50010
> INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
> /172.28.1.2:36860, dest: /172.28.1.2:50010, bytes: 51686616, op: HDFS_WRITE, 
> cliID: 
> libhdfs3_client_random_741511239_count_1_pid_215802_tid_140085714196576, 
> offset: 0, srvID: DS-2074102060-172.28.1.2-50010-1401432768690, blockid: 
> BP-649434182-172.28.1.251-1401432753616:blk_1073741920_13948, duration: 
> 189226453336
> INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-649434182-172.28.1.251-1401432753616:blk_1073741920_13948, 
> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to