[ https://issues.apache.org/jira/browse/HDFS-6505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026085#comment-14026085 ]
Gordon Wang commented on HDFS-6505: ----------------------------------- This issue causes the last block is missing and the file is corrupted. But actually, the data on DataNode is correct. I went through the code, and I think some safe check is missing when namenode receives a bad block report from datanodes. See the following code snippet in namenode BlockManager {code} public void findAndMarkBlockAsCorrupt(final ExtendedBlock blk, final DatanodeInfo dn, String storageID, String reason) throws IOException { assert namesystem.hasWriteLock(); final BlockInfo storedBlock = getStoredBlock(blk.getLocalBlock()); if (storedBlock == null) { // Check if the replica is in the blockMap, if not // ignore the request for now. This could happen when BlockScanner // thread of Datanode reports bad block before Block reports are sent // by the Datanode on startup blockLog.info("BLOCK* findAndMarkBlockAsCorrupt: " + blk + " not found"); return; } markBlockAsCorrupt(new BlockToMarkCorrupt(storedBlock, reason, Reason.CORRUPTION_REPORTED), dn, storageID); } {code} We should check the timestamp in reported block and stored block. If the reported block has a smaller timestamp, this block should not be marked as corrupt. It is possible that the reported block has a smaller timestamp when client has done some work on recovering pipeline. > Can not close file due to last block is marked as corrupt > --------------------------------------------------------- > > Key: HDFS-6505 > URL: https://issues.apache.org/jira/browse/HDFS-6505 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.2.0 > Reporter: Gordon Wang > > After appending a file, client could not close it. Because namenode could not > complete the last block in file. The UC status of last block remained as > COMMIT and never change. > The namenode log was like this. > {code} > INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > checkFileProgress: blk_1073741920_13948{blockUCState=COMMITTED, > primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[172.28.1.2:50010|RBW]]} has not reached > minimal replication 1 > {code} > After going through the log of namenode, I found a log like this > {code} > INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: > blk_1073741920 added as corrupt on 172.28.1.2:50010 by sdw3/172.28.1.3 > because client machine reported it > {code} > But actually, the last block was finished successfully in the data node. > Because I could find the log in datanode > {code} > INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer: > Transmitted BP-649434182-172.28.1.251-1401432753616:blk_1073741920_13808 > (numBytes=50120352) to /172.28.1.3:50010 > INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /172.28.1.2:36860, dest: /172.28.1.2:50010, bytes: 51686616, op: HDFS_WRITE, > cliID: > libhdfs3_client_random_741511239_count_1_pid_215802_tid_140085714196576, > offset: 0, srvID: DS-2074102060-172.28.1.2-50010-1401432768690, blockid: > BP-649434182-172.28.1.251-1401432753616:blk_1073741920_13948, duration: > 189226453336 > INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: > BP-649434182-172.28.1.251-1401432753616:blk_1073741920_13948, > type=LAST_IN_PIPELINE, downstreams=0:[] terminating > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)