Wei-Chiu Chuang created HDFS-13999:
--------------------------------------

             Summary: Bogus missing block warning if the file is under 
construction when NN starts
                 Key: HDFS-13999
                 URL: https://issues.apache.org/jira/browse/HDFS-13999
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 2.6.0
            Reporter: Wei-Chiu Chuang
            Assignee: Wei-Chiu Chuang
         Attachments: webui missing blocks.png

We found an interesting case where web UI displays a few missing blocks, but it 
doesn't state which files are corrupt. What'll also happen is that fsck states 
the file system is healthy. This bug is similar to HDFS-10827 and HDFS-8533. 

 (See the attachment for an example)

Using Dynamometer, I was able to reproduce the bug, and realized the the 
"missing" blocks are actually healthy, but somehow neededReplications doesn't 
get updated when NN receives block reports. What's more interesting is that the 
files associated with the "missing" blocks are under construction when NN 
starts, and so after a while NN prints file recovery log.

Given that, I determined the following code is the source of bug:
{code:java|title=BlockManager#addStoredBlock}
....
   // if file is under construction, then done for now
    if (bc.isUnderConstruction()) {
      return storedBlock;
    }
{code}
which is wrong, because a file may have multiple blocks, and the first block is 
complete. In which case, the neededReplications structure doesn't get updated 
for the first block, and thus the missing block warning on the web UI. More 
appropriately, it should check the state of the block itself, not the file.

Fortunately, it was unintentionally fixed via HDFS-9754:
{code:java}
    // if block is still under construction, then done for now
    if (!storedBlock.isCompleteOrCommitted()) {
      return storedBlock;
    }
{code}
We should bring this fix into branch-2.7 too. That said, this is a harmless 
warning, and should go away after the under-construction-files are recovered, 
and the NN restarts (or force full block reports).

Kudos to Dynamometer! It would be impossible to reproduce this bug without the 
tool. And thanks [~smeng] for helping with the reproduction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to