Mingliang Liu created HDFS-11030:
------------------------------------

             Summary: TestDataNodeVolumeFailure#testVolumeFailure is flaky 
(though passing)
                 Key: HDFS-11030
                 URL: https://issues.apache.org/jira/browse/HDFS-11030
             Project: Hadoop HDFS
          Issue Type: Sub-task
          Components: datanode, test
    Affects Versions: 2.7.0
            Reporter: Mingliang Liu
            Assignee: Mingliang Liu


TestDataNodeVolumeFailure#testVolumeFailure fails a volume and verifies the 
blocks and files are replicated correctly.

To fail a volume, it deletes all the blocks and sets the data dir read only.
{code}
    // fail the volume
    // delete/make non-writable one of the directories (failed volume)
    data_fail = new File(dataDir, "data3");
    failedDir = MiniDFSCluster.getFinalizedDir(dataDir, 
        cluster.getNamesystem().getBlockPoolId());
    if (failedDir.exists() &&
        //!FileUtil.fullyDelete(failedDir)
        !deteteBlocks(failedDir)
        ) {
      throw new IOException("Could not delete hdfs directory '" + failedDir + 
"'");
    }
    data_fail.setReadOnly();
    failedDir.setReadOnly();
{code}
However, there are two bugs here:
- The {{failedDir}} directory for finalized blocks is not calculated correctly. 
It should use {{data_fail}} instead of {{dataDir}} as the base directory.
- When deleting block files in {{deteteBlocks(failedDir)}}, it assumes that 
there is no subdirectories in the data dir. This assumption was also noted in 
the comments.
{quote}
    // we use only small number of blocks to avoid creating subdirs in the data 
dir..
{quote}
This is not true. On my local cluster and MiniDFSCluster, there will be 
subdir0/subdir0/ two level directories regardless of the number of blocks.

These two bugs made the blocks not deleted.

To fail a volume, it also needs to trigger the DataNode removing the volume and 
send block report to NN. This is basically in the {{triggerFailure()}} method.
{code}
  /**
   * go to each block on the 2nd DataNode until it fails...
   * @param path
   * @param size
   * @throws IOException
   */
  private void triggerFailure(String path, long size) throws IOException {
    NamenodeProtocols nn = cluster.getNameNodeRpc();
    List<LocatedBlock> locatedBlocks =
      nn.getBlockLocations(path, 0, size).getLocatedBlocks();
    
    for (LocatedBlock lb : locatedBlocks) {
      DatanodeInfo dinfo = lb.getLocations()[1];
      ExtendedBlock b = lb.getBlock();
      try {
        accessBlock(dinfo, lb);
      } catch (IOException e) {
        System.out.println("Failure triggered, on block: " + b.getBlockId() +  
            "; corresponding volume should be removed by now");
        break;
      }
    }
  }
{code}
Accessing those blocks will not trigger failures if the directory is read-only 
(while the block files are all there). I ran the tests multiple times without 
triggering this failure. We have to write new block files to the data 
directories, or we should have deleted the blocks correctly.

This unit test has been there for years and it seldom fails, just because it's 
never triggered a real volume failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to