[jira] [Commented] (HDFS-11030) TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing)

Mingliang Liu (JIRA) Tue, 18 Oct 2016 18:45:08 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587343#comment-15587343
 ]


Mingliang Liu commented on HDFS-11030:
--------------------------------------

Sending block report code seems complex. This is the internal logic of 
{{BPServiceActor}} and we may have to update this code it changes. I think 
{{cluster.triggerBlockReport()}} is a good alternative.
{code:title=TestDataNodeVolumeFailure#testVolumeFailure()}
    // make sure a block report is sent 
    DataNode dn = cluster.getDataNodes().get(1); //corresponds to dir data3
    String bpid = cluster.getNamesystem().getBlockPoolId();
    DatanodeRegistration dnR = dn.getDNRegistrationForBP(bpid);
    Map<DatanodeStorage, BlockListAsLongs> perVolumeBlockLists =
        dn.getFSDataset().getBlockReports(bpid);

    // Send block report
    StorageBlockReport[] reports =
        new StorageBlockReport[perVolumeBlockLists.size()];

    int reportIndex = 0;
    for(Map.Entry<DatanodeStorage, BlockListAsLongs> kvPair : 
perVolumeBlockLists.entrySet()) {
        DatanodeStorage dnStorage = kvPair.getKey();
        BlockListAsLongs blockList = kvPair.getValue();
        reports[reportIndex++] =
            new StorageBlockReport(dnStorage, blockList);
    }
    
    cluster.getNameNodeRpc().blockReport(dnR, bpid, reports,
        new BlockReportContext(1, 0, System.nanoTime(), 0, false));
{code}

> TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing)
> ---------------------------------------------------------------------
>
>                 Key: HDFS-11030
>                 URL: https://issues.apache.org/jira/browse/HDFS-11030
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, test
>    Affects Versions: 2.7.0
>            Reporter: Mingliang Liu
>            Assignee: Mingliang Liu
>
> TestDataNodeVolumeFailure#testVolumeFailure fails a volume and verifies the 
> blocks and files are replicated correctly.
> To fail a volume, it deletes all the blocks and sets the data dir read only.
> {code:title=testVolumeFailure() snippet}
>     // fail the volume
>     // delete/make non-writable one of the directories (failed volume)
>     data_fail = new File(dataDir, "data3");
>     failedDir = MiniDFSCluster.getFinalizedDir(dataDir, 
>         cluster.getNamesystem().getBlockPoolId());
>     if (failedDir.exists() &&
>         //!FileUtil.fullyDelete(failedDir)
>         !deteteBlocks(failedDir)
>         ) {
>       throw new IOException("Could not delete hdfs directory '" + failedDir + 
> "'");
>     }
>     data_fail.setReadOnly();
>     failedDir.setReadOnly();
> {code}
> However, there are two bugs here, which make the blocks not deleted.
> # The {{failedDir}} directory for finalized blocks is not calculated 
> correctly. It should use {{data_fail}} instead of {{dataDir}} as the base 
> directory.
> # When deleting block files in {{deteteBlocks(failedDir)}}, it assumes that 
> there is no subdirectories in the data dir. This assumption was also noted in 
> the comments.
> {quote}
>     // we use only small number of blocks to avoid creating subdirs in the 
> data dir..
> {quote}
> This is not true. On my local cluster and MiniDFSCluster, there will be 
> subdir0/subdir0/ two level directories regardless of the number of blocks.
> Meanwhile, to fail a volume, it also needs to trigger the DataNode removing 
> the volume and send block report to NN. This is basically in the 
> {{triggerFailure()}} method.
> {code}
>   private void triggerFailure(String path, long size) throws IOException {
>     NamenodeProtocols nn = cluster.getNameNodeRpc();
>     List<LocatedBlock> locatedBlocks =
>       nn.getBlockLocations(path, 0, size).getLocatedBlocks();
>     
>     for (LocatedBlock lb : locatedBlocks) {
>       DatanodeInfo dinfo = lb.getLocations()[1];
>       ExtendedBlock b = lb.getBlock();
>       try {
>         accessBlock(dinfo, lb);
>       } catch (IOException e) {
>         System.out.println("Failure triggered, on block: " + b.getBlockId() + 
>  
>             "; corresponding volume should be removed by now");
>         break;
>       }
>     }
>   }
> {code}
> Accessing those blocks will not trigger failures if the directory is 
> read-only (while the block files are all there). I ran the tests multiple 
> times without triggering this failure. We have to write new block files to 
> the data directories, or we should have deleted the blocks correctly.
> This unit test has been there for years and it seldom fails, just because 
> it's never triggered a real volume failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-11030) TestDataNodeVolumeFailure#testVolumeFailure is flaky (though passing)

Reply via email to