[ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184788#comment-13184788 ]
Hudson commented on HBASE-5163: ------------------------------- Integrated in HBase-0.92-security #72 (See [https://builds.apache.org/job/HBase-0.92-security/72/]) HBASE-5163 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.") (N Keywal) tedyu : Files : * /hbase/branches/0.92/CHANGES.txt * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java > TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or > hadoop QA ("The directory is already locked.") > ---------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-5163 > URL: https://issues.apache.org/jira/browse/HBASE-5163 > Project: HBase > Issue Type: Bug > Components: test > Affects Versions: 0.92.0 > Environment: all > Reporter: nkeywal > Assignee: nkeywal > Priority: Minor > Fix For: 0.92.0, 0.94.0 > > Attachments: 5163-92.txt, 5163.patch > > > The stack is typically: > {noformat} > <error message="Cannot lock storage > /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. > The directory is already locked." > type="java.io.IOException">java.io.IOException: Cannot lock storage > /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. > The directory is already locked. > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:290) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467) > at > org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417) > at > org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460) > at > org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470) > // ... > {noformat} > It can be reproduced without parallelization or without executing the other > tests in the class. It seems to fail about 5% of the time. > This comes from the naming policy for the directories in > MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* > in the cluster, and does not take into account previous starts/stops: > {noformat} > for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) { > if (manageDfsDirs) { > File dir1 = new File(data_dir, "data"+(2*i+1)); > File dir2 = new File(data_dir, "data"+(2*i+2)); > dir1.mkdirs(); > dir2.mkdirs(); > // [...] > {noformat} > This means that it if we want to stop/start a datanode, we should always stop > the last one, if not the names will conflict. This test exhibits the behavior: > {noformat} > @Test > public void testMiniDFSCluster_startDataNode() throws Exception { > assertTrue( dfsCluster.getDataNodes().size() == 2 ); > // Works, as we kill the last datanode, we can now start a datanode > dfsCluster.stopDataNode(1); > dfsCluster > .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); > // Fails, as it's not the last datanode, the directory will conflict on > // creation > dfsCluster.stopDataNode(0); > try { > dfsCluster > .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); > fail("There should be an exception because the directory already > exists"); > } catch (IOException e) { > assertTrue( e.getMessage().contains("The directory is already > locked.")); > LOG.info("Expected (!) exception caught " + e.getMessage()); > } > // Works, as we kill the last datanode, we can now restart 2 datanodes > // This makes us back with 2 nodes > dfsCluster.stopDataNode(0); > dfsCluster > .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null); > } > {noformat} > And then this behavior is randomly triggered in testLogRollOnDatanodeDeath > because when we do > {noformat} > DatanodeInfo[] pipeline = getPipeline(log); > assertTrue(pipeline.length == fs.getDefaultReplication()); > {noformat} > and then kill the datanodes in the pipeline, we will have: > - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a > new datanode that will reuse the available 2's directory. > - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new > datanode, it fails because it wants to use the same directory as the still > alive '2'. > There are two ways of fixing the test: > 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure > that the directory names will not be reused. But I wonder if there is not a > testCase somewhere (may be not in HBase) depending on this behavior. > 2) Kill explicitly the first and second datanode without using the pipeline > to be sure that the names won't conflict. > Feedback welcome and the choice to make here... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira