[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")
[ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5163: -- Resolution: Fixed Status: Resolved (was: Patch Available) > TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or > hadoop QA ("The directory is already locked.") > -- > > Key: HBASE-5163 > URL: https://issues.apache.org/jira/browse/HBASE-5163 > Project: HBase > Issue Type: Bug > Components: test >Affects Versions: 0.92.0 > Environment: all >Reporter: nkeywal >Assignee: nkeywal >Priority: Minor > Fix For: 0.92.0, 0.94.0 > > Attachments: 5163-92.txt, 5163.patch > > > The stack is typically: > {noformat} > type="java.io.IOException">java.io.IOException: Cannot lock storage > /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. > The directory is already locked. > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:290) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467) > at > org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417) > at > org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460) > at > org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470) > // ... > {noformat} > It can be reproduced without parallelization or without executing the other > tests in the class. It seems to fail about 5% of the time. > This comes from the naming policy for the directories in > MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* > in the cluster, and does not take into account previous starts/stops: > {noformat} >for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) { > if (manageDfsDirs) { > File dir1 = new File(data_dir, "data"+(2*i+1)); > File dir2 = new File(data_dir, "data"+(2*i+2)); > dir1.mkdirs(); > dir2.mkdirs(); > // [...] > {noformat} > This means that it if we want to stop/start a datanode, we should always stop > the last one, if not the names will conflict. This test exhibits the behavior: > {noformat} > @Test > public void testMiniDFSCluster_startDataNode() throws Exception { > assertTrue( dfsCluster.getDataNodes().size() == 2 ); > // Works, as we kill the last datanode, we can now start a datanode > dfsCluster.stopDataNode(1); > dfsCluster > .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); > // Fails, as it's not the last datanode, the directory will conflict on > // creation > dfsCluster.stopDataNode(0); > try { > dfsCluster > .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); > fail("There should be an exception because the directory already > exists"); > } catch (IOException e) { > assertTrue( e.getMessage().contains("The directory is already > locked.")); > LOG.info("Expected (!) exception caught " + e.getMessage()); > } > // Works, as we kill the last datanode, we can now restart 2 datanodes > // This makes us back with 2 nodes > dfsCluster.stopDataNode(0); > dfsCluster > .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null); > } > {noformat} > And then this behavior is randomly triggered in testLogRollOnDatanodeDeath > because when we do > {noformat} > DatanodeInfo[] pipeline = getPipeline(log); > assertTrue(pipeline.length == fs.getDefaultReplication()); > {noformat} > and then kill the datanodes in the pipeline, we will have: > - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a > new datanode that will reuse the available 2's directory. > - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new > datanode, it fails because it wants to use the same directory as the still > alive '2'. > There are two ways of fixing the test: > 1) Fix the naming rule in MiniDFSCluster#startDataNode,
[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")
[ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5163: -- Affects Version/s: (was: 0.94.0) 0.92.0 Fix Version/s: 0.94.0 0.92.0 Hadoop Flags: Reviewed > TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or > hadoop QA ("The directory is already locked.") > -- > > Key: HBASE-5163 > URL: https://issues.apache.org/jira/browse/HBASE-5163 > Project: HBase > Issue Type: Bug > Components: test >Affects Versions: 0.92.0 > Environment: all >Reporter: nkeywal >Assignee: nkeywal >Priority: Minor > Fix For: 0.92.0, 0.94.0 > > Attachments: 5163-92.txt, 5163.patch > > > The stack is typically: > {noformat} > type="java.io.IOException">java.io.IOException: Cannot lock storage > /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. > The directory is already locked. > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:290) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467) > at > org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417) > at > org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460) > at > org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470) > // ... > {noformat} > It can be reproduced without parallelization or without executing the other > tests in the class. It seems to fail about 5% of the time. > This comes from the naming policy for the directories in > MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* > in the cluster, and does not take into account previous starts/stops: > {noformat} >for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) { > if (manageDfsDirs) { > File dir1 = new File(data_dir, "data"+(2*i+1)); > File dir2 = new File(data_dir, "data"+(2*i+2)); > dir1.mkdirs(); > dir2.mkdirs(); > // [...] > {noformat} > This means that it if we want to stop/start a datanode, we should always stop > the last one, if not the names will conflict. This test exhibits the behavior: > {noformat} > @Test > public void testMiniDFSCluster_startDataNode() throws Exception { > assertTrue( dfsCluster.getDataNodes().size() == 2 ); > // Works, as we kill the last datanode, we can now start a datanode > dfsCluster.stopDataNode(1); > dfsCluster > .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); > // Fails, as it's not the last datanode, the directory will conflict on > // creation > dfsCluster.stopDataNode(0); > try { > dfsCluster > .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); > fail("There should be an exception because the directory already > exists"); > } catch (IOException e) { > assertTrue( e.getMessage().contains("The directory is already > locked.")); > LOG.info("Expected (!) exception caught " + e.getMessage()); > } > // Works, as we kill the last datanode, we can now restart 2 datanodes > // This makes us back with 2 nodes > dfsCluster.stopDataNode(0); > dfsCluster > .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null); > } > {noformat} > And then this behavior is randomly triggered in testLogRollOnDatanodeDeath > because when we do > {noformat} > DatanodeInfo[] pipeline = getPipeline(log); > assertTrue(pipeline.length == fs.getDefaultReplication()); > {noformat} > and then kill the datanodes in the pipeline, we will have: > - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a > new datanode that will reuse the available 2's directory. > - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new > datanode, it fails because it wants to use the same directory as the still > alive '2'. > Ther
[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")
[ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5163: -- Attachment: 5163-92.txt Patch I would integrate to 0.92 > TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or > hadoop QA ("The directory is already locked.") > -- > > Key: HBASE-5163 > URL: https://issues.apache.org/jira/browse/HBASE-5163 > Project: HBase > Issue Type: Bug > Components: test >Affects Versions: 0.94.0 > Environment: all >Reporter: nkeywal >Assignee: nkeywal >Priority: Minor > Attachments: 5163-92.txt, 5163.patch > > > The stack is typically: > {noformat} > type="java.io.IOException">java.io.IOException: Cannot lock storage > /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. > The directory is already locked. > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:290) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467) > at > org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417) > at > org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460) > at > org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470) > // ... > {noformat} > It can be reproduced without parallelization or without executing the other > tests in the class. It seems to fail about 5% of the time. > This comes from the naming policy for the directories in > MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* > in the cluster, and does not take into account previous starts/stops: > {noformat} >for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) { > if (manageDfsDirs) { > File dir1 = new File(data_dir, "data"+(2*i+1)); > File dir2 = new File(data_dir, "data"+(2*i+2)); > dir1.mkdirs(); > dir2.mkdirs(); > // [...] > {noformat} > This means that it if we want to stop/start a datanode, we should always stop > the last one, if not the names will conflict. This test exhibits the behavior: > {noformat} > @Test > public void testMiniDFSCluster_startDataNode() throws Exception { > assertTrue( dfsCluster.getDataNodes().size() == 2 ); > // Works, as we kill the last datanode, we can now start a datanode > dfsCluster.stopDataNode(1); > dfsCluster > .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); > // Fails, as it's not the last datanode, the directory will conflict on > // creation > dfsCluster.stopDataNode(0); > try { > dfsCluster > .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); > fail("There should be an exception because the directory already > exists"); > } catch (IOException e) { > assertTrue( e.getMessage().contains("The directory is already > locked.")); > LOG.info("Expected (!) exception caught " + e.getMessage()); > } > // Works, as we kill the last datanode, we can now restart 2 datanodes > // This makes us back with 2 nodes > dfsCluster.stopDataNode(0); > dfsCluster > .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null); > } > {noformat} > And then this behavior is randomly triggered in testLogRollOnDatanodeDeath > because when we do > {noformat} > DatanodeInfo[] pipeline = getPipeline(log); > assertTrue(pipeline.length == fs.getDefaultReplication()); > {noformat} > and then kill the datanodes in the pipeline, we will have: > - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a > new datanode that will reuse the available 2's directory. > - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new > datanode, it fails because it wants to use the same directory as the still > alive '2'. > There are two ways of fixing the test: > 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure > that the dir
[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")
[ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5163: -- Summary: TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.") (was: TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk ("The directory is already locked.")) > TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or > hadoop QA ("The directory is already locked.") > -- > > Key: HBASE-5163 > URL: https://issues.apache.org/jira/browse/HBASE-5163 > Project: HBase > Issue Type: Bug > Components: test >Affects Versions: 0.94.0 > Environment: all >Reporter: nkeywal >Assignee: nkeywal >Priority: Minor > Attachments: 5163.patch > > > The stack is typically: > {noformat} > type="java.io.IOException">java.io.IOException: Cannot lock storage > /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. > The directory is already locked. > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602) > at > org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455) > at > org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:290) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467) > at > org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417) > at > org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460) > at > org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470) > // ... > {noformat} > It can be reproduced without parallelization or without executing the other > tests in the class. It seems to fail about 5% of the time. > This comes from the naming policy for the directories in > MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* > in the cluster, and does not take into account previous starts/stops: > {noformat} >for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) { > if (manageDfsDirs) { > File dir1 = new File(data_dir, "data"+(2*i+1)); > File dir2 = new File(data_dir, "data"+(2*i+2)); > dir1.mkdirs(); > dir2.mkdirs(); > // [...] > {noformat} > This means that it if we want to stop/start a datanode, we should always stop > the last one, if not the names will conflict. This test exhibits the behavior: > {noformat} > @Test > public void testMiniDFSCluster_startDataNode() throws Exception { > assertTrue( dfsCluster.getDataNodes().size() == 2 ); > // Works, as we kill the last datanode, we can now start a datanode > dfsCluster.stopDataNode(1); > dfsCluster > .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); > // Fails, as it's not the last datanode, the directory will conflict on > // creation > dfsCluster.stopDataNode(0); > try { > dfsCluster > .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); > fail("There should be an exception because the directory already > exists"); > } catch (IOException e) { > assertTrue( e.getMessage().contains("The directory is already > locked.")); > LOG.info("Expected (!) exception caught " + e.getMessage()); > } > // Works, as we kill the last datanode, we can now restart 2 datanodes > // This makes us back with 2 nodes > dfsCluster.stopDataNode(0); > dfsCluster > .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null); > } > {noformat} > And then this behavior is randomly triggered in testLogRollOnDatanodeDeath > because when we do > {noformat} > DatanodeInfo[] pipeline = getPipeline(log); > assertTrue(pipeline.length == fs.getDefaultReplication()); > {noformat} > and then kill the datanodes in the pipeline, we will have: > - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a > new datanode that will reuse the available 2's directory. > - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new > datanode, it fails because it want