subject:"\[jira\] \[Updated\] \(HBASE\-5163\) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA \(The directory is already locked.\)"

[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

2012-01-12 Thread Zhihong Yu (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5163:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or 
> hadoop QA ("The directory is already locked.")
> --
>
> Key: HBASE-5163
> URL: https://issues.apache.org/jira/browse/HBASE-5163
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.92.0
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Minor
> Fix For: 0.92.0, 0.94.0
>
> Attachments: 5163-92.txt, 5163.patch
>
>
> The stack is typically:
> {noformat}
>  type="java.io.IOException">java.io.IOException: Cannot lock storage 
> /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
>  The directory is already locked.
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:290)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
> // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other 
> tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in 
> MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
> in the cluster, and does not take into account previous starts/stops:
> {noformat}
>for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>   if (manageDfsDirs) {
> File dir1 = new File(data_dir, "data"+(2*i+1));
> File dir2 = new File(data_dir, "data"+(2*i+2));
> dir1.mkdirs();
> dir2.mkdirs();
>   // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop 
> the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
> assertTrue( dfsCluster.getDataNodes().size() == 2 );
> // Works, as we kill the last datanode, we can now start a datanode
> dfsCluster.stopDataNode(1);
> dfsCluster
>   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
> // Fails, as it's not the last datanode, the directory will conflict on
> //  creation
> dfsCluster.stopDataNode(0);
> try {
>   dfsCluster
> .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>   fail("There should be an exception because the directory already 
> exists");
> } catch (IOException e) {
>   assertTrue( e.getMessage().contains("The directory is already 
> locked."));
>   LOG.info("Expected (!) exception caught " + e.getMessage());
> }
> // Works, as we kill the last datanode, we can now restart 2 datanodes
> // This makes us back with 2 nodes
> dfsCluster.stopDataNode(0);
> dfsCluster
>   .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
> because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a 
> new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new 
> datanode, it fails because it wants to use the same directory as the still 
> alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode,

[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

2012-01-11 Thread Zhihong Yu (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5163:
--

Affects Version/s: (was: 0.94.0)
   0.92.0
Fix Version/s: 0.94.0
   0.92.0
 Hadoop Flags: Reviewed

> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or 
> hadoop QA ("The directory is already locked.")
> --
>
> Key: HBASE-5163
> URL: https://issues.apache.org/jira/browse/HBASE-5163
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.92.0
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Minor
> Fix For: 0.92.0, 0.94.0
>
> Attachments: 5163-92.txt, 5163.patch
>
>
> The stack is typically:
> {noformat}
>  type="java.io.IOException">java.io.IOException: Cannot lock storage 
> /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
>  The directory is already locked.
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:290)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
> // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other 
> tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in 
> MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
> in the cluster, and does not take into account previous starts/stops:
> {noformat}
>for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>   if (manageDfsDirs) {
> File dir1 = new File(data_dir, "data"+(2*i+1));
> File dir2 = new File(data_dir, "data"+(2*i+2));
> dir1.mkdirs();
> dir2.mkdirs();
>   // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop 
> the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
> assertTrue( dfsCluster.getDataNodes().size() == 2 );
> // Works, as we kill the last datanode, we can now start a datanode
> dfsCluster.stopDataNode(1);
> dfsCluster
>   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
> // Fails, as it's not the last datanode, the directory will conflict on
> //  creation
> dfsCluster.stopDataNode(0);
> try {
>   dfsCluster
> .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>   fail("There should be an exception because the directory already 
> exists");
> } catch (IOException e) {
>   assertTrue( e.getMessage().contains("The directory is already 
> locked."));
>   LOG.info("Expected (!) exception caught " + e.getMessage());
> }
> // Works, as we kill the last datanode, we can now restart 2 datanodes
> // This makes us back with 2 nodes
> dfsCluster.stopDataNode(0);
> dfsCluster
>   .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
> because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a 
> new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new 
> datanode, it fails because it wants to use the same directory as the still 
> alive '2'.
> Ther

[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

2012-01-11 Thread Zhihong Yu (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5163:
--

Attachment: 5163-92.txt

Patch I would integrate to 0.92

> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or 
> hadoop QA ("The directory is already locked.")
> --
>
> Key: HBASE-5163
> URL: https://issues.apache.org/jira/browse/HBASE-5163
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.94.0
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Minor
> Attachments: 5163-92.txt, 5163.patch
>
>
> The stack is typically:
> {noformat}
>  type="java.io.IOException">java.io.IOException: Cannot lock storage 
> /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
>  The directory is already locked.
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:290)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
> // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other 
> tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in 
> MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
> in the cluster, and does not take into account previous starts/stops:
> {noformat}
>for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>   if (manageDfsDirs) {
> File dir1 = new File(data_dir, "data"+(2*i+1));
> File dir2 = new File(data_dir, "data"+(2*i+2));
> dir1.mkdirs();
> dir2.mkdirs();
>   // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop 
> the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
> assertTrue( dfsCluster.getDataNodes().size() == 2 );
> // Works, as we kill the last datanode, we can now start a datanode
> dfsCluster.stopDataNode(1);
> dfsCluster
>   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
> // Fails, as it's not the last datanode, the directory will conflict on
> //  creation
> dfsCluster.stopDataNode(0);
> try {
>   dfsCluster
> .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>   fail("There should be an exception because the directory already 
> exists");
> } catch (IOException e) {
>   assertTrue( e.getMessage().contains("The directory is already 
> locked."));
>   LOG.info("Expected (!) exception caught " + e.getMessage());
> }
> // Works, as we kill the last datanode, we can now restart 2 datanodes
> // This makes us back with 2 nodes
> dfsCluster.stopDataNode(0);
> dfsCluster
>   .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
> because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a 
> new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new 
> datanode, it fails because it wants to use the same directory as the still 
> alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure 
> that the dir

[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

2012-01-11 Thread Zhihong Yu (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5163:
--

Summary: TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on 
Jenkins or hadoop QA ("The directory is already locked.")  (was: 
TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or 
hadoop QA on trunk ("The directory is already locked."))

> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or 
> hadoop QA ("The directory is already locked.")
> --
>
> Key: HBASE-5163
> URL: https://issues.apache.org/jira/browse/HBASE-5163
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.94.0
> Environment: all
>Reporter: nkeywal
>Assignee: nkeywal
>Priority: Minor
> Attachments: 5163.patch
>
>
> The stack is typically:
> {noformat}
>  type="java.io.IOException">java.io.IOException: Cannot lock storage 
> /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
>  The directory is already locked.
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
>   at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:290)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
> // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other 
> tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in 
> MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
> in the cluster, and does not take into account previous starts/stops:
> {noformat}
>for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>   if (manageDfsDirs) {
> File dir1 = new File(data_dir, "data"+(2*i+1));
> File dir2 = new File(data_dir, "data"+(2*i+2));
> dir1.mkdirs();
> dir2.mkdirs();
>   // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop 
> the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
> assertTrue( dfsCluster.getDataNodes().size() == 2 );
> // Works, as we kill the last datanode, we can now start a datanode
> dfsCluster.stopDataNode(1);
> dfsCluster
>   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
> // Fails, as it's not the last datanode, the directory will conflict on
> //  creation
> dfsCluster.stopDataNode(0);
> try {
>   dfsCluster
> .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>   fail("There should be an exception because the directory already 
> exists");
> } catch (IOException e) {
>   assertTrue( e.getMessage().contains("The directory is already 
> locked."));
>   LOG.info("Expected (!) exception caught " + e.getMessage());
> }
> // Works, as we kill the last datanode, we can now restart 2 datanodes
> // This makes us back with 2 nodes
> dfsCluster.stopDataNode(0);
> dfsCluster
>   .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
> because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a 
> new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new 
> datanode, it fails because it want

[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

4 matches

Site Navigation

Mail list logo

Footer information