[ 
https://issues.apache.org/jira/browse/HDDS-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-5632:
---------------------------------
    Labels: pull-request-available  (was: )

> Intermittent failure in TestOzoneManagerBootstrap#testBootstrapTwoNewOMs
> ------------------------------------------------------------------------
>
>                 Key: HDDS-5632
>                 URL: https://issues.apache.org/jira/browse/HDDS-5632
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Janus Chow
>            Assignee: Janus Chow
>            Priority: Major
>              Labels: pull-request-available
>
> Stacktrace as follows:
> {code:java}
> // code placeholder
> Error:  testBootstrapTwoNewOMs  Time elapsed: 66.255 s  <<< ERROR!
> java.io.IOException: Failed init RocksDB, db path : 
> /home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-585ca9ba-85a5-4b41-a979-d85c644b1560/omNode-bootstrap-1/om.db,
>  exception :org.rocksdb.RocksDBException lock hold by current process, 
> acquire time 1629174403 acquiring thread 140196605994752: 
> /home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-585ca9ba-85a5-4b41-a979-d85c644b1560/omNode-bootstrap-1/om.db/LOCK:
>  No locks available; status : IOError; message : lock hold by current 
> process, acquire time 1629174403 acquiring thread 140196605994752: 
> /home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-585ca9ba-85a5-4b41-a979-d85c644b1560/omNode-bootstrap-1/om.db/LOCK:
>  No locks available 
>     at 
> org.apache.hadoop.hdds.utils.HddsServerUtil.toIOException(HddsServerUtil.java:564)
>  
>     at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:164) 
>     at 
> org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build(DBStoreBuilder.java:191) 
>     at 
> org.apache.hadoop.ozone.om.OmMetadataManagerImpl.loadDB(OmMetadataManagerImpl.java:397)
>  
>     at 
> org.apache.hadoop.ozone.om.OmMetadataManagerImpl.loadDB(OmMetadataManagerImpl.java:387)
>  
>     at 
> org.apache.hadoop.ozone.om.OmMetadataManagerImpl.start(OmMetadataManagerImpl.java:379)
>  
>     at 
> org.apache.hadoop.ozone.om.OmMetadataManagerImpl.<init>(OmMetadataManagerImpl.java:246)
>  
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.instantiateServices(OzoneManager.java:581)
>  
>     at org.apache.hadoop.ozone.om.OzoneManager.<init>(OzoneManager.java:505) 
>     at 
> org.apache.hadoop.ozone.om.OzoneManager.createOm(OzoneManager.java:552) 
>     at 
> org.apache.hadoop.ozone.MiniOzoneHAClusterImpl.bootstrapNewOM(MiniOzoneHAClusterImpl.java:791)
>  
>     at 
> org.apache.hadoop.ozone.MiniOzoneHAClusterImpl.bootstrapOzoneManager(MiniOzoneHAClusterImpl.java:706)
>  
>     at 
> org.apache.hadoop.ozone.om.TestOzoneManagerBootstrap.testBootstrapOMs(TestOzoneManagerBootstrap.java:156)
>  
>     at 
> org.apache.hadoop.ozone.om.TestOzoneManagerBootstrap.testBootstrapTwoNewOMs(TestOzoneManagerBootstrap.java:180)
>  
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
>     at java.lang.reflect.Method.invoke(Method.java:498) 
>     at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>  
>     at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  
>     at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>  
>     at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  
>     at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
>     at 
> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:258)
>  
>     at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
>  
>     at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
>  
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: org.rocksdb.RocksDBException: lock hold by current process, 
> acquire time 1629174403 acquiring thread 140196605994752: 
> /home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-585ca9ba-85a5-4b41-a979-d85c644b1560/omNode-bootstrap-1/om.db/LOCK:
>  No locks available 
>     at org.rocksdb.RocksDB.open(Native Method) at 
> org.rocksdb.RocksDB.open(RocksDB.java:306)
>     at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:119) ... 
> 26 more{code}
> Root cause is when MiniOzoneHAClusterImpl#bootstrapOzoneManager is creating a 
> new OM, it may encounter a port conflict, this function will retry with a new 
> port, but before that, the metadataManager of the first OM didn't close the 
> lock on the rocksdb, which causes the test to fail for the retry.
> Options to solve:
>  # I tried to add a "metadataManager.stop()" in the constructor of OM when it 
> fails to start RPC server, but it will prompt another error about the lock on 
> ratis directory.
>  # I tried to stop the ratisServer too, but in 
> [https://github.com/apache/ratis/blob/dc0b68b4c0b8c187a08f669422a2cd099d7be0b7/ratis-common/src/main/java/org/apache/ratis/util/LifeCycle.java#L308,]
>  the close function will not be called, so the lock won't be released. Tried 
> to call the closeMethod for State.NEW, but something wrong else happened.
>  # So I think it's much easier to just check if the port is available in 
> MiniOzoneHAClusterImpl. 
> Steps to reproduce:
> Change the generation of basePort to the following code, then the error would 
> happen for omNode-bootstrap-2 in testBootstrapTwoNewOMs.
> {code:java}
> @@ -697,9 +698,11 @@ public void bootstrapOzoneManager(String omNodeId) 
> throws Exception {
>  
>      long leaderSnapshotIndex = getOMLeader().getRatisSnapshotIndex();
>  
> +    int start = 0;
>      while (true) {
>        try {
> -        basePort = 10000 + RANDOM.nextInt(1000) * 4;
> +//        basePort = 10000 + RANDOM.nextInt(1000) * 4;
> +        basePort = 10000 + start * 4;
>          OzoneConfiguration newConf = addNewOMToConfig(getOMServiceId(),
>              omNodeId, basePort);
>  
> @@ -721,6 +724,7 @@ public void bootstrapOzoneManager(String omNodeId) throws 
> Exception {
>          if (e instanceof BindException ||
>              e.getCause() instanceof BindException) {
>            ++retryCount;
> +          start++;
>            LOG.info("MiniOzoneHACluster port conflicts, retried {} times",
>                retryCount);
>          } else {
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to