[ 
https://issues.apache.org/jira/browse/HDDS-5547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758159#comment-17758159
 ] 

Aryan Gupta commented on HDDS-5547:
-----------------------------------

I looked into this it looks like the below solution may work!
We can avoid creating a ratis group directory if already exists, creation of 
ratis group directory happens internally inside ratis code when initializing 
the OzoneManager ratis server. 
1. First-time creation, ratis storage directory won't have any group 
directories inside it so that time we should create the directory.
2. If there is a change in the OM service ID then check if any group directory 
is present in ratis storage directory, If it is we should avoid creating a new 
group directory corresponding to the new OM service ID.
3. If there's no group directory present then create one.

Question: The above solution makes sense only if the directories present in 
ratis storage directory are only group directories. If any other kind of 
directories are also present then this solution may not work, but from Bharat's 
Jira I feel only a group directory is there, please correct me if I'm wrong! 
[~szetszwo] 

cc [~sumitagrawl] 

> Generation of raftgroupId should not depend on OM service id
> ------------------------------------------------------------
>
>                 Key: HDDS-5547
>                 URL: https://issues.apache.org/jira/browse/HDDS-5547
>             Project: Apache Ozone
>          Issue Type: Improvement
>            Reporter: Bharat Viswanadham
>            Assignee: Aryan Gupta
>            Priority: Major
>
> In OM HA, raftGroupID is generated from service ID.
> So, if there is a change in OM Service ID OM startup fails with below error
> {code:java}
> 2021-08-05 12:20:03,043 ERROR org.apache.hadoop.ozone.om.OzoneManagerStarter: 
> OM start failed with exception
> java.io.IOException: java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
> OzoneManagerStateMachine:om1:group-8A65FD498CB6, RUNNING -> STARTING
>         at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
>         at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
>         at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:71)
>         at 
> org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:354)
>         at 
> org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:371)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.start(OzoneManagerRatisServer.java:390)
>         at 
> org.apache.hadoop.ozone.om.OzoneManager.start(OzoneManager.java:1109)
>         at 
> org.apache.hadoop.ozone.om.OzoneManagerStarter$OMStarterHelper.start(OzoneManagerStarter.java:126)
>         at 
> org.apache.hadoop.ozone.om.OzoneManagerStarter.startOm(OzoneManagerStarter.java:79)
>         at 
> org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:67)
>         at 
> org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:38)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:1933)
>         at picocli.CommandLine.access$1100(CommandLine.java:145)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2326)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2291)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:2152)
>         at picocli.CommandLine.parseWithHandlers(CommandLine.java:2530)
>         at picocli.CommandLine.parseWithHandler(CommandLine.java:2465)
>         at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:96)
>         at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:87)
>         at 
> org.apache.hadoop.ozone.om.OzoneManagerStarter.main(OzoneManagerStarter.java:51)
> Caused by: java.lang.IllegalStateException: ILLEGAL TRANSITION: In 
> OzoneManagerStateMachine:om1:group-8A65FD498CB6, RUNNING -> STARTING
>         at 
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:60)
>         at org.apache.ratis.util.LifeCycle$State.validate(LifeCycle.java:121)
>         at org.apache.ratis.util.LifeCycle.transition(LifeCycle.java:164)
>         at 
> org.apache.ratis.util.LifeCycle.startAndTransition(LifeCycle.java:268)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.initialize(OzoneManagerStateMachine.java:127)
>         at 
> org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:120)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:193)
>         at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$4(RaftServerProxy.java:266)
>         at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
> One possible solution is
> If a ratis group dir already exists, use that as it is an existing cluster we 
> cannot change. For new clusters might be we can use clusterID which does not 
> change for a ozone cluster, in this way we shall be tolerant to service id 
> config change.
> This is just one idea, we can discuss any other approaches to solve this 
> issue and fix this.
> As right now, in OM we don't allow change of om service id



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to