[
https://issues.apache.org/jira/browse/HDDS-5547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758667#comment-17758667
]
Aryan Gupta commented on HDDS-5547:
-----------------------------------
Thanks, [~szetszwo], In that case, let's not change it and keep it this way. Om
service id is necessary for raft group ID generation, For now, I believe we
should throw an exception if someone tries to change om service ID. This was
reverted in HDDS-7088 due to the problem of colocating scm/om storage dirs and
converted to a warning. I think we should also not allow OM and SCM raft
directories to be co-located, by default they are configured differently but in
case they are configured the same we should throw an exception. I also saw your
[comment|https://github.com/apache/ozone/pull/3809#:~:text=Indeed%2C%20it%20is%20uncommon%20in%20production%20(SCM%20and%20OM%20usually%20use%20different%20directories%2C%20or%20even%20different%20machine)%20but%20it%20may%20happen%20in%20testing%20setup.]
on this [PR|https://github.com/apache/ozone/pull/3809], that it’s a very rare
event that both OM and SCM directories are the same except in some testing
scenarios. I’m planning to raise the PR soon, this will also help us to fix
[HDDS-7553|https://issues.apache.org/jira/browse/HDDS-7553] as it will help to
prevent the creation of a new raft group directory when someone changes the om
service id.
> Generation of raftgroupId should not depend on OM service id
> ------------------------------------------------------------
>
> Key: HDDS-5547
> URL: https://issues.apache.org/jira/browse/HDDS-5547
> Project: Apache Ozone
> Issue Type: Improvement
> Reporter: Bharat Viswanadham
> Assignee: Aryan Gupta
> Priority: Major
>
> In OM HA, raftGroupID is generated from service ID.
> So, if there is a change in OM Service ID OM startup fails with below error
> {code:java}
> 2021-08-05 12:20:03,043 ERROR org.apache.hadoop.ozone.om.OzoneManagerStarter:
> OM start failed with exception
> java.io.IOException: java.lang.IllegalStateException: ILLEGAL TRANSITION: In
> OzoneManagerStateMachine:om1:group-8A65FD498CB6, RUNNING -> STARTING
> at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
> at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
> at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:71)
> at
> org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:354)
> at
> org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:371)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.start(OzoneManagerRatisServer.java:390)
> at
> org.apache.hadoop.ozone.om.OzoneManager.start(OzoneManager.java:1109)
> at
> org.apache.hadoop.ozone.om.OzoneManagerStarter$OMStarterHelper.start(OzoneManagerStarter.java:126)
> at
> org.apache.hadoop.ozone.om.OzoneManagerStarter.startOm(OzoneManagerStarter.java:79)
> at
> org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:67)
> at
> org.apache.hadoop.ozone.om.OzoneManagerStarter.call(OzoneManagerStarter.java:38)
> at picocli.CommandLine.executeUserObject(CommandLine.java:1933)
> at picocli.CommandLine.access$1100(CommandLine.java:145)
> at
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2332)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:2326)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:2291)
> at
> picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:2152)
> at picocli.CommandLine.parseWithHandlers(CommandLine.java:2530)
> at picocli.CommandLine.parseWithHandler(CommandLine.java:2465)
> at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:96)
> at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:87)
> at
> org.apache.hadoop.ozone.om.OzoneManagerStarter.main(OzoneManagerStarter.java:51)
> Caused by: java.lang.IllegalStateException: ILLEGAL TRANSITION: In
> OzoneManagerStateMachine:om1:group-8A65FD498CB6, RUNNING -> STARTING
> at
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:60)
> at org.apache.ratis.util.LifeCycle$State.validate(LifeCycle.java:121)
> at org.apache.ratis.util.LifeCycle.transition(LifeCycle.java:164)
> at
> org.apache.ratis.util.LifeCycle.startAndTransition(LifeCycle.java:268)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.initialize(OzoneManagerStateMachine.java:127)
> at
> org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:120)
> at
> org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:193)
> at
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$4(RaftServerProxy.java:266)
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> One possible solution is
> If a ratis group dir already exists, use that as it is an existing cluster we
> cannot change. For new clusters might be we can use clusterID which does not
> change for a ozone cluster, in this way we shall be tolerant to service id
> config change.
> This is just one idea, we can discuss any other approaches to solve this
> issue and fix this.
> As right now, in OM we don't allow change of om service id
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]