[ 
https://issues.apache.org/jira/browse/HDDS-15103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sammi Chen resolved HDDS-15103.
-------------------------------
    Fix Version/s: 2.2.0
       Resolution: Fixed

> OM termination due to race condition between om start and install snapshot
> --------------------------------------------------------------------------
>
>                 Key: HDDS-15103
>                 URL: https://issues.apache.org/jira/browse/HDDS-15103
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Sammi Chen
>            Assignee: Sammi Chen
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.2.0
>
>
> OM is created
> {code:java}
> 2026-04-19 17:19:36,490 INFO 
> [main]-org.apache.hadoop.ozone.om.OzoneManagerStarter: STARTUP_MSG: 
> {code}
> OzoneManagerRatisServer(omRatisServer) in created in OM constructor
> {code:java}
> 2026-04-19 17:19:45,113 INFO 
> [main]-org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer: 
> Instantiating OM Ratis server with groupID: ozone1776408527 and peers: 
> hostname-7:9872, hostname-1:9872, hostname-5:9872
> 2026-04-19 17:19:45,121 INFO 
> [main]-org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: 
> LastAppliedIndex is set from TransactionInfo from OM DB as (t:11, i:975246)
> {code}
> OM is started
> {code:java}
> 2026-04-19 17:19:46,456 INFO [main]-org.apache.hadoop.ozone.om.OzoneManager: 
> OzoneManager RPC server is listening at hostname-7/10.140.136.10:9862
> {code}
> OzoneManagerRatisServer(omRatisServer) is started
> {code:java}
> 2026-04-19 17:19:46,456 INFO 
> [main]-org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer: Starting 
> OzoneManagerRatisServer om43 at port 9872
> ...
> 2026-04-19 17:19:51,299 INFO 
> [om43-impl-thread1]-org.apache.ratis.server.raftlog.segmented.LogSegment: 
> Successfully read 85006 entries from segment file 
> /var/lib/hadoop-ozone/om/ratis/5341b4ef-bcd6-3183-aa83-a6bf76f23ef4/current/log_inprogress_890723
> 2026-04-19 17:19:51,306 INFO 
> [om43-impl-thread1]-org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
>  om43@group-A6BF76F23EF4-SegmentedRaftLogWorker: flushIndex: 
> setUnconditionally 0 -> 975728
> 2026-04-19 17:19:51,306 INFO 
> [om43-impl-thread1]-org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
>  om43@group-A6BF76F23EF4-SegmentedRaftLogWorker: safeCacheEvictIndex: 
> setUnconditionally 0 -> 890722
> 2026-04-19 17:19:51,356 INFO 
> [om43-impl-thread1]-org.apache.ratis.server.raftlog.RaftLog: 
> om43@group-A6BF76F23EF4-SegmentedRaftLog: commitIndex: updateToMax 
> old=975246, new=975726, updated? true
> {code}
> OM starts secretKeyClient
> {code:java}
> 2026-04-19 17:19:51,445 INFO [main]-org.apache.hadoop.ozone.om.OzoneManager: 
> Starting secret key client.
> {code}
> OzoneManagerRatisServer(omRatisServer) receives install snapshot request and 
> download snapshot file
> {code:java}
> 2026-04-19 17:19:51,622 INFO 
> [grpc-default-executor-0]-org.apache.ratis.server.impl.SnapshotInstallationHandler:
>  om43@group-A6BF76F23EF4: receive installSnapshot: 
> om45->om43#0-t11,notify:(t:11, i:1151853)
> 2026-04-19 17:19:51,624 INFO 
> [om43-server-thread1]-org.apache.ratis.server.RaftServer$Division: 
> om43@group-A6BF76F23EF4: change Leader from null to om45 at term 11 for 
> appendEntries, leader elected after 6159ms
> 2026-04-19 17:19:51,626 INFO 
> [grpc-default-executor-0]-org.apache.ratis.server.impl.SnapshotInstallationHandler:
>  om43@group-A6BF76F23EF4: Received notification to install snapshot at index 
> 1151853
> 2026-04-19 17:19:51,640 INFO 
> [grpc-default-executor-0]-org.apache.ratis.server.impl.SnapshotInstallationHandler:
>  om43@group-A6BF76F23EF4: notifyInstallSnapshot: nextIndex is 975729 but the 
> leader's first available index is 1151853.
> 2026-04-19 17:19:51,640 INFO 
> [grpc-default-executor-0]-org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine:
>  Received install snapshot notification from OM leader: om45 with term index: 
> (t:11, i:1151853)
> ...
> 2026-04-19 17:19:51,649 INFO 
> [pool-31-thread-1]-org.apache.hadoop.ozone.om.ratis_snapshot.OmRatisSnapshotProvider:
>  Downloading latest checkpoint from Leader OM om45. Checkpoint: 
> om.db-om45-1776619191643.tar URL: 
> https:hostname-5:9875/dbCheckpoint?includeSnapshotData=true&flushBeforeCheckpoint=true
> ...
> 2026-04-19 17:19:52,637 INFO 
> [pool-31-thread-1]-org.apache.hadoop.ozone.om.ratis_snapshot.OmRatisSnapshotProvider:
>  Download completed for 'om.db-om45-1776619191643.tar'. Total size: 26695 KB
> 2026-04-19 17:19:52,637 INFO 
> [pool-31-thread-1]-org.apache.hadoop.hdds.utils.RDBSnapshotProvider: 
> Successfully download the latest snapshot 
> /var/lib/hadoop-ozone/om/ozone-metadata/snapshot/om.db-om45-1776619191643.tar 
> from leader OM: om45
> {code}
> OM starts to install snapshot
> {code:java}
> 2026-04-19 17:19:52,796 INFO 
> [pool-31-thread-1]-org.apache.hadoop.ozone.om.OzoneManager: Installing 
> checkpoint with OMTransactionInfo 11#1251978
> 2026-04-19 17:19:52,796 INFO 
> [pool-31-thread-1]-org.apache.hadoop.ozone.om.OzoneManager: Stopping secret 
> key client.
> 2026-04-19 17:19:52,796 INFO 
> [pool-31-thread-1]-org.apache.hadoop.ozone.om.OzoneManager: Stopping OM 
> delegation token secret manager.
> 2026-04-19 17:19:52,796 INFO 
> [pool-31-thread-1]-org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: 
> OzoneManagerStateMachine is pausing
> 2026-04-19 17:19:52,796 INFO 
> [pool-31-thread-1]-org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer: 
> Stopping OMDoubleBuffer flush thread
> ..
> 2026-04-19 17:19:52,797 INFO [pool-31-thread-1]-org.apache.hadoop.ipc.Server: 
> Stopping server on 9862
> 2026-04-19 17:19:52,797 INFO 
> [pool-31-thread-1]-org.apache.hadoop.ozone.om.OzoneManager: RPC server is 
> stopped. Spend 0 ms.
> 2026-04-19 17:19:52,797 INFO 
> [pool-31-thread-1]-org.apache.ozone.rocksdiff.RocksDBCheckpointDiffer: 
> Shutting down CompactionDagPruningService.
> ...
> 2026-04-19 17:19:52,806 INFO 
> [pool-31-thread-1]-org.apache.hadoop.ozone.om.OzoneManager: Replaced DB with 
> checkpoint from OM: om45, term: 11, index: 1251978, time: 4 ms
> ...
> 2026-04-19 17:19:52,903 INFO 
> [pool-31-thread-1]-org.apache.hadoop.ozone.om.OzoneManager: Starting secret 
> key client.
> 2026-04-19 17:19:54,964 INFO 
> [pool-31-thread-1]-org.apache.hadoop.hdds.security.symmetric.DefaultSecretKeySignerClient:
>  Initial secret key fetched from SCM: SecretKey(id = 
> b7587d07-308b-4aac-9f4c-37acb0eb76db, creation at: 2026-04-19T06:57:45.934Z, 
> expire at: 2026-04-28T06:57:45.934Z).
> 2026-04-19 17:19:54,964 INFO 
> [pool-31-thread-1]-org.apache.hadoop.hdds.security.symmetric.DefaultSecretKeySignerClient:
>  Scheduling SecretKeyPoller with initial delay of PT13H37M50.969104546S and 
> interval of PT10M
> *2026-04-19 17:19:54,965 INFO 
> [pool-31-thread-1]-org.apache.hadoop.ozone.om.OzoneManager: Starting OM 
> delegation token secret manager*
> 2026-04-19 17:19:54,966 ERROR 
> [pool-31-thread-1]-org.apache.hadoop.ozone.om.OzoneManager: Terminating with 
> exit status 1: Failed to reload OM state and instantiate services.
> java.lang.IllegalStateException
>       at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:496)
>       at 
> org.apache.hadoop.hdds.security.OzoneSecretManager.start(OzoneSecretManager.java:214)
>       at 
> org.apache.hadoop.ozone.security.OzoneDelegationTokenSecretManager.start(OzoneDelegationTokenSecretManager.java:619)
>       at 
> org.apache.hadoop.ozone.om.OzoneManager.startSecretManager(OzoneManager.java:1286)
>       at 
> org.apache.hadoop.ozone.om.OzoneManager.startSecretManagerIfNecessary(OzoneManager.java:2519)
>       at 
> org.apache.hadoop.ozone.om.OzoneManager.reloadOMState(OzoneManager.java:4199)
>       at 
> org.apache.hadoop.ozone.om.OzoneManager.installCheckpoint(OzoneManager.java:4037)
>       at 
> org.apache.hadoop.ozone.om.OzoneManager.installCheckpoint(OzoneManager.java:3943)
>       at 
> org.apache.hadoop.ozone.om.OzoneManager.installSnapshotFromLeader(OzoneManager.java:3920)
>       at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$5(OzoneManagerStateMachine.java:512)
>       at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>       at java.base/java.lang.Thread.run(Thread.java:840)
> {code}
> OM starts continuous
> {code:java}
> 2026-04-19 17:19:53,554 INFO [main]-org.apache.hadoop.ozone.om.OzoneManager: 
> Starting OM delegation token secret manager
> 2026-04-19 17:19:53,556 INFO [main]-org.apache.hadoop.ozone.om.OzoneManager: 
> Version File has different layout version (6) than OM DB (null). That is 
> expected if this OM has never been finalized to a newer layout version.
> {code}
> The exception happens when both install snapshot thread and OzoneManager 
> start main thread call OzoneDelegationTokenSecretManager.start(), whoever 
> calls later will fail at the checkstate().
> To fix this problem, one solution is changing OzoneSecretManager.start()
> {code:java}
>   Preconditions.checkState(!isRunning());
> {code}
> to
> {code:java}
> if (isRunning()) {
>    return;
> }
> {code}
> But the issue is due to install snapshot thread and ozone manager main thread 
> are running OM start process in the same time. Other than 
> OzoneDelegationTokenSecretManager, other services also have the chance to go 
> wrong if multi-thread running is not expected and test. So the safe idea, is 
> wait for the OM start to finish, before that, OzoneManagerStatemachine will 
> ignore snapshot installation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to