Hi Snehasish,

Since you already have a test, could you share the code change?  You may
attach a patch file or create a pull request.   I will run it to reproduce
the failure.

In the meantime, I will try to understand the details you provided.

Tsz-Wo


On Thu, Mar 5, 2026 at 3:14 AM Snehasish Roy <[email protected]>
wrote:

> Hi Tsz-Wo,
>
> Thank you for your prompt response. I was able to reproduce this issue
> using CounterStateMachine.
>
> I added an utility in the CounterClient to trigger a snapshot.
>
> ```
> private void takeSnapshot() throws IOException {
>     RaftClientReply raftClientReply = client.getSnapshotManagementApi()
>             .create(true, 30_000);
>     System.out.println(raftClientReply);
> }
> ```
>
> Once the snapshot is triggered, I move it to a different directory to
> simulate clean restart.
>
> I also updated the SimpleStateMachineStorage::loadLatestSnapshot() to look
> for snapshots in a different directory.
>
> ```
> public SingleFileSnapshotInfo loadLatestSnapshot() {
>     final File dir = new File("/tmp/snapshots");
> }
> ```
>
> Full steps for reproduction
> 1. I started a 3 Node CounterServer and performed some updates to the state
> machine using the CounterClient.
>
> 2. Triggered the snapshot via the CounterClient and then moved the snapshot
> to a different directory - the snapshot will be of the format term_index.
> Here the term will initially be 1, and let's assume the index is at 10.
>
> 3. Kill the leader, the term would have increased to 2.
>
> 4. Perform some updates and trigger another snapshot. Let's assume the
> index is at 20 and the term is at 2. Moved the snapshot to a different
> directory.
>
> 5. Stopped all nodes. Cleared all storage directories of all the nodes to
> simulate clean restart.
>
> 6. Start 3 node CounterServer and observe the failure at the startup.
>
> ```
> 026-03-05 15:48:56 INFO  SimpleStateMachineStorage:229 - Latest snapshot is
> SingleFileSnapshotInfo(t:2, i:20):[/tmp/snapshots/snapshot.2_20] in
> /tmp/snapshots
> 2026-03-05 15:48:56 INFO  SimpleStateMachineStorage:229 - Latest snapshot
> is SingleFileSnapshotInfo(t:2, i:20):[/tmp/snapshots/snapshot.2_20] in
> /tmp/snapshots
> 2026-03-05 15:48:56 INFO  RaftServerConfigKeys:62 -
> raft.server.log.use.memory = false (default)
> 2026-03-05 15:48:56 INFO  RaftServer$Division:155 - n0@group-ABB3109A44C1:
> getLatestSnapshot(CounterStateMachine-1:n0:group-ABB3109A44C1) returns
> SingleFileSnapshotInfo(t:2, i:20):[/tmp/snapshots/snapshot.2_20]
> 2026-03-05 15:48:56 INFO  RaftLog:90 -
> n0@group-ABB3109A44C1-SegmentedRaftLog: snapshotIndexFromStateMachine = 20
> ....
> 2026-03-05 15:49:02 INFO  RaftServer$Division:577 - n1@group-ABB3109A44C1:
> set firstElectionSinceStartup to false for becomeLeader
> 2026-03-05 15:49:02 INFO  RaftServer$Division:278 - n1@group-ABB3109A44C1:
> change Leader from null to n1 at term 1 for becomeLeader, leader elected
> after 672ms
> 2026-03-05 15:49:02 INFO  SegmentedRaftLogWorker:440 -
> n1@group-ABB3109A44C1-SegmentedRaftLogWorker: Starting segment from
> index:21
> 2026-03-05 15:49:02 INFO  SegmentedRaftLogWorker:647 -
> n1@group-ABB3109A44C1-SegmentedRaftLogWorker: created new log segment
> /ratis/./n1/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_inprogress_21
> ....
> 2026-03-05 15:49:02 INFO  RaftServer$Division:309 - Leader
> n1@group-ABB3109A44C1-LeaderStateImpl is ready since appliedIndex ==
> startIndex == 21
> 2026-03-05 15:49:02 ERROR StateMachineUpdater:207 -
> n1@group-ABB3109A44C1-StateMachineUpdater caught a Throwable.
> 2026-03-05 15:49:02 ERROR StateMachineUpdater:207 -
> n1@group-ABB3109A44C1-StateMachineUpdater caught a Throwable.
> java.lang.IllegalStateException: n1: Failed updateLastAppliedTermIndex:
> newTI = (t:1, i:21) < oldTI = (t:2, i:20)
> at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77)
> at
>
> org.apache.ratis.statemachine.impl.BaseStateMachine.updateLastAppliedTermIndex(BaseStateMachine.java:148)
> at
>
> org.apache.ratis.statemachine.impl.BaseStateMachine.updateLastAppliedTermIndex(BaseStateMachine.java:139)
> at
>
> org.apache.ratis.statemachine.impl.BaseStateMachine.notifyTermIndexUpdated(BaseStateMachine.java:135)
> at
>
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1893)
> at
>
> org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:255)
> at
>
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:194)
> at java.base/java.lang.Thread.run(Thread.java:1575)
> 2026-03-05 15:49:02 INFO  RaftServer$Division:528 - n1@group-ABB3109A44C1:
> shutdown
> ```
>
> As you can see from the stack trace, during the snapshot restore, the
> termIndex was updated to the latest value seen from the snapshot 2:20, but
> when the server was started from a clean slate, then the term was reset to
> 1 by the RaftServerImpl at the startup. It then tries to update the log
> entries and fails because of the precondition check that the term should be
> monotonically increasing in the log entries.
>
> Please let me know if you need more information.
>
> Regards
>
> On Wed, 4 Mar 2026 at 06:33, Tsz Wo Sze <[email protected]> wrote:
>
> > Hi Snehasish,
> >
> > > ... newTI = (t:1, i:21) ...
> >
> > The newTI was invalid.  It probably was from the state machine.  It
> should
> > just use the TermIndex from LogEntryProto.  See  CounterStateMachine [1]
> as
> > an example.
> >
> > Tsz-Wo
> > [1]
> >
> >
> https://github.com/apache/ratis/blob/3d9f5af376409de7e635bb67c7dfbeadc882c413/ratis-examples/src/main/java/org/apache/ratis/examples/counter/server/CounterStateMachine.java#L263-L266
> >
> > On Tue, Mar 3, 2026 at 10:52 AM Snehasish Roy via dev <
> > [email protected]>
> > wrote:
> >
> > > Hello everyone,
> > >
> > > I was exploring the snapshot restore capability of Ratis and found one
> > > scenario that failed.
> > >
> > > 1. Start a 3 Node ratis cluster and perform some updates to the state
> > > machine.
> > > 2. Take the snapshot - the snapshot will be of the format term_index.
> > Here
> > > the term will initially be 1, and let's assume the index is at 10.
> > > 3. Kill the leader, the term would have increased to 2.
> > > 4. Perform some updates and trigger another snapshot. Let's assume the
> > > index is at 20 and term is at 2.
> > > 5. Stop all nodes.
> > > 6. A failure is observed while starting the node.
> > >
> > > ```
> > > Failed updateLastAppliedTermIndex: newTI = (t:1, i:21) < oldTI = (t:2,
> > > i:20)
> > > ```
> > >
> > > Based on the error logs, I suspect the state machine updated the last
> > > applied term index to t:2, i:20, but the ServerState has a separate
> > > variable for tracking the currentTerm which is initialized to 0 at
> > startup.
> > > Once the leader is elected, it tried to update the log entry but the
> > update
> > > failed due to precondition check.
> > >
> > > What's the correct way to solve this problem? Should the term be reset
> > to 0
> > > while loading the snapshot at the server startup?
> > >
> > > References:
> > >
> > >
> >
> https://github.com/apache/ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L82
> > >
> > >
> >
> https://github.com/apache/ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/statemachine/impl/BaseStateMachine.java#L138
> > >
> > > Thank you for looking into this issue.
> > >
> > >
> > > Regards,
> > > Snehasish
> > >
> >
>

Reply via email to