[
https://issues.apache.org/jira/browse/IGNITE-25240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman Puchkovskiy updated IGNITE-25240:
---------------------------------------
Description:
If a node gets stopped while installing a Raft snapshot to another node, log
entries like the following ones appear:
2025-04-23 17:18:54:998 +0200
[ERROR][%defaultNode%JRaft-Common-Executor-2][SnapshotExecutorImpl] Fail to
save snapshot: Status[EIO<1014>: Fail to save snapshot to
/.../work/partitions/meta/370_part_21-0/snapshot, reason
java.util.concurrent.CancellationException].
They are accompanied by
2025-04-23 17:18:54:999 +0200
[ERROR][%defaultNode%JRaft-FSMCaller-Disruptor_stripe_9-0][StateMachineAdapter]
Encountered an error=Status[EIO<1014>: Fail to save snapshot.] on StateMachine
org.apache.ignite.internal.raft.server.impl.JraftServerImpl$DelegatingStateMachine,
it's highly recommended to implement this method as raft stops working since
some error occurs, you should figure out the cause and repair or remove this
node.
Error [type=ERROR_TYPE_SNAPSHOT, status=Status[EIO<1014>: Fail to save
snapshot.]]
at
org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl.reportError(SnapshotExecutorImpl.java:687)
at
org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl.onSnapshotSaveDone(SnapshotExecutorImpl.java:411)
at
org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl$SaveSnapshotDone.continueRun(SnapshotExecutorImpl.java:127)
at
org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl$SaveSnapshotDone.lambda$run$0(SnapshotExecutorImpl.java:123)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
None of these are fatal during node stop, we should avoid to log them.
This might probably also happen if a partition gets evicted from a node while
the node installs a Raft snapshot on a follower. This needs to be checked.
> Partition raft snapshot interrupted due to node stop causes garbage in logs
> ---------------------------------------------------------------------------
>
> Key: IGNITE-25240
> URL: https://issues.apache.org/jira/browse/IGNITE-25240
> Project: Ignite
> Issue Type: Bug
> Reporter: Roman Puchkovskiy
> Priority: Major
> Labels: ignite-3
>
> If a node gets stopped while installing a Raft snapshot to another node, log
> entries like the following ones appear:
> 2025-04-23 17:18:54:998 +0200
> [ERROR][%defaultNode%JRaft-Common-Executor-2][SnapshotExecutorImpl] Fail to
> save snapshot: Status[EIO<1014>: Fail to save snapshot to
> /.../work/partitions/meta/370_part_21-0/snapshot, reason
> java.util.concurrent.CancellationException].
> They are accompanied by
> 2025-04-23 17:18:54:999 +0200
> [ERROR][%defaultNode%JRaft-FSMCaller-Disruptor_stripe_9-0][StateMachineAdapter]
> Encountered an error=Status[EIO<1014>: Fail to save snapshot.] on
> StateMachine
> org.apache.ignite.internal.raft.server.impl.JraftServerImpl$DelegatingStateMachine,
> it's highly recommended to implement this method as raft stops working since
> some error occurs, you should figure out the cause and repair or remove this
> node.
> Error [type=ERROR_TYPE_SNAPSHOT, status=Status[EIO<1014>: Fail to save
> snapshot.]]
> at
> org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl.reportError(SnapshotExecutorImpl.java:687)
> at
> org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl.onSnapshotSaveDone(SnapshotExecutorImpl.java:411)
> at
> org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl$SaveSnapshotDone.continueRun(SnapshotExecutorImpl.java:127)
> at
> org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl$SaveSnapshotDone.lambda$run$0(SnapshotExecutorImpl.java:123)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
> at java.base/java.lang.Thread.run(Thread.java:1583)
>
> None of these are fatal during node stop, we should avoid to log them.
> This might probably also happen if a partition gets evicted from a node while
> the node installs a Raft snapshot on a follower. This needs to be checked.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)