[
https://issues.apache.org/jira/browse/IGNITE-25240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman Puchkovskiy updated IGNITE-25240:
---
Description:
If a node gets stopped while 'saving' a Raft snapshot, log entries like the
following ones appear:
2025-04-23 17:18:54:998 +0200
[ERROR][%defaultNode%JRaft-Common-Executor-2][SnapshotExecutorImpl] Fail to
save snapshot: Status[EIO<1014>: Fail to save snapshot to
/.../work/partitions/meta/370_part_21-0/snapshot, reason
java.util.concurrent.CancellationException].
They are accompanied by
2025-04-23 17:18:54:999 +0200
[ERROR][%defaultNode%JRaft-FSMCaller-Disruptor_stripe_9-0][StateMachineAdapter]
Encountered an error=Status[EIO<1014>: Fail to save snapshot.] on StateMachine
org.apache.ignite.internal.raft.server.impl.JraftServerImpl$DelegatingStateMachine,
it's highly recommended to implement this method as raft stops working since
some error occurs, you should figure out the cause and repair or remove this
node.
Error [type=ERROR_TYPE_SNAPSHOT, status=Status[EIO<1014>: Fail to save
snapshot.]]
at
org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl.reportError(SnapshotExecutorImpl.java:687)
at
org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl.onSnapshotSaveDone(SnapshotExecutorImpl.java:411)
at
org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl$SaveSnapshotDone.continueRun(SnapshotExecutorImpl.java:127)
at
org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl$SaveSnapshotDone.lambda$run$0(SnapshotExecutorImpl.java:123)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
None of these are fatal during node stop, we should avoid to log them.
This might probably also happen if a partition gets evicted from a node while
the node 'saves' a Raft snapshot. This needs to be checked.
Here, 'saves' is quoted because no actual data is saved to the filesystem; this
only makes flushes to storages and truncates the log.
was:
If a node gets stopped while installing a Raft snapshot to another node, log
entries like the following ones appear:
2025-04-23 17:18:54:998 +0200
[ERROR][%defaultNode%JRaft-Common-Executor-2][SnapshotExecutorImpl] Fail to
save snapshot: Status[EIO<1014>: Fail to save snapshot to
/.../work/partitions/meta/370_part_21-0/snapshot, reason
java.util.concurrent.CancellationException].
They are accompanied by
2025-04-23 17:18:54:999 +0200
[ERROR][%defaultNode%JRaft-FSMCaller-Disruptor_stripe_9-0][StateMachineAdapter]
Encountered an error=Status[EIO<1014>: Fail to save snapshot.] on StateMachine
org.apache.ignite.internal.raft.server.impl.JraftServerImpl$DelegatingStateMachine,
it's highly recommended to implement this method as raft stops working since
some error occurs, you should figure out the cause and repair or remove this
node.
Error [type=ERROR_TYPE_SNAPSHOT, status=Status[EIO<1014>: Fail to save
snapshot.]]
at
org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl.reportError(SnapshotExecutorImpl.java:687)
at
org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl.onSnapshotSaveDone(SnapshotExecutorImpl.java:411)
at
org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl$SaveSnapshotDone.continueRun(SnapshotExecutorImpl.java:127)
at
org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl$SaveSnapshotDone.lambda$run$0(SnapshotExecutorImpl.java:123)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
None of these are fatal during node stop, we should avoid to log them.
This might probably also happen if a partition gets evicted from a node while
the node installs a Raft snapshot on a follower. This needs to be checked.
> Partition raft snapshot interrupted due to node stop causes garbage in logs
> ---
>
> Key: IGNITE-25240
> URL: https://issues.apache.org/jira/browse/IGNITE-25240
> Project: Ignite
> Issue Type: Bug
>Reporter: Roman Puchkovskiy
>Priority: Major
> Labels: ignite-3
>
> If a node gets stopped while 'saving' a Raft snapshot, log entries like the
> following ones appear:
> 2025-04-23 17:18:54:998 +0200
> [ERROR][%defaultNode%JRaft-Common-Executor-2][SnapshotExecutorImpl] Fail to
> save snapshot: Status[EIO<1014>: Fail to save snapshot to
> /.../work/partitions/meta