[jira] [Updated] (IGNITE-25240) Partition raft snapshot interrupted due to node stop causes garbage in logs

Roman Puchkovskiy (Jira) Fri, 25 Apr 2025 04:23:24 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-25240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Roman Puchkovskiy updated IGNITE-25240:
---------------------------------------
    Description: 
If a node gets stopped while installing a Raft snapshot to another node, log 
entries like the following ones appear:

2025-04-23 17:18:54:998 +0200 
[ERROR][%defaultNode%JRaft-Common-Executor-2][SnapshotExecutorImpl] Fail to 
save snapshot: Status[EIO<1014>: Fail to save snapshot to 
/.../work/partitions/meta/370_part_21-0/snapshot, reason 
java.util.concurrent.CancellationException].

They are accompanied by

2025-04-23 17:18:54:999 +0200 
[ERROR][%defaultNode%JRaft-FSMCaller-Disruptor_stripe_9-0][StateMachineAdapter] 
Encountered an error=Status[EIO<1014>: Fail to save snapshot.] on StateMachine 
org.apache.ignite.internal.raft.server.impl.JraftServerImpl$DelegatingStateMachine,
 it's highly recommended to implement this method as raft stops working since 
some error occurs, you should figure out the cause and repair or remove this 
node.
Error [type=ERROR_TYPE_SNAPSHOT, status=Status[EIO<1014>: Fail to save 
snapshot.]]
        at 
org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl.reportError(SnapshotExecutorImpl.java:687)
        at 
org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl.onSnapshotSaveDone(SnapshotExecutorImpl.java:411)
        at 
org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl$SaveSnapshotDone.continueRun(SnapshotExecutorImpl.java:127)
        at 
org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl$SaveSnapshotDone.lambda$run$0(SnapshotExecutorImpl.java:123)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)

 

None of these are fatal during node stop, we should avoid to log them.

This might probably also happen if a partition gets evicted from a node while 
the node installs a Raft snapshot on a follower. This needs to be checked.

> Partition raft snapshot interrupted due to node stop causes garbage in logs
> ---------------------------------------------------------------------------
>
>                 Key: IGNITE-25240
>                 URL: https://issues.apache.org/jira/browse/IGNITE-25240
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3
>
> If a node gets stopped while installing a Raft snapshot to another node, log 
> entries like the following ones appear:
> 2025-04-23 17:18:54:998 +0200 
> [ERROR][%defaultNode%JRaft-Common-Executor-2][SnapshotExecutorImpl] Fail to 
> save snapshot: Status[EIO<1014>: Fail to save snapshot to 
> /.../work/partitions/meta/370_part_21-0/snapshot, reason 
> java.util.concurrent.CancellationException].
> They are accompanied by
> 2025-04-23 17:18:54:999 +0200 
> [ERROR][%defaultNode%JRaft-FSMCaller-Disruptor_stripe_9-0][StateMachineAdapter]
>  Encountered an error=Status[EIO<1014>: Fail to save snapshot.] on 
> StateMachine 
> org.apache.ignite.internal.raft.server.impl.JraftServerImpl$DelegatingStateMachine,
>  it's highly recommended to implement this method as raft stops working since 
> some error occurs, you should figure out the cause and repair or remove this 
> node.
> Error [type=ERROR_TYPE_SNAPSHOT, status=Status[EIO<1014>: Fail to save 
> snapshot.]]
>         at 
> org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl.reportError(SnapshotExecutorImpl.java:687)
>         at 
> org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl.onSnapshotSaveDone(SnapshotExecutorImpl.java:411)
>         at 
> org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl$SaveSnapshotDone.continueRun(SnapshotExecutorImpl.java:127)
>         at 
> org.apache.ignite.raft.jraft.storage.snapshot.SnapshotExecutorImpl$SaveSnapshotDone.lambda$run$0(SnapshotExecutorImpl.java:123)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
>         at java.base/java.lang.Thread.run(Thread.java:1583)
>  
> None of these are fatal during node stop, we should avoid to log them.
> This might probably also happen if a partition gets evicted from a node while 
> the node installs a Raft snapshot on a follower. This needs to be checked.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-25240) Partition raft snapshot interrupted due to node stop causes garbage in logs

Reply via email to