[ 
https://issues.apache.org/jira/browse/IGNITE-26037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirill Tkalenko updated IGNITE-26037:
-------------------------------------
    Description: 
When analyzing the log, I found an error when saving FreeList metadata, which 
led to a Checkpointer crash, and this, as a consequence, leads to an 
inoperative node. This needs to be sorted out.

What scenario, there was a cluster of three nodes on which a lot of data was 
loaded, all tables were in a zone with a replica count of 1. After loading all 
the data, the replica count was changed from 1 to 3, which led to multiple 
rebalancings via raft snapshots. After some time, this problem appeared.

This may be difficult to reproduce until the issue in IGNITE-26034 is fixed.

{noformat}
2025-07-24 14:11:42:486 +0000 [ERROR][%node1%checkpoint-thread][FailureManager] 
Critical system error detected. Will be handled accordingly to configured 
handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler 
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_TERMINATION]
org.apache.ignite.internal.failure.StackTraceCapturingException: IGN-CMN-65535 
Unknown error TraceId:00fda422
        at 
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:183)
        at 
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:160)
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:263)
        at 
org.apache.ignite.internal.util.worker.IgniteWorker.run(IgniteWorker.java:89)
        at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.apache.ignite.internal.lang.IgniteInternalCheckedException: 
IGN-CMN-65535 java.lang.AssertionError: FullPageId [pageId=000100020000003c, 
effectivePageId=000000020000003c, groupId=38] TraceId:00fda422
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:347)
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:243)
        ... 2 more
Caused by: java.util.concurrent.CompletionException: java.lang.AssertionError: 
FullPageId [pageId=000100020000003c, effectivePageId=000000020000003c, 
groupId=38]
        at 
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
        at 
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
        at 
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:874)
        at 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
        at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:55)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        ... 1 more
Caused by: java.lang.AssertionError: FullPageId [pageId=000100020000003c, 
effectivePageId=000000020000003c, groupId=38]
        at 
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:819)
        at 
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:705)
        at 
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:677)
        at 
org.apache.ignite.internal.pagememory.util.PageHandler.writePage(PageHandler.java:202)
        at 
org.apache.ignite.internal.pagememory.datastructure.DataStructure.write(DataStructure.java:250)
        at 
org.apache.ignite.internal.pagememory.freelist.PagesList.flushBucketsCache(PagesList.java:380)
        at 
org.apache.ignite.internal.pagememory.freelist.PagesList.saveMetadata(PagesList.java:322)
        at 
org.apache.ignite.internal.pagememory.freelist.FreeListImpl.saveMetadata(FreeListImpl.java:813)
        at 
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.saveFreeListMetadataBusy(PersistentPageMemoryMvPartitionStorage.java:579)
        at 
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$17(PersistentPageMemoryMvPartitionStorage.java:494)
        at 
org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.busySafe(AbstractPageMemoryMvPartitionStorage.java:1052)
        at 
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$18(PersistentPageMemoryMvPartitionStorage.java:494)
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:51)
        ... 3 more
{noformat}

  was:
When analyzing the log, I found an error when saving FreeList metadata, which 
led to a Checkpointer crash, and this, as a consequence, leads to an 
inoperative node. This needs to be sorted out.

What scenario, there was a cluster of three nodes on which a lot of data was 
loaded, all tables were in a zone with a replica count of 1. After loading all 
the data, the replica count was changed from 1 to 3, which led to multiple 
rebalancings via raft snapshots. After some time, this problem appeared.

{noformat}
2025-07-24 14:11:42:486 +0000 [ERROR][%node1%checkpoint-thread][FailureManager] 
Critical system error detected. Will be handled accordingly to configured 
handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler 
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_TERMINATION]
org.apache.ignite.internal.failure.StackTraceCapturingException: IGN-CMN-65535 
Unknown error TraceId:00fda422
        at 
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:183)
        at 
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:160)
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:263)
        at 
org.apache.ignite.internal.util.worker.IgniteWorker.run(IgniteWorker.java:89)
        at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.apache.ignite.internal.lang.IgniteInternalCheckedException: 
IGN-CMN-65535 java.lang.AssertionError: FullPageId [pageId=000100020000003c, 
effectivePageId=000000020000003c, groupId=38] TraceId:00fda422
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:347)
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:243)
        ... 2 more
Caused by: java.util.concurrent.CompletionException: java.lang.AssertionError: 
FullPageId [pageId=000100020000003c, effectivePageId=000000020000003c, 
groupId=38]
        at 
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
        at 
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
        at 
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:874)
        at 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
        at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:55)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        ... 1 more
Caused by: java.lang.AssertionError: FullPageId [pageId=000100020000003c, 
effectivePageId=000000020000003c, groupId=38]
        at 
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:819)
        at 
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:705)
        at 
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:677)
        at 
org.apache.ignite.internal.pagememory.util.PageHandler.writePage(PageHandler.java:202)
        at 
org.apache.ignite.internal.pagememory.datastructure.DataStructure.write(DataStructure.java:250)
        at 
org.apache.ignite.internal.pagememory.freelist.PagesList.flushBucketsCache(PagesList.java:380)
        at 
org.apache.ignite.internal.pagememory.freelist.PagesList.saveMetadata(PagesList.java:322)
        at 
org.apache.ignite.internal.pagememory.freelist.FreeListImpl.saveMetadata(FreeListImpl.java:813)
        at 
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.saveFreeListMetadataBusy(PersistentPageMemoryMvPartitionStorage.java:579)
        at 
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$17(PersistentPageMemoryMvPartitionStorage.java:494)
        at 
org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.busySafe(AbstractPageMemoryMvPartitionStorage.java:1052)
        at 
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$18(PersistentPageMemoryMvPartitionStorage.java:494)
        at 
org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:51)
        ... 3 more
{noformat}


> Error saving FreeList metadata causing checkpointer to crash
> ------------------------------------------------------------
>
>                 Key: IGNITE-26037
>                 URL: https://issues.apache.org/jira/browse/IGNITE-26037
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Kirill Tkalenko
>            Assignee: Kirill Tkalenko
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.1
>
>
> When analyzing the log, I found an error when saving FreeList metadata, which 
> led to a Checkpointer crash, and this, as a consequence, leads to an 
> inoperative node. This needs to be sorted out.
> What scenario, there was a cluster of three nodes on which a lot of data was 
> loaded, all tables were in a zone with a replica count of 1. After loading 
> all the data, the replica count was changed from 1 to 3, which led to 
> multiple rebalancings via raft snapshots. After some time, this problem 
> appeared.
> This may be difficult to reproduce until the issue in IGNITE-26034 is fixed.
> {noformat}
> 2025-07-24 14:11:42:486 +0000 
> [ERROR][%node1%checkpoint-thread][FailureManager] Critical system error 
> detected. Will be handled accordingly to configured handler 
> [hnd=NoOpFailureHandler [super=AbstractFailureHandler 
> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_TERMINATION]
> org.apache.ignite.internal.failure.StackTraceCapturingException: 
> IGN-CMN-65535 Unknown error TraceId:00fda422
>       at 
> org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:183)
>       at 
> org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:160)
>       at 
> org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:263)
>       at 
> org.apache.ignite.internal.util.worker.IgniteWorker.run(IgniteWorker.java:89)
>       at java.base/java.lang.Thread.run(Thread.java:840)
> Caused by: org.apache.ignite.internal.lang.IgniteInternalCheckedException: 
> IGN-CMN-65535 java.lang.AssertionError: FullPageId [pageId=000100020000003c, 
> effectivePageId=000000020000003c, groupId=38] TraceId:00fda422
>       at 
> org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:347)
>       at 
> org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:243)
>       ... 2 more
> Caused by: java.util.concurrent.CompletionException: 
> java.lang.AssertionError: FullPageId [pageId=000100020000003c, 
> effectivePageId=000000020000003c, groupId=38]
>       at 
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
>       at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:874)
>       at 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
>       at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
>       at 
> org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:55)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>       ... 1 more
> Caused by: java.lang.AssertionError: FullPageId [pageId=000100020000003c, 
> effectivePageId=000000020000003c, groupId=38]
>       at 
> org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:819)
>       at 
> org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:705)
>       at 
> org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:677)
>       at 
> org.apache.ignite.internal.pagememory.util.PageHandler.writePage(PageHandler.java:202)
>       at 
> org.apache.ignite.internal.pagememory.datastructure.DataStructure.write(DataStructure.java:250)
>       at 
> org.apache.ignite.internal.pagememory.freelist.PagesList.flushBucketsCache(PagesList.java:380)
>       at 
> org.apache.ignite.internal.pagememory.freelist.PagesList.saveMetadata(PagesList.java:322)
>       at 
> org.apache.ignite.internal.pagememory.freelist.FreeListImpl.saveMetadata(FreeListImpl.java:813)
>       at 
> org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.saveFreeListMetadataBusy(PersistentPageMemoryMvPartitionStorage.java:579)
>       at 
> org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$17(PersistentPageMemoryMvPartitionStorage.java:494)
>       at 
> org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.busySafe(AbstractPageMemoryMvPartitionStorage.java:1052)
>       at 
> org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$18(PersistentPageMemoryMvPartitionStorage.java:494)
>       at 
> org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:51)
>       ... 3 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to