[
https://issues.apache.org/jira/browse/IGNITE-26037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kirill Tkalenko updated IGNITE-26037:
-------------------------------------
Description:
When analyzing the log, I found an error when saving FreeList metadata, which
led to a Checkpointer crash, and this, as a consequence, leads to an
inoperative node. This needs to be sorted out.
What scenario, there was a cluster of three nodes on which a lot of data was
loaded, all tables were in a zone with a replica count of 1. After loading all
the data, the replica count was changed from 1 to 3, which led to multiple
rebalancings via raft snapshots. After some time, this problem appeared.
This may be difficult to reproduce until the issue in IGNITE-26034 is fixed.
{noformat}
2025-07-24 14:11:42:486 +0000 [ERROR][%node1%checkpoint-thread][FailureManager]
Critical system error detected. Will be handled accordingly to configured
handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_TERMINATION]
org.apache.ignite.internal.failure.StackTraceCapturingException: IGN-CMN-65535
Unknown error TraceId:00fda422
at
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:183)
at
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:160)
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:263)
at
org.apache.ignite.internal.util.worker.IgniteWorker.run(IgniteWorker.java:89)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.apache.ignite.internal.lang.IgniteInternalCheckedException:
IGN-CMN-65535 java.lang.AssertionError: FullPageId [pageId=000100020000003c,
effectivePageId=000000020000003c, groupId=38] TraceId:00fda422
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:347)
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:243)
... 2 more
Caused by: java.util.concurrent.CompletionException: java.lang.AssertionError:
FullPageId [pageId=000100020000003c, effectivePageId=000000020000003c,
groupId=38]
at
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
at
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
at
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:874)
at
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
at
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
at
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:55)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
... 1 more
Caused by: java.lang.AssertionError: FullPageId [pageId=000100020000003c,
effectivePageId=000000020000003c, groupId=38]
at
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:819)
at
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:705)
at
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:677)
at
org.apache.ignite.internal.pagememory.util.PageHandler.writePage(PageHandler.java:202)
at
org.apache.ignite.internal.pagememory.datastructure.DataStructure.write(DataStructure.java:250)
at
org.apache.ignite.internal.pagememory.freelist.PagesList.flushBucketsCache(PagesList.java:380)
at
org.apache.ignite.internal.pagememory.freelist.PagesList.saveMetadata(PagesList.java:322)
at
org.apache.ignite.internal.pagememory.freelist.FreeListImpl.saveMetadata(FreeListImpl.java:813)
at
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.saveFreeListMetadataBusy(PersistentPageMemoryMvPartitionStorage.java:579)
at
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$17(PersistentPageMemoryMvPartitionStorage.java:494)
at
org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.busySafe(AbstractPageMemoryMvPartitionStorage.java:1052)
at
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$18(PersistentPageMemoryMvPartitionStorage.java:494)
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:51)
... 3 more
{noformat}
was:
When analyzing the log, I found an error when saving FreeList metadata, which
led to a Checkpointer crash, and this, as a consequence, leads to an
inoperative node. This needs to be sorted out.
What scenario, there was a cluster of three nodes on which a lot of data was
loaded, all tables were in a zone with a replica count of 1. After loading all
the data, the replica count was changed from 1 to 3, which led to multiple
rebalancings via raft snapshots. After some time, this problem appeared.
{noformat}
2025-07-24 14:11:42:486 +0000 [ERROR][%node1%checkpoint-thread][FailureManager]
Critical system error detected. Will be handled accordingly to configured
handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_TERMINATION]
org.apache.ignite.internal.failure.StackTraceCapturingException: IGN-CMN-65535
Unknown error TraceId:00fda422
at
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:183)
at
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:160)
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:263)
at
org.apache.ignite.internal.util.worker.IgniteWorker.run(IgniteWorker.java:89)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.apache.ignite.internal.lang.IgniteInternalCheckedException:
IGN-CMN-65535 java.lang.AssertionError: FullPageId [pageId=000100020000003c,
effectivePageId=000000020000003c, groupId=38] TraceId:00fda422
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:347)
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:243)
... 2 more
Caused by: java.util.concurrent.CompletionException: java.lang.AssertionError:
FullPageId [pageId=000100020000003c, effectivePageId=000000020000003c,
groupId=38]
at
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
at
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
at
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:874)
at
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
at
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
at
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:55)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
... 1 more
Caused by: java.lang.AssertionError: FullPageId [pageId=000100020000003c,
effectivePageId=000000020000003c, groupId=38]
at
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:819)
at
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:705)
at
org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:677)
at
org.apache.ignite.internal.pagememory.util.PageHandler.writePage(PageHandler.java:202)
at
org.apache.ignite.internal.pagememory.datastructure.DataStructure.write(DataStructure.java:250)
at
org.apache.ignite.internal.pagememory.freelist.PagesList.flushBucketsCache(PagesList.java:380)
at
org.apache.ignite.internal.pagememory.freelist.PagesList.saveMetadata(PagesList.java:322)
at
org.apache.ignite.internal.pagememory.freelist.FreeListImpl.saveMetadata(FreeListImpl.java:813)
at
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.saveFreeListMetadataBusy(PersistentPageMemoryMvPartitionStorage.java:579)
at
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$17(PersistentPageMemoryMvPartitionStorage.java:494)
at
org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.busySafe(AbstractPageMemoryMvPartitionStorage.java:1052)
at
org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$18(PersistentPageMemoryMvPartitionStorage.java:494)
at
org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:51)
... 3 more
{noformat}
> Error saving FreeList metadata causing checkpointer to crash
> ------------------------------------------------------------
>
> Key: IGNITE-26037
> URL: https://issues.apache.org/jira/browse/IGNITE-26037
> Project: Ignite
> Issue Type: Bug
> Reporter: Kirill Tkalenko
> Assignee: Kirill Tkalenko
> Priority: Major
> Labels: ignite-3
> Fix For: 3.1
>
>
> When analyzing the log, I found an error when saving FreeList metadata, which
> led to a Checkpointer crash, and this, as a consequence, leads to an
> inoperative node. This needs to be sorted out.
> What scenario, there was a cluster of three nodes on which a lot of data was
> loaded, all tables were in a zone with a replica count of 1. After loading
> all the data, the replica count was changed from 1 to 3, which led to
> multiple rebalancings via raft snapshots. After some time, this problem
> appeared.
> This may be difficult to reproduce until the issue in IGNITE-26034 is fixed.
> {noformat}
> 2025-07-24 14:11:42:486 +0000
> [ERROR][%node1%checkpoint-thread][FailureManager] Critical system error
> detected. Will be handled accordingly to configured handler
> [hnd=NoOpFailureHandler [super=AbstractFailureHandler
> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_TERMINATION]
> org.apache.ignite.internal.failure.StackTraceCapturingException:
> IGN-CMN-65535 Unknown error TraceId:00fda422
> at
> org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:183)
> at
> org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:160)
> at
> org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:263)
> at
> org.apache.ignite.internal.util.worker.IgniteWorker.run(IgniteWorker.java:89)
> at java.base/java.lang.Thread.run(Thread.java:840)
> Caused by: org.apache.ignite.internal.lang.IgniteInternalCheckedException:
> IGN-CMN-65535 java.lang.AssertionError: FullPageId [pageId=000100020000003c,
> effectivePageId=000000020000003c, groupId=38] TraceId:00fda422
> at
> org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:347)
> at
> org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:243)
> ... 2 more
> Caused by: java.util.concurrent.CompletionException:
> java.lang.AssertionError: FullPageId [pageId=000100020000003c,
> effectivePageId=000000020000003c, groupId=38]
> at
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
> at
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
> at
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:874)
> at
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
> at
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
> at
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
> at
> org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:55)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> ... 1 more
> Caused by: java.lang.AssertionError: FullPageId [pageId=000100020000003c,
> effectivePageId=000000020000003c, groupId=38]
> at
> org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:819)
> at
> org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:705)
> at
> org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:677)
> at
> org.apache.ignite.internal.pagememory.util.PageHandler.writePage(PageHandler.java:202)
> at
> org.apache.ignite.internal.pagememory.datastructure.DataStructure.write(DataStructure.java:250)
> at
> org.apache.ignite.internal.pagememory.freelist.PagesList.flushBucketsCache(PagesList.java:380)
> at
> org.apache.ignite.internal.pagememory.freelist.PagesList.saveMetadata(PagesList.java:322)
> at
> org.apache.ignite.internal.pagememory.freelist.FreeListImpl.saveMetadata(FreeListImpl.java:813)
> at
> org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.saveFreeListMetadataBusy(PersistentPageMemoryMvPartitionStorage.java:579)
> at
> org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$17(PersistentPageMemoryMvPartitionStorage.java:494)
> at
> org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.busySafe(AbstractPageMemoryMvPartitionStorage.java:1052)
> at
> org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$18(PersistentPageMemoryMvPartitionStorage.java:494)
> at
> org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:51)
> ... 3 more
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)