[ 
https://issues.apache.org/jira/browse/IGNITE-26037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18016538#comment-18016538
 ] 

Roman Puchkovskiy commented on IGNITE-26037:
--------------------------------------------

The patch looks good to me, thanks!

> Error saving FreeList metadata causing checkpointer to crash
> ------------------------------------------------------------
>
>                 Key: IGNITE-26037
>                 URL: https://issues.apache.org/jira/browse/IGNITE-26037
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Kirill Tkalenko
>            Assignee: Kirill Tkalenko
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.1
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> When analyzing the log, I found an error when saving FreeList metadata, which 
> led to a Checkpointer crash, and this, as a consequence, leads to an 
> inoperative node. This needs to be sorted out.
> What scenario, there was a cluster of three nodes on which a lot of data was 
> loaded, all tables were in a zone with a replica count of 1. After loading 
> all the data, the replica count was changed from 1 to 3, which led to 
> multiple rebalancings via raft snapshots. After some time, this problem 
> appeared. The exception itself, as far as I understand, occurred while saving 
> a raft snapshot.
> This may be difficult to reproduce until the issue in IGNITE-26034 is fixed.
> h3. {color:red}Update{color}
> Root cause of the problem is a race between recreating the storage structures 
> at the start of its rebalance and at the checkpoint. There may be a small 
> chance for the *closed* FreeList to try to trigger the metadata sync at the 
> checkpoint, which causes an error at the checkpoint and the node to shut down.
> In my opinion, the correct fix would be if before closing the structures we 
> remove the checkpoint listener that synchronizes the FreeList metadata and 
> after it is recreated, return the listener. There may also be a small chance 
> that the checkpoint will start executing a callback for the closed FreeList 
> before the listener is removed, so we need to take that into account.
> {noformat}
> 2025-07-24 14:11:42:486 +0000 
> [ERROR][%node1%checkpoint-thread][FailureManager] Critical system error 
> detected. Will be handled accordingly to configured handler 
> [hnd=NoOpFailureHandler [super=AbstractFailureHandler 
> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_TERMINATION]
> org.apache.ignite.internal.failure.StackTraceCapturingException: 
> IGN-CMN-65535 Unknown error TraceId:00fda422
>       at 
> org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:183)
>       at 
> org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:160)
>       at 
> org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:263)
>       at 
> org.apache.ignite.internal.util.worker.IgniteWorker.run(IgniteWorker.java:89)
>       at java.base/java.lang.Thread.run(Thread.java:840)
> Caused by: org.apache.ignite.internal.lang.IgniteInternalCheckedException: 
> IGN-CMN-65535 java.lang.AssertionError: FullPageId [pageId=000100020000003c, 
> effectivePageId=000000020000003c, groupId=38] TraceId:00fda422
>       at 
> org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:347)
>       at 
> org.apache.ignite.internal.pagememory.persistence.checkpoint.Checkpointer.body(Checkpointer.java:243)
>       ... 2 more
> Caused by: java.util.concurrent.CompletionException: 
> java.lang.AssertionError: FullPageId [pageId=000100020000003c, 
> effectivePageId=000000020000003c, groupId=38]
>       at 
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)
>       at 
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:874)
>       at 
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
>       at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
>       at 
> org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:55)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>       ... 1 more
> Caused by: java.lang.AssertionError: FullPageId [pageId=000100020000003c, 
> effectivePageId=000000020000003c, groupId=38]
>       at 
> org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:819)
>       at 
> org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:705)
>       at 
> org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.acquirePage(PersistentPageMemory.java:677)
>       at 
> org.apache.ignite.internal.pagememory.util.PageHandler.writePage(PageHandler.java:202)
>       at 
> org.apache.ignite.internal.pagememory.datastructure.DataStructure.write(DataStructure.java:250)
>       at 
> org.apache.ignite.internal.pagememory.freelist.PagesList.flushBucketsCache(PagesList.java:380)
>       at 
> org.apache.ignite.internal.pagememory.freelist.PagesList.saveMetadata(PagesList.java:322)
>       at 
> org.apache.ignite.internal.pagememory.freelist.FreeListImpl.saveMetadata(FreeListImpl.java:813)
>       at 
> org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.saveFreeListMetadataBusy(PersistentPageMemoryMvPartitionStorage.java:579)
>       at 
> org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$17(PersistentPageMemoryMvPartitionStorage.java:494)
>       at 
> org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.busySafe(AbstractPageMemoryMvPartitionStorage.java:1052)
>       at 
> org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$syncMetadataOnCheckpoint$18(PersistentPageMemoryMvPartitionStorage.java:494)
>       at 
> org.apache.ignite.internal.pagememory.persistence.checkpoint.AwaitTasksCompletionExecutor.lambda$execute$1(AwaitTasksCompletionExecutor.java:51)
>       ... 3 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to