Hi Sergey!

Seems that the root cause of your problem is possible corruption of data
page

Caused by: java.lang.IllegalStateException: Item not found: 3
at
org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.findIndirectItemIndex(AbstractDataPageIO.java:476)

and it happened twice, not during expiration cleanup, but also during
rebalance eviction. But your scenario looks quite usual and couldn't cause
such problems.

I've found similar existing ticket
https://issues.apache.org/jira/browse/IGNITE-11400
but assignee couldn't reproduce the error.

Could you please provide more information about what had happened on your
cluster before and about cache configuration? Do you use expiration on
persistent caches? Any details can be useful for the problem localization.

Denis Chudov
Software Engineer, Moscow
+7 905 5775239
https://www.gridgain.com
Powered by Apache® Ignite™


On Thu, Nov 18, 2021 at 2:25 PM Sergey Korotkov <serge.korot...@gmail.com>
wrote:

> Hello,
>
> We have troubles with the CorruptedTreeException: B+Tree is corrupted
> during the node start after cluster restart. Looks like the caches with
> the Expiry Policy configured are source of the problems.
>
> I have attached the log from the problem node.  The exact steps with
> timestamps are as folows.  Before the deactivation cluster works fine
> about 5 days
>
> 2021-11-08 10:54:44         cluster deactivate request
>
> 2021-11-08 10:59:33         cluster deactivated
>
> 2021-11-08 11:02:30         stop all nodes
>
> 2021-11-08 11:02:39         start all nodes
>
> 2021-11-08 11:03:14         auto-activation start
>
> 2021-11-08 11:03:16         cluster activated
>
> 2021-11-08 11:03:21         'B+Tree is corrupted' exception in
> GridCacheTtlManager.expire() on one of the nodes (see the
> 10.12.86.29-ignite-2021-11-08.0.log):
>
> [2021-11-08 11:03:21,820][ERROR][ttl-cleanup-worker-#215][ROOT]{}
> Critical system error detected. Will be handled accordingly to
> configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false,
> timeout=0, super=AbstractFailureHandler
> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
> [type=CRITICAL_ERROR, err=class
> o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree
> is corrupted [pages(groupId, pageId)=[], msg=Runtime failure on bounds:
> [lower=null, upper=PendingRow []]]]]
>
> org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
> B+Tree is corrupted [pages(groupId, pageId)=[], msg=Runtime failure on
> bounds: [lower=null, upper=PendingRow []]]
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6139)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1133)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1100)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1095)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:3076)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:3023)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:1255)
>         at
>
> org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:246)
>         at
>
> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.lambda$body$0(GridCacheSharedTtlCleanupManager.java:193)
>         at
>
> java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769)
>         at
>
> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:192)
>         at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by:
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
> java.lang.IllegalStateException: Item not found: 3
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1079)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1118)
>         ... 11 common frames omitted
> Caused by:
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
> java.lang.IllegalStateException: Item not found: 3
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:345)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:165)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:136)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:123)
>         at
>
> org.apache.ignite.internal.processors.cache.tree.PendingRow.initKey(PendingRow.java:73)
>         at
>
> org.apache.ignite.internal.processors.cache.tree.PendingEntriesTree.getRow(PendingEntriesTree.java:127)
>         at
>
> org.apache.ignite.internal.processors.cache.tree.PendingEntriesTree.getRow(PendingEntriesTree.java:32)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$ForwardCursor.fillFromBuffer0(BPlusTree.java:5820)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.fillFromBuffer(BPlusTree.java:5586)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.init(BPlusTree.java:5512)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1068)
>         ... 12 common frames omitted
> Caused by: java.lang.IllegalStateException: Item not found: 3
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.findIndirectItemIndex(AbstractDataPageIO.java:476)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.getDataOffset(AbstractDataPageIO.java:584)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.readPayload(AbstractDataPageIO.java:626)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.readIncomplete(CacheDataRowAdapter.java:380)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:316)
>         ... 22 common frames omitted
>
>
> Error message suggests to run the
> org.apache.ignite.development.utis.IgniteWalConverter to diagnose the
> problem.  I attached the output of this utility:
>
>  - corruptedPages_2021-11-08_11-03-21_999.txt  - file created by ignite
> on crash
>
>  - diag-2021-11-08.txt - output of the diagnostic utility.
>
>
> Next day we try to start this node again and it still fail with  'B+Tree
> is corrupted'  but in different place:
>
> 2021-11-09 12:37:34     'B+Tree is corrupted' exception in
> PartitionsEvictManager$PartitionEvictionTask.run()  (see the
> 10.12.86.29-ignite-2021-11-09.0.log)
>
>
> [2021-11-09 12:43:10,857][ERROR][rebalance-#344][ROOT]{} Critical system
> error detected. Will be handled accordingly to configured handler
> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailu
> reHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
> [type=CRITICAL_ERROR, err=class
> o.a.i.i.processors.cache.persistence.tree.CorruptedT
> reeException: B+Tree is corrupted [pages(groupId, pageId)=[],
> msg=Runtime failure on bounds: [lower=null, upper=null]]]]
>
> org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
> B+Tree is corrupted [pages(groupId, pageId)=[], msg=Runtime failure on
> bounds: [lower=null, upper=null]]
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6139)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1133)
>         at
>
> org.apache.ignite.internal.processors.cache.tree.CacheDataTree.find(CacheDataTree.java:167)
>         at
>
> org.apache.ignite.internal.processors.cache.tree.CacheDataTree.find(CacheDataTree.java:63)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1100)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1095)
>         at
>
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.cursor(IgniteCacheOffheapManagerImpl.java:2914)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.cursor(GridCacheOffheapManager.java:2856)
>         at
>
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$3.onHasNext(IgniteCacheOffheapManagerImpl.java:938)
>         at
>
> org.apache.ignite.internal.util.GridCloseableIteratorAdapter.hasNextX(GridCloseableIteratorAdapter.java:53)
>         at
>
> org.apache.ignite.internal.util.lang.GridIteratorAdapter.hasNext(GridIteratorAdapter.java:45)
>         at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:987)
>         at
>
> org.apache.ignite.internal.processors.cache.distributed.dht.topology.PartitionsEvictManager$PartitionEvictionTask.run(PartitionsEvictManager.java:409)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by:
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
> java.lang.IllegalStateException: Item not found: 16
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1079)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1118)
>         ... 16 common frames omitted
> Caused by:
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
> java.lang.IllegalStateException: Item not found: 16
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:345)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:165)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:136)
>         at
>
> org.apache.ignite.internal.processors.cache.tree.DataRow.<init>(DataRow.java:55)
>         at
>
> org.apache.ignite.internal.processors.cache.tree.CacheDataRowStore.dataRow(CacheDataRowStore.java:129)
>         at
>
> org.apache.ignite.internal.processors.cache.tree.CacheDataTree.getRow(CacheDataTree.java:422)
>         at
>
> org.apache.ignite.internal.processors.cache.tree.CacheDataTree.getRow(CacheDataTree.java:63)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$ForwardCursor.fillFromBuffer0(BPlusTree.java:5820)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.fillFromBuffer(BPlusTree.java:5586)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.init(BPlusTree.java:5512)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1068)
>         ... 17 common frames omitted
> Caused by: java.lang.IllegalStateException: Item not found: 16
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.findIndirectItemIndex(AbstractDataPageIO.java:476)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.getDataOffset(AbstractDataPageIO.java:584)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.readPayload(AbstractDataPageIO.java:626)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.readIncomplete(CacheDataRowAdapter.java:380)
>         at
>
> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:316)
>
>
> I also attached the corrupted pages file and diagnostic output for this
> case as well
>
> - corruptedPages_2021-11-09_12-43-12_449.txt
>
> - diag-2021-11-09.txt
>
>
> In both cases pages are for the caches with the Expire Policy configured.
>
>
> ****
>
> What can be done about that?  Is there any recommended way to stop/start
> ignite cluster to prevent such data loss problems?
>
> ****
>
> I see some similar fixed issues in Jira like
> https://issues.apache.org/jira/browse/IGNITE-12489 or
> https://issues.apache.org/jira/browse/IGNITE-14093  but looks like
> something still not working in 2.11.0.
>
>
> Thanks,
>
> --
>
>   Sergey
>
>
>
>

Reply via email to