Any other thoughts on this? The data corruption occurred when we were using version 2.7.5. I have looked at a couple of tickets involving corrupted trees, but it doesn't seem like any of them apply to our use case of Ignite. Would like to understand at least how we get into this corrupted state in the first place, and how to handle it when it happens. Is there a way to detect and log this error while avoiding crashing the process?
From: user@ignite.apache.org At: 02/19/21 14:18:44To: Mitchell Rathbun (BLOOMBERG/ 731 LEX ) , user@ignite.apache.org Subject: Re: Corrupted B+ Tree Causing Repeated Crashes Hello! What version of Apache Ignite are you using? 19.02.2021, 22:07, "Mitchell Rathbun (BLOOMBERG/ 731 LEX)" <mrathb...@bloomberg.net>: > We are encountering the following error repeatedly, which causes our node to crash: > > 2021-02-19 13:30:38,175 ERROR STDIO [pool-32-thread-5] {} Feb 19, 2021 1:30:38 PM org.apache.ignite.logger.java.JavaLogger error > SEVERE: Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler > [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-128547534, val2=281474976721835]], msg=Runtime failure on lookup row: SearchRow [key=com.bloomberg.aim.wingman.cachemgr.Ts3DataCache$Ts3SecurityCacheKey [idHash=1436767547, hash=-931214342, accountCusip=com.bloomberg.aim.wingman.common.dto.submgr.AccountCusip [idHash=316813954, hash=343304888, accountId=0, cusip=com.bloomberg.aim.wingman.common.dto.Cusip [idHash=1325824124, hash=2123451959, cusip1=136125, cusip2=9001, cusip3=541401120, dept=2, subflag=2]]], hash=-931214342, cacheId=0]]]] > class org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeExcept ion: B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-128547534, val2=281474976721835]], msg=Runtime failure on lookup row: SearchRow [key=com.bloomberg.aim.wingman.cachemgr.Ts3DataCache$Ts3SecurityCacheKey [idHash=1436767547, hash=-931214342, accountCusip= > com.bloomberg.aim.wingman.common.dto.submgr.AccountCusip [idHash=316813954, hash=343304888, accountId=0, cusip=com.bloomberg.aim.wingman.common.dto.Cusip [idHash=1325824124, hash=2123451959, cusip1=136125, cusip2=9001, cusip3=541401120, dept=2, subflag=2]]], hash=-931214342, cacheId=0]] > at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corrupted TreeException(BPlusTree.java:6106) > at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findOne(B PlusTree.java:1367) > at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findOne(B PlusTree.java:1344) > at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheD ataStoreImpl.find(IgniteCacheOffheapManagerImpl.java:2755) > at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$ GridCacheDataStore.find(GridCacheOffheapManager.java:2469) > at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.read(I gniteCacheOffheapManagerImpl.java:637) > at org.apache.ignite.internal.processors.cache.local.atomic.GridLocalAtomicCache.ge tAllInternal(GridLocalAtomicCache.java:410) > at org.apache.ignite.internal.processors.cache.local.atomic.GridLocalAtomicCache.ge tAll(GridLocalAtomicCache.java:323) > at org.apache.ignite.internal.processors.cache.GridCacheAdapter.repairableGetAll(Gr idCacheAdapter.java:4907) > at org.apache.ignite.internal.processors.cache.GridCacheAdapter.getAll(GridCacheAda pter.java:1617) > at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.getAll(IgniteCa cheProxyImpl.java:1157) > at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.getAll(Ga tewayProtectedCacheProxy.java:724) > at com.bloomberg.aim.wingman.cachemgr.Ts3DataCache.fetchCalcrtDataByKeySync(Ts3Data Cache.java:1535) > at com.bloomberg.aim.wingman.cachemgr.Ts3DataCache.lambda$fetchCalcrtDataBySecurity KeyAccountAsync$11(Ts3DataCache.java:895) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j ava:1128) > at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor. java:628) > at java.base/java.lang.Thread.run(Thread.java:834) > Caused by: java.lang.IllegalStateException: Item not found: 1 > at org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPage IO.findIndirectItemIndex(AbstractDataPageIO.java:351) > at org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPage IO.getDataOffset(AbstractDataPageIO.java:459) > at org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPage IO.readPayload(AbstractDataPageIO.java:501) > at org.apache.ignite.internal.processors.cache.tree.CacheDataTree.compareKeys(Cache DataTree.java:447) > at org.apache.ignite.internal.processors.cache.tree.CacheDataTree.compare(CacheData Tree.java:386) > at org.apache.ignite.internal.processors.cache.tree.CacheDataTree.compare(CacheData Tree.java:63) > at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.compare(B PlusTree.java:5377) > at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findInser tionPoint(BPlusTree.java:5297) > at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$11 00(BPlusTree.java:98) > at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Search.ru n0(BPlusTree.java:302) > at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHa ndler.run(BPlusTree.java:5888) > at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Search.ru n(BPlusTree.java:282) > at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHa ndler.run(BPlusTree.java:5874) > at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.re adPage(PageHandler.java:169) > at org.apache.ignite.internal.processors.cache.persistence.DataStructure.read(DataS tructure.java:364) > at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.read(BPlu sTree.java:6075) > at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findDown( BPlusTree.java:1424) > at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findDown( BPlusTree.java:1433) > at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findDown( BPlusTree.java:1433) > at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doFind(BP lusTree.java:1391) > at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findOne(B PlusTree.java:1359) > ... 16 more > 2021-02-19 13:30:38,177 ERROR STDIO [pool-32-thread-5] {} Feb 19, 2021 1:30:38 PM org.apache.ignite.logger.java.JavaLogger error > SEVERE: A critical problem with persistence data structures was detected. Please make backup of persistence storage and WAL files for further analysis. Persistence storage path: null WAL path: db/wal WAL archive path: db/wal/archive > > I think we can fix this by just clearing the persistent storage and restarting our node, but we can't have this happen in production so I want to understand two things: > > 1. How can this happen? > > 2. How can we prevent this from happening/best respond when it does happen? We don't want our process to crash as a result of this, we would rather just invalidate the cache and clear it if at all possible.