Mitchell, Can you provide the full log and the cache configuration?
On Thu, 25 Feb 2021 at 03:55, Mitchell Rathbun (BLOOMBERG/ 731 LEX) <mrathb...@bloomberg.net> wrote: > > Any other thoughts on this? The data corruption occurred when we were using > version 2.7.5. I have looked at a couple of tickets involving corrupted > trees, but it doesn't seem like any of them apply to our use case of Ignite. > Would like to understand at least how we get into this corrupted state in the > first place, and how to handle it when it happens. Is there a way to detect > and log this error while avoiding crashing the process? > > From: user@ignite.apache.org At: 02/19/21 14:18:44 > To: Mitchell Rathbun (BLOOMBERG/ 731 LEX ) , user@ignite.apache.org > Subject: Re: Corrupted B+ Tree Causing Repeated Crashes > > Hello! What version of Apache Ignite are you using? > > 19.02.2021, 22:07, "Mitchell Rathbun (BLOOMBERG/ 731 LEX)" > <mrathb...@bloomberg.net>: > > We are encountering the following error repeatedly, which causes our node to > crash: > > > > 2021-02-19 13:30:38,175 ERROR STDIO [pool-32-thread-5] {} Feb 19, 2021 > 1:30:38 PM org.apache.ignite.logger.java.JavaLogger error > > SEVERE: Critical system error detected. Will be handled accordingly to > configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, > timeout=0, > super=AbstractFailureHandler > > [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, > SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext > [type=CRITICAL_ERROR, err=class > o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is > corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-128547534, > val2=281474976721835]], msg=Runtime failure on lookup row: SearchRow > [key=com.bloomberg.aim.wingman.cachemgr.Ts3DataCache$Ts3SecurityCacheKey > [idHash=1436767547, hash=-931214342, > accountCusip=com.bloomberg.aim.wingman.common.dto.submgr.AccountCusip > [idHash=316813954, hash=343304888, accountId=0, > cusip=com.bloomberg.aim.wingman.common.dto.Cusip [idHash=1325824124, > hash=2123451959, cusip1=136125, cusip2=9001, cusip3=541401120, dept=2, > subflag=2]]], hash=-931214342, cacheId=0]]]] > > class > org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeExcept > ion: B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple > [val1=-128547534, val2=281474976721835]], msg=Runtime failure on lookup row: > SearchRow > [key=com.bloomberg.aim.wingman.cachemgr.Ts3DataCache$Ts3SecurityCacheKey > [idHash=1436767547, hash=-931214342, accountCusip= > > com.bloomberg.aim.wingman.common.dto.submgr.AccountCusip [idHash=316813954, > hash=343304888, accountId=0, cusip=com.bloomberg.aim.wingman.common.dto.Cusip > [idHash=1325824124, hash=2123451959, cusip1=136125, cusip2=9001, > cusip3=541401120, dept=2, subflag=2]]], hash=-931214342, cacheId=0]] > > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corrupted > TreeException(BPlusTree.java:6106) > > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findOne(B > PlusTree.java:1367) > > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findOne(B > PlusTree.java:1344) > > at > org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheD > ataStoreImpl.find(IgniteCacheOffheapManagerImpl.java:2755) > > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$ > GridCacheDataStore.find(GridCacheOffheapManager.java:2469) > > at > org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.read(I > gniteCacheOffheapManagerImpl.java:637) > > at > org.apache.ignite.internal.processors.cache.local.atomic.GridLocalAtomicCache.ge > tAllInternal(GridLocalAtomicCache.java:410) > > at > org.apache.ignite.internal.processors.cache.local.atomic.GridLocalAtomicCache.ge > tAll(GridLocalAtomicCache.java:323) > > at > org.apache.ignite.internal.processors.cache.GridCacheAdapter.repairableGetAll(Gr > idCacheAdapter.java:4907) > > at > org.apache.ignite.internal.processors.cache.GridCacheAdapter.getAll(GridCacheAda > pter.java:1617) > > at > org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.getAll(IgniteCa > cheProxyImpl.java:1157) > > at > org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.getAll(Ga > tewayProtectedCacheProxy.java:724) > > at > com.bloomberg.aim.wingman.cachemgr.Ts3DataCache.fetchCalcrtDataByKeySync(Ts3Data > Cache.java:1535) > > at > com.bloomberg.aim.wingman.cachemgr.Ts3DataCache.lambda$fetchCalcrtDataBySecurity > KeyAccountAsync$11(Ts3DataCache.java:895) > > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j > ava:1128) > > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor. > java:628) > > at java.base/java.lang.Thread.run(Thread.java:834) > > Caused by: java.lang.IllegalStateException: Item not found: 1 > > at > org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPage > IO.findIndirectItemIndex(AbstractDataPageIO.java:351) > > at > org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPage > IO.getDataOffset(AbstractDataPageIO.java:459) > > at > org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPage > IO.readPayload(AbstractDataPageIO.java:501) > > at > org.apache.ignite.internal.processors.cache.tree.CacheDataTree.compareKeys(Cache > DataTree.java:447) > > at > org.apache.ignite.internal.processors.cache.tree.CacheDataTree.compare(CacheData > Tree.java:386) > > at > org.apache.ignite.internal.processors.cache.tree.CacheDataTree.compare(CacheData > Tree.java:63) > > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.compare(B > PlusTree.java:5377) > > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findInser > tionPoint(BPlusTree.java:5297) > > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$11 > 00(BPlusTree.java:98) > > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Search.ru > n0(BPlusTree.java:302) > > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHa > ndler.run(BPlusTree.java:5888) > > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Search.ru > n(BPlusTree.java:282) > > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHa > ndler.run(BPlusTree.java:5874) > > at > org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.re > adPage(PageHandler.java:169) > > at > org.apache.ignite.internal.processors.cache.persistence.DataStructure.read(DataS > tructure.java:364) > > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.read(BPlu > sTree.java:6075) > > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findDown( > BPlusTree.java:1424) > > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findDown( > BPlusTree.java:1433) > > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findDown( > BPlusTree.java:1433) > > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doFind(BP > lusTree.java:1391) > > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findOne(B > PlusTree.java:1359) > > ... 16 more > > 2021-02-19 13:30:38,177 ERROR STDIO [pool-32-thread-5] {} Feb 19, 2021 > 1:30:38 PM org.apache.ignite.logger.java.JavaLogger error > > SEVERE: A critical problem with persistence data structures was detected. > Please make backup of persistence storage and WAL files for further analysis. > Persistence storage path: null WAL path: db/wal WAL archive path: > db/wal/archive > > > > I think we can fix this by just clearing the persistent storage and > restarting our node, but we can't have this happen in production so I want to > understand two things: > > > > 1. How can this happen? > > > > 2. How can we prevent this from happening/best respond when it does happen? > We don't want our process to crash as a result of this, we would rather just > invalidate the cache and clear it if at all possible. > >