FYI, I raised Jira ticket https://issues.apache.org/jira/browse/IGNITE-20299 for this.
On Mon, Aug 28, 2023 at 3:42 PM Raymond Wilson <raymond_wil...@trimble.com> wrote: > We have tried the same renaming in the dev environment which has multiple > server nodes impacted and contains some data (unlike the local grid I > tested this on which had a single server node containing no data). This > environment failed to restart after those changes. > > We are still looking into it and will try to delete the cache with the > Control.sh script, but if that is not feasible and there are no other ways > to mitigate it I would rank this as a hot-fix candidate where a simple > error on a customer's part is capable of causing complete loss. > > Raymond. > > On Sun, Aug 27, 2023 at 9:23 PM Raymond Wilson <raymond_wil...@trimble.com> > wrote: > >> Looking at the cache-SiteModelMetaData folder in the persistence folder >> for a server node shows a "cache_data" file 6kb in size. No other cache >> folders contain this file. >> >> As an experiment I renamed this file to "cache_dataxxx". This appeared to >> be sufficient to permit the grid to restart. Similarly renaming the cache >> folder to "xxxcache-SiteModelMetaData" also permitted the grid to restart; >> we will be testing this further to verify. >> >> Raymond. >> >> >> On Sun, Aug 27, 2023 at 5:20 PM Raymond Wilson < >> raymond_wil...@trimble.com> wrote: >> >>> I have reproduced the possible bug I reported in my earlier email. >>> >>> Given a running grid, having a client node in the grid attempt to create >>> a cache using a DataRegionName that does not exist in the grid causes >>> immediate failure in the client node with the following log output. >>> >>> 2023-08-27 17:08:48,520 [44] INF [ImmutableClientServer] Completed >>> partition exchange [localNode=15122bd7-bf81-44e6-a548-e70dbd9334c0, >>> exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion >>> [topVer=15, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode >>> [id=9d5ed68d-38bb-447d-aed5-189f52660716, >>> consistentId=9d5ed68d-38bb-447d-aed5-189f52660716, addrs=ArrayList >>> [127.0.0.1], sockAddrs=null, discPort=0, order=8, intOrder=8, >>> lastExchangeTime=1693112858024, loc=false, >>> ver=2.15.0#20230425-sha1:f98f7f35, isClient=true], rebalanced=false, >>> done=true, newCrdFut=null], topVer=AffinityTopologyVersion [topVer=15, >>> minorTopVer=0]] >>> 2023-08-27 17:08:48,520 [44] INF [ImmutableClientServer] Exchange >>> timings [startVer=AffinityTopologyVersion [topVer=15, minorTopVer=0], >>> resVer=AffinityTopologyVersion [topVer=15, minorTopVer=0], stage="Waiting >>> in exchange queue" (14850 ms), stage="Exchange parameters initialization" >>> (2 ms), stage="Determine exchange type" (3 ms), stage="Exchange done" (4 >>> ms), stage="Total time" (14859 ms)] >>> 2023-08-27 17:08:48,522 [44] INF [ImmutableClientServer] Exchange >>> longest local stages [startVer=AffinityTopologyVersion [topVer=15, >>> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=15, minorTopVer=0]] >>> 2023-08-27 17:08:48,524 [44] INF [ImmutableClientServer] Finished >>> exchange init [topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0], >>> crd=false] >>> 2023-08-27 17:08:48,525 [44] INF [ImmutableClientServer] >>> AffinityTopologyVersion [topVer=15, minorTopVer=0], evt=NODE_FAILED, >>> evtNode=9d5ed68d-38bb-447d-aed5-189f52660716, client=true] >>> Unhandled exception: Apache.Ignite.Core.Cache.CacheException: class >>> org.apache.ignite.IgniteCheckedException: Failed to complete exchange >>> process. >>> ---> Apache.Ignite.Core.Common.IgniteException: Failed to complete >>> exchange process. >>> ---> Apache.Ignite.Core.Common.JavaException: >>> javax.cache.CacheException: class org.apache.ignite.IgniteCheckedException: >>> Failed to complete exchange process. >>> at >>> org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1272) >>> at >>> org.apache.ignite.internal.IgniteKernal.getOrCreateCache0(IgniteKernal.java:2278) >>> at >>> org.apache.ignite.internal.IgniteKernal.getOrCreateCache(IgniteKernal.java:2242) >>> at >>> org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.processInStreamOutObject(PlatformProcessorImpl.java:643) >>> at >>> org.apache.ignite.internal.processors.platform.PlatformTargetProxyImpl.inStreamOutObject(PlatformTargetProxyImpl.java:79) >>> Caused by: class org.apache.ignite.IgniteCheckedException: Failed to >>> complete exchange process. >>> at >>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.createExchangeException(GridDhtPartitionsExchangeFuture.java:3709) >>> at >>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.sendExchangeFailureMessage(GridDhtPartitionsExchangeFuture.java:3737) >>> at >>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.finishExchangeOnCoordinator(GridDhtPartitionsExchangeFuture.java:3832) >>> at >>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onAllReceived(GridDhtPartitionsExchangeFuture.java:3813) >>> at >>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1796) >>> at >>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1053) >>> at >>> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3348) >>> at >>> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3182) >>> at >>> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125) >>> at java.base/java.lang.Thread.run(Thread.java:829) >>> Suppressed: class org.apache.ignite.IgniteCheckedException: >>> Failed to initialize exchange locally >>> [locNodeId=e9325b04-00fa-452e-9796-989b47b860ea] >>> at >>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onCacheChangeRequest(GridDhtPartitionsExchangeFuture.java:1483) >>> at >>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:979) >>> ... 4 more >>> Caused by: class org.apache.ignite.IgniteCheckedException: >>> Requested DataRegion is not configured: Default-Mutable >>> at >>> org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.dataRegion(IgniteCacheDatabaseSharedManager.java:896) >>> at >>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.startCacheGroup(GridCacheProcessor.java:2463) >>> at >>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.getOrCreateCacheGroupContext(GridCacheProcessor.java:2181) >>> at >>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareCacheContext(GridCacheProcessor.java:1991) >>> at >>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareCacheStart(GridCacheProcessor.java:1926) >>> at >>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$prepareStartCaches$55a0e703$1(GridCacheProcessor.java:1801) >>> at >>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$prepareStartCachesIfPossible$16(GridCacheProcessor.java:1771) >>> at >>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareStartCaches(GridCacheProcessor.java:1798) >>> at >>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareStartCachesIfPossible(GridCacheProcessor.java:1769) >>> at >>> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.processCacheStartRequests(CacheAffinitySharedManager.java:1000) >>> at >>> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onCacheChangeRequest(CacheAffinitySharedManager.java:886) >>> at >>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onCacheChangeRequest(GridDhtPartitionsExchangeFuture.java:1472) >>> ... 5 more >>> >>> at Apache.Ignite.Core.Impl.Unmanaged.Jni.Env.ExceptionCheck() >>> at >>> Apache.Ignite.Core.Impl.Unmanaged.Jni.Env.CallObjectMethod(GlobalRef obj, >>> IntPtr methodId, Int64* argsPtr) >>> at >>> Apache.Ignite.Core.Impl.Unmanaged.UnmanagedUtils.TargetInStreamOutObject(GlobalRef >>> target, Int32 opType, Int64 inMemPtr) >>> at Apache.Ignite.Core.Impl.PlatformJniTarget.InStreamOutObject(Int32 >>> type, Action`1 writeAction) >>> --- End of inner exception stack trace --- >>> --- End of inner exception stack trace --- >>> at Apache.Ignite.Core.Impl.PlatformJniTarget.InStreamOutObject(Int32 >>> type, Action`1 writeAction) >>> at Apache.Ignite.Core.Impl.PlatformTargetAdapter.DoOutOpObject(Int32 >>> type, Action`1 action) >>> at >>> Apache.Ignite.Core.Impl.Ignite.GetOrCreateCache[TK,TV](CacheConfiguration >>> configuration, NearCacheConfiguration nearConfiguration, >>> PlatformCacheConfiguration platformCacheConfiguration, Op op) >>> at >>> Apache.Ignite.Core.Impl.Ignite.GetOrCreateCache[TK,TV](CacheConfiguration >>> configuration, NearCacheConfiguration nearConfiguration, >>> PlatformCacheConfiguration platformCacheConfiguration) >>> at >>> Apache.Ignite.Core.Impl.Ignite.GetOrCreateCache[TK,TV](CacheConfiguration >>> configuration, NearCacheConfiguration nearConfiguration) >>> at >>> Apache.Ignite.Core.Impl.Ignite.GetOrCreateCache[TK,TV](CacheConfiguration >>> configuration) >>> >>> >>> This failure causes issues in the server nodes in the grid which now >>> fail to restart with these errors such as the below (for the incorrectly >>> create cache) but which are repeated for every defined cache in the grid: >>> >>> 2023-08-27 17:11:36,882 [42] INF [ImmutableCacheComputeServer] Can not >>> finish proxy initialization because proxy does not exist, >>> cacheName=SiteModelMetadata, >>> localNodeId=3d4a75e8-174d-4947-877e-e45784d8d08d >>> 2 >>> >>> At this point the grid is now unusable. >>> >>> To summarise: Attempted creation of a cache with an unknown >>> DataRegionName causes immediate and unrecovered failure in the entire grid. >>> >>> Raymond. >>> >>> >>> On Fri, Aug 25, 2023 at 7:47 PM Raymond Wilson < >>> raymond_wil...@trimble.com> wrote: >>> >>>> We believe we had some code on a dev environment attempt to create a >>>> cache that was intended for another Ignite. >>>> >>>> The creation of this cache would have failed (at least) because the >>>> data region referenced in the cache configuration does not exist on that >>>> environment. >>>> >>>> A subsequent restart of the environment some time later started failing >>>> to initialise nodes on which the failed cache would have been stored had it >>>> succeeded. >>>> >>>> The failing nodes report this in the log: >>>> >>>> 2023-08-25 04:20:24,540 [44] WRN [ImmutableCacheComputeServer] Cache >>>> can not be started : cache=SiteModelMetadata >>>> >>>> 2023-08-25 04:20:11,265 [1] WRN [ImmutableCacheComputeServer] WAL >>>> segment tail reached. [idx=414, isWorkDir=true, >>>> serVer=org.apache.ignite.internal.processors.cache.persistence.wal.serializer.RecordV2Serializer@c3719e5, >>>> actualFilePtr=WALPointer [idx=414, fileOff=452480679, len=0]] >>>> >>>> This error implies that (somehow) Ignite considers this to be a cache >>>> existing in the grid and is attempting to set it up. >>>> >>>> Raymond. >>>> >>>> >>> >>> -- >>> <http://www.trimble.com/> >>> Raymond Wilson >>> Trimble Distinguished Engineer, Civil Construction Software (CCS) >>> 11 Birmingham Drive | Christchurch, New Zealand >>> raymond_wil...@trimble.com >>> >>> >>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch> >>> >> >> >> -- >> <http://www.trimble.com/> >> Raymond Wilson >> Trimble Distinguished Engineer, Civil Construction Software (CCS) >> 11 Birmingham Drive | Christchurch, New Zealand >> raymond_wil...@trimble.com >> >> >> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch> >> > > > -- > <http://www.trimble.com/> > Raymond Wilson > Trimble Distinguished Engineer, Civil Construction Software (CCS) > 11 Birmingham Drive | Christchurch, New Zealand > raymond_wil...@trimble.com > > > <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch> > -- <http://www.trimble.com/> Raymond Wilson Trimble Distinguished Engineer, Civil Construction Software (CCS) 11 Birmingham Drive | Christchurch, New Zealand raymond_wil...@trimble.com <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>