We have tried the same renaming in the dev environment which has multiple
server nodes impacted and contains some data (unlike the local grid I
tested this on which had a single server node containing no data). This
environment failed to restart after those changes.

We are still looking into it and will try to delete the cache with the
Control.sh script, but if that is not feasible and there are no other ways
to mitigate it I would rank this as a hot-fix candidate where a simple
error on a customer's part is capable of causing complete loss.

Raymond.

On Sun, Aug 27, 2023 at 9:23 PM Raymond Wilson <raymond_wil...@trimble.com>
wrote:

> Looking at the cache-SiteModelMetaData folder in the persistence folder
> for a server node shows a "cache_data" file 6kb in size. No other cache
> folders contain this file.
>
> As an experiment I renamed this file to "cache_dataxxx". This appeared to
> be sufficient to permit the grid to restart. Similarly renaming the cache
> folder to  "xxxcache-SiteModelMetaData" also permitted the grid to restart;
> we will be testing this further to verify.
>
> Raymond.
>
>
> On Sun, Aug 27, 2023 at 5:20 PM Raymond Wilson <raymond_wil...@trimble.com>
> wrote:
>
>> I have reproduced the possible bug I reported in my earlier email.
>>
>> Given a running grid, having a client node in the grid attempt to create
>> a cache using a DataRegionName that does not exist in the grid causes
>> immediate failure in the client node with the following log output.
>>
>> 2023-08-27 17:08:48,520 [44] INF [ImmutableClientServer]   Completed
>> partition exchange [localNode=15122bd7-bf81-44e6-a548-e70dbd9334c0,
>> exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
>> [topVer=15, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode
>> [id=9d5ed68d-38bb-447d-aed5-189f52660716,
>> consistentId=9d5ed68d-38bb-447d-aed5-189f52660716, addrs=ArrayList
>> [127.0.0.1], sockAddrs=null, discPort=0, order=8, intOrder=8,
>> lastExchangeTime=1693112858024, loc=false,
>> ver=2.15.0#20230425-sha1:f98f7f35, isClient=true], rebalanced=false,
>> done=true, newCrdFut=null], topVer=AffinityTopologyVersion [topVer=15,
>> minorTopVer=0]]
>> 2023-08-27 17:08:48,520 [44] INF [ImmutableClientServer]   Exchange
>> timings [startVer=AffinityTopologyVersion [topVer=15, minorTopVer=0],
>> resVer=AffinityTopologyVersion [topVer=15, minorTopVer=0], stage="Waiting
>> in exchange queue" (14850 ms), stage="Exchange parameters initialization"
>> (2 ms), stage="Determine exchange type" (3 ms), stage="Exchange done" (4
>> ms), stage="Total time" (14859 ms)]
>> 2023-08-27 17:08:48,522 [44] INF [ImmutableClientServer]   Exchange
>> longest local stages [startVer=AffinityTopologyVersion [topVer=15,
>> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=15, minorTopVer=0]]
>> 2023-08-27 17:08:48,524 [44] INF [ImmutableClientServer]   Finished
>> exchange init [topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0],
>> crd=false]
>> 2023-08-27 17:08:48,525 [44] INF [ImmutableClientServer]
>> AffinityTopologyVersion [topVer=15, minorTopVer=0], evt=NODE_FAILED,
>> evtNode=9d5ed68d-38bb-447d-aed5-189f52660716, client=true]
>> Unhandled exception: Apache.Ignite.Core.Cache.CacheException: class
>> org.apache.ignite.IgniteCheckedException: Failed to complete exchange
>> process.
>>  ---> Apache.Ignite.Core.Common.IgniteException: Failed to complete
>> exchange process.
>>  ---> Apache.Ignite.Core.Common.JavaException:
>> javax.cache.CacheException: class org.apache.ignite.IgniteCheckedException:
>> Failed to complete exchange process.
>>         at
>> org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1272)
>>         at
>> org.apache.ignite.internal.IgniteKernal.getOrCreateCache0(IgniteKernal.java:2278)
>>         at
>> org.apache.ignite.internal.IgniteKernal.getOrCreateCache(IgniteKernal.java:2242)
>>         at
>> org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.processInStreamOutObject(PlatformProcessorImpl.java:643)
>>         at
>> org.apache.ignite.internal.processors.platform.PlatformTargetProxyImpl.inStreamOutObject(PlatformTargetProxyImpl.java:79)
>> Caused by: class org.apache.ignite.IgniteCheckedException: Failed to
>> complete exchange process.
>>         at
>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.createExchangeException(GridDhtPartitionsExchangeFuture.java:3709)
>>         at
>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.sendExchangeFailureMessage(GridDhtPartitionsExchangeFuture.java:3737)
>>         at
>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.finishExchangeOnCoordinator(GridDhtPartitionsExchangeFuture.java:3832)
>>         at
>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onAllReceived(GridDhtPartitionsExchangeFuture.java:3813)
>>         at
>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1796)
>>         at
>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1053)
>>         at
>> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3348)
>>         at
>> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3182)
>>         at
>> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125)
>>         at java.base/java.lang.Thread.run(Thread.java:829)
>>         Suppressed: class org.apache.ignite.IgniteCheckedException:
>> Failed to initialize exchange locally
>> [locNodeId=e9325b04-00fa-452e-9796-989b47b860ea]
>>                 at
>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onCacheChangeRequest(GridDhtPartitionsExchangeFuture.java:1483)
>>                 at
>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:979)
>>                 ... 4 more
>>         Caused by: class org.apache.ignite.IgniteCheckedException:
>> Requested DataRegion is not configured: Default-Mutable
>>                 at
>> org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.dataRegion(IgniteCacheDatabaseSharedManager.java:896)
>>                 at
>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.startCacheGroup(GridCacheProcessor.java:2463)
>>                 at
>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.getOrCreateCacheGroupContext(GridCacheProcessor.java:2181)
>>                 at
>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareCacheContext(GridCacheProcessor.java:1991)
>>                 at
>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareCacheStart(GridCacheProcessor.java:1926)
>>                 at
>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$prepareStartCaches$55a0e703$1(GridCacheProcessor.java:1801)
>>                 at
>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$prepareStartCachesIfPossible$16(GridCacheProcessor.java:1771)
>>                 at
>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareStartCaches(GridCacheProcessor.java:1798)
>>                 at
>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareStartCachesIfPossible(GridCacheProcessor.java:1769)
>>                 at
>> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.processCacheStartRequests(CacheAffinitySharedManager.java:1000)
>>                 at
>> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onCacheChangeRequest(CacheAffinitySharedManager.java:886)
>>                 at
>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onCacheChangeRequest(GridDhtPartitionsExchangeFuture.java:1472)
>>                 ... 5 more
>>
>>    at Apache.Ignite.Core.Impl.Unmanaged.Jni.Env.ExceptionCheck()
>>    at
>> Apache.Ignite.Core.Impl.Unmanaged.Jni.Env.CallObjectMethod(GlobalRef obj,
>> IntPtr methodId, Int64* argsPtr)
>>    at
>> Apache.Ignite.Core.Impl.Unmanaged.UnmanagedUtils.TargetInStreamOutObject(GlobalRef
>> target, Int32 opType, Int64 inMemPtr)
>>    at Apache.Ignite.Core.Impl.PlatformJniTarget.InStreamOutObject(Int32
>> type, Action`1 writeAction)
>>    --- End of inner exception stack trace ---
>>    --- End of inner exception stack trace ---
>>    at Apache.Ignite.Core.Impl.PlatformJniTarget.InStreamOutObject(Int32
>> type, Action`1 writeAction)
>>    at Apache.Ignite.Core.Impl.PlatformTargetAdapter.DoOutOpObject(Int32
>> type, Action`1 action)
>>    at
>> Apache.Ignite.Core.Impl.Ignite.GetOrCreateCache[TK,TV](CacheConfiguration
>> configuration, NearCacheConfiguration nearConfiguration,
>> PlatformCacheConfiguration platformCacheConfiguration, Op op)
>>    at
>> Apache.Ignite.Core.Impl.Ignite.GetOrCreateCache[TK,TV](CacheConfiguration
>> configuration, NearCacheConfiguration nearConfiguration,
>> PlatformCacheConfiguration platformCacheConfiguration)
>>    at
>> Apache.Ignite.Core.Impl.Ignite.GetOrCreateCache[TK,TV](CacheConfiguration
>> configuration, NearCacheConfiguration nearConfiguration)
>>    at
>> Apache.Ignite.Core.Impl.Ignite.GetOrCreateCache[TK,TV](CacheConfiguration
>> configuration)
>>
>>
>> This failure causes issues in the server nodes in the grid which now fail
>> to restart with these errors such as the below (for the incorrectly create
>> cache) but which are repeated for every defined cache in the grid:
>>
>> 2023-08-27 17:11:36,882 [42] INF [ImmutableCacheComputeServer]   Can not
>> finish proxy initialization because proxy does not exist,
>> cacheName=SiteModelMetadata,
>> localNodeId=3d4a75e8-174d-4947-877e-e45784d8d08d
>> 2
>>
>> At this point the grid is now unusable.
>>
>> To summarise: Attempted creation of a cache with an unknown
>> DataRegionName causes immediate and unrecovered failure in the entire grid.
>>
>> Raymond.
>>
>>
>> On Fri, Aug 25, 2023 at 7:47 PM Raymond Wilson <
>> raymond_wil...@trimble.com> wrote:
>>
>>> We believe we had some code on a dev environment attempt to create a
>>> cache that was intended for another Ignite.
>>>
>>> The creation of this cache would have failed (at least) because the data
>>> region referenced in the cache configuration does not exist on that
>>> environment.
>>>
>>> A subsequent restart of the environment some time later started failing
>>> to initialise nodes on which the failed cache would have been stored had it
>>> succeeded.
>>>
>>> The failing nodes report this in the log:
>>>
>>> 2023-08-25 04:20:24,540 [44] WRN [ImmutableCacheComputeServer]   Cache
>>> can not be started : cache=SiteModelMetadata
>>>
>>> 2023-08-25 04:20:11,265 [1] WRN [ImmutableCacheComputeServer]   WAL
>>> segment tail reached. [idx=414, isWorkDir=true,
>>> serVer=org.apache.ignite.internal.processors.cache.persistence.wal.serializer.RecordV2Serializer@c3719e5,
>>> actualFilePtr=WALPointer [idx=414, fileOff=452480679, len=0]]
>>>
>>> This error implies that (somehow) Ignite considers this to be a cache
>>> existing in the grid and is attempting to set it up.
>>>
>>> Raymond.
>>>
>>>
>>
>> --
>> <http://www.trimble.com/>
>> Raymond Wilson
>> Trimble Distinguished Engineer, Civil Construction Software (CCS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> raymond_wil...@trimble.com
>>
>>
>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Trimble Distinguished Engineer, Civil Construction Software (CCS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wil...@trimble.com
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>


-- 
<http://www.trimble.com/>
Raymond Wilson
Trimble Distinguished Engineer, Civil Construction Software (CCS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wil...@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Reply via email to