FYI, I raised Jira ticket https://issues.apache.org/jira/browse/IGNITE-20299
for this.


On Mon, Aug 28, 2023 at 3:42 PM Raymond Wilson <raymond_wil...@trimble.com>
wrote:

> We have tried the same renaming in the dev environment which has multiple
> server nodes impacted and contains some data (unlike the local grid I
> tested this on which had a single server node containing no data). This
> environment failed to restart after those changes.
>
> We are still looking into it and will try to delete the cache with the
> Control.sh script, but if that is not feasible and there are no other ways
> to mitigate it I would rank this as a hot-fix candidate where a simple
> error on a customer's part is capable of causing complete loss.
>
> Raymond.
>
> On Sun, Aug 27, 2023 at 9:23 PM Raymond Wilson <raymond_wil...@trimble.com>
> wrote:
>
>> Looking at the cache-SiteModelMetaData folder in the persistence folder
>> for a server node shows a "cache_data" file 6kb in size. No other cache
>> folders contain this file.
>>
>> As an experiment I renamed this file to "cache_dataxxx". This appeared to
>> be sufficient to permit the grid to restart. Similarly renaming the cache
>> folder to  "xxxcache-SiteModelMetaData" also permitted the grid to restart;
>> we will be testing this further to verify.
>>
>> Raymond.
>>
>>
>> On Sun, Aug 27, 2023 at 5:20 PM Raymond Wilson <
>> raymond_wil...@trimble.com> wrote:
>>
>>> I have reproduced the possible bug I reported in my earlier email.
>>>
>>> Given a running grid, having a client node in the grid attempt to create
>>> a cache using a DataRegionName that does not exist in the grid causes
>>> immediate failure in the client node with the following log output.
>>>
>>> 2023-08-27 17:08:48,520 [44] INF [ImmutableClientServer]   Completed
>>> partition exchange [localNode=15122bd7-bf81-44e6-a548-e70dbd9334c0,
>>> exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
>>> [topVer=15, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode
>>> [id=9d5ed68d-38bb-447d-aed5-189f52660716,
>>> consistentId=9d5ed68d-38bb-447d-aed5-189f52660716, addrs=ArrayList
>>> [127.0.0.1], sockAddrs=null, discPort=0, order=8, intOrder=8,
>>> lastExchangeTime=1693112858024, loc=false,
>>> ver=2.15.0#20230425-sha1:f98f7f35, isClient=true], rebalanced=false,
>>> done=true, newCrdFut=null], topVer=AffinityTopologyVersion [topVer=15,
>>> minorTopVer=0]]
>>> 2023-08-27 17:08:48,520 [44] INF [ImmutableClientServer]   Exchange
>>> timings [startVer=AffinityTopologyVersion [topVer=15, minorTopVer=0],
>>> resVer=AffinityTopologyVersion [topVer=15, minorTopVer=0], stage="Waiting
>>> in exchange queue" (14850 ms), stage="Exchange parameters initialization"
>>> (2 ms), stage="Determine exchange type" (3 ms), stage="Exchange done" (4
>>> ms), stage="Total time" (14859 ms)]
>>> 2023-08-27 17:08:48,522 [44] INF [ImmutableClientServer]   Exchange
>>> longest local stages [startVer=AffinityTopologyVersion [topVer=15,
>>> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=15, minorTopVer=0]]
>>> 2023-08-27 17:08:48,524 [44] INF [ImmutableClientServer]   Finished
>>> exchange init [topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0],
>>> crd=false]
>>> 2023-08-27 17:08:48,525 [44] INF [ImmutableClientServer]
>>> AffinityTopologyVersion [topVer=15, minorTopVer=0], evt=NODE_FAILED,
>>> evtNode=9d5ed68d-38bb-447d-aed5-189f52660716, client=true]
>>> Unhandled exception: Apache.Ignite.Core.Cache.CacheException: class
>>> org.apache.ignite.IgniteCheckedException: Failed to complete exchange
>>> process.
>>>  ---> Apache.Ignite.Core.Common.IgniteException: Failed to complete
>>> exchange process.
>>>  ---> Apache.Ignite.Core.Common.JavaException:
>>> javax.cache.CacheException: class org.apache.ignite.IgniteCheckedException:
>>> Failed to complete exchange process.
>>>         at
>>> org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1272)
>>>         at
>>> org.apache.ignite.internal.IgniteKernal.getOrCreateCache0(IgniteKernal.java:2278)
>>>         at
>>> org.apache.ignite.internal.IgniteKernal.getOrCreateCache(IgniteKernal.java:2242)
>>>         at
>>> org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.processInStreamOutObject(PlatformProcessorImpl.java:643)
>>>         at
>>> org.apache.ignite.internal.processors.platform.PlatformTargetProxyImpl.inStreamOutObject(PlatformTargetProxyImpl.java:79)
>>> Caused by: class org.apache.ignite.IgniteCheckedException: Failed to
>>> complete exchange process.
>>>         at
>>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.createExchangeException(GridDhtPartitionsExchangeFuture.java:3709)
>>>         at
>>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.sendExchangeFailureMessage(GridDhtPartitionsExchangeFuture.java:3737)
>>>         at
>>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.finishExchangeOnCoordinator(GridDhtPartitionsExchangeFuture.java:3832)
>>>         at
>>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onAllReceived(GridDhtPartitionsExchangeFuture.java:3813)
>>>         at
>>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1796)
>>>         at
>>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1053)
>>>         at
>>> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3348)
>>>         at
>>> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3182)
>>>         at
>>> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125)
>>>         at java.base/java.lang.Thread.run(Thread.java:829)
>>>         Suppressed: class org.apache.ignite.IgniteCheckedException:
>>> Failed to initialize exchange locally
>>> [locNodeId=e9325b04-00fa-452e-9796-989b47b860ea]
>>>                 at
>>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onCacheChangeRequest(GridDhtPartitionsExchangeFuture.java:1483)
>>>                 at
>>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:979)
>>>                 ... 4 more
>>>         Caused by: class org.apache.ignite.IgniteCheckedException:
>>> Requested DataRegion is not configured: Default-Mutable
>>>                 at
>>> org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.dataRegion(IgniteCacheDatabaseSharedManager.java:896)
>>>                 at
>>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.startCacheGroup(GridCacheProcessor.java:2463)
>>>                 at
>>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.getOrCreateCacheGroupContext(GridCacheProcessor.java:2181)
>>>                 at
>>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareCacheContext(GridCacheProcessor.java:1991)
>>>                 at
>>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareCacheStart(GridCacheProcessor.java:1926)
>>>                 at
>>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$prepareStartCaches$55a0e703$1(GridCacheProcessor.java:1801)
>>>                 at
>>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.lambda$prepareStartCachesIfPossible$16(GridCacheProcessor.java:1771)
>>>                 at
>>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareStartCaches(GridCacheProcessor.java:1798)
>>>                 at
>>> org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareStartCachesIfPossible(GridCacheProcessor.java:1769)
>>>                 at
>>> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.processCacheStartRequests(CacheAffinitySharedManager.java:1000)
>>>                 at
>>> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onCacheChangeRequest(CacheAffinitySharedManager.java:886)
>>>                 at
>>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onCacheChangeRequest(GridDhtPartitionsExchangeFuture.java:1472)
>>>                 ... 5 more
>>>
>>>    at Apache.Ignite.Core.Impl.Unmanaged.Jni.Env.ExceptionCheck()
>>>    at
>>> Apache.Ignite.Core.Impl.Unmanaged.Jni.Env.CallObjectMethod(GlobalRef obj,
>>> IntPtr methodId, Int64* argsPtr)
>>>    at
>>> Apache.Ignite.Core.Impl.Unmanaged.UnmanagedUtils.TargetInStreamOutObject(GlobalRef
>>> target, Int32 opType, Int64 inMemPtr)
>>>    at Apache.Ignite.Core.Impl.PlatformJniTarget.InStreamOutObject(Int32
>>> type, Action`1 writeAction)
>>>    --- End of inner exception stack trace ---
>>>    --- End of inner exception stack trace ---
>>>    at Apache.Ignite.Core.Impl.PlatformJniTarget.InStreamOutObject(Int32
>>> type, Action`1 writeAction)
>>>    at Apache.Ignite.Core.Impl.PlatformTargetAdapter.DoOutOpObject(Int32
>>> type, Action`1 action)
>>>    at
>>> Apache.Ignite.Core.Impl.Ignite.GetOrCreateCache[TK,TV](CacheConfiguration
>>> configuration, NearCacheConfiguration nearConfiguration,
>>> PlatformCacheConfiguration platformCacheConfiguration, Op op)
>>>    at
>>> Apache.Ignite.Core.Impl.Ignite.GetOrCreateCache[TK,TV](CacheConfiguration
>>> configuration, NearCacheConfiguration nearConfiguration,
>>> PlatformCacheConfiguration platformCacheConfiguration)
>>>    at
>>> Apache.Ignite.Core.Impl.Ignite.GetOrCreateCache[TK,TV](CacheConfiguration
>>> configuration, NearCacheConfiguration nearConfiguration)
>>>    at
>>> Apache.Ignite.Core.Impl.Ignite.GetOrCreateCache[TK,TV](CacheConfiguration
>>> configuration)
>>>
>>>
>>> This failure causes issues in the server nodes in the grid which now
>>> fail to restart with these errors such as the below (for the incorrectly
>>> create cache) but which are repeated for every defined cache in the grid:
>>>
>>> 2023-08-27 17:11:36,882 [42] INF [ImmutableCacheComputeServer]   Can not
>>> finish proxy initialization because proxy does not exist,
>>> cacheName=SiteModelMetadata,
>>> localNodeId=3d4a75e8-174d-4947-877e-e45784d8d08d
>>> 2
>>>
>>> At this point the grid is now unusable.
>>>
>>> To summarise: Attempted creation of a cache with an unknown
>>> DataRegionName causes immediate and unrecovered failure in the entire grid.
>>>
>>> Raymond.
>>>
>>>
>>> On Fri, Aug 25, 2023 at 7:47 PM Raymond Wilson <
>>> raymond_wil...@trimble.com> wrote:
>>>
>>>> We believe we had some code on a dev environment attempt to create a
>>>> cache that was intended for another Ignite.
>>>>
>>>> The creation of this cache would have failed (at least) because the
>>>> data region referenced in the cache configuration does not exist on that
>>>> environment.
>>>>
>>>> A subsequent restart of the environment some time later started failing
>>>> to initialise nodes on which the failed cache would have been stored had it
>>>> succeeded.
>>>>
>>>> The failing nodes report this in the log:
>>>>
>>>> 2023-08-25 04:20:24,540 [44] WRN [ImmutableCacheComputeServer]   Cache
>>>> can not be started : cache=SiteModelMetadata
>>>>
>>>> 2023-08-25 04:20:11,265 [1] WRN [ImmutableCacheComputeServer]   WAL
>>>> segment tail reached. [idx=414, isWorkDir=true,
>>>> serVer=org.apache.ignite.internal.processors.cache.persistence.wal.serializer.RecordV2Serializer@c3719e5,
>>>> actualFilePtr=WALPointer [idx=414, fileOff=452480679, len=0]]
>>>>
>>>> This error implies that (somehow) Ignite considers this to be a cache
>>>> existing in the grid and is attempting to set it up.
>>>>
>>>> Raymond.
>>>>
>>>>
>>>
>>> --
>>> <http://www.trimble.com/>
>>> Raymond Wilson
>>> Trimble Distinguished Engineer, Civil Construction Software (CCS)
>>> 11 Birmingham Drive | Christchurch, New Zealand
>>> raymond_wil...@trimble.com
>>>
>>>
>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>
>>
>> --
>> <http://www.trimble.com/>
>> Raymond Wilson
>> Trimble Distinguished Engineer, Civil Construction Software (CCS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> raymond_wil...@trimble.com
>>
>>
>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Trimble Distinguished Engineer, Civil Construction Software (CCS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wil...@trimble.com
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>


-- 
<http://www.trimble.com/>
Raymond Wilson
Trimble Distinguished Engineer, Civil Construction Software (CCS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wil...@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Reply via email to