Vyacheslav Koptilin created IGNITE-11253: --------------------------------------------
Summary: When a node that is not part of the base topology joins the cluster, it may lead to a node failure. Key: IGNITE-11253 URL: https://issues.apache.org/jira/browse/IGNITE-11253 Project: Ignite Issue Type: Bug Affects Versions: 2.7 Reporter: Vyacheslav Koptilin Assignee: Vyacheslav Koptilin Fix For: 2.8 * In case of eager TTL is configured, a starting node creates and starts {{cleanupWorker}} (see {{GridCacheTtlManager.start0()}}) * {{GridCacheSharedTtlCleanupManager.CleanupWorker}}, in its turn, has to wait for {{discovery().localJoin()}} future that is completed by discovery thread. * On the other hand, the exchange thread stops cache contexts and, therefore, it stops the \{{cleanupWorker}} as well. {code:java} org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager.stopCleanupWorker(GridCacheSharedTtlCleanupManager.java:109) org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager.unregister(GridCacheSharedTtlCleanupManager.java:82) org.apache.ignite.internal.processors.cache.GridCacheTtlManager.onKernalStop0(GridCacheTtlManager.java:110) org.apache.ignite.internal.processors.cache.GridCacheManagerAdapter.onKernalStop(GridCacheManagerAdapter.java:111) org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStop(GridCacheProcessor.java:1495) org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStopCaches(GridCacheProcessor.java:1182) org.apache.ignite.internal.processors.cache.GridCacheProcessor$CacheRecoveryLifecycle.onBaselineChange(GridCacheProcessor.java:5637) org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initCachesOnLocalJoin(GridDhtPartitionsExchangeFuture.java:910) org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:792) {code} So, exchange thread may try to stop the {{cleanupWorker}} before the {{localJoin}} future is completed by discovery thread. Unfortunately, `cleanupWorker` incorrectly handles this situation, and this fact can lead to a node failure: {code:java} Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=class o.a.i.IgniteException: Got interrupted while waiting for future to complete.]] class org.apache.ignite.IgniteException: Got interrupted while waiting for future to complete. at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManager.java:2217) at org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:136) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at java.lang.Thread.run(Thread.java:748) Caused by: class org.apache.ignite.internal.IgniteInterruptedCheckedException: Got interrupted while waiting for future to complete. at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:186) at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141) at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManager.java:2214) ... 3 more {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)