[jira] [Commented] (IGNITE-11253) When a node that is not part of the baseline topology joins the cluster, it may lead to a node failure.
[ https://issues.apache.org/jira/browse/IGNITE-11253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763636#comment-16763636 ] Ignite TC Bot commented on IGNITE-11253: {panel:title=--> Run :: All: Possible Blockers|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1} {color:#d04437}Service Grid (legacy mode){color} [[tests 0 TIMEOUT , Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=3038050]] {color:#d04437}Cache (Restarts) 1{color} [[tests 4|https://ci.ignite.apache.org/viewLog.html?buildId=3038010]] * IgniteCacheRestartTestSuite: GridCacheReplicatedNodeRestartSelfTest.testRestartWithPutFourNodesNoBackups - 0,0% fails in last 384 master runs. {color:#d04437}ZooKeeper (Discovery) 1{color} [[tests 1|https://ci.ignite.apache.org/viewLog.html?buildId=3037987]] * ZookeeperDiscoverySpiTestSuite1: ZookeeperDiscoveryClientDisconnectTest.testReconnectServersRestart_3 - 3,8% fails in last 286 master runs. {panel} [TeamCity *--> Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=3038065&buildTypeId=IgniteTests24Java8_RunAll] > When a node that is not part of the baseline topology joins the cluster, it > may lead to a node failure. > --- > > Key: IGNITE-11253 > URL: https://issues.apache.org/jira/browse/IGNITE-11253 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Vyacheslav Koptilin >Assignee: Vyacheslav Koptilin >Priority: Major > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > * In case of eager TTL is configured, a starting node creates and starts > {{cleanupWorker}} (see {{GridCacheTtlManager.start0()}}) > * {{GridCacheSharedTtlCleanupManager.CleanupWorker}}, in its turn, has to > wait for {{discovery().localJoin()}} future that is completed by discovery > thread. > * On the other hand, the exchange thread stops cache contexts and, > therefore, it stops the {{cleanupWorker}} as well. > > {code:java} > org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager.stopCleanupWorker(GridCacheSharedTtlCleanupManager.java:109) > org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager.unregister(GridCacheSharedTtlCleanupManager.java:82) > org.apache.ignite.internal.processors.cache.GridCacheTtlManager.onKernalStop0(GridCacheTtlManager.java:110) > org.apache.ignite.internal.processors.cache.GridCacheManagerAdapter.onKernalStop(GridCacheManagerAdapter.java:111) > org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStop(GridCacheProcessor.java:1495) > org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStopCaches(GridCacheProcessor.java:1182) > org.apache.ignite.internal.processors.cache.GridCacheProcessor$CacheRecoveryLifecycle.onBaselineChange(GridCacheProcessor.java:5637) > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initCachesOnLocalJoin(GridDhtPartitionsExchangeFuture.java:910) > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:792) > {code} > So, exchange thread may try to stop the {{cleanupWorker}} before the > {{localJoin}} future is completed by discovery thread. Unfortunately, > `cleanupWorker` incorrectly handles this situation, and this fact can lead to > a node failure: > {code:java} > Critical system error detected. Will be handled accordingly to configured > handler [hnd=StopNodeFailureHandler [super=AbstractFailureHandler > [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, > SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext > [type=SYSTEM_WORKER_TERMINATION, err=class o.a.i.IgniteException: Got > interrupted while waiting for future to complete.]] > class org.apache.ignite.IgniteException: Got interrupted while waiting for > future to complete. > at > org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManager.java:2217) > at > org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:136) > at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) > at java.lang.Thread.run(Thread.java:748) > Caused by: class > org.apache.ignite.internal.IgniteInterruptedCheckedException: Got interrupted > while waiting for future to complete. > at > org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:186) > at > org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141) > at > org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManage
[jira] [Commented] (IGNITE-11253) When a node that is not part of the baseline topology joins the cluster, it may lead to a node failure.
[ https://issues.apache.org/jira/browse/IGNITE-11253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764906#comment-16764906 ] Andrey Gura commented on IGNITE-11253: -- [~slava.koptilin] Could you please add at least {{InterruptedException}} to {{X.hasCause}} method's parameters. Usually any worker stops via {{cancel}} call that interrupt worker's runner. > When a node that is not part of the baseline topology joins the cluster, it > may lead to a node failure. > --- > > Key: IGNITE-11253 > URL: https://issues.apache.org/jira/browse/IGNITE-11253 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Vyacheslav Koptilin >Assignee: Vyacheslav Koptilin >Priority: Major > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > * In case of eager TTL is configured, a starting node creates and starts > {{cleanupWorker}} (see {{GridCacheTtlManager.start0()}}) > * {{GridCacheSharedTtlCleanupManager.CleanupWorker}}, in its turn, has to > wait for {{discovery().localJoin()}} future that is completed by discovery > thread. > * On the other hand, the exchange thread stops cache contexts and, > therefore, it stops the {{cleanupWorker}} as well. > > {code:java} > org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager.stopCleanupWorker(GridCacheSharedTtlCleanupManager.java:109) > org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager.unregister(GridCacheSharedTtlCleanupManager.java:82) > org.apache.ignite.internal.processors.cache.GridCacheTtlManager.onKernalStop0(GridCacheTtlManager.java:110) > org.apache.ignite.internal.processors.cache.GridCacheManagerAdapter.onKernalStop(GridCacheManagerAdapter.java:111) > org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStop(GridCacheProcessor.java:1495) > org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStopCaches(GridCacheProcessor.java:1182) > org.apache.ignite.internal.processors.cache.GridCacheProcessor$CacheRecoveryLifecycle.onBaselineChange(GridCacheProcessor.java:5637) > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initCachesOnLocalJoin(GridDhtPartitionsExchangeFuture.java:910) > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:792) > {code} > So, exchange thread may try to stop the {{cleanupWorker}} before the > {{localJoin}} future is completed by discovery thread. Unfortunately, > `cleanupWorker` incorrectly handles this situation, and this fact can lead to > a node failure: > {code:java} > Critical system error detected. Will be handled accordingly to configured > handler [hnd=StopNodeFailureHandler [super=AbstractFailureHandler > [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, > SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext > [type=SYSTEM_WORKER_TERMINATION, err=class o.a.i.IgniteException: Got > interrupted while waiting for future to complete.]] > class org.apache.ignite.IgniteException: Got interrupted while waiting for > future to complete. > at > org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManager.java:2217) > at > org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:136) > at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) > at java.lang.Thread.run(Thread.java:748) > Caused by: class > org.apache.ignite.internal.IgniteInterruptedCheckedException: Got interrupted > while waiting for future to complete. > at > org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:186) > at > org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141) > at > org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManager.java:2214) > ... 3 more > {code} > The obvious fix is changing the catch block > {code:java} > catch (Throwable t) { > if (!(t instanceof IgniteInterruptedCheckedException)) > err = t; > throw t; > } > {code} > to the following: > {code:java} > catch (Throwable t) { > if (!(X.hasCause(t, IgniteInterruptedCheckedException.class))) > err = t; > throw t; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-11253) When a node that is not part of the baseline topology joins the cluster, it may lead to a node failure.
[ https://issues.apache.org/jira/browse/IGNITE-11253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765746#comment-16765746 ] Ignite TC Bot commented on IGNITE-11253: {panel:title=--> Run :: All: Possible Blockers|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1} {color:#d04437}Platform .NET (NuGet)*{color} [[tests 0 Exit Code , Compilation Error |https://ci.ignite.apache.org/viewLog.html?buildId=3066504]] {panel} [TeamCity *--> Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=3064210&buildTypeId=IgniteTests24Java8_RunAll] > When a node that is not part of the baseline topology joins the cluster, it > may lead to a node failure. > --- > > Key: IGNITE-11253 > URL: https://issues.apache.org/jira/browse/IGNITE-11253 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Vyacheslav Koptilin >Assignee: Vyacheslav Koptilin >Priority: Major > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > * In case of eager TTL is configured, a starting node creates and starts > {{cleanupWorker}} (see {{GridCacheTtlManager.start0()}}) > * {{GridCacheSharedTtlCleanupManager.CleanupWorker}}, in its turn, has to > wait for {{discovery().localJoin()}} future that is completed by discovery > thread. > * On the other hand, the exchange thread stops cache contexts and, > therefore, it stops the {{cleanupWorker}} as well. > > {code:java} > org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager.stopCleanupWorker(GridCacheSharedTtlCleanupManager.java:109) > org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager.unregister(GridCacheSharedTtlCleanupManager.java:82) > org.apache.ignite.internal.processors.cache.GridCacheTtlManager.onKernalStop0(GridCacheTtlManager.java:110) > org.apache.ignite.internal.processors.cache.GridCacheManagerAdapter.onKernalStop(GridCacheManagerAdapter.java:111) > org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStop(GridCacheProcessor.java:1495) > org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStopCaches(GridCacheProcessor.java:1182) > org.apache.ignite.internal.processors.cache.GridCacheProcessor$CacheRecoveryLifecycle.onBaselineChange(GridCacheProcessor.java:5637) > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initCachesOnLocalJoin(GridDhtPartitionsExchangeFuture.java:910) > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:792) > {code} > So, exchange thread may try to stop the {{cleanupWorker}} before the > {{localJoin}} future is completed by discovery thread. Unfortunately, > `cleanupWorker` incorrectly handles this situation, and this fact can lead to > a node failure: > {code:java} > Critical system error detected. Will be handled accordingly to configured > handler [hnd=StopNodeFailureHandler [super=AbstractFailureHandler > [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, > SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext > [type=SYSTEM_WORKER_TERMINATION, err=class o.a.i.IgniteException: Got > interrupted while waiting for future to complete.]] > class org.apache.ignite.IgniteException: Got interrupted while waiting for > future to complete. > at > org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManager.java:2217) > at > org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:136) > at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) > at java.lang.Thread.run(Thread.java:748) > Caused by: class > org.apache.ignite.internal.IgniteInterruptedCheckedException: Got interrupted > while waiting for future to complete. > at > org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:186) > at > org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141) > at > org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManager.java:2214) > ... 3 more > {code} > The obvious fix is changing the catch block > {code:java} > catch (Throwable t) { > if (!(t instanceof IgniteInterruptedCheckedException)) > err = t; > throw t; > } > {code} > to the following: > {code:java} > catch (Throwable t) { > if (!(X.hasCause(t, IgniteInterruptedCheckedException.class))) > err = t; > throw t; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-11253) When a node that is not part of the baseline topology joins the cluster, it may lead to a node failure.
[ https://issues.apache.org/jira/browse/IGNITE-11253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765807#comment-16765807 ] Vyacheslav Koptilin commented on IGNITE-11253: -- Hi [~agura] , > Could you please add at least {{InterruptedException}} to {{X.hasCause}} >method's parameters. Done! > When a node that is not part of the baseline topology joins the cluster, it > may lead to a node failure. > --- > > Key: IGNITE-11253 > URL: https://issues.apache.org/jira/browse/IGNITE-11253 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7 >Reporter: Vyacheslav Koptilin >Assignee: Vyacheslav Koptilin >Priority: Major > Fix For: 2.8 > > Time Spent: 10m > Remaining Estimate: 0h > > * In case of eager TTL is configured, a starting node creates and starts > {{cleanupWorker}} (see {{GridCacheTtlManager.start0()}}) > * {{GridCacheSharedTtlCleanupManager.CleanupWorker}}, in its turn, has to > wait for {{discovery().localJoin()}} future that is completed by discovery > thread. > * On the other hand, the exchange thread stops cache contexts and, > therefore, it stops the {{cleanupWorker}} as well. > > {code:java} > org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager.stopCleanupWorker(GridCacheSharedTtlCleanupManager.java:109) > org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager.unregister(GridCacheSharedTtlCleanupManager.java:82) > org.apache.ignite.internal.processors.cache.GridCacheTtlManager.onKernalStop0(GridCacheTtlManager.java:110) > org.apache.ignite.internal.processors.cache.GridCacheManagerAdapter.onKernalStop(GridCacheManagerAdapter.java:111) > org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStop(GridCacheProcessor.java:1495) > org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStopCaches(GridCacheProcessor.java:1182) > org.apache.ignite.internal.processors.cache.GridCacheProcessor$CacheRecoveryLifecycle.onBaselineChange(GridCacheProcessor.java:5637) > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initCachesOnLocalJoin(GridDhtPartitionsExchangeFuture.java:910) > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:792) > {code} > So, exchange thread may try to stop the {{cleanupWorker}} before the > {{localJoin}} future is completed by discovery thread. Unfortunately, > `cleanupWorker` incorrectly handles this situation, and this fact can lead to > a node failure: > {code:java} > Critical system error detected. Will be handled accordingly to configured > handler [hnd=StopNodeFailureHandler [super=AbstractFailureHandler > [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, > SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext > [type=SYSTEM_WORKER_TERMINATION, err=class o.a.i.IgniteException: Got > interrupted while waiting for future to complete.]] > class org.apache.ignite.IgniteException: Got interrupted while waiting for > future to complete. > at > org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManager.java:2217) > at > org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:136) > at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) > at java.lang.Thread.run(Thread.java:748) > Caused by: class > org.apache.ignite.internal.IgniteInterruptedCheckedException: Got interrupted > while waiting for future to complete. > at > org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:186) > at > org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141) > at > org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManager.java:2214) > ... 3 more > {code} > The obvious fix is changing the catch block > {code:java} > catch (Throwable t) { > if (!(t instanceof IgniteInterruptedCheckedException)) > err = t; > throw t; > } > {code} > to the following: > {code:java} > catch (Throwable t) { > if (!(X.hasCause(t, IgniteInterruptedCheckedException.class))) > err = t; > throw t; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)