[ https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16799408#comment-16799408 ]
Ignite TC Bot commented on IGNITE-6587: --------------------------------------- {panel:title=--> Run :: All: Possible Blockers|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1} {color:#d04437}Platform .NET (Core Linux){color} [[tests 0 Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=3386454]] {color:#d04437}ZooKeeper (Discovery) 1{color} [[tests 0 TIMEOUT , Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=3386456]] {color:#d04437}Client Nodes{color} [[tests 0 TIMEOUT , Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=3386458]] {color:#d04437}Platform C++ (Linux Clang){color} [[tests 0 Exit Code , Failure on metric |https://ci.ignite.apache.org/viewLog.html?buildId=3386476]] {color:#d04437}Thin client: PHP{color} [[tests 0 Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=3386482]] {color:#d04437}Hibernate 5.3{color} [[tests 0 Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=3386484]] {color:#d04437}Thin client: Node.js{color} [[tests 0 Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=3386486]] {color:#d04437}Thin client: Python{color} [[tests 0 Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=3386492]] {color:#d04437}Spring (Data){color} [[tests 0 Exit Code |https://ci.ignite.apache.org/viewLog.html?buildId=3386496]] {color:#d04437}Queries 1{color} [[tests 6|https://ci.ignite.apache.org/viewLog.html?buildId=3386460]] * IgniteBinaryCacheQueryTestSuite: SchemaExchangeSelfTest.testServerRestartWithNewTypes - 0,0% fails in last 422 master runs. {color:#d04437}Cache 1{color} [[tests 10|https://ci.ignite.apache.org/viewLog.html?buildId=3386462]] * IgniteBinaryCacheTestSuite: DataStreamerClientReconnectAfterClusterRestartTest.testTwoClientsAllowOverwrite - 0,0% fails in last 419 master runs. * IgniteBinaryCacheTestSuite: DataStreamerClientReconnectAfterClusterRestartTest.testOneClientAllowOverwrite - 0,0% fails in last 419 master runs. * IgniteBinaryCacheTestSuite: DataStreamerClientReconnectAfterClusterRestartTest.testTwoClients - 0,0% fails in last 419 master runs. * IgniteBinaryCacheTestSuite: DataStreamerClientReconnectAfterClusterRestartTest.testOneClient - 0,0% fails in last 419 master runs. {color:#d04437}PDS (Indexing){color} [[tests 3 Out Of Memory Error |https://ci.ignite.apache.org/viewLog.html?buildId=3386464]] * IgnitePdsWithIndexingCoreTestSuite: IgniteLogicalRecoveryTest.testRecoveryOnDynamicallyStartedCaches - 0,0% fails in last 414 master runs. * IgnitePdsWithIndexingCoreTestSuite: IgnitePdsThreadInterruptionTest.testInterruptsOnWALWrite - 0,0% fails in last 414 master runs. {color:#d04437}Cache 3{color} [[tests 3|https://ci.ignite.apache.org/viewLog.html?buildId=3386466]] * IgniteBinaryObjectsCacheTestSuite3: CacheMetricsManageTest.testJmxPdsStatisticsEnable * IgniteBinaryObjectsCacheTestSuite3: CacheGroupsMetricsRebalanceTest.testRebalanceEstimateFinishTime {color:#d04437}Queries 2{color} [[tests 13|https://ci.ignite.apache.org/viewLog.html?buildId=3386468]] * IgniteBinaryCacheQueryTestSuite2: DynamicColumnsConcurrentTransactionalReplicatedSelfTest.testClientReconnectWithCacheRestart - 0,0% fails in last 426 master runs. * IgniteBinaryCacheQueryTestSuite2: IgniteCacheQueryNodeRestartSelfTest2.testRestarts - 0,0% fails in last 0 master runs. * IgniteBinaryCacheQueryTestSuite2: DynamicColumnsConcurrentAtomicReplicatedSelfTest.testClientReconnectWithNonDynamicCacheRestart - 0,0% fails in last 426 master runs. * IgniteBinaryCacheQueryTestSuite2: DynamicIndexReplicatedAtomicConcurrentSelfTest.testClientReconnectWithCacheRestart - 0,0% fails in last 426 master runs. * IgniteBinaryCacheQueryTestSuite2: DynamicColumnsConcurrentAtomicPartitionedSelfTest.testClientReconnectWithCacheRestart - 0,0% fails in last 426 master runs. * IgniteBinaryCacheQueryTestSuite2: DynamicColumnsConcurrentAtomicReplicatedSelfTest.testClientReconnectWithCacheRestart - 0,0% fails in last 426 master runs. * IgniteBinaryCacheQueryTestSuite2: DynamicIndexPartitionedTransactionalConcurrentSelfTest.testClientReconnectWithCacheRestart - 0,0% fails in last 426 master runs. * IgniteBinaryCacheQueryTestSuite2: DynamicColumnsConcurrentTransactionalPartitionedSelfTest.testClientReconnectWithNonDynamicCacheRestart - 0,0% fails in last 426 master runs. * IgniteBinaryCacheQueryTestSuite2: DynamicIndexPartitionedAtomicConcurrentSelfTest.testClientReconnectWithCacheRestart - 0,0% fails in last 426 master runs. * IgniteBinaryCacheQueryTestSuite2: DynamicColumnsConcurrentTransactionalReplicatedSelfTest.testClientReconnectWithNonDynamicCacheRestart - 0,0% fails in last 426 master runs. * IgniteBinaryCacheQueryTestSuite2: DynamicIndexReplicatedTransactionalConcurrentSelfTest.testClientReconnectWithCacheRestart - 0,0% fails in last 426 master runs. * IgniteBinaryCacheQueryTestSuite2: DynamicColumnsConcurrentAtomicPartitionedSelfTest.testClientReconnectWithNonDynamicCacheRestart - 0,0% fails in last 426 master runs. * IgniteBinaryCacheQueryTestSuite2: DynamicColumnsConcurrentTransactionalPartitionedSelfTest.testClientReconnectWithCacheRestart - 0,0% fails in last 426 master runs. {color:#d04437}ZooKeeper (Discovery) 2{color} [[tests 4|https://ci.ignite.apache.org/viewLog.html?buildId=3386470]] * ZookeeperDiscoverySpiTestSuite2: IgniteClientReconnectCacheTest.testReconnect - 0,0% fails in last 416 master runs. * ZookeeperDiscoverySpiTestSuite2: IgniteClientReconnectCacheTest.testReconnectClusterRestart - 0,0% fails in last 416 master runs. * ZookeeperDiscoverySpiTestSuite2: IgniteClientReconnectCacheTest.testReconnectCacheDestroyedAndCreated - 0,0% fails in last 416 master runs. {color:#d04437}Cache 2{color} [[tests 2|https://ci.ignite.apache.org/viewLog.html?buildId=3386472]] * IgniteCacheTestSuite2: IgniteCacheClientNodeChangingTopologyTest.testPessimisticTx2 - 0,0% fails in last 418 master runs. * IgniteCacheTestSuite2: IgniteClientCacheStartFailoverTest.testClientStartLastServerFailsTx - 0,0% fails in last 418 master runs. {color:#d04437}Continuous Query 1{color} [[tests 1|https://ci.ignite.apache.org/viewLog.html?buildId=3386480]] * IgniteCacheQuerySelfTestSuite3: CacheContinuousQueryConcurrentPartitionUpdateTest.testConcurrentUpdatesAndQueryStartTx - 0,0% fails in last 422 master runs. {color:#d04437}Web Sessions{color} [[tests 4|https://ci.ignite.apache.org/viewLog.html?buildId=3386490]] * IgniteWebSessionSelfTestSuite: WebSessionSelfTest.testClientReconnectRequest - 0,0% fails in last 426 master runs. {color:#d04437}Basic 3{color} [[tests 1|https://ci.ignite.apache.org/viewLog.html?buildId=3386494]] * IgniteBasicWithPersistenceTestSuite: PluginNodeValidationTest.testValidationException {color:#d04437}Platform C++ (Win x64 | Release){color} [[tests 9 Failure on metric , BuildFailureOnMessage |https://ci.ignite.apache.org/viewLog.html?buildId=3386478]] * IgniteOdbcTest: QueriesTestSuite: TestNotFullInsertBatchSelect4096 - 0,8% fails in last 790 master runs. * IgniteOdbcTest: QueriesTestSuite: TestManyCursorsSelectMerge2 - 0,8% fails in last 790 master runs. * IgniteOdbcTest: QueriesTestSuite: TestManyCursorsTwoSelects2 - 0,8% fails in last 790 master runs. * IgniteOdbcTest: QueriesTestSuite: TestInsertBatchSelect1025 - 0,8% fails in last 790 master runs. * IgniteOdbcTest: QueriesTestSuite: TestInsertBatchSelect100 - 0,8% fails in last 790 master runs. * IgniteOdbcTest: QueriesTestSuite: TestInsertBatchSelect2000 - 0,8% fails in last 790 master runs. * IgniteOdbcTest: QueriesTestSuite: TestInsertBatchSelect1000 - 0,8% fails in last 790 master runs. * IgniteOdbcTest: QueriesTestSuite: TestNotFullInsertBatchSelect1500 - 0,8% fails in last 790 master runs. * IgniteOdbcTest: QueriesTestSuite: TestInsertBatchSelect1024 - 0,6% fails in last 790 master runs. {panel} [TeamCity *--> Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=3372451&buildTypeId=IgniteTests24Java8_RunAll] > Ignite watchdog service > ----------------------- > > Key: IGNITE-6587 > URL: https://issues.apache.org/jira/browse/IGNITE-6587 > Project: Ignite > Issue Type: Improvement > Components: general > Affects Versions: 2.2 > Reporter: Alexey Goncharuk > Assignee: Andrey Kuznetsov > Priority: Major > Labels: IEP-5 > Fix For: 2.7 > > Attachments: watchdog.sh > > > As described in [1], each Ignite node has a number of system-critical > threads. We should implement a periodic check that calls failure handler when > one of the following conditions has been detected: > * Critical thread is not alive anymore. > * Critical thread 'hangs' for a long time, e.g. while executing a task > extracted from task queue. > In case of failure condition, call stacks of all threads should be logged > before invoking failure handler. > Actual list of system-critical threads can be found at [1]. > Implementations based on separate diagnostic thread seem fragile, cause this > thread become a vulnerable point with respect to thread termination and CPU > resource starvation. So we are to use self-monitoring approach: critical > threads themselves should monitor each other. > Currently we have {{o.a.i.internal.worker.WorkersRegistry}} facility that > fits best to store and track system critical threads. All of them should be > refactored to be {{GridWorker's}} and added to {{WorkersRegistry}}. Each > worker should periodically choose some subset of peer workers and check > whether > * All of them are alive. > * All of them are actively running. > It's required to add a 'heartbeat' timestamp to worker in order to implement > latter check. Additionally, infinite queue polls, waits on monitors or thread > parks should be refactored to their timed equivalents in system critical > threads. > Monitoring parameters (enable/disable, check interval, thread 'hang' > threshold, etc.) are to be set via system properties. > [1] > https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling -- This message was sent by Atlassian JIRA (v7.6.3#76005)