[jira] [Commented] (IGNITE-6587) Ignite watchdog service

2019-03-27 Thread Ignite TC Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16802823#comment-16802823
 ] 

Ignite TC Bot commented on IGNITE-6587:
---

{panel:title=-- Run :: All: Possible 
Blockers|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1}
{color:#d04437}Platform .NET (Core Linux){color} [[tests 0 Exit Code 
|https://ci.ignite.apache.org/viewLog.html?buildId=3395254]]

{color:#d04437}ZooKeeper (Discovery) 1{color} [[tests 0 TIMEOUT , Exit Code 
|https://ci.ignite.apache.org/viewLog.html?buildId=3395256]]
* ZookeeperDiscoverySpiTest.testDisconnectOnServersLeft_3 (last started)

{color:#d04437}Client Nodes{color} [[tests 0 TIMEOUT , Exit Code 
|https://ci.ignite.apache.org/viewLog.html?buildId=3395258]]
* IgniteClientRejoinTest.testClientsReconnect (last started)

{color:#d04437}Cache 3{color} [[tests 0 TIMEOUT , Exit Code 
|https://ci.ignite.apache.org/viewLog.html?buildId=3395266]]
* IgniteCacheGroupsTest.testRestartsAndCacheCreateDestroy (last started)

{color:#d04437}Platform C++ (Linux Clang){color} [[tests 0 Exit Code , Failure 
on metric |https://ci.ignite.apache.org/viewLog.html?buildId=3395274]]

{color:#d04437}Hibernate 5.3{color} [[tests 0 Exit Code 
|https://ci.ignite.apache.org/viewLog.html?buildId=3395282]]

{color:#d04437}Thin client: PHP{color} [[tests 0 Exit Code 
|https://ci.ignite.apache.org/viewLog.html?buildId=3395280]]

{color:#d04437}Thin client: Node.js{color} [[tests 0 Exit Code 
|https://ci.ignite.apache.org/viewLog.html?buildId=3395286]]

{color:#d04437}Thin client: Python{color} [[tests 0 Exit Code 
|https://ci.ignite.apache.org/viewLog.html?buildId=3395290]]

{color:#d04437}Spring (Data){color} [[tests 0 Exit Code 
|https://ci.ignite.apache.org/viewLog.html?buildId=3395294]]

{color:#d04437}Cache 1{color} [[tests 
11|https://ci.ignite.apache.org/viewLog.html?buildId=3395262]]
* IgniteBinaryCacheTestSuite: 
DataStreamerClientReconnectAfterClusterRestartTest.testTwoClientsAllowOverwrite 
- 0,0% fails in last 405 master runs.
* IgniteBinaryCacheTestSuite: 
DataStreamerClientReconnectAfterClusterRestartTest.testOneClientAllowOverwrite 
- 0,0% fails in last 405 master runs.
* IgniteBinaryCacheTestSuite: 
DataStreamerClientReconnectAfterClusterRestartTest.testTwoClients - 0,0% fails 
in last 405 master runs.
* IgniteBinaryCacheTestSuite: 
DataStreamerClientReconnectAfterClusterRestartTest.testOneClient - 0,0% fails 
in last 405 master runs.

{color:#d04437}Queries 1{color} [[tests 
6|https://ci.ignite.apache.org/viewLog.html?buildId=3395260]]
* IgniteBinaryCacheQueryTestSuite: 
SchemaExchangeSelfTest.testServerRestartWithNewTypes - 0,0% fails in last 409 
master runs.

{color:#d04437}PDS (Indexing){color} [[tests 4 Out Of Memory Error 
|https://ci.ignite.apache.org/viewLog.html?buildId=3395264]]
* IgnitePdsWithIndexingCoreTestSuite: 
IgniteLogicalRecoveryTest.testRecoveryOnJoinToDifferentBlt - 0,0% fails in last 
398 master runs.
* IgnitePdsWithIndexingCoreTestSuite: 
IgniteLogicalRecoveryTest.testRecoveryOnDynamicallyStartedCaches - 0,0% fails 
in last 398 master runs.
* IgnitePdsWithIndexingCoreTestSuite: 
IgnitePdsThreadInterruptionTest.testInterruptsOnWALWrite - 0,0% fails in last 
398 master runs.
* IgniteLogicalRecoveryTest.testRecoveryOnDynamicallyStartedCaches (last 
started)

{color:#d04437}Queries 2{color} [[tests 
14|https://ci.ignite.apache.org/viewLog.html?buildId=3395268]]
* IgniteBinaryCacheQueryTestSuite2: 
DynamicColumnsConcurrentTransactionalReplicatedSelfTest.testClientReconnectWithCacheRestart
 - 0,0% fails in last 414 master runs.
* IgniteBinaryCacheQueryTestSuite2: 
DynamicColumnsConcurrentAtomicPartitionedSelfTest.testClientReconnectWithCacheRestart
 - 0,0% fails in last 414 master runs.
* IgniteBinaryCacheQueryTestSuite2: 
DynamicIndexPartitionedTransactionalConcurrentSelfTest.testClientReconnectWithCacheRestart
 - 0,0% fails in last 414 master runs.
* IgniteBinaryCacheQueryTestSuite2: 
DynamicColumnsConcurrentTransactionalPartitionedSelfTest.testClientReconnectWithNonDynamicCacheRestart
 - 0,0% fails in last 414 master runs.
* IgniteBinaryCacheQueryTestSuite2: 
DynamicColumnsConcurrentTransactionalPartitionedSelfTest.testClientReconnectWithCacheRestart
 - 0,0% fails in last 414 master runs.
* IgniteBinaryCacheQueryTestSuite2: 
IgniteCacheQueryNodeRestartSelfTest2.testRestarts - 0,0% fails in last 0 master 
runs.
* IgniteBinaryCacheQueryTestSuite2: 
DynamicColumnsConcurrentAtomicReplicatedSelfTest.testClientReconnectWithNonDynamicCacheRestart
 - 0,0% fails in last 414 master runs.
* IgniteBinaryCacheQueryTestSuite2: 
DynamicIndexReplicatedAtomicConcurrentSelfTest.testClientReconnectWithCacheRestart
 - 0,0% fails in last 414 master runs.
* IgniteBinaryCacheQueryTestSuite2: 
DynamicColumnsConcurrentAtomicReplicatedSelfTest.testClientReconnectWithCacheRestart
 - 0,0% fails in last 414 master runs.
* IgniteBinaryCacheQueryTestSuite2: 

[jira] [Commented] (IGNITE-6587) Ignite watchdog service

2019-03-22 Thread Ignite TC Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799408#comment-16799408
 ] 

Ignite TC Bot commented on IGNITE-6587:
---

{panel:title=-- Run :: All: Possible 
Blockers|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1}
{color:#d04437}Platform .NET (Core Linux){color} [[tests 0 Exit Code 
|https://ci.ignite.apache.org/viewLog.html?buildId=3386454]]

{color:#d04437}ZooKeeper (Discovery) 1{color} [[tests 0 TIMEOUT , Exit Code 
|https://ci.ignite.apache.org/viewLog.html?buildId=3386456]]

{color:#d04437}Client Nodes{color} [[tests 0 TIMEOUT , Exit Code 
|https://ci.ignite.apache.org/viewLog.html?buildId=3386458]]

{color:#d04437}Platform C++ (Linux Clang){color} [[tests 0 Exit Code , Failure 
on metric |https://ci.ignite.apache.org/viewLog.html?buildId=3386476]]

{color:#d04437}Thin client: PHP{color} [[tests 0 Exit Code 
|https://ci.ignite.apache.org/viewLog.html?buildId=3386482]]

{color:#d04437}Hibernate 5.3{color} [[tests 0 Exit Code 
|https://ci.ignite.apache.org/viewLog.html?buildId=3386484]]

{color:#d04437}Thin client: Node.js{color} [[tests 0 Exit Code 
|https://ci.ignite.apache.org/viewLog.html?buildId=3386486]]

{color:#d04437}Thin client: Python{color} [[tests 0 Exit Code 
|https://ci.ignite.apache.org/viewLog.html?buildId=3386492]]

{color:#d04437}Spring (Data){color} [[tests 0 Exit Code 
|https://ci.ignite.apache.org/viewLog.html?buildId=3386496]]

{color:#d04437}Queries 1{color} [[tests 
6|https://ci.ignite.apache.org/viewLog.html?buildId=3386460]]
* IgniteBinaryCacheQueryTestSuite: 
SchemaExchangeSelfTest.testServerRestartWithNewTypes - 0,0% fails in last 422 
master runs.

{color:#d04437}Cache 1{color} [[tests 
10|https://ci.ignite.apache.org/viewLog.html?buildId=3386462]]
* IgniteBinaryCacheTestSuite: 
DataStreamerClientReconnectAfterClusterRestartTest.testTwoClientsAllowOverwrite 
- 0,0% fails in last 419 master runs.
* IgniteBinaryCacheTestSuite: 
DataStreamerClientReconnectAfterClusterRestartTest.testOneClientAllowOverwrite 
- 0,0% fails in last 419 master runs.
* IgniteBinaryCacheTestSuite: 
DataStreamerClientReconnectAfterClusterRestartTest.testTwoClients - 0,0% fails 
in last 419 master runs.
* IgniteBinaryCacheTestSuite: 
DataStreamerClientReconnectAfterClusterRestartTest.testOneClient - 0,0% fails 
in last 419 master runs.

{color:#d04437}PDS (Indexing){color} [[tests 3 Out Of Memory Error 
|https://ci.ignite.apache.org/viewLog.html?buildId=3386464]]
* IgnitePdsWithIndexingCoreTestSuite: 
IgniteLogicalRecoveryTest.testRecoveryOnDynamicallyStartedCaches - 0,0% fails 
in last 414 master runs.
* IgnitePdsWithIndexingCoreTestSuite: 
IgnitePdsThreadInterruptionTest.testInterruptsOnWALWrite - 0,0% fails in last 
414 master runs.

{color:#d04437}Cache 3{color} [[tests 
3|https://ci.ignite.apache.org/viewLog.html?buildId=3386466]]
* IgniteBinaryObjectsCacheTestSuite3: 
CacheMetricsManageTest.testJmxPdsStatisticsEnable
* IgniteBinaryObjectsCacheTestSuite3: 
CacheGroupsMetricsRebalanceTest.testRebalanceEstimateFinishTime

{color:#d04437}Queries 2{color} [[tests 
13|https://ci.ignite.apache.org/viewLog.html?buildId=3386468]]
* IgniteBinaryCacheQueryTestSuite2: 
DynamicColumnsConcurrentTransactionalReplicatedSelfTest.testClientReconnectWithCacheRestart
 - 0,0% fails in last 426 master runs.
* IgniteBinaryCacheQueryTestSuite2: 
IgniteCacheQueryNodeRestartSelfTest2.testRestarts - 0,0% fails in last 0 master 
runs.
* IgniteBinaryCacheQueryTestSuite2: 
DynamicColumnsConcurrentAtomicReplicatedSelfTest.testClientReconnectWithNonDynamicCacheRestart
 - 0,0% fails in last 426 master runs.
* IgniteBinaryCacheQueryTestSuite2: 
DynamicIndexReplicatedAtomicConcurrentSelfTest.testClientReconnectWithCacheRestart
 - 0,0% fails in last 426 master runs.
* IgniteBinaryCacheQueryTestSuite2: 
DynamicColumnsConcurrentAtomicPartitionedSelfTest.testClientReconnectWithCacheRestart
 - 0,0% fails in last 426 master runs.
* IgniteBinaryCacheQueryTestSuite2: 
DynamicColumnsConcurrentAtomicReplicatedSelfTest.testClientReconnectWithCacheRestart
 - 0,0% fails in last 426 master runs.
* IgniteBinaryCacheQueryTestSuite2: 
DynamicIndexPartitionedTransactionalConcurrentSelfTest.testClientReconnectWithCacheRestart
 - 0,0% fails in last 426 master runs.
* IgniteBinaryCacheQueryTestSuite2: 
DynamicColumnsConcurrentTransactionalPartitionedSelfTest.testClientReconnectWithNonDynamicCacheRestart
 - 0,0% fails in last 426 master runs.
* IgniteBinaryCacheQueryTestSuite2: 
DynamicIndexPartitionedAtomicConcurrentSelfTest.testClientReconnectWithCacheRestart
 - 0,0% fails in last 426 master runs.
* IgniteBinaryCacheQueryTestSuite2: 
DynamicColumnsConcurrentTransactionalReplicatedSelfTest.testClientReconnectWithNonDynamicCacheRestart
 - 0,0% fails in last 426 master runs.
* IgniteBinaryCacheQueryTestSuite2: 
DynamicIndexReplicatedTransactionalConcurrentSelfTest.testClientReconnectWithCacheRestart
 - 0,0% fails in last 

[jira] [Commented] (IGNITE-6587) Ignite watchdog service

2018-09-24 Thread Andrey Gura (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625766#comment-16625766
 ] 

Andrey Gura commented on IGNITE-6587:
-

[~andrey-kuznetsov] And finally merged to master branch! Thanks for 
contribution!

> Ignite watchdog service
> ---
>
> Key: IGNITE-6587
> URL: https://issues.apache.org/jira/browse/IGNITE-6587
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 2.2
>Reporter: Alexey Goncharuk
>Assignee: Andrey Kuznetsov
>Priority: Major
>  Labels: IEP-5
> Fix For: 2.7
>
> Attachments: watchdog.sh
>
>
> As described in [1], each Ignite node has a number of system-critical 
> threads. We should implement a periodic check that calls failure handler when 
> one of the following conditions has been detected:
> * Critical thread is not alive anymore.
> * Critical thread 'hangs' for a long time, e.g. while executing a task 
> extracted from task queue.
> In case of failure condition, call stacks of all threads should be logged 
> before invoking failure handler.
> Actual list of system-critical threads can be found at [1].
> Implementations based on separate diagnostic thread seem fragile, cause this 
> thread become a vulnerable point with respect to thread termination and CPU 
> resource starvation. So we are to use self-monitoring approach: critical 
> threads themselves should monitor each other.
> Currently we have {{o.a.i.internal.worker.WorkersRegistry}} facility that 
> fits best to store and track system critical threads. All of them should be 
> refactored to be {{GridWorker's}} and added to {{WorkersRegistry}}. Each 
> worker should periodically choose some subset of peer workers and check 
> whether
> * All of them are alive.
> * All of them are actively running.
> It's required to add a 'heartbeat' timestamp to worker in order to implement 
> latter check. Additionally, infinite queue polls, waits on monitors or thread 
> parks should be refactored to their timed equivalents in system critical 
> threads.
> Monitoring parameters (enable/disable, check interval, thread 'hang' 
> threshold, etc.) are to be set via system properties.
> [1] 
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-6587) Ignite watchdog service

2018-09-18 Thread Andrey Kuznetsov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618882#comment-16618882
 ] 

Andrey Kuznetsov commented on IGNITE-6587:
--

[~agura], I've updated the implementation after discussing your points, see 
[1]. Now it's waiting for your review.

> Ignite watchdog service
> ---
>
> Key: IGNITE-6587
> URL: https://issues.apache.org/jira/browse/IGNITE-6587
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 2.2
>Reporter: Alexey Goncharuk
>Assignee: Andrey Kuznetsov
>Priority: Major
>  Labels: IEP-5
> Fix For: 2.7
>
> Attachments: watchdog.sh
>
>
> As described in [1], each Ignite node has a number of system-critical 
> threads. We should implement a periodic check that calls failure handler when 
> one of the following conditions has been detected:
> * Critical thread is not alive anymore.
> * Critical thread 'hangs' for a long time, e.g. while executing a task 
> extracted from task queue.
> In case of failure condition, call stacks of all threads should be logged 
> before invoking failure handler.
> Actual list of system-critical threads can be found at [1].
> Implementations based on separate diagnostic thread seem fragile, cause this 
> thread become a vulnerable point with respect to thread termination and CPU 
> resource starvation. So we are to use self-monitoring approach: critical 
> threads themselves should monitor each other.
> Currently we have {{o.a.i.internal.worker.WorkersRegistry}} facility that 
> fits best to store and track system critical threads. All of them should be 
> refactored to be {{GridWorker's}} and added to {{WorkersRegistry}}. Each 
> worker should periodically choose some subset of peer workers and check 
> whether
> * All of them are alive.
> * All of them are actively running.
> It's required to add a 'heartbeat' timestamp to worker in order to implement 
> latter check. Additionally, infinite queue polls, waits on monitors or thread 
> parks should be refactored to their timed equivalents in system critical 
> threads.
> Monitoring parameters (enable/disable, check interval, thread 'hang' 
> threshold, etc.) are to be set via system properties.
> [1] 
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-6587) Ignite watchdog service

2018-09-03 Thread Andrey Gura (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602284#comment-16602284
 ] 

Andrey Gura commented on IGNITE-6587:
-

[~andrey-kuznetsov] I've looked at your changes and I understood that I don't 
like the idea about using failure processor in case when some critical worker 
is blocked. There are many situations when thread can be blocked intentionally 
and explicitly. E.g. fsync in checkpointer or wal-writer threads. In this case 
we can use some guards in order to prevent thread form liveness checking.

Moreover worker could be blocked implicitly (e.g. exchange-worker or some 
thread from striped pool came to the fsync point). Guards are useless here.

While too long fsync isn't good it is still valid situation. If we stop node 
with blocked worker we can eventually stop all nodes of the cluster because 
load will be redistribute between live nodes and this load will be higher.

[~andrey-kuznetsov] Could you please initiate discussion on dev list? The goal 
of this discussion is finding the approach in addressing described problems.

> Ignite watchdog service
> ---
>
> Key: IGNITE-6587
> URL: https://issues.apache.org/jira/browse/IGNITE-6587
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 2.2
>Reporter: Alexey Goncharuk
>Assignee: Andrey Kuznetsov
>Priority: Major
>  Labels: IEP-5
> Fix For: 2.7
>
> Attachments: watchdog.sh
>
>
> As described in [1], each Ignite node has a number of system-critical 
> threads. We should implement a periodic check that calls failure handler when 
> one of the following conditions has been detected:
> * Critical thread is not alive anymore.
> * Critical thread 'hangs' for a long time, e.g. while executing a task 
> extracted from task queue.
> In case of failure condition, call stacks of all threads should be logged 
> before invoking failure handler.
> Actual list of system-critical threads can be found at [1].
> Implementations based on separate diagnostic thread seem fragile, cause this 
> thread become a vulnerable point with respect to thread termination and CPU 
> resource starvation. So we are to use self-monitoring approach: critical 
> threads themselves should monitor each other.
> Currently we have {{o.a.i.internal.worker.WorkersRegistry}} facility that 
> fits best to store and track system critical threads. All of them should be 
> refactored to be {{GridWorker's}} and added to {{WorkersRegistry}}. Each 
> worker should periodically choose some subset of peer workers and check 
> whether
> * All of them are alive.
> * All of them are actively running.
> It's required to add a 'heartbeat' timestamp to worker in order to implement 
> latter check. Additionally, infinite queue polls, waits on monitors or thread 
> parks should be refactored to their timed equivalents in system critical 
> threads.
> Monitoring parameters (enable/disable, check interval, thread 'hang' 
> threshold, etc.) are to be set via system properties.
> [1] 
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-6587) Ignite watchdog service

2018-06-26 Thread Andrey Kuznetsov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16523549#comment-16523549
 ] 

Andrey Kuznetsov commented on IGNITE-6587:
--

[~sergey-chugunov], thanks for your remarks. My way of controlling flow through 
throwing {{GridWorkerFailureException}} looks like an antipattern. I'm about to 
drop this exception class and pass some closure to {{WorkersRegistry}} on its 
construction, then call it when critical worker failure is detected.

> Ignite watchdog service
> ---
>
> Key: IGNITE-6587
> URL: https://issues.apache.org/jira/browse/IGNITE-6587
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 2.2
>Reporter: Alexey Goncharuk
>Assignee: Andrey Kuznetsov
>Priority: Major
>  Labels: IEP-5
> Fix For: 2.6
>
> Attachments: watchdog.sh
>
>
> As described in [1], each Ignite node has a number of system-critical 
> threads. We should implement a periodic check that calls failure handler when 
> one of the following conditions has been detected:
> * Critical thread is not alive anymore.
> * Critical thread 'hangs' for a long time, e.g. while executing a task 
> extracted from task queue.
> In case of failure condition, call stacks of all threads should be logged 
> before invoking failure handler.
> Actual list of system-critical threads can be found at [1].
> Implementations based on separate diagnostic thread seem fragile, cause this 
> thread become a vulnerable point with respect to thread termination and CPU 
> resource starvation. So we are to use self-monitoring approach: critical 
> threads themselves should monitor each other.
> Currently we have {{o.a.i.internal.worker.WorkersRegistry}} facility that 
> fits best to store and track system critical threads. All of them should be 
> refactored to be {{GridWorker's}} and added to {{WorkersRegistry}}. Each 
> worker should periodically choose some subset of peer workers and check 
> whether
> * All of them are alive.
> * All of them are actively running.
> It's required to add a 'heartbeat' timestamp to worker in order to implement 
> latter check. Additionally, infinite queue polls, waits on monitors or thread 
> parks should be refactored to their timed equivalents in system critical 
> threads.
> Monitoring parameters (enable/disable, check interval, thread 'hang' 
> threshold, etc.) are to be set via system properties.
> [1] 
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-6587) Ignite watchdog service

2018-06-25 Thread Sergey Chugunov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522457#comment-16522457
 ] 

Sergey Chugunov commented on IGNITE-6587:
-

[~andrey-kuznetsov],

I haven't finished review yet but have some high-level questions:
* As I see from code there is no way to turn this functionality off. It may 
worth adding such ability e.g. on FailureHandler level. Users may even want not 
to stop the node but log message at ERROR level and let monitoring systems to 
identify such nodes and make appropriate actions.
* It is not clear what to do with GC pauses longer than 
criticalHeartbeatTimeout - in that case node will stop when GC is finished. 
[~agoncharuk], [~agura], what do you guys think about handling this? Should we 
come up with some mechanism to deal with it?

> Ignite watchdog service
> ---
>
> Key: IGNITE-6587
> URL: https://issues.apache.org/jira/browse/IGNITE-6587
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 2.2
>Reporter: Alexey Goncharuk
>Assignee: Andrey Kuznetsov
>Priority: Major
>  Labels: IEP-5
> Fix For: 2.6
>
> Attachments: watchdog.sh
>
>
> As described in [1], each Ignite node has a number of system-critical 
> threads. We should implement a periodic check that calls failure handler when 
> one of the following conditions has been detected:
> * Critical thread is not alive anymore.
> * Critical thread 'hangs' for a long time, e.g. while executing a task 
> extracted from task queue.
> In case of failure condition, call stacks of all threads should be logged 
> before invoking failure handler.
> Actual list of system-critical threads can be found at [1].
> Implementations based on separate diagnostic thread seem fragile, cause this 
> thread become a vulnerable point with respect to thread termination and CPU 
> resource starvation. So we are to use self-monitoring approach: critical 
> threads themselves should monitor each other.
> Currently we have {{o.a.i.internal.worker.WorkersRegistry}} facility that 
> fits best to store and track system critical threads. All of them should be 
> refactored to be {{GridWorker's}} and added to {{WorkersRegistry}}. Each 
> worker should periodically choose some subset of peer workers and check 
> whether
> * All of them are alive.
> * All of them are actively running.
> It's required to add a 'heartbeat' timestamp to worker in order to implement 
> latter check. Additionally, infinite queue polls, waits on monitors or thread 
> parks should be refactored to their timed equivalents in system critical 
> threads.
> Monitoring parameters (enable/disable, check interval, thread 'hang' 
> threshold, etc.) are to be set via system properties.
> [1] 
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-6587) Ignite watchdog service

2018-05-22 Thread Andrey Kuznetsov (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483947#comment-16483947
 ] 

Andrey Kuznetsov commented on IGNITE-6587:
--

Changing critical threads to {{GridWorkers}} has been brought to separate 
issue, since it has it's own value.

> Ignite watchdog service
> ---
>
> Key: IGNITE-6587
> URL: https://issues.apache.org/jira/browse/IGNITE-6587
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 2.2
>Reporter: Alexey Goncharuk
>Assignee: Andrey Kuznetsov
>Priority: Major
>  Labels: IEP-5
> Fix For: 2.6
>
> Attachments: watchdog.sh
>
>
> As described in [1], each Ignite node has a number of system-critical 
> threads. We should implement a periodic check that calls failure handler when 
> one of the following conditions has been detected:
> * Critical thread is not alive anymore.
> * Critical thread 'hangs' for a long time, e.g. while executing a task 
> extracted from task queue.
> In case of failure condition, call stacks of all threads should be logged 
> before invoking failure handler.
> Actual list of system-critical threads can be found at [1].
> Implementations based on separate diagnostic thread seem fragile, cause this 
> thread become a vulnerable point with respect to thread termination and CPU 
> resource starvation. So we are to use self-monitoring approach: critical 
> threads themselves should monitor each other.
> Currently we have {{o.a.i.internal.worker.WorkersRegistry}} facility that 
> fits best to store and track system critical threads. All of them should be 
> refactored to be {{GridWorker's}} and added to {{WorkersRegistry}}. Each 
> worker should periodically choose some subset of peer workers and check 
> whether
> * All of them are alive.
> * All of them are actively running.
> It's required to add a 'heartbeat' timestamp to worker in order to implement 
> latter check. Additionally, infinite queue polls, waits on monitors or thread 
> parks should be refactored to their timed equivalents in system critical 
> threads.
> Monitoring parameters (enable/disable, check interval, thread 'hang' 
> threshold, etc.) are to be set via system properties.
> [1] 
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-6587) Ignite watchdog service

2018-05-04 Thread Alexey Goncharuk (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464085#comment-16464085
 ] 

Alexey Goncharuk commented on IGNITE-6587:
--

I would also note that we have also some loops in exchange-worker where we 
synchronously wait for other futures (for example, partition release future). 
These places should also be corrected to update thread alive marker.

> Ignite watchdog service
> ---
>
> Key: IGNITE-6587
> URL: https://issues.apache.org/jira/browse/IGNITE-6587
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 2.2
>Reporter: Alexey Goncharuk
>Assignee: Andrey Kuznetsov
>Priority: Major
>  Labels: IEP-5
> Fix For: 2.6
>
> Attachments: watchdog.sh
>
>
> As described in [1], each Ignite node has a number of system-critical 
> threads. We should implement a periodic check that calls failure handler when 
> one of the following conditions has been detected:
> * Critical thread is not alive anymore.
> * Critical thread 'hangs' for a long time, e.g. while executing a task 
> extracted from task queue.
> In case of failure condition, call stacks of all threads should be logged 
> before invoking failure handler.
> Actual list of system-critical threads can be found at [1].
> Implementations based on separate diagnostic thread seem fragile, cause this 
> thread become a vulnerable point with respect to thread termination and CPU 
> resource starvation. So we are to use self-monitoring approach: critical 
> threads themselves should monitor each other.
> Currently we have {{o.a.i.internal.worker.WorkersRegistry}} facility that 
> fits best to store and track system critical threads. All of them should be 
> refactored to be {{GridWorker's}} and added to {{WorkersRegistry}}. Each 
> worker should periodically choose some subset of peer workers and check 
> whether
> * All of them are alive.
> * All of them are actively running.
> It's required to add a 'heartbeat' timestamp to worker in order to implement 
> latter check. Additionally, infinite queue polls, waits on monitors or thread 
> parks should be refactored to their timed equivalents in system critical 
> threads.
> Monitoring parameters (enable/disable, check interval, thread 'hang' 
> threshold, etc.) are to be set via system properties.
> [1] 
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-6587) Ignite watchdog service

2017-11-15 Thread Sergey Puchnin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16253348#comment-16253348
 ] 

Sergey Puchnin commented on IGNITE-6587:


Old thread list

All TCP discovery threads
All communication NIO threads (acceptor and workers)
Exchange worker
Striped pool threads
Timeout Worker
Checkpointer 
WAL archiver

> Ignite watchdog service
> ---
>
> Key: IGNITE-6587
> URL: https://issues.apache.org/jira/browse/IGNITE-6587
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 2.2
>Reporter: Alexey Goncharuk
>Assignee: Dmitriy Pavlov
>  Labels: IEP-5
> Fix For: 2.4
>
> Attachments: watchdog.sh
>
>
> We need to come up with a 'watchdog service' to monitor for Ignite node local 
> health and kill the process under some critical conditions.
> For example, if one of the mission-critical Ignite threads die, the Ignite 
> node must be stopped.
> At the first glance, the list of critical threads is:
> disco-event-worker
> tcp-disco-sock-reader
> tcp-disco-srvr
> tcp-disco-msg-worker
> tcp-comm-worker
> grid-nio-worker-tcp-comm
> exchange-worker
> sys-stripe
> grid-timeout-worker
> db-checkpoint-thread
> wal-file-archiver
> ttl-cleanup-worker
> nio-acceptor
> The mechanism should support pluggable components so that self-check can be 
> extended via plugins.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)