[jira] [Comment Edited] (IGNITE-6587) Ignite watchdog service

2018-09-18 Thread Andrey Kuznetsov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618882#comment-16618882
 ] 

Andrey Kuznetsov edited comment on IGNITE-6587 at 9/18/18 10:28 AM:


[~agura], I've updated the implementation after discussing your points, see 
[1]. Now it's waiting for your review.

[1] 
http://apache-ignite-developers.2346864.n4.nabble.com/Critical-worker-threads-liveness-checking-drawbacks-td34783.html



was (Author: andrey-kuznetsov):
[~agura], I've updated the implementation after discussing your points, see 
[1]. Now it's waiting for your review.

> Ignite watchdog service
> ---
>
> Key: IGNITE-6587
> URL: https://issues.apache.org/jira/browse/IGNITE-6587
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 2.2
>Reporter: Alexey Goncharuk
>Assignee: Andrey Kuznetsov
>Priority: Major
>  Labels: IEP-5
> Fix For: 2.7
>
> Attachments: watchdog.sh
>
>
> As described in [1], each Ignite node has a number of system-critical 
> threads. We should implement a periodic check that calls failure handler when 
> one of the following conditions has been detected:
> * Critical thread is not alive anymore.
> * Critical thread 'hangs' for a long time, e.g. while executing a task 
> extracted from task queue.
> In case of failure condition, call stacks of all threads should be logged 
> before invoking failure handler.
> Actual list of system-critical threads can be found at [1].
> Implementations based on separate diagnostic thread seem fragile, cause this 
> thread become a vulnerable point with respect to thread termination and CPU 
> resource starvation. So we are to use self-monitoring approach: critical 
> threads themselves should monitor each other.
> Currently we have {{o.a.i.internal.worker.WorkersRegistry}} facility that 
> fits best to store and track system critical threads. All of them should be 
> refactored to be {{GridWorker's}} and added to {{WorkersRegistry}}. Each 
> worker should periodically choose some subset of peer workers and check 
> whether
> * All of them are alive.
> * All of them are actively running.
> It's required to add a 'heartbeat' timestamp to worker in order to implement 
> latter check. Additionally, infinite queue polls, waits on monitors or thread 
> parks should be refactored to their timed equivalents in system critical 
> threads.
> Monitoring parameters (enable/disable, check interval, thread 'hang' 
> threshold, etc.) are to be set via system properties.
> [1] 
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-6587) Ignite watchdog service

2018-06-26 Thread Sergey Chugunov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522457#comment-16522457
 ] 

Sergey Chugunov edited comment on IGNITE-6587 at 6/26/18 9:12 AM:
--

[~andrey-kuznetsov],

I haven't finished review yet but have some high-level questions:
* As I see from code there is no way to turn this functionality off. It may 
worth adding such ability e.g. on FailureHandler level. Users may even want not 
to stop the node but log message at ERROR level and let monitoring systems to 
identify such nodes and make appropriate actions.
* And even if we are not talking about turning this feature on and off, 
consider that we configured NoOpFailureHandler and, say, *grid-timeout-worker* 
is hanging. If *wal-file-archiver* executes onIdle logic it will get an 
exception and exits its main loop, as a result one hanging thread will cause 
another to terminate.
* It is not clear what to do with GC pauses longer than 
criticalHeartbeatTimeout - in that case node will stop when GC is finished. 
[~agoncharuk], [~agura], what do you guys think about handling this? Should we 
come up with some mechanism to deal with it?


was (Author: sergey-chugunov):
[~andrey-kuznetsov],

I haven't finished review yet but have some high-level questions:
* As I see from code there is no way to turn this functionality off. It may 
worth adding such ability e.g. on FailureHandler level. Users may even want not 
to stop the node but log message at ERROR level and let monitoring systems to 
identify such nodes and make appropriate actions.
* It is not clear what to do with GC pauses longer than 
criticalHeartbeatTimeout - in that case node will stop when GC is finished. 
[~agoncharuk], [~agura], what do you guys think about handling this? Should we 
come up with some mechanism to deal with it?

> Ignite watchdog service
> ---
>
> Key: IGNITE-6587
> URL: https://issues.apache.org/jira/browse/IGNITE-6587
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 2.2
>Reporter: Alexey Goncharuk
>Assignee: Andrey Kuznetsov
>Priority: Major
>  Labels: IEP-5
> Fix For: 2.6
>
> Attachments: watchdog.sh
>
>
> As described in [1], each Ignite node has a number of system-critical 
> threads. We should implement a periodic check that calls failure handler when 
> one of the following conditions has been detected:
> * Critical thread is not alive anymore.
> * Critical thread 'hangs' for a long time, e.g. while executing a task 
> extracted from task queue.
> In case of failure condition, call stacks of all threads should be logged 
> before invoking failure handler.
> Actual list of system-critical threads can be found at [1].
> Implementations based on separate diagnostic thread seem fragile, cause this 
> thread become a vulnerable point with respect to thread termination and CPU 
> resource starvation. So we are to use self-monitoring approach: critical 
> threads themselves should monitor each other.
> Currently we have {{o.a.i.internal.worker.WorkersRegistry}} facility that 
> fits best to store and track system critical threads. All of them should be 
> refactored to be {{GridWorker's}} and added to {{WorkersRegistry}}. Each 
> worker should periodically choose some subset of peer workers and check 
> whether
> * All of them are alive.
> * All of them are actively running.
> It's required to add a 'heartbeat' timestamp to worker in order to implement 
> latter check. Additionally, infinite queue polls, waits on monitors or thread 
> parks should be refactored to their timed equivalents in system critical 
> threads.
> Monitoring parameters (enable/disable, check interval, thread 'hang' 
> threshold, etc.) are to be set via system properties.
> [1] 
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-6587) Ignite watchdog service

2018-05-29 Thread Andrey Kuznetsov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483947#comment-16483947
 ] 

Andrey Kuznetsov edited comment on IGNITE-6587 at 5/29/18 7:17 PM:
---

Changing critical threads to {{GridWorkers}} has been brought to separate 
issue, since it has its own value.


was (Author: andrey-kuznetsov):
Changing critical threads to {{GridWorkers}} has been brought to separate 
issue, since it has it's own value.

> Ignite watchdog service
> ---
>
> Key: IGNITE-6587
> URL: https://issues.apache.org/jira/browse/IGNITE-6587
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 2.2
>Reporter: Alexey Goncharuk
>Assignee: Andrey Kuznetsov
>Priority: Major
>  Labels: IEP-5
> Fix For: 2.6
>
> Attachments: watchdog.sh
>
>
> As described in [1], each Ignite node has a number of system-critical 
> threads. We should implement a periodic check that calls failure handler when 
> one of the following conditions has been detected:
> * Critical thread is not alive anymore.
> * Critical thread 'hangs' for a long time, e.g. while executing a task 
> extracted from task queue.
> In case of failure condition, call stacks of all threads should be logged 
> before invoking failure handler.
> Actual list of system-critical threads can be found at [1].
> Implementations based on separate diagnostic thread seem fragile, cause this 
> thread become a vulnerable point with respect to thread termination and CPU 
> resource starvation. So we are to use self-monitoring approach: critical 
> threads themselves should monitor each other.
> Currently we have {{o.a.i.internal.worker.WorkersRegistry}} facility that 
> fits best to store and track system critical threads. All of them should be 
> refactored to be {{GridWorker's}} and added to {{WorkersRegistry}}. Each 
> worker should periodically choose some subset of peer workers and check 
> whether
> * All of them are alive.
> * All of them are actively running.
> It's required to add a 'heartbeat' timestamp to worker in order to implement 
> latter check. Additionally, infinite queue polls, waits on monitors or thread 
> parks should be refactored to their timed equivalents in system critical 
> threads.
> Monitoring parameters (enable/disable, check interval, thread 'hang' 
> threshold, etc.) are to be set via system properties.
> [1] 
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)