[ https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618882#comment-16618882 ]
Andrey Kuznetsov edited comment on IGNITE-6587 at 9/18/18 10:28 AM: -------------------------------------------------------------------- [~agura], I've updated the implementation after discussing your points, see [1]. Now it's waiting for your review. [1] http://apache-ignite-developers.2346864.n4.nabble.com/Critical-worker-threads-liveness-checking-drawbacks-td34783.html was (Author: andrey-kuznetsov): [~agura], I've updated the implementation after discussing your points, see [1]. Now it's waiting for your review. > Ignite watchdog service > ----------------------- > > Key: IGNITE-6587 > URL: https://issues.apache.org/jira/browse/IGNITE-6587 > Project: Ignite > Issue Type: Improvement > Components: general > Affects Versions: 2.2 > Reporter: Alexey Goncharuk > Assignee: Andrey Kuznetsov > Priority: Major > Labels: IEP-5 > Fix For: 2.7 > > Attachments: watchdog.sh > > > As described in [1], each Ignite node has a number of system-critical > threads. We should implement a periodic check that calls failure handler when > one of the following conditions has been detected: > * Critical thread is not alive anymore. > * Critical thread 'hangs' for a long time, e.g. while executing a task > extracted from task queue. > In case of failure condition, call stacks of all threads should be logged > before invoking failure handler. > Actual list of system-critical threads can be found at [1]. > Implementations based on separate diagnostic thread seem fragile, cause this > thread become a vulnerable point with respect to thread termination and CPU > resource starvation. So we are to use self-monitoring approach: critical > threads themselves should monitor each other. > Currently we have {{o.a.i.internal.worker.WorkersRegistry}} facility that > fits best to store and track system critical threads. All of them should be > refactored to be {{GridWorker's}} and added to {{WorkersRegistry}}. Each > worker should periodically choose some subset of peer workers and check > whether > * All of them are alive. > * All of them are actively running. > It's required to add a 'heartbeat' timestamp to worker in order to implement > latter check. Additionally, infinite queue polls, waits on monitors or thread > parks should be refactored to their timed equivalents in system critical > threads. > Monitoring parameters (enable/disable, check interval, thread 'hang' > threshold, etc.) are to be set via system properties. > [1] > https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling -- This message was sent by Atlassian JIRA (v7.6.3#76005)