[ https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrey Gura updated IGNITE-6587: -------------------------------- Ignite Flags: Docs Required > Ignite watchdog service > ----------------------- > > Key: IGNITE-6587 > URL: https://issues.apache.org/jira/browse/IGNITE-6587 > Project: Ignite > Issue Type: Improvement > Components: general > Affects Versions: 2.2 > Reporter: Alexey Goncharuk > Assignee: Andrey Kuznetsov > Priority: Major > Labels: IEP-5 > Fix For: 2.7 > > Attachments: watchdog.sh > > > As described in [1], each Ignite node has a number of system-critical > threads. We should implement a periodic check that calls failure handler when > one of the following conditions has been detected: > * Critical thread is not alive anymore. > * Critical thread 'hangs' for a long time, e.g. while executing a task > extracted from task queue. > In case of failure condition, call stacks of all threads should be logged > before invoking failure handler. > Actual list of system-critical threads can be found at [1]. > Implementations based on separate diagnostic thread seem fragile, cause this > thread become a vulnerable point with respect to thread termination and CPU > resource starvation. So we are to use self-monitoring approach: critical > threads themselves should monitor each other. > Currently we have {{o.a.i.internal.worker.WorkersRegistry}} facility that > fits best to store and track system critical threads. All of them should be > refactored to be {{GridWorker's}} and added to {{WorkersRegistry}}. Each > worker should periodically choose some subset of peer workers and check > whether > * All of them are alive. > * All of them are actively running. > It's required to add a 'heartbeat' timestamp to worker in order to implement > latter check. Additionally, infinite queue polls, waits on monitors or thread > parks should be refactored to their timed equivalents in system critical > threads. > Monitoring parameters (enable/disable, check interval, thread 'hang' > threshold, etc.) are to be set via system properties. > [1] > https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling -- This message was sent by Atlassian JIRA (v7.6.3#76005)