[ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16602284#comment-16602284
 ] 

Andrey Gura commented on IGNITE-6587:
-------------------------------------

[~andrey-kuznetsov] I've looked at your changes and I understood that I don't 
like the idea about using failure processor in case when some critical worker 
is blocked. There are many situations when thread can be blocked intentionally 
and explicitly. E.g. fsync in checkpointer or wal-writer threads. In this case 
we can use some guards in order to prevent thread form liveness checking.

Moreover worker could be blocked implicitly (e.g. exchange-worker or some 
thread from striped pool came to the fsync point). Guards are useless here.

While too long fsync isn't good it is still valid situation. If we stop node 
with blocked worker we can eventually stop all nodes of the cluster because 
load will be redistribute between live nodes and this load will be higher.

[~andrey-kuznetsov] Could you please initiate discussion on dev list? The goal 
of this discussion is finding the approach in addressing described problems.

> Ignite watchdog service
> -----------------------
>
>                 Key: IGNITE-6587
>                 URL: https://issues.apache.org/jira/browse/IGNITE-6587
>             Project: Ignite
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 2.2
>            Reporter: Alexey Goncharuk
>            Assignee: Andrey Kuznetsov
>            Priority: Major
>              Labels: IEP-5
>             Fix For: 2.7
>
>         Attachments: watchdog.sh
>
>
> As described in [1], each Ignite node has a number of system-critical 
> threads. We should implement a periodic check that calls failure handler when 
> one of the following conditions has been detected:
> * Critical thread is not alive anymore.
> * Critical thread 'hangs' for a long time, e.g. while executing a task 
> extracted from task queue.
> In case of failure condition, call stacks of all threads should be logged 
> before invoking failure handler.
> Actual list of system-critical threads can be found at [1].
> Implementations based on separate diagnostic thread seem fragile, cause this 
> thread become a vulnerable point with respect to thread termination and CPU 
> resource starvation. So we are to use self-monitoring approach: critical 
> threads themselves should monitor each other.
> Currently we have {{o.a.i.internal.worker.WorkersRegistry}} facility that 
> fits best to store and track system critical threads. All of them should be 
> refactored to be {{GridWorker's}} and added to {{WorkersRegistry}}. Each 
> worker should periodically choose some subset of peer workers and check 
> whether
> * All of them are alive.
> * All of them are actively running.
> It's required to add a 'heartbeat' timestamp to worker in order to implement 
> latter check. Additionally, infinite queue polls, waits on monitors or thread 
> parks should be refactored to their timed equivalents in system critical 
> threads.
> Monitoring parameters (enable/disable, check interval, thread 'hang' 
> threshold, etc.) are to be set via system properties.
> [1] 
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to