[ https://issues.apache.org/jira/browse/IGNITE-9679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668842#comment-16668842 ]
Andrey Kuznetsov commented on IGNITE-9679: ------------------------------------------ [~Artem Budnikov], thanks, great job! Please consider some minor remarks. * Blocked (aka hanging) worker could be included to Critical Failures list. * Workers of Data Streamer striped pool could be added to mission critical worker list. * Due to [1], blocked worker timeout configuration became a bit trickier. Should this be mentioned in docs? [1] https://issues.apache.org/jira/browse/IGNITE-9737?focusedCommentId=16632210&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16632210 > Document critical workers liveness checking implementation > ---------------------------------------------------------- > > Key: IGNITE-9679 > URL: https://issues.apache.org/jira/browse/IGNITE-9679 > Project: Ignite > Issue Type: Task > Components: documentation > Reporter: Andrey Kuznetsov > Assignee: Andrey Kuznetsov > Priority: Major > Fix For: 2.7 > > > Newly implemented critical worker thread liveness checks should be mentioned > in Ignite Documentation. Brief description of the functionality follows. > Ignite node has a number of critical worker threads that should be alive and > responsive, otherwise node's health is not guaranteed. These threads monitor > each other periodically and track two aspects for a thread being checked: > - whether it's alive; > - whether it updates its internal heartbeat timestamp. > Whenever at least one of the above conditions is violated, checker thread > logs the error and calls currently configured {{FailureHandler}}. > {{IgniteConfiguration.SystemWorkerBlockedTimeout}} configuration property > affects monitoring behavior. At runtime monitoring settings can be changed > via {{FailureHandlingMxBean}}. > By default, liveness checks are enabled, but blocked system worker detection > will not lead to failure handler invocation, see > {{FailureProcessor#getDefaultFailureHandler}} . -- This message was sent by Atlassian JIRA (v7.6.3#76005)