[jira] [Commented] (IGNITE-9679) Document critical workers liveness checking implementation
[ https://issues.apache.org/jira/browse/IGNITE-9679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695341#comment-16695341 ] Denis Magda commented on IGNITE-9679: - Thanks folks, everything is clear. Closing the ticket. > Document critical workers liveness checking implementation > -- > > Key: IGNITE-9679 > URL: https://issues.apache.org/jira/browse/IGNITE-9679 > Project: Ignite > Issue Type: Task > Components: documentation >Reporter: Andrey Kuznetsov >Assignee: Denis Magda >Priority: Major > Fix For: 2.7 > > > Newly implemented critical worker thread liveness checks should be mentioned > in Ignite Documentation. Brief description of the functionality follows. > Ignite node has a number of critical worker threads that should be alive and > responsive, otherwise node's health is not guaranteed. These threads monitor > each other periodically and track two aspects for a thread being checked: > - whether it's alive; > - whether it updates its internal heartbeat timestamp. > Whenever at least one of the above conditions is violated, checker thread > logs the error and calls currently configured {{FailureHandler}}. > {{IgniteConfiguration.SystemWorkerBlockedTimeout}} configuration property > affects monitoring behavior. At runtime monitoring settings can be changed > via {{FailureHandlingMxBean}}. > By default, liveness checks are enabled, but blocked system worker detection > will not lead to failure handler invocation, see > {{FailureProcessor#getDefaultFailureHandler}} . -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9679) Document critical workers liveness checking implementation
[ https://issues.apache.org/jira/browse/IGNITE-9679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690218#comment-16690218 ] Prachi Garg commented on IGNITE-9679: - [~Artem Budnikov], looks good. > Document critical workers liveness checking implementation > -- > > Key: IGNITE-9679 > URL: https://issues.apache.org/jira/browse/IGNITE-9679 > Project: Ignite > Issue Type: Task > Components: documentation >Reporter: Andrey Kuznetsov >Assignee: Prachi Garg >Priority: Major > Fix For: 2.7 > > > Newly implemented critical worker thread liveness checks should be mentioned > in Ignite Documentation. Brief description of the functionality follows. > Ignite node has a number of critical worker threads that should be alive and > responsive, otherwise node's health is not guaranteed. These threads monitor > each other periodically and track two aspects for a thread being checked: > - whether it's alive; > - whether it updates its internal heartbeat timestamp. > Whenever at least one of the above conditions is violated, checker thread > logs the error and calls currently configured {{FailureHandler}}. > {{IgniteConfiguration.SystemWorkerBlockedTimeout}} configuration property > affects monitoring behavior. At runtime monitoring settings can be changed > via {{FailureHandlingMxBean}}. > By default, liveness checks are enabled, but blocked system worker detection > will not lead to failure handler invocation, see > {{FailureProcessor#getDefaultFailureHandler}} . -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9679) Document critical workers liveness checking implementation
[ https://issues.apache.org/jira/browse/IGNITE-9679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686258#comment-16686258 ] Artem Budnikov commented on IGNITE-9679: [~pgarg] I updated the description because the implementation has changed. By default, the failure handler is not called when a critical worker stops responding. If the user wants to call the failure handler in such situations, additional configuring is required, that's why I added another configuration example in the "Critical Workers Health Check" section. Please review. > Document critical workers liveness checking implementation > -- > > Key: IGNITE-9679 > URL: https://issues.apache.org/jira/browse/IGNITE-9679 > Project: Ignite > Issue Type: Task > Components: documentation >Reporter: Andrey Kuznetsov >Assignee: Artem Budnikov >Priority: Major > Fix For: 2.7 > > > Newly implemented critical worker thread liveness checks should be mentioned > in Ignite Documentation. Brief description of the functionality follows. > Ignite node has a number of critical worker threads that should be alive and > responsive, otherwise node's health is not guaranteed. These threads monitor > each other periodically and track two aspects for a thread being checked: > - whether it's alive; > - whether it updates its internal heartbeat timestamp. > Whenever at least one of the above conditions is violated, checker thread > logs the error and calls currently configured {{FailureHandler}}. > {{IgniteConfiguration.SystemWorkerBlockedTimeout}} configuration property > affects monitoring behavior. At runtime monitoring settings can be changed > via {{FailureHandlingMxBean}}. > By default, liveness checks are enabled, but blocked system worker detection > will not lead to failure handler invocation, see > {{FailureProcessor#getDefaultFailureHandler}} . -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9679) Document critical workers liveness checking implementation
[ https://issues.apache.org/jira/browse/IGNITE-9679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672441#comment-16672441 ] Prachi Garg commented on IGNITE-9679: - [~Artem Budnikov], Is failure handler called under two circumstances - either there is critical failure or there is a critical worker that is not operational? If so, then what is the difference in configuration for Failure Handler section and Critical Workers Health Check section? > Document critical workers liveness checking implementation > -- > > Key: IGNITE-9679 > URL: https://issues.apache.org/jira/browse/IGNITE-9679 > Project: Ignite > Issue Type: Task > Components: documentation >Reporter: Andrey Kuznetsov >Assignee: Prachi Garg >Priority: Major > Fix For: 2.7 > > > Newly implemented critical worker thread liveness checks should be mentioned > in Ignite Documentation. Brief description of the functionality follows. > Ignite node has a number of critical worker threads that should be alive and > responsive, otherwise node's health is not guaranteed. These threads monitor > each other periodically and track two aspects for a thread being checked: > - whether it's alive; > - whether it updates its internal heartbeat timestamp. > Whenever at least one of the above conditions is violated, checker thread > logs the error and calls currently configured {{FailureHandler}}. > {{IgniteConfiguration.SystemWorkerBlockedTimeout}} configuration property > affects monitoring behavior. At runtime monitoring settings can be changed > via {{FailureHandlingMxBean}}. > By default, liveness checks are enabled, but blocked system worker detection > will not lead to failure handler invocation, see > {{FailureProcessor#getDefaultFailureHandler}} . -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9679) Document critical workers liveness checking implementation
[ https://issues.apache.org/jira/browse/IGNITE-9679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670176#comment-16670176 ] Artem Budnikov commented on IGNITE-9679: [~pgarg] please review the page. > Document critical workers liveness checking implementation > -- > > Key: IGNITE-9679 > URL: https://issues.apache.org/jira/browse/IGNITE-9679 > Project: Ignite > Issue Type: Task > Components: documentation >Reporter: Andrey Kuznetsov >Assignee: Prachi Garg >Priority: Major > Fix For: 2.7 > > > Newly implemented critical worker thread liveness checks should be mentioned > in Ignite Documentation. Brief description of the functionality follows. > Ignite node has a number of critical worker threads that should be alive and > responsive, otherwise node's health is not guaranteed. These threads monitor > each other periodically and track two aspects for a thread being checked: > - whether it's alive; > - whether it updates its internal heartbeat timestamp. > Whenever at least one of the above conditions is violated, checker thread > logs the error and calls currently configured {{FailureHandler}}. > {{IgniteConfiguration.SystemWorkerBlockedTimeout}} configuration property > affects monitoring behavior. At runtime monitoring settings can be changed > via {{FailureHandlingMxBean}}. > By default, liveness checks are enabled, but blocked system worker detection > will not lead to failure handler invocation, see > {{FailureProcessor#getDefaultFailureHandler}} . -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9679) Document critical workers liveness checking implementation
[ https://issues.apache.org/jira/browse/IGNITE-9679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16669859#comment-16669859 ] Andrey Kuznetsov commented on IGNITE-9679: -- [~Artem Budnikov], sorry. Maybe I overlooked something yesterday. Now I see that configuration description is up to date. > Document critical workers liveness checking implementation > -- > > Key: IGNITE-9679 > URL: https://issues.apache.org/jira/browse/IGNITE-9679 > Project: Ignite > Issue Type: Task > Components: documentation >Reporter: Andrey Kuznetsov >Assignee: Andrey Kuznetsov >Priority: Major > Fix For: 2.7 > > > Newly implemented critical worker thread liveness checks should be mentioned > in Ignite Documentation. Brief description of the functionality follows. > Ignite node has a number of critical worker threads that should be alive and > responsive, otherwise node's health is not guaranteed. These threads monitor > each other periodically and track two aspects for a thread being checked: > - whether it's alive; > - whether it updates its internal heartbeat timestamp. > Whenever at least one of the above conditions is violated, checker thread > logs the error and calls currently configured {{FailureHandler}}. > {{IgniteConfiguration.SystemWorkerBlockedTimeout}} configuration property > affects monitoring behavior. At runtime monitoring settings can be changed > via {{FailureHandlingMxBean}}. > By default, liveness checks are enabled, but blocked system worker detection > will not lead to failure handler invocation, see > {{FailureProcessor#getDefaultFailureHandler}} . -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9679) Document critical workers liveness checking implementation
[ https://issues.apache.org/jira/browse/IGNITE-9679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16669837#comment-16669837 ] Artem Budnikov commented on IGNITE-9679: [~andrey-kuznetsov] I've added the first two points you mentioned to the page. With regard to configuration, I think I've described that already, or is something missing? > Document critical workers liveness checking implementation > -- > > Key: IGNITE-9679 > URL: https://issues.apache.org/jira/browse/IGNITE-9679 > Project: Ignite > Issue Type: Task > Components: documentation >Reporter: Andrey Kuznetsov >Assignee: Andrey Kuznetsov >Priority: Major > Fix For: 2.7 > > > Newly implemented critical worker thread liveness checks should be mentioned > in Ignite Documentation. Brief description of the functionality follows. > Ignite node has a number of critical worker threads that should be alive and > responsive, otherwise node's health is not guaranteed. These threads monitor > each other periodically and track two aspects for a thread being checked: > - whether it's alive; > - whether it updates its internal heartbeat timestamp. > Whenever at least one of the above conditions is violated, checker thread > logs the error and calls currently configured {{FailureHandler}}. > {{IgniteConfiguration.SystemWorkerBlockedTimeout}} configuration property > affects monitoring behavior. At runtime monitoring settings can be changed > via {{FailureHandlingMxBean}}. > By default, liveness checks are enabled, but blocked system worker detection > will not lead to failure handler invocation, see > {{FailureProcessor#getDefaultFailureHandler}} . -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9679) Document critical workers liveness checking implementation
[ https://issues.apache.org/jira/browse/IGNITE-9679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668842#comment-16668842 ] Andrey Kuznetsov commented on IGNITE-9679: -- [~Artem Budnikov], thanks, great job! Please consider some minor remarks. * Blocked (aka hanging) worker could be included to Critical Failures list. * Workers of Data Streamer striped pool could be added to mission critical worker list. * Due to [1], blocked worker timeout configuration became a bit trickier. Should this be mentioned in docs? [1] https://issues.apache.org/jira/browse/IGNITE-9737?focusedCommentId=16632210=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16632210 > Document critical workers liveness checking implementation > -- > > Key: IGNITE-9679 > URL: https://issues.apache.org/jira/browse/IGNITE-9679 > Project: Ignite > Issue Type: Task > Components: documentation >Reporter: Andrey Kuznetsov >Assignee: Andrey Kuznetsov >Priority: Major > Fix For: 2.7 > > > Newly implemented critical worker thread liveness checks should be mentioned > in Ignite Documentation. Brief description of the functionality follows. > Ignite node has a number of critical worker threads that should be alive and > responsive, otherwise node's health is not guaranteed. These threads monitor > each other periodically and track two aspects for a thread being checked: > - whether it's alive; > - whether it updates its internal heartbeat timestamp. > Whenever at least one of the above conditions is violated, checker thread > logs the error and calls currently configured {{FailureHandler}}. > {{IgniteConfiguration.SystemWorkerBlockedTimeout}} configuration property > affects monitoring behavior. At runtime monitoring settings can be changed > via {{FailureHandlingMxBean}}. > By default, liveness checks are enabled, but blocked system worker detection > will not lead to failure handler invocation, see > {{FailureProcessor#getDefaultFailureHandler}} . -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9679) Document critical workers liveness checking implementation
[ https://issues.apache.org/jira/browse/IGNITE-9679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668757#comment-16668757 ] Artem Budnikov commented on IGNITE-9679: I added a description of this feature to [https://apacheignite.readme.io/v2.6/docs/critical-failures-handling-27#section-critical-workers-health-check] [~andrey-kuznetsov], [~agura] please review and provide feedback. > Document critical workers liveness checking implementation > -- > > Key: IGNITE-9679 > URL: https://issues.apache.org/jira/browse/IGNITE-9679 > Project: Ignite > Issue Type: Task > Components: documentation >Reporter: Andrey Kuznetsov >Assignee: Artem Budnikov >Priority: Major > Fix For: 2.7 > > > Newly implemented critical worker thread liveness checks should be mentioned > in Ignite Documentation. Brief description of the functionality follows. > Ignite node has a number of critical worker threads that should be alive and > responsive, otherwise node's health is not guaranteed. These threads monitor > each other periodically and track two aspects for a thread being checked: > - whether it's alive; > - whether it updates its internal heartbeat timestamp. > Whenever at least one of the above conditions is violated, checker thread > logs the error and calls currently configured {{FailureHandler}}. > {{IgniteConfiguration.SystemWorkerBlockedTimeout}} configuration property > affects monitoring behavior. At runtime monitoring settings can be changed > via {{FailureHandlingMxBean}}. > By default, liveness checks are enabled, but blocked system worker detection > will not lead to failure handler invocation, see > {{FailureProcessor#getDefaultFailureHandler}} . -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9679) Document critical workers liveness checking implementation
[ https://issues.apache.org/jira/browse/IGNITE-9679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655548#comment-16655548 ] Andrey Kuznetsov commented on IGNITE-9679: -- [~Artem Budnikov], this issue is not blocked anymore, your help with it is appreciated. > Document critical workers liveness checking implementation > -- > > Key: IGNITE-9679 > URL: https://issues.apache.org/jira/browse/IGNITE-9679 > Project: Ignite > Issue Type: Task > Components: documentation >Reporter: Andrey Kuznetsov >Priority: Major > Fix For: 2.7 > > > Newly implemented critical worker thread liveness checks should be mentioned > in Ignite Documentation. Brief description of the functionality follows. > Ignite node has a number of critical worker threads that should be alive and > responsive, otherwise node's health is not guaranteed. These threads monitor > each other periodically and track two aspects for a thread being checked: > - whether it's alive; > - whether it updates its internal heartbeat timestamp. > Whenever at least one of the above conditions is violated, checker thread > logs the error and calls currently configured {{FailureHandler}}. > {{IgniteConfiguration.SystemWorkerBlockedTimeout}} configuration property > affects monitoring behavior. At runtime monitoring settings can be changed > via {{FailureHandlingMxBean}}. > By default, liveness checks are enabled, but blocked system worker detection > will not lead to failure handler invocation, see > {{FailureProcessor#getDefaultFailureHandler}} . -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (IGNITE-9679) Document critical workers liveness checking implementation
[ https://issues.apache.org/jira/browse/IGNITE-9679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626422#comment-16626422 ] Denis Magda commented on IGNITE-9679: - [~Artem Budnikov], please help to document this issue. > Document critical workers liveness checking implementation > -- > > Key: IGNITE-9679 > URL: https://issues.apache.org/jira/browse/IGNITE-9679 > Project: Ignite > Issue Type: Task > Components: documentation >Reporter: Andrey Kuznetsov >Assignee: Artem Budnikov >Priority: Major > Fix For: 2.7 > > > Newly implemented critical worker thread liveness checks should be mentioned > in Ignite Documentation. Brief description of the functionality follows. > Ignite node has a number of critical worker threads that should be alive and > responsive, otherwise node's health is not guaranteed. These threads monitor > each other periodically and track two aspects for a thread being checked: > - whether it's alive; > - whether it updates its internal heartbeat timestamp. > Both checks use {{IgniteConfiguration.failureDetectionTimeout}} property as a > threshold value. > Whenever at least one of the above conditions is violated, checker thread > logs the error and calls currently configured {{FailureHandler}}. > Liveness checks are enabled by default, but can be disabled through > {{WorkersControlMXBean.healthMonitoringEnabled}} property. -- This message was sent by Atlassian JIRA (v7.6.3#76005)