[ https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184378#comment-16184378 ]
Gour Saha edited comment on SLIDER-1246 at 9/28/17 4:58 PM: ------------------------------------------------------------ 4 new resources config properties have been introduced in the patch which should provide the health-threshold control required for this feature - {code} yarn.container.health.threshold.percent e.g. "yarn.container.health.threshold.percent" : "80" (set to 80%, so if there are 100 containers for a component, then 80 or more running containers will deem the component as healthy) {code} There is no default, so needs to be explicitly set in resources file to enable health monitor. It can be defined at the global level to enable monitors for all components. When defined at global-level the same percent is applicable for all components. It can be defined at component-level also to override the global value for a specific component. Note, if health monitor is enabled for a component then failure threshold is automatically disabled for that component. So, if a value is set for *_yarn.container.failure.threshold_* as well, it will be ignored for that component. If health-threshold is set for one component and failure threshold for another, they will compete with each other against determining an app to be unhealthy. Whoever wins brings the app down first, so the app owners need to understand the behavior of these competing properties and set them appropriately. {code} yarn.container.health.threshold.window.secs e.g. "yarn.container.health.threshold.window.secs" : "3600" (sets the window to 1 hour) {code} Default is 600 secs (5 mins). The amount of time a component is allowed to be below the health-threshold percent after which the application is stopped. If the health crosses above threshold before the window expires then this window is reset to 0. So, if the health goes below threshold later again, it has to be there for the entire window to be considered unhealthy. {code} yarn.container.health.threshold.poll.frequency.secs e.g. "yarn.container.health.threshold.poll.frequency.secs" : "20" (sets poll frequency to 20 secs) {code} Default is 10 secs. For most purposes this property does not need to be set by the application owner, unless the app owner knows exactly what she/he is doing. {code} yarn.container.health.threshold.init.delay.secs e.g. "yarn.container.health.threshold.init.delay.secs" : "1800" (sets the window to 30 mins) {code} Default is 600 secs (same as default for *_yarn.container.health.threshold.window.secs_*). Controls the health monitor's behavior the exact same way as *_yarn.container.health.threshold.window.secs_* does, except that it comes into play only the first time when the application is started while it is working its way up to cross the health-threshold percent for the first time. Note, the node failure blacklisting feature is implemented by SLIDER-1199. was (Author: gsaha): 4 new config properties have been introduced in the patch which should provide the health threshold control required for this feature - {code} yarn.container.health.threshold.percent e.g. "yarn.container.health.threshold.percent" : "80" (set to 80%, so if there are 100 containers for a component, then 80 or more running containers will deem the component as healthy) {code} There is no default, so needs to be explicitly set in resources file to enable health monitor. It can be defined at the global level to enable monitor for all components. When defined at global level the same percent is applicable for all components. It can be defined at component level also to override the global value for a specific component. Note, if health monitor is enabled for a component then failure threshold is disabled for that component. So, if a value is set for yarn.container.failure.threshold it will be ignored. {code} yarn.container.health.threshold.window.secs e.g. "yarn.container.health.threshold.window.secs" : "3600" (sets the window to 1 hour) {code} Default is 600 secs (5 mins). The amount of time a component is allowed to be below the health threshold percent after which the application is stopped. If the health crosses above threshold before the window expires then this window is reset to 0. So, if the health goes below threshold later again, it has to be there for the entire window to be considered unhealthy. {code} yarn.container.health.threshold.poll.frequency.secs e.g. "yarn.container.health.threshold.poll.frequency.secs" : "20" (sets poll frequency to 20 secs) {code} Default is 10 secs. For most purposes this property does not need to be set by the application owner. {code} yarn.container.health.threshold.init.delay.secs e.g. "yarn.container.health.threshold.init.delay.secs" : "1800" (sets the window to 30 mins) {code} Default is 600 secs (same as default for yarn.container.health.threshold.window.secs). Controls the health monitor's behavior the exact same way as yarn.container.health.threshold.window.secs does, except that it comes into play only the first time when the application is started. Note, the node failure blacklisting feature is implemented by SLIDER-1199. > Application health should not be affected by faulty nodes > --------------------------------------------------------- > > Key: SLIDER-1246 > URL: https://issues.apache.org/jira/browse/SLIDER-1246 > Project: Slider > Issue Type: Bug > Components: appmaster, core > Affects Versions: Slider 0.92 > Reporter: Prasanth Jayachandran > Assignee: Gour Saha > Fix For: Slider 1.0.0 > > Attachments: SLIDER-1246.01.patch > > > In case of a faulty node, multiple container failures will be deemed as an > application failure. > Observed this in HIVE-16927, where container failures in certain nodes brings > down entire application. Slider has to provide a way to not mark application > as unhealthy if certain threshold of containers are running. Tuning failure > threshold is not optimal as setting the correct default on large cluster is > not trivial. Beyond certain failures, slider should mark the node as > unhealthy and report that back to client/AM. Application could continue to > run as long as container request is satisfied partially (example: 80% > containers are running). -- This message was sent by Atlassian JIRA (v6.4.14#64029)