[ https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184378#comment-16184378 ]
Gour Saha edited comment on SLIDER-1246 at 9/28/17 6:46 PM: ------------------------------------------------------------ 4 new resources config properties have been introduced in the patch which should provide the health-threshold control required for this feature - {code} yarn.container.health.threshold.percent e.g. "yarn.container.health.threshold.percent" : "80" (set to 80%, so if there are 100 containers for a component, then 80 or more running containers will deem the component as healthy) {code} There is no default, so needs to be explicitly set in resources file to enable health monitor. It can be defined at the global level to enable monitors for all components. When defined at global-level the same percent is applicable for all components. It can be defined at component-level also to override the global value for a specific component. Note, if health monitor is enabled for a component then failure threshold is automatically disabled for that component. So, if a value is set for *_yarn.container.failure.threshold_* as well, it will be ignored for that component. If health-threshold is set for one component and failure threshold for another, they will compete with each other against determining an app to be unhealthy. Whoever wins brings the app down first, so the app owners need to understand the behavior of these competing properties and set them appropriately. {code} yarn.container.health.threshold.window.secs e.g. "yarn.container.health.threshold.window.secs" : "3600" (sets the window to 1 hour) {code} Default is 600 secs (10 mins). The amount of time a component is allowed to be below the health-threshold percent after which the application is stopped. If the health crosses above threshold before the window expires then this window is reset to 0. So, if the health goes below threshold later again, it has to be there for the entire window to be considered unhealthy. {code} yarn.container.health.threshold.poll.frequency.secs e.g. "yarn.container.health.threshold.poll.frequency.secs" : "20" (sets poll frequency to 20 secs) {code} The frequency at which the monitor wakes up and checks the component health. Default is 10 secs. For most purposes, this property does not need to be set by the application owner, unless the app owner knows exactly what she/he is doing. {code} yarn.container.health.threshold.init.delay.secs e.g. "yarn.container.health.threshold.init.delay.secs" : "400" (sets the window to 400 secs) {code} Default is 600 secs. This provides an additional lead time before the health monitor kicks in to do its job. Note, the component health check timer will start at the end of this init delay. It is used to provide an extra bit of lead time to the application to bring up its containers the first time it is started (while it is working its way up to cross the health-threshold percent). Once the init delay time expires, and a component is still below health threshold percent, the monitor kicks in and waits for *_yarn.container.health.threshold.window.secs_* more time before it stops the app (assuming it never crossed the threshold percent). Hence when the app starts it technically gets yarn.container.health.threshold.init.delay.secs + yarn.container.health.threshold.window.secs time to cross health threshold percent. If no extra initial lead time is required, set it to 0. Note, the node failure blacklisting feature is implemented by SLIDER-1199. was (Author: gsaha): 4 new resources config properties have been introduced in the patch which should provide the health-threshold control required for this feature - {code} yarn.container.health.threshold.percent e.g. "yarn.container.health.threshold.percent" : "80" (set to 80%, so if there are 100 containers for a component, then 80 or more running containers will deem the component as healthy) {code} There is no default, so needs to be explicitly set in resources file to enable health monitor. It can be defined at the global level to enable monitors for all components. When defined at global-level the same percent is applicable for all components. It can be defined at component-level also to override the global value for a specific component. Note, if health monitor is enabled for a component then failure threshold is automatically disabled for that component. So, if a value is set for *_yarn.container.failure.threshold_* as well, it will be ignored for that component. If health-threshold is set for one component and failure threshold for another, they will compete with each other against determining an app to be unhealthy. Whoever wins brings the app down first, so the app owners need to understand the behavior of these competing properties and set them appropriately. {code} yarn.container.health.threshold.window.secs e.g. "yarn.container.health.threshold.window.secs" : "3600" (sets the window to 1 hour) {code} Default is 600 secs (5 mins). The amount of time a component is allowed to be below the health-threshold percent after which the application is stopped. If the health crosses above threshold before the window expires then this window is reset to 0. So, if the health goes below threshold later again, it has to be there for the entire window to be considered unhealthy. {code} yarn.container.health.threshold.poll.frequency.secs e.g. "yarn.container.health.threshold.poll.frequency.secs" : "20" (sets poll frequency to 20 secs) {code} Default is 10 secs. For most purposes this property does not need to be set by the application owner, unless the app owner knows exactly what she/he is doing. {code} yarn.container.health.threshold.init.delay.secs e.g. "yarn.container.health.threshold.init.delay.secs" : "1800" (sets the window to 30 mins) {code} Default is 600 secs (same as default for *_yarn.container.health.threshold.window.secs_*). Controls the health monitor's behavior the exact same way as *_yarn.container.health.threshold.window.secs_* does, except that it comes into play only the first time when the application is started while it is working its way up to cross the health-threshold percent for the first time. Note, the node failure blacklisting feature is implemented by SLIDER-1199. > Application health should not be affected by faulty nodes > --------------------------------------------------------- > > Key: SLIDER-1246 > URL: https://issues.apache.org/jira/browse/SLIDER-1246 > Project: Slider > Issue Type: Bug > Components: appmaster, core > Affects Versions: Slider 0.92 > Reporter: Prasanth Jayachandran > Assignee: Gour Saha > Fix For: Slider 1.0.0 > > Attachments: SLIDER-1246.01.patch > > > In case of a faulty node, multiple container failures will be deemed as an > application failure. > Observed this in HIVE-16927, where container failures in certain nodes brings > down entire application. Slider has to provide a way to not mark application > as unhealthy if certain threshold of containers are running. Tuning failure > threshold is not optimal as setting the correct default on large cluster is > not trivial. Beyond certain failures, slider should mark the node as > unhealthy and report that back to client/AM. Application could continue to > run as long as container request is satisfied partially (example: 80% > containers are running). -- This message was sent by Atlassian JIRA (v6.4.14#64029)