[jira] [Comment Edited] (SLIDER-1246) Application health should not be affected by faulty nodes

Gour Saha (JIRA) Thu, 28 Sep 2017 11:48:15 -0700

    [ 
https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184378#comment-16184378
 ]


Gour Saha edited comment on SLIDER-1246 at 9/28/17 6:46 PM:
------------------------------------------------------------

4 new resources config properties have been introduced in the patch which 
should provide the health-threshold control required for this feature -
{code}
yarn.container.health.threshold.percent
e.g. 
"yarn.container.health.threshold.percent" : "80" (set to 80%, so if there are 
100 containers for a component, then 80 or more running containers will deem 
the component as healthy)
{code}
There is no default, so needs to be explicitly set in resources file to enable 
health monitor. It can be defined at the global level to enable monitors for 
all components. When defined at global-level the same percent is applicable for 
all components. It can be defined at component-level also to override the 
global value for a specific component. Note, if health monitor is enabled for a 
component then failure threshold is automatically disabled for that component. 
So, if a value is set for *_yarn.container.failure.threshold_* as well, it will 
be ignored for that component. If health-threshold is set for one component and 
failure threshold for another, they will compete with each other against 
determining an app to be unhealthy. Whoever wins brings the app down first, so 
the app owners need to understand the behavior of these competing properties 
and set them appropriately.

{code}
yarn.container.health.threshold.window.secs
e.g.
"yarn.container.health.threshold.window.secs" : "3600" (sets the window to 1 
hour)
{code}
Default is 600 secs (10 mins). The amount of time a component is allowed to be 
below the health-threshold percent after which the application is stopped. If 
the health crosses above threshold before the window expires then this window 
is reset to 0. So, if the health goes below threshold later again, it has to be 
there for the entire window to be considered unhealthy.

{code}
yarn.container.health.threshold.poll.frequency.secs
e.g.
"yarn.container.health.threshold.poll.frequency.secs" : "20" (sets poll 
frequency to 20 secs)
{code}
The frequency at which the monitor wakes up and checks the component health. 
Default is 10 secs. For most purposes, this property does not need to be set by 
the application owner, unless the app owner knows exactly what she/he is doing.

{code}
yarn.container.health.threshold.init.delay.secs
e.g.
"yarn.container.health.threshold.init.delay.secs" : "400" (sets the window to 
400 secs)
{code}
Default is 600 secs. This provides an additional lead time before the health 
monitor kicks in to do its job. Note, the component health check timer will 
start at the end of this init delay. It is used to provide an extra bit of lead 
time to the application to bring up its containers the first time it is started 
(while it is working its way up to cross the health-threshold percent). Once 
the init delay time expires, and a component is still below health threshold 
percent, the monitor kicks in and waits for 
*_yarn.container.health.threshold.window.secs_* more time before it stops the 
app (assuming it never crossed the threshold percent). Hence when the app 
starts it technically gets yarn.container.health.threshold.init.delay.secs + 
yarn.container.health.threshold.window.secs time to cross health threshold 
percent. If no extra initial lead time is required, set it to 0.

Note, the node failure blacklisting feature is implemented by SLIDER-1199.


was (Author: gsaha):
4 new resources config properties have been introduced in the patch which 
should provide the health-threshold control required for this feature -
{code}
yarn.container.health.threshold.percent
e.g. 
"yarn.container.health.threshold.percent" : "80" (set to 80%, so if there are 
100 containers for a component, then 80 or more running containers will deem 
the component as healthy)
{code}
There is no default, so needs to be explicitly set in resources file to enable 
health monitor. It can be defined at the global level to enable monitors for 
all components. When defined at global-level the same percent is applicable for 
all components. It can be defined at component-level also to override the 
global value for a specific component. Note, if health monitor is enabled for a 
component then failure threshold is automatically disabled for that component. 
So, if a value is set for *_yarn.container.failure.threshold_* as well, it will 
be ignored for that component. If health-threshold is set for one component and 
failure threshold for another, they will compete with each other against 
determining an app to be unhealthy. Whoever wins brings the app down first, so 
the app owners need to understand the behavior of these competing properties 
and set them appropriately.

{code}
yarn.container.health.threshold.window.secs
e.g.
"yarn.container.health.threshold.window.secs" : "3600" (sets the window to 1 
hour)
{code}
Default is 600 secs (5 mins). The amount of time a component is allowed to be 
below the health-threshold percent after which the application is stopped. If 
the health crosses above threshold before the window expires then this window 
is reset to 0. So, if the health goes below threshold later again, it has to be 
there for the entire window to be considered unhealthy.

{code}
yarn.container.health.threshold.poll.frequency.secs
e.g.
"yarn.container.health.threshold.poll.frequency.secs" : "20" (sets poll 
frequency to 20 secs)
{code}
Default is 10 secs. For most purposes this property does not need to be set by 
the application owner, unless the app owner knows exactly what she/he is doing.

{code}
yarn.container.health.threshold.init.delay.secs
e.g.
"yarn.container.health.threshold.init.delay.secs" : "1800" (sets the window to 
30 mins)
{code}
Default is 600 secs (same as default for 
*_yarn.container.health.threshold.window.secs_*). Controls the health monitor's 
behavior the exact same way as *_yarn.container.health.threshold.window.secs_* 
does, except that it comes into play only the first time when the application 
is started while it is working its way up to cross the health-threshold percent 
for the first time.

Note, the node failure blacklisting feature is implemented by SLIDER-1199.

> Application health should not be affected by faulty nodes
> ---------------------------------------------------------
>
>                 Key: SLIDER-1246
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1246
>             Project: Slider
>          Issue Type: Bug
>          Components: appmaster, core
>    Affects Versions: Slider 0.92
>            Reporter: Prasanth Jayachandran
>            Assignee: Gour Saha
>             Fix For: Slider 1.0.0
>
>         Attachments: SLIDER-1246.01.patch
>
>
> In case of a faulty node, multiple container failures will be deemed as an 
> application failure. 
> Observed this in HIVE-16927, where container failures in certain nodes brings 
> down entire application. Slider has to provide a way to not mark application 
> as unhealthy if certain threshold of containers are running. Tuning failure 
> threshold is not optimal as setting the correct default on large cluster is 
> not trivial. Beyond certain failures, slider should mark the node as 
> unhealthy and report that back to client/AM. Application could continue to 
> run as long as container request is satisfied partially (example: 80% 
> containers are running).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (SLIDER-1246) Application health should not be affected by faulty nodes

Reply via email to