[jira] [Comment Edited] (SLIDER-1246) Application health should not be affected by faulty nodes

Gour Saha (JIRA) Thu, 28 Sep 2017 10:05:23 -0700

    [ 
https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184378#comment-16184378
 ]


Gour Saha edited comment on SLIDER-1246 at 9/28/17 4:58 PM:
------------------------------------------------------------

4 new resources config properties have been introduced in the patch which 
should provide the health-threshold control required for this feature -
{code}
yarn.container.health.threshold.percent
e.g. 
"yarn.container.health.threshold.percent" : "80" (set to 80%, so if there are 
100 containers for a component, then 80 or more running containers will deem 
the component as healthy)
{code}
There is no default, so needs to be explicitly set in resources file to enable 
health monitor. It can be defined at the global level to enable monitors for 
all components. When defined at global-level the same percent is applicable for 
all components. It can be defined at component-level also to override the 
global value for a specific component. Note, if health monitor is enabled for a 
component then failure threshold is automatically disabled for that component. 
So, if a value is set for *_yarn.container.failure.threshold_* as well, it will 
be ignored for that component. If health-threshold is set for one component and 
failure threshold for another, they will compete with each other against 
determining an app to be unhealthy. Whoever wins brings the app down first, so 
the app owners need to understand the behavior of these competing properties 
and set them appropriately.

{code}
yarn.container.health.threshold.window.secs
e.g.
"yarn.container.health.threshold.window.secs" : "3600" (sets the window to 1 
hour)
{code}
Default is 600 secs (5 mins). The amount of time a component is allowed to be 
below the health-threshold percent after which the application is stopped. If 
the health crosses above threshold before the window expires then this window 
is reset to 0. So, if the health goes below threshold later again, it has to be 
there for the entire window to be considered unhealthy.

{code}
yarn.container.health.threshold.poll.frequency.secs
e.g.
"yarn.container.health.threshold.poll.frequency.secs" : "20" (sets poll 
frequency to 20 secs)
{code}
Default is 10 secs. For most purposes this property does not need to be set by 
the application owner, unless the app owner knows exactly what she/he is doing.

{code}
yarn.container.health.threshold.init.delay.secs
e.g.
"yarn.container.health.threshold.init.delay.secs" : "1800" (sets the window to 
30 mins)
{code}
Default is 600 secs (same as default for 
*_yarn.container.health.threshold.window.secs_*). Controls the health monitor's 
behavior the exact same way as *_yarn.container.health.threshold.window.secs_* 
does, except that it comes into play only the first time when the application 
is started while it is working its way up to cross the health-threshold percent 
for the first time.

Note, the node failure blacklisting feature is implemented by SLIDER-1199.


was (Author: gsaha):
4 new config properties have been introduced in the patch which should provide 
the health threshold control required for this feature -
{code}
yarn.container.health.threshold.percent
e.g. 
"yarn.container.health.threshold.percent" : "80" (set to 80%, so if there are 
100 containers for a component, then 80 or more running containers will deem 
the component as healthy)
{code}
There is no default, so needs to be explicitly set in resources file to enable 
health monitor. It can be defined at the global level to enable monitor for all 
components. When defined at global level the same percent is applicable for all 
components. It can be defined at component level also to override the global 
value for a specific component. Note, if health monitor is enabled for a 
component then failure threshold is disabled for that component. So, if a value 
is set for yarn.container.failure.threshold it will be ignored.

{code}
yarn.container.health.threshold.window.secs
e.g.
"yarn.container.health.threshold.window.secs" : "3600" (sets the window to 1 
hour)
{code}
Default is 600 secs (5 mins). The amount of time a component is allowed to be 
below the health threshold percent after which the application is stopped. If 
the health crosses above threshold before the window expires then this window 
is reset to 0. So, if the health goes below threshold later again, it has to be 
there for the entire window to be considered unhealthy.

{code}
yarn.container.health.threshold.poll.frequency.secs
e.g.
"yarn.container.health.threshold.poll.frequency.secs" : "20" (sets poll 
frequency to 20 secs)
{code}
Default is 10 secs. For most purposes this property does not need to be set by 
the application owner.

{code}
yarn.container.health.threshold.init.delay.secs
e.g.
"yarn.container.health.threshold.init.delay.secs" : "1800" (sets the window to 
30 mins)
{code}
Default is 600 secs (same as default for 
yarn.container.health.threshold.window.secs). Controls the health monitor's 
behavior the exact same way as yarn.container.health.threshold.window.secs 
does, except that it comes into play only the first time when the application 
is started.

Note, the node failure blacklisting feature is implemented by SLIDER-1199.

> Application health should not be affected by faulty nodes
> ---------------------------------------------------------
>
>                 Key: SLIDER-1246
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1246
>             Project: Slider
>          Issue Type: Bug
>          Components: appmaster, core
>    Affects Versions: Slider 0.92
>            Reporter: Prasanth Jayachandran
>            Assignee: Gour Saha
>             Fix For: Slider 1.0.0
>
>         Attachments: SLIDER-1246.01.patch
>
>
> In case of a faulty node, multiple container failures will be deemed as an 
> application failure. 
> Observed this in HIVE-16927, where container failures in certain nodes brings 
> down entire application. Slider has to provide a way to not mark application 
> as unhealthy if certain threshold of containers are running. Tuning failure 
> threshold is not optimal as setting the correct default on large cluster is 
> not trivial. Beyond certain failures, slider should mark the node as 
> unhealthy and report that back to client/AM. Application could continue to 
> run as long as container request is satisfied partially (example: 80% 
> containers are running).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (SLIDER-1246) Application health should not be affected by faulty nodes

Reply via email to