[jira] [Commented] (SLIDER-1246) Application health should not be affected by faulty nodes

Billie Rinaldi (JIRA) Fri, 29 Sep 2017 11:25:19 -0700

    [ 
https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16186187#comment-16186187
 ]


Billie Rinaldi commented on SLIDER-1246:
----------------------------------------

bq. it seems like the health threshold check will not be effective unless we 
consider the age of the containers
After further contemplation of the feature, it seems the effective failure 
condition for apps under this implementation is (# of non-blacklisted nodes < 
health fraction * desired containers) for an amount of time greater than the 
health window. IMO this is not ideal, as the condition would never be true 
without blacklisting and "less than 80% of containers healthy" is a much more 
understandable criterion. This could be solved by placing a condition on how 
long a container must be running before it would be counted as healthy.

However, this implementation does meet my basic requirement that it will 
eventually kill an app whose containers are constantly failing. I am okay with 
us committing the feature without a health condition for containers (once the 
other issues are addressed).

> Application health should not be affected by faulty nodes
> ---------------------------------------------------------
>
>                 Key: SLIDER-1246
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1246
>             Project: Slider
>          Issue Type: Bug
>          Components: appmaster, core
>    Affects Versions: Slider 0.92
>            Reporter: Prasanth Jayachandran
>            Assignee: Gour Saha
>             Fix For: Slider 1.0.0
>
>         Attachments: SLIDER-1246.01.patch, SLIDER-1246.02.patch, 
> SLIDER-1246.03.patch
>
>
> In case of a faulty node, multiple container failures will be deemed as an 
> application failure. 
> Observed this in HIVE-16927, where container failures in certain nodes brings 
> down entire application. Slider has to provide a way to not mark application 
> as unhealthy if certain threshold of containers are running. Tuning failure 
> threshold is not optimal as setting the correct default on large cluster is 
> not trivial. Beyond certain failures, slider should mark the node as 
> unhealthy and report that back to client/AM. Application could continue to 
> run as long as container request is satisfied partially (example: 80% 
> containers are running).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (SLIDER-1246) Application health should not be affected by faulty nodes

Reply via email to