[jira] [Commented] (SLIDER-1246) Application health should not be affected by faulty nodes

Billie Rinaldi (JIRA) Fri, 29 Sep 2017 08:37:59 -0700

    [ 
https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16185999#comment-16185999
 ]


Billie Rinaldi commented on SLIDER-1246:
----------------------------------------

bq. there is an advantage for the containers such that if one fails and a new 
one is allocated by Yarn (within the poll frequency) then it might not dip the 
health percent at all
This is not an advantage for an unhealthy app where the containers always fail. 
I think what you're saying is that nodes will eventually be blacklisted and 
this will cause the health threshold to dip, once enough nodes are blacklisted 
that the container requests can't be satisfied. Let's say we have 100 nodes and 
an app with 10 containers with a health threshold of 80%. We would need 93 
nodes to be blacklisted to fall below the health threshold, which would mean at 
least 93*(3 + 1) = 372 container failures before the health threshold would be 
invoked. Seems like a lot, but this is better than I thought it would be 
because I had forgotten to consider the blacklisting feature.

> Application health should not be affected by faulty nodes
> ---------------------------------------------------------
>
>                 Key: SLIDER-1246
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1246
>             Project: Slider
>          Issue Type: Bug
>          Components: appmaster, core
>    Affects Versions: Slider 0.92
>            Reporter: Prasanth Jayachandran
>            Assignee: Gour Saha
>             Fix For: Slider 1.0.0
>
>         Attachments: SLIDER-1246.01.patch, SLIDER-1246.02.patch, 
> SLIDER-1246.03.patch
>
>
> In case of a faulty node, multiple container failures will be deemed as an 
> application failure. 
> Observed this in HIVE-16927, where container failures in certain nodes brings 
> down entire application. Slider has to provide a way to not mark application 
> as unhealthy if certain threshold of containers are running. Tuning failure 
> threshold is not optimal as setting the correct default on large cluster is 
> not trivial. Beyond certain failures, slider should mark the node as 
> unhealthy and report that back to client/AM. Application could continue to 
> run as long as container request is satisfied partially (example: 80% 
> containers are running).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (SLIDER-1246) Application health should not be affected by faulty nodes

Reply via email to