[jira] [Updated] (SLIDER-1246) Application health should not be affected by faulty nodes

Gour Saha (JIRA) Fri, 29 Sep 2017 03:02:34 -0700

     [ 
https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gour Saha updated SLIDER-1246:
------------------------------
    Attachment: SLIDER-1246.03.patch

[~billie.rinaldi] uploaded the 03 patch incorporating all your comments. Let me 
know if I missed any.

To this point -
{quote}
it seems like the health threshold check will not be effective unless we 
consider the age of the containers. i can imagine that an app that is 
restarting containers constantly would by chance be able to meet the health 
threshold. have you tested this scenario?
{quote}
In this health-threshold feature, there is an advantage for the containers such 
that if one fails and a new one is allocated by Yarn (within the poll 
frequency) then it might not dip the health percent at all. Note, the actual 
install and start of the app process can potentially take longer. Under this 
premise, even if multiple containers are restarting constantly, it might not 
bring the health percent down at all, as long as replacement containers come up 
to take their place.

Typically, when a node failure threshold is reached and Yarn cannot allocate a 
container on any other node, that the health percent starts to take a dive for 
the first time. App owners can set the threshold-window appropriately such that 
they have sufficient time to take necessary actions when they are alerted that 
their app is below health threshold. Hence, even if the health falls below 
threshold, it needs to stay there for health-window amount of time to bring the 
app down. I tested this scenario and the behavior is as I have described above. 
Did I fail to understand your scenario?

> Application health should not be affected by faulty nodes
> ---------------------------------------------------------
>
>                 Key: SLIDER-1246
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1246
>             Project: Slider
>          Issue Type: Bug
>          Components: appmaster, core
>    Affects Versions: Slider 0.92
>            Reporter: Prasanth Jayachandran
>            Assignee: Gour Saha
>             Fix For: Slider 1.0.0
>
>         Attachments: SLIDER-1246.01.patch, SLIDER-1246.02.patch, 
> SLIDER-1246.03.patch
>
>
> In case of a faulty node, multiple container failures will be deemed as an 
> application failure. 
> Observed this in HIVE-16927, where container failures in certain nodes brings 
> down entire application. Slider has to provide a way to not mark application 
> as unhealthy if certain threshold of containers are running. Tuning failure 
> threshold is not optimal as setting the correct default on large cluster is 
> not trivial. Beyond certain failures, slider should mark the node as 
> unhealthy and report that back to client/AM. Application could continue to 
> run as long as container request is satisfied partially (example: 80% 
> containers are running).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (SLIDER-1246) Application health should not be affected by faulty nodes

Reply via email to