[jira] [Commented] (SLIDER-1246) Application health should not be affected by faulty nodes

Gour Saha (JIRA) Thu, 28 Sep 2017 15:31:31 -0700

    [ 
https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16185014#comment-16185014
 ]


Gour Saha commented on SLIDER-1246:
-----------------------------------

Thanks [~billie.rinaldi] I am incorporating all your comments now.

On this point -
{quote}
as discussed previously offline, i don't think the failure threshold should be 
automatically disabled when the health percent is enabled. but since we 
disagree on this, i am okay with having the automatic disable until someone 
expresses interest in using both features
{quote}
I thought this over and figured that the failure threshold which is an absolute 
value will always step into the way of a monitor which is driven by a percent 
value. No matter what the absolute value we set for failure threshold, for a 
component with high no of containers, it can potentially be less than the 
absolute no of containers given by (100 - health.percent)%. Hence failure 
threshold will always win in this scenario and is as good as not setting health 
threshold in the first place. Also, with flex up and flex down health percent 
will always scale accordingly, but the absolute value of failure threshold will 
cease to make sense. It is also very difficult to document and provide a 
usecase so that app owners will understand how the app health is tracked when 
both failure threshold and health threshold are in play (for the same 
component). Additionally, the current failure threshold logic counts a single 
container failing multiple times (while all other n-1 containers are healthy) 
the same as multiple containers failing at the same time and can result in the 
app to be shutdown although effectively n-1 containers were always running 
(unless it is saved by the blacklisting feature of node failure threshold when 
set to a value less than failure threshold and if containers were cycling 
through in the same node). This logic in health threshold is a significant 
drift, since if n-1 containers are healthy and only 1 container fails multiple 
times, it is counted only once.

If you still think that there is value to have both in play, then I can 
introduce a boolean config which when set to true will let both be in play. Let 
me know what you think?

> Application health should not be affected by faulty nodes
> ---------------------------------------------------------
>
>                 Key: SLIDER-1246
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1246
>             Project: Slider
>          Issue Type: Bug
>          Components: appmaster, core
>    Affects Versions: Slider 0.92
>            Reporter: Prasanth Jayachandran
>            Assignee: Gour Saha
>             Fix For: Slider 1.0.0
>
>         Attachments: SLIDER-1246.01.patch, SLIDER-1246.02.patch
>
>
> In case of a faulty node, multiple container failures will be deemed as an 
> application failure. 
> Observed this in HIVE-16927, where container failures in certain nodes brings 
> down entire application. Slider has to provide a way to not mark application 
> as unhealthy if certain threshold of containers are running. Tuning failure 
> threshold is not optimal as setting the correct default on large cluster is 
> not trivial. Beyond certain failures, slider should mark the node as 
> unhealthy and report that back to client/AM. Application could continue to 
> run as long as container request is satisfied partially (example: 80% 
> containers are running).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (SLIDER-1246) Application health should not be affected by faulty nodes

Reply via email to