[jira] [Commented] (SLIDER-1246) Application health should not be affected by faulty nodes

Billie Rinaldi (JIRA) Thu, 28 Sep 2017 12:20:31 -0700

    [ 
https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184698#comment-16184698
 ]


Billie Rinaldi commented on SLIDER-1246:
----------------------------------------

Hi [~gsaha], thanks for taking on this patch. Comments follow:
* rename CONTAINER_HEALTH_THRESHOLD_DISABLED_PERCENT to 
CONTAINER_HEALTH_THRESHOLD_PERCENT_DISABLED for clarity / to match existing 
convention
* create a DEFAULT_CONTAINER_HEALTH_THRESHOLD_PERCENT property which is set to 
CONTAINER_HEALTH_THRESHOLD_PERCENT_DISABLED and use it (some method or 
constructor calls that take a default are using the disabled percent and some 
are hardcoded to -1)
* in scheduleHealthThresholdMonitor the global vs. component option handling is 
unnecessary and should be removed. slider handles this automatically, so you 
only need to retrieve the component options
* based on the javadocs, it looks like appMaster.queue should be used instead 
of appMaster.signalAMComplete to queue the stop action
* in MonitorHealthThreshold, i would set currentTimestamp = now() and then 
optionally set firstOccurrenceTimestamp = currentTimestamp
* the separation between AppMaster#scheduleHealthThresholdMonitor, 
MonitorHealthThreshold, and AppState is a bit muddy. RoleStatus and 
ProviderRole do not need to be used outside of AppState. 
AppMaster#scheduleHealthThresholdMonitor can iterate over the resource 
components instead of the role status map. in MonitorHealthThreshold you can 
store the name instead of the role status. and you would just need to add a 
couple of methods like appState.isHealthThresholdMet(name) and 
appState.setHealthThresholdMonitorEnabled(name)
* as discussed previously offline, i don't think the failure threshold should 
be automatically disabled when the health percent is enabled. but since we 
disagree on this, i am okay with having the automatic disable until someone 
expresses interest in using both features
* it seems like the health threshold check will not be effective unless we 
consider the age of the containers. i can imagine that an app that is 
restarting containers constantly would by chance be able to meet the health 
threshold. have you tested this scenario?

> Application health should not be affected by faulty nodes
> ---------------------------------------------------------
>
>                 Key: SLIDER-1246
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1246
>             Project: Slider
>          Issue Type: Bug
>          Components: appmaster, core
>    Affects Versions: Slider 0.92
>            Reporter: Prasanth Jayachandran
>            Assignee: Gour Saha
>             Fix For: Slider 1.0.0
>
>         Attachments: SLIDER-1246.01.patch, SLIDER-1246.02.patch
>
>
> In case of a faulty node, multiple container failures will be deemed as an 
> application failure. 
> Observed this in HIVE-16927, where container failures in certain nodes brings 
> down entire application. Slider has to provide a way to not mark application 
> as unhealthy if certain threshold of containers are running. Tuning failure 
> threshold is not optimal as setting the correct default on large cluster is 
> not trivial. Beyond certain failures, slider should mark the node as 
> unhealthy and report that back to client/AM. Application could continue to 
> run as long as container request is satisfied partially (example: 80% 
> containers are running).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (SLIDER-1246) Application health should not be affected by faulty nodes

Reply via email to