[ https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184698#comment-16184698 ]
Billie Rinaldi commented on SLIDER-1246: ---------------------------------------- Hi [~gsaha], thanks for taking on this patch. Comments follow: * rename CONTAINER_HEALTH_THRESHOLD_DISABLED_PERCENT to CONTAINER_HEALTH_THRESHOLD_PERCENT_DISABLED for clarity / to match existing convention * create a DEFAULT_CONTAINER_HEALTH_THRESHOLD_PERCENT property which is set to CONTAINER_HEALTH_THRESHOLD_PERCENT_DISABLED and use it (some method or constructor calls that take a default are using the disabled percent and some are hardcoded to -1) * in scheduleHealthThresholdMonitor the global vs. component option handling is unnecessary and should be removed. slider handles this automatically, so you only need to retrieve the component options * based on the javadocs, it looks like appMaster.queue should be used instead of appMaster.signalAMComplete to queue the stop action * in MonitorHealthThreshold, i would set currentTimestamp = now() and then optionally set firstOccurrenceTimestamp = currentTimestamp * the separation between AppMaster#scheduleHealthThresholdMonitor, MonitorHealthThreshold, and AppState is a bit muddy. RoleStatus and ProviderRole do not need to be used outside of AppState. AppMaster#scheduleHealthThresholdMonitor can iterate over the resource components instead of the role status map. in MonitorHealthThreshold you can store the name instead of the role status. and you would just need to add a couple of methods like appState.isHealthThresholdMet(name) and appState.setHealthThresholdMonitorEnabled(name) * as discussed previously offline, i don't think the failure threshold should be automatically disabled when the health percent is enabled. but since we disagree on this, i am okay with having the automatic disable until someone expresses interest in using both features * it seems like the health threshold check will not be effective unless we consider the age of the containers. i can imagine that an app that is restarting containers constantly would by chance be able to meet the health threshold. have you tested this scenario? > Application health should not be affected by faulty nodes > --------------------------------------------------------- > > Key: SLIDER-1246 > URL: https://issues.apache.org/jira/browse/SLIDER-1246 > Project: Slider > Issue Type: Bug > Components: appmaster, core > Affects Versions: Slider 0.92 > Reporter: Prasanth Jayachandran > Assignee: Gour Saha > Fix For: Slider 1.0.0 > > Attachments: SLIDER-1246.01.patch, SLIDER-1246.02.patch > > > In case of a faulty node, multiple container failures will be deemed as an > application failure. > Observed this in HIVE-16927, where container failures in certain nodes brings > down entire application. Slider has to provide a way to not mark application > as unhealthy if certain threshold of containers are running. Tuning failure > threshold is not optimal as setting the correct default on large cluster is > not trivial. Beyond certain failures, slider should mark the node as > unhealthy and report that back to client/AM. Application could continue to > run as long as container request is satisfied partially (example: 80% > containers are running). -- This message was sent by Atlassian JIRA (v6.4.14#64029)