[ https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16185014#comment-16185014 ]
Gour Saha commented on SLIDER-1246: ----------------------------------- Thanks [~billie.rinaldi] I am incorporating all your comments now. On this point - {quote} as discussed previously offline, i don't think the failure threshold should be automatically disabled when the health percent is enabled. but since we disagree on this, i am okay with having the automatic disable until someone expresses interest in using both features {quote} I thought this over and figured that the failure threshold which is an absolute value will always step into the way of a monitor which is driven by a percent value. No matter what the absolute value we set for failure threshold, for a component with high no of containers, it can potentially be less than the absolute no of containers given by (100 - health.percent)%. Hence failure threshold will always win in this scenario and is as good as not setting health threshold in the first place. Also, with flex up and flex down health percent will always scale accordingly, but the absolute value of failure threshold will cease to make sense. It is also very difficult to document and provide a usecase so that app owners will understand how the app health is tracked when both failure threshold and health threshold are in play (for the same component). Additionally, the current failure threshold logic counts a single container failing multiple times (while all other n-1 containers are healthy) the same as multiple containers failing at the same time and can result in the app to be shutdown although effectively n-1 containers were always running (unless it is saved by the blacklisting feature of node failure threshold when set to a value less than failure threshold and if containers were cycling through in the same node). This logic in health threshold is a significant drift, since if n-1 containers are healthy and only 1 container fails multiple times, it is counted only once. If you still think that there is value to have both in play, then I can introduce a boolean config which when set to true will let both be in play. Let me know what you think? > Application health should not be affected by faulty nodes > --------------------------------------------------------- > > Key: SLIDER-1246 > URL: https://issues.apache.org/jira/browse/SLIDER-1246 > Project: Slider > Issue Type: Bug > Components: appmaster, core > Affects Versions: Slider 0.92 > Reporter: Prasanth Jayachandran > Assignee: Gour Saha > Fix For: Slider 1.0.0 > > Attachments: SLIDER-1246.01.patch, SLIDER-1246.02.patch > > > In case of a faulty node, multiple container failures will be deemed as an > application failure. > Observed this in HIVE-16927, where container failures in certain nodes brings > down entire application. Slider has to provide a way to not mark application > as unhealthy if certain threshold of containers are running. Tuning failure > threshold is not optimal as setting the correct default on large cluster is > not trivial. Beyond certain failures, slider should mark the node as > unhealthy and report that back to client/AM. Application could continue to > run as long as container request is satisfied partially (example: 80% > containers are running). -- This message was sent by Atlassian JIRA (v6.4.14#64029)