[ https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184609#comment-16184609 ]
Gour Saha commented on SLIDER-1246: ----------------------------------- A sample resources json for hive is - {code} { "schema" : "http://example.org/specification/v2.0.0", "metadata" : { }, "global" : { "yarn.log.include.patterns" : ".*\\.done" }, "credentials" : { }, "components" : { "LLAP" : { "yarn.role.priority" : "1", "yarn.component.instances" : "5", "yarn.memory" : "10240", "yarn.component.placement.policy" : "0", "yarn.resource.normalization.enabled" : "false", "yarn.container.health.threshold.percent" : "80", // 80% "yarn.container.health.threshold.window.secs" : "600", // acceptable to be below 80% for up to 10 mins at a stretch "yarn.container.health.threshold.init.delay.secs" : "400" // additional lead time of 400 secs before the threshold monitor kicks in to do its job }, "slider-appmaster" : { "yarn.vcores" : "1", "yarn.component.instances" : "1", "yarn.memory" : "1024" } } } {code} > Application health should not be affected by faulty nodes > --------------------------------------------------------- > > Key: SLIDER-1246 > URL: https://issues.apache.org/jira/browse/SLIDER-1246 > Project: Slider > Issue Type: Bug > Components: appmaster, core > Affects Versions: Slider 0.92 > Reporter: Prasanth Jayachandran > Assignee: Gour Saha > Fix For: Slider 1.0.0 > > Attachments: SLIDER-1246.01.patch > > > In case of a faulty node, multiple container failures will be deemed as an > application failure. > Observed this in HIVE-16927, where container failures in certain nodes brings > down entire application. Slider has to provide a way to not mark application > as unhealthy if certain threshold of containers are running. Tuning failure > threshold is not optimal as setting the correct default on large cluster is > not trivial. Beyond certain failures, slider should mark the node as > unhealthy and report that back to client/AM. Application could continue to > run as long as container request is satisfied partially (example: 80% > containers are running). -- This message was sent by Atlassian JIRA (v6.4.14#64029)