[jira] [Commented] (SLIDER-1246) Application health should not be affected by faulty nodes

Gour Saha (JIRA) Thu, 28 Sep 2017 11:29:45 -0700

    [ 
https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184609#comment-16184609
 ]


Gour Saha commented on SLIDER-1246:
-----------------------------------

A sample resources json for hive is -
{code}
{
  "schema" : "http://example.org/specification/v2.0.0";,
  "metadata" : { },
  "global" : {
    "yarn.log.include.patterns" : ".*\\.done"
  },
  "credentials" : { },
  "components" : {
    "LLAP" : {
      "yarn.role.priority" : "1",
      "yarn.component.instances" : "5",
      "yarn.memory" : "10240",
      "yarn.component.placement.policy" : "0",
      "yarn.resource.normalization.enabled" : "false",
      "yarn.container.health.threshold.percent" : "80", // 80%
      "yarn.container.health.threshold.window.secs" : "600", // acceptable to 
be below 80% for up to 10 mins at a stretch
      "yarn.container.health.threshold.init.delay.secs" : "400" // additional 
lead time of 400 secs before the threshold monitor kicks in to do its job
    },
    "slider-appmaster" : {
      "yarn.vcores" : "1",
      "yarn.component.instances" : "1",
      "yarn.memory" : "1024"
    }
  }
}
{code}

> Application health should not be affected by faulty nodes
> ---------------------------------------------------------
>
>                 Key: SLIDER-1246
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1246
>             Project: Slider
>          Issue Type: Bug
>          Components: appmaster, core
>    Affects Versions: Slider 0.92
>            Reporter: Prasanth Jayachandran
>            Assignee: Gour Saha
>             Fix For: Slider 1.0.0
>
>         Attachments: SLIDER-1246.01.patch
>
>
> In case of a faulty node, multiple container failures will be deemed as an 
> application failure. 
> Observed this in HIVE-16927, where container failures in certain nodes brings 
> down entire application. Slider has to provide a way to not mark application 
> as unhealthy if certain threshold of containers are running. Tuning failure 
> threshold is not optimal as setting the correct default on large cluster is 
> not trivial. Beyond certain failures, slider should mark the node as 
> unhealthy and report that back to client/AM. Application could continue to 
> run as long as container request is satisfied partially (example: 80% 
> containers are running).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (SLIDER-1246) Application health should not be affected by faulty nodes

Reply via email to