[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs

Steve Loughran (JIRA) Tue, 09 Jun 2015 01:26:37 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578562#comment-14578562
 ]


Steve Loughran commented on YARN-2005:
--------------------------------------

This is what we do for slider 
[http://steveloughran.blogspot.co.uk/2015/05/dynamic-datacentre-applications.html],
 with SLIDER-856 containing [the 
failure-analysis|https://git-wip-us.apache.org/repos/asf?p=incubator-slider.git;a=commitdiff;h=f61dc2b;hp=585fc4c0a6821efa2e23e87b450a738bc5c11b5a],
 part of the placement rework of SLIDER-611.

it differentiates
* known node failure events (counts against node reliability)
* known app failures (limits exceeded) (counts against component reliability, 
not nodes)
* pre-emption (don't worry about them)
* startup failures (often a symptom of TCP port conflict, localisation failure, 
lack of keytabs, or some other incompatibility between container and node)
* general "container exit" events (count against node and component)

Also
* it resets the counters regularly.
* has different failure thresholds for different components (e.g for 30+ region 
servers, we have a higher threshold than for the 2 hbase masters)
* doesn't let the unreliability of one component on a node count against it 
being used for requesting different components on it. (Mixed merit here; good 
for things like port conflict, bad for other causes).

None of this looks @ AM failures. We haven't seen specific problems there to 
the same extent as some containers, because YARN does the tracking, the AM 
doesn't have any hard-coded ports, and with one AM per app, failure rate is 
much lower. Where we do have problems it is usually immediately obvious at 
launch time, and almost invariably environment related. 

> Blacklisting support for scheduling AMs
> ---------------------------------------
>
>                 Key: YARN-2005
>                 URL: https://issues.apache.org/jira/browse/YARN-2005
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 0.23.10, 2.4.0
>            Reporter: Jason Lowe
>            Assignee: Anubhav Dhoot
>
> It would be nice if the RM supported blacklisting a node for an AM launch 
> after the same node fails a configurable number of AM attempts.  This would 
> be similar to the blacklisting support for scheduling task attempts in the 
> MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs

Reply via email to