[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578562#comment-14578562 ]
Steve Loughran commented on YARN-2005: -------------------------------------- This is what we do for slider [http://steveloughran.blogspot.co.uk/2015/05/dynamic-datacentre-applications.html], with SLIDER-856 containing [the failure-analysis|https://git-wip-us.apache.org/repos/asf?p=incubator-slider.git;a=commitdiff;h=f61dc2b;hp=585fc4c0a6821efa2e23e87b450a738bc5c11b5a], part of the placement rework of SLIDER-611. it differentiates * known node failure events (counts against node reliability) * known app failures (limits exceeded) (counts against component reliability, not nodes) * pre-emption (don't worry about them) * startup failures (often a symptom of TCP port conflict, localisation failure, lack of keytabs, or some other incompatibility between container and node) * general "container exit" events (count against node and component) Also * it resets the counters regularly. * has different failure thresholds for different components (e.g for 30+ region servers, we have a higher threshold than for the 2 hbase masters) * doesn't let the unreliability of one component on a node count against it being used for requesting different components on it. (Mixed merit here; good for things like port conflict, bad for other causes). None of this looks @ AM failures. We haven't seen specific problems there to the same extent as some containers, because YARN does the tracking, the AM doesn't have any hard-coded ports, and with one AM per app, failure rate is much lower. Where we do have problems it is usually immediately obvious at launch time, and almost invariably environment related. > Blacklisting support for scheduling AMs > --------------------------------------- > > Key: YARN-2005 > URL: https://issues.apache.org/jira/browse/YARN-2005 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager > Affects Versions: 0.23.10, 2.4.0 > Reporter: Jason Lowe > Assignee: Anubhav Dhoot > > It would be nice if the RM supported blacklisting a node for an AM launch > after the same node fails a configurable number of AM attempts. This would > be similar to the blacklisting support for scheduling task attempts in the > MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)