[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578577#comment-14578577 ]
Steve Loughran commented on YARN-2005: -------------------------------------- Don't do it yet, but plan for a future version to add liveness probes, which is what we're adding to slider soon. The AM already registers its IPC and HTTP ports; if the AM could also register a health URL, such as the codehale /healthy URL, then something near the RM could decide when the AM had failed. For that we need * URLs to be provided at AM registration, or updated later * something to do the liveness checks. The RM is overloaded on a big cluster, but a little YARN service that could be launched standalone or embedded would be enough. I have all the code for liveness probes (basic TCP, http gets & status, with a launch track policy: you are given time to start, but once a probe is up, it must stay up). Of course, it'd need to run on an RM node for the redirect logic to not bounce it through the RM proxy. * AMs to provide simple health URLs which return an HTTP error code on failure, 200 if happy. > Blacklisting support for scheduling AMs > --------------------------------------- > > Key: YARN-2005 > URL: https://issues.apache.org/jira/browse/YARN-2005 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager > Affects Versions: 0.23.10, 2.4.0 > Reporter: Jason Lowe > Assignee: Anubhav Dhoot > > It would be nice if the RM supported blacklisting a node for an AM launch > after the same node fails a configurable number of AM attempts. This would > be similar to the blacklisting support for scheduling task attempts in the > MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)