[ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295171#comment-14295171
 ] 

Jason Lowe commented on YARN-2005:
----------------------------------

bq.  App name is the first point came in to my thoughts.

The problem with app name in the workflow spamming case is that many workflows 
I've seen use a different app name each time they submit, since the app name 
often includes some timestamp indicating which data window it's 
consuming/producing.  If the workflow is retrying the same failed apps then the 
app name may not be changing, but if it's plowing ahead submitting other jobs 
then it very likely is changing.

bq. If an app from "user1" with name "job2" fails on node1, it is very much 
appropriate to try its second attempt in a different node.

Totally agree.  I think it's worthwhile to consider implementing a relatively 
simple app-specific blacklisting logic to avoid this fairly common scenario.  
We can then follow that up with a much more sophisticated blacklisting 
algortihm with fancy weighting with time decays, etc., but the biggest problem 
we're seeing probably doesn't need anything that fancy to solve 80% of the 
cases we see.

bq. I feel i could jot down few points and share as a doc for same

Sounds good, feel free to post one.

> Blacklisting support for scheduling AMs
> ---------------------------------------
>
>                 Key: YARN-2005
>                 URL: https://issues.apache.org/jira/browse/YARN-2005
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 0.23.10, 2.4.0
>            Reporter: Jason Lowe
>
> It would be nice if the RM supported blacklisting a node for an AM launch 
> after the same node fails a configurable number of AM attempts.  This would 
> be similar to the blacklisting support for scheduling task attempts in the 
> MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to