[ 
https://issues.apache.org/jira/browse/YARN-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202584#comment-15202584
 ] 

Wangda Tan commented on YARN-4576:
----------------------------------

Thanks [~vinodkv] for starting this discussion,

After ramped up most of discussions in this JIRA and related JIRAs, my 
suggestions:
1) AM blacklist is unnecessary to me: 
- When YARN detects *possible* failures, it should blacklist nodes *within the 
app* (from [~sjlee0]). If AM container of an app fails on a node because of 
node-specific reasons, other containers of the app could fail with the same 
reason. But we shouldn't spread it to other apps because different app has 
different settings. We can do this unless we're confident enough that the two 
apps are very similar in configs.
- When YARN detects fatal failures, it should blacklist nodes globally, we mark 
node to be UNHEALTHY. As 
[~djp][commented|https://issues.apache.org/jira/browse/YARN-4576?focusedCommentId=15201559&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15201559],
 we may need to fix this issue, if a node's state goes between HEALTHY and 
UNHEALTHY back-and-forth, we need detect it and mark this node to be UNHEALTHY.

2) Framework-specified container blacklists should be transient to end users.
YARN should make correct decisions to select best places for apps, and apps 
should trust YARN's decisions. Just like UNHEALTHY status of a node, it is 
possible that node has 90% of disk utilization is quite acceptable to some 
apps, but we shouldn't allow apps to say: I know it's risky, but I still want 
to schedule on these UNHEALTHY nodes.

3) App should have their own choices to setup preferred nodes, hosts etc.
As Junping commented:
bq. We don't really give application that freedom - where and how to launch 
application's AM container is never be application's business so far, that's 
why we call it out here - give applications the right to set their bar for AM 
launching.
We need this for AMs, I cannot find the original JIRA for AM resource reuqest. 
But I believe there's an open JIRA for this. And I think AM should be able to 
add blacklist nodes with ApplicationSubmissionContext.

> Enhancement for tracking Blacklist in AM Launching
> --------------------------------------------------
>
>                 Key: YARN-4576
>                 URL: https://issues.apache.org/jira/browse/YARN-4576
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: EnhancementAMLaunchingBlacklist.pdf
>
>
> Before YARN-2005, YARN blacklist mechanism is to track the bad nodes by AM:  
> If AM tried to launch containers on a specific node get failed for several 
> times, AM will blacklist this node in future resource asking. This mechanism 
> works fine for normal containers. However, from our observation on behaviors 
> of several clusters: if this problematic node launch AM failed, then RM could 
> pickup this problematic node to launch next AM attempts again and again that 
> cause application failure in case other functional nodes are busy. In normal 
> case, the customized healthy checker script cannot be so sensitive to mark 
> node as unhealthy when one or two containers get launched failed. 
> After YARN-2005, we can have a BlacklistManager in each RMapp, so those nodes 
> who launching AM attempts failed for specific application before will get 
> blacklisted. To get rid of potential risks that all nodes being blacklisted 
> by BlacklistManager, a disable-failure-threshold is involved to stop adding 
> more nodes into blacklist if hit certain ratio already. 
> There are already some enhancements for this AM blacklist mechanism: 
> YARN-4284 is to address the more wider case for AM container get launched 
> failure and YARN-4389 tries to make configuration settings available for 
> change by App to meet app specific requirement. However, there are still 
> several gaps to address more scenarios:
> 1. We may need a global blacklist instead of each app maintain a separated 
> one. The reason is: AM could get more chance to fail if other AM get failed 
> before. A quick example is: in a busy cluster, all nodes are busy except two 
> problematic nodes: node a and node b, app1 already submit and get failed in 
> two AM attempts on a and b. app2 and other apps should wait for other busy 
> nodes rather than waste attempts on these two problematic nodes.
> 2. If AM container failure is recognized as global event instead app own 
> issue, we should consider the blacklist is not a permanent thing but with a 
> specific time window. 
> 3. We could have user defined black list polices to address more possible 
> cases and scenarios, so it reasonable to make blacklist policy pluggable.
> 4. For some test scenario, we could have whitelist mechanism for AM launching.
> 5. Some minor issues: it sounds like NM reconnect won't refresh blacklist so 
> far.
> Will try to address all issues here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to