[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130272#comment-15130272
 ] 

Junping Du commented on YARN-4635:
----------------------------------

Thanks [~jianhe] for review and comments.
First, I would like to claim an assumption that the blacklist mechanism for AM 
launching is not for tracking nodes that completely not work (unhealthy) but 
tracking nodes that has suspect to fail the AM container due to previous failed 
experience. This is because we already have unhealthy report mechanism to 
report serious issue for NM so here is another one which should have a higher 
bar (as in some sense, AM container is more important than other container) 
according to the history. 
My response will be based on above assumption.
bq. why should below container exit status back list the node ?
This container failure could due to resource congestion (like 
KILLED_EXCEEDED_PMEM) or unknown reason (ABORTED, INVALID) that make this NM 
higher suspect than normal nodes.

bq. For DISKS_FAILED which is considered as global blacklist node in this jira, 
I think in this case, the node will report as unhealthy and RM should remove 
the node already.
Some DISKS_FAILED could happens due to the failed container write disk to full. 
But it could still have other directories available to use by node. It could 
still get launched with normal containers but not suitable to risk AM container.

bq. AMBlackListingRequest contains a boolean flag and a threshold number. Do 
you think it’s ok to just use the threshold number only ? 0 means disabled, and 
numbers larger than 0 means enabled?
If so, it means the job submitter have to understand how many nodes the current 
cluster have and the job parameter should be updated if it get submitted to 
different cluster (with different nodes). IMO, That sounds more complexity to 
users.

> Add global blacklist tracking for AM container failure.
> -------------------------------------------------------
>
>                 Key: YARN-4635
>                 URL: https://issues.apache.org/jira/browse/YARN-4635
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to