[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130272#comment-15130272 ]
Junping Du commented on YARN-4635: ---------------------------------- Thanks [~jianhe] for review and comments. First, I would like to claim an assumption that the blacklist mechanism for AM launching is not for tracking nodes that completely not work (unhealthy) but tracking nodes that has suspect to fail the AM container due to previous failed experience. This is because we already have unhealthy report mechanism to report serious issue for NM so here is another one which should have a higher bar (as in some sense, AM container is more important than other container) according to the history. My response will be based on above assumption. bq. why should below container exit status back list the node ? This container failure could due to resource congestion (like KILLED_EXCEEDED_PMEM) or unknown reason (ABORTED, INVALID) that make this NM higher suspect than normal nodes. bq. For DISKS_FAILED which is considered as global blacklist node in this jira, I think in this case, the node will report as unhealthy and RM should remove the node already. Some DISKS_FAILED could happens due to the failed container write disk to full. But it could still have other directories available to use by node. It could still get launched with normal containers but not suitable to risk AM container. bq. AMBlackListingRequest contains a boolean flag and a threshold number. Do you think it’s ok to just use the threshold number only ? 0 means disabled, and numbers larger than 0 means enabled? If so, it means the job submitter have to understand how many nodes the current cluster have and the job parameter should be updated if it get submitted to different cluster (with different nodes). IMO, That sounds more complexity to users. > Add global blacklist tracking for AM container failure. > ------------------------------------------------------- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Reporter: Junping Du > Assignee: Junping Du > Priority: Critical > Attachments: YARN-4635-v2.patch, YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)