[ https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16697677#comment-16697677 ]
Zhenqiu Huang edited comment on FLINK-10868 at 11/24/18 5:48 PM: ----------------------------------------------------------------- [~suez1224] [~till.rohrmann] Agree with Shuyi's proposal. As maximum-failed-containers is more a configuration for a job level rather than session cluster level. We may have a simple fix for Per Job cluster first to achieve feature parity with former release. 1) I will add a boolean parameter to createResourceManager function to distinguish whether it runs for a per job cluster or session cluster. And also pass LeaderGatewayRetriever<DispatcherGateway> dispatcherGatewayRetriever as one of parameters createResourceManager function in ResourceManagerFactory. 2) If it is per job cluster, One the threshold is hit, shutdownCluster by using DispatcherGateway. How do you think? was (Author: zhenqiuhuang): [~suez1224] [~till.rohrmann] Agree with Shuyi's proposal. As yarn.maximum-failed-containers is more a configuration for a job level rather than session cluster level. We may have a simple fix for Per Job cluster first to achieve feature parity with former release. 1) I will add a boolean parameter to YarnResourceManager to distinguish whether it runs for a per job cluster or session cluster. And also pass LeaderGatewayRetriever<DispatcherGateway> dispatcherGatewayRetriever as parameter of constructor of YarnResourceManager. 2) If it is per job cluster, One the threshold is hit, shutdownCluster by using DispatcherGateway. How do you think? > Flink's JobCluster ResourceManager doesn't use yarn.maximum-failed-containers > as limit of resource acquirement > -------------------------------------------------------------------------------------------------------------- > > Key: FLINK-10868 > URL: https://issues.apache.org/jira/browse/FLINK-10868 > Project: Flink > Issue Type: Bug > Components: YARN > Affects Versions: 1.6.2, 1.7.0 > Reporter: Zhenqiu Huang > Assignee: Zhenqiu Huang > Priority: Major > > Currently, YarnResourceManager does use yarn.maximum-failed-containers as > limit of resource acquirement. In worse case, when new start containers > consistently fail, YarnResourceManager will goes into an infinite resource > acquirement process without failing the job. Together with the > https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all > resources of yarn queue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)