HuangZhenQiu commented on issue #7356: [FLINK-10868][flink-yarn] Enforce maximum failed TMs in YarnResourceManager URL: https://github.com/apache/flink/pull/7356#issuecomment-457305863 @tillrohrmann Thanks for you comments. 1) Totally agreement. This PR should also support MesosResourceManager. Let me rephrase the title of the JIRA ticket. 2)Yes, It should be disabled by default. I can simply initialize the default value of maximum Number of allowed failure executor as Integer.MAX_VALUE. 3) There are several type of failure scenarios. - When start a new container, there is a namenode failover or hdfs is down. Then, the container can't fetch job jar to bootstrap. - We allocate hdfs quota for each job's checkpoint folder. When quota is hit, containers will consistently fail, and restart strategy will restart the job by allocating more containers. In this condition, a job will continue to run for a while, but the job is actually in wrong state. I prefer to add the threshold on job rather than whole cluster. But when i jump into implementation. I found it is hard to distinguish which allocated container is for which job master in YarnResourceManager. Any suggestion for it? 4) As you suggested in the initial conversation in Jira ticket, currently the MaximumFailedTaskManagerExceedingException is thrown to ExecutionGraph, then rely on different type of RestartStrategy to take action. As preventing infinite restart from restart strategy is definitely what we want, If so I will let MaximumFailedTaskManagerExceedingException extends from SuppressRestartsException. I will update the PR accordingly.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services