[GitHub] HuangZhenQiu commented on issue #7356: [FLINK-10868][flink-yarn] Enforce maximum failed TMs in YarnResourceManager

GitBox Thu, 24 Jan 2019 10:31:23 -0800

HuangZhenQiu commented on issue #7356: [FLINK-10868][flink-yarn] Enforce 
maximum failed TMs in YarnResourceManager
URL: https://github.com/apache/flink/pull/7356#issuecomment-457305863
 
 
   @tillrohrmann 
   Thanks for you comments. 
   1) Totally agreement. This PR should also support MesosResourceManager. Let 
me rephrase the title of the JIRA ticket.
   
   2)Yes, It should be disabled by default. I can simply initialize the default 
value of maximum Number of allowed failure executor as Integer.MAX_VALUE.
   
   3) There are several type of failure scenarios. 
   - When start a new container, there is a namenode failover or hdfs is down. 
Then, the container can't fetch job jar to bootstrap.
   - We allocate hdfs quota for each job's checkpoint folder. When quota is 
hit, containers will consistently fail, and restart strategy will restart the 
job by allocating more containers. In this condition, a job will continue to 
run for a while, but the job is actually in wrong state.
   
   I prefer to add the threshold on job rather than whole cluster. But when i 
jump into implementation. I found it is hard to distinguish which allocated 
container is for which job master in YarnResourceManager. Any suggestion for it?
   
   4) As you suggested in the initial conversation in Jira ticket, currently 
the MaximumFailedTaskManagerExceedingException is thrown to ExecutionGraph, 
then rely on different type of RestartStrategy to take action. As preventing 
infinite restart from restart strategy is definitely what we want, If so I will 
let MaximumFailedTaskManagerExceedingException extends from 
SuppressRestartsException. 
   
   I will update the PR accordingly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] HuangZhenQiu commented on issue #7356: [FLINK-10868][flink-yarn] Enforce maximum failed TMs in YarnResourceManager

Reply via email to