[ https://issues.apache.org/jira/browse/FLINK-13554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010522#comment-17010522 ]
Zhu Zhu commented on FLINK-13554: --------------------------------- This issue is triggered only when a TM is stuck in launching before registering to RM. Currently we only see this case in our stability tests which break zookeeper and network connections intentionally. So I agree that we can postpone it as long as we do not encounter this issue in production. > ResourceManager should have a timeout on starting new TaskExecutors. > -------------------------------------------------------------------- > > Key: FLINK-13554 > URL: https://issues.apache.org/jira/browse/FLINK-13554 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.9.0 > Reporter: Xintong Song > Priority: Critical > Fix For: 1.10.0 > > > Recently, we encountered a case that one TaskExecutor get stuck during > launching on Yarn (without fail), causing that job cannot recover from > continuous failovers. > The reason the TaskExecutor gets stuck is due to our environment problem. The > TaskExecutor gets stuck somewhere after the ResourceManager starts the > TaskExecutor and waiting for the TaskExecutor to be brought up and register. > Later when the slot request timeouts, the job fails over and requests slots > from ResourceManager again, the ResourceManager still see a TaskExecutor (the > stuck one) is being started and will not request new container from Yarn. > Therefore, the job can not recover from failure. > I think to avoid such unrecoverable status, the ResourceManager need to have > a timeout on starting new TaskExecutor. If the starting of TaskExecutor takes > too long, it should just fail the TaskExecutor and starts a new one. -- This message was sent by Atlassian Jira (v8.3.4#803005)