[ https://issues.apache.org/jira/browse/FLINK-13554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010460#comment-17010460 ]
Xintong Song commented on FLINK-13554: -------------------------------------- IMO, I think a clean solution should be RM monitors a timeout for starting new TMs. But this approach includes introducing config options for the timeout, monitoring timeout asynchronously, properly un-monitoring on TM registration, which may not be suitable to add after the feature freeze. Also, it seems not to be a common case. We do not see any report of this bug from the users. We run into this problem (both this ticket and FLINK-15456) only when testing the stability of Flink with ChaosMonkey intentionally breaking the network connections. Therefore, I'm in favor of not fixing this problem in release 1.10.0. > ResourceManager should have a timeout on starting new TaskExecutors. > -------------------------------------------------------------------- > > Key: FLINK-13554 > URL: https://issues.apache.org/jira/browse/FLINK-13554 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.9.0 > Reporter: Xintong Song > Priority: Critical > Fix For: 1.10.0 > > > Recently, we encountered a case that one TaskExecutor get stuck during > launching on Yarn (without fail), causing that job cannot recover from > continuous failovers. > The reason the TaskExecutor gets stuck is due to our environment problem. The > TaskExecutor gets stuck somewhere after the ResourceManager starts the > TaskExecutor and waiting for the TaskExecutor to be brought up and register. > Later when the slot request timeouts, the job fails over and requests slots > from ResourceManager again, the ResourceManager still see a TaskExecutor (the > stuck one) is being started and will not request new container from Yarn. > Therefore, the job can not recover from failure. > I think to avoid such unrecoverable status, the ResourceManager need to have > a timeout on starting new TaskExecutor. If the starting of TaskExecutor takes > too long, it should just fail the TaskExecutor and starts a new one. -- This message was sent by Atlassian Jira (v8.3.4#803005)