Xintong Song created FLINK-13554:
------------------------------------

             Summary: ResourceManager should have a timeout on starting new 
TaskExecutors.
                 Key: FLINK-13554
                 URL: https://issues.apache.org/jira/browse/FLINK-13554
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination
    Affects Versions: 1.9.0
            Reporter: Xintong Song


Recently, we encountered a case that one TaskExecutor get stuck during 
launching on Yarn (without fail), causing that job cannot recover from 
continuous failovers.

The reason the TaskExecutor gets stuck is due to our environment problem. The 
TaskExecutor gets stuck somewhere after the ResourceManager starts the 
TaskExecutor and waiting for the TaskExecutor to be brought up and register. 
Later when the slot request timeouts, the job fails over and requests slots 
from ResourceManager again, the ResourceManager still see a TaskExecutor (the 
stuck one) is being started and will not request new container from Yarn. 
Therefore, the job can not recover from failure.

I think to avoid such unrecoverable status, the ResourceManager need to have a 
timeout on starting new TaskExecutor. If the starting of TaskExecutor takes too 
long, it should just fail the TaskExecutor and starts a new one.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to