[ https://issues.apache.org/jira/browse/FLINK-9351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473980#comment-16473980 ]
Sihua Zhou commented on FLINK-9351: ----------------------------------- Hi [~till.rohrmann] Do you mind if I take this ticket? If not, I'd like to take it now. > RM stop assigning slot to Job because the TM killed before connecting to JM > successfully > ---------------------------------------------------------------------------------------- > > Key: FLINK-9351 > URL: https://issues.apache.org/jira/browse/FLINK-9351 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination > Affects Versions: 1.5.0 > Reporter: Sihua Zhou > Priority: Critical > Fix For: 1.6.0 > > > The steps are the following(copied from Stephan's comments in > [5931|https://github.com/apache/flink/pull/5931]): > - JobMaster / SlotPool requests a slot (AllocationID) from the ResourceManager > - ResourceManager starts a container with a TaskManager > - TaskManager registers at ResourceManager, which tells the TaskManager to > push a slot to the JobManager. > - TaskManager container is killed > - The ResourceManager does not queue back the slot requests (AllocationIDs) > that it sent to the previous TaskManager, so the requests are lost and need > to time out before another attempt is tried. -- This message was sent by Atlassian JIRA (v7.6.3#76005)