[jira] [Commented] (FLINK-9351) RM stop assigning slot to Job because the TM killed before connecting to JM successfully

Sihua Zhou (JIRA) Mon, 14 May 2018 02:49:12 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-9351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473980#comment-16473980
 ]


Sihua Zhou commented on FLINK-9351:
-----------------------------------

Hi [~till.rohrmann] Do you mind if I take this ticket? If not, I'd like to take 
it now.

> RM stop assigning slot to Job because the TM killed before connecting to JM 
> successfully
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-9351
>                 URL: https://issues.apache.org/jira/browse/FLINK-9351
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0
>            Reporter: Sihua Zhou
>            Priority: Critical
>             Fix For: 1.6.0
>
>
> The steps are the following(copied from Stephan's comments in 
> [5931|https://github.com/apache/flink/pull/5931]):
> - JobMaster / SlotPool requests a slot (AllocationID) from the ResourceManager
> - ResourceManager starts a container with a TaskManager
> - TaskManager registers at ResourceManager, which tells the TaskManager to 
> push a slot to the JobManager.
> - TaskManager container is killed
> - The ResourceManager does not queue back the slot requests (AllocationIDs) 
> that it sent to the previous TaskManager, so the requests are lost and need 
> to time out before another attempt is tried.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-9351) RM stop assigning slot to Job because the TM killed before connecting to JM successfully

Reply via email to