[ 
https://issues.apache.org/jira/browse/FLINK-15456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007193#comment-17007193
 ] 

Zhu Zhu commented on FLINK-15456:
---------------------------------

Synced with [~xintongsong] offline, the RM recovered because a disconnection 
and reconnection of a existing TM fulfilled the request which had been pending 
for long,  so that the RM would request new TMs on new slot requests.

Now we can focus on the original issue, I will try to reproduce it with debug 
logs.

> Job keeps failing on slot allocation timeout due to RM not allocating new TMs 
> for slot requests
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-15456
>                 URL: https://issues.apache.org/jira/browse/FLINK-15456
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Zhu Zhu
>            Priority: Blocker
>             Fix For: 1.10.0
>
>         Attachments: jm_part.log, jm_part2.log
>
>
> As in the attached JM log, the job tried to start 30 TMs but only 29 are 
> registered. So the job fails due to not able to acquire all 30 slots needed 
> in time.
> And when the failover happens and tasks are re-scheduled, the RM will not ask 
> for new TMs even if it cannot fulfill the slot requests. So the job will keep 
> failing for slot allocation timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to