[ 
https://issues.apache.org/jira/browse/FLINK-15456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006738#comment-17006738
 ] 

Xintong Song commented on FLINK-15456:
--------------------------------------

When Flink RM recovers previous attempt containers from Yarn after a failover, 
it will not create pending slots, like what it does when requesting new TM 
containers. RM only adds the recovered containers' information to its worker 
map, so that later TM registrations can be accepted. Existing TMs will 
proactively register to the new leader RM. That means if a TM from a recovered 
container does not register to RM, it will not prevent RM from allocating new 
slots.

> Job keeps failing on slot allocation timeout due to RM not allocating new TMs 
> for slot requests
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-15456
>                 URL: https://issues.apache.org/jira/browse/FLINK-15456
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Zhu Zhu
>            Priority: Blocker
>             Fix For: 1.10.0
>
>         Attachments: jm_part.log, jm_part2.log
>
>
> As in the attached JM log, the job tried to start 30 TMs but only 29 are 
> registered. So the job fails due to not able to acquire all 30 slots needed 
> in time.
> And when the failover happens and tasks are re-scheduled, the RM will not ask 
> for new TMs even if it cannot fulfill the slot requests. So the job will keep 
> failing for slot allocation timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to