[ https://issues.apache.org/jira/browse/FLINK-15456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010440#comment-17010440 ]
Xintong Song commented on FLINK-15456: -------------------------------------- Thanks [~zhuzh] for looking into the problem. I agree with you that this should be the same problem as FLINK-13554. I'm closing this ticket as duplicated. Let's keep the discussion of how to fix this issue in FLINK-13554. > Job keeps failing on slot allocation timeout due to RM not allocating new TMs > for slot requests > ----------------------------------------------------------------------------------------------- > > Key: FLINK-15456 > URL: https://issues.apache.org/jira/browse/FLINK-15456 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.10.0 > Reporter: Zhu Zhu > Priority: Blocker > Fix For: 1.10.0 > > Attachments: jm.log, jm_part.log, jm_part2.log, tm_container_07.log > > > As in the attached JM log, the job tried to start 30 TMs but only 29 are > registered. So the job fails due to not able to acquire all 30 slots needed > in time. > And when the failover happens and tasks are re-scheduled, the RM will not ask > for new TMs even if it cannot fulfill the slot requests. So the job will keep > failing for slot allocation timeout. -- This message was sent by Atlassian Jira (v8.3.4#803005)