Chenyu Zheng created FLINK-27350:
------------------------------------
Summary: JobManager doesn't bring up new TaskManager during
failure recovery
Key: FLINK-27350
URL: https://issues.apache.org/jira/browse/FLINK-27350
Project: Flink
Issue Type: Bug
Reporter: Chenyu Zheng
Attachments: jobmanager.log,
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-10.log
I got a strange bug during failure recovery of Flink. It seems the JobManager
doesn't bring up new TaskManager during failure recovery. Some logs and
information of the Flink job are pasted below. Can you take a look and give me
some guidance? Thank you so much!
Flink version: 1.13.2
Deploy mode: K8s native
Timeline of the bug:
# Flink job start to work with 8 taskmanagers.
# At {*}2022-04-17 00:28:15,286{*}, this job got an error and JobManager
decided to restart 2 tasks (pod
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-1,
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-7)
# The two old pod is stopped and JobManager created 2 pod (pod
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9,
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-10) at *2022-04-17
00:33:15,376*
# JobManager discard two new pods’ registration at *2022-04-17 00:33:32,393*
# These new pods exited at {*}2022-04-17 00:33:32,396{*}, due to the rejection
of registration.
# JobManager didn’t bring up new pods and print error “Slot request bulk is
not fulfillable! Could not allocate the required slot within slot request
timeout” over and over
--
This message was sent by Atlassian Jira
(v8.20.7#820007)