Hi developers!

I got a strange bug during failure recovery of Flink. It seems the JobManager 
doesn't bring up new TaskManager during failure recovery. Some logs and 
information of the Flink job are pasted below. Can you take a look and give me 
some guidance? Thank you so much!

Flink version: 1.13.2
Deploy mode: K8s native
Timeline of the bug:

  1.  Flink job start to work with 8 taskmanagers.
  2.  At 2022-04-17 00:28:15,286, this job got an error and JobManager decided 
to restart 2 tasks (pod 
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-1, 
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-7)
  3.  The two old pod is stopped and JobManager created 2 pod (pod 
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9, 
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-10) at 2022-04-17 
00:33:15,376
  4.  JobManager discard two new pods’ registration at 2022-04-17 00:33:32,393
  5.  These new pods exited at 2022-04-17 00:33:32,396, due to the rejection of 
registration.
  6.  JobManager didn’t bring up new pods and print error “Slot request bulk is 
not fulfillable! Could not allocate the required slot within slot request 
timeout” over and over

Flink logs:
1.      JobManager: 
https://drive.google.com/file/d/1HuRQUFQrq9JIfrOzH9qBPCK1hMsyqFpJ/view?usp=sharing
2.      TaskManager: 
https://drive.google.com/file/d/1ReWR27VlXCkGCFN62__j0UpQlXV7Ensn/view?usp=sharing


BRs,
Chenyu

回复