Thank you, Yang!

In fact I have a fine-grained dashboard for Kubernetes cluster health (like 
apiserver qps/latency etc.), and I didn't find anything unusual… Also, the 
JobManager container cpu/memory usage is low.

Besides, I have a deep dive in these logs and Flink resource manager code, and 
find something interesting. I use taskmanager-1-9 to give you an example:

  1.  I can see logs “Requesting new worker with resource spec 
WorkerResourceSpec” at 2022-04-17 00:33:15,333. And the code location is 
here<https://github.com/apache/flink/blob/release-1.13.2/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/active/ActiveResourceManager.java#L283>.
  2.  “Creating new TaskManager pod with name 
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9 and resource 
<16384,4.0>” at 2022-04-17 00:33:15,376, code 
location<https://github.com/apache/flink/blob/release-1.13.2/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/KubernetesResourceManagerDriver.java#L167>.
  3.  “Pod stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9 is created.” 
at 2022-04-17 00:33:15,412, code 
location<https://github.com/apache/flink/blob/release-1.13.2/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/KubernetesResourceManagerDriver.java#L190>.
 The request is sent and pod is created here, so I think the apiserver is 
healthy at that moment.
  4.  But I cannot find any logs that print in 
line<https://github.com/apache/flink/blob/release-1.13.2/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/active/ActiveResourceManager.java#L301>
 and 
line<https://github.com/apache/flink/blob/release-1.13.2/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/active/ActiveResourceManager.java#L314>.
  5.  “Discard registration from TaskExecutor 
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9” at 2022-04-17 
00:33:32,393. Root cause of this logs is due to the 
workerNodeMap<https://github.com/apache/flink/blob/release-1.13.2/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/active/ActiveResourceManager.java#L81>
 isn’t put a ResourceId that linked with taskmanager-1-9.
That’s why I think things are strange here. Flink would put the ResourceId to 
workerNodeMap 
here<https://github.com/apache/flink/blob/release-1.13.2/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/active/ActiveResourceManager.java#L310>.
 The code didn’t execute, although its Java future condition is reached and 
fulfilled. And I don’t see any code that related to Kubernetes events in this 
piece of logic.

By the way, in our expectation, would JobManager create new TaskManager in that 
case?

BRs,
Chenyu

From: Yang Wang <danrtsey...@gmail.com>
Date: Friday, April 22, 2022 at 4:49 PM
To: "Zheng, Chenyu" <chenyu.zh...@disneystreaming.com>
Cc: "user@flink.apache.org" <user@flink.apache.org>, "user...@flink.apache.org" 
<user...@flink.apache.org>
Subject: Re: JobManager doesn't bring up new TaskManager during failure recovery

The root cause might be you APIServer is overloaded or not running normally. 
And then all the pods events of
taskmanager-1-9 and taskmanager-1-10 are not delivered to the watch in 
FlinkResourceManager.
So the two taskmanagers are not recognized by ResourceManager and then 
registration are rejected.

The ResourceManager also did not receive the terminated pod events. That's why 
it does not allocate new TaskManager pods.

All in all, I believe you need to check the K8s APIServer status.

Best,
Yang

Zheng, Chenyu 
<chenyu.zh...@disneystreaming.com<mailto:chenyu.zh...@disneystreaming.com>> 
于2022年4月22日周五 12:54写道:
Hi developers!

I got a strange bug during failure recovery of Flink. It seems the JobManager 
doesn't bring up new TaskManager during failure recovery. Some logs and 
information of the Flink job are pasted below. Can you take a look and give me 
some guidance? Thank you so much!

Flink version: 1.13.2
Deploy mode: K8s native
Timeline of the bug:

  1.  Flink job start to work with 8 taskmanagers.
  2.  At 2022-04-17 00:28:15,286, this job got an error and JobManager decided 
to restart 2 tasks (pod 
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-1, 
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-7)
  3.  The two old pod is stopped and JobManager created 2 pod (pod 
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9, 
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-10) at 2022-04-17 
00:33:15,376
  4.  JobManager discard two new pods’ registration at 2022-04-17 00:33:32,393
  5.  These new pods exited at 2022-04-17 00:33:32,396, due to the rejection of 
registration.
  6.  JobManager didn’t bring up new pods and print error “Slot request bulk is 
not fulfillable! Could not allocate the required slot within slot request 
timeout” over and over

Flink logs:
1.      JobManager: 
https://drive.google.com/file/d/1HuRQUFQrq9JIfrOzH9qBPCK1hMsyqFpJ/view?usp=sharing<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F1HuRQUFQrq9JIfrOzH9qBPCK1hMsyqFpJ%2Fview%3Fusp%3Dsharing&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7C6f48868c91d7400f563208da243cf092%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862141402021998%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6xhd1eCg3V5yllJ%2F5M0MuN7BFDwHSorKujdpbJTJ1Yk%3D&reserved=0>
2.      TaskManager: 
https://drive.google.com/file/d/1ReWR27VlXCkGCFN62__j0UpQlXV7Ensn/view?usp=sharing<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F1ReWR27VlXCkGCFN62__j0UpQlXV7Ensn%2Fview%3Fusp%3Dsharing&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7C6f48868c91d7400f563208da243cf092%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862141402021998%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=dl%2Fyii9vDaHhm%2B3UpEESZlBQwsEee753mA2mMSd65qk%3D&reserved=0>


BRs,
Chenyu

Reply via email to