Ruibin Xing created FLINK-34347: ----------------------------------- Summary: Kubernetes native resource manager request wrong spec. Key: FLINK-34347 URL: https://issues.apache.org/jira/browse/FLINK-34347 Project: Flink Issue Type: Bug Components: Deployment / Kubernetes, Kubernetes Operator Affects Versions: kubernetes-operator-1.6.1, 1.18.0 Reporter: Ruibin Xing Attachments: jobmanager.csv, taskmanager_octopus-16-323-octopus-engine-write-proxy-taskmanager-3-326.csv
We had a flink spec in which TM cpu is set to 0.5, then we upgraded it to 4.0. We found the job manager requesting both TM with 0.5 CPU and 4 CPU. Most TMs with 0.5 CPU was released soon, however there was 1 TM with 0.5 CPU remained and caused lag in job. Logs for mixed TM requests: {code:java} 2024-02-03 10:10:41,414 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requested worker octopus-16-323-octopus-engine-write-proxy-taskmanager-3-244 with resource spec WorkerResourceSpec {cpuCores=4.0, taskHeapSize=5.637gb (6053219520 bytes), taskOffHeapSize=1024.000mb (1073741824 bytes), networkMemSize=64.000mb (67108864 bytes), managedMemSize=0 bytes, numSlots=4}.02-03 18:10:44.8442024-02-03 10:10:44,844 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requesting new worker with resource spec WorkerResourceSpec {cpuCores=0.5, taskHeapSize=1.137gb (1221381320 bytes), taskOffHeapSize=1024.000mb (1073741824 bytes), networkMemSize=64.000mb (67108864 bytes), managedMemSize=0 bytes, numSlots=4}, current pending count: 1.02-03 18:10:44.9202024-02-03 10:10:44,920 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requesting new worker with resource spec WorkerResourceSpec {cpuCores=0.5, taskHeapSize=1.137gb (1221381320 bytes), taskOffHeapSize=1024.000mb (1073741824 bytes), networkMemSize=64.000mb (67108864 bytes), managedMemSize=0 bytes, numSlots=4}, current pending count: 2.02-03 18:10:44.942 {code} The name of wrong TM: octopus-16-323-octopus-engine-write-proxy-taskmanager-3-326. Relevant logs are attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)