Ruibin Xing created FLINK-34347:
-----------------------------------

             Summary: Kubernetes native resource manager request wrong spec.
                 Key: FLINK-34347
                 URL: https://issues.apache.org/jira/browse/FLINK-34347
             Project: Flink
          Issue Type: Bug
          Components: Deployment / Kubernetes, Kubernetes Operator
    Affects Versions: kubernetes-operator-1.6.1, 1.18.0
            Reporter: Ruibin Xing
         Attachments: jobmanager.csv, 
taskmanager_octopus-16-323-octopus-engine-write-proxy-taskmanager-3-326.csv

We had a flink spec in which TM cpu is set to 0.5, then we upgraded it to 4.0. 
We found the job manager requesting both TM with 0.5 CPU and 4 CPU. Most TMs 
with 0.5 CPU was released soon, however there was 1 TM with 0.5 CPU remained 
and caused lag in job.

 

Logs for mixed TM requests:
{code:java}
2024-02-03 10:10:41,414 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Requested worker octopus-16-323-octopus-engine-write-proxy-taskmanager-3-244 
with resource spec WorkerResourceSpec {cpuCores=4.0, taskHeapSize=5.637gb 
(6053219520 bytes), taskOffHeapSize=1024.000mb (1073741824 bytes), 
networkMemSize=64.000mb (67108864 bytes), managedMemSize=0 bytes, 
numSlots=4}.02-03 18:10:44.8442024-02-03 10:10:44,844 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Requesting new worker with resource spec WorkerResourceSpec {cpuCores=0.5, 
taskHeapSize=1.137gb (1221381320 bytes), taskOffHeapSize=1024.000mb (1073741824 
bytes), networkMemSize=64.000mb (67108864 bytes), managedMemSize=0 bytes, 
numSlots=4}, current pending count: 1.02-03 18:10:44.9202024-02-03 10:10:44,920 
INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] 
- Requesting new worker with resource spec WorkerResourceSpec {cpuCores=0.5, 
taskHeapSize=1.137gb (1221381320 bytes), taskOffHeapSize=1024.000mb (1073741824 
bytes), networkMemSize=64.000mb (67108864 bytes), managedMemSize=0 bytes, 
numSlots=4}, current pending count: 2.02-03 18:10:44.942 {code}
The name of wrong TM: 
octopus-16-323-octopus-engine-write-proxy-taskmanager-3-326.

Relevant logs are attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to