Thanks Roman for providing the details.

I also made more observations that has increased my confusion about this topic 😝
To ease the calculations, I deployed a test cluster this time providing 1 CPU 
in K8s(with docker) for all the taskmanager container.

When I check the taskmanager CPU load, the value is in the order of 
"0.002158428663932657".
Assuming that the underlying JVM recognizes 1 CPU allocated to the docker 
container, this values means % CPU usage in ball park of 0.21%.

However, if I look at the K8s metrics(formula below) for this container – it 
turns out in the ball park of 10-16%.
There is no other process running in the container apart from the flink 
taskmanager.

The order of these two values of CPU % usage is different.

Am I comparing the right metrics here?
How are folks running Flink on K8s monitoring the CPU load?

~ Abhi

% CPU usage from K8s metrics
sum(rate(container_cpu_usage_seconds_total{pod=~"my-taskmanagers-*", 
container="taskmanager"}[5m])) by (pod)
/ sum(container_spec_cpu_quota{pod=~"my-taskmanager-pod-*", 
container="taskmanager"}
/ container_spec_cpu_period{pod=~"my-taskmanager-pod-*", 
container="taskmanager"}) by (pod)

From: Roman Grebennikov <g...@dfdx.me>
Date: Tuesday, August 4, 2020 at 12:42 AM
To: "user@flink.apache.org" <user@flink.apache.org>
Subject: Re: Flink CPU load metrics in K8s

LEARN FAST: This email originated outside of HERE.
Please do not click on links or open attachments unless you recognize the 
sender and know the content is safe. Thank you.

Hi,

JVM.CPU.Load is just a wrapper (MetricUtils.instantiateCPUMetrics) on top of 
OperatingSystemMXBean.getProcessCpuLoad (see 
https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getProcessCpuLoad<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.oracle.com%2Fjavase%2F7%2Fdocs%2Fjre%2Fapi%2Fmanagement%2Fextension%2Fcom%2Fsun%2Fmanagement%2FOperatingSystemMXBean.html%23getProcessCpuLoad&data=01%7C01%7C%7C91f986f70be644c080b708d83849f9ce%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=3DbPDqCldO%2FY0MpLEEnjPDopYZKrmmHkpSjdsTJI3Vg%3D&reserved=0>())

Usually it looks weird if you have multiple CPU cores. For example, if you have 
a job with a single slot 100% utilizing a single CPU core on a 8 core machine, 
the JVM.CPU.Load will be 1.0/8.0 = 0.125. It's also a point-in-time snapshot of 
current CPU usage, so if you're collecting your metrics every minute, and the 
job has spiky workload within this minute (like it's idle almost always and 
once in a minute it consumes 100% CPU for one second), so you have a chance to 
completely miss this from the metrics.

As for me personally, JVM.CPU.Time is more clear indicator of CPU usage, which 
is always increasing amount of milliseconds CPU spent executing your code. And 
it will also catch CPU usage spikes.

Roman Grebennikov | g...@dfdx.me


On Mon, Aug 3, 2020, at 23:34, Bajaj, Abhinav wrote:

Hi,



I am trying to understand the CPU Load metrics reported by Flink 1.7.1 running 
with openjdk 1.8.0_212 on K8s.



After deploying the Flink Job on K8s, I tried to get CPU Load metrics following 
this 
documentation<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-release-1.7%2Fmonitoring%2Fmetrics.html%23rest-api-integration&data=01%7C01%7C%7C91f986f70be644c080b708d83849f9ce%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=hQ9UpVlJ9D3eGc0jjTXNagTP9ofpvTE%2B6B4lrXQT7Q4%3D&reserved=0>.

curl 
localhost:8081/taskmanagers/7737ac33b311ea0a696422680711597b/metrics?get=Status.JVM.CPU.Load,Status.JVM.CPU.Time

[{"id":"Status.JVM.CPU.Load","value":"0.0023815194093831865"},{"id":"Status.JVM.CPU.Time","value":"23260000000"}]



The value of the CPU load looks odd to me.



What is the unit and scale of this value?

How does Flink determine this value?



Appreciate your time and help here.

~ Abhinav Bajaj



Reply via email to