Re: Flink CPU load metrics in K8s

Arvid Heise Thu, 13 Aug 2020 00:34:03 -0700

Hi Abhinav,

according to [1], you need 8u261 for the OperatingSystemMXBean to work as
expected.


[1] https://bugs.openjdk.java.net/browse/JDK-8242287

On Thu, Aug 13, 2020 at 1:10 AM Bajaj, Abhinav <abhinav.ba...@here.com>
wrote:

> Thanks Xintong for your input.
>
>
>
> From the information I could find, I understand the JDK version 1.8.0_212
> we use includes the docker/container support.
>
> I also had a quick test inside the docker image using the below –
>
> Runtime.getRuntime().availableProcessors()
>
>
>
> It showed the right number of CPU cores associated to container.
>
>
>
> But I am not familiar with OperatingSystemMXBean used by Flink.
>
> So I don’t know if it will pick up docker CPU limits set by K8s or not. I
> will continue to investigate that.
>
>
>
> In meantime, the K8s metric - container_cpu_usage_seconds_total does seem
> to provide the expected CPU usage for the containers.
>
>
>
>
>
> I was hoping that someone in the community may have already ran into this
> behavior on K8s and can share their specific experience 😊.
>
>
>
> Thanks much.
>
> ~ Abhinav Bajaj
>
>
>
> *From: *Xintong Song <tonysong...@gmail.com>
> *Date: *Wednesday, August 12, 2020 at 3:56 AM
> *To: *"Bajaj, Abhinav" <abhinav.ba...@here.com>
> *Cc: *"user@flink.apache.org" <user@flink.apache.org>, Roman Grebennikov <
> g...@dfdx.me>
> *Subject: *Re: Flink CPU load metrics in K8s
>
>
>
> Hi Abhinav,
>
>
>
> Do you know how many total cpus does the physical machine have where the
> kubernetes container is running?
>
>
>
> I'm asking because I suspect whether JVM is aware that only 1 cpu is
> configured for the container. It does not work like JVM understands how
> many cpu are configured and controls itself to not use more than that. On
> the other hand, JVM tries to use as much cpu time as possible, and the
> limit comes from external (OS, docker, cgroup, ...).
>
>
>
> Please understand that docker containers are not virtual machines. They do
> not "pretend" to only have certain hardwares. I did a simple test on my
> laptop, launching a docker container with cpu limit configured. Inside the
> container, I can still see all my machine's cpus.
>
>
> Thank you~
>
> Xintong Song
>
>
>
>
>
> On Wed, Aug 12, 2020 at 1:19 AM Bajaj, Abhinav <abhinav.ba...@here.com>
> wrote:
>
> Hi,
>
>
>
> Reaching out to folks running Flink on K8s.
>
>
>
> ~ Abhinav Bajaj
>
>
>
> *From: *"Bajaj, Abhinav" <abhinav.ba...@here.com>
> *Date: *Wednesday, August 5, 2020 at 1:46 PM
> *To: *Roman Grebennikov <g...@dfdx.me>, "user@flink.apache.org" <
> user@flink.apache.org>
> *Subject: *Re: Flink CPU load metrics in K8s
>
>
>
> Thanks Roman for providing the details.
>
>
>
> I also made more observations that has increased my confusion about this
> topic 😝
>
> To ease the calculations, I deployed a test cluster this time providing 1
> CPU in K8s(with docker) for all the taskmanager container.
>
>
>
> When I check the taskmanager CPU load, the value is in the order of
> "0.002158428663932657".
>
> Assuming that the underlying JVM recognizes 1 CPU allocated to the docker
> container, this values means % CPU usage in ball park of 0.21%.
>
>
>
> However, if I look at the K8s metrics(formula below) for this container –
> it turns out in the ball park of 10-16%.
>
> There is no other process running in the container apart from the flink
> taskmanager.
>
>
>
> The order of these two values of CPU % usage is different.
>
>
>
> *Am I comparing the right metrics here?*
>
> *How are folks running Flink on K8s monitoring the CPU load?*
>
>
>
> ~ Abhi
>
>
>
> *% CPU usage from K8s metrics*
>
> sum(rate(container_cpu_usage_seconds_total{pod=~"my-taskmanagers-*",
> container="taskmanager"}[5m])) by (pod)
>
> / sum(container_spec_cpu_quota{pod=~"my-taskmanager-pod-*",
> container="taskmanager"}
>
> / container_spec_cpu_period{pod=~"my-taskmanager-pod-*",
> container="taskmanager"}) by (pod)
>
>
>
> *From: *Roman Grebennikov <g...@dfdx.me>
> *Date: *Tuesday, August 4, 2020 at 12:42 AM
> *To: *"user@flink.apache.org" <user@flink.apache.org>
> *Subject: *Re: Flink CPU load metrics in K8s
>
>
>
> *LEARN FAST: This email originated outside of HERE.*
> Please do not click on links or open attachments unless you recognize the
> sender and know the content is safe. Thank you.
>
>
>
> Hi,
>
>
>
> JVM.CPU.Load is just a wrapper (MetricUtils.instantiateCPUMetrics) on top
> of OperatingSystemMXBean.getProcessCpuLoad (see
> https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getProcessCpuLoad
> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.oracle.com%2Fjavase%2F7%2Fdocs%2Fjre%2Fapi%2Fmanagement%2Fextension%2Fcom%2Fsun%2Fmanagement%2FOperatingSystemMXBean.html%23getProcessCpuLoad&data=01%7C01%7C%7Ce32e547897104433cdef08d83eae5912%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=1GFnINqDDVLZGLUQnFMEz7W%2Fcnm36HnViOsVpEikrVE%3D&reserved=0>
> ())
>
>
>
> Usually it looks weird if you have multiple CPU cores. For example, if you
> have a job with a single slot 100% utilizing a single CPU core on a 8 core
> machine, the JVM.CPU.Load will be 1.0/8.0 = 0.125. It's also a
> point-in-time snapshot of current CPU usage, so if you're collecting your
> metrics every minute, and the job has spiky workload within this minute
> (like it's idle almost always and once in a minute it consumes 100% CPU for
> one second), so you have a chance to completely miss this from the metrics.
>
>
>
> As for me personally, JVM.CPU.Time is more clear indicator of CPU usage,
> which is always increasing amount of milliseconds CPU spent executing your
> code. And it will also catch CPU usage spikes.
>
>
>
> Roman Grebennikov | g...@dfdx.me
>
>
>
>
>
> On Mon, Aug 3, 2020, at 23:34, Bajaj, Abhinav wrote:
>
> Hi,
>
>
>
> I am trying to understand the CPU Load metrics reported by Flink 1.7.1
> running with openjdk 1.8.0_212 on K8s.
>
>
>
> After deploying the Flink Job on K8s, I tried to get CPU Load metrics
> following this documentation
> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-release-1.7%2Fmonitoring%2Fmetrics.html%23rest-api-integration&data=01%7C01%7C%7Ce32e547897104433cdef08d83eae5912%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=I5%2FK%2FHSbtnQ%2F3%2FLYOK1wOIda2fnxRdqrDfyMv5N0KBY%3D&reserved=0>
> .
>
> curl
> localhost:8081/taskmanagers/7737ac33b311ea0a696422680711597b/metrics?get=Status.JVM.CPU.Load,Status.JVM.CPU.Time
>
> [{"id":"Status.JVM.CPU.Load","value":"0.0023815194093831865
> "},{"id":"Status.JVM.CPU.Time","value":"23260000000"}]
>
>
>
> The value of the CPU load looks odd to me.
>
>
>
> What is the unit and scale of this value?
>
> How does Flink determine this value?
>
>
>
> Appreciate your time and help here.
>
> ~ Abhinav Bajaj
>
>
>
>
>
>

-- 

Arvid Heise | Senior Java Developer

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
(Toni) Cheng

Re: Flink CPU load metrics in K8s

Reply via email to