Hi Piper,

I personally like looking at the system load (if Flink is the only major
process on the system). It nicely captures the "stress" Flink puts on the
system (this would be the "System.CPU.Load5min class of metrics") (there
are a lot of articles about understanding linux load averages)

I don't think there's something built into Flink for getting the CPU
utilization across the cluster.

For the difference in the REST endpoints:
According to the Flink documentation (1) captures the process CPU usage
(with the issue Roman described), (2) captures the overall system CPU usage
https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#cpu

Best,
Robert


On Thu, Sep 10, 2020 at 11:08 PM Piper Piper <piperfl...@gmail.com> wrote:

> Hello,
>
> What is the best way to measure the CPU utilization of a TaskManager in
> Flink, as opposed to using Linux's "top" command? Is querying the REST
> endpoint 
> http://<IP>:<port>/taskmanagers/<TM_ID>/metrics?get=Status.JVM.CPU.Load\
> the best option? Roman's reply (copied below) from the archives suggests
> that it returns the CPU usage for the whole system including
> other processes currently in the system, and would not give the CPU
> utilization only of that Task Manager.
>
> Based on Roman's reply that JVM.CPU.Time is a more clear indicator of CPU
> usage, can you suggest how I would use it to calculate CPU utilization? Is
> there any way I can get the CPU utilization for a Job that is distributed
> over several nodes in the cluster?
>
> Also, what is the difference between the two REST API endpoints below:
>
> 1. http://
> <IP>:<port>/taskmanagers/<TM_ID>/metrics?get=Status.JVM.CPU.Load\
> 2. http://<IP>:<port>/taskmanagers/<TM_ID>/metrics?get=System.CPU.Usage\
>
> Thanks,
>
> Piper
>
> Hi,
>
> JVM.CPU.Load is just a wrapper (MetricUtils.instantiateCPUMetrics) on top of 
> OperatingSystemMXBean.getProcessCpuLoad (see 
> https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getProcessCpuLoad<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.oracle.com%2Fjavase%2F7%2Fdocs%2Fjre%2Fapi%2Fmanagement%2Fextension%2Fcom%2Fsun%2Fmanagement%2FOperatingSystemMXBean.html%23getProcessCpuLoad&data=01%7C01%7C%7Ce32e547897104433cdef08d83eae5912%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=1GFnINqDDVLZGLUQnFMEz7W%2Fcnm36HnViOsVpEikrVE%3D&reserved=0
>  
> <https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getProcessCpuLoad%3Chttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.oracle.com%2Fjavase%2F7%2Fdocs%2Fjre%2Fapi%2Fmanagement%2Fextension%2Fcom%2Fsun%2Fmanagement%2FOperatingSystemMXBean.html%23getProcessCpuLoad&data=01%7C01%7C%7Ce32e547897104433cdef08d83eae5912%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=1GFnINqDDVLZGLUQnFMEz7W%2Fcnm36HnViOsVpEikrVE%3D&reserved=0>>())
>
> Usually it looks weird if you have multiple CPU cores. For example, if you 
> have a job with a single slot 100% utilizing a single CPU core on a 8 core 
> machine, the JVM.CPU.Load will be 1.0/8.0 = 0.125. It's also a point-in-time 
> snapshot of current CPU usage, so if you're collecting your metrics every 
> minute, and the job has spiky workload within this minute (like it's idle 
> almost always and once in a minute it consumes 100% CPU for one second), so 
> you have a chance to completely miss this from the metrics.
>
> As for me personally, JVM.CPU.Time is more clear indicator of CPU usage, 
> which is always increasing amount of milliseconds CPU spent executing your 
> code. And it will also catch CPU usage spikes.
>
> Roman Grebennikov | g...@dfdx.me<ma...@dfdx.me>
>
>

Reply via email to