[ 
https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193732#comment-15193732
 ] 

Sangjin Lee commented on YARN-4712:
-----------------------------------

I also think that {{cpuUsagePercentPerCore}} might be a better metric to record 
than {{cpuUsageTotalCorePercentage}}.

One way to understand the difference in using either is with the former the 
unit is the cores and with the latter it is the machines. Other aspects are 
entirely similar. Thus, it follows that {{cpuUsagePercentPerCore}} is a 
finer-grained value than {{cpuUsageTotalCorePercentage}}.

For example, to come up with a relative utilization of an app against the full 
cluster, you need the number of cores as the denominator with the former, and 
the number of machines with the latter. Granted, obtaining the number of cores 
can be more difficult than the number of machines.

Either model breaks down when those units are no longer interchangeable. For 
example, with {{cpuUsageTotalCorePercentage}}, it causes inaccurate values if 
the machines are not of equal size (e.g. machines with different numbers of 
cores). With {{cpuUsagePercentPerCore}}, it can report inaccurate utilization 
of the cluster if clock speeds are different between machines.

\[1\] cpuUsagePercentPerCore
- pro: more accurate and finer-grained reporting of utilization
- con: requires the number of cores to come up with the cluster-wide 
utilization of anything
- con: still doesn’t account for different core performance

\[2\] cpuUsageTotalCoresPercentage
- pro: easier to come up with cluster-wide utilization
- con: coarser-grained metric that breaks down the moment machines are not 
equivalent

\[other points\]
\[1\] stick with pure utilization
One point to consider is whether we should take into account the available 
capacity as opposed to full machine capacity. There are a couple of ways the 
available capacity can be different than the full capacity. One is via 
{{nodeCpuPercentageForYARN}} (coming from the cpu-limit config). Another 
mechanism is via the allocated vcores mechanism. Either way, for example, one 
may allocate only 6 cores out of a 8-core machine. If a container is using 6 
cores, the question is whether that should be reported as 100% utilization or 
75% utilization.

Although an argument can be made for either outcome, I think it might be 
simpler to stick with a pure utilization approach. It would be easier to match 
those numbers against CPU measurements coming from direct means. We should 
consider CPU reported by the NM as plain utilization numbers.

\[2\] stick with physical cores vs. vcores
Another potentially complicating factor is whether we should consider using 
vcores Using vcores would put this closer to YARN’s resource scheduling model. 
However, IMO it would make things unnecessarily more complicated.

Again, in the vein of treating the CPU as plain utilization that can be matched 
against the direct measurements, I think we should stick with physical cores.

Thoughts?

> CPU Usage Metric is not captured properly in YARN-2928
> ------------------------------------------------------
>
>                 Key: YARN-4712
>                 URL: https://issues.apache.org/jira/browse/YARN-4712
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Naganarasimha G R
>            Assignee: Naganarasimha G R
>              Labels: yarn-2928-1st-milestone
>         Attachments: YARN-4712-YARN-2928.v1.001.patch, 
> YARN-4712-YARN-2928.v1.002.patch, YARN-4712-YARN-2928.v1.003.patch, 
> YARN-4712-YARN-2928.v1.004.patch, YARN-4712-YARN-2928.v1.005.patch
>
>
> There are 2 issues with CPU usage collection 
> * I was able to observe that that many times CPU usage got from 
> {{pTree.getCpuUsagePercent()}} is 
> ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do 
> the calculation  i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore 
> /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE 
> check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not 
> encountered. so proper checks needs to be handled
> * {{EntityColumnPrefix.METRIC}} uses always LongConverter but 
> ContainerMonitor is publishing decimal values for the CPU usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to