[ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193732#comment-15193732 ]
Sangjin Lee commented on YARN-4712: ----------------------------------- I also think that {{cpuUsagePercentPerCore}} might be a better metric to record than {{cpuUsageTotalCorePercentage}}. One way to understand the difference in using either is with the former the unit is the cores and with the latter it is the machines. Other aspects are entirely similar. Thus, it follows that {{cpuUsagePercentPerCore}} is a finer-grained value than {{cpuUsageTotalCorePercentage}}. For example, to come up with a relative utilization of an app against the full cluster, you need the number of cores as the denominator with the former, and the number of machines with the latter. Granted, obtaining the number of cores can be more difficult than the number of machines. Either model breaks down when those units are no longer interchangeable. For example, with {{cpuUsageTotalCorePercentage}}, it causes inaccurate values if the machines are not of equal size (e.g. machines with different numbers of cores). With {{cpuUsagePercentPerCore}}, it can report inaccurate utilization of the cluster if clock speeds are different between machines. \[1\] cpuUsagePercentPerCore - pro: more accurate and finer-grained reporting of utilization - con: requires the number of cores to come up with the cluster-wide utilization of anything - con: still doesn’t account for different core performance \[2\] cpuUsageTotalCoresPercentage - pro: easier to come up with cluster-wide utilization - con: coarser-grained metric that breaks down the moment machines are not equivalent \[other points\] \[1\] stick with pure utilization One point to consider is whether we should take into account the available capacity as opposed to full machine capacity. There are a couple of ways the available capacity can be different than the full capacity. One is via {{nodeCpuPercentageForYARN}} (coming from the cpu-limit config). Another mechanism is via the allocated vcores mechanism. Either way, for example, one may allocate only 6 cores out of a 8-core machine. If a container is using 6 cores, the question is whether that should be reported as 100% utilization or 75% utilization. Although an argument can be made for either outcome, I think it might be simpler to stick with a pure utilization approach. It would be easier to match those numbers against CPU measurements coming from direct means. We should consider CPU reported by the NM as plain utilization numbers. \[2\] stick with physical cores vs. vcores Another potentially complicating factor is whether we should consider using vcores Using vcores would put this closer to YARN’s resource scheduling model. However, IMO it would make things unnecessarily more complicated. Again, in the vein of treating the CPU as plain utilization that can be matched against the direct measurements, I think we should stick with physical cores. Thoughts? > CPU Usage Metric is not captured properly in YARN-2928 > ------------------------------------------------------ > > Key: YARN-4712 > URL: https://issues.apache.org/jira/browse/YARN-4712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver > Reporter: Naganarasimha G R > Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4712-YARN-2928.v1.001.patch, > YARN-4712-YARN-2928.v1.002.patch, YARN-4712-YARN-2928.v1.003.patch, > YARN-4712-YARN-2928.v1.004.patch, YARN-4712-YARN-2928.v1.005.patch > > > There are 2 issues with CPU usage collection > * I was able to observe that that many times CPU usage got from > {{pTree.getCpuUsagePercent()}} is > ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do > the calculation i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore > /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE > check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not > encountered. so proper checks needs to be handled > * {{EntityColumnPrefix.METRIC}} uses always LongConverter but > ContainerMonitor is publishing decimal values for the CPU usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)