[ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Naganarasimha G R updated YARN-4712: ------------------------------------ Attachment: YARN-4712-YARN-2928.v1.004.patch thanks for the comments [~sjlee0]. Reverting the changes for the trunk code and limiting the scope to 2928 bq. My position is that we should skip reporting the value rather than reporting 0. IIUC already existing patches are taking care of it, i am setting *cpuUsageTotalCoresPercentage* to -1 when *cpuUsagePercentPerCore* is -1, and in *NMTimelinePublisher* i am skipping if *cpuUsageTotalCoresPercentage* is -1. bq. Most of YARN's CPU accounting is based on cores rather than nodes/machines. IMO cpuUsagePercentPerCore would be a better value to emit. Thoughts? IMO *cpuUsageTotalCoresPercentage* is important to gauge how much of the cluster's CPU is getting utlized, if its *cpuUsagePercentPerCore* i beleive it doesnt give the cluster's CPU on aggregation from all containers. Infact we need to report both and also IMO *cpuUsageTotalCoresPercentage* is not calculated properly it should be {code} cpuUsageTotalCoresPercentage = (cpuUsagePercentPerCore /resourceCalculatorPlugin.getNumProcessors())/ (nodeCpuPercentageForYARN/100); {code} In this way we will be able to identify how much % of cluster's CPU is getting utilized. Also do we broaden the scope of this jira further or shall we discuss on this in a different jira? bq. Why are we appending the process id to the metric id? Doesn't this cause issues when we do the aggregation? Agree have handled this, i beleive YARN-3816 was also trying to address it, but as that might take little more time i am addressing as part of this jira. > CPU Usage Metric is not captured properly in YARN-2928 > ------------------------------------------------------ > > Key: YARN-4712 > URL: https://issues.apache.org/jira/browse/YARN-4712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver > Reporter: Naganarasimha G R > Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-4712-YARN-2928.v1.001.patch, > YARN-4712-YARN-2928.v1.002.patch, YARN-4712-YARN-2928.v1.003.patch, > YARN-4712-YARN-2928.v1.004.patch > > > There are 2 issues with CPU usage collection > * I was able to observe that that many times CPU usage got from > {{pTree.getCpuUsagePercent()}} is > ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do > the calculation i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore > /resourceCalculatorPlugin.getNumProcessors()}} because of which UNAVAILABLE > check in {{NMTimelinePublisher.reportContainerResourceUsage}} is not > encountered. so proper checks needs to be handled > * {{EntityColumnPrefix.METRIC}} uses always LongConverter but > ContainerMonitor is publishing decimal values for the CPU usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)