[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eric Payne updated YARN-415: ---------------------------- Attachment: YARN-415.201407242148.txt [~leftnoteasy] Thank you very much for helping me review this. {quote} 1) RMAppAttemptImpl.java 1.1 There're some irrelevant line changes in RMAppAttemptImpl, could you please revert them? Like {code} RMAppAttemptEventType.RECOVER, new AttemptRecoveredTransition()) - + {code} {quote} Changes completed. {quote} 1.2 getResourceUtilization: {code} + if (rmApps != null) { + RMApp app = rmApps.get(attemptId.getApplicationId()); + if (app != null) { {code} I think the two cannot happen, we don't need check null to avoid potential bug here {quote} Changes completed. {quote} {code} + ApplicationResourceUsageReport appResUsageRpt = {code} It's better to name it appResUsageReport since rpt is not a common abbr of report. {quote} Changes completed. {quote} 2) RMContainerImpl.java 2.1 updateAttemptMetrics: {code} if (rmApps != null) { RMApp rmApp = rmApps.get(container.getApplicationAttemptId().getApplicationId()); if (rmApp != null) { {code} Again, I think the two null check is unnecessary {quote} I was able to remove the rmApps variable, but I had to leave the check for {{app != null}} because if I try to take that out, several unit tests would fail with NullPointerException. Even with removing the rmApps variable, I needed to change TestRMContainerImpl.java to mock rmContext.getRMApps(). {quote} 3) SchedulerApplicationAttempt.java 3.1 Some rename suggestions: (Please let me know if you have better idea) CACHE_MILLI -> MEMORY_UTILIZATION_CACHE_MILLISECONDS lastTime -> lastMemoryUtilizationUpdateTime cachedMemorySeconds -> lastMemorySeconds same for cachedVCore ... {quote} Changes complete. {quote} 4) AppBlock.java Should we rename "Resource Seconds:" to "Resource Utilization" or something? {quote} I changed it as you suggested. It feels like there should be something that would describe it better, but I can't think of anything right now. {quote} 5) Test 5.1 I'm wondering if we need add a end to end test, since we changed RMAppAttempt/RMContainerImpl/SchedulerApplicationAttempt. It can consist submit an application, launch several containers, and finish application. And it's better to make the launched application contains several application attempt. While the application running, there're muliple containers running, and multiple containers finished. We can check if total resource utilization are expected. {quote} I'm still working on the unit tests as you suggested, but I wanted to get the rest of the patch up first so you can look at it :-) {quote} bq. One thing I did notice when these values are cached is that there is a race where containers can get counted twice: I think this can not be avoid, it should be a transient state and Jian He and I discussed about this long time before. But apparently, 3 sec cache make it not only a transient state. I suggest you can make "lastTime" in SchedulerApplicationAttempt protected. And in FiCaSchedulerApp/FSSchedulerApp, when remove container from liveContainer (in completedContainer method). You can set lastTime to a negtive value like -1, and next time when trying to get accumulated resource utilization, it will recompute all container utilization. {quote} I made the changes to {{lastTime}} as you suggested. I agree that there will always be a possibility that the container will still be in the {{liveContainers}} list for a very brief period after the container has finished. With the cached values the way I had them before, this gap was noticeable in the resource calculations. Your suggested changes brought the race back down even for the cached values. > Capture memory utilization at the app-level for chargeback > ---------------------------------------------------------- > > Key: YARN-415 > URL: https://issues.apache.org/jira/browse/YARN-415 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager > Affects Versions: 0.23.6 > Reporter: Kendall Thrapp > Assignee: Andrey Klochkov > Attachments: YARN-415--n10.patch, YARN-415--n2.patch, > YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, > YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, > YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, > YARN-415.201406262136.txt, YARN-415.201407042037.txt, > YARN-415.201407071542.txt, YARN-415.201407171553.txt, > YARN-415.201407172144.txt, YARN-415.201407232237.txt, > YARN-415.201407242148.txt, YARN-415.patch > > > For the purpose of chargeback, I'd like to be able to compute the cost of an > application in terms of cluster resource usage. To start out, I'd like to > get the memory utilization of an application. The unit should be MB-seconds > or something similar and, from a chargeback perspective, the memory amount > should be the memory reserved for the application, as even if the app didn't > use all that memory, no one else was able to use it. > (reserved ram for container 1 * lifetime of container 1) + (reserved ram for > container 2 * lifetime of container 2) + ... + (reserved ram for container n > * lifetime of container n) > It'd be nice to have this at the app level instead of the job level because: > 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't > appear on the job history server). > 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). > This new metric should be available both through the RM UI and RM Web > Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)