[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-415:
----------------------------

    Attachment: YARN-415.201407242148.txt

[~leftnoteasy]

Thank you very much for helping me review this.

{quote}
1) RMAppAttemptImpl.java
 1.1 There're some irrelevant line changes in RMAppAttemptImpl, could you 
please revert them? Like
{code}
           RMAppAttemptEventType.RECOVER, new AttemptRecoveredTransition())
-          
+
{code}
{quote}
Changes completed.

{quote}
1.2 getResourceUtilization:
{code}
+    if (rmApps != null) {
+      RMApp app = rmApps.get(attemptId.getApplicationId());
+      if (app != null) {
{code}
I think the two cannot happen, we don't need check null to avoid potential bug 
here
{quote}
Changes completed.

{quote}
{code}
+          ApplicationResourceUsageReport appResUsageRpt =
{code}

It's better to name it appResUsageReport since rpt is not a common abbr of 
report.
{quote}
Changes completed.

{quote}
2) RMContainerImpl.java
 2.1 updateAttemptMetrics:

{code}
      if (rmApps != null) {
        RMApp rmApp = 
            rmApps.get(container.getApplicationAttemptId().getApplicationId());
        if (rmApp != null) {
{code}

Again, I think the two null check is unnecessary
{quote}
I was able to remove the rmApps variable, but I had to leave the check for 
{{app != null}} because if I try to take that out, several unit tests would 
fail with NullPointerException. Even with removing the rmApps variable, I 
needed to change TestRMContainerImpl.java to mock rmContext.getRMApps().

{quote}
3) SchedulerApplicationAttempt.java
 3.1 Some rename suggestions: (Please let me know if you have better idea)
 CACHE_MILLI -> MEMORY_UTILIZATION_CACHE_MILLISECONDS
 lastTime -> lastMemoryUtilizationUpdateTime
 cachedMemorySeconds -> lastMemorySeconds
 same for cachedVCore ...
{quote}
Changes complete.

{quote}
4) AppBlock.java
 Should we rename "Resource Seconds:" to "Resource Utilization" or something?
{quote}
I changed it as you suggested. It feels like there should be something that 
would describe it better, but I can't think of anything right now.

{quote}
5) Test
 5.1 I'm wondering if we need add a end to end test, since we changed 
RMAppAttempt/RMContainerImpl/SchedulerApplicationAttempt.
 It can consist submit an application, launch several containers, and finish 
application. And it's better to make the launched application contains several 
application attempt.
 While the application running, there're muliple containers running, and 
multiple containers finished. We can check if total resource utilization are 
expected.
{quote}
I'm still working on the unit tests as you suggested, but I wanted to get the 
rest of the patch up first so you can look at it :-)

{quote}
bq. One thing I did notice when these values are cached is that there is a race 
where containers can get counted twice:

I think this can not be avoid, it should be a transient state and Jian He and I 
discussed about this long time before.
 But apparently, 3 sec cache make it not only a transient state. I suggest you 
can make "lastTime" in SchedulerApplicationAttempt protected. And in 
FiCaSchedulerApp/FSSchedulerApp, when remove container from liveContainer (in 
completedContainer method). You can set lastTime to a negtive value like -1, 
and next time when trying to get accumulated resource utilization, it will 
recompute all container utilization.
{quote}
I made the changes to {{lastTime}} as you suggested. I agree that there will 
always be a possibility that the container will still be in the 
{{liveContainers}} list for a very brief period after the container has 
finished. With the cached values the way I had them before, this gap was 
noticeable in the resource calculations. Your suggested changes brought the 
race back down even for the cached values.


> Capture memory utilization at the app-level for chargeback
> ----------------------------------------------------------
>
>                 Key: YARN-415
>                 URL: https://issues.apache.org/jira/browse/YARN-415
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: resourcemanager
>    Affects Versions: 0.23.6
>            Reporter: Kendall Thrapp
>            Assignee: Andrey Klochkov
>         Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
> YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
> YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
> YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
> YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
> YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
> YARN-415.201407172144.txt, YARN-415.201407232237.txt, 
> YARN-415.201407242148.txt, YARN-415.patch
>
>
> For the purpose of chargeback, I'd like to be able to compute the cost of an
> application in terms of cluster resource usage.  To start out, I'd like to 
> get the memory utilization of an application.  The unit should be MB-seconds 
> or something similar and, from a chargeback perspective, the memory amount 
> should be the memory reserved for the application, as even if the app didn't 
> use all that memory, no one else was able to use it.
> (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
> container 2 * lifetime of container 2) + ... + (reserved ram for container n 
> * lifetime of container n)
> It'd be nice to have this at the app level instead of the job level because:
> 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
> appear on the job history server).
> 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
> This new metric should be available both through the RM UI and RM Web 
> Services REST API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to