[jira] [Commented] (YARN-4488) CapacityScheduler: Compute per-container allocation latency and roll up to get per-application and per-queue

2018-03-08 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391687#comment-16391687
 ] 

Wangda Tan commented on YARN-4488:
--

[~ywskycn], apologize that I missed your ping in YARN-7844. I think it will be 
useful to understand the latency between resource requested added and container 
allocated. However, as I commented above: 
https://issues.apache.org/jira/browse/YARN-4488?focusedCommentId=16382744&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16382744,
 it could be an expensive call to record a more accurate delay. IMHO, if we 
don't record accurate delay, the metrics will not be useful. 

So my personal suggestion is: can we check if it is possible to put the 
container allocation delay code to a isolated module, and enable only on demand 
(just like the approach of YARN-7844). [~maniraj...@gmail.com], does it make 
sense to you? Could you investigate this option?

> CapacityScheduler: Compute per-container allocation latency and roll up to 
> get per-application and per-queue
> 
>
> Key: YARN-4488
> URL: https://issues.apache.org/jira/browse/YARN-4488
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Karthik Kambatla
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-4485.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4488) CapacityScheduler: Compute per-container allocation latency and roll up to get per-application and per-queue

2018-03-08 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391597#comment-16391597
 ] 

Wei Yan commented on YARN-4488:
---

Thanks for pinging, [~leftnoteasy]. I created YARN-7844 previously, which 
mostly exposes related metrics in the scheduler level, including (may not fully 
included in YARN-7844.001.patch) various scheduler ops (node_add, node_remove, 
allocate, update...), and event queue size. This set of metrics would help us 
understand whether RM scheduler is under-pressure, what is the throughput of 
the scheduler, and whether the scheduler itself becomes a system bottleneck.

For this JIRA, the scheduling delay for a container, an application can be 
various due to different reasons: scheduler itself, resource availability, 
queue configs... I'm not sure how we can use this info in prod, to tune queue 
configs. In our prod env, the top complaints from customers are their jobs get 
long time to run. Mostly becuase of their queues short of resources, which have 
already covered by existing metrics (tracking available resources for each 
queue).

> CapacityScheduler: Compute per-container allocation latency and roll up to 
> get per-application and per-queue
> 
>
> Key: YARN-4488
> URL: https://issues.apache.org/jira/browse/YARN-4488
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Karthik Kambatla
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-4485.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4488) CapacityScheduler: Compute per-container allocation latency and roll up to get per-application and per-queue

2018-03-01 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382744#comment-16382744
 ] 

Wangda Tan commented on YARN-4488:
--

[~maniraj...@gmail.com], thanks for the explanation, I can understand the 
approach better now. 

Regarding to the metrics, here's what I expected behavior: 

Delay of container should be T1 (container_allocated_time) - T2 (requested 
time). In your proposal, T2 is {{time while creating ResourceRequest object}}, 
which may not be correct to me. We have to consider a complex case. 

What I expected:
{code:java}
(time=1) An app has a resource request asks 5 * 2G containers
(time=3) 3 containers allocated, delay of the 3 containers = 2. Pending ask = 2
(time=5) App requested 10 containers (instead of 2) on the same priority.
(time=7) 5 containers allocated, 2 containers have delay (which is from the 
original ask) has delay = 7-1 = 6
 And 3 containers have delay (which is from the additional ask) = 7-5 = 
2{code}
This is a common scenario for apps have additional asks for failed containers 
(for example MR), if a container failed, it will ask additional containers use 
the same priority (FAILED_MAPPER_PRIORITY), so we should consider it.

The downside of this approach is it needs additional memory to record accurate 
requested time for each resource request. An alternative approach is remember 
an average requested time for each priority. (Assume we have X container 
requested at T1, Y additional container requested at T2, the average time will 
be {{(X * T1 + Y * T2) / (X + Y)}}). 

*Regarding to implementation:* 

I'm not sure if a massive changes required, let's figure out semantics of the 
delay first, and look at implementation later.

+ [~sunilg] 

+ [~ywskycn] to the thread: You pinged me offline about metrics related stuffs 
before, I think you might be interested about this Jira.

 

 

> CapacityScheduler: Compute per-container allocation latency and roll up to 
> get per-application and per-queue
> 
>
> Key: YARN-4488
> URL: https://issues.apache.org/jira/browse/YARN-4488
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Karthik Kambatla
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-4485.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4488) CapacityScheduler: Compute per-container allocation latency and roll up to get per-application and per-queue

2018-03-01 Thread Manikandan R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381997#comment-16381997
 ] 

Manikandan R commented on YARN-4488:


[~leftnoteasy] Had a chance to review this approach? Can you please check?

> CapacityScheduler: Compute per-container allocation latency and roll up to 
> get per-application and per-queue
> 
>
> Key: YARN-4488
> URL: https://issues.apache.org/jira/browse/YARN-4488
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Karthik Kambatla
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-4485.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4488) CapacityScheduler: Compute per-container allocation latency and roll up to get per-application and per-queue

2018-02-21 Thread Manikandan R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371732#comment-16371732
 ] 

Manikandan R commented on YARN-4488:


[~leftnoteasy] Thanks.

Overall approach (as described in YARN-4485) is
 # Set the request start time while creating ResourceRequest object
 # Once container is allocated, subtract #1 from now() to compute the time 
taken/duration
 # Use #2 to update the corresponding queue metrics. For example, Lets say, for 
a job using 2 containers, newly added Queue Metrics would look like

{"ContainerAllocationDelayNumOps" : 2, "ContainerAllocationDelayAvgTime" : 
1755.5}

As described in YARN-7902, To set the start time in ResourceRequest (which is 
in hadoop-yarn-api package), it requires SystemClock class and its dependent 
Clock interface to be in hadoop-common instead of hadoop-yarn-common (which is 
dependent on hadoop-yarn-api package). Hence, idea is to move SystemClock class 
etc to hadoop-common package so that it can be used any packages (not only 
related to YARN). Subsequently, we will need to do corresponding import changes 
in all classes where those clock related classes has been used. Thoughts? If it 
makes sense, Can we also move other classes like MonotonicClock class etc?

Main changes are in {{AppSchedulingInfo}}, {{QueueMetrics}} and 
{{ResourceRequest}} class. Attached patch covers metrics per Queue only. Once 
approach is confirmed, will do the same per app as it is also an requirement of 
this JIRA.

 

 

 

 

> CapacityScheduler: Compute per-container allocation latency and roll up to 
> get per-application and per-queue
> 
>
> Key: YARN-4488
> URL: https://issues.apache.org/jira/browse/YARN-4488
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Karthik Kambatla
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-4485.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4488) CapacityScheduler: Compute per-container allocation latency and roll up to get per-application and per-queue

2018-02-20 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370884#comment-16370884
 ] 

Wangda Tan commented on YARN-4488:
--

[~maniraj...@gmail.com], thanks for working on the patch, I took a glance at 
the patch but I think most changes are related to class import.

Could you elaborate:
1) What's the approach?
2) What changes needed? 
3) Apart from YARN, do we need to change other projects?

> CapacityScheduler: Compute per-container allocation latency and roll up to 
> get per-application and per-queue
> 
>
> Key: YARN-4488
> URL: https://issues.apache.org/jira/browse/YARN-4488
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Karthik Kambatla
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-4485.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4488) CapacityScheduler: Compute per-container allocation latency and roll up to get per-application and per-queue

2018-02-18 Thread Manikandan R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368782#comment-16368782
 ] 

Manikandan R commented on YARN-4488:


[~leftnoteasy] Can you please review the patch?

> CapacityScheduler: Compute per-container allocation latency and roll up to 
> get per-application and per-queue
> 
>
> Key: YARN-4488
> URL: https://issues.apache.org/jira/browse/YARN-4488
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Karthik Kambatla
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-4485.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4488) CapacityScheduler: Compute per-container allocation latency and roll up to get per-application and per-queue

2018-02-10 Thread Manikandan R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16359370#comment-16359370
 ] 

Manikandan R commented on YARN-4488:


[~leftnoteasy] Thanks.

Attaching patch containing changes to compute container latency for every RR at 
queue level. This patch has consolidated changes required for other sub tasks 
as well. Please review and let me know if you want me to split the patches. 
Will need to add test for sure once you review the approach.

Copy pasted the below metrics for better understanding: 

"ContainerAllocationDelayNumOps" : 2,

"ContainerAllocationDelayAvgTime" : 1755.5,

 

 

> CapacityScheduler: Compute per-container allocation latency and roll up to 
> get per-application and per-queue
> 
>
> Key: YARN-4488
> URL: https://issues.apache.org/jira/browse/YARN-4488
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Karthik Kambatla
>Assignee: Wangda Tan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org