[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15585160#comment-15585160
 ] 

Varun Saxena commented on YARN-3816:
------------------------------------

bq. What you've mention here, IIUC, is something closer to the concept 
"accumulation" as we discussed before. Accumulation will apply an accumulative 
method on the same metric for the same timeline entity across time.
Sort of. From the earlier code on this JIRA, accumulation meant time-based 
integral i.e. generating area under the curve using Trapezoidal rule. It should 
be fine to address this use case when we do accumulation.

bq. We also had a discussion on how often node managers should publish 
container metrics (YARN-4712 and YARN-4821). Currently they emit them every 3 
seconds, but I think we should do a time average on the NMTimelinePublisher and 
emit them less often. It may help in this regard.
Yes, this should largely address the concern I had depending on what the 
configuration interval is. 
Assume, aggregation interval is 15 seconds and the config we add in YARN-4821 
is configured as 5 seconds, then we can potentially have 3 CPU values for a 
container reported to Collector. Assume these values to be (t1, 40), (t2, 30) 
and (t3, 7). t1,t2 and t3 are 5 seconds apart. Currently we will pick up only 7 
as the value which will be used for aggregation. My point is should it be 
((5*40) + (5*30) + (5*7)) / 15 = 26 as a value for aggregation instead ?
Because if instead of 7, this value was 70, it would be reported as 70 whereas 
time average would have been around 46.

We can however assume aggregation as just the latest value at a particular time 
(sort of snapshot of the system) and handle above use case during accumulation, 
as Li suggested. 

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> ----------------------------------------------------------------------------
>
>                 Key: YARN-3816
>                 URL: https://issues.apache.org/jira/browse/YARN-3816
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Li Lu
>              Labels: yarn-2928-1st-milestone
>             Fix For: 3.0.0-alpha1
>
>         Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, 
> YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, 
> YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, 
> YARN-3816-YARN-2928-v5.patch, YARN-3816-YARN-2928-v6.patch, 
> YARN-3816-YARN-2928-v7.patch, YARN-3816-YARN-2928-v8.patch, 
> YARN-3816-YARN-2928-v9.patch, YARN-3816-feature-YARN-2928.v4.1.patch, 
> YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to