[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231612#comment-15231612
 ] 

Sangjin Lee commented on YARN-3816:
-----------------------------------

[~gtCarrera9], thanks much for posting an updated patch for this! I just had an 
opportunity to go over it fairly completely once, and have some high-level 
comments as well as more detailed code feedback.

Starting with high-level comments:
1. "aggregation" v. "accumulation"
This came up several times on this JIRA, and I think the distinction is crucial 
in getting this completed. I believe what we agreed on is as follows: 
"aggregation" is about rolling up metrics from a child type to a parent type 
(e.g. rolling up metrics from containers to applications), and "accumulation" 
is about computing/deriving secondary values based on the *time dimension* 
(e.g. area under the curve or the running maximum). Those two are rather 
independent, and we should not mix them.

Unfortunately in the latest patch, these two terms are used very much 
interchangeably. Can we make that distinction clear and rename all the 
classes/methods/variables that pertain to accumulation from "aggregation" to 
"accumulation"? It would be good if we reserve "aggregation" to child-to-parent 
rollups.

2. container-to-application aggregation
Related to above, this JIRA was meant to implement 2 features: (1) 
"aggregating" metrics from containers to applications, and (2) "accumulating" 
metrics for (certain) entity types. Both should be done. However, in the latest 
patch, *I do not see (1) being done*. In other words, I didn't find code that 
rolls up metrics from the container entities and sets them to the parent 
application entities. Am I missing something? The previous patches did 
implement that. Without this, we will *NOT* see things like container CPU or 
memory being rolled up to applications, and as a consequence to flow runs, and 
so on. This is a MUST.

IMO that is a separate functionality from the accumulation. I think we should 
do it clearly and explicitly. And the rolled-up metrics should be set onto the 
application entities.

3. time-based accumulation
We also said that the time-based accumulation should be conditional on a 
configuration (see [the previous 
patch|https://issues.apache.org/jira/secure/attachment/12761120/YARN-3816-YARN-2928-v4.patch]).
 I see that condition is not there in the latest patch. Can we please make the 
accumulation conditional on that configuration?

Also, this was an issue with the previous patches and I think it exists with 
the latest patch. It appears that we are doing the time-based accumulation for 
*all metrics for all entity types*. We might want to think about whether that 
would be OK. There are some performance and storage implications in doing so. 
Also, I raised some semantic issues with that idea. See the previous comment 
[here|https://issues.apache.org/jira/browse/YARN-3816?focusedCommentId=15067321&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15067321].
 I'm not 100% certain if the latest patch has the same issue or not although I 
suspect it might.

4. new YARN_APPLICATION_AGGREGATION entity type
I also raised a concern whether we should use a separate entity type for this. 
First of all, the "aggregation" (from containers to applications) *should* go 
to the actual application type. Second, even for "accumulation" you might want 
to think about what you want to do. I assume that the accumulated metrics 
(YARN_APPLICATION_AGGREGATION) are being written to the entities table. Note 
that they are not really considered as part of the application, and are not 
available for application queries. So there is an implication for queries. And 
they are not going to be aggregated up to the flow runs.

I know this is a lot to parse, and obviously there is much history in this 
discussion. However, it would help to replay the main discussions up to this 
point so that we don't lose these important points. Thanks much!

> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> ----------------------------------------------------------------------------
>
>                 Key: YARN-3816
>                 URL: https://issues.apache.org/jira/browse/YARN-3816
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Li Lu
>              Labels: yarn-2928-1st-milestone
>         Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, 
> YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, 
> YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, 
> YARN-3816-YARN-2928-v5.patch, YARN-3816-feature-YARN-2928.v4.1.patch, 
> YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to