[ 
https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876928#comment-14876928
 ] 

Sangjin Lee commented on YARN-3816:
-----------------------------------

My apologies for truly belated review comments. I just had time to go over this 
in some depth after working on YARN-4074. I think the latest patch is much more 
aligned with the overall design, and thanks much for working on that patiently 
[~djp].

First off, this overlaps with YARN-4074 and YARN-4075 that are getting wrapped 
up. So it would be good if this goes in after those 2 JIRAs. Let me know if 
you're OK with that.

Also, I do have some basic questions and issues to discuss, and I'll mention 
them here. But I'm comfortable with having follow-on JIRAs after this one to 
address some of these that turn out to be major changes.

*(aggregating metrics from all types of entities to application)*
It appears that the current code will aggregate metrics from all types of 
entities to the application. This seems problematic to me.

The main goal of this aggregation is to roll up metrics from individual 
*containers* to the application. But just by having the same metric id, any 
entity can have its metric aggregated by this (incorrectly). For example, any 
arbitrary entity can simply declare a metric named "MEMORY". By virtue of that, 
it would get aggregated and added to the application-level value. There can be 
variations of this: for example, the same metrics can be reported by the 
container entity, app attempt entity, and so on. Then the values may be 
aggregated double or triple.

I think we should ensure strongly that the aggregation happens only along the 
path of YARN container entities to application to prevent these accidental 
cases.

On a semi-related note, what happens if clients send metrics directly at the 
application entity level? We should expect most framework-specific AMs to do 
that. For example, MR AM already has all the job-level counters, and it can 
(and should) report those job-level counters as metrics at the YARN application 
entity. Is that case handled correctly, or will we end up getting incorrect 
values (double counting) in that situation?

On to individual files:

(TimelineMetric.java)
- l.122: Although the method name is {{accumulateTo()}}, most of the variables 
and comments say "aggregate". Can we clean them up to say "accumulate"?

(TimelineMetricCalculator.java)
- we should add the annotations (public? unstable?)
- l.34: if {{n1 == null}}, shouldn't we return {{-n2}}?
- for both {{sub()}} and {{sum()}}, would it be simpler just to handle the 
arithmetic as longs even if they're integers?

(yarn-default.xml)
- The default defined in YarnConfiguration is true, but in yarn-default.xml it 
is false; which is correct? We should reconcile them.

(NMTimelinePublisher.java)
- Shouldn't these metrics set {{toAggregate}} to true (because the default is 
false)? These metrics are *THE* main ones we want to aggregate from containers 
to application, right? For that matter, should the default itself for 
{{toAggregate}} on {{TimelineMetric}} be true? I feel we should aggregate 
unless specified otherwise, not the other way around. Thoughts?

(TimelineCollector.java)
- l.124: nit: you can simply call {{aggregateMetrics()}} instead of 
{{TimelineCollector.aggregateMetrics()}}
- l.130: the same for {{appendAggregatedMetricsToEntities()}}
- l.212: What is the point of nulling out the value for metric id in 
{{perIdAggregatedNum}}? It doesn't seem necessary.

(TimelineReaderWebServices.java)
- I'm not so sure if we need a separate REST end point for "aggregates". If I 
understand correctly, they are all stored in the same application table under 
the same app id. What does it mean to have a separate REST URL for aggregates? 
Can we query for the application and be done?

(HBaseTimelineWriterImpl.java)
- I see that you're appending the {{toAggregate flag}} to the column name. I 
think it is fine for now, but we will need to look at this again, as there are 
other dimensions of metrics that need to be persisted. Some examples include 
single value v. time series, long v. float (possibly), and so on. We will need 
to arrive at a conclusion on how to encode them all cleanly and efficiently. We 
can address this later together with [~varun_saxena] as he's dealing with a 
related JIRA.

(HBaseTimelineReaderImpl.java)
- l.506: nit: it can just be
{code}
boolean toAggregate = toAggregateStr.equals("1");
{code}


> [Aggregation] App-level aggregation and accumulation for YARN system metrics
> ----------------------------------------------------------------------------
>
>                 Key: YARN-3816
>                 URL: https://issues.apache.org/jira/browse/YARN-3816
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Junping Du
>            Assignee: Junping Du
>         Attachments: Application Level Aggregation of Timeline Data.pdf, 
> YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, 
> YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, 
> YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, 
> YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, 
> YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch
>
>
> We need application level aggregation of Timeline data:
> - To present end user aggregated states for each application, include: 
> resource (CPU, Memory) consumption across all containers, number of 
> containers launched/completed/failed, etc. We need this for apps while they 
> are running as well as when they are done.
> - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be 
> aggregated to show details of states in framework level.
> - Other level (Flow/User/Queue) aggregation can be more efficient to be based 
> on Application-level aggregations rather than raw entity-level data as much 
> less raws need to scan (with filter out non-aggregated entities, like: 
> events, configurations, etc.).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to