[ https://issues.apache.org/jira/browse/YARN-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876928#comment-14876928 ]
Sangjin Lee commented on YARN-3816: ----------------------------------- My apologies for truly belated review comments. I just had time to go over this in some depth after working on YARN-4074. I think the latest patch is much more aligned with the overall design, and thanks much for working on that patiently [~djp]. First off, this overlaps with YARN-4074 and YARN-4075 that are getting wrapped up. So it would be good if this goes in after those 2 JIRAs. Let me know if you're OK with that. Also, I do have some basic questions and issues to discuss, and I'll mention them here. But I'm comfortable with having follow-on JIRAs after this one to address some of these that turn out to be major changes. *(aggregating metrics from all types of entities to application)* It appears that the current code will aggregate metrics from all types of entities to the application. This seems problematic to me. The main goal of this aggregation is to roll up metrics from individual *containers* to the application. But just by having the same metric id, any entity can have its metric aggregated by this (incorrectly). For example, any arbitrary entity can simply declare a metric named "MEMORY". By virtue of that, it would get aggregated and added to the application-level value. There can be variations of this: for example, the same metrics can be reported by the container entity, app attempt entity, and so on. Then the values may be aggregated double or triple. I think we should ensure strongly that the aggregation happens only along the path of YARN container entities to application to prevent these accidental cases. On a semi-related note, what happens if clients send metrics directly at the application entity level? We should expect most framework-specific AMs to do that. For example, MR AM already has all the job-level counters, and it can (and should) report those job-level counters as metrics at the YARN application entity. Is that case handled correctly, or will we end up getting incorrect values (double counting) in that situation? On to individual files: (TimelineMetric.java) - l.122: Although the method name is {{accumulateTo()}}, most of the variables and comments say "aggregate". Can we clean them up to say "accumulate"? (TimelineMetricCalculator.java) - we should add the annotations (public? unstable?) - l.34: if {{n1 == null}}, shouldn't we return {{-n2}}? - for both {{sub()}} and {{sum()}}, would it be simpler just to handle the arithmetic as longs even if they're integers? (yarn-default.xml) - The default defined in YarnConfiguration is true, but in yarn-default.xml it is false; which is correct? We should reconcile them. (NMTimelinePublisher.java) - Shouldn't these metrics set {{toAggregate}} to true (because the default is false)? These metrics are *THE* main ones we want to aggregate from containers to application, right? For that matter, should the default itself for {{toAggregate}} on {{TimelineMetric}} be true? I feel we should aggregate unless specified otherwise, not the other way around. Thoughts? (TimelineCollector.java) - l.124: nit: you can simply call {{aggregateMetrics()}} instead of {{TimelineCollector.aggregateMetrics()}} - l.130: the same for {{appendAggregatedMetricsToEntities()}} - l.212: What is the point of nulling out the value for metric id in {{perIdAggregatedNum}}? It doesn't seem necessary. (TimelineReaderWebServices.java) - I'm not so sure if we need a separate REST end point for "aggregates". If I understand correctly, they are all stored in the same application table under the same app id. What does it mean to have a separate REST URL for aggregates? Can we query for the application and be done? (HBaseTimelineWriterImpl.java) - I see that you're appending the {{toAggregate flag}} to the column name. I think it is fine for now, but we will need to look at this again, as there are other dimensions of metrics that need to be persisted. Some examples include single value v. time series, long v. float (possibly), and so on. We will need to arrive at a conclusion on how to encode them all cleanly and efficiently. We can address this later together with [~varun_saxena] as he's dealing with a related JIRA. (HBaseTimelineReaderImpl.java) - l.506: nit: it can just be {code} boolean toAggregate = toAggregateStr.equals("1"); {code} > [Aggregation] App-level aggregation and accumulation for YARN system metrics > ---------------------------------------------------------------------------- > > Key: YARN-3816 > URL: https://issues.apache.org/jira/browse/YARN-3816 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver > Reporter: Junping Du > Assignee: Junping Du > Attachments: Application Level Aggregation of Timeline Data.pdf, > YARN-3816-YARN-2928-v1.patch, YARN-3816-YARN-2928-v2.1.patch, > YARN-3816-YARN-2928-v2.2.patch, YARN-3816-YARN-2928-v2.3.patch, > YARN-3816-YARN-2928-v2.patch, YARN-3816-YARN-2928-v3.1.patch, > YARN-3816-YARN-2928-v3.patch, YARN-3816-YARN-2928-v4.patch, > YARN-3816-poc-v1.patch, YARN-3816-poc-v2.patch > > > We need application level aggregation of Timeline data: > - To present end user aggregated states for each application, include: > resource (CPU, Memory) consumption across all containers, number of > containers launched/completed/failed, etc. We need this for apps while they > are running as well as when they are done. > - Also, framework specific metrics, e.g. HDFS_BYTES_READ, should be > aggregated to show details of states in framework level. > - Other level (Flow/User/Queue) aggregation can be more efficient to be based > on Application-level aggregations rather than raw entity-level data as much > less raws need to scan (with filter out non-aggregated entities, like: > events, configurations, etc.). -- This message was sent by Atlassian JIRA (v6.3.4#6332)