[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

Joep Rottinghuis (JIRA) Fri, 28 Aug 2015 17:32:37 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720837#comment-14720837
 ]


Joep Rottinghuis commented on YARN-3901:
----------------------------------------

Reviewed / discussed 1.patch with [~vrushalic]
Comments may sound cryptic to others, but roughly we discussed these changes to 
make things generic (and more clear / reusable for the long run):

No timestamp needed in FlowActivity table. Runs can start one day and end 
another.
Probably start without, add later if needed. 

Min/Max how does the use of app-id be needed?
FlowScanner currentMinCell should not consider app ID.
If there is a start time for an app id, and then later another start, we should 
still keep the min, not the latest value.

UI based on FlowActivity can enumerate active flows for that day, plus show 
number of runs, and # of distinct versions.

Update javadoc on FlowRunKey.

FlowRunTable add increment and decrement for number of running apps (during 
start and app end).

MIN, MAX, SUM, SUM_FINAL should be AggOps

Aggregation dimension = metric name (stored in column)
Aggregation compaction dimension = application id

For store, make the Attributes... the last argument.
An attribute is a tuple of String, byte[]

The MIN AggregationOperation should have a createAttribute method that takes
an AggCompactionDimension as argument and return an Attribute.

Assumption is that all the cells in a put are the same operation.

In the general coprocessor, read the attribute (does not have to be unique).
Always add a tag witht eh aggregationCompaction dimension.
Set compaction tag only if compaction needs to be done (if the operation is 
SUM_FINAL).



> Populate flow run data in the flow_run table
> --------------------------------------------
>
>                 Key: YARN-3901
>                 URL: https://issues.apache.org/jira/browse/YARN-3901
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Vrushali C
>            Assignee: Vrushali C
>         Attachments: YARN-3901-YARN-2928.1.patch, 
> YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch
>
>
> As per the schema proposed in YARN-3815 in 
> https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf
> filing jira to track creation and population of data in the flow run table. 
> Some points that are being  considered:
> - Stores per flow run information aggregated across applications, flow version
> RM’s collector writes to on app creation and app completion
> - Per App collector writes to it for metric updates at a slower frequency 
> than the metric updates to application table
> primary key: cluster ! user ! flow ! flow run id
> - Only the latest version of flow-level aggregated metrics will be kept, even 
> if the entity and application level keep a timeseries.
> - The running_apps column will be incremented on app creation, and 
> decremented on app completion.
> - For min_start_time the RM writer will simply write a value with the tag for 
> the applicationId. A coprocessor will return the min value of all written 
> values. - 
> - Upon flush and compactions, the min value between all the cells of this 
> column will be written to the cell without any tag (empty tag) and all the 
> other cells will be discarded.
> - Ditto for the max_end_time, but then the max will be kept.
> - Tags are represented as #type:value. The type can be not set (0), or can 
> indicate running (1) or complete (2). In those cases (for metrics) only 
> complete app metrics are collapsed on compaction.
> - The m! values are aggregated (summed) upon read. Only when applications are 
> completed (indicated by tag type 2) can the values be collapsed.
> - The application ids that have completed and been aggregated into the flow 
> numbers are retained in a separate column for historical tracking: we don’t 
> want to re-aggregate for those upon replay
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run table

Reply via email to