[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15369763#comment-15369763 ] Hudson commented on YARN-3901: -- SUCCESS: Integrated in Hadoop-trunk-Commit #10074 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/10074/]) YARN-3901. Populate flow run data in the flow_run & flow activity tables (sjlee: rev a68e3839218523403f42acd7bdd7ce1da59a5e60) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/entity/EntityColumn.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/common/TimestampGenerator.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/flow/FlowActivityTable.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/apptoflow/AppToFlowColumn.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/application/ApplicationColumn.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/HBaseTimelineWriterImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/flow/FlowRunColumnFamily.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/pom.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/test/java/org/apache/hadoop/yarn/server/timelineservice/storage/TestHBaseTimelineStorage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/application/ApplicationColumnPrefix.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/entity/EntityColumnPrefix.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/test/java/org/apache/hadoop/yarn/server/timelineservice/storage/flow/TestHBaseStorageFlowRun.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/flow/FlowActivityColumnPrefix.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/flow/FlowRunColumnPrefix.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/flow/FlowActivityColumnFamily.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/flow/FlowActivityRowKey.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/flow/FlowRunCoprocessor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/common/ColumnHelper.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/common/ColumnPrefix.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/flow/Attribute.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/TimelineSchemaCreator.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/flow/FlowRunTable.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/common/TimelineWriterUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/storage/flow/AggregationCompactionDimension.java *
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791626#comment-14791626 ] Vrushali C commented on YARN-3901: -- Thanks [~jrottinghuis], it is quite exciting to work on hbase cell tags and coprocessors. Looking forward to the upcoming jiras! > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.10.patch, YARN-3901-YARN-2928.2.patch, > YARN-3901-YARN-2928.3.patch, YARN-3901-YARN-2928.4.patch, > YARN-3901-YARN-2928.5.patch, YARN-3901-YARN-2928.6.patch, > YARN-3901-YARN-2928.7.patch, YARN-3901-YARN-2928.8.patch, > YARN-3901-YARN-2928.9.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14803137#comment-14803137 ] Junping Du commented on YARN-3901: -- Patch LGTM too. Looks like Jenkins doesn't be triggered against v10 patch for some reason and I just kick it off manually. Let's wait for Jenkins result. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.10.patch, YARN-3901-YARN-2928.2.patch, > YARN-3901-YARN-2928.3.patch, YARN-3901-YARN-2928.4.patch, > YARN-3901-YARN-2928.5.patch, YARN-3901-YARN-2928.6.patch, > YARN-3901-YARN-2928.7.patch, YARN-3901-YARN-2928.8.patch, > YARN-3901-YARN-2928.9.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14803109#comment-14803109 ] Sangjin Lee commented on YARN-3901: --- The latest patch (v.10) LGTM. Thanks much [~vrushalic] for the update! Please let me know if you have additional feedback. I'd like to commit this soon. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.10.patch, YARN-3901-YARN-2928.2.patch, > YARN-3901-YARN-2928.3.patch, YARN-3901-YARN-2928.4.patch, > YARN-3901-YARN-2928.5.patch, YARN-3901-YARN-2928.6.patch, > YARN-3901-YARN-2928.7.patch, YARN-3901-YARN-2928.8.patch, > YARN-3901-YARN-2928.9.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14803213#comment-14803213 ] Li Lu commented on YARN-3901: - Patch LGTM. Pending Jenkins. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.10.patch, YARN-3901-YARN-2928.2.patch, > YARN-3901-YARN-2928.3.patch, YARN-3901-YARN-2928.4.patch, > YARN-3901-YARN-2928.5.patch, YARN-3901-YARN-2928.6.patch, > YARN-3901-YARN-2928.7.patch, YARN-3901-YARN-2928.8.patch, > YARN-3901-YARN-2928.9.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14803240#comment-14803240 ] Sangjin Lee commented on YARN-3901: --- Jenkins does kick in but for some reason, it cannot post the result to the JIRA. The result is the following: -1 overall | Vote | Subsystem | Runtime | Comment | -1 | pre-patch | 16m 3s| Findbugs (version ) appears to be | | || broken on YARN-2928. | +1 |@author | 0m 0s | The patch does not contain any | | || @author tags. | +1 | tests included | 0m 0s | The patch appears to include 4 new | | || or modified test files. | +1 | javac | 8m 26s| There were no new javac warning | | || messages. | +1 |javadoc | 10m 47s | There were no new javadoc warning | | || messages. | +1 | release audit | 0m 24s| The applied patch does not increase | | || the total number of release audit | | || warnings. | +1 | checkstyle | 0m 16s| There were no new checkstyle | | || issues. | -1 | whitespace | 0m 50s| The patch has 9 line(s) that end in | | || whitespace. Use git apply | | || --whitespace=fix. | +1 |install | 1m 39s| mvn install still works. | +1 |eclipse:eclipse | 0m 43s| The patch built with | | || eclipse:eclipse. | +1 | findbugs | 0m 52s| The patch does not introduce any | | || new Findbugs (version 3.0.0) | | || warnings. | +1 | yarn tests | 2m 37s| Tests passed in | | || hadoop-yarn-server-timelineservice. | | | 42m 43s | || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12756422/YARN-3901-YARN-2928.10.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / b1960e0 | | whitespace | /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/whitespace.txt | | hadoop-yarn-server-timelineservice test log | /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9191/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | I'll remove the whitespace as I commit it. +1? > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.10.patch, YARN-3901-YARN-2928.2.patch, > YARN-3901-YARN-2928.3.patch, YARN-3901-YARN-2928.4.patch, > YARN-3901-YARN-2928.5.patch, YARN-3901-YARN-2928.6.patch, > YARN-3901-YARN-2928.7.patch, YARN-3901-YARN-2928.8.patch, > YARN-3901-YARN-2928.9.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791359#comment-14791359 ] Sangjin Lee commented on YARN-3901: --- I did a quick review, and it looks real close now. Some minor comments (mostly javadoc comment nits). (TimestampGenerator.java) - l.57: nit: "unlikely" -> "Unlikely" - l.60: nit: should have a period (".") at the end - l.76, 79: the same (FlowScanner.java) - l.50: nit: should have a period (".") at the end - There are other javadoc comments here that do not start with capital letters or end with periods. Could you scan them and fix them? (TestFlowDataGenerator.java) - it can be non-public, right? > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch, > YARN-3901-YARN-2928.6.patch, YARN-3901-YARN-2928.7.patch, > YARN-3901-YARN-2928.8.patch, YARN-3901-YARN-2928.9.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791446#comment-14791446 ] Joep Rottinghuis commented on YARN-3901: Thanks for patiently making updates as we figured out the details of the algorithm for readless aggregation and ran into the sharp edged of HBase coprocessors [~vrushalic] Patch looks fine to be committed to me. I'll put further comments and tweaks in YARN-4062 > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch, > YARN-3901-YARN-2928.6.patch, YARN-3901-YARN-2928.7.patch, > YARN-3901-YARN-2928.8.patch, YARN-3901-YARN-2928.9.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791334#comment-14791334 ] Vrushali C commented on YARN-3901: -- Thanks Sangjin, I corrected the patch to include those files now. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch, > YARN-3901-YARN-2928.6.patch, YARN-3901-YARN-2928.7.patch, > YARN-3901-YARN-2928.8.patch, YARN-3901-YARN-2928.9.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791330#comment-14791330 ] Sangjin Lee commented on YARN-3901: --- It appears the latest patch is missing some new files. I see {{TestHBaseStorageFlowRun}} instead of {{TestHBaseStorageFlowRunFlowActivity}} but it seems much shorter than before. I'm assuming we're missing {{TestHBaseStorageFlowActivity}}? Also, I think we're missing {{TestFlowDataGenerator}}. Could you make sure all the new files are included in the patch? Thanks. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch, > YARN-3901-YARN-2928.6.patch, YARN-3901-YARN-2928.7.patch, > YARN-3901-YARN-2928.8.patch, YARN-3901-YARN-2928.9.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746516#comment-14746516 ] Sangjin Lee commented on YARN-3901: --- {quote} Yes, the algorithm makes sense. I thought JVM may have some special optimization with the getAndAdd method on x86 since x86 has a specific fetch-and-add (LOCK:XADD). However I'm not sure if this is actually reflected in most current JVMs (https://bugs.openjdk.java.net/browse/JDK-6973482). I'm OK with both, since the advantage of either one of them in our use case is nontrivial (https://blogs.oracle.com/dave/entry/atomic_fetch_and_add_vs). If keeping up with current time is important, we need to stick to the CAS based solution and not to change it in future. {quote} I see. I missed the part where you were comparing CAS with fetch-and-add. Yes, I think keeping up with the current time is an important requirement here. Also, this may not be one of the bigger performance issues IMO. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch, > YARN-3901-YARN-2928.6.patch, YARN-3901-YARN-2928.7.patch, > YARN-3901-YARN-2928.8.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746558#comment-14746558 ] Sangjin Lee commented on YARN-3901: --- [~vrushalic], it seems that {{package-info.java}} is not quite right, and that might be related with the javadoc errors. Your patch removed {{package-info.java}} at org/apache/hadoop/yarn/server/timelineservice/storage/common, but changes the package incorrectly to org.apache.hadoop.yarn.server.timelineservice.storage.common for {{package-info.java}} at org/apache/hadoop/yarn/server/timelineservice/storage. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch, > YARN-3901-YARN-2928.6.patch, YARN-3901-YARN-2928.7.patch, > YARN-3901-YARN-2928.8.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746305#comment-14746305 ] Sangjin Lee commented on YARN-3901: --- I can answer and clarify some of [~gtCarrera9]'s questions. bq. Any special considerations on not directly using getAndAdd (fetch-and-increment) here? It's essentially using AtomicLong.compareAndSet(). The few lines around it are mostly to keep pace with the current time. I hope that makes sense. bq. It's fine to directly read data out in the UT now. We may want to switch this to use readers sometime later? I'm actually adding more unit tests that use the reader in YARN-4074. bq. I noticed we never remove or disable anything in this table, so do we do this by setting the ttl of the table? Or we create different rows for the same application to differentiate the day an entity was posted? I think we're using the latter but would like to confirm. We're doing the latter. We create a new record in this table any time a new activity is done for a given day for a flow. bq. Also, with the current design, when the reader tries to get latest activities on a cluster, it can craft a row key prefix cluster!inv(currTime-24h) and scan from the very beginning to this record, right? (I'm trying to connect the two pieces together. ) If you're interested in the latest only, then it's even simpler. The prefix is just {{cluster!}} and we can grab from the beginning. What you mention would get activities for today only, which is slightly different. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch, > YARN-3901-YARN-2928.6.patch, YARN-3901-YARN-2928.7.patch, > YARN-3901-YARN-2928.8.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746352#comment-14746352 ] Li Lu commented on YARN-3901: - Thanks [~sjlee0]! bq. It's essentially using AtomicLong.compareAndSet(). The few lines around it are mostly to keep pace with the current time. I hope that makes sense. Yes, the algorithm makes sense. I thought JVM may have some special optimization with the getAndAdd method on x86 since x86 has a specific fetch-and-add (LOCK:XADD). However I'm not sure if this is actually reflected in most current JVMs (https://bugs.openjdk.java.net/browse/JDK-6973482). I'm OK with both, since the advantage of either one of them in our use case is nontrivial (https://blogs.oracle.com/dave/entry/atomic_fetch_and_add_vs). If keeping up with current time is important, we need to stick to the CAS based solution and _not_ to change it in future. bq. We create a new record in this table any time a new activity is done for a given day for a flow. Thanks. I missed the {{getTopOfTheDayTimestamp}} part. bq. The prefix is just cluster! and we can grab from the beginning. What you mention would get activities for today only, which is slightly different. Right. Thanks for the clarification! I meant activities for the past 24 hours. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch, > YARN-3901-YARN-2928.6.patch, YARN-3901-YARN-2928.7.patch, > YARN-3901-YARN-2928.8.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746239#comment-14746239 ] Li Lu commented on YARN-3901: - Hi [~vrushalic], thanks for the work! I feel the current patch is much closer. Some of my review comments: - Column.java Let's make it clear in the comments if attributes can be null. (I believe it can be null but let's make it explicit. ) - TimelineWriterUtils.java Quick question: when checking application finish event and its timestamp, we only search for the very last event event. Will there be some corner cases where the application finish event is not the last one? Such as, another event accidentally has the same timestamp as the finish event? I'm not sure if we have any strong guarantee on this (so that the chance is low). - TimestampGenerator.java The lock-free algorithm to generate (locally-) unique timestamps looks good to me. Depend on the usage maybe we'd like to promote this algorithm to a more common place if other part of the code needs it as well. Any special considerations on not directly using getAndAdd (fetch-and-increment) here? - TimelineWriterUtils.java TimelineWriterUtils#getTagFromAttribute is much better than the previous versions. It may be better to let the key of attribute entry to identify its own type? Right now we're requiring AggregationOperations and AggregationCompactionDimension to have disjoint enum values. This implicit requirement may be error pruning: if two attributes accidentally have DEFAULT, we're confusing them. Not sure if this is the top priority for now but we may need to fix it sometime later. - FlowScanner.java Unify findMaxCell and findMinCell? I'm a little bit confused on SUM and SUM_FINAL: what's the difference (we treat them in the same way in collectCells)? Right now we treat all metric values to be Long. Let's decide that later. - FlowRunTable.java We're mixing real "info" with metrics (info.m!x) together in the info column family. I think this is worth noting somewhere? - TestHBaseStorageFlowRunFlowActivity.java Split into test flowRun and test flowActivity? Although added in the same JIRA, the function of the flowRun table and flowActivity tables are different (online aggregation and real-time activity info). It's fine to directly read data out in the UT now. We may want to switch this to use readers sometime later? Do we need to test single data aggregation? The two {{getEntityMetricsApp}} methods only generates time series metrics. checkFlowRunTable(), shall we have some better ways to check the values 141 and 57? Say, is it possible to check them dynamically rather than hard code them? - With regard to FlowActivityTable, maybe I'm still missing some points here, but how do we maintain the active flow list for the past 24 hrs? I noticed we never remove or disable anything in this table, so do we do this by setting the ttl of the table? Or we create different rows for the same application to differentiate the day an entity was posted? I think we're using the latter but would like to confirm. Also, with the current design, when the reader tries to get latest activities on a cluster, it can craft a row key prefix {{cluster!inv(currTime-24h)}} and scan from the very beginning to this record, right? (I'm trying to connect the two pieces together. ) > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch, > YARN-3901-YARN-2928.6.patch, YARN-3901-YARN-2928.7.patch, > YARN-3901-YARN-2928.8.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743869#comment-14743869 ] Sangjin Lee commented on YARN-3901: --- Thanks for updating the patch [~vrushalic]. It seems that the patch does not apply cleanly because {{TimelineSchemaCreator.java}} has been updated for YARN-4102. Could you please update the patch so it applies cleanly? Thanks. https://builds.apache.org/job/PreCommit-YARN-Build/9119/console > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch, > YARN-3901-YARN-2928.6.patch, YARN-3901-YARN-2928.7.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744270#comment-14744270 ] Li Lu commented on YARN-3901: - OK, thanks for the info! > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch, > YARN-3901-YARN-2928.6.patch, YARN-3901-YARN-2928.7.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744320#comment-14744320 ] Joep Rottinghuis commented on YARN-3901: [~sjlee0] bq. I remember Joep Rottinghuis mentioning that an unset timestamp is equivalent to Cell.getTimestamp() returning Long.MAX_VALUE. Joep Rottinghuis? Yes, if you see the org.apache.hadoop.hbase.client.Put#addColumn method without a timestamp, it uses the timestamp from the Put. There are two constructors for a Put, one with the timestamp, one without. The one without uses HConstants.LATEST_TIMESTAMP which is defined as: {code} /** * Timestamp to use when we want to refer to the latest cell. * This is the timestamp sent by clients when no timestamp is specified on * commit. */ public static final long LATEST_TIMESTAMP = Long.MAX_VALUE; {code} That is then used (indirectly through LATEST_TIMESTAMP_BYTES) in the KeyValue class in the #isLatestTimestamp method, which in turn is used in the KeyValue#updateLatestStamp that sets it to "now" on the server side. I'm not 100% sure (we need to test this) but I'm assuming that the transformation of this isLatestTimestamp happens after coprocessors or never at all (the cells might just be written with the latest timestamp, and one might not have the ability to ask what the row looked like at any particular time at all). I thought this might be overwritten on the server side, but can't find that code now. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch, > YARN-3901-YARN-2928.6.patch, YARN-3901-YARN-2928.7.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744789#comment-14744789 ] Sangjin Lee commented on YARN-3901: --- Again the jenkins result wasn't posted. Here it is: -1 overall | Vote | Subsystem | Runtime | Comment | -1 | pre-patch | 16m 1s| Findbugs (version ) appears to be | | || broken on YARN-2928. | +1 |@author | 0m 0s | The patch does not contain any | | || @author tags. | +1 | tests included | 0m 0s | The patch appears to include 2 new | | || or modified test files. | +1 | javac | 8m 4s | There were no new javac warning | | || messages. | -1 |javadoc | 7m 42s| The applied patch generated 166 | | || additional warning messages. | +1 | release audit | 0m 21s| The applied patch does not increase | | || the total number of release audit | | || warnings. | +1 | checkstyle | 0m 19s| There were no new checkstyle | | || issues. | -1 | whitespace | 0m 47s| The patch has 1 line(s) that end in | | || whitespace. Use git apply | | || --whitespace=fix. | +1 |install | 1m 33s| mvn install still works. | +1 |eclipse:eclipse | 0m 40s| The patch built with | | || eclipse:eclipse. | +1 | findbugs | 0m 52s| The patch does not introduce any | | || new Findbugs (version 3.0.0) | | || warnings. | +1 | yarn tests | 1m 56s| Tests passed in | | || hadoop-yarn-server-timelineservice. | | | 38m 22s | || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12755862/YARN-3901-YARN-2928.8.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / b1960e0 | | javadoc | /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/diffJavadocWarnings.txt | | whitespace | /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/whitespace.txt | | hadoop-yarn-server-timelineservice test log | /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9134/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch, > YARN-3901-YARN-2928.6.patch, YARN-3901-YARN-2928.7.patch, > YARN-3901-YARN-2928.8.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741507#comment-14741507 ] Sangjin Lee commented on YARN-3901: --- Thanks [~vrushalic] for updating the patch! I did a quick review (I know you'll be making bit more changes for findbugs, etc.), and wanted to share feedback. (ColumnHelper.java) - l.70: if the timestamp is null, the *current* timestamp (not server) is used, right? So we should update this comment? - l.99,104: let's use primitive long over object Long - l.99: does this need to be non-private? (FlowRunCoprocessor.java) - l.146: Since {{Cell.getTimestamp()}} returns a primitive long, it will never be a null Long object. I remember [~jrottinghuis] mentioning that an unset timestamp is equivalent to {{Cell.getTimestamp()}} returning Long.MAX_VALUE. [~jrottinghuis]? (TimestampGenerator.java) - If we're going to have {{ColumnHelper}} use this, I suggest moving this to the storage.common package. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch, > YARN-3901-YARN-2928.6.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741575#comment-14741575 ] Vrushali C commented on YARN-3901: -- Thanks Sangjin, I will make these updates. I am also looking at the patch with Joep and hope to have an updated patch shortly. bq. Since Cell.getTimestamp() returns a primitive long, it will never be a null Long object. I remember Joep Rottinghuis mentioning that an unset timestamp is equivalent to Cell.getTimestamp() returning Long.MAX_VALUE. Yes, I will update this. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch, > YARN-3901-YARN-2928.6.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739036#comment-14739036 ] Vrushali C commented on YARN-3901: -- Hi [~gtCarrera9] The start and end times for a flow run can be evaluated if you have all the known start times and end times of applications in that flow run and min/max timestamp can be evaluated. Hence this can be determined from the flow run table. But in the flow activity table, the purpose is to note that a flow was "active" on that day, meaning an application in that flow either started, completed or was running on that day. So when Joep and I had reviewed my patch together we realized that calculating the min/max in the flow activity table wont work for apps that span day boundaries and so in his comment on Aug 29th, there is a note "No timestamp needed in FlowActivity table. Runs can start one day and end another. Probably start without, add later if needed." That meant we did not need the coprocessor to determine min or max in the flow activity table. Hence I removed it. HTH Vrushali > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737890#comment-14737890 ] Li Lu commented on YARN-3901: - Hi [~vrushalic], one quick question, in which discussion did we decide to remove flow activity coprocessor? I appreciate your pointer. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735452#comment-14735452 ] Vrushali C commented on YARN-3901: -- Thanks [~sjlee0] for the review! I will correct the variable ordering for static and private members as well as making variables final. bq. l.210: Strictly speaking, GenericObjectMapper will return an integer if the value fits within an integer; so it's not exactly a concern for min/max (timestamps) but for caution we might want to stay with Number instead of long Comparisons are not allowed for Number datatype. {code} The operator < is undefined for the argument type(s) java.lang.Number, java.lang.Number {code} So I would have to do something like {code} Number d = a.longValue() + b.longValue(); {code} Do you think this is better? bq. l.52: Is the TimestampGenerator class going to be used outside FlowRunCoprocessor? If not, I would argue that we should make it an inner class of FlowRunCoprocessor. At least we should make it non-public to keep it within the package. If it would see general use outside this class, then it might be better to make it a true public class in the common package. I suspect a non-public class might be what we want here. I am thinking I will need this when the flush/compaction scanner is added in. If you'd like, I can move it in as a non-public class for now and then move it out if needed. bq. It's up to you, but you could leave the row key improvement to YARN-4074. That might be easier to manage the changes between yours and mine. I'm restructuring all *RowKey classes uniformly. I actually needed this in the unit test while checking the FlowActivityTable contents, if you want I can take it out and you can add that test case in when you add in the RowKey changes? bq. l.144: This would mean that some cell timestamps would have the unit of the milliseconds and others would be in nanoseconds. I'm a little bit concerned if we ever interpret these timestamps incorrectly. Could there be a more explicit way of clearly differentiating them? I don't have good suggestions at the moment. Yeah, I was thinking about that too. Right now, metrics will get their own timestamps. For other columns, we'd be using the nanoseconds. I am trying to see if we can just use milliseconds. bq. it might be good to have short comments on what each method is testing I did try to make the unit test names themselves descriptive like testFlowActivityTable or testWriteFlowRunMinMaxToHBase or testWriteFlowRunMetricsOneFlow or testWriteFlowActivityOneFlow but I agree some more comments in the unit test will surely help. Will upload a new patch shortly, thanks! > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735384#comment-14735384 ] Sangjin Lee commented on YARN-3901: --- Thanks for the updated patch [~vrushalic]! I went over the new patch, and the following is the quick feedback. I'll also apply it with YARN-4074, and test it a little more. (HBaseTimelineWriterImpl.java) - l.141-155: the whole thing could be inside {{if (isApplication)...}} - l.264: this null check is not needed (FlowRunCoprocessor.java) - l.52: Is the {{TimestampGenerator}} class going to be used outside {{FlowRunCoprocessor}}? If not, I would argue that we should make it an inner class of {{FlowRunCoprocessor}}. At least we should make it non-public to keep it within the package. If it would see general use outside this class, then it might be better to make it a true public class in the common package. I suspect a non-public class might be what we want here. - l.52: let's make it final - l.54: style nit: I think the common style is to place the static variables before instance variables - Also, overall it seems we're using both the diamond operator (<>) and the old style generic declaration. It might be good to stick with one style (in which case the diamond operator might be better). - l.144: This would mean that some cell timestamps would have the unit of the milliseconds and others would be in nanoseconds. I'm a little bit concerned if we ever interpret these timestamps incorrectly. Could there be a more explicit way of clearly differentiating them? I don't have good suggestions at the moment. (FlowScanner.java) - variable ordering - l.210: Strictly speaking, {{GenericObjectMapper}} will return an integer if the value fits within an integer; so it's not exactly a concern for min/max (timestamps) but for caution we might want to stay with {{Number}} instead of long. (TimestampGenerator.java) - l.29: make it final - variable ordering - see above for the public/non-public comment (FlowActivityRowKey.java) - It's up to you, but you could leave the row key improvement to YARN-4074. That might be easier to manage the changes between yours and mine. I'm restructuring all *RowKey classes uniformly. (TestHBaseTimelineWriterImplFlowRun.java) - it might be good to have short comments on what each method is testing > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735887#comment-14735887 ] Joep Rottinghuis commented on YARN-3901: Thanks [~vrushalic]. I'm going to dig through the details on the latest patch. Separately [~sjlee0] and I further discussed the challenges of taking the timestamp on the coprocessor, buffering writes, app restarts, timestamp collisions and ordering of various writes that come on. 1) Given that we have timestamps in # millis, then multiplying by 1,000 should suffice. It is unlikely that we'd have > 1M writes for one column in one region server for one flow. If we multiply by 1M we get close to the total date range that can fit in a long (still years to come, but still). 2) If we do any shifting of time, we should do the same everywhere to keep things consistent, and to keep the ability to ask what a particular row (roughly) looked like at any particular time (like last night midnight, what was the state of this entire row). 3) We think in the column helper, if the ATS client supplies a timestamp, we should multiply by 1,000. If we read any timestamp from HBase, we'll divide by 1,000. 4) If the ATS client doesn't supply the timestamp, we'll grab the timestamp in the ats writer the moment the write arrives (and before it is batched / buffered in the buffered mutator, HBase client, or RS queue). We then take this time and multiply by 1,000. Reads again divide by 1,000 to get back to millis in epoch as before. 5) For Agg operation SUM, MIN, and MAX we take the least significant 3 digits of the app_id and add this to the (timestamp*1000), so that we create a unique timestamp per app in an active flow-run. This should avoid any collisions. This takes care of uniqueness (no collisions on a single ms), but also solves for older instances of a writer (in case of a second AM attempt for example) or any other kind of ordering issue. The write are timestamped when they arrive at the writer. 6) If some piece of client code doesn't set any timestamp (this should be an error) then we cannot effectively order the writes as per the previous point. We still need to ensure that we don't have collisions. If the client supplied timestamp if LONG.Maxvalue, then we can generate the timestamp in the coprocessor on the servers side, modulo the counter to ensure uniqueness. We should still multiply by 1K to make the same amount of space for the unique counter. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735900#comment-14735900 ] Joep Rottinghuis commented on YARN-3901: The one remaining issue we have to tackle is when there are two app attempts. The previous app attempt ends up buffering some writes, and the new app attempt ends up writing a final_value. Now if the flush happens before the first attempt its write comes in, we no longer have the unaggregated value for that app_id in order to discard against (the timestamp should have taken care of this order). We can deal with this issue in three ways: 1) Ignore (risky and very hard to debug if it ever happens) 2) Keep the final value around until it has aged a certain time. Upside is that the value is initially kept (for for example 1-2 days?) and then later discarded. Downside is that we won't collapse values as quickly on flush as we can. The collapse would probably happen when a compaction happens, possibly only when a major compaction happens. But previous unaggregated values may have been written to disk anyway, so not sure how much of an issue this really is. 3) keep a list of the last x app_ids (aggregation compaction dimension values) on the aggregated flow-level data. What we would then do in the aggregator is to go through all the values as we currently do. We'd collapse all the values to keep only the latest per flow. Before we sum an item for the flow, we'd compare if the app_id was in the list of most recent x (10) apps that were completed and collapsed. Pro is that with a lower app completion rate in a flow, we'd be guarded against stale writes for longer than a fixed time period. We'd still limit the size of extra storage in tags to a list of x (10?) items. Downside is that if apps complete in very rapid succession, we would potentially be protected from stale writes from an app for a shorter period of time. Given that there is a correlation between an app completion and its previous run, this may not be a huge factor. It's not like random previous app attempts are launched. This is really to cover the case when a new app attempt is launched, but the previous writer had some buffered writes that somehow still got through. I'm sort of tempted towards 2, since that is the most similar to the existing TTL functionality, and probably the easiest to code and understand. Simply compact only after a certain time period has passed. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736140#comment-14736140 ] Sangjin Lee commented on YARN-3901: --- Somehow the jenkins info didn't make it to the JIRA: https://builds.apache.org/job/PreCommit-YARN-Build/9044/ -1 overall | Vote | Subsystem | Runtime | Comment | -1 | pre-patch | 15m 55s | Findbugs (version ) appears to be | | || broken on YARN-2928. | +1 |@author | 0m 1s | The patch does not contain any | | || @author tags. | +1 | tests included | 0m 0s | The patch appears to include 2 new | | || or modified test files. | +1 | javac | 8m 6s | There were no new javac warning | | || messages. | +1 |javadoc | 10m 12s | There were no new javadoc warning | | || messages. | +1 | release audit | 0m 23s| The applied patch does not increase | | || the total number of release audit | | || warnings. | +1 | checkstyle | 0m 16s| There were no new checkstyle | | || issues. | -1 | whitespace | 0m 31s| The patch has 7 line(s) that end in | | || whitespace. Use git apply | | || --whitespace=fix. | +1 |install | 1m 34s| mvn install still works. | +1 |eclipse:eclipse | 0m 40s| The patch built with | | || eclipse:eclipse. | -1 | findbugs | 0m 56s| The patch appears to introduce 7 | | || new Findbugs (version 3.0.0) | | || warnings. | +1 | yarn tests | 1m 54s| Tests passed in | | || hadoop-yarn-server-timelineservice. | | | 40m 33s | Reason | Tests FindBugs | module:hadoop-yarn-server-timelineservice || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12754731/YARN-3901-YARN-2928.5.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / e6afe26 | | whitespace | /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/whitespace.txt | | Findbugs warnings | /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-timelineservice.html | | hadoop-yarn-server-timelineservice test log | /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9044/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.4.patch, YARN-3901-YARN-2928.5.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731056#comment-14731056 ] Junping Du commented on YARN-3901: -- Thanks Vrushali for updating the patch. This is a great/huge work which also means it may need more rounds of review and could receive much more criticize than the normal case. Thanks again for being patient. After go through the code, I had a few comments (with omitted some duplicated ideas with Joep or Li's comments): In HBaseTimelineWriterImpl.java, For isApplicationFinished() and getApplicationFinishedTime(), the event list in TimelineEntity is a SortedSet, so we can use last() to retrieve the last event instead of for each every element? It shouldn't be other events after FINISHED_EVENT_TYPE. Isn't it? In addition, because the method is general enough. In addition, we can consider to move them to TimelineUtils class so can be reused by other classes. Also, need to fix some indentation issues in this class. In ColumnHelper.java, {code} + for (Attribute attribute : attributes) { +if (attribute != null) { + p.setAttribute(attribute.getName(), attribute.getValue()); +} + } {code} Do we expect null element added to attributes? If not, we should complain with NPE or other exception instead of ignore it silently. In ColumnPrefix.java, Indentation issue in Javadoc. In TimelineWriterUtils.java, I think getIncomingAttributes() tries to clone an array of attributes with appending an extra attribute in AggregationOperations. May be we should have a javadoc to describe it. The 3 if else cases sounds unnecessary and can be combined. I didn't go to coprocessor classes quite deeply but I agree with Joep's above comments that it need more Javadoc to explain what are outstanding methods doing. In FlowRunCoprocessor.java, getTagFromAttribute() sounds like we are using exception to differentiate normal case in matching string with enum elements. Can we improve it with using EnumUtils? In AggregationCompactionDimension.java, I think the only usage here is to provide a method getAttribute() which return an attribute object mixed with app_id (in byte array). If so, why we make this an enum class instead of a regular class as APPLICATION_ID is the only element? May be more straightforward way is to have a utility class to getAttribute() directly. In AggregationOperations.java, Indentation issues. Haven't quite go through code around flow activity table, more comments should comes in my 2nd round review. Some quick check on test code, for TestHBaseTimelineWriterImplFlowRun.java, {code} + Result r1 = table1.get(g); + if (r1 != null && !r1.isEmpty()) { +Mapvalues = r1.getFamilyMap(FlowRunColumnFamily.INFO +.getBytes()); +assertEquals(2, r1.size()); ... {code} Do we accept r1 to be null or empty result? I don't think so, so may be we should check the size of r1 earlier so we are not ignore the real failure cases? > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731173#comment-14731173 ] Vrushali C commented on YARN-3901: -- Thanks [~djp] , appreciate the feedback! I too am figuring out how to make this more readable and sincerely appreciate everyone's time and efforts in reviewing it. Some responses: bq. TestHBaseTimelineWriterImplFlowRun.java, Do we accept r1 to be null or empty result? I don't think so, so may be we should check the size of r1 earlier so we are not ignore the real failure cases? Right, I will add an assert not null here. bq. In TimelineWriterUtils.java, I think getIncomingAttributes() tries to clone an array of attributes with appending an extra attribute in AggregationOperations. May be we should have a javadoc to describe it. The 3 if else cases sounds unnecessary and can be combined. I think I will update this method a bit more and add more comments so that it explains well what the code is doing. bq. In ColumnHelper.java, Do we expect null element added to attributes? If not, we should complain with NPE or other exception instead of ignore it silently. Hmm. So this method in ColumnHelper is called from several places and is nested between many calls from hbase writer till here. Since the list of attributes is a variable length list of parameters, it could be null or turn out to be empty if some function in between decides to remove an attribute, so this was more of a safety check. The list of Attributes can be modified at several places in the call stack, so it is not actually an error if it comes to this point as an empty list. But I will think over this a bit more. bq. For isApplicationFinished() and getApplicationFinishedTime(), the event list in TimelineEntity is a SortedSet, so we can use last() to retrieve the last event instead of for each every element? It shouldn't be other events after FINISHED_EVENT_TYPE. Isn't it? Ah, I did not know that it was a sorted set, will update the code accordingly. bq. In addition, because the method is general enough. In addition, we can consider to move them to TimelineUtils class so can be reused by other classes. Sounds good, will refactor it. Looks like the indentation is a bit off in some places, I will update the formatting as recommeded by [~gtCarrera9], [~jrottinghuis] and [~djp] > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731650#comment-14731650 ] Sangjin Lee commented on YARN-3901: --- [~gtCarrera9], I think we're not far off actually. I've been testing using the v.3 patch, and with a few more changes that Vrushali will be doing, it should be pretty close to a reasonably complete state. At this point, it would be more efficient to bring this to completion. My 2 cents. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731639#comment-14731639 ] Li Lu commented on YARN-3901: - Hi [~vrushalic], since part of this JIRA is blocking the web UI work, but we may still need sometime to reach a stable state on this JIRA, is it possible to separate the flow activity part? In this way we may unblock the flow activity table queries, as well as the web services. However, if it's too hard to separate the JIRA we can proceed all tasks in this JIRA. Thoughts? > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731648#comment-14731648 ] Sangjin Lee commented on YARN-3901: --- (1) FlowScanner cell order issue When I added the reader code and started testing it against the flow run unit tests, I found that reading the END_TIME column on the flow run table didn't work. The flow run column read on END_TIME is essentially {{Result.getValue()}}. However, HBase was failing to find the END_TIME column although it clearly existed in the result. It was basically failing at the binary search: {code} public Cell getColumnLatestCell(byte [] family, byte [] qualifier) { Cell [] kvs = rawCells(); // side effect possibly. if (kvs == null || kvs.length == 0) { return null; } int pos = binarySearch(kvs, family, qualifier); if (pos == -1) { return null; } if (CellUtil.matchingColumn(kvs[pos], family, qualifier)) { return kvs[pos]; } return null; } {code} The binary search was failing because the cells in the result were stored in the wrong order. The cells were stored in the wrong order because it was being added by our co-processor (in FlowScanner.nextInternal()). {code} 189 if (runningSum.size() > 0) { 190 for (Map.EntrynewCellSum : runningSum.entrySet()) { 191 // create a new cell that represents the flow metric 192 Cell c = newCell(metricCell.get(newCellSum.getKey()), 193 newCellSum.getValue()); 194 cells.add(c); 195 } 196 } 197 if (currentMinCell != null) { 198 cells.add(currentMinCell); 199 } 200 if (currentMaxCell != null) { 201 cells.add(currentMaxCell); 202 } {code} And this order is preserved all the way to the reader. The fix is to add the cells in the right order via KeyValueComparator. This fix is included in my patch on YARN-4074. This will be fixed in this JIRA. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731651#comment-14731651 ] Li Lu commented on YARN-3901: - [~sjlee0] sure, then let's keep all the work here! > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731654#comment-14731654 ] Sangjin Lee commented on YARN-3901: --- (2) colliding puts in the co-processor We found another issue with the write side of things via unit test. It fails only occasionally (and more often on some environments than others). It happens when 2 puts on the same column are coming in very closely (namely within 1 millisecond). The code in question is {{FlowRunCoprocessor.prePut()}}: {code} for (Map.Entryentry : put.getFamilyCellMap() .entrySet()) { List newCells = new ArrayList<>(entry.getValue().size()); for (Cell cell : entry.getValue()) { // for each cell in the put add the tags // Assumption is that all the cells in // one put are the same operation newCells.add(CellUtil.createCell(CellUtil.cloneRow(cell), CellUtil.cloneFamily(cell), CellUtil.cloneQualifier(cell), cell.getTimestamp(), KeyValue.Type.Put, CellUtil.cloneValue(cell), Tag.fromList(tags))); } newFamilyMap.put(entry.getKey(), newCells); } // for each entry {code} If 2 cells for example carry the same timestamp, then the later one ends up overwriting the previous one, effectively losing one put. This was triggered by one of the tests in {{TestHBaseTimelineWriterImplFlowRun.java}}. It's an edge case which is rather unlikely to happen normally, but is an issue nonetheless. And how to solve this problem is pretty complicated. We'll soon post possible approaches for handling this. But at any rate, I suspect we could isolate this issue into a separate JIRA, and tackle it post-UI-POC. I'd appreciate your feedback. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729511#comment-14729511 ] Vrushali C commented on YARN-3901: -- Thanks [~gtCarrera9] for the review. Let me try to give some explanations to your questions above. bq. Name of Attribute seems to be quite general. Maybe we want something more specific? From my understanding, Attribute acts as the "command" (as the meaning in design pattern) of the aggregation? Yes, attribute indicates what action needs to be taken in the aggregation step/reading step. What name do you recommend? bq. TimelineSchemaCreator May conflict with YARN-4102. I'm fine with either order to put them in. Yes, I thought about that, but the patch in YARN-4102 is not committed yet, so could not rebase. I am good with rebasing the YARN-3901 patch if YARN-4102 gets in. bq. Are we assuming there will be at most two attributes for each column prefix? In FlowScanner we're only dealing with two attributes, one from compaction one from operations. But in FlowActivityColumnPrefix we're assuming there's a list of attributes? No, there can be any number of attributes for a column prefix. Currently MIN, MAX and SUM happen to be exclusive in the sense, if you want a min for start time, it's unlikely that you want to be SUMing up the start times. FlowScanner looks for application id from the aggregation compaction dimensions and for the MIN/MAX/SUM from the aggregation operations. This FlowScanner class is very different from FlowActivityColumnPrefix. The FlowActivityColumnPrefix or any class that generates a Put will deal with a list of attributes. bq. What is our plan on FlowActivityColumnPrefix#IN_PROGRESS_TIME? Yes, this is timestamp that needs to be put into the flow activity table for all (long) running applications. If an application in a flow starts on say Day1 and runs through day 2, day 3 and ends on day4, then the flow activity table needs to have an entry for this flow for day2 and day3. This is the in progress time of that application, it is the TBD part being thought over in YARN-4069. We need to think if we want the RM to write it, or the App master or something else offline. bq. In FlowScanner, after aggregation (in nextInternal) we're simply adding aggregated data as a Cell. However I haven't found where we're guaranteeing the new node is not aggregate again (and we create another new cell for the aggregation result). Are we doing this deliberately or I'm missing anything here? Hmm, not sure I got the question but let me try to explain what the FlowScanner should be doing. It will read each cell one by one. Say for start time column, it reads the cells. Now for a flow, we want that value which is the lowest for the start time of the flow. Hence these cells have a tag of MIN. So, the nextInternal will return one cell with the min value for the column start time. Similarly for max and for SUM, it sums up the cell values. Hope this helps. Also, I will double check the formatting related comments and update as necessary. Appreciate the review! > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729886#comment-14729886 ] Vrushali C commented on YARN-3901: -- bq. From the code I can see we add the newly created cell to the list of cells (cells.add(someNewCell)). Is this operation only modifying the returned value of the next call, or it permanently writes the aggregated values back to HBase? This may sound silly but I'm not very familiar with the observer coprocessors. Yes, in this code, we are only "reading" or returning back cells to the client (hbase client). But when we add in the compaction/flush coprocessors, they will write back to hbase as well. bq. Maybe a name that is more specific will help? How about something like AggregationPutAttribute? Hmm. It's not really an attribute for the Put, it's more like the characteristic of the cell value itself. We could use these attributes outside of Aggregation as well, so don't want the agg prefix here. Still thinking what to call it. Perhaps FlowValueAttribute? > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729890#comment-14729890 ] Li Lu commented on YARN-3901: - bq. Yes, in this code, we are only "reading" or returning back cells to the client (hbase client). But when we add in the compaction/flush coprocessors, they will write back to hbase as well. Oh I see. This is the missing piece that confused me. Thanks for the clarification! bq. We could use these attributes outside of Aggregation as well, so don't want the agg prefix here. OK, if this is the plan, let's leave it here to unblock the critical path of the whole JIRA? We can clean this up later. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729910#comment-14729910 ] Joep Rottinghuis commented on YARN-3901: Not sure if [~gtCarrera9] is asking what happens if we have consolidated a "finished" app into the flow aggregate and then we get an older batched put from the same app (from an older AM that wasn't properly killed or something). That is currently an open problem that we can address by storing the list of x most recently completed apps in a tag on the total sum. We can't quite store all app IDs that have been collapsed into the flow sum, because that list could potentially be really long. We could keep a list of the last x apps, to radically reduce the likelihood that stale batched writes end up messing up the aggregation if there were to be a race condition (however rare that might be). I think we can add that guarding behavior in a separate jira and not further complicate the first cut (that might suffer from this rare race condition). As [~sjlee0] pointed out and we were discussing offline this morning scan.setMaxResultSize(limit); doesn't limit the # rows that are returned, but limits the size in bytes. Not sure we want to address that here, or if we'll let him adjust that in his patch. We should limit the number of rows we retrieve from scan and if needed (as [~sjlee0] pointed out) add a PageFilter with a limit in addition to the limit to further restrict what is prefetched in a buffer (gets more complicated when scan spans region servers). It is a little confusing to read the patch and see where ColumnPrefix has just an additional argument in the store method for Attribute... attributes and when the method with a String qualifier is changed to one with a byte[]. There seems to be a discrepancy in how EntityTablePrefix and ApplicationTablePrefix are handles. I'm not sure if is needed to have a getColumnQualifier with a long in ColumnHelper, but we may have to review this together interactively behind two laptops. In HBaseTimelineWriterImpl. onApplicationFinished You have an old comment: 281// clear out the hashmap in case the downstream store/coprocessor 282 // added any more attributes during the put for END_TIME that no longer makes sense. Also, I'd simply do: {code} storeFlowMetrics(rowKey, metrics, attribute1, AggregationOperations.SUM_FINAL.getAttribute()); {code} I'd update the javadoc in AggregationOperations before the enum definitions. They are the old style and no longer make sense in the more generic case. SUM indicates that the values need to be summed up, MIN means that only the minimum needs to be kept etc. Initially I found the method name getIncomingAttributes and corresponding member names somewhat confusing (from the method perspective it isn't the incoming values, it is the outgoing values). Perhaps combinedAttribute and combineAttributes(...) makes more sense, but the provided logic seems correct. FlowActivityColumnPrefix.IN_PROGRESS_TIME needs a better javadoc description to describe its use and meaning. The coprocessor methods need a little more javadoc to explain what is going on. To the casual reader this is total voodoo. The preGetOp creates a new scanner (ok), then does a single next on it (why?) and then bypasses the environment (huh?). Similarly if in preScannerOpen we already set scan.setMaxVersions(); then why is the same still needed in PreGetOp, but in PostScannerOpen we don't do it anymore (presumably already done in the preOpen). I like the more generic FlowRunCoprocessor (although it can have a name that is not associated with a table, because behavior is generic, and arg names such as frpa are probably artifact from previous version). In getTagFromAttribute, is it possible to recognize a operation from an AggregationCompactionDimension without relying on an exception and catching it? For example, can you do AggregationOperation.isA(Attribute a) or something like that? The other thing I realize with the coprocessor is this. It nicely maps attributes to tags, but we unnecessarily bloat every single put with the operation. We could get creative and use a different column prefix for min and max columns. Then the coprocessor can pick that up during read/flush/compaction. That makes queries (and filters) much harder. So for now we're probably stuck with tagging each value. Perhaps not so bad for min and max given that after flush and compact we store only one value. For SUM we will always have an Aggregation dimension, so adding a SUM tag then isn't needed. We assume an aggregation dimension w/o agg operation would default to SUM. We do certainly need to tag values with SUM_FINAL. Aside from that, in FlowRunProcessor.prePut do we have to keep doing Tag.fromList(tags) for each cell, or can we create a Tag once and re-use it? When reading through the
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729821#comment-14729821 ] Li Lu commented on YARN-3901: - Hi [~vrushalic], some of my commnets: bq. Yes, attribute indicates what action needs to be taken in the aggregation step/reading step. What name do you recommend? Maybe a name that is more specific will help? How about something like AggregationPutAttribute? bq. I am good with rebasing the YARN-3901 patch if YARN-4102 gets in. Given the timing of both patches, let's proceed both patches and rebase the later one. Should be trivial. bq. No, there can be any number of attributes for a column prefix. Sure. So maybe we can clarify this somewhere that these AggregationOperations options are exclusive in FlowScanner (i.e. adding more than one of them will not generate two aggregated results)? bq. Yes, this is timestamp that needs to be put into the flow activity table for all (long) running applications. Ah I see. That makes sense. bq. It will read each cell one by one. Say for start time column, it reads the cells. Now for a flow, we want that value which is the lowest for the start time of the flow. Hence these cells have a tag of MIN. So, the nextInternal will return one cell with the min value for the column start time. Similarly for max and for SUM, it sums up the cell values. Thanks for the clarification. From the code I can see we add the newly created cell to the list of cells ({{cells.add(someNewCell)}}). Is this operation only modifying the returned value of the next call, or it permanently writes the aggregated values back to HBase? This may sound silly but I'm not very familiar with the observer coprocessors. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want to re-aggregate for those upon replay > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3901) Populate flow run data in the flow_run & flow activity tables
[ https://issues.apache.org/jira/browse/YARN-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728367#comment-14728367 ] Li Lu commented on YARN-3901: - Hi [~vrushalic], thanks for the work! I looked at the current patch and have the following comments/questions: - Name of Attribute seems to be quite general. Maybe we want something more specific? From my understanding, Attribute acts as the "command" (as the meaning in design pattern) of the aggregation? HBaseTimelineWriterImpl - storeInFlowActivityTable, why two levels of indirections? The te parameter is never used in the first wrapper. - Move the few static helper methods into TimelineEntity? - Since both AggregationCompactionDimension and AggregationOperations may generate an Attribute, maybe it's helpful to distinguish Attributes from these two ways from their names? The name like attribute1 does not look like helpful. :( getIncomingAttributes in TimelineWriterUtils may need some more comments on its function. TimelineSchemaCreator May conflict with YARN-4102. I'm fine with either order to put them in. Are we assuming there will be at most two attributes for each column prefix? In FlowScanner we're only dealing with two attributes, one from compaction one from operations. But in FlowActivityColumnPrefix we're assuming there's a list of attributes? Maybe I'm missing something, but why we're converting hbase attributes into tags in FlowRunCoprocessor, but not doing the same thing in FlowActivityCoprocessor? Or, what does FlowActivityCoprocessor aggregate on? What is our plan on FlowActivityColumnPrefix#IN_PROGRESS_TIME? In FlowScanner, after aggregation (in nextInternal) we're simply adding aggregated data as a Cell. However I haven't found where we're guaranteeing the new node is not aggregate again (and we create another new cell for the aggregation result). Are we doing this deliberately or I'm missing anything here? nits: - l.149, l310, HBaseTimelineWriterImpl Indentation problems? - There are some lines are longer than 80. - l.64 AggregationOperations, wrong indentation with tab. - FlowRunColumnPrefix, FlowActivityColumnPrefix (maybe somewhere else): in Hadoop a common practice is to only have one space before the name of member variables. We don't really need to make all of them start in the same column. - Just curious, how did you choose the numbers associated with different AggregationOperations? It's a big patch so I may find something more tomorrow. Sorry about that but I just want to not to block the whole review process. > Populate flow run data in the flow_run & flow activity tables > - > > Key: YARN-3901 > URL: https://issues.apache.org/jira/browse/YARN-3901 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Vrushali C > Attachments: YARN-3901-YARN-2928.1.patch, > YARN-3901-YARN-2928.2.patch, YARN-3901-YARN-2928.3.patch, > YARN-3901-YARN-2928.WIP.2.patch, YARN-3901-YARN-2928.WIP.patch > > > As per the schema proposed in YARN-3815 in > https://issues.apache.org/jira/secure/attachment/12743391/hbase-schema-proposal-for-aggregation.pdf > filing jira to track creation and population of data in the flow run table. > Some points that are being considered: > - Stores per flow run information aggregated across applications, flow version > RM’s collector writes to on app creation and app completion > - Per App collector writes to it for metric updates at a slower frequency > than the metric updates to application table > primary key: cluster ! user ! flow ! flow run id > - Only the latest version of flow-level aggregated metrics will be kept, even > if the entity and application level keep a timeseries. > - The running_apps column will be incremented on app creation, and > decremented on app completion. > - For min_start_time the RM writer will simply write a value with the tag for > the applicationId. A coprocessor will return the min value of all written > values. - > - Upon flush and compactions, the min value between all the cells of this > column will be written to the cell without any tag (empty tag) and all the > other cells will be discarded. > - Ditto for the max_end_time, but then the max will be kept. > - Tags are represented as #type:value. The type can be not set (0), or can > indicate running (1) or complete (2). In those cases (for metrics) only > complete app metrics are collapsed on compaction. > - The m! values are aggregated (summed) upon read. Only when applications are > completed (indicated by tag type 2) can the values be collapsed. > - The application ids that have completed and been aggregated into the flow > numbers are retained in a separate column for historical tracking: we don’t > want