[
https://issues.apache.org/jira/browse/GOBBLIN-1672?focusedWorklogId=797771&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-797771
]
ASF GitHub Bot logged work on GOBBLIN-1672:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 03/Aug/22 20:53
Start Date: 03/Aug/22 20:53
Worklog Time Spent: 10m
Work Description: Will-Lo commented on code in PR #3532:
URL: https://github.com/apache/gobblin/pull/3532#discussion_r937128646
##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java:
##########
@@ -1132,24 +1035,39 @@ private void cleanUp() {
DagNode<JobExecutionPlan> dagNode = dagNodeList.poll();
deleteJobState(dagId, dagNode);
}
- log.info("Dag {} has finished with status FAILED; Cleaning up dag from
the state store.", dagId);
- onFlowFailure(dagId);
+ Dag<JobExecutionPlan> dag = this.dags.get(dagId);
+ String status = TimingEvent.FlowTimings.FLOW_FAILED;
+ if
(TimingEvent.FlowTimings.FLOW_RUN_DEADLINE_EXCEEDED.equals(dag.getFlowEvent()))
{
+
this.dagManagerMetrics.emitFlowSlaExceededMetrics(DagManagerUtils.getFlowId(dag));
+ } else if
(!TimingEvent.FlowTimings.FLOW_START_DEADLINE_EXCEEDED.equals(dag.getFlowEvent()))
{
+
dagManagerMetrics.emitFlowFailedMetrics(DagManagerUtils.getFlowId(this.dags.get(dagId)));
+ }
+ addFailedDag(dagId);
+ log.info("Dag {} has finished with status {}; Cleaning up dag from the
state store.", dagId, status);
// send an event before cleaning up dag
- DagManagerUtils.emitFlowEvent(this.eventSubmitter,
this.dags.get(dagId), TimingEvent.FlowTimings.FLOW_FAILED);
+ DagManagerUtils.emitFlowEvent(this.eventSubmitter,
this.dags.get(dagId), status);
dagIdstoClean.add(dagId);
}
- //Clean up completed dags
- for (String dagId : this.dags.keySet()) {
+ // Remove dags that are finished and emit their appropriate metrics
+ for (Map.Entry<String, Dag<JobExecutionPlan>> dagIdKeyPair :
this.dags.entrySet()) {
+ String dagId = dagIdKeyPair.getKey();
+ Dag<JobExecutionPlan> dag = dagIdKeyPair.getValue();
if (!hasRunningJobs(dagId) &&
!this.failedDagIdsFinishRunning.contains(dagId)) {
String status = TimingEvent.FlowTimings.FLOW_SUCCEEDED;
if (this.failedDagIdsFinishAllPossible.contains(dagId)) {
- onFlowFailure(dagId);
+ if
(TimingEvent.FlowTimings.FLOW_RUN_DEADLINE_EXCEEDED.equals(dag.getFlowEvent()))
{
+
this.dagManagerMetrics.emitFlowSlaExceededMetrics(DagManagerUtils.getFlowId(dag));
+ } else if
(!TimingEvent.FlowTimings.FLOW_START_DEADLINE_EXCEEDED.equals(dag.getFlowEvent()))
{
+
this.dagManagerMetrics.conditionallyMarkFlowAsState(DagManagerUtils.getFlowId(this.dags.get(dagId)),
+ DagManager.FlowState.FAILED);
+
dagManagerMetrics.emitFlowFailedMetrics(DagManagerUtils.getFlowId(this.dags.get(dagId)));
+ }
status = TimingEvent.FlowTimings.FLOW_FAILED;
+ addFailedDag(dagId);
this.failedDagIdsFinishAllPossible.remove(dagId);
- conditionallyUpdateFlowGaugeExecutionState(flowGauges,
DagManagerUtils.getFlowId(this.dags.get(dagId)), FlowState.FAILED);
Review Comment:
I think it was unintentional, I went back and forth on where to add the flow
gauge removal and mistakenly left it out of addFailedDag(), will add it there
Issue Time Tracking
-------------------
Worklog Id: (was: 797771)
Time Spent: 2h 10m (was: 2h)
> Refactor metrics in dagmanager and add per spec executor metrics
> ----------------------------------------------------------------
>
> Key: GOBBLIN-1672
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1672
> Project: Apache Gobblin
> Issue Type: Improvement
> Components: gobblin-service
> Reporter: William Lo
> Assignee: Abhishek Tiwari
> Priority: Major
> Time Spent: 2h 10m
> Remaining Estimate: 0h
>
> Add the following metrics:
> 1. Success per executor
> 2. Fail per executor
> 3. SLA killed per executor
> 4. SLA killed per flowgroup
> 5. SLA killed per user
> 6. SLA killed overall
--
This message was sent by Atlassian Jira
(v8.20.10#820010)