[ 
https://issues.apache.org/jira/browse/GOBBLIN-1672?focusedWorklogId=797426&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-797426
 ]

ASF GitHub Bot logged work on GOBBLIN-1672:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 02/Aug/22 22:27
            Start Date: 02/Aug/22 22:27
    Worklog Time Spent: 10m 
      Work Description: arjun4084346 commented on code in PR #3532:
URL: https://github.com/apache/gobblin/pull/3532#discussion_r936076417


##########
gobblin-service/src/main/java/org/apache/gobblin/service/modules/orchestration/DagManager.java:
##########
@@ -1132,24 +1035,39 @@ private void cleanUp() {
           DagNode<JobExecutionPlan> dagNode = dagNodeList.poll();
           deleteJobState(dagId, dagNode);
         }
-        log.info("Dag {} has finished with status FAILED; Cleaning up dag from 
the state store.", dagId);
-        onFlowFailure(dagId);
+        Dag<JobExecutionPlan> dag = this.dags.get(dagId);
+        String status = TimingEvent.FlowTimings.FLOW_FAILED;
+        if 
(TimingEvent.FlowTimings.FLOW_RUN_DEADLINE_EXCEEDED.equals(dag.getFlowEvent())) 
{
+          
this.dagManagerMetrics.emitFlowSlaExceededMetrics(DagManagerUtils.getFlowId(dag));
+        } else if 
(!TimingEvent.FlowTimings.FLOW_START_DEADLINE_EXCEEDED.equals(dag.getFlowEvent()))
 {
+          
dagManagerMetrics.emitFlowFailedMetrics(DagManagerUtils.getFlowId(this.dags.get(dagId)));
+        }
+        addFailedDag(dagId);
+        log.info("Dag {} has finished with status {}; Cleaning up dag from the 
state store.", dagId, status);
         // send an event before cleaning up dag
-        DagManagerUtils.emitFlowEvent(this.eventSubmitter, 
this.dags.get(dagId), TimingEvent.FlowTimings.FLOW_FAILED);
+        DagManagerUtils.emitFlowEvent(this.eventSubmitter, 
this.dags.get(dagId), status);
         dagIdstoClean.add(dagId);
       }
 
-      //Clean up completed dags
-      for (String dagId : this.dags.keySet()) {
+      // Remove dags that are finished and emit their appropriate metrics
+      for (Map.Entry<String, Dag<JobExecutionPlan>> dagIdKeyPair : 
this.dags.entrySet()) {
+        String dagId = dagIdKeyPair.getKey();
+        Dag<JobExecutionPlan> dag = dagIdKeyPair.getValue();
         if (!hasRunningJobs(dagId) && 
!this.failedDagIdsFinishRunning.contains(dagId)) {
           String status = TimingEvent.FlowTimings.FLOW_SUCCEEDED;
           if (this.failedDagIdsFinishAllPossible.contains(dagId)) {
-            onFlowFailure(dagId);
+            if 
(TimingEvent.FlowTimings.FLOW_RUN_DEADLINE_EXCEEDED.equals(dag.getFlowEvent())) 
{
+              
this.dagManagerMetrics.emitFlowSlaExceededMetrics(DagManagerUtils.getFlowId(dag));
+            } else if 
(!TimingEvent.FlowTimings.FLOW_START_DEADLINE_EXCEEDED.equals(dag.getFlowEvent()))
 {
+              
this.dagManagerMetrics.conditionallyMarkFlowAsState(DagManagerUtils.getFlowId(this.dags.get(dagId)),
+                  DagManager.FlowState.FAILED);
+              
dagManagerMetrics.emitFlowFailedMetrics(DagManagerUtils.getFlowId(this.dags.get(dagId)));
+            }
             status = TimingEvent.FlowTimings.FLOW_FAILED;
+            addFailedDag(dagId);
             this.failedDagIdsFinishAllPossible.remove(dagId);
-            conditionallyUpdateFlowGaugeExecutionState(flowGauges, 
DagManagerUtils.getFlowId(this.dags.get(dagId)), FlowState.FAILED);

Review Comment:
   Why are we removing flowGauges ?





Issue Time Tracking
-------------------

    Worklog Id:     (was: 797426)
    Time Spent: 2h  (was: 1h 50m)

> Refactor metrics in dagmanager and add per spec executor metrics
> ----------------------------------------------------------------
>
>                 Key: GOBBLIN-1672
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1672
>             Project: Apache Gobblin
>          Issue Type: Improvement
>          Components: gobblin-service
>            Reporter: William Lo
>            Assignee: Abhishek Tiwari
>            Priority: Major
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> Add the following metrics:
> 1. Success per executor
> 2. Fail per executor
> 3. SLA killed per executor
> 4. SLA killed per flowgroup
> 5. SLA killed per user
> 6. SLA killed overall



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to