[ 
https://issues.apache.org/jira/browse/GOBBLIN-2011?focusedWorklogId=907871&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-907871
 ]

ASF GitHub Bot logged work on GOBBLIN-2011:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 02/Mar/24 00:11
            Start Date: 02/Mar/24 00:11
    Worklog Time Spent: 10m 
      Work Description: umustafi commented on code in PR #3888:
URL: https://github.com/apache/gobblin/pull/3888#discussion_r1509707292


##########
gobblin-runtime/src/main/java/org/apache/gobblin/service/monitoring/FlowStatusGenerator.java:
##########
@@ -153,15 +153,20 @@ public List<FlowStatus> getFlowStatusesAcrossGroup(String 
flowGroup, int countPe
    * @return true, if any jobs of the flow are RUNNING.
    */
   public boolean isFlowRunning(String flowName, String flowGroup, long 
flowExecutionId) {
-    List<FlowStatus> flowStatusList = getLatestFlowStatus(flowName, flowGroup, 
1, null);
+    List<FlowStatus> flowStatusList = getLatestFlowStatus(flowName, flowGroup, 
2, null);
     if (flowStatusList == null || flowStatusList.isEmpty()) {
       return false;
+    }
+    FlowStatus flowStatus = flowStatusList.get(0);

Review Comment:
   can you make a comment to make clear that the first one is the most recent 
and may or may not match the pending flowExecutionId attempt and the second 
index is an older one? It's a bit hard to keep track of what's going on here 
without reading ur commit desc





Issue Time Tracking
-------------------

    Worklog Id:     (was: 907871)
    Time Spent: 0.5h  (was: 20m)

> Fix bug where concurrent flows can be kicked off depending on a jobstatus 
> race condition
> ----------------------------------------------------------------------------------------
>
>                 Key: GOBBLIN-2011
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-2011
>             Project: Apache Gobblin
>          Issue Type: Bug
>            Reporter: William Lo
>            Priority: Major
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> There's a bug that causes GaaS multileader to kick off unintended concurrent 
> flows which happens in the order described below:
> 1. Host A checks the latest flow execution status to ensure the prior flow is 
> not running, sees that the prior execution is still running.
> 2. Host A fails the flow pending execution as it cannot run concurrent flow, 
> this emits a FAILED event to GaaS which is ingested by the JobStatusMonitor.
> 3. Host B checks the latest flow execution status, sees the current flow 
> execution ID which is FAILED (considered a finished flow).
> 4. Host B kicks off the pending flow execution when it shouldn't be.
> To resolve this, we need to ensure that we are looking at the past 2 flow 
> executions, and follow the behavior:
> 1. If there is no prior execution, kick off the pending flow
> 2. If the prior execution is IN PROGRESS, we want to indicate that there is a 
> concurrent flow and block the pending execution.
> 3. If the prior execution is FINISHED, then we want to kick off the pending 
> execution (rely on the DagManager for deduplication of flows because we do 
> not know if the host managing this pending flow is running behind the other 
> hosts).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to