William Lo created GOBBLIN-2011:
-----------------------------------

             Summary: Fix bug where concurrent flows can be kicked off 
depending on a jobstatus race condition
                 Key: GOBBLIN-2011
                 URL: https://issues.apache.org/jira/browse/GOBBLIN-2011
             Project: Apache Gobblin
          Issue Type: Bug
            Reporter: William Lo


There's a bug that causes GaaS multileader to kick off unintended concurrent 
flows which happens in the order described below:

1. Host A checks the latest flow execution status to ensure the prior flow is 
not running, sees that the prior execution is still running.
2. Host A fails the flow pending execution as it cannot run concurrent flow, 
this emits a FAILED event to GaaS which is ingested by the JobStatusMonitor.
3. Host B checks the latest flow execution status, sees the current flow 
execution ID which is FAILED (considered a finished flow).
4. Host B kicks off the pending flow execution when it shouldn't be.

To resolve this, we need to ensure that we are looking at the past 2 flow 
executions, and follow the behavior:
1. If there is no prior execution, kick off the pending flow
2. If the prior execution is IN PROGRESS, we want to indicate that there is a 
concurrent flow and block the pending execution.
3. If the prior execution is FINISHED, then we want to kick off the pending 
execution (rely on the DagManager for deduplication of flows because we do not 
know if the host managing this pending flow is running behind the other hosts).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to