William Lo created GOBBLIN-2011:
-----------------------------------
Summary: Fix bug where concurrent flows can be kicked off
depending on a jobstatus race condition
Key: GOBBLIN-2011
URL: https://issues.apache.org/jira/browse/GOBBLIN-2011
Project: Apache Gobblin
Issue Type: Bug
Reporter: William Lo
There's a bug that causes GaaS multileader to kick off unintended concurrent
flows which happens in the order described below:
1. Host A checks the latest flow execution status to ensure the prior flow is
not running, sees that the prior execution is still running.
2. Host A fails the flow pending execution as it cannot run concurrent flow,
this emits a FAILED event to GaaS which is ingested by the JobStatusMonitor.
3. Host B checks the latest flow execution status, sees the current flow
execution ID which is FAILED (considered a finished flow).
4. Host B kicks off the pending flow execution when it shouldn't be.
To resolve this, we need to ensure that we are looking at the past 2 flow
executions, and follow the behavior:
1. If there is no prior execution, kick off the pending flow
2. If the prior execution is IN PROGRESS, we want to indicate that there is a
concurrent flow and block the pending execution.
3. If the prior execution is FINISHED, then we want to kick off the pending
execution (rely on the DagManager for deduplication of flows because we do not
know if the host managing this pending flow is running behind the other hosts).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)