jack-moseley opened a new pull request #3179: URL: https://github.com/apache/incubator-gobblin/pull/3179
Dear Gobblin maintainers, Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below! ### JIRA - [x] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR" - https://issues.apache.org/jira/browse/GOBBLIN-1342 ### Description - [x] Here are some details about my PR, including screenshots (if applicable): This PR adds flow resume functionality, which means after a flow fails, resume can be called on it, and the flow will be reattempted from the point before it failed (i.e. not rerunning the successful jobs). How it works: - Whenever a flow fails, on cleanup, it is now added to a `failedDagStateStore`, similar to the normal dag state store. (It is also kept in an in-memory `failedDags` map to avoid querying the store for every resume request). - User triggers a resume request by calling partial update on the FlowExecutions endpoint, updating the status from `FAILED` to `RUNNING`. This is a little confusing, but as per [this discussion](https://softwareengineering.stackexchange.com/questions/261552/which-http-verb-should-i-use-to-trigger-an-action-in-a-rest-web-service) it seems to be the most correct way to represent a triggered action like this in rest. - The request is forwarded to the `DagManager` as a `ResumeFlowEvent` through the eventbus, similar to how killing a flow works. - `DagManager` adds it to a `resumeQueue` to be picked up by a thread - The thread calls `beginResumingDag` on the dag, which sets the status of both the flow and each FAILED/CANCELLED job to `PENDING_RESUME` (and also sends events so this will be updated in the status store). The dag is then added to another map `resumingDags` - In `finishResumingDags`, the dags in `resumingDags` are checked by querying the status store. If the dag has reached the correct state, then it is passed back to `initialize` to be launched. This is separated from `beginResumingDag` because it takes some time for the status to be updated in the status store. Other related changes: - Fixed a bug where `addDag` sets all jobs to `PENDING`, even if the dag is already running. This would cause their initial jobs to get relaunched on service restart. - Added a `flowStartTime` variable in `JobExecutionPlan` which is set on flow resume so that flow SLA will not be calculated based on the original flow. - Added `PENDING_RESUME` to execution status enum, it is a special status required because resuming flows are allowed to move "backwards" in status from `FAILED` to `PENDING`. ### Tests - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: Add unit test ### Commits - [x] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 2. Subject is limited to 50 characters 3. Subject does not end with a period 4. Subject uses the imperative mood ("add", not "adding") 5. Body wraps at 72 characters 6. Body explains "what" and "why", not "how" ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
