jack-moseley opened a new pull request #3179:
URL: https://github.com/apache/incubator-gobblin/pull/3179


   Dear Gobblin maintainers,
   
   Please accept this PR. I understand that it will not be reviewed until I 
have checked off all the steps below!
   
   
   ### JIRA
   - [x] My PR addresses the following [Gobblin 
JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references 
them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
       - https://issues.apache.org/jira/browse/GOBBLIN-1342
   
   
   ### Description
   - [x] Here are some details about my PR, including screenshots (if 
applicable):
   
   This PR adds flow resume functionality, which means after a flow fails, 
resume can be called on it, and the flow will be reattempted from the point 
before it failed (i.e. not rerunning the successful jobs).
   
   How it works:
   - Whenever a flow fails, on cleanup, it is now added to a 
`failedDagStateStore`, similar to the normal dag state store. (It is also kept 
in an in-memory `failedDags` map to avoid querying the store for every resume 
request).
   - User triggers a resume request by calling partial update on the 
FlowExecutions endpoint, updating the status from `FAILED` to `RUNNING`. This 
is a little confusing, but as per [this 
discussion](https://softwareengineering.stackexchange.com/questions/261552/which-http-verb-should-i-use-to-trigger-an-action-in-a-rest-web-service)
 it seems to be the most correct way to represent a triggered action like this 
in rest.
   - The request is forwarded to the `DagManager` as a `ResumeFlowEvent` 
through the eventbus, similar to how killing a flow works.
   - `DagManager` adds it to a `resumeQueue` to be picked up by a thread
   - The thread calls `beginResumingDag` on the dag, which sets the status of 
both the flow and each FAILED/CANCELLED job to `PENDING_RESUME` (and also sends 
events so this will be updated in the status store). The dag is then added to 
another map `resumingDags`
   - In `finishResumingDags`, the dags in `resumingDags` are checked by 
querying the status store. If the dag has reached the correct state, then it is 
passed back to `initialize` to be launched. This is separated from 
`beginResumingDag` because it takes some time for the status to be updated in 
the status store.
   
   Other related changes:
   - Fixed a bug where `addDag` sets all jobs to `PENDING`, even if the dag is 
already running. This would cause their initial jobs to get relaunched on 
service restart.
   - Added a `flowStartTime` variable in `JobExecutionPlan` which is set on 
flow resume so that flow SLA will not be calculated based on the original flow.
   - Added `PENDING_RESUME` to execution status enum, it is a special status 
required because resuming flows are allowed to move "backwards" in status from 
`FAILED` to `PENDING`.
   
   
   ### Tests
   - [x] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   Add unit test
   
   ### Commits
   - [x] My commits all reference JIRA issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
       1. Subject is separated from body by a blank line
       2. Subject is limited to 50 characters
       3. Subject does not end with a period
       4. Subject uses the imperative mood ("add", not "adding")
       5. Body wraps at 72 characters
       6. Body explains "what" and "why", not "how"
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to