[GitHub] [flink] dawidwys commented on a change in pull request #18092: [FLINK-25191] Skip savepoints for recovery

GitBox Thu, 16 Dec 2021 03:13:00 -0800


dawidwys commented on a change in pull request #18092:
URL: https://github.com/apache/flink/pull/18092#discussion_r770443223




##########
File path: docs/content/docs/ops/state/savepoints.md
##########
@@ -130,9 +130,23 @@ Unlike savepoints, checkpoints cannot generally be moved 
to a different location
 
 If you use `JobManagerCheckpointStorage`, metadata *and* savepoint state will 
be stored in the `_metadata` file, so don't be confused by the absence of 
additional data files.
 
-{{< hint warning  >}}
-It is discouraged to move or delete the last savepoint of a running job, 
because this might interfere with failure-recovery. Savepoints have 
side-effects on exactly-once sinks, therefore 
-to ensure exactly-once semantics, if there is no checkpoint after the last 
savepoint, the savepoint will be used for recovery. 
+{{< hint warning  >}} 
+Starting from Flink 1.15 intermediate savepoints (savepoints other than
+created with [stop-with-savepoint](#stopping-a-job-with-savepoint)) are not 
used for recovery and do
+not commit any side effects.
+
+This has to be taken into consideration, especially when running multiple jobs 
in the same
+checkpointing timeline. It is possible in that solution that if the original 
job (after taking a
+savepoint) fails, then it will fall back to a checkpoint prior to the 
savepoint. However, if we now
+resume a job from the savepoint, then we might commit transactions that 
might’ve never happened
+because of falling back to a checkpoint before the savepoint (assuming 
non-determinism).

Review comment:
       It does guarantee correctness. If you start a single job from a 
savepoint, the next checkpoint will commit the transactions from the savepoint 
as well.
   
   The issue is if you still run the original job and start a new one from the 
savepoint. If the original job fails before the next checkpoint it might 
recreate data from the transactions.
   
   The purpose of such savepoints is:
   * you want to take a savepoint and verify it before stopping the original 
job 
   * you want to replicate the job into a separate zone/cluster/... (you need 
to drop the transactional state then)
   
   We were discussing dropping the sink's state automatically from intermediate 
savepoints, but we decided it's better to have it there and possibly drop it on 
restore. If we drop it while taking a savepoint there is no turning back.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] dawidwys commented on a change in pull request #18092: [FLINK-25191] Skip savepoints for recovery

Reply via email to