gaoyunhaii commented on code in PR #545: URL: https://github.com/apache/flink-web/pull/545#discussion_r901234152
########## _posts/2022-06-01-final-checkpoint-part1.md: ########## @@ -81,59 +81,61 @@ running subtasks. The states could be repartitioned on restarting and all the ne To support creating such a checkpoint for jobs with finished tasks, we extended the checkpoint procedure. Previously the checkpoint coordinator inside the JobManager first notifies all the sources to report snapshots, then all the sources further notify their descendants via broadcasting barrier events. Since now the sources might -already finish, the checkpoint coordinator would instead treat the running tasks who do not have running precedent -tasks as “new sources”, and notifies these tasks to initiate the checkpoints. The checkpoint could then deduce -which operator is fully finished based on the task states when triggering checkpoint and the received snapshots. +have already finished, the checkpoint coordinator would instead treat the running tasks who also do not have running +precedent tasks as "new sources", and it notifies these tasks to initiate the checkpoints. Finally, if the subtasks of +an operator are either finished on triggering checkpoint or have finished processing all the data on snapshotting states, +the operator would be marked as fully finished. The changes of the checkpoint procedure are transparent to users except that for checkpoints indeed containing -finished tasks, we disallowed adding new operators before the fully finished ones, since it would make the fully +finished tasks, we disallowed adding new operators as precedents of the fully finished ones, since it would make the fully finished operators have running precedents after restarting, which conflicts with the design that tasks finished in topological order. # Revise the Process of Finishing Based on the ability to take checkpoints with finished tasks, we could then solve the issue that two-phase-commit -operators could not commit all the data when running in streaming mode. As the background, currently Flink jobs +operators could not commit all the data when running in streaming mode. As the background, Flink jobs have two ways to finish: -1. All sources are bound and they processed all the input records. The job will start to finish after all the sources are finished. -2. Users execute `stop-with-savepoint [--drain]`. The job will take a savepoint and then finish. If the `–-drain` parameter is not set, -the savepoint might be used to start new jobs and the operators will not flush all the event times or call methods marking all -records processed (like `endInput()`). +1. All sources are bound and they processed all the input records. The job will finish after all the sources are finished. +2. Users execute `stop-with-savepoint [--drain]`. The job will take a savepoint and then finish. With `–-drain`, the job +will be stopped permanently and is also required to commit all the data. However, without `--drain` the job might +be resumed from the savepoint later, thus not all data are required to be committed, as long as the state of the data could be +recovered from the savepoint. -Ideally we should ensure exactly-once semantics in both cases. +Let's first have a look at the case of bounded sources. To achieve end-to-end exactly-once, +two-phase-commit operators only commit data after a checkpoint following this piece of data succeed. +However, previously there is no such an opportunity for the data between the last periodic checkpoint and job finish, Review Comment: Changed to `job getting finished`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org