[ https://issues.apache.org/jira/browse/FLINK-4975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636184#comment-15636184 ]
ASF GitHub Bot commented on FLINK-4975: --------------------------------------- GitHub user StephanEwen opened a pull request: https://github.com/apache/flink/pull/2754 [FLINK-4975] [checkpointing] Add a limit for how much data may be buffered in alignment In corner case situations, checkpoint alignment can take very long and buffer/spill a lot of data. This PR introduces setting a limit to how much data may be buffered during alignments. If that volume is exceeded, the checkpoint will abort. While these overly large alignment situation should not occur in a healthy environment, it is an important safety net to have. This Pull Request consists of three parts: ### Introduce Cancellation Barriers These *Cancellation Barriers* are like checkpoint barriers, flowing with the data, but signalling that a checkpoint should be aborted rather that the position of that stream in the checkpoint. This adds extensive tests to the `BarrierBuffer` and `BarrierTracker` that these Cancellation Barriers are correctly interpreted and interplay well with other situations of alignment starts and cancellations (such as when newer barriers come early). ### Adjust and Checkpoint Coordinator Tasks emit cancellation barriers whenever they cannot start a checkpoint or whenever a checkpoint alignment was canceled. That lets downstream tasks know earlier that they should stop the alignment for that checkpoint, because it will not be able to complete. Tasks also explicitly send "decline" messages to the checkpoint coordinator for checkpoints they "skipped" due to alignment being cancelled or superseded. Previously the assumptions were: - When a Source Task cannot start a checkpoint, a new checkpoint must be triggered immediately, to dissolve any started downstream alignments that otherwise would not be able to complete. - Whenever an alignment is aborted by a newer checkpoint barrier coming in, that newer barrier will eventually reach the downstream task and break outdated pending alignments. The cancellation barrier will not break the outdated alignment earlier. ### Alignment Size Limit When the `BarrierBuffer` has buffered more than a certain number of bytes, it aborts the alignment and signals the Task that the checkpoint was aborted. The Task sends a cancellation barrier for that checkpoint downstream, to signal the downstream tasks that they should not wait for a proper barrier. You can merge this pull request into a Git repository by running: $ git pull https://github.com/StephanEwen/incubator-flink checkpoint_cancellation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2754.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2754 ---- commit 9837f551e58c7d7d40b85e3ae2292f14be9d73e4 Author: Stephan Ewen <se...@apache.org> Date: 2016-10-23T16:41:32Z [FLINK-4984] [checkpointing] Add Cancellation Barriers as a way to signal aborted checkpoints commit 844d15c6bd50f7d4e5449ba6958e404685c6eb59 Author: Stephan Ewen <se...@apache.org> Date: 2016-11-02T21:34:59Z [FLINK-4985] [checkpointing] Report canceled / declined checkpoints to the Checkpoint Coordinator commit 3b922bb6ec3b798042c265c4d49d4d5dad940759 Author: Stephan Ewen <se...@apache.org> Date: 2016-11-03T14:28:15Z [FLINK-4975] [checkpointing] Add a limit for how much data may be buffered in alignment. If more data than the defined amount is buffered, the alignment is aborted and the checkpoint canceled. ---- > Add a limit for how much data may be buffered during checkpoint alignment > ------------------------------------------------------------------------- > > Key: FLINK-4975 > URL: https://issues.apache.org/jira/browse/FLINK-4975 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing > Affects Versions: 1.1.3 > Reporter: Stephan Ewen > Assignee: Stephan Ewen > Fix For: 1.2.0, 1.1.4 > > > During checkpoint alignment, data may be buffered/spilled. > We should introduce an upper limit for the spilled data volume. After > exceeding that limit, the checkpoint alignment should abort and the > checkpoint be canceled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)