[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427941#comment-15427941 ] ASF GitHub Bot commented on FLINK-4322: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2385 > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen >Assignee: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427942#comment-15427942 ] Ufuk Celebi commented on FLINK-4322: Addendum in 8854d75 and 5d7f880 (master). > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen >Assignee: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427931#comment-15427931 ] ASF GitHub Bot commented on FLINK-4322: --- Github user uce commented on the issue: https://github.com/apache/flink/pull/2385 I will merge this and add the following test for min pause. This fails with the current master, but works with your PR. ```java /** * Tests that no minimum delay between savepoints is enforced. */ @Test public void testMinDelayBetweenSavepoints() throws Exception { JobID jobId = new JobID(); final ExecutionAttemptID attemptID1 = new ExecutionAttemptID(); ExecutionVertex vertex1 = mockExecutionVertex(attemptID1); CheckpointCoordinator coord = new CheckpointCoordinator( jobId, 10, 20, 1L, // very long min delay => should not affect savepoints 1, 42, new ExecutionVertex[] { vertex1 }, new ExecutionVertex[] { vertex1 }, new ExecutionVertex[] { vertex1 }, cl, new StandaloneCheckpointIDCounter(), new StandaloneCompletedCheckpointStore(2, cl), new HeapSavepointStore(), new DisabledCheckpointStatsTracker()); Future savepoint0 = coord.triggerSavepoint(0); assertFalse("Did not trigger savepoint", savepoint0.isCompleted()); Future savepoint1 = coord.triggerSavepoint(1); assertFalse("Did not trigger savepoint", savepoint1.isCompleted()); } ``` > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen >Assignee: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427909#comment-15427909 ] ASF GitHub Bot commented on FLINK-4322: --- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2385 Looks good to me. +1 > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen >Assignee: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426317#comment-15426317 ] ASF GitHub Bot commented on FLINK-4322: --- GitHub user ramkrish86 opened a pull request: https://github.com/apache/flink/pull/2385 FLINK-4322 (addendum)Issue with savepoint considering the time interval like that of check… Thanks for contributing to Apache Flink. Before you open your pull request, please take the following check list into consideration. If your changes take all of the items into account, feel free to open your pull request. For more information and/or questions please refer to the [How To Contribute guide](http://flink.apache.org/how-to-contribute.html). In addition to going through the list, please provide a meaningful description of your changes. - [ ] General - The pull request references the related JIRA issue ("[FLINK-XXX] Jira title text") - The pull request addresses only one issue - Each commit in the PR has a meaningful commit message (including the JIRA id) - [ ] Documentation - Documentation has been added for new functionality - Old documentation affected by the pull request has been updated - JavaDoc for public methods has been added - [ ] Tests & Build - Functionality added by the pull request is covered by tests - `mvn clean verify` has been executed successfully locally or a Travis build has passed …point You can merge this pull request into a Git repository by running: $ git pull https://github.com/ramkrish86/flink FLINK-4322_addendum Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2385.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2385 commit fd864d425dea920a0c49d8e4f9b433d219424252 Author: Ramkrishna Date: 2016-08-18T11:57:41Z Issue with savepoint considering the time interval like that of checkpoint > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen >Assignee: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426222#comment-15426222 ] ASF GitHub Bot commented on FLINK-4322: --- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2366 JIRA issue never hurts. But in this case, it could piggy-bag on the previous issue. Your choice ;-) > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen >Assignee: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426219#comment-15426219 ] ASF GitHub Bot commented on FLINK-4322: --- Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/2366 Yes. I can do it. Thanks. Is it better to raise a JIRA for that for accounting or just raise a PR with the same issue ID? > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen >Assignee: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426212#comment-15426212 ] ASF GitHub Bot commented on FLINK-4322: --- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2366 You are right, this was an oversight on my side. This check should also be within the `if (!isSavepoint())` block. Thanks for reviewing this. Do you want to create a fix for that? > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen >Assignee: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426208#comment-15426208 ] ASF GitHub Bot commented on FLINK-4322: --- Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/2366 Ya. So may be this code `if (lastTriggeredCheckpoint + minPauseBetweenCheckpoints > timestamp) {` The if block should also go inside the previous 'if' condition that checks for !savePoint, right? > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen >Assignee: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426201#comment-15426201 ] ASF GitHub Bot commented on FLINK-4322: --- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2366 Savepoints are excepted form the concurrent checkpoints and time between checkpoints limitation. These checks run only if the triggered checkpoint is not a savepoint (in the triggerCheckpoint(...)) method. > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen >Assignee: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426167#comment-15426167 ] ASF GitHub Bot commented on FLINK-4322: --- Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/2366 Reading the code once again, just few doubts/questions ->If a save point is triggered and that is happening with in the duration of the 'MINIMUM_TIME_BETWEEN_CHECKPOINTS' we still throw back a decline result? Is it not needed that since Save points are externally triggered we need to isolate that with the internal timing we maintain? Correct me if am missing something here. Thanks. > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen >Assignee: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425928#comment-15425928 ] ASF GitHub Bot commented on FLINK-4322: --- Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/2366 The unification looks great and now things are simple in terms of restoration. Thanks. > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen >Assignee: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15424997#comment-15424997 ] ASF GitHub Bot commented on FLINK-4322: --- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2366 Looked very good. Fixed a minor issue (test log level) and merged this. > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15424993#comment-15424993 ] ASF GitHub Bot commented on FLINK-4322: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2366 > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419941#comment-15419941 ] ASF GitHub Bot commented on FLINK-4322: --- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2366 Thanks for joining this effort. I'll try to have a look at the tests in the next days. > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418952#comment-15418952 ] ASF GitHub Bot commented on FLINK-4322: --- GitHub user uce opened a pull request: https://github.com/apache/flink/pull/2366 [FLINK-4322] Unify CheckpointCoordinator and SavepointCoordinator The CheckpointCoordinator now also takes over the role of the SavepointCoordinator. Savepoints are just like other checkpoints - they only store the metadata in addition. Restoring from a savepoint means loading it into the CheckpointStore at startup. This simplifies the code quite a bit. We get rid of the savepoint coordinator and related classes and cumbersome restoring logic in the main code. For the tests, we can replace some integration tests with unit tests. `PendingSavepoint` instances are finalized to become a `CompletedCheckpoint` like regular `PendingCheckpoint` instances, but in addition store the savepoint meta data and complete a Promise for callbacks. `PendingSavepoints` cannot be subsumed and a `CompletedCheckpoint` from a savepoint does not delete its associated state when being disposed. @StephanEwen did most of the work and I added and fixed some tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uce/flink savepointunify Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2366.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2366 commit 9d13d3b9c78b5fe6ec436c476492a82b846338aa Author: Stephan Ewen Date: 2016-08-08T17:18:44Z [FLINK-4322] [checkpointing] Unify CheckpointCoordinator and SavepointCoordinator The CheckpointCoordinator now also takes over the role of the SavepointCoordinator. Savepoints are just like other checkpoints - they only store the metadata in addition. Restoring from a savepoint means loading it into the CheckpointStore at startup. commit bcb6cf0b573314449437bc869febfc68f798b0f4 Author: Ufuk Celebi Date: 2016-08-11T17:40:07Z [FLINK-4322] [checkpointing] Add and fix tests > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15411056#comment-15411056 ] Ufuk Celebi commented on FLINK-4322: +1 I had similar thoughts about this > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410662#comment-15410662 ] ramkrishna.s.vasudevan commented on FLINK-4322: --- Ok, got it. I understand. Thanks. > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15409961#comment-15409961 ] Stephan Ewen commented on FLINK-4322: - I would like to make some more radical changes to the current checkpoint / savepoint design. - There should be no difference at all between checkpoint and savepoint any more. - A savepoint is merely a parameterized checkpoint (with a future to announce its result) - There is no SavepointCoordinator any more - a savepoint restore simply adds a completed checkpoint to the checkpoint coordinator The execution graph code should become very simple that way. Since that is a pretty deep change in the checkpointing mechanism, I would be happy to actually take this over. > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-4322) Unify CheckpointCoordinator and SavepointCoordinator
[ https://issues.apache.org/jira/browse/FLINK-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15409799#comment-15409799 ] ramkrishna.s.vasudevan commented on FLINK-4322: --- bq.jobs should be able to resume from checkpoint or savepoints bq.jobs should fall back to the latest checkpoint or savepoint Ya this is what FLINK-3397 tries to achieve too. [~uce] had given some comments and I had update the document but not yet uploaded in that JIRa. He was also telling about job and save point mapping. I had raised some comments and doubts regarding that in the other JIRA. As he is very busy just waiting for his inputs. As this and FLINK-3397 are related am happy to work on these and will do after we are arriving at a consensus on the design part. Thanks. > Unify CheckpointCoordinator and SavepointCoordinator > > > Key: FLINK-4322 > URL: https://issues.apache.org/jira/browse/FLINK-4322 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.1.0 >Reporter: Stephan Ewen > Fix For: 1.2.0 > > > The Checkpoint coordinator should have the functionality of both handling > checkpoints and savepoints. > The difference between checkpoints and savepoints is minimal: > - savepoints always write the root metadata of the checkpoint > - savepoints are always full (never incremental) > The commonalities are large > - jobs should be able to resume from checkpoint or savepoints > - jobs should fall back to the latest checkpoint or savepoint > This subsumes issue https://issues.apache.org/jira/browse/FLINK-3397 -- This message was sent by Atlassian JIRA (v6.3.4#6332)