[jira] [Commented] (FLINK-9598) [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when there's a checkpoint failure
[ https://issues.apache.org/jira/browse/FLINK-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17334154#comment-17334154 ] Flink Jira Bot commented on FLINK-9598: --- This issue was marked "stale-assigned" and has not received an update in 7 days. It is now automatically unassigned. If you are still working on it, you can assign it to yourself again. Please also give an update about the status of the work. > [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when > there's a checkpoint failure > - > > Key: FLINK-9598 > URL: https://issues.apache.org/jira/browse/FLINK-9598 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.3.2 >Reporter: Prem Santosh >Assignee: Yun Tang >Priority: Major > Labels: pull-request-available, stale-assigned > Attachments: Screen Shot 2018-06-20 at 7.44.10 AM.png > > > We have set the config Minimum Pause Between Checkpoints to be 10 min but > noticed that when a checkpoint fails (because it timesout before it > completes) the application immediately starts taking the next checkpoint. > This basically stalls the application's progress since its always taking > checkpoints. > [^Screen Shot 2018-06-20 at 7.44.10 AM.png] is a screenshot of this issue. > Details: > * Running Flink-1.3.2 on EMR > * checkpoint timeout duration: 40 min > * minimum pause between checkpoints: 10 min > There is also a [relevant > thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Having-a-backoff-while-experiencing-checkpointing-failures-td20618.html] > that I found on the Flink users group. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-9598) [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when there's a checkpoint failure
[ https://issues.apache.org/jira/browse/FLINK-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323531#comment-17323531 ] Flink Jira Bot commented on FLINK-9598: --- This issue is assigned but has not received an update in 7 days so it has been labeled "stale-assigned". If you are still working on the issue, please give an update and remove the label. If you are no longer working on the issue, please unassign so someone else may work on it. In 7 days the issue will be automatically unassigned. > [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when > there's a checkpoint failure > - > > Key: FLINK-9598 > URL: https://issues.apache.org/jira/browse/FLINK-9598 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.3.2 >Reporter: Prem Santosh >Assignee: Yun Tang >Priority: Major > Labels: pull-request-available, stale-assigned > Attachments: Screen Shot 2018-06-20 at 7.44.10 AM.png > > > We have set the config Minimum Pause Between Checkpoints to be 10 min but > noticed that when a checkpoint fails (because it timesout before it > completes) the application immediately starts taking the next checkpoint. > This basically stalls the application's progress since its always taking > checkpoints. > [^Screen Shot 2018-06-20 at 7.44.10 AM.png] is a screenshot of this issue. > Details: > * Running Flink-1.3.2 on EMR > * checkpoint timeout duration: 40 min > * minimum pause between checkpoints: 10 min > There is also a [relevant > thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Having-a-backoff-while-experiencing-checkpointing-failures-td20618.html] > that I found on the Flink users group. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-9598) [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when there's a checkpoint failure
[ https://issues.apache.org/jira/browse/FLINK-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605855#comment-16605855 ] ASF GitHub Bot commented on FLINK-9598: --- Myasuka commented on issue #6346: [FLINK-9598] Refine java-doc about the min pause between checkpoints URL: https://github.com/apache/flink/pull/6346#issuecomment-419119077 @zentol I think the decision to change javadocs or fix checkpoint coordinator's behavior depends on whether we give an explicit definition of the minimal pause between checkpoints. From our docs about [Minimum Pause Between Checkpoints](https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/checkpoint_monitoring.html#configuration-tab), it only said the successful scenario: > After a checkpoint has completed successfully, we wait at least for this amount of time before triggering the next one. If we could give more clear definition on the checkpoint-failure scenario, it might not be a problem to discuss. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when > there's a checkpoint failure > - > > Key: FLINK-9598 > URL: https://issues.apache.org/jira/browse/FLINK-9598 > Project: Flink > Issue Type: Bug >Affects Versions: 1.3.2 >Reporter: Prem Santosh >Assignee: Yun Tang >Priority: Major > Labels: pull-request-available > Attachments: Screen Shot 2018-06-20 at 7.44.10 AM.png > > > We have set the config Minimum Pause Between Checkpoints to be 10 min but > noticed that when a checkpoint fails (because it timesout before it > completes) the application immediately starts taking the next checkpoint. > This basically stalls the application's progress since its always taking > checkpoints. > [^Screen Shot 2018-06-20 at 7.44.10 AM.png] is a screenshot of this issue. > Details: > * Running Flink-1.3.2 on EMR > * checkpoint timeout duration: 40 min > * minimum pause between checkpoints: 10 min > There is also a [relevant > thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Having-a-backoff-while-experiencing-checkpointing-failures-td20618.html] > that I found on the Flink users group. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9598) [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when there's a checkpoint failure
[ https://issues.apache.org/jira/browse/FLINK-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16571678#comment-16571678 ] ASF GitHub Bot commented on FLINK-9598: --- zentol commented on issue #6346: [FLINK-9598] Refine java-doc about the min pause between checkpoints URL: https://github.com/apache/flink/pull/6346#issuecomment-411062550 After looking at the discussion threasd I'm not sure if it makes sense to merge this PR. If the behavior is deemed buggy we shouldn't touch the javadocs and fix the behavior instead. One could argue that they should still outline the _current_ state, but then end up switching back-and-forth between versions. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when > there's a checkpoint failure > - > > Key: FLINK-9598 > URL: https://issues.apache.org/jira/browse/FLINK-9598 > Project: Flink > Issue Type: Bug >Affects Versions: 1.3.2 >Reporter: Prem Santosh >Assignee: Yun Tang >Priority: Major > Labels: pull-request-available > Attachments: Screen Shot 2018-06-20 at 7.44.10 AM.png > > > We have set the config Minimum Pause Between Checkpoints to be 10 min but > noticed that when a checkpoint fails (because it timesout before it > completes) the application immediately starts taking the next checkpoint. > This basically stalls the application's progress since its always taking > checkpoints. > [^Screen Shot 2018-06-20 at 7.44.10 AM.png] is a screenshot of this issue. > Details: > * Running Flink-1.3.2 on EMR > * checkpoint timeout duration: 40 min > * minimum pause between checkpoints: 10 min > There is also a [relevant > thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Having-a-backoff-while-experiencing-checkpointing-failures-td20618.html] > that I found on the Flink users group. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9598) [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when there's a checkpoint failure
[ https://issues.apache.org/jira/browse/FLINK-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546195#comment-16546195 ] ASF GitHub Bot commented on FLINK-9598: --- Github user Myasuka commented on the issue: https://github.com/apache/flink/pull/6346 @zentol Sorry, it's really my fault to affect the public APIs by mistake, I'm reverting API-changes back to original ones and only refine the docs. > [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when > there's a checkpoint failure > - > > Key: FLINK-9598 > URL: https://issues.apache.org/jira/browse/FLINK-9598 > Project: Flink > Issue Type: Bug >Affects Versions: 1.3.2 >Reporter: Prem Santosh >Assignee: Yun Tang >Priority: Major > Labels: pull-request-available > Attachments: Screen Shot 2018-06-20 at 7.44.10 AM.png > > > We have set the config Minimum Pause Between Checkpoints to be 10 min but > noticed that when a checkpoint fails (because it timesout before it > completes) the application immediately starts taking the next checkpoint. > This basically stalls the application's progress since its always taking > checkpoints. > [^Screen Shot 2018-06-20 at 7.44.10 AM.png] is a screenshot of this issue. > Details: > * Running Flink-1.3.2 on EMR > * checkpoint timeout duration: 40 min > * minimum pause between checkpoints: 10 min > There is also a [relevant > thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Having-a-backoff-while-experiencing-checkpointing-failures-td20618.html] > that I found on the Flink users group. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9598) [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when there's a checkpoint failure
[ https://issues.apache.org/jira/browse/FLINK-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546112#comment-16546112 ] ASF GitHub Bot commented on FLINK-9598: --- GitHub user Myasuka opened a pull request: https://github.com/apache/flink/pull/6346 [FLINK-9598] Refine java-doc about the min pause between checkpoints ## What is the purpose of the change This pull request makes docs about config `minPauseBetweenCheckpoints` more clear, since many users felt confused about this config parameter as they found new checkpoint triggered once checkpoints failed, related threads: [thread-1](http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/minPauseBetweenCheckpoints-for-failed-checkpoints-td20152.html), [thread-2](http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Having-a-backoff-while-experiencing-checkpointing-failures-td20618.html). ## Brief change log Refine java-doc about the min pause between checkpoints. ## Verifying this change This change is a trivial rework without any test coverage. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers: no - The runtime per-record code paths (performance sensitive): no - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no - The S3 file system connector: no ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? JavaDocs You can merge this pull request into a Git repository by running: $ git pull https://github.com/Myasuka/flink min-pause-checkpoint-FLINK-9598 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/6346.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6346 commit f47b487f74767859df19ac09ebc15bb4117a36f2 Author: Yun Tang Date: 2018-07-17T06:44:44Z [FLINK-9598] Refactor java-doc about the min pause between checkpoints > [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when > there's a checkpoint failure > - > > Key: FLINK-9598 > URL: https://issues.apache.org/jira/browse/FLINK-9598 > Project: Flink > Issue Type: Bug >Affects Versions: 1.3.2 >Reporter: Prem Santosh >Assignee: Yun Tang >Priority: Major > Labels: pull-request-available > Attachments: Screen Shot 2018-06-20 at 7.44.10 AM.png > > > We have set the config Minimum Pause Between Checkpoints to be 10 min but > noticed that when a checkpoint fails (because it timesout before it > completes) the application immediately starts taking the next checkpoint. > This basically stalls the application's progress since its always taking > checkpoints. > [^Screen Shot 2018-06-20 at 7.44.10 AM.png] is a screenshot of this issue. > Details: > * Running Flink-1.3.2 on EMR > * checkpoint timeout duration: 40 min > * minimum pause between checkpoints: 10 min > There is also a [relevant > thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Having-a-backoff-while-experiencing-checkpointing-failures-td20618.html] > that I found on the Flink users group. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9598) [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when there's a checkpoint failure
[ https://issues.apache.org/jira/browse/FLINK-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545352#comment-16545352 ] vinoyang commented on FLINK-9598: - [~yunta] I agree with you. I will release the assignee, please feel free to assign to yourself and redefine the doc if you want. > [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when > there's a checkpoint failure > - > > Key: FLINK-9598 > URL: https://issues.apache.org/jira/browse/FLINK-9598 > Project: Flink > Issue Type: Bug >Affects Versions: 1.3.2 >Reporter: Prem Santosh >Assignee: vinoyang >Priority: Major > Attachments: Screen Shot 2018-06-20 at 7.44.10 AM.png > > > We have set the config Minimum Pause Between Checkpoints to be 10 min but > noticed that when a checkpoint fails (because it timesout before it > completes) the application immediately starts taking the next checkpoint. > This basically stalls the application's progress since its always taking > checkpoints. > [^Screen Shot 2018-06-20 at 7.44.10 AM.png] is a screenshot of this issue. > Details: > * Running Flink-1.3.2 on EMR > * checkpoint timeout duration: 40 min > * minimum pause between checkpoints: 10 min > There is also a [relevant > thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Having-a-backoff-while-experiencing-checkpointing-failures-td20618.html] > that I found on the Flink users group. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9598) [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when there's a checkpoint failure
[ https://issues.apache.org/jira/browse/FLINK-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545318#comment-16545318 ] Yun Tang commented on FLINK-9598: - Hi [~premsantosh], the "Minimum Pause Between Checkpoints" is actually the initial delay between successful checkpoints, you can find the logical in CheckpointCoordinator#_triggerCheckpoint_() method, in which after expired-checkpoint cleaner detects some checkpoint expired, it will trigger another checkpoint ASAP through CheckpointCoordinator#_triggerQueuedRequests_() method, no matter Flink-1.3.2 or latest Flink-1.5.1 I think a user usually wants to get a successful checkpoint as quickly as possible again, and the running checkpoint would not stall your application running in general as the sub-tasks only start snapshot when checkpoint barrier comes, not all sub-tasks are executing snapshot process. In my point of view, it would be better to redefine some of the javadocs e.g. attribute _minPauseBetweenCheckpointsNanos_ in CheckpointCoordinator. What's your opinion [~yanghua], if you don't have time to do these trivial works, I'd like to take some time to redefine all related javadocs. > [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when > there's a checkpoint failure > - > > Key: FLINK-9598 > URL: https://issues.apache.org/jira/browse/FLINK-9598 > Project: Flink > Issue Type: Bug >Affects Versions: 1.3.2 >Reporter: Prem Santosh >Assignee: vinoyang >Priority: Major > Attachments: Screen Shot 2018-06-20 at 7.44.10 AM.png > > > We have set the config Minimum Pause Between Checkpoints to be 10 min but > noticed that when a checkpoint fails (because it timesout before it > completes) the application immediately starts taking the next checkpoint. > This basically stalls the application's progress since its always taking > checkpoints. > [^Screen Shot 2018-06-20 at 7.44.10 AM.png] is a screenshot of this issue. > Details: > * Running Flink-1.3.2 on EMR > * checkpoint timeout duration: 40 min > * minimum pause between checkpoints: 10 min > There is also a [relevant > thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Having-a-backoff-while-experiencing-checkpointing-failures-td20618.html] > that I found on the Flink users group. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)