[jira] [Commented] (FLINK-9598) [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when there's a checkpoint failure

2021-04-27 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17334154#comment-17334154
 ] 

Flink Jira Bot commented on FLINK-9598:
---

This issue was marked "stale-assigned" and has not received an update in 7 
days. It is now automatically unassigned. If you are still working on it, you 
can assign it to yourself again. Please also give an update about the status of 
the work.

> [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when 
> there's a checkpoint failure
> -
>
> Key: FLINK-9598
> URL: https://issues.apache.org/jira/browse/FLINK-9598
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.3.2
>Reporter: Prem Santosh
>Assignee: Yun Tang
>Priority: Major
>  Labels: pull-request-available, stale-assigned
> Attachments: Screen Shot 2018-06-20 at 7.44.10 AM.png
>
>
> We have set the config Minimum Pause Between Checkpoints to be 10 min but 
> noticed that when a checkpoint fails (because it timesout before it 
> completes) the application immediately starts taking the next checkpoint. 
> This basically stalls the application's progress since its always taking 
> checkpoints.
> [^Screen Shot 2018-06-20 at 7.44.10 AM.png] is a screenshot of this issue.
> Details:
>  * Running Flink-1.3.2 on EMR
>  * checkpoint timeout duration: 40 min
>  * minimum pause between checkpoints: 10 min
> There is also a [relevant 
> thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Having-a-backoff-while-experiencing-checkpointing-failures-td20618.html]
>  that I found on the Flink users group.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-9598) [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when there's a checkpoint failure

2021-04-16 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323531#comment-17323531
 ] 

Flink Jira Bot commented on FLINK-9598:
---

This issue is assigned but has not received an update in 7 days so it has been 
labeled "stale-assigned". If you are still working on the issue, please give an 
update and remove the label. If you are no longer working on the issue, please 
unassign so someone else may work on it. In 7 days the issue will be 
automatically unassigned.

> [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when 
> there's a checkpoint failure
> -
>
> Key: FLINK-9598
> URL: https://issues.apache.org/jira/browse/FLINK-9598
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.3.2
>Reporter: Prem Santosh
>Assignee: Yun Tang
>Priority: Major
>  Labels: pull-request-available, stale-assigned
> Attachments: Screen Shot 2018-06-20 at 7.44.10 AM.png
>
>
> We have set the config Minimum Pause Between Checkpoints to be 10 min but 
> noticed that when a checkpoint fails (because it timesout before it 
> completes) the application immediately starts taking the next checkpoint. 
> This basically stalls the application's progress since its always taking 
> checkpoints.
> [^Screen Shot 2018-06-20 at 7.44.10 AM.png] is a screenshot of this issue.
> Details:
>  * Running Flink-1.3.2 on EMR
>  * checkpoint timeout duration: 40 min
>  * minimum pause between checkpoints: 10 min
> There is also a [relevant 
> thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Having-a-backoff-while-experiencing-checkpointing-failures-td20618.html]
>  that I found on the Flink users group.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-9598) [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when there's a checkpoint failure

2018-09-06 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605855#comment-16605855
 ] 

ASF GitHub Bot commented on FLINK-9598:
---

Myasuka commented on issue #6346: [FLINK-9598] Refine java-doc about the min 
pause between checkpoints
URL: https://github.com/apache/flink/pull/6346#issuecomment-419119077
 
 
   @zentol I think the decision to change javadocs or fix checkpoint 
coordinator's behavior depends on whether we give an explicit definition of the 
minimal pause between checkpoints. From our docs about [Minimum Pause Between 
Checkpoints](https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/checkpoint_monitoring.html#configuration-tab),
 it only said the successful scenario:
   
   > After a checkpoint has completed successfully, we wait at least for this 
amount of time before triggering the next one.
   
   If we could give more clear definition on the checkpoint-failure scenario, 
it might not be a problem to discuss. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when 
> there's a checkpoint failure
> -
>
> Key: FLINK-9598
> URL: https://issues.apache.org/jira/browse/FLINK-9598
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.3.2
>Reporter: Prem Santosh
>Assignee: Yun Tang
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screen Shot 2018-06-20 at 7.44.10 AM.png
>
>
> We have set the config Minimum Pause Between Checkpoints to be 10 min but 
> noticed that when a checkpoint fails (because it timesout before it 
> completes) the application immediately starts taking the next checkpoint. 
> This basically stalls the application's progress since its always taking 
> checkpoints.
> [^Screen Shot 2018-06-20 at 7.44.10 AM.png] is a screenshot of this issue.
> Details:
>  * Running Flink-1.3.2 on EMR
>  * checkpoint timeout duration: 40 min
>  * minimum pause between checkpoints: 10 min
> There is also a [relevant 
> thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Having-a-backoff-while-experiencing-checkpointing-failures-td20618.html]
>  that I found on the Flink users group.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9598) [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when there's a checkpoint failure

2018-08-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16571678#comment-16571678
 ] 

ASF GitHub Bot commented on FLINK-9598:
---

zentol commented on issue #6346: [FLINK-9598] Refine java-doc about the min 
pause between checkpoints
URL: https://github.com/apache/flink/pull/6346#issuecomment-411062550
 
 
   After looking at the discussion threasd I'm not sure if it makes sense to 
merge this PR. If the behavior is deemed buggy we shouldn't touch the javadocs 
and fix the behavior instead. One could argue that they should still outline 
the _current_ state, but then end up switching back-and-forth between versions.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when 
> there's a checkpoint failure
> -
>
> Key: FLINK-9598
> URL: https://issues.apache.org/jira/browse/FLINK-9598
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.3.2
>Reporter: Prem Santosh
>Assignee: Yun Tang
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screen Shot 2018-06-20 at 7.44.10 AM.png
>
>
> We have set the config Minimum Pause Between Checkpoints to be 10 min but 
> noticed that when a checkpoint fails (because it timesout before it 
> completes) the application immediately starts taking the next checkpoint. 
> This basically stalls the application's progress since its always taking 
> checkpoints.
> [^Screen Shot 2018-06-20 at 7.44.10 AM.png] is a screenshot of this issue.
> Details:
>  * Running Flink-1.3.2 on EMR
>  * checkpoint timeout duration: 40 min
>  * minimum pause between checkpoints: 10 min
> There is also a [relevant 
> thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Having-a-backoff-while-experiencing-checkpointing-failures-td20618.html]
>  that I found on the Flink users group.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9598) [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when there's a checkpoint failure

2018-07-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546195#comment-16546195
 ] 

ASF GitHub Bot commented on FLINK-9598:
---

Github user Myasuka commented on the issue:

https://github.com/apache/flink/pull/6346
  
@zentol Sorry, it's really my fault to affect the public APIs by mistake, 
I'm reverting API-changes back to original ones and only refine the docs.


> [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when 
> there's a checkpoint failure
> -
>
> Key: FLINK-9598
> URL: https://issues.apache.org/jira/browse/FLINK-9598
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.3.2
>Reporter: Prem Santosh
>Assignee: Yun Tang
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screen Shot 2018-06-20 at 7.44.10 AM.png
>
>
> We have set the config Minimum Pause Between Checkpoints to be 10 min but 
> noticed that when a checkpoint fails (because it timesout before it 
> completes) the application immediately starts taking the next checkpoint. 
> This basically stalls the application's progress since its always taking 
> checkpoints.
> [^Screen Shot 2018-06-20 at 7.44.10 AM.png] is a screenshot of this issue.
> Details:
>  * Running Flink-1.3.2 on EMR
>  * checkpoint timeout duration: 40 min
>  * minimum pause between checkpoints: 10 min
> There is also a [relevant 
> thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Having-a-backoff-while-experiencing-checkpointing-failures-td20618.html]
>  that I found on the Flink users group.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9598) [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when there's a checkpoint failure

2018-07-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546112#comment-16546112
 ] 

ASF GitHub Bot commented on FLINK-9598:
---

GitHub user Myasuka opened a pull request:

https://github.com/apache/flink/pull/6346

[FLINK-9598] Refine java-doc about the min pause between checkpoints

## What is the purpose of the change

This pull request makes docs about config `minPauseBetweenCheckpoints` more 
clear, since many users felt confused about this config parameter as they found 
new checkpoint triggered once checkpoints failed, related threads: 
[thread-1](http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/minPauseBetweenCheckpoints-for-failed-checkpoints-td20152.html),
 
[thread-2](http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Having-a-backoff-while-experiencing-checkpointing-failures-td20618.html).


## Brief change log
Refine java-doc about the min pause between checkpoints.


## Verifying this change

This change is a trivial rework without any test coverage.


## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): no
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
  - The serializers: no
  - The runtime per-record code paths (performance sensitive): no
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: no
  - The S3 file system connector: no

## Documentation

  - Does this pull request introduce a new feature? no
  - If yes, how is the feature documented? JavaDocs


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Myasuka/flink min-pause-checkpoint-FLINK-9598

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/6346.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6346


commit f47b487f74767859df19ac09ebc15bb4117a36f2
Author: Yun Tang 
Date:   2018-07-17T06:44:44Z

[FLINK-9598] Refactor java-doc about the min pause between checkpoints




> [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when 
> there's a checkpoint failure
> -
>
> Key: FLINK-9598
> URL: https://issues.apache.org/jira/browse/FLINK-9598
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.3.2
>Reporter: Prem Santosh
>Assignee: Yun Tang
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screen Shot 2018-06-20 at 7.44.10 AM.png
>
>
> We have set the config Minimum Pause Between Checkpoints to be 10 min but 
> noticed that when a checkpoint fails (because it timesout before it 
> completes) the application immediately starts taking the next checkpoint. 
> This basically stalls the application's progress since its always taking 
> checkpoints.
> [^Screen Shot 2018-06-20 at 7.44.10 AM.png] is a screenshot of this issue.
> Details:
>  * Running Flink-1.3.2 on EMR
>  * checkpoint timeout duration: 40 min
>  * minimum pause between checkpoints: 10 min
> There is also a [relevant 
> thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Having-a-backoff-while-experiencing-checkpointing-failures-td20618.html]
>  that I found on the Flink users group.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9598) [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when there's a checkpoint failure

2018-07-16 Thread vinoyang (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545352#comment-16545352
 ] 

vinoyang commented on FLINK-9598:
-

[~yunta] I agree with you. I will release the assignee, please feel free to 
assign to yourself and redefine the doc if you want.

> [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when 
> there's a checkpoint failure
> -
>
> Key: FLINK-9598
> URL: https://issues.apache.org/jira/browse/FLINK-9598
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.3.2
>Reporter: Prem Santosh
>Assignee: vinoyang
>Priority: Major
> Attachments: Screen Shot 2018-06-20 at 7.44.10 AM.png
>
>
> We have set the config Minimum Pause Between Checkpoints to be 10 min but 
> noticed that when a checkpoint fails (because it timesout before it 
> completes) the application immediately starts taking the next checkpoint. 
> This basically stalls the application's progress since its always taking 
> checkpoints.
> [^Screen Shot 2018-06-20 at 7.44.10 AM.png] is a screenshot of this issue.
> Details:
>  * Running Flink-1.3.2 on EMR
>  * checkpoint timeout duration: 40 min
>  * minimum pause between checkpoints: 10 min
> There is also a [relevant 
> thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Having-a-backoff-while-experiencing-checkpointing-failures-td20618.html]
>  that I found on the Flink users group.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9598) [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when there's a checkpoint failure

2018-07-16 Thread Yun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545318#comment-16545318
 ] 

Yun Tang commented on FLINK-9598:
-

Hi [~premsantosh], the "Minimum Pause Between Checkpoints" is actually the 
initial delay between successful checkpoints, you can find the logical in 
CheckpointCoordinator#_triggerCheckpoint_() method, in which after 
expired-checkpoint cleaner detects some checkpoint expired, it will trigger 
another checkpoint ASAP through CheckpointCoordinator#_triggerQueuedRequests_() 
method, no matter Flink-1.3.2 or latest Flink-1.5.1

I think a user usually wants to get a successful checkpoint as quickly as 
possible again, and the running checkpoint would not stall your application 
running in general as the sub-tasks only start snapshot when checkpoint barrier 
comes, not all sub-tasks are executing snapshot process.

In my point of view, it would be better to redefine some of the javadocs e.g. 
attribute _minPauseBetweenCheckpointsNanos_ in CheckpointCoordinator. What's 
your opinion [~yanghua], if you don't have time to do these trivial works, I'd 
like to take some time to redefine all related javadocs.

> [Checkpoints] The config Minimum Pause Between Checkpoints doesn't work when 
> there's a checkpoint failure
> -
>
> Key: FLINK-9598
> URL: https://issues.apache.org/jira/browse/FLINK-9598
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.3.2
>Reporter: Prem Santosh
>Assignee: vinoyang
>Priority: Major
> Attachments: Screen Shot 2018-06-20 at 7.44.10 AM.png
>
>
> We have set the config Minimum Pause Between Checkpoints to be 10 min but 
> noticed that when a checkpoint fails (because it timesout before it 
> completes) the application immediately starts taking the next checkpoint. 
> This basically stalls the application's progress since its always taking 
> checkpoints.
> [^Screen Shot 2018-06-20 at 7.44.10 AM.png] is a screenshot of this issue.
> Details:
>  * Running Flink-1.3.2 on EMR
>  * checkpoint timeout duration: 40 min
>  * minimum pause between checkpoints: 10 min
> There is also a [relevant 
> thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Having-a-backoff-while-experiencing-checkpointing-failures-td20618.html]
>  that I found on the Flink users group.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)