[jira] [Commented] (FLINK-18675) Checkpoint not maintaining minimum pause duration between checkpoints
[ https://issues.apache.org/jira/browse/FLINK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182272#comment-17182272 ] Ravi Bhushan Ratnakar commented on FLINK-18675: --- [~dian.fu]Fu, this ticket has been addressed by FLINK-18856 > Checkpoint not maintaining minimum pause duration between checkpoints > - > > Key: FLINK-18675 > URL: https://issues.apache.org/jira/browse/FLINK-18675 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 > Environment: !image.png! >Reporter: Ravi Bhushan Ratnakar >Priority: Critical > Fix For: 1.12.0, 1.11.2 > > Attachments: image.png > > > I am running a streaming job with Flink 1.11.0 using kubernetes > infrastructure. I have configured checkpoint configuration like below > Interval - 3 minutes > Minimum pause between checkpoints - 3 minutes > Checkpoint timeout - 10 minutes > Checkpointing Mode - Exactly Once > Number of Concurrent Checkpoint - 1 > > Other configs > Time Characteristics - Processing Time > > I am observing an usual behaviour. *When a checkpoint completes successfully* > *and if it's end to end duration is almost equal or greater than Minimum > pause duration then the next checkpoint gets triggered immediately without > maintaining the Minimum pause duration*. Kindly notice this behaviour from > checkpoint id 194 onward in the attached screenshot -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18675) Checkpoint not maintaining minimum pause duration between checkpoints
[ https://issues.apache.org/jira/browse/FLINK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182236#comment-17182236 ] Dian Fu commented on FLINK-18675: - It seems duplicate with FLINK-18856 and has been addressed there. If this is the case, we should close this issue. [~raviratnakar] Could you help to check if this is the case? > Checkpoint not maintaining minimum pause duration between checkpoints > - > > Key: FLINK-18675 > URL: https://issues.apache.org/jira/browse/FLINK-18675 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 > Environment: !image.png! >Reporter: Ravi Bhushan Ratnakar >Priority: Critical > Fix For: 1.12.0, 1.11.2 > > Attachments: image.png > > > I am running a streaming job with Flink 1.11.0 using kubernetes > infrastructure. I have configured checkpoint configuration like below > Interval - 3 minutes > Minimum pause between checkpoints - 3 minutes > Checkpoint timeout - 10 minutes > Checkpointing Mode - Exactly Once > Number of Concurrent Checkpoint - 1 > > Other configs > Time Characteristics - Processing Time > > I am observing an usual behaviour. *When a checkpoint completes successfully* > *and if it's end to end duration is almost equal or greater than Minimum > pause duration then the next checkpoint gets triggered immediately without > maintaining the Minimum pause duration*. Kindly notice this behaviour from > checkpoint id 194 onward in the attached screenshot -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18675) Checkpoint not maintaining minimum pause duration between checkpoints
[ https://issues.apache.org/jira/browse/FLINK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173755#comment-17173755 ] Congxian Qiu(klion26) commented on FLINK-18675: --- Seems there is a similar issue FLINK-18856 > Checkpoint not maintaining minimum pause duration between checkpoints > - > > Key: FLINK-18675 > URL: https://issues.apache.org/jira/browse/FLINK-18675 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 > Environment: !image.png! >Reporter: Ravi Bhushan Ratnakar >Priority: Critical > Attachments: image.png > > > I am running a streaming job with Flink 1.11.0 using kubernetes > infrastructure. I have configured checkpoint configuration like below > Interval - 3 minutes > Minimum pause between checkpoints - 3 minutes > Checkpoint timeout - 10 minutes > Checkpointing Mode - Exactly Once > Number of Concurrent Checkpoint - 1 > > Other configs > Time Characteristics - Processing Time > > I am observing an usual behaviour. *When a checkpoint completes successfully* > *and if it's end to end duration is almost equal or greater than Minimum > pause duration then the next checkpoint gets triggered immediately without > maintaining the Minimum pause duration*. Kindly notice this behaviour from > checkpoint id 194 onward in the attached screenshot -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18675) Checkpoint not maintaining minimum pause duration between checkpoints
[ https://issues.apache.org/jira/browse/FLINK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170497#comment-17170497 ] Congxian Qiu(klion26) commented on FLINK-18675: --- [~raviratnakar] I think the problem here is that {{CheckpointRequestDecider}} has a wrong value of {{lastCheckpointCompletionRelativeTime}} when checking whether the checkpoint request is too early. 1. We retrieve the value of {{lastCheckpointCompletionRelativeTime}} when calling {{CheckpointRequestDecider#chooseRequestToExecute}} in {{CheckpointCoordinator#triggerCheckpoint}} 2. A pending checkpoint complete, and update the valuable {{pendingCheckpoints}} and {{lastCheckpointCompletionRelativeTime}} 3. In {{CheckpointRequestDecider#chooseRequestToExecute}} we use the previous {{lastCheckpointCompletionRelativeTime}} to check whether current checkpoint request is too early I think we can get the value of {{lastCheckpointCompletionRelativeTime}} in {{CheckpointRequestDecider#chooseRequestToExecute}} here to solve the problem here. > Checkpoint not maintaining minimum pause duration between checkpoints > - > > Key: FLINK-18675 > URL: https://issues.apache.org/jira/browse/FLINK-18675 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 > Environment: !image.png! >Reporter: Ravi Bhushan Ratnakar >Priority: Critical > Attachments: image.png > > > I am running a streaming job with Flink 1.11.0 using kubernetes > infrastructure. I have configured checkpoint configuration like below > Interval - 3 minutes > Minimum pause between checkpoints - 3 minutes > Checkpoint timeout - 10 minutes > Checkpointing Mode - Exactly Once > Number of Concurrent Checkpoint - 1 > > Other configs > Time Characteristics - Processing Time > > I am observing an usual behaviour. *When a checkpoint completes successfully* > *and if it's end to end duration is almost equal or greater than Minimum > pause duration then the next checkpoint gets triggered immediately without > maintaining the Minimum pause duration*. Kindly notice this behaviour from > checkpoint id 194 onward in the attached screenshot -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18675) Checkpoint not maintaining minimum pause duration between checkpoints
[ https://issues.apache.org/jira/browse/FLINK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165387#comment-17165387 ] Congxian Qiu(klion26) commented on FLINK-18675: --- Hi [~raviratnakar] from the git history, the code at line number 1512 was added in 1.9.0(and seems there change would not affect this problem), IIUC, we would check whether the checkpoint can be triggered somewhere, I need to check it carefully as the code changed a lot. will reply here If found anything. > Checkpoint not maintaining minimum pause duration between checkpoints > - > > Key: FLINK-18675 > URL: https://issues.apache.org/jira/browse/FLINK-18675 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 > Environment: !image.png! >Reporter: Ravi Bhushan Ratnakar >Priority: Critical > Attachments: image.png > > > I am running a streaming job with Flink 1.11.0 using kubernetes > infrastructure. I have configured checkpoint configuration like below > Interval - 3 minutes > Minimum pause between checkpoints - 3 minutes > Checkpoint timeout - 10 minutes > Checkpointing Mode - Exactly Once > Number of Concurrent Checkpoint - 1 > > Other configs > Time Characteristics - Processing Time > > I am observing an usual behaviour. *When a checkpoint completes successfully* > *and if it's end to end duration is almost equal or greater than Minimum > pause duration then the next checkpoint gets triggered immediately without > maintaining the Minimum pause duration*. Kindly notice this behaviour from > checkpoint id 194 onward in the attached screenshot -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18675) Checkpoint not maintaining minimum pause duration between checkpoints
[ https://issues.apache.org/jira/browse/FLINK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163289#comment-17163289 ] Ravi Bhushan Ratnakar commented on FLINK-18675: --- [~klion26] from the attached screenshot, there is also one condition in checkpoint id 190,191 where checkpoint end to end duration is just below minimum pause duration however next checkpoint got triggered much early than the minimum pause duration. what do you think about the root cause of this issue? is this due to "CheckpointCordinator class, at line number 1512 ,scheduleAtFixedRate method is being used."? > Checkpoint not maintaining minimum pause duration between checkpoints > - > > Key: FLINK-18675 > URL: https://issues.apache.org/jira/browse/FLINK-18675 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 > Environment: !image.png! >Reporter: Ravi Bhushan Ratnakar >Priority: Critical > Attachments: image.png > > > I am running a streaming job with Flink 1.11.0 using kubernetes > infrastructure. I have configured checkpoint configuration like below > Interval - 3 minutes > Minimum pause between checkpoints - 3 minutes > Checkpoint timeout - 10 minutes > Checkpointing Mode - Exactly Once > Number of Concurrent Checkpoint - 1 > > Other configs > Time Characteristics - Processing Time > > I am observing an usual behaviour. *When a checkpoint completes successfully* > *and if it's end to end duration is almost equal or greater than Minimum > pause duration then the next checkpoint gets triggered immediately without > maintaining the Minimum pause duration*. Kindly notice this behaviour from > checkpoint id 194 onward in the attached screenshot -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18675) Checkpoint not maintaining minimum pause duration between checkpoints
[ https://issues.apache.org/jira/browse/FLINK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162932#comment-17162932 ] Ravi Bhushan Ratnakar commented on FLINK-18675: --- As per my understanding of the code, in the CheckpointCordinator class, at line number [1512|[https://github.com/apache/flink/blob/release-1.11.0/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1512]] ,scheduleAtFixedRate method is being used. I think that we should use "[scheduleWithFixedDelay|[https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-]]{{"}} > Checkpoint not maintaining minimum pause duration between checkpoints > - > > Key: FLINK-18675 > URL: https://issues.apache.org/jira/browse/FLINK-18675 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 > Environment: !image.png! >Reporter: Ravi Bhushan Ratnakar >Priority: Critical > Attachments: image.png > > > I am running a streaming job with Flink 1.11.0 using kubernetes > infrastructure. I have configured checkpoint configuration like below > Interval - 3 minutes > Minimum pause between checkpoints - 3 minutes > Checkpoint timeout - 10 minutes > Checkpointing Mode - Exactly Once > Number of Concurrent Checkpoint - 1 > > Other configs > Time Characteristics - Processing Time > > I am observing an usual behaviour. *When a checkpoint completes successfully* > *and if it's end to end duration is almost equal or greater than Minimum > pause duration then the next checkpoint gets triggered immediately without > maintaining the Minimum pause duration*. Kindly notice this behaviour from > checkpoint id 194 onward in the attached screenshot -- This message was sent by Atlassian Jira (v8.3.4#803005)