[ 
https://issues.apache.org/jira/browse/FLINK-24881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487418#comment-17487418
 ] 

Ken Burford edited comment on FLINK-24881 at 2/5/22, 4:37 AM:
--------------------------------------------------------------

I just recently migrated from Flink 1.11 to 1.14, and I strongly suspect I'm 
observing the same issue. With 30s checkpoints and 15s min interval, 
checkpoints take up to five on average minutes, sometimes closer to ten minutes 
under more significant backpressure. This is for a very high parallelism job. 
The cluster is able to keep up with the work, but there are continuous bursts 
of backpressure against the source from some temporary hot spots in the input 
stream to the system.

If I submit a savepoint, I'll see the trigger submission in the JM log, but the 
actual savepoint won't start until about when I'd expect the next checkpoint to 
begin, often many minutes later.

I don't suppose you've found a mitigation for this by any chance?


was (Author: JIRAUSER284731):
I just recently migrated from Flink 1.11 to 1.14, and I strongly suspect I'm 
observing the same issue. With 30s checkpoints and 15s min interval, 
checkpoints take up to five on average minutes, sometimes closer to ten minutes 
under more significant backpressure. This is for a very high parallelism job. 
The cluster is able to keep up with the work, but there are continuous bursts 
of backpressure against the source from some temporary hot spots in the input 
stream to the system.

If I submit a savepoint, I'll see the trigger submission in the JM log, but the 
actual savepoint won't start until about when I'd expect the next checkpoint to 
begin, often many minutes later.

> When the Source is back pressured, the checkpoint interval may not take effect
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-24881
>                 URL: https://issues.apache.org/jira/browse/FLINK-24881
>             Project: Flink
>          Issue Type: Bug
>          Components: API / Core
>    Affects Versions: 1.14.0, 1.13.3
>            Reporter: Zongwen Li
>            Priority: Major
>         Attachments: image-2021-11-12-11-21-15-910.png
>
>
> Checkpoint config:
>  * EXACTLY_ONCE
>  * aligned
>  * interval: 10s
>  * min-pause: 10s
>  * max-attempts: 2
> When Source was back pressured for a long time, I found that multiple 
> checkpoints were triggered at the same time, which made the configuration 
> support parallel checkpoint and checkpoint interval unable to achieve the 
> target effect;
> And I found that there is usually a checkpoint that will fail at this time, 
> but this failure will not cause the job to restart.
> !image-2021-11-12-11-21-15-910.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to