Re: flink 反压导致checkpoint超时，从而导致任务失败问题

yu'an huang Thu, 03 Mar 2022 18:48:19 -0800

你好，我检查了下关于checkpoint的文档：https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/
 
<https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/>
tolerable checkpoint failure number: This defines how many consecutive 
checkpoint failures will be tolerated, before the whole job is failed over. The 
default value is 0, which means no checkpoint failures will be tolerated, and 
the job will fail on first reported checkpoint failure.


可以为作业设置容忍checkpoint失败的, 你可以像文档中说加下相关设置：
// only two consecutive checkpoint failures are tolerated
env.getCheckpointConfig().setTolerableCheckpointFailureNumber(2);
希望可以帮到你




> On 4 Mar 2022, at 10:07 AM, yu'an huang <h.yuan...@gmail.com 
> <mailto:h.yuan...@gmail.com>> wrote:
> 
> 你好，checkpoint超时默认不会导致作业重启，可以提供下JM log看看作业为什么会重启吗？
> 
>> On 3 Mar 2022, at 9:15 PM, kong <62...@163.com <mailto:62...@163.com>> wrote:
>> 
>> hello,我最近遇到一个问题：
>> 我通过flink消费kafka数据，job 图大概是这样的：Source -> map -> filter -> flatMap -> Map -> 
>> Sink
>> 在一瞬间kafka的producer端会产生大量的数据，导致flink无法消费完，我的checkpoint设置的是10分钟；
>> 最后会产生Checkpoint expired before 
>> completing.的错误，导致job重启，从而导致从上一个checkpoint恢复，然后重复消费数据，又导致checkpoint超时，死循环了。
>> 
>> 
>> 不知道有什么好办法解决该问题。
>> 多谢~
>> 
>

Re: flink 反压导致checkpoint超时，从而导致任务失败问题

回复