Hi:
    according to my experience, there are several possible reasons for 
checkpoint fail.
        1. if you use rocksdb as backend, insufficient disk will cause it. 
because file save on local disk, and you may see a exception.
        2. Sink can’t be written. all parallelism can’t be complete, and there 
is often no phenomenon.
        3. Back Pressure. data skew will cause one subtask take on more 
calculations, so checkpoint can’t be finish.
    Here is my advice:
        1. learn more about checkpoint work.
            
https://ci.apache.org/projects/flink/flink-docs-release-1.10/internals/stream_checkpointing.html
 
<https://ci.apache.org/projects/flink/flink-docs-release-1.10/internals/stream_checkpointing.html>
        2. try to test back pressure.
            
https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/back_pressure.html
 
<https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/back_pressure.html>
        3. if there is no data skew, you can set more parallelism, or you can 
ajust checkpoint parameter.
    In my computer, I have hadoop environment. so i commit job to yarn, i can 
use dashboard to test pressure.

On 2020/03/23 15:14:33, Stephen Connolly <s...@gmail.com> wrote: 
> We have a topology and the checkpoints fail to complete a *lot* of the time.> 
> 
> Typically it is just one subtask that fails.> 
> 
> We have a parallelism of 2 on this topology at present and the other> 
> subtask will complete in 3ms though the end to end duration on the rare> 
> times when the checkpointing completes is like 4m30> 
> 
> How can I start debugging this? When I run locally on my development> 
> cluster I have no issues, the issues only seem to show in production.> 
> 

Reply via email to