Hi,
I think there are many reasons that could lead to the checkpoint timeout.
Would you like to share some detailed information of checkpoint? For
example, the detailed checkpoint information from the web.[1]  And which
Flink version do you use?

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/checkpoint_monitoring.html

Best,
Guowei


On Thu, Apr 1, 2021 at 4:33 PM Geldenhuys, Morgan Karl <
morgan.geldenh...@tu-berlin.de> wrote:

> Hi Community,
>
>
> I have a number of flink jobs running inside my session cluster with
> varying checkpoint intervals plus a large amount of operator state and in
> times of high load, the jobs fail due to checkpoint timeouts (set to 6
> minutes). I can only assume this is because the latencies for saving
> checkpoints at these times of high load increase. I have a 30 node HDFS
> cluster for checkpoints... however I see that only 4 of these nodes are
> being used for storage. Is there a way of ensuring the load is evenly
> spread? Could there be another reason for these checkpoint timeouts? Events
> are being consumed from kafka, to kafka with EXACTLY ONCE guarantees
> enabled.
>
>
> Thank you very much!
>
>
> M.
>

Reply via email to