Hi Eva

If checkpoint failed, please view the web UI or jobmanager log to see why 
checkpoint failed, might be declined by some specific task.

If checkpoint expired, you can also access the web UI to see which tasks did 
not respond in time, some hot task might not be able to respond in time. 
Generally speaking, checkpoint expired is mostly caused by back pressure which 
led the checkpoint barrier did not arrive in time. Resolve the back pressure 
could help the checkpoint finished before timeout.

I think the doc of monitoring web UI for checkpoint [1] and back pressure [2] 
could help you.

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.9/monitoring/checkpoint_monitoring.html
[2] 
https://ci.apache.org/projects/flink/flink-docs-release-1.9/monitoring/back_pressure.html

Best
Yun Tang
________________________________
From: Eva Eva <eternalsunshine2...@gmail.com>
Sent: Friday, January 10, 2020 10:29
To: user <user@flink.apache.org>
Subject: Please suggest helpful tools

Hi,

I'm running Flink job on 1.9 version with blink planner.

My checkpoints are timing out intermittently, but as state grows they are 
timing out more and more often eventually killing the job.

Size of the state is large with Minimum=10.2MB and Maximum=49GB (this one is 
accumulated due to prior failed ones), Average=8.44GB.

Although size is huge, I have enough space on EC2 instance in which I'm running 
job. I'm using RocksDB for checkpointing.

Logs does not have any useful information to understand why checkpoints are 
expiring/failing, can someone please point me to tools that can be used to 
investigate and understand why checkpoints are failing.

Also any other related suggestions are welcome.


Thanks,
Reva.

Reply via email to