Savepoint incomplete when job was killed after a cancel timeout

Paul Lam Tue, 29 Sep 2020 04:52:40 -0700

Hi,

We have a Flink job that was stopped erroneously with no available 
checkpoint/savepoint to restore, 
and are looking for some help to narrow down the problem.


How we ran into this problem:

We stopped the job using cancel with savepoint command (for compatibility 
issue), but the command
timed out after 1 min because there was some backpressure. So we force kill the 
job by yarn kill command.
Usually, this would not cause troubles because we can still use the last 
checkpoint to restore the job.

But at this time, the last checkpoint dir was cleaned up and empty (the 
retained checkpoint number was 1).
According to zookeeper and the logs, the savepoint finished (job master logged 
“Savepoint stored in …”) 
right after the cancel timeout. However, the savepoint directory contains only 
_metadata file, and other 
state files referred by metadata are absent. 

Environment & Config:
- Flink 1.11.0
- YARN job cluster
- HA via zookeeper
- FsStateBackend
- Aligned non-incremental checkpoint

Any comments and suggestions are appreciated! Thanks!

Best,
Paul Lam

Savepoint incomplete when job was killed after a cancel timeout

Reply via email to