Hi, We have a Flink job that was stopped erroneously with no available checkpoint/savepoint to restore, and are looking for some help to narrow down the problem.
How we ran into this problem: We stopped the job using cancel with savepoint command (for compatibility issue), but the command timed out after 1 min because there was some backpressure. So we force kill the job by yarn kill command. Usually, this would not cause troubles because we can still use the last checkpoint to restore the job. But at this time, the last checkpoint dir was cleaned up and empty (the retained checkpoint number was 1). According to zookeeper and the logs, the savepoint finished (job master logged “Savepoint stored in …”) right after the cancel timeout. However, the savepoint directory contains only _metadata file, and other state files referred by metadata are absent. Environment & Config: - Flink 1.11.0 - YARN job cluster - HA via zookeeper - FsStateBackend - Aligned non-incremental checkpoint Any comments and suggestions are appreciated! Thanks! Best, Paul Lam