Ambiguous behavior of Flink on Job cancellation with checkpoint configured

Parth Sarathy Thu, 21 Mar 2019 00:58:27 -0700

Hi All,
               We are using flink 1.7.2 and have enabled checkpoint with
RocksDB configured as state backend with retain checkpoints on job cancel.
In our scenario we are cancelling the job and while resubmitting the job, we
try to restore the job with latest checkpoint / savepoint available. We are
observing ambiguous behavior based on the way job is being cancelled, below
are the captured observations:


Observations :
1. When we cancel the job with a savepoint option, a savepoint is created as
expected but flink is deleting the latest checkpoint directory available for
the running job. Is this an expected behavior even when the configuration
asks to retain checkpoints on job cancellation?
2. When we cancel the job without the savepoint option, the same latest
checkpoint was retained by flink as opposed to before where it was deleted
as job was cancelled with the savepoint option.

               As we have configured flink to retain only a single
checkpoint at any point of time, could there be any issue wherein when we
cancel the job with a savepoint, the savepoint gets triggered but fails
midway. So now we would end up with an incomplete savepoint and no trace of
checkpoint for the job as it would have been erased.

Thanks
Parth Sarathy



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Ambiguous behavior of Flink on Job cancellation with checkpoint configured

Reply via email to