You cannot change the checkpointing configuration at runtime.

You should be able to resume the job from the last checkpoint.

On 22.01.2019 19:39, knur wrote:
I'm running a streaming job that uses the following config:

     checkpointInterval = 5 mins
     minPauseBetweenCheckpoints = 2 mins
     checkpointTimeout = 1 minute
     maxConcurrentCheckpoints = 1

This is using incremental, async checkpoints with the RocksDb backend. So
far around 2K checkpoints have been triggered, but I just noticed that after
the first ~1K the checkpoints have been failing with:

     Checkpoint 1560 of job 9054d277265950c07ab90cf7ba0641d0 expired before
completing.

Now I'm in a very interesting position: I want to trigger a `savepoint` or a
`cancel -s`, but both of those commands will fail because they are coupled
to the checkpoint mechanism. i.e. both commands fail precisely because the
checkpoints are timing out.

Hence my question... is there a way to change the configuration of the
checkpoints at runtime? It seems like there is no such thing, but also not a
good reason why it couldn't be implemented (we already allow modifying the
parallelism of a job which looks like a harder problem to solve).

Assuming there is no way to do this... how should I try to save my job? I do
have enabled the `RETAIN_ON_CANCELLATION` policy.

Should I be able to resume the job from the last checkpoint using the
--savepoint flag?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Reply via email to