Hi experts,
I am running a flink job cluster, the application jar is packaged together
with flink in a docker image. The flink job cluster is running in
kubernetes, the restart strategy is below

restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 20
restart-strategy.failure-rate.failure-rate-interval: 3 min
restart-strategy.failure-rate.delay: 100 ms

The job manager is not setup in HA mode, so only 1 pod.

What I have observed is the job manager pod has restarted a few times, and
when it restarts, it will start as a new flink job (hence a new flink job
id), so it seems it could not restart from the last successful checkpoint,
highlighted in yellow is what the evidence.

So I wonder in this case, should I set the flink job as a fixed value? (if
there is a way to set it), or should I set the restart strategy to retry
infinite? Or something else I should do?

Thanks a lot!

{"@timestamp":"2020-10-21T09:45:30.571Z","@version":"1","message":"1 tasks
should be restarted to recover the failed task

9ikvi0743v9rkayb1qof (0b4c1ed9cd2cb47ee99ddb173a9beee5) switched from state
Could not submit task because there is no JobManager associated for the job

Reply via email to