Hi all! We are running a flink cluster on kubernetes and deploying a single
job on it through "flink run <jar>". Whenever we want to modify the jar, we
cancel the job and run the "flink run" command again, with the new jar, and
the retained checkpoint URL from the first run.
This works well, but this adds some unavoidable downtime for each update:
- Downtime in-between the "cancel" and the "run"
- Downtime during restoration of the state

We aim at reducing this downtime to a minimum and are wondering if there is
a way to submit a new version of a job without completely stopping the old
one, having each TM running the new jar one at a time and waiting for the
state to be restored before moving on to the next TM? This would limit the
downtime to only a single TM at a time

Otherwise, we could try to submit a second job and let it restore before
canceling the old one, but this raises complex synchronisation issues,
especially since the second job will be restoring the first job's retained
(incremental) checkpoint, which seems like a bad idea...

What would be the best way to reduce downtime in this scenario, in your
opinion?

Thanks!

Reply via email to