Hi, I was able to understand what was causing this. We were using the restart strategy `fixed-delay` with a maximum number of restarts set to 10. Using exponential-delay resolved the issue of restarting the job from fresh.
On Thu, Aug 19, 2021 at 2:04 PM Chesnay Schepler <ches...@apache.org> wrote: > How do you deploy Flink on Kubernetes? Do you use the standalone > <https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/> > or native > <https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/native_kubernetes/> > mode? > > Is it really just task managers going down? It seems unlikely that the > loss of a TM could have such an effect. > > Can you provide us with the JobManager logs at the time the TM crash > occurred? They should contain some hints as to how Flink handled the TM > failure. > > On 19/08/2021 16:06, Kevin Lam wrote: > > Hi all, > > I've noticed that sometimes when task managers go down--it looks like the > job is not restored from checkpoint, but instead restarted from a fresh > state (when I go to the job's checkpoint tab in the UI, I don't see the > restore, and the number in the job overview all get reset). Under what > circumstances does this happen? > > I've been trying to debug and we really want the job to restore from the > checkpoint at all times for our use case. > > We're running Apache Flink 1.13 on Kubernetes in a high availability > set-up. > > Thanks in advance! > > >