Re: Job manager sometimes doesn't restore job from checkpoint post TaskManager failure

Kevin Lam Mon, 23 Aug 2021 06:40:28 -0700

Hi,

I was able to understand what was causing this. We were using the restart
strategy `fixed-delay` with a maximum number of restarts set to 10. Using
exponential-delay resolved the issue of restarting the job from fresh.


On Thu, Aug 19, 2021 at 2:04 PM Chesnay Schepler <ches...@apache.org> wrote:

> How do you deploy Flink on Kubernetes? Do you use the standalone
> <https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/>
> or native
> <https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/native_kubernetes/>
> mode?
>
> Is it really just task managers going down? It seems unlikely that the
> loss of a TM could have such an effect.
>
> Can you provide us with the JobManager logs at the time the TM crash
> occurred? They should contain some hints as to how Flink handled the TM
> failure.
>
> On 19/08/2021 16:06, Kevin Lam wrote:
>
> Hi all,
>
> I've noticed that sometimes when task managers go down--it looks like the
> job is not restored from checkpoint, but instead restarted from a fresh
> state (when I go to the job's checkpoint tab in the UI, I don't see the
> restore, and the number in the job overview all get reset). Under what
> circumstances does this happen?
>
> I've been trying to debug and we really want the job to restore from the
> checkpoint at all times for our use case.
>
> We're running Apache Flink 1.13 on Kubernetes in a high availability
> set-up.
>
> Thanks in advance!
>
>
>

Re: Job manager sometimes doesn't restore job from checkpoint post TaskManager failure

Reply via email to