[
https://issues.apache.org/jira/browse/FLINK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17936726#comment-17936726
]
Matthias Pohl commented on FLINK-37483:
---------------------------------------
This is the stacktrace you got when the JobManager restarted (point 5. in your
description) or the initial failed state (point 2. in your description)?
> Native kubernetes clusters losing checkpoint state on FAILED
> ------------------------------------------------------------
>
> Key: FLINK-37483
> URL: https://issues.apache.org/jira/browse/FLINK-37483
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Affects Versions: 1.20.1
> Reporter: Max Feng
> Priority: Major
>
> We're running Flink 1.20, native kubernetes application-mode clusters, and
> we're running into an issue where clusters are restarting without checkpoints
> from HA configmaps.
> To the best of our understanding, here's what's happening:
> 1) We're running application-mode clusters in native kubernetes with
> externalized checkpoints, retained on cancellation. We're attempting to
> restore a job from a checkpoint; the checkpoint reference is held in the
> Kubernetes HA configmap.
> 2) The jobmanager encounters an issue during startup, and the job goes to
> state FAILED.
> 3) The HA configmap containing the checkpoint reference is cleaned up.
> 4) The Kubernetes pod exits. Because it is a Kubernetes deployment, the pod
> is immediately restarted.
> 5) Upon restart, the new Jobmanager finds no checkpoints to restore from.
> We think this is a bad combination of the following behaviors:
> * FAILED triggers cleanup, which cleans up HA configmaps in native kubernetes
> mode
> * FAILED does not actually stop a job in native kubernetes mode, instead it
> is immediately retried
--
This message was sent by Atlassian Jira
(v8.20.10#820010)