Hi flink-kubernetes-operator maintainers, We have recently migrated to the official operator and seeing a new issue where our FlinkDeployments can fail and crashloop looking for a non-existent savepoint. On further inspection, the job is attempting to restart from the savepoint specified in execution.savepoint.path. This config new for us (wasn't set by previous operator) is seems to be automatically set behind the scenes by the official operator. We see the savepoint in execution.savepoint.path existed but gets deleted after some amount of time (in the latest example, a few hours). Then when there is some pod disruption, the job attempts to restart from the savepoint (which was deleted) and starts crashlooping.
Hoping you can help us troubleshoot and figure out if this can be solved through configuration (we are using equivalent configs from our previous operator where we did not have this issue). Adding some details on version and k8s state for your reference. Thank you for your support! Flink Version: 1.14.5 Flink Operator Version: 1.4.0 At the time of the issue, here is the flink-config we see in the configmap (the savepoint savepoint-bad5e5-6ab08cf0808e has been deleted from s3 at this point): kubernetes.jobmanager.replicas: 1 jobmanager.rpc.address: <SOMETHING> metrics.scope.task: flink.taskmanager.job.<job_name>.task.<task_name>.metric kubernetes.service-account: <SOMETHING> kubernetes.cluster-id: <SOMETHING> pipeline.auto-generate-uids: false metrics.scope.tm: flink.taskmanager.metric parallelism.default: 2 kubernetes.namespace: <SOMETHING> metrics.reporters: prom kubernetes.jobmanager.owner.reference: <SOMETHING> metrics.reporter.prom.port: 9090 taskmanager.memory.process.size: 10G kubernetes.internal.jobmanager.entrypoint.class: org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint pipeline.name: <SOMETHING> execution.savepoint.path: s3://<SOMETHING>/savepoint-bad5e5-6ab08cf0808e kubernetes.pod-template-file: /tmp/flink_op_generated_podTemplate_12924532349572558288.yaml state.backend.rocksdb.localdir: /rocksdb/ kubernetes.pod-template-file.taskmanager: /tmp/flink_op_generated_podTemplate_1129545383743356980.yaml web.cancel.enable: false execution.checkpointing.timeout: 5 min kubernetes.container.image.pull-policy: IfNotPresent $internal.pipeline.job-id: bad5e5682b8f4fbefbf75b00d285ac10 kubernetes.jobmanager.cpu: 2.0 state.backend: filesystem $internal.flink.version: v1_14 kubernetes.pod-template-file.jobmanager: /tmp/flink_op_generated_podTemplate_824610597202468981.yaml blob.server.port: 6124 kubernetes.jobmanager.annotations: flinkdeployment.flink.apache.org/generation:14 metrics.scope.operator: flink.taskmanager.job.<job_name>.operator.<operator_name>.metric state.savepoints.dir: s3://<SOMETHING>/savepoints kubernetes.taskmanager.cpu: 2.0 execution.savepoint.ignore-unclaimed-state: true $internal.application.program-args: kubernetes.container.image: <SOMETHING> taskmanager.numberOfTaskSlots: 1 metrics.scope.jm.job: flink.jobmanager.job.<job_name>.metric kubernetes.rest-service.exposed.type: ClusterIP metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter $internal.application.main: <SOMETHING> metrics.scope.jm: flink.jobmanager.metric execution.target: kubernetes-application jobmanager.memory.process.size: 10G metrics.scope.tm.job: flink.taskmanager.job.<job_name>.metric taskmanager.rpc.port: 6122 internal.cluster.execution-mode: NORMAL execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION pipeline.jars: local:///build/flink/usrlib/<SOMETHING>.jar state.checkpoints.dir: s3://<SOMETHING>/checkpoints At the time of the issue, here is our FlinkDeployment Spec: Spec: Flink Configuration: execution.checkpointing.timeout: 5 min kubernetes.operator.job.restart.failed: true kubernetes.operator.periodic.savepoint.interval: 600s metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter metrics.reporter.prom.port: 9090 metrics.reporters: prom metrics.scope.jm: flink.jobmanager.metric metrics.scope.jm.job: flink.jobmanager.job.<job_name>.metric metrics.scope.operator: flink.taskmanager.job.<job_name>.operator.<operator_name>.metric metrics.scope.task: flink.taskmanager.job.<job_name>.task.<task_name>.metric metrics.scope.tm: flink.taskmanager.metric metrics.scope.tm.job: flink.taskmanager.job.<job_name>.metric pipeline.auto-generate-uids: false pipeline.name: <SOMETHING> state.backend: filesystem state.backend.rocksdb.localdir: /rocksdb/ state.checkpoints.dir: s3://<SOMETHING>/checkpoints state.savepoints.dir: s3://<SOMETHING>/savepoints Flink Version: v1_14 Image: <SOMETHING> Image Pull Policy: IfNotPresent Job: Allow Non Restored State: true Args: Entry Class: <SOMETHING> Initial Savepoint Path: s3a://<SOMETHING>/savepoint-bad5e5-577c6a76aec5 Jar URI: local:///build/flink/usrlib/<SOMETHING>.jar Parallelism: 2 State: running Upgrade Mode: savepoint