Hi flink-kubernetes-operator maintainers,

We have recently migrated to the official operator and seeing a new issue
where our FlinkDeployments can fail and crashloop looking for a
non-existent savepoint. On further inspection, the job is attempting to
restart from the savepoint specified in execution.savepoint.path. This
config new for us (wasn't set by previous operator) is seems to be
automatically set behind the scenes by the official operator. We see the
savepoint in execution.savepoint.path existed but gets deleted after some
amount of time (in the latest example, a few hours). Then when there is
some pod disruption, the job attempts to restart from the savepoint (which
was deleted) and starts crashlooping.

Hoping you can help us troubleshoot and figure out if this can be solved
through configuration (we are using equivalent configs from our previous
operator where we did not have this issue). Adding some details on version
and k8s state for your reference. Thank you for your support!

Flink Version: 1.14.5
Flink Operator Version: 1.4.0

At the time of the issue, here is the flink-config we see in the configmap
(the savepoint savepoint-bad5e5-6ab08cf0808e has been deleted from s3 at
this point):

kubernetes.jobmanager.replicas: 1
jobmanager.rpc.address: <SOMETHING>
metrics.scope.task: flink.taskmanager.job.<job_name>.task.<task_name>.metric
kubernetes.service-account: <SOMETHING>
kubernetes.cluster-id: <SOMETHING>
pipeline.auto-generate-uids: false
metrics.scope.tm: flink.taskmanager.metric
parallelism.default: 2
kubernetes.namespace: <SOMETHING>
metrics.reporters: prom
kubernetes.jobmanager.owner.reference: <SOMETHING>
metrics.reporter.prom.port: 9090
taskmanager.memory.process.size: 10G
kubernetes.internal.jobmanager.entrypoint.class:
org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint
pipeline.name: <SOMETHING>
execution.savepoint.path: s3://<SOMETHING>/savepoint-bad5e5-6ab08cf0808e
kubernetes.pod-template-file:
/tmp/flink_op_generated_podTemplate_12924532349572558288.yaml
state.backend.rocksdb.localdir: /rocksdb/
kubernetes.pod-template-file.taskmanager:
/tmp/flink_op_generated_podTemplate_1129545383743356980.yaml
web.cancel.enable: false
execution.checkpointing.timeout: 5 min
kubernetes.container.image.pull-policy: IfNotPresent
$internal.pipeline.job-id: bad5e5682b8f4fbefbf75b00d285ac10
kubernetes.jobmanager.cpu: 2.0
state.backend: filesystem
$internal.flink.version: v1_14
kubernetes.pod-template-file.jobmanager:
/tmp/flink_op_generated_podTemplate_824610597202468981.yaml
blob.server.port: 6124
kubernetes.jobmanager.annotations:
flinkdeployment.flink.apache.org/generation:14
metrics.scope.operator:
flink.taskmanager.job.<job_name>.operator.<operator_name>.metric
state.savepoints.dir: s3://<SOMETHING>/savepoints
kubernetes.taskmanager.cpu: 2.0
execution.savepoint.ignore-unclaimed-state: true
$internal.application.program-args:
kubernetes.container.image: <SOMETHING>
taskmanager.numberOfTaskSlots: 1
metrics.scope.jm.job: flink.jobmanager.job.<job_name>.metric
kubernetes.rest-service.exposed.type: ClusterIP
metrics.reporter.prom.class:
org.apache.flink.metrics.prometheus.PrometheusReporter
$internal.application.main: <SOMETHING>
metrics.scope.jm: flink.jobmanager.metric
execution.target: kubernetes-application
jobmanager.memory.process.size: 10G
metrics.scope.tm.job: flink.taskmanager.job.<job_name>.metric
taskmanager.rpc.port: 6122
internal.cluster.execution-mode: NORMAL
execution.checkpointing.externalized-checkpoint-retention:
RETAIN_ON_CANCELLATION
pipeline.jars: local:///build/flink/usrlib/<SOMETHING>.jar
state.checkpoints.dir: s3://<SOMETHING>/checkpoints

At the time of the issue, here is our FlinkDeployment Spec:

Spec:
  Flink Configuration:
    execution.checkpointing.timeout:                  5 min
    kubernetes.operator.job.restart.failed:           true
    kubernetes.operator.periodic.savepoint.interval:  600s
    metrics.reporter.prom.class:
 org.apache.flink.metrics.prometheus.PrometheusReporter
    metrics.reporter.prom.port:                       9090
    metrics.reporters:                                prom
    metrics.scope.jm:
flink.jobmanager.metric
    metrics.scope.jm.job:
flink.jobmanager.job.<job_name>.metric
    metrics.scope.operator:
flink.taskmanager.job.<job_name>.operator.<operator_name>.metric
    metrics.scope.task:
flink.taskmanager.job.<job_name>.task.<task_name>.metric
    metrics.scope.tm:
flink.taskmanager.metric
    metrics.scope.tm.job:
flink.taskmanager.job.<job_name>.metric
    pipeline.auto-generate-uids:                      false
    pipeline.name:                                    <SOMETHING>
    state.backend:                                    filesystem
    state.backend.rocksdb.localdir:                   /rocksdb/
    state.checkpoints.dir:
 s3://<SOMETHING>/checkpoints
    state.savepoints.dir:
s3://<SOMETHING>/savepoints
  Flink Version:                                      v1_14
  Image:                                              <SOMETHING>
  Image Pull Policy:                                  IfNotPresent
  Job:
    Allow Non Restored State:  true
    Args:
    Entry Class:             <SOMETHING>
    Initial Savepoint Path:  s3a://<SOMETHING>/savepoint-bad5e5-577c6a76aec5
    Jar URI:                 local:///build/flink/usrlib/<SOMETHING>.jar
    Parallelism:             2
    State:                   running
    Upgrade Mode:            savepoint

Reply via email to