Filippo Ghibellini created FLINK-39989:
------------------------------------------
Summary: flinksessionjob stuck on "Job Not Found" if jobmanager
terminates before operator learns about new job state transitions
Key: FLINK-39989
URL: https://issues.apache.org/jira/browse/FLINK-39989
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Affects Versions: 1.13
Environment: * Flink version: 2.1.2
* Flink Kubernetes operator version: 1.13
Reporter: Filippo Ghibellini
h3. Steps to reproduce
# create a `flinksessionjob` and wait for it to start running
# scale down the k8s operator to 0 replicas to simulate a delayed
reconciliation
# use the Flink web UI to cancel the job (or use any other method to put the
job in a [globally terminal
state|https://nightlies.apache.org/flink/flink-docs-stable/docs/internals/job_scheduling/#jobmanager-data-structures]
# delete all the job manager pods (will be replaced automatically by the k8s
deployment)
# scale the k8s operator back to 1 replica - this will resume the
reconciliation process
# The `flinksessionjob` k8s entity now reports `Job Not Found`
It seems that the entire reconciliation process relies heavily on the k8s
operator learning about job terminations from the job-manager {*}before the
job-manager restarts{*}.
A newly started job-manager will not recover jobs that reached a "globally
terminal state" (since those are not even persisted in the HA state).
In our case it seems like the trigger for the jobs reaching a globally terminal
state was us setting `spec.job.state=suspended` on the k8s `flinksessionjob`
entity i.e. even though in the reproduction steps we cancel the job through the
UI, the problem can manifest even if the Flink cluster is managed exclusively
through the k8s operator (it's just harder to reproduce).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)