Filippo Ghibellini created FLINK-39989:
------------------------------------------

             Summary: flinksessionjob stuck on "Job Not Found" if jobmanager 
terminates before operator learns about new job state transitions
                 Key: FLINK-39989
                 URL: https://issues.apache.org/jira/browse/FLINK-39989
             Project: Flink
          Issue Type: Bug
          Components: Kubernetes Operator
    Affects Versions: 1.13
         Environment: * Flink version: 2.1.2
 * Flink Kubernetes operator version: 1.13
            Reporter: Filippo Ghibellini


h3. Steps to reproduce
 # create a `flinksessionjob` and wait for it to start running
 # scale down the k8s operator to 0 replicas to simulate a delayed 
reconciliation
 # use the Flink web UI to cancel the job (or use any other method to put the 
job in a [globally terminal 
state|https://nightlies.apache.org/flink/flink-docs-stable/docs/internals/job_scheduling/#jobmanager-data-structures]
 # delete all the job manager pods (will be replaced automatically by the k8s 
deployment)
 # scale the k8s operator back to 1 replica - this will resume the 
reconciliation process
 # The `flinksessionjob` k8s entity now reports `Job Not Found`

It seems that the entire reconciliation process relies heavily on the k8s 
operator learning about job terminations from the job-manager {*}before the 
job-manager restarts{*}.

A newly started job-manager will not recover jobs that reached a "globally 
terminal state" (since those are not even persisted in the HA state).

In our case it seems like the trigger for the jobs reaching a globally terminal 
state was us setting `spec.job.state=suspended` on the k8s `flinksessionjob` 
entity i.e. even though in the reproduction steps we cancel the job through the 
UI, the problem can manifest even if the Flink cluster is managed exclusively 
through the k8s operator (it's just harder to reproduce).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to