[ 
https://issues.apache.org/jira/browse/FLINK-39989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18091440#comment-18091440
 ] 

Oleksandr Shulgin commented on FLINK-39989:
-------------------------------------------

Filippo points out that this issue was already reported at: 
https://issues.apache.org/jira/browse/FLINK-32631, but got closed with "Cannot 
Reproduce".

This time it is clear how to reproduce the problem.

> flinksessionjob stuck on "Job Not Found" if jobmanager terminates before 
> operator learns about new job state transitions
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39989
>                 URL: https://issues.apache.org/jira/browse/FLINK-39989
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: 1.13
>         Environment: * Flink version: 2.1.2
>  * Flink Kubernetes operator version: 1.13
>            Reporter: Filippo Ghibellini
>            Priority: Major
>
> h3. Steps to reproduce
>  # create a `flinksessionjob` and wait for it to start running
>  # scale down the k8s operator to 0 replicas to simulate a delayed 
> reconciliation
>  # use the Flink web UI to cancel the job (or use any other method to put the 
> job in a [globally terminal 
> state|https://nightlies.apache.org/flink/flink-docs-stable/docs/internals/job_scheduling/#jobmanager-data-structures]
>  # delete all the job manager pods (will be replaced automatically by the k8s 
> deployment)
>  # scale the k8s operator back to 1 replica - this will resume the 
> reconciliation process
>  # The `flinksessionjob` k8s entity now reports `Job Not Found`
> It seems that the entire reconciliation process relies heavily on the k8s 
> operator learning about job terminations from the job-manager {*}before the 
> job-manager restarts{*}.
> A newly started job-manager will not recover jobs that reached a "globally 
> terminal state" (since those are not even persisted in the HA state).
> In our case it seems like the trigger for the jobs reaching a globally 
> terminal state was us setting `spec.job.state=suspended` on the k8s 
> `flinksessionjob` entity i.e. even though in the reproduction steps we cancel 
> the job through the UI, the problem can manifest even if the Flink cluster is 
> managed exclusively through the k8s operator (it's just harder to reproduce).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to