[ https://issues.apache.org/jira/browse/FLINK-27868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550394#comment-17550394 ]
Gyula Fora commented on FLINK-27868: ------------------------------------ I think the best would be to improve the JobStatus observer with the following logic: if RUNNING status was observer we should apply a second check to verify that all tasks are indeed running. If Yes, keep the job in RUNNING otherwise set the state to CREATED. This way we can leverage the improved running observation throughout the operator code where we already use it instead of having to inject custom logic all over the place. > Harden running job check before triggering savepoints or savepoint upgrades > --------------------------------------------------------------------------- > > Key: FLINK-27868 > URL: https://issues.apache.org/jira/browse/FLINK-27868 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator > Reporter: Gyula Fora > Assignee: Matyas Orhidi > Priority: Major > Fix For: kubernetes-operator-1.1.0 > > > Even if the job is in RUNNING state, often not all subtasks are yet running > which leads to savepoint upgrade / savepoint trigger failures. We should > harden the isRunning check we use to include subtask states as well. > This suggestion is desribed more in detail by [~matyas] here: > https://github.com/apache/flink-kubernetes-operator/pull/237#issuecomment-1137054088 -- This message was sent by Atlassian Jira (v8.20.7#820007)