Yang Wang created FLINK-26930:
---------------------------------

             Summary: Rethink last-state upgrade implementation in 
flink-kubernetes-operator
                 Key: FLINK-26930
                 URL: https://issues.apache.org/jira/browse/FLINK-26930
             Project: Flink
          Issue Type: Improvement
          Components: Kubernetes Operator
            Reporter: Yang Wang


Following the discussion in FLINK-26916.

 

How the last-state upgrade works now?

First, delete the Flink cluster directly with HA ConfigMap retained. This 
leaves job in a "SUSPENDED" state. Then flink-kubernetes-operator will deploy a 
new Flink application with same cluster-id so that it could recover from the 
latest checkpoint. Please note that before starting the application, JobGraph 
will be deleted from the HA ConfigMap. This is to ensure the newly changed job 
options could take effect.

 

Some community devs are thinking to extend the JRS so the stored job result 
contains list of retained checkpoints. This of course implies that cluster gets 
shut down / job gets terminated properly (other cases should be used for 
fail-over scenarios only).

 

As soon as there is a straightforward way of accessing the last checkpoint, we 
should improve the current implementation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to