[ https://issues.apache.org/jira/browse/FLINK-34009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804118#comment-17804118 ]
Vijay commented on FLINK-34009: ------------------------------- As flink support multi-job execution on Application mode of deployment (with HA being disabled), we need more details of how to enable restoration process via checkpointing (when app / flink is upgraded). Please support us to overcome this issue. Thanks. > Apache flink: Checkpoint restoration issue on Application Mode of deployment > ---------------------------------------------------------------------------- > > Key: FLINK-34009 > URL: https://issues.apache.org/jira/browse/FLINK-34009 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.18.0 > Environment: Flink version: 1.18 > Zookeeper version: 3.7.2 > Env: Custom flink docker image (with embedded application class) deployed > over kubernetes (v1.26.11). > Reporter: Vijay > Priority: Major > > Hi Team, > Good Day. Wish you all a happy new year 2024. > We are using Flink (1.18) version on our flink cluster. Job manager has been > deployed on "Application mode" and HA is disabled (high-availability.type: > NONE), under this configuration parameters we are able to start multiple jobs > (using env.executeAsync()) of a single application. > Note: We have also setup checkpoint on a s3 instance with > RETAIN_ON_CANCELLATION mode (plus other required settings). > Lets say now we start two jobs of the same application (ex: Jobidxxx1, > jobidxxx2) and they are currently running on the k8s env. If we have to > perform Flink minor upgrade (or) upgrade of our application with minor > changes, in that case we will stop the Job Manager and Task Managers > instances and perform the necessary up-gradation then when we start both Job > Manager and Task Managers instance. On startup we expect the job's to be > restored back from the last checkpoint, but the job restoration is not > happening on Job manager startup. Please let us know if this is an bug (or) > its the general behavior of flink under application mode of deployment. > Additional information: If we enable HA (using Zookeeper) on Application > mode, we are able to startup only one job (i.e., per-job behavior). When we > perform Flink minor upgrade (or) upgrade of our application with minor > changes, the checkpoint restoration is working properly on Job Manager & Task > Managers restart process. > It seems checkpoint restoration and HA are inter-related, but why checkpoint > restoration doesn't work when HA is disabled. > > Please let us know if anyone has experienced similar issues or if have any > suggestions, it will be highly appreciated. Thanks in advance for your > assistance. -- This message was sent by Atlassian Jira (v8.20.10#820010)