[jira] [Created] (FLINK-34009) Apache flink: Checkpoint restoration issue on Application Mode of deployment

2024-01-07 Thread Vijay (Jira)
Vijay created FLINK-34009:
-

 Summary: Apache flink: Checkpoint restoration issue on Application 
Mode of deployment
 Key: FLINK-34009
 URL: https://issues.apache.org/jira/browse/FLINK-34009
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.18.0
 Environment: Flink version: 1.18

Zookeeper version: 3.7.2

Env: Custom flink docker image (with embedded application class) deployed over 
kubernetes (v1.26.11).
Reporter: Vijay


Hi Team,

Good Day. Wish you all a happy new year 2024.

We are using Flink (1.18) version on our flink cluster. Job manager has been 
deployed on "Application mode" and HA is disabled (high-availability.type: 
NONE), under this configuration parameters we are able to start multiple jobs 
(using env.executeAsync()) of a single application.

Note: We have also setup checkpoint on a s3 instance with 
RETAIN_ON_CANCELLATION mode (plus other required settings).

Lets say now we start two jobs of the same application (ex: Jobidxxx1, 
jobidxxx2) and they are currently running on the k8s env. If we have to perform 
Flink minor upgrade (or) upgrade of our application with minor changes, in that 
case we will stop the Job Manager and Task Managers instances and perform the 
necessary up-gradation then when we start both Job Manager and Task Managers 
instance. On startup we expect the job's to be restored back from the last 
checkpoint, but the job restoration is not happening on Job manager startup. 
Please let us know if this is an bug (or) its the general behavior of flink 
under application mode of deployment.

Additional information: If we enable HA (using Zookeeper) on Application mode, 
we are able to startup only one job (i.e., per-job behavior). When we perform 
Flink minor upgrade (or) upgrade of our application with minor changes, the 
checkpoint restoration is working properly on Job Manager & Task Managers 
restart process.

It seems checkpoint restoration and HA are inter-related, but why checkpoint 
restoration doesn't work when HA is disabled.

 

Please let us know if anyone has experienced similar issues or if have any 
suggestions, it will be highly appreciated. Thanks in advance for your 
assistance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33944) Apache Flink: Process to restore more than one job on job manager startup from the respective savepoints

2023-12-26 Thread Vijay (Jira)
Vijay created FLINK-33944:
-

 Summary: Apache Flink: Process to restore more than one job on job 
manager startup from the respective savepoints
 Key: FLINK-33944
 URL: https://issues.apache.org/jira/browse/FLINK-33944
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.18.0
Reporter: Vijay


 
We are using Flink (1.18) version for our Flink cluster. The job manager has 
been deployed in "Application mode" and we are looking for a process to restore 
multiple jobs (using their respective savepoint directories) when the job 
manager is started. Currently, we have the option to restore only one job while 
running "standalone-job.sh" using the --fromSavepoint and 
--allowNonRestoredState. However, we need a way to trigger multiple job 
executions via Java client.

Note: We are not using a Kubernetes native deployment, but we are using k8s 
standalone mode of deployment.

*Expected process:*
 # Before starting with the Flink/application image upgrade, trigger the 
savepoints for all the current running jobs.
 # Once the savepoints process completed for all jobs, will trigger the scale 
down of job manager and task manager instances.
 # Update the image version on the k8s deployment with the update application 
image.
 # After image version is updated, scale up the job manager and task manager.
 # We need a process to restore the previously running jobs from the savepoint 
dir and start all the jobs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33943) Apache flink: Issues after configuring HA (using zookeeper setting)

2023-12-26 Thread Vijay (Jira)
Vijay created FLINK-33943:
-

 Summary: Apache flink: Issues after configuring HA (using 
zookeeper setting)
 Key: FLINK-33943
 URL: https://issues.apache.org/jira/browse/FLINK-33943
 Project: Flink
  Issue Type: Bug
  Components: Build System
Affects Versions: 1.18.0
 Environment: Flink version: 1.18

Zookeeper version: 3.7.2

Env: Custom flink docker image (with embedded application class) deployed over 
kubernetes (v1.26.11).

 
Reporter: Vijay


Hi Team,

Note: Not sure whether I have picked the right component while raising the 
issue.

Good Day. I am using Flink (1.18) version and zookeeper (3.7.2) for our flink 
cluster. Job manager has been deployed on "Application mode" and when HA is 
disabled (high-availability.type: NONE) we are able to start multiple jobs 
(using env.executeAsyn()) for a single application. But when I setup the 
Zookeeper as the HA type (high-availability.type: zookeeper), we are only 
seeing only one job is getting executed on the Flink dashboard. Following are 
the parameters setup for the Zookeeper based HA setup on the flink-conf.yaml. 
Please let us know if anyone has experienced similar issues and have any 
suggestions. Thanks in advance for your assistance.

Note: We are using a Streaming application and following are the 
flink-config.yaml configurations.
 # high-availability.storageDir: /opt/flink/data
 # high-availability.cluster-id: test
 # high-availability.zookeeper.quorum: localhost:2181
 # high-availability.type: zookeeper
 # high-availability.zookeeper.path.root: /dp/configs/flinkha



--
This message was sent by Atlassian Jira
(v8.20.10#820010)