I confirm that 1.8.0 fixes all the above issue . The JM process exits with code 0 and exits the pod ( TERMINATED state ) . The above is true for both PATCH cancel and POST save point with cancel as above.
Thank you for fixing this issue. On Wed, Mar 13, 2019 at 10:17 AM Vishal Santoshi <vishal.santo...@gmail.com> wrote: > BTW, does 1.8 also solve the issue where we can cancel with a save point. > That too is broken in 1.7.2 > > curl --header "Content-Type: application/json" --request POST --data > '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":*true*}' > https://*************/jobs/00000000000000000000000000000000/savepoints > > > On Tue, Mar 12, 2019 at 11:55 AM Vishal Santoshi < > vishal.santo...@gmail.com> wrote: > >> Awesome, thanks! >> >> On Tue, Mar 12, 2019 at 11:53 AM Gary Yao <g...@ververica.com> wrote: >> >>> The RC artifacts are only deployed to the Maven Central Repository when >>> the RC >>> is promoted to a release. As written in the 1.8.0 RC1 voting email [1], >>> you >>> can find the maven artifacts, and the Flink binaries here: >>> >>> - >>> https://repository.apache.org/content/repositories/orgapacheflink-1210/ >>> - https://dist.apache.org/repos/dist/dev/flink/flink-1.8.0-rc1/ >>> >>> Alternatively, you can apply the patch yourself, and build Flink 1.7 from >>> sources [2]. On my machine this takes around 10 minutes if tests are >>> skipped. >>> >>> Best, >>> Gary >>> >>> [1] >>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html >>> [2] >>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/flinkDev/building.html#build-flink >>> >>> On Tue, Mar 12, 2019 at 4:01 PM Vishal Santoshi < >>> vishal.santo...@gmail.com> wrote: >>> >>>> Do you have a mvn repository ( at mvn central ) set up for 1,8 release >>>> candidate. We could test it for you. >>>> >>>> Without 1.8and this exit code we are essentially held up. >>>> >>>> On Tue, Mar 12, 2019 at 10:56 AM Gary Yao <g...@ververica.com> wrote: >>>> >>>>> Nobody can tell with 100% certainty. We want to give the RC some >>>>> exposure >>>>> first, and there is also a release process that is prescribed by the >>>>> ASF [1]. >>>>> You can look at past releases to get a feeling for how long the release >>>>> process lasts [2]. >>>>> >>>>> [1] http://www.apache.org/legal/release-policy.html#release-approval >>>>> [2] >>>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0 >>>>> >>>>> >>>>> On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi < >>>>> vishal.santo...@gmail.com> wrote: >>>>> >>>>>> And when is the 1.8.0 release expected ? >>>>>> >>>>>> On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi < >>>>>> vishal.santo...@gmail.com> wrote: >>>>>> >>>>>>> :) That makes so much more sense. Is k8s native flink a part of >>>>>>> this release ? >>>>>>> >>>>>>> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <g...@ververica.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Vishal, >>>>>>>> >>>>>>>> This issue was fixed recently [1], and the patch will be released >>>>>>>> with 1.8. If >>>>>>>> the Flink job gets cancelled, the JVM should exit with code 0. >>>>>>>> There is a >>>>>>>> release candidate [2], which you can test. >>>>>>>> >>>>>>>> Best, >>>>>>>> Gary >>>>>>>> >>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-10743 >>>>>>>> [2] >>>>>>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html >>>>>>>> >>>>>>>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi < >>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Thanks Vijay, >>>>>>>>> >>>>>>>>> This is the larger issue. The cancellation routine is itself >>>>>>>>> broken. >>>>>>>>> >>>>>>>>> On cancellation flink does remove the checkpoint counter >>>>>>>>> >>>>>>>>> *2019-03-12 14:12:13,143 >>>>>>>>> INFO >>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >>>>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from >>>>>>>>> ZooKeeper * >>>>>>>>> >>>>>>>>> but exist with a non zero code >>>>>>>>> >>>>>>>>> *2019-03-12 14:12:13,477 >>>>>>>>> INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>>>>>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint >>>>>>>>> with >>>>>>>>> exit code 1444.* >>>>>>>>> >>>>>>>>> >>>>>>>>> That I think is an issue. A cancelled job is a complete job and >>>>>>>>> thus the exit code should be 0 for k8s to mark it complete. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar < >>>>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Yes Vishal. Thats correct. >>>>>>>>>> >>>>>>>>>> Regards >>>>>>>>>> Bhaskar >>>>>>>>>> >>>>>>>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi < >>>>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> This really not cool but here you go. This seems to work. Agreed >>>>>>>>>>> that this cannot be this painful. The cancel does not exit with an >>>>>>>>>>> exit >>>>>>>>>>> code pf 0 and thus the job has to manually delete. Vijay does this >>>>>>>>>>> align >>>>>>>>>>> with what you have had to do ? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> - Take a save point . This returns a request id >>>>>>>>>>> >>>>>>>>>>> curl --header "Content-Type: application/json" --request POST >>>>>>>>>>> --data >>>>>>>>>>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}' >>>>>>>>>>> >>>>>>>>>>> https://*************/jobs/00000000000000000000000000000000/savepoints >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> - Make sure the save point succeeded >>>>>>>>>>> >>>>>>>>>>> curl --request GET >>>>>>>>>>> https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> - cancel the job >>>>>>>>>>> >>>>>>>>>>> curl --request PATCH >>>>>>>>>>> https://***************/jobs/00000000000000000000000000000000?mode=cancel >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> - Delete the job and deployment >>>>>>>>>>> >>>>>>>>>>> kubectl delete -f >>>>>>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml >>>>>>>>>>> >>>>>>>>>>> kubectl delete -f >>>>>>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> - Edit the job-cluster-job-deployment.yaml. Add/Edit >>>>>>>>>>> >>>>>>>>>>> args: ["job-cluster", >>>>>>>>>>> >>>>>>>>>>> "--fromSavepoint", >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22", >>>>>>>>>>> "--job-classname", ......... >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> - Restart >>>>>>>>>>> >>>>>>>>>>> kubectl create -f >>>>>>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml >>>>>>>>>>> >>>>>>>>>>> kubectl create -f >>>>>>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> - Make sure from the UI, that it restored from the specific >>>>>>>>>>> save point. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar < >>>>>>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Yes Its supposed to work. But unfortunately it was not >>>>>>>>>>>> working. Flink community needs to respond to this behavior. >>>>>>>>>>>> >>>>>>>>>>>> Regards >>>>>>>>>>>> Bhaskar >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi < >>>>>>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Aah. >>>>>>>>>>>>> Let me try this out and will get back to you. >>>>>>>>>>>>> Though I would assume that save point with cancel is a single >>>>>>>>>>>>> atomic step, rather then a save point *followed* by a >>>>>>>>>>>>> cancellation ( else why would that be an option ). >>>>>>>>>>>>> Thanks again. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar < >>>>>>>>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Vishal, >>>>>>>>>>>>>> >>>>>>>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for >>>>>>>>>>>>>> all clusters. Its recommended command. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Use the following command to issue save point. >>>>>>>>>>>>>> curl --header "Content-Type: application/json" --request >>>>>>>>>>>>>> POST --data >>>>>>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1", >>>>>>>>>>>>>> "cancel-job":false}' \ https:// >>>>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>>>>>>>>>>>> >>>>>>>>>>>>>> Then issue yarn-cancel. >>>>>>>>>>>>>> After that follow the process to restore save point >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards >>>>>>>>>>>>>> Bhaskar >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi < >>>>>>>>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hello Vijay, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thank you for the reply. This though is k8s >>>>>>>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same >>>>>>>>>>>>>>> lifecycle. >>>>>>>>>>>>>>> I issue a* save point with cancel* as documented here >>>>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints, >>>>>>>>>>>>>>> a straight up >>>>>>>>>>>>>>> curl --header "Content-Type: application/json" --request >>>>>>>>>>>>>>> POST --data >>>>>>>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}' >>>>>>>>>>>>>>> \ https:// >>>>>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I would assume that after taking the save point, the jvm >>>>>>>>>>>>>>> should exit, after all the k8s deployment is of kind: job and >>>>>>>>>>>>>>> if it is a >>>>>>>>>>>>>>> job cluster then a cancellation should exit the jvm and hence >>>>>>>>>>>>>>> the pod. It >>>>>>>>>>>>>>> does seem to do some things right. It stops a bunch of stuff ( >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> JobMaster, the slotPol, zookeeper coordinator etc ) . It also >>>>>>>>>>>>>>> remove the >>>>>>>>>>>>>>> checkpoint counter but does not exit the job. And after a >>>>>>>>>>>>>>> little bit the >>>>>>>>>>>>>>> job is restarted which does not make sense and absolutely not >>>>>>>>>>>>>>> the right >>>>>>>>>>>>>>> thing to do ( to me at least ). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Further if I delete the deployment and the job from k8s and >>>>>>>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to >>>>>>>>>>>>>>> honor the >>>>>>>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to >>>>>>>>>>>>>>> consider the save >>>>>>>>>>>>>>> point. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thus the process of cancelling and resuming from a SP on a >>>>>>>>>>>>>>> k8s job cluster deployment seems to be >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - cancel with save point as defined hre >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints >>>>>>>>>>>>>>> - delete the job manger job and task manager >>>>>>>>>>>>>>> deployments from k8s almost immediately. >>>>>>>>>>>>>>> - clear the ZK chroot for the 0000000...... job and may >>>>>>>>>>>>>>> be the checkpoints directory. >>>>>>>>>>>>>>> - resumeFromCheckPoint >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If some body can say that this indeed is the process ? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Logs are attached. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster >>>>>>>>>>>>>>> - Savepoint stored in >>>>>>>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. >>>>>>>>>>>>>>> Now >>>>>>>>>>>>>>> cancelling 00000000000000000000000000000000. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph >>>>>>>>>>>>>>> - Job anomaly_echo (00000000000000000000000000000000) >>>>>>>>>>>>>>> switched from state RUNNING to CANCELLING. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,227 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>>>>>>>>>>>> - Completed checkpoint 10 for job >>>>>>>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,232 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph >>>>>>>>>>>>>>> - Source: Barnacle Anomalies Kafka topic -> Map -> Sink: >>>>>>>>>>>>>>> Logging Sink (1/1) (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched >>>>>>>>>>>>>>> from RUNNING >>>>>>>>>>>>>>> to CANCELING. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,274 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph >>>>>>>>>>>>>>> - Source: Barnacle Anomalies Kafka topic -> Map -> Sink: >>>>>>>>>>>>>>> Logging Sink (1/1) (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched >>>>>>>>>>>>>>> from >>>>>>>>>>>>>>> CANCELING to CANCELED. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph >>>>>>>>>>>>>>> - Job anomaly_echo (00000000000000000000000000000000) >>>>>>>>>>>>>>> switched from state CANCELLING to CANCELED. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>>>>>>>>>>>> - Stopping checkpoint coordinator for job >>>>>>>>>>>>>>> 00000000000000000000000000000000. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,277 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>>>>>>>>>> - Shutting down >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,323 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>>>>>>>>>>>> - Checkpoint with ID 8 at >>>>>>>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' >>>>>>>>>>>>>>> not >>>>>>>>>>>>>>> discarded. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>>>>>>>>>> - Removing >>>>>>>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000 >>>>>>>>>>>>>>> from ZooKeeper >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>>>>>>>>>>>> - Checkpoint with ID 10 at >>>>>>>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' >>>>>>>>>>>>>>> not >>>>>>>>>>>>>>> discarded. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter >>>>>>>>>>>>>>> - Shutting down. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter >>>>>>>>>>>>>>> - Removing >>>>>>>>>>>>>>> /checkpoint-counter/00000000000000000000000000000000 from >>>>>>>>>>>>>>> ZooKeeper >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,463 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher >>>>>>>>>>>>>>> - Job 00000000000000000000000000000000 reached globally >>>>>>>>>>>>>>> terminal state CANCELED. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,467 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster >>>>>>>>>>>>>>> - Stopping the JobMaster for job >>>>>>>>>>>>>>> anomaly_echo(00000000000000000000000000000000). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint >>>>>>>>>>>>>>> - Shutting StandaloneJobClusterEntryPoint down with >>>>>>>>>>>>>>> application status CANCELED. Diagnostics null. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint >>>>>>>>>>>>>>> - Shutting down rest endpoint. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,473 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService >>>>>>>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService >>>>>>>>>>>>>>> /leader/resource_manager_lock. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster >>>>>>>>>>>>>>> - Close ResourceManager connection >>>>>>>>>>>>>>> d38c6e599d16415a69c65c8b2a72d9a2: JobManager is shutting down.. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool >>>>>>>>>>>>>>> - Suspending SlotPool. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool >>>>>>>>>>>>>>> - Stopping SlotPool. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager >>>>>>>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca >>>>>>>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for >>>>>>>>>>>>>>> job 00000000000000000000000000000000 from the resource manager. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-03-12 08:10:44,477 INFO >>>>>>>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService >>>>>>>>>>>>>>> - Stopping ZooKeeperLeaderElectionService >>>>>>>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> After a little bit >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Starting the job-cluster >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace >>>>>>>>>>>>>>> with key `jobmanager.heap.size` >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Starting standalonejob as a console application on host >>>>>>>>>>>>>>> anomalyecho-mmg6t. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> .. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> .. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar < >>>>>>>>>>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Vishal >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Save point with cancellation internally use /cancel REST >>>>>>>>>>>>>>>> API. Which is not stable API. It always exits with 404. Best >>>>>>>>>>>>>>>> way to issue >>>>>>>>>>>>>>>> is: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> a) First issue save point REST API >>>>>>>>>>>>>>>> b) Then issue /yarn-cancel rest API( As described in >>>>>>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E >>>>>>>>>>>>>>>> ) >>>>>>>>>>>>>>>> c) Then After resuming your job, provide save point Path as >>>>>>>>>>>>>>>> argument for the run jar REST API, which is returned by the (a) >>>>>>>>>>>>>>>> Above is the smoother way >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>> Bhaskar >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi < >>>>>>>>>>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> There are some issues I see and would want to get some >>>>>>>>>>>>>>>>> feedback >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory >>>>>>>>>>>>>>>>> , the k8s job does not exit ( it is not a deployment ) . I >>>>>>>>>>>>>>>>> would assume >>>>>>>>>>>>>>>>> that on cancellation the jvm should exit, after cleanup etc, >>>>>>>>>>>>>>>>> and thus the >>>>>>>>>>>>>>>>> pod should too. That does not happen and thus the job pod >>>>>>>>>>>>>>>>> remains live. Is >>>>>>>>>>>>>>>>> that expected ? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2. To resume fro a save point it seems that I have to >>>>>>>>>>>>>>>>> delete the job id ( 0000000000.... ) from ZooKeeper ( this >>>>>>>>>>>>>>>>> is HA ), else >>>>>>>>>>>>>>>>> it defaults to the latest checkpoint no matter what >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested >>>>>>>>>>>>>>>>> process of cancelling with a save point and resuming and >>>>>>>>>>>>>>>>> what is the >>>>>>>>>>>>>>>>> cogent story around job id ( defaults to 000000000000.. ). >>>>>>>>>>>>>>>>> Note that >>>>>>>>>>>>>>>>> --job-id does not work with 1.7.2 so even though that does >>>>>>>>>>>>>>>>> not make sense, >>>>>>>>>>>>>>>>> I still can not provide a new job id. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Vishal. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>