Thanks Vijay, This is the larger issue. The cancellation routine is itself broken.
On cancellation flink does remove the checkpoint counter *2019-03-12 14:12:13,143 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper * but exist with a non zero code *2019-03-12 14:12:13,477 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with exit code 1444.* That I think is an issue. A cancelled job is a complete job and thus the exit code should be 0 for k8s to mark it complete. On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> wrote: > Yes Vishal. Thats correct. > > Regards > Bhaskar > > On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <vishal.santo...@gmail.com> > wrote: > >> This really not cool but here you go. This seems to work. Agreed that >> this cannot be this painful. The cancel does not exit with an exit code pf >> 0 and thus the job has to manually delete. Vijay does this align with what >> you have had to do ? >> >> >> - Take a save point . This returns a request id >> >> curl --header "Content-Type: application/json" --request POST --data >> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}' >> https://*************/jobs/00000000000000000000000000000000/savepoints >> >> >> >> - Make sure the save point succeeded >> >> curl --request GET >> https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687 >> >> >> >> - cancel the job >> >> curl --request PATCH >> https://***************/jobs/00000000000000000000000000000000?mode=cancel >> >> >> >> - Delete the job and deployment >> >> kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml >> >> kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml >> >> >> >> - Edit the job-cluster-job-deployment.yaml. Add/Edit >> >> args: ["job-cluster", >> >> "--fromSavepoint", >> >> >> "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22", >> "--job-classname", ......... >> >> >> >> - Restart >> >> kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml >> >> kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml >> >> >> >> - Make sure from the UI, that it restored from the specific save >> point. >> >> >> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> >> wrote: >> >>> Yes Its supposed to work. But unfortunately it was not working. Flink >>> community needs to respond to this behavior. >>> >>> Regards >>> Bhaskar >>> >>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi < >>> vishal.santo...@gmail.com> wrote: >>> >>>> Aah. >>>> Let me try this out and will get back to you. >>>> Though I would assume that save point with cancel is a single atomic >>>> step, rather then a save point *followed* by a cancellation ( else >>>> why would that be an option ). >>>> Thanks again. >>>> >>>> >>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> >>>> wrote: >>>> >>>>> Hi Vishal, >>>>> >>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all >>>>> clusters. Its recommended command. >>>>> >>>>> Use the following command to issue save point. >>>>> curl --header "Content-Type: application/json" --request POST --data >>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1", >>>>> "cancel-job":false}' \ https:// >>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>>> >>>>> Then issue yarn-cancel. >>>>> After that follow the process to restore save point >>>>> >>>>> Regards >>>>> Bhaskar >>>>> >>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi < >>>>> vishal.santo...@gmail.com> wrote: >>>>> >>>>>> Hello Vijay, >>>>>> >>>>>> Thank you for the reply. This though is k8s deployment >>>>>> ( rather then yarn ) but may be they follow the same lifecycle. I issue >>>>>> a* >>>>>> save point with cancel* as documented here >>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints, >>>>>> a straight up >>>>>> curl --header "Content-Type: application/json" --request POST >>>>>> --data >>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}' >>>>>> \ https:// >>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>>>> >>>>>> I would assume that after taking the save point, the jvm should exit, >>>>>> after all the k8s deployment is of kind: job and if it is a job cluster >>>>>> then a cancellation should exit the jvm and hence the pod. It does seem >>>>>> to >>>>>> do some things right. It stops a bunch of stuff ( the JobMaster, the >>>>>> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint >>>>>> counter but does not exit the job. And after a little bit the job is >>>>>> restarted which does not make sense and absolutely not the right thing to >>>>>> do ( to me at least ). >>>>>> >>>>>> Further if I delete the deployment and the job from k8s and restart >>>>>> the job and deployment fromSavePoint, it refuses to honor the >>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the save >>>>>> point. >>>>>> >>>>>> >>>>>> Thus the process of cancelling and resuming from a SP on a k8s job >>>>>> cluster deployment seems to be >>>>>> >>>>>> - cancel with save point as defined hre >>>>>> >>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints >>>>>> - delete the job manger job and task manager deployments from >>>>>> k8s almost immediately. >>>>>> - clear the ZK chroot for the 0000000...... job and may be the >>>>>> checkpoints directory. >>>>>> - resumeFromCheckPoint >>>>>> >>>>>> If some body can say that this indeed is the process ? >>>>>> >>>>>> >>>>>> >>>>>> Logs are attached. >>>>>> >>>>>> >>>>>> >>>>>> 2019-03-12 08:10:43,871 INFO >>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>> Savepoint stored in >>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now >>>>>> cancelling 00000000000000000000000000000000. >>>>>> >>>>>> 2019-03-12 08:10:43,871 INFO >>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job >>>>>> anomaly_echo (00000000000000000000000000000000) switched from state >>>>>> RUNNING >>>>>> to CANCELLING. >>>>>> >>>>>> 2019-03-12 08:10:44,227 INFO >>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>>> - Completed checkpoint 10 for job >>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms). >>>>>> >>>>>> 2019-03-12 08:10:44,232 INFO >>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1) >>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING. >>>>>> >>>>>> 2019-03-12 08:10:44,274 INFO >>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1) >>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED. >>>>>> >>>>>> 2019-03-12 08:10:44,276 INFO >>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job >>>>>> anomaly_echo (00000000000000000000000000000000) switched from state >>>>>> CANCELLING to CANCELED. >>>>>> >>>>>> 2019-03-12 08:10:44,276 INFO >>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>>> - Stopping checkpoint coordinator for job >>>>>> 00000000000000000000000000000000. >>>>>> >>>>>> 2019-03-12 08:10:44,277 INFO >>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>> - Shutting down >>>>>> >>>>>> 2019-03-12 08:10:44,323 INFO >>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>>> - Checkpoint with ID 8 at >>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not >>>>>> discarded. >>>>>> >>>>>> 2019-03-12 08:10:44,437 INFO >>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>> - Removing >>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000 >>>>>> from ZooKeeper >>>>>> >>>>>> 2019-03-12 08:10:44,437 INFO >>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>>> - Checkpoint with ID 10 at >>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not >>>>>> discarded. >>>>>> >>>>>> 2019-03-12 08:10:44,447 INFO >>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >>>>>> Shutting down. >>>>>> >>>>>> 2019-03-12 08:10:44,447 INFO >>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from >>>>>> ZooKeeper >>>>>> >>>>>> 2019-03-12 08:10:44,463 INFO >>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher - Job >>>>>> 00000000000000000000000000000000 reached globally terminal state >>>>>> CANCELED. >>>>>> >>>>>> 2019-03-12 08:10:44,467 INFO >>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>> Stopping the JobMaster for job >>>>>> anomaly_echo(00000000000000000000000000000000). >>>>>> >>>>>> 2019-03-12 08:10:44,468 INFO >>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint >>>>>> - Shutting StandaloneJobClusterEntryPoint down with >>>>>> application status CANCELED. Diagnostics null. >>>>>> >>>>>> 2019-03-12 08:10:44,468 INFO >>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - >>>>>> Shutting down rest endpoint. >>>>>> >>>>>> 2019-03-12 08:10:44,473 INFO >>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService >>>>>> - Stopping ZooKeeperLeaderRetrievalService >>>>>> /leader/resource_manager_lock. >>>>>> >>>>>> 2019-03-12 08:10:44,475 INFO >>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: >>>>>> JobManager is shutting down.. >>>>>> >>>>>> 2019-03-12 08:10:44,475 INFO >>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>>>>> Suspending SlotPool. >>>>>> >>>>>> 2019-03-12 08:10:44,476 INFO >>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>>>>> Stopping SlotPool. >>>>>> >>>>>> 2019-03-12 08:10:44,476 INFO >>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - >>>>>> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca >>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job >>>>>> 00000000000000000000000000000000 from the resource manager. >>>>>> >>>>>> 2019-03-12 08:10:44,477 INFO >>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService >>>>>> - Stopping ZooKeeperLeaderElectionService >>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}. >>>>>> >>>>>> >>>>>> After a little bit >>>>>> >>>>>> >>>>>> Starting the job-cluster >>>>>> >>>>>> used deprecated key `jobmanager.heap.mb`, please replace with key >>>>>> `jobmanager.heap.size` >>>>>> >>>>>> Starting standalonejob as a console application on host >>>>>> anomalyecho-mmg6t. >>>>>> >>>>>> .. >>>>>> >>>>>> .. >>>>>> >>>>>> >>>>>> Regards. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar < >>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>> >>>>>>> Hi Vishal >>>>>>> >>>>>>> Save point with cancellation internally use /cancel REST API. >>>>>>> Which is not stable API. It always exits with 404. Best way to issue >>>>>>> is: >>>>>>> >>>>>>> a) First issue save point REST API >>>>>>> b) Then issue /yarn-cancel rest API( As described in >>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E >>>>>>> ) >>>>>>> c) Then After resuming your job, provide save point Path as argument >>>>>>> for the run jar REST API, which is returned by the (a) >>>>>>> Above is the smoother way >>>>>>> >>>>>>> Regards >>>>>>> Bhaskar >>>>>>> >>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi < >>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>> >>>>>>>> There are some issues I see and would want to get some feedback >>>>>>>> >>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the >>>>>>>> k8s job does not exit ( it is not a deployment ) . I would assume >>>>>>>> that on >>>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod >>>>>>>> should too. That does not happen and thus the job pod remains live. Is >>>>>>>> that >>>>>>>> expected ? >>>>>>>> >>>>>>>> 2. To resume fro a save point it seems that I have to delete the >>>>>>>> job id ( 0000000000.... ) from ZooKeeper ( this is HA ), else it >>>>>>>> defaults >>>>>>>> to the latest checkpoint no matter what >>>>>>>> >>>>>>>> >>>>>>>> I am kind of curious as to what in 1.7.2 is the tested process of >>>>>>>> cancelling with a save point and resuming and what is the cogent story >>>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does >>>>>>>> not >>>>>>>> work with 1.7.2 so even though that does not make sense, I still can >>>>>>>> not >>>>>>>> provide a new job id. >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Vishal. >>>>>>>> >>>>>>>>