Yes Its supposed to work. But unfortunately it was not working. Flink community needs to respond to this behavior.
Regards Bhaskar On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <vishal.santo...@gmail.com> wrote: > Aah. > Let me try this out and will get back to you. > Though I would assume that save point with cancel is a single atomic step, > rather then a save point *followed* by a cancellation ( else why would > that be an option ). > Thanks again. > > > On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> > wrote: > >> Hi Vishal, >> >> yarn-cancel doesn't mean to be for yarn cluster. It works for all >> clusters. Its recommended command. >> >> Use the following command to issue save point. >> curl --header "Content-Type: application/json" --request POST --data >> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":false}' >> \ https:// >> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >> >> Then issue yarn-cancel. >> After that follow the process to restore save point >> >> Regards >> Bhaskar >> >> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi < >> vishal.santo...@gmail.com> wrote: >> >>> Hello Vijay, >>> >>> Thank you for the reply. This though is k8s deployment ( >>> rather then yarn ) but may be they follow the same lifecycle. I issue a* >>> save point with cancel* as documented here >>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints, >>> a straight up >>> curl --header "Content-Type: application/json" --request POST --data >>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}' >>> \ https:// >>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>> >>> I would assume that after taking the save point, the jvm should exit, >>> after all the k8s deployment is of kind: job and if it is a job cluster >>> then a cancellation should exit the jvm and hence the pod. It does seem to >>> do some things right. It stops a bunch of stuff ( the JobMaster, the >>> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint >>> counter but does not exit the job. And after a little bit the job is >>> restarted which does not make sense and absolutely not the right thing to >>> do ( to me at least ). >>> >>> Further if I delete the deployment and the job from k8s and restart the >>> job and deployment fromSavePoint, it refuses to honor the fromSavePoint. I >>> have to delete the zk chroot for it to consider the save point. >>> >>> >>> Thus the process of cancelling and resuming from a SP on a k8s job >>> cluster deployment seems to be >>> >>> - cancel with save point as defined hre >>> >>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints >>> - delete the job manger job and task manager deployments from k8s >>> almost immediately. >>> - clear the ZK chroot for the 0000000...... job and may be the >>> checkpoints directory. >>> - resumeFromCheckPoint >>> >>> If some body can say that this indeed is the process ? >>> >>> >>> >>> Logs are attached. >>> >>> >>> >>> 2019-03-12 08:10:43,871 INFO >>> org.apache.flink.runtime.jobmaster.JobMaster - >>> Savepoint stored in >>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now >>> cancelling 00000000000000000000000000000000. >>> >>> 2019-03-12 08:10:43,871 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job >>> anomaly_echo (00000000000000000000000000000000) switched from state RUNNING >>> to CANCELLING. >>> >>> 2019-03-12 08:10:44,227 INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>> - Completed checkpoint 10 for job 00000000000000000000000000000000 >>> (7238 bytes in 311 ms). >>> >>> 2019-03-12 08:10:44,232 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: >>> Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1) >>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING. >>> >>> 2019-03-12 08:10:44,274 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: >>> Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1) >>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED. >>> >>> 2019-03-12 08:10:44,276 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job >>> anomaly_echo (00000000000000000000000000000000) switched from state >>> CANCELLING to CANCELED. >>> >>> 2019-03-12 08:10:44,276 INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>> - Stopping checkpoint coordinator for job >>> 00000000000000000000000000000000. >>> >>> 2019-03-12 08:10:44,277 INFO >>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - >>> Shutting down >>> >>> 2019-03-12 08:10:44,323 INFO >>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>> - Checkpoint with ID 8 at >>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not >>> discarded. >>> >>> 2019-03-12 08:10:44,437 INFO >>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - >>> Removing >>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000 >>> from ZooKeeper >>> >>> 2019-03-12 08:10:44,437 INFO >>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>> - Checkpoint with ID 10 at >>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not >>> discarded. >>> >>> 2019-03-12 08:10:44,447 INFO >>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >>> Shutting down. >>> >>> 2019-03-12 08:10:44,447 INFO >>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >>> Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper >>> >>> 2019-03-12 08:10:44,463 INFO >>> org.apache.flink.runtime.dispatcher.MiniDispatcher - Job >>> 00000000000000000000000000000000 reached globally terminal state CANCELED. >>> >>> 2019-03-12 08:10:44,467 INFO >>> org.apache.flink.runtime.jobmaster.JobMaster - >>> Stopping the JobMaster for job >>> anomaly_echo(00000000000000000000000000000000). >>> >>> 2019-03-12 08:10:44,468 INFO >>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint >>> - Shutting StandaloneJobClusterEntryPoint down with application >>> status CANCELED. Diagnostics null. >>> >>> 2019-03-12 08:10:44,468 INFO >>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - >>> Shutting down rest endpoint. >>> >>> 2019-03-12 08:10:44,473 INFO >>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService >>> - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. >>> >>> 2019-03-12 08:10:44,475 INFO >>> org.apache.flink.runtime.jobmaster.JobMaster - Close >>> ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: JobManager is >>> shutting down.. >>> >>> 2019-03-12 08:10:44,475 INFO >>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>> Suspending SlotPool. >>> >>> 2019-03-12 08:10:44,476 INFO >>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>> Stopping SlotPool. >>> >>> 2019-03-12 08:10:44,476 INFO >>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - >>> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca >>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job >>> 00000000000000000000000000000000 from the resource manager. >>> >>> 2019-03-12 08:10:44,477 INFO >>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - >>> Stopping ZooKeeperLeaderElectionService >>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}. >>> >>> >>> After a little bit >>> >>> >>> Starting the job-cluster >>> >>> used deprecated key `jobmanager.heap.mb`, please replace with key >>> `jobmanager.heap.size` >>> >>> Starting standalonejob as a console application on host >>> anomalyecho-mmg6t. >>> >>> .. >>> >>> .. >>> >>> >>> Regards. >>> >>> >>> >>> >>> >>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <bhaskar.eba...@gmail.com> >>> wrote: >>> >>>> Hi Vishal >>>> >>>> Save point with cancellation internally use /cancel REST API. Which >>>> is not stable API. It always exits with 404. Best way to issue is: >>>> >>>> a) First issue save point REST API >>>> b) Then issue /yarn-cancel rest API( As described in >>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E >>>> ) >>>> c) Then After resuming your job, provide save point Path as argument >>>> for the run jar REST API, which is returned by the (a) >>>> Above is the smoother way >>>> >>>> Regards >>>> Bhaskar >>>> >>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi < >>>> vishal.santo...@gmail.com> wrote: >>>> >>>>> There are some issues I see and would want to get some feedback >>>>> >>>>> 1. On Cancellation With SavePoint with a Target Directory , the k8s >>>>> job does not exit ( it is not a deployment ) . I would assume that on >>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod >>>>> should too. That does not happen and thus the job pod remains live. Is >>>>> that >>>>> expected ? >>>>> >>>>> 2. To resume fro a save point it seems that I have to delete the job >>>>> id ( 0000000000.... ) from ZooKeeper ( this is HA ), else it defaults to >>>>> the latest checkpoint no matter what >>>>> >>>>> >>>>> I am kind of curious as to what in 1.7.2 is the tested process of >>>>> cancelling with a save point and resuming and what is the cogent story >>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does not >>>>> work with 1.7.2 so even though that does not make sense, I still can not >>>>> provide a new job id. >>>>> >>>>> Regards, >>>>> >>>>> Vishal. >>>>> >>>>>