Re: K8s job cluster and cancel and resume from a save point ?

Vishal Santoshi Tue, 12 Mar 2019 06:53:40 -0700

I must add that there has to be more love for k8s flink deployments. IMHO
that is the way to go.  Maintaining a captive/session cluster, if you have
k8s on premise is pretty much a no go  for various reasons.


On Tue, Mar 12, 2019 at 9:44 AM Vishal Santoshi <vishal.santo...@gmail.com>
wrote:

> This really not cool but here you go. This seems to work. Agreed that this
> cannot be this painful. The cancel does not exit with an exit code pf 0 and
> thus the job has to manually delete. Vijay does this align with what you
> have had to do ?
>
>
>    - Take a save point . This returns a request id
>
>    curl  --header "Content-Type: application/json" --request POST --data 
> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'  
>   https://*************/jobs/00000000000000000000000000000000/savepoints
>
>
>
>    - Make sure the save point succeeded
>
>    curl  --request GET   
> https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>
>
>
>    - cancel the job
>
>    curl  --request PATCH 
> https://***************/jobs/00000000000000000000000000000000?mode=cancel
>
>
>
>    - Delete the job and deployment
>
>    kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>
>    kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>
>
>
>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>
>    args: ["job-cluster",
>
>                   "--fromSavepoint",
>
>                   
> "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>                   "--job-classname", .........
>
>
>
>    - Restart
>
>    kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>
>    kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>
>
>
>    - Make sure from the UI, that it restored from the specific save point.
>
>
> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <bhaskar.eba...@gmail.com>
> wrote:
>
>> Yes Its supposed to work.  But unfortunately it was not working. Flink
>> community needs to respond to this behavior.
>>
>> Regards
>> Bhaskar
>>
>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> Aah.
>>> Let me try this out and will get back to you.
>>> Though I would assume that save point with cancel is a single atomic
>>> step, rather then a save point *followed*  by a cancellation ( else why
>>> would that be an option ).
>>> Thanks again.
>>>
>>>
>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <bhaskar.eba...@gmail.com>
>>> wrote:
>>>
>>>> Hi Vishal,
>>>>
>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>>> clusters. Its recommended command.
>>>>
>>>> Use the following command to issue save point.
>>>>  curl  --header "Content-Type: application/json" --request POST --data
>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>> "cancel-job":false}'  \ https://
>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>
>>>> Then issue yarn-cancel.
>>>> After that  follow the process to restore save point
>>>>
>>>> Regards
>>>> Bhaskar
>>>>
>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>> vishal.santo...@gmail.com> wrote:
>>>>
>>>>> Hello Vijay,
>>>>>
>>>>>                Thank you for the reply. This though is k8s deployment
>>>>> ( rather then yarn ) but may be they follow the same lifecycle.  I issue 
>>>>> a*
>>>>> save point with cancel*  as documented here
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>> a straight up
>>>>>  curl  --header "Content-Type: application/json" --request POST --data
>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>> \ https://
>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>
>>>>> I would assume that after taking the save point, the jvm should exit,
>>>>> after all the k8s deployment is of kind: job and if it is a job cluster
>>>>> then a cancellation should exit the jvm and hence the pod. It does seem to
>>>>> do some things right. It stops a bunch of stuff ( the JobMaster, the
>>>>> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>>>>> counter but does not exit  the job. And after a little bit the job is
>>>>> restarted which does not make sense and absolutely not the right thing to
>>>>> do  ( to me at least ).
>>>>>
>>>>> Further if I delete the deployment and the job from k8s and restart
>>>>> the job and deployment fromSavePoint, it refuses to honor the
>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the save
>>>>> point.
>>>>>
>>>>>
>>>>> Thus the process of cancelling and resuming from a SP on a k8s job
>>>>> cluster deployment  seems to be
>>>>>
>>>>>    - cancel with save point as defined hre
>>>>>    
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>    - delete the job manger job  and task manager deployments from k8s
>>>>>    almost immediately.
>>>>>    - clear the ZK chroot for the 0000000...... job  and may be the
>>>>>    checkpoints directory.
>>>>>    - resumeFromCheckPoint
>>>>>
>>>>> If some body can say that this indeed is the process ?
>>>>>
>>>>>
>>>>>
>>>>>  Logs are attached.
>>>>>
>>>>>
>>>>>
>>>>> 2019-03-12 08:10:43,871 INFO
>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>> Savepoint stored in
>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>> cancelling 00000000000000000000000000000000.
>>>>>
>>>>> 2019-03-12 08:10:43,871 INFO
>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>>> anomaly_echo (00000000000000000000000000000000) switched from state 
>>>>> RUNNING
>>>>> to CANCELLING.
>>>>>
>>>>> 2019-03-12 08:10:44,227 INFO  
>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>     - Completed checkpoint 10 for job
>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>
>>>>> 2019-03-12 08:10:44,232 INFO
>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>>>
>>>>> 2019-03-12 08:10:44,274 INFO
>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>>>
>>>>> 2019-03-12 08:10:44,276 INFO
>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>>> anomaly_echo (00000000000000000000000000000000) switched from state
>>>>> CANCELLING to CANCELED.
>>>>>
>>>>> 2019-03-12 08:10:44,276 INFO  
>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>     - Stopping checkpoint coordinator for job
>>>>> 00000000000000000000000000000000.
>>>>>
>>>>> 2019-03-12 08:10:44,277 INFO
>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>> - Shutting down
>>>>>
>>>>> 2019-03-12 08:10:44,323 INFO  
>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>       - Checkpoint with ID 8 at
>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>> discarded.
>>>>>
>>>>> 2019-03-12 08:10:44,437 INFO
>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>> - Removing
>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>> from ZooKeeper
>>>>>
>>>>> 2019-03-12 08:10:44,437 INFO  
>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>       - Checkpoint with ID 10 at
>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>>> discarded.
>>>>>
>>>>> 2019-03-12 08:10:44,447 INFO
>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>> Shutting down.
>>>>>
>>>>> 2019-03-12 08:10:44,447 INFO
>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from 
>>>>> ZooKeeper
>>>>>
>>>>> 2019-03-12 08:10:44,463 INFO
>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            - Job
>>>>> 00000000000000000000000000000000 reached globally terminal state CANCELED.
>>>>>
>>>>> 2019-03-12 08:10:44,467 INFO
>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>> Stopping the JobMaster for job
>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>
>>>>> 2019-03-12 08:10:44,468 INFO  
>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>> application status CANCELED. Diagnostics null.
>>>>>
>>>>> 2019-03-12 08:10:44,468 INFO
>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>>> Shutting down rest endpoint.
>>>>>
>>>>> 2019-03-12 08:10:44,473 INFO
>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>> /leader/resource_manager_lock.
>>>>>
>>>>> 2019-03-12 08:10:44,475 INFO
>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  - Close
>>>>> ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: JobManager is
>>>>> shutting down..
>>>>>
>>>>> 2019-03-12 08:10:44,475 INFO
>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>> Suspending SlotPool.
>>>>>
>>>>> 2019-03-12 08:10:44,476 INFO
>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>> Stopping SlotPool.
>>>>>
>>>>> 2019-03-12 08:10:44,476 INFO
>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
>>>>> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>
>>>>> 2019-03-12 08:10:44,477 INFO
>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>
>>>>>
>>>>> After a little bit
>>>>>
>>>>>
>>>>> Starting the job-cluster
>>>>>
>>>>> used deprecated key `jobmanager.heap.mb`, please replace with key
>>>>> `jobmanager.heap.size`
>>>>>
>>>>> Starting standalonejob as a console application on host
>>>>> anomalyecho-mmg6t.
>>>>>
>>>>> ..
>>>>>
>>>>> ..
>>>>>
>>>>>
>>>>> Regards.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>
>>>>>> Hi Vishal
>>>>>>
>>>>>> Save point with cancellation internally use  /cancel  REST API. Which
>>>>>> is not stable API.  It always exits with 404. Best  way to issue is:
>>>>>>
>>>>>> a) First issue save point REST API
>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E
>>>>>> )
>>>>>> c) Then After resuming your job, provide save point Path as argument
>>>>>> for the run jar REST API, which is returned by the (a)
>>>>>> Above is the smoother way
>>>>>>
>>>>>> Regards
>>>>>> Bhaskar
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>
>>>>>>> There are some issues I see and would want to get some feedback
>>>>>>>
>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the k8s
>>>>>>> job  does not exit ( it is not a deployment ) . I would assume that on
>>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>>>>>>> should too. That does not happen and thus the job pod remains live. Is 
>>>>>>> that
>>>>>>> expected ?
>>>>>>>
>>>>>>> 2. To resume fro a save point it seems that I have to delete the job
>>>>>>> id ( 0000000000.... )  from ZooKeeper ( this is HA ), else it defaults 
>>>>>>> to
>>>>>>> the latest checkpoint no matter what
>>>>>>>
>>>>>>>
>>>>>>> I am kind of curious as to what in 1.7.2 is the tested  process of
>>>>>>> cancelling with a save point and resuming  and what is the cogent story
>>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does 
>>>>>>> not
>>>>>>> work with 1.7.2 so even though that does not make sense, I still can not
>>>>>>> provide a new job id.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Vishal.
>>>>>>>
>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Reply via email to