Re: K8s job cluster and cancel and resume from a save point ?

Vishal Santoshi Tue, 12 Mar 2019 07:21:42 -0700

Thanks Vijay,

This is the larger issue.  The cancellation routine is itself broken.


On cancellation flink does remove the checkpoint counter

*2019-03-12 14:12:13,143
INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
Removing /checkpoint-counter/00000000000000000000000000000000 from
ZooKeeper *

but exist with a non zero code

*2019-03-12 14:12:13,477
INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
exit code 1444.*


That I think is an issue. A cancelled job is a complete job and thus the
exit code should be 0 for k8s to mark it complete.









On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <bhaskar.eba...@gmail.com>
wrote:

> Yes Vishal. Thats correct.
>
> Regards
> Bhaskar
>
> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <vishal.santo...@gmail.com>
> wrote:
>
>> This really not cool but here you go. This seems to work. Agreed that
>> this cannot be this painful. The cancel does not exit with an exit code pf
>> 0 and thus the job has to manually delete. Vijay does this align with what
>> you have had to do ?
>>
>>
>>    - Take a save point . This returns a request id
>>
>>    curl  --header "Content-Type: application/json" --request POST --data 
>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}' 
>>    https://*************/jobs/00000000000000000000000000000000/savepoints
>>
>>
>>
>>    - Make sure the save point succeeded
>>
>>    curl  --request GET   
>> https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>
>>
>>
>>    - cancel the job
>>
>>    curl  --request PATCH 
>> https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>
>>
>>
>>    - Delete the job and deployment
>>
>>    kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>
>>    kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>
>>
>>
>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>
>>    args: ["job-cluster",
>>
>>                   "--fromSavepoint",
>>
>>                   
>> "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>                   "--job-classname", .........
>>
>>
>>
>>    - Restart
>>
>>    kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>
>>    kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>
>>
>>
>>    - Make sure from the UI, that it restored from the specific save
>>    point.
>>
>>
>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <bhaskar.eba...@gmail.com>
>> wrote:
>>
>>> Yes Its supposed to work.  But unfortunately it was not working. Flink
>>> community needs to respond to this behavior.
>>>
>>> Regards
>>> Bhaskar
>>>
>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>> vishal.santo...@gmail.com> wrote:
>>>
>>>> Aah.
>>>> Let me try this out and will get back to you.
>>>> Though I would assume that save point with cancel is a single atomic
>>>> step, rather then a save point *followed*  by a cancellation ( else
>>>> why would that be an option ).
>>>> Thanks again.
>>>>
>>>>
>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <bhaskar.eba...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Vishal,
>>>>>
>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>>>> clusters. Its recommended command.
>>>>>
>>>>> Use the following command to issue save point.
>>>>>  curl  --header "Content-Type: application/json" --request POST --data
>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>> "cancel-job":false}'  \ https://
>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>
>>>>> Then issue yarn-cancel.
>>>>> After that  follow the process to restore save point
>>>>>
>>>>> Regards
>>>>> Bhaskar
>>>>>
>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>> vishal.santo...@gmail.com> wrote:
>>>>>
>>>>>> Hello Vijay,
>>>>>>
>>>>>>                Thank you for the reply. This though is k8s deployment
>>>>>> ( rather then yarn ) but may be they follow the same lifecycle.  I issue 
>>>>>> a*
>>>>>> save point with cancel*  as documented here
>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>> a straight up
>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>> --data
>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>> \ https://
>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>
>>>>>> I would assume that after taking the save point, the jvm should exit,
>>>>>> after all the k8s deployment is of kind: job and if it is a job cluster
>>>>>> then a cancellation should exit the jvm and hence the pod. It does seem 
>>>>>> to
>>>>>> do some things right. It stops a bunch of stuff ( the JobMaster, the
>>>>>> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>>>>>> counter but does not exit  the job. And after a little bit the job is
>>>>>> restarted which does not make sense and absolutely not the right thing to
>>>>>> do  ( to me at least ).
>>>>>>
>>>>>> Further if I delete the deployment and the job from k8s and restart
>>>>>> the job and deployment fromSavePoint, it refuses to honor the
>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider the save
>>>>>> point.
>>>>>>
>>>>>>
>>>>>> Thus the process of cancelling and resuming from a SP on a k8s job
>>>>>> cluster deployment  seems to be
>>>>>>
>>>>>>    - cancel with save point as defined hre
>>>>>>    
>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>    - delete the job manger job  and task manager deployments from
>>>>>>    k8s almost immediately.
>>>>>>    - clear the ZK chroot for the 0000000...... job  and may be the
>>>>>>    checkpoints directory.
>>>>>>    - resumeFromCheckPoint
>>>>>>
>>>>>> If some body can say that this indeed is the process ?
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Logs are attached.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>> Savepoint stored in
>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>
>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>>>> anomaly_echo (00000000000000000000000000000000) switched from state 
>>>>>> RUNNING
>>>>>> to CANCELLING.
>>>>>>
>>>>>> 2019-03-12 08:10:44,227 INFO  
>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>     - Completed checkpoint 10 for job
>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>
>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>>>>
>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>>>>
>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>>>> anomaly_echo (00000000000000000000000000000000) switched from state
>>>>>> CANCELLING to CANCELED.
>>>>>>
>>>>>> 2019-03-12 08:10:44,276 INFO  
>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>     - Stopping checkpoint coordinator for job
>>>>>> 00000000000000000000000000000000.
>>>>>>
>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>> - Shutting down
>>>>>>
>>>>>> 2019-03-12 08:10:44,323 INFO  
>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>       - Checkpoint with ID 8 at
>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>>> discarded.
>>>>>>
>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>> - Removing
>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>> from ZooKeeper
>>>>>>
>>>>>> 2019-03-12 08:10:44,437 INFO  
>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>       - Checkpoint with ID 10 at
>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>>>>> discarded.
>>>>>>
>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>> Shutting down.
>>>>>>
>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from 
>>>>>> ZooKeeper
>>>>>>
>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            - Job
>>>>>> 00000000000000000000000000000000 reached globally terminal state 
>>>>>> CANCELED.
>>>>>>
>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>> Stopping the JobMaster for job
>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>
>>>>>> 2019-03-12 08:10:44,468 INFO  
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>> application status CANCELED. Diagnostics null.
>>>>>>
>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>>>> Shutting down rest endpoint.
>>>>>>
>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>> /leader/resource_manager_lock.
>>>>>>
>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2:
>>>>>> JobManager is shutting down..
>>>>>>
>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>> Suspending SlotPool.
>>>>>>
>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>> Stopping SlotPool.
>>>>>>
>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
>>>>>> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>>
>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>
>>>>>>
>>>>>> After a little bit
>>>>>>
>>>>>>
>>>>>> Starting the job-cluster
>>>>>>
>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with key
>>>>>> `jobmanager.heap.size`
>>>>>>
>>>>>> Starting standalonejob as a console application on host
>>>>>> anomalyecho-mmg6t.
>>>>>>
>>>>>> ..
>>>>>>
>>>>>> ..
>>>>>>
>>>>>>
>>>>>> Regards.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Vishal
>>>>>>>
>>>>>>> Save point with cancellation internally use  /cancel  REST API.
>>>>>>> Which is not stable API.  It always exits with 404. Best  way to issue 
>>>>>>> is:
>>>>>>>
>>>>>>> a) First issue save point REST API
>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E
>>>>>>> )
>>>>>>> c) Then After resuming your job, provide save point Path as argument
>>>>>>> for the run jar REST API, which is returned by the (a)
>>>>>>> Above is the smoother way
>>>>>>>
>>>>>>> Regards
>>>>>>> Bhaskar
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>
>>>>>>>> There are some issues I see and would want to get some feedback
>>>>>>>>
>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , the
>>>>>>>> k8s  job  does not exit ( it is not a deployment ) . I would assume 
>>>>>>>> that on
>>>>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>>>>>>>> should too. That does not happen and thus the job pod remains live. Is 
>>>>>>>> that
>>>>>>>> expected ?
>>>>>>>>
>>>>>>>> 2. To resume fro a save point it seems that I have to delete the
>>>>>>>> job id ( 0000000000.... )  from ZooKeeper ( this is HA ), else it 
>>>>>>>> defaults
>>>>>>>> to the latest checkpoint no matter what
>>>>>>>>
>>>>>>>>
>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested  process of
>>>>>>>> cancelling with a save point and resuming  and what is the cogent story
>>>>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does 
>>>>>>>> not
>>>>>>>> work with 1.7.2 so even though that does not make sense, I still can 
>>>>>>>> not
>>>>>>>> provide a new job id.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Vishal.
>>>>>>>>
>>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Reply via email to