Re: K8s job cluster and cancel and resume from a save point ?

Vijay Bhaskar Tue, 12 Mar 2019 04:26:56 -0700

Yes Its supposed to work.  But unfortunately it was not working. Flink
community needs to respond to this behavior.


Regards
Bhaskar

On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <vishal.santo...@gmail.com>
wrote:

> Aah.
> Let me try this out and will get back to you.
> Though I would assume that save point with cancel is a single atomic step,
> rather then a save point *followed*  by a cancellation ( else why would
> that be an option ).
> Thanks again.
>
>
> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <bhaskar.eba...@gmail.com>
> wrote:
>
>> Hi Vishal,
>>
>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>> clusters. Its recommended command.
>>
>> Use the following command to issue save point.
>>  curl  --header "Content-Type: application/json" --request POST --data
>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":false}'
>> \ https://
>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>
>> Then issue yarn-cancel.
>> After that  follow the process to restore save point
>>
>> Regards
>> Bhaskar
>>
>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> Hello Vijay,
>>>
>>>                Thank you for the reply. This though is k8s deployment (
>>> rather then yarn ) but may be they follow the same lifecycle.  I issue a*
>>> save point with cancel*  as documented here
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>> a straight up
>>>  curl  --header "Content-Type: application/json" --request POST --data
>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>> \ https://
>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>
>>> I would assume that after taking the save point, the jvm should exit,
>>> after all the k8s deployment is of kind: job and if it is a job cluster
>>> then a cancellation should exit the jvm and hence the pod. It does seem to
>>> do some things right. It stops a bunch of stuff ( the JobMaster, the
>>> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>>> counter but does not exit  the job. And after a little bit the job is
>>> restarted which does not make sense and absolutely not the right thing to
>>> do  ( to me at least ).
>>>
>>> Further if I delete the deployment and the job from k8s and restart the
>>> job and deployment fromSavePoint, it refuses to honor the fromSavePoint. I
>>> have to delete the zk chroot for it to consider the save point.
>>>
>>>
>>> Thus the process of cancelling and resuming from a SP on a k8s job
>>> cluster deployment  seems to be
>>>
>>>    - cancel with save point as defined hre
>>>    
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>    - delete the job manger job  and task manager deployments from k8s
>>>    almost immediately.
>>>    - clear the ZK chroot for the 0000000...... job  and may be the
>>>    checkpoints directory.
>>>    - resumeFromCheckPoint
>>>
>>> If some body can say that this indeed is the process ?
>>>
>>>
>>>
>>>  Logs are attached.
>>>
>>>
>>>
>>> 2019-03-12 08:10:43,871 INFO
>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>> Savepoint stored in
>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>> cancelling 00000000000000000000000000000000.
>>>
>>> 2019-03-12 08:10:43,871 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>> anomaly_echo (00000000000000000000000000000000) switched from state RUNNING
>>> to CANCELLING.
>>>
>>> 2019-03-12 08:10:44,227 INFO  
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>     - Completed checkpoint 10 for job 00000000000000000000000000000000
>>> (7238 bytes in 311 ms).
>>>
>>> 2019-03-12 08:10:44,232 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
>>> Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>
>>> 2019-03-12 08:10:44,274 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
>>> Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>
>>> 2019-03-12 08:10:44,276 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>> anomaly_echo (00000000000000000000000000000000) switched from state
>>> CANCELLING to CANCELED.
>>>
>>> 2019-03-12 08:10:44,276 INFO  
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>     - Stopping checkpoint coordinator for job
>>> 00000000000000000000000000000000.
>>>
>>> 2019-03-12 08:10:44,277 INFO
>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>> Shutting down
>>>
>>> 2019-03-12 08:10:44,323 INFO  
>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>       - Checkpoint with ID 8 at
>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>> discarded.
>>>
>>> 2019-03-12 08:10:44,437 INFO
>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>> Removing
>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>> from ZooKeeper
>>>
>>> 2019-03-12 08:10:44,437 INFO  
>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>       - Checkpoint with ID 10 at
>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' not
>>> discarded.
>>>
>>> 2019-03-12 08:10:44,447 INFO
>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>> Shutting down.
>>>
>>> 2019-03-12 08:10:44,447 INFO
>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>> Removing /checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
>>>
>>> 2019-03-12 08:10:44,463 INFO
>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            - Job
>>> 00000000000000000000000000000000 reached globally terminal state CANCELED.
>>>
>>> 2019-03-12 08:10:44,467 INFO
>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>> Stopping the JobMaster for job
>>> anomaly_echo(00000000000000000000000000000000).
>>>
>>> 2019-03-12 08:10:44,468 INFO  
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>         - Shutting StandaloneJobClusterEntryPoint down with application
>>> status CANCELED. Diagnostics null.
>>>
>>> 2019-03-12 08:10:44,468 INFO
>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>> Shutting down rest endpoint.
>>>
>>> 2019-03-12 08:10:44,473 INFO
>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>> - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
>>>
>>> 2019-03-12 08:10:44,475 INFO
>>> org.apache.flink.runtime.jobmaster.JobMaster                  - Close
>>> ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: JobManager is
>>> shutting down..
>>>
>>> 2019-03-12 08:10:44,475 INFO
>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>> Suspending SlotPool.
>>>
>>> 2019-03-12 08:10:44,476 INFO
>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>> Stopping SlotPool.
>>>
>>> 2019-03-12 08:10:44,476 INFO
>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
>>> Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>> 00000000000000000000000000000000 from the resource manager.
>>>
>>> 2019-03-12 08:10:44,477 INFO
>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>>> Stopping ZooKeeperLeaderElectionService
>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>
>>>
>>> After a little bit
>>>
>>>
>>> Starting the job-cluster
>>>
>>> used deprecated key `jobmanager.heap.mb`, please replace with key
>>> `jobmanager.heap.size`
>>>
>>> Starting standalonejob as a console application on host
>>> anomalyecho-mmg6t.
>>>
>>> ..
>>>
>>> ..
>>>
>>>
>>> Regards.
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <bhaskar.eba...@gmail.com>
>>> wrote:
>>>
>>>> Hi Vishal
>>>>
>>>> Save point with cancellation internally use  /cancel  REST API. Which
>>>> is not stable API.  It always exits with 404. Best  way to issue is:
>>>>
>>>> a) First issue save point REST API
>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E
>>>> )
>>>> c) Then After resuming your job, provide save point Path as argument
>>>> for the run jar REST API, which is returned by the (a)
>>>> Above is the smoother way
>>>>
>>>> Regards
>>>> Bhaskar
>>>>
>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>> vishal.santo...@gmail.com> wrote:
>>>>
>>>>> There are some issues I see and would want to get some feedback
>>>>>
>>>>> 1. On Cancellation With SavePoint with a Target Directory , the k8s
>>>>> job  does not exit ( it is not a deployment ) . I would assume that on
>>>>> cancellation the jvm should exit, after cleanup etc, and thus the pod
>>>>> should too. That does not happen and thus the job pod remains live. Is 
>>>>> that
>>>>> expected ?
>>>>>
>>>>> 2. To resume fro a save point it seems that I have to delete the job
>>>>> id ( 0000000000.... )  from ZooKeeper ( this is HA ), else it defaults to
>>>>> the latest checkpoint no matter what
>>>>>
>>>>>
>>>>> I am kind of curious as to what in 1.7.2 is the tested  process of
>>>>> cancelling with a save point and resuming  and what is the cogent story
>>>>> around job id ( defaults to 000000000000.. ). Note that --job-id does not
>>>>> work with 1.7.2 so even though that does not make sense, I still can not
>>>>> provide a new job id.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Vishal.
>>>>>
>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Reply via email to