Re: K8s job cluster and cancel and resume from a save point ?

Gary Yao Tue, 12 Mar 2019 08:53:41 -0700

The RC artifacts are only deployed to the Maven Central Repository when the
RC
is promoted to a release. As written in the 1.8.0 RC1 voting email [1], you
can find the maven artifacts, and the Flink binaries here:


    -
https://repository.apache.org/content/repositories/orgapacheflink-1210/
    - https://dist.apache.org/repos/dist/dev/flink/flink-1.8.0-rc1/

Alternatively, you can apply the patch yourself, and build Flink 1.7 from
sources [2]. On my machine this takes around 10 minutes if tests are
skipped.

Best,
Gary

[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.7/flinkDev/building.html#build-flink

On Tue, Mar 12, 2019 at 4:01 PM Vishal Santoshi <vishal.santo...@gmail.com>
wrote:

> Do you have a mvn repository ( at mvn central )  set up for 1,8 release
> candidate. We could test it for you.
>
> Without 1.8and this exit code we are essentially held up.
>
> On Tue, Mar 12, 2019 at 10:56 AM Gary Yao <g...@ververica.com> wrote:
>
>> Nobody can tell with 100% certainty. We want to give the RC some exposure
>> first, and there is also a release process that is prescribed by the ASF
>> [1].
>> You can look at past releases to get a feeling for how long the release
>> process lasts [2].
>>
>> [1] http://www.apache.org/legal/release-policy.html#release-approval
>> [2]
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0
>>
>>
>> On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> And when is the 1.8.0 release expected ?
>>>
>>> On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi <
>>> vishal.santo...@gmail.com> wrote:
>>>
>>>> :) That makes so much more sense. Is  k8s native flink a part of this
>>>> release ?
>>>>
>>>> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <g...@ververica.com> wrote:
>>>>
>>>>> Hi Vishal,
>>>>>
>>>>> This issue was fixed recently [1], and the patch will be released with
>>>>> 1.8. If
>>>>> the Flink job gets cancelled, the JVM should exit with code 0. There
>>>>> is a
>>>>> release candidate [2], which you can test.
>>>>>
>>>>> Best,
>>>>> Gary
>>>>>
>>>>> [1] https://issues.apache.org/jira/browse/FLINK-10743
>>>>> [2]
>>>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>>>>
>>>>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
>>>>> vishal.santo...@gmail.com> wrote:
>>>>>
>>>>>> Thanks Vijay,
>>>>>>
>>>>>> This is the larger issue.  The cancellation routine is itself broken.
>>>>>>
>>>>>> On cancellation flink does remove the checkpoint counter
>>>>>>
>>>>>> *2019-03-12 14:12:13,143
>>>>>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from
>>>>>> ZooKeeper *
>>>>>>
>>>>>> but exist with a non zero code
>>>>>>
>>>>>> *2019-03-12 14:12:13,477
>>>>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint 
>>>>>> with
>>>>>> exit code 1444.*
>>>>>>
>>>>>>
>>>>>> That I think is an issue. A cancelled job is a complete job and thus
>>>>>> the exit code should be 0 for k8s to mark it complete.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <
>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>
>>>>>>> Yes Vishal. Thats correct.
>>>>>>>
>>>>>>> Regards
>>>>>>> Bhaskar
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>
>>>>>>>> This really not cool but here you go. This seems to work. Agreed
>>>>>>>> that this cannot be this painful. The cancel does not exit with an exit
>>>>>>>> code pf 0 and thus the job has to manually delete. Vijay does this 
>>>>>>>> align
>>>>>>>> with what you have had to do ?
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Take a save point . This returns a request id
>>>>>>>>
>>>>>>>>    curl  --header "Content-Type: application/json" --request POST 
>>>>>>>> --data 
>>>>>>>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
>>>>>>>>     
>>>>>>>> https://*************/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Make sure the save point succeeded
>>>>>>>>
>>>>>>>>    curl  --request GET   
>>>>>>>> https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    - cancel the job
>>>>>>>>
>>>>>>>>    curl  --request PATCH 
>>>>>>>> https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Delete the job and deployment
>>>>>>>>
>>>>>>>>    kubectl delete -f 
>>>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>>
>>>>>>>>    kubectl delete -f 
>>>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>>>>>>
>>>>>>>>    args: ["job-cluster",
>>>>>>>>
>>>>>>>>                   "--fromSavepoint",
>>>>>>>>
>>>>>>>>                   
>>>>>>>> "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>>>>>>>                   "--job-classname", .........
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Restart
>>>>>>>>
>>>>>>>>    kubectl create -f 
>>>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>>
>>>>>>>>    kubectl create -f 
>>>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Make sure from the UI, that it restored from the specific
>>>>>>>>    save point.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <
>>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Yes Its supposed to work.  But unfortunately it was not working.
>>>>>>>>> Flink community needs to respond to this behavior.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Bhaskar
>>>>>>>>>
>>>>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Aah.
>>>>>>>>>> Let me try this out and will get back to you.
>>>>>>>>>> Though I would assume that save point with cancel is a single
>>>>>>>>>> atomic step, rather then a save point *followed*  by a
>>>>>>>>>> cancellation ( else why would that be an option ).
>>>>>>>>>> Thanks again.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>>>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Vishal,
>>>>>>>>>>>
>>>>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for
>>>>>>>>>>> all clusters. Its recommended command.
>>>>>>>>>>>
>>>>>>>>>>> Use the following command to issue save point.
>>>>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>>>>>>>> "cancel-job":false}'  \ https://
>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>
>>>>>>>>>>> Then issue yarn-cancel.
>>>>>>>>>>> After that  follow the process to restore save point
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>> Bhaskar
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello Vijay,
>>>>>>>>>>>>
>>>>>>>>>>>>                Thank you for the reply. This though is k8s
>>>>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same 
>>>>>>>>>>>> lifecycle.
>>>>>>>>>>>> I issue a* save point with cancel*  as documented here
>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>>>>>>>> a straight up
>>>>>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>>>>>> --data
>>>>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>>>>>>>> \ https://
>>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>>
>>>>>>>>>>>> I would assume that after taking the save point, the jvm should
>>>>>>>>>>>> exit, after all the k8s deployment is of kind: job and if it is a 
>>>>>>>>>>>> job
>>>>>>>>>>>> cluster then a cancellation should exit the jvm and hence the pod. 
>>>>>>>>>>>> It does
>>>>>>>>>>>> seem to do some things right. It stops a bunch of stuff ( the 
>>>>>>>>>>>> JobMaster,
>>>>>>>>>>>> the slotPol, zookeeper coordinator etc ) . It also remove the 
>>>>>>>>>>>> checkpoint
>>>>>>>>>>>> counter but does not exit  the job. And after a little bit the job 
>>>>>>>>>>>> is
>>>>>>>>>>>> restarted which does not make sense and absolutely not the right 
>>>>>>>>>>>> thing to
>>>>>>>>>>>> do  ( to me at least ).
>>>>>>>>>>>>
>>>>>>>>>>>> Further if I delete the deployment and the job from k8s and
>>>>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to honor 
>>>>>>>>>>>> the
>>>>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider 
>>>>>>>>>>>> the save
>>>>>>>>>>>> point.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s
>>>>>>>>>>>> job cluster deployment  seems to be
>>>>>>>>>>>>
>>>>>>>>>>>>    - cancel with save point as defined hre
>>>>>>>>>>>>    
>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>>>>>>>    - delete the job manger job  and task manager deployments
>>>>>>>>>>>>    from k8s almost immediately.
>>>>>>>>>>>>    - clear the ZK chroot for the 0000000...... job  and may be
>>>>>>>>>>>>    the checkpoints directory.
>>>>>>>>>>>>    - resumeFromCheckPoint
>>>>>>>>>>>>
>>>>>>>>>>>> If some body can say that this indeed is the process ?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  Logs are attached.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>>>> Savepoint stored in
>>>>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>>>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from 
>>>>>>>>>>>> state
>>>>>>>>>>>> RUNNING to CANCELLING.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,227 INFO  
>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>>     - Completed checkpoint 10 for job
>>>>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging 
>>>>>>>>>>>> Sink (1/1)
>>>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to 
>>>>>>>>>>>> CANCELING.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging 
>>>>>>>>>>>> Sink (1/1)
>>>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to 
>>>>>>>>>>>> CANCELED.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from 
>>>>>>>>>>>> state
>>>>>>>>>>>> CANCELLING to CANCELED.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO  
>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>>     - Stopping checkpoint coordinator for job
>>>>>>>>>>>> 00000000000000000000000000000000.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>>> - Shutting down
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,323 INFO  
>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>>       - Checkpoint with ID 8 at
>>>>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>>>>>>>>> discarded.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>>> - Removing
>>>>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>>>>>>>> from ZooKeeper
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO  
>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>>       - Checkpoint with ID 10 at
>>>>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' 
>>>>>>>>>>>> not
>>>>>>>>>>>> discarded.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>>> - Shutting down.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>>> - Removing /checkpoint-counter/00000000000000000000000000000000
>>>>>>>>>>>> from ZooKeeper
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            -
>>>>>>>>>>>> Job 00000000000000000000000000000000 reached globally terminal 
>>>>>>>>>>>> state
>>>>>>>>>>>> CANCELED.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>>>> Stopping the JobMaster for job
>>>>>>>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO  
>>>>>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>>>>>>>> application status CANCELED. Diagnostics null.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>>>>>>>>>> Shutting down rest endpoint.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>>>>>>> /leader/resource_manager_lock.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2:
>>>>>>>>>>>> JobManager is shutting down..
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>>>>> Suspending SlotPool.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>>>>> Stopping SlotPool.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>>>>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>>>>>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>>>>>>>>
>>>>>>>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> After a little bit
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Starting the job-cluster
>>>>>>>>>>>>
>>>>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with
>>>>>>>>>>>> key `jobmanager.heap.size`
>>>>>>>>>>>>
>>>>>>>>>>>> Starting standalonejob as a console application on host
>>>>>>>>>>>> anomalyecho-mmg6t.
>>>>>>>>>>>>
>>>>>>>>>>>> ..
>>>>>>>>>>>>
>>>>>>>>>>>> ..
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Regards.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Vishal
>>>>>>>>>>>>>
>>>>>>>>>>>>> Save point with cancellation internally use  /cancel  REST
>>>>>>>>>>>>> API. Which is not stable API.  It always exits with 404. Best  
>>>>>>>>>>>>> way to issue
>>>>>>>>>>>>> is:
>>>>>>>>>>>>>
>>>>>>>>>>>>> a) First issue save point REST API
>>>>>>>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E
>>>>>>>>>>>>> )
>>>>>>>>>>>>> c) Then After resuming your job, provide save point Path as
>>>>>>>>>>>>> argument for the run jar REST API, which is returned by the (a)
>>>>>>>>>>>>> Above is the smoother way
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> There are some issues I see and would want to get some
>>>>>>>>>>>>>> feedback
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory ,
>>>>>>>>>>>>>> the k8s  job  does not exit ( it is not a deployment ) . I would 
>>>>>>>>>>>>>> assume
>>>>>>>>>>>>>> that on cancellation the jvm should exit, after cleanup etc, and 
>>>>>>>>>>>>>> thus the
>>>>>>>>>>>>>> pod should too. That does not happen and thus the job pod 
>>>>>>>>>>>>>> remains live. Is
>>>>>>>>>>>>>> that expected ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. To resume fro a save point it seems that I have to delete
>>>>>>>>>>>>>> the job id ( 0000000000.... )  from ZooKeeper ( this is HA ), 
>>>>>>>>>>>>>> else it
>>>>>>>>>>>>>> defaults to the latest checkpoint no matter what
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested
>>>>>>>>>>>>>> process of cancelling with a save point and resuming  and what 
>>>>>>>>>>>>>> is the
>>>>>>>>>>>>>> cogent story around job id ( defaults to 000000000000.. ). Note 
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> --job-id does not work with 1.7.2 so even though that does not 
>>>>>>>>>>>>>> make sense,
>>>>>>>>>>>>>> I still can not provide a new job id.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Vishal.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Reply via email to