Re: K8s job cluster and cancel and resume from a save point ?

Vishal Santoshi Thu, 11 Apr 2019 18:37:15 -0700

I confirm that 1.8.0 fixes all the above issue . The JM process exits with
code 0 and exits the pod ( TERMINATED state ) . The above is true for both
PATCH cancel and POST save point with cancel as above.


Thank you for fixing this issue.


On Wed, Mar 13, 2019 at 10:17 AM Vishal Santoshi <vishal.santo...@gmail.com>
wrote:

> BTW, does 1.8 also solve the issue where we can cancel with a save point.
> That too is broken in 1.7.2
>
> curl  --header "Content-Type: application/json" --request POST --data 
> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":*true*}' 
>    https://*************/jobs/00000000000000000000000000000000/savepoints
>
>
> On Tue, Mar 12, 2019 at 11:55 AM Vishal Santoshi <
> vishal.santo...@gmail.com> wrote:
>
>> Awesome, thanks!
>>
>> On Tue, Mar 12, 2019 at 11:53 AM Gary Yao <g...@ververica.com> wrote:
>>
>>> The RC artifacts are only deployed to the Maven Central Repository when
>>> the RC
>>> is promoted to a release. As written in the 1.8.0 RC1 voting email [1],
>>> you
>>> can find the maven artifacts, and the Flink binaries here:
>>>
>>>     -
>>> https://repository.apache.org/content/repositories/orgapacheflink-1210/
>>>     - https://dist.apache.org/repos/dist/dev/flink/flink-1.8.0-rc1/
>>>
>>> Alternatively, you can apply the patch yourself, and build Flink 1.7 from
>>> sources [2]. On my machine this takes around 10 minutes if tests are
>>> skipped.
>>>
>>> Best,
>>> Gary
>>>
>>> [1]
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>> [2]
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/flinkDev/building.html#build-flink
>>>
>>> On Tue, Mar 12, 2019 at 4:01 PM Vishal Santoshi <
>>> vishal.santo...@gmail.com> wrote:
>>>
>>>> Do you have a mvn repository ( at mvn central )  set up for 1,8 release
>>>> candidate. We could test it for you.
>>>>
>>>> Without 1.8and this exit code we are essentially held up.
>>>>
>>>> On Tue, Mar 12, 2019 at 10:56 AM Gary Yao <g...@ververica.com> wrote:
>>>>
>>>>> Nobody can tell with 100% certainty. We want to give the RC some
>>>>> exposure
>>>>> first, and there is also a release process that is prescribed by the
>>>>> ASF [1].
>>>>> You can look at past releases to get a feeling for how long the release
>>>>> process lasts [2].
>>>>>
>>>>> [1] http://www.apache.org/legal/release-policy.html#release-approval
>>>>> [2]
>>>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0
>>>>>
>>>>>
>>>>> On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi <
>>>>> vishal.santo...@gmail.com> wrote:
>>>>>
>>>>>> And when is the 1.8.0 release expected ?
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi <
>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>
>>>>>>> :) That makes so much more sense. Is  k8s native flink a part of
>>>>>>> this release ?
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <g...@ververica.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Vishal,
>>>>>>>>
>>>>>>>> This issue was fixed recently [1], and the patch will be released
>>>>>>>> with 1.8. If
>>>>>>>> the Flink job gets cancelled, the JVM should exit with code 0.
>>>>>>>> There is a
>>>>>>>> release candidate [2], which you can test.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Gary
>>>>>>>>
>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-10743
>>>>>>>> [2]
>>>>>>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>>>>>>>
>>>>>>>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks Vijay,
>>>>>>>>>
>>>>>>>>> This is the larger issue.  The cancellation routine is itself
>>>>>>>>> broken.
>>>>>>>>>
>>>>>>>>> On cancellation flink does remove the checkpoint counter
>>>>>>>>>
>>>>>>>>> *2019-03-12 14:12:13,143
>>>>>>>>> INFO  
>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from
>>>>>>>>> ZooKeeper *
>>>>>>>>>
>>>>>>>>> but exist with a non zero code
>>>>>>>>>
>>>>>>>>> *2019-03-12 14:12:13,477
>>>>>>>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>>>>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint 
>>>>>>>>> with
>>>>>>>>> exit code 1444.*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> That I think is an issue. A cancelled job is a complete job and
>>>>>>>>> thus the exit code should be 0 for k8s to mark it complete.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <
>>>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Yes Vishal. Thats correct.
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Bhaskar
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>>>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> This really not cool but here you go. This seems to work. Agreed
>>>>>>>>>>> that this cannot be this painful. The cancel does not exit with an 
>>>>>>>>>>> exit
>>>>>>>>>>> code pf 0 and thus the job has to manually delete. Vijay does this 
>>>>>>>>>>> align
>>>>>>>>>>> with what you have had to do ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Take a save point . This returns a request id
>>>>>>>>>>>
>>>>>>>>>>>    curl  --header "Content-Type: application/json" --request POST 
>>>>>>>>>>> --data 
>>>>>>>>>>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
>>>>>>>>>>>     
>>>>>>>>>>> https://*************/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Make sure the save point succeeded
>>>>>>>>>>>
>>>>>>>>>>>    curl  --request GET   
>>>>>>>>>>> https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - cancel the job
>>>>>>>>>>>
>>>>>>>>>>>    curl  --request PATCH 
>>>>>>>>>>> https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Delete the job and deployment
>>>>>>>>>>>
>>>>>>>>>>>    kubectl delete -f 
>>>>>>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>>>>>
>>>>>>>>>>>    kubectl delete -f 
>>>>>>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>>>>>>>>>
>>>>>>>>>>>    args: ["job-cluster",
>>>>>>>>>>>
>>>>>>>>>>>                   "--fromSavepoint",
>>>>>>>>>>>
>>>>>>>>>>>                   
>>>>>>>>>>> "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>>>>>>>>>>                   "--job-classname", .........
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Restart
>>>>>>>>>>>
>>>>>>>>>>>    kubectl create -f 
>>>>>>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>>>>>
>>>>>>>>>>>    kubectl create -f 
>>>>>>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - Make sure from the UI, that it restored from the specific
>>>>>>>>>>>    save point.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <
>>>>>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes Its supposed to work.  But unfortunately it was not
>>>>>>>>>>>> working. Flink community needs to respond to this behavior.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards
>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>>>>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Aah.
>>>>>>>>>>>>> Let me try this out and will get back to you.
>>>>>>>>>>>>> Though I would assume that save point with cancel is a single
>>>>>>>>>>>>> atomic step, rather then a save point *followed*  by a
>>>>>>>>>>>>> cancellation ( else why would that be an option ).
>>>>>>>>>>>>> Thanks again.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>>>>>>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Vishal,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for
>>>>>>>>>>>>>> all clusters. Its recommended command.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Use the following command to issue save point.
>>>>>>>>>>>>>>  curl  --header "Content-Type: application/json" --request
>>>>>>>>>>>>>> POST --data 
>>>>>>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>>>>>>>>>>> "cancel-job":false}'  \ https://
>>>>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Then issue yarn-cancel.
>>>>>>>>>>>>>> After that  follow the process to restore save point
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>>>>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello Vijay,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                Thank you for the reply. This though is k8s
>>>>>>>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same 
>>>>>>>>>>>>>>> lifecycle.
>>>>>>>>>>>>>>> I issue a* save point with cancel*  as documented here
>>>>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>>>>>>>>>>> a straight up
>>>>>>>>>>>>>>>  curl  --header "Content-Type: application/json" --request
>>>>>>>>>>>>>>> POST --data
>>>>>>>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>>>>>>>>>>> \ https://
>>>>>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would assume that after taking the save point, the jvm
>>>>>>>>>>>>>>> should exit, after all the k8s deployment is of kind: job and 
>>>>>>>>>>>>>>> if it is a
>>>>>>>>>>>>>>> job cluster then a cancellation should exit the jvm and hence 
>>>>>>>>>>>>>>> the pod. It
>>>>>>>>>>>>>>> does seem to do some things right. It stops a bunch of stuff ( 
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> JobMaster, the slotPol, zookeeper coordinator etc ) . It also 
>>>>>>>>>>>>>>> remove the
>>>>>>>>>>>>>>> checkpoint counter but does not exit  the job. And after a 
>>>>>>>>>>>>>>> little bit the
>>>>>>>>>>>>>>> job is restarted which does not make sense and absolutely not 
>>>>>>>>>>>>>>> the right
>>>>>>>>>>>>>>> thing to do  ( to me at least ).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Further if I delete the deployment and the job from k8s and
>>>>>>>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to 
>>>>>>>>>>>>>>> honor the
>>>>>>>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to 
>>>>>>>>>>>>>>> consider the save
>>>>>>>>>>>>>>> point.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thus the process of cancelling and resuming from a SP on a
>>>>>>>>>>>>>>> k8s job cluster deployment  seems to be
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - cancel with save point as defined hre
>>>>>>>>>>>>>>>    
>>>>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>>>>>>>>>>    - delete the job manger job  and task manager
>>>>>>>>>>>>>>>    deployments from k8s almost immediately.
>>>>>>>>>>>>>>>    - clear the ZK chroot for the 0000000...... job  and may
>>>>>>>>>>>>>>>    be the checkpoints directory.
>>>>>>>>>>>>>>>    - resumeFromCheckPoint
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If some body can say that this indeed is the process ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Logs are attached.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster
>>>>>>>>>>>>>>>   - Savepoint stored in
>>>>>>>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. 
>>>>>>>>>>>>>>> Now
>>>>>>>>>>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>>>>>>>>>   - Job anomaly_echo (00000000000000000000000000000000)
>>>>>>>>>>>>>>> switched from state RUNNING to CANCELLING.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,227 INFO  
>>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>>>>>     - Completed checkpoint 10 for job
>>>>>>>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>>>>>>>>>   - Source: Barnacle Anomalies Kafka topic -> Map -> Sink:
>>>>>>>>>>>>>>> Logging Sink (1/1) (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched 
>>>>>>>>>>>>>>> from RUNNING
>>>>>>>>>>>>>>> to CANCELING.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>>>>>>>>>   - Source: Barnacle Anomalies Kafka topic -> Map -> Sink:
>>>>>>>>>>>>>>> Logging Sink (1/1) (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched 
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>> CANCELING to CANCELED.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>>>>>>>>>   - Job anomaly_echo (00000000000000000000000000000000)
>>>>>>>>>>>>>>> switched from state CANCELLING to CANCELED.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO  
>>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>>>>>     - Stopping checkpoint coordinator for job
>>>>>>>>>>>>>>> 00000000000000000000000000000000.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>>>>>> - Shutting down
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,323 INFO  
>>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>>>>>       - Checkpoint with ID 8 at
>>>>>>>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' 
>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>> discarded.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>>>>>> - Removing
>>>>>>>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>>>>>>>>>>> from ZooKeeper
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO  
>>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>>>>>       - Checkpoint with ID 10 at
>>>>>>>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae'
>>>>>>>>>>>>>>>  not
>>>>>>>>>>>>>>> discarded.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>>>>>> - Shutting down.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>>>>>> - Removing
>>>>>>>>>>>>>>> /checkpoint-counter/00000000000000000000000000000000 from 
>>>>>>>>>>>>>>> ZooKeeper
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher
>>>>>>>>>>>>>>>   - Job 00000000000000000000000000000000 reached globally
>>>>>>>>>>>>>>> terminal state CANCELED.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster
>>>>>>>>>>>>>>>   - Stopping the JobMaster for job
>>>>>>>>>>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO  
>>>>>>>>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>>>>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>>>>>>>>>>> application status CANCELED. Diagnostics null.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint
>>>>>>>>>>>>>>> - Shutting down rest endpoint.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>>>>>>>>>> /leader/resource_manager_lock.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster
>>>>>>>>>>>>>>>   - Close ResourceManager connection
>>>>>>>>>>>>>>> d38c6e599d16415a69c65c8b2a72d9a2: JobManager is shutting down..
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool
>>>>>>>>>>>>>>>   - Suspending SlotPool.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool
>>>>>>>>>>>>>>>   - Stopping SlotPool.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>>>>>>>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>>>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for
>>>>>>>>>>>>>>> job 00000000000000000000000000000000 from the resource manager.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>>>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>>>>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> After a little bit
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Starting the job-cluster
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace
>>>>>>>>>>>>>>> with key `jobmanager.heap.size`
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Starting standalonejob as a console application on host
>>>>>>>>>>>>>>> anomalyecho-mmg6t.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ..
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ..
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>>>>>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Vishal
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Save point with cancellation internally use  /cancel  REST
>>>>>>>>>>>>>>>> API. Which is not stable API.  It always exits with 404. Best  
>>>>>>>>>>>>>>>> way to issue
>>>>>>>>>>>>>>>> is:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> a) First issue save point REST API
>>>>>>>>>>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E
>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>> c) Then After resuming your job, provide save point Path as
>>>>>>>>>>>>>>>> argument for the run jar REST API, which is returned by the (a)
>>>>>>>>>>>>>>>> Above is the smoother way
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>>>>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There are some issues I see and would want to get some
>>>>>>>>>>>>>>>>> feedback
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory
>>>>>>>>>>>>>>>>> , the k8s  job  does not exit ( it is not a deployment ) . I 
>>>>>>>>>>>>>>>>> would assume
>>>>>>>>>>>>>>>>> that on cancellation the jvm should exit, after cleanup etc, 
>>>>>>>>>>>>>>>>> and thus the
>>>>>>>>>>>>>>>>> pod should too. That does not happen and thus the job pod 
>>>>>>>>>>>>>>>>> remains live. Is
>>>>>>>>>>>>>>>>> that expected ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2. To resume fro a save point it seems that I have to
>>>>>>>>>>>>>>>>> delete the job id ( 0000000000.... )  from ZooKeeper ( this 
>>>>>>>>>>>>>>>>> is HA ), else
>>>>>>>>>>>>>>>>> it defaults to the latest checkpoint no matter what
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested
>>>>>>>>>>>>>>>>> process of cancelling with a save point and resuming  and 
>>>>>>>>>>>>>>>>> what is the
>>>>>>>>>>>>>>>>> cogent story around job id ( defaults to 000000000000.. ). 
>>>>>>>>>>>>>>>>> Note that
>>>>>>>>>>>>>>>>> --job-id does not work with 1.7.2 so even though that does 
>>>>>>>>>>>>>>>>> not make sense,
>>>>>>>>>>>>>>>>> I still can not provide a new job id.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Vishal.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>

Re: K8s job cluster and cancel and resume from a save point ?

Reply via email to