subject:"K8s job cluster and cancel and resume from a save point \?"

Awesome, thanks!

On Tue, Mar 12, 2019 at 11:53 AM Gary Yao  wrote:

> The RC artifacts are only deployed to the Maven Central Repository when
> the RC
> is promoted to a release. As written in the 1.8.0 RC1 voting email [1], you
> can find the maven artifacts, and the Flink binaries here:
>
> -
> https://repository.apache.org/content/repositories/orgapacheflink-1210/
> - https://dist.apache.org/repos/dist/dev/flink/flink-1.8.0-rc1/
>
> Alternatively, you can apply the patch yourself, and build Flink 1.7 from
> sources [2]. On my machine this takes around 10 minutes if tests are
> skipped.
>
> Best,
> Gary
>
> [1]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
> [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/flinkDev/building.html#build-flink
>
> On Tue, Mar 12, 2019 at 4:01 PM Vishal Santoshi 
> wrote:
>
>> Do you have a mvn repository ( at mvn central )  set up for 1,8 release
>> candidate. We could test it for you.
>>
>> Without 1.8and this exit code we are essentially held up.
>>
>> On Tue, Mar 12, 2019 at 10:56 AM Gary Yao  wrote:
>>
>>> Nobody can tell with 100% certainty. We want to give the RC some exposure
>>> first, and there is also a release process that is prescribed by the ASF
>>> [1].
>>> You can look at past releases to get a feeling for how long the release
>>> process lasts [2].
>>>
>>> [1] http://www.apache.org/legal/release-policy.html#release-approval
>>> [2]
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0
>>>
>>>
>>> On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi <
>>> vishal.santo...@gmail.com> wrote:
>>>
 And when is the 1.8.0 release expected ?

 On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi <
 vishal.santo...@gmail.com> wrote:

> :) That makes so much more sense. Is  k8s native flink a part of this
> release ?
>
> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao  wrote:
>
>> Hi Vishal,
>>
>> This issue was fixed recently [1], and the patch will be released
>> with 1.8. If
>> the Flink job gets cancelled, the JVM should exit with code 0. There
>> is a
>> release candidate [2], which you can test.
>>
>> Best,
>> Gary
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-10743
>> [2]
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>
>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> Thanks Vijay,
>>>
>>> This is the larger issue.  The cancellation routine is itself broken.
>>>
>>> On cancellation flink does remove the checkpoint counter
>>>
>>> *2019-03-12 14:12:13,143
>>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  
>>> -
>>> Removing /checkpoint-counter/ from
>>> ZooKeeper *
>>>
>>> but exist with a non zero code
>>>
>>> *2019-03-12 14:12:13,477
>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint 
>>> with
>>> exit code 1444.*
>>>
>>>
>>> That I think is an issue. A cancelled job is a complete job and thus
>>> the exit code should be 0 for k8s to mark it complete.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <
>>> bhaskar.eba...@gmail.com> wrote:
>>>
 Yes Vishal. Thats correct.

 Regards
 Bhaskar

 On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
 vishal.santo...@gmail.com> wrote:

> This really not cool but here you go. This seems to work. Agreed
> that this cannot be this painful. The cancel does not exit with an 
> exit
> code pf 0 and thus the job has to manually delete. Vijay does this 
> align
> with what you have had to do ?
>
>
>- Take a save point . This returns a request id
>
>curl  --header "Content-Type: application/json" --request POST 
> --data 
> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
> 
> https://*/jobs//savepoints
>
>
>
>- Make sure the save point succeeded
>
>curl  --request GET   
> https:///jobs//savepoints/2c053ce3bea31276aa25e63784629687
>
>
>
>- cancel the job
>
>curl  --request PATCH 
> https://***/jobs/?mode=cancel
>

Re: K8s job cluster and cancel and resume from a save point ?

The RC artifacts are only deployed to the Maven Central Repository when the
RC
is promoted to a release. As written in the 1.8.0 RC1 voting email [1], you
can find the maven artifacts, and the Flink binaries here:

-
https://repository.apache.org/content/repositories/orgapacheflink-1210/
- https://dist.apache.org/repos/dist/dev/flink/flink-1.8.0-rc1/

Alternatively, you can apply the patch yourself, and build Flink 1.7 from
sources [2]. On my machine this takes around 10 minutes if tests are
skipped.

Best,
Gary

[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.7/flinkDev/building.html#build-flink

On Tue, Mar 12, 2019 at 4:01 PM Vishal Santoshi 
wrote:

> Do you have a mvn repository ( at mvn central )  set up for 1,8 release
> candidate. We could test it for you.
>
> Without 1.8and this exit code we are essentially held up.
>
> On Tue, Mar 12, 2019 at 10:56 AM Gary Yao  wrote:
>
>> Nobody can tell with 100% certainty. We want to give the RC some exposure
>> first, and there is also a release process that is prescribed by the ASF
>> [1].
>> You can look at past releases to get a feeling for how long the release
>> process lasts [2].
>>
>> [1] http://www.apache.org/legal/release-policy.html#release-approval
>> [2]
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0
>>
>>
>> On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> And when is the 1.8.0 release expected ?
>>>
>>> On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi <
>>> vishal.santo...@gmail.com> wrote:
>>>
 :) That makes so much more sense. Is  k8s native flink a part of this
 release ?

 On Tue, Mar 12, 2019 at 10:27 AM Gary Yao  wrote:

> Hi Vishal,
>
> This issue was fixed recently [1], and the patch will be released with
> 1.8. If
> the Flink job gets cancelled, the JVM should exit with code 0. There
> is a
> release candidate [2], which you can test.
>
> Best,
> Gary
>
> [1] https://issues.apache.org/jira/browse/FLINK-10743
> [2]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>
> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
> vishal.santo...@gmail.com> wrote:
>
>> Thanks Vijay,
>>
>> This is the larger issue.  The cancellation routine is itself broken.
>>
>> On cancellation flink does remove the checkpoint counter
>>
>> *2019-03-12 14:12:13,143
>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>> Removing /checkpoint-counter/ from
>> ZooKeeper *
>>
>> but exist with a non zero code
>>
>> *2019-03-12 14:12:13,477
>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint 
>> with
>> exit code 1444.*
>>
>>
>> That I think is an issue. A cancelled job is a complete job and thus
>> the exit code should be 0 for k8s to mark it complete.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <
>> bhaskar.eba...@gmail.com> wrote:
>>
>>> Yes Vishal. Thats correct.
>>>
>>> Regards
>>> Bhaskar
>>>
>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>>> vishal.santo...@gmail.com> wrote:
>>>
 This really not cool but here you go. This seems to work. Agreed
 that this cannot be this painful. The cancel does not exit with an exit
 code pf 0 and thus the job has to manually delete. Vijay does this 
 align
 with what you have had to do ?


- Take a save point . This returns a request id

curl  --header "Content-Type: application/json" --request POST 
 --data 
 '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
 
 https://*/jobs//savepoints



- Make sure the save point succeeded

curl  --request GET   
 https:///jobs//savepoints/2c053ce3bea31276aa25e63784629687



- cancel the job

curl  --request PATCH 
 https://***/jobs/?mode=cancel



- Delete the job and deployment

kubectl delete -f 
 manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml

kubectl delete -f 
 manifests/bf2-PRODUCTION/task

Re: K8s job cluster and cancel and resume from a save point ?

Do you have a mvn repository ( at mvn central )  set up for 1,8 release
candidate. We could test it for you.

Without 1.8and this exit code we are essentially held up.

On Tue, Mar 12, 2019 at 10:56 AM Gary Yao  wrote:

> Nobody can tell with 100% certainty. We want to give the RC some exposure
> first, and there is also a release process that is prescribed by the ASF
> [1].
> You can look at past releases to get a feeling for how long the release
> process lasts [2].
>
> [1] http://www.apache.org/legal/release-policy.html#release-approval
> [2]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0
>
>
> On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi 
> wrote:
>
>> And when is the 1.8.0 release expected ?
>>
>> On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> :) That makes so much more sense. Is  k8s native flink a part of this
>>> release ?
>>>
>>> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao  wrote:
>>>
 Hi Vishal,

 This issue was fixed recently [1], and the patch will be released with
 1.8. If
 the Flink job gets cancelled, the JVM should exit with code 0. There is
 a
 release candidate [2], which you can test.

 Best,
 Gary

 [1] https://issues.apache.org/jira/browse/FLINK-10743
 [2]
 http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html

 On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
 vishal.santo...@gmail.com> wrote:

> Thanks Vijay,
>
> This is the larger issue.  The cancellation routine is itself broken.
>
> On cancellation flink does remove the checkpoint counter
>
> *2019-03-12 14:12:13,143
> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
> Removing /checkpoint-counter/ from
> ZooKeeper *
>
> but exist with a non zero code
>
> *2019-03-12 14:12:13,477
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
> exit code 1444.*
>
>
> That I think is an issue. A cancelled job is a complete job and thus
> the exit code should be 0 for k8s to mark it complete.
>
>
>
>
>
>
>
>
>
> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <
> bhaskar.eba...@gmail.com> wrote:
>
>> Yes Vishal. Thats correct.
>>
>> Regards
>> Bhaskar
>>
>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> This really not cool but here you go. This seems to work. Agreed
>>> that this cannot be this painful. The cancel does not exit with an exit
>>> code pf 0 and thus the job has to manually delete. Vijay does this align
>>> with what you have had to do ?
>>>
>>>
>>>- Take a save point . This returns a request id
>>>
>>>curl  --header "Content-Type: application/json" --request POST 
>>> --data 
>>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
>>> 
>>> https://*/jobs//savepoints
>>>
>>>
>>>
>>>- Make sure the save point succeeded
>>>
>>>curl  --request GET   
>>> https:///jobs//savepoints/2c053ce3bea31276aa25e63784629687
>>>
>>>
>>>
>>>- cancel the job
>>>
>>>curl  --request PATCH 
>>> https://***/jobs/?mode=cancel
>>>
>>>
>>>
>>>- Delete the job and deployment
>>>
>>>kubectl delete -f 
>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>
>>>kubectl delete -f 
>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>
>>>
>>>
>>>- Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>
>>>args: ["job-cluster",
>>>
>>>   "--fromSavepoint",
>>>
>>>   
>>> "hdfs:///tmp/xyz14/savepoint-00-1d4f71345e22",
>>>   "--job-classname", .
>>>
>>>
>>>
>>>- Restart
>>>
>>>kubectl create -f 
>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>
>>>kubectl create -f 
>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>
>>>
>>>
>>>- Make sure from the UI, that it restored from the specific save
>>>point.
>>>
>>>
>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <
>>> bhaskar.eba...@gmail.com> wrote:
>>>
 Yes Its supposed to work.  But unfortunately it was not working.
 Flink community needs to respond to this b

Re: K8s job cluster and cancel and resume from a save point ?

Nobody can tell with 100% certainty. We want to give the RC some exposure
first, and there is also a release process that is prescribed by the ASF
[1].
You can look at past releases to get a feeling for how long the release
process lasts [2].

[1] http://www.apache.org/legal/release-policy.html#release-approval
[2]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0


On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi 
wrote:

> And when is the 1.8.0 release expected ?
>
> On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi <
> vishal.santo...@gmail.com> wrote:
>
>> :) That makes so much more sense. Is  k8s native flink a part of this
>> release ?
>>
>> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao  wrote:
>>
>>> Hi Vishal,
>>>
>>> This issue was fixed recently [1], and the patch will be released with
>>> 1.8. If
>>> the Flink job gets cancelled, the JVM should exit with code 0. There is a
>>> release candidate [2], which you can test.
>>>
>>> Best,
>>> Gary
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-10743
>>> [2]
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>>
>>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
>>> vishal.santo...@gmail.com> wrote:
>>>
 Thanks Vijay,

 This is the larger issue.  The cancellation routine is itself broken.

 On cancellation flink does remove the checkpoint counter

 *2019-03-12 14:12:13,143
 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
 Removing /checkpoint-counter/ from
 ZooKeeper *

 but exist with a non zero code

 *2019-03-12 14:12:13,477
 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
 Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
 exit code 1444.*


 That I think is an issue. A cancelled job is a complete job and thus
 the exit code should be 0 for k8s to mark it complete.









 On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <
 bhaskar.eba...@gmail.com> wrote:

> Yes Vishal. Thats correct.
>
> Regards
> Bhaskar
>
> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
> vishal.santo...@gmail.com> wrote:
>
>> This really not cool but here you go. This seems to work. Agreed that
>> this cannot be this painful. The cancel does not exit with an exit code 
>> pf
>> 0 and thus the job has to manually delete. Vijay does this align with 
>> what
>> you have had to do ?
>>
>>
>>- Take a save point . This returns a request id
>>
>>curl  --header "Content-Type: application/json" --request POST --data 
>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
>> 
>> https://*/jobs//savepoints
>>
>>
>>
>>- Make sure the save point succeeded
>>
>>curl  --request GET   
>> https:///jobs//savepoints/2c053ce3bea31276aa25e63784629687
>>
>>
>>
>>- cancel the job
>>
>>curl  --request PATCH 
>> https://***/jobs/?mode=cancel
>>
>>
>>
>>- Delete the job and deployment
>>
>>kubectl delete -f 
>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>
>>kubectl delete -f 
>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>
>>
>>
>>- Edit the job-cluster-job-deployment.yaml. Add/Edit
>>
>>args: ["job-cluster",
>>
>>   "--fromSavepoint",
>>
>>   
>> "hdfs:///tmp/xyz14/savepoint-00-1d4f71345e22",
>>   "--job-classname", .
>>
>>
>>
>>- Restart
>>
>>kubectl create -f 
>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>
>>kubectl create -f 
>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>
>>
>>
>>- Make sure from the UI, that it restored from the specific save
>>point.
>>
>>
>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <
>> bhaskar.eba...@gmail.com> wrote:
>>
>>> Yes Its supposed to work.  But unfortunately it was not working.
>>> Flink community needs to respond to this behavior.
>>>
>>> Regards
>>> Bhaskar
>>>
>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>> vishal.santo...@gmail.com> wrote:
>>>
 Aah.
 Let me try this out and will get back to you.
 Though I would assume that save point with cancel is a single
 atomic step, rather then a save point *followed*  by a
 ca

Re: K8s job cluster and cancel and resume from a save point ?

This really not cool but here you go. This seems to work. Agreed that this
cannot be this painful. The cancel does not exit with an exit code pf 0 and
thus the job has to manually delete. Vijay does this align with what you
have had to do ?

   - Take a save point . This returns a request id

   curl  --header "Content-Type: application/json" --request POST
--data 
'{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
   https://*/jobs//savepoints

   - Make sure the save point succeeded

   curl  --request GET
https:///jobs//savepoints/2c053ce3bea31276aa25e63784629687

   - cancel the job

   curl  --request PATCH
https://***/jobs/?mode=cancel

   - Delete the job and deployment

   kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml

   kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml

   - Edit the job-cluster-job-deployment.yaml. Add/Edit

   args: ["job-cluster",

  "--fromSavepoint",

  "hdfs:///tmp/xyz14/savepoint-00-1d4f71345e22",
  "--job-classname", .

   - Restart

   kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml

   kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml

   - Make sure from the UI, that it restored from the specific save point.

On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar 
wrote:

> Yes Its supposed to work.  But unfortunately it was not working. Flink
> community needs to respond to this behavior.
>
> Regards
> Bhaskar
>
> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi 
> wrote:
>
>> Aah.
>> Let me try this out and will get back to you.
>> Though I would assume that save point with cancel is a single atomic
>> step, rather then a save point *followed*  by a cancellation ( else why
>> would that be an option ).
>> Thanks again.
>>
>>
>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar 
>> wrote:
>>
>>> Hi Vishal,
>>>
>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>> clusters. Its recommended command.
>>>
>>> Use the following command to issue save point.
>>>  curl  --header "Content-Type: application/json" --request POST --data
>>> '{"target-directory":"hdfs://*:8020/tmp/xyz1","cancel-job":false}'
>>> \ https://
>>> .ingress.***/jobs//savepoints
>>>
>>> Then issue yarn-cancel.
>>> After that  follow the process to restore save point
>>>
>>> Regards
>>> Bhaskar
>>>
>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>> vishal.santo...@gmail.com> wrote:
>>>
 Hello Vijay,

Thank you for the reply. This though is k8s deployment (
 rather then yarn ) but may be they follow the same lifecycle.  I issue a*
 save point with cancel*  as documented here
 https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
 a straight up
  curl  --header "Content-Type: application/json" --request POST --data
 '{"target-directory":"hdfs://*:8020/tmp/xyz1","cancel-job":true}'
 \ https://
 .ingress.***/jobs//savepoints

 I would assume that after taking the save point, the jvm should exit,
 after all the k8s deployment is of kind: job and if it is a job cluster
 then a cancellation should exit the jvm and hence the pod. It does seem to
 do some things right. It stops a bunch of stuff ( the JobMaster, the
 slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
 counter but does not exit  the job. And after a little bit the job is
 restarted which does not make sense and absolutely not the right thing to
 do  ( to me at least ).

 Further if I delete the deployment and the job from k8s and restart the
 job and deployment fromSavePoint, it refuses to honor the fromSavePoint. I
 have to delete the zk chroot for it to consider the save point.

 Thus the process of cancelling and resuming from a SP on a k8s job
 cluster deployment  seems to be

- cancel with save point as defined hre

 https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
- delete the job manger job  and task manager deployments from k8s
almost immediately.
- clear the ZK chroot for the 000.. job  and may be the
checkpoints directory.
- resumeFromCheckPoint

 If some body can say that this indeed is the process ?

  Logs are attached.

 2019-03-12 08:10:43,871 INFO
 org.apache.flink.runtime.jobmaster.JobMaster  -
 Savepoint stored in
 hdfs://*:8020/tmp/xyz3/savepoint-00-6d5bdc9b53

Re: K8s job cluster and cancel and resume from a save point ?

Hi Vishal,

I'm afraid not but there are open pull requests for that. You can track the
progress here:

https://issues.apache.org/jira/browse/FLINK-9953

Best,
Gary

On Tue, Mar 12, 2019 at 3:32 PM Vishal Santoshi 
wrote:

> :) That makes so much more sense. Is  k8s native flink a part of this
> release ?
>
> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao  wrote:
>
>> Hi Vishal,
>>
>> This issue was fixed recently [1], and the patch will be released with
>> 1.8. If
>> the Flink job gets cancelled, the JVM should exit with code 0. There is a
>> release candidate [2], which you can test.
>>
>> Best,
>> Gary
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-10743
>> [2]
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>
>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> Thanks Vijay,
>>>
>>> This is the larger issue.  The cancellation routine is itself broken.
>>>
>>> On cancellation flink does remove the checkpoint counter
>>>
>>> *2019-03-12 14:12:13,143
>>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>> Removing /checkpoint-counter/ from
>>> ZooKeeper *
>>>
>>> but exist with a non zero code
>>>
>>> *2019-03-12 14:12:13,477
>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
>>> exit code 1444.*
>>>
>>>
>>> That I think is an issue. A cancelled job is a complete job and thus the
>>> exit code should be 0 for k8s to mark it complete.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar 
>>> wrote:
>>>
 Yes Vishal. Thats correct.

 Regards
 Bhaskar

 On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
 vishal.santo...@gmail.com> wrote:

> This really not cool but here you go. This seems to work. Agreed that
> this cannot be this painful. The cancel does not exit with an exit code pf
> 0 and thus the job has to manually delete. Vijay does this align with what
> you have had to do ?
>
>
>- Take a save point . This returns a request id
>
>curl  --header "Content-Type: application/json" --request POST --data 
> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
> https://*/jobs//savepoints
>
>
>
>- Make sure the save point succeeded
>
>curl  --request GET   
> https:///jobs//savepoints/2c053ce3bea31276aa25e63784629687
>
>
>
>- cancel the job
>
>curl  --request PATCH 
> https://***/jobs/?mode=cancel
>
>
>
>- Delete the job and deployment
>
>kubectl delete -f 
> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>
>kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>
>
>
>- Edit the job-cluster-job-deployment.yaml. Add/Edit
>
>args: ["job-cluster",
>
>   "--fromSavepoint",
>
>   
> "hdfs:///tmp/xyz14/savepoint-00-1d4f71345e22",
>   "--job-classname", .
>
>
>
>- Restart
>
>kubectl create -f 
> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>
>kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>
>
>
>- Make sure from the UI, that it restored from the specific save
>point.
>
>
> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <
> bhaskar.eba...@gmail.com> wrote:
>
>> Yes Its supposed to work.  But unfortunately it was not working.
>> Flink community needs to respond to this behavior.
>>
>> Regards
>> Bhaskar
>>
>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> Aah.
>>> Let me try this out and will get back to you.
>>> Though I would assume that save point with cancel is a single atomic
>>> step, rather then a save point *followed*  by a cancellation ( else
>>> why would that be an option ).
>>> Thanks again.
>>>
>>>
>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>> bhaskar.eba...@gmail.com> wrote:
>>>
 Hi Vishal,

 yarn-cancel doesn't mean to be for yarn cluster. It works for all
 clusters. Its recommended command.

 Use the following command to issue save point.
  curl  --header "Content-Type: application/json" --request POST
 --data '{"target-directory":"hdfs://*:8020/tmp/xyz1",
 "cancel-job":false}'  \ https://
 .ingress.***/jobs/0

Re: K8s job cluster and cancel and resume from a save point ?

And when is the 1.8.0 release expected ?

On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi 
wrote:

> :) That makes so much more sense. Is  k8s native flink a part of this
> release ?
>
> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao  wrote:
>
>> Hi Vishal,
>>
>> This issue was fixed recently [1], and the patch will be released with
>> 1.8. If
>> the Flink job gets cancelled, the JVM should exit with code 0. There is a
>> release candidate [2], which you can test.
>>
>> Best,
>> Gary
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-10743
>> [2]
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>
>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> Thanks Vijay,
>>>
>>> This is the larger issue.  The cancellation routine is itself broken.
>>>
>>> On cancellation flink does remove the checkpoint counter
>>>
>>> *2019-03-12 14:12:13,143
>>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>> Removing /checkpoint-counter/ from
>>> ZooKeeper *
>>>
>>> but exist with a non zero code
>>>
>>> *2019-03-12 14:12:13,477
>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
>>> exit code 1444.*
>>>
>>>
>>> That I think is an issue. A cancelled job is a complete job and thus the
>>> exit code should be 0 for k8s to mark it complete.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar 
>>> wrote:
>>>
 Yes Vishal. Thats correct.

 Regards
 Bhaskar

 On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
 vishal.santo...@gmail.com> wrote:

> This really not cool but here you go. This seems to work. Agreed that
> this cannot be this painful. The cancel does not exit with an exit code pf
> 0 and thus the job has to manually delete. Vijay does this align with what
> you have had to do ?
>
>
>- Take a save point . This returns a request id
>
>curl  --header "Content-Type: application/json" --request POST --data 
> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
> https://*/jobs//savepoints
>
>
>
>- Make sure the save point succeeded
>
>curl  --request GET   
> https:///jobs//savepoints/2c053ce3bea31276aa25e63784629687
>
>
>
>- cancel the job
>
>curl  --request PATCH 
> https://***/jobs/?mode=cancel
>
>
>
>- Delete the job and deployment
>
>kubectl delete -f 
> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>
>kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>
>
>
>- Edit the job-cluster-job-deployment.yaml. Add/Edit
>
>args: ["job-cluster",
>
>   "--fromSavepoint",
>
>   
> "hdfs:///tmp/xyz14/savepoint-00-1d4f71345e22",
>   "--job-classname", .
>
>
>
>- Restart
>
>kubectl create -f 
> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>
>kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>
>
>
>- Make sure from the UI, that it restored from the specific save
>point.
>
>
> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <
> bhaskar.eba...@gmail.com> wrote:
>
>> Yes Its supposed to work.  But unfortunately it was not working.
>> Flink community needs to respond to this behavior.
>>
>> Regards
>> Bhaskar
>>
>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> Aah.
>>> Let me try this out and will get back to you.
>>> Though I would assume that save point with cancel is a single atomic
>>> step, rather then a save point *followed*  by a cancellation ( else
>>> why would that be an option ).
>>> Thanks again.
>>>
>>>
>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>> bhaskar.eba...@gmail.com> wrote:
>>>
 Hi Vishal,

 yarn-cancel doesn't mean to be for yarn cluster. It works for all
 clusters. Its recommended command.

 Use the following command to issue save point.
  curl  --header "Content-Type: application/json" --request POST
 --data '{"target-directory":"hdfs://*:8020/tmp/xyz1",
 "cancel-job":false}'  \ https://
 .ingress.***/jobs//savepoints

 Then issue yarn-cancel.
 After that  follow the process to res

Re: K8s job cluster and cancel and resume from a save point ?

:) That makes so much more sense. Is  k8s native flink a part of this
release ?

On Tue, Mar 12, 2019 at 10:27 AM Gary Yao  wrote:

> Hi Vishal,
>
> This issue was fixed recently [1], and the patch will be released with
> 1.8. If
> the Flink job gets cancelled, the JVM should exit with code 0. There is a
> release candidate [2], which you can test.
>
> Best,
> Gary
>
> [1] https://issues.apache.org/jira/browse/FLINK-10743
> [2]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>
> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi 
> wrote:
>
>> Thanks Vijay,
>>
>> This is the larger issue.  The cancellation routine is itself broken.
>>
>> On cancellation flink does remove the checkpoint counter
>>
>> *2019-03-12 14:12:13,143
>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>> Removing /checkpoint-counter/ from
>> ZooKeeper *
>>
>> but exist with a non zero code
>>
>> *2019-03-12 14:12:13,477
>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
>> exit code 1444.*
>>
>>
>> That I think is an issue. A cancelled job is a complete job and thus the
>> exit code should be 0 for k8s to mark it complete.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar 
>> wrote:
>>
>>> Yes Vishal. Thats correct.
>>>
>>> Regards
>>> Bhaskar
>>>
>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>>> vishal.santo...@gmail.com> wrote:
>>>
 This really not cool but here you go. This seems to work. Agreed that
 this cannot be this painful. The cancel does not exit with an exit code pf
 0 and thus the job has to manually delete. Vijay does this align with what
 you have had to do ?


- Take a save point . This returns a request id

curl  --header "Content-Type: application/json" --request POST --data 
 '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
 https://*/jobs//savepoints



- Make sure the save point succeeded

curl  --request GET   
 https:///jobs//savepoints/2c053ce3bea31276aa25e63784629687



- cancel the job

curl  --request PATCH 
 https://***/jobs/?mode=cancel



- Delete the job and deployment

kubectl delete -f 
 manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml

kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml



- Edit the job-cluster-job-deployment.yaml. Add/Edit

args: ["job-cluster",

   "--fromSavepoint",

   
 "hdfs:///tmp/xyz14/savepoint-00-1d4f71345e22",
   "--job-classname", .



- Restart

kubectl create -f 
 manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml

kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml



- Make sure from the UI, that it restored from the specific save
point.


 On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar 
 wrote:

> Yes Its supposed to work.  But unfortunately it was not working. Flink
> community needs to respond to this behavior.
>
> Regards
> Bhaskar
>
> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
> vishal.santo...@gmail.com> wrote:
>
>> Aah.
>> Let me try this out and will get back to you.
>> Though I would assume that save point with cancel is a single atomic
>> step, rather then a save point *followed*  by a cancellation ( else
>> why would that be an option ).
>> Thanks again.
>>
>>
>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>> bhaskar.eba...@gmail.com> wrote:
>>
>>> Hi Vishal,
>>>
>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>> clusters. Its recommended command.
>>>
>>> Use the following command to issue save point.
>>>  curl  --header "Content-Type: application/json" --request POST
>>> --data '{"target-directory":"hdfs://*:8020/tmp/xyz1",
>>> "cancel-job":false}'  \ https://
>>> .ingress.***/jobs//savepoints
>>>
>>> Then issue yarn-cancel.
>>> After that  follow the process to restore save point
>>>
>>> Regards
>>> Bhaskar
>>>
>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>> vishal.santo...@gmail.com> wrote:
>>>
 Hello Vijay,

Thank you for the reply. This though is k8s
 deployment ( rather then yarn ) but may

Re: K8s job cluster and cancel and resume from a save point ?

Hi Vishal,

This issue was fixed recently [1], and the patch will be released with 1.8.
If
the Flink job gets cancelled, the JVM should exit with code 0. There is a
release candidate [2], which you can test.

Best,
Gary

[1] https://issues.apache.org/jira/browse/FLINK-10743
[2]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html

On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi 
wrote:

> Thanks Vijay,
>
> This is the larger issue.  The cancellation routine is itself broken.
>
> On cancellation flink does remove the checkpoint counter
>
> *2019-03-12 14:12:13,143
> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
> Removing /checkpoint-counter/ from
> ZooKeeper *
>
> but exist with a non zero code
>
> *2019-03-12 14:12:13,477
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
> exit code 1444.*
>
>
> That I think is an issue. A cancelled job is a complete job and thus the
> exit code should be 0 for k8s to mark it complete.
>
>
>
>
>
>
>
>
>
> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar 
> wrote:
>
>> Yes Vishal. Thats correct.
>>
>> Regards
>> Bhaskar
>>
>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> This really not cool but here you go. This seems to work. Agreed that
>>> this cannot be this painful. The cancel does not exit with an exit code pf
>>> 0 and thus the job has to manually delete. Vijay does this align with what
>>> you have had to do ?
>>>
>>>
>>>- Take a save point . This returns a request id
>>>
>>>curl  --header "Content-Type: application/json" --request POST --data 
>>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
>>> https://*/jobs//savepoints
>>>
>>>
>>>
>>>- Make sure the save point succeeded
>>>
>>>curl  --request GET   
>>> https:///jobs//savepoints/2c053ce3bea31276aa25e63784629687
>>>
>>>
>>>
>>>- cancel the job
>>>
>>>curl  --request PATCH 
>>> https://***/jobs/?mode=cancel
>>>
>>>
>>>
>>>- Delete the job and deployment
>>>
>>>kubectl delete -f 
>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>
>>>kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>
>>>
>>>
>>>- Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>
>>>args: ["job-cluster",
>>>
>>>   "--fromSavepoint",
>>>
>>>   
>>> "hdfs:///tmp/xyz14/savepoint-00-1d4f71345e22",
>>>   "--job-classname", .
>>>
>>>
>>>
>>>- Restart
>>>
>>>kubectl create -f 
>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>
>>>kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>
>>>
>>>
>>>- Make sure from the UI, that it restored from the specific save
>>>point.
>>>
>>>
>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar 
>>> wrote:
>>>
 Yes Its supposed to work.  But unfortunately it was not working. Flink
 community needs to respond to this behavior.

 Regards
 Bhaskar

 On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
 vishal.santo...@gmail.com> wrote:

> Aah.
> Let me try this out and will get back to you.
> Though I would assume that save point with cancel is a single atomic
> step, rather then a save point *followed*  by a cancellation ( else
> why would that be an option ).
> Thanks again.
>
>
> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
> bhaskar.eba...@gmail.com> wrote:
>
>> Hi Vishal,
>>
>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>> clusters. Its recommended command.
>>
>> Use the following command to issue save point.
>>  curl  --header "Content-Type: application/json" --request POST
>> --data '{"target-directory":"hdfs://*:8020/tmp/xyz1",
>> "cancel-job":false}'  \ https://
>> .ingress.***/jobs//savepoints
>>
>> Then issue yarn-cancel.
>> After that  follow the process to restore save point
>>
>> Regards
>> Bhaskar
>>
>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> Hello Vijay,
>>>
>>>Thank you for the reply. This though is k8s
>>> deployment ( rather then yarn ) but may be they follow the same 
>>> lifecycle.
>>> I issue a* save point with cancel*  as documented here
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>> a straight up
>>>  curl  --header "Content-Type: application/json" --reques

Re: K8s job cluster and cancel and resume from a save point ?

Oh, Yeah this is larger issue indeed :)

Regards
Bhaskar

On Tue, Mar 12, 2019 at 7:51 PM Vishal Santoshi 
wrote:

> Thanks Vijay,
>
> This is the larger issue.  The cancellation routine is itself broken.
>
> On cancellation flink does remove the checkpoint counter
>
> *2019-03-12 14:12:13,143
> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
> Removing /checkpoint-counter/ from
> ZooKeeper *
>
> but exist with a non zero code
>
> *2019-03-12 14:12:13,477
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
> exit code 1444.*
>
>
> That I think is an issue. A cancelled job is a complete job and thus the
> exit code should be 0 for k8s to mark it complete.
>
>
>
>
>
>
>
>
>
> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar 
> wrote:
>
>> Yes Vishal. Thats correct.
>>
>> Regards
>> Bhaskar
>>
>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> This really not cool but here you go. This seems to work. Agreed that
>>> this cannot be this painful. The cancel does not exit with an exit code pf
>>> 0 and thus the job has to manually delete. Vijay does this align with what
>>> you have had to do ?
>>>
>>>
>>>- Take a save point . This returns a request id
>>>
>>>curl  --header "Content-Type: application/json" --request POST --data 
>>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
>>> https://*/jobs//savepoints
>>>
>>>
>>>
>>>- Make sure the save point succeeded
>>>
>>>curl  --request GET   
>>> https:///jobs//savepoints/2c053ce3bea31276aa25e63784629687
>>>
>>>
>>>
>>>- cancel the job
>>>
>>>curl  --request PATCH 
>>> https://***/jobs/?mode=cancel
>>>
>>>
>>>
>>>- Delete the job and deployment
>>>
>>>kubectl delete -f 
>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>
>>>kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>
>>>
>>>
>>>- Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>
>>>args: ["job-cluster",
>>>
>>>   "--fromSavepoint",
>>>
>>>   
>>> "hdfs:///tmp/xyz14/savepoint-00-1d4f71345e22",
>>>   "--job-classname", .
>>>
>>>
>>>
>>>- Restart
>>>
>>>kubectl create -f 
>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>
>>>kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>
>>>
>>>
>>>- Make sure from the UI, that it restored from the specific save
>>>point.
>>>
>>>
>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar 
>>> wrote:
>>>
 Yes Its supposed to work.  But unfortunately it was not working. Flink
 community needs to respond to this behavior.

 Regards
 Bhaskar

 On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
 vishal.santo...@gmail.com> wrote:

> Aah.
> Let me try this out and will get back to you.
> Though I would assume that save point with cancel is a single atomic
> step, rather then a save point *followed*  by a cancellation ( else
> why would that be an option ).
> Thanks again.
>
>
> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
> bhaskar.eba...@gmail.com> wrote:
>
>> Hi Vishal,
>>
>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>> clusters. Its recommended command.
>>
>> Use the following command to issue save point.
>>  curl  --header "Content-Type: application/json" --request POST
>> --data '{"target-directory":"hdfs://*:8020/tmp/xyz1",
>> "cancel-job":false}'  \ https://
>> .ingress.***/jobs//savepoints
>>
>> Then issue yarn-cancel.
>> After that  follow the process to restore save point
>>
>> Regards
>> Bhaskar
>>
>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> Hello Vijay,
>>>
>>>Thank you for the reply. This though is k8s
>>> deployment ( rather then yarn ) but may be they follow the same 
>>> lifecycle.
>>> I issue a* save point with cancel*  as documented here
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>> a straight up
>>>  curl  --header "Content-Type: application/json" --request POST
>>> --data
>>> '{"target-directory":"hdfs://*:8020/tmp/xyz1","cancel-job":true}'
>>> \ https://
>>> .ingress.***/jobs//savepoints
>>>
>>> I would assume that after taking the save point, the jvm should
>>> exit, after all the k8s deployment is of

Re: K8s job cluster and cancel and resume from a save point ?

Thanks Vijay,

This is the larger issue.  The cancellation routine is itself broken.

On cancellation flink does remove the checkpoint counter

*2019-03-12 14:12:13,143
INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
Removing /checkpoint-counter/ from
ZooKeeper *

but exist with a non zero code

*2019-03-12 14:12:13,477
INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
exit code 1444.*


That I think is an issue. A cancelled job is a complete job and thus the
exit code should be 0 for k8s to mark it complete.









On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar 
wrote:

> Yes Vishal. Thats correct.
>
> Regards
> Bhaskar
>
> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi 
> wrote:
>
>> This really not cool but here you go. This seems to work. Agreed that
>> this cannot be this painful. The cancel does not exit with an exit code pf
>> 0 and thus the job has to manually delete. Vijay does this align with what
>> you have had to do ?
>>
>>
>>- Take a save point . This returns a request id
>>
>>curl  --header "Content-Type: application/json" --request POST --data 
>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}' 
>>https://*/jobs//savepoints
>>
>>
>>
>>- Make sure the save point succeeded
>>
>>curl  --request GET   
>> https:///jobs//savepoints/2c053ce3bea31276aa25e63784629687
>>
>>
>>
>>- cancel the job
>>
>>curl  --request PATCH 
>> https://***/jobs/?mode=cancel
>>
>>
>>
>>- Delete the job and deployment
>>
>>kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>
>>kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>
>>
>>
>>- Edit the job-cluster-job-deployment.yaml. Add/Edit
>>
>>args: ["job-cluster",
>>
>>   "--fromSavepoint",
>>
>>   
>> "hdfs:///tmp/xyz14/savepoint-00-1d4f71345e22",
>>   "--job-classname", .
>>
>>
>>
>>- Restart
>>
>>kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>
>>kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>
>>
>>
>>- Make sure from the UI, that it restored from the specific save
>>point.
>>
>>
>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar 
>> wrote:
>>
>>> Yes Its supposed to work.  But unfortunately it was not working. Flink
>>> community needs to respond to this behavior.
>>>
>>> Regards
>>> Bhaskar
>>>
>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>> vishal.santo...@gmail.com> wrote:
>>>
 Aah.
 Let me try this out and will get back to you.
 Though I would assume that save point with cancel is a single atomic
 step, rather then a save point *followed*  by a cancellation ( else
 why would that be an option ).
 Thanks again.


 On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar 
 wrote:

> Hi Vishal,
>
> yarn-cancel doesn't mean to be for yarn cluster. It works for all
> clusters. Its recommended command.
>
> Use the following command to issue save point.
>  curl  --header "Content-Type: application/json" --request POST --data
> '{"target-directory":"hdfs://*:8020/tmp/xyz1",
> "cancel-job":false}'  \ https://
> .ingress.***/jobs//savepoints
>
> Then issue yarn-cancel.
> After that  follow the process to restore save point
>
> Regards
> Bhaskar
>
> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
> vishal.santo...@gmail.com> wrote:
>
>> Hello Vijay,
>>
>>Thank you for the reply. This though is k8s deployment
>> ( rather then yarn ) but may be they follow the same lifecycle.  I issue 
>> a*
>> save point with cancel*  as documented here
>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>> a straight up
>>  curl  --header "Content-Type: application/json" --request POST
>> --data
>> '{"target-directory":"hdfs://*:8020/tmp/xyz1","cancel-job":true}'
>> \ https://
>> .ingress.***/jobs//savepoints
>>
>> I would assume that after taking the save point, the jvm should exit,
>> after all the k8s deployment is of kind: job and if it is a job cluster
>> then a cancellation should exit the jvm and hence the pod. It does seem 
>> to
>> do some things right. It stops a bunch of stuff ( the JobMaster, the
>> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>> counter but does not exit  the job. And after a little bit the job i

Re: K8s job cluster and cancel and resume from a save point ?

Yes Vishal. Thats correct.

Regards
Bhaskar

On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi 
wrote:

> This really not cool but here you go. This seems to work. Agreed that this
> cannot be this painful. The cancel does not exit with an exit code pf 0 and
> thus the job has to manually delete. Vijay does this align with what you
> have had to do ?
>
>
>- Take a save point . This returns a request id
>
>curl  --header "Content-Type: application/json" --request POST --data 
> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'  
>   https://*/jobs//savepoints
>
>
>
>- Make sure the save point succeeded
>
>curl  --request GET   
> https:///jobs//savepoints/2c053ce3bea31276aa25e63784629687
>
>
>
>- cancel the job
>
>curl  --request PATCH 
> https://***/jobs/?mode=cancel
>
>
>
>- Delete the job and deployment
>
>kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>
>kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>
>
>
>- Edit the job-cluster-job-deployment.yaml. Add/Edit
>
>args: ["job-cluster",
>
>   "--fromSavepoint",
>
>   
> "hdfs:///tmp/xyz14/savepoint-00-1d4f71345e22",
>   "--job-classname", .
>
>
>
>- Restart
>
>kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>
>kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>
>
>
>- Make sure from the UI, that it restored from the specific save point.
>
>
> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar 
> wrote:
>
>> Yes Its supposed to work.  But unfortunately it was not working. Flink
>> community needs to respond to this behavior.
>>
>> Regards
>> Bhaskar
>>
>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> Aah.
>>> Let me try this out and will get back to you.
>>> Though I would assume that save point with cancel is a single atomic
>>> step, rather then a save point *followed*  by a cancellation ( else why
>>> would that be an option ).
>>> Thanks again.
>>>
>>>
>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar 
>>> wrote:
>>>
 Hi Vishal,

 yarn-cancel doesn't mean to be for yarn cluster. It works for all
 clusters. Its recommended command.

 Use the following command to issue save point.
  curl  --header "Content-Type: application/json" --request POST --data
 '{"target-directory":"hdfs://*:8020/tmp/xyz1",
 "cancel-job":false}'  \ https://
 .ingress.***/jobs//savepoints

 Then issue yarn-cancel.
 After that  follow the process to restore save point

 Regards
 Bhaskar

 On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
 vishal.santo...@gmail.com> wrote:

> Hello Vijay,
>
>Thank you for the reply. This though is k8s deployment
> ( rather then yarn ) but may be they follow the same lifecycle.  I issue 
> a*
> save point with cancel*  as documented here
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
> a straight up
>  curl  --header "Content-Type: application/json" --request POST --data
> '{"target-directory":"hdfs://*:8020/tmp/xyz1","cancel-job":true}'
> \ https://
> .ingress.***/jobs//savepoints
>
> I would assume that after taking the save point, the jvm should exit,
> after all the k8s deployment is of kind: job and if it is a job cluster
> then a cancellation should exit the jvm and hence the pod. It does seem to
> do some things right. It stops a bunch of stuff ( the JobMaster, the
> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
> counter but does not exit  the job. And after a little bit the job is
> restarted which does not make sense and absolutely not the right thing to
> do  ( to me at least ).
>
> Further if I delete the deployment and the job from k8s and restart
> the job and deployment fromSavePoint, it refuses to honor the
> fromSavePoint. I have to delete the zk chroot for it to consider the save
> point.
>
>
> Thus the process of cancelling and resuming from a SP on a k8s job
> cluster deployment  seems to be
>
>- cancel with save point as defined hre
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>- delete the job manger job  and task manager deployments from k8s
>almost immediately.
>- clear the ZK chroot for the 000.. job  and may be the
>checkpoints directory.
>- resumeFromChe

Re: K8s job cluster and cancel and resume from a save point ?

I must add that there has to be more love for k8s flink deployments. IMHO
that is the way to go.  Maintaining a captive/session cluster, if you have
k8s on premise is pretty much a no go  for various reasons.

On Tue, Mar 12, 2019 at 9:44 AM Vishal Santoshi 
wrote:

> This really not cool but here you go. This seems to work. Agreed that this
> cannot be this painful. The cancel does not exit with an exit code pf 0 and
> thus the job has to manually delete. Vijay does this align with what you
> have had to do ?
>
>
>- Take a save point . This returns a request id
>
>curl  --header "Content-Type: application/json" --request POST --data 
> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'  
>   https://*/jobs//savepoints
>
>
>
>- Make sure the save point succeeded
>
>curl  --request GET   
> https:///jobs//savepoints/2c053ce3bea31276aa25e63784629687
>
>
>
>- cancel the job
>
>curl  --request PATCH 
> https://***/jobs/?mode=cancel
>
>
>
>- Delete the job and deployment
>
>kubectl delete -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>
>kubectl delete -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>
>
>
>- Edit the job-cluster-job-deployment.yaml. Add/Edit
>
>args: ["job-cluster",
>
>   "--fromSavepoint",
>
>   
> "hdfs:///tmp/xyz14/savepoint-00-1d4f71345e22",
>   "--job-classname", .
>
>
>
>- Restart
>
>kubectl create -f manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>
>kubectl create -f manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>
>
>
>- Make sure from the UI, that it restored from the specific save point.
>
>
> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar 
> wrote:
>
>> Yes Its supposed to work.  But unfortunately it was not working. Flink
>> community needs to respond to this behavior.
>>
>> Regards
>> Bhaskar
>>
>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> Aah.
>>> Let me try this out and will get back to you.
>>> Though I would assume that save point with cancel is a single atomic
>>> step, rather then a save point *followed*  by a cancellation ( else why
>>> would that be an option ).
>>> Thanks again.
>>>
>>>
>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar 
>>> wrote:
>>>
 Hi Vishal,

 yarn-cancel doesn't mean to be for yarn cluster. It works for all
 clusters. Its recommended command.

 Use the following command to issue save point.
  curl  --header "Content-Type: application/json" --request POST --data
 '{"target-directory":"hdfs://*:8020/tmp/xyz1",
 "cancel-job":false}'  \ https://
 .ingress.***/jobs//savepoints

 Then issue yarn-cancel.
 After that  follow the process to restore save point

 Regards
 Bhaskar

 On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
 vishal.santo...@gmail.com> wrote:

> Hello Vijay,
>
>Thank you for the reply. This though is k8s deployment
> ( rather then yarn ) but may be they follow the same lifecycle.  I issue 
> a*
> save point with cancel*  as documented here
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
> a straight up
>  curl  --header "Content-Type: application/json" --request POST --data
> '{"target-directory":"hdfs://*:8020/tmp/xyz1","cancel-job":true}'
> \ https://
> .ingress.***/jobs//savepoints
>
> I would assume that after taking the save point, the jvm should exit,
> after all the k8s deployment is of kind: job and if it is a job cluster
> then a cancellation should exit the jvm and hence the pod. It does seem to
> do some things right. It stops a bunch of stuff ( the JobMaster, the
> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
> counter but does not exit  the job. And after a little bit the job is
> restarted which does not make sense and absolutely not the right thing to
> do  ( to me at least ).
>
> Further if I delete the deployment and the job from k8s and restart
> the job and deployment fromSavePoint, it refuses to honor the
> fromSavePoint. I have to delete the zk chroot for it to consider the save
> point.
>
>
> Thus the process of cancelling and resuming from a SP on a k8s job
> cluster deployment  seems to be
>
>- cancel with save point as defined hre
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>- delete the job manger job  and task manager deployments fr

Re: K8s job cluster and cancel and resume from a save point ?

Yes Its supposed to work.  But unfortunately it was not working. Flink
community needs to respond to this behavior.

Regards
Bhaskar

On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi 
wrote:

> Aah.
> Let me try this out and will get back to you.
> Though I would assume that save point with cancel is a single atomic step,
> rather then a save point *followed*  by a cancellation ( else why would
> that be an option ).
> Thanks again.
>
>
> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar 
> wrote:
>
>> Hi Vishal,
>>
>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>> clusters. Its recommended command.
>>
>> Use the following command to issue save point.
>>  curl  --header "Content-Type: application/json" --request POST --data
>> '{"target-directory":"hdfs://*:8020/tmp/xyz1","cancel-job":false}'
>> \ https://
>> .ingress.***/jobs//savepoints
>>
>> Then issue yarn-cancel.
>> After that  follow the process to restore save point
>>
>> Regards
>> Bhaskar
>>
>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> Hello Vijay,
>>>
>>>Thank you for the reply. This though is k8s deployment (
>>> rather then yarn ) but may be they follow the same lifecycle.  I issue a*
>>> save point with cancel*  as documented here
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>> a straight up
>>>  curl  --header "Content-Type: application/json" --request POST --data
>>> '{"target-directory":"hdfs://*:8020/tmp/xyz1","cancel-job":true}'
>>> \ https://
>>> .ingress.***/jobs//savepoints
>>>
>>> I would assume that after taking the save point, the jvm should exit,
>>> after all the k8s deployment is of kind: job and if it is a job cluster
>>> then a cancellation should exit the jvm and hence the pod. It does seem to
>>> do some things right. It stops a bunch of stuff ( the JobMaster, the
>>> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>>> counter but does not exit  the job. And after a little bit the job is
>>> restarted which does not make sense and absolutely not the right thing to
>>> do  ( to me at least ).
>>>
>>> Further if I delete the deployment and the job from k8s and restart the
>>> job and deployment fromSavePoint, it refuses to honor the fromSavePoint. I
>>> have to delete the zk chroot for it to consider the save point.
>>>
>>>
>>> Thus the process of cancelling and resuming from a SP on a k8s job
>>> cluster deployment  seems to be
>>>
>>>- cancel with save point as defined hre
>>>
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>- delete the job manger job  and task manager deployments from k8s
>>>almost immediately.
>>>- clear the ZK chroot for the 000.. job  and may be the
>>>checkpoints directory.
>>>- resumeFromCheckPoint
>>>
>>> If some body can say that this indeed is the process ?
>>>
>>>
>>>
>>>  Logs are attached.
>>>
>>>
>>>
>>> 2019-03-12 08:10:43,871 INFO
>>> org.apache.flink.runtime.jobmaster.JobMaster  -
>>> Savepoint stored in
>>> hdfs://*:8020/tmp/xyz3/savepoint-00-6d5bdc9b53ae. Now
>>> cancelling .
>>>
>>> 2019-03-12 08:10:43,871 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph- Job
>>> anomaly_echo () switched from state RUNNING
>>> to CANCELLING.
>>>
>>> 2019-03-12 08:10:44,227 INFO  
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>> - Completed checkpoint 10 for job 
>>> (7238 bytes in 311 ms).
>>>
>>> 2019-03-12 08:10:44,232 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph- Source:
>>> Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>>
>>> 2019-03-12 08:10:44,274 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph- Source:
>>> Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>>
>>> 2019-03-12 08:10:44,276 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph- Job
>>> anomaly_echo () switched from state
>>> CANCELLING to CANCELED.
>>>
>>> 2019-03-12 08:10:44,276 INFO  
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>> - Stopping checkpoint coordinator for job
>>> .
>>>
>>> 2019-03-12 08:10:44,277 INFO
>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>>> Shutting down
>>>
>>> 2019-03-12 08:10:44,323 INFO  
>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>   - Checkpoint with ID 8 at
>>> 'hdfs://n

Re: K8s job cluster and cancel and resume from a save point ?

Aah.
Let me try this out and will get back to you.
Though I would assume that save point with cancel is a single atomic step,
rather then a save point *followed*  by a cancellation ( else why would
that be an option ).
Thanks again.


On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar 
wrote:

> Hi Vishal,
>
> yarn-cancel doesn't mean to be for yarn cluster. It works for all
> clusters. Its recommended command.
>
> Use the following command to issue save point.
>  curl  --header "Content-Type: application/json" --request POST --data
> '{"target-directory":"hdfs://*:8020/tmp/xyz1","cancel-job":false}'
> \ https://
> .ingress.***/jobs//savepoints
>
> Then issue yarn-cancel.
> After that  follow the process to restore save point
>
> Regards
> Bhaskar
>
> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi 
> wrote:
>
>> Hello Vijay,
>>
>>Thank you for the reply. This though is k8s deployment (
>> rather then yarn ) but may be they follow the same lifecycle.  I issue a*
>> save point with cancel*  as documented here
>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>> a straight up
>>  curl  --header "Content-Type: application/json" --request POST --data
>> '{"target-directory":"hdfs://*:8020/tmp/xyz1","cancel-job":true}'
>> \ https://
>> .ingress.***/jobs//savepoints
>>
>> I would assume that after taking the save point, the jvm should exit,
>> after all the k8s deployment is of kind: job and if it is a job cluster
>> then a cancellation should exit the jvm and hence the pod. It does seem to
>> do some things right. It stops a bunch of stuff ( the JobMaster, the
>> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
>> counter but does not exit  the job. And after a little bit the job is
>> restarted which does not make sense and absolutely not the right thing to
>> do  ( to me at least ).
>>
>> Further if I delete the deployment and the job from k8s and restart the
>> job and deployment fromSavePoint, it refuses to honor the fromSavePoint. I
>> have to delete the zk chroot for it to consider the save point.
>>
>>
>> Thus the process of cancelling and resuming from a SP on a k8s job
>> cluster deployment  seems to be
>>
>>- cancel with save point as defined hre
>>
>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>- delete the job manger job  and task manager deployments from k8s
>>almost immediately.
>>- clear the ZK chroot for the 000.. job  and may be the
>>checkpoints directory.
>>- resumeFromCheckPoint
>>
>> If some body can say that this indeed is the process ?
>>
>>
>>
>>  Logs are attached.
>>
>>
>>
>> 2019-03-12 08:10:43,871 INFO
>> org.apache.flink.runtime.jobmaster.JobMaster  -
>> Savepoint stored in
>> hdfs://*:8020/tmp/xyz3/savepoint-00-6d5bdc9b53ae. Now
>> cancelling .
>>
>> 2019-03-12 08:10:43,871 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph- Job
>> anomaly_echo () switched from state RUNNING
>> to CANCELLING.
>>
>> 2019-03-12 08:10:44,227 INFO  
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>> - Completed checkpoint 10 for job 
>> (7238 bytes in 311 ms).
>>
>> 2019-03-12 08:10:44,232 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph- Source:
>> Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>>
>> 2019-03-12 08:10:44,274 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph- Source:
>> Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>>
>> 2019-03-12 08:10:44,276 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph- Job
>> anomaly_echo () switched from state
>> CANCELLING to CANCELED.
>>
>> 2019-03-12 08:10:44,276 INFO  
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>> - Stopping checkpoint coordinator for job
>> .
>>
>> 2019-03-12 08:10:44,277 INFO
>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>> Shutting down
>>
>> 2019-03-12 08:10:44,323 INFO  
>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>   - Checkpoint with ID 8 at
>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-00-859e626cbb00' not
>> discarded.
>>
>> 2019-03-12 08:10:44,437 INFO
>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
>> Removing
>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/
>> from ZooKeeper
>>
>> 2019-03-12 08:10:44,437 INFO  
>> org.apache.fl

Re: K8s job cluster and cancel and resume from a save point ?

Hi Vishal,

yarn-cancel doesn't mean to be for yarn cluster. It works for all clusters.
Its recommended command.

Use the following command to issue save point.
 curl  --header "Content-Type: application/json" --request POST --data
'{"target-directory":"hdfs://*:8020/tmp/xyz1","cancel-job":false}'
\ https://
.ingress.***/jobs//savepoints

Then issue yarn-cancel.
After that  follow the process to restore save point

Regards
Bhaskar

On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi 
wrote:

> Hello Vijay,
>
>Thank you for the reply. This though is k8s deployment (
> rather then yarn ) but may be they follow the same lifecycle.  I issue a*
> save point with cancel*  as documented here
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
> a straight up
>  curl  --header "Content-Type: application/json" --request POST --data
> '{"target-directory":"hdfs://*:8020/tmp/xyz1","cancel-job":true}'
> \ https://
> .ingress.***/jobs//savepoints
>
> I would assume that after taking the save point, the jvm should exit,
> after all the k8s deployment is of kind: job and if it is a job cluster
> then a cancellation should exit the jvm and hence the pod. It does seem to
> do some things right. It stops a bunch of stuff ( the JobMaster, the
> slotPol, zookeeper coordinator etc ) . It also remove the checkpoint
> counter but does not exit  the job. And after a little bit the job is
> restarted which does not make sense and absolutely not the right thing to
> do  ( to me at least ).
>
> Further if I delete the deployment and the job from k8s and restart the
> job and deployment fromSavePoint, it refuses to honor the fromSavePoint. I
> have to delete the zk chroot for it to consider the save point.
>
>
> Thus the process of cancelling and resuming from a SP on a k8s job cluster
> deployment  seems to be
>
>- cancel with save point as defined hre
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>- delete the job manger job  and task manager deployments from k8s
>almost immediately.
>- clear the ZK chroot for the 000.. job  and may be the
>checkpoints directory.
>- resumeFromCheckPoint
>
> If some body can say that this indeed is the process ?
>
>
>
>  Logs are attached.
>
>
>
> 2019-03-12 08:10:43,871 INFO  org.apache.flink.runtime.jobmaster.JobMaster
> - Savepoint stored in
> hdfs://*:8020/tmp/xyz3/savepoint-00-6d5bdc9b53ae. Now
> cancelling .
>
> 2019-03-12 08:10:43,871 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Job
> anomaly_echo () switched from state RUNNING
> to CANCELLING.
>
> 2019-03-12 08:10:44,227 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> - Completed checkpoint 10 for job 
> (7238 bytes in 311 ms).
>
> 2019-03-12 08:10:44,232 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Source:
> Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.
>
> 2019-03-12 08:10:44,274 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Source:
> Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.
>
> 2019-03-12 08:10:44,276 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Job
> anomaly_echo () switched from state
> CANCELLING to CANCELED.
>
> 2019-03-12 08:10:44,276 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
> - Stopping checkpoint coordinator for job
> .
>
> 2019-03-12 08:10:44,277 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
> Shutting down
>
> 2019-03-12 08:10:44,323 INFO  
> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>   - Checkpoint with ID 8 at
> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-00-859e626cbb00' not
> discarded.
>
> 2019-03-12 08:10:44,437 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  -
> Removing
> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/
> from ZooKeeper
>
> 2019-03-12 08:10:44,437 INFO  
> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>   - Checkpoint with ID 10 at
> 'hdfs://*:8020/tmp/xyz3/savepoint-00-6d5bdc9b53ae' not
> discarded.
>
> 2019-03-12 08:10:44,447 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
> Shutting down.
>
> 2019-03-12 08:10:44,447 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
> Removing /checkpoint-counter/

Re: K8s job cluster and cancel and resume from a save point ?

Hello Vijay,

Thank you for the reply. This though is k8s deployment (
rather then yarn ) but may be they follow the same lifecycle. I issue a*
save point with cancel* as documented here
https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
a straight up
curl --header "Content-Type: application/json" --request POST --data
'{"target-directory":"hdfs://*:8020/tmp/xyz1","cancel-job":true}'
\ https://
.ingress.***/jobs//savepoints

I would assume that after taking the save point, the jvm should exit, after
all the k8s deployment is of kind: job and if it is a job cluster then a
cancellation should exit the jvm and hence the pod. It does seem to do some
things right. It stops a bunch of stuff ( the JobMaster, the slotPol,
zookeeper coordinator etc ) . It also remove the checkpoint counter but
does not exit the job. And after a little bit the job is restarted which
does not make sense and absolutely not the right thing to do ( to me at
least ).

Further if I delete the deployment and the job from k8s and restart the job
and deployment fromSavePoint, it refuses to honor the fromSavePoint. I have
to delete the zk chroot for it to consider the save point.

Thus the process of cancelling and resuming from a SP on a k8s job cluster
deployment seems to be

- cancel with save point as defined hre

https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
- delete the job manger job and task manager deployments from k8s
almost immediately.
- clear the ZK chroot for the 000.. job and may be the
checkpoints directory.
- resumeFromCheckPoint

If some body can say that this indeed is the process ?

Logs are attached.

2019-03-12 08:10:43,871 INFO org.apache.flink.runtime.jobmaster.JobMaster
- Savepoint stored in
hdfs://*:8020/tmp/xyz3/savepoint-00-6d5bdc9b53ae. Now
cancelling .

2019-03-12 08:10:43,871 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph- Job
anomaly_echo () switched from state RUNNING
to CANCELLING.

2019-03-12 08:10:44,227 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator
- Completed checkpoint 10 for job
(7238 bytes in 311 ms).

2019-03-12 08:10:44,232 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph- Source:
Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
(e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to CANCELING.

2019-03-12 08:10:44,274 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph- Source:
Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink (1/1)
(e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to CANCELED.

2019-03-12 08:10:44,276 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph- Job
anomaly_echo () switched from state
CANCELLING to CANCELED.

2019-03-12 08:10:44,276 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator
- Stopping checkpoint coordinator for job
.

2019-03-12 08:10:44,277 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore -
Shutting down

2019-03-12 08:10:44,323 INFO
org.apache.flink.runtime.checkpoint.CompletedCheckpoint
- Checkpoint with ID 8 at
'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-00-859e626cbb00' not
discarded.

2019-03-12 08:10:44,437 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore -
Removing
/k8s_anomalyecho/k8s_anomalyecho/checkpoints/
from ZooKeeper

2019-03-12 08:10:44,437 INFO
org.apache.flink.runtime.checkpoint.CompletedCheckpoint
- Checkpoint with ID 10 at
'hdfs://*:8020/tmp/xyz3/savepoint-00-6d5bdc9b53ae' not
discarded.

2019-03-12 08:10:44,447 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter -
Shutting down.

2019-03-12 08:10:44,447 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter -
Removing /checkpoint-counter/ from ZooKeeper

2019-03-12 08:10:44,463 INFO
org.apache.flink.runtime.dispatcher.MiniDispatcher- Job
reached globally terminal state CANCELED.

2019-03-12 08:10:44,467 INFO org.apache.flink.runtime.jobmaster.JobMaster
- Stopping the JobMaster for job
anomaly_echo().

2019-03-12 08:10:44,468 INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint
- Shutting StandaloneJobClusterEntryPoint down with application
status CANCELED. Diagnostics null.

2019-03-12 08:10:44,468 INFO
org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Shutting
down rest endpoint.

2019-03-12 08:10:44,473 INFO
org.apache.flink.runt

Re: K8s job cluster and cancel and resume from a save point ?