Do you have a mvn repository ( at mvn central )  set up for 1,8 release
candidate. We could test it for you.

Without 1.8and this exit code we are essentially held up.

On Tue, Mar 12, 2019 at 10:56 AM Gary Yao <g...@ververica.com> wrote:

> Nobody can tell with 100% certainty. We want to give the RC some exposure
> first, and there is also a release process that is prescribed by the ASF
> [1].
> You can look at past releases to get a feeling for how long the release
> process lasts [2].
>
> [1] http://www.apache.org/legal/release-policy.html#release-approval
> [2]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0
>
>
> On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi <vishal.santo...@gmail.com>
> wrote:
>
>> And when is the 1.8.0 release expected ?
>>
>> On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> :) That makes so much more sense. Is  k8s native flink a part of this
>>> release ?
>>>
>>> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <g...@ververica.com> wrote:
>>>
>>>> Hi Vishal,
>>>>
>>>> This issue was fixed recently [1], and the patch will be released with
>>>> 1.8. If
>>>> the Flink job gets cancelled, the JVM should exit with code 0. There is
>>>> a
>>>> release candidate [2], which you can test.
>>>>
>>>> Best,
>>>> Gary
>>>>
>>>> [1] https://issues.apache.org/jira/browse/FLINK-10743
>>>> [2]
>>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html
>>>>
>>>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi <
>>>> vishal.santo...@gmail.com> wrote:
>>>>
>>>>> Thanks Vijay,
>>>>>
>>>>> This is the larger issue.  The cancellation routine is itself broken.
>>>>>
>>>>> On cancellation flink does remove the checkpoint counter
>>>>>
>>>>> *2019-03-12 14:12:13,143
>>>>> INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
>>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from
>>>>> ZooKeeper *
>>>>>
>>>>> but exist with a non zero code
>>>>>
>>>>> *2019-03-12 14:12:13,477
>>>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with
>>>>> exit code 1444.*
>>>>>
>>>>>
>>>>> That I think is an issue. A cancelled job is a complete job and thus
>>>>> the exit code should be 0 for k8s to mark it complete.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar <
>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>
>>>>>> Yes Vishal. Thats correct.
>>>>>>
>>>>>> Regards
>>>>>> Bhaskar
>>>>>>
>>>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi <
>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>
>>>>>>> This really not cool but here you go. This seems to work. Agreed
>>>>>>> that this cannot be this painful. The cancel does not exit with an exit
>>>>>>> code pf 0 and thus the job has to manually delete. Vijay does this align
>>>>>>> with what you have had to do ?
>>>>>>>
>>>>>>>
>>>>>>>    - Take a save point . This returns a request id
>>>>>>>
>>>>>>>    curl  --header "Content-Type: application/json" --request POST 
>>>>>>> --data 
>>>>>>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}'
>>>>>>>     
>>>>>>> https://*************/jobs/00000000000000000000000000000000/savepoints
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - Make sure the save point succeeded
>>>>>>>
>>>>>>>    curl  --request GET   
>>>>>>> https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - cancel the job
>>>>>>>
>>>>>>>    curl  --request PATCH 
>>>>>>> https://***************/jobs/00000000000000000000000000000000?mode=cancel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - Delete the job and deployment
>>>>>>>
>>>>>>>    kubectl delete -f 
>>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>
>>>>>>>    kubectl delete -f 
>>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - Edit the job-cluster-job-deployment.yaml. Add/Edit
>>>>>>>
>>>>>>>    args: ["job-cluster",
>>>>>>>
>>>>>>>                   "--fromSavepoint",
>>>>>>>
>>>>>>>                   
>>>>>>> "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22",
>>>>>>>                   "--job-classname", .........
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - Restart
>>>>>>>
>>>>>>>    kubectl create -f 
>>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml
>>>>>>>
>>>>>>>    kubectl create -f 
>>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    - Make sure from the UI, that it restored from the specific save
>>>>>>>    point.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar <
>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Yes Its supposed to work.  But unfortunately it was not working.
>>>>>>>> Flink community needs to respond to this behavior.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Bhaskar
>>>>>>>>
>>>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi <
>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Aah.
>>>>>>>>> Let me try this out and will get back to you.
>>>>>>>>> Though I would assume that save point with cancel is a single
>>>>>>>>> atomic step, rather then a save point *followed*  by a
>>>>>>>>> cancellation ( else why would that be an option ).
>>>>>>>>> Thanks again.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar <
>>>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Vishal,
>>>>>>>>>>
>>>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all
>>>>>>>>>> clusters. Its recommended command.
>>>>>>>>>>
>>>>>>>>>> Use the following command to issue save point.
>>>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1",
>>>>>>>>>> "cancel-job":false}'  \ https://
>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>
>>>>>>>>>> Then issue yarn-cancel.
>>>>>>>>>> After that  follow the process to restore save point
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Bhaskar
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi <
>>>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello Vijay,
>>>>>>>>>>>
>>>>>>>>>>>                Thank you for the reply. This though is k8s
>>>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same 
>>>>>>>>>>> lifecycle.
>>>>>>>>>>> I issue a* save point with cancel*  as documented here
>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints,
>>>>>>>>>>> a straight up
>>>>>>>>>>>  curl  --header "Content-Type: application/json" --request POST
>>>>>>>>>>> --data
>>>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}'
>>>>>>>>>>> \ https://
>>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints
>>>>>>>>>>>
>>>>>>>>>>> I would assume that after taking the save point, the jvm should
>>>>>>>>>>> exit, after all the k8s deployment is of kind: job and if it is a 
>>>>>>>>>>> job
>>>>>>>>>>> cluster then a cancellation should exit the jvm and hence the pod. 
>>>>>>>>>>> It does
>>>>>>>>>>> seem to do some things right. It stops a bunch of stuff ( the 
>>>>>>>>>>> JobMaster,
>>>>>>>>>>> the slotPol, zookeeper coordinator etc ) . It also remove the 
>>>>>>>>>>> checkpoint
>>>>>>>>>>> counter but does not exit  the job. And after a little bit the job 
>>>>>>>>>>> is
>>>>>>>>>>> restarted which does not make sense and absolutely not the right 
>>>>>>>>>>> thing to
>>>>>>>>>>> do  ( to me at least ).
>>>>>>>>>>>
>>>>>>>>>>> Further if I delete the deployment and the job from k8s and
>>>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to honor 
>>>>>>>>>>> the
>>>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider 
>>>>>>>>>>> the save
>>>>>>>>>>> point.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s
>>>>>>>>>>> job cluster deployment  seems to be
>>>>>>>>>>>
>>>>>>>>>>>    - cancel with save point as defined hre
>>>>>>>>>>>    
>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints
>>>>>>>>>>>    - delete the job manger job  and task manager deployments
>>>>>>>>>>>    from k8s almost immediately.
>>>>>>>>>>>    - clear the ZK chroot for the 0000000...... job  and may be
>>>>>>>>>>>    the checkpoints directory.
>>>>>>>>>>>    - resumeFromCheckPoint
>>>>>>>>>>>
>>>>>>>>>>> If some body can say that this indeed is the process ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Logs are attached.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>>> Savepoint stored in
>>>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now
>>>>>>>>>>> cancelling 00000000000000000000000000000000.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:43,871 INFO
>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from 
>>>>>>>>>>> state
>>>>>>>>>>> RUNNING to CANCELLING.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,227 INFO  
>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>     - Completed checkpoint 10 for job
>>>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms).
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,232 INFO
>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink 
>>>>>>>>>>> (1/1)
>>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to 
>>>>>>>>>>> CANCELING.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,274 INFO
>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink 
>>>>>>>>>>> (1/1)
>>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to 
>>>>>>>>>>> CANCELED.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO
>>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from 
>>>>>>>>>>> state
>>>>>>>>>>> CANCELLING to CANCELED.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,276 INFO  
>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator
>>>>>>>>>>>     - Stopping checkpoint coordinator for job
>>>>>>>>>>> 00000000000000000000000000000000.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,277 INFO
>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>> - Shutting down
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,323 INFO  
>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>       - Checkpoint with ID 8 at
>>>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not
>>>>>>>>>>> discarded.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO
>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
>>>>>>>>>>> - Removing
>>>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000
>>>>>>>>>>> from ZooKeeper
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,437 INFO  
>>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint
>>>>>>>>>>>       - Checkpoint with ID 10 at
>>>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' 
>>>>>>>>>>> not
>>>>>>>>>>> discarded.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>> - Shutting down.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,447 INFO
>>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
>>>>>>>>>>> - Removing /checkpoint-counter/00000000000000000000000000000000
>>>>>>>>>>> from ZooKeeper
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,463 INFO
>>>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher            -
>>>>>>>>>>> Job 00000000000000000000000000000000 reached globally terminal state
>>>>>>>>>>> CANCELED.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,467 INFO
>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>>> Stopping the JobMaster for job
>>>>>>>>>>> anomaly_echo(00000000000000000000000000000000).
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO  
>>>>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>>>>>>>>>>>         - Shutting StandaloneJobClusterEntryPoint down with
>>>>>>>>>>> application status CANCELED. Diagnostics null.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,468 INFO
>>>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint  -
>>>>>>>>>>> Shutting down rest endpoint.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,473 INFO
>>>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService
>>>>>>>>>>> /leader/resource_manager_lock.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster                  -
>>>>>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2:
>>>>>>>>>>> JobManager is shutting down..
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,475 INFO
>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>>>> Suspending SlotPool.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          -
>>>>>>>>>>> Stopping SlotPool.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,476 INFO
>>>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>>>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca
>>>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job
>>>>>>>>>>> 00000000000000000000000000000000 from the resource manager.
>>>>>>>>>>>
>>>>>>>>>>> 2019-03-12 08:10:44,477 INFO
>>>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>>>>>>>>>>> - Stopping ZooKeeperLeaderElectionService
>>>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> After a little bit
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Starting the job-cluster
>>>>>>>>>>>
>>>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with
>>>>>>>>>>> key `jobmanager.heap.size`
>>>>>>>>>>>
>>>>>>>>>>> Starting standalonejob as a console application on host
>>>>>>>>>>> anomalyecho-mmg6t.
>>>>>>>>>>>
>>>>>>>>>>> ..
>>>>>>>>>>>
>>>>>>>>>>> ..
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Regards.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar <
>>>>>>>>>>> bhaskar.eba...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Vishal
>>>>>>>>>>>>
>>>>>>>>>>>> Save point with cancellation internally use  /cancel  REST API.
>>>>>>>>>>>> Which is not stable API.  It always exits with 404. Best  way to 
>>>>>>>>>>>> issue is:
>>>>>>>>>>>>
>>>>>>>>>>>> a) First issue save point REST API
>>>>>>>>>>>> b) Then issue  /yarn-cancel  rest API( As described in
>>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E
>>>>>>>>>>>> )
>>>>>>>>>>>> c) Then After resuming your job, provide save point Path as
>>>>>>>>>>>> argument for the run jar REST API, which is returned by the (a)
>>>>>>>>>>>> Above is the smoother way
>>>>>>>>>>>>
>>>>>>>>>>>> Regards
>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi <
>>>>>>>>>>>> vishal.santo...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> There are some issues I see and would want to get some feedback
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory ,
>>>>>>>>>>>>> the k8s  job  does not exit ( it is not a deployment ) . I would 
>>>>>>>>>>>>> assume
>>>>>>>>>>>>> that on cancellation the jvm should exit, after cleanup etc, and 
>>>>>>>>>>>>> thus the
>>>>>>>>>>>>> pod should too. That does not happen and thus the job pod remains 
>>>>>>>>>>>>> live. Is
>>>>>>>>>>>>> that expected ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. To resume fro a save point it seems that I have to delete
>>>>>>>>>>>>> the job id ( 0000000000.... )  from ZooKeeper ( this is HA ), 
>>>>>>>>>>>>> else it
>>>>>>>>>>>>> defaults to the latest checkpoint no matter what
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested
>>>>>>>>>>>>> process of cancelling with a save point and resuming  and what is 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> cogent story around job id ( defaults to 000000000000.. ). Note 
>>>>>>>>>>>>> that
>>>>>>>>>>>>> --job-id does not work with 1.7.2 so even though that does not 
>>>>>>>>>>>>> make sense,
>>>>>>>>>>>>> I still can not provide a new job id.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Vishal.
>>>>>>>>>>>>>
>>>>>>>>>>>>>

Reply via email to