Do you have a mvn repository ( at mvn central ) set up for 1,8 release candidate. We could test it for you.
Without 1.8and this exit code we are essentially held up. On Tue, Mar 12, 2019 at 10:56 AM Gary Yao <g...@ververica.com> wrote: > Nobody can tell with 100% certainty. We want to give the RC some exposure > first, and there is also a release process that is prescribed by the ASF > [1]. > You can look at past releases to get a feeling for how long the release > process lasts [2]. > > [1] http://www.apache.org/legal/release-policy.html#release-approval > [2] > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=%5BVOTE%5D+Release&days=0 > > > On Tue, Mar 12, 2019 at 3:38 PM Vishal Santoshi <vishal.santo...@gmail.com> > wrote: > >> And when is the 1.8.0 release expected ? >> >> On Tue, Mar 12, 2019 at 10:32 AM Vishal Santoshi < >> vishal.santo...@gmail.com> wrote: >> >>> :) That makes so much more sense. Is k8s native flink a part of this >>> release ? >>> >>> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao <g...@ververica.com> wrote: >>> >>>> Hi Vishal, >>>> >>>> This issue was fixed recently [1], and the patch will be released with >>>> 1.8. If >>>> the Flink job gets cancelled, the JVM should exit with code 0. There is >>>> a >>>> release candidate [2], which you can test. >>>> >>>> Best, >>>> Gary >>>> >>>> [1] https://issues.apache.org/jira/browse/FLINK-10743 >>>> [2] >>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Release-1-8-0-release-candidate-1-td27637.html >>>> >>>> On Tue, Mar 12, 2019 at 3:21 PM Vishal Santoshi < >>>> vishal.santo...@gmail.com> wrote: >>>> >>>>> Thanks Vijay, >>>>> >>>>> This is the larger issue. The cancellation routine is itself broken. >>>>> >>>>> On cancellation flink does remove the checkpoint counter >>>>> >>>>> *2019-03-12 14:12:13,143 >>>>> INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - >>>>> Removing /checkpoint-counter/00000000000000000000000000000000 from >>>>> ZooKeeper * >>>>> >>>>> but exist with a non zero code >>>>> >>>>> *2019-03-12 14:12:13,477 >>>>> INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >>>>> Terminating cluster entrypoint process StandaloneJobClusterEntryPoint with >>>>> exit code 1444.* >>>>> >>>>> >>>>> That I think is an issue. A cancelled job is a complete job and thus >>>>> the exit code should be 0 for k8s to mark it complete. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Mar 12, 2019 at 10:18 AM Vijay Bhaskar < >>>>> bhaskar.eba...@gmail.com> wrote: >>>>> >>>>>> Yes Vishal. Thats correct. >>>>>> >>>>>> Regards >>>>>> Bhaskar >>>>>> >>>>>> On Tue, Mar 12, 2019 at 7:14 PM Vishal Santoshi < >>>>>> vishal.santo...@gmail.com> wrote: >>>>>> >>>>>>> This really not cool but here you go. This seems to work. Agreed >>>>>>> that this cannot be this painful. The cancel does not exit with an exit >>>>>>> code pf 0 and thus the job has to manually delete. Vijay does this align >>>>>>> with what you have had to do ? >>>>>>> >>>>>>> >>>>>>> - Take a save point . This returns a request id >>>>>>> >>>>>>> curl --header "Content-Type: application/json" --request POST >>>>>>> --data >>>>>>> '{"target-directory":"hdfs://nn-crunchy:8020/tmp/xyz14","cancel-job":false}' >>>>>>> >>>>>>> https://*************/jobs/00000000000000000000000000000000/savepoints >>>>>>> >>>>>>> >>>>>>> >>>>>>> - Make sure the save point succeeded >>>>>>> >>>>>>> curl --request GET >>>>>>> https://****************/jobs/00000000000000000000000000000000/savepoints/2c053ce3bea31276aa25e63784629687 >>>>>>> >>>>>>> >>>>>>> >>>>>>> - cancel the job >>>>>>> >>>>>>> curl --request PATCH >>>>>>> https://***************/jobs/00000000000000000000000000000000?mode=cancel >>>>>>> >>>>>>> >>>>>>> >>>>>>> - Delete the job and deployment >>>>>>> >>>>>>> kubectl delete -f >>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml >>>>>>> >>>>>>> kubectl delete -f >>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml >>>>>>> >>>>>>> >>>>>>> >>>>>>> - Edit the job-cluster-job-deployment.yaml. Add/Edit >>>>>>> >>>>>>> args: ["job-cluster", >>>>>>> >>>>>>> "--fromSavepoint", >>>>>>> >>>>>>> >>>>>>> "hdfs://************/tmp/xyz14/savepoint-000000-1d4f71345e22", >>>>>>> "--job-classname", ......... >>>>>>> >>>>>>> >>>>>>> >>>>>>> - Restart >>>>>>> >>>>>>> kubectl create -f >>>>>>> manifests/bf2-PRODUCTION/job-cluster-job-deployment.yaml >>>>>>> >>>>>>> kubectl create -f >>>>>>> manifests/bf2-PRODUCTION/task-manager-deployment.yaml >>>>>>> >>>>>>> >>>>>>> >>>>>>> - Make sure from the UI, that it restored from the specific save >>>>>>> point. >>>>>>> >>>>>>> >>>>>>> On Tue, Mar 12, 2019 at 7:26 AM Vijay Bhaskar < >>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>> >>>>>>>> Yes Its supposed to work. But unfortunately it was not working. >>>>>>>> Flink community needs to respond to this behavior. >>>>>>>> >>>>>>>> Regards >>>>>>>> Bhaskar >>>>>>>> >>>>>>>> On Tue, Mar 12, 2019 at 3:45 PM Vishal Santoshi < >>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Aah. >>>>>>>>> Let me try this out and will get back to you. >>>>>>>>> Though I would assume that save point with cancel is a single >>>>>>>>> atomic step, rather then a save point *followed* by a >>>>>>>>> cancellation ( else why would that be an option ). >>>>>>>>> Thanks again. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Mar 12, 2019 at 4:50 AM Vijay Bhaskar < >>>>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Vishal, >>>>>>>>>> >>>>>>>>>> yarn-cancel doesn't mean to be for yarn cluster. It works for all >>>>>>>>>> clusters. Its recommended command. >>>>>>>>>> >>>>>>>>>> Use the following command to issue save point. >>>>>>>>>> curl --header "Content-Type: application/json" --request POST >>>>>>>>>> --data '{"target-directory":"hdfs://*********:8020/tmp/xyz1", >>>>>>>>>> "cancel-job":false}' \ https:// >>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>>>>>>>> >>>>>>>>>> Then issue yarn-cancel. >>>>>>>>>> After that follow the process to restore save point >>>>>>>>>> >>>>>>>>>> Regards >>>>>>>>>> Bhaskar >>>>>>>>>> >>>>>>>>>> On Tue, Mar 12, 2019 at 2:11 PM Vishal Santoshi < >>>>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hello Vijay, >>>>>>>>>>> >>>>>>>>>>> Thank you for the reply. This though is k8s >>>>>>>>>>> deployment ( rather then yarn ) but may be they follow the same >>>>>>>>>>> lifecycle. >>>>>>>>>>> I issue a* save point with cancel* as documented here >>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints, >>>>>>>>>>> a straight up >>>>>>>>>>> curl --header "Content-Type: application/json" --request POST >>>>>>>>>>> --data >>>>>>>>>>> '{"target-directory":"hdfs://*********:8020/tmp/xyz1","cancel-job":true}' >>>>>>>>>>> \ https:// >>>>>>>>>>> ************.ingress.*******/jobs/00000000000000000000000000000000/savepoints >>>>>>>>>>> >>>>>>>>>>> I would assume that after taking the save point, the jvm should >>>>>>>>>>> exit, after all the k8s deployment is of kind: job and if it is a >>>>>>>>>>> job >>>>>>>>>>> cluster then a cancellation should exit the jvm and hence the pod. >>>>>>>>>>> It does >>>>>>>>>>> seem to do some things right. It stops a bunch of stuff ( the >>>>>>>>>>> JobMaster, >>>>>>>>>>> the slotPol, zookeeper coordinator etc ) . It also remove the >>>>>>>>>>> checkpoint >>>>>>>>>>> counter but does not exit the job. And after a little bit the job >>>>>>>>>>> is >>>>>>>>>>> restarted which does not make sense and absolutely not the right >>>>>>>>>>> thing to >>>>>>>>>>> do ( to me at least ). >>>>>>>>>>> >>>>>>>>>>> Further if I delete the deployment and the job from k8s and >>>>>>>>>>> restart the job and deployment fromSavePoint, it refuses to honor >>>>>>>>>>> the >>>>>>>>>>> fromSavePoint. I have to delete the zk chroot for it to consider >>>>>>>>>>> the save >>>>>>>>>>> point. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thus the process of cancelling and resuming from a SP on a k8s >>>>>>>>>>> job cluster deployment seems to be >>>>>>>>>>> >>>>>>>>>>> - cancel with save point as defined hre >>>>>>>>>>> >>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jobs-jobid-savepoints >>>>>>>>>>> - delete the job manger job and task manager deployments >>>>>>>>>>> from k8s almost immediately. >>>>>>>>>>> - clear the ZK chroot for the 0000000...... job and may be >>>>>>>>>>> the checkpoints directory. >>>>>>>>>>> - resumeFromCheckPoint >>>>>>>>>>> >>>>>>>>>>> If some body can say that this indeed is the process ? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Logs are attached. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:43,871 INFO >>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>>>>>>> Savepoint stored in >>>>>>>>>>> hdfs://*********:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae. Now >>>>>>>>>>> cancelling 00000000000000000000000000000000. >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:43,871 INFO >>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from >>>>>>>>>>> state >>>>>>>>>>> RUNNING to CANCELLING. >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,227 INFO >>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>>>>>>>> - Completed checkpoint 10 for job >>>>>>>>>>> 00000000000000000000000000000000 (7238 bytes in 311 ms). >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,232 INFO >>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink >>>>>>>>>>> (1/1) >>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from RUNNING to >>>>>>>>>>> CANCELING. >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,274 INFO >>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>>>>> Source: Barnacle Anomalies Kafka topic -> Map -> Sink: Logging Sink >>>>>>>>>>> (1/1) >>>>>>>>>>> (e2d02ca40a9a6c96a0c1882f5a2e4dd6) switched from CANCELING to >>>>>>>>>>> CANCELED. >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,276 INFO >>>>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>>>>>>>>> Job anomaly_echo (00000000000000000000000000000000) switched from >>>>>>>>>>> state >>>>>>>>>>> CANCELLING to CANCELED. >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,276 INFO >>>>>>>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator >>>>>>>>>>> - Stopping checkpoint coordinator for job >>>>>>>>>>> 00000000000000000000000000000000. >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,277 INFO >>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>>>>>> - Shutting down >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,323 INFO >>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>>>>>>>> - Checkpoint with ID 8 at >>>>>>>>>>> 'hdfs://nn-crunchy:8020/tmp/xyz2/savepoint-000000-859e626cbb00' not >>>>>>>>>>> discarded. >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,437 INFO >>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore >>>>>>>>>>> - Removing >>>>>>>>>>> /k8s_anomalyecho/k8s_anomalyecho/checkpoints/00000000000000000000000000000000 >>>>>>>>>>> from ZooKeeper >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,437 INFO >>>>>>>>>>> org.apache.flink.runtime.checkpoint.CompletedCheckpoint >>>>>>>>>>> - Checkpoint with ID 10 at >>>>>>>>>>> 'hdfs://*************:8020/tmp/xyz3/savepoint-000000-6d5bdc9b53ae' >>>>>>>>>>> not >>>>>>>>>>> discarded. >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,447 INFO >>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter >>>>>>>>>>> - Shutting down. >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,447 INFO >>>>>>>>>>> org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter >>>>>>>>>>> - Removing /checkpoint-counter/00000000000000000000000000000000 >>>>>>>>>>> from ZooKeeper >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,463 INFO >>>>>>>>>>> org.apache.flink.runtime.dispatcher.MiniDispatcher - >>>>>>>>>>> Job 00000000000000000000000000000000 reached globally terminal state >>>>>>>>>>> CANCELED. >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,467 INFO >>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>>>>>>> Stopping the JobMaster for job >>>>>>>>>>> anomaly_echo(00000000000000000000000000000000). >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,468 INFO >>>>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint >>>>>>>>>>> - Shutting StandaloneJobClusterEntryPoint down with >>>>>>>>>>> application status CANCELED. Diagnostics null. >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,468 INFO >>>>>>>>>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - >>>>>>>>>>> Shutting down rest endpoint. >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,473 INFO >>>>>>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService >>>>>>>>>>> - Stopping ZooKeeperLeaderRetrievalService >>>>>>>>>>> /leader/resource_manager_lock. >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,475 INFO >>>>>>>>>>> org.apache.flink.runtime.jobmaster.JobMaster - >>>>>>>>>>> Close ResourceManager connection d38c6e599d16415a69c65c8b2a72d9a2: >>>>>>>>>>> JobManager is shutting down.. >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,475 INFO >>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>>>>>>>>>> Suspending SlotPool. >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,476 INFO >>>>>>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - >>>>>>>>>>> Stopping SlotPool. >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,476 INFO >>>>>>>>>>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager >>>>>>>>>>> - Disconnect job manager a0dcf8aaa3fadcfd6fef49666d7344ca >>>>>>>>>>> @akka.tcp://flink@anomalyecho:6123/user/jobmanager_0 for job >>>>>>>>>>> 00000000000000000000000000000000 from the resource manager. >>>>>>>>>>> >>>>>>>>>>> 2019-03-12 08:10:44,477 INFO >>>>>>>>>>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService >>>>>>>>>>> - Stopping ZooKeeperLeaderElectionService >>>>>>>>>>> ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> After a little bit >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Starting the job-cluster >>>>>>>>>>> >>>>>>>>>>> used deprecated key `jobmanager.heap.mb`, please replace with >>>>>>>>>>> key `jobmanager.heap.size` >>>>>>>>>>> >>>>>>>>>>> Starting standalonejob as a console application on host >>>>>>>>>>> anomalyecho-mmg6t. >>>>>>>>>>> >>>>>>>>>>> .. >>>>>>>>>>> >>>>>>>>>>> .. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Regards. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Mar 12, 2019 at 3:25 AM Vijay Bhaskar < >>>>>>>>>>> bhaskar.eba...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Vishal >>>>>>>>>>>> >>>>>>>>>>>> Save point with cancellation internally use /cancel REST API. >>>>>>>>>>>> Which is not stable API. It always exits with 404. Best way to >>>>>>>>>>>> issue is: >>>>>>>>>>>> >>>>>>>>>>>> a) First issue save point REST API >>>>>>>>>>>> b) Then issue /yarn-cancel rest API( As described in >>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/flink-user/201804.mbox/%3c0ffa63f4-e6ed-42d8-1928-37a7adaaa...@apache.org%3E >>>>>>>>>>>> ) >>>>>>>>>>>> c) Then After resuming your job, provide save point Path as >>>>>>>>>>>> argument for the run jar REST API, which is returned by the (a) >>>>>>>>>>>> Above is the smoother way >>>>>>>>>>>> >>>>>>>>>>>> Regards >>>>>>>>>>>> Bhaskar >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Mar 12, 2019 at 2:46 AM Vishal Santoshi < >>>>>>>>>>>> vishal.santo...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> There are some issues I see and would want to get some feedback >>>>>>>>>>>>> >>>>>>>>>>>>> 1. On Cancellation With SavePoint with a Target Directory , >>>>>>>>>>>>> the k8s job does not exit ( it is not a deployment ) . I would >>>>>>>>>>>>> assume >>>>>>>>>>>>> that on cancellation the jvm should exit, after cleanup etc, and >>>>>>>>>>>>> thus the >>>>>>>>>>>>> pod should too. That does not happen and thus the job pod remains >>>>>>>>>>>>> live. Is >>>>>>>>>>>>> that expected ? >>>>>>>>>>>>> >>>>>>>>>>>>> 2. To resume fro a save point it seems that I have to delete >>>>>>>>>>>>> the job id ( 0000000000.... ) from ZooKeeper ( this is HA ), >>>>>>>>>>>>> else it >>>>>>>>>>>>> defaults to the latest checkpoint no matter what >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I am kind of curious as to what in 1.7.2 is the tested >>>>>>>>>>>>> process of cancelling with a save point and resuming and what is >>>>>>>>>>>>> the >>>>>>>>>>>>> cogent story around job id ( defaults to 000000000000.. ). Note >>>>>>>>>>>>> that >>>>>>>>>>>>> --job-id does not work with 1.7.2 so even though that does not >>>>>>>>>>>>> make sense, >>>>>>>>>>>>> I still can not provide a new job id. >>>>>>>>>>>>> >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> >>>>>>>>>>>>> Vishal. >>>>>>>>>>>>> >>>>>>>>>>>>>