I see I think we have seen this issue with others before, in Flink 1.15 it is solved by the newly introduced JobResultStore. The operator also configures that automatically for 1.15 to avoid this.
Gyula On Tue, Sep 20, 2022 at 3:27 PM Evgeniy Lyutikov <eblyuti...@avito.ru> wrote: > Thanks for the answer. > I think this is not about the operator issue, kubernetes deployment just > restarts the fallen pod, restarted jobmanager without HA metadata starts > the job itself from an empty state. > > I'm looking for a way to prevent it from exiting in case of an job error > (we use application mode cluster). > > > > ------------------------------ > *От:* Gyula Fóra <gyula.f...@gmail.com> > *Отправлено:* 20 сентября 2022 г. 19:49:37 > *Кому:* Evgeniy Lyutikov > *Копия:* user@flink.apache.org > *Тема:* Re: JobManager restarts on job failure > > The best thing for you to do would be to upgrade to Flink 1.15 and the > latest operator version. > In Flink 1.15 we have the option to interact with the Flink jobmanager > even after the job FAILED and the operator leverages this for a much more > robust behaviour. > > In any case the operator should not ever start the job from an empty state > (even if it FAILED), if you think that's happening could you please open a > JIRA ticket with the accompanying JM and Operator logs? > > Thanks > Gyula > > On Tue, Sep 20, 2022 at 1:00 PM Evgeniy Lyutikov <eblyuti...@avito.ru> > wrote: > >> Hi, >> We using flink 1.14.4 with flink kubernetes operator. >> >> Sometimes when updating a job, it fails on startup and flink removes all >> HA metadata and exits the jobmanager. >> >> >> 2022-09-14 14:54:44,534 INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring >> job 00000000000000000000000000000000 from Checkpoint 30829 @ 1663167158684 >> for 00000000000000000000000000000000 located at >> s3p://flink-checkpoints/k8s-checkpoint-job-name/00000000000000000000000000000000/chk-30829. >> 2022-09-14 14:54:44,638 INFO >> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job >> 00000000000000000000000000000000 reached terminal state FAILED. >> org.apache.flink.runtime.client.JobInitializationException: Could not >> start the JobMaster. >> Caused by: java.util.concurrent.CompletionException: >> java.lang.IllegalStateException: There is no operator for the state >> 4e1d9dde287c33a35e7970cbe64a40fe >> 2022-09-14 14:54:44,930 ERROR >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal >> error occurred in the cluster entrypoint. >> 2022-09-14 14:54:45,020 INFO >> org.apache.flink.kubernetes.highavailability.KubernetesHaServices [] - >> Clean up the high availability data for job >> 00000000000000000000000000000000. >> 2022-09-14 14:54:45,020 INFO >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Shutting >> KubernetesApplicationClusterEntrypoint down with application status >> UNKNOWN. Diagnostics Cluster entrypoint has been closed externally.. >> 2022-09-14 14:54:45,026 INFO >> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Shutting >> down rest endpoint. >> 2022-09-14 14:54:46,122 INFO >> akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Shutting >> down remote daemon. >> 2022-09-14 14:54:46,321 INFO >> akka.remote.RemoteActorRefProvider$RemotingTerminator [] - Remoting >> shut down. >> >> Kubernetes restarts the pod jobmanager and the new instance, not finding >> the HA metadata, starts the job from an empty state. >> Is there some option to prevent jobmanager from exiting after an job FAILED >> state? >> >> >> * ------------------------------ *“This message contains confidential >> information/commercial secret. If you are not the intended addressee of >> this message you may not copy, save, print or forward it to any third party >> and you are kindly requested to destroy this message and notify the sender >> thereof by email. >> Данное сообщение содержит конфиденциальную информацию/информацию, >> являющуюся коммерческой тайной. Если Вы не являетесь надлежащим адресатом >> данного сообщения, Вы не вправе копировать, сохранять, печатать или >> пересылать его каким либо иным лицам. Просьба уничтожить данное сообщение и >> уведомить об этом отправителя электронным письмом.” >> >