Hi Yang, Thanks again for all the help!
We are still seeing this with 1.11.2 and ZK. Looks like others are seeing this as well and they found a solution https://translate.google.com/translate?hl=en&sl=zh-CN&u=https://cloud.tencent.com/developer/article/1731416&prev=search Should this solution be added to 1.12? Best kevin On 2020/08/14 02:48:50, Yang Wang <d...@gmail.com<mailto:d...@gmail.com>> wrote: > Hi kevin,> > > Thanks for sharing more information. You are right. Actually, "too old> > resource version" is caused by a bug> > of fabric8 kubernetes-client[1]. It has been fix in v4.6.1. And we have> > bumped the kubernetes-client version> > to v4.9.2 in Flink release-1.11. Also it has been backported to release> > 1.10 and will be included in the next> > minor release version(1.10.2).> > > BTW, if you really want all your jobs recovered when jobmanager crashed,> > you still need to configure the Zookeeper high availability.> > > [1]. https://github.com/fabric8io/kubernetes-client/pull/1800> > > > Best,> > Yang> > > Bohinski, Kevin <ke...@comcast.com<mailto:ke...@comcast.com>> 于2020年8月14日周五 > 上午6:32写道:> > > > Might be useful> > >> > > https://stackoverflow.com/a/61437982> > >> > >> > >> > > Best,> > >> > > kevin> > >> > >> > >> > >> > >> > > *From: *"Bohinski, Kevin" <ke...@comcast.com<mailto:ke...@comcast.com>>> > > *Date: *Thursday, August 13, 2020 at 6:13 PM> > > *To: *Yang Wang <da...@gmail.com<mailto:da...@gmail.com>>> > > *Cc: *"user@flink.apache.org<mailto:user@flink.apache.org>" > > <us...@flink.apache.org<mailto:us...@flink.apache.org>>> > > *Subject: *Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job> > > never recovers> > >> > >> > >> > > Hi> > >> > >> > >> > > Got the logs on crash, hopefully they help.> > >> > >> > >> > > 2020-08-13 22:00:40,336 ERROR> > > org.apache.flink.kubernetes.KubernetesResourceManager [] - Fatal> > > error occurred in ResourceManager.> > >> > > io.fabric8.kubernetes.client.KubernetesClientException: too old resource> > > version: 8617182 (8633230)> > >> > > at> > > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)> > > [?:1.8.0_262]> > >> > > at> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)> > > [?:1.8.0_262]> > >> > > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]> > >> > > 2020-08-13 22:00:40,337 ERROR> > > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal> > > error occurred in the cluster entrypoint.> > >> > > io.fabric8.kubernetes.client.KubernetesClientException: too old resource> > > version: 8617182 (8633230)> > >> > > at> > > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)> > > [?:1.8.0_262]> > >> > > at> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)> > > [?:1.8.0_262]> > >> > > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]> > >> > > 2020-08-13 22:00:40,416 INFO> > > org.apache.flink.runtime.blob.BlobServer [] - Stopped> > > BLOB server at 0.0.0.0:6124> > >> > >> > >> > > Best,> > >> > > kevin> > >> > >> > >> > >> > >> > > *From: *Yang Wang <da...@gmail.com<mailto:da...@gmail.com>>> > > *Date: *Sunday, August 9, 2020 at 10:29 PM> > > *To: *"Bohinski, Kevin" <ke...@comcast.com<mailto:ke...@comcast.com>>> > > *Cc: *"user@flink.apache.org<mailto:user@flink.apache.org>" > > <us...@flink.apache.org<mailto:us...@flink.apache.org>>> > > *Subject: *[EXTERNAL] Re: Native K8S Jobmanager restarts and job never> > > recovers> > >> > >> > >> > > Hi Kevin,> > >> > >> > >> > > I think you may not set the high availability configurations in your> > > native K8s session. Currently, we only> > >> > > support zookeeper HA, so you need to add the following configuration.> > > After the HA is configured, the> > >> > > running job, checkpoint and other meta could be stored. When the> > > jobmanager failover, all the jobs> > >> > > could be recovered then. I have tested it could work properly.> > >> > >> > >> > > high-availability: zookeeper> > >> > > high-availability.zookeeper.quorum: zk-client:2181> > >> > > high-availability.storageDir: hdfs:///flink/recovery> > >> > >> > >> > > I know you may not have a zookeeper cluster.You could a zookeeper K8s> > > operator[1] to deploy a new one.> > >> > >> > >> > > More over, it is not very convenient to use zookeeper as HA. So a K8s> > > native HA support[2] is in plan and we> > >> > > are trying to finish it in the next major release cycle(1.12).> > >> > >> > >> > >> > >> > > [1]. https://github.com/pravega/zookeeper-operator> > > <https://urldefense.com/v3/__https:/github.com/pravega/zookeeper-operator__;!!CQl3mcHX2A!WYnCnwXm-Wu9wKZW8f_xzLnOD6R_o4EMi-s9gJYBegeB-7lUopfm3lexybTzSVWj1J1mzw$><https://urldefense.com/v3/__https:/github.com/pravega/zookeeper-operator__;!!CQl3mcHX2A!WYnCnwXm-Wu9wKZW8f_xzLnOD6R_o4EMi-s9gJYBegeB-7lUopfm3lexybTzSVWj1J1mzw$%3e>> > >> > > [2]. https://issues.apache.org/jira/browse/FLINK-12884> > > <https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/FLINK-12884__;!!CQl3mcHX2A!WYnCnwXm-Wu9wKZW8f_xzLnOD6R_o4EMi-s9gJYBegeB-7lUopfm3lexybTzSVVglqEFAw$><https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/FLINK-12884__;!!CQl3mcHX2A!WYnCnwXm-Wu9wKZW8f_xzLnOD6R_o4EMi-s9gJYBegeB-7lUopfm3lexybTzSVVglqEFAw$%3e>> > >> > >> > >> > >> > >> > > Best,> > >> > > Yang> > >> > >> > >> > > Bohinski, Kevin <ke...@comcast.com<mailto:ke...@comcast.com>> 于2020年8月7日周五 > > 下午11:40写道:> > >> > > Hi all,> > >> > >> > >> > > In our 1.11.1 native k8s session after we submit a job it will run> > > successfully for a few hours then fail when the jobmanager pod restarts.> > >> > >> > >> > > Relevant logs after restart are attached below. Any suggestions?> > >> > >> > >> > > Best> > >> > > kevin> > >> > >> > >> > > 2020-08-06 21:50:24,425 INFO> > > org.apache.flink.kubernetes.KubernetesResourceManager [] - Recovered> > > 32 pods from previous attempts, current attempt id is 2.> > >> > > 2020-08-06 21:50:24,610 DEBUG> > > org.apache.flink.kubernetes.kubeclient.resources.KubernetesPodsWatcher [] -> > > Received ADDED event for pod REDACTED-flink-session-taskmanager-1-16,> > > details: PodStatus(conditions=[PodCondition(lastProbeTime=null,> > > lastTransitionTime=2020-08-06T18:48:33Z, message=null, reason=null,> > > status=True, type=Initialized, additionalProperties={}),> > > PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:37Z,> > > message=null, reason=null, status=True, type=Ready,> > > additionalProperties={}), PodCondition(lastProbeTime=null,> > > lastTransitionTime=2020-08-06T18:48:37Z, message=null, reason=null,> > > status=True, type=ContainersReady, additionalProperties={}),> > > PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:33Z,> > > message=null, reason=null, status=True, type=PodScheduled,> > > additionalProperties={})],> > > containerStatuses=[ContainerStatus(containerID=docker://REDACTED,> > > image=REDACTED/flink:1.11.1-scala_2.11-s3-0,> > > imageID=docker-pullable://REDACTED/flink@sha256:REDACTED,> > > lastState=ContainerState(running=null, terminated=null, waiting=null,> > > additionalProperties={}), name=flink-task-manager, ready=true,> > > restartCount=0, started=true,> > > state=ContainerState(running=ContainerStateRunning(startedAt=2020-08-06T18:48:35Z,> > > additionalProperties={}), terminated=null, waiting=null,> > > additionalProperties={}), additionalProperties={})],> > > ephemeralContainerStatuses=[], hostIP=REDACTED, initContainerStatuses=[],> > > message=null, nominatedNodeName=null, phase=Running, podIP=REDACTED,> > > podIPs=[PodIP(ip=REDACTED, additionalProperties={})], qosClass=Guaranteed,> > > reason=null, startTime=2020-08-06T18:48:33Z, additionalProperties={})> > >> > > 2020-08-06 21:50:24,613 DEBUG> > > org.apache.flink.kubernetes.KubernetesResourceManager [] - Ignore> > > TaskManager pod that is already added:> > > REDACTED-flink-session-taskmanager-1-16> > >> > > 2020-08-06 21:50:24,615 INFO> > > org.apache.flink.kubernetes.KubernetesResourceManager [] - Received> > > 0 new TaskManager pods. Remaining pending pod requests: 0> > >> > > 2020-08-06 21:50:24,631 DEBUG> > > org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore [] -> > > Could not load archived execution graph for job id> > > 8c76f37962afd87783c65c95387fb828.> > >> > > java.util.concurrent.ExecutionException: java.io.FileNotFoundException:> > > Could not find file for archived execution graph> > > 8c76f37962afd87783c65c95387fb828. This indicates that the file either has> > > been deleted or never written.> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2348)> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2320)> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197)> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.get(LocalCache.java:3937)> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.get(FileArchivedExecutionGraphStore.java:143)> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$requestJob$20(Dispatcher.java:554)> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)> > > ~[?:1.8.0_262]> > >> > > at> > > java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:898)> > > ~[?:1.8.0_262]> > >> > > at> > > java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2209)> > > ~[?:1.8.0_262]> > >> > > at> > > org.apache.flink.runtime.dispatcher.Dispatcher.requestJob(Dispatcher.java:552)> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native> > > Method) ~[?:1.8.0_262]> > >> > > at> > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)> > > ~[?:1.8.0_262]> > >> > > at> > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)> > > ~[?:1.8.0_262]> > >> > > at java.lang.reflect.Method.invoke(Method.java:498)> > > ~[?:1.8.0_262]> > >> > > at> > > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:284)> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199)> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)> > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)> > > [flink-dist_2.11 [message truncated...]