Flink1.11.1版本job以Application Mode在K8S集群上运行,jobmanager每个小时会重启一次,报错【Fatal error
occurred in
ResourceManager.io.fabric8.kubernetes.client.KubernetesClientException: too
old resource version】

pod重启:
<http://apache-flink.147419.n8.nabble.com/file/t1176/11.jpg> 

重启原因:
2020-12-10 07:21:19,290 ERROR
org.apache.flink.kubernetes.KubernetesResourceManager        [] - Fatal
error occurred in ResourceManager.
io.fabric8.kubernetes.client.KubernetesClientException: too old resource
version: 247468999 (248117930)
  at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_202]
  at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_202]
  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]
2020-12-10 07:21:19,291 ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal
error occurred in the cluster entrypoint.
io.fabric8.kubernetes.client.KubernetesClientException: too old resource
version: 247468999 (248117930)
  at
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
[flink-dist_2.11-1.11.1.jar:1.11.1]
  at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_202]
  at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_202]
  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]


网上查的原因是因为:
org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient类中212行

@Override
public KubernetesWatch watchPodsAndDoCallback(Map<String, String> labels,
PodCallbackHandler podCallbackHandler) {
                return new KubernetesWatch(
                        this.internalClient.pods()
                                .withLabels(labels)
                                .watch(new 
KubernetesPodsWatcher(podCallbackHandler)));
}

而ETCD中只会保留一段时间的version信息
【 I think it's standard behavior of Kubernetes to give 410 after some time
during watch. It's usually client's responsibility to handle it. In the
context of a watch, it will return HTTP_GONE when you ask to see changes for
a resourceVersion that is too old - i.e. when it can no longer tell you what
has changed since that version, since too many things have changed. In that
case, you'll need to start again, by not specifying a resourceVersion in
which case the watch will send you the current state of the thing you are
watching and then send updates from that point.】

大家有没遇到相同的问题,是怎么处理的?我有几个处理方式,希望能跟大家一起讨论一下。




--
Sent from: http://apache-flink.147419.n8.nabble.com/

回复