Yang Wang created FLINK-20417: --------------------------------- Summary: Handle "Too old resource version" exception in Kubernetes watch more gracefully Key: FLINK-20417 URL: https://issues.apache.org/jira/browse/FLINK-20417 Project: Flink Issue Type: Improvement Components: Deployment / Kubernetes Reporter: Yang Wang
Currently, when the watcher(pods watcher, configmap watcher) is closed with exception, we will call {{WatchCallbackHandler#handleFatalError}}. And this could cause JobManager terminating and then failover. For most cases, this is correct. But not for "too old resource version" exception. See more information here[1]. Usually this exception could happen when the APIServer is restarted. And we just need to create a new watch and continue to do the pods/configmap watching. This could help the Flink cluster reducing the impact of K8s cluster restarting. The issue is inspired by this technical article[2]. Thanks the guys from tencent for the debugging. Note this is a Chinese documentation. [1]. [https://stackoverflow.com/questions/61409596/kubernetes-too-old-resource-version] [2]. [https://cloud.tencent.com/developer/article/1731416] -- This message was sent by Atlassian Jira (v8.3.4#803005)