[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806586#comment-17806586 ]
Xintong Song commented on FLINK-33728: -------------------------------------- Sorry for the late reply, I was distracted by some other works last week. I think you are right about that JM will kill itself if the re-watch does not succeed. I think it is expected in most cases that the client try re-watch immediately after seeing a ResourceVersionTooOld exception. However, if the first attempt to re-watch fail, JM should not kill itself immediately, but may retry with some backoff interval. cc [~wangyang0918] > do not rewatch when KubernetesResourceManagerDriver watch fail > -------------------------------------------------------------- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes > Reporter: xiaogang zhou > Priority: Major > Labels: pull-request-available > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)