[jira] [Updated] (FLINK-33728) Do not rewatch when KubernetesResourceManagerDriver watch fail
[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xintong Song updated FLINK-33728: - Fix Version/s: 1.20.0 (was: 1.19.0) > Do not rewatch when KubernetesResourceManagerDriver watch fail > -- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Reporter: xiaogang zhou >Assignee: xiaogang zhou >Priority: Major > Labels: pull-request-available > Fix For: 1.20.0 > > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33728) Do not rewatch when KubernetesResourceManagerDriver watch fail
[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-33728: -- Summary: Do not rewatch when KubernetesResourceManagerDriver watch fail (was: do not rewatch when KubernetesResourceManagerDriver watch fail) > Do not rewatch when KubernetesResourceManagerDriver watch fail > -- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Reporter: xiaogang zhou >Assignee: xiaogang zhou >Priority: Major > Labels: pull-request-available > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail
[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-33728: --- Labels: pull-request-available (was: ) > do not rewatch when KubernetesResourceManagerDriver watch fail > -- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Reporter: xiaogang zhou >Priority: Major > Labels: pull-request-available > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail
[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-33728: -- Description: I met massive production problem when kubernetes ETCD slow responding happen. After Kube recoverd after 1 hour, Thousands of Flink jobs using kubernetesResourceManagerDriver rewatched when recieving ResourceVersionTooOld, which caused great pressure on API Server and made API server failed again... I am not sure is it necessary to getResourceEventHandler().onError(throwable) in PodCallbackHandlerImpl# handleError method? We can just neglect the disconnection of watching process. and try to rewatch once new requestResource called. And we can leverage on the akka heartbeat timeout to discover the TM failure, just like YARN mode do. was: is it necessary to getResourceEventHandler().onError(throwable) in PodCallbackHandlerImpl# handleError method. We can just neglect the disconnection of watching process. and try to rewatch once new requestResource called > do not rewatch when KubernetesResourceManagerDriver watch fail > -- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Reporter: xiaogang zhou >Priority: Major > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail
[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-33728: -- Description: is it necessary to getResourceEventHandler().onError(throwable) in PodCallbackHandlerImpl# handleError method. We can just neglect the disconnection of watching process. and try to rewatch once new requestResource called > do not rewatch when KubernetesResourceManagerDriver watch fail > -- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Reporter: xiaogang zhou >Priority: Major > > is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method. > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called -- This message was sent by Atlassian Jira (v8.20.10#820010)