[jira] [Updated] (FLINK-33728) Do not rewatch when KubernetesResourceManagerDriver watch fail

2024-02-21 Thread Xintong Song (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xintong Song updated FLINK-33728:
-
Fix Version/s: 1.20.0
   (was: 1.19.0)

> Do not rewatch when KubernetesResourceManagerDriver watch fail
> --
>
> Key: FLINK-33728
> URL: https://issues.apache.org/jira/browse/FLINK-33728
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / Kubernetes
>Reporter: xiaogang zhou
>Assignee: xiaogang zhou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.20.0
>
>
> I met massive production problem when kubernetes ETCD slow responding happen. 
> After Kube recoverd after 1 hour, Thousands of Flink jobs using 
> kubernetesResourceManagerDriver rewatched when recieving 
> ResourceVersionTooOld,  which caused great pressure on API Server and made 
> API server failed again... 
>  
> I am not sure is it necessary to
> getResourceEventHandler().onError(throwable)
> in  PodCallbackHandlerImpl# handleError method?
>  
> We can just neglect the disconnection of watching process. and try to rewatch 
> once new requestResource called. And we can leverage on the akka heartbeat 
> timeout to discover the TM failure, just like YARN mode do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33728) Do not rewatch when KubernetesResourceManagerDriver watch fail

2024-01-22 Thread xiaogang zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiaogang zhou updated FLINK-33728:
--
Summary: Do not rewatch when KubernetesResourceManagerDriver watch fail  
(was: do not rewatch when KubernetesResourceManagerDriver watch fail)

> Do not rewatch when KubernetesResourceManagerDriver watch fail
> --
>
> Key: FLINK-33728
> URL: https://issues.apache.org/jira/browse/FLINK-33728
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / Kubernetes
>Reporter: xiaogang zhou
>Assignee: xiaogang zhou
>Priority: Major
>  Labels: pull-request-available
>
> I met massive production problem when kubernetes ETCD slow responding happen. 
> After Kube recoverd after 1 hour, Thousands of Flink jobs using 
> kubernetesResourceManagerDriver rewatched when recieving 
> ResourceVersionTooOld,  which caused great pressure on API Server and made 
> API server failed again... 
>  
> I am not sure is it necessary to
> getResourceEventHandler().onError(throwable)
> in  PodCallbackHandlerImpl# handleError method?
>  
> We can just neglect the disconnection of watching process. and try to rewatch 
> once new requestResource called. And we can leverage on the akka heartbeat 
> timeout to discover the TM failure, just like YARN mode do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail

2023-12-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-33728:
---
Labels: pull-request-available  (was: )

> do not rewatch when KubernetesResourceManagerDriver watch fail
> --
>
> Key: FLINK-33728
> URL: https://issues.apache.org/jira/browse/FLINK-33728
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / Kubernetes
>Reporter: xiaogang zhou
>Priority: Major
>  Labels: pull-request-available
>
> I met massive production problem when kubernetes ETCD slow responding happen. 
> After Kube recoverd after 1 hour, Thousands of Flink jobs using 
> kubernetesResourceManagerDriver rewatched when recieving 
> ResourceVersionTooOld,  which caused great pressure on API Server and made 
> API server failed again... 
>  
> I am not sure is it necessary to
> getResourceEventHandler().onError(throwable)
> in  PodCallbackHandlerImpl# handleError method?
>  
> We can just neglect the disconnection of watching process. and try to rewatch 
> once new requestResource called. And we can leverage on the akka heartbeat 
> timeout to discover the TM failure, just like YARN mode do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail

2023-12-04 Thread xiaogang zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiaogang zhou updated FLINK-33728:
--
Description: 
I met massive production problem when kubernetes ETCD slow responding happen. 
After Kube recoverd after 1 hour, Thousands of Flink jobs using 
kubernetesResourceManagerDriver rewatched when recieving ResourceVersionTooOld, 
 which caused great pressure on API Server and made API server failed again... 

 

I am not sure is it necessary to

getResourceEventHandler().onError(throwable)

in  PodCallbackHandlerImpl# handleError method?

 

We can just neglect the disconnection of watching process. and try to rewatch 
once new requestResource called. And we can leverage on the akka heartbeat 
timeout to discover the TM failure, just like YARN mode do.

  was:
is it necessary to

getResourceEventHandler().onError(throwable)

in  PodCallbackHandlerImpl# handleError method.

 

We can just neglect the disconnection of watching process. and try to rewatch 
once new requestResource called


> do not rewatch when KubernetesResourceManagerDriver watch fail
> --
>
> Key: FLINK-33728
> URL: https://issues.apache.org/jira/browse/FLINK-33728
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / Kubernetes
>Reporter: xiaogang zhou
>Priority: Major
>
> I met massive production problem when kubernetes ETCD slow responding happen. 
> After Kube recoverd after 1 hour, Thousands of Flink jobs using 
> kubernetesResourceManagerDriver rewatched when recieving 
> ResourceVersionTooOld,  which caused great pressure on API Server and made 
> API server failed again... 
>  
> I am not sure is it necessary to
> getResourceEventHandler().onError(throwable)
> in  PodCallbackHandlerImpl# handleError method?
>  
> We can just neglect the disconnection of watching process. and try to rewatch 
> once new requestResource called. And we can leverage on the akka heartbeat 
> timeout to discover the TM failure, just like YARN mode do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail

2023-12-03 Thread xiaogang zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiaogang zhou updated FLINK-33728:
--
Description: 
is it necessary to

getResourceEventHandler().onError(throwable)

in  PodCallbackHandlerImpl# handleError method.

 

We can just neglect the disconnection of watching process. and try to rewatch 
once new requestResource called

> do not rewatch when KubernetesResourceManagerDriver watch fail
> --
>
> Key: FLINK-33728
> URL: https://issues.apache.org/jira/browse/FLINK-33728
> Project: Flink
>  Issue Type: New Feature
>  Components: Deployment / Kubernetes
>Reporter: xiaogang zhou
>Priority: Major
>
> is it necessary to
> getResourceEventHandler().onError(throwable)
> in  PodCallbackHandlerImpl# handleError method.
>  
> We can just neglect the disconnection of watching process. and try to rewatch 
> once new requestResource called



--
This message was sent by Atlassian Jira
(v8.20.10#820010)