[jira] [Commented] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail

Xintong Song (Jira) Sun, 14 Jan 2024 18:00:06 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806586#comment-17806586
 ]


Xintong Song commented on FLINK-33728:
--------------------------------------

Sorry for the late reply, I was distracted by some other works last week.

I think you are right about that JM will kill itself if the re-watch does not 
succeed. I think it is expected in most cases that the client try re-watch 
immediately after seeing a ResourceVersionTooOld exception. However, if the 
first attempt to re-watch fail, JM should not kill itself immediately, but may 
retry with some backoff interval.

cc [~wangyang0918]

> do not rewatch when KubernetesResourceManagerDriver watch fail
> --------------------------------------------------------------
>
>                 Key: FLINK-33728
>                 URL: https://issues.apache.org/jira/browse/FLINK-33728
>             Project: Flink
>          Issue Type: New Feature
>          Components: Deployment / Kubernetes
>            Reporter: xiaogang zhou
>            Priority: Major
>              Labels: pull-request-available
>
> I met massive production problem when kubernetes ETCD slow responding happen. 
> After Kube recoverd after 1 hour, Thousands of Flink jobs using 
> kubernetesResourceManagerDriver rewatched when recieving 
> ResourceVersionTooOld,  which caused great pressure on API Server and made 
> API server failed again... 
>  
> I am not sure is it necessary to
> getResourceEventHandler().onError(throwable)
> in  PodCallbackHandlerImpl# handleError method?
>  
> We can just neglect the disconnection of watching process. and try to rewatch 
> once new requestResource called. And we can leverage on the akka heartbeat 
> timeout to discover the TM failure, just like YARN mode do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail

Reply via email to