[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure
[ https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17185069#comment-17185069 ] Xintong Song commented on FLINK-17273: -- Thanks [~felixzheng]. > Fix not calling ResourceManager#closeTaskManagerConnection in > KubernetesResourceManager in case of registered TaskExecutor failure > -- > > Key: FLINK-17273 > URL: https://issues.apache.org/jira/browse/FLINK-17273 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Runtime / Coordination >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Major > Fix For: 1.12.0 > > > At the moment, the {{KubernetesResourceManager}} does not call the method of > {{ResourceManager#closeTaskManagerConnection}} once it detects that a > currently registered task executor has failed. This ticket propoeses to fix > this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure
[ https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17185032#comment-17185032 ] Canbin Zheng commented on FLINK-17273: -- Hi [~xintongsong], I do not work on this issue, go ahead! > Fix not calling ResourceManager#closeTaskManagerConnection in > KubernetesResourceManager in case of registered TaskExecutor failure > -- > > Key: FLINK-17273 > URL: https://issues.apache.org/jira/browse/FLINK-17273 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Runtime / Coordination >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Major > Fix For: 1.12.0 > > > At the moment, the {{KubernetesResourceManager}} does not call the method of > {{ResourceManager#closeTaskManagerConnection}} once it detects that a > currently registered task executor has failed. This ticket propoeses to fix > this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure
[ https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183869#comment-17183869 ] Xintong Song commented on FLINK-17273: -- Hi [~felixzheng], Is there any updates on this ticket? I'm asking because, we are making good progress in revisiting the boundary between {{ResourceManager}} and its deployment specific implementations (FLINK-18620), so the previously discussed solution might no longer apply. If you have not already started working on this, would you be ok with me taking over this ticket? > Fix not calling ResourceManager#closeTaskManagerConnection in > KubernetesResourceManager in case of registered TaskExecutor failure > -- > > Key: FLINK-17273 > URL: https://issues.apache.org/jira/browse/FLINK-17273 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Runtime / Coordination >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Major > Fix For: 1.12.0 > > > At the moment, the {{KubernetesResourceManager}} does not call the method of > {{ResourceManager#closeTaskManagerConnection}} once it detects that a > currently registered task executor has failed. This ticket propoeses to fix > this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure
[ https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151535#comment-17151535 ] Robert Metzger commented on FLINK-17273: no problem & done. > Fix not calling ResourceManager#closeTaskManagerConnection in > KubernetesResourceManager in case of registered TaskExecutor failure > -- > > Key: FLINK-17273 > URL: https://issues.apache.org/jira/browse/FLINK-17273 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Runtime / Coordination >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Major > Fix For: 1.12.0 > > > At the moment, the {{KubernetesResourceManager}} does not call the method of > {{ResourceManager#closeTaskManagerConnection}} once it detects that a > currently registered task executor has failed. This ticket propoeses to fix > this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure
[ https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151238#comment-17151238 ] Canbin Zheng commented on FLINK-17273: -- [~rmetzger] Sorry that I am busy in the past two months. How about changing the fixed version to 1.12? > Fix not calling ResourceManager#closeTaskManagerConnection in > KubernetesResourceManager in case of registered TaskExecutor failure > -- > > Key: FLINK-17273 > URL: https://issues.apache.org/jira/browse/FLINK-17273 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Runtime / Coordination >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > At the moment, the {{KubernetesResourceManager}} does not call the method of > {{ResourceManager#closeTaskManagerConnection}} once it detects that a > currently registered task executor has failed. This ticket propoeses to fix > this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure
[ https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150857#comment-17150857 ] Robert Metzger commented on FLINK-17273: [~felixzheng] What's the status of this ticket? I'm currently checking some old tickets, and I found that there's no updates here in the past weeks. > Fix not calling ResourceManager#closeTaskManagerConnection in > KubernetesResourceManager in case of registered TaskExecutor failure > -- > > Key: FLINK-17273 > URL: https://issues.apache.org/jira/browse/FLINK-17273 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Runtime / Coordination >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > At the moment, the {{KubernetesResourceManager}} does not call the method of > {{ResourceManager#closeTaskManagerConnection}} once it detects that a > currently registered task executor has failed. This ticket propoeses to fix > this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure
[ https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091374#comment-17091374 ] Yang Wang commented on FLINK-17273: --- +1 to rethink and make a clear boundary between {{ResourceManager}} and specific cluster deployment implementation. Some future developments will also benefit a lot from this. > Fix not calling ResourceManager#closeTaskManagerConnection in > KubernetesResourceManager in case of registered TaskExecutor failure > -- > > Key: FLINK-17273 > URL: https://issues.apache.org/jira/browse/FLINK-17273 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Runtime / Coordination >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > At the moment, the {{KubernetesResourceManager}} does not call the method of > {{ResourceManager#closeTaskManagerConnection}} once it detects that a > currently registered task executor has failed. This ticket propoeses to fix > this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure
[ https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091156#comment-17091156 ] Canbin Zheng commented on FLINK-17273: -- Thanks a lot for the input [~trohrmann] [~xintongsong]. I agree that we need to revisit the boundary between {{ResourceManager}} and its deployment-specific implementations, especially for the worker lifecycle control flow; I will take a closer look at the overall architecture and get back to further discuss it with you. > Fix not calling ResourceManager#closeTaskManagerConnection in > KubernetesResourceManager in case of registered TaskExecutor failure > -- > > Key: FLINK-17273 > URL: https://issues.apache.org/jira/browse/FLINK-17273 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Runtime / Coordination >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > At the moment, the {{KubernetesResourceManager}} does not call the method of > {{ResourceManager#closeTaskManagerConnection}} once it detects that a > currently registered task executor has failed. This ticket propoeses to fix > this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure
[ https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091096#comment-17091096 ] Xintong Song commented on FLINK-17273: -- +1 for revisiting the boundary between {{ResourceManager}} and its deployment specific implementations. I think this would help deduplicating the worker lifecycle control flow across deployments. The RM implementations should only handles the minimum set of deployment specific API/behavior differences. > Fix not calling ResourceManager#closeTaskManagerConnection in > KubernetesResourceManager in case of registered TaskExecutor failure > -- > > Key: FLINK-17273 > URL: https://issues.apache.org/jira/browse/FLINK-17273 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Runtime / Coordination >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > At the moment, the {{KubernetesResourceManager}} does not call the method of > {{ResourceManager#closeTaskManagerConnection}} once it detects that a > currently registered task executor has failed. This ticket propoeses to fix > this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure
[ https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090578#comment-17090578 ] Till Rohrmann commented on FLINK-17273: --- I think part of the problem why we missed to call this function is that {{ResourceManager}} does not enforce a certain control flow. I think it would be better if the {{ResourceManager}} offered some calls like {{notifyWorkerFailed}} which will trigger the failover behaviour controlled by the {{ResourceManager}} and not by the sub class. In order to make this work, I guess we should take a look at the overall architecture and think about what callbacks the {{ResourceManager}} would need in order to do its job. Then the {{ResourceManager}} should be responsible for reacting to failures and other signals and simply call the implementation specific callbacks (e.g. terminating a pod). In contrast to that, our current {{ResourceManager}} implementations handle most of the logic themselves which can lead to problems such as forgetting to call a method in order to not violate the contract. > Fix not calling ResourceManager#closeTaskManagerConnection in > KubernetesResourceManager in case of registered TaskExecutor failure > -- > > Key: FLINK-17273 > URL: https://issues.apache.org/jira/browse/FLINK-17273 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Runtime / Coordination >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > At the moment, the {{KubernetesResourceManager}} does not call the method of > {{ResourceManager#closeTaskManagerConnection}} once it detects that a > currently registered task executor has failed. This ticket propoeses to fix > this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure
[ https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17089254#comment-17089254 ] Canbin Zheng commented on FLINK-17273: -- Thanks for your attention! [~xintongsong] [~fly_in_gis] Given the following call stack, {quote}{{ResourceManager#releaseResource}} - {{KubernetesResourceManager#stopWorker}} - {{KubernetesResourceManager#internalStopPod}} - {{ResourceManager#closeTaskManagerConnection}} {quote} I think it's enough to explicitly call {{ResourceManager#closeTaskManagerConnection}} in {{KubernetesResourceManager#removePodIfTerminated}} for this issue. > Fix not calling ResourceManager#closeTaskManagerConnection in > KubernetesResourceManager in case of registered TaskExecutor failure > -- > > Key: FLINK-17273 > URL: https://issues.apache.org/jira/browse/FLINK-17273 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Runtime / Coordination >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > At the moment, the {{KubernetesResourceManager}} does not call the method of > {{ResourceManager#closeTaskManagerConnection}} once it detects that a > currently registered task executor has failed. This ticket propoeses to fix > this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure
[ https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17089219#comment-17089219 ] Xintong Song commented on FLINK-17273: -- I think this is a valid issue. +1 for fixing it. IIUC, we can call {{ResourceManager#closeTaskManagerConnection}} in {{KubernetesResourceManager#internalStopPod}}? > Fix not calling ResourceManager#closeTaskManagerConnection in > KubernetesResourceManager in case of registered TaskExecutor failure > -- > > Key: FLINK-17273 > URL: https://issues.apache.org/jira/browse/FLINK-17273 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Runtime / Coordination >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > At the moment, the {{KubernetesResourceManager}} does not call the method of > {{ResourceManager#closeTaskManagerConnection}} once it detects that a > currently registered task executor has failed. This ticket propoeses to fix > this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure
[ https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088438#comment-17088438 ] Till Rohrmann commented on FLINK-17273: --- {{ResourceManager#closeTaskManagerConnection}} cleans up the registration state on the RM side and tries to notify the TM about the closed connection (this might succeed or not). Hence, I guess the K8s RM should also call this method whenever a TM or pod is signalled to have failed/disappeared. > Fix not calling ResourceManager#closeTaskManagerConnection in > KubernetesResourceManager in case of registered TaskExecutor failure > -- > > Key: FLINK-17273 > URL: https://issues.apache.org/jira/browse/FLINK-17273 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Runtime / Coordination >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > At the moment, the {{KubernetesResourceManager}} does not call the method of > {{ResourceManager#closeTaskManagerConnection}} once it detects that a > currently registered task executor has failed. This ticket propoeses to fix > this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure
[ https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088324#comment-17088324 ] Yang Wang commented on FLINK-17273: --- [~felixzheng] Do you mean when a TaskManager pod crashed exceptionally, we should {{closeTaskManagerConnection}} before removing the pod in {{KubernetesResourceManager}}? If it is, i think it is a valid fix. Otherwise, we need to wait for the timeout. > Fix not calling ResourceManager#closeTaskManagerConnection in > KubernetesResourceManager in case of registered TaskExecutor failure > -- > > Key: FLINK-17273 > URL: https://issues.apache.org/jira/browse/FLINK-17273 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Runtime / Coordination >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > At the moment, the {{KubernetesResourceManager}} does not call the method of > {{ResourceManager#closeTaskManagerConnection}} once it detects that a > currently registered task executor has failed. This ticket propoeses to fix > this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure
[ https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088276#comment-17088276 ] Canbin Zheng commented on FLINK-17273: -- cc [~xintongsong] > Fix not calling ResourceManager#closeTaskManagerConnection in > KubernetesResourceManager in case of registered TaskExecutor failure > -- > > Key: FLINK-17273 > URL: https://issues.apache.org/jira/browse/FLINK-17273 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Runtime / Coordination >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > At the moment, the {{KubernetesResourceManager}} does not call the method of > {{ResourceManager#closeTaskManagerConnection}} once it detects that a > currently registered task executor has failed. This ticket propoeses to fix > this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure
[ https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088255#comment-17088255 ] Zili Chen commented on FLINK-17273: --- [~trohrmann][~fly_in_gis] This seems a fast fail path when TM(Pod) failed, which we did in YARN & Mesos code path. It would be better you also have a look. > Fix not calling ResourceManager#closeTaskManagerConnection in > KubernetesResourceManager in case of registered TaskExecutor failure > -- > > Key: FLINK-17273 > URL: https://issues.apache.org/jira/browse/FLINK-17273 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes, Runtime / Coordination >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Assignee: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > At the moment, the {{KubernetesResourceManager}} does not call the method of > {{ResourceManager#closeTaskManagerConnection}} once it detects that a > currently registered task executor has failed. This ticket propoeses to fix > this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)