[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure

2020-08-26 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17185069#comment-17185069
 ] 

Xintong Song commented on FLINK-17273:
--

Thanks [~felixzheng].

> Fix not calling ResourceManager#closeTaskManagerConnection in 
> KubernetesResourceManager in case of registered TaskExecutor failure
> --
>
> Key: FLINK-17273
> URL: https://issues.apache.org/jira/browse/FLINK-17273
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
> Fix For: 1.12.0
>
>
> At the moment, the {{KubernetesResourceManager}} does not call the method of 
> {{ResourceManager#closeTaskManagerConnection}} once it detects that a 
> currently registered task executor has failed. This ticket propoeses to fix 
> this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure

2020-08-26 Thread Canbin Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17185032#comment-17185032
 ] 

Canbin Zheng commented on FLINK-17273:
--

Hi [~xintongsong], I do not work on this issue, go ahead!

> Fix not calling ResourceManager#closeTaskManagerConnection in 
> KubernetesResourceManager in case of registered TaskExecutor failure
> --
>
> Key: FLINK-17273
> URL: https://issues.apache.org/jira/browse/FLINK-17273
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
> Fix For: 1.12.0
>
>
> At the moment, the {{KubernetesResourceManager}} does not call the method of 
> {{ResourceManager#closeTaskManagerConnection}} once it detects that a 
> currently registered task executor has failed. This ticket propoeses to fix 
> this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure

2020-08-25 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183869#comment-17183869
 ] 

Xintong Song commented on FLINK-17273:
--

Hi [~felixzheng],

Is there any updates on this ticket?

I'm asking because, we are making good progress in revisiting the boundary 
between {{ResourceManager}} and its deployment specific implementations 
(FLINK-18620), so the previously discussed solution might no longer apply.

If you have not already started working on this, would you be ok with me taking 
over this ticket?

> Fix not calling ResourceManager#closeTaskManagerConnection in 
> KubernetesResourceManager in case of registered TaskExecutor failure
> --
>
> Key: FLINK-17273
> URL: https://issues.apache.org/jira/browse/FLINK-17273
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
> Fix For: 1.12.0
>
>
> At the moment, the {{KubernetesResourceManager}} does not call the method of 
> {{ResourceManager#closeTaskManagerConnection}} once it detects that a 
> currently registered task executor has failed. This ticket propoeses to fix 
> this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure

2020-07-05 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151535#comment-17151535
 ] 

Robert Metzger commented on FLINK-17273:


no problem & done.

> Fix not calling ResourceManager#closeTaskManagerConnection in 
> KubernetesResourceManager in case of registered TaskExecutor failure
> --
>
> Key: FLINK-17273
> URL: https://issues.apache.org/jira/browse/FLINK-17273
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
> Fix For: 1.12.0
>
>
> At the moment, the {{KubernetesResourceManager}} does not call the method of 
> {{ResourceManager#closeTaskManagerConnection}} once it detects that a 
> currently registered task executor has failed. This ticket propoeses to fix 
> this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure

2020-07-04 Thread Canbin Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151238#comment-17151238
 ] 

Canbin Zheng commented on FLINK-17273:
--

[~rmetzger] Sorry that I am busy in the past two months. How about changing the 
fixed version to 1.12?

> Fix not calling ResourceManager#closeTaskManagerConnection in 
> KubernetesResourceManager in case of registered TaskExecutor failure
> --
>
> Key: FLINK-17273
> URL: https://issues.apache.org/jira/browse/FLINK-17273
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> At the moment, the {{KubernetesResourceManager}} does not call the method of 
> {{ResourceManager#closeTaskManagerConnection}} once it detects that a 
> currently registered task executor has failed. This ticket propoeses to fix 
> this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure

2020-07-03 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150857#comment-17150857
 ] 

Robert Metzger commented on FLINK-17273:


[~felixzheng] What's the status of this ticket?
I'm currently checking some old tickets, and I found that there's no updates 
here in the past weeks.

> Fix not calling ResourceManager#closeTaskManagerConnection in 
> KubernetesResourceManager in case of registered TaskExecutor failure
> --
>
> Key: FLINK-17273
> URL: https://issues.apache.org/jira/browse/FLINK-17273
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> At the moment, the {{KubernetesResourceManager}} does not call the method of 
> {{ResourceManager#closeTaskManagerConnection}} once it detects that a 
> currently registered task executor has failed. This ticket propoeses to fix 
> this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure

2020-04-24 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091374#comment-17091374
 ] 

Yang Wang commented on FLINK-17273:
---

+1 to rethink and make a clear boundary between {{ResourceManager}} and 
specific cluster deployment implementation. Some future developments will also 
benefit a lot from this.

> Fix not calling ResourceManager#closeTaskManagerConnection in 
> KubernetesResourceManager in case of registered TaskExecutor failure
> --
>
> Key: FLINK-17273
> URL: https://issues.apache.org/jira/browse/FLINK-17273
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> At the moment, the {{KubernetesResourceManager}} does not call the method of 
> {{ResourceManager#closeTaskManagerConnection}} once it detects that a 
> currently registered task executor has failed. This ticket propoeses to fix 
> this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure

2020-04-23 Thread Canbin Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091156#comment-17091156
 ] 

Canbin Zheng commented on FLINK-17273:
--

Thanks a lot for the input [~trohrmann] [~xintongsong]. I agree that we need to 
revisit the boundary between {{ResourceManager}} and its deployment-specific 
implementations, especially for the worker lifecycle control flow; I will take 
a closer look at the overall architecture and get back to further discuss it 
with you.

> Fix not calling ResourceManager#closeTaskManagerConnection in 
> KubernetesResourceManager in case of registered TaskExecutor failure
> --
>
> Key: FLINK-17273
> URL: https://issues.apache.org/jira/browse/FLINK-17273
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> At the moment, the {{KubernetesResourceManager}} does not call the method of 
> {{ResourceManager#closeTaskManagerConnection}} once it detects that a 
> currently registered task executor has failed. This ticket propoeses to fix 
> this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure

2020-04-23 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091096#comment-17091096
 ] 

Xintong Song commented on FLINK-17273:
--

+1 for revisiting the boundary between {{ResourceManager}} and its deployment 
specific implementations.
I think this would help deduplicating the worker lifecycle control flow across 
deployments. The RM implementations should only handles the minimum set of 
deployment specific API/behavior differences.

> Fix not calling ResourceManager#closeTaskManagerConnection in 
> KubernetesResourceManager in case of registered TaskExecutor failure
> --
>
> Key: FLINK-17273
> URL: https://issues.apache.org/jira/browse/FLINK-17273
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> At the moment, the {{KubernetesResourceManager}} does not call the method of 
> {{ResourceManager#closeTaskManagerConnection}} once it detects that a 
> currently registered task executor has failed. This ticket propoeses to fix 
> this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure

2020-04-23 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090578#comment-17090578
 ] 

Till Rohrmann commented on FLINK-17273:
---

I think part of the problem why we missed to call this function is that 
{{ResourceManager}} does not enforce a certain control flow. I think it would 
be better if the {{ResourceManager}} offered some calls like 
{{notifyWorkerFailed}} which will trigger the failover behaviour controlled by 
the {{ResourceManager}} and not by the sub class. In order to make this work, I 
guess we should take a look at the overall architecture and think about what 
callbacks the {{ResourceManager}} would need in order to do its job. Then the 
{{ResourceManager}} should be responsible for reacting to failures and other 
signals and simply call the implementation specific callbacks (e.g. terminating 
a pod). In contrast to that, our current {{ResourceManager}} implementations 
handle most of the logic themselves which can lead to problems such as 
forgetting to call a method in order to not violate the contract.

> Fix not calling ResourceManager#closeTaskManagerConnection in 
> KubernetesResourceManager in case of registered TaskExecutor failure
> --
>
> Key: FLINK-17273
> URL: https://issues.apache.org/jira/browse/FLINK-17273
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> At the moment, the {{KubernetesResourceManager}} does not call the method of 
> {{ResourceManager#closeTaskManagerConnection}} once it detects that a 
> currently registered task executor has failed. This ticket propoeses to fix 
> this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure

2020-04-21 Thread Canbin Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17089254#comment-17089254
 ] 

Canbin Zheng commented on FLINK-17273:
--

Thanks for your attention! [~xintongsong] [~fly_in_gis] Given the following 
call stack,
{quote}{{ResourceManager#releaseResource}}

   - {{KubernetesResourceManager#stopWorker}}

      - {{KubernetesResourceManager#internalStopPod}}

   - {{ResourceManager#closeTaskManagerConnection}}
{quote}
 

I think it's enough to explicitly call 
{{ResourceManager#closeTaskManagerConnection}} in 
{{KubernetesResourceManager#removePodIfTerminated}} for this issue.

> Fix not calling ResourceManager#closeTaskManagerConnection in 
> KubernetesResourceManager in case of registered TaskExecutor failure
> --
>
> Key: FLINK-17273
> URL: https://issues.apache.org/jira/browse/FLINK-17273
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> At the moment, the {{KubernetesResourceManager}} does not call the method of 
> {{ResourceManager#closeTaskManagerConnection}} once it detects that a 
> currently registered task executor has failed. This ticket propoeses to fix 
> this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure

2020-04-21 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17089219#comment-17089219
 ] 

Xintong Song commented on FLINK-17273:
--

I think this is a valid issue. +1 for fixing it.
IIUC, we can call {{ResourceManager#closeTaskManagerConnection}} in 
{{KubernetesResourceManager#internalStopPod}}?

> Fix not calling ResourceManager#closeTaskManagerConnection in 
> KubernetesResourceManager in case of registered TaskExecutor failure
> --
>
> Key: FLINK-17273
> URL: https://issues.apache.org/jira/browse/FLINK-17273
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> At the moment, the {{KubernetesResourceManager}} does not call the method of 
> {{ResourceManager#closeTaskManagerConnection}} once it detects that a 
> currently registered task executor has failed. This ticket propoeses to fix 
> this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure

2020-04-21 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088438#comment-17088438
 ] 

Till Rohrmann commented on FLINK-17273:
---

{{ResourceManager#closeTaskManagerConnection}} cleans up the registration state 
on the RM side and tries to notify the TM about the closed connection (this 
might succeed or not). Hence, I guess the K8s RM should also call this method 
whenever a TM or pod is signalled to have failed/disappeared.

> Fix not calling ResourceManager#closeTaskManagerConnection in 
> KubernetesResourceManager in case of registered TaskExecutor failure
> --
>
> Key: FLINK-17273
> URL: https://issues.apache.org/jira/browse/FLINK-17273
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> At the moment, the {{KubernetesResourceManager}} does not call the method of 
> {{ResourceManager#closeTaskManagerConnection}} once it detects that a 
> currently registered task executor has failed. This ticket propoeses to fix 
> this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure

2020-04-20 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088324#comment-17088324
 ] 

Yang Wang commented on FLINK-17273:
---

[~felixzheng] Do you mean when a TaskManager pod crashed exceptionally, we 
should {{closeTaskManagerConnection}} before removing the pod in 
{{KubernetesResourceManager}}? If it is, i think it is a valid fix. Otherwise, 
we need to wait for the timeout.

> Fix not calling ResourceManager#closeTaskManagerConnection in 
> KubernetesResourceManager in case of registered TaskExecutor failure
> --
>
> Key: FLINK-17273
> URL: https://issues.apache.org/jira/browse/FLINK-17273
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> At the moment, the {{KubernetesResourceManager}} does not call the method of 
> {{ResourceManager#closeTaskManagerConnection}} once it detects that a 
> currently registered task executor has failed. This ticket propoeses to fix 
> this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure

2020-04-20 Thread Canbin Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088276#comment-17088276
 ] 

Canbin Zheng commented on FLINK-17273:
--

cc [~xintongsong]

> Fix not calling ResourceManager#closeTaskManagerConnection in 
> KubernetesResourceManager in case of registered TaskExecutor failure
> --
>
> Key: FLINK-17273
> URL: https://issues.apache.org/jira/browse/FLINK-17273
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> At the moment, the {{KubernetesResourceManager}} does not call the method of 
> {{ResourceManager#closeTaskManagerConnection}} once it detects that a 
> currently registered task executor has failed. This ticket propoeses to fix 
> this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17273) Fix not calling ResourceManager#closeTaskManagerConnection in KubernetesResourceManager in case of registered TaskExecutor failure

2020-04-20 Thread Zili Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088255#comment-17088255
 ] 

Zili Chen commented on FLINK-17273:
---

[~trohrmann][~fly_in_gis]

This seems a fast fail path when TM(Pod) failed, which we did in YARN & Mesos 
code path. It would be better you also have a look.



> Fix not calling ResourceManager#closeTaskManagerConnection in 
> KubernetesResourceManager in case of registered TaskExecutor failure
> --
>
> Key: FLINK-17273
> URL: https://issues.apache.org/jira/browse/FLINK-17273
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Runtime / Coordination
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Assignee: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> At the moment, the {{KubernetesResourceManager}} does not call the method of 
> {{ResourceManager#closeTaskManagerConnection}} once it detects that a 
> currently registered task executor has failed. This ticket propoeses to fix 
> this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)