[jira] [Commented] (FLINK-11914) Expose a REST endpoint in JobManager to kill specific TaskManager
[ https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323338#comment-17323338 ] Flink Jira Bot commented on FLINK-11914: This issue is assigned but has not received an update in 7 days so it has been labeled "stale-assigned". If you are still working on the issue, please give an update and remove the label. If you are no longer working on the issue, please unassign so someone else may work on it. In 7 days the issue will be automatically unassigned. > Expose a REST endpoint in JobManager to kill specific TaskManager > - > > Key: FLINK-11914 > URL: https://issues.apache.org/jira/browse/FLINK-11914 > Project: Flink > Issue Type: New Feature > Components: Runtime / REST >Reporter: Shuyi Chen >Assignee: Shuyi Chen >Priority: Major > Labels: stale-assigned > > we want to add capability in the Flink web UI to kill each individual TM by > clicking a button, this would require first exposing the capability from the > REST API endpoint. The reason is that some TM might be running on a heavily > loaded YARN host over time, and we want to kill just that TM and have flink > JM to reallocate a TM to restart the job graph. The other approach would be > restart the entire YARN job and this is heavy-weight. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11914) Expose a REST endpoint in JobManager to kill specific TaskManager
[ https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227308#comment-17227308 ] Matthias commented on FLINK-11914: -- Hi [~suez1224], I visited this issue as part of the backlog grooming of the Engine team and realized that it's idling for some time now. It looks like there is no need for this feature for now. Hence, I would close it if there's no objection from your side against it. > Expose a REST endpoint in JobManager to kill specific TaskManager > - > > Key: FLINK-11914 > URL: https://issues.apache.org/jira/browse/FLINK-11914 > Project: Flink > Issue Type: New Feature > Components: Runtime / REST >Reporter: Shuyi Chen >Assignee: Shuyi Chen >Priority: Major > > we want to add capability in the Flink web UI to kill each individual TM by > clicking a button, this would require first exposing the capability from the > REST API endpoint. The reason is that some TM might be running on a heavily > loaded YARN host over time, and we want to kill just that TM and have flink > JM to reallocate a TM to restart the job graph. The other approach would be > restart the entire YARN job and this is heavy-weight. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-11914) Expose a REST endpoint in JobManager to kill specific TaskManager
[ https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813432#comment-16813432 ] Till Rohrmann commented on FLINK-11914: --- I agree with [~suez1224], we should not expose any Akka endpoints to the external world. Akka should exclusively be used for internal cluster communication. > Expose a REST endpoint in JobManager to kill specific TaskManager > - > > Key: FLINK-11914 > URL: https://issues.apache.org/jira/browse/FLINK-11914 > Project: Flink > Issue Type: New Feature > Components: Runtime / REST >Reporter: Shuyi Chen >Assignee: Shuyi Chen >Priority: Major > > we want to add capability in the Flink web UI to kill each individual TM by > clicking a button, this would require first exposing the capability from the > REST API endpoint. The reason is that some TM might be running on a heavily > loaded YARN host over time, and we want to kill just that TM and have flink > JM to reallocate a TM to restart the job graph. The other approach would be > restart the entire YARN job and this is heavy-weight. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11914) Expose a REST endpoint in JobManager to kill specific TaskManager
[ https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16810193#comment-16810193 ] Shuyi Chen commented on FLINK-11914: Hi [~feng.xu], I dont think we should expose an Akka endpoint because Akka is an internal implementation detail, and AFAIK, the community is trying to deprecate the use of Akka in Flink. Thanks. > Expose a REST endpoint in JobManager to kill specific TaskManager > - > > Key: FLINK-11914 > URL: https://issues.apache.org/jira/browse/FLINK-11914 > Project: Flink > Issue Type: New Feature > Components: Runtime / REST >Reporter: Shuyi Chen >Assignee: Shuyi Chen >Priority: Major > > we want to add capability in the Flink web UI to kill each individual TM by > clicking a button, this would require first exposing the capability from the > REST API endpoint. The reason is that some TM might be running on a heavily > loaded YARN host over time, and we want to kill just that TM and have flink > JM to reallocate a TM to restart the job graph. The other approach would be > restart the entire YARN job and this is heavy-weight. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11914) Expose a REST endpoint in JobManager to kill specific TaskManager
[ https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804300#comment-16804300 ] Feng Xu commented on FLINK-11914: - Hi [~suez1224] and [~Zentol], Alternatively, we could have a command line tool which sends out the Akka message to JobManager for the disconnection, so there is no need to expose the endpoint or add UI button. Of course, we need to have the appropriate access control on this. > Expose a REST endpoint in JobManager to kill specific TaskManager > - > > Key: FLINK-11914 > URL: https://issues.apache.org/jira/browse/FLINK-11914 > Project: Flink > Issue Type: New Feature > Components: Runtime / REST >Reporter: Shuyi Chen >Assignee: Shuyi Chen >Priority: Major > > we want to add capability in the Flink web UI to kill each individual TM by > clicking a button, this would require first exposing the capability from the > REST API endpoint. The reason is that some TM might be running on a heavily > loaded YARN host over time, and we want to kill just that TM and have flink > JM to reallocate a TM to restart the job graph. The other approach would be > restart the entire YARN job and this is heavy-weight. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11914) Expose a REST endpoint in JobManager to kill specific TaskManager
[ https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16803209#comment-16803209 ] Shuyi Chen commented on FLINK-11914: Hi [~Zentol], thanks a lot for the comments. cc [~till.rohrmann], since we had some offline discussion as well. Currently, the YARN resource scheduler does not take into dynamic resource usage. Also over time, the resource usage of some containers might increase or some containers might use more than what they ask for, thus, oversubscribe host resource. Also, the resource that causing lags might be CPU/memory/FD/Disk/network, or even some application specific cause. This commonly happen in a shared cluster, and it’s not possible for the resource scheduler to predict and regulate the runtime resource usage effectively. Like other frameworks, like MapReduce or Spark, if there is a straggle task, it should be the responsibility of the framework to restart the straggle task in a different node, but not the resource scheduler, since the resource schedule has no idea what it means for one container to be slow. I think exposing an endpoint to disconnect TM will enable us to build external monitor/controller to recover the flink job by relocating the straggling TM. The external controller will synthesize information from Flink metrics, application metrics and host metrics to determine whether a TM is straggling and relocate it. This will greatly help scale our platform to manage more Flink jobs. Also, you are correct that it's possible that the same slow host get allocated again after the kill. To mitigate the issue, I propose we can add a reason parameter for the API and let the Flink resource scheduler to blacklist that host from the resource acquisition from YARN/Mesos. With regards to adding a UI button for this, I understand your concern and we can discuss the need in follow-up. > Expose a REST endpoint in JobManager to kill specific TaskManager > - > > Key: FLINK-11914 > URL: https://issues.apache.org/jira/browse/FLINK-11914 > Project: Flink > Issue Type: New Feature > Components: Runtime / REST >Reporter: Shuyi Chen >Assignee: Shuyi Chen >Priority: Major > > we want to add capability in the Flink web UI to kill each individual TM by > clicking a button, this would require first exposing the capability from the > REST API endpoint. The reason is that some TM might be running on a heavily > loaded YARN host over time, and we want to kill just that TM and have flink > JM to reallocate a TM to restart the job graph. The other approach would be > restart the entire YARN job and this is heavy-weight. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11914) Expose a REST endpoint in JobManager to kill specific TaskManager
[ https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797006#comment-16797006 ] Chesnay Schepler commented on FLINK-11914: -- I think we have to be careful here. This would add a new kind of operations to the REST API (cluster control), which realistically would have to be disabled by default as I don't see many users being willing to expose a shutdown button/call to their users. Is this supposed to work in all deployment modes, or just YARN? Conceptually, outside of standalone deployments, it should never be required for users to manually shutdown TaskManagers. Shouldn't the container management (in this case YARN) ensure that a single host is not overloaded? If it isn't capable of doing so, what prevents YARN from allocating another TM on the same host? > Expose a REST endpoint in JobManager to kill specific TaskManager > - > > Key: FLINK-11914 > URL: https://issues.apache.org/jira/browse/FLINK-11914 > Project: Flink > Issue Type: New Feature > Components: Runtime / REST >Reporter: Shuyi Chen >Assignee: Shuyi Chen >Priority: Major > > we want to add capability in the Flink web UI to kill each individual TM by > clicking a button, this would require first exposing the capability from the > REST API endpoint. The reason is that some TM might be running on a heavily > loaded YARN host over time, and we want to kill just that TM and have flink > JM to reallocate a TM to restart the job graph. The other approach would be > restart the entire YARN job and this is heavy-weight. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11914) Expose a REST endpoint in JobManager to kill specific TaskManager
[ https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793176#comment-16793176 ] Shuyi Chen commented on FLINK-11914: [~gjy], thanks a lot for the quick reply. Yes, to kill the TM process on a host, it would require sudo permission to do so. And we dont allow individual job owners to have this privilege for security reason, as they might accidentally kill other user's job colocating on the same host. Also, exposing the API will allow our external monitoring service (called watchdog) to monitor the TM health and programmatically disconnect it if it experiences issues. I see the JobMasterGateway already has a disconnectTaskManager() interface, so it wont be too much effort to add a REST endpoint to expose the capability. What do you think? > Expose a REST endpoint in JobManager to kill specific TaskManager > - > > Key: FLINK-11914 > URL: https://issues.apache.org/jira/browse/FLINK-11914 > Project: Flink > Issue Type: New Feature > Components: Runtime / REST >Reporter: Shuyi Chen >Assignee: Shuyi Chen >Priority: Major > > we want to add capability in the Flink web UI to kill each individual TM by > clicking a button, this would require first exposing the capability from the > REST API endpoint. The reason is that some TM might be running on a heavily > loaded YARN host over time, and we want to kill just that TM and have flink > JM to reallocate a TM to restart the job graph. The other approach would be > restart the entire YARN job and this is heavy-weight. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-11914) Expose a REST endpoint in JobManager to kill specific TaskManager
[ https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16792380#comment-16792380 ] Gary Yao commented on FLINK-11914: -- [~suez1224] I assume that you need this because you are not able to run commands directly on the machines that are running the YARN NodeManagers? > Expose a REST endpoint in JobManager to kill specific TaskManager > - > > Key: FLINK-11914 > URL: https://issues.apache.org/jira/browse/FLINK-11914 > Project: Flink > Issue Type: New Feature > Components: Runtime / REST >Reporter: Shuyi Chen >Assignee: Shuyi Chen >Priority: Major > > we want to add capability in the Flink web UI to kill each individual TM by > clicking a button, this would require first exposing the capability from the > REST API endpoint. The reason is that some TM might be running on a heavily > loaded YARN host over time, and we want to kill just that TM and have flink > JM to reallocate a TM to restart the job graph. The other approach would be > restart the entire YARN job and this is heavy-weight. -- This message was sent by Atlassian JIRA (v7.6.3#76005)