[jira] [Commented] (FLINK-11914) Expose a REST endpoint in JobManager to kill specific TaskManager

2021-04-16 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323338#comment-17323338
 ] 

Flink Jira Bot commented on FLINK-11914:


This issue is assigned but has not received an update in 7 days so it has been 
labeled "stale-assigned". If you are still working on the issue, please give an 
update and remove the label. If you are no longer working on the issue, please 
unassign so someone else may work on it. In 7 days the issue will be 
automatically unassigned.

> Expose a REST endpoint in JobManager to kill specific TaskManager
> -
>
> Key: FLINK-11914
> URL: https://issues.apache.org/jira/browse/FLINK-11914
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / REST
>Reporter: Shuyi Chen
>Assignee: Shuyi Chen
>Priority: Major
>  Labels: stale-assigned
>
> we want to add capability in the Flink web UI to kill each individual TM by 
> clicking a button, this would require first exposing the capability from the 
> REST API endpoint. The reason is that  some TM might be running on a heavily 
> loaded YARN host over time, and we want to kill just that TM and have flink 
> JM to reallocate a TM to restart the job graph. The other approach would be 
> restart the entire YARN job and this is heavy-weight.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-11914) Expose a REST endpoint in JobManager to kill specific TaskManager

2020-11-06 Thread Matthias (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227308#comment-17227308
 ] 

Matthias commented on FLINK-11914:
--

Hi [~suez1224], I visited this issue as part of the backlog grooming of the 
Engine team and realized that it's idling for some time now. It looks like 
there is no need for this feature for now. Hence, I would close it if there's 
no objection from your side against it.

> Expose a REST endpoint in JobManager to kill specific TaskManager
> -
>
> Key: FLINK-11914
> URL: https://issues.apache.org/jira/browse/FLINK-11914
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / REST
>Reporter: Shuyi Chen
>Assignee: Shuyi Chen
>Priority: Major
>
> we want to add capability in the Flink web UI to kill each individual TM by 
> clicking a button, this would require first exposing the capability from the 
> REST API endpoint. The reason is that  some TM might be running on a heavily 
> loaded YARN host over time, and we want to kill just that TM and have flink 
> JM to reallocate a TM to restart the job graph. The other approach would be 
> restart the entire YARN job and this is heavy-weight.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-11914) Expose a REST endpoint in JobManager to kill specific TaskManager

2019-04-09 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813432#comment-16813432
 ] 

Till Rohrmann commented on FLINK-11914:
---

I agree with [~suez1224], we should not expose any Akka endpoints to the 
external world. Akka should exclusively be used for internal cluster 
communication.

> Expose a REST endpoint in JobManager to kill specific TaskManager
> -
>
> Key: FLINK-11914
> URL: https://issues.apache.org/jira/browse/FLINK-11914
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / REST
>Reporter: Shuyi Chen
>Assignee: Shuyi Chen
>Priority: Major
>
> we want to add capability in the Flink web UI to kill each individual TM by 
> clicking a button, this would require first exposing the capability from the 
> REST API endpoint. The reason is that  some TM might be running on a heavily 
> loaded YARN host over time, and we want to kill just that TM and have flink 
> JM to reallocate a TM to restart the job graph. The other approach would be 
> restart the entire YARN job and this is heavy-weight.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11914) Expose a REST endpoint in JobManager to kill specific TaskManager

2019-04-04 Thread Shuyi Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16810193#comment-16810193
 ] 

Shuyi Chen commented on FLINK-11914:


Hi [~feng.xu], I dont think we should expose an Akka endpoint because Akka is 
an internal implementation detail, and AFAIK, the community is trying to 
deprecate the use of Akka in Flink. Thanks.

> Expose a REST endpoint in JobManager to kill specific TaskManager
> -
>
> Key: FLINK-11914
> URL: https://issues.apache.org/jira/browse/FLINK-11914
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / REST
>Reporter: Shuyi Chen
>Assignee: Shuyi Chen
>Priority: Major
>
> we want to add capability in the Flink web UI to kill each individual TM by 
> clicking a button, this would require first exposing the capability from the 
> REST API endpoint. The reason is that  some TM might be running on a heavily 
> loaded YARN host over time, and we want to kill just that TM and have flink 
> JM to reallocate a TM to restart the job graph. The other approach would be 
> restart the entire YARN job and this is heavy-weight.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11914) Expose a REST endpoint in JobManager to kill specific TaskManager

2019-03-28 Thread Feng Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804300#comment-16804300
 ] 

Feng Xu commented on FLINK-11914:
-

Hi [~suez1224] and [~Zentol],

Alternatively, we could have a command line tool which sends out the Akka 
message to JobManager for the disconnection, so there is no need to expose the 
endpoint or add UI button. Of course, we need to have the appropriate access 
control on this. 

 

> Expose a REST endpoint in JobManager to kill specific TaskManager
> -
>
> Key: FLINK-11914
> URL: https://issues.apache.org/jira/browse/FLINK-11914
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / REST
>Reporter: Shuyi Chen
>Assignee: Shuyi Chen
>Priority: Major
>
> we want to add capability in the Flink web UI to kill each individual TM by 
> clicking a button, this would require first exposing the capability from the 
> REST API endpoint. The reason is that  some TM might be running on a heavily 
> loaded YARN host over time, and we want to kill just that TM and have flink 
> JM to reallocate a TM to restart the job graph. The other approach would be 
> restart the entire YARN job and this is heavy-weight.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11914) Expose a REST endpoint in JobManager to kill specific TaskManager

2019-03-27 Thread Shuyi Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16803209#comment-16803209
 ] 

Shuyi Chen commented on FLINK-11914:


Hi [~Zentol], thanks a lot for the comments. cc [~till.rohrmann], since we had 
some offline discussion as well.

Currently, the YARN resource scheduler does not take into dynamic resource 
usage. Also over time, the resource usage of some containers might increase or 
some containers might use more than what they ask for, thus, oversubscribe host 
resource. Also, the resource that causing lags might be 
CPU/memory/FD/Disk/network, or even some application specific cause. This 
commonly happen in a shared cluster, and it’s not possible for the resource 
scheduler to predict and regulate the runtime resource usage effectively. Like 
other frameworks, like MapReduce or Spark, if there is a straggle task, it 
should be the responsibility of the framework to restart the straggle task in a 
different node, but not the resource scheduler, since the resource schedule has 
no idea what it means for one container to be slow.

I think exposing an endpoint to disconnect TM will enable us to build external 
monitor/controller to recover the flink job by relocating the straggling TM. 
The external controller will synthesize information from Flink metrics, 
application metrics and host metrics to determine whether a TM is straggling 
and relocate it. This will greatly help scale our platform to manage more Flink 
jobs.

Also, you are correct that it's possible that the same slow host get allocated 
again after the kill. To mitigate the issue, I propose we can add a reason 
parameter for the API and let the Flink resource scheduler to blacklist that 
host from the resource acquisition from YARN/Mesos. 

With regards to adding a UI button for this, I understand your concern and we 
can discuss the need in follow-up. 

> Expose a REST endpoint in JobManager to kill specific TaskManager
> -
>
> Key: FLINK-11914
> URL: https://issues.apache.org/jira/browse/FLINK-11914
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / REST
>Reporter: Shuyi Chen
>Assignee: Shuyi Chen
>Priority: Major
>
> we want to add capability in the Flink web UI to kill each individual TM by 
> clicking a button, this would require first exposing the capability from the 
> REST API endpoint. The reason is that  some TM might be running on a heavily 
> loaded YARN host over time, and we want to kill just that TM and have flink 
> JM to reallocate a TM to restart the job graph. The other approach would be 
> restart the entire YARN job and this is heavy-weight.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11914) Expose a REST endpoint in JobManager to kill specific TaskManager

2019-03-20 Thread Chesnay Schepler (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797006#comment-16797006
 ] 

Chesnay Schepler commented on FLINK-11914:
--

I think we have to be careful here. This would add a new kind of operations to 
the REST API (cluster control), which realistically would have to be disabled 
by default as I don't see many users being willing to expose a shutdown 
button/call to their users.

Is this supposed to work in all deployment modes, or just YARN?

Conceptually, outside of standalone deployments, it should never be required 
for users to manually shutdown TaskManagers. Shouldn't the container management 
(in this case YARN) ensure that a single host is not overloaded? If it isn't 
capable of doing so, what prevents YARN from allocating another TM on the same 
host?

 

> Expose a REST endpoint in JobManager to kill specific TaskManager
> -
>
> Key: FLINK-11914
> URL: https://issues.apache.org/jira/browse/FLINK-11914
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / REST
>Reporter: Shuyi Chen
>Assignee: Shuyi Chen
>Priority: Major
>
> we want to add capability in the Flink web UI to kill each individual TM by 
> clicking a button, this would require first exposing the capability from the 
> REST API endpoint. The reason is that  some TM might be running on a heavily 
> loaded YARN host over time, and we want to kill just that TM and have flink 
> JM to reallocate a TM to restart the job graph. The other approach would be 
> restart the entire YARN job and this is heavy-weight.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11914) Expose a REST endpoint in JobManager to kill specific TaskManager

2019-03-14 Thread Shuyi Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793176#comment-16793176
 ] 

Shuyi Chen commented on FLINK-11914:


[~gjy], thanks a lot for the quick reply. Yes, to kill the TM process on a 
host, it would require sudo permission to do so. And we dont allow individual 
job owners to have this privilege for security reason, as they might 
accidentally kill other user's job colocating on the same host.

Also, exposing the API will allow our external monitoring service (called 
watchdog) to monitor the TM health and programmatically disconnect it if it 
experiences issues. I see the JobMasterGateway already has a 
disconnectTaskManager() interface, so it wont be too much effort to add a REST 
endpoint to expose the capability. What do you think?

> Expose a REST endpoint in JobManager to kill specific TaskManager
> -
>
> Key: FLINK-11914
> URL: https://issues.apache.org/jira/browse/FLINK-11914
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / REST
>Reporter: Shuyi Chen
>Assignee: Shuyi Chen
>Priority: Major
>
> we want to add capability in the Flink web UI to kill each individual TM by 
> clicking a button, this would require first exposing the capability from the 
> REST API endpoint. The reason is that  some TM might be running on a heavily 
> loaded YARN host over time, and we want to kill just that TM and have flink 
> JM to reallocate a TM to restart the job graph. The other approach would be 
> restart the entire YARN job and this is heavy-weight.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11914) Expose a REST endpoint in JobManager to kill specific TaskManager

2019-03-13 Thread Gary Yao (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16792380#comment-16792380
 ] 

Gary Yao commented on FLINK-11914:
--

[~suez1224] I assume that you need this because you are not able to run 
commands directly on the machines that are running the YARN NodeManagers?

> Expose a REST endpoint in JobManager to kill specific TaskManager
> -
>
> Key: FLINK-11914
> URL: https://issues.apache.org/jira/browse/FLINK-11914
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / REST
>Reporter: Shuyi Chen
>Assignee: Shuyi Chen
>Priority: Major
>
> we want to add capability in the Flink web UI to kill each individual TM by 
> clicking a button, this would require first exposing the capability from the 
> REST API endpoint. The reason is that  some TM might be running on a heavily 
> loaded YARN host over time, and we want to kill just that TM and have flink 
> JM to reallocate a TM to restart the job graph. The other approach would be 
> restart the entire YARN job and this is heavy-weight.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)