[jira] [Commented] (FLINK-5621) Flink should provide a mechanism to prevent scheduling tasks on TaskManagers with operational issues

2021-04-29 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17336811#comment-17336811
 ] 

Flink Jira Bot commented on FLINK-5621:
---

This issue was labeled "stale-major" 7 ago and has not received any updates so 
it is being deprioritized. If this ticket is actually Major, please raise the 
priority and ask a committer to assign you the issue or revive the public 
discussion.


> Flink should provide a mechanism to prevent scheduling tasks on TaskManagers 
> with operational issues
> 
>
> Key: FLINK-5621
> URL: https://issues.apache.org/jira/browse/FLINK-5621
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Coordination
>Affects Versions: 1.1.4
>Reporter: Jamie Grier
>Priority: Major
>  Labels: stale-major
>
> There are cases where jobs can get into a state where no progress can be made 
> if there is something pathologically wrong with one of the TaskManager nodes 
> in the cluster.
> An example of this would be a TaskManager on a machine that runs out of disk 
> space.  Flink never considers the TM to be "bad" and will keep using it to 
> attempt to run tasks -- which will continue to fail.
> A suggestion for overcoming this would be to allow an option where a TM will 
> commit suicide if that TM was the source of an exception that caused a job to 
> fail/restart.
> I'm sure there are plenty of other approaches to solving this..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-5621) Flink should provide a mechanism to prevent scheduling tasks on TaskManagers with operational issues

2021-04-22 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17328839#comment-17328839
 ] 

Flink Jira Bot commented on FLINK-5621:
---

This major issue is unassigned and itself and all of its Sub-Tasks have not 
been updated for 30 days. So, it has been labeled "stale-major". If this ticket 
is indeed "major", please either assign yourself or give an update. Afterwards, 
please remove the label. In 7 days the issue will be deprioritized.

> Flink should provide a mechanism to prevent scheduling tasks on TaskManagers 
> with operational issues
> 
>
> Key: FLINK-5621
> URL: https://issues.apache.org/jira/browse/FLINK-5621
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Coordination
>Affects Versions: 1.1.4
>Reporter: Jamie Grier
>Priority: Major
>  Labels: stale-major
>
> There are cases where jobs can get into a state where no progress can be made 
> if there is something pathologically wrong with one of the TaskManager nodes 
> in the cluster.
> An example of this would be a TaskManager on a machine that runs out of disk 
> space.  Flink never considers the TM to be "bad" and will keep using it to 
> attempt to run tasks -- which will continue to fail.
> A suggestion for overcoming this would be to allow an option where a TM will 
> commit suicide if that TM was the source of an exception that caused a job to 
> fail/restart.
> I'm sure there are plenty of other approaches to solving this..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-5621) Flink should provide a mechanism to prevent scheduling tasks on TaskManagers with operational issues

2019-05-16 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841238#comment-16841238
 ] 

Till Rohrmann commented on FLINK-5621:
--

I think this issue is tightly related to a general blacklisting mechanism of 
TMs in a Flink cluster. There was a [design 
discussion|http://osdir.com/apache-flink-development/msg09858.html] on the dev 
ML and also a design document. Since this is a bigger feature we should first 
reach consensus how to do it.

> Flink should provide a mechanism to prevent scheduling tasks on TaskManagers 
> with operational issues
> 
>
> Key: FLINK-5621
> URL: https://issues.apache.org/jira/browse/FLINK-5621
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Coordination
>Affects Versions: 1.1.4
>Reporter: Jamie Grier
>Assignee: vinoyang
>Priority: Major
>
> There are cases where jobs can get into a state where no progress can be made 
> if there is something pathologically wrong with one of the TaskManager nodes 
> in the cluster.
> An example of this would be a TaskManager on a machine that runs out of disk 
> space.  Flink never considers the TM to be "bad" and will keep using it to 
> attempt to run tasks -- which will continue to fail.
> A suggestion for overcoming this would be to allow an option where a TM will 
> commit suicide if that TM was the source of an exception that caused a job to 
> fail/restart.
> I'm sure there are plenty of other approaches to solving this..



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-5621) Flink should provide a mechanism to prevent scheduling tasks on TaskManagers with operational issues

2019-05-16 Thread vinoyang (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841028#comment-16841028
 ] 

vinoyang commented on FLINK-5621:
-

Hi [~till.rohrmann] Can you tell me what's your opinion now?

> Flink should provide a mechanism to prevent scheduling tasks on TaskManagers 
> with operational issues
> 
>
> Key: FLINK-5621
> URL: https://issues.apache.org/jira/browse/FLINK-5621
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Coordination
>Affects Versions: 1.1.4
>Reporter: Jamie Grier
>Assignee: vinoyang
>Priority: Major
>
> There are cases where jobs can get into a state where no progress can be made 
> if there is something pathologically wrong with one of the TaskManager nodes 
> in the cluster.
> An example of this would be a TaskManager on a machine that runs out of disk 
> space.  Flink never considers the TM to be "bad" and will keep using it to 
> attempt to run tasks -- which will continue to fail.
> A suggestion for overcoming this would be to allow an option where a TM will 
> commit suicide if that TM was the source of an exception that caused a job to 
> fail/restart.
> I'm sure there are plenty of other approaches to solving this..



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-5621) Flink should provide a mechanism to prevent scheduling tasks on TaskManagers with operational issues

2018-03-02 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383439#comment-16383439
 ] 

Till Rohrmann commented on FLINK-5621:
--

Hi [~yanghua], I think such a feature would indeed be a nice addition for 
Flink. Black-listing TMs with known issues could be done in the 
{{ResourceManager}}. We could also add a RPC call which tells the {{TMs}} to 
shut down in such a case.

> Flink should provide a mechanism to prevent scheduling tasks on TaskManagers 
> with operational issues
> 
>
> Key: FLINK-5621
> URL: https://issues.apache.org/jira/browse/FLINK-5621
> Project: Flink
>  Issue Type: Bug
>  Components: Core
>Affects Versions: 1.1.4
>Reporter: Jamie Grier
>Priority: Critical
>
> There are cases where jobs can get into a state where no progress can be made 
> if there is something pathologically wrong with one of the TaskManager nodes 
> in the cluster.
> An example of this would be a TaskManager on a machine that runs out of disk 
> space.  Flink never considers the TM to be "bad" and will keep using it to 
> attempt to run tasks -- which will continue to fail.
> A suggestion for overcoming this would be to allow an option where a TM will 
> commit suicide if that TM was the source of an exception that caused a job to 
> fail/restart.
> I'm sure there are plenty of other approaches to solving this..



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-5621) Flink should provide a mechanism to prevent scheduling tasks on TaskManagers with operational issues

2018-03-01 Thread vinoyang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381763#comment-16381763
 ] 

vinoyang commented on FLINK-5621:
-

Hi [~till.rohrmann] what's your opinion about this idea. Since Flink 1.5+, it's 
local recovery feature produced snapshot may also trigger the disk space 
insufficient frequently. If we collect task managers' metrics and mark them as 
some rules. The resource manager can consider these taskamangers as 
'dangerous'. Then the scheduler can avoid these tms.

> Flink should provide a mechanism to prevent scheduling tasks on TaskManagers 
> with operational issues
> 
>
> Key: FLINK-5621
> URL: https://issues.apache.org/jira/browse/FLINK-5621
> Project: Flink
>  Issue Type: Bug
>  Components: Core
>Affects Versions: 1.1.4
>Reporter: Jamie Grier
>Priority: Critical
>
> There are cases where jobs can get into a state where no progress can be made 
> if there is something pathologically wrong with one of the TaskManager nodes 
> in the cluster.
> An example of this would be a TaskManager on a machine that runs out of disk 
> space.  Flink never considers the TM to be "bad" and will keep using it to 
> attempt to run tasks -- which will continue to fail.
> A suggestion for overcoming this would be to allow an option where a TM will 
> commit suicide if that TM was the source of an exception that caused a job to 
> fail/restart.
> I'm sure there are plenty of other approaches to solving this..



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-5621) Flink should provide a mechanism to prevent scheduling tasks on TaskManagers with operational issues

2018-02-28 Thread vinoyang (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381479#comment-16381479
 ] 

vinoyang commented on FLINK-5621:
-

Hi [~jgrier] What about we can introduce "tm tag / label" mechanism(like YARN 
node label) for standalone cluster to mark different type taskmanagers. For 
example, "disk space insufficient", "network congestion" and so on. The task 
scheduler will pay attention to critical tags and avoid potential task failure 
risk. And we can report it as metrics and show these tags in web interface to 
let devOps monitor there nodes.

We are thinking about this feature in our inner Flink version at Tencent. 

> Flink should provide a mechanism to prevent scheduling tasks on TaskManagers 
> with operational issues
> 
>
> Key: FLINK-5621
> URL: https://issues.apache.org/jira/browse/FLINK-5621
> Project: Flink
>  Issue Type: Bug
>  Components: Core
>Affects Versions: 1.1.4
>Reporter: Jamie Grier
>Priority: Critical
>
> There are cases where jobs can get into a state where no progress can be made 
> if there is something pathologically wrong with one of the TaskManager nodes 
> in the cluster.
> An example of this would be a TaskManager on a machine that runs out of disk 
> space.  Flink never considers the TM to be "bad" and will keep using it to 
> attempt to run tasks -- which will continue to fail.
> A suggestion for overcoming this would be to allow an option where a TM will 
> commit suicide if that TM was the source of an exception that caused a job to 
> fail/restart.
> I'm sure there are plenty of other approaches to solving this..



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)