[jira] [Commented] (FLINK-5621) Flink should provide a mechanism to prevent scheduling tasks on TaskManagers with operational issues
[ https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17336811#comment-17336811 ] Flink Jira Bot commented on FLINK-5621: --- This issue was labeled "stale-major" 7 ago and has not received any updates so it is being deprioritized. If this ticket is actually Major, please raise the priority and ask a committer to assign you the issue or revive the public discussion. > Flink should provide a mechanism to prevent scheduling tasks on TaskManagers > with operational issues > > > Key: FLINK-5621 > URL: https://issues.apache.org/jira/browse/FLINK-5621 > Project: Flink > Issue Type: New Feature > Components: Runtime / Coordination >Affects Versions: 1.1.4 >Reporter: Jamie Grier >Priority: Major > Labels: stale-major > > There are cases where jobs can get into a state where no progress can be made > if there is something pathologically wrong with one of the TaskManager nodes > in the cluster. > An example of this would be a TaskManager on a machine that runs out of disk > space. Flink never considers the TM to be "bad" and will keep using it to > attempt to run tasks -- which will continue to fail. > A suggestion for overcoming this would be to allow an option where a TM will > commit suicide if that TM was the source of an exception that caused a job to > fail/restart. > I'm sure there are plenty of other approaches to solving this.. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-5621) Flink should provide a mechanism to prevent scheduling tasks on TaskManagers with operational issues
[ https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17328839#comment-17328839 ] Flink Jira Bot commented on FLINK-5621: --- This major issue is unassigned and itself and all of its Sub-Tasks have not been updated for 30 days. So, it has been labeled "stale-major". If this ticket is indeed "major", please either assign yourself or give an update. Afterwards, please remove the label. In 7 days the issue will be deprioritized. > Flink should provide a mechanism to prevent scheduling tasks on TaskManagers > with operational issues > > > Key: FLINK-5621 > URL: https://issues.apache.org/jira/browse/FLINK-5621 > Project: Flink > Issue Type: New Feature > Components: Runtime / Coordination >Affects Versions: 1.1.4 >Reporter: Jamie Grier >Priority: Major > Labels: stale-major > > There are cases where jobs can get into a state where no progress can be made > if there is something pathologically wrong with one of the TaskManager nodes > in the cluster. > An example of this would be a TaskManager on a machine that runs out of disk > space. Flink never considers the TM to be "bad" and will keep using it to > attempt to run tasks -- which will continue to fail. > A suggestion for overcoming this would be to allow an option where a TM will > commit suicide if that TM was the source of an exception that caused a job to > fail/restart. > I'm sure there are plenty of other approaches to solving this.. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-5621) Flink should provide a mechanism to prevent scheduling tasks on TaskManagers with operational issues
[ https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841238#comment-16841238 ] Till Rohrmann commented on FLINK-5621: -- I think this issue is tightly related to a general blacklisting mechanism of TMs in a Flink cluster. There was a [design discussion|http://osdir.com/apache-flink-development/msg09858.html] on the dev ML and also a design document. Since this is a bigger feature we should first reach consensus how to do it. > Flink should provide a mechanism to prevent scheduling tasks on TaskManagers > with operational issues > > > Key: FLINK-5621 > URL: https://issues.apache.org/jira/browse/FLINK-5621 > Project: Flink > Issue Type: New Feature > Components: Runtime / Coordination >Affects Versions: 1.1.4 >Reporter: Jamie Grier >Assignee: vinoyang >Priority: Major > > There are cases where jobs can get into a state where no progress can be made > if there is something pathologically wrong with one of the TaskManager nodes > in the cluster. > An example of this would be a TaskManager on a machine that runs out of disk > space. Flink never considers the TM to be "bad" and will keep using it to > attempt to run tasks -- which will continue to fail. > A suggestion for overcoming this would be to allow an option where a TM will > commit suicide if that TM was the source of an exception that caused a job to > fail/restart. > I'm sure there are plenty of other approaches to solving this.. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-5621) Flink should provide a mechanism to prevent scheduling tasks on TaskManagers with operational issues
[ https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841028#comment-16841028 ] vinoyang commented on FLINK-5621: - Hi [~till.rohrmann] Can you tell me what's your opinion now? > Flink should provide a mechanism to prevent scheduling tasks on TaskManagers > with operational issues > > > Key: FLINK-5621 > URL: https://issues.apache.org/jira/browse/FLINK-5621 > Project: Flink > Issue Type: New Feature > Components: Runtime / Coordination >Affects Versions: 1.1.4 >Reporter: Jamie Grier >Assignee: vinoyang >Priority: Major > > There are cases where jobs can get into a state where no progress can be made > if there is something pathologically wrong with one of the TaskManager nodes > in the cluster. > An example of this would be a TaskManager on a machine that runs out of disk > space. Flink never considers the TM to be "bad" and will keep using it to > attempt to run tasks -- which will continue to fail. > A suggestion for overcoming this would be to allow an option where a TM will > commit suicide if that TM was the source of an exception that caused a job to > fail/restart. > I'm sure there are plenty of other approaches to solving this.. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-5621) Flink should provide a mechanism to prevent scheduling tasks on TaskManagers with operational issues
[ https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383439#comment-16383439 ] Till Rohrmann commented on FLINK-5621: -- Hi [~yanghua], I think such a feature would indeed be a nice addition for Flink. Black-listing TMs with known issues could be done in the {{ResourceManager}}. We could also add a RPC call which tells the {{TMs}} to shut down in such a case. > Flink should provide a mechanism to prevent scheduling tasks on TaskManagers > with operational issues > > > Key: FLINK-5621 > URL: https://issues.apache.org/jira/browse/FLINK-5621 > Project: Flink > Issue Type: Bug > Components: Core >Affects Versions: 1.1.4 >Reporter: Jamie Grier >Priority: Critical > > There are cases where jobs can get into a state where no progress can be made > if there is something pathologically wrong with one of the TaskManager nodes > in the cluster. > An example of this would be a TaskManager on a machine that runs out of disk > space. Flink never considers the TM to be "bad" and will keep using it to > attempt to run tasks -- which will continue to fail. > A suggestion for overcoming this would be to allow an option where a TM will > commit suicide if that TM was the source of an exception that caused a job to > fail/restart. > I'm sure there are plenty of other approaches to solving this.. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-5621) Flink should provide a mechanism to prevent scheduling tasks on TaskManagers with operational issues
[ https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381763#comment-16381763 ] vinoyang commented on FLINK-5621: - Hi [~till.rohrmann] what's your opinion about this idea. Since Flink 1.5+, it's local recovery feature produced snapshot may also trigger the disk space insufficient frequently. If we collect task managers' metrics and mark them as some rules. The resource manager can consider these taskamangers as 'dangerous'. Then the scheduler can avoid these tms. > Flink should provide a mechanism to prevent scheduling tasks on TaskManagers > with operational issues > > > Key: FLINK-5621 > URL: https://issues.apache.org/jira/browse/FLINK-5621 > Project: Flink > Issue Type: Bug > Components: Core >Affects Versions: 1.1.4 >Reporter: Jamie Grier >Priority: Critical > > There are cases where jobs can get into a state where no progress can be made > if there is something pathologically wrong with one of the TaskManager nodes > in the cluster. > An example of this would be a TaskManager on a machine that runs out of disk > space. Flink never considers the TM to be "bad" and will keep using it to > attempt to run tasks -- which will continue to fail. > A suggestion for overcoming this would be to allow an option where a TM will > commit suicide if that TM was the source of an exception that caused a job to > fail/restart. > I'm sure there are plenty of other approaches to solving this.. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-5621) Flink should provide a mechanism to prevent scheduling tasks on TaskManagers with operational issues
[ https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381479#comment-16381479 ] vinoyang commented on FLINK-5621: - Hi [~jgrier] What about we can introduce "tm tag / label" mechanism(like YARN node label) for standalone cluster to mark different type taskmanagers. For example, "disk space insufficient", "network congestion" and so on. The task scheduler will pay attention to critical tags and avoid potential task failure risk. And we can report it as metrics and show these tags in web interface to let devOps monitor there nodes. We are thinking about this feature in our inner Flink version at Tencent. > Flink should provide a mechanism to prevent scheduling tasks on TaskManagers > with operational issues > > > Key: FLINK-5621 > URL: https://issues.apache.org/jira/browse/FLINK-5621 > Project: Flink > Issue Type: Bug > Components: Core >Affects Versions: 1.1.4 >Reporter: Jamie Grier >Priority: Critical > > There are cases where jobs can get into a state where no progress can be made > if there is something pathologically wrong with one of the TaskManager nodes > in the cluster. > An example of this would be a TaskManager on a machine that runs out of disk > space. Flink never considers the TM to be "bad" and will keep using it to > attempt to run tasks -- which will continue to fail. > A suggestion for overcoming this would be to allow an option where a TM will > commit suicide if that TM was the source of an exception that caused a job to > fail/restart. > I'm sure there are plenty of other approaches to solving this.. -- This message was sent by Atlassian JIRA (v7.6.3#76005)