[jira] [Updated] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhiyuan Yang updated TEZ-3718: -- Attachment: (was: TEZ-3718.4.patch) > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch, > TEZ-3718.4.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhiyuan Yang updated TEZ-3718: -- Attachment: TEZ-3718.4.patch > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch, > TEZ-3718.4.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhiyuan Yang updated TEZ-3718: -- Attachment: TEZ-3718.4.patch Addressed previous comments in new patch. In this patch, old killing nothing behavior is kept, but won't work with task fail fast feature. Please help review. CC [~sseth], [~jlowe]. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch, > TEZ-3718.4.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhiyuan Yang updated TEZ-3718: -- Target Version/s: 0.9.next (was: 0.9.0) > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhiyuan Yang updated TEZ-3718: -- Attachment: TEZ-3718.3.patch > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhiyuan Yang updated TEZ-3718: -- Attachment: TEZ-3718.2.patch Upload new patch to address comments. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhiyuan Yang updated TEZ-3718: -- Attachment: TEZ-3718.1.patch Upload patch to support keeping completed task and failing task fast on unhealthy node. Please help review, [~sseth], [~jlowe]. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.3.15#6346)