[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796234#comment-16796234 ] TezQA commented on TEZ-3718: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 4s{color} | {color:red} TEZ-3718 does not apply to master. Rebase required? Wrong Branch? See https://cwiki.apache.org/confluence/display/TEZ/How+to+Contribute+to+Tez for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | TEZ-3718 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12891370/TEZ-3718.4.patch | | Console output | https://builds.apache.org/job/PreCommit-TEZ-Build/128/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang >Priority: Major > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch, > TEZ-3718.4.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548570#comment-16548570 ] TezQA commented on TEZ-3718: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12891370/TEZ-3718.4.patch against master revision 7e397b4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 180 javac compiler warnings (more than the master's current 177 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.test.TestRecovery Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2866//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/2866//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2866//console This message is automatically generated. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang >Priority: Major > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch, > TEZ-3718.4.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548343#comment-16548343 ] Zhiyuan Yang commented on TEZ-3718: --- This one has been pending review for long. Review is greatly appreciated. But feel free to drop this from the release. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang >Priority: Major > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch, > TEZ-3718.4.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548325#comment-16548325 ] Eric Wohlstadter commented on TEZ-3718: --- [~kshukla] [~jlowe] I'm thinking we'll need to drop this from the 0.9.2 and 0.10 release, since the assignee is not active and no one else has picked this up. Does that sound right to you? > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang >Priority: Major > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch, > TEZ-3718.4.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199602#comment-16199602 ] TezQA commented on TEZ-3718: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12891370/TEZ-3718.4.patch against master revision c82b2ea. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 25 javac compiler warnings (more than the master's current 24 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2658//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/2658//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2658//console This message is automatically generated. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch, > TEZ-3718.4.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090848#comment-16090848 ] Siddharth Seth commented on TEZ-3718: - Patch mostly looks good to me. More at the end though on potential changes. - TaskAttemptEventNodeFailed - failure reason is still a boolean. This can be an enum as well. - Config still being read in TAImpl. To your point about it already being read, that needs to be changed. Trying to make sure there's no access beyond the Vertex level at max. (Configuration has historically been slow to access). This should be a simple change via getVertex.getVertexConfig. - Changing the config parameter node-unhealthy-reschedule-tasks is an incompatible change. Should be deprecated, and a new one introduced. Filed TEZ-3799 to make blacklisting behave the same. This does change behaviour to kill the currently running task on the container irrespective of the config setting. (With the previous setting, not only would old tasks not re-run, the current one would not be terminated either). [~jlowe], [~rohini] - based on the offline conversation we had about this, the preference was to have this configurable. With the current patch, trying to make this configurable is a big change to the Container state machine. It's wired to complete in case of a node failure (which is correct IMHO), and if it completes, the running task will end up completing. One possible way to handle this. Retain old behaviour (AMNode will not send out events - and this can be covered by the current config). If this is enabled, the old behaviour continues, and the changes to Task are irrelevant (new configs don't apply). With the old config set to send Container termination messages, the new flags can kick in. The running task will be killed. For completed ones, fast exits can be enabled in case of Input errors. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078469#comment-16078469 ] Zhiyuan Yang commented on TEZ-3718: --- Weird, neither javac warning nor test failure appear on my local build. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1607#comment-1607 ] TezQA commented on TEZ-3718: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12875797/TEZ-3718.3.patch against master revision b915a07. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 25 javac compiler warnings (more than the master's current 24 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.client.TestTezClient Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2565//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/2565//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2565//console This message is automatically generated. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16075195#comment-16075195 ] Zhiyuan Yang commented on TEZ-3718: --- Moving this out to unblock 0.9 release. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16075194#comment-16075194 ] Zhiyuan Yang commented on TEZ-3718: --- Thanks [~sseth] for review! bq. Instead of isUnhealthy in the event - this could be an enum (UNHEALTHY, BLACKLISTED) Use enum in new patch. bq. Don't change container state - allow a task action to change the state via a STOP_REQEUST depending on task level configs. OR If there are not running fragments on the container, change state to a COMPLETED state, so that new tasks allocations are not accepted. In the third patch, I removed the 'do nothing' behavior previously introduced by TEZ-2972. The sole purpose of this behavior is to avoid unnecessary reschedule of completed task. With third patch we can do this more precisely, so there is no need to keep 'do nothing' behavior. Also the semantic of configuration is changed according. With this said, there is no issue about sending TA_CONTAINER_TERMINATING from container to TAImpl since running task should always be killed. bq. Don't read from a Configuration instance within each AMContainer / TaskAttemptImpl - there's example code on how to avoid this in TaskImpl/TaskAttemptImpl Didn't change this part. Are you saying this because of locking issue? Even in TAImpl, configuration is read by each instance, just in a different place (ctor). > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16067111#comment-16067111 ] Jason Lowe commented on TEZ-3718: - bq. I'm not sure why AMNodeIMpl treats NodeUnhealthy and NodeBlacklisted differently from each other w.r.t the config which determines whether tasks need to be restarted or not. I'm not sure I know the full history in this, but unhealthy vs. blacklisted can represent different contexts. For example, a node could go unhealthy because too many disks have failed. We probably want to proactively re-run upstream tasks rather than wait for the fetch failures. Node blacklisting is a little different than unhealthy nodes. If a task runs and fails before completing, we may want to blacklist that node to prevent other tasks from also failing to complete on that node. But if we have _completed_ tasks on that blacklisted node then there's a decent chance we can complete the shuffle despite the fact that tasks are failing. For example, task needs to use a GPU but something about the GPU setup on that node causes all tasks trying to use it to crash. If tasks that didn't need the GPU ran and succeeded on the node, why are we proactively re-running them rather than just fetching their inputs? That could be a huge waste of work and end up being a large performance hit to the job. TEZ-3072 was filed because of behavior like this. It all comes down to these two questions: - if a node is unhealthy, is it likely I won't be able to successfully shuffle data from it? - if a node is blacklisted, is it likely I won't be able to successfully shuffle data from it? If the answer to both of them is always the same regarding whether we re-run completed tasks then yes, we should treat them equivalently. I think there are cases where we would want them to be different. Full disclosure -- we've been running in a mode where we do _not_ re-run completed tasks on nodes if they go unhealthy or are blacklisted. We found many cases where the node was still able to shuffle most (sometimes all) of the completed data for tasks despite being declared unhealthy or blacklisted. In short, re-running was causing more problems than it was fixing for us, so now we simply wait for the fetch failures. It's not always optimal, of course, and there are cases where proactively re-running would have been preferable. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065786#comment-16065786 ] Siddharth Seth commented on TEZ-3718: - I'm not sure why AMNodeIMpl treats NodeUnhealthy and NodeBlacklisted differently from each other w.r.t the config which determines whether tasks need to be restarted or not. I think both can be treated the same. [~jlowe] - you may have more context on this. The changes to Node related classes mostly look good to me. Instead of isUnhealthy in the event - this could be an enum (UNHEALTHY, BLACKLISTED) For AMContainer, not sure why the fail task config needs to be read. Will the following work. - When event received, annotate the container to say "On A Failed Node" (Already done) - Inform prior and current attempts of the node failure. - Don't change container state - allow a task action to change the state via a STOP_REQEUST depending on task level configs. OR If there are not running fragments on the container, change state to a COMPLETED state, so that new tasks allocations are not accepted. Do not accept new tasks since nodeFailure has been set. TaskAttempt - From a brief glance, the functionality looks good. Fail_Fast / decide whether to keep a task / cause it to be killed on a node failure. Genearl - Don't read from a Configuration instance within each AMContainer / TaskAttemptImpl - there's example code on how to avoid this in TaskImpl/TaskAttemptImpl - Thought the configs woiuld be the following TEZ_AM_NODE_UNHEALTHY_RESCHEDULE_TASKS=false - Current, default=false TEZ_AM_NODE_UNHEALTHY_KILL_RUNNING=true - New, default=true (overrides TEZ_AM_NODE_UNHEALTHY_RESCHEDULE_TASKS) Third config in the patch looks good > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064683#comment-16064683 ] TezQA commented on TEZ-3718: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12874660/TEZ-3718.2.patch against master revision 5b0f5a0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2548//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2548//console This message is automatically generated. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047013#comment-16047013 ] Siddharth Seth commented on TEZ-3718: - bq. it. For not accessing nodeTracker from TaskAttempt, are you suggesting notifying TaskAttempt about node health? That would be more overhead than pulling info from node tracker since every TaskAttempt need note down the node health status. Isn't their enough information available in the message itself. Node->Container->TaskAttempt - Whatever state change triggered the node event to go out, that information can be retained in the event sent out by the container. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045142#comment-16045142 ] Zhiyuan Yang commented on TEZ-3718: --- Thanks [~sseth] for review! Doing it the other way around makes sense as our event should be more like message than command. I'll change the patch for that. About TaskAttempt configuration, that's for allowing user to fall back to original behavior and TaskAttempt has to check it. For not accessing nodeTracker from TaskAttempt, are you suggesting notifying TaskAttempt about node health? That would be more overhead than pulling info from node tracker since every TaskAttempt need note down the node health status. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035239#comment-16035239 ] Siddharth Seth commented on TEZ-3718: - Can we do this a little differently. Typically Entities send out events, and the receiving entity takes a decision on what to do. In this case, Nodes would always send out an event. Container and/or TaskAttempt, based on state and Configuration, would take a call on what to do next. The event from Node may still need some augmenting to indicate whether the node was blacklisted or marked "UNHEALTHY" for some other reason. In the current patch, I'm not sure why TaskAttempt needs to look up the configuration. Can avoid accessing the nodeTracker node status, and rely upon the event instead. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16031616#comment-16031616 ] Zhiyuan Yang commented on TEZ-3718: --- TestUnorderedPartitionedKVWriter is flaky, and TestMockDAGAppMaster doesn't fail on local run. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030539#comment-16030539 ] TezQA commented on TEZ-3718: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12870479/TEZ-3718.1.patch against master revision 241a7fa. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.runtime.library.common.writers.TestUnorderedPartitionedKVWriter org.apache.tez.dag.app.TestMockDAGAppMaster Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2491//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2491//console This message is automatically generated. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005282#comment-16005282 ] Jason Lowe commented on TEZ-3718: - Last I checked, MapReduce doesn't have task duration factored into the decision to re-submit. As for container timeout vs. task timeout, I definitely agree that if we have fresh heartbeats from the container yet fail to receive a task heartbeat in the required interval then that's a task failure. If the container heartbeat timed out then we might want to treat it as a kill, although I'm a bit worried if the issue that caused the timeout is pathological in the setup and every container does it. Wouldn't we run forever, constantly killing instead of failing? > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005247#comment-16005247 ] Siddharth Seth commented on TEZ-3718: - bq. Not specifically related to bad node handling, but we could also improve fetch failure handling by taking the upstream task runtime into account when deciding how to handle failures. Does it really make sense to retry fetching for minutes when the upstream task can regenerate the data in a few seconds? I believe MapReduce may already have this factored into it's retry handling. Definitely makes sense to get something like this in as well. In terms of the 'bad' indication, AMNode does track failures on a node. Again, the reason is not really known. The AM has a lot of information on what is going on in the cluster - transfer rate per node, execution rate per node, shuffle failures etc. It should, in theory, be able to make much better calls. Don't think we've every gotten around to getting all of this connected together in the AM though. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes
[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003529#comment-16003529 ] Jason Lowe commented on TEZ-3718: - Sure, killing tasks that are active on a node that as marked bad can make sense, and it also makes sense to be more sensitive to rescheduling upstream tasks when downstream tasks start reporting failures given we already have other evidence the node is bad. I'm not a big fan of declaring all upstream tasks bad that ran on the node since this often creates as many problems as it solves. In some cases a node can be declared unhealthy but still be able to serve up shuffle data. Unfortunately the 'bad' indication is just a boolean so we don't get the required fidelity to know whether re-running all tasks really makes sense. Not specifically related to bad node handling, but we could also improve fetch failure handling by taking the upstream task runtime into account when deciding how to handle failures. Does it really make sense to retry fetching for minutes when the upstream task can regenerate the data in a few seconds? On the flip side, it might make sense to try a bit harder depending upon the type of failure (e.g.: read timeouts for slow nodes) when we suspect it will take hours to complete a reschedule of a task. > Better handling of 'bad' nodes > -- > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.3.15#6346)