[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2019-03-19 Thread TezQA (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796234#comment-16796234
 ] 

TezQA commented on TEZ-3718:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  4s{color} 
| {color:red} TEZ-3718 does not apply to master. Rebase required? Wrong Branch? 
See https://cwiki.apache.org/confluence/display/TEZ/How+to+Contribute+to+Tez 
for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | TEZ-3718 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12891370/TEZ-3718.4.patch |
| Console output | 
https://builds.apache.org/job/PreCommit-TEZ-Build/128/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Zhiyuan Yang
>Priority: Major
> Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch, 
> TEZ-3718.4.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2018-07-18 Thread TezQA (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548570#comment-16548570
 ] 

TezQA commented on TEZ-3718:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12891370/TEZ-3718.4.patch
  against master revision 7e397b4.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 180 javac 
compiler warnings (more than the master's current 177 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in :
   org.apache.tez.test.TestRecovery

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2866//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2866//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2866//console

This message is automatically generated.


> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Zhiyuan Yang
>Priority: Major
> Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch, 
> TEZ-3718.4.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2018-07-18 Thread Zhiyuan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548343#comment-16548343
 ] 

Zhiyuan Yang commented on TEZ-3718:
---

This one has been pending review for long. Review is greatly appreciated. But 
feel free to drop this from the release. 

> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Zhiyuan Yang
>Priority: Major
> Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch, 
> TEZ-3718.4.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2018-07-18 Thread Eric Wohlstadter (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548325#comment-16548325
 ] 

Eric Wohlstadter commented on TEZ-3718:
---

[~kshukla] [~jlowe]

I'm thinking we'll need to drop this from the 0.9.2 and 0.10 release, since the 
assignee is not active and no one else has picked this up.

Does that sound right to you?

> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Zhiyuan Yang
>Priority: Major
> Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch, 
> TEZ-3718.4.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2017-10-10 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199602#comment-16199602
 ] 

TezQA commented on TEZ-3718:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12891370/TEZ-3718.4.patch
  against master revision c82b2ea.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 25 javac 
compiler warnings (more than the master's current 24 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2658//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2658//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2658//console

This message is automatically generated.

> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch, 
> TEZ-3718.4.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2017-07-17 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090848#comment-16090848
 ] 

Siddharth Seth commented on TEZ-3718:
-

Patch mostly looks good to me. More at the end though on potential changes.
- TaskAttemptEventNodeFailed - failure reason is still a boolean. This can be 
an enum as well.
- Config still being read in TAImpl. To your point about it already being read, 
that needs to be changed. Trying to make sure there's no access beyond the 
Vertex level at max. (Configuration has historically been slow to access). This 
should be a simple change via getVertex.getVertexConfig.
- Changing the config parameter node-unhealthy-reschedule-tasks is an 
incompatible change. Should be deprecated, and a new one introduced.

Filed TEZ-3799 to make blacklisting behave the same.

This does change behaviour to kill the currently running task on the container 
irrespective of the config setting. (With the previous setting, not only would 
old tasks not re-run, the current one would not be terminated either). 
[~jlowe], [~rohini] - based on the offline conversation we had about this, the 
preference was to have this configurable.
With the current patch, trying to make this configurable is a big change to the 
Container state machine. It's wired to complete in case of a node failure 
(which is correct IMHO), and if it completes, the running task will end up 
completing. 
One possible way to handle this. Retain old behaviour (AMNode will not send out 
events - and this can be covered by the current config). If this is enabled, 
the old behaviour continues, and the changes to Task are irrelevant (new 
configs don't apply). With the old config set to send Container termination 
messages, the new flags can kick in. The running task will be killed. For 
completed ones, fast exits can be enabled in case of Input errors.

> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2017-07-07 Thread Zhiyuan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078469#comment-16078469
 ] 

Zhiyuan Yang commented on TEZ-3718:
---

Weird, neither javac warning nor test failure appear on my local build.

> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2017-07-05 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1607#comment-1607
 ] 

TezQA commented on TEZ-3718:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12875797/TEZ-3718.3.patch
  against master revision b915a07.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 25 javac 
compiler warnings (more than the master's current 24 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in :
   org.apache.tez.client.TestTezClient

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2565//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2565//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2565//console

This message is automatically generated.

> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2017-07-05 Thread Zhiyuan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16075195#comment-16075195
 ] 

Zhiyuan Yang commented on TEZ-3718:
---

Moving this out to unblock 0.9 release.

> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2017-07-05 Thread Zhiyuan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16075194#comment-16075194
 ] 

Zhiyuan Yang commented on TEZ-3718:
---

Thanks [~sseth] for review! 
bq.  Instead of isUnhealthy in the event - this could be an enum (UNHEALTHY, 
BLACKLISTED)
Use enum in new patch.

bq. Don't change container state - allow a task action to change the state via 
a STOP_REQEUST depending on task level configs. OR If there are not running 
fragments on the container, change state to a COMPLETED state, so that new 
tasks allocations are not accepted.
In the third patch, I removed the 'do nothing' behavior previously introduced 
by TEZ-2972. The sole purpose of this behavior is to avoid unnecessary 
reschedule of completed task. With third patch we can do this more precisely, 
so there is no need to keep 'do nothing' behavior. Also the semantic of 
configuration is changed according. With this said, there is no issue about 
sending TA_CONTAINER_TERMINATING from container to TAImpl since running task 
should always be killed.

bq. Don't read from a Configuration instance within each AMContainer / 
TaskAttemptImpl - there's example code on how to avoid this in 
TaskImpl/TaskAttemptImpl
Didn't change this part.  Are you saying this because of locking issue? Even in 
TAImpl, configuration is read by each instance, just in a different place 
(ctor).

> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2017-06-28 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16067111#comment-16067111
 ] 

Jason Lowe commented on TEZ-3718:
-

bq. I'm not sure why AMNodeIMpl treats NodeUnhealthy and NodeBlacklisted 
differently from each other w.r.t the config which determines whether tasks 
need to be restarted or not.

I'm not sure I know the full history in this, but unhealthy vs. blacklisted can 
represent different contexts.  For example, a node could go unhealthy because 
too many disks have failed.  We probably want to proactively re-run upstream 
tasks rather than wait for the fetch failures.  Node blacklisting is a little 
different than unhealthy nodes.  If a task runs and fails before completing, we 
may want to blacklist that node to prevent other tasks from also failing to 
complete on that node.  But if we have _completed_ tasks on that blacklisted 
node then there's a decent chance we can complete the shuffle despite the fact 
that tasks are failing.  For example, task needs to use a GPU but something 
about the GPU setup on that node causes all tasks trying to use it to crash.  
If tasks that didn't need the GPU ran and succeeded on the node, why are we 
proactively re-running them rather than just fetching their inputs?  That could 
be a huge waste of work and end up being a large performance hit to the job.  
TEZ-3072 was filed because of behavior like this.  It all comes down to these 
two questions:
- if a node is unhealthy, is it likely I won't be able to successfully shuffle 
data from it?
- if a node is blacklisted, is it likely I won't be able to successfully 
shuffle data from it?

If the answer to both of them is always the same regarding whether we re-run 
completed tasks then yes, we should treat them equivalently.  I think there are 
cases where we would want them to be different.  Full disclosure -- we've been 
running in a mode where we do _not_ re-run completed tasks on nodes if they go 
unhealthy or are blacklisted.  We found many cases where the node was still 
able to shuffle most (sometimes all) of the completed data for tasks despite 
being declared unhealthy or blacklisted.  In short, re-running was causing more 
problems than it was fixing for us, so now we simply wait for the fetch 
failures.  It's not always optimal, of course, and there are cases where 
proactively re-running would have been preferable.

> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2017-06-27 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065786#comment-16065786
 ] 

Siddharth Seth commented on TEZ-3718:
-

I'm not sure why AMNodeIMpl treats NodeUnhealthy and NodeBlacklisted 
differently from each other w.r.t the config which determines whether tasks 
need to be restarted or not. I think both can be treated the same. [~jlowe] - 
you may have more context on this.

The changes to Node related classes mostly look good to me. Instead of 
isUnhealthy in the event - this could be an enum (UNHEALTHY, BLACKLISTED)

For AMContainer, not sure why the fail task config needs to be read. Will the 
following work.
- When event received, annotate the container to say "On A Failed Node" 
(Already done)
- Inform prior and current attempts of the node failure.
- Don't change container state - allow a task action to change the state via a 
STOP_REQEUST depending on task level configs.
  OR
  If there are not running fragments on the container, change state to a 
COMPLETED state, so that new tasks allocations are not accepted.
  Do not accept new tasks since nodeFailure has been set.
TaskAttempt
- From a brief glance, the functionality looks good. Fail_Fast / decide whether 
to keep a task / cause it to be killed on a node failure.

Genearl
- Don't read from a Configuration instance within each AMContainer / 
TaskAttemptImpl - there's example code on how to avoid this in 
TaskImpl/TaskAttemptImpl
- Thought the configs woiuld be the following
TEZ_AM_NODE_UNHEALTHY_RESCHEDULE_TASKS=false - Current, default=false
TEZ_AM_NODE_UNHEALTHY_KILL_RUNNING=true - New, default=true (overrides 
TEZ_AM_NODE_UNHEALTHY_RESCHEDULE_TASKS)
Third config in the patch looks good

> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2017-06-27 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064683#comment-16064683
 ] 

TezQA commented on TEZ-3718:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12874660/TEZ-3718.2.patch
  against master revision 5b0f5a0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2548//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2548//console

This message is automatically generated.

> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2017-06-12 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047013#comment-16047013
 ] 

Siddharth Seth commented on TEZ-3718:
-

bq. it. For not accessing nodeTracker from TaskAttempt, are you suggesting 
notifying TaskAttempt about node health? That would be more overhead than 
pulling info from node tracker since every TaskAttempt need note down the node 
health status.
Isn't their enough information available in the message itself. 
Node->Container->TaskAttempt - Whatever state change triggered the node event 
to go out, that information can be retained in the event sent out by the 
container.

> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3718.1.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2017-06-09 Thread Zhiyuan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045142#comment-16045142
 ] 

Zhiyuan Yang commented on TEZ-3718:
---

Thanks [~sseth] for review! Doing it the other way around makes sense as our 
event should be more like message than command. I'll change the patch for that. 
 About TaskAttempt configuration, that's for allowing user to fall back to 
original behavior and TaskAttempt has to check it. For not accessing 
nodeTracker from TaskAttempt, are you suggesting notifying TaskAttempt about 
node health? That would be more overhead than pulling info from node tracker 
since every TaskAttempt need note down the node health status.

> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3718.1.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2017-06-02 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035239#comment-16035239
 ] 

Siddharth Seth commented on TEZ-3718:
-

Can we do this a little differently. Typically Entities send out events, and 
the receiving entity takes a decision on what to do. In this case, Nodes would 
always send out an event. Container and/or TaskAttempt, based on state and 
Configuration, would take a call on what to do next.

The event from Node may still need some augmenting to indicate whether the node 
was blacklisted or marked "UNHEALTHY" for some other reason.

In the current patch, I'm not sure why TaskAttempt needs to look up the 
configuration. Can avoid accessing the nodeTracker node status, and rely upon 
the event instead.



> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3718.1.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2017-05-31 Thread Zhiyuan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16031616#comment-16031616
 ] 

Zhiyuan Yang commented on TEZ-3718:
---

TestUnorderedPartitionedKVWriter is flaky, and TestMockDAGAppMaster doesn't 
fail on local run. 

> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3718.1.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2017-05-30 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030539#comment-16030539
 ] 

TezQA commented on TEZ-3718:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12870479/TEZ-3718.1.patch
  against master revision 241a7fa.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in :
   
org.apache.tez.runtime.library.common.writers.TestUnorderedPartitionedKVWriter
  org.apache.tez.dag.app.TestMockDAGAppMaster

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2491//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2491//console

This message is automatically generated.

> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3718.1.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2017-05-10 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005282#comment-16005282
 ] 

Jason Lowe commented on TEZ-3718:
-

Last I checked, MapReduce doesn't have task duration factored into the decision 
to re-submit.

As for container timeout vs. task timeout, I definitely agree that if we have 
fresh heartbeats from the container yet fail to receive a task heartbeat in the 
required interval then that's a task failure.  If the container heartbeat timed 
out then we might want to treat it as a kill, although I'm a bit worried if the 
issue that caused the timeout is pathological in the setup and every container 
does it.  Wouldn't we run forever, constantly killing instead of failing?

> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2017-05-10 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005247#comment-16005247
 ] 

Siddharth Seth commented on TEZ-3718:
-

bq. Not specifically related to bad node handling, but we could also improve 
fetch failure handling by taking the upstream task runtime into account when 
deciding how to handle failures. Does it really make sense to retry fetching 
for minutes when the upstream task can regenerate the data in a few seconds? 
I believe MapReduce may already have this factored into it's retry handling. 
Definitely makes sense to get something like this in as well.

In terms of the 'bad' indication, AMNode does track failures on a node. Again, 
the reason is not really known. The AM has a lot of information on what is 
going on in the cluster - transfer rate per node, execution rate per node, 
shuffle failures etc. It should, in theory, be able to make much better calls. 
Don't think we've every gotten around to getting all of this connected together 
in the AM though.

> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

2017-05-09 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003529#comment-16003529
 ] 

Jason Lowe commented on TEZ-3718:
-

Sure, killing tasks that are active on a node that as marked bad can make 
sense, and it also makes sense to be more sensitive to rescheduling upstream 
tasks when downstream tasks start reporting failures given we already have 
other evidence the node is bad.  I'm not a big fan of declaring all upstream 
tasks bad that ran on the node since this often creates as many problems as it 
solves.  In some cases a node can be declared unhealthy but still be able to 
serve up shuffle data.  Unfortunately the 'bad' indication is just a boolean so 
we don't get the required fidelity to know whether re-running all tasks really 
makes sense.

Not specifically related to bad node handling, but we could also improve fetch 
failure handling by taking the upstream task runtime into account when deciding 
how to handle failures.  Does it really make sense to retry fetching for 
minutes when the upstream task can regenerate the data in a few seconds?  On 
the flip side, it might make sense to try a bit harder depending upon the type 
of failure (e.g.: read timeouts for slow nodes) when we suspect it will take 
hours to complete a reschedule of a task.

> Better handling of 'bad' nodes
> --
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)