[jira] [Commented] (TEZ-3072) Node blacklisting always reruns completed non-leaf tasks

2016-02-15 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147747#comment-15147747
 ] 

Bikas Saha commented on TEZ-3072:
-

When we handle node decommissioning this may be partly relevant. Eg. in that 
case we could send inputfailed to consumers. However that's a discussion for 
the future.

I am +1 for the changes in the taskattemptimpl. Blacklisting a node should not 
arbitrarily rerun all completed attempts on that node because downstream 
consumers may have already finished processing. We should probably rename the 
config to signify this aspect - e.g. bad_node_rerun_attempts and give it a 
default of false.

However, I would like to be cautious about the changes in taskimpl. If a task 
has been marked as failed retroactively then it implies that consumers have 
reported enough errors against it. And also, after this that attempt will be 
retried. So informing the node about this seems the right thing to do. It is 
likely that a number of such errors may indicate issues with that node, some of 
which may be temporary. With TEZ-3075, which would temporarily decommission the 
nodes, we should be able to handle the temporary cases. But getting the 
information about failures (including fetch failures) is important to make the 
decisions at the node level. Hence, IMO we should not make the change proposed 
in TaskImpl. If such a change is needed, then it could be made in 
AMNode/AMNodeTracker logic that handles AMNodeEventTaskAttemptEnded. There we 
could filter attempt failures by type and ignore fetch failures (based on a 
separate config). Or we could postpone that change in preference to TEZ-3075.

Separately, AMNodeEventTaskAttemptEnded seems to be sent from TaskScheduler and 
TaskImpl whereas it could be sent from a single source in TaskAttemptImpl. The 
current approach is open to getting out of sync. 

> Node blacklisting always reruns completed non-leaf tasks
> 
>
> Key: TEZ-3072
> URL: https://issues.apache.org/jira/browse/TEZ-3072
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: TEZ-3072.001.patch
>
>
> Recently a user ran a job with many vertices, and there was a bug in the 
> user's code that caused a problem in one of the trailing vertices in the 
> task.  On some nodes enough tasks failed that the AM thought it needed to 
> blacklist those nodes.  That blacklisting then caused many completed vertices 
> to re-run because it thought it needed to re-execute the non-leaf tasks that 
> had completed on those nodes.  This wasted a lot of cluster resources and job 
> time for no benefit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3072) Node blacklisting always reruns completed non-leaf tasks

2016-01-26 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15117350#comment-15117350
 ] 

Jason Lowe commented on TEZ-3072:
-

In this particular case even blacklisting the node was the wrong thing to do 
because the node was irrelevant to the task failures.

I noticed that the code treats a node removed by YARN and a node blacklisted 
due to attempt failures equally.  I could see that being problematic in 
practice, because a node that is failing tasks could serve up data from shuffle 
just fine.  Re-running completed tasks would only help iff the shuffle would be 
problematic.  I suspect in most cases the completed re-runs to avoid the  
theoretical possibility we could have shuffle problems (without even trying to 
verify the problem exists) makes the job slower than just assuming the shuffle 
might work and let normal fetch failure handling take care of the problem.  
Yes, there's going to be pathological cases where predictive re-execution would 
drastically speed up the job, but we're seeing plenty of cases where this 
preemptive strike against potential shuffle problems is causing much more harm. 
 Saw another case of this yesterday where a job re-ran dozens of tasks from 
upstream completed vertices for no benefit.


> Node blacklisting always reruns completed non-leaf tasks
> 
>
> Key: TEZ-3072
> URL: https://issues.apache.org/jira/browse/TEZ-3072
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Jason Lowe
>
> Recently a user ran a job with many vertices, and there was a bug in the 
> user's code that caused a problem in one of the trailing vertices in the 
> task.  On some nodes enough tasks failed that the AM thought it needed to 
> blacklist those nodes.  That blacklisting then caused many completed vertices 
> to re-run because it thought it needed to re-execute the non-leaf tasks that 
> had completed on those nodes.  This wasted a lot of cluster resources and job 
> time for no benefit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3072) Node blacklisting always reruns completed non-leaf tasks

2016-01-26 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15117668#comment-15117668
 ] 

Bikas Saha commented on TEZ-3072:
-

Agree. Which is why I am suggesting that we stop doing this in the short term 
(regular read-error based path is going to provide protection in case the 
machine is really down. The current logic in there is mostly derived from MR 
and may be getting triggered more often due to more notifications being sent 
from other parts of the Tez code for which the node handling logic is not 
prepared for. Opened TEZ-3075 for a longer term revamp of that logic. But for 
now, I think, not re-running all completed work may be a good enough fix for 
the common cases we are seeing in this jira. Is that correct? Or should the 
larger changes in TEZ-3075 be done now?

> Node blacklisting always reruns completed non-leaf tasks
> 
>
> Key: TEZ-3072
> URL: https://issues.apache.org/jira/browse/TEZ-3072
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Jason Lowe
>
> Recently a user ran a job with many vertices, and there was a bug in the 
> user's code that caused a problem in one of the trailing vertices in the 
> task.  On some nodes enough tasks failed that the AM thought it needed to 
> blacklist those nodes.  That blacklisting then caused many completed vertices 
> to re-run because it thought it needed to re-execute the non-leaf tasks that 
> had completed on those nodes.  This wasted a lot of cluster resources and job 
> time for no benefit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3072) Node blacklisting always reruns completed non-leaf tasks

2016-01-26 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15117885#comment-15117885
 ] 

TezQA commented on TEZ-3072:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12784463/TEZ-3072.001.patch
  against master revision 2bf27de.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in :
   org.apache.tez.test.TestFaultTolerance

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/1432//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1432//console

This message is automatically generated.

> Node blacklisting always reruns completed non-leaf tasks
> 
>
> Key: TEZ-3072
> URL: https://issues.apache.org/jira/browse/TEZ-3072
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: TEZ-3072.001.patch
>
>
> Recently a user ran a job with many vertices, and there was a bug in the 
> user's code that caused a problem in one of the trailing vertices in the 
> task.  On some nodes enough tasks failed that the AM thought it needed to 
> blacklist those nodes.  That blacklisting then caused many completed vertices 
> to re-run because it thought it needed to re-execute the non-leaf tasks that 
> had completed on those nodes.  This wasted a lot of cluster resources and job 
> time for no benefit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3072) Node blacklisting always reruns completed non-leaf tasks

2016-01-25 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115814#comment-15115814
 ] 

Jason Lowe commented on TEZ-3072:
-

We also have issues with temporary fetch failure issues with a node causing all 
completed tasks from that node to re-run.  In many ways the blacklisting logic 
is causing more problems than it is solving, at least with respect to 
fetch-failure related processing.  It would be nice if we could configure 
blacklisting to ignore node effects involving shuffle (e.g.; fetch failures are 
not reported to the blacklisting logic, and blacklisted nodes don't cause 
compelted tasks to re-run).

> Node blacklisting always reruns completed non-leaf tasks
> 
>
> Key: TEZ-3072
> URL: https://issues.apache.org/jira/browse/TEZ-3072
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Jason Lowe
>
> Recently a user ran a job with many vertices, and there was a bug in the 
> user's code that caused a problem in one of the trailing vertices in the 
> task.  On some nodes enough tasks failed that the AM thought it needed to 
> blacklist those nodes.  That blacklisting then caused many completed vertices 
> to re-run because it thought it needed to re-execute the non-leaf tasks that 
> had completed on those nodes.  This wasted a lot of cluster resources and job 
> time for no benefit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3072) Node blacklisting always reruns completed non-leaf tasks

2016-01-25 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116156#comment-15116156
 ] 

Bikas Saha commented on TEZ-3072:
-

A short term fix could be disabling the rerun of completed tasks (but 
continuing to blacklist the node to avoid scheduling more work there). A longer 
term effort towards putting machines in probation for a few cycles before 
giving up on them might help prevent cliffs like this, specially due to 
temporary glitches.

> Node blacklisting always reruns completed non-leaf tasks
> 
>
> Key: TEZ-3072
> URL: https://issues.apache.org/jira/browse/TEZ-3072
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Jason Lowe
>
> Recently a user ran a job with many vertices, and there was a bug in the 
> user's code that caused a problem in one of the trailing vertices in the 
> task.  On some nodes enough tasks failed that the AM thought it needed to 
> blacklist those nodes.  That blacklisting then caused many completed vertices 
> to re-run because it thought it needed to re-execute the non-leaf tasks that 
> had completed on those nodes.  This wasted a lot of cluster resources and job 
> time for no benefit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)