[ 
https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115814#comment-15115814
 ] 

Jason Lowe edited comment on TEZ-3072 at 1/25/16 7:30 PM:
----------------------------------------------------------

We also have issues with temporary fetch failure issues with a node causing all 
completed tasks from that node to re-run.  In many ways the blacklisting logic 
is causing more problems than it is solving, at least with respect to 
fetch-failure related processing.  It would be nice if we could configure 
blacklisting to ignore node effects involving shuffle (e.g.: fetch failures are 
not reported to the blacklisting logic, and blacklisted nodes don't cause 
completed tasks to re-run).


was (Author: jlowe):
We also have issues with temporary fetch failure issues with a node causing all 
completed tasks from that node to re-run.  In many ways the blacklisting logic 
is causing more problems than it is solving, at least with respect to 
fetch-failure related processing.  It would be nice if we could configure 
blacklisting to ignore node effects involving shuffle (e.g.; fetch failures are 
not reported to the blacklisting logic, and blacklisted nodes don't cause 
compelted tasks to re-run).

> Node blacklisting always reruns completed non-leaf tasks
> --------------------------------------------------------
>
>                 Key: TEZ-3072
>                 URL: https://issues.apache.org/jira/browse/TEZ-3072
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Jason Lowe
>
> Recently a user ran a job with many vertices, and there was a bug in the 
> user's code that caused a problem in one of the trailing vertices in the 
> task.  On some nodes enough tasks failed that the AM thought it needed to 
> blacklist those nodes.  That blacklisting then caused many completed vertices 
> to re-run because it thought it needed to re-execute the non-leaf tasks that 
> had completed on those nodes.  This wasted a lot of cluster resources and job 
> time for no benefit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to