[ 
https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15117668#comment-15117668
 ] 

Bikas Saha commented on TEZ-3072:
---------------------------------

Agree. Which is why I am suggesting that we stop doing this in the short term 
(regular read-error based path is going to provide protection in case the 
machine is really down. The current logic in there is mostly derived from MR 
and may be getting triggered more often due to more notifications being sent 
from other parts of the Tez code for which the node handling logic is not 
prepared for. Opened TEZ-3075 for a longer term revamp of that logic. But for 
now, I think, not re-running all completed work may be a good enough fix for 
the common cases we are seeing in this jira. Is that correct? Or should the 
larger changes in TEZ-3075 be done now?

> Node blacklisting always reruns completed non-leaf tasks
> --------------------------------------------------------
>
>                 Key: TEZ-3072
>                 URL: https://issues.apache.org/jira/browse/TEZ-3072
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Jason Lowe
>
> Recently a user ran a job with many vertices, and there was a bug in the 
> user's code that caused a problem in one of the trailing vertices in the 
> task.  On some nodes enough tasks failed that the AM thought it needed to 
> blacklist those nodes.  That blacklisting then caused many completed vertices 
> to re-run because it thought it needed to re-execute the non-leaf tasks that 
> had completed on those nodes.  This wasted a lot of cluster resources and job 
> time for no benefit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to