[ https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115814#comment-15115814 ]
Jason Lowe edited comment on TEZ-3072 at 1/25/16 7:30 PM: ---------------------------------------------------------- We also have issues with temporary fetch failure issues with a node causing all completed tasks from that node to re-run. In many ways the blacklisting logic is causing more problems than it is solving, at least with respect to fetch-failure related processing. It would be nice if we could configure blacklisting to ignore node effects involving shuffle (e.g.: fetch failures are not reported to the blacklisting logic, and blacklisted nodes don't cause completed tasks to re-run). was (Author: jlowe): We also have issues with temporary fetch failure issues with a node causing all completed tasks from that node to re-run. In many ways the blacklisting logic is causing more problems than it is solving, at least with respect to fetch-failure related processing. It would be nice if we could configure blacklisting to ignore node effects involving shuffle (e.g.; fetch failures are not reported to the blacklisting logic, and blacklisted nodes don't cause compelted tasks to re-run). > Node blacklisting always reruns completed non-leaf tasks > -------------------------------------------------------- > > Key: TEZ-3072 > URL: https://issues.apache.org/jira/browse/TEZ-3072 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.7.0 > Reporter: Jason Lowe > > Recently a user ran a job with many vertices, and there was a bug in the > user's code that caused a problem in one of the trailing vertices in the > task. On some nodes enough tasks failed that the AM thought it needed to > blacklist those nodes. That blacklisting then caused many completed vertices > to re-run because it thought it needed to re-execute the non-leaf tasks that > had completed on those nodes. This wasted a lot of cluster resources and job > time for no benefit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)