[ https://issues.apache.org/jira/browse/TEZ-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14804876#comment-14804876 ]
Rajesh Balamohan commented on TEZ-814: -------------------------------------- lgtm. +1. Even when tez.task.max.allowed.output.failures & tez.task.max.allowed.output.failures.fraction are not converging, this would end up restarting producer after 300 seconds in case of output read-error. Should this be backported to 0.6 and 0.5 as well? > Improve heuristic for determining a task has failed outputs > ----------------------------------------------------------- > > Key: TEZ-814 > URL: https://issues.apache.org/jira/browse/TEZ-814 > Project: Apache Tez > Issue Type: Sub-task > Reporter: Bikas Saha > Assignee: Bikas Saha > Fix For: 0.7.1 > > Attachments: TEZ-814.1.patch, TEZ-814.2.patch > > > Currently 25% of consumers need to report failure. However we may not always > have those many error reports. Eg. this is the last consumer and it the > source is lost. Or some consumers are cut off from the source. The job may > hang on those consumers waiting for a re-run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)