[ 
https://issues.apache.org/jira/browse/TEZ-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14804876#comment-14804876
 ] 

Rajesh Balamohan commented on TEZ-814:
--------------------------------------

lgtm. +1. Even when tez.task.max.allowed.output.failures & 
tez.task.max.allowed.output.failures.fraction are not converging, this would 
end up restarting producer after 300 seconds in case of output read-error. 
Should this be backported to 0.6 and 0.5 as well?

> Improve heuristic for determining a task has failed outputs
> -----------------------------------------------------------
>
>                 Key: TEZ-814
>                 URL: https://issues.apache.org/jira/browse/TEZ-814
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>             Fix For: 0.7.1
>
>         Attachments: TEZ-814.1.patch, TEZ-814.2.patch
>
>
> Currently 25% of consumers need to report failure. However we may not always 
> have those many error reports. Eg. this is the last consumer and it the 
> source is lost. Or some consumers are cut off from the source. The job may 
> hang on those consumers waiting for a re-run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to