[ 
https://issues.apache.org/jira/browse/TEZ-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606216#comment-16606216
 ] 

Kuhu Shukla commented on TEZ-3972:
----------------------------------

Good point [~jeagles]. I think if running tasks are zero, we might want to 
avoid a rerun to indicate that the reporter vertex has in fact finished and it 
will save us from other possible races which won't show up if everything 
succeeds (treating this input failure as stale) and allow the DAG to finish. 
Thoughts?

> Tez DAG can hang when a single task fails to fetch
> --------------------------------------------------
>
>                 Key: TEZ-3972
>                 URL: https://issues.apache.org/jira/browse/TEZ-3972
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Kuhu Shukla
>            Assignee: Kuhu Shukla
>            Priority: Major
>         Attachments: TEZ-3972.001.patch, TEZ-3972.002.patch
>
>
> Description of the hung DAG:
> A DAG with 2 vertices. {{Map}} Vertex has 22k maps, downstream vertex 
> {{Reduce}} has 1009 tasks. All tasks succeed but one, which hangs. This one 
> task (attempt) is doing a local fetch from a node that (now) has a bad disk. 
> It fails to fetch and reports to the AM for the offending input attempt 
> identifiers. However the AM does not schedule a re-run as 
> {{uniquefailedOutputReports}} size is 1 (since only this task attempt failed 
> to fetch) and failure fraction is not met. The denominator for this fraction 
> is the total number of tasks. That causes the re-run to never occur. This 
> JIRA tracks the AM side of the change to alleviate this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to