[ 
https://issues.apache.org/jira/browse/TEZ-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083818#comment-17083818
 ] 

Rajesh Balamohan commented on TEZ-4139:
---------------------------------------

[~abstractdog]: Thanks for sharing the wip patch. Tez should consider the node 
details of the downstream. Currently the patch uses "String sourceHost = 
attempt.getNodeId().getHost();" which should give the node details of the 
source task. Intent is to compute the number of nodes which could not download 
data from this sourceHost. In certain cases, all these downstream tasks belong 
to the same node and they end up spiking up the fraction. IMO, some addition 
needs to be done for passing the downstream node details in the failure event. 
And then consider failures from same node as single failure (e.g 10 tasks 
getting scheduled in d1 would be computed as 1 failure in the calculation).

> Tez should consider node information for computing failure fraction
> -------------------------------------------------------------------
>
>                 Key: TEZ-4139
>                 URL: https://issues.apache.org/jira/browse/TEZ-4139
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: TEZ-4139.01.WIP.patch
>
>
> When lots of downstream attempts fail to pull the information from source 
> task, source task is marked as failed and it is retried. Currently failure 
> fraction is handled by looking at unique task attempts from downstream. 
> However, it should consider taking into account node information for 
> computing "failureFraction".
> https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java#L1845-L1849



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to