[ https://issues.apache.org/jira/browse/TEZ-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083818#comment-17083818 ]
Rajesh Balamohan commented on TEZ-4139: --------------------------------------- [~abstractdog]: Thanks for sharing the wip patch. Tez should consider the node details of the downstream. Currently the patch uses "String sourceHost = attempt.getNodeId().getHost();" which should give the node details of the source task. Intent is to compute the number of nodes which could not download data from this sourceHost. In certain cases, all these downstream tasks belong to the same node and they end up spiking up the fraction. IMO, some addition needs to be done for passing the downstream node details in the failure event. And then consider failures from same node as single failure (e.g 10 tasks getting scheduled in d1 would be computed as 1 failure in the calculation). > Tez should consider node information for computing failure fraction > ------------------------------------------------------------------- > > Key: TEZ-4139 > URL: https://issues.apache.org/jira/browse/TEZ-4139 > Project: Apache Tez > Issue Type: Improvement > Reporter: Rajesh Balamohan > Assignee: László Bodor > Priority: Major > Attachments: TEZ-4139.01.WIP.patch > > > When lots of downstream attempts fail to pull the information from source > task, source task is marked as failed and it is retried. Currently failure > fraction is handled by looking at unique task attempts from downstream. > However, it should consider taking into account node information for > computing "failureFraction". > https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java#L1845-L1849 -- This message was sent by Atlassian Jira (v8.3.4#803005)