[ 
https://issues.apache.org/jira/browse/TEZ-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080277#comment-17080277
 ] 

László Bodor commented on TEZ-4139:
-----------------------------------

[~rajesh.balamohan]: could you please take a look at  [^TEZ-4139.01.WIP.patch] ?
basically I've changed to store attempt failures per source host
{code}
  private Map<String, Map<TezTaskAttemptID, Long>> uniquefailedOutputReports = 
Maps.newHashMap();
{code}
my question is, how could this be considered exactly? given the failureFraction 
calculation:
{code}
   float failureFraction = runningTasks > 0 ? ((float) totalUniqueReportsCount) 
/ runningTasks : 0;
{code}
in this example above, the denominator (runningTasks) is the number of running 
tasks in the actual vertex, and totalUniqueReportsCount is the original count 
(I made it work in the same way as earlier, regardless of the underlying data 
structure)

if I want to change this calculation to take the failures for a given host into 
account, how should I change the denominator?  (as in the numerator, I'll most 
probably change to failure count per host)

1. by not changing denominator, I'll have a lower amount of failure fraction, 
which is not the intention I guess (this will only work if user sets 
"tez.task.max.allowed.output.failures.fraction" to a lower value)

2. changing the denominator somehow? maybe to reflect some "per vertex" number

(3. not changing the denominator and introducing something like 
"tez.task.max.allowed.output.failures.fraction.per.source.host" and set it to a 
lower value by default?)

> Tez should consider node information for computing failure fraction
> -------------------------------------------------------------------
>
>                 Key: TEZ-4139
>                 URL: https://issues.apache.org/jira/browse/TEZ-4139
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: TEZ-4139.01.WIP.patch
>
>
> When lots of downstream attempts fail to pull the information from source 
> task, source task is marked as failed and it is retried. Currently failure 
> fraction is handled by looking at unique task attempts from downstream. 
> However, it should consider taking into account node information for 
> computing "failureFraction".
> https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java#L1845-L1849



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to