[ 
https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216906#comment-15216906
 ] 

Hitesh Shah commented on TEZ-3161:
----------------------------------


FailureType seems to be specific to task failures. Maybe change enum to 
TaskFailureType. Thinking a bit more and looking at our previous API which says 
fatalError(), we might need to figure out better naming for both the enum and 
the API to ensure that the developer understands that one will fail the task 
completely i.e. no more re-attempts and the other will fail the task attempt 
but do a re-try. Maybe abort() or something similar for the API. Additionally, 
any reason why we cant leave the current API as is to denote a non-fatal error? 
A different approach could be to have failureType enum not be public but rather 
private impl and we expose fatalError() or abort() kind of APIs only? Open to 
any approach.  

"public void reportFailure(FailureType unsuccessfulEndReason, @Nullable 
Throwable exception, @Nullable String message);" - 
s/unsuccessfulEndReason/failureType/ and s/message/unsuccessfulEndReason/ - 
failure type does not indicate a reason of why the task failed. 

killSelf() - I dont think this should be supported. It allows mis-use from 
user-code where it could end up re-running itself multiple times. Additionally, 
if the usecase is pipelined shuffle, if the input cannot handle certain failure 
scenarios, it may be better to consider the task as failed. It is okay to 
retain the code changes for now but lets mark it Private for now at the very 
least.  Also killSelf documentation of "Kill the currently running attempt." is 
not really clear to a developer implement either an Input/Output/Processor.  

TaskAttemptTerminationCause vs FailureType:
   - For failure type fatal what is termination cause? 
   - It seems a bit messy that we have both termination cause and failure type 
where literally all termination causes are likely to be non-fatal. 
   - Additionally, this impacts UI in terms of how we display this information. 
Should file a follow-up jira for the UI to figure out how this should be 
visualized \cc [~Sreenath]

{code}
        LOG.info("Failing task: " + task.getTaskId()
            + ", currentFailedAttempts: " + task.failedAttempts + ", 
maxFailedAttempts: "
            + task.maxFailedAttempts + ", FailureType: " + 
castEvent.getFailureType());
{code}
   - Might be good to add another log line if task is being failed to a fatal 
TA failure. Above might lead to confusion as first glance for users seeing a 
task fail even though attempt counts have not hit max. 

Why is the failure type not being written to history? Shouldn't it be needed by 
the UI? Additionally, for a task finishing, we should likely have either a new 
field or atleast its diagnostics denote that it failed due to a fatal TA 
failure. Current patch I believe we rely only on diagnostics and expect the 
user to set up the necessary info as needed via the reportFailure() API call. 

For tests, might be good to enhance one of the TestFaultTolerance based tests 
to allow for more complex fatal/non-fatal run tests.

An additional end-to-end test in recovery for a fatal failure might be good 
too. \cc [~zjffdu] for suggestions on the best approach. 



> Allow task to report different kinds of errors - fatal / kill
> -------------------------------------------------------------
>
>                 Key: TEZ-3161
>                 URL: https://issues.apache.org/jira/browse/TEZ-3161
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>            Assignee: Siddharth Seth
>         Attachments: TEZ-3161.1.txt, TEZ-3161.2.txt, TEZ-3161.3.txt
>
>
> In some cases, task failures will be the same across all attempts - e.g. 
> exceeding memory utilization on an operation. In this case, there's no point 
> in running another attempt of the same task.
> There's other cases where a task may want to mark itself as KILLED - i.e. a 
> temporary error. An example of this is pipelined shuffle.
> Tez should allow both operations.
> cc [~vikram.dixit], [~rajesh.balamohan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to