[ https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216906#comment-15216906 ]
Hitesh Shah commented on TEZ-3161: ---------------------------------- FailureType seems to be specific to task failures. Maybe change enum to TaskFailureType. Thinking a bit more and looking at our previous API which says fatalError(), we might need to figure out better naming for both the enum and the API to ensure that the developer understands that one will fail the task completely i.e. no more re-attempts and the other will fail the task attempt but do a re-try. Maybe abort() or something similar for the API. Additionally, any reason why we cant leave the current API as is to denote a non-fatal error? A different approach could be to have failureType enum not be public but rather private impl and we expose fatalError() or abort() kind of APIs only? Open to any approach. "public void reportFailure(FailureType unsuccessfulEndReason, @Nullable Throwable exception, @Nullable String message);" - s/unsuccessfulEndReason/failureType/ and s/message/unsuccessfulEndReason/ - failure type does not indicate a reason of why the task failed. killSelf() - I dont think this should be supported. It allows mis-use from user-code where it could end up re-running itself multiple times. Additionally, if the usecase is pipelined shuffle, if the input cannot handle certain failure scenarios, it may be better to consider the task as failed. It is okay to retain the code changes for now but lets mark it Private for now at the very least. Also killSelf documentation of "Kill the currently running attempt." is not really clear to a developer implement either an Input/Output/Processor. TaskAttemptTerminationCause vs FailureType: - For failure type fatal what is termination cause? - It seems a bit messy that we have both termination cause and failure type where literally all termination causes are likely to be non-fatal. - Additionally, this impacts UI in terms of how we display this information. Should file a follow-up jira for the UI to figure out how this should be visualized \cc [~Sreenath] {code} LOG.info("Failing task: " + task.getTaskId() + ", currentFailedAttempts: " + task.failedAttempts + ", maxFailedAttempts: " + task.maxFailedAttempts + ", FailureType: " + castEvent.getFailureType()); {code} - Might be good to add another log line if task is being failed to a fatal TA failure. Above might lead to confusion as first glance for users seeing a task fail even though attempt counts have not hit max. Why is the failure type not being written to history? Shouldn't it be needed by the UI? Additionally, for a task finishing, we should likely have either a new field or atleast its diagnostics denote that it failed due to a fatal TA failure. Current patch I believe we rely only on diagnostics and expect the user to set up the necessary info as needed via the reportFailure() API call. For tests, might be good to enhance one of the TestFaultTolerance based tests to allow for more complex fatal/non-fatal run tests. An additional end-to-end test in recovery for a fatal failure might be good too. \cc [~zjffdu] for suggestions on the best approach. > Allow task to report different kinds of errors - fatal / kill > ------------------------------------------------------------- > > Key: TEZ-3161 > URL: https://issues.apache.org/jira/browse/TEZ-3161 > Project: Apache Tez > Issue Type: Improvement > Reporter: Siddharth Seth > Assignee: Siddharth Seth > Attachments: TEZ-3161.1.txt, TEZ-3161.2.txt, TEZ-3161.3.txt > > > In some cases, task failures will be the same across all attempts - e.g. > exceeding memory utilization on an operation. In this case, there's no point > in running another attempt of the same task. > There's other cases where a task may want to mark itself as KILLED - i.e. a > temporary error. An example of this is pipelined shuffle. > Tez should allow both operations. > cc [~vikram.dixit], [~rajesh.balamohan] -- This message was sent by Atlassian JIRA (v6.3.4#6332)