[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill
[ https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227749#comment-15227749 ] Bikas Saha commented on TEZ-3161: - Does a fatal error affect the recovery code path? E.g. fatal error got stored but dag failure did not get stored. What happens in recovery? Should dag fail after recovery because task fatal error was recovered. Likely yes, but does it work? Please ignore this comment in case its already covered. > Allow task to report different kinds of errors - fatal / kill > - > > Key: TEZ-3161 > URL: https://issues.apache.org/jira/browse/TEZ-3161 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Fix For: 0.8.3 > > Attachments: TEZ-3161.1.txt, TEZ-3161.2.txt, TEZ-3161.3.txt, > TEZ-3161.4.txt, TEZ-3161.5.txt, TEZ-3161.6.txt > > > In some cases, task failures will be the same across all attempts - e.g. > exceeding memory utilization on an operation. In this case, there's no point > in running another attempt of the same task. > There's other cases where a task may want to mark itself as KILLED - i.e. a > temporary error. An example of this is pipelined shuffle. > Tez should allow both operations. > cc [~vikram.dixit], [~rajesh.balamohan] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill
[ https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15226748#comment-15226748 ] Siddharth Seth commented on TEZ-3161: - [~oae] - I noticed you watching the jira. Is this something you intend to use ?, and if so, could you please provide some context around scenarios in which it will be used. > Allow task to report different kinds of errors - fatal / kill > - > > Key: TEZ-3161 > URL: https://issues.apache.org/jira/browse/TEZ-3161 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Attachments: TEZ-3161.1.txt, TEZ-3161.2.txt, TEZ-3161.3.txt, > TEZ-3161.4.txt, TEZ-3161.5.txt, TEZ-3161.6.txt > > > In some cases, task failures will be the same across all attempts - e.g. > exceeding memory utilization on an operation. In this case, there's no point > in running another attempt of the same task. > There's other cases where a task may want to mark itself as KILLED - i.e. a > temporary error. An example of this is pipelined shuffle. > Tez should allow both operations. > cc [~vikram.dixit], [~rajesh.balamohan] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill
[ https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15226465#comment-15226465 ] Hitesh Shah commented on TEZ-3161: -- Looks good. Minor nits: - remove KKK - use of assertNull instead of Assert.assertNull - doesnt match convention in other parts of code - noticed it in the timeline history test for example. Likewise of use of InterfaceAudience.Private instead of Private ( annotations in same class use @Public ) Feel free to commit after changing above. > Allow task to report different kinds of errors - fatal / kill > - > > Key: TEZ-3161 > URL: https://issues.apache.org/jira/browse/TEZ-3161 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Attachments: TEZ-3161.1.txt, TEZ-3161.2.txt, TEZ-3161.3.txt, > TEZ-3161.4.txt, TEZ-3161.5.txt > > > In some cases, task failures will be the same across all attempts - e.g. > exceeding memory utilization on an operation. In this case, there's no point > in running another attempt of the same task. > There's other cases where a task may want to mark itself as KILLED - i.e. a > temporary error. An example of this is pipelined shuffle. > Tez should allow both operations. > cc [~vikram.dixit], [~rajesh.balamohan] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill
[ https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15223070#comment-15223070 ] TezQA commented on TEZ-3161: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12796697/TEZ-3161.4.txt against master revision 0c7e1c5. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 18 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 35 javac compiler warnings (more than the master's current 33 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.test.TestFaultTolerance org.apache.tez.dag.history.logging.ats.TestHistoryEventTimelineConversion org.apache.tez.dag.app.rm.TestContainerReuse Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/1604//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/1604//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1604//console This message is automatically generated. > Allow task to report different kinds of errors - fatal / kill > - > > Key: TEZ-3161 > URL: https://issues.apache.org/jira/browse/TEZ-3161 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Attachments: TEZ-3161.1.txt, TEZ-3161.2.txt, TEZ-3161.3.txt, > TEZ-3161.4.txt > > > In some cases, task failures will be the same across all attempts - e.g. > exceeding memory utilization on an operation. In this case, there's no point > in running another attempt of the same task. > There's other cases where a task may want to mark itself as KILLED - i.e. a > temporary error. An example of this is pipelined shuffle. > Tez should allow both operations. > cc [~vikram.dixit], [~rajesh.balamohan] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill
[ https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220917#comment-15220917 ] Hitesh Shah commented on TEZ-3161: -- bq. In terms of alternate naming - do you have suggestions on what would be less confusing Not sure - fatalError(), abortProcessing() - not sure I have good suggestions especially as fatalError is probably the one which should be indicating a fatal error instead of the current non-fatal behavior. bq. I'm OK marking it as private Lets mark it so initially until we can figure out a clear use-case for self-kills. bq. Any suggestion on this. Duplicate the TerminationCause to include FATAL_, and KILL_ for almost all the existing TerminationCauses ? Wouldnt there be only one specific termination cause to indicate that the user-code told the framework to abort itself or kill itself? bq. I though it was being written to history. TaskAttemptFinished event is being written to history but the failure type bit is not in the data being pushed to ATS ( check TimelineHistoryEventConversion or the *JsonConversion ). The proto was changed but that is only used in Recovery. Tests in sbubsequent follow-ups should be ok. > Allow task to report different kinds of errors - fatal / kill > - > > Key: TEZ-3161 > URL: https://issues.apache.org/jira/browse/TEZ-3161 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Attachments: TEZ-3161.1.txt, TEZ-3161.2.txt, TEZ-3161.3.txt > > > In some cases, task failures will be the same across all attempts - e.g. > exceeding memory utilization on an operation. In this case, there's no point > in running another attempt of the same task. > There's other cases where a task may want to mark itself as KILLED - i.e. a > temporary error. An example of this is pipelined shuffle. > Tez should allow both operations. > cc [~vikram.dixit], [~rajesh.balamohan] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill
[ https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218546#comment-15218546 ] Siddharth Seth commented on TEZ-3161: - Thanks for the review. bq. FailureType seems to be specific to task failures. Maybe change enum to TaskFailureType Will do. bq. Thinking a bit more and looking at our previous API which says fatalError(), we might need to figure out better naming for both the enum and the API to ensure that the developer understands that one will fail the task completely i.e. no more re-attempts and the other will fail the task attempt but do a re-try. Maybe abort() or something similar for the API. Additionally, any reason why we cant leave the current API as is to denote a non-fatal error? A different approach could be to have failureType enum not be public but rather private impl and we expose fatalError() or abort() kind of APIs only? Open to any approach. The existing API is named fatalError - which makes it fairly confusing if we leave it as is and process it as a task that can be retried. Even after deprecation, it continues to be misleading - however the fact that it's deprecated should encourage people to read the javadoc. In terms of alternate naming - do you have suggestions on what would be less confusing ? Depending on this we can decide on keeping the current API and adding one more, or deprecating and exposing the enum. I prefer the enum approach though, since it's a single method. bq. "public void reportFailure(FailureType unsuccessfulEndReason, @Nullable Throwable exception, @Nullable String message);" - s/unsuccessfulEndReason/failureType/ and s/message/unsuccessfulEndReason/ - failure type does not indicate a reason of why the task failed. Leftovers from previous changes. Will rename the parameters. bq. killSelf() - I dont think this should be supported. It allows mis-use from user-code where it could end up re-running itself multiple times. Additionally, if the usecase is pipelined shuffle, if the input cannot handle certain failure scenarios, it may be better to consider the task as failed. It is okay to retain the code changes for now but lets mark it Private for now at the very least. Also killSelf documentation of "Kill the currently running attempt." is not really clear to a developer implement either an Input/Output/Processor. I had though about this. There's nothing stopping user code from causing a single task to hang forever. I'm OK marking it as private, but don't really see issues with exposing this functionality. Misuse will result in the dag running forever, which can easily be done within a bad processor as well. bq. For failure type fatal what is termination cause? As of now, this isn't changed, since the TerminationCause is determined based on who reported the error - Input/Output/Processor. The FailureType can be used as a separate field altogether to indicate whether the reported error was fatal or not, and the TerminationCause provides the source information. bq. It seems a bit messy that we have both termination cause and failure type where literally all termination causes are likely to be non-fatal. Any suggestion on this. Duplicate the TerminationCause to include FATAL_*, and KILL_* for almost all the existing TerminationCauses ? bq. Might be good to add another log line if task is being failed to a fatal TA failure. Above might lead to confusion as first glance for users seeing a task fail even though attempt counts have not hit max. Will change this to be easier to understand. bq. Why is the failure type not being written to history? Shouldn't it be needed by the UI? Additionally, for a task finishing, we should likely have either a new field or atleast its diagnostics denote that it failed due to a fatal TA failure. Current patch I believe we rely only on diagnostics and expect the user to set up the necessary info as needed via the reportFailure() API call. I though it was being written to history. At least for recovery purposes. I'll look again, in case there's a different history event which needs this information. Will check about augmenting the message provided by the user with fatal / non-fatal (or whatever it is eventually called) bq. For tests, might be good to enhance one of the TestFaultTolerance based tests to allow for more complex fatal/non-fatal run tests. This may be usefdul, especially for KILLED. I don't want to address this in the current jira though. There's several tests added already to cover basic scenarios. TestFaultTolerance is already flakey. bq. An additional end-to-end test in recovery for a fatal failure might be good too. If there are tests which handle task failure, that covers most of the scenarios. Again - I'd prefer a follow up jira for this. One additional test here may be to have a dag restart after a task sees a fatal error, but before the DAG has been
[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill
[ https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216906#comment-15216906 ] Hitesh Shah commented on TEZ-3161: -- FailureType seems to be specific to task failures. Maybe change enum to TaskFailureType. Thinking a bit more and looking at our previous API which says fatalError(), we might need to figure out better naming for both the enum and the API to ensure that the developer understands that one will fail the task completely i.e. no more re-attempts and the other will fail the task attempt but do a re-try. Maybe abort() or something similar for the API. Additionally, any reason why we cant leave the current API as is to denote a non-fatal error? A different approach could be to have failureType enum not be public but rather private impl and we expose fatalError() or abort() kind of APIs only? Open to any approach. "public void reportFailure(FailureType unsuccessfulEndReason, @Nullable Throwable exception, @Nullable String message);" - s/unsuccessfulEndReason/failureType/ and s/message/unsuccessfulEndReason/ - failure type does not indicate a reason of why the task failed. killSelf() - I dont think this should be supported. It allows mis-use from user-code where it could end up re-running itself multiple times. Additionally, if the usecase is pipelined shuffle, if the input cannot handle certain failure scenarios, it may be better to consider the task as failed. It is okay to retain the code changes for now but lets mark it Private for now at the very least. Also killSelf documentation of "Kill the currently running attempt." is not really clear to a developer implement either an Input/Output/Processor. TaskAttemptTerminationCause vs FailureType: - For failure type fatal what is termination cause? - It seems a bit messy that we have both termination cause and failure type where literally all termination causes are likely to be non-fatal. - Additionally, this impacts UI in terms of how we display this information. Should file a follow-up jira for the UI to figure out how this should be visualized \cc [~Sreenath] {code} LOG.info("Failing task: " + task.getTaskId() + ", currentFailedAttempts: " + task.failedAttempts + ", maxFailedAttempts: " + task.maxFailedAttempts + ", FailureType: " + castEvent.getFailureType()); {code} - Might be good to add another log line if task is being failed to a fatal TA failure. Above might lead to confusion as first glance for users seeing a task fail even though attempt counts have not hit max. Why is the failure type not being written to history? Shouldn't it be needed by the UI? Additionally, for a task finishing, we should likely have either a new field or atleast its diagnostics denote that it failed due to a fatal TA failure. Current patch I believe we rely only on diagnostics and expect the user to set up the necessary info as needed via the reportFailure() API call. For tests, might be good to enhance one of the TestFaultTolerance based tests to allow for more complex fatal/non-fatal run tests. An additional end-to-end test in recovery for a fatal failure might be good too. \cc [~zjffdu] for suggestions on the best approach. > Allow task to report different kinds of errors - fatal / kill > - > > Key: TEZ-3161 > URL: https://issues.apache.org/jira/browse/TEZ-3161 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Attachments: TEZ-3161.1.txt, TEZ-3161.2.txt, TEZ-3161.3.txt > > > In some cases, task failures will be the same across all attempts - e.g. > exceeding memory utilization on an operation. In this case, there's no point > in running another attempt of the same task. > There's other cases where a task may want to mark itself as KILLED - i.e. a > temporary error. An example of this is pipelined shuffle. > Tez should allow both operations. > cc [~vikram.dixit], [~rajesh.balamohan] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill
[ https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209029#comment-15209029 ] TezQA commented on TEZ-3161: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12795036/TEZ-3161.2.txt against master revision 19280cf. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 18 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 35 javac compiler warnings (more than the master's current 33 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.runtime.library.common.shuffle.orderedgrouped.TestShuffle Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/1586//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/1586//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1586//console This message is automatically generated. > Allow task to report different kinds of errors - fatal / kill > - > > Key: TEZ-3161 > URL: https://issues.apache.org/jira/browse/TEZ-3161 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Attachments: TEZ-3161.1.txt, TEZ-3161.2.txt > > > In some cases, task failures will be the same across all attempts - e.g. > exceeding memory utilization on an operation. In this case, there's no point > in running another attempt of the same task. > There's other cases where a task may want to mark itself as KILLED - i.e. a > temporary error. An example of this is pipelined shuffle. > Tez should allow both operations. > cc [~vikram.dixit], [~rajesh.balamohan] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill
[ https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189975#comment-15189975 ] Siddharth Seth commented on TEZ-3161: - The same functionality could be added on other user plugin points as well. > Allow task to report different kinds of errors - fatal / kill > - > > Key: TEZ-3161 > URL: https://issues.apache.org/jira/browse/TEZ-3161 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth > > In some cases, task failures will be the same across all attempts - e.g. > exceeding memory utilization on an operation. In this case, there's no point > in running another attempt of the same task. > There's other cases where a task may want to mark itself as KILLED - i.e. a > temporary error. An example of this is pipelined shuffle. > Tez should allow both operations. > cc [~vikram.dixit], [~rajesh.balamohan] -- This message was sent by Atlassian JIRA (v6.3.4#6332)