[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill

2016-04-05 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227749#comment-15227749
 ] 

Bikas Saha commented on TEZ-3161:
-

Does a fatal error affect the recovery code path? E.g. fatal error got stored 
but dag failure did not get stored. What happens in recovery? Should dag fail 
after recovery because task fatal error was recovered. Likely yes, but does it 
work? Please ignore this comment in case its already covered.

> Allow task to report different kinds of errors - fatal / kill
> -
>
> Key: TEZ-3161
> URL: https://issues.apache.org/jira/browse/TEZ-3161
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Fix For: 0.8.3
>
> Attachments: TEZ-3161.1.txt, TEZ-3161.2.txt, TEZ-3161.3.txt, 
> TEZ-3161.4.txt, TEZ-3161.5.txt, TEZ-3161.6.txt
>
>
> In some cases, task failures will be the same across all attempts - e.g. 
> exceeding memory utilization on an operation. In this case, there's no point 
> in running another attempt of the same task.
> There's other cases where a task may want to mark itself as KILLED - i.e. a 
> temporary error. An example of this is pipelined shuffle.
> Tez should allow both operations.
> cc [~vikram.dixit], [~rajesh.balamohan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill

2016-04-05 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15226748#comment-15226748
 ] 

Siddharth Seth commented on TEZ-3161:
-

[~oae] - I noticed you watching the jira. Is this something you intend to use 
?, and if so, could you please provide some context around scenarios in which 
it will be used.

> Allow task to report different kinds of errors - fatal / kill
> -
>
> Key: TEZ-3161
> URL: https://issues.apache.org/jira/browse/TEZ-3161
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-3161.1.txt, TEZ-3161.2.txt, TEZ-3161.3.txt, 
> TEZ-3161.4.txt, TEZ-3161.5.txt, TEZ-3161.6.txt
>
>
> In some cases, task failures will be the same across all attempts - e.g. 
> exceeding memory utilization on an operation. In this case, there's no point 
> in running another attempt of the same task.
> There's other cases where a task may want to mark itself as KILLED - i.e. a 
> temporary error. An example of this is pipelined shuffle.
> Tez should allow both operations.
> cc [~vikram.dixit], [~rajesh.balamohan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill

2016-04-05 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15226465#comment-15226465
 ] 

Hitesh Shah commented on TEZ-3161:
--

Looks good. 

Minor nits:
  - remove KKK
  - use of assertNull instead of Assert.assertNull - doesnt match convention in 
other parts of code - noticed it in the timeline history test for example. 
Likewise of use of InterfaceAudience.Private instead of Private ( annotations 
in same class use @Public ) 

Feel free to commit after changing above. 

> Allow task to report different kinds of errors - fatal / kill
> -
>
> Key: TEZ-3161
> URL: https://issues.apache.org/jira/browse/TEZ-3161
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-3161.1.txt, TEZ-3161.2.txt, TEZ-3161.3.txt, 
> TEZ-3161.4.txt, TEZ-3161.5.txt
>
>
> In some cases, task failures will be the same across all attempts - e.g. 
> exceeding memory utilization on an operation. In this case, there's no point 
> in running another attempt of the same task.
> There's other cases where a task may want to mark itself as KILLED - i.e. a 
> temporary error. An example of this is pipelined shuffle.
> Tez should allow both operations.
> cc [~vikram.dixit], [~rajesh.balamohan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill

2016-04-02 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15223070#comment-15223070
 ] 

TezQA commented on TEZ-3161:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12796697/TEZ-3161.4.txt
  against master revision 0c7e1c5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 18 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 35 javac 
compiler warnings (more than the master's current 33 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in :
   org.apache.tez.test.TestFaultTolerance
  
org.apache.tez.dag.history.logging.ats.TestHistoryEventTimelineConversion
  org.apache.tez.dag.app.rm.TestContainerReuse

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/1604//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/1604//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1604//console

This message is automatically generated.

> Allow task to report different kinds of errors - fatal / kill
> -
>
> Key: TEZ-3161
> URL: https://issues.apache.org/jira/browse/TEZ-3161
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-3161.1.txt, TEZ-3161.2.txt, TEZ-3161.3.txt, 
> TEZ-3161.4.txt
>
>
> In some cases, task failures will be the same across all attempts - e.g. 
> exceeding memory utilization on an operation. In this case, there's no point 
> in running another attempt of the same task.
> There's other cases where a task may want to mark itself as KILLED - i.e. a 
> temporary error. An example of this is pipelined shuffle.
> Tez should allow both operations.
> cc [~vikram.dixit], [~rajesh.balamohan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill

2016-03-31 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220917#comment-15220917
 ] 

Hitesh Shah commented on TEZ-3161:
--

bq. In terms of alternate naming - do you have suggestions on what would be 
less confusing

Not sure - fatalError(), abortProcessing() - not sure I have good suggestions 
especially as fatalError is probably the one which should be indicating a fatal 
error instead of the current non-fatal behavior. 

bq. I'm OK marking it as private

Lets mark it so initially until we can figure out a clear use-case for 
self-kills.  

bq. Any suggestion on this. Duplicate the TerminationCause to include FATAL_, 
and KILL_ for almost all the existing TerminationCauses ?

Wouldnt there be only one specific termination cause to indicate that the 
user-code told the framework to abort itself or kill itself?

bq. I though it was being written to history. 

TaskAttemptFinished event is being written to history but the failure type bit 
is not in the data being pushed to ATS ( check TimelineHistoryEventConversion 
or the *JsonConversion ). The proto was changed but that is only used in 
Recovery. 

Tests in sbubsequent follow-ups should be ok.  




> Allow task to report different kinds of errors - fatal / kill
> -
>
> Key: TEZ-3161
> URL: https://issues.apache.org/jira/browse/TEZ-3161
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-3161.1.txt, TEZ-3161.2.txt, TEZ-3161.3.txt
>
>
> In some cases, task failures will be the same across all attempts - e.g. 
> exceeding memory utilization on an operation. In this case, there's no point 
> in running another attempt of the same task.
> There's other cases where a task may want to mark itself as KILLED - i.e. a 
> temporary error. An example of this is pipelined shuffle.
> Tez should allow both operations.
> cc [~vikram.dixit], [~rajesh.balamohan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill

2016-03-30 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218546#comment-15218546
 ] 

Siddharth Seth commented on TEZ-3161:
-

Thanks for the review.

bq. FailureType seems to be specific to task failures. Maybe change enum to 
TaskFailureType
Will do.

bq. Thinking a bit more and looking at our previous API which says 
fatalError(), we might need to figure out better naming for both the enum and 
the API to ensure that the developer understands that one will fail the task 
completely i.e. no more re-attempts and the other will fail the task attempt 
but do a re-try. Maybe abort() or something similar for the API. Additionally, 
any reason why we cant leave the current API as is to denote a non-fatal error? 
A different approach could be to have failureType enum not be public but rather 
private impl and we expose fatalError() or abort() kind of APIs only? Open to 
any approach.
The existing API is named fatalError - which makes it fairly confusing if we 
leave it as is and process it as a task that can be retried. Even after 
deprecation, it continues to be misleading - however the fact that it's 
deprecated should encourage people to read the javadoc.
In terms of alternate naming - do you have suggestions on what would be less 
confusing ? Depending on this we can decide on keeping the current API and 
adding one more, or deprecating and exposing the enum. I prefer the enum 
approach though, since it's a single method.

bq. "public void reportFailure(FailureType unsuccessfulEndReason, @Nullable 
Throwable exception, @Nullable String message);" - 
s/unsuccessfulEndReason/failureType/ and s/message/unsuccessfulEndReason/ - 
failure type does not indicate a reason of why the task failed.
Leftovers from previous changes. Will rename the parameters.

bq. killSelf() - I dont think this should be supported. It allows mis-use from 
user-code where it could end up re-running itself multiple times. Additionally, 
if the usecase is pipelined shuffle, if the input cannot handle certain failure 
scenarios, it may be better to consider the task as failed. It is okay to 
retain the code changes for now but lets mark it Private for now at the very 
least. Also killSelf documentation of "Kill the currently running attempt." is 
not really clear to a developer implement either an Input/Output/Processor.
I had though about this. There's nothing stopping user code from causing a 
single task to hang forever. I'm OK marking it as private, but don't really see 
issues with exposing this functionality. Misuse will result in the dag running 
forever, which can easily be done within a bad processor as well.

bq. For failure type fatal what is termination cause?
As of now, this isn't changed, since the TerminationCause is determined based 
on who reported the error - Input/Output/Processor. The FailureType can be used 
as a separate field altogether to indicate whether the reported error was fatal 
or not, and the TerminationCause provides the source information.

bq. It seems a bit messy that we have both termination cause and failure type 
where literally all termination causes are likely to be non-fatal.
Any suggestion on this. Duplicate the TerminationCause to include FATAL_*, and 
KILL_* for almost all the existing TerminationCauses ?

bq. Might be good to add another log line if task is being failed to a fatal TA 
failure. Above might lead to confusion as first glance for users seeing a task 
fail even though attempt counts have not hit max.
Will change this to be easier to understand.

bq. Why is the failure type not being written to history? Shouldn't it be 
needed by the UI? Additionally, for a task finishing, we should likely have 
either a new field or atleast its diagnostics denote that it failed due to a 
fatal TA failure. Current patch I believe we rely only on diagnostics and 
expect the user to set up the necessary info as needed via the reportFailure() 
API call.
I though it was being written to history. At least for recovery purposes. I'll 
look again, in case there's a different history event which needs this 
information. Will check about augmenting the message provided by the user with 
fatal / non-fatal (or whatever it is eventually called)

bq. For tests, might be good to enhance one of the TestFaultTolerance based 
tests to allow for more complex fatal/non-fatal run tests.
This may be usefdul, especially for KILLED. I don't want to address this in the 
current jira though. There's several tests added already to cover basic 
scenarios. TestFaultTolerance is already flakey.

bq. An additional end-to-end test in recovery for a fatal failure might be good 
too.
If there are tests which handle task failure, that covers most of the 
scenarios. Again - I'd prefer a follow up jira for this. One additional test 
here may be to have a dag restart after a task sees a fatal error, but before 
the DAG has been 

[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill

2016-03-29 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216906#comment-15216906
 ] 

Hitesh Shah commented on TEZ-3161:
--


FailureType seems to be specific to task failures. Maybe change enum to 
TaskFailureType. Thinking a bit more and looking at our previous API which says 
fatalError(), we might need to figure out better naming for both the enum and 
the API to ensure that the developer understands that one will fail the task 
completely i.e. no more re-attempts and the other will fail the task attempt 
but do a re-try. Maybe abort() or something similar for the API. Additionally, 
any reason why we cant leave the current API as is to denote a non-fatal error? 
A different approach could be to have failureType enum not be public but rather 
private impl and we expose fatalError() or abort() kind of APIs only? Open to 
any approach.  

"public void reportFailure(FailureType unsuccessfulEndReason, @Nullable 
Throwable exception, @Nullable String message);" - 
s/unsuccessfulEndReason/failureType/ and s/message/unsuccessfulEndReason/ - 
failure type does not indicate a reason of why the task failed. 

killSelf() - I dont think this should be supported. It allows mis-use from 
user-code where it could end up re-running itself multiple times. Additionally, 
if the usecase is pipelined shuffle, if the input cannot handle certain failure 
scenarios, it may be better to consider the task as failed. It is okay to 
retain the code changes for now but lets mark it Private for now at the very 
least.  Also killSelf documentation of "Kill the currently running attempt." is 
not really clear to a developer implement either an Input/Output/Processor.  

TaskAttemptTerminationCause vs FailureType:
   - For failure type fatal what is termination cause? 
   - It seems a bit messy that we have both termination cause and failure type 
where literally all termination causes are likely to be non-fatal. 
   - Additionally, this impacts UI in terms of how we display this information. 
Should file a follow-up jira for the UI to figure out how this should be 
visualized \cc [~Sreenath]

{code}
LOG.info("Failing task: " + task.getTaskId()
+ ", currentFailedAttempts: " + task.failedAttempts + ", 
maxFailedAttempts: "
+ task.maxFailedAttempts + ", FailureType: " + 
castEvent.getFailureType());
{code}
   - Might be good to add another log line if task is being failed to a fatal 
TA failure. Above might lead to confusion as first glance for users seeing a 
task fail even though attempt counts have not hit max. 

Why is the failure type not being written to history? Shouldn't it be needed by 
the UI? Additionally, for a task finishing, we should likely have either a new 
field or atleast its diagnostics denote that it failed due to a fatal TA 
failure. Current patch I believe we rely only on diagnostics and expect the 
user to set up the necessary info as needed via the reportFailure() API call. 

For tests, might be good to enhance one of the TestFaultTolerance based tests 
to allow for more complex fatal/non-fatal run tests.

An additional end-to-end test in recovery for a fatal failure might be good 
too. \cc [~zjffdu] for suggestions on the best approach. 



> Allow task to report different kinds of errors - fatal / kill
> -
>
> Key: TEZ-3161
> URL: https://issues.apache.org/jira/browse/TEZ-3161
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-3161.1.txt, TEZ-3161.2.txt, TEZ-3161.3.txt
>
>
> In some cases, task failures will be the same across all attempts - e.g. 
> exceeding memory utilization on an operation. In this case, there's no point 
> in running another attempt of the same task.
> There's other cases where a task may want to mark itself as KILLED - i.e. a 
> temporary error. An example of this is pipelined shuffle.
> Tez should allow both operations.
> cc [~vikram.dixit], [~rajesh.balamohan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill

2016-03-23 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209029#comment-15209029
 ] 

TezQA commented on TEZ-3161:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12795036/TEZ-3161.2.txt
  against master revision 19280cf.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 18 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 35 javac 
compiler warnings (more than the master's current 33 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in :
   
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.TestShuffle

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/1586//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/1586//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1586//console

This message is automatically generated.

> Allow task to report different kinds of errors - fatal / kill
> -
>
> Key: TEZ-3161
> URL: https://issues.apache.org/jira/browse/TEZ-3161
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-3161.1.txt, TEZ-3161.2.txt
>
>
> In some cases, task failures will be the same across all attempts - e.g. 
> exceeding memory utilization on an operation. In this case, there's no point 
> in running another attempt of the same task.
> There's other cases where a task may want to mark itself as KILLED - i.e. a 
> temporary error. An example of this is pipelined shuffle.
> Tez should allow both operations.
> cc [~vikram.dixit], [~rajesh.balamohan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill

2016-03-10 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189975#comment-15189975
 ] 

Siddharth Seth commented on TEZ-3161:
-

The same functionality could be added on other user plugin points as well.

> Allow task to report different kinds of errors - fatal / kill
> -
>
> Key: TEZ-3161
> URL: https://issues.apache.org/jira/browse/TEZ-3161
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>
> In some cases, task failures will be the same across all attempts - e.g. 
> exceeding memory utilization on an operation. In this case, there's no point 
> in running another attempt of the same task.
> There's other cases where a task may want to mark itself as KILLED - i.e. a 
> temporary error. An example of this is pipelined shuffle.
> Tez should allow both operations.
> cc [~vikram.dixit], [~rajesh.balamohan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)