[ 
https://issues.apache.org/jira/browse/TEZ-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186098#comment-14186098
 ] 

Jeff Zhang commented on TEZ-1689:
---------------------------------

[~sseth] I attach a new patch.

Only one change: remove the change on VertexImpl.getOutputSpecList(taskIndex), 
keep it as original.  But it is not perfect.
The only remaining issue is that it would cause the diagnostics like following 
( although it works). TaskAttempt_0 would catch the real exception from 
VertexImpl.getOutputSpecList, but the TaskAttempt_1 would catch another 
Exception from Processor ( because we cache the outputSpec in taskAttemp_0, the 
exception won't been throw in the second call. And the outputSpec is empty, so 
it would throw exception in the Processor).  
We can't stop the Task start a new taskattempt, although I can do some hack 
here (like set the max_attempt to 1, or set a flag to stop starting new task 
attempt,  ) but I afraid this may cause new issues, especially on recovery, so 
I didn't do that. 
Another method is just remeber the exception in the first call, and throw it in 
the following call on VertexImpl.getOutputSpecList (just as I did in the first 
patch ).

Any opnion on this

{code}
taskAttempt=task_1414454439916_0001_1_00_000000, Fail to getSourceSpec, 
sourceTaskIndex=0, EdgeInfo: sourceVertexName=v1, destinationVertexName=v2, 
java.lang.RuntimeException: EM_GetNumSourceTaskPhysicalOutputs
        at 
org.apache.tez.test.TestExceptionPropagation$CustomEdgeManager.getNumSourceTaskPhysicalOutputs(TestExceptionPropagation.java:711)
        at org.apache.tez.dag.app.dag.impl.Edge.getSourceSpec(Edge.java:228)
        at 
org.apache.tez.dag.app.dag.impl.VertexImpl.getOutputSpecList(VertexImpl.java:3932)
        at 
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.createRemoteTaskSpec(TaskAttemptImpl.java:510)
        at 
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl$ScheduleTaskattemptTransition.transition(TaskAttemptImpl.java:1056)
        at 
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl$ScheduleTaskattemptTransition.transition(TaskAttemptImpl.java:1043)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        at 
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:723)
        at 
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:108)
        at 
org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1660)
        at 
org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
        at java.lang.Thread.run(Thread.java:745)
], TaskAttempt 1 failed, info=[Error: Failure while running 
task:java.lang.NullPointerException
        at 
org.apache.tez.test.TestExceptionPropagation$ProcessorWithException.run(TestExceptionPropagation.java:540)
        at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:176)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{code}

> Exception handling for EdgeManagerPlugin
> ----------------------------------------
>
>                 Key: TEZ-1689
>                 URL: https://issues.apache.org/jira/browse/TEZ-1689
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>            Priority: Critical
>         Attachments: TEZ-1689-2.patch, TEZ-1689-3.patch, TEZ-1689-4.patch, 
> TEZ-1689.patch
>
>
> EdgeManagePlugin and InputInitializer are both user code which could lead 
> exception, we should handle it, fail the DAG and display the exception in 
> client side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to