[ https://issues.apache.org/jira/browse/TEZ-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186098#comment-14186098 ]
Jeff Zhang commented on TEZ-1689: --------------------------------- [~sseth] I attach a new patch. Only one change: remove the change on VertexImpl.getOutputSpecList(taskIndex), keep it as original. But it is not perfect. The only remaining issue is that it would cause the diagnostics like following ( although it works). TaskAttempt_0 would catch the real exception from VertexImpl.getOutputSpecList, but the TaskAttempt_1 would catch another Exception from Processor ( because we cache the outputSpec in taskAttemp_0, the exception won't been throw in the second call. And the outputSpec is empty, so it would throw exception in the Processor). We can't stop the Task start a new taskattempt, although I can do some hack here (like set the max_attempt to 1, or set a flag to stop starting new task attempt, ) but I afraid this may cause new issues, especially on recovery, so I didn't do that. Another method is just remeber the exception in the first call, and throw it in the following call on VertexImpl.getOutputSpecList (just as I did in the first patch ). Any opnion on this {code} taskAttempt=task_1414454439916_0001_1_00_000000, Fail to getSourceSpec, sourceTaskIndex=0, EdgeInfo: sourceVertexName=v1, destinationVertexName=v2, java.lang.RuntimeException: EM_GetNumSourceTaskPhysicalOutputs at org.apache.tez.test.TestExceptionPropagation$CustomEdgeManager.getNumSourceTaskPhysicalOutputs(TestExceptionPropagation.java:711) at org.apache.tez.dag.app.dag.impl.Edge.getSourceSpec(Edge.java:228) at org.apache.tez.dag.app.dag.impl.VertexImpl.getOutputSpecList(VertexImpl.java:3932) at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.createRemoteTaskSpec(TaskAttemptImpl.java:510) at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl$ScheduleTaskattemptTransition.transition(TaskAttemptImpl.java:1056) at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl$ScheduleTaskattemptTransition.transition(TaskAttemptImpl.java:1043) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:723) at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:108) at org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1660) at org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) ], TaskAttempt 1 failed, info=[Error: Failure while running task:java.lang.NullPointerException at org.apache.tez.test.TestExceptionPropagation$ProcessorWithException.run(TestExceptionPropagation.java:540) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:176) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} > Exception handling for EdgeManagerPlugin > ---------------------------------------- > > Key: TEZ-1689 > URL: https://issues.apache.org/jira/browse/TEZ-1689 > Project: Apache Tez > Issue Type: Sub-task > Reporter: Jeff Zhang > Assignee: Jeff Zhang > Priority: Critical > Attachments: TEZ-1689-2.patch, TEZ-1689-3.patch, TEZ-1689-4.patch, > TEZ-1689.patch > > > EdgeManagePlugin and InputInitializer are both user code which could lead > exception, we should handle it, fail the DAG and display the exception in > client side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)