[ 
https://issues.apache.org/jira/browse/TEZ-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534565#comment-14534565
 ] 

Jeff Zhang commented on TEZ-2429:
---------------------------------

The following case may cause the invalid transition ( Invalid event: 
DAG_VERTEX_RERUNNING at SUCCEEDED )
1. TaskComplete (enqueue  VertexEventTaskComplete)
2. Task-Rerun (enqueue VertexEventTaskRerun)
3. Vertex go to SUCCEEDED due to VertexEventTaskComplete (dequeue 
VertexEventTaskComplete, enqueue DAGEventVertexComplete)
4. Vertex go to Re-Running due to VertexEventTaskRerun (dequeue 
VertexEventTaskRerun, enqueue DAGEventVertexRerun)
5. DAG go to SUCCEEDED due to DAGEventVertexComplete (dequeue 
DAGEventVertexComplete)
6. DAG go to ERROR due to DAGEventVertexRerun (dequeue DAGEventVertexRerun, 
InvalidTransition happens)

But the weird thing is that I can only reproduce this issue before TEZ-2404, 
looks like TEZ-2404 fix it and prevent this event sequence. Will look at it 
more. 

> Tez AM does not die after hitting internal error 
> -------------------------------------------------
>
>                 Key: TEZ-2429
>                 URL: https://issues.apache.org/jira/browse/TEZ-2429
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Hitesh Shah
>            Priority: Blocker
>         Attachments: syslog_dag_1430956448478_0001_16_post, 
> syslog_dag_1430956448478_0001_17
>
>
> From https://builds.apache.org/job/Tez-Build/1055/: 
> 2015-05-06 23:55:54,421 ERROR [Dispatcher thread: Central] impl.DAGImpl: 
> Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> DAG_VERTEX_RERUNNING at SUCCEEDED
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>       at 
> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57)
>       at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1079)
>       at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:143)
>       at 
> org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1871)
>       at 
> org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1862)
>       at 
> org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
>       at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114)
>       at java.lang.Thread.run(Thread.java:662)
> 2015-05-06 23:55:54,423 INFO [Dispatcher thread: Central] app.DAGAppMaster: 
> Cleaning up DAG: name=testRandomFailingInputs, with 
> id=dag_1430956448478_0001_16
> 2015-05-06 23:55:54,423 INFO [Dispatcher thread: Central] app.DAGAppMaster: 
> Completed cleanup for DAG: name=testRandomFailingInputs, with 
> id=dag_1430956448478_0001_16
> 2015-05-06 23:55:54,424 INFO [Dispatcher thread: Central] impl.DAGImpl: 
> dag_1430956448478_0001_16 terminating due to internal error
> 2015-05-06 23:55:54,433 INFO [IPC Server handler 0 on 47432] 
> app.DAGAppMaster: Starting DAG submitted via RPC: 
> testBasicInputFailureWithExit
> 2015-05-06 23:55:54,455 ERROR [Dispatcher thread: Central] 
> common.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
>       at 
> org.apache.tez.dag.history.recovery.RecoveryService.doFlush(RecoveryService.java:458)
>       at 
> org.apache.tez.dag.history.recovery.RecoveryService.handle(RecoveryService.java:289)
>       at 
> org.apache.tez.dag.history.HistoryEventHandler.handleCriticalEvent(HistoryEventHandler.java:102)
>       at 
> org.apache.tez.dag.app.dag.impl.DAGImpl.logJobHistoryUnsuccesfulEvent(DAGImpl.java:1161)
>       at org.apache.tez.dag.app.dag.impl.DAGImpl.finished(DAGImpl.java:1275)
>       at org.apache.tez.dag.app.dag.impl.DAGImpl.access$2600(DAGImpl.java:144)
>       at 
> org.apache.tez.dag.app.dag.impl.DAGImpl$InternalErrorTransition.transition(DAGImpl.java:2151)
>       at 
> org.apache.tez.dag.app.dag.impl.DAGImpl$InternalErrorTransition.transition(DAGImpl.java:2140)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>       at 
> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57)
>       at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:1079)
>       at org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:143)
>       at 
> org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1871)
>       at 
> org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1862)
>       at 
> org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
>       at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114)
>       at java.lang.Thread.run(Thread.java:662)
> 2015-05-06 23:55:54,456 INFO [Dispatcher thread: Central] impl.VertexImpl: 
> Killing tasks in vertex: vertex_1430956448478_0001_16_10 [l4v1] due to 
> trigger: INTERNAL_ERROR
> 2015-05-06 23:55:54,456 INFO [Dispatcher thread: Central] impl.VertexImpl: 
> vertex_1430956448478_0001_16_10 [l4v1] transitioned from RUNNING to 
> TERMINATING due to event V_TERMINATE
> 2015-05-06 23:55:54,456 INFO [AsyncDispatcher ShutDown handler] 
> common.AsyncDispatcher: Exiting, bbye..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to