[ https://issues.apache.org/jira/browse/TEZ-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524013#comment-14524013 ]
Hitesh Shah commented on TEZ-2379: ---------------------------------- {code} .addTransition( TaskAttemptStateInternal.SUCCEEDED, EnumSet.of(TaskAttemptStateInternal.KILLED, TaskAttemptStateInternal.SUCCEEDED), TaskAttemptEventType.TA_KILL_REQUEST, new TerminatedAfterSuccessTransition()) .addTransition( TaskAttemptStateInternal.SUCCEEDED, EnumSet.of(TaskAttemptStateInternal.KILLED, TaskAttemptStateInternal.SUCCEEDED), TaskAttemptEventType.TA_NODE_FAILED, new TerminatedAfterSuccessTransition()) .addTransition( TaskAttemptStateInternal.SUCCEEDED, EnumSet.of(TaskAttemptStateInternal.FAILED, TaskAttemptStateInternal.SUCCEEDED), TaskAttemptEventType.TA_OUTPUT_FAILED, new OutputReportedFailedTransition()) {code} Based on the above, there are 3 cases where succeeded reverts to a non-succeeded state. A node failed could potentially show up while a dag is being killed. The TA_KILL_REQUEST request could probably be ignored based on your above analysis. TA_OUTPUT_FAILED could also have a race with a user kill - need to check whether a failed attempt is handled correctly. > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > T_ATTEMPT_KILLED at KILLED > ------------------------------------------------------------------------------------------------------ > > Key: TEZ-2379 > URL: https://issues.apache.org/jira/browse/TEZ-2379 > Project: Apache Tez > Issue Type: Bug > Reporter: Rajesh Balamohan > Assignee: Hitesh Shah > Priority: Blocker > Attachments: TEZ-2379.1.patch > > > {noformat} > 2015-04-28 04:49:32,455 ERROR [Dispatcher thread: Central] impl.TaskImpl: > Can't handle this event at current state for > task_1429683757595_0479_1_03_000013 > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > T_ATTEMPT_KILLED at KILLED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57) > at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:853) > at org.apache.tez.dag.app.dag.impl.TaskImpl.handle(TaskImpl.java:106) > at > org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:1874) > at > org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher.handle(DAGAppMaster.java:1860) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:182) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:113) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Additional notes: > ============ > Hive - latest build > Tez - master > tpch-200 gb scale q_17 (kill the job in the middle of execution) -- This message was sent by Atlassian JIRA (v6.3.4#6332)