[ https://issues.apache.org/jira/browse/TEZ-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557994#comment-14557994 ]
Jeff Zhang commented on TEZ-2304: --------------------------------- In this log, there's only recovery events for attempt_1428329756093_168563_1_00_006728_1 (attempt_1) but no attempt_1428329756093_168563_1_00_006728_0 (attempt_0) It is possible that attempt_0 is killed before it started so there's no any recovery events for it. We should log the TaskAttemptFinishedEvent even when there's no TaskAttemptStartedEvent. (link this with TEZ-2456) In this case, attempt_0 wouldn't be recovered and attempt_1 will be recovered, and when a new attempt is scheduled its task attempt id would be the same as the attempt_1, because we create task attempt id by using the attempts.size(); {code} TaskAttempt attempt = createAttempt(attempts.size()); {code} That's why we would see the following weird transition ( from NEW to KILLED, and then form NEW to START_WAIT), actually these are 2 different task attempt but with the same attempt id, so their state machines are messed up together. {noformat} 2015-04-09 20:05:42,055 INFO [AsyncDispatcher event handler] impl.TaskAttemptImpl: attempt_1428329756093_168563_1_00_006728_1 TaskAttempt Transitioned from NEW to KILLED due to event TA_RECOVER {noformat} {noformat} 2015-04-09 20:05:45,748 INFO [AsyncDispatcher event handler] impl.TaskAttemptImpl: attempt_1428329756093_168563_1_00_006728_1 TaskAttempt Transitioned from NEW to START_WAIT due to event TA_SCHEDULE {noformat} > InvalidStateTransitonException TA_SCHEDULE at START_WAIT during recovery > ------------------------------------------------------------------------ > > Key: TEZ-2304 > URL: https://issues.apache.org/jira/browse/TEZ-2304 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.6.0 > Reporter: Jason Lowe > Attachments: 168563_recovery.gz > > > I saw a Tez AM throw a few InvalidStateTransitonException (sic) instances > during recovery complaining about TA_SCHEDULE arriving at the START_WAIT > state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)