[ https://issues.apache.org/jira/browse/TEZ-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14166432#comment-14166432 ]
Jeff Zhang commented on TEZ-1470: --------------------------------- [~hitesh] Use map looks clean to me, I saw some ugly code like this ( increase numberUncompletedAttempts in TaskRetroactiveFailureTransition) {code} // fake values for code for super.transition ++task.numberUncompletedAttempts; task.finishedAttempts--; TaskStateInternal returnState = super.transition(task, event); {code} Do you worry that this change may cause new issue ? If so, for the next release I could first limit the change to recovery, and do the other changes in future. > Recovery fail due to TaskAttemptFinishedEvent is recorded multiple times for > the same task > ------------------------------------------------------------------------------------------ > > Key: TEZ-1470 > URL: https://issues.apache.org/jira/browse/TEZ-1470 > Project: Apache Tez > Issue Type: Sub-task > Reporter: Jeff Zhang > Assignee: Jeff Zhang > Priority: Minor > Attachments: Tez-1470.patch > > > TaskAttempt can move from SUCCEEDED to KILLED due to node failure. In this > case TaskAttemptFinishedEvent may been recorded 2 times,and will cause > failure in recovery. > {code} > 14-05-16 08:07:18,386 INFO [main] org.apache.hadoop.service.AbstractService: > Service org.apache.tez.dag.app.DAGAppMaster failed in state STARTED; cause: > org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for > attempt finished, more completions than starts encountered, > taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2, > incompleteAttempts=-1 > org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for > attempt finished, more completions than starts encountered, > taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2, > incompleteAttempts=-1 > at > org.apache.tez.dag.app.dag.impl.TaskImpl.restoreFromEvent(TaskImpl.java:592) > at > org.apache.tez.dag.app.RecoveryParser.parseRecoveryData(RecoveryParser.java:814) > at > org.apache.tez.dag.app.DAGAppMaster.recoverDAG(DAGAppMaster.java:1529) > at > org.apache.tez.dag.app.DAGAppMaster.serviceStart(DAGAppMaster.java:1558) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at org.apache.tez.dag.app.DAGAppMaster$5.run(DAGAppMaster.java:1957) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557) > at > org.apache.tez.dag.app.DAGAppMaster.initAndStartAppMaster(DAGAppMaster.java:1953) > at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:1792) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)