[ 
https://issues.apache.org/jira/browse/TEZ-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14166432#comment-14166432
 ] 

Jeff Zhang commented on TEZ-1470:
---------------------------------

[~hitesh] Use map looks clean to me, I saw some ugly code like this ( increase 
numberUncompletedAttempts in TaskRetroactiveFailureTransition)
{code}
      // fake values for code for super.transition
      ++task.numberUncompletedAttempts;
      task.finishedAttempts--;
      TaskStateInternal returnState = super.transition(task, event);
{code}

Do you worry that this change may cause new issue ? If so, for the next release 
I could first limit the change to recovery, and do the other changes in future. 


> Recovery fail due to TaskAttemptFinishedEvent is recorded multiple times for 
> the same task
> ------------------------------------------------------------------------------------------
>
>                 Key: TEZ-1470
>                 URL: https://issues.apache.org/jira/browse/TEZ-1470
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>            Priority: Minor
>         Attachments: Tez-1470.patch
>
>
> TaskAttempt can move from SUCCEEDED to KILLED due to node failure. In this 
> case TaskAttemptFinishedEvent may been recorded 2 times,and will cause 
> failure in recovery.
> {code}
> 14-05-16 08:07:18,386 INFO [main] org.apache.hadoop.service.AbstractService: 
> Service org.apache.tez.dag.app.DAGAppMaster failed in state STARTED; cause: 
> org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for 
> attempt finished, more completions than starts encountered, 
> taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2, 
> incompleteAttempts=-1
> org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for 
> attempt finished, more completions than starts encountered, 
> taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2, 
> incompleteAttempts=-1
>       at 
> org.apache.tez.dag.app.dag.impl.TaskImpl.restoreFromEvent(TaskImpl.java:592)
>       at 
> org.apache.tez.dag.app.RecoveryParser.parseRecoveryData(RecoveryParser.java:814)
>       at 
> org.apache.tez.dag.app.DAGAppMaster.recoverDAG(DAGAppMaster.java:1529)
>       at 
> org.apache.tez.dag.app.DAGAppMaster.serviceStart(DAGAppMaster.java:1558)
>       at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>       at org.apache.tez.dag.app.DAGAppMaster$5.run(DAGAppMaster.java:1957)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
>       at 
> org.apache.tez.dag.app.DAGAppMaster.initAndStartAppMaster(DAGAppMaster.java:1953)
>       at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:1792)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to