[ https://issues.apache.org/jira/browse/TEZ-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001874#comment-15001874 ]
Jeff Zhang edited comment on TEZ-2581 at 11/12/15 9:23 AM: ----------------------------------------------------------- bq. Please create a jira instead of leaving behind a TODO Created TEZ-2938 bq. Secondly, for handling commit recovery, would it be possible to do the check and recovery of commit operation in the recovery flow of the ta_done inside the attempt itself. That way the full responsibility of recovering the attempt (including its commit) will stay in the attempt instead of spilling over into the task? IF you think this would be better, we can do it in a follow up jira or within this patch, your call. I think it would be better to leave it in Task. Because I think the commit of task attempt should be controlled by task rather than itself (just like TaskAttemptListener also control the commit of task attempt). From the semantics of the recovery API (OutputComitter#recoverTask), it is to recover the task rather than task attempt. It would be weird to call OutputCommitter#recoverTask in TaskAttemptImpl bq. Why are we sending TaskAttemptEventAttemptFailed and then TaskAttemptEventContainerTerminated? Shouldn't the first be enough to change the state to failed? This is because All the Termination related transition extends TerminateTransition which is SingleArcTransition. So after TaskAttemptEventAttemptFailed event, task attempt must go to FAIL_IN_PROGRESS. That's why I send TaskAttemptEventContainerTerminated after that. Another solution I can think of is to create another new event e.g. FAIL_IN_RECOVERY, and then make a new transition for this event. bq. I skimmed over the test changes. Looks good. I see that we have used the tez examples for some e2e cases. It may be useful to follow the pattern of TestFaultTolerance to create some specific controlled cases and use them for a formal test matrix. E.g. Lets say that we create a 3 level dag. And lets say vertices can be Initializing (I), HalfRunning (H), Done(D). Then we can have cases like III, HII, HHH, DDD, HIH etc. This would create a methodical coverage for various cases. If you agree then this can be a follow up jira. Created TEZ-2939 for that. * Currently I use OrderedWordCount & HashJoin as the system test. I think these are 2 typical dag topological. And for each example, there's one test matrix,will cover the cases as you mentioned (III, HII, HHH, DDD, HIH) * Another reason I use examples is that I also need to verify the job result. I do see before that job recover successfully but get incorrect result. (due to multiple copies of DM events) * There's one drawback about the system test. It would take lots of time because for each case the job would run under minicluster in distributed mode. And totally there's more than 30 cases. was (Author: zjffdu): bq. Please create a jira instead of leaving behind a TODO Created TEZ-2938 bq. Secondly, for handling commit recovery, would it be possible to do the check and recovery of commit operation in the recovery flow of the ta_done inside the attempt itself. That way the full responsibility of recovering the attempt (including its commit) will stay in the attempt instead of spilling over into the task? IF you think this would be better, we can do it in a follow up jira or within this patch, your call. bq. Why are we sending TaskAttemptEventAttemptFailed and then TaskAttemptEventContainerTerminated? Shouldn't the first be enough to change the state to failed? This is because All the Termination related transition extends TerminateTransition which is SingleArcTransition. So after TaskAttemptEventAttemptFailed event, task attempt must go to FAIL_IN_PROGRESS. That's why I send TaskAttemptEventContainerTerminated after that. Another solution I can think of is to create another new event e.g. FAIL_IN_RECOVERY, and then make a new transition for this event. bq. I skimmed over the test changes. Looks good. I see that we have used the tez examples for some e2e cases. It may be useful to follow the pattern of TestFaultTolerance to create some specific controlled cases and use them for a formal test matrix. E.g. Lets say that we create a 3 level dag. And lets say vertices can be Initializing (I), HalfRunning (H), Done(D). Then we can have cases like III, HII, HHH, DDD, HIH etc. This would create a methodical coverage for various cases. If you agree then this can be a follow up jira. Created TEZ-2939 for that. * Currently I use OrderedWordCount & HashJoin as the system test. I think these are 2 typical dag topological. And for each example, there's one test matrix,will cover the cases as you mentioned (III, HII, HHH, DDD, HIH) * Another reason I use examples is that I also need to verify the job result. I do see before that job recover successfully but get incorrect result. (due to multiple copies of DM events) * There's one drawback about the system test. It would take lots of time because for each case the job would run under minicluster in distributed mode. And totally there's more than 30 cases. > Umbrella for Tez Recovery Redesign > ---------------------------------- > > Key: TEZ-2581 > URL: https://issues.apache.org/jira/browse/TEZ-2581 > Project: Apache Tez > Issue Type: Improvement > Reporter: Jeff Zhang > Assignee: Jeff Zhang > Attachments: TEZ-2581-WIP-1.patch, TEZ-2581-WIP-10.patch, > TEZ-2581-WIP-11.patch, TEZ-2581-WIP-2.patch, TEZ-2581-WIP-3.patch, > TEZ-2581-WIP-4.patch, TEZ-2581-WIP-5.patch, TEZ-2581-WIP-6.patch, > TEZ-2581-WIP-7.patch, TEZ-2581-WIP-8.patch, TEZ-2581-WIP-9.patch, > TezRecoveryRedesignProposal.pdf, TezRecoveryRedesignV1.1.pdf > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)