[ 
https://issues.apache.org/jira/browse/TEZ-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219187#comment-14219187
 ] 

Jeff Zhang commented on TEZ-992:
--------------------------------

Seems this jira related to TEZ-714 closely, I try to resolve them together.  I 
create a FinishSavingService for logging recovery event and commit/abort.

Attach the 2 state machines diagrams (DAG/Vertex)
* The main change is that I add one additional state: FINISH_SAVING.  ( 
RUNNING/TERMINATING will transite to FINISH_SAVING first and then go to 
SUCCEEDED/FAILED/KILLED )
** In Vertex's FINISH_SAVING, it will log recovery data (VertexFinisheEvent, 
VertexCommitStartedEvent) and commit/abort the data if necessary. 
** in DAG's FINISH_SAVING, it will log recovery data (DAGCommitStartedEvent, 
DAGFinishedEvent) and commit the/abort data if necessary
** For VertexGroupCommitStartedEvent / VertexGroupCommitFinishedEvent, I will 
run it in FinishSavingService and keep DAG in the state of RUNNING.
** In recovery, still keep logging recovery event and commit in 
AsycMainDispatcher.  Not so confidently on moving it out now, may leave it in 
another jira. 
 
I have implemented a prototype of this feature ( run tez examples successfully 
and TestAMRecovery successfully ) but still need some code refinement, will 
attach the patch soon. 
When I implement it, I still have the following questions, hope to get some 
feedback about them.

* initializing of committer is still in the main AsycDispather thread, is it 
acceptable ? 
* Can TerminateEvent been ignored when DAG/Vertex is in FINISH_SAVING ? IMO, I 
think it can been ignored. Because if we don't ignore it, we still need to 
abort the committer. Since for both ignore and not-ingore we both have to call 
commit or abort, I think ignore it is acceptable.





> Recovery data should not be written on AsyncDispatcher thread
> -------------------------------------------------------------
>
>                 Key: TEZ-992
>                 URL: https://issues.apache.org/jira/browse/TEZ-992
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Jeff Zhang
>
> This may block the DAG operations in case the recovery data needs to be 
> synchronously stored. The operations requiring this blocking operation should 
> change their state machines to wait for the store operation before moving 
> ahead. They will move ahead after they receive notification from the 
> RecoveryService that their operation has completed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to