[ 
https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383586#comment-14383586
 ] 

Jeff Zhang edited comment on TEZ-714 at 3/27/15 9:31 AM:
---------------------------------------------------------

Upload new patch to address the review comment.

* Why is this calling Vertex.abortVertex() instead of directly calling 
committer.abort() for the outputs?
Several reasons.
** reuse the abort code of vertex
** there's one bug in the existing code that it would abort the committed 
output when dag is failed in the case of 
TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is false, but it is supposed not to 
abort in this case.
* Both for DAG & Vertex wrap all the commit related code into one 
CommitCompletedTransition
* bq. Should we wait for all outstanding commit operations to get cancelled or 
complete and then call abort on all outputs?
It is done in the new patch except when internal error happens. Still use the 
original InternaErrorTransition 
** For Vertex, when commit completed event and task completed event happens, it 
would call checkVertexForCompletion
** For DAG, when commit completed event and vertex completed event happens, it 
would call checkDAGForCompletion

* Add unit test in TestCommit & e2e test in TestMockDAGAppMaster

* Maybe an issue: when vertex's output is being committing as an vertex group 
output, then what state should this vertex in ? Currently vertex will go to 
SUCCEEDED state, but may be better to move to COMMITTING state
* Should Task wait for all the task attempts completed before move to 
SUCCEEDED, otherwise it is possible that vertex is in COMMITTING while there 
still task attempt is still running in the case of speculation.




was (Author: zjffdu):
Upload new patch to address the review comment.

* Why is this calling Vertex.abortVertex() instead of directly calling 
committer.abort() for the outputs?
Several reasons.
** reuse the abort code of vertex
** there's one bug in the existing code that it would abort the committed 
output when dag is failed, but it is supposed not to abort in this case.
* Both for DAG & Vertex wrap all the commit related code into one 
CommitCompletedTransition
* bq. Should we wait for all outstanding commit operations to get cancelled or 
complete and then call abort on all outputs?
It is done in the new patch except when internal error happens. Still use the 
original InternaErrorTransition 
** For Vertex, when commit completed event and task completed event happens, it 
would call checkVertexForCompletion
** For DAG, when commit completed event and vertex completed event happens, it 
would call checkDAGForCompletion

* Add unit test in TestCommit & e2e test in TestMockDAGAppMaster

* Maybe an issue: when vertex's output is being committing as an vertex group 
output, then what state should this vertex in ? Currently vertex will go to 
SUCCEEDED state, but may be better to move to COMMITTING state
* Should Task wait for all the task attempts completed before move to 
SUCCEEDED, otherwise it is possible that vertex is in COMMITTING while there 
still task attempt is still running in the case of speculation.



> OutputCommitters should not run in the main AM dispatcher thread
> ----------------------------------------------------------------
>
>                 Key: TEZ-714
>                 URL: https://issues.apache.org/jira/browse/TEZ-714
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>            Assignee: Jeff Zhang
>            Priority: Critical
>         Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, 
> TEZ-714-3.patch, TEZ-714-4.patch, TEZ-714-5.patch, Vertex_2.pdf
>
>
> Follow up jira from TEZ-41.
> 1) If there's multiple OutputCommitters on a Vertex, they can be run in 
> parallel.
> 2) Running an OutputCommitter in the main thread blocks all other event 
> handling, w.r.t the DAG, and causes the event queue to back up.
> 3) This should also cover shared commits that happen in the DAG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to