[ https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383586#comment-14383586 ]
Jeff Zhang edited comment on TEZ-714 at 3/27/15 9:31 AM: --------------------------------------------------------- Upload new patch to address the review comment. * Why is this calling Vertex.abortVertex() instead of directly calling committer.abort() for the outputs? Several reasons. ** reuse the abort code of vertex ** there's one bug in the existing code that it would abort the committed output when dag is failed in the case of TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is false, but it is supposed not to abort in this case. * Both for DAG & Vertex wrap all the commit related code into one CommitCompletedTransition * bq. Should we wait for all outstanding commit operations to get cancelled or complete and then call abort on all outputs? It is done in the new patch except when internal error happens. Still use the original InternaErrorTransition ** For Vertex, when commit completed event and task completed event happens, it would call checkVertexForCompletion ** For DAG, when commit completed event and vertex completed event happens, it would call checkDAGForCompletion * Add unit test in TestCommit & e2e test in TestMockDAGAppMaster * Maybe an issue: when vertex's output is being committing as an vertex group output, then what state should this vertex in ? Currently vertex will go to SUCCEEDED state, but may be better to move to COMMITTING state * Should Task wait for all the task attempts completed before move to SUCCEEDED, otherwise it is possible that vertex is in COMMITTING while there still task attempt is still running in the case of speculation. was (Author: zjffdu): Upload new patch to address the review comment. * Why is this calling Vertex.abortVertex() instead of directly calling committer.abort() for the outputs? Several reasons. ** reuse the abort code of vertex ** there's one bug in the existing code that it would abort the committed output when dag is failed, but it is supposed not to abort in this case. * Both for DAG & Vertex wrap all the commit related code into one CommitCompletedTransition * bq. Should we wait for all outstanding commit operations to get cancelled or complete and then call abort on all outputs? It is done in the new patch except when internal error happens. Still use the original InternaErrorTransition ** For Vertex, when commit completed event and task completed event happens, it would call checkVertexForCompletion ** For DAG, when commit completed event and vertex completed event happens, it would call checkDAGForCompletion * Add unit test in TestCommit & e2e test in TestMockDAGAppMaster * Maybe an issue: when vertex's output is being committing as an vertex group output, then what state should this vertex in ? Currently vertex will go to SUCCEEDED state, but may be better to move to COMMITTING state * Should Task wait for all the task attempts completed before move to SUCCEEDED, otherwise it is possible that vertex is in COMMITTING while there still task attempt is still running in the case of speculation. > OutputCommitters should not run in the main AM dispatcher thread > ---------------------------------------------------------------- > > Key: TEZ-714 > URL: https://issues.apache.org/jira/browse/TEZ-714 > Project: Apache Tez > Issue Type: Improvement > Reporter: Siddharth Seth > Assignee: Jeff Zhang > Priority: Critical > Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, > TEZ-714-3.patch, TEZ-714-4.patch, TEZ-714-5.patch, Vertex_2.pdf > > > Follow up jira from TEZ-41. > 1) If there's multiple OutputCommitters on a Vertex, they can be run in > parallel. > 2) Running an OutputCommitter in the main thread blocks all other event > handling, w.r.t the DAG, and causes the event queue to back up. > 3) This should also cover shared commits that happen in the DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)