[ https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375512#comment-14375512 ]
Jeff Zhang commented on TEZ-714: -------------------------------- Upload a new patch. [~bikassaha] Please help review it. * Wrap the commit in the CallableEvent both in DAG & Vertex, but for the abort, still call it inline. Make the abort asyn will complicate the patch, so still keep it a sync call as before. * Introduce new state COMMITTING for Vertex & DAG ** Vertex's COMMITTING means vertex is in the middle of committing, if vertex has no committers or the option of TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is true, vertex would not to to COMMITTING state. ** DAG's COMMITTING has 2 cases, one is when TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is true and all the vertices are completed, another case is that TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is false and all the vertices are completed, but still some vertex group committers are running. * Regarding the issue of "not sure why group-commit and non-group commit need to be differentiated in different transitions.", I rename it to NonFinalCommitCompletedTransition and FinalCommitCompletetionTransition (maybe there's better names ). One mean the committer when TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is false and the other means TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS is true. The reason I differentiate them is that for the NonFinalCommitCompletedEvent, we need to log the recovery log of VertexGroupCommitCompletedEvent while it is not necessary for FinalCommitCompletedEvent. * Unit test is still not perfect. Because currently in the DAGImpl/VertexImpl we run the shared thread pool in the AsynDispatcher thread ( that means Committer still run in the thread of AsynDispather) so this may hide some potential issues and under this thread mode, it is not possible for test some cases like kill dag while it is in committing. I am trying to think of ways to simulate the shared thread pool in the unit test. * For the some existing transition, like (RUNNING to ERROR due to INTERNAL ERROR), I am not sure why it go to ERROR directly rather than TERMINATING. Maybe it is to allow the client get the final status as earyl as possible. > OutputCommitters should not run in the main AM dispatcher thread > ---------------------------------------------------------------- > > Key: TEZ-714 > URL: https://issues.apache.org/jira/browse/TEZ-714 > Project: Apache Tez > Issue Type: Improvement > Reporter: Siddharth Seth > Assignee: Jeff Zhang > Priority: Critical > Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, Vertex_2.pdf > > > Follow up jira from TEZ-41. > 1) If there's multiple OutputCommitters on a Vertex, they can be run in > parallel. > 2) Running an OutputCommitter in the main thread blocks all other event > handling, w.r.t the DAG, and causes the event queue to back up. > 3) This should also cover shared commits that happen in the DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)