[ https://issues.apache.org/jira/browse/TEZ-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298522#comment-14298522 ]
Jeff Zhang commented on TEZ-1895: --------------------------------- bq. Was this added because the new tests revealed the bug that vertex would not complete because of the counting error? Yes. otherwise, the dag won't finish. bq. INVALID_RERUN -> VERTEX_RERUN_AFTER_COMMIT??? Done bq. Maybe add the diagnostic when the failure is triggered in vertexReRunning() rather than inside checkForCompletion()? bq. The diagnostic is less informative than the log. Can we get the vertex information in the diagnostic? Suppose these 2 things are the same thing, add diagnostics in the vertexReRunning() bq. Perhaps in a separate jira we should rename TaskAttemptTerminationCause to FailureReason and consolidate DAGTerminationCause and VertexTerminationCause into it. Currently there is too much duplication and essentially we are only looking for a programmatic enum for a common set of failure reasons. I have a impression there may be one jira for the consolidation of termination cause, but don't remember the jira number. > Vertex reRunning should decrease successfulMembers of VertexGroupInfo > --------------------------------------------------------------------- > > Key: TEZ-1895 > URL: https://issues.apache.org/jira/browse/TEZ-1895 > Project: Apache Tez > Issue Type: Bug > Reporter: Jeff Zhang > Assignee: Jeff Zhang > Attachments: TEZ-1895-1.patch, TEZ-1895-2.patch, TEZ-1895-3.patch, > TEZ-1895-4.patch > > > Vertex reRunning should decrease successfulMembers of VertexGroupInfo, > otherwise commit may happen when vertex is still in rerunning. -- This message was sent by Atlassian JIRA (v6.3.4#6332)