[ 
https://issues.apache.org/jira/browse/TEZ-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298522#comment-14298522
 ] 

Jeff Zhang commented on TEZ-1895:
---------------------------------

bq. Was this added because the new tests revealed the bug that vertex would not 
complete because of the counting error?
Yes. otherwise, the dag won't finish. 

bq. INVALID_RERUN -> VERTEX_RERUN_AFTER_COMMIT???
Done

bq.  Maybe add the diagnostic when the failure is triggered in 
vertexReRunning() rather than inside checkForCompletion()? 
bq. The diagnostic is less informative than the log. Can we get the vertex 
information in the diagnostic?
Suppose these 2 things are the same thing, add diagnostics in the 
vertexReRunning()

bq. Perhaps in a separate jira we should rename TaskAttemptTerminationCause to 
FailureReason and consolidate DAGTerminationCause and VertexTerminationCause 
into it. Currently there is too much duplication and essentially we are only 
looking for a programmatic enum for a common set of failure reasons.
I have a impression there may be one jira for the consolidation of termination 
cause, but don't remember the jira number.


> Vertex reRunning should decrease successfulMembers of VertexGroupInfo
> ---------------------------------------------------------------------
>
>                 Key: TEZ-1895
>                 URL: https://issues.apache.org/jira/browse/TEZ-1895
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-1895-1.patch, TEZ-1895-2.patch, TEZ-1895-3.patch, 
> TEZ-1895-4.patch
>
>
> Vertex reRunning should decrease successfulMembers of VertexGroupInfo, 
> otherwise commit may happen when vertex is still in rerunning. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to