[ 
https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16055872#comment-16055872
 ] 

Jonathan Eagles commented on TEZ-3758:
--------------------------------------

Thanks, [~kshukla]. The code looks good. Couple of things with the test will 
make this a little better.

* Let's not expose the task attempt to completed status to the tests. Just 
counting completed vs uncompleted should be sufficient.
* Please move the mockVertex initialization from the getVertex() method to the 
constructor
* Please add a assertTaskSucceededState check after before starting the 
retroactive failure condition
* In addition to the assertTaskScheduledState at the end of the test, please 
add another completed vs uncompleted count check. The main difference with the 
code path before and after this patch is that a new task attempt is scheduled.

> Vertex can hang in RUNNING state when two task attempts finish very closely 
> and have retroactive failures
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-3758
>                 URL: https://issues.apache.org/jira/browse/TEZ-3758
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.1, 0.9.0
>            Reporter: Kuhu Shukla
>            Assignee: Kuhu Shukla
>         Attachments: TEZ-3758.001.patch
>
>
> A vertex's count of what tasks are done can go off in a case where two task 
> attempts finish very closely, say within a millisecond of each other. We had 
> a case where this task, which was marked successful, never scheduled another 
> attempt upon getting a retroactive failure since it thought it had one 
> uncompleted task attempt already. This is because the attempt that finished 1 
> ms later transitioned to SUCCEEDED but we don't take any action on the 
> taskAttempStatus data structure and it stays false. This JIRA will attempt to 
> solve that race.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to