[jira] [Commented] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures

2017-06-22 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16059513#comment-16059513
 ] 

Jonathan Eagles commented on TEZ-3758:
--

+1. Thanks, [~kshukla]

> Vertex can hang in RUNNING state when two task attempts finish very closely 
> and have retroactive failures
> -
>
> Key: TEZ-3758
> URL: https://issues.apache.org/jira/browse/TEZ-3758
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1, 0.9.0
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: TEZ-3758.001.patch, TEZ-3758.002.patch, 
> TEZ-3758.003.patch, TEZ-3758.004.patch
>
>
> A vertex's count of what tasks are done can go off in a case where two task 
> attempts finish very closely, say within a millisecond of each other. We had 
> a case where this task, which was marked successful, never scheduled another 
> attempt upon getting a retroactive failure since it thought it had one 
> uncompleted task attempt already. This is because the attempt that finished 1 
> ms later transitioned to SUCCEEDED but we don't take any action on the 
> taskAttempStatus data structure and it stays false. This JIRA will attempt to 
> solve that race.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures

2017-06-21 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16058586#comment-16058586
 ] 

TezQA commented on TEZ-3758:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12873959/TEZ-3758.004.patch
  against master revision a925c83.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2533//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2533//console

This message is automatically generated.

> Vertex can hang in RUNNING state when two task attempts finish very closely 
> and have retroactive failures
> -
>
> Key: TEZ-3758
> URL: https://issues.apache.org/jira/browse/TEZ-3758
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1, 0.9.0
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: TEZ-3758.001.patch, TEZ-3758.002.patch, 
> TEZ-3758.003.patch, TEZ-3758.004.patch
>
>
> A vertex's count of what tasks are done can go off in a case where two task 
> attempts finish very closely, say within a millisecond of each other. We had 
> a case where this task, which was marked successful, never scheduled another 
> attempt upon getting a retroactive failure since it thought it had one 
> uncompleted task attempt already. This is because the attempt that finished 1 
> ms later transitioned to SUCCEEDED but we don't take any action on the 
> taskAttempStatus data structure and it stays false. This JIRA will attempt to 
> solve that race.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures

2017-06-21 Thread Kuhu Shukla (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16057948#comment-16057948
 ] 

Kuhu Shukla commented on TEZ-3758:
--

Test failure is unrelated and timed out as the test ran more than a minute. I 
have opened TEZ-3768 to track the minor change to the test timeout. [~jeagles], 
appreciate any comments/review. Thanks a lot!

> Vertex can hang in RUNNING state when two task attempts finish very closely 
> and have retroactive failures
> -
>
> Key: TEZ-3758
> URL: https://issues.apache.org/jira/browse/TEZ-3758
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1, 0.9.0
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: TEZ-3758.001.patch, TEZ-3758.002.patch, 
> TEZ-3758.003.patch
>
>
> A vertex's count of what tasks are done can go off in a case where two task 
> attempts finish very closely, say within a millisecond of each other. We had 
> a case where this task, which was marked successful, never scheduled another 
> attempt upon getting a retroactive failure since it thought it had one 
> uncompleted task attempt already. This is because the attempt that finished 1 
> ms later transitioned to SUCCEEDED but we don't take any action on the 
> taskAttempStatus data structure and it stays false. This JIRA will attempt to 
> solve that race.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures

2017-06-21 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16057920#comment-16057920
 ] 

TezQA commented on TEZ-3758:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12873897/TEZ-3758.003.patch
  against master revision a925c83.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in :
   org.apache.tez.auxservices.TestShuffleHandlerJobs

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2530//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2530//console

This message is automatically generated.

> Vertex can hang in RUNNING state when two task attempts finish very closely 
> and have retroactive failures
> -
>
> Key: TEZ-3758
> URL: https://issues.apache.org/jira/browse/TEZ-3758
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1, 0.9.0
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: TEZ-3758.001.patch, TEZ-3758.002.patch, 
> TEZ-3758.003.patch
>
>
> A vertex's count of what tasks are done can go off in a case where two task 
> attempts finish very closely, say within a millisecond of each other. We had 
> a case where this task, which was marked successful, never scheduled another 
> attempt upon getting a retroactive failure since it thought it had one 
> uncompleted task attempt already. This is because the attempt that finished 1 
> ms later transitioned to SUCCEEDED but we don't take any action on the 
> taskAttempStatus data structure and it stays false. This JIRA will attempt to 
> solve that race.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures

2017-06-21 Thread Kuhu Shukla (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16057496#comment-16057496
 ] 

Kuhu Shukla commented on TEZ-3758:
--

Test failures are relevant. Looking into them now.

> Vertex can hang in RUNNING state when two task attempts finish very closely 
> and have retroactive failures
> -
>
> Key: TEZ-3758
> URL: https://issues.apache.org/jira/browse/TEZ-3758
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1, 0.9.0
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: TEZ-3758.001.patch, TEZ-3758.002.patch
>
>
> A vertex's count of what tasks are done can go off in a case where two task 
> attempts finish very closely, say within a millisecond of each other. We had 
> a case where this task, which was marked successful, never scheduled another 
> attempt upon getting a retroactive failure since it thought it had one 
> uncompleted task attempt already. This is because the attempt that finished 1 
> ms later transitioned to SUCCEEDED but we don't take any action on the 
> taskAttempStatus data structure and it stays false. This JIRA will attempt to 
> solve that race.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures

2017-06-20 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055872#comment-16055872
 ] 

Jonathan Eagles commented on TEZ-3758:
--

Thanks, [~kshukla]. The code looks good. Couple of things with the test will 
make this a little better.

* Let's not expose the task attempt to completed status to the tests. Just 
counting completed vs uncompleted should be sufficient.
* Please move the mockVertex initialization from the getVertex() method to the 
constructor
* Please add a assertTaskSucceededState check after before starting the 
retroactive failure condition
* In addition to the assertTaskScheduledState at the end of the test, please 
add another completed vs uncompleted count check. The main difference with the 
code path before and after this patch is that a new task attempt is scheduled.

> Vertex can hang in RUNNING state when two task attempts finish very closely 
> and have retroactive failures
> -
>
> Key: TEZ-3758
> URL: https://issues.apache.org/jira/browse/TEZ-3758
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1, 0.9.0
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: TEZ-3758.001.patch
>
>
> A vertex's count of what tasks are done can go off in a case where two task 
> attempts finish very closely, say within a millisecond of each other. We had 
> a case where this task, which was marked successful, never scheduled another 
> attempt upon getting a retroactive failure since it thought it had one 
> uncompleted task attempt already. This is because the attempt that finished 1 
> ms later transitioned to SUCCEEDED but we don't take any action on the 
> taskAttempStatus data structure and it stays false. This JIRA will attempt to 
> solve that race.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures

2017-06-12 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047011#comment-16047011
 ] 

Siddharth Seth commented on TEZ-3758:
-

cc [~aplusplus], [~harishjp]

> Vertex can hang in RUNNING state when two task attempts finish very closely 
> and have retroactive failures
> -
>
> Key: TEZ-3758
> URL: https://issues.apache.org/jira/browse/TEZ-3758
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1, 0.9.0
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
> Attachments: TEZ-3758.001.patch
>
>
> A vertex's count of what tasks are done can go off in a case where two task 
> attempts finish very closely, say within a millisecond of each other. We had 
> a case where this task, which was marked successful, never scheduled another 
> attempt upon getting a retroactive failure since it thought it had one 
> uncompleted task attempt already. This is because the attempt that finished 1 
> ms later transitioned to SUCCEEDED but we don't take any action on the 
> taskAttempStatus data structure and it stays false. This JIRA will attempt to 
> solve that race.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)