[jira] [Commented] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures
[ https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16059513#comment-16059513 ] Jonathan Eagles commented on TEZ-3758: -- +1. Thanks, [~kshukla] > Vertex can hang in RUNNING state when two task attempts finish very closely > and have retroactive failures > - > > Key: TEZ-3758 > URL: https://issues.apache.org/jira/browse/TEZ-3758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1, 0.9.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: TEZ-3758.001.patch, TEZ-3758.002.patch, > TEZ-3758.003.patch, TEZ-3758.004.patch > > > A vertex's count of what tasks are done can go off in a case where two task > attempts finish very closely, say within a millisecond of each other. We had > a case where this task, which was marked successful, never scheduled another > attempt upon getting a retroactive failure since it thought it had one > uncompleted task attempt already. This is because the attempt that finished 1 > ms later transitioned to SUCCEEDED but we don't take any action on the > taskAttempStatus data structure and it stays false. This JIRA will attempt to > solve that race. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures
[ https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16058586#comment-16058586 ] TezQA commented on TEZ-3758: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12873959/TEZ-3758.004.patch against master revision a925c83. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2533//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2533//console This message is automatically generated. > Vertex can hang in RUNNING state when two task attempts finish very closely > and have retroactive failures > - > > Key: TEZ-3758 > URL: https://issues.apache.org/jira/browse/TEZ-3758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1, 0.9.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: TEZ-3758.001.patch, TEZ-3758.002.patch, > TEZ-3758.003.patch, TEZ-3758.004.patch > > > A vertex's count of what tasks are done can go off in a case where two task > attempts finish very closely, say within a millisecond of each other. We had > a case where this task, which was marked successful, never scheduled another > attempt upon getting a retroactive failure since it thought it had one > uncompleted task attempt already. This is because the attempt that finished 1 > ms later transitioned to SUCCEEDED but we don't take any action on the > taskAttempStatus data structure and it stays false. This JIRA will attempt to > solve that race. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures
[ https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16057948#comment-16057948 ] Kuhu Shukla commented on TEZ-3758: -- Test failure is unrelated and timed out as the test ran more than a minute. I have opened TEZ-3768 to track the minor change to the test timeout. [~jeagles], appreciate any comments/review. Thanks a lot! > Vertex can hang in RUNNING state when two task attempts finish very closely > and have retroactive failures > - > > Key: TEZ-3758 > URL: https://issues.apache.org/jira/browse/TEZ-3758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1, 0.9.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: TEZ-3758.001.patch, TEZ-3758.002.patch, > TEZ-3758.003.patch > > > A vertex's count of what tasks are done can go off in a case where two task > attempts finish very closely, say within a millisecond of each other. We had > a case where this task, which was marked successful, never scheduled another > attempt upon getting a retroactive failure since it thought it had one > uncompleted task attempt already. This is because the attempt that finished 1 > ms later transitioned to SUCCEEDED but we don't take any action on the > taskAttempStatus data structure and it stays false. This JIRA will attempt to > solve that race. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures
[ https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16057920#comment-16057920 ] TezQA commented on TEZ-3758: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12873897/TEZ-3758.003.patch against master revision a925c83. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.auxservices.TestShuffleHandlerJobs Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2530//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2530//console This message is automatically generated. > Vertex can hang in RUNNING state when two task attempts finish very closely > and have retroactive failures > - > > Key: TEZ-3758 > URL: https://issues.apache.org/jira/browse/TEZ-3758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1, 0.9.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: TEZ-3758.001.patch, TEZ-3758.002.patch, > TEZ-3758.003.patch > > > A vertex's count of what tasks are done can go off in a case where two task > attempts finish very closely, say within a millisecond of each other. We had > a case where this task, which was marked successful, never scheduled another > attempt upon getting a retroactive failure since it thought it had one > uncompleted task attempt already. This is because the attempt that finished 1 > ms later transitioned to SUCCEEDED but we don't take any action on the > taskAttempStatus data structure and it stays false. This JIRA will attempt to > solve that race. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures
[ https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16057496#comment-16057496 ] Kuhu Shukla commented on TEZ-3758: -- Test failures are relevant. Looking into them now. > Vertex can hang in RUNNING state when two task attempts finish very closely > and have retroactive failures > - > > Key: TEZ-3758 > URL: https://issues.apache.org/jira/browse/TEZ-3758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1, 0.9.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: TEZ-3758.001.patch, TEZ-3758.002.patch > > > A vertex's count of what tasks are done can go off in a case where two task > attempts finish very closely, say within a millisecond of each other. We had > a case where this task, which was marked successful, never scheduled another > attempt upon getting a retroactive failure since it thought it had one > uncompleted task attempt already. This is because the attempt that finished 1 > ms later transitioned to SUCCEEDED but we don't take any action on the > taskAttempStatus data structure and it stays false. This JIRA will attempt to > solve that race. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures
[ https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055872#comment-16055872 ] Jonathan Eagles commented on TEZ-3758: -- Thanks, [~kshukla]. The code looks good. Couple of things with the test will make this a little better. * Let's not expose the task attempt to completed status to the tests. Just counting completed vs uncompleted should be sufficient. * Please move the mockVertex initialization from the getVertex() method to the constructor * Please add a assertTaskSucceededState check after before starting the retroactive failure condition * In addition to the assertTaskScheduledState at the end of the test, please add another completed vs uncompleted count check. The main difference with the code path before and after this patch is that a new task attempt is scheduled. > Vertex can hang in RUNNING state when two task attempts finish very closely > and have retroactive failures > - > > Key: TEZ-3758 > URL: https://issues.apache.org/jira/browse/TEZ-3758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1, 0.9.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: TEZ-3758.001.patch > > > A vertex's count of what tasks are done can go off in a case where two task > attempts finish very closely, say within a millisecond of each other. We had > a case where this task, which was marked successful, never scheduled another > attempt upon getting a retroactive failure since it thought it had one > uncompleted task attempt already. This is because the attempt that finished 1 > ms later transitioned to SUCCEEDED but we don't take any action on the > taskAttempStatus data structure and it stays false. This JIRA will attempt to > solve that race. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures
[ https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047011#comment-16047011 ] Siddharth Seth commented on TEZ-3758: - cc [~aplusplus], [~harishjp] > Vertex can hang in RUNNING state when two task attempts finish very closely > and have retroactive failures > - > > Key: TEZ-3758 > URL: https://issues.apache.org/jira/browse/TEZ-3758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1, 0.9.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: TEZ-3758.001.patch > > > A vertex's count of what tasks are done can go off in a case where two task > attempts finish very closely, say within a millisecond of each other. We had > a case where this task, which was marked successful, never scheduled another > attempt upon getting a retroactive failure since it thought it had one > uncompleted task attempt already. This is because the attempt that finished 1 > ms later transitioned to SUCCEEDED but we don't take any action on the > taskAttempStatus data structure and it stays false. This JIRA will attempt to > solve that race. -- This message was sent by Atlassian JIRA (v6.4.14#64029)