[jira] [Commented] (TEZ-3768) Test timeout value for TestShuffleHandlerJobs is low
[ https://issues.apache.org/jira/browse/TEZ-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058602#comment-16058602 ] TezQA commented on TEZ-3768: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12873961/TEZ-3768.002.patch against master revision a925c83. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2534//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2534//console This message is automatically generated. > Test timeout value for TestShuffleHandlerJobs is low > > > Key: TEZ-3768 > URL: https://issues.apache.org/jira/browse/TEZ-3768 > Project: Apache Tez > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Minor > Attachments: TEZ-3768.001.patch, TEZ-3768.002.patch > > > The test can fail with a timeout on slow build machines. One minute is > clearly too less. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
Success: TEZ-3768 PreCommit Build #2534
Jira: https://issues.apache.org/jira/browse/TEZ-3768 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/2534/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 338.88 KB...] [INFO] Tez SUCCESS [ 0.020 s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 58:45 min [INFO] Finished at: 2017-06-22T02:05:21Z [INFO] Final Memory: 83M/1220M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12873961/TEZ-3768.002.patch against master revision a925c83. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2534//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2534//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. b49b17c8546b3999d480b0250b74762d21a03b05 logged out == == Finished build. == == Archiving artifacts [description-setter] Description set: TEZ-3768 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures
[ https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058586#comment-16058586 ] TezQA commented on TEZ-3758: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12873959/TEZ-3758.004.patch against master revision a925c83. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2533//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2533//console This message is automatically generated. > Vertex can hang in RUNNING state when two task attempts finish very closely > and have retroactive failures > - > > Key: TEZ-3758 > URL: https://issues.apache.org/jira/browse/TEZ-3758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1, 0.9.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: TEZ-3758.001.patch, TEZ-3758.002.patch, > TEZ-3758.003.patch, TEZ-3758.004.patch > > > A vertex's count of what tasks are done can go off in a case where two task > attempts finish very closely, say within a millisecond of each other. We had > a case where this task, which was marked successful, never scheduled another > attempt upon getting a retroactive failure since it thought it had one > uncompleted task attempt already. This is because the attempt that finished 1 > ms later transitioned to SUCCEEDED but we don't take any action on the > taskAttempStatus data structure and it stays false. This JIRA will attempt to > solve that race. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
Success: TEZ-3758 PreCommit Build #2533
Jira: https://issues.apache.org/jira/browse/TEZ-3758 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/2533/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 339.07 KB...] [INFO] Tez SUCCESS [ 0.022 s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 59:51 min [INFO] Finished at: 2017-06-22T01:44:02Z [INFO] Final Memory: 80M/1411M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12873959/TEZ-3758.004.patch against master revision a925c83. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2533//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2533//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. cd5e9a1c777411944bf2996881283108c80f4cc6 logged out == == Finished build. == == Archiving artifacts [description-setter] Description set: TEZ-3758 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Updated] (TEZ-3768) Test timeout value for TestShuffleHandlerJobs is low
[ https://issues.apache.org/jira/browse/TEZ-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3768: - Attachment: TEZ-3768.002.patch Thanks [~jeagles], made the timeout as 5 minutes. > Test timeout value for TestShuffleHandlerJobs is low > > > Key: TEZ-3768 > URL: https://issues.apache.org/jira/browse/TEZ-3768 > Project: Apache Tez > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Minor > Attachments: TEZ-3768.001.patch, TEZ-3768.002.patch > > > The test can fail with a timeout on slow build machines. One minute is > clearly too less. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures
[ https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3758: - Attachment: TEZ-3758.004.patch Thank you [~jeagles] for the review! Attached is the revised patch addressing those comments. > Vertex can hang in RUNNING state when two task attempts finish very closely > and have retroactive failures > - > > Key: TEZ-3758 > URL: https://issues.apache.org/jira/browse/TEZ-3758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1, 0.9.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: TEZ-3758.001.patch, TEZ-3758.002.patch, > TEZ-3758.003.patch, TEZ-3758.004.patch > > > A vertex's count of what tasks are done can go off in a case where two task > attempts finish very closely, say within a millisecond of each other. We had > a case where this task, which was marked successful, never scheduled another > attempt upon getting a retroactive failure since it thought it had one > uncompleted task attempt already. This is because the attempt that finished 1 > ms later transitioned to SUCCEEDED but we don't take any action on the > taskAttempStatus data structure and it stays false. This JIRA will attempt to > solve that race. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3761) NPE in Fetcher under load
[ https://issues.apache.org/jira/browse/TEZ-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058333#comment-16058333 ] Jonathan Eagles commented on TEZ-3761: -- [~rajesh.balamohan], can you please have a look at v2 patch to see if this will better handler internal server errors from the shuffle handler? > NPE in Fetcher under load > - > > Key: TEZ-3761 > URL: https://issues.apache.org/jira/browse/TEZ-3761 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Jonathan Eagles > Attachments: TEZ-3618.2.patch, TEZ-3761.debug.patch > > > Env: apache tez + apache hive master > {noformat} > 2017-06-14 00:24:53,795 [INFO] [Dispatcher thread {Central}] > |HistoryEventHandler.criticalEvents|: > [HISTORY][DAG:dag_1490656001509_5009_1][Event:TASK_ATTEMPT_FINISHED]: > vertexName=Reducer 36, > taskAttemptId=attempt_1490656001509_5009_1_15_13_0, > creationTime=1497414223481, allocationTime=1497414290240, > startTime=1497414290240, finishTime=1497414293795, timeTaken=3555, > status=FAILED, taskFailureType=NON_FATAL, errorEnum=INPUT_READ_ERROR, > diagnostics=Error: Error while running task ( failure ) : > java.lang.NullPointerException > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:914) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:599) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:486) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:284) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:76) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > , errorMessage=Fetch failed:java.lang.NullPointerException > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:914) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:599) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:486) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:284) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:76) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Query for ref: Q4 with 10 TB TPC-DS -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3761) NPE in Fetcher under load
[ https://issues.apache.org/jira/browse/TEZ-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058330#comment-16058330 ] TezQA commented on TEZ-3761: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12873933/TEZ-3618.2.patch against master revision a925c83. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2532//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2532//console This message is automatically generated. > NPE in Fetcher under load > - > > Key: TEZ-3761 > URL: https://issues.apache.org/jira/browse/TEZ-3761 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Jonathan Eagles > Attachments: TEZ-3618.2.patch, TEZ-3761.debug.patch > > > Env: apache tez + apache hive master > {noformat} > 2017-06-14 00:24:53,795 [INFO] [Dispatcher thread {Central}] > |HistoryEventHandler.criticalEvents|: > [HISTORY][DAG:dag_1490656001509_5009_1][Event:TASK_ATTEMPT_FINISHED]: > vertexName=Reducer 36, > taskAttemptId=attempt_1490656001509_5009_1_15_13_0, > creationTime=1497414223481, allocationTime=1497414290240, > startTime=1497414290240, finishTime=1497414293795, timeTaken=3555, > status=FAILED, taskFailureType=NON_FATAL, errorEnum=INPUT_READ_ERROR, > diagnostics=Error: Error while running task ( failure ) : > java.lang.NullPointerException > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:914) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:599) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:486) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:284) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:76) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > , errorMessage=Fetch failed:java.lang.NullPointerException > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:914) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:599) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:486) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:284) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:76) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Query for ref: Q4 with 10 TB TPC-DS -- This message was sent by Atlassian JIRA (v6.4.14#64029)
Failed: TEZ-3761 PreCommit Build #2532
Jira: https://issues.apache.org/jira/browse/TEZ-3761 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/2532/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 339.13 KB...] [INFO] Total time: 57:22 min [INFO] Finished at: 2017-06-21T22:00:46Z [INFO] Final Memory: 84M/1401M [INFO] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12873933/TEZ-3618.2.patch against master revision a925c83. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2532//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2532//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 9bd0f3f57091949360ea75d07dab7b78c930acec logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Compressed 3.50 MB of artifacts by 11.6% relative to #2531 [description-setter] Could not determine description. Recording test results Email was triggered for: Failure - Any Sending email for trigger: Failure - Any ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-3768) Test timeout value for TestShuffleHandlerJobs is low
[ https://issues.apache.org/jira/browse/TEZ-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058286#comment-16058286 ] Jonathan Eagles commented on TEZ-3768: -- [~kshukla], can you please move this to a 5 minute timeout? > Test timeout value for TestShuffleHandlerJobs is low > > > Key: TEZ-3768 > URL: https://issues.apache.org/jira/browse/TEZ-3768 > Project: Apache Tez > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Minor > Attachments: TEZ-3768.001.patch > > > The test can fail with a timeout on slow build machines. One minute is > clearly too less. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures
[ https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058242#comment-16058242 ] Jonathan Eagles commented on TEZ-3758: -- Hey [~kshukla]. Couple of more things I noticed on v3 version of the patch. {noformat:title=TestTaskImpl} * failAttempt the spys are not being used and can be taken out {noformat} {noformat:title=TaskImpl} * please group the succeeded and failed state machine transitions again and reformat to match surrounding code. * There is a missing status put needed from SUCEEDED under T_ATTEMPT_FAILED. If the failed event is not for the succeeded attempt we return without marking the attempt as completed. {noformat} > Vertex can hang in RUNNING state when two task attempts finish very closely > and have retroactive failures > - > > Key: TEZ-3758 > URL: https://issues.apache.org/jira/browse/TEZ-3758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1, 0.9.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: TEZ-3758.001.patch, TEZ-3758.002.patch, > TEZ-3758.003.patch > > > A vertex's count of what tasks are done can go off in a case where two task > attempts finish very closely, say within a millisecond of each other. We had > a case where this task, which was marked successful, never scheduled another > attempt upon getting a retroactive failure since it thought it had one > uncompleted task attempt already. This is because the attempt that finished 1 > ms later transitioned to SUCCEEDED but we don't take any action on the > taskAttempStatus data structure and it stays false. This JIRA will attempt to > solve that race. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3761) NPE in Fetcher under load
[ https://issues.apache.org/jira/browse/TEZ-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058142#comment-16058142 ] Jonathan Eagles commented on TEZ-3761: -- [~rajesh.balamohan], put up a patch the adds the same logic from ordered to unordered that ensures path components are correct. > NPE in Fetcher under load > - > > Key: TEZ-3761 > URL: https://issues.apache.org/jira/browse/TEZ-3761 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Jonathan Eagles > Attachments: TEZ-3618.2.patch, TEZ-3761.debug.patch > > > Env: apache tez + apache hive master > {noformat} > 2017-06-14 00:24:53,795 [INFO] [Dispatcher thread {Central}] > |HistoryEventHandler.criticalEvents|: > [HISTORY][DAG:dag_1490656001509_5009_1][Event:TASK_ATTEMPT_FINISHED]: > vertexName=Reducer 36, > taskAttemptId=attempt_1490656001509_5009_1_15_13_0, > creationTime=1497414223481, allocationTime=1497414290240, > startTime=1497414290240, finishTime=1497414293795, timeTaken=3555, > status=FAILED, taskFailureType=NON_FATAL, errorEnum=INPUT_READ_ERROR, > diagnostics=Error: Error while running task ( failure ) : > java.lang.NullPointerException > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:914) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:599) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:486) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:284) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:76) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > , errorMessage=Fetch failed:java.lang.NullPointerException > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:914) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:599) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:486) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:284) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:76) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Query for ref: Q4 with 10 TB TPC-DS -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TEZ-3761) NPE in Fetcher under load
[ https://issues.apache.org/jira/browse/TEZ-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated TEZ-3761: - Attachment: TEZ-3618.2.patch > NPE in Fetcher under load > - > > Key: TEZ-3761 > URL: https://issues.apache.org/jira/browse/TEZ-3761 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Jonathan Eagles > Attachments: TEZ-3618.2.patch, TEZ-3761.debug.patch > > > Env: apache tez + apache hive master > {noformat} > 2017-06-14 00:24:53,795 [INFO] [Dispatcher thread {Central}] > |HistoryEventHandler.criticalEvents|: > [HISTORY][DAG:dag_1490656001509_5009_1][Event:TASK_ATTEMPT_FINISHED]: > vertexName=Reducer 36, > taskAttemptId=attempt_1490656001509_5009_1_15_13_0, > creationTime=1497414223481, allocationTime=1497414290240, > startTime=1497414290240, finishTime=1497414293795, timeTaken=3555, > status=FAILED, taskFailureType=NON_FATAL, errorEnum=INPUT_READ_ERROR, > diagnostics=Error: Error while running task ( failure ) : > java.lang.NullPointerException > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:914) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:599) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:486) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:284) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:76) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > , errorMessage=Fetch failed:java.lang.NullPointerException > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:914) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:599) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:486) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:284) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:76) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Query for ref: Q4 with 10 TB TPC-DS -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TEZ-3761) NPE in Fetcher under load
[ https://issues.apache.org/jira/browse/TEZ-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated TEZ-3761: - Attachment: (was: TEZ-3761.2.patch) > NPE in Fetcher under load > - > > Key: TEZ-3761 > URL: https://issues.apache.org/jira/browse/TEZ-3761 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Jonathan Eagles > Attachments: TEZ-3761.debug.patch > > > Env: apache tez + apache hive master > {noformat} > 2017-06-14 00:24:53,795 [INFO] [Dispatcher thread {Central}] > |HistoryEventHandler.criticalEvents|: > [HISTORY][DAG:dag_1490656001509_5009_1][Event:TASK_ATTEMPT_FINISHED]: > vertexName=Reducer 36, > taskAttemptId=attempt_1490656001509_5009_1_15_13_0, > creationTime=1497414223481, allocationTime=1497414290240, > startTime=1497414290240, finishTime=1497414293795, timeTaken=3555, > status=FAILED, taskFailureType=NON_FATAL, errorEnum=INPUT_READ_ERROR, > diagnostics=Error: Error while running task ( failure ) : > java.lang.NullPointerException > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:914) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:599) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:486) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:284) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:76) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > , errorMessage=Fetch failed:java.lang.NullPointerException > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:914) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:599) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:486) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:284) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:76) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Query for ref: Q4 with 10 TB TPC-DS -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TEZ-3761) NPE in Fetcher under load
[ https://issues.apache.org/jira/browse/TEZ-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated TEZ-3761: - Attachment: TEZ-3761.2.patch > NPE in Fetcher under load > - > > Key: TEZ-3761 > URL: https://issues.apache.org/jira/browse/TEZ-3761 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Jonathan Eagles > Attachments: TEZ-3761.2.patch, TEZ-3761.debug.patch > > > Env: apache tez + apache hive master > {noformat} > 2017-06-14 00:24:53,795 [INFO] [Dispatcher thread {Central}] > |HistoryEventHandler.criticalEvents|: > [HISTORY][DAG:dag_1490656001509_5009_1][Event:TASK_ATTEMPT_FINISHED]: > vertexName=Reducer 36, > taskAttemptId=attempt_1490656001509_5009_1_15_13_0, > creationTime=1497414223481, allocationTime=1497414290240, > startTime=1497414290240, finishTime=1497414293795, timeTaken=3555, > status=FAILED, taskFailureType=NON_FATAL, errorEnum=INPUT_READ_ERROR, > diagnostics=Error: Error while running task ( failure ) : > java.lang.NullPointerException > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:914) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:599) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:486) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:284) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:76) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > , errorMessage=Fetch failed:java.lang.NullPointerException > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:914) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:599) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:486) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:284) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:76) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Query for ref: Q4 with 10 TB TPC-DS -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3768) Test timeout value for TestShuffleHandlerJobs is low
[ https://issues.apache.org/jira/browse/TEZ-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058104#comment-16058104 ] TezQA commented on TEZ-3768: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12873910/TEZ-3768.001.patch against master revision a925c83. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2531//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2531//console This message is automatically generated. > Test timeout value for TestShuffleHandlerJobs is low > > > Key: TEZ-3768 > URL: https://issues.apache.org/jira/browse/TEZ-3768 > Project: Apache Tez > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Minor > Attachments: TEZ-3768.001.patch > > > The test can fail with a timeout on slow build machines. One minute is > clearly too less. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
Success: TEZ-3768 PreCommit Build #2531
Jira: https://issues.apache.org/jira/browse/TEZ-3768 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/2531/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 339.22 KB...] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 59:32 min [INFO] Finished at: 2017-06-21T20:05:23Z [INFO] Final Memory: 83M/1496M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12873910/TEZ-3768.001.patch against master revision a925c83. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2531//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2531//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. a1ea92d26297bec81b9ef6ca88aded7b07c62275 logged out == == Finished build. == == Archiving artifacts Compressed 3.50 MB of artifacts by 10.7% relative to #2529 [description-setter] Description set: TEZ-3768 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Updated] (TEZ-3768) Test timeout value for TestShuffleHandlerJobs is low
[ https://issues.apache.org/jira/browse/TEZ-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3768: - Attachment: TEZ-3768.001.patch v1 patch that takes out the timeout value. I can make it around 5 minutes from 1 minute. But going with this for now. > Test timeout value for TestShuffleHandlerJobs is low > > > Key: TEZ-3768 > URL: https://issues.apache.org/jira/browse/TEZ-3768 > Project: Apache Tez > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Minor > Attachments: TEZ-3768.001.patch > > > The test can fail with a timeout on slow build machines. One minute is > clearly too less. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures
[ https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057948#comment-16057948 ] Kuhu Shukla commented on TEZ-3758: -- Test failure is unrelated and timed out as the test ran more than a minute. I have opened TEZ-3768 to track the minor change to the test timeout. [~jeagles], appreciate any comments/review. Thanks a lot! > Vertex can hang in RUNNING state when two task attempts finish very closely > and have retroactive failures > - > > Key: TEZ-3758 > URL: https://issues.apache.org/jira/browse/TEZ-3758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1, 0.9.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: TEZ-3758.001.patch, TEZ-3758.002.patch, > TEZ-3758.003.patch > > > A vertex's count of what tasks are done can go off in a case where two task > attempts finish very closely, say within a millisecond of each other. We had > a case where this task, which was marked successful, never scheduled another > attempt upon getting a retroactive failure since it thought it had one > uncompleted task attempt already. This is because the attempt that finished 1 > ms later transitioned to SUCCEEDED but we don't take any action on the > taskAttempStatus data structure and it stays false. This JIRA will attempt to > solve that race. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (TEZ-3768) Test timeout value for TestShuffleHandlerJobs is low
Kuhu Shukla created TEZ-3768: Summary: Test timeout value for TestShuffleHandlerJobs is low Key: TEZ-3768 URL: https://issues.apache.org/jira/browse/TEZ-3768 Project: Apache Tez Issue Type: Bug Reporter: Kuhu Shukla Assignee: Kuhu Shukla Priority: Minor The test can fail with a timeout on slow build machines. One minute is clearly too less. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
Failed: TEZ-3758 PreCommit Build #2530
Jira: https://issues.apache.org/jira/browse/TEZ-3758 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/2530/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 337.56 KB...] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn -rf :tez-aux-services [INFO] Build failures were ignored. {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12873897/TEZ-3758.003.patch against master revision a925c83. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.auxservices.TestShuffleHandlerJobs Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2530//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2530//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 76a3422af5ee27fd50fa12144f3fd27eefcd4ddf logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts [description-setter] Could not determine description. Recording test results Email was triggered for: Failure - Any Sending email for trigger: Failure - Any ### ## FAILED TESTS (if any) ## 1 tests failed. FAILED: org.apache.tez.auxservices.TestShuffleHandlerJobs.testOrderedWordCount Error Message: test timed out after 6 milliseconds Stack Trace: java.lang.Exception: test timed out after 6 milliseconds at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.ipc.Client.call(Client.java:1462) at org.apache.hadoop.ipc.Client.call(Client.java:1407) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy90.getDAGStatus(Unknown Source) at org.apache.tez.dag.api.client.rpc.DAGClientRPCImpl.getDAGStatusViaAM(DAGClientRPCImpl.java:199) at org.apache.tez.dag.api.client.rpc.DAGClientRPCImpl.getDAGStatus(DAGClientRPCImpl.java:97) at org.apache.tez.dag.api.client.DAGClientImpl.getDAGStatusViaAM(DAGClientImpl.java:371) at org.apache.tez.dag.api.client.DAGClientImpl.getDAGStatusInternal(DAGClientImpl.java:221) at org.apache.tez.dag.api.client.DAGClientImpl.getDAGStatus(DAGClientImpl.java:208) at org.apache.tez.dag.api.client.DAGClientImpl._waitForCompletionWithStatusUpdates(DAGClientImpl.java:540) at org.apache.tez.dag.api.client.DAGClientImpl.waitForCompletionWithStatusUpdates(DAGClientImpl.java:349) at org.apache.tez.examples.TezExampleBase.runDag(TezExampleBase.java:187) at org.apache.tez.examples.OrderedWordCount.runJob(OrderedWordCount.java:204) at org.apache.tez.examples.TezExampleBase._execute(TezExampleBase.java:232) at org.apache.tez.examples.TezExampleBase.run(TezExampleBase.java:150) at org.apache.tez.auxservices.TestShuffleHandlerJobs.testOrderedWordCount(TestShuffleHandlerJobs.java:129)
[jira] [Commented] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures
[ https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057920#comment-16057920 ] TezQA commented on TEZ-3758: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12873897/TEZ-3758.003.patch against master revision a925c83. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.auxservices.TestShuffleHandlerJobs Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2530//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2530//console This message is automatically generated. > Vertex can hang in RUNNING state when two task attempts finish very closely > and have retroactive failures > - > > Key: TEZ-3758 > URL: https://issues.apache.org/jira/browse/TEZ-3758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1, 0.9.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: TEZ-3758.001.patch, TEZ-3758.002.patch, > TEZ-3758.003.patch > > > A vertex's count of what tasks are done can go off in a case where two task > attempts finish very closely, say within a millisecond of each other. We had > a case where this task, which was marked successful, never scheduled another > attempt upon getting a retroactive failure since it thought it had one > uncompleted task attempt already. This is because the attempt that finished 1 > ms later transitioned to SUCCEEDED but we don't take any action on the > taskAttempStatus data structure and it stays false. This JIRA will attempt to > solve that race. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures
[ https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated TEZ-3758: - Attachment: TEZ-3758.003.patch Revised patch that adds the new redundant transition only for cases when attempts are launched or succeed. Renamed the transition class accordingly. Also made analogous change when task state is FAILED. While the current inconsistency of 'status' data structure does not impact us if the task was marked failed as the DAG would fail as well, but after this change at least the status data structure reflects correct values. The KILLED task state transitions did not need this change since they already mark the statuses correctly before adding another attempt. Failing tests pass after this change. > Vertex can hang in RUNNING state when two task attempts finish very closely > and have retroactive failures > - > > Key: TEZ-3758 > URL: https://issues.apache.org/jira/browse/TEZ-3758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1, 0.9.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: TEZ-3758.001.patch, TEZ-3758.002.patch, > TEZ-3758.003.patch > > > A vertex's count of what tasks are done can go off in a case where two task > attempts finish very closely, say within a millisecond of each other. We had > a case where this task, which was marked successful, never scheduled another > attempt upon getting a retroactive failure since it thought it had one > uncompleted task attempt already. This is because the attempt that finished 1 > ms later transitioned to SUCCEEDED but we don't take any action on the > taskAttempStatus data structure and it stays false. This JIRA will attempt to > solve that race. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3758) Vertex can hang in RUNNING state when two task attempts finish very closely and have retroactive failures
[ https://issues.apache.org/jira/browse/TEZ-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057496#comment-16057496 ] Kuhu Shukla commented on TEZ-3758: -- Test failures are relevant. Looking into them now. > Vertex can hang in RUNNING state when two task attempts finish very closely > and have retroactive failures > - > > Key: TEZ-3758 > URL: https://issues.apache.org/jira/browse/TEZ-3758 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1, 0.9.0 >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla > Attachments: TEZ-3758.001.patch, TEZ-3758.002.patch > > > A vertex's count of what tasks are done can go off in a case where two task > attempts finish very closely, say within a millisecond of each other. We had > a case where this task, which was marked successful, never scheduled another > attempt upon getting a retroactive failure since it thought it had one > uncompleted task attempt already. This is because the attempt that finished 1 > ms later transitioned to SUCCEEDED but we don't take any action on the > taskAttempStatus data structure and it stays false. This JIRA will attempt to > solve that race. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
Success: TEZ-3767 PreCommit Build #2529
Jira: https://issues.apache.org/jira/browse/TEZ-3767 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/2529/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 338.96 KB...] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 51:56 min [INFO] Finished at: 2017-06-21T13:07:40Z [INFO] Final Memory: 97M/1376M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12873854/TEZ-3767.2.patch against master revision a925c83. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2529//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2529//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 71a21fbd1ae01ff3c75dc2494221deb5eec33e3f logged out == == Finished build. == == Archiving artifacts Compressed 3.50 MB of artifacts by 12.5% relative to #2528 [description-setter] Description set: TEZ-3767 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-3767) Shuffle should not report error to AM during inputContext.killSelf()
[ https://issues.apache.org/jira/browse/TEZ-3767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057440#comment-16057440 ] TezQA commented on TEZ-3767: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12873854/TEZ-3767.2.patch against master revision a925c83. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2529//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2529//console This message is automatically generated. > Shuffle should not report error to AM during inputContext.killSelf() > > > Key: TEZ-3767 > URL: https://issues.apache.org/jira/browse/TEZ-3767 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-3767.1.patch, TEZ-3767.2.patch > > > {{ShuffleScheduler::killSelf}} kills the current attempt when it encounters > certain errors. As a part of cleanup, it invokes {{close}} which internally > releases the resources. > If merge is happening in the middle, it could throw the following exception. > This is caught in {{RunShuffleCallable}} and reported to AM immediately. This > causes tasks to fail. > {noformat} > » Error: Error while running task ( failure ) : > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: > Error while doing final merge > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:320) > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:285) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.util.ConcurrentModificationException > at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1211) > at java.util.TreeMap$KeyIterator.next(TreeMap.java:1265) > at java.util.AbstractCollection.toArray(AbstractCollection.java:141) > at java.util.ArrayList.addAll(ArrayList.java:577) > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.close(MergeManager.java:636) > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:316) > ... 6 more > {noformat} > When {{isShutDown}} is set to true, it would be good to avoid sending error > messages to AM. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TEZ-3767) Shuffle should not report error to AM during inputContext.killSelf()
[ https://issues.apache.org/jira/browse/TEZ-3767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-3767: -- Attachment: TEZ-3767.2.patch Updated comments > Shuffle should not report error to AM during inputContext.killSelf() > > > Key: TEZ-3767 > URL: https://issues.apache.org/jira/browse/TEZ-3767 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-3767.1.patch, TEZ-3767.2.patch > > > {{ShuffleScheduler::killSelf}} kills the current attempt when it encounters > certain errors. As a part of cleanup, it invokes {{close}} which internally > releases the resources. > If merge is happening in the middle, it could throw the following exception. > This is caught in {{RunShuffleCallable}} and reported to AM immediately. This > causes tasks to fail. > {noformat} > » Error: Error while running task ( failure ) : > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: > Error while doing final merge > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:320) > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:285) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.util.ConcurrentModificationException > at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1211) > at java.util.TreeMap$KeyIterator.next(TreeMap.java:1265) > at java.util.AbstractCollection.toArray(AbstractCollection.java:141) > at java.util.ArrayList.addAll(ArrayList.java:577) > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.close(MergeManager.java:636) > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:316) > ... 6 more > {noformat} > When {{isShutDown}} is set to true, it would be good to avoid sending error > messages to AM. -- This message was sent by Atlassian JIRA (v6.4.14#64029)