[jira] [Commented] (TEZ-3077) TezClient.waitTillReady should support timeout
[ https://issues.apache.org/jira/browse/TEZ-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221245#comment-15221245 ] Kuhu Shukla commented on TEZ-3077: -- [~sseth], [~hitesh], Request for comments/review. Thanks a lot! > TezClient.waitTillReady should support timeout > -- > > Key: TEZ-3077 > URL: https://issues.apache.org/jira/browse/TEZ-3077 > Project: Apache Tez > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Kuhu Shukla > Attachments: TEZ-3077.001.patch, TEZ-3077.002.patch > > > Also preWarm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3193) Deadlock in AM during task commit request
[ https://issues.apache.org/jira/browse/TEZ-3193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221046#comment-15221046 ] Bikas Saha commented on TEZ-3193: - This is probably a leftover of removal of such reverse calls. There were more of them and some were removed by making sure that such objects/members are available locally to the TaskAttemptImpl (from the Task passed in via the constructor) instead of calling back into the task to get this object/members. Hence, task location hint and taskSpec could be passed in via the constructor and referenced locally. Doing this helps other future scenarios as well. If the TA location hint is passed in via a constructor then it could be made different for each attempt. E.g. remove the machine for v.1 from the location hint of v.2 for a speculative execution so that speculated attempt does not end up on the same machine. There is a jira for open for this. Similarly, change the spec of v.1 have higher memory than the default for that vertex because v.0 died with OOM. > Deadlock in AM during task commit request > - > > Key: TEZ-3193 > URL: https://issues.apache.org/jira/browse/TEZ-3193 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.1, 0.8.2 >Reporter: Jason Lowe >Priority: Blocker > > The AM can deadlock between TaskImpl and TaskAttemptImpl. Stacktrace and > details in a followup comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3192) IFile#checkState creating unnecessary objects though auto-boxing
[ https://issues.apache.org/jira/browse/TEZ-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220953#comment-15220953 ] TezQA commented on TEZ-3192: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12796422/TEZ-3192.1.patch against master revision e416991. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in : org.apache.tez.test.TestRecovery Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/1601//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1601//console This message is automatically generated. > IFile#checkState creating unnecessary objects though auto-boxing > > > Key: TEZ-3192 > URL: https://issues.apache.org/jira/browse/TEZ-3192 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Fix For: 0.7.1, 0.8.3 > > Attachments: TEZ-3192.1.patch > > > checkState is a varargs function which takes Objects. ints and longs create > unnecessary Integers and Long objects through Integer.valueOf and > Long.valueOf. This is used in the read key and read value loop so while > small, puts this on par with the MR equivalent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-3192 PreCommit Build #1601
Jira: https://issues.apache.org/jira/browse/TEZ-3192 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/1601/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 4402 lines...] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn -rf :tez-tests [INFO] Build failures were ignored. {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12796422/TEZ-3192.1.patch against master revision e416991. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in : org.apache.tez.test.TestRecovery Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/1601//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1601//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. ad39565c18aa36dd00957cbafa960ac847fb8f43 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Compressed 3.40 MB of artifacts by 11.0% relative to #1599 [description-setter] Could not determine description. Recording test results Email was triggered for: Failure - Any Sending email for trigger: Failure - Any ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-3192) IFile#checkState creating unnecessary objects though auto-boxing
[ https://issues.apache.org/jira/browse/TEZ-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220921#comment-15220921 ] Jonathan Eagles commented on TEZ-3192: -- Thanks [~rajesh.balamohan]. Committed this patch to master and branch-0.7. > IFile#checkState creating unnecessary objects though auto-boxing > > > Key: TEZ-3192 > URL: https://issues.apache.org/jira/browse/TEZ-3192 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: TEZ-3192.1.patch > > > checkState is a varargs function which takes Objects. ints and longs create > unnecessary Integers and Long objects through Integer.valueOf and > Long.valueOf. This is used in the read key and read value loop so while > small, puts this on par with the MR equivalent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill
[ https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220917#comment-15220917 ] Hitesh Shah commented on TEZ-3161: -- bq. In terms of alternate naming - do you have suggestions on what would be less confusing Not sure - fatalError(), abortProcessing() - not sure I have good suggestions especially as fatalError is probably the one which should be indicating a fatal error instead of the current non-fatal behavior. bq. I'm OK marking it as private Lets mark it so initially until we can figure out a clear use-case for self-kills. bq. Any suggestion on this. Duplicate the TerminationCause to include FATAL_, and KILL_ for almost all the existing TerminationCauses ? Wouldnt there be only one specific termination cause to indicate that the user-code told the framework to abort itself or kill itself? bq. I though it was being written to history. TaskAttemptFinished event is being written to history but the failure type bit is not in the data being pushed to ATS ( check TimelineHistoryEventConversion or the *JsonConversion ). The proto was changed but that is only used in Recovery. Tests in sbubsequent follow-ups should be ok. > Allow task to report different kinds of errors - fatal / kill > - > > Key: TEZ-3161 > URL: https://issues.apache.org/jira/browse/TEZ-3161 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Siddharth Seth > Attachments: TEZ-3161.1.txt, TEZ-3161.2.txt, TEZ-3161.3.txt > > > In some cases, task failures will be the same across all attempts - e.g. > exceeding memory utilization on an operation. In this case, there's no point > in running another attempt of the same task. > There's other cases where a task may want to mark itself as KILLED - i.e. a > temporary error. An example of this is pipelined shuffle. > Tez should allow both operations. > cc [~vikram.dixit], [~rajesh.balamohan] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3192) IFile#checkState creating unnecessary objects though auto-boxing
[ https://issues.apache.org/jira/browse/TEZ-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220880#comment-15220880 ] Rajesh Balamohan commented on TEZ-3192: --- +1. lgtm. Thanks [~jeagles] > IFile#checkState creating unnecessary objects though auto-boxing > > > Key: TEZ-3192 > URL: https://issues.apache.org/jira/browse/TEZ-3192 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: TEZ-3192.1.patch > > > checkState is a varargs function which takes Objects. ints and longs create > unnecessary Integers and Long objects through Integer.valueOf and > Long.valueOf. This is used in the read key and read value loop so while > small, puts this on par with the MR equivalent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3193) Deadlock in AM during task commit request
[ https://issues.apache.org/jira/browse/TEZ-3193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220835#comment-15220835 ] Jason Lowe commented on TEZ-3193: - Here are the relevant portions of the AM stacktrace when it deadlocks: {noformat} "TaskSchedulerAppCaller #0" #106 daemon prio=5 os_prio=0 tid=0x7fb1cc1bb800 nid=0x4619 waiting on condition [0x7fb1b6509000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xc5d8e398> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.getState(TaskAttemptImpl.java:630) at org.apache.tez.dag.app.dag.impl.TaskImpl.selectBestAttempt(TaskImpl.java:745) at org.apache.tez.dag.app.dag.impl.TaskImpl.getProgress(TaskImpl.java:483) at org.apache.tez.dag.app.dag.impl.VertexImpl.computeProgress(VertexImpl.java:1285) at org.apache.tez.dag.app.dag.impl.VertexImpl.getProgress(VertexImpl.java:1195) at org.apache.tez.dag.app.dag.impl.DAGImpl.getProgress(DAGImpl.java:829) at org.apache.tez.dag.app.DAGAppMaster.getProgress(DAGAppMaster.java:1181) at org.apache.tez.dag.app.rm.TaskSchedulerEventHandler.getProgress(TaskSchedulerEventHandler.java:560) at org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$GetProgressCallable.call(TaskSchedulerAppCallbackWrapper.java:291) at org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$GetProgressCallable.call(TaskSchedulerAppCallbackWrapper.java:282) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) "IPC Server handler 4 on 52743" #64 daemon prio=5 os_prio=0 tid=0x7fb1c454c800 nid=0x45ca waiting on condition [0x7fb1b920e000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xc1421810> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) at org.apache.tez.dag.app.dag.impl.TaskImpl.canCommit(TaskImpl.java:768) at org.apache.tez.dag.app.TaskAttemptListenerImpTezDag.canCommit(TaskAttemptListenerImpTezDag.java:274) at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.hadoop.ipc.WritableRpcEngine$Server$WritableRpcInvoker.call(WritableRpcEngine.java:514) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2096) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2092) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2090) "Dispatcher thread {Central}" #37 prio=5 os_prio=0 tid=0x7fb1c422f000 nid=0x45aa waiting on condition [0x7fb1ba722000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xc1421810> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
[jira] [Created] (TEZ-3193) Deadlock in AM during task commit request
Jason Lowe created TEZ-3193: --- Summary: Deadlock in AM during task commit request Key: TEZ-3193 URL: https://issues.apache.org/jira/browse/TEZ-3193 Project: Apache Tez Issue Type: Bug Affects Versions: 0.8.2, 0.7.1 Reporter: Jason Lowe Priority: Blocker The AM can deadlock between TaskImpl and TaskAttemptImpl. Stacktrace and details in a followup comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-3192) IFile#checkState creating unnecessary objects though auto-boxing
[ https://issues.apache.org/jira/browse/TEZ-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated TEZ-3192: - Attachment: TEZ-3192.1.patch > IFile#checkState creating unnecessary objects though auto-boxing > > > Key: TEZ-3192 > URL: https://issues.apache.org/jira/browse/TEZ-3192 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: TEZ-3192.1.patch > > > checkState is a varargs function which takes Objects. ints and longs create > unnecessary Integers and Long objects through Integer.valueOf and > Long.valueOf. This is used in the read key and read value loop so while > small, puts this on par with the MR equivalent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-3192) IFile#checkState creating unnecessary objects though auto-boxing
Jonathan Eagles created TEZ-3192: Summary: IFile#checkState creating unnecessary objects though auto-boxing Key: TEZ-3192 URL: https://issues.apache.org/jira/browse/TEZ-3192 Project: Apache Tez Issue Type: Bug Reporter: Jonathan Eagles Assignee: Jonathan Eagles checkState is a varargs function which takes Objects. ints and longs create unnecessary Integers and Long objects through Integer.valueOf and Long.valueOf. This is used in the read key and read value loop so while small, puts this on par with the MR equivalent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3177) Non-DAG events should use the session domain or no domain if the data does not need protection
[ https://issues.apache.org/jira/browse/TEZ-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220597#comment-15220597 ] TezQA commented on TEZ-3177: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12796369/TEZ-3177.1.patch against master revision e416991. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.test.TestFaultTolerance org.apache.tez.dag.app.rm.TestContainerReuse Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/1600//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1600//console This message is automatically generated. > Non-DAG events should use the session domain or no domain if the data does > not need protection > --- > > Key: TEZ-3177 > URL: https://issues.apache.org/jira/browse/TEZ-3177 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah > Attachments: TEZ-3177.1.patch > > > There have been issues noticed where when using dag specific domains, > container events get generated under different dags causing issues as they > are updated using different domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-3177 PreCommit Build #1600
Jira: https://issues.apache.org/jira/browse/TEZ-3177 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/1600/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 4678 lines...] [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn -rf :tez-dag [INFO] Build failures were ignored. {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12796369/TEZ-3177.1.patch against master revision e416991. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.test.TestFaultTolerance org.apache.tez.dag.app.rm.TestContainerReuse Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/1600//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1600//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 6507b1af9699df5582387621c34975e408f092d5 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Compressed 3.39 MB of artifacts by 10.2% relative to #1599 [description-setter] Could not determine description. Recording test results Email was triggered for: Failure - Any Sending email for trigger: Failure - Any ### ## FAILED TESTS (if any) ## 8 tests failed. FAILED: org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources Error Message: Wanted but not invoked: taskSchedulerManagerForTest.taskAllocated( 0, Mock for TA attempt_0_0001_0_01_03_1, , Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, NodeHttpAddress: host1:0, Resource: , Priority: 1, Token: null, ] ); -> at org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources(TestContainerReuse.java:1254) However, there were other interactions with this mock: taskSchedulerManagerForTest.init( Configuration: core-default.xml, core-site.xml, yarn-default.xml, yarn-site.xml ); -> at org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources(TestContainerReuse.java:1143) taskSchedulerManagerForTest.setConfig( Configuration: core-default.xml, core-site.xml, yarn-default.xml, yarn-site.xml ); -> at org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources(TestContainerReuse.java:1143) taskSchedulerManagerForTest.serviceInit( Configuration: core-default.xml, core-site.xml, yarn-default.xml, yarn-site.xml ); -> at org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources(TestContainerReuse.java:1143) taskSchedulerManagerForTest.start(); -> at org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources(TestContainerReuse.java:1144) taskSchedulerManagerForTest.serviceStart(); -> at org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources(TestContainerReuse.java:1144) taskSchedulerManagerForTest.instantiateSchedulers( "host", 0, "", Mock for AppContext, hashCode: 469698423 ); -> at org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources(TestContainerReuse.java:1144) taskSchedulerManagerForTest.getContainerSignatureMatcher(); -> at org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources(TestContainerReuse.java:1144) taskSchedulerManagerForTest.getConfig(); -> at org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources(TestContaine
[jira] [Commented] (TEZ-3177) Non-DAG events should use the session domain or no domain if the data does not need protection
[ https://issues.apache.org/jira/browse/TEZ-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220406#comment-15220406 ] Hitesh Shah commented on TEZ-3177: -- [~sseth] [~rajesh.balamohan] please review > Non-DAG events should use the session domain or no domain if the data does > not need protection > --- > > Key: TEZ-3177 > URL: https://issues.apache.org/jira/browse/TEZ-3177 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah > Attachments: TEZ-3177.1.patch > > > There have been issues noticed where when using dag specific domains, > container events get generated under different dags causing issues as they > are updated using different domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-3177) Non-DAG events should use the session domain or no domain if the data does not need protection
[ https://issues.apache.org/jira/browse/TEZ-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-3177: - Attachment: TEZ-3177.1.patch Made AM/Container events use sessionDomainId always. Also minor fix to use dag counter for domainIds as dag names are not unique. > Non-DAG events should use the session domain or no domain if the data does > not need protection > --- > > Key: TEZ-3177 > URL: https://issues.apache.org/jira/browse/TEZ-3177 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Hitesh Shah > Attachments: TEZ-3177.1.patch > > > There have been issues noticed where when using dag specific domains, > container events get generated under different dags causing issues as they > are updated using different domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3187) Pig on tez hang with java.io.IOException: Connection reset by peer
[ https://issues.apache.org/jira/browse/TEZ-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220297#comment-15220297 ] Kurt Muehlner commented on TEZ-3187: [~rajesh.balamohan] task attempt logs for all those tasks are in task_attempts.tar.gz. It appears to me they have all completed successfully. > Pig on tez hang with java.io.IOException: Connection reset by peer > -- > > Key: TEZ-3187 > URL: https://issues.apache.org/jira/browse/TEZ-3187 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.8.2 > Environment: Hadoop 2.5.0 > Pig 0.15.0 > Tez 0.8.2 >Reporter: Kurt Muehlner > Attachments: 10.102.173.86.logs.gz, TEZ-3187.incomplete-tasks.txt, > dag_1437886552023_169758_3.dot, stack.application_1437886552023_171131.out, > syslog_dag_1437886552023_169758_3.gz, task_attempts.tar.gz > > > We are experiencing occasional application hangs, when testing an existing > Pig MapReduce script, executing on Tez. When this occurs, we find this in > the syslog for the executing dag: > 016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] > |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout > delay expired or is new. Releasing container, > containerId=container_e11_1437886552023_169758_01_000822, > containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, > heldContainers=112, delayedContainers=27, isNew=false > 2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] > |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout > delay expired or is new. Releasing container, > containerId=container_e11_1437886552023_169758_01_000824, > containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, > heldContainers=111, delayedContainers=26, isNew=false > 2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] > |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client > 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer] > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:197) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) > at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593) > at org.apache.hadoop.ipc.Server.access$2800(Server.java:135) > at > org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471) > at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762) > at > org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636) > at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607) > 2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] > |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout > delay expired or is new. Releasing container, > containerId=container_e11_1437886552023_169758_01_000811, > containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, > heldContainers=110, delayedContainers=25, isNew=false > In all cases I've been able to analyze so far, this also correlates with a > warning in the node identified in the IOException: > 2016-03-21 16:36:13,641 [WARN] [I/O Setup 2 Initialize: {scope-178}] > |retry.RetryInvocationHandler|: A failover has occurred since the start of > this method invocation attempt. > However, it does not appear that any namenode failover has actually occurred > (the most recent failover we see in logs is from 2015). > Attached: > syslog_dag_1437886552023_169758_3.gz: syslog for the dag which hangs > 10.102.173.86.logs.gz: aggregated logs from the host identified in the > IOException -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-3187) Pig on tez hang with java.io.IOException: Connection reset by peer
[ https://issues.apache.org/jira/browse/TEZ-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kurt Muehlner updated TEZ-3187: --- Attachment: task_attempts.tar.gz > Pig on tez hang with java.io.IOException: Connection reset by peer > -- > > Key: TEZ-3187 > URL: https://issues.apache.org/jira/browse/TEZ-3187 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.8.2 > Environment: Hadoop 2.5.0 > Pig 0.15.0 > Tez 0.8.2 >Reporter: Kurt Muehlner > Attachments: 10.102.173.86.logs.gz, TEZ-3187.incomplete-tasks.txt, > dag_1437886552023_169758_3.dot, stack.application_1437886552023_171131.out, > syslog_dag_1437886552023_169758_3.gz, task_attempts.tar.gz > > > We are experiencing occasional application hangs, when testing an existing > Pig MapReduce script, executing on Tez. When this occurs, we find this in > the syslog for the executing dag: > 016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] > |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout > delay expired or is new. Releasing container, > containerId=container_e11_1437886552023_169758_01_000822, > containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, > heldContainers=112, delayedContainers=27, isNew=false > 2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] > |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout > delay expired or is new. Releasing container, > containerId=container_e11_1437886552023_169758_01_000824, > containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, > heldContainers=111, delayedContainers=26, isNew=false > 2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] > |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client > 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer] > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:197) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) > at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593) > at org.apache.hadoop.ipc.Server.access$2800(Server.java:135) > at > org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471) > at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762) > at > org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636) > at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607) > 2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] > |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout > delay expired or is new. Releasing container, > containerId=container_e11_1437886552023_169758_01_000811, > containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, > heldContainers=110, delayedContainers=25, isNew=false > In all cases I've been able to analyze so far, this also correlates with a > warning in the node identified in the IOException: > 2016-03-21 16:36:13,641 [WARN] [I/O Setup 2 Initialize: {scope-178}] > |retry.RetryInvocationHandler|: A failover has occurred since the start of > this method invocation attempt. > However, it does not appear that any namenode failover has actually occurred > (the most recent failover we see in logs is from 2015). > Attached: > syslog_dag_1437886552023_169758_3.gz: syslog for the dag which hangs > 10.102.173.86.logs.gz: aggregated logs from the host identified in the > IOException -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3187) Pig on tez hang with java.io.IOException: Connection reset by peer
[ https://issues.apache.org/jira/browse/TEZ-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220110#comment-15220110 ] Kurt Muehlner commented on TEZ-3187: I'll get those task attempt logs. Meanwhile, I've deployed the Pig config param changes suggested by Daniel, and we do see a change in behavior. This application consists of four pig scripts which execute sequentially. When it hangs, it has been consistently doing so in the third script. I deployed the param changes only in that third script. On the first run thereafter, the application hung, but in the fourth script. That's the first time that's happened. I then deployed the param changes in the fourth script, and as of yet the application hasn't hung. I'll attach the task attempt logs soon. > Pig on tez hang with java.io.IOException: Connection reset by peer > -- > > Key: TEZ-3187 > URL: https://issues.apache.org/jira/browse/TEZ-3187 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.8.2 > Environment: Hadoop 2.5.0 > Pig 0.15.0 > Tez 0.8.2 >Reporter: Kurt Muehlner > Attachments: 10.102.173.86.logs.gz, TEZ-3187.incomplete-tasks.txt, > dag_1437886552023_169758_3.dot, stack.application_1437886552023_171131.out, > syslog_dag_1437886552023_169758_3.gz > > > We are experiencing occasional application hangs, when testing an existing > Pig MapReduce script, executing on Tez. When this occurs, we find this in > the syslog for the executing dag: > 016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] > |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout > delay expired or is new. Releasing container, > containerId=container_e11_1437886552023_169758_01_000822, > containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, > heldContainers=112, delayedContainers=27, isNew=false > 2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] > |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout > delay expired or is new. Releasing container, > containerId=container_e11_1437886552023_169758_01_000824, > containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, > heldContainers=111, delayedContainers=26, isNew=false > 2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] > |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client > 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer] > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:197) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) > at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593) > at org.apache.hadoop.ipc.Server.access$2800(Server.java:135) > at > org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471) > at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762) > at > org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636) > at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607) > 2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] > |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout > delay expired or is new. Releasing container, > containerId=container_e11_1437886552023_169758_01_000811, > containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, > heldContainers=110, delayedContainers=25, isNew=false > In all cases I've been able to analyze so far, this also correlates with a > warning in the node identified in the IOException: > 2016-03-21 16:36:13,641 [WARN] [I/O Setup 2 Initialize: {scope-178}] > |retry.RetryInvocationHandler|: A failover has occurred since the start of > this method invocation attempt. > However, it does not appear that any namenode failover has actually occurred > (the most recent failover we see in logs is from 2015). > Attached: > syslog_dag_1437886552023_169758_3.gz: syslog for the dag which hangs > 10.102.173.86.logs.gz: aggregated logs from the host identified in the > IOException -- This message was sent by Atlassian JIRA (v6.3.4#6332)