[jira] [Updated] (TEZ-1621) Should report error to AM before shuting down TezChild
[ https://issues.apache.org/jira/browse/TEZ-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-1621: Summary: Should report error to AM before shuting down TezChild (was: Actual error message not thrown on console, does appear in the YARN application log) > Should report error to AM before shuting down TezChild > -- > > Key: TEZ-1621 > URL: https://issues.apache.org/jira/browse/TEZ-1621 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Deepesh Khandelwal >Assignee: Jeff Zhang > Attachments: Tez-1621.patch, app_logs.txt, console.txt > > > While running an in session testorderedwordcount example the DAG failed with > the following error on the console: > {noformat} > 14/09/25 01:55:53 INFO examples.TestOrderedWordCount: DAG 1 diagnostics: > [Vertex failed, vertexName=initialmap, > vertexId=vertex_1411586515507_0110_1_00, diagnostics=[Task failed, > taskId=task_1411586515507_0110_1_00_00, diagnostics=[TaskAttempt 0 > failed, info=[Container container_1411586515507_0110_01_02 finished with > diagnostics set to [Container failed. Exception from container-launch. > Container id: container_1411586515507_0110_01_02 > Exit code: 255 > Stack trace: ExitCodeException exitCode=255: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:290) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > This wasn't very helpful, the root cause is in the application log: > {noformat} > 2014-09-25 01:55:41,246 ERROR [TezChild] > org.apache.tez.runtime.task.TezTaskRunner: Exception of type Error. Exiting > now > java.lang.UnsatisfiedLinkError: > org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;J)V > at org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(Native > Method) > at > org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(NativeCrc32.java:57) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:291) > at > org.apache.hadoop.hdfs.BlockReaderLocal.fillBuffer(BlockReaderLocal.java:344) > at > org.apache.hadoop.hdfs.BlockReaderLocal.fillDataBuf(BlockReaderLocal.java:444) > at > org.apache.hadoop.hdfs.BlockReaderLocal.readWithBounceBuffer(BlockReaderLocal.java:575) > at > org.apache.hadoop.hdfs.BlockReaderLocal.read(BlockReaderLocal.java:539) > at > org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:683) > at > org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:739) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:796) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180) > at > org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) > at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149) > at > org.apache.hadoop.mapreduce.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.nextKeyValue(TezGroupedSplitsInputFormat.java:167) > at > org.apache.tez.mapreduce.lib.MRReaderMapReduce.next(MRReaderMapReduce.java:116) > at > org.apache.tez.mapreduce.processor.map.MapProcessor$NewRecordReader.nextKeyValue(MapProcessor.java:266) > at > org.apache.tez.mapreduce.hadoop.mapreduce.MapContextImpl.nextKeyValue(MapContextImpl.java:81) > at > org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at > org.apache.tez.mapreduce.processor.map.MapProcessor.runNewMapper(MapProcessor.java:237) > at > org.apache.te
[jira] [Resolved] (TEZ-1555) TestTezClientUtils.validateSetTezJarLocalResourcesDefinedButEmpty failing on Windows
[ https://issues.apache.org/jira/browse/TEZ-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha resolved TEZ-1555. - Resolution: Fixed Fix Version/s: 0.6.0 Hadoop Flags: Reviewed Thanks for the fix. Committed commit a56f9ef4e81002f64d961a195bd68fd5c10e2bee Author: Bikas Saha Date: Fri Sep 26 11:00:15 2014 -0700 TEZ-1555. TestTezClientUtils.validateSetTezJarLocalResourcesDefinedButEmpty failing on Windows (Prakash Ramachandran vi > TestTezClientUtils.validateSetTezJarLocalResourcesDefinedButEmpty failing on > Windows > > > Key: TEZ-1555 > URL: https://issues.apache.org/jira/browse/TEZ-1555 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Prakash Ramachandran > Fix For: 0.6.0 > > Attachments: tez-1555.1.patch > > > Error Message > Wrong FS: > file://D:/w/tez/tez-api/target/org.apache.tez.client.TestTezClientUtils-tmpDir/emptyDir, > expected: file:/// > Stacktrace > java.lang.IllegalArgumentException: Wrong FS: > file://D:/w/tez/tez-api/target/org.apache.tez.client.TestTezClientUtils-tmpDir/emptyDir, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) > at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:465) > at > org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:119) > at > org.apache.tez.client.TezClientUtils.getLRFileStatus(TezClientUtils.java:132) > at > org.apache.tez.client.TezClientUtils.setupTezJarsLocalResources(TezClientUtils.java:198) > at > org.apache.tez.client.TestTezClientUtils.validateSetTezJarLocalResourcesDefinedButEmpty(TestTezClientUtils.java:77) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1609) Add hostname to logIdentifiers of fetchers for easy debugging
[ https://issues.apache.org/jira/browse/TEZ-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149738#comment-14149738 ] Bikas Saha commented on TEZ-1609: - lgtm. Do the fetchers also log the host name from which they are fetching? > Add hostname to logIdentifiers of fetchers for easy debugging > - > > Key: TEZ-1609 > URL: https://issues.apache.org/jira/browse/TEZ-1609 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.0 >Reporter: Rajesh Balamohan >Assignee: Gopal V > Attachments: TEZ-1609.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1624) Flaky tests in TestContainerReuse
[ https://issues.apache.org/jira/browse/TEZ-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149754#comment-14149754 ] Bikas Saha commented on TEZ-1624: - lgtm. Good catch. > Flaky tests in TestContainerReuse > - > > Key: TEZ-1624 > URL: https://issues.apache.org/jira/browse/TEZ-1624 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-1624.1.patch, TEZ-1624.2.patch > > > Couple of TestContainerReuse tests are failing due to minor race condition in > DelayedContainerManager thread. > Wanted but not invoked: > taskSchedulerEventHandlerForTest.taskAllocated( > Mock for TaskAttempt, hashCode: 290467934, > , > Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, > NodeHttpAddress: host1:0, Resource: , Priority: 1, > Token: null, ] > ); > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:580) > However, there were other interactions with this mock: > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:534) > -> at > org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:570) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:571) > Wanted but not invoked: > taskSchedulerEventHandlerForTest.taskAllocated( > Mock for TaskAttempt, hashCode: 392638651, > , > Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, > NodeHttpAddress: host1:0, Resource: , Priority: 5, > Token: null, ] > ); > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333) > However, there were other interactions with this mock: > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:292) > -> at > org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:323) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:324) > at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333) > org.mockito.exceptions.verification.WantedButNotInvoked: > Wanted but not invoked: > taskSchedulerEventHandlerForTest.taskAllocated( > Mock for TaskAttempt, hashCode: 1830222901, > , > Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, > NodeHttpAddress: host1:0, Resource: , Priority: 3, > Token: null, ] > ); > -> at > org.apache.tez.dag.app
[jira] [Updated] (TEZ-1609) Add hostname to logIdentifiers of fetchers for easy debugging
[ https://issues.apache.org/jira/browse/TEZ-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-1609: - Assignee: Rajesh Balamohan (was: Gopal V) > Add hostname to logIdentifiers of fetchers for easy debugging > - > > Key: TEZ-1609 > URL: https://issues.apache.org/jira/browse/TEZ-1609 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.0 >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-1609.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1609) Add hostname to logIdentifiers of fetchers for easy debugging
[ https://issues.apache.org/jira/browse/TEZ-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149779#comment-14149779 ] Gopal V commented on TEZ-1609: -- Yes, they do print the URL and speed. > Add hostname to logIdentifiers of fetchers for easy debugging > - > > Key: TEZ-1609 > URL: https://issues.apache.org/jira/browse/TEZ-1609 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.0 >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-1609.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1621) Should report error to AM before shuting down TezChild
[ https://issues.apache.org/jira/browse/TEZ-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149794#comment-14149794 ] Bikas Saha commented on TEZ-1621: - This will break when we stop sending the failure notification to the AM inline (thus blocking the System.exit() until the AM has been notified). IMO we should remove the System.exit() from these places and move them up to TezChild such that TezChild can observe the Exception/Error and determine if it needs to exit or not. If it needs to exit it can make sure all pending notifications are complete and the AM gets a proper error/diagnostic before exiting. These current exit()s sprayed across the code make graceful cleanup hard to do. And are probably the cause of this jira. If doing the global system.exit() is difficult in this jira then we should at least remove the current system.exit()s and open a follow up jira to handle Error and exit in one place. That will remove the need to special case local mode everywhere in this jira. Ideally, all of Tez code should be using a common util to handle shutdown which exits in non-local mode and does not exit in local mode. > Should report error to AM before shuting down TezChild > -- > > Key: TEZ-1621 > URL: https://issues.apache.org/jira/browse/TEZ-1621 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Deepesh Khandelwal >Assignee: Jeff Zhang > Attachments: Tez-1621.patch, app_logs.txt, console.txt > > > While running an in session testorderedwordcount example the DAG failed with > the following error on the console: > {noformat} > 14/09/25 01:55:53 INFO examples.TestOrderedWordCount: DAG 1 diagnostics: > [Vertex failed, vertexName=initialmap, > vertexId=vertex_1411586515507_0110_1_00, diagnostics=[Task failed, > taskId=task_1411586515507_0110_1_00_00, diagnostics=[TaskAttempt 0 > failed, info=[Container container_1411586515507_0110_01_02 finished with > diagnostics set to [Container failed. Exception from container-launch. > Container id: container_1411586515507_0110_01_02 > Exit code: 255 > Stack trace: ExitCodeException exitCode=255: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:290) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > This wasn't very helpful, the root cause is in the application log: > {noformat} > 2014-09-25 01:55:41,246 ERROR [TezChild] > org.apache.tez.runtime.task.TezTaskRunner: Exception of type Error. Exiting > now > java.lang.UnsatisfiedLinkError: > org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;J)V > at org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(Native > Method) > at > org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(NativeCrc32.java:57) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:291) > at > org.apache.hadoop.hdfs.BlockReaderLocal.fillBuffer(BlockReaderLocal.java:344) > at > org.apache.hadoop.hdfs.BlockReaderLocal.fillDataBuf(BlockReaderLocal.java:444) > at > org.apache.hadoop.hdfs.BlockReaderLocal.readWithBounceBuffer(BlockReaderLocal.java:575) > at > org.apache.hadoop.hdfs.BlockReaderLocal.read(BlockReaderLocal.java:539) > at > org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:683) > at > org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:739) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:796) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180) > at > org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) > at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) > at > org.ap
[jira] [Comment Edited] (TEZ-1621) Should report error to AM before shuting down TezChild
[ https://issues.apache.org/jira/browse/TEZ-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149794#comment-14149794 ] Bikas Saha edited comment on TEZ-1621 at 9/26/14 6:51 PM: -- This will break when we stop sending the failure notification to the AM inline (thus blocking the System.exit() until the AM has been notified). IMO we should remove the System.exit() from these places and move them up to TezChild such that TezChild can observe the Exception/Error and determine if it needs to exit or not. If it needs to exit it can make sure all pending notifications are complete and the AM gets a proper error/diagnostic before exiting. These current exit()s sprayed across the code make graceful cleanup hard to do. And are probably the cause of this jira. If doing the global system.exit() is difficult in this jira then we should at least remove the current system.exit()s and open a follow up jira to handle Error and exit in one place. That will remove the need to special case local mode everywhere in this jira. Ideally, all of Tez code should be using a common util to handle shutdown which exits in non-local mode and does not exit in local mode. The change to report the error looks good. The above comments are about the existing System.exit()s. Let me know what you think? was (Author: bikassaha): This will break when we stop sending the failure notification to the AM inline (thus blocking the System.exit() until the AM has been notified). IMO we should remove the System.exit() from these places and move them up to TezChild such that TezChild can observe the Exception/Error and determine if it needs to exit or not. If it needs to exit it can make sure all pending notifications are complete and the AM gets a proper error/diagnostic before exiting. These current exit()s sprayed across the code make graceful cleanup hard to do. And are probably the cause of this jira. If doing the global system.exit() is difficult in this jira then we should at least remove the current system.exit()s and open a follow up jira to handle Error and exit in one place. That will remove the need to special case local mode everywhere in this jira. Ideally, all of Tez code should be using a common util to handle shutdown which exits in non-local mode and does not exit in local mode. > Should report error to AM before shuting down TezChild > -- > > Key: TEZ-1621 > URL: https://issues.apache.org/jira/browse/TEZ-1621 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Deepesh Khandelwal >Assignee: Jeff Zhang > Attachments: Tez-1621.patch, app_logs.txt, console.txt > > > While running an in session testorderedwordcount example the DAG failed with > the following error on the console: > {noformat} > 14/09/25 01:55:53 INFO examples.TestOrderedWordCount: DAG 1 diagnostics: > [Vertex failed, vertexName=initialmap, > vertexId=vertex_1411586515507_0110_1_00, diagnostics=[Task failed, > taskId=task_1411586515507_0110_1_00_00, diagnostics=[TaskAttempt 0 > failed, info=[Container container_1411586515507_0110_01_02 finished with > diagnostics set to [Container failed. Exception from container-launch. > Container id: container_1411586515507_0110_01_02 > Exit code: 255 > Stack trace: ExitCodeException exitCode=255: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:290) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > This wasn't very helpful, the root cause is in the application log: > {noformat} > 2014-09-25 01:55:41,246 ERROR [TezChild] > org.apache.tez.runtime.task.TezTaskRunner: Exception of type Error. Exiting > now > java.lang.UnsatisfiedLinkError: > org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;J)V > at org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(Native > Method) > at > org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(Nati
[jira] [Created] (TEZ-1626) Tez : Generate per dag logs
Mostafa Mokhtar created TEZ-1626: Summary: Tez : Generate per dag logs Key: TEZ-1626 URL: https://issues.apache.org/jira/browse/TEZ-1626 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Mostafa Mokhtar Fix For: 0.6.0 When a user submits multiple Hive queries using the same connection (ApplicationId) the logs are not generated for the Dags that completed, as a result the user needs to wait till all queries complete. This behavior makes it very difficult to isolate failures per query as a single log file will have results from multiple queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1612) ShuffleVertexManager's EdgeManager should not hard code source num tasks
[ https://issues.apache.org/jira/browse/TEZ-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149962#comment-14149962 ] Daniel Dai commented on TEZ-1612: - Shall we port the patch to 0.5 branch? Pig 0.14 will use tez 0.5. > ShuffleVertexManager's EdgeManager should not hard code source num tasks > > > Key: TEZ-1612 > URL: https://issues.apache.org/jira/browse/TEZ-1612 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.0 >Reporter: Daniel Dai >Assignee: Bikas Saha > Fix For: 0.6.0 > > Attachments: DAG1.png, TEZ-1612.1.patch, runwithmaster.tar.gz, > syslog_dag_1411413615885_0001_1, testfail1.log.tar.gz > > > Several Pig unit tests hang intermittently. For example, > TestNewPlanImplicitSplit.testImplicitSplitInCoGroup, which is a DAG of 4 > nodes: > !DAG1.png! > It uses auto-parallelism, vertex 106 change parallelism from 2->1, and vertex > 107 from 21->1. > Log attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1626) Tez : Generate per dag logs
[ https://issues.apache.org/jira/browse/TEZ-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149969#comment-14149969 ] Bikas Saha commented on TEZ-1626: - Separate logs are generated at the AM per DAG and named syslog_dag_xyz. These separate per dag logs are available from the YARN web UI for download from the logs link on the AM page. > Tez : Generate per dag logs > --- > > Key: TEZ-1626 > URL: https://issues.apache.org/jira/browse/TEZ-1626 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.0 >Reporter: Mostafa Mokhtar > Labels: tez > Fix For: 0.6.0 > > > When a user submits multiple Hive queries using the same connection > (ApplicationId) the logs are not generated for the Dags that completed, as a > result the user needs to wait till all queries complete. > This behavior makes it very difficult to isolate failures per query as a > single log file will have results from multiple queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1626) Tez : Generate per DAG logs
[ https://issues.apache.org/jira/browse/TEZ-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mostafa Mokhtar updated TEZ-1626: - Summary: Tez : Generate per DAG logs (was: Tez : Generate per dag logs) > Tez : Generate per DAG logs > --- > > Key: TEZ-1626 > URL: https://issues.apache.org/jira/browse/TEZ-1626 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.0 >Reporter: Mostafa Mokhtar > Labels: tez > Fix For: 0.6.0 > > > When a user submits multiple Hive queries using the same connection > (ApplicationId) the logs are not generated for the Dags that completed, as a > result the user needs to wait till all queries complete. > This behavior makes it very difficult to isolate failures per query as a > single log file will have results from multiple queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1612) ShuffleVertexManager's EdgeManager should not hard code source num tasks
[ https://issues.apache.org/jira/browse/TEZ-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149975#comment-14149975 ] Bikas Saha commented on TEZ-1612: - yes. This will go into 0.5.1 > ShuffleVertexManager's EdgeManager should not hard code source num tasks > > > Key: TEZ-1612 > URL: https://issues.apache.org/jira/browse/TEZ-1612 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.0 >Reporter: Daniel Dai >Assignee: Bikas Saha > Fix For: 0.6.0 > > Attachments: DAG1.png, TEZ-1612.1.patch, runwithmaster.tar.gz, > syslog_dag_1411413615885_0001_1, testfail1.log.tar.gz > > > Several Pig unit tests hang intermittently. For example, > TestNewPlanImplicitSplit.testImplicitSplitInCoGroup, which is a DAG of 4 > nodes: > !DAG1.png! > It uses auto-parallelism, vertex 106 change parallelism from 2->1, and vertex > 107 from 21->1. > Log attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1626) Tez : Generate per DAG logs
[ https://issues.apache.org/jira/browse/TEZ-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150023#comment-14150023 ] Mostafa Mokhtar commented on TEZ-1626: -- How can the user get extract them programmatically? > Tez : Generate per DAG logs > --- > > Key: TEZ-1626 > URL: https://issues.apache.org/jira/browse/TEZ-1626 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.0 >Reporter: Mostafa Mokhtar > Labels: tez > Fix For: 0.6.0 > > > When a user submits multiple Hive queries using the same connection > (ApplicationId) the logs are not generated for the Dags that completed, as a > result the user needs to wait till all queries complete. > This behavior makes it very difficult to isolate failures per query as a > single log file will have results from multiple queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1626) Tez : Generate per DAG logs
[ https://issues.apache.org/jira/browse/TEZ-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150051#comment-14150051 ] Bikas Saha commented on TEZ-1626: - Ah. That would be a YARN jira. Accessing logs from YARN apps while running or after completion is YARN domain. > Tez : Generate per DAG logs > --- > > Key: TEZ-1626 > URL: https://issues.apache.org/jira/browse/TEZ-1626 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.0 >Reporter: Mostafa Mokhtar > Labels: tez > Fix For: 0.6.0 > > > When a user submits multiple Hive queries using the same connection > (ApplicationId) the logs are not generated for the Dags that completed, as a > result the user needs to wait till all queries complete. > This behavior makes it very difficult to isolate failures per query as a > single log file will have results from multiple queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1624) Flaky tests in TestContainerReuse due to race condition in
[ https://issues.apache.org/jira/browse/TEZ-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1624: -- Summary: Flaky tests in TestContainerReuse due to race condition in (was: Flaky tests in TestContainerReuse) > Flaky tests in TestContainerReuse due to race condition in > --- > > Key: TEZ-1624 > URL: https://issues.apache.org/jira/browse/TEZ-1624 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-1624.1.patch, TEZ-1624.2.patch > > > Couple of TestContainerReuse tests are failing due to minor race condition in > DelayedContainerManager thread. > Wanted but not invoked: > taskSchedulerEventHandlerForTest.taskAllocated( > Mock for TaskAttempt, hashCode: 290467934, > , > Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, > NodeHttpAddress: host1:0, Resource: , Priority: 1, > Token: null, ] > ); > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:580) > However, there were other interactions with this mock: > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:534) > -> at > org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:570) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:571) > Wanted but not invoked: > taskSchedulerEventHandlerForTest.taskAllocated( > Mock for TaskAttempt, hashCode: 392638651, > , > Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, > NodeHttpAddress: host1:0, Resource: , Priority: 5, > Token: null, ] > ); > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333) > However, there were other interactions with this mock: > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:292) > -> at > org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:323) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:324) > at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333) > org.mockito.exceptions.verification.WantedButNotInvoked: > Wanted but not invoked: > taskSchedulerEventHandlerForTest.taskAllocated( > Mock for TaskAttempt, hashCode: 1830222901, > , > Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, > Nod
[jira] [Updated] (TEZ-1624) Flaky tests in TestContainerReuse due to race condition in DelayedContainerManager thread
[ https://issues.apache.org/jira/browse/TEZ-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1624: -- Summary: Flaky tests in TestContainerReuse due to race condition in DelayedContainerManager thread (was: Flaky tests in TestContainerReuse due to race condition in ) > Flaky tests in TestContainerReuse due to race condition in > DelayedContainerManager thread > - > > Key: TEZ-1624 > URL: https://issues.apache.org/jira/browse/TEZ-1624 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-1624.1.patch, TEZ-1624.2.patch > > > Couple of TestContainerReuse tests are failing due to minor race condition in > DelayedContainerManager thread. > Wanted but not invoked: > taskSchedulerEventHandlerForTest.taskAllocated( > Mock for TaskAttempt, hashCode: 290467934, > , > Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, > NodeHttpAddress: host1:0, Resource: , Priority: 1, > Token: null, ] > ); > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:580) > However, there were other interactions with this mock: > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:534) > -> at > org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:570) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:571) > Wanted but not invoked: > taskSchedulerEventHandlerForTest.taskAllocated( > Mock for TaskAttempt, hashCode: 392638651, > , > Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, > NodeHttpAddress: host1:0, Resource: , Priority: 5, > Token: null, ] > ); > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333) > However, there were other interactions with this mock: > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:292) > -> at > org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:323) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:324) > at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333) > org.mockito.exceptions.verification.WantedButNotInvoked: > Wanted but not invoked: > taskSchedulerEventHandlerForTest.taskAllocated( > Mock for TaskA
[jira] [Resolved] (TEZ-1624) Flaky tests in TestContainerReuse due to race condition in DelayedContainerManager thread
[ https://issues.apache.org/jira/browse/TEZ-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan resolved TEZ-1624. --- Resolution: Fixed Hadoop Flags: Reviewed Thanks [~bikassaha]. Committed to master and branch-0.5 > Flaky tests in TestContainerReuse due to race condition in > DelayedContainerManager thread > - > > Key: TEZ-1624 > URL: https://issues.apache.org/jira/browse/TEZ-1624 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-1624.1.patch, TEZ-1624.2.patch > > > Couple of TestContainerReuse tests are failing due to minor race condition in > DelayedContainerManager thread. > Wanted but not invoked: > taskSchedulerEventHandlerForTest.taskAllocated( > Mock for TaskAttempt, hashCode: 290467934, > , > Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, > NodeHttpAddress: host1:0, Resource: , Priority: 1, > Token: null, ] > ); > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:580) > However, there were other interactions with this mock: > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:534) > -> at > org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:570) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:571) > Wanted but not invoked: > taskSchedulerEventHandlerForTest.taskAllocated( > Mock for TaskAttempt, hashCode: 392638651, > , > Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, > NodeHttpAddress: host1:0, Resource: , Priority: 5, > Token: null, ] > ); > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333) > However, there were other interactions with this mock: > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:292) > -> at > org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:323) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:324) > at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333) > org.mockito.exceptions.verification.WantedButNotInvoked: > Wanted but not invoked: > taskSchedulerEventHandlerForTest.taskAllocated( > Mock for TaskAttempt, hashCode: 1830222901, > , > Container: [Contai
[jira] [Updated] (TEZ-1624) Flaky tests in TestContainerReuse due to race condition in DelayedContainerManager thread
[ https://issues.apache.org/jira/browse/TEZ-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1624: -- Fix Version/s: 0.5.1 > Flaky tests in TestContainerReuse due to race condition in > DelayedContainerManager thread > - > > Key: TEZ-1624 > URL: https://issues.apache.org/jira/browse/TEZ-1624 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Fix For: 0.5.1 > > Attachments: TEZ-1624.1.patch, TEZ-1624.2.patch > > > Couple of TestContainerReuse tests are failing due to minor race condition in > DelayedContainerManager thread. > Wanted but not invoked: > taskSchedulerEventHandlerForTest.taskAllocated( > Mock for TaskAttempt, hashCode: 290467934, > , > Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, > NodeHttpAddress: host1:0, Resource: , Priority: 1, > Token: null, ] > ); > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:580) > However, there were other interactions with this mock: > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:534) > -> at > org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:570) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:571) > Wanted but not invoked: > taskSchedulerEventHandlerForTest.taskAllocated( > Mock for TaskAttempt, hashCode: 392638651, > , > Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, > NodeHttpAddress: host1:0, Resource: , Priority: 5, > Token: null, ] > ); > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333) > However, there were other interactions with this mock: > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:292) > -> at > org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:323) > -> at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:324) > at > org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333) > org.mockito.exceptions.verification.WantedButNotInvoked: > Wanted but not invoked: > taskSchedulerEventHandlerForTest.taskAllocated( > Mock for TaskAttempt, hashCode: 1830222901, > , > Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, >
[jira] [Resolved] (TEZ-1609) Add hostname to logIdentifiers of fetchers for easy debugging
[ https://issues.apache.org/jira/browse/TEZ-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan resolved TEZ-1609. --- Resolution: Fixed Fix Version/s: 0.6.0 Target Version/s: 0.6.0 Hadoop Flags: Reviewed Thanks [~bikassaha] and [~gopalv]. Committed the changes to master. > Add hostname to logIdentifiers of fetchers for easy debugging > - > > Key: TEZ-1609 > URL: https://issues.apache.org/jira/browse/TEZ-1609 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.0 >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Fix For: 0.6.0 > > Attachments: TEZ-1609.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1621) Should report error to AM before shuting down TezChild
[ https://issues.apache.org/jira/browse/TEZ-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150320#comment-14150320 ] Jeff Zhang commented on TEZ-1621: - bq. This will break when we stop sending the failure notification to the AM inline (thus blocking the System.exit() until the AM has been notified). What does this mean ? sendFailure should not block the System.exit(), it would either success (heartbeat success) or throw exception (heartbeat failure, this would also cause TezChild shutdown). But, agree on putting the System.exit into TezChild which would make code clean and easy to maintain. > Should report error to AM before shuting down TezChild > -- > > Key: TEZ-1621 > URL: https://issues.apache.org/jira/browse/TEZ-1621 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Deepesh Khandelwal >Assignee: Jeff Zhang > Attachments: Tez-1621.patch, app_logs.txt, console.txt > > > While running an in session testorderedwordcount example the DAG failed with > the following error on the console: > {noformat} > 14/09/25 01:55:53 INFO examples.TestOrderedWordCount: DAG 1 diagnostics: > [Vertex failed, vertexName=initialmap, > vertexId=vertex_1411586515507_0110_1_00, diagnostics=[Task failed, > taskId=task_1411586515507_0110_1_00_00, diagnostics=[TaskAttempt 0 > failed, info=[Container container_1411586515507_0110_01_02 finished with > diagnostics set to [Container failed. Exception from container-launch. > Container id: container_1411586515507_0110_01_02 > Exit code: 255 > Stack trace: ExitCodeException exitCode=255: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:290) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > This wasn't very helpful, the root cause is in the application log: > {noformat} > 2014-09-25 01:55:41,246 ERROR [TezChild] > org.apache.tez.runtime.task.TezTaskRunner: Exception of type Error. Exiting > now > java.lang.UnsatisfiedLinkError: > org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;J)V > at org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(Native > Method) > at > org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(NativeCrc32.java:57) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:291) > at > org.apache.hadoop.hdfs.BlockReaderLocal.fillBuffer(BlockReaderLocal.java:344) > at > org.apache.hadoop.hdfs.BlockReaderLocal.fillDataBuf(BlockReaderLocal.java:444) > at > org.apache.hadoop.hdfs.BlockReaderLocal.readWithBounceBuffer(BlockReaderLocal.java:575) > at > org.apache.hadoop.hdfs.BlockReaderLocal.read(BlockReaderLocal.java:539) > at > org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:683) > at > org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:739) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:796) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180) > at > org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) > at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149) > at > org.apache.hadoop.mapreduce.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.nextKeyValue(TezGroupedSplitsInputFormat.java:167) > at > org.apache.tez.mapreduce.lib.MRReaderMapReduce.next(MRReaderMapReduce.java:116) > at > org.apache.tez.mapreduce.processor.map.MapProcessor$NewRecordReader.nextKeyValue(MapProcessor.java:266) > at > org.apache.tez.mapreduce.hadoop.mapreduce.MapContextImpl.
[jira] [Commented] (TEZ-1621) Should report error to AM before shuting down TezChild
[ https://issues.apache.org/jira/browse/TEZ-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150325#comment-14150325 ] Bikas Saha commented on TEZ-1621: - I mean this is working now because the error reporting is synchronous (sendFailure) - the error is notified all the way to the AM synchronously before calling system.exit(). If that was not the case then system.exit() might end up bringing down the JVM before the error gets reported to the AM. {code}+sendFailure(failureCause, "Fatal error cause TezChild exit."); +if (isLocal) { + throw new TezException("Fatal error cause TezChild exit.", failureCause); +} else { + ExitUtil.terminate(-1, failureCause); +}{code} > Should report error to AM before shuting down TezChild > -- > > Key: TEZ-1621 > URL: https://issues.apache.org/jira/browse/TEZ-1621 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Deepesh Khandelwal >Assignee: Jeff Zhang > Attachments: Tez-1621.patch, app_logs.txt, console.txt > > > While running an in session testorderedwordcount example the DAG failed with > the following error on the console: > {noformat} > 14/09/25 01:55:53 INFO examples.TestOrderedWordCount: DAG 1 diagnostics: > [Vertex failed, vertexName=initialmap, > vertexId=vertex_1411586515507_0110_1_00, diagnostics=[Task failed, > taskId=task_1411586515507_0110_1_00_00, diagnostics=[TaskAttempt 0 > failed, info=[Container container_1411586515507_0110_01_02 finished with > diagnostics set to [Container failed. Exception from container-launch. > Container id: container_1411586515507_0110_01_02 > Exit code: 255 > Stack trace: ExitCodeException exitCode=255: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:290) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > This wasn't very helpful, the root cause is in the application log: > {noformat} > 2014-09-25 01:55:41,246 ERROR [TezChild] > org.apache.tez.runtime.task.TezTaskRunner: Exception of type Error. Exiting > now > java.lang.UnsatisfiedLinkError: > org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;J)V > at org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(Native > Method) > at > org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(NativeCrc32.java:57) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:291) > at > org.apache.hadoop.hdfs.BlockReaderLocal.fillBuffer(BlockReaderLocal.java:344) > at > org.apache.hadoop.hdfs.BlockReaderLocal.fillDataBuf(BlockReaderLocal.java:444) > at > org.apache.hadoop.hdfs.BlockReaderLocal.readWithBounceBuffer(BlockReaderLocal.java:575) > at > org.apache.hadoop.hdfs.BlockReaderLocal.read(BlockReaderLocal.java:539) > at > org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:683) > at > org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:739) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:796) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180) > at > org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) > at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149) > at > org.apache.hadoop.mapreduce.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.nextKeyValue(TezGroupedSplitsInputFormat.java:167) > at > org.apache.tez.mapreduce.lib.MRReaderMapReduce.next(MRReaderMapReduce.java:116) > at > org.apache.tez.mapreduce.processor.map.MapProcessor$NewRecor