[jira] [Updated] (TEZ-1621) Should report error to AM before shuting down TezChild

2014-09-26 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-1621:

Summary: Should report error to AM before shuting down TezChild  (was: 
Actual error message not thrown on console, does appear in the YARN application 
log)

> Should report error to AM before shuting down TezChild
> --
>
> Key: TEZ-1621
> URL: https://issues.apache.org/jira/browse/TEZ-1621
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Deepesh Khandelwal
>Assignee: Jeff Zhang
> Attachments: Tez-1621.patch, app_logs.txt, console.txt
>
>
> While running an in session testorderedwordcount example the DAG failed with 
> the following error on the console:
> {noformat}
> 14/09/25 01:55:53 INFO examples.TestOrderedWordCount: DAG 1 diagnostics: 
> [Vertex failed, vertexName=initialmap, 
> vertexId=vertex_1411586515507_0110_1_00, diagnostics=[Task failed, 
> taskId=task_1411586515507_0110_1_00_00, diagnostics=[TaskAttempt 0 
> failed, info=[Container container_1411586515507_0110_01_02 finished with 
> diagnostics set to [Container failed. Exception from container-launch.
> Container id: container_1411586515507_0110_01_02
> Exit code: 255
> Stack trace: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:290)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This wasn't very helpful, the root cause is in the application log:
> {noformat}
> 2014-09-25 01:55:41,246 ERROR [TezChild] 
> org.apache.tez.runtime.task.TezTaskRunner: Exception of type Error. Exiting 
> now
> java.lang.UnsatisfiedLinkError: 
> org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;J)V
> at org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(Native 
> Method)
> at 
> org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(NativeCrc32.java:57)
> at 
> org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:291)
> at 
> org.apache.hadoop.hdfs.BlockReaderLocal.fillBuffer(BlockReaderLocal.java:344)
> at 
> org.apache.hadoop.hdfs.BlockReaderLocal.fillDataBuf(BlockReaderLocal.java:444)
> at 
> org.apache.hadoop.hdfs.BlockReaderLocal.readWithBounceBuffer(BlockReaderLocal.java:575)
> at 
> org.apache.hadoop.hdfs.BlockReaderLocal.read(BlockReaderLocal.java:539)
> at 
> org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:683)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:739)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:796)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
> at 
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
> at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149)
> at 
> org.apache.hadoop.mapreduce.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.nextKeyValue(TezGroupedSplitsInputFormat.java:167)
> at 
> org.apache.tez.mapreduce.lib.MRReaderMapReduce.next(MRReaderMapReduce.java:116)
> at 
> org.apache.tez.mapreduce.processor.map.MapProcessor$NewRecordReader.nextKeyValue(MapProcessor.java:266)
> at 
> org.apache.tez.mapreduce.hadoop.mapreduce.MapContextImpl.nextKeyValue(MapContextImpl.java:81)
> at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at 
> org.apache.tez.mapreduce.processor.map.MapProcessor.runNewMapper(MapProcessor.java:237)
> at 
> org.apache.te

[jira] [Resolved] (TEZ-1555) TestTezClientUtils.validateSetTezJarLocalResourcesDefinedButEmpty failing on Windows

2014-09-26 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha resolved TEZ-1555.
-
   Resolution: Fixed
Fix Version/s: 0.6.0
 Hadoop Flags: Reviewed

Thanks for the fix. Committed
commit a56f9ef4e81002f64d961a195bd68fd5c10e2bee
Author: Bikas Saha 
Date:   Fri Sep 26 11:00:15 2014 -0700

TEZ-1555. TestTezClientUtils.validateSetTezJarLocalResourcesDefinedButEmpty 
failing on Windows (Prakash Ramachandran vi



> TestTezClientUtils.validateSetTezJarLocalResourcesDefinedButEmpty failing on 
> Windows
> 
>
> Key: TEZ-1555
> URL: https://issues.apache.org/jira/browse/TEZ-1555
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Prakash Ramachandran
> Fix For: 0.6.0
>
> Attachments: tez-1555.1.patch
>
>
> Error Message
> Wrong FS: 
> file://D:/w/tez/tez-api/target/org.apache.tez.client.TestTezClientUtils-tmpDir/emptyDir,
>  expected: file:///
> Stacktrace
> java.lang.IllegalArgumentException: Wrong FS: 
> file://D:/w/tez/tez-api/target/org.apache.tez.client.TestTezClientUtils-tmpDir/emptyDir,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
>   at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:465)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:119)
>   at 
> org.apache.tez.client.TezClientUtils.getLRFileStatus(TezClientUtils.java:132)
>   at 
> org.apache.tez.client.TezClientUtils.setupTezJarsLocalResources(TezClientUtils.java:198)
>   at 
> org.apache.tez.client.TestTezClientUtils.validateSetTezJarLocalResourcesDefinedButEmpty(TestTezClientUtils.java:77)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1609) Add hostname to logIdentifiers of fetchers for easy debugging

2014-09-26 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149738#comment-14149738
 ] 

Bikas Saha commented on TEZ-1609:
-

lgtm. Do the fetchers also log the host name from which they are fetching?

> Add hostname to logIdentifiers of fetchers for easy debugging
> -
>
> Key: TEZ-1609
> URL: https://issues.apache.org/jira/browse/TEZ-1609
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Rajesh Balamohan
>Assignee: Gopal V
> Attachments: TEZ-1609.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1624) Flaky tests in TestContainerReuse

2014-09-26 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149754#comment-14149754
 ] 

Bikas Saha commented on TEZ-1624:
-

lgtm. Good catch.

> Flaky tests in TestContainerReuse
> -
>
> Key: TEZ-1624
> URL: https://issues.apache.org/jira/browse/TEZ-1624
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Attachments: TEZ-1624.1.patch, TEZ-1624.2.patch
>
>
> Couple of TestContainerReuse tests are failing due to minor race condition in 
> DelayedContainerManager thread.  
> Wanted but not invoked:
> taskSchedulerEventHandlerForTest.taskAllocated(
> Mock for TaskAttempt, hashCode: 290467934,
> ,
> Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, 
> NodeHttpAddress: host1:0, Resource: , Priority: 1, 
> Token: null, ]
> );
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:580)
> However, there were other interactions with this mock:
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:534)
> -> at 
> org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:570)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:571)
> Wanted but not invoked:
> taskSchedulerEventHandlerForTest.taskAllocated(
> Mock for TaskAttempt, hashCode: 392638651,
> ,
> Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, 
> NodeHttpAddress: host1:0, Resource: , Priority: 5, 
> Token: null, ]
> );
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333)
> However, there were other interactions with this mock:
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:292)
> -> at 
> org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:323)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:324)
> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333)
> org.mockito.exceptions.verification.WantedButNotInvoked:
> Wanted but not invoked:
> taskSchedulerEventHandlerForTest.taskAllocated(
> Mock for TaskAttempt, hashCode: 1830222901,
> ,
> Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, 
> NodeHttpAddress: host1:0, Resource: , Priority: 3, 
> Token: null, ]
> );
> -> at 
> org.apache.tez.dag.app

[jira] [Updated] (TEZ-1609) Add hostname to logIdentifiers of fetchers for easy debugging

2014-09-26 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated TEZ-1609:
-
Assignee: Rajesh Balamohan  (was: Gopal V)

> Add hostname to logIdentifiers of fetchers for easy debugging
> -
>
> Key: TEZ-1609
> URL: https://issues.apache.org/jira/browse/TEZ-1609
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Attachments: TEZ-1609.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1609) Add hostname to logIdentifiers of fetchers for easy debugging

2014-09-26 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149779#comment-14149779
 ] 

Gopal V commented on TEZ-1609:
--

Yes, they do print the URL and speed.

> Add hostname to logIdentifiers of fetchers for easy debugging
> -
>
> Key: TEZ-1609
> URL: https://issues.apache.org/jira/browse/TEZ-1609
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Attachments: TEZ-1609.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1621) Should report error to AM before shuting down TezChild

2014-09-26 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149794#comment-14149794
 ] 

Bikas Saha commented on TEZ-1621:
-

This will break when we stop sending the failure notification to the AM inline 
(thus blocking the System.exit() until the AM has been notified). IMO we should 
remove the System.exit() from these places and move them up to TezChild such 
that TezChild can observe the Exception/Error and determine if it needs to exit 
or not. If it needs to exit it can make sure all pending notifications are 
complete and the AM gets a proper error/diagnostic before exiting. These 
current exit()s sprayed across the code make graceful cleanup hard to do. And 
are probably the cause of this jira. If doing the global system.exit() is 
difficult in this jira then we should at least remove the current 
system.exit()s and open a follow up jira to handle Error and exit in one place. 
That will remove the need to special case local mode everywhere in this jira. 
Ideally, all of Tez code should be using a common util to handle shutdown which 
exits in non-local mode and does not exit in local mode.

> Should report error to AM before shuting down TezChild
> --
>
> Key: TEZ-1621
> URL: https://issues.apache.org/jira/browse/TEZ-1621
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Deepesh Khandelwal
>Assignee: Jeff Zhang
> Attachments: Tez-1621.patch, app_logs.txt, console.txt
>
>
> While running an in session testorderedwordcount example the DAG failed with 
> the following error on the console:
> {noformat}
> 14/09/25 01:55:53 INFO examples.TestOrderedWordCount: DAG 1 diagnostics: 
> [Vertex failed, vertexName=initialmap, 
> vertexId=vertex_1411586515507_0110_1_00, diagnostics=[Task failed, 
> taskId=task_1411586515507_0110_1_00_00, diagnostics=[TaskAttempt 0 
> failed, info=[Container container_1411586515507_0110_01_02 finished with 
> diagnostics set to [Container failed. Exception from container-launch.
> Container id: container_1411586515507_0110_01_02
> Exit code: 255
> Stack trace: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:290)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This wasn't very helpful, the root cause is in the application log:
> {noformat}
> 2014-09-25 01:55:41,246 ERROR [TezChild] 
> org.apache.tez.runtime.task.TezTaskRunner: Exception of type Error. Exiting 
> now
> java.lang.UnsatisfiedLinkError: 
> org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;J)V
> at org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(Native 
> Method)
> at 
> org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(NativeCrc32.java:57)
> at 
> org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:291)
> at 
> org.apache.hadoop.hdfs.BlockReaderLocal.fillBuffer(BlockReaderLocal.java:344)
> at 
> org.apache.hadoop.hdfs.BlockReaderLocal.fillDataBuf(BlockReaderLocal.java:444)
> at 
> org.apache.hadoop.hdfs.BlockReaderLocal.readWithBounceBuffer(BlockReaderLocal.java:575)
> at 
> org.apache.hadoop.hdfs.BlockReaderLocal.read(BlockReaderLocal.java:539)
> at 
> org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:683)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:739)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:796)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
> at 
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
> at 
> org.ap

[jira] [Comment Edited] (TEZ-1621) Should report error to AM before shuting down TezChild

2014-09-26 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149794#comment-14149794
 ] 

Bikas Saha edited comment on TEZ-1621 at 9/26/14 6:51 PM:
--

This will break when we stop sending the failure notification to the AM inline 
(thus blocking the System.exit() until the AM has been notified). IMO we should 
remove the System.exit() from these places and move them up to TezChild such 
that TezChild can observe the Exception/Error and determine if it needs to exit 
or not. If it needs to exit it can make sure all pending notifications are 
complete and the AM gets a proper error/diagnostic before exiting. These 
current exit()s sprayed across the code make graceful cleanup hard to do. And 
are probably the cause of this jira. If doing the global system.exit() is 
difficult in this jira then we should at least remove the current 
system.exit()s and open a follow up jira to handle Error and exit in one place. 
That will remove the need to special case local mode everywhere in this jira. 
Ideally, all of Tez code should be using a common util to handle shutdown which 
exits in non-local mode and does not exit in local mode.
The change to report the error looks good. The above comments are about the 
existing System.exit()s. Let me know what you think?


was (Author: bikassaha):
This will break when we stop sending the failure notification to the AM inline 
(thus blocking the System.exit() until the AM has been notified). IMO we should 
remove the System.exit() from these places and move them up to TezChild such 
that TezChild can observe the Exception/Error and determine if it needs to exit 
or not. If it needs to exit it can make sure all pending notifications are 
complete and the AM gets a proper error/diagnostic before exiting. These 
current exit()s sprayed across the code make graceful cleanup hard to do. And 
are probably the cause of this jira. If doing the global system.exit() is 
difficult in this jira then we should at least remove the current 
system.exit()s and open a follow up jira to handle Error and exit in one place. 
That will remove the need to special case local mode everywhere in this jira. 
Ideally, all of Tez code should be using a common util to handle shutdown which 
exits in non-local mode and does not exit in local mode.

> Should report error to AM before shuting down TezChild
> --
>
> Key: TEZ-1621
> URL: https://issues.apache.org/jira/browse/TEZ-1621
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Deepesh Khandelwal
>Assignee: Jeff Zhang
> Attachments: Tez-1621.patch, app_logs.txt, console.txt
>
>
> While running an in session testorderedwordcount example the DAG failed with 
> the following error on the console:
> {noformat}
> 14/09/25 01:55:53 INFO examples.TestOrderedWordCount: DAG 1 diagnostics: 
> [Vertex failed, vertexName=initialmap, 
> vertexId=vertex_1411586515507_0110_1_00, diagnostics=[Task failed, 
> taskId=task_1411586515507_0110_1_00_00, diagnostics=[TaskAttempt 0 
> failed, info=[Container container_1411586515507_0110_01_02 finished with 
> diagnostics set to [Container failed. Exception from container-launch.
> Container id: container_1411586515507_0110_01_02
> Exit code: 255
> Stack trace: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:290)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This wasn't very helpful, the root cause is in the application log:
> {noformat}
> 2014-09-25 01:55:41,246 ERROR [TezChild] 
> org.apache.tez.runtime.task.TezTaskRunner: Exception of type Error. Exiting 
> now
> java.lang.UnsatisfiedLinkError: 
> org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;J)V
> at org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(Native 
> Method)
> at 
> org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(Nati

[jira] [Created] (TEZ-1626) Tez : Generate per dag logs

2014-09-26 Thread Mostafa Mokhtar (JIRA)
Mostafa Mokhtar created TEZ-1626:


 Summary: Tez : Generate per dag logs
 Key: TEZ-1626
 URL: https://issues.apache.org/jira/browse/TEZ-1626
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Mostafa Mokhtar
 Fix For: 0.6.0


When a user submits multiple Hive queries using the same connection 
(ApplicationId) the logs are not generated for the Dags that completed, as a 
result the user needs to wait till all queries complete.

This behavior makes it very difficult to isolate failures per query as a single 
log file will have results from multiple queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1612) ShuffleVertexManager's EdgeManager should not hard code source num tasks

2014-09-26 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149962#comment-14149962
 ] 

Daniel Dai commented on TEZ-1612:
-

Shall we port the patch to 0.5 branch? Pig 0.14 will use tez 0.5.

> ShuffleVertexManager's EdgeManager should not hard code source num tasks
> 
>
> Key: TEZ-1612
> URL: https://issues.apache.org/jira/browse/TEZ-1612
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Daniel Dai
>Assignee: Bikas Saha
> Fix For: 0.6.0
>
> Attachments: DAG1.png, TEZ-1612.1.patch, runwithmaster.tar.gz, 
> syslog_dag_1411413615885_0001_1, testfail1.log.tar.gz
>
>
> Several Pig unit tests hang intermittently. For example, 
> TestNewPlanImplicitSplit.testImplicitSplitInCoGroup, which is a DAG of 4 
> nodes:
> !DAG1.png!
> It uses auto-parallelism, vertex 106 change parallelism from 2->1, and vertex 
> 107 from 21->1.
> Log attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1626) Tez : Generate per dag logs

2014-09-26 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149969#comment-14149969
 ] 

Bikas Saha commented on TEZ-1626:
-

Separate logs are generated at the AM per DAG and named syslog_dag_xyz.
These separate per dag logs are available from the YARN web UI for download 
from the logs link on the AM page.


> Tez : Generate per dag logs
> ---
>
> Key: TEZ-1626
> URL: https://issues.apache.org/jira/browse/TEZ-1626
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Mostafa Mokhtar
>  Labels: tez
> Fix For: 0.6.0
>
>
> When a user submits multiple Hive queries using the same connection 
> (ApplicationId) the logs are not generated for the Dags that completed, as a 
> result the user needs to wait till all queries complete.
> This behavior makes it very difficult to isolate failures per query as a 
> single log file will have results from multiple queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1626) Tez : Generate per DAG logs

2014-09-26 Thread Mostafa Mokhtar (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mostafa Mokhtar updated TEZ-1626:
-
Summary: Tez : Generate per DAG logs  (was: Tez : Generate per dag logs)

> Tez : Generate per DAG logs
> ---
>
> Key: TEZ-1626
> URL: https://issues.apache.org/jira/browse/TEZ-1626
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Mostafa Mokhtar
>  Labels: tez
> Fix For: 0.6.0
>
>
> When a user submits multiple Hive queries using the same connection 
> (ApplicationId) the logs are not generated for the Dags that completed, as a 
> result the user needs to wait till all queries complete.
> This behavior makes it very difficult to isolate failures per query as a 
> single log file will have results from multiple queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1612) ShuffleVertexManager's EdgeManager should not hard code source num tasks

2014-09-26 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149975#comment-14149975
 ] 

Bikas Saha commented on TEZ-1612:
-

yes. This will go into 0.5.1

> ShuffleVertexManager's EdgeManager should not hard code source num tasks
> 
>
> Key: TEZ-1612
> URL: https://issues.apache.org/jira/browse/TEZ-1612
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Daniel Dai
>Assignee: Bikas Saha
> Fix For: 0.6.0
>
> Attachments: DAG1.png, TEZ-1612.1.patch, runwithmaster.tar.gz, 
> syslog_dag_1411413615885_0001_1, testfail1.log.tar.gz
>
>
> Several Pig unit tests hang intermittently. For example, 
> TestNewPlanImplicitSplit.testImplicitSplitInCoGroup, which is a DAG of 4 
> nodes:
> !DAG1.png!
> It uses auto-parallelism, vertex 106 change parallelism from 2->1, and vertex 
> 107 from 21->1.
> Log attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1626) Tez : Generate per DAG logs

2014-09-26 Thread Mostafa Mokhtar (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150023#comment-14150023
 ] 

Mostafa Mokhtar commented on TEZ-1626:
--

How can the user get extract them programmatically?  

> Tez : Generate per DAG logs
> ---
>
> Key: TEZ-1626
> URL: https://issues.apache.org/jira/browse/TEZ-1626
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Mostafa Mokhtar
>  Labels: tez
> Fix For: 0.6.0
>
>
> When a user submits multiple Hive queries using the same connection 
> (ApplicationId) the logs are not generated for the Dags that completed, as a 
> result the user needs to wait till all queries complete.
> This behavior makes it very difficult to isolate failures per query as a 
> single log file will have results from multiple queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1626) Tez : Generate per DAG logs

2014-09-26 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150051#comment-14150051
 ] 

Bikas Saha commented on TEZ-1626:
-

Ah. That would be a YARN jira. Accessing logs from YARN apps while running or 
after completion is YARN domain.

> Tez : Generate per DAG logs
> ---
>
> Key: TEZ-1626
> URL: https://issues.apache.org/jira/browse/TEZ-1626
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Mostafa Mokhtar
>  Labels: tez
> Fix For: 0.6.0
>
>
> When a user submits multiple Hive queries using the same connection 
> (ApplicationId) the logs are not generated for the Dags that completed, as a 
> result the user needs to wait till all queries complete.
> This behavior makes it very difficult to isolate failures per query as a 
> single log file will have results from multiple queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1624) Flaky tests in TestContainerReuse due to race condition in

2014-09-26 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-1624:
--
Summary: Flaky tests in TestContainerReuse due to race condition in   (was: 
Flaky tests in TestContainerReuse)

> Flaky tests in TestContainerReuse due to race condition in 
> ---
>
> Key: TEZ-1624
> URL: https://issues.apache.org/jira/browse/TEZ-1624
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Attachments: TEZ-1624.1.patch, TEZ-1624.2.patch
>
>
> Couple of TestContainerReuse tests are failing due to minor race condition in 
> DelayedContainerManager thread.  
> Wanted but not invoked:
> taskSchedulerEventHandlerForTest.taskAllocated(
> Mock for TaskAttempt, hashCode: 290467934,
> ,
> Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, 
> NodeHttpAddress: host1:0, Resource: , Priority: 1, 
> Token: null, ]
> );
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:580)
> However, there were other interactions with this mock:
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:534)
> -> at 
> org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:570)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:571)
> Wanted but not invoked:
> taskSchedulerEventHandlerForTest.taskAllocated(
> Mock for TaskAttempt, hashCode: 392638651,
> ,
> Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, 
> NodeHttpAddress: host1:0, Resource: , Priority: 5, 
> Token: null, ]
> );
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333)
> However, there were other interactions with this mock:
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:292)
> -> at 
> org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:323)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:324)
> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333)
> org.mockito.exceptions.verification.WantedButNotInvoked:
> Wanted but not invoked:
> taskSchedulerEventHandlerForTest.taskAllocated(
> Mock for TaskAttempt, hashCode: 1830222901,
> ,
> Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, 
> Nod

[jira] [Updated] (TEZ-1624) Flaky tests in TestContainerReuse due to race condition in DelayedContainerManager thread

2014-09-26 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-1624:
--
Summary: Flaky tests in TestContainerReuse due to race condition in 
DelayedContainerManager thread  (was: Flaky tests in TestContainerReuse due to 
race condition in )

> Flaky tests in TestContainerReuse due to race condition in 
> DelayedContainerManager thread
> -
>
> Key: TEZ-1624
> URL: https://issues.apache.org/jira/browse/TEZ-1624
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Attachments: TEZ-1624.1.patch, TEZ-1624.2.patch
>
>
> Couple of TestContainerReuse tests are failing due to minor race condition in 
> DelayedContainerManager thread.  
> Wanted but not invoked:
> taskSchedulerEventHandlerForTest.taskAllocated(
> Mock for TaskAttempt, hashCode: 290467934,
> ,
> Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, 
> NodeHttpAddress: host1:0, Resource: , Priority: 1, 
> Token: null, ]
> );
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:580)
> However, there were other interactions with this mock:
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:534)
> -> at 
> org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:570)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:571)
> Wanted but not invoked:
> taskSchedulerEventHandlerForTest.taskAllocated(
> Mock for TaskAttempt, hashCode: 392638651,
> ,
> Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, 
> NodeHttpAddress: host1:0, Resource: , Priority: 5, 
> Token: null, ]
> );
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333)
> However, there were other interactions with this mock:
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:292)
> -> at 
> org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:323)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:324)
> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333)
> org.mockito.exceptions.verification.WantedButNotInvoked:
> Wanted but not invoked:
> taskSchedulerEventHandlerForTest.taskAllocated(
> Mock for TaskA

[jira] [Resolved] (TEZ-1624) Flaky tests in TestContainerReuse due to race condition in DelayedContainerManager thread

2014-09-26 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan resolved TEZ-1624.
---
  Resolution: Fixed
Hadoop Flags: Reviewed

Thanks [~bikassaha].  Committed to master and branch-0.5

> Flaky tests in TestContainerReuse due to race condition in 
> DelayedContainerManager thread
> -
>
> Key: TEZ-1624
> URL: https://issues.apache.org/jira/browse/TEZ-1624
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Attachments: TEZ-1624.1.patch, TEZ-1624.2.patch
>
>
> Couple of TestContainerReuse tests are failing due to minor race condition in 
> DelayedContainerManager thread.  
> Wanted but not invoked:
> taskSchedulerEventHandlerForTest.taskAllocated(
> Mock for TaskAttempt, hashCode: 290467934,
> ,
> Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, 
> NodeHttpAddress: host1:0, Resource: , Priority: 1, 
> Token: null, ]
> );
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:580)
> However, there were other interactions with this mock:
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:534)
> -> at 
> org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:570)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:571)
> Wanted but not invoked:
> taskSchedulerEventHandlerForTest.taskAllocated(
> Mock for TaskAttempt, hashCode: 392638651,
> ,
> Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, 
> NodeHttpAddress: host1:0, Resource: , Priority: 5, 
> Token: null, ]
> );
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333)
> However, there were other interactions with this mock:
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:292)
> -> at 
> org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:323)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:324)
> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333)
> org.mockito.exceptions.verification.WantedButNotInvoked:
> Wanted but not invoked:
> taskSchedulerEventHandlerForTest.taskAllocated(
> Mock for TaskAttempt, hashCode: 1830222901,
> ,
> Container: [Contai

[jira] [Updated] (TEZ-1624) Flaky tests in TestContainerReuse due to race condition in DelayedContainerManager thread

2014-09-26 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-1624:
--
Fix Version/s: 0.5.1

> Flaky tests in TestContainerReuse due to race condition in 
> DelayedContainerManager thread
> -
>
> Key: TEZ-1624
> URL: https://issues.apache.org/jira/browse/TEZ-1624
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Fix For: 0.5.1
>
> Attachments: TEZ-1624.1.patch, TEZ-1624.2.patch
>
>
> Couple of TestContainerReuse tests are failing due to minor race condition in 
> DelayedContainerManager thread.  
> Wanted but not invoked:
> taskSchedulerEventHandlerForTest.taskAllocated(
> Mock for TaskAttempt, hashCode: 290467934,
> ,
> Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, 
> NodeHttpAddress: host1:0, Resource: , Priority: 1, 
> Token: null, ]
> );
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:580)
> However, there were other interactions with this mock:
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:531)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:532)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:534)
> -> at 
> org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:570)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testReuseWithTaskSpecificLaunchCmdOption(TestContainerReuse.java:571)
> Wanted but not invoked:
> taskSchedulerEventHandlerForTest.taskAllocated(
> Mock for TaskAttempt, hashCode: 392638651,
> ,
> Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, 
> NodeHttpAddress: host1:0, Resource: , Priority: 5, 
> Token: null, ]
> );
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333)
> However, there were other interactions with this mock:
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:289)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:290)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:292)
> -> at 
> org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$SetApplicationRegistrationDataCallable.call(TaskSchedulerAppCallbackWrapper.java:244)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:323)
> -> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:324)
> at 
> org.apache.tez.dag.app.rm.TestContainerReuse.testDelayedReuseContainerNotAvailable(TestContainerReuse.java:333)
> org.mockito.exceptions.verification.WantedButNotInvoked:
> Wanted but not invoked:
> taskSchedulerEventHandlerForTest.taskAllocated(
> Mock for TaskAttempt, hashCode: 1830222901,
> ,
> Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, 
> 

[jira] [Resolved] (TEZ-1609) Add hostname to logIdentifiers of fetchers for easy debugging

2014-09-26 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan resolved TEZ-1609.
---
  Resolution: Fixed
   Fix Version/s: 0.6.0
Target Version/s: 0.6.0
Hadoop Flags: Reviewed

Thanks [~bikassaha] and [~gopalv].  Committed the changes to master.

> Add hostname to logIdentifiers of fetchers for easy debugging
> -
>
> Key: TEZ-1609
> URL: https://issues.apache.org/jira/browse/TEZ-1609
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Fix For: 0.6.0
>
> Attachments: TEZ-1609.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1621) Should report error to AM before shuting down TezChild

2014-09-26 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150320#comment-14150320
 ] 

Jeff Zhang commented on TEZ-1621:
-

bq. This will break when we stop sending the failure notification to the AM 
inline (thus blocking the System.exit() until the AM has been notified).
What does this mean ? sendFailure should not block the System.exit(), it would 
either success (heartbeat success) or throw exception (heartbeat failure, this 
would also cause TezChild shutdown).


But, agree on putting the System.exit into TezChild which would make code clean 
and easy to maintain. 

> Should report error to AM before shuting down TezChild
> --
>
> Key: TEZ-1621
> URL: https://issues.apache.org/jira/browse/TEZ-1621
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Deepesh Khandelwal
>Assignee: Jeff Zhang
> Attachments: Tez-1621.patch, app_logs.txt, console.txt
>
>
> While running an in session testorderedwordcount example the DAG failed with 
> the following error on the console:
> {noformat}
> 14/09/25 01:55:53 INFO examples.TestOrderedWordCount: DAG 1 diagnostics: 
> [Vertex failed, vertexName=initialmap, 
> vertexId=vertex_1411586515507_0110_1_00, diagnostics=[Task failed, 
> taskId=task_1411586515507_0110_1_00_00, diagnostics=[TaskAttempt 0 
> failed, info=[Container container_1411586515507_0110_01_02 finished with 
> diagnostics set to [Container failed. Exception from container-launch.
> Container id: container_1411586515507_0110_01_02
> Exit code: 255
> Stack trace: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:290)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This wasn't very helpful, the root cause is in the application log:
> {noformat}
> 2014-09-25 01:55:41,246 ERROR [TezChild] 
> org.apache.tez.runtime.task.TezTaskRunner: Exception of type Error. Exiting 
> now
> java.lang.UnsatisfiedLinkError: 
> org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;J)V
> at org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(Native 
> Method)
> at 
> org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(NativeCrc32.java:57)
> at 
> org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:291)
> at 
> org.apache.hadoop.hdfs.BlockReaderLocal.fillBuffer(BlockReaderLocal.java:344)
> at 
> org.apache.hadoop.hdfs.BlockReaderLocal.fillDataBuf(BlockReaderLocal.java:444)
> at 
> org.apache.hadoop.hdfs.BlockReaderLocal.readWithBounceBuffer(BlockReaderLocal.java:575)
> at 
> org.apache.hadoop.hdfs.BlockReaderLocal.read(BlockReaderLocal.java:539)
> at 
> org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:683)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:739)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:796)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
> at 
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
> at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149)
> at 
> org.apache.hadoop.mapreduce.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.nextKeyValue(TezGroupedSplitsInputFormat.java:167)
> at 
> org.apache.tez.mapreduce.lib.MRReaderMapReduce.next(MRReaderMapReduce.java:116)
> at 
> org.apache.tez.mapreduce.processor.map.MapProcessor$NewRecordReader.nextKeyValue(MapProcessor.java:266)
> at 
> org.apache.tez.mapreduce.hadoop.mapreduce.MapContextImpl.

[jira] [Commented] (TEZ-1621) Should report error to AM before shuting down TezChild

2014-09-26 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150325#comment-14150325
 ] 

Bikas Saha commented on TEZ-1621:
-

I mean this is working now because the error reporting is synchronous 
(sendFailure) - the error is notified all the way to the AM synchronously 
before calling system.exit(). If that was not the case then system.exit() might 
end up bringing down the JVM before the error gets reported to the AM.
{code}+sendFailure(failureCause, "Fatal error cause TezChild exit.");
+if (isLocal) {
+  throw new TezException("Fatal error cause TezChild exit.", 
failureCause);
+} else {
+  ExitUtil.terminate(-1, failureCause);
+}{code}

> Should report error to AM before shuting down TezChild
> --
>
> Key: TEZ-1621
> URL: https://issues.apache.org/jira/browse/TEZ-1621
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Deepesh Khandelwal
>Assignee: Jeff Zhang
> Attachments: Tez-1621.patch, app_logs.txt, console.txt
>
>
> While running an in session testorderedwordcount example the DAG failed with 
> the following error on the console:
> {noformat}
> 14/09/25 01:55:53 INFO examples.TestOrderedWordCount: DAG 1 diagnostics: 
> [Vertex failed, vertexName=initialmap, 
> vertexId=vertex_1411586515507_0110_1_00, diagnostics=[Task failed, 
> taskId=task_1411586515507_0110_1_00_00, diagnostics=[TaskAttempt 0 
> failed, info=[Container container_1411586515507_0110_01_02 finished with 
> diagnostics set to [Container failed. Exception from container-launch.
> Container id: container_1411586515507_0110_01_02
> Exit code: 255
> Stack trace: ExitCodeException exitCode=255:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:290)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This wasn't very helpful, the root cause is in the application log:
> {noformat}
> 2014-09-25 01:55:41,246 ERROR [TezChild] 
> org.apache.tez.runtime.task.TezTaskRunner: Exception of type Error. Exiting 
> now
> java.lang.UnsatisfiedLinkError: 
> org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(IILjava/nio/ByteBuffer;ILjava/nio/ByteBuffer;IILjava/lang/String;J)V
> at org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums(Native 
> Method)
> at 
> org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(NativeCrc32.java:57)
> at 
> org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:291)
> at 
> org.apache.hadoop.hdfs.BlockReaderLocal.fillBuffer(BlockReaderLocal.java:344)
> at 
> org.apache.hadoop.hdfs.BlockReaderLocal.fillDataBuf(BlockReaderLocal.java:444)
> at 
> org.apache.hadoop.hdfs.BlockReaderLocal.readWithBounceBuffer(BlockReaderLocal.java:575)
> at 
> org.apache.hadoop.hdfs.BlockReaderLocal.read(BlockReaderLocal.java:539)
> at 
> org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:683)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:739)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:796)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
> at 
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
> at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149)
> at 
> org.apache.hadoop.mapreduce.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.nextKeyValue(TezGroupedSplitsInputFormat.java:167)
> at 
> org.apache.tez.mapreduce.lib.MRReaderMapReduce.next(MRReaderMapReduce.java:116)
> at 
> org.apache.tez.mapreduce.processor.map.MapProcessor$NewRecor