[jira] [Commented] (TEZ-3077) TezClient.waitTillReady should support timeout

2016-03-31 Thread Kuhu Shukla (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221245#comment-15221245
 ] 

Kuhu Shukla commented on TEZ-3077:
--

[~sseth], [~hitesh], Request for comments/review. Thanks a lot!

> TezClient.waitTillReady should support timeout
> --
>
> Key: TEZ-3077
> URL: https://issues.apache.org/jira/browse/TEZ-3077
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Kuhu Shukla
> Attachments: TEZ-3077.001.patch, TEZ-3077.002.patch
>
>
> Also preWarm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3193) Deadlock in AM during task commit request

2016-03-31 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221046#comment-15221046
 ] 

Bikas Saha commented on TEZ-3193:
-

This is probably a leftover of removal of such reverse calls. There were more 
of them and some were removed by making sure that such objects/members are 
available locally to the TaskAttemptImpl (from the Task passed in via the 
constructor) instead of calling back into the task to get this object/members. 
Hence, task location hint and taskSpec could be passed in via the constructor  
and referenced locally.
Doing this helps other future scenarios as well. If the TA location hint is 
passed in via a constructor then it could be made different for each attempt. 
E.g. remove the machine for v.1 from the location hint of v.2 for a speculative 
execution so that speculated attempt does not end up on the same machine. There 
is a jira for open for this.
Similarly, change the spec of v.1 have higher memory than the default for that 
vertex because v.0 died with OOM.

> Deadlock in AM during task commit request
> -
>
> Key: TEZ-3193
> URL: https://issues.apache.org/jira/browse/TEZ-3193
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1, 0.8.2
>Reporter: Jason Lowe
>Priority: Blocker
>
> The AM can deadlock between TaskImpl and TaskAttemptImpl.  Stacktrace and 
> details in a followup comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3192) IFile#checkState creating unnecessary objects though auto-boxing

2016-03-31 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220953#comment-15220953
 ] 

TezQA commented on TEZ-3192:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12796422/TEZ-3192.1.patch
  against master revision e416991.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in :
 org.apache.tez.test.TestRecovery

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/1601//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1601//console

This message is automatically generated.

> IFile#checkState creating unnecessary objects though auto-boxing
> 
>
> Key: TEZ-3192
> URL: https://issues.apache.org/jira/browse/TEZ-3192
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Fix For: 0.7.1, 0.8.3
>
> Attachments: TEZ-3192.1.patch
>
>
> checkState is a varargs function which takes Objects. ints and longs create 
> unnecessary Integers and Long objects through Integer.valueOf and 
> Long.valueOf. This is used in the read key and read value loop so while 
> small, puts this on par with the MR equivalent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-3192 PreCommit Build #1601

2016-03-31 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-3192
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/1601/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 4402 lines...]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :tez-tests
[INFO] Build failures were ignored.




{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12796422/TEZ-3192.1.patch
  against master revision e416991.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in :
 org.apache.tez.test.TestRecovery

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/1601//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1601//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
ad39565c18aa36dd00957cbafa960ac847fb8f43 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Compressed 3.40 MB of artifacts by 11.0% relative to #1599
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure - Any
Sending email for trigger: Failure - Any



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (TEZ-3192) IFile#checkState creating unnecessary objects though auto-boxing

2016-03-31 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220921#comment-15220921
 ] 

Jonathan Eagles commented on TEZ-3192:
--

Thanks [~rajesh.balamohan]. Committed this patch to master and branch-0.7.

> IFile#checkState creating unnecessary objects though auto-boxing
> 
>
> Key: TEZ-3192
> URL: https://issues.apache.org/jira/browse/TEZ-3192
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Attachments: TEZ-3192.1.patch
>
>
> checkState is a varargs function which takes Objects. ints and longs create 
> unnecessary Integers and Long objects through Integer.valueOf and 
> Long.valueOf. This is used in the read key and read value loop so while 
> small, puts this on par with the MR equivalent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3161) Allow task to report different kinds of errors - fatal / kill

2016-03-31 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220917#comment-15220917
 ] 

Hitesh Shah commented on TEZ-3161:
--

bq. In terms of alternate naming - do you have suggestions on what would be 
less confusing

Not sure - fatalError(), abortProcessing() - not sure I have good suggestions 
especially as fatalError is probably the one which should be indicating a fatal 
error instead of the current non-fatal behavior. 

bq. I'm OK marking it as private

Lets mark it so initially until we can figure out a clear use-case for 
self-kills.  

bq. Any suggestion on this. Duplicate the TerminationCause to include FATAL_, 
and KILL_ for almost all the existing TerminationCauses ?

Wouldnt there be only one specific termination cause to indicate that the 
user-code told the framework to abort itself or kill itself?

bq. I though it was being written to history. 

TaskAttemptFinished event is being written to history but the failure type bit 
is not in the data being pushed to ATS ( check TimelineHistoryEventConversion 
or the *JsonConversion ). The proto was changed but that is only used in 
Recovery. 

Tests in sbubsequent follow-ups should be ok.  




> Allow task to report different kinds of errors - fatal / kill
> -
>
> Key: TEZ-3161
> URL: https://issues.apache.org/jira/browse/TEZ-3161
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Siddharth Seth
> Attachments: TEZ-3161.1.txt, TEZ-3161.2.txt, TEZ-3161.3.txt
>
>
> In some cases, task failures will be the same across all attempts - e.g. 
> exceeding memory utilization on an operation. In this case, there's no point 
> in running another attempt of the same task.
> There's other cases where a task may want to mark itself as KILLED - i.e. a 
> temporary error. An example of this is pipelined shuffle.
> Tez should allow both operations.
> cc [~vikram.dixit], [~rajesh.balamohan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3192) IFile#checkState creating unnecessary objects though auto-boxing

2016-03-31 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220880#comment-15220880
 ] 

Rajesh Balamohan commented on TEZ-3192:
---

+1. lgtm. Thanks [~jeagles]

> IFile#checkState creating unnecessary objects though auto-boxing
> 
>
> Key: TEZ-3192
> URL: https://issues.apache.org/jira/browse/TEZ-3192
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Attachments: TEZ-3192.1.patch
>
>
> checkState is a varargs function which takes Objects. ints and longs create 
> unnecessary Integers and Long objects through Integer.valueOf and 
> Long.valueOf. This is used in the read key and read value loop so while 
> small, puts this on par with the MR equivalent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3193) Deadlock in AM during task commit request

2016-03-31 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220835#comment-15220835
 ] 

Jason Lowe commented on TEZ-3193:
-

Here are the relevant portions of the AM stacktrace when it deadlocks:
{noformat}
"TaskSchedulerAppCaller #0" #106 daemon prio=5 os_prio=0 tid=0x7fb1cc1bb800 
nid=0x4619 waiting on condition [0x7fb1b6509000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xc5d8e398> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at 
org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.getState(TaskAttemptImpl.java:630)
at 
org.apache.tez.dag.app.dag.impl.TaskImpl.selectBestAttempt(TaskImpl.java:745)
at 
org.apache.tez.dag.app.dag.impl.TaskImpl.getProgress(TaskImpl.java:483)
at 
org.apache.tez.dag.app.dag.impl.VertexImpl.computeProgress(VertexImpl.java:1285)
at 
org.apache.tez.dag.app.dag.impl.VertexImpl.getProgress(VertexImpl.java:1195)
at org.apache.tez.dag.app.dag.impl.DAGImpl.getProgress(DAGImpl.java:829)
at 
org.apache.tez.dag.app.DAGAppMaster.getProgress(DAGAppMaster.java:1181)
at 
org.apache.tez.dag.app.rm.TaskSchedulerEventHandler.getProgress(TaskSchedulerEventHandler.java:560)
at 
org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$GetProgressCallable.call(TaskSchedulerAppCallbackWrapper.java:291)
at 
org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper$GetProgressCallable.call(TaskSchedulerAppCallbackWrapper.java:282)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

"IPC Server handler 4 on 52743" #64 daemon prio=5 os_prio=0 
tid=0x7fb1c454c800 nid=0x45ca waiting on condition [0x7fb1b920e000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xc1421810> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
at org.apache.tez.dag.app.dag.impl.TaskImpl.canCommit(TaskImpl.java:768)
at 
org.apache.tez.dag.app.TaskAttemptListenerImpTezDag.canCommit(TaskAttemptListenerImpTezDag.java:274)
at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.hadoop.ipc.WritableRpcEngine$Server$WritableRpcInvoker.call(WritableRpcEngine.java:514)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2096)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2092)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2090)

"Dispatcher thread {Central}" #37 prio=5 os_prio=0 tid=0x7fb1c422f000 
nid=0x45aa waiting on condition [0x7fb1ba722000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xc1421810> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
  

[jira] [Created] (TEZ-3193) Deadlock in AM during task commit request

2016-03-31 Thread Jason Lowe (JIRA)
Jason Lowe created TEZ-3193:
---

 Summary: Deadlock in AM during task commit request
 Key: TEZ-3193
 URL: https://issues.apache.org/jira/browse/TEZ-3193
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.8.2, 0.7.1
Reporter: Jason Lowe
Priority: Blocker


The AM can deadlock between TaskImpl and TaskAttemptImpl.  Stacktrace and 
details in a followup comment.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3192) IFile#checkState creating unnecessary objects though auto-boxing

2016-03-31 Thread Jonathan Eagles (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Eagles updated TEZ-3192:
-
Attachment: TEZ-3192.1.patch

> IFile#checkState creating unnecessary objects though auto-boxing
> 
>
> Key: TEZ-3192
> URL: https://issues.apache.org/jira/browse/TEZ-3192
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Attachments: TEZ-3192.1.patch
>
>
> checkState is a varargs function which takes Objects. ints and longs create 
> unnecessary Integers and Long objects through Integer.valueOf and 
> Long.valueOf. This is used in the read key and read value loop so while 
> small, puts this on par with the MR equivalent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3192) IFile#checkState creating unnecessary objects though auto-boxing

2016-03-31 Thread Jonathan Eagles (JIRA)
Jonathan Eagles created TEZ-3192:


 Summary: IFile#checkState creating unnecessary objects though 
auto-boxing
 Key: TEZ-3192
 URL: https://issues.apache.org/jira/browse/TEZ-3192
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles


checkState is a varargs function which takes Objects. ints and longs create 
unnecessary Integers and Long objects through Integer.valueOf and Long.valueOf. 
This is used in the read key and read value loop so while small, puts this on 
par with the MR equivalent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3177) Non-DAG events should use the session domain or no domain if the data does not need protection

2016-03-31 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220597#comment-15220597
 ] 

TezQA commented on TEZ-3177:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12796369/TEZ-3177.1.patch
  against master revision e416991.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in :
   org.apache.tez.test.TestFaultTolerance
  org.apache.tez.dag.app.rm.TestContainerReuse

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/1600//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1600//console

This message is automatically generated.

> Non-DAG events should use the session domain or no domain if the data does 
> not need protection 
> ---
>
> Key: TEZ-3177
> URL: https://issues.apache.org/jira/browse/TEZ-3177
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
> Attachments: TEZ-3177.1.patch
>
>
> There have been issues noticed where when using dag specific domains, 
> container events get generated under different dags causing issues as they 
> are updated using different domains. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-3177 PreCommit Build #1600

2016-03-31 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-3177
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/1600/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 4678 lines...]
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :tez-dag
[INFO] Build failures were ignored.




{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12796369/TEZ-3177.1.patch
  against master revision e416991.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in :
   org.apache.tez.test.TestFaultTolerance
  org.apache.tez.dag.app.rm.TestContainerReuse

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/1600//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1600//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
6507b1af9699df5582387621c34975e408f092d5 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Compressed 3.39 MB of artifacts by 10.2% relative to #1599
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure - Any
Sending email for trigger: Failure - Any



###
## FAILED TESTS (if any) 
##
8 tests failed.
FAILED:  
org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources

Error Message:

Wanted but not invoked:
taskSchedulerManagerForTest.taskAllocated(
0,
Mock for TA attempt_0_0001_0_01_03_1,
,
Container: [ContainerId: container_1_0001_01_01, NodeId: host1:0, 
NodeHttpAddress: host1:0, Resource: , Priority: 1, 
Token: null, ]
);
-> at 
org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources(TestContainerReuse.java:1254)

However, there were other interactions with this mock:
taskSchedulerManagerForTest.init(
Configuration: core-default.xml, core-site.xml, yarn-default.xml, 
yarn-site.xml
);
-> at 
org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources(TestContainerReuse.java:1143)

taskSchedulerManagerForTest.setConfig(
Configuration: core-default.xml, core-site.xml, yarn-default.xml, 
yarn-site.xml
);
-> at 
org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources(TestContainerReuse.java:1143)

taskSchedulerManagerForTest.serviceInit(
Configuration: core-default.xml, core-site.xml, yarn-default.xml, 
yarn-site.xml
);
-> at 
org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources(TestContainerReuse.java:1143)

taskSchedulerManagerForTest.start();
-> at 
org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources(TestContainerReuse.java:1144)

taskSchedulerManagerForTest.serviceStart();
-> at 
org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources(TestContainerReuse.java:1144)

taskSchedulerManagerForTest.instantiateSchedulers(
"host",
0,
"",
Mock for AppContext, hashCode: 469698423
);
-> at 
org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources(TestContainerReuse.java:1144)

taskSchedulerManagerForTest.getContainerSignatureMatcher();
-> at 
org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources(TestContainerReuse.java:1144)

taskSchedulerManagerForTest.getConfig();
-> at 
org.apache.tez.dag.app.rm.TestContainerReuse.testReuseConflictLocalResources(TestContaine

[jira] [Commented] (TEZ-3177) Non-DAG events should use the session domain or no domain if the data does not need protection

2016-03-31 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220406#comment-15220406
 ] 

Hitesh Shah commented on TEZ-3177:
--

[~sseth] [~rajesh.balamohan] please review

> Non-DAG events should use the session domain or no domain if the data does 
> not need protection 
> ---
>
> Key: TEZ-3177
> URL: https://issues.apache.org/jira/browse/TEZ-3177
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
> Attachments: TEZ-3177.1.patch
>
>
> There have been issues noticed where when using dag specific domains, 
> container events get generated under different dags causing issues as they 
> are updated using different domains. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3177) Non-DAG events should use the session domain or no domain if the data does not need protection

2016-03-31 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-3177:
-
Attachment: TEZ-3177.1.patch

Made AM/Container events use sessionDomainId always. Also minor fix to use dag 
counter for domainIds as dag names are not unique. 

> Non-DAG events should use the session domain or no domain if the data does 
> not need protection 
> ---
>
> Key: TEZ-3177
> URL: https://issues.apache.org/jira/browse/TEZ-3177
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
> Attachments: TEZ-3177.1.patch
>
>
> There have been issues noticed where when using dag specific domains, 
> container events get generated under different dags causing issues as they 
> are updated using different domains. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3187) Pig on tez hang with java.io.IOException: Connection reset by peer

2016-03-31 Thread Kurt Muehlner (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220297#comment-15220297
 ] 

Kurt Muehlner commented on TEZ-3187:


[~rajesh.balamohan] task attempt logs for all those tasks are in 
task_attempts.tar.gz.

It appears to me they have all completed successfully.

> Pig on tez hang with java.io.IOException: Connection reset by peer
> --
>
> Key: TEZ-3187
> URL: https://issues.apache.org/jira/browse/TEZ-3187
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.8.2
> Environment: Hadoop 2.5.0
> Pig 0.15.0
> Tez 0.8.2
>Reporter: Kurt Muehlner
> Attachments: 10.102.173.86.logs.gz, TEZ-3187.incomplete-tasks.txt, 
> dag_1437886552023_169758_3.dot, stack.application_1437886552023_171131.out, 
> syslog_dag_1437886552023_169758_3.gz, task_attempts.tar.gz
>
>
> We are experiencing occasional application hangs, when testing an existing 
> Pig MapReduce script, executing on Tez.  When this occurs, we find this in 
> the syslog for the executing dag:
> 016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000822, 
> containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=112, delayedContainers=27, isNew=false
> 2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000824, 
> containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=111, delayedContainers=26, isNew=false
> 2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] 
> |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client 
> 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer]
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
> at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
> at 
> org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
> at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
> at 
> org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
> at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
> 2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000811, 
> containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=110, delayedContainers=25, isNew=false
> In all cases I've been able to analyze so far, this also correlates with a 
> warning in the node identified in the IOException:
> 2016-03-21 16:36:13,641 [WARN] [I/O Setup 2 Initialize: {scope-178}] 
> |retry.RetryInvocationHandler|: A failover has occurred since the start of 
> this method invocation attempt.
> However, it does not appear that any namenode failover has actually occurred 
> (the most recent failover we see in logs is from 2015).
> Attached:
> syslog_dag_1437886552023_169758_3.gz: syslog for the dag which hangs
> 10.102.173.86.logs.gz: aggregated logs from the host identified in the 
> IOException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3187) Pig on tez hang with java.io.IOException: Connection reset by peer

2016-03-31 Thread Kurt Muehlner (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kurt Muehlner updated TEZ-3187:
---
Attachment: task_attempts.tar.gz

> Pig on tez hang with java.io.IOException: Connection reset by peer
> --
>
> Key: TEZ-3187
> URL: https://issues.apache.org/jira/browse/TEZ-3187
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.8.2
> Environment: Hadoop 2.5.0
> Pig 0.15.0
> Tez 0.8.2
>Reporter: Kurt Muehlner
> Attachments: 10.102.173.86.logs.gz, TEZ-3187.incomplete-tasks.txt, 
> dag_1437886552023_169758_3.dot, stack.application_1437886552023_171131.out, 
> syslog_dag_1437886552023_169758_3.gz, task_attempts.tar.gz
>
>
> We are experiencing occasional application hangs, when testing an existing 
> Pig MapReduce script, executing on Tez.  When this occurs, we find this in 
> the syslog for the executing dag:
> 016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000822, 
> containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=112, delayedContainers=27, isNew=false
> 2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000824, 
> containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=111, delayedContainers=26, isNew=false
> 2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] 
> |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client 
> 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer]
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
> at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
> at 
> org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
> at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
> at 
> org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
> at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
> 2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000811, 
> containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=110, delayedContainers=25, isNew=false
> In all cases I've been able to analyze so far, this also correlates with a 
> warning in the node identified in the IOException:
> 2016-03-21 16:36:13,641 [WARN] [I/O Setup 2 Initialize: {scope-178}] 
> |retry.RetryInvocationHandler|: A failover has occurred since the start of 
> this method invocation attempt.
> However, it does not appear that any namenode failover has actually occurred 
> (the most recent failover we see in logs is from 2015).
> Attached:
> syslog_dag_1437886552023_169758_3.gz: syslog for the dag which hangs
> 10.102.173.86.logs.gz: aggregated logs from the host identified in the 
> IOException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3187) Pig on tez hang with java.io.IOException: Connection reset by peer

2016-03-31 Thread Kurt Muehlner (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220110#comment-15220110
 ] 

Kurt Muehlner commented on TEZ-3187:


I'll get those task attempt logs.  Meanwhile, I've deployed the Pig config 
param changes suggested by Daniel, and we do see a change in behavior.  This 
application consists of four pig scripts which execute sequentially.  When it 
hangs, it has been consistently doing so in the third script.  I deployed the 
param changes only in that third script.  On the first run thereafter, the 
application hung, but in the fourth script.  That's the first time that's 
happened.  I then deployed the param changes in the fourth script, and as of 
yet the application hasn't hung.  I'll attach the task attempt logs soon.

> Pig on tez hang with java.io.IOException: Connection reset by peer
> --
>
> Key: TEZ-3187
> URL: https://issues.apache.org/jira/browse/TEZ-3187
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.8.2
> Environment: Hadoop 2.5.0
> Pig 0.15.0
> Tez 0.8.2
>Reporter: Kurt Muehlner
> Attachments: 10.102.173.86.logs.gz, TEZ-3187.incomplete-tasks.txt, 
> dag_1437886552023_169758_3.dot, stack.application_1437886552023_171131.out, 
> syslog_dag_1437886552023_169758_3.gz
>
>
> We are experiencing occasional application hangs, when testing an existing 
> Pig MapReduce script, executing on Tez.  When this occurs, we find this in 
> the syslog for the executing dag:
> 016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000822, 
> containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=112, delayedContainers=27, isNew=false
> 2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000824, 
> containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=111, delayedContainers=26, isNew=false
> 2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] 
> |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client 
> 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer]
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
> at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
> at 
> org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
> at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
> at 
> org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
> at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
> 2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
> delay expired or is new. Releasing container, 
> containerId=container_e11_1437886552023_169758_01_000811, 
> containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, 
> heldContainers=110, delayedContainers=25, isNew=false
> In all cases I've been able to analyze so far, this also correlates with a 
> warning in the node identified in the IOException:
> 2016-03-21 16:36:13,641 [WARN] [I/O Setup 2 Initialize: {scope-178}] 
> |retry.RetryInvocationHandler|: A failover has occurred since the start of 
> this method invocation attempt.
> However, it does not appear that any namenode failover has actually occurred 
> (the most recent failover we see in logs is from 2015).
> Attached:
> syslog_dag_1437886552023_169758_3.gz: syslog for the dag which hangs
> 10.102.173.86.logs.gz: aggregated logs from the host identified in the 
> IOException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)