[jira] [Commented] (TEZ-2475) Tez local mode hanging in big testsuite

2015-05-26 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560251#comment-14560251
 ] 

Siddharth Seth commented on TEZ-2475:
-

Don't think it's related. The message shows up for pretty much all tasks - 
should investigate what it is, but I don't think it's causing the job to hang.

 Tez local mode hanging in big testsuite
 ---

 Key: TEZ-2475
 URL: https://issues.apache.org/jira/browse/TEZ-2475
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.1
Reporter: André Kelpe
 Attachments: 2015-05-21_15-55-20_buildLog.log.gz


 we have a big test suite for lingual, our SQL layer for cascading. We are 
 trying very hard to make it work correctly on Tez, but I am stuck:
 The setup is a huge suite of SQL based tests (6000+), which are being 
 executed in order in local mode. At certain moments the whole process just 
 stops. Nothing gets executed any longer. This is not all the time, but quite 
 often. Note that it is not happening at the same line of code, more at 
 random, which makes it quite complex to debug.
 What I am seeing, is these kind of stacktraces in the middle of the run:
 2015-05-21 16:07:42,413 ERROR [TaskHeartbeatThread] task.TezTaskRunner 
 (TezTaskRunner.java:reportError(333)) - TaskReporter reported error
 java.lang.InterruptedException
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2188)
 at 
 org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:187)
 at 
 org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 This looks like it could be related to the hang, but the hang is not 
 happening immediately afterwards, but some time later.
 I have gone through quite a few JIRAs and saw that there were problems with 
 locks and hanging threads before, which should be fixed, but it still happens.
 I have tried 0.6.1 and 0.7.0. Both show the same behaviour.
 This gist contains a thread dump of a hanging build: 
 https://gist.github.com/fs111/1ee44469bf5cc31e5a52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-1883 PreCommit Build #744

2015-05-26 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-1883
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/744/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2545 lines...]



{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12735492/TEZ-1883.5.txt
  against master revision 9dabf94.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in :
 org.apache.tez.dag.app.dag.impl.TestVertexImpl

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/744//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/744//artifact/patchprocess/newPatchFindbugsWarningstez-runtime-library.html
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/744//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
848d5cf082251406a0dc1162af54cf959a4d58e2 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #737
Archived 47 artifacts
Archive block size is 32768
Received 22 blocks and 2167622 bytes
Compression is 25.0%
Took 1.1 sec
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Comment Edited] (TEZ-2490) TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability

2015-05-26 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560295#comment-14560295
 ] 

Rajesh Balamohan edited comment on TEZ-2490 at 5/27/15 2:43 AM:


[~sseth], [~hitesh], [~pramachandran] - Please review.  Tested with 2.2, 2.4, 
2.6. (related hadoop jira HADOOP-11243)


was (Author: rajesh.balamohan):
[~sseth], [~hitesh] - Please review.  Tested with 2.2, 2.4, 2.6. (related 
hadoop jira HADOOP-11243)

 TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability
 

 Key: TEZ-2490
 URL: https://issues.apache.org/jira/browse/TEZ-2490
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: TEZ-2490.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1883) Change findbugs version to 3.x

2015-05-26 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-1883:

Attachment: TEZ-1883.4.txt

Added excludes for the DAGAM Inconsistent sync warnings. [~hitesh] - please 
review. 

 Change findbugs version to 3.x 
 ---

 Key: TEZ-1883
 URL: https://issues.apache.org/jira/browse/TEZ-1883
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Siddharth Seth
Priority: Minor
 Attachments: TEZ-1883.1.patch, TEZ-1883.2.txt, TEZ-1883.3.txt, 
 TEZ-1883.4.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1954) Multiple instances of Inconsistent synchronization in org.apache.tez.dag.app.DAGAppMaster.

2015-05-26 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560154#comment-14560154
 ] 

Siddharth Seth commented on TEZ-1954:
-

Some more after findbugs3

CodeWarning
IS  Inconsistent synchronization of 
org.apache.tez.dag.app.DAGAppMaster.containers; locked 80% of time
IS  Inconsistent synchronization of 
org.apache.tez.dag.app.DAGAppMaster.currentRecoveryDataDir; locked 66% of time
IS  Inconsistent synchronization of 
org.apache.tez.dag.app.DAGAppMaster.execService; locked 75% of time
IS  Inconsistent synchronization of 
org.apache.tez.dag.app.DAGAppMaster.historyEventHandler; locked 91% of time
IS  Inconsistent synchronization of 
org.apache.tez.dag.app.DAGAppMaster.nodes; locked 80% of time
IS  Inconsistent synchronization of 
org.apache.tez.dag.app.DAGAppMaster.recoveryEnabled; locked 66% of time

 Multiple instances of Inconsistent synchronization in 
 org.apache.tez.dag.app.DAGAppMaster.
 --

 Key: TEZ-1954
 URL: https://issues.apache.org/jira/browse/TEZ-1954
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah

 Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.amTokens; 
 locked 50% of time
 Inconsistent synchronization of 
 org.apache.tez.dag.app.DAGAppMaster.appMasterUgi; locked 66% of time
 Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.context; 
 locked 65% of time
 Inconsistent synchronization of 
 org.apache.tez.dag.app.DAGAppMaster.currentDAG; locked 72% of time
 Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.state; 
 locked 80% of time
 Inconsistent synchronization of 
 org.apache.tez.dag.app.DAGAppMaster.taskSchedulerEventHandler; locked 78% of 
 time
 Inconsistent synchronization of 
 org.apache.tez.dag.app.DAGAppMaster.versionMismatch; locked 83% of time
 Inconsistent synchronization of 
 org.apache.tez.dag.app.DAGAppMaster.versionMismatchDiagnostics; locked 80% of 
 time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1954) Multiple instances of Inconsistent synchronization in org.apache.tez.dag.app.DAGAppMaster.

2015-05-26 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560185#comment-14560185
 ] 

Jeff Zhang commented on TEZ-1954:
-

I believe things will change after TEZ-1273. 

 Multiple instances of Inconsistent synchronization in 
 org.apache.tez.dag.app.DAGAppMaster.
 --

 Key: TEZ-1954
 URL: https://issues.apache.org/jira/browse/TEZ-1954
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah

 Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.amTokens; 
 locked 50% of time
 Inconsistent synchronization of 
 org.apache.tez.dag.app.DAGAppMaster.appMasterUgi; locked 66% of time
 Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.context; 
 locked 65% of time
 Inconsistent synchronization of 
 org.apache.tez.dag.app.DAGAppMaster.currentDAG; locked 72% of time
 Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.state; 
 locked 80% of time
 Inconsistent synchronization of 
 org.apache.tez.dag.app.DAGAppMaster.taskSchedulerEventHandler; locked 78% of 
 time
 Inconsistent synchronization of 
 org.apache.tez.dag.app.DAGAppMaster.versionMismatch; locked 83% of time
 Inconsistent synchronization of 
 org.apache.tez.dag.app.DAGAppMaster.versionMismatchDiagnostics; locked 80% of 
 time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2304) InvalidStateTransitonException TA_SCHEDULE at START_WAIT during recovery

2015-05-26 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560187#comment-14560187
 ] 

Jeff Zhang commented on TEZ-2304:
-

bq. Maybe createAttempt could be changed to use the last seen attempt id 
instead?
This should also solve this issue. But I think it would be better to recover 
the task attempt even if it has not started (log TaskAttemptFinishedEvent even 
if there's no TaskAttemptStartedEvent), otherwise we may get wrong 
killedTaskAttemptCount, although it is not critical. And I believe recovery 
should recover AM to the same state of last application attempt. 

 InvalidStateTransitonException TA_SCHEDULE at START_WAIT during recovery
 

 Key: TEZ-2304
 URL: https://issues.apache.org/jira/browse/TEZ-2304
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jason Lowe
  Labels: Recovery
 Attachments: 168563_recovery.gz


 I saw a Tez AM throw a few InvalidStateTransitonException (sic) instances 
 during recovery complaining about TA_SCHEDULE arriving at the START_WAIT 
 state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1883) Change findbugs version to 3.x

2015-05-26 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560274#comment-14560274
 ] 

TezQA commented on TEZ-1883:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12735492/TEZ-1883.5.txt
  against master revision 9dabf94.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in :
 org.apache.tez.dag.app.dag.impl.TestVertexImpl

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/744//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/744//artifact/patchprocess/newPatchFindbugsWarningstez-runtime-library.html
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/744//console

This message is automatically generated.

 Change findbugs version to 3.x 
 ---

 Key: TEZ-1883
 URL: https://issues.apache.org/jira/browse/TEZ-1883
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Siddharth Seth
Priority: Minor
 Attachments: TEZ-1883.1.patch, TEZ-1883.2.txt, TEZ-1883.3.txt, 
 TEZ-1883.4.txt, TEZ-1883.5.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2475) Tez local mode hanging in big testsuite

2015-05-26 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560312#comment-14560312
 ] 

Siddharth Seth commented on TEZ-2475:
-

My best guess here is a RuntimeException in the 
LocalContainerLauncher-SubTaskRunner thread while creating a TezChild instance. 
These exception aren't caught or logged anywhere. I'm assuming the trace and 
the logs on this jira are unrelated.

That's the last message during TezChild creation.
{code}2015-05-26 13:10:23,128 WARN  [LocalContainerLauncher-SubTaskRunner] 
token.Token (Token.java:getClassForIdentifier(121)) - Cannot find class for 
token kind tez.job{code}

After this, the LocalTaskExecutionThread doesn't show up at all - which leads 
me to believe the failure happened during TezChild construction itself. The 
previous container holding on to the thread (single thread pool) would have 
generated log messages when the previous container would've tried fetching new 
work.

A patch to at least log exceptions when the sub-task-runner is about to die 
should be simple. That should help diagnose this further.

[~fs111] - is it possible to get instructions on how to reproduce this ? Also a 
set of logs / stack trace when this happens next.

 Tez local mode hanging in big testsuite
 ---

 Key: TEZ-2475
 URL: https://issues.apache.org/jira/browse/TEZ-2475
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.1
Reporter: André Kelpe
 Attachments: 2015-05-21_15-55-20_buildLog.log.gz


 we have a big test suite for lingual, our SQL layer for cascading. We are 
 trying very hard to make it work correctly on Tez, but I am stuck:
 The setup is a huge suite of SQL based tests (6000+), which are being 
 executed in order in local mode. At certain moments the whole process just 
 stops. Nothing gets executed any longer. This is not all the time, but quite 
 often. Note that it is not happening at the same line of code, more at 
 random, which makes it quite complex to debug.
 What I am seeing, is these kind of stacktraces in the middle of the run:
 2015-05-21 16:07:42,413 ERROR [TaskHeartbeatThread] task.TezTaskRunner 
 (TezTaskRunner.java:reportError(333)) - TaskReporter reported error
 java.lang.InterruptedException
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2188)
 at 
 org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:187)
 at 
 org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 This looks like it could be related to the hang, but the hang is not 
 happening immediately afterwards, but some time later.
 I have gone through quite a few JIRAs and saw that there were problems with 
 locks and hanging threads before, which should be fixed, but it still happens.
 I have tried 0.6.1 and 0.7.0. Both show the same behaviour.
 This gist contains a thread dump of a hanging build: 
 https://gist.github.com/fs111/1ee44469bf5cc31e5a52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2467) document tez-history-parser usage

2015-05-26 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-2467:
--
Target Version/s: 0.8.0

 document tez-history-parser usage
 -

 Key: TEZ-2467
 URL: https://issues.apache.org/jira/browse/TEZ-2467
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: TEZ-2467.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2488) Tez AM crashes if a submitted DAG is configured to use invalid resource sizes.

2015-05-26 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560243#comment-14560243
 ] 

Jeff Zhang commented on TEZ-2488:
-

[~hitesh] Here DAG specify the memory request which is beyond the limit of yarn 
scheduler's property RM_SCHEDULER_MAXIMUM_ALLOCATION_MB. This would cause 
SCHEDULING_SERVICE_ERROR which will cause the AM shutdown.

Ideally I think this should only cause the DAG failed but AM should be able to 
continue to server the next dag. But it is hard to identify whether the 
SCHEDULING_SERVICE_ERROR is caused by dag or other reasons, so I think shutdown 
AM is reasonable here. One thing we can do is adding the error in the 
diagnostics to prograpate it to client side. Any thoughts ?

 Tez AM crashes if a submitted DAG is configured to use invalid resource 
 sizes. 
 ---

 Key: TEZ-2488
 URL: https://issues.apache.org/jira/browse/TEZ-2488
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Priority: Critical
 Attachments: applogs.txt


 2015-05-26 21:54:03,485 ERROR [AMRM Heartbeater thread] 
 impl.AMRMClientAsyncImpl: Exception on heartbeat
 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
 resource request, requested memory  0, or requested memory  max configured, 
 requestedMemory=682, maxMemory=512
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:249)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:226)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:234)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:98)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:505)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
   at 
 org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:422)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
   at 
 org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
   at 
 org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
   at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 2015-05-26 21:54:03,495 INFO [Dispatcher thread: Central] app.DAGAppMaster: 
 Error in the TaskScheduler. Shutting down.
 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
 resource request, requested memory  0, or requested memory  max configured, 
 requestedMemory=682, maxMemory=512
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:249)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:226)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:234)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:98)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:505)
   at 
 

[jira] [Commented] (TEZ-2488) Tez AM crashes if a submitted DAG is configured to use invalid resource sizes.

2015-05-26 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560265#comment-14560265
 ] 

Hitesh Shah commented on TEZ-2488:
--

The fix would be to never end up going to the RM for invalid data. On 
registering with the RM, it sends back RegisterApplicationMasterResponse in the 
registerAppMaster() call. The response object has getMaximumResourceCapability 
which can be used to do basic checks for resources being requested before 
making the request.

By doing this check in say DAG initialization we can fail the dag before making 
any allocation request calls to the RM. The check though would need to be done 
for all the vertices ( and the configured task settings ). If we enhance the 
VertexManager at some point, this check will need to be done everytime the 
VertexManager modifies the resources needed and throw an error back to the VM 
in such cases. 

 

 Tez AM crashes if a submitted DAG is configured to use invalid resource 
 sizes. 
 ---

 Key: TEZ-2488
 URL: https://issues.apache.org/jira/browse/TEZ-2488
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Priority: Critical
 Attachments: applogs.txt


 2015-05-26 21:54:03,485 ERROR [AMRM Heartbeater thread] 
 impl.AMRMClientAsyncImpl: Exception on heartbeat
 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
 resource request, requested memory  0, or requested memory  max configured, 
 requestedMemory=682, maxMemory=512
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:249)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:226)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:234)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:98)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:505)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
   at 
 org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:422)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
   at 
 org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
   at 
 org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
   at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 2015-05-26 21:54:03,495 INFO [Dispatcher thread: Central] app.DAGAppMaster: 
 Error in the TaskScheduler. Shutting down.
 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
 resource request, requested memory  0, or requested memory  max configured, 
 requestedMemory=682, maxMemory=512
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:249)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:226)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:234)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:98)
   at 
 

[jira] [Commented] (TEZ-2440) Sorter should check for indexCacheList.size() in flush()

2015-05-26 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560324#comment-14560324
 ] 

Rajesh Balamohan commented on TEZ-2440:
---

Thanks [~mitdesai]. Can you please rebase the patch for master branch?.   
indexCacheList.isEmpty() might be an easier check?. 

 Sorter should check for indexCacheList.size() in flush()
 

 Key: TEZ-2440
 URL: https://issues.apache.org/jira/browse/TEZ-2440
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Mit Desai
 Attachments: TEZ-2440-1.patch


 {noformat}
 015-05-11 20:28:20,225 INFO [main] task.TezTaskRunner: Shutdown requested... 
 returning
 2015-05-11 20:28:20,225 INFO [main] task.TezChild: Got a shouldDie 
 notification via hearbeats. Shutting down
 2015-05-11 20:28:20,231 INFO [TezChild] impl.PipelinedSorter: Thread 
 interrupted, cleaned up stale data, sorter threads shutdown=true, 
 terminated=false
 2015-05-11 20:28:20,231 INFO [TezChild] 
 runtime.LogicalIOProcessorRuntimeTask: Joining on EventRouter
 2015-05-11 20:28:20,231 INFO [TezChild] 
 runtime.LogicalIOProcessorRuntimeTask: Ignoring interrupt while waiting for 
 the router thread to die
 2015-05-11 20:28:20,232 INFO [TezChild] task.TezTaskRunner: Encounted an 
 error while executing task: attempt_1429683757595_0875_1_07_00_0
 java.lang.ArrayIndexOutOfBoundsException: -1
 at java.util.ArrayList.elementData(ArrayList.java:418)
 at java.util.ArrayList.get(ArrayList.java:431)
 at 
 org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.flush(PipelinedSorter.java:462)
 at 
 org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.close(OrderedPartitionedKVOutput.java:183)
 at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:360)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
 at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 {noformat}
 When a DAG is killed in the middle, sometimes these exceptions are thrown 
 (e.g q_17 in TPC-DS).  Even though it is completely harmless, it would be 
 better to fix it to avoid distraction when debugging



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2475) Tez local mode hanging in big testsuite

2015-05-26 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560220#comment-14560220
 ] 

Jeff Zhang commented on TEZ-2475:
-

[~sseth] Is it related to TEZ-1802 ? I see the following message at the end of 
logs
{noformat}
2015-05-26 13:10:23,128 WARN  [LocalContainerLauncher-SubTaskRunner] 
token.Token (Token.java:getClassForIdentifier(121)) - Cannot find class for 
token kind tez.job
2015-05-26 13:10:23,128 WARN  [LocalContainerLauncher-SubTaskRunner] 
token.Token (Token.java:getClassForIdentifier(121)) - Cannot find class for 
token kind tez.job
Kind: tez.job, Service: application_1432638619418_0001, Ident: 1e 61 70 70 
6c 69 63 61 74 69 6f 6e 5f 31 34 33 32 36 33 38 36 31 39 34 31 38 5f 30 30 30 31
2015-05-26 13:12:23,155 INFO  [cascading shutdown hooks] flow.Flow 
(BaseFlow.java:logInfo(1433)) - [20150526-131019-64BE78...] shutdown hook 
calling stop on flow
2015-05-26 13:12:23,155 INFO  [cascading shutdown hooks] flow.Flow 
(BaseFlow.java:logInfo(1433)) - [20150526-131019-64BE78...] stopping all jobs
2015-05-26 13:12:23,156 INFO  [cascading shutdown hooks] flow.Flow 
(BaseFlow.java:logInfo(1433)) - [20150526-131019-64BE78...] stopping: (1/1) 
...26-131019-64BE78F366.tcsv
{noformat}

 Tez local mode hanging in big testsuite
 ---

 Key: TEZ-2475
 URL: https://issues.apache.org/jira/browse/TEZ-2475
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.1
Reporter: André Kelpe
 Attachments: 2015-05-21_15-55-20_buildLog.log.gz


 we have a big test suite for lingual, our SQL layer for cascading. We are 
 trying very hard to make it work correctly on Tez, but I am stuck:
 The setup is a huge suite of SQL based tests (6000+), which are being 
 executed in order in local mode. At certain moments the whole process just 
 stops. Nothing gets executed any longer. This is not all the time, but quite 
 often. Note that it is not happening at the same line of code, more at 
 random, which makes it quite complex to debug.
 What I am seeing, is these kind of stacktraces in the middle of the run:
 2015-05-21 16:07:42,413 ERROR [TaskHeartbeatThread] task.TezTaskRunner 
 (TezTaskRunner.java:reportError(333)) - TaskReporter reported error
 java.lang.InterruptedException
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2188)
 at 
 org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:187)
 at 
 org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 This looks like it could be related to the hang, but the hang is not 
 happening immediately afterwards, but some time later.
 I have gone through quite a few JIRAs and saw that there were problems with 
 locks and hanging threads before, which should be fixed, but it still happens.
 I have tried 0.6.1 and 0.7.0. Both show the same behaviour.
 This gist contains a thread dump of a hanging build: 
 https://gist.github.com/fs111/1ee44469bf5cc31e5a52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1883) Change findbugs version to 3.x

2015-05-26 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560217#comment-14560217
 ] 

TezQA commented on TEZ-1883:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12735475/TEZ-1883.4.txt
  against master revision 7be325e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in :
 org.apache.tez.dag.app.dag.impl.TestVertexImpl

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/742//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/742//console

This message is automatically generated.

 Change findbugs version to 3.x 
 ---

 Key: TEZ-1883
 URL: https://issues.apache.org/jira/browse/TEZ-1883
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Siddharth Seth
Priority: Minor
 Attachments: TEZ-1883.1.patch, TEZ-1883.2.txt, TEZ-1883.3.txt, 
 TEZ-1883.4.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-1883 PreCommit Build #742

2015-05-26 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-1883
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/742/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2538 lines...]


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12735475/TEZ-1883.4.txt
  against master revision 7be325e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in :
 org.apache.tez.dag.app.dag.impl.TestVertexImpl

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/742//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/742//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
be222332e57a87b4e74afc091305a27a6cae1204 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #737
Archived 47 artifacts
Archive block size is 32768
Received 28 blocks and 1986589 bytes
Compression is 31.6%
Took 0.88 sec
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Updated] (TEZ-1883) Change findbugs version to 3.x

2015-05-26 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-1883:

Attachment: TEZ-1883.5.txt

Attempting to get the findbugs version fixed in the report.

 Change findbugs version to 3.x 
 ---

 Key: TEZ-1883
 URL: https://issues.apache.org/jira/browse/TEZ-1883
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Siddharth Seth
Priority: Minor
 Attachments: TEZ-1883.1.patch, TEZ-1883.2.txt, TEZ-1883.3.txt, 
 TEZ-1883.4.txt, TEZ-1883.5.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2467) document tez-history-parser usage

2015-05-26 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560259#comment-14560259
 ] 

TezQA commented on TEZ-2467:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12734060/TEZ-2467.1.patch
  against master revision 9dabf94.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in :
 org.apache.tez.dag.app.dag.impl.TestVertexImpl

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/743//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/743//console

This message is automatically generated.

 document tez-history-parser usage
 -

 Key: TEZ-2467
 URL: https://issues.apache.org/jira/browse/TEZ-2467
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: TEZ-2467.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-2467 PreCommit Build #743

2015-05-26 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2467
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/743/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2532 lines...]


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12734060/TEZ-2467.1.patch
  against master revision 9dabf94.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in :
 org.apache.tez.dag.app.dag.impl.TestVertexImpl

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/743//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/743//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
a15227eed89bd1e06af1c5f4b0af62dec73a6679 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #737
Archived 47 artifacts
Archive block size is 32768
Received 8 blocks and 2626338 bytes
Compression is 9.1%
Took 3 sec
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Created] (TEZ-2490) TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability

2015-05-26 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created TEZ-2490:
-

 Summary: TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability
 Key: TEZ-2490
 URL: https://issues.apache.org/jira/browse/TEZ-2490
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2475) Tez local mode hanging in big testsuite

2015-05-26 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-2475:

Attachment: TEZ-2475.debug.1.txt

Adds some debug logging to the subTaskRunner. [~fs111] - could you tried this 
out please. Patch applies on 0.6.
Also, did you see any strange GC activity for this process ? Won't be surprised 
if this were an OOM. Though the client heartbeat continued on for 2 minutes.

This looks like it's running in non-session mode, and I don't think 
tezClient.stop() is being called after each job completes. That leaves 
AppMaster instances hanging around.

 Tez local mode hanging in big testsuite
 ---

 Key: TEZ-2475
 URL: https://issues.apache.org/jira/browse/TEZ-2475
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.1
Reporter: André Kelpe
 Attachments: 2015-05-21_15-55-20_buildLog.log.gz, TEZ-2475.debug.1.txt


 we have a big test suite for lingual, our SQL layer for cascading. We are 
 trying very hard to make it work correctly on Tez, but I am stuck:
 The setup is a huge suite of SQL based tests (6000+), which are being 
 executed in order in local mode. At certain moments the whole process just 
 stops. Nothing gets executed any longer. This is not all the time, but quite 
 often. Note that it is not happening at the same line of code, more at 
 random, which makes it quite complex to debug.
 What I am seeing, is these kind of stacktraces in the middle of the run:
 2015-05-21 16:07:42,413 ERROR [TaskHeartbeatThread] task.TezTaskRunner 
 (TezTaskRunner.java:reportError(333)) - TaskReporter reported error
 java.lang.InterruptedException
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2188)
 at 
 org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:187)
 at 
 org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 This looks like it could be related to the hang, but the hang is not 
 happening immediately afterwards, but some time later.
 I have gone through quite a few JIRAs and saw that there were problems with 
 locks and hanging threads before, which should be fixed, but it still happens.
 I have tried 0.6.1 and 0.7.0. Both show the same behaviour.
 This gist contains a thread dump of a hanging build: 
 https://gist.github.com/fs111/1ee44469bf5cc31e5a52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2440) Sorter should check for indexCacheList.size() in flush()

2015-05-26 Thread Mit Desai (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560357#comment-14560357
 ] 

Mit Desai commented on TEZ-2440:


Yes. I was based on branch 0.7. I will post another patch tomorrow.

 Sorter should check for indexCacheList.size() in flush()
 

 Key: TEZ-2440
 URL: https://issues.apache.org/jira/browse/TEZ-2440
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Mit Desai
 Attachments: TEZ-2440-1.patch


 {noformat}
 015-05-11 20:28:20,225 INFO [main] task.TezTaskRunner: Shutdown requested... 
 returning
 2015-05-11 20:28:20,225 INFO [main] task.TezChild: Got a shouldDie 
 notification via hearbeats. Shutting down
 2015-05-11 20:28:20,231 INFO [TezChild] impl.PipelinedSorter: Thread 
 interrupted, cleaned up stale data, sorter threads shutdown=true, 
 terminated=false
 2015-05-11 20:28:20,231 INFO [TezChild] 
 runtime.LogicalIOProcessorRuntimeTask: Joining on EventRouter
 2015-05-11 20:28:20,231 INFO [TezChild] 
 runtime.LogicalIOProcessorRuntimeTask: Ignoring interrupt while waiting for 
 the router thread to die
 2015-05-11 20:28:20,232 INFO [TezChild] task.TezTaskRunner: Encounted an 
 error while executing task: attempt_1429683757595_0875_1_07_00_0
 java.lang.ArrayIndexOutOfBoundsException: -1
 at java.util.ArrayList.elementData(ArrayList.java:418)
 at java.util.ArrayList.get(ArrayList.java:431)
 at 
 org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.flush(PipelinedSorter.java:462)
 at 
 org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.close(OrderedPartitionedKVOutput.java:183)
 at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:360)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
 at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 {noformat}
 When a DAG is killed in the middle, sometimes these exceptions are thrown 
 (e.g q_17 in TPC-DS).  Even though it is completely harmless, it would be 
 better to fix it to avoid distraction when debugging



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2490) TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability

2015-05-26 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560356#comment-14560356
 ] 

Siddharth Seth commented on TEZ-2490:
-

+1. Would be worth putting into a shim at a later point.

 TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability
 

 Key: TEZ-2490
 URL: https://issues.apache.org/jira/browse/TEZ-2490
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: TEZ-2490.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2490) TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability

2015-05-26 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560363#comment-14560363
 ] 

Rajesh Balamohan edited comment on TEZ-2490 at 5/27/15 3:52 AM:


Thanks [~sseth]. Committed to master.

commit dac59a2aa71aab5daaa6fabdda9d8f48539e1bda



was (Author: rajesh.balamohan):
Thanks [~sseth]

commit dac59a2aa71aab5daaa6fabdda9d8f48539e1bda


 TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability
 

 Key: TEZ-2490
 URL: https://issues.apache.org/jira/browse/TEZ-2490
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Fix For: 0.8.0

 Attachments: TEZ-2490.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2483) Tez should close task if processor fail

2015-05-26 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-2483:

Attachment: TEZ-2483-3.patch

 Tez should close task if processor fail
 ---

 Key: TEZ-2483
 URL: https://issues.apache.org/jira/browse/TEZ-2483
 Project: Apache Tez
  Issue Type: Bug
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: TEZ-2483-1.patch, TEZ-2483-2.patch, TEZ-2483-3.patch


 The symptom is if PigProcessor fail, MRInput is not closed. On Windows, this 
 creates a problem since Pig client cannot remove the input file.
 In general, if a task fail, Tez shall close all input/output handles in 
 cleanup. MROutput is closed in MROutput.abort() which Pig invokes explicitly 
 right now. Attach a demo patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2490) TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability

2015-05-26 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560336#comment-14560336
 ] 

TezQA commented on TEZ-2490:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12735499/TEZ-2490.1.patch
  against master revision 9dabf94.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/745//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/745//console

This message is automatically generated.

 TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability
 

 Key: TEZ-2490
 URL: https://issues.apache.org/jira/browse/TEZ-2490
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: TEZ-2490.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-2490 PreCommit Build #745

2015-05-26 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2490
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/745/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 3013 lines...]



{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12735499/TEZ-2490.1.patch
  against master revision 9dabf94.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/745//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/745//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
9b4860e2528902398c822cbeb2e8af38cbf72870 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #737
Archived 47 artifacts
Archive block size is 32768
Received 22 blocks and 2199772 bytes
Compression is 24.7%
Took 0.9 sec
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (TEZ-2475) Tez local mode hanging in big testsuite

2015-05-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TEZ-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558974#comment-14558974
 ] 

André Kelpe commented on TEZ-2475:
--

I have attached the output of a run with loglevel set to DEBUG. After a while 
the process just stopped and it kept in logging RpcProtobufEngine messages.

 Tez local mode hanging in big testsuite
 ---

 Key: TEZ-2475
 URL: https://issues.apache.org/jira/browse/TEZ-2475
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.1
Reporter: André Kelpe
 Attachments: 2015-05-21_15-55-20_buildLog.log.gz


 we have a big test suite for lingual, our SQL layer for cascading. We are 
 trying very hard to make it work correctly on Tez, but I am stuck:
 The setup is a huge suite of SQL based tests (6000+), which are being 
 executed in order in local mode. At certain moments the whole process just 
 stops. Nothing gets executed any longer. This is not all the time, but quite 
 often. Note that it is not happening at the same line of code, more at 
 random, which makes it quite complex to debug.
 What I am seeing, is these kind of stacktraces in the middle of the run:
 2015-05-21 16:07:42,413 ERROR [TaskHeartbeatThread] task.TezTaskRunner 
 (TezTaskRunner.java:reportError(333)) - TaskReporter reported error
 java.lang.InterruptedException
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2188)
 at 
 org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:187)
 at 
 org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 This looks like it could be related to the hang, but the hang is not 
 happening immediately afterwards, but some time later.
 I have gone through quite a few JIRAs and saw that there were problems with 
 locks and hanging threads before, which should be fixed, but it still happens.
 I have tried 0.6.1 and 0.7.0. Both show the same behaviour.
 This gist contains a thread dump of a hanging build: 
 https://gist.github.com/fs111/1ee44469bf5cc31e5a52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2475) Tez local mode hanging in big testsuite

2015-05-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TEZ-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558974#comment-14558974
 ] 

André Kelpe edited comment on TEZ-2475 at 5/26/15 11:15 AM:


I have attached the output of a run with loglevel set to DEBUG. After a while 
the process just stopped and it kept in logging RpcProtobufEngine messages.

Edit: The file was too big for JIRA, please use this link:

https://www.dropbox.com/s/41ugvhyb3lb2d5c/2015-05-26_13-00-07_buildLog.log.gz?dl=0


was (Author: fs111):
I have attached the output of a run with loglevel set to DEBUG. After a while 
the process just stopped and it kept in logging RpcProtobufEngine messages.

 Tez local mode hanging in big testsuite
 ---

 Key: TEZ-2475
 URL: https://issues.apache.org/jira/browse/TEZ-2475
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.1
Reporter: André Kelpe
 Attachments: 2015-05-21_15-55-20_buildLog.log.gz


 we have a big test suite for lingual, our SQL layer for cascading. We are 
 trying very hard to make it work correctly on Tez, but I am stuck:
 The setup is a huge suite of SQL based tests (6000+), which are being 
 executed in order in local mode. At certain moments the whole process just 
 stops. Nothing gets executed any longer. This is not all the time, but quite 
 often. Note that it is not happening at the same line of code, more at 
 random, which makes it quite complex to debug.
 What I am seeing, is these kind of stacktraces in the middle of the run:
 2015-05-21 16:07:42,413 ERROR [TaskHeartbeatThread] task.TezTaskRunner 
 (TezTaskRunner.java:reportError(333)) - TaskReporter reported error
 java.lang.InterruptedException
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2188)
 at 
 org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:187)
 at 
 org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 This looks like it could be related to the hang, but the hang is not 
 happening immediately afterwards, but some time later.
 I have gone through quite a few JIRAs and saw that there were problems with 
 locks and hanging threads before, which should be fixed, but it still happens.
 I have tried 0.6.1 and 0.7.0. Both show the same behaviour.
 This gist contains a thread dump of a hanging build: 
 https://gist.github.com/fs111/1ee44469bf5cc31e5a52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2475) Tez local mode hanging in big testsuite

2015-05-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TEZ-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558974#comment-14558974
 ] 

André Kelpe edited comment on TEZ-2475 at 5/26/15 11:23 AM:


I have attached the output of a run with loglevel set to DEBUG. After a while 
the process just stopped and it kept on logging RpcProtobufEngine messages.

Edit: The file was too big for JIRA, please use this link:

https://www.dropbox.com/s/41ugvhyb3lb2d5c/2015-05-26_13-00-07_buildLog.log.gz?dl=0


was (Author: fs111):
I have attached the output of a run with loglevel set to DEBUG. After a while 
the process just stopped and it kept in logging RpcProtobufEngine messages.

Edit: The file was too big for JIRA, please use this link:

https://www.dropbox.com/s/41ugvhyb3lb2d5c/2015-05-26_13-00-07_buildLog.log.gz?dl=0

 Tez local mode hanging in big testsuite
 ---

 Key: TEZ-2475
 URL: https://issues.apache.org/jira/browse/TEZ-2475
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.1
Reporter: André Kelpe
 Attachments: 2015-05-21_15-55-20_buildLog.log.gz


 we have a big test suite for lingual, our SQL layer for cascading. We are 
 trying very hard to make it work correctly on Tez, but I am stuck:
 The setup is a huge suite of SQL based tests (6000+), which are being 
 executed in order in local mode. At certain moments the whole process just 
 stops. Nothing gets executed any longer. This is not all the time, but quite 
 often. Note that it is not happening at the same line of code, more at 
 random, which makes it quite complex to debug.
 What I am seeing, is these kind of stacktraces in the middle of the run:
 2015-05-21 16:07:42,413 ERROR [TaskHeartbeatThread] task.TezTaskRunner 
 (TezTaskRunner.java:reportError(333)) - TaskReporter reported error
 java.lang.InterruptedException
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2188)
 at 
 org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:187)
 at 
 org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 This looks like it could be related to the hang, but the hang is not 
 happening immediately afterwards, but some time later.
 I have gone through quite a few JIRAs and saw that there were problems with 
 locks and hanging threads before, which should be fixed, but it still happens.
 I have tried 0.6.1 and 0.7.0. Both show the same behaviour.
 This gist contains a thread dump of a hanging build: 
 https://gist.github.com/fs111/1ee44469bf5cc31e5a52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2485) Reduce the Resource Load on the Timeline Server

2015-05-26 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559573#comment-14559573
 ] 

Jonathan Eagles commented on TEZ-2485:
--

I'll try to post a break down of what is taking up the most space soon so we 
can start brainstorming.

 Reduce the Resource Load on the Timeline Server
 ---

 Key: TEZ-2485
 URL: https://issues.apache.org/jira/browse/TEZ-2485
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Jonathan Eagles

 The disk, network, and memory resources needed by the timeline server are are 
 many times higher than the need for the equivalent mapreduce job. 
 Based on storage improvents YARN-3448, the timeline server may support up to 
 30,000 jobs / 10,000,000 tasks a
 day.
 While I understand there is community effort on timeline server v2, it
 will be good if Tez can reduce its pressure on the timeline server by
 auditing both the number of events and size of events.
 Here are some observations based on my understanding of the design of
 timeline stores:
 Each timeline entity pushed explodes into many records in the database
 1 marker record
 1 domain record
 1 record per event
 2 records per related entity
 2 records per primary filter (2 record per primary filter in
 RollingLevelDBTimelineStore, in leveldb it rewrites entire entity
 records per primary filter )
 1 record per other info
 For example
 Task Attempt Start
 1 marker
 1 domain
 1 task attempt start event
 1 related entity X 2
 7 other info entries
 4 primary filters X 2
 20 records written in the database for task attempt start
 Task Attempt Finish
 1 marker
 1 domain
 1 task attempt start event
 1 related entity X 2
 5 other info entries
 5 primary filters X 2
 20 records written in the database for task attempt finish
 =
 QUESTION:
 =
 Is there any data we are publishing to the timeline server that is not
 in the UI?
 Do we use all the entities (TEZ_CONTAINER_ID for example)
 Do we use all the primary filters?
 Do we use all the related entities specified?
 Are there any fields we don't use?
 Are there other approaches to consider to reduce entity count/size?
 Is there a way to store the same information in less space?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2485) Reduce the Resource Load on the Timeline Server

2015-05-26 Thread Jonathan Eagles (JIRA)
Jonathan Eagles created TEZ-2485:


 Summary: Reduce the Resource Load on the Timeline Server
 Key: TEZ-2485
 URL: https://issues.apache.org/jira/browse/TEZ-2485
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Jonathan Eagles


The disk, network, and memory resources needed by the timeline server are are 
many times higher than the need for the equivalent mapreduce job. 

Based on storage improvents YARN-3448, the timeline server may support up to 
30,000 jobs / 10,000,000 tasks a
day.

While I understand there is community effort on timeline server v2, it
will be good if Tez can reduce its pressure on the timeline server by
auditing both the number of events and size of events.

Here are some observations based on my understanding of the design of
timeline stores:

Each timeline entity pushed explodes into many records in the database
1 marker record
1 domain record
1 record per event
2 records per related entity
2 records per primary filter (2 record per primary filter in
RollingLevelDBTimelineStore, in leveldb it rewrites entire entity
records per primary filter )
1 record per other info

For example

Task Attempt Start
1 marker
1 domain
1 task attempt start event
1 related entity X 2
7 other info entries
4 primary filters X 2

20 records written in the database for task attempt start

Task Attempt Finish
1 marker
1 domain
1 task attempt start event
1 related entity X 2
5 other info entries
5 primary filters X 2

20 records written in the database for task attempt finish

=
QUESTION:
=

Is there any data we are publishing to the timeline server that is not
in the UI?

Do we use all the entities (TEZ_CONTAINER_ID for example)
Do we use all the primary filters?
Do we use all the related entities specified?
Are there any fields we don't use?
Are there other approaches to consider to reduce entity count/size?
Is there a way to store the same information in less space?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2391) TestVertexImpl timing out at times on jenkins builds

2015-05-26 Thread Mit Desai (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559607#comment-14559607
 ] 

Mit Desai commented on TEZ-2391:


[~bikassaha] so do we want to increase the timeout or want to have a different 
approach to fix this problem?

 TestVertexImpl timing out at times on jenkins builds 
 -

 Key: TEZ-2391
 URL: https://issues.apache.org/jira/browse/TEZ-2391
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Mit Desai
 Attachments: TEZ-2391.patch, TestVertexImpl-output.txt


 For example, https://builds.apache.org/job/Tez-Build/1028/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1529) ATS and TezClient integration in secure kerberos enabled cluster

2015-05-26 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559505#comment-14559505
 ] 

Hitesh Shah commented on TEZ-1529:
--

Minor nit: 

{code}
if (!TimelineReaderFactory.isTimelineClientSupported()) {
 throw new TezException(Reading from Timeline is not supported);
 }
{code}

The exception message should be a bit more descriptive on why it may not be 
supported. 

+1 once the above is fixed. Feel free to commit after fixing exception message. 
 


 ATS and TezClient integration  in secure kerberos enabled cluster
 -

 Key: TEZ-1529
 URL: https://issues.apache.org/jira/browse/TEZ-1529
 Project: Apache Tez
  Issue Type: Bug
Reporter: Prakash Ramachandran
Assignee: Prakash Ramachandran
Priority: Blocker
 Attachments: TEZ-1529-branch6.2.patch, TEZ-1529.1.patch, 
 TEZ-1529.2.patch, TEZ-1529.3.patch, TEZ-1529.4.patch, TEZ-1529.5.patch


 This is a follow up for TEZ-1495 which address ATS - TezClient integration. 
 however it does not enable it  in secure kerberos enabled cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2483) Tez should close task if processor fail

2015-05-26 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559521#comment-14559521
 ] 

Siddharth Seth commented on TEZ-2483:
-

Agree with what Rajesh said. It would be better to add this to cleanup. Another 
thing to consider is exceptions which are thrown while invoking close on a 
failed Processor / Input / Output - those should be ignored so that each 
Input/Output gets closed. The TEZ-2003 branch already has some of this code in 
place.

 Tez should close task if processor fail
 ---

 Key: TEZ-2483
 URL: https://issues.apache.org/jira/browse/TEZ-2483
 Project: Apache Tez
  Issue Type: Bug
Reporter: Daniel Dai
 Fix For: 0.7.1

 Attachments: TEZ-2483-1.patch, TEZ-2483-2.patch


 The symptom is if PigProcessor fail, MRInput is not closed. On Windows, this 
 creates a problem since Pig client cannot remove the input file.
 In general, if a task fail, Tez shall close all input/output handles in 
 cleanup. MROutput is closed in MROutput.abort() which Pig invokes explicitly 
 right now. Attach a demo patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TEZ-2440) Sorter should check for indexCacheList.size() in flush()

2015-05-26 Thread Mit Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai reassigned TEZ-2440:
--

Assignee: Mit Desai

 Sorter should check for indexCacheList.size() in flush()
 

 Key: TEZ-2440
 URL: https://issues.apache.org/jira/browse/TEZ-2440
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Mit Desai

 {noformat}
 015-05-11 20:28:20,225 INFO [main] task.TezTaskRunner: Shutdown requested... 
 returning
 2015-05-11 20:28:20,225 INFO [main] task.TezChild: Got a shouldDie 
 notification via hearbeats. Shutting down
 2015-05-11 20:28:20,231 INFO [TezChild] impl.PipelinedSorter: Thread 
 interrupted, cleaned up stale data, sorter threads shutdown=true, 
 terminated=false
 2015-05-11 20:28:20,231 INFO [TezChild] 
 runtime.LogicalIOProcessorRuntimeTask: Joining on EventRouter
 2015-05-11 20:28:20,231 INFO [TezChild] 
 runtime.LogicalIOProcessorRuntimeTask: Ignoring interrupt while waiting for 
 the router thread to die
 2015-05-11 20:28:20,232 INFO [TezChild] task.TezTaskRunner: Encounted an 
 error while executing task: attempt_1429683757595_0875_1_07_00_0
 java.lang.ArrayIndexOutOfBoundsException: -1
 at java.util.ArrayList.elementData(ArrayList.java:418)
 at java.util.ArrayList.get(ArrayList.java:431)
 at 
 org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.flush(PipelinedSorter.java:462)
 at 
 org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.close(OrderedPartitionedKVOutput.java:183)
 at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:360)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
 at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 {noformat}
 When a DAG is killed in the middle, sometimes these exceptions are thrown 
 (e.g q_17 in TPC-DS).  Even though it is completely harmless, it would be 
 better to fix it to avoid distraction when debugging



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2478) Move OneToOne routing to store events in Tasks

2015-05-26 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559622#comment-14559622
 ] 

TezQA commented on TEZ-2478:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12735369/TEZ-2478.1.txt
  against master revision 7be325e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in :
 org.apache.tez.dag.app.dag.impl.TestVertexImpl

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/739//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/739//console

This message is automatically generated.

 Move OneToOne routing to store events in Tasks
 --

 Key: TEZ-2478
 URL: https://issues.apache.org/jira/browse/TEZ-2478
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Siddharth Seth
Assignee: Siddharth Seth
 Attachments: 1-1-wip.patch, TEZ-2478.1.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-2478 PreCommit Build #739

2015-05-26 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2478
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/739/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2535 lines...]




{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12735369/TEZ-2478.1.txt
  against master revision 7be325e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in :
 org.apache.tez.dag.app.dag.impl.TestVertexImpl

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/739//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/739//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
80740b09fe3515f3a7d5d594281fbe4107317ef8 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #737
Archived 47 artifacts
Archive block size is 32768
Received 25 blocks and 2057520 bytes
Compression is 28.5%
Took 0.99 sec
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (TEZ-2450) support async http clients in ordered unordered inputs

2015-05-26 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559747#comment-14559747
 ] 

Siddharth Seth commented on TEZ-2450:
-

+1. Looks good. Thanks [~rajesh.balamohan]. 

 support async http clients in ordered  unordered inputs
 

 Key: TEZ-2450
 URL: https://issues.apache.org/jira/browse/TEZ-2450
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: TEZ-2450.1.patch, TEZ-2450.2.WIP.patch, 
 TEZ-2450.2.patch, TEZ-2450.3.patch, TEZ-2450.4.patch, TEZ-2450.WIP.patch


 It will be helpful to switch between JDK  other async http impls.  For LLAP 
 scenarios, it would be useful to make http clients interruptible which is 
 supported in async libraries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2487) Scheduler should be able to preempt tasks instead of containers

2015-05-26 Thread Siddharth Seth (JIRA)
Siddharth Seth created TEZ-2487:
---

 Summary: Scheduler should be able to preempt tasks instead of 
containers
 Key: TEZ-2487
 URL: https://issues.apache.org/jira/browse/TEZ-2487
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Siddharth Seth
Assignee: Siddharth Seth


The scheduler currently preempts containers since task level preemption was not 
supported. There's changes in TEZ-2003 which allow tasks to be killed. Adding 
support in the AM would be useful so that containers can be re-used even if a 
running task needs to be preempted.

Assigning to myself for now. If anyone wants to take it over, please ping.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1883) Change findbugs version to 3.x

2015-05-26 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-1883:

Attachment: TEZ-1883.3.txt

Updated patch to fix the simpler findbugs warnings. Will leave th sync issues 
for TEZ-1900.

 Change findbugs version to 3.x 
 ---

 Key: TEZ-1883
 URL: https://issues.apache.org/jira/browse/TEZ-1883
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Siddharth Seth
Priority: Minor
 Attachments: TEZ-1883.1.patch, TEZ-1883.2.txt, TEZ-1883.3.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2481) Tez UI: graphical view does not render properly on IE11

2015-05-26 Thread Prakash Ramachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559146#comment-14559146
 ] 

Prakash Ramachandran commented on TEZ-2481:
---

+1 LGTM checked on IE11 and chrome. committing shortly.

 Tez UI: graphical view does not render properly on IE11
 ---

 Key: TEZ-2481
 URL: https://issues.apache.org/jira/browse/TEZ-2481
 Project: Apache Tez
  Issue Type: Bug
Reporter: Sreenath Somarajapuram
Assignee: Sreenath Somarajapuram
 Attachments: Screen-Shot-2015-05-25-at-4.02.46-PM.jpg, 
 TEZ-2481.1.patch, TEZ-2481.2.patch, TEZ-2481.3.patch


 The issue was because of IE's poor/broken support of css in SVG.
 # IE doesn't support transform in css like other browsers. This caused the 
 bubbles in a vertex to appear at the origin - 
 https://connect.microsoft.com/IE/feedbackdetail/view/920928
 # IE have a broken support for the marker(Arrow on the path). This was 
 causing the links/paths to disappear - 
 https://connect.microsoft.com/IE/feedback/details/801938



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2486) Update tez website to include links based on http://www.apache.org/foundation/marks/pmcs.html#navigation

2015-05-26 Thread Hitesh Shah (JIRA)
Hitesh Shah created TEZ-2486:


 Summary: Update tez website to include links based on 
http://www.apache.org/foundation/marks/pmcs.html#navigation
 Key: TEZ-2486
 URL: https://issues.apache.org/jira/browse/TEZ-2486
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1883) Change findbugs version to 3.x

2015-05-26 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559882#comment-14559882
 ] 

TezQA commented on TEZ-1883:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12735400/TEZ-1883.3.txt
  against master revision 7be325e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 findbugs{color}.  The patch appears to introduce 6 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in :
 org.apache.tez.dag.app.dag.impl.TestVertexImpl

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/740//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/740//artifact/patchprocess/newPatchFindbugsWarningstez-dag.html
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/740//console

This message is automatically generated.

 Change findbugs version to 3.x 
 ---

 Key: TEZ-1883
 URL: https://issues.apache.org/jira/browse/TEZ-1883
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Siddharth Seth
Priority: Minor
 Attachments: TEZ-1883.1.patch, TEZ-1883.2.txt, TEZ-1883.3.txt






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-1883 PreCommit Build #740

2015-05-26 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-1883
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/740/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2539 lines...]

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12735400/TEZ-1883.3.txt
  against master revision 7be325e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 findbugs{color}.  The patch appears to introduce 6 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in :
 org.apache.tez.dag.app.dag.impl.TestVertexImpl

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/740//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/740//artifact/patchprocess/newPatchFindbugsWarningstez-dag.html
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/740//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
f2b92a8c0a5fec61c70c3cd7be659d4cb2b77d9f logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #737
Archived 47 artifacts
Archive block size is 32768
Received 8 blocks and 2783929 bytes
Compression is 8.6%
Took 1 sec
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (TEZ-2485) Reduce the Resource Load on the Timeline Server

2015-05-26 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559919#comment-14559919
 ] 

Hitesh Shah commented on TEZ-2485:
--

Thanks for starting this [~jeagles]. \cc [~rajesh.balamohan] [~gopalv] as they 
will need to look at how it impacts the job analysers and [~Sreenath] 
[~pramachandran] for UI impact. 

 Reduce the Resource Load on the Timeline Server
 ---

 Key: TEZ-2485
 URL: https://issues.apache.org/jira/browse/TEZ-2485
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Jonathan Eagles

 The disk, network, and memory resources needed by the timeline server are are 
 many times higher than the need for the equivalent mapreduce job. 
 Based on storage improvents YARN-3448, the timeline server may support up to 
 30,000 jobs / 10,000,000 tasks a
 day.
 While I understand there is community effort on timeline server v2, it
 will be good if Tez can reduce its pressure on the timeline server by
 auditing both the number of events and size of events.
 Here are some observations based on my understanding of the design of
 timeline stores:
 Each timeline entity pushed explodes into many records in the database
 1 marker record
 1 domain record
 1 record per event
 2 records per related entity
 2 records per primary filter (2 record per primary filter in
 RollingLevelDBTimelineStore, in leveldb it rewrites entire entity
 records per primary filter )
 1 record per other info
 For example
 Task Attempt Start
 1 marker
 1 domain
 1 task attempt start event
 1 related entity X 2
 7 other info entries
 4 primary filters X 2
 20 records written in the database for task attempt start
 Task Attempt Finish
 1 marker
 1 domain
 1 task attempt start event
 1 related entity X 2
 5 other info entries
 5 primary filters X 2
 20 records written in the database for task attempt finish
 =
 QUESTION:
 =
 Is there any data we are publishing to the timeline server that is not
 in the UI?
 Do we use all the entities (TEZ_CONTAINER_ID for example)
 Do we use all the primary filters?
 Do we use all the related entities specified?
 Are there any fields we don't use?
 Are there other approaches to consider to reduce entity count/size?
 Is there a way to store the same information in less space?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2485) Reduce the Resource Load on the Timeline Server

2015-05-26 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559929#comment-14559929
 ] 

Hitesh Shah commented on TEZ-2485:
--

Thanks for starting this [~jeagles]. \cc [~rajesh.balamohan] [~gopalv] as they 
will need to look at how it impacts the job analysers and [~Sreenath] 
[~pramachandran] for UI impact. 

 Reduce the Resource Load on the Timeline Server
 ---

 Key: TEZ-2485
 URL: https://issues.apache.org/jira/browse/TEZ-2485
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Jonathan Eagles

 The disk, network, and memory resources needed by the timeline server are are 
 many times higher than the need for the equivalent mapreduce job. 
 Based on storage improvents YARN-3448, the timeline server may support up to 
 30,000 jobs / 10,000,000 tasks a
 day.
 While I understand there is community effort on timeline server v2, it
 will be good if Tez can reduce its pressure on the timeline server by
 auditing both the number of events and size of events.
 Here are some observations based on my understanding of the design of
 timeline stores:
 Each timeline entity pushed explodes into many records in the database
 1 marker record
 1 domain record
 1 record per event
 2 records per related entity
 2 records per primary filter (2 record per primary filter in
 RollingLevelDBTimelineStore, in leveldb it rewrites entire entity
 records per primary filter )
 1 record per other info
 For example
 Task Attempt Start
 1 marker
 1 domain
 1 task attempt start event
 1 related entity X 2
 7 other info entries
 4 primary filters X 2
 20 records written in the database for task attempt start
 Task Attempt Finish
 1 marker
 1 domain
 1 task attempt start event
 1 related entity X 2
 5 other info entries
 5 primary filters X 2
 20 records written in the database for task attempt finish
 =
 QUESTION:
 =
 Is there any data we are publishing to the timeline server that is not
 in the UI?
 Do we use all the entities (TEZ_CONTAINER_ID for example)
 Do we use all the primary filters?
 Do we use all the related entities specified?
 Are there any fields we don't use?
 Are there other approaches to consider to reduce entity count/size?
 Is there a way to store the same information in less space?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (TEZ-2485) Reduce the Resource Load on the Timeline Server

2015-05-26 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-2485:
-
Comment: was deleted

(was: Thanks for starting this [~jeagles]. \cc [~rajesh.balamohan] [~gopalv] as 
they will need to look at how it impacts the job analysers and [~Sreenath] 
[~pramachandran] for UI impact. )

 Reduce the Resource Load on the Timeline Server
 ---

 Key: TEZ-2485
 URL: https://issues.apache.org/jira/browse/TEZ-2485
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Jonathan Eagles

 The disk, network, and memory resources needed by the timeline server are are 
 many times higher than the need for the equivalent mapreduce job. 
 Based on storage improvents YARN-3448, the timeline server may support up to 
 30,000 jobs / 10,000,000 tasks a
 day.
 While I understand there is community effort on timeline server v2, it
 will be good if Tez can reduce its pressure on the timeline server by
 auditing both the number of events and size of events.
 Here are some observations based on my understanding of the design of
 timeline stores:
 Each timeline entity pushed explodes into many records in the database
 1 marker record
 1 domain record
 1 record per event
 2 records per related entity
 2 records per primary filter (2 record per primary filter in
 RollingLevelDBTimelineStore, in leveldb it rewrites entire entity
 records per primary filter )
 1 record per other info
 For example
 Task Attempt Start
 1 marker
 1 domain
 1 task attempt start event
 1 related entity X 2
 7 other info entries
 4 primary filters X 2
 20 records written in the database for task attempt start
 Task Attempt Finish
 1 marker
 1 domain
 1 task attempt start event
 1 related entity X 2
 5 other info entries
 5 primary filters X 2
 20 records written in the database for task attempt finish
 =
 QUESTION:
 =
 Is there any data we are publishing to the timeline server that is not
 in the UI?
 Do we use all the entities (TEZ_CONTAINER_ID for example)
 Do we use all the primary filters?
 Do we use all the related entities specified?
 Are there any fields we don't use?
 Are there other approaches to consider to reduce entity count/size?
 Is there a way to store the same information in less space?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2481) Tez UI: graphical view does not render properly on IE11

2015-05-26 Thread Prakash Ramachandran (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Ramachandran updated TEZ-2481:
--
Summary: Tez UI: graphical view does not render properly on IE11  (was: Tez 
UI: IE11 - graphical view renders incorrectly)

 Tez UI: graphical view does not render properly on IE11
 ---

 Key: TEZ-2481
 URL: https://issues.apache.org/jira/browse/TEZ-2481
 Project: Apache Tez
  Issue Type: Bug
Reporter: Sreenath Somarajapuram
Assignee: Sreenath Somarajapuram
 Attachments: Screen-Shot-2015-05-25-at-4.02.46-PM.jpg, 
 TEZ-2481.1.patch, TEZ-2481.2.patch, TEZ-2481.3.patch


 The issue was because of IE's poor/broken support of css in SVG.
 # IE doesn't support transform in css like other browsers. This caused the 
 bubbles in a vertex to appear at the origin - 
 https://connect.microsoft.com/IE/feedbackdetail/view/920928
 # IE have a broken support for the marker(Arrow on the path). This was 
 causing the links/paths to disappear - 
 https://connect.microsoft.com/IE/feedback/details/801938



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2483) Tez should close task if processor fail

2015-05-26 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-2483:
-
Fix Version/s: (was: 0.7.1)

 Tez should close task if processor fail
 ---

 Key: TEZ-2483
 URL: https://issues.apache.org/jira/browse/TEZ-2483
 Project: Apache Tez
  Issue Type: Bug
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: TEZ-2483-1.patch, TEZ-2483-2.patch


 The symptom is if PigProcessor fail, MRInput is not closed. On Windows, this 
 creates a problem since Pig client cannot remove the input file.
 In general, if a task fail, Tez shall close all input/output handles in 
 cleanup. MROutput is closed in MROutput.abort() which Pig invokes explicitly 
 right now. Attach a demo patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2483) Tez should close task if processor fail

2015-05-26 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-2483:
-
Assignee: Daniel Dai

 Tez should close task if processor fail
 ---

 Key: TEZ-2483
 URL: https://issues.apache.org/jira/browse/TEZ-2483
 Project: Apache Tez
  Issue Type: Bug
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: TEZ-2483-1.patch, TEZ-2483-2.patch


 The symptom is if PigProcessor fail, MRInput is not closed. On Windows, this 
 creates a problem since Pig client cannot remove the input file.
 In general, if a task fail, Tez shall close all input/output handles in 
 cleanup. MROutput is closed in MROutput.abort() which Pig invokes explicitly 
 right now. Attach a demo patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2483) Tez should close task if processor fail

2015-05-26 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-2483:
-
Target Version/s: 0.6.2, 0.7.1

 Tez should close task if processor fail
 ---

 Key: TEZ-2483
 URL: https://issues.apache.org/jira/browse/TEZ-2483
 Project: Apache Tez
  Issue Type: Bug
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: TEZ-2483-1.patch, TEZ-2483-2.patch


 The symptom is if PigProcessor fail, MRInput is not closed. On Windows, this 
 creates a problem since Pig client cannot remove the input file.
 In general, if a task fail, Tez shall close all input/output handles in 
 cleanup. MROutput is closed in MROutput.abort() which Pig invokes explicitly 
 right now. Attach a demo patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2488) Tez AM crashes if a submitted DAG is configured to use invalid resource sizes.

2015-05-26 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-2488:
-
Attachment: applogs.txt

 Tez AM crashes if a submitted DAG is configured to use invalid resource 
 sizes. 
 ---

 Key: TEZ-2488
 URL: https://issues.apache.org/jira/browse/TEZ-2488
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Priority: Critical
 Attachments: applogs.txt


 2015-05-26 21:54:03,485 ERROR [AMRM Heartbeater thread] 
 impl.AMRMClientAsyncImpl: Exception on heartbeat
 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
 resource request, requested memory  0, or requested memory  max configured, 
 requestedMemory=682, maxMemory=512
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:249)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:226)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:234)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:98)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:505)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
   at 
 org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:422)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
   at 
 org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
   at 
 org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
   at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 2015-05-26 21:54:03,495 INFO [Dispatcher thread: Central] app.DAGAppMaster: 
 Error in the TaskScheduler. Shutting down.
 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
 resource request, requested memory  0, or requested memory  max configured, 
 requestedMemory=682, maxMemory=512
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:249)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:226)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:234)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:98)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:505)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
   at 
 org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
   at 

[jira] [Updated] (TEZ-2489) Disable warn log for Timeline error when tez.allow.disabled.timeline-domains set to true

2015-05-26 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-2489:
-
Description: 
15/05/26 22:57:38 WARN client.TezClient: Could not instantiate object for 
org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager. ACLs cannot be 
enforced correctly for history data in Timeline
org.apache.tez.dag.api.TezUncheckedException: Unable to load class: 
org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager
   at 
org.apache.tez.common.ReflectionUtils.getClazz(ReflectionUtils.java:45)
   at 
org.apache.tez.common.ReflectionUtils.createClazzInstance(ReflectionUtils.java:88)
   at org.apache.tez.client.TezClient.start(TezClient.java:317)
   at 
cascading.flow.tez.planner.Hadoop2TezFlowStepJob.internalNonBlockingStart(Hadoop2TezFlowStepJob.java:137)
   at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:248)
   at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:172)
   at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:134)
   at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:45)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: 
org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

Reported by @chris wensel

  was:

15/05/26 22:57:38 WARN client.TezClient: Could not instantiate object for 
org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager. ACLs cannot be 
enforced correctly for history data in Timeline
org.apache.tez.dag.api.TezUncheckedException: Unable to load class: 
org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager
   at 
org.apache.tez.common.ReflectionUtils.getClazz(ReflectionUtils.java:45)
   at 
org.apache.tez.common.ReflectionUtils.createClazzInstance(ReflectionUtils.java:88)
   at org.apache.tez.client.TezClient.start(TezClient.java:317)
   at 
cascading.flow.tez.planner.Hadoop2TezFlowStepJob.internalNonBlockingStart(Hadoop2TezFlowStepJob.java:137)
   at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:248)
   at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:172)
   at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:134)
   at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:45)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: 
org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)


 Disable warn log for Timeline error when tez.allow.disabled.timeline-domains 
 set to true 
 -

 Key: TEZ-2489
 URL: https://issues.apache.org/jira/browse/TEZ-2489
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah

 15/05/26 22:57:38 WARN client.TezClient: Could not instantiate object for 
 org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager. ACLs cannot 
 be enforced correctly for history data in Timeline
 org.apache.tez.dag.api.TezUncheckedException: Unable to load class: 
 org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager
at 
 org.apache.tez.common.ReflectionUtils.getClazz(ReflectionUtils.java:45)
at 
 org.apache.tez.common.ReflectionUtils.createClazzInstance(ReflectionUtils.java:88)
at org.apache.tez.client.TezClient.start(TezClient.java:317)
at 
 cascading.flow.tez.planner.Hadoop2TezFlowStepJob.internalNonBlockingStart(Hadoop2TezFlowStepJob.java:137)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:248)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:172)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:134)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:45)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
 Caused by: 

[jira] [Updated] (TEZ-2489) Disable warn log for Timeline ACL error when tez.allow.disabled.timeline-domains set to true

2015-05-26 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-2489:
-
Summary: Disable warn log for Timeline ACL error when 
tez.allow.disabled.timeline-domains set to true   (was: Disable warn log for 
Timeline error when tez.allow.disabled.timeline-domains set to true )

 Disable warn log for Timeline ACL error when 
 tez.allow.disabled.timeline-domains set to true 
 -

 Key: TEZ-2489
 URL: https://issues.apache.org/jira/browse/TEZ-2489
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah

 15/05/26 22:57:38 WARN client.TezClient: Could not instantiate object for 
 org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager. ACLs cannot 
 be enforced correctly for history data in Timeline
 org.apache.tez.dag.api.TezUncheckedException: Unable to load class: 
 org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager
at 
 org.apache.tez.common.ReflectionUtils.getClazz(ReflectionUtils.java:45)
at 
 org.apache.tez.common.ReflectionUtils.createClazzInstance(ReflectionUtils.java:88)
at org.apache.tez.client.TezClient.start(TezClient.java:317)
at 
 cascading.flow.tez.planner.Hadoop2TezFlowStepJob.internalNonBlockingStart(Hadoop2TezFlowStepJob.java:137)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:248)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:172)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:134)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:45)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 Reported by @chris wensel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2488) Tez AM crashes if a submitted DAG is configured to use invalid resource sizes.

2015-05-26 Thread Hitesh Shah (JIRA)
Hitesh Shah created TEZ-2488:


 Summary: Tez AM crashes if a submitted DAG is configured to use 
invalid resource sizes. 
 Key: TEZ-2488
 URL: https://issues.apache.org/jira/browse/TEZ-2488
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Priority: Critical


2015-05-26 21:54:03,485 ERROR [AMRM Heartbeater thread] 
impl.AMRMClientAsyncImpl: Exception on heartbeat
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
resource request, requested memory  0, or requested memory  max configured, 
requestedMemory=682, maxMemory=512
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:249)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:226)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:234)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:98)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:505)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

2015-05-26 21:54:03,495 INFO [Dispatcher thread: Central] app.DAGAppMaster: 
Error in the TaskScheduler. Shutting down.
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
resource request, requested memory  0, or requested memory  max configured, 
requestedMemory=682, maxMemory=512
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:249)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:226)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:234)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:98)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:505)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 

[jira] [Updated] (TEZ-2440) Sorter should check for indexCacheList.size() in flush()

2015-05-26 Thread Mit Desai (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mit Desai updated TEZ-2440:
---
Attachment: TEZ-2440-1.patch

[~rajesh.balamohan], can you take a look on the patch?

 Sorter should check for indexCacheList.size() in flush()
 

 Key: TEZ-2440
 URL: https://issues.apache.org/jira/browse/TEZ-2440
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Mit Desai
 Attachments: TEZ-2440-1.patch


 {noformat}
 015-05-11 20:28:20,225 INFO [main] task.TezTaskRunner: Shutdown requested... 
 returning
 2015-05-11 20:28:20,225 INFO [main] task.TezChild: Got a shouldDie 
 notification via hearbeats. Shutting down
 2015-05-11 20:28:20,231 INFO [TezChild] impl.PipelinedSorter: Thread 
 interrupted, cleaned up stale data, sorter threads shutdown=true, 
 terminated=false
 2015-05-11 20:28:20,231 INFO [TezChild] 
 runtime.LogicalIOProcessorRuntimeTask: Joining on EventRouter
 2015-05-11 20:28:20,231 INFO [TezChild] 
 runtime.LogicalIOProcessorRuntimeTask: Ignoring interrupt while waiting for 
 the router thread to die
 2015-05-11 20:28:20,232 INFO [TezChild] task.TezTaskRunner: Encounted an 
 error while executing task: attempt_1429683757595_0875_1_07_00_0
 java.lang.ArrayIndexOutOfBoundsException: -1
 at java.util.ArrayList.elementData(ArrayList.java:418)
 at java.util.ArrayList.get(ArrayList.java:431)
 at 
 org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.flush(PipelinedSorter.java:462)
 at 
 org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.close(OrderedPartitionedKVOutput.java:183)
 at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:360)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
 at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 {noformat}
 When a DAG is killed in the middle, sometimes these exceptions are thrown 
 (e.g q_17 in TPC-DS).  Even though it is completely harmless, it would be 
 better to fix it to avoid distraction when debugging



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2485) Reduce the Resource Load on the Timeline Server

2015-05-26 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560070#comment-14560070
 ] 

Jonathan Eagles edited comment on TEZ-2485 at 5/26/15 11:07 PM:


Posted the data storage breakdown by entity type and by column type. The 
database in the instance was approximately 315MB on disk. Leveldb uses snappy 
compression so that the expanded key/value breakdown is 508MB/710MB 
respectively. Another thing to consider is the key overhead per record. Keys 
are of the form |Entity Type|8bytes for timestamp| Entity Id | column specific 
data|. To calculate the amount of space utilized by type multiple the type 
length by the count. The majority of this data was generated using pig.


was (Author: jeagles):
Posted the data storage breakdown by entity type and by column type. The 
database in the instance was approximately 315MB on disk. Leveldb uses snappy 
compression so that the expanded key/value breakdown is 508MB/710MB 
respectively. Another thing to consider is the key overhead per record. Keys 
are of the form |Entity Type|8bytes for timestamp| Entity Id | column specific 
data|. To calculate the amount of space utilized by type multiple the type 
length by the count.

 Reduce the Resource Load on the Timeline Server
 ---

 Key: TEZ-2485
 URL: https://issues.apache.org/jira/browse/TEZ-2485
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Jonathan Eagles

 The disk, network, and memory resources needed by the timeline server are are 
 many times higher than the need for the equivalent mapreduce job. 
 Based on storage improvents YARN-3448, the timeline server may support up to 
 30,000 jobs / 10,000,000 tasks a
 day.
 While I understand there is community effort on timeline server v2, it
 will be good if Tez can reduce its pressure on the timeline server by
 auditing both the number of events and size of events.
 Here are some observations based on my understanding of the design of
 timeline stores:
 Each timeline entity pushed explodes into many records in the database
 1 marker record
 1 domain record
 1 record per event
 2 records per related entity
 2 records per primary filter (2 record per primary filter in
 RollingLevelDBTimelineStore, in leveldb it rewrites entire entity
 records per primary filter )
 1 record per other info
 For example
 Task Attempt Start
 1 marker
 1 domain
 1 task attempt start event
 1 related entity X 2
 7 other info entries
 4 primary filters X 2
 20 records written in the database for task attempt start
 Task Attempt Finish
 1 marker
 1 domain
 1 task attempt start event
 1 related entity X 2
 5 other info entries
 5 primary filters X 2
 20 records written in the database for task attempt finish
 =
 QUESTION:
 =
 Is there any data we are publishing to the timeline server that is not
 in the UI?
 Do we use all the entities (TEZ_CONTAINER_ID for example)
 Do we use all the primary filters?
 Do we use all the related entities specified?
 Are there any fields we don't use?
 Are there other approaches to consider to reduce entity count/size?
 Is there a way to store the same information in less space?
 ===
 Key Value Breakdown
 ||Count||Key Size||Value Size||
 |5642512|533690380|745454867|
 Entity Type Breakdown
 ||Type||Count||Key Size||Value Size||
 |TEZ_CONTAINER_ID|843850|86244392|5654341|
 |applicationAttemptId|544|53248|6174|
 |applicationId|544|44412|6174|
 |TEZ_TASK_ATTEMPT_ID|2471393|239523553|373637209|
 |TEZ_APPLICATION|1048|84312|13057630|
 |containerId|362443|37013813|4135845|
 |TEZ_VERTEX_ID|99239|10387114|1559948|
 |TEZ_DAG_ID|5402|387705|2910830|
 |TEZ_TASK_ID|1762211|146210017|344478400|
 |TEZ_APPLICATION_ATTEMPT|95838|13741814|8316|
 Column Breakdown
 ||Column||Count||Key Size||Value Size||
 |primarykeys|1092413|118768299|0|
 |marker|373515|25740507|2988120|
 |events|578196|55148482|1156392|
 |domain|373515|26114022|15314115|
 |reverserelated|587815|73721347|0|
 |otherinfo|2143751|170983893|725996240|
 |related|493307|63213830|0|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2485) Reduce the Resource Load on the Timeline Server

2015-05-26 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560070#comment-14560070
 ] 

Jonathan Eagles commented on TEZ-2485:
--

Posted the data storage breakdown by entity type and by column type. The 
database in the instance was approximately 315MB on disk. Leveldb uses snappy 
compression so that the expanded key/value breakdown is 508MB/710MB 
respectively. Another thing to consider is the key overhead per record. Keys 
are of the form |Entity Type|8bytes for timestamp| Entity Id | column specific 
data|. To calculate the amount of space utilized by type multiple the type 
length by the count.

 Reduce the Resource Load on the Timeline Server
 ---

 Key: TEZ-2485
 URL: https://issues.apache.org/jira/browse/TEZ-2485
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Jonathan Eagles

 The disk, network, and memory resources needed by the timeline server are are 
 many times higher than the need for the equivalent mapreduce job. 
 Based on storage improvents YARN-3448, the timeline server may support up to 
 30,000 jobs / 10,000,000 tasks a
 day.
 While I understand there is community effort on timeline server v2, it
 will be good if Tez can reduce its pressure on the timeline server by
 auditing both the number of events and size of events.
 Here are some observations based on my understanding of the design of
 timeline stores:
 Each timeline entity pushed explodes into many records in the database
 1 marker record
 1 domain record
 1 record per event
 2 records per related entity
 2 records per primary filter (2 record per primary filter in
 RollingLevelDBTimelineStore, in leveldb it rewrites entire entity
 records per primary filter )
 1 record per other info
 For example
 Task Attempt Start
 1 marker
 1 domain
 1 task attempt start event
 1 related entity X 2
 7 other info entries
 4 primary filters X 2
 20 records written in the database for task attempt start
 Task Attempt Finish
 1 marker
 1 domain
 1 task attempt start event
 1 related entity X 2
 5 other info entries
 5 primary filters X 2
 20 records written in the database for task attempt finish
 =
 QUESTION:
 =
 Is there any data we are publishing to the timeline server that is not
 in the UI?
 Do we use all the entities (TEZ_CONTAINER_ID for example)
 Do we use all the primary filters?
 Do we use all the related entities specified?
 Are there any fields we don't use?
 Are there other approaches to consider to reduce entity count/size?
 Is there a way to store the same information in less space?
 ===
 Key Value Breakdown
 ||Count||Key Size||Value Size||
 |5642512|533690380|745454867|
 Entity Type Breakdown
 ||Type||Count||Key Size||Value Size||
 |TEZ_CONTAINER_ID|843850|86244392|5654341|
 |applicationAttemptId|544|53248|6174|
 |applicationId|544|44412|6174|
 |TEZ_TASK_ATTEMPT_ID|2471393|239523553|373637209|
 |TEZ_APPLICATION|1048|84312|13057630|
 |containerId|362443|37013813|4135845|
 |TEZ_VERTEX_ID|99239|10387114|1559948|
 |TEZ_DAG_ID|5402|387705|2910830|
 |TEZ_TASK_ID|1762211|146210017|344478400|
 |TEZ_APPLICATION_ATTEMPT|95838|13741814|8316|
 Column Breakdown
 ||Column||Count||Key Size||Value Size||
 |primarykeys|1092413|118768299|0|
 |marker|373515|25740507|2988120|
 |events|578196|55148482|1156392|
 |domain|373515|26114022|15314115|
 |reverserelated|587815|73721347|0|
 |otherinfo|2143751|170983893|725996240|
 |related|493307|63213830|0|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2489) Disable warn log for Timeline error when tez.allow.disabled.timeline-domains set to true

2015-05-26 Thread Hitesh Shah (JIRA)
Hitesh Shah created TEZ-2489:


 Summary: Disable warn log for Timeline error when 
tez.allow.disabled.timeline-domains set to true 
 Key: TEZ-2489
 URL: https://issues.apache.org/jira/browse/TEZ-2489
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah



15/05/26 22:57:38 WARN client.TezClient: Could not instantiate object for 
org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager. ACLs cannot be 
enforced correctly for history data in Timeline
org.apache.tez.dag.api.TezUncheckedException: Unable to load class: 
org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager
   at 
org.apache.tez.common.ReflectionUtils.getClazz(ReflectionUtils.java:45)
   at 
org.apache.tez.common.ReflectionUtils.createClazzInstance(ReflectionUtils.java:88)
   at org.apache.tez.client.TezClient.start(TezClient.java:317)
   at 
cascading.flow.tez.planner.Hadoop2TezFlowStepJob.internalNonBlockingStart(Hadoop2TezFlowStepJob.java:137)
   at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:248)
   at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:172)
   at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:134)
   at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:45)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: 
org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2485) Reduce the Resource Load on the Timeline Server

2015-05-26 Thread Jonathan Eagles (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Eagles updated TEZ-2485:
-
Description: 
The disk, network, and memory resources needed by the timeline server are are 
many times higher than the need for the equivalent mapreduce job. 

Based on storage improvents YARN-3448, the timeline server may support up to 
30,000 jobs / 10,000,000 tasks a
day.

While I understand there is community effort on timeline server v2, it
will be good if Tez can reduce its pressure on the timeline server by
auditing both the number of events and size of events.

Here are some observations based on my understanding of the design of
timeline stores:

Each timeline entity pushed explodes into many records in the database
1 marker record
1 domain record
1 record per event
2 records per related entity
2 records per primary filter (2 record per primary filter in
RollingLevelDBTimelineStore, in leveldb it rewrites entire entity
records per primary filter )
1 record per other info

For example

Task Attempt Start
1 marker
1 domain
1 task attempt start event
1 related entity X 2
7 other info entries
4 primary filters X 2

20 records written in the database for task attempt start

Task Attempt Finish
1 marker
1 domain
1 task attempt start event
1 related entity X 2
5 other info entries
5 primary filters X 2

20 records written in the database for task attempt finish

=
QUESTION:
=

Is there any data we are publishing to the timeline server that is not
in the UI?

Do we use all the entities (TEZ_CONTAINER_ID for example)
Do we use all the primary filters?
Do we use all the related entities specified?
Are there any fields we don't use?
Are there other approaches to consider to reduce entity count/size?
Is there a way to store the same information in less space?

===
Key Value Breakdown
||Count||Key Size||Value Size||
|5642512|533690380|745454867|

Entity Type Breakdown
||Type||Count||Key Size||Value Size||
|TEZ_CONTAINER_ID|843850|86244392|5654341|
|applicationAttemptId|544|53248|6174|
|applicationId|544|44412|6174|
|TEZ_TASK_ATTEMPT_ID|2471393|239523553|373637209|
|TEZ_APPLICATION|1048|84312|13057630|
|containerId|362443|37013813|4135845|
|TEZ_VERTEX_ID|99239|10387114|1559948|
|TEZ_DAG_ID|5402|387705|2910830|
|TEZ_TASK_ID|1762211|146210017|344478400|
|TEZ_APPLICATION_ATTEMPT|95838|13741814|8316|

Column Breakdown
||Column||Count||Key Size||Value Size||
|primarykeys|1092413|118768299|0|
|marker|373515|25740507|2988120|
|events|578196|55148482|1156392|
|domain|373515|26114022|15314115|
|reverserelated|587815|73721347|0|
|otherinfo|2143751|170983893|725996240|
|related|493307|63213830|0|

  was:
The disk, network, and memory resources needed by the timeline server are are 
many times higher than the need for the equivalent mapreduce job. 

Based on storage improvents YARN-3448, the timeline server may support up to 
30,000 jobs / 10,000,000 tasks a
day.

While I understand there is community effort on timeline server v2, it
will be good if Tez can reduce its pressure on the timeline server by
auditing both the number of events and size of events.

Here are some observations based on my understanding of the design of
timeline stores:

Each timeline entity pushed explodes into many records in the database
1 marker record
1 domain record
1 record per event
2 records per related entity
2 records per primary filter (2 record per primary filter in
RollingLevelDBTimelineStore, in leveldb it rewrites entire entity
records per primary filter )
1 record per other info

For example

Task Attempt Start
1 marker
1 domain
1 task attempt start event
1 related entity X 2
7 other info entries
4 primary filters X 2

20 records written in the database for task attempt start

Task Attempt Finish
1 marker
1 domain
1 task attempt start event
1 related entity X 2
5 other info entries
5 primary filters X 2

20 records written in the database for task attempt finish

=
QUESTION:
=

Is there any data we are publishing to the timeline server that is not
in the UI?

Do we use all the entities (TEZ_CONTAINER_ID for example)
Do we use all the primary filters?
Do we use all the related entities specified?
Are there any fields we don't use?
Are there other approaches to consider to reduce entity count/size?
Is there a way to store the same information in less space?


 Reduce the Resource Load on the Timeline Server
 ---

 Key: TEZ-2485
 URL: https://issues.apache.org/jira/browse/TEZ-2485
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Jonathan Eagles

 The disk, network, and memory resources needed by the 

Failed: TEZ-2440 PreCommit Build #741

2015-05-26 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2440
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/741/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 19 lines...]
No emails were triggered.
[PreCommit-TEZ-Build] $ /bin/bash /tmp/hudson4691266686886385787.sh
Running in Jenkins mode


==
==
Testing patch for TEZ-2440.
==
==


HEAD is now at 7be325e TEZ-2481. Tez UI: graphical view does not render 
properly on IE11 (Sreenath Somarajapuram via pramachandran)
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.
Current branch master is up to date.
TEZ-2440 patch is being downloaded at Tue May 26 22:22:30 UTC 2015 from
http://issues.apache.org/jira/secure/attachment/12735421/TEZ-2440-1.patch
The patch does not appear to apply with p0 to p2
PATCH APPLICATION FAILED




{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12735421/TEZ-2440-1.patch
  against master revision 7be325e.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/741//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
fd483760be6fe714a044e2ac25a35f6a8baa79b8 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
No tests ran.

[jira] [Commented] (TEZ-2440) Sorter should check for indexCacheList.size() in flush()

2015-05-26 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559991#comment-14559991
 ] 

TezQA commented on TEZ-2440:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12735421/TEZ-2440-1.patch
  against master revision 7be325e.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/741//console

This message is automatically generated.

 Sorter should check for indexCacheList.size() in flush()
 

 Key: TEZ-2440
 URL: https://issues.apache.org/jira/browse/TEZ-2440
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Mit Desai
 Attachments: TEZ-2440-1.patch


 {noformat}
 015-05-11 20:28:20,225 INFO [main] task.TezTaskRunner: Shutdown requested... 
 returning
 2015-05-11 20:28:20,225 INFO [main] task.TezChild: Got a shouldDie 
 notification via hearbeats. Shutting down
 2015-05-11 20:28:20,231 INFO [TezChild] impl.PipelinedSorter: Thread 
 interrupted, cleaned up stale data, sorter threads shutdown=true, 
 terminated=false
 2015-05-11 20:28:20,231 INFO [TezChild] 
 runtime.LogicalIOProcessorRuntimeTask: Joining on EventRouter
 2015-05-11 20:28:20,231 INFO [TezChild] 
 runtime.LogicalIOProcessorRuntimeTask: Ignoring interrupt while waiting for 
 the router thread to die
 2015-05-11 20:28:20,232 INFO [TezChild] task.TezTaskRunner: Encounted an 
 error while executing task: attempt_1429683757595_0875_1_07_00_0
 java.lang.ArrayIndexOutOfBoundsException: -1
 at java.util.ArrayList.elementData(ArrayList.java:418)
 at java.util.ArrayList.get(ArrayList.java:431)
 at 
 org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.flush(PipelinedSorter.java:462)
 at 
 org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.close(OrderedPartitionedKVOutput.java:183)
 at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:360)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
 at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
 at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 {noformat}
 When a DAG is killed in the middle, sometimes these exceptions are thrown 
 (e.g q_17 in TPC-DS).  Even though it is completely harmless, it would be 
 better to fix it to avoid distraction when debugging



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2484) Tez vertex for Hive fails but Resource Manager reports job succeeded

2015-05-26 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559463#comment-14559463
 ] 

Hitesh Shah edited comment on TEZ-2484 at 5/26/15 5:35 PM:
---

This is related to how hive is using Tez sessions. There is no 1:1 relationship 
between a yarn application and a Hive query ( multiple queries can be run 
within a single yarn application ) hence the application status cannot be 
mapped to the failure of one of the queries that ran within a given Tez 
application on yarn.


was (Author: hitesh):
This is related to how hive is using Tez sessions. There is no 1:1 relationship 
between a yarn application and a Hive query hence the application status cannot 
be mapped to the failure of one of the queries that ran within a given Tez 
application on yarn.

 Tez vertex for Hive fails but Resource Manager reports job succeeded
 

 Key: TEZ-2484
 URL: https://issues.apache.org/jira/browse/TEZ-2484
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.2
 Environment: HDP 2.2.4.2
Reporter: Hari Sekhon
 Attachments: Tez_RM_misreporting_succeeded.png


 When running a Hive on Tez job via Hive CLI the job fails and I get the error 
 shown below but in the Resource Manager the job is shown as Succeeded, even 
 though it's clearly failed:
 {code}
 Status: Running (Executing on YARN cluster with App id 
 application_1432310690008_0103)
 
 VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
 KILLED
 
 Map 1 FAILED   1478  00 1478   1
 1477
 
 VERTICES: 00/01  [--] 0%ELAPSED TIME: 1589.41 s
 
 Status: Failed
 Vertex failed, vertexName=Map 1, vertexId=vertex_1432310690008_0103_1_00, 
 diagnostics=[Task failed, taskId=task_1432310690008_0103_1_00_00, 
 diagnostics=[TaskAttempt 0 failed, info=[ 
 Containercontainer_e122_1432310690008_0103_01_94 received a 
 STOP_REQUEST]], Vertex failed as one or more tasks failed. failedTasks:1, 
 Vertex vertex_1432310690008_0103_1_00 [Map 1] killed/failed due to:null]
 DAG failed due to vertex failure. failedVertices:1 killedVertices:0
 FAILED: Execution Error, return code 2 from 
 org.apache.hadoop.hive.ql.exec.tez.TezTask
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2484) Tez vertex for Hive fails but Resource Manager reports job succeeded

2015-05-26 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559463#comment-14559463
 ] 

Hitesh Shah commented on TEZ-2484:
--

This is related to how hive is using Tez sessions. There is no 1:1 relationship 
between a yarn application and a Hive query hence the application status cannot 
be mapped to the failure of one of the queries that ran within a given Tez 
application on yarn.

 Tez vertex for Hive fails but Resource Manager reports job succeeded
 

 Key: TEZ-2484
 URL: https://issues.apache.org/jira/browse/TEZ-2484
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.2
 Environment: HDP 2.2.4.2
Reporter: Hari Sekhon
 Attachments: Tez_RM_misreporting_succeeded.png


 When running a Hive on Tez job via Hive CLI the job fails and I get the error 
 shown below but in the Resource Manager the job is shown as Succeeded, even 
 though it's clearly failed:
 {code}
 Status: Running (Executing on YARN cluster with App id 
 application_1432310690008_0103)
 
 VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
 KILLED
 
 Map 1 FAILED   1478  00 1478   1
 1477
 
 VERTICES: 00/01  [--] 0%ELAPSED TIME: 1589.41 s
 
 Status: Failed
 Vertex failed, vertexName=Map 1, vertexId=vertex_1432310690008_0103_1_00, 
 diagnostics=[Task failed, taskId=task_1432310690008_0103_1_00_00, 
 diagnostics=[TaskAttempt 0 failed, info=[ 
 Containercontainer_e122_1432310690008_0103_01_94 received a 
 STOP_REQUEST]], Vertex failed as one or more tasks failed. failedTasks:1, 
 Vertex vertex_1432310690008_0103_1_00 [Map 1] killed/failed due to:null]
 DAG failed due to vertex failure. failedVertices:1 killedVertices:0
 FAILED: Execution Error, return code 2 from 
 org.apache.hadoop.hive.ql.exec.tez.TezTask
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2484) Tez vertex for Hive fails but Resource Manager reports job succeeded

2015-05-26 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated TEZ-2484:
-
Attachment: Tez_RM_misreporting_succeeded.png

Attaching screenshot of Yarn Resource Manager line showing this Tez job being 
incorrectly reported as succeeded despite failure output in user session.

 Tez vertex for Hive fails but Resource Manager reports job succeeded
 

 Key: TEZ-2484
 URL: https://issues.apache.org/jira/browse/TEZ-2484
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.2
 Environment: HDP 2.2.4.2
Reporter: Hari Sekhon
 Attachments: Tez_RM_misreporting_succeeded.png


 When running a Hive on Tez job via Hive CLI the job fails and I get the error 
 shown below but in the Resource Manager the job is shown as Succeeded, even 
 though it's clearly failed:
 {code}
 Status: Running (Executing on YARN cluster with App id 
 application_1432310690008_0103)
 
 VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
 KILLED
 
 Map 1 FAILED   1478  00 1478   1
 1477
 
 VERTICES: 00/01  [--] 0%ELAPSED TIME: 1589.41 s
 
 Status: Failed
 Vertex failed, vertexName=Map 1, vertexId=vertex_1432310690008_0103_1_00, 
 diagnostics=[Task failed, taskId=task_1432310690008_0103_1_00_00, 
 diagnostics=[TaskAttempt 0 failed, info=[ 
 Containercontainer_e122_1432310690008_0103_01_94 received a 
 STOP_REQUEST]], Vertex failed as one or more tasks failed. failedTasks:1, 
 Vertex vertex_1432310690008_0103_1_00 [Map 1] killed/failed due to:null]
 DAG failed due to vertex failure. failedVertices:1 killedVertices:0
 FAILED: Execution Error, return code 2 from 
 org.apache.hadoop.hive.ql.exec.tez.TezTask
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TEZ-2484) Tez vertex for Hive fails but Resource Manager reports job succeeded

2015-05-26 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah resolved TEZ-2484.
--
Resolution: Invalid

 Tez vertex for Hive fails but Resource Manager reports job succeeded
 

 Key: TEZ-2484
 URL: https://issues.apache.org/jira/browse/TEZ-2484
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.2
 Environment: HDP 2.2.4.2
Reporter: Hari Sekhon
 Attachments: Tez_RM_misreporting_succeeded.png


 When running a Hive on Tez job via Hive CLI the job fails and I get the error 
 shown below but in the Resource Manager the job is shown as Succeeded, even 
 though it's clearly failed:
 {code}
 Status: Running (Executing on YARN cluster with App id 
 application_1432310690008_0103)
 
 VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
 KILLED
 
 Map 1 FAILED   1478  00 1478   1
 1477
 
 VERTICES: 00/01  [--] 0%ELAPSED TIME: 1589.41 s
 
 Status: Failed
 Vertex failed, vertexName=Map 1, vertexId=vertex_1432310690008_0103_1_00, 
 diagnostics=[Task failed, taskId=task_1432310690008_0103_1_00_00, 
 diagnostics=[TaskAttempt 0 failed, info=[ 
 Containercontainer_e122_1432310690008_0103_01_94 received a 
 STOP_REQUEST]], Vertex failed as one or more tasks failed. failedTasks:1, 
 Vertex vertex_1432310690008_0103_1_00 [Map 1] killed/failed due to:null]
 DAG failed due to vertex failure. failedVertices:1 killedVertices:0
 FAILED: Execution Error, return code 2 from 
 org.apache.hadoop.hive.ql.exec.tez.TezTask
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2484) Tez vertex for Hive fails but Resource Manager reports job succeeded

2015-05-26 Thread Hari Sekhon (JIRA)
Hari Sekhon created TEZ-2484:


 Summary: Tez vertex for Hive fails but Resource Manager reports 
job succeeded
 Key: TEZ-2484
 URL: https://issues.apache.org/jira/browse/TEZ-2484
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.2
 Environment: HDP 2.2.4.2
Reporter: Hari Sekhon


When running a Hive on Tez job via Hive CLI the job fails and I get the error 
shown below but in the Resource Manager the job is shown as Succeeded, even 
though it's clearly failed:
{code}
Status: Running (Executing on YARN cluster with App id 
application_1432310690008_0103)


VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

Map 1 FAILED   1478  00 1478   11477

VERTICES: 00/01  [--] 0%ELAPSED TIME: 1589.41 s

Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1432310690008_0103_1_00, 
diagnostics=[Task failed, taskId=task_1432310690008_0103_1_00_00, 
diagnostics=[TaskAttempt 0 failed, info=[ 
Containercontainer_e122_1432310690008_0103_01_94 received a STOP_REQUEST]], 
Vertex failed as one or more tasks failed. failedTasks:1, Vertex 
vertex_1432310690008_0103_1_00 [Map 1] killed/failed due to:null]
DAG failed due to vertex failure. failedVertices:1 killedVertices:0
FAILED: Execution Error, return code 2 from 
org.apache.hadoop.hive.ql.exec.tez.TezTask
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)