[jira] [Commented] (TEZ-2475) Tez local mode hanging in big testsuite
[ https://issues.apache.org/jira/browse/TEZ-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560251#comment-14560251 ] Siddharth Seth commented on TEZ-2475: - Don't think it's related. The message shows up for pretty much all tasks - should investigate what it is, but I don't think it's causing the job to hang. Tez local mode hanging in big testsuite --- Key: TEZ-2475 URL: https://issues.apache.org/jira/browse/TEZ-2475 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0, 0.6.1 Reporter: André Kelpe Attachments: 2015-05-21_15-55-20_buildLog.log.gz we have a big test suite for lingual, our SQL layer for cascading. We are trying very hard to make it work correctly on Tez, but I am stuck: The setup is a huge suite of SQL based tests (6000+), which are being executed in order in local mode. At certain moments the whole process just stops. Nothing gets executed any longer. This is not all the time, but quite often. Note that it is not happening at the same line of code, more at random, which makes it quite complex to debug. What I am seeing, is these kind of stacktraces in the middle of the run: 2015-05-21 16:07:42,413 ERROR [TaskHeartbeatThread] task.TezTaskRunner (TezTaskRunner.java:reportError(333)) - TaskReporter reported error java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2188) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:187) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) This looks like it could be related to the hang, but the hang is not happening immediately afterwards, but some time later. I have gone through quite a few JIRAs and saw that there were problems with locks and hanging threads before, which should be fixed, but it still happens. I have tried 0.6.1 and 0.7.0. Both show the same behaviour. This gist contains a thread dump of a hanging build: https://gist.github.com/fs111/1ee44469bf5cc31e5a52 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-1883 PreCommit Build #744
Jira: https://issues.apache.org/jira/browse/TEZ-1883 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/744/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2545 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12735492/TEZ-1883.5.txt against master revision 9dabf94. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in : org.apache.tez.dag.app.dag.impl.TestVertexImpl Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/744//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/744//artifact/patchprocess/newPatchFindbugsWarningstez-runtime-library.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/744//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 848d5cf082251406a0dc1162af54cf959a4d58e2 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #737 Archived 47 artifacts Archive block size is 32768 Received 22 blocks and 2167622 bytes Compression is 25.0% Took 1.1 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Comment Edited] (TEZ-2490) TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability
[ https://issues.apache.org/jira/browse/TEZ-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560295#comment-14560295 ] Rajesh Balamohan edited comment on TEZ-2490 at 5/27/15 2:43 AM: [~sseth], [~hitesh], [~pramachandran] - Please review. Tested with 2.2, 2.4, 2.6. (related hadoop jira HADOOP-11243) was (Author: rajesh.balamohan): [~sseth], [~hitesh] - Please review. Tested with 2.2, 2.4, 2.6. (related hadoop jira HADOOP-11243) TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability Key: TEZ-2490 URL: https://issues.apache.org/jira/browse/TEZ-2490 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2490.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1883) Change findbugs version to 3.x
[ https://issues.apache.org/jira/browse/TEZ-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1883: Attachment: TEZ-1883.4.txt Added excludes for the DAGAM Inconsistent sync warnings. [~hitesh] - please review. Change findbugs version to 3.x --- Key: TEZ-1883 URL: https://issues.apache.org/jira/browse/TEZ-1883 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Siddharth Seth Priority: Minor Attachments: TEZ-1883.1.patch, TEZ-1883.2.txt, TEZ-1883.3.txt, TEZ-1883.4.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1954) Multiple instances of Inconsistent synchronization in org.apache.tez.dag.app.DAGAppMaster.
[ https://issues.apache.org/jira/browse/TEZ-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560154#comment-14560154 ] Siddharth Seth commented on TEZ-1954: - Some more after findbugs3 CodeWarning IS Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.containers; locked 80% of time IS Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.currentRecoveryDataDir; locked 66% of time IS Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.execService; locked 75% of time IS Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.historyEventHandler; locked 91% of time IS Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.nodes; locked 80% of time IS Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.recoveryEnabled; locked 66% of time Multiple instances of Inconsistent synchronization in org.apache.tez.dag.app.DAGAppMaster. -- Key: TEZ-1954 URL: https://issues.apache.org/jira/browse/TEZ-1954 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.amTokens; locked 50% of time Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.appMasterUgi; locked 66% of time Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.context; locked 65% of time Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.currentDAG; locked 72% of time Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.state; locked 80% of time Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.taskSchedulerEventHandler; locked 78% of time Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.versionMismatch; locked 83% of time Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.versionMismatchDiagnostics; locked 80% of time -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1954) Multiple instances of Inconsistent synchronization in org.apache.tez.dag.app.DAGAppMaster.
[ https://issues.apache.org/jira/browse/TEZ-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560185#comment-14560185 ] Jeff Zhang commented on TEZ-1954: - I believe things will change after TEZ-1273. Multiple instances of Inconsistent synchronization in org.apache.tez.dag.app.DAGAppMaster. -- Key: TEZ-1954 URL: https://issues.apache.org/jira/browse/TEZ-1954 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.amTokens; locked 50% of time Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.appMasterUgi; locked 66% of time Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.context; locked 65% of time Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.currentDAG; locked 72% of time Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.state; locked 80% of time Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.taskSchedulerEventHandler; locked 78% of time Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.versionMismatch; locked 83% of time Inconsistent synchronization of org.apache.tez.dag.app.DAGAppMaster.versionMismatchDiagnostics; locked 80% of time -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2304) InvalidStateTransitonException TA_SCHEDULE at START_WAIT during recovery
[ https://issues.apache.org/jira/browse/TEZ-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560187#comment-14560187 ] Jeff Zhang commented on TEZ-2304: - bq. Maybe createAttempt could be changed to use the last seen attempt id instead? This should also solve this issue. But I think it would be better to recover the task attempt even if it has not started (log TaskAttemptFinishedEvent even if there's no TaskAttemptStartedEvent), otherwise we may get wrong killedTaskAttemptCount, although it is not critical. And I believe recovery should recover AM to the same state of last application attempt. InvalidStateTransitonException TA_SCHEDULE at START_WAIT during recovery Key: TEZ-2304 URL: https://issues.apache.org/jira/browse/TEZ-2304 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jason Lowe Labels: Recovery Attachments: 168563_recovery.gz I saw a Tez AM throw a few InvalidStateTransitonException (sic) instances during recovery complaining about TA_SCHEDULE arriving at the START_WAIT state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1883) Change findbugs version to 3.x
[ https://issues.apache.org/jira/browse/TEZ-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560274#comment-14560274 ] TezQA commented on TEZ-1883: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12735492/TEZ-1883.5.txt against master revision 9dabf94. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in : org.apache.tez.dag.app.dag.impl.TestVertexImpl Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/744//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/744//artifact/patchprocess/newPatchFindbugsWarningstez-runtime-library.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/744//console This message is automatically generated. Change findbugs version to 3.x --- Key: TEZ-1883 URL: https://issues.apache.org/jira/browse/TEZ-1883 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Siddharth Seth Priority: Minor Attachments: TEZ-1883.1.patch, TEZ-1883.2.txt, TEZ-1883.3.txt, TEZ-1883.4.txt, TEZ-1883.5.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2475) Tez local mode hanging in big testsuite
[ https://issues.apache.org/jira/browse/TEZ-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560312#comment-14560312 ] Siddharth Seth commented on TEZ-2475: - My best guess here is a RuntimeException in the LocalContainerLauncher-SubTaskRunner thread while creating a TezChild instance. These exception aren't caught or logged anywhere. I'm assuming the trace and the logs on this jira are unrelated. That's the last message during TezChild creation. {code}2015-05-26 13:10:23,128 WARN [LocalContainerLauncher-SubTaskRunner] token.Token (Token.java:getClassForIdentifier(121)) - Cannot find class for token kind tez.job{code} After this, the LocalTaskExecutionThread doesn't show up at all - which leads me to believe the failure happened during TezChild construction itself. The previous container holding on to the thread (single thread pool) would have generated log messages when the previous container would've tried fetching new work. A patch to at least log exceptions when the sub-task-runner is about to die should be simple. That should help diagnose this further. [~fs111] - is it possible to get instructions on how to reproduce this ? Also a set of logs / stack trace when this happens next. Tez local mode hanging in big testsuite --- Key: TEZ-2475 URL: https://issues.apache.org/jira/browse/TEZ-2475 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0, 0.6.1 Reporter: André Kelpe Attachments: 2015-05-21_15-55-20_buildLog.log.gz we have a big test suite for lingual, our SQL layer for cascading. We are trying very hard to make it work correctly on Tez, but I am stuck: The setup is a huge suite of SQL based tests (6000+), which are being executed in order in local mode. At certain moments the whole process just stops. Nothing gets executed any longer. This is not all the time, but quite often. Note that it is not happening at the same line of code, more at random, which makes it quite complex to debug. What I am seeing, is these kind of stacktraces in the middle of the run: 2015-05-21 16:07:42,413 ERROR [TaskHeartbeatThread] task.TezTaskRunner (TezTaskRunner.java:reportError(333)) - TaskReporter reported error java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2188) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:187) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) This looks like it could be related to the hang, but the hang is not happening immediately afterwards, but some time later. I have gone through quite a few JIRAs and saw that there were problems with locks and hanging threads before, which should be fixed, but it still happens. I have tried 0.6.1 and 0.7.0. Both show the same behaviour. This gist contains a thread dump of a hanging build: https://gist.github.com/fs111/1ee44469bf5cc31e5a52 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2467) document tez-history-parser usage
[ https://issues.apache.org/jira/browse/TEZ-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-2467: -- Target Version/s: 0.8.0 document tez-history-parser usage - Key: TEZ-2467 URL: https://issues.apache.org/jira/browse/TEZ-2467 Project: Apache Tez Issue Type: Improvement Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2467.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2488) Tez AM crashes if a submitted DAG is configured to use invalid resource sizes.
[ https://issues.apache.org/jira/browse/TEZ-2488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560243#comment-14560243 ] Jeff Zhang commented on TEZ-2488: - [~hitesh] Here DAG specify the memory request which is beyond the limit of yarn scheduler's property RM_SCHEDULER_MAXIMUM_ALLOCATION_MB. This would cause SCHEDULING_SERVICE_ERROR which will cause the AM shutdown. Ideally I think this should only cause the DAG failed but AM should be able to continue to server the next dag. But it is hard to identify whether the SCHEDULING_SERVICE_ERROR is caused by dag or other reasons, so I think shutdown AM is reasonable here. One thing we can do is adding the error in the diagnostics to prograpate it to client side. Any thoughts ? Tez AM crashes if a submitted DAG is configured to use invalid resource sizes. --- Key: TEZ-2488 URL: https://issues.apache.org/jira/browse/TEZ-2488 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Priority: Critical Attachments: applogs.txt 2015-05-26 21:54:03,485 ERROR [AMRM Heartbeater thread] impl.AMRMClientAsyncImpl: Exception on heartbeat org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=682, maxMemory=512 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:249) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:226) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:234) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:505) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2015-05-26 21:54:03,495 INFO [Dispatcher thread: Central] app.DAGAppMaster: Error in the TaskScheduler. Shutting down. org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=682, maxMemory=512 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:249) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:226) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:234) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:505) at
[jira] [Commented] (TEZ-2488) Tez AM crashes if a submitted DAG is configured to use invalid resource sizes.
[ https://issues.apache.org/jira/browse/TEZ-2488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560265#comment-14560265 ] Hitesh Shah commented on TEZ-2488: -- The fix would be to never end up going to the RM for invalid data. On registering with the RM, it sends back RegisterApplicationMasterResponse in the registerAppMaster() call. The response object has getMaximumResourceCapability which can be used to do basic checks for resources being requested before making the request. By doing this check in say DAG initialization we can fail the dag before making any allocation request calls to the RM. The check though would need to be done for all the vertices ( and the configured task settings ). If we enhance the VertexManager at some point, this check will need to be done everytime the VertexManager modifies the resources needed and throw an error back to the VM in such cases. Tez AM crashes if a submitted DAG is configured to use invalid resource sizes. --- Key: TEZ-2488 URL: https://issues.apache.org/jira/browse/TEZ-2488 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Priority: Critical Attachments: applogs.txt 2015-05-26 21:54:03,485 ERROR [AMRM Heartbeater thread] impl.AMRMClientAsyncImpl: Exception on heartbeat org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=682, maxMemory=512 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:249) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:226) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:234) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:505) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2015-05-26 21:54:03,495 INFO [Dispatcher thread: Central] app.DAGAppMaster: Error in the TaskScheduler. Shutting down. org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=682, maxMemory=512 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:249) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:226) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:234) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:98) at
[jira] [Commented] (TEZ-2440) Sorter should check for indexCacheList.size() in flush()
[ https://issues.apache.org/jira/browse/TEZ-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560324#comment-14560324 ] Rajesh Balamohan commented on TEZ-2440: --- Thanks [~mitdesai]. Can you please rebase the patch for master branch?. indexCacheList.isEmpty() might be an easier check?. Sorter should check for indexCacheList.size() in flush() Key: TEZ-2440 URL: https://issues.apache.org/jira/browse/TEZ-2440 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Mit Desai Attachments: TEZ-2440-1.patch {noformat} 015-05-11 20:28:20,225 INFO [main] task.TezTaskRunner: Shutdown requested... returning 2015-05-11 20:28:20,225 INFO [main] task.TezChild: Got a shouldDie notification via hearbeats. Shutting down 2015-05-11 20:28:20,231 INFO [TezChild] impl.PipelinedSorter: Thread interrupted, cleaned up stale data, sorter threads shutdown=true, terminated=false 2015-05-11 20:28:20,231 INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask: Joining on EventRouter 2015-05-11 20:28:20,231 INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask: Ignoring interrupt while waiting for the router thread to die 2015-05-11 20:28:20,232 INFO [TezChild] task.TezTaskRunner: Encounted an error while executing task: attempt_1429683757595_0875_1_07_00_0 java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.elementData(ArrayList.java:418) at java.util.ArrayList.get(ArrayList.java:431) at org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.flush(PipelinedSorter.java:462) at org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.close(OrderedPartitionedKVOutput.java:183) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:360) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) {noformat} When a DAG is killed in the middle, sometimes these exceptions are thrown (e.g q_17 in TPC-DS). Even though it is completely harmless, it would be better to fix it to avoid distraction when debugging -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2475) Tez local mode hanging in big testsuite
[ https://issues.apache.org/jira/browse/TEZ-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560220#comment-14560220 ] Jeff Zhang commented on TEZ-2475: - [~sseth] Is it related to TEZ-1802 ? I see the following message at the end of logs {noformat} 2015-05-26 13:10:23,128 WARN [LocalContainerLauncher-SubTaskRunner] token.Token (Token.java:getClassForIdentifier(121)) - Cannot find class for token kind tez.job 2015-05-26 13:10:23,128 WARN [LocalContainerLauncher-SubTaskRunner] token.Token (Token.java:getClassForIdentifier(121)) - Cannot find class for token kind tez.job Kind: tez.job, Service: application_1432638619418_0001, Ident: 1e 61 70 70 6c 69 63 61 74 69 6f 6e 5f 31 34 33 32 36 33 38 36 31 39 34 31 38 5f 30 30 30 31 2015-05-26 13:12:23,155 INFO [cascading shutdown hooks] flow.Flow (BaseFlow.java:logInfo(1433)) - [20150526-131019-64BE78...] shutdown hook calling stop on flow 2015-05-26 13:12:23,155 INFO [cascading shutdown hooks] flow.Flow (BaseFlow.java:logInfo(1433)) - [20150526-131019-64BE78...] stopping all jobs 2015-05-26 13:12:23,156 INFO [cascading shutdown hooks] flow.Flow (BaseFlow.java:logInfo(1433)) - [20150526-131019-64BE78...] stopping: (1/1) ...26-131019-64BE78F366.tcsv {noformat} Tez local mode hanging in big testsuite --- Key: TEZ-2475 URL: https://issues.apache.org/jira/browse/TEZ-2475 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0, 0.6.1 Reporter: André Kelpe Attachments: 2015-05-21_15-55-20_buildLog.log.gz we have a big test suite for lingual, our SQL layer for cascading. We are trying very hard to make it work correctly on Tez, but I am stuck: The setup is a huge suite of SQL based tests (6000+), which are being executed in order in local mode. At certain moments the whole process just stops. Nothing gets executed any longer. This is not all the time, but quite often. Note that it is not happening at the same line of code, more at random, which makes it quite complex to debug. What I am seeing, is these kind of stacktraces in the middle of the run: 2015-05-21 16:07:42,413 ERROR [TaskHeartbeatThread] task.TezTaskRunner (TezTaskRunner.java:reportError(333)) - TaskReporter reported error java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2188) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:187) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) This looks like it could be related to the hang, but the hang is not happening immediately afterwards, but some time later. I have gone through quite a few JIRAs and saw that there were problems with locks and hanging threads before, which should be fixed, but it still happens. I have tried 0.6.1 and 0.7.0. Both show the same behaviour. This gist contains a thread dump of a hanging build: https://gist.github.com/fs111/1ee44469bf5cc31e5a52 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1883) Change findbugs version to 3.x
[ https://issues.apache.org/jira/browse/TEZ-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560217#comment-14560217 ] TezQA commented on TEZ-1883: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12735475/TEZ-1883.4.txt against master revision 7be325e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in : org.apache.tez.dag.app.dag.impl.TestVertexImpl Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/742//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/742//console This message is automatically generated. Change findbugs version to 3.x --- Key: TEZ-1883 URL: https://issues.apache.org/jira/browse/TEZ-1883 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Siddharth Seth Priority: Minor Attachments: TEZ-1883.1.patch, TEZ-1883.2.txt, TEZ-1883.3.txt, TEZ-1883.4.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-1883 PreCommit Build #742
Jira: https://issues.apache.org/jira/browse/TEZ-1883 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/742/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2538 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12735475/TEZ-1883.4.txt against master revision 7be325e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in : org.apache.tez.dag.app.dag.impl.TestVertexImpl Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/742//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/742//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. be222332e57a87b4e74afc091305a27a6cae1204 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #737 Archived 47 artifacts Archive block size is 32768 Received 28 blocks and 1986589 bytes Compression is 31.6% Took 0.88 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Updated] (TEZ-1883) Change findbugs version to 3.x
[ https://issues.apache.org/jira/browse/TEZ-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1883: Attachment: TEZ-1883.5.txt Attempting to get the findbugs version fixed in the report. Change findbugs version to 3.x --- Key: TEZ-1883 URL: https://issues.apache.org/jira/browse/TEZ-1883 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Siddharth Seth Priority: Minor Attachments: TEZ-1883.1.patch, TEZ-1883.2.txt, TEZ-1883.3.txt, TEZ-1883.4.txt, TEZ-1883.5.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2467) document tez-history-parser usage
[ https://issues.apache.org/jira/browse/TEZ-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560259#comment-14560259 ] TezQA commented on TEZ-2467: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12734060/TEZ-2467.1.patch against master revision 9dabf94. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in : org.apache.tez.dag.app.dag.impl.TestVertexImpl Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/743//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/743//console This message is automatically generated. document tez-history-parser usage - Key: TEZ-2467 URL: https://issues.apache.org/jira/browse/TEZ-2467 Project: Apache Tez Issue Type: Improvement Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2467.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2467 PreCommit Build #743
Jira: https://issues.apache.org/jira/browse/TEZ-2467 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/743/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2532 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12734060/TEZ-2467.1.patch against master revision 9dabf94. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in : org.apache.tez.dag.app.dag.impl.TestVertexImpl Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/743//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/743//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. a15227eed89bd1e06af1c5f4b0af62dec73a6679 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #737 Archived 47 artifacts Archive block size is 32768 Received 8 blocks and 2626338 bytes Compression is 9.1% Took 3 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Created] (TEZ-2490) TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability
Rajesh Balamohan created TEZ-2490: - Summary: TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability Key: TEZ-2490 URL: https://issues.apache.org/jira/browse/TEZ-2490 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2475) Tez local mode hanging in big testsuite
[ https://issues.apache.org/jira/browse/TEZ-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-2475: Attachment: TEZ-2475.debug.1.txt Adds some debug logging to the subTaskRunner. [~fs111] - could you tried this out please. Patch applies on 0.6. Also, did you see any strange GC activity for this process ? Won't be surprised if this were an OOM. Though the client heartbeat continued on for 2 minutes. This looks like it's running in non-session mode, and I don't think tezClient.stop() is being called after each job completes. That leaves AppMaster instances hanging around. Tez local mode hanging in big testsuite --- Key: TEZ-2475 URL: https://issues.apache.org/jira/browse/TEZ-2475 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0, 0.6.1 Reporter: André Kelpe Attachments: 2015-05-21_15-55-20_buildLog.log.gz, TEZ-2475.debug.1.txt we have a big test suite for lingual, our SQL layer for cascading. We are trying very hard to make it work correctly on Tez, but I am stuck: The setup is a huge suite of SQL based tests (6000+), which are being executed in order in local mode. At certain moments the whole process just stops. Nothing gets executed any longer. This is not all the time, but quite often. Note that it is not happening at the same line of code, more at random, which makes it quite complex to debug. What I am seeing, is these kind of stacktraces in the middle of the run: 2015-05-21 16:07:42,413 ERROR [TaskHeartbeatThread] task.TezTaskRunner (TezTaskRunner.java:reportError(333)) - TaskReporter reported error java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2188) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:187) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) This looks like it could be related to the hang, but the hang is not happening immediately afterwards, but some time later. I have gone through quite a few JIRAs and saw that there were problems with locks and hanging threads before, which should be fixed, but it still happens. I have tried 0.6.1 and 0.7.0. Both show the same behaviour. This gist contains a thread dump of a hanging build: https://gist.github.com/fs111/1ee44469bf5cc31e5a52 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2440) Sorter should check for indexCacheList.size() in flush()
[ https://issues.apache.org/jira/browse/TEZ-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560357#comment-14560357 ] Mit Desai commented on TEZ-2440: Yes. I was based on branch 0.7. I will post another patch tomorrow. Sorter should check for indexCacheList.size() in flush() Key: TEZ-2440 URL: https://issues.apache.org/jira/browse/TEZ-2440 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Mit Desai Attachments: TEZ-2440-1.patch {noformat} 015-05-11 20:28:20,225 INFO [main] task.TezTaskRunner: Shutdown requested... returning 2015-05-11 20:28:20,225 INFO [main] task.TezChild: Got a shouldDie notification via hearbeats. Shutting down 2015-05-11 20:28:20,231 INFO [TezChild] impl.PipelinedSorter: Thread interrupted, cleaned up stale data, sorter threads shutdown=true, terminated=false 2015-05-11 20:28:20,231 INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask: Joining on EventRouter 2015-05-11 20:28:20,231 INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask: Ignoring interrupt while waiting for the router thread to die 2015-05-11 20:28:20,232 INFO [TezChild] task.TezTaskRunner: Encounted an error while executing task: attempt_1429683757595_0875_1_07_00_0 java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.elementData(ArrayList.java:418) at java.util.ArrayList.get(ArrayList.java:431) at org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.flush(PipelinedSorter.java:462) at org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.close(OrderedPartitionedKVOutput.java:183) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:360) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) {noformat} When a DAG is killed in the middle, sometimes these exceptions are thrown (e.g q_17 in TPC-DS). Even though it is completely harmless, it would be better to fix it to avoid distraction when debugging -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2490) TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability
[ https://issues.apache.org/jira/browse/TEZ-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560356#comment-14560356 ] Siddharth Seth commented on TEZ-2490: - +1. Would be worth putting into a shim at a later point. TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability Key: TEZ-2490 URL: https://issues.apache.org/jira/browse/TEZ-2490 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2490.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2490) TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability
[ https://issues.apache.org/jira/browse/TEZ-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560363#comment-14560363 ] Rajesh Balamohan edited comment on TEZ-2490 at 5/27/15 3:52 AM: Thanks [~sseth]. Committed to master. commit dac59a2aa71aab5daaa6fabdda9d8f48539e1bda was (Author: rajesh.balamohan): Thanks [~sseth] commit dac59a2aa71aab5daaa6fabdda9d8f48539e1bda TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability Key: TEZ-2490 URL: https://issues.apache.org/jira/browse/TEZ-2490 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Fix For: 0.8.0 Attachments: TEZ-2490.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2483) Tez should close task if processor fail
[ https://issues.apache.org/jira/browse/TEZ-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2483: Attachment: TEZ-2483-3.patch Tez should close task if processor fail --- Key: TEZ-2483 URL: https://issues.apache.org/jira/browse/TEZ-2483 Project: Apache Tez Issue Type: Bug Reporter: Daniel Dai Assignee: Daniel Dai Attachments: TEZ-2483-1.patch, TEZ-2483-2.patch, TEZ-2483-3.patch The symptom is if PigProcessor fail, MRInput is not closed. On Windows, this creates a problem since Pig client cannot remove the input file. In general, if a task fail, Tez shall close all input/output handles in cleanup. MROutput is closed in MROutput.abort() which Pig invokes explicitly right now. Attach a demo patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2490) TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability
[ https://issues.apache.org/jira/browse/TEZ-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560336#comment-14560336 ] TezQA commented on TEZ-2490: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12735499/TEZ-2490.1.patch against master revision 9dabf94. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/745//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/745//console This message is automatically generated. TEZ-2450 breaks Hadoop 2.2 and 2.4 compatability Key: TEZ-2490 URL: https://issues.apache.org/jira/browse/TEZ-2490 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2490.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2490 PreCommit Build #745
Jira: https://issues.apache.org/jira/browse/TEZ-2490 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/745/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 3013 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12735499/TEZ-2490.1.patch against master revision 9dabf94. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/745//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/745//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 9b4860e2528902398c822cbeb2e8af38cbf72870 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #737 Archived 47 artifacts Archive block size is 32768 Received 22 blocks and 2199772 bytes Compression is 24.7% Took 0.9 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2475) Tez local mode hanging in big testsuite
[ https://issues.apache.org/jira/browse/TEZ-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558974#comment-14558974 ] André Kelpe commented on TEZ-2475: -- I have attached the output of a run with loglevel set to DEBUG. After a while the process just stopped and it kept in logging RpcProtobufEngine messages. Tez local mode hanging in big testsuite --- Key: TEZ-2475 URL: https://issues.apache.org/jira/browse/TEZ-2475 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0, 0.6.1 Reporter: André Kelpe Attachments: 2015-05-21_15-55-20_buildLog.log.gz we have a big test suite for lingual, our SQL layer for cascading. We are trying very hard to make it work correctly on Tez, but I am stuck: The setup is a huge suite of SQL based tests (6000+), which are being executed in order in local mode. At certain moments the whole process just stops. Nothing gets executed any longer. This is not all the time, but quite often. Note that it is not happening at the same line of code, more at random, which makes it quite complex to debug. What I am seeing, is these kind of stacktraces in the middle of the run: 2015-05-21 16:07:42,413 ERROR [TaskHeartbeatThread] task.TezTaskRunner (TezTaskRunner.java:reportError(333)) - TaskReporter reported error java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2188) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:187) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) This looks like it could be related to the hang, but the hang is not happening immediately afterwards, but some time later. I have gone through quite a few JIRAs and saw that there were problems with locks and hanging threads before, which should be fixed, but it still happens. I have tried 0.6.1 and 0.7.0. Both show the same behaviour. This gist contains a thread dump of a hanging build: https://gist.github.com/fs111/1ee44469bf5cc31e5a52 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2475) Tez local mode hanging in big testsuite
[ https://issues.apache.org/jira/browse/TEZ-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558974#comment-14558974 ] André Kelpe edited comment on TEZ-2475 at 5/26/15 11:15 AM: I have attached the output of a run with loglevel set to DEBUG. After a while the process just stopped and it kept in logging RpcProtobufEngine messages. Edit: The file was too big for JIRA, please use this link: https://www.dropbox.com/s/41ugvhyb3lb2d5c/2015-05-26_13-00-07_buildLog.log.gz?dl=0 was (Author: fs111): I have attached the output of a run with loglevel set to DEBUG. After a while the process just stopped and it kept in logging RpcProtobufEngine messages. Tez local mode hanging in big testsuite --- Key: TEZ-2475 URL: https://issues.apache.org/jira/browse/TEZ-2475 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0, 0.6.1 Reporter: André Kelpe Attachments: 2015-05-21_15-55-20_buildLog.log.gz we have a big test suite for lingual, our SQL layer for cascading. We are trying very hard to make it work correctly on Tez, but I am stuck: The setup is a huge suite of SQL based tests (6000+), which are being executed in order in local mode. At certain moments the whole process just stops. Nothing gets executed any longer. This is not all the time, but quite often. Note that it is not happening at the same line of code, more at random, which makes it quite complex to debug. What I am seeing, is these kind of stacktraces in the middle of the run: 2015-05-21 16:07:42,413 ERROR [TaskHeartbeatThread] task.TezTaskRunner (TezTaskRunner.java:reportError(333)) - TaskReporter reported error java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2188) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:187) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) This looks like it could be related to the hang, but the hang is not happening immediately afterwards, but some time later. I have gone through quite a few JIRAs and saw that there were problems with locks and hanging threads before, which should be fixed, but it still happens. I have tried 0.6.1 and 0.7.0. Both show the same behaviour. This gist contains a thread dump of a hanging build: https://gist.github.com/fs111/1ee44469bf5cc31e5a52 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2475) Tez local mode hanging in big testsuite
[ https://issues.apache.org/jira/browse/TEZ-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558974#comment-14558974 ] André Kelpe edited comment on TEZ-2475 at 5/26/15 11:23 AM: I have attached the output of a run with loglevel set to DEBUG. After a while the process just stopped and it kept on logging RpcProtobufEngine messages. Edit: The file was too big for JIRA, please use this link: https://www.dropbox.com/s/41ugvhyb3lb2d5c/2015-05-26_13-00-07_buildLog.log.gz?dl=0 was (Author: fs111): I have attached the output of a run with loglevel set to DEBUG. After a while the process just stopped and it kept in logging RpcProtobufEngine messages. Edit: The file was too big for JIRA, please use this link: https://www.dropbox.com/s/41ugvhyb3lb2d5c/2015-05-26_13-00-07_buildLog.log.gz?dl=0 Tez local mode hanging in big testsuite --- Key: TEZ-2475 URL: https://issues.apache.org/jira/browse/TEZ-2475 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0, 0.6.1 Reporter: André Kelpe Attachments: 2015-05-21_15-55-20_buildLog.log.gz we have a big test suite for lingual, our SQL layer for cascading. We are trying very hard to make it work correctly on Tez, but I am stuck: The setup is a huge suite of SQL based tests (6000+), which are being executed in order in local mode. At certain moments the whole process just stops. Nothing gets executed any longer. This is not all the time, but quite often. Note that it is not happening at the same line of code, more at random, which makes it quite complex to debug. What I am seeing, is these kind of stacktraces in the middle of the run: 2015-05-21 16:07:42,413 ERROR [TaskHeartbeatThread] task.TezTaskRunner (TezTaskRunner.java:reportError(333)) - TaskReporter reported error java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2188) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:187) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) This looks like it could be related to the hang, but the hang is not happening immediately afterwards, but some time later. I have gone through quite a few JIRAs and saw that there were problems with locks and hanging threads before, which should be fixed, but it still happens. I have tried 0.6.1 and 0.7.0. Both show the same behaviour. This gist contains a thread dump of a hanging build: https://gist.github.com/fs111/1ee44469bf5cc31e5a52 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2485) Reduce the Resource Load on the Timeline Server
[ https://issues.apache.org/jira/browse/TEZ-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559573#comment-14559573 ] Jonathan Eagles commented on TEZ-2485: -- I'll try to post a break down of what is taking up the most space soon so we can start brainstorming. Reduce the Resource Load on the Timeline Server --- Key: TEZ-2485 URL: https://issues.apache.org/jira/browse/TEZ-2485 Project: Apache Tez Issue Type: Improvement Reporter: Jonathan Eagles The disk, network, and memory resources needed by the timeline server are are many times higher than the need for the equivalent mapreduce job. Based on storage improvents YARN-3448, the timeline server may support up to 30,000 jobs / 10,000,000 tasks a day. While I understand there is community effort on timeline server v2, it will be good if Tez can reduce its pressure on the timeline server by auditing both the number of events and size of events. Here are some observations based on my understanding of the design of timeline stores: Each timeline entity pushed explodes into many records in the database 1 marker record 1 domain record 1 record per event 2 records per related entity 2 records per primary filter (2 record per primary filter in RollingLevelDBTimelineStore, in leveldb it rewrites entire entity records per primary filter ) 1 record per other info For example Task Attempt Start 1 marker 1 domain 1 task attempt start event 1 related entity X 2 7 other info entries 4 primary filters X 2 20 records written in the database for task attempt start Task Attempt Finish 1 marker 1 domain 1 task attempt start event 1 related entity X 2 5 other info entries 5 primary filters X 2 20 records written in the database for task attempt finish = QUESTION: = Is there any data we are publishing to the timeline server that is not in the UI? Do we use all the entities (TEZ_CONTAINER_ID for example) Do we use all the primary filters? Do we use all the related entities specified? Are there any fields we don't use? Are there other approaches to consider to reduce entity count/size? Is there a way to store the same information in less space? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2485) Reduce the Resource Load on the Timeline Server
Jonathan Eagles created TEZ-2485: Summary: Reduce the Resource Load on the Timeline Server Key: TEZ-2485 URL: https://issues.apache.org/jira/browse/TEZ-2485 Project: Apache Tez Issue Type: Improvement Reporter: Jonathan Eagles The disk, network, and memory resources needed by the timeline server are are many times higher than the need for the equivalent mapreduce job. Based on storage improvents YARN-3448, the timeline server may support up to 30,000 jobs / 10,000,000 tasks a day. While I understand there is community effort on timeline server v2, it will be good if Tez can reduce its pressure on the timeline server by auditing both the number of events and size of events. Here are some observations based on my understanding of the design of timeline stores: Each timeline entity pushed explodes into many records in the database 1 marker record 1 domain record 1 record per event 2 records per related entity 2 records per primary filter (2 record per primary filter in RollingLevelDBTimelineStore, in leveldb it rewrites entire entity records per primary filter ) 1 record per other info For example Task Attempt Start 1 marker 1 domain 1 task attempt start event 1 related entity X 2 7 other info entries 4 primary filters X 2 20 records written in the database for task attempt start Task Attempt Finish 1 marker 1 domain 1 task attempt start event 1 related entity X 2 5 other info entries 5 primary filters X 2 20 records written in the database for task attempt finish = QUESTION: = Is there any data we are publishing to the timeline server that is not in the UI? Do we use all the entities (TEZ_CONTAINER_ID for example) Do we use all the primary filters? Do we use all the related entities specified? Are there any fields we don't use? Are there other approaches to consider to reduce entity count/size? Is there a way to store the same information in less space? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2391) TestVertexImpl timing out at times on jenkins builds
[ https://issues.apache.org/jira/browse/TEZ-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559607#comment-14559607 ] Mit Desai commented on TEZ-2391: [~bikassaha] so do we want to increase the timeout or want to have a different approach to fix this problem? TestVertexImpl timing out at times on jenkins builds - Key: TEZ-2391 URL: https://issues.apache.org/jira/browse/TEZ-2391 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Mit Desai Attachments: TEZ-2391.patch, TestVertexImpl-output.txt For example, https://builds.apache.org/job/Tez-Build/1028/console -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1529) ATS and TezClient integration in secure kerberos enabled cluster
[ https://issues.apache.org/jira/browse/TEZ-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559505#comment-14559505 ] Hitesh Shah commented on TEZ-1529: -- Minor nit: {code} if (!TimelineReaderFactory.isTimelineClientSupported()) { throw new TezException(Reading from Timeline is not supported); } {code} The exception message should be a bit more descriptive on why it may not be supported. +1 once the above is fixed. Feel free to commit after fixing exception message. ATS and TezClient integration in secure kerberos enabled cluster - Key: TEZ-1529 URL: https://issues.apache.org/jira/browse/TEZ-1529 Project: Apache Tez Issue Type: Bug Reporter: Prakash Ramachandran Assignee: Prakash Ramachandran Priority: Blocker Attachments: TEZ-1529-branch6.2.patch, TEZ-1529.1.patch, TEZ-1529.2.patch, TEZ-1529.3.patch, TEZ-1529.4.patch, TEZ-1529.5.patch This is a follow up for TEZ-1495 which address ATS - TezClient integration. however it does not enable it in secure kerberos enabled cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2483) Tez should close task if processor fail
[ https://issues.apache.org/jira/browse/TEZ-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559521#comment-14559521 ] Siddharth Seth commented on TEZ-2483: - Agree with what Rajesh said. It would be better to add this to cleanup. Another thing to consider is exceptions which are thrown while invoking close on a failed Processor / Input / Output - those should be ignored so that each Input/Output gets closed. The TEZ-2003 branch already has some of this code in place. Tez should close task if processor fail --- Key: TEZ-2483 URL: https://issues.apache.org/jira/browse/TEZ-2483 Project: Apache Tez Issue Type: Bug Reporter: Daniel Dai Fix For: 0.7.1 Attachments: TEZ-2483-1.patch, TEZ-2483-2.patch The symptom is if PigProcessor fail, MRInput is not closed. On Windows, this creates a problem since Pig client cannot remove the input file. In general, if a task fail, Tez shall close all input/output handles in cleanup. MROutput is closed in MROutput.abort() which Pig invokes explicitly right now. Attach a demo patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TEZ-2440) Sorter should check for indexCacheList.size() in flush()
[ https://issues.apache.org/jira/browse/TEZ-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai reassigned TEZ-2440: -- Assignee: Mit Desai Sorter should check for indexCacheList.size() in flush() Key: TEZ-2440 URL: https://issues.apache.org/jira/browse/TEZ-2440 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Mit Desai {noformat} 015-05-11 20:28:20,225 INFO [main] task.TezTaskRunner: Shutdown requested... returning 2015-05-11 20:28:20,225 INFO [main] task.TezChild: Got a shouldDie notification via hearbeats. Shutting down 2015-05-11 20:28:20,231 INFO [TezChild] impl.PipelinedSorter: Thread interrupted, cleaned up stale data, sorter threads shutdown=true, terminated=false 2015-05-11 20:28:20,231 INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask: Joining on EventRouter 2015-05-11 20:28:20,231 INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask: Ignoring interrupt while waiting for the router thread to die 2015-05-11 20:28:20,232 INFO [TezChild] task.TezTaskRunner: Encounted an error while executing task: attempt_1429683757595_0875_1_07_00_0 java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.elementData(ArrayList.java:418) at java.util.ArrayList.get(ArrayList.java:431) at org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.flush(PipelinedSorter.java:462) at org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.close(OrderedPartitionedKVOutput.java:183) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:360) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) {noformat} When a DAG is killed in the middle, sometimes these exceptions are thrown (e.g q_17 in TPC-DS). Even though it is completely harmless, it would be better to fix it to avoid distraction when debugging -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2478) Move OneToOne routing to store events in Tasks
[ https://issues.apache.org/jira/browse/TEZ-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559622#comment-14559622 ] TezQA commented on TEZ-2478: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12735369/TEZ-2478.1.txt against master revision 7be325e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in : org.apache.tez.dag.app.dag.impl.TestVertexImpl Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/739//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/739//console This message is automatically generated. Move OneToOne routing to store events in Tasks -- Key: TEZ-2478 URL: https://issues.apache.org/jira/browse/TEZ-2478 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: 1-1-wip.patch, TEZ-2478.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2478 PreCommit Build #739
Jira: https://issues.apache.org/jira/browse/TEZ-2478 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/739/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2535 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12735369/TEZ-2478.1.txt against master revision 7be325e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in : org.apache.tez.dag.app.dag.impl.TestVertexImpl Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/739//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/739//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 80740b09fe3515f3a7d5d594281fbe4107317ef8 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #737 Archived 47 artifacts Archive block size is 32768 Received 25 blocks and 2057520 bytes Compression is 28.5% Took 0.99 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2450) support async http clients in ordered unordered inputs
[ https://issues.apache.org/jira/browse/TEZ-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559747#comment-14559747 ] Siddharth Seth commented on TEZ-2450: - +1. Looks good. Thanks [~rajesh.balamohan]. support async http clients in ordered unordered inputs Key: TEZ-2450 URL: https://issues.apache.org/jira/browse/TEZ-2450 Project: Apache Tez Issue Type: Improvement Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2450.1.patch, TEZ-2450.2.WIP.patch, TEZ-2450.2.patch, TEZ-2450.3.patch, TEZ-2450.4.patch, TEZ-2450.WIP.patch It will be helpful to switch between JDK other async http impls. For LLAP scenarios, it would be useful to make http clients interruptible which is supported in async libraries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2487) Scheduler should be able to preempt tasks instead of containers
Siddharth Seth created TEZ-2487: --- Summary: Scheduler should be able to preempt tasks instead of containers Key: TEZ-2487 URL: https://issues.apache.org/jira/browse/TEZ-2487 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth The scheduler currently preempts containers since task level preemption was not supported. There's changes in TEZ-2003 which allow tasks to be killed. Adding support in the AM would be useful so that containers can be re-used even if a running task needs to be preempted. Assigning to myself for now. If anyone wants to take it over, please ping. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1883) Change findbugs version to 3.x
[ https://issues.apache.org/jira/browse/TEZ-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1883: Attachment: TEZ-1883.3.txt Updated patch to fix the simpler findbugs warnings. Will leave th sync issues for TEZ-1900. Change findbugs version to 3.x --- Key: TEZ-1883 URL: https://issues.apache.org/jira/browse/TEZ-1883 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Siddharth Seth Priority: Minor Attachments: TEZ-1883.1.patch, TEZ-1883.2.txt, TEZ-1883.3.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2481) Tez UI: graphical view does not render properly on IE11
[ https://issues.apache.org/jira/browse/TEZ-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559146#comment-14559146 ] Prakash Ramachandran commented on TEZ-2481: --- +1 LGTM checked on IE11 and chrome. committing shortly. Tez UI: graphical view does not render properly on IE11 --- Key: TEZ-2481 URL: https://issues.apache.org/jira/browse/TEZ-2481 Project: Apache Tez Issue Type: Bug Reporter: Sreenath Somarajapuram Assignee: Sreenath Somarajapuram Attachments: Screen-Shot-2015-05-25-at-4.02.46-PM.jpg, TEZ-2481.1.patch, TEZ-2481.2.patch, TEZ-2481.3.patch The issue was because of IE's poor/broken support of css in SVG. # IE doesn't support transform in css like other browsers. This caused the bubbles in a vertex to appear at the origin - https://connect.microsoft.com/IE/feedbackdetail/view/920928 # IE have a broken support for the marker(Arrow on the path). This was causing the links/paths to disappear - https://connect.microsoft.com/IE/feedback/details/801938 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2486) Update tez website to include links based on http://www.apache.org/foundation/marks/pmcs.html#navigation
Hitesh Shah created TEZ-2486: Summary: Update tez website to include links based on http://www.apache.org/foundation/marks/pmcs.html#navigation Key: TEZ-2486 URL: https://issues.apache.org/jira/browse/TEZ-2486 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1883) Change findbugs version to 3.x
[ https://issues.apache.org/jira/browse/TEZ-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559882#comment-14559882 ] TezQA commented on TEZ-1883: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12735400/TEZ-1883.3.txt against master revision 7be325e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 6 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in : org.apache.tez.dag.app.dag.impl.TestVertexImpl Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/740//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/740//artifact/patchprocess/newPatchFindbugsWarningstez-dag.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/740//console This message is automatically generated. Change findbugs version to 3.x --- Key: TEZ-1883 URL: https://issues.apache.org/jira/browse/TEZ-1883 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Siddharth Seth Priority: Minor Attachments: TEZ-1883.1.patch, TEZ-1883.2.txt, TEZ-1883.3.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-1883 PreCommit Build #740
Jira: https://issues.apache.org/jira/browse/TEZ-1883 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/740/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2539 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12735400/TEZ-1883.3.txt against master revision 7be325e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 6 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in : org.apache.tez.dag.app.dag.impl.TestVertexImpl Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/740//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/740//artifact/patchprocess/newPatchFindbugsWarningstez-dag.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/740//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. f2b92a8c0a5fec61c70c3cd7be659d4cb2b77d9f logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #737 Archived 47 artifacts Archive block size is 32768 Received 8 blocks and 2783929 bytes Compression is 8.6% Took 1 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2485) Reduce the Resource Load on the Timeline Server
[ https://issues.apache.org/jira/browse/TEZ-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559919#comment-14559919 ] Hitesh Shah commented on TEZ-2485: -- Thanks for starting this [~jeagles]. \cc [~rajesh.balamohan] [~gopalv] as they will need to look at how it impacts the job analysers and [~Sreenath] [~pramachandran] for UI impact. Reduce the Resource Load on the Timeline Server --- Key: TEZ-2485 URL: https://issues.apache.org/jira/browse/TEZ-2485 Project: Apache Tez Issue Type: Improvement Reporter: Jonathan Eagles The disk, network, and memory resources needed by the timeline server are are many times higher than the need for the equivalent mapreduce job. Based on storage improvents YARN-3448, the timeline server may support up to 30,000 jobs / 10,000,000 tasks a day. While I understand there is community effort on timeline server v2, it will be good if Tez can reduce its pressure on the timeline server by auditing both the number of events and size of events. Here are some observations based on my understanding of the design of timeline stores: Each timeline entity pushed explodes into many records in the database 1 marker record 1 domain record 1 record per event 2 records per related entity 2 records per primary filter (2 record per primary filter in RollingLevelDBTimelineStore, in leveldb it rewrites entire entity records per primary filter ) 1 record per other info For example Task Attempt Start 1 marker 1 domain 1 task attempt start event 1 related entity X 2 7 other info entries 4 primary filters X 2 20 records written in the database for task attempt start Task Attempt Finish 1 marker 1 domain 1 task attempt start event 1 related entity X 2 5 other info entries 5 primary filters X 2 20 records written in the database for task attempt finish = QUESTION: = Is there any data we are publishing to the timeline server that is not in the UI? Do we use all the entities (TEZ_CONTAINER_ID for example) Do we use all the primary filters? Do we use all the related entities specified? Are there any fields we don't use? Are there other approaches to consider to reduce entity count/size? Is there a way to store the same information in less space? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2485) Reduce the Resource Load on the Timeline Server
[ https://issues.apache.org/jira/browse/TEZ-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559929#comment-14559929 ] Hitesh Shah commented on TEZ-2485: -- Thanks for starting this [~jeagles]. \cc [~rajesh.balamohan] [~gopalv] as they will need to look at how it impacts the job analysers and [~Sreenath] [~pramachandran] for UI impact. Reduce the Resource Load on the Timeline Server --- Key: TEZ-2485 URL: https://issues.apache.org/jira/browse/TEZ-2485 Project: Apache Tez Issue Type: Improvement Reporter: Jonathan Eagles The disk, network, and memory resources needed by the timeline server are are many times higher than the need for the equivalent mapreduce job. Based on storage improvents YARN-3448, the timeline server may support up to 30,000 jobs / 10,000,000 tasks a day. While I understand there is community effort on timeline server v2, it will be good if Tez can reduce its pressure on the timeline server by auditing both the number of events and size of events. Here are some observations based on my understanding of the design of timeline stores: Each timeline entity pushed explodes into many records in the database 1 marker record 1 domain record 1 record per event 2 records per related entity 2 records per primary filter (2 record per primary filter in RollingLevelDBTimelineStore, in leveldb it rewrites entire entity records per primary filter ) 1 record per other info For example Task Attempt Start 1 marker 1 domain 1 task attempt start event 1 related entity X 2 7 other info entries 4 primary filters X 2 20 records written in the database for task attempt start Task Attempt Finish 1 marker 1 domain 1 task attempt start event 1 related entity X 2 5 other info entries 5 primary filters X 2 20 records written in the database for task attempt finish = QUESTION: = Is there any data we are publishing to the timeline server that is not in the UI? Do we use all the entities (TEZ_CONTAINER_ID for example) Do we use all the primary filters? Do we use all the related entities specified? Are there any fields we don't use? Are there other approaches to consider to reduce entity count/size? Is there a way to store the same information in less space? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (TEZ-2485) Reduce the Resource Load on the Timeline Server
[ https://issues.apache.org/jira/browse/TEZ-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2485: - Comment: was deleted (was: Thanks for starting this [~jeagles]. \cc [~rajesh.balamohan] [~gopalv] as they will need to look at how it impacts the job analysers and [~Sreenath] [~pramachandran] for UI impact. ) Reduce the Resource Load on the Timeline Server --- Key: TEZ-2485 URL: https://issues.apache.org/jira/browse/TEZ-2485 Project: Apache Tez Issue Type: Improvement Reporter: Jonathan Eagles The disk, network, and memory resources needed by the timeline server are are many times higher than the need for the equivalent mapreduce job. Based on storage improvents YARN-3448, the timeline server may support up to 30,000 jobs / 10,000,000 tasks a day. While I understand there is community effort on timeline server v2, it will be good if Tez can reduce its pressure on the timeline server by auditing both the number of events and size of events. Here are some observations based on my understanding of the design of timeline stores: Each timeline entity pushed explodes into many records in the database 1 marker record 1 domain record 1 record per event 2 records per related entity 2 records per primary filter (2 record per primary filter in RollingLevelDBTimelineStore, in leveldb it rewrites entire entity records per primary filter ) 1 record per other info For example Task Attempt Start 1 marker 1 domain 1 task attempt start event 1 related entity X 2 7 other info entries 4 primary filters X 2 20 records written in the database for task attempt start Task Attempt Finish 1 marker 1 domain 1 task attempt start event 1 related entity X 2 5 other info entries 5 primary filters X 2 20 records written in the database for task attempt finish = QUESTION: = Is there any data we are publishing to the timeline server that is not in the UI? Do we use all the entities (TEZ_CONTAINER_ID for example) Do we use all the primary filters? Do we use all the related entities specified? Are there any fields we don't use? Are there other approaches to consider to reduce entity count/size? Is there a way to store the same information in less space? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2481) Tez UI: graphical view does not render properly on IE11
[ https://issues.apache.org/jira/browse/TEZ-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Ramachandran updated TEZ-2481: -- Summary: Tez UI: graphical view does not render properly on IE11 (was: Tez UI: IE11 - graphical view renders incorrectly) Tez UI: graphical view does not render properly on IE11 --- Key: TEZ-2481 URL: https://issues.apache.org/jira/browse/TEZ-2481 Project: Apache Tez Issue Type: Bug Reporter: Sreenath Somarajapuram Assignee: Sreenath Somarajapuram Attachments: Screen-Shot-2015-05-25-at-4.02.46-PM.jpg, TEZ-2481.1.patch, TEZ-2481.2.patch, TEZ-2481.3.patch The issue was because of IE's poor/broken support of css in SVG. # IE doesn't support transform in css like other browsers. This caused the bubbles in a vertex to appear at the origin - https://connect.microsoft.com/IE/feedbackdetail/view/920928 # IE have a broken support for the marker(Arrow on the path). This was causing the links/paths to disappear - https://connect.microsoft.com/IE/feedback/details/801938 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2483) Tez should close task if processor fail
[ https://issues.apache.org/jira/browse/TEZ-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2483: - Fix Version/s: (was: 0.7.1) Tez should close task if processor fail --- Key: TEZ-2483 URL: https://issues.apache.org/jira/browse/TEZ-2483 Project: Apache Tez Issue Type: Bug Reporter: Daniel Dai Assignee: Daniel Dai Attachments: TEZ-2483-1.patch, TEZ-2483-2.patch The symptom is if PigProcessor fail, MRInput is not closed. On Windows, this creates a problem since Pig client cannot remove the input file. In general, if a task fail, Tez shall close all input/output handles in cleanup. MROutput is closed in MROutput.abort() which Pig invokes explicitly right now. Attach a demo patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2483) Tez should close task if processor fail
[ https://issues.apache.org/jira/browse/TEZ-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2483: - Assignee: Daniel Dai Tez should close task if processor fail --- Key: TEZ-2483 URL: https://issues.apache.org/jira/browse/TEZ-2483 Project: Apache Tez Issue Type: Bug Reporter: Daniel Dai Assignee: Daniel Dai Attachments: TEZ-2483-1.patch, TEZ-2483-2.patch The symptom is if PigProcessor fail, MRInput is not closed. On Windows, this creates a problem since Pig client cannot remove the input file. In general, if a task fail, Tez shall close all input/output handles in cleanup. MROutput is closed in MROutput.abort() which Pig invokes explicitly right now. Attach a demo patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2483) Tez should close task if processor fail
[ https://issues.apache.org/jira/browse/TEZ-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2483: - Target Version/s: 0.6.2, 0.7.1 Tez should close task if processor fail --- Key: TEZ-2483 URL: https://issues.apache.org/jira/browse/TEZ-2483 Project: Apache Tez Issue Type: Bug Reporter: Daniel Dai Assignee: Daniel Dai Attachments: TEZ-2483-1.patch, TEZ-2483-2.patch The symptom is if PigProcessor fail, MRInput is not closed. On Windows, this creates a problem since Pig client cannot remove the input file. In general, if a task fail, Tez shall close all input/output handles in cleanup. MROutput is closed in MROutput.abort() which Pig invokes explicitly right now. Attach a demo patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2488) Tez AM crashes if a submitted DAG is configured to use invalid resource sizes.
[ https://issues.apache.org/jira/browse/TEZ-2488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2488: - Attachment: applogs.txt Tez AM crashes if a submitted DAG is configured to use invalid resource sizes. --- Key: TEZ-2488 URL: https://issues.apache.org/jira/browse/TEZ-2488 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Priority: Critical Attachments: applogs.txt 2015-05-26 21:54:03,485 ERROR [AMRM Heartbeater thread] impl.AMRMClientAsyncImpl: Exception on heartbeat org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=682, maxMemory=512 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:249) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:226) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:234) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:505) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2015-05-26 21:54:03,495 INFO [Dispatcher thread: Central] app.DAGAppMaster: Error in the TaskScheduler. Shutting down. org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=682, maxMemory=512 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:249) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:226) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:234) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:505) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at
[jira] [Updated] (TEZ-2489) Disable warn log for Timeline error when tez.allow.disabled.timeline-domains set to true
[ https://issues.apache.org/jira/browse/TEZ-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2489: - Description: 15/05/26 22:57:38 WARN client.TezClient: Could not instantiate object for org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager. ACLs cannot be enforced correctly for history data in Timeline org.apache.tez.dag.api.TezUncheckedException: Unable to load class: org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager at org.apache.tez.common.ReflectionUtils.getClazz(ReflectionUtils.java:45) at org.apache.tez.common.ReflectionUtils.createClazzInstance(ReflectionUtils.java:88) at org.apache.tez.client.TezClient.start(TezClient.java:317) at cascading.flow.tez.planner.Hadoop2TezFlowStepJob.internalNonBlockingStart(Hadoop2TezFlowStepJob.java:137) at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:248) at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:172) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:134) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:45) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) Reported by @chris wensel was: 15/05/26 22:57:38 WARN client.TezClient: Could not instantiate object for org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager. ACLs cannot be enforced correctly for history data in Timeline org.apache.tez.dag.api.TezUncheckedException: Unable to load class: org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager at org.apache.tez.common.ReflectionUtils.getClazz(ReflectionUtils.java:45) at org.apache.tez.common.ReflectionUtils.createClazzInstance(ReflectionUtils.java:88) at org.apache.tez.client.TezClient.start(TezClient.java:317) at cascading.flow.tez.planner.Hadoop2TezFlowStepJob.internalNonBlockingStart(Hadoop2TezFlowStepJob.java:137) at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:248) at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:172) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:134) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:45) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) Disable warn log for Timeline error when tez.allow.disabled.timeline-domains set to true - Key: TEZ-2489 URL: https://issues.apache.org/jira/browse/TEZ-2489 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah 15/05/26 22:57:38 WARN client.TezClient: Could not instantiate object for org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager. ACLs cannot be enforced correctly for history data in Timeline org.apache.tez.dag.api.TezUncheckedException: Unable to load class: org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager at org.apache.tez.common.ReflectionUtils.getClazz(ReflectionUtils.java:45) at org.apache.tez.common.ReflectionUtils.createClazzInstance(ReflectionUtils.java:88) at org.apache.tez.client.TezClient.start(TezClient.java:317) at cascading.flow.tez.planner.Hadoop2TezFlowStepJob.internalNonBlockingStart(Hadoop2TezFlowStepJob.java:137) at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:248) at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:172) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:134) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:45) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by:
[jira] [Updated] (TEZ-2489) Disable warn log for Timeline ACL error when tez.allow.disabled.timeline-domains set to true
[ https://issues.apache.org/jira/browse/TEZ-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2489: - Summary: Disable warn log for Timeline ACL error when tez.allow.disabled.timeline-domains set to true (was: Disable warn log for Timeline error when tez.allow.disabled.timeline-domains set to true ) Disable warn log for Timeline ACL error when tez.allow.disabled.timeline-domains set to true - Key: TEZ-2489 URL: https://issues.apache.org/jira/browse/TEZ-2489 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah 15/05/26 22:57:38 WARN client.TezClient: Could not instantiate object for org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager. ACLs cannot be enforced correctly for history data in Timeline org.apache.tez.dag.api.TezUncheckedException: Unable to load class: org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager at org.apache.tez.common.ReflectionUtils.getClazz(ReflectionUtils.java:45) at org.apache.tez.common.ReflectionUtils.createClazzInstance(ReflectionUtils.java:88) at org.apache.tez.client.TezClient.start(TezClient.java:317) at cascading.flow.tez.planner.Hadoop2TezFlowStepJob.internalNonBlockingStart(Hadoop2TezFlowStepJob.java:137) at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:248) at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:172) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:134) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:45) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) Reported by @chris wensel -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2488) Tez AM crashes if a submitted DAG is configured to use invalid resource sizes.
Hitesh Shah created TEZ-2488: Summary: Tez AM crashes if a submitted DAG is configured to use invalid resource sizes. Key: TEZ-2488 URL: https://issues.apache.org/jira/browse/TEZ-2488 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Priority: Critical 2015-05-26 21:54:03,485 ERROR [AMRM Heartbeater thread] impl.AMRMClientAsyncImpl: Exception on heartbeat org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=682, maxMemory=512 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:249) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:226) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:234) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:505) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2015-05-26 21:54:03,495 INFO [Dispatcher thread: Central] app.DAGAppMaster: Error in the TaskScheduler. Shutting down. org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory 0, or requested memory max configured, requestedMemory=682, maxMemory=512 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:249) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:226) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:234) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:505) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at
[jira] [Updated] (TEZ-2440) Sorter should check for indexCacheList.size() in flush()
[ https://issues.apache.org/jira/browse/TEZ-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated TEZ-2440: --- Attachment: TEZ-2440-1.patch [~rajesh.balamohan], can you take a look on the patch? Sorter should check for indexCacheList.size() in flush() Key: TEZ-2440 URL: https://issues.apache.org/jira/browse/TEZ-2440 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Mit Desai Attachments: TEZ-2440-1.patch {noformat} 015-05-11 20:28:20,225 INFO [main] task.TezTaskRunner: Shutdown requested... returning 2015-05-11 20:28:20,225 INFO [main] task.TezChild: Got a shouldDie notification via hearbeats. Shutting down 2015-05-11 20:28:20,231 INFO [TezChild] impl.PipelinedSorter: Thread interrupted, cleaned up stale data, sorter threads shutdown=true, terminated=false 2015-05-11 20:28:20,231 INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask: Joining on EventRouter 2015-05-11 20:28:20,231 INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask: Ignoring interrupt while waiting for the router thread to die 2015-05-11 20:28:20,232 INFO [TezChild] task.TezTaskRunner: Encounted an error while executing task: attempt_1429683757595_0875_1_07_00_0 java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.elementData(ArrayList.java:418) at java.util.ArrayList.get(ArrayList.java:431) at org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.flush(PipelinedSorter.java:462) at org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.close(OrderedPartitionedKVOutput.java:183) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:360) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) {noformat} When a DAG is killed in the middle, sometimes these exceptions are thrown (e.g q_17 in TPC-DS). Even though it is completely harmless, it would be better to fix it to avoid distraction when debugging -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2485) Reduce the Resource Load on the Timeline Server
[ https://issues.apache.org/jira/browse/TEZ-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560070#comment-14560070 ] Jonathan Eagles edited comment on TEZ-2485 at 5/26/15 11:07 PM: Posted the data storage breakdown by entity type and by column type. The database in the instance was approximately 315MB on disk. Leveldb uses snappy compression so that the expanded key/value breakdown is 508MB/710MB respectively. Another thing to consider is the key overhead per record. Keys are of the form |Entity Type|8bytes for timestamp| Entity Id | column specific data|. To calculate the amount of space utilized by type multiple the type length by the count. The majority of this data was generated using pig. was (Author: jeagles): Posted the data storage breakdown by entity type and by column type. The database in the instance was approximately 315MB on disk. Leveldb uses snappy compression so that the expanded key/value breakdown is 508MB/710MB respectively. Another thing to consider is the key overhead per record. Keys are of the form |Entity Type|8bytes for timestamp| Entity Id | column specific data|. To calculate the amount of space utilized by type multiple the type length by the count. Reduce the Resource Load on the Timeline Server --- Key: TEZ-2485 URL: https://issues.apache.org/jira/browse/TEZ-2485 Project: Apache Tez Issue Type: Improvement Reporter: Jonathan Eagles The disk, network, and memory resources needed by the timeline server are are many times higher than the need for the equivalent mapreduce job. Based on storage improvents YARN-3448, the timeline server may support up to 30,000 jobs / 10,000,000 tasks a day. While I understand there is community effort on timeline server v2, it will be good if Tez can reduce its pressure on the timeline server by auditing both the number of events and size of events. Here are some observations based on my understanding of the design of timeline stores: Each timeline entity pushed explodes into many records in the database 1 marker record 1 domain record 1 record per event 2 records per related entity 2 records per primary filter (2 record per primary filter in RollingLevelDBTimelineStore, in leveldb it rewrites entire entity records per primary filter ) 1 record per other info For example Task Attempt Start 1 marker 1 domain 1 task attempt start event 1 related entity X 2 7 other info entries 4 primary filters X 2 20 records written in the database for task attempt start Task Attempt Finish 1 marker 1 domain 1 task attempt start event 1 related entity X 2 5 other info entries 5 primary filters X 2 20 records written in the database for task attempt finish = QUESTION: = Is there any data we are publishing to the timeline server that is not in the UI? Do we use all the entities (TEZ_CONTAINER_ID for example) Do we use all the primary filters? Do we use all the related entities specified? Are there any fields we don't use? Are there other approaches to consider to reduce entity count/size? Is there a way to store the same information in less space? === Key Value Breakdown ||Count||Key Size||Value Size|| |5642512|533690380|745454867| Entity Type Breakdown ||Type||Count||Key Size||Value Size|| |TEZ_CONTAINER_ID|843850|86244392|5654341| |applicationAttemptId|544|53248|6174| |applicationId|544|44412|6174| |TEZ_TASK_ATTEMPT_ID|2471393|239523553|373637209| |TEZ_APPLICATION|1048|84312|13057630| |containerId|362443|37013813|4135845| |TEZ_VERTEX_ID|99239|10387114|1559948| |TEZ_DAG_ID|5402|387705|2910830| |TEZ_TASK_ID|1762211|146210017|344478400| |TEZ_APPLICATION_ATTEMPT|95838|13741814|8316| Column Breakdown ||Column||Count||Key Size||Value Size|| |primarykeys|1092413|118768299|0| |marker|373515|25740507|2988120| |events|578196|55148482|1156392| |domain|373515|26114022|15314115| |reverserelated|587815|73721347|0| |otherinfo|2143751|170983893|725996240| |related|493307|63213830|0| -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2485) Reduce the Resource Load on the Timeline Server
[ https://issues.apache.org/jira/browse/TEZ-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560070#comment-14560070 ] Jonathan Eagles commented on TEZ-2485: -- Posted the data storage breakdown by entity type and by column type. The database in the instance was approximately 315MB on disk. Leveldb uses snappy compression so that the expanded key/value breakdown is 508MB/710MB respectively. Another thing to consider is the key overhead per record. Keys are of the form |Entity Type|8bytes for timestamp| Entity Id | column specific data|. To calculate the amount of space utilized by type multiple the type length by the count. Reduce the Resource Load on the Timeline Server --- Key: TEZ-2485 URL: https://issues.apache.org/jira/browse/TEZ-2485 Project: Apache Tez Issue Type: Improvement Reporter: Jonathan Eagles The disk, network, and memory resources needed by the timeline server are are many times higher than the need for the equivalent mapreduce job. Based on storage improvents YARN-3448, the timeline server may support up to 30,000 jobs / 10,000,000 tasks a day. While I understand there is community effort on timeline server v2, it will be good if Tez can reduce its pressure on the timeline server by auditing both the number of events and size of events. Here are some observations based on my understanding of the design of timeline stores: Each timeline entity pushed explodes into many records in the database 1 marker record 1 domain record 1 record per event 2 records per related entity 2 records per primary filter (2 record per primary filter in RollingLevelDBTimelineStore, in leveldb it rewrites entire entity records per primary filter ) 1 record per other info For example Task Attempt Start 1 marker 1 domain 1 task attempt start event 1 related entity X 2 7 other info entries 4 primary filters X 2 20 records written in the database for task attempt start Task Attempt Finish 1 marker 1 domain 1 task attempt start event 1 related entity X 2 5 other info entries 5 primary filters X 2 20 records written in the database for task attempt finish = QUESTION: = Is there any data we are publishing to the timeline server that is not in the UI? Do we use all the entities (TEZ_CONTAINER_ID for example) Do we use all the primary filters? Do we use all the related entities specified? Are there any fields we don't use? Are there other approaches to consider to reduce entity count/size? Is there a way to store the same information in less space? === Key Value Breakdown ||Count||Key Size||Value Size|| |5642512|533690380|745454867| Entity Type Breakdown ||Type||Count||Key Size||Value Size|| |TEZ_CONTAINER_ID|843850|86244392|5654341| |applicationAttemptId|544|53248|6174| |applicationId|544|44412|6174| |TEZ_TASK_ATTEMPT_ID|2471393|239523553|373637209| |TEZ_APPLICATION|1048|84312|13057630| |containerId|362443|37013813|4135845| |TEZ_VERTEX_ID|99239|10387114|1559948| |TEZ_DAG_ID|5402|387705|2910830| |TEZ_TASK_ID|1762211|146210017|344478400| |TEZ_APPLICATION_ATTEMPT|95838|13741814|8316| Column Breakdown ||Column||Count||Key Size||Value Size|| |primarykeys|1092413|118768299|0| |marker|373515|25740507|2988120| |events|578196|55148482|1156392| |domain|373515|26114022|15314115| |reverserelated|587815|73721347|0| |otherinfo|2143751|170983893|725996240| |related|493307|63213830|0| -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2489) Disable warn log for Timeline error when tez.allow.disabled.timeline-domains set to true
Hitesh Shah created TEZ-2489: Summary: Disable warn log for Timeline error when tez.allow.disabled.timeline-domains set to true Key: TEZ-2489 URL: https://issues.apache.org/jira/browse/TEZ-2489 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah 15/05/26 22:57:38 WARN client.TezClient: Could not instantiate object for org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager. ACLs cannot be enforced correctly for history data in Timeline org.apache.tez.dag.api.TezUncheckedException: Unable to load class: org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager at org.apache.tez.common.ReflectionUtils.getClazz(ReflectionUtils.java:45) at org.apache.tez.common.ReflectionUtils.createClazzInstance(ReflectionUtils.java:88) at org.apache.tez.client.TezClient.start(TezClient.java:317) at cascading.flow.tez.planner.Hadoop2TezFlowStepJob.internalNonBlockingStart(Hadoop2TezFlowStepJob.java:137) at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:248) at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:172) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:134) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:45) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2485) Reduce the Resource Load on the Timeline Server
[ https://issues.apache.org/jira/browse/TEZ-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated TEZ-2485: - Description: The disk, network, and memory resources needed by the timeline server are are many times higher than the need for the equivalent mapreduce job. Based on storage improvents YARN-3448, the timeline server may support up to 30,000 jobs / 10,000,000 tasks a day. While I understand there is community effort on timeline server v2, it will be good if Tez can reduce its pressure on the timeline server by auditing both the number of events and size of events. Here are some observations based on my understanding of the design of timeline stores: Each timeline entity pushed explodes into many records in the database 1 marker record 1 domain record 1 record per event 2 records per related entity 2 records per primary filter (2 record per primary filter in RollingLevelDBTimelineStore, in leveldb it rewrites entire entity records per primary filter ) 1 record per other info For example Task Attempt Start 1 marker 1 domain 1 task attempt start event 1 related entity X 2 7 other info entries 4 primary filters X 2 20 records written in the database for task attempt start Task Attempt Finish 1 marker 1 domain 1 task attempt start event 1 related entity X 2 5 other info entries 5 primary filters X 2 20 records written in the database for task attempt finish = QUESTION: = Is there any data we are publishing to the timeline server that is not in the UI? Do we use all the entities (TEZ_CONTAINER_ID for example) Do we use all the primary filters? Do we use all the related entities specified? Are there any fields we don't use? Are there other approaches to consider to reduce entity count/size? Is there a way to store the same information in less space? === Key Value Breakdown ||Count||Key Size||Value Size|| |5642512|533690380|745454867| Entity Type Breakdown ||Type||Count||Key Size||Value Size|| |TEZ_CONTAINER_ID|843850|86244392|5654341| |applicationAttemptId|544|53248|6174| |applicationId|544|44412|6174| |TEZ_TASK_ATTEMPT_ID|2471393|239523553|373637209| |TEZ_APPLICATION|1048|84312|13057630| |containerId|362443|37013813|4135845| |TEZ_VERTEX_ID|99239|10387114|1559948| |TEZ_DAG_ID|5402|387705|2910830| |TEZ_TASK_ID|1762211|146210017|344478400| |TEZ_APPLICATION_ATTEMPT|95838|13741814|8316| Column Breakdown ||Column||Count||Key Size||Value Size|| |primarykeys|1092413|118768299|0| |marker|373515|25740507|2988120| |events|578196|55148482|1156392| |domain|373515|26114022|15314115| |reverserelated|587815|73721347|0| |otherinfo|2143751|170983893|725996240| |related|493307|63213830|0| was: The disk, network, and memory resources needed by the timeline server are are many times higher than the need for the equivalent mapreduce job. Based on storage improvents YARN-3448, the timeline server may support up to 30,000 jobs / 10,000,000 tasks a day. While I understand there is community effort on timeline server v2, it will be good if Tez can reduce its pressure on the timeline server by auditing both the number of events and size of events. Here are some observations based on my understanding of the design of timeline stores: Each timeline entity pushed explodes into many records in the database 1 marker record 1 domain record 1 record per event 2 records per related entity 2 records per primary filter (2 record per primary filter in RollingLevelDBTimelineStore, in leveldb it rewrites entire entity records per primary filter ) 1 record per other info For example Task Attempt Start 1 marker 1 domain 1 task attempt start event 1 related entity X 2 7 other info entries 4 primary filters X 2 20 records written in the database for task attempt start Task Attempt Finish 1 marker 1 domain 1 task attempt start event 1 related entity X 2 5 other info entries 5 primary filters X 2 20 records written in the database for task attempt finish = QUESTION: = Is there any data we are publishing to the timeline server that is not in the UI? Do we use all the entities (TEZ_CONTAINER_ID for example) Do we use all the primary filters? Do we use all the related entities specified? Are there any fields we don't use? Are there other approaches to consider to reduce entity count/size? Is there a way to store the same information in less space? Reduce the Resource Load on the Timeline Server --- Key: TEZ-2485 URL: https://issues.apache.org/jira/browse/TEZ-2485 Project: Apache Tez Issue Type: Improvement Reporter: Jonathan Eagles The disk, network, and memory resources needed by the
Failed: TEZ-2440 PreCommit Build #741
Jira: https://issues.apache.org/jira/browse/TEZ-2440 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/741/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 19 lines...] No emails were triggered. [PreCommit-TEZ-Build] $ /bin/bash /tmp/hudson4691266686886385787.sh Running in Jenkins mode == == Testing patch for TEZ-2440. == == HEAD is now at 7be325e TEZ-2481. Tez UI: graphical view does not render properly on IE11 (Sreenath Somarajapuram via pramachandran) Switched to branch 'master' Your branch is up-to-date with 'origin/master'. Current branch master is up to date. TEZ-2440 patch is being downloaded at Tue May 26 22:22:30 UTC 2015 from http://issues.apache.org/jira/secure/attachment/12735421/TEZ-2440-1.patch The patch does not appear to apply with p0 to p2 PATCH APPLICATION FAILED {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12735421/TEZ-2440-1.patch against master revision 7be325e. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/741//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. fd483760be6fe714a044e2ac25a35f6a8baa79b8 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## No tests ran.
[jira] [Commented] (TEZ-2440) Sorter should check for indexCacheList.size() in flush()
[ https://issues.apache.org/jira/browse/TEZ-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559991#comment-14559991 ] TezQA commented on TEZ-2440: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12735421/TEZ-2440-1.patch against master revision 7be325e. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/741//console This message is automatically generated. Sorter should check for indexCacheList.size() in flush() Key: TEZ-2440 URL: https://issues.apache.org/jira/browse/TEZ-2440 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Mit Desai Attachments: TEZ-2440-1.patch {noformat} 015-05-11 20:28:20,225 INFO [main] task.TezTaskRunner: Shutdown requested... returning 2015-05-11 20:28:20,225 INFO [main] task.TezChild: Got a shouldDie notification via hearbeats. Shutting down 2015-05-11 20:28:20,231 INFO [TezChild] impl.PipelinedSorter: Thread interrupted, cleaned up stale data, sorter threads shutdown=true, terminated=false 2015-05-11 20:28:20,231 INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask: Joining on EventRouter 2015-05-11 20:28:20,231 INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask: Ignoring interrupt while waiting for the router thread to die 2015-05-11 20:28:20,232 INFO [TezChild] task.TezTaskRunner: Encounted an error while executing task: attempt_1429683757595_0875_1_07_00_0 java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.elementData(ArrayList.java:418) at java.util.ArrayList.get(ArrayList.java:431) at org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.flush(PipelinedSorter.java:462) at org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.close(OrderedPartitionedKVOutput.java:183) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:360) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) {noformat} When a DAG is killed in the middle, sometimes these exceptions are thrown (e.g q_17 in TPC-DS). Even though it is completely harmless, it would be better to fix it to avoid distraction when debugging -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2484) Tez vertex for Hive fails but Resource Manager reports job succeeded
[ https://issues.apache.org/jira/browse/TEZ-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559463#comment-14559463 ] Hitesh Shah edited comment on TEZ-2484 at 5/26/15 5:35 PM: --- This is related to how hive is using Tez sessions. There is no 1:1 relationship between a yarn application and a Hive query ( multiple queries can be run within a single yarn application ) hence the application status cannot be mapped to the failure of one of the queries that ran within a given Tez application on yarn. was (Author: hitesh): This is related to how hive is using Tez sessions. There is no 1:1 relationship between a yarn application and a Hive query hence the application status cannot be mapped to the failure of one of the queries that ran within a given Tez application on yarn. Tez vertex for Hive fails but Resource Manager reports job succeeded Key: TEZ-2484 URL: https://issues.apache.org/jira/browse/TEZ-2484 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Environment: HDP 2.2.4.2 Reporter: Hari Sekhon Attachments: Tez_RM_misreporting_succeeded.png When running a Hive on Tez job via Hive CLI the job fails and I get the error shown below but in the Resource Manager the job is shown as Succeeded, even though it's clearly failed: {code} Status: Running (Executing on YARN cluster with App id application_1432310690008_0103) VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED Map 1 FAILED 1478 00 1478 1 1477 VERTICES: 00/01 [--] 0%ELAPSED TIME: 1589.41 s Status: Failed Vertex failed, vertexName=Map 1, vertexId=vertex_1432310690008_0103_1_00, diagnostics=[Task failed, taskId=task_1432310690008_0103_1_00_00, diagnostics=[TaskAttempt 0 failed, info=[ Containercontainer_e122_1432310690008_0103_01_94 received a STOP_REQUEST]], Vertex failed as one or more tasks failed. failedTasks:1, Vertex vertex_1432310690008_0103_1_00 [Map 1] killed/failed due to:null] DAG failed due to vertex failure. failedVertices:1 killedVertices:0 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2484) Tez vertex for Hive fails but Resource Manager reports job succeeded
[ https://issues.apache.org/jira/browse/TEZ-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559463#comment-14559463 ] Hitesh Shah commented on TEZ-2484: -- This is related to how hive is using Tez sessions. There is no 1:1 relationship between a yarn application and a Hive query hence the application status cannot be mapped to the failure of one of the queries that ran within a given Tez application on yarn. Tez vertex for Hive fails but Resource Manager reports job succeeded Key: TEZ-2484 URL: https://issues.apache.org/jira/browse/TEZ-2484 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Environment: HDP 2.2.4.2 Reporter: Hari Sekhon Attachments: Tez_RM_misreporting_succeeded.png When running a Hive on Tez job via Hive CLI the job fails and I get the error shown below but in the Resource Manager the job is shown as Succeeded, even though it's clearly failed: {code} Status: Running (Executing on YARN cluster with App id application_1432310690008_0103) VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED Map 1 FAILED 1478 00 1478 1 1477 VERTICES: 00/01 [--] 0%ELAPSED TIME: 1589.41 s Status: Failed Vertex failed, vertexName=Map 1, vertexId=vertex_1432310690008_0103_1_00, diagnostics=[Task failed, taskId=task_1432310690008_0103_1_00_00, diagnostics=[TaskAttempt 0 failed, info=[ Containercontainer_e122_1432310690008_0103_01_94 received a STOP_REQUEST]], Vertex failed as one or more tasks failed. failedTasks:1, Vertex vertex_1432310690008_0103_1_00 [Map 1] killed/failed due to:null] DAG failed due to vertex failure. failedVertices:1 killedVertices:0 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2484) Tez vertex for Hive fails but Resource Manager reports job succeeded
[ https://issues.apache.org/jira/browse/TEZ-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated TEZ-2484: - Attachment: Tez_RM_misreporting_succeeded.png Attaching screenshot of Yarn Resource Manager line showing this Tez job being incorrectly reported as succeeded despite failure output in user session. Tez vertex for Hive fails but Resource Manager reports job succeeded Key: TEZ-2484 URL: https://issues.apache.org/jira/browse/TEZ-2484 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Environment: HDP 2.2.4.2 Reporter: Hari Sekhon Attachments: Tez_RM_misreporting_succeeded.png When running a Hive on Tez job via Hive CLI the job fails and I get the error shown below but in the Resource Manager the job is shown as Succeeded, even though it's clearly failed: {code} Status: Running (Executing on YARN cluster with App id application_1432310690008_0103) VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED Map 1 FAILED 1478 00 1478 1 1477 VERTICES: 00/01 [--] 0%ELAPSED TIME: 1589.41 s Status: Failed Vertex failed, vertexName=Map 1, vertexId=vertex_1432310690008_0103_1_00, diagnostics=[Task failed, taskId=task_1432310690008_0103_1_00_00, diagnostics=[TaskAttempt 0 failed, info=[ Containercontainer_e122_1432310690008_0103_01_94 received a STOP_REQUEST]], Vertex failed as one or more tasks failed. failedTasks:1, Vertex vertex_1432310690008_0103_1_00 [Map 1] killed/failed due to:null] DAG failed due to vertex failure. failedVertices:1 killedVertices:0 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-2484) Tez vertex for Hive fails but Resource Manager reports job succeeded
[ https://issues.apache.org/jira/browse/TEZ-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah resolved TEZ-2484. -- Resolution: Invalid Tez vertex for Hive fails but Resource Manager reports job succeeded Key: TEZ-2484 URL: https://issues.apache.org/jira/browse/TEZ-2484 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Environment: HDP 2.2.4.2 Reporter: Hari Sekhon Attachments: Tez_RM_misreporting_succeeded.png When running a Hive on Tez job via Hive CLI the job fails and I get the error shown below but in the Resource Manager the job is shown as Succeeded, even though it's clearly failed: {code} Status: Running (Executing on YARN cluster with App id application_1432310690008_0103) VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED Map 1 FAILED 1478 00 1478 1 1477 VERTICES: 00/01 [--] 0%ELAPSED TIME: 1589.41 s Status: Failed Vertex failed, vertexName=Map 1, vertexId=vertex_1432310690008_0103_1_00, diagnostics=[Task failed, taskId=task_1432310690008_0103_1_00_00, diagnostics=[TaskAttempt 0 failed, info=[ Containercontainer_e122_1432310690008_0103_01_94 received a STOP_REQUEST]], Vertex failed as one or more tasks failed. failedTasks:1, Vertex vertex_1432310690008_0103_1_00 [Map 1] killed/failed due to:null] DAG failed due to vertex failure. failedVertices:1 killedVertices:0 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2484) Tez vertex for Hive fails but Resource Manager reports job succeeded
Hari Sekhon created TEZ-2484: Summary: Tez vertex for Hive fails but Resource Manager reports job succeeded Key: TEZ-2484 URL: https://issues.apache.org/jira/browse/TEZ-2484 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Environment: HDP 2.2.4.2 Reporter: Hari Sekhon When running a Hive on Tez job via Hive CLI the job fails and I get the error shown below but in the Resource Manager the job is shown as Succeeded, even though it's clearly failed: {code} Status: Running (Executing on YARN cluster with App id application_1432310690008_0103) VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED Map 1 FAILED 1478 00 1478 11477 VERTICES: 00/01 [--] 0%ELAPSED TIME: 1589.41 s Status: Failed Vertex failed, vertexName=Map 1, vertexId=vertex_1432310690008_0103_1_00, diagnostics=[Task failed, taskId=task_1432310690008_0103_1_00_00, diagnostics=[TaskAttempt 0 failed, info=[ Containercontainer_e122_1432310690008_0103_01_94 received a STOP_REQUEST]], Vertex failed as one or more tasks failed. failedTasks:1, Vertex vertex_1432310690008_0103_1_00 [Map 1] killed/failed due to:null] DAG failed due to vertex failure. failedVertices:1 killedVertices:0 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)