[jira] [Commented] (TEZ-2209) Fix pipelined shuffle to fetch data from any one attempt
[ https://issues.apache.org/jira/browse/TEZ-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14370823#comment-14370823 ] Siddharth Seth commented on TEZ-2209: - Minor stuff. - shuffleInfoEventsMap in ShuffleManager should be a ConcurrentMap - it can be accessed from multiple threads, and outside of any synchronization. Not required in ShuffleSchdeuler though since that's synchronized. Missed this in the earlier review for pipelined shuffle. - In the reportFatalError invocation - it'll be useful to add the currently registered attemptNumber, and the one which caused the error. The rest looks good to me. Fix pipelined shuffle to fetch data from any one attempt Key: TEZ-2209 URL: https://issues.apache.org/jira/browse/TEZ-2209 Project: Apache Tez Issue Type: Improvement Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2209.1.patch, TEZ-2209.2.patch, TEZ-2209.3.patch - Currently, pipelined shuffle will fail-fast the moment it receives data from an attempt other than 0. This was done as an add-on check to prevent data being copied from speculated attempts. - However, in some scenarios (like LLAP), it could be possible that that task attempt gets killed even before generating any data. In such cases, attempt #1 or later attempts, would generate the actual data. - This jira is created to allow pipelined shuffle to download data from any one attempt. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.
[ https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2204: Attachment: TEZ-2204-1.patch TestAMRecovery increasingly flaky on jenkins builds. - Key: TEZ-2204 URL: https://issues.apache.org/jira/browse/TEZ-2204 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: TEZ-2204-1.patch In recent pre-commit builds and daily builds, there seem to have been some occurrences of TestAMRecovery failing or timing out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.
[ https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371062#comment-14371062 ] Jeff Zhang commented on TEZ-2204: - Upload patch. [~hitesh] [~bikassaha] Please help review it. 2 potential dead lock: * Related to YARN-2917. Tez's AsyncDispatcher doesn't integrate its patch. * Deadlock in DAGAppMaster. method DAGAppMaster::handle DAGAppMaster:stopService. While stopService is called, it would stop the AsyncDispatcher, while AsyncDispatcher will drain its events which may call DAGAppMaster::handle. And method handle() stopService both has the synchronized keyword. TestAMRecovery increasingly flaky on jenkins builds. - Key: TEZ-2204 URL: https://issues.apache.org/jira/browse/TEZ-2204 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: TEZ-2204-1.patch In recent pre-commit builds and daily builds, there seem to have been some occurrences of TestAMRecovery failing or timing out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2204 PreCommit Build #318
Jira: https://issues.apache.org/jira/browse/TEZ-2204 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/318/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2755 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12705879/TEZ-2204-1.patch against master revision 9b845f2. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 4 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/318//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/318//artifact/patchprocess/newPatchFindbugsWarningstez-dag.html Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/318//artifact/patchprocess/newPatchFindbugsWarningstez-common.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/318//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 0310fbd24a7e22aad56bda62c69f2f57d92cd884 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #315 Archived 44 artifacts Archive block size is 32768 Received 6 blocks and 2578137 bytes Compression is 7.1% Took 1.1 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime
[ https://issues.apache.org/jira/browse/TEZ-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372307#comment-14372307 ] Hitesh Shah commented on TEZ-2217: -- [~gopalv] Could you clarify on the bit about the AM has received a soft pre-emption message. Any suggestions on potential approach for the AM inferring that the cluster no longer has additional available resources for the AM and should now start releasing held containers? The min-held-containers constraint is not enforced during query runtime Key: TEZ-2217 URL: https://issues.apache.org/jira/browse/TEZ-2217 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Gopal V Assignee: Bikas Saha Attachments: TEZ-2217.txt.bz2 The min-held containers constraint is respected during query idle times, but is not respected when a query is actually in motion. The AM releases unused containers during dag execution without checking for min-held containers. {code} 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing container, containerId=container_1424502260528_1348_01_13, containerExpiryTime=1426891313264, idleTimeoutMin=5000 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: Releasing unused container: container_1424502260528_1348_01_13 {code} This is actually useful only after the AM has received a soft pre-emption message, doing it on an idle cluster slows down one of the most common query patterns in BI systems. {code} create temporary table smalltable as ...; select ... bigtable JOIN smalltable ON ...; {code} The smaller query in the beginning throws away the pre-warmed capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime
[ https://issues.apache.org/jira/browse/TEZ-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372361#comment-14372361 ] Gopal V commented on TEZ-2217: -- The performance bug occurs within the queue, that is a straight-forward failure to utilize available resources properly. I can work-around this issue by upping the max-delay for releasing containers, but that's where the pre-emption scenario is relevant - that approach isn't good on a busy cluster with a lot of applications. I don't want to hurt the performance or concurrency considerations. The AM-RM allocate responses contain the pre-emption messages, which should be a good enough indicator that a certain fraction of currently held resources will be removed soon. The pre-emption period is the gap between that event and the event of termination, in which period the min-held rules will not be enforced - to protect the ones actually doing work, idle containers should be allowed to expire (still following the min-max delay curve). I suspect the pre-emption contracts do not expose that sunset period for the ill-fated containers, which makes it slightly harder to do this directly off the message - perhaps there is one I can't find ? That makes the system both performant on idle cluster, but handles the mid-query bottlenecking that is common on busy clusters - particularly if the container expiry is smaller than the sunset period for pre-empted containers. The min-held-containers constraint is not enforced during query runtime Key: TEZ-2217 URL: https://issues.apache.org/jira/browse/TEZ-2217 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Gopal V Assignee: Bikas Saha Attachments: TEZ-2217.txt.bz2 The min-held containers constraint is respected during query idle times, but is not respected when a query is actually in motion. The AM releases unused containers during dag execution without checking for min-held containers. {code} 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing container, containerId=container_1424502260528_1348_01_13, containerExpiryTime=1426891313264, idleTimeoutMin=5000 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: Releasing unused container: container_1424502260528_1348_01_13 {code} This is actually useful only after the AM has received a soft pre-emption message, doing it on an idle cluster slows down one of the most common query patterns in BI systems. {code} create temporary table smalltable as ...; select ... bigtable JOIN smalltable ON ...; {code} The smaller query in the beginning throws away the pre-warmed capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2210) Record DAG AM CPU usage stats
[ https://issues.apache.org/jira/browse/TEZ-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2210: Attachment: TEZ-2210.4.patch Record DAG AM CPU usage stats - Key: TEZ-2210 URL: https://issues.apache.org/jira/browse/TEZ-2210 Project: Apache Tez Issue Type: Task Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-2210.1.patch, TEZ-2210.2.patch, TEZ-2210.3.patch, TEZ-2210.4.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2218) Turn on speculation by default
Bikas Saha created TEZ-2218: --- Summary: Turn on speculation by default Key: TEZ-2218 URL: https://issues.apache.org/jira/browse/TEZ-2218 Project: Apache Tez Issue Type: Sub-task Reporter: Bikas Saha Assignee: Bikas Saha -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2210) Record DAG AM CPU usage stats
[ https://issues.apache.org/jira/browse/TEZ-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372466#comment-14372466 ] Hadoop QA commented on TEZ-2210: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706106/TEZ-2210.4.patch against master revision 6e15b2f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/323//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/323//console This message is automatically generated. Record DAG AM CPU usage stats - Key: TEZ-2210 URL: https://issues.apache.org/jira/browse/TEZ-2210 Project: Apache Tez Issue Type: Task Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-2210.1.patch, TEZ-2210.2.patch, TEZ-2210.3.patch, TEZ-2210.4.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-2210 PreCommit Build #323
Jira: https://issues.apache.org/jira/browse/TEZ-2210 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/323/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2762 lines...] [INFO] Final Memory: 72M/1029M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706106/TEZ-2210.4.patch against master revision 6e15b2f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/323//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/323//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. a623705000b5e2ffcba9e6c8cc69bb851de1ce51 logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #320 Archived 44 artifacts Archive block size is 32768 Received 2 blocks and 2682311 bytes Compression is 2.4% Took 1.3 sec Description set: TEZ-2210 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Created] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime
Gopal V created TEZ-2217: Summary: The min-held-containers constraint is not enforced during query runtime Key: TEZ-2217 URL: https://issues.apache.org/jira/browse/TEZ-2217 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Gopal V Assignee: Bikas Saha The min-held containers constraint is respected during query idle times, but is not respected when a query is actually in motion. The AM releases unused containers during dag execution without checking for min-held containers. {code} 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing container, containerId=container_1424502260528_1348_01_13, containerExpiryTime=1426891313264, idleTimeoutMin=5000 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: Releasing unused container: container_1424502260528_1348_01_13 {code} This is actually useful only after the AM has received a soft pre-emption message, doing it on an idle cluster slows down one of the most common query patterns in BI systems. {code} create temporary table smalltable as ...; select ... bigtable JOIN smalltable ON ...; {code} The smaller query in the beginning throws away the pre-warmed capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2216) Expose errors during AM initialization
[ https://issues.apache.org/jira/browse/TEZ-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372415#comment-14372415 ] Jeff Zhang edited comment on TEZ-2216 at 3/21/15 1:20 AM: -- This could be done to some certain extent. Depends on 2 things: * whether DAGClientServer is started ( Client get diagnostics from DAGClientServer * whether YarnTaskSchedulerService is started ( YarnTaskScheduler will unregister it from RM, and push diagnostics to RM, Client get diagnostics from RM ). For the error on AM initialization, maybe YarnTaskSchedulerService is the key. was (Author: zjffdu): This could be done to some certain extent. Depends on 2 things: * whether DAGClientServer is started ( Client get diagnostics from DAGClientServer * whether YarnTaskSchedulerService is started ( YarnTaskScheduler will unregister it from RM, and push diagnostics to RM, Client get diagnostics from RM ) Expose errors during AM initialization -- Key: TEZ-2216 URL: https://issues.apache.org/jira/browse/TEZ-2216 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha If there are bad configs or other issues that cause errors/exceptions during AM initialization (eg. during service init) then those errors are not exposed to the user. Exposing them would be useful in quickly debugging such issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2216) Expose errors during AM initialization
[ https://issues.apache.org/jira/browse/TEZ-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372415#comment-14372415 ] Jeff Zhang commented on TEZ-2216: - This could be done to some certain extent. Depends on 2 things: * whether DAGClientServer is started ( Client get diagnostics from DAGClientServer * whether YarnTaskSchedulerService is started ( YarnTaskScheduler will unregister it from RM, and push diagnostics to RM, Client get diagnostics from RM ) Expose errors during AM initialization -- Key: TEZ-2216 URL: https://issues.apache.org/jira/browse/TEZ-2216 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha If there are bad configs or other issues that cause errors/exceptions during AM initialization (eg. during service init) then those errors are not exposed to the user. Exposing them would be useful in quickly debugging such issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2163) Task status update should be handled in the START_WAIT state
[ https://issues.apache.org/jira/browse/TEZ-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-2163: Issue Type: Bug (was: Sub-task) Parent: (was: TEZ-2149) Task status update should be handled in the START_WAIT state Key: TEZ-2163 URL: https://issues.apache.org/jira/browse/TEZ-2163 Project: Apache Tez Issue Type: Bug Reporter: Siddharth Seth Assignee: Jeff Zhang Priority: Critical Attachments: TEZ-2163-1.patch, TEZ-2163-2.patch It;s possible for a task to send in a STATUS_UPDATE before the TA_STARTED_REMOTELY message is processed within the AM. {code} 2015-02-27 13:21:15,491 ERROR [Dispatcher thread: Central] impl.TaskAttemptImpl: Can't handle this event at current state for attempt_1424502260528_0177_5_03_000223_0 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: TA_STATUS_UPDATE at START_WAIT at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:670) at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:112) at org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1835) at org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1820) at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2163) Task status update should be handled in the START_WAIT state
[ https://issues.apache.org/jira/browse/TEZ-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-2163: Closing this out as Won't fix. Task status update should be handled in the START_WAIT state Key: TEZ-2163 URL: https://issues.apache.org/jira/browse/TEZ-2163 Project: Apache Tez Issue Type: Bug Reporter: Siddharth Seth Assignee: Jeff Zhang Priority: Critical Attachments: TEZ-2163-1.patch, TEZ-2163-2.patch It;s possible for a task to send in a STATUS_UPDATE before the TA_STARTED_REMOTELY message is processed within the AM. {code} 2015-02-27 13:21:15,491 ERROR [Dispatcher thread: Central] impl.TaskAttemptImpl: Can't handle this event at current state for attempt_1424502260528_0177_5_03_000223_0 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: TA_STATUS_UPDATE at START_WAIT at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:670) at org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:112) at org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1835) at org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1820) at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2218) Turn on speculation by default
[ https://issues.apache.org/jira/browse/TEZ-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372450#comment-14372450 ] Siddharth Seth commented on TEZ-2218: - Are there any pre-conditions for this, testing required etc ? We may be better leaving it off by default - especially for single node instances, and allow this to be enabled per site. Progress reporting from tasks being one of the concerns. Turn on speculation by default -- Key: TEZ-2218 URL: https://issues.apache.org/jira/browse/TEZ-2218 Project: Apache Tez Issue Type: Sub-task Reporter: Bikas Saha Assignee: Bikas Saha -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2210) Record DAG AM CPU usage stats
[ https://issues.apache.org/jira/browse/TEZ-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372409#comment-14372409 ] Bikas Saha commented on TEZ-2210: - bq. AsyncDispatcher should remain as it is not meant for use outside of the Tez project. Not sure what you mean here. I made no change other than explicitly mark it @private bq. if (null == counters) { I am not changing that part of the code. That is legacy code probably from MR that is expected to update a shared counter object. bq. protected long getElapsedGc() is not thread-safe. Again, legacy code that I am not touching in this patch. From what I see, its only called from TaskCounterUpdater which effectively rules out concurrent calls. bq.restoreFromEvent probably does not need a synchronized. Yes. As of now it does not because its called during recovery. I am removing it since that has caused the new spurious findbugs message. bq.Can ManagementFactory.getGarbageCollectorMXBeans() ever return null? It probably does not or else GcTimeUpdater would have been broken all this while. bq. Is private DAGStatus.State getDAGStatusFromState( Eclipse showed this as unreferenced code. No complains on removing it. I will remove the new synchronized and upload the patch for Jenkins. Record DAG AM CPU usage stats - Key: TEZ-2210 URL: https://issues.apache.org/jira/browse/TEZ-2210 Project: Apache Tez Issue Type: Task Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-2210.1.patch, TEZ-2210.2.patch, TEZ-2210.3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.
[ https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371273#comment-14371273 ] Jeff Zhang commented on TEZ-2204: - The findbug issue should be OK. TestAMRecovery increasingly flaky on jenkins builds. - Key: TEZ-2204 URL: https://issues.apache.org/jira/browse/TEZ-2204 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: TEZ-2204-1.patch, TEZ-2204-2.patch In recent pre-commit builds and daily builds, there seem to have been some occurrences of TestAMRecovery failing or timing out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2214) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging
[ https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-2214: -- Attachment: TEZ-2214.1.patch [~sseth], [~hitesh] - Please review when you have sometime. FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging -- Key: TEZ-2214 URL: https://issues.apache.org/jira/browse/TEZ-2214 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Attachments: TEZ-2214.1.patch Scenario: - commitMemory usedMemory are beyond their allowed threshold. - InMemoryMerge kicks off and is in the process of flushing memory contents to disk - As it progresses, it releases memory segments as well (but not yet over). - Fetchers who need memory maxSingleShuffleLimit, get scheduled. - If fetchers are fast, this quickly adds up to commitMemory usedMemory. Since InMemoryMerge is already in progress, this wouldn't trigger another merge(). - Pretty soon all fetchers would be stalled and get into the following state. {noformat} Thread 9351: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - java.lang.Object.wait() @bci=2, line=502 (Compiled frame) - org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory() @bci=17, line=337 (Interpreted frame) - org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run() @bci=34, line=157 (Interpreted frame) {noformat} - Even if InMemoryMerger completes, commitedMem usedMem are beyond their threshold and no other fetcher threads (all are in stalled state) are there to release memory. This causes fetchers to wait indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.
[ https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371255#comment-14371255 ] Hadoop QA commented on TEZ-2204: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12705906/TEZ-2204-2.patch against master revision 9b845f2. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/319//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/319//artifact/patchprocess/newPatchFindbugsWarningstez-common.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/319//console This message is automatically generated. TestAMRecovery increasingly flaky on jenkins builds. - Key: TEZ-2204 URL: https://issues.apache.org/jira/browse/TEZ-2204 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: TEZ-2204-1.patch, TEZ-2204-2.patch In recent pre-commit builds and daily builds, there seem to have been some occurrences of TestAMRecovery failing or timing out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2204 PreCommit Build #319
Jira: https://issues.apache.org/jira/browse/TEZ-2204 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/319/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2753 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12705906/TEZ-2204-2.patch against master revision 9b845f2. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/319//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/319//artifact/patchprocess/newPatchFindbugsWarningstez-common.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/319//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 82da12738622ddd9da36401cbe820cdc0fe397c9 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #315 Archived 44 artifacts Archive block size is 32768 Received 6 blocks and 2530138 bytes Compression is 7.2% Took 1.8 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Updated] (TEZ-2214) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging
[ https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2214: - Target Version/s: 0.7.0, 0.5.4, 0.6.1 (was: 0.7.0) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging -- Key: TEZ-2214 URL: https://issues.apache.org/jira/browse/TEZ-2214 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2214.1.patch Scenario: - commitMemory usedMemory are beyond their allowed threshold. - InMemoryMerge kicks off and is in the process of flushing memory contents to disk - As it progresses, it releases memory segments as well (but not yet over). - Fetchers who need memory maxSingleShuffleLimit, get scheduled. - If fetchers are fast, this quickly adds up to commitMemory usedMemory. Since InMemoryMerge is already in progress, this wouldn't trigger another merge(). - Pretty soon all fetchers would be stalled and get into the following state. {noformat} Thread 9351: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - java.lang.Object.wait() @bci=2, line=502 (Compiled frame) - org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory() @bci=17, line=337 (Interpreted frame) - org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run() @bci=34, line=157 (Interpreted frame) {noformat} - Even if InMemoryMerger completes, commitedMem usedMem are beyond their threshold and no other fetcher threads (all are in stalled state) are there to release memory. This causes fetchers to wait indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2210) Record DAG AM CPU usage stats
[ https://issues.apache.org/jira/browse/TEZ-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372486#comment-14372486 ] Hitesh Shah commented on TEZ-2210: -- bq. Not sure what you mean here. I made no change other than explicitly mark it @private My mistake. Hurried review meant that I read the change in reverse :). Yes, should be marked private. bq. protected long getElapsedGc() is not thread-safe. Maybe add a comment even though it is legacy code? Rest looks fine. One gotcha though which needs to be addressed - if someone invokes DAGClient to retrieve the counters while the DAG is in progress, the cpu and gc stats will not show up. This might affect how the GcTimeUpdater is used though for AM stats. For tasks, calling it multiple times works as counters are incremented based on new values from each heartbeat. Record DAG AM CPU usage stats - Key: TEZ-2210 URL: https://issues.apache.org/jira/browse/TEZ-2210 Project: Apache Tez Issue Type: Task Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-2210.1.patch, TEZ-2210.2.patch, TEZ-2210.3.patch, TEZ-2210.4.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2211) Tez UI: Allow users to configure timezone
[ https://issues.apache.org/jira/browse/TEZ-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated TEZ-2211: - Component/s: UI Tez UI: Allow users to configure timezone - Key: TEZ-2211 URL: https://issues.apache.org/jira/browse/TEZ-2211 Project: Apache Tez Issue Type: Improvement Components: UI Reporter: Jonathan Eagles -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2047) Build fails against hadoop-2.2 post TEZ-2018
[ https://issues.apache.org/jira/browse/TEZ-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2047: - Priority: Blocker (was: Major) Build fails against hadoop-2.2 post TEZ-2018 Key: TEZ-2047 URL: https://issues.apache.org/jira/browse/TEZ-2047 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Prakash Ramachandran Priority: Blocker Attachments: TEZ-2047.1.patch Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project tez-dag: Compilation failure: Compilation failure: [ERROR] /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[85,13] cannot find symbol [ERROR] symbol : method withHttpPolicy(org.apache.hadoop.conf.Configuration,org.apache.hadoop.http.HttpConfig.Policy) [ERROR] location: class org.apache.hadoop.yarn.webapp.WebApps.Builderorg.apache.tez.dag.app.web.WebUIService.TezAMWebApp [ERROR] /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[87,45] cannot find symbol [ERROR] symbol : method getConnectorAddress(int) [ERROR] location: class org.apache.hadoop.http.HttpServer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.
[ https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2204: Attachment: (was: TEZ-2204-2.patch) TestAMRecovery increasingly flaky on jenkins builds. - Key: TEZ-2204 URL: https://issues.apache.org/jira/browse/TEZ-2204 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: TEZ-2204-1.patch In recent pre-commit builds and daily builds, there seem to have been some occurrences of TestAMRecovery failing or timing out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.
[ https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2204: Attachment: TEZ-2204-2.patch Upload new patch. TestAMRecovery increasingly flaky on jenkins builds. - Key: TEZ-2204 URL: https://issues.apache.org/jira/browse/TEZ-2204 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: TEZ-2204-1.patch, TEZ-2204-2.patch In recent pre-commit builds and daily builds, there seem to have been some occurrences of TestAMRecovery failing or timing out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2210) Record DAG AM CPU usage stats
[ https://issues.apache.org/jira/browse/TEZ-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372097#comment-14372097 ] Hitesh Shah edited comment on TEZ-2210 at 3/20/15 9:01 PM: --- Comments: AsyncDispatcher should remain as it is not meant for use outside of the Tez project. {code} if (null == counters) { return; // nothing to do. } {code} Why will this ever be null? protected long getElapsedGc() is not thread-safe. Can ManagementFactory.getGarbageCollectorMXBeans() ever return null? restoreFromEvent probably does not need a synchronized. I think the findbugs issue might be fixed if the dag counter is updated in mayBeConstructFinalFullCounters() Is private DAGStatus.State getDAGStatusFromState(DAGState finalState) no longer used in the DAGClient code path? was (Author: hitesh): Comments: AsyncDispatcher should remain as it is not meant for use outside of the Tez project. {code} if (null == counters) { return; // nothing to do. } Why will this ever be null? protected long getElapsedGc() is not thread-safe. Can ManagementFactory.getGarbageCollectorMXBeans() ever return null? restoreFromEvent probably does not need a synchronized. I think the findbugs issue might be fixed if the dag counter is updated in mayBeConstructFinalFullCounters() Is private DAGStatus.State getDAGStatusFromState(DAGState finalState) no longer used in the DAGClient code path? Record DAG AM CPU usage stats - Key: TEZ-2210 URL: https://issues.apache.org/jira/browse/TEZ-2210 Project: Apache Tez Issue Type: Task Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-2210.1.patch, TEZ-2210.2.patch, TEZ-2210.3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2210) Record DAG AM CPU usage stats
[ https://issues.apache.org/jira/browse/TEZ-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372097#comment-14372097 ] Hitesh Shah commented on TEZ-2210: -- Comments: AsyncDispatcher should remain as it is not meant for use outside of the Tez project. {code} if (null == counters) { return; // nothing to do. } Why will this ever be null? protected long getElapsedGc() is not thread-safe. Can ManagementFactory.getGarbageCollectorMXBeans() ever return null? restoreFromEvent probably does not need a synchronized. I think the findbugs issue might be fixed if the dag counter is updated in mayBeConstructFinalFullCounters() Is private DAGStatus.State getDAGStatusFromState(DAGState finalState) no longer used in the DAGClient code path? Record DAG AM CPU usage stats - Key: TEZ-2210 URL: https://issues.apache.org/jira/browse/TEZ-2210 Project: Apache Tez Issue Type: Task Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-2210.1.patch, TEZ-2210.2.patch, TEZ-2210.3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2205 PreCommit Build #322
Jira: https://issues.apache.org/jira/browse/TEZ-2205 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/322/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2754 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706018/TEZ-2205.wip.patch against master revision 6e15b2f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/322//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/322//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 1dde7251f57d8acac45b58c10df0e44ed7dd1159 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #320 Archived 44 artifacts Archive block size is 32768 Received 2 blocks and 2688105 bytes Compression is 2.4% Took 0.64 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372109#comment-14372109 ] Hadoop QA commented on TEZ-2205: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706018/TEZ-2205.wip.patch against master revision 6e15b2f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/322//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/322//console This message is automatically generated. Tez still tries to post to ATS when yarn.timeline-service.enabled=false --- Key: TEZ-2205 URL: https://issues.apache.org/jira/browse/TEZ-2205 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.6.1 Reporter: Chang Li Assignee: Chang Li Attachments: TEZ-2205.wip.patch when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, but hits error as token is not found. Does not fail the job because of the fix to not fail job when there is error posting to ATS. But it should not be trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372119#comment-14372119 ] Matt Foley commented on TEZ-1923: - Yes please! FetcherOrderedGrouped gets into infinite loop due to memory pressure Key: TEZ-1923 URL: https://issues.apache.org/jira/browse/TEZ-1923 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Fix For: 0.7.0 Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, TEZ-1923.4.patch - Ran a comparatively large job (temp table creation) at 10 TB scale. - Turned on intermediate mem-to-mem (tez.runtime.shuffle.memory-to-memory.enable=true and tez.runtime.shuffle.memory-to-memory.segments=4) - Some reducers get lots of data and quickly gets into infinite loop {code} 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... {code} Additional debug/patch statements revealed that InMemoryMerge is not invoked appropriately and not releasing the memory back for fetchers to proceed. e.g debug/patch messages are given below {code} syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, mergeThreshold=708669632 === InMemoryMerge would be started in this case as commitMemory = mergeThreshold syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO [fetcher [Map_1] #1] orderedgrouped.MergeManager: Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. {code} In MergeManager, in memory merging is invoked under the following condition {code} if (!inMemoryMerger.isInProgress() commitMemory = mergeThreshold) {code} Attaching the
[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372045#comment-14372045 ] Hitesh Shah commented on TEZ-2205: -- [~rohini] At the crux of this, there are effectively a couple of issues/questions in terms of what is the logical behavior and what is the expectation? The fix for any approach is probably trivial. Consider the fact that the user configured both YARN and Tez in a conflicting manner. i.e. Configured YARN to disable timeline but made Tez use Timeline. Should Tez: 1) error out due to a conflicting configuration i.e YARN timeline disabled but Tez ATS logger enabled. 2) should Tez try and use Timeline (even though YARN flag is set to false ) and ignore its failures as needed? This should be ok for the most part except that I think there are some cases in YARN which are not handled cleanly and end up causing the app to error out. Also, there were some behavioral changes in YARN-2375 - see below. 3) Should Tez look for the YARN configuration property and silently ignore the fact that TimelineATSLogger has been configured but it should not be used? Also, FWIW, earlier ( before YARN-2375 ), even though Tez invoked Timeline::postEntities, if the YARN flag was set to false, the YARN library silently dropped the call. (2) is probably something that YARN needs to address. As for Tez, we can go with either (1) or (3). (1) is more clear-cut in terms of making it very clear to the user in terms of how to configure Tez. (3) merely hides the fact that something is wrongly configured. Also, to clarify, part of this stems from what is the yarn.timeline-service.enabled flag meant to be used for? Is it a admin flag to control where timeline is enabled or disabled for the whole cluster? It currently is a client-side flag that cannot be enforced at all. Furthermore, if it is meant to be used on a per job basis, should it then be a tez-specific setting ( which we already have in the form of the class setting ). Last question for [~rohini]: does the issue of disabling ATS stem from the fact that it is a bit hard to disable ATS logging via the service class name property? Tez still tries to post to ATS when yarn.timeline-service.enabled=false --- Key: TEZ-2205 URL: https://issues.apache.org/jira/browse/TEZ-2205 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.6.1 Reporter: Chang Li Assignee: Chang Li Attachments: TEZ-2205.wip.patch when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, but hits error as token is not found. Does not fail the job because of the fix to not fail job when there is error posting to ATS. But it should not be trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371983#comment-14371983 ] Jonathan Eagles commented on TEZ-2205: -- This seems like a patch that will get the conversation going for what to do for this jira. [~rohini], can you comment from a pig perspective. Essentially, we are deeming it illegal to specify ATSHistoryLogger when timeline service is disabled. This approach gives clients the control of which history logger to use but leaves the yarn-timeline-service.enabled a flag that describes a feature that is available on the cluster. Tez still tries to post to ATS when yarn.timeline-service.enabled=false --- Key: TEZ-2205 URL: https://issues.apache.org/jira/browse/TEZ-2205 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.6.1 Reporter: Chang Li Assignee: Chang Li Attachments: TEZ-2205.wip.patch when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, but hits error as token is not found. Does not fail the job because of the fix to not fail job when there is error posting to ATS. But it should not be trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371984#comment-14371984 ] Chang Li commented on TEZ-2205: --- [~hitesh] I post an attempt patch which checks if tez has its logging service class set to atsHistoryLoggingService but not enable timelie-service. In that case it will throw an exception with alert stopping the tez job from launching. Tez still tries to post to ATS when yarn.timeline-service.enabled=false --- Key: TEZ-2205 URL: https://issues.apache.org/jira/browse/TEZ-2205 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.6.1 Reporter: Chang Li Assignee: Chang Li Attachments: TEZ-2205.wip.patch when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, but hits error as token is not found. Does not fail the job because of the fix to not fail job when there is error posting to ATS. But it should not be trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2203) Intern strings in tez counters
[ https://issues.apache.org/jira/browse/TEZ-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2203: Attachment: TEZ-2203.2.patch Intern strings in tez counters -- Key: TEZ-2203 URL: https://issues.apache.org/jira/browse/TEZ-2203 Project: Apache Tez Issue Type: Task Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-2203.1.patch, TEZ-2203.2.patch Getting per IO counters is possible today. This jira tracks work needed to enabled them by default. Internalizing strings to save memory is one item needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2203) Intern strings in tez counters
[ https://issues.apache.org/jira/browse/TEZ-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371906#comment-14371906 ] Bikas Saha commented on TEZ-2203: - Thanks! Uploading commit patch with commented code removed. Intern strings in tez counters -- Key: TEZ-2203 URL: https://issues.apache.org/jira/browse/TEZ-2203 Project: Apache Tez Issue Type: Task Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-2203.1.patch, TEZ-2203.2.patch Getting per IO counters is possible today. This jira tracks work needed to enabled them by default. Internalizing strings to save memory is one item needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2216) Expose errors during AM initialization
Bikas Saha created TEZ-2216: --- Summary: Expose errors during AM initialization Key: TEZ-2216 URL: https://issues.apache.org/jira/browse/TEZ-2216 Project: Apache Tez Issue Type: Bug Reporter: Bikas Saha If there are bad configs or other issues that cause errors/exceptions during AM initialization (eg. during service init) then those errors are not exposed to the user. Exposing them would be useful in quickly debugging such issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372005#comment-14372005 ] Rohini Palaniswamy commented on TEZ-2205: - {code} public void handle(DAGHistoryEvent event) { eventQueue.add(event); } {code} It would be a simple check to not add to the queue if timeline service is disabled. Tez still tries to post to ATS when yarn.timeline-service.enabled=false --- Key: TEZ-2205 URL: https://issues.apache.org/jira/browse/TEZ-2205 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.6.1 Reporter: Chang Li Assignee: Chang Li Attachments: TEZ-2205.wip.patch when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, but hits error as token is not found. Does not fail the job because of the fix to not fail job when there is error posting to ATS. But it should not be trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372025#comment-14372025 ] Jonathan Eagles commented on TEZ-2205: -- [~hitesh], do you want to comment on this? [~rohini], there are many ways to approach this. We have all be discussing the pros and cons. I wanted to loop you into the conversation since you may have better insight as to what users can expect. Tez still tries to post to ATS when yarn.timeline-service.enabled=false --- Key: TEZ-2205 URL: https://issues.apache.org/jira/browse/TEZ-2205 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.6.1 Reporter: Chang Li Assignee: Chang Li Attachments: TEZ-2205.wip.patch when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, but hits error as token is not found. Does not fail the job because of the fix to not fail job when there is error posting to ATS. But it should not be trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2210) Record DAG AM CPU usage stats
[ https://issues.apache.org/jira/browse/TEZ-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371951#comment-14371951 ] Bikas Saha commented on TEZ-2210: - Thanks for the review! bq. In GcTimeUpdater, should this be moved to constructor itself? I dont think so. Its only relevant in the incrementCounter method which is probably tied to some legacy code style where the counters are just passed into this update object for convenience. bq. In DAGAppMaster, cpuPlugin GcTimeUpdater are always initialized; Do we need the null check in getAMGCTime, getAMCPUTime? The initialization code may fail while initing them or before initing them (while initing something else). Hence those null checks are needed. bq. In DAGAppMaster, initResourceCalculatorPlugins() is called in serviceInit(). If user makes any mistake in configuring the plugin, initResourceCalculatorPlugins() could get RuntimeException? Info might not be available to end user to find out the reason for AM not starting (Could get ExitCodeException exitCode=??) Thats possible, but this is a general problem of failing during init and should probably be fixed for this and other cases. Opened TEZ-2216 for this. Updated patch for the findbugs comment. The patch is fine. The comment is spurious. Removing the unnecessary synchronization in the private methods that is causing the spurious comment. Record DAG AM CPU usage stats - Key: TEZ-2210 URL: https://issues.apache.org/jira/browse/TEZ-2210 Project: Apache Tez Issue Type: Task Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-2210.1.patch, TEZ-2210.2.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2210) Record DAG AM CPU usage stats
[ https://issues.apache.org/jira/browse/TEZ-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2210: Attachment: TEZ-2210.3.patch Record DAG AM CPU usage stats - Key: TEZ-2210 URL: https://issues.apache.org/jira/browse/TEZ-2210 Project: Apache Tez Issue Type: Task Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-2210.1.patch, TEZ-2210.2.patch, TEZ-2210.3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated TEZ-2205: -- Attachment: TEZ-2205.wip.patch Tez still tries to post to ATS when yarn.timeline-service.enabled=false --- Key: TEZ-2205 URL: https://issues.apache.org/jira/browse/TEZ-2205 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.6.1 Reporter: Chang Li Assignee: Chang Li Attachments: TEZ-2205.wip.patch when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, but hits error as token is not found. Does not fail the job because of the fix to not fail job when there is error posting to ATS. But it should not be trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated TEZ-2205: -- Attachment: TEZ-2205.wip.patch Tez still tries to post to ATS when yarn.timeline-service.enabled=false --- Key: TEZ-2205 URL: https://issues.apache.org/jira/browse/TEZ-2205 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.6.1 Reporter: Chang Li Assignee: Chang Li Attachments: TEZ-2205.wip.patch when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, but hits error as token is not found. Does not fail the job because of the fix to not fail job when there is error posting to ATS. But it should not be trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chang Li updated TEZ-2205: -- Attachment: (was: TEZ-2205.wip.patch) Tez still tries to post to ATS when yarn.timeline-service.enabled=false --- Key: TEZ-2205 URL: https://issues.apache.org/jira/browse/TEZ-2205 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.6.1 Reporter: Chang Li Assignee: Chang Li Attachments: TEZ-2205.wip.patch when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, but hits error as token is not found. Does not fail the job because of the fix to not fail job when there is error posting to ATS. But it should not be trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2210) Record DAG AM CPU usage stats
[ https://issues.apache.org/jira/browse/TEZ-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371974#comment-14371974 ] Bikas Saha commented on TEZ-2210: - [~hitesh] Can you please take a look. I have remove the sync from the private method that was causing the spurious warning. All invocations of this method are already thread safe except for a public method restoreFromEvent() related to recovery that I have now synchronized. I think that should be fine. Record DAG AM CPU usage stats - Key: TEZ-2210 URL: https://issues.apache.org/jira/browse/TEZ-2210 Project: Apache Tez Issue Type: Task Reporter: Bikas Saha Assignee: Bikas Saha Attachments: TEZ-2210.1.patch, TEZ-2210.2.patch, TEZ-2210.3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371996#comment-14371996 ] Rohini Palaniswamy commented on TEZ-2205: - Wouldn't it be better if ATSHistoryLogger checked the value of the timeline setting and do nothing if it is false. Tez still tries to post to ATS when yarn.timeline-service.enabled=false --- Key: TEZ-2205 URL: https://issues.apache.org/jira/browse/TEZ-2205 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.6.1 Reporter: Chang Li Assignee: Chang Li Attachments: TEZ-2205.wip.patch when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, but hits error as token is not found. Does not fail the job because of the fix to not fail job when there is error posting to ATS. But it should not be trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir
[ https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366825#comment-14366825 ] Jeff Zhang edited comment on TEZ-1909 at 3/20/15 9:14 AM: -- Attach the new patch to address the review comment. [~hitesh] Please help review Apart from the issues in the review comments, I also found there's one issue about RecoveryService. For the scenario of draining the events before RecoverySerivce is stopped, previously I take the event queue's size equal to zero as an indication of events are all consumed, but it is not true. Because even if the event queue is empty, the event may still been processing. I fix this bug in the new patch just like AsyncDispatcher did. bq. the if (skipAllOtherEvents) { check is probably also needed at the top of the loop to prevent new files from being opened and read ( in addition to short-circuiting the read of all events in the given file ). Maybe just log a message that other files were present and skipped Fix it. also add unit test in TestRecoveryParser bq. any reason why this is needed in the DAGAppMaster SetString getDagIDs() ? Only for unit test. But in the new patch, I remove it and initialize the Set in the setup method. bq. also, we should add a test for adding corrupt data to the summary stream and ensuring that its processing fails Done. bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used anywhere apart from being set to true in one of the tests. Fix it. bq. please replace import com.sun.tools.javac.util.List; with java.lang.List Fix it bq. testCorruptedLastRecord should also verify that the dag submitted event was seen. Done. verify DAGAppMaster.createDAG is invoked. was (Author: zjffdu): Attach the new patch to address the review comment. Apart from the issues in the review comments, I also found there's one issue about RecoveryService. For the scenario of draining the events before RecoverySerivce is stopped, previously I take the event queue's size equal to zero as an indication of events are all consumed, but it is not true. Because even if the event queue is empty, the event may still been processing. I fix this bug in the new patch just like AsyncDispatcher did. bq. the if (skipAllOtherEvents) { check is probably also needed at the top of the loop to prevent new files from being opened and read ( in addition to short-circuiting the read of all events in the given file ). Maybe just log a message that other files were present and skipped Fix it. also add unit test in TestRecoveryParser bq. any reason why this is needed in the DAGAppMaster SetString getDagIDs() ? Only for unit test. But in the new patch, I remove it and initialize the Set in the setup method. bq. also, we should add a test for adding corrupt data to the summary stream and ensuring that its processing fails Done. bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used anywhere apart from being set to true in one of the tests. Fix it. bq. please replace import com.sun.tools.javac.util.List; with java.lang.List Fix it bq. testCorruptedLastRecord should also verify that the dag submitted event was seen. Done. verify DAGAppMaster.createDAG is invoked. Remove need to copy over all events from attempt 1 to attempt 2 dir --- Key: TEZ-1909 URL: https://issues.apache.org/jira/browse/TEZ-1909 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: TEZ-1909-1.patch, TEZ-1909-2.patch, TEZ-1909-3.patch Use of file versions should prevent the need for copying over data into a second attempt dir. Care needs to be taken to handle last corrupt record handling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)