[jira] [Commented] (TEZ-986) Make conf set on DAG and vertex available in jobhistory
[ https://issues.apache.org/jira/browse/TEZ-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377568#comment-14377568 ] Rohini Palaniswamy commented on TEZ-986: bq. viewable in Tez UI after the job completes. This is very essential for debugging jobs. Just wanted to mention that we need this for Pig and would be good to have it in one of the upcoming releases. While debugging some of the recent issues, realized that I don't have access to pig script if user ran it from a gateway (for Oozie I can get it from launcher job), because pig.script setting is only set on the DAG config and few other settings useful for debugging like pig version. For now, I have other workarounds to get this info or resort to asking the user. The vertex config also has some important debugging info like what feature is being run (group by, etc), input/output dirs, etc. Even for this can manage for the short term and figure these out with the explain output of the script. But life would be easier if those are shown in the UI. Make conf set on DAG and vertex available in jobhistory --- Key: TEZ-986 URL: https://issues.apache.org/jira/browse/TEZ-986 Project: Apache Tez Issue Type: Sub-task Components: UI Reporter: Rohini Palaniswamy Priority: Blocker Would like to have the conf set on DAG and Vertex 1) viewable in Tez UI after the job completes. This is very essential for debugging jobs. 2) We have processes, that parse jobconf.xml from job history (hdfs) and load them into hive tables for analysis. Would like to have Tez also make all the configuration (byte array) available in job history so that we can similarly parse them. 1) mandates that you store it in hdfs. 2) is just to say make the format stored as a contract others can rely on for parsing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService
[ https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1433#comment-1433 ] Hadoop QA commented on TEZ-2224: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706880/TEZ-2224-1.patch against master revision f53942c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/338//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/338//console This message is automatically generated. EventQueue empty doesn't mean events are consumed in RecoveryService Key: TEZ-2224 URL: https://issues.apache.org/jira/browse/TEZ-2224 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-2224-1.patch If the event queue is empty, the event may still been processing. Should fix it like AsyncDispatcher -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-2224 PreCommit Build #338
Jira: https://issues.apache.org/jira/browse/TEZ-2224 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/338/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2754 lines...] [INFO] Final Memory: 72M/762M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706880/TEZ-2224-1.patch against master revision f53942c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/338//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/338//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. d98f08814b5cfb8ee2fbc55e0c5bb28a28e2e817 logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #335 Archived 44 artifacts Archive block size is 32768 Received 4 blocks and 2622022 bytes Compression is 4.8% Took 1.3 sec Description set: TEZ-2224 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Comment Edited] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService
[ https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377733#comment-14377733 ] Jeff Zhang edited comment on TEZ-2224 at 3/24/15 11:39 AM: --- Upload patch for this issue. Just like how AsyncDispatcher does. [~hitesh] Please help review it. was (Author: zjffdu): Upload patch for this issue. Just like how AsyncDispatcher does EventQueue empty doesn't mean events are consumed in RecoveryService Key: TEZ-2224 URL: https://issues.apache.org/jira/browse/TEZ-2224 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-2224-1.patch If the event queue is empty, the event may still been processing. Should fix it like AsyncDispatcher -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377562#comment-14377562 ] Rohini Palaniswamy commented on TEZ-2205: - I prefer 3). For jobs launched through Oozie it is easy to turn off ATS via Oozie server side setting and this might be required in the near future now and then considering the issues we are facing with ATS. Since tez-site.xml for those jobs come from HDFS, it is not easy to change the tez ATS Logger easily (replacing the file on HDFS is more manual and can cause running jobs to fail as LocalResource time has changed) and so do not like 1). Also having to change multiple settings to turn off something is cumbersome. 2) is what is happening now but the problem I see is that it impacts performance as time is wasted trying to connect to ATS and failing due to lack of authentication. Tez still tries to post to ATS when yarn.timeline-service.enabled=false --- Key: TEZ-2205 URL: https://issues.apache.org/jira/browse/TEZ-2205 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.6.1 Reporter: Chang Li Assignee: Chang Li Attachments: TEZ-2205.wip.patch when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, but hits error as token is not found. Does not fail the job because of the fix to not fail job when there is error posting to ATS. But it should not be trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService
[ https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377733#comment-14377733 ] Jeff Zhang commented on TEZ-2224: - Upload patch for this issue. Just like how AsyncDispatcher does EventQueue empty doesn't mean events are consumed in RecoveryService Key: TEZ-2224 URL: https://issues.apache.org/jira/browse/TEZ-2224 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-2224-1.patch If the event queue is empty, the event may still been processing. Should fix it like AsyncDispatcher -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService
[ https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2224: Attachment: TEZ-2224-1.patch EventQueue empty doesn't mean events are consumed in RecoveryService Key: TEZ-2224 URL: https://issues.apache.org/jira/browse/TEZ-2224 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-2224-1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService
[ https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2224: Description: If the event queue is empty, the event may still been processing. Should fix it like AsyncDispatcher EventQueue empty doesn't mean events are consumed in RecoveryService Key: TEZ-2224 URL: https://issues.apache.org/jira/browse/TEZ-2224 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-2224-1.patch If the event queue is empty, the event may still been processing. Should fix it like AsyncDispatcher -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2221) VertexGroup name should be unqiue
[ https://issues.apache.org/jira/browse/TEZ-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378769#comment-14378769 ] Hitesh Shah commented on TEZ-2221: -- Understood but we should have 2 checks being done instead of re-writing the existing check. {code} dag.createVertexGroup(group_1, v1,v2); try { dag.createVertexGroup(group_1, v2,v3); Assert.fail(); } ... try { dag.createVertexGroup(group_2, v1,v2); Assert.fail(); } ... {code} VertexGroup name should be unqiue - Key: TEZ-2221 URL: https://issues.apache.org/jira/browse/TEZ-2221 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-2221-1.patch VertexGroupCommitStartedEvent VertexGroupCommitFinishedEvent use vertex group name to identify the vertex group commit, the same name of vertex group will conflict. While in the current equals hashCode of VertexGroup, vertex group name and members name are used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2226) Disable writing history to timeline if domain creation fails.
Hitesh Shah created TEZ-2226: Summary: Disable writing history to timeline if domain creation fails. Key: TEZ-2226 URL: https://issues.apache.org/jira/browse/TEZ-2226 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime
[ https://issues.apache.org/jira/browse/TEZ-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378907#comment-14378907 ] Hadoop QA commented on TEZ-2217: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12707032/TEZ-2217.2.patch against master revision f53942c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/341//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/341//console This message is automatically generated. The min-held-containers constraint is not enforced during query runtime Key: TEZ-2217 URL: https://issues.apache.org/jira/browse/TEZ-2217 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Gopal V Assignee: Bikas Saha Attachments: TEZ-2217-debug.txt.bz2, TEZ-2217.1.patch, TEZ-2217.2.patch, TEZ-2217.txt.bz2 The min-held containers constraint is respected during query idle times, but is not respected when a query is actually in motion. The AM releases unused containers during dag execution without checking for min-held containers. {code} 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing container, containerId=container_1424502260528_1348_01_13, containerExpiryTime=1426891313264, idleTimeoutMin=5000 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: Releasing unused container: container_1424502260528_1348_01_13 {code} This is actually useful only after the AM has received a soft pre-emption message, doing it on an idle cluster slows down one of the most common query patterns in BI systems. {code} create temporary table smalltable as ...; select ... bigtable JOIN smalltable ON ...; {code} The smaller query in the beginning throws away the pre-warmed capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime
[ https://issues.apache.org/jira/browse/TEZ-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2217: Attachment: TEZ-2217.2.patch New patch. The problem was that the expire time was not update until the min held container expire time actually elapsed. But if task requests would come in just before the update would happen, then in the next allocation cycle the min held containers would be released because they just crossed the expire time boundary. Looks like the timing of the next dag is currently hitting that race condition and probably was not hitting it earlier. Can you please try this out? The min-held-containers constraint is not enforced during query runtime Key: TEZ-2217 URL: https://issues.apache.org/jira/browse/TEZ-2217 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0, 0.7.0 Reporter: Gopal V Assignee: Bikas Saha Attachments: TEZ-2217-debug.txt.bz2, TEZ-2217.1.patch, TEZ-2217.2.patch, TEZ-2217.txt.bz2 The min-held containers constraint is respected during query idle times, but is not respected when a query is actually in motion. The AM releases unused containers during dag execution without checking for min-held containers. {code} 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing container, containerId=container_1424502260528_1348_01_13, containerExpiryTime=1426891313264, idleTimeoutMin=5000 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: Releasing unused container: container_1424502260528_1348_01_13 {code} This is actually useful only after the AM has received a soft pre-emption message, doing it on an idle cluster slows down one of the most common query patterns in BI systems. {code} create temporary table smalltable as ...; select ... bigtable JOIN smalltable ON ...; {code} The smaller query in the beginning throws away the pre-warmed capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378034#comment-14378034 ] Chang Li commented on TEZ-2205: --- [~hitesh] So should I implemnt 3) by switching to simplehistory or wrap each post call with an if statement or just not to add post events to eventqueue? Tez still tries to post to ATS when yarn.timeline-service.enabled=false --- Key: TEZ-2205 URL: https://issues.apache.org/jira/browse/TEZ-2205 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.6.1 Reporter: Chang Li Assignee: Chang Li Attachments: TEZ-2205.wip.patch when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, but hits error as token is not found. Does not fail the job because of the fix to not fail job when there is error posting to ATS. But it should not be trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2214) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging
[ https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-2214: -- Attachment: TEZ-2214.2.patch FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging -- Key: TEZ-2214 URL: https://issues.apache.org/jira/browse/TEZ-2214 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2214.1.patch, TEZ-2214.2.patch Scenario: - commitMemory usedMemory are beyond their allowed threshold. - InMemoryMerge kicks off and is in the process of flushing memory contents to disk - As it progresses, it releases memory segments as well (but not yet over). - Fetchers who need memory maxSingleShuffleLimit, get scheduled. - If fetchers are fast, this quickly adds up to commitMemory usedMemory. Since InMemoryMerge is already in progress, this wouldn't trigger another merge(). - Pretty soon all fetchers would be stalled and get into the following state. {noformat} Thread 9351: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - java.lang.Object.wait() @bci=2, line=502 (Compiled frame) - org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory() @bci=17, line=337 (Interpreted frame) - org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run() @bci=34, line=157 (Interpreted frame) {noformat} - Even if InMemoryMerger completes, commitedMem usedMem are beyond their threshold and no other fetcher threads (all are in stalled state) are there to release memory. This causes fetchers to wait indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379013#comment-14379013 ] Hitesh Shah commented on TEZ-2205: -- A log.warn should be sufficient for now. SimpleHistory comes with its own problems of configuring an hdfs location and cleaning it on a regular basis. Looks like we have agreement. [~lichangleo] Does the above clarify the implemenation requirements for this jira? Thanks for the patience in handling the various run-arounds with respect to the design :) Tez still tries to post to ATS when yarn.timeline-service.enabled=false --- Key: TEZ-2205 URL: https://issues.apache.org/jira/browse/TEZ-2205 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.6.1 Reporter: Chang Li Assignee: Chang Li Attachments: TEZ-2205.wip.patch when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, but hits error as token is not found. Does not fail the job because of the fix to not fail job when there is error posting to ATS. But it should not be trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService
[ https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379039#comment-14379039 ] Jeff Zhang commented on TEZ-2224: - bq. Is there a reason why we want to prevent new events from being processed on a shutdown? No special reason for that, just borrow the code from AsynDispatcher. But I think you are right, AsyncDispatcher is for general use case, here for RecoveringService, we need to handle events even when it is stopped. bq. TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED_DEFAULT being false by default is probably wrong. For a real-world scenario, as many pending events that are seen and can be processed, should be processed. When there's no this flag, by default we won't process any pending events, is there any consideration at that time ? (not sure, maybe for performance consideration, like in non-session mode that dag is finished while recovery events handling is not completed but is not necessary) I try to be conservative to make the behavior as same as before. EventQueue empty doesn't mean events are consumed in RecoveryService Key: TEZ-2224 URL: https://issues.apache.org/jira/browse/TEZ-2224 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-2224-1.patch If the event queue is empty, the event may still been processing. Should fix it like AsyncDispatcher -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2214) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging
[ https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379179#comment-14379179 ] Siddharth Seth commented on TEZ-2214: - Was looking at the .1 patch. The latest patch addresses the sync / visibility issue. Question: This same block could just as well have been placed in the waitForInMemoryMerge method ? Essentially, any place where it could be triggered after a merge completes. FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging -- Key: TEZ-2214 URL: https://issues.apache.org/jira/browse/TEZ-2214 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2214.1.patch, TEZ-2214.2.patch Scenario: - commitMemory usedMemory are beyond their allowed threshold. - InMemoryMerge kicks off and is in the process of flushing memory contents to disk - As it progresses, it releases memory segments as well (but not yet over). - Fetchers who need memory maxSingleShuffleLimit, get scheduled. - If fetchers are fast, this quickly adds up to commitMemory usedMemory. Since InMemoryMerge is already in progress, this wouldn't trigger another merge(). - Pretty soon all fetchers would be stalled and get into the following state. {noformat} Thread 9351: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - java.lang.Object.wait() @bci=2, line=502 (Compiled frame) - org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory() @bci=17, line=337 (Interpreted frame) - org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run() @bci=34, line=157 (Interpreted frame) {noformat} - Even if InMemoryMerger completes, commitedMem usedMem are beyond their threshold and no other fetcher threads (all are in stalled state) are there to release memory. This causes fetchers to wait indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379007#comment-14379007 ] Jonathan Eagles commented on TEZ-2205: -- I think what zhijie has posted is similar to what I am thinking as well. This will give the on/off flag to users and keep the client in control. A log WARN should be sufficient to alert clients there is a mismatch in configuration. Whether we fall back to SimpleHistory or have a hollow ATS History isn't much of a difference to me. Tez still tries to post to ATS when yarn.timeline-service.enabled=false --- Key: TEZ-2205 URL: https://issues.apache.org/jira/browse/TEZ-2205 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.6.1 Reporter: Chang Li Assignee: Chang Li Attachments: TEZ-2205.wip.patch when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, but hits error as token is not found. Does not fail the job because of the fix to not fail job when there is error posting to ATS. But it should not be trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378996#comment-14378996 ] Zhijie Shen commented on TEZ-2205: -- bq. i.e option 3's impl would be: LGTM. For your reference, this is what we did in MR: {code} if (conf.getBoolean(MRJobConfig.MAPREDUCE_JOB_EMIT_TIMELINE_DATA, MRJobConfig.DEFAULT_MAPREDUCE_JOB_EMIT_TIMELINE_DATA)) { if (conf.getBoolean(YarnConfiguration.TIMELINE_SERVICE_ENABLED, YarnConfiguration.DEFAULT_TIMELINE_SERVICE_ENABLED)) { timelineClient = TimelineClient.createTimelineClient(); timelineClient.init(conf); LOG.info(Timeline service is enabled); LOG.info(Emitting job history data to the timeline server is enabled); } else { LOG.info(Timeline service is not enabled); } } else { LOG.info(Emitting job history data to the timeline server is not enabled); } {code} And only when {{timelineClient != null}}, MR will publish the history info to the timeline server. Tez still tries to post to ATS when yarn.timeline-service.enabled=false --- Key: TEZ-2205 URL: https://issues.apache.org/jira/browse/TEZ-2205 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.6.1 Reporter: Chang Li Assignee: Chang Li Attachments: TEZ-2205.wip.patch when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, but hits error as token is not found. Does not fail the job because of the fix to not fail job when there is error posting to ATS. But it should not be trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2214 PreCommit Build #342
Jira: https://issues.apache.org/jira/browse/TEZ-2214 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/342/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2754 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12707083/TEZ-2214.2.patch against master revision f53942c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/342//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/342//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 48d9285cf144bbef730af593e553b3ddf63b6148 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #341 Archived 44 artifacts Archive block size is 32768 Received 6 blocks and 2555463 bytes Compression is 7.1% Took 1.3 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2214) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging
[ https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379091#comment-14379091 ] Hadoop QA commented on TEZ-2214: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12707083/TEZ-2214.2.patch against master revision f53942c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/342//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/342//console This message is automatically generated. FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging -- Key: TEZ-2214 URL: https://issues.apache.org/jira/browse/TEZ-2214 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2214.1.patch, TEZ-2214.2.patch Scenario: - commitMemory usedMemory are beyond their allowed threshold. - InMemoryMerge kicks off and is in the process of flushing memory contents to disk - As it progresses, it releases memory segments as well (but not yet over). - Fetchers who need memory maxSingleShuffleLimit, get scheduled. - If fetchers are fast, this quickly adds up to commitMemory usedMemory. Since InMemoryMerge is already in progress, this wouldn't trigger another merge(). - Pretty soon all fetchers would be stalled and get into the following state. {noformat} Thread 9351: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - java.lang.Object.wait() @bci=2, line=502 (Compiled frame) - org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory() @bci=17, line=337 (Interpreted frame) - org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run() @bci=34, line=157 (Interpreted frame) {noformat} - Even if InMemoryMerger completes, commitedMem usedMem are beyond their threshold and no other fetcher threads (all are in stalled state) are there to release memory. This causes fetchers to wait indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-2221 PreCommit Build #343
Jira: https://issues.apache.org/jira/browse/TEZ-2221 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/343/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2752 lines...] [INFO] Final Memory: 69M/960M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12707092/TEZ-2221-2.patch against master revision 60ddcba. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/343//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/343//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. e544a39ba0d04ae0bf24d66fb79bc3faafb721e2 logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #341 Archived 44 artifacts Archive block size is 32768 Received 24 blocks and 1934195 bytes Compression is 28.9% Took 0.52 sec Description set: TEZ-2221 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2221) VertexGroup name should be unqiue
[ https://issues.apache.org/jira/browse/TEZ-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379124#comment-14379124 ] Hadoop QA commented on TEZ-2221: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12707092/TEZ-2221-2.patch against master revision 60ddcba. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/343//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/343//console This message is automatically generated. VertexGroup name should be unqiue - Key: TEZ-2221 URL: https://issues.apache.org/jira/browse/TEZ-2221 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-2221-1.patch, TEZ-2221-2.patch VertexGroupCommitStartedEvent VertexGroupCommitFinishedEvent use vertex group name to identify the vertex group commit, the same name of vertex group will conflict. While in the current equals hashCode of VertexGroup, vertex group name and members name are used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2214) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging
[ https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379172#comment-14379172 ] Siddharth Seth commented on TEZ-2214: - [~rajesh.balamohan] - I'm trying to understand the scenario a little better. bq. Fetchers who need memory maxSingleShuffleLimit, get scheduled. Won't the fetchers first block on merger.waitForInMemoryMerge, and then on merger.waitForShuffleToMergeMemory() ? That'll happen to fetchers which aren't currently active - or for the ones where the MergeManager returns a WAIT. It's possible for fetchers which already have an active list to keep going - and get memory as it is released by the mergeThread - or just get memory because some is available. Is this the situation which can cause the race ? If the merge threshold is 50% - won't there always be capacity available for a single mergeToMem (after the MemToDiskMerger completes) - which will then trigger another merge. The fact that we allow a single fetch to go over the memory limit probably complicates this - the last fetch puts the usedMemory over 100%. The last release from the merger doesn't bring it below 100 - will result in everything getting stuck. I think the same last fetch applies to a merge threshold of 50% as well. Other than 'usedMemory' not going below the memoryLimit right after the InMemoryMerger completes, are there any other scenarios in which this will be triggered ? If I'm not mistaken - for Tez 0.4, this would manifest as a tight loop on MergeManager.reserve returning a WAIT. On the patch: Removing synchronization on waitForShuffleToMergeMemory leads to visibility issues for 'commitMemory'. This could be invoked by all Fetchers, and there's no guarantee on the threads reading the latest value. Also it's possible for the currently running merge to complete (thus reducing the commitMemory) between the time the commitMemory is checked and the next merge is triggered - which could result in a merge being triggered before hitting the memory limit. Otherwise I think the approach works. If the above case is correct - should the check be inside of usedMemory memoryLimit ? Another option would be to have the merger check if another merge is required when it completes. That gets messy though - and will likely get in the way of the MemToMemMerger in the future. A callback from the merge threads may be a better option - to keep the merge threads clean. I was looking at the MapReduce code - that sets commitMemory to 0 the moment a merge starts. I don't think that fixes this particular race. FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging -- Key: TEZ-2214 URL: https://issues.apache.org/jira/browse/TEZ-2214 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2214.1.patch, TEZ-2214.2.patch Scenario: - commitMemory usedMemory are beyond their allowed threshold. - InMemoryMerge kicks off and is in the process of flushing memory contents to disk - As it progresses, it releases memory segments as well (but not yet over). - Fetchers who need memory maxSingleShuffleLimit, get scheduled. - If fetchers are fast, this quickly adds up to commitMemory usedMemory. Since InMemoryMerge is already in progress, this wouldn't trigger another merge(). - Pretty soon all fetchers would be stalled and get into the following state. {noformat} Thread 9351: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - java.lang.Object.wait() @bci=2, line=502 (Compiled frame) - org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory() @bci=17, line=337 (Interpreted frame) - org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run() @bci=34, line=157 (Interpreted frame) {noformat} - Even if InMemoryMerger completes, commitedMem usedMem are beyond their threshold and no other fetcher threads (all are in stalled state) are there to release memory. This causes fetchers to wait indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379239#comment-14379239 ] Hitesh Shah commented on TEZ-1923: -- Updated fix versions. FetcherOrderedGrouped gets into infinite loop due to memory pressure Key: TEZ-1923 URL: https://issues.apache.org/jira/browse/TEZ-1923 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Fix For: 0.5.4, 0.6.1 Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, TEZ-1923.4.patch - Ran a comparatively large job (temp table creation) at 10 TB scale. - Turned on intermediate mem-to-mem (tez.runtime.shuffle.memory-to-memory.enable=true and tez.runtime.shuffle.memory-to-memory.segments=4) - Some reducers get lots of data and quickly gets into infinite loop {code} 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... {code} Additional debug/patch statements revealed that InMemoryMerge is not invoked appropriately and not releasing the memory back for fetchers to proceed. e.g debug/patch messages are given below {code} syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, mergeThreshold=708669632 === InMemoryMerge would be started in this case as commitMemory = mergeThreshold syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO [fetcher [Map_1] #1] orderedgrouped.MergeManager: Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. {code} In MergeManager, in memory merging is invoked under the following condition {code} if (!inMemoryMerger.isInProgress() commitMemory = mergeThreshold)
[jira] [Updated] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1923: - Fix Version/s: (was: 0.7.0) 0.6.1 0.5.4 FetcherOrderedGrouped gets into infinite loop due to memory pressure Key: TEZ-1923 URL: https://issues.apache.org/jira/browse/TEZ-1923 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Fix For: 0.5.4, 0.6.1 Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, TEZ-1923.4.patch - Ran a comparatively large job (temp table creation) at 10 TB scale. - Turned on intermediate mem-to-mem (tez.runtime.shuffle.memory-to-memory.enable=true and tez.runtime.shuffle.memory-to-memory.segments=4) - Some reducers get lots of data and quickly gets into infinite loop {code} 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... {code} Additional debug/patch statements revealed that InMemoryMerge is not invoked appropriately and not releasing the memory back for fetchers to proceed. e.g debug/patch messages are given below {code} syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, mergeThreshold=708669632 === InMemoryMerge would be started in this case as commitMemory = mergeThreshold syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO [fetcher [Map_1] #1] orderedgrouped.MergeManager: Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. {code} In MergeManager, in memory merging is invoked under the following condition {code} if (!inMemoryMerger.isInProgress() commitMemory =
[jira] [Updated] (TEZ-2221) VertexGroup name should be unqiue
[ https://issues.apache.org/jira/browse/TEZ-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2221: Attachment: TEZ-2221-2.patch Thanks [~hitesh] Upload new patch (address the issue in unit test) VertexGroup name should be unqiue - Key: TEZ-2221 URL: https://issues.apache.org/jira/browse/TEZ-2221 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-2221-1.patch, TEZ-2221-2.patch VertexGroupCommitStartedEvent VertexGroupCommitFinishedEvent use vertex group name to identify the vertex group commit, the same name of vertex group will conflict. While in the current equals hashCode of VertexGroup, vertex group name and members name are used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1923: -- Target Version/s: 0.7.0, 0.5.4, 0.6.1 (was: 0.7.0) FetcherOrderedGrouped gets into infinite loop due to memory pressure Key: TEZ-1923 URL: https://issues.apache.org/jira/browse/TEZ-1923 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Fix For: 0.7.0 Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, TEZ-1923.4.patch - Ran a comparatively large job (temp table creation) at 10 TB scale. - Turned on intermediate mem-to-mem (tez.runtime.shuffle.memory-to-memory.enable=true and tez.runtime.shuffle.memory-to-memory.segments=4) - Some reducers get lots of data and quickly gets into infinite loop {code} 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... {code} Additional debug/patch statements revealed that InMemoryMerge is not invoked appropriately and not releasing the memory back for fetchers to proceed. e.g debug/patch messages are given below {code} syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, mergeThreshold=708669632 === InMemoryMerge would be started in this case as commitMemory = mergeThreshold syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO [fetcher [Map_1] #1] orderedgrouped.MergeManager: Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. {code} In MergeManager, in memory merging is invoked under the following condition {code} if (!inMemoryMerger.isInProgress() commitMemory = mergeThreshold) {code} Attaching the
[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379049#comment-14379049 ] Rajesh Balamohan commented on TEZ-1923: --- committed to branch-0.6 and branch-0.5. FetcherOrderedGrouped gets into infinite loop due to memory pressure Key: TEZ-1923 URL: https://issues.apache.org/jira/browse/TEZ-1923 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Fix For: 0.7.0 Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, TEZ-1923.4.patch - Ran a comparatively large job (temp table creation) at 10 TB scale. - Turned on intermediate mem-to-mem (tez.runtime.shuffle.memory-to-memory.enable=true and tez.runtime.shuffle.memory-to-memory.segments=4) - Some reducers get lots of data and quickly gets into infinite loop {code} 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true sent hash and receievd reply 0 ms 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned Status.WAIT ... {code} Additional debug/patch statements revealed that InMemoryMerge is not invoked appropriately and not releasing the memory back for fetchers to proceed. e.g debug/patch messages are given below {code} syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, mergeThreshold=708669632 === InMemoryMerge would be started in this case as commitMemory = mergeThreshold syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO [fetcher [Map_1] #2] orderedgrouped.MergeManager: Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO [fetcher [Map_1] #1] orderedgrouped.MergeManager: Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this case as commitMemory mergeThreshold. But the usedMemory is higher than memoryLimit. Fetchers would keep waiting indefinitely until memory is released. InMemoryMerge will not kick in and not release memory. {code} In MergeManager, in memory merging is invoked under the following condition {code} if (!inMemoryMerger.isInProgress() commitMemory
[jira] [Created] (TEZ-2227) Tez UI shows empty page under IE11
Fengdong Yu created TEZ-2227: Summary: Tez UI shows empty page under IE11 Key: TEZ-2227 URL: https://issues.apache.org/jira/browse/TEZ-2227 Project: Apache Tez Issue Type: Bug Components: UI Affects Versions: 0.6.0 Reporter: Fengdong Yu Priority: Minor Tez UI works well under Chrome and Firefox, but shows empty page udner IE11. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2227) Tez UI shows empty page under IE11
[ https://issues.apache.org/jira/browse/TEZ-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379343#comment-14379343 ] Hitesh Shah commented on TEZ-2227: -- Thanks for filing the issue [~azuryy]. Any chance you could provide more details/logs ( if any ) from the browser console. Tez UI shows empty page under IE11 -- Key: TEZ-2227 URL: https://issues.apache.org/jira/browse/TEZ-2227 Project: Apache Tez Issue Type: Bug Components: UI Affects Versions: 0.6.0 Reporter: Fengdong Yu Priority: Minor Tez UI works well under Chrome and Firefox, but shows empty page udner IE11. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2214) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging
[ https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-2214: -- Attachment: TEZ-2214.3.patch It's possible for fetchers which already have an active list to keep going - and get memory as it is released by the mergeThread - or just get memory because some is available. Is this the situation which can cause the race ? Right, this is the case. As merge is happening, memory gets released which is taken up fetchers. By the time, existing merge completes, commitMemory usedMemory are already beyond allowed threshold. And this causes the issue. Question: This same block could just as well have been placed in the waitForInMemoryMerge method ? Essentially, any place where it could be triggered after a merge completes. Yes, it is possible to move the code block to waitForInMemoryMerge(). Addressed it in the current patch. (i.e after inMemoryMerger.waitForMerge(), we double check if the memory limits beyond thresholds. If so, we trigger one more merge and block until it is done in order to release memory.) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging -- Key: TEZ-2214 URL: https://issues.apache.org/jira/browse/TEZ-2214 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2214.1.patch, TEZ-2214.2.patch, TEZ-2214.3.patch Scenario: - commitMemory usedMemory are beyond their allowed threshold. - InMemoryMerge kicks off and is in the process of flushing memory contents to disk - As it progresses, it releases memory segments as well (but not yet over). - Fetchers who need memory maxSingleShuffleLimit, get scheduled. - If fetchers are fast, this quickly adds up to commitMemory usedMemory. Since InMemoryMerge is already in progress, this wouldn't trigger another merge(). - Pretty soon all fetchers would be stalled and get into the following state. {noformat} Thread 9351: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - java.lang.Object.wait() @bci=2, line=502 (Compiled frame) - org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory() @bci=17, line=337 (Interpreted frame) - org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run() @bci=34, line=157 (Interpreted frame) {noformat} - Even if InMemoryMerger completes, commitedMem usedMem are beyond their threshold and no other fetcher threads (all are in stalled state) are there to release memory. This causes fetchers to wait indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService
Jeff Zhang created TEZ-2224: --- Summary: EventQueue empty doesn't mean events are consumed in RecoveryService Key: TEZ-2224 URL: https://issues.apache.org/jira/browse/TEZ-2224 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir
[ https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377544#comment-14377544 ] Jeff Zhang commented on TEZ-1909: - bq. It seems like the patch for this jira has been merged with fixes for a different jira? Can these be separated out? Yes, I found one issue in RecoveryService when working on this jira. I have created TEZ-2224 to separate it. And will upload new patch after TEZ-2224 is done, because the unit test depends on TEZ-2224 Remove need to copy over all events from attempt 1 to attempt 2 dir --- Key: TEZ-1909 URL: https://issues.apache.org/jira/browse/TEZ-1909 Project: Apache Tez Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Jeff Zhang Attachments: TEZ-1909-1.patch, TEZ-1909-2.patch, TEZ-1909-3.patch Use of file versions should prevent the need for copying over data into a second attempt dir. Care needs to be taken to handle last corrupt record handling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2047) Build fails against hadoop-2.2 post TEZ-2018
[ https://issues.apache.org/jira/browse/TEZ-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Ramachandran updated TEZ-2047: -- Attachment: TEZ-2047.2.patch corrected the value for yarn.http.policy verified on 2.6 that the http scheme is correct. Build fails against hadoop-2.2 post TEZ-2018 Key: TEZ-2047 URL: https://issues.apache.org/jira/browse/TEZ-2047 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Prakash Ramachandran Priority: Blocker Attachments: TEZ-2047.1.patch, TEZ-2047.1.patch, TEZ-2047.2.patch Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project tez-dag: Compilation failure: Compilation failure: [ERROR] /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[85,13] cannot find symbol [ERROR] symbol : method withHttpPolicy(org.apache.hadoop.conf.Configuration,org.apache.hadoop.http.HttpConfig.Policy) [ERROR] location: class org.apache.hadoop.yarn.webapp.WebApps.Builderorg.apache.tez.dag.app.web.WebUIService.TezAMWebApp [ERROR] /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[87,45] cannot find symbol [ERROR] symbol : method getConnectorAddress(int) [ERROR] location: class org.apache.hadoop.http.HttpServer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2047 PreCommit Build #339
Jira: https://issues.apache.org/jira/browse/TEZ-2047 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/339/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2754 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706940/TEZ-2047.1.patch against master revision f53942c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/339//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/339//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. d892bfccf6b5af2779a653b4e76d79f6cc811ac8 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #338 Archived 44 artifacts Archive block size is 32768 Received 8 blocks and 2471608 bytes Compression is 9.6% Took 1.7 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2047) Build fails against hadoop-2.2 post TEZ-2018
[ https://issues.apache.org/jira/browse/TEZ-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378254#comment-14378254 ] Hadoop QA commented on TEZ-2047: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706940/TEZ-2047.1.patch against master revision f53942c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/339//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/339//console This message is automatically generated. Build fails against hadoop-2.2 post TEZ-2018 Key: TEZ-2047 URL: https://issues.apache.org/jira/browse/TEZ-2047 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Prakash Ramachandran Priority: Blocker Attachments: TEZ-2047.1.patch, TEZ-2047.1.patch, TEZ-2047.2.patch Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project tez-dag: Compilation failure: Compilation failure: [ERROR] /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[85,13] cannot find symbol [ERROR] symbol : method withHttpPolicy(org.apache.hadoop.conf.Configuration,org.apache.hadoop.http.HttpConfig.Policy) [ERROR] location: class org.apache.hadoop.yarn.webapp.WebApps.Builderorg.apache.tez.dag.app.web.WebUIService.TezAMWebApp [ERROR] /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[87,45] cannot find symbol [ERROR] symbol : method getConnectorAddress(int) [ERROR] location: class org.apache.hadoop.http.HttpServer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2047) Build fails against hadoop-2.2 post TEZ-2018
[ https://issues.apache.org/jira/browse/TEZ-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378277#comment-14378277 ] Hitesh Shah commented on TEZ-2047: -- +1 pending pre-commit. Build fails against hadoop-2.2 post TEZ-2018 Key: TEZ-2047 URL: https://issues.apache.org/jira/browse/TEZ-2047 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Prakash Ramachandran Priority: Blocker Attachments: TEZ-2047.1.patch, TEZ-2047.1.patch, TEZ-2047.2.patch Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project tez-dag: Compilation failure: Compilation failure: [ERROR] /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[85,13] cannot find symbol [ERROR] symbol : method withHttpPolicy(org.apache.hadoop.conf.Configuration,org.apache.hadoop.http.HttpConfig.Policy) [ERROR] location: class org.apache.hadoop.yarn.webapp.WebApps.Builderorg.apache.tez.dag.app.web.WebUIService.TezAMWebApp [ERROR] /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[87,45] cannot find symbol [ERROR] symbol : method getConnectorAddress(int) [ERROR] location: class org.apache.hadoop.http.HttpServer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-986) Make conf set on DAG and vertex available in jobhistory
[ https://issues.apache.org/jira/browse/TEZ-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378304#comment-14378304 ] Hitesh Shah commented on TEZ-986: - [~rohini] Agreed. Just moved the target version as I am not sure if [~Sreenath] has had a chance to work on it and given that [~jeagles] is looking to turn around a 0.6.1 release soon, this jira should likely be moved out if no one volunteers to work on it within a short timeframe. Make conf set on DAG and vertex available in jobhistory --- Key: TEZ-986 URL: https://issues.apache.org/jira/browse/TEZ-986 Project: Apache Tez Issue Type: Sub-task Components: UI Reporter: Rohini Palaniswamy Priority: Blocker Would like to have the conf set on DAG and Vertex 1) viewable in Tez UI after the job completes. This is very essential for debugging jobs. 2) We have processes, that parse jobconf.xml from job history (hdfs) and load them into hive tables for analysis. Would like to have Tez also make all the configuration (byte array) available in job history so that we can similarly parse them. 1) mandates that you store it in hdfs. 2) is just to say make the format stored as a contract others can rely on for parsing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2225) Remove instances of LOG.isDebugEnabled
Vasanth kumar RJ created TEZ-2225: - Summary: Remove instances of LOG.isDebugEnabled Key: TEZ-2225 URL: https://issues.apache.org/jira/browse/TEZ-2225 Project: Apache Tez Issue Type: Improvement Reporter: Vasanth kumar RJ Assignee: Vasanth kumar RJ Priority: Minor Remove LOG.isDebugEnabled() and use parameterized debug logging -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2214 PreCommit Build #340
Jira: https://issues.apache.org/jira/browse/TEZ-2214 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/340/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2752 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12705913/TEZ-2214.1.patch against master revision f53942c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/340//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/340//artifact/patchprocess/newPatchFindbugsWarningstez-runtime-library.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/340//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. e5dcef324d906a7ab5d23e71b8107b372a9599a7 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #338 Archived 44 artifacts Archive block size is 32768 Received 6 blocks and 2556109 bytes Compression is 7.1% Took 0.88 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2214) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging
[ https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378344#comment-14378344 ] Hadoop QA commented on TEZ-2214: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12705913/TEZ-2214.1.patch against master revision f53942c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/340//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/340//artifact/patchprocess/newPatchFindbugsWarningstez-runtime-library.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/340//console This message is automatically generated. FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging -- Key: TEZ-2214 URL: https://issues.apache.org/jira/browse/TEZ-2214 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: TEZ-2214.1.patch Scenario: - commitMemory usedMemory are beyond their allowed threshold. - InMemoryMerge kicks off and is in the process of flushing memory contents to disk - As it progresses, it releases memory segments as well (but not yet over). - Fetchers who need memory maxSingleShuffleLimit, get scheduled. - If fetchers are fast, this quickly adds up to commitMemory usedMemory. Since InMemoryMerge is already in progress, this wouldn't trigger another merge(). - Pretty soon all fetchers would be stalled and get into the following state. {noformat} Thread 9351: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise) - java.lang.Object.wait() @bci=2, line=502 (Compiled frame) - org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory() @bci=17, line=337 (Interpreted frame) - org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run() @bci=34, line=157 (Interpreted frame) {noformat} - Even if InMemoryMerger completes, commitedMem usedMem are beyond their threshold and no other fetcher threads (all are in stalled state) are there to release memory. This causes fetchers to wait indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2047) Build fails against hadoop-2.2 post TEZ-2018
[ https://issues.apache.org/jira/browse/TEZ-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378156#comment-14378156 ] Hitesh Shah commented on TEZ-2047: -- The config value being set seems wrong. Build fails against hadoop-2.2 post TEZ-2018 Key: TEZ-2047 URL: https://issues.apache.org/jira/browse/TEZ-2047 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Prakash Ramachandran Priority: Blocker Attachments: TEZ-2047.1.patch, TEZ-2047.1.patch Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project tez-dag: Compilation failure: Compilation failure: [ERROR] /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[85,13] cannot find symbol [ERROR] symbol : method withHttpPolicy(org.apache.hadoop.conf.Configuration,org.apache.hadoop.http.HttpConfig.Policy) [ERROR] location: class org.apache.hadoop.yarn.webapp.WebApps.Builderorg.apache.tez.dag.app.web.WebUIService.TezAMWebApp [ERROR] /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[87,45] cannot find symbol [ERROR] symbol : method getConnectorAddress(int) [ERROR] location: class org.apache.hadoop.http.HttpServer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378199#comment-14378199 ] Hitesh Shah commented on TEZ-2205: -- bq. For jobs launched through Oozie it is easy to turn off ATS via Oozie server side setting Could you clarify a bit more on how this is being done? [~jeagles] [~zjshen] [~lichangleo] If (3) is the approach that works the best in Yahoo environments, should the eventual fix be in YARN given that other applications will face the same issue i.e. option (2)? i.e. if yarn timeline enabled flag is set to false, all the relevant yarn client and timeline client libs should automatically not make calls to the server related to timeline ( i.e. do not retrieve delegation tokens, treat all calls as no-op) ? Tez still tries to post to ATS when yarn.timeline-service.enabled=false --- Key: TEZ-2205 URL: https://issues.apache.org/jira/browse/TEZ-2205 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.6.1 Reporter: Chang Li Assignee: Chang Li Attachments: TEZ-2205.wip.patch when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, but hits error as token is not found. Does not fail the job because of the fix to not fail job when there is error posting to ATS. But it should not be trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2047) Build fails against hadoop-2.2 post TEZ-2018
[ https://issues.apache.org/jira/browse/TEZ-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Ramachandran updated TEZ-2047: -- Attachment: TEZ-2047.1.patch thanks hitesh, * rebased the patch * sets the yarn.http.policy on the conf to use http_only [~hitesh] can you have a look? Build fails against hadoop-2.2 post TEZ-2018 Key: TEZ-2047 URL: https://issues.apache.org/jira/browse/TEZ-2047 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Prakash Ramachandran Priority: Blocker Attachments: TEZ-2047.1.patch, TEZ-2047.1.patch Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project tez-dag: Compilation failure: Compilation failure: [ERROR] /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[85,13] cannot find symbol [ERROR] symbol : method withHttpPolicy(org.apache.hadoop.conf.Configuration,org.apache.hadoop.http.HttpConfig.Policy) [ERROR] location: class org.apache.hadoop.yarn.webapp.WebApps.Builderorg.apache.tez.dag.app.web.WebUIService.TezAMWebApp [ERROR] /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[87,45] cannot find symbol [ERROR] symbol : method getConnectorAddress(int) [ERROR] location: class org.apache.hadoop.http.HttpServer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378199#comment-14378199 ] Hitesh Shah edited comment on TEZ-2205 at 3/24/15 5:27 PM: --- bq. For jobs launched through Oozie it is easy to turn off ATS via Oozie server side setting Could you clarify a bit more on how this is being done? [~jeagles] [~zjshen] [~lichangleo] If (3) is the approach that works the best in Yahoo environments, should the eventual fix be in YARN given that other applications will face the same issue? i.e. if yarn timeline enabled flag is set to false, all the relevant yarn client and timeline client libs should automatically not make calls to the server related to timeline ( i.e. do not retrieve delegation tokens, treat all calls as no-op) ? was (Author: hitesh): bq. For jobs launched through Oozie it is easy to turn off ATS via Oozie server side setting Could you clarify a bit more on how this is being done? [~jeagles] [~zjshen] [~lichangleo] If (3) is the approach that works the best in Yahoo environments, should the eventual fix be in YARN given that other applications will face the same issue i.e. option (2)? i.e. if yarn timeline enabled flag is set to false, all the relevant yarn client and timeline client libs should automatically not make calls to the server related to timeline ( i.e. do not retrieve delegation tokens, treat all calls as no-op) ? Tez still tries to post to ATS when yarn.timeline-service.enabled=false --- Key: TEZ-2205 URL: https://issues.apache.org/jira/browse/TEZ-2205 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.6.1 Reporter: Chang Li Assignee: Chang Li Attachments: TEZ-2205.wip.patch when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, but hits error as token is not found. Does not fail the job because of the fix to not fail job when there is error posting to ATS. But it should not be trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2225) Remove instances of LOG.isDebugEnabled
[ https://issues.apache.org/jira/browse/TEZ-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378401#comment-14378401 ] Bikas Saha commented on TEZ-2225: - This is needed to reduce cruft from code. There is an alternate log format via sl4j that does not have the penalty without the need for the extra if statements everywhere. Remove instances of LOG.isDebugEnabled -- Key: TEZ-2225 URL: https://issues.apache.org/jira/browse/TEZ-2225 Project: Apache Tez Issue Type: Improvement Reporter: Vasanth kumar RJ Assignee: Vasanth kumar RJ Priority: Minor Labels: performance Remove LOG.isDebugEnabled() and use parameterized debug logging -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2225) Remove instances of LOG.isDebugEnabled
[ https://issues.apache.org/jira/browse/TEZ-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378369#comment-14378369 ] Hitesh Shah commented on TEZ-2225: -- Is this really needed? Doesn't the slf4j doc mention that if (LOG.isDebugEnabled()) is one way to reduce the perf penalty? Remove instances of LOG.isDebugEnabled -- Key: TEZ-2225 URL: https://issues.apache.org/jira/browse/TEZ-2225 Project: Apache Tez Issue Type: Improvement Reporter: Vasanth kumar RJ Assignee: Vasanth kumar RJ Priority: Minor Labels: performance Remove LOG.isDebugEnabled() and use parameterized debug logging -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2225) Remove instances of LOG.isDebugEnabled
[ https://issues.apache.org/jira/browse/TEZ-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378370#comment-14378370 ] Hitesh Shah commented on TEZ-2225: -- \cc [~bikassaha] [~sseth] Remove instances of LOG.isDebugEnabled -- Key: TEZ-2225 URL: https://issues.apache.org/jira/browse/TEZ-2225 Project: Apache Tez Issue Type: Improvement Reporter: Vasanth kumar RJ Assignee: Vasanth kumar RJ Priority: Minor Labels: performance Remove LOG.isDebugEnabled() and use parameterized debug logging -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread
[ https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377401#comment-14377401 ] Bikas Saha edited comment on TEZ-714 at 3/24/15 6:54 AM: - bq. It could, but this may make the transition complicated. Currently we need to differentiate these 2 kinds of commits, besides there's 2 possible states (RUNNING, COMMITTING) when the commit happens and we also need check handle 2 different cases (commit succeeded failure), so there would be totally 8 different cases in one transition which may be difficult to read. I am looking at TaskAttemptImpl#TerminatedBeforeRunningTransition state transitions as inspiration. There are some standard things to do when a commit operation completes. e.g. decrement the outstanding commit counter. If commit was a group commit then write the recovery entry for it. If the commit fails then set a flag to abort. This can be in a base transition say CommitCompletedTransition. Then we can have CommitCompletedWhileRunningTransition that calls the base for common code and does running specific stuff.e.g. trigger job failure upon commit failure. And another transition for CommitCompletedWhileCommitting that just waits for the commit counter to drop to 0. Next, CommitCompletedWhileTerminating which waits for all commit operations to complete and then calls abort (this could be blocking for now). This way we can separate things while still keeping the transitions essentially linear. Instead of multiplying the possibilities by (2 commit types x 3 states x 2 commit results) Perhaps, all commit events need to have a shared boolean that they should check before invoking commit. This boolean could be set to false when the vertex/dag decides to abort. This would make and pending commit operations complete quickly instead of trying to commit unnecessarily. Some e2e scenarios could be tested via simulation using the MockDAGAppMaster. Create custom committers that fail/pass as desired and check that the dag behaved as expected. was (Author: bikassaha): bq. It could, but this may make the transition complicated. Currently we need to differentiate these 2 kinds of commits, besides there's 2 possible states (RUNNING, COMMITTING) when the commit happens and we also need check handle 2 different cases (commit succeeded failure), so there would be totally 8 different cases in one transition which may be difficult to read. I am looking at TaskAttemptImpl#TerminatedBeforeRunningTransition state transitions as inspiration. There are some standard things to do when a commit operation completes. e.g. decrement the outstanding commit counter. If commit was a group commit then write the recovery entry for it. If the commit fails then set a flag to abort. This can be in a base transition say CommitCompletedTransition. Then we can have CommitCompletedWhileRunningTransition that calls the base for common code and does running specific stuff.e.g. trigger job failure upon commit failure. And another transition for CommitCompletedWhileCommitting that just waits for the commit counter to drop to 0. Next, CommitCompletedWhileTerminating which waits for all commit operations to complete and then calls abort (this could be blocking for now). Perhaps, all commit events need to have a shared boolean that they should check before invoking commit. This boolean could be set to false when the vertex/dag decides to abort. This would make and pending commit operations complete quickly instead of trying to commit unnecessarily. Some e2e scenarios could be tested via simulation using the MockDAGAppMaster. Create custom committers that fail/pass as desired and check that the dag behaved as expected. OutputCommitters should not run in the main AM dispatcher thread Key: TEZ-714 URL: https://issues.apache.org/jira/browse/TEZ-714 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Jeff Zhang Priority: Critical Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, Vertex_2.pdf Follow up jira from TEZ-41. 1) If there's multiple OutputCommitters on a Vertex, they can be run in parallel. 2) Running an OutputCommitter in the main thread blocks all other event handling, w.r.t the DAG, and causes the event queue to back up. 3) This should also cover shared commits that happen in the DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread
[ https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377401#comment-14377401 ] Bikas Saha commented on TEZ-714: bq. It could, but this may make the transition complicated. Currently we need to differentiate these 2 kinds of commits, besides there's 2 possible states (RUNNING, COMMITTING) when the commit happens and we also need check handle 2 different cases (commit succeeded failure), so there would be totally 8 different cases in one transition which may be difficult to read. I am looking at TaskAttemptImpl#TerminatedBeforeRunningTransition state transitions as inspiration. There are some standard things to do when a commit operation completes. e.g. decrement the outstanding commit counter. If commit was a group commit then write the recovery entry for it. If the commit fails then set a flag to abort. This can be in a base transition say CommitCompletedTransition. Then we can have CommitCompletedWhileRunningTransition that calls the base for common code and does running specific stuff.e.g. trigger job failure upon commit failure. And another transition for CommitCompletedWhileCommitting that just waits for the commit counter to drop to 0. Next, CommitCompletedWhileTerminating which waits for all commit operations to complete and then calls abort (this could be blocking for now). Perhaps, all commit events need to have a shared boolean that they should check before invoking commit. This boolean could be set to false when the vertex/dag decides to abort. This would make and pending commit operations complete quickly instead of trying to commit unnecessarily. Some e2e scenarios could be tested via simulation using the MockDAGAppMaster. Create custom committers that fail/pass as desired and check that the dag behaved as expected. OutputCommitters should not run in the main AM dispatcher thread Key: TEZ-714 URL: https://issues.apache.org/jira/browse/TEZ-714 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Jeff Zhang Priority: Critical Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, Vertex_2.pdf Follow up jira from TEZ-41. 1) If there's multiple OutputCommitters on a Vertex, they can be run in parallel. 2) Running an OutputCommitter in the main thread blocks all other event handling, w.r.t the DAG, and causes the event queue to back up. 3) This should also cover shared commits that happen in the DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378536#comment-14378536 ] Hitesh Shah edited comment on TEZ-2205 at 3/24/15 8:31 PM: --- [~lichangleo] In the end, someone has to check that config and make some choices on how to act based on the configured value :). In an ideal world, the yarn config would be checked and enforced by yarn libraries and not by yarn applications but if it comes to it, we can make the change in Tez to handle this config. Also, [~lichangleo], based on [~rohini]'s comment, if YARN does not have the hollow class, this would imply that there needs to have the hollow implementation in Tez. i.e option 3's impl would be: - check yarn timeline enabled flag and make the revelant classes that use ATS to be a no-op. - Add a log.warn if ats is configured but yarn timeline is disabled. [~zjshen] [~jeagles] comments on this? It would be good to try and get a final consensus on whether we enforce the yarn-specific flag in YARN or in the Application. Based on this, we can unblock [~lichangleo] to be able to make the changes. was (Author: hitesh): [~lichangleo] In the end, someone has to check that config and make some choices on how to act based on the configured value :). In an ideal world, the yarn config would be checked and enforced by yarn libraries and not by yarn applications. Also, [~lichangleo], based on [~rohini]'s comment, if YARN does not have the hollow class, this would imply that there needs to have the hollow implementation in Tez. i.e option 3's impl would be: - check yarn timeline enabled flag and make the revelant classes that use ATS to be a no-op. - Add a log.warn if ats is configured but yarn timeline is disabled. [~zjshen] [~jeagles] comments on this? It would be good to try and get a final consensus on whether we enforce the yarn-specific flag in YARN or in the Application. Based on this, we can unblock [~lichangleo] to be able to make the changes. Tez still tries to post to ATS when yarn.timeline-service.enabled=false --- Key: TEZ-2205 URL: https://issues.apache.org/jira/browse/TEZ-2205 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.6.1 Reporter: Chang Li Assignee: Chang Li Attachments: TEZ-2205.wip.patch when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, but hits error as token is not found. Does not fail the job because of the fix to not fail job when there is error posting to ATS. But it should not be trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.
[ https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2204: Attachment: TEZ-2204-5.patch Upload new patch with minor change (add one more log) TestAMRecovery increasingly flaky on jenkins builds. - Key: TEZ-2204 URL: https://issues.apache.org/jira/browse/TEZ-2204 Project: Apache Tez Issue Type: Bug Reporter: Hitesh Shah Assignee: Jeff Zhang Fix For: 0.7.0 Attachments: TEZ-2204-1.patch, TEZ-2204-2.patch, TEZ-2204-3.patch, TEZ-2204-4.patch, TEZ-2204-5.patch In recent pre-commit builds and daily builds, there seem to have been some occurrences of TestAMRecovery failing or timing out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService
[ https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378230#comment-14378230 ] Hitesh Shah commented on TEZ-2224: -- Also, TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED_DEFAULT being false by default is probably wrong. For a real-world scenario, as many pending events that are seen and can be processed, should be processed. EventQueue empty doesn't mean events are consumed in RecoveryService Key: TEZ-2224 URL: https://issues.apache.org/jira/browse/TEZ-2224 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-2224-1.patch If the event queue is empty, the event may still been processing. Should fix it like AsyncDispatcher -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378245#comment-14378245 ] Rohini Palaniswamy commented on TEZ-2205: - OOZIE-2133 is the one that handles getting delegation tokens for ATS for tez jobs. If oozie.action.launcher.yarn.timeline-service.enabled is set to true on the Oozie server configuration, it adds yarn.timeline-service.enabled=true to conf of JobClient that submits the launcher job if the tez-site.xml is part of the distributed cache. JobClient (YARN) fetches ATS delegation token if that setting is set before the job is submitted and adds it to the job. Tez still tries to post to ATS when yarn.timeline-service.enabled=false --- Key: TEZ-2205 URL: https://issues.apache.org/jira/browse/TEZ-2205 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.6.1 Reporter: Chang Li Assignee: Chang Li Attachments: TEZ-2205.wip.patch when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, but hits error as token is not found. Does not fail the job because of the fix to not fail job when there is error posting to ATS. But it should not be trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService
[ https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378217#comment-14378217 ] Hitesh Shah commented on TEZ-2224: -- Is there a reason why we want to prevent new events from being processed on a shutdown? EventQueue empty doesn't mean events are consumed in RecoveryService Key: TEZ-2224 URL: https://issues.apache.org/jira/browse/TEZ-2224 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-2224-1.patch If the event queue is empty, the event may still been processing. Should fix it like AsyncDispatcher -- This message was sent by Atlassian JIRA (v6.3.4#6332)