[jira] [Created] (TEZ-2209) Fix pipelined shuffle to fetch data from any one attempt
Rajesh Balamohan created TEZ-2209: - Summary: Fix pipelined shuffle to fetch data from any one attempt Key: TEZ-2209 URL: https://issues.apache.org/jira/browse/TEZ-2209 Project: Apache Tez Issue Type: Improvement Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan - Currently, pipelined shuffle will fail-fast the moment it receives data from an attempt other than 0. This was done as an add-on check to prevent data being copied from speculated attempts. - However, in some scenarios (like LLAP), it could be possible that that task attempt gets killed even before generating any data. In such cases, attempt #1 or later attempts, would generate the actual data. - This jira is created to allow pipelined shuffle to download data from any one attempt. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2196) Consider reusing UnorderedPartitionedKVWriter with single output in UnorderedKVOutput
[ https://issues.apache.org/jira/browse/TEZ-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368455#comment-14368455 ] Rajesh Balamohan commented on TEZ-2196: --- [~sseth] - Removed TEZ_RUNTIME_TRANSFER_DATA_VIA_EVENTS_ENABLED in the patch. Can I go ahead and remove the corresponding processing in consumer side as well? (unordered shuffle manager has addCompletedInputWithData). > Consider reusing UnorderedPartitionedKVWriter with single output in > UnorderedKVOutput > - > > Key: TEZ-2196 > URL: https://issues.apache.org/jira/browse/TEZ-2196 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-2196.1.patch > > > Can possibly get rid of FileBasedKVWriter and reuse > UnorderedPartitionedKVWriter with single partition in UnorderedKVOutput. > This can also benefit from pipelined shuffle changes done in > UnorderedPartitionedKVWriter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.
[ https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368377#comment-14368377 ] Jeff Zhang edited comment on TEZ-2204 at 3/19/15 2:23 AM: -- Sometimes DAGAppMaster leakage happens. It is may be an issue related to YARN-2917. Because tez has its own AsyncDispatcher, but hasn't included of the patch of YARN-2917 Paste the jstack {code} "Thread-1" prio=5 tid=0x7f9d13011800 nid=0xe507 in Object.wait() [0x000117559000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x0007fed1c360> (a java.lang.Thread) at java.lang.Thread.join(Thread.java:1281) - locked <0x0007fed1c360> (a java.lang.Thread) at java.lang.Thread.join(Thread.java:1355) at org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:162) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) - locked <0x0007fed61000> (a java.lang.Object) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1539) at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:1674) - locked <0x0007fed0dc50> (a org.apache.tez.dag.app.DAGAppMaster) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) - locked <0x0007fed0de80> (a java.lang.Object) at org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHook.run(DAGAppMaster.java:1940) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) Locked ownable synchronizers: - None "App Shared Pool - #1" daemon prio=5 tid=0x7f9d13e60800 nid=0xdd03 in Object.wait() [0x00011714c000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x0007ff1193b8> (a org.apache.hadoop.util.ShutdownHookManager$1) at java.lang.Thread.join(Thread.java:1281) - locked <0x0007ff1193b8> (a org.apache.hadoop.util.ShutdownHookManager$1) at java.lang.Thread.join(Thread.java:1355) at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106) at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46) at java.lang.Shutdown.runHooks(Shutdown.java:123) at java.lang.Shutdown.sequence(Shutdown.java:167) at java.lang.Shutdown.exit(Shutdown.java:212) - locked <0x0007ff111ec8> (a java.lang.Class for java.lang.Shutdown) at java.lang.Runtime.exit(Runtime.java:109) at java.lang.System.exit(System.java:962) at org.apache.tez.test.TestAMRecovery$ControlledImmediateStartVertexManager.onSourceTaskCompleted(TestAMRecovery.java:601) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEventSourceTaskCompleted.invoke(VertexManager.java:525) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:580) - locked <0x0007fb82fac8> (a org.apache.tez.dag.app.dag.impl.VertexManager) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:1) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent.call(VertexManager.java:575) at org.apache.tez.dag.app.dag.event.CallableEvent.call(CallableEvent.java:27) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - <0x0007fbc182d8> (a java.util.concurrent.ThreadPoolExecutor$Worker) {code} was (Author: zjffdu): Sometimes DAGAppMaster leakage happens. It is may be an issue related to YARN-2917. Because tez has its own AsyncDispatcher, but hasn't included of the patch of YARN-2917 Copy the jstack {code} "Thread-1" prio=5 tid=0x7f9d13011800 nid=0xe507 in Object.wait() [0x000117559000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x0007fed1c360> (a java.lang.Thread) at java.lang.Thread.join(Thread.java:1281) - locked <0x0007fed1c360> (a java.lang.Thread) at java.lang.Thr
[jira] [Comment Edited] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.
[ https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368377#comment-14368377 ] Jeff Zhang edited comment on TEZ-2204 at 3/19/15 2:23 AM: -- Sometimes DAGAppMaster leakage happens. It is may be an issue related to YARN-2917. Because tez has its own AsyncDispatcher, but hasn't included of the patch of YARN-2917 Copy the jstack {code} "Thread-1" prio=5 tid=0x7f9d13011800 nid=0xe507 in Object.wait() [0x000117559000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x0007fed1c360> (a java.lang.Thread) at java.lang.Thread.join(Thread.java:1281) - locked <0x0007fed1c360> (a java.lang.Thread) at java.lang.Thread.join(Thread.java:1355) at org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:162) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) - locked <0x0007fed61000> (a java.lang.Object) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1539) at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:1674) - locked <0x0007fed0dc50> (a org.apache.tez.dag.app.DAGAppMaster) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) - locked <0x0007fed0de80> (a java.lang.Object) at org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHook.run(DAGAppMaster.java:1940) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) Locked ownable synchronizers: - None "App Shared Pool - #1" daemon prio=5 tid=0x7f9d13e60800 nid=0xdd03 in Object.wait() [0x00011714c000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x0007ff1193b8> (a org.apache.hadoop.util.ShutdownHookManager$1) at java.lang.Thread.join(Thread.java:1281) - locked <0x0007ff1193b8> (a org.apache.hadoop.util.ShutdownHookManager$1) at java.lang.Thread.join(Thread.java:1355) at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106) at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46) at java.lang.Shutdown.runHooks(Shutdown.java:123) at java.lang.Shutdown.sequence(Shutdown.java:167) at java.lang.Shutdown.exit(Shutdown.java:212) - locked <0x0007ff111ec8> (a java.lang.Class for java.lang.Shutdown) at java.lang.Runtime.exit(Runtime.java:109) at java.lang.System.exit(System.java:962) at org.apache.tez.test.TestAMRecovery$ControlledImmediateStartVertexManager.onSourceTaskCompleted(TestAMRecovery.java:601) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEventSourceTaskCompleted.invoke(VertexManager.java:525) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:580) - locked <0x0007fb82fac8> (a org.apache.tez.dag.app.dag.impl.VertexManager) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:1) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent.call(VertexManager.java:575) at org.apache.tez.dag.app.dag.event.CallableEvent.call(CallableEvent.java:27) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - <0x0007fbc182d8> (a java.util.concurrent.ThreadPoolExecutor$Worker) {code} was (Author: zjffdu): It is may be an issue related to YARN-2917. Because tez has its own AsyncDispatcher, but hasn't include of the patch of YARN-2917 Copy the jstack {code} "Thread-1" prio=5 tid=0x7f9d13011800 nid=0xe507 in Object.wait() [0x000117559000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x0007fed1c360> (a java.lang.Thread) at java.lang.Thread.join(Thread.java:1281) - locked <0x0007fed1c360> (a java.lang.Thread) at java.lang.Thread.join(Thread.java:1355) at org.
[jira] [Commented] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.
[ https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368377#comment-14368377 ] Jeff Zhang commented on TEZ-2204: - It is may be an issue related to YARN-2917. Because tez has its own AsyncDispatcher, but hasn't include of the patch of YARN-2917 Copy the jstack {code} "Thread-1" prio=5 tid=0x7f9d13011800 nid=0xe507 in Object.wait() [0x000117559000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x0007fed1c360> (a java.lang.Thread) at java.lang.Thread.join(Thread.java:1281) - locked <0x0007fed1c360> (a java.lang.Thread) at java.lang.Thread.join(Thread.java:1355) at org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:162) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) - locked <0x0007fed61000> (a java.lang.Object) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1539) at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:1674) - locked <0x0007fed0dc50> (a org.apache.tez.dag.app.DAGAppMaster) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) - locked <0x0007fed0de80> (a java.lang.Object) at org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHook.run(DAGAppMaster.java:1940) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) Locked ownable synchronizers: - None "App Shared Pool - #1" daemon prio=5 tid=0x7f9d13e60800 nid=0xdd03 in Object.wait() [0x00011714c000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x0007ff1193b8> (a org.apache.hadoop.util.ShutdownHookManager$1) at java.lang.Thread.join(Thread.java:1281) - locked <0x0007ff1193b8> (a org.apache.hadoop.util.ShutdownHookManager$1) at java.lang.Thread.join(Thread.java:1355) at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106) at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46) at java.lang.Shutdown.runHooks(Shutdown.java:123) at java.lang.Shutdown.sequence(Shutdown.java:167) at java.lang.Shutdown.exit(Shutdown.java:212) - locked <0x0007ff111ec8> (a java.lang.Class for java.lang.Shutdown) at java.lang.Runtime.exit(Runtime.java:109) at java.lang.System.exit(System.java:962) at org.apache.tez.test.TestAMRecovery$ControlledImmediateStartVertexManager.onSourceTaskCompleted(TestAMRecovery.java:601) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEventSourceTaskCompleted.invoke(VertexManager.java:525) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:580) - locked <0x0007fb82fac8> (a org.apache.tez.dag.app.dag.impl.VertexManager) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:1) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent.call(VertexManager.java:575) at org.apache.tez.dag.app.dag.event.CallableEvent.call(CallableEvent.java:27) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - <0x0007fbc182d8> (a java.util.concurrent.ThreadPoolExecutor$Worker) {code} > TestAMRecovery increasingly flaky on jenkins builds. > - > > Key: TEZ-2204 > URL: https://issues.apache.org/jira/browse/TEZ-2204 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Jeff Zhang > > In recent pre-commit builds and daily builds, there seem to have been some > occurrences of TestAMRecovery failing or timing out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-160) Remove 5 second sleep at the end of AM completion.
[ https://issues.apache.org/jira/browse/TEZ-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367980#comment-14367980 ] André Kelpe commented on TEZ-160: - They are independent apps, so the shutdown happens after each test, so that we have a clean test env. > Remove 5 second sleep at the end of AM completion. > -- > > Key: TEZ-160 > URL: https://issues.apache.org/jira/browse/TEZ-160 > Project: Apache Tez > Issue Type: Bug >Reporter: Siddharth Seth > Labels: TEZ-0.2.0 > > ClientServiceDelegate/DAGClient doesn't seem to be getting job completion > status from the AM after job completion. It, instead, always relies on the RM > for this information. The information returned by the AM should be used while > it's available. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-145) Support a combiner processor that can run non-local to map/reduce nodes
[ https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367849#comment-14367849 ] Bikas Saha commented on TEZ-145: I know what you are talking about but let me restate to check if we are on the same page. Combining can be at multiple levels - task, host, rack etc. Doing these combines in theory requires maintaining partition boundaries per combining level. However, if tasks are maintaining partition boundaries then there is a task explosion (== level-arity * partition count). Hence, an efficient, multi-level combine operation, needs to operate on multiple partitions per task at each level. Such that a reasonable number of tasks can be used to process a large number of partitions. This statement can be true even for the final reducer. Partially, that is what happens with auto-reduce except that the tasks lost their partition boundaries. If the processor can find a way to process multiple partitions while keeping them logically separate then we could de-link physical tasks from physical partitioning. If that is supported by the processor, the edge manager can be set up to do the correct routing of N output/partition indeces to the same task. > Support a combiner processor that can run non-local to map/reduce nodes > --- > > Key: TEZ-145 > URL: https://issues.apache.org/jira/browse/TEZ-145 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Tsuyoshi Ozawa > Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch > > > For aggregate operators that can benefit by running in multi-level trees, > support of being able to run a combiner in a non-local mode would allow > performance efficiencies to be gained by running a combiner at a rack-level. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-145) Support a combiner processor that can run non-local to map/reduce nodes
[ https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367813#comment-14367813 ] Gopal V commented on TEZ-145: - This is a question for [~bikassaha]. There is an combiner edge & vertex manager that needs to go along with this to which converts all partitions from all local input into one combine processor (i.e if it finds it has remote fetches to do, it should just forward the DME events using the pipelined mode of >1 event per-attempt). To be able to bail-out with a no-op like that, all partitioning through-out has to be exactly the reducer partition count. This is the most optimal mode, but this makes everything extra complex. Assume you have 600 hosts over 30 racks which ran a map-task + 2000 partitions in the reducer. The host-level combiner input count is actually 600 x 2000 partitions, which can be grouped into 600 x m groups - not 2000 groups. The rack-level combiner input count is actually 30 x 2000 partitions, which can be grouped into 30 x n groups - not 2000 groups. Yet, all the inputs are actually always partitioned into 2000 partitions and the destination task-index is determined by something other than the partition. So, how practical is that? > Support a combiner processor that can run non-local to map/reduce nodes > --- > > Key: TEZ-145 > URL: https://issues.apache.org/jira/browse/TEZ-145 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Tsuyoshi Ozawa > Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch > > > For aggregate operators that can benefit by running in multi-level trees, > support of being able to run a combiner in a non-local mode would allow > performance efficiencies to be gained by running a combiner at a rack-level. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2137) Add task counter to understand sorter final merge time
[ https://issues.apache.org/jira/browse/TEZ-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367764#comment-14367764 ] Hitesh Shah commented on TEZ-2137: -- At this point, these counters are only useful at a task level. We should look at how to make these counters usable at the vertex level. Current counter aggregation ( which does a simple sum ) is next to useless for timestamp information. Ver few users are likely to dig into counters at each task level but most will look more at aggregates at the DAG and vertex level. Only the analyser will likely make use of these counters at the granular level. Given the above, how much memory footprint are we adding for these new counters? Should we have better logic on how to keep memory in check even as we add more and more information at the task level ( needed only for later deep analysis )? \cc [~bikassaha] > Add task counter to understand sorter final merge time > -- > > Key: TEZ-2137 > URL: https://issues.apache.org/jira/browse/TEZ-2137 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-2137.1.patch, TEZ-2137.2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-145) Support a combiner processor that can run non-local to map/reduce nodes
[ https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367762#comment-14367762 ] Gopal V commented on TEZ-145: - [~ozawa]: the CombineProcessor patch looks good. This will help applications which do no in-memory aggregations, but you're effectively moving the data over racks ~3x. So this is a necessary part of the fix, but not the complete part as long as the ShuffleVertexManager is being used to connect them up. Because that vertex manager has no way to provide locality of tasks when spinning up tasks (for rack-local or host-local). > Support a combiner processor that can run non-local to map/reduce nodes > --- > > Key: TEZ-145 > URL: https://issues.apache.org/jira/browse/TEZ-145 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Tsuyoshi Ozawa > Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch > > > For aggregate operators that can benefit by running in multi-level trees, > support of being able to run a combiner in a non-local mode would allow > performance efficiencies to be gained by running a combiner at a rack-level. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-145) Support a combiner processor that can run non-local to map/reduce nodes
[ https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367685#comment-14367685 ] Hadoop QA commented on TEZ-145: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12705400/TEZ-145.2.patch against master revision 9b845f2. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/310//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/310//console This message is automatically generated. > Support a combiner processor that can run non-local to map/reduce nodes > --- > > Key: TEZ-145 > URL: https://issues.apache.org/jira/browse/TEZ-145 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Tsuyoshi Ozawa > Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch > > > For aggregate operators that can benefit by running in multi-level trees, > support of being able to run a combiner in a non-local mode would allow > performance efficiencies to be gained by running a combiner at a rack-level. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-145 PreCommit Build #310
Jira: https://issues.apache.org/jira/browse/TEZ-145 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/310/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2757 lines...] [INFO] Final Memory: 67M/864M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12705400/TEZ-145.2.patch against master revision 9b845f2. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/310//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/310//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 8b3154918e3c17438ebe050a32bf3a55b2bc8187 logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #309 Archived 44 artifacts Archive block size is 32768 Received 8 blocks and 2472767 bytes Compression is 9.6% Took 1.8 sec Description set: TEZ-145 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Updated] (TEZ-145) Support a combiner processor that can run non-local to map/reduce nodes
[ https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated TEZ-145: --- Attachment: TEZ-145.2.patch Fix warnings by findbugs. > Support a combiner processor that can run non-local to map/reduce nodes > --- > > Key: TEZ-145 > URL: https://issues.apache.org/jira/browse/TEZ-145 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Tsuyoshi Ozawa > Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch > > > For aggregate operators that can benefit by running in multi-level trees, > support of being able to run a combiner in a non-local mode would allow > performance efficiencies to be gained by running a combiner at a rack-level. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367561#comment-14367561 ] Hitesh Shah commented on TEZ-2205: -- Added a comment on YARN-2375. Let us see what happens there :) > Tez still tries to post to ATS when yarn.timeline-service.enabled=false > --- > > Key: TEZ-2205 > URL: https://issues.apache.org/jira/browse/TEZ-2205 > Project: Apache Tez > Issue Type: Sub-task >Affects Versions: 0.6.1 >Reporter: Chang Li >Assignee: Chang Li > > when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, > but hits error as token is not found. Does not fail the job because of the > fix to not fail job when there is error posting to ATS. But it should not be > trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1175) Support Kafka Input and Output in Tez
[ https://issues.apache.org/jira/browse/TEZ-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1175: Description: Tez has an interface to integrate various input and output types into Tez. This makes jobs pluggable where compatible inputs and outputs can be plugged in/out without changing the main processing logic of the job. This jira tracks adding support for Kafka as an input and output for Tez jobs/tasks. (was: It is something like existing MRInput and MROutput. Kafka I/O is expected to open up the new domain for Tez adoption. More details will be added soon..) > Support Kafka Input and Output in Tez > - > > Key: TEZ-1175 > URL: https://issues.apache.org/jira/browse/TEZ-1175 > Project: Apache Tez > Issue Type: Bug >Reporter: Mohammad Kamrul Islam > Labels: gsoc, gsoc2015, hadoop, java, tez > > Tez has an interface to integrate various input and output types into Tez. > This makes jobs pluggable where compatible inputs and outputs can be plugged > in/out without changing the main processing logic of the job. This jira > tracks adding support for Kafka as an input and output for Tez jobs/tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1175) Support Kafka Input and Output in Tez
[ https://issues.apache.org/jira/browse/TEZ-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052774#comment-14052774 ] Bikas Saha edited comment on TEZ-1175 at 3/18/15 5:46 PM: -- Any updates on this? was (Author: bikassaha): Any updates on this? > Support Kafka Input and Output in Tez > - > > Key: TEZ-1175 > URL: https://issues.apache.org/jira/browse/TEZ-1175 > Project: Apache Tez > Issue Type: Bug >Reporter: Mohammad Kamrul Islam > Labels: gsoc, gsoc2015, hadoop, java, tez > > It is something like existing MRInput and MROutput. > Kafka I/O is expected to open up the new domain for Tez adoption. > More details will be added soon.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1175) Support Kafka Input and Output in Tez
[ https://issues.apache.org/jira/browse/TEZ-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1175: Labels: gsoc gsoc2015 hadoop java tez (was: gsoc gsoc2015) > Support Kafka Input and Output in Tez > - > > Key: TEZ-1175 > URL: https://issues.apache.org/jira/browse/TEZ-1175 > Project: Apache Tez > Issue Type: Bug >Reporter: Mohammad Kamrul Islam > Labels: gsoc, gsoc2015, hadoop, java, tez > > It is something like existing MRInput and MROutput. > Kafka I/O is expected to open up the new domain for Tez adoption. > More details will be added soon.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (TEZ-1175) Support Kafka Input and Output in Tez
[ https://issues.apache.org/jira/browse/TEZ-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1175: Comment: was deleted (was: Any updates on this?) > Support Kafka Input and Output in Tez > - > > Key: TEZ-1175 > URL: https://issues.apache.org/jira/browse/TEZ-1175 > Project: Apache Tez > Issue Type: Bug >Reporter: Mohammad Kamrul Islam > Labels: gsoc, gsoc2015, hadoop, java, tez > > It is something like existing MRInput and MROutput. > Kafka I/O is expected to open up the new domain for Tez adoption. > More details will be added soon.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1175) Support Kafka Input and Output in Tez
[ https://issues.apache.org/jira/browse/TEZ-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-1175: Labels: gsoc gsoc2015 (was: ) > Support Kafka Input and Output in Tez > - > > Key: TEZ-1175 > URL: https://issues.apache.org/jira/browse/TEZ-1175 > Project: Apache Tez > Issue Type: Bug >Reporter: Mohammad Kamrul Islam > Labels: gsoc, gsoc2015 > > It is something like existing MRInput and MROutput. > Kafka I/O is expected to open up the new domain for Tez adoption. > More details will be added soon.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread
[ https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367542#comment-14367542 ] Bikas Saha commented on TEZ-714: I understand that. What I expected was code changes around the code that invokes the commit to change from sync to async. And new transitions from committing state. But there were changes to other parts of the code too. Perhaps I am missing something. I will take a closer look in the next patch where async operation are on a per commit basis. Also, not sure why group-commit and non-group commit need to be differentiated in different transitions. If the next patch continues to differentiate them (instead of just being able to count pending operations) then perhaps you can add a comment on why its necessary so that its easy to understand the cause. > OutputCommitters should not run in the main AM dispatcher thread > > > Key: TEZ-714 > URL: https://issues.apache.org/jira/browse/TEZ-714 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Jeff Zhang >Priority: Critical > Attachments: DAG_2.pdf, TEZ-714-1.patch, Vertex_2.pdf > > > Follow up jira from TEZ-41. > 1) If there's multiple OutputCommitters on a Vertex, they can be run in > parallel. > 2) Running an OutputCommitter in the main thread blocks all other event > handling, w.r.t the DAG, and causes the event queue to back up. > 3) This should also cover shared commits that happen in the DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false
[ https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367518#comment-14367518 ] Chang Li commented on TEZ-2205: --- Thanks for clarification [~jeagles]. [~hitesh] which way should I proceed to solve this problem > Tez still tries to post to ATS when yarn.timeline-service.enabled=false > --- > > Key: TEZ-2205 > URL: https://issues.apache.org/jira/browse/TEZ-2205 > Project: Apache Tez > Issue Type: Sub-task >Affects Versions: 0.6.1 >Reporter: Chang Li >Assignee: Chang Li > > when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, > but hits error as token is not found. Does not fail the job because of the > fix to not fail job when there is error posting to ATS. But it should not be > trying to post to ATS in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367463#comment-14367463 ] Jonathan Eagles commented on TEZ-1923: -- This will be a good fix for 0.6.1 > FetcherOrderedGrouped gets into infinite loop due to memory pressure > > > Key: TEZ-1923 > URL: https://issues.apache.org/jira/browse/TEZ-1923 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Fix For: 0.7.0 > > Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, > TEZ-1923.4.patch > > > - Ran a comparatively large job (temp table creation) at 10 TB scale. > - Turned on intermediate mem-to-mem > (tez.runtime.shuffle.memory-to-memory.enable=true and > tez.runtime.shuffle.memory-to-memory.segments=4) > - Some reducers get lots of data and quickly gets into infinite loop > {code} > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms > 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > {code} > Additional debug/patch statements revealed that InMemoryMerge is not invoked > appropriately and not releasing the memory back for fetchers to proceed. e.g > debug/patch messages are given below > {code} > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO > [fetcher [Map_1] #2] orderedgrouped.MergeManager: > Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, > mergeThreshold=708669632 <<=== InMemoryMerge would be started in this case > as commitMemory >= mergeThreshold > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO > [fetcher [Map_1] #2] orderedgrouped.MergeManager: > Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, > mergeThreshold=708669632 <<=== InMemoryMerge would *NOT* be started in this > case as commitMemory < mergeThreshold. But the usedMemory is higher than > memoryLimit. Fetchers would keep waiting indefinitely until memory is > released. InMemoryMerge will not kick in and not release memory. > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO > [fetcher [Map_1] #1] orderedgrouped.MergeManager: > Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, > mergeThreshold=708669632 <<=== InMemoryMerge would *NOT* be started in this > case as commitMemory < mergeThreshold. But the usedMemory is higher than > memoryLimit. Fetchers would keep waiting indefinitely until memory is > released. InMemoryMerge will not kick in and not release memory. > {code} > In MergeManager, in memory merging is i
[jira] [Comment Edited] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367428#comment-14367428 ] Hitesh Shah edited comment on TEZ-1923 at 3/18/15 4:42 PM: --- Is this something that needs to be backported to 0.5 and 0.6 ? \cc [~jeagles] was (Author: hitesh): Is this something that needs to be backported to 0.5 and 0.6 ? > FetcherOrderedGrouped gets into infinite loop due to memory pressure > > > Key: TEZ-1923 > URL: https://issues.apache.org/jira/browse/TEZ-1923 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Fix For: 0.7.0 > > Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, > TEZ-1923.4.patch > > > - Ran a comparatively large job (temp table creation) at 10 TB scale. > - Turned on intermediate mem-to-mem > (tez.runtime.shuffle.memory-to-memory.enable=true and > tez.runtime.shuffle.memory-to-memory.segments=4) > - Some reducers get lots of data and quickly gets into infinite loop > {code} > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms > 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > {code} > Additional debug/patch statements revealed that InMemoryMerge is not invoked > appropriately and not releasing the memory back for fetchers to proceed. e.g > debug/patch messages are given below > {code} > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO > [fetcher [Map_1] #2] orderedgrouped.MergeManager: > Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, > mergeThreshold=708669632 <<=== InMemoryMerge would be started in this case > as commitMemory >= mergeThreshold > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO > [fetcher [Map_1] #2] orderedgrouped.MergeManager: > Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, > mergeThreshold=708669632 <<=== InMemoryMerge would *NOT* be started in this > case as commitMemory < mergeThreshold. But the usedMemory is higher than > memoryLimit. Fetchers would keep waiting indefinitely until memory is > released. InMemoryMerge will not kick in and not release memory. > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO > [fetcher [Map_1] #1] orderedgrouped.MergeManager: > Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, > mergeThreshold=708669632 <<=== InMemoryMerge would *NOT* be started in this > case as commitMemory < mergeThreshold. But the usedMemory is higher than > memoryLimit. Fetc
[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure
[ https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367428#comment-14367428 ] Hitesh Shah commented on TEZ-1923: -- Is this something that needs to be backported to 0.5 and 0.6 ? > FetcherOrderedGrouped gets into infinite loop due to memory pressure > > > Key: TEZ-1923 > URL: https://issues.apache.org/jira/browse/TEZ-1923 > Project: Apache Tez > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Fix For: 0.7.0 > > Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, > TEZ-1923.4.patch > > > - Ran a comparatively large job (temp table creation) at 10 TB scale. > - Turned on intermediate mem-to-mem > (tez.runtime.shuffle.memory-to-memory.enable=true and > tez.runtime.shuffle.memory-to-memory.segments=4) > - Some reducers get lots of data and quickly gets into infinite loop > {code} > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms > 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms > 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] > orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms > 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for > url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true > sent hash and receievd reply 0 ms > 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] > orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned > Status.WAIT ... > {code} > Additional debug/patch statements revealed that InMemoryMerge is not invoked > appropriately and not releasing the memory back for fetchers to proceed. e.g > debug/patch messages are given below > {code} > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO > [fetcher [Map_1] #2] orderedgrouped.MergeManager: > Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, > mergeThreshold=708669632 <<=== InMemoryMerge would be started in this case > as commitMemory >= mergeThreshold > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO > [fetcher [Map_1] #2] orderedgrouped.MergeManager: > Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, > mergeThreshold=708669632 <<=== InMemoryMerge would *NOT* be started in this > case as commitMemory < mergeThreshold. But the usedMemory is higher than > memoryLimit. Fetchers would keep waiting indefinitely until memory is > released. InMemoryMerge will not kick in and not release memory. > syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO > [fetcher [Map_1] #1] orderedgrouped.MergeManager: > Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, > mergeThreshold=708669632 <<=== InMemoryMerge would *NOT* be started in this > case as commitMemory < mergeThreshold. But the usedMemory is higher than > memoryLimit. Fetchers would keep waiting indefinitely until memory is > released. InMemoryMerge will not kick in and not release memory. > {code} > In MergeManager, i
[jira] [Commented] (TEZ-2159) Tez UI: download timeline data for offline use.
[ https://issues.apache.org/jira/browse/TEZ-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367312#comment-14367312 ] Rajesh Balamohan commented on TEZ-2159: --- Thanks [~pramachandran]. Tried out the patch and the zip format looks fine. > Tez UI: download timeline data for offline use. > --- > > Key: TEZ-2159 > URL: https://issues.apache.org/jira/browse/TEZ-2159 > Project: Apache Tez > Issue Type: Improvement > Components: UI >Reporter: Prakash Ramachandran >Assignee: Prakash Ramachandran > Attachments: TEZ-2159.wip.1.patch > > > It is useful to have capability to download the timeline data for a dag for > offline analysis. for ex. TEZ-2076 uses the timeline data to do offline > analysis of a tez application run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-1529) ATS and TezClient integration in secure kerberos enabled cluster
[ https://issues.apache.org/jira/browse/TEZ-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Ramachandran resolved TEZ-1529. --- Resolution: Not a Problem Closing as this has been tested. > ATS and TezClient integration in secure kerberos enabled cluster > - > > Key: TEZ-1529 > URL: https://issues.apache.org/jira/browse/TEZ-1529 > Project: Apache Tez > Issue Type: Bug >Reporter: Prakash Ramachandran >Assignee: Prakash Ramachandran >Priority: Blocker > > This is a follow up for TEZ-1495 which address ATS - TezClient integration. > however it does not enable it in secure kerberos enabled cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2137) Add task counter to understand sorter final merge time
[ https://issues.apache.org/jira/browse/TEZ-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-2137: -- Attachment: TEZ-2137.2.patch Revising the patch post TEZ-2001. Added the following counters which provide absolute timestamps. Absolute timestamps are needed when we need to analyze the details at vertex levels. -SORTER_FLUSH_START_TIME : Absolute time for the sorter to start flush() -SORTER_FINAL_MERGE_START_TIME : Absolute time for the sorter to start final merge. In case final merge is disabled, this counter would not be populated. -SORTER_FLUSH_END_TIME : Absolute time for the sorter to finish flushing the data for final result. Need to add test case to TestDefaultSorter/TestPipelinedSorter after TEZ-2198 gets committed. > Add task counter to understand sorter final merge time > -- > > Key: TEZ-2137 > URL: https://issues.apache.org/jira/browse/TEZ-2137 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan > Attachments: TEZ-2137.1.patch, TEZ-2137.2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread
[ https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14366878#comment-14366878 ] Jeff Zhang commented on TEZ-714: [~bikassaha] I think the biggest issue in my patch is the granularity of the committer thread. Currently I take it as vertex/dag level, but I think it should be one OutputCommitter per thread. I will update the patch later. For the other parts of the patch, here's more description of my patch, hope it can clarify the my patch. * VertexImpl. ** Main change is in checkVertexForCompletion where commit will happen. I change it to async commit by wrapping it into CallableEvent and submit it to Shared Thread Pool. Here introduce new State COMMITTING which repsent vertex is in committing. ** Also make the abort operation as async operation. No new state is introduced here, if Vertex is in aborting, then it is in state of TERMINATING. ** DAGImpl ** Main change is in checkDAGForCompletion() where dag commit will happen and vertexSucceeded() where vertex group commit will happen. And like VertexImpl, I aslo wrap the dag commit and vertex group commit into CallableEvent and submit to shared thread pool. Here also introduce new state COMMITTING which represent that all the vertices are done but still some committing(dag commit or vertex group commit) are not yet completed. ** Like the VertexImpl, if the dag is in aborting , then it is in state of TERMINATING. > OutputCommitters should not run in the main AM dispatcher thread > > > Key: TEZ-714 > URL: https://issues.apache.org/jira/browse/TEZ-714 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Jeff Zhang >Priority: Critical > Attachments: DAG_2.pdf, TEZ-714-1.patch, Vertex_2.pdf > > > Follow up jira from TEZ-41. > 1) If there's multiple OutputCommitters on a Vertex, they can be run in > parallel. > 2) Running an OutputCommitter in the main thread blocks all other event > handling, w.r.t the DAG, and causes the event queue to back up. > 3) This should also cover shared commits that happen in the DAG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir
[ https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14366825#comment-14366825 ] Jeff Zhang edited comment on TEZ-1909 at 3/18/15 8:37 AM: -- Attach the new patch to address the review comment. Apart from the issues in the review comments, I also found there's one issue about RecoveryService. For the scenario of draining the events before RecoverySerivce is stopped, previously I take the event queue's size equal to zero as an indication of events are all consumed, but it is not true. Because even if the event queue is empty, the event may still been processing. I fix this bug in the new patch just like AsyncDispatcher did. bq. the "if (skipAllOtherEvents) {" check is probably also needed at the top of the loop to prevent new files from being opened and read ( in addition to short-circuiting the read of all events in the given file ). Maybe just log a message that other files were present and skipped Fix it. also add unit test in TestRecoveryParser bq. any reason why this is needed in the DAGAppMaster "Set getDagIDs()" ? Only for unit test. But in the new patch, I remove it and initialize the Set in the setup method. bq. also, we should add a test for adding corrupt data to the summary stream and ensuring that its processing fails Done. bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used anywhere apart from being set to true in one of the tests. Fix it. bq. please replace "import com.sun.tools.javac.util.List;" with java.lang.List Fix it bq. testCorruptedLastRecord should also verify that the dag submitted event was seen. Done. verify DAGAppMaster.createDAG is invoked. was (Author: zjffdu): Attach the new patch to address the review comment. Apart from the issues in the review comments, I also found there's one issue about RecoveryService. For the scenario of draining the events before RecoverySerivce is stopped, previously I take the event queue's size eqaul to zero as an indication of events are all consumed, but it is not true. Because even if the event queue is empty, the event may still being processing. I fix this bug in the new patch just like AsyncDispatcher did. bq. the "if (skipAllOtherEvents) {" check is probably also needed at the top of the loop to prevent new files from being opened and read ( in addition to short-circuiting the read of all events in the given file ). Maybe just log a message that other files were present and skipped Fix it. also add unit test in TestRecoveryParser bq. any reason why this is needed in the DAGAppMaster "Set getDagIDs()" ? Only for unit test. But in the new patch, I remove it and initialize the Set in the setup method. bq. also, we should add a test for adding corrupt data to the summary stream and ensuring that its processing fails Done. bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used anywhere apart from being set to true in one of the tests. Fix it. bq. please replace "import com.sun.tools.javac.util.List;" with java.lang.List Fix it bq. testCorruptedLastRecord should also verify that the dag submitted event was seen. Done. verify DAGAppMaster.createDAG is invoked. > Remove need to copy over all events from attempt 1 to attempt 2 dir > --- > > Key: TEZ-1909 > URL: https://issues.apache.org/jira/browse/TEZ-1909 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1909-1.patch, TEZ-1909-2.patch, TEZ-1909-3.patch > > > Use of file versions should prevent the need for copying over data into a > second attempt dir. Care needs to be taken to handle "last corrupt record" > handling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir
[ https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-1909: Attachment: TEZ-1909-3.patch > Remove need to copy over all events from attempt 1 to attempt 2 dir > --- > > Key: TEZ-1909 > URL: https://issues.apache.org/jira/browse/TEZ-1909 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1909-1.patch, TEZ-1909-2.patch, TEZ-1909-3.patch > > > Use of file versions should prevent the need for copying over data into a > second attempt dir. Care needs to be taken to handle "last corrupt record" > handling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir
[ https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14366825#comment-14366825 ] Jeff Zhang commented on TEZ-1909: - Attach the new patch to address the review comment. Apart from the issues in the review comments, I also found there's one issue about RecoveryService. For the scenario of draining the events before RecoverySerivce is stopped, previously I take the event queue's size eqaul to zero as an indication of events are all consumed, but it is not true. Because even if the event queue is empty, the event may still being processing. I fix this bug in the new patch just like AsyncDispatcher did. bq. the "if (skipAllOtherEvents) {" check is probably also needed at the top of the loop to prevent new files from being opened and read ( in addition to short-circuiting the read of all events in the given file ). Maybe just log a message that other files were present and skipped Fix it. also add unit test in TestRecoveryParser bq. any reason why this is needed in the DAGAppMaster "Set getDagIDs()" ? Only for unit test. But in the new patch, I remove it and initialize the Set in the setup method. bq. also, we should add a test for adding corrupt data to the summary stream and ensuring that its processing fails Done. bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used anywhere apart from being set to true in one of the tests. Fix it. bq. please replace "import com.sun.tools.javac.util.List;" with java.lang.List Fix it bq. testCorruptedLastRecord should also verify that the dag submitted event was seen. Done. verify DAGAppMaster.createDAG is invoked. > Remove need to copy over all events from attempt 1 to attempt 2 dir > --- > > Key: TEZ-1909 > URL: https://issues.apache.org/jira/browse/TEZ-1909 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Hitesh Shah >Assignee: Jeff Zhang > Attachments: TEZ-1909-1.patch, TEZ-1909-2.patch, TEZ-1909-3.patch > > > Use of file versions should prevent the need for copying over data into a > second attempt dir. Care needs to be taken to handle "last corrupt record" > handling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2208) Counter of REDUCE_INPUT_GROUPS is incorrect
[ https://issues.apache.org/jira/browse/TEZ-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2208: Attachment: Counter of REDUCE_INPUT_GROUPS.png > Counter of REDUCE_INPUT_GROUPS is incorrect > --- > > Key: TEZ-2208 > URL: https://issues.apache.org/jira/browse/TEZ-2208 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang > Attachments: Counter of REDUCE_INPUT_GROUPS.png > > > Counter of REDUCE_INPUT_GROUPS is always 1 less than the real number. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2208) Counter of REDUCE_INPUT_GROUPS is incorrect
Jeff Zhang created TEZ-2208: --- Summary: Counter of REDUCE_INPUT_GROUPS is incorrect Key: TEZ-2208 URL: https://issues.apache.org/jira/browse/TEZ-2208 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Counter of REDUCE_INPUT_GROUPS is always 1 less than the real number. -- This message was sent by Atlassian JIRA (v6.3.4#6332)