date:20150318

[jira] [Created] (TEZ-2209) Fix pipelined shuffle to fetch data from any one attempt

2015-03-18 Thread Rajesh Balamohan (JIRA)

Rajesh Balamohan created TEZ-2209:
-

 Summary: Fix pipelined shuffle to fetch data from any one attempt
 Key: TEZ-2209
 URL: https://issues.apache.org/jira/browse/TEZ-2209
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan


- Currently, pipelined shuffle will fail-fast the moment it receives data from 
an attempt other than 0.  This was done as an add-on check to prevent data 
being copied from speculated attempts.
- However, in some scenarios (like LLAP), it could be possible that that task 
attempt gets killed even before generating any data.  In such cases, attempt #1 
or later attempts, would generate the actual data.
- This jira is created to allow pipelined shuffle to download data from any one 
attempt. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2196) Consider reusing UnorderedPartitionedKVWriter with single output in UnorderedKVOutput

2015-03-18 Thread Rajesh Balamohan (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368455#comment-14368455
 ] 

Rajesh Balamohan commented on TEZ-2196:
---

[~sseth] - Removed TEZ_RUNTIME_TRANSFER_DATA_VIA_EVENTS_ENABLED in the patch.  
Can I go ahead and remove the corresponding processing in consumer side as 
well? (unordered shuffle manager has addCompletedInputWithData).

> Consider reusing UnorderedPartitionedKVWriter with single output in 
> UnorderedKVOutput
> -
>
> Key: TEZ-2196
> URL: https://issues.apache.org/jira/browse/TEZ-2196
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Attachments: TEZ-2196.1.patch
>
>
> Can possibly get rid of FileBasedKVWriter and reuse 
> UnorderedPartitionedKVWriter with single partition in UnorderedKVOutput.  
> This can also benefit from pipelined shuffle changes done in 
> UnorderedPartitionedKVWriter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.

2015-03-18 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368377#comment-14368377
 ] 

Jeff Zhang edited comment on TEZ-2204 at 3/19/15 2:23 AM:
--

Sometimes DAGAppMaster leakage happens. It is may be an issue related to 
YARN-2917. Because tez has its own AsyncDispatcher, but hasn't included of the 
patch of YARN-2917

Paste the jstack
{code}
"Thread-1" prio=5 tid=0x7f9d13011800 nid=0xe507 in Object.wait() 
[0x000117559000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0007fed1c360> (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1281)
- locked <0x0007fed1c360> (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1355)
at 
org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:162)
at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x0007fed61000> (a java.lang.Object)
at 
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at 
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at 
org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1539)
at 
org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:1674)
- locked <0x0007fed0dc50> (a org.apache.tez.dag.app.DAGAppMaster)
at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x0007fed0de80> (a java.lang.Object)
at 
org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHook.run(DAGAppMaster.java:1940)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

   Locked ownable synchronizers:
- None

"App Shared Pool - #1" daemon prio=5 tid=0x7f9d13e60800 nid=0xdd03 in 
Object.wait() [0x00011714c000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0007ff1193b8> (a 
org.apache.hadoop.util.ShutdownHookManager$1)
at java.lang.Thread.join(Thread.java:1281)
- locked <0x0007ff1193b8> (a 
org.apache.hadoop.util.ShutdownHookManager$1)
at java.lang.Thread.join(Thread.java:1355)
at 
java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106)
at 
java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
at java.lang.Shutdown.runHooks(Shutdown.java:123)
at java.lang.Shutdown.sequence(Shutdown.java:167)
at java.lang.Shutdown.exit(Shutdown.java:212)
- locked <0x0007ff111ec8> (a java.lang.Class for java.lang.Shutdown)
at java.lang.Runtime.exit(Runtime.java:109)
at java.lang.System.exit(System.java:962)
at 
org.apache.tez.test.TestAMRecovery$ControlledImmediateStartVertexManager.onSourceTaskCompleted(TestAMRecovery.java:601)
at 
org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEventSourceTaskCompleted.invoke(VertexManager.java:525)
at 
org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:580)
- locked <0x0007fb82fac8> (a 
org.apache.tez.dag.app.dag.impl.VertexManager)
at 
org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:1)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent.call(VertexManager.java:575)
at 
org.apache.tez.dag.app.dag.event.CallableEvent.call(CallableEvent.java:27)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
- <0x0007fbc182d8> (a 
java.util.concurrent.ThreadPoolExecutor$Worker)
{code}


was (Author: zjffdu):
Sometimes DAGAppMaster leakage happens. It is may be an issue related to 
YARN-2917. Because tez has its own AsyncDispatcher, but hasn't included of the 
patch of YARN-2917

Copy the jstack
{code}
"Thread-1" prio=5 tid=0x7f9d13011800 nid=0xe507 in Object.wait() 
[0x000117559000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0007fed1c360> (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1281)
- locked <0x0007fed1c360> (a java.lang.Thread)
at java.lang.Thr

[jira] [Comment Edited] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.

2015-03-18 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368377#comment-14368377
 ] 

Jeff Zhang edited comment on TEZ-2204 at 3/19/15 2:23 AM:
--

Sometimes DAGAppMaster leakage happens. It is may be an issue related to 
YARN-2917. Because tez has its own AsyncDispatcher, but hasn't included of the 
patch of YARN-2917

Copy the jstack
{code}
"Thread-1" prio=5 tid=0x7f9d13011800 nid=0xe507 in Object.wait() 
[0x000117559000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0007fed1c360> (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1281)
- locked <0x0007fed1c360> (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1355)
at 
org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:162)
at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x0007fed61000> (a java.lang.Object)
at 
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at 
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at 
org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1539)
at 
org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:1674)
- locked <0x0007fed0dc50> (a org.apache.tez.dag.app.DAGAppMaster)
at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x0007fed0de80> (a java.lang.Object)
at 
org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHook.run(DAGAppMaster.java:1940)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

   Locked ownable synchronizers:
- None

"App Shared Pool - #1" daemon prio=5 tid=0x7f9d13e60800 nid=0xdd03 in 
Object.wait() [0x00011714c000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0007ff1193b8> (a 
org.apache.hadoop.util.ShutdownHookManager$1)
at java.lang.Thread.join(Thread.java:1281)
- locked <0x0007ff1193b8> (a 
org.apache.hadoop.util.ShutdownHookManager$1)
at java.lang.Thread.join(Thread.java:1355)
at 
java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106)
at 
java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
at java.lang.Shutdown.runHooks(Shutdown.java:123)
at java.lang.Shutdown.sequence(Shutdown.java:167)
at java.lang.Shutdown.exit(Shutdown.java:212)
- locked <0x0007ff111ec8> (a java.lang.Class for java.lang.Shutdown)
at java.lang.Runtime.exit(Runtime.java:109)
at java.lang.System.exit(System.java:962)
at 
org.apache.tez.test.TestAMRecovery$ControlledImmediateStartVertexManager.onSourceTaskCompleted(TestAMRecovery.java:601)
at 
org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEventSourceTaskCompleted.invoke(VertexManager.java:525)
at 
org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:580)
- locked <0x0007fb82fac8> (a 
org.apache.tez.dag.app.dag.impl.VertexManager)
at 
org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:1)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent.call(VertexManager.java:575)
at 
org.apache.tez.dag.app.dag.event.CallableEvent.call(CallableEvent.java:27)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
- <0x0007fbc182d8> (a 
java.util.concurrent.ThreadPoolExecutor$Worker)
{code}


was (Author: zjffdu):
It is may be an issue related to YARN-2917. Because tez has its own 
AsyncDispatcher, but hasn't include of the patch of YARN-2917

Copy the jstack
{code}
"Thread-1" prio=5 tid=0x7f9d13011800 nid=0xe507 in Object.wait() 
[0x000117559000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0007fed1c360> (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1281)
- locked <0x0007fed1c360> (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1355)
at 
org.

[jira] [Commented] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.

2015-03-18 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368377#comment-14368377
 ] 

Jeff Zhang commented on TEZ-2204:
-

It is may be an issue related to YARN-2917. Because tez has its own 
AsyncDispatcher, but hasn't include of the patch of YARN-2917

Copy the jstack
{code}
"Thread-1" prio=5 tid=0x7f9d13011800 nid=0xe507 in Object.wait() 
[0x000117559000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0007fed1c360> (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1281)
- locked <0x0007fed1c360> (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1355)
at 
org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:162)
at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x0007fed61000> (a java.lang.Object)
at 
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at 
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at 
org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1539)
at 
org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:1674)
- locked <0x0007fed0dc50> (a org.apache.tez.dag.app.DAGAppMaster)
at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x0007fed0de80> (a java.lang.Object)
at 
org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHook.run(DAGAppMaster.java:1940)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

   Locked ownable synchronizers:
- None

"App Shared Pool - #1" daemon prio=5 tid=0x7f9d13e60800 nid=0xdd03 in 
Object.wait() [0x00011714c000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0007ff1193b8> (a 
org.apache.hadoop.util.ShutdownHookManager$1)
at java.lang.Thread.join(Thread.java:1281)
- locked <0x0007ff1193b8> (a 
org.apache.hadoop.util.ShutdownHookManager$1)
at java.lang.Thread.join(Thread.java:1355)
at 
java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106)
at 
java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
at java.lang.Shutdown.runHooks(Shutdown.java:123)
at java.lang.Shutdown.sequence(Shutdown.java:167)
at java.lang.Shutdown.exit(Shutdown.java:212)
- locked <0x0007ff111ec8> (a java.lang.Class for java.lang.Shutdown)
at java.lang.Runtime.exit(Runtime.java:109)
at java.lang.System.exit(System.java:962)
at 
org.apache.tez.test.TestAMRecovery$ControlledImmediateStartVertexManager.onSourceTaskCompleted(TestAMRecovery.java:601)
at 
org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEventSourceTaskCompleted.invoke(VertexManager.java:525)
at 
org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:580)
- locked <0x0007fb82fac8> (a 
org.apache.tez.dag.app.dag.impl.VertexManager)
at 
org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent$1.run(VertexManager.java:1)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerEvent.call(VertexManager.java:575)
at 
org.apache.tez.dag.app.dag.event.CallableEvent.call(CallableEvent.java:27)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
- <0x0007fbc182d8> (a 
java.util.concurrent.ThreadPoolExecutor$Worker)
{code}

> TestAMRecovery increasingly flaky on jenkins builds. 
> -
>
> Key: TEZ-2204
> URL: https://issues.apache.org/jira/browse/TEZ-2204
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
>
> In recent pre-commit builds and daily builds, there seem to have been some 
> occurrences of TestAMRecovery failing or timing out. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-160) Remove 5 second sleep at the end of AM completion.

2015-03-18 Thread JIRA


[ 
https://issues.apache.org/jira/browse/TEZ-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367980#comment-14367980
 ] 

André Kelpe commented on TEZ-160:
-

They are independent apps, so the shutdown happens after each test, so that we 
have a clean test env.

> Remove 5 second sleep at the end of AM completion.
> --
>
> Key: TEZ-160
> URL: https://issues.apache.org/jira/browse/TEZ-160
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Siddharth Seth
>  Labels: TEZ-0.2.0
>
> ClientServiceDelegate/DAGClient doesn't seem to be getting job completion 
> status from the AM after job completion. It, instead, always relies on the RM 
> for this information. The information returned by the AM should be used while 
> it's available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-145) Support a combiner processor that can run non-local to map/reduce nodes

2015-03-18 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367849#comment-14367849
 ] 

Bikas Saha commented on TEZ-145:


I know what you are talking about but let me restate to check if we are on the 
same page. 
Combining can be at multiple levels - task, host, rack etc.
Doing these combines in theory requires maintaining partition boundaries per 
combining level. However, if tasks are maintaining partition boundaries then 
there is a task explosion (== level-arity * partition count). Hence, an 
efficient, multi-level combine operation, needs to operate on multiple 
partitions per task at each level.  Such that a reasonable number of tasks can 
be used to process a large number of partitions. This statement can be true 
even for the final reducer. Partially, that is what happens with auto-reduce 
except that the tasks lost their partition boundaries.
If the processor can find a way to process multiple partitions while keeping 
them logically separate then we could de-link physical tasks from physical 
partitioning. If that is supported by the processor, the edge manager can be 
set up to do the correct routing of N output/partition indeces to the same task.

> Support a combiner processor that can run non-local to map/reduce nodes
> ---
>
> Key: TEZ-145
> URL: https://issues.apache.org/jira/browse/TEZ-145
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Tsuyoshi Ozawa
> Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch
>
>
> For aggregate operators that can benefit by running in multi-level trees, 
> support of being able to run a combiner in a non-local mode would allow 
> performance efficiencies to be gained by running a combiner at a rack-level. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-145) Support a combiner processor that can run non-local to map/reduce nodes

2015-03-18 Thread Gopal V (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367813#comment-14367813
 ] 

Gopal V commented on TEZ-145:
-

This is a question for [~bikassaha].

There is an combiner edge & vertex manager that needs to go along with this to 
which converts all partitions from all local input into one combine processor 
(i.e if it finds it has remote fetches to do, it should just forward the DME 
events using the pipelined mode of >1 event per-attempt).

To be able to bail-out with a no-op like that, all partitioning through-out has 
to be exactly the reducer partition count.

This is the most optimal mode, but this makes everything extra complex.

Assume you have 600 hosts over 30 racks which ran a map-task + 2000 partitions 
in the reducer.

The host-level combiner input count is actually 600 x 2000 partitions, which 
can be grouped into 600 x m groups - not 2000 groups.

The rack-level combiner input count is actually 30 x 2000 partitions, which can 
be grouped into 30 x n groups - not 2000 groups.

Yet, all the inputs are actually always partitioned into 2000 partitions and 
the destination task-index is determined by something other than the partition.

So, how practical is that?

> Support a combiner processor that can run non-local to map/reduce nodes
> ---
>
> Key: TEZ-145
> URL: https://issues.apache.org/jira/browse/TEZ-145
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Tsuyoshi Ozawa
> Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch
>
>
> For aggregate operators that can benefit by running in multi-level trees, 
> support of being able to run a combiner in a non-local mode would allow 
> performance efficiencies to be gained by running a combiner at a rack-level. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2137) Add task counter to understand sorter final merge time

2015-03-18 Thread Hitesh Shah (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367764#comment-14367764
 ] 

Hitesh Shah commented on TEZ-2137:
--

At this point, these counters are only useful at a task level. We should look 
at how to make these counters usable at the vertex level. 
Current counter aggregation ( which does a simple sum ) is next to useless for 
timestamp information.

Ver few users are likely to dig into counters at each task level but most will 
look more at aggregates at the DAG and vertex level. Only the analyser will 
likely make use of these counters at the granular level.

Given the above, how much memory footprint are we adding for these new 
counters? Should we have better logic on how to keep memory in check even as we 
add more and more information at the task level ( needed only for later deep 
analysis )? 

\cc [~bikassaha]

> Add task counter to understand sorter final merge time
> --
>
> Key: TEZ-2137
> URL: https://issues.apache.org/jira/browse/TEZ-2137
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Attachments: TEZ-2137.1.patch, TEZ-2137.2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-145) Support a combiner processor that can run non-local to map/reduce nodes

2015-03-18 Thread Gopal V (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367762#comment-14367762
 ] 

Gopal V commented on TEZ-145:
-

[~ozawa]: the CombineProcessor patch looks good. 

This will help applications which do no in-memory aggregations, but you're 
effectively moving the data over racks ~3x.

So this is a necessary part of the fix, but not the complete part as long as 
the ShuffleVertexManager is being used to connect them up.

Because that vertex manager has no way to provide locality of tasks when 
spinning up tasks (for rack-local or host-local).

> Support a combiner processor that can run non-local to map/reduce nodes
> ---
>
> Key: TEZ-145
> URL: https://issues.apache.org/jira/browse/TEZ-145
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Tsuyoshi Ozawa
> Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch
>
>
> For aggregate operators that can benefit by running in multi-level trees, 
> support of being able to run a combiner in a non-local mode would allow 
> performance efficiencies to be gained by running a combiner at a rack-level. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-145) Support a combiner processor that can run non-local to map/reduce nodes

2015-03-18 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367685#comment-14367685
 ] 

Hadoop QA commented on TEZ-145:
---

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12705400/TEZ-145.2.patch
  against master revision 9b845f2.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/310//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/310//console

This message is automatically generated.

> Support a combiner processor that can run non-local to map/reduce nodes
> ---
>
> Key: TEZ-145
> URL: https://issues.apache.org/jira/browse/TEZ-145
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Tsuyoshi Ozawa
> Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch
>
>
> For aggregate operators that can benefit by running in multi-level trees, 
> support of being able to run a combiner in a non-local mode would allow 
> performance efficiencies to be gained by running a combiner at a rack-level. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Success: TEZ-145 PreCommit Build #310

2015-03-18 Thread Apache Jenkins Server

Jira: https://issues.apache.org/jira/browse/TEZ-145
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/310/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2757 lines...]
[INFO] Final Memory: 67M/864M
[INFO] 




{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12705400/TEZ-145.2.patch
  against master revision 9b845f2.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/310//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/310//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
8b3154918e3c17438ebe050a32bf3a55b2bc8187 logged out


==
==
Finished build.
==
==


Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #309
Archived 44 artifacts
Archive block size is 32768
Received 8 blocks and 2472767 bytes
Compression is 9.6%
Took 1.8 sec
Description set: TEZ-145
Recording test results
Email was triggered for: Success
Sending email for trigger: Success



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Updated] (TEZ-145) Support a combiner processor that can run non-local to map/reduce nodes

2015-03-18 Thread Tsuyoshi Ozawa (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi Ozawa updated TEZ-145:
---
Attachment: TEZ-145.2.patch

Fix warnings by findbugs.

> Support a combiner processor that can run non-local to map/reduce nodes
> ---
>
> Key: TEZ-145
> URL: https://issues.apache.org/jira/browse/TEZ-145
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Tsuyoshi Ozawa
> Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch
>
>
> For aggregate operators that can benefit by running in multi-level trees, 
> support of being able to run a combiner in a non-local mode would allow 
> performance efficiencies to be gained by running a combiner at a rack-level. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-18 Thread Hitesh Shah (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367561#comment-14367561
 ] 

Hitesh Shah commented on TEZ-2205:
--

Added a comment on YARN-2375. Let us see what happens there :)

> Tez still tries to post to ATS when yarn.timeline-service.enabled=false
> ---
>
> Key: TEZ-2205
> URL: https://issues.apache.org/jira/browse/TEZ-2205
> Project: Apache Tez
>  Issue Type: Sub-task
>Affects Versions: 0.6.1
>Reporter: Chang Li
>Assignee: Chang Li
>
> when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
> but hits error as token is not found. Does not fail the job because of the 
> fix to not fail job when there is error posting to ATS. But it should not be 
> trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1175) Support Kafka Input and Output in Tez

2015-03-18 Thread Bikas Saha (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1175:

Description: Tez has an interface to integrate various input and output 
types into Tez. This makes jobs pluggable where compatible inputs and outputs 
can be plugged in/out without changing the main processing logic of the job. 
This jira tracks adding support for Kafka as an input and output for Tez 
jobs/tasks.  (was: It is something like existing MRInput and MROutput. 
Kafka I/O is expected to open up the new domain for Tez adoption.
More details will be added soon..)

> Support Kafka Input and Output in Tez
> -
>
> Key: TEZ-1175
> URL: https://issues.apache.org/jira/browse/TEZ-1175
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Mohammad Kamrul Islam
>  Labels: gsoc, gsoc2015, hadoop, java, tez
>
> Tez has an interface to integrate various input and output types into Tez. 
> This makes jobs pluggable where compatible inputs and outputs can be plugged 
> in/out without changing the main processing logic of the job. This jira 
> tracks adding support for Kafka as an input and output for Tez jobs/tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TEZ-1175) Support Kafka Input and Output in Tez

2015-03-18 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052774#comment-14052774
 ] 

Bikas Saha edited comment on TEZ-1175 at 3/18/15 5:46 PM:
--

Any updates on this?


was (Author: bikassaha):
Any updates on this?

> Support Kafka Input and Output in Tez
> -
>
> Key: TEZ-1175
> URL: https://issues.apache.org/jira/browse/TEZ-1175
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Mohammad Kamrul Islam
>  Labels: gsoc, gsoc2015, hadoop, java, tez
>
> It is something like existing MRInput and MROutput. 
> Kafka I/O is expected to open up the new domain for Tez adoption.
> More details will be added soon..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1175) Support Kafka Input and Output in Tez

2015-03-18 Thread Bikas Saha (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1175:

Labels: gsoc gsoc2015 hadoop java tez  (was: gsoc gsoc2015)

> Support Kafka Input and Output in Tez
> -
>
> Key: TEZ-1175
> URL: https://issues.apache.org/jira/browse/TEZ-1175
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Mohammad Kamrul Islam
>  Labels: gsoc, gsoc2015, hadoop, java, tez
>
> It is something like existing MRInput and MROutput. 
> Kafka I/O is expected to open up the new domain for Tez adoption.
> More details will be added soon..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Issue Comment Deleted] (TEZ-1175) Support Kafka Input and Output in Tez

2015-03-18 Thread Bikas Saha (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1175:

Comment: was deleted

(was: Any updates on this?)

> Support Kafka Input and Output in Tez
> -
>
> Key: TEZ-1175
> URL: https://issues.apache.org/jira/browse/TEZ-1175
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Mohammad Kamrul Islam
>  Labels: gsoc, gsoc2015, hadoop, java, tez
>
> It is something like existing MRInput and MROutput. 
> Kafka I/O is expected to open up the new domain for Tez adoption.
> More details will be added soon..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1175) Support Kafka Input and Output in Tez

2015-03-18 Thread Bikas Saha (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1175:

Labels: gsoc gsoc2015  (was: )

> Support Kafka Input and Output in Tez
> -
>
> Key: TEZ-1175
> URL: https://issues.apache.org/jira/browse/TEZ-1175
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Mohammad Kamrul Islam
>  Labels: gsoc, gsoc2015
>
> It is something like existing MRInput and MROutput. 
> Kafka I/O is expected to open up the new domain for Tez adoption.
> More details will be added soon..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread

2015-03-18 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367542#comment-14367542
 ] 

Bikas Saha commented on TEZ-714:


I understand that. What I expected was code changes around the code that 
invokes the commit to change from sync to async. And new transitions from 
committing state. But there were changes to other parts of the code too. 
Perhaps I am missing something. I will take a closer look in the next patch 
where async operation are on a per commit basis. Also, not sure why 
group-commit and non-group commit need to be differentiated in different 
transitions. If the next patch continues to differentiate them (instead of just 
being able to count pending operations) then perhaps you can add a comment on 
why its necessary so that its easy to understand the cause.

> OutputCommitters should not run in the main AM dispatcher thread
> 
>
> Key: TEZ-714
> URL: https://issues.apache.org/jira/browse/TEZ-714
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Jeff Zhang
>Priority: Critical
> Attachments: DAG_2.pdf, TEZ-714-1.patch, Vertex_2.pdf
>
>
> Follow up jira from TEZ-41.
> 1) If there's multiple OutputCommitters on a Vertex, they can be run in 
> parallel.
> 2) Running an OutputCommitter in the main thread blocks all other event 
> handling, w.r.t the DAG, and causes the event queue to back up.
> 3) This should also cover shared commits that happen in the DAG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-18 Thread Chang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367518#comment-14367518
 ] 

Chang Li commented on TEZ-2205:
---

Thanks for clarification [~jeagles].  [~hitesh] which way should I proceed to 
solve this problem

> Tez still tries to post to ATS when yarn.timeline-service.enabled=false
> ---
>
> Key: TEZ-2205
> URL: https://issues.apache.org/jira/browse/TEZ-2205
> Project: Apache Tez
>  Issue Type: Sub-task
>Affects Versions: 0.6.1
>Reporter: Chang Li
>Assignee: Chang Li
>
> when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
> but hits error as token is not found. Does not fail the job because of the 
> fix to not fail job when there is error posting to ATS. But it should not be 
> trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure

2015-03-18 Thread Jonathan Eagles (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367463#comment-14367463
 ] 

Jonathan Eagles commented on TEZ-1923:
--

This will be a good fix for 0.6.1

> FetcherOrderedGrouped gets into infinite loop due to memory pressure
> 
>
> Key: TEZ-1923
> URL: https://issues.apache.org/jira/browse/TEZ-1923
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Fix For: 0.7.0
>
> Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, 
> TEZ-1923.4.patch
>
>
> - Ran a comparatively large job (temp table creation) at 10 TB scale.
> - Turned on intermediate mem-to-mem 
> (tez.runtime.shuffle.memory-to-memory.enable=true and 
> tez.runtime.shuffle.memory-to-memory.segments=4)
> - Some reducers get lots of data and quickly gets into infinite loop
> {code}
> 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
> Status.WAIT ...
> 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
> orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms
> 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
> url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true
>  sent hash and receievd reply 0 ms
> 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
> Status.WAIT ...
> 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
> orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms
> 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
> url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true
>  sent hash and receievd reply 0 ms
> 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
> Status.WAIT ...
> 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
> orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms
> 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
> url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true
>  sent hash and receievd reply 0 ms
> 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
> Status.WAIT ...
> 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
> orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms
> 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
> url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true
>  sent hash and receievd reply 0 ms
> 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
> Status.WAIT ...
> {code}
> Additional debug/patch statements revealed that InMemoryMerge is not invoked 
> appropriately and not releasing the memory back for fetchers to proceed. e.g 
> debug/patch messages are given below
> {code}
> syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO 
> [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
> Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, 
> mergeThreshold=708669632  <<=== InMemoryMerge would be started in this case 
> as commitMemory >= mergeThreshold
> syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO 
> [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
> Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, 
> mergeThreshold=708669632 <<=== InMemoryMerge would *NOT* be started in this 
> case as commitMemory < mergeThreshold.  But the usedMemory is higher than 
> memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
> released. InMemoryMerge will not kick in and not release memory.
> syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO 
> [fetcher [Map_1] #1] orderedgrouped.MergeManager: 
> Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, 
> mergeThreshold=708669632 <<=== InMemoryMerge would *NOT* be started in this 
> case as commitMemory < mergeThreshold.  But the usedMemory is higher than 
> memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
> released.  InMemoryMerge will not kick in and not release memory.
> {code}
> In MergeManager, in memory merging is i

[jira] [Comment Edited] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure

2015-03-18 Thread Hitesh Shah (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367428#comment-14367428
 ] 

Hitesh Shah edited comment on TEZ-1923 at 3/18/15 4:42 PM:
---

Is this something that needs to be backported to 0.5 and 0.6 ? \cc [~jeagles]


was (Author: hitesh):
Is this something that needs to be backported to 0.5 and 0.6 ?

> FetcherOrderedGrouped gets into infinite loop due to memory pressure
> 
>
> Key: TEZ-1923
> URL: https://issues.apache.org/jira/browse/TEZ-1923
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Fix For: 0.7.0
>
> Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, 
> TEZ-1923.4.patch
>
>
> - Ran a comparatively large job (temp table creation) at 10 TB scale.
> - Turned on intermediate mem-to-mem 
> (tez.runtime.shuffle.memory-to-memory.enable=true and 
> tez.runtime.shuffle.memory-to-memory.segments=4)
> - Some reducers get lots of data and quickly gets into infinite loop
> {code}
> 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
> Status.WAIT ...
> 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
> orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms
> 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
> url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true
>  sent hash and receievd reply 0 ms
> 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
> Status.WAIT ...
> 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
> orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms
> 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
> url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true
>  sent hash and receievd reply 0 ms
> 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
> Status.WAIT ...
> 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
> orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms
> 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
> url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true
>  sent hash and receievd reply 0 ms
> 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
> Status.WAIT ...
> 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
> orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms
> 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
> url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true
>  sent hash and receievd reply 0 ms
> 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
> Status.WAIT ...
> {code}
> Additional debug/patch statements revealed that InMemoryMerge is not invoked 
> appropriately and not releasing the memory back for fetchers to proceed. e.g 
> debug/patch messages are given below
> {code}
> syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO 
> [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
> Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, 
> mergeThreshold=708669632  <<=== InMemoryMerge would be started in this case 
> as commitMemory >= mergeThreshold
> syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO 
> [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
> Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, 
> mergeThreshold=708669632 <<=== InMemoryMerge would *NOT* be started in this 
> case as commitMemory < mergeThreshold.  But the usedMemory is higher than 
> memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
> released. InMemoryMerge will not kick in and not release memory.
> syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO 
> [fetcher [Map_1] #1] orderedgrouped.MergeManager: 
> Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, 
> mergeThreshold=708669632 <<=== InMemoryMerge would *NOT* be started in this 
> case as commitMemory < mergeThreshold.  But the usedMemory is higher than 
> memoryLimit.  Fetc

[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure

2015-03-18 Thread Hitesh Shah (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367428#comment-14367428
 ] 

Hitesh Shah commented on TEZ-1923:
--

Is this something that needs to be backported to 0.5 and 0.6 ?

> FetcherOrderedGrouped gets into infinite loop due to memory pressure
> 
>
> Key: TEZ-1923
> URL: https://issues.apache.org/jira/browse/TEZ-1923
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Fix For: 0.7.0
>
> Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, 
> TEZ-1923.4.patch
>
>
> - Ran a comparatively large job (temp table creation) at 10 TB scale.
> - Turned on intermediate mem-to-mem 
> (tez.runtime.shuffle.memory-to-memory.enable=true and 
> tez.runtime.shuffle.memory-to-memory.segments=4)
> - Some reducers get lots of data and quickly gets into infinite loop
> {code}
> 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
> Status.WAIT ...
> 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
> orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms
> 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
> url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true
>  sent hash and receievd reply 0 ms
> 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
> Status.WAIT ...
> 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
> orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms
> 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
> url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true
>  sent hash and receievd reply 0 ms
> 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
> Status.WAIT ...
> 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
> orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms
> 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
> url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true
>  sent hash and receievd reply 0 ms
> 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
> Status.WAIT ...
> 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
> orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms
> 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
> url=http://m1:13562/mapOutput?job=job_142126204_0201&reduce=34&map=attempt_142126204_0201_1_00_000420_0_10027&keepAlive=true
>  sent hash and receievd reply 0 ms
> 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] 
> orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
> Status.WAIT ...
> {code}
> Additional debug/patch statements revealed that InMemoryMerge is not invoked 
> appropriately and not releasing the memory back for fetchers to proceed. e.g 
> debug/patch messages are given below
> {code}
> syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO 
> [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
> Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, 
> mergeThreshold=708669632  <<=== InMemoryMerge would be started in this case 
> as commitMemory >= mergeThreshold
> syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO 
> [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
> Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, 
> mergeThreshold=708669632 <<=== InMemoryMerge would *NOT* be started in this 
> case as commitMemory < mergeThreshold.  But the usedMemory is higher than 
> memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
> released. InMemoryMerge will not kick in and not release memory.
> syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO 
> [fetcher [Map_1] #1] orderedgrouped.MergeManager: 
> Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, 
> mergeThreshold=708669632 <<=== InMemoryMerge would *NOT* be started in this 
> case as commitMemory < mergeThreshold.  But the usedMemory is higher than 
> memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
> released.  InMemoryMerge will not kick in and not release memory.
> {code}
> In MergeManager, i

[jira] [Commented] (TEZ-2159) Tez UI: download timeline data for offline use.

2015-03-18 Thread Rajesh Balamohan (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367312#comment-14367312
 ] 

Rajesh Balamohan commented on TEZ-2159:
---

Thanks [~pramachandran]. Tried out the patch and the zip format looks fine. 

> Tez UI: download timeline data for offline use.
> ---
>
> Key: TEZ-2159
> URL: https://issues.apache.org/jira/browse/TEZ-2159
> Project: Apache Tez
>  Issue Type: Improvement
>  Components: UI
>Reporter: Prakash Ramachandran
>Assignee: Prakash Ramachandran
> Attachments: TEZ-2159.wip.1.patch
>
>
> It is useful to have capability to download the timeline data for a dag for 
> offline analysis. for ex. TEZ-2076 uses the timeline data to do offline 
> analysis of a tez application run. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TEZ-1529) ATS and TezClient integration in secure kerberos enabled cluster

2015-03-18 Thread Prakash Ramachandran (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Ramachandran resolved TEZ-1529.
---
Resolution: Not a Problem

Closing as this has been tested.

> ATS and TezClient integration  in secure kerberos enabled cluster
> -
>
> Key: TEZ-1529
> URL: https://issues.apache.org/jira/browse/TEZ-1529
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Prakash Ramachandran
>Assignee: Prakash Ramachandran
>Priority: Blocker
>
> This is a follow up for TEZ-1495 which address ATS - TezClient integration. 
> however it does not enable it  in secure kerberos enabled cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-2137) Add task counter to understand sorter final merge time

2015-03-18 Thread Rajesh Balamohan (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-2137:
--
Attachment: TEZ-2137.2.patch

Revising the patch post TEZ-2001.  Added the following counters which provide 
absolute timestamps.  Absolute timestamps are needed when we need to analyze 
the details at vertex levels.

-SORTER_FLUSH_START_TIME : Absolute time for the sorter to start flush()
-SORTER_FINAL_MERGE_START_TIME : Absolute time for the sorter to start final 
merge. In case final merge is disabled, this counter would not be populated.
-SORTER_FLUSH_END_TIME : Absolute time for the sorter to finish flushing the 
data for final result.

Need to add test case to TestDefaultSorter/TestPipelinedSorter after TEZ-2198 
gets committed.

> Add task counter to understand sorter final merge time
> --
>
> Key: TEZ-2137
> URL: https://issues.apache.org/jira/browse/TEZ-2137
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Attachments: TEZ-2137.1.patch, TEZ-2137.2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread

2015-03-18 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14366878#comment-14366878
 ] 

Jeff Zhang commented on TEZ-714:


[~bikassaha] I think the biggest issue in my patch is the granularity of the 
committer thread. Currently I take it as vertex/dag level, but I think it 
should be one OutputCommitter per thread. I will update the patch later.

For the other parts of the patch, here's more description of my patch, hope it 
can clarify the my patch.

* VertexImpl.
** Main change is in checkVertexForCompletion where commit will happen. I 
change it to async commit by wrapping it into CallableEvent and submit it to 
Shared Thread Pool. Here introduce new State COMMITTING which repsent vertex is 
in committing. 
** Also make the abort operation as async operation. No new state is introduced 
here, if Vertex is in aborting, then it is in state of TERMINATING.

** DAGImpl
** Main change is in checkDAGForCompletion() where dag commit will happen and 
vertexSucceeded() where vertex group commit will happen.  And like VertexImpl, 
I aslo wrap the dag commit and vertex group commit into CallableEvent and 
submit to shared thread pool.  Here also introduce new state COMMITTING which 
represent that all the vertices are done but still some committing(dag commit 
or vertex group commit) are not yet completed.
** Like the VertexImpl, if the dag is in aborting , then it is in state of 
TERMINATING.




> OutputCommitters should not run in the main AM dispatcher thread
> 
>
> Key: TEZ-714
> URL: https://issues.apache.org/jira/browse/TEZ-714
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Jeff Zhang
>Priority: Critical
> Attachments: DAG_2.pdf, TEZ-714-1.patch, Vertex_2.pdf
>
>
> Follow up jira from TEZ-41.
> 1) If there's multiple OutputCommitters on a Vertex, they can be run in 
> parallel.
> 2) Running an OutputCommitter in the main thread blocks all other event 
> handling, w.r.t the DAG, and causes the event queue to back up.
> 3) This should also cover shared commits that happen in the DAG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir

2015-03-18 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14366825#comment-14366825
 ] 

Jeff Zhang edited comment on TEZ-1909 at 3/18/15 8:37 AM:
--

Attach the new patch to address the review comment.
Apart from the issues in the review comments, I also found there's one issue 
about RecoveryService. For the scenario of draining the events before 
RecoverySerivce is stopped, previously I take the event queue's size equal to 
zero as an indication of events are all consumed, but it is not true. Because 
even if the event queue is empty, the event may still been processing. I fix 
this bug in the new patch just like AsyncDispatcher did. 

bq. the "if (skipAllOtherEvents) {" check is probably also needed at the top of 
the loop to prevent new files from being opened and read ( in addition to 
short-circuiting the read of all events in the given file ). Maybe just log a 
message that other files were present and skipped
Fix it. also add unit test in TestRecoveryParser

bq. any reason why this is needed in the DAGAppMaster "Set getDagIDs()" 
?
Only for unit test. But in the new patch, I remove it and initialize the Set in 
the setup method.

bq. also, we should add a test for adding corrupt data to the summary stream 
and ensuring that its processing fails
Done.

bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used 
anywhere apart from being set to true in one of the tests.
Fix it.

bq. please replace "import com.sun.tools.javac.util.List;" with java.lang.List
Fix it

bq. testCorruptedLastRecord should also verify that the dag submitted event was 
seen.
Done. verify DAGAppMaster.createDAG is invoked.







was (Author: zjffdu):
Attach the new patch to address the review comment.
Apart from the issues in the review comments, I also found there's one issue 
about RecoveryService. For the scenario of draining the events before 
RecoverySerivce is stopped, previously I take the event queue's size eqaul to 
zero as an indication of events are all consumed, but it is not true. Because 
even if the event queue is empty, the event may still being processing. I fix 
this bug in the new patch just like AsyncDispatcher did. 

bq. the "if (skipAllOtherEvents) {" check is probably also needed at the top of 
the loop to prevent new files from being opened and read ( in addition to 
short-circuiting the read of all events in the given file ). Maybe just log a 
message that other files were present and skipped
Fix it. also add unit test in TestRecoveryParser

bq. any reason why this is needed in the DAGAppMaster "Set getDagIDs()" 
?
Only for unit test. But in the new patch, I remove it and initialize the Set in 
the setup method.

bq. also, we should add a test for adding corrupt data to the summary stream 
and ensuring that its processing fails
Done.

bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used 
anywhere apart from being set to true in one of the tests.
Fix it.

bq. please replace "import com.sun.tools.javac.util.List;" with java.lang.List
Fix it

bq. testCorruptedLastRecord should also verify that the dag submitted event was 
seen.
Done. verify DAGAppMaster.createDAG is invoked.






> Remove need to copy over all events from attempt 1 to attempt 2 dir
> ---
>
> Key: TEZ-1909
> URL: https://issues.apache.org/jira/browse/TEZ-1909
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1909-1.patch, TEZ-1909-2.patch, TEZ-1909-3.patch
>
>
> Use of file versions should prevent the need for copying over data into a 
> second attempt dir. Care needs to be taken to handle "last corrupt record" 
> handling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir

2015-03-18 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-1909:

Attachment: TEZ-1909-3.patch

> Remove need to copy over all events from attempt 1 to attempt 2 dir
> ---
>
> Key: TEZ-1909
> URL: https://issues.apache.org/jira/browse/TEZ-1909
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1909-1.patch, TEZ-1909-2.patch, TEZ-1909-3.patch
>
>
> Use of file versions should prevent the need for copying over data into a 
> second attempt dir. Care needs to be taken to handle "last corrupt record" 
> handling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir

2015-03-18 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14366825#comment-14366825
 ] 

Jeff Zhang commented on TEZ-1909:
-

Attach the new patch to address the review comment.
Apart from the issues in the review comments, I also found there's one issue 
about RecoveryService. For the scenario of draining the events before 
RecoverySerivce is stopped, previously I take the event queue's size eqaul to 
zero as an indication of events are all consumed, but it is not true. Because 
even if the event queue is empty, the event may still being processing. I fix 
this bug in the new patch just like AsyncDispatcher did. 

bq. the "if (skipAllOtherEvents) {" check is probably also needed at the top of 
the loop to prevent new files from being opened and read ( in addition to 
short-circuiting the read of all events in the given file ). Maybe just log a 
message that other files were present and skipped
Fix it. also add unit test in TestRecoveryParser

bq. any reason why this is needed in the DAGAppMaster "Set getDagIDs()" 
?
Only for unit test. But in the new patch, I remove it and initialize the Set in 
the setup method.

bq. also, we should add a test for adding corrupt data to the summary stream 
and ensuring that its processing fails
Done.

bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used 
anywhere apart from being set to true in one of the tests.
Fix it.

bq. please replace "import com.sun.tools.javac.util.List;" with java.lang.List
Fix it

bq. testCorruptedLastRecord should also verify that the dag submitted event was 
seen.
Done. verify DAGAppMaster.createDAG is invoked.






> Remove need to copy over all events from attempt 1 to attempt 2 dir
> ---
>
> Key: TEZ-1909
> URL: https://issues.apache.org/jira/browse/TEZ-1909
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Hitesh Shah
>Assignee: Jeff Zhang
> Attachments: TEZ-1909-1.patch, TEZ-1909-2.patch, TEZ-1909-3.patch
>
>
> Use of file versions should prevent the need for copying over data into a 
> second attempt dir. Care needs to be taken to handle "last corrupt record" 
> handling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-2208) Counter of REDUCE_INPUT_GROUPS is incorrect

2015-03-18 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-2208:

Attachment: Counter of REDUCE_INPUT_GROUPS.png

> Counter of REDUCE_INPUT_GROUPS is incorrect
> ---
>
> Key: TEZ-2208
> URL: https://issues.apache.org/jira/browse/TEZ-2208
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
> Attachments: Counter of REDUCE_INPUT_GROUPS.png
>
>
> Counter of REDUCE_INPUT_GROUPS is always 1 less than the real number. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TEZ-2208) Counter of REDUCE_INPUT_GROUPS is incorrect

2015-03-18 Thread Jeff Zhang (JIRA)

Jeff Zhang created TEZ-2208:
---

 Summary: Counter of REDUCE_INPUT_GROUPS is incorrect
 Key: TEZ-2208
 URL: https://issues.apache.org/jira/browse/TEZ-2208
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang


Counter of REDUCE_INPUT_GROUPS is always 1 less than the real number. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TEZ-2209) Fix pipelined shuffle to fetch data from any one attempt

[jira] [Commented] (TEZ-2196) Consider reusing UnorderedPartitionedKVWriter with single output in UnorderedKVOutput

[jira] [Comment Edited] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.

[jira] [Comment Edited] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.

[jira] [Commented] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.

[jira] [Commented] (TEZ-160) Remove 5 second sleep at the end of AM completion.

[jira] [Commented] (TEZ-145) Support a combiner processor that can run non-local to map/reduce nodes

[jira] [Commented] (TEZ-145) Support a combiner processor that can run non-local to map/reduce nodes

[jira] [Commented] (TEZ-2137) Add task counter to understand sorter final merge time

[jira] [Commented] (TEZ-145) Support a combiner processor that can run non-local to map/reduce nodes

[jira] [Commented] (TEZ-145) Support a combiner processor that can run non-local to map/reduce nodes

Success: TEZ-145 PreCommit Build #310

[jira] [Updated] (TEZ-145) Support a combiner processor that can run non-local to map/reduce nodes

[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

[jira] [Updated] (TEZ-1175) Support Kafka Input and Output in Tez

[jira] [Comment Edited] (TEZ-1175) Support Kafka Input and Output in Tez

[jira] [Updated] (TEZ-1175) Support Kafka Input and Output in Tez

[jira] [Issue Comment Deleted] (TEZ-1175) Support Kafka Input and Output in Tez

[jira] [Updated] (TEZ-1175) Support Kafka Input and Output in Tez

[jira] [Commented] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread

[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure

[jira] [Comment Edited] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure

[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure

[jira] [Commented] (TEZ-2159) Tez UI: download timeline data for offline use.

[jira] [Resolved] (TEZ-1529) ATS and TezClient integration in secure kerberos enabled cluster

[jira] [Updated] (TEZ-2137) Add task counter to understand sorter final merge time

[jira] [Commented] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread

[jira] [Comment Edited] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir

[jira] [Updated] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir

[jira] [Commented] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir

[jira] [Updated] (TEZ-2208) Counter of REDUCE_INPUT_GROUPS is incorrect

[jira] [Created] (TEZ-2208) Counter of REDUCE_INPUT_GROUPS is incorrect

33 matches

Site Navigation

Mail list logo

Footer information