[jira] [Commented] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread

2015-04-15 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497547#comment-14497547
 ] 

Jeff Zhang commented on TEZ-714:


Thanks [~bikassaha] Committed to master

commit d932579b002f14b81836eeed75f4bf92d4ed7fbf (HEAD, master, TEZ-714)
Author: Jeff Zhang 
Date:   Thu Apr 16 06:39:34 2015 +0200

TEZ-714. OutputCommitters should not run in the main AM dispatcher thread 
(zjffdu)

> OutputCommitters should not run in the main AM dispatcher thread
> 
>
> Key: TEZ-714
> URL: https://issues.apache.org/jira/browse/TEZ-714
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Jeff Zhang
>Priority: Critical
> Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-10.patch, 
> TEZ-714-11.patch, TEZ-714-12.patch, TEZ-714-13.patch, TEZ-714-14.patch, 
> TEZ-714-15.patch, TEZ-714-16.patch, TEZ-714-17.patch, TEZ-714-2.patch, 
> TEZ-714-3.patch, TEZ-714-4.patch, TEZ-714-5.patch, TEZ-714-6.patch, 
> TEZ-714-7.patch, TEZ-714-8.patch, TEZ-714-9.patch, Vertex_2.pdf
>
>
> Follow up jira from TEZ-41.
> 1) If there's multiple OutputCommitters on a Vertex, they can be run in 
> parallel.
> 2) Running an OutputCommitter in the main thread blocks all other event 
> handling, w.r.t the DAG, and causes the event queue to back up.
> 3) This should also cover shared commits that happen in the DAG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2328) Add tez.runtime.sorter.class & rename tez.runtime.sort.threads to tez.runtime.pipelinedsorter.sort.threads

2015-04-15 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created TEZ-2328:
-

 Summary: Add tez.runtime.sorter.class & rename 
tez.runtime.sort.threads to tez.runtime.pipelinedsorter.sort.threads
 Key: TEZ-2328
 URL: https://issues.apache.org/jira/browse/TEZ-2328
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1897) Allow higher concurrency in AsyncDispatcher

2015-04-15 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497415#comment-14497415
 ] 

Siddharth Seth commented on TEZ-1897:
-

This is fine, just as long as we don't move the central dispatcher to run on 
multiple threads. Alternately, moving specific events to run on different 
dispatchers.

> Allow higher concurrency in AsyncDispatcher
> ---
>
> Key: TEZ-1897
> URL: https://issues.apache.org/jira/browse/TEZ-1897
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: TEZ-1897.1.patch, TEZ-1897.2.patch, TEZ-1897.3.patch
>
>
> Currently, it processes events on a single thread. For events that can be 
> executed in parallel, e.g. vertex manager events, allowing higher concurrency 
> may be beneficial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1897) Allow higher concurrency in AsyncDispatcher

2015-04-15 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497417#comment-14497417
 ] 

Siddharth Seth commented on TEZ-1897:
-

btw. findbugs will likely be a new warning, since the checks are based on 
counts.

> Allow higher concurrency in AsyncDispatcher
> ---
>
> Key: TEZ-1897
> URL: https://issues.apache.org/jira/browse/TEZ-1897
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: TEZ-1897.1.patch, TEZ-1897.2.patch, TEZ-1897.3.patch
>
>
> Currently, it processes events on a single thread. For events that can be 
> executed in parallel, e.g. vertex manager events, allowing higher concurrency 
> may be beneficial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1897) Allow higher concurrency in AsyncDispatcher

2015-04-15 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497409#comment-14497409
 ] 

Bikas Saha commented on TEZ-1897:
-

>From what I see, drained code is dead since noone calls setDrainEventsOnStop() 
>which actually creates a user for the draining logic. We can remove that code 
>I think.

> Allow higher concurrency in AsyncDispatcher
> ---
>
> Key: TEZ-1897
> URL: https://issues.apache.org/jira/browse/TEZ-1897
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: TEZ-1897.1.patch, TEZ-1897.2.patch, TEZ-1897.3.patch
>
>
> Currently, it processes events on a single thread. For events that can be 
> executed in parallel, e.g. vertex manager events, allowing higher concurrency 
> may be beneficial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1897) Allow higher concurrency in AsyncDispatcher

2015-04-15 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497398#comment-14497398
 ] 

Hitesh Shah commented on TEZ-1897:
--

Need to dig a bit deeper. The basic changes look ok but the handling of the 
drained flag may need to tweaked depending on how the event processing works 
with multiple threads in play. 

> Allow higher concurrency in AsyncDispatcher
> ---
>
> Key: TEZ-1897
> URL: https://issues.apache.org/jira/browse/TEZ-1897
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: TEZ-1897.1.patch, TEZ-1897.2.patch, TEZ-1897.3.patch
>
>
> Currently, it processes events on a single thread. For events that can be 
> executed in parallel, e.g. vertex manager events, allowing higher concurrency 
> may be beneficial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2323) Fix TestOrderedWordcount to use MR memory configs

2015-04-15 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497395#comment-14497395
 ] 

Bikas Saha commented on TEZ-2323:
-

lgtm

> Fix TestOrderedWordcount to use MR memory configs
> -
>
> Key: TEZ-2323
> URL: https://issues.apache.org/jira/browse/TEZ-2323
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Hitesh Shah
> Attachments: TEZ-2323.1.patch, TEZ-2323.2.patch
>
>
> TestOrderedwordcount takes combination of  configs from mapred-site.xml and 
> tez-site.xml. Due to considering mix of the mapred and tez configs, it fails 
> with below error.
> {noformat}
> 2015-04-15 13:20:53,599 DEBUG [main] app.RecoveryParser: Parsing event from 
> input stream, eventType=TASK_ATTEMPT_FINISHED
> 2015-04-15 13:20:53,619 DEBUG [main] app.RecoveryParser: Parsed event from 
> input stream, eventType=TASK_ATTEMPT_FINISHED, event=vertexName=null, 
> taskAttemptId=attempt_1429100089638_0008_1_00_02_0, startTime=0, 
> finishTime=1429104012181, timeTaken=1429104012181, status=FAILED, 
> errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while running 
> task:java.lang.IllegalArgumentException: tez.runtime.io.sort.mb 512 should be 
> larger than 0 and should be less than the available task memory (MB):246
>   at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.ExternalSorter.getInitialMemoryRequirement(ExternalSorter.java:304)
>   at 
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.initialize(OrderedPartitionedKVOutput.java:90)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:443)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:422)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2282) Delimit reused yarn container logs (stderr, stdout, syslog) with task attempt start/stop events

2015-04-15 Thread Mit Desai (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497377#comment-14497377
 ] 

Mit Desai commented on TEZ-2282:


bq. it seems like we should probably add a timestamp to the log message
I thought having a start point to differentiate was the purpose. But I can 
definitely work on getting the timestamp.

I will also do the same when the task completes and in the DAG App Master 
start/stop logic.

> Delimit reused yarn container logs (stderr, stdout, syslog) with task attempt 
> start/stop events
> ---
>
> Key: TEZ-2282
> URL: https://issues.apache.org/jira/browse/TEZ-2282
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Mit Desai
> Attachments: TEZ-2282.1.patch, TEZ-2282.2.patch, 
> TEZ-2282.master.1.patch
>
>
> This could help with debugging in some cases where logging is task specific. 
> For example GC log is going to stdout, it will be nice to see task attempt 
> start/stop times



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2317) Successful task attempts getting killed

2015-04-15 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497375#comment-14497375
 ] 

Rohini Palaniswamy commented on TEZ-2317:
-

+1 on the patch. Looks good. But let me test it out as well before you commit 
it.

> Successful task attempts getting killed
> ---
>
> Key: TEZ-2317
> URL: https://issues.apache.org/jira/browse/TEZ-2317
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Bikas Saha
> Fix For: 0.7.0
>
> Attachments: AM-taskkill.log, TEZ-2317.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2317) Successful task attempts getting killed

2015-04-15 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497372#comment-14497372
 ] 

Bikas Saha commented on TEZ-2317:
-

[~jeagles] [~hitesh] please review

> Successful task attempts getting killed
> ---
>
> Key: TEZ-2317
> URL: https://issues.apache.org/jira/browse/TEZ-2317
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Bikas Saha
> Fix For: 0.7.0
>
> Attachments: AM-taskkill.log, TEZ-2317.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1897) Allow higher concurrency in AsyncDispatcher

2015-04-15 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497368#comment-14497368
 ] 

Bikas Saha commented on TEZ-1897:
-

The findbugs is in existing code not introduced in this patch and is fine to 
ignore.

Since there are no concerns on improving the dispatcher to schedule on multiple 
threads (even though this patch is not doing so), lets us proceed and review 
the patch. This essentially still runs the central dispatcher on a single 
thread but instead of us explicitly creating the thread we use a thread created 
in a threadpool. So everything stays the same. Having the code in place allows 
experimentation with scenarios where increasing the threads may help. E.g. 
speculation events can executed concurrently.

[~zjffdu] [~sseth] [~hitesh] [~rajesh.balamohan] Please review.

> Allow higher concurrency in AsyncDispatcher
> ---
>
> Key: TEZ-1897
> URL: https://issues.apache.org/jira/browse/TEZ-1897
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: TEZ-1897.1.patch, TEZ-1897.2.patch, TEZ-1897.3.patch
>
>
> Currently, it processes events on a single thread. For events that can be 
> executed in parallel, e.g. vertex manager events, allowing higher concurrency 
> may be beneficial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2317) Successful task attempts getting killed

2015-04-15 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497352#comment-14497352
 ] 

TezQA commented on TEZ-2317:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12725731/TEZ-2317.1.patch
  against master revision 19378d5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/472//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/472//console

This message is automatically generated.

> Successful task attempts getting killed
> ---
>
> Key: TEZ-2317
> URL: https://issues.apache.org/jira/browse/TEZ-2317
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Bikas Saha
> Fix For: 0.7.0
>
> Attachments: AM-taskkill.log, TEZ-2317.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Success: TEZ-2317 PreCommit Build #472

2015-04-15 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2317
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/472/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2764 lines...]
[INFO] Final Memory: 70M/988M
[INFO] 




{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12725731/TEZ-2317.1.patch
  against master revision 19378d5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/472//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/472//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
dd0519c53a92d3e0ed3a9a3ec755d464331d87ac logged out


==
==
Finished build.
==
==


Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #471
Archived 44 artifacts
Archive block size is 32768
Received 6 blocks and 2550489 bytes
Compression is 7.2%
Took 1.6 sec
Description set: TEZ-2317
Recording test results
Email was triggered for: Success
Sending email for trigger: Success



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Comment Edited] (TEZ-2282) Delimit reused yarn container logs (stderr, stdout, syslog) with task attempt start/stop events

2015-04-15 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497299#comment-14497299
 ] 

Hitesh Shah edited comment on TEZ-2282 at 4/15/15 11:27 PM:


[~mitdesai] Looking at [~jeagles]'s description, it seems like we should 
probably add a timestamp to the log message. Maybe prefix the message with a 
timestamp? Also, it might be helpful to add a log whenever the task completes ( 
after calling close() ). 

In addition to this, I think it might be good to have the same logic in the DAG 
App Master for each dag start/stop.



was (Author: hitesh):
[~mitdesai] Looking at [~jeagles]'s description, it seems like we should 
probably add a timestamp to the log message. Maybe prefix the message with a 
timestamp? 

> Delimit reused yarn container logs (stderr, stdout, syslog) with task attempt 
> start/stop events
> ---
>
> Key: TEZ-2282
> URL: https://issues.apache.org/jira/browse/TEZ-2282
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Mit Desai
> Attachments: TEZ-2282.1.patch, TEZ-2282.2.patch, 
> TEZ-2282.master.1.patch
>
>
> This could help with debugging in some cases where logging is task specific. 
> For example GC log is going to stdout, it will be nice to see task attempt 
> start/stop times



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2282) Delimit reused yarn container logs (stderr, stdout, syslog) with task attempt start/stop events

2015-04-15 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497299#comment-14497299
 ] 

Hitesh Shah commented on TEZ-2282:
--

[~mitdesai] Looking at [~jeagles]'s description, it seems like we should 
probably add a timestamp to the log message. Maybe prefix the message with a 
timestamp? 

> Delimit reused yarn container logs (stderr, stdout, syslog) with task attempt 
> start/stop events
> ---
>
> Key: TEZ-2282
> URL: https://issues.apache.org/jira/browse/TEZ-2282
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Mit Desai
> Attachments: TEZ-2282.1.patch, TEZ-2282.2.patch, 
> TEZ-2282.master.1.patch
>
>
> This could help with debugging in some cases where logging is task specific. 
> For example GC log is going to stdout, it will be nice to see task attempt 
> start/stop times



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2317) Successful task attempts getting killed

2015-04-15 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-2317:

Attachment: TEZ-2317.1.patch

updating patch with test.

> Successful task attempts getting killed
> ---
>
> Key: TEZ-2317
> URL: https://issues.apache.org/jira/browse/TEZ-2317
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Bikas Saha
> Fix For: 0.7.0
>
> Attachments: AM-taskkill.log, TEZ-2317.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2317) Successful task attempts getting killed

2015-04-15 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-2317:

Attachment: (was: TEZ-2317.1.patch)

> Successful task attempts getting killed
> ---
>
> Key: TEZ-2317
> URL: https://issues.apache.org/jira/browse/TEZ-2317
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Bikas Saha
> Fix For: 0.7.0
>
> Attachments: AM-taskkill.log, TEZ-2317.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2317) Successful task attempts getting killed

2015-04-15 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-2317:

Attachment: TEZ-2317.1.patch

[~rohini] Can you please try with your Pig Processor fix and this patch 
applied? Both together should resolve all the unnecessary kills.

> Successful task attempts getting killed
> ---
>
> Key: TEZ-2317
> URL: https://issues.apache.org/jira/browse/TEZ-2317
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Bikas Saha
> Fix For: 0.7.0
>
> Attachments: AM-taskkill.log, TEZ-2317.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2327) NPE in shuffle

2015-04-15 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-2327:
--
Attachment: (was: am.log.gz)

> NPE in shuffle
> --
>
> Key: TEZ-2327
> URL: https://issues.apache.org/jira/browse/TEZ-2327
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Siddharth Seth
>
> {noformat}
> 2015-04-15 15:19:46,529 INFO [Dispatcher thread: Central] 
> history.HistoryEventHandler: 
> [HISTORY][DAG:dag_1428572510173_0219_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=Reducer 2, taskAttemptId=attempt_1428572510173_0219_1_08_000872_0, 
> startTime=1429136298733, finishTime=1429136386528, timeTaken=87795, 
> status=FAILED, errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while 
> running task:java.lang.NullPointerException
>at sun.net.www.http.KeepAliveStream.close(KeepAliveStream.java:93)
>at java.io.FilterInputStream.close(FilterInputStream.java:181)
>at 
> sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.close(HttpURLConnection.java:3395)
>at java.io.BufferedInputStream.close(BufferedInputStream.java:483)
>at java.io.FilterInputStream.close(FilterInputStream.java:181)
>at 
> org.apache.tez.runtime.library.common.shuffle.HttpConnection.cleanup(HttpConnection.java:278)
>at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdownInternal(Fetcher.java:644)
>at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdownInternal(Fetcher.java:634)
>at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdown(Fetcher.java:629)
>at 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager.shutdown(ShuffleManager.java:759)
>at 
> org.apache.tez.runtime.library.input.UnorderedKVInput.close(UnorderedKVInput.java:209)
>at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:347)
>at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:182)
>at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
>at java.security.AccessController.doPrivileged(Native Method)
>at javax.security.auth.Subject.doAs(Subject.java:422)
>at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
>at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
>at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This caused the task in question to fail



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2327) NPE in shuffle

2015-04-15 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497166#comment-14497166
 ] 

Sergey Shelukhin commented on TEZ-2327:
---

logs are too big, will share separately

> NPE in shuffle
> --
>
> Key: TEZ-2327
> URL: https://issues.apache.org/jira/browse/TEZ-2327
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Siddharth Seth
>
> {noformat}
> 2015-04-15 15:19:46,529 INFO [Dispatcher thread: Central] 
> history.HistoryEventHandler: 
> [HISTORY][DAG:dag_1428572510173_0219_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=Reducer 2, taskAttemptId=attempt_1428572510173_0219_1_08_000872_0, 
> startTime=1429136298733, finishTime=1429136386528, timeTaken=87795, 
> status=FAILED, errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while 
> running task:java.lang.NullPointerException
>at sun.net.www.http.KeepAliveStream.close(KeepAliveStream.java:93)
>at java.io.FilterInputStream.close(FilterInputStream.java:181)
>at 
> sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.close(HttpURLConnection.java:3395)
>at java.io.BufferedInputStream.close(BufferedInputStream.java:483)
>at java.io.FilterInputStream.close(FilterInputStream.java:181)
>at 
> org.apache.tez.runtime.library.common.shuffle.HttpConnection.cleanup(HttpConnection.java:278)
>at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdownInternal(Fetcher.java:644)
>at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdownInternal(Fetcher.java:634)
>at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdown(Fetcher.java:629)
>at 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager.shutdown(ShuffleManager.java:759)
>at 
> org.apache.tez.runtime.library.input.UnorderedKVInput.close(UnorderedKVInput.java:209)
>at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:347)
>at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:182)
>at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
>at java.security.AccessController.doPrivileged(Native Method)
>at javax.security.auth.Subject.doAs(Subject.java:422)
>at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
>at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
>at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This caused the task in question to fail



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2327) NPE in shuffle

2015-04-15 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-2327:
--
Attachment: am.log.gz

AM logs

> NPE in shuffle
> --
>
> Key: TEZ-2327
> URL: https://issues.apache.org/jira/browse/TEZ-2327
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Siddharth Seth
>
> {noformat}
> 2015-04-15 15:19:46,529 INFO [Dispatcher thread: Central] 
> history.HistoryEventHandler: 
> [HISTORY][DAG:dag_1428572510173_0219_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=Reducer 2, taskAttemptId=attempt_1428572510173_0219_1_08_000872_0, 
> startTime=1429136298733, finishTime=1429136386528, timeTaken=87795, 
> status=FAILED, errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while 
> running task:java.lang.NullPointerException
>at sun.net.www.http.KeepAliveStream.close(KeepAliveStream.java:93)
>at java.io.FilterInputStream.close(FilterInputStream.java:181)
>at 
> sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.close(HttpURLConnection.java:3395)
>at java.io.BufferedInputStream.close(BufferedInputStream.java:483)
>at java.io.FilterInputStream.close(FilterInputStream.java:181)
>at 
> org.apache.tez.runtime.library.common.shuffle.HttpConnection.cleanup(HttpConnection.java:278)
>at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdownInternal(Fetcher.java:644)
>at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdownInternal(Fetcher.java:634)
>at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdown(Fetcher.java:629)
>at 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager.shutdown(ShuffleManager.java:759)
>at 
> org.apache.tez.runtime.library.input.UnorderedKVInput.close(UnorderedKVInput.java:209)
>at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:347)
>at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:182)
>at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
>at java.security.AccessController.doPrivileged(Native Method)
>at javax.security.auth.Subject.doAs(Subject.java:422)
>at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
>at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
>at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This caused the task in question to fail



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-119) The AM-RM heartbeat interval should not be static

2015-04-15 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-119:
---
Assignee: (was: Bikas Saha)

> The AM-RM heartbeat interval should not be static
> -
>
> Key: TEZ-119
> URL: https://issues.apache.org/jira/browse/TEZ-119
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Siddharth Seth
>
> AMs should be more aggressive in heartbeating to the RM - especially soon 
> after job start to get the initial set of containers, and also in the general 
> case where allocations are pending.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2327) NPE in shuffle

2015-04-15 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created TEZ-2327:
-

 Summary: NPE in shuffle
 Key: TEZ-2327
 URL: https://issues.apache.org/jira/browse/TEZ-2327
 Project: Apache Tez
  Issue Type: Bug
Reporter: Sergey Shelukhin
Assignee: Siddharth Seth


{noformat}
2015-04-15 15:19:46,529 INFO [Dispatcher thread: Central] 
history.HistoryEventHandler: 
[HISTORY][DAG:dag_1428572510173_0219_1][Event:TASK_ATTEMPT_FINISHED]: 
vertexName=Reducer 2, taskAttemptId=attempt_1428572510173_0219_1_08_000872_0, 
startTime=1429136298733, finishTime=1429136386528, timeTaken=87795, 
status=FAILED, errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while 
running task:java.lang.NullPointerException
   at sun.net.www.http.KeepAliveStream.close(KeepAliveStream.java:93)
   at java.io.FilterInputStream.close(FilterInputStream.java:181)
   at 
sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.close(HttpURLConnection.java:3395)
   at java.io.BufferedInputStream.close(BufferedInputStream.java:483)
   at java.io.FilterInputStream.close(FilterInputStream.java:181)
   at 
org.apache.tez.runtime.library.common.shuffle.HttpConnection.cleanup(HttpConnection.java:278)
   at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdownInternal(Fetcher.java:644)
   at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdownInternal(Fetcher.java:634)
   at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.shutdown(Fetcher.java:629)
   at 
org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager.shutdown(ShuffleManager.java:759)
   at 
org.apache.tez.runtime.library.input.UnorderedKVInput.close(UnorderedKVInput.java:209)
   at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.close(LogicalIOProcessorRuntimeTask.java:347)
   at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:182)
   at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:422)
   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
   at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
   at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
{noformat}

This caused the task in question to fail



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2317) Successful task attempts getting killed

2015-04-15 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497163#comment-14497163
 ] 

Bikas Saha commented on TEZ-2317:
-

bq. Optimize by not sending a commit go/no-go request if there is no hdfs 
output (DataSink) involved. In the above case, it is always intermediate output
Fix in Pig
bq. Handle the commit go/no-go request after processing events in the event 
queue. May be something like ask the task to come back after some time
In this jira
bq. We saw that for 3058 KilledTaskAttempts TA_KILL_REQUEST events was 383519. 
This is way high.
That is because each canCommit request from the task was resulting in a kill 
event being enqueued. Not killing (in this jira) will fix that.
bq.In the attached AM-taskkill.log which has grepped statements for a single 
task that was killed, it has 327 repeats of below message. Need to see why so 
much and fix that.
The log happens for each canCommit call from the task that gets denied because 
the AM task state is not running. Can change to debug in this patch. The pig 
processor is calling canCommit every 100ms.

> Successful task attempts getting killed
> ---
>
> Key: TEZ-2317
> URL: https://issues.apache.org/jira/browse/TEZ-2317
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Bikas Saha
> Fix For: 0.7.0
>
> Attachments: AM-taskkill.log
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TEZ-2324) Dynamic heartbeat intervals between RM and AM

2015-04-15 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth resolved TEZ-2324.
-
Resolution: Duplicate

> Dynamic heartbeat intervals between RM and AM
> -
>
> Key: TEZ-2324
> URL: https://issues.apache.org/jira/browse/TEZ-2324
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Bikas Saha
>
> Currently there is a config (10ms) which can be an issue for large clusters 
> with many jobs heartbeating to the RM. We should be able to scale it up and 
> down based on outstanding requests etc. e.g. if there are no outstanding 
> requests then ping on larger intervals. The heuristics may be more 
> sophisticated than that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-119) The AM-RM heartbeat interval should not be static

2015-04-15 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-119:
---
Issue Type: Sub-task  (was: Improvement)
Parent: TEZ-753

> The AM-RM heartbeat interval should not be static
> -
>
> Key: TEZ-119
> URL: https://issues.apache.org/jira/browse/TEZ-119
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Siddharth Seth
>Assignee: Bikas Saha
>
> AMs should be more aggressive in heartbeating to the RM - especially soon 
> after job start to get the initial set of containers, and also in the general 
> case where allocations are pending.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Success: TEZ-2323 PreCommit Build #471

2015-04-15 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2323
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/471/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2766 lines...]
[INFO] Final Memory: 73M/1113M
[INFO] 




{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12725696/TEZ-2323.2.patch
  against master revision 19378d5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/471//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/471//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
816b60535179bfaa69f39786e660c5af148afb07 logged out


==
==
Finished build.
==
==


Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #467
Archived 44 artifacts
Archive block size is 32768
Received 6 blocks and 2551229 bytes
Compression is 7.2%
Took 0.58 sec
Description set: TEZ-2323
Recording test results
Email was triggered for: Success
Sending email for trigger: Success



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (TEZ-2323) Fix TestOrderedWordcount to use MR memory configs

2015-04-15 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497150#comment-14497150
 ] 

TezQA commented on TEZ-2323:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12725696/TEZ-2323.2.patch
  against master revision 19378d5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/471//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/471//console

This message is automatically generated.

> Fix TestOrderedWordcount to use MR memory configs
> -
>
> Key: TEZ-2323
> URL: https://issues.apache.org/jira/browse/TEZ-2323
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Hitesh Shah
> Attachments: TEZ-2323.1.patch, TEZ-2323.2.patch
>
>
> TestOrderedwordcount takes combination of  configs from mapred-site.xml and 
> tez-site.xml. Due to considering mix of the mapred and tez configs, it fails 
> with below error.
> {noformat}
> 2015-04-15 13:20:53,599 DEBUG [main] app.RecoveryParser: Parsing event from 
> input stream, eventType=TASK_ATTEMPT_FINISHED
> 2015-04-15 13:20:53,619 DEBUG [main] app.RecoveryParser: Parsed event from 
> input stream, eventType=TASK_ATTEMPT_FINISHED, event=vertexName=null, 
> taskAttemptId=attempt_1429100089638_0008_1_00_02_0, startTime=0, 
> finishTime=1429104012181, timeTaken=1429104012181, status=FAILED, 
> errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while running 
> task:java.lang.IllegalArgumentException: tez.runtime.io.sort.mb 512 should be 
> larger than 0 and should be less than the available task memory (MB):246
>   at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.ExternalSorter.getInitialMemoryRequirement(ExternalSorter.java:304)
>   at 
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.initialize(OrderedPartitionedKVOutput.java:90)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:443)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:422)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2320) GroupByOrderByMRRTest not functional in branch 0.6

2015-04-15 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497126#comment-14497126
 ] 

Hitesh Shah commented on TEZ-2320:
--

Ran this against the top of branch 0.6. Could not reproduce. 

[~tiwari] If you are building from source, can you apply the TEZ-2190 patch and 
re-try the run?

> GroupByOrderByMRRTest not functional in branch 0.6 
> ---
>
> Key: TEZ-2320
> URL: https://issues.apache.org/jira/browse/TEZ-2320
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
>
> Reported by [~tiwari] in TEZ-1581. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1897) Allow higher concurrency in AsyncDispatcher

2015-04-15 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497117#comment-14497117
 ] 

TezQA commented on TEZ-1897:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12725692/TEZ-1897.3.patch
  against master revision 19378d5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/470//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/470//artifact/patchprocess/newPatchFindbugsWarningstez-common.html
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/470//console

This message is automatically generated.

> Allow higher concurrency in AsyncDispatcher
> ---
>
> Key: TEZ-1897
> URL: https://issues.apache.org/jira/browse/TEZ-1897
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: TEZ-1897.1.patch, TEZ-1897.2.patch, TEZ-1897.3.patch
>
>
> Currently, it processes events on a single thread. For events that can be 
> executed in parallel, e.g. vertex manager events, allowing higher concurrency 
> may be beneficial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-1897 PreCommit Build #470

2015-04-15 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-1897
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/470/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2769 lines...]




{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12725692/TEZ-1897.3.patch
  against master revision 19378d5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/470//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/470//artifact/patchprocess/newPatchFindbugsWarningstez-common.html
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/470//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
46267d54a4e88d8295bd4d87f42b26a7f30b6575 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #467
Archived 44 artifacts
Archive block size is 32768
Received 26 blocks and 1896974 bytes
Compression is 31.0%
Took 0.49 sec
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Resolved] (TEZ-2326) Update branch 0.6 version to 0.6.1-SNAPSHOT

2015-04-15 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah resolved TEZ-2326.
--
   Resolution: Fixed
Fix Version/s: 0.6.1

Committed to branch 0.6 

> Update branch 0.6 version to 0.6.1-SNAPSHOT
> ---
>
> Key: TEZ-2326
> URL: https://issues.apache.org/jira/browse/TEZ-2326
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
>Priority: Minor
> Fix For: 0.6.1
>
> Attachments: TEZ-2326.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2326) Update branch 0.6 version to 0.6.1-SNAPSHOT

2015-04-15 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-2326:
-
Summary: Update branch 0.6 version to 0.6.1-SNAPSHOT  (was: Update branch 
0.6 version )

> Update branch 0.6 version to 0.6.1-SNAPSHOT
> ---
>
> Key: TEZ-2326
> URL: https://issues.apache.org/jira/browse/TEZ-2326
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
>Priority: Minor
> Attachments: TEZ-2326.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2326) Update branch 0.6 version

2015-04-15 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-2326:
-
Attachment: TEZ-2326.1.patch

Update to 0.6.1-SNAPSHOT

> Update branch 0.6 version 
> --
>
> Key: TEZ-2326
> URL: https://issues.apache.org/jira/browse/TEZ-2326
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
>Priority: Minor
> Attachments: TEZ-2326.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2326) Update branch 0.6 version

2015-04-15 Thread Hitesh Shah (JIRA)
Hitesh Shah created TEZ-2326:


 Summary: Update branch 0.6 version 
 Key: TEZ-2326
 URL: https://issues.apache.org/jira/browse/TEZ-2326
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Hitesh Shah
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2323) Fix TestOrderedWordcount to use MR memory configs

2015-04-15 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-2323:
-
Attachment: TEZ-2323.2.patch

\cc [~rajesh.balamohan]

> Fix TestOrderedWordcount to use MR memory configs
> -
>
> Key: TEZ-2323
> URL: https://issues.apache.org/jira/browse/TEZ-2323
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Hitesh Shah
> Attachments: TEZ-2323.1.patch, TEZ-2323.2.patch
>
>
> TestOrderedwordcount takes combination of  configs from mapred-site.xml and 
> tez-site.xml. Due to considering mix of the mapred and tez configs, it fails 
> with below error.
> {noformat}
> 2015-04-15 13:20:53,599 DEBUG [main] app.RecoveryParser: Parsing event from 
> input stream, eventType=TASK_ATTEMPT_FINISHED
> 2015-04-15 13:20:53,619 DEBUG [main] app.RecoveryParser: Parsed event from 
> input stream, eventType=TASK_ATTEMPT_FINISHED, event=vertexName=null, 
> taskAttemptId=attempt_1429100089638_0008_1_00_02_0, startTime=0, 
> finishTime=1429104012181, timeTaken=1429104012181, status=FAILED, 
> errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while running 
> task:java.lang.IllegalArgumentException: tez.runtime.io.sort.mb 512 should be 
> larger than 0 and should be less than the available task memory (MB):246
>   at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.ExternalSorter.getInitialMemoryRequirement(ExternalSorter.java:304)
>   at 
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.initialize(OrderedPartitionedKVOutput.java:90)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:443)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:422)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1653) Dynamic heartbeat intervals between task and AM

2015-04-15 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1653:

Issue Type: Sub-task  (was: Task)
Parent: TEZ-753

> Dynamic heartbeat intervals between task and AM
> ---
>
> Key: TEZ-1653
> URL: https://issues.apache.org/jira/browse/TEZ-1653
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Bikas Saha
>
> The internval is fixed now. Based on load (number of running tasks etc) the 
> container/task heartbeat could be adjusted. The AM could return the next 
> heartbeat interval in the response. The interval could be small when num 
> tasks is small and large when num tasks is high. This will help with AM 
> scalability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2324) Dynamic heartbeat intervals between RM and AM

2015-04-15 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-2324:

Issue Type: Sub-task  (was: Bug)
Parent: TEZ-753

> Dynamic heartbeat intervals between RM and AM
> -
>
> Key: TEZ-2324
> URL: https://issues.apache.org/jira/browse/TEZ-2324
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Bikas Saha
>
> Currently there is a config (10ms) which can be an issue for large clusters 
> with many jobs heartbeating to the RM. We should be able to scale it up and 
> down based on outstanding requests etc. e.g. if there are no outstanding 
> requests then ping on larger intervals. The heuristics may be more 
> sophisticated than that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2294) Add tez-site-template.xml with description of config properties

2015-04-15 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497056#comment-14497056
 ] 

Jonathan Eagles commented on TEZ-2294:
--

[~rajesh.balamohan], I think TEZ-963 is trying to accomplish the same effort. 
Can you close as duplicate if so?

> Add tez-site-template.xml with description of config properties
> ---
>
> Key: TEZ-2294
> URL: https://issues.apache.org/jira/browse/TEZ-2294
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Attachments: TEZ-2294.wip.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2325) Route status update event directly to the attempt

2015-04-15 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497053#comment-14497053
 ] 

Siddharth Seth commented on TEZ-2325:
-

+1. Long pending, instead of routing everything to the Vertex.

> Route status update event directly to the attempt 
> --
>
> Key: TEZ-2325
> URL: https://issues.apache.org/jira/browse/TEZ-2325
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Bikas Saha
>
> Today, all events from the attempt heartbeat are routed to the vertex. then 
> the vertex routes (if any) status update events to the attempt. This is 
> unnecessary and potentially creates out of order scenarios. We could route 
> the status update events directly to attempts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2324) Dynamic heartbeat intervals between RM and AM

2015-04-15 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497052#comment-14497052
 ] 

Siddharth Seth commented on TEZ-2324:
-

There's already a jira for this, and linked under the scalability improvement 
jira.

> Dynamic heartbeat intervals between RM and AM
> -
>
> Key: TEZ-2324
> URL: https://issues.apache.org/jira/browse/TEZ-2324
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Bikas Saha
>
> Currently there is a config (10ms) which can be an issue for large clusters 
> with many jobs heartbeating to the RM. We should be able to scale it up and 
> down based on outstanding requests etc. e.g. if there are no outstanding 
> requests then ping on larger intervals. The heuristics may be more 
> sophisticated than that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1897) Allow higher concurrency in AsyncDispatcher

2015-04-15 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-1897:

Attachment: TEZ-1897.3.patch

> Allow higher concurrency in AsyncDispatcher
> ---
>
> Key: TEZ-1897
> URL: https://issues.apache.org/jira/browse/TEZ-1897
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Bikas Saha
>Assignee: Bikas Saha
> Attachments: TEZ-1897.1.patch, TEZ-1897.2.patch, TEZ-1897.3.patch
>
>
> Currently, it processes events on a single thread. For events that can be 
> executed in parallel, e.g. vertex manager events, allowing higher concurrency 
> may be beneficial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2323) Fix TestOrderedWordcount to use MR memory configs

2015-04-15 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496988#comment-14496988
 ] 

TezQA commented on TEZ-2323:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12725669/TEZ-2323.1.patch
  against master revision 19378d5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in :
   org.apache.tez.test.TestSecureShuffle

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/469//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/469//console

This message is automatically generated.

> Fix TestOrderedWordcount to use MR memory configs
> -
>
> Key: TEZ-2323
> URL: https://issues.apache.org/jira/browse/TEZ-2323
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Hitesh Shah
> Attachments: TEZ-2323.1.patch
>
>
> TestOrderedwordcount takes combination of  configs from mapred-site.xml and 
> tez-site.xml. Due to considering mix of the mapred and tez configs, it fails 
> with below error.
> {noformat}
> 2015-04-15 13:20:53,599 DEBUG [main] app.RecoveryParser: Parsing event from 
> input stream, eventType=TASK_ATTEMPT_FINISHED
> 2015-04-15 13:20:53,619 DEBUG [main] app.RecoveryParser: Parsed event from 
> input stream, eventType=TASK_ATTEMPT_FINISHED, event=vertexName=null, 
> taskAttemptId=attempt_1429100089638_0008_1_00_02_0, startTime=0, 
> finishTime=1429104012181, timeTaken=1429104012181, status=FAILED, 
> errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while running 
> task:java.lang.IllegalArgumentException: tez.runtime.io.sort.mb 512 should be 
> larger than 0 and should be less than the available task memory (MB):246
>   at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.ExternalSorter.getInitialMemoryRequirement(ExternalSorter.java:304)
>   at 
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.initialize(OrderedPartitionedKVOutput.java:90)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:443)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:422)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-2323 PreCommit Build #469

2015-04-15 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2323
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/469/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2525 lines...]




{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12725669/TEZ-2323.1.patch
  against master revision 19378d5.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in :
   org.apache.tez.test.TestSecureShuffle

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/469//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/469//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
520e3e8e00548ce828beb1715efce99c313a6c02 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #467
Archived 44 artifacts
Archive block size is 32768
Received 6 blocks and 2555813 bytes
Compression is 7.1%
Took 0.83 sec
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
2 tests failed.
REGRESSION:  
org.apache.tez.test.TestSecureShuffle.testSecureShuffle[test[sslInCluster:true, 
resultWithTezSSL:0, resultWithoutTezSSL:1]]

Error Message:
expected:<1> but was:<0>

Stack Trace:
java.lang.AssertionError: expected:<1> but was:<0>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.tez.test.TestSecureShuffle.baseTest(TestSecureShuffle.java:144)
at 
org.apache.tez.test.TestSecureShuffle.testSecureShuffle(TestSecureShuffle.java:162)


REGRESSION:  
org.apache.tez.test.TestSecureShuffle.testSecureShuffle[test[sslInCluster:false,
 resultWithTezSSL:1, resultWithoutTezSSL:0]]

Error Message:
expected:<1> but was:<0>

Stack Trace:
java.lang.AssertionError: expected:<1> but was:<0>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.tez.test.TestSecureShuffle.baseTest(TestSecureShuffle.java:144)
at 
org.apache.tez.test.TestSecureShuffle.testSecureShuffle(TestSecureShuffle.java:157)




[jira] [Created] (TEZ-2325) Route status update event directly to the attempt

2015-04-15 Thread Bikas Saha (JIRA)
Bikas Saha created TEZ-2325:
---

 Summary: Route status update event directly to the attempt 
 Key: TEZ-2325
 URL: https://issues.apache.org/jira/browse/TEZ-2325
 Project: Apache Tez
  Issue Type: Bug
Reporter: Bikas Saha


Today, all events from the attempt heartbeat are routed to the vertex. then the 
vertex routes (if any) status update events to the attempt. This is unnecessary 
and potentially creates out of order scenarios. We could route the status 
update events directly to attempts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2324) Dynamic heartbeat intervals between RM and AM

2015-04-15 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-2324:

Description: Currently there is a config (10ms) which can be an issue for 
large clusters with many jobs heartbeating to the RM. We should be able to 
scale it up and down based on outstanding requests etc. e.g. if there are no 
outstanding requests then ping on larger intervals. The heuristics may be more 
sophisticated than that.  (was: Currently there is a config (10ms) which can be 
an issue for large clusters with many jobs heartbeating to the RM. We should be 
able to scale it up and down based on outstanding requests etc. e.g. if there 
are no outstanding requests then ping on larger intervals.)

> Dynamic heartbeat intervals between RM and AM
> -
>
> Key: TEZ-2324
> URL: https://issues.apache.org/jira/browse/TEZ-2324
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Bikas Saha
>
> Currently there is a config (10ms) which can be an issue for large clusters 
> with many jobs heartbeating to the RM. We should be able to scale it up and 
> down based on outstanding requests etc. e.g. if there are no outstanding 
> requests then ping on larger intervals. The heuristics may be more 
> sophisticated than that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2294) Add tez-site-template.xml with description of config properties

2015-04-15 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-2294:
-
Attachment: TEZ-2294.wip.patch

[~rajesh.balamohan] Please take a look. The assembly tarball will generate a 
conf dir with config files after applying this patch. 

Pending work: 
  - use @private annotations to filter config properties
  - change TezConfiguration and TezRuntimeConfiguration to have descriptions in 
annotations so that they can be used by the maven plugin to generate the 
description tag in the config file.
  - make use of scope information
  - possibly combine both files into a single one

> Add tez-site-template.xml with description of config properties
> ---
>
> Key: TEZ-2294
> URL: https://issues.apache.org/jira/browse/TEZ-2294
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
> Attachments: TEZ-2294.wip.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2324) Dynamic heartbeat intervals between RM and AM

2015-04-15 Thread Bikas Saha (JIRA)
Bikas Saha created TEZ-2324:
---

 Summary: Dynamic heartbeat intervals between RM and AM
 Key: TEZ-2324
 URL: https://issues.apache.org/jira/browse/TEZ-2324
 Project: Apache Tez
  Issue Type: Bug
Reporter: Bikas Saha


Currently there is a config (10ms) which can be an issue for large clusters 
with many jobs heartbeating to the RM. We should be able to scale it up and 
down based on outstanding requests etc. e.g. if there are no outstanding 
requests then ping on larger intervals.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2323) Fix TestOrderedWordcount to use MR memory configs

2015-04-15 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-2323:
-
Attachment: TEZ-2323.1.patch

[~pramachandran] [~sseth] please review. 

> Fix TestOrderedWordcount to use MR memory configs
> -
>
> Key: TEZ-2323
> URL: https://issues.apache.org/jira/browse/TEZ-2323
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Hitesh Shah
> Attachments: TEZ-2323.1.patch
>
>
> TestOrderedwordcount takes combination of  configs from mapred-site.xml and 
> tez-site.xml. Due to considering mix of the mapred and tez configs, it fails 
> with below error.
> {noformat}
> 2015-04-15 13:20:53,599 DEBUG [main] app.RecoveryParser: Parsing event from 
> input stream, eventType=TASK_ATTEMPT_FINISHED
> 2015-04-15 13:20:53,619 DEBUG [main] app.RecoveryParser: Parsed event from 
> input stream, eventType=TASK_ATTEMPT_FINISHED, event=vertexName=null, 
> taskAttemptId=attempt_1429100089638_0008_1_00_02_0, startTime=0, 
> finishTime=1429104012181, timeTaken=1429104012181, status=FAILED, 
> errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while running 
> task:java.lang.IllegalArgumentException: tez.runtime.io.sort.mb 512 should be 
> larger than 0 and should be less than the available task memory (MB):246
>   at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.ExternalSorter.getInitialMemoryRequirement(ExternalSorter.java:304)
>   at 
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.initialize(OrderedPartitionedKVOutput.java:90)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:443)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:422)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2323) Fix TestOrderedWordcount to use MR memory configs

2015-04-15 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-2323:
-
Fix Version/s: (was: 0.7.0)

> Fix TestOrderedWordcount to use MR memory configs
> -
>
> Key: TEZ-2323
> URL: https://issues.apache.org/jira/browse/TEZ-2323
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Hitesh Shah
>
> TestOrderedwordcount takes combination of  configs from mapred-site.xml and 
> tez-site.xml. Due to considering mix of the mapred and tez configs, it fails 
> with below error.
> {noformat}
> 2015-04-15 13:20:53,599 DEBUG [main] app.RecoveryParser: Parsing event from 
> input stream, eventType=TASK_ATTEMPT_FINISHED
> 2015-04-15 13:20:53,619 DEBUG [main] app.RecoveryParser: Parsed event from 
> input stream, eventType=TASK_ATTEMPT_FINISHED, event=vertexName=null, 
> taskAttemptId=attempt_1429100089638_0008_1_00_02_0, startTime=0, 
> finishTime=1429104012181, timeTaken=1429104012181, status=FAILED, 
> errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while running 
> task:java.lang.IllegalArgumentException: tez.runtime.io.sort.mb 512 should be 
> larger than 0 and should be less than the available task memory (MB):246
>   at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>   at 
> org.apache.tez.runtime.library.common.sort.impl.ExternalSorter.getInitialMemoryRequirement(ExternalSorter.java:304)
>   at 
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.initialize(OrderedPartitionedKVOutput.java:90)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:443)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:422)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2323) Fix TestOrderedWordcount to use MR memory configs

2015-04-15 Thread Yesha Vora (JIRA)
Yesha Vora created TEZ-2323:
---

 Summary: Fix TestOrderedWordcount to use MR memory configs
 Key: TEZ-2323
 URL: https://issues.apache.org/jira/browse/TEZ-2323
 Project: Apache Tez
  Issue Type: Bug
Reporter: Yesha Vora
Assignee: Hitesh Shah
 Fix For: 0.7.0


TestOrderedwordcount takes combination of  configs from mapred-site.xml and 
tez-site.xml. Due to considering mix of the mapred and tez configs, it fails 
with below error.

{noformat}
2015-04-15 13:20:53,599 DEBUG [main] app.RecoveryParser: Parsing event from 
input stream, eventType=TASK_ATTEMPT_FINISHED
2015-04-15 13:20:53,619 DEBUG [main] app.RecoveryParser: Parsed event from 
input stream, eventType=TASK_ATTEMPT_FINISHED, event=vertexName=null, 
taskAttemptId=attempt_1429100089638_0008_1_00_02_0, startTime=0, 
finishTime=1429104012181, timeTaken=1429104012181, status=FAILED, 
errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while running 
task:java.lang.IllegalArgumentException: tez.runtime.io.sort.mb 512 should be 
larger than 0 and should be less than the available task memory (MB):246
at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
at 
org.apache.tez.runtime.library.common.sort.impl.ExternalSorter.getInitialMemoryRequirement(ExternalSorter.java:304)
at 
org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.initialize(OrderedPartitionedKVOutput.java:90)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:443)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$InitializeOutputCallable.callInternal(LogicalIOProcessorRuntimeTask.java:422)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2319) DAG history in HDFS

2015-04-15 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496800#comment-14496800
 ] 

Jason Lowe commented on TEZ-2319:
-

MR does not dump the final state all at once, rather it is more like the 
SimpleHistoryLogger.  The JobHistoryEventHandler logs job/task/attempt 
start/stop events to the .jhist avro file in the staging directory as the job 
runs.  Once the job finishes it copies that jhist file over to the done 
intermediate directory for the job history server to pick up.  It does not dump 
it all at once from memory when the job completes.

Note that the MR AM is building the state over time in memory, not because it's 
logging to the jhist file along the way but because it has to provide a UI 
while the job is running.  It could dump the contents to the jhist file all at 
once when the job completes, but it also uses the jhist file as a recovery 
mechanism in case the AM crashes.

I think we'd be OK dumping the events to a file as we get them in a similar way 
to how JobHistoryEventHandler works in the MR AM.  Biggest concern is adding 
multiple logging mechanisms adds to the failure potential.  If we're generating 
events faster than the two loggers can process them then we'll start buffering 
events and putting pressure on the AM heap.

> DAG history in HDFS
> ---
>
> Key: TEZ-2319
> URL: https://issues.apache.org/jira/browse/TEZ-2319
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Rohini Palaniswamy
>
>   We have processes, that parse jobconf.xml and job history details (map and 
> reduce task details, etc) in avro files from HDFS and load them into hive 
> tables for analysis for mapreduce jobs. Would like to have Tez also make this 
> information written to a history file in HDFS when AM or each DAG completes 
> so that we can do analytics on Tez jobs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1969) Stop the DAGAppMaster when a local mode client is stopped

2015-04-15 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496798#comment-14496798
 ] 

Siddharth Seth commented on TEZ-1969:
-

Mostly looks good. Could you please add a comment in the close method on why 
this works, even though the framework client is shared between TezClient and 
DAGClient. Assuming this is making use of the fact that DAGClient close is not 
invoked, and a single instance of DAGClient will be associated with a TezClient.

> Stop the DAGAppMaster when a local mode client is stopped
> -
>
> Key: TEZ-1969
> URL: https://issues.apache.org/jira/browse/TEZ-1969
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Siddharth Seth
>Assignee: Prakash Ramachandran
> Attachments: TEZ-1969.1.patch, TEZ-1969.2.patch
>
>
> https://issues.apache.org/jira/browse/TEZ-1661?focusedCommentId=14275366&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14275366
> Running multiple local clients in a single JVM will leak DAGAppMaster and 
> related threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1482) Fix memory issues for Local Mode running concurrent tasks

2015-04-15 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496763#comment-14496763
 ] 

Siddharth Seth commented on TEZ-1482:
-

Don't think this needs to go into 0.6. It's really an improvement to local 
mode, and IIRC there's multiple other dependent jiras which haven't been pulled 
into 0.6.

> Fix memory issues for Local Mode running concurrent tasks
> -
>
> Key: TEZ-1482
> URL: https://issues.apache.org/jira/browse/TEZ-1482
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Chen He
>Assignee: Prakash Ramachandran
> Fix For: 0.7.0
>
> Attachments: TEZ-1482.1.patch, TEZ-1482.2.patch, TEZ-1482.3.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2319) DAG history in HDFS

2015-04-15 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496715#comment-14496715
 ] 

Rohini Palaniswamy commented on TEZ-2319:
-

bq. Maybe this should be a primary ask for ATS v2
   This is something that we do not want to wait for ATS v2. But it would be 
good if they captured this as part of the design.

bq. make the SimpleHistoryLogger ( to HDFS ) production-ready and tez should 
allow publishing to multiple loggers.
This history only needs to capture the final state of the DAG, its tasks and 
counters. It does not need to capture intermediate data. I am not sure 
SimpleHistoryLogger in its current form is a good fit. The job history in MR is 
in avro format and gives the whole state of the job on its completion. If AM 
has that in memory, then we can have a config to dump that into HDFS in some 
format (json/avro) which is the easiest thing. Else will need another Logger to 
- build the state over time (not preferrable as it will consume lot of 
memory) and dump on completion.
- or write events as it happens, then parse it and construct only relevant 
information and write another file. 
Both options with another Logger are not efficient and I don't like the idea 
myself.

  [~jlowe]/[~jeagles] , Any better suggestions on how this can be done based on 
your experience with how it is currently done in MR?

> DAG history in HDFS
> ---
>
> Key: TEZ-2319
> URL: https://issues.apache.org/jira/browse/TEZ-2319
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Rohini Palaniswamy
>
>   We have processes, that parse jobconf.xml and job history details (map and 
> reduce task details, etc) in avro files from HDFS and load them into hive 
> tables for analysis for mapreduce jobs. Would like to have Tez also make this 
> information written to a history file in HDFS when AM or each DAG completes 
> so that we can do analytics on Tez jobs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2310) AM Deadlock in VertexImpl

2015-04-15 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496654#comment-14496654
 ] 

Bikas Saha commented on TEZ-2310:
-

Thanks for the verification [~daijy]. 
[~sseth] [~rajesh.balamohan] [~hitesh] Please review.
The change is basically having notifications sent out to listeners on a 
separate thread. Potentially, we could do multiple of these concurrently via a 
thread pool but for now sticking to a single thread. Will open a separate jira 
to do this for task status updates.

> AM Deadlock in VertexImpl
> -
>
> Key: TEZ-2310
> URL: https://issues.apache.org/jira/browse/TEZ-2310
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Bikas Saha
> Fix For: 0.7.0
>
> Attachments: TEZ-2310-0.patch, TEZ-2310.1.patch
>
>
> See the following deadlock in testing:
> Thread#1:
> {code}
> Daemon Thread [App Shared Pool - #3] (Suspended)  
>   owns: VertexManager$VertexManagerPluginContextImpl  (id=327)
>   owns: ShuffleVertexManager  (id=328)
>   owns: VertexManager  (id=329)   
>   waiting for: VertexManager$VertexManagerPluginContextImpl  (id=326) 
>   
> VertexManager$VertexManagerPluginContextImpl.onStateUpdated(VertexStateUpdate)
>  line: 344
>   
> StateChangeNotifier$ListenerContainer.sendStateUpdate(VertexStateUpdate) 
> line: 138  
>   
> StateChangeNotifier$ListenerContainer.access$100(StateChangeNotifier$ListenerContainer,
>  VertexStateUpdate) line: 122
>   StateChangeNotifier.sendStateUpdate(TezVertexID, VertexStateUpdate) 
> line: 116   
>   StateChangeNotifier.stateChanged(TezVertexID, VertexStateUpdate) line: 
> 106  
>   VertexImpl.maybeSendConfiguredEvent() line: 3385
>   VertexImpl.doneReconfiguringVertex() line: 1634 
>   VertexManager$VertexManagerPluginContextImpl.doneReconfiguringVertex() 
> line: 339
>   ShuffleVertexManager.schedulePendingTasks(int) line: 561
>   ShuffleVertexManager.schedulePendingTasks() line: 620   
>   ShuffleVertexManager.handleVertexStateUpdate(VertexStateUpdate) line: 
> 731   
>   ShuffleVertexManager.onVertexStateUpdated(VertexStateUpdate) line: 744  
>   VertexManager$VertexManagerEventOnVertexStateUpdate.invoke() line: 527  
>   VertexManager$VertexManagerEvent$1.run() line: 612  
>   VertexManager$VertexManagerEvent$1.run() line: 607  
>   AccessController.doPrivileged(PrivilegedExceptionAction, 
> AccessControlContext) line: not available [native method]   
>   Subject.doAs(Subject, PrivilegedExceptionAction) line: 415   
>   UserGroupInformation.doAs(PrivilegedExceptionAction) line: 1548  
>   
> VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call()
>  line: 607  
>   
> VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call()
>  line: 596  
>   ListenableFutureTask(FutureTask).run() line: 262  
>   ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1145  
>   ThreadPoolExecutor$Worker.run() line: 615   
>   Thread.run() line: 745  
> {code}
> Thread #2
> {code}
> Daemon Thread [App Shared Pool - #2] (Suspended)  
>   owns: VertexManager$VertexManagerPluginContextImpl  (id=326)
>   owns: PigGraceShuffleVertexManager  (id=344)
>   owns: VertexManager  (id=345)   
>   Unsafe.park(boolean, long) line: not available [native method]  
>   LockSupport.park(Object) line: 186  
>   
> ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).parkAndCheckInterrupt()
>  line: 834
>   
> ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).doAcquireShared(int)
>  line: 964   
>   
> ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).acquireShared(int)
>  line: 1282
>   ReentrantReadWriteLock$ReadLock.lock() line: 731
>   VertexImpl.getTotalTasks() line: 952
>   VertexManager$VertexManagerPluginContextImpl.getVertexNumTasks(String) 
> line: 162
>   
> PigGraceShuffleVertexManager(ShuffleVertexManager).updateSourceTaskCount() 
> line: 435
>   
> PigGraceShuffleVertexManager(ShuffleVertexManager).onVertexStarted(Map>)
>  line: 353 
>   VertexManager$VertexManagerEventOnVertexStarted.invoke() line: 541  
>   VertexManager$VertexManagerEvent$1.run() line: 612  
>   VertexManager$VertexManagerEvent$1.run() line: 607  
>   AccessController.doPrivileged(PrivilegedExceptionAction, 
> AccessControlContext) line: not available [native method]   
>   Subject.doAs(Subject, PrivilegedExceptionAction) line: 415   
>   UserGroupInformation.doAs(PrivilegedExceptionAction) line: 1548  
>  

[jira] [Commented] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181

2015-04-15 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496599#comment-14496599
 ] 

Hitesh Shah commented on TEZ-2322:
--

The running task count seems fine as there may be cases where succeeded task 
attempts may not be recovered properly. 

> Succeeded count wrong for Pig on Tez job, decreased 380 => 181
> --
>
> Key: TEZ-2322
> URL: https://issues.apache.org/jira/browse/TEZ-2322
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
> Environment: HDP 2.2
>Reporter: Hari Sekhon
>Priority: Minor
> Attachments: attempt1_syslog_dag_1427546104095_0146_1, 
> attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, 
> attempt2_syslog_dag_1427546104095_0146_1_post
>
>
> During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
> as shown below:
> {code}
> 2015-04-15 15:09:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:56,993 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:12:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 
> 0 
> {code}
> Now this may be because the tasks failed, some certainly did due to space 
> exceptions having checked the logs, but surely once a task has finished 
> successfully and is marked as succeeded it cannot then later be removed from 
> the succeeded count? Perhaps the succeeded counter is incremented too early 
> before the task results are really saved?
> KilledTaskAttempts jumped from 16 => 89 at the same time, but even this 
> doesn't account for the large drop in number of succeeded tasks.
> There was also a noticeable jump in Running tasks from 58 => 724 at the same 
> time which is suspicious, I'm pretty sure there was no contending job to 
> finish and release so much more resource to this Tez job, so it's also 
> unclear how the running count count have jumped up to significantly given the 
> cluster hardware resources have been the same throughout.
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181

2015-04-15 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496598#comment-14496598
 ] 

Hitesh Shah commented on TEZ-2322:
--

Thanks [~harisekhon]. The command I gave has no relation to Ambari ( or the job 
history server ) and should work from the command-line if you try it. 

In any case, it seems like the failed attempt task count is not getting updated 
on recovery. \cc [~zjffdu]



> Succeeded count wrong for Pig on Tez job, decreased 380 => 181
> --
>
> Key: TEZ-2322
> URL: https://issues.apache.org/jira/browse/TEZ-2322
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
> Environment: HDP 2.2
>Reporter: Hari Sekhon
>Priority: Minor
> Attachments: attempt1_syslog_dag_1427546104095_0146_1, 
> attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, 
> attempt2_syslog_dag_1427546104095_0146_1_post
>
>
> During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
> as shown below:
> {code}
> 2015-04-15 15:09:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:56,993 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:12:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 
> 0 
> {code}
> Now this may be because the tasks failed, some certainly did due to space 
> exceptions having checked the logs, but surely once a task has finished 
> successfully and is marked as succeeded it cannot then later be removed from 
> the succeeded count? Perhaps the succeeded counter is incremented too early 
> before the task results are really saved?
> KilledTaskAttempts jumped from 16 => 89 at the same time, but even this 
> doesn't account for the large drop in number of succeeded tasks.
> There was also a noticeable jump in Running tasks from 58 => 724 at the same 
> time which is suspicious, I'm pretty sure there was no contending job to 
> finish and release so much more resource to this Tez job, so it's also 
> unclear how the running count count have jumped up to significantly given the 
> cluster hardware resources have been the same throughout.
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181

2015-04-15 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496553#comment-14496553
 ] 

Hari Sekhon edited comment on TEZ-2322 at 4/15/15 5:25 PM:
---

Iirc Ambari still doesn't support Job History server so that command fails, but 
I've copied the logs out via RM and attached to this ticket for you.


was (Author: harisekhon):
Iirc Ambari still doesn't support Job History server so that command fails, but 
I've copied the logs out via RM.

> Succeeded count wrong for Pig on Tez job, decreased 380 => 181
> --
>
> Key: TEZ-2322
> URL: https://issues.apache.org/jira/browse/TEZ-2322
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
> Environment: HDP 2.2
>Reporter: Hari Sekhon
>Priority: Minor
> Attachments: attempt1_syslog_dag_1427546104095_0146_1, 
> attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, 
> attempt2_syslog_dag_1427546104095_0146_1_post
>
>
> During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
> as shown below:
> {code}
> 2015-04-15 15:09:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:56,993 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:12:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 
> 0 
> {code}
> Now this may be because the tasks failed, some certainly did due to space 
> exceptions having checked the logs, but surely once a task has finished 
> successfully and is marked as succeeded it cannot then later be removed from 
> the succeeded count? Perhaps the succeeded counter is incremented too early 
> before the task results are really saved?
> KilledTaskAttempts jumped from 16 => 89 at the same time, but even this 
> doesn't account for the large drop in number of succeeded tasks.
> There was also a noticeable jump in Running tasks from 58 => 724 at the same 
> time which is suspicious, I'm pretty sure there was no contending job to 
> finish and release so much more resource to this Tez job, so it's also 
> unclear how the running count count have jumped up to significantly given the 
> cluster hardware resources have been the same throughout.
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181

2015-04-15 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496555#comment-14496555
 ] 

Hari Sekhon commented on TEZ-2322:
--

There was a point at which space ran out and kerberos also broke as a result, 
but I fixed it and the job continued and eventually succeeded.

> Succeeded count wrong for Pig on Tez job, decreased 380 => 181
> --
>
> Key: TEZ-2322
> URL: https://issues.apache.org/jira/browse/TEZ-2322
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
> Environment: HDP 2.2
>Reporter: Hari Sekhon
>Priority: Minor
> Attachments: attempt1_syslog_dag_1427546104095_0146_1, 
> attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, 
> attempt2_syslog_dag_1427546104095_0146_1_post
>
>
> During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
> as shown below:
> {code}
> 2015-04-15 15:09:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:56,993 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:12:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 
> 0 
> {code}
> Now this may be because the tasks failed, some certainly did due to space 
> exceptions having checked the logs, but surely once a task has finished 
> successfully and is marked as succeeded it cannot then later be removed from 
> the succeeded count? Perhaps the succeeded counter is incremented too early 
> before the task results are really saved?
> KilledTaskAttempts jumped from 16 => 89 at the same time, but even this 
> doesn't account for the large drop in number of succeeded tasks.
> There was also a noticeable jump in Running tasks from 58 => 724 at the same 
> time which is suspicious, I'm pretty sure there was no contending job to 
> finish and release so much more resource to this Tez job, so it's also 
> unclear how the running count count have jumped up to significantly given the 
> cluster hardware resources have been the same throughout.
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181

2015-04-15 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated TEZ-2322:
-
Attachment: attempt2_syslog_dag_1427546104095_0146_1_post
attempt2_syslog_dag_1427546104095_0146_1
attempt2_syslog
attempt1_syslog_dag_1427546104095_0146_1

Iirc Ambari still doesn't support Job History server so that command fails, but 
I've copied the logs out via RM.

> Succeeded count wrong for Pig on Tez job, decreased 380 => 181
> --
>
> Key: TEZ-2322
> URL: https://issues.apache.org/jira/browse/TEZ-2322
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
> Environment: HDP 2.2
>Reporter: Hari Sekhon
>Priority: Minor
> Attachments: attempt1_syslog_dag_1427546104095_0146_1, 
> attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, 
> attempt2_syslog_dag_1427546104095_0146_1_post
>
>
> During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
> as shown below:
> {code}
> 2015-04-15 15:09:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:56,993 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:12:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 
> 0 
> {code}
> Now this may be because the tasks failed, some certainly did due to space 
> exceptions having checked the logs, but surely once a task has finished 
> successfully and is marked as succeeded it cannot then later be removed from 
> the succeeded count? Perhaps the succeeded counter is incremented too early 
> before the task results are really saved?
> KilledTaskAttempts jumped from 16 => 89 at the same time, but even this 
> doesn't account for the large drop in number of succeeded tasks.
> There was also a noticeable jump in Running tasks from 58 => 724 at the same 
> time which is suspicious, I'm pretty sure there was no contending job to 
> finish and release so much more resource to this Tez job, so it's also 
> unclear how the running count count have jumped up to significantly given the 
> cluster hardware resources have been the same throughout.
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2282) Delimit reused yarn container logs (stderr, stdout, syslog) with task attempt start/stop events

2015-04-15 Thread Mit Desai (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496473#comment-14496473
 ] 

Mit Desai commented on TEZ-2282:


[~hitesh], can you take a look on this patch?

> Delimit reused yarn container logs (stderr, stdout, syslog) with task attempt 
> start/stop events
> ---
>
> Key: TEZ-2282
> URL: https://issues.apache.org/jira/browse/TEZ-2282
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Mit Desai
> Attachments: TEZ-2282.1.patch, TEZ-2282.2.patch, 
> TEZ-2282.master.1.patch
>
>
> This could help with debugging in some cases where logging is task specific. 
> For example GC log is going to stdout, it will be nice to see task attempt 
> start/stop times



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181

2015-04-15 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496452#comment-14496452
 ] 

Rohini Palaniswamy commented on TEZ-2322:
-

No. Have only seen 
   - TotalTasks come down when a new vertex is starting and tasks reduced due 
to auto parallelism with ShuffleVertexManager. 
   - If the AM gets killed and a new one is launched, Succeeded goes to 0 and 
then increases as recovery kicks in. 

Have not seen Succeeded reduce to a non-zero count. But I have only seen AM 
relaunch due to OOM or other issues with very big jobs (30K+ tasks). So 
worthwhile to check if there is a second AM attempt launched. Pig prints that 
status every 20 secs and it is possible a new AM was launched and recovery 
recovered 181 tasks by then.

> Succeeded count wrong for Pig on Tez job, decreased 380 => 181
> --
>
> Key: TEZ-2322
> URL: https://issues.apache.org/jira/browse/TEZ-2322
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
> Environment: HDP 2.2
>Reporter: Hari Sekhon
>Priority: Minor
>
> During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
> as shown below:
> {code}
> 2015-04-15 15:09:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:56,993 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:12:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 
> 0 
> {code}
> Now this may be because the tasks failed, some certainly did due to space 
> exceptions having checked the logs, but surely once a task has finished 
> successfully and is marked as succeeded it cannot then later be removed from 
> the succeeded count? Perhaps the succeeded counter is incremented too early 
> before the task results are really saved?
> KilledTaskAttempts jumped from 16 => 89 at the same time, but even this 
> doesn't account for the large drop in number of succeeded tasks.
> There was also a noticeable jump in Running tasks from 58 => 724 at the same 
> time which is suspicious, I'm pretty sure there was no contending job to 
> finish and release so much more resource to this Tez job, so it's also 
> unclear how the running count count have jumped up to significantly given the 
> cluster hardware resources have been the same throughout.
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181

2015-04-15 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496448#comment-14496448
 ] 

Hitesh Shah commented on TEZ-2322:
--

\cc [~daijy] [~rohini] in case either of you have seen this before. 

> Succeeded count wrong for Pig on Tez job, decreased 380 => 181
> --
>
> Key: TEZ-2322
> URL: https://issues.apache.org/jira/browse/TEZ-2322
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
> Environment: HDP 2.2
>Reporter: Hari Sekhon
>Priority: Minor
>
> During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
> as shown below:
> {code}
> 2015-04-15 15:09:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:56,993 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:12:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 
> 0 
> {code}
> Now this may be because the tasks failed, some certainly did due to space 
> exceptions having checked the logs, but surely once a task has finished 
> successfully and is marked as succeeded it cannot then later be removed from 
> the succeeded count? Perhaps the succeeded counter is incremented too early 
> before the task results are really saved?
> KilledTaskAttempts jumped from 16 => 89 at the same time, but even this 
> doesn't account for the large drop in number of succeeded tasks.
> There was also a noticeable jump in Running tasks from 58 => 724 at the same 
> time which is suspicious, I'm pretty sure there was no contending job to 
> finish and release so much more resource to this Tez job, so it's also 
> unclear how the running count count have jumped up to significantly given the 
> cluster hardware resources have been the same throughout.
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2317) Successful task attempts getting killed

2015-04-15 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496439#comment-14496439
 ] 

Rohini Palaniswamy edited comment on TEZ-2317 at 4/15/15 4:01 PM:
--

Thanks [~bikassaha]. Issue is with PigProcessor calling canCommit. Fixing that 
in PIG-4508.


was (Author: rohini):
Ah. Thanks [~bikassaha]. Issue is with PigProcessor calling canCommit. Fixing 
that in PIG-4508.

> Successful task attempts getting killed
> ---
>
> Key: TEZ-2317
> URL: https://issues.apache.org/jira/browse/TEZ-2317
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Bikas Saha
> Fix For: 0.7.0
>
> Attachments: AM-taskkill.log
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2317) Successful task attempts getting killed

2015-04-15 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496439#comment-14496439
 ] 

Rohini Palaniswamy commented on TEZ-2317:
-

Ah. Thanks [~bikassaha]. Issue is with PigProcessor calling canCommit. Fixing 
that in PIG-4508.

> Successful task attempts getting killed
> ---
>
> Key: TEZ-2317
> URL: https://issues.apache.org/jira/browse/TEZ-2317
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Bikas Saha
> Fix For: 0.7.0
>
> Attachments: AM-taskkill.log
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181

2015-04-15 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496399#comment-14496399
 ] 

Hitesh Shah commented on TEZ-2322:
--

Could you please attach the application logs to the jira? (obtained via 
bin/yarn logs -applicationId ) ?

 

> Succeeded count wrong for Pig on Tez job, decreased 380 => 181
> --
>
> Key: TEZ-2322
> URL: https://issues.apache.org/jira/browse/TEZ-2322
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.2
> Environment: HDP 2.2
>Reporter: Hari Sekhon
>Priority: Minor
>
> During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
> as shown below:
> {code}
> 2015-04-15 15:09:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:56,993 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:12:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 
> 0 
> {code}
> Now this may be because the tasks failed, some certainly did due to space 
> exceptions having checked the logs, but surely once a task has finished 
> successfully and is marked as succeeded it cannot then later be removed from 
> the succeeded count? Perhaps the succeeded counter is incremented too early 
> before the task results are really saved?
> KilledTaskAttempts jumped from 16 => 89 at the same time, but even this 
> doesn't account for the large drop in number of succeeded tasks.
> There was also a noticeable jump in Running tasks from 58 => 724 at the same 
> time which is suspicious, I'm pretty sure there was no contending job to 
> finish and release so much more resource to this Tez job, so it's also 
> unclear how the running count count have jumped up to significantly given the 
> cluster hardware resources have been the same throughout.
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2320) GroupByOrderByMRRTest not functional in branch 0.6

2015-04-15 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496389#comment-14496389
 ] 

Hitesh Shah commented on TEZ-2320:
--

Thanks for the clarification. The test runs fine against master so will need to 
re-look at why it is failing in branch 0.6 

> GroupByOrderByMRRTest not functional in branch 0.6 
> ---
>
> Key: TEZ-2320
> URL: https://issues.apache.org/jira/browse/TEZ-2320
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
>
> Reported by [~tiwari] in TEZ-1581. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2320) GroupByOrderByMRRTest not functional in branch 0.6

2015-04-15 Thread Amit Tiwari (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496339#comment-14496339
 ] 

Amit Tiwari commented on TEZ-2320:
--

Hello Hitesh,
Yes our build contains the fixes for  TEZ-2190.
As advised, I will deprecate this test in our cluster.

thank you
--amit

> GroupByOrderByMRRTest not functional in branch 0.6 
> ---
>
> Key: TEZ-2320
> URL: https://issues.apache.org/jira/browse/TEZ-2320
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hitesh Shah
>Assignee: Hitesh Shah
>
> Reported by [~tiwari] in TEZ-1581. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181

2015-04-15 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated TEZ-2322:
-
Description: 
During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:12:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 
{code}
Now this may be because the tasks failed, some certainly did due to space 
exceptions having checked the logs, but surely once a task has finished 
successfully and is marked as succeeded it cannot then later be removed from 
the succeeded count? Perhaps the succeeded counter is incremented too early 
before the task results are really saved?

KilledTaskAttempts jumped from 16 => 89 at the same time, but even this doesn't 
account for the large drop in number of succeeded tasks.

There was also a noticeable jump in Running tasks from 58 => 724 at the same 
time which is suspicious, I'm pretty sure there was no contending job to finish 
and release so much more resource to this Tez job, so it's also unclear how the 
running count count have jumped up to significantly given the cluster hardware 
resources have been the same throughout.

Hari Sekhon
http://www.linkedin.com/in/harisekhon

  was:
During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO  
org.apache.pig.backend.hadoop.ex

[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181

2015-04-15 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated TEZ-2322:
-
Description: 
During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:12:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 
{code}
Now this may be because the tasks failed, some certainly did due to space 
exceptions having checked the logs, but surely once a task has finished 
successfully and is marked as succeeded it cannot then later be removed from 
the succeeded count? Perhaps the succeeded counter is incremented too early 
before the task results are really saved?

KilledTaskAttempts jumped from 16 => 89 at the same time, but even this doesn't 
account for the large drop in number of succeeded tasks.

There was also a noticeable jump in Running tasks from 58 => 724 at the same 
time which is suspicious, I'm pretty sure there was no contending job to finish 
and release so much more resource to this Tez job, so it's also unclear how the 
running count count have jumped up to significantly.

  was:
During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 

[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181

2015-04-15 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated TEZ-2322:
-
Description: 
During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:12:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 
{code}
Now this may be because the tasks failed, some certainly did due to space 
exceptions having checked the logs, but surely once a task has finished 
successfully and is marked as succeeded it cannot then later be removed from 
the succeeded count? Perhaps the succeeded counter is incremented too early 
before the task results are really saved?

KilledTaskAttempts jumped from 16 => 89 at the same time, but even this doesn't 
account for the large drop in number of succeeded tasks.

There was also a noticeable jump in Running tasks from 58 => 724 at the same 
time which is suspicious, I'm pretty sure there was no contending job to finish 
and release so much more resource to this Tez job, so it's also unclear how the 
running count count have jumped up to significantly given the cluster hardware 
resources have been the same throughout.

  was:
During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNN

[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181

2015-04-15 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated TEZ-2322:
-
Description: 
During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:12:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 
{code}
Now this may be because the tasks failed, some certainly did due to space 
exceptions having checked the logs, but surely once a task has finished 
successfully and is marked as succeeded it cannot then later be removed from 
the succeeded count? Perhaps the succeeded counter is incremented too early 
before the task results are really saved?

KilledTaskAttempts jumped from 16 => 89 at the same time, but even this doesn't 
account for the large drop in number of succeeded tasks.

There was also a noticeable jump in Running tasks from 58 => 724 at the same 
time which is suspicious, I'm pretty sure there was no contending job to finish 
and release so much more resource to this Tez job.

  was:
During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 1

[jira] [Created] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 => 181

2015-04-15 Thread Hari Sekhon (JIRA)
Hari Sekhon created TEZ-2322:


 Summary: Succeeded count wrong for Pig on Tez job, decreased 380 
=> 181
 Key: TEZ-2322
 URL: https://issues.apache.org/jira/browse/TEZ-2322
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.2
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Minor


During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:12:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 
{code}
Now this may be because the tasks failed, some certainly did due to space 
exceptions, but surely once a task has finished successfully and is marked as 
succeeded it cannot then be removed from the succeeded count? Perhaps the 
succeeded counter is incremented too early before the task results are really 
saved?

KilledTaskAttempts jumped from 16 => 89 at the same time, but even this doesn't 
account for the large drop in number of succeeded tasks.

There was also a noticeable jump in Running tasks from 58 => 724 at the same 
time which is suspicious, I'm pretty sure there was no contending job to finish 
and release so much more resource to this Tez job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1969) Stop the DAGAppMaster when a local mode client is stopped

2015-04-15 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14495995#comment-14495995
 ] 

TezQA commented on TEZ-1969:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12725531/TEZ-1969.2.patch
  against master revision 11b5843.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/468//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/468//console

This message is automatically generated.

> Stop the DAGAppMaster when a local mode client is stopped
> -
>
> Key: TEZ-1969
> URL: https://issues.apache.org/jira/browse/TEZ-1969
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Siddharth Seth
>Assignee: Prakash Ramachandran
> Attachments: TEZ-1969.1.patch, TEZ-1969.2.patch
>
>
> https://issues.apache.org/jira/browse/TEZ-1661?focusedCommentId=14275366&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14275366
> Running multiple local clients in a single JVM will leak DAGAppMaster and 
> related threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-1969 PreCommit Build #468

2015-04-15 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-1969
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/468/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2765 lines...]



{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12725531/TEZ-1969.2.patch
  against master revision 11b5843.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/468//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/468//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
2365279340c09aa684be8afc1bec9d4750ecb855 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #467
Archived 44 artifacts
Archive block size is 32768
Received 8 blocks and 2486175 bytes
Compression is 9.5%
Took 1.1 sec
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Updated] (TEZ-1969) Stop the DAGAppMaster when a local mode client is stopped

2015-04-15 Thread Prakash Ramachandran (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Ramachandran updated TEZ-1969:
--
Attachment: TEZ-1969.2.patch

reattaching to trigger a pre-commit build

> Stop the DAGAppMaster when a local mode client is stopped
> -
>
> Key: TEZ-1969
> URL: https://issues.apache.org/jira/browse/TEZ-1969
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Siddharth Seth
>Assignee: Prakash Ramachandran
> Attachments: TEZ-1969.1.patch, TEZ-1969.2.patch
>
>
> https://issues.apache.org/jira/browse/TEZ-1661?focusedCommentId=14275366&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14275366
> Running multiple local clients in a single JVM will leak DAGAppMaster and 
> related threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)