[jira] [Commented] (TEZ-2209) Fix pipelined shuffle to fetch data from any one attempt

2015-03-20 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14370823#comment-14370823
 ] 

Siddharth Seth commented on TEZ-2209:
-

Minor stuff.
- shuffleInfoEventsMap in ShuffleManager should be a ConcurrentMap - it can be 
accessed from multiple threads, and outside of any synchronization. Not 
required in ShuffleSchdeuler though since that's synchronized. Missed this in 
the earlier review for pipelined shuffle.
- In the reportFatalError invocation - it'll be useful to add the currently 
registered attemptNumber, and the one which caused the error.

The rest looks good to me.

 Fix pipelined shuffle to fetch data from any one attempt
 

 Key: TEZ-2209
 URL: https://issues.apache.org/jira/browse/TEZ-2209
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: TEZ-2209.1.patch, TEZ-2209.2.patch, TEZ-2209.3.patch


 - Currently, pipelined shuffle will fail-fast the moment it receives data 
 from an attempt other than 0.  This was done as an add-on check to prevent 
 data being copied from speculated attempts.
 - However, in some scenarios (like LLAP), it could be possible that that task 
 attempt gets killed even before generating any data.  In such cases, attempt 
 #1 or later attempts, would generate the actual data.
 - This jira is created to allow pipelined shuffle to download data from any 
 one attempt. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.

2015-03-20 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-2204:

Attachment: TEZ-2204-1.patch

 TestAMRecovery increasingly flaky on jenkins builds. 
 -

 Key: TEZ-2204
 URL: https://issues.apache.org/jira/browse/TEZ-2204
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Attachments: TEZ-2204-1.patch


 In recent pre-commit builds and daily builds, there seem to have been some 
 occurrences of TestAMRecovery failing or timing out. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.

2015-03-20 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371062#comment-14371062
 ] 

Jeff Zhang commented on TEZ-2204:
-

Upload patch. [~hitesh] [~bikassaha] Please help review it.

2 potential dead lock:
* Related to YARN-2917. Tez's AsyncDispatcher doesn't integrate its patch.
* Deadlock in DAGAppMaster. method DAGAppMaster::handle  
DAGAppMaster:stopService.  While stopService is called, it would stop the 
AsyncDispatcher, while AsyncDispatcher will drain its events which may call 
DAGAppMaster::handle.  And method handle()  stopService both has the 
synchronized keyword.

 TestAMRecovery increasingly flaky on jenkins builds. 
 -

 Key: TEZ-2204
 URL: https://issues.apache.org/jira/browse/TEZ-2204
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Attachments: TEZ-2204-1.patch


 In recent pre-commit builds and daily builds, there seem to have been some 
 occurrences of TestAMRecovery failing or timing out. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-2204 PreCommit Build #318

2015-03-20 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2204
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/318/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2755 lines...]

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12705879/TEZ-2204-1.patch
  against master revision 9b845f2.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 findbugs{color}.  The patch appears to introduce 4 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/318//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/318//artifact/patchprocess/newPatchFindbugsWarningstez-dag.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/318//artifact/patchprocess/newPatchFindbugsWarningstez-common.html
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/318//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
0310fbd24a7e22aad56bda62c69f2f57d92cd884 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #315
Archived 44 artifacts
Archive block size is 32768
Received 6 blocks and 2578137 bytes
Compression is 7.1%
Took 1.1 sec
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime

2015-03-20 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372307#comment-14372307
 ] 

Hitesh Shah commented on TEZ-2217:
--

[~gopalv] Could you clarify on the bit about the AM has received a soft 
pre-emption message. Any suggestions on potential approach for the AM 
inferring that the cluster no longer has additional available resources for the 
AM and should now start releasing held containers?




 The min-held-containers constraint is not enforced during query runtime 
 

 Key: TEZ-2217
 URL: https://issues.apache.org/jira/browse/TEZ-2217
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Gopal V
Assignee: Bikas Saha
 Attachments: TEZ-2217.txt.bz2


 The min-held containers constraint is respected during query idle times, but 
 is not respected when a query is actually in motion.
 The AM releases unused containers during dag execution without checking for 
 min-held containers.
 {code}
 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] 
 rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing 
 container, containerId=container_1424502260528_1348_01_13, 
 containerExpiryTime=1426891313264, idleTimeoutMin=5000
 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] 
 rm.YarnTaskSchedulerService: Releasing unused container: 
 container_1424502260528_1348_01_13
 {code}
 This is actually useful only after the AM has received a soft pre-emption 
 message, doing it on an idle cluster slows down one of the most common query 
 patterns in BI systems.
 {code}
 create temporary table smalltable as ...; 
 select ... bigtable JOIN smalltable ON ...;
 {code}
 The smaller query in the beginning throws away the pre-warmed capacity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime

2015-03-20 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372361#comment-14372361
 ] 

Gopal V commented on TEZ-2217:
--

The performance bug occurs within the queue, that is a straight-forward failure 
to utilize available resources properly.

I can work-around this issue by upping the max-delay for releasing containers, 
but that's where the pre-emption scenario is relevant - that approach isn't 
good on a busy cluster with a lot of applications. I don't want to hurt the 
performance or concurrency considerations.

The AM-RM allocate responses contain the pre-emption messages, which should be 
a good enough indicator that a certain fraction of currently held resources 
will be removed soon.

The pre-emption period is the gap between that event and the event of 
termination, in which period the min-held rules will not be enforced - to 
protect the ones actually doing work, idle containers should be allowed to 
expire (still following the min-max delay curve).

I suspect the pre-emption contracts do not expose that sunset period for the 
ill-fated containers, which makes it slightly harder to do this directly off 
the message - perhaps there is one I can't find ?

That makes the system both performant on idle cluster, but handles the 
mid-query bottlenecking that is common on busy clusters - particularly if the 
container expiry is smaller than the sunset period for pre-empted containers.

 The min-held-containers constraint is not enforced during query runtime 
 

 Key: TEZ-2217
 URL: https://issues.apache.org/jira/browse/TEZ-2217
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Gopal V
Assignee: Bikas Saha
 Attachments: TEZ-2217.txt.bz2


 The min-held containers constraint is respected during query idle times, but 
 is not respected when a query is actually in motion.
 The AM releases unused containers during dag execution without checking for 
 min-held containers.
 {code}
 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] 
 rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing 
 container, containerId=container_1424502260528_1348_01_13, 
 containerExpiryTime=1426891313264, idleTimeoutMin=5000
 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] 
 rm.YarnTaskSchedulerService: Releasing unused container: 
 container_1424502260528_1348_01_13
 {code}
 This is actually useful only after the AM has received a soft pre-emption 
 message, doing it on an idle cluster slows down one of the most common query 
 patterns in BI systems.
 {code}
 create temporary table smalltable as ...; 
 select ... bigtable JOIN smalltable ON ...;
 {code}
 The smaller query in the beginning throws away the pre-warmed capacity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2210) Record DAG AM CPU usage stats

2015-03-20 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-2210:

Attachment: TEZ-2210.4.patch

 Record DAG AM CPU usage stats
 -

 Key: TEZ-2210
 URL: https://issues.apache.org/jira/browse/TEZ-2210
 Project: Apache Tez
  Issue Type: Task
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: TEZ-2210.1.patch, TEZ-2210.2.patch, TEZ-2210.3.patch, 
 TEZ-2210.4.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2218) Turn on speculation by default

2015-03-20 Thread Bikas Saha (JIRA)
Bikas Saha created TEZ-2218:
---

 Summary: Turn on speculation by default
 Key: TEZ-2218
 URL: https://issues.apache.org/jira/browse/TEZ-2218
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Bikas Saha






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2210) Record DAG AM CPU usage stats

2015-03-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372466#comment-14372466
 ] 

Hadoop QA commented on TEZ-2210:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12706106/TEZ-2210.4.patch
  against master revision 6e15b2f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/323//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/323//console

This message is automatically generated.

 Record DAG AM CPU usage stats
 -

 Key: TEZ-2210
 URL: https://issues.apache.org/jira/browse/TEZ-2210
 Project: Apache Tez
  Issue Type: Task
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: TEZ-2210.1.patch, TEZ-2210.2.patch, TEZ-2210.3.patch, 
 TEZ-2210.4.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Success: TEZ-2210 PreCommit Build #323

2015-03-20 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2210
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/323/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2762 lines...]
[INFO] Final Memory: 72M/1029M
[INFO] 




{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12706106/TEZ-2210.4.patch
  against master revision 6e15b2f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/323//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/323//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
a623705000b5e2ffcba9e6c8cc69bb851de1ce51 logged out


==
==
Finished build.
==
==


Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #320
Archived 44 artifacts
Archive block size is 32768
Received 2 blocks and 2682311 bytes
Compression is 2.4%
Took 1.3 sec
Description set: TEZ-2210
Recording test results
Email was triggered for: Success
Sending email for trigger: Success



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Created] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime

2015-03-20 Thread Gopal V (JIRA)
Gopal V created TEZ-2217:


 Summary: The min-held-containers constraint is not enforced during 
query runtime 
 Key: TEZ-2217
 URL: https://issues.apache.org/jira/browse/TEZ-2217
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Gopal V
Assignee: Bikas Saha


The min-held containers constraint is respected during query idle times, but is 
not respected when a query is actually in motion.

The AM releases unused containers during dag execution without checking for 
min-held containers.

{code}
2015-03-20 15:41:53,475 INFO [DelayedContainerManager] 
rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing 
container, containerId=container_1424502260528_1348_01_13, 
containerExpiryTime=1426891313264, idleTimeoutMin=5000
2015-03-20 15:41:53,475 INFO [DelayedContainerManager] 
rm.YarnTaskSchedulerService: Releasing unused container: 
container_1424502260528_1348_01_13
{code}

This is actually useful only after the AM has received a soft pre-emption 
message, doing it on an idle cluster slows down one of the most common query 
patterns in BI systems.

{code}
create temporary table smalltable as ...; 
select ... bigtable JOIN smalltable ON ...;
{code}

The smaller query in the beginning throws away the pre-warmed capacity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2216) Expose errors during AM initialization

2015-03-20 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372415#comment-14372415
 ] 

Jeff Zhang edited comment on TEZ-2216 at 3/21/15 1:20 AM:
--

This could be done to some certain extent. Depends on 2 things:
* whether DAGClientServer is started ( Client get diagnostics from 
DAGClientServer
* whether YarnTaskSchedulerService is started ( YarnTaskScheduler will 
unregister it from RM, and push diagnostics to RM, Client get diagnostics from 
RM ). For the error on AM initialization, maybe YarnTaskSchedulerService is the 
key.


was (Author: zjffdu):
This could be done to some certain extent. Depends on 2 things:
* whether DAGClientServer is started ( Client get diagnostics from 
DAGClientServer
* whether YarnTaskSchedulerService is started ( YarnTaskScheduler will 
unregister it from RM, and push diagnostics to RM, Client get diagnostics from 
RM )

 Expose errors during AM initialization
 --

 Key: TEZ-2216
 URL: https://issues.apache.org/jira/browse/TEZ-2216
 Project: Apache Tez
  Issue Type: Bug
Reporter: Bikas Saha

 If there are bad configs or other issues that cause errors/exceptions during 
 AM initialization (eg. during service init) then those errors are not exposed 
 to the user. Exposing them would be useful in quickly debugging such issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2216) Expose errors during AM initialization

2015-03-20 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372415#comment-14372415
 ] 

Jeff Zhang commented on TEZ-2216:
-

This could be done to some certain extent. Depends on 2 things:
* whether DAGClientServer is started ( Client get diagnostics from 
DAGClientServer
* whether YarnTaskSchedulerService is started ( YarnTaskScheduler will 
unregister it from RM, and push diagnostics to RM, Client get diagnostics from 
RM )

 Expose errors during AM initialization
 --

 Key: TEZ-2216
 URL: https://issues.apache.org/jira/browse/TEZ-2216
 Project: Apache Tez
  Issue Type: Bug
Reporter: Bikas Saha

 If there are bad configs or other issues that cause errors/exceptions during 
 AM initialization (eg. during service init) then those errors are not exposed 
 to the user. Exposing them would be useful in quickly debugging such issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2163) Task status update should be handled in the START_WAIT state

2015-03-20 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-2163:

Issue Type: Bug  (was: Sub-task)
Parent: (was: TEZ-2149)

 Task status update should be handled in the START_WAIT state
 

 Key: TEZ-2163
 URL: https://issues.apache.org/jira/browse/TEZ-2163
 Project: Apache Tez
  Issue Type: Bug
Reporter: Siddharth Seth
Assignee: Jeff Zhang
Priority: Critical
 Attachments: TEZ-2163-1.patch, TEZ-2163-2.patch


 It;s possible for a task to send in a STATUS_UPDATE before the 
 TA_STARTED_REMOTELY message is processed within the AM.
 {code}
 2015-02-27 13:21:15,491 ERROR [Dispatcher thread: Central] 
 impl.TaskAttemptImpl: Can't handle this event at current state for 
 attempt_1424502260528_0177_5_03_000223_0
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 TA_STATUS_UPDATE at START_WAIT
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:670)
   at 
 org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:112)
   at 
 org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1835)
   at 
 org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1820)
   at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
   at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2163) Task status update should be handled in the START_WAIT state

2015-03-20 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-2163:


Closing this out as Won't fix.

 Task status update should be handled in the START_WAIT state
 

 Key: TEZ-2163
 URL: https://issues.apache.org/jira/browse/TEZ-2163
 Project: Apache Tez
  Issue Type: Bug
Reporter: Siddharth Seth
Assignee: Jeff Zhang
Priority: Critical
 Attachments: TEZ-2163-1.patch, TEZ-2163-2.patch


 It;s possible for a task to send in a STATUS_UPDATE before the 
 TA_STARTED_REMOTELY message is processed within the AM.
 {code}
 2015-02-27 13:21:15,491 ERROR [Dispatcher thread: Central] 
 impl.TaskAttemptImpl: Can't handle this event at current state for 
 attempt_1424502260528_0177_5_03_000223_0
 org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
 TA_STATUS_UPDATE at START_WAIT
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:670)
   at 
 org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:112)
   at 
 org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1835)
   at 
 org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1820)
   at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
   at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2218) Turn on speculation by default

2015-03-20 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372450#comment-14372450
 ] 

Siddharth Seth commented on TEZ-2218:
-

Are there any pre-conditions for this, testing required etc ?
We may be better leaving it off by default - especially for single node 
instances, and allow this to be enabled per site. Progress reporting from tasks 
being one of the concerns.


 Turn on speculation by default
 --

 Key: TEZ-2218
 URL: https://issues.apache.org/jira/browse/TEZ-2218
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Bikas Saha
Assignee: Bikas Saha





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2210) Record DAG AM CPU usage stats

2015-03-20 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372409#comment-14372409
 ] 

Bikas Saha commented on TEZ-2210:
-

bq. AsyncDispatcher should remain as it is not meant for use outside of the Tez 
project.
Not sure what you mean here. I made no change other than explicitly mark it 
@private

bq. if (null == counters) {
I am not changing that part of the code. That is legacy code probably from MR 
that is expected to update a shared counter object.

bq. protected long getElapsedGc() is not thread-safe.
Again, legacy code that I am not touching in this patch. From what I see, its 
only called from TaskCounterUpdater which effectively rules out concurrent 
calls.

bq.restoreFromEvent probably does not need a synchronized.
Yes. As of now it does not because its called during recovery. I am removing it 
since that has caused the new spurious findbugs message.

bq.Can ManagementFactory.getGarbageCollectorMXBeans() ever return null?
It probably does not or else GcTimeUpdater would have been broken all this 
while.

bq. Is  private DAGStatus.State getDAGStatusFromState(
Eclipse showed this as unreferenced code. No complains on removing it.

I will remove the new synchronized and upload the patch for Jenkins.





 Record DAG AM CPU usage stats
 -

 Key: TEZ-2210
 URL: https://issues.apache.org/jira/browse/TEZ-2210
 Project: Apache Tez
  Issue Type: Task
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: TEZ-2210.1.patch, TEZ-2210.2.patch, TEZ-2210.3.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.

2015-03-20 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371273#comment-14371273
 ] 

Jeff Zhang commented on TEZ-2204:
-

The findbug issue should be OK.

 TestAMRecovery increasingly flaky on jenkins builds. 
 -

 Key: TEZ-2204
 URL: https://issues.apache.org/jira/browse/TEZ-2204
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Attachments: TEZ-2204-1.patch, TEZ-2204-2.patch


 In recent pre-commit builds and daily builds, there seem to have been some 
 occurrences of TestAMRecovery failing or timing out. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2214) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging

2015-03-20 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-2214:
--
Attachment: TEZ-2214.1.patch

[~sseth], [~hitesh] - Please review when you have sometime.

 FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses 
 memToDiskMerging
 --

 Key: TEZ-2214
 URL: https://issues.apache.org/jira/browse/TEZ-2214
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
 Attachments: TEZ-2214.1.patch


 Scenario:
 - commitMemory  usedMemory are beyond their allowed threshold.
 - InMemoryMerge kicks off and is in the process of flushing memory contents 
 to disk
 - As it progresses, it releases memory segments as well (but not yet over).
 - Fetchers who need memory  maxSingleShuffleLimit, get scheduled.
 - If fetchers are fast, this quickly adds up to commitMemory  usedMemory. 
 Since InMemoryMerge is already in progress, this wouldn't trigger another 
 merge().
 - Pretty soon all fetchers would be stalled and get into the following state.
 {noformat}
 Thread 9351: (state = BLOCKED)
  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be 
 imprecise)
  - java.lang.Object.wait() @bci=2, line=502 (Compiled frame)
  - 
 org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory()
  @bci=17, line=337 (Interpreted frame)
  - 
 org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run()
  @bci=34, line=157 (Interpreted frame)
 {noformat}
 - Even if InMemoryMerger completes, commitedMem  usedMem are beyond their 
 threshold and no other fetcher threads (all are in stalled state) are there 
 to release memory. This causes fetchers to wait indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.

2015-03-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371255#comment-14371255
 ] 

Hadoop QA commented on TEZ-2204:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12705906/TEZ-2204-2.patch
  against master revision 9b845f2.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/319//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/319//artifact/patchprocess/newPatchFindbugsWarningstez-common.html
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/319//console

This message is automatically generated.

 TestAMRecovery increasingly flaky on jenkins builds. 
 -

 Key: TEZ-2204
 URL: https://issues.apache.org/jira/browse/TEZ-2204
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Attachments: TEZ-2204-1.patch, TEZ-2204-2.patch


 In recent pre-commit builds and daily builds, there seem to have been some 
 occurrences of TestAMRecovery failing or timing out. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-2204 PreCommit Build #319

2015-03-20 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2204
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/319/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2753 lines...]


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12705906/TEZ-2204-2.patch
  against master revision 9b845f2.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/319//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/319//artifact/patchprocess/newPatchFindbugsWarningstez-common.html
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/319//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
82da12738622ddd9da36401cbe820cdc0fe397c9 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #315
Archived 44 artifacts
Archive block size is 32768
Received 6 blocks and 2530138 bytes
Compression is 7.2%
Took 1.8 sec
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Updated] (TEZ-2214) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging

2015-03-20 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-2214:
-
Target Version/s: 0.7.0, 0.5.4, 0.6.1  (was: 0.7.0)

 FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses 
 memToDiskMerging
 --

 Key: TEZ-2214
 URL: https://issues.apache.org/jira/browse/TEZ-2214
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: TEZ-2214.1.patch


 Scenario:
 - commitMemory  usedMemory are beyond their allowed threshold.
 - InMemoryMerge kicks off and is in the process of flushing memory contents 
 to disk
 - As it progresses, it releases memory segments as well (but not yet over).
 - Fetchers who need memory  maxSingleShuffleLimit, get scheduled.
 - If fetchers are fast, this quickly adds up to commitMemory  usedMemory. 
 Since InMemoryMerge is already in progress, this wouldn't trigger another 
 merge().
 - Pretty soon all fetchers would be stalled and get into the following state.
 {noformat}
 Thread 9351: (state = BLOCKED)
  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be 
 imprecise)
  - java.lang.Object.wait() @bci=2, line=502 (Compiled frame)
  - 
 org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory()
  @bci=17, line=337 (Interpreted frame)
  - 
 org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run()
  @bci=34, line=157 (Interpreted frame)
 {noformat}
 - Even if InMemoryMerger completes, commitedMem  usedMem are beyond their 
 threshold and no other fetcher threads (all are in stalled state) are there 
 to release memory. This causes fetchers to wait indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2210) Record DAG AM CPU usage stats

2015-03-20 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372486#comment-14372486
 ] 

Hitesh Shah commented on TEZ-2210:
--

bq. Not sure what you mean here. I made no change other than explicitly mark it 
@private

My mistake. Hurried review meant that I read the change in reverse :). Yes, 
should be marked private. 

bq. protected long getElapsedGc() is not thread-safe.

Maybe add a comment even though it is legacy code? 

Rest looks fine. One gotcha though which needs to be addressed - if someone 
invokes DAGClient to retrieve the counters while the DAG is in progress, the 
cpu and gc stats will not show up. This might affect how the GcTimeUpdater is 
used though for AM stats. For tasks, calling it multiple times works as 
counters are incremented based on new values from each heartbeat. 
 


 Record DAG AM CPU usage stats
 -

 Key: TEZ-2210
 URL: https://issues.apache.org/jira/browse/TEZ-2210
 Project: Apache Tez
  Issue Type: Task
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: TEZ-2210.1.patch, TEZ-2210.2.patch, TEZ-2210.3.patch, 
 TEZ-2210.4.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2211) Tez UI: Allow users to configure timezone

2015-03-20 Thread Jonathan Eagles (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Eagles updated TEZ-2211:
-
Component/s: UI

 Tez UI: Allow users to configure timezone
 -

 Key: TEZ-2211
 URL: https://issues.apache.org/jira/browse/TEZ-2211
 Project: Apache Tez
  Issue Type: Improvement
  Components: UI
Reporter: Jonathan Eagles





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2047) Build fails against hadoop-2.2 post TEZ-2018

2015-03-20 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-2047:
-
Priority: Blocker  (was: Major)

 Build fails against hadoop-2.2 post TEZ-2018
 

 Key: TEZ-2047
 URL: https://issues.apache.org/jira/browse/TEZ-2047
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Prakash Ramachandran
Priority: Blocker
 Attachments: TEZ-2047.1.patch


 Failed to execute goal 
 org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) 
 on project tez-dag: Compilation failure: Compilation failure:
 [ERROR] 
 /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[85,13]
  cannot find symbol
 [ERROR] symbol  : method 
 withHttpPolicy(org.apache.hadoop.conf.Configuration,org.apache.hadoop.http.HttpConfig.Policy)
 [ERROR] location: class 
 org.apache.hadoop.yarn.webapp.WebApps.Builderorg.apache.tez.dag.app.web.WebUIService.TezAMWebApp
 [ERROR] 
 /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[87,45]
  cannot find symbol
 [ERROR] symbol  : method getConnectorAddress(int)
 [ERROR] location: class org.apache.hadoop.http.HttpServer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.

2015-03-20 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-2204:

Attachment: (was: TEZ-2204-2.patch)

 TestAMRecovery increasingly flaky on jenkins builds. 
 -

 Key: TEZ-2204
 URL: https://issues.apache.org/jira/browse/TEZ-2204
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Attachments: TEZ-2204-1.patch


 In recent pre-commit builds and daily builds, there seem to have been some 
 occurrences of TestAMRecovery failing or timing out. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.

2015-03-20 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-2204:

Attachment: TEZ-2204-2.patch

Upload new patch.

 TestAMRecovery increasingly flaky on jenkins builds. 
 -

 Key: TEZ-2204
 URL: https://issues.apache.org/jira/browse/TEZ-2204
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Attachments: TEZ-2204-1.patch, TEZ-2204-2.patch


 In recent pre-commit builds and daily builds, there seem to have been some 
 occurrences of TestAMRecovery failing or timing out. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2210) Record DAG AM CPU usage stats

2015-03-20 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372097#comment-14372097
 ] 

Hitesh Shah edited comment on TEZ-2210 at 3/20/15 9:01 PM:
---

Comments:

AsyncDispatcher should remain as it is not meant for use outside of the Tez 
project.

{code}
if (null == counters) {
  return; // nothing to do.
}
{code}
Why will this ever be null? 

protected long getElapsedGc() is not thread-safe. 

Can ManagementFactory.getGarbageCollectorMXBeans() ever return null? 

restoreFromEvent probably does not need a synchronized. 

I think the findbugs issue might be fixed if the dag counter is updated in 
mayBeConstructFinalFullCounters()

Is  private DAGStatus.State getDAGStatusFromState(DAGState finalState)  no 
longer used in the DAGClient code path? 








was (Author: hitesh):
Comments:

AsyncDispatcher should remain as it is not meant for use outside of the Tez 
project.

{code}
if (null == counters) {
  return; // nothing to do.
}

Why will this ever be null? 

protected long getElapsedGc() is not thread-safe. 

Can ManagementFactory.getGarbageCollectorMXBeans() ever return null? 

restoreFromEvent probably does not need a synchronized. 

I think the findbugs issue might be fixed if the dag counter is updated in 
mayBeConstructFinalFullCounters()

Is  private DAGStatus.State getDAGStatusFromState(DAGState finalState)  no 
longer used in the DAGClient code path? 







 Record DAG AM CPU usage stats
 -

 Key: TEZ-2210
 URL: https://issues.apache.org/jira/browse/TEZ-2210
 Project: Apache Tez
  Issue Type: Task
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: TEZ-2210.1.patch, TEZ-2210.2.patch, TEZ-2210.3.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2210) Record DAG AM CPU usage stats

2015-03-20 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372097#comment-14372097
 ] 

Hitesh Shah commented on TEZ-2210:
--

Comments:

AsyncDispatcher should remain as it is not meant for use outside of the Tez 
project.

{code}
if (null == counters) {
  return; // nothing to do.
}

Why will this ever be null? 

protected long getElapsedGc() is not thread-safe. 

Can ManagementFactory.getGarbageCollectorMXBeans() ever return null? 

restoreFromEvent probably does not need a synchronized. 

I think the findbugs issue might be fixed if the dag counter is updated in 
mayBeConstructFinalFullCounters()

Is  private DAGStatus.State getDAGStatusFromState(DAGState finalState)  no 
longer used in the DAGClient code path? 







 Record DAG AM CPU usage stats
 -

 Key: TEZ-2210
 URL: https://issues.apache.org/jira/browse/TEZ-2210
 Project: Apache Tez
  Issue Type: Task
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: TEZ-2210.1.patch, TEZ-2210.2.patch, TEZ-2210.3.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-2205 PreCommit Build #322

2015-03-20 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2205
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/322/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2754 lines...]



{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12706018/TEZ-2205.wip.patch
  against master revision 6e15b2f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/322//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/322//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
1dde7251f57d8acac45b58c10df0e44ed7dd1159 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #320
Archived 44 artifacts
Archive block size is 32768
Received 2 blocks and 2688105 bytes
Compression is 2.4%
Took 0.64 sec
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372109#comment-14372109
 ] 

Hadoop QA commented on TEZ-2205:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12706018/TEZ-2205.wip.patch
  against master revision 6e15b2f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/322//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/322//console

This message is automatically generated.

 Tez still tries to post to ATS when yarn.timeline-service.enabled=false
 ---

 Key: TEZ-2205
 URL: https://issues.apache.org/jira/browse/TEZ-2205
 Project: Apache Tez
  Issue Type: Sub-task
Affects Versions: 0.6.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: TEZ-2205.wip.patch


 when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
 but hits error as token is not found. Does not fail the job because of the 
 fix to not fail job when there is error posting to ATS. But it should not be 
 trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure

2015-03-20 Thread Matt Foley (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372119#comment-14372119
 ] 

Matt Foley commented on TEZ-1923:
-

Yes please!

 FetcherOrderedGrouped gets into infinite loop due to memory pressure
 

 Key: TEZ-1923
 URL: https://issues.apache.org/jira/browse/TEZ-1923
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Fix For: 0.7.0

 Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, 
 TEZ-1923.4.patch


 - Ran a comparatively large job (temp table creation) at 10 TB scale.
 - Turned on intermediate mem-to-mem 
 (tez.runtime.shuffle.memory-to-memory.enable=true and 
 tez.runtime.shuffle.memory-to-memory.segments=4)
 - Some reducers get lots of data and quickly gets into infinite loop
 {code}
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 {code}
 Additional debug/patch statements revealed that InMemoryMerge is not invoked 
 appropriately and not releasing the memory back for fetchers to proceed. e.g 
 debug/patch messages are given below
 {code}
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO 
 [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
 Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, 
 mergeThreshold=708669632  === InMemoryMerge would be started in this case 
 as commitMemory = mergeThreshold
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO 
 [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
 Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, 
 mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this 
 case as commitMemory  mergeThreshold.  But the usedMemory is higher than 
 memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
 released. InMemoryMerge will not kick in and not release memory.
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO 
 [fetcher [Map_1] #1] orderedgrouped.MergeManager: 
 Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, 
 mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this 
 case as commitMemory  mergeThreshold.  But the usedMemory is higher than 
 memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
 released.  InMemoryMerge will not kick in and not release memory.
 {code}
 In MergeManager, in memory merging is invoked under the following condition
 {code}
 if (!inMemoryMerger.isInProgress()  commitMemory = mergeThreshold)
 {code}
 Attaching the 

[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-20 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372045#comment-14372045
 ] 

Hitesh Shah commented on TEZ-2205:
--

[~rohini] At the crux of this, there are effectively a couple of 
issues/questions in terms of what is the logical behavior and what is the 
expectation? The fix for any approach is probably trivial. 

Consider the fact that the user configured both YARN and Tez in a conflicting 
manner. i.e. Configured YARN to disable timeline but made Tez use Timeline. 
Should Tez:

1) error out due to a conflicting configuration i.e YARN timeline disabled but 
Tez ATS logger enabled.
2) should Tez try and use Timeline  (even though YARN flag is set to false ) 
and ignore its failures as needed? This should be ok for the most part except 
that I think there are some cases in YARN which are not handled cleanly and end 
up causing the app to error out. Also, there were some behavioral changes in 
YARN-2375 - see below. 
3) Should Tez look for the YARN configuration property and silently ignore the 
fact that TimelineATSLogger has been configured but it should not be used? 

Also, FWIW, earlier ( before YARN-2375 ), even though Tez invoked 
Timeline::postEntities, if the YARN flag was set to false, the YARN library 
silently dropped the call. 

(2) is probably something that YARN needs to address. 

As for Tez, we can go with either (1) or (3). (1) is more clear-cut in terms of 
making it very clear to the user in terms of how to configure Tez. (3) merely 
hides the fact that something is wrongly configured. Also, to clarify, part of 
this stems from what is the yarn.timeline-service.enabled flag meant to be 
used for? Is it a admin flag to control where timeline is enabled or disabled 
for the whole cluster? It currently is a client-side flag that cannot be 
enforced at all. Furthermore, if it is meant to be used on a per job basis, 
should it then be a tez-specific setting ( which we already have in the form of 
the class setting ).

Last question for [~rohini]: does the issue of disabling ATS stem from the fact 
that it is a bit hard to disable ATS logging via the service class name 
property? 

 








 Tez still tries to post to ATS when yarn.timeline-service.enabled=false
 ---

 Key: TEZ-2205
 URL: https://issues.apache.org/jira/browse/TEZ-2205
 Project: Apache Tez
  Issue Type: Sub-task
Affects Versions: 0.6.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: TEZ-2205.wip.patch


 when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
 but hits error as token is not found. Does not fail the job because of the 
 fix to not fail job when there is error posting to ATS. But it should not be 
 trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-20 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371983#comment-14371983
 ] 

Jonathan Eagles commented on TEZ-2205:
--

This seems like a patch that will get the conversation going for what to do for 
this jira. [~rohini], can you comment from a pig perspective. Essentially, we 
are deeming it illegal to specify ATSHistoryLogger when timeline service is 
disabled. This approach gives clients the control of which history logger to 
use but leaves the yarn-timeline-service.enabled a flag that describes a 
feature that is available on the cluster.

 Tez still tries to post to ATS when yarn.timeline-service.enabled=false
 ---

 Key: TEZ-2205
 URL: https://issues.apache.org/jira/browse/TEZ-2205
 Project: Apache Tez
  Issue Type: Sub-task
Affects Versions: 0.6.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: TEZ-2205.wip.patch


 when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
 but hits error as token is not found. Does not fail the job because of the 
 fix to not fail job when there is error posting to ATS. But it should not be 
 trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-20 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371984#comment-14371984
 ] 

Chang Li commented on TEZ-2205:
---

[~hitesh] I post an attempt patch which checks if tez has its logging service 
class set to atsHistoryLoggingService but not enable timelie-service. In that 
case it will throw an exception with alert stopping the tez job from launching.

 Tez still tries to post to ATS when yarn.timeline-service.enabled=false
 ---

 Key: TEZ-2205
 URL: https://issues.apache.org/jira/browse/TEZ-2205
 Project: Apache Tez
  Issue Type: Sub-task
Affects Versions: 0.6.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: TEZ-2205.wip.patch


 when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
 but hits error as token is not found. Does not fail the job because of the 
 fix to not fail job when there is error posting to ATS. But it should not be 
 trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2203) Intern strings in tez counters

2015-03-20 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-2203:

Attachment: TEZ-2203.2.patch

 Intern strings in tez counters
 --

 Key: TEZ-2203
 URL: https://issues.apache.org/jira/browse/TEZ-2203
 Project: Apache Tez
  Issue Type: Task
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: TEZ-2203.1.patch, TEZ-2203.2.patch


 Getting per IO counters is possible today. This jira tracks work needed to 
 enabled them by default. Internalizing strings to save memory is one item 
 needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2203) Intern strings in tez counters

2015-03-20 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371906#comment-14371906
 ] 

Bikas Saha commented on TEZ-2203:
-

Thanks! Uploading commit patch with commented code removed.

 Intern strings in tez counters
 --

 Key: TEZ-2203
 URL: https://issues.apache.org/jira/browse/TEZ-2203
 Project: Apache Tez
  Issue Type: Task
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: TEZ-2203.1.patch, TEZ-2203.2.patch


 Getting per IO counters is possible today. This jira tracks work needed to 
 enabled them by default. Internalizing strings to save memory is one item 
 needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2216) Expose errors during AM initialization

2015-03-20 Thread Bikas Saha (JIRA)
Bikas Saha created TEZ-2216:
---

 Summary: Expose errors during AM initialization
 Key: TEZ-2216
 URL: https://issues.apache.org/jira/browse/TEZ-2216
 Project: Apache Tez
  Issue Type: Bug
Reporter: Bikas Saha


If there are bad configs or other issues that cause errors/exceptions during AM 
initialization (eg. during service init) then those errors are not exposed to 
the user. Exposing them would be useful in quickly debugging such issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-20 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372005#comment-14372005
 ] 

Rohini Palaniswamy commented on TEZ-2205:
-

{code}
public void handle(DAGHistoryEvent event) {
eventQueue.add(event);
  }
{code}

It would be a simple check to not add to the queue if timeline service is 
disabled.

 Tez still tries to post to ATS when yarn.timeline-service.enabled=false
 ---

 Key: TEZ-2205
 URL: https://issues.apache.org/jira/browse/TEZ-2205
 Project: Apache Tez
  Issue Type: Sub-task
Affects Versions: 0.6.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: TEZ-2205.wip.patch


 when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
 but hits error as token is not found. Does not fail the job because of the 
 fix to not fail job when there is error posting to ATS. But it should not be 
 trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-20 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372025#comment-14372025
 ] 

Jonathan Eagles commented on TEZ-2205:
--

[~hitesh], do you want to comment on this? [~rohini], there are many ways to 
approach this. We have all be discussing the pros and cons. I wanted to loop 
you into the conversation since you may have better insight as to what users 
can expect.

 Tez still tries to post to ATS when yarn.timeline-service.enabled=false
 ---

 Key: TEZ-2205
 URL: https://issues.apache.org/jira/browse/TEZ-2205
 Project: Apache Tez
  Issue Type: Sub-task
Affects Versions: 0.6.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: TEZ-2205.wip.patch


 when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
 but hits error as token is not found. Does not fail the job because of the 
 fix to not fail job when there is error posting to ATS. But it should not be 
 trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2210) Record DAG AM CPU usage stats

2015-03-20 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371951#comment-14371951
 ] 

Bikas Saha commented on TEZ-2210:
-

Thanks for the review!
bq. In GcTimeUpdater, should this be moved to constructor itself?
I dont think so. Its only relevant in the incrementCounter method which is 
probably tied to some legacy code style where the counters are just passed into 
this update object for convenience.

bq. In DAGAppMaster, cpuPlugin  GcTimeUpdater are always initialized; Do we 
need the null check in getAMGCTime, getAMCPUTime?
The initialization code may fail while initing them or before initing them 
(while initing something else). Hence those null checks are needed.

bq. In DAGAppMaster, initResourceCalculatorPlugins() is called in 
serviceInit(). If user makes any mistake in configuring the plugin, 
initResourceCalculatorPlugins() could get RuntimeException? Info might not be 
available to end user to find out the reason for AM not starting (Could get 
ExitCodeException exitCode=??)
Thats possible, but this is a general problem of failing during init and should 
probably be fixed for this and other cases. Opened TEZ-2216 for this.

Updated patch for the findbugs comment. The patch is fine. The comment is 
spurious. Removing the unnecessary synchronization in the private methods that 
is causing the spurious comment.


 Record DAG AM CPU usage stats
 -

 Key: TEZ-2210
 URL: https://issues.apache.org/jira/browse/TEZ-2210
 Project: Apache Tez
  Issue Type: Task
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: TEZ-2210.1.patch, TEZ-2210.2.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2210) Record DAG AM CPU usage stats

2015-03-20 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-2210:

Attachment: TEZ-2210.3.patch

 Record DAG AM CPU usage stats
 -

 Key: TEZ-2210
 URL: https://issues.apache.org/jira/browse/TEZ-2210
 Project: Apache Tez
  Issue Type: Task
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: TEZ-2210.1.patch, TEZ-2210.2.patch, TEZ-2210.3.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-20 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated TEZ-2205:
--
Attachment: TEZ-2205.wip.patch

 Tez still tries to post to ATS when yarn.timeline-service.enabled=false
 ---

 Key: TEZ-2205
 URL: https://issues.apache.org/jira/browse/TEZ-2205
 Project: Apache Tez
  Issue Type: Sub-task
Affects Versions: 0.6.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: TEZ-2205.wip.patch


 when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
 but hits error as token is not found. Does not fail the job because of the 
 fix to not fail job when there is error posting to ATS. But it should not be 
 trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-20 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated TEZ-2205:
--
Attachment: TEZ-2205.wip.patch

 Tez still tries to post to ATS when yarn.timeline-service.enabled=false
 ---

 Key: TEZ-2205
 URL: https://issues.apache.org/jira/browse/TEZ-2205
 Project: Apache Tez
  Issue Type: Sub-task
Affects Versions: 0.6.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: TEZ-2205.wip.patch


 when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
 but hits error as token is not found. Does not fail the job because of the 
 fix to not fail job when there is error posting to ATS. But it should not be 
 trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-20 Thread Chang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Li updated TEZ-2205:
--
Attachment: (was: TEZ-2205.wip.patch)

 Tez still tries to post to ATS when yarn.timeline-service.enabled=false
 ---

 Key: TEZ-2205
 URL: https://issues.apache.org/jira/browse/TEZ-2205
 Project: Apache Tez
  Issue Type: Sub-task
Affects Versions: 0.6.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: TEZ-2205.wip.patch


 when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
 but hits error as token is not found. Does not fail the job because of the 
 fix to not fail job when there is error posting to ATS. But it should not be 
 trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2210) Record DAG AM CPU usage stats

2015-03-20 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371974#comment-14371974
 ] 

Bikas Saha commented on TEZ-2210:
-

[~hitesh] Can you please take a look. I have remove the sync from the private 
method that was causing the spurious warning. All invocations of this method 
are already thread safe except for a public method restoreFromEvent() related 
to recovery that I have now synchronized. I think that should be fine.

 Record DAG AM CPU usage stats
 -

 Key: TEZ-2210
 URL: https://issues.apache.org/jira/browse/TEZ-2210
 Project: Apache Tez
  Issue Type: Task
Reporter: Bikas Saha
Assignee: Bikas Saha
 Attachments: TEZ-2210.1.patch, TEZ-2210.2.patch, TEZ-2210.3.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-20 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371996#comment-14371996
 ] 

Rohini Palaniswamy commented on TEZ-2205:
-

Wouldn't it be better if ATSHistoryLogger checked the value of the timeline 
setting and do nothing if it is false. 

 Tez still tries to post to ATS when yarn.timeline-service.enabled=false
 ---

 Key: TEZ-2205
 URL: https://issues.apache.org/jira/browse/TEZ-2205
 Project: Apache Tez
  Issue Type: Sub-task
Affects Versions: 0.6.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: TEZ-2205.wip.patch


 when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
 but hits error as token is not found. Does not fail the job because of the 
 fix to not fail job when there is error posting to ATS. But it should not be 
 trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir

2015-03-20 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366825#comment-14366825
 ] 

Jeff Zhang edited comment on TEZ-1909 at 3/20/15 9:14 AM:
--

Attach the new patch to address the review comment. [~hitesh] Please help 
review 
Apart from the issues in the review comments, I also found there's one issue 
about RecoveryService. For the scenario of draining the events before 
RecoverySerivce is stopped, previously I take the event queue's size equal to 
zero as an indication of events are all consumed, but it is not true. Because 
even if the event queue is empty, the event may still been processing. I fix 
this bug in the new patch just like AsyncDispatcher did. 

bq. the if (skipAllOtherEvents) { check is probably also needed at the top of 
the loop to prevent new files from being opened and read ( in addition to 
short-circuiting the read of all events in the given file ). Maybe just log a 
message that other files were present and skipped
Fix it. also add unit test in TestRecoveryParser

bq. any reason why this is needed in the DAGAppMaster SetString getDagIDs() 
?
Only for unit test. But in the new patch, I remove it and initialize the Set in 
the setup method.

bq. also, we should add a test for adding corrupt data to the summary stream 
and ensuring that its processing fails
Done.

bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used 
anywhere apart from being set to true in one of the tests.
Fix it.

bq. please replace import com.sun.tools.javac.util.List; with java.lang.List
Fix it

bq. testCorruptedLastRecord should also verify that the dag submitted event was 
seen.
Done. verify DAGAppMaster.createDAG is invoked.







was (Author: zjffdu):
Attach the new patch to address the review comment.
Apart from the issues in the review comments, I also found there's one issue 
about RecoveryService. For the scenario of draining the events before 
RecoverySerivce is stopped, previously I take the event queue's size equal to 
zero as an indication of events are all consumed, but it is not true. Because 
even if the event queue is empty, the event may still been processing. I fix 
this bug in the new patch just like AsyncDispatcher did. 

bq. the if (skipAllOtherEvents) { check is probably also needed at the top of 
the loop to prevent new files from being opened and read ( in addition to 
short-circuiting the read of all events in the given file ). Maybe just log a 
message that other files were present and skipped
Fix it. also add unit test in TestRecoveryParser

bq. any reason why this is needed in the DAGAppMaster SetString getDagIDs() 
?
Only for unit test. But in the new patch, I remove it and initialize the Set in 
the setup method.

bq. also, we should add a test for adding corrupt data to the summary stream 
and ensuring that its processing fails
Done.

bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used 
anywhere apart from being set to true in one of the tests.
Fix it.

bq. please replace import com.sun.tools.javac.util.List; with java.lang.List
Fix it

bq. testCorruptedLastRecord should also verify that the dag submitted event was 
seen.
Done. verify DAGAppMaster.createDAG is invoked.






 Remove need to copy over all events from attempt 1 to attempt 2 dir
 ---

 Key: TEZ-1909
 URL: https://issues.apache.org/jira/browse/TEZ-1909
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Attachments: TEZ-1909-1.patch, TEZ-1909-2.patch, TEZ-1909-3.patch


 Use of file versions should prevent the need for copying over data into a 
 second attempt dir. Care needs to be taken to handle last corrupt record 
 handling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)