[jira] [Commented] (TEZ-986) Make conf set on DAG and vertex available in jobhistory

2015-03-24 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377568#comment-14377568
 ] 

Rohini Palaniswamy commented on TEZ-986:


bq. viewable in Tez UI after the job completes. This is very essential for 
debugging jobs.
   Just wanted to mention that we need this for Pig and would be good to have 
it in one of the upcoming releases. While debugging some of the recent issues, 
realized that I don't have access to pig script if user ran it from a gateway 
(for Oozie I can get it from launcher job), because pig.script setting is only 
set on the DAG config and few other settings useful for debugging like pig 
version. For now, I have other workarounds to get this info or resort to asking 
the user.

The vertex config also has some important debugging info like what feature is 
being run (group by, etc), input/output dirs, etc. Even for this can manage for 
the short term and figure these out with the explain output of the script. But 
life would be easier if those are shown in the UI.

 Make conf set on DAG and vertex available in jobhistory
 ---

 Key: TEZ-986
 URL: https://issues.apache.org/jira/browse/TEZ-986
 Project: Apache Tez
  Issue Type: Sub-task
  Components: UI
Reporter: Rohini Palaniswamy
Priority: Blocker

 Would like to have the conf set on DAG and Vertex
   1) viewable in Tez UI after the job completes. This is very essential for 
 debugging jobs.
   2) We have processes, that parse jobconf.xml from job history (hdfs) and 
 load them into hive tables for analysis. Would like to have Tez also make all 
 the configuration (byte array) available in job history so that we can 
 similarly parse them. 1) mandates that you store it in hdfs. 2) is just to 
 say make the format stored as a contract others can rely on for parsing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1433#comment-1433
 ] 

Hadoop QA commented on TEZ-2224:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12706880/TEZ-2224-1.patch
  against master revision f53942c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/338//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/338//console

This message is automatically generated.

 EventQueue empty doesn't mean events are consumed in RecoveryService
 

 Key: TEZ-2224
 URL: https://issues.apache.org/jira/browse/TEZ-2224
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: TEZ-2224-1.patch


 If the event queue is empty, the event may still been processing. Should fix 
 it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Success: TEZ-2224 PreCommit Build #338

2015-03-24 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2224
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/338/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2754 lines...]
[INFO] Final Memory: 72M/762M
[INFO] 




{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12706880/TEZ-2224-1.patch
  against master revision f53942c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/338//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/338//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
d98f08814b5cfb8ee2fbc55e0c5bb28a28e2e817 logged out


==
==
Finished build.
==
==


Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #335
Archived 44 artifacts
Archive block size is 32768
Received 4 blocks and 2622022 bytes
Compression is 4.8%
Took 1.3 sec
Description set: TEZ-2224
Recording test results
Email was triggered for: Success
Sending email for trigger: Success



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Comment Edited] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-24 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377733#comment-14377733
 ] 

Jeff Zhang edited comment on TEZ-2224 at 3/24/15 11:39 AM:
---

Upload patch for this issue. Just like how AsyncDispatcher does. [~hitesh] 
Please help review it.


was (Author: zjffdu):
Upload patch for this issue. Just like how AsyncDispatcher does

 EventQueue empty doesn't mean events are consumed in RecoveryService
 

 Key: TEZ-2224
 URL: https://issues.apache.org/jira/browse/TEZ-2224
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: TEZ-2224-1.patch


 If the event queue is empty, the event may still been processing. Should fix 
 it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-24 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377562#comment-14377562
 ] 

Rohini Palaniswamy commented on TEZ-2205:
-

I prefer 3). For jobs launched through Oozie it is easy to turn off ATS via 
Oozie server side setting and this might be required in the near future now and 
then considering the issues we are facing with ATS. Since tez-site.xml for 
those jobs come from HDFS, it is not easy to change the tez ATS Logger easily 
(replacing the file on HDFS is more manual and can cause running jobs to fail 
as LocalResource time has changed) and so do not like 1). Also having to change 
multiple settings to turn off something is cumbersome.  2) is what is happening 
now but the problem I see is that it impacts performance as time is wasted 
trying to connect to ATS and failing due to lack of authentication. 

 Tez still tries to post to ATS when yarn.timeline-service.enabled=false
 ---

 Key: TEZ-2205
 URL: https://issues.apache.org/jira/browse/TEZ-2205
 Project: Apache Tez
  Issue Type: Sub-task
Affects Versions: 0.6.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: TEZ-2205.wip.patch


 when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
 but hits error as token is not found. Does not fail the job because of the 
 fix to not fail job when there is error posting to ATS. But it should not be 
 trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-24 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377733#comment-14377733
 ] 

Jeff Zhang commented on TEZ-2224:
-

Upload patch for this issue. Just like how AsyncDispatcher does

 EventQueue empty doesn't mean events are consumed in RecoveryService
 

 Key: TEZ-2224
 URL: https://issues.apache.org/jira/browse/TEZ-2224
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: TEZ-2224-1.patch


 If the event queue is empty, the event may still been processing. Should fix 
 it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-24 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-2224:

Attachment: TEZ-2224-1.patch

 EventQueue empty doesn't mean events are consumed in RecoveryService
 

 Key: TEZ-2224
 URL: https://issues.apache.org/jira/browse/TEZ-2224
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: TEZ-2224-1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-24 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-2224:

Description: If the event queue is empty, the event may still been 
processing. Should fix it like AsyncDispatcher

 EventQueue empty doesn't mean events are consumed in RecoveryService
 

 Key: TEZ-2224
 URL: https://issues.apache.org/jira/browse/TEZ-2224
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: TEZ-2224-1.patch


 If the event queue is empty, the event may still been processing. Should fix 
 it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2221) VertexGroup name should be unqiue

2015-03-24 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378769#comment-14378769
 ] 

Hitesh Shah commented on TEZ-2221:
--

Understood but we should have 2 checks being done instead of re-writing the 
existing check. 

{code}
dag.createVertexGroup(group_1, v1,v2);
try {
  dag.createVertexGroup(group_1, v2,v3);
   Assert.fail();
} ...
try {
  dag.createVertexGroup(group_2, v1,v2);
   Assert.fail();
} ...
{code} 




  

 VertexGroup name should be unqiue
 -

 Key: TEZ-2221
 URL: https://issues.apache.org/jira/browse/TEZ-2221
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: TEZ-2221-1.patch


 VertexGroupCommitStartedEvent  VertexGroupCommitFinishedEvent use vertex 
 group name to identify the vertex group commit, the same name of vertex group 
 will conflict. While in the current equals  hashCode of VertexGroup, vertex 
 group name and members name are used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2226) Disable writing history to timeline if domain creation fails.

2015-03-24 Thread Hitesh Shah (JIRA)
Hitesh Shah created TEZ-2226:


 Summary: Disable writing history to timeline if domain creation 
fails.
 Key: TEZ-2226
 URL: https://issues.apache.org/jira/browse/TEZ-2226
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime

2015-03-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378907#comment-14378907
 ] 

Hadoop QA commented on TEZ-2217:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12707032/TEZ-2217.2.patch
  against master revision f53942c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/341//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/341//console

This message is automatically generated.

 The min-held-containers constraint is not enforced during query runtime 
 

 Key: TEZ-2217
 URL: https://issues.apache.org/jira/browse/TEZ-2217
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Gopal V
Assignee: Bikas Saha
 Attachments: TEZ-2217-debug.txt.bz2, TEZ-2217.1.patch, 
 TEZ-2217.2.patch, TEZ-2217.txt.bz2


 The min-held containers constraint is respected during query idle times, but 
 is not respected when a query is actually in motion.
 The AM releases unused containers during dag execution without checking for 
 min-held containers.
 {code}
 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] 
 rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing 
 container, containerId=container_1424502260528_1348_01_13, 
 containerExpiryTime=1426891313264, idleTimeoutMin=5000
 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] 
 rm.YarnTaskSchedulerService: Releasing unused container: 
 container_1424502260528_1348_01_13
 {code}
 This is actually useful only after the AM has received a soft pre-emption 
 message, doing it on an idle cluster slows down one of the most common query 
 patterns in BI systems.
 {code}
 create temporary table smalltable as ...; 
 select ... bigtable JOIN smalltable ON ...;
 {code}
 The smaller query in the beginning throws away the pre-warmed capacity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2217) The min-held-containers constraint is not enforced during query runtime

2015-03-24 Thread Bikas Saha (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-2217:

Attachment: TEZ-2217.2.patch

New patch. The problem was that the expire time was not update until the min 
held container expire time actually elapsed. But if task requests would come in 
just before the update would happen, then in the next allocation cycle the min 
held containers would be released because they just crossed the expire time 
boundary. Looks like the timing of the next dag is currently hitting that race 
condition and probably was not hitting it earlier.
Can you please try this out?

 The min-held-containers constraint is not enforced during query runtime 
 

 Key: TEZ-2217
 URL: https://issues.apache.org/jira/browse/TEZ-2217
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Gopal V
Assignee: Bikas Saha
 Attachments: TEZ-2217-debug.txt.bz2, TEZ-2217.1.patch, 
 TEZ-2217.2.patch, TEZ-2217.txt.bz2


 The min-held containers constraint is respected during query idle times, but 
 is not respected when a query is actually in motion.
 The AM releases unused containers during dag execution without checking for 
 min-held containers.
 {code}
 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] 
 rm.YarnTaskSchedulerService: Container's idle timeout expired. Releasing 
 container, containerId=container_1424502260528_1348_01_13, 
 containerExpiryTime=1426891313264, idleTimeoutMin=5000
 2015-03-20 15:41:53,475 INFO [DelayedContainerManager] 
 rm.YarnTaskSchedulerService: Releasing unused container: 
 container_1424502260528_1348_01_13
 {code}
 This is actually useful only after the AM has received a soft pre-emption 
 message, doing it on an idle cluster slows down one of the most common query 
 patterns in BI systems.
 {code}
 create temporary table smalltable as ...; 
 select ... bigtable JOIN smalltable ON ...;
 {code}
 The smaller query in the beginning throws away the pre-warmed capacity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-24 Thread Chang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378034#comment-14378034
 ] 

Chang Li commented on TEZ-2205:
---

[~hitesh] So should I implemnt 3) by switching to simplehistory or wrap each 
post call with an if statement or just not to add post events to eventqueue?

 Tez still tries to post to ATS when yarn.timeline-service.enabled=false
 ---

 Key: TEZ-2205
 URL: https://issues.apache.org/jira/browse/TEZ-2205
 Project: Apache Tez
  Issue Type: Sub-task
Affects Versions: 0.6.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: TEZ-2205.wip.patch


 when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
 but hits error as token is not found. Does not fail the job because of the 
 fix to not fail job when there is error posting to ATS. But it should not be 
 trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2214) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging

2015-03-24 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-2214:
--
Attachment: TEZ-2214.2.patch

 FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses 
 memToDiskMerging
 --

 Key: TEZ-2214
 URL: https://issues.apache.org/jira/browse/TEZ-2214
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: TEZ-2214.1.patch, TEZ-2214.2.patch


 Scenario:
 - commitMemory  usedMemory are beyond their allowed threshold.
 - InMemoryMerge kicks off and is in the process of flushing memory contents 
 to disk
 - As it progresses, it releases memory segments as well (but not yet over).
 - Fetchers who need memory  maxSingleShuffleLimit, get scheduled.
 - If fetchers are fast, this quickly adds up to commitMemory  usedMemory. 
 Since InMemoryMerge is already in progress, this wouldn't trigger another 
 merge().
 - Pretty soon all fetchers would be stalled and get into the following state.
 {noformat}
 Thread 9351: (state = BLOCKED)
  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be 
 imprecise)
  - java.lang.Object.wait() @bci=2, line=502 (Compiled frame)
  - 
 org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory()
  @bci=17, line=337 (Interpreted frame)
  - 
 org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run()
  @bci=34, line=157 (Interpreted frame)
 {noformat}
 - Even if InMemoryMerger completes, commitedMem  usedMem are beyond their 
 threshold and no other fetcher threads (all are in stalled state) are there 
 to release memory. This causes fetchers to wait indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-24 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379013#comment-14379013
 ] 

Hitesh Shah commented on TEZ-2205:
--

A log.warn should be sufficient for now. SimpleHistory comes with its own 
problems of configuring an hdfs location and cleaning it on a regular basis. 
Looks like we have agreement. 

[~lichangleo] Does the above clarify the implemenation requirements for this 
jira? Thanks for the patience in handling the various run-arounds with respect 
to the design :)




 Tez still tries to post to ATS when yarn.timeline-service.enabled=false
 ---

 Key: TEZ-2205
 URL: https://issues.apache.org/jira/browse/TEZ-2205
 Project: Apache Tez
  Issue Type: Sub-task
Affects Versions: 0.6.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: TEZ-2205.wip.patch


 when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
 but hits error as token is not found. Does not fail the job because of the 
 fix to not fail job when there is error posting to ATS. But it should not be 
 trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-24 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379039#comment-14379039
 ] 

Jeff Zhang commented on TEZ-2224:
-

bq. Is there a reason why we want to prevent new events from being processed on 
a shutdown?
No special reason for that, just borrow the code from AsynDispatcher. But I 
think you are right, AsyncDispatcher is for general use case, here for 
RecoveringService, we need to handle events even when it is stopped.

bq. TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED_DEFAULT being false by 
default is probably wrong. For a real-world scenario, as many pending events 
that are seen and can be processed, should be processed.
When there's no this flag, by default we won't process any pending events, is 
there any consideration at that time ? (not sure, maybe for performance 
consideration, like in non-session mode that dag is finished while recovery 
events handling is not completed but is not necessary) I try to be conservative 
to make the behavior as same as before. 

 EventQueue empty doesn't mean events are consumed in RecoveryService
 

 Key: TEZ-2224
 URL: https://issues.apache.org/jira/browse/TEZ-2224
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: TEZ-2224-1.patch


 If the event queue is empty, the event may still been processing. Should fix 
 it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2214) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging

2015-03-24 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379179#comment-14379179
 ] 

Siddharth Seth commented on TEZ-2214:
-

Was looking at the .1 patch. The latest patch addresses the sync / visibility 
issue.
Question: This same block could just as well have been placed in the 
waitForInMemoryMerge method ? Essentially, any place where it could be 
triggered after a merge completes.

 FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses 
 memToDiskMerging
 --

 Key: TEZ-2214
 URL: https://issues.apache.org/jira/browse/TEZ-2214
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: TEZ-2214.1.patch, TEZ-2214.2.patch


 Scenario:
 - commitMemory  usedMemory are beyond their allowed threshold.
 - InMemoryMerge kicks off and is in the process of flushing memory contents 
 to disk
 - As it progresses, it releases memory segments as well (but not yet over).
 - Fetchers who need memory  maxSingleShuffleLimit, get scheduled.
 - If fetchers are fast, this quickly adds up to commitMemory  usedMemory. 
 Since InMemoryMerge is already in progress, this wouldn't trigger another 
 merge().
 - Pretty soon all fetchers would be stalled and get into the following state.
 {noformat}
 Thread 9351: (state = BLOCKED)
  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be 
 imprecise)
  - java.lang.Object.wait() @bci=2, line=502 (Compiled frame)
  - 
 org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory()
  @bci=17, line=337 (Interpreted frame)
  - 
 org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run()
  @bci=34, line=157 (Interpreted frame)
 {noformat}
 - Even if InMemoryMerger completes, commitedMem  usedMem are beyond their 
 threshold and no other fetcher threads (all are in stalled state) are there 
 to release memory. This causes fetchers to wait indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-24 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379007#comment-14379007
 ] 

Jonathan Eagles commented on TEZ-2205:
--

I think what zhijie has posted is similar to what I am thinking as well. This 
will give the on/off flag to users and keep the client in control. A log WARN 
should be sufficient to alert clients there is a mismatch in configuration. 
Whether we fall back to SimpleHistory or have a hollow ATS History isn't much 
of a difference to me.

 Tez still tries to post to ATS when yarn.timeline-service.enabled=false
 ---

 Key: TEZ-2205
 URL: https://issues.apache.org/jira/browse/TEZ-2205
 Project: Apache Tez
  Issue Type: Sub-task
Affects Versions: 0.6.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: TEZ-2205.wip.patch


 when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
 but hits error as token is not found. Does not fail the job because of the 
 fix to not fail job when there is error posting to ATS. But it should not be 
 trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-24 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378996#comment-14378996
 ] 

Zhijie Shen commented on TEZ-2205:
--

bq. i.e option 3's impl would be:

LGTM. For your reference, this is what we did in MR:

{code}
if (conf.getBoolean(MRJobConfig.MAPREDUCE_JOB_EMIT_TIMELINE_DATA,
MRJobConfig.DEFAULT_MAPREDUCE_JOB_EMIT_TIMELINE_DATA)) {
  if (conf.getBoolean(YarnConfiguration.TIMELINE_SERVICE_ENABLED,
YarnConfiguration.DEFAULT_TIMELINE_SERVICE_ENABLED)) {
timelineClient = TimelineClient.createTimelineClient();
timelineClient.init(conf);
LOG.info(Timeline service is enabled);
LOG.info(Emitting job history data to the timeline server is enabled);
  } else {
LOG.info(Timeline service is not enabled);
  }
} else {
  LOG.info(Emitting job history data to the timeline server is not 
enabled);
}
{code}
And only when {{timelineClient != null}}, MR will publish the history info to 
the timeline server.

 Tez still tries to post to ATS when yarn.timeline-service.enabled=false
 ---

 Key: TEZ-2205
 URL: https://issues.apache.org/jira/browse/TEZ-2205
 Project: Apache Tez
  Issue Type: Sub-task
Affects Versions: 0.6.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: TEZ-2205.wip.patch


 when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
 but hits error as token is not found. Does not fail the job because of the 
 fix to not fail job when there is error posting to ATS. But it should not be 
 trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-2214 PreCommit Build #342

2015-03-24 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2214
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/342/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2754 lines...]



{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12707083/TEZ-2214.2.patch
  against master revision f53942c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/342//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/342//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
48d9285cf144bbef730af593e553b3ddf63b6148 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #341
Archived 44 artifacts
Archive block size is 32768
Received 6 blocks and 2555463 bytes
Compression is 7.1%
Took 1.3 sec
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (TEZ-2214) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging

2015-03-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379091#comment-14379091
 ] 

Hadoop QA commented on TEZ-2214:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12707083/TEZ-2214.2.patch
  against master revision f53942c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/342//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/342//console

This message is automatically generated.

 FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses 
 memToDiskMerging
 --

 Key: TEZ-2214
 URL: https://issues.apache.org/jira/browse/TEZ-2214
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: TEZ-2214.1.patch, TEZ-2214.2.patch


 Scenario:
 - commitMemory  usedMemory are beyond their allowed threshold.
 - InMemoryMerge kicks off and is in the process of flushing memory contents 
 to disk
 - As it progresses, it releases memory segments as well (but not yet over).
 - Fetchers who need memory  maxSingleShuffleLimit, get scheduled.
 - If fetchers are fast, this quickly adds up to commitMemory  usedMemory. 
 Since InMemoryMerge is already in progress, this wouldn't trigger another 
 merge().
 - Pretty soon all fetchers would be stalled and get into the following state.
 {noformat}
 Thread 9351: (state = BLOCKED)
  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be 
 imprecise)
  - java.lang.Object.wait() @bci=2, line=502 (Compiled frame)
  - 
 org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory()
  @bci=17, line=337 (Interpreted frame)
  - 
 org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run()
  @bci=34, line=157 (Interpreted frame)
 {noformat}
 - Even if InMemoryMerger completes, commitedMem  usedMem are beyond their 
 threshold and no other fetcher threads (all are in stalled state) are there 
 to release memory. This causes fetchers to wait indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Success: TEZ-2221 PreCommit Build #343

2015-03-24 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2221
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/343/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2752 lines...]
[INFO] Final Memory: 69M/960M
[INFO] 




{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12707092/TEZ-2221-2.patch
  against master revision 60ddcba.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/343//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/343//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
e544a39ba0d04ae0bf24d66fb79bc3faafb721e2 logged out


==
==
Finished build.
==
==


Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #341
Archived 44 artifacts
Archive block size is 32768
Received 24 blocks and 1934195 bytes
Compression is 28.9%
Took 0.52 sec
Description set: TEZ-2221
Recording test results
Email was triggered for: Success
Sending email for trigger: Success



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (TEZ-2221) VertexGroup name should be unqiue

2015-03-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379124#comment-14379124
 ] 

Hadoop QA commented on TEZ-2221:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12707092/TEZ-2221-2.patch
  against master revision 60ddcba.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/343//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/343//console

This message is automatically generated.

 VertexGroup name should be unqiue
 -

 Key: TEZ-2221
 URL: https://issues.apache.org/jira/browse/TEZ-2221
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: TEZ-2221-1.patch, TEZ-2221-2.patch


 VertexGroupCommitStartedEvent  VertexGroupCommitFinishedEvent use vertex 
 group name to identify the vertex group commit, the same name of vertex group 
 will conflict. While in the current equals  hashCode of VertexGroup, vertex 
 group name and members name are used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2214) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging

2015-03-24 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379172#comment-14379172
 ] 

Siddharth Seth commented on TEZ-2214:
-

[~rajesh.balamohan] - I'm trying to understand the scenario a little better.

bq. Fetchers who need memory  maxSingleShuffleLimit, get scheduled.
Won't the fetchers first block on merger.waitForInMemoryMerge, and then on 
merger.waitForShuffleToMergeMemory() ?
That'll happen to fetchers which aren't currently active - or for the ones 
where the MergeManager returns a WAIT.

It's possible for fetchers which already have an active list to keep going - 
and get memory as it is released by the mergeThread - or just get memory 
because some is available. Is this the situation which can cause the race ?
If the merge threshold is  50% - won't there always be capacity available for 
a single mergeToMem (after the MemToDiskMerger completes) - which will then 
trigger another merge. The fact that we allow a single fetch to go over the 
memory limit probably complicates this - the last fetch puts the usedMemory 
over 100%. The last release from the merger doesn't bring it below 100 - will 
result in everything getting stuck.
I think the same last fetch applies to a merge threshold of  50% as well.
Other than 'usedMemory' not going below the memoryLimit right after the 
InMemoryMerger completes, are there any other scenarios in which this will be 
triggered ?
If I'm not mistaken - for Tez 0.4, this would manifest as a tight loop on 
MergeManager.reserve returning a WAIT.


On the patch: 
Removing synchronization on waitForShuffleToMergeMemory leads to visibility 
issues for 'commitMemory'. This could be invoked by all Fetchers, and there's 
no guarantee on the threads reading the latest value. Also it's possible for 
the currently running merge to complete (thus reducing the commitMemory) 
between the time the commitMemory is checked and the next merge is triggered - 
which could result in a merge being triggered before hitting the memory limit.
Otherwise I think the approach works.
If the above case is correct - should the check be inside of usedMemory  
memoryLimit ?

Another option would be to have the merger check if another merge is required 
when it completes. That gets messy though - and will likely get in the way of 
the MemToMemMerger in the future. A callback from the merge threads may be a 
better option - to keep the merge threads clean.

I was looking at the MapReduce code - that sets commitMemory to 0 the moment a 
merge starts. I don't think that fixes this particular race.


 FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses 
 memToDiskMerging
 --

 Key: TEZ-2214
 URL: https://issues.apache.org/jira/browse/TEZ-2214
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: TEZ-2214.1.patch, TEZ-2214.2.patch


 Scenario:
 - commitMemory  usedMemory are beyond their allowed threshold.
 - InMemoryMerge kicks off and is in the process of flushing memory contents 
 to disk
 - As it progresses, it releases memory segments as well (but not yet over).
 - Fetchers who need memory  maxSingleShuffleLimit, get scheduled.
 - If fetchers are fast, this quickly adds up to commitMemory  usedMemory. 
 Since InMemoryMerge is already in progress, this wouldn't trigger another 
 merge().
 - Pretty soon all fetchers would be stalled and get into the following state.
 {noformat}
 Thread 9351: (state = BLOCKED)
  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be 
 imprecise)
  - java.lang.Object.wait() @bci=2, line=502 (Compiled frame)
  - 
 org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory()
  @bci=17, line=337 (Interpreted frame)
  - 
 org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run()
  @bci=34, line=157 (Interpreted frame)
 {noformat}
 - Even if InMemoryMerger completes, commitedMem  usedMem are beyond their 
 threshold and no other fetcher threads (all are in stalled state) are there 
 to release memory. This causes fetchers to wait indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure

2015-03-24 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379239#comment-14379239
 ] 

Hitesh Shah commented on TEZ-1923:
--

Updated fix versions.

 FetcherOrderedGrouped gets into infinite loop due to memory pressure
 

 Key: TEZ-1923
 URL: https://issues.apache.org/jira/browse/TEZ-1923
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Fix For: 0.5.4, 0.6.1

 Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, 
 TEZ-1923.4.patch


 - Ran a comparatively large job (temp table creation) at 10 TB scale.
 - Turned on intermediate mem-to-mem 
 (tez.runtime.shuffle.memory-to-memory.enable=true and 
 tez.runtime.shuffle.memory-to-memory.segments=4)
 - Some reducers get lots of data and quickly gets into infinite loop
 {code}
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 {code}
 Additional debug/patch statements revealed that InMemoryMerge is not invoked 
 appropriately and not releasing the memory back for fetchers to proceed. e.g 
 debug/patch messages are given below
 {code}
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO 
 [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
 Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, 
 mergeThreshold=708669632  === InMemoryMerge would be started in this case 
 as commitMemory = mergeThreshold
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO 
 [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
 Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, 
 mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this 
 case as commitMemory  mergeThreshold.  But the usedMemory is higher than 
 memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
 released. InMemoryMerge will not kick in and not release memory.
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO 
 [fetcher [Map_1] #1] orderedgrouped.MergeManager: 
 Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, 
 mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this 
 case as commitMemory  mergeThreshold.  But the usedMemory is higher than 
 memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
 released.  InMemoryMerge will not kick in and not release memory.
 {code}
 In MergeManager, in memory merging is invoked under the following condition
 {code}
 if (!inMemoryMerger.isInProgress()  commitMemory = mergeThreshold)
 

[jira] [Updated] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure

2015-03-24 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-1923:
-
Fix Version/s: (was: 0.7.0)
   0.6.1
   0.5.4

 FetcherOrderedGrouped gets into infinite loop due to memory pressure
 

 Key: TEZ-1923
 URL: https://issues.apache.org/jira/browse/TEZ-1923
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Fix For: 0.5.4, 0.6.1

 Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, 
 TEZ-1923.4.patch


 - Ran a comparatively large job (temp table creation) at 10 TB scale.
 - Turned on intermediate mem-to-mem 
 (tez.runtime.shuffle.memory-to-memory.enable=true and 
 tez.runtime.shuffle.memory-to-memory.segments=4)
 - Some reducers get lots of data and quickly gets into infinite loop
 {code}
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 {code}
 Additional debug/patch statements revealed that InMemoryMerge is not invoked 
 appropriately and not releasing the memory back for fetchers to proceed. e.g 
 debug/patch messages are given below
 {code}
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO 
 [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
 Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, 
 mergeThreshold=708669632  === InMemoryMerge would be started in this case 
 as commitMemory = mergeThreshold
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO 
 [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
 Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, 
 mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this 
 case as commitMemory  mergeThreshold.  But the usedMemory is higher than 
 memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
 released. InMemoryMerge will not kick in and not release memory.
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO 
 [fetcher [Map_1] #1] orderedgrouped.MergeManager: 
 Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, 
 mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this 
 case as commitMemory  mergeThreshold.  But the usedMemory is higher than 
 memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
 released.  InMemoryMerge will not kick in and not release memory.
 {code}
 In MergeManager, in memory merging is invoked under the following condition
 {code}
 if (!inMemoryMerger.isInProgress()  commitMemory = 

[jira] [Updated] (TEZ-2221) VertexGroup name should be unqiue

2015-03-24 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-2221:

Attachment: TEZ-2221-2.patch

Thanks [~hitesh] Upload new patch (address the issue in unit test)

 VertexGroup name should be unqiue
 -

 Key: TEZ-2221
 URL: https://issues.apache.org/jira/browse/TEZ-2221
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: TEZ-2221-1.patch, TEZ-2221-2.patch


 VertexGroupCommitStartedEvent  VertexGroupCommitFinishedEvent use vertex 
 group name to identify the vertex group commit, the same name of vertex group 
 will conflict. While in the current equals  hashCode of VertexGroup, vertex 
 group name and members name are used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure

2015-03-24 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-1923:
--
Target Version/s: 0.7.0, 0.5.4, 0.6.1  (was: 0.7.0)

 FetcherOrderedGrouped gets into infinite loop due to memory pressure
 

 Key: TEZ-1923
 URL: https://issues.apache.org/jira/browse/TEZ-1923
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Fix For: 0.7.0

 Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, 
 TEZ-1923.4.patch


 - Ran a comparatively large job (temp table creation) at 10 TB scale.
 - Turned on intermediate mem-to-mem 
 (tez.runtime.shuffle.memory-to-memory.enable=true and 
 tez.runtime.shuffle.memory-to-memory.segments=4)
 - Some reducers get lots of data and quickly gets into infinite loop
 {code}
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 {code}
 Additional debug/patch statements revealed that InMemoryMerge is not invoked 
 appropriately and not releasing the memory back for fetchers to proceed. e.g 
 debug/patch messages are given below
 {code}
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO 
 [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
 Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, 
 mergeThreshold=708669632  === InMemoryMerge would be started in this case 
 as commitMemory = mergeThreshold
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO 
 [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
 Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, 
 mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this 
 case as commitMemory  mergeThreshold.  But the usedMemory is higher than 
 memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
 released. InMemoryMerge will not kick in and not release memory.
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO 
 [fetcher [Map_1] #1] orderedgrouped.MergeManager: 
 Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, 
 mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this 
 case as commitMemory  mergeThreshold.  But the usedMemory is higher than 
 memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
 released.  InMemoryMerge will not kick in and not release memory.
 {code}
 In MergeManager, in memory merging is invoked under the following condition
 {code}
 if (!inMemoryMerger.isInProgress()  commitMemory = mergeThreshold)
 {code}
 Attaching the 

[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure

2015-03-24 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379049#comment-14379049
 ] 

Rajesh Balamohan commented on TEZ-1923:
---

committed to branch-0.6 and branch-0.5.

 FetcherOrderedGrouped gets into infinite loop due to memory pressure
 

 Key: TEZ-1923
 URL: https://issues.apache.org/jira/browse/TEZ-1923
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Fix For: 0.7.0

 Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, 
 TEZ-1923.4.patch


 - Ran a comparatively large job (temp table creation) at 10 TB scale.
 - Turned on intermediate mem-to-mem 
 (tez.runtime.shuffle.memory-to-memory.enable=true and 
 tez.runtime.shuffle.memory-to-memory.segments=4)
 - Some reducers get lots of data and quickly gets into infinite loop
 {code}
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 {code}
 Additional debug/patch statements revealed that InMemoryMerge is not invoked 
 appropriately and not releasing the memory back for fetchers to proceed. e.g 
 debug/patch messages are given below
 {code}
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO 
 [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
 Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, 
 mergeThreshold=708669632  === InMemoryMerge would be started in this case 
 as commitMemory = mergeThreshold
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO 
 [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
 Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, 
 mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this 
 case as commitMemory  mergeThreshold.  But the usedMemory is higher than 
 memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
 released. InMemoryMerge will not kick in and not release memory.
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO 
 [fetcher [Map_1] #1] orderedgrouped.MergeManager: 
 Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, 
 mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this 
 case as commitMemory  mergeThreshold.  But the usedMemory is higher than 
 memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
 released.  InMemoryMerge will not kick in and not release memory.
 {code}
 In MergeManager, in memory merging is invoked under the following condition
 {code}
 if (!inMemoryMerger.isInProgress()  commitMemory 

[jira] [Created] (TEZ-2227) Tez UI shows empty page under IE11

2015-03-24 Thread Fengdong Yu (JIRA)
Fengdong Yu created TEZ-2227:


 Summary: Tez UI shows empty page under IE11
 Key: TEZ-2227
 URL: https://issues.apache.org/jira/browse/TEZ-2227
 Project: Apache Tez
  Issue Type: Bug
  Components: UI
Affects Versions: 0.6.0
Reporter: Fengdong Yu
Priority: Minor


Tez UI works well under Chrome and Firefox, but shows empty page udner IE11.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2227) Tez UI shows empty page under IE11

2015-03-24 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379343#comment-14379343
 ] 

Hitesh Shah commented on TEZ-2227:
--

Thanks for filing the issue [~azuryy]. Any chance you could provide more 
details/logs ( if any ) from the browser console.

 

 Tez UI shows empty page under IE11
 --

 Key: TEZ-2227
 URL: https://issues.apache.org/jira/browse/TEZ-2227
 Project: Apache Tez
  Issue Type: Bug
  Components: UI
Affects Versions: 0.6.0
Reporter: Fengdong Yu
Priority: Minor

 Tez UI works well under Chrome and Firefox, but shows empty page udner IE11.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2214) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging

2015-03-24 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-2214:
--
Attachment: TEZ-2214.3.patch


It's possible for fetchers which already have an active list to keep going - 
and get memory as it is released by the mergeThread - or just get memory 
because some is available. Is this the situation which can cause the race ?


Right, this is the case. As merge is happening, memory gets released which is 
taken up fetchers.  By the time, existing merge completes, commitMemory  
usedMemory are already beyond allowed threshold. And this causes the issue.


Question: This same block could just as well have been placed in the 
waitForInMemoryMerge method ? Essentially, any place where it could be 
triggered after a merge completes.


Yes, it is possible to move the code block to waitForInMemoryMerge(). Addressed 
it in the current patch. (i.e after inMemoryMerger.waitForMerge(), we double 
check if the memory limits beyond thresholds.  If so, we trigger one more merge 
and block until it is done in order to release memory.)

 FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses 
 memToDiskMerging
 --

 Key: TEZ-2214
 URL: https://issues.apache.org/jira/browse/TEZ-2214
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: TEZ-2214.1.patch, TEZ-2214.2.patch, TEZ-2214.3.patch


 Scenario:
 - commitMemory  usedMemory are beyond their allowed threshold.
 - InMemoryMerge kicks off and is in the process of flushing memory contents 
 to disk
 - As it progresses, it releases memory segments as well (but not yet over).
 - Fetchers who need memory  maxSingleShuffleLimit, get scheduled.
 - If fetchers are fast, this quickly adds up to commitMemory  usedMemory. 
 Since InMemoryMerge is already in progress, this wouldn't trigger another 
 merge().
 - Pretty soon all fetchers would be stalled and get into the following state.
 {noformat}
 Thread 9351: (state = BLOCKED)
  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be 
 imprecise)
  - java.lang.Object.wait() @bci=2, line=502 (Compiled frame)
  - 
 org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory()
  @bci=17, line=337 (Interpreted frame)
  - 
 org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run()
  @bci=34, line=157 (Interpreted frame)
 {noformat}
 - Even if InMemoryMerger completes, commitedMem  usedMem are beyond their 
 threshold and no other fetcher threads (all are in stalled state) are there 
 to release memory. This causes fetchers to wait indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-24 Thread Jeff Zhang (JIRA)
Jeff Zhang created TEZ-2224:
---

 Summary: EventQueue empty doesn't mean events are consumed in 
RecoveryService
 Key: TEZ-2224
 URL: https://issues.apache.org/jira/browse/TEZ-2224
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir

2015-03-24 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377544#comment-14377544
 ] 

Jeff Zhang commented on TEZ-1909:
-

bq. It seems like the patch for this jira has been merged with fixes for a 
different jira? Can these be separated out?
Yes, I found one issue in RecoveryService when working on this jira.  I have 
created TEZ-2224 to separate it. And will upload new patch after TEZ-2224 is 
done, because the unit test depends on TEZ-2224

 Remove need to copy over all events from attempt 1 to attempt 2 dir
 ---

 Key: TEZ-1909
 URL: https://issues.apache.org/jira/browse/TEZ-1909
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Attachments: TEZ-1909-1.patch, TEZ-1909-2.patch, TEZ-1909-3.patch


 Use of file versions should prevent the need for copying over data into a 
 second attempt dir. Care needs to be taken to handle last corrupt record 
 handling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2047) Build fails against hadoop-2.2 post TEZ-2018

2015-03-24 Thread Prakash Ramachandran (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Ramachandran updated TEZ-2047:
--
Attachment: TEZ-2047.2.patch

corrected the value for yarn.http.policy

verified on 2.6 that the http scheme is correct.

 Build fails against hadoop-2.2 post TEZ-2018
 

 Key: TEZ-2047
 URL: https://issues.apache.org/jira/browse/TEZ-2047
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Prakash Ramachandran
Priority: Blocker
 Attachments: TEZ-2047.1.patch, TEZ-2047.1.patch, TEZ-2047.2.patch


 Failed to execute goal 
 org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) 
 on project tez-dag: Compilation failure: Compilation failure:
 [ERROR] 
 /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[85,13]
  cannot find symbol
 [ERROR] symbol  : method 
 withHttpPolicy(org.apache.hadoop.conf.Configuration,org.apache.hadoop.http.HttpConfig.Policy)
 [ERROR] location: class 
 org.apache.hadoop.yarn.webapp.WebApps.Builderorg.apache.tez.dag.app.web.WebUIService.TezAMWebApp
 [ERROR] 
 /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[87,45]
  cannot find symbol
 [ERROR] symbol  : method getConnectorAddress(int)
 [ERROR] location: class org.apache.hadoop.http.HttpServer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-2047 PreCommit Build #339

2015-03-24 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2047
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/339/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2754 lines...]



{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12706940/TEZ-2047.1.patch
  against master revision f53942c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/339//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/339//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
d892bfccf6b5af2779a653b4e76d79f6cc811ac8 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #338
Archived 44 artifacts
Archive block size is 32768
Received 8 blocks and 2471608 bytes
Compression is 9.6%
Took 1.7 sec
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (TEZ-2047) Build fails against hadoop-2.2 post TEZ-2018

2015-03-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378254#comment-14378254
 ] 

Hadoop QA commented on TEZ-2047:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12706940/TEZ-2047.1.patch
  against master revision f53942c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/339//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/339//console

This message is automatically generated.

 Build fails against hadoop-2.2 post TEZ-2018
 

 Key: TEZ-2047
 URL: https://issues.apache.org/jira/browse/TEZ-2047
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Prakash Ramachandran
Priority: Blocker
 Attachments: TEZ-2047.1.patch, TEZ-2047.1.patch, TEZ-2047.2.patch


 Failed to execute goal 
 org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) 
 on project tez-dag: Compilation failure: Compilation failure:
 [ERROR] 
 /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[85,13]
  cannot find symbol
 [ERROR] symbol  : method 
 withHttpPolicy(org.apache.hadoop.conf.Configuration,org.apache.hadoop.http.HttpConfig.Policy)
 [ERROR] location: class 
 org.apache.hadoop.yarn.webapp.WebApps.Builderorg.apache.tez.dag.app.web.WebUIService.TezAMWebApp
 [ERROR] 
 /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[87,45]
  cannot find symbol
 [ERROR] symbol  : method getConnectorAddress(int)
 [ERROR] location: class org.apache.hadoop.http.HttpServer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2047) Build fails against hadoop-2.2 post TEZ-2018

2015-03-24 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378277#comment-14378277
 ] 

Hitesh Shah commented on TEZ-2047:
--

+1 pending pre-commit. 

 Build fails against hadoop-2.2 post TEZ-2018
 

 Key: TEZ-2047
 URL: https://issues.apache.org/jira/browse/TEZ-2047
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Prakash Ramachandran
Priority: Blocker
 Attachments: TEZ-2047.1.patch, TEZ-2047.1.patch, TEZ-2047.2.patch


 Failed to execute goal 
 org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) 
 on project tez-dag: Compilation failure: Compilation failure:
 [ERROR] 
 /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[85,13]
  cannot find symbol
 [ERROR] symbol  : method 
 withHttpPolicy(org.apache.hadoop.conf.Configuration,org.apache.hadoop.http.HttpConfig.Policy)
 [ERROR] location: class 
 org.apache.hadoop.yarn.webapp.WebApps.Builderorg.apache.tez.dag.app.web.WebUIService.TezAMWebApp
 [ERROR] 
 /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[87,45]
  cannot find symbol
 [ERROR] symbol  : method getConnectorAddress(int)
 [ERROR] location: class org.apache.hadoop.http.HttpServer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-986) Make conf set on DAG and vertex available in jobhistory

2015-03-24 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378304#comment-14378304
 ] 

Hitesh Shah commented on TEZ-986:
-

[~rohini] Agreed. Just moved the target version as I am not sure if [~Sreenath] 
has had a chance to work on it and given that [~jeagles] is looking to turn 
around a 0.6.1 release soon, this jira should likely be moved out if no one 
volunteers to work on it within a short timeframe.

 Make conf set on DAG and vertex available in jobhistory
 ---

 Key: TEZ-986
 URL: https://issues.apache.org/jira/browse/TEZ-986
 Project: Apache Tez
  Issue Type: Sub-task
  Components: UI
Reporter: Rohini Palaniswamy
Priority: Blocker

 Would like to have the conf set on DAG and Vertex
   1) viewable in Tez UI after the job completes. This is very essential for 
 debugging jobs.
   2) We have processes, that parse jobconf.xml from job history (hdfs) and 
 load them into hive tables for analysis. Would like to have Tez also make all 
 the configuration (byte array) available in job history so that we can 
 similarly parse them. 1) mandates that you store it in hdfs. 2) is just to 
 say make the format stored as a contract others can rely on for parsing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2225) Remove instances of LOG.isDebugEnabled

2015-03-24 Thread Vasanth kumar RJ (JIRA)
Vasanth kumar RJ created TEZ-2225:
-

 Summary: Remove instances of LOG.isDebugEnabled
 Key: TEZ-2225
 URL: https://issues.apache.org/jira/browse/TEZ-2225
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Vasanth kumar RJ
Assignee: Vasanth kumar RJ
Priority: Minor


Remove LOG.isDebugEnabled() and use parameterized debug logging



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-2214 PreCommit Build #340

2015-03-24 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-2214
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/340/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 2752 lines...]


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12705913/TEZ-2214.1.patch
  against master revision f53942c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/340//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/340//artifact/patchprocess/newPatchFindbugsWarningstez-runtime-library.html
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/340//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
e5dcef324d906a7ab5d23e71b8107b372a9599a7 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-TEZ-Build #338
Archived 44 artifacts
Archive block size is 32768
Received 6 blocks and 2556109 bytes
Compression is 7.1%
Took 0.88 sec
[description-setter] Could not determine description.
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (TEZ-2214) FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses memToDiskMerging

2015-03-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378344#comment-14378344
 ] 

Hadoop QA commented on TEZ-2214:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12705913/TEZ-2214.1.patch
  against master revision f53942c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/340//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/340//artifact/patchprocess/newPatchFindbugsWarningstez-runtime-library.html
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/340//console

This message is automatically generated.

 FetcherOrderedGrouped can get stuck indefinitely when MergeManager misses 
 memToDiskMerging
 --

 Key: TEZ-2214
 URL: https://issues.apache.org/jira/browse/TEZ-2214
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: TEZ-2214.1.patch


 Scenario:
 - commitMemory  usedMemory are beyond their allowed threshold.
 - InMemoryMerge kicks off and is in the process of flushing memory contents 
 to disk
 - As it progresses, it releases memory segments as well (but not yet over).
 - Fetchers who need memory  maxSingleShuffleLimit, get scheduled.
 - If fetchers are fast, this quickly adds up to commitMemory  usedMemory. 
 Since InMemoryMerge is already in progress, this wouldn't trigger another 
 merge().
 - Pretty soon all fetchers would be stalled and get into the following state.
 {noformat}
 Thread 9351: (state = BLOCKED)
  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be 
 imprecise)
  - java.lang.Object.wait() @bci=2, line=502 (Compiled frame)
  - 
 org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.waitForShuffleToMergeMemory()
  @bci=17, line=337 (Interpreted frame)
  - 
 org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run()
  @bci=34, line=157 (Interpreted frame)
 {noformat}
 - Even if InMemoryMerger completes, commitedMem  usedMem are beyond their 
 threshold and no other fetcher threads (all are in stalled state) are there 
 to release memory. This causes fetchers to wait indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2047) Build fails against hadoop-2.2 post TEZ-2018

2015-03-24 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378156#comment-14378156
 ] 

Hitesh Shah commented on TEZ-2047:
--

The config value being set seems wrong. 

 Build fails against hadoop-2.2 post TEZ-2018
 

 Key: TEZ-2047
 URL: https://issues.apache.org/jira/browse/TEZ-2047
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Prakash Ramachandran
Priority: Blocker
 Attachments: TEZ-2047.1.patch, TEZ-2047.1.patch


 Failed to execute goal 
 org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) 
 on project tez-dag: Compilation failure: Compilation failure:
 [ERROR] 
 /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[85,13]
  cannot find symbol
 [ERROR] symbol  : method 
 withHttpPolicy(org.apache.hadoop.conf.Configuration,org.apache.hadoop.http.HttpConfig.Policy)
 [ERROR] location: class 
 org.apache.hadoop.yarn.webapp.WebApps.Builderorg.apache.tez.dag.app.web.WebUIService.TezAMWebApp
 [ERROR] 
 /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[87,45]
  cannot find symbol
 [ERROR] symbol  : method getConnectorAddress(int)
 [ERROR] location: class org.apache.hadoop.http.HttpServer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-24 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378199#comment-14378199
 ] 

Hitesh Shah commented on TEZ-2205:
--

bq. For jobs launched through Oozie it is easy to turn off ATS via Oozie server 
side setting

Could you clarify a bit more on how this is being done? 

[~jeagles] [~zjshen] [~lichangleo] If (3) is the approach that works the best 
in Yahoo environments, should the eventual fix be in YARN given that other 
applications will face the same issue i.e. option (2)? i.e. if yarn timeline 
enabled flag is set to false, all the relevant yarn client and timeline client 
libs should automatically not make calls to the server related to timeline ( 
i.e. do not retrieve delegation tokens, treat all calls as no-op) ? 

 






 Tez still tries to post to ATS when yarn.timeline-service.enabled=false
 ---

 Key: TEZ-2205
 URL: https://issues.apache.org/jira/browse/TEZ-2205
 Project: Apache Tez
  Issue Type: Sub-task
Affects Versions: 0.6.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: TEZ-2205.wip.patch


 when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
 but hits error as token is not found. Does not fail the job because of the 
 fix to not fail job when there is error posting to ATS. But it should not be 
 trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2047) Build fails against hadoop-2.2 post TEZ-2018

2015-03-24 Thread Prakash Ramachandran (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Ramachandran updated TEZ-2047:
--
Attachment: TEZ-2047.1.patch

thanks hitesh, 
* rebased the patch
* sets the yarn.http.policy on the conf to use http_only

[~hitesh] can you have a look?

 Build fails against hadoop-2.2 post TEZ-2018
 

 Key: TEZ-2047
 URL: https://issues.apache.org/jira/browse/TEZ-2047
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Prakash Ramachandran
Priority: Blocker
 Attachments: TEZ-2047.1.patch, TEZ-2047.1.patch


 Failed to execute goal 
 org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) 
 on project tez-dag: Compilation failure: Compilation failure:
 [ERROR] 
 /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[85,13]
  cannot find symbol
 [ERROR] symbol  : method 
 withHttpPolicy(org.apache.hadoop.conf.Configuration,org.apache.hadoop.http.HttpConfig.Policy)
 [ERROR] location: class 
 org.apache.hadoop.yarn.webapp.WebApps.Builderorg.apache.tez.dag.app.web.WebUIService.TezAMWebApp
 [ERROR] 
 /home/jenkins/jenkins-slave/workspace/Tez-Build-Hadoop-2.2/tez-dag/src/main/java/org/apache/tez/dag/app/web/WebUIService.java:[87,45]
  cannot find symbol
 [ERROR] symbol  : method getConnectorAddress(int)
 [ERROR] location: class org.apache.hadoop.http.HttpServer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-24 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378199#comment-14378199
 ] 

Hitesh Shah edited comment on TEZ-2205 at 3/24/15 5:27 PM:
---

bq. For jobs launched through Oozie it is easy to turn off ATS via Oozie server 
side setting

Could you clarify a bit more on how this is being done? 

[~jeagles] [~zjshen] [~lichangleo] If (3) is the approach that works the best 
in Yahoo environments, should the eventual fix be in YARN given that other 
applications will face the same issue? i.e. if yarn timeline enabled flag is 
set to false, all the relevant yarn client and timeline client libs should 
automatically not make calls to the server related to timeline ( i.e. do not 
retrieve delegation tokens, treat all calls as no-op) ? 

 







was (Author: hitesh):
bq. For jobs launched through Oozie it is easy to turn off ATS via Oozie server 
side setting

Could you clarify a bit more on how this is being done? 

[~jeagles] [~zjshen] [~lichangleo] If (3) is the approach that works the best 
in Yahoo environments, should the eventual fix be in YARN given that other 
applications will face the same issue i.e. option (2)? i.e. if yarn timeline 
enabled flag is set to false, all the relevant yarn client and timeline client 
libs should automatically not make calls to the server related to timeline ( 
i.e. do not retrieve delegation tokens, treat all calls as no-op) ? 

 






 Tez still tries to post to ATS when yarn.timeline-service.enabled=false
 ---

 Key: TEZ-2205
 URL: https://issues.apache.org/jira/browse/TEZ-2205
 Project: Apache Tez
  Issue Type: Sub-task
Affects Versions: 0.6.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: TEZ-2205.wip.patch


 when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
 but hits error as token is not found. Does not fail the job because of the 
 fix to not fail job when there is error posting to ATS. But it should not be 
 trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2225) Remove instances of LOG.isDebugEnabled

2015-03-24 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378401#comment-14378401
 ] 

Bikas Saha commented on TEZ-2225:
-

This is needed to reduce cruft from code. There is an alternate log format via 
sl4j that does not have the penalty without the need for the extra if 
statements everywhere.

 Remove instances of LOG.isDebugEnabled
 --

 Key: TEZ-2225
 URL: https://issues.apache.org/jira/browse/TEZ-2225
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Vasanth kumar RJ
Assignee: Vasanth kumar RJ
Priority: Minor
  Labels: performance

 Remove LOG.isDebugEnabled() and use parameterized debug logging



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2225) Remove instances of LOG.isDebugEnabled

2015-03-24 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378369#comment-14378369
 ] 

Hitesh Shah commented on TEZ-2225:
--

Is this really needed? Doesn't the slf4j doc mention that if 
(LOG.isDebugEnabled()) is one way to reduce the perf penalty? 

 Remove instances of LOG.isDebugEnabled
 --

 Key: TEZ-2225
 URL: https://issues.apache.org/jira/browse/TEZ-2225
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Vasanth kumar RJ
Assignee: Vasanth kumar RJ
Priority: Minor
  Labels: performance

 Remove LOG.isDebugEnabled() and use parameterized debug logging



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2225) Remove instances of LOG.isDebugEnabled

2015-03-24 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378370#comment-14378370
 ] 

Hitesh Shah commented on TEZ-2225:
--

\cc [~bikassaha] [~sseth]

 Remove instances of LOG.isDebugEnabled
 --

 Key: TEZ-2225
 URL: https://issues.apache.org/jira/browse/TEZ-2225
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Vasanth kumar RJ
Assignee: Vasanth kumar RJ
Priority: Minor
  Labels: performance

 Remove LOG.isDebugEnabled() and use parameterized debug logging



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread

2015-03-24 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377401#comment-14377401
 ] 

Bikas Saha edited comment on TEZ-714 at 3/24/15 6:54 AM:
-

bq. It could, but this may make the transition complicated. Currently we need 
to differentiate these 2 kinds of commits, besides there's 2 possible states 
(RUNNING, COMMITTING) when the commit happens and we also need check handle 2 
different cases (commit succeeded  failure), so there would be totally 8 
different cases in one transition which may be difficult to read.
I am looking at TaskAttemptImpl#TerminatedBeforeRunningTransition state 
transitions as inspiration. There are some standard things to do when a commit 
operation completes. e.g. decrement  the outstanding commit counter. If commit 
was a group commit then write the recovery entry for it. If the commit fails 
then set a flag to abort. This can be in a base transition say 
CommitCompletedTransition. Then we can have 
CommitCompletedWhileRunningTransition that calls the base for common code and 
does running specific stuff.e.g. trigger job failure upon commit failure. And 
another transition for CommitCompletedWhileCommitting that just waits for the 
commit counter to drop to 0. Next, CommitCompletedWhileTerminating which waits 
for all commit operations to complete and then calls abort (this could be 
blocking for now). This way we can separate things while still keeping the 
transitions essentially linear. Instead of multiplying the possibilities by (2 
commit types x 3 states x 2 commit results)
Perhaps, all commit events need to have a shared boolean that they should check 
before invoking commit. This boolean could be set to false when the vertex/dag 
decides to abort. This would make and pending commit operations complete 
quickly instead of trying to commit unnecessarily.
Some e2e scenarios could be tested via simulation using the MockDAGAppMaster. 
Create custom committers that fail/pass as desired and check that the dag 
behaved as expected.


was (Author: bikassaha):
bq. It could, but this may make the transition complicated. Currently we need 
to differentiate these 2 kinds of commits, besides there's 2 possible states 
(RUNNING, COMMITTING) when the commit happens and we also need check handle 2 
different cases (commit succeeded  failure), so there would be totally 8 
different cases in one transition which may be difficult to read.
I am looking at TaskAttemptImpl#TerminatedBeforeRunningTransition state 
transitions as inspiration. There are some standard things to do when a commit 
operation completes. e.g. decrement  the outstanding commit counter. If commit 
was a group commit then write the recovery entry for it. If the commit fails 
then set a flag to abort. This can be in a base transition say 
CommitCompletedTransition. Then we can have 
CommitCompletedWhileRunningTransition that calls the base for common code and 
does running specific stuff.e.g. trigger job failure upon commit failure. And 
another transition for CommitCompletedWhileCommitting that just waits for the 
commit counter to drop to 0. Next, CommitCompletedWhileTerminating which waits 
for all commit operations to complete and then calls abort (this could be 
blocking for now). 
Perhaps, all commit events need to have a shared boolean that they should check 
before invoking commit. This boolean could be set to false when the vertex/dag 
decides to abort. This would make and pending commit operations complete 
quickly instead of trying to commit unnecessarily.
Some e2e scenarios could be tested via simulation using the MockDAGAppMaster. 
Create custom committers that fail/pass as desired and check that the dag 
behaved as expected.

 OutputCommitters should not run in the main AM dispatcher thread
 

 Key: TEZ-714
 URL: https://issues.apache.org/jira/browse/TEZ-714
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Siddharth Seth
Assignee: Jeff Zhang
Priority: Critical
 Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, Vertex_2.pdf


 Follow up jira from TEZ-41.
 1) If there's multiple OutputCommitters on a Vertex, they can be run in 
 parallel.
 2) Running an OutputCommitter in the main thread blocks all other event 
 handling, w.r.t the DAG, and causes the event queue to back up.
 3) This should also cover shared commits that happen in the DAG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-714) OutputCommitters should not run in the main AM dispatcher thread

2015-03-24 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377401#comment-14377401
 ] 

Bikas Saha commented on TEZ-714:


bq. It could, but this may make the transition complicated. Currently we need 
to differentiate these 2 kinds of commits, besides there's 2 possible states 
(RUNNING, COMMITTING) when the commit happens and we also need check handle 2 
different cases (commit succeeded  failure), so there would be totally 8 
different cases in one transition which may be difficult to read.
I am looking at TaskAttemptImpl#TerminatedBeforeRunningTransition state 
transitions as inspiration. There are some standard things to do when a commit 
operation completes. e.g. decrement  the outstanding commit counter. If commit 
was a group commit then write the recovery entry for it. If the commit fails 
then set a flag to abort. This can be in a base transition say 
CommitCompletedTransition. Then we can have 
CommitCompletedWhileRunningTransition that calls the base for common code and 
does running specific stuff.e.g. trigger job failure upon commit failure. And 
another transition for CommitCompletedWhileCommitting that just waits for the 
commit counter to drop to 0. Next, CommitCompletedWhileTerminating which waits 
for all commit operations to complete and then calls abort (this could be 
blocking for now). 
Perhaps, all commit events need to have a shared boolean that they should check 
before invoking commit. This boolean could be set to false when the vertex/dag 
decides to abort. This would make and pending commit operations complete 
quickly instead of trying to commit unnecessarily.
Some e2e scenarios could be tested via simulation using the MockDAGAppMaster. 
Create custom committers that fail/pass as desired and check that the dag 
behaved as expected.

 OutputCommitters should not run in the main AM dispatcher thread
 

 Key: TEZ-714
 URL: https://issues.apache.org/jira/browse/TEZ-714
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Siddharth Seth
Assignee: Jeff Zhang
Priority: Critical
 Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch, Vertex_2.pdf


 Follow up jira from TEZ-41.
 1) If there's multiple OutputCommitters on a Vertex, they can be run in 
 parallel.
 2) Running an OutputCommitter in the main thread blocks all other event 
 handling, w.r.t the DAG, and causes the event queue to back up.
 3) This should also cover shared commits that happen in the DAG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-24 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378536#comment-14378536
 ] 

Hitesh Shah edited comment on TEZ-2205 at 3/24/15 8:31 PM:
---

[~lichangleo] In the end, someone has to check that config and make some 
choices on how to act based on the configured value :). In an ideal world, the 
yarn config would be checked and enforced by yarn libraries and not by yarn 
applications but if it comes to it, we can make the change in Tez to handle 
this config. Also, [~lichangleo], based on [~rohini]'s comment, if YARN does 
not have the hollow class, this would imply that there needs to have the hollow 
implementation in Tez.

i.e option 3's impl would be: 
   - check yarn timeline enabled flag and make the revelant classes that use 
ATS to be a no-op. 
   - Add a log.warn if ats is configured but yarn timeline is disabled. 

[~zjshen] [~jeagles] comments on this? 

It would be good to try and get a final consensus on whether we enforce the 
yarn-specific flag in YARN or in the Application. Based on this, we can unblock 
[~lichangleo] to be able to make the changes.

  




 



was (Author: hitesh):
[~lichangleo] In the end, someone has to check that config and make some 
choices on how to act based on the configured value :). In an ideal world, the 
yarn config would be checked and enforced by yarn libraries and not by yarn 
applications. Also, [~lichangleo], based on [~rohini]'s comment, if YARN does 
not have the hollow class, this would imply that there needs to have the hollow 
implementation in Tez.

i.e option 3's impl would be: 
   - check yarn timeline enabled flag and make the revelant classes that use 
ATS to be a no-op. 
   - Add a log.warn if ats is configured but yarn timeline is disabled. 

[~zjshen] [~jeagles] comments on this? 

It would be good to try and get a final consensus on whether we enforce the 
yarn-specific flag in YARN or in the Application. Based on this, we can unblock 
[~lichangleo] to be able to make the changes.

  




 


 Tez still tries to post to ATS when yarn.timeline-service.enabled=false
 ---

 Key: TEZ-2205
 URL: https://issues.apache.org/jira/browse/TEZ-2205
 Project: Apache Tez
  Issue Type: Sub-task
Affects Versions: 0.6.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: TEZ-2205.wip.patch


 when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
 but hits error as token is not found. Does not fail the job because of the 
 fix to not fail job when there is error posting to ATS. But it should not be 
 trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2204) TestAMRecovery increasingly flaky on jenkins builds.

2015-03-24 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-2204:

Attachment: TEZ-2204-5.patch

Upload new patch with minor change (add one more log)

 TestAMRecovery increasingly flaky on jenkins builds. 
 -

 Key: TEZ-2204
 URL: https://issues.apache.org/jira/browse/TEZ-2204
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Fix For: 0.7.0

 Attachments: TEZ-2204-1.patch, TEZ-2204-2.patch, TEZ-2204-3.patch, 
 TEZ-2204-4.patch, TEZ-2204-5.patch


 In recent pre-commit builds and daily builds, there seem to have been some 
 occurrences of TestAMRecovery failing or timing out. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-24 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378230#comment-14378230
 ] 

Hitesh Shah commented on TEZ-2224:
--

Also, TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED_DEFAULT being false 
by default is probably wrong. For a real-world scenario, as many pending events 
that are seen and can be processed, should be processed. 



 EventQueue empty doesn't mean events are consumed in RecoveryService
 

 Key: TEZ-2224
 URL: https://issues.apache.org/jira/browse/TEZ-2224
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: TEZ-2224-1.patch


 If the event queue is empty, the event may still been processing. Should fix 
 it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2205) Tez still tries to post to ATS when yarn.timeline-service.enabled=false

2015-03-24 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378245#comment-14378245
 ] 

Rohini Palaniswamy commented on TEZ-2205:
-

 OOZIE-2133 is the one that handles getting delegation tokens for ATS for tez 
jobs. If oozie.action.launcher.yarn.timeline-service.enabled is set to true on 
the Oozie server configuration, it adds yarn.timeline-service.enabled=true to 
conf of JobClient that submits the launcher job if the tez-site.xml is part of 
the distributed cache. JobClient (YARN) fetches ATS delegation token if that 
setting is set before the job is submitted and adds it to the job.  

 Tez still tries to post to ATS when yarn.timeline-service.enabled=false
 ---

 Key: TEZ-2205
 URL: https://issues.apache.org/jira/browse/TEZ-2205
 Project: Apache Tez
  Issue Type: Sub-task
Affects Versions: 0.6.1
Reporter: Chang Li
Assignee: Chang Li
 Attachments: TEZ-2205.wip.patch


 when set yarn.timeline-service.enabled=false, Tez still tries posting to ATS, 
 but hits error as token is not found. Does not fail the job because of the 
 fix to not fail job when there is error posting to ATS. But it should not be 
 trying to post to ATS in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2224) EventQueue empty doesn't mean events are consumed in RecoveryService

2015-03-24 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378217#comment-14378217
 ] 

Hitesh Shah commented on TEZ-2224:
--

Is there a reason why we want to prevent new events from being processed on a 
shutdown? 

 EventQueue empty doesn't mean events are consumed in RecoveryService
 

 Key: TEZ-2224
 URL: https://issues.apache.org/jira/browse/TEZ-2224
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: TEZ-2224-1.patch


 If the event queue is empty, the event may still been processing. Should fix 
 it like AsyncDispatcher



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)