[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-02 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542750#comment-13542750
 ] 

Bikas Saha commented on MAPREDUCE-4819:
---

In general, you might want to rename some of the new stuff like "justShutDown" 
or "EventEater". And I feel that the change in MRAppMaster.init() function 
might benefit with some refactoring.

> AM can rerun job after reporting final job status to the client
> ---
>
> Key: MAPREDUCE-4819
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Assignee: Bikas Saha
>Priority: Critical
> Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
> MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt
>
>
> If the AM reports final job status to the client but then crashes before 
> unregistering with the RM then the RM can run another AM attempt.  Currently 
> AM re-attempts assume that the previous attempts did not reach a final job 
> state, and that causes the job to rerun (from scratch, if the output format 
> doesn't support recovery).
> Re-running the job when we've already told the client the final status of the 
> job is bad for a number of reasons.  If the job failed, it's confusing at 
> best since the client was already told the job failed but the subsequent 
> attempt could succeed.  If the job succeeded there could be data loss, as a 
> subsequent job launched by the client tries to consume the job's output as 
> input just as the re-attempt starts removing output files in preparation for 
> the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-02 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542748#comment-13542748
 ] 

Bikas Saha commented on MAPREDUCE-4819:
---

Looks like after the recent changes in JobImpl and the current alternative 
approach my original fix for not rerunning the job does not really apply. I 
think you would want to take the changes in my patch that adds the jobid to the 
history staging dir. Since the staging dir is not deleted during job history 
flushing, I had observed that if I made my AM crash (by putting an exit(1) in 
shutdownJob() then the history files would get orphaned and not cleaned up. Or 
something like that. And to fix that I had to add the jobid to the path.
Snippet from my patch.
{code}
+++ 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapreduce/v2/jobhistory/JobHistoryUtils.java
@@ -186,10 +186,11 @@ public static PathFilter getHistoryFileFilter() {
* @return A string representation of the prefix.
*/
   public static String
-  getConfiguredHistoryStagingDirPrefix(Configuration conf)
+  getConfiguredHistoryStagingDirPrefix(Configuration conf, String jobId)
   throws IOException {
 String user = UserGroupInformation.getCurrentUser().getShortUserName();
-Path path = MRApps.getStagingAreaDir(conf, user);
+Path stagingPath = MRApps.getStagingAreaDir(conf, user);
+Path path = new Path(stagingPath, jobId);
 String logDir = path.toString();
 return logDir;
   }
{code}


For the patch itself I have a few comments

Why not end in success if the staging dir was cleaned up by the last attempt? I 
am guessing that this code wont be necessary after we move the unregister to RM 
before the staging dir cleanup in MAPREDUCE-4841, right?
{code}
+  if(!stagingExists) {
+copyHistory = false;
+isLastAMRetry = true;
+justShutDown = true;
+shouldNotify = false;
+forcedState = JobStateInternal.ERROR;
+shutDownMessage = "Staging dir does not exist " + stagingDir;
+LOG.fatal(shutDownMessage);
{code}

Why are we only eating/ignoring the JobEvents in the dispatcher? So that the 
JobImpl state machine is not triggered?

This might be a question of personal preference. I think an explicit transition 
to from the INIT to final state is cleaner than overriding the state in the 
getter.
{code}
   public JobStateInternal getInternalState() {
 readLock.lock();
 try {
+  if(forcedState != null) {
+return forcedState;
+  }
{code}

Didnt quite get this in HistoryFileManager.java. Looks like it related to a 
recent change in that code.
{code}
+  } else if (old != null && !old.isMovePending()) {
+//This is a duplicate so just delete it
+fileInfo.delete();
   }
{code}

Typo
{code}
+throw new Exception("No handler for regitered for " + type);
+  }
{code}



> AM can rerun job after reporting final job status to the client
> ---
>
> Key: MAPREDUCE-4819
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Assignee: Bikas Saha
>Priority: Critical
> Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
> MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt
>
>
> If the AM reports final job status to the client but then crashes before 
> unregistering with the RM then the RM can run another AM attempt.  Currently 
> AM re-attempts assume that the previous attempts did not reach a final job 
> state, and that causes the job to rerun (from scratch, if the output format 
> doesn't support recovery).
> Re-running the job when we've already told the client the final status of the 
> job is bad for a number of reasons.  If the job failed, it's confusing at 
> best since the client was already told the job failed but the subsequent 
> attempt could succeed.  If the job succeeded there could be data loss, as a 
> subsequent job launched by the client tries to consume the job's output as 
> input just as the re-attempt starts removing output files in preparation for 
> the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAPREDUCE-4911) Add node-level aggregation flag feature(setLocalAggregation(boolean)) to JobConf

2013-01-02 Thread Tsuyoshi OZAWA (JIRA)
Tsuyoshi OZAWA created MAPREDUCE-4911:
-

 Summary: Add node-level aggregation flag 
feature(setLocalAggregation(boolean)) to JobConf
 Key: MAPREDUCE-4911
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4911
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: client
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA


This JIRA adds node-level aggregation flag 
feature(setLocalAggregation(boolean)) to JobConf.
This task is subtask of MAPREDUCE-4502.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAPREDUCE-4910) Adding AggregationWaitMap to some components(MRAppMaster, TaskAttemptListener, JobImpl, MapTaskImpl).

2013-01-02 Thread Tsuyoshi OZAWA (JIRA)
Tsuyoshi OZAWA created MAPREDUCE-4910:
-

 Summary: Adding AggregationWaitMap to some components(MRAppMaster, 
TaskAttemptListener, JobImpl, MapTaskImpl).
 Key: MAPREDUCE-4910
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4910
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mrv2, task
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA


To implement MR-4502, AggregationWaitMap need to be used by some 
components(MRAppMaster, TaskAttemptListener, JobImpl, MapTaskImpl).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4904) TestMultipleLevelCaching failed in barnch-1

2013-01-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542687#comment-13542687
 ] 

Junping Du commented on MAPREDUCE-4904:
---

Sure. Thanks Luke for comments.
If localityLevel =2 and in case of without-NodeGroup, the task should be 
counted into OTHER_LOCAL_MAPS (it should go to "default" below to be handled 
rather than being break out). This tiny patch fix this issue.

> TestMultipleLevelCaching failed in barnch-1
> ---
>
> Key: MAPREDUCE-4904
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4904
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.2.0
>Reporter: meng gong
>Assignee: meng gong
> Fix For: 1.2.0
>
> Attachments: MAPREDUCE-4904.patch
>
>
> TestMultipleLevelCaching will failed:
> {noformat}
> Testcase: testMultiLevelCaching took 30.406 sec
> FAILED
> Number of local maps expected:<0> but was:<1>
> junit.framework.AssertionFailedError: Number of local maps expected:<0> but 
> was:<1>
> at 
> org.apache.hadoop.mapred.TestRackAwareTaskPlacement.launchJobAndTestCounters(TestRackAwareTaskPlacement.java:78)
> at 
> org.apache.hadoop.mapred.TestMultipleLevelCaching.testCachingAtLevel(TestMultipleLevelCaching.java:113)
> at 
> org.apache.hadoop.mapred.TestMultipleLevelCaching.testMultiLevelCaching(TestMultipleLevelCaching.java:69)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows

2013-01-02 Thread Arpit Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated MAPREDUCE-4909:
-

Attachment: MAPREDUCE-4909.patch

[~sureshms] Removed the Windows-specific comment.

HADOOP-9176 was filed to address the root cause.

Thanks!
Arpit

> TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
> 
>
> Key: MAPREDUCE-4909
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 1-win
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
> Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, 
> MAPREDUCE-4909.patch
>
>
> TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause 
> appears to be a failure to delete in-use files via LocalFileSystem.delete 
> (RawLocalFileSystem.delete).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows

2013-01-02 Thread Arpit Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated MAPREDUCE-4909:
-

Description: TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. 
The root cause appears to be a failure to delete in-use files via 
LocalFileSystem.delete (RawLocalFileSystem.delete).  (was: 
TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause 
appears to be a failure to delete in-use files on Windows via 
LocalFileSystem.delete (RawLocalFileSystem.delete).)

> TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
> 
>
> Key: MAPREDUCE-4909
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 1-win
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
> Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch
>
>
> TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause 
> appears to be a failure to delete in-use files via LocalFileSystem.delete 
> (RawLocalFileSystem.delete).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-2217) The expire launching task should cover the UNASSIGNED task

2013-01-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542583#comment-13542583
 ] 

Hadoop QA commented on MAPREDUCE-2217:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12562996/MR-2217.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3187//console

This message is automatically generated.

> The expire launching task should cover the UNASSIGNED task
> --
>
> Key: MAPREDUCE-2217
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2217
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobtracker
>Affects Versions: 0.23.0, 1.1.1
>Reporter: Scott Chen
>Assignee: Karthik Kambatla
> Attachments: expose-bug-mr-2217.patch, MAPREDUCE-2217.1.txt, 
> MR-2217.patch, MR-2217.patch
>
>
> The ExpireLaunchingTask thread kills the task that are scheduled but not 
> responded.
> Currently if a task is scheduled on tasktracker and for some reason 
> tasktracker cannot put it to RUNNING.
> The task will just hang in the UNASSIGNED status and JobTracker will keep 
> waiting for it.
> JobTracker.ExpireLaunchingTask should be able to kill this task.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-2217) The expire launching task should cover the UNASSIGNED task

2013-01-02 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated MAPREDUCE-2217:


Fix Version/s: (was: 0.24.0)
Affects Version/s: 1.1.1
   Status: Patch Available  (was: Open)

> The expire launching task should cover the UNASSIGNED task
> --
>
> Key: MAPREDUCE-2217
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2217
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobtracker
>Affects Versions: 1.1.1, 0.23.0
>Reporter: Scott Chen
>Assignee: Karthik Kambatla
> Attachments: expose-bug-mr-2217.patch, MAPREDUCE-2217.1.txt, 
> MR-2217.patch, MR-2217.patch
>
>
> The ExpireLaunchingTask thread kills the task that are scheduled but not 
> responded.
> Currently if a task is scheduled on tasktracker and for some reason 
> tasktracker cannot put it to RUNNING.
> The task will just hang in the UNASSIGNED status and JobTracker will keep 
> waiting for it.
> JobTracker.ExpireLaunchingTask should be able to kill this task.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-2217) The expire launching task should cover the UNASSIGNED task

2013-01-02 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated MAPREDUCE-2217:


Attachment: MR-2217.patch

Re-uploading the patch for Jenkins sanity.

> The expire launching task should cover the UNASSIGNED task
> --
>
> Key: MAPREDUCE-2217
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2217
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobtracker
>Affects Versions: 0.23.0
>Reporter: Scott Chen
>Assignee: Karthik Kambatla
> Fix For: 0.24.0
>
> Attachments: expose-bug-mr-2217.patch, MAPREDUCE-2217.1.txt, 
> MR-2217.patch, MR-2217.patch
>
>
> The ExpireLaunchingTask thread kills the task that are scheduled but not 
> responded.
> Currently if a task is scheduled on tasktracker and for some reason 
> tasktracker cannot put it to RUNNING.
> The task will just hang in the UNASSIGNED status and JobTracker will keep 
> waiting for it.
> JobTracker.ExpireLaunchingTask should be able to kill this task.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-2217) The expire launching task should cover the UNASSIGNED task

2013-01-02 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542570#comment-13542570
 ] 

Karthik Kambatla commented on MAPREDUCE-2217:
-

The patch posted on 16/Nov fixes the issue.

To verify this I ran a hadoop cluster of 4 nodes with both MR-2217.patch and 
expose-bug-mr-2217.patch. The tasks assigned to machine01 timeout, and are 
subsequently scheduled on other nodes, and the job completes. Without 
MR-2217.patch, the job doesn't progress even after an hour. I used pi job with 
8 mappers and 1000 input splits for this.

> The expire launching task should cover the UNASSIGNED task
> --
>
> Key: MAPREDUCE-2217
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2217
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobtracker
>Affects Versions: 0.23.0
>Reporter: Scott Chen
>Assignee: Karthik Kambatla
> Fix For: 0.24.0
>
> Attachments: expose-bug-mr-2217.patch, MAPREDUCE-2217.1.txt, 
> MR-2217.patch
>
>
> The ExpireLaunchingTask thread kills the task that are scheduled but not 
> responded.
> Currently if a task is scheduled on tasktracker and for some reason 
> tasktracker cannot put it to RUNNING.
> The task will just hang in the UNASSIGNED status and JobTracker will keep 
> waiting for it.
> JobTracker.ExpireLaunchingTask should be able to kill this task.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542567#comment-13542567
 ] 

Hadoop QA commented on MAPREDUCE-4819:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12562963/MR-4819-bobby-trunk.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 2015 javac 
compiler warnings (more than the trunk's current 2014 warnings).

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common:

  
org.apache.hadoop.mapreduce.v2.app.commit.TestCommitterEventHandler

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3186//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3186//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-app.html
Javac warnings: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3186//artifact/trunk/patchprocess/diffJavacWarnings.txt
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3186//console

This message is automatically generated.

> AM can rerun job after reporting final job status to the client
> ---
>
> Key: MAPREDUCE-4819
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Assignee: Bikas Saha
>Priority: Critical
> Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
> MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt
>
>
> If the AM reports final job status to the client but then crashes before 
> unregistering with the RM then the RM can run another AM attempt.  Currently 
> AM re-attempts assume that the previous attempts did not reach a final job 
> state, and that causes the job to rerun (from scratch, if the output format 
> doesn't support recovery).
> Re-running the job when we've already told the client the final status of the 
> job is bad for a number of reasons.  If the job failed, it's confusing at 
> best since the client was already told the job failed but the subsequent 
> attempt could succeed.  If the job succeeded there could be data loss, as a 
> subsequent job launched by the client tries to consume the job's output as 
> input just as the re-attempt starts removing output files in preparation for 
> the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4832) MR AM can get in a split brain situation

2013-01-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542561#comment-13542561
 ] 

Hadoop QA commented on MAPREDUCE-4832:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12562975/MAPREDUCE-4832.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3185//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3185//console

This message is automatically generated.

> MR AM can get in a split brain situation
> 
>
> Key: MAPREDUCE-4832
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 2.0.2-alpha, 0.23.5
>Reporter: Robert Joseph Evans
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: MAPREDUCE-4832.patch
>
>
> It is possible for a networking issue to happen where the RM thinks an AM has 
> gone down and launches a replacement, but the previous AM is still up and 
> running.  If the previous AM does not need any more resources from the RM it 
> could try to commit either tasks or jobs.  This could cause lots of problems 
> where the second AM finishes and tries to commit too.  This could result in 
> data corruption.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows

2013-01-02 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542542#comment-13542542
 ] 

Suresh Srinivas commented on MAPREDUCE-4909:


bq. One minor thing, it might be good to add a "TODO" in the code comments as a 
reminder that the root cause is still under investigation.
The goal of this test is not delete a file that is in use. So TODO seems 
unnecessary.

[~arpitagarwal] Also windows related comments seems inappropriate. Can a 
separate jira be created, related to this, to track deletion of file that is in 
use? I think there might already be some jiras tracking this for Windows.

> TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
> 
>
> Key: MAPREDUCE-4909
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 1-win
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
> Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch
>
>
> TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause 
> appears to be a failure to delete in-use files on Windows via 
> LocalFileSystem.delete (RawLocalFileSystem.delete).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows

2013-01-02 Thread Arpit Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated MAPREDUCE-4909:
-

Attachment: MAPREDUCE-4909.patch

Added TODO.

> TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
> 
>
> Key: MAPREDUCE-4909
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 1-win
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
> Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch
>
>
> TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause 
> appears to be a failure to delete in-use files on Windows via 
> LocalFileSystem.delete (RawLocalFileSystem.delete).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows

2013-01-02 Thread Brandon Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542536#comment-13542536
 ] 

Brandon Li commented on MAPREDUCE-4909:
---

+1, the patch looks good as a workaround.
One minor thing, it might be good to add a "TODO" in the code comments as a 
reminder that the root cause is still under investigation.

> TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
> 
>
> Key: MAPREDUCE-4909
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 1-win
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
> Attachments: MAPREDUCE-4909.patch
>
>
> TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause 
> appears to be a failure to delete in-use files on Windows via 
> LocalFileSystem.delete (RawLocalFileSystem.delete).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows

2013-01-02 Thread Arpit Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated MAPREDUCE-4909:
-

Attachment: MAPREDUCE-4909.patch

Submitting a patch to work around the test failures. 

Filed HADOOP-9176 to address the root cause.

> TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
> 
>
> Key: MAPREDUCE-4909
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 1-win
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
> Attachments: MAPREDUCE-4909.patch
>
>
> TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause 
> appears to be a failure to delete in-use files on Windows via 
> LocalFileSystem.delete (RawLocalFileSystem.delete).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows

2013-01-02 Thread Arpit Agarwal (JIRA)
Arpit Agarwal created MAPREDUCE-4909:


 Summary: TestKeyValueTextInputFormat fails with Open JDK 7 on 
Windows
 Key: MAPREDUCE-4909
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 1-win
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal


TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause 
appears to be a failure to delete in-use files on Windows via 
LocalFileSystem.delete (RawLocalFileSystem.delete).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4832) MR AM can get in a split brain situation

2013-01-02 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4832:
--

Assignee: Jason Lowe
Target Version/s: 2.0.3-alpha, 0.23.6
  Status: Patch Available  (was: Open)

> MR AM can get in a split brain situation
> 
>
> Key: MAPREDUCE-4832
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 0.23.5, 2.0.2-alpha
>Reporter: Robert Joseph Evans
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: MAPREDUCE-4832.patch
>
>
> It is possible for a networking issue to happen where the RM thinks an AM has 
> gone down and launches a replacement, but the previous AM is still up and 
> running.  If the previous AM does not need any more resources from the RM it 
> could try to commit either tasks or jobs.  This could cause lots of problems 
> where the second AM finishes and tries to commit too.  This could result in 
> data corruption.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4832) MR AM can get in a split brain situation

2013-01-02 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4832:
--

Attachment: MAPREDUCE-4832.patch

Patch that implements the "commit window" concept outlined above.  The AM will 
not allow task commits or job commit to proceed unless it has heard back from 
the RM within the configured amount of time (10 seconds by default).

> MR AM can get in a split brain situation
> 
>
> Key: MAPREDUCE-4832
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: applicationmaster
>Affects Versions: 2.0.2-alpha, 0.23.5
>Reporter: Robert Joseph Evans
>Priority: Critical
> Attachments: MAPREDUCE-4832.patch
>
>
> It is possible for a networking issue to happen where the RM thinks an AM has 
> gone down and launches a replacement, but the previous AM is still up and 
> running.  If the previous AM does not need any more resources from the RM it 
> could try to commit either tasks or jobs.  This could cause lots of problems 
> where the second AM finishes and tries to commit too.  This could result in 
> data corruption.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-02 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated MAPREDUCE-4819:
---

Attachment: MR-4819-bobby-trunk.txt

This is an updated version of my patch.  It addresses all of the outstanding 
tasks besides integration with the split brain fix MAPREDUCE-4832.  I still 
need to do a lot of manual testing to be sure that this fixes the issues.  But 
I think it is very close to being a final patch.  Please take a look at it.

Bikas, if you have concerns about it or think that there is more from your 
patch that I need to pull in please let me know.

> AM can rerun job after reporting final job status to the client
> ---
>
> Key: MAPREDUCE-4819
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Assignee: Bikas Saha
>Priority: Critical
> Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
> MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt
>
>
> If the AM reports final job status to the client but then crashes before 
> unregistering with the RM then the RM can run another AM attempt.  Currently 
> AM re-attempts assume that the previous attempts did not reach a final job 
> state, and that causes the job to rerun (from scratch, if the output format 
> doesn't support recovery).
> Re-running the job when we've already told the client the final status of the 
> job is bad for a number of reasons.  If the job failed, it's confusing at 
> best since the client was already told the job failed but the subsequent 
> attempt could succeed.  If the job succeeded there could be data loss, as a 
> subsequent job launched by the client tries to consume the job's output as 
> input just as the re-attempt starts removing output files in preparation for 
> the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-02 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542379#comment-13542379
 ] 

Robert Joseph Evans commented on MAPREDUCE-4819:


Sorry,  Yes I have been working very closely with Jason Lowe lately on this and 
MAPREDUCE-4832, so I glossed over a lot more then I should have.

In general this patch is more formally coupling job commit to job completion 
because it was informally coupled previously. FileOutputCommitter optionally 
will mark a directory as complete with an "_SUCCESS" file when the job is 
committed.  Oozie or other workflow systems can use this to recognize that a 
job has finished and start processing that output as input to another job.  If 
we do not couple them there is a race that Oozie may lose.  You are correct 
that we have to be careful about what processing happens after a job is 
committed and verify that it can be redone without any problem.  The things 
that happen here are moving the job history over to where the history server 
can pick it up, job end notification, unregistering from the RM, and cleaning 
up the staging directory.

Looking at each of these one at a time:
For moving job history over I do need to adopt the change that you made to make 
it more robust where we copy the log file and do not delete the old one until 
the staging directory is removed. I also need to make changes to the 
HistoryServer to allow it to ignore the subsequent JobHistory files for the 
same job.

For Job End notification.  This is hitting a URL to indicate that the job has 
finished and if it has finished successfully or in error.  I do need to do some 
integration tests with Oozie to validate that it can handle being informed more 
then once without having any real problems.  The notification is a best effort 
contract, so in the short term I plan to disable notification if we think that 
we may double notify (Commit finished and we don't know if we notified or not). 
 I know Oozie can handle this, but it will delay some processing. We can then 
explore changing that contract on a separate JIRA.

Unregistering with the RM is by its very nature atomic.  If we crash after 
unregistering we will not be rerun. 

Deleting the staging directory is also guarded against (code commented out in 
the first patch, but I have fixed the unit tests in and will have it in an 
upcoming patch).  If for some reason the staging directory was removed and a 
new AM is launched it will exit with an error.

The only other code that is part of this patch is the JobHistoryCopyService.  
This is kind of a stripped down version of the recovery service for the special 
case where we are not going to rerun anything, we just want the events to be 
put into the new history file. We could have copied the old history file over, 
but it would be missing the section about this new AM.

This first patch was just to show the concepts.  There is still a fair amount 
of work to do before it is really ready to commit, so if you have any other 
suggestions, or potential problems that you see with this approach please point 
them out.

> AM can rerun job after reporting final job status to the client
> ---
>
> Key: MAPREDUCE-4819
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Assignee: Bikas Saha
>Priority: Critical
> Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
> MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt
>
>
> If the AM reports final job status to the client but then crashes before 
> unregistering with the RM then the RM can run another AM attempt.  Currently 
> AM re-attempts assume that the previous attempts did not reach a final job 
> state, and that causes the job to rerun (from scratch, if the output format 
> doesn't support recovery).
> Re-running the job when we've already told the client the final status of the 
> job is bad for a number of reasons.  If the job failed, it's confusing at 
> best since the client was already told the job failed but the subsequent 
> attempt could succeed.  If the job succeeded there could be data loss, as a 
> subsequent job launched by the client tries to consume the job's output as 
> input just as the re-attempt starts removing output files in preparation for 
> the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-2217) The expire launching task should cover the UNASSIGNED task

2013-01-02 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated MAPREDUCE-2217:


Attachment: expose-bug-mr-2217.patch

Sorry for the delay, just got around to this.

Uploading a patch that exposes the bug on clusters with some hosts with a 1 in 
their hostname. Running a sample pi job with 4 nodes with common prefix 
followed by 01-04, results in the job hanging at 75% map progress.


> The expire launching task should cover the UNASSIGNED task
> --
>
> Key: MAPREDUCE-2217
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2217
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: jobtracker
>Affects Versions: 0.23.0
>Reporter: Scott Chen
>Assignee: Karthik Kambatla
> Fix For: 0.24.0
>
> Attachments: expose-bug-mr-2217.patch, MAPREDUCE-2217.1.txt, 
> MR-2217.patch
>
>
> The ExpireLaunchingTask thread kills the task that are scheduled but not 
> responded.
> Currently if a task is scheduled on tasktracker and for some reason 
> tasktracker cannot put it to RUNNING.
> The task will just hang in the UNASSIGNED status and JobTracker will keep 
> waiting for it.
> JobTracker.ExpireLaunchingTask should be able to kill this task.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4884) streaming tests fail to start MiniMRCluster due to "Queue configuration missing child queue names for root"

2013-01-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542345#comment-13542345
 ] 

Hudson commented on MAPREDUCE-4884:
---

Integrated in Hadoop-trunk-Commit #3162 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/3162/])
MAPREDUCE-4884. Streaming tests fail to start MiniMRCluster due to missing 
queue configuration. Contributed by Chris Nauroth. (Revision 1427945)

 Result = SUCCESS
suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1427945
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-tools/hadoop-streaming/pom.xml


> streaming tests fail to start MiniMRCluster due to "Queue configuration 
> missing child queue names for root"
> ---
>
> Key: MAPREDUCE-4884
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4884
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/streaming, test
>Affects Versions: 3.0.0, trunk-win
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 3.0.0
>
> Attachments: MAPREDUCE-4884.1.patch
>
>
> Multiple tests in hadoop-streaming, such as {{TestFileArgs}}, fail to 
> initialize {{MiniMRCluster}} due to a {{YarnException}} with reason "Queue 
> configuration missing child queue names for root".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4884) streaming tests fail to start MiniMRCluster due to "Queue configuration missing child queue names for root"

2013-01-02 Thread Suresh Srinivas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suresh Srinivas updated MAPREDUCE-4884:
---

   Resolution: Fixed
Fix Version/s: 3.0.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

I committed the patch to trunk.

> streaming tests fail to start MiniMRCluster due to "Queue configuration 
> missing child queue names for root"
> ---
>
> Key: MAPREDUCE-4884
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4884
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/streaming, test
>Affects Versions: 3.0.0, trunk-win
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 3.0.0
>
> Attachments: MAPREDUCE-4884.1.patch
>
>
> Multiple tests in hadoop-streaming, such as {{TestFileArgs}}, fail to 
> initialize {{MiniMRCluster}} due to a {{YarnException}} with reason "Queue 
> configuration missing child queue names for root".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4884) streaming tests fail to start MiniMRCluster due to "Queue configuration missing child queue names for root"

2013-01-02 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542319#comment-13542319
 ] 

Suresh Srinivas commented on MAPREDUCE-4884:


+1 for the patch.

> streaming tests fail to start MiniMRCluster due to "Queue configuration 
> missing child queue names for root"
> ---
>
> Key: MAPREDUCE-4884
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4884
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: contrib/streaming, test
>Affects Versions: 3.0.0, trunk-win
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: MAPREDUCE-4884.1.patch
>
>
> Multiple tests in hadoop-streaming, such as {{TestFileArgs}}, fail to 
> initialize {{MiniMRCluster}} due to a {{YarnException}} with reason "Queue 
> configuration missing child queue names for root".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-02 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542278#comment-13542278
 ] 

Bikas Saha commented on MAPREDUCE-4819:
---

It would really help if you could elaborate on the solution a bit more. I think 
I get the gist (ie try to lock the commit using atomic file operations) but I 
am not clear beyond that part. We can quickly discuss the utility of both 
approaches after that. Perhaps you have already done that in your mind :)
The only thing I would like to guard against is linking of job commit operation 
with job completion where they can be independent. I agree that job commit is 
strictly needed before job completion. But making job commit the same as job 
completion may not be correct. eg. other operations post completion that are 
unsafe to repeat (maybe none exist now) or committing multiple outputs perhaps.
The patch posted earlier, made sure that if a job has completed then it will be 
a no-op to run it again. Its a safe change. Also, it notifies the client about 
job success after making sure that the success state is persisted. I agree is 
does not handle errors in commit which is perhaps what your patch is addressing.
So it could be that both changes are needed.

> AM can rerun job after reporting final job status to the client
> ---
>
> Key: MAPREDUCE-4819
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Assignee: Bikas Saha
>Priority: Critical
> Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
> MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt
>
>
> If the AM reports final job status to the client but then crashes before 
> unregistering with the RM then the RM can run another AM attempt.  Currently 
> AM re-attempts assume that the previous attempts did not reach a final job 
> state, and that causes the job to rerun (from scratch, if the output format 
> doesn't support recovery).
> Re-running the job when we've already told the client the final status of the 
> job is bad for a number of reasons.  If the job failed, it's confusing at 
> best since the client was already told the job failed but the subsequent 
> attempt could succeed.  If the job succeeded there could be data loss, as a 
> subsequent job launched by the client tries to consume the job's output as 
> input just as the re-attempt starts removing output files in preparation for 
> the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4904) TestMultipleLevelCaching failed in barnch-1

2013-01-02 Thread Luke Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542257#comment-13542257
 ] 

Luke Lu commented on MAPREDUCE-4904:


Please add a comment about the switch fall-through, as it's not obvious and 
would raise more questions in later maintenance.

> TestMultipleLevelCaching failed in barnch-1
> ---
>
> Key: MAPREDUCE-4904
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4904
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.2.0
>Reporter: meng gong
>Assignee: meng gong
> Fix For: 1.2.0
>
> Attachments: MAPREDUCE-4904.patch
>
>
> TestMultipleLevelCaching will failed:
> {noformat}
> Testcase: testMultiLevelCaching took 30.406 sec
> FAILED
> Number of local maps expected:<0> but was:<1>
> junit.framework.AssertionFailedError: Number of local maps expected:<0> but 
> was:<1>
> at 
> org.apache.hadoop.mapred.TestRackAwareTaskPlacement.launchJobAndTestCounters(TestRackAwareTaskPlacement.java:78)
> at 
> org.apache.hadoop.mapred.TestMultipleLevelCaching.testCachingAtLevel(TestMultipleLevelCaching.java:113)
> at 
> org.apache.hadoop.mapred.TestMultipleLevelCaching.testMultiLevelCaching(TestMultipleLevelCaching.java:69)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-02 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542179#comment-13542179
 ] 

Robert Joseph Evans commented on MAPREDUCE-4819:


The findbugs warning is because the code is not complete.  The javac warning is 
because of a new EventHandler not having the generics on it. Both of these are 
currently expected.

> AM can rerun job after reporting final job status to the client
> ---
>
> Key: MAPREDUCE-4819
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Assignee: Bikas Saha
>Priority: Critical
> Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
> MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt
>
>
> If the AM reports final job status to the client but then crashes before 
> unregistering with the RM then the RM can run another AM attempt.  Currently 
> AM re-attempts assume that the previous attempts did not reach a final job 
> state, and that causes the job to rerun (from scratch, if the output format 
> doesn't support recovery).
> Re-running the job when we've already told the client the final status of the 
> job is bad for a number of reasons.  If the job failed, it's confusing at 
> best since the client was already told the job failed but the subsequent 
> attempt could succeed.  If the job succeeded there could be data loss, as a 
> subsequent job launched by the client tries to consume the job's output as 
> input just as the re-attempt starts removing output files in preparation for 
> the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542174#comment-13542174
 ] 

Hadoop QA commented on MAPREDUCE-4819:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12562909/MR-4819-bobby-trunk.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 2015 javac 
compiler warnings (more than the trunk's current 2014 warnings).

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3184//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3184//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-app.html
Javac warnings: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3184//artifact/trunk/patchprocess/diffJavacWarnings.txt
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3184//console

This message is automatically generated.

> AM can rerun job after reporting final job status to the client
> ---
>
> Key: MAPREDUCE-4819
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Assignee: Bikas Saha
>Priority: Critical
> Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
> MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt
>
>
> If the AM reports final job status to the client but then crashes before 
> unregistering with the RM then the RM can run another AM attempt.  Currently 
> AM re-attempts assume that the previous attempts did not reach a final job 
> state, and that causes the job to rerun (from scratch, if the output format 
> doesn't support recovery).
> Re-running the job when we've already told the client the final status of the 
> job is bad for a number of reasons.  If the job failed, it's confusing at 
> best since the client was already told the job failed but the subsequent 
> attempt could succeed.  If the job succeeded there could be data loss, as a 
> subsequent job launched by the client tries to consume the job's output as 
> input just as the re-attempt starts removing output files in preparation for 
> the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client

2013-01-02 Thread Robert Joseph Evans (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated MAPREDUCE-4819:
---

Attachment: MR-4819-bobby-trunk.txt

Bikas,

I would actually like to propose an alternative fix.  I am attaching a very 
preliminary patch.  This will instead put a "lock" around the job commit by 
adding a few new files into the staging directory.  Task commits would be 
required to handle the rare possibility of a double commit, just as it is 
possible in 1.0 now.  We would make it just as likely to happen as it is in 1.0 
by also putting in MAPREDUCE-4832 which would help to ensure that we don't have 
two AM telling tasks to do things at the same time.

I would appreciate any feedback on this approach.  I am going to be working to 
add in more tests and clean up the code.

> AM can rerun job after reporting final job status to the client
> ---
>
> Key: MAPREDUCE-4819
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mr-am
>Affects Versions: 0.23.3, 2.0.1-alpha
>Reporter: Jason Lowe
>Assignee: Bikas Saha
>Priority: Critical
> Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
> MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt
>
>
> If the AM reports final job status to the client but then crashes before 
> unregistering with the RM then the RM can run another AM attempt.  Currently 
> AM re-attempts assume that the previous attempts did not reach a final job 
> state, and that causes the job to rerun (from scratch, if the output format 
> doesn't support recovery).
> Re-running the job when we've already told the client the final status of the 
> job is bad for a number of reasons.  If the job failed, it's confusing at 
> best since the client was already told the job failed but the subsequent 
> attempt could succeed.  If the job succeeded there could be data loss, as a 
> subsequent job launched by the client tries to consume the job's output as 
> input just as the re-attempt starts removing output files in preparation for 
> the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service

2013-01-02 Thread Avner BenHanoch (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542115#comment-13542115
 ] 

Avner BenHanoch commented on MAPREDUCE-4049:


1. so I'll open case for the trunk "send APPLICATION_INIT event to additional 
AuxiliaryServices instead of hard-coded send to 'mapreduce.shuffle'" 

(1.a Do you have an idea whether to send it to all of them, or to use two sets 
of AuxiliaryServices - 1 that get the event and 1 that doesn't get it?)

2. My branch-1 code only loads an optionally configured ShuffleProviderPlugin.  
I didn't touch the existing code that loads MapOutputServlet in the CTOR of TT. 
 Hence, user will have 1 or 2 shuffle-providers.

> plugin for generic shuffle service
> --
>
> Key: MAPREDUCE-4049
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: performance, task, tasktracker
>Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0
>Reporter: Avner BenHanoch
>Assignee: Avner BenHanoch
>  Labels: merge, plugin, rdma, shuffle
> Fix For: 3.0.0
>
> Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, 
> mapreduce-4049.patch
>
>
> Support generic shuffle service as set of two plugins: ShuffleProvider & 
> ShuffleConsumer.
> This will satisfy the following needs:
> # Better shuffle and merge performance. For example: we are working on 
> shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, 
> or Infiniband) instead of using the current HTTP shuffle. Based on the fast 
> RDMA shuffle, the plugin can also utilize a suitable merge approach during 
> the intermediate merges. Hence, getting much better performance.
> # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden 
> dependency of NodeManager with a specific version of mapreduce shuffle 
> (currently targeted to 0.24.0).
> References:
> # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu 
> from Auburn University with others, 
> [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf]
> # I am attaching 2 documents with suggested Top Level Design for both plugins 
> (currently, based on 1.0 branch)
> # I am providing link for downloading UDA - Mellanox's open source plugin 
> that implements generic shuffle service using RDMA and levitated merge.  
> Note: At this phase, the code is in C++ through JNI and you should consider 
> it as beta only.  Still, it can serve anyone that wants to implement or 
> contribute to levitated merge. (Please be advised that levitated merge is 
> mostly suit in very fast networks) - 
> [http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=144&menu_section=69]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service

2013-01-02 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542111#comment-13542111
 ] 

Alejandro Abdelnur commented on MAPREDUCE-4049:
---

bq. my 2nd point above was for the trunk. 

If that is the case, I think we should do that in a follow up JIRA and have 
there the patch for trunk and branch-1.

bq. because a 3rd party shuffle-provider always runs in addition to the default 
shuffle-provider.

The shuffle-provider class is a TaskTracker config, so it is the same for ALL 
jobs; meaning the TaskTracker will use always the same shuffle-provider class. 
no?

> plugin for generic shuffle service
> --
>
> Key: MAPREDUCE-4049
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: performance, task, tasktracker
>Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0
>Reporter: Avner BenHanoch
>Assignee: Avner BenHanoch
>  Labels: merge, plugin, rdma, shuffle
> Fix For: 3.0.0
>
> Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, 
> mapreduce-4049.patch
>
>
> Support generic shuffle service as set of two plugins: ShuffleProvider & 
> ShuffleConsumer.
> This will satisfy the following needs:
> # Better shuffle and merge performance. For example: we are working on 
> shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, 
> or Infiniband) instead of using the current HTTP shuffle. Based on the fast 
> RDMA shuffle, the plugin can also utilize a suitable merge approach during 
> the intermediate merges. Hence, getting much better performance.
> # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden 
> dependency of NodeManager with a specific version of mapreduce shuffle 
> (currently targeted to 0.24.0).
> References:
> # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu 
> from Auburn University with others, 
> [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf]
> # I am attaching 2 documents with suggested Top Level Design for both plugins 
> (currently, based on 1.0 branch)
> # I am providing link for downloading UDA - Mellanox's open source plugin 
> that implements generic shuffle service using RDMA and levitated merge.  
> Note: At this phase, the code is in C++ through JNI and you should consider 
> it as beta only.  Still, it can serve anyone that wants to implement or 
> contribute to levitated merge. (Please be advised that levitated merge is 
> mostly suit in very fast networks) - 
> [http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=144&menu_section=69]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service

2013-01-02 Thread Avner BenHanoch (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542107#comment-13542107
 ] 

Avner BenHanoch commented on MAPREDUCE-4049:


Hi Alejandro,
thanks for your comments!

*my 2nd point above was for the trunk.*  True, in MRv2 AuxiliaryServices are 
plugable, still they don't get APPLICATION_INIT events; hence, they don't know 
to map jobId->userId.  We need to send this event to all AuxiliaryServices, or 
to define 2 groups of AuxiliaryServices: 1 that get this event, and 1 that 
doesn't get this event. *In the current code, this event is only sent to 
"mapreduce.shuffle" using hard-coded string rather than relying on any conf 
settings*.

For the branch-1 comments:
Please notice that ShuffleProvider has different semantics than 
ShuffleConsumer, because a 3rd party shuffle-provider always runs in addition 
to the default shuffle-provider.  multiple jobs can run in parallel, resulting 
in various shuffleConsumers in parallel (in different Jobs/ReduceTasks). Hence, 
all possible providers should exists in parallel.  Saying that, the semantic of 
ShuffleProvider plugin is in addition to the default shuffle-provider.  Hence 
TT should not fail for that.

(for the rest of your branch-1 comments: yes, you are right on all!)

> plugin for generic shuffle service
> --
>
> Key: MAPREDUCE-4049
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: performance, task, tasktracker
>Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0
>Reporter: Avner BenHanoch
>Assignee: Avner BenHanoch
>  Labels: merge, plugin, rdma, shuffle
> Fix For: 3.0.0
>
> Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, 
> mapreduce-4049.patch
>
>
> Support generic shuffle service as set of two plugins: ShuffleProvider & 
> ShuffleConsumer.
> This will satisfy the following needs:
> # Better shuffle and merge performance. For example: we are working on 
> shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, 
> or Infiniband) instead of using the current HTTP shuffle. Based on the fast 
> RDMA shuffle, the plugin can also utilize a suitable merge approach during 
> the intermediate merges. Hence, getting much better performance.
> # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden 
> dependency of NodeManager with a specific version of mapreduce shuffle 
> (currently targeted to 0.24.0).
> References:
> # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu 
> from Auburn University with others, 
> [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf]
> # I am attaching 2 documents with suggested Top Level Design for both plugins 
> (currently, based on 1.0 branch)
> # I am providing link for downloading UDA - Mellanox's open source plugin 
> that implements generic shuffle service using RDMA and levitated merge.  
> Note: At this phase, the code is in C++ through JNI and you should consider 
> it as beta only.  Still, it can serve anyone that wants to implement or 
> contribute to levitated merge. (Please be advised that levitated merge is 
> mostly suit in very fast networks) - 
> [http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=144&menu_section=69]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service

2013-01-02 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542093#comment-13542093
 ] 

Alejandro Abdelnur commented on MAPREDUCE-4049:
---

Avner, thanks for the clarification, I've got confused as JIRA emailed an 
'updated patch' message.

On #1, the ShuffleConsumerPlugin should be in this JIRA.
On #2, I assume the trunk version does not (will not) have a Map side because 
the ShuffleHandler is already pluggable. Given that, the Map side 
(ShuffleProvider) seems an artifact of the backport of this JIRA. Because of 
that, I think is OK to have it here.

I assume you are working on updating the attached Hadoop-1 patch, following 
some comments on the current Hadoop-1 patch:

* Not having a ShuffleProviderPlugin in the TaskTracker should be reason to 
fail the TaskTracker at startup, no?
* We should follow the same pattern as in trunk:
** Define an interface instead of an abstract class for ShuffleConsumerPlugin, 
with init(), fetchOutput(), createKVIterator(), getMergeThrowable() methods.
** Define a Context for ShuffleConsumerPlugin initialization
** Use ReflectionUtil.newInstance() in ReducerTask to instantiate the 
ShuffleConsumerPlugin
** visibility/stability Annotations are missing


> plugin for generic shuffle service
> --
>
> Key: MAPREDUCE-4049
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: performance, task, tasktracker
>Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0
>Reporter: Avner BenHanoch
>Assignee: Avner BenHanoch
>  Labels: merge, plugin, rdma, shuffle
> Fix For: 3.0.0
>
> Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, 
> mapreduce-4049.patch
>
>
> Support generic shuffle service as set of two plugins: ShuffleProvider & 
> ShuffleConsumer.
> This will satisfy the following needs:
> # Better shuffle and merge performance. For example: we are working on 
> shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, 
> or Infiniband) instead of using the current HTTP shuffle. Based on the fast 
> RDMA shuffle, the plugin can also utilize a suitable merge approach during 
> the intermediate merges. Hence, getting much better performance.
> # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden 
> dependency of NodeManager with a specific version of mapreduce shuffle 
> (currently targeted to 0.24.0).
> References:
> # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu 
> from Auburn University with others, 
> [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf]
> # I am attaching 2 documents with suggested Top Level Design for both plugins 
> (currently, based on 1.0 branch)
> # I am providing link for downloading UDA - Mellanox's open source plugin 
> that implements generic shuffle service using RDMA and levitated merge.  
> Note: At this phase, the code is in C++ through JNI and you should consider 
> it as beta only.  Still, it can serve anyone that wants to implement or 
> contribute to levitated merge. (Please be advised that levitated merge is 
> mostly suit in very fast networks) - 
> [http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=144&menu_section=69]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service

2013-01-02 Thread Avner BenHanoch (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542088#comment-13542088
 ] 

Avner BenHanoch commented on MAPREDUCE-4049:


Hi Alejandro,
On Monday I only removed obsolete attachments for the trunk and kept just the 
last one we submitted.

Speaking about that, please let me know:
1. Do you prefer patch for branch-1 in this issue or in a separated issue.
2. There is still what to do in the trunk for ShuffleProvider - see [this 
comment|https://issues.apache.org/jira/browse/MAPREDUCE-4049?focusedCommentId=13444026&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13444026].
  Do you want me to address it here or in a separated issue.

Kindly thank you,
  Avner

> plugin for generic shuffle service
> --
>
> Key: MAPREDUCE-4049
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: performance, task, tasktracker
>Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0
>Reporter: Avner BenHanoch
>Assignee: Avner BenHanoch
>  Labels: merge, plugin, rdma, shuffle
> Fix For: 3.0.0
>
> Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, 
> mapreduce-4049.patch
>
>
> Support generic shuffle service as set of two plugins: ShuffleProvider & 
> ShuffleConsumer.
> This will satisfy the following needs:
> # Better shuffle and merge performance. For example: we are working on 
> shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, 
> or Infiniband) instead of using the current HTTP shuffle. Based on the fast 
> RDMA shuffle, the plugin can also utilize a suitable merge approach during 
> the intermediate merges. Hence, getting much better performance.
> # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden 
> dependency of NodeManager with a specific version of mapreduce shuffle 
> (currently targeted to 0.24.0).
> References:
> # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu 
> from Auburn University with others, 
> [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf]
> # I am attaching 2 documents with suggested Top Level Design for both plugins 
> (currently, based on 1.0 branch)
> # I am providing link for downloading UDA - Mellanox's open source plugin 
> that implements generic shuffle service using RDMA and levitated merge.  
> Note: At this phase, the code is in C++ through JNI and you should consider 
> it as beta only.  Still, it can serve anyone that wants to implement or 
> contribute to levitated merge. (Please be advised that levitated merge is 
> mostly suit in very fast networks) - 
> [http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=144&menu_section=69]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service

2013-01-02 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542077#comment-13542077
 ] 

Alejandro Abdelnur commented on MAPREDUCE-4049:
---

Avner, it seems the attachment you posted on Monday for branch-1 is MIA, would 
you please post it again? thx.

> plugin for generic shuffle service
> --
>
> Key: MAPREDUCE-4049
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: performance, task, tasktracker
>Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0
>Reporter: Avner BenHanoch
>Assignee: Avner BenHanoch
>  Labels: merge, plugin, rdma, shuffle
> Fix For: 3.0.0
>
> Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, 
> mapreduce-4049.patch
>
>
> Support generic shuffle service as set of two plugins: ShuffleProvider & 
> ShuffleConsumer.
> This will satisfy the following needs:
> # Better shuffle and merge performance. For example: we are working on 
> shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, 
> or Infiniband) instead of using the current HTTP shuffle. Based on the fast 
> RDMA shuffle, the plugin can also utilize a suitable merge approach during 
> the intermediate merges. Hence, getting much better performance.
> # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden 
> dependency of NodeManager with a specific version of mapreduce shuffle 
> (currently targeted to 0.24.0).
> References:
> # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu 
> from Auburn University with others, 
> [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf]
> # I am attaching 2 documents with suggested Top Level Design for both plugins 
> (currently, based on 1.0 branch)
> # I am providing link for downloading UDA - Mellanox's open source plugin 
> that implements generic shuffle service using RDMA and levitated merge.  
> Note: At this phase, the code is in C++ through JNI and you should consider 
> it as beta only.  Still, it can serve anyone that wants to implement or 
> contribute to levitated merge. (Please be advised that levitated merge is 
> mostly suit in very fast networks) - 
> [http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=144&menu_section=69]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-1688) A failing retry'able notification in JobEndNotifier can affect notifications of other jobs.

2013-01-02 Thread Olga Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542053#comment-13542053
 ] 

Olga Shen commented on MAPREDUCE-1688:
--

MAPREDUCE-3028 only added timeout setting in 
org.apache.hadoop.mapreduce.v2.app.JobEndNotifier.
Would you apply timeout setting to org.apache.hadoop.mapred.JobEndNotifier for 
MRv1 users?

> A failing retry'able notification in JobEndNotifier can affect notifications 
> of other jobs.
> ---
>
> Key: MAPREDUCE-1688
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1688
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobtracker
>Affects Versions: 0.20.1, 1.0.0, 1.0.2, 1.0.3
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Ravi Prakash
>
> The JobTracker puts all the notification commands into a delay-queue.  It has 
> a single thread that loops through this queue and sends out the 
> notifications.  When it hits failures with any notification which is 
> configured to be retired via {{job.end.retry.attempts}} and 
> {{job.end.retry.interval}}, the notification is queued back again. A single 
> notification with sufficiently large number of configured retries and which 
> consistently fails will affect other notifications in the queue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira