[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542750#comment-13542750 ] Bikas Saha commented on MAPREDUCE-4819: --- In general, you might want to rename some of the new stuff like "justShutDown" or "EventEater". And I feel that the change in MRAppMaster.init() function might benefit with some refactoring. > AM can rerun job after reporting final job status to the client > --- > > Key: MAPREDUCE-4819 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Assignee: Bikas Saha >Priority: Critical > Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, > MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt > > > If the AM reports final job status to the client but then crashes before > unregistering with the RM then the RM can run another AM attempt. Currently > AM re-attempts assume that the previous attempts did not reach a final job > state, and that causes the job to rerun (from scratch, if the output format > doesn't support recovery). > Re-running the job when we've already told the client the final status of the > job is bad for a number of reasons. If the job failed, it's confusing at > best since the client was already told the job failed but the subsequent > attempt could succeed. If the job succeeded there could be data loss, as a > subsequent job launched by the client tries to consume the job's output as > input just as the re-attempt starts removing output files in preparation for > the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542748#comment-13542748 ] Bikas Saha commented on MAPREDUCE-4819: --- Looks like after the recent changes in JobImpl and the current alternative approach my original fix for not rerunning the job does not really apply. I think you would want to take the changes in my patch that adds the jobid to the history staging dir. Since the staging dir is not deleted during job history flushing, I had observed that if I made my AM crash (by putting an exit(1) in shutdownJob() then the history files would get orphaned and not cleaned up. Or something like that. And to fix that I had to add the jobid to the path. Snippet from my patch. {code} +++ hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapreduce/v2/jobhistory/JobHistoryUtils.java @@ -186,10 +186,11 @@ public static PathFilter getHistoryFileFilter() { * @return A string representation of the prefix. */ public static String - getConfiguredHistoryStagingDirPrefix(Configuration conf) + getConfiguredHistoryStagingDirPrefix(Configuration conf, String jobId) throws IOException { String user = UserGroupInformation.getCurrentUser().getShortUserName(); -Path path = MRApps.getStagingAreaDir(conf, user); +Path stagingPath = MRApps.getStagingAreaDir(conf, user); +Path path = new Path(stagingPath, jobId); String logDir = path.toString(); return logDir; } {code} For the patch itself I have a few comments Why not end in success if the staging dir was cleaned up by the last attempt? I am guessing that this code wont be necessary after we move the unregister to RM before the staging dir cleanup in MAPREDUCE-4841, right? {code} + if(!stagingExists) { +copyHistory = false; +isLastAMRetry = true; +justShutDown = true; +shouldNotify = false; +forcedState = JobStateInternal.ERROR; +shutDownMessage = "Staging dir does not exist " + stagingDir; +LOG.fatal(shutDownMessage); {code} Why are we only eating/ignoring the JobEvents in the dispatcher? So that the JobImpl state machine is not triggered? This might be a question of personal preference. I think an explicit transition to from the INIT to final state is cleaner than overriding the state in the getter. {code} public JobStateInternal getInternalState() { readLock.lock(); try { + if(forcedState != null) { +return forcedState; + } {code} Didnt quite get this in HistoryFileManager.java. Looks like it related to a recent change in that code. {code} + } else if (old != null && !old.isMovePending()) { +//This is a duplicate so just delete it +fileInfo.delete(); } {code} Typo {code} +throw new Exception("No handler for regitered for " + type); + } {code} > AM can rerun job after reporting final job status to the client > --- > > Key: MAPREDUCE-4819 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Assignee: Bikas Saha >Priority: Critical > Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, > MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt > > > If the AM reports final job status to the client but then crashes before > unregistering with the RM then the RM can run another AM attempt. Currently > AM re-attempts assume that the previous attempts did not reach a final job > state, and that causes the job to rerun (from scratch, if the output format > doesn't support recovery). > Re-running the job when we've already told the client the final status of the > job is bad for a number of reasons. If the job failed, it's confusing at > best since the client was already told the job failed but the subsequent > attempt could succeed. If the job succeeded there could be data loss, as a > subsequent job launched by the client tries to consume the job's output as > input just as the re-attempt starts removing output files in preparation for > the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAPREDUCE-4911) Add node-level aggregation flag feature(setLocalAggregation(boolean)) to JobConf
Tsuyoshi OZAWA created MAPREDUCE-4911: - Summary: Add node-level aggregation flag feature(setLocalAggregation(boolean)) to JobConf Key: MAPREDUCE-4911 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4911 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: client Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA This JIRA adds node-level aggregation flag feature(setLocalAggregation(boolean)) to JobConf. This task is subtask of MAPREDUCE-4502. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAPREDUCE-4910) Adding AggregationWaitMap to some components(MRAppMaster, TaskAttemptListener, JobImpl, MapTaskImpl).
Tsuyoshi OZAWA created MAPREDUCE-4910: - Summary: Adding AggregationWaitMap to some components(MRAppMaster, TaskAttemptListener, JobImpl, MapTaskImpl). Key: MAPREDUCE-4910 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4910 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: applicationmaster, mrv2, task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA To implement MR-4502, AggregationWaitMap need to be used by some components(MRAppMaster, TaskAttemptListener, JobImpl, MapTaskImpl). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4904) TestMultipleLevelCaching failed in barnch-1
[ https://issues.apache.org/jira/browse/MAPREDUCE-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542687#comment-13542687 ] Junping Du commented on MAPREDUCE-4904: --- Sure. Thanks Luke for comments. If localityLevel =2 and in case of without-NodeGroup, the task should be counted into OTHER_LOCAL_MAPS (it should go to "default" below to be handled rather than being break out). This tiny patch fix this issue. > TestMultipleLevelCaching failed in barnch-1 > --- > > Key: MAPREDUCE-4904 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4904 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: test >Affects Versions: 1.2.0 >Reporter: meng gong >Assignee: meng gong > Fix For: 1.2.0 > > Attachments: MAPREDUCE-4904.patch > > > TestMultipleLevelCaching will failed: > {noformat} > Testcase: testMultiLevelCaching took 30.406 sec > FAILED > Number of local maps expected:<0> but was:<1> > junit.framework.AssertionFailedError: Number of local maps expected:<0> but > was:<1> > at > org.apache.hadoop.mapred.TestRackAwareTaskPlacement.launchJobAndTestCounters(TestRackAwareTaskPlacement.java:78) > at > org.apache.hadoop.mapred.TestMultipleLevelCaching.testCachingAtLevel(TestMultipleLevelCaching.java:113) > at > org.apache.hadoop.mapred.TestMultipleLevelCaching.testMultiLevelCaching(TestMultipleLevelCaching.java:69) > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
[ https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Agarwal updated MAPREDUCE-4909: - Attachment: MAPREDUCE-4909.patch [~sureshms] Removed the Windows-specific comment. HADOOP-9176 was filed to address the root cause. Thanks! Arpit > TestKeyValueTextInputFormat fails with Open JDK 7 on Windows > > > Key: MAPREDUCE-4909 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 1-win >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal > Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch, > MAPREDUCE-4909.patch > > > TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause > appears to be a failure to delete in-use files via LocalFileSystem.delete > (RawLocalFileSystem.delete). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
[ https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Agarwal updated MAPREDUCE-4909: - Description: TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause appears to be a failure to delete in-use files via LocalFileSystem.delete (RawLocalFileSystem.delete). (was: TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause appears to be a failure to delete in-use files on Windows via LocalFileSystem.delete (RawLocalFileSystem.delete).) > TestKeyValueTextInputFormat fails with Open JDK 7 on Windows > > > Key: MAPREDUCE-4909 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 1-win >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal > Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch > > > TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause > appears to be a failure to delete in-use files via LocalFileSystem.delete > (RawLocalFileSystem.delete). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2217) The expire launching task should cover the UNASSIGNED task
[ https://issues.apache.org/jira/browse/MAPREDUCE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542583#comment-13542583 ] Hadoop QA commented on MAPREDUCE-2217: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562996/MR-2217.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3187//console This message is automatically generated. > The expire launching task should cover the UNASSIGNED task > -- > > Key: MAPREDUCE-2217 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2217 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: jobtracker >Affects Versions: 0.23.0, 1.1.1 >Reporter: Scott Chen >Assignee: Karthik Kambatla > Attachments: expose-bug-mr-2217.patch, MAPREDUCE-2217.1.txt, > MR-2217.patch, MR-2217.patch > > > The ExpireLaunchingTask thread kills the task that are scheduled but not > responded. > Currently if a task is scheduled on tasktracker and for some reason > tasktracker cannot put it to RUNNING. > The task will just hang in the UNASSIGNED status and JobTracker will keep > waiting for it. > JobTracker.ExpireLaunchingTask should be able to kill this task. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2217) The expire launching task should cover the UNASSIGNED task
[ https://issues.apache.org/jira/browse/MAPREDUCE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated MAPREDUCE-2217: Fix Version/s: (was: 0.24.0) Affects Version/s: 1.1.1 Status: Patch Available (was: Open) > The expire launching task should cover the UNASSIGNED task > -- > > Key: MAPREDUCE-2217 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2217 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: jobtracker >Affects Versions: 1.1.1, 0.23.0 >Reporter: Scott Chen >Assignee: Karthik Kambatla > Attachments: expose-bug-mr-2217.patch, MAPREDUCE-2217.1.txt, > MR-2217.patch, MR-2217.patch > > > The ExpireLaunchingTask thread kills the task that are scheduled but not > responded. > Currently if a task is scheduled on tasktracker and for some reason > tasktracker cannot put it to RUNNING. > The task will just hang in the UNASSIGNED status and JobTracker will keep > waiting for it. > JobTracker.ExpireLaunchingTask should be able to kill this task. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2217) The expire launching task should cover the UNASSIGNED task
[ https://issues.apache.org/jira/browse/MAPREDUCE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated MAPREDUCE-2217: Attachment: MR-2217.patch Re-uploading the patch for Jenkins sanity. > The expire launching task should cover the UNASSIGNED task > -- > > Key: MAPREDUCE-2217 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2217 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: jobtracker >Affects Versions: 0.23.0 >Reporter: Scott Chen >Assignee: Karthik Kambatla > Fix For: 0.24.0 > > Attachments: expose-bug-mr-2217.patch, MAPREDUCE-2217.1.txt, > MR-2217.patch, MR-2217.patch > > > The ExpireLaunchingTask thread kills the task that are scheduled but not > responded. > Currently if a task is scheduled on tasktracker and for some reason > tasktracker cannot put it to RUNNING. > The task will just hang in the UNASSIGNED status and JobTracker will keep > waiting for it. > JobTracker.ExpireLaunchingTask should be able to kill this task. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2217) The expire launching task should cover the UNASSIGNED task
[ https://issues.apache.org/jira/browse/MAPREDUCE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542570#comment-13542570 ] Karthik Kambatla commented on MAPREDUCE-2217: - The patch posted on 16/Nov fixes the issue. To verify this I ran a hadoop cluster of 4 nodes with both MR-2217.patch and expose-bug-mr-2217.patch. The tasks assigned to machine01 timeout, and are subsequently scheduled on other nodes, and the job completes. Without MR-2217.patch, the job doesn't progress even after an hour. I used pi job with 8 mappers and 1000 input splits for this. > The expire launching task should cover the UNASSIGNED task > -- > > Key: MAPREDUCE-2217 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2217 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: jobtracker >Affects Versions: 0.23.0 >Reporter: Scott Chen >Assignee: Karthik Kambatla > Fix For: 0.24.0 > > Attachments: expose-bug-mr-2217.patch, MAPREDUCE-2217.1.txt, > MR-2217.patch > > > The ExpireLaunchingTask thread kills the task that are scheduled but not > responded. > Currently if a task is scheduled on tasktracker and for some reason > tasktracker cannot put it to RUNNING. > The task will just hang in the UNASSIGNED status and JobTracker will keep > waiting for it. > JobTracker.ExpireLaunchingTask should be able to kill this task. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542567#comment-13542567 ] Hadoop QA commented on MAPREDUCE-4819: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562963/MR-4819-bobby-trunk.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 2015 javac compiler warnings (more than the trunk's current 2014 warnings). {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common: org.apache.hadoop.mapreduce.v2.app.commit.TestCommitterEventHandler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3186//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3186//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-app.html Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3186//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3186//console This message is automatically generated. > AM can rerun job after reporting final job status to the client > --- > > Key: MAPREDUCE-4819 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Assignee: Bikas Saha >Priority: Critical > Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, > MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt > > > If the AM reports final job status to the client but then crashes before > unregistering with the RM then the RM can run another AM attempt. Currently > AM re-attempts assume that the previous attempts did not reach a final job > state, and that causes the job to rerun (from scratch, if the output format > doesn't support recovery). > Re-running the job when we've already told the client the final status of the > job is bad for a number of reasons. If the job failed, it's confusing at > best since the client was already told the job failed but the subsequent > attempt could succeed. If the job succeeded there could be data loss, as a > subsequent job launched by the client tries to consume the job's output as > input just as the re-attempt starts removing output files in preparation for > the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4832) MR AM can get in a split brain situation
[ https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542561#comment-13542561 ] Hadoop QA commented on MAPREDUCE-4832: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562975/MAPREDUCE-4832.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3185//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3185//console This message is automatically generated. > MR AM can get in a split brain situation > > > Key: MAPREDUCE-4832 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster >Affects Versions: 2.0.2-alpha, 0.23.5 >Reporter: Robert Joseph Evans >Assignee: Jason Lowe >Priority: Critical > Attachments: MAPREDUCE-4832.patch > > > It is possible for a networking issue to happen where the RM thinks an AM has > gone down and launches a replacement, but the previous AM is still up and > running. If the previous AM does not need any more resources from the RM it > could try to commit either tasks or jobs. This could cause lots of problems > where the second AM finishes and tries to commit too. This could result in > data corruption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
[ https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542542#comment-13542542 ] Suresh Srinivas commented on MAPREDUCE-4909: bq. One minor thing, it might be good to add a "TODO" in the code comments as a reminder that the root cause is still under investigation. The goal of this test is not delete a file that is in use. So TODO seems unnecessary. [~arpitagarwal] Also windows related comments seems inappropriate. Can a separate jira be created, related to this, to track deletion of file that is in use? I think there might already be some jiras tracking this for Windows. > TestKeyValueTextInputFormat fails with Open JDK 7 on Windows > > > Key: MAPREDUCE-4909 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 1-win >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal > Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch > > > TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause > appears to be a failure to delete in-use files on Windows via > LocalFileSystem.delete (RawLocalFileSystem.delete). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
[ https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Agarwal updated MAPREDUCE-4909: - Attachment: MAPREDUCE-4909.patch Added TODO. > TestKeyValueTextInputFormat fails with Open JDK 7 on Windows > > > Key: MAPREDUCE-4909 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 1-win >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal > Attachments: MAPREDUCE-4909.patch, MAPREDUCE-4909.patch > > > TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause > appears to be a failure to delete in-use files on Windows via > LocalFileSystem.delete (RawLocalFileSystem.delete). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
[ https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542536#comment-13542536 ] Brandon Li commented on MAPREDUCE-4909: --- +1, the patch looks good as a workaround. One minor thing, it might be good to add a "TODO" in the code comments as a reminder that the root cause is still under investigation. > TestKeyValueTextInputFormat fails with Open JDK 7 on Windows > > > Key: MAPREDUCE-4909 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 1-win >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal > Attachments: MAPREDUCE-4909.patch > > > TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause > appears to be a failure to delete in-use files on Windows via > LocalFileSystem.delete (RawLocalFileSystem.delete). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
[ https://issues.apache.org/jira/browse/MAPREDUCE-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Agarwal updated MAPREDUCE-4909: - Attachment: MAPREDUCE-4909.patch Submitting a patch to work around the test failures. Filed HADOOP-9176 to address the root cause. > TestKeyValueTextInputFormat fails with Open JDK 7 on Windows > > > Key: MAPREDUCE-4909 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 1-win >Reporter: Arpit Agarwal >Assignee: Arpit Agarwal > Attachments: MAPREDUCE-4909.patch > > > TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause > appears to be a failure to delete in-use files on Windows via > LocalFileSystem.delete (RawLocalFileSystem.delete). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAPREDUCE-4909) TestKeyValueTextInputFormat fails with Open JDK 7 on Windows
Arpit Agarwal created MAPREDUCE-4909: Summary: TestKeyValueTextInputFormat fails with Open JDK 7 on Windows Key: MAPREDUCE-4909 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4909 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 1-win Reporter: Arpit Agarwal Assignee: Arpit Agarwal TestKeyValueTextInputFormat.testFormat fails with Open JDK 7. The root cause appears to be a failure to delete in-use files on Windows via LocalFileSystem.delete (RawLocalFileSystem.delete). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4832) MR AM can get in a split brain situation
[ https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-4832: -- Assignee: Jason Lowe Target Version/s: 2.0.3-alpha, 0.23.6 Status: Patch Available (was: Open) > MR AM can get in a split brain situation > > > Key: MAPREDUCE-4832 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster >Affects Versions: 0.23.5, 2.0.2-alpha >Reporter: Robert Joseph Evans >Assignee: Jason Lowe >Priority: Critical > Attachments: MAPREDUCE-4832.patch > > > It is possible for a networking issue to happen where the RM thinks an AM has > gone down and launches a replacement, but the previous AM is still up and > running. If the previous AM does not need any more resources from the RM it > could try to commit either tasks or jobs. This could cause lots of problems > where the second AM finishes and tries to commit too. This could result in > data corruption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4832) MR AM can get in a split brain situation
[ https://issues.apache.org/jira/browse/MAPREDUCE-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-4832: -- Attachment: MAPREDUCE-4832.patch Patch that implements the "commit window" concept outlined above. The AM will not allow task commits or job commit to proceed unless it has heard back from the RM within the configured amount of time (10 seconds by default). > MR AM can get in a split brain situation > > > Key: MAPREDUCE-4832 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4832 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster >Affects Versions: 2.0.2-alpha, 0.23.5 >Reporter: Robert Joseph Evans >Priority: Critical > Attachments: MAPREDUCE-4832.patch > > > It is possible for a networking issue to happen where the RM thinks an AM has > gone down and launches a replacement, but the previous AM is still up and > running. If the previous AM does not need any more resources from the RM it > could try to commit either tasks or jobs. This could cause lots of problems > where the second AM finishes and tries to commit too. This could result in > data corruption. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated MAPREDUCE-4819: --- Attachment: MR-4819-bobby-trunk.txt This is an updated version of my patch. It addresses all of the outstanding tasks besides integration with the split brain fix MAPREDUCE-4832. I still need to do a lot of manual testing to be sure that this fixes the issues. But I think it is very close to being a final patch. Please take a look at it. Bikas, if you have concerns about it or think that there is more from your patch that I need to pull in please let me know. > AM can rerun job after reporting final job status to the client > --- > > Key: MAPREDUCE-4819 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Assignee: Bikas Saha >Priority: Critical > Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, > MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt > > > If the AM reports final job status to the client but then crashes before > unregistering with the RM then the RM can run another AM attempt. Currently > AM re-attempts assume that the previous attempts did not reach a final job > state, and that causes the job to rerun (from scratch, if the output format > doesn't support recovery). > Re-running the job when we've already told the client the final status of the > job is bad for a number of reasons. If the job failed, it's confusing at > best since the client was already told the job failed but the subsequent > attempt could succeed. If the job succeeded there could be data loss, as a > subsequent job launched by the client tries to consume the job's output as > input just as the re-attempt starts removing output files in preparation for > the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542379#comment-13542379 ] Robert Joseph Evans commented on MAPREDUCE-4819: Sorry, Yes I have been working very closely with Jason Lowe lately on this and MAPREDUCE-4832, so I glossed over a lot more then I should have. In general this patch is more formally coupling job commit to job completion because it was informally coupled previously. FileOutputCommitter optionally will mark a directory as complete with an "_SUCCESS" file when the job is committed. Oozie or other workflow systems can use this to recognize that a job has finished and start processing that output as input to another job. If we do not couple them there is a race that Oozie may lose. You are correct that we have to be careful about what processing happens after a job is committed and verify that it can be redone without any problem. The things that happen here are moving the job history over to where the history server can pick it up, job end notification, unregistering from the RM, and cleaning up the staging directory. Looking at each of these one at a time: For moving job history over I do need to adopt the change that you made to make it more robust where we copy the log file and do not delete the old one until the staging directory is removed. I also need to make changes to the HistoryServer to allow it to ignore the subsequent JobHistory files for the same job. For Job End notification. This is hitting a URL to indicate that the job has finished and if it has finished successfully or in error. I do need to do some integration tests with Oozie to validate that it can handle being informed more then once without having any real problems. The notification is a best effort contract, so in the short term I plan to disable notification if we think that we may double notify (Commit finished and we don't know if we notified or not). I know Oozie can handle this, but it will delay some processing. We can then explore changing that contract on a separate JIRA. Unregistering with the RM is by its very nature atomic. If we crash after unregistering we will not be rerun. Deleting the staging directory is also guarded against (code commented out in the first patch, but I have fixed the unit tests in and will have it in an upcoming patch). If for some reason the staging directory was removed and a new AM is launched it will exit with an error. The only other code that is part of this patch is the JobHistoryCopyService. This is kind of a stripped down version of the recovery service for the special case where we are not going to rerun anything, we just want the events to be put into the new history file. We could have copied the old history file over, but it would be missing the section about this new AM. This first patch was just to show the concepts. There is still a fair amount of work to do before it is really ready to commit, so if you have any other suggestions, or potential problems that you see with this approach please point them out. > AM can rerun job after reporting final job status to the client > --- > > Key: MAPREDUCE-4819 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Assignee: Bikas Saha >Priority: Critical > Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, > MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt > > > If the AM reports final job status to the client but then crashes before > unregistering with the RM then the RM can run another AM attempt. Currently > AM re-attempts assume that the previous attempts did not reach a final job > state, and that causes the job to rerun (from scratch, if the output format > doesn't support recovery). > Re-running the job when we've already told the client the final status of the > job is bad for a number of reasons. If the job failed, it's confusing at > best since the client was already told the job failed but the subsequent > attempt could succeed. If the job succeeded there could be data loss, as a > subsequent job launched by the client tries to consume the job's output as > input just as the re-attempt starts removing output files in preparation for > the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-2217) The expire launching task should cover the UNASSIGNED task
[ https://issues.apache.org/jira/browse/MAPREDUCE-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated MAPREDUCE-2217: Attachment: expose-bug-mr-2217.patch Sorry for the delay, just got around to this. Uploading a patch that exposes the bug on clusters with some hosts with a 1 in their hostname. Running a sample pi job with 4 nodes with common prefix followed by 01-04, results in the job hanging at 75% map progress. > The expire launching task should cover the UNASSIGNED task > -- > > Key: MAPREDUCE-2217 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2217 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: jobtracker >Affects Versions: 0.23.0 >Reporter: Scott Chen >Assignee: Karthik Kambatla > Fix For: 0.24.0 > > Attachments: expose-bug-mr-2217.patch, MAPREDUCE-2217.1.txt, > MR-2217.patch > > > The ExpireLaunchingTask thread kills the task that are scheduled but not > responded. > Currently if a task is scheduled on tasktracker and for some reason > tasktracker cannot put it to RUNNING. > The task will just hang in the UNASSIGNED status and JobTracker will keep > waiting for it. > JobTracker.ExpireLaunchingTask should be able to kill this task. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4884) streaming tests fail to start MiniMRCluster due to "Queue configuration missing child queue names for root"
[ https://issues.apache.org/jira/browse/MAPREDUCE-4884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542345#comment-13542345 ] Hudson commented on MAPREDUCE-4884: --- Integrated in Hadoop-trunk-Commit #3162 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/3162/]) MAPREDUCE-4884. Streaming tests fail to start MiniMRCluster due to missing queue configuration. Contributed by Chris Nauroth. (Revision 1427945) Result = SUCCESS suresh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1427945 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-tools/hadoop-streaming/pom.xml > streaming tests fail to start MiniMRCluster due to "Queue configuration > missing child queue names for root" > --- > > Key: MAPREDUCE-4884 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4884 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: contrib/streaming, test >Affects Versions: 3.0.0, trunk-win >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Fix For: 3.0.0 > > Attachments: MAPREDUCE-4884.1.patch > > > Multiple tests in hadoop-streaming, such as {{TestFileArgs}}, fail to > initialize {{MiniMRCluster}} due to a {{YarnException}} with reason "Queue > configuration missing child queue names for root". -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4884) streaming tests fail to start MiniMRCluster due to "Queue configuration missing child queue names for root"
[ https://issues.apache.org/jira/browse/MAPREDUCE-4884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suresh Srinivas updated MAPREDUCE-4884: --- Resolution: Fixed Fix Version/s: 3.0.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I committed the patch to trunk. > streaming tests fail to start MiniMRCluster due to "Queue configuration > missing child queue names for root" > --- > > Key: MAPREDUCE-4884 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4884 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: contrib/streaming, test >Affects Versions: 3.0.0, trunk-win >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Fix For: 3.0.0 > > Attachments: MAPREDUCE-4884.1.patch > > > Multiple tests in hadoop-streaming, such as {{TestFileArgs}}, fail to > initialize {{MiniMRCluster}} due to a {{YarnException}} with reason "Queue > configuration missing child queue names for root". -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4884) streaming tests fail to start MiniMRCluster due to "Queue configuration missing child queue names for root"
[ https://issues.apache.org/jira/browse/MAPREDUCE-4884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542319#comment-13542319 ] Suresh Srinivas commented on MAPREDUCE-4884: +1 for the patch. > streaming tests fail to start MiniMRCluster due to "Queue configuration > missing child queue names for root" > --- > > Key: MAPREDUCE-4884 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4884 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: contrib/streaming, test >Affects Versions: 3.0.0, trunk-win >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Attachments: MAPREDUCE-4884.1.patch > > > Multiple tests in hadoop-streaming, such as {{TestFileArgs}}, fail to > initialize {{MiniMRCluster}} due to a {{YarnException}} with reason "Queue > configuration missing child queue names for root". -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542278#comment-13542278 ] Bikas Saha commented on MAPREDUCE-4819: --- It would really help if you could elaborate on the solution a bit more. I think I get the gist (ie try to lock the commit using atomic file operations) but I am not clear beyond that part. We can quickly discuss the utility of both approaches after that. Perhaps you have already done that in your mind :) The only thing I would like to guard against is linking of job commit operation with job completion where they can be independent. I agree that job commit is strictly needed before job completion. But making job commit the same as job completion may not be correct. eg. other operations post completion that are unsafe to repeat (maybe none exist now) or committing multiple outputs perhaps. The patch posted earlier, made sure that if a job has completed then it will be a no-op to run it again. Its a safe change. Also, it notifies the client about job success after making sure that the success state is persisted. I agree is does not handle errors in commit which is perhaps what your patch is addressing. So it could be that both changes are needed. > AM can rerun job after reporting final job status to the client > --- > > Key: MAPREDUCE-4819 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Assignee: Bikas Saha >Priority: Critical > Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, > MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt > > > If the AM reports final job status to the client but then crashes before > unregistering with the RM then the RM can run another AM attempt. Currently > AM re-attempts assume that the previous attempts did not reach a final job > state, and that causes the job to rerun (from scratch, if the output format > doesn't support recovery). > Re-running the job when we've already told the client the final status of the > job is bad for a number of reasons. If the job failed, it's confusing at > best since the client was already told the job failed but the subsequent > attempt could succeed. If the job succeeded there could be data loss, as a > subsequent job launched by the client tries to consume the job's output as > input just as the re-attempt starts removing output files in preparation for > the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4904) TestMultipleLevelCaching failed in barnch-1
[ https://issues.apache.org/jira/browse/MAPREDUCE-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542257#comment-13542257 ] Luke Lu commented on MAPREDUCE-4904: Please add a comment about the switch fall-through, as it's not obvious and would raise more questions in later maintenance. > TestMultipleLevelCaching failed in barnch-1 > --- > > Key: MAPREDUCE-4904 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4904 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: test >Affects Versions: 1.2.0 >Reporter: meng gong >Assignee: meng gong > Fix For: 1.2.0 > > Attachments: MAPREDUCE-4904.patch > > > TestMultipleLevelCaching will failed: > {noformat} > Testcase: testMultiLevelCaching took 30.406 sec > FAILED > Number of local maps expected:<0> but was:<1> > junit.framework.AssertionFailedError: Number of local maps expected:<0> but > was:<1> > at > org.apache.hadoop.mapred.TestRackAwareTaskPlacement.launchJobAndTestCounters(TestRackAwareTaskPlacement.java:78) > at > org.apache.hadoop.mapred.TestMultipleLevelCaching.testCachingAtLevel(TestMultipleLevelCaching.java:113) > at > org.apache.hadoop.mapred.TestMultipleLevelCaching.testMultiLevelCaching(TestMultipleLevelCaching.java:69) > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542179#comment-13542179 ] Robert Joseph Evans commented on MAPREDUCE-4819: The findbugs warning is because the code is not complete. The javac warning is because of a new EventHandler not having the generics on it. Both of these are currently expected. > AM can rerun job after reporting final job status to the client > --- > > Key: MAPREDUCE-4819 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Assignee: Bikas Saha >Priority: Critical > Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, > MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt > > > If the AM reports final job status to the client but then crashes before > unregistering with the RM then the RM can run another AM attempt. Currently > AM re-attempts assume that the previous attempts did not reach a final job > state, and that causes the job to rerun (from scratch, if the output format > doesn't support recovery). > Re-running the job when we've already told the client the final status of the > job is bad for a number of reasons. If the job failed, it's confusing at > best since the client was already told the job failed but the subsequent > attempt could succeed. If the job succeeded there could be data loss, as a > subsequent job launched by the client tries to consume the job's output as > input just as the re-attempt starts removing output files in preparation for > the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542174#comment-13542174 ] Hadoop QA commented on MAPREDUCE-4819: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562909/MR-4819-bobby-trunk.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 2015 javac compiler warnings (more than the trunk's current 2014 warnings). {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3184//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3184//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-app.html Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3184//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3184//console This message is automatically generated. > AM can rerun job after reporting final job status to the client > --- > > Key: MAPREDUCE-4819 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Assignee: Bikas Saha >Priority: Critical > Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, > MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt > > > If the AM reports final job status to the client but then crashes before > unregistering with the RM then the RM can run another AM attempt. Currently > AM re-attempts assume that the previous attempts did not reach a final job > state, and that causes the job to rerun (from scratch, if the output format > doesn't support recovery). > Re-running the job when we've already told the client the final status of the > job is bad for a number of reasons. If the job failed, it's confusing at > best since the client was already told the job failed but the subsequent > attempt could succeed. If the job succeeded there could be data loss, as a > subsequent job launched by the client tries to consume the job's output as > input just as the re-attempt starts removing output files in preparation for > the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAPREDUCE-4819) AM can rerun job after reporting final job status to the client
[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated MAPREDUCE-4819: --- Attachment: MR-4819-bobby-trunk.txt Bikas, I would actually like to propose an alternative fix. I am attaching a very preliminary patch. This will instead put a "lock" around the job commit by adding a few new files into the staging directory. Task commits would be required to handle the rare possibility of a double commit, just as it is possible in 1.0 now. We would make it just as likely to happen as it is in 1.0 by also putting in MAPREDUCE-4832 which would help to ensure that we don't have two AM telling tasks to do things at the same time. I would appreciate any feedback on this approach. I am going to be working to add in more tests and clean up the code. > AM can rerun job after reporting final job status to the client > --- > > Key: MAPREDUCE-4819 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am >Affects Versions: 0.23.3, 2.0.1-alpha >Reporter: Jason Lowe >Assignee: Bikas Saha >Priority: Critical > Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, > MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt > > > If the AM reports final job status to the client but then crashes before > unregistering with the RM then the RM can run another AM attempt. Currently > AM re-attempts assume that the previous attempts did not reach a final job > state, and that causes the job to rerun (from scratch, if the output format > doesn't support recovery). > Re-running the job when we've already told the client the final status of the > job is bad for a number of reasons. If the job failed, it's confusing at > best since the client was already told the job failed but the subsequent > attempt could succeed. If the job succeeded there could be data loss, as a > subsequent job launched by the client tries to consume the job's output as > input just as the re-attempt starts removing output files in preparation for > the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542115#comment-13542115 ] Avner BenHanoch commented on MAPREDUCE-4049: 1. so I'll open case for the trunk "send APPLICATION_INIT event to additional AuxiliaryServices instead of hard-coded send to 'mapreduce.shuffle'" (1.a Do you have an idea whether to send it to all of them, or to use two sets of AuxiliaryServices - 1 that get the event and 1 that doesn't get it?) 2. My branch-1 code only loads an optionally configured ShuffleProviderPlugin. I didn't touch the existing code that loads MapOutputServlet in the CTOR of TT. Hence, user will have 1 or 2 shuffle-providers. > plugin for generic shuffle service > -- > > Key: MAPREDUCE-4049 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: performance, task, tasktracker >Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 >Reporter: Avner BenHanoch >Assignee: Avner BenHanoch > Labels: merge, plugin, rdma, shuffle > Fix For: 3.0.0 > > Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, > mapreduce-4049.patch > > > Support generic shuffle service as set of two plugins: ShuffleProvider & > ShuffleConsumer. > This will satisfy the following needs: > # Better shuffle and merge performance. For example: we are working on > shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, > or Infiniband) instead of using the current HTTP shuffle. Based on the fast > RDMA shuffle, the plugin can also utilize a suitable merge approach during > the intermediate merges. Hence, getting much better performance. > # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden > dependency of NodeManager with a specific version of mapreduce shuffle > (currently targeted to 0.24.0). > References: > # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu > from Auburn University with others, > [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] > # I am attaching 2 documents with suggested Top Level Design for both plugins > (currently, based on 1.0 branch) > # I am providing link for downloading UDA - Mellanox's open source plugin > that implements generic shuffle service using RDMA and levitated merge. > Note: At this phase, the code is in C++ through JNI and you should consider > it as beta only. Still, it can serve anyone that wants to implement or > contribute to levitated merge. (Please be advised that levitated merge is > mostly suit in very fast networks) - > [http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=144&menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542111#comment-13542111 ] Alejandro Abdelnur commented on MAPREDUCE-4049: --- bq. my 2nd point above was for the trunk. If that is the case, I think we should do that in a follow up JIRA and have there the patch for trunk and branch-1. bq. because a 3rd party shuffle-provider always runs in addition to the default shuffle-provider. The shuffle-provider class is a TaskTracker config, so it is the same for ALL jobs; meaning the TaskTracker will use always the same shuffle-provider class. no? > plugin for generic shuffle service > -- > > Key: MAPREDUCE-4049 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: performance, task, tasktracker >Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 >Reporter: Avner BenHanoch >Assignee: Avner BenHanoch > Labels: merge, plugin, rdma, shuffle > Fix For: 3.0.0 > > Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, > mapreduce-4049.patch > > > Support generic shuffle service as set of two plugins: ShuffleProvider & > ShuffleConsumer. > This will satisfy the following needs: > # Better shuffle and merge performance. For example: we are working on > shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, > or Infiniband) instead of using the current HTTP shuffle. Based on the fast > RDMA shuffle, the plugin can also utilize a suitable merge approach during > the intermediate merges. Hence, getting much better performance. > # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden > dependency of NodeManager with a specific version of mapreduce shuffle > (currently targeted to 0.24.0). > References: > # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu > from Auburn University with others, > [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] > # I am attaching 2 documents with suggested Top Level Design for both plugins > (currently, based on 1.0 branch) > # I am providing link for downloading UDA - Mellanox's open source plugin > that implements generic shuffle service using RDMA and levitated merge. > Note: At this phase, the code is in C++ through JNI and you should consider > it as beta only. Still, it can serve anyone that wants to implement or > contribute to levitated merge. (Please be advised that levitated merge is > mostly suit in very fast networks) - > [http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=144&menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542107#comment-13542107 ] Avner BenHanoch commented on MAPREDUCE-4049: Hi Alejandro, thanks for your comments! *my 2nd point above was for the trunk.* True, in MRv2 AuxiliaryServices are plugable, still they don't get APPLICATION_INIT events; hence, they don't know to map jobId->userId. We need to send this event to all AuxiliaryServices, or to define 2 groups of AuxiliaryServices: 1 that get this event, and 1 that doesn't get this event. *In the current code, this event is only sent to "mapreduce.shuffle" using hard-coded string rather than relying on any conf settings*. For the branch-1 comments: Please notice that ShuffleProvider has different semantics than ShuffleConsumer, because a 3rd party shuffle-provider always runs in addition to the default shuffle-provider. multiple jobs can run in parallel, resulting in various shuffleConsumers in parallel (in different Jobs/ReduceTasks). Hence, all possible providers should exists in parallel. Saying that, the semantic of ShuffleProvider plugin is in addition to the default shuffle-provider. Hence TT should not fail for that. (for the rest of your branch-1 comments: yes, you are right on all!) > plugin for generic shuffle service > -- > > Key: MAPREDUCE-4049 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: performance, task, tasktracker >Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 >Reporter: Avner BenHanoch >Assignee: Avner BenHanoch > Labels: merge, plugin, rdma, shuffle > Fix For: 3.0.0 > > Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, > mapreduce-4049.patch > > > Support generic shuffle service as set of two plugins: ShuffleProvider & > ShuffleConsumer. > This will satisfy the following needs: > # Better shuffle and merge performance. For example: we are working on > shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, > or Infiniband) instead of using the current HTTP shuffle. Based on the fast > RDMA shuffle, the plugin can also utilize a suitable merge approach during > the intermediate merges. Hence, getting much better performance. > # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden > dependency of NodeManager with a specific version of mapreduce shuffle > (currently targeted to 0.24.0). > References: > # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu > from Auburn University with others, > [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] > # I am attaching 2 documents with suggested Top Level Design for both plugins > (currently, based on 1.0 branch) > # I am providing link for downloading UDA - Mellanox's open source plugin > that implements generic shuffle service using RDMA and levitated merge. > Note: At this phase, the code is in C++ through JNI and you should consider > it as beta only. Still, it can serve anyone that wants to implement or > contribute to levitated merge. (Please be advised that levitated merge is > mostly suit in very fast networks) - > [http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=144&menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542093#comment-13542093 ] Alejandro Abdelnur commented on MAPREDUCE-4049: --- Avner, thanks for the clarification, I've got confused as JIRA emailed an 'updated patch' message. On #1, the ShuffleConsumerPlugin should be in this JIRA. On #2, I assume the trunk version does not (will not) have a Map side because the ShuffleHandler is already pluggable. Given that, the Map side (ShuffleProvider) seems an artifact of the backport of this JIRA. Because of that, I think is OK to have it here. I assume you are working on updating the attached Hadoop-1 patch, following some comments on the current Hadoop-1 patch: * Not having a ShuffleProviderPlugin in the TaskTracker should be reason to fail the TaskTracker at startup, no? * We should follow the same pattern as in trunk: ** Define an interface instead of an abstract class for ShuffleConsumerPlugin, with init(), fetchOutput(), createKVIterator(), getMergeThrowable() methods. ** Define a Context for ShuffleConsumerPlugin initialization ** Use ReflectionUtil.newInstance() in ReducerTask to instantiate the ShuffleConsumerPlugin ** visibility/stability Annotations are missing > plugin for generic shuffle service > -- > > Key: MAPREDUCE-4049 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: performance, task, tasktracker >Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 >Reporter: Avner BenHanoch >Assignee: Avner BenHanoch > Labels: merge, plugin, rdma, shuffle > Fix For: 3.0.0 > > Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, > mapreduce-4049.patch > > > Support generic shuffle service as set of two plugins: ShuffleProvider & > ShuffleConsumer. > This will satisfy the following needs: > # Better shuffle and merge performance. For example: we are working on > shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, > or Infiniband) instead of using the current HTTP shuffle. Based on the fast > RDMA shuffle, the plugin can also utilize a suitable merge approach during > the intermediate merges. Hence, getting much better performance. > # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden > dependency of NodeManager with a specific version of mapreduce shuffle > (currently targeted to 0.24.0). > References: > # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu > from Auburn University with others, > [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] > # I am attaching 2 documents with suggested Top Level Design for both plugins > (currently, based on 1.0 branch) > # I am providing link for downloading UDA - Mellanox's open source plugin > that implements generic shuffle service using RDMA and levitated merge. > Note: At this phase, the code is in C++ through JNI and you should consider > it as beta only. Still, it can serve anyone that wants to implement or > contribute to levitated merge. (Please be advised that levitated merge is > mostly suit in very fast networks) - > [http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=144&menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542088#comment-13542088 ] Avner BenHanoch commented on MAPREDUCE-4049: Hi Alejandro, On Monday I only removed obsolete attachments for the trunk and kept just the last one we submitted. Speaking about that, please let me know: 1. Do you prefer patch for branch-1 in this issue or in a separated issue. 2. There is still what to do in the trunk for ShuffleProvider - see [this comment|https://issues.apache.org/jira/browse/MAPREDUCE-4049?focusedCommentId=13444026&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13444026]. Do you want me to address it here or in a separated issue. Kindly thank you, Avner > plugin for generic shuffle service > -- > > Key: MAPREDUCE-4049 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: performance, task, tasktracker >Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 >Reporter: Avner BenHanoch >Assignee: Avner BenHanoch > Labels: merge, plugin, rdma, shuffle > Fix For: 3.0.0 > > Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, > mapreduce-4049.patch > > > Support generic shuffle service as set of two plugins: ShuffleProvider & > ShuffleConsumer. > This will satisfy the following needs: > # Better shuffle and merge performance. For example: we are working on > shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, > or Infiniband) instead of using the current HTTP shuffle. Based on the fast > RDMA shuffle, the plugin can also utilize a suitable merge approach during > the intermediate merges. Hence, getting much better performance. > # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden > dependency of NodeManager with a specific version of mapreduce shuffle > (currently targeted to 0.24.0). > References: > # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu > from Auburn University with others, > [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] > # I am attaching 2 documents with suggested Top Level Design for both plugins > (currently, based on 1.0 branch) > # I am providing link for downloading UDA - Mellanox's open source plugin > that implements generic shuffle service using RDMA and levitated merge. > Note: At this phase, the code is in C++ through JNI and you should consider > it as beta only. Still, it can serve anyone that wants to implement or > contribute to levitated merge. (Please be advised that levitated merge is > mostly suit in very fast networks) - > [http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=144&menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4049) plugin for generic shuffle service
[ https://issues.apache.org/jira/browse/MAPREDUCE-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542077#comment-13542077 ] Alejandro Abdelnur commented on MAPREDUCE-4049: --- Avner, it seems the attachment you posted on Monday for branch-1 is MIA, would you please post it again? thx. > plugin for generic shuffle service > -- > > Key: MAPREDUCE-4049 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4049 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: performance, task, tasktracker >Affects Versions: 1.0.3, 1.1.0, 2.0.0-alpha, 3.0.0 >Reporter: Avner BenHanoch >Assignee: Avner BenHanoch > Labels: merge, plugin, rdma, shuffle > Fix For: 3.0.0 > > Attachments: HADOOP-1.x.y.patch, Hadoop Shuffle Plugin Design.rtf, > mapreduce-4049.patch > > > Support generic shuffle service as set of two plugins: ShuffleProvider & > ShuffleConsumer. > This will satisfy the following needs: > # Better shuffle and merge performance. For example: we are working on > shuffle plugin that performs shuffle over RDMA in fast networks (10gE, 40gE, > or Infiniband) instead of using the current HTTP shuffle. Based on the fast > RDMA shuffle, the plugin can also utilize a suitable merge approach during > the intermediate merges. Hence, getting much better performance. > # Satisfy MAPREDUCE-3060 - generic shuffle service for avoiding hidden > dependency of NodeManager with a specific version of mapreduce shuffle > (currently targeted to 0.24.0). > References: > # Hadoop Acceleration through Network Levitated Merging, by Prof. Weikuan Yu > from Auburn University with others, > [http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf] > # I am attaching 2 documents with suggested Top Level Design for both plugins > (currently, based on 1.0 branch) > # I am providing link for downloading UDA - Mellanox's open source plugin > that implements generic shuffle service using RDMA and levitated merge. > Note: At this phase, the code is in C++ through JNI and you should consider > it as beta only. Still, it can serve anyone that wants to implement or > contribute to levitated merge. (Please be advised that levitated merge is > mostly suit in very fast networks) - > [http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=144&menu_section=69] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1688) A failing retry'able notification in JobEndNotifier can affect notifications of other jobs.
[ https://issues.apache.org/jira/browse/MAPREDUCE-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542053#comment-13542053 ] Olga Shen commented on MAPREDUCE-1688: -- MAPREDUCE-3028 only added timeout setting in org.apache.hadoop.mapreduce.v2.app.JobEndNotifier. Would you apply timeout setting to org.apache.hadoop.mapred.JobEndNotifier for MRv1 users? > A failing retry'able notification in JobEndNotifier can affect notifications > of other jobs. > --- > > Key: MAPREDUCE-1688 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1688 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobtracker >Affects Versions: 0.20.1, 1.0.0, 1.0.2, 1.0.3 >Reporter: Vinod Kumar Vavilapalli >Assignee: Ravi Prakash > > The JobTracker puts all the notification commands into a delay-queue. It has > a single thread that loops through this queue and sends out the > notifications. When it hits failures with any notification which is > configured to be retired via {{job.end.retry.attempts}} and > {{job.end.retry.interval}}, the notification is queued back again. A single > notification with sufficiently large number of configured retries and which > consistently fails will affect other notifications in the queue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira