[jira] [Commented] (MAPREDUCE-3315) Master-Worker Application on YARN
[ https://issues.apache.org/jira/browse/MAPREDUCE-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246166#comment-13246166 ] Sharad Agarwal commented on MAPREDUCE-3315: --- Thanks Nikhil for the patch. Will have a look at it. Master-Worker Application on YARN - Key: MAPREDUCE-3315 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3315 Project: Hadoop Map/Reduce Issue Type: New Feature Reporter: Sharad Agarwal Assignee: Sharad Agarwal Fix For: 0.24.0 Attachments: MAPREDUCE-3315.patch Currently master worker scenarios are forced fit into Map-Reduce. Now with YARN, these can be first class and would benefit real/near realtime workloads and be more effective in using the cluster resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3315) Master-Worker Application on YARN
[ https://issues.apache.org/jira/browse/MAPREDUCE-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245108#comment-13245108 ] Sharad Agarwal commented on MAPREDUCE-3315: --- can have it in hadoop-yarn-applications module. Example app could be sub-module of the master-worker app. Master-Worker Application on YARN - Key: MAPREDUCE-3315 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3315 Project: Hadoop Map/Reduce Issue Type: New Feature Reporter: Sharad Agarwal Assignee: Sharad Agarwal Fix For: 0.24.0 Currently master worker scenarios are forced fit into Map-Reduce. Now with YARN, these can be first class and would benefit real/near realtime workloads and be more effective in using the cluster resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3315) Master-Worker Application on YARN
[ https://issues.apache.org/jira/browse/MAPREDUCE-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235534#comment-13235534 ] Sharad Agarwal commented on MAPREDUCE-3315: --- bq. Should I use Hadoop IPC or RMI Hadoop IPC bq. Should the Master be in the ApplicationManager or be run as a Container? For Master, you need to write a yarn ApplicationMaster (AM) see - http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html AM runs in one of the container on the cluster. Also it would be good to completely separate the user API (writing master-worker apps) and the runtime implementation clearly. Master-Worker Application on YARN - Key: MAPREDUCE-3315 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3315 Project: Hadoop Map/Reduce Issue Type: New Feature Reporter: Sharad Agarwal Assignee: Sharad Agarwal Fix For: 0.24.0 Currently master worker scenarios are forced fit into Map-Reduce. Now with YARN, these can be first class and would benefit real/near realtime workloads and be more effective in using the cluster resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3846) Restarted+Recovered AM hangs in some corner cases
[ https://issues.apache.org/jira/browse/MAPREDUCE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206860#comment-13206860 ] Sharad Agarwal commented on MAPREDUCE-3846: --- looks good. we should add a testcase to recover in third generation. Restarted+Recovered AM hangs in some corner cases - Key: MAPREDUCE-3846 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3846 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: mrv2 Affects Versions: 0.23.0 Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Priority: Critical Attachments: MAPREDUCE-3846-20120210.txt [~karams] found this while testing AM restart/recovery feature. After the first generation AM crashes (manually killed by kill -9), the second generation AM starts, but hangs after a while. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3858) Task attempt failure during commit results in task never completing
[ https://issues.apache.org/jira/browse/MAPREDUCE-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207527#comment-13207527 ] Sharad Agarwal commented on MAPREDUCE-3858: --- +1 looks good. Thanks Tom for the patch. Task attempt failure during commit results in task never completing --- Key: MAPREDUCE-3858 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3858 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Reporter: Tom White Assignee: Tom White Priority: Critical Attachments: MAPREDUCE-3858.patch On a terasort job a task attempt failed during the commit phase. Another attempt was rescheduled, but when it tried to commit it failed. {noformat} attempt_1329019187148_0083_r_000586_0 already given a go for committing the task output, so killing attempt_1329019187148_0083_r_000586_1 {noformat} The job hung as new attempts kept getting scheduled only to fail during commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3846) Restarted+Recovered AM hangs in some corner cases
[ https://issues.apache.org/jira/browse/MAPREDUCE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205325#comment-13205325 ] Sharad Agarwal commented on MAPREDUCE-3846: --- should this be marked as duplicate of MAPREDUCE-3802 ? It is exactly the same behaviour for the AM hanging/failing in third generation. Restarted+Recovered AM hangs in some corner cases - Key: MAPREDUCE-3846 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3846 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: mrv2 Affects Versions: 0.23.0 Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli [~karams] found this while testing AM restart/recovery feature. After the first generation AM crashes (manually killed by kill -9), the second generation AM starts, but hangs after a while. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3802) If an MR AM dies twice it looks like the process freezes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203325#comment-13203325 ] Sharad Agarwal commented on MAPREDUCE-3802: --- bq. need to understand a little bit better how these names are determined The task attemptIds are unique across all the generations of AM. This is to avoid any remote task attempt from previous generation of AM joining the current AM. The assumption is there won't be more than 1000 attempts of a task in AM run. The suffix part of task attemptId is determined as follows: _(AMGeneration-1)*1000. For first AM it will start from 0. For second it will start from 1000, for third from 2000 .. If an MR AM dies twice it looks like the process freezes - Key: MAPREDUCE-3802 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3802 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: applicationmaster, mrv2 Affects Versions: 0.23.1, 0.24.0 Reporter: Robert Joseph Evans Assignee: Robert Joseph Evans Priority: Critical Attachments: syslog It looks like recovering from an RM AM dieing works very well on a single failure. But if it fails multiple times we appear to get into a live lock situation. {noformat} yarn jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*-SNAPSHOT.jar wordcount -Dyarn.app.mapreduce.am.log.level=DEBUG -Dmapreduce.job.reduces=30 input output 12/02/03 21:06:57 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS 12/02/03 21:06:57 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used 12/02/03 21:06:57 INFO input.FileInputFormat: Total input paths to process : 17 12/02/03 21:06:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/02/03 21:06:57 WARN snappy.LoadSnappy: Snappy native library not loaded 12/02/03 21:06:57 INFO mapreduce.JobSubmitter: number of splits:17 12/02/03 21:06:57 INFO mapred.ResourceMgrDelegate: Submitted application application_1328302034486_0003 to ResourceManager at HOST/IP:8040 12/02/03 21:06:57 INFO mapreduce.Job: The url to track the job: http://HOST:8088/proxy/application_1328302034486_0003/ 12/02/03 21:06:57 INFO mapreduce.Job: Running job: job_1328302034486_0003 12/02/03 21:07:03 INFO mapreduce.Job: Job job_1328302034486_0003 running in uber mode : false 12/02/03 21:07:03 INFO mapreduce.Job: map 0% reduce 0% 12/02/03 21:07:09 INFO mapreduce.Job: map 5% reduce 0% 12/02/03 21:07:10 INFO mapreduce.Job: map 17% reduce 0% #KILLED AM with kill -9 here 12/02/03 21:07:16 INFO mapreduce.Job: map 29% reduce 0% 12/02/03 21:07:17 INFO mapreduce.Job: map 35% reduce 0% 12/02/03 21:07:30 INFO mapreduce.Job: map 52% reduce 0% 12/02/03 21:07:35 INFO mapreduce.Job: map 58% reduce 0% 12/02/03 21:07:37 INFO mapreduce.Job: map 70% reduce 0% 12/02/03 21:07:41 INFO mapreduce.Job: map 76% reduce 0% 12/02/03 21:07:43 INFO mapreduce.Job: map 82% reduce 0% 12/02/03 21:07:44 INFO mapreduce.Job: map 88% reduce 0% 12/02/03 21:07:47 INFO mapreduce.Job: map 94% reduce 0% 12/02/03 21:07:49 INFO mapreduce.Job: map 100% reduce 0% 12/02/03 21:07:53 INFO mapreduce.Job: map 100% reduce 3% 12/02/03 21:08:00 INFO mapreduce.Job: map 100% reduce 6% 12/02/03 21:08:06 INFO mapreduce.Job: map 100% reduce 10% 12/02/03 21:08:12 INFO mapreduce.Job: map 100% reduce 13% 12/02/03 21:08:18 INFO mapreduce.Job: map 100% reduce 16% #killed AM with kill -9 here 12/02/03 21:08:20 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 0 time(s). 12/02/03 21:08:21 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 1 time(s). 12/02/03 21:08:22 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 2 time(s). 12/02/03 21:08:26 INFO mapreduce.Job: map 64% reduce 16% #It never makes any more progress... {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3802) If an MR AM dies twice it looks like the process freezes
[ https://issues.apache.org/jira/browse/MAPREDUCE-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203332#comment-13203332 ] Sharad Agarwal commented on MAPREDUCE-3802: --- The bug is in TaskImpl over here: {code} //attempt ids are generated based on MR app startCount so that attempts //from previous lives don't overstep the current one. //this assumes that a task won't have more than 1000 attempts in its single //life nextAttemptNumber = (startCount - 1) * 1000; {code} The completed task could be from any earlier AM generation not just from the previous one. I am looking into the way to fix this. If an MR AM dies twice it looks like the process freezes - Key: MAPREDUCE-3802 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3802 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: applicationmaster, mrv2 Affects Versions: 0.23.1, 0.24.0 Reporter: Robert Joseph Evans Assignee: Robert Joseph Evans Priority: Critical Attachments: syslog It looks like recovering from an RM AM dieing works very well on a single failure. But if it fails multiple times we appear to get into a live lock situation. {noformat} yarn jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*-SNAPSHOT.jar wordcount -Dyarn.app.mapreduce.am.log.level=DEBUG -Dmapreduce.job.reduces=30 input output 12/02/03 21:06:57 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS 12/02/03 21:06:57 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used 12/02/03 21:06:57 INFO input.FileInputFormat: Total input paths to process : 17 12/02/03 21:06:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/02/03 21:06:57 WARN snappy.LoadSnappy: Snappy native library not loaded 12/02/03 21:06:57 INFO mapreduce.JobSubmitter: number of splits:17 12/02/03 21:06:57 INFO mapred.ResourceMgrDelegate: Submitted application application_1328302034486_0003 to ResourceManager at HOST/IP:8040 12/02/03 21:06:57 INFO mapreduce.Job: The url to track the job: http://HOST:8088/proxy/application_1328302034486_0003/ 12/02/03 21:06:57 INFO mapreduce.Job: Running job: job_1328302034486_0003 12/02/03 21:07:03 INFO mapreduce.Job: Job job_1328302034486_0003 running in uber mode : false 12/02/03 21:07:03 INFO mapreduce.Job: map 0% reduce 0% 12/02/03 21:07:09 INFO mapreduce.Job: map 5% reduce 0% 12/02/03 21:07:10 INFO mapreduce.Job: map 17% reduce 0% #KILLED AM with kill -9 here 12/02/03 21:07:16 INFO mapreduce.Job: map 29% reduce 0% 12/02/03 21:07:17 INFO mapreduce.Job: map 35% reduce 0% 12/02/03 21:07:30 INFO mapreduce.Job: map 52% reduce 0% 12/02/03 21:07:35 INFO mapreduce.Job: map 58% reduce 0% 12/02/03 21:07:37 INFO mapreduce.Job: map 70% reduce 0% 12/02/03 21:07:41 INFO mapreduce.Job: map 76% reduce 0% 12/02/03 21:07:43 INFO mapreduce.Job: map 82% reduce 0% 12/02/03 21:07:44 INFO mapreduce.Job: map 88% reduce 0% 12/02/03 21:07:47 INFO mapreduce.Job: map 94% reduce 0% 12/02/03 21:07:49 INFO mapreduce.Job: map 100% reduce 0% 12/02/03 21:07:53 INFO mapreduce.Job: map 100% reduce 3% 12/02/03 21:08:00 INFO mapreduce.Job: map 100% reduce 6% 12/02/03 21:08:06 INFO mapreduce.Job: map 100% reduce 10% 12/02/03 21:08:12 INFO mapreduce.Job: map 100% reduce 13% 12/02/03 21:08:18 INFO mapreduce.Job: map 100% reduce 16% #killed AM with kill -9 here 12/02/03 21:08:20 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 0 time(s). 12/02/03 21:08:21 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 1 time(s). 12/02/03 21:08:22 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already tried 2 time(s). 12/02/03 21:08:26 INFO mapreduce.Job: map 64% reduce 16% #It never makes any more progress... {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3711) AppMaster recovery for Medium to large jobs take long time
[ https://issues.apache.org/jira/browse/MAPREDUCE-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196747#comment-13196747 ] Sharad Agarwal commented on MAPREDUCE-3711: --- bq. That is the real bug. We should instead be moving the committed outputs of a single task from $(JobAttemptBaseDir - 1) dir to $(JobAttemptBaseDir) dir. true. thats *the* bug. recoverJob is not required. for saving hdfs trips, a simple check of non-zero reduces to skip maps output recovery in RecoveryService should be sufficient. AppMaster recovery for Medium to large jobs take long time -- Key: MAPREDUCE-3711 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3711 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.0 Reporter: Siddharth Seth Assignee: Robert Joseph Evans Priority: Blocker Reported by [~karams] yarn.resourcemanager.am.max-retries=2 Ran test cases with sort job on 350 scale having 16800 maps and 680 reduces -: 1. After 70 secs of Job Sumbission Am is killed using kill -9, around 3900 maps were completed and 680 reduces were scheduled, Second AM got restart. Job got completed in 980 secs. AM took very less time to recover. 2. After 150 secs of Job Sumbission AM is killed using kill -9, around 90% maps were completed and 680 reduces were scheduled , Second AM got restart Job got completed in 1000 secs. AM got revocer. 3. After 150 secs of Job Sumbission AM as killed using kill -9, almost all maps were completed and only 680 reduces were running, Recovery was too slow, AM was still revocering after 1hr :40 mis when I killed the run. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3634) All daemons should crash instead of hanging around when their EventHandlers get exceptions
[ https://issues.apache.org/jira/browse/MAPREDUCE-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13195964#comment-13195964 ] Sharad Agarwal commented on MAPREDUCE-3634: --- Can we set Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY default to false instead ? No of daemons are much less and would almost remain constant as opposed to tests ? Anybody adding a test or using a API should not get surprised by not setting this property. All daemons should crash instead of hanging around when their EventHandlers get exceptions -- Key: MAPREDUCE-3634 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3634 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.0 Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Fix For: 0.23.1 Attachments: MAPREDUCE-3634-20120118.1.txt, MAPREDUCE-3634-20120119.txt We should make sure that the daemons crash in case the dispatchers get exceptions and stop processing. That way we will be debugging RM/NM/AM crashes instead of hard-to-track hanging jobs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3489) EventDispatcher should have a call-back on errors for aiding tests
[ https://issues.apache.org/jira/browse/MAPREDUCE-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13194633#comment-13194633 ] Sharad Agarwal commented on MAPREDUCE-3489: --- Currently in the AsyncDispatcher, exitOnDispatchException defaults to false. The daemon don't set it to true either. Is this the intended bahaviour ? I think daemons should exit on dispatcher error while testcases can handle it differently. right ? EventDispatcher should have a call-back on errors for aiding tests -- Key: MAPREDUCE-3489 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3489 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.0 Reporter: Siddharth Seth Assignee: Sharad Agarwal If one of the dispatched events generates an exception - the dispatcher kills the JVM via a System.exit. Unit tests end up not running - but they don't fail either. TestTaskAttempt is currently running like this. Previously - have seen TestRecovery and TestJobHistoryParsing do the same. Most of the tests would need to be looked at. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3711) AppMaster recovery for Medium to large jobs take long time
[ https://issues.apache.org/jira/browse/MAPREDUCE-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13194486#comment-13194486 ] Sharad Agarwal commented on MAPREDUCE-3711: --- Karam, can you upload the AM logs for the case when recovery was taking too long ? AppMaster recovery for Medium to large jobs take long time -- Key: MAPREDUCE-3711 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3711 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.0 Reporter: Siddharth Seth Assignee: Robert Joseph Evans Priority: Blocker Reported by [~karams] yarn.resourcemanager.am.max-retries=2 Ran test cases with sort job on 350 scale having 16800 maps and 680 reduces -: 1. After 70 secs of Job Sumbission Am is killed using kill -9, around 3900 maps were completed and 680 reduces were scheduled, Second AM got restart. Job got completed in 980 secs. AM took very less time to recover. 2. After 150 secs of Job Sumbission AM is killed using kill -9, around 90% maps were completed and 680 reduces were scheduled , Second AM got restart Job got completed in 1000 secs. AM got revocer. 3. After 150 secs of Job Sumbission AM as killed using kill -9, almost all maps were completed and only 680 reduces were running, Recovery was too slow, AM was still revocering after 1hr :40 mis when I killed the run. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3490) RMContainerAllocator counts failed maps towards Reduce ramp up
[ https://issues.apache.org/jira/browse/MAPREDUCE-3490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176621#comment-13176621 ] Sharad Agarwal commented on MAPREDUCE-3490: --- currently all bookkeeping and calculations in RMContainerAllocator are based on attempts not on tasks. there is one to one between attempt and container. completedMapPercent is currently completed map *attempts* percentage. I looked in more detail. we can simply change this to reflect completed map tasks percentage. All information is readily available there in Job. So we need not maintain these counts in RMContainerAllocator. Arun also mentioned this. I am attaching a patch which drastically simplify this, without the need to add new events. Also I have removed the completedMaps and completedReduces counts in RMContainerAllocator. Arun/Vinod - see if this make sense ? RMContainerAllocator counts failed maps towards Reduce ramp up -- Key: MAPREDUCE-3490 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3490 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am, mrv2 Affects Versions: 0.23.0 Reporter: Siddharth Seth Assignee: Arun C Murthy Priority: Blocker Attachments: MAPREDUCE-3490.patch, MAPREDUCE-3490.patch, MAPREDUCE-3490.patch, MAPREDUCE-3490.patch, MR-3490-alternate.patch The RMContainerAllocator does not differentiate between failed and successful maps while calculating whether reduce tasks are ready to launch. Failed tasks are also counted towards total completed tasks. Example. 4 failed maps, 10 total maps. Map%complete = 4/14 * 100 instead of being 0. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3490) RMContainerAllocator counts failed maps towards Reduce ramp up
[ https://issues.apache.org/jira/browse/MAPREDUCE-3490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176623#comment-13176623 ] Sharad Agarwal commented on MAPREDUCE-3490: --- Note: I am on vacation rest of the week. please expect slow or no response till then. RMContainerAllocator counts failed maps towards Reduce ramp up -- Key: MAPREDUCE-3490 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3490 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am, mrv2 Affects Versions: 0.23.0 Reporter: Siddharth Seth Assignee: Arun C Murthy Priority: Blocker Attachments: MAPREDUCE-3490.patch, MAPREDUCE-3490.patch, MAPREDUCE-3490.patch, MAPREDUCE-3490.patch, MR-3490-alternate.patch, MR-3490-alternate1.patch The RMContainerAllocator does not differentiate between failed and successful maps while calculating whether reduce tasks are ready to launch. Failed tasks are also counted towards total completed tasks. Example. 4 failed maps, 10 total maps. Map%complete = 4/14 * 100 instead of being 0. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3490) RMContainerAllocator counts failed maps towards Reduce ramp up
[ https://issues.apache.org/jira/browse/MAPREDUCE-3490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174797#comment-13174797 ] Sharad Agarwal commented on MAPREDUCE-3490: --- Hi Arun - just had a brief look at the patch. seems like you don't need new Container Allocator types. RMContainerAllocator is already getting CONTAINER_FAILED event. completedMaps includes succeeded and failed. Instead of incorporating completedMaps in the calculation, it can take succeededMaps. succeededMaps = completedMaps - failedMaps RMContainerAllocator counts failed maps towards Reduce ramp up -- Key: MAPREDUCE-3490 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3490 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am, mrv2 Affects Versions: 0.23.0 Reporter: Siddharth Seth Assignee: Arun C Murthy Priority: Blocker Attachments: MAPREDUCE-3490.patch, MAPREDUCE-3490.patch, MAPREDUCE-3490.patch, MAPREDUCE-3490.patch The RMContainerAllocator does not differentiate between failed and successful maps while calculating whether reduce tasks are ready to launch. Failed tasks are also counted towards total completed tasks. Example. 4 failed maps, 10 total maps. Map%complete = 4/14 * 100 instead of being 0. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3490) RMContainerAllocator counts failed maps towards Reduce ramp up
[ https://issues.apache.org/jira/browse/MAPREDUCE-3490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175319#comment-13175319 ] Sharad Agarwal commented on MAPREDUCE-3490: --- bq. I think we need to stop tracking this in RMContainerAllocator and rather rely on Job. For now, my patch seems the closest approximation to that (being conservative). Doing it in Job or in RMContainerAllocator is a separate discussion. I don't think this patch deal with anything like that. It adds two new events for RMContainerAllocator itself. I am proposing that we don't need these extra events because this information (failed attempts info) is already available in RMContainerAllocator. RMContainerAllocator counts failed maps towards Reduce ramp up -- Key: MAPREDUCE-3490 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3490 Project: Hadoop Map/Reduce Issue Type: Bug Components: mr-am, mrv2 Affects Versions: 0.23.0 Reporter: Siddharth Seth Assignee: Arun C Murthy Priority: Blocker Attachments: MAPREDUCE-3490.patch, MAPREDUCE-3490.patch, MAPREDUCE-3490.patch, MAPREDUCE-3490.patch, MR-3490-alternate.patch The RMContainerAllocator does not differentiate between failed and successful maps while calculating whether reduce tasks are ready to launch. Failed tasks are also counted towards total completed tasks. Example. 4 failed maps, 10 total maps. Map%complete = 4/14 * 100 instead of being 0. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3473) A single task tracker failure shouldn't result in Job failure
[ https://issues.apache.org/jira/browse/MAPREDUCE-3473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174636#comment-13174636 ] Sharad Agarwal commented on MAPREDUCE-3473: --- a single machine failure doesn't result in job to fail; thats the whole point of hadoop. smile. we are missing the difference with *Task* and *TaskAttempt*. A task gets 4 chances (task attempts) by default to run before the job is declared failed. I think this issue can be resolved as Invalid. A single task tracker failure shouldn't result in Job failure -- Key: MAPREDUCE-3473 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3473 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tasktracker Affects Versions: 0.20.205.0, 0.23.0 Reporter: Eli Collins Currently some task failures may result in job failures. Eg a local TT disk failure seen in TaskLauncher#run, TaskRunner#run, MapTask#run is visible to and can hang the JobClient, causing the job to fail. Job execution should always be able to survive a task failure if there are sufficient resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3489) Unit tests failing silently
[ https://issues.apache.org/jira/browse/MAPREDUCE-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13160716#comment-13160716 ] Sharad Agarwal commented on MAPREDUCE-3489: --- Definately System.exit is not the right thing in a library. Sigh! I knew that when I wrote that and intended to remove it. Couldn't get chance to get to it. I think we should get rid of it and instead have a error handling callback registered to Dispatcher. Unit tests failing silently --- Key: MAPREDUCE-3489 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3489 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 0.23.0 Reporter: Siddharth Seth Priority: Critical If one of the dispatched events generates an exception - the dispatcher kills the JVM via a System.exit. Unit tests end up not running - but they don't fail either. TestTaskAttempt is currently running like this. Previously - have seen TestRecovery and TestJobHistoryParsing do the same. Most of the tests would need to be looked at. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3473) Task failures shouldn't result in Job failures
[ https://issues.apache.org/jira/browse/MAPREDUCE-3473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13159158#comment-13159158 ] Sharad Agarwal commented on MAPREDUCE-3473: --- note this is *task* failure NOT taskattempt failures. Task failure would mean losing processing on corresponding inputsplit. Not all applications would be ok with it. Explicitly setting to non-zero value makes sense so losing data doesn't come as surprise for applications. Task failures shouldn't result in Job failures --- Key: MAPREDUCE-3473 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3473 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tasktracker Affects Versions: 0.20.205.0, 0.23.0 Reporter: Eli Collins Currently some task failures may result in job failures. Eg a local TT disk failure seen in TaskLauncher#run, TaskRunner#run, MapTask#run is visible to and can hang the JobClient, causing the job to fail. Job execution should always be able to survive a task failure if there are sufficient resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3402) AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly
[ https://issues.apache.org/jira/browse/MAPREDUCE-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151029#comment-13151029 ] Sharad Agarwal commented on MAPREDUCE-3402: --- just fyi org.apache.hadoop.mapreduce.v2.app.MRAppBenchmark can be used to benchmark the AM mainly for memory usage, job latencies and state machine transitions. It however doesn't capture the remoting/rpc issues as it doesn't run on real cluster. AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly --- Key: MAPREDUCE-3402 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3402 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.0 Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Fix For: 0.23.1 The world was rosier before October 19-25, [~karams] says. The 100K 1 second sleep job used to take around 800mins or 13-14 mins. It now runs till 45 mins and still manages to complete only about 45K tasks. One/more of the flurry of commits for 0.23.0 deserve(s) the blame. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3315) Master-Worker Application on YARN
[ https://issues.apache.org/jira/browse/MAPREDUCE-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13140059#comment-13140059 ] Sharad Agarwal commented on MAPREDUCE-3315: --- some thoughts: - AM decides the Tasks. - worker keep polling AM for getting the next Task instance. - should be able to dynamically increase or shutdown workers - have minWorkers, maxWorkers, keepAlive Master-Worker Application on YARN - Key: MAPREDUCE-3315 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3315 Project: Hadoop Map/Reduce Issue Type: New Feature Reporter: Sharad Agarwal Assignee: Sharad Agarwal Fix For: 0.24.0 Currently master worker scenarios are forced fit into Map-Reduce. Now with YARN, these can be first class and would benefit real/near realtime workloads and be more effective in using the cluster resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-3274) Race condition in MR App Master Preemtion can cause a dead lock
[ https://issues.apache.org/jira/browse/MAPREDUCE-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13136881#comment-13136881 ] Sharad Agarwal commented on MAPREDUCE-3274: --- JVM with ID: jvm_1319242394842_0065_r_08 given task: attempt_1319242394842_0065_r_04_0 there seems something wrong here. jvm with particular id should always be given the corresponding task. Race condition in MR App Master Preemtion can cause a dead lock --- Key: MAPREDUCE-3274 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3274 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2, scheduler Affects Versions: 0.23.0, 0.24.0 Reporter: Robert Joseph Evans Assignee: Robert Joseph Evans Priority: Critical Fix For: 0.23.0, 0.24.0 There appears to be a race condition in the MR App Master in relation to preempting reducers to let a mapper run. In the particular case that I have been debugging a reducer was selected for preemption that did not have a container assigned to it yet. When the container became available that reduce started running and the previous TA_KILL event appears to have been ignored. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2708) [MR-279] Design and implement MR Application Master recovery
[ https://issues.apache.org/jira/browse/MAPREDUCE-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13133597#comment-13133597 ] Sharad Agarwal commented on MAPREDUCE-2708: --- bq. and the AM restarted with 83% map and 0% reduce great! it worked. btw Vinod, with recovery module unable to parse the history file (due to hdfs bug), it should fall back to restarting the job. just curious, did you notice that? [MR-279] Design and implement MR Application Master recovery Key: MAPREDUCE-2708 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2708 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: applicationmaster, mrv2 Affects Versions: 0.23.0 Reporter: Sharad Agarwal Assignee: Sharad Agarwal Priority: Blocker Fix For: 0.23.0 Attachments: MAPREDUCE-2708-20111021.1.txt, MAPREDUCE-2708-20111021.txt, MAPREDUCE-2708-20111022.txt, mr2708_v1.patch, mr2708_v2.patch Design recovery of MR AM from crashes/node failures. The running job should recover from the state it left off. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2708) [MR-279] Design and implement MR Application Master recovery
[ https://issues.apache.org/jira/browse/MAPREDUCE-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13132806#comment-13132806 ] Sharad Agarwal commented on MAPREDUCE-2708: --- bq. You write extremely beautiful tests smile. Thanks Vinod for taking this up. hope testing on cluster goes smooth with this. [MR-279] Design and implement MR Application Master recovery Key: MAPREDUCE-2708 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2708 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: applicationmaster, mrv2 Affects Versions: 0.23.0 Reporter: Sharad Agarwal Assignee: Sharad Agarwal Priority: Blocker Fix For: 0.23.0 Attachments: MAPREDUCE-2708-20111021.1.txt, MAPREDUCE-2708-20111021.txt, mr2708_v1.patch, mr2708_v2.patch Design recovery of MR AM from crashes/node failures. The running job should recover from the state it left off. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2708) [MR-279] Design and implement MR Application Master recovery
[ https://issues.apache.org/jira/browse/MAPREDUCE-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128689#comment-13128689 ] Sharad Agarwal commented on MAPREDUCE-2708: --- bq. All hadoop-mapreduce-client passing. correction: All hadoop-mapreduce-client tests passing. [MR-279] Design and implement MR Application Master recovery Key: MAPREDUCE-2708 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2708 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: applicationmaster, mrv2 Affects Versions: 0.23.0 Reporter: Sharad Agarwal Assignee: Sharad Agarwal Priority: Blocker Fix For: 0.23.0 Attachments: mr2708_v1.patch, mr2708_v2.patch Design recovery of MR AM from crashes/node failures. The running job should recover from the state it left off. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2708) [MR-279] Design and implement MR Application Master recovery
[ https://issues.apache.org/jira/browse/MAPREDUCE-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13127307#comment-13127307 ] Sharad Agarwal commented on MAPREDUCE-2708: --- Lot of conflict while merging. I will try to get this done in next couple of days. Thanks! [MR-279] Design and implement MR Application Master recovery Key: MAPREDUCE-2708 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2708 Project: Hadoop Map/Reduce Issue Type: Sub-task Components: applicationmaster, mrv2 Affects Versions: 0.23.0 Reporter: Sharad Agarwal Assignee: Sharad Agarwal Priority: Blocker Fix For: 0.23.0 Attachments: mr2708_v1.patch Design recovery of MR AM from crashes/node failures. The running job should recover from the state it left off. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2702) [MR-279] OutputCommitter changes for MR Application Master recovery
[ https://issues.apache.org/jira/browse/MAPREDUCE-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13120708#comment-13120708 ] Sharad Agarwal commented on MAPREDUCE-2702: --- bq. uniting isRecoverySupported and recoverTask into a single api combining the api has problem: - you won't know that recovery is supported or not until you use the recover task api. currently recover code path is different and only executed if recovery is supported. - recovery is at job level, while recover task is task level. semantically it is not very clear because recover task is invoked multiple times. [MR-279] OutputCommitter changes for MR Application Master recovery --- Key: MAPREDUCE-2702 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2702 Project: Hadoop Map/Reduce Issue Type: New Feature Components: mrv2 Reporter: Sharad Agarwal Assignee: Sharad Agarwal Priority: Blocker Attachments: MAPREDUCE-2702.patch, MAPREDUCE-2702.patch, mr2702_v1.patch, mr2702_v2.patch, mr2702_v3.patch, mr2702_v4.patch In MR AM recovers from a crash, it only reruns the non completed tasks. The completed tasks (along with their output, if any) needs to be recovered from the previous life. This would require some changes in OutputCommitter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2702) [MR-279] OutputCommitter changes for MR Application Master recovery
[ https://issues.apache.org/jira/browse/MAPREDUCE-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116228#comment-13116228 ] Sharad Agarwal commented on MAPREDUCE-2702: --- can be done as separate jira for old api. anyway if the outputcommitter doesn't support recovery (the default), the MR AM falls back to rerun the job from beginning. [MR-279] OutputCommitter changes for MR Application Master recovery --- Key: MAPREDUCE-2702 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2702 Project: Hadoop Map/Reduce Issue Type: New Feature Components: mrv2 Reporter: Sharad Agarwal Assignee: Sharad Agarwal Priority: Blocker Attachments: mr2702_v1.patch, mr2702_v2.patch, mr2702_v3.patch, mr2702_v4.patch In MR AM recovers from a crash, it only reruns the non completed tasks. The completed tasks (along with their output, if any) needs to be recovered from the previous life. This would require some changes in OutputCommitter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM
[ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13117006#comment-13117006 ] Sharad Agarwal commented on MAPREDUCE-2693: --- Yes this bug is valid but only appears if job level node blacklisting is enabled. sigh! I may not have the bandwidth to work on this in short term. feel free if someone else wants to take this up. thanks! NPE in AM causes it to lose containers which are never returned back to RM -- Key: MAPREDUCE-2693 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Reporter: Amol Kekre Assignee: Sharad Agarwal Priority: Critical Fix For: 0.23.0 The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens because of these lost containers. It happens when there are blacklisted nodes at the app level in AM. A bug in AM (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the request-table. We should make sure RM also knows about this update. 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20 resourceName=... numContainers=4978 #asks=5 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20 resourceName=... numContainers=4977 #asks=5 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20 resourceName=... numContainers=1540 #asks=5 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20 resourceName=... numContainers=1539 #asks=6 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. java.lang.NullPointerException at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220) at java.lang.Thread.run(Thread.java:619) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira