[jira] [Commented] (MAPREDUCE-3315) Master-Worker Application on YARN

2012-04-04 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246166#comment-13246166
 ] 

Sharad Agarwal commented on MAPREDUCE-3315:
---

Thanks Nikhil for the patch. Will have a look at it.

 Master-Worker Application on YARN
 -

 Key: MAPREDUCE-3315
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3315
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Sharad Agarwal
Assignee: Sharad Agarwal
 Fix For: 0.24.0

 Attachments: MAPREDUCE-3315.patch


 Currently master worker scenarios are forced fit into Map-Reduce. Now with 
 YARN, these can be first class and would benefit real/near realtime workloads 
 and be more effective in using the cluster resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3315) Master-Worker Application on YARN

2012-04-03 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245108#comment-13245108
 ] 

Sharad Agarwal commented on MAPREDUCE-3315:
---

can have it in hadoop-yarn-applications module. Example app could be sub-module 
of the master-worker app.

 Master-Worker Application on YARN
 -

 Key: MAPREDUCE-3315
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3315
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Sharad Agarwal
Assignee: Sharad Agarwal
 Fix For: 0.24.0


 Currently master worker scenarios are forced fit into Map-Reduce. Now with 
 YARN, these can be first class and would benefit real/near realtime workloads 
 and be more effective in using the cluster resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3315) Master-Worker Application on YARN

2012-03-22 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235534#comment-13235534
 ] 

Sharad Agarwal commented on MAPREDUCE-3315:
---

bq. Should I use Hadoop IPC or RMI 
Hadoop IPC

bq. Should the Master be in the ApplicationManager or be run as a Container?
For Master, you need to write a yarn ApplicationMaster (AM) see - 
http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html
AM runs in one of the container on the cluster.

Also it would be good to completely separate the user API (writing 
master-worker apps) and the runtime implementation clearly.

 Master-Worker Application on YARN
 -

 Key: MAPREDUCE-3315
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3315
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Sharad Agarwal
Assignee: Sharad Agarwal
 Fix For: 0.24.0


 Currently master worker scenarios are forced fit into Map-Reduce. Now with 
 YARN, these can be first class and would benefit real/near realtime workloads 
 and be more effective in using the cluster resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3846) Restarted+Recovered AM hangs in some corner cases

2012-02-13 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206860#comment-13206860
 ] 

Sharad Agarwal commented on MAPREDUCE-3846:
---

looks good. we should add a testcase to recover in third generation.

 Restarted+Recovered AM hangs in some corner cases
 -

 Key: MAPREDUCE-3846
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3846
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: mrv2
Affects Versions: 0.23.0
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
Priority: Critical
 Attachments: MAPREDUCE-3846-20120210.txt


 [~karams] found this while testing AM restart/recovery feature. After the 
 first generation AM crashes (manually killed by kill -9), the second 
 generation AM starts, but hangs after a while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3858) Task attempt failure during commit results in task never completing

2012-02-13 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207527#comment-13207527
 ] 

Sharad Agarwal commented on MAPREDUCE-3858:
---

+1 looks good. Thanks Tom for the patch.

 Task attempt failure during commit results in task never completing
 ---

 Key: MAPREDUCE-3858
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3858
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Reporter: Tom White
Assignee: Tom White
Priority: Critical
 Attachments: MAPREDUCE-3858.patch


 On a terasort job a task attempt failed during the commit phase. Another 
 attempt was rescheduled, but when it tried to commit it failed.
 {noformat}
 attempt_1329019187148_0083_r_000586_0 already given a go for committing the 
 task output, so killing attempt_1329019187148_0083_r_000586_1
 {noformat}
 The job hung as new attempts kept getting scheduled only to fail during 
 commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3846) Restarted+Recovered AM hangs in some corner cases

2012-02-10 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205325#comment-13205325
 ] 

Sharad Agarwal commented on MAPREDUCE-3846:
---

should this be marked as duplicate of MAPREDUCE-3802 ? It is exactly the same 
behaviour for the AM hanging/failing in third generation.

 Restarted+Recovered AM hangs in some corner cases
 -

 Key: MAPREDUCE-3846
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3846
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: mrv2
Affects Versions: 0.23.0
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

 [~karams] found this while testing AM restart/recovery feature. After the 
 first generation AM crashes (manually killed by kill -9), the second 
 generation AM starts, but hangs after a while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3802) If an MR AM dies twice it looks like the process freezes

2012-02-07 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203325#comment-13203325
 ] 

Sharad Agarwal commented on MAPREDUCE-3802:
---

bq. need to understand a little bit better how these names are determined
The task attemptIds are unique across all the generations of AM. This is to 
avoid any remote task attempt from previous generation of AM joining the 
current AM. The assumption is there won't be more than 1000 attempts of a task 
in AM run. The suffix part of task attemptId is determined as follows:
 _(AMGeneration-1)*1000. For first AM it will start from 0. For second it will 
start from 1000, for third from 2000 ..



 If an MR AM dies twice  it looks like the process freezes
 -

 Key: MAPREDUCE-3802
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3802
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mrv2
Affects Versions: 0.23.1, 0.24.0
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
Priority: Critical
 Attachments: syslog


 It looks like recovering from an RM AM dieing works very well on a single 
 failure.  But if it fails multiple times we appear to get into a live lock 
 situation.
 {noformat}
 yarn jar 
 hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*-SNAPSHOT.jar 
 wordcount -Dyarn.app.mapreduce.am.log.level=DEBUG -Dmapreduce.job.reduces=30 
 input output
 12/02/03 21:06:57 WARN conf.Configuration: fs.default.name is deprecated. 
 Instead, use fs.defaultFS
 12/02/03 21:06:57 WARN conf.Configuration: mapred.used.genericoptionsparser 
 is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
 12/02/03 21:06:57 INFO input.FileInputFormat: Total input paths to process : 
 17
 12/02/03 21:06:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library
 12/02/03 21:06:57 WARN snappy.LoadSnappy: Snappy native library not loaded
 12/02/03 21:06:57 INFO mapreduce.JobSubmitter: number of splits:17
 12/02/03 21:06:57 INFO mapred.ResourceMgrDelegate: Submitted application 
 application_1328302034486_0003 to ResourceManager at HOST/IP:8040
 12/02/03 21:06:57 INFO mapreduce.Job: The url to track the job: 
 http://HOST:8088/proxy/application_1328302034486_0003/
 12/02/03 21:06:57 INFO mapreduce.Job: Running job: job_1328302034486_0003
 12/02/03 21:07:03 INFO mapreduce.Job: Job job_1328302034486_0003 running in 
 uber mode : false
 12/02/03 21:07:03 INFO mapreduce.Job:  map 0% reduce 0%
 12/02/03 21:07:09 INFO mapreduce.Job:  map 5% reduce 0%
 12/02/03 21:07:10 INFO mapreduce.Job:  map 17% reduce 0%
 #KILLED AM with kill -9 here
 12/02/03 21:07:16 INFO mapreduce.Job:  map 29% reduce 0%
 12/02/03 21:07:17 INFO mapreduce.Job:  map 35% reduce 0%
 12/02/03 21:07:30 INFO mapreduce.Job:  map 52% reduce 0%
 12/02/03 21:07:35 INFO mapreduce.Job:  map 58% reduce 0%
 12/02/03 21:07:37 INFO mapreduce.Job:  map 70% reduce 0%
 12/02/03 21:07:41 INFO mapreduce.Job:  map 76% reduce 0%
 12/02/03 21:07:43 INFO mapreduce.Job:  map 82% reduce 0%
 12/02/03 21:07:44 INFO mapreduce.Job:  map 88% reduce 0%
 12/02/03 21:07:47 INFO mapreduce.Job:  map 94% reduce 0%
 12/02/03 21:07:49 INFO mapreduce.Job:  map 100% reduce 0%
 12/02/03 21:07:53 INFO mapreduce.Job:  map 100% reduce 3%
 12/02/03 21:08:00 INFO mapreduce.Job:  map 100% reduce 6%
 12/02/03 21:08:06 INFO mapreduce.Job:  map 100% reduce 10%
 12/02/03 21:08:12 INFO mapreduce.Job:  map 100% reduce 13%
 12/02/03 21:08:18 INFO mapreduce.Job:  map 100% reduce 16%
 #killed AM with kill -9 here
 12/02/03 21:08:20 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. 
 Already tried 0 time(s).
 12/02/03 21:08:21 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. 
 Already tried 1 time(s).
 12/02/03 21:08:22 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. 
 Already tried 2 time(s).
 12/02/03 21:08:26 INFO mapreduce.Job:  map 64% reduce 16%
 #It never makes any more progress...
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3802) If an MR AM dies twice it looks like the process freezes

2012-02-07 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203332#comment-13203332
 ] 

Sharad Agarwal commented on MAPREDUCE-3802:
---

The bug is in TaskImpl over here:

{code}
//attempt ids are generated based on MR app startCount so that attempts
//from previous lives don't overstep the current one.
//this assumes that a task won't have more than 1000 attempts in its single 
//life
nextAttemptNumber = (startCount - 1) * 1000;
{code}

The completed task could be from any earlier AM generation not just from the 
previous one. I am looking into the way to fix this.
 

 If an MR AM dies twice  it looks like the process freezes
 -

 Key: MAPREDUCE-3802
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3802
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mrv2
Affects Versions: 0.23.1, 0.24.0
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
Priority: Critical
 Attachments: syslog


 It looks like recovering from an RM AM dieing works very well on a single 
 failure.  But if it fails multiple times we appear to get into a live lock 
 situation.
 {noformat}
 yarn jar 
 hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*-SNAPSHOT.jar 
 wordcount -Dyarn.app.mapreduce.am.log.level=DEBUG -Dmapreduce.job.reduces=30 
 input output
 12/02/03 21:06:57 WARN conf.Configuration: fs.default.name is deprecated. 
 Instead, use fs.defaultFS
 12/02/03 21:06:57 WARN conf.Configuration: mapred.used.genericoptionsparser 
 is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
 12/02/03 21:06:57 INFO input.FileInputFormat: Total input paths to process : 
 17
 12/02/03 21:06:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library
 12/02/03 21:06:57 WARN snappy.LoadSnappy: Snappy native library not loaded
 12/02/03 21:06:57 INFO mapreduce.JobSubmitter: number of splits:17
 12/02/03 21:06:57 INFO mapred.ResourceMgrDelegate: Submitted application 
 application_1328302034486_0003 to ResourceManager at HOST/IP:8040
 12/02/03 21:06:57 INFO mapreduce.Job: The url to track the job: 
 http://HOST:8088/proxy/application_1328302034486_0003/
 12/02/03 21:06:57 INFO mapreduce.Job: Running job: job_1328302034486_0003
 12/02/03 21:07:03 INFO mapreduce.Job: Job job_1328302034486_0003 running in 
 uber mode : false
 12/02/03 21:07:03 INFO mapreduce.Job:  map 0% reduce 0%
 12/02/03 21:07:09 INFO mapreduce.Job:  map 5% reduce 0%
 12/02/03 21:07:10 INFO mapreduce.Job:  map 17% reduce 0%
 #KILLED AM with kill -9 here
 12/02/03 21:07:16 INFO mapreduce.Job:  map 29% reduce 0%
 12/02/03 21:07:17 INFO mapreduce.Job:  map 35% reduce 0%
 12/02/03 21:07:30 INFO mapreduce.Job:  map 52% reduce 0%
 12/02/03 21:07:35 INFO mapreduce.Job:  map 58% reduce 0%
 12/02/03 21:07:37 INFO mapreduce.Job:  map 70% reduce 0%
 12/02/03 21:07:41 INFO mapreduce.Job:  map 76% reduce 0%
 12/02/03 21:07:43 INFO mapreduce.Job:  map 82% reduce 0%
 12/02/03 21:07:44 INFO mapreduce.Job:  map 88% reduce 0%
 12/02/03 21:07:47 INFO mapreduce.Job:  map 94% reduce 0%
 12/02/03 21:07:49 INFO mapreduce.Job:  map 100% reduce 0%
 12/02/03 21:07:53 INFO mapreduce.Job:  map 100% reduce 3%
 12/02/03 21:08:00 INFO mapreduce.Job:  map 100% reduce 6%
 12/02/03 21:08:06 INFO mapreduce.Job:  map 100% reduce 10%
 12/02/03 21:08:12 INFO mapreduce.Job:  map 100% reduce 13%
 12/02/03 21:08:18 INFO mapreduce.Job:  map 100% reduce 16%
 #killed AM with kill -9 here
 12/02/03 21:08:20 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. 
 Already tried 0 time(s).
 12/02/03 21:08:21 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. 
 Already tried 1 time(s).
 12/02/03 21:08:22 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. 
 Already tried 2 time(s).
 12/02/03 21:08:26 INFO mapreduce.Job:  map 64% reduce 16%
 #It never makes any more progress...
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3711) AppMaster recovery for Medium to large jobs take long time

2012-01-30 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196747#comment-13196747
 ] 

Sharad Agarwal commented on MAPREDUCE-3711:
---

bq. That is the real bug. We should instead be moving the committed outputs of 
a single task from $(JobAttemptBaseDir - 1) dir to $(JobAttemptBaseDir) dir.
true. thats *the* bug. recoverJob is not required.

for saving hdfs trips, a simple check of non-zero reduces to skip maps output 
recovery in RecoveryService should be sufficient.

 AppMaster recovery for Medium to large jobs take long time
 --

 Key: MAPREDUCE-3711
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3711
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 0.23.0
Reporter: Siddharth Seth
Assignee: Robert Joseph Evans
Priority: Blocker

 Reported by [~karams]
 yarn.resourcemanager.am.max-retries=2
 Ran test cases with sort job on 350 scale having 16800 maps and 680 reduces -:
 1. After 70 secs of Job Sumbission Am is killed using kill -9, around 3900 
 maps were completed and 680 reduces were
 scheduled, Second AM got restart. Job got completed in 980 secs. AM took very 
 less time to recover.
 2. After 150 secs of Job Sumbission AM is killed using kill -9, around 90% 
 maps were completed and 680 reduces were
 scheduled , Second AM got restart Job got completed in 1000 secs. AM got 
 revocer.
 3. After 150 secs of Job Sumbission AM as killed using kill -9, almost all 
 maps were completed and only 680 reduces
 were running, Recovery was too slow, AM was still revocering after 1hr :40 
 mis when I killed the run.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3634) All daemons should crash instead of hanging around when their EventHandlers get exceptions

2012-01-29 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13195964#comment-13195964
 ] 

Sharad Agarwal commented on MAPREDUCE-3634:
---

Can we set Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY default to false instead ? 
No of daemons are much less and would almost remain constant as opposed to 
tests ? Anybody adding a test or using a API should not get surprised by not 
setting this property.

 All daemons should crash instead of hanging around when their EventHandlers 
 get exceptions
 --

 Key: MAPREDUCE-3634
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3634
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 0.23.0
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Fix For: 0.23.1

 Attachments: MAPREDUCE-3634-20120118.1.txt, 
 MAPREDUCE-3634-20120119.txt


 We should make sure that the daemons crash in case the dispatchers get 
 exceptions and stop processing. That way we will be debugging RM/NM/AM 
 crashes instead of hard-to-track hanging jobs. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3489) EventDispatcher should have a call-back on errors for aiding tests

2012-01-27 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13194633#comment-13194633
 ] 

Sharad Agarwal commented on MAPREDUCE-3489:
---

Currently in the AsyncDispatcher, exitOnDispatchException defaults to false. 
The daemon don't set it to true either. Is this the intended bahaviour ? I 
think daemons should exit on dispatcher error while testcases can handle it 
differently. right ?

 EventDispatcher should have a call-back on errors for aiding tests
 --

 Key: MAPREDUCE-3489
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3489
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 0.23.0
Reporter: Siddharth Seth
Assignee: Sharad Agarwal

 If one of the dispatched events generates an exception - the dispatcher kills 
 the JVM via a System.exit. Unit tests end up not running - but they don't 
 fail either.
 TestTaskAttempt is currently running like this.
 Previously - have seen TestRecovery and TestJobHistoryParsing do the same. 
 Most of the tests would need to be looked at.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3711) AppMaster recovery for Medium to large jobs take long time

2012-01-26 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13194486#comment-13194486
 ] 

Sharad Agarwal commented on MAPREDUCE-3711:
---

Karam, can you upload the AM logs for the case when recovery was taking too 
long ?

 AppMaster recovery for Medium to large jobs take long time
 --

 Key: MAPREDUCE-3711
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3711
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 0.23.0
Reporter: Siddharth Seth
Assignee: Robert Joseph Evans
Priority: Blocker

 Reported by [~karams]
 yarn.resourcemanager.am.max-retries=2
 Ran test cases with sort job on 350 scale having 16800 maps and 680 reduces -:
 1. After 70 secs of Job Sumbission Am is killed using kill -9, around 3900 
 maps were completed and 680 reduces were
 scheduled, Second AM got restart. Job got completed in 980 secs. AM took very 
 less time to recover.
 2. After 150 secs of Job Sumbission AM is killed using kill -9, around 90% 
 maps were completed and 680 reduces were
 scheduled , Second AM got restart Job got completed in 1000 secs. AM got 
 revocer.
 3. After 150 secs of Job Sumbission AM as killed using kill -9, almost all 
 maps were completed and only 680 reduces
 were running, Recovery was too slow, AM was still revocering after 1hr :40 
 mis when I killed the run.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3490) RMContainerAllocator counts failed maps towards Reduce ramp up

2011-12-28 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176621#comment-13176621
 ] 

Sharad Agarwal commented on MAPREDUCE-3490:
---

currently all bookkeeping and calculations in RMContainerAllocator are based on 
attempts not on tasks. there is one to one between attempt and container.

completedMapPercent is currently completed map *attempts* percentage. I looked 
in more detail. we can simply change this to reflect completed map tasks 
percentage. All information is readily available there in Job. So we need not 
maintain these counts in RMContainerAllocator. Arun also mentioned this. I am 
attaching a patch which drastically simplify this, without the need to add new 
events. Also I have removed the completedMaps and completedReduces counts in 
RMContainerAllocator.
Arun/Vinod - see if this make sense ?

 RMContainerAllocator counts failed maps towards Reduce ramp up
 --

 Key: MAPREDUCE-3490
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3490
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am, mrv2
Affects Versions: 0.23.0
Reporter: Siddharth Seth
Assignee: Arun C Murthy
Priority: Blocker
 Attachments: MAPREDUCE-3490.patch, MAPREDUCE-3490.patch, 
 MAPREDUCE-3490.patch, MAPREDUCE-3490.patch, MR-3490-alternate.patch


 The RMContainerAllocator does not differentiate between failed and successful 
 maps while calculating whether reduce tasks are ready to launch. Failed tasks 
 are also counted towards total completed tasks. 
 Example. 4 failed maps, 10 total maps. Map%complete = 4/14 * 100 instead of 
 being 0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3490) RMContainerAllocator counts failed maps towards Reduce ramp up

2011-12-28 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176623#comment-13176623
 ] 

Sharad Agarwal commented on MAPREDUCE-3490:
---

Note: I am on vacation rest of the week. please expect slow or no response till 
then.

 RMContainerAllocator counts failed maps towards Reduce ramp up
 --

 Key: MAPREDUCE-3490
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3490
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am, mrv2
Affects Versions: 0.23.0
Reporter: Siddharth Seth
Assignee: Arun C Murthy
Priority: Blocker
 Attachments: MAPREDUCE-3490.patch, MAPREDUCE-3490.patch, 
 MAPREDUCE-3490.patch, MAPREDUCE-3490.patch, MR-3490-alternate.patch, 
 MR-3490-alternate1.patch


 The RMContainerAllocator does not differentiate between failed and successful 
 maps while calculating whether reduce tasks are ready to launch. Failed tasks 
 are also counted towards total completed tasks. 
 Example. 4 failed maps, 10 total maps. Map%complete = 4/14 * 100 instead of 
 being 0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3490) RMContainerAllocator counts failed maps towards Reduce ramp up

2011-12-22 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174797#comment-13174797
 ] 

Sharad Agarwal commented on MAPREDUCE-3490:
---

Hi Arun - just had a brief look at the patch. seems like you don't need new 
Container Allocator types. RMContainerAllocator is already getting 
CONTAINER_FAILED event. completedMaps includes succeeded and failed.

Instead of incorporating completedMaps in the calculation, it can take 
succeededMaps.

succeededMaps = completedMaps - failedMaps 

 RMContainerAllocator counts failed maps towards Reduce ramp up
 --

 Key: MAPREDUCE-3490
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3490
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am, mrv2
Affects Versions: 0.23.0
Reporter: Siddharth Seth
Assignee: Arun C Murthy
Priority: Blocker
 Attachments: MAPREDUCE-3490.patch, MAPREDUCE-3490.patch, 
 MAPREDUCE-3490.patch, MAPREDUCE-3490.patch


 The RMContainerAllocator does not differentiate between failed and successful 
 maps while calculating whether reduce tasks are ready to launch. Failed tasks 
 are also counted towards total completed tasks. 
 Example. 4 failed maps, 10 total maps. Map%complete = 4/14 * 100 instead of 
 being 0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3490) RMContainerAllocator counts failed maps towards Reduce ramp up

2011-12-22 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175319#comment-13175319
 ] 

Sharad Agarwal commented on MAPREDUCE-3490:
---

bq. I think we need to stop tracking this in RMContainerAllocator and rather 
rely on Job. For now, my patch seems the closest approximation to that (being 
conservative).
Doing it in Job or in RMContainerAllocator is a separate discussion. I don't 
think this patch deal with anything like that. It adds two new events for 
RMContainerAllocator itself. 
I am proposing that we don't need these extra events because this information 
(failed attempts info) is already available in RMContainerAllocator.

 

 RMContainerAllocator counts failed maps towards Reduce ramp up
 --

 Key: MAPREDUCE-3490
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3490
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mr-am, mrv2
Affects Versions: 0.23.0
Reporter: Siddharth Seth
Assignee: Arun C Murthy
Priority: Blocker
 Attachments: MAPREDUCE-3490.patch, MAPREDUCE-3490.patch, 
 MAPREDUCE-3490.patch, MAPREDUCE-3490.patch, MR-3490-alternate.patch


 The RMContainerAllocator does not differentiate between failed and successful 
 maps while calculating whether reduce tasks are ready to launch. Failed tasks 
 are also counted towards total completed tasks. 
 Example. 4 failed maps, 10 total maps. Map%complete = 4/14 * 100 instead of 
 being 0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3473) A single task tracker failure shouldn't result in Job failure

2011-12-21 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174636#comment-13174636
 ] 

Sharad Agarwal commented on MAPREDUCE-3473:
---

a single machine failure doesn't result in job to fail; thats the whole point 
of hadoop. smile.

we are missing the difference with *Task* and *TaskAttempt*.  A task gets 4 
chances (task attempts) by default to run before the job is declared failed. 

I think this issue can be resolved as Invalid.

 A single task tracker failure shouldn't result in Job failure 
 --

 Key: MAPREDUCE-3473
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3473
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: tasktracker
Affects Versions: 0.20.205.0, 0.23.0
Reporter: Eli Collins

 Currently some task failures may result in job failures. Eg a local TT disk 
 failure seen in TaskLauncher#run, TaskRunner#run, MapTask#run is visible to 
 and can hang the JobClient, causing the job to fail. Job execution should 
 always be able to survive a task failure if there are sufficient resources. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3489) Unit tests failing silently

2011-12-01 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13160716#comment-13160716
 ] 

Sharad Agarwal commented on MAPREDUCE-3489:
---

Definately System.exit is not the right thing in a library. Sigh! I knew that 
when I wrote that and intended to remove it. Couldn't get chance to get to it. 
I think we should get rid of it and instead have a error handling callback 
registered to Dispatcher.

 Unit tests failing silently
 ---

 Key: MAPREDUCE-3489
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3489
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 0.23.0
Reporter: Siddharth Seth
Priority: Critical

 If one of the dispatched events generates an exception - the dispatcher kills 
 the JVM via a System.exit. Unit tests end up not running - but they don't 
 fail either.
 TestTaskAttempt is currently running like this.
 Previously - have seen TestRecovery and TestJobHistoryParsing do the same. 
 Most of the tests would need to be looked at.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3473) Task failures shouldn't result in Job failures

2011-11-29 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13159158#comment-13159158
 ] 

Sharad Agarwal commented on MAPREDUCE-3473:
---

note this is *task* failure NOT taskattempt failures. Task failure would mean 
losing processing on corresponding inputsplit. Not all applications would be ok 
with it. 
Explicitly setting to non-zero value makes sense so losing data doesn't come as 
surprise for applications.

 Task failures shouldn't result in Job failures 
 ---

 Key: MAPREDUCE-3473
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3473
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: tasktracker
Affects Versions: 0.20.205.0, 0.23.0
Reporter: Eli Collins

 Currently some task failures may result in job failures. Eg a local TT disk 
 failure seen in TaskLauncher#run, TaskRunner#run, MapTask#run is visible to 
 and can hang the JobClient, causing the job to fail. Job execution should 
 always be able to survive a task failure if there are sufficient resources. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3402) AMScalability test of Sleep job with 100K 1-sec maps regressed into running very slowly

2011-11-15 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151029#comment-13151029
 ] 

Sharad Agarwal commented on MAPREDUCE-3402:
---

just fyi org.apache.hadoop.mapreduce.v2.app.MRAppBenchmark can be used to 
benchmark the AM mainly for memory usage, job latencies and state machine 
transitions. It however doesn't capture the remoting/rpc issues as it doesn't 
run on real cluster.

 AMScalability test of Sleep job with 100K 1-sec maps regressed into running 
 very slowly
 ---

 Key: MAPREDUCE-3402
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3402
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: applicationmaster, mrv2
Affects Versions: 0.23.0
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Fix For: 0.23.1


 The world was rosier before October 19-25, [~karams] says.
 The 100K 1 second sleep job used to take around 800mins or 13-14 mins. It now 
 runs till 45 mins and still manages to complete only about 45K tasks.
 One/more of the flurry of commits for 0.23.0 deserve(s) the blame.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3315) Master-Worker Application on YARN

2011-10-31 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13140059#comment-13140059
 ] 

Sharad Agarwal commented on MAPREDUCE-3315:
---

some thoughts:
- AM decides the Tasks.
- worker keep polling AM for getting the next Task instance.
- should be able to dynamically increase or shutdown workers - have minWorkers, 
maxWorkers, keepAlive

 Master-Worker Application on YARN
 -

 Key: MAPREDUCE-3315
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3315
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Sharad Agarwal
Assignee: Sharad Agarwal
 Fix For: 0.24.0


 Currently master worker scenarios are forced fit into Map-Reduce. Now with 
 YARN, these can be first class and would benefit real/near realtime workloads 
 and be more effective in using the cluster resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3274) Race condition in MR App Master Preemtion can cause a dead lock

2011-10-27 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13136881#comment-13136881
 ] 

Sharad Agarwal commented on MAPREDUCE-3274:
---

 JVM with ID: jvm_1319242394842_0065_r_08 given task: 
 attempt_1319242394842_0065_r_04_0

there seems something wrong here. jvm with particular id should always be given 
the corresponding task.

 Race condition in MR App Master Preemtion can cause a dead lock
 ---

 Key: MAPREDUCE-3274
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3274
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2, scheduler
Affects Versions: 0.23.0, 0.24.0
Reporter: Robert Joseph Evans
Assignee: Robert Joseph Evans
Priority: Critical
 Fix For: 0.23.0, 0.24.0


 There appears to be a race condition in the MR App Master in relation to 
 preempting reducers to let a mapper run.  In the particular case that I have 
 been debugging a reducer was selected for preemption that did not have a 
 container assigned to it yet. When the container became available that reduce 
 started running and the previous TA_KILL event appears to have been ignored.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2708) [MR-279] Design and implement MR Application Master recovery

2011-10-23 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13133597#comment-13133597
 ] 

Sharad Agarwal commented on MAPREDUCE-2708:
---

bq. and the AM restarted with 83% map and 0% reduce

great! it worked.

btw Vinod, with recovery module unable to parse the history file (due to hdfs 
bug), it should fall back to restarting the job. just curious, did you notice 
that? 

 [MR-279] Design and implement MR Application Master recovery
 

 Key: MAPREDUCE-2708
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2708
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mrv2
Affects Versions: 0.23.0
Reporter: Sharad Agarwal
Assignee: Sharad Agarwal
Priority: Blocker
 Fix For: 0.23.0

 Attachments: MAPREDUCE-2708-20111021.1.txt, 
 MAPREDUCE-2708-20111021.txt, MAPREDUCE-2708-20111022.txt, mr2708_v1.patch, 
 mr2708_v2.patch


 Design recovery of MR AM from crashes/node failures. The running job should 
 recover from the state it left off.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2708) [MR-279] Design and implement MR Application Master recovery

2011-10-21 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13132806#comment-13132806
 ] 

Sharad Agarwal commented on MAPREDUCE-2708:
---

bq. You write extremely beautiful tests
smile. Thanks Vinod for taking this up. hope testing on cluster goes smooth 
with this.

 [MR-279] Design and implement MR Application Master recovery
 

 Key: MAPREDUCE-2708
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2708
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mrv2
Affects Versions: 0.23.0
Reporter: Sharad Agarwal
Assignee: Sharad Agarwal
Priority: Blocker
 Fix For: 0.23.0

 Attachments: MAPREDUCE-2708-20111021.1.txt, 
 MAPREDUCE-2708-20111021.txt, mr2708_v1.patch, mr2708_v2.patch


 Design recovery of MR AM from crashes/node failures. The running job should 
 recover from the state it left off.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2708) [MR-279] Design and implement MR Application Master recovery

2011-10-17 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128689#comment-13128689
 ] 

Sharad Agarwal commented on MAPREDUCE-2708:
---

bq. All hadoop-mapreduce-client passing.

correction: All hadoop-mapreduce-client tests passing.

 [MR-279] Design and implement MR Application Master recovery
 

 Key: MAPREDUCE-2708
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2708
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mrv2
Affects Versions: 0.23.0
Reporter: Sharad Agarwal
Assignee: Sharad Agarwal
Priority: Blocker
 Fix For: 0.23.0

 Attachments: mr2708_v1.patch, mr2708_v2.patch


 Design recovery of MR AM from crashes/node failures. The running job should 
 recover from the state it left off.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2708) [MR-279] Design and implement MR Application Master recovery

2011-10-14 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13127307#comment-13127307
 ] 

Sharad Agarwal commented on MAPREDUCE-2708:
---

Lot of conflict while merging. I will try to get this done in next couple of 
days. Thanks!

 [MR-279] Design and implement MR Application Master recovery
 

 Key: MAPREDUCE-2708
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2708
 Project: Hadoop Map/Reduce
  Issue Type: Sub-task
  Components: applicationmaster, mrv2
Affects Versions: 0.23.0
Reporter: Sharad Agarwal
Assignee: Sharad Agarwal
Priority: Blocker
 Fix For: 0.23.0

 Attachments: mr2708_v1.patch


 Design recovery of MR AM from crashes/node failures. The running job should 
 recover from the state it left off.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2702) [MR-279] OutputCommitter changes for MR Application Master recovery

2011-10-05 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13120708#comment-13120708
 ] 

Sharad Agarwal commented on MAPREDUCE-2702:
---

bq. uniting isRecoverySupported and recoverTask into a single api

combining the api has problem:
- you won't know that recovery is supported or not until you use the recover 
task api. currently recover code path is different and only executed if 
recovery is supported.
- recovery is at job level, while recover task is task level. semantically it 
is not very clear because recover task is invoked multiple times.




 [MR-279] OutputCommitter changes for MR Application Master recovery
 ---

 Key: MAPREDUCE-2702
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2702
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: mrv2
Reporter: Sharad Agarwal
Assignee: Sharad Agarwal
Priority: Blocker
 Attachments: MAPREDUCE-2702.patch, MAPREDUCE-2702.patch, 
 mr2702_v1.patch, mr2702_v2.patch, mr2702_v3.patch, mr2702_v4.patch


 In MR AM recovers from a crash, it only reruns the non completed tasks. The 
 completed tasks (along with their output, if any) needs to be recovered from 
 the previous life. This would require some changes in OutputCommitter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2702) [MR-279] OutputCommitter changes for MR Application Master recovery

2011-09-28 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116228#comment-13116228
 ] 

Sharad Agarwal commented on MAPREDUCE-2702:
---

can be done as separate jira for old api. anyway if the outputcommitter doesn't 
support recovery (the default), the MR AM falls back to rerun the job from 
beginning.

 [MR-279] OutputCommitter changes for MR Application Master recovery
 ---

 Key: MAPREDUCE-2702
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2702
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: mrv2
Reporter: Sharad Agarwal
Assignee: Sharad Agarwal
Priority: Blocker
 Attachments: mr2702_v1.patch, mr2702_v2.patch, mr2702_v3.patch, 
 mr2702_v4.patch


 In MR AM recovers from a crash, it only reruns the non completed tasks. The 
 completed tasks (along with their output, if any) needs to be recovered from 
 the previous life. This would require some changes in OutputCommitter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-2693) NPE in AM causes it to lose containers which are never returned back to RM

2011-09-28 Thread Sharad Agarwal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13117006#comment-13117006
 ] 

Sharad Agarwal commented on MAPREDUCE-2693:
---

Yes this bug is valid but only appears if job level node blacklisting is 
enabled.

sigh! I may not have the bandwidth to work on this in short term. feel free if 
someone else wants to take this up. thanks!

 NPE in AM causes it to lose containers which are never returned back to RM
 --

 Key: MAPREDUCE-2693
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Reporter: Amol Kekre
Assignee: Sharad Agarwal
Priority: Critical
 Fix For: 0.23.0


 The following exception in AM of an application at the top of queue causes 
 this. Once this happens, AM keeps obtaining
 containers from RM and simply loses them. Eventually on a cluster with 
 multiple jobs, no more scheduling happens
 because of these lost containers.
 It happens when there are blacklisted nodes at the app level in AM. A bug in 
 AM
 (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - 
 nodes are simply getting removed from the
 request-table. We should make sure RM also knows about this update.
 
 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 
 98.138.163.34
 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: 
 applicationId=30 priority=20
 resourceName=... numContainers=4978 #asks=5
 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: 
 applicationId=30 priority=20
 resourceName=... numContainers=4977 #asks=5
 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: 
 applicationId=30 priority=20
 resourceName=... numContainers=1540 #asks=5
 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: 
 applicationId=30 priority=20
 resourceName=... numContainers=1539 #asks=6
 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. 
 java.lang.NullPointerException
 at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
 at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
 at
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
 at
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
 at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
 at 
 org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
 at java.lang.Thread.run(Thread.java:619)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira