[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13201329#comment-13201329
 ] 

Robert Joseph Evans commented on MAPREDUCE-3802:
------------------------------------------------

That appears to be the case. We are getting an NPE which is caused by calling 
RecoverService.getTaskAttemptInfo() and getting a null back.  
RecoveryService.getTaskAttemptInfo() first gets a task info, and then gets a 
task attempt info from inside that task.  It looks like the task info is parsed 
and populated just fine, but the task attempt info is not.  That seems to be 
caused by no TaskAttemptStarted events being put in the history log at all 
during the recovery process.  This also seems like no MapAttemptFinishedEvents, 
ReduceAttemp0tFinishedEvents, TaskAttemptFailedEvents nor 
TaskAttemptFinishedEvents are in the log either, or we would get null pointer 
exceptions while parsing them too.
                
> If an MR AM dies twice  it looks like the process freezes
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3802
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3802
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Robert Joseph Evans
>            Priority: Critical
>         Attachments: syslog
>
>
> It looks like recovering from an RM AM dieing works very well on a single 
> failure.  But if it fails multiple times we appear to get into a live lock 
> situation.
> {noformat}
> yarn jar 
> hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*-SNAPSHOT.jar 
> wordcount -Dyarn.app.mapreduce.am.log.level=DEBUG -Dmapreduce.job.reduces=30 
> input output
> 12/02/03 21:06:57 WARN conf.Configuration: fs.default.name is deprecated. 
> Instead, use fs.defaultFS
> 12/02/03 21:06:57 WARN conf.Configuration: mapred.used.genericoptionsparser 
> is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
> 12/02/03 21:06:57 INFO input.FileInputFormat: Total input paths to process : 
> 17
> 12/02/03 21:06:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library
> 12/02/03 21:06:57 WARN snappy.LoadSnappy: Snappy native library not loaded
> 12/02/03 21:06:57 INFO mapreduce.JobSubmitter: number of splits:17
> 12/02/03 21:06:57 INFO mapred.ResourceMgrDelegate: Submitted application 
> application_1328302034486_0003 to ResourceManager at HOST/IP:8040
> 12/02/03 21:06:57 INFO mapreduce.Job: The url to track the job: 
> http://HOST:8088/proxy/application_1328302034486_0003/
> 12/02/03 21:06:57 INFO mapreduce.Job: Running job: job_1328302034486_0003
> 12/02/03 21:07:03 INFO mapreduce.Job: Job job_1328302034486_0003 running in 
> uber mode : false
> 12/02/03 21:07:03 INFO mapreduce.Job:  map 0% reduce 0%
> 12/02/03 21:07:09 INFO mapreduce.Job:  map 5% reduce 0%
> 12/02/03 21:07:10 INFO mapreduce.Job:  map 17% reduce 0%
> #KILLED AM with kill -9 here
> 12/02/03 21:07:16 INFO mapreduce.Job:  map 29% reduce 0%
> 12/02/03 21:07:17 INFO mapreduce.Job:  map 35% reduce 0%
> 12/02/03 21:07:30 INFO mapreduce.Job:  map 52% reduce 0%
> 12/02/03 21:07:35 INFO mapreduce.Job:  map 58% reduce 0%
> 12/02/03 21:07:37 INFO mapreduce.Job:  map 70% reduce 0%
> 12/02/03 21:07:41 INFO mapreduce.Job:  map 76% reduce 0%
> 12/02/03 21:07:43 INFO mapreduce.Job:  map 82% reduce 0%
> 12/02/03 21:07:44 INFO mapreduce.Job:  map 88% reduce 0%
> 12/02/03 21:07:47 INFO mapreduce.Job:  map 94% reduce 0%
> 12/02/03 21:07:49 INFO mapreduce.Job:  map 100% reduce 0%
> 12/02/03 21:07:53 INFO mapreduce.Job:  map 100% reduce 3%
> 12/02/03 21:08:00 INFO mapreduce.Job:  map 100% reduce 6%
> 12/02/03 21:08:06 INFO mapreduce.Job:  map 100% reduce 10%
> 12/02/03 21:08:12 INFO mapreduce.Job:  map 100% reduce 13%
> 12/02/03 21:08:18 INFO mapreduce.Job:  map 100% reduce 16%
> #killed AM with kill -9 here
> 12/02/03 21:08:20 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. 
> Already tried 0 time(s).
> 12/02/03 21:08:21 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. 
> Already tried 1 time(s).
> 12/02/03 21:08:22 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. 
> Already tried 2 time(s).
> 12/02/03 21:08:26 INFO mapreduce.Job:  map 64% reduce 16%
> #It never makes any more progress...
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to