[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194886#comment-13194886
 ] 

Robert Joseph Evans commented on MAPREDUCE-3711:
------------------------------------------------

Looking at what happens in between {noformat}Sending status update event to 
attempt_1326983991390_0002_m_000001_0{noformat} and {noformat}Saved output of 
job to hdfs://NN:8020/sort/output/_temporary/2{noformat} is creating and 
dispatching a status update event, I would guess this should not take much 
time, and checking for a file/directory and moving it.  My guess is that it is 
the HDFS operations that are taking all of the time, but I need to profile it 
to be sure.

Another thing that I am a bit confused about, is why we are even trying to 
recover the mapper output.  It is not a map only job.  It has Reducers too.  
There should be no Mapper output.  I guess that might be part of the issue 
here.  Why are we trying to recover HDFS output for something that has no 
output in HDFS?  Also why is it taking up to 1.5 sec to do this?
                
> AppMaster recovery for Medium to large jobs take long time
> ----------------------------------------------------------
>
>                 Key: MAPREDUCE-3711
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3711
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>
> Reported by [~karams]
> yarn.resourcemanager.am.max-retries=2
> Ran test cases with sort job on 350 scale having 16800 maps and 680 reduces -:
> 1. After 70 secs of Job Sumbission Am is killed using kill -9, around 3900 
> maps were completed and 680 reduces were
> scheduled, Second AM got restart. Job got completed in 980 secs. AM took very 
> less time to recover.
> 2. After 150 secs of Job Sumbission AM is killed using kill -9, around 90% 
> maps were completed and 680 reduces were
> scheduled , Second AM got restart Job got completed in 1000 secs. AM got 
> revocer.
> 3. After 150 secs of Job Sumbission AM as killed using kill -9, almost all 
> maps were completed and only 680 reduces
> were running, Recovery was too slow, AM was still revocering after 1hr :40 
> mis when I killed the run.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to