[jira] [Commented] (MAPREDUCE-3711) AppMaster recovery for Medium to large jobs take long time

Robert Joseph Evans (Commented) (JIRA) Fri, 27 Jan 2012 07:34:36 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194850#comment-13194850
 ]


Robert Joseph Evans commented on MAPREDUCE-3711:
------------------------------------------------

I have gone through more of the logs

# Reading task data starts at 14:54:25,002
# Reading task data ends at 14:54:27,436 (about 2.5 sec later)
# Sending mapper scheduled events ends at 14:54:31,394 (about 4 sec later)
# Sending mapper started events ends at 14:54:36,046 (just under 5 sec later)
# Then it starts trying to recover the output from the task attempt.  This goes 
until the job is killed at 16:43:56,062 (almost 2 hours later)

The following is from the logs (with hosts, IPs and user names removed)
{noformat}
2012-01-19 14:54:40,516 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of job 
to hdfs://NN:8020/sort/output/_temporary/2
2012-01-19 14:54:40,516 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Recovered output 
from task attempt attempt_1326983991390_0002_m_000000_0
2012-01-19 14:54:40,516 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending done event 
to attempt_1326983991390_0002_m_000000_0
2012-01-19 14:54:40,516 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1326983991390_0002_m_000000_0 TaskAttempt Transitioned from UNASSIGNED 
to ASSIGNED
2012-01-19 14:54:40,516 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Processing 
attempt_1326983991390_0002_m_000001_0 of type TA_ASSIGNED
2012-01-19 14:54:40,516 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved 
gsbl91544.blue.ygrid.yahoo.com to /98.137.103.0
2012-01-19 14:54:40,516 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending status 
update event to attempt_1326983991390_0002_m_000001_0
2012-01-19 14:54:41,901 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of job 
to hdfs://NN:8020/sort/output/_temporary/2
2012-01-19 14:54:41,901 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Recovered output 
from task attempt attempt_1326983991390_0002_m_000001_0
2012-01-19 14:54:41,901 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending done event 
to attempt_1326983991390_0002_m_000001_0
2012-01-19 14:54:41,901 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1326983991390_0002_m_000001_0 TaskAttempt Transitioned from UNASSIGNED 
to ASSIGNED
2012-01-19 14:54:41,901 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Processing 
attempt_1326983991390_0002_m_000002_0 of type TA_ASSIGNED
2012-01-19 14:54:41,901 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.util.RackResolver: Resolved HOST to /RACK
2012-01-19 14:54:41,901 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending status 
update event to attempt_1326983991390_0002_m_000002_0
{noformat}

It looks like it takes about 1.5 sec for each mapper's output to be recovered.  
For all 16800 mappers that would result in 7 hours of time to recover all of 
the mapper's output.  I will look at this code to see exactly what it is doing, 
and how we can optimize it.
                
> AppMaster recovery for Medium to large jobs take long time
> ----------------------------------------------------------
>
>                 Key: MAPREDUCE-3711
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3711
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>
> Reported by [~karams]
> yarn.resourcemanager.am.max-retries=2
> Ran test cases with sort job on 350 scale having 16800 maps and 680 reduces -:
> 1. After 70 secs of Job Sumbission Am is killed using kill -9, around 3900 
> maps were completed and 680 reduces were
> scheduled, Second AM got restart. Job got completed in 980 secs. AM took very 
> less time to recover.
> 2. After 150 secs of Job Sumbission AM is killed using kill -9, around 90% 
> maps were completed and 680 reduces were
> scheduled , Second AM got restart Job got completed in 1000 secs. AM got 
> revocer.
> 3. After 150 secs of Job Sumbission AM as killed using kill -9, almost all 
> maps were completed and only 680 reduces
> were running, Recovery was too slow, AM was still revocering after 1hr :40 
> mis when I killed the run.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3711) AppMaster recovery for Medium to large jobs take long time

Reply via email to