[ https://issues.apache.org/jira/browse/MAPREDUCE-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194850#comment-13194850 ]
Robert Joseph Evans commented on MAPREDUCE-3711: ------------------------------------------------ I have gone through more of the logs # Reading task data starts at 14:54:25,002 # Reading task data ends at 14:54:27,436 (about 2.5 sec later) # Sending mapper scheduled events ends at 14:54:31,394 (about 4 sec later) # Sending mapper started events ends at 14:54:36,046 (just under 5 sec later) # Then it starts trying to recover the output from the task attempt. This goes until the job is killed at 16:43:56,062 (almost 2 hours later) The following is from the logs (with hosts, IPs and user names removed) {noformat} 2012-01-19 14:54:40,516 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of job to hdfs://NN:8020/sort/output/_temporary/2 2012-01-19 14:54:40,516 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Recovered output from task attempt attempt_1326983991390_0002_m_000000_0 2012-01-19 14:54:40,516 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending done event to attempt_1326983991390_0002_m_000000_0 2012-01-19 14:54:40,516 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1326983991390_0002_m_000000_0 TaskAttempt Transitioned from UNASSIGNED to ASSIGNED 2012-01-19 14:54:40,516 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Processing attempt_1326983991390_0002_m_000001_0 of type TA_ASSIGNED 2012-01-19 14:54:40,516 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved gsbl91544.blue.ygrid.yahoo.com to /98.137.103.0 2012-01-19 14:54:40,516 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending status update event to attempt_1326983991390_0002_m_000001_0 2012-01-19 14:54:41,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of job to hdfs://NN:8020/sort/output/_temporary/2 2012-01-19 14:54:41,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Recovered output from task attempt attempt_1326983991390_0002_m_000001_0 2012-01-19 14:54:41,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending done event to attempt_1326983991390_0002_m_000001_0 2012-01-19 14:54:41,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1326983991390_0002_m_000001_0 TaskAttempt Transitioned from UNASSIGNED to ASSIGNED 2012-01-19 14:54:41,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Processing attempt_1326983991390_0002_m_000002_0 of type TA_ASSIGNED 2012-01-19 14:54:41,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved HOST to /RACK 2012-01-19 14:54:41,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending status update event to attempt_1326983991390_0002_m_000002_0 {noformat} It looks like it takes about 1.5 sec for each mapper's output to be recovered. For all 16800 mappers that would result in 7 hours of time to recover all of the mapper's output. I will look at this code to see exactly what it is doing, and how we can optimize it. > AppMaster recovery for Medium to large jobs take long time > ---------------------------------------------------------- > > Key: MAPREDUCE-3711 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3711 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Affects Versions: 0.23.0 > Reporter: Siddharth Seth > Assignee: Robert Joseph Evans > Priority: Blocker > > Reported by [~karams] > yarn.resourcemanager.am.max-retries=2 > Ran test cases with sort job on 350 scale having 16800 maps and 680 reduces -: > 1. After 70 secs of Job Sumbission Am is killed using kill -9, around 3900 > maps were completed and 680 reduces were > scheduled, Second AM got restart. Job got completed in 980 secs. AM took very > less time to recover. > 2. After 150 secs of Job Sumbission AM is killed using kill -9, around 90% > maps were completed and 680 reduces were > scheduled , Second AM got restart Job got completed in 1000 secs. AM got > revocer. > 3. After 150 secs of Job Sumbission AM as killed using kill -9, almost all > maps were completed and only 680 reduces > were running, Recovery was too slow, AM was still revocering after 1hr :40 > mis when I killed the run. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira