[ https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175188#comment-13175188 ]
Ravi Prakash commented on MAPREDUCE-3596: ----------------------------------------- Ok. Here's how far I've got {noformat} $ grep attempt_1324018664143_0002_m -r container_1324018664143_0002_01_000001/ | grep "Created attempt" | awk '{print $10}' | sort | uniq | grep "_1$" attempt_1324018664143_0002_m_009775_1 attempt_1324018664143_0002_m_012988_1 attempt_1324018664143_0002_m_013199_1 {noformat} i.e. There are three maps which had to be retried. The first succeeded on being retried {noformat} 2011-12-16 07:09:11,013 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attempt attempt_1324018664143_0002_m_009775_1 {noformat} The other two failed. They failed for different reasons which doesn't seem to me to be related to this investigation. In any case. After failure, {noformat} 2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Processing attempt_1324018664143_0002_m_012988_0 of type TA_CONTAINER_LAUNCH_FAILED 2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1324018664143_0002_m_012988_0 TaskAttempt Transitioned from ASSIGNED to FAILED 2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Processing the event EventType: CONTAINER_DEALLOCATE 2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Processing task_1324018664143_0002_m_012988 of type T_ATTEMPT_FAILED 2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Created attempt attempt_1324018664143_0002_m_012988_1 2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Processing the event EventType: CONTAINER_FAILED 2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 1 failures on node someNode 2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Processing attempt_1324018664143_0002_m_012988_1 of type TA_RESCHEDULE 2011-12-16 07:09:15,870 INFO [Thread-31] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In HistoryEventHandler TASK_FINISHED 2011-12-16 07:09:15,871 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1324018664143_0002_m_012988_1 TaskAttempt Transitioned from NEW to UNASSIGNED 2011-12-16 07:09:15,871 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Processing the event EventType: CONTAINER_REQ 2011-12-16 07:09:15,871 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Added attempt_1324018664143_0002_m_012988_1 to list of failed maps 2011-12-16 07:09:15,871 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Added priority=priority: 5, 2011-12-16 07:09:15,871 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: addResourceRequest: applicationId=2 priority=5 resourceName=* numContainers=1 #asks=1 {noformat} And then that attempt is never heard from again in the AM logs. Similarly for the other attempt I could not find the resource request in the RM logs. > Job got hang after completion of 99% map phase with hadoop-0.23.1.1112091615 > RE build > ------------------------------------------------------------------------------------- > > Key: MAPREDUCE-3596 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, mrv2 > Affects Versions: 0.23.0 > Reporter: Ravi Prakash > Priority: Critical > > Courtesy [~vinaythota] > {quote} > Ran sort benchmark couple of times and every time the job got hang after > completion 99% map phase. There are some map tasks failed. Also it's not > scheduled some of the pending map tasks. > Cluster size is 350 nodes. > Build Details: > ============== > Compiled: Fri Dec 9 16:25:27 PST 2011 by someone from > branches/branch-0.23/hadoop-common-project/hadoop-common > ResourceManager version: revision 1212681 by someone source checksum > on Fri Dec 9 16:52:07 PST 2011 > Hadoop version: revision 1212592 by someone Fri Dec 9 16:25:27 PST > 2011 > {quote} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira