[jira] [Commented] (MAPREDUCE-3596) Job got hang after completion of 99% map phase with hadoop-0.23.1.1112091615 RE build

Ravi Prakash (Commented) (JIRA) Thu, 22 Dec 2011 16:17:57 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175188#comment-13175188
 ]


Ravi Prakash commented on MAPREDUCE-3596:
-----------------------------------------

Ok. Here's how far I've got

{noformat}
$ grep attempt_1324018664143_0002_m -r container_1324018664143_0002_01_000001/ 
| grep "Created attempt" | awk '{print $10}' | sort | uniq  | grep "_1$"
attempt_1324018664143_0002_m_009775_1
attempt_1324018664143_0002_m_012988_1
attempt_1324018664143_0002_m_013199_1
{noformat}

i.e. There are three maps which had to be retried. The first succeeded on being 
retried
{noformat}
2011-12-16 07:09:11,013 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with 
attempt attempt_1324018664143_0002_m_009775_1
{noformat}

The other two failed. They failed for different reasons which doesn't seem to 
me to be related to this investigation. In any case. After failure,
{noformat}
2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Processing 
attempt_1324018664143_0002_m_012988_0 of type TA_CONTAINER_LAUNCH_FAILED
2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1324018664143_0002_m_012988_0 TaskAttempt Transitioned from ASSIGNED to 
FAILED
2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Processing the 
event EventType: CONTAINER_DEALLOCATE
2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Processing 
task_1324018664143_0002_m_012988 of type T_ATTEMPT_FAILED
2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Created attempt 
attempt_1324018664143_0002_m_012988_1
2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Processing the 
event EventType: CONTAINER_FAILED
2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 1 failures on node 
someNode
2011-12-16 07:09:15,870 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Processing 
attempt_1324018664143_0002_m_012988_1 of type TA_RESCHEDULE
2011-12-16 07:09:15,870 INFO [Thread-31] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In 
HistoryEventHandler TASK_FINISHED
2011-12-16 07:09:15,871 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1324018664143_0002_m_012988_1 TaskAttempt Transitioned from NEW to 
UNASSIGNED
2011-12-16 07:09:15,871 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Processing the 
event EventType: CONTAINER_REQ
2011-12-16 07:09:15,871 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Added 
attempt_1324018664143_0002_m_012988_1 to list of failed maps
2011-12-16 07:09:15,871 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Added 
priority=priority: 5, 
2011-12-16 07:09:15,871 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: addResourceRequest: 
applicationId=2 priority=5 resourceName=* numContainers=1 #asks=1
{noformat}
And then that attempt is never heard from again in the AM logs. Similarly for 
the other attempt

I could not find the resource request in the RM logs.

                
> Job got hang after completion of 99% map phase with hadoop-0.23.1.1112091615 
> RE build
> -------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3596
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3596
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Ravi Prakash
>            Priority: Critical
>
> Courtesy [~vinaythota]
> {quote}
> Ran sort benchmark couple of times and every time the job got hang after 
> completion 99% map phase. There are some map tasks failed. Also it's not 
> scheduled some of the pending map tasks.
> Cluster size is 350 nodes.
> Build Details:
> ==============
> Compiled:       Fri Dec 9 16:25:27 PST 2011 by someone from 
> branches/branch-0.23/hadoop-common-project/hadoop-common 
> ResourceManager version:        revision 1212681 by someone source checksum 
> on Fri Dec 9 16:52:07 PST 2011
> Hadoop version:         revision 1212592 by someone Fri Dec 9 16:25:27 PST 
> 2011
> {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3596) Job got hang after completion of 99% map phase with hadoop-0.23.1.1112091615 RE build

Reply via email to