[
https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gera Shegalov updated MAPREDUCE-5044:
-------------------------------------
Attachment: MAPREDUCE-5044.v04.patch
v04 to apply on top of YARN-1515.v05. It now makes sure that a thread dump is
created in the uber mode.
Added unit tests for a normal MR job and uber MR job.
While working on this I realized that we actually need to discuss how
mapreduce.task.timeout is treated in the ubermode. Right now it's basically
ignored because AM does not kill itself, LocalContainerLauncher processes
CONTAINER_REMOTE_CLEANUP inline with the stuck in SubtaskRunner. The liveness
monitor for AM in RM does not catch the problem either because RMCommunicator
heartbeats in a separate allocator thread.
I am considering two options:
- move heartbeat() into SubtaskRunner for ubermode such that the liveness
monitor catches the stuck ubertask.
- do System.exit(errorcode) when TA_TIMEOUT occurs.
> Have AM trigger jstack on task attempts that timeout before killing them
> ------------------------------------------------------------------------
>
> Key: MAPREDUCE-5044
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: mr-am
> Affects Versions: 2.1.0-beta
> Reporter: Jason Lowe
> Assignee: Gera Shegalov
> Attachments: MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch,
> MAPREDUCE-5044.v03.patch, MAPREDUCE-5044.v04.patch, Screen Shot 2013-11-12 at
> 1.05.32 PM.png, Screen Shot 2013-11-12 at 1.06.04 PM.png
>
>
> When an AM expires a task attempt it would be nice if it triggered a jstack
> output via SIGQUIT before killing the task attempt. This would be invaluable
> for helping users debug their hung tasks, especially if they do not have
> shell access to the nodes.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)