Ming Ma created MAPREDUCE-6135:
----------------------------------

             Summary: Job staging directory remains if MRAppMaster is OOM
                 Key: MAPREDUCE-6135
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6135
             Project: Hadoop Map/Reduce
          Issue Type: Bug
            Reporter: Ming Ma


If MRAppMaster attempts run out of memory, it won't go through the normal job 
clean up process to move history files to history server location. When 
customers try to find out why the job failed, the data won't be available on 
history server webUI.

The work around is to extract the container id and NM id from the jhist file in 
the job staging directory; then use "yarn logs" command to get the AM logs.

It would be great the platform can take care of it by moving these hist files 
automatically to history server if AM attempts don't exit properly.

We discuss ideas on how to address this and would like get suggestions from 
others. Not sure if timeline server design covers this scenario.

1. Define some protocol for YARN to tell AppMaster "you have exceeded AM max 
attempt, please clean up". For example, YARN can launch AppMaster one more time 
after AM max attempt and MRAppMaster use that as the indication this is 
clean-up-only attempt.

2. Have some program periodically check job statuses and move files from job 
staging directory to history server for those finished jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to