[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488741#comment-13488741
 ] 

Jason Lowe commented on MAPREDUCE-4729:
---------------------------------------

I tried testing the patch with a sleep job using 
-Dyarn.app.mapreduce.am.job.recovery.enable=false and manually killing the 
ApplicationMaster with a kill -9, but it didn't work.  The log showed this 
exception:

{noformat}
2012-11-01 14:37:01,543 WARN [main] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Could not parse the old history 
file. Will not have old AMinfos 
java.io.IOException: Incompatible event log version: null
        at 
org.apache.hadoop.mapreduce.jobhistory.EventReader.<init>(EventReader.java:70)
        at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.readJustAMInfos(MRAppMaster.java:915)
        at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.start(MRAppMaster.java:846)
        at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1143)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1378)
        at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1139)
        at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1098)
{noformat}

It looks like the AM is buffering the history file output, and we didn't flush 
out the AMInfos from previous runs.  When I used a normal kill instead of kill 
-9, it worked.  We will want to flush/sync the job history file after writing 
the AMInfos to help guard against unclean teardowns losing prior AM attempts in 
the history.  This can be fixed in a separate JIRA if we don't want to fix it 
here.

Couple of other comments on the patch:
* Application attempts start from 1 instead of 0, so the first attempt tries to 
recover AMInfos when it shouldn't and leads to a large FileNotFoundException 
stacktrace being logged
* Nit: In RecoveryService.parse there's an extra space logged before a comma.  
{{LOG.info("Got an error parsing job-history file "}} should be {{LOG.info("Got 
an error parsing job-history file"}}
* Nit: The body of the while loop in readJustAMInfos could be a bit cleaner 
with fewer conditionals.  For example:
{code}
      while ((event = jobHistoryEventReader.getNextEvent()) != null) {
        if (event.getEventType() == EventType.AM_STARTED) {
          amStartedEventsBegan = true;
          AMStartedEvent amStartedEvent = (AMStartedEvent) event;
          amInfos.add(MRBuilderUtils.newAMInfo(
            amStartedEvent.getAppAttemptId(), amStartedEvent.getStartTime(),
            amStartedEvent.getContainerId(),
            StringInterner.weakIntern(amStartedEvent.getNodeManagerHost()),
            amStartedEvent.getNodeManagerPort(),
            amStartedEvent.getNodeManagerHttpPort()));
        } else if (amStartedEventsBegan) {
          // This means AMStartedEvents began and this event is a
          // non-AMStarted event.
          // No need to continue reading all the other events.
          break;
        }
      }
{code}
                
> job history UI not showing all job attempts
> -------------------------------------------
>
>                 Key: MAPREDUCE-4729
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4729
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobhistoryserver
>    Affects Versions: 0.23.3
>            Reporter: Thomas Graves
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: MAPREDUCE-4729-20121031.txt
>
>
> We are seeing a case where a job runs but the AM is running out of memory in 
> the first 3 attempts. The job eventually finishes on the 4th attempt.  When 
> you go to the job history UI for that job, it only shows the last attempt.  
> This is bad since we want to see why the first 3 attempts failed.
> The RM web ui shows all 4 attempts. 
> Also I tested this locally by running "kill" on the app master and in that 
> case the history server UI does show all attempts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to