[ https://issues.apache.org/jira/browse/MAPREDUCE-4729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488741#comment-13488741 ]
Jason Lowe commented on MAPREDUCE-4729: --------------------------------------- I tried testing the patch with a sleep job using -Dyarn.app.mapreduce.am.job.recovery.enable=false and manually killing the ApplicationMaster with a kill -9, but it didn't work. The log showed this exception: {noformat} 2012-11-01 14:37:01,543 WARN [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Could not parse the old history file. Will not have old AMinfos java.io.IOException: Incompatible event log version: null at org.apache.hadoop.mapreduce.jobhistory.EventReader.<init>(EventReader.java:70) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.readJustAMInfos(MRAppMaster.java:915) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.start(MRAppMaster.java:846) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1143) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1378) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1139) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1098) {noformat} It looks like the AM is buffering the history file output, and we didn't flush out the AMInfos from previous runs. When I used a normal kill instead of kill -9, it worked. We will want to flush/sync the job history file after writing the AMInfos to help guard against unclean teardowns losing prior AM attempts in the history. This can be fixed in a separate JIRA if we don't want to fix it here. Couple of other comments on the patch: * Application attempts start from 1 instead of 0, so the first attempt tries to recover AMInfos when it shouldn't and leads to a large FileNotFoundException stacktrace being logged * Nit: In RecoveryService.parse there's an extra space logged before a comma. {{LOG.info("Got an error parsing job-history file "}} should be {{LOG.info("Got an error parsing job-history file"}} * Nit: The body of the while loop in readJustAMInfos could be a bit cleaner with fewer conditionals. For example: {code} while ((event = jobHistoryEventReader.getNextEvent()) != null) { if (event.getEventType() == EventType.AM_STARTED) { amStartedEventsBegan = true; AMStartedEvent amStartedEvent = (AMStartedEvent) event; amInfos.add(MRBuilderUtils.newAMInfo( amStartedEvent.getAppAttemptId(), amStartedEvent.getStartTime(), amStartedEvent.getContainerId(), StringInterner.weakIntern(amStartedEvent.getNodeManagerHost()), amStartedEvent.getNodeManagerPort(), amStartedEvent.getNodeManagerHttpPort())); } else if (amStartedEventsBegan) { // This means AMStartedEvents began and this event is a // non-AMStarted event. // No need to continue reading all the other events. break; } } {code} > job history UI not showing all job attempts > ------------------------------------------- > > Key: MAPREDUCE-4729 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4729 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver > Affects Versions: 0.23.3 > Reporter: Thomas Graves > Assignee: Vinod Kumar Vavilapalli > Attachments: MAPREDUCE-4729-20121031.txt > > > We are seeing a case where a job runs but the AM is running out of memory in > the first 3 attempts. The job eventually finishes on the 4th attempt. When > you go to the job history UI for that job, it only shows the last attempt. > This is bad since we want to see why the first 3 attempts failed. > The RM web ui shows all 4 attempts. > Also I tested this locally by running "kill" on the app master and in that > case the history server UI does show all attempts. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira