[ https://issues.apache.org/jira/browse/MAPREDUCE-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379118#comment-14379118 ]
Hadoop QA commented on MAPREDUCE-4443: -------------------------------------- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579168/MAPREDUCE-4443-trunk-3.patch against trunk revision 53a28af. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5332//console This message is automatically generated. > MR AM and job history server should be resilient to jobs that exceed counter > limits > ------------------------------------------------------------------------------------ > > Key: MAPREDUCE-4443 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4443 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 2.0.0-alpha > Reporter: Rahul Jain > Assignee: Mayank Bansal > Labels: usability > Attachments: MAPREDUCE-4443-trunk-1.patch, > MAPREDUCE-4443-trunk-2.patch, MAPREDUCE-4443-trunk-3.patch, > MAPREDUCE-4443-trunk-draft.patch, am_failed_counter_limits.txt > > > We saw this problem migrating applications to MapReduceV2: > Our applications use hadoop counters extensively (1000+ counters for certain > jobs). While this may not be one of recommended best practices in hadoop, the > real issue here is reliability of the framework when applications exceed > counter limits. > The hadoop servers (yarn, history server) were originally brought up with > mapreduce.job.counters.max=1000 under core-site.xml > We then ran map-reduce job under an application using its own job specific > overrides, with mapreduce.job.counters.max=10000 > All the tasks for the job finished successfully; however the overall job > still failed due to AM encountering exceptions as: > {code} > 2012-07-12 17:31:43,485 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks > : 712012-07-12 17:31:43,502 FATAL [AsyncDispatcher event handler] > org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher threa > dorg.apache.hadoop.mapreduce.counters.LimitExceededException: Too many > counters: 1001 max=1000 > at > org.apache.hadoop.mapreduce.counters.Limits.checkCounters(Limits.java:58) > at org.apache.hadoop.mapreduce.counters.Limits.incrCounters(Limits.java:65) > at > org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:77) > at > org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:94) > at > org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:105) > at > org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.incrAllCounters(AbstractCounterGroup.java:202) > at > org.apache.hadoop.mapreduce.counters.AbstractCounters.incrAllCounters(AbstractCounters.java:337) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.constructFinalFullcounters(JobImpl.java:1212) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.mayBeConstructFinalFullCounters(JobImpl.java:1198) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.createJobFinishedEvent(JobImpl.java:1179) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.logJobHistoryFinishedEvent(JobImpl.java:711) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.checkJobCompleteSuccess(JobImpl.java:737) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$TaskCompletedTransition.checkJobForCompletion(JobImpl.java:1360) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$TaskCompletedTransition.transition(JobImpl.java:1340) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$TaskCompletedTransition.transition(JobImpl.java:1323) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:380) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:666) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:113) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:890) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:886) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:125) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:74) > at java.lang.Thread.run(Thread.java:662) > 2012-07-12 17:31:43,502 INFO [AsyncDispatcher event handler] > org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye..2012-07-12 > 17:31:43,503 INFO [Thread-1] org.apache.had > {code} > The overall job failed, and the job history wasn't accessible either at the > end of the job (didn't show up in job history server). > We were able to workaround the issue by changing to higher limits in > core-site.xml and restarting yarn servers. However that forced us to increase > the counters global limit to be as high as possible use by any individual > application, which is hard to predict. > The original job then succeeded with new global limits. > However, since we didn't restart the job history server, it was unable to > display job history page for the successful job altogether as it still hit > counter exceeded exception. Restart of job history server finally got the > application available under job history. > I'll also attach AM logs to help debug the issue -- This message was sent by Atlassian JIRA (v6.3.4#6332)