[ https://issues.apache.org/jira/browse/MAPREDUCE-4300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399485#comment-13399485 ]
Robert Joseph Evans commented on MAPREDUCE-4300: ------------------------------------------------ OK I have some results now from testing the UncaughtExceptionHandler is helping, but it is not preventing it in all cases. I ran a sleep job with 50,000 map tasks 1000 reduce tasks and a 50MB heap on trunk. This is just about enough memory that it can initialize and start working, but not get too far before OOMs start to show up. In one case I got a very odd issue. No OOM was thrown, instead java threw {noformat} 2012-06-22 17:11:14,966 WARN [IPC Server handler 0 on 50500] org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50500, call org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getJobReport from <IP>:41735: error: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.yarn.proto.MRClientProtocol java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.yarn.proto.MRClientProtocol at org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService.getDescriptor(MRClientProtocol.java:370) at org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.getDescriptorForType(MRClientProtocol.java:185) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:442) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:919) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686) {noformat} This is not a classpath issue because it worked on all other runs of the AM. What is more this killed off the client. {noformat} 12/06/22 17:11:15 ERROR security.UserGroupInformation: PriviledgedActionException as:<USER> (auth:SIMPLE) cause:java.io.IOException java.io.IOException at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:315) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:383) at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:481) at org.apache.hadoop.mapreduce.Job$1.run(Job.java:311) at org.apache.hadoop.mapreduce.Job$1.run(Job.java:308) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:308) at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:592) at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1274) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1236) at org.apache.hadoop.mapreduce.SleepJob.run(SleepJob.java:262) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.mapreduce.SleepJob.main(SleepJob.java:194) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144) at org.apache.hadoop.test.MapredTestDriver.run(MapredTestDriver.java:112) at org.apache.hadoop.test.MapredTestDriver.main(MapredTestDriver.java:120) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) {noformat} and made it impossible for mapred kill to kill the job {noformat} Exception in thread "main" java.io.IOException at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:315) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:383) at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:481) at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:184) at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:280) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1244) {noformat} Part of the odd thing here is that the AM then refused to request any containers, and got stuck doing nothing. We may need to modify the client to be able to handle this and still kill the job. Possibly also the IPC server to handle a class not found error better. In another case The UncaughtExceptionHandler worked as expected and triggered a shutdown, but the AM locked up on shutdown. {noformat} 71 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[AsyncDispatcher event handler,5,main] threw an Error. Shutting down now... java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOfRange(Arrays.java:3209) at java.lang.String.(String.java:215) at java.io.BufferedReader.readLine(BufferedReader.java:331) at java.io.LineNumberReader.readLine(LineNumberReader.java:182) at org.apache.log4j.spi.ThrowableInformation.getThrowableStrRep(ThrowableInformation.java:85) at org.apache.log4j.spi.LoggingEvent.getThrowableStrRep(LoggingEvent.java:407) at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:305) at org.apache.log4j.WriterAppender.append(WriterAppender.java:160) at org.apache.hadoop.yarn.ContainerLogAppender.append(ContainerLogAppender.java:57) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) at org.apache.log4j.Category.callAppenders(Category.java:206) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.log(Category.java:856) at org.apache.commons.logging.impl.Log4JLogger.fatal(Log4JLogger.java:280) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:129) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:74) at java.lang.Thread.run(Thread.java:662) 2012-06-22 17:25:45,683 WARN [IPC Server handler 0 on 50500] org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50500, call org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getJobReport from <IP>:35485: error: java.lang.OutOfMemoryError: GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded 2012-06-22 17:25:47,099 ERROR [TaskHeartbeatHandler PingChecker] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[TaskHeartbeatHandler PingChecker,5,main] threw an Throwable, buyt we are shutting down, so ignoring this java.lang.OutOfMemoryError: GC overhead limit exceeded 2012-06-22 17:25:47,780 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a signal. Signaling RMCommunicator and JobHistoryEventHandler. 2012-06-22 17:25:47,780 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator notified that iSignalled was : true 2012-06-22 17:25:47,780 INFO [Thread-1] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: JobHistoryEventHandler notified that isSignalled was true 2012-06-22 17:25:47,780 INFO [Thread-1] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler. Size of the outstanding queue size is 0 2012-06-22 17:25:48,487 INFO [Thread-50] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue take interrupted. Returning 2012-06-22 17:25:49,192 ERROR [Socket Reader #1 for port 50500] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Socket Reader #1 for port 50500,5,main] threw an Throwable, buyt we are shutting down, so ignoring this java.lang.OutOfMemoryError: GC overhead limit exceeded 2012-06-22 17:25:49,193 WARN [Thread-1] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Found jobId job_1340385884343_0001 to have not been closed. Will close 2012-06-22 17:25:52,705 ERROR [LeaseRenewer:<USER>@<MACHINE>:9000] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[LeaseRenewer:<USER>@<MACHINE>:9000,5,main] threw an Throwable, buyt we are shutting down, so ignoring this java.lang.OutOfMemoryError: GC overhead limit exceeded 2012-06-22 17:25:55,427 WARN [Thread-51] org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception 2012-06-22 17:25:56,113 ERROR [Thread-51] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-51,5,main] threw an Throwable, buyt we are shutting down, so ignoring this java.lang.OutOfMemoryError: GC overhead limit exceeded {noformat} Yes I am going to fix the spelling of s/buyt/but/g. I will upload a full stack dump of the AM as well in just a bit. > OOM in AM can turn it into a zombie. > ------------------------------------ > > Key: MAPREDUCE-4300 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4300 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster > Affects Versions: 0.23.3 > Reporter: Robert Joseph Evans > > It looks like 4 threads in the AM died with OOM but not the one pinging the > RM. > stderr for this AM > {noformat} > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use > org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. > May 30, 2012 4:49:55 AM > com.google.inject.servlet.InternalServletModule$BackwardsCompatibleServletContextProvider > get > WARNING: You are attempting to use a deprecated API (specifically, attempting > to @Inject ServletContext inside an eagerly created singleton. While we allow > this for backwards compatibility, be warned that this MAY have unexpected > behavior if you have more than one injector (with ServletModule) running in > the same JVM. Please consult the Guice documentation at > http://code.google.com/p/google-guice/wiki/Servlets for more information. > May 30, 2012 4:49:55 AM > com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register > INFO: Registering > org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver as a provider > class > May 30, 2012 4:49:55 AM > com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register > INFO: Registering org.apache.hadoop.yarn.webapp.GenericExceptionHandler as a > provider class > May 30, 2012 4:49:55 AM > com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register > INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices as > a root resource class > May 30, 2012 4:49:55 AM > com.sun.jersey.server.impl.application.WebApplicationImpl _initiate > INFO: Initiating Jersey application, version 'Jersey: 1.8 06/24/2011 12:17 PM' > May 30, 2012 4:49:55 AM > com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory > getComponentProvider > INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver > to GuiceManagedComponentProvider with the scope "Singleton" > May 30, 2012 4:49:56 AM > com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory > getComponentProvider > INFO: Binding org.apache.hadoop.yarn.webapp.GenericExceptionHandler to > GuiceManagedComponentProvider with the scope "Singleton" > May 30, 2012 4:49:56 AM > com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory > getComponentProvider > INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices to > GuiceManagedComponentProvider with the scope "PerRequest" > Exception in thread "ResponseProcessor for block > BP-1114822160-<IP>-1322528669066:blk_-6528896407411719649_34227308" > java.lang.OutOfMemoryError: Java heap space > at com.google.protobuf.CodedInputStream.(CodedInputStream.java:538) > at > com.google.protobuf.CodedInputStream.newInstance(CodedInputStream.java:55) > at > com.google.protobuf.AbstractMessageLite$Builder.mergeFrom(AbstractMessageLite.java:201) > at > com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:738) > at > org.apache.hadoop.hdfs.protocol.proto.DataTransferProtos$PipelineAckProto.parseFrom(DataTransferProtos.java:7287) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:95) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:656) > Exception in thread "DefaultSpeculator background processing" > java.lang.OutOfMemoryError: Java heap space > at java.util.HashMap.resize(HashMap.java:462) > at java.util.HashMap.addEntry(HashMap.java:755) > at java.util.HashMap.put(HashMap.java:385) > at > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.getTasks(JobImpl.java:632) > at > org.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator.maybeScheduleASpeculation(DefaultSpeculator.java:465) > at > org.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator.maybeScheduleAMapSpeculation(DefaultSpeculator.java:433) > at > org.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator.computeSpeculations(DefaultSpeculator.java:509) > at > org.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator.access$100(DefaultSpeculator.java:56) > at > org.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator$1.run(DefaultSpeculator.java:176) > at java.lang.Thread.run(Thread.java:619) > Exception in thread "Timer for 'MRAppMaster' metrics system" > java.lang.OutOfMemoryError: Java heap space > Exception in thread "Socket Reader #4 for port 50500" > java.lang.OutOfMemoryError: Java heap space > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira