[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399485#comment-13399485
 ] 

Robert Joseph Evans commented on MAPREDUCE-4300:
------------------------------------------------

OK I have some results now from testing the UncaughtExceptionHandler is 
helping, but it is not preventing it in all cases.  I ran a sleep job with 
50,000 map tasks 1000 reduce tasks and a 50MB heap on trunk.  This is just 
about enough memory that it can initialize and start working, but not get too 
far before OOMs start to show up.

In one case I got a very odd issue.  No OOM was thrown, instead java threw 
{noformat}
2012-06-22 17:11:14,966 WARN [IPC Server handler 0 on 50500] 
org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50500, call 
org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getJobReport from 
<IP>:41735: error: java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.hadoop.yarn.proto.MRClientProtocol
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.hadoop.yarn.proto.MRClientProtocol
        at 
org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService.getDescriptor(MRClientProtocol.java:370)
        at 
org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.getDescriptorForType(MRClientProtocol.java:185)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:442)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:919)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
{noformat}

This is not a classpath issue because it worked on all other runs of the AM.  
What is more this killed off the client.
{noformat}
12/06/22 17:11:15 ERROR security.UserGroupInformation: 
PriviledgedActionException as:<USER> (auth:SIMPLE) cause:java.io.IOException
java.io.IOException
        at 
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:315)
        at 
org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:383)
        at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:481)
        at org.apache.hadoop.mapreduce.Job$1.run(Job.java:311)
        at org.apache.hadoop.mapreduce.Job$1.run(Job.java:308)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
        at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:308)
        at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:592)
        at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1274)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1236)
        at org.apache.hadoop.mapreduce.SleepJob.run(SleepJob.java:262)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.mapreduce.SleepJob.main(SleepJob.java:194)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
        at 
org.apache.hadoop.test.MapredTestDriver.run(MapredTestDriver.java:112)
        at 
org.apache.hadoop.test.MapredTestDriver.main(MapredTestDriver.java:120)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
{noformat}

and made it impossible for mapred kill to kill the job
{noformat}
Exception in thread "main" java.io.IOException
        at 
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:315)
        at 
org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:383)
        at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:481)
        at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:184)
        at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:280)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
        at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1244)
{noformat}

Part of the odd thing here is that the AM then refused to request any 
containers, and got stuck doing nothing.  We may need to modify the client to 
be able to handle this and still kill the job.  Possibly also the IPC server to 
handle a class not found error better.


In another case The UncaughtExceptionHandler worked as expected and triggered a 
shutdown, but the AM locked up on shutdown.
{noformat}
71 FATAL [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[AsyncDispatcher event handler,5,main] threw an Error.  Shutting down 
now...
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOfRange(Arrays.java:3209)
        at java.lang.String.(String.java:215)
        at java.io.BufferedReader.readLine(BufferedReader.java:331)
        at java.io.LineNumberReader.readLine(LineNumberReader.java:182)
        at 
org.apache.log4j.spi.ThrowableInformation.getThrowableStrRep(ThrowableInformation.java:85)
        at 
org.apache.log4j.spi.LoggingEvent.getThrowableStrRep(LoggingEvent.java:407)
        at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:305)
        at org.apache.log4j.WriterAppender.append(WriterAppender.java:160)
        at 
org.apache.hadoop.yarn.ContainerLogAppender.append(ContainerLogAppender.java:57)
        at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
        at 
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
        at org.apache.log4j.Category.callAppenders(Category.java:206)
        at org.apache.log4j.Category.forcedLog(Category.java:391)
        at org.apache.log4j.Category.log(Category.java:856)
        at 
org.apache.commons.logging.impl.Log4JLogger.fatal(Log4JLogger.java:280)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:129)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:74)
        at java.lang.Thread.run(Thread.java:662)
2012-06-22 17:25:45,683 WARN [IPC Server handler 0 on 50500] 
org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50500, call 
org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.getJobReport from 
<IP>:35485: error: java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
2012-06-22 17:25:47,099 ERROR [TaskHeartbeatHandler PingChecker] 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[TaskHeartbeatHandler PingChecker,5,main] threw an Throwable, buyt we are 
shutting down, so ignoring this
java.lang.OutOfMemoryError: GC overhead limit exceeded
2012-06-22 17:25:47,780 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a signal. 
Signaling RMCommunicator and JobHistoryEventHandler.
2012-06-22 17:25:47,780 INFO [Thread-1] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator 
notified that iSignalled was : true
2012-06-22 17:25:47,780 INFO [Thread-1] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: 
JobHistoryEventHandler notified that isSignalled was true
2012-06-22 17:25:47,780 INFO [Thread-1] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping 
JobHistoryEventHandler. Size of the outstanding queue size is 0
2012-06-22 17:25:48,487 INFO [Thread-50] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: EventQueue take 
interrupted. Returning
2012-06-22 17:25:49,192 ERROR [Socket Reader #1 for port 50500] 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Socket 
Reader #1 for port 50500,5,main] threw an Throwable, buyt we are shutting down, 
so ignoring this
java.lang.OutOfMemoryError: GC overhead limit exceeded
2012-06-22 17:25:49,193 WARN [Thread-1] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Found jobId 
job_1340385884343_0001 to have not been closed. Will close
2012-06-22 17:25:52,705 ERROR [LeaseRenewer:<USER>@<MACHINE>:9000] 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[LeaseRenewer:<USER>@<MACHINE>:9000,5,main] threw an Throwable, buyt we 
are shutting down, so ignoring this
java.lang.OutOfMemoryError: GC overhead limit exceeded
2012-06-22 17:25:55,427 WARN [Thread-51] org.apache.hadoop.hdfs.DFSClient: 
DataStreamer Exception
2012-06-22 17:25:56,113 ERROR [Thread-51] 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[Thread-51,5,main] threw an Throwable, buyt we are shutting down, so 
ignoring this
java.lang.OutOfMemoryError: GC overhead limit exceeded
{noformat}

Yes I am going to fix the spelling of s/buyt/but/g.  I will upload a full stack 
dump of the AM as well in just a bit.
                
> OOM in AM can turn it into a zombie.
> ------------------------------------
>
>                 Key: MAPREDUCE-4300
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4300
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 0.23.3
>            Reporter: Robert Joseph Evans
>
> It looks like 4 threads in the AM died with OOM but not the one pinging the 
> RM.
> stderr for this AM
> {noformat}
> WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use 
> org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
> May 30, 2012 4:49:55 AM 
> com.google.inject.servlet.InternalServletModule$BackwardsCompatibleServletContextProvider
>  get
> WARNING: You are attempting to use a deprecated API (specifically, attempting 
> to @Inject ServletContext inside an eagerly created singleton. While we allow 
> this for backwards compatibility, be warned that this MAY have unexpected 
> behavior if you have more than one injector (with ServletModule) running in 
> the same JVM. Please consult the Guice documentation at 
> http://code.google.com/p/google-guice/wiki/Servlets for more information.
> May 30, 2012 4:49:55 AM 
> com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
> INFO: Registering 
> org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver as a provider 
> class
> May 30, 2012 4:49:55 AM 
> com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
> INFO: Registering org.apache.hadoop.yarn.webapp.GenericExceptionHandler as a 
> provider class
> May 30, 2012 4:49:55 AM 
> com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
> INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices as 
> a root resource class
> May 30, 2012 4:49:55 AM 
> com.sun.jersey.server.impl.application.WebApplicationImpl _initiate
> INFO: Initiating Jersey application, version 'Jersey: 1.8 06/24/2011 12:17 PM'
> May 30, 2012 4:49:55 AM 
> com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory 
> getComponentProvider
> INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver 
> to GuiceManagedComponentProvider with the scope "Singleton"
> May 30, 2012 4:49:56 AM 
> com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory 
> getComponentProvider
> INFO: Binding org.apache.hadoop.yarn.webapp.GenericExceptionHandler to 
> GuiceManagedComponentProvider with the scope "Singleton"
> May 30, 2012 4:49:56 AM 
> com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory 
> getComponentProvider
> INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices to 
> GuiceManagedComponentProvider with the scope "PerRequest"
> Exception in thread "ResponseProcessor for block 
> BP-1114822160-<IP>-1322528669066:blk_-6528896407411719649_34227308" 
> java.lang.OutOfMemoryError: Java heap space
>       at com.google.protobuf.CodedInputStream.(CodedInputStream.java:538)
>       at 
> com.google.protobuf.CodedInputStream.newInstance(CodedInputStream.java:55)
>       at 
> com.google.protobuf.AbstractMessageLite$Builder.mergeFrom(AbstractMessageLite.java:201)
>       at 
> com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:738)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.DataTransferProtos$PipelineAckProto.parseFrom(DataTransferProtos.java:7287)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:95)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:656)
> Exception in thread "DefaultSpeculator background processing" 
> java.lang.OutOfMemoryError: Java heap space
>       at java.util.HashMap.resize(HashMap.java:462)
>       at java.util.HashMap.addEntry(HashMap.java:755)
>       at java.util.HashMap.put(HashMap.java:385)
>       at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.getTasks(JobImpl.java:632)
>       at 
> org.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator.maybeScheduleASpeculation(DefaultSpeculator.java:465)
>       at 
> org.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator.maybeScheduleAMapSpeculation(DefaultSpeculator.java:433)
>       at 
> org.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator.computeSpeculations(DefaultSpeculator.java:509)
>       at 
> org.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator.access$100(DefaultSpeculator.java:56)
>       at 
> org.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator$1.run(DefaultSpeculator.java:176)
>       at java.lang.Thread.run(Thread.java:619)
> Exception in thread "Timer for 'MRAppMaster' metrics system" 
> java.lang.OutOfMemoryError: Java heap space
> Exception in thread "Socket Reader #4 for port 50500" 
> java.lang.OutOfMemoryError: Java heap space
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to