[ https://issues.apache.org/jira/browse/HDFS-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204235#comment-13204235 ]
Suresh Srinivas commented on HDFS-2911: --------------------------------------- bq. @Eli ... as Todd points out not all OOMs are unrecoverable ... bq. On the NN I'd rather see the critical threads all get uncaughtExceptionHandlers attached which abort the NN if they fail. So if an individual rpc handler OOMEs (eg by an invalid request making it try to allocate a 4G array or something) it won't take down the NN, whereas if the LeaseManager OOMEs it should. I think this may not be a good idea. Infact I would say, it is more important to shutdown NN when RPC handler gets an OOME. Lets say an RPC handler updated in memory namespace and was about add it to editlog. The system was indeed running out of memory and before editlog could be written the handler got OOME. If we do not shutdown at this time, we could end up in interesting data corruption issues. Instead of trying to categorize which one is safe and not safe, we should use kill -9 option. In cases where OOME is caused by the system trying to create a large object, we could add appropriate size/limit checks. > Gracefully handle OutOfMemoryErrors > ----------------------------------- > > Key: HDFS-2911 > URL: https://issues.apache.org/jira/browse/HDFS-2911 > Project: Hadoop HDFS > Issue Type: Improvement > Components: data-node, name-node > Affects Versions: 0.23.0, 1.0.0 > Reporter: Eli Collins > Assignee: Eli Collins > > We should gracefully handle j.l.OutOfMemoryError exceptions in the NN or DN. > We should catch them in a high-level handler, cleanly fail the RPC (vs > sending back the OOM stackrace) or background thread, and shutdown the NN or > DN. Currently the process is left in a not well-test tested state > (continuously fails RPCs and internal threads, may or may not recover and > doesn't shutdown gracefully). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira