[ 
https://issues.apache.org/jira/browse/HDFS-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204235#comment-13204235
 ] 

Suresh Srinivas commented on HDFS-2911:
---------------------------------------

bq. @Eli ... as Todd points out not all OOMs are unrecoverable ...
bq. On the NN I'd rather see the critical threads all get 
uncaughtExceptionHandlers attached which abort the NN if they fail. So if an 
individual rpc handler OOMEs (eg by an invalid request making it try to 
allocate a 4G array or something) it won't take down the NN, whereas if the 
LeaseManager OOMEs it should.

I think this may not be a good idea. Infact I would say, it is more important 
to shutdown NN when RPC handler gets an OOME. Lets say an RPC handler updated 
in memory namespace and was about add it to editlog. The system was indeed 
running out of memory and before editlog could be written the handler got OOME. 
If we do not shutdown at this time, we could end up in interesting data 
corruption issues.

Instead of trying to categorize which one is safe and not safe, we should use 
kill -9 option. In cases where OOME is caused by the system trying to create a 
large object, we could add appropriate size/limit checks.
                
> Gracefully handle OutOfMemoryErrors
> -----------------------------------
>
>                 Key: HDFS-2911
>                 URL: https://issues.apache.org/jira/browse/HDFS-2911
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node, name-node
>    Affects Versions: 0.23.0, 1.0.0
>            Reporter: Eli Collins
>            Assignee: Eli Collins
>
> We should gracefully handle j.l.OutOfMemoryError exceptions in the NN or DN. 
> We should catch them in a high-level handler, cleanly fail the RPC (vs 
> sending back the OOM stackrace) or background thread, and shutdown the NN or 
> DN. Currently the process is left in a not well-test tested state 
> (continuously fails RPCs and internal threads, may or may not recover and 
> doesn't shutdown gracefully).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to