[jira] [Comment Edited] (CASSANDRA-7507) OOM creates unreliable state - die instantly better

Karl Mueller (JIRA) Mon, 07 Jul 2014 19:25:16 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054433#comment-14054433
 ]


Karl Mueller edited comment on CASSANDRA-7507 at 7/8/14 2:23 AM:
-----------------------------------------------------------------

if a bug can cause the clean exit after OOM to fail as expected, then isn't it 
considered a problem?

I guess if I'm considering the value of a "clean exit" versus "possibly staying 
up, being in a weird state, or not writing the right data to disk", I would 
always prefer it to die without worrying about a clean exit. As I said, in my 
opinion, Cassandra already handles dying unexpectedly fine - there's no need to 
handle it cleanly when there's any risk. 

If there's no risk of something like 7133 happening (or a similar bug), then 
sure, clean exit is sensible, but that's clearly not guaranteed. Replaying some 
logs and then flushing is not a big deal compared to potentially bad data, 
zombie states, etc. - in my view, at least.



was (Author: kmueller):
if a bug can cause the clean exit after OOM to fail as expected, then isn't it 
considered a problem?

I guess if I'm considering the value of a "clean exit" versus "possibly staying 
up or being in a weird state", I would always prefer it to die without worrying 
about a clean exit. As I said, in my opinion, Cassandra already handles dying 
unexpectedly fine - there's no need to handle it cleanly when there's any risk. 

If there's no risk of something like 7133 happening (or a similar bug), then 
sure, clean exit is sensible, but that's clearly not guaranteed. Replaying some 
logs and then flushing is not a big deal compared to potentially bad data, 
zombie states, etc. - in my view, at least.


> OOM creates unreliable state - die instantly better
> ---------------------------------------------------
>
>                 Key: CASSANDRA-7507
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7507
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Karl Mueller
>            Priority: Minor
>
> I had a cassandra node run OOM. My heap had enough headroom, there was just 
> something which either was a bug or some unfortunate amount of short-term 
> memory utilization. This resulted in the following error:
>  WARN [StorageServiceShutdownHook] 2014-06-30 09:38:38,251 StorageProxy.java 
> (line 1713) Some hints were not written before shutdown.  This is not 
> supposed to happen.  You should (a) run repair, and (b) file a bug report
> There are no other messages of relevance besides the OOM error about 90 
> minutes earlier.
> My (limited) understanding of the JVM and Cassandra says that when it goes 
> OOM, it will attempt to signal cassandra to shut down "cleanly". The problem, 
> in my view, is that with an OOM situation, nothing is guaranteed anymore. I 
> believe it's impossible to reliably "cleanly shut down" at this point, and 
> therefore it's wrong to even try. 
> Yes, ideally things could be written out, flushed to disk, memory messages 
> written, other nodes notified, etc. but why is there any reason to believe 
> any of those steps could happen? Would happen? Couldn't bad data be written 
> at this point to disk rather than good data? Some network messages delivered, 
> but not others?
> I think Cassandra should have the option to (and possibly default) to kill 
> itself immediately upon the OOM condition happening in a hard way, and not 
> rely on the java-based clean shutdown process. Cassandra already handles 
> recovery from unclean shutdown, and it's not a big deal. My node, for 
> example, kept in a sort-of alive state for 90 minutes where who knows what it 
> was doing or not doing.
> I don't know enough about the JVM and options for it to know the best exact 
> implementation of "die instantly on OOM", but it should be something that's 
> possible either with some flags or a C library (which doesn't rely on java 
> memory to do something which it may not be able to get!)
> Short version: a kill -9 of all C* processes in that instance without needing 
> more java memory, when OOM is raised



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (CASSANDRA-7507) OOM creates unreliable state - die instantly better

Reply via email to