[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2020-11-11 Thread Yifan Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17230243#comment-17230243
 ] 

Yifan Cai commented on CASSANDRA-15214:
---

Thanks [~jwest]. Addressed you comments and run CI (unit, jvm dtest and dtest) 
after rebasing. There are a few test failures, but do not look related to the 
change. 

CI result: 
https://app.circleci.com/pipelines/github/yifan-c/cassandra/159/workflows/a37b8a85-b705-479e-b7ca-846bb71b36dc

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Assignee: Yifan Cai
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2020-11-09 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228865#comment-17228865
 ] 

David Capwell commented on CASSANDRA-15214:
---

+1

Need second reviewer, can merge after.

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Assignee: Yifan Cai
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2020-11-06 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227604#comment-17227604
 ] 

David Capwell commented on CASSANDRA-15214:
---

+1 from me with small comment, see PR.

I tested this patch by breaking byte buffer allocation to run out of direct 
memory, in doing so found an edge case on client (.transport package) code, so 
once that is fixed client and internode shut down on OOM.

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Assignee: Yifan Cai
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2020-09-29 Thread Yifan Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204348#comment-17204348
 ] 

Yifan Cai commented on CASSANDRA-15214:
---

Talked with Benedict on Slack and cleaned up my confusion. So the 
{{JVMStabilityInspector}} is able to inspect the OOM error. But after it 
re-throws, Netty catches all throwables and simply logs. It happens 
[here|https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L303-L316].
 Therefore, the {{propagateOutOfMemory}} parameter was added. 

I submitted a PR that allows to produce a heap space OOM error forcefully when 
catching a direct buffer OOM. 
The PR also removes the parameter {{propagateOutOfMemory}} in the 
{{JVMStabilityInspector}}. Because it makes sure the instance can crash/exit 
properly on OOM. (see the gist below)

PR: https://github.com/apache/cassandra/pull/761
CI: 
https://app.circleci.com/pipelines/github/yifan-c/cassandra/112/workflows/293a4334-d2df-43f9-b532-1d79876701c1

I have also created a separate demo to prove that JVM invokes the OOM handler 
even if such OOM error (not including the direct buffer one) is to be swallowed 
by a catch block. 
The code and the output can be found at the gist: 
https://gist.github.com/yifan-c/82ff4fd7fbe83fe41113f6f14cba4907.

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Assignee: Yifan Cai
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2020-09-19 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198670#comment-17198670
 ] 

Benedict Elliott Smith commented on CASSANDRA-15214:


As I have said, they do not - unless you are confident I am wrong. That is the 
reason this ticket was filed, and I ascertained this at a time when I was 
intimately familiar with Netty’s workings. The non-propagation of OOM by 
inspectThrowable is irrelevant.

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Assignee: Yifan Cai
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2020-09-18 Thread Yifan Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198585#comment-17198585
 ] 

Yifan Cai commented on CASSANDRA-15214:
---

> where would that exception end up if it were rethrown

{{JVMStabilityInspector}} only re-throws {{OutOfMemoryError}}. Depending on the 
presence of those OOM-related JVM options, {{OnOutOfMemoryError}}, 
{{ExitOnOutOfMemoryError}} or {{HeapDumpOnOutOfMemoryError}}, the JVM exits and 
trigger a heap dump if it is a heap space OOM error. 

However, the call-sites indicate to not re-throw OOM error (e.g. 
[here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/InboundMessageHandler.java#L647-L659]),
 which I'd like to learn why we do not let the JVM to exit.

Netty by default just logs the exception, when {{exceptionCaught()}} is _not_ 
implemented in any of the handler in the inbound direction. For the outbound, 
client code handles exception by adding listener to {{ChannelFuture}} or 
{{ChannelPromise}}. We have the handling in both directions. 
Besides the inbound/outbound pathes, it looks like that Netty does do a lot of 
catch-{{Throwable}}-and-swallow things in its code base. So it is possible that 
errors from Netty internal are not bubbled up. For example, this 
[issue|https://github.com/netty/netty/issues/6096]. 

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Assignee: Yifan Cai
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2020-09-18 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198233#comment-17198233
 ] 

Benedict Elliott Smith commented on CASSANDRA-15214:


Zoom out a bit - where would that exception end up if it were rethrown?  I 
can't remember precisely, but it is caught by Netty's default exception 
handling and iirc simply logged.

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Assignee: Yifan Cai
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2020-09-17 Thread Yifan Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198034#comment-17198034
 ] 

Yifan Cai commented on CASSANDRA-15214:
---

> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.

Running a code inspection, the exception/throwable from Netty is already 
handled. 
For inbound, the {{InboundMessageHandler}} implements {{exceptionCaught()}} 
which invokes {{JVMStabilityInspector}}. The message handler is the last one in 
the inbound direction, and there is no previous handler that handles 
exceptions. So the message handler should handle all exceptions from that 
direction. However, the {{exceptionCaught()}} override in 
{{StreamingInboundHandler}} does not invoke  {{JVMStabilityInspector}}. It 
could swallow OOM errors. 
For outbound, {{JVMStabilityInspector}} is invoked when the channel future 
fails, and several other places. 

All the above callsites call {{JVMStabilityInspector}} with 
{{propagateOutOfMemory}} disabled. So the inspector just swallows the OOM 
errors and not let JVM to handle. [~benedict], what is the reason for doing so 
in the inbound/outbound connections? 

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Assignee: Yifan Cai
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2020-07-03 Thread Robert Stupp (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150869#comment-17150869
 ] 

Robert Stupp commented on CASSANDRA-15214:
--

Just read this ticket and the approach looks absolutely reasonable to me.

One thing though is that the the (off-heap) row-cache isn't covered here - let 
me know whether it's reasonable to add some support regarding this ticket. 
IMHO, people shouldn't use the row-cache, but I'm not sure whether there are 
reasonable use cases out there in the wild. Don't want to start a discussion 
about the row-cache in this, just a heads-up.

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2020-05-08 Thread Yifan Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103019#comment-17103019
 ] 

Yifan Cai commented on CASSANDRA-15214:
---

Sounds good [~jolynch]. 

So for this ticket, the goal is to force JVM to trigger a Heap OOM upon 
receiving the direct buffer OOM. (I can work on it.) 

Do you want to the jvmquake integration be addressed in a different ticket? 

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2020-05-05 Thread Joey Lynch (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100330#comment-17100330
 ] 

Joey Lynch commented on CASSANDRA-15214:


> Since one should be able to trigger the OOM by looping allocating large chunk 
> of memory, e.g. array, in the java code. What is the benefit of doing it so 
> using jvmquake? I can see that in the killer_thread callback function, it 
> also does long array allocation once notified by the gc callback. 

Ah sorry I was not clear. I think the JVMStabilityDetector (which we call into 
via inspectThrowable all over the place) should allocate the long array if we 
see an OutOfMemoryError with message "Direct buffer memory", in turn triggering 
a Heap OOM (which will trigger the normal resource exhausted mechanism). Since 
we're not out of _heap_ memory we can trust that JVMStatbilityDetector can run.

I guess my proposal is to include jvmquake by default for linux deployments (I 
can add more architectures if we want more, easy to opt out), and if 
JVMStabilityDetector sees a "Direct buffer memory" OOM it should force the JVM 
into a heap OOM, triggering jvmquake's resource exhausted handler.

This setup would guarantee that C* dies (and produces a heap dump) if any of 
the following conditions hold:
 * The JVM is out of heap memory
 * The JVM has accumulated 30s of GC debt with 1:5 runtime weight (meaning that 
we had <85% throughput for at least 30s): aka "GC spirals of death"
 * The JVM is out of metaspace memory
 * The JVM is out of threads
 * (best effort, likely true) The JVM is out of native memory (so basically C* 
is using 2x the heap size) -> triggers a heap oom -> triggers the first case

Unlike the built in JVM options jvmquake really actually works in these edge 
cases (not only is there a test suite to prove it that the built in Java 
options don't work but if you run inside the heap you fundamentally can't 
guarantee you will run, e.g. why the kill -9 approach never really works).

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2020-05-05 Thread Yifan Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100193#comment-17100193
 ] 

Yifan Cai commented on CASSANDRA-15214:
---

Got it. I did not look closely enough at the discussions in  CASSANDRA-13006. 

I agree that leaving it to JVM is a more clean and general solution. Also as 
you mentioned, "It's relatively easy to ignore the "sacrificial" long array in 
a heap dump and we could log clearly what is happening."

Since one should be able to trigger the OOM by looping allocating large chunk 
of memory, e.g. array, in the java code. What is the benefit of doing it so 
using jvmquake? I can see that in the killer_thread callback function, it also 
does long array allocation once notified by the gc callback. 

The comment of the callback says
{quote}the only way to reliably trigger OutOfMemory
 when we are not actually out of memory (e.g. due to GC behavior) that I
 could find was to make JNI calls that allocate large blobs of memory which
 can only be done from outside of the GC callbacks.
{quote}
Can you elaborate more about preferring jvmquake? 

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2020-05-05 Thread Joey Lynch (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100063#comment-17100063
 ] 

Joey Lynch commented on CASSANDRA-15214:


> An alternative way could be programmatically grab the heap dump via 
> [JMX|https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/jdk.management/share/classes/com/sun/management/HotSpotDiagnosticMXBean.java#L75]
>  and exit.

I believe that was more or less what C* was doing before CASSANDRA-13006 if I'm 
reading the patch in 
[02aba73|https://github.com/apache/cassandra/commit/02aba73] correctly, and 
Eric Evans pointed out this approach in general can cause the C*'s jmap heap 
dump to race with the JVM heap dump and advocated for just letting the JVM 
handle it with built in options. The nice thing about the jvmquake technique of 
just running the heap out of memory is all the normal JVM options work as 
expected (logging and dumping heap to a particular location on disk mostly). 
That being said, I think that for the direct buffer issue in particular this 
won't be a problem since as we've established the JVM OOM 
report_java_out_of_memory isn't triggered on direct memory allocation failures.

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2020-05-05 Thread Yifan Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099582#comment-17099582
 ] 

Yifan Cai commented on CASSANDRA-15214:
---

Thanks [~jolynch] for the update.
{quote}force the JVM into a "normal" OOM by [allocating large long 
arrays|https://github.com/Netflix-Skunkworks/jvmquake/blob/master/src/jvmquake.c#L103]
{quote}
An alternative way could be programmatically grab the heap dump via 
[JMX|https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/jdk.management/share/classes/com/sun/management/HotSpotDiagnosticMXBean.java#L75]
 and exit.

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2020-05-04 Thread Joey Lynch (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099546#comment-17099546
 ] 

Joey Lynch commented on CASSANDRA-15214:


Quick update on this from the jvmquake side we are now building [architecture 
specific artifacts|https://github.com/Netflix-Skunkworks/jvmquake/releases] 
that will work with any JVM newer than Java 8, they link only against the 
platform specific libc (we're also now testing on Java 8 and 11, on both zulu 
and openjdk JVMs). I think this means it would be plausible to include the 
{{libjvmquake-linux-x86_64.so}} in {{libs}} and then have a switch on uname -s 
-m to determine to pick it up or not. Right now we're only building for linux 
amd64 but if there is interest I can generate more architectures (linux arm 
probably makes sense, and could do osx). I also still like the idea of having a 
agents/available and agents/enabled folder like apache does for modules, users 
can just symlink agents from one to the other to include them (and we can 
symlink jamm and jvmquake by default).

[~yifanc] I agree that the OutOfMemory conditions that do not result in "true" 
JVM OOM (meaning that it would cause a heapdump via {{HeapDumpOnOutOfMemory}}) 
will not get caught by jvmquake, my testing confirms your findings, although 
the jvmquake GC instability algorithm will still trigger in various real world 
scenarios I've run into.

I feel like the right move mightly be to walk back a small bit of 
CASSANDRA-13006 where we stopped forcibly killing the JVM ourselves and let the 
JVM do it. Specifically if the OOM message contains "Direct buffer memory" we 
could do what jvmquake does and force the JVM into a "normal" OOM by 
[allocating large long 
arrays|https://github.com/Netflix-Skunkworks/jvmquake/blob/master/src/jvmquake.c#L103].
 This will then trigger a proper OOM and get us heap dumping. It's relatively 
easy to ignore the "sacrificial" long array in a heap dump and we could log 
clearly what is happening.

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2020-04-07 Thread Manish Ghildiyal (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077743#comment-17077743
 ] 

Manish Ghildiyal commented on CASSANDRA-15214:
--

Please let me know if I can contribute here.

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2020-03-14 Thread Dinesh Joshi (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059535#comment-17059535
 ] 

Dinesh Joshi commented on CASSANDRA-15214:
--

Followed up with [~jolynch] regarding his original comment about including C 
JVMTI agents in C*. If we build the agent for the officially supported JVMs, we 
should be good. We need to detect the platform, JVM combo and load it up. If 
the agent is unavailable for the specific VM/Platform combination, it can be 
disabled with a warning in the logs much like what we do with `NativeLibrary` 
except this will need to happen as part of the startup script.

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2019-08-06 Thread Yifan Cai (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901558#comment-16901558
 ] 

Yifan Cai commented on CASSANDRA-15214:
---

[~jolynch], you are welcome. Please use them.

The test cases attached are more on the `Unsafe.allocateMemory` path. As far as 
I can see, they are different from the ones included in the jvmquake's test 
cases that only check the heap OOM. 

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict
>Priority: Normal
> Fix For: 4.0
>
> Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2019-08-06 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901555#comment-16901555
 ] 

Joseph Lynch commented on CASSANDRA-15214:
--

[~yifanc] If you are ok with it I can add your test cases to 
[jvmquake|https://github.com/Netflix-Skunkworks/jvmquake/tree/master/tests] to 
ensure it handles all edge cases. For what it's worth jvmquake is a strict 
superset of jvmkill and I wouldn't advocate for using jvmkill (I'm biased 
though). In my production experience jvmquake actually works at detecting GC 
spirals of death that C* runs into while jvmkill simply doesn't work as C* 
doesn't actually go OOM, it just death spirals. See the "hard oom"  [test 
cases|https://github.com/Netflix-Skunkworks/jvmquake/blob/master/tests/test_hard_ooms.py]
 for example where jvmkill won't work while jvmquake will work.

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict
>Priority: Normal
> Fix For: 4.0
>
> Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2019-08-06 Thread Yifan Cai (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901497#comment-16901497
 ] 

Yifan Cai commented on CASSANDRA-15214:
---

Several experiments of the OOM scenario are made to check if the HotSpot 
handlers work as expected, namely kill the process. 
 
The result shows that the handlers, OnOutOfMemoryError and 
ExitOnOutOfMemoryError, are only effective for heap OOM. 
 
*Experiments*
 
The experiments are designed to emulate what happens in C* while being minimal. 
They have the Thread.setDefaultUncaughtExceptionHandler installed and just 
re-throw the OOM error hoping the handlers can take care. 
 
OpenJDK 8 was used.
 
 
You can find all the 5 experiments in the attached [^oom-experiments.zip].

{code:java}
├── OomExperimentExceedsDirectBuffer.java
├── OomExperimentExceedsDirectBufferRapidAlloc.java
├── OomExperimentExceedsHeap.java
├── OomExperimentSimple.java
└── OomExperimentSimpleJustExit.java{code}
Among those experiments, there is only one (OomExperimentExceedsHeap) can 
successfully trigger the handlers. 
 
The rest do throw the OutOfMemoryError, but the handlers are not triggered. 
 
*Some Research*
 
The cause could be due to the difference of the code path in JVM implementation 
to allocate memory on heap and for direct buffer. (OpenJDK8 is the reference)
 
Heap memory allocation happens at 
[collectedHeap.inline.hpp#CollectedHeap::common_mem_allocate_noinit|https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/hotspot/src/share/vm/gc_interface/collectedHeap.inline.hpp#L149].
 When it failed, it calls 
[report_java_out_of_memory|https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/hotspot/src/share/vm/utilities/debug.cpp#L287],
 which is responsible to create a heap dump on OOM and run the handlers. 
 
Meanwhile, allocating direct buffer take a different path. In 
java.nio.DirectByteBuffer, OOM can happen at 2 places. 
1. Bits.reserveMemory, finds out there is not enough direct memory and throws 
OOM. In this case, I do not think the OOM is caught and handled in JVM to 
trigger report_java_out_of_memory.
2. unsafe.allocateMemory, which calls malloc directly, but [failed to allocate 
and throws 
OOM|https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/hotspot/src/share/vm/prims/unsafe.cpp#L606].
 Again, such OOM was throw in order to let the application to handle. 
 
Another proof is that 
[report_java_out_of_memory|https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/hotspot/src/share/vm/utilities/debug.cpp#L287],
 the only place to trigger the handler, was not invoked during 
unsafe.allocateMemory. Here are [all the references of the method 
invocation|https://github.com/AdoptOpenJDK/openjdk-jdk8u/search?q=report_java_out_of_memory_q=report_java_out_of_memory].
 
Because of that, jvmkill or jvmquake mentioned in the ticket might not work. 
The tool replies on the notification of the 
[JvmtiExport::post_resource_exhausted|https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/hotspot/src/share/vm/gc_interface/collectedHeap.inline.hpp#L153],
 which does not present in the 2 places that direct buffer OOM can happen. Here 
is the implementation of 
[jvmkill|https://github.com/airlift/jvmkill/blob/master/jvmkill.c#L24] (less 
than 100 lines).

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict
>Priority: Normal
> Fix For: 4.0
>
> Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2019-08-05 Thread Dinesh Joshi (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900410#comment-16900410
 ] 

Dinesh Joshi commented on CASSANDRA-15214:
--

Sounds great. [~benedict] who would be able to take up the audit? Is this 
something I can help with?

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict
>Priority: Normal
> Fix For: 4.0
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2019-08-05 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900394#comment-16900394
 ] 

Benedict commented on CASSANDRA-15214:
--

Sorry, I completely forgot to respond to this ticket so thanks for bumping it 
[~djoshi3]

>From my POV, including a C JVMTI agent is absolutely fine.  We'd have to take 
>a closer look at jvmkill and jvmquake, and do our own brief audit of the 
>version we include to ensure it seems to behave reasonably.  But I don't see 
>any problem with utilising non-Java functionality.

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict
>Priority: Normal
> Fix For: 4.0
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2019-08-05 Thread Dinesh Joshi (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900379#comment-16900379
 ] 

Dinesh Joshi commented on CASSANDRA-15214:
--

I think this issue might be related to 
https://bugs.openjdk.java.net/browse/JDK-8027434. Other projects that use the 
JVM have run into a similar issue and the usual solution is to use 
[jvmkill|https://github.com/airlift/jvmkill]. The issue at hand is when a JVM 
has run out of memory (heap or otherwise), it enters an undefined state. In 
this situation, I would not expect the handlers to work as expected either. I 
think we should either use jvmkill or 
[jvmquake|https://github.com/Netflix-Skunkworks/jvmquake] to solve this issue 
as it has proven to be reliable and Netflix, Facebook and other large JVM users 
are actively using it.

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict
>Priority: Normal
> Fix For: 4.0
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2019-07-21 Thread Tomas Shestakov (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889767#comment-16889767
 ] 

Tomas Shestakov commented on CASSANDRA-15214:
-

There is two options to handle *OOM* in java 8u92 
[https://www.oracle.com/technetwork/java/javase/8u92-relnotes-2949471.html]

-XX:+ExitOnOutOfMemoryError

-XX:+CrashOnOutOfMemoryError

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict
>Priority: Normal
> Fix For: 4.0
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15214) OOMs caught and not rethrown

2019-07-16 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886475#comment-16886475
 ] 

Joseph Lynch commented on CASSANDRA-15214:
--

We've (Netlfix) found handling OOMs to be generally hard to do correctly in all 
the various Java codebases we have so we built an agent solution which attaches 
to the JVM in [https://github.com/Netflix-Skunkworks/jvmquake]. I think the 
only reason that we couldn't just directly include that in C* is because it's a 
C JVMTI agent instead of a Java one, but perhaps we could just solve this with 
some documentation and making it really easy to include agents (which is useful 
regardless)?

The following is the patch for supporting easy pluggable agents for C*:
{noformat}
diff --git a/conf/cassandra-env.sh b/conf/cassandra-env.sh
index d6c48be0a3..92061db3ab 100644
--- a/conf/cassandra-env.sh
+++ b/conf/cassandra-env.sh
@@ -134,6 +134,29 @@ do
   JVM_OPTS="$JVM_OPTS $opt"
 done
 
+# Pull in any agents present in CASSANDRA_HOME
+for agent_file in ${CASSANDRA_HOME}/agents/*.jar; do
+  if [ -e "${agent_file}" ]; then
+base_file="${agent_file%.jar}"
+if [ -s "${base_file}.options" ]; then
+  options=`cat ${base_file}.options`
+  agent_file="${agent_file}=${options}"
+fi
+JVM_OPTS="$JVM_OPTS -javaagent:${agent_file}"
+  fi
+done
+
+for agent_file in ${CASSANDRA_HOME}/agents/*.so; do
+  if [ -e "${agent_file}" ]; then
+base_file="${agent_file%.so}"
+if [ -s "${base_file}.options" ]; then
+  options=`cat ${base_file}.options`
+  agent_file="${agent_file}=${options}"
+fi
+JVM_OPTS="$JVM_OPTS -agentpath:${agent_file}"
+  fi
+done
{noformat}
Then we can just drop agents into the {{CASSANDRA_HOME/agents}} folder and they 
are loaded automatically by Cassandra. From a security perspective this is 
identical to "drop a jar".

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict
>Priority: Normal
> Fix For: 4.0
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org