[jira] [Commented] (CASSANDRA-11818) C* does neither recover nor trigger stability inspector on direct memory OOM

2016-05-29 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15306030#comment-15306030
 ] 

Robert Stupp commented on CASSANDRA-11818:
--

I've tried [~norman]'s patch against Netty 4.1 against trunk. With the patch 
enabled (requires {{-Dio.netty.noDirectBufferNoCleaner=false}}) my overloaded 
node recovers nicely. The CMS-GC-storm caused by {{Bits.reserveMemory()}} does 
not occur and the node remains responsive. However, while the node is in an 
overload situation, it spews a lot of errors. Unfortunately these are 
{{java.lang.OutOfMemoryError: No more memory available}}), which is generally 
fine, but in this case it just indicates that there is not enough direct memory 
to fulfill the *current* request. IMO, passing this OOM to 
{{JMVStabilityInspector}} would be wrong, since it is a recoverable error. 
(Background: Netty has a separate, distinct direct memory pool then, which does 
not affect other operations or memory pools.)

I've also applied the same technique to C* internal direct memory allocations. 
(We already use {{FileUtils.clean()}} to cleanup direct buffers.)

To summarize, {{Bits.reserveMemory}} + {{Cleaner}} are the root cause IMO. Not 
having both reduce client latency as a side effect (from 6ms to <1ms in my 
test).

> C* does neither recover nor trigger stability inspector on direct memory OOM
> 
>
> Key: CASSANDRA-11818
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11818
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Robert Stupp
> Attachments: 11818-direct-mem-unpooled.png, 11818-direct-mem.png, 
> oom-histo-live.txt, oom-stack.txt
>
>
> The following stack trace is not caught by {{JVMStabilityInspector}}.
> Situation was caused by a load test with a lot of parallel writes and reads 
> against a single node.
> {code}
> ERROR [SharedPool-Worker-1] 2016-05-17 18:38:44,187 Message.java:611 - 
> Unexpected exception during request; channel = [id: 0x1e02351b, 
> L:/127.0.0.1:9042 - R:/127.0.0.1:51087]
> java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:693) ~[na:1.8.0_92]
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.8.0_92]
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) 
> ~[na:1.8.0_92]
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:672) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:234) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:218) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:138) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:270)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:105)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:349)
>  ~[main/:na]
>   at 
> org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:314)
>  ~[main/:na]
>   at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:619)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:676)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:612)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> org.apache.cassandra.transport.Message$Dispatcher$Flusher.run(Message.java:445)
>  ~[main/:na]
>   at 
> io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.channel.nio.NioEv

[jira] [Commented] (CASSANDRA-11818) C* does neither recover nor trigger stability inspector on direct memory OOM

2016-05-27 Thread Norman Maurer (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304464#comment-15304464
 ] 

Norman Maurer commented on CASSANDRA-11818:
---

Sorry for the late response (busy as always :( )... I wonder if this would be 
something that may be helpful for you in terms of Netty: 
https://github.com/netty/netty/pull/5314

> C* does neither recover nor trigger stability inspector on direct memory OOM
> 
>
> Key: CASSANDRA-11818
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11818
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Robert Stupp
> Attachments: 11818-direct-mem-unpooled.png, 11818-direct-mem.png, 
> oom-histo-live.txt, oom-stack.txt
>
>
> The following stack trace is not caught by {{JVMStabilityInspector}}.
> Situation was caused by a load test with a lot of parallel writes and reads 
> against a single node.
> {code}
> ERROR [SharedPool-Worker-1] 2016-05-17 18:38:44,187 Message.java:611 - 
> Unexpected exception during request; channel = [id: 0x1e02351b, 
> L:/127.0.0.1:9042 - R:/127.0.0.1:51087]
> java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:693) ~[na:1.8.0_92]
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.8.0_92]
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) 
> ~[na:1.8.0_92]
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:672) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:234) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:218) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:138) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:270)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:105)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:349)
>  ~[main/:na]
>   at 
> org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:314)
>  ~[main/:na]
>   at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:619)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:676)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:612)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> org.apache.cassandra.transport.Message$Dispatcher$Flusher.run(Message.java:445)
>  ~[main/:na]
>   at 
> io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:374) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at java.lang.Thread.run(Thread.java:745) [na:1.8.0_92]
> {code}
> The situation does not get better when the load driver is stopped.
> I can reproduce this scenario at will. Managed to get histogram, stack traces 
> and heap dump. Already increased {{-XX:MaxDirectMemorySize}} to {{2g}}.
> A {{nodetool flush}} causes the daemon to exit (as that direct-memory OOM is 
> caught by {{JVMStabilityInspector}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11818) C* does neither recover nor trigger stability inspector on direct memory OOM

2016-05-22 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15295617#comment-15295617
 ] 

Robert Stupp commented on CASSANDRA-11818:
--

Hm - one thing that I noted, is how Java handles failed direct memory 
allocations in {{java.nio.Bits#reserveMemory}}: It first tries an "optimistic" 
{{tryReserveMemory}} and exits immediately on success. However, if that somehow 
doesn't succeed, it triggers a {{System.gc()}}, which is bad IMO (however, kind 
of how direct buffers work in Java). After that GC it sleeps and tries to 
reserve the memory up to 9 times - up to 511 ms - and then throws 
{{OutOfMemoryError("Direct buffer memory")}}. I've opened CASSANDRA-11870 to 
think about that.

> C* does neither recover nor trigger stability inspector on direct memory OOM
> 
>
> Key: CASSANDRA-11818
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11818
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Robert Stupp
> Attachments: 11818-direct-mem-unpooled.png, 11818-direct-mem.png, 
> oom-histo-live.txt, oom-stack.txt
>
>
> The following stack trace is not caught by {{JVMStabilityInspector}}.
> Situation was caused by a load test with a lot of parallel writes and reads 
> against a single node.
> {code}
> ERROR [SharedPool-Worker-1] 2016-05-17 18:38:44,187 Message.java:611 - 
> Unexpected exception during request; channel = [id: 0x1e02351b, 
> L:/127.0.0.1:9042 - R:/127.0.0.1:51087]
> java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:693) ~[na:1.8.0_92]
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.8.0_92]
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) 
> ~[na:1.8.0_92]
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:672) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:234) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:218) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:138) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:270)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:105)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:349)
>  ~[main/:na]
>   at 
> org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:314)
>  ~[main/:na]
>   at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:619)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:676)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:612)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> org.apache.cassandra.transport.Message$Dispatcher$Flusher.run(Message.java:445)
>  ~[main/:na]
>   at 
> io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:374) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at java.lang.Thread.run(Thread.java:745) [na:1.8.0_92]
> {code}
> The situation does not get better when the load driver is stopped.
> I can reproduce this scen

[jira] [Commented] (CASSANDRA-11818) C* does neither recover nor trigger stability inspector on direct memory OOM

2016-05-20 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293563#comment-15293563
 ] 

Robert Stupp commented on CASSANDRA-11818:
--

The node no longer responds to any native protocol message. Restarting the 
native protocol listener doesn't help (as expected). JMX requests are still 
working.
I assume it just hangs in a loop trying to allocate more native memory, which 
fails (as in the jstack output).
It does not immediately die. But it does if C* itself requests more off heap 
memory (since that OOM is passed to JVMStabilityInspector).


> C* does neither recover nor trigger stability inspector on direct memory OOM
> 
>
> Key: CASSANDRA-11818
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11818
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Robert Stupp
> Attachments: oom-histo-live.txt, oom-stack.txt
>
>
> The following stack trace is not caught by {{JVMStabilityInspector}}.
> Situation was caused by a load test with a lot of parallel writes and reads 
> against a single node.
> {code}
> ERROR [SharedPool-Worker-1] 2016-05-17 18:38:44,187 Message.java:611 - 
> Unexpected exception during request; channel = [id: 0x1e02351b, 
> L:/127.0.0.1:9042 - R:/127.0.0.1:51087]
> java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:693) ~[na:1.8.0_92]
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.8.0_92]
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) 
> ~[na:1.8.0_92]
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:672) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:234) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:218) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:138) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:270)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:105)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:349)
>  ~[main/:na]
>   at 
> org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:314)
>  ~[main/:na]
>   at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:619)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:676)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:612)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> org.apache.cassandra.transport.Message$Dispatcher$Flusher.run(Message.java:445)
>  ~[main/:na]
>   at 
> io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:374) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at java.lang.Thread.run(Thread.java:745) [na:1.8.0_92]
> {code}
> The situation does not get better when the load driver is stopped.
> I can reproduce this scenario at will. Managed to get histogram, stack traces 
> and heap dump. Already increased {{-XX:MaxDirectMemorySize}} to {{2g}}.
> A {{nodetool flush}} causes the daemon to exit (as that direct-m

[jira] [Commented] (CASSANDRA-11818) C* does neither recover nor trigger stability inspector on direct memory OOM

2016-05-20 Thread Joshua McKenzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293529#comment-15293529
 ] 

Joshua McKenzie commented on CASSANDRA-11818:
-

[~snazy]: what is the state of the node when you repro? It dies and we just 
don't catch / print out on the OOM?

> C* does neither recover nor trigger stability inspector on direct memory OOM
> 
>
> Key: CASSANDRA-11818
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11818
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Robert Stupp
> Attachments: oom-histo-live.txt, oom-stack.txt
>
>
> The following stack trace is not caught by {{JVMStabilityInspector}}.
> Situation was caused by a load test with a lot of parallel writes and reads 
> against a single node.
> {code}
> ERROR [SharedPool-Worker-1] 2016-05-17 18:38:44,187 Message.java:611 - 
> Unexpected exception during request; channel = [id: 0x1e02351b, 
> L:/127.0.0.1:9042 - R:/127.0.0.1:51087]
> java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:693) ~[na:1.8.0_92]
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.8.0_92]
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) 
> ~[na:1.8.0_92]
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:672) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:234) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:218) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:138) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:270)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:105)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:349)
>  ~[main/:na]
>   at 
> org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:314)
>  ~[main/:na]
>   at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:619)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:676)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:612)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> org.apache.cassandra.transport.Message$Dispatcher$Flusher.run(Message.java:445)
>  ~[main/:na]
>   at 
> io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:374) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at java.lang.Thread.run(Thread.java:745) [na:1.8.0_92]
> {code}
> The situation does not get better when the load driver is stopped.
> I can reproduce this scenario at will. Managed to get histogram, stack traces 
> and heap dump. Already increased {{-XX:MaxDirectMemorySize}} to {{2g}}.
> A {{nodetool flush}} causes the daemon to exit (as that direct-memory OOM is 
> caught by {{JVMStabilityInspector}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11818) C* does neither recover nor trigger stability inspector on direct memory OOM

2016-05-17 Thread Michael Shuler (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15287276#comment-15287276
 ] 

Michael Shuler commented on CASSANDRA-11818:


It's on the list, now :)

> C* does neither recover nor trigger stability inspector on direct memory OOM
> 
>
> Key: CASSANDRA-11818
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11818
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Robert Stupp
> Attachments: oom-histo-live.txt, oom-stack.txt
>
>
> The following stack trace is not caught by {{JVMStabilityInspector}}.
> Situation was caused by a load test with a lot of parallel writes and reads 
> against a single node.
> {code}
> ERROR [SharedPool-Worker-1] 2016-05-17 18:38:44,187 Message.java:611 - 
> Unexpected exception during request; channel = [id: 0x1e02351b, 
> L:/127.0.0.1:9042 - R:/127.0.0.1:51087]
> java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:693) ~[na:1.8.0_92]
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.8.0_92]
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) 
> ~[na:1.8.0_92]
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:672) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:234) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:218) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:138) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:270)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:105)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:349)
>  ~[main/:na]
>   at 
> org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:314)
>  ~[main/:na]
>   at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:619)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:676)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:612)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> org.apache.cassandra.transport.Message$Dispatcher$Flusher.run(Message.java:445)
>  ~[main/:na]
>   at 
> io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:374) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at java.lang.Thread.run(Thread.java:745) [na:1.8.0_92]
> {code}
> The situation does not get better when the load driver is stopped.
> I can reproduce this scenario at will. Managed to get histogram, stack traces 
> and heap dump. Already increased {{-XX:MaxDirectMemorySize}} to {{2g}}.
> A {{nodetool flush}} causes the daemon to exit (as that direct-memory OOM is 
> caught by {{JVMStabilityInspector}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11818) C* does neither recover nor trigger stability inspector on direct memory OOM

2016-05-17 Thread Joshua McKenzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15287103#comment-15287103
 ] 

Joshua McKenzie commented on CASSANDRA-11818:
-

A few observations:

1) We'd need to change our handling of the error 
[here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/transport/Message.java#L396]
 for it to be inspected by the JVMStabilityInspector. If we're OOM, wrapping's 
going to fail.
2) We both need CASSANDRA-8092 to be integrated into CI to catch new errors 
like this in the future, and it needs its logic revised to check if we're 
immediately rethrowing vs. attempting further operation and/or wrapping an 
exception since that will fail in the OOM condition.

Looks like the state of the code-base has regressed w/regards to this issue 
since I submit that ticket:
{noformat}
Total caught and rethrown as something other than Runtime: 100
Total caught and rethrown as Runtime: 68
Total Swallowed: 81
Total delegated to JVMStabilityInspector: 69
Total 'catch (Throwable ...)' analyzed: 120
Total 'catch (Exception ...)' analyzed: 198
Total catch clauses analyzed: 318
{noformat}

[~mshuler]: Any word on where CASSANDRA-8092 falls on the priority list?

> C* does neither recover nor trigger stability inspector on direct memory OOM
> 
>
> Key: CASSANDRA-11818
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11818
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Robert Stupp
> Attachments: oom-histo-live.txt, oom-stack.txt
>
>
> The following stack trace is not caught by {{JVMStabilityInspector}}.
> Situation was caused by a load test with a lot of parallel writes and reads 
> against a single node.
> {code}
> ERROR [SharedPool-Worker-1] 2016-05-17 18:38:44,187 Message.java:611 - 
> Unexpected exception during request; channel = [id: 0x1e02351b, 
> L:/127.0.0.1:9042 - R:/127.0.0.1:51087]
> java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:693) ~[na:1.8.0_92]
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.8.0_92]
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) 
> ~[na:1.8.0_92]
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:672) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:234) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:218) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:138) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:270)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:105)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:349)
>  ~[main/:na]
>   at 
> org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:314)
>  ~[main/:na]
>   at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:619)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:676)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:612)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> org.apache.cassandra.transport.Message$Dispatcher$Flusher.run(Message.java:445)
>  ~[main/:na]
>   at 
> io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)
>  ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:374) 
> ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.util.concurre