[jira] [Resolved] (CASSANDRA-7743) Possible C* OOM issue during long running test

Benedict (JIRA) Thu, 21 Aug 2014 01:40:11 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Benedict resolved CASSANDRA-7743.
---------------------------------

    Resolution: Fixed

I've had off-JIRA confirmation that this has been fixed (by both inspecting 
heap dumps and leaving the test to run for its extended period to confirm). 
Kishan's negative result was apparently a mistake.


> Possible C* OOM issue during long running test
> ----------------------------------------------
>
>                 Key: CASSANDRA-7743
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Google Compute Engine, n1-standard-1
>            Reporter: Pierre Laporte
>            Assignee: Benedict
>             Fix For: 2.1 rc6
>
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
>         at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
>         at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
>         at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
>         at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total              used       free     shared    buffers     cached
> Mem:          3702       3532        169          0        161        854
> -/+ buffers/cache:       2516       1185
> Swap:            0          0          0
> $ head -n 4 /proc/meminfo
> MemTotal:        3791292 kB
> MemFree:          173568 kB
> Buffers:          165608 kB
> Cached:           874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address         Load       Tokens  Owns (effective)  Host ID              
>                  Rack
> UN  10.240.98.27    925.17 KB  256     100.0%            
> 41314169-eff5-465f-85ea-d501fd8f9c5e  RAC1
> UN  10.240.137.253  1.1 MB     256     100.0%            
> c706f5f9-c5f3-4d5e-95e9-a8903823827e  RAC1
> UN  10.240.72.183   896.57 KB  256     100.0%            
> 15735c4d-98d4-4ea4-a305-7ab2d92f65fc  RAC1
> $ echo 'tracing on; select count(*) from duration_test.ints;' | ./bin/cqlsh 
> 10.240.137.253
> Now tracing requests.
>  count
> -------
>   9486
> (1 rows)
> Statement trace did not complete within 10 seconds
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (CASSANDRA-7743) Possible C* OOM issue during long running test

Reply via email to