As promised: https://issues.apache.org/jira/browse/CASSANDRA-2654

On Sun, May 15, 2011 at 7:09 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
> Great debugging work!
>
> That workaround sounds like the best alternative to me too.
>
> On Sat, May 14, 2011 at 9:46 PM, Hannes Schmidt <han...@eyealike.com> wrote:
>> Ok, so I think I found one major cause contributing to the increasing
>> resident size of the Cassandra process. Looking at the OpenJDK sources
>> was of great help in understanding the problem but my findings also
>> apply to the Sun/Oracle JDK because the affected code is shared by
>> both.
>>
>> Each IncomingTcpConnection (ITC) thread handles a socket to another
>> node. That socket is a server socket returned from
>> ServerSocket.accept() and as such it is implemented on top of an NIO
>> socket channel (sun.nio.ch.SocketAdaptor) which in turn makes use of
>> direct byte buffers. It obtains these buffers from sun.nio.ch.Util
>> which caches the 3 most recently used buffers per thread. If a cached
>> buffer isn't large enough for a message, a new one that is will
>> replace it. The size of the buffer is determined by the amount of data
>> that the application requests to be read. ITC uses the readFully()
>> method of DataInputStream (DIS) to read data into a byte array
>> allocated to hold the entire message:
>>
>> int size = socketStream.readInt();
>> byte[] buffer = new byte[size];
>> socketStream.readFully(buffer);
>>
>> Whatever the value of 'size' will end up being the size of the direct
>> buffer allocated by the socket channel code.
>>
>> Our application uses range queries whose result sets are around 40
>> megabytes in size. If a range isn't hosted on the node the application
>> client is connected to, the range result set will be fetched from
>> another node. When that other node has prepared the result it will
>> send it back (asynchonously, this took me a while to grasp) and it
>> will end up in the direct byte buffer that is cached by
>> sun.nio.ch.Util for the ITC thread on the original node.
>>
>> The thing is that the buffer holds the entire message, all 40 megs of
>> it. ITC is rather long-lived and so the buffers will simply stick
>> around. Our range queries cover the entire ring (we do a lot of "map
>> reduce") and so each node ends up with as many 40M buffers as we have
>> nodes in the ring, 10 in our case. That's 400M of native heap space
>> wasted on each node.
>>
>> Each ITC thread holds onto the historically largest direct buffer,
>> possibly for a long time. This could be alleviated by periodically
>> closing the connection and thereby releasing a potentially large
>> buffer and replacing it with a new thread that starts with a clean
>> slate. If all queries have large result sets, this solution won't
>> help. Another alternative is to read the message incrementally rather
>> than buffering it in its entirety in a byte array as ITC currently
>> does. A third and possibly the simplest solution would be to read the
>> messages into the buffer in chunks of say 1M. DIS has offers
>> readFully( data, offset, length ) for that. I have tried this solution
>> and it fixes this problem for us. I'll open an issue and submit my
>> patch. We have observed the issue with 0.6.12 but from looking at ITC
>> in trunk it seems to be affected too.
>>
>> It gets worse though: even after the ITC thread dies, the cached
>> buffers stick around as they are being held via SoftReferences. SR's
>> are released only as a last resort to prevent an OutOfMemoryException.
>> Using SR's for caching direct buffers is silly because direct buffers
>> have negligible impact on the Java heap but may have dramatic impact
>> on the native heap. I am not the only one who thinks so [1]. In other
>> words, sun.nio.ch.Util's buffer caching is severely broken. I tried to
>> find a way to explicitly release soft references but haven't found
>> anything other than the allocation of an oversized array to force an
>> OutOfMemoryException. The only thing we can do is to keep the buffer
>> sizes small in order to reduce the impact of the leak. My patch takes
>> care of that.
>>
>> I will post a link to the JIRA issue with the patch shortly.
>>
>> [1] http://bugs.sun.com/view_bug.do?bug_id=6210541
>>
>> On Wed, May 4, 2011 at 11:50 AM, Hannes Schmidt <han...@eyealike.com> wrote:
>>> Hi,
>>>
>>> We are using Cassandra 0.6.12 in a cluster of 9 nodes. Each node is
>>> 64-bit, has 4 cores and 4G of RAM and runs on Ubuntu Lucid with the
>>> stock 2.6.32-31-generic kernel. We use the Sun/Oracle JDK.
>>>
>>> Here's the problem: The Cassandra process starts up with 1.1G resident
>>> memory (according to top) but slowly grows to 2.1G at a rate that
>>> seems proportional to the write load. No writes, no growth. The node
>>> is running other memory-sensitive applications (a second JVM for our
>>> in-house webapp and a short-lived C++ program) so we need to ensure
>>> that each process stays within certain bounds as far as memory
>>> requirements go. The nodes OOM and crash when the Cassandra process is
>>> at 2.1G so I can't say if the growth is bounded or not.
>>>
>>> Looking at theĀ /proc/$pid/smapsĀ for the Cassandra process it seems to
>>> me that it is the native heap of the Cassandra JVM that is leaking. I
>>> attached a readable version of the smaps file generated by [1].
>>>
>>> Some more data: Cassandra runs with default command line arguments,
>>> which means it gets 1G heap. The JNA jar is present and Cassandra logs
>>> that the memory locking was successful. In storage-conf.xml,
>>> DiskAccessMode is mmap_index_only. Other than that and some increased
>>> timeouts we left the defaults. Swap is completely disabled. I don't
>>> think this is related but I am mentioning it anyways: overcommit [2]
>>> is always-on (vm.overcommit_memory=1). Without that we get OOMs when
>>> our application JVM is fork()'ing and exec()'ing our C++program even
>>> though there is enough free RAM to satisfy the demands of the C++
>>> program. We think this is caused by a flawed kernel heuristic that
>>> assumes that the forked process (our C++ app) is as big as the forking
>>> one (the 2nd JVM). Anyways, the Cassandra process leaks with both,
>>> vm.overcommit_memory=0 (the default) and 1.
>>>
>>> Whether it is the native heap that leaks or something else, I think
>>> that 1.1G of additional RAM for 1G of Java heap can't be normal. I'd
>>> be grateful for any insights or pointers at what to try next.
>>>
>>> [1] http://bmaurer.blogspot.com/2006/03/memory-usage-with-smaps.html
>>> [2] http://www.win.tue.nl/~aeb/linux/lk/lk-9.html#ss9.6
>>>
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Reply via email to