I have a cluster of 5 machines building a Cassandra datastore, and I load
bulk data into this using the Java Thrift API. The first ~250GB runs fine,
then, one of the nodes starts to throw OutOfMemory exceptions. I'm not using
and row or index caches, and since I only have 5 CF's and some 2,5 GB of RAM
allocated to the JVM (-Xmx2500M), in theory, that should happen. All inserts
are done with consistency level ALL.

I hope with this I have avoided all the 'usual dummy errors' that lead to
OOM's. I have begun to troubleshoot the issue with JMX, however, it's
difficult to catch the JVM in the right moment because it runs well for
several hours before this thing happens.

One thing gets to my mind, maybe one of the experts could confirm or reject
this idea for me: is it possible that when one machine slows down a little
bit (for example because a big compaction is going on), the memtables don't
get flushed to disk as fast as they are building up under the continuing
bulk import? That would result in a downward spiral, the system gets slower
and slower on disk I/O, but since more and more data arrives over Thrift,
finally OOM.

I'm using the "periodic" commit log sync, maybe also this could create a
situation where the commit log writer is too slow to catch up with the data
intake, resulting in ever growing memory usage?

Maybe these thoughts are just bullshit. Let me now if so... ;-)

Reply via email to