Hello Maxime Can you put the complete logs and config somewhere ? It would be interesting to know what is the cause of the OOM.
On Sun, Oct 26, 2014 at 3:15 AM, Maxime <maxim...@gmail.com> wrote: > Thanks a lot that is comforting. We are also small at the moment so I > definitely can relate with the idea of keeping small and simple at a level > where it just works. > > I see the new Apache version has a lot of fixes so I will try to upgrade > before I look into downgrading. > > > On Saturday, October 25, 2014, Laing, Michael <michael.la...@nytimes.com> > wrote: > >> Since no one else has stepped in... >> >> We have run clusters with ridiculously small nodes - I have a production >> cluster in AWS with 4GB nodes each with 1 CPU and disk-based instance >> storage. It works fine but you can see those little puppies struggle... >> >> And I ran into problems such as you observe... >> >> Upgrading Java to the latest 1.7 and - most importantly - *reverting to >> the default configuration, esp. for heap*, seemed to settle things down >> completely. Also make sure that you are using the 'recommended production >> settings' from the docs on your boxen. >> >> However we are running 2.0.x not 2.1.0 so YMMV. >> >> And we are switching to 15GB nodes w 2 heftier CPUs each and SSD storage >> - still a 'small' machine, but much more reasonable for C*. >> >> However I can't say I am an expert, since I deliberately keep things so >> simple that we do not encounter problems - it just works so I dig into >> other stuff. >> >> ml >> >> >> On Sat, Oct 25, 2014 at 5:22 PM, Maxime <maxim...@gmail.com> wrote: >> >>> Hello, I've been trying to add a new node to my cluster ( 4 nodes ) for >>> a few days now. >>> >>> I started by adding a node similar to my current configuration, 4 GB or >>> RAM + 2 Cores on DigitalOcean. However every time, I would end up getting >>> OOM errors after many log entries of the type: >>> >>> INFO [SlabPoolCleaner] 2014-10-25 13:44:57,240 >>> ColumnFamilyStore.java:856 - Enqueuing flush of mycf: 5383 (0%) on-heap, 0 >>> (0%) off-heap >>> >>> leading to: >>> >>> ka-120-Data.db (39291 bytes) for commitlog position >>> ReplayPosition(segmentId=1414243978538, position=23699418) >>> WARN [SharedPool-Worker-13] 2014-10-25 13:48:18,032 >>> AbstractTracingAwareExecutorService.java:167 - Uncaught exception on thread >>> Thread[SharedPool-Worker-13,5,main]: {} >>> java.lang.OutOfMemoryError: Java heap space >>> >>> Thinking it had to do with either compaction somehow or streaming, 2 >>> activities I've had tremendous issues with in the past; I tried to slow >>> down the setstreamthroughput to extremely low values all the way to 5. I >>> also tried setting setcompactionthoughput to 0, and then reading that in >>> some cases it might be too fast, down to 8. Nothing worked, it merely >>> vaguely changed the mean time to OOM but not in a way indicating either was >>> anywhere a solution. >>> >>> The nodes were configured with 2 GB of Heap initially, I tried to crank >>> it up to 3 GB, stressing the host memory to its limit. >>> >>> After doing some exploration (I am considering writing a Cassandra Ops >>> documentation with lessons learned since there seems to be little of it in >>> organized fashions), I read that some people had strange issues on >>> lower-end boxes like that, so I bit the bullet and upgraded my new node to >>> a 8GB + 4 Core instance, which was anecdotally better. >>> >>> To my complete shock, exact same issues are present, even raising the >>> Heap memory to 6 GB. I figure it can't be a "normal" situation anymore, but >>> must be a bug somehow. >>> >>> My cluster is 4 nodes, RF of 2, about 160 GB of data across all nodes. >>> About 10 CF of varying sizes. Runtime writes are between 300 to 900 / >>> second. Cassandra 2.1.0, nothing too wild. >>> >>> Has anyone encountered these kinds of issues before? I would really >>> enjoy hearing about the experiences of people trying to run small-sized >>> clusters like mine. From everything I read, Cassandra operations go very >>> well on large (16 GB + 8 Cores) machines, but I'm sad to report I've had >>> nothing but trouble trying to run on smaller machines, perhaps I can learn >>> from other's experience? >>> >>> Full logs can be provided to anyone interested. >>> >>> Cheers >>> >> >>