Hi, we have a 9-node ring on m1.xlarge AWS hosts. We started having some trouble a while ago, and it's making me pull out all of my hair.

The host in position #3 has been replaced 4 times. Each time, the host joins the ring, I do a nodetool repair -pr, and she seems fine for about a day. Then she gets real slow, sometimes OOMs, sometimes takes down the host in position #5, sometimes gets stuck on a compaction with near-idle disk throughput, and eventually dies without any kind of error message or reason for failing.

Sometimes our cluster gets so slow that it is almost unusable - we get timeout errors from our application, AWS sends us voluminous alerts about latency.

I've tried changing the amount of RAM between 8G and 12G, changing the MAX_HEAP_SIZE and HEAP_NEWSIZE, repeatedly forcing a stop compaction, setting astronomical ulimit values, and praying to available gods. I'm a bit confused. We're not using super-wide rows, most things are default.

        EL5, Cassandra 1.1.9, Java 1.6.0


--

Drew from Zhrodague
lolcat divinator
d...@zhrodague.net

Reply via email to