Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Lee Parker Mon, 17 May 2010 17:13:12 -0700

Also, I am using batch_mutate for all of my writes.

Lee Parker
On Mon, May 17, 2010 at 7:11 PM, Lee Parker <l...@socialagency.com> wrote:


> What are your storage-conf settings for Memtable thresholds?  One thing
> that could cause lots of CPU usage is dumping the memtables too frequently
> and then having to do lots of compaction.  With that much available heap
> space you could definitely go larger than the default thresholds.  Also, do
> you not have any swap space setup on the machine?  It is a good idea to at
> least setup a swap file so that the system can use it when it needs to.
>
> We are running a two node cluster using Amazon large EC2 instances as well.
>  The cluster is using a replication factor of 2 and most of my writes and
> reads are at a consistency level of ONE except for a few QUORUM calls.  The
> only difference in my JVM opts is that my max is set at 6G.  I have the two
> ephemeral disks setup as a raid 0 array and that is where I'm storing the
> data.  The commit logs are going to the default location so they are using
> the local disk.  We currently have more than 90G of data running on these
> and have only had issues with CPU utilization when our code was accidentally
> duplicating content to one of the servers.  This duplication of content
> started causing the server to be in a state of constant major compaction and
> it couldn't keep up with new writes.  In the end, I completely dropped that
> server and spun up another one to take it's place since the one good server
> had all the data anyway.  So, it might have also been an issue with that
> box.
>
> One more question, are all of the instances in the same region?
>
> Lee Parker
> On Mon, May 17, 2010 at 6:02 PM, Curt Bererton <c...@zipzapplay.com>wrote:
>
>> Here are the current jvm args  and java version:
>>
>> # Arguments to pass to the JVM
>> JVM_OPTS=" \
>>         -ea \
>>         -Xms128M \
>>         -Xmx7G \
>>         -XX:TargetSurvivorRatio=90 \
>>         -XX:+AggressiveOpts \
>>         -XX:+UseParNewGC \
>>         -XX:+UseConcMarkSweepGC \
>>         -XX:+CMSParallelRemarkEnabled \
>>         -XX:+HeapDumpOnOutOfMemoryError \
>>         -XX:SurvivorRatio=128 \
>>         -XX:MaxTenuringThreshold=0 \
>>         -Dcom.sun.management.jmxremote.port=8080 \
>>         -Dcom.sun.management.jmxremote.ssl=false \
>>         -Dcom.sun.management.jmxremote.authenticate=false"
>>
>> java -version outputs:
>> java version "1.6.0_20"
>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>>
>> So pretty much the defaults aside from the 7Gig max heap. CPU is totally
>> hammered right now, and it is receiving 0 ops/sec from me since I
>> disconnected it from our application right now until I can figure out what's
>> going on.
>>
>> running top on the machine I get:
>> top - 18:56:32 up 2 days, 20:57,  2 users,  load average: 14.97, 15.24,
>> 15.13
>> Tasks:  87 total,   5 running,  82 sleeping,   0 stopped,   0 zombie
>> Cpu(s): 40.1%us, 33.9%sy,  0.0%ni,  0.1%id,  0.0%wa,  0.0%hi,  1.3%si,
>> 24.6%st
>> Mem:   7872040k total,  3618764k used,  4253276k free,   387536k buffers
>> Swap:        0k total,        0k used,        0k free,  1655556k cached
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>> COMMAND
>>  2566 cassandr  25   0 7906m 639m  10m S  150  8.3   5846:35 java
>>
>>
>> I have jconsole up and running, and jconsole vm Summary tab says:
>>  - total physical memory: 7,872,040 K
>>  - Free physical memory: 4,253,036 K
>>  - Total swap space: 0K
>>  - Free swap space: 0K
>>  - Committed virtual memory: 8,096648K
>>
>> Is there a specific thread I can look at in jconsole that might give me a
>> clue?  It's weird that it's still at 100% cpu even though it's getting no
>> traffic from outside right now.  I suppose it might still be talking across
>> the machines though.
>>
>> Also, stopping cassandra and starting cassandra on one of the 4 machines
>> caused the CPU to go back down to almost normal levels.
>>
>> Here's the ring;
>> Address       Status     Load
>> Range                                      Ring
>>
>> 170141183460469231731687303715884105728
>> 10.251.XX.XX Up         2.15 MB
>> 42535295865117307932921825928971026432     |<--|
>> 10.250.XX.XX  Up         2.42 MB
>> 85070591730234615865843651857942052864     |   |
>> 10.250.XX.XX Up         2.47 MB
>> 127605887595351923798765477786913079296    |   |
>> 10.250.XX.XX Up         2.46 MB
>> 170141183460469231731687303715884105728    |-->|
>>
>> Any thoughts?
>>
>> Best,
>>
>> Curt
>> --
>> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
>> http://apps.facebook.com/happyhabitat
>>
>>
>> On Mon, May 17, 2010 at 3:51 PM, Mark Greene <green...@gmail.com> wrote:
>>
>>> Can you provide us with the current JVM args? Also, what type of work
>>> load you are giving the ring (op/s)?
>>>
>>>
>>> On Mon, May 17, 2010 at 6:39 PM, Curt Bererton <c...@zipzapplay.com>wrote:
>>>
>>>> Hello Cassandra users+experts,
>>>>
>>>> Hopefully someone will be able to point me in the correct direction. We
>>>> have cassandra 0.6.1 working on our test servers and we *thought* 
>>>> everything
>>>> was great and ready to move to production. We are currently running a ring
>>>> of 4 large instance EC2 (http://aws.amazon.com/ec2/instance-types/)
>>>> servers on production with a replication factor of 3 and a QUORUM
>>>> consistency level. We ran a test on 1% of our users, and everything was
>>>> writing to and reading from cassandra great for the first 3 hours. After
>>>> that point CPU usage spiked to 100% and stayed there, basically on all 4
>>>> machines at once. This smells to me like a GC issue, and I'm looking into 
>>>> it
>>>> with jconsole right now. If anyone can help me debug this and get cassandra
>>>> all the way up and running without CPU spiking I would be forever in their
>>>> debt.
>>>>
>>>> I suspect that anyone else running cassandra on large EC2 instances
>>>> might just be able to tell me what JVM args they are successfully using in 
>>>> a
>>>> production environment and if they upgraded to Cassandra 0.6.2 from 0.6.1,
>>>> and did they go to batched writes due to bug 1014? (
>>>> https://issues.apache.org/jira/browse/CASSANDRA-1014) That might answer
>>>> all my questions.
>>>>
>>>> Is there anyone on the list who is using large EC2 instances in
>>>> production? Would you be kind enough to share your JVM arguments and any
>>>> other tips?
>>>>
>>>> Thanks for any help,
>>>> Curt
>>>> --
>>>> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
>>>> http://apps.facebook.com/happyhabitat
>>>>
>>>
>>>
>>
>

Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Reply via email to