Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Lee Parker Tue, 18 May 2010 06:00:09 -0700

How many different CFs do you have?  If you only have a few, I would highly
recommend increasing the MemtableThroughputInMB and
MemtableOperationsInMillions.
 We only have to CFs and I have it set at 256MB and 2.5m. Since most of our
columns are relatively small, these values are practically equivalent to
each other.  I would also recommend dropping your heap space to 6G and
adding a swap file.  In our case, the large EC2 instances didn't have any
swap setup by default.


Lee Parker
On Mon, May 17, 2010 at 7:31 PM, Curt Bererton <c...@zipzapplay.com> wrote:

> Agreed, and I just saw that in storage conf that a higher value for the
> MemtableFlushAfterMinutes is suggested otherwise you might get a "flush
> storm: of all your memtables flushing at once". I've changed that as well.
>
>
> --
> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
> http://apps.facebook.com/happyhabitat
>
>
> On Mon, May 17, 2010 at 5:27 PM, Mark Greene <green...@gmail.com> wrote:
>
>> Since you only have 7.5GB of memory, it's a really bad idea to set your
>> heap space to a max of 7GB. Remember, the java process heap will be larger
>> than what Xmx is allowed to grow to. If you reach this level, you can
>> start swapping which is very very bad. As Brandon pointed out, you haven't
>> exhausted your physically memory yet but you still want to lower Xmx to
>> something like 5 maybe 6 GB.
>>
>>
>> On Mon, May 17, 2010 at 7:02 PM, Curt Bererton <c...@zipzapplay.com>wrote:
>>
>>> Here are the current jvm args  and java version:
>>>
>>> # Arguments to pass to the JVM
>>> JVM_OPTS=" \
>>>         -ea \
>>>         -Xms128M \
>>>         -Xmx7G \
>>>         -XX:TargetSurvivorRatio=90 \
>>>         -XX:+AggressiveOpts \
>>>         -XX:+UseParNewGC \
>>>         -XX:+UseConcMarkSweepGC \
>>>         -XX:+CMSParallelRemarkEnabled \
>>>         -XX:+HeapDumpOnOutOfMemoryError \
>>>         -XX:SurvivorRatio=128 \
>>>         -XX:MaxTenuringThreshold=0 \
>>>         -Dcom.sun.management.jmxremote.port=8080 \
>>>         -Dcom.sun.management.jmxremote.ssl=false \
>>>         -Dcom.sun.management.jmxremote.authenticate=false"
>>>
>>> java -version outputs:
>>> java version "1.6.0_20"
>>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>>>
>>> So pretty much the defaults aside from the 7Gig max heap. CPU is totally
>>> hammered right now, and it is receiving 0 ops/sec from me since I
>>> disconnected it from our application right now until I can figure out what's
>>> going on.
>>>
>>> running top on the machine I get:
>>> top - 18:56:32 up 2 days, 20:57,  2 users,  load average: 14.97, 15.24,
>>> 15.13
>>> Tasks:  87 total,   5 running,  82 sleeping,   0 stopped,   0 zombie
>>> Cpu(s): 40.1%us, 33.9%sy,  0.0%ni,  0.1%id,  0.0%wa,  0.0%hi,  1.3%si,
>>> 24.6%st
>>> Mem:   7872040k total,  3618764k used,  4253276k free,   387536k buffers
>>> Swap:        0k total,        0k used,        0k free,  1655556k cached
>>>
>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>>> COMMAND
>>>  2566 cassandr  25   0 7906m 639m  10m S  150  8.3   5846:35 java
>>>
>>>
>>> I have jconsole up and running, and jconsole vm Summary tab says:
>>>  - total physical memory: 7,872,040 K
>>>  - Free physical memory: 4,253,036 K
>>>  - Total swap space: 0K
>>>  - Free swap space: 0K
>>>  - Committed virtual memory: 8,096648K
>>>
>>> Is there a specific thread I can look at in jconsole that might give me a
>>> clue?  It's weird that it's still at 100% cpu even though it's getting no
>>> traffic from outside right now.  I suppose it might still be talking across
>>> the machines though.
>>>
>>> Also, stopping cassandra and starting cassandra on one of the 4 machines
>>> caused the CPU to go back down to almost normal levels.
>>>
>>> Here's the ring;
>>>
>>> Address       Status     Load
>>> Range                                      Ring
>>>
>>> 170141183460469231731687303715884105728
>>> 10.251.XX.XX Up         2.15 MB
>>> 42535295865117307932921825928971026432     |<--|
>>> 10.250.XX.XX  Up         2.42 MB
>>> 85070591730234615865843651857942052864     |   |
>>> 10.250.XX.XX Up         2.47 MB
>>> 127605887595351923798765477786913079296    |   |
>>> 10.250.XX.XX Up         2.46 MB
>>> 170141183460469231731687303715884105728    |-->|
>>>
>>> Any thoughts?
>>>
>>> Best,
>>>
>>> Curt
>>> --
>>> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
>>> http://apps.facebook.com/happyhabitat
>>>
>>>
>>> On Mon, May 17, 2010 at 3:51 PM, Mark Greene <green...@gmail.com> wrote:
>>>
>>>> Can you provide us with the current JVM args? Also, what type of work
>>>> load you are giving the ring (op/s)?
>>>>
>>>>
>>>> On Mon, May 17, 2010 at 6:39 PM, Curt Bererton <c...@zipzapplay.com>wrote:
>>>>
>>>>> Hello Cassandra users+experts,
>>>>>
>>>>> Hopefully someone will be able to point me in the correct direction. We
>>>>> have cassandra 0.6.1 working on our test servers and we *thought* 
>>>>> everything
>>>>> was great and ready to move to production. We are currently running a ring
>>>>> of 4 large instance EC2 (http://aws.amazon.com/ec2/instance-types/)
>>>>> servers on production with a replication factor of 3 and a QUORUM
>>>>> consistency level. We ran a test on 1% of our users, and everything was
>>>>> writing to and reading from cassandra great for the first 3 hours. After
>>>>> that point CPU usage spiked to 100% and stayed there, basically on all 4
>>>>> machines at once. This smells to me like a GC issue, and I'm looking into 
>>>>> it
>>>>> with jconsole right now. If anyone can help me debug this and get 
>>>>> cassandra
>>>>> all the way up and running without CPU spiking I would be forever in their
>>>>> debt.
>>>>>
>>>>> I suspect that anyone else running cassandra on large EC2 instances
>>>>> might just be able to tell me what JVM args they are successfully using 
>>>>> in a
>>>>> production environment and if they upgraded to Cassandra 0.6.2 from 0.6.1,
>>>>> and did they go to batched writes due to bug 1014? (
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-1014) That might
>>>>> answer all my questions.
>>>>>
>>>>> Is there anyone on the list who is using large EC2 instances in
>>>>> production? Would you be kind enough to share your JVM arguments and any
>>>>> other tips?
>>>>>
>>>>> Thanks for any help,
>>>>> Curt
>>>>> --
>>>>> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
>>>>> http://apps.facebook.com/happyhabitat
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Reply via email to