Re: Finding bottleneck of a cluster

rohit bhatia Thu, 05 Jul 2012 21:22:00 -0700

On Fri, Jul 6, 2012 at 9:44 AM, rohit bhatia <rohit2...@gmail.com> wrote:
> On Fri, Jul 6, 2012 at 4:47 AM, aaron morton <aa...@thelastpickle.com> wrote:
>> 12G Heap,
>> 1600Mb Young gen,
>>
>> Is a bit higher than the normal recommendation. 1600MB young gen can cause
>> some extra ParNew pauses.
> Thanks for heads up, i'll try tinkering on this
>
>>
>> 128 Concurrent writer
>> threads
>>
>> Unless you are on SSD this is too many.
>>
> I mean 
> http://www.datastax.com/docs/0.8/configuration/node_configuration#concurrent-writes
> , this is not memtable flush queue writers.
> Suggested value is 8*number of cores(16) = 128 itself.
>>
>> 1) Is using JDK 1.7 any way detrimental to cassandra?
>>
>> as far as I know it's not fully certified, thanks for trying it :)
>>
>> 2) What is the max write operation qps that should be expected. Is the
>> netflix benchmark also applicable for counter incrmenting tasks?
>>
>> Counters use a different write path than normal writes and are a bit slower.
>>
>> To benchmark, get a single node and work out the max throughput. Then
>> multiply by the number of nodes and divide by the RF to get a rough idea.
>>
>> the cpu
>> idle time is around 30%, cassandra is not disk bound(insignificant
>> read operations and cpu's iowait is around 0.05%)
>>
>> Wait until compaction kicks in and handle all your inserts.
>>
>> The os load is around 16-20 and the average write latency is 3ms.
>> tpstats do not show any significant pending tasks.
>>
>> The node is overloaded. What is the write latency for a single thread doing
>> as single increment against a node that has not other traffic ? The latency
>> for a request is the time spent working and the time spent waiting, once you
>> read the max throughput the time spent waiting increases. The SEDA
>> architecture is designed to limit the time spent working.
The write latency I reported is as reported by datastax opscenter for
the total latency of a client's request. This is minimum at .5ms.
In contrast, the "local write request latency" as reported by cfstats
are around 50 micro seconds but jump to 150 microseconds during the
crash.



>>
>>    At this point suddenly, Several nodes start dropping several
>> "Mutation" messages. There are also lots of pending
>>
>> The cluster is overwhelmed.
>>
>>  Almost all the new threads seem to be named
>> "pool-2-thread-*".
>>
>> These are client connection threads.
>>
>> My guess is that this might be due to the 128 Writer threads not being
>> able to perform more writes.(
>>
>> Yes.
>> https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L214
>>
>> Work out the latency for a single client single node, then start adding
>> replication, nodes and load. When the latency increases you are getting to
>> the max throughput for that config.
>
> Also, as mentioned in my second mail, seeing messages like this "Total
> time for which application threads were stopped: 16.7663710 seconds",
> if something pauses for this long, it might be overwhelmed by the
> hints stored at other nodes. This can further cause the node to wait
> on/drop a lot of client connection threads. I'll look into what is
> causing these non-gc pauses. Thanks for the help.
>
>>
>> Hope that helps
>>
>> -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 5/07/2012, at 6:49 PM, rohit bhatia wrote:
>>
>> Our Cassandra cluster consists of 8 nodes(16 core, 32G ram, 12G Heap,
>> 1600Mb Young gen, cassandra1.0.5, JDK 1.7, 128 Concurrent writer
>> threads). The replication factor is 2 with 10 column families and we
>> service Counter incrementing write intensive tasks(CL=ONE).
>>
>> I am trying to figure out the bottleneck,
>>
>> 1) Is using JDK 1.7 any way detrimental to cassandra?
>>
>> 2) What is the max write operation qps that should be expected. Is the
>> netflix benchmark also applicable for counter incrmenting tasks?
>>
>> http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
>>
>> 3) At around 50,000qps for the cluster (~12500 qps per node), the cpu
>> idle time is around 30%, cassandra is not disk bound(insignificant
>> read operations and cpu's iowait is around 0.05%) and is not swapping
>> its memory(around 15 gb RAM is free or inactive). The average gc pause
>> time for parnew are 100ms occuring every second. So cassandra spends
>> 10% of its time stuck in "Stop the world" collector.
>> The os load is around 16-20 and the average write latency is 3ms.
>> tpstats do not show any significant pending tasks.
>>
>>    At this point suddenly, Several nodes start dropping several
>> "Mutation" messages. There are also lots of pending
>> MutationStage,replicateOnWriteStage tasks in tpstats.
>> The number of threads in the java process increase to around 25,000
>> from the usual 300-400. Almost all the new threads seem to be named
>> "pool-2-thread-*".
>> The OS load jumps to around 30-40, the "write request latency" starts
>> spiking to more than 500ms (even to several tens of seconds sometime).
>> Even the "Local write latency" increases fourfolds to 200 microseconds
>> from 50 microseconds. This happens across all the nodes and in around
>> 2-3 minutes.
>> My guess is that this might be due to the 128 Writer threads not being
>> able to perform more writes.(though with  average local write latency
>> of 100-150 micro seconds, each thread should be able to serve 10,000
>> qps and with 128 writer threads, should be able to serve 1,280,000 qps
>> per node)
>> Could there be any other reason for this? What else should I monitor
>> since system.log do not seem to say anything conclusive before
>> dropping messages.
>>
>>
>>
>> Thanks
>> Rohit
>>
>>

Re: Finding bottleneck of a cluster

Reply via email to