subject:"concurrent_compactors via JMX"

Re: concurrent_compactors via JMX

2018-07-24 Thread Alain RODRIGUEZ

Hello Ricardo,

My understanding is that GP2 is better. I think we did some testing in the
past, but I must say I do not remember the exact results. I remember we
also thought of IO1 at some point, but we were not convinced by this kind
of EBS (not sure if it was not as performant as suggested in the doc or
just much more expensive). Maybe test it and make your own idea or wait for
someone else's information.

Be aware that the size of the GP2 EBS is impacting the IOPS, the max IOPS
is reached at ~ 3.334 TB which is also a good dataset size for Cassandra
(1.5/2 TB with some spared space for compactions).

I'd like to deploy on i3.xlarge
>

Yet if you go for I3, of course, use the ephemeral drives (NVMe). It's
incredibly fast ;-). Compared with m1.xlarge you should see a
substantial difference. The problem is that with a low number of nodes, it
will always cost more to have i3 than m1. This is often not the case with
more machines, as each node will work way more efficiently and you can
effectively reduce the number of nodes. Here, 3 will probably be the
minimum of nodes and 3 x i3 might cost more than 5/6 x m1 instances. When
scaling up though, you should come back to an acceptable cost/efficiency.
It's your call to see if to continue with m1, m5 or r4 instances meanwhile.

I decided to get safe and scale horizontally with the hardware we have
> tested
>

Yes, this is fair enough and a safe approach. To add new hardware the best
approach is a data center switch (I will write a post about how to do this
sometime soon)

I'm preparing to migrate inside vpc
>

This too is probably through a DC switch. I reminded I asked for help on
this in 2014, I found the reference for you where I published the steps
that I went through to go from EC2 public --> public VPC  --> private VPC.
It's old and I did not read it again, but it worked for us and at that
time. I hope you might find it useful as well as the process is detailed
step by step. It should be  easy to adapt it and you should not forget any
step this way:
http://grokbase.com/t/cassandra/user/1465m9txtw/vpc-aws#20140612k7xq0t280cvyk6waeytxbkx40c

possibly in Multi-AZ.
>

Yes, I recommend you to do this. It's incredibly powerful when you now that
with 3 racks and a RF=3 (and proper topology/configuration), each rack owes
100% of the data. Thus when operating you can work on a rack at once with
limited risk, even using quorum, service should stay up, no matter what
happens as long as 2 AZs are completely available. When cluster will grow
you might really appreciate this to prevent some failures and operate
safely.

PS: I defintely own you a coffee, actually much more than that!

If we meet we can definitely share a beer (no coffee for me, but I never
say no to a beer ;-)).
But you don't owe me it was and still is for free. Here we all share,
period. I like to think that knowledge is the only wealth you can give away
while keeping it for yourself. Some even say that knowledge grows when
shared. I used this mailing list myself to ramp up with Cassandra, I am
myself probably paying back to the community somehow for years now :-). Now
it's even part of my job, this is a part of what we do :). And I like it.
What I invite you to do is help people around yourself when you will be
comfortable with some topics. This way someone else might enjoy this
mailing list, making it a nicer place and contributing to growing up the
community ;-).

Yet, be ensured I appreciate the feedback and that you are grateful, it
shows it was somehow useful to you. This is enough for me.

C*heers
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-07-19 19:21 GMT+02:00 Riccardo Ferrari :

> Alain,
>
> I really appreciate your answers! A little typo is not changing the
> valuable content! For sure I will give a shot to your GC settings and come
> back with my findings.
> Right now I have 6 nodes up and running and everything looks good so far
> (at least much better).
>
> I agree, the hardware I am using is quite old but rather experimenting
> with new hardware combinations (on prod) I decided to get safe and scale
> horizontally with the hardware we have tested. I'm preparing to migrate
> inside vpc and I'd like to deploy on i3.xlarge instances and possibly in
> Multi-AZ.
>
> Speaking of EBS: I gave a quick I/O test to m3.xlarge + SSD + EBS (400
> PIOPS). SSD looks great for commitlogs, EBS I might need more guidance. I
> certainly gain in terms of random i/o however I'd like to hear what is your
> stand wrt IO2 (PIOPS) vs regular GP2? Or better: what are you guidelines
> when using EBS?
>
> Thanks!
>
> PS: I defintely own you a coffee, actually much more than that!
>
> On Thu, Jul 19, 2018 at 6:24 PM, Alain RODRIGUEZ 
> wrote:
>
>> Ah excuse my confusion. I now understand I guide you through changing the
>>> throughput when you wanted to change the compaction throughput

Re: concurrent_compactors via JMX

2018-07-19 Thread Riccardo Ferrari

Alain,

I really appreciate your answers! A little typo is not changing the
valuable content! For sure I will give a shot to your GC settings and come
back with my findings.
Right now I have 6 nodes up and running and everything looks good so far
(at least much better).

I agree, the hardware I am using is quite old but rather experimenting with
new hardware combinations (on prod) I decided to get safe and scale
horizontally with the hardware we have tested. I'm preparing to migrate
inside vpc and I'd like to deploy on i3.xlarge instances and possibly in
Multi-AZ.

Speaking of EBS: I gave a quick I/O test to m3.xlarge + SSD + EBS (400
PIOPS). SSD looks great for commitlogs, EBS I might need more guidance. I
certainly gain in terms of random i/o however I'd like to hear what is your
stand wrt IO2 (PIOPS) vs regular GP2? Or better: what are you guidelines
when using EBS?

Thanks!

PS: I defintely own you a coffee, actually much more than that!

On Thu, Jul 19, 2018 at 6:24 PM, Alain RODRIGUEZ  wrote:

> Ah excuse my confusion. I now understand I guide you through changing the
>> throughput when you wanted to change the compaction throughput.
>
>
>
> Wow, I meant to say "I guided you through changing the compaction
> throughput when you wanted to change the number of concurrent compactors."
>
> I should not answer messages before waking up fully...
>
> :)
>
> C*heers,
> ---
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> 2018-07-19 14:07 GMT+01:00 Alain RODRIGUEZ :
>
>> Ah excuse my confusion. I now understand I guide you through changing the
>> throughput when you wanted to change the compaction throughput.
>>
>> I also found some commands I ran in the past using jmxterm. As mentioned
>> by Chris - and thanks Chris for answering the question properly -, the
>> 'max' can never be lower than the 'core'.
>>
>> Use JMXTERM to REDUCE the concurrent compactors:
>>
>> ```
>> # if we have more than 2 threads:
>> echo "set -b org.apache.cassandra.db:type=CompactionManager
>> CoreCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
>> 127.0.0.1:7199 && echo "set -b org.apache.cassandra.db:type=CompactionManager
>> MaximumCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar
>> -l 127.0.0.1:7199
>> ```
>>
>> Use JMXTERM to INCREASE the concurrent compactors:
>>
>> ```
>> # if we have currently less than 6 threads:
>> echo "set -b org.apache.cassandra.db:type=CompactionManager
>> MaximumCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar
>> -l 127.0.0.1:7199 && echo "set -b 
>> org.apache.cassandra.db:type=CompactionManager
>> CoreCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
>> 127.0.0.1:7199
>> ```
>>
>> Some comments about the information you shared, as you said, 'thinking
>> out loud' :):
>>
>> *About the hardware*
>>
>> I remember using the 'm1.xlarge' :). They are not that recent. It will
>> probably worth it to reconsider this hardware choice and migrate to newer
>> hardware (m5/r4 + EBS GP2 or I3 with ephemeral). You should be able to
>> reduce the number of nodes and make it equivalent (or maybe slightly more
>> expensive but so it works properly). I once moved from a lot of these nodes
>> (80ish) to a few I2 instances (5 - 15? I don't remember). Latency went from
>> 20 ms to 3 - 5 ms (and was improved later on). Also, using the right
>> hardware for your case should avoid headaches to you and your team. I
>> started with t1.micro in prod and went all the way up (m1.small, m1.medium,
>> ...). It's good for learning, not for business.
>>
>> Especially, this does not work well together:
>>
>> my instances are still on magnetic drivers
>>>
>>
>> with
>>
>> most tables on LCS
>>
>> frequent r/w pattern
>>>
>>
>> Having some SSDs here (EBS GP2 or even better I3 - NVMe disks) would most
>> probably help to reduce the latency. I would also pick an instance with
>> more memory (30 GB would probably be more comfortable). The more memory,
>> the better it is possible to tune the JVM and the more page caching can be
>> done (thus avoiding some disk reads). Given the number of nodes you use,
>> it's complex to keep the cost low doing this change. When the cluster will
>> grow you might want to consider changing the instance type again and maybe
>> for now just take a r4.xlarge + EBS Volume GP2. It comes with 30+ GB of
>> memory and the same number of cpu (or more) and see how many nodes are
>> needed. It might be slightly more expensive, but I really believe it could
>> do  some good.
>>
>> As a middle term solution, I think you might be really happy with a
>> change of this kind.
>>
>> *About DTCS/TWCS?*
>>
>>
>>>
>>> * - few tables with DTCS- need to upgrade to 3.0.8 for TWCS*
>>
>> Indeed switching to DTCS rather than TWCS can be a real relief for a
>> cluster. You should not have to wait to upgrade to 3.0.8 to use TWCS. I
>> must say I

Re: concurrent_compactors via JMX

2018-07-19 Thread Alain RODRIGUEZ

> Ah excuse my confusion. I now understand I guide you through changing the
> throughput when you wanted to change the compaction throughput.



Wow, I meant to say "I guided you through changing the compaction
throughput when you wanted to change the number of concurrent compactors."

I should not answer messages before waking up fully...

:)

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-07-19 14:07 GMT+01:00 Alain RODRIGUEZ :

> Ah excuse my confusion. I now understand I guide you through changing the
> throughput when you wanted to change the compaction throughput.
>
> I also found some commands I ran in the past using jmxterm. As mentioned
> by Chris - and thanks Chris for answering the question properly -, the
> 'max' can never be lower than the 'core'.
>
> Use JMXTERM to REDUCE the concurrent compactors:
>
> ```
> # if we have more than 2 threads:
> echo "set -b org.apache.cassandra.db:type=CompactionManager
> CoreCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
> 127.0.0.1:7199 && echo "set -b org.apache.cassandra.db:type=CompactionManager
> MaximumCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
> 127.0.0.1:7199
> ```
>
> Use JMXTERM to INCREASE the concurrent compactors:
>
> ```
> # if we have currently less than 6 threads:
> echo "set -b org.apache.cassandra.db:type=CompactionManager
> MaximumCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
> 127.0.0.1:7199 && echo "set -b org.apache.cassandra.db:type=CompactionManager
> CoreCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
> 127.0.0.1:7199
> ```
>
> Some comments about the information you shared, as you said, 'thinking out
> loud' :):
>
> *About the hardware*
>
> I remember using the 'm1.xlarge' :). They are not that recent. It will
> probably worth it to reconsider this hardware choice and migrate to newer
> hardware (m5/r4 + EBS GP2 or I3 with ephemeral). You should be able to
> reduce the number of nodes and make it equivalent (or maybe slightly more
> expensive but so it works properly). I once moved from a lot of these nodes
> (80ish) to a few I2 instances (5 - 15? I don't remember). Latency went from
> 20 ms to 3 - 5 ms (and was improved later on). Also, using the right
> hardware for your case should avoid headaches to you and your team. I
> started with t1.micro in prod and went all the way up (m1.small, m1.medium,
> ...). It's good for learning, not for business.
>
> Especially, this does not work well together:
>
> my instances are still on magnetic drivers
>>
>
> with
>
> most tables on LCS
>
> frequent r/w pattern
>>
>
> Having some SSDs here (EBS GP2 or even better I3 - NVMe disks) would most
> probably help to reduce the latency. I would also pick an instance with
> more memory (30 GB would probably be more comfortable). The more memory,
> the better it is possible to tune the JVM and the more page caching can be
> done (thus avoiding some disk reads). Given the number of nodes you use,
> it's complex to keep the cost low doing this change. When the cluster will
> grow you might want to consider changing the instance type again and maybe
> for now just take a r4.xlarge + EBS Volume GP2. It comes with 30+ GB of
> memory and the same number of cpu (or more) and see how many nodes are
> needed. It might be slightly more expensive, but I really believe it could
> do  some good.
>
> As a middle term solution, I think you might be really happy with a change
> of this kind.
>
> *About DTCS/TWCS?*
>
>
>>
>> * - few tables with DTCS- need to upgrade to 3.0.8 for TWCS*
>
> Indeed switching to DTCS rather than TWCS can be a real relief for a
> cluster. You should not have to wait to upgrade to 3.0.8 to use TWCS. I
> must say I am not too sure for 3.0.x (x < 8) versions though. Maybe giving
> a try to http://thelastpickle.com/blog/2017/01/10/twcs-part2.html with
> https://github.com/jeffjirsa/twcs/tree/cassandra-3.0.0 is easier for you?
>
> *Garbage Collection?*
>
> That being said, the CPU load is really high, I suspect Garbage Collection
> is taking a lot of time to the nodes of this cluster. It is probably not
> helping the CPUs either. This might even be the biggest pain point for this
> cluster.
>
> Would you like to try using following settings on a canary node and see
> how it goes? These settings are quite arbitrary. With the gc.log I could be
> more precise on what I believe is a correct setting.
>
> GC Type: CMS
> Heap: 8 GB (could be bigger, but we are limited by the 15 GB in total).
> New_heap: 2 - 4 GB (maybe experiment with the 2 distinct values)
> TenuringThreshold: 15 (instead of 1, that is definitely too small and tend
> to have short living object still being promoted to the old gen)
>
> For those settings, I do not trust the cassandra defaults in most cases. 
> New_heap_size
> should be 25-50% of the heap (and not rela

Re: concurrent_compactors via JMX

2018-07-19 Thread Alain RODRIGUEZ

Ah excuse my confusion. I now understand I guide you through changing the
throughput when you wanted to change the compaction throughput.

I also found some commands I ran in the past using jmxterm. As mentioned by
Chris - and thanks Chris for answering the question properly -, the 'max'
can never be lower than the 'core'.

Use JMXTERM to REDUCE the concurrent compactors:

```
# if we have more than 2 threads:
echo "set -b org.apache.cassandra.db:type=CompactionManager
CoreCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
127.0.0.1:7199 && echo "set -b
org.apache.cassandra.db:type=CompactionManager MaximumCompactorThreads 2" |
java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l 127.0.0.1:7199
```

Use JMXTERM to INCREASE the concurrent compactors:

```
# if we have currently less than 6 threads:
echo "set -b org.apache.cassandra.db:type=CompactionManager
MaximumCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
127.0.0.1:7199 && echo "set -b
org.apache.cassandra.db:type=CompactionManager CoreCompactorThreads 6" |
java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l 127.0.0.1:7199
```

Some comments about the information you shared, as you said, 'thinking out
loud' :):

*About the hardware*

I remember using the 'm1.xlarge' :). They are not that recent. It will
probably worth it to reconsider this hardware choice and migrate to newer
hardware (m5/r4 + EBS GP2 or I3 with ephemeral). You should be able to
reduce the number of nodes and make it equivalent (or maybe slightly more
expensive but so it works properly). I once moved from a lot of these nodes
(80ish) to a few I2 instances (5 - 15? I don't remember). Latency went from
20 ms to 3 - 5 ms (and was improved later on). Also, using the right
hardware for your case should avoid headaches to you and your team. I
started with t1.micro in prod and went all the way up (m1.small, m1.medium,
...). It's good for learning, not for business.

Especially, this does not work well together:

my instances are still on magnetic drivers
>

with

most tables on LCS

frequent r/w pattern
>

Having some SSDs here (EBS GP2 or even better I3 - NVMe disks) would most
probably help to reduce the latency. I would also pick an instance with
more memory (30 GB would probably be more comfortable). The more memory,
the better it is possible to tune the JVM and the more page caching can be
done (thus avoiding some disk reads). Given the number of nodes you use,
it's complex to keep the cost low doing this change. When the cluster will
grow you might want to consider changing the instance type again and maybe
for now just take a r4.xlarge + EBS Volume GP2. It comes with 30+ GB of
memory and the same number of cpu (or more) and see how many nodes are
needed. It might be slightly more expensive, but I really believe it could
do  some good.

As a middle term solution, I think you might be really happy with a change
of this kind.

*About DTCS/TWCS?*


>
> * - few tables with DTCS- need to upgrade to 3.0.8 for TWCS*

Indeed switching to DTCS rather than TWCS can be a real relief for a
cluster. You should not have to wait to upgrade to 3.0.8 to use TWCS. I
must say I am not too sure for 3.0.x (x < 8) versions though. Maybe giving
a try to http://thelastpickle.com/blog/2017/01/10/twcs-part2.html with
https://github.com/jeffjirsa/twcs/tree/cassandra-3.0.0 is easier for you?

*Garbage Collection?*

That being said, the CPU load is really high, I suspect Garbage Collection
is taking a lot of time to the nodes of this cluster. It is probably not
helping the CPUs either. This might even be the biggest pain point for this
cluster.

Would you like to try using following settings on a canary node and see how
it goes? These settings are quite arbitrary. With the gc.log I could be
more precise on what I believe is a correct setting.

GC Type: CMS
Heap: 8 GB (could be bigger, but we are limited by the 15 GB in total).
New_heap: 2 - 4 GB (maybe experiment with the 2 distinct values)
TenuringThreshold: 15 (instead of 1, that is definitely too small and tend
to have short living object still being promoted to the old gen)

For those settings, I do not trust the cassandra defaults in most
cases. New_heap_size
should be 25-50% of the heap (and not related to the number of CPU cores).
Also below 16 GB I never had a better result with G1GC than CMS. But I must
say I have been fighting a lot with CMS in the past to tune it nicely while
I did not even play much with G1GC.

This (or similar settings) worked for distinct cases having heavy read
patterns. In the mailing list I explained recently to someone else my
understanding of JVM and GC, also there is a blog post my colleague Jon
wrote here: http://thelastpickle.com/blog/2018/04/11/gc-tuning.html. I
believe he suggested a slightly different tuning.
If none of this is helping, please send the gc.log file over with and
without this change we could have a look what is going on. SurvivorRatio
can also be moved down to 2 or 4, if you want to play

Re: concurrent_compactors via JMX

2018-07-18 Thread Riccardo Ferrari

Chris,

Thank you for mbean reference.

On Wed, Jul 18, 2018 at 6:26 PM, Riccardo Ferrari 
wrote:

> Alain, thank you for email. I really really appreciate it!
>
> I am actually trying to remove the disk io from the suspect list, thus I'm
> want to reduce the number of concurrent compactors. I'll give thorughput a
> shot.
> No, I don't have a long list of pending compactions, however my instances
> are still on magnetic drivers and can't really afford high number of
> compactors.
>
> We started to have slow downs and most likely we were undersized, new
> features are coming in and I want to be ready for them.
> *About the issue:*
>
>
>- High system load on cassanda nodes. This means top saying 6.0/12.0
>on a 4 vcpu instance (!)
>
>
>- CPU is high:
>  - Dynatrace says 50%
>  - top easily goes to 80%
>   - Network around 30Mb (according to Dynatrace)
>   - Disks:
>  - ~40 iops
>  - high latency: ~20ms (min 8 max 50!)
>  - negligible iowait
>  - testing an empty instance with fio I get 1200 r_iops / 400
>  w_iops
>
>
>- Clients timeout
>   - mostly when reading
>   - few cases when writing
>- Slowly growing number of "All time blocked of Native T-R"
>   - small numbers: hundreds vs millions of successfully serverd
>   requests
>
> The system:
>
>- Cassandra 3.0.6
>   - most tables on LCS
>  - frequent r/w pattern
>   - few tables with DTCS
>  - need to upgrade to 3.0.8 for TWCS
>  - mostly TS data, stream write / batch read
>   - All our keyspaces have RF: 3
>
>
>- All nodes on the same AZ
>- m1.xlarge
>- 4x420 drives (emphemerial storage) configured in striping (raid0)
>   - 4 vcpu
>   - 15GB ram
>- workload:
>   - Java applications;
>  - Mostly feeding cassandra writing data coming in
>  - Apache Spark applications:
>  - batch processes to read and write back to C* or other systems
>  - not co-located
>
> So far my effort was put into growing the ring to better distribute the
> load and decrease the pressure, including:
>
>- Increasing the node number from 3 to 5 (6th node joining)
>- jvm memory "optimization":
>- heaps were set by default script to something bit smaller that 4GB
>   with CMS gc
>   - gc pressure was high / long gc pauses
>  - clients were suffering of read timeouts
>   - increased the heap still using CMS:
>  - very long GC pauses
>  - not much tuning around CMS
>  - switched to G1 and forced 6/7GB heap on each node using almost
>   suggested settings
>   - much more stable
> - generally < 300ms
>  - I still have long pauses from time to time (mostly around
>  1200ms, sometimes on some nodes 3000)
>
> *Thinking out loud:*
> Things are much better, however I still see a high cpu usage specially
> when Spark kicks even though spark jobs are very small in terms of
> resources (single worker with very limited parallelism).
>
> On LCS tables cfstats reports single digit read latencies and generally
> 0.X write latencies (as per today).
> On DTCS tables I have 0.x ms write latency but still double digit read
> latency, but I guess I should spend some time to tune that or upgrade and
> move away from DTCS :(
> Yes, Saprk reads mostly from DTCS tables
>
> Still is kinda common to to have dropped READ, HINT and MUTATION.
>
>- not on all nodes
>- this generally happen on node restart
>
>
> On a side note I tried to install libjemalloc1 from Ubuntu repo (mixed
> 14.04 and 16.04) with terrible results, much slower instance startup and
> responsiveness, how could that be?
>
> Once everything will be stabilized I'll prepare our move to vpc and
> possibly upgrade to i3 instance, any comment on the hardware side?  is
> 4cores still a reasonble hardware?
>
> Best,
>
> On Tue, Jul 17, 2018 at 9:18 PM, Alain RODRIGUEZ 
> wrote:
>
>> Hello Riccardo,
>>
>> I noticed I have been writing a novel to answer a simple couple of
>> questions again ¯\_(ツ)_/¯. So here is a short answer in the case that's
>> what you were looking for :). Also, there is a warning that it might be
>> counter-productive and stress the cluster even more to increase the
>> compaction throughput. There is more information below ('about the issue').
>>
>> *tl;dr*:
>>
>> What about using 'nodetool setcompactionthroughput XX' instead. It
>> should available there.
>>
>> In the same way 'nodetool getcompactionthroughput' gives you the current
>> value. Be aware that this change done through JMX/nodetool is *not* 
>> permanent.
>> You still need to update the cassandra.yaml file.
>>
>> If you really want to use the MBean through JMX, because using 'nodetool'
>> is too easy (or for any other reason :p):
>>
>> Mbean: org.apache.cassandra.service.StorageServiceMBean
>> Attribute: CompactionThroughputMbPerSec
>>
>> *Long story* with the "how to" since I wen

Re: concurrent_compactors via JMX

2018-07-18 Thread Riccardo Ferrari

Alain, thank you for email. I really really appreciate it!

I am actually trying to remove the disk io from the suspect list, thus I'm
want to reduce the number of concurrent compactors. I'll give thorughput a
shot.
No, I don't have a long list of pending compactions, however my instances
are still on magnetic drivers and can't really afford high number of
compactors.

We started to have slow downs and most likely we were undersized, new
features are coming in and I want to be ready for them.
*About the issue:*

   - High system load on cassanda nodes. This means top saying 6.0/12.0 on
   a 4 vcpu instance (!)

   - CPU is high:
 - Dynatrace says 50%
 - top easily goes to 80%
  - Network around 30Mb (according to Dynatrace)
  - Disks:
 - ~40 iops
 - high latency: ~20ms (min 8 max 50!)
 - negligible iowait
 - testing an empty instance with fio I get 1200 r_iops / 400 w_iops

   - Clients timeout
  - mostly when reading
  - few cases when writing
   - Slowly growing number of "All time blocked of Native T-R"
  - small numbers: hundreds vs millions of successfully serverd requests

The system:

   - Cassandra 3.0.6
  - most tables on LCS
 - frequent r/w pattern
  - few tables with DTCS
 - need to upgrade to 3.0.8 for TWCS
 - mostly TS data, stream write / batch read
  - All our keyspaces have RF: 3

   - All nodes on the same AZ
   - m1.xlarge
   - 4x420 drives (emphemerial storage) configured in striping (raid0)
  - 4 vcpu
  - 15GB ram
   - workload:
  - Java applications;
 - Mostly feeding cassandra writing data coming in
 - Apache Spark applications:
 - batch processes to read and write back to C* or other systems
 - not co-located

So far my effort was put into growing the ring to better distribute the
load and decrease the pressure, including:

   - Increasing the node number from 3 to 5 (6th node joining)
   - jvm memory "optimization":
   - heaps were set by default script to something bit smaller that 4GB
  with CMS gc
  - gc pressure was high / long gc pauses
 - clients were suffering of read timeouts
  - increased the heap still using CMS:
 - very long GC pauses
 - not much tuning around CMS
 - switched to G1 and forced 6/7GB heap on each node using almost
  suggested settings
  - much more stable
- generally < 300ms
 - I still have long pauses from time to time (mostly around
 1200ms, sometimes on some nodes 3000)

*Thinking out loud:*
Things are much better, however I still see a high cpu usage specially when
Spark kicks even though spark jobs are very small in terms of resources
(single worker with very limited parallelism).

On LCS tables cfstats reports single digit read latencies and generally 0.X
write latencies (as per today).
On DTCS tables I have 0.x ms write latency but still double digit read
latency, but I guess I should spend some time to tune that or upgrade and
move away from DTCS :(
Yes, Saprk reads mostly from DTCS tables

Still is kinda common to to have dropped READ, HINT and MUTATION.

   - not on all nodes
   - this generally happen on node restart

On a side note I tried to install libjemalloc1 from Ubuntu repo (mixed
14.04 and 16.04) with terrible results, much slower instance startup and
responsiveness, how could that be?

Once everything will be stabilized I'll prepare our move to vpc and
possibly upgrade to i3 instance, any comment on the hardware side?  is
4cores still a reasonble hardware?

Best,

On Tue, Jul 17, 2018 at 9:18 PM, Alain RODRIGUEZ  wrote:

> Hello Riccardo,
>
> I noticed I have been writing a novel to answer a simple couple of
> questions again ¯\_(ツ)_/¯. So here is a short answer in the case that's
> what you were looking for :). Also, there is a warning that it might be
> counter-productive and stress the cluster even more to increase the
> compaction throughput. There is more information below ('about the issue').
>
> *tl;dr*:
>
> What about using 'nodetool setcompactionthroughput XX' instead. It should
> available there.
>
> In the same way 'nodetool getcompactionthroughput' gives you the current
> value. Be aware that this change done through JMX/nodetool is *not* permanent.
> You still need to update the cassandra.yaml file.
>
> If you really want to use the MBean through JMX, because using 'nodetool'
> is too easy (or for any other reason :p):
>
> Mbean: org.apache.cassandra.service.StorageServiceMBean
> Attribute: CompactionThroughputMbPerSec
>
> *Long story* with the "how to" since I went through this search myself, I
> did not know where this MBean was.
>
> Can someone point me to the right mbean?
>> I can not really find good docs about mbeans (or tools ...)
>
>
> I am not sure about the doc, but you can use jmxterm (
> http://wiki.cyclopsgroup.org/jmxterm/download.html).
>
> To replace the doc I use CCM (https://git

Re: concurrent_compactors via JMX

2018-07-18 Thread Chris Lohfink

Refer to Alains email but to strictly answer the question of increasing 
concurrent_compactors via jmx:

There are two attributes you can increase that would set the maximum number of 
concurrent compactions.

org.apache.cassandra.db:type=CompactionManager,name=MaximumCompactorThreads  -> 
6
org.apache.cassandra.db:type=CompactionManager,name=CoreCompactorThreads -> 6

Would set it to 6. To decrease them you will want to go opposite order (core 
than max). Just increasing the number of concurrent compactors doesnt mean that 
all of them will be utilized though.

Chris

> On Jul 17, 2018, at 12:18 PM, Alain RODRIGUEZ  wrote:
> 
> Hello Riccardo,
> 
> I noticed I have been writing a novel to answer a simple couple of questions 
> again ¯\_(ツ)_/¯. So here is a short answer in the case that's what you were 
> looking for :). Also, there is a warning that it might be counter-productive 
> and stress the cluster even more to increase the compaction throughput. There 
> is more information below ('about the issue').
> 
> tl;dr: 
> 
> What about using 'nodetool setcompactionthroughput XX' instead. It should 
> available there.
> 
> In the same way 'nodetool getcompactionthroughput' gives you the current 
> value. Be aware that this change done through JMX/nodetool is not permanent. 
> You still need to update the cassandra.yaml file.
> 
> If you really want to use the MBean through JMX, because using 'nodetool' is 
> too easy (or for any other reason :p):
> 
> Mbean: org.apache.cassandra.service.StorageServiceMBean
> Attribute: CompactionThroughputMbPerSec
> 
> Long story with the "how to" since I went through this search myself, I did 
> not know where this MBean was.
> 
> Can someone point me to the right mbean? 
> I can not really find good docs about mbeans (or tools ...) 
> 
> I am not sure about the doc, but you can use jmxterm 
> (http://wiki.cyclopsgroup.org/jmxterm/download.html 
> <http://wiki.cyclopsgroup.org/jmxterm/download.html>).
> 
> To replace the doc I use CCM (https://github.com/riptano/ccm 
> <https://github.com/riptano/ccm>) + jconsole to find the mbeans locally:
> 
> * Add loopback addresses for ccm (see the readme file)
> * then, create the cluster: * 'ccm create Cassandra-3-0-6 -v 3.0.6 -n 3 -s'
> * Start jconsole using the right pid: 'jconsole $(ccm node1 show | grep pid | 
> cut -d "=" -f 2)'
> * Explore MBeans, try to guess where this could be (and discover other funny 
> stuff in there :)).
> 
> I must admit I did not find it this way using C*3.0.6 and jconsole. 
> I looked at the code, I locally used C*3.0.6 and ran 'grep -RiI 
> CompactionThroughput' with this result: 
> https://gist.github.com/arodrime/f9591e4bdd2b1367a496447cdd959006 
> <https://gist.github.com/arodrime/f9591e4bdd2b1367a496447cdd959006>
> 
> With this I could find the right MBean, the only code documentation that is 
> always up to date is the code itself I am afraid:
> 
> './src/java/org/apache/cassandra/service/StorageServiceMBean.java:public 
> void setCompactionThroughputMbPerSec(int value);' 
> 
> Note that the research in the code also leads to nodetool ;-).
> 
> I could finally find the MBean in the 'jconsole' too: 
> https://cdn.pbrd.co/images/HuUya3x.png 
> <https://cdn.pbrd.co/images/HuUya3x.png> (not sure how long this link will 
> live).
> 
> jconsole also allows you to see what attributes it is possible to set or not.
> 
> You can now find any other MBean you would need I hope :).
> 
> 
> see if it helps when the system is under stress
> 
> About the issue
> 
> You don't exactly say what you are observing, what is that "stress"? How is 
> it impacting the cluster?
> 
> I ask because I am afraid this change might not help and even be 
> counter-productive. Even though having SSTables nicely compacted make a huge 
> difference at the read time, if that's already the case for you and the data 
> is already nicely compacted, doing this change won't help. It might even make 
> things slightly worse if the current bottleneck is the disk IO during a 
> stress period as the compactors would increase their disk read throughput, 
> thus maybe fight with the read requests for disk throughput.
> 
> If you have a similar number of sstables on all nodes, not many compactions 
> pending (nodetool netstats -H) and read operations are hitting a small number 
> sstables (nodetool tablehistogram) then you probably don't need to increase 
> the compaction speed.
> 
> Let's say that the compaction throughput is not often the cause of stress 
> during peak hours nor a direct

Re: concurrent_compactors via JMX

2018-07-17 Thread Alain RODRIGUEZ

Hello Riccardo,

I noticed I have been writing a novel to answer a simple couple of
questions again ¯\_(ツ)_/¯. So here is a short answer in the case that's
what you were looking for :). Also, there is a warning that it might be
counter-productive and stress the cluster even more to increase the
compaction throughput. There is more information below ('about the issue').

*tl;dr*:

What about using 'nodetool setcompactionthroughput XX' instead. It should
available there.

In the same way 'nodetool getcompactionthroughput' gives you the current
value. Be aware that this change done through JMX/nodetool is *not* permanent.
You still need to update the cassandra.yaml file.

If you really want to use the MBean through JMX, because using 'nodetool'
is too easy (or for any other reason :p):

Mbean: org.apache.cassandra.service.StorageServiceMBean
Attribute: CompactionThroughputMbPerSec

*Long story* with the "how to" since I went through this search myself, I
did not know where this MBean was.

Can someone point me to the right mbean?
> I can not really find good docs about mbeans (or tools ...)


I am not sure about the doc, but you can use jmxterm (
http://wiki.cyclopsgroup.org/jmxterm/download.html).

To replace the doc I use CCM (https://github.com/riptano/ccm) + jconsole to
find the mbeans locally:

* Add loopback addresses for ccm (see the readme file)
* then, create the cluster: * 'ccm create Cassandra-3-0-6 -v 3.0.6 -n 3 -s'
* Start jconsole using the right pid: 'jconsole $(ccm node1 show | grep pid
| cut -d "=" -f 2)'
* Explore MBeans, try to guess where this could be (and discover other
funny stuff in there :)).

I must admit I did not find it this way using C*3.0.6 and jconsole.
I looked at the code, I locally used C*3.0.6 and ran 'grep -RiI
CompactionThroughput' with this result:
https://gist.github.com/arodrime/f9591e4bdd2b1367a496447cdd959006

With this I could find the right MBean, the only code documentation that is
always up to date is the code itself I am afraid:

'./src/java/org/apache/cassandra/service/StorageServiceMBean.java:
public void setCompactionThroughputMbPerSec(int value);'

Note that the research in the code also leads to nodetool ;-).

I could finally find the MBean in the 'jconsole' too:
https://cdn.pbrd.co/images/HuUya3x.png (not sure how long this link will
live).

jconsole also allows you to see what attributes it is possible to set or
not.

You can now find any other MBean you would need I hope :).


see if it helps when the system is under stress


*About the issue*

You don't exactly say what you are observing, what is that "stress"? How is
it impacting the cluster?

I ask because I am afraid this change might not help and even be
counter-productive. Even though having SSTables nicely compacted make a
huge difference at the read time, if that's already the case for you and
the data is already nicely compacted, doing this change won't help. It
might even make things slightly worse if the current bottleneck is the disk
IO during a stress period as the compactors would increase their disk read
throughput, thus maybe fight with the read requests for disk throughput.

If you have a similar number of sstables on all nodes, not many compactions
pending (nodetool netstats -H) and read operations are hitting a small
number sstables (nodetool tablehistogram) then you probably don't need to
increase the compaction speed.

Let's say that the compaction throughput is not often the cause of stress
during peak hours nor a direct way to make things 'faster'. Generally when
compaction goes wrong, the number of sstables goes *t**hrou**g**h* the
roof. If you have a chart showing the number sstables, you can see this
really well.

Of course, if you feel you are in this case, increasing the compaction
throughput will definitely help if the cluster also has spared disk
throughput.

To check what's wrong, if you believe it's something different, here are
some useful commands:

- nodetool tpstats (check for pending/blocked/dropped threads there)
- check WARN and ERRORS in the logs (ie. grep -e "WARN" -e "ERROR"
/var/log/cassandra/system.log)
- Check local latencies (nodetool tablestats / nodetool tablehistogram) and
compare it to the client request latency. At the node level, reads should
probably be a single digit in milliseconds, rather close to 1 ms with SSDs
and writes below the millisecond most probably (it depends on the data size
too, etc...).
- Trace a query during this period, see what takes time (for example from
'cqlsh' - 'TRACING ON; SELECT ...')

You can also analyze the *Garbage Collection* activity. As Cassandre uses
the JVM, a badly tuned GC will induce long pauses. Depending on the
workload, and I must say for most of the cluster I work on, default the
tuning is not that good and can keep server busy 10-15% of the time with
stop the world GC.
You might find this post of my colleague Jon about GC tuning for Apache
Cassandra interesting:
http://thelastpickle.com/blog/2018/04/11/gc-tun

concurrent_compactors via JMX

2018-07-17 Thread Riccardo Ferrari

Hi list,

Cassandra 3.0.6

I'd like to test the change of concurrent compactors to see if it helps
when the system is under stress.

Can someone point me to the right mbean?
I can not really find good docs about mbeans (or tools ...)

Any suggestion much appreciated, best

Re: concurrent_compactors via JMX

Re: concurrent_compactors via JMX

Re: concurrent_compactors via JMX

Re: concurrent_compactors via JMX

Re: concurrent_compactors via JMX

Re: concurrent_compactors via JMX

Re: concurrent_compactors via JMX

Re: concurrent_compactors via JMX

concurrent_compactors via JMX

9 matches

Site Navigation

Mail list logo

Footer information