Re: concurrent_compactors via JMX

Alain RODRIGUEZ Tue, 24 Jul 2018 07:48:33 -0700

Hello Ricardo,

My understanding is that GP2 is better. I think we did some testing in the
past, but I must say I do not remember the exact results. I remember we
also thought of IO1 at some point, but we were not convinced by this kind
of EBS (not sure if it was not as performant as suggested in the doc or
just much more expensive). Maybe test it and make your own idea or wait for
someone else's information.


Be aware that the size of the GP2 EBS is impacting the IOPS, the max IOPS
is reached at ~ 3.334 TB which is also a good dataset size for Cassandra
(1.5/2 TB with some spared space for compactions).

I'd like to deploy on i3.xlarge
>

Yet if you go for I3, of course, use the ephemeral drives (NVMe). It's
incredibly fast ;-). Compared with m1.xlarge you should see a
substantial difference. The problem is that with a low number of nodes, it
will always cost more to have i3 than m1. This is often not the case with
more machines, as each node will work way more efficiently and you can
effectively reduce the number of nodes. Here, 3 will probably be the
minimum of nodes and 3 x i3 might cost more than 5/6 x m1 instances. When
scaling up though, you should come back to an acceptable cost/efficiency.
It's your call to see if to continue with m1, m5 or r4 instances meanwhile.

I decided to get safe and scale horizontally with the hardware we have
> tested
>

Yes, this is fair enough and a safe approach. To add new hardware the best
approach is a data center switch (I will write a post about how to do this
sometime soon)

I'm preparing to migrate inside vpc
>

This too is probably through a DC switch. I reminded I asked for help on
this in 2014, I found the reference for you where I published the steps
that I went through to go from EC2 public --> public VPC  --> private VPC.
It's old and I did not read it again, but it worked for us and at that
time. I hope you might find it useful as well as the process is detailed
step by step. It should be  easy to adapt it and you should not forget any
step this way:
http://grokbase.com/t/cassandra/user/1465m9txtw/vpc-aws#20140612k7xq0t280cvyk6waeytxbkx40c


possibly in Multi-AZ.
>

Yes, I recommend you to do this. It's incredibly powerful when you now that
with 3 racks and a RF=3 (and proper topology/configuration), each rack owes
100% of the data. Thus when operating you can work on a rack at once with
limited risk, even using quorum, service should stay up, no matter what
happens as long as 2 AZs are completely available. When cluster will grow
you might really appreciate this to prevent some failures and operate
safely.

PS: I defintely own you a coffee, actually much more than that!


If we meet we can definitely share a beer (no coffee for me, but I never
say no to a beer ;-)).
But you don't owe me it was and still is for free. Here we all share,
period. I like to think that knowledge is the only wealth you can give away
while keeping it for yourself. Some even say that knowledge grows when
shared. I used this mailing list myself to ramp up with Cassandra, I am
myself probably paying back to the community somehow for years now :-). Now
it's even part of my job, this is a part of what we do :). And I like it.
What I invite you to do is help people around yourself when you will be
comfortable with some topics. This way someone else might enjoy this
mailing list, making it a nicer place and contributing to growing up the
community ;-).

Yet, be ensured I appreciate the feedback and that you are grateful, it
shows it was somehow useful to you. This is enough for me.

C*heers
-----------------------
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-07-19 19:21 GMT+02:00 Riccardo Ferrari <ferra...@gmail.com>:

> Alain,
>
> I really appreciate your answers! A little typo is not changing the
> valuable content! For sure I will give a shot to your GC settings and come
> back with my findings.
> Right now I have 6 nodes up and running and everything looks good so far
> (at least much better).
>
> I agree, the hardware I am using is quite old but rather experimenting
> with new hardware combinations (on prod) I decided to get safe and scale
> horizontally with the hardware we have tested. I'm preparing to migrate
> inside vpc and I'd like to deploy on i3.xlarge instances and possibly in
> Multi-AZ.
>
> Speaking of EBS: I gave a quick I/O test to m3.xlarge + SSD + EBS (400
> PIOPS). SSD looks great for commitlogs, EBS I might need more guidance. I
> certainly gain in terms of random i/o however I'd like to hear what is your
> stand wrt IO2 (PIOPS) vs regular GP2? Or better: what are you guidelines
> when using EBS?
>
> Thanks!
>
> PS: I defintely own you a coffee, actually much more than that!
>
> On Thu, Jul 19, 2018 at 6:24 PM, Alain RODRIGUEZ <arodr...@gmail.com>
> wrote:
>
>> Ah excuse my confusion. I now understand I guide you through changing the
>>> throughput when you wanted to change the compaction throughput.
>>
>>
>>
>> Wow, I meant to say "I guided you through changing the compaction
>> throughput when you wanted to change the number of concurrent compactors."
>>
>> I should not answer messages before waking up fully...
>>
>> :)
>>
>> C*heers,
>> -----------------------
>> Alain Rodriguez - @arodream - al...@thelastpickle.com
>> France / Spain
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>> 2018-07-19 14:07 GMT+01:00 Alain RODRIGUEZ <arodr...@gmail.com>:
>>
>>> Ah excuse my confusion. I now understand I guide you through changing
>>> the throughput when you wanted to change the compaction throughput.
>>>
>>> I also found some commands I ran in the past using jmxterm. As mentioned
>>> by Chris - and thanks Chris for answering the question properly -, the
>>> 'max' can never be lower than the 'core'.
>>>
>>> Use JMXTERM to REDUCE the concurrent compactors:
>>>
>>> ```
>>> # if we have more than 2 threads:
>>> echo "set -b org.apache.cassandra.db:type=CompactionManager
>>> CoreCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
>>> 127.0.0.1:7199 && echo "set -b 
>>> org.apache.cassandra.db:type=CompactionManager
>>> MaximumCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar
>>> -l 127.0.0.1:7199
>>> ```
>>>
>>> Use JMXTERM to INCREASE the concurrent compactors:
>>>
>>> ```
>>> # if we have currently less than 6 threads:
>>> echo "set -b org.apache.cassandra.db:type=CompactionManager
>>> MaximumCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar
>>> -l 127.0.0.1:7199 && echo "set -b 
>>> org.apache.cassandra.db:type=CompactionManager
>>> CoreCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
>>> 127.0.0.1:7199
>>> ```
>>>
>>> Some comments about the information you shared, as you said, 'thinking
>>> out loud' :):
>>>
>>> *About the hardware*
>>>
>>> I remember using the 'm1.xlarge' :). They are not that recent. It will
>>> probably worth it to reconsider this hardware choice and migrate to newer
>>> hardware (m5/r4 + EBS GP2 or I3 with ephemeral). You should be able to
>>> reduce the number of nodes and make it equivalent (or maybe slightly more
>>> expensive but so it works properly). I once moved from a lot of these nodes
>>> (80ish) to a few I2 instances (5 - 15? I don't remember). Latency went from
>>> 20 ms to 3 - 5 ms (and was improved later on). Also, using the right
>>> hardware for your case should avoid headaches to you and your team. I
>>> started with t1.micro in prod and went all the way up (m1.small, m1.medium,
>>> ...). It's good for learning, not for business.
>>>
>>> Especially, this does not work well together:
>>>
>>> my instances are still on magnetic drivers
>>>>
>>>
>>> with
>>>
>>> most tables on LCS
>>>
>>> frequent r/w pattern
>>>>
>>>
>>> Having some SSDs here (EBS GP2 or even better I3 - NVMe disks) would
>>> most probably help to reduce the latency. I would also pick an instance
>>> with more memory (30 GB would probably be more comfortable). The more
>>> memory, the better it is possible to tune the JVM and the more page caching
>>> can be done (thus avoiding some disk reads). Given the number of nodes you
>>> use, it's complex to keep the cost low doing this change. When the cluster
>>> will grow you might want to consider changing the instance type again and
>>> maybe for now just take a r4.xlarge + EBS Volume GP2. It comes with 30+ GB
>>> of memory and the same number of cpu (or more) and see how many nodes are
>>> needed. It might be slightly more expensive, but I really believe it could
>>> do  some good.
>>>
>>> As a middle term solution, I think you might be really happy with a
>>> change of this kind.
>>>
>>> *About DTCS/TWCS?*
>>>
>>>
>>>>
>>>> * - few tables with DTCS- need to upgrade to 3.0.8 for TWCS*
>>>
>>> Indeed switching to DTCS rather than TWCS can be a real relief for a
>>> cluster. You should not have to wait to upgrade to 3.0.8 to use TWCS. I
>>> must say I am not too sure for 3.0.x (x < 8) versions though. Maybe giving
>>> a try to http://thelastpickle.com/blog/2017/01/10/twcs-part2.html with
>>> https://github.com/jeffjirsa/twcs/tree/cassandra-3.0.0 is easier for
>>> you?
>>>
>>> *Garbage Collection?*
>>>
>>> That being said, the CPU load is really high, I suspect Garbage
>>> Collection is taking a lot of time to the nodes of this cluster. It is
>>> probably not helping the CPUs either. This might even be the biggest pain
>>> point for this cluster.
>>>
>>> Would you like to try using following settings on a canary node and see
>>> how it goes? These settings are quite arbitrary. With the gc.log I could be
>>> more precise on what I believe is a correct setting.
>>>
>>> GC Type: CMS
>>> Heap: 8 GB (could be bigger, but we are limited by the 15 GB in total).
>>> New_heap: 2 - 4 GB (maybe experiment with the 2 distinct values)
>>> TenuringThreshold: 15 (instead of 1, that is definitely too small and
>>> tend to have short living object still being promoted to the old gen)
>>>
>>> For those settings, I do not trust the cassandra defaults in most cases. 
>>> New_heap_size
>>> should be 25-50% of the heap (and not related to the number of CPU cores).
>>> Also below 16 GB I never had a better result with G1GC than CMS. But I must
>>> say I have been fighting a lot with CMS in the past to tune it nicely while
>>> I did not even play much with G1GC.
>>>
>>> This (or similar settings) worked for distinct cases having heavy read
>>> patterns. In the mailing list I explained recently to someone else my
>>> understanding of JVM and GC, also there is a blog post my colleague Jon
>>> wrote here: http://thelastpickle.com/blog/2018/04/11/gc-tuning.html. I
>>> believe he suggested a slightly different tuning.
>>> If none of this is helping, please send the gc.log file over with and
>>> without this change we could have a look what is going on. SurvivorRatio
>>> can also be moved down to 2 or 4, if you want to play around and check the
>>> difference.
>>>
>>> Make sure to use a canary node first, there is no 'good' configuration
>>> here, it really depends on the workload and the settings above could harm
>>> the cluster.
>>>
>>> I think we can make more of these instances. Nonetheless after adding a
>>> few more nodes, scaling up the instance type instead of the number of nodes
>>> to have SSDs and bit more of memory will make things smoother, and probably
>>> cheaper as well at some point.
>>>
>>>
>>>
>>>
>>> 2018-07-18 17:27 GMT+01:00 Riccardo Ferrari <ferra...@gmail.com>:
>>>
>>>> Chris,
>>>>
>>>> Thank you for mbean reference.
>>>>
>>>> On Wed, Jul 18, 2018 at 6:26 PM, Riccardo Ferrari <ferra...@gmail.com>
>>>> wrote:
>>>>
>>>>> Alain, thank you for email. I really really appreciate it!
>>>>>
>>>>> I am actually trying to remove the disk io from the suspect list, thus
>>>>> I'm want to reduce the number of concurrent compactors. I'll give
>>>>> thorughput a shot.
>>>>> No, I don't have a long list of pending compactions, however my
>>>>> instances are still on magnetic drivers and can't really afford high 
>>>>> number
>>>>> of compactors.
>>>>>
>>>>> We started to have slow downs and most likely we were undersized, new
>>>>> features are coming in and I want to be ready for them.
>>>>> *About the issue:*
>>>>>
>>>>>
>>>>>    - High system load on cassanda nodes. This means top saying
>>>>>    6.0/12.0 on a 4 vcpu instance (!)
>>>>>
>>>>>
>>>>>    - CPU is high:
>>>>>          - Dynatrace says 50%
>>>>>          - top easily goes to 80%
>>>>>       - Network around 30Mb (according to Dynatrace)
>>>>>       - Disks:
>>>>>          - ~40 iops
>>>>>          - high latency: ~20ms (min 8 max 50!)
>>>>>          - negligible iowait
>>>>>          - testing an empty instance with fio I get 1200 r_iops / 400
>>>>>          w_iops
>>>>>
>>>>>
>>>>>    - Clients timeout
>>>>>       - mostly when reading
>>>>>       - few cases when writing
>>>>>    - Slowly growing number of "All time blocked of Native T-R"
>>>>>       - small numbers: hundreds vs millions of successfully serverd
>>>>>       requests
>>>>>
>>>>> The system:
>>>>>
>>>>>    - Cassandra 3.0.6
>>>>>       - most tables on LCS
>>>>>          - frequent r/w pattern
>>>>>       - few tables with DTCS
>>>>>          - need to upgrade to 3.0.8 for TWCS
>>>>>          - mostly TS data, stream write / batch read
>>>>>       - All our keyspaces have RF: 3
>>>>>
>>>>>
>>>>>    - All nodes on the same AZ
>>>>>    - m1.xlarge
>>>>>    - 4x420 drives (emphemerial storage) configured in striping (raid0)
>>>>>       - 4 vcpu
>>>>>       - 15GB ram
>>>>>    - workload:
>>>>>       - Java applications;
>>>>>          - Mostly feeding cassandra writing data coming in
>>>>>          - Apache Spark applications:
>>>>>          - batch processes to read and write back to C* or other
>>>>>          systems
>>>>>          - not co-located
>>>>>
>>>>> So far my effort was put into growing the ring to better distribute
>>>>> the load and decrease the pressure, including:
>>>>>
>>>>>    - Increasing the node number from 3 to 5 (6th node joining)
>>>>>    - jvm memory "optimization":
>>>>>    - heaps were set by default script to something bit smaller that
>>>>>       4GB with CMS gc
>>>>>       - gc pressure was high / long gc pauses
>>>>>          - clients were suffering of read timeouts
>>>>>       - increased the heap still using CMS:
>>>>>          - very long GC pauses
>>>>>          - not much tuning around CMS
>>>>>          - switched to G1 and forced 6/7GB heap on each node using
>>>>>       almost suggested settings
>>>>>       - much more stable
>>>>>             - generally < 300ms
>>>>>          - I still have long pauses from time to time (mostly around
>>>>>          1200ms, sometimes on some nodes 3000)
>>>>>
>>>>> *Thinking out loud:*
>>>>> Things are much better, however I still see a high cpu usage specially
>>>>> when Spark kicks even though spark jobs are very small in terms of
>>>>> resources (single worker with very limited parallelism).
>>>>>
>>>>> On LCS tables cfstats reports single digit read latencies and
>>>>> generally 0.X write latencies (as per today).
>>>>> On DTCS tables I have 0.x ms write latency but still double digit read
>>>>> latency, but I guess I should spend some time to tune that or upgrade and
>>>>> move away from DTCS :(
>>>>> Yes, Saprk reads mostly from DTCS tables
>>>>>
>>>>> Still is kinda common to to have dropped READ, HINT and MUTATION.
>>>>>
>>>>>    - not on all nodes
>>>>>    - this generally happen on node restart
>>>>>
>>>>>
>>>>> On a side note I tried to install libjemalloc1 from Ubuntu repo (mixed
>>>>> 14.04 and 16.04) with terrible results, much slower instance startup and
>>>>> responsiveness, how could that be?
>>>>>
>>>>> Once everything will be stabilized I'll prepare our move to vpc and
>>>>> possibly upgrade to i3 instance, any comment on the hardware side?  is
>>>>> 4cores still a reasonble hardware?
>>>>>
>>>>> Best,
>>>>>
>>>>> On Tue, Jul 17, 2018 at 9:18 PM, Alain RODRIGUEZ <arodr...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello Riccardo,
>>>>>>
>>>>>> I noticed I have been writing a novel to answer a simple couple of
>>>>>> questions again ¯\_(ツ)_/¯. So here is a short answer in the case that's
>>>>>> what you were looking for :). Also, there is a warning that it might be
>>>>>> counter-productive and stress the cluster even more to increase the
>>>>>> compaction throughput. There is more information below ('about the 
>>>>>> issue').
>>>>>>
>>>>>> *tl;dr*:
>>>>>>
>>>>>> What about using 'nodetool setcompactionthroughput XX' instead. It
>>>>>> should available there.
>>>>>>
>>>>>> In the same way 'nodetool getcompactionthroughput' gives you the
>>>>>> current value. Be aware that this change done through JMX/nodetool is
>>>>>> *not* permanent. You still need to update the cassandra.yaml file.
>>>>>>
>>>>>> If you really want to use the MBean through JMX, because using
>>>>>> 'nodetool' is too easy (or for any other reason :p):
>>>>>>
>>>>>> Mbean: org.apache.cassandra.service.StorageServiceMBean
>>>>>> Attribute: CompactionThroughputMbPerSec
>>>>>>
>>>>>> *Long story* with the "how to" since I went through this search
>>>>>> myself, I did not know where this MBean was.
>>>>>>
>>>>>> Can someone point me to the right mbean?
>>>>>>> I can not really find good docs about mbeans (or tools ...)
>>>>>>
>>>>>>
>>>>>> I am not sure about the doc, but you can use jmxterm (
>>>>>> http://wiki.cyclopsgroup.org/jmxterm/download.html).
>>>>>>
>>>>>> To replace the doc I use CCM (https://github.com/riptano/ccm) +
>>>>>> jconsole to find the mbeans locally:
>>>>>>
>>>>>> * Add loopback addresses for ccm (see the readme file)
>>>>>> * then, create the cluster: * 'ccm create Cassandra-3-0-6 -v 3.0.6 -n
>>>>>> 3 -s'
>>>>>> * Start jconsole using the right pid: 'jconsole $(ccm node1 show |
>>>>>> grep pid | cut -d "=" -f 2)'
>>>>>> * Explore MBeans, try to guess where this could be (and discover
>>>>>> other funny stuff in there :)).
>>>>>>
>>>>>> I must admit I did not find it this way using C*3.0.6 and jconsole.
>>>>>> I looked at the code, I locally used C*3.0.6 and ran 'grep -RiI
>>>>>> CompactionThroughput' with this result: https://gist.github.co
>>>>>> m/arodrime/f9591e4bdd2b1367a496447cdd959006
>>>>>>
>>>>>> With this I could find the right MBean, the only code documentation
>>>>>> that is always up to date is the code itself I am afraid:
>>>>>>
>>>>>> './src/java/org/apache/cassandra/service/StorageServiceMBean.java:
>>>>>>   public void setCompactionThroughputMbPerSec(int value);'
>>>>>>
>>>>>> Note that the research in the code also leads to nodetool ;-).
>>>>>>
>>>>>> I could finally find the MBean in the 'jconsole' too:
>>>>>> https://cdn.pbrd.co/images/HuUya3x.png (not sure how long this link
>>>>>> will live).
>>>>>>
>>>>>> jconsole also allows you to see what attributes it is possible to set
>>>>>> or not.
>>>>>>
>>>>>> You can now find any other MBean you would need I hope :).
>>>>>>
>>>>>>
>>>>>> see if it helps when the system is under stress
>>>>>>
>>>>>>
>>>>>> *About the issue*
>>>>>>
>>>>>> You don't exactly say what you are observing, what is that "stress"?
>>>>>> How is it impacting the cluster?
>>>>>>
>>>>>> I ask because I am afraid this change might not help and even be
>>>>>> counter-productive. Even though having SSTables nicely compacted make a
>>>>>> huge difference at the read time, if that's already the case for you and
>>>>>> the data is already nicely compacted, doing this change won't help. It
>>>>>> might even make things slightly worse if the current bottleneck is the 
>>>>>> disk
>>>>>> IO during a stress period as the compactors would increase their disk 
>>>>>> read
>>>>>> throughput, thus maybe fight with the read requests for disk throughput.
>>>>>>
>>>>>> If you have a similar number of sstables on all nodes, not many
>>>>>> compactions pending (nodetool netstats -H) and read operations are 
>>>>>> hitting
>>>>>> a small number sstables (nodetool tablehistogram) then you probably
>>>>>> don't need to increase the compaction speed.
>>>>>>
>>>>>> Let's say that the compaction throughput is not often the cause of
>>>>>> stress during peak hours nor a direct way to make things 'faster'.
>>>>>> Generally when compaction goes wrong, the number of sstables goes *t*
>>>>>> *hrou**g**h* the roof. If you have a chart showing the number
>>>>>> sstables, you can see this really well.
>>>>>>
>>>>>> Of course, if you feel you are in this case, increasing the
>>>>>> compaction throughput will definitely help if the cluster also has spared
>>>>>> disk throughput.
>>>>>>
>>>>>> To check what's wrong, if you believe it's something different, here
>>>>>> are some useful commands:
>>>>>>
>>>>>> - nodetool tpstats (check for pending/blocked/dropped threads there)
>>>>>> - check WARN and ERRORS in the logs (ie. grep -e "WARN" -e "ERROR"
>>>>>> /var/log/cassandra/system.log)
>>>>>> - Check local latencies (nodetool tablestats /
>>>>>> nodetool tablehistogram) and compare it to the client request latency. At
>>>>>> the node level, reads should probably be a single digit in milliseconds,
>>>>>> rather close to 1 ms with SSDs and writes below the millisecond most
>>>>>> probably (it depends on the data size too, etc...).
>>>>>> - Trace a query during this period, see what takes time (for example
>>>>>> from  'cqlsh' - 'TRACING ON; SELECT ...')
>>>>>>
>>>>>> You can also analyze the *Garbage Collection* activity. As Cassandre
>>>>>> uses the JVM, a badly tuned GC will induce long pauses. Depending on the
>>>>>> workload, and I must say for most of the cluster I work on, default the
>>>>>> tuning is not that good and can keep server busy 10-15% of the time with
>>>>>> stop the world GC.
>>>>>> You might find this post of my colleague Jon about GC tuning for
>>>>>> Apache Cassandra interesting: http://thelastpic
>>>>>> kle.com/blog/2018/04/11/gc-tuning.html. GC pressure is a very common
>>>>>> way to optimize a Cassandra cluster, to adapt it to your 
>>>>>> workload/hardware.
>>>>>>
>>>>>> C*heers,
>>>>>> -----------------------
>>>>>> Alain Rodriguez - @arodream - al...@thelastpickle.com
>>>>>> France / Spain
>>>>>>
>>>>>> The Last Pickle - Apache Cassandra Consulting
>>>>>> http://www.thelastpickle.com
>>>>>>
>>>>>>
>>>>>> 2018-07-17 17:23 GMT+01:00 Riccardo Ferrari <ferra...@gmail.com>:
>>>>>>
>>>>>>> Hi list,
>>>>>>>
>>>>>>> Cassandra 3.0.6
>>>>>>>
>>>>>>> I'd like to test the change of concurrent compactors to see if it
>>>>>>> helps when the system is under stress.
>>>>>>>
>>>>>>> Can someone point me to the right mbean?
>>>>>>> I can not really find good docs about mbeans (or tools ...)
>>>>>>>
>>>>>>> Any suggestion much appreciated, best
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: concurrent_compactors via JMX

Reply via email to