Re: concurrent_compactors via JMX
Hello Ricardo, My understanding is that GP2 is better. I think we did some testing in the past, but I must say I do not remember the exact results. I remember we also thought of IO1 at some point, but we were not convinced by this kind of EBS (not sure if it was not as performant as suggested in the doc or just much more expensive). Maybe test it and make your own idea or wait for someone else's information. Be aware that the size of the GP2 EBS is impacting the IOPS, the max IOPS is reached at ~ 3.334 TB which is also a good dataset size for Cassandra (1.5/2 TB with some spared space for compactions). I'd like to deploy on i3.xlarge > Yet if you go for I3, of course, use the ephemeral drives (NVMe). It's incredibly fast ;-). Compared with m1.xlarge you should see a substantial difference. The problem is that with a low number of nodes, it will always cost more to have i3 than m1. This is often not the case with more machines, as each node will work way more efficiently and you can effectively reduce the number of nodes. Here, 3 will probably be the minimum of nodes and 3 x i3 might cost more than 5/6 x m1 instances. When scaling up though, you should come back to an acceptable cost/efficiency. It's your call to see if to continue with m1, m5 or r4 instances meanwhile. I decided to get safe and scale horizontally with the hardware we have > tested > Yes, this is fair enough and a safe approach. To add new hardware the best approach is a data center switch (I will write a post about how to do this sometime soon) I'm preparing to migrate inside vpc > This too is probably through a DC switch. I reminded I asked for help on this in 2014, I found the reference for you where I published the steps that I went through to go from EC2 public --> public VPC --> private VPC. It's old and I did not read it again, but it worked for us and at that time. I hope you might find it useful as well as the process is detailed step by step. It should be easy to adapt it and you should not forget any step this way: http://grokbase.com/t/cassandra/user/1465m9txtw/vpc-aws#20140612k7xq0t280cvyk6waeytxbkx40c possibly in Multi-AZ. > Yes, I recommend you to do this. It's incredibly powerful when you now that with 3 racks and a RF=3 (and proper topology/configuration), each rack owes 100% of the data. Thus when operating you can work on a rack at once with limited risk, even using quorum, service should stay up, no matter what happens as long as 2 AZs are completely available. When cluster will grow you might really appreciate this to prevent some failures and operate safely. PS: I defintely own you a coffee, actually much more than that! If we meet we can definitely share a beer (no coffee for me, but I never say no to a beer ;-)). But you don't owe me it was and still is for free. Here we all share, period. I like to think that knowledge is the only wealth you can give away while keeping it for yourself. Some even say that knowledge grows when shared. I used this mailing list myself to ramp up with Cassandra, I am myself probably paying back to the community somehow for years now :-). Now it's even part of my job, this is a part of what we do :). And I like it. What I invite you to do is help people around yourself when you will be comfortable with some topics. This way someone else might enjoy this mailing list, making it a nicer place and contributing to growing up the community ;-). Yet, be ensured I appreciate the feedback and that you are grateful, it shows it was somehow useful to you. This is enough for me. C*heers --- Alain Rodriguez - @arodream - al...@thelastpickle.com France / Spain The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com 2018-07-19 19:21 GMT+02:00 Riccardo Ferrari : > Alain, > > I really appreciate your answers! A little typo is not changing the > valuable content! For sure I will give a shot to your GC settings and come > back with my findings. > Right now I have 6 nodes up and running and everything looks good so far > (at least much better). > > I agree, the hardware I am using is quite old but rather experimenting > with new hardware combinations (on prod) I decided to get safe and scale > horizontally with the hardware we have tested. I'm preparing to migrate > inside vpc and I'd like to deploy on i3.xlarge instances and possibly in > Multi-AZ. > > Speaking of EBS: I gave a quick I/O test to m3.xlarge + SSD + EBS (400 > PIOPS). SSD looks great for commitlogs, EBS I might need more guidance. I > certainly gain in terms of random i/o however I'd like to hear what is your > stand wrt IO2 (PIOPS) vs regular GP2? Or better: what are you guidelines > when using EBS? > > Thanks! > > PS: I defintely own you a coffee, actually much more than that! > > On Thu, Jul 19, 2018 at 6:24 PM, Alain RODRIGUEZ > wrote: > >> Ah excuse my confusion. I now understand I guide you through changing the >>> throughput when you wanted to change the compaction throughput
Re: concurrent_compactors via JMX
Alain, I really appreciate your answers! A little typo is not changing the valuable content! For sure I will give a shot to your GC settings and come back with my findings. Right now I have 6 nodes up and running and everything looks good so far (at least much better). I agree, the hardware I am using is quite old but rather experimenting with new hardware combinations (on prod) I decided to get safe and scale horizontally with the hardware we have tested. I'm preparing to migrate inside vpc and I'd like to deploy on i3.xlarge instances and possibly in Multi-AZ. Speaking of EBS: I gave a quick I/O test to m3.xlarge + SSD + EBS (400 PIOPS). SSD looks great for commitlogs, EBS I might need more guidance. I certainly gain in terms of random i/o however I'd like to hear what is your stand wrt IO2 (PIOPS) vs regular GP2? Or better: what are you guidelines when using EBS? Thanks! PS: I defintely own you a coffee, actually much more than that! On Thu, Jul 19, 2018 at 6:24 PM, Alain RODRIGUEZ wrote: > Ah excuse my confusion. I now understand I guide you through changing the >> throughput when you wanted to change the compaction throughput. > > > > Wow, I meant to say "I guided you through changing the compaction > throughput when you wanted to change the number of concurrent compactors." > > I should not answer messages before waking up fully... > > :) > > C*heers, > --- > Alain Rodriguez - @arodream - al...@thelastpickle.com > France / Spain > > The Last Pickle - Apache Cassandra Consulting > http://www.thelastpickle.com > > 2018-07-19 14:07 GMT+01:00 Alain RODRIGUEZ : > >> Ah excuse my confusion. I now understand I guide you through changing the >> throughput when you wanted to change the compaction throughput. >> >> I also found some commands I ran in the past using jmxterm. As mentioned >> by Chris - and thanks Chris for answering the question properly -, the >> 'max' can never be lower than the 'core'. >> >> Use JMXTERM to REDUCE the concurrent compactors: >> >> ``` >> # if we have more than 2 threads: >> echo "set -b org.apache.cassandra.db:type=CompactionManager >> CoreCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l >> 127.0.0.1:7199 && echo "set -b org.apache.cassandra.db:type=CompactionManager >> MaximumCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar >> -l 127.0.0.1:7199 >> ``` >> >> Use JMXTERM to INCREASE the concurrent compactors: >> >> ``` >> # if we have currently less than 6 threads: >> echo "set -b org.apache.cassandra.db:type=CompactionManager >> MaximumCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar >> -l 127.0.0.1:7199 && echo "set -b >> org.apache.cassandra.db:type=CompactionManager >> CoreCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l >> 127.0.0.1:7199 >> ``` >> >> Some comments about the information you shared, as you said, 'thinking >> out loud' :): >> >> *About the hardware* >> >> I remember using the 'm1.xlarge' :). They are not that recent. It will >> probably worth it to reconsider this hardware choice and migrate to newer >> hardware (m5/r4 + EBS GP2 or I3 with ephemeral). You should be able to >> reduce the number of nodes and make it equivalent (or maybe slightly more >> expensive but so it works properly). I once moved from a lot of these nodes >> (80ish) to a few I2 instances (5 - 15? I don't remember). Latency went from >> 20 ms to 3 - 5 ms (and was improved later on). Also, using the right >> hardware for your case should avoid headaches to you and your team. I >> started with t1.micro in prod and went all the way up (m1.small, m1.medium, >> ...). It's good for learning, not for business. >> >> Especially, this does not work well together: >> >> my instances are still on magnetic drivers >>> >> >> with >> >> most tables on LCS >> >> frequent r/w pattern >>> >> >> Having some SSDs here (EBS GP2 or even better I3 - NVMe disks) would most >> probably help to reduce the latency. I would also pick an instance with >> more memory (30 GB would probably be more comfortable). The more memory, >> the better it is possible to tune the JVM and the more page caching can be >> done (thus avoiding some disk reads). Given the number of nodes you use, >> it's complex to keep the cost low doing this change. When the cluster will >> grow you might want to consider changing the instance type again and maybe >> for now just take a r4.xlarge + EBS Volume GP2. It comes with 30+ GB of >> memory and the same number of cpu (or more) and see how many nodes are >> needed. It might be slightly more expensive, but I really believe it could >> do some good. >> >> As a middle term solution, I think you might be really happy with a >> change of this kind. >> >> *About DTCS/TWCS?* >> >> >>> >>> * - few tables with DTCS- need to upgrade to 3.0.8 for TWCS* >> >> Indeed switching to DTCS rather than TWCS can be a real relief for a >> cluster. You should not have to wait to upgrade to 3.0.8 to use TWCS. I >> must say I
Re: concurrent_compactors via JMX
> Ah excuse my confusion. I now understand I guide you through changing the > throughput when you wanted to change the compaction throughput. Wow, I meant to say "I guided you through changing the compaction throughput when you wanted to change the number of concurrent compactors." I should not answer messages before waking up fully... :) C*heers, --- Alain Rodriguez - @arodream - al...@thelastpickle.com France / Spain The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com 2018-07-19 14:07 GMT+01:00 Alain RODRIGUEZ : > Ah excuse my confusion. I now understand I guide you through changing the > throughput when you wanted to change the compaction throughput. > > I also found some commands I ran in the past using jmxterm. As mentioned > by Chris - and thanks Chris for answering the question properly -, the > 'max' can never be lower than the 'core'. > > Use JMXTERM to REDUCE the concurrent compactors: > > ``` > # if we have more than 2 threads: > echo "set -b org.apache.cassandra.db:type=CompactionManager > CoreCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l > 127.0.0.1:7199 && echo "set -b org.apache.cassandra.db:type=CompactionManager > MaximumCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l > 127.0.0.1:7199 > ``` > > Use JMXTERM to INCREASE the concurrent compactors: > > ``` > # if we have currently less than 6 threads: > echo "set -b org.apache.cassandra.db:type=CompactionManager > MaximumCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l > 127.0.0.1:7199 && echo "set -b org.apache.cassandra.db:type=CompactionManager > CoreCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l > 127.0.0.1:7199 > ``` > > Some comments about the information you shared, as you said, 'thinking out > loud' :): > > *About the hardware* > > I remember using the 'm1.xlarge' :). They are not that recent. It will > probably worth it to reconsider this hardware choice and migrate to newer > hardware (m5/r4 + EBS GP2 or I3 with ephemeral). You should be able to > reduce the number of nodes and make it equivalent (or maybe slightly more > expensive but so it works properly). I once moved from a lot of these nodes > (80ish) to a few I2 instances (5 - 15? I don't remember). Latency went from > 20 ms to 3 - 5 ms (and was improved later on). Also, using the right > hardware for your case should avoid headaches to you and your team. I > started with t1.micro in prod and went all the way up (m1.small, m1.medium, > ...). It's good for learning, not for business. > > Especially, this does not work well together: > > my instances are still on magnetic drivers >> > > with > > most tables on LCS > > frequent r/w pattern >> > > Having some SSDs here (EBS GP2 or even better I3 - NVMe disks) would most > probably help to reduce the latency. I would also pick an instance with > more memory (30 GB would probably be more comfortable). The more memory, > the better it is possible to tune the JVM and the more page caching can be > done (thus avoiding some disk reads). Given the number of nodes you use, > it's complex to keep the cost low doing this change. When the cluster will > grow you might want to consider changing the instance type again and maybe > for now just take a r4.xlarge + EBS Volume GP2. It comes with 30+ GB of > memory and the same number of cpu (or more) and see how many nodes are > needed. It might be slightly more expensive, but I really believe it could > do some good. > > As a middle term solution, I think you might be really happy with a change > of this kind. > > *About DTCS/TWCS?* > > >> >> * - few tables with DTCS- need to upgrade to 3.0.8 for TWCS* > > Indeed switching to DTCS rather than TWCS can be a real relief for a > cluster. You should not have to wait to upgrade to 3.0.8 to use TWCS. I > must say I am not too sure for 3.0.x (x < 8) versions though. Maybe giving > a try to http://thelastpickle.com/blog/2017/01/10/twcs-part2.html with > https://github.com/jeffjirsa/twcs/tree/cassandra-3.0.0 is easier for you? > > *Garbage Collection?* > > That being said, the CPU load is really high, I suspect Garbage Collection > is taking a lot of time to the nodes of this cluster. It is probably not > helping the CPUs either. This might even be the biggest pain point for this > cluster. > > Would you like to try using following settings on a canary node and see > how it goes? These settings are quite arbitrary. With the gc.log I could be > more precise on what I believe is a correct setting. > > GC Type: CMS > Heap: 8 GB (could be bigger, but we are limited by the 15 GB in total). > New_heap: 2 - 4 GB (maybe experiment with the 2 distinct values) > TenuringThreshold: 15 (instead of 1, that is definitely too small and tend > to have short living object still being promoted to the old gen) > > For those settings, I do not trust the cassandra defaults in most cases. > New_heap_size > should be 25-50% of the heap (and not rela
Re: concurrent_compactors via JMX
Ah excuse my confusion. I now understand I guide you through changing the throughput when you wanted to change the compaction throughput. I also found some commands I ran in the past using jmxterm. As mentioned by Chris - and thanks Chris for answering the question properly -, the 'max' can never be lower than the 'core'. Use JMXTERM to REDUCE the concurrent compactors: ``` # if we have more than 2 threads: echo "set -b org.apache.cassandra.db:type=CompactionManager CoreCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l 127.0.0.1:7199 && echo "set -b org.apache.cassandra.db:type=CompactionManager MaximumCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l 127.0.0.1:7199 ``` Use JMXTERM to INCREASE the concurrent compactors: ``` # if we have currently less than 6 threads: echo "set -b org.apache.cassandra.db:type=CompactionManager MaximumCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l 127.0.0.1:7199 && echo "set -b org.apache.cassandra.db:type=CompactionManager CoreCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l 127.0.0.1:7199 ``` Some comments about the information you shared, as you said, 'thinking out loud' :): *About the hardware* I remember using the 'm1.xlarge' :). They are not that recent. It will probably worth it to reconsider this hardware choice and migrate to newer hardware (m5/r4 + EBS GP2 or I3 with ephemeral). You should be able to reduce the number of nodes and make it equivalent (or maybe slightly more expensive but so it works properly). I once moved from a lot of these nodes (80ish) to a few I2 instances (5 - 15? I don't remember). Latency went from 20 ms to 3 - 5 ms (and was improved later on). Also, using the right hardware for your case should avoid headaches to you and your team. I started with t1.micro in prod and went all the way up (m1.small, m1.medium, ...). It's good for learning, not for business. Especially, this does not work well together: my instances are still on magnetic drivers > with most tables on LCS frequent r/w pattern > Having some SSDs here (EBS GP2 or even better I3 - NVMe disks) would most probably help to reduce the latency. I would also pick an instance with more memory (30 GB would probably be more comfortable). The more memory, the better it is possible to tune the JVM and the more page caching can be done (thus avoiding some disk reads). Given the number of nodes you use, it's complex to keep the cost low doing this change. When the cluster will grow you might want to consider changing the instance type again and maybe for now just take a r4.xlarge + EBS Volume GP2. It comes with 30+ GB of memory and the same number of cpu (or more) and see how many nodes are needed. It might be slightly more expensive, but I really believe it could do some good. As a middle term solution, I think you might be really happy with a change of this kind. *About DTCS/TWCS?* > > * - few tables with DTCS- need to upgrade to 3.0.8 for TWCS* Indeed switching to DTCS rather than TWCS can be a real relief for a cluster. You should not have to wait to upgrade to 3.0.8 to use TWCS. I must say I am not too sure for 3.0.x (x < 8) versions though. Maybe giving a try to http://thelastpickle.com/blog/2017/01/10/twcs-part2.html with https://github.com/jeffjirsa/twcs/tree/cassandra-3.0.0 is easier for you? *Garbage Collection?* That being said, the CPU load is really high, I suspect Garbage Collection is taking a lot of time to the nodes of this cluster. It is probably not helping the CPUs either. This might even be the biggest pain point for this cluster. Would you like to try using following settings on a canary node and see how it goes? These settings are quite arbitrary. With the gc.log I could be more precise on what I believe is a correct setting. GC Type: CMS Heap: 8 GB (could be bigger, but we are limited by the 15 GB in total). New_heap: 2 - 4 GB (maybe experiment with the 2 distinct values) TenuringThreshold: 15 (instead of 1, that is definitely too small and tend to have short living object still being promoted to the old gen) For those settings, I do not trust the cassandra defaults in most cases. New_heap_size should be 25-50% of the heap (and not related to the number of CPU cores). Also below 16 GB I never had a better result with G1GC than CMS. But I must say I have been fighting a lot with CMS in the past to tune it nicely while I did not even play much with G1GC. This (or similar settings) worked for distinct cases having heavy read patterns. In the mailing list I explained recently to someone else my understanding of JVM and GC, also there is a blog post my colleague Jon wrote here: http://thelastpickle.com/blog/2018/04/11/gc-tuning.html. I believe he suggested a slightly different tuning. If none of this is helping, please send the gc.log file over with and without this change we could have a look what is going on. SurvivorRatio can also be moved down to 2 or 4, if you want to play
Re: concurrent_compactors via JMX
Chris, Thank you for mbean reference. On Wed, Jul 18, 2018 at 6:26 PM, Riccardo Ferrari wrote: > Alain, thank you for email. I really really appreciate it! > > I am actually trying to remove the disk io from the suspect list, thus I'm > want to reduce the number of concurrent compactors. I'll give thorughput a > shot. > No, I don't have a long list of pending compactions, however my instances > are still on magnetic drivers and can't really afford high number of > compactors. > > We started to have slow downs and most likely we were undersized, new > features are coming in and I want to be ready for them. > *About the issue:* > > >- High system load on cassanda nodes. This means top saying 6.0/12.0 >on a 4 vcpu instance (!) > > >- CPU is high: > - Dynatrace says 50% > - top easily goes to 80% > - Network around 30Mb (according to Dynatrace) > - Disks: > - ~40 iops > - high latency: ~20ms (min 8 max 50!) > - negligible iowait > - testing an empty instance with fio I get 1200 r_iops / 400 > w_iops > > >- Clients timeout > - mostly when reading > - few cases when writing >- Slowly growing number of "All time blocked of Native T-R" > - small numbers: hundreds vs millions of successfully serverd > requests > > The system: > >- Cassandra 3.0.6 > - most tables on LCS > - frequent r/w pattern > - few tables with DTCS > - need to upgrade to 3.0.8 for TWCS > - mostly TS data, stream write / batch read > - All our keyspaces have RF: 3 > > >- All nodes on the same AZ >- m1.xlarge >- 4x420 drives (emphemerial storage) configured in striping (raid0) > - 4 vcpu > - 15GB ram >- workload: > - Java applications; > - Mostly feeding cassandra writing data coming in > - Apache Spark applications: > - batch processes to read and write back to C* or other systems > - not co-located > > So far my effort was put into growing the ring to better distribute the > load and decrease the pressure, including: > >- Increasing the node number from 3 to 5 (6th node joining) >- jvm memory "optimization": >- heaps were set by default script to something bit smaller that 4GB > with CMS gc > - gc pressure was high / long gc pauses > - clients were suffering of read timeouts > - increased the heap still using CMS: > - very long GC pauses > - not much tuning around CMS > - switched to G1 and forced 6/7GB heap on each node using almost > suggested settings > - much more stable > - generally < 300ms > - I still have long pauses from time to time (mostly around > 1200ms, sometimes on some nodes 3000) > > *Thinking out loud:* > Things are much better, however I still see a high cpu usage specially > when Spark kicks even though spark jobs are very small in terms of > resources (single worker with very limited parallelism). > > On LCS tables cfstats reports single digit read latencies and generally > 0.X write latencies (as per today). > On DTCS tables I have 0.x ms write latency but still double digit read > latency, but I guess I should spend some time to tune that or upgrade and > move away from DTCS :( > Yes, Saprk reads mostly from DTCS tables > > Still is kinda common to to have dropped READ, HINT and MUTATION. > >- not on all nodes >- this generally happen on node restart > > > On a side note I tried to install libjemalloc1 from Ubuntu repo (mixed > 14.04 and 16.04) with terrible results, much slower instance startup and > responsiveness, how could that be? > > Once everything will be stabilized I'll prepare our move to vpc and > possibly upgrade to i3 instance, any comment on the hardware side? is > 4cores still a reasonble hardware? > > Best, > > On Tue, Jul 17, 2018 at 9:18 PM, Alain RODRIGUEZ > wrote: > >> Hello Riccardo, >> >> I noticed I have been writing a novel to answer a simple couple of >> questions again ¯\_(ツ)_/¯. So here is a short answer in the case that's >> what you were looking for :). Also, there is a warning that it might be >> counter-productive and stress the cluster even more to increase the >> compaction throughput. There is more information below ('about the issue'). >> >> *tl;dr*: >> >> What about using 'nodetool setcompactionthroughput XX' instead. It >> should available there. >> >> In the same way 'nodetool getcompactionthroughput' gives you the current >> value. Be aware that this change done through JMX/nodetool is *not* >> permanent. >> You still need to update the cassandra.yaml file. >> >> If you really want to use the MBean through JMX, because using 'nodetool' >> is too easy (or for any other reason :p): >> >> Mbean: org.apache.cassandra.service.StorageServiceMBean >> Attribute: CompactionThroughputMbPerSec >> >> *Long story* with the "how to" since I wen
Re: concurrent_compactors via JMX
Alain, thank you for email. I really really appreciate it! I am actually trying to remove the disk io from the suspect list, thus I'm want to reduce the number of concurrent compactors. I'll give thorughput a shot. No, I don't have a long list of pending compactions, however my instances are still on magnetic drivers and can't really afford high number of compactors. We started to have slow downs and most likely we were undersized, new features are coming in and I want to be ready for them. *About the issue:* - High system load on cassanda nodes. This means top saying 6.0/12.0 on a 4 vcpu instance (!) - CPU is high: - Dynatrace says 50% - top easily goes to 80% - Network around 30Mb (according to Dynatrace) - Disks: - ~40 iops - high latency: ~20ms (min 8 max 50!) - negligible iowait - testing an empty instance with fio I get 1200 r_iops / 400 w_iops - Clients timeout - mostly when reading - few cases when writing - Slowly growing number of "All time blocked of Native T-R" - small numbers: hundreds vs millions of successfully serverd requests The system: - Cassandra 3.0.6 - most tables on LCS - frequent r/w pattern - few tables with DTCS - need to upgrade to 3.0.8 for TWCS - mostly TS data, stream write / batch read - All our keyspaces have RF: 3 - All nodes on the same AZ - m1.xlarge - 4x420 drives (emphemerial storage) configured in striping (raid0) - 4 vcpu - 15GB ram - workload: - Java applications; - Mostly feeding cassandra writing data coming in - Apache Spark applications: - batch processes to read and write back to C* or other systems - not co-located So far my effort was put into growing the ring to better distribute the load and decrease the pressure, including: - Increasing the node number from 3 to 5 (6th node joining) - jvm memory "optimization": - heaps were set by default script to something bit smaller that 4GB with CMS gc - gc pressure was high / long gc pauses - clients were suffering of read timeouts - increased the heap still using CMS: - very long GC pauses - not much tuning around CMS - switched to G1 and forced 6/7GB heap on each node using almost suggested settings - much more stable - generally < 300ms - I still have long pauses from time to time (mostly around 1200ms, sometimes on some nodes 3000) *Thinking out loud:* Things are much better, however I still see a high cpu usage specially when Spark kicks even though spark jobs are very small in terms of resources (single worker with very limited parallelism). On LCS tables cfstats reports single digit read latencies and generally 0.X write latencies (as per today). On DTCS tables I have 0.x ms write latency but still double digit read latency, but I guess I should spend some time to tune that or upgrade and move away from DTCS :( Yes, Saprk reads mostly from DTCS tables Still is kinda common to to have dropped READ, HINT and MUTATION. - not on all nodes - this generally happen on node restart On a side note I tried to install libjemalloc1 from Ubuntu repo (mixed 14.04 and 16.04) with terrible results, much slower instance startup and responsiveness, how could that be? Once everything will be stabilized I'll prepare our move to vpc and possibly upgrade to i3 instance, any comment on the hardware side? is 4cores still a reasonble hardware? Best, On Tue, Jul 17, 2018 at 9:18 PM, Alain RODRIGUEZ wrote: > Hello Riccardo, > > I noticed I have been writing a novel to answer a simple couple of > questions again ¯\_(ツ)_/¯. So here is a short answer in the case that's > what you were looking for :). Also, there is a warning that it might be > counter-productive and stress the cluster even more to increase the > compaction throughput. There is more information below ('about the issue'). > > *tl;dr*: > > What about using 'nodetool setcompactionthroughput XX' instead. It should > available there. > > In the same way 'nodetool getcompactionthroughput' gives you the current > value. Be aware that this change done through JMX/nodetool is *not* permanent. > You still need to update the cassandra.yaml file. > > If you really want to use the MBean through JMX, because using 'nodetool' > is too easy (or for any other reason :p): > > Mbean: org.apache.cassandra.service.StorageServiceMBean > Attribute: CompactionThroughputMbPerSec > > *Long story* with the "how to" since I went through this search myself, I > did not know where this MBean was. > > Can someone point me to the right mbean? >> I can not really find good docs about mbeans (or tools ...) > > > I am not sure about the doc, but you can use jmxterm ( > http://wiki.cyclopsgroup.org/jmxterm/download.html). > > To replace the doc I use CCM (https://git
Re: concurrent_compactors via JMX
Refer to Alains email but to strictly answer the question of increasing concurrent_compactors via jmx: There are two attributes you can increase that would set the maximum number of concurrent compactions. org.apache.cassandra.db:type=CompactionManager,name=MaximumCompactorThreads -> 6 org.apache.cassandra.db:type=CompactionManager,name=CoreCompactorThreads -> 6 Would set it to 6. To decrease them you will want to go opposite order (core than max). Just increasing the number of concurrent compactors doesnt mean that all of them will be utilized though. Chris > On Jul 17, 2018, at 12:18 PM, Alain RODRIGUEZ wrote: > > Hello Riccardo, > > I noticed I have been writing a novel to answer a simple couple of questions > again ¯\_(ツ)_/¯. So here is a short answer in the case that's what you were > looking for :). Also, there is a warning that it might be counter-productive > and stress the cluster even more to increase the compaction throughput. There > is more information below ('about the issue'). > > tl;dr: > > What about using 'nodetool setcompactionthroughput XX' instead. It should > available there. > > In the same way 'nodetool getcompactionthroughput' gives you the current > value. Be aware that this change done through JMX/nodetool is not permanent. > You still need to update the cassandra.yaml file. > > If you really want to use the MBean through JMX, because using 'nodetool' is > too easy (or for any other reason :p): > > Mbean: org.apache.cassandra.service.StorageServiceMBean > Attribute: CompactionThroughputMbPerSec > > Long story with the "how to" since I went through this search myself, I did > not know where this MBean was. > > Can someone point me to the right mbean? > I can not really find good docs about mbeans (or tools ...) > > I am not sure about the doc, but you can use jmxterm > (http://wiki.cyclopsgroup.org/jmxterm/download.html > <http://wiki.cyclopsgroup.org/jmxterm/download.html>). > > To replace the doc I use CCM (https://github.com/riptano/ccm > <https://github.com/riptano/ccm>) + jconsole to find the mbeans locally: > > * Add loopback addresses for ccm (see the readme file) > * then, create the cluster: * 'ccm create Cassandra-3-0-6 -v 3.0.6 -n 3 -s' > * Start jconsole using the right pid: 'jconsole $(ccm node1 show | grep pid | > cut -d "=" -f 2)' > * Explore MBeans, try to guess where this could be (and discover other funny > stuff in there :)). > > I must admit I did not find it this way using C*3.0.6 and jconsole. > I looked at the code, I locally used C*3.0.6 and ran 'grep -RiI > CompactionThroughput' with this result: > https://gist.github.com/arodrime/f9591e4bdd2b1367a496447cdd959006 > <https://gist.github.com/arodrime/f9591e4bdd2b1367a496447cdd959006> > > With this I could find the right MBean, the only code documentation that is > always up to date is the code itself I am afraid: > > './src/java/org/apache/cassandra/service/StorageServiceMBean.java:public > void setCompactionThroughputMbPerSec(int value);' > > Note that the research in the code also leads to nodetool ;-). > > I could finally find the MBean in the 'jconsole' too: > https://cdn.pbrd.co/images/HuUya3x.png > <https://cdn.pbrd.co/images/HuUya3x.png> (not sure how long this link will > live). > > jconsole also allows you to see what attributes it is possible to set or not. > > You can now find any other MBean you would need I hope :). > > > see if it helps when the system is under stress > > About the issue > > You don't exactly say what you are observing, what is that "stress"? How is > it impacting the cluster? > > I ask because I am afraid this change might not help and even be > counter-productive. Even though having SSTables nicely compacted make a huge > difference at the read time, if that's already the case for you and the data > is already nicely compacted, doing this change won't help. It might even make > things slightly worse if the current bottleneck is the disk IO during a > stress period as the compactors would increase their disk read throughput, > thus maybe fight with the read requests for disk throughput. > > If you have a similar number of sstables on all nodes, not many compactions > pending (nodetool netstats -H) and read operations are hitting a small number > sstables (nodetool tablehistogram) then you probably don't need to increase > the compaction speed. > > Let's say that the compaction throughput is not often the cause of stress > during peak hours nor a direct
Re: concurrent_compactors via JMX
Hello Riccardo, I noticed I have been writing a novel to answer a simple couple of questions again ¯\_(ツ)_/¯. So here is a short answer in the case that's what you were looking for :). Also, there is a warning that it might be counter-productive and stress the cluster even more to increase the compaction throughput. There is more information below ('about the issue'). *tl;dr*: What about using 'nodetool setcompactionthroughput XX' instead. It should available there. In the same way 'nodetool getcompactionthroughput' gives you the current value. Be aware that this change done through JMX/nodetool is *not* permanent. You still need to update the cassandra.yaml file. If you really want to use the MBean through JMX, because using 'nodetool' is too easy (or for any other reason :p): Mbean: org.apache.cassandra.service.StorageServiceMBean Attribute: CompactionThroughputMbPerSec *Long story* with the "how to" since I went through this search myself, I did not know where this MBean was. Can someone point me to the right mbean? > I can not really find good docs about mbeans (or tools ...) I am not sure about the doc, but you can use jmxterm ( http://wiki.cyclopsgroup.org/jmxterm/download.html). To replace the doc I use CCM (https://github.com/riptano/ccm) + jconsole to find the mbeans locally: * Add loopback addresses for ccm (see the readme file) * then, create the cluster: * 'ccm create Cassandra-3-0-6 -v 3.0.6 -n 3 -s' * Start jconsole using the right pid: 'jconsole $(ccm node1 show | grep pid | cut -d "=" -f 2)' * Explore MBeans, try to guess where this could be (and discover other funny stuff in there :)). I must admit I did not find it this way using C*3.0.6 and jconsole. I looked at the code, I locally used C*3.0.6 and ran 'grep -RiI CompactionThroughput' with this result: https://gist.github.com/arodrime/f9591e4bdd2b1367a496447cdd959006 With this I could find the right MBean, the only code documentation that is always up to date is the code itself I am afraid: './src/java/org/apache/cassandra/service/StorageServiceMBean.java: public void setCompactionThroughputMbPerSec(int value);' Note that the research in the code also leads to nodetool ;-). I could finally find the MBean in the 'jconsole' too: https://cdn.pbrd.co/images/HuUya3x.png (not sure how long this link will live). jconsole also allows you to see what attributes it is possible to set or not. You can now find any other MBean you would need I hope :). see if it helps when the system is under stress *About the issue* You don't exactly say what you are observing, what is that "stress"? How is it impacting the cluster? I ask because I am afraid this change might not help and even be counter-productive. Even though having SSTables nicely compacted make a huge difference at the read time, if that's already the case for you and the data is already nicely compacted, doing this change won't help. It might even make things slightly worse if the current bottleneck is the disk IO during a stress period as the compactors would increase their disk read throughput, thus maybe fight with the read requests for disk throughput. If you have a similar number of sstables on all nodes, not many compactions pending (nodetool netstats -H) and read operations are hitting a small number sstables (nodetool tablehistogram) then you probably don't need to increase the compaction speed. Let's say that the compaction throughput is not often the cause of stress during peak hours nor a direct way to make things 'faster'. Generally when compaction goes wrong, the number of sstables goes *t**hrou**g**h* the roof. If you have a chart showing the number sstables, you can see this really well. Of course, if you feel you are in this case, increasing the compaction throughput will definitely help if the cluster also has spared disk throughput. To check what's wrong, if you believe it's something different, here are some useful commands: - nodetool tpstats (check for pending/blocked/dropped threads there) - check WARN and ERRORS in the logs (ie. grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log) - Check local latencies (nodetool tablestats / nodetool tablehistogram) and compare it to the client request latency. At the node level, reads should probably be a single digit in milliseconds, rather close to 1 ms with SSDs and writes below the millisecond most probably (it depends on the data size too, etc...). - Trace a query during this period, see what takes time (for example from 'cqlsh' - 'TRACING ON; SELECT ...') You can also analyze the *Garbage Collection* activity. As Cassandre uses the JVM, a badly tuned GC will induce long pauses. Depending on the workload, and I must say for most of the cluster I work on, default the tuning is not that good and can keep server busy 10-15% of the time with stop the world GC. You might find this post of my colleague Jon about GC tuning for Apache Cassandra interesting: http://thelastpickle.com/blog/2018/04/11/gc-tun
concurrent_compactors via JMX
Hi list, Cassandra 3.0.6 I'd like to test the change of concurrent compactors to see if it helps when the system is under stress. Can someone point me to the right mbean? I can not really find good docs about mbeans (or tools ...) Any suggestion much appreciated, best