Your dashboards are great. The only challenge is getting all the data to feed them.
On Tue, Oct 16, 2018 at 1:45 PM Carl Mueller <carl.muel...@smartthings.com> wrote: > metadata.csv: that helps a lot, thank you! > > On Fri, Oct 5, 2018 at 5:42 AM Alain RODRIGUEZ <arodr...@gmail.com> wrote: > >> I feel you for most of the troubles you faced, I've been facing most of >> them too. Again, Datadog support can probably help you with most of those. >> You should really consider sharing this feedback to them. >> >> there is re-namespacing of the metric names in lots of cases, and these >>> don't appear to be centrally documented, but maybe i haven't found the >>> magic page. >>> >> >> I don't know if that would be the 'magic' page, but that's something: >> https://github.com/DataDog/integrations-core/blob/master/cassandra/metadata.csv >> >> There are sooooo many good stats. >> >> >> Yes, and it's still improving. I love this about Cassandra. It's our work >> to pick the relevant ones for each situation. I would not like Cassandra to >> reduce the number of metrics exposed, we need to learn to handle them >> properly. Also, this is the reason we designed 4 dashboards out the box, >> the goal was to have everything we need for distinct scenarios: >> - Overview - global health-check / anomaly detection >> - Read Path - troubleshooting / optimizing read ops >> - Write Path - troubleshooting / optimizing write ops >> - SSTable Management - troubleshooting / optimizing - >> comapction/flushes/... anything related to sstables. >> >> instead of the single overview dashboard that was present before. We are >> also perfectly aware that it's far from perfect, but aiming at perfect >> would only have had us never releasing anything. Anyone interested could >> now build missing dashboards or improve existing ones for himself or/and >> suggest improvements to Datadog :). I hope I'll do some more of this work >> at some point in the future. >> >> Good luck, >> C*heers, >> ----------------------- >> Alain Rodriguez - @arodream - al...@thelastpickle.com >> France / Spain >> >> The Last Pickle - Apache Cassandra Consulting >> http://www.thelastpickle.com >> >> Le jeu. 4 oct. 2018 à 21:21, Carl Mueller >> <carl.muel...@smartthings.com.invalid> a écrit : >> >>> for 2.1.x we had a custom reporter that delivered metrics to datadog's >>> endpoint via https, bypassing the agent-imposed 350. But integrating that >>> required targetting the other shared libs in the cassandra path, so the >>> build is a bit of a pain when we update major versions. >>> >>> We are migrating our 2.1.x specific dashboards, and we will use >>> agent-delivered metrics for non-table, and adapt the custom library to >>> deliver the table-based ones, at a slower rate than the "core" ones. >>> >>> Datadog is also super annoying because there doesn't appear to be >>> anything that reports what metrics the agent is sending (the metric count >>> can indicate if a configured new metric increased the count and is being >>> reported, but it's still... a guess), and there is re-namespacing of the >>> metric names in lots of cases, and these don't appear to be centrally >>> documented, but maybe i haven't found the magic page. >>> >>> There are sooooo many good stats. We might also implement some facility >>> to dynamically turn on the delivery of detailed metrics on the nodes. >>> >>> On Tue, Oct 2, 2018 at 5:21 AM Alain RODRIGUEZ <arodr...@gmail.com> >>> wrote: >>> >>>> Hello Carl, >>>> >>>> I guess we can use bean_regex to do specific targetted metrics for the >>>>> important tables anyway. >>>>> >>>> >>>> Yes, this would work, but 350 is very limited for Cassandra dashboards. >>>> We have a LOT of metrics available. >>>> >>>> Datadog 350 metric limit is a PITA for tables once you get over 10 >>>>> tables >>>>> >>>> >>>> I noticed this while I was working on providing default dashboards for >>>> Cassandra-Datadog integration. I was told by Datadog team it would not be >>>> an issue for users, that I should not care about it. As you pointed out, >>>> per table metrics quickly increase the total number of metrics we need to >>>> collect. >>>> >>>> I believe you can set the following option: *"max_returned_metrics: >>>> 1000"* - it can be used if metrics are missing to increase the limit >>>> of the number of collected metrics. Be aware of CPU utilization that this >>>> might imply (greatly improved in dd-agent version 6+ I believe -thanks >>>> Datadog teams for that- making this fully usable for Cassandra). This >>>> option should go in the *cassandra.yaml* file for Cassandra >>>> integrations, off the top of my head. >>>> >>>> Also, do not hesitate to reach to Datadog directly for this kind of >>>> questions, I have always been very happy with their support so far, I am >>>> sure they would guide you through this as well, probably better than we can >>>> do :). It also provides them with feedback on what people are struggling >>>> with I imagine. >>>> >>>> I am interested to know if you still have issues getting more metrics >>>> (option above not working / CPU under too much load) as this would make the >>>> dashboards we built mostly unusable for clusters with more tables. We might >>>> then need to review the design. >>>> >>>> As a side note, I believe metrics are handled the same way cross >>>> version, they got the same name/label for C*2.1, 2.2 and 3+ on Datadog. >>>> There is an abstraction layer that removes this complexity (if I remember >>>> well, we built those dashboards a while ago). >>>> >>>> C*heers >>>> ----------------------- >>>> Alain Rodriguez - @arodream - al...@thelastpickle.com >>>> France / Spain >>>> >>>> The Last Pickle - Apache Cassandra Consulting >>>> http://www.thelastpickle.com >>>> >>>> Le lun. 1 oct. 2018 à 19:38, Carl Mueller >>>> <carl.muel...@smartthings.com.invalid> a écrit : >>>> >>>>> That's great too, thank you. >>>>> >>>>> Datadog 350 metric limit is a PITA for tables once you get over 10 >>>>> tables, but I guess we can use bean_regex to do specific targetted metrics >>>>> for the important tables anyway. >>>>> >>>>> On Mon, Oct 1, 2018 at 4:21 AM Alain RODRIGUEZ <arodr...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hello Carl, >>>>>> >>>>>> Here is a message I sent to my team a few months ago. I hope this >>>>>> will be helpful to you and more people around :). It might not be >>>>>> exhaustive and we were moving from C*2.1 to C*3+ in this case, thus >>>>>> skipping C*2.2, but C*2.2 is similar to C*3.0 if I remember correctly in >>>>>> terms of metrics. Here it is for what it's worth: >>>>>> >>>>>> Quite a few things changed between metric reporter in C* 2.1 and >>>>>> C*3.0. >>>>>> - ColumnFamily --> Table >>>>>> - XXpercentile --> pXX >>>>>> - 1MinuteRate --> m1_rate >>>>>> - metric name before KS and Table names and some other changes of >>>>>> this kind. >>>>>> - ^ aggregations / aliases indexes changed because of this (using >>>>>> graphite for example) ^ >>>>>> - ‘.value’ is not appended in the metric name anymore for gauges, >>>>>> nothing instead. >>>>>> >>>>>> For example (graphite): >>>>>> >>>>>> From >>>>>> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.ColumnFamily.$ks.$table.ReadLatency.95percentile, >>>>>> 2, 3), 1, 7, 8, 9) >>>>>> >>>>>> to >>>>>> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.Table.ReadLatency.$ks.$table.p95, >>>>>> 2, 3), 1, 8, 9, 10) >>>>>> >>>>>> C*heers, >>>>>> ----------------------- >>>>>> Alain Rodriguez - @arodream - al...@thelastpickle.com >>>>>> France / Spain >>>>>> >>>>>> The Last Pickle - Apache Cassandra Consulting >>>>>> http://www.thelastpickle.com >>>>>> >>>>>> Le ven. 28 sept. 2018 à 20:38, Carl Mueller >>>>>> <carl.muel...@smartthings.com.invalid> a écrit : >>>>>> >>>>>>> VERY NICE! Thank you very much >>>>>>> >>>>>>> On Fri, Sep 28, 2018 at 1:32 PM Lyuben Todorov < >>>>>>> lyuben.todo...@instaclustr.com> wrote: >>>>>>> >>>>>>>> Nothing as fancy as a matrix but a list of what JMX term can see. >>>>>>>> Link to the online diff here: https://www.diffchecker.com/G9FE9swS >>>>>>>> >>>>>>>> /lyubent >>>>>>>> >>>>>>>> On Fri, 28 Sep 2018 at 19:04, Carl Mueller >>>>>>>> <carl.muel...@smartthings.com.invalid> wrote: >>>>>>>> >>>>>>>>> It's my understanding that metrics got heavily re-namespaced in >>>>>>>>> JMX for 2.2 from 2.1 >>>>>>>>> >>>>>>>>> Did anyone ever make a migration matrix/guide for conversion of >>>>>>>>> old metrics to new metrics? >>>>>>>>> >>>>>>>>> >>>>>>>>>