Re: Metrics matrix: migrate 2.1.x metrics to 2.2.x+

Carl Mueller Tue, 16 Oct 2018 11:47:02 -0700

Your dashboards are great. The only challenge is getting all the data to
feed them.



On Tue, Oct 16, 2018 at 1:45 PM Carl Mueller <carl.muel...@smartthings.com>
wrote:

> metadata.csv: that helps a lot, thank you!
>
> On Fri, Oct 5, 2018 at 5:42 AM Alain RODRIGUEZ <arodr...@gmail.com> wrote:
>
>> I feel you for most of the troubles you faced, I've been facing most of
>> them too. Again, Datadog support can probably help you with most of those.
>> You should really consider sharing this feedback to them.
>>
>> there is re-namespacing of the metric names in lots of cases, and these
>>> don't appear to be centrally documented, but maybe i haven't found the
>>> magic page.
>>>
>>
>> I don't know if that would be the 'magic' page, but that's something:
>> https://github.com/DataDog/integrations-core/blob/master/cassandra/metadata.csv
>>
>> There are sooooo many good stats.
>>
>>
>> Yes, and it's still improving. I love this about Cassandra. It's our work
>> to pick the relevant ones for each situation. I would not like Cassandra to
>> reduce the number of metrics exposed, we need to learn to handle them
>> properly. Also, this is the reason we designed 4 dashboards out the box,
>> the goal was to have everything we need for distinct scenarios:
>> - Overview - global health-check / anomaly detection
>> - Read Path - troubleshooting / optimizing read ops
>> - Write Path - troubleshooting / optimizing write ops
>> - SSTable Management - troubleshooting / optimizing -
>> comapction/flushes/... anything related to sstables.
>>
>> instead of the single overview dashboard that was present before. We are
>> also perfectly aware that it's far from perfect, but aiming at perfect
>> would only have had us never releasing anything. Anyone interested could
>> now build missing dashboards or improve existing ones for himself or/and
>> suggest improvements to Datadog :). I hope I'll do some more of this work
>> at some point in the future.
>>
>> Good luck,
>> C*heers,
>> -----------------------
>> Alain Rodriguez - @arodream - al...@thelastpickle.com
>> France / Spain
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>> Le jeu. 4 oct. 2018 à 21:21, Carl Mueller
>> <carl.muel...@smartthings.com.invalid> a écrit :
>>
>>> for 2.1.x we had a custom reporter that delivered  metrics to datadog's
>>> endpoint via https, bypassing the agent-imposed 350. But integrating that
>>> required targetting the other shared libs in the cassandra path, so the
>>> build is a bit of a pain when we update major versions.
>>>
>>> We are migrating our 2.1.x specific dashboards, and we will use
>>> agent-delivered metrics for non-table, and adapt the custom library to
>>> deliver the table-based ones, at a slower rate than the "core" ones.
>>>
>>> Datadog is also super annoying because there doesn't appear to be
>>> anything that reports what metrics the agent is sending (the metric count
>>> can indicate if a configured new metric increased the count and is being
>>> reported, but it's still... a guess), and there is re-namespacing of the
>>> metric names in lots of cases, and these don't appear to be centrally
>>> documented, but maybe i haven't found the magic page.
>>>
>>> There are sooooo many good stats. We might also implement some facility
>>> to dynamically turn on the delivery of detailed metrics on the nodes.
>>>
>>> On Tue, Oct 2, 2018 at 5:21 AM Alain RODRIGUEZ <arodr...@gmail.com>
>>> wrote:
>>>
>>>> Hello Carl,
>>>>
>>>> I guess we can use bean_regex to do specific targetted metrics for the
>>>>> important tables anyway.
>>>>>
>>>>
>>>> Yes, this would work, but 350 is very limited for Cassandra dashboards.
>>>> We have a LOT of metrics available.
>>>>
>>>> Datadog 350 metric limit is a PITA for tables once you get over 10
>>>>> tables
>>>>>
>>>>
>>>> I noticed this while I was working on providing default dashboards for
>>>> Cassandra-Datadog integration. I was told by Datadog team it would not be
>>>> an issue for users, that I should not care about it. As you pointed out,
>>>> per table metrics quickly increase the total number of metrics we need to
>>>> collect.
>>>>
>>>> I believe you can set the following option: *"max_returned_metrics:
>>>> 1000"* - it can be used if metrics are missing to increase the limit
>>>> of the number of collected metrics. Be aware of CPU utilization that this
>>>> might imply (greatly improved in dd-agent version 6+ I believe -thanks
>>>> Datadog teams for that- making this fully usable for Cassandra). This
>>>> option should go in the *cassandra.yaml* file for Cassandra
>>>> integrations, off the top of my head.
>>>>
>>>> Also, do not hesitate to reach to Datadog directly for this kind of
>>>> questions, I have always been very happy with their support so far, I am
>>>> sure they would guide you through this as well, probably better than we can
>>>> do :). It also provides them with feedback on what people are struggling
>>>> with I imagine.
>>>>
>>>> I am interested to know if you still have issues getting more metrics
>>>> (option above not working / CPU under too much load) as this would make the
>>>> dashboards we built mostly unusable for clusters with more tables. We might
>>>> then need to review the design.
>>>>
>>>> As a side note, I believe metrics are handled the same way cross
>>>> version, they got the same name/label for C*2.1, 2.2 and 3+ on Datadog.
>>>> There is an abstraction layer that removes this complexity (if I remember
>>>> well, we built those dashboards a while ago).
>>>>
>>>> C*heers
>>>> -----------------------
>>>> Alain Rodriguez - @arodream - al...@thelastpickle.com
>>>> France / Spain
>>>>
>>>> The Last Pickle - Apache Cassandra Consulting
>>>> http://www.thelastpickle.com
>>>>
>>>> Le lun. 1 oct. 2018 à 19:38, Carl Mueller
>>>> <carl.muel...@smartthings.com.invalid> a écrit :
>>>>
>>>>> That's great too, thank you.
>>>>>
>>>>> Datadog 350 metric limit is a PITA for tables once you get over 10
>>>>> tables, but I guess we can use bean_regex to do specific targetted metrics
>>>>> for the important tables anyway.
>>>>>
>>>>> On Mon, Oct 1, 2018 at 4:21 AM Alain RODRIGUEZ <arodr...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello Carl,
>>>>>>
>>>>>> Here is a message I sent to my team a few months ago. I hope this
>>>>>> will be helpful to you and more people around :). It might not be
>>>>>> exhaustive and we were moving from C*2.1 to C*3+ in this case, thus
>>>>>> skipping C*2.2, but C*2.2 is similar to C*3.0 if I remember correctly in
>>>>>> terms of metrics. Here it is for what it's worth:
>>>>>>
>>>>>> Quite a few things changed between metric reporter in C* 2.1 and
>>>>>> C*3.0.
>>>>>> - ColumnFamily --> Table
>>>>>> - XXpercentile --> pXX
>>>>>> - 1MinuteRate -->  m1_rate
>>>>>> - metric name before KS and Table names and some other changes of
>>>>>> this kind.
>>>>>> - ^ aggregations / aliases indexes changed because of this (using
>>>>>> graphite for example) ^
>>>>>> - ‘.value’ is not appended in the metric name anymore for gauges,
>>>>>> nothing instead.
>>>>>>
>>>>>> For example (graphite):
>>>>>>
>>>>>> From
>>>>>> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.ColumnFamily.$ks.$table.ReadLatency.95percentile,
>>>>>> 2, 3), 1, 7, 8, 9)
>>>>>>
>>>>>> to
>>>>>> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.Table.ReadLatency.$ks.$table.p95,
>>>>>> 2, 3), 1, 8, 9, 10)
>>>>>>
>>>>>> C*heers,
>>>>>> -----------------------
>>>>>> Alain Rodriguez - @arodream - al...@thelastpickle.com
>>>>>> France / Spain
>>>>>>
>>>>>> The Last Pickle - Apache Cassandra Consulting
>>>>>> http://www.thelastpickle.com
>>>>>>
>>>>>> Le ven. 28 sept. 2018 à 20:38, Carl Mueller
>>>>>> <carl.muel...@smartthings.com.invalid> a écrit :
>>>>>>
>>>>>>> VERY NICE! Thank you very much
>>>>>>>
>>>>>>> On Fri, Sep 28, 2018 at 1:32 PM Lyuben Todorov <
>>>>>>> lyuben.todo...@instaclustr.com> wrote:
>>>>>>>
>>>>>>>> Nothing as fancy as a matrix but a list of what JMX term can see.
>>>>>>>> Link to the online diff here: https://www.diffchecker.com/G9FE9swS
>>>>>>>>
>>>>>>>> /lyubent
>>>>>>>>
>>>>>>>> On Fri, 28 Sep 2018 at 19:04, Carl Mueller
>>>>>>>> <carl.muel...@smartthings.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> It's my understanding that metrics got heavily re-namespaced in
>>>>>>>>> JMX for 2.2 from 2.1
>>>>>>>>>
>>>>>>>>> Did anyone ever make a migration matrix/guide for conversion of
>>>>>>>>> old metrics to new metrics?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>

Re: Metrics matrix: migrate 2.1.x metrics to 2.2.x+

Reply via email to