Maybe time to setup Thanos or Mimir since they just use S3, so would be a lot cheaper than AWS prometheus. :-D
Prometheus doesn't explicitly track counter resets yet. It's something we've thought about with the OpenMetrics support for metric creation timestamps, but it's not implemented. I meant that if you did restart something and the counter went to 0, if you have only 2 samples in a rate(), so you might get funky spikes or a little bit of inaccurate extrapolation. On Wed, Apr 6, 2022 at 3:07 PM Sam Rose <[email protected]> wrote: > We do make use of that variable, and then we configure the minimum value > of it to 1m per-graph. I didn't realise you could configure this > per-datasource, thanks for pointing that out! > > We did used to scrape at 15s intervals but we're using AWS's managed > prometheus workspaces, and each data point costs money, so we brought it > down to 1m intervals. > > I'm not sure I understand the relationship between scrape interval and > counter resets, especially considering there doesn't appear to be a counter > reset in the raw data of the time series in question. > > You mentioned "true counter reset", does prometheus have some internal > distinction between types of counter reset? > > On Wednesday, April 6, 2022 at 2:03:40 PM UTC+1 [email protected] wrote: > >> I would recommend using the `$__rate_interval` magic variable in Grafana. >> Note that Grafana assumes a default interval of 15s in the datasource >> settings. >> >> If your data is mostly 60s scrape intervals, you can configure this >> setting in the Grafana datasource settings. >> >> If you want to be able to view 1m resolution rates, I recommend >> increasing your scrape interval to 15s. This makes sure you have several >> samples in the rate window. This helps Prometheus better handle true >> counter resets and lost scrapes. >> >> On Wed, Apr 6, 2022 at 2:56 PM Sam Rose <[email protected]> wrote: >> >>> Thanks for the heads up! We've flip flopped a bit between using 1m or >>> 2m. 1m seems to work reliably enough to be useful in most situations, but >>> I'll probably end up going back to 2m after this discussion. >>> >>> I don't believe that helps with the reset problem though, right? I >>> retried the queries using 2m instead of 1m and they still exhibit the same >>> problem. >>> >>> Is there any more data I can get you to help debug the problem? We see >>> this happen multiple times per day, and it's making it difficult to monitor >>> our systems in production. >>> >>> On Wednesday, April 6, 2022 at 1:53:26 PM UTC+1 [email protected] wrote: >>> >>>> Yup, PromQL thinks there's a small dip in the data. I'm not sure why >>>> tho. I took your raw values: >>>> >>>> 225201 >>>> 225226 >>>> 225249 >>>> 225262 >>>> 225278 >>>> 225310 >>>> 225329 >>>> 225363 >>>> 225402 >>>> 225437 >>>> 225466 >>>> 225492 >>>> 225529 >>>> 225555 >>>> 225595 >>>> >>>> $ awk '{print $1-225201}' values >>>> 0 >>>> 25 >>>> 48 >>>> 61 >>>> 77 >>>> 109 >>>> 128 >>>> 162 >>>> 201 >>>> 236 >>>> 265 >>>> 291 >>>> 328 >>>> 354 >>>> 394 >>>> >>>> I'm not seeing the reset there. >>>> >>>> One thing I noticed, your data interval is 60 seconds and you are doing >>>> a rate(counter[1m]). This is not going to work reliably, because you are >>>> likely to not have two samples in the same step window. This is because >>>> Prometheus uses millisecond timestamps, so if you have timestamps at these >>>> times: >>>> >>>> 5.335 >>>> 65.335 >>>> 125.335 >>>> >>>> Then you do a rate(counter[1m]) at time 120 (Grafana attempts to align >>>> queries to even minutes for consistency), the only sample you'll get back >>>> is 65.335. >>>> >>>> You need to do rate(counter[2m]) in order to avoid problems. >>>> >>>> On Wed, Apr 6, 2022 at 2:45 PM Sam Rose <[email protected]> wrote: >>>> >>>>> I just learned about the resets() function and applying it does seem >>>>> to show that a reset occurred: >>>>> >>>>> { >>>>> "request": { >>>>> "url": >>>>> "api/datasources/proxy/1/api/v1/query_range?query=resets(counter[1m])&start=1649239200&end=1649240100&step=60", >>>>> "method": "GET", >>>>> "hideFromInspector": false >>>>> }, >>>>> "response": { >>>>> "status": "success", >>>>> "data": { >>>>> "resultType": "matrix", >>>>> "result": [ >>>>> { >>>>> "metric": {/* redacted */}, >>>>> "values": [ >>>>> [ >>>>> 1649239200, >>>>> "0" >>>>> ], >>>>> [ >>>>> 1649239260, >>>>> "0" >>>>> ], >>>>> [ >>>>> 1649239320, >>>>> "0" >>>>> ], >>>>> [ >>>>> 1649239380, >>>>> "0" >>>>> ], >>>>> [ >>>>> 1649239440, >>>>> "0" >>>>> ], >>>>> [ >>>>> 1649239500, >>>>> "0" >>>>> ], >>>>> [ >>>>> 1649239560, >>>>> "0" >>>>> ], >>>>> [ >>>>> 1649239620, >>>>> "0" >>>>> ], >>>>> [ >>>>> 1649239680, >>>>> "0" >>>>> ], >>>>> [ >>>>> 1649239740, >>>>> "1" >>>>> ], >>>>> [ >>>>> 1649239800, >>>>> "0" >>>>> ], >>>>> [ >>>>> 1649239860, >>>>> "0" >>>>> ], >>>>> [ >>>>> 1649239920, >>>>> "0" >>>>> ], >>>>> [ >>>>> 1649239980, >>>>> "0" >>>>> ], >>>>> [ >>>>> 1649240040, >>>>> "0" >>>>> ], >>>>> [ >>>>> 1649240100, >>>>> "0" >>>>> ] >>>>> ] >>>>> } >>>>> ] >>>>> } >>>>> } >>>>> } >>>>> >>>>> I don't quite understand how, though. >>>>> On Wednesday, April 6, 2022 at 1:40:12 PM UTC+1 Sam Rose wrote: >>>>> >>>>>> Hi there, >>>>>> >>>>>> We're seeing really large spikes when using the `rate()` function on >>>>>> some of our metrics. I've been able to isolate a single time series that >>>>>> displays this problem, which I'm going to call `counter`. I haven't >>>>>> attached the actual metric labels here, but all of the data you see here >>>>>> is >>>>>> from `counter` over the same time period. >>>>>> >>>>>> This is the raw data, as obtained through a request to /api/v1/query: >>>>>> >>>>>> { >>>>>> "data": { >>>>>> "result": [ >>>>>> { >>>>>> "metric": {/* redacted */}, >>>>>> "values": [ >>>>>> [ >>>>>> 1649239253.4, >>>>>> "225201" >>>>>> ], >>>>>> [ >>>>>> 1649239313.4, >>>>>> "225226" >>>>>> ], >>>>>> [ >>>>>> 1649239373.4, >>>>>> "225249" >>>>>> ], >>>>>> [ >>>>>> 1649239433.4, >>>>>> "225262" >>>>>> ], >>>>>> [ >>>>>> 1649239493.4, >>>>>> "225278" >>>>>> ], >>>>>> [ >>>>>> 1649239553.4, >>>>>> "225310" >>>>>> ], >>>>>> [ >>>>>> 1649239613.4, >>>>>> "225329" >>>>>> ], >>>>>> [ >>>>>> 1649239673.4, >>>>>> "225363" >>>>>> ], >>>>>> [ >>>>>> 1649239733.4, >>>>>> "225402" >>>>>> ], >>>>>> [ >>>>>> 1649239793.4, >>>>>> "225437" >>>>>> ], >>>>>> [ >>>>>> 1649239853.4, >>>>>> "225466" >>>>>> ], >>>>>> [ >>>>>> 1649239913.4, >>>>>> "225492" >>>>>> ], >>>>>> [ >>>>>> 1649239973.4, >>>>>> "225529" >>>>>> ], >>>>>> [ >>>>>> 1649240033.4, >>>>>> "225555" >>>>>> ], >>>>>> [ >>>>>> 1649240093.4, >>>>>> "225595" >>>>>> ] >>>>>> ] >>>>>> } >>>>>> ], >>>>>> "resultType": "matrix" >>>>>> }, >>>>>> "status": "success" >>>>>> } >>>>>> >>>>>> The next query is taken from the Grafana query inspector, because for >>>>>> reasons I don't understand I can't get Prometheus to give me any data >>>>>> when >>>>>> I issue the same query to /api/v1/query_range. The query is the same as >>>>>> the >>>>>> above query, but wrapped in a rate([1m]): >>>>>> >>>>>> "request": { >>>>>> "url": >>>>>> "api/datasources/proxy/1/api/v1/query_range?query=rate(counter[1m])&start=1649239200&end=1649240100&step=60", >>>>>> "method": "GET", >>>>>> "hideFromInspector": false >>>>>> }, >>>>>> "response": { >>>>>> "status": "success", >>>>>> "data": { >>>>>> "resultType": "matrix", >>>>>> "result": [ >>>>>> { >>>>>> "metric": {/* redacted */}, >>>>>> "values": [ >>>>>> [ >>>>>> 1649239200, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239260, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239320, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239380, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239440, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239500, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239560, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239620, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239680, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239740, >>>>>> "9391.766666666665" >>>>>> ], >>>>>> [ >>>>>> 1649239800, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239860, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239920, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239980, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649240040, >>>>>> "0.03333333333333333" >>>>>> ], >>>>>> [ >>>>>> 1649240100, >>>>>> "0" >>>>>> ] >>>>>> ] >>>>>> } >>>>>> ] >>>>>> } >>>>>> } >>>>>> } >>>>>> >>>>>> Given the gradual increase in the underlying counter, I have two >>>>>> questions: >>>>>> >>>>>> 1. How come the rate is 0 for all except 2 datapoints? >>>>>> 2. How come there is one enormous datapoint in the rate query, that >>>>>> is seemingly unexplained in the raw data? >>>>>> >>>>>> For 2 I've seen in other threads that the explanation is an >>>>>> unintentional counter reset, caused by scrapes a millisecond apart that >>>>>> make the counter appear to go down for a single scrape interval. I don't >>>>>> think I see this in our raw data, though. >>>>>> >>>>>> We're using Prometheus version 2.26.0, revision >>>>>> 3cafc58827d1ebd1a67749f88be4218f0bab3d8d, go version go1.16.2. >>>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "Prometheus Users" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Prometheus Users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> >> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com >>> <https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "Prometheus Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/prometheus-users/c686d415-c9b8-42a1-843f-9010d1b763e4n%40googlegroups.com > <https://groups.google.com/d/msgid/prometheus-users/c686d415-c9b8-42a1-843f-9010d1b763e4n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CABbyFmpWbjwMM00WxG%3DLy8OEgMLrQew%2BEauqawwGeS-hzLnmkA%40mail.gmail.com.

