We do see a graph with rate(counter[1m]). It even looks pretty close to what we see with rate(counter[2m]). We definitely scrape every 60 seconds, double checked our config to make sure.
The exact query was `counter[15m]`. Counter is `django_http_responses_total_by_status_total` in reality, with a long list of labels attached to ensure I'm selecting a single time series. I didn't realise Grafana did that, thank you for the advice. I feel like we're drifting away from the original problem a little bit. Can I get you any additional data to make the original problem easier to debug? On Wednesday, April 6, 2022 at 2:31:27 PM UTC+1 Brian Candler wrote: > If you are scraping at 1m intervals, then you definitely need > rate(counter[2m]). That's because rate() needs at least two data points to > fall within the range window. I would be surprised if you see any graph at > all with rate(counter[1m]). > > > This is the raw data, as obtained through a request to /api/v1/query > > What is the *exact* query you gave? Hopefully it is a range vector query, > like counter[15m]. A range vector expression sent to the simple query > endpoint gives you the raw data points with their raw timestamps from the > database. > > > and then we configure the minimum value of it to 1m per-graph > > Just in case you haven't realised: to set a minimum value of 1m, you must > set the data source scrape interval (in Grafana) to 15s - since Grafana > clamps the minimum value to 4 x Grafana-configured data source scrape > interval. > > Therefore if you are actually scraping at 1m intervals, and you want the > minimum of $__rate_interval to be 2m, then you must set the Grafana data > source interval to 30s. This is weird, but it is what it is. > https://github.com/grafana/grafana/issues/32169 > > On Wednesday, 6 April 2022 at 14:07:13 UTC+1 [email protected] wrote: > >> We do make use of that variable, and then we configure the minimum value >> of it to 1m per-graph. I didn't realise you could configure this >> per-datasource, thanks for pointing that out! >> >> We did used to scrape at 15s intervals but we're using AWS's managed >> prometheus workspaces, and each data point costs money, so we brought it >> down to 1m intervals. >> >> I'm not sure I understand the relationship between scrape interval and >> counter resets, especially considering there doesn't appear to be a counter >> reset in the raw data of the time series in question. >> >> You mentioned "true counter reset", does prometheus have some internal >> distinction between types of counter reset? >> >> On Wednesday, April 6, 2022 at 2:03:40 PM UTC+1 [email protected] wrote: >> >>> I would recommend using the `$__rate_interval` magic variable in >>> Grafana. Note that Grafana assumes a default interval of 15s in the >>> datasource settings. >>> >>> If your data is mostly 60s scrape intervals, you can configure this >>> setting in the Grafana datasource settings. >>> >>> If you want to be able to view 1m resolution rates, I recommend >>> increasing your scrape interval to 15s. This makes sure you have several >>> samples in the rate window. This helps Prometheus better handle true >>> counter resets and lost scrapes. >>> >>> On Wed, Apr 6, 2022 at 2:56 PM Sam Rose <[email protected]> wrote: >>> >>>> Thanks for the heads up! We've flip flopped a bit between using 1m or >>>> 2m. 1m seems to work reliably enough to be useful in most situations, but >>>> I'll probably end up going back to 2m after this discussion. >>>> >>>> I don't believe that helps with the reset problem though, right? I >>>> retried the queries using 2m instead of 1m and they still exhibit the same >>>> problem. >>>> >>>> Is there any more data I can get you to help debug the problem? We see >>>> this happen multiple times per day, and it's making it difficult to >>>> monitor >>>> our systems in production. >>>> >>>> On Wednesday, April 6, 2022 at 1:53:26 PM UTC+1 [email protected] wrote: >>>> >>>>> Yup, PromQL thinks there's a small dip in the data. I'm not sure why >>>>> tho. I took your raw values: >>>>> >>>>> 225201 >>>>> 225226 >>>>> 225249 >>>>> 225262 >>>>> 225278 >>>>> 225310 >>>>> 225329 >>>>> 225363 >>>>> 225402 >>>>> 225437 >>>>> 225466 >>>>> 225492 >>>>> 225529 >>>>> 225555 >>>>> 225595 >>>>> >>>>> $ awk '{print $1-225201}' values >>>>> 0 >>>>> 25 >>>>> 48 >>>>> 61 >>>>> 77 >>>>> 109 >>>>> 128 >>>>> 162 >>>>> 201 >>>>> 236 >>>>> 265 >>>>> 291 >>>>> 328 >>>>> 354 >>>>> 394 >>>>> >>>>> I'm not seeing the reset there. >>>>> >>>>> One thing I noticed, your data interval is 60 seconds and you are >>>>> doing a rate(counter[1m]). This is not going to work reliably, because >>>>> you >>>>> are likely to not have two samples in the same step window. This is >>>>> because >>>>> Prometheus uses millisecond timestamps, so if you have timestamps at >>>>> these >>>>> times: >>>>> >>>>> 5.335 >>>>> 65.335 >>>>> 125.335 >>>>> >>>>> Then you do a rate(counter[1m]) at time 120 (Grafana attempts to align >>>>> queries to even minutes for consistency), the only sample you'll get back >>>>> is 65.335. >>>>> >>>>> You need to do rate(counter[2m]) in order to avoid problems. >>>>> >>>>> On Wed, Apr 6, 2022 at 2:45 PM Sam Rose <[email protected]> wrote: >>>>> >>>>>> I just learned about the resets() function and applying it does seem >>>>>> to show that a reset occurred: >>>>>> >>>>>> { >>>>>> "request": { >>>>>> "url": >>>>>> "api/datasources/proxy/1/api/v1/query_range?query=resets(counter[1m])&start=1649239200&end=1649240100&step=60", >>>>>> "method": "GET", >>>>>> "hideFromInspector": false >>>>>> }, >>>>>> "response": { >>>>>> "status": "success", >>>>>> "data": { >>>>>> "resultType": "matrix", >>>>>> "result": [ >>>>>> { >>>>>> "metric": {/* redacted */}, >>>>>> "values": [ >>>>>> [ >>>>>> 1649239200, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239260, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239320, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239380, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239440, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239500, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239560, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239620, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239680, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239740, >>>>>> "1" >>>>>> ], >>>>>> [ >>>>>> 1649239800, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239860, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239920, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649239980, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649240040, >>>>>> "0" >>>>>> ], >>>>>> [ >>>>>> 1649240100, >>>>>> "0" >>>>>> ] >>>>>> ] >>>>>> } >>>>>> ] >>>>>> } >>>>>> } >>>>>> } >>>>>> >>>>>> I don't quite understand how, though. >>>>>> On Wednesday, April 6, 2022 at 1:40:12 PM UTC+1 Sam Rose wrote: >>>>>> >>>>>>> Hi there, >>>>>>> >>>>>>> We're seeing really large spikes when using the `rate()` function on >>>>>>> some of our metrics. I've been able to isolate a single time series >>>>>>> that >>>>>>> displays this problem, which I'm going to call `counter`. I haven't >>>>>>> attached the actual metric labels here, but all of the data you see >>>>>>> here is >>>>>>> from `counter` over the same time period. >>>>>>> >>>>>>> This is the raw data, as obtained through a request to /api/v1/query: >>>>>>> >>>>>>> { >>>>>>> "data": { >>>>>>> "result": [ >>>>>>> { >>>>>>> "metric": {/* redacted */}, >>>>>>> "values": [ >>>>>>> [ >>>>>>> 1649239253.4, >>>>>>> "225201" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239313.4, >>>>>>> "225226" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239373.4, >>>>>>> "225249" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239433.4, >>>>>>> "225262" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239493.4, >>>>>>> "225278" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239553.4, >>>>>>> "225310" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239613.4, >>>>>>> "225329" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239673.4, >>>>>>> "225363" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239733.4, >>>>>>> "225402" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239793.4, >>>>>>> "225437" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239853.4, >>>>>>> "225466" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239913.4, >>>>>>> "225492" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239973.4, >>>>>>> "225529" >>>>>>> ], >>>>>>> [ >>>>>>> 1649240033.4, >>>>>>> "225555" >>>>>>> ], >>>>>>> [ >>>>>>> 1649240093.4, >>>>>>> "225595" >>>>>>> ] >>>>>>> ] >>>>>>> } >>>>>>> ], >>>>>>> "resultType": "matrix" >>>>>>> }, >>>>>>> "status": "success" >>>>>>> } >>>>>>> >>>>>>> The next query is taken from the Grafana query inspector, because >>>>>>> for reasons I don't understand I can't get Prometheus to give me any >>>>>>> data >>>>>>> when I issue the same query to /api/v1/query_range. The query is the >>>>>>> same >>>>>>> as the above query, but wrapped in a rate([1m]): >>>>>>> >>>>>>> "request": { >>>>>>> "url": >>>>>>> "api/datasources/proxy/1/api/v1/query_range?query=rate(counter[1m])&start=1649239200&end=1649240100&step=60", >>>>>>> "method": "GET", >>>>>>> "hideFromInspector": false >>>>>>> }, >>>>>>> "response": { >>>>>>> "status": "success", >>>>>>> "data": { >>>>>>> "resultType": "matrix", >>>>>>> "result": [ >>>>>>> { >>>>>>> "metric": {/* redacted */}, >>>>>>> "values": [ >>>>>>> [ >>>>>>> 1649239200, >>>>>>> "0" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239260, >>>>>>> "0" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239320, >>>>>>> "0" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239380, >>>>>>> "0" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239440, >>>>>>> "0" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239500, >>>>>>> "0" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239560, >>>>>>> "0" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239620, >>>>>>> "0" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239680, >>>>>>> "0" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239740, >>>>>>> "9391.766666666665" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239800, >>>>>>> "0" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239860, >>>>>>> "0" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239920, >>>>>>> "0" >>>>>>> ], >>>>>>> [ >>>>>>> 1649239980, >>>>>>> "0" >>>>>>> ], >>>>>>> [ >>>>>>> 1649240040, >>>>>>> "0.03333333333333333" >>>>>>> ], >>>>>>> [ >>>>>>> 1649240100, >>>>>>> "0" >>>>>>> ] >>>>>>> ] >>>>>>> } >>>>>>> ] >>>>>>> } >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> Given the gradual increase in the underlying counter, I have two >>>>>>> questions: >>>>>>> >>>>>>> 1. How come the rate is 0 for all except 2 datapoints? >>>>>>> 2. How come there is one enormous datapoint in the rate query, that >>>>>>> is seemingly unexplained in the raw data? >>>>>>> >>>>>>> For 2 I've seen in other threads that the explanation is an >>>>>>> unintentional counter reset, caused by scrapes a millisecond apart that >>>>>>> make the counter appear to go down for a single scrape interval. I >>>>>>> don't >>>>>>> think I see this in our raw data, though. >>>>>>> >>>>>>> We're using Prometheus version 2.26.0, revision >>>>>>> 3cafc58827d1ebd1a67749f88be4218f0bab3d8d, go version go1.16.2. >>>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "Prometheus Users" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Prometheus Users" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> >>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/eca3d571-df24-4c82-b3e5-0731838eefben%40googlegroups.com.

