Re: [prometheus-users] Re: Big spike when using rate() - doesn't seem to be a counter reset

Sam Rose Wed, 06 Apr 2022 06:35:04 -0700

We do see a graph with rate(counter[1m]). It even looks pretty close to 
what we see with rate(counter[2m]). We definitely scrape every 60 seconds, 
double checked our config to make sure.


The exact query was `counter[15m]`. Counter is 
`django_http_responses_total_by_status_total` in reality, with a long list 
of labels attached to ensure I'm selecting a single time series.

I didn't realise Grafana did that, thank you for the advice.

I feel like we're drifting away from the original problem a little bit. Can 
I get you any additional data to make the original problem easier to debug?

On Wednesday, April 6, 2022 at 2:31:27 PM UTC+1 Brian Candler wrote:

> If you are scraping at 1m intervals, then you definitely need 
> rate(counter[2m]).  That's because rate() needs at least two data points to 
> fall within the range window.  I would be surprised if you see any graph at 
> all with rate(counter[1m]).
>
> > This is the raw data, as obtained through a request to /api/v1/query
>
> What is the *exact* query you gave? Hopefully it is a range vector query, 
> like counter[15m].  A range vector expression sent to the simple query 
> endpoint gives you the raw data points with their raw timestamps from the 
> database.
>
> > and then we configure the minimum value of it to 1m per-graph
>
> Just in case you haven't realised: to set a minimum value of 1m, you must 
> set the data source scrape interval (in Grafana) to 15s - since Grafana 
> clamps the minimum value to 4 x Grafana-configured data source scrape 
> interval.
>
> Therefore if you are actually scraping at 1m intervals, and you want the 
> minimum of $__rate_interval to be 2m, then you must set the Grafana data 
> source interval to 30s.  This is weird, but it is what it is.
> https://github.com/grafana/grafana/issues/32169
>
> On Wednesday, 6 April 2022 at 14:07:13 UTC+1 [email protected] wrote:
>
>> We do make use of that variable, and then we configure the minimum value 
>> of it to 1m per-graph. I didn't realise you could configure this 
>> per-datasource, thanks for pointing that out!
>>
>> We did used to scrape at 15s intervals but we're using AWS's managed 
>> prometheus workspaces, and each data point costs money, so we brought it 
>> down to 1m intervals.
>>
>> I'm not sure I understand the relationship between scrape interval and 
>> counter resets, especially considering there doesn't appear to be a counter 
>> reset in the raw data of the time series in question.
>>
>> You mentioned "true counter reset", does prometheus have some internal 
>> distinction between types of counter reset?
>>
>> On Wednesday, April 6, 2022 at 2:03:40 PM UTC+1 [email protected] wrote:
>>
>>> I would recommend using the `$__rate_interval` magic variable in 
>>> Grafana. Note that Grafana assumes a default interval of 15s in the 
>>> datasource settings.
>>>
>>> If your data is mostly 60s scrape intervals, you can configure this 
>>> setting in the Grafana datasource settings.
>>>
>>> If you want to be able to view 1m resolution rates, I recommend 
>>> increasing your scrape interval to 15s. This makes sure you have several 
>>> samples in the rate window. This helps Prometheus better handle true 
>>> counter resets and lost scrapes.
>>>
>>> On Wed, Apr 6, 2022 at 2:56 PM Sam Rose <[email protected]> wrote:
>>>
>>>> Thanks for the heads up! We've flip flopped a bit between using 1m or 
>>>> 2m. 1m seems to work reliably enough to be useful in most situations, but 
>>>> I'll probably end up going back to 2m after this discussion.
>>>>
>>>> I don't believe that helps with the reset problem though, right? I 
>>>> retried the queries using 2m instead of 1m and they still exhibit the same 
>>>> problem.
>>>>
>>>> Is there any more data I can get you to help debug the problem? We see 
>>>> this happen multiple times per day, and it's making it difficult to 
>>>> monitor 
>>>> our systems in production.
>>>>
>>>> On Wednesday, April 6, 2022 at 1:53:26 PM UTC+1 [email protected] wrote:
>>>>
>>>>> Yup, PromQL thinks there's a small dip in the data. I'm not sure why 
>>>>> tho. I took your raw values:
>>>>>
>>>>> 225201
>>>>> 225226
>>>>> 225249
>>>>> 225262
>>>>> 225278
>>>>> 225310
>>>>> 225329
>>>>> 225363
>>>>> 225402
>>>>> 225437
>>>>> 225466
>>>>> 225492
>>>>> 225529
>>>>> 225555
>>>>> 225595
>>>>>
>>>>> $ awk '{print $1-225201}' values
>>>>> 0
>>>>> 25
>>>>> 48
>>>>> 61
>>>>> 77
>>>>> 109
>>>>> 128
>>>>> 162
>>>>> 201
>>>>> 236
>>>>> 265
>>>>> 291
>>>>> 328
>>>>> 354
>>>>> 394
>>>>>
>>>>> I'm not seeing the reset there.
>>>>>
>>>>> One thing I noticed, your data interval is 60 seconds and you are 
>>>>> doing a rate(counter[1m]). This is not going to work reliably, because 
>>>>> you 
>>>>> are likely to not have two samples in the same step window. This is 
>>>>> because 
>>>>> Prometheus uses millisecond timestamps, so if you have timestamps at 
>>>>> these 
>>>>> times:
>>>>>
>>>>> 5.335
>>>>> 65.335
>>>>> 125.335
>>>>>
>>>>> Then you do a rate(counter[1m]) at time 120 (Grafana attempts to align 
>>>>> queries to even minutes for consistency), the only sample you'll get back 
>>>>> is 65.335.
>>>>>
>>>>> You need to do rate(counter[2m]) in order to avoid problems.
>>>>>
>>>>> On Wed, Apr 6, 2022 at 2:45 PM Sam Rose <[email protected]> wrote:
>>>>>
>>>>>> I just learned about the resets() function and applying it does seem 
>>>>>> to show that a reset occurred:
>>>>>>
>>>>>> {
>>>>>>   "request": {
>>>>>>     "url": 
>>>>>> "api/datasources/proxy/1/api/v1/query_range?query=resets(counter[1m])&start=1649239200&end=1649240100&step=60",
>>>>>>     "method": "GET",
>>>>>>     "hideFromInspector": false
>>>>>>   },
>>>>>>   "response": {
>>>>>>     "status": "success",
>>>>>>     "data": {
>>>>>>       "resultType": "matrix",
>>>>>>       "result": [
>>>>>>         {
>>>>>>           "metric": {/* redacted */},
>>>>>>           "values": [
>>>>>>             [
>>>>>>               1649239200,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239260,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239320,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239380,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239440,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239500,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239560,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239620,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239680,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239740,
>>>>>>               "1"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239800,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239860,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239920,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239980,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649240040,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649240100,
>>>>>>               "0"
>>>>>>             ]
>>>>>>           ]
>>>>>>         }
>>>>>>       ]
>>>>>>     }
>>>>>>   }
>>>>>> }
>>>>>>
>>>>>> I don't quite understand how, though.
>>>>>> On Wednesday, April 6, 2022 at 1:40:12 PM UTC+1 Sam Rose wrote:
>>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>> We're seeing really large spikes when using the `rate()` function on 
>>>>>>> some of our metrics. I've been able to isolate a single time series 
>>>>>>> that 
>>>>>>> displays this problem, which I'm going to call `counter`. I haven't 
>>>>>>> attached the actual metric labels here, but all of the data you see 
>>>>>>> here is 
>>>>>>> from `counter` over the same time period.
>>>>>>>
>>>>>>> This is the raw data, as obtained through a request to /api/v1/query:
>>>>>>>
>>>>>>> {
>>>>>>>     "data": {
>>>>>>>         "result": [
>>>>>>>             {
>>>>>>>                 "metric": {/* redacted */},
>>>>>>>                 "values": [
>>>>>>>                     [
>>>>>>>                         1649239253.4,
>>>>>>>                         "225201"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239313.4,
>>>>>>>                         "225226"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239373.4,
>>>>>>>                         "225249"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239433.4,
>>>>>>>                         "225262"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239493.4,
>>>>>>>                         "225278"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239553.4,
>>>>>>>                         "225310"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239613.4,
>>>>>>>                         "225329"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239673.4,
>>>>>>>                         "225363"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239733.4,
>>>>>>>                         "225402"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239793.4,
>>>>>>>                         "225437"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239853.4,
>>>>>>>                         "225466"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239913.4,
>>>>>>>                         "225492"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239973.4,
>>>>>>>                         "225529"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649240033.4,
>>>>>>>                         "225555"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649240093.4,
>>>>>>>                         "225595"
>>>>>>>                     ]
>>>>>>>                 ]
>>>>>>>             }
>>>>>>>         ],
>>>>>>>         "resultType": "matrix"
>>>>>>>     },
>>>>>>>     "status": "success"
>>>>>>> }
>>>>>>>
>>>>>>> The next query is taken from the Grafana query inspector, because 
>>>>>>> for reasons I don't understand I can't get Prometheus to give me any 
>>>>>>> data 
>>>>>>> when I issue the same query to /api/v1/query_range. The query is the 
>>>>>>> same 
>>>>>>> as the above query, but wrapped in a rate([1m]):
>>>>>>>
>>>>>>>     "request": {
>>>>>>>         "url": 
>>>>>>> "api/datasources/proxy/1/api/v1/query_range?query=rate(counter[1m])&start=1649239200&end=1649240100&step=60",
>>>>>>>         "method": "GET",
>>>>>>>         "hideFromInspector": false
>>>>>>>     },
>>>>>>>     "response": {
>>>>>>>         "status": "success",
>>>>>>>         "data": {
>>>>>>>             "resultType": "matrix",
>>>>>>>             "result": [
>>>>>>>                 {
>>>>>>>                     "metric": {/* redacted */},
>>>>>>>                     "values": [
>>>>>>>                         [
>>>>>>>                             1649239200,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239260,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239320,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239380,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239440,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239500,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239560,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239620,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239680,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239740,
>>>>>>>                             "9391.766666666665"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239800,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239860,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239920,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239980,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649240040,
>>>>>>>                             "0.03333333333333333"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649240100,
>>>>>>>                             "0"
>>>>>>>                         ]
>>>>>>>                     ]
>>>>>>>                 }
>>>>>>>             ]
>>>>>>>         }
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>> Given the gradual increase in the underlying counter, I have two 
>>>>>>> questions:
>>>>>>>
>>>>>>> 1. How come the rate is 0 for all except 2 datapoints?
>>>>>>> 2. How come there is one enormous datapoint in the rate query, that 
>>>>>>> is seemingly unexplained in the raw data?
>>>>>>>
>>>>>>> For 2 I've seen in other threads that the explanation is an 
>>>>>>> unintentional counter reset, caused by scrapes a millisecond apart that 
>>>>>>> make the counter appear to go down for a single scrape interval. I 
>>>>>>> don't 
>>>>>>> think I see this in our raw data, though.
>>>>>>>
>>>>>>> We're using Prometheus version 2.26.0, revision 
>>>>>>> 3cafc58827d1ebd1a67749f88be4218f0bab3d8d, go version go1.16.2.
>>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "Prometheus Users" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Prometheus Users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>>
>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/eca3d571-df24-4c82-b3e5-0731838eefben%40googlegroups.com.

Re: [prometheus-users] Re: Big spike when using rate() - doesn't seem to be a counter reset

Reply via email to