Re: [prometheus-users] Re: Big spike when using rate() - doesn't seem to be a counter reset

Brian Candler Wed, 06 Apr 2022 06:31:33 -0700

If you are scraping at 1m intervals, then you definitely need 
rate(counter[2m]).  That's because rate() needs at least two data points to 
fall within the range window.  I would be surprised if you see any graph at 
all with rate(counter[1m]).


> This is the raw data, as obtained through a request to /api/v1/query

What is the *exact* query you gave? Hopefully it is a range vector query, 
like counter[15m].  A range vector expression sent to the simple query 
endpoint gives you the raw data points with their raw timestamps from the 
database.

> and then we configure the minimum value of it to 1m per-graph

Just in case you haven't realised: to set a minimum value of 1m, you must 
set the data source scrape interval (in Grafana) to 15s - since Grafana 
clamps the minimum value to 4 x Grafana-configured data source scrape 
interval.

Therefore if you are actually scraping at 1m intervals, and you want the 
minimum of $__rate_interval to be 2m, then you must set the Grafana data 
source interval to 30s.  This is weird, but it is what it is.
https://github.com/grafana/grafana/issues/32169

On Wednesday, 6 April 2022 at 14:07:13 UTC+1 [email protected] wrote:

> We do make use of that variable, and then we configure the minimum value 
> of it to 1m per-graph. I didn't realise you could configure this 
> per-datasource, thanks for pointing that out!
>
> We did used to scrape at 15s intervals but we're using AWS's managed 
> prometheus workspaces, and each data point costs money, so we brought it 
> down to 1m intervals.
>
> I'm not sure I understand the relationship between scrape interval and 
> counter resets, especially considering there doesn't appear to be a counter 
> reset in the raw data of the time series in question.
>
> You mentioned "true counter reset", does prometheus have some internal 
> distinction between types of counter reset?
>
> On Wednesday, April 6, 2022 at 2:03:40 PM UTC+1 [email protected] wrote:
>
>> I would recommend using the `$__rate_interval` magic variable in Grafana. 
>> Note that Grafana assumes a default interval of 15s in the datasource 
>> settings.
>>
>> If your data is mostly 60s scrape intervals, you can configure this 
>> setting in the Grafana datasource settings.
>>
>> If you want to be able to view 1m resolution rates, I recommend 
>> increasing your scrape interval to 15s. This makes sure you have several 
>> samples in the rate window. This helps Prometheus better handle true 
>> counter resets and lost scrapes.
>>
>> On Wed, Apr 6, 2022 at 2:56 PM Sam Rose <[email protected]> wrote:
>>
>>> Thanks for the heads up! We've flip flopped a bit between using 1m or 
>>> 2m. 1m seems to work reliably enough to be useful in most situations, but 
>>> I'll probably end up going back to 2m after this discussion.
>>>
>>> I don't believe that helps with the reset problem though, right? I 
>>> retried the queries using 2m instead of 1m and they still exhibit the same 
>>> problem.
>>>
>>> Is there any more data I can get you to help debug the problem? We see 
>>> this happen multiple times per day, and it's making it difficult to monitor 
>>> our systems in production.
>>>
>>> On Wednesday, April 6, 2022 at 1:53:26 PM UTC+1 [email protected] wrote:
>>>
>>>> Yup, PromQL thinks there's a small dip in the data. I'm not sure why 
>>>> tho. I took your raw values:
>>>>
>>>> 225201
>>>> 225226
>>>> 225249
>>>> 225262
>>>> 225278
>>>> 225310
>>>> 225329
>>>> 225363
>>>> 225402
>>>> 225437
>>>> 225466
>>>> 225492
>>>> 225529
>>>> 225555
>>>> 225595
>>>>
>>>> $ awk '{print $1-225201}' values
>>>> 0
>>>> 25
>>>> 48
>>>> 61
>>>> 77
>>>> 109
>>>> 128
>>>> 162
>>>> 201
>>>> 236
>>>> 265
>>>> 291
>>>> 328
>>>> 354
>>>> 394
>>>>
>>>> I'm not seeing the reset there.
>>>>
>>>> One thing I noticed, your data interval is 60 seconds and you are doing 
>>>> a rate(counter[1m]). This is not going to work reliably, because you are 
>>>> likely to not have two samples in the same step window. This is because 
>>>> Prometheus uses millisecond timestamps, so if you have timestamps at these 
>>>> times:
>>>>
>>>> 5.335
>>>> 65.335
>>>> 125.335
>>>>
>>>> Then you do a rate(counter[1m]) at time 120 (Grafana attempts to align 
>>>> queries to even minutes for consistency), the only sample you'll get back 
>>>> is 65.335.
>>>>
>>>> You need to do rate(counter[2m]) in order to avoid problems.
>>>>
>>>> On Wed, Apr 6, 2022 at 2:45 PM Sam Rose <[email protected]> wrote:
>>>>
>>>>> I just learned about the resets() function and applying it does seem 
>>>>> to show that a reset occurred:
>>>>>
>>>>> {
>>>>>   "request": {
>>>>>     "url": 
>>>>> "api/datasources/proxy/1/api/v1/query_range?query=resets(counter[1m])&start=1649239200&end=1649240100&step=60",
>>>>>     "method": "GET",
>>>>>     "hideFromInspector": false
>>>>>   },
>>>>>   "response": {
>>>>>     "status": "success",
>>>>>     "data": {
>>>>>       "resultType": "matrix",
>>>>>       "result": [
>>>>>         {
>>>>>           "metric": {/* redacted */},
>>>>>           "values": [
>>>>>             [
>>>>>               1649239200,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239260,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239320,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239380,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239440,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239500,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239560,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239620,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239680,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239740,
>>>>>               "1"
>>>>>             ],
>>>>>             [
>>>>>               1649239800,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239860,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239920,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239980,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649240040,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649240100,
>>>>>               "0"
>>>>>             ]
>>>>>           ]
>>>>>         }
>>>>>       ]
>>>>>     }
>>>>>   }
>>>>> }
>>>>>
>>>>> I don't quite understand how, though.
>>>>> On Wednesday, April 6, 2022 at 1:40:12 PM UTC+1 Sam Rose wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> We're seeing really large spikes when using the `rate()` function on 
>>>>>> some of our metrics. I've been able to isolate a single time series that 
>>>>>> displays this problem, which I'm going to call `counter`. I haven't 
>>>>>> attached the actual metric labels here, but all of the data you see here 
>>>>>> is 
>>>>>> from `counter` over the same time period.
>>>>>>
>>>>>> This is the raw data, as obtained through a request to /api/v1/query:
>>>>>>
>>>>>> {
>>>>>>     "data": {
>>>>>>         "result": [
>>>>>>             {
>>>>>>                 "metric": {/* redacted */},
>>>>>>                 "values": [
>>>>>>                     [
>>>>>>                         1649239253.4,
>>>>>>                         "225201"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239313.4,
>>>>>>                         "225226"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239373.4,
>>>>>>                         "225249"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239433.4,
>>>>>>                         "225262"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239493.4,
>>>>>>                         "225278"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239553.4,
>>>>>>                         "225310"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239613.4,
>>>>>>                         "225329"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239673.4,
>>>>>>                         "225363"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239733.4,
>>>>>>                         "225402"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239793.4,
>>>>>>                         "225437"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239853.4,
>>>>>>                         "225466"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239913.4,
>>>>>>                         "225492"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239973.4,
>>>>>>                         "225529"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649240033.4,
>>>>>>                         "225555"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649240093.4,
>>>>>>                         "225595"
>>>>>>                     ]
>>>>>>                 ]
>>>>>>             }
>>>>>>         ],
>>>>>>         "resultType": "matrix"
>>>>>>     },
>>>>>>     "status": "success"
>>>>>> }
>>>>>>
>>>>>> The next query is taken from the Grafana query inspector, because for 
>>>>>> reasons I don't understand I can't get Prometheus to give me any data 
>>>>>> when 
>>>>>> I issue the same query to /api/v1/query_range. The query is the same as 
>>>>>> the 
>>>>>> above query, but wrapped in a rate([1m]):
>>>>>>
>>>>>>     "request": {
>>>>>>         "url": 
>>>>>> "api/datasources/proxy/1/api/v1/query_range?query=rate(counter[1m])&start=1649239200&end=1649240100&step=60",
>>>>>>         "method": "GET",
>>>>>>         "hideFromInspector": false
>>>>>>     },
>>>>>>     "response": {
>>>>>>         "status": "success",
>>>>>>         "data": {
>>>>>>             "resultType": "matrix",
>>>>>>             "result": [
>>>>>>                 {
>>>>>>                     "metric": {/* redacted */},
>>>>>>                     "values": [
>>>>>>                         [
>>>>>>                             1649239200,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239260,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239320,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239380,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239440,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239500,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239560,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239620,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239680,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239740,
>>>>>>                             "9391.766666666665"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239800,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239860,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239920,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239980,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649240040,
>>>>>>                             "0.03333333333333333"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649240100,
>>>>>>                             "0"
>>>>>>                         ]
>>>>>>                     ]
>>>>>>                 }
>>>>>>             ]
>>>>>>         }
>>>>>>     }
>>>>>> }
>>>>>>
>>>>>> Given the gradual increase in the underlying counter, I have two 
>>>>>> questions:
>>>>>>
>>>>>> 1. How come the rate is 0 for all except 2 datapoints?
>>>>>> 2. How come there is one enormous datapoint in the rate query, that 
>>>>>> is seemingly unexplained in the raw data?
>>>>>>
>>>>>> For 2 I've seen in other threads that the explanation is an 
>>>>>> unintentional counter reset, caused by scrapes a millisecond apart that 
>>>>>> make the counter appear to go down for a single scrape interval. I don't 
>>>>>> think I see this in our raw data, though.
>>>>>>
>>>>>> We're using Prometheus version 2.26.0, revision 
>>>>>> 3cafc58827d1ebd1a67749f88be4218f0bab3d8d, go version go1.16.2.
>>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "Prometheus Users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Prometheus Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>>
>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/8c08feff-3e11-493b-9c9a-86a2a1decf51n%40googlegroups.com.

Re: [prometheus-users] Re: Big spike when using rate() - doesn't seem to be a counter reset

Reply via email to