Re: [prometheus-users] Re: Big spike when using rate() - doesn't seem to be a counter reset

Brian Candler Wed, 06 Apr 2022 10:14:54 -0700

What version of prometheus are you running?

With prometheus, rate(counter[1m]) should give you no results at all when 
you are scraping at 1 minute intervals - unless something has changed very 
recently (I'm running 2.33.4).  So this is a big red flag.


Now, for driving the query API, you should be able to do it like this:

# curl -Ssg 
'http://localhost:9090/api/v1/query?query=ifHCInOctets{instance="gw1",ifName="ether1"}[60s]'
 
| python3 -m json.tool
{
    "status": "success",
    "data": {
        "resultType": "matrix",
        "result": [
            {
                "metric": {
                    "__name__": "ifHCInOctets",
                    "ifIndex": "16",
                    "ifName": "ether1",
                    "instance": "gw1",
                    "job": "snmp",
                    "module": "mikrotik_secret",
                    "netbox_type": "device"
                },
                "values": [
                    [
                        1649264595.241,
                        "117857843410"
                    ],
                    [
                        1649264610.241,
                        "117858063821"
                    ],
                    [
                        1649264625.241,
                        "117858075769"
                    ]
                ]
            }
        ]
    }
}

There I gave a range vector of 60 seconds, and I got 3 data points because 
I'm scraping at 15 second intervals, so only 3 points fell within the time 
window of (current time) and (current time - 60s)

Sending a query_range will sample the data at intervals.  Only an actual 
range vector query (as shown above) will show you *all* the data points in 
the time series, wherever they lie.

I think you should do this.  My guess - and it's only a guess at the moment 
- is that there are multiple points being received for the same timeseries, 
and this is giving your spike.  This could be due to overlapping scrape 
jobs for the same timeseries, or relabelling removing some distinguishing 
label, or some HA setup which is scraping the same timeseries multiple 
times but not adding external labels to distinguish them.

I do have some evidence for my guess.  If you are storing the same data 
points twice, this will give you the rate of zero most of the time, when 
doing rate[1m], because there are two adjacent identical points most of the 
time (whereas if there were only a single data point, you'd get no rate at 
all).  And you'll get a counter spike if two data points get transposed.

On Wednesday, 6 April 2022 at 14:37:57 UTC+1 [email protected] wrote:

> Here's the query inspector output from Grafana for rate(counter[2m]). It 
> makes the answer to question 1 in my original post more clear. You're 
> right, the graph for 1m is just plain wrong. We do still see the reset, 
> though.
>
> {
>   "request": {
>     "url": 
> "api/datasources/proxy/1/api/v1/query_range?query=rate(counter[2m])&start=1649239200&end=1649240100&step=60",
>
>     "method": "GET",
>     "hideFromInspector": false
>   },
>   "response": {
>     "status": "success",
>     "data": {
>       "resultType": "matrix",
>       "result": [
>         {
>           "metric": {/* redacted */},
>           "values": [
>             [
>               1649239200,
>               "0.2871886897537781"
>             ],
>             [
>               1649239260,
>               "0.3084619260318357"
>             ],
>             [
>               1649239320,
>               "0.26591545347572043"
>             ],
>             [
>               1649239380,
>               "0.2446422171976628"
>             ],
>             [
>               1649239440,
>               "0.13827603580737463"
>             ],
>             [
>               1649239500,
>               "0.1701858902244611"
>             ],
>             [
>               1649239560,
>               "0.3403717804489222"
>             ],
>             [
>               1649239620,
>               "0.20209574464154753"
>             ],
>             [
>               1649239680,
>               "0.3616450167269798"
>             ],
>             [
>               1649239740,
>               "2397.9404664989347"
>             ],
>             [
>               1649239800,
>               "2397.88728340824"
>             ],
>             [
>               1649239860,
>               "0.3084619260318357"
>             ],
>             [
>               1649239920,
>               "0.27655207161474926"
>             ],
>             [
>               1649239980,
>               "0.39355487114406623"
>             ],
>             [
>               1649240040,
>               "0.27655207161474926"
>             ],
>             [
>               1649240100,
>               "0.43610134370018155"
>             ]
>           ]
>         }
>       ]
>     }
>   }
> }
>
> On Wednesday, April 6, 2022 at 2:34:59 PM UTC+1 Sam Rose wrote:
>
>> We do see a graph with rate(counter[1m]). It even looks pretty close to 
>> what we see with rate(counter[2m]). We definitely scrape every 60 seconds, 
>> double checked our config to make sure.
>>
>> The exact query was `counter[15m]`. Counter is 
>> `django_http_responses_total_by_status_total` in reality, with a long list 
>> of labels attached to ensure I'm selecting a single time series.
>>
>> I didn't realise Grafana did that, thank you for the advice.
>>
>> I feel like we're drifting away from the original problem a little bit. 
>> Can I get you any additional data to make the original problem easier to 
>> debug?
>>
>> On Wednesday, April 6, 2022 at 2:31:27 PM UTC+1 Brian Candler wrote:
>>
>>> If you are scraping at 1m intervals, then you definitely need 
>>> rate(counter[2m]).  That's because rate() needs at least two data points to 
>>> fall within the range window.  I would be surprised if you see any graph at 
>>> all with rate(counter[1m]).
>>>
>>> > This is the raw data, as obtained through a request to /api/v1/query
>>>
>>> What is the *exact* query you gave? Hopefully it is a range vector 
>>> query, like counter[15m].  A range vector expression sent to the simple 
>>> query endpoint gives you the raw data points with their raw timestamps from 
>>> the database.
>>>
>>> > and then we configure the minimum value of it to 1m per-graph
>>>
>>> Just in case you haven't realised: to set a minimum value of 1m, you 
>>> must set the data source scrape interval (in Grafana) to 15s - since 
>>> Grafana clamps the minimum value to 4 x Grafana-configured data source 
>>> scrape interval.
>>>
>>> Therefore if you are actually scraping at 1m intervals, and you want the 
>>> minimum of $__rate_interval to be 2m, then you must set the Grafana data 
>>> source interval to 30s.  This is weird, but it is what it is.
>>> https://github.com/grafana/grafana/issues/32169
>>>
>>> On Wednesday, 6 April 2022 at 14:07:13 UTC+1 [email protected] wrote:
>>>
>>>> We do make use of that variable, and then we configure the minimum 
>>>> value of it to 1m per-graph. I didn't realise you could configure this 
>>>> per-datasource, thanks for pointing that out!
>>>>
>>>> We did used to scrape at 15s intervals but we're using AWS's managed 
>>>> prometheus workspaces, and each data point costs money, so we brought it 
>>>> down to 1m intervals.
>>>>
>>>> I'm not sure I understand the relationship between scrape interval and 
>>>> counter resets, especially considering there doesn't appear to be a 
>>>> counter 
>>>> reset in the raw data of the time series in question.
>>>>
>>>> You mentioned "true counter reset", does prometheus have some internal 
>>>> distinction between types of counter reset?
>>>>
>>>> On Wednesday, April 6, 2022 at 2:03:40 PM UTC+1 [email protected] wrote:
>>>>
>>>>> I would recommend using the `$__rate_interval` magic variable in 
>>>>> Grafana. Note that Grafana assumes a default interval of 15s in the 
>>>>> datasource settings.
>>>>>
>>>>> If your data is mostly 60s scrape intervals, you can configure this 
>>>>> setting in the Grafana datasource settings.
>>>>>
>>>>> If you want to be able to view 1m resolution rates, I recommend 
>>>>> increasing your scrape interval to 15s. This makes sure you have several 
>>>>> samples in the rate window. This helps Prometheus better handle true 
>>>>> counter resets and lost scrapes.
>>>>>
>>>>> On Wed, Apr 6, 2022 at 2:56 PM Sam Rose <[email protected]> wrote:
>>>>>
>>>>>> Thanks for the heads up! We've flip flopped a bit between using 1m or 
>>>>>> 2m. 1m seems to work reliably enough to be useful in most situations, 
>>>>>> but 
>>>>>> I'll probably end up going back to 2m after this discussion.
>>>>>>
>>>>>> I don't believe that helps with the reset problem though, right? I 
>>>>>> retried the queries using 2m instead of 1m and they still exhibit the 
>>>>>> same 
>>>>>> problem.
>>>>>>
>>>>>> Is there any more data I can get you to help debug the problem? We 
>>>>>> see this happen multiple times per day, and it's making it difficult to 
>>>>>> monitor our systems in production.
>>>>>>
>>>>>> On Wednesday, April 6, 2022 at 1:53:26 PM UTC+1 [email protected] 
>>>>>> wrote:
>>>>>>
>>>>>>> Yup, PromQL thinks there's a small dip in the data. I'm not sure why 
>>>>>>> tho. I took your raw values:
>>>>>>>
>>>>>>> 225201
>>>>>>> 225226
>>>>>>> 225249
>>>>>>> 225262
>>>>>>> 225278
>>>>>>> 225310
>>>>>>> 225329
>>>>>>> 225363
>>>>>>> 225402
>>>>>>> 225437
>>>>>>> 225466
>>>>>>> 225492
>>>>>>> 225529
>>>>>>> 225555
>>>>>>> 225595
>>>>>>>
>>>>>>> $ awk '{print $1-225201}' values
>>>>>>> 0
>>>>>>> 25
>>>>>>> 48
>>>>>>> 61
>>>>>>> 77
>>>>>>> 109
>>>>>>> 128
>>>>>>> 162
>>>>>>> 201
>>>>>>> 236
>>>>>>> 265
>>>>>>> 291
>>>>>>> 328
>>>>>>> 354
>>>>>>> 394
>>>>>>>
>>>>>>> I'm not seeing the reset there.
>>>>>>>
>>>>>>> One thing I noticed, your data interval is 60 seconds and you are 
>>>>>>> doing a rate(counter[1m]). This is not going to work reliably, because 
>>>>>>> you 
>>>>>>> are likely to not have two samples in the same step window. This is 
>>>>>>> because 
>>>>>>> Prometheus uses millisecond timestamps, so if you have timestamps at 
>>>>>>> these 
>>>>>>> times:
>>>>>>>
>>>>>>> 5.335
>>>>>>> 65.335
>>>>>>> 125.335
>>>>>>>
>>>>>>> Then you do a rate(counter[1m]) at time 120 (Grafana attempts to 
>>>>>>> align queries to even minutes for consistency), the only sample you'll 
>>>>>>> get 
>>>>>>> back is 65.335.
>>>>>>>
>>>>>>> You need to do rate(counter[2m]) in order to avoid problems.
>>>>>>>
>>>>>>> On Wed, Apr 6, 2022 at 2:45 PM Sam Rose <[email protected]> wrote:
>>>>>>>
>>>>>>>> I just learned about the resets() function and applying it does 
>>>>>>>> seem to show that a reset occurred:
>>>>>>>>
>>>>>>>> {
>>>>>>>>   "request": {
>>>>>>>>     "url": 
>>>>>>>> "api/datasources/proxy/1/api/v1/query_range?query=resets(counter[1m])&start=1649239200&end=1649240100&step=60",
>>>>>>>>     "method": "GET",
>>>>>>>>     "hideFromInspector": false
>>>>>>>>   },
>>>>>>>>   "response": {
>>>>>>>>     "status": "success",
>>>>>>>>     "data": {
>>>>>>>>       "resultType": "matrix",
>>>>>>>>       "result": [
>>>>>>>>         {
>>>>>>>>           "metric": {/* redacted */},
>>>>>>>>           "values": [
>>>>>>>>             [
>>>>>>>>               1649239200,
>>>>>>>>               "0"
>>>>>>>>             ],
>>>>>>>>             [
>>>>>>>>               1649239260,
>>>>>>>>               "0"
>>>>>>>>             ],
>>>>>>>>             [
>>>>>>>>               1649239320,
>>>>>>>>               "0"
>>>>>>>>             ],
>>>>>>>>             [
>>>>>>>>               1649239380,
>>>>>>>>               "0"
>>>>>>>>             ],
>>>>>>>>             [
>>>>>>>>               1649239440,
>>>>>>>>               "0"
>>>>>>>>             ],
>>>>>>>>             [
>>>>>>>>               1649239500,
>>>>>>>>               "0"
>>>>>>>>             ],
>>>>>>>>             [
>>>>>>>>               1649239560,
>>>>>>>>               "0"
>>>>>>>>             ],
>>>>>>>>             [
>>>>>>>>               1649239620,
>>>>>>>>               "0"
>>>>>>>>             ],
>>>>>>>>             [
>>>>>>>>               1649239680,
>>>>>>>>               "0"
>>>>>>>>             ],
>>>>>>>>             [
>>>>>>>>               1649239740,
>>>>>>>>               "1"
>>>>>>>>             ],
>>>>>>>>             [
>>>>>>>>               1649239800,
>>>>>>>>               "0"
>>>>>>>>             ],
>>>>>>>>             [
>>>>>>>>               1649239860,
>>>>>>>>               "0"
>>>>>>>>             ],
>>>>>>>>             [
>>>>>>>>               1649239920,
>>>>>>>>               "0"
>>>>>>>>             ],
>>>>>>>>             [
>>>>>>>>               1649239980,
>>>>>>>>               "0"
>>>>>>>>             ],
>>>>>>>>             [
>>>>>>>>               1649240040,
>>>>>>>>               "0"
>>>>>>>>             ],
>>>>>>>>             [
>>>>>>>>               1649240100,
>>>>>>>>               "0"
>>>>>>>>             ]
>>>>>>>>           ]
>>>>>>>>         }
>>>>>>>>       ]
>>>>>>>>     }
>>>>>>>>   }
>>>>>>>> }
>>>>>>>>
>>>>>>>> I don't quite understand how, though.
>>>>>>>> On Wednesday, April 6, 2022 at 1:40:12 PM UTC+1 Sam Rose wrote:
>>>>>>>>
>>>>>>>>> Hi there,
>>>>>>>>>
>>>>>>>>> We're seeing really large spikes when using the `rate()` function 
>>>>>>>>> on some of our metrics. I've been able to isolate a single time 
>>>>>>>>> series that 
>>>>>>>>> displays this problem, which I'm going to call `counter`. I haven't 
>>>>>>>>> attached the actual metric labels here, but all of the data you see 
>>>>>>>>> here is 
>>>>>>>>> from `counter` over the same time period.
>>>>>>>>>
>>>>>>>>> This is the raw data, as obtained through a request to 
>>>>>>>>> /api/v1/query:
>>>>>>>>>
>>>>>>>>> {
>>>>>>>>>     "data": {
>>>>>>>>>         "result": [
>>>>>>>>>             {
>>>>>>>>>                 "metric": {/* redacted */},
>>>>>>>>>                 "values": [
>>>>>>>>>                     [
>>>>>>>>>                         1649239253.4,
>>>>>>>>>                         "225201"
>>>>>>>>>                     ],
>>>>>>>>>                     [
>>>>>>>>>                         1649239313.4,
>>>>>>>>>                         "225226"
>>>>>>>>>                     ],
>>>>>>>>>                     [
>>>>>>>>>                         1649239373.4,
>>>>>>>>>                         "225249"
>>>>>>>>>                     ],
>>>>>>>>>                     [
>>>>>>>>>                         1649239433.4,
>>>>>>>>>                         "225262"
>>>>>>>>>                     ],
>>>>>>>>>                     [
>>>>>>>>>                         1649239493.4,
>>>>>>>>>                         "225278"
>>>>>>>>>                     ],
>>>>>>>>>                     [
>>>>>>>>>                         1649239553.4,
>>>>>>>>>                         "225310"
>>>>>>>>>                     ],
>>>>>>>>>                     [
>>>>>>>>>                         1649239613.4,
>>>>>>>>>                         "225329"
>>>>>>>>>                     ],
>>>>>>>>>                     [
>>>>>>>>>                         1649239673.4,
>>>>>>>>>                         "225363"
>>>>>>>>>                     ],
>>>>>>>>>                     [
>>>>>>>>>                         1649239733.4,
>>>>>>>>>                         "225402"
>>>>>>>>>                     ],
>>>>>>>>>                     [
>>>>>>>>>                         1649239793.4,
>>>>>>>>>                         "225437"
>>>>>>>>>                     ],
>>>>>>>>>                     [
>>>>>>>>>                         1649239853.4,
>>>>>>>>>                         "225466"
>>>>>>>>>                     ],
>>>>>>>>>                     [
>>>>>>>>>                         1649239913.4,
>>>>>>>>>                         "225492"
>>>>>>>>>                     ],
>>>>>>>>>                     [
>>>>>>>>>                         1649239973.4,
>>>>>>>>>                         "225529"
>>>>>>>>>                     ],
>>>>>>>>>                     [
>>>>>>>>>                         1649240033.4,
>>>>>>>>>                         "225555"
>>>>>>>>>                     ],
>>>>>>>>>                     [
>>>>>>>>>                         1649240093.4,
>>>>>>>>>                         "225595"
>>>>>>>>>                     ]
>>>>>>>>>                 ]
>>>>>>>>>             }
>>>>>>>>>         ],
>>>>>>>>>         "resultType": "matrix"
>>>>>>>>>     },
>>>>>>>>>     "status": "success"
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> The next query is taken from the Grafana query inspector, because 
>>>>>>>>> for reasons I don't understand I can't get Prometheus to give me any 
>>>>>>>>> data 
>>>>>>>>> when I issue the same query to /api/v1/query_range. The query is the 
>>>>>>>>> same 
>>>>>>>>> as the above query, but wrapped in a rate([1m]):
>>>>>>>>>
>>>>>>>>>     "request": {
>>>>>>>>>         "url": 
>>>>>>>>> "api/datasources/proxy/1/api/v1/query_range?query=rate(counter[1m])&start=1649239200&end=1649240100&step=60",
>>>>>>>>>         "method": "GET",
>>>>>>>>>         "hideFromInspector": false
>>>>>>>>>     },
>>>>>>>>>     "response": {
>>>>>>>>>         "status": "success",
>>>>>>>>>         "data": {
>>>>>>>>>             "resultType": "matrix",
>>>>>>>>>             "result": [
>>>>>>>>>                 {
>>>>>>>>>                     "metric": {/* redacted */},
>>>>>>>>>                     "values": [
>>>>>>>>>                         [
>>>>>>>>>                             1649239200,
>>>>>>>>>                             "0"
>>>>>>>>>                         ],
>>>>>>>>>                         [
>>>>>>>>>                             1649239260,
>>>>>>>>>                             "0"
>>>>>>>>>                         ],
>>>>>>>>>                         [
>>>>>>>>>                             1649239320,
>>>>>>>>>                             "0"
>>>>>>>>>                         ],
>>>>>>>>>                         [
>>>>>>>>>                             1649239380,
>>>>>>>>>                             "0"
>>>>>>>>>                         ],
>>>>>>>>>                         [
>>>>>>>>>                             1649239440,
>>>>>>>>>                             "0"
>>>>>>>>>                         ],
>>>>>>>>>                         [
>>>>>>>>>                             1649239500,
>>>>>>>>>                             "0"
>>>>>>>>>                         ],
>>>>>>>>>                         [
>>>>>>>>>                             1649239560,
>>>>>>>>>                             "0"
>>>>>>>>>                         ],
>>>>>>>>>                         [
>>>>>>>>>                             1649239620,
>>>>>>>>>                             "0"
>>>>>>>>>                         ],
>>>>>>>>>                         [
>>>>>>>>>                             1649239680,
>>>>>>>>>                             "0"
>>>>>>>>>                         ],
>>>>>>>>>                         [
>>>>>>>>>                             1649239740,
>>>>>>>>>                             "9391.766666666665"
>>>>>>>>>                         ],
>>>>>>>>>                         [
>>>>>>>>>                             1649239800,
>>>>>>>>>                             "0"
>>>>>>>>>                         ],
>>>>>>>>>                         [
>>>>>>>>>                             1649239860,
>>>>>>>>>                             "0"
>>>>>>>>>                         ],
>>>>>>>>>                         [
>>>>>>>>>                             1649239920,
>>>>>>>>>                             "0"
>>>>>>>>>                         ],
>>>>>>>>>                         [
>>>>>>>>>                             1649239980,
>>>>>>>>>                             "0"
>>>>>>>>>                         ],
>>>>>>>>>                         [
>>>>>>>>>                             1649240040,
>>>>>>>>>                             "0.03333333333333333"
>>>>>>>>>                         ],
>>>>>>>>>                         [
>>>>>>>>>                             1649240100,
>>>>>>>>>                             "0"
>>>>>>>>>                         ]
>>>>>>>>>                     ]
>>>>>>>>>                 }
>>>>>>>>>             ]
>>>>>>>>>         }
>>>>>>>>>     }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Given the gradual increase in the underlying counter, I have two 
>>>>>>>>> questions:
>>>>>>>>>
>>>>>>>>> 1. How come the rate is 0 for all except 2 datapoints?
>>>>>>>>> 2. How come there is one enormous datapoint in the rate query, 
>>>>>>>>> that is seemingly unexplained in the raw data?
>>>>>>>>>
>>>>>>>>> For 2 I've seen in other threads that the explanation is an 
>>>>>>>>> unintentional counter reset, caused by scrapes a millisecond apart 
>>>>>>>>> that 
>>>>>>>>> make the counter appear to go down for a single scrape interval. I 
>>>>>>>>> don't 
>>>>>>>>> think I see this in our raw data, though.
>>>>>>>>>
>>>>>>>>> We're using Prometheus version 2.26.0, revision 
>>>>>>>>> 3cafc58827d1ebd1a67749f88be4218f0bab3d8d, go version go1.16.2.
>>>>>>>>>
>>>>>>>> -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "Prometheus Users" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to [email protected].
>>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com
>>>>>>>>  
>>>>>>>> <https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "Prometheus Users" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>>
>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/1ac9e9ed-9dc6-458d-aac2-a964af3d6935n%40googlegroups.com.

Re: [prometheus-users] Re: Big spike when using rate() - doesn't seem to be a counter reset

Reply via email to