Here's the query inspector output from Grafana for rate(counter[2m]). It
makes the answer to question 1 in my original post more clear. You're
right, the graph for 1m is just plain wrong. We do still see the reset,
though.
{
"request": {
"url":
"api/datasources/proxy/1/api/v1/query_range?query=rate(counter[2m])&start=1649239200&end=1649240100&step=60",
"method": "GET",
"hideFromInspector": false
},
"response": {
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {/* redacted */},
"values": [
[
1649239200,
"0.2871886897537781"
],
[
1649239260,
"0.3084619260318357"
],
[
1649239320,
"0.26591545347572043"
],
[
1649239380,
"0.2446422171976628"
],
[
1649239440,
"0.13827603580737463"
],
[
1649239500,
"0.1701858902244611"
],
[
1649239560,
"0.3403717804489222"
],
[
1649239620,
"0.20209574464154753"
],
[
1649239680,
"0.3616450167269798"
],
[
1649239740,
"2397.9404664989347"
],
[
1649239800,
"2397.88728340824"
],
[
1649239860,
"0.3084619260318357"
],
[
1649239920,
"0.27655207161474926"
],
[
1649239980,
"0.39355487114406623"
],
[
1649240040,
"0.27655207161474926"
],
[
1649240100,
"0.43610134370018155"
]
]
}
]
}
}
}
On Wednesday, April 6, 2022 at 2:34:59 PM UTC+1 Sam Rose wrote:
> We do see a graph with rate(counter[1m]). It even looks pretty close to
> what we see with rate(counter[2m]). We definitely scrape every 60 seconds,
> double checked our config to make sure.
>
> The exact query was `counter[15m]`. Counter is
> `django_http_responses_total_by_status_total` in reality, with a long list
> of labels attached to ensure I'm selecting a single time series.
>
> I didn't realise Grafana did that, thank you for the advice.
>
> I feel like we're drifting away from the original problem a little bit.
> Can I get you any additional data to make the original problem easier to
> debug?
>
> On Wednesday, April 6, 2022 at 2:31:27 PM UTC+1 Brian Candler wrote:
>
>> If you are scraping at 1m intervals, then you definitely need
>> rate(counter[2m]). That's because rate() needs at least two data points to
>> fall within the range window. I would be surprised if you see any graph at
>> all with rate(counter[1m]).
>>
>> > This is the raw data, as obtained through a request to /api/v1/query
>>
>> What is the *exact* query you gave? Hopefully it is a range vector query,
>> like counter[15m]. A range vector expression sent to the simple query
>> endpoint gives you the raw data points with their raw timestamps from the
>> database.
>>
>> > and then we configure the minimum value of it to 1m per-graph
>>
>> Just in case you haven't realised: to set a minimum value of 1m, you must
>> set the data source scrape interval (in Grafana) to 15s - since Grafana
>> clamps the minimum value to 4 x Grafana-configured data source scrape
>> interval.
>>
>> Therefore if you are actually scraping at 1m intervals, and you want the
>> minimum of $__rate_interval to be 2m, then you must set the Grafana data
>> source interval to 30s. This is weird, but it is what it is.
>> https://github.com/grafana/grafana/issues/32169
>>
>> On Wednesday, 6 April 2022 at 14:07:13 UTC+1 [email protected] wrote:
>>
>>> We do make use of that variable, and then we configure the minimum value
>>> of it to 1m per-graph. I didn't realise you could configure this
>>> per-datasource, thanks for pointing that out!
>>>
>>> We did used to scrape at 15s intervals but we're using AWS's managed
>>> prometheus workspaces, and each data point costs money, so we brought it
>>> down to 1m intervals.
>>>
>>> I'm not sure I understand the relationship between scrape interval and
>>> counter resets, especially considering there doesn't appear to be a counter
>>> reset in the raw data of the time series in question.
>>>
>>> You mentioned "true counter reset", does prometheus have some internal
>>> distinction between types of counter reset?
>>>
>>> On Wednesday, April 6, 2022 at 2:03:40 PM UTC+1 [email protected] wrote:
>>>
>>>> I would recommend using the `$__rate_interval` magic variable in
>>>> Grafana. Note that Grafana assumes a default interval of 15s in the
>>>> datasource settings.
>>>>
>>>> If your data is mostly 60s scrape intervals, you can configure this
>>>> setting in the Grafana datasource settings.
>>>>
>>>> If you want to be able to view 1m resolution rates, I recommend
>>>> increasing your scrape interval to 15s. This makes sure you have several
>>>> samples in the rate window. This helps Prometheus better handle true
>>>> counter resets and lost scrapes.
>>>>
>>>> On Wed, Apr 6, 2022 at 2:56 PM Sam Rose <[email protected]> wrote:
>>>>
>>>>> Thanks for the heads up! We've flip flopped a bit between using 1m or
>>>>> 2m. 1m seems to work reliably enough to be useful in most situations, but
>>>>> I'll probably end up going back to 2m after this discussion.
>>>>>
>>>>> I don't believe that helps with the reset problem though, right? I
>>>>> retried the queries using 2m instead of 1m and they still exhibit the
>>>>> same
>>>>> problem.
>>>>>
>>>>> Is there any more data I can get you to help debug the problem? We see
>>>>> this happen multiple times per day, and it's making it difficult to
>>>>> monitor
>>>>> our systems in production.
>>>>>
>>>>> On Wednesday, April 6, 2022 at 1:53:26 PM UTC+1 [email protected]
>>>>> wrote:
>>>>>
>>>>>> Yup, PromQL thinks there's a small dip in the data. I'm not sure why
>>>>>> tho. I took your raw values:
>>>>>>
>>>>>> 225201
>>>>>> 225226
>>>>>> 225249
>>>>>> 225262
>>>>>> 225278
>>>>>> 225310
>>>>>> 225329
>>>>>> 225363
>>>>>> 225402
>>>>>> 225437
>>>>>> 225466
>>>>>> 225492
>>>>>> 225529
>>>>>> 225555
>>>>>> 225595
>>>>>>
>>>>>> $ awk '{print $1-225201}' values
>>>>>> 0
>>>>>> 25
>>>>>> 48
>>>>>> 61
>>>>>> 77
>>>>>> 109
>>>>>> 128
>>>>>> 162
>>>>>> 201
>>>>>> 236
>>>>>> 265
>>>>>> 291
>>>>>> 328
>>>>>> 354
>>>>>> 394
>>>>>>
>>>>>> I'm not seeing the reset there.
>>>>>>
>>>>>> One thing I noticed, your data interval is 60 seconds and you are
>>>>>> doing a rate(counter[1m]). This is not going to work reliably, because
>>>>>> you
>>>>>> are likely to not have two samples in the same step window. This is
>>>>>> because
>>>>>> Prometheus uses millisecond timestamps, so if you have timestamps at
>>>>>> these
>>>>>> times:
>>>>>>
>>>>>> 5.335
>>>>>> 65.335
>>>>>> 125.335
>>>>>>
>>>>>> Then you do a rate(counter[1m]) at time 120 (Grafana attempts to
>>>>>> align queries to even minutes for consistency), the only sample you'll
>>>>>> get
>>>>>> back is 65.335.
>>>>>>
>>>>>> You need to do rate(counter[2m]) in order to avoid problems.
>>>>>>
>>>>>> On Wed, Apr 6, 2022 at 2:45 PM Sam Rose <[email protected]> wrote:
>>>>>>
>>>>>>> I just learned about the resets() function and applying it does seem
>>>>>>> to show that a reset occurred:
>>>>>>>
>>>>>>> {
>>>>>>> "request": {
>>>>>>> "url":
>>>>>>> "api/datasources/proxy/1/api/v1/query_range?query=resets(counter[1m])&start=1649239200&end=1649240100&step=60",
>>>>>>> "method": "GET",
>>>>>>> "hideFromInspector": false
>>>>>>> },
>>>>>>> "response": {
>>>>>>> "status": "success",
>>>>>>> "data": {
>>>>>>> "resultType": "matrix",
>>>>>>> "result": [
>>>>>>> {
>>>>>>> "metric": {/* redacted */},
>>>>>>> "values": [
>>>>>>> [
>>>>>>> 1649239200,
>>>>>>> "0"
>>>>>>> ],
>>>>>>> [
>>>>>>> 1649239260,
>>>>>>> "0"
>>>>>>> ],
>>>>>>> [
>>>>>>> 1649239320,
>>>>>>> "0"
>>>>>>> ],
>>>>>>> [
>>>>>>> 1649239380,
>>>>>>> "0"
>>>>>>> ],
>>>>>>> [
>>>>>>> 1649239440,
>>>>>>> "0"
>>>>>>> ],
>>>>>>> [
>>>>>>> 1649239500,
>>>>>>> "0"
>>>>>>> ],
>>>>>>> [
>>>>>>> 1649239560,
>>>>>>> "0"
>>>>>>> ],
>>>>>>> [
>>>>>>> 1649239620,
>>>>>>> "0"
>>>>>>> ],
>>>>>>> [
>>>>>>> 1649239680,
>>>>>>> "0"
>>>>>>> ],
>>>>>>> [
>>>>>>> 1649239740,
>>>>>>> "1"
>>>>>>> ],
>>>>>>> [
>>>>>>> 1649239800,
>>>>>>> "0"
>>>>>>> ],
>>>>>>> [
>>>>>>> 1649239860,
>>>>>>> "0"
>>>>>>> ],
>>>>>>> [
>>>>>>> 1649239920,
>>>>>>> "0"
>>>>>>> ],
>>>>>>> [
>>>>>>> 1649239980,
>>>>>>> "0"
>>>>>>> ],
>>>>>>> [
>>>>>>> 1649240040,
>>>>>>> "0"
>>>>>>> ],
>>>>>>> [
>>>>>>> 1649240100,
>>>>>>> "0"
>>>>>>> ]
>>>>>>> ]
>>>>>>> }
>>>>>>> ]
>>>>>>> }
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> I don't quite understand how, though.
>>>>>>> On Wednesday, April 6, 2022 at 1:40:12 PM UTC+1 Sam Rose wrote:
>>>>>>>
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>> We're seeing really large spikes when using the `rate()` function
>>>>>>>> on some of our metrics. I've been able to isolate a single time series
>>>>>>>> that
>>>>>>>> displays this problem, which I'm going to call `counter`. I haven't
>>>>>>>> attached the actual metric labels here, but all of the data you see
>>>>>>>> here is
>>>>>>>> from `counter` over the same time period.
>>>>>>>>
>>>>>>>> This is the raw data, as obtained through a request to
>>>>>>>> /api/v1/query:
>>>>>>>>
>>>>>>>> {
>>>>>>>> "data": {
>>>>>>>> "result": [
>>>>>>>> {
>>>>>>>> "metric": {/* redacted */},
>>>>>>>> "values": [
>>>>>>>> [
>>>>>>>> 1649239253.4,
>>>>>>>> "225201"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239313.4,
>>>>>>>> "225226"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239373.4,
>>>>>>>> "225249"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239433.4,
>>>>>>>> "225262"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239493.4,
>>>>>>>> "225278"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239553.4,
>>>>>>>> "225310"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239613.4,
>>>>>>>> "225329"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239673.4,
>>>>>>>> "225363"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239733.4,
>>>>>>>> "225402"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239793.4,
>>>>>>>> "225437"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239853.4,
>>>>>>>> "225466"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239913.4,
>>>>>>>> "225492"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239973.4,
>>>>>>>> "225529"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649240033.4,
>>>>>>>> "225555"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649240093.4,
>>>>>>>> "225595"
>>>>>>>> ]
>>>>>>>> ]
>>>>>>>> }
>>>>>>>> ],
>>>>>>>> "resultType": "matrix"
>>>>>>>> },
>>>>>>>> "status": "success"
>>>>>>>> }
>>>>>>>>
>>>>>>>> The next query is taken from the Grafana query inspector, because
>>>>>>>> for reasons I don't understand I can't get Prometheus to give me any
>>>>>>>> data
>>>>>>>> when I issue the same query to /api/v1/query_range. The query is the
>>>>>>>> same
>>>>>>>> as the above query, but wrapped in a rate([1m]):
>>>>>>>>
>>>>>>>> "request": {
>>>>>>>> "url":
>>>>>>>> "api/datasources/proxy/1/api/v1/query_range?query=rate(counter[1m])&start=1649239200&end=1649240100&step=60",
>>>>>>>> "method": "GET",
>>>>>>>> "hideFromInspector": false
>>>>>>>> },
>>>>>>>> "response": {
>>>>>>>> "status": "success",
>>>>>>>> "data": {
>>>>>>>> "resultType": "matrix",
>>>>>>>> "result": [
>>>>>>>> {
>>>>>>>> "metric": {/* redacted */},
>>>>>>>> "values": [
>>>>>>>> [
>>>>>>>> 1649239200,
>>>>>>>> "0"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239260,
>>>>>>>> "0"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239320,
>>>>>>>> "0"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239380,
>>>>>>>> "0"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239440,
>>>>>>>> "0"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239500,
>>>>>>>> "0"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239560,
>>>>>>>> "0"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239620,
>>>>>>>> "0"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239680,
>>>>>>>> "0"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239740,
>>>>>>>> "9391.766666666665"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239800,
>>>>>>>> "0"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239860,
>>>>>>>> "0"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239920,
>>>>>>>> "0"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649239980,
>>>>>>>> "0"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649240040,
>>>>>>>> "0.03333333333333333"
>>>>>>>> ],
>>>>>>>> [
>>>>>>>> 1649240100,
>>>>>>>> "0"
>>>>>>>> ]
>>>>>>>> ]
>>>>>>>> }
>>>>>>>> ]
>>>>>>>> }
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> Given the gradual increase in the underlying counter, I have two
>>>>>>>> questions:
>>>>>>>>
>>>>>>>> 1. How come the rate is 0 for all except 2 datapoints?
>>>>>>>> 2. How come there is one enormous datapoint in the rate query, that
>>>>>>>> is seemingly unexplained in the raw data?
>>>>>>>>
>>>>>>>> For 2 I've seen in other threads that the explanation is an
>>>>>>>> unintentional counter reset, caused by scrapes a millisecond apart
>>>>>>>> that
>>>>>>>> make the counter appear to go down for a single scrape interval. I
>>>>>>>> don't
>>>>>>>> think I see this in our raw data, though.
>>>>>>>>
>>>>>>>> We're using Prometheus version 2.26.0, revision
>>>>>>>> 3cafc58827d1ebd1a67749f88be4218f0bab3d8d, go version go1.16.2.
>>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "Prometheus Users" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com
>>>>>>>
>>>>>>> <https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Prometheus Users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>>
>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com
>>>>>
>>>>> <https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/fd92e5f7-85f6-440c-bf99-568044d8f0aen%40googlegroups.com.