Hey Brian,
In the original post I put the output of the raw time series as gathered
the way you suggest. I'll copy it again below:
{
"data": {
"result": [
{
"metric": {/* redacted */},
"values": [
[
1649239253.4,
"225201"
],
[
1649239313.4,
"225226"
],
[
1649239373.4,
"225249"
],
[
1649239433.4,
"225262"
],
[
1649239493.4,
"225278"
],
[
1649239553.4,
"225310"
],
[
1649239613.4,
"225329"
],
[
1649239673.4,
"225363"
],
[
1649239733.4,
"225402"
],
[
1649239793.4,
"225437"
],
[
1649239853.4,
"225466"
],
[
1649239913.4,
"225492"
],
[
1649239973.4,
"225529"
],
[
1649240033.4,
"225555"
],
[
1649240093.4,
"225595"
]
]
}
],
"resultType": "matrix"
},
"status": "success"
}
The query was of the form `counter[15m]` at a given time. I don't see
duplicate scrape data in there.
The version of prometheus is 2.26.0, revision
3cafc58827d1ebd1a67749f88be4218f0bab3d8d, go version go1.16.2.
On Wednesday, April 6, 2022 at 6:13:10 PM UTC+1 Brian Candler wrote:
> What version of prometheus are you running?
>
> With prometheus, rate(counter[1m]) should give you no results at all when
> you are scraping at 1 minute intervals - unless something has changed very
> recently (I'm running 2.33.4). So this is a big red flag.
>
> Now, for driving the query API, you should be able to do it like this:
>
> # curl -Ssg '
> http://localhost:9090/api/v1/query?query=ifHCInOctets{instance="gw1",ifName="ether1"}[60s]'
>
> | python3 -m json.tool
>
> {
> "status": "success",
> "data": {
> "resultType": "matrix",
> "result": [
> {
> "metric": {
> "__name__": "ifHCInOctets",
> "ifIndex": "16",
> "ifName": "ether1",
> "instance": "gw1",
> "job": "snmp",
> "module": "mikrotik_secret",
> "netbox_type": "device"
> },
> "values": [
> [
> 1649264595.241,
> "117857843410 <(785)%20784-3410>"
> ],
> [
> 1649264610.241,
> "117858063821 <(785)%20806-3821>"
> ],
> [
> 1649264625.241,
> "117858075769 <(785)%20807-5769>"
> ]
> ]
> }
> ]
> }
> }
>
> There I gave a range vector of 60 seconds, and I got 3 data points because
> I'm scraping at 15 second intervals, so only 3 points fell within the time
> window of (current time) and (current time - 60s)
>
> Sending a query_range will sample the data at intervals. Only an actual
> range vector query (as shown above) will show you *all* the data points in
> the time series, wherever they lie.
>
> I think you should do this. My guess - and it's only a guess at the
> moment - is that there are multiple points being received for the same
> timeseries, and this is giving your spike. This could be due to
> overlapping scrape jobs for the same timeseries, or relabelling removing
> some distinguishing label, or some HA setup which is scraping the same
> timeseries multiple times but not adding external labels to distinguish
> them.
>
> I do have some evidence for my guess. If you are storing the same data
> points twice, this will give you the rate of zero most of the time, when
> doing rate[1m], because there are two adjacent identical points most of the
> time (whereas if there were only a single data point, you'd get no rate at
> all). And you'll get a counter spike if two data points get transposed.
>
> On Wednesday, 6 April 2022 at 14:37:57 UTC+1 [email protected] wrote:
>
>> Here's the query inspector output from Grafana for rate(counter[2m]). It
>> makes the answer to question 1 in my original post more clear. You're
>> right, the graph for 1m is just plain wrong. We do still see the reset,
>> though.
>>
>> {
>> "request": {
>> "url":
>> "api/datasources/proxy/1/api/v1/query_range?query=rate(counter[2m])&start=1649239200&end=1649240100&step=60",
>>
>> "method": "GET",
>> "hideFromInspector": false
>> },
>> "response": {
>> "status": "success",
>> "data": {
>> "resultType": "matrix",
>> "result": [
>> {
>> "metric": {/* redacted */},
>> "values": [
>> [
>> 1649239200,
>> "0.2871886897537781"
>> ],
>> [
>> 1649239260,
>> "0.3084619260318357"
>> ],
>> [
>> 1649239320,
>> "0.26591545347572043"
>> ],
>> [
>> 1649239380,
>> "0.2446422171976628"
>> ],
>> [
>> 1649239440,
>> "0.13827603580737463"
>> ],
>> [
>> 1649239500,
>> "0.1701858902244611"
>> ],
>> [
>> 1649239560,
>> "0.3403717804489222"
>> ],
>> [
>> 1649239620,
>> "0.20209574464154753"
>> ],
>> [
>> 1649239680,
>> "0.3616450167269798"
>> ],
>> [
>> 1649239740,
>> "2397.9404664989347"
>> ],
>> [
>> 1649239800,
>> "2397.88728340824"
>> ],
>> [
>> 1649239860,
>> "0.3084619260318357"
>> ],
>> [
>> 1649239920,
>> "0.27655207161474926"
>> ],
>> [
>> 1649239980,
>> "0.39355487114406623"
>> ],
>> [
>> 1649240040,
>> "0.27655207161474926"
>> ],
>> [
>> 1649240100,
>> "0.43610134370018155"
>> ]
>> ]
>> }
>> ]
>> }
>> }
>> }
>>
>> On Wednesday, April 6, 2022 at 2:34:59 PM UTC+1 Sam Rose wrote:
>>
>>> We do see a graph with rate(counter[1m]). It even looks pretty close to
>>> what we see with rate(counter[2m]). We definitely scrape every 60 seconds,
>>> double checked our config to make sure.
>>>
>>> The exact query was `counter[15m]`. Counter is
>>> `django_http_responses_total_by_status_total` in reality, with a long list
>>> of labels attached to ensure I'm selecting a single time series.
>>>
>>> I didn't realise Grafana did that, thank you for the advice.
>>>
>>> I feel like we're drifting away from the original problem a little bit.
>>> Can I get you any additional data to make the original problem easier to
>>> debug?
>>>
>>> On Wednesday, April 6, 2022 at 2:31:27 PM UTC+1 Brian Candler wrote:
>>>
>>>> If you are scraping at 1m intervals, then you definitely need
>>>> rate(counter[2m]). That's because rate() needs at least two data points
>>>> to
>>>> fall within the range window. I would be surprised if you see any graph
>>>> at
>>>> all with rate(counter[1m]).
>>>>
>>>> > This is the raw data, as obtained through a request to /api/v1/query
>>>>
>>>> What is the *exact* query you gave? Hopefully it is a range vector
>>>> query, like counter[15m]. A range vector expression sent to the simple
>>>> query endpoint gives you the raw data points with their raw timestamps
>>>> from
>>>> the database.
>>>>
>>>> > and then we configure the minimum value of it to 1m per-graph
>>>>
>>>> Just in case you haven't realised: to set a minimum value of 1m, you
>>>> must set the data source scrape interval (in Grafana) to 15s - since
>>>> Grafana clamps the minimum value to 4 x Grafana-configured data source
>>>> scrape interval.
>>>>
>>>> Therefore if you are actually scraping at 1m intervals, and you want
>>>> the minimum of $__rate_interval to be 2m, then you must set the Grafana
>>>> data source interval to 30s. This is weird, but it is what it is.
>>>> https://github.com/grafana/grafana/issues/32169
>>>>
>>>> On Wednesday, 6 April 2022 at 14:07:13 UTC+1 [email protected] wrote:
>>>>
>>>>> We do make use of that variable, and then we configure the minimum
>>>>> value of it to 1m per-graph. I didn't realise you could configure this
>>>>> per-datasource, thanks for pointing that out!
>>>>>
>>>>> We did used to scrape at 15s intervals but we're using AWS's managed
>>>>> prometheus workspaces, and each data point costs money, so we brought it
>>>>> down to 1m intervals.
>>>>>
>>>>> I'm not sure I understand the relationship between scrape interval and
>>>>> counter resets, especially considering there doesn't appear to be a
>>>>> counter
>>>>> reset in the raw data of the time series in question.
>>>>>
>>>>> You mentioned "true counter reset", does prometheus have some internal
>>>>> distinction between types of counter reset?
>>>>>
>>>>> On Wednesday, April 6, 2022 at 2:03:40 PM UTC+1 [email protected]
>>>>> wrote:
>>>>>
>>>>>> I would recommend using the `$__rate_interval` magic variable in
>>>>>> Grafana. Note that Grafana assumes a default interval of 15s in the
>>>>>> datasource settings.
>>>>>>
>>>>>> If your data is mostly 60s scrape intervals, you can configure this
>>>>>> setting in the Grafana datasource settings.
>>>>>>
>>>>>> If you want to be able to view 1m resolution rates, I recommend
>>>>>> increasing your scrape interval to 15s. This makes sure you have several
>>>>>> samples in the rate window. This helps Prometheus better handle true
>>>>>> counter resets and lost scrapes.
>>>>>>
>>>>>> On Wed, Apr 6, 2022 at 2:56 PM Sam Rose <[email protected]> wrote:
>>>>>>
>>>>>>> Thanks for the heads up! We've flip flopped a bit between using 1m
>>>>>>> or 2m. 1m seems to work reliably enough to be useful in most
>>>>>>> situations,
>>>>>>> but I'll probably end up going back to 2m after this discussion.
>>>>>>>
>>>>>>> I don't believe that helps with the reset problem though, right? I
>>>>>>> retried the queries using 2m instead of 1m and they still exhibit the
>>>>>>> same
>>>>>>> problem.
>>>>>>>
>>>>>>> Is there any more data I can get you to help debug the problem? We
>>>>>>> see this happen multiple times per day, and it's making it difficult to
>>>>>>> monitor our systems in production.
>>>>>>>
>>>>>>> On Wednesday, April 6, 2022 at 1:53:26 PM UTC+1 [email protected]
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Yup, PromQL thinks there's a small dip in the data. I'm not sure
>>>>>>>> why tho. I took your raw values:
>>>>>>>>
>>>>>>>> 225201
>>>>>>>> 225226
>>>>>>>> 225249
>>>>>>>> 225262
>>>>>>>> 225278
>>>>>>>> 225310
>>>>>>>> 225329
>>>>>>>> 225363
>>>>>>>> 225402
>>>>>>>> 225437
>>>>>>>> 225466
>>>>>>>> 225492
>>>>>>>> 225529
>>>>>>>> 225555
>>>>>>>> 225595
>>>>>>>>
>>>>>>>> $ awk '{print $1-225201}' values
>>>>>>>> 0
>>>>>>>> 25
>>>>>>>> 48
>>>>>>>> 61
>>>>>>>> 77
>>>>>>>> 109
>>>>>>>> 128
>>>>>>>> 162
>>>>>>>> 201
>>>>>>>> 236
>>>>>>>> 265
>>>>>>>> 291
>>>>>>>> 328
>>>>>>>> 354
>>>>>>>> 394
>>>>>>>>
>>>>>>>> I'm not seeing the reset there.
>>>>>>>>
>>>>>>>> One thing I noticed, your data interval is 60 seconds and you are
>>>>>>>> doing a rate(counter[1m]). This is not going to work reliably, because
>>>>>>>> you
>>>>>>>> are likely to not have two samples in the same step window. This is
>>>>>>>> because
>>>>>>>> Prometheus uses millisecond timestamps, so if you have timestamps at
>>>>>>>> these
>>>>>>>> times:
>>>>>>>>
>>>>>>>> 5.335
>>>>>>>> 65.335
>>>>>>>> 125.335
>>>>>>>>
>>>>>>>> Then you do a rate(counter[1m]) at time 120 (Grafana attempts to
>>>>>>>> align queries to even minutes for consistency), the only sample you'll
>>>>>>>> get
>>>>>>>> back is 65.335.
>>>>>>>>
>>>>>>>> You need to do rate(counter[2m]) in order to avoid problems.
>>>>>>>>
>>>>>>>> On Wed, Apr 6, 2022 at 2:45 PM Sam Rose <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> I just learned about the resets() function and applying it does
>>>>>>>>> seem to show that a reset occurred:
>>>>>>>>>
>>>>>>>>> {
>>>>>>>>> "request": {
>>>>>>>>> "url":
>>>>>>>>> "api/datasources/proxy/1/api/v1/query_range?query=resets(counter[1m])&start=1649239200&end=1649240100&step=60",
>>>>>>>>> "method": "GET",
>>>>>>>>> "hideFromInspector": false
>>>>>>>>> },
>>>>>>>>> "response": {
>>>>>>>>> "status": "success",
>>>>>>>>> "data": {
>>>>>>>>> "resultType": "matrix",
>>>>>>>>> "result": [
>>>>>>>>> {
>>>>>>>>> "metric": {/* redacted */},
>>>>>>>>> "values": [
>>>>>>>>> [
>>>>>>>>> 1649239200,
>>>>>>>>> "0"
>>>>>>>>> ],
>>>>>>>>> [
>>>>>>>>> 1649239260,
>>>>>>>>> "0"
>>>>>>>>> ],
>>>>>>>>> [
>>>>>>>>> 1649239320,
>>>>>>>>> "0"
>>>>>>>>> ],
>>>>>>>>> [
>>>>>>>>> 1649239380,
>>>>>>>>> "0"
>>>>>>>>> ],
>>>>>>>>> [
>>>>>>>>> 1649239440,
>>>>>>>>> "0"
>>>>>>>>> ],
>>>>>>>>> [
>>>>>>>>> 1649239500,
>>>>>>>>> "0"
>>>>>>>>> ],
>>>>>>>>> [
>>>>>>>>> 1649239560,
>>>>>>>>> "0"
>>>>>>>>> ],
>>>>>>>>> [
>>>>>>>>> 1649239620,
>>>>>>>>> "0"
>>>>>>>>> ],
>>>>>>>>> [
>>>>>>>>> 1649239680,
>>>>>>>>> "0"
>>>>>>>>> ],
>>>>>>>>> [
>>>>>>>>> 1649239740,
>>>>>>>>> "1"
>>>>>>>>> ],
>>>>>>>>> [
>>>>>>>>> 1649239800,
>>>>>>>>> "0"
>>>>>>>>> ],
>>>>>>>>> [
>>>>>>>>> 1649239860,
>>>>>>>>> "0"
>>>>>>>>> ],
>>>>>>>>> [
>>>>>>>>> 1649239920,
>>>>>>>>> "0"
>>>>>>>>> ],
>>>>>>>>> [
>>>>>>>>> 1649239980,
>>>>>>>>> "0"
>>>>>>>>> ],
>>>>>>>>> [
>>>>>>>>> 1649240040,
>>>>>>>>> "0"
>>>>>>>>> ],
>>>>>>>>> [
>>>>>>>>> 1649240100,
>>>>>>>>> "0"
>>>>>>>>> ]
>>>>>>>>> ]
>>>>>>>>> }
>>>>>>>>> ]
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> I don't quite understand how, though.
>>>>>>>>> On Wednesday, April 6, 2022 at 1:40:12 PM UTC+1 Sam Rose wrote:
>>>>>>>>>
>>>>>>>>>> Hi there,
>>>>>>>>>>
>>>>>>>>>> We're seeing really large spikes when using the `rate()` function
>>>>>>>>>> on some of our metrics. I've been able to isolate a single time
>>>>>>>>>> series that
>>>>>>>>>> displays this problem, which I'm going to call `counter`. I haven't
>>>>>>>>>> attached the actual metric labels here, but all of the data you see
>>>>>>>>>> here is
>>>>>>>>>> from `counter` over the same time period.
>>>>>>>>>>
>>>>>>>>>> This is the raw data, as obtained through a request to
>>>>>>>>>> /api/v1/query:
>>>>>>>>>>
>>>>>>>>>> {
>>>>>>>>>> "data": {
>>>>>>>>>> "result": [
>>>>>>>>>> {
>>>>>>>>>> "metric": {/* redacted */},
>>>>>>>>>> "values": [
>>>>>>>>>> [
>>>>>>>>>> 1649239253.4,
>>>>>>>>>> "225201"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239313.4,
>>>>>>>>>> "225226"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239373.4,
>>>>>>>>>> "225249"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239433.4,
>>>>>>>>>> "225262"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239493.4,
>>>>>>>>>> "225278"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239553.4,
>>>>>>>>>> "225310"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239613.4,
>>>>>>>>>> "225329"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239673.4,
>>>>>>>>>> "225363"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239733.4,
>>>>>>>>>> "225402"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239793.4,
>>>>>>>>>> "225437"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239853.4,
>>>>>>>>>> "225466"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239913.4,
>>>>>>>>>> "225492"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239973.4,
>>>>>>>>>> "225529"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649240033.4,
>>>>>>>>>> "225555"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649240093.4,
>>>>>>>>>> "225595"
>>>>>>>>>> ]
>>>>>>>>>> ]
>>>>>>>>>> }
>>>>>>>>>> ],
>>>>>>>>>> "resultType": "matrix"
>>>>>>>>>> },
>>>>>>>>>> "status": "success"
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> The next query is taken from the Grafana query inspector, because
>>>>>>>>>> for reasons I don't understand I can't get Prometheus to give me any
>>>>>>>>>> data
>>>>>>>>>> when I issue the same query to /api/v1/query_range. The query is the
>>>>>>>>>> same
>>>>>>>>>> as the above query, but wrapped in a rate([1m]):
>>>>>>>>>>
>>>>>>>>>> "request": {
>>>>>>>>>> "url":
>>>>>>>>>> "api/datasources/proxy/1/api/v1/query_range?query=rate(counter[1m])&start=1649239200&end=1649240100&step=60",
>>>>>>>>>> "method": "GET",
>>>>>>>>>> "hideFromInspector": false
>>>>>>>>>> },
>>>>>>>>>> "response": {
>>>>>>>>>> "status": "success",
>>>>>>>>>> "data": {
>>>>>>>>>> "resultType": "matrix",
>>>>>>>>>> "result": [
>>>>>>>>>> {
>>>>>>>>>> "metric": {/* redacted */},
>>>>>>>>>> "values": [
>>>>>>>>>> [
>>>>>>>>>> 1649239200,
>>>>>>>>>> "0"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239260,
>>>>>>>>>> "0"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239320,
>>>>>>>>>> "0"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239380,
>>>>>>>>>> "0"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239440,
>>>>>>>>>> "0"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239500,
>>>>>>>>>> "0"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239560,
>>>>>>>>>> "0"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239620,
>>>>>>>>>> "0"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239680,
>>>>>>>>>> "0"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239740,
>>>>>>>>>> "9391.766666666665"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239800,
>>>>>>>>>> "0"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239860,
>>>>>>>>>> "0"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239920,
>>>>>>>>>> "0"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649239980,
>>>>>>>>>> "0"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649240040,
>>>>>>>>>> "0.03333333333333333"
>>>>>>>>>> ],
>>>>>>>>>> [
>>>>>>>>>> 1649240100,
>>>>>>>>>> "0"
>>>>>>>>>> ]
>>>>>>>>>> ]
>>>>>>>>>> }
>>>>>>>>>> ]
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> Given the gradual increase in the underlying counter, I have two
>>>>>>>>>> questions:
>>>>>>>>>>
>>>>>>>>>> 1. How come the rate is 0 for all except 2 datapoints?
>>>>>>>>>> 2. How come there is one enormous datapoint in the rate query,
>>>>>>>>>> that is seemingly unexplained in the raw data?
>>>>>>>>>>
>>>>>>>>>> For 2 I've seen in other threads that the explanation is an
>>>>>>>>>> unintentional counter reset, caused by scrapes a millisecond apart
>>>>>>>>>> that
>>>>>>>>>> make the counter appear to go down for a single scrape interval. I
>>>>>>>>>> don't
>>>>>>>>>> think I see this in our raw data, though.
>>>>>>>>>>
>>>>>>>>>> We're using Prometheus version 2.26.0, revision
>>>>>>>>>> 3cafc58827d1ebd1a67749f88be4218f0bab3d8d, go version go1.16.2.
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>> Groups "Prometheus Users" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>> send an email to [email protected].
>>>>>>>>> To view this discussion on the web visit
>>>>>>>>> https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com
>>>>>>>>>
>>>>>>>>> <https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "Prometheus Users" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>>
>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com
>>>>>>>
>>>>>>> <https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/ca33e8cf-1168-4abc-afc4-3fb7a706cbdfn%40googlegroups.com.