Re: [prometheus-users] Re: Big spike when using rate() - doesn't seem to be a counter reset

Ben Kochie Wed, 06 Apr 2022 06:26:28 -0700

Maybe time to setup Thanos or Mimir since they just use S3, so would be a
lot cheaper than AWS prometheus. :-D


Prometheus doesn't explicitly track counter resets yet. It's something
we've thought about with the OpenMetrics support for metric creation
timestamps, but it's not implemented.

I meant that if you did restart something and the counter went to 0, if you
have only 2 samples in a rate(), so you might get funky spikes or a little
bit of inaccurate extrapolation.

On Wed, Apr 6, 2022 at 3:07 PM Sam Rose <[email protected]> wrote:

> We do make use of that variable, and then we configure the minimum value
> of it to 1m per-graph. I didn't realise you could configure this
> per-datasource, thanks for pointing that out!
>
> We did used to scrape at 15s intervals but we're using AWS's managed
> prometheus workspaces, and each data point costs money, so we brought it
> down to 1m intervals.
>
> I'm not sure I understand the relationship between scrape interval and
> counter resets, especially considering there doesn't appear to be a counter
> reset in the raw data of the time series in question.
>
> You mentioned "true counter reset", does prometheus have some internal
> distinction between types of counter reset?
>
> On Wednesday, April 6, 2022 at 2:03:40 PM UTC+1 [email protected] wrote:
>
>> I would recommend using the `$__rate_interval` magic variable in Grafana.
>> Note that Grafana assumes a default interval of 15s in the datasource
>> settings.
>>
>> If your data is mostly 60s scrape intervals, you can configure this
>> setting in the Grafana datasource settings.
>>
>> If you want to be able to view 1m resolution rates, I recommend
>> increasing your scrape interval to 15s. This makes sure you have several
>> samples in the rate window. This helps Prometheus better handle true
>> counter resets and lost scrapes.
>>
>> On Wed, Apr 6, 2022 at 2:56 PM Sam Rose <[email protected]> wrote:
>>
>>> Thanks for the heads up! We've flip flopped a bit between using 1m or
>>> 2m. 1m seems to work reliably enough to be useful in most situations, but
>>> I'll probably end up going back to 2m after this discussion.
>>>
>>> I don't believe that helps with the reset problem though, right? I
>>> retried the queries using 2m instead of 1m and they still exhibit the same
>>> problem.
>>>
>>> Is there any more data I can get you to help debug the problem? We see
>>> this happen multiple times per day, and it's making it difficult to monitor
>>> our systems in production.
>>>
>>> On Wednesday, April 6, 2022 at 1:53:26 PM UTC+1 [email protected] wrote:
>>>
>>>> Yup, PromQL thinks there's a small dip in the data. I'm not sure why
>>>> tho. I took your raw values:
>>>>
>>>> 225201
>>>> 225226
>>>> 225249
>>>> 225262
>>>> 225278
>>>> 225310
>>>> 225329
>>>> 225363
>>>> 225402
>>>> 225437
>>>> 225466
>>>> 225492
>>>> 225529
>>>> 225555
>>>> 225595
>>>>
>>>> $ awk '{print $1-225201}' values
>>>> 0
>>>> 25
>>>> 48
>>>> 61
>>>> 77
>>>> 109
>>>> 128
>>>> 162
>>>> 201
>>>> 236
>>>> 265
>>>> 291
>>>> 328
>>>> 354
>>>> 394
>>>>
>>>> I'm not seeing the reset there.
>>>>
>>>> One thing I noticed, your data interval is 60 seconds and you are doing
>>>> a rate(counter[1m]). This is not going to work reliably, because you are
>>>> likely to not have two samples in the same step window. This is because
>>>> Prometheus uses millisecond timestamps, so if you have timestamps at these
>>>> times:
>>>>
>>>> 5.335
>>>> 65.335
>>>> 125.335
>>>>
>>>> Then you do a rate(counter[1m]) at time 120 (Grafana attempts to align
>>>> queries to even minutes for consistency), the only sample you'll get back
>>>> is 65.335.
>>>>
>>>> You need to do rate(counter[2m]) in order to avoid problems.
>>>>
>>>> On Wed, Apr 6, 2022 at 2:45 PM Sam Rose <[email protected]> wrote:
>>>>
>>>>> I just learned about the resets() function and applying it does seem
>>>>> to show that a reset occurred:
>>>>>
>>>>> {
>>>>>   "request": {
>>>>>     "url":
>>>>> "api/datasources/proxy/1/api/v1/query_range?query=resets(counter[1m])&start=1649239200&end=1649240100&step=60",
>>>>>     "method": "GET",
>>>>>     "hideFromInspector": false
>>>>>   },
>>>>>   "response": {
>>>>>     "status": "success",
>>>>>     "data": {
>>>>>       "resultType": "matrix",
>>>>>       "result": [
>>>>>         {
>>>>>           "metric": {/* redacted */},
>>>>>           "values": [
>>>>>             [
>>>>>               1649239200,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239260,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239320,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239380,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239440,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239500,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239560,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239620,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239680,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239740,
>>>>>               "1"
>>>>>             ],
>>>>>             [
>>>>>               1649239800,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239860,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239920,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649239980,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649240040,
>>>>>               "0"
>>>>>             ],
>>>>>             [
>>>>>               1649240100,
>>>>>               "0"
>>>>>             ]
>>>>>           ]
>>>>>         }
>>>>>       ]
>>>>>     }
>>>>>   }
>>>>> }
>>>>>
>>>>> I don't quite understand how, though.
>>>>> On Wednesday, April 6, 2022 at 1:40:12 PM UTC+1 Sam Rose wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> We're seeing really large spikes when using the `rate()` function on
>>>>>> some of our metrics. I've been able to isolate a single time series that
>>>>>> displays this problem, which I'm going to call `counter`. I haven't
>>>>>> attached the actual metric labels here, but all of the data you see here 
>>>>>> is
>>>>>> from `counter` over the same time period.
>>>>>>
>>>>>> This is the raw data, as obtained through a request to /api/v1/query:
>>>>>>
>>>>>> {
>>>>>>     "data": {
>>>>>>         "result": [
>>>>>>             {
>>>>>>                 "metric": {/* redacted */},
>>>>>>                 "values": [
>>>>>>                     [
>>>>>>                         1649239253.4,
>>>>>>                         "225201"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239313.4,
>>>>>>                         "225226"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239373.4,
>>>>>>                         "225249"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239433.4,
>>>>>>                         "225262"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239493.4,
>>>>>>                         "225278"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239553.4,
>>>>>>                         "225310"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239613.4,
>>>>>>                         "225329"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239673.4,
>>>>>>                         "225363"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239733.4,
>>>>>>                         "225402"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239793.4,
>>>>>>                         "225437"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239853.4,
>>>>>>                         "225466"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239913.4,
>>>>>>                         "225492"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649239973.4,
>>>>>>                         "225529"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649240033.4,
>>>>>>                         "225555"
>>>>>>                     ],
>>>>>>                     [
>>>>>>                         1649240093.4,
>>>>>>                         "225595"
>>>>>>                     ]
>>>>>>                 ]
>>>>>>             }
>>>>>>         ],
>>>>>>         "resultType": "matrix"
>>>>>>     },
>>>>>>     "status": "success"
>>>>>> }
>>>>>>
>>>>>> The next query is taken from the Grafana query inspector, because for
>>>>>> reasons I don't understand I can't get Prometheus to give me any data 
>>>>>> when
>>>>>> I issue the same query to /api/v1/query_range. The query is the same as 
>>>>>> the
>>>>>> above query, but wrapped in a rate([1m]):
>>>>>>
>>>>>>     "request": {
>>>>>>         "url":
>>>>>> "api/datasources/proxy/1/api/v1/query_range?query=rate(counter[1m])&start=1649239200&end=1649240100&step=60",
>>>>>>         "method": "GET",
>>>>>>         "hideFromInspector": false
>>>>>>     },
>>>>>>     "response": {
>>>>>>         "status": "success",
>>>>>>         "data": {
>>>>>>             "resultType": "matrix",
>>>>>>             "result": [
>>>>>>                 {
>>>>>>                     "metric": {/* redacted */},
>>>>>>                     "values": [
>>>>>>                         [
>>>>>>                             1649239200,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239260,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239320,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239380,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239440,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239500,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239560,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239620,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239680,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239740,
>>>>>>                             "9391.766666666665"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239800,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239860,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239920,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649239980,
>>>>>>                             "0"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649240040,
>>>>>>                             "0.03333333333333333"
>>>>>>                         ],
>>>>>>                         [
>>>>>>                             1649240100,
>>>>>>                             "0"
>>>>>>                         ]
>>>>>>                     ]
>>>>>>                 }
>>>>>>             ]
>>>>>>         }
>>>>>>     }
>>>>>> }
>>>>>>
>>>>>> Given the gradual increase in the underlying counter, I have two
>>>>>> questions:
>>>>>>
>>>>>> 1. How come the rate is 0 for all except 2 datapoints?
>>>>>> 2. How come there is one enormous datapoint in the rate query, that
>>>>>> is seemingly unexplained in the raw data?
>>>>>>
>>>>>> For 2 I've seen in other threads that the explanation is an
>>>>>> unintentional counter reset, caused by scrapes a millisecond apart that
>>>>>> make the counter appear to go down for a single scrape interval. I don't
>>>>>> think I see this in our raw data, though.
>>>>>>
>>>>>> We're using Prometheus version 2.26.0, revision
>>>>>> 3cafc58827d1ebd1a67749f88be4218f0bab3d8d, go version go1.16.2.
>>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Prometheus Users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Prometheus Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>>
>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/c686d415-c9b8-42a1-843f-9010d1b763e4n%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/c686d415-c9b8-42a1-843f-9010d1b763e4n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmpWbjwMM00WxG%3DLy8OEgMLrQew%2BEauqawwGeS-hzLnmkA%40mail.gmail.com.

Re: [prometheus-users] Re: Big spike when using rate() - doesn't seem to be a counter reset

Reply via email to