Re: [prometheus-users] Re: Big spike when using rate() - doesn't seem to be a counter reset

Sam Rose Wed, 06 Apr 2022 06:29:20 -0700

I don't think we plan on migrating to anything like Thanos just yet, but 
it's good to hear that it's much cheaper.


As far as I can tell, no counters actually reset in this time period. The 
specific container being monitored didn't restart. Do you have any advice 
on debugging why prometheus believes we're having counter resets?

On Wednesday, April 6, 2022 at 2:26:27 PM UTC+1 [email protected] wrote:

> Maybe time to setup Thanos or Mimir since they just use S3, so would be a 
> lot cheaper than AWS prometheus. :-D
>
> Prometheus doesn't explicitly track counter resets yet. It's something 
> we've thought about with the OpenMetrics support for metric creation 
> timestamps, but it's not implemented.
>
> I meant that if you did restart something and the counter went to 0, if 
> you have only 2 samples in a rate(), so you might get funky spikes or a 
> little bit of inaccurate extrapolation.
>
> On Wed, Apr 6, 2022 at 3:07 PM Sam Rose <[email protected]> wrote:
>
>> We do make use of that variable, and then we configure the minimum value 
>> of it to 1m per-graph. I didn't realise you could configure this 
>> per-datasource, thanks for pointing that out!
>>
>> We did used to scrape at 15s intervals but we're using AWS's managed 
>> prometheus workspaces, and each data point costs money, so we brought it 
>> down to 1m intervals.
>>
>> I'm not sure I understand the relationship between scrape interval and 
>> counter resets, especially considering there doesn't appear to be a counter 
>> reset in the raw data of the time series in question.
>>
>> You mentioned "true counter reset", does prometheus have some internal 
>> distinction between types of counter reset?
>>
>> On Wednesday, April 6, 2022 at 2:03:40 PM UTC+1 [email protected] wrote:
>>
>>> I would recommend using the `$__rate_interval` magic variable in 
>>> Grafana. Note that Grafana assumes a default interval of 15s in the 
>>> datasource settings.
>>>
>>> If your data is mostly 60s scrape intervals, you can configure this 
>>> setting in the Grafana datasource settings.
>>>
>>> If you want to be able to view 1m resolution rates, I recommend 
>>> increasing your scrape interval to 15s. This makes sure you have several 
>>> samples in the rate window. This helps Prometheus better handle true 
>>> counter resets and lost scrapes.
>>>
>>> On Wed, Apr 6, 2022 at 2:56 PM Sam Rose <[email protected]> wrote:
>>>
>>>> Thanks for the heads up! We've flip flopped a bit between using 1m or 
>>>> 2m. 1m seems to work reliably enough to be useful in most situations, but 
>>>> I'll probably end up going back to 2m after this discussion.
>>>>
>>>> I don't believe that helps with the reset problem though, right? I 
>>>> retried the queries using 2m instead of 1m and they still exhibit the same 
>>>> problem.
>>>>
>>>> Is there any more data I can get you to help debug the problem? We see 
>>>> this happen multiple times per day, and it's making it difficult to 
>>>> monitor 
>>>> our systems in production.
>>>>
>>>> On Wednesday, April 6, 2022 at 1:53:26 PM UTC+1 [email protected] wrote:
>>>>
>>>>> Yup, PromQL thinks there's a small dip in the data. I'm not sure why 
>>>>> tho. I took your raw values:
>>>>>
>>>>> 225201
>>>>> 225226
>>>>> 225249
>>>>> 225262
>>>>> 225278
>>>>> 225310
>>>>> 225329
>>>>> 225363
>>>>> 225402
>>>>> 225437
>>>>> 225466
>>>>> 225492
>>>>> 225529
>>>>> 225555
>>>>> 225595
>>>>>
>>>>> $ awk '{print $1-225201}' values
>>>>> 0
>>>>> 25
>>>>> 48
>>>>> 61
>>>>> 77
>>>>> 109
>>>>> 128
>>>>> 162
>>>>> 201
>>>>> 236
>>>>> 265
>>>>> 291
>>>>> 328
>>>>> 354
>>>>> 394
>>>>>
>>>>> I'm not seeing the reset there.
>>>>>
>>>>> One thing I noticed, your data interval is 60 seconds and you are 
>>>>> doing a rate(counter[1m]). This is not going to work reliably, because 
>>>>> you 
>>>>> are likely to not have two samples in the same step window. This is 
>>>>> because 
>>>>> Prometheus uses millisecond timestamps, so if you have timestamps at 
>>>>> these 
>>>>> times:
>>>>>
>>>>> 5.335
>>>>> 65.335
>>>>> 125.335
>>>>>
>>>>> Then you do a rate(counter[1m]) at time 120 (Grafana attempts to align 
>>>>> queries to even minutes for consistency), the only sample you'll get back 
>>>>> is 65.335.
>>>>>
>>>>> You need to do rate(counter[2m]) in order to avoid problems.
>>>>>
>>>>> On Wed, Apr 6, 2022 at 2:45 PM Sam Rose <[email protected]> wrote:
>>>>>
>>>>>> I just learned about the resets() function and applying it does seem 
>>>>>> to show that a reset occurred:
>>>>>>
>>>>>> {
>>>>>>   "request": {
>>>>>>     "url": 
>>>>>> "api/datasources/proxy/1/api/v1/query_range?query=resets(counter[1m])&start=1649239200&end=1649240100&step=60",
>>>>>>     "method": "GET",
>>>>>>     "hideFromInspector": false
>>>>>>   },
>>>>>>   "response": {
>>>>>>     "status": "success",
>>>>>>     "data": {
>>>>>>       "resultType": "matrix",
>>>>>>       "result": [
>>>>>>         {
>>>>>>           "metric": {/* redacted */},
>>>>>>           "values": [
>>>>>>             [
>>>>>>               1649239200,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239260,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239320,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239380,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239440,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239500,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239560,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239620,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239680,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239740,
>>>>>>               "1"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239800,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239860,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239920,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649239980,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649240040,
>>>>>>               "0"
>>>>>>             ],
>>>>>>             [
>>>>>>               1649240100,
>>>>>>               "0"
>>>>>>             ]
>>>>>>           ]
>>>>>>         }
>>>>>>       ]
>>>>>>     }
>>>>>>   }
>>>>>> }
>>>>>>
>>>>>> I don't quite understand how, though.
>>>>>> On Wednesday, April 6, 2022 at 1:40:12 PM UTC+1 Sam Rose wrote:
>>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>> We're seeing really large spikes when using the `rate()` function on 
>>>>>>> some of our metrics. I've been able to isolate a single time series 
>>>>>>> that 
>>>>>>> displays this problem, which I'm going to call `counter`. I haven't 
>>>>>>> attached the actual metric labels here, but all of the data you see 
>>>>>>> here is 
>>>>>>> from `counter` over the same time period.
>>>>>>>
>>>>>>> This is the raw data, as obtained through a request to /api/v1/query:
>>>>>>>
>>>>>>> {
>>>>>>>     "data": {
>>>>>>>         "result": [
>>>>>>>             {
>>>>>>>                 "metric": {/* redacted */},
>>>>>>>                 "values": [
>>>>>>>                     [
>>>>>>>                         1649239253.4,
>>>>>>>                         "225201"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239313.4,
>>>>>>>                         "225226"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239373.4,
>>>>>>>                         "225249"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239433.4,
>>>>>>>                         "225262"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239493.4,
>>>>>>>                         "225278"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239553.4,
>>>>>>>                         "225310"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239613.4,
>>>>>>>                         "225329"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239673.4,
>>>>>>>                         "225363"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239733.4,
>>>>>>>                         "225402"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239793.4,
>>>>>>>                         "225437"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239853.4,
>>>>>>>                         "225466"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239913.4,
>>>>>>>                         "225492"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649239973.4,
>>>>>>>                         "225529"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649240033.4,
>>>>>>>                         "225555"
>>>>>>>                     ],
>>>>>>>                     [
>>>>>>>                         1649240093.4,
>>>>>>>                         "225595"
>>>>>>>                     ]
>>>>>>>                 ]
>>>>>>>             }
>>>>>>>         ],
>>>>>>>         "resultType": "matrix"
>>>>>>>     },
>>>>>>>     "status": "success"
>>>>>>> }
>>>>>>>
>>>>>>> The next query is taken from the Grafana query inspector, because 
>>>>>>> for reasons I don't understand I can't get Prometheus to give me any 
>>>>>>> data 
>>>>>>> when I issue the same query to /api/v1/query_range. The query is the 
>>>>>>> same 
>>>>>>> as the above query, but wrapped in a rate([1m]):
>>>>>>>
>>>>>>>     "request": {
>>>>>>>         "url": 
>>>>>>> "api/datasources/proxy/1/api/v1/query_range?query=rate(counter[1m])&start=1649239200&end=1649240100&step=60",
>>>>>>>         "method": "GET",
>>>>>>>         "hideFromInspector": false
>>>>>>>     },
>>>>>>>     "response": {
>>>>>>>         "status": "success",
>>>>>>>         "data": {
>>>>>>>             "resultType": "matrix",
>>>>>>>             "result": [
>>>>>>>                 {
>>>>>>>                     "metric": {/* redacted */},
>>>>>>>                     "values": [
>>>>>>>                         [
>>>>>>>                             1649239200,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239260,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239320,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239380,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239440,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239500,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239560,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239620,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239680,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239740,
>>>>>>>                             "9391.766666666665"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239800,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239860,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239920,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649239980,
>>>>>>>                             "0"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649240040,
>>>>>>>                             "0.03333333333333333"
>>>>>>>                         ],
>>>>>>>                         [
>>>>>>>                             1649240100,
>>>>>>>                             "0"
>>>>>>>                         ]
>>>>>>>                     ]
>>>>>>>                 }
>>>>>>>             ]
>>>>>>>         }
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>> Given the gradual increase in the underlying counter, I have two 
>>>>>>> questions:
>>>>>>>
>>>>>>> 1. How come the rate is 0 for all except 2 datapoints?
>>>>>>> 2. How come there is one enormous datapoint in the rate query, that 
>>>>>>> is seemingly unexplained in the raw data?
>>>>>>>
>>>>>>> For 2 I've seen in other threads that the explanation is an 
>>>>>>> unintentional counter reset, caused by scrapes a millisecond apart that 
>>>>>>> make the counter appear to go down for a single scrape interval. I 
>>>>>>> don't 
>>>>>>> think I see this in our raw data, though.
>>>>>>>
>>>>>>> We're using Prometheus version 2.26.0, revision 
>>>>>>> 3cafc58827d1ebd1a67749f88be4218f0bab3d8d, go version go1.16.2.
>>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "Prometheus Users" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Prometheus Users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>>
>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/c686d415-c9b8-42a1-843f-9010d1b763e4n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/prometheus-users/c686d415-c9b8-42a1-843f-9010d1b763e4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9c463078-8b51-4f3b-be42-458cbc59d8fbn%40googlegroups.com.

Re: [prometheus-users] Re: Big spike when using rate() - doesn't seem to be a counter reset

Reply via email to