Re: [prometheus-users] Re: Big spike when using rate() - doesn't seem to be a counter reset

Sam Rose Wed, 06 Apr 2022 06:38:04 -0700

Here's the query inspector output from Grafana for rate(counter[2m]). It 
makes the answer to question 1 in my original post more clear. You're 
right, the graph for 1m is just plain wrong. We do still see the reset, 
though.


{
  "request": {
    "url": 
"api/datasources/proxy/1/api/v1/query_range?query=rate(counter[2m])&start=1649239200&end=1649240100&step=60",
    "method": "GET",
    "hideFromInspector": false
  },
  "response": {
    "status": "success",
    "data": {
      "resultType": "matrix",
      "result": [
        {
          "metric": {/* redacted */},
          "values": [
            [
              1649239200,
              "0.2871886897537781"
            ],
            [
              1649239260,
              "0.3084619260318357"
            ],
            [
              1649239320,
              "0.26591545347572043"
            ],
            [
              1649239380,
              "0.2446422171976628"
            ],
            [
              1649239440,
              "0.13827603580737463"
            ],
            [
              1649239500,
              "0.1701858902244611"
            ],
            [
              1649239560,
              "0.3403717804489222"
            ],
            [
              1649239620,
              "0.20209574464154753"
            ],
            [
              1649239680,
              "0.3616450167269798"
            ],
            [
              1649239740,
              "2397.9404664989347"
            ],
            [
              1649239800,
              "2397.88728340824"
            ],
            [
              1649239860,
              "0.3084619260318357"
            ],
            [
              1649239920,
              "0.27655207161474926"
            ],
            [
              1649239980,
              "0.39355487114406623"
            ],
            [
              1649240040,
              "0.27655207161474926"
            ],
            [
              1649240100,
              "0.43610134370018155"
            ]
          ]
        }
      ]
    }
  }
}

On Wednesday, April 6, 2022 at 2:34:59 PM UTC+1 Sam Rose wrote:

> We do see a graph with rate(counter[1m]). It even looks pretty close to 
> what we see with rate(counter[2m]). We definitely scrape every 60 seconds, 
> double checked our config to make sure.
>
> The exact query was `counter[15m]`. Counter is 
> `django_http_responses_total_by_status_total` in reality, with a long list 
> of labels attached to ensure I'm selecting a single time series.
>
> I didn't realise Grafana did that, thank you for the advice.
>
> I feel like we're drifting away from the original problem a little bit. 
> Can I get you any additional data to make the original problem easier to 
> debug?
>
> On Wednesday, April 6, 2022 at 2:31:27 PM UTC+1 Brian Candler wrote:
>
>> If you are scraping at 1m intervals, then you definitely need 
>> rate(counter[2m]).  That's because rate() needs at least two data points to 
>> fall within the range window.  I would be surprised if you see any graph at 
>> all with rate(counter[1m]).
>>
>> > This is the raw data, as obtained through a request to /api/v1/query
>>
>> What is the *exact* query you gave? Hopefully it is a range vector query, 
>> like counter[15m].  A range vector expression sent to the simple query 
>> endpoint gives you the raw data points with their raw timestamps from the 
>> database.
>>
>> > and then we configure the minimum value of it to 1m per-graph
>>
>> Just in case you haven't realised: to set a minimum value of 1m, you must 
>> set the data source scrape interval (in Grafana) to 15s - since Grafana 
>> clamps the minimum value to 4 x Grafana-configured data source scrape 
>> interval.
>>
>> Therefore if you are actually scraping at 1m intervals, and you want the 
>> minimum of $__rate_interval to be 2m, then you must set the Grafana data 
>> source interval to 30s.  This is weird, but it is what it is.
>> https://github.com/grafana/grafana/issues/32169
>>
>> On Wednesday, 6 April 2022 at 14:07:13 UTC+1 [email protected] wrote:
>>
>>> We do make use of that variable, and then we configure the minimum value 
>>> of it to 1m per-graph. I didn't realise you could configure this 
>>> per-datasource, thanks for pointing that out!
>>>
>>> We did used to scrape at 15s intervals but we're using AWS's managed 
>>> prometheus workspaces, and each data point costs money, so we brought it 
>>> down to 1m intervals.
>>>
>>> I'm not sure I understand the relationship between scrape interval and 
>>> counter resets, especially considering there doesn't appear to be a counter 
>>> reset in the raw data of the time series in question.
>>>
>>> You mentioned "true counter reset", does prometheus have some internal 
>>> distinction between types of counter reset?
>>>
>>> On Wednesday, April 6, 2022 at 2:03:40 PM UTC+1 [email protected] wrote:
>>>
>>>> I would recommend using the `$__rate_interval` magic variable in 
>>>> Grafana. Note that Grafana assumes a default interval of 15s in the 
>>>> datasource settings.
>>>>
>>>> If your data is mostly 60s scrape intervals, you can configure this 
>>>> setting in the Grafana datasource settings.
>>>>
>>>> If you want to be able to view 1m resolution rates, I recommend 
>>>> increasing your scrape interval to 15s. This makes sure you have several 
>>>> samples in the rate window. This helps Prometheus better handle true 
>>>> counter resets and lost scrapes.
>>>>
>>>> On Wed, Apr 6, 2022 at 2:56 PM Sam Rose <[email protected]> wrote:
>>>>
>>>>> Thanks for the heads up! We've flip flopped a bit between using 1m or 
>>>>> 2m. 1m seems to work reliably enough to be useful in most situations, but 
>>>>> I'll probably end up going back to 2m after this discussion.
>>>>>
>>>>> I don't believe that helps with the reset problem though, right? I 
>>>>> retried the queries using 2m instead of 1m and they still exhibit the 
>>>>> same 
>>>>> problem.
>>>>>
>>>>> Is there any more data I can get you to help debug the problem? We see 
>>>>> this happen multiple times per day, and it's making it difficult to 
>>>>> monitor 
>>>>> our systems in production.
>>>>>
>>>>> On Wednesday, April 6, 2022 at 1:53:26 PM UTC+1 [email protected] 
>>>>> wrote:
>>>>>
>>>>>> Yup, PromQL thinks there's a small dip in the data. I'm not sure why 
>>>>>> tho. I took your raw values:
>>>>>>
>>>>>> 225201
>>>>>> 225226
>>>>>> 225249
>>>>>> 225262
>>>>>> 225278
>>>>>> 225310
>>>>>> 225329
>>>>>> 225363
>>>>>> 225402
>>>>>> 225437
>>>>>> 225466
>>>>>> 225492
>>>>>> 225529
>>>>>> 225555
>>>>>> 225595
>>>>>>
>>>>>> $ awk '{print $1-225201}' values
>>>>>> 0
>>>>>> 25
>>>>>> 48
>>>>>> 61
>>>>>> 77
>>>>>> 109
>>>>>> 128
>>>>>> 162
>>>>>> 201
>>>>>> 236
>>>>>> 265
>>>>>> 291
>>>>>> 328
>>>>>> 354
>>>>>> 394
>>>>>>
>>>>>> I'm not seeing the reset there.
>>>>>>
>>>>>> One thing I noticed, your data interval is 60 seconds and you are 
>>>>>> doing a rate(counter[1m]). This is not going to work reliably, because 
>>>>>> you 
>>>>>> are likely to not have two samples in the same step window. This is 
>>>>>> because 
>>>>>> Prometheus uses millisecond timestamps, so if you have timestamps at 
>>>>>> these 
>>>>>> times:
>>>>>>
>>>>>> 5.335
>>>>>> 65.335
>>>>>> 125.335
>>>>>>
>>>>>> Then you do a rate(counter[1m]) at time 120 (Grafana attempts to 
>>>>>> align queries to even minutes for consistency), the only sample you'll 
>>>>>> get 
>>>>>> back is 65.335.
>>>>>>
>>>>>> You need to do rate(counter[2m]) in order to avoid problems.
>>>>>>
>>>>>> On Wed, Apr 6, 2022 at 2:45 PM Sam Rose <[email protected]> wrote:
>>>>>>
>>>>>>> I just learned about the resets() function and applying it does seem 
>>>>>>> to show that a reset occurred:
>>>>>>>
>>>>>>> {
>>>>>>>   "request": {
>>>>>>>     "url": 
>>>>>>> "api/datasources/proxy/1/api/v1/query_range?query=resets(counter[1m])&start=1649239200&end=1649240100&step=60",
>>>>>>>     "method": "GET",
>>>>>>>     "hideFromInspector": false
>>>>>>>   },
>>>>>>>   "response": {
>>>>>>>     "status": "success",
>>>>>>>     "data": {
>>>>>>>       "resultType": "matrix",
>>>>>>>       "result": [
>>>>>>>         {
>>>>>>>           "metric": {/* redacted */},
>>>>>>>           "values": [
>>>>>>>             [
>>>>>>>               1649239200,
>>>>>>>               "0"
>>>>>>>             ],
>>>>>>>             [
>>>>>>>               1649239260,
>>>>>>>               "0"
>>>>>>>             ],
>>>>>>>             [
>>>>>>>               1649239320,
>>>>>>>               "0"
>>>>>>>             ],
>>>>>>>             [
>>>>>>>               1649239380,
>>>>>>>               "0"
>>>>>>>             ],
>>>>>>>             [
>>>>>>>               1649239440,
>>>>>>>               "0"
>>>>>>>             ],
>>>>>>>             [
>>>>>>>               1649239500,
>>>>>>>               "0"
>>>>>>>             ],
>>>>>>>             [
>>>>>>>               1649239560,
>>>>>>>               "0"
>>>>>>>             ],
>>>>>>>             [
>>>>>>>               1649239620,
>>>>>>>               "0"
>>>>>>>             ],
>>>>>>>             [
>>>>>>>               1649239680,
>>>>>>>               "0"
>>>>>>>             ],
>>>>>>>             [
>>>>>>>               1649239740,
>>>>>>>               "1"
>>>>>>>             ],
>>>>>>>             [
>>>>>>>               1649239800,
>>>>>>>               "0"
>>>>>>>             ],
>>>>>>>             [
>>>>>>>               1649239860,
>>>>>>>               "0"
>>>>>>>             ],
>>>>>>>             [
>>>>>>>               1649239920,
>>>>>>>               "0"
>>>>>>>             ],
>>>>>>>             [
>>>>>>>               1649239980,
>>>>>>>               "0"
>>>>>>>             ],
>>>>>>>             [
>>>>>>>               1649240040,
>>>>>>>               "0"
>>>>>>>             ],
>>>>>>>             [
>>>>>>>               1649240100,
>>>>>>>               "0"
>>>>>>>             ]
>>>>>>>           ]
>>>>>>>         }
>>>>>>>       ]
>>>>>>>     }
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>> I don't quite understand how, though.
>>>>>>> On Wednesday, April 6, 2022 at 1:40:12 PM UTC+1 Sam Rose wrote:
>>>>>>>
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>> We're seeing really large spikes when using the `rate()` function 
>>>>>>>> on some of our metrics. I've been able to isolate a single time series 
>>>>>>>> that 
>>>>>>>> displays this problem, which I'm going to call `counter`. I haven't 
>>>>>>>> attached the actual metric labels here, but all of the data you see 
>>>>>>>> here is 
>>>>>>>> from `counter` over the same time period.
>>>>>>>>
>>>>>>>> This is the raw data, as obtained through a request to 
>>>>>>>> /api/v1/query:
>>>>>>>>
>>>>>>>> {
>>>>>>>>     "data": {
>>>>>>>>         "result": [
>>>>>>>>             {
>>>>>>>>                 "metric": {/* redacted */},
>>>>>>>>                 "values": [
>>>>>>>>                     [
>>>>>>>>                         1649239253.4,
>>>>>>>>                         "225201"
>>>>>>>>                     ],
>>>>>>>>                     [
>>>>>>>>                         1649239313.4,
>>>>>>>>                         "225226"
>>>>>>>>                     ],
>>>>>>>>                     [
>>>>>>>>                         1649239373.4,
>>>>>>>>                         "225249"
>>>>>>>>                     ],
>>>>>>>>                     [
>>>>>>>>                         1649239433.4,
>>>>>>>>                         "225262"
>>>>>>>>                     ],
>>>>>>>>                     [
>>>>>>>>                         1649239493.4,
>>>>>>>>                         "225278"
>>>>>>>>                     ],
>>>>>>>>                     [
>>>>>>>>                         1649239553.4,
>>>>>>>>                         "225310"
>>>>>>>>                     ],
>>>>>>>>                     [
>>>>>>>>                         1649239613.4,
>>>>>>>>                         "225329"
>>>>>>>>                     ],
>>>>>>>>                     [
>>>>>>>>                         1649239673.4,
>>>>>>>>                         "225363"
>>>>>>>>                     ],
>>>>>>>>                     [
>>>>>>>>                         1649239733.4,
>>>>>>>>                         "225402"
>>>>>>>>                     ],
>>>>>>>>                     [
>>>>>>>>                         1649239793.4,
>>>>>>>>                         "225437"
>>>>>>>>                     ],
>>>>>>>>                     [
>>>>>>>>                         1649239853.4,
>>>>>>>>                         "225466"
>>>>>>>>                     ],
>>>>>>>>                     [
>>>>>>>>                         1649239913.4,
>>>>>>>>                         "225492"
>>>>>>>>                     ],
>>>>>>>>                     [
>>>>>>>>                         1649239973.4,
>>>>>>>>                         "225529"
>>>>>>>>                     ],
>>>>>>>>                     [
>>>>>>>>                         1649240033.4,
>>>>>>>>                         "225555"
>>>>>>>>                     ],
>>>>>>>>                     [
>>>>>>>>                         1649240093.4,
>>>>>>>>                         "225595"
>>>>>>>>                     ]
>>>>>>>>                 ]
>>>>>>>>             }
>>>>>>>>         ],
>>>>>>>>         "resultType": "matrix"
>>>>>>>>     },
>>>>>>>>     "status": "success"
>>>>>>>> }
>>>>>>>>
>>>>>>>> The next query is taken from the Grafana query inspector, because 
>>>>>>>> for reasons I don't understand I can't get Prometheus to give me any 
>>>>>>>> data 
>>>>>>>> when I issue the same query to /api/v1/query_range. The query is the 
>>>>>>>> same 
>>>>>>>> as the above query, but wrapped in a rate([1m]):
>>>>>>>>
>>>>>>>>     "request": {
>>>>>>>>         "url": 
>>>>>>>> "api/datasources/proxy/1/api/v1/query_range?query=rate(counter[1m])&start=1649239200&end=1649240100&step=60",
>>>>>>>>         "method": "GET",
>>>>>>>>         "hideFromInspector": false
>>>>>>>>     },
>>>>>>>>     "response": {
>>>>>>>>         "status": "success",
>>>>>>>>         "data": {
>>>>>>>>             "resultType": "matrix",
>>>>>>>>             "result": [
>>>>>>>>                 {
>>>>>>>>                     "metric": {/* redacted */},
>>>>>>>>                     "values": [
>>>>>>>>                         [
>>>>>>>>                             1649239200,
>>>>>>>>                             "0"
>>>>>>>>                         ],
>>>>>>>>                         [
>>>>>>>>                             1649239260,
>>>>>>>>                             "0"
>>>>>>>>                         ],
>>>>>>>>                         [
>>>>>>>>                             1649239320,
>>>>>>>>                             "0"
>>>>>>>>                         ],
>>>>>>>>                         [
>>>>>>>>                             1649239380,
>>>>>>>>                             "0"
>>>>>>>>                         ],
>>>>>>>>                         [
>>>>>>>>                             1649239440,
>>>>>>>>                             "0"
>>>>>>>>                         ],
>>>>>>>>                         [
>>>>>>>>                             1649239500,
>>>>>>>>                             "0"
>>>>>>>>                         ],
>>>>>>>>                         [
>>>>>>>>                             1649239560,
>>>>>>>>                             "0"
>>>>>>>>                         ],
>>>>>>>>                         [
>>>>>>>>                             1649239620,
>>>>>>>>                             "0"
>>>>>>>>                         ],
>>>>>>>>                         [
>>>>>>>>                             1649239680,
>>>>>>>>                             "0"
>>>>>>>>                         ],
>>>>>>>>                         [
>>>>>>>>                             1649239740,
>>>>>>>>                             "9391.766666666665"
>>>>>>>>                         ],
>>>>>>>>                         [
>>>>>>>>                             1649239800,
>>>>>>>>                             "0"
>>>>>>>>                         ],
>>>>>>>>                         [
>>>>>>>>                             1649239860,
>>>>>>>>                             "0"
>>>>>>>>                         ],
>>>>>>>>                         [
>>>>>>>>                             1649239920,
>>>>>>>>                             "0"
>>>>>>>>                         ],
>>>>>>>>                         [
>>>>>>>>                             1649239980,
>>>>>>>>                             "0"
>>>>>>>>                         ],
>>>>>>>>                         [
>>>>>>>>                             1649240040,
>>>>>>>>                             "0.03333333333333333"
>>>>>>>>                         ],
>>>>>>>>                         [
>>>>>>>>                             1649240100,
>>>>>>>>                             "0"
>>>>>>>>                         ]
>>>>>>>>                     ]
>>>>>>>>                 }
>>>>>>>>             ]
>>>>>>>>         }
>>>>>>>>     }
>>>>>>>> }
>>>>>>>>
>>>>>>>> Given the gradual increase in the underlying counter, I have two 
>>>>>>>> questions:
>>>>>>>>
>>>>>>>> 1. How come the rate is 0 for all except 2 datapoints?
>>>>>>>> 2. How come there is one enormous datapoint in the rate query, that 
>>>>>>>> is seemingly unexplained in the raw data?
>>>>>>>>
>>>>>>>> For 2 I've seen in other threads that the explanation is an 
>>>>>>>> unintentional counter reset, caused by scrapes a millisecond apart 
>>>>>>>> that 
>>>>>>>> make the counter appear to go down for a single scrape interval. I 
>>>>>>>> don't 
>>>>>>>> think I see this in our raw data, though.
>>>>>>>>
>>>>>>>> We're using Prometheus version 2.26.0, revision 
>>>>>>>> 3cafc58827d1ebd1a67749f88be4218f0bab3d8d, go version go1.16.2.
>>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "Prometheus Users" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "Prometheus Users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>>
>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/fd92e5f7-85f6-440c-bf99-568044d8f0aen%40googlegroups.com.

Re: [prometheus-users] Re: Big spike when using rate() - doesn't seem to be a counter reset

Reply via email to