Re: [prometheus-users] Re: Big spike when using rate() - doesn't seem to be a counter reset

Brian Candler Wed, 06 Apr 2022 23:25:18 -0700

There are at least three different prometheus servers running then?
- 2 x prometheus in kubernetes that you deployed yourself
- 1 x AWS managed prometheus


Which of those are you querying from Grafana, and which were you querying 
for the direct queries you showed?
How do you get data from the 2 x prometheus into the 1 x AWS managed 
prometheus?  e.g. remote write, or federation?

If the two local kubernetes servers are both scraping the same targets and 
both writing into the same AWS instance, then you need to set different 
"external_labels" on them, so that they create two distinct timeseries in 
AWS.  If not, you'll get duplicate data points with different timestamps 
which sounds very likely to give the problem you describe.

On Wednesday, 6 April 2022 at 20:23:21 UTC+1 [email protected] wrote:

> I appreciate your time. I’ve logged off for the day but will get back to 
> you tomorrow with more data.
>
> To answer the question I can: we aren’t using any proxy software to my 
> knowledge. We use the 
> https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus
>  Helm 
> chart (version 13.8.0) hooked up to store data in AWS’s managed Prometheus 
> product.
>
> That said, we do run it in StatefulSet mode with 2 replicas. I wonder if 
> that’s causing problems.
>
> On Wed, 6 Apr 2022, at 19:49, Brian Candler wrote:
>
> Are you going through any middleware or proxy, like promxy?
>
> rate(foo[1m]) should definitely give no answer at all, when the timeseries 
> data is sampled at 1 minute intervals.
>
> Here is a working query_range for rate[1m] where the scrape interval is 
> 15s:
>
> # curl -Ssg '
> http://localhost:9090/api/v1/query_range?query=rate(ifHCInOctets{instance="gw1",ifName="ether1"}[60s])&start=1649264340&end=1649264640&step=60'
>  
> | python3 -m json.tool
> {
>     "status": "success",
>     "data": {
>         "resultType": "matrix",
>         "result": [
>             {
>                 "metric": {
>                     "ifIndex": "16",
>                     "ifName": "ether1",
>                     "instance": "gw1",
>                     "job": "snmp",
>                     "module": "mikrotik_secret",
>                     "netbox_type": "device"
>                 },
>                 "values": [
>                     [
>                         1649264340,
>                         "578.6444444444444"
>                     ],
>                     [
>                         1649264400,
>                         "651.4222222222221"
>                     ],
>                     [
>                         1649264460,
>                         "135.17777777777778"
>                     ],
>                     [
>                         1649264520,
>                         "1699.4888888888888"
>                     ],
>                     [
>                         1649264580,
>                         "441.5777777777777"
>                     ],
>                     [
>                         1649264640,
>                         "39768.08888888888"
>                     ]
>                 ]
>             }
>         ]
>     }
> }
>
> But if I make exactly the same query but with rate[15s] then there are no 
> answers:
>
> # curl -Ssg '
> http://localhost:9090/api/v1/query_range?query=rate(ifHCInOctets{instance="gw1",ifName="ether1"}[15s])&start=1649264340&end=1649264640&step=60'
>  
> | python3 -m json.tool
> {
>     "status": "success",
>     "data": {
>         "resultType": "matrix",
>         "result": []
>     }
> }
>
> I think the real reason for your problem is hidden; you're obfuscating the 
> query and metric names, and I suspect it's hidden behind that.  Sorry, I 
> can't help you any further given what I can see, but hopefully you have an 
> idea where you can look further.
>
> On Wednesday, 6 April 2022 at 18:45:10 UTC+1 [email protected] wrote:
>
> Hey Brian,
>
> In the original post I put the output of the raw time series as gathered 
> the way you suggest. I'll copy it again below:
>
> {
>     "data": {
>         "result": [
>             {
>                 "metric": {/* redacted */},
>                 "values": [
>                     [
>                         1649239253.4,
>                         "225201"
>                     ],
>                     [
>                         1649239313.4,
>                         "225226"
>                     ],
>                     [
>                         1649239373.4,
>                         "225249"
>                     ],
>                     [
>                         1649239433.4,
>                         "225262"
>                     ],
>                     [
>                         1649239493.4,
>                         "225278"
>                     ],
>                     [
>                         1649239553.4,
>                         "225310"
>                     ],
>                     [
>                         1649239613.4,
>                         "225329"
>                     ],
>                     [
>                         1649239673.4,
>                         "225363"
>                     ],
>                     [
>                         1649239733.4,
>                         "225402"
>                     ],
>                     [
>                         1649239793.4,
>                         "225437"
>                     ],
>                     [
>                         1649239853.4,
>                         "225466"
>                     ],
>                     [
>                         1649239913.4,
>                         "225492"
>                     ],
>                     [
>                         1649239973.4,
>                         "225529"
>                     ],
>                     [
>                         1649240033.4,
>                         "225555"
>                     ],
>                     [
>                         1649240093.4,
>                         "225595"
>                     ]
>                 ]
>             }
>         ],
>         "resultType": "matrix"
>     },
>     "status": "success"
> }
>
> The query was of the form `counter[15m]` at a given time. I don't see 
> duplicate scrape data in there.
>
> The version of prometheus is 2.26.0, revision 
> 3cafc58827d1ebd1a67749f88be4218f0bab3d8d, go version go1.16.2.
> On Wednesday, April 6, 2022 at 6:13:10 PM UTC+1 Brian Candler wrote:
>
> What version of prometheus are you running?
>
> With prometheus, rate(counter[1m]) should give you no results at all when 
> you are scraping at 1 minute intervals - unless something has changed very 
> recently (I'm running 2.33.4).  So this is a big red flag.
> Now, for driving the query API, you should be able to do it like this:
>
> # curl -Ssg '
> http://localhost:9090/api/v1/query?query=ifHCInOctets{instance="gw1",ifName="ether1"}[60s]'
>  
> | python3 -m json.tool
>
> {
>     "status": "success",
>     "data": {
>         "resultType": "matrix",
>         "result": [
>             {
>                 "metric": {
>                     "__name__": "ifHCInOctets",
>                     "ifIndex": "16",
>                     "ifName": "ether1",
>                     "instance": "gw1",
>                     "job": "snmp",
>                     "module": "mikrotik_secret",
>                     "netbox_type": "device"
>                 },
>                 "values": [
>                     [
>                         1649264595.241,
>                         "117857843410 <(785)%20784-3410>"
>                     ],
>                     [
>                         1649264610.241,
>                         "117858063821 <(785)%20806-3821>"
>                     ],
>                     [
>                         1649264625.241,
>                         "117858075769 <(785)%20807-5769>"
>                     ]
>                 ]
>             }
>         ]
>     }
> }
>
> There I gave a range vector of 60 seconds, and I got 3 data points because 
> I'm scraping at 15 second intervals, so only 3 points fell within the time 
> window of (current time) and (current time - 60s)
>
> Sending a query_range will sample the data at intervals.  Only an actual 
> range vector query (as shown above) will show you *all* the data points in 
> the time series, wherever they lie.
>
> I think you should do this.  My guess - and it's only a guess at the 
> moment - is that there are multiple points being received for the same 
> timeseries, and this is giving your spike.  This could be due to 
> overlapping scrape jobs for the same timeseries, or relabelling removing 
> some distinguishing label, or some HA setup which is scraping the same 
> timeseries multiple times but not adding external labels to distinguish 
> them.
>
> I do have some evidence for my guess.  If you are storing the same data 
> points twice, this will give you the rate of zero most of the time, when 
> doing rate[1m], because there are two adjacent identical points most of the 
> time (whereas if there were only a single data point, you'd get no rate at 
> all).  And you'll get a counter spike if two data points get transposed.
>
> On Wednesday, 6 April 2022 at 14:37:57 UTC+1 [email protected] wrote:
>
> Here's the query inspector output from Grafana for rate(counter[2m]). It 
> makes the answer to question 1 in my original post more clear. You're 
> right, the graph for 1m is just plain wrong. We do still see the reset, 
> though.
>
> {
>   "request": {
>     "url": 
> "api/datasources/proxy/1/api/v1/query_range?query=rate(counter[2m])&start=1649239200&end=1649240100&step=60",
>
>     "method": "GET",
>     "hideFromInspector": false
>   },
>   "response": {
>     "status": "success",
>     "data": {
>       "resultType": "matrix",
>       "result": [
>         {
>           "metric": {/* redacted */},
>           "values": [
>             [
>               1649239200,
>               "0.2871886897537781"
>             ],
>             [
>               1649239260,
>               "0.3084619260318357"
>             ],
>             [
>               1649239320,
>               "0.26591545347572043"
>             ],
>             [
>               1649239380,
>               "0.2446422171976628"
>             ],
>             [
>               1649239440,
>               "0.13827603580737463"
>             ],
>             [
>               1649239500,
>               "0.1701858902244611"
>             ],
>             [
>               1649239560,
>               "0.3403717804489222"
>             ],
>             [
>               1649239620,
>               "0.20209574464154753"
>             ],
>             [
>               1649239680,
>               "0.3616450167269798"
>             ],
>             [
>               1649239740,
>               "2397.9404664989347"
>             ],
>             [
>               1649239800,
>               "2397.88728340824"
>             ],
>             [
>               1649239860,
>               "0.3084619260318357"
>             ],
>             [
>               1649239920,
>               "0.27655207161474926"
>             ],
>             [
>               1649239980,
>               "0.39355487114406623"
>             ],
>             [
>               1649240040,
>               "0.27655207161474926"
>             ],
>             [
>               1649240100,
>               "0.43610134370018155"
>             ]
>           ]
>         }
>       ]
>     }
>   }
> }
> On Wednesday, April 6, 2022 at 2:34:59 PM UTC+1 Sam Rose wrote:
>
> We do see a graph with rate(counter[1m]). It even looks pretty close to 
> what we see with rate(counter[2m]). We definitely scrape every 60 seconds, 
> double checked our config to make sure.
>
> The exact query was `counter[15m]`. Counter is 
> `django_http_responses_total_by_status_total` in reality, with a long list 
> of labels attached to ensure I'm selecting a single time series.
>
> I didn't realise Grafana did that, thank you for the advice.
>
> I feel like we're drifting away from the original problem a little bit. 
> Can I get you any additional data to make the original problem easier to 
> debug?
>
> On Wednesday, April 6, 2022 at 2:31:27 PM UTC+1 Brian Candler wrote:
>
> If you are scraping at 1m intervals, then you definitely need 
> rate(counter[2m]).  That's because rate() needs at least two data points to 
> fall within the range window.  I would be surprised if you see any graph at 
> all with rate(counter[1m]).
>
> > This is the raw data, as obtained through a request to /api/v1/query
>
> What is the *exact* query you gave? Hopefully it is a range vector query, 
> like counter[15m].  A range vector expression sent to the simple query 
> endpoint gives you the raw data points with their raw timestamps from the 
> database.
>
> > and then we configure the minimum value of it to 1m per-graph
>
> Just in case you haven't realised: to set a minimum value of 1m, you must 
> set the data source scrape interval (in Grafana) to 15s - since Grafana 
> clamps the minimum value to 4 x Grafana-configured data source scrape 
> interval.
>
> Therefore if you are actually scraping at 1m intervals, and you want the 
> minimum of $__rate_interval to be 2m, then you must set the Grafana data 
> source interval to 30s.  This is weird, but it is what it is.
> https://github.com/grafana/grafana/issues/32169
>
> On Wednesday, 6 April 2022 at 14:07:13 UTC+1 [email protected] wrote:
>
> We do make use of that variable, and then we configure the minimum value 
> of it to 1m per-graph. I didn't realise you could configure this 
> per-datasource, thanks for pointing that out!
>
> We did used to scrape at 15s intervals but we're using AWS's managed 
> prometheus workspaces, and each data point costs money, so we brought it 
> down to 1m intervals.
>
> I'm not sure I understand the relationship between scrape interval and 
> counter resets, especially considering there doesn't appear to be a counter 
> reset in the raw data of the time series in question.
>
> You mentioned "true counter reset", does prometheus have some internal 
> distinction between types of counter reset?
> On Wednesday, April 6, 2022 at 2:03:40 PM UTC+1 [email protected] wrote:
>
> I would recommend using the `$__rate_interval` magic variable in Grafana. 
> Note that Grafana assumes a default interval of 15s in the datasource 
> settings.
>
> If your data is mostly 60s scrape intervals, you can configure this 
> setting in the Grafana datasource settings.
>
> If you want to be able to view 1m resolution rates, I recommend increasing 
> your scrape interval to 15s. This makes sure you have several samples in 
> the rate window. This helps Prometheus better handle true counter resets 
> and lost scrapes.
>
>
> On Wed, Apr 6, 2022 at 2:56 PM Sam Rose <[email protected]> wrote:
>
> Thanks for the heads up! We've flip flopped a bit between using 1m or 2m. 
> 1m seems to work reliably enough to be useful in most situations, but I'll 
> probably end up going back to 2m after this discussion.
>
> I don't believe that helps with the reset problem though, right? I retried 
> the queries using 2m instead of 1m and they still exhibit the same problem.
>
> Is there any more data I can get you to help debug the problem? We see 
> this happen multiple times per day, and it's making it difficult to monitor 
> our systems in production.
> On Wednesday, April 6, 2022 at 1:53:26 PM UTC+1 [email protected] wrote:
>
> Yup, PromQL thinks there's a small dip in the data. I'm not sure why tho. 
> I took your raw values:
>
> 225201
> 225226
> 225249
> 225262
> 225278
> 225310
> 225329
> 225363
> 225402
> 225437
> 225466
> 225492
> 225529
> 225555
> 225595
>
> $ awk '{print $1-225201}' values
> 0
> 25
> 48
> 61
> 77
> 109
> 128
> 162
> 201
> 236
> 265
> 291
> 328
> 354
> 394
>
> I'm not seeing the reset there.
>
> One thing I noticed, your data interval is 60 seconds and you are doing a 
> rate(counter[1m]). This is not going to work reliably, because you are 
> likely to not have two samples in the same step window. This is because 
> Prometheus uses millisecond timestamps, so if you have timestamps at these 
> times:
>
> 5.335
> 65.335
> 125.335
>
> Then you do a rate(counter[1m]) at time 120 (Grafana attempts to align 
> queries to even minutes for consistency), the only sample you'll get back 
> is 65.335.
>
> You need to do rate(counter[2m]) in order to avoid problems.
>
>
> On Wed, Apr 6, 2022 at 2:45 PM Sam Rose <[email protected]> wrote:
>
> I just learned about the resets() function and applying it does seem to 
> show that a reset occurred:
>
> {
>   "request": {
>     "url": 
> "api/datasources/proxy/1/api/v1/query_range?query=resets(counter[1m])&start=1649239200&end=1649240100&step=60",
>     "method": "GET",
>     "hideFromInspector": false
>   },
>   "response": {
>     "status": "success",
>     "data": {
>       "resultType": "matrix",
>       "result": [
>         {
>           "metric": {/* redacted */},
>           "values": [
>             [
>               1649239200,
>               "0"
>             ],
>             [
>               1649239260,
>               "0"
>             ],
>             [
>               1649239320,
>               "0"
>             ],
>             [
>               1649239380,
>               "0"
>             ],
>             [
>               1649239440,
>               "0"
>             ],
>             [
>               1649239500,
>               "0"
>             ],
>             [
>               1649239560,
>               "0"
>             ],
>             [
>               1649239620,
>               "0"
>             ],
>             [
>               1649239680,
>               "0"
>             ],
>             [
>               1649239740,
>               "1"
>             ],
>             [
>               1649239800,
>               "0"
>             ],
>             [
>               1649239860,
>               "0"
>             ],
>             [
>               1649239920,
>               "0"
>             ],
>             [
>               1649239980,
>               "0"
>             ],
>             [
>               1649240040,
>               "0"
>             ],
>             [
>               1649240100,
>               "0"
>             ]
>           ]
>         }
>       ]
>     }
>   }
> }
> I don't quite understand how, though.
> On Wednesday, April 6, 2022 at 1:40:12 PM UTC+1 Sam Rose wrote:
>
> Hi there,
>
> We're seeing really large spikes when using the `rate()` function on some 
> of our metrics. I've been able to isolate a single time series that 
> displays this problem, which I'm going to call `counter`. I haven't 
> attached the actual metric labels here, but all of the data you see here is 
> from `counter` over the same time period.
>
> This is the raw data, as obtained through a request to /api/v1/query:
>
> {
>     "data": {
>         "result": [
>             {
>                 "metric": {/* redacted */},
>                 "values": [
>                     [
>                         1649239253.4,
>                         "225201"
>                     ],
>                     [
>                         1649239313.4,
>                         "225226"
>                     ],
>                     [
>                         1649239373.4,
>                         "225249"
>                     ],
>                     [
>                         1649239433.4,
>                         "225262"
>                     ],
>                     [
>                         1649239493.4,
>                         "225278"
>                     ],
>                     [
>                         1649239553.4,
>                         "225310"
>                     ],
>                     [
>                         1649239613.4,
>                         "225329"
>                     ],
>                     [
>                         1649239673.4,
>                         "225363"
>                     ],
>                     [
>                         1649239733.4,
>                         "225402"
>                     ],
>                     [
>                         1649239793.4,
>                         "225437"
>                     ],
>                     [
>                         1649239853.4,
>                         "225466"
>                     ],
>                     [
>                         1649239913.4,
>                         "225492"
>                     ],
>                     [
>                         1649239973.4,
>                         "225529"
>                     ],
>                     [
>                         1649240033.4,
>                         "225555"
>                     ],
>                     [
>                         1649240093.4,
>                         "225595"
>                     ]
>                 ]
>             }
>         ],
>         "resultType": "matrix"
>     },
>     "status": "success"
> }
>
> The next query is taken from the Grafana query inspector, because for 
> reasons I don't understand I can't get Prometheus to give me any data when 
> I issue the same query to /api/v1/query_range. The query is the same as the 
> above query, but wrapped in a rate([1m]):
>
>     "request": {
>         "url": 
> "api/datasources/proxy/1/api/v1/query_range?query=rate(counter[1m])&start=1649239200&end=1649240100&step=60",
>         "method": "GET",
>         "hideFromInspector": false
>     },
>     "response": {
>         "status": "success",
>         "data": {
>             "resultType": "matrix",
>             "result": [
>                 {
>                     "metric": {/* redacted */},
>                     "values": [
>                         [
>                             1649239200,
>                             "0"
>                         ],
>                         [
>                             1649239260,
>                             "0"
>                         ],
>                         [
>                             1649239320,
>                             "0"
>                         ],
>                         [
>                             1649239380,
>                             "0"
>                         ],
>                         [
>                             1649239440,
>                             "0"
>                         ],
>                         [
>                             1649239500,
>                             "0"
>                         ],
>                         [
>                             1649239560,
>                             "0"
>                         ],
>                         [
>                             1649239620,
>                             "0"
>                         ],
>                         [
>                             1649239680,
>                             "0"
>                         ],
>                         [
>                             1649239740,
>                             "9391.766666666665"
>                         ],
>                         [
>                             1649239800,
>                             "0"
>                         ],
>                         [
>                             1649239860,
>                             "0"
>                         ],
>                         [
>                             1649239920,
>                             "0"
>                         ],
>                         [
>                             1649239980,
>                             "0"
>                         ],
>                         [
>                             1649240040,
>                             "0.03333333333333333"
>                         ],
>                         [
>                             1649240100,
>                             "0"
>                         ]
>                     ]
>                 }
>             ]
>         }
>     }
> }
>
> Given the gradual increase in the underlying counter, I have two questions:
>
> 1. How come the rate is 0 for all except 2 datapoints?
> 2. How come there is one enormous datapoint in the rate query, that is 
> seemingly unexplained in the raw data?
>
> For 2 I've seen in other threads that the explanation is an unintentional 
> counter reset, caused by scrapes a millisecond apart that make the counter 
> appear to go down for a single scrape interval. I don't think I see this in 
> our raw data, though.
>
> We're using Prometheus version 2.26.0, revision 
> 3cafc58827d1ebd1a67749f88be4218f0bab3d8d, go version go1.16.2.
>
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/prometheus-users/c1b7568b-f7f9-4edc-943a-22412658975fn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
>
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/prometheus-users/888affc4-e9ba-4ea8-8a40-c7b7a17affe4n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
>
> -- 
>
> You received this message because you are subscribed to a topic in the 
> Google Groups "Prometheus Users" group.
> To unsubscribe from this topic, visit 
> https://groups.google.com/d/topic/prometheus-users/gJgSjdxlgYY/unsubscribe
> .
> To unsubscribe from this group and all its topics, send an email to 
> [email protected].
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/prometheus-users/5b7f1f53-1ea4-4c60-b05e-76396a370a46n%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/prometheus-users/5b7f1f53-1ea4-4c60-b05e-76396a370a46n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/71fd85fe-1141-4f70-a707-a42759773b82n%40googlegroups.com.

Re: [prometheus-users] Re: Big spike when using rate() - doesn't seem to be a counter reset

Reply via email to