Thank you both for the detailed explanation and the additional material on this topic. I do now understand why I get the results I get. I however think the method used for calculation can be improved to provide more precise results and provide a better match to expectations.
I propose one changes to the algorithm for this: align the window with the data points. This way you do not need to extrapolate at all in most cases and if you need to do so you would only do it to one edge. In the example above this would mean the 1m window covers 5 datapoints from edge to edge with a 15s scrape interval with datapoints aligning on the edges perfectly. I understand that this may add complexity for multiple series in one query especially when they are on a different interval but in general the results would match up better especially for smaller ranges. What do you think about that? juliu...@promlabs.com schrieb am Freitag, 29. Januar 2021 um 18:59:37 UTC+1: > Yup, thanks! Goes to show just how much to say there is just about this > topic :) > > On Fri, Jan 29, 2021 at 6:29 PM Julien Pivotto <roidel...@prometheus.io> > wrote: > >> There are other resources: >> >> https://www.youtube.com/watch?v=67Ulrq6DxwA ( >> https://slideshare.net/brianbrazil/counting-with-prometheus-cloudnativeconkubecon-europe-2017 >> >> ) >> https://www.robustperception.io/what-range-should-i-use-with-rate >> https://grafana.com/go/grafanaconline/prometheus-rate-queries-in-grafana/ >> >> >> . >> >> On 29 Jan 18:24, Julius Volz wrote: >> > This actually triggered me to write a blog post about exactly this :) >> > >> https://promlabs.com/blog/2021/01/29/how-exactly-does-promql-calculate-rates >> > >> > On Fri, Jan 29, 2021 at 2:12 PM Julius Volz <juliu...@promlabs.com> >> > wrote: >> > >> > > On Fri, Jan 29, 2021 at 1:56 PM Alex <alexander....@gmail.com> >> > > wrote: >> > > >> > >> First of all thank you very much for your detailed answer. Some >> things >> > >> however are still not clear to me. >> > >> >> > >> Your drawn rate window is 75s in total (not 60s as I expect it to be) >> > >> since it contains 4x 15s samples plus 2x 1/2 step to the next sample. >> > >> >> > > >> > > The drawn rate window is indeed exactly 60s. In the underlying grid >> of the >> > > drawing tool, I made 2 boxes represent 15s. You can count that there >> are >> > > exactly 4 of those double boxes under the rate window. And a 60s rate >> > > window will usually contain 4 samples that are 15 seconds apart. >> > > >> > > >> > >> I would expect the window to be exactly 60s including edges and >> aligned >> > >> with data points. >> > >> >> > > >> > > The range time windows are usually *not* aligned to the data points at >> > > all, but chosen completely separately (based on the evaluation >> timestamp >> > > and the size of the window stretching backwards from that timestamp), >> and >> > > then the range vector selector just selects any samples that happen >> to fall >> > > under the selected time window. Consider also that a single range >> vector >> > > selector like "foo[5m]" (with exactly one set of window boundaries) >> can >> > > select many different time series at once, where not even the samples >> > > between the different series are necessarily aligned with each other. >> And >> > > the window size will also almost never form a *perfect* multiple of >> your >> > > actual scrape timestamps, even your configure your scrape intervals to >> > > match your rate window boundaries in that way. >> > > >> > > So you end up with a blanket pre-chosen time window, and then whatever >> > > samples happen to fall into it are used for the rate calculation >> under that >> > > window for each series. >> > > >> > > >> > >> 4x 15s samples in my mind sums up to a perfect 60s interval for the >> > >> window. When the window is aligned with the data and matching a >> multiple of >> > >> the interval you should get nice/perfect results - right? >> > >> Can you explain where the difference comes from? What is wrong with >> my >> > >> understanding of a 60s rate window? >> > >> >> > > >> > > Same explanation as above, I guess. >> > > >> > > Thank you a lot for explaining the semantic difference between [1m] >> and >> > >> [1m:] that helped a lot. I however as a user would not expect this >> but it >> > >> makes sense when it picks 1m for the subquery. >> > >> >> > > >> > > Yeah... to be honest, subqueries don't even usually come up in the >> normal >> > > rate context and are mostly useful for situations where you want to >> pass a >> > > derived expression (vs. a raw series) into a function that expects a >> range >> > > vector, but you don't want to go through the route of using a >> recording >> > > rule to record the intermediary result into a materialized time series >> > > first. >> > > >> > > In general, the articles >> > > https://promlabs.com/blog/2020/06/18/the-anatomy-of-a-promql-query >> and >> > > https://promlabs.com/blog/2020/07/02/selecting-data-in-promql could >> be >> > > interesting to understand a bit more the execution semantics of >> > > instant/range queries, as well as of instant/range vectors. >> > > >> > > Hope this helps a bit! >> > > >> > > >> > >> In the github issue I have a screenshot with some graphs of example >> data >> > >> with different queries. >> > >> >> > >> juliu...@promlabs.com schrieb am Donnerstag, 28. Januar 2021 um >> 22:22:56 >> > >> UTC+1: >> > >> >> > >>> On Thu, Jan 28, 2021 at 5:29 PM Alex <alexander....@gmail.com> >> wrote: >> > >>> >> > >>>> a rate series would never be exact but a rate on a counter could be >> > >>>> exact since the counter is exact. do you agree? >> > >>> >> > >>> >> > >>> I'm not 100% sure what you mean with the rate being exact because >> the >> > >>> counter is exact. The tricky bit is that we have to try to guess >> how the >> > >>> counter behaves outside of the exact data points that we actually >> have. >> > >>> >> > >>> I felt like a meditative exercise, so I drew this example of how >> > >>> "rate(foo[1m])" would work with an assumed scrape interval of 15s: >> > >>> >> > >>> [image: rate-extrapolated.png] >> > >>> >> > >>> I did the calculation purely visually, getting a value of around >> 0.25/s, >> > >>> which is close enough for me to the 0.29/s or so you are getting. >> You can >> > >>> see that the extrapolated slope is steeper than the actual samples >> around >> > >>> the rate window would yield, but rate() has no idea how the data >> around the >> > >>> window behaves and thus guesses that on average it will behave >> similarly as >> > >>> under the window, which is not always precisely right. >> > >>> >> > >>> >> > >>>> I also agree to your multiple counter increases in one window size >> > >>>> argumentation but thats not the case here. There is only one >> counter value >> > >>>> increase (by one) within the window. >> > >>>> >> > >>> >> > >>> Yep, same above. >> > >>> >> > >>> >> > >>>> All values are known and in the past. Also the window with 1m is >> larger >> > >>>> then the series interval so there is no extrapolation needed but i >> would >> > >>>> understand interpolation / downsampling deu to the 1m. Do I miss on >> > >>>> something here? >> > >>>> >> > >>>> All this does not explain why the query results behave like they >> do. >> > >>>> So let me frase questions again: >> > >>>> a) why is [1m] behaving different then [1m:]? the optional >> resolution >> > >>>> should behave the same in both cases - right? >> > >>>> >> > >>> >> > >>> This comes down to subtle evaluation semantics. >> > >>> >> > >>> - foo[1m] is a range vector selector, running in a single outer >> PromQL >> > >>> query. It passes all raw samples as-is into rate(). >> > >>> - foo[1m:] can also be written as (foo)[1m:]. It's an instant vector >> > >>> selector that is being run as a subquery, where every subquery >> aligns its >> > >>> output points to the subquery resolution step, rather than giving >> you raw >> > >>> samples, so you get slightly shifted samples passed into rate(). >> Generally >> > >>> you want to run rate() on completely raw data (also to not lose any >> counter >> > >>> resets if subqueries run at coarser resolutions than the underlying >> data). >> > >>> >> > >>> >> > >>>> b) why is [1m:15s] and [1m:30s] also different to [1m] and [1m:]? >> In my >> > >>>> understanding the result should be the same for those due to the >> data and >> > >>>> series interval. >> > >>>> >> > >>> >> > >>> I haven't looked more into these but I assume it's similar issues. >> > >>> >> > >>> >> > >>>> c) why is the result of [1m] (and most others) twice as high as it >> > >>>> should be (or even other strange values not matching the functions >> > >>>> description)? >> > >>>> >> > >>> >> > >>> See the diagram. >> > >>> >> > >>> >> > >>>> d) why is sometimes the result even stepped in a unexpected way >> e.g. >> > >>>> [5m:1m]? >> > >>>> >> > >>> >> > >>> As far as I can see there's no example data or image provided for >> this, >> > >>> so it's hard to answer this one. >> > >>> >> > >>> >> > >>>> juliu...@promlabs.com schrieb am Donnerstag, 28. Januar 2021 um >> > >>>> 15:08:34 UTC+1: >> > >>>> >> > >>>>> Hi, >> > >>>>> >> > >>>>> Regarding "For a counter increase by 1 I expect a rate() >> result/value >> > >>>>> of 1/60 = 0.01666666666." - rate()'s extrapolating behavior might >> be the >> > >>>>> thing surprising you here, which can be extra surprising for very >> > >>>>> slow-moving counters like yours. rate() tries to calculate the >> best >> > >>>>> approximation of the increase of a counter *on average*, but >> since it has >> > >>>>> to operate on sampled values over time, it can never know the >> "right" value >> > >>>>> for sure. But imagine you provide a [5m] window, and the actual >> first+last >> > >>>>> samples under the window are only 4m apart, and thus don't 100% >> coincide >> > >>>>> with the beginning and end of your [5m] window. Thus the rate() >> (and >> > >>>>> increase()) function operate on the question: "ok but WHAT IF we >> had had >> > >>>>> data matching exactly the full window", and extrapolate the >> observed >> > >>>>> 4m-based slope to the whole 5m window. Thus even if you use >> increase() on a >> > >>>>> counter that only increases by an integer amount, you will >> typically get >> > >>>>> back a non-integer result that represents the whole window, and >> not just >> > >>>>> the raw increases seen between actual samples. You can find the >> exact >> > >>>>> details of how this works (including an exception when the >> first/last >> > >>>>> samples are too far away from the window boundaries) in the code >> here: >> > >>>>> >> https://github.com/prometheus/prometheus/blob/275f7e7766f80648d6e63ed968685f3963b494e9/promql/functions.go#L55-L131 >> > >>>>> >> > >>>>> Short summary is: rate() / increase() will give you on-average >> decent >> > >>>>> approximation of the actual rate of increase by extrapolating to >> the window >> > >>>>> boundaries. >> > >>>>> >> > >>>>> Regards, >> > >>>>> Julius >> > >>>>> >> > >>>>> On Thu, Jan 28, 2021 at 12:59 PM Alex <alexander....@gmail.com> >> wrote: >> > >>>>> >> > >>>>>> Hi, >> > >>>>>> I think rate/irate/delta/increase are not working correctly for >> most >> > >>>>>> resolutions. >> > >>>>>> I however got redirected from github to here so please have a >> look >> > >>>>>> and tell me what you think about this. >> > >>>>>> Is this a bug or do I get something wrong here? >> > >>>>>> Please have a look at the details in: >> > >>>>>> https://github.com/prometheus/prometheus/issues/8413 >> > >>>>>> >> > >>>>>> Thanks >> > >>>>>> Alex >> > >>>>>> >> > >>>>>> -- >> > >>>>>> You received this message because you are subscribed to the >> Google >> > >>>>>> Groups "Prometheus Users" group. >> > >>>>>> To unsubscribe from this group and stop receiving emails from it, >> > >>>>>> send an email to prometheus-use...@googlegroups.com. >> > >>>>>> To view this discussion on the web visit >> > >>>>>> >> https://groups.google.com/d/msgid/prometheus-users/097e9b1d-8c12-49d4-8a78-66ff4c4f555fn%40googlegroups.com >> > >>>>>> < >> https://groups.google.com/d/msgid/prometheus-users/097e9b1d-8c12-49d4-8a78-66ff4c4f555fn%40googlegroups.com?utm_medium=email&utm_source=footer >> > >> > >>>>>> . >> > >>>>>> >> > >>>>> >> > >>>>> >> > >>>>> -- >> > >>>>> Julius Volz >> > >>>>> PromLabs - promlabs.com >> > >>>>> >> > >>>> -- >> > >>>> You received this message because you are subscribed to the Google >> > >>>> Groups "Prometheus Users" group. >> > >>>> To unsubscribe from this group and stop receiving emails from it, >> send >> > >>>> an email to prometheus-use...@googlegroups.com. >> > >>>> >> > >>> To view this discussion on the web visit >> > >>>> >> https://groups.google.com/d/msgid/prometheus-users/3cec78c4-851d-4262-8b1a-cfd7899b0da7n%40googlegroups.com >> > >>>> < >> https://groups.google.com/d/msgid/prometheus-users/3cec78c4-851d-4262-8b1a-cfd7899b0da7n%40googlegroups.com?utm_medium=email&utm_source=footer >> > >> > >>>> . >> > >>>> >> > >>> >> > >>> >> > >>> -- >> > >>> Julius Volz >> > >>> PromLabs - promlabs.com >> > >>> >> > >> -- >> > >> You received this message because you are subscribed to the Google >> Groups >> > >> "Prometheus Users" group. >> > >> To unsubscribe from this group and stop receiving emails from it, >> send an >> > >> email to prometheus-use...@googlegroups.com. >> > >> To view this discussion on the web visit >> > >> >> https://groups.google.com/d/msgid/prometheus-users/50b8f357-5a46-497c-8ebb-7283160b2c1an%40googlegroups.com >> > >> < >> https://groups.google.com/d/msgid/prometheus-users/50b8f357-5a46-497c-8ebb-7283160b2c1an%40googlegroups.com?utm_medium=email&utm_source=footer >> > >> > >> . >> > >> >> > > >> > > >> > > -- >> > > Julius Volz >> > > PromLabs - promlabs.com >> > > >> > >> > >> > -- >> > Julius Volz >> > PromLabs - promlabs.com >> > >> > -- >> > You received this message because you are subscribed to the Google >> Groups "Prometheus Users" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> an email to prometheus-use...@googlegroups.com. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/prometheus-users/CAObpH5yv4zrWfFBQqbcNo5A9P-GFoab8q9_xV3s0Zq7%2B97YXYg%40mail.gmail.com >> . >> >> -- >> Julien Pivotto >> @roidelapluie >> > > > -- > Julius Volz > PromLabs - promlabs.com > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/8e10e4d1-0f61-42b9-84ba-8279953b6286n%40googlegroups.com.