Re: [prometheus-users] Extracting long queries from multiple histograms

2022-04-21 Thread Victor Sudakov
Julius Volz wrote:

[dd]
> >
> > The query `app1_response_duration_bucket{{le="0.75"}` will return a
> > list of endpoints which have responded faster than 0.75s.
> >
> 
> This is not quite correct - this query gives you the le="0.75" bucket
> counter for *all* endpoints, 

OK, I stand corrected.

> and the value of each bucket counter tells you
> how many requests that endpoint has handled that completed within 0.75s
> since the exposing process started tracking things.

What if I want to see how many requests each endpoint has handled that
DID NOT complete within 0.75s since the exposing process started
tracking things?
> 
> 
> > How do I invert the "le" and find the endpoints slower than "le"?
> >
> 
> Hmm, histograms are usually used to tell you about the *distribution* of
> request latencies to a given endpoint (or other label combination). So it's
> unclear what you mean with an endpoint being slower than some "le" value.

Please see above.

> Do you want to find out whether some endpoint has handled any requests *at
> all* that took longer than some duration? Or only if that happened in the
> last X amount of time? 

Yes, I think I can put it like this. I would like to be informed if any
endpoint has become "slow" and the details may vary.


> Or only if a certain percentage of requests were too
> slow?
> 
> One thing people frequently do is to calculate percentiles / quantiles from
> a histogram, for example:
> 
> histogram_quantile(0.9, rate(app1_response_duration_bucket[5m]))
> 
> ...would tell you the approximated 90th percentile latency in seconds as
> averaged over a moving 5-minute window for a given label combination, which
> you can then combine with a filter operator to find slow endpoints (e.g.
> "... > 10" would give you those endpoints that have a 90th percentile
> latency above 10s).

I've tried to graph "histogram_quantile(0.9, 
rate(app1_response_duration_bucket[5m])) > 3" 
but the result is very hard to interpret visually. It almost makes no sense.

It's slightly more understandable as a table/list.

-- 
Victor Sudakov VAS4-RIPE
http://vas.tomsk.ru/
2:5005/49@fidonet

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/YmHfziweOcQGpIjh%40admin.sibptus.ru.


Re: [prometheus-users] Extracting long queries from multiple histograms

2022-04-20 Thread Victor Sudakov
Victor Sudakov wrote:
> 
> There is a web app which exports its metrics as multiple histograms,
> one histogram per Web endpoint. So each set of histogram data is also
> labelled by the {endpoint} label. There are about 50 endpoints so
> about 50 histograms.
> 
> I would like to detect and graph slow endpoints, that is I would like
> to know the value of {endpoint} when its {le} is over 1s or something
> like that. 
> 
> Can you please help with a relevant PromQL query and an idea how to
> represent it in Grafana?
> 
> I don't actually want 50 heatmaps, there must be a clever way to make
> an overview of all the slow endpoints, or all the endpoints with a
> particular status code etc.

An example. The PromQL query
`app1_response_duration_bucket{external_endpoint="http://YY/XX",status_code="200",method="GET"}`
produces a histogram.

The PromQL query 
`app1_response_duration_bucket{external_endpoint="http://YY/XX",status_code="200",method="POST"}`
produces another histogram.

The query `app1_response_duration_bucket{{le="0.75"}` will return a
list of endpoints which have responded faster than 0.75s. 

How do I invert the "le" and find the endpoints slower than "le"?

-- 
Victor Sudakov VAS4-RIPE
http://vas.tomsk.ru/
2:5005/49@fidonet

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/YmBsN4x/u/Oe7ozF%40admin.sibptus.ru.


[prometheus-users] Extracting long queries from multiple histograms

2022-04-19 Thread Victor Sudakov
Dear Colleages,

There is a web app which exports its metrics as multiple histograms,
one histogram per Web endpoint. So each set of histogram data is also
labelled by the {endpoint} label. There are about 50 endpoints so
about 50 histograms.

I would like to detect and graph slow endpoints, that is I would like
to know the value of {endpoint} when its {le} is over 1s or something
like that. 

Can you please help with a relevant PromQL query and an idea how to
represent it in Grafana?

I don't actually want 50 heatmaps, there must be a clever way to make
an overview of all the slow endpoints, or all the endpoints with a
particular status code etc.

-- 
Victor Sudakov VAS4-RIPE
http://vas.tomsk.ru/
2:5005/49@fidonet

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/Yl77deKJeKZuj7eU%40admin.sibptus.ru.


Re: [prometheus-users] Re: A query to find a burst?

2022-01-04 Thread Victor Sudakov
Brian Candler wrote:
> On Tuesday, 4 January 2022 at 10:51:45 UTC Victor Sudakov wrote:
> 
> > This "@" modifier seems quite useful. I had not had it enabled before 
> > this conversation with you. Now I'll be using it more often. 
> >
> > Do you happen to know why it is disabled by default?
> >
> 
> I'm guessing because it's experimental and might be withdrawn if it's 
> decided not to be worth the hassle of maintaining it going forward.
> 
> You don't need it when using the HTTP API 
> <https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries> 
> anyway: you specify the time you want the instant query to be evaluated 
> at.  The web interface, which is just a front-end onto the HTTP API, also 
> lets you specify the evaluation time.  So I was using "@timestamp" 
> generically to mean "expression evaluated at that time"; it doesn't have to 
> be literal PromQL.

So, the "@timestamp" modifier and the "Evaluation time" selector in the 
Prometheus
Web UI are the same? I see. But the "@timestamp" modifier in PromQL is
more demonstrative IMHO. Also, if PromQL is a query language, it
should be self-sufficient.

Thanks again for clarification.

-- 
Victor Sudakov VAS4-RIPE
http://vas.tomsk.ru/
2:5005/49@fidonet

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/YdUGXDfpJ2Svmy3A%40admin.sibptus.ru.


Re: [prometheus-users] Re: A query to find a burst?

2022-01-04 Thread Victor Sudakov
Brian Candler wrote:
> 
> > I've noticed that if I take a larger resampling interval, like 
> > "foo[2d:1h]", I lose all my peaks. Which is kind of understandable now 
> > but the question "how to better find peaks" kind of remains.
> 
> 
> (foo == N)[2d:15s] will find the peaks, with approximate timestamps within 
> 15 seconds of the actual time the data was sampled.
> 
> If you want, you can then hit the API with additional queries
> 
> foo[15s] @timestamp
> 
> to get the raw metrics with exact timestamps (it will return the raw 
> timeseries between timestamp-15s and timestamp).

This "@" modifier seems quite useful. I had not had it enabled before
this conversation with you. Now I'll be using it more often.

Do you happen to know why it is disabled by default?

> 
> But in many applications, you don't care about this.  You're only sampling 
> the data every 15 seconds anyway, which means you'll miss the exact time 
> when the state of the thing you're sampling changed; in other words, the 
> timestamp will already have between 0 and 15 seconds of error. So adding 
> another 0-15 seconds of error is probably not a big deal.

Thank you Brian, you've been able to help me achieve more clarity. The
PromQL and the CloudWatch approaches to queries are difficult to get
used to for a person who started graphing things with MRTG 20+ years
ago.


-- 
Victor Sudakov VAS4-RIPE
http://vas.tomsk.ru/
2:5005/49@fidonet

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/YdQmuuW/GqEUP6eO%40admin.sibptus.ru.


Re: [prometheus-users] Re: A query to find a burst?

2022-01-04 Thread Victor Sudakov
Brian Candler wrote:
> 
> > Hello Brian! I don't quite understand why "(foo == N)[2d:1m]" or even 
> > "(foo == N)[2d:]" is allowed while "(foo == N)[2d]" is not?
> 
> 
> foo[2d] is a range vector.  It gives you all the individual timestamped 
> data points belonging to all timeseries for metric "foo" within a time 
> period from evaluation time T to T-2d.
> 
> However, range vectors can *only* be applied to pure metrics, not to 
> expressions.  "foo == N" is an expression which generates an instant vector 
> at some evaluation time T.
> 
> The reason for this limitation becomes clear when you consider expressions 
> which calculate across multiple timeseries, such as
> sum(foo)
> or
>foo / bar
> 
> Metrics "foo" and "bar" compromise multiple timeseries, identified by 
> different label sets.  However within each timeseries, the data points have 
> their own unique timestamps: the data points in foo{bar="a"} were not 
> necessarily scraped at the same time as foo{bar="b"}.

You probably meant "comprise" ?

> 
> Therefore, the only possible way to do arithmetic across timeseries is to 
> pick some arbitrary evaluation time T, take the value of those timeseries 
> at that same point T, and give the result timestamped with T.  A subquery 
> lets you repeat that across a time window: it scans across the window at 
> intervals of some step S, repeating the calculation at those times.
> 
> What is the value of a timeseries at time T, given that it may not have a 
> data point at exactly T? It's the value of the most recent data point on 
> *or before* time T, looking back no more than the staleness window (by 
> default 5 minutes)

Thank you, this was very educational albeit a bit difficult to grasp.

> > > What this does is evaluate the expression foo == N at the current time 
> > T, 
> > > at time T-1m, at time T-2m etc. In the results, this won't give you the 
> > > *exact* time that the data point occurred: it will give you a timestamp 
> > of 
> > > T-Nm, which will be up to 1 minute after the timestamp of the point 
> > > itself. (The value of a timeseries at time T is the value of the most 
> > > recent data point on or before time T). 
> >
> > Sounds fine with me if it does not skip/hide peaks but shows the time 
> > nearest to the peak. Does it?
> 
> 
> It won't be the time "nearest" the peak, but the first sampling time 
> *after* the peak.  That is, if you have a 15 second step in your subquery, 
> and a 15 second sampling interval, then the timestamp could be up to 14.99 
> seconds after the event.
> 
> You can prove this to yourself by comparing the timestamps of
> 
> foo[2d]
> foo[2d:15s]
> 
> Look for the corresponding peaks / data points.

I've noticed that if I take a larger resampling interval, like
"foo[2d:1h]", I lose all my peaks. Which is kind of understandable now
but the question "how to better find peaks" kind of remains.


-- 
Victor Sudakov VAS4-RIPE
http://vas.tomsk.ru/
2:5005/49@fidonet

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/YdQUqkBITFtbgDho%40admin.sibptus.ru.


Re: [prometheus-users] Re: A query to find a burst?

2022-01-04 Thread Victor Sudakov
Brian Candler wrote:
> 
> > Isn't "(foo == N)[2d:]" what I'm looking for? I don't quite grok 
> > subqueries, but the resolution parameter seems to be optional. At 
> > least "(foo == N)[2d:]" seems to show the timestamps I was looking 
> > for.
> >
> 
> As it says here 
> <https://prometheus.io/docs/prometheus/latest/querying/basics/#subquery>: 
> " is optional. Default is the global evaluation interval."
> 
> So if your global evaluation interval is 1m, then that expression is the 
> same as (foo == N)[2d:1m]

Hello Brian! I don't quite understand why "(foo == N)[2d:1m]" or even
"(foo == N)[2d:]" is allowed while "(foo == N)[2d]" is not?

> 
> What this does is evaluate the expression foo == N at the current time T, 
> at time T-1m, at time T-2m etc.  In the results, this won't give you the 
> *exact* time that the data point occurred: it will give you a timestamp of 
> T-Nm, which will be up to 1 minute after the timestamp of the point 
> itself.  (The value of a timeseries at time T is the value of the most 
> recent data point on or before time T).

Sounds fine with me if it does not skip/hide peaks but shows the time
nearest to the peak. Does it?

> 
> Also, individual scrape jobs can use different scrape intervals.  If you 
> have a global eval interval of 1 minute but this particular scrape job uses 
> 15s, then the above expression will return (on average) 1 in every 4 data 
> points.

I have 15s across all my prometheus instances as I've read somewhere
that it is the best practice to have a unified scrape interval
everywhere.

-- 
Victor Sudakov VAS4-RIPE
http://vas.tomsk.ru/
2:5005/49@fidonet

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/YdQAK9k7SqXjimiz%40admin.sibptus.ru.


Re: [prometheus-users] Re: A query to find a burst?

2022-01-02 Thread Victor Sudakov
Brian Candler wrote:
> You can send the query "foo[2d]" and then filter the results in the client, 
> to just those points where the value is N.

Indeed, in the Prometheus Web UI I can use ^F in the browser to look
for N. Thank you for the hint. The problem is not to overwhelm the
browser with data.

> This is a use case where it would be nice to be able to build a range 
> vector directly out of a simple instant vector expression, i.e. "(foo == 
> N)[2d]".  However that isn't allowed.
> 
> A subquery doesn't cut it here, because it resamples the data.  The 
> subquery "(foo == N)[2d:1s]" gives an approximation, but for a given point 
> you'll see multiple points at 1 second intervals (until the time where foo 
> != N)

Isn't "(foo == N)[2d:]" what I'm looking for? I don't quite grok
subqueries, but the resolution parameter seems to be optional. At
least "(foo == N)[2d:]" seems to show the timestamps I was looking
for.


-- 
Victor Sudakov VAS4-RIPE
http://vas.tomsk.ru/
2:5005/49@fidonet

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/YdJtXrJ9cFdR7uii%40admin.sibptus.ru.


[prometheus-users] A query to find a burst?

2022-01-01 Thread Victor Sudakov
Colleagues,

If max_over_time(foo[2d]) returns N, how can I find the exact timestamp(s) in 
the past when foo=N?

In other words, if there have been very short bursts, how do I find the exact 
time of those bursts with a PromQL query?

-- 
Victor Sudakov VAS4-RIPE
http://vas.tomsk.ru/
2:5005/49@fidonet

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/YdElah1k0yJkcrKo%40admin.sibptus.ru.


Re: [prometheus-users] limiting permissions for the prometheus ClusterRole?

2021-11-08 Thread Victor Sudakov
Hello Matthias,

I've tried the set of permissions as quoted below and discovery did
NOT work. So the desired set of permissions should be somewhere in
between.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: test-prometheus
rules:
- apiGroups: [""]
  resources:
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]

Matthias Rampke wrote:
> I think it should work with just get/list/watch on pods. Try it and see
> what happens?
> 
> /MR
> 
> On Mon, Nov 8, 2021, 06:38 Victor Sudakov  wrote:
> 
> > Dear Colleagues,
> >
> > There is a good working example of RBAC setup in
> >
> > https://github.com/prometheus/prometheus/blob/main/documentation/examples/rbac-setup.yml
> > However if I want to discover and scrape only pods for metrics, these
> > permissions seem a bit excessive.
> >
> > What RBAC permissions can be safely removed from the prometheus
> > ClusterRole if only "role: pod" is required? There is also a discussion
> > open at https://github.com/prometheus/prometheus/discussions/9672 ,
> > you can comment there if you like.
> >
> > Thanks in advance for any input.
> >

-- 
Victor Sudakov VAS4-RIPE
http://vas.tomsk.ru/
2:5005/49@fidonet

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/YYnbV4jBxoWo/1GJ%40admin.sibptus.ru.


[prometheus-users] limiting permissions for the prometheus ClusterRole?

2021-11-07 Thread Victor Sudakov
Dear Colleagues,

There is a good working example of RBAC setup in
https://github.com/prometheus/prometheus/blob/main/documentation/examples/rbac-setup.yml
However if I want to discover and scrape only pods for metrics, these
permissions seem a bit excessive. 

What RBAC permissions can be safely removed from the prometheus
ClusterRole if only "role: pod" is required? There is also a discussion
open at https://github.com/prometheus/prometheus/discussions/9672 ,
you can comment there if you like. 

Thanks in advance for any input.

-- 
Victor Sudakov VAS4-RIPE
http://vas.tomsk.ru/
2:5005/49@fidonet

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/YYi305hUXdhYBL/U%40admin.sibptus.ru.