[prometheus-users] Re: Graph Tab in Prometheus

2022-08-18 Thread kekr...@gmail.com
Thank you Brian.  This helps.

Kevin

On Thursday, August 18, 2022 at 4:27:01 AM UTC-5 Brian Candler wrote:

> BTW, I just did a quick test.  When setting my graph display range to 2w 
> in the Prometheus web interface, I found that adjacent data points were 
> just under 81 minutes apart.  So the query
>
> max_over_time(ALERTS[81m])
>
> was able to show lots of short-lived alerts, which the plain query
>
> ALERTS
>
> did not.  Setting it longer, e.g. to [3h], smears those alerts over 
> multiple graph points, as expected.
>
> On Thursday, 18 August 2022 at 09:46:40 UTC+1 Brian Candler wrote:
>
>> Presumably you are using the PromQL query browser built into prometheus? 
>> (Not some third party tool like Grafana etc?)
>>
>> When you draw a graph from time T1 to T2, you send the prometheus API a 
>> range 
>> query 
>>  
>> to repeatedly evaluate an instant vector query over a time range from T1 to 
>> T2 with some step S.  The step S is chosen by the client so that it a 
>> suitable number fit in the display, e.g. if it wants 200 data points then 
>> it could chose step = (T2 - T1) / 200.  In the prometheus graph view you 
>> can see this by moving your mouse left and right over the graph; a pop-up 
>> shows you each data point, and you can see it switch from point to point as 
>> you move left to right.
>>
>> Therefore, it's showing the values of the timeseries at the instants T1, 
>> T1+S, T1+2S, ... T2-S,T2.
>>
>> When evaluating a timeseries at a given instant in time, it finds the 
>> closest value *at or before* that time (up to a maximum lookback interval, 
>> which by default is 5 minutes).
>>
>> Therefore, your graph is showing *samples* of the data in the TSDB.  If 
>> you zoom out too far, you may be missing "interesting" values.  For example:
>>
>> TSDB :  0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0  ...
>> Graph:   0 0 1 0 0 ...
>>
>> Counters make this less of a problem: you can get your graph to show how 
>> the counter has *increased* between two adjacent points (usually then 
>> divided by the step time, to get a rate).
>>
>> However, the problem for a metric like ALERTS is it's not a counter, and 
>> it doesn't even switch between 0 and 1, but the whole timeseries appears 
>> and disappears.  (In fact, it creates separate timeseries for when the 
>> alert is in state "pending" and "firing").  If you graph step is more than 
>> 5 minutes, you may not catch the alert's presence at all.
>>
>> What you could try is a query like this:
>>
>> max_over_time(ALERTS{alertname="CPUUtilization"}[1h])
>>
>> The inner query is a range vector: it returns all data points within a 1 
>> hour window, between 1 hour before the evaluation time up to the evaluation 
>> time.  Then if *any* data points exist in that window, the highest one 
>> returned, forming an instant vector again.  When your graph sweeps this 
>> expression over a time period from T1 to T2, then each data point will 
>> cover one hour. That should catch the "missing" samples.
>>
>> Of course, the time window is fixed to 1h in that query, and you may need 
>> to adjust it depending on your graph zoom level, to match the time period 
>> between adjacent points on the graph.  If you're using grafana, there's a 
>> magic 
>> variable 
>> 
>>  
>> $__interval you can use.  I vaguely remember seeing a proposal for PromQL 
>> to have a way of referring to "the current step interval" in a range vector 
>> expression, but I don't know what happened to that.
>>
>> HTH,
>>
>> Brian.
>>
>> On Wednesday, 17 August 2022 at 23:21:03 UTC+1 kekr...@gmail.com wrote:
>>
>>> I am currently looking for all CPU alerts using a query of 
>>> ALERTS{alertname="CPUUtilization"}
>>>
>>> I am stepping through the graph time frame one click at a time.  
>>>
>>> At the 12h time, I get one entry.  At 1d I get zero entries.  At 2d, I 
>>> get 4 entries but not the one I found at 12h.  I would expect to get 
>>> everything from 2d to now.
>>>
>>> At 1w, I get 8 entries but at 2w, I only get 5 entries.  I would expect 
>>> to get everything from 2w to now.
>>>
>>> Last week I ran this same query and found the alert I was looking for 
>>> back in April.  Today I ran the same query and I cannot find that alert 
>>> from April.
>>>
>>> I see this behavior in multiple Prometheus environments.
>>>
>>> Is this a problem or the way the graphing works in Prometheus?
>>>
>>> Prometheus version is 2.29.1
>>> Prometheus retention period is 1y
>>> DB is currently 1.2TB.  There are DBs as large as 5TB in other 
>>> Prometheus environments.
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googleg

[prometheus-users] Re: Graph Tab in Prometheus

2022-08-18 Thread Brian Candler
BTW, I just did a quick test.  When setting my graph display range to 2w in 
the Prometheus web interface, I found that adjacent data points were just 
under 81 minutes apart.  So the query

max_over_time(ALERTS[81m])

was able to show lots of short-lived alerts, which the plain query

ALERTS

did not.  Setting it longer, e.g. to [3h], smears those alerts over 
multiple graph points, as expected.

On Thursday, 18 August 2022 at 09:46:40 UTC+1 Brian Candler wrote:

> Presumably you are using the PromQL query browser built into prometheus? 
> (Not some third party tool like Grafana etc?)
>
> When you draw a graph from time T1 to T2, you send the prometheus API a range 
> query 
>  
> to repeatedly evaluate an instant vector query over a time range from T1 to 
> T2 with some step S.  The step S is chosen by the client so that it a 
> suitable number fit in the display, e.g. if it wants 200 data points then 
> it could chose step = (T2 - T1) / 200.  In the prometheus graph view you 
> can see this by moving your mouse left and right over the graph; a pop-up 
> shows you each data point, and you can see it switch from point to point as 
> you move left to right.
>
> Therefore, it's showing the values of the timeseries at the instants T1, 
> T1+S, T1+2S, ... T2-S,T2.
>
> When evaluating a timeseries at a given instant in time, it finds the 
> closest value *at or before* that time (up to a maximum lookback interval, 
> which by default is 5 minutes).
>
> Therefore, your graph is showing *samples* of the data in the TSDB.  If 
> you zoom out too far, you may be missing "interesting" values.  For example:
>
> TSDB :  0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0  ...
> Graph:   0 0 1 0 0 ...
>
> Counters make this less of a problem: you can get your graph to show how 
> the counter has *increased* between two adjacent points (usually then 
> divided by the step time, to get a rate).
>
> However, the problem for a metric like ALERTS is it's not a counter, and 
> it doesn't even switch between 0 and 1, but the whole timeseries appears 
> and disappears.  (In fact, it creates separate timeseries for when the 
> alert is in state "pending" and "firing").  If you graph step is more than 
> 5 minutes, you may not catch the alert's presence at all.
>
> What you could try is a query like this:
>
> max_over_time(ALERTS{alertname="CPUUtilization"}[1h])
>
> The inner query is a range vector: it returns all data points within a 1 
> hour window, between 1 hour before the evaluation time up to the evaluation 
> time.  Then if *any* data points exist in that window, the highest one 
> returned, forming an instant vector again.  When your graph sweeps this 
> expression over a time period from T1 to T2, then each data point will 
> cover one hour. That should catch the "missing" samples.
>
> Of course, the time window is fixed to 1h in that query, and you may need 
> to adjust it depending on your graph zoom level, to match the time period 
> between adjacent points on the graph.  If you're using grafana, there's a 
> magic 
> variable 
> 
>  
> $__interval you can use.  I vaguely remember seeing a proposal for PromQL 
> to have a way of referring to "the current step interval" in a range vector 
> expression, but I don't know what happened to that.
>
> HTH,
>
> Brian.
>
> On Wednesday, 17 August 2022 at 23:21:03 UTC+1 kekr...@gmail.com wrote:
>
>> I am currently looking for all CPU alerts using a query of 
>> ALERTS{alertname="CPUUtilization"}
>>
>> I am stepping through the graph time frame one click at a time.  
>>
>> At the 12h time, I get one entry.  At 1d I get zero entries.  At 2d, I 
>> get 4 entries but not the one I found at 12h.  I would expect to get 
>> everything from 2d to now.
>>
>> At 1w, I get 8 entries but at 2w, I only get 5 entries.  I would expect 
>> to get everything from 2w to now.
>>
>> Last week I ran this same query and found the alert I was looking for 
>> back in April.  Today I ran the same query and I cannot find that alert 
>> from April.
>>
>> I see this behavior in multiple Prometheus environments.
>>
>> Is this a problem or the way the graphing works in Prometheus?
>>
>> Prometheus version is 2.29.1
>> Prometheus retention period is 1y
>> DB is currently 1.2TB.  There are DBs as large as 5TB in other Prometheus 
>> environments.
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6e2b25fa-105d-4428-8123-646718962ae7n%40googlegroups.com.


[prometheus-users] Re: Graph Tab in Prometheus

2022-08-18 Thread Brian Candler
Presumably you are using the PromQL query browser built into prometheus? 
(Not some third party tool like Grafana etc?)

When you draw a graph from time T1 to T2, you send the prometheus API a range 
query 
 
to repeatedly evaluate an instant vector query over a time range from T1 to 
T2 with some step S.  The step S is chosen by the client so that it a 
suitable number fit in the display, e.g. if it wants 200 data points then 
it could chose step = (T2 - T1) / 200.  In the prometheus graph view you 
can see this by moving your mouse left and right over the graph; a pop-up 
shows you each data point, and you can see it switch from point to point as 
you move left to right.

Therefore, it's showing the values of the timeseries at the instants T1, 
T1+S, T1+2S, ... T2-S,T2.

When evaluating a timeseries at a given instant in time, it finds the 
closest value *at or before* that time (up to a maximum lookback interval, 
which by default is 5 minutes).

Therefore, your graph is showing *samples* of the data in the TSDB.  If you 
zoom out too far, you may be missing "interesting" values.  For example:

TSDB :  0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0  ...
Graph:   0 0 1 0 0 ...

Counters make this less of a problem: you can get your graph to show how 
the counter has *increased* between two adjacent points (usually then 
divided by the step time, to get a rate).

However, the problem for a metric like ALERTS is it's not a counter, and it 
doesn't even switch between 0 and 1, but the whole timeseries appears and 
disappears.  (In fact, it creates separate timeseries for when the alert is 
in state "pending" and "firing").  If you graph step is more than 5 
minutes, you may not catch the alert's presence at all.

What you could try is a query like this:

max_over_time(ALERTS{alertname="CPUUtilization"}[1h])

The inner query is a range vector: it returns all data points within a 1 
hour window, between 1 hour before the evaluation time up to the evaluation 
time.  Then if *any* data points exist in that window, the highest one 
returned, forming an instant vector again.  When your graph sweeps this 
expression over a time period from T1 to T2, then each data point will 
cover one hour. That should catch the "missing" samples.

Of course, the time window is fixed to 1h in that query, and you may need 
to adjust it depending on your graph zoom level, to match the time period 
between adjacent points on the graph.  If you're using grafana, there's a magic 
variable 

 
$__interval you can use.  I vaguely remember seeing a proposal for PromQL 
to have a way of referring to "the current step interval" in a range vector 
expression, but I don't know what happened to that.

HTH,

Brian.

On Wednesday, 17 August 2022 at 23:21:03 UTC+1 kekr...@gmail.com wrote:

> I am currently looking for all CPU alerts using a query of 
> ALERTS{alertname="CPUUtilization"}
>
> I am stepping through the graph time frame one click at a time.  
>
> At the 12h time, I get one entry.  At 1d I get zero entries.  At 2d, I get 
> 4 entries but not the one I found at 12h.  I would expect to get everything 
> from 2d to now.
>
> At 1w, I get 8 entries but at 2w, I only get 5 entries.  I would expect to 
> get everything from 2w to now.
>
> Last week I ran this same query and found the alert I was looking for back 
> in April.  Today I ran the same query and I cannot find that alert from 
> April.
>
> I see this behavior in multiple Prometheus environments.
>
> Is this a problem or the way the graphing works in Prometheus?
>
> Prometheus version is 2.29.1
> Prometheus retention period is 1y
> DB is currently 1.2TB.  There are DBs as large as 5TB in other Prometheus 
> environments.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/bc524790-336c-43bf-b187-9bbfd02bca02n%40googlegroups.com.