[prometheus-users] Re: CPU Usage

2022-12-09 Thread Brian Candler
Then you need avg_over_time and stddev_over_time 

 
to get the mean and standard deviation over time, and those links give you 
the exact queries to use, especially https://stackoverflow.com/a/73351330

Instead of sum(rate(http_server_requests_seconds_count[1m])) you'll use 1 - 
avg(irate(node_cpu_seconds_total{mode="idle",instance=~"$ip"}[5m])) by 
(instance)

If the expression you've built doesn't work, then you'll need to debug it.  
I'd approach that by breaking it into parts, and putting the parts 
individually into the PromQL expression browser in the web UI, and drawing 
graphs of each subexpression.  Then start to combine the subexpressions and 
look at the results of those, until you've built up the whole query.

Typical problems is that one subexpression creates an empty instance 
vector; or you're trying to combine two subexpressions with different label 
sets, so the result set is empty unless you use appropriate "on" or 
"ignoring" qualifiers.

On Friday, 9 December 2022 at 10:06:16 UTC shivakuma...@gmail.com wrote:

> Hello Sir,
>
> Actually I wanted find anomaly using Prometheus for time series data of 
> Memory and CPU, so I thought of implementing Z-score.
> and, Z-score formula is (x-mean)/Standard deviation.
>
> We have already referred the following Links:
>
>
> https://stackoverflow.com/questions/71079726/why-does-stddev-over-time-increase-the-bigger-the-range-vector-is
>
>
> https://stackoverflow.com/questions/72832048/unit-test-for-z-score-with-prometheus
>
> but couldn't fetch any positive result.
>
> So I wanted Prometheus query which would help me to get anomaly for 
> specified time interval, whether it is Z-score or any other method.
>
>
> On Friday, 9 December 2022 at 14:13:16 UTC+5:30 Brian Candler wrote:
>
>> You're averaging across all the CPUs to get a single figure for the 
>> instance, instead of a separate figure per CPU.  You're not averaging over 
>> time.
>>
>> You're using irate(...) which uses the last two figures available for CPU 
>> usage.  I'd use rate(...[2m]) instead of irate(...[5m]) but they should 
>> give the same results when you have a 1 minute scrape interval.  Either of 
>> these will take the difference between node_cpu_seconds_total@now and 
>> node_cpu_seconds_total@1_minute_ago and use this to calculate the rate of 
>> CPU usage.
>>
>> So what exactly is wrong with this query - in other words, what do you 
>> want that's different?
>>
>> On Friday, 9 December 2022 at 05:23:04 UTC shivakuma...@gmail.com wrote:
>>
>>> Hello All,
>>>
>>>  
>>>
>>> I was scraping data for fetching CPU usage for every minute and the 
>>> query i'm using is 
>>> ((1 - avg(irate(node_cpu_seconds_total{mode="idle",instance=~"$ip"}[5m])) 
>>> by (instance)) * 100),
>>>
>>>  i'm getting the average data but i want the last data sample which is 
>>> evaluated.
>>>
>>> Could you please help me with a proper query to fetch cpu usage of 
>>> entire instance for a particular minute.
>>>
>>> Thanks and Regards,
>>>
>>> Sandesh S
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/22cc5aeb-5dcb-408d-a652-da5fd33afe11n%40googlegroups.com.


[prometheus-users] Re: CPU Usage

2022-12-09 Thread Sandesh Shivakumar
Hello Sir,

Actually I wanted find anomaly using Prometheus for time series data of 
Memory and CPU, so I thought of implementing Z-score.
and, Z-score formula is (x-mean)/Standard deviation.

We have already referred the following Links:

https://stackoverflow.com/questions/71079726/why-does-stddev-over-time-increase-the-bigger-the-range-vector-is

https://stackoverflow.com/questions/72832048/unit-test-for-z-score-with-prometheus

but couldn't fetch any positive result.

So I wanted Prometheus query which would help me to get anomaly for 
specified time interval, whether it is Z-score or any other method.


On Friday, 9 December 2022 at 14:13:16 UTC+5:30 Brian Candler wrote:

> You're averaging across all the CPUs to get a single figure for the 
> instance, instead of a separate figure per CPU.  You're not averaging over 
> time.
>
> You're using irate(...) which uses the last two figures available for CPU 
> usage.  I'd use rate(...[2m]) instead of irate(...[5m]) but they should 
> give the same results when you have a 1 minute scrape interval.  Either of 
> these will take the difference between node_cpu_seconds_total@now and 
> node_cpu_seconds_total@1_minute_ago and use this to calculate the rate of 
> CPU usage.
>
> So what exactly is wrong with this query - in other words, what do you 
> want that's different?
>
> On Friday, 9 December 2022 at 05:23:04 UTC shivakuma...@gmail.com wrote:
>
>> Hello All,
>>
>>  
>>
>> I was scraping data for fetching CPU usage for every minute and the query 
>> i'm using is 
>> ((1 - avg(irate(node_cpu_seconds_total{mode="idle",instance=~"$ip"}[5m])) by 
>> (instance)) * 100),
>>
>>  i'm getting the average data but i want the last data sample which is 
>> evaluated.
>>
>> Could you please help me with a proper query to fetch cpu usage of entire 
>> instance for a particular minute.
>>
>> Thanks and Regards,
>>
>> Sandesh S
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/22c299a2-e159-42d2-869b-c1479eb228c2n%40googlegroups.com.


[prometheus-users] Re: Null value in alerts

2022-12-09 Thread Brian Candler
On Friday, 9 December 2022 at 07:31:32 UTC sebag...@gmail.com wrote:

> expression:
> windows_mscluster_resourcegroup_state {name!~"Available Storage"} != 0 or 
> on() vector(0)
>
> The alert goes off non-stop.
>

Yes, that's correct.

PromQL expressions don't work like normal boolean expressions.  They return 
the presence or absence of values, not a true or false value.  The presence 
of *any* value will trigger an alert, and vector(0) generates a value all 
of the time.

For example, suppose you have 5 timeseries for the metric 
"node_filesystem_avail_bytes".

The PromQL expression "node_filesystem_avail_bytes" returns an instant 
vector containing 5 values.

The PromQL expression "node_filesystem_avail_bytes < 1000" returns an 
instant vector containing between 0 and 5 values; you have filtered down to 
just those timeseries whose values are less than the threshold.

If you use this as an alerting expression, then if the instant vector is 
not empty, i.e. if 1 or more machines have a value less than the threshold, 
then an alert is generated.

 

>  How can I set the metric to send an alert when the value is different 
> from 0 and is null?
>

There is no concept of "null" in PromQL.  (Well, you can store a floating 
point value of "NaN" in a timeseries, but that's not what we're discussing 
here).

Either a timeseries is present, or it is not.
 
Hence I'm not really sure what you're trying to alert on.  What do your 
metrics look like?

Let me guess they look something like this:

windows_mscluster_resourcegroup_state{instance="foo",name="Available 
Storage"} 123
windows_mscluster_resourcegroup_state{instance="foo",name="Broken Storage"} 
0
windows_mscluster_resourcegroup_state{instance="bar",name="Available 
Storage"} 0 
windows_mscluster_resourcegroup_state{instance="bar",name="Broken Storage"} 
4

Now, this alerting expression:

windows_mscluster_resourcegroup_state {name!~"Available Storage"} != 0

will only alert on the last one of these (it filters to labels which are 
not "Available Storage", and then it filters to values which are not 0, and 
only the fourth metric shown matches both conditions)

Similarly, "or" works differently to what you might expect.

foo or bar

will return a union of:
- all timeseries with metric name "foo", PLUS:
- all those timeseries with metric name "bar" which *don't* have exactly 
the same label sets as the timeseries on the LHS (foo)

Since vector(0) has no labels, but the expression you gave on your LHS has 
labels, this will *always* include vector(0) in the result set, and 
therefore will always generate alerts.

The question is, what sort of "missing" values do you want to look for?

For example, are you trying to alert on instance "baz", which doesn't 
generate *any* values for windows_mscluster_resourcegroup_state ?  If so, 
you either need to alert explicitly on this absence, or you need to 
cross-reference to some other timeseries which refers to "baz" (such a 
timeseries is often "up").  Otherwise, the PromQL expression for 
windows_mscluster_resourcegroup_state has no way of knowing that you 
*expect* a value for baz, but there isn't one.

So one possibility is:

absent(windows_mscluster_resourcegroup_state{instance="baz",name="Available 
Storage"})

which will alert explicitly if there is no timeseries with that metric name 
and those particular labels.  But you've hard-coded the existence of a 
machine called "baz" into your alerting rules.

Or are you trying to alert on any node which is being scraped by scrape job 
"windows_exporter" but is not returning 
windows_mscluster_resourcegroup_state with a particular label?  The "up" 
metric tells you whether something is being scraped, so the expression 
might be along the lines of "... or on (instance) up"

If you show the *actual* metrics you are scraping (including the full label 
sets), and an example of an *actual* condition you are trying to catch, 
then we can help you write the expression.

For more hints:

https://www.robustperception.io/absent-alerting-for-jobs/
https://www.robustperception.io/existential-issues-with-metrics/
https://www.robustperception.io/staleness-and-promql/
https://www.robustperception.io/functions-to-avoid/

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6f2fe456-1328-43d5-840d-923b695bb69en%40googlegroups.com.


Re: [prometheus-users] Null value in alerts

2022-12-09 Thread Stuart Clark

On 09/12/2022 08:49, sebagloc...@gmail.com wrote:


Thanks for advice,

So in this case I just need to use absent like this In alert?:

  - alert: Resource group in cluster is down

    expr: absent(windows_mscluster_resourcegroup_state 
{name!~"Available Storage"}) == 1


You aren't listing a metric here as you are using !~. You need to ensure 
you are only using = in any labels.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/02603cd3-ccdd-9a73-cd3e-1aa9ed55e093%40Jahingo.com.


RE: [prometheus-users] Null value in alerts

2022-12-09 Thread sebaglock14
Thanks for advice,

 

So in this case I just need to use absent like this In alert?:

 

  - alert: Resource group in cluster is down

expr: absent(windows_mscluster_resourcegroup_state {name!~"Available 
Storage"}) == 1

 

for: 10s

labels:

  severity: "[Cluster]"

annotations:

  summary: "Resource group in cluster is down!"

  description: "{{ humanize $value }}"

 

This one will send message, when metric is missing?

 

From: Matthias Rampke  
Sent: Friday, December 9, 2022 8:57 AM
To: Sebastian Glock 
Cc: Prometheus Users 
Subject: Re: [prometheus-users] Null value in alerts

 

When you say "the value is missing", what condition exactly do you want to 
alert on?

 

To detect that there is *no* metric matching your selector, you can use the 
absent(…) function. It returns 1 when … is nothing.

 

It gets more complicated and difficult if you want to detect that a single 
series has disappeared. In this case, you need to very specific in telling 
Prometheus which series *should* exist. Common ways to do this are

 

- listing them all out with separate absent(x) clauses and specific positive 
matchers

- comparing to a previous time (x offset 15m unless x)

- use some other metric that lets you determine what should be there

- generate recording rules to create such a metric

 

The fundamental challenge here is to distinguish between "this went missing" 
and "this went away because of expected changes".

 

In general, I prefer splitting "metric indicates there is a problem " and 
"metric is missing" into two different alerts with separate names and 
descriptions. To the one investigating, the difference matters. Additionally 
using absent() often results in different label sets because it cannot know 
labels for a time series that is absent. This causes trouble with templating 
that you sidestep by using separate alert definitions to begin with.

 

/MR

 

On Fri, 9 Dec 2022, 08:31 Sebastian Glock, mailto:sebagloc...@gmail.com> > wrote:

Hi,

 

I'm having trouble setting up an alert that will send a notification when a 
value is different from 0 and the value is missing (i.e. null).

 

expression:

windows_mscluster_resourcegroup_state {name!~"Available Storage"} != 0 or on() 
vector(0)

 

The alert goes off non-stop. How can I set the metric to send an alert when the 
value is different from 0 and is null?

 

I tried with sum() but not working anyway:

sum(windows_mscluster_resourcegroup_state {name!~"Available Storage"} != 0) or 
on() vector(0)

 

Thanks for replies!

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com 
 .
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9fbc7d5d-c7ce-4b93-b653-733cac798956n%40googlegroups.com
 

 .

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/04e501d90bab%242c5659b0%2485030d10%24%40gmail.com.


[prometheus-users] Re: CPU Usage

2022-12-09 Thread Brian Candler
You're averaging across all the CPUs to get a single figure for the 
instance, instead of a separate figure per CPU.  You're not averaging over 
time.

You're using irate(...) which uses the last two figures available for CPU 
usage.  I'd use rate(...[2m]) instead of irate(...[5m]) but they should 
give the same results when you have a 1 minute scrape interval.  Either of 
these will take the difference between node_cpu_seconds_total@now and 
node_cpu_seconds_total@1_minute_ago and use this to calculate the rate of 
CPU usage.

So what exactly is wrong with this query - in other words, what do you want 
that's different?

On Friday, 9 December 2022 at 05:23:04 UTC shivakuma...@gmail.com wrote:

> Hello All,
>
>  
>
> I was scraping data for fetching CPU usage for every minute and the query 
> i'm using is 
> ((1 - avg(irate(node_cpu_seconds_total{mode="idle",instance=~"$ip"}[5m])) by 
> (instance)) * 100),
>
>  i'm getting the average data but i want the last data sample which is 
> evaluated.
>
> Could you please help me with a proper query to fetch cpu usage of entire 
> instance for a particular minute.
>
> Thanks and Regards,
>
> Sandesh S
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e3b8d987-2d1f-4c78-ada2-b750f8a01330n%40googlegroups.com.